2.5 ~3.7

2.5 데이터 준비

데이터 준비를 함수로 자동화 이유

- 어떤 데이터셋에 대해서도 데이터 변환을 손쉽게 반복할 수 있음

- 향후 프로젝트에 재사용 가능한 변환 라이브러리를 점진적으로 구축할 수 있음

- 실제 시스템에서 알고리즘에 새 데이터를 주입하기 전에 이 함수를 사용해 변환할 수 있음

- 여러 가지 데이터 변환을 쉽게 시도 및 어떤 조합이 가장 좋은지 확인하는데 편리

2.5.1 데이터 정제

1. 해당 구역 제거

2. 전체 특성 삭제

3. 결측치를 어떤 값으로 대체함 (imputing) : 주로 0, 1, 평균

housing.dropna(subset=["total_bedrooms"], inplace=True)    # 옵션 1

housing.drop("total_bedrooms", axis=1)                     # 옵션 2

median = housing["total_bedrooms"].median()                # 옵션 3
housing["total_bedrooms"].fillna(median, inplace=True)

(아래 코드 삭제가 안 됩니다..... )

housing.dropna(subset=["total_bedrooms"], inplace=True)    # 옵션 1

housing.drop("total_bedrooms", axis=1)                     # 옵션 2

median = housing["total_bedrooms"].median()                # 옵션 3
housing["total_bedrooms"].fillna(median, inplace=True)

* SimpleImputer 라이브러리 사용 가능

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

* 사이킷런의 설계 철학

- 일관성 (추정기 , 변환기 , 예측기)

- 검사 가능

- 클래스 남용 방지

- 합리적인 기본값

범주형 입력 특성을 인코더 라이브러리를 통해 전처리

* 원- 핫 인코딩 : 해당되는 특성을 1, 나머지 특성을 0으로 해 개별 컬럼으로 쪼개는 전처리 방식

* 이의 출력은 넘파이 배열이 아닌 사이파이 희소 행렬임

- 특성 스케일링 : 표준정규분포로 스케일링 (라이브러리 사용)

from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
housing_num_std_scaled = std_scaler.fit_transform(housing_num)

* 멀티모달 분포 > 특정 특성과 특성 모드 사이 유사도를 나타내는 특성을 추가함

* 유사도 측정: 방사 기저 함수 사용

from sklearn.metrics.pairwise import rbf_kernel

age_simil_35 = rbf_kernel(housing[["housing_median_age"]], [[35]], gamma=0.1)

# 추가 코드 – 이 셀은 그림 2–18을 생성 합니다

ages = np.linspace(housing["housing_median_age"].min(),
                   housing["housing_median_age"].max(),
                   500).reshape(-1, 1)
gamma1 = 0.1
gamma2 = 0.03
rbf1 = rbf_kernel(ages, [[35]], gamma=gamma1)
rbf2 = rbf_kernel(ages, [[35]], gamma=gamma2)

fig, ax1 = plt.subplots()

ax1.set_xlabel("Housing median age")
ax1.set_ylabel("Number of districts")
ax1.hist(housing["housing_median_age"], bins=50)

ax2 = ax1.twinx()  # x축을 공유 하는 쌍둥이 축을 만듭니다
color = "blue"
ax2.plot(ages, rbf1, color=color, label="gamma = 0.10")
ax2.plot(ages, rbf2, color=color, label="gamma = 0.03", linestyle="--")
ax2.tick_params(axis='y', labelcolor=color)
ax2.set_ylabel("Age similarity", color=color)

plt.legend(loc="upper left")
save_fig("age_similarity_plot")
plt.show()

* 모델 피팅 : 선형 회귀 모델 사용

from sklearn.linear_model import LinearRegression

target_scaler = StandardScaler()
scaled_labels = target_scaler.fit_transform(housing_labels.to_frame())

model = LinearRegression()
model.fit(housing[["median_income"]], scaled_labels)
some_new_data = housing[["median_income"]].iloc[:5]  # 새로운 데이터라고 가정합니다

scaled_predictions = model.predict(some_new_data)
predictions = target_scaler.inverse_transform(scaled_predictions)

사용자 정의 변환기: 사용자 정의 변환, 정제 연산, 특성 결합과 같은 작업은 자신만의 변환기를 작성할 필요 있음

ex) 어떤 훈련도 필요하지 않는 변환의 경우 넘파이 배열을 입력으로 받고 변환된 배열을 출력하는 함수 작성

- 추가적인 인수를 하이퍼 파라미터로 반을 수 있음

- 사이킷런의 경우 오버라이딩할 필요는 없음

from sklearn.preprocessing import FunctionTransformer

log_transformer = FunctionTransformer(np.log, inverse_func=np.exp)
log_pop = log_transformer.transform(housing[["population"]])

rbf_transformer = FunctionTransformer(rbf_kernel,
                                      kw_args=dict(Y=[[35.]], gamma=0.1))
age_simil_35 = rbf_transformer.transform(housing[["housing_median_age"]])

* KMeans 를 사용한 클러스터링 기법에 클러스터 간 유사도를 측정한 코드

from sklearn.cluster import KMeans

class ClusterSimilarity(BaseEstimator, TransformerMixin):
    def __init__(self, n_clusters=10, gamma=1.0, random_state=None):
        self.n_clusters = n_clusters
        self.gamma = gamma
        self.random_state = random_state

    def fit(self, X, y=None, sample_weight=None):
        # 사이킷런 1.2버전에서 최상의 결과를 찾기 위해 반복하는 횟수를 지정하는 `n_init` 매개변수 값에 `'auto'`가 추가되었습니다.
        # `n_init='auto'`로 지정하면 초기화 방법을 지정하는 `init='random'`일 때 10, `init='k-means++'`일 때 1이 됩니다.
        # 사이킷런 1.4버전에서 `n_init`의 기본값이 10에서 `'auto'`로 바뀝니다. 경고를 피하기 위해 `n_init=10`으로 지정합니다.
        self.kmeans_ = KMeans(self.n_clusters, n_init=10, random_state=self.random_state)
        self.kmeans_.fit(X, sample_weight=sample_weight)
        return self  # 항상 self를 반환합니다!

    def transform(self, X):
        return rbf_kernel(X, self.kmeans_.cluster_centers_, gamma=self.gamma)

    def get_feature_names_out(self, names=None):
        return [f"클러스터 {i} 유사도" for i in range(self.n_clusters)]

cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)
similarities = cluster_simil.fit_transform(housing[["latitude", "longitude"]],
                                           sample_weight=housing_labels)

* Pilpeline 생성자를 통해 변환 파이프라인 구축 가능

from sklearn.pipeline import Pipeline

num_pipeline = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("standardize", StandardScaler()),
])

from sklearn.pipeline import make_pipeline

num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

* 모델 선택 / 훈련

- 해결하려는 문제에 따라 Loss Function을 다양하게 정의할 수 있음

- K 폴드 교차 검증 사용 가능 (훈련 셋에 오버피팅 막기 위함)

- 랜덤 포레스트 같은 앙상블 메소드 사용 가능

* 모델 미세 조정(파인 튜닝)

- GridSearchCV : 제시한 모든 하이퍼파라미터들의 조합 수동으로 다 평가

from sklearn.model_selection import GridSearchCV

full_pipeline = Pipeline([
    ("preprocessing", preprocessing),
    ("random_forest", RandomForestRegressor(random_state=42)),
])
param_grid = [
    {'preprocessing__geo__n_clusters': [5, 8, 10],
     'random_forest__max_features': [4, 6, 8]},
    {'preprocessing__geo__n_clusters': [10, 15],
     'random_forest__max_features': [6, 8, 10]},
]
grid_search = GridSearchCV(full_pipeline, param_grid, cv=3,
                           scoring='neg_root_mean_squared_error')
grid_search.fit(housing, housing_labels)

- RandonSearchCV : 랜덤한 수만큼의 조합으로 하이퍼파리미터 평가

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {'preprocessing__geo__n_clusters': randint(low=3, high=50),
                  'random_forest__max_features': randint(low=2, high=20)}

rnd_search = RandomizedSearchCV(
    full_pipeline, param_distributions=param_distribs, n_iter=10, cv=3,
    scoring='neg_root_mean_squared_error', random_state=42)

rnd_search.fit(housing, housing_labels)

- 앙상블

- 모델의 신뢰 구간 계산

- 시스템 배포 및 서비스 모니터링

3. 분류

데이터: MNIST (숫자 손글씨 이미지 + 실제 숫자 label 데이터 )

- 이진 분류기 훈련 : label = 5일 때를 True로 가정

- 확률적 경사 하강법 사용 : 큰 데이터셋 처리 가능, 각 샘플 개별로 취급, 온라인 학습에 좋음

from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)

- 모델 교차 검증 및 평가

from sklearn.model_selection import cross_val_score

cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

array([0.95035, 0.96035, 0.9604 ])

from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3)  # 데이터셋이 미리 섞여 있지 않다면
                                       # shuffle=True를 추가하세요.
for train_index, test_index in skfolds.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = y_train_5[train_index]
    X_test_fold = X_train[test_index]
    y_test_fold = y_train_5[test_index]

    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct / len(y_pred))

오차 행렬: 일반적인 분류기의 성능 평가 방법

- 정확도는 불균형한 데이터셋 등을 다룰 때 지표로 좋지 않음 (예를 들어 이상치 탐지에서 다 맞다고 하면 정확도 90% 이상인 경우 등.....)

- 오차 행렬을 만들기 위해서 cross_val_predict()사용

( 행: 실제 클래스, 열: 예측한 클래스 )

*** Confusion Matrix의 용어들은 정확하게 파악하고 있어야 함

- 음성 클래스 : 5 아님 (진짜 음성(TN): 5 아님, 거짓 양성(FP): 5 맞음)

- 양성 클래스: 5 맞음 (거짓 음성(FN): 5 아님, 진짜 양성(TP): 5 맞음)

* precision

* recall

** F1 Score : precision 과 recall의 조화 평균 : 점수가 높으려면 두 값이 모두 높아야 함

** 근데 보통 정밀도가 오르면 재현율이 줄어들고 그 반대도 마찬가지인 '정밀도 재현율 트레이드오프' 발생

threshold = 3000
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred

array([False])

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
                             method="decision_function")

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

thresholds, precisions

(array([-146348.56726174, -142300.00705404, -137588.97581744, ...,
          38871.26391927,   42216.05562787,   49441.43765905]),
 array([0.09035   , 0.09035151, 0.09035301, ..., 1.        , 1.        ,
        1.        ]))

plt.figure(figsize=(8, 4))  # 추가 코드
plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
plt.vlines(threshold, 0, 1.0, "k", "dotted", label="threshold")

# 추가 코드 – 그림 3–5를 그리고 저장합니다
idx = (thresholds >= threshold).argmax()  # 첫 번째 index ≥ threshold
plt.plot(thresholds[idx], precisions[idx], "bo")
plt.plot(thresholds[idx], recalls[idx], "go")
plt.axis([-50000, 50000, 0, 1])
plt.grid()
plt.xlabel("Threshold")
plt.legend(loc="center right")
save_fig("precision_recall_vs_threshold_plot")

plt.show()

* ROC 곡선 : FPR/ TPR

from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)

y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,
                                    method="predict_proba")

* 다중 분류

- OvR / OvA 전략 : 클래스 N개 중 가장 점수 높은 걸로 선택

from sklearn.multiclass import OneVsRestClassifier

ovr_clf = OneVsRestClassifier(SVC(random_state=42))
ovr_clf.fit(X_train[:2000], y_train[:2000])

- OvO : 각각의 이진 분류기를 훈련

from sklearn.svm import SVC

svm_clf = SVC(random_state=42)
svm_clf.fit(X_train[:2000], y_train[:2000])  # y_train_5가 아니고 y_train을 사용합니다.

'핸즈온 머신러닝' 카테고리의 다른 글

6. 결정 트리 / 7. 앙상블 (1)	2025.05.11
5. SVM (0)	2025.05.04
9. 비지도 학습 (0)	2025.04.29
4.1~4.3 (0)	2025.03.27
1-4~2-4 (0)	2025.03.16

leesy의 블로그

2.5 ~3.7

'핸즈온 머신러닝' 카테고리의 다른 글

티스토리툴바

2.5 ~3.7

'핸즈온 머신러닝' 카테고리의 다른 글

'핸즈온 머신러닝' Related Articles

티스토리툴바