λ³Έλ¬Έ λ°”λ‘œκ°€κΈ°

Machine Learning

[Machine Learning] Feature Engineering

λ°˜μ‘ν˜•

βœ‹ Google colab 연동  /  Data load

## GOOGLE COLAB 연동
from google.colab import drive
drive.mount('/content/drive')

# Mounted at /content/drive

## DATA LOAD
import pandas as pd
print('pandas version:', pd.__version__)

wine = pd.read_csv('https://bit.ly/wine_csv_data')
wine.head()

alcohol	sugar	pH	class
0	9.4	1.9	3.51	0.0
1	9.8	2.6	3.20	0.0
2	9.8	2.3	3.26	0.0
3	9.8	1.9	3.16	0.0
4	9.4	1.9	3.51	0.0

 

πŸ‘€ numpy - 수치 μ—°μ‚°

# numpy - 수치 μ—°μ‚° (ν–‰λ ¬λ‘œ λ³€ν™˜)
import numpy as np
data = wine[['alcohol', 'sugar', 'pH']].to_numpy()
target = wine['class'].to_numpy()

data.shape, target.shape

# ((6497, 3), (6497,))

βœ‹ DATASET 뢄리

## 데이터셋 뢄리
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    data, target, test_size = 0.2, random_state=42
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

# ((5197, 3), (1300, 3), (5197,), (1300,))

πŸ‘€ Feature Engineering

β–· 수치 데이터 μ •κ·œν™”

 - 수치 λ°μ΄ν„°λ§ˆλ‹€ λ‹¨μœ„κ°€ λ‹€ λ‹€λ₯΄λ‹€  

   ex. ν‚€, λͺΈλ¬΄κ²Œ - 180cm, 1.8m, 180kg  =>  잘λͺ»λœ 정보 ν•™μŠ΅ν•  κ°€λŠ₯μ„± 있음 (μ •κ·œν™” ν•„μš”)

 - Min-Max Scaler : 0 - 1 μ‚¬μ΄λ‘œ 뢄포λ₯Ό μž¬μ •λ¦½

 - Standerd Scaler : 평균은 0, 뢄산은 1둜 작고 μΆ•μ†Œ

 - μ •κ·œν™”λ₯Ό 톡해 수치 λ³€ν™” μ‹œ, μ›λž˜μ˜ κ°’ μ„±μ§ˆμ„ μžƒμ–΄λ²„λ¦¬κ³  달라지더라도 λ°μ΄ν„°μ˜ μˆœμ„œλŠ” λ³€λ™μ—†μŒ(β˜…β˜…β˜…)

 

β–· κΈ°λ³Έ 원칙 (β˜…β˜…β˜…)

 - train, test data λŠ” λΆ„λ¦¬λ˜μ–΄μ•Ό ν•œλ‹€

 - λ‚΄κ°€ μ§  μ½”λ“œμ— train, test data κ°€ μ„žμ΄μ§€λŠ” μ•Šμ•˜λŠ”μ§€ μ˜μ‹¬ν•œλ‹€

 - Data Leakage μ—μ„œ λ²—μ–΄λ‚˜λŠ” λ°©ν–₯으둜 κ³ λ―Όν•œλ‹€

 

βœ‹ Train Data Scaling

# train data scaling
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit(X_train)

 

## train data ν•™μŠ΅ν•˜μ—¬ X_train / X_test에 적용
trained_scaled = ss.transform(X_train)
test_scaled = ss.transform(X_test)

trained_scaled.shape, test_scaled.shape

# ((5197, 3), (1300, 3))

## λ…λ¦½λ³€μˆ˜ λ§Œλ“€μ–΄μ§€λŠ” 것 확인
X_train[0], trained_scaled[0]
# (array([10.5 ,  7.7 ,  3.19]), array([ 0.01376574,  0.4646116 , -0.16888369]))

πŸ‘‰ λ³€ν™˜λœ 값을 μ μš©ν•œλ‹€λŠ” 의미 (train_scaled)

πŸ‘‰  train_scaled μ μš©ν•˜μ—¬ μƒˆλ‘œμš΄ test dataset으둜 λ°”κΎΌ ν›„ predict

βœ‹ Modeling

# λ…λ¦½λ³€μˆ˜ λ§Œλ“  이후 λͺ¨λΈλ§ μž‘μ—…
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state=42)
dt.fit(train_scaled, y_train)

print(dt.score(train_scaled, y_train))  # ν›ˆλ ¨ 평가   99.69 %
print(dt.score(test_scaled, y_test))    # 검증 평가   85.92 %

# train - test λ‘˜ 사이에 차이가 많이 λ‚˜κΈ°λ•Œλ¬Έμ—, overfitting 이 일어났닀고 νŒλ‹¨ν•œλ‹€
# train - test μ‚¬μ΄μ˜ 차이λ₯Ό μ€„μ—¬μ£ΌλŠ” 것이 μ€‘μš”!!
# overfitting 이 μΌμ–΄λ‚˜μ§€ μ•Šλ„λ‘

# 이후에 νŒŒμƒλ³€μˆ˜ λ§Œλ“œλŠ” 싸움

πŸ‘€ Decision Tree

# κ²°μ •νŠΈλ¦¬ κ²°κ³Όκ°’
# hyperparameter μ„€μ • 없이 μ „λΆ€ 돌릴 λ•Œ
# λͺ¨λ“  쑰건에 맞게, train data에 μ΅œμ ν™” λ˜λ„λ‘ (였래 κ±Έλ¦Ό)
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure()
plot_tree(dt)
plt.show()

πŸ‘€ Max_Depth Adjustment

- max_depth κ°€ μž‘μ„μˆ˜λ‘ μ •ν™•λ„λŠ” 떨어짐
πŸ‘‰ κ³Όμ†Œμ ν•©μ΄ μΌμ–΄λ‚˜κ²Œ λœλ‹€ (ν›ˆλ ¨ < 검증)

# max_depth μ‘°μ •
plt.figure(figsize=(10,7))
plot_tree(dt, max_depth=1, filled=True, feature_names=['alcohol', 'sugar', 'pH'])
plt.show()

max_depth = 1

# max_depth μ‘°μ •
plt.figure(figsize=(10,7))
plot_tree(dt, max_depth=2, filled=True, feature_names=['alcohol', 'sugar', 'pH'])
plt.show()

max_depth = 2

## 쒋은 λͺ¨λΈμ΄ λ˜λ„λ‘ max_depth 값을 μ‘°μ •ν•˜μ—¬ 찾아냄
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=1, random_state=42)
dt.fit(train_scaled, y_train)

print(dt.score(train_scaled, y_train))  # ν›ˆλ ¨ 평가   75.3 %
print(dt.score(test_scaled, y_test))    # 검증 평가   73.7 %

plt.figure(figsize=(10,7))
plot_tree(dt, max_depth=3, filled=True, feature_names=['alcohol', 'sugar', 'pH'])
plt.show()

# 0.7579372715027901
# 0.7376923076923076

max_depth = 1 → train : 75.3 %, test : 73.7 %

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=3, random_state=42)
dt.fit(train_scaled, y_train)

print(dt.score(train_scaled, y_train))  # ν›ˆλ ¨ 평가   84.5 %
print(dt.score(test_scaled, y_test))    # 검증 평가   84.1 %

plt.figure(figsize=(10,7))
plot_tree(dt, max_depth=3, filled=True, feature_names=['alcohol', 'sugar', 'pH'])
plt.show()

# 0.8454877814123533
# 0.8415384615384616

πŸ‘‰ Overfitting 이 μΌμ–΄λ‚˜μ§€ μ•Šμ•˜μŒμ„ μ•Œ 수 있음

λ°˜μ‘ν˜•

'Machine Learning' μΉ΄ν…Œκ³ λ¦¬μ˜ λ‹€λ₯Έ κΈ€

[Machine Learning] Poisson Regression  (0) 2023.04.26
[Machine Learning] XGBoost (Extreme Gradient Boosting)  (0) 2023.04.26
[Machine learning] scikit-learn pipeline  (0) 2023.04.21
[Machine Learning] Data Leakage  (0) 2023.04.21
[Machine Learning] Hyperparameter Tuning  (0) 2023.04.21