[Machine Learning] XGBoost (Extreme Gradient Boosting)

📌 Boosting

- 여러 개의 약한 Decision Tree를 조합해서 사용하는 Ensemble 기법 중 하나

- 약한 에측 모형들의 학습 에러에 가중치를 두고, 순차적으로 다음 학습 모델에 반영하여 강한 예측모형을 만드는 것

📌Gradient Boosting

- 경사 하강법(gradient descent)을 사용해 잔여 오차를 최소화하는 것

- 잘못된 예측을 기반으로 조정하여 새로운 트리를 만듦 (랜덤 포레스트 알고리즘의 결점을 보완하는 매력적인 대안)

- 새로운 트리는 올바르게 예측된 값에는 영향을 받지 않는다

- 오차에만 초점을 맞추는 ML 알고리즘 만드려면 정확한 최종 예측을 만들기 위해 오차를 계산하는 방법 필요

따라서, 모델의 예측과 실제 값 사이의 차이인 잔차(residual)를 활용

- 각 트리 예측 값을 더해 모델 평가에 사용한다.

✋ XGBoost (Extreme Gradient Boosting)

- Extreme : '정확도', '속도' 를 달성하기 위해 계산량을 극대화하는 것

- XGBoost : Gradient Boosting을 크게 업그레이드 한 모델

따라서, XGBoost 의 장점을 이해하려면 Gradient Boosting의 작동 방식을 알아야 함.

- 잔차로부터 훈련한 트리를 추가하여 약한 학습기를 강력한 학습기로 바꾼다

💡 GBM 대비 빠른 수행시간 - 병렬 처리로 학습, 분류 속도가 빠르다

💡 과적합 규제 (Regularization)

- 표준 GBM의 경우 과적합 규제기능이 없으나, XGBoost는 자체에 과적합 규제 기능으로 강한 내구성 지닌다

💡 분류와 회귀영역에서 뛰어난 예측 성능 발휘

- CART (Classification And Regression Tree) 앙상블 모델 사용

💡 조기 종료 (Early Stopping) 기능 있음

💡 다양한 옵션을 제공, Customizing이 용이

✋ XGBRegressor 하이퍼파라미터 튜닝 예시

 XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
                 colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
                 importance_type='gain', interaction_constraints='',
                 learning_rate=0.1, max_delta_step=0, max_depth=5,
                 min_child_weight=1, missing=nan, monotone_constraints='()',
                 n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
                 reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
                 tree_method='exact', validate_parameters=1, verbosity=None)

✋ XGBoost Parameter

👀 일반 파라미터

- Boosting 수행할 때 트리를 사용할지, 선형 모델을 사용할지 등을 고름

👉 booster [기본값 = gbtree]

- 어떤 부스터 구조를 쓸지 결정한다.

- 의사결정기반모형(gbtree), 선형모형(gblinear), dart

👉 n_jobs

- XGBoost를 실행하는 데 사용되는 병렬 스레드 수

👉 verbosity [기본값 = 1]

- 유효한 값은 0 (무음), 1 (경고), 2 (정보), 3 (디버그)

👀 부스터 파라미터

- 선택한 Booster에 따라 적용할 수 있는 파라미터 종류가 다르다

👉 learning_rate [ 기본값 : 0.3 ]

- learning rate 가 높을수록 과적합 하기 쉽다

👉 n_estimators [ 기본값 : 100 ]

- 생성할 weak learner 의 수

- learning_rate 가 낮을 땐, n_estimators 를 높여야 과적합이 방지

👉 max_depth [ 기본값 : 6 ]

- 트리의 maximum depth

- 적정한 값이 제시되어야 함 (보통 3 ~ 10 사이의 값이 적용)

- max_depth 가 높을수록 모델의 복잡도가 커져 과적합 하기 쉽다

👉 min_child_weight [ 기본값 : 1 ]

- 관측치에 대한 가중치 합의 최소

- 값이 높을수록 과적합이 방지

👉 gamma [ 기본값 : 0 ]

- leaf node의 추가 분할을 결정할 최소손실 감소값

- 해당값보다 손실이 크게 감소할 때 분리

- 값이 높을수록 과적합이 방지

👉 subsample [ 기본값 : 1 ]

- weak learner 가 학습에 사용하는 데이터 샘플링 비율

- 보통 0.5 ~ 1 사용

- 값이 낮을수록 과적합이 방지

👉 colsample_bytree [ 기본값 : 1 ]

- 각 tree 별 사용된 feature percentage

- 보통 0.5 ~ 1 사용

- 값이 낮을수록 과적합이 방지

👉 lambda [ 기본값 : 1, 별칭 : reg_lambda ]

- 가중치에 대한 L2 Regularization 적용 값

- feature 개수가 많을 때 적용 검토

- 값이 클수록 과적합 감소

👉 alpha [ 기본값 : 0, 별칭 : reg_alpha ]

- 가중치에 대한 L1 Regularization 적용 값

- feature 개수가 많을 때 적용 검토

- 값이 클수록 과적합 감소

👀 학습 과정 파라미터

- 학습 시나리오를 결정

👉 objective [ 기본값 : reg = squarederror ]

- reg : squarederror

✔ 제곱 손실이 있는 회귀

- binary : logistic (binary-logistic classification)

✔ 이항 분류 문제 로지스틱 회귀 모형으로 반환값이 클래스가 아니라 예측 확률

- multi : softmax

✔ 다항 분류 문제의 경우 소프트맥스(softmax)를 사용해서 분류

✔ 반환되는 값이 예측확률이 아니라 클래스, num_class도 지정해야 함

👉 eval_metric

- 모델의 평가 함수를 조정하는 함수

- 설정한 objective 별로 기본 설정값이 지정되어 있음

- 해당 데이터의 특성에 맞게 평가 함수를 조정

✔ rmse : root mean square error

✔ mae : mean absolute error

✔ logloss : negative log-likelihood

✔ error : Binary classification error rate (0.5 threshold)

✔ merror : Multiclass classification error rate

✔ mlogloss : Multiclass logloss

✔ auc : Area under the curve

✔ map : mean average precision

👉 seed [ 기본값 : 0 ]

- 재현 가능하도록 난수를 고정

📢 민감하게 조정해야 하는 것

✔ Booster 모양

✔ eval_metric (평가함수) / objective (목적함수)

✔ eta

✔ L1 form (L1 regulariztion form 이 L2 보다 outlier 에 민감)

✔ L2 form

📢 과적합 방지를 위해 조정해야 하는 것

✔ learning rate 낮추기 → n_estimators 는 높여야 함

✔ max_depth 낮추기

✔ min_child_weight 높이기

✔ gamma 높이기

✔ subsample, colsample_bytree 낮추기

✋ Sample code 1. XGBClassifier

import xgboost as xgb
import matplotlib.pyplot as plt

# 모델 선언
model = xgb.XGBClassifier() 

# 모델 훈련
model.fit(x,y) 

# 모델 예측
y_pred = model.predict(X_test)

✋ Sample code 2. XGBRegressor

import xgboost as xgb

# 모델 선언
my_model = xgb.XGBRegressor(learning_rate=0.1,max_depth=5,n_estimators=100)

# 모델 훈련
my_model.fit(X_train, y_train, verbose=False)

# 모델 예측
y_pred = my_model.predict(X_test)

✋ XGBoost 모형 시각화

👉 의사결정나무 시각화 library 설치 (graphviz)

  pip install graphviz
  conda install graphviz

👉 xgb.plot_importance() 메서드

  import xgboost as xgb

  xgb.plot_importance(my_model)

👉 xgb.plot_tree() 메서드

import xgboost as xgb
import matplotlib.pyplot as plt

# num_trees : 그림을 여러개 그릴시 그림 번호
# rankdir : 트리의 방향, 디폴트는 위아래 방향
# rankdir="LR" : 왼쪽에서 오른쪽 방향으로 트리를 보여준다.
xgb.plot_tree(my_model, num_trees=0, rankdir='LR')

fig = plt.gcf()
fig.set_size_inches(150, 100)  # 해상도 지정 옵션

# 이미지 저장하고 싶다면
# fig.savefig('tree.png')

plt.show()

저작자표시 변경금지 (새창열림)

'Machine Learning' 카테고리의 다른 글

[Machine Learning] Poisson Regression (0)	2023.04.26
[Machine learning] scikit-learn pipeline (0)	2023.04.21
[Machine Learning] Data Leakage (0)	2023.04.21
[Machine Learning] Hyperparameter Tuning (0)	2023.04.21
[Machine Learning] Feature Engineering (0)	2023.04.21

🖐 열정이 무한한 개발자

[Machine Learning] XGBoost (Extreme Gradient Boosting)

📌 Boosting

📌Gradient Boosting

✋ XGBoost (Extreme Gradient Boosting)

✋ XGBRegressor 하이퍼파라미터 튜닝 예시

✋ XGBoost Parameter

✋ Sample code 1. XGBClassifier

✋ Sample code 2. XGBRegressor

✋ XGBoost 모형 시각화

'Machine Learning' 카테고리의 다른 글

티스토리툴바

[Machine Learning] XGBoost (Extreme Gradient Boosting)

📌 Boosting

📌Gradient Boosting

✋ XGBoost (Extreme Gradient Boosting)

✋ XGBRegressor 하이퍼파라미터 튜닝 예시

✋ XGBoost Parameter

✋ Sample code 1. XGBClassifier

✋ Sample code 2. XGBRegressor

✋ XGBoost 모형 시각화

'Machine Learning' 카테고리의 다른 글

'Machine Learning' Related Articles

티스토리툴바