決定木・アンサンブルモデル

「ベースラインのAUC-ROCは0.83。悪くないが、まだ伸ばせるはずだ。」

田中VPoEがモデル比較表を見せる。

「Kaggleのコンペでも、テーブルデータの王者はGBDTだ。決定木からXGBoost、LightGBMまで段階的に試していこう。各モデルの特性を理解した上で使い分けることが重要だ。」

モデルの進化ステップ

決定木 → ランダムフォレスト → XGBoost → LightGBM
(単一)    (バギング)          (ブースティング) (高速ブースティング)

決定木（Decision Tree）

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score, classification_report

# 決定木
dt_model = DecisionTreeClassifier(
    max_depth=5,
    min_samples_leaf=20,
    random_state=42
)
dt_model.fit(X_train, y_train)

y_val_proba = dt_model.predict_proba(X_val)[:, 1]
print(f"決定木 AUC-ROC: {roc_auc_score(y_val, y_val_proba):.4f}")

# 特徴量重要度
import pandas as pd
feature_imp = pd.DataFrame({
    'feature': X_train.columns,
    'importance': dt_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 10 特徴量:")
print(feature_imp.head(10).to_string(index=False))

決定木の特性

利点	欠点
解釈しやすい	過学習しやすい
前処理が少ない	不安定（データ変化に敏感）
非線形関係を捉える	単体では精度が低い

ランダムフォレスト（Random Forest）

from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_leaf=5,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)
rf_model.fit(X_train, y_train)

y_val_proba = rf_model.predict_proba(X_val)[:, 1]
print(f"ランダムフォレスト AUC-ROC: {roc_auc_score(y_val, y_val_proba):.4f}")

バギングの仕組み

ランダムフォレスト = バギング + 特徴量サンプリング
1. データをブートストラップサンプリング（復元抽出）
2. 各サンプルで決定木を構築（特徴量もランダムに選択）
3. 全木の予測を多数決（分類）/平均（回帰）で統合
→ 分散を減らし、過学習を抑制

XGBoost

from xgboost import XGBClassifier

xgb_model = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=len(y_train[y_train==0]) / len(y_train[y_train==1]),
    random_state=42,
    eval_metric='auc',
    use_label_encoder=False
)

xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False
)

y_val_proba = xgb_model.predict_proba(X_val)[:, 1]
print(f"XGBoost AUC-ROC: {roc_auc_score(y_val, y_val_proba):.4f}")

ブースティングの仕組み

XGBoost = 勾配ブースティング + 正則化
1. 残差（誤差）に対して次の木を構築
2. 前の木の誤りを補正するように学習
3. 正則化で複雑さを制御
→ バイアスを減らし、精度を向上

LightGBM

from lightgbm import LGBMClassifier

lgbm_model = LGBMClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    is_unbalance=True,
    random_state=42,
    verbose=-1
)

lgbm_model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    callbacks=[lgbm_model.callbacks if hasattr(lgbm_model, 'callbacks') else None]
)

y_val_proba = lgbm_model.predict_proba(X_val)[:, 1]
print(f"LightGBM AUC-ROC: {roc_auc_score(y_val, y_val_proba):.4f}")

LightGBM vs XGBoost

項目	XGBoost	LightGBM
分割戦略	Level-wise	Leaf-wise
速度	中程度	高速
メモリ	多め	少なめ
カテゴリ対応	手動エンコーディング必要	ネイティブ対応
過学習リスク	中	やや高（Leaf-wiseのため）

モデル比較

from sklearn.metrics import roc_auc_score, average_precision_score

models = {
    'Logistic Regression': baseline_model,
    'Decision Tree': dt_model,
    'Random Forest': rf_model,
    'XGBoost': xgb_model,
    'LightGBM': lgbm_model,
}

results = []
for name, model in models.items():
    y_proba = model.predict_proba(X_val)[:, 1]
    results.append({
        'Model': name,
        'AUC-ROC': roc_auc_score(y_val, y_proba),
        'PR-AUC': average_precision_score(y_val, y_proba),
    })

comparison = pd.DataFrame(results).sort_values('AUC-ROC', ascending=False)
print(comparison.to_string(index=False))

期待される結果

モデル	AUC-ROC	PR-AUC
LightGBM	0.84-0.86	0.65-0.70
XGBoost	0.84-0.86	0.64-0.69
Random Forest	0.83-0.85	0.62-0.67
Logistic Regression	0.82-0.84	0.60-0.65
Decision Tree	0.75-0.80	0.50-0.55

まとめ

項目	ポイント
進化の流れ	決定木 → RF → XGBoost → LightGBM
バギング	分散を減らす（ランダムフォレスト）
ブースティング	バイアスを減らす（XGBoost, LightGBM）
テーブルデータの王者	GBDT（XGBoost/LightGBM）
不均衡対策	scale_pos_weight / is_unbalance

チェックリスト

バギングとブースティングの違いを説明できる
4つのモデルを構築し比較できる
XGBoostとLightGBMの違いを説明できる
不均衡データに対する各モデルの対策を設定できる

次のステップへ

ベースモデルの比較ができた。次は特徴量エンジニアリングで、業務知識を活かした特徴量を生成してさらなる性能向上を目指そう。

推定読了時間: 30分