混同行列と閾値調整

田中VPoE:「モデルの予測は確率として出力される。デフォルトでは0.5を閾値にして離反/継続を判定しているが、この閾値を変えることでビジネスに最適な予測にできる。」

あなた:「閾値を下げれば、より多くの顧客を離反と予測して見逃しを減らせますね。」

田中VPoE:「その通り。ただし代わりに誤検知（本当は離反しない顧客への不要な施策）が増える。このバランスをビジネス観点で最適化するのが重要だ。」

混同行列の詳しい読み方

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

y_prob = model.predict_proba(X_test)[:, 1]
y_pred = (y_prob >= 0.5).astype(int)

cm = confusion_matrix(y_test, y_pred)

fig, ax = plt.subplots(figsize=(8, 6))
disp = ConfusionMatrixDisplay(cm, display_labels=['継続', '離反'])
disp.plot(ax=ax, cmap='Blues')
ax.set_title('混同行列')
plt.tight_layout()
plt.show()

# 各セルのビジネス的な意味
tn, fp, fn, tp = cm.ravel()
print(f"TN（正しく継続と予測）: {tn}人 → 問題なし")
print(f"FP（誤って離反と予測）: {fp}人 → 不要なクーポンコスト")
print(f"FN（離反を見逃し）:     {fn}人 → 顧客損失（最も深刻）")
print(f"TP（正しく離反と予測）: {tp}人 → 施策で防止可能")

コスト計算

# NetShop のコスト設定例
cost_coupon = 500        # クーポン1枚のコスト
cost_customer_loss = 30000  # 顧客1人の年間価値（離反時の損失）

# 現在のモデルのコスト
cost_fp = fp * cost_coupon       # 不要なクーポンコスト
cost_fn = fn * cost_customer_loss  # 離反見逃しの損失
total_cost = cost_fp + cost_fn

print(f"不要クーポンコスト（FP）: {cost_fp:,}円")
print(f"離反見逃し損失（FN）:   {cost_fn:,}円")
print(f"合計コスト:            {total_cost:,}円")

閾値の調整

デフォルト閾値 0.5 は最適とは限らない

import numpy as np
from sklearn.metrics import f1_score, precision_score, recall_score

thresholds = np.arange(0.1, 0.9, 0.05)
results = []

for threshold in thresholds:
    y_pred_t = (y_prob >= threshold).astype(int)
    results.append({
        'threshold': threshold,
        'precision': precision_score(y_test, y_pred_t, zero_division=0),
        'recall': recall_score(y_test, y_pred_t, zero_division=0),
        'f1': f1_score(y_test, y_pred_t, zero_division=0),
    })

results_df = pd.DataFrame(results)

plt.figure(figsize=(10, 6))
plt.plot(results_df['threshold'], results_df['precision'], label='適合率')
plt.plot(results_df['threshold'], results_df['recall'], label='再現率')
plt.plot(results_df['threshold'], results_df['f1'], label='F1スコア', linewidth=2)
plt.xlabel('閾値')
plt.ylabel('スコア')
plt.title('閾値による指標の変化')
plt.legend()
plt.grid(True)
plt.show()

# F1スコアが最大となる閾値
best_idx = results_df['f1'].idxmax()
best_threshold = results_df.loc[best_idx, 'threshold']
print(f"F1最大の閾値: {best_threshold:.2f}")
print(f"F1スコア: {results_df.loc[best_idx, 'f1']:.3f}")

ビジネスコストに基づく閾値最適化

# ビジネスコストを最小化する閾値を探す
cost_results = []

for threshold in thresholds:
    y_pred_t = (y_prob >= threshold).astype(int)
    cm_t = confusion_matrix(y_test, y_pred_t)
    tn_t, fp_t, fn_t, tp_t = cm_t.ravel()

    total_cost = fp_t * cost_coupon + fn_t * cost_customer_loss
    saved = tp_t * cost_customer_loss  # 離反防止で守れた売上

    cost_results.append({
        'threshold': threshold,
        'total_cost': total_cost,
        'saved': saved,
        'net_benefit': saved - total_cost,
        'fp': fp_t,
        'fn': fn_t,
        'tp': tp_t,
    })

cost_df = pd.DataFrame(cost_results)

plt.figure(figsize=(10, 6))
plt.plot(cost_df['threshold'], cost_df['net_benefit'], linewidth=2, color='green')
plt.xlabel('閾値')
plt.ylabel('純利益（円）')
plt.title('閾値と純利益の関係')
plt.grid(True)
plt.show()

best_cost_idx = cost_df['net_benefit'].idxmax()
best_cost_threshold = cost_df.loc[best_cost_idx, 'threshold']
print(f"\n純利益最大の閾値: {best_cost_threshold:.2f}")
print(f"純利益: {cost_df.loc[best_cost_idx, 'net_benefit']:,.0f}円")

Precision-Recall 曲線

from sklearn.metrics import precision_recall_curve, average_precision_score

precision_vals, recall_vals, pr_thresholds = precision_recall_curve(y_test, y_prob)
ap = average_precision_score(y_test, y_prob)

plt.figure(figsize=(8, 6))
plt.plot(recall_vals, precision_vals, label=f'AP = {ap:.3f}')
plt.xlabel('再現率')
plt.ylabel('適合率')
plt.title('Precision-Recall 曲線')
plt.legend()
plt.grid(True)
plt.show()

不均衡データでは、ROC曲線よりも Precision-Recall 曲線の方が性能の違いが分かりやすくなります。

閾値調整のガイドライン

ビジネス要件	閾値の方向	効果
離反者を絶対に見逃したくない	閾値を下げる（0.3等）	再現率↑ 適合率↓
クーポンコストを抑えたい	閾値を上げる（0.7等）	適合率↑ 再現率↓
バランスを取りたい	F1最大の閾値	バランス
ビジネスコストを最小化	コスト関数で最適化	コスト最小

まとめ

混同行列の各セルにはビジネス的なコストが対応する
デフォルト閾値 0.5 は最適とは限らない
閾値を変えることで適合率と再現率のバランスを調整できる
ビジネスコスト関数を定義して閾値を最適化するのが理想
Precision-Recall 曲線は不均衡データでの評価に有効

チェックリスト

混同行列の各セルのビジネス的な意味を説明できる
閾値調整の効果を理解した
ビジネスコストに基づく閾値最適化の手法を把握した
Precision-Recall 曲線の意味を理解した

次のステップへ

次のレッスンでは、オーバーフィッティング対策の具体的な手法を学びます。

推定読了時間: 30分