不均衡データの評価指標

「モデルの良し悪しを正しく測れなければ、改善もできない。」

田中VPoEが評価指標の一覧を示す。

「不正検知では、一般的なAccuracyは使えない。ビジネスコストと直結する指標を選ぶことが重要だ。」

PR曲線とPR-AUC

Precision-Recall曲線

閾値を変化させたときのPrecisionとRecallの関係をプロットする。

from sklearn.metrics import precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt

# モデルの予測確率を取得
y_prob = model.predict_proba(X_test)[:, 1]

# PR曲線
precision, recall, thresholds = precision_recall_curve(y_test, y_prob)
pr_auc = average_precision_score(y_test, y_prob)

plt.figure(figsize=(8, 6))
plt.plot(recall, precision, label=f'PR-AUC = {pr_auc:.4f}')
plt.xlabel('Recall（再現率）')
plt.ylabel('Precision（適合率）')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True, alpha=0.3)

# ベースライン（ランダム予測の場合のPrecision = 不正の割合）
baseline = y_test.mean()
plt.axhline(y=baseline, color='r', linestyle='--', label=f'Baseline = {baseline:.4f}')
plt.legend()
plt.show()

PR-AUCの解釈

PR-AUC の解釈:
  1.0:   完全な検知（全ての不正を漏れなく、誤検知なく検知）
  0.5:   不均衡が50:50の場合のランダム予測
  baseline: 不正の割合（0.0017）がランダム予測の値

Credit Card Fraud Detection の場合:
  PR-AUC > 0.70: 良いモデル
  PR-AUC > 0.80: 優れたモデル
  PR-AUC > 0.90: 非常に優れたモデル

F1-Scoreとその変形

F1-Score

from sklearn.metrics import f1_score

# 閾値0.5でのF1
y_pred = (y_prob >= 0.5).astype(int)
f1 = f1_score(y_test, y_pred)
print(f"F1-Score (threshold=0.5): {f1:.4f}")

F-beta Score

ビジネス要件に応じてRecallとPrecisionの重みを変えたい場合に使用する。

from sklearn.metrics import fbeta_score

# beta > 1: Recall重視、beta < 1: Precision重視
f2 = fbeta_score(y_test, y_pred, beta=2)  # Recall重視
f05 = fbeta_score(y_test, y_pred, beta=0.5)  # Precision重視

print(f"F2-Score (Recall重視): {f2:.4f}")
print(f"F0.5-Score (Precision重視): {f05:.4f}")

beta値の選び方:

不正検知（見逃し重視）: beta=2 → Recallを2倍重視
不正検知（バランス型）: beta=1 → F1-Score
スパム検知（誤検知回避重視）: beta=0.5 → Precisionを重視

F1最適化のための閾値探索

# 全閾値でF1を計算し、最適閾値を見つける
f1_scores = []
for threshold in thresholds:
    y_pred_t = (y_prob >= threshold).astype(int)
    f1_t = f1_score(y_test, y_pred_t)
    f1_scores.append(f1_t)

best_idx = np.argmax(f1_scores)
best_threshold = thresholds[best_idx]
best_f1 = f1_scores[best_idx]

print(f"最適閾値: {best_threshold:.4f}")
print(f"最大F1-Score: {best_f1:.4f}")

# 閾値を変えるだけでF1が大きく変わることに注目

ビジネスコスト最小化指標

PR-AUCやF1は汎用的だが、ビジネスでは「コスト最小化」が究極の指標である。

コスト関数の定義

def business_cost(y_true, y_pred, cost_fn=50000, cost_fp=500):
    """ビジネスコスト関数"""
    fn = ((y_true == 1) & (y_pred == 0)).sum()
    fp = ((y_true == 0) & (y_pred == 1)).sum()
    total_cost = fn * cost_fn + fp * cost_fp
    return total_cost

# 閾値ごとのコストを計算
costs = []
threshold_range = np.arange(0.01, 1.0, 0.01)

for threshold in threshold_range:
    y_pred_t = (y_prob >= threshold).astype(int)
    cost = business_cost(y_test, y_pred_t)
    costs.append(cost)

best_cost_idx = np.argmin(costs)
best_cost_threshold = threshold_range[best_cost_idx]
min_cost = costs[best_cost_idx]

print(f"コスト最小化閾値: {best_cost_threshold:.2f}")
print(f"最小コスト: {min_cost:,.0f}円")

コスト曲線の可視化

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# コスト曲線
axes[0].plot(threshold_range, costs)
axes[0].axvline(x=best_cost_threshold, color='r', linestyle='--',
                label=f'最適閾値={best_cost_threshold:.2f}')
axes[0].set_xlabel('閾値')
axes[0].set_ylabel('総コスト（円）')
axes[0].set_title('閾値とビジネスコストの関係')
axes[0].legend()

# FNコストとFPコストの内訳
fn_costs = []
fp_costs = []
for threshold in threshold_range:
    y_pred_t = (y_prob >= threshold).astype(int)
    fn = ((y_test == 1) & (y_pred_t == 0)).sum()
    fp = ((y_test == 0) & (y_pred_t == 1)).sum()
    fn_costs.append(fn * 50000)
    fp_costs.append(fp * 500)

axes[1].plot(threshold_range, fn_costs, label='FNコスト（見逃し）')
axes[1].plot(threshold_range, fp_costs, label='FPコスト（誤検知）')
axes[1].set_xlabel('閾値')
axes[1].set_ylabel('コスト（円）')
axes[1].set_title('コストの内訳')
axes[1].legend()

plt.tight_layout()
plt.show()

評価指標の使い分け

指標	用途	不正検知での推奨度
Accuracy	一般的な分類性能	不適切
ROC-AUC	全体的な識別能力	参考程度
PR-AUC	不均衡下の総合性能	主指標
F1-Score	Precision/Recallバランス	補助指標
F2-Score	Recall重視の評価	見逃し重視時
ビジネスコスト	コスト最小化	最終判断指標

評価のベストプラクティス

from sklearn.metrics import classification_report

def comprehensive_evaluation(y_true, y_prob, threshold=0.5,
                              cost_fn=50000, cost_fp=500):
    """包括的な評価レポート"""
    y_pred = (y_prob >= threshold).astype(int)

    print(f"=== 閾値: {threshold} ===")
    print(classification_report(y_true, y_pred,
                                target_names=['正常', '不正']))

    pr_auc = average_precision_score(y_true, y_prob)
    print(f"PR-AUC: {pr_auc:.4f}")

    cost = business_cost(y_true, y_pred, cost_fn, cost_fp)
    print(f"ビジネスコスト: {cost:,.0f}円")

    fn = ((y_true == 1) & (y_pred == 0)).sum()
    fp = ((y_true == 0) & (y_pred == 1)).sum()
    print(f"見逃し: {fn}件 / 誤検知: {fp}件")

まとめ

項目	ポイント
PR-AUC	不均衡データの主指標。ROC-AUCより実態を反映
F-beta	beta>1でRecall重視、beta<1でPrecision重視
コスト最適化	FN/FPのビジネスコストに基づく閾値最適化が最終目標
閾値の重要性	同じモデルでも閾値で性能が大きく変わる

チェックリスト

PR曲線の読み方とPR-AUCの解釈ができる
F-beta Scoreのbeta値の選び方を理解した
ビジネスコスト関数を定義し、閾値最適化ができる
評価指標の使い分けを判断できる

次のステップへ

評価指標を理解したところで、次は演習でCredit Card Fraudデータを実際に分析してみよう。

推定読了時間: 30分