パフォーマンスメトリクスの体系的理解

「パフォーマンスチューニングで最初にやるべきことは何だと思う？」佐藤CTOが問いかけた。「コードの最適化？キャッシュの導入？いや違う。正しく計測することだ。計測できないものは改善できない。今日はパフォーマンスエンジニアリングの土台となるメトリクスを徹底的に学ぼう。」

1. RED / USE メソッド

パフォーマンスメトリクスには体系的な分類方法がある。

RED メソッド（サービスレベル）

メトリクス	定義	例
Rate	単位時間あたりのリクエスト数	1,200 req/s
Errors	エラーリクエストの割合	0.5%
Duration	リクエスト処理時間	p99 = 250ms

USE メソッド（リソースレベル）

メトリクス	定義	例
Utilization	リソース使用率	CPU 75%
Saturation	待ちキューの長さ	12 pending requests
Errors	リソースエラー数	disk I/O errors: 3

// RED + USE メトリクスを統合的に収集するモニタリングクラス
import { Counter, Histogram, Gauge } from 'prom-client';

class PerformanceMetrics {
  // RED メトリクス
  private readonly requestRate = new Counter({
    name: 'http_requests_total',
    help: 'Total HTTP requests',
    labelNames: ['method', 'path', 'status_code'],
  });

  private readonly requestDuration = new Histogram({
    name: 'http_request_duration_seconds',
    help: 'HTTP request latency',
    labelNames: ['method', 'path'],
    buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  });

  private readonly errorRate = new Counter({
    name: 'http_request_errors_total',
    help: 'Total HTTP request errors',
    labelNames: ['method', 'path', 'error_type'],
  });

  // USE メトリクス
  private readonly cpuUtilization = new Gauge({
    name: 'process_cpu_utilization',
    help: 'CPU utilization percentage',
  });

  private readonly eventLoopSaturation = new Gauge({
    name: 'nodejs_eventloop_lag_seconds',
    help: 'Event loop lag in seconds',
  });

  recordRequest(method: string, path: string, statusCode: number, durationMs: number): void {
    this.requestRate.inc({ method, path, status_code: statusCode });
    this.requestDuration.observe(
      { method, path },
      durationMs / 1000
    );

    if (statusCode >= 400) {
      const errorType = statusCode >= 500 ? 'server_error' : 'client_error';
      this.errorRate.inc({ method, path, error_type: errorType });
    }
  }
}

2. レイテンシパーセンタイル（p50 / p95 / p99）

「平均値を見ているようでは一流にはなれない」と佐藤CTOは断言した。「p99 を見ろ。100人に1人が感じている遅さが、サービスの評判を決める。」

パーセンタイルの意味

パーセンタイル	意味	用途
p50（中央値）	半数のリクエストがこの値以下	典型的なユーザー体験
p90	90% のリクエストがこの値以下	多くのユーザー体験
p95	95% のリクエストがこの値以下	SLO の一般的な基準
p99	99% のリクエストがこの値以下	テールレイテンシ
p99.9	99.9% のリクエストがこの値以下	大規模サービスの基準

なぜ平均値が危険か

// 平均値とパーセンタイルの乖離を示す例
function analyzeLatencyDistribution(latencies: number[]): void {
  const sorted = [...latencies].sort((a, b) => a - b);
  const n = sorted.length;

  const avg = latencies.reduce((sum, v) => sum + v, 0) / n;
  const p50 = sorted[Math.floor(n * 0.50)];
  const p95 = sorted[Math.floor(n * 0.95)];
  const p99 = sorted[Math.floor(n * 0.99)];

  console.log(`Average: ${avg.toFixed(1)}ms`);
  console.log(`p50:     ${p50}ms`);
  console.log(`p95:     ${p95}ms`);
  console.log(`p99:     ${p99}ms`);

  // 典型的な結果:
  // Average: 45.2ms  ← 一見問題なさそう
  // p50:     12ms    ← 大半は高速
  // p95:     120ms   ← 20人に1人は遅い
  // p99:     850ms   ← 100人に1人は非常に遅い
}

// ヒストグラムベースのパーセンタイル近似計算
class HistogramPercentile {
  private buckets: Map<number, number> = new Map();
  private totalCount = 0;

  constructor(private boundaries: number[]) {
    for (const b of boundaries) {
      this.buckets.set(b, 0);
    }
    this.buckets.set(Infinity, 0);
  }

  observe(value: number): void {
    this.totalCount++;
    for (const boundary of [...this.boundaries, Infinity]) {
      if (value <= boundary) {
        this.buckets.set(boundary, (this.buckets.get(boundary) ?? 0) + 1);
        break;
      }
    }
  }

  percentile(p: number): number {
    const target = Math.ceil(this.totalCount * (p / 100));
    let cumulative = 0;

    for (const boundary of [...this.boundaries, Infinity]) {
      cumulative += this.buckets.get(boundary) ?? 0;
      if (cumulative >= target) {
        return boundary;
      }
    }
    return Infinity;
  }
}

3. スループットとリトルの法則

リトルの法則（Little’s Law）

L = λ × W

L: システム内の平均リクエスト数（同時接続数）
λ: 到着レート（スループット: req/s）
W: 平均滞在時間（レイテンシ: seconds）

// リトルの法則を使った容量計算
interface CapacityEstimate {
  concurrentRequests: number;
  throughput: number;      // req/s
  avgLatency: number;      // seconds
  requiredInstances: number;
}

function estimateCapacity(
  targetThroughput: number,  // req/s
  avgLatencyMs: number,      // ms
  maxConcurrencyPerInstance: number
): CapacityEstimate {
  const avgLatencySec = avgLatencyMs / 1000;

  // L = λ × W
  const concurrentRequests = targetThroughput * avgLatencySec;
  const requiredInstances = Math.ceil(concurrentRequests / maxConcurrencyPerInstance);

  return {
    concurrentRequests,
    throughput: targetThroughput,
    avgLatency: avgLatencySec,
    requiredInstances,
  };
}

// 例: 10,000 req/s、平均レイテンシ 50ms、インスタンスあたり最大100並行
const estimate = estimateCapacity(10000, 50, 100);
// concurrentRequests = 10000 * 0.05 = 500
// requiredInstances = ceil(500 / 100) = 5

4. Apdex スコア

Application Performance Index（Apdex）は、ユーザー満足度を 0〜1 のスコアで表現する。

Apdex = (Satisfied + Tolerating × 0.5) / Total

- Satisfied: T 以下（例: 500ms 以下）
- Tolerating: T 〜 4T（例: 500ms 〜 2000ms）
- Frustrated: 4T 超（例: 2000ms 超）

interface ApdexConfig {
  threshold: number; // T (ms)
}

interface ApdexResult {
  score: number;
  satisfied: number;
  tolerating: number;
  frustrated: number;
  total: number;
  rating: 'Excellent' | 'Good' | 'Fair' | 'Poor' | 'Unacceptable';
}

function calculateApdex(latencies: number[], config: ApdexConfig): ApdexResult {
  const { threshold } = config;
  let satisfied = 0;
  let tolerating = 0;
  let frustrated = 0;

  for (const latency of latencies) {
    if (latency <= threshold) {
      satisfied++;
    } else if (latency <= threshold * 4) {
      tolerating++;
    } else {
      frustrated++;
    }
  }

  const total = latencies.length;
  const score = (satisfied + tolerating * 0.5) / total;

  const rating = score >= 0.94 ? 'Excellent'
    : score >= 0.85 ? 'Good'
    : score >= 0.70 ? 'Fair'
    : score >= 0.50 ? 'Poor'
    : 'Unacceptable';

  return { score, satisfied, tolerating, frustrated, total, rating };
}

// 例: T=500ms で計算
const result = calculateApdex(
  [100, 200, 300, 450, 600, 800, 1500, 2100, 3000, 5500],
  { threshold: 500 }
);
// satisfied: 4, tolerating: 3, frustrated: 3
// Apdex = (4 + 3*0.5) / 10 = 0.55 → Poor

5. SLI / SLO / SLA とメトリクスの関係

概念	定義	例
SLI (Service Level Indicator)	測定可能な指標	p99 レイテンシ
SLO (Service Level Objective)	内部目標	p99 < 200ms を 99.9% の時間
SLA (Service Level Agreement)	外部契約	可用性 99.95%、違反時返金

// SLO ベースのエラーバジェット計算
interface SloConfig {
  target: number;        // 例: 0.999 (99.9%)
  windowDays: number;    // 例: 30
}

interface ErrorBudget {
  totalMinutes: number;
  allowedDowntimeMinutes: number;
  consumedMinutes: number;
  remainingMinutes: number;
  burnRate: number;       // 1.0 = 予定通り消費
}

function calculateErrorBudget(
  config: SloConfig,
  violationMinutes: number,
  elapsedDays: number
): ErrorBudget {
  const totalMinutes = config.windowDays * 24 * 60;
  const allowedDowntimeMinutes = totalMinutes * (1 - config.target);
  const remainingMinutes = allowedDowntimeMinutes - violationMinutes;

  // バーンレート: 1.0なら均等消費、2.0なら2倍速で消費
  const expectedConsumption = (elapsedDays / config.windowDays) * allowedDowntimeMinutes;
  const burnRate = expectedConsumption > 0 ? violationMinutes / expectedConsumption : 0;

  return {
    totalMinutes,
    allowedDowntimeMinutes,
    consumedMinutes: violationMinutes,
    remainingMinutes,
    burnRate,
  };
}

// 例: 99.9% SLO、30日ウィンドウ、10日経過で20分のダウンタイム
const budget = calculateErrorBudget({ target: 0.999, windowDays: 30 }, 20, 10);
// allowedDowntime = 43200 * 0.001 = 43.2分
// remaining = 43.2 - 20 = 23.2分
// expected consumption at day 10 = (10/30) * 43.2 = 14.4分
// burnRate = 20 / 14.4 = 1.39（想定より39%速く消費）

コラム: Google の「四つのゴールデンシグナル」

Google の SRE 本で提唱された、サービス監視に必須の4つのシグナル:

Latency - 成功リクエストと失敗リクエストのレイテンシを分けて計測
Traffic - サービスへのリクエスト量（HTTP req/s、DB queries/s）
Errors - 失敗リクエストの割合（明示的 5xx + 暗黙的遅延）
Saturation - リソースの「満杯度」（CPU、メモリ、ディスク I/O）

これは RED + USE の統合版とも言える。

まとめ

トピック	要点
RED メソッド	Rate/Errors/Duration でサービスレベルを把握
USE メソッド	Utilization/Saturation/Errors でリソースレベルを把握
パーセンタイル	p50/p95/p99 でテールレイテンシを可視化、平均値は危険
リトルの法則	L = λ × W で同時接続数・容量を推定
Apdex	ユーザー満足度を 0〜1 のスコアで定量化
SLI/SLO/SLA	メトリクス → 目標 → 契約の3層構造

チェックリスト

RED メソッドと USE メソッドの違いを説明できる
パーセンタイルの意味と平均値の危険性を理解した
リトルの法則を使った容量計算ができる
Apdex スコアの計算方法を理解した
SLI/SLO/SLA の関係を説明できる

次のステップへ

メトリクスの体系を理解した。次は プロファイリング手法 を学び、ボトルネックの特定方法を身につけよう。

推定読了時間: 30分