可用性と耐障害性 - L0 カリキュラム

ストーリー

佐

佐藤CTO

去年、あるサービスが3時間ダウンした。影響を受けたユーザーは50万人。損失額は推定で数千万円だ

あなた

原因は？

あ

佐

佐藤CTO

データベースサーバーが1台しかなくて、そのディスクが故障した。Single Point of Failure（単一障害点）というやつだ

あなた

1台しかないサーバーが壊れたら全部止まる……当たり前のことのように聞こえますが、実際にはよくあるんですね

あ

佐

佐藤CTO

そう。“壊れるわけがない”という思い込みが最大のリスクだ。システムは必ず壊れる。壊れたときにどう振る舞うか — それが可用性設計の本質だ

可用性の計算

SLAと稼働率

可用性はパーセンテージで表現され、「ナイン」の数で語られます。

SLA	年間ダウンタイム	月間ダウンタイム	週間ダウンタイム
99%（ツーナイン）	3日15時間36分	7時間18分	1時間41分
99.9%（スリーナイン）	8時間45分36秒	43分48秒	10分5秒
99.95%	4時間22分48秒	21分54秒	5分2秒
99.99%（フォーナイン）	52分33.6秒	4分23秒	1分0.5秒
99.999%（ファイブナイン）	5分15.4秒	26.3秒	6秒

可用性の計算方法

// 可用性の計算ユーティリティ
class AvailabilityCalculator {
  // 直列構成の可用性
  // 全コンポーネントが動作している必要がある
  static serial(...availabilities: number[]): number {
    return availabilities.reduce((total, a) => total * a, 1);
  }

  // 並列構成（冗長構成）の可用性
  // 1つでも動作していればOK
  static parallel(...availabilities: number[]): number {
    const unavailability = availabilities.reduce(
      (total, a) => total * (1 - a),
      1
    );
    return 1 - unavailability;
  }

  // ダウンタイムの計算
  static annualDowntime(availability: number): string {
    const totalMinutes = 365.25 * 24 * 60;
    const downtimeMinutes = totalMinutes * (1 - availability);

    if (downtimeMinutes >= 60) {
      const hours = Math.floor(downtimeMinutes / 60);
      const minutes = Math.round(downtimeMinutes % 60);
      return `${hours}時間${minutes}分`;
    }
    return `${Math.round(downtimeMinutes)}分`;
  }
}

計算例：ECサイトの可用性

ECサイト構成：
  Web Server (99.9%) → API Server (99.9%) → Database (99.9%)

直列構成の可用性:
  0.999 × 0.999 × 0.999 = 0.997 (99.7%)
  年間ダウンタイム: 約26時間

改善：各コンポーネントを冗長化
  Web Server: 2台並列 → 1-(1-0.999)^2 = 0.999999 (99.9999%)
  API Server: 2台並列 → 0.999999
  Database:   2台並列 → 0.999999

改善後の直列構成:
  0.999999 × 0.999999 × 0.999999 = 0.999997 (99.9997%)
  年間ダウンタイム: 約1.6分

単一障害点（SPOF）の特定と排除

SPOFの典型的な場所

graph TD
    Internet["Internet"]
    DNS["DNS<br/>← SPOFになりやすい"]
    LB["LB<br/>← ロードバランサー自体がSPOF"]
    App["App（1台）<br/>← 1台ならSPOF"]
    DB["DB（1台）<br/>← 最もよくあるSPOF"]

    Internet --> DNS --> LB --> App --> DB

    classDef warnStyle fill:#d9534f,stroke:#b52b27,color:#fff
    class DNS,LB,App,DB warnStyle

    classDef netStyle fill:#67b7dc,stroke:#3a8ab5,color:#fff
    class Internet netStyle

SPOF排除チェックリスト

interface SPOFAssessment {
  component: string;
  isSPOF: boolean;
  impact: 'CRITICAL' | 'HIGH' | 'MEDIUM' | 'LOW';
  mitigation: string;
}

const assessments: SPOFAssessment[] = [
  {
    component: 'データベース',
    isSPOF: true,
    impact: 'CRITICAL',
    mitigation: 'プライマリ-スタンバイ構成 + 自動フェイルオーバー',
  },
  {
    component: 'ロードバランサー',
    isSPOF: true,
    impact: 'CRITICAL',
    mitigation: 'アクティブ-スタンバイ構成（VRRP/keepalived）',
  },
  {
    component: 'キャッシュサーバー',
    isSPOF: true,
    impact: 'HIGH',
    mitigation: 'Redis Cluster または Sentinel構成',
  },
  {
    component: '外部API連携',
    isSPOF: true,
    impact: 'MEDIUM',
    mitigation: 'サーキットブレーカー + フォールバック',
  },
  {
    component: 'DNS',
    isSPOF: true,
    impact: 'CRITICAL',
    mitigation: '複数DNSプロバイダー + ヘルスチェック',
  },
];

冗長化パターン

Active-Active（アクティブ-アクティブ）

すべてのノードがリクエストを処理します。

graph TD
    LB["LB"] --> A1["Active 1<br/>（稼働中）"]
    LB --> A2["Active 2<br/>（稼働中）"]

    classDef lbStyle fill:#e8a838,stroke:#b07c1e,color:#fff
    classDef activeStyle fill:#5cb85c,stroke:#3d8b3d,color:#fff

    class LB lbStyle
    class A1,A2 activeStyle

メリット	デメリット
リソースを100%活用	データ同期が複雑
負荷分散が可能	スプリットブレイン問題のリスク
切替時間ゼロ	コスト・設計の複雑性が高い

Active-Passive（アクティブ-パッシブ）

1台が稼働し、障害時にスタンバイが引き継ぎます。

graph LR
    subgraph 通常時
        A1["Active
（稼働中）"] -->|"データ同期"| P1["Passive
（待機中）"]
    end

    subgraph 障害時
        A2["Active
（障害!）"]
        P2["Passive → 昇格!
フェイルオーバー"]
    end

    classDef activeStyle fill:#d1fae5,stroke:#059669,color:#065f46
    classDef passiveStyle fill:#f3f4f6,stroke:#9ca3af,color:#374151
    classDef failStyle fill:#fee2e2,stroke:#dc2626,color:#991b1b
    classDef promoteStyle fill:#dbeafe,stroke:#2563eb,stroke-width:2px,color:#1e40af

    class A1 activeStyle
    class P1 passiveStyle
    class A2 failStyle
    class P2 promoteStyle

メリット	デメリット
データ一貫性が保ちやすい	スタンバイのリソースが無駄
実装がシンプル	フェイルオーバーに時間がかかる
コスト効率が良い	スタンバイが正常に動作するか不確実

ヘルスチェックとサーキットブレーカー

ヘルスチェックの実装

// 多層ヘルスチェック
interface HealthCheckResult {
  status: 'healthy' | 'degraded' | 'unhealthy';
  checks: Record<string, ComponentHealth>;
  timestamp: Date;
}

interface ComponentHealth {
  status: 'up' | 'down' | 'degraded';
  responseTime?: number;
  details?: string;
}

class HealthChecker {
  async check(): Promise<HealthCheckResult> {
    const checks: Record<string, ComponentHealth> = {};

    // データベース接続チェック
    checks['database'] = await this.checkDatabase();
    // キャッシュ接続チェック
    checks['cache'] = await this.checkCache();
    // 外部API接続チェック
    checks['externalApi'] = await this.checkExternalApi();
    // ディスク容量チェック
    checks['disk'] = await this.checkDiskSpace();

    const overallStatus = this.determineOverallStatus(checks);
    return { status: overallStatus, checks, timestamp: new Date() };
  }

  private async checkDatabase(): Promise<ComponentHealth> {
    try {
      const start = Date.now();
      await this.db.query('SELECT 1');
      return {
        status: 'up',
        responseTime: Date.now() - start,
      };
    } catch (error) {
      return { status: 'down', details: String(error) };
    }
  }

  private determineOverallStatus(
    checks: Record<string, ComponentHealth>
  ): 'healthy' | 'degraded' | 'unhealthy' {
    const statuses = Object.values(checks).map((c) => c.status);
    if (statuses.every((s) => s === 'up')) return 'healthy';
    if (statuses.some((s) => s === 'down')) return 'unhealthy';
    return 'degraded';
  }
}

サーキットブレーカーの実装

// サーキットブレーカーパターン
enum CircuitState {
  CLOSED = 'CLOSED',       // 正常：リクエストを通す
  OPEN = 'OPEN',           // 障害：リクエストを遮断
  HALF_OPEN = 'HALF_OPEN', // 試行：一部リクエストを通して回復確認
}

class CircuitBreaker {
  private state: CircuitState = CircuitState.CLOSED;
  private failureCount = 0;
  private lastFailureTime?: Date;
  private successCount = 0;

  constructor(
    private readonly failureThreshold: number = 5,     // 失敗5回で開く
    private readonly resetTimeout: number = 30_000,    // 30秒後に半開
    private readonly successThreshold: number = 3,     // 成功3回で閉じる
  ) {}

  async execute<T>(action: () => Promise<T>): Promise<T> {
    if (this.state === CircuitState.OPEN) {
      if (this.shouldAttemptReset()) {
        this.state = CircuitState.HALF_OPEN;
      } else {
        throw new CircuitOpenError('Circuit is OPEN. Request blocked.');
      }
    }

    try {
      const result = await action();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess(): void {
    if (this.state === CircuitState.HALF_OPEN) {
      this.successCount++;
      if (this.successCount >= this.successThreshold) {
        this.state = CircuitState.CLOSED;
        this.failureCount = 0;
        this.successCount = 0;
      }
    } else {
      this.failureCount = 0;
    }
  }

  private onFailure(): void {
    this.failureCount++;
    this.lastFailureTime = new Date();
    if (this.failureCount >= this.failureThreshold) {
      this.state = CircuitState.OPEN;
    }
    if (this.state === CircuitState.HALF_OPEN) {
      this.state = CircuitState.OPEN;
      this.successCount = 0;
    }
  }

  private shouldAttemptReset(): boolean {
    if (!this.lastFailureTime) return false;
    return Date.now() - this.lastFailureTime.getTime() >= this.resetTimeout;
  }
}

サーキットブレーカーの状態遷移

graph TD
    CLOSED["CLOSED
（正常）"] -->|"失敗が閾値超え"| OPEN["OPEN
（遮断）"]
    OPEN -->|"タイムアウト経過"| HALF["HALF_OPEN
（試行）"]
    HALF -->|"成功が閾値超え"| CLOSED
    HALF -->|"失敗"| OPEN

    classDef closedStyle fill:#d1fae5,stroke:#059669,color:#065f46
    classDef openStyle fill:#fee2e2,stroke:#dc2626,color:#991b1b
    classDef halfStyle fill:#fef3c7,stroke:#d97706,stroke-width:2px,color:#92400e

    class CLOSED closedStyle
    class OPEN openStyle
    class HALF halfStyle

ディザスタリカバリ（DR）

RTO と RPO

指標	定義	質問
RTO（Recovery Time Objective）	障害発生からサービス復旧までの目標時間	「何時間以内に復旧する必要があるか？」
RPO（Recovery Point Objective）	許容できるデータ損失の時間幅	「何分前までのデータを復旧できればよいか？」

graph LR
    A["最後の
バックアップ"] -->|"RPO"| B["障害発生"]
    B -->|"RTO"| C["復旧完了"]

    classDef backupStyle fill:#dbeafe,stroke:#2563eb,stroke-width:2px,color:#1e40af
    classDef failStyle fill:#fee2e2,stroke:#dc2626,color:#991b1b
    classDef recoverStyle fill:#d1fae5,stroke:#059669,color:#065f46

    class A backupStyle
    class B failStyle
    class C recoverStyle

DR戦略の比較

戦略	RTO	RPO	コスト	説明
バックアップ & リストア	時間〜日	時間	最低	バックアップからの復元
パイロットライト	分〜時間	秒〜分	低	最小構成のスタンバイ環境
ウォームスタンバイ	分	秒	中	縮小版の本番環境
マルチサイトアクティブ	秒	ゼロに近い	最高	複数リージョンでアクティブ

マルチリージョンアーキテクチャ

graph TD
    DNS["Global DNS / CDN<br/>（リージョン間のルーティング）"]
    DNS --> RA
    DNS --> RB

    subgraph RA["リージョンA（東京）"]
        AppA1["App"] & AppA2["App"] --> PDB["Primary DB"]
    end

    subgraph RB["リージョンB（大阪）"]
        AppB1["App"] & AppB2["App"] --> RepDB["Replica DB"]
    end

    PDB -->|"非同期レプリケーション"| RepDB

    classDef dnsStyle fill:#e8a838,stroke:#b07c1e,color:#fff
    classDef appStyle fill:#67b7dc,stroke:#3a8ab5,color:#fff
    classDef primaryStyle fill:#5cb85c,stroke:#3d8b3d,color:#fff
    classDef replicaStyle fill:#4a90d9,stroke:#2c5f8a,color:#fff

    class DNS dnsStyle
    class AppA1,AppA2,AppB1,AppB2 appStyle
    class PDB primaryStyle
    class RepDB replicaStyle

マルチリージョンの考慮事項

interface MultiRegionConfig {
  // データ同期戦略
  replication: {
    mode: 'sync' | 'async';  // 同期/非同期
    lagTolerance: number;     // 許容遅延（ms）
    conflictResolution: 'last-write-wins' | 'merge' | 'manual';
  };

  // フェイルオーバー条件
  failover: {
    healthCheckInterval: number;  // ヘルスチェック間隔（秒）
    failureThreshold: number;     // 障害判定の閾値
    automaticFailover: boolean;   // 自動フェイルオーバーの有無
    dnsUpdateTTL: number;         // DNS切替のTTL（秒）
  };

  // データ整合性
  consistency: {
    readAfterWrite: boolean;      // 書き込み直後の読み取り保証
    crossRegionConsistency: 'eventual' | 'strong';
  };
}

カオスエンジニアリング

「障害に強い設計をしたとして、それが本当に機能するかどうか、どうやって確認する？」と佐藤CTOが問いかけました。

「テスト……ですか？」

「そう、だが普通のテストではない。本番環境で意図的に障害を起こすんだ。それがカオスエンジニアリングだ」

カオスエンジニアリングの原則

定常状態の仮説を立てる — 正常な動作を定義する
実世界のイベントを模倣する — 現実的な障害を注入する
本番環境で実験する — ステージングでは見つからない問題がある
爆破半径を最小化する — 影響範囲を限定する
継続的に実行する — 一度だけでなく常に行う

カオス実験の例

実験	対象	検証内容
サーバー停止	アプリケーションサーバー1台をkill	自動復旧とLBからの除外
ネットワーク遅延注入	DB接続に200ms遅延追加	タイムアウト設定の妥当性
ディスク容量枯渇	ログ領域を100%にする	アラート発火と自動対応
DNS障害	内部DNS応答を遅延させる	キャッシュとフォールバック
リージョン障害	AZ全体を隔離	マルチAZフェイルオーバー

まとめ

ポイント	内容
可用性の計算	直列・並列構成で可用性が大きく変わる
SPOF排除	すべてのコンポーネントの単一障害点を特定・排除する
冗長化パターン	Active-Active/Active-Passiveの使い分け
サーキットブレーカー	障害の連鎖を防ぎ、システム全体を守る
DR戦略	RTO/RPOに基づいて戦略を選択する
カオスエンジニアリング	障害耐性を本番環境で継続的に検証する

チェックリスト

SLAの可用性（99.9%、99.99%）が意味するダウンタイムを計算できる
直列・並列構成の可用性を計算できる
Active-Active/Active-Passiveの違いと適用場面を説明できる
サーキットブレーカーの状態遷移を説明できる
RTO/RPOの定義と、DR戦略の選択基準を説明できる

次のステップへ

可用性と耐障害性の設計を学びました。次は「セキュリティとコンプライアンス」を学びます。システムを外部の脅威からどう守るか、法規制にどう対応するか、アーキテクチャレベルでの対策を見ていきましょう。

推定読了時間: 30分