演習：サービスメッシュを設計しよう

ストーリー

佐

佐藤CTO

サービスメッシュの理論は十分に学んだな。では実際に設計してみよう

佐

佐藤CTO

我々のECサイトをマイクロサービスに分割した後のサービスメッシュ構成を設計してほしい。理論を知っていることと、実際に設計できることは別物だ

佐

佐藤CTO

必要なのはトポロジ設計、トラフィック管理ルール、オブザーバビリティの設定、そしてカナリアデプロイ戦略だ。全部YAMLで書けるようにしてくれ

ミッション概要

ミッション	テーマ	目安時間
Mission 1	サービスメッシュトポロジを設計しよう	15分
Mission 2	トラフィック管理ルールを設定しよう	10分
Mission 3	オブザーバビリティスタックを構築しよう	15分
Mission 4	カナリアデプロイ戦略を設計しよう	10分

対象システム

以下のマイクロサービスシステムにサービスメッシュを導入します。

graph TD
    subgraph EC["ECサイト"]
        GW["API Gateway"]
        WEB["Web Frontend"]
        ORD["Order Svc<br/>（注文管理）"]
        PAY["Payment Svc<br/>（決済処理）"]
        INV["Inventory Svc<br/>（在庫）"]
        USR["User Svc<br/>（ユーザー管理）"]
        NOT["Notification Svc<br/>（通知）"]
        SRC["Search Svc<br/>（商品検索）"]
    end

    GW -->|同期| ORD
    GW -->|同期| USR
    GW -->|同期| SRC
    ORD -->|同期| PAY
    ORD -->|同期| INV
    ORD -.->|非同期/イベント| NOT
    SRC -->|同期| INV

    classDef gw fill:#e8f4fd,stroke:#2196f3,color:#333
    classDef core fill:#d4edda,stroke:#28a745,color:#333
    classDef support fill:#fff3cd,stroke:#f0ad4e,color:#333
    class GW,WEB gw
    class ORD,PAY core
    class INV,USR,NOT,SRC support

要件

全サービスはKubernetes上で稼働（namespace: production）
Payment Serviceは特にセキュリティ要件が高い
Search Serviceは高トラフィック（ピーク時1000 RPS）
Order Serviceは新バージョンv2をカナリアリリース予定
全サービスの可観測性を確保する必要がある

Mission 1: サービスメッシュトポロジを設計しよう（15分）

要件

以下を設計してください。

各サービスのIstio DestinationRuleを定義（サブセット、ロードバランシング、サーキットブレーカー）
Payment Serviceのセキュリティポリシー（mTLS + AuthorizationPolicy）
全体のPeerAuthentication設定

解答

PeerAuthentication（メッシュ全体のmTLS）

# メッシュ全体でmTLSを強制
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: production
spec:
  mtls:
    mode: STRICT

DestinationRule（各サービス）

# Order Service: カナリアリリース用サブセット定義
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: order-service
  namespace: production
spec:
  host: order-service
  trafficPolicy:
    loadBalancer:
      simple: ROUND_ROBIN
    connectionPool:
      tcp:
        maxConnections: 200
      http:
        http1MaxPendingRequests: 100
        http2MaxRequests: 500
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 30
  subsets:
    - name: stable
      labels:
        version: v1
    - name: canary
      labels:
        version: v2

---
# Payment Service: 厳格なサーキットブレーカー
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-service
  namespace: production
spec:
  host: payment-service
  trafficPolicy:
    loadBalancer:
      simple: LEAST_REQUEST
    connectionPool:
      tcp:
        maxConnections: 50
      http:
        http1MaxPendingRequests: 50
        http2MaxRequests: 100
        maxRetries: 2
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 10s
      baseEjectionTime: 60s
      maxEjectionPercent: 50
  subsets:
    - name: stable
      labels:
        version: v1

---
# Search Service: 高トラフィック対応
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: search-service
  namespace: production
spec:
  host: search-service
  trafficPolicy:
    loadBalancer:
      simple: ROUND_ROBIN
    connectionPool:
      tcp:
        maxConnections: 500
      http:
        http1MaxPendingRequests: 200
        http2MaxRequests: 2000
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 15s
      baseEjectionTime: 15s
      maxEjectionPercent: 20
  subsets:
    - name: stable
      labels:
        version: v1

AuthorizationPolicy（Payment Service）

# Payment Serviceへのアクセスを厳格に制限
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: payment-service-policy
  namespace: production
spec:
  selector:
    matchLabels:
      app: payment-service
  action: ALLOW
  rules:
    # Order Serviceからのみ決済リクエストを許可
    - from:
        - source:
            principals:
              - "cluster.local/ns/production/sa/order-service"
      to:
        - operation:
            methods: ["POST"]
            paths: ["/api/payments/*", "/api/refunds/*"]

    # ヘルスチェックは全てのソースから許可
    - to:
        - operation:
            methods: ["GET"]
            paths: ["/health/*"]

Mission 2: トラフィック管理ルールを設定しよう（10分）

要件

以下のVirtualServiceを設計してください。

API Gatewayからの各サービスへのルーティングルール
Order Serviceへのリトライ・タイムアウト設定
Search Serviceへのレート制限の考え方（設定方針）

解答

VirtualService

# Order Service: リトライ・タイムアウト付き
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: order-service
  namespace: production
spec:
  hosts:
    - order-service
  http:
    - route:
        - destination:
            host: order-service
            subset: stable
          weight: 100
      timeout: 15s
      retries:
        attempts: 3
        perTryTimeout: 5s
        retryOn: 5xx,reset,connect-failure,retriable-4xx
      fault:
        # フォールトインジェクション（テスト時のみ有効化）
        # delay:
        #   percentage:
        #     value: 5
        #   fixedDelay: 3s

---
# Payment Service: 厳格なタイムアウト（リトライは慎重に）
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service
  namespace: production
spec:
  hosts:
    - payment-service
  http:
    - route:
        - destination:
            host: payment-service
            subset: stable
          weight: 100
      timeout: 30s  # 決済処理は時間がかかる場合がある
      retries:
        attempts: 1  # 決済はリトライを最小限に（冪等性の確認が必要）
        perTryTimeout: 25s
        retryOn: reset,connect-failure  # 5xxではリトライしない（二重課金防止）

---
# Search Service: 高スループット設定
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: search-service
  namespace: production
spec:
  hosts:
    - search-service
  http:
    - route:
        - destination:
            host: search-service
            subset: stable
          weight: 100
      timeout: 3s  # 検索は高速レスポンス必須
      retries:
        attempts: 2
        perTryTimeout: 1s
        retryOn: 5xx,reset,connect-failure

---
# User Service: 標準設定
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: user-service
  namespace: production
spec:
  hosts:
    - user-service
  http:
    - route:
        - destination:
            host: user-service
            subset: stable
          weight: 100
      timeout: 10s
      retries:
        attempts: 3
        perTryTimeout: 3s
        retryOn: 5xx,reset,connect-failure

レート制限の設計方針

# Search Serviceのレート制限方針
# ピーク1000 RPSに対し、上限を1200 RPSに設定（20%バッファ）
# EnvoyFilterまたは外部レート制限サービスで実装

# 設計方針:
# 1. グローバルレート制限: 全体で1200 RPS
# 2. ユーザー別レート制限: 1ユーザーあたり30 RPS
# 3. バースト対応: トークンバケットで短期的なバーストを許容
#
# レート制限の優先順位:
# - 認証済みユーザー: 30 req/s
# - 未認証ユーザー: 10 req/s
# - 内部サービス: 無制限（信頼済み通信）

Mission 3: オブザーバビリティスタックを構築しよう（15分）

要件

以下を設計してください。

各サービスのGolden Signals監視項目を定義
Prometheusアラートルール（3つ以上）
ダッシュボード構成案（3階層）
分散トレーシングでのサンプリング戦略

解答

Golden Signals監視定義

// 各サービスのGolden Signals SLO定義
interface ServiceSLO {
  service: string;
  latency: { p50: number; p95: number; p99: number };  // ms
  errorRate: number;  // 許容エラー率（%）
  trafficCapacity: number;  // 最大RPS
  saturation: { cpu: number; memory: number };  // 上限%
}

const serviceSLOs: ServiceSLO[] = [
  {
    service: 'api-gateway',
    latency: { p50: 10, p95: 50, p99: 100 },
    errorRate: 0.1,
    trafficCapacity: 2000,
    saturation: { cpu: 70, memory: 80 },
  },
  {
    service: 'order-service',
    latency: { p50: 50, p95: 200, p99: 500 },
    errorRate: 0.5,
    trafficCapacity: 500,
    saturation: { cpu: 70, memory: 80 },
  },
  {
    service: 'payment-service',
    latency: { p50: 200, p95: 500, p99: 1000 },
    errorRate: 0.1,  // 決済はエラー許容度が低い
    trafficCapacity: 200,
    saturation: { cpu: 60, memory: 70 },
  },
  {
    service: 'search-service',
    latency: { p50: 30, p95: 100, p99: 300 },
    errorRate: 1.0,
    trafficCapacity: 1200,
    saturation: { cpu: 80, memory: 85 },
  },
  {
    service: 'inventory-service',
    latency: { p50: 20, p95: 80, p99: 200 },
    errorRate: 0.5,
    trafficCapacity: 500,
    saturation: { cpu: 60, memory: 70 },
  },
];

Prometheusアラートルール

groups:
  - name: ec-site-service-mesh-alerts
    rules:
      # アラート1: Payment Serviceのエラー率が0.1%超過
      - alert: PaymentServiceHighErrorRate
        expr: |
          sum(rate(istio_requests_total{
            response_code=~"5.*",
            destination_service_name="payment-service"
          }[5m]))
          /
          sum(rate(istio_requests_total{
            destination_service_name="payment-service"
          }[5m]))
          > 0.001
        for: 2m
        labels:
          severity: critical
          team: payment
        annotations:
          summary: "Payment Serviceのエラー率が0.1%を超過"
          description: "決済処理のエラー率: {{ $value | humanizePercentage }}"
          runbook: "https://wiki.example.com/runbooks/payment-errors"

      # アラート2: Order ServiceのP99レイテンシが500ms超過
      - alert: OrderServiceHighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(istio_request_duration_milliseconds_bucket{
              destination_service_name="order-service"
            }[5m])) by (le)
          ) > 500
        for: 5m
        labels:
          severity: warning
          team: order
        annotations:
          summary: "Order ServiceのP99レイテンシがSLO超過（500ms）"

      # アラート3: Search Serviceのトラフィックがキャパシティの80%
      - alert: SearchServiceHighTraffic
        expr: |
          sum(rate(istio_requests_total{
            destination_service_name="search-service"
          }[1m])) > 960
        for: 3m
        labels:
          severity: warning
          team: search
        annotations:
          summary: "Search Serviceのトラフィックがキャパシティ80%到達"

      # アラート4: サーキットブレーカー発動
      - alert: CircuitBreakerTripped
        expr: |
          sum(rate(envoy_cluster_upstream_cx_overflow[5m])) by (cluster_name) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.cluster_name }} でサーキットブレーカー発動"

ダッシュボード構成

graph TD
    L1["Level 1: ECサイト全体概要"] --> L2["Level 2: サービス別詳細"]
    L2 --> L3["Level 3: トラブルシューティング"]

    L1 -.- L1a["全サービスの健全性マップ（Kiali風）\n全体RPS / エラー率 / P99レイテンシ\nアクティブアラート一覧\n直近のデプロイ履歴"]
    L2 -.- L2a["Order Service: RPS / エラー率 / レイテンシ\nPayment Service: 決済成功率 / 処理時間\nSearch Service: QPS / キャッシュヒット率\nサービス間通信 / Podリソース使用率"]
    L3 -.- L3a["個別リクエストのトレース（Jaeger）\nログ検索（Loki/Elasticsearch）\nEnvoyプロキシの詳細統計\nPod/コンテナレベルのメトリクス"]

    style L1 fill:#dbeafe,stroke:#2563eb,stroke-width:2px,color:#1e40af
    style L2 fill:#fef3c7,stroke:#d97706,color:#92400e
    style L3 fill:#fee2e2,stroke:#dc2626,color:#991b1b
    style L1a fill:#f3f4f6,stroke:#9ca3af,color:#374151
    style L2a fill:#f3f4f6,stroke:#9ca3af,color:#374151
    style L3a fill:#f3f4f6,stroke:#9ca3af,color:#374151

サンプリング戦略

# Istioのトレーシングサンプリング設定
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    defaultConfig:
      tracing:
        sampling: 10  # 通常時: 10%サンプリング
    extensionProviders:
      - name: jaeger
        zipkin:
          service: jaeger-collector.observability.svc.cluster.local
          port: 9411

# サンプリング戦略の方針:
# - 通常トラフィック: 10%（コスト削減）
# - エラーリクエスト: 100%（全てのエラーをキャプチャ）
# - Payment Service: 100%（監査要件）
# - 高レイテンシリクエスト（>1秒）: 100%

Mission 4: カナリアデプロイ戦略を設計しよう（10分）

要件

Order Service v2のカナリアリリース計画を設計してください。

段階的なトラフィック移行計画（重み付き）
各段階での成功基準（メトリクスベース）
自動ロールバック条件
完全な移行完了までのVirtualService設定

解答

カナリアデプロイ計画

Phase 1: 内部テスト（ヘッダーベース）
  期間: 1日
  対象: QAチームのみ（x-canary: true ヘッダー）
  成功基準: QAチームのテスト全項目パス

Phase 2: カナリア 5%
  期間: 2日
  対象: ランダム5%のトラフィック
  成功基準:
    - エラー率 < 0.5%（v1と同等以下）
    - P99レイテンシ < 500ms（v1と同等以下）
    - ビジネスメトリクス（注文完了率）低下なし

Phase 3: カナリア 25%
  期間: 2日
  成功基準: Phase 2と同じ

Phase 4: カナリア 50%
  期間: 1日
  成功基準: Phase 2と同じ

Phase 5: カナリア 100%
  期間: 1日（旧バージョンはスタンバイ）

Phase 6: 旧バージョン削除
  v1のPodを削除、完了

VirtualService設定（各Phase）

# Phase 1: ヘッダーベースルーティング（QAチームのみ）
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: order-service
  namespace: production
spec:
  hosts:
    - order-service
  http:
    # QAチーム用: x-canary ヘッダーがある場合v2に送信
    - match:
        - headers:
            x-canary:
              exact: "true"
      route:
        - destination:
            host: order-service
            subset: canary
    # その他: 全てv1
    - route:
        - destination:
            host: order-service
            subset: stable
          weight: 100

---
# Phase 2: 5%カナリア
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: order-service
  namespace: production
spec:
  hosts:
    - order-service
  http:
    - route:
        - destination:
            host: order-service
            subset: stable
          weight: 95
        - destination:
            host: order-service
            subset: canary
          weight: 5
      retries:
        attempts: 3
        perTryTimeout: 5s
        retryOn: 5xx,reset,connect-failure

---
# Phase 3: 25%カナリア（weight: 75/25に変更）
# Phase 4: 50%カナリア（weight: 50/50に変更）
# Phase 5: 100%カナリア（weight: 0/100に変更）

自動ロールバック条件

# Prometheusアラートルールによるロールバックトリガー
groups:
  - name: canary-rollback-triggers
    rules:
      # カナリアのエラー率がstableの3倍を超えたら即時ロールバック
      - alert: CanaryHighErrorRate
        expr: |
          (
            sum(rate(istio_requests_total{
              response_code=~"5.*",
              destination_version="v2",
              destination_service_name="order-service"
            }[5m]))
            /
            sum(rate(istio_requests_total{
              destination_version="v2",
              destination_service_name="order-service"
            }[5m]))
          )
          >
          3 * (
            sum(rate(istio_requests_total{
              response_code=~"5.*",
              destination_version="v1",
              destination_service_name="order-service"
            }[5m]))
            /
            sum(rate(istio_requests_total{
              destination_version="v1",
              destination_service_name="order-service"
            }[5m]))
          )
        for: 2m
        labels:
          severity: critical
          action: rollback
        annotations:
          summary: "カナリア(v2)のエラー率がstable(v1)の3倍超過。自動ロールバック推奨"

      # カナリアのP99レイテンシがstableの2倍を超えたら警告
      - alert: CanaryHighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(istio_request_duration_milliseconds_bucket{
              destination_version="v2",
              destination_service_name="order-service"
            }[5m])) by (le)
          )
          >
          2 * histogram_quantile(0.99,
            sum(rate(istio_request_duration_milliseconds_bucket{
              destination_version="v1",
              destination_service_name="order-service"
            }[5m])) by (le)
          )
        for: 5m
        labels:
          severity: warning
          action: investigate

# ロールバック実行（手動またはCI/CDから自動実行）:
# VirtualServiceのweightをstable: 100, canary: 0に変更

達成度チェック

ミッション	テーマ	完了
Mission 1	サービスメッシュトポロジの設計
Mission 2	トラフィック管理ルールの設定
Mission 3	オブザーバビリティスタックの構築
Mission 4	カナリアデプロイ戦略の設計

まとめ

ポイント	内容
トポロジ設計	サービスごとのDestinationRuleでLB・サーキットブレーカーを設定
セキュリティ	mTLS + AuthorizationPolicyで決済サービスへのアクセスを厳格制御
トラフィック管理	サービス特性に応じたタイムアウト・リトライ・レート制限を設計
オブザーバビリティ	Golden SignalsベースのSLO監視とアラート設計
カナリアデプロイ	段階的な重み付きルーティングとメトリクスベースのロールバック

チェックリスト

サービスごとのDestinationRuleを適切に設計できた
セキュリティポリシー（mTLS + 認可）を設計できた
サービス特性に応じたVirtualService設定を書けた
Golden SignalsベースのSLO定義とアラートを設計できた
段階的なカナリアデプロイ計画と自動ロールバック条件を定義できた

次のステップへ

次はチェックポイントクイズです。サービスメッシュ、Istio、オブザーバビリティ、サービスディスカバリについての理解度を確認しましょう。

推定読了時間: 50分