分散トレーシング - L0 カリキュラム

ストーリー

佐

佐藤CTO

マイクロサービスで障害が起きたとき、どのサービスが原因かを特定するのが最も困難な課題だ

あなた

先日の決済遅延も、API Gateway、Payment Service、外部決済API、DBの4つのサービスが関わっていて、原因の特定に2時間かかりました

あ

佐

佐藤CTO

分散トレーシングがあれば、リクエストの経路を1本の線で追跡できる。2時間が5分になる可能性がある

分散トレーシングの基礎

基本概念

概念	説明
Trace	1つのリクエストの全体の旅路。複数のSpanで構成される
Span	1つのサービスまたは処理単位での作業。開始時刻、期間、属性を持つ
Trace ID	1つのトレースを識別するユニークID。全サービスで共有される
Span ID	各Spanを識別するユニークID
Parent Span ID	親Spanを参照するID。Span間の親子関係を表現
Context Propagation	サービス間でTrace/Span IDを伝搬する仕組み

トレースの構造

Trace ID: abc-123-def-456

  API Gateway [Span A] ──────────────────────────────
    │
    ├─→ Auth Service [Span B] ────────
    │
    ├─→ Payment Service [Span C] ─────────────────────
    │     │
    │     ├─→ DB Query [Span D] ──────
    │     │
    │     └─→ External Payment API [Span E] ──────────
    │
    └─→ Notification Service [Span F] ────

  時間 ──────────────────────────────────────────────→
  0ms        100ms       200ms       300ms      400ms

Context Propagation

// W3C Trace Context ヘッダー
// リクエストヘッダーに Trace Context を伝搬する

// 送信側
const headers = {
  'traceparent': '00-abc123def456-span789-01',
  // version-trace_id-parent_id-trace_flags
  'tracestate': 'vendor1=value1,vendor2=value2',
};

// OpenTelemetryによる自動Context伝搬
import { context, propagation } from '@opentelemetry/api';

async function callPaymentService(orderId: string): Promise<PaymentResult> {
  // 現在のSpanのContextを自動的にヘッダーに注入
  const headers: Record<string, string> = {};
  propagation.inject(context.active(), headers);

  const response = await fetch('http://payment-service/api/payments', {
    method: 'POST',
    headers: {
      ...headers,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({ orderId }),
  });

  return response.json();
}

手動計装（Manual Instrumentation）

import { trace, SpanKind, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service', '1.0.0');

async function processPayment(
  orderId: string,
  amount: number
): Promise<PaymentResult> {
  // カスタムSpanの作成
  return tracer.startActiveSpan(
    'processPayment',
    {
      kind: SpanKind.INTERNAL,
      attributes: {
        'payment.order_id': orderId,
        'payment.amount': amount,
        'payment.currency': 'JPY',
      },
    },
    async (span) => {
      try {
        // 1. 注文の検証
        const order = await tracer.startActiveSpan(
          'validateOrder',
          async (validateSpan) => {
            const result = await orderRepository.findById(orderId);
            validateSpan.setAttribute('order.status', result.status);
            validateSpan.end();
            return result;
          }
        );

        // 2. 外部決済APIの呼び出し
        const paymentResult = await tracer.startActiveSpan(
          'callExternalPaymentAPI',
          { kind: SpanKind.CLIENT },
          async (apiSpan) => {
            apiSpan.setAttribute('payment.provider', 'stripe');
            try {
              const result = await stripeClient.charge({
                amount,
                currency: 'jpy',
                orderId,
              });
              apiSpan.setAttribute('payment.transaction_id', result.transactionId);
              apiSpan.setStatus({ code: SpanStatusCode.OK });
              apiSpan.end();
              return result;
            } catch (error) {
              apiSpan.setStatus({
                code: SpanStatusCode.ERROR,
                message: error instanceof Error ? error.message : 'Unknown error',
              });
              apiSpan.recordException(error as Error);
              apiSpan.end();
              throw error;
            }
          }
        );

        // 3. 決済結果の保存
        await tracer.startActiveSpan(
          'savePaymentResult',
          async (saveSpan) => {
            await paymentRepository.save({
              orderId,
              transactionId: paymentResult.transactionId,
              status: 'completed',
            });
            saveSpan.end();
          }
        );

        span.setStatus({ code: SpanStatusCode.OK });
        span.end();
        return paymentResult;
      } catch (error) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: error instanceof Error ? error.message : 'Payment failed',
        });
        span.recordException(error as Error);
        span.end();
        throw error;
      }
    }
  );
}

トレースバックエンド

Jaeger vs Tempo

観点	Jaeger	Grafana Tempo
ストレージ	Elasticsearch, Cassandra	Object Storage (S3等)
コスト	ストレージコスト高め	Object Storageで低コスト
検索	柔軟な検索クエリ	TraceQL、Trace ID検索
Grafana統合	プラグインで連携	ネイティブ統合
スケール	水平スケール対応	高い水平スケーラビリティ
推奨ケース	既存のES環境がある場合	Grafanaスタックを使う場合

Tempo設定例

# tempo-config.yaml
stream_over_http_enabled: true
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.amazonaws.com
      region: ap-northeast-1
    wal:
      path: /var/tempo/wal
    local:
      path: /var/tempo/blocks

# サンプリング設定
overrides:
  defaults:
    ingestion:
      rate_strategy: local
      rate_limit_bytes: 15000000
      burst_size_bytes: 20000000

サンプリング戦略

# サンプリング戦略: 全トレースを保存するとコストが膨大
sampling_strategies:
  # 1. Head-based Sampling（リクエスト開始時に決定）
  head_based:
    type: probabilistic
    rate: 0.1  # 10%をサンプリング
    per_service:
      payment-service: 1.0   # 決済は100%保存
      search-service: 0.01   # 検索は1%

  # 2. Tail-based Sampling（リクエスト完了後に決定）
  tail_based:
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
        # エラーのあるトレースは100%保存

      - name: slow_requests
        type: latency
        latency: { threshold_ms: 2000 }
        # 2秒以上のトレースは100%保存

      - name: default
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }
        # その他は5%

トレーシングの落とし穴と対策

落とし穴	問題	対策
高カーディナリティ属性	ストレージコスト爆発	user_idはタグに入れるがインデックスしない
サンプリング漏れ	重要なトレースが記録されない	Tail-basedでエラー・遅延を確実に記録
Context伝搬の断絶	非同期処理でTraceが切れる	メッセージキューにもTrace Contextを伝搬
パフォーマンスオーバーヘッド	本番環境の性能低下	サンプリング率を調整、バッチエクスポート

まとめ

ポイント	内容
Trace/Span	リクエストの全体旅路と各処理単位
Context Propagation	W3C Trace Context でサービス間伝搬
手動計装	重要なビジネスロジックにカスタムSpanを追加
サンプリング	Head-based（確率的）、Tail-based（条件付き）
バックエンド	Jaeger（ES環境向け）、Tempo（Grafanaスタック向け）

チェックリスト

Trace、Span、Context Propagationの概念を理解した
OpenTelemetryでの手動計装を実装できる
サンプリング戦略の使い分けを把握した
Jaeger/Tempoの特徴と選択基準を理解した
トレーシングの落とし穴と対策を知っている

次のステップへ

次は「構造化ログとログ集約」を学びます。3本柱の最後の柱であるログを、構造化して効率的に活用する方法を深掘りしましょう。

推定読了時間: 40分