RAGシステムの評価 - L0 カリキュラム

ストーリー

佐

佐藤CTO

Advanced RAGの各パターンは理解できた。では質問だ

佐

佐藤CTO

Query Rewriting を追加したら品質は上がったのか？Re-ranking の効果は？どうやって測定する？

佐

佐藤CTO

…体感では良くなった気がしますが、数値では測っていません

佐

佐藤CTO

“良くなった気がする”は本番では許されない。RAGの品質を定量的に測定するフレームワークを導入する。Retrieval と Generation、それぞれに適切なメトリクスがある。今日はその全体像を押さえよう

RAG評価の2つの軸

Retrieval（検索）の評価とGeneration（生成）の評価

graph LR
    subgraph Framework["RAG評価フレームワーク"]
        subgraph Retrieval["Retrieval 評価"]
            RQ["正しい文書を<br/>検索できているか？"]
            RM["指標:<br/>- Recall@K<br/>- Precision@K<br/>- MRR<br/>- NDCG"]
        end
        subgraph Generation["Generation 評価"]
            GQ["検索結果を使って<br/>正確な回答を<br/>生成できているか？"]
            GM["指標:<br/>- Faithfulness<br/>- Answer Relevancy<br/>- Context Utilization<br/>- Hallucination Rate"]
        end
    end

    classDef retrieval fill:#dbeafe,stroke:#3b82f6
    classDef generation fill:#fef3c7,stroke:#f59e0b
    class RQ,RM retrieval
    class GQ,GM generation

Retrieval メトリクス

Recall@K と Precision@K

interface RetrievalEvaluation {
  query: string;
  retrievedDocs: string[];     // 検索結果のドキュメントID
  relevantDocs: string[];      // 正解の関連ドキュメントID
}

function recallAtK(eval: RetrievalEvaluation, k: number): number {
  const topK = eval.retrievedDocs.slice(0, k);
  const relevant = topK.filter(id => eval.relevantDocs.includes(id));
  return relevant.length / eval.relevantDocs.length;
}

function precisionAtK(eval: RetrievalEvaluation, k: number): number {
  const topK = eval.retrievedDocs.slice(0, k);
  const relevant = topK.filter(id => eval.relevantDocs.includes(id));
  return relevant.length / k;
}

// 例: 正解が3件、検索結果のTop-5に2件含まれている場合
// Recall@5 = 2/3 = 0.667
// Precision@5 = 2/5 = 0.400

MRR (Mean Reciprocal Rank)

最初の正解ドキュメントが何位に現れるかを測定します。

function reciprocalRank(eval: RetrievalEvaluation): number {
  for (let i = 0; i < eval.retrievedDocs.length; i++) {
    if (eval.relevantDocs.includes(eval.retrievedDocs[i])) {
      return 1 / (i + 1);
    }
  }
  return 0;
}

function meanReciprocalRank(evals: RetrievalEvaluation[]): number {
  const rrs = evals.map(e => reciprocalRank(e));
  return rrs.reduce((sum, rr) => sum + rr, 0) / rrs.length;
}

// 例: 正解が3位に初めて出現 → RR = 1/3 = 0.333
//     正解が1位に出現 → RR = 1/1 = 1.000

NDCG (Normalized Discounted Cumulative Gain)

順位を考慮した評価指標。上位に関連文書が来るほど高スコアになります。

function dcg(relevanceScores: number[]): number {
  return relevanceScores.reduce((sum, score, i) => {
    return sum + (Math.pow(2, score) - 1) / Math.log2(i + 2);
  }, 0);
}

function ndcg(retrievedScores: number[], idealScores: number[]): number {
  const actualDCG = dcg(retrievedScores);
  const idealDCG = dcg(idealScores.sort((a, b) => b - a));
  return idealDCG === 0 ? 0 : actualDCG / idealDCG;
}

// 例: 検索結果の関連度 [3, 0, 2, 0, 1]（3が最も関連）
// 理想的な順序      [3, 2, 1, 0, 0]
// NDCG = DCG(actual) / DCG(ideal) → 上位に高関連度が来るほど1に近い

Generation メトリクス

Faithfulness（忠実性）

生成された回答が、検索されたコンテキストに基づいているかを測定します。

interface FaithfulnessEvaluation {
  query: string;
  answer: string;
  contexts: string[];
}

class FaithfulnessEvaluator {
  constructor(private readonly llm: LLMService) {}

  async evaluate(eval: FaithfulnessEvaluation): Promise<number> {
    // Step 1: 回答から主張（claim）を抽出
    const claims = await this.extractClaims(eval.answer);

    // Step 2: 各主張がコンテキストに裏付けられているか判定
    let supportedCount = 0;
    for (const claim of claims) {
      const isSupported = await this.verifyClaim(claim, eval.contexts);
      if (isSupported) supportedCount++;
    }

    // Faithfulness = 裏付けられた主張数 / 全主張数
    return claims.length === 0 ? 1 : supportedCount / claims.length;
  }

  private async extractClaims(answer: string): Promise<string[]> {
    const prompt = `以下の回答から、独立した事実の主張を抽出してください。
各主張は1文で、検証可能な内容にしてください。

回答: ${answer}

JSON形式: {"claims": ["主張1", "主張2", ...]}`;

    const response = await this.llm.complete(prompt);
    return JSON.parse(response).claims;
  }

  private async verifyClaim(claim: string, contexts: string[]): Promise<boolean> {
    const contextText = contexts.join('\n\n');
    const prompt = `以下のコンテキストは、この主張を裏付けていますか？

主張: ${claim}

コンテキスト:
${contextText}

回答 (yes/no):`;

    const response = await this.llm.complete(prompt);
    return response.trim().toLowerCase() === 'yes';
  }
}

Answer Relevancy（回答の関連性）

生成された回答がユーザーの質問に対して関連しているかを測定します。

class AnswerRelevancyEvaluator {
  constructor(
    private readonly llm: LLMService,
    private readonly embedder: EmbeddingService,
  ) {}

  async evaluate(query: string, answer: string): Promise<number> {
    // 回答から逆にN個の質問を生成
    const generatedQuestions = await this.generateQuestions(answer, 3);

    // 元のクエリと生成された質問の類似度を計算
    const queryEmbedding = await this.embedder.embed(query);
    const questionEmbeddings = await this.embedder.embedBatch(generatedQuestions);

    const similarities = questionEmbeddings.map(qe =>
      cosineSimilarity(queryEmbedding, qe)
    );

    return similarities.reduce((sum, s) => sum + s, 0) / similarities.length;
  }

  private async generateQuestions(answer: string, n: number): Promise<string[]> {
    const prompt = `以下の回答に対して、この回答が適切な答えとなるような質問を${n}個生成してください。

回答: ${answer}

JSON形式: {"questions": ["質問1", "質問2", ...]}`;

    const response = await this.llm.complete(prompt);
    return JSON.parse(response).questions;
  }
}

function cosineSimilarity(a: number[], b: number[]): number {
  const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
  const normA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
  const normB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
  return dotProduct / (normA * normB);
}

RAGAS フレームワーク

RAGAS の概要

RAGAS（Retrieval Augmented Generation Assessment）は、RAGシステムの品質を包括的に評価するオープンソースフレームワークです。

メトリクス	評価対象	説明
Faithfulness	Generation	回答がコンテキストに忠実か
Answer Relevancy	Generation	回答が質問に関連しているか
Context Precision	Retrieval	関連コンテキストが上位にあるか
Context Recall	Retrieval	必要なコンテキストが全て取得されているか
Context Utilization	両方	検索されたコンテキストが回答に活用されているか

評価データセットの構築

interface RAGEvalDataset {
  samples: RAGEvalSample[];
}

interface RAGEvalSample {
  question: string;
  groundTruth: string;           // 期待される回答
  relevantDocIds: string[];      // 正解の関連ドキュメントID
  contexts?: string[];           // 検索で取得されたコンテキスト（実行時に取得）
  answer?: string;               // RAGシステムの回答（実行時に生成）
}

// 評価データセットの作成方法
class EvalDatasetBuilder {
  constructor(private readonly llm: LLMService) {}

  // ドキュメントから質問-回答ペアを自動生成
  async generateFromDocuments(
    documents: Document[]
  ): Promise<RAGEvalSample[]> {
    const samples: RAGEvalSample[] = [];

    for (const doc of documents) {
      const prompt = `以下のドキュメントから、3つの質問とその回答のペアを生成してください。
質問は具体的で、ドキュメントの内容を知らないと回答できないものにしてください。

ドキュメント:
${doc.content}

JSON形式:
{"pairs": [{"question": "質問", "answer": "回答"}, ...]}`;

      const response = await this.llm.complete(prompt);
      const parsed = JSON.parse(response);

      for (const pair of parsed.pairs) {
        samples.push({
          question: pair.question,
          groundTruth: pair.answer,
          relevantDocIds: [doc.id],
        });
      }
    }

    return samples;
  }
}

評価パイプラインの実装

interface RAGEvalResult {
  sample: RAGEvalSample;
  metrics: {
    faithfulness: number;
    answerRelevancy: number;
    contextPrecision: number;
    contextRecall: number;
  };
}

class RAGEvaluator {
  constructor(
    private readonly ragPipeline: RAGPipeline,
    private readonly faithfulnessEval: FaithfulnessEvaluator,
    private readonly relevancyEval: AnswerRelevancyEvaluator,
  ) {}

  async evaluateDataset(dataset: RAGEvalDataset): Promise<{
    results: RAGEvalResult[];
    aggregated: Record<string, number>;
  }> {
    const results: RAGEvalResult[] = [];

    for (const sample of dataset.samples) {
      // RAGパイプラインを実行
      const contexts = await this.ragPipeline.retrieve(sample.question);
      const answer = await this.ragPipeline.generate(sample.question);

      const contextTexts = contexts.map(c => c.chunk.content);
      const contextIds = contexts.map(c => c.chunk.id);

      // メトリクス計算
      const faithfulness = await this.faithfulnessEval.evaluate({
        query: sample.question,
        answer,
        contexts: contextTexts,
      });

      const answerRelevancy = await this.relevancyEval.evaluate(
        sample.question,
        answer,
      );

      const contextPrecision = this.calculateContextPrecision(
        contextIds,
        sample.relevantDocIds,
      );

      const contextRecall = this.calculateContextRecall(
        contextIds,
        sample.relevantDocIds,
      );

      results.push({
        sample: { ...sample, contexts: contextTexts, answer },
        metrics: { faithfulness, answerRelevancy, contextPrecision, contextRecall },
      });
    }

    // 集約メトリクス
    const aggregated = {
      avgFaithfulness: avg(results.map(r => r.metrics.faithfulness)),
      avgAnswerRelevancy: avg(results.map(r => r.metrics.answerRelevancy)),
      avgContextPrecision: avg(results.map(r => r.metrics.contextPrecision)),
      avgContextRecall: avg(results.map(r => r.metrics.contextRecall)),
    };

    return { results, aggregated };
  }

  private calculateContextPrecision(
    retrievedIds: string[],
    relevantIds: string[]
  ): number {
    const relevant = retrievedIds.filter(id => relevantIds.includes(id));
    return retrievedIds.length === 0 ? 0 : relevant.length / retrievedIds.length;
  }

  private calculateContextRecall(
    retrievedIds: string[],
    relevantIds: string[]
  ): number {
    const found = relevantIds.filter(id => retrievedIds.includes(id));
    return relevantIds.length === 0 ? 1 : found.length / relevantIds.length;
  }
}

function avg(values: number[]): number {
  return values.length === 0 ? 0 : values.reduce((s, v) => s + v, 0) / values.length;
}

評価レポートの自動生成

function generateEvalReport(
  results: RAGEvalResult[],
  aggregated: Record<string, number>
): string {
  let report = `# RAG評価レポート\n\n`;
  report += `## 集約メトリクス\n\n`;
  report += `| メトリクス | スコア | 閾値 | 判定 |\n`;
  report += `|-----------|--------|------|------|\n`;

  const thresholds: Record<string, number> = {
    avgFaithfulness: 0.85,
    avgAnswerRelevancy: 0.80,
    avgContextPrecision: 0.70,
    avgContextRecall: 0.75,
  };

  for (const [key, value] of Object.entries(aggregated)) {
    const threshold = thresholds[key] ?? 0.7;
    const pass = value >= threshold ? 'PASS' : 'FAIL';
    report += `| ${key} | ${(value * 100).toFixed(1)}% | ${(threshold * 100).toFixed(0)}% | ${pass} |\n`;
  }

  report += `\n## 失敗ケース分析\n\n`;
  const failures = results.filter(
    r => r.metrics.faithfulness < 0.7 || r.metrics.answerRelevancy < 0.6
  );

  for (const failure of failures) {
    report += `### Q: ${failure.sample.question}\n`;
    report += `- Faithfulness: ${(failure.metrics.faithfulness * 100).toFixed(1)}%\n`;
    report += `- Answer Relevancy: ${(failure.metrics.answerRelevancy * 100).toFixed(1)}%\n`;
    report += `- Answer: ${failure.sample.answer?.slice(0, 200)}\n\n`;
  }

  return report;
}

評価品質の目安

メトリクス	PoC段階	本番最低ライン	優秀
Faithfulness	> 0.70	> 0.85	> 0.95
Answer Relevancy	> 0.65	> 0.80	> 0.90
Context Precision	> 0.50	> 0.70	> 0.85
Context Recall	> 0.60	> 0.75	> 0.90
MRR	> 0.50	> 0.70	> 0.85

まとめ

ポイント	内容
2つの評価軸	Retrieval（検索品質）と Generation（生成品質）を分離して評価
Retrieval指標	Recall@K、Precision@K、MRR、NDCGで検索精度を測定
Generation指標	Faithfulness（忠実性）とAnswer Relevancy（関連性）が重要
RAGAS	RAG評価の標準フレームワーク。評価データセットの構築も重要

チェックリスト

Retrieval メトリクス（Recall、Precision、MRR、NDCG）を理解した
Generation メトリクス（Faithfulness、Answer Relevancy）を理解した
RAGAS フレームワークの概要を理解した
評価データセットの構築方法を理解した

次のステップへ

RAGの評価方法を学びました。次は演習で、実際にエンタープライズナレッジベース向けのRAGシステムを設計してみましょう。

測れないものは改善できない。評価が品質改善のサイクルを回す原動力です。

推定読了時間: 40分