マルチモーダルパイプライン

「画像分類、VLM、テキスト分析。これらを一気通貫のパイプラインにまとめよう。」

田中VPoEが設計図を広げる。

「入力された画像とテキストを処理し、分析結果をレポートとして出力する。エンドツーエンドのパイプラインが運用のキモだ。」

パイプライン全体像

入力（画像 + テキスト）
    ↓
[前処理]
  画像: リサイズ、正規化
  テキスト: トークン化
    ↓
[特徴量抽出]
  画像: CNN/ViT → 画像特徴量
  テキスト: BERT → テキスト特徴量
    ↓
[マルチモーダル融合]
  早期融合 / 遅延融合 / Cross-Attention
    ↓
[分析]
  分類、所見生成、リスク判定
    ↓
[後処理]
  信頼度フィルタ、レポート生成
    ↓
出力（分析レポート）

パイプライン実装

class MultimodalPipeline:
    """マルチモーダル分析パイプライン"""

    def __init__(self, image_model, text_model, fusion_model, vlm_model):
        self.image_model = image_model
        self.text_model = text_model
        self.fusion_model = fusion_model
        self.vlm = vlm_model

    def process(self, image_path, text_input, task_type='classification'):
        """入力を処理して分析結果を返す"""
        # 1. 前処理
        image_tensor = self._preprocess_image(image_path)
        text_features = self._preprocess_text(text_input)

        # 2. 特徴量抽出
        img_features = self.image_model.extract_features(image_tensor)
        txt_features = self.text_model.encode(text_input)

        # 3. 分類（融合モデル）
        classification = self.fusion_model(img_features, txt_features)
        predicted_class = torch.argmax(classification, dim=-1).item()
        confidence = torch.softmax(classification, dim=-1).max().item()

        # 4. 所見生成（VLM）
        findings = self.vlm.analyze(
            image_path,
            f"この画像を分析してください。カテゴリ: {predicted_class}"
        )

        # 5. 後処理
        result = {
            'classification': {
                'class': predicted_class,
                'confidence': round(confidence, 3),
            },
            'findings': findings,
            'needs_review': confidence < 0.8,
        }

        return result

    def _preprocess_image(self, image_path):
        """画像の前処理"""
        from torchvision import transforms
        transform = transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225],
            ),
        ])
        image = Image.open(image_path).convert('RGB')
        return transform(image).unsqueeze(0)

    def _preprocess_text(self, text):
        """テキストの前処理"""
        return text.strip()

バッチ処理

class BatchProcessor:
    """大量画像のバッチ処理"""

    def __init__(self, pipeline, batch_size=32):
        self.pipeline = pipeline
        self.batch_size = batch_size

    def process_batch(self, items):
        """バッチで処理"""
        results = []
        for i in range(0, len(items), self.batch_size):
            batch = items[i:i + self.batch_size]
            batch_results = [
                self.pipeline.process(
                    item['image_path'],
                    item.get('text', ''),
                )
                for item in batch
            ]
            results.extend(batch_results)

        # サマリーレポート
        summary = {
            'total': len(results),
            'needs_review': sum(1 for r in results if r['needs_review']),
            'class_distribution': self._count_classes(results),
        }

        return results, summary

    def _count_classes(self, results):
        from collections import Counter
        return Counter(r['classification']['class'] for r in results)

まとめ

項目	ポイント
パイプライン	前処理→特徴量抽出→融合→分析→後処理の5段階
信頼度管理	低信頼度は人間レビューにルーティング
バッチ処理	大量データの効率的な処理
レポート	サマリーと詳細を含む分析レポート生成

チェックリスト

マルチモーダルパイプラインの全体構成を説明できる
画像・テキストの前処理を実装できる
信頼度ベースのルーティングを設計できる
バッチ処理の仕組みを理解した

次のステップへ

マルチモーダルパイプラインを理解した。次は演習でマルチモーダル分析を実践しよう。

推定読了時間: 30分