Vision Language Model

「画像分類はできるようになった。だが、画像の中身を言葉で説明できるか？」

田中VPoEが問いかける。

「Vision Language Model（VLM）は画像を理解し、テキストで応答できるモデルだ。CLIPで画像とテキストを同じ空間に埋め込み、LLaVAで画像について対話する。マルチモーダルAIの基盤技術だ。」

VLMの全体像

従来の画像AI:
  画像 → CNN → クラスラベル（「犬」「猫」）

VLM:
  画像 + テキスト → VLM → テキスト応答
  「この画像に何が写っている？」→「赤いワンピースを着た女性が...」

CLIPの仕組み

画像エンコーダー        テキストエンコーダー
┌──────────┐           ┌──────────┐
│  画像     │           │ テキスト  │
└─────┬────┘           └─────┬────┘
      ↓                      ↓
  [Vision          [Transformer
   Transformer]     Encoder]
      ↓                      ↓
  画像埋め込み         テキスト埋め込み
  (512次元)            (512次元)
      └────────┬──────────┘
          コサイン類似度

特徴	説明
対照学習	画像とテキストのペアで学習
Zero-shot	学習していないカテゴリでも分類可能
汎用性	画像検索、分類、類似度計算に活用

CLIP実装

import torch
from transformers import CLIPProcessor, CLIPModel
from PIL import Image

# モデル読み込み
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def clip_zero_shot_classify(image_path, candidate_labels):
    """CLIPによるZero-shot画像分類"""
    image = Image.open(image_path)

    inputs = processor(
        text=candidate_labels,
        images=image,
        return_tensors="pt",
        padding=True,
    )

    outputs = model(**inputs)
    logits = outputs.logits_per_image
    probs = logits.softmax(dim=1)

    results = [
        {'label': label, 'probability': round(prob.item(), 4)}
        for label, prob in zip(candidate_labels, probs[0])
    ]
    return sorted(results, key=lambda x: -x['probability'])

# 使用例
labels = ["正常な作物", "病気の作物", "害虫被害"]
results = clip_zero_shot_classify("plant_image.jpg", labels)

LLaVA（Large Language and Vision Assistant）

from transformers import LlavaForConditionalGeneration, AutoProcessor

def analyze_image_with_llava(image_path, question):
    """LLaVAで画像について質問に回答"""
    model = LlavaForConditionalGeneration.from_pretrained(
        "llava-hf/llava-1.5-7b-hf",
        torch_dtype=torch.float16,
    )
    processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")

    image = Image.open(image_path)
    prompt = f"USER: <image>\n{question}\nASSISTANT:"

    inputs = processor(text=prompt, images=image, return_tensors="pt")
    output = model.generate(**inputs, max_new_tokens=200)
    response = processor.decode(output[0], skip_special_tokens=True)

    return response.split("ASSISTANT:")[-1].strip()

# 使用例
response = analyze_image_with_llava(
    "chest_xray.jpg",
    "この胸部X線画像に異常所見はありますか？詳しく説明してください。"
)

まとめ

項目	ポイント
CLIP	画像とテキストを共通空間に埋め込み、Zero-shot分類が可能
LLaVA	画像を理解し自然言語で応答するVLM
活用	画像分類、画像検索、視覚的質問応答（VQA）
限界	ハルシネーション、医療等の専門領域では注意が必要

チェックリスト

CLIPの対照学習の仕組みを説明できる
CLIPによるZero-shot画像分類を実装できる
LLaVAの仕組みと活用方法を理解した
VLMの限界と注意点を説明できる

次のステップへ

Vision Language Modelを理解した。次は画像とテキストの融合技術を学ぼう。

推定読了時間: 30分