コンテンツベース推薦

「協調フィルタリングは新商品に弱い。誰も買っていない商品は推薦できないからな。」

田中VPoEがNetShop社の新商品リストを開く。

「毎週100点以上の新商品が入荷するのに、行動データが貯まるまで推薦できないのは機会損失だ。商品の中身で推薦するアプローチも必要だ。」

コンテンツベース推薦の基本原理

コンテンツベース推薦（Content-Based Filtering）は、アイテムの属性情報（特徴量）を使って類似アイテムを推薦する手法である。ユーザーが過去に好んだアイテムの特徴を分析し、類似した特徴を持つ未見のアイテムを推薦する。

協調フィルタリングとの比較

観点	協調フィルタリング	コンテンツベース
入力データ	行動データのみ	アイテム属性
Cold Start	新規アイテムに弱い	新規アイテムでも推薦可能
セレンディピティ	高い	低い（似たものばかり）
ドメイン知識	不要	特徴量設計に必要
データの疎性	影響大	影響小

特徴量抽出

カテゴリ特徴量

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

# 商品データ
products = pd.DataFrame({
    'product_id': [1, 2, 3, 4, 5],
    'name': ['ランニングシューズA', 'ランニングシューズB', '革靴C', 'スニーカーD', 'ブーツE'],
    'category': ['シューズ', 'シューズ', 'シューズ', 'シューズ', 'シューズ'],
    'sub_category': ['ランニング', 'ランニング', 'ビジネス', 'カジュアル', 'アウトドア'],
    'brand': ['Nike', 'Adidas', 'Regal', 'Nike', 'Timberland'],
    'price': [12000, 11000, 25000, 9000, 28000],
    'color': ['黒', '白', '茶', '青', '茶'],
    'tags': [
        ['軽量', 'クッション', 'メッシュ'],
        ['軽量', '安定性', 'BOOST'],
        ['本革', 'フォーマル', '防水'],
        ['カジュアル', '軽量', 'Air'],
        ['防水', '本革', '厚底'],
    ],
})

# タグをOne-Hotエンコーディング
mlb = MultiLabelBinarizer()
tag_features = pd.DataFrame(
    mlb.fit_transform(products['tags']),
    columns=mlb.classes_,
    index=products.index
)

print("タグ特徴量:")
print(tag_features)

TF-IDF特徴量

商品説明文からTF-IDFベクトルを生成する。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# 商品説明文
descriptions = [
    "軽量でクッション性に優れたランニングシューズ。メッシュ素材で通気性抜群。マラソンにも対応。",
    "安定性とクッション性を両立したランニングシューズ。BOOST技術で反発力を提供。",
    "高品質な本革を使用したビジネスシューズ。防水加工で雨の日も安心。フォーマルシーンに最適。",
    "Airクッション搭載の軽量カジュアルスニーカー。普段使いに最適なデザイン。",
    "防水本革のアウトドアブーツ。厚底で悪路でも安定した歩行が可能。冬のアウトドアに。",
]

# TF-IDFベクトル化
tfidf = TfidfVectorizer(max_features=100)
tfidf_matrix = tfidf.fit_transform(descriptions)

# アイテム間類似度
item_similarity = cosine_similarity(tfidf_matrix)
print("TF-IDFベースのアイテム間類似度:")
sim_df = pd.DataFrame(
    item_similarity,
    index=products['name'],
    columns=products['name']
)
print(sim_df.round(3))

Embedding特徴量

より高度な方法として、事前学習済みモデルのEmbeddingを使う。

from sentence_transformers import SentenceTransformer

# 多言語対応のSentence Transformer
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# 商品説明文をEmbedding化
embeddings = model.encode(descriptions)
print(f"Embedding次元数: {embeddings.shape[1]}")  # 384次元

# Embeddingベースの類似度
emb_similarity = cosine_similarity(embeddings)
print("Embeddingベースのアイテム間類似度:")
emb_df = pd.DataFrame(
    emb_similarity,
    index=products['name'],
    columns=products['name']
)
print(emb_df.round(3))

類似アイテム検索の実装

特徴量の統合

import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder

class ContentBasedRecommender:
    """コンテンツベース推薦システム"""

    def __init__(self):
        self.item_vectors = None
        self.item_ids = None

    def build_item_profiles(self, products_df, descriptions):
        """アイテムプロファイルを構築"""
        self.item_ids = products_df['product_id'].values
        features = []

        # 1. カテゴリ特徴量（One-Hot）
        ohe = OneHotEncoder(sparse_output=False)
        cat_features = ohe.fit_transform(
            products_df[['sub_category', 'brand', 'color']]
        )
        features.append(cat_features)

        # 2. 数値特徴量（正規化）
        scaler = StandardScaler()
        num_features = scaler.fit_transform(
            products_df[['price']].values
        )
        features.append(num_features)

        # 3. テキスト特徴量（TF-IDF）
        tfidf = TfidfVectorizer(max_features=50)
        text_features = tfidf.fit_transform(descriptions).toarray()
        features.append(text_features)

        # 4. タグ特徴量（Multi-Hot）
        mlb = MultiLabelBinarizer()
        tag_features = mlb.fit_transform(products_df['tags'])
        features.append(tag_features)

        # 全特徴量を結合
        self.item_vectors = np.hstack(features)
        print(f"アイテムベクトル次元数: {self.item_vectors.shape[1]}")

    def get_similar_items(self, item_id, n=5):
        """指定アイテムに類似したアイテムを返す"""
        idx = np.where(self.item_ids == item_id)[0][0]
        item_vec = self.item_vectors[idx].reshape(1, -1)

        similarities = cosine_similarity(item_vec, self.item_vectors)[0]
        # 自分自身を除外
        similarities[idx] = -1

        top_indices = np.argsort(similarities)[::-1][:n]
        return [
            (self.item_ids[i], similarities[i])
            for i in top_indices
        ]

    def recommend_for_user(self, user_history, n=5):
        """ユーザーの行動履歴に基づいて推薦"""
        # ユーザープロファイル = 購入アイテムのベクトル平均
        history_indices = [
            np.where(self.item_ids == item_id)[0][0]
            for item_id in user_history
        ]
        user_profile = self.item_vectors[history_indices].mean(axis=0).reshape(1, -1)

        # 全アイテムとの類似度
        similarities = cosine_similarity(user_profile, self.item_vectors)[0]

        # 既に購入済みのアイテムを除外
        for idx in history_indices:
            similarities[idx] = -1

        top_indices = np.argsort(similarities)[::-1][:n]
        return [
            (self.item_ids[i], similarities[i])
            for i in top_indices
        ]

H&Mデータセットでの応用

# H&M Personalized Fashion Recommendationsでの活用例
# データ構造:
#   articles.csv: 商品情報（105,542商品）
#     - article_id, product_name, product_type_name
#     - product_group_name, colour_group_name
#     - department_name, section_name, garment_group_name
#     - detail_desc（商品説明文）
#
#   customers.csv: 顧客情報（1,371,980顧客）
#   transactions_train.csv: 取引データ（31,788,324件）

# H&Mデータでの特徴量設計
feature_design = {
    "カテゴリ特徴量": [
        "product_type_name（51種類）",
        "product_group_name（19種類）",
        "colour_group_name（50種類）",
        "department_name（299種類）",
        "section_name（57種類）",
    ],
    "テキスト特徴量": [
        "detail_desc（商品説明文）→ TF-IDF or Embedding",
        "product_name → キーワード抽出",
    ],
    "数値特徴量": [
        "price（価格）",
    ],
    "画像特徴量（発展）": [
        "商品画像 → CNN Embedding",
    ],
}

コンテンツベース推薦の限界と対策

限界:
1. 過度の類似性: ユーザーが見たものと似すぎるものしか出ない
2. 特徴量の質: 良い特徴量設計にはドメイン知識が必要
3. セレンディピティの欠如: 新しいジャンルの発見が起きにくい

対策:
1. 多様性の導入（MMR等）
2. 協調フィルタリングとのハイブリッド
3. Embeddingの活用で暗黙的な特徴も捉える

まとめ

項目	ポイント
原理	アイテム属性の類似性で推薦
特徴量	カテゴリ、テキスト（TF-IDF/Embedding）、タグ
利点	Cold Startに強い、ドメイン知識を反映可能
欠点	セレンディピティが低い、特徴量設計が必要
応用	新商品推薦、「類似商品」セクション

チェックリスト

コンテンツベース推薦の原理と利点を説明できる
TF-IDFとEmbeddingの違いを理解した
複数の特徴量を統合したアイテムプロファイルを構築できる
ユーザープロファイルからの推薦ロジックを理解した
H&Mデータセットでの特徴量設計をイメージできる

次のステップへ

コンテンツベース推薦を理解したところで、次はセッション内の直近行動を活用するセッションベース推薦を学ぼう。

推定読了時間: 30分