TF-IDFによるテキストベクトル化

「テキストをモデルに入力するためには、数値ベクトルに変換する必要がある。最もシンプルかつ強力な手法がTF-IDFだ。」

田中VPoEがホワイトボードに数式を書き始める。

「Kaggleコンペでも、まずTF-IDFでベースラインを作るのが定石だ。」

Bag of Words（BoW）

テキストを単語の出現回数で表現する最もシンプルな手法。

from sklearn.feature_extraction.text import CountVectorizer

texts = [
    "商品が届かない",
    "商品を返品したい",
    "商品の使い方を教えて",
]

vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(texts)

print("語彙:", vectorizer.get_feature_names_out())
print("BoW行列:")
print(bow_matrix.toarray())
# 語彙: ['使い方' '商品' '届かない' '教えて' '返品したい']
# BoW行列:
# [[0 1 1 0 0]   ← 商品が届かない
#  [0 1 0 0 1]   ← 商品を返品したい
#  [1 1 0 1 0]]  ← 商品の使い方を教えて

BoWの限界:

語順情報が失われる（「犬が猫を追う」と「猫が犬を追う」が同じ表現に）
高頻度語が支配的になる（「商品」はすべての文書に出現）

TF-IDF（Term Frequency - Inverse Document Frequency）

BoWの限界を克服するため、単語の「重要度」を考慮した重み付けを行う。

TF-IDFの計算

TF（単語頻度）= ある文書でのその単語の出現回数 / 文書内の全単語数
IDF（逆文書頻度）= log(全文書数 / その単語を含む文書数)
TF-IDF = TF × IDF

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

texts = [
    "商品が届かない 配送状況を教えて",
    "商品を返品したい 返金してほしい",
    "商品の使い方がわからない",
    "配送が遅い 届かない いつ届くか",
]

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(texts)

print("語彙:", tfidf.get_feature_names_out())
print("\nTF-IDF行列（各文書の特徴的な単語が高スコア）:")
feature_names = tfidf.get_feature_names_out()
for i, text in enumerate(texts):
    scores = tfidf_matrix[i].toarray()[0]
    top_indices = scores.argsort()[-3:][::-1]
    top_words = [(feature_names[j], scores[j]) for j in top_indices]
    print(f"文書{i+1}: {top_words}")

# 「商品」は全文書に出現するのでIDFが低い → TF-IDFスコアが低い
# 「返品」「返金」は特定文書にのみ出現 → TF-IDFスコアが高い

TF-IDFのハイパーパラメータ

tfidf = TfidfVectorizer(
    max_features=10000,      # 上位10,000語のみ使用
    min_df=2,                # 最低2文書に出現する語のみ
    max_df=0.95,             # 95%以上の文書に出現する語を除外
    ngram_range=(1, 2),      # ユニグラムとバイグラムを使用
    sublinear_tf=True,       # TFを対数スケールに（1 + log(TF)）
)

N-gramの活用

# バイグラムを使うことで語順情報を部分的に保持
tfidf_bigram = TfidfVectorizer(ngram_range=(1, 2))
matrix = tfidf_bigram.fit_transform(texts)

print("ユニグラム+バイグラムの語彙例:")
features = tfidf_bigram.get_feature_names_out()
bigrams = [f for f in features if ' ' in f]
print(f"バイグラム: {bigrams[:10]}")
# 例: ['商品 届かない', '配送 状況', '返品 したい', ...]

Kaggleデータセットでの実践

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Disaster Tweetsデータセットの読み込み
train_df = pd.read_csv('train.csv')
print(f"データ件数: {len(train_df)}")
print(f"カラム: {train_df.columns.tolist()}")
print(f"ラベル分布:\n{train_df['target'].value_counts()}")

# テキスト前処理
import re

def clean_text(text):
    text = re.sub(r'https?://\S+', '', text)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'#(\w+)', r'\1', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = text.lower().strip()
    return text

train_df['clean_text'] = train_df['text'].apply(clean_text)

# TF-IDFベクトル化
tfidf = TfidfVectorizer(
    max_features=10000,
    ngram_range=(1, 2),
    min_df=3,
    sublinear_tf=True,
)

X = tfidf.fit_transform(train_df['clean_text'])
y = train_df['target']

print(f"TF-IDF行列の形状: {X.shape}")
# 出力例: TF-IDF行列の形状: (7613, 10000)

TF-IDFの特徴的な単語の可視化

import numpy as np

def get_top_tfidf_words(tfidf_vectorizer, tfidf_matrix, labels, n=10):
    """各クラスの特徴的な単語を抽出"""
    feature_names = tfidf_vectorizer.get_feature_names_out()

    for label in sorted(labels.unique()):
        mask = labels == label
        class_tfidf = tfidf_matrix[mask].mean(axis=0).A1
        top_indices = class_tfidf.argsort()[-n:][::-1]
        top_words = [(feature_names[i], class_tfidf[i]) for i in top_indices]
        label_name = "災害" if label == 1 else "非災害"
        print(f"\n{label_name}の特徴語:")
        for word, score in top_words:
            print(f"  {word}: {score:.4f}")

get_top_tfidf_words(tfidf, X, y)

まとめ

項目	ポイント
BoW	単語の出現回数で表現、語順を無視
TF-IDF	単語の重要度で重み付け、高頻度語を抑制
N-gram	語順情報を部分的に保持
ハイパーパラメータ	max_features, ngram_range, min_df, max_df

チェックリスト

BoWとTF-IDFの違いを説明できる
TF-IDFの計算式（TF × IDF）を理解した
N-gramの活用方法を理解した
scikit-learnのTfidfVectorizerを使えるようになった

次のステップへ

TF-IDFの基礎を学んだところで、次はWord2Vecによる単語の分散表現を学ぼう。

推定読了時間: 30分