Traffine I/O

日本語

2023-01-20

TF-IDF

Machine Learning

Machine Learning

NLP

TF-IDFとは

TF-IDFとはTerm Frequency - Inverse Document Frequencyの略で、自然言語をベクトルで表現する方法の一つです。TF-IDFはある文書を特徴づける重要な単語を抽出したいときに有効な手法です。

TF-IDFはTFとIDFの掛け算により導出されます。

tf \space idf(t,d,D) = tf(t,d) \times idf(t,D)

$t$ : 単語
$d$ : 文書
$D$ : 文書セット（コーパス）

Term Frequency

TF (Term Frequency) は単語の出現頻度を表します。出現頻度が多い単語は重要度が高く、出現頻度が少ない単語は重要度が低いと考えます。つまり、よく出現する単語は、その文書の特徴を判別するのに有用という考え方になります。出現頻度の定義は次のように複数あります。

文書中にその単語が出現した回数（単純な単語のカウント）
文書の長さに合わせて調整した出現頻度（単語の出現回数を文書内の単語数で割ったもの）
対数変換された出現頻度 (例: log(1 + raw count))
ブール値の出現頻度 (文書内で用語が出現する場合は1、出現しない場合は0)

今回は、TFを単語の出現回数を文書内の単語数で割ったものとして計算してみます。

次の文書があるとします。

Document 1: It is going to rain today. I like sound of rain.
Document 2: Today I am going to watch Netflix.

TFは次のようになります。

	TF (Document 1)	TF (Document 2)
it	$\frac{1}{10}=0.1$	0
is	$\frac{1}{10}=0.1$	0
going	$\frac{1}{10}=0.1$	$\frac{1}{7}=0.14$
to	$\frac{1}{10}=0.1$	$\frac{1}{7}=0.14$
rain	$\frac{2}{10}=0.2$	0
today	$\frac{1}{10}=0.1$	$\frac{1}{7}=0.14$
i	$\frac{1}{10}=0.1$	$\frac{1}{7}=0.14$
like	$\frac{1}{10}=0.1$	0
sound	$\frac{1}{10}=0.1$	0
of	$\frac{1}{10}=0.1$	0
am	0	$\frac{1}{7}=0.14$
watch	0	$\frac{1}{7}=0.14$
Netflix	0	$\frac{1}{7}=0.14$

Inverse Document Frequency

IDF (Inverse Document Frequency) はthisやisなどのような一般語のフィルタとして機能します。色々な文書によく出現する単語は低い値IDFを示し、レアな単語は高いIDFを示します。IDFの式は次のとおりです。

idf(t,D) = \log(\frac{N}{count(d \in D:t \in d)})

$t$ : 単語
$d$ : 文書
$D$ : 文書セット（コーパス）
$N$ : $D$ の中にある $d$ の数
$count(d \in D:t \in d)$ : 単語 $t$ が登場する文書 $d$ の数

次の文書があるとします。

Document 1: It is going to rain today. I like sound of rain.
Document 2: Today I am going to watch Netflix.

IDFは次のようになります。

	IDF
it	$\log(2/1)=0.3$
is	$\log(2/1)=0.3$
going	$\log(2/2)=0$
to	$\log(2/2)=0$
rain	$\log(2/1)=0.3$
today	$\log(2/2)=0$
i	$\log(2/2)=0$
like	$\log(2/1)=0.3$
sound	$\log(2/1)=0.3$
of	$\log(2/1)=0.3$
am	$\log(2/1)=0.3$
watch	$\log(2/1)=0.3$
Netflix	$\log(2/1)=0.3$

TF-IDFの計算ライブラリ

TF-IDFは次の主要なライブラリを使って計算することができます。

scikit-learn

TensorFlow（2.x）/ Keras

TensorFlow Extended（TFX）

ただし、ライブラリや指定オプションにより計算式が異なることがあります。機械学習ではscikit-learnのTF-IDFの関数がよく使われています。

今回はsklearn.feature_extraction.text.TfidfVectorizerを使って次の文書のTF-IDFを計算してみます。

Document 1: It is going to rain today. I like sound of rain.
Document 2: Today I am going to watch Netflix.

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

document1 = "It is going to rain today. I like sound of rain."
document2 = "Today I am going to watch Netflix."

df = pd.DataFrame({'id': ["Document 1", "Docuemnt 2"],
                   'document': [document1, document2]
                   })

# calculate TF-IDF
tfidf_vectorizer = TfidfVectorizer(use_idf=True,lowercase=True)

# get TF-IDF score of all words in documents
tfidf_matrix = tfidf_vectorizer.fit_transform(df['document'])

# term list
terms = tfidf_vectorizer.get_feature_names()

# vectors of words
tfidfs = tfidf_matrix.toarray()

>> terms

['am',
 'going',
 'is',
 'it',
 'like',
 'netflix',
 'of',
 'rain',
 'sound',
 'to',
 'today',
 'watch']

>> tfidfs

array([[0.        , 0.28867513, 0.28867513, 0.28867513, 0.28867513,
        0.        , 0.28867513, 0.57735027, 0.28867513, 0.28867513,
        0.28867513, 0.        ],
       [0.40824829, 0.40824829, 0.        , 0.        , 0.        ,
        0.40824829, 0.        , 0.        , 0.        , 0.40824829,
        0.40824829, 0.40824829]])

df_tfidf = pd.DataFrame(tfidfs,
                  columns=terms,
                  index=["Document 1", "Document 2"]
                  )
display(df_tfidf)

	am	going	is	it	like	netflix	of	rain	sound	to	today	watch
Document 1	0	0.289	0.289	0.289	0.289	0	0.289	0.577	0.289	0.289	0.289	0
Document 2	0.408	0.408	0	0	0	0.408	0	0	0	0.408	0.408	0.408

上記の値は、各単語のTF-IDFのスコアです。TfidfVectorizerは内部で正規化まで行なっています。

このTF-IDFのスコアが大きいほど、その文書における重要単語ということになります。

TF-IDFの活用例

IF-IDFは次のような場面で活用されています。

情報検索
テキスト要約
キーワード抽出
類似文書の検索
関連記事のレコメンデーション

参考

NLPとは

単語埋め込み

AlloyDB

Amazon Cognito

Amazon EC2

Amazon ECS

Amazon QuickSight

Amazon QuickSight

Amazon RDS

Amazon Redshift

Amazon Redshift

Amazon S3

API

Autonomous Vehicle

Autonomous Vehicle

AWS

AWS API Gateway

AWS API Gateway

AWS Chalice

AWS Control Tower

AWS Control Tower

AWS IAM

AWS Lambda

AWS VPC

BERT

BigQuery

Causal Inference

Causal Inference

ChatGPT

Chrome Extension

Chrome Extension

CircleCI

Classification

Cloud Functions

Cloud Functions

Cloud IAM

Cloud Run

Cloud Storage

Clustering

CSS

Data Engineering

Data Engineering

Data Modeling

Database

dbt

Decision Tree

Deep Learning

Descriptive Statistics

Descriptive Statistics

Differential Equation

Differential Equation

Dimensionality Reduction

Dimensionality Reduction

Discrete Choice Model

Discrete Choice Model

Docker

Economics

FastAPI

Firebase

GIS

git

GitHub

GitHub Actions

Google

Google Cloud

Google Search Console

Google Search Console

Hugging Face

Hypothesis Testing

Hypothesis Testing

Inferential Statistics

Inferential Statistics

Interval Estimation

Interval Estimation

JavaScript

Jinja

Kedro

Kubernetes

LightGBM

Linux

LLM

Mac

Machine Learning

Machine Learning

Macroeconomics

Marketing

Mathematical Model

Mathematical Model

Meltano

MLflow

MLOps

MySQL

NextJS

NLP

Nodejs

NoSQL

ONNX

OpenAI

Optimization Problem

Optimization Problem

Optuna

Pandas

Pinecone

PostGIS

PostgreSQL

Probability Distribution

Probability Distribution

Product

Project

Psychology

Python

PyTorch

QGIS

ReactJS

Regression

Rideshare

SEO

Singer

sklearn

Slack

Snowflake

Software Development

Software Development

SQL

Statistical Model

Statistical Model

Statistics

Streamlit

Tabular

Tailwind CSS

TensorFlow

Terraform

Transportation

TypeScript

Urban Planning

Vector Database

Vector Database

Vertex AI

VSCode

XGBoost

Ryusei Kakujo

researchgate

github

Weave the future of cities through data

Transportation modeling/ Urban planning/ Machine learning/ Computer science/ GIS