Traffine I/O

English

2023-01-20

TF-IDF

Machine Learning

Machine Learning

NLP

What is TF-IDF

TF-IDF stands for Term Frequency - Inverse Document Frequency and is a vector representation of natural language.

TF-IDF is derived by multiplying TF and IDF.

tf \space idf(t,d,D) = tf(t,d) \times idf(t,D)

$t$ : word
$d$ : document
$D$ : document set (corpus)

Term Frequency

TF (Term Frequency) represents the frequency with which a word appears. Words that occur more frequently are considered more important, while words that occur less frequently are considered less important. In other words, words that occur frequently are useful for identifying the characteristics of a document. There are several definitions of frequency of occurrence.

The number of times a word appears in a document (raw count)
Frequency of occurrence adjusted for document length (the number of occurrences of a word divided by the number of words in the document)
Log-transformed frequency of occurrence (e.g. log(1 + raw count))
Boolean frequency of occurrence (1 if the term appears in the document, 0 if it does not)

In this case, we will calculate TF as the number of occurrences of a word divided by the number of words in the document.

Suppose we have the following document.

Document 1: It is going to rain today. I like sound of rain.
Document 2: Today I am going to watch Netflix.

TF is as follows.

	TF (Document 1)	TF (Document 2)
it	$\frac{1}{10}=0.1$	0
is	$\frac{1}{10}=0.1$	0
going	$\frac{1}{10}=0.1$	$\frac{1}{7}=0.14$
to	$\frac{1}{10}=0.1$	$\frac{1}{7}=0.14$
rain	$\frac{2}{10}=0.2$	0
today	$\frac{1}{10}=0.1$	$\frac{1}{7}=0.14$
i	$\frac{1}{10}=0.1$	$\frac{1}{7}=0.14$
like	$\frac{1}{10}=0.1$	0
sound	$\frac{1}{10}=0.1$	0
of	$\frac{1}{10}=0.1$	0
am	0	$\frac{1}{7}=0.14$
watch	0	$\frac{1}{7}=0.14$
Netflix	0	$\frac{1}{7}=0.14$

Inverse Document Frequency

IDF (Inverse Document Frequency) works as a filter for common words such as this and is. Words that occur frequently in various documents have a low IDF, while rare words have a high IDF.

idf(t,D) = \log(\frac{N}{count(d \in D:t \in d)})

$t$ : word
$D$ : document
$D$ : document set (corpus)
$N$ : the number of $d$ in $D$
$count(d \in D:t \in d)$ : the number of documents $d$ in which the word $t$ appears

Suppose you have the following document.

Document 1: It is going to rain today. I like sound of rain.
Document 2: Today I am going to watch Netflix.

The IDF will look like this.

	IDF
it	$\log(2/1)=0.3$
is	$\log(2/1)=0.3$
going	$\log(2/2)=0$
to	$\log(2/2)=0$
rain	$\log(2/1)=0.3$
today	$\log(2/2)=0$
i	$\log(2/2)=0$
like	$\log(2/1)=0.3$
sound	$\log(2/1)=0.3$
of	$\log(2/1)=0.3$
am	$\log(2/1)=0.3$
watch	$\log(2/1)=0.3$
Netflix	$\log(2/1)=0.3$

Libraries for calculating TF-IDF

TF-IDF can be calculated using the following major libraries.

scikit-learn

TensorFlow（2.x）/ Keras

TensorFlow Extended（TFX）

However, the formulas may differ depending on the library and options specified. In machine learning, scikit-learn's TF-IDF functions are often used.

In this example, we use sklearn.feature_extraction.text.TfidfVectorizer to calculate the TF-IDF of the following documents.

Document 1: It is going to rain today. I like sound of rain.
Document 2: Today I am going to watch Netflix.

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

document1 = "It is going to rain today. I like sound of rain."
document2 = "Today I am going to watch Netflix."

df = pd.DataFrame({'id': ["Document 1", "Docuemnt 2"],
                   'document': [document1, document2]
                   })

# calculate TF-IDF
tfidf_vectorizer = TfidfVectorizer(use_idf=True,lowercase=True)

# get TF-IDF score of all words in documents
tfidf_matrix = tfidf_vectorizer.fit_transform(df['document'])

# term list
terms = tfidf_vectorizer.get_feature_names()

# vectors of words
tfidfs = tfidf_matrix.toarray()

>> terms

['am',
 'going',
 'is',
 'it',
 'like',
 'netflix',
 'of',
 'rain',
 'sound',
 'to',
 'today',
 'watch']

>> tfidfs

array([[0.        , 0.28867513, 0.28867513, 0.28867513, 0.28867513,
        0.        , 0.28867513, 0.57735027, 0.28867513, 0.28867513,
        0.28867513, 0.        ],
       [0.40824829, 0.40824829, 0.        , 0.        , 0.        ,
        0.40824829, 0.        , 0.        , 0.        , 0.40824829,
        0.40824829, 0.40824829]])

df_tfidf = pd.DataFrame(tfidfs,
                  columns=terms,
                  index=["Document 1", "Document 2"]
                  )
display(df_tfidf)

	am	going	is	it	like	netflix	of	rain	sound	to	today	watch
Document 1	0	0.289	0.289	0.289	0.289	0	0.289	0.577	0.289	0.289	0.289	0
Document 2	0.408	0.408	0	0	0	0.408	0	0	0	0.408	0.408	0.408

The above values are the TF-IDF scores for each word. The TfidfVectorizer does the normalization internally.

The higher the TF-IDF score, the more important the word is in the document.

Example of using TF-IDF

IF-IDF is used in the following situations

Information retrieval
Text summarization
Keyword extraction
Searching for similar documents
Recommendation of related articles

References

What is NLP

Word Embeddings

AlloyDB

Amazon Cognito

Amazon EC2

Amazon ECS

Amazon QuickSight

Amazon QuickSight

Amazon RDS

Amazon Redshift

Amazon Redshift

Amazon S3

API

Autonomous Vehicle

Autonomous Vehicle

AWS

AWS API Gateway

AWS API Gateway

AWS Chalice

AWS Control Tower

AWS Control Tower

AWS IAM

AWS Lambda

AWS VPC

BERT

BigQuery

Causal Inference

Causal Inference

ChatGPT

Chrome Extension

Chrome Extension

CircleCI

Classification

Cloud Functions

Cloud Functions

Cloud IAM

Cloud Run

Cloud Storage

Clustering

CSS

Data Engineering

Data Engineering

Data Modeling

Database

dbt

Decision Tree

Deep Learning

Descriptive Statistics

Descriptive Statistics

Differential Equation

Differential Equation

Dimensionality Reduction

Dimensionality Reduction

Discrete Choice Model

Discrete Choice Model

Docker

Economics

FastAPI

Firebase

GIS

git

GitHub

GitHub Actions

Google

Google Cloud

Google Search Console

Google Search Console

Hugging Face

Hypothesis Testing

Hypothesis Testing

Inferential Statistics

Inferential Statistics

Interval Estimation

Interval Estimation

JavaScript

Jinja

Kedro

Kubernetes

LightGBM

Linux

LLM

Mac

Machine Learning

Machine Learning

Macroeconomics

Marketing

Mathematical Model

Mathematical Model

Meltano

MLflow

MLOps

MySQL

NextJS

NLP

Nodejs

NoSQL

ONNX

OpenAI

Optimization Problem

Optimization Problem

Optuna

Pandas

Pinecone

PostGIS

PostgreSQL

Probability Distribution

Probability Distribution

Product

Project

Psychology

Python

PyTorch

QGIS

ReactJS

Regression

Rideshare

SEO

Singer

sklearn

Slack

Snowflake

Software Development

Software Development

SQL

Statistical Model

Statistical Model

Statistics

Streamlit

Tabular

Tailwind CSS

TensorFlow

Terraform

Transportation

TypeScript

Urban Planning

Vector Database

Vector Database

Vertex AI

VSCode

XGBoost

Ryusei Kakujo

researchgate

github

Focusing on data science for mobility

Bench Press 100kg!