What is TF-IDF
TF-IDF stands for Term Frequency - Inverse Document Frequency and is a vector representation of natural language.
TF-IDF is derived by multiplying TF and IDF.
: wordt : documentd : document set (corpus)D
Term Frequency
TF (Term Frequency) represents the frequency with which a word appears. Words that occur more frequently are considered more important, while words that occur less frequently are considered less important. In other words, words that occur frequently are useful for identifying the characteristics of a document. There are several definitions of frequency of occurrence.
- The number of times a word appears in a document (raw count)
- Frequency of occurrence adjusted for document length (the number of occurrences of a word divided by the number of words in the document)
- Log-transformed frequency of occurrence (e.g. log(1 + raw count))
- Boolean frequency of occurrence (1 if the term appears in the document, 0 if it does not)
In this case, we will calculate TF as the number of occurrences of a word divided by the number of words in the document.
Suppose we have the following document.
- Document 1: It is going to rain today. I like sound of rain.
- Document 2: Today I am going to watch Netflix.
TF is as follows.
TF (Document 1) | TF (Document 2) | |
---|---|---|
it | 0 | |
is | 0 | |
going | ||
to | ||
rain | 0 | |
today | ||
i | ||
like | 0 | |
sound | 0 | |
of | 0 | |
am | 0 | |
watch | 0 | |
Netflix | 0 |
Inverse Document Frequency
IDF (Inverse Document Frequency) works as a filter for common words such as this and is. Words that occur frequently in various documents have a low IDF, while rare words have a high IDF.
: wordt : documentD : document set (corpus)D : the number ofN ind D : the number of documentscount(d \in D:t \in d) in which the wordd appearst
Suppose you have the following document.
- Document 1: It is going to rain today. I like sound of rain.
- Document 2: Today I am going to watch Netflix.
The IDF will look like this.
IDF | |
---|---|
it | |
is | |
going | |
to | |
rain | |
today | |
i | |
like | |
sound | |
of | |
am | |
watch | |
Netflix |
Libraries for calculating TF-IDF
TF-IDF can be calculated using the following major libraries.
- scikit-learn
- TensorFlow(2.x)/ Keras
- TensorFlow Extended(TFX)
However, the formulas may differ depending on the library and options specified. In machine learning, scikit-learn's TF-IDF functions are often used.
In this example, we use sklearn.feature_extraction.text.TfidfVectorizer
to calculate the TF-IDF of the following documents.
- Document 1: It is going to rain today. I like sound of rain.
- Document 2: Today I am going to watch Netflix.
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
document1 = "It is going to rain today. I like sound of rain."
document2 = "Today I am going to watch Netflix."
df = pd.DataFrame({'id': ["Document 1", "Docuemnt 2"],
'document': [document1, document2]
})
# calculate TF-IDF
tfidf_vectorizer = TfidfVectorizer(use_idf=True,lowercase=True)
# get TF-IDF score of all words in documents
tfidf_matrix = tfidf_vectorizer.fit_transform(df['document'])
# term list
terms = tfidf_vectorizer.get_feature_names()
# vectors of words
tfidfs = tfidf_matrix.toarray()
>> terms
['am',
'going',
'is',
'it',
'like',
'netflix',
'of',
'rain',
'sound',
'to',
'today',
'watch']
>> tfidfs
array([[0. , 0.28867513, 0.28867513, 0.28867513, 0.28867513,
0. , 0.28867513, 0.57735027, 0.28867513, 0.28867513,
0.28867513, 0. ],
[0.40824829, 0.40824829, 0. , 0. , 0. ,
0.40824829, 0. , 0. , 0. , 0.40824829,
0.40824829, 0.40824829]])
df_tfidf = pd.DataFrame(tfidfs,
columns=terms,
index=["Document 1", "Document 2"]
)
display(df_tfidf)
am | going | is | it | like | netflix | of | rain | sound | to | today | watch | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Document 1 | 0 | 0.289 | 0.289 | 0.289 | 0.289 | 0 | 0.289 | 0.577 | 0.289 | 0.289 | 0.289 | 0 |
Document 2 | 0.408 | 0.408 | 0 | 0 | 0 | 0.408 | 0 | 0 | 0 | 0.408 | 0.408 | 0.408 |
The above values are the TF-IDF scores for each word. The TfidfVectorizer
does the normalization internally.
The higher the TF-IDF score, the more important the word is in the document.
Example of using TF-IDF
IF-IDF is used in the following situations
- Information retrieval
- Text summarization
- Keyword extraction
- Searching for similar documents
- Recommendation of related articles
References