2023-01-20

TF-IDF

What is TF-IDF

TF-IDF stands for Term Frequency - Inverse Document Frequency and is a vector representation of natural language.

TF-IDF is derived by multiplying TF and IDF.

tf \space idf(t,d,D) = tf(t,d) \times idf(t,D)
  • t: word
  • d: document
  • D: document set (corpus)

Term Frequency

TF (Term Frequency) represents the frequency with which a word appears. Words that occur more frequently are considered more important, while words that occur less frequently are considered less important. In other words, words that occur frequently are useful for identifying the characteristics of a document. There are several definitions of frequency of occurrence.

  • The number of times a word appears in a document (raw count)
  • Frequency of occurrence adjusted for document length (the number of occurrences of a word divided by the number of words in the document)
  • Log-transformed frequency of occurrence (e.g. log(1 + raw count))
  • Boolean frequency of occurrence (1 if the term appears in the document, 0 if it does not)

In this case, we will calculate TF as the number of occurrences of a word divided by the number of words in the document.

Suppose we have the following document.

  • Document 1: It is going to rain today. I like sound of rain.
  • Document 2: Today I am going to watch Netflix.

TF is as follows.

TF (Document 1) TF (Document 2)
it \frac{1}{10}=0.1 0
is \frac{1}{10}=0.1 0
going \frac{1}{10}=0.1 \frac{1}{7}=0.14
to \frac{1}{10}=0.1 \frac{1}{7}=0.14
rain \frac{2}{10}=0.2 0
today \frac{1}{10}=0.1 \frac{1}{7}=0.14
i \frac{1}{10}=0.1 \frac{1}{7}=0.14
like \frac{1}{10}=0.1 0
sound \frac{1}{10}=0.1 0
of \frac{1}{10}=0.1 0
am 0 \frac{1}{7}=0.14
watch 0 \frac{1}{7}=0.14
Netflix 0 \frac{1}{7}=0.14

Inverse Document Frequency

IDF (Inverse Document Frequency) works as a filter for common words such as this and is. Words that occur frequently in various documents have a low IDF, while rare words have a high IDF.

idf(t,D) = \log(\frac{N}{count(d \in D:t \in d)})
  • t: word
  • D: document
  • D: document set (corpus)
  • N: the number of d in D
  • count(d \in D:t \in d): the number of documents d in which the word t appears

Suppose you have the following document.

  • Document 1: It is going to rain today. I like sound of rain.
  • Document 2: Today I am going to watch Netflix.

The IDF will look like this.

IDF
it \log(2/1)=0.3
is \log(2/1)=0.3
going \log(2/2)=0
to \log(2/2)=0
rain \log(2/1)=0.3
today \log(2/2)=0
i \log(2/2)=0
like \log(2/1)=0.3
sound \log(2/1)=0.3
of \log(2/1)=0.3
am \log(2/1)=0.3
watch \log(2/1)=0.3
Netflix \log(2/1)=0.3

Libraries for calculating TF-IDF

TF-IDF can be calculated using the following major libraries.

  • scikit-learn

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html

  • TensorFlow(2.x)/ Keras

https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#sequences_to_matrix

  • TensorFlow Extended(TFX)

https://www.tensorflow.org/tfx/transform/api_docs/python/tft/tfidf

However, the formulas may differ depending on the library and options specified. In machine learning, scikit-learn's TF-IDF functions are often used.

In this example, we use sklearn.feature_extraction.text.TfidfVectorizer to calculate the TF-IDF of the following documents.

  • Document 1: It is going to rain today. I like sound of rain.
  • Document 2: Today I am going to watch Netflix.
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

document1 = "It is going to rain today. I like sound of rain."
document2 = "Today I am going to watch Netflix."

df = pd.DataFrame({'id': ["Document 1", "Docuemnt 2"],
                   'document': [document1, document2]
                   })

# calculate TF-IDF
tfidf_vectorizer = TfidfVectorizer(use_idf=True,lowercase=True)

# get TF-IDF score of all words in documents
tfidf_matrix = tfidf_vectorizer.fit_transform(df['document'])

# term list
terms = tfidf_vectorizer.get_feature_names()

# vectors of words
tfidfs = tfidf_matrix.toarray()
>> terms

['am',
 'going',
 'is',
 'it',
 'like',
 'netflix',
 'of',
 'rain',
 'sound',
 'to',
 'today',
 'watch']
>> tfidfs

array([[0.        , 0.28867513, 0.28867513, 0.28867513, 0.28867513,
        0.        , 0.28867513, 0.57735027, 0.28867513, 0.28867513,
        0.28867513, 0.        ],
       [0.40824829, 0.40824829, 0.        , 0.        , 0.        ,
        0.40824829, 0.        , 0.        , 0.        , 0.40824829,
        0.40824829, 0.40824829]])
df_tfidf = pd.DataFrame(tfidfs,
                  columns=terms,
                  index=["Document 1", "Document 2"]
                  )
display(df_tfidf)
am going is it like netflix of rain sound to today watch
Document 1 0 0.289 0.289 0.289 0.289 0 0.289 0.577 0.289 0.289 0.289 0
Document 2 0.408 0.408 0 0 0 0.408 0 0 0 0.408 0.408 0.408

The above values are the TF-IDF scores for each word. The TfidfVectorizer does the normalization internally.

The higher the TF-IDF score, the more important the word is in the document.

Example of using TF-IDF

IF-IDF is used in the following situations

  • Information retrieval
  • Text summarization
  • Keyword extraction
  • Searching for similar documents
  • Recommendation of related articles

References

https://www.capitalone.com/tech/machine-learning/understanding-tf-idf/
https://medium.com/analytics-vidhya/tf-idf-term-frequency-technique-easiest-explanation-for-text-classification-in-nlp-with-code-8ca3912e58c3
https://monkeylearn.com/blog/what-is-tf-idf/
https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!