TF and IDF – Theory and Exercise
Introduction
Term Frequency (TF) and Inverse Document Frequency (IDF) are two fundamental measures in Natural Language Processing (NLP) and information retrieval. They are used to evaluate how important a word is within a document or across a collection of documents.
Definitions
-
Term Frequency (TF): measures how often a term appears in a document relative to the total number of terms in that document. $$ TF(t,d) = \frac{f_{t,d}}{\text{total number of terms in } d}
where $f_{t,d}$) is the frequency of term $t$ in document $d$. -
Inverse Document Frequency (IDF): measures how rare or informative a term is across the entire collection of documents.
where is the total number of documents and is the number of documents containing the term .
-
TF-IDF: the product of TF and IDF shows the importance of a term in a single document relative to the corpus.
Practical Exercise
Data
We have three documents:
- D1: “the cat eats the mouse”
- D2: “the dog chases the cat”
- D3: “the mouse runs from the dog”
Step 1: Build the vocabulary
Vocabulary:
["cat", "dog", "mouse"]
Step 2: Compute TF
| Document | cat | dog | mouse |
|---|---|---|---|
| D1 | 1/5 | 0 | 1/5 |
| D2 | 1/5 | 1/5 | 0 |
| D3 | 0 | 1/5 | 1/5 |
Step 3: Compute IDF
- “cat” appears in 2 documents →
- “dog” appears in 2 documents →
- “mouse” appears in 2 documents →
Step 4: Compute TF-IDF for each term
Example for the term “cat” in document D1:
Higher TF-IDF values indicate that a term is more distinctive or important in a specific document. If a term appears in almost all documents, its IDF value decreases, reducing its overall importance.