TF and IDF – Theory and Exercise

Introduction

Term Frequency (TF) and Inverse Document Frequency (IDF) are two fundamental measures in Natural Language Processing (NLP) and information retrieval. They are used to evaluate how important a word is within a document or across a collection of documents.

Definitions

Term Frequency (TF): measures how often a term appears in a document relative to the total number of terms in that document. $$ TF(t,d) = \frac{f_{t,d}}{\text{total number of terms in } d}
$where $f_{t,d}$) is the frequency of term $t$ in document $d$.$
Inverse Document Frequency (IDF): measures how rare or informative a term is across the entire collection of documents.
$I D F (t) = lo g (\frac{N}{d f _{t}})$
where $N$ is the total number of documents and $d f_{t}$ is the number of documents containing the term $t$ .
TF-IDF: the product of TF and IDF shows the importance of a term in a single document relative to the corpus.
$TF - I D F (t, d) = TF (t, d) \times I D F (t)$

Practical Exercise

Data

We have three documents:

D1: “the cat eats the mouse”
D2: “the dog chases the cat”
D3: “the mouse runs from the dog”

Step 1: Build the vocabulary

Vocabulary:
["cat", "dog", "mouse"]

Step 2: Compute TF

Document	cat	dog	mouse
D1	1/5	0	1/5
D2	1/5	1/5	0
D3	0	1/5	1/5

Step 3: Compute IDF

$N = 3$

“cat” appears in 2 documents → $d f_{c a t} = 2$

I D F (c a t) = lo g (\frac{3}{2})

“dog” appears in 2 documents → $d f_{d o g} = 2$

I D F (d o g) = lo g (\frac{3}{2})

“mouse” appears in 2 documents → $d f_{m o u se} = 2$

I D F (m o u se) = lo g (\frac{3}{2})

Step 4: Compute TF-IDF for each term

TF - I D F (t, d) = TF (t, d) \times I D F (t)

Example for the term “cat” in document D1:

TF - I D F (c a t, D 1) = \frac{1}{5} \times lo g (\frac{3}{2})

Higher TF-IDF values indicate that a term is more distinctive or important in a specific document. If a term appears in almost all documents, its IDF value decreases, reducing its overall importance.

😎 Appunti di Dag7

Esplora

TF-IDF