03 - Decision Trees

Decision tree

Tree with the following characteristics:

each internal node tests an attribute $A i$
each branch is a value of an attribute $a_{i}, j \in A_{i}$
each leaf node assigns a classification value $c \in C$
a classic decision tree, like the ones in Algorithms

How to form a rule

evaluates the branches we are interested in and put them in disjunction
for example: IF (Outlook = Sunny) AND (Humidity = High) THEN PlayTennis = No

ID3 Algorithm

We have: INPUT: Examples (both positives and negatives), Target Attribute (Play Tennis), Attributes (Outlook, Wind…) OUTPUT: DT (Decison Tree)

We have three cases if attributes are empty:

Examples contains only positives: RETURN +
Examples contains only negatives: RETURN -
Both of them: + or -, depending on majority samples in D

(also called GENERAL CASE) We have these cases if attributes are NOT empty:

let’s choose a specific attribute (by using a specific criteria)
let’s create a node
let’s create a child node
let’s make a recursive call to build all the subsets
- each recursive call will change the examples and the attributes EXCEPT the one I am already considering

An example is worth one thousand words:

Entropy and Information gain

it represents the impurity or randomness in a dataset $Entropy (S) = - \sum_{i = 1}^{n} p_{i} lo g_{2} (p_{i})$

Information Gain measures the effectiveness of a feature in classifying a dataset. It quantifies how much information a feature provides about the class label. High information gain means a feature is very useful for classification.

To calculate information gain for a feature A with respect to a dataset S, you compute the entropy of the original dataset Entropy(S), and then calculate the weighted sum of entropies after splitting the dataset based on feature A. $Information Gain (S, A) = Entropy (S) - \sum_{v \in Values (A)} \frac{∣ S _{v} ∣}{∣ S ∣} \times Entropy (S_{v})$

Example

Let’s consider a simple binary classification problem to illustrate how entropy and information gain are used in decision tree classification.

Suppose we have a dataset of emails that are classified as either spam (positive class) or non-spam (negative class). The dataset has two features:

Feature 1: Whether the email contains the word “lottery” (binary feature: 1 if yes, 0 if no).
Feature 2: Whether the email contains the word “discount” (binary feature: 1 if yes, 0 if no).

Let’s calculate entropy and information gain to decide the best feature for the first split in the decision tree.

Calculating Entropy for the Initial Dataset:

Let’s assume there are 60 emails in the dataset, with 30 being spam and 30 non-spam. $Entropy (S) = - (\frac{30}{60} lo g_{2} (\frac{30}{60}) + \frac{30}{60} lo g_{2} (\frac{30}{60})) = - (0.5 \times - 1 + 0.5 \times - 1) = 1$

Calculating Information Gain for Feature 1 (“lottery”):

Let’s assume 20 emails have the word “lottery” and 40 do not. $Entropy (S_{lottery}) = - (\frac{10}{20} lo g_{2} (\frac{10}{20}) + \frac{10}{20} lo g_{2} (\frac{10}{20})) = 1$ $Information Gain (S, lottery) = 1 - (\frac{20}{60} \times 1 + \frac{40}{60} \times 1) = 0$

Calculating Information Gain for Feature 2 (“discount”):

Let’s assume 25 emails have the word “discount” and 35 do not. $Entropy (S_{discount}) = - (\frac{10}{25} lo g_{2} (\frac{10}{25}) + \frac{15}{25} lo g_{2} (\frac{15}{25})) \approx 0.96$ $Information Gain (S, discount) = 1 - (\frac{25}{60} \times 0.96 + \frac{35}{60} \times 0.96) \approx 0.02$ In this example, Feature 2 (“discount”) has a higher information gain compared to Feature 1 (“lottery”). Therefore, the decision tree would choose “discount” as the first split, as it provides more information in reducing the uncertainty about the class labels.

So, entropy measures how mixed up things are, and information gain helps us figure out the best way to organize and understand the mix!

When the entropy is 0 all the subsets are such that in each subset there only are positives or negatives example!

ID3 for short

Create a rotto node
if all examples are pos, return Root with label +
if all examples are neg, return Root with label -
if attributes is empty, then return the node root with label = most common value of target_attribute in Examples
otherwise
1. A ← the best decision attribute for Examples
2. assign A as devision attribute for Root
  1. for each value $v_{i}$ of A
    - add a new branch from Root corresponding to the test A = $v_{i}$
    - $e x am pl e s_{v i}$ = subset of Examples that have value $v_{i}$ for A
    - if $e x am pl e s_{v i}$ is empty then add a leaf node with label = most common value of target_attribute in examples
    - else add the tree ID3( $E x am pl e s_{v i}$ , Target_attribute, Attributes-{A})

Avoid overfitting

stop growing when data split not efficient
grow full tree then post prune In pratica semplifichi un nodo mettendo al posto di un albero di decisione, un certo valore o meno.

GOOGLE COLAB HERE!

😎 Appunti di Dag7

Esplora

03 - Decision Trees

03 - Decision Trees

Decision tree

How to form a rule

ID3 Algorithm

Entropy and Information gain

Example

Calculating Entropy for the Initial Dataset:

Calculating Information Gain for Feature 1 (“lottery”):

Calculating Information Gain for Feature 2 (“discount”):

ID3 for short

Avoid overfitting

Vista grafico

Tabella dei contenuti

Link entranti