Back to the posts

How Decision Tree Works

A decision tree is so widely used and its idea is very simple. However, I always tend to forget how it works because it's too simple! Decision tree uses something called Information Gain as its objective. It tries to maximize the Information Gain.

Entropy

Before talking about information gain, we must know entropy.

entropy=_ipilog2pi\text{entropy} = - \sum\_{i} p_i \log_2 p_i

Its graph would look as below: Entropy Function Graph

Notice that the entropy has its maximum at pi=0.5p_i = 0.5

Example

Grade Bumpiness Speed limit Speed
Steep Bumpy Yes Slow
Steep Smooth Yes Slow
Flat Bumpy No Fast
Steep Smooth No Fast

Suppose we want to predict speed. Either slow or fast.

The base entropy is 1 because

pslow=0.5pfast=0.5\begin{aligned} p_{slow} &= 0.5 \\ p_{fast} &= 0.5 \\ \end{aligned}
entropy=pslowlog2pslowpfastlog2pfast=0.5log20.50.5log20.5=log20.5=1.0\begin{aligned} \text{entropy} &= - p_{slow} \log_2 p_{slow} - p_{fast} \log_2 p_{fast} \\ &= - 0.5 \log_2 0.5 - 0.5 \log_2 0.5 \\ &= - \log_2 0.5 = 1.0 \end{aligned}

Information Gain

Finally, we can talk about the information gain. Information gain refers to a difference between parent and children entropy.

Information Gain=base entropyweighted average of children entropy\text{Information Gain} = \text{base entropy} - \text{weighted average of children entropy}

Example

We have 3 features (Grade, Bumpiness, Speed Limit) and for each feature, we compute its entropy and compute the difference.

In case of Grade,

Grade Bumpiness Speed limit Speed
Steep Bumpy Yes Slow
Steep Smooth Yes Slow
Flat Bumpy No Fast
Steep Smooth No Fast
decision tree

we have two cases when the grade is steep or flat

When grade is steep

There are 2 slow and 1 fast examples. Therefore,

pslow=23p_{slow} = \frac{2}{3}
pfast=13p_{fast} = \frac{1}{3}
entropy=pslowlog2pslowpfastlog2pfast0.9184\begin{aligned} \text{entropy} &= - p_{slow} \log_2 p_{slow} - p_{fast} \log_2 p_{fast} \\ &\approx 0.9184 \end{aligned}

When grade is flat

It's simple because it's "pure" (there exists only one class). Its entropy is 0.

Information Gain

Therefore,

the weighted average of children entropy(e1e_1) is

e1=340.9184+140e_1 = \frac{3}{4} \cdot 0.9184 + \frac{1}{4} \cdot 0

the information gain

Information Gain=1.0e10.3112\text{Information Gain} = 1.0 - e_1 \approx 0.3112

If we repeat for other features (Bumpiness, Speed Limit)

Information Gain for grade=0.311\text{Information Gain for \textbf{grade}} = 0.311
Information Gain for bumpiness=0\text{Information Gain for \textbf{bumpiness}} = 0
Information Gain for speed limit=1\text{Information Gain for \textbf{speed limit}}= 1

Therefore, the next split will occur around Speed Limit because it has the greatest information gain.

Min Split

The decision tree will repeat the above process, but a question is when will stop splitting. There are two popular hyperparameters for decision trees. One is depth and the other one is min split.

Depth refers to the depth of a tree. It will stop if current depth is at the max depth.

Min split refers to the number of minimum samples to split. For example, if a min split is 3 and there are 2 samples left. There will be no split afterward.

Code Samples

In sklearn,

from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)

© 2024 Mo Kweon. All rights reserved.