Mathematical Insights: Decoding the Inner Workings of Decision Trees for Classification Problems

Unveiling the Mathematical Framework: A Deep Dive into Decision Trees for Accurate Classification in Machine Learning

Published in

Analytics Vidhya

6 min readJun 18, 2020

What is a Decision tree?
A decision tree is a supervised learning method with comparatively good accuracy and a self-explanatory model. The approach is Top-Down, and it is a Greedy algorithm. It can be applied for both Classification and Regression problems.

Image source: https://images.idgesg.net/images/article/2018/10/ai_robotics_mathematics_equation_analytics_virtual-screen-100777425-large.jpg

Terminologies for Decision Tree

There are three types of nodes, namely:

Root node: The topmost node consists of the whole data-set or a sample portion. This will split into two sub-nodes.
Child node: The sub-nodes of the parent nodes.
Terminal node: The bottom nodes which cannot further split.

2. Splitting: The nodes are split into sub-nodes using either the Gini index, Information Gain, or Chi-square methods, which will be discussed in the later sections.

Types of Decision Tree

Decision Trees work for two types of Input
1. Categorical Input
2. Continuous input variable

Based on the Target variables, there are two types of Decision Tree
1. Categorical Variable Decision Tree: The probability of a variable belonging to a category is calculated and assigned.
2. Continuous Variable Decision Tree: The nearest category for the variable is found and classified.

How to split the nodes for Classification problems ?

1. Gini Index and Gini Impurity

Gini index can be defined as the probability of a label being wrongly classified when it chosen randomly from a feature.

Gini Index = 1 — ∑(p)², where p is the label chosen from the feature.

Gini Index only does binary splits, the result is either Success or Failure.

Higher the value of the index higher the homogeneity. Hence the feature with the least Gini index is selected as the root node.

Gini impurity tells us the measure at which new random variable will be classified incorrectly, if it is classified in the same way as that of the distribution.

Gini impurity = Σ p(i) * ( 1 — p(i)), i is the range of numbers

Sounds pretty complicated, isn’t it!!

Let us understand with an example….

Gini index and Gini Impurity for Trading column

P(Trading = High ) = 6/10
P(Trading = Low ) = 4/10
Label ‘High’ in Trading column with respect to Target column.

For Increase label in Bitcoin values
P(Trading = High & Bitcoin values = Increase ) = 3/6 For Decrease label in Bitcoin values
P(Trading = High & Bitcoin values = Decrease ) = 3/6Gini Index for Label 'High' = 1 — [(3/6)²+ (3/6)²] = 0.5

3. Label ‘Low’ in Trading column with respect to Target column.

For Increase label in Bitcoin values
P(Trading = Low & Bitcoin values = Increase) = 3/4For Decrease label in Bitcoin values
P(Trading = Low & Bitcoin values = Decrease) = 1/4Gini Index for Label 'Low' = 1 — [(3/4)² + (1/4)²] = 0.375

Gini Index of trading column

( 6/10 * 0.5 ) + ( 4/10 * 0.375 ) = 0.45

Gini Impurity of trading column

[ (6/10) * (1–6/10) ] + [ (4/10) * (1–4/10) ] = 0.48

There is 48% chance of incorrect classification of the new random data on the Trading column.

Gini index and Gini Impurity for Trend column

P(Trend = +) = 7/10
P(Trend = -) = 3/10
Label ‘+’ in Trend column with respect to Target column

For Increase label in Bitcoin values
P(Trend = + & Bitcoin values = Increase ) = 4/7For Decrease label in Bitcoin values
P(Trend = + & Bitcoin values = Decrease ) = 3/7Gini Index for Label ‘+’ = 1 — [(4/7)²+ (3/7)²] = 0.492

3. Label ‘-’ in Trading column with respect to Target column

For Increase label in Bitcoin values
P(Trend = - & Bitcoin values = Increase ) = 2/3For Decrease label in Bitcoin values
P(Trend = - & Bitcoin values = Decrease ) = 1/3Gini Index for Label ‘-’ = 1 — [(2/3)² + (1/3)²] = 0.554

Gini Index of Trend column

( 7/10 * 0.492 ) + ( 3/10 * 0.552 ) = 0.51

Gini Impurity of Trend column

[ (7/10) * (1–7/10)] + [ (3/10) * (1–3/10)] = 0.42

There is 42% chance of incorrect classification of the new random data on the Trading column.

Table for Gini Index and Gini Impurity for the columns for comparison

We should choose the column with minimum Gini Index. Hence Trading is the best option for the split.

2. Information Gain

Entropy is the amount heterogeneous data in a system.

Completely homogeneous data has an entropy ∆E=0
data-set with exactly half homogeneous data has an entropy ∆E=1.

Entropy of the target column is calculated in order to obtain the information gained from each feature.

Information gain = Entropy(target) — ∆ Entropy(features)

The feature with Higher information gain will be selected for the split.

Lets do some math to get a good understanding:

From the above data-set:

Entropy(target)

P(Increase)= 6/10
P(Decrease)=4/10Entropy(Bitcoin values)= — [ (6/10) log(6/10) + (4/10) log(4/10)] = 0.292

2. Entropy(features)

Trading

P(High) = 6/10
P(Low) = 4/10

Calculate the entropy of C1:

P(Increase) = 3/6 = 0.5
P(Decrease) = 3/6 = 0.5Entropy(C1) = — [0.5 log(0.5) + 0.5 log(0.5)] = 0.301

Calculate the entropy of C2:

P(Increase) = 3/4 = 0.75
P(Decrease) = 1/4 = 0.25Entropy(C1) = — [0.75 log(0.75) + 0.25 log(0.25)] = 0.243

Weighted Entropy(Trading) =[ (6/10) * 0.301 ] + [ (4/10) * 0.243 ] = 0.277

Information gain(Trading) 
= Entropy(Bitcoin values) —  Weighted Entropy(Trading)
= 0.292–0.277
= 0.0142

Trend

P(+) = 6/10
P(-)= 4/10

Calculate the entropy of C3:

P(Increase) = 4/7 = 0.571
P(Decrease) = 3/7 = 0.428Entropy(C1) = — [0.571 log(0.571) + 0.428 log(0.428)] = 0.295

Calculate the entropy of C4:

P(Increase) = 2/3 = 0.667
P(Decrease) = 1/3 = 0.334Entropy(C1) = — [0.667 log(0.667) + 0.334 log(0.334)] = 0.276

Weighted Entropy(Trend) =[ (7/10) * 0.295 ] + [ (3/10) * 0.276 ] = 0.2893

Information gain(Trend) 
= Entropy(Bitcoin values) —  Weighted Entropy(Trend)
= 0.292–0.2893
= 0.0027

Information gain comparison for the columns

Information Gain is highest for Trading, hence it should be selected for the split.

3. Chi-Square

Chi-Square test acts as unit of measurement to find the significance of a feature. If the test value is higher then the statistical significance is high.

Chi-square = √ ((Actual — Predicted)² / Predicted)

It can perform multiple splits, unlike Gini index.

Here is an example:

Chi-square test in performed on the Recession column:

Chi Square(Increase) for Past = √((3–2)²/2) = 0.707
Chi Square(Decrease) for Present= √((1–1.5)²/1.5) = 0.407

Chi-Square is calculated in the same way for all the labels.

Chi-Square(Recession)= 0.707+0.707+0.407+0.407+0.407+0.407 = 3.042

Chi-square test on Trading:

Chi-Square(Recession) = 0 + 0 + 0.707 + 0.707 = 1.414

Chi-Square test on Trend:

Chi-Square(Recession) = 0.266 + 0.266 + 0.408 + 0.408 = 1.348

Comparison for Chi-Square Test

Recession feature has the highest Chi-Square test value, hence has the highest Statistical significance.

I’ll be posting the Decision Tree -Part 2 on splitting nodes for Regression problems.