Skip to main content

Decision Trees โญ

2. Decision Trees โ€” The Core Algorithm ๐ŸŒณโ€‹

A Decision Tree is a supervised learning algorithm that recursively splits data based on feature values to create a flowchart-like structure for classification or regression.

2.1 How It Worksโ€‹

The tree asks a series of questions, splitting data into increasingly pure groups:

                    Is Monthly Spend > โ‚น5000?
/ \
YES NO
/ \
Contract > 12 months? Support Calls > 4?
/ \ / \
YES NO YES NO
/ \ / \
Will STAY โœ… Will CHURN โŒ Will CHURN โŒ Will STAY โœ…
(High spender, (Short contract (Unhappy (Low spender,
long contract) = churn risk) customer) but satisfied)

Structure:

  • Root Node: Topmost node โ€” represents the entire dataset
  • Decision Nodes: Where data splits based on a feature
  • Leaf Nodes: Final output โ€” the predicted class or value
  • Branch: The outcome of a split (e.g., "Yes" or "No")
  • Depth: Number of edges from root to the deepest leaf

2.2 Splitting Criteria โ€” How the Tree Decides Which Question to Askโ€‹

The tree selects the feature that produces the purest split โ€” meaning the resulting groups are as homogeneous as possible.

Impurityโ€‹

Impurity measures how mixed a group is. A bucket with only red balls = pure (impurity = 0). A bucket with 50% red, 50% blue = maximum impurity.

The tree's goal: each split should reduce impurity as much as possible.

Gini Indexโ€‹

Formula: Gini = 1 - ฮฃ(pแตขยฒ) where pแตข = proportion of class i

Worked Example โ€” Complete Gini Calculation:

Dataset: 20 customers deciding between Churn/Stay. Feature: "Has Complaint" (Yes/No).

Overall: 12 Stay, 8 Churn โ†’ Gini(parent) = 1 - (0.6ยฒ + 0.4ยฒ) = 0.48

Split by "Has Complaint":
Complaint = Yes (10 customers): 3 Stay, 7 Churn
Gini(Yes) = 1 - (0.3ยฒ + 0.7ยฒ) = 1 - (0.09 + 0.49) = 0.42

Complaint = No (10 customers): 9 Stay, 1 Churn
Gini(No) = 1 - (0.9ยฒ + 0.1ยฒ) = 1 - (0.81 + 0.01) = 0.18

Weighted Gini = (10/20)ร—0.42 + (10/20)ร—0.18 = 0.21 + 0.09 = 0.30

Gini Gain = 0.48 - 0.30 = 0.18 โ† This feature reduces impurity โœ…
Gini ValueMeaning
0Perfectly pure (all one class)
0.5Maximum impurity for binary classification

Entropy & Information Gainโ€‹

Entropy Formula: E = -ฮฃ pแตข ร— logโ‚‚(pแตข)

Information Gain = Entropy(parent) - Weighted Average Entropy(children)

The feature with the highest Information Gain is chosen for the split.

Worked Example:

Before split: 100 customers (50 churn, 50 stay)
Entropy = -0.5ร—logโ‚‚(0.5) - 0.5ร—logโ‚‚(0.5) = 1.0 (maximum uncertainty)

Split by "Contract Length > 12 months":
Left child: 60 customers (10 churn, 50 stay) โ†’ Entropy = 0.65
Right child: 40 customers (40 churn, 0 stay) โ†’ Entropy = 0.0 (PURE!)

Weighted Entropy = (60/100)ร—0.65 + (40/100)ร—0.0 = 0.39
Information Gain = 1.0 - 0.39 = 0.61 โ†’ HIGH gain = great feature โœ…

๐Ÿง  "Gini vs Entropy kya fark hai?" Both measure impurity. Gini is computationally faster, Entropy is more theoretically grounded. In practice, results are very similar. Scikit-learn uses Gini by default.

2.3 Overfitting โ€” The Biggest Problemโ€‹

Overfitting occurs when a tree becomes too complex โ€” it memorizes the training data (including noise) and performs poorly on new, unseen data.

Analogy: A student who memorizes past exam papers word-for-word. They score 100% on those exact questions but fail on any new question.

Signs: Very high training accuracy but poor test accuracy. Deep tree with hundreds of leaves.

Solution: Pruning

TypeWhat It DoesParameters
Pre-pruningStops the tree from growing too deepmax_depth, min_samples_split, min_samples_leaf
Post-pruningGrows the full tree, then removes unhelpful branchesccp_alpha parameter
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(
max_depth=5, # Tree won't go deeper than 5 levels
min_samples_split=20, # A node needs โ‰ฅ20 samples to split
min_samples_leaf=10, # Each leaf must have โ‰ฅ10 samples
random_state=42
)
model.fit(X_train, y_train)

๐Ÿง  Key parameters ratt lo: max_depth = kitna deep, min_samples_split = minimum samples to split, min_samples_leaf = minimum samples in leaf. In sab se tree ko "behave" karte hain.

2.4 Complete Scikit-learn Pipelineโ€‹

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix

# 1. Load and prepare data
df = pd.read_csv('customer_churn.csv')
X = df[['monthly_spend', 'contract_length', 'support_calls', 'tenure']]
y = df['churned'] # 1 = Churned, 0 = Stayed

# 2. Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Build and train model
model = DecisionTreeClassifier(max_depth=4, random_state=42)
model.fit(X_train, y_train)

# 4. Predict and evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

# 5. Feature importance
for name, imp in zip(X.columns, model.feature_importances_):
print(f"{name}: {imp:.3f}")