Skip to main content

Random Forest & Regression

3. Random Forest — Decision Tree's Powerful Upgrade

Random Forest = an ensemble of many decision trees whose combined vote is more accurate than any single tree.

Why it works better than a single tree:

  • Each tree trains on a random subset of data (Bagging)
  • Each tree considers random features at each split
  • Individual errors cancel out through majority voting

Key Concepts:

ConceptMeaning
BaggingBootstrap Aggregating — each tree gets a random data sample (with replacement)
Feature RandomnessEach split considers only √n random features (classification) or n/3 (regression)
EnsembleCombining multiple models for better performance
OOB ScoreOut-of-Bag Score — built-in validation using the ~37% data each tree didn't train on

🧠 Analogy: Ek doctor se opinion lo vs 100 doctors ki committee. Committee zyada accurate hogi — individual biases cancel out ho jaate hain.


4. Linear & Logistic Regression — Basics

Linear Regression (Predict a number)

Fits a straight line through data points to predict a continuous outcome.

Formula: y = mx + b

  • y = predicted value (e.g., Sales)
  • x = input feature (e.g., Ad Spend)
  • m = slope (how much y changes per unit change in x)
  • b = intercept (predicted y when x = 0)

Worked Example:

A model predicts: Sales = 200 × (Ad_Spend_in_lakhs) + 5000

Interpretation:
- Base sales (no ads) = ₹5,000
- Each ₹1 lakh in ad spend adds ₹200 to sales
- If Ad Spend = ₹10 lakhs → Sales = 200×10 + 5000 = ₹7,000

Key Assumptions:

  1. Linear relationship between x and y
  2. No multicollinearity (features shouldn't be highly correlated with each other)
  3. Homoscedasticity (constant variance of errors)
  4. Normal distribution of residuals

R² (Coefficient of Determination):

  • Measures how well the model explains variance in the data
  • R² = 0.85 → Model explains 85% of the variance, 15% unexplained
  • R² = 1.0 → Perfect fit; R² = 0.0 → Model explains nothing

Logistic Regression (Predict a category — Yes/No)

Despite the name "Regression," this is a classification algorithm. It predicts the probability of a binary outcome using the sigmoid function.

  • Output: Probability between 0 and 1
  • Decision threshold: Usually 0.5 — probability > 0.5 = Yes, ≤ 0.5 = No
  • Use cases: Churn prediction, spam detection, loan default

5. Feature Engineering & Data Preparation

5.1 Train-Test Split

Why: Evaluate model on data it has NEVER seen during training.

Common Splits:
80/20 → 80% train, 20% test (most common)
70/30 → When dataset is large
60/20/20 → Train/Validation/Test (for tuning hyperparameters)

CRITICAL: Never use test data during training — that's data leakage!

🧠 Interview mein ye zaroor bolo: "Stratified split ensures class proportions are maintained. If 30% of data is churn, both train and test sets will have ~30% churn."

5.2 Handling Missing Values

StrategyWhen to UseCode
Drop rowsVery few missing values (\< 5%)df.dropna()
Mean/Median imputationNumerical columnsdf['col'].fillna(df['col'].median())
Mode imputationCategorical columnsdf['col'].fillna(df['col'].mode()[0])
Forward/Back fillTime series datadf['col'].ffill()

5.3 Encoding Categorical Variables

MethodWhen to UseExample
Label EncodingOrdinal data (has natural order)Low=0, Medium=1, High=2
One-Hot EncodingNominal data (no order)City → is_Delhi, is_Mumbai, is_Bangalore

6. Model Evaluation — How Good Is the Model?

6.1 Confusion Matrix

                    PREDICTED
Positive Negative
ACTUAL Positive [ TP | FN ]
Negative [ FP | TN ]
TermMeaningExample (Churn Prediction)
TP (True Positive)Predicted positive, actually positivePredicted churn, customer did churn ✅
TN (True Negative)Predicted negative, actually negativePredicted stay, customer did stay ✅
FP (False Positive)Predicted positive, actually negativePredicted churn, but customer stayed ❌ (false alarm)
FN (False Negative)Predicted negative, actually positivePredicted stay, but customer churned ❌ (missed)

6.2 Key Metrics

MetricFormulaWhat It Tells YouWhen It Matters
Accuracy(TP+TN) / TotalOverall proportion correctOnly when classes are balanced
PrecisionTP / (TP+FP)Of those predicted positive, how many actually were?When false positives are costly (spam filter)
RecallTP / (TP+FN)Of all actual positives, how many did we catch?When false negatives are costly (disease detection)
F1 Score2×(P×R)/(P+R)Harmonic mean of Precision and RecallWhen you need a balance of both
AUC-ROCArea under ROC curveModel's ability to distinguish classesOverall model discrimination ability

Worked Problem — Complete Confusion Matrix Analysis:

A churn model's confusion matrix on 200 test customers:

Predicted
Churn Stay
Actual Churn [ 35 | 15 ] = 50 actual churners
Actual Stay [ 10 | 140 ] = 150 actual stayers

Accuracy = (35+140)/200 = 87.5%
Precision = 35/(35+10) = 77.8% (of predicted churners, 78% actually churned)
Recall = 35/(35+15) = 70.0% (caught 70% of actual churners)
F1 Score = 2×(0.778×0.70)/(0.778+0.70) = 0.737

Interpretation: Model misses 30% of churners (15 FN).
If each churner costs ₹5000 to lose, that's ₹75,000 in missed
retention opportunities. → Might want to increase recall.

The Critical Interview Scenario:

Q: "Cancer detection — Precision or Recall?" A: Recall. Missing a real cancer case (False Negative) is far worse than ordering extra tests (False Positive).

Q: "Spam filter — Precision or Recall?" A: Precision. Sending an important email to spam (False Positive) is worse than letting some spam through (False Negative).

6.3 ROC Curve & AUC

The ROC (Receiver Operating Characteristic) curve plots True Positive Rate vs False Positive Rate at various threshold settings.

  • AUC = 1.0 → Perfect model
  • AUC = 0.5 → Random guessing (useless)
  • AUC > 0.8 → Good model
  • AUC > 0.9 → Excellent model

🧠 "ROC curve kya hai?" ka one-liner: "It shows the trade-off between catching more positives (recall) and generating false alarms at every possible threshold. AUC summarizes this trade-off into a single number."

6.4 The Accuracy Paradox

🧠 "98% accuracy" sunke impress mat ho jao!

Example: 1000 transactions: 980 normal, 20 fraud. A model that ALWAYS predicts "Normal" → Accuracy = 980/1000 = 98%! But it detected zero fraud cases.

Lesson: For imbalanced data, accuracy is meaningless. Use F1 Score, Precision, Recall, and AUC instead.


7. Bias-Variance Tradeoff

ConceptWhat It IsAnalogy
BiasError from oversimplification — model misses real patternsArrows cluster together but far from the bullseye
VarianceError from overcomplexity — model learns noiseArrows scattered all over
Sweet SpotNeither too simple nor too complexArrows clustered near the bullseye
Model StateBiasVarianceWhat's Happening
UnderfittingHighLowToo simple — misses patterns
Good FitLowLowJust right
OverfittingLowHighToo complex — memorizes noise

For Decision Trees specifically:

  • Deep, unpruned tree → Low bias, High variance (overfits)
  • Shallow, pruned tree → High bias, Low variance (underfits)
  • Random Forest → Reduces variance while keeping low bias (best of both)

8. Interview Questions (12 Questions)

Q1: "Decision Tree vs Random Forest?"

Answer: "A Decision Tree is a single model that's easy to interpret and visualize — you can show it to a client and they'll understand the logic. However, it's prone to overfitting. Random Forest c