Skip to main content

๐Ÿ“ˆ Round 6 โ€” Statistics & Probability

Complete Guide From Scratch for Fresher Data Analystโ€‹

What to expect: 20โ€“30 minutes testing your understanding of statistical concepts and how to apply them to business problems. You'll get both direct concept questions ("Mean vs Median?") and scenario-based questions ("This data is right-skewed โ€” which average should we report?").


1. Descriptive Statistics โ€” Summarizing Dataโ€‹

1.1 Central Tendency โ€” "Where Is the Center?"โ€‹

MeasureWhat It IsFormulaWhen to Use
MeanSum of all values รท countSum รท NWhen distribution is symmetric and there are no outliers
MedianMiddle value in sorted dataMiddle positionWhen there are outliers or data is skewed โ€” safer choice
ModeMost frequently occurring valueMost frequentFor categorical data (e.g., "Delhi" is the most common city)

Critical Example โ€” comes up in almost every interview:

Employee Salaries: โ‚น30K, โ‚น35K, โ‚น40K, โ‚น45K, โ‚น50K, โ‚น10,00,000 (CEO)

  • Mean = โ‚น2,00,000 โ€” misleading! The CEO salary drags the average up
  • Median = โ‚น42,500 โ€” accurate representation of the typical salary

๐Ÿง  Ratt lo: Jab data mein outliers hon, hamesha Median use karo. Interview mein bolo: "Mean is sensitive to outliers, so I prefer Median for skewed distributions like income or house prices."

1.2 Spread โ€” "How Scattered Is the Data?"โ€‹

MeasureWhat It IsKey Point
RangeMax - MinVery basic, heavily affected by outliers
Variance (ฯƒยฒ)Average of squared deviations from meanMeasures spread, but units are squared
Standard Deviation (ฯƒ)โˆšVarianceMost important โ€” same units as original data
IQRQ3 - Q1 (75th - 25th percentile)Range of the middle 50% of data, best for outlier detection

Standard Deviation โ€” Intuitive Explanation:

  • Class A marks: 70, 72, 68, 71, 69 โ†’ SD โ‰ˆ 1.6 (consistent โ€” everyone scored similarly)
  • Class B marks: 30, 50, 90, 70, 10 โ†’ SD โ‰ˆ 30 (highly variable โ€” very different scores)

Low SD = consistent data. High SD = lots of variation.

Worked Problem โ€” Calculating Variance and SD by Hand:

Data: 4, 8, 6, 10, 2

Step 1: Mean = (4+8+6+10+2)/5 = 30/5 = 6

Step 2: Deviations from mean:
4-6 = -2, 8-6 = +2, 6-6 = 0, 10-6 = +4, 2-6 = -4

Step 3: Squared deviations:
4, 4, 0, 16, 16

Step 4: Variance = (4+4+0+16+16)/5 = 40/5 = 8

Step 5: SD = โˆš8 โ‰ˆ 2.83

1.3 Percentiles & Quartilesโ€‹

"90th percentile" means: 90% of observations fall below this value. If your score is at the 90th percentile, you're in the top 10%.

1.4 Box Plot โ€” Visual Summary of Distributionโ€‹

              IQR (Q3 - Q1)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ•ถโ”€โ”€โ”€โ”€โ”€โ”ค โ”ƒ โ”œโ”€โ”€โ”€โ”€โ”€โ•ด โ— โ— โ† Outliers
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Min* Q1 Q2(Median) Q3 Max*

* Whiskers extend to Q1 - 1.5ร—IQR and Q3 + 1.5ร—IQR
Points beyond whiskers = OUTLIERS

Worked Problem โ€” Outlier Detection:

Data (sorted): 12, 15, 18, 20, 22, 25, 28, 30, 95

Q1 = 15 (25th percentile)
Q3 = 28 (75th percentile)
IQR = 28 - 15 = 13

Lower fence = Q1 - 1.5 ร— IQR = 15 - 19.5 = -4.5
Upper fence = Q3 + 1.5 ร— IQR = 28 + 19.5 = 47.5

โ†’ 95 > 47.5, so 95 IS an outlier โœ…

2. Distributions โ€” The Shape of Dataโ€‹

2.1 Normal Distribution (Bell Curve)โ€‹

The most important distribution in statistics. Many natural phenomena follow it โ€” heights, IQ scores, measurement errors.

            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”Œโ”€โ”ค โ”œโ”€โ”
โ”Œโ”€โ”ค โ”‚ โ”‚ โ”œโ”€โ”
โ”Œโ”€โ”ค โ”‚ โ”‚ ฮผ โ”‚ โ”‚ โ”œโ”€โ”
โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ โ†“ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€
โ”€โ”€โ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”€โ”€
โ”‚โ†1ฯƒโ†’โ”‚
โ”‚โ†โ”€โ”€2ฯƒโ”€โ”€โ†’โ”‚
โ”‚โ†โ”€โ”€โ”€โ”€3ฯƒโ”€โ”€โ”€โ”€โ†’โ”‚

The 68-95-99.7 Rule (Empirical Rule):

Range% of DataExample (Mean=50, SD=10)
ฮผ ยฑ 1ฯƒ68%40 to 60 contains 68% of observations
ฮผ ยฑ 2ฯƒ95%30 to 70 contains 95% of observations
ฮผ ยฑ 3ฯƒ99.7%20 to 80 contains virtually all observations

Worked Problem:

Customer daily spending is normally distributed: Mean = โ‚น500, SD = โ‚น100.

Q: What % of customers spend between โ‚น300 and โ‚น700?
A: โ‚น300 = 500 - 2ร—100 = ฮผ - 2ฯƒ
โ‚น700 = 500 + 2ร—100 = ฮผ + 2ฯƒ
By 68-95-99.7 rule โ†’ 95% of customers โœ…

Q: A customer spends โ‚น800. Is this unusual?
A: โ‚น800 = 500 + 3ร—100 = ฮผ + 3ฯƒ
Only 0.15% of customers spend this much โ†’ YES, highly unusual โœ…

๐Ÿง  Interview mein aise use karo: "If customer spending is normally distributed with mean โ‚น5000 and SD โ‚น1000, then 95% of customers spend between โ‚น3000 and โ‚น7000."

2.2 Skewness โ€” Data Leaning to One Sideโ€‹

DirectionTailMean vs MedianExamples
Right SkewedLong tail to the RIGHTMean > MedianIncome, house prices, age at retirement
Left SkewedLong tail to the LEFTMean < MedianExam scores (easy exam), age at death
SymmetricNo tailMean โ‰ˆ MedianHeight, weight, IQ

๐Ÿง  Trick: Mean hamesha tail ki taraf khinchta hai. Right-skewed โ†’ Mean > Median. Interview mein: "Income data is right-skewed, so I'd report Median, not Mean."

2.3 Other Important Distributionsโ€‹

DistributionWhen It OccursExample
BinomialFixed number of trials, each with success/fail"Out of 10 emails, how many get opened?"
PoissonCount of rare events in fixed interval"How many customer complaints per hour?"
UniformEvery outcome equally likelyRolling a fair die
ExponentialTime between events"Time between customer arrivals at a store"

3. Z-Score โ€” "How Normal Is This Data Point?"โ€‹

A Z-Score tells you how many standard deviations a data point is from the mean. It allows comparison across different scales.

Formula: Z = (X - ฮผ) / ฯƒ

Worked Problem โ€” Comparing Performance Across Subjects:

  • Maths: 80/100 (class avg 70, SD 10) โ†’ Z = (80-70)/10 = 1.0
  • English: 75/100 (class avg 60, SD 5) โ†’ Z = (75-60)/5 = 3.0

English performance is relatively much better despite lower raw marks, because the student is 3 standard deviations above the class average (top 0.15%) compared to only 1 SD above in Maths (top 16%).

Z-ScoreMeaningPercentile
0Exactly at the average50th
+11 SD above average84th (top 16%)
+22 SD above average97.5th (top 2.5%)
+33 SD above average99.85th (top 0.15%)
-11 SD below average16th (bottom 16%)
-22 SD below average2.5th (bottom 2.5%)