Topic 01 · Phase 1

What Does the Data Say?

Exploratory Data Analysis & Statistics — before any model, understand your data

↓

1

The Problem

You've been given a loan dataset. Before predicting defaults, what do you actually know about it? Real-world data is messy — missing values, skewed distributions, hidden correlations.

What we're trying to solve

Understand the shape and spread of each feature
Detect missing values and outliers
Find which features correlate with the target
Decide what transformations are needed before modelling

"All models are wrong, but some are useful — and no model survives bad data." — after Box

2

The Intuition

Think of EDA as a conversation with the data. You ask questions, the data answers with distributions, correlations, and anomalies. A histogram tells you more about a variable than its mean ever could.

Loan amount distribution — bars animate as data loads

μ

Mean — central tendency

σ

Std dev — spread

r

Correlation — relationship

3

The Math

Three statistical primitives underpin almost everything in EDA:

Sample Mean & Variance

μ = (1/n) Σxᵢ | σ² = (1/n) Σ(xᵢ−μ)²

Pearson Correlation

r = Σ(xᵢ−x̄)(yᵢ−ȳ) / √[Σ(xᵢ−x̄)²·Σ(yᵢ−ȳ)²]

Z-score (Standardization)

z = (x − μ) / σ

Anything beyond |z| > 3 is a likely outlier.

Hypothesis Testing (t-test)

t = (x̄ − μ₀) / (s/√n) → compare to t-distribution

p < 0.05: reject H₀ at 95% confidence.

4

Assumptions & Pitfalls

Correlation ≠ causation. Ice cream sales correlate with drowning rates — both driven by summer.

Simpson's Paradox. A trend in combined data reverses when you look at subgroups. Always segment.

Outlier obsession. Don't delete outliers reflexively — they may be the signal, not the noise.

Missing Not At Random (MNAR). If missingness is related to the missing value itself, imputation will mislead you.

5

When to Use

EDA is not optional — it's the first step before any modelling task.

Strengths

Catches data quality issues early
Reveals unexpected patterns
Informs feature engineering
No model needed — pure understanding

Limitations

Time-consuming on large datasets
Visualizations can mislead if axes are poor
Doesn't replace statistical testing
Manual process — hard to automate fully

Finance — loan risk Healthcare — clinical data E-commerce — customer behaviour Any supervised ML pipeline

Key Takeaways

Look Before You Model

Distributions, missing values, and outliers can invalidate any model if not handled first.

Correlation is a Clue

High correlation with the target suggests a useful feature; high correlation between features suggests multicollinearity.

Test Your Intuitions

Hypothesis tests turn "this looks different" into "this is statistically different" with a quantified confidence level.