← Hub
Topic 01 · Phase 1

What Does the Data Say?

Exploratory Data Analysis & Statistics — before any model, understand your data

1

The Problem

You've been given a loan dataset. Before predicting defaults, what do you actually know about it? Real-world data is messy — missing values, skewed distributions, hidden correlations.

What we're trying to solve

  • Understand the shape and spread of each feature
  • Detect missing values and outliers
  • Find which features correlate with the target
  • Decide what transformations are needed before modelling
"All models are wrong, but some are useful — and no model survives bad data." — after Box
2

The Intuition

Think of EDA as a conversation with the data. You ask questions, the data answers with distributions, correlations, and anomalies. A histogram tells you more about a variable than its mean ever could.

Loan amount distribution — bars animate as data loads

μ
Mean — central tendency
σ
Std dev — spread
r
Correlation — relationship
3

The Math

Three statistical primitives underpin almost everything in EDA:

Sample Mean & Variance
μ = (1/n) Σxᵢ  |  σ² = (1/n) Σ(xᵢ−μ)²
Pearson Correlation
r = Σ(xᵢ−x̄)(yᵢ−ȳ) / √[Σ(xᵢ−x̄)²·Σ(yᵢ−ȳ)²]
Z-score (Standardization)
z = (x − μ) / σ

Anything beyond |z| > 3 is a likely outlier.

Hypothesis Testing (t-test)
t = (x̄ − μ₀) / (s/√n)  →  compare to t-distribution

p < 0.05: reject H₀ at 95% confidence.

4

Assumptions & Pitfalls

Correlation ≠ causation. Ice cream sales correlate with drowning rates — both driven by summer.
Simpson's Paradox. A trend in combined data reverses when you look at subgroups. Always segment.
Outlier obsession. Don't delete outliers reflexively — they may be the signal, not the noise.
Missing Not At Random (MNAR). If missingness is related to the missing value itself, imputation will mislead you.
5

When to Use

EDA is not optional — it's the first step before any modelling task.

Strengths

  • Catches data quality issues early
  • Reveals unexpected patterns
  • Informs feature engineering
  • No model needed — pure understanding

Limitations

  • Time-consuming on large datasets
  • Visualizations can mislead if axes are poor
  • Doesn't replace statistical testing
  • Manual process — hard to automate fully
Finance — loan risk Healthcare — clinical data E-commerce — customer behaviour Any supervised ML pipeline

Key Takeaways

Look Before You Model

Distributions, missing values, and outliers can invalidate any model if not handled first.

Correlation is a Clue

High correlation with the target suggests a useful feature; high correlation between features suggests multicollinearity.

Test Your Intuitions

Hypothesis tests turn "this looks different" into "this is statistically different" with a quantified confidence level.