Exploratory Data Analysis & Statistics — before any model, understand your data
You've been given a loan dataset. Before predicting defaults, what do you actually know about it? Real-world data is messy — missing values, skewed distributions, hidden correlations.
Think of EDA as a conversation with the data. You ask questions, the data answers with distributions, correlations, and anomalies. A histogram tells you more about a variable than its mean ever could.
Loan amount distribution — bars animate as data loads
Three statistical primitives underpin almost everything in EDA:
μ = (1/n) Σxᵢ | σ² = (1/n) Σ(xᵢ−μ)²
r = Σ(xᵢ−x̄)(yᵢ−ȳ) / √[Σ(xᵢ−x̄)²·Σ(yᵢ−ȳ)²]
z = (x − μ) / σ
Anything beyond |z| > 3 is a likely outlier.
t = (x̄ − μ₀) / (s/√n) → compare to t-distribution
p < 0.05: reject H₀ at 95% confidence.
EDA is not optional — it's the first step before any modelling task.
Distributions, missing values, and outliers can invalidate any model if not handled first.
High correlation with the target suggests a useful feature; high correlation between features suggests multicollinearity.
Hypothesis tests turn "this looks different" into "this is statistically different" with a quantified confidence level.