Table of Contents#
- QQ-Plots: Quantifying Normality
- Theory and Interpretation
- Implementation in Python
- Box Plots and Violin Plots
- Comparative Analysis
- Use Case Examples
- Kernel Density Estimation (KDE)
- Bandwidth Selection Best Practices
- Multimodal Distribution Analysis
- Statistical Tests for Distribution Fitting
- Kolmogorov-Smirnov Test
- Shapiro-Wilk Test
- Handling Skewed Distributions
- Transformation Techniques (Log, Box-Cox)
- When to Normalize vs. Standarize
- Conclusion
- References
1. QQ-Plots: Quantifying Normality#
Theory and Interpretation#
Quantile-Quantile (QQ) plots visually compare sample quantiles to theoretical quantiles of a distribution (typically normal). If points form a straight diagonal line, data follows the reference distribution. Deviations indicate:
- S-Shaped Curve: Heavy-tailed or light-tailed distribution
- Concave Curve: Left-skewed data
- Convex Curve: Right-skewed data
Python Implementation#
import statsmodels.api as sm
import matplotlib.pyplot as plt
# Generate skewed data
data = np.random.exponential(scale=2, size=1000)
# Create QQ plot
sm.qqplot(data, line='45', dist=scipy.stats.norm)
plt.title("QQ Plot vs. Normal Distribution")
plt.show()
Points deviate from diagonal → Non-normal distribution
2. Box Plots vs. Violin Plots#
Comparative Analysis#
| Feature | Box Plot | Violin Plot |
|---|---|---|
| Outliers | Explicitly shown | Not explicitly marked |
| Density Shape | Not shown | Kernel density estimate |
| Best For | Summary statistics | Distribution shape |
Use Case Example: Comparing Exam Scores#
import seaborn as sns
# Box plot
sns.boxplot(x='exam_type', y='score', data=df)
# Violin plot
sns.violinplot(x='exam_type', y='score', data=df, inner="quartile")
Violin plots reveal bimodal distributions in Exam B
3. Kernel Density Estimation (KDE)#
Bandwidth Selection Best Practices#
Bandwidth controls smoothness:
- Low bandwidth → Overfitting to noise
- High bandwidth → Oversmoothing peaks
Rule of thumb: Use Scott’s Rule for Gaussian-like data, Silverman’s for skewed distributions.
Multimodal Analysis#
from sklearn.neighbors import KernelDensity
# Fit KDE
kde = KernelDensity(bandwidth=0.5, kernel='gaussian')
kde.fit(samples.reshape(-1, 1))
# Plot
x_vals = np.linspace(min(samples), max(samples), 1000)
log_prob = kde.score_samples(x_vals.reshape(-1,1))
plt.plot(x_vals, np.exp(log_prob))Peaks at x≈0.3 and x≈0.7 indicate bimodal distribution
4. Statistical Tests for Distribution Fitting#
Kolmogorov-Smirnov (KS) Test#
Tests if sample comes from a reference distribution:
H₀: Data matches reference distribution
Python:
from scipy.stats import kstest
stat, p = kstest(data, 'norm')
print(f"KS Statistic: {stat:.3f}, p-value: {p:.4f}")Shapiro-Wilk Test#
Specialized for normality testing (more powerful than KS for small samples):
from scipy.stats import shapiro
stat, p = shapiro(data)
print(f"W-statistic: {stat:.3f}, p-value: {p:.4f}")Interpretation: p < 0.05 → Reject normality hypothesis
5. Handling Skewed Distributions#
Log Transformation#
Applies to right-skewed data:
df['log_sales'] = np.log1p(df['sales']) # Use log1p to handle zerosCaution: Not effective for left-skewed or symmetric distributions.
Box-Cox Transformation#
Dynamic λ parameter optimizes normality:
from scipy.stats import boxcox
df['transformed'], lambda_val = boxcox(df['skewed_column'])
print(f"Optimal lambda: {lambda_val:.2f}")Normalization vs. Standardization#
| Technique | Use When | Formula |
|---|---|---|
| Standardize | ML algorithms require normality | (x - μ)/σ |
| Normalize | Preserving bounded scales | (x - min)/(max - min) |
Best Practice: Normalize for neural networks, standardize for linear models.
6. Conclusion#
Understanding data distributions is foundational for effective statistical modeling and machine learning. In this guide, we've advanced beyond basics to:
- Visualize distributions with QQ/violin plots
- Identify non-normality using statistical tests
- Model complex shapes with KDE
- Transform skewed data strategically
Always match techniques to your data’s characteristics and project goals. Remember: EDA is iterative—refine your approach as insights emerge!
References#
- Cleveland, W. S. (1993). Visualizing Data. Hobart Press.
- SciPy Documentation: Statistical Functions
- Hyndman, R.J. (1996). Computing and Graphing Highest Density Regions. Journal of Computational and Graphical Statistics.
- Seaborn Gallery: Distribution Plots