codelessgenie blog

Exploring Data Distribution | Set 2: Advanced Techniques and Visualizations

Welcome to Part 2 of our Exploring Data Distribution series! In Set 1, we covered foundational concepts like histograms, measures of central tendency, and standard deviation. This installment dives into advanced techniques for analyzing complex distributions. We'll explore robust visualization tools, statistical tests for distribution identification, and practical strategies for handling skewed data—essential skills for data scientists and analysts working with real-world datasets.


2026-06

Table of Contents#

  1. QQ-Plots: Quantifying Normality
    • Theory and Interpretation
    • Implementation in Python
  2. Box Plots and Violin Plots
    • Comparative Analysis
    • Use Case Examples
  3. Kernel Density Estimation (KDE)
    • Bandwidth Selection Best Practices
    • Multimodal Distribution Analysis
  4. Statistical Tests for Distribution Fitting
    • Kolmogorov-Smirnov Test
    • Shapiro-Wilk Test
  5. Handling Skewed Distributions
    • Transformation Techniques (Log, Box-Cox)
    • When to Normalize vs. Standarize
  6. Conclusion
  7. References

1. QQ-Plots: Quantifying Normality#

Theory and Interpretation#

Quantile-Quantile (QQ) plots visually compare sample quantiles to theoretical quantiles of a distribution (typically normal). If points form a straight diagonal line, data follows the reference distribution. Deviations indicate:

  • S-Shaped Curve: Heavy-tailed or light-tailed distribution
  • Concave Curve: Left-skewed data
  • Convex Curve: Right-skewed data

Python Implementation#

import statsmodels.api as sm
import matplotlib.pyplot as plt
 
# Generate skewed data
data = np.random.exponential(scale=2, size=1000)
 
# Create QQ plot
sm.qqplot(data, line='45', dist=scipy.stats.norm)
plt.title("QQ Plot vs. Normal Distribution")
plt.show()

QQ-Plot Example
Points deviate from diagonal → Non-normal distribution


2. Box Plots vs. Violin Plots#

Comparative Analysis#

FeatureBox PlotViolin Plot
OutliersExplicitly shownNot explicitly marked
Density ShapeNot shownKernel density estimate
Best ForSummary statisticsDistribution shape

Use Case Example: Comparing Exam Scores#

import seaborn as sns
 
# Box plot
sns.boxplot(x='exam_type', y='score', data=df)
 
# Violin plot
sns.violinplot(x='exam_type', y='score', data=df, inner="quartile")

Box vs Violin
Violin plots reveal bimodal distributions in Exam B


3. Kernel Density Estimation (KDE)#

Bandwidth Selection Best Practices#

Bandwidth controls smoothness:

  • Low bandwidth → Overfitting to noise
  • High bandwidth → Oversmoothing peaks
    Rule of thumb: Use Scott’s Rule for Gaussian-like data, Silverman’s for skewed distributions.

Multimodal Analysis#

from sklearn.neighbors import KernelDensity
 
# Fit KDE
kde = KernelDensity(bandwidth=0.5, kernel='gaussian')
kde.fit(samples.reshape(-1, 1))
 
# Plot
x_vals = np.linspace(min(samples), max(samples), 1000)
log_prob = kde.score_samples(x_vals.reshape(-1,1))
plt.plot(x_vals, np.exp(log_prob))

Peaks at x≈0.3 and x≈0.7 indicate bimodal distribution


4. Statistical Tests for Distribution Fitting#

Kolmogorov-Smirnov (KS) Test#

Tests if sample comes from a reference distribution:
H₀: Data matches reference distribution
Python:

from scipy.stats import kstest
stat, p = kstest(data, 'norm')
print(f"KS Statistic: {stat:.3f}, p-value: {p:.4f}")

Shapiro-Wilk Test#

Specialized for normality testing (more powerful than KS for small samples):

from scipy.stats import shapiro
stat, p = shapiro(data)
print(f"W-statistic: {stat:.3f}, p-value: {p:.4f}")

Interpretation: p < 0.05 → Reject normality hypothesis


5. Handling Skewed Distributions#

Log Transformation#

Applies to right-skewed data:

df['log_sales'] = np.log1p(df['sales'])  # Use log1p to handle zeros

Caution: Not effective for left-skewed or symmetric distributions.

Box-Cox Transformation#

Dynamic λ parameter optimizes normality:

from scipy.stats import boxcox
df['transformed'], lambda_val = boxcox(df['skewed_column'])
print(f"Optimal lambda: {lambda_val:.2f}")

Normalization vs. Standardization#

TechniqueUse WhenFormula
StandardizeML algorithms require normality(x - μ)/σ
NormalizePreserving bounded scales(x - min)/(max - min)

Best Practice: Normalize for neural networks, standardize for linear models.


6. Conclusion#

Understanding data distributions is foundational for effective statistical modeling and machine learning. In this guide, we've advanced beyond basics to:

  • Visualize distributions with QQ/violin plots
  • Identify non-normality using statistical tests
  • Model complex shapes with KDE
  • Transform skewed data strategically

Always match techniques to your data’s characteristics and project goals. Remember: EDA is iterative—refine your approach as insights emerge!


References#

  1. Cleveland, W. S. (1993). Visualizing Data. Hobart Press.
  2. SciPy Documentation: Statistical Functions
  3. Hyndman, R.J. (1996). Computing and Graphing Highest Density Regions. Journal of Computational and Graphical Statistics.
  4. Seaborn Gallery: Distribution Plots