Choosing the Right Probability Distribution
A structured decision framework for selecting probability distributions based on data characteristics, domain knowledge, and modeling goals.
Jump directly into the interval probability workflow and apply this guide on a live distribution chart.
Open Interactive Interval ProbabilityStart With Variable Type and Support
The first split is discrete vs continuous. Then consider the support: is it bounded (like [0, 1] for proportions), non-negative (like lifetimes), or the full real line (like measurement errors)?
Support constraints immediately rule out many candidates. A proportion cannot follow a normal distribution without truncation issues, and a count variable should not use a continuous model without justification.
Symmetric vs Skewed
If your data are roughly symmetric and unbounded, the normal distribution is the natural starting point. For heavy-tailed symmetric data, consider the Student t or Cauchy distributions.
Right-skewed positive data suggest exponential, gamma, Weibull, or lognormal. Left-skewed data are rarer but can sometimes be modeled by reflecting a right-skewed distribution.
Count Data Decision Tree
Binary outcomes with a fixed number of trials lead to the binomial. Counts of rare events with no fixed upper bound suggest Poisson. If the variance exceeds the mean (overdispersion), the negative binomial is a better choice.
The geometric distribution models the number of trials until the first success and is a special case of the negative binomial.
Lifetime and Waiting Time Models
For constant failure rate, use the exponential. If the failure rate changes over time, the Weibull is the most common first choice. The gamma generalizes the exponential for aggregate waiting times.
The lognormal is appropriate when failure results from multiplicative degradation, and the Pareto models heavy-tailed phenomena like income and file sizes.
Validating Your Choice
After selecting a candidate, fit the parameters and check the fit visually with histograms overlaid on the theoretical density. QQ plots reveal tail deviations that density overlays can miss.
Formal tests like Kolmogorov-Smirnov or Anderson-Darling quantify the discrepancy but can be overly sensitive with large samples. Prioritize practical fit for your use case over p-values.