Probability Distributions are prevalent in many fields, namely, computer science, stock market, astronomy and economics. In this blog we are going to see different probability distribution for Machine Learning and their properties. Also, we will discuss key statistical points for the simple models.
Note, all the discussion in this blog is based on the assumption that all data points are independent and identically distributed.
“A probability distribution for Machine Learning is a statistical method that describes all the possible values and likelihoods that a random variable can take within a given interval.” Always remember the issue of choosing an appropriate distribution relates to the problem of model selection.

Discrete Probability Distributions
It is distribution of all possible values of a discrete random variable together with an indication of their probabilities.
Some examples of well known discrete probability distribution for Machine Learning are:
- Bernoulli distributions.
- Poisson distribution.
- Uniform distribution.
- Multinomial distributions.
Bernoulli Distribution
Bernoulli Distribution is widely use for distribution of categorical outcomes. Concept behind logistic regression is best example of Bernoulli distribution.
Let’s consider you have toss a coin ‘c’ , c ∈ {0, 1}. such that c = 1 representing heads and c = 0 representing tails.
Let the probability of head is denoted by ‘μ’ ,such that:
P(x=1 | μ) = μ ; 0 =< μ =< 1
So, P(x=0 | μ) = 1 – μ
The probability distribution for flipping a coin is therefore can be written as:
Bern (c | μ) = μc (1 -μ)1-c
This is known as Bernoulli Distribution. From this equation we can easily said that this distribution is normalize. Now, suppose we have a data set X = {c1, c2, c3,….cn} of observed value of c. So, for this we can construct maximum likelihood function, which is function of μ. Such that P( X | μ) is given as:
P(X | μ) = ∏ P(c | μ) = ∏ μc (1 -μ)1-c
where; c = {c1, c2,…….cn}
Now, we can estimate a value for μ by maximizing the logarithm of the likelihood and it is given as:
lnP(X | μ) = ΣP(cn | μ) = Σ{cn ln μ + (1 + cn) ln(1 – μ)}
Now if you remember, this is a same equation we have used in logistic regression for the error calculation.
Poisson Distribution
A Poisson distribution is a measure of how many times an event is likely to occur within given period of time.
It’s super handy because it’s pretty simple to use and is applicable for tons of things, there are a lot of interesting processes that boil down to “events that happen in time or space.” So, the probability for total ‘E’ events in ‘X’ period of time is given as:
P(E events in X period of time) = e(-( E / T ) * X) * { (E / T) * X)k} / K!
Let λ = (E / T) * X)
Then, P(E events in X period of time) = eλ * {λ}k/ K!
Where T is time for each event, X is total time interval and λ is a rate parameter which is the expected number of events in the interval, i.e. (E / T) * X).
Continuous Probability Distributions
It is distribution of all possible values of a continuous random variable together with an indication of their probabilities.
Some examples of well-known continuous probability distribution for Machine Learning are:
- Normal or Gaussian distribution.
- Power-law distribution.
- Pareto distribution.
Gaussian Distribution
The Gaussian, also known as normal distribution, is a widely used model for the distribution of continuous variables. When you work on large data it is observed that most of the data is closer to mean and the very less frequent data is observed towards the extremes, which is nothing but a gaussian distribution, and this is what central limit theorem tells us. In the case of a single variable ‘x’, gaussian distribution is given as:

Where μ is the mean and σ² is the variance.

For the single real variable, the distribution that maximizes the entropy is the Gaussian. Also, the Gaussian Distribution arises is when we consider the sum of multiple random variables. They may used to find non-linear regression as well as to reduce dimensionality by identifying which dimensions of a dataset have larger variance.
Lots of phenomenon in nature follows the Gaussian Distribution. Like, Our height, weight, blood pressure etc. Hence it is a widely used distribution and favorite of many data scientists. 😉
I hope this article helped you in your data science journey. Was it explanatory? If you have any doubts and want to see more articles on distributions, please do write in the comment section.
Written by –