When we analyze data, it is important to have a good understanding of statistical distributions. Here, we will discuss how to work with some standard statistical distributions in R.

The concept of Probability distribution

Suppose I toss a coin (with two faces), how do we summarixe what is going to be the result? The outcome of coin toss can be Heads(H) or Tails (T) and depends on the “Likelihood of getting a Heads” or “Probability of Heads”.

Probability Distribution consists of two parts

  1. Set of possible outcomes (e.g. coin toss Heads and Tails )
  2. Likelihoods of outcomes (P(Heads) = 0.5, P(Tails)=0.5)

Note that the coin or dice need not be fair! In that case, P(Heads) can be 0.25 for example.

We can easily simulate a coin toss in R as follows.

# Tossing a Fair Coin
X<-sample(c(0,1), size=1, replace=TRUE, prob=c(0.5, 0.5))
print(X)
## [1] 1
# Tossing a Baised Coin
X<-sample(c(0,1), size=1, replace=TRUE, prob=c(0.75, 0.25))
print(X)
## [1] 0

One may repeat this multiple times and check that approximately 50% of times for the fair coin (and 25% for biased coin ) we will see a 1 appear.

Another useful and standard example is that of a 6-faced fair dice, where

  1. Set of possible outcomes= { 1,2,3,4,5,6}

  2. P(1)=P(2)=…=P(6)=1/6

Ofcourse, the dice can be biased as well. Here’s how to simulate a biased dice thrown say 10 times.

X2<- sample(c(1,2,3,4,5,6), size=15, replace=TRUE, prob=c(0.05,.2,.25,.3,.15,.05))
print(X2)
##  [1] 3 4 3 1 2 2 4 2 3 5 3 5 4 3 2

Let us check if a large number of throws will result in frequencies that are approximately the probabilities used to simulate the tosses.

 X3<- sample(c(1,2,3,4,5,6), size=500, replace=TRUE, prob=c(0.05,.2,.25,.3,.15,.05))

Freq<- table(X3)

print(Freq)
## X3
##   1   2   3   4   5   6 
##  24 100 118 155  84  19
print(Freq/sum(Freq))
## X3
##     1     2     3     4     5     6 
## 0.048 0.200 0.236 0.310 0.168 0.038
plot(Freq/sum(Freq), ylab="probability or relative frequency", xlab="dice outcome", main="Frequency distribution of 500 dice throws")

Ofcourse if we increase the number of throws, the relative frequencies in data will get closer to the actual probabilities of the dice.

 X4<- sample(c(1,2,3,4,5,6), size=5000, replace=TRUE, prob=c(0.05,.2,.25,.3,.15,.05))

Freq<- table(X4)

print(Freq)
## X4
##    1    2    3    4    5    6 
##  251 1046 1243 1454  762  244
print(Freq/sum(Freq))
## X4
##      1      2      3      4      5      6 
## 0.0502 0.2092 0.2486 0.2908 0.1524 0.0488
plot(Freq/sum(Freq), ylab="probability or relative frequency", xlab="dice outcome", main="Frequency distribution of 5000 dice throws")

Next, let us look at some commonly used probability distributions.

Binomial Distribution

A Binomial random variable is defined as the number of heads resulting from \(n\) independent tosses of a coin, whose probabilitu of heads is \(p\). We denote this by \(Binom(n, p)\). The Binomial distribution can be understood as follows

  1. Possible Outcomes: {\(1,2,3,\ldots, n\)}

  2. Likelihood of outcomes: Mathematically

\[ Prob(k) = {n \choose k} p^k (1-p)^{n-k}, ~~ k \in \{1,2,..., n \}\]

We can compute these probabilities easily as follows. Let us take \(n=20\) and \(p=0.2\)

 n<- 20
 p<- 0.2
 K<- seq(from=0, to=n, by=1)
 
 prob_K<- dbinom(K, n, p)
 
 freq_table<- data.frame(Possiblities=K, probability=prob_K)
 print(round(freq_table,3))
##    Possiblities probability
## 1             0       0.012
## 2             1       0.058
## 3             2       0.137
## 4             3       0.205
## 5             4       0.218
## 6             5       0.175
## 7             6       0.109
## 8             7       0.055
## 9             8       0.022
## 10            9       0.007
## 11           10       0.002
## 12           11       0.000
## 13           12       0.000
## 14           13       0.000
## 15           14       0.000
## 16           15       0.000
## 17           16       0.000
## 18           17       0.000
## 19           18       0.000
## 20           19       0.000
## 21           20       0.000
 plot(K, prob_K, ylab="probability", xlab="possiblities", col="blue")

Usually, when we work with probability distributions, we will be interested in the Mean (or Expected Value), variance and cumulative distribution function. These computations are illustrated below.

# Computing Mean or Expected Value

Mean<- sum(freq_table[,1]*freq_table[,2])
print(Mean)
## [1] 4
# Variance

Variance<- sum((freq_table[,1]-Mean)^2*freq_table[,2])
print(Variance)
## [1] 3.2
# Standard Deviation
print(sqrt(Variance))
## [1] 1.788854
#CDF  P(X<=10) for Binom(20,.2)

print(pbinom(10, 20 , 0.2))
## [1] 0.9994366

We can also simulate from a binomial distribution as follows. Note that this is essentially similar to throwing a \(n-\)faced dice with given probabilities. Generating 25 outcomes from binomial with \(n=20\) and \(p=0.2\) means repeating the experiment of “tossing a coin n times and counting number of heads for every repetition”.

# Generating 10 outcomes

X5<- rbinom(25, 20, 0.2 )

print(X5)
##  [1] 2 6 6 3 2 3 1 3 4 8 6 3 8 3 5 6 3 3 5 4 5 2 5 4 3

Ofcourse, is we repeat a large number of times, the resulting data distribution would resemble the underlying binomial probabilities. See the bars based on data and the blue circles based on the actual probabilities almost matching in the graph below.

set.seed(19)
X6<- rbinom(1000, 20, 0.2 )


Freq<- table(X6)

plot(Freq/sum(Freq), xlab="Possiblities for number of heads in 10 tossses", ylab="relative frequency")
points(K, prob_K, col="blue")

The Normal distribution

Perhaps, the most commonly heard probability distribution is the Normal distribution. The Binomial random variable results in what is called as a “discrete distribution” since it can take only discrete or countable values. However, a Normal random variable can take continuous values. The Normal distribution is popularly referred to as the Bell-curve. The plot of “probability density function”" (pdf) of a Normal random variable whose mean is \(\mu=15\) and standard deviation \(\sigma=5\) is given below.

x<-seq(5, 25, length=1000)
plot(x, dnorm(x,15,2), type='l',xlim=c(4,25),main="Normal Density Curve", xlab="X", ylab="probability density function (pdf)")

Unlike the discrete distributions, the pdf is not a probability, but is closely related to probability in the following sense. The area under the curve for a given range on the x-axis is the probability gives the value lies within that range. For example, \(P(10< X \leq 15)=\) shaded area under the curve shown below.

plot(x, dnorm(x,15,2), type='l',xlim=c(4,25),main="Normal Density Curve", xlab="X", ylab="probability density function (pdf)")

cord.x <- c(10,seq(10,14,0.01), 14)
cord.y <- c(0,dnorm(seq(10,14,0.01),15,2),0)
polygon(cord.x,cord.y,col='skyblue')

We can compute the probability value as below

print(pnorm(14, mean=15, sd=2) - pnorm(10, mean=15, sd=2))
## [1] 0.3023279

One can also simulate from Normal distribution. Shown below is the histogram for simulated data and how it compares to the actual pdf.

x_sim<- rnorm(1000, mean=15, sd=2)

hist(x_sim, freq=FALSE, xlab="X", ylab="density", main="Simulated Data distribution vs. actual pdf")
lines(density(x_sim), col="red")  # Kernel density: a continuous version of histogram
lines(x, dnorm(x,15,2), col="green")

The above is an illustration of how we can work with statistical distributions. We considered Binomial and Normal. SImilaraly, one can find functions for other statistical distributions such as Poisson, Gamma, Exponential, Beta, Weibull etc..