Bayesian Linear Regression (part 1)

Here, we will learn how to implement Bayesian Linear Regression in R. Let us take the dataset pbc.vote available in the BaM package. The dataset can be used to study the relationship between number of bad ballots cast versus some variables such as voting technology, size of voting population etc, during the 2000 U.S presidential election held in Plambeach county, Florida, USA.

Fetch the raw data

install.packages('BaM', dependencies=TRUE)

library(BaM)  #used for pbc.vote data

## Warning: package 'BaM' was built under R version 3.5.3

library(MASS)  #used for multivariate normal simulation
data(pbc.vote)
Data<-pbc.vote
#View(Data)
#help(pbc.vote) # you may check the description of variables using help

Define the variables for analysis

First, we organize the data to facilitate the analysis. We will create a dependent variable \(Y\) and we will create a data matrix containing the independent variables along with a column for the intercept.

Y<-Data$badballots
tech<- (Data$technology == "Votomatic")*1
X<-with(Data,cbind(1,tech, new, size, Republican, white))

Estimate the classical linear regression model

The linear regression model can be written as

\(Y_i = \beta_1 + \beta_2 X_2 + \beta_3 X_3 + \cdots \beta_{k} X_k + \epsilon_i\)

where \(\epsilon_1, \epsilon_2, \ldots \epsilon_n \stackrel{iid}{\sim} N(0, \sigma^2)\)

Estimation of parameters is based on Maximum Likelihood, which turns out to be same as minimizing sum of squared errors.

This can be implemented in R using the lm() command.

reg<- lm(Y~X[,-1])
summary(reg)

## 
## Call:
## lm(formula = Y ~ X[, -1])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -84.550 -10.040  -2.118   8.135 130.588 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        54.389083   4.351886  12.498   <2e-16 ***
## X[, -1]tech       -49.951266   3.443656 -14.505   <2e-16 ***
## X[, -1]new         -0.379588   0.039182  -9.688   <2e-16 ***
## X[, -1]size         0.154160   0.006146  25.081   <2e-16 ***
## X[, -1]Republican  -0.084490   0.007321 -11.541   <2e-16 ***
## X[, -1]white       -0.049646   0.005616  -8.839   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.94 on 510 degrees of freedom
## Multiple R-squared:  0.7973, Adjusted R-squared:  0.7953 
## F-statistic: 401.1 on 5 and 510 DF,  p-value: < 2.2e-16

It is nice to check that the above estimates can also be directly computed using matrix formulas for regression. The least square regression estimates are obtained as a solution to the problem of minimizing error sum of squares: \[\min_{\beta \in \mathbb{R}^k} (Y- X\beta)^T(Y-X\beta).\] After a bit of algebraic manipultaion, it follows \[(Y- X\beta)^T(Y-X\beta) = (\beta-M)^T D (\beta-M) + Y^T(I-X(X^TX)^{-1}X^{T})Y ,\] \[\mbox{ where } D=(X^TX)~~\mbox{ and }~~ M=(X^T X)^{-1}X^{T}Y.\] We observe that first term in the previous expression will always be non-negative and hence minimum is achieved when \(\beta=M\). So, the least squares estimator is given by \[\widehat{\beta} = (X^T X)^{-1} X^T Y\] It then follows that \[V(\widehat{\beta})= \sigma^{2} (X^T X)^{-1}.\] An unbiased estimator of \(\sigma^2\), is given by \[\widehat{\sigma^2} = \frac{1}{n-k} Y^T ( I- X(X^T X)^{-1}X^T) Y.\]

beta_hat<- solve(t(X)%*%X)%*%t(X)%*%Y
n<-length(Y)
k<-dim(X)[2]
sigmasq_hat<- 1/(n-k) *  (t(Y) %*% ( diag(rep(1,n)) - X%*%solve(t(X)%*%X)%*%t(X) ) %*% Y)
beta_var <- as.numeric(sigmasq_hat) * solve(t(X)%*%X)
beta_se<- sqrt( diag(beta_var))

# estimates rounded to 3 decimal places
print(cbind(round(beta_hat,3), round(beta_se,3)))

##               [,1]  [,2]
##             54.389 4.352
## tech       -49.951 3.444
## new         -0.380 0.039
## size         0.154 0.006
## Republican  -0.084 0.007
## white       -0.050 0.006

Bayesian Formulation of the Regression Model

The linear regression model can be specified in matrix notation as

\[Y_{n\times 1} = X_{n\times k}\beta_{k\times 1} + \epsilon_{n\times 1} \mbox{ where } \epsilon_{n\times 1}\sim N_{n}(0_{N\times 1}, \sigma^2 \mathcal{I}_{n\times n} ),\]

or equivalently

\[Y\sim N_{n}(X \beta , \sigma^2 \mathcal{I} ).\]

Let the prior on \(\beta\) given \(\sigma^2\) be

\[\beta \vert \sigma^2 \sim N_{k}(\beta_0, \Sigma).\]

An example is Zellner’s g-prior where \(\Sigma= g \cdot \sigma^2 \cdot {(X^T X)}^{-1}\), for a known value of \(g\).

Let the prior on \(\theta= \frac{1}{\sigma^2}\) be given by

\[\theta \sim Gamma(\nu, \lambda)\]

Using the simplifications of error sum of squares presented above, we can determine the posterior distributions for \((\beta, \sigma^2)\) as follows: \[\pi\left(\beta \vert Data, \sigma^2 \right) \propto e^{-\frac{(\beta -M)^TX^TX(\beta-M)}{2\sigma^2} } e^{-\frac{(\beta-\beta_0)^T\Sigma^{-1}(\beta-\beta_0)}{2}} \] \[\propto e^{-\frac{(\beta-\beta_1)^T A_1(\beta-\beta_1)}{2}}.\] or equivalently \[\beta \vert Data, \sigma^2 \sim N_k(\beta_1, A_1^{-1}), \mbox{ where } \beta_1 = A_1^{-1}\left(\frac{1}{\sigma^2}X^T Y + \Sigma^{-1}\beta_0\right), ~~ A_1= \left(\frac{1}{\sigma^2}X^{T}X + \Sigma^{-1}\right)\]

Let us take the case of Zellner’s prior, where \(\Sigma= g \sigma^2 (X^T X)^{-1}\). Then, the posterior for \(\theta=\frac{1}{\sigma^2}\) is given by \[\pi\left(\theta \vert Data, \beta \right)\propto \theta^{\frac{n}{2}} e^{-\frac{(Y-X\beta)^T(Y-X\beta) }{2} \cdot \theta}\cdot \theta^\frac{k}{2}e^{-\frac{(\beta-\beta_0)^TX^TX(\beta-\beta_0) }{2g}\cdot \theta} \cdot \theta^{\nu-1} e^{-\lambda \theta},\] or equivalently \[ \theta\vert Data, \beta ~~\sim~~ Gamma\left(shape=\frac{n}{2}+\frac{k}{2}+\nu , rate= \frac{(Y-X\beta)^T(Y-X\beta)}{2} + \frac{(\beta-\beta_0)^TX^TX(\beta-\beta_0)}{2g} + \lambda \right) \]

Gibbs Sampling for the parameters

We will use Gibbs sampling , a Markov Chain Monte Carlo technique to simulate from the posterior distribution of \((\beta, \theta)\). The code below shows one single step of the Gibbs sampling procedure.

#priors
n<- dim(X)[1]
k<- dim(X)[2]


# For theta we use Jeffrey's prior: Improper prior which can be thought of as Gamma(shape=0,rate=0)
nu<- 0   
lambda<- 0

# For Gibbs sampling take some starting value for theta
set.seed(sqrt(23))
rtheta <- 1

rsig2<- 1/rtheta
#prior mean for beta
beta0<- cbind(rep(0,k))
#prior variance for beta
g<- 100  # Fixed and known  Try g=1 to indicate one experience of similar data size from where we believe coeff are =0,  1/2 to indicate 2 experiences 
rSigma <- g*rsig2 * solve(t(X) %*% X)

A1<- 1/rsig2 * t(X)%*%X  + solve(rSigma)
beta1<- solve(A1)%*%(1/rsig2 * t(X)%*%Y + solve(rSigma)%*%beta0)

# simulate a value of beta given theta(or equivalently beta given sigma^2)
rbeta<- cbind(c(mvrnorm(1, beta1, solve(A1))))

print(rbeta)

##                    [,1]
##             53.90995431
## tech       -49.46722051
## new         -0.37741193
## size         0.15235494
## Republican  -0.08353643
## white       -0.04883333

#simulate a value of theta given previous value of beta
nu1<- n/2 + k/2 + nu
lambda1<-  t(Y- X%*%rbeta)%*%(Y-X%*%rbeta)/2 + t(rbeta-beta0)%*%t(X)%*%X%*%(rbeta-beta0)/(2*g) + lambda  
rtheta <- rgamma(1,shape=nu1,rate=lambda1)
print(rtheta)

## [1] 0.001785999

data_sim<- rbind(c(rbeta, rtheta))
print(data_sim)

##          [,1]      [,2]       [,3]      [,4]        [,5]        [,6]
## [1,] 53.90995 -49.46722 -0.3774119 0.1523549 -0.08353643 -0.04883333
##             [,7]
## [1,] 0.001785999

Now we will repeat the previous Gibbs sampling step a large number of times. MCMC ensures that after an initial burin-in sample, the remaining simulations are approximately iid from the posterior distribution. One way to decide the appropriate burn-in sample is to inspect the trace plots.

N_MCMC<- 10000

for(count in 1:N_MCMC){
  
    rsig2<- 1/rtheta
  #prior mean for beta
  beta0<- cbind(rep(0,k))
  #prior variance for beta
  rSigma <- g*rsig2 * solve(t(X) %*% X)
  
  A1<- 1/rsig2 * t(X)%*%X  + solve(rSigma)
  beta1<- solve(A1)%*%(1/rsig2 * t(X)%*%Y + solve(rSigma)%*%beta0)
  
  # simulate a value of beta given theta(or equivalently beta given sigma^2)
  rbeta<- cbind(c(mvrnorm(1, beta1, solve(A1))))
  
  
  
  #simulate a value of theta given previous value of beta
  nu1<- n/2 + k/2 + nu
  lambda1<-  t(Y- X%*%rbeta)%*%(Y-X%*%rbeta)/2 + t(rbeta-beta0)%*%t(X)%*%X%*%(rbeta-beta0)/(2*g) + lambda  
  rtheta <- rgamma(1,shape=nu1,rate=lambda1)
 
  data_sim<- rbind(data_sim, c(rbeta, rtheta))
}

We can check the traceplots for convergence of Markov chain.

for(ii in 1:6){
  plot(data_sim[,ii], type='l', main=colnames(X)[ii])
}

plot(data_sim[,7], type='l', main="theta")

We can now compute the estimates, standard errors and 95% credible intervals for the regression parameters. If zero is contained towards the center of the intervals, it is an indication that the variable may not be significant.

l<-N_MCMC/2
u<- N_MCMC
means<- apply(data_sim[l:u,-7], MARGIN=2, FUN=mean)
sd<- apply(data_sim[l:u,-7], MARGIN=2, FUN=sd)
Q025<- apply(data_sim[l:u,-7], MARGIN=2, FUN=quantile, prob=.025)
Q975<- apply(data_sim[l:u,-7], MARGIN=2, FUN=quantile, prob=.975)
out<-data.frame(cbind(round(means,3), round(sd,3),round(Q025,3), round(Q975,3) ))
colnames(out)<- c("Est", "SE", "Q025", "Q975")

nmm<-colnames(X)
nmm[1]<-"Intercept"
row.names(out)<-nmm    
print(out)

##                Est    SE    Q025    Q975
## Intercept   53.904 4.423  45.395  62.512
## tech       -49.515 3.492 -56.315 -42.806
## new         -0.376 0.041  -0.459  -0.296
## size         0.153 0.007   0.140   0.165
## Republican  -0.083 0.007  -0.098  -0.069
## white       -0.049 0.006  -0.061  -0.038

# Recall classical regression output for comparison
print(round(summary(reg)$coefficients,3))

##                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)         54.389      4.352  12.498        0
## X[, -1]tech        -49.951      3.444 -14.505        0
## X[, -1]new          -0.380      0.039  -9.688        0
## X[, -1]size          0.154      0.006  25.081        0
## X[, -1]Republican   -0.084      0.007 -11.541        0
## X[, -1]white        -0.050      0.006  -8.839        0

We can observe that when \(g=100\) the Bayesian approach results are almost identical to Classical linear regression. As we decrease \(g\) to indicate more and more certainty in the prior, the significance of the coefficients diminishes.