Bayesian Linear Regression (part 3)

As in part 2 of Bayesian Linear Regression, we will again consider the Bayesian linear regression formulation to help fit non-linear models to data. For this purpose, we use a simulated dataset generated as below. The first few steps of formulation are similar to part 2, except that we try to determine the regularization parameter using a hyper-prior on \(g\).

Define the variables for analysis

library(MASS)
set.seed(19)
n<- 100
x<- seq(0.2,2,length=n)
error<- rnorm(n,0,2.5)
fx<- 2*pi*x*sin(2*pi*x) + sqrt(x) 
Y<- 2*pi*x*sin(2*pi*x) + sqrt(x) +error
plot(x,Y)

We would like to fit a model as below \[ Y_i = f(x_i) + \epsilon_i , ~~~~\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\] Here function \(f\) is unknown and needs to be estimated. If we assume \(f(x)=x\beta\), then it becomes linear regression. It turns out that we can still use linear regression to model this problem. The idea is to approximate the function \(f\) as a linear combination of basis functions. For that, we will use cubic B-spline basis (B stands for Basic) of degree 3 with 48 knots, which corresponds to 50 (=knots+degree-1) basis functions or degrees of freedom. We note that B-spline basis is equivalent to a truncated polynomial basis of same degree and same set of knots, but are more computationally stable. (A good reference for B-splines and Penalized regression is Chapter 8 of Fahrmeir et al(2013))

Once the B-spline basis is created, the problem can be viewed again as a linear regression model, which can be specified in matrix notation as

\[Y_{n\times 1} = X_{n\times k}\beta_{k\times 1} + \epsilon_{n\times 1} \mbox{ where } \epsilon_{n\times 1}\sim N_{n}(0_{N\times 1}, \sigma^2 \mathcal{I}_{n\times n} ),\]

or equivalently

\[Y\sim N_{n}(X \beta , \sigma^2 \mathcal{I} ).\]

The r code below generates the required B-spline basis.

library(splines)
X<-as.matrix(bs(x, df=50, degree=3, intercept=TRUE))

Regularization using Bayesian Formulation of the Regression Model

We note that introducing basis functions increases the number of parameters in the model. In such problems, one uses regularization or a penalty to discourage overfitting. A formulation in terms of quadratic penalty, given regularization parameter \(g\) is as below:

\[ \min_{\beta} (Y-X\beta)^T(Y-X\beta) + \frac{1}{g} \sum_{j=2}^k (\beta_j - \beta_{j-1})^2\]

It is a property of B-spline basis that the columns of \(X\) will add to the unit vector. So, if all the coefficients \(\beta_j\) are equal to each other then the model is essentially the constant function. A smaller \(g\) penalizes deviating from the contant function and hence encourages smoother functions. A larger \(g\) would lead to very wriggly functions.

The above penalized approach can be formulated as a Bayesian problem by considering a prior as follows:

Define the \((k-1)\times k\) matrix \(D= \begin{bmatrix}1 & -1 & 0 & 0 &.....& 0\\ 0& 1 &-1&0&.....&0 \\ \vdots & \vdots & \vdots & \vdots & .....&\vdots \\ 0& 0 &0&0&....1&-1 \end{bmatrix}\) and \(K= D^TD\).

Note that \((\beta-\beta_0)^TK(\beta-\beta_0)=\sum_{j=2}^k (\beta_j - \beta_{j-1})^2\). So, we define the prior

\[ \pi(\beta) \propto e^{- \frac{(\beta-\beta)^T K (\beta - \beta_0)}{2g\sigma^2} } . \]

This prior specification is called the random walk of order 1, where \(\beta_j - \beta_{j-1} \stackrel{iid}{\sim} N(0, g \sigma^2)\), and Prior on \(\beta_1\) is improper, i.e. \(\pi(\beta_1) \propto 1\).

Let the prior on \(\theta= \frac{1}{\sigma^2}\) be given by

\[\theta \sim Gamma(\nu, \lambda)\]

Here, in addition, we will consider the following prior for \(\frac{1}{g}\)

\[ \frac{1}{g} \sim Gamma(\nu_g , \lambda_g)\]

Now, similar to our working in Bayesian Linear Regression part 1, we can determine the posterior distributions for \((\beta, \sigma^2, \frac{1}{g})\) as follows: \[\pi\left(\beta \vert Data, \sigma^2, g \right) \propto e^{-\frac{(\beta -M)^TX^TX(\beta-M)}{2\sigma^2} } e^{-\frac{(\beta-\beta_0)^T K(\beta-\beta_0)}{2g\sigma^2}} \] \[\propto e^{-\frac{(\beta-\beta_1)^T A_{1}(\beta-\beta_1)}{2}},\] or equivalently \[\beta \vert Data, \sigma^2 \sim N_k(\beta_1, A_1^{-1}), \mbox{ where } \beta_1 = A_1^{-1}\left(\frac{1}{\sigma^2}X^T Y + \frac{1}{g \sigma^2}K \beta_0\right), ~~ A_1= \left(\frac{1}{\sigma^2}X^{T}X + \frac{1}{g \sigma^2}K\right)\]

The posterior for \(\theta=\frac{1}{\sigma^2}\) is given by \[\pi\left(\theta \vert Data, \beta, g \right)\propto \theta^{\frac{n}{2}} e^{-\frac{(Y-X\beta)^T(Y-X\beta) }{2} \cdot \theta}\cdot \theta^\frac{k}{2}e^{-\frac{(\beta-\beta_0)^T K (\beta-\beta_0) }{2g}\cdot \theta} \cdot \theta^{\nu-1} e^{-\lambda \theta},\] or equivalently \[ \theta\vert Data, \beta ~~\sim~~ Gamma\left(shape=\frac{n}{2}+\frac{k}{2}+\nu , rate= \frac{(Y-X\beta)^T(Y-X\beta)}{2} + \frac{(\beta-\beta_0)^T K (\beta-\beta_0)}{2g} + \lambda \right) \]

Similarly, posterior for \(\frac{1}{g}\) given \((\beta, \theta)\) can be determined as

\[\frac{1}{g} \vert Data, \beta, \sigma^2 \sim Gamma\left(\frac{k}{2} + \nu_g, \frac{(\beta - \beta_0)^T K (\beta-\beta_0)}{2\sigma^2} + \lambda_g \right)\]

Gibbs Sampling for the parameters

We will use Gibbs sampling , a Markov Chain Monte Carlo technique to simulate from the posterior distribution of \((\beta, \theta, \frac{1}{g})\). The code below shows one single step of the Gibbs sampling procedure.

set.seed(sqrt(23))
n<- dim(X)[1]
k<- dim(X)[2]


# For theta we use Jeffrey's prior: Improper prior which can be thought of as Gamma(shape=0,rate=0)
nu<- 0   
lambda<- 0

nu_g<-1
lambda_g<- 1

plot(density(1/rgamma(1000,nu_g,lambda_g)), main="prior for g")

# For Gibbs sampling take some starting value for theta
rtheta <- 1
rg<-1

rsig2<- 1/rtheta
#prior mean for beta
beta0<- cbind(rep(0,k))
#prior variance for beta

Ik<-diag(rep(1,k))
S1<- as.matrix(Ik[-k,] - cbind(rep(0,(k-1)), Ik[1:(k-1),1:(k-1)]))


K<- t(S1)%*%S1

A1<- 1/rsig2 * t(X)%*%X  + 1/rg * 1/rsig2 * K

beta1<- solve(A1)%*%(1/rsig2 * t(X)%*%Y + 1/rg * 1/rsig2 * K %*% beta0)

# simulate a value of beta given theta(or equivalently beta given sigma^2)
rbeta<- cbind(c(mvrnorm(1, beta1, solve(A1))))


#simulate a value of theta given previous value of beta
nu1<- n/2 + k/2 + nu
lambda1<-  t(Y- X%*%rbeta)%*%(Y-X%*%rbeta)/2 + t(rbeta-beta0)%*%K%*%(rbeta-beta0)/(2*rg) + lambda  
rtheta <- rgamma(1,shape=nu1,rate=lambda1)

nu1_g <- k/2 + nu_g

lambda1_g<- rtheta/2* t(rbeta-beta0)%*%K%*%(rbeta- beta0) + lambda_g

rg<- 1/rgamma(1, nu1_g, lambda1_g)

data_sim<- rbind(c(rbeta, rtheta, rg))
print(data_sim)

##           [,1]     [,2]    [,3]     [,4]    [,5]    [,6]     [,7]     [,8]
## [1,] 0.6682696 1.572322 2.35206 2.480192 2.52302 1.74278 2.915734 3.140244
##           [,9]    [,10]     [,11]      [,12]     [,13]     [,14]     [,15]
## [1,] 0.9706556 1.608546 0.8090984 -0.7950023 -1.365269 -1.587111 -2.124581
##          [,16]     [,17]    [,18]     [,19]    [,20]     [,21]    [,22]
## [1,] -2.699863 -3.017831 -3.80922 -3.374187 -3.35244 -2.054132 -1.67452
##           [,23]    [,24]    [,25]    [,26]    [,27]    [,28]    [,29]    [,30]
## [1,] -0.2373175 4.932283 5.586475 6.729188 6.387577 8.784164 9.723837 9.477777
##         [,31]    [,32]    [,33]    [,34]    [,35]     [,36]     [,37]     [,38]
## [1,] 9.832505 8.724392 7.333784 4.124466 1.728072 0.3593074 -1.550768 -4.208733
##          [,39]     [,40]     [,41]     [,42]     [,43]    [,44]     [,45]
## [1,] -6.273941 -8.303028 -10.53761 -9.163626 -7.695068 -7.91989 -8.547445
##          [,46]     [,47]      [,48]   [,49]    [,50]     [,51]    [,52]
## [1,] -8.438275 -5.326607 -0.9961585 1.45681 1.326431 0.2162647 0.558193

Now we will repeat the Gibbs sampling a large number of times.

N_MCMC<- 10000

for(count in 1:N_MCMC){
  
rsig2<- 1/rtheta
#prior mean for beta
beta0<- cbind(rep(0,k))
#prior variance for beta

Ik<-diag(rep(1,k))
S1<- as.matrix(Ik[-k,] - cbind(rep(0,(k-1)), Ik[1:(k-1),1:(k-1)]))


K<- t(S1)%*%S1

A1<- 1/rsig2 * t(X)%*%X  + 1/rg * 1/rsig2 * K

beta1<- solve(A1)%*%(1/rsig2 * t(X)%*%Y + 1/rg * 1/rsig2 * K %*% beta0)

# simulate a value of beta given theta(or equivalently beta given sigma^2)
rbeta<- cbind(c(mvrnorm(1, beta1, solve(A1))))



#simulate a value of theta given previous value of beta
nu1<- n/2 + k/2 + nu
lambda1<-  t(Y- X%*%rbeta)%*%(Y-X%*%rbeta)/2 + t(rbeta-beta0)%*%K%*%(rbeta-beta0)/(2*rg) + lambda  
rtheta <- rgamma(1,shape=nu1,rate=lambda1)


nu1_g <- k/2 + nu_g

lambda1_g<- rtheta/2* t(rbeta-beta0)%*%K%*%(rbeta- beta0) + lambda_g

rg<- 1/rgamma(1, nu1_g, lambda1_g)

 data_sim<- rbind(data_sim,c(rbeta, rtheta, rg))
}

Now, let us plot the functions simulated from posterior distribution during MCMC, their mean (which is the posterior mean function) and compare with the true function.

l<-N_MCMC/2
u<- N_MCMC
means<- apply(data_sim[l:u, 1:dim(X)[2]], MARGIN=2, FUN=mean)

pred<-X%*%cbind(means)
plot(x,Y, main=paste("Bayesian formulation used for g"))
points(x,pred, col="black", cex=.5, pch=11)

for(ii in c(l:u)){
  pred<-X%*%cbind(data_sim[ii,1:dim(X)[2]])
  points(x,pred, col="brown", cex=.1)
  
}


points(x,fx, col="blue", cex=.5)

plot(density(data_sim[l:u, (dim(X)[2] +2)]), main="Posterior pdf for g")

#mean value of g from posterior 
print(mean(data_sim[l:u, (dim(X)[2] +2)]))

## [1] 0.7705588

#quantiles Q.025, Q975 for 95% credible interval
print(quantile(data_sim[l:u, (dim(X)[2] +2)], .025))

##      2.5% 
## 0.3714398

print(quantile(data_sim[l:u, (dim(X)[2] +2)], .975))

##    97.5% 
## 1.466696

Reference: Fahrmeir, Kneib, Lang and Mark (2013), “Regression: Models , Methods and Applications”, Springer.

Bayesian Linear Regression (part 3)

Karthik Sriram, IIM Ahmedabad

27 January 2020

Define the variables for analysis

Regularization using Bayesian Formulation of the Regression Model

Gibbs Sampling for the parameters