Bayesian Linear Regression (part 2)

We will see how Bayesian linear regression formulation can help fit non-linear models to data. For this purpose, we use a simulated dataset generated as below.

Define the variables for analysis

library(MASS)
set.seed(19)
n<- 100
x<- seq(0.2,2,length=n)
error<- rnorm(n,0,2.5)
fx<- 2*pi*x*sin(2*pi*x) + sqrt(x) 
Y<- 2*pi*x*sin(2*pi*x) + sqrt(x) +error
plot(x,Y)

We would like to fit a model as below \[ Y_i = f(x_i) + \epsilon_i , ~~~~\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\] Here function \(f\) is unknown and needs to be estimated. If we assume \(f(x)=x\beta\), then it becomes linear regression. It turns out that we can still use linear regression to model the non-linear problem. The idea is to approximate the function \(f\) as a linear combination of basis functions. A popular basis is the truncated cubic-polynomial basis consisting functions in the set: \(\{1, x, x^2, x^3, (x-\kappa_1)^3_{+},\ldots, (x-\kappa_r)^3_{+} \}\), where \(\kappa_1, \kappa_2,\ldots,\kappa_r\) are called ``knots“, which serve as positions for joining polynomials piece-wise. The notation \(w_{+} = \begin{cases}w, ~ \mbox{ if } ~ w>0 \\ 0, ~~\mbox{ otherwise } \end{cases}\).

For our pupose, we will use a different basis formulation called “B-spline” basis (B stands for “Basic”). The B-spline basis of degree 3 is equivalent to the truncated cubic polynomial basis with the same set of knots, but are computationally more stable. More specifically, we use cubic B-spline basis of degree 3 with 48 knots, which corresponds to 50 basis functions (number of basis functions = degrees of freedom=knots+degree of polynomial-1) . A good reference for B-splines and Penalized regression is Chapter 8 of Fahrmeir et al(2013).

Once the B-spline basis is created, the problem can be viewed again as a linear regression model, which can be specified in matrix notation as

\[Y_{n\times 1} = X_{n\times k}\beta_{k\times 1} + \epsilon_{n\times 1} \mbox{ where } \epsilon_{n\times 1}\sim N_{n}(0_{N\times 1}, \sigma^2 \mathcal{I}_{n\times n} ),\]

or equivalently

\[Y\sim N_{n}(X \beta , \sigma^2 \mathcal{I} ).\]

The r code below generates the required B-spline basis.

library(splines)
X<-as.matrix(bs(x, df=50, degree=3, intercept=TRUE))

Regularization using Bayesian Formulation of the Regression Model

We note that introducing basis functions increases the number of parameters in the model. In such problems, one uses regularization or a penalty to discourage overfitting. A formulation in terms of quadratic penalty, given regularization parameter \(g\) is as below:

\[ \min_{\beta} (Y-X\beta)^T(Y-X\beta) + \frac{1}{g} \sum_{j=2}^k (\beta_j - \beta_{j-1})^2\]

It is a property of B-spline basis that the columns of \(X\) will add to the unit vector. So, if all the coefficients \(\beta_j\) are equal to each other then the model is essentially the constant function. A smaller \(g\) penalizes deviating from the contant function and hence encourages smoother functions. A larger \(g\) would lead to very wriggly functions.

The above penalized approach can be formulated as a Bayesian problem by considering a prior as follows:

Define the \((k-1)\times k\) matrix \(D= \begin{bmatrix}1 & -1 & 0 & 0 &.....& 0\\ 0& 1 &-1&0&.....&0 \\ \vdots & \vdots & \vdots & \vdots & .....&\vdots \\ 0& 0 &0&0&....1&-1 \end{bmatrix}\) and \(K= D^TD\).

Note that \((\beta-\beta_0)^TK(\beta-\beta_0)=\sum_{j=2}^k (\beta_j - \beta_{j-1})^2\). So, we define the prior

\[ \pi(\beta) \propto e^{- \frac{(\beta-\beta)^T K (\beta - \beta_0)}{2g\sigma^2} } . \]

This prior specification is called the random walk of order 1, where \(\beta_j - \beta_{j-1} \stackrel{iid}{\sim} N(0, g \sigma^2)\), and Prior on \(\beta_1\) is improper, i.e. \(\pi(\beta_1) \propto 1\).

Let the prior on \(\theta= \frac{1}{\sigma^2}\) be given by

\[\theta \sim Gamma(\nu, \lambda)\]

Now, similar to our working in Bayesian Linear Regression part 1, we can determine the posterior distributions for \((\beta, \sigma^2)\) as follows: \[\pi\left(\beta \vert Data, \sigma^2 \right) \propto e^{-\frac{(\beta -M)^TX^TX(\beta-M)}{2\sigma^2} } e^{-\frac{(\beta-\beta_0)^T K(\beta-\beta_0)}{2g\sigma^2}} \] \[\propto e^{-\frac{(\beta-\beta_1)^T A_{1}(\beta-\beta_1)}{2}},\] or equivalently \[\beta \vert Data, \sigma^2 \sim N_k(\beta_1, A_1^{-1}), \mbox{ where } \beta_1 = A_1^{-1}\left(\frac{1}{\sigma^2}X^T Y + \frac{1}{g \sigma^2}K \beta_0\right), ~~ A_1= \left(\frac{1}{\sigma^2}X^{T}X + \frac{1}{g \sigma^2}K\right)\]

The posterior for \(\theta=\frac{1}{\sigma^2}\) is given by \[\pi\left(\theta \vert Data, \beta \right)\propto \theta^{\frac{n}{2}} e^{-\frac{(Y-X\beta)^T(Y-X\beta) }{2} \cdot \theta}\cdot \theta^\frac{k}{2}e^{-\frac{(\beta-\beta_0)^T K (\beta-\beta_0) }{2g}\cdot \theta} \cdot \theta^{\nu-1} e^{-\lambda \theta},\] or equivalently \[ \theta\vert Data, \beta ~~\sim~~ Gamma\left(shape=\frac{n}{2}+\frac{k}{2}+\nu , rate= \frac{(Y-X\beta)^T(Y-X\beta)}{2} + \frac{(\beta-\beta_0)^T K (\beta-\beta_0)}{2g} + \lambda \right) \]

Gibbs Sampling for the parameters

We will use Gibbs sampling , a Markov Chain Monte Carlo technique to simulate from the posterior distribution of \((\beta, \theta)\). The code below shows one single step of the Gibbs sampling procedure.

set.seed(sqrt(23))
n<- dim(X)[1]
k<- dim(X)[2]

# smoothing parameter
g<-1
# For theta we use Jeffrey's prior: Improper prior which can be thought of as Gamma(shape=0,rate=0)
nu<- 0   
lambda<- 0

# For Gibbs sampling take some starting value for theta
set.seed(sqrt(23))
rtheta <- 1

rsig2<- 1/rtheta
#prior mean for beta
beta0<- cbind(rep(0,k))
#prior variance for beta

Ik<-diag(rep(1,k))
S1<- as.matrix(Ik[-k,] - cbind(rep(0,(k-1)), Ik[1:(k-1),1:(k-1)]))


K<- t(S1)%*%S1

A1<- 1/rsig2 * t(X)%*%X  + 1/g * 1/rsig2 * K

beta1<- solve(A1)%*%(1/rsig2 * t(X)%*%Y + 1/g * 1/rsig2 * K %*% beta0)

# simulate a value of beta given theta(or equivalently beta given sigma^2)
rbeta<- cbind(c(mvrnorm(1, beta1, solve(A1))))
print(rbeta)

##          [,1]
## 1  -0.2991040
## 2   1.3084071
## 3   1.7882028
## 4   3.2656438
## 5   2.4406175
## 6   2.2535215
## 7   2.8966222
## 8   1.9986895
## 9  -0.2332324
## 10  0.9021022
## 11  0.6501266
## 12 -1.8448455
## 13 -1.8623291
## 14 -1.9781286
## 15 -2.1679650
## 16 -2.0582530
## 17 -2.9413381
## 18 -2.6064783
## 19 -3.6540798
## 20 -3.1883503
## 21 -1.2486168
## 22 -0.8767274
## 23  0.7529009
## 24  3.4362116
## 25  6.9330308
## 26  8.0553514
## 27  6.7108396
## 28  8.6172398
## 29  8.7753766
## 30 10.1424820
## 31  9.0056878
## 32  8.6222641
## 33  6.8516434
## 34  4.1967137
## 35  2.0179854
## 36  0.9672949
## 37 -1.9376270
## 38 -4.0463005
## 39 -6.8064310
## 40 -8.8321309
## 41 -9.9360617
## 42 -8.7514001
## 43 -8.7646703
## 44 -8.4123855
## 45 -8.5453054
## 46 -8.0932743
## 47 -6.2919344
## 48 -1.9837884
## 49 -0.3424448
## 50  0.8620543

#simulate a value of theta given previous value of beta
nu1<- n/2 + k/2 + nu
lambda1<-  t(Y- X%*%rbeta)%*%(Y-X%*%rbeta)/2 + t(rbeta-beta0)%*%K%*%(rbeta-beta0)/(2*g) + lambda  
rtheta <- rgamma(1,shape=nu1,rate=lambda1)
print(rtheta)

## [1] 0.1982792

data_sim<- rbind(c(rbeta, rtheta))
print(data_sim)

##           [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
## [1,] -0.299104 1.308407 1.788203 3.265644 2.440618 2.253521 2.896622 1.998689
##            [,9]     [,10]     [,11]     [,12]     [,13]     [,14]     [,15]
## [1,] -0.2332324 0.9021022 0.6501266 -1.844845 -1.862329 -1.978129 -2.167965
##          [,16]     [,17]     [,18]    [,19]    [,20]     [,21]      [,22]
## [1,] -2.058253 -2.941338 -2.606478 -3.65408 -3.18835 -1.248617 -0.8767274
##          [,23]    [,24]    [,25]    [,26]   [,27]   [,28]    [,29]    [,30]
## [1,] 0.7529009 3.436212 6.933031 8.055351 6.71084 8.61724 8.775377 10.14248
##         [,31]    [,32]    [,33]    [,34]    [,35]     [,36]     [,37]     [,38]
## [1,] 9.005688 8.622264 6.851643 4.196714 2.017985 0.9672949 -1.937627 -4.046301
##          [,39]     [,40]     [,41]   [,42]    [,43]     [,44]     [,45]
## [1,] -6.806431 -8.832131 -9.936062 -8.7514 -8.76467 -8.412385 -8.545305
##          [,46]     [,47]     [,48]      [,49]     [,50]     [,51]
## [1,] -8.093274 -6.291934 -1.983788 -0.3424448 0.8620543 0.1982792

Now we will repeat the Gibbs sampling a large number of times.

N_MCMC<- 10000

for(count in 1:N_MCMC){
  
  rsig2<- 1/rtheta
#prior mean for beta
beta0<- cbind(rep(0,k))
#prior variance for beta

Ik<-diag(rep(1,k))
S1<- as.matrix(Ik[-k,] - cbind(rep(0,(k-1)), Ik[1:(k-1),1:(k-1)]))


K<- t(S1)%*%S1

A1<- 1/rsig2 * t(X)%*%X  + 1/g * 1/rsig2 * K

beta1<- solve(A1)%*%(1/rsig2 * t(X)%*%Y + 1/g * 1/rsig2 * K %*% beta0)

# simulate a value of beta given theta(or equivalently beta given sigma^2)
rbeta<- cbind(c(mvrnorm(1, beta1, solve(A1))))



#simulate a value of theta given previous value of beta
nu1<- n/2 + k/2 + nu
lambda1<-  t(Y- X%*%rbeta)%*%(Y-X%*%rbeta)/2 + t(rbeta-beta0)%*%K%*%(rbeta-beta0)/(2*g) + lambda  
rtheta <- rgamma(1,shape=nu1,rate=lambda1)
 data_sim<- rbind(data_sim,c(rbeta, rtheta))
}

Now, let us plot the functions simulated from posterior distribution during MCMC, their mean (which is the posterior mean function) and compare with the true function.

l<-N_MCMC/2
u<- N_MCMC
means<- apply(data_sim[l:u, 1:dim(X)[2]], MARGIN=2, FUN=mean)

pred<-X%*%cbind(means)
plot(x,Y, main=paste("smoothing parameter g=",g))
points(x,pred, col="black", cex=.5, pch=11)

for(ii in c(l:u)){
  pred<-X%*%cbind(data_sim[ii,1:dim(X)[2]])
  points(x,pred, col="brown", cex=.1)
  
}


points(x,fx, col="blue", cex=.5)

What would be a good choice for smoothing parameter\(g\) ? We will use the cross-validation criteria described in section 8.1.9 of Fahrmeir et al(2013), which is to minimize \[ CV = \frac{1}{n}\sum_{i=1}^{n}(Y_i - \hat{f}^{(-i)}(x_i))^2,\] where \(\hat{f}^{(-i)}(x_i)\) is the prediction obtained by estimating the model without using the \(i^{th}\) observation. As discussed in Fahrmeir et al(2013), an easier computation of CV is obtained as follows \[CV = \frac{1}{n}\sum_{i=1}^n\left(\frac{Y_i- \hat{f}(x_i)}{1-s_{ii}} \right)^2,\] where \(s_{ii}\) is the \(i^{th}\) diagonal entry of the Smoother matrix \(S\): \(S=X (X^T X + \frac{1}{g} K)^{-1}X^T\) and the predictions at the observed data points are obtained as \(\hat{f}=SY\). Below, we show the computation of CV for \(g=1\).

S<- X%*% solve(t(X)%*%X + 1/g * K)%*%t(X)
fhat<- S%*%Y
diag_S<- c(diag(S))
Z<- (Y-fhat)/(1-diag_S)
CV<- 1/n *sum(Z*Z)
print(CV)

## [1] 8.674162

Below, we look at the fit for different values of \(g\).

The table of CV for different values of \(g\) is given below.

print(CV_data[-1,])

##         g        CV
## 2    0.01 19.838802
## 3    1.00  8.674162
## 4    5.00 10.199177
## 5   10.00 11.210716
## 6   25.00 12.935874
## 7 1000.00 32.895557

We can see that the CV is lowest for \(g=1\). This is just an illustration. A finer grid of \(g\) values would need to be considered to find the best value.

Reference: Fahrmeir, Kneib, Lang and Mark (2013), “Regression: Models , Methods and Applications”, Springer.

Bayesian Linear Regression (part 2)

Karthik Sriram, IIM Ahmedabad

27 January 2020

Define the variables for analysis

Regularization using Bayesian Formulation of the Regression Model

Gibbs Sampling for the parameters