As in part 2 of Bayesian Linear Regression, we will again consider the Bayesian linear regression formulation to help fit non-linear models to data. For this purpose, we use a simulated dataset generated as below. The first few steps of formulation are similar to part 2, except that we try to determine the regularization parameter using a hyper-prior on \(g\).

Define the variables for analysis

library(MASS)
set.seed(19)
n<- 100
x<- seq(0.2,2,length=n)
error<- rnorm(n,0,2.5)
fx<- 2*pi*x*sin(2*pi*x) + sqrt(x) 
Y<- 2*pi*x*sin(2*pi*x) + sqrt(x) +error
plot(x,Y)

We would like to fit a model as below \[ Y_i = f(x_i) + \epsilon_i , ~~~~\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\] Here function \(f\) is unknown and needs to be estimated. If we assume \(f(x)=x\beta\), then it becomes linear regression. It turns out that we can still use linear regression to model this problem. The idea is to approximate the function \(f\) as a linear combination of basis functions. For that, we will use cubic B-spline basis (B stands for Basic) of degree 3 with 48 knots, which corresponds to 50 (=knots+degree-1) basis functions or degrees of freedom. We note that B-spline basis is equivalent to a truncated polynomial basis of same degree and same set of knots, but are more computationally stable. (A good reference for B-splines and Penalized regression is Chapter 8 of Fahrmeir et al(2013))

Once the B-spline basis is created, the problem can be viewed again as a linear regression model, which can be specified in matrix notation as

\[Y_{n\times 1} = X_{n\times k}\beta_{k\times 1} + \epsilon_{n\times 1} \mbox{ where } \epsilon_{n\times 1}\sim N_{n}(0_{N\times 1}, \sigma^2 \mathcal{I}_{n\times n} ),\]

or equivalently

\[Y\sim N_{n}(X \beta , \sigma^2 \mathcal{I} ).\]

The r code below generates the required B-spline basis.

library(splines)
X<-as.matrix(bs(x, df=50, degree=3, intercept=TRUE))

Regularization using Bayesian Formulation of the Regression Model

We note that introducing basis functions increases the number of parameters in the model. In such problems, one uses regularization or a penalty to discourage overfitting. A formulation in terms of quadratic penalty, given regularization parameter \(g\) is as below:

\[ \min_{\beta} (Y-X\beta)^T(Y-X\beta) + \frac{1}{g} \sum_{j=2}^k (\beta_j - \beta_{j-1})^2\]

It is a property of B-spline basis that the columns of \(X\) will add to the unit vector. So, if all the coefficients \(\beta_j\) are equal to each other then the model is essentially the constant function. A smaller \(g\) penalizes deviating from the contant function and hence encourages smoother functions. A larger \(g\) would lead to very wriggly functions.

The above penalized approach can be formulated as a Bayesian problem by considering a prior as follows:

Define the \((k-1)\times k\) matrix \(D= \begin{bmatrix}1 & -1 & 0 & 0 &.....& 0\\ 0& 1 &-1&0&.....&0 \\ \vdots & \vdots & \vdots & \vdots & .....&\vdots \\ 0& 0 &0&0&....1&-1 \end{bmatrix}\) and \(K= D^TD\).

Note that \((\beta-\beta_0)^TK(\beta-\beta_0)=\sum_{j=2}^k (\beta_j - \beta_{j-1})^2\). So, we define the prior

\[ \pi(\beta) \propto e^{- \frac{(\beta-\beta)^T K (\beta - \beta_0)}{2g\sigma^2} } . \]

This prior specification is called the random walk of order 1, where \(\beta_j - \beta_{j-1} \stackrel{iid}{\sim} N(0, g \sigma^2)\), and Prior on \(\beta_1\) is improper, i.e. \(\pi(\beta_1) \propto 1\).

Let the prior on \(\theta= \frac{1}{\sigma^2}\) be given by

\[\theta \sim Gamma(\nu, \lambda)\]

Here, in addition, we will consider the following prior for \(\frac{1}{g}\)

\[ \frac{1}{g} \sim Gamma(\nu_g , \lambda_g)\]

Now, similar to our working in Bayesian Linear Regression part 1, we can determine the posterior distributions for \((\beta, \sigma^2, \frac{1}{g})\) as follows: \[\pi\left(\beta \vert Data, \sigma^2, g \right) \propto e^{-\frac{(\beta -M)^TX^TX(\beta-M)}{2\sigma^2} } e^{-\frac{(\beta-\beta_0)^T K(\beta-\beta_0)}{2g\sigma^2}} \] \[\propto e^{-\frac{(\beta-\beta_1)^T A_{1}(\beta-\beta_1)}{2}},\] or equivalently \[\beta \vert Data, \sigma^2 \sim N_k(\beta_1, A_1^{-1}), \mbox{ where } \beta_1 = A_1^{-1}\left(\frac{1}{\sigma^2}X^T Y + \frac{1}{g \sigma^2}K \beta_0\right), ~~ A_1= \left(\frac{1}{\sigma^2}X^{T}X + \frac{1}{g \sigma^2}K\right)\]

The posterior for \(\theta=\frac{1}{\sigma^2}\) is given by \[\pi\left(\theta \vert Data, \beta, g \right)\propto \theta^{\frac{n}{2}} e^{-\frac{(Y-X\beta)^T(Y-X\beta) }{2} \cdot \theta}\cdot \theta^\frac{k}{2}e^{-\frac{(\beta-\beta_0)^T K (\beta-\beta_0) }{2g}\cdot \theta} \cdot \theta^{\nu-1} e^{-\lambda \theta},\] or equivalently \[ \theta\vert Data, \beta ~~\sim~~ Gamma\left(shape=\frac{n}{2}+\frac{k}{2}+\nu , rate= \frac{(Y-X\beta)^T(Y-X\beta)}{2} + \frac{(\beta-\beta_0)^T K (\beta-\beta_0)}{2g} + \lambda \right) \]

Similarly, posterior for \(\frac{1}{g}\) given \((\beta, \theta)\) can be determined as

\[\frac{1}{g} \vert Data, \beta, \sigma^2 \sim Gamma\left(\frac{k}{2} + \nu_g, \frac{(\beta - \beta_0)^T K (\beta-\beta_0)}{2\sigma^2} + \lambda_g \right)\]

Gibbs Sampling for the parameters

We will use Gibbs sampling , a Markov Chain Monte Carlo technique to simulate from the posterior distribution of \((\beta, \theta, \frac{1}{g})\). The code below shows one single step of the Gibbs sampling procedure.

set.seed(sqrt(23))
n<- dim(X)[1]
k<- dim(X)[2]


# For theta we use Jeffrey's prior: Improper prior which can be thought of as Gamma(shape=0,rate=0)
nu<- 0   
lambda<- 0

nu_g<-1
lambda_g<- 1

plot(density(1/rgamma(1000,nu_g,lambda_g)), main="prior for g")

# For Gibbs sampling take some starting value for theta
rtheta <- 1
rg<-1

rsig2<- 1/rtheta
#prior mean for beta
beta0<- cbind(rep(0,k))
#prior variance for beta

Ik<-diag(rep(1,k))
S1<- as.matrix(Ik[-k,] - cbind(rep(0,(k-1)), Ik[1:(k-1),1:(k-1)]))


K<- t(S1)%*%S1

A1<- 1/rsig2 * t(X)%*%X  + 1/rg * 1/rsig2 * K

beta1<- solve(A1)%*%(1/rsig2 * t(X)%*%Y + 1/rg * 1/rsig2 * K %*% beta0)

# simulate a value of beta given theta(or equivalently beta given sigma^2)
rbeta<- cbind(c(mvrnorm(1, beta1, solve(A1))))


#simulate a value of theta given previous value of beta
nu1<- n/2 + k/2 + nu
lambda1<-  t(Y- X%*%rbeta)%*%(Y-X%*%rbeta)/2 + t(rbeta-beta0)%*%K%*%(rbeta-beta0)/(2*rg) + lambda  
rtheta <- rgamma(1,shape=nu1,rate=lambda1)

nu1_g <- k/2 + nu_g

lambda1_g<- rtheta/2* t(rbeta-beta0)%*%K%*%(rbeta- beta0) + lambda_g

rg<- 1/rgamma(1, nu1_g, lambda1_g)

data_sim<- rbind(c(rbeta, rtheta, rg))
print(data_sim)
##           [,1]     [,2]    [,3]     [,4]    [,5]    [,6]     [,7]     [,8]
## [1,] 0.6682696 1.572322 2.35206 2.480192 2.52302 1.74278 2.915734 3.140244
##           [,9]    [,10]     [,11]      [,12]     [,13]     [,14]     [,15]
## [1,] 0.9706556 1.608546 0.8090984 -0.7950023 -1.365269 -1.587111 -2.124581
##          [,16]     [,17]    [,18]     [,19]    [,20]     [,21]    [,22]
## [1,] -2.699863 -3.017831 -3.80922 -3.374187 -3.35244 -2.054132 -1.67452
##           [,23]    [,24]    [,25]    [,26]    [,27]    [,28]    [,29]    [,30]
## [1,] -0.2373175 4.932283 5.586475 6.729188 6.387577 8.784164 9.723837 9.477777
##         [,31]    [,32]    [,33]    [,34]    [,35]     [,36]     [,37]     [,38]
## [1,] 9.832505 8.724392 7.333784 4.124466 1.728072 0.3593074 -1.550768 -4.208733
##          [,39]     [,40]     [,41]     [,42]     [,43]    [,44]     [,45]
## [1,] -6.273941 -8.303028 -10.53761 -9.163626 -7.695068 -7.91989 -8.547445
##          [,46]     [,47]      [,48]   [,49]    [,50]     [,51]    [,52]
## [1,] -8.438275 -5.326607 -0.9961585 1.45681 1.326431 0.2162647 0.558193

Now we will repeat the Gibbs sampling a large number of times.

N_MCMC<- 10000

for(count in 1:N_MCMC){
  
rsig2<- 1/rtheta
#prior mean for beta
beta0<- cbind(rep(0,k))
#prior variance for beta

Ik<-diag(rep(1,k))
S1<- as.matrix(Ik[-k,] - cbind(rep(0,(k-1)), Ik[1:(k-1),1:(k-1)]))


K<- t(S1)%*%S1

A1<- 1/rsig2 * t(X)%*%X  + 1/rg * 1/rsig2 * K

beta1<- solve(A1)%*%(1/rsig2 * t(X)%*%Y + 1/rg * 1/rsig2 * K %*% beta0)

# simulate a value of beta given theta(or equivalently beta given sigma^2)
rbeta<- cbind(c(mvrnorm(1, beta1, solve(A1))))



#simulate a value of theta given previous value of beta
nu1<- n/2 + k/2 + nu
lambda1<-  t(Y- X%*%rbeta)%*%(Y-X%*%rbeta)/2 + t(rbeta-beta0)%*%K%*%(rbeta-beta0)/(2*rg) + lambda  
rtheta <- rgamma(1,shape=nu1,rate=lambda1)


nu1_g <- k/2 + nu_g

lambda1_g<- rtheta/2* t(rbeta-beta0)%*%K%*%(rbeta- beta0) + lambda_g

rg<- 1/rgamma(1, nu1_g, lambda1_g)

 data_sim<- rbind(data_sim,c(rbeta, rtheta, rg))
}

Now, let us plot the functions simulated from posterior distribution during MCMC, their mean (which is the posterior mean function) and compare with the true function.

l<-N_MCMC/2
u<- N_MCMC
means<- apply(data_sim[l:u, 1:dim(X)[2]], MARGIN=2, FUN=mean)

pred<-X%*%cbind(means)
plot(x,Y, main=paste("Bayesian formulation used for g"))
points(x,pred, col="black", cex=.5, pch=11)

for(ii in c(l:u)){
  pred<-X%*%cbind(data_sim[ii,1:dim(X)[2]])
  points(x,pred, col="brown", cex=.1)
  
}


points(x,fx, col="blue", cex=.5)

plot(density(data_sim[l:u, (dim(X)[2] +2)]), main="Posterior pdf for g")

#mean value of g from posterior 
print(mean(data_sim[l:u, (dim(X)[2] +2)]))
## [1] 0.7705588
#quantiles Q.025, Q975 for 95% credible interval
print(quantile(data_sim[l:u, (dim(X)[2] +2)], .025))
##      2.5% 
## 0.3714398
print(quantile(data_sim[l:u, (dim(X)[2] +2)], .975))
##    97.5% 
## 1.466696

Reference: Fahrmeir, Kneib, Lang and Mark (2013), “Regression: Models , Methods and Applications”, Springer.