We will see how Bayesian linear regression formulation can help fit non-linear models to data. For this purpose, we use a simulated dataset generated as below.

Define the variables for analysis

library(MASS)
set.seed(19)
n<- 100
x<- seq(0.2,2,length=n)
error<- rnorm(n,0,2.5)
fx<- 2*pi*x*sin(2*pi*x) + sqrt(x) 
Y<- 2*pi*x*sin(2*pi*x) + sqrt(x) +error
plot(x,Y)

We would like to fit a model as below \[ Y_i = f(x_i) + \epsilon_i , ~~~~\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\] Here function \(f\) is unknown and needs to be estimated. If we assume \(f(x)=x\beta\), then it becomes linear regression. It turns out that we can still use linear regression to model the non-linear problem. The idea is to approximate the function \(f\) as a linear combination of basis functions. A popular basis is the truncated cubic-polynomial basis consisting functions in the set: \(\{1, x, x^2, x^3, (x-\kappa_1)^3_{+},\ldots, (x-\kappa_r)^3_{+} \}\), where \(\kappa_1, \kappa_2,\ldots,\kappa_r\) are called ``knots“, which serve as positions for joining polynomials piece-wise. The notation \(w_{+} = \begin{cases}w, ~ \mbox{ if } ~ w>0 \\ 0, ~~\mbox{ otherwise } \end{cases}\).

For our pupose, we will use a different basis formulation called “B-spline” basis (B stands for “Basic”). The B-spline basis of degree 3 is equivalent to the truncated cubic polynomial basis with the same set of knots, but are computationally more stable. More specifically, we use cubic B-spline basis of degree 3 with 48 knots, which corresponds to 50 basis functions (number of basis functions = degrees of freedom=knots+degree of polynomial-1) . A good reference for B-splines and Penalized regression is Chapter 8 of Fahrmeir et al(2013).

Once the B-spline basis is created, the problem can be viewed again as a linear regression model, which can be specified in matrix notation as

\[Y_{n\times 1} = X_{n\times k}\beta_{k\times 1} + \epsilon_{n\times 1} \mbox{ where } \epsilon_{n\times 1}\sim N_{n}(0_{N\times 1}, \sigma^2 \mathcal{I}_{n\times n} ),\]

or equivalently

\[Y\sim N_{n}(X \beta , \sigma^2 \mathcal{I} ).\]

The r code below generates the required B-spline basis.

library(splines)
X<-as.matrix(bs(x, df=50, degree=3, intercept=TRUE))

Regularization using Bayesian Formulation of the Regression Model

We note that introducing basis functions increases the number of parameters in the model. In such problems, one uses regularization or a penalty to discourage overfitting. A formulation in terms of quadratic penalty, given regularization parameter \(g\) is as below:

\[ \min_{\beta} (Y-X\beta)^T(Y-X\beta) + \frac{1}{g} \sum_{j=2}^k (\beta_j - \beta_{j-1})^2\]

It is a property of B-spline basis that the columns of \(X\) will add to the unit vector. So, if all the coefficients \(\beta_j\) are equal to each other then the model is essentially the constant function. A smaller \(g\) penalizes deviating from the contant function and hence encourages smoother functions. A larger \(g\) would lead to very wriggly functions.

The above penalized approach can be formulated as a Bayesian problem by considering a prior as follows:

Define the \((k-1)\times k\) matrix \(D= \begin{bmatrix}1 & -1 & 0 & 0 &.....& 0\\ 0& 1 &-1&0&.....&0 \\ \vdots & \vdots & \vdots & \vdots & .....&\vdots \\ 0& 0 &0&0&....1&-1 \end{bmatrix}\) and \(K= D^TD\).

Note that \((\beta-\beta_0)^TK(\beta-\beta_0)=\sum_{j=2}^k (\beta_j - \beta_{j-1})^2\). So, we define the prior

\[ \pi(\beta) \propto e^{- \frac{(\beta-\beta)^T K (\beta - \beta_0)}{2g\sigma^2} } . \]

This prior specification is called the random walk of order 1, where \(\beta_j - \beta_{j-1} \stackrel{iid}{\sim} N(0, g \sigma^2)\), and Prior on \(\beta_1\) is improper, i.e. \(\pi(\beta_1) \propto 1\).

Let the prior on \(\theta= \frac{1}{\sigma^2}\) be given by

\[\theta \sim Gamma(\nu, \lambda)\]

Now, similar to our working in Bayesian Linear Regression part 1, we can determine the posterior distributions for \((\beta, \sigma^2)\) as follows: \[\pi\left(\beta \vert Data, \sigma^2 \right) \propto e^{-\frac{(\beta -M)^TX^TX(\beta-M)}{2\sigma^2} } e^{-\frac{(\beta-\beta_0)^T K(\beta-\beta_0)}{2g\sigma^2}} \] \[\propto e^{-\frac{(\beta-\beta_1)^T A_{1}(\beta-\beta_1)}{2}},\] or equivalently \[\beta \vert Data, \sigma^2 \sim N_k(\beta_1, A_1^{-1}), \mbox{ where } \beta_1 = A_1^{-1}\left(\frac{1}{\sigma^2}X^T Y + \frac{1}{g \sigma^2}K \beta_0\right), ~~ A_1= \left(\frac{1}{\sigma^2}X^{T}X + \frac{1}{g \sigma^2}K\right)\]

The posterior for \(\theta=\frac{1}{\sigma^2}\) is given by \[\pi\left(\theta \vert Data, \beta \right)\propto \theta^{\frac{n}{2}} e^{-\frac{(Y-X\beta)^T(Y-X\beta) }{2} \cdot \theta}\cdot \theta^\frac{k}{2}e^{-\frac{(\beta-\beta_0)^T K (\beta-\beta_0) }{2g}\cdot \theta} \cdot \theta^{\nu-1} e^{-\lambda \theta},\] or equivalently \[ \theta\vert Data, \beta ~~\sim~~ Gamma\left(shape=\frac{n}{2}+\frac{k}{2}+\nu , rate= \frac{(Y-X\beta)^T(Y-X\beta)}{2} + \frac{(\beta-\beta_0)^T K (\beta-\beta_0)}{2g} + \lambda \right) \]

Gibbs Sampling for the parameters

We will use Gibbs sampling , a Markov Chain Monte Carlo technique to simulate from the posterior distribution of \((\beta, \theta)\). The code below shows one single step of the Gibbs sampling procedure.

set.seed(sqrt(23))
n<- dim(X)[1]
k<- dim(X)[2]

# smoothing parameter
g<-1
# For theta we use Jeffrey's prior: Improper prior which can be thought of as Gamma(shape=0,rate=0)
nu<- 0   
lambda<- 0

# For Gibbs sampling take some starting value for theta
set.seed(sqrt(23))
rtheta <- 1

rsig2<- 1/rtheta
#prior mean for beta
beta0<- cbind(rep(0,k))
#prior variance for beta

Ik<-diag(rep(1,k))
S1<- as.matrix(Ik[-k,] - cbind(rep(0,(k-1)), Ik[1:(k-1),1:(k-1)]))


K<- t(S1)%*%S1

A1<- 1/rsig2 * t(X)%*%X  + 1/g * 1/rsig2 * K

beta1<- solve(A1)%*%(1/rsig2 * t(X)%*%Y + 1/g * 1/rsig2 * K %*% beta0)

# simulate a value of beta given theta(or equivalently beta given sigma^2)
rbeta<- cbind(c(mvrnorm(1, beta1, solve(A1))))
print(rbeta)
##          [,1]
## 1  -0.2991040
## 2   1.3084071
## 3   1.7882028
## 4   3.2656438
## 5   2.4406175
## 6   2.2535215
## 7   2.8966222
## 8   1.9986895
## 9  -0.2332324
## 10  0.9021022
## 11  0.6501266
## 12 -1.8448455
## 13 -1.8623291
## 14 -1.9781286
## 15 -2.1679650
## 16 -2.0582530
## 17 -2.9413381
## 18 -2.6064783
## 19 -3.6540798
## 20 -3.1883503
## 21 -1.2486168
## 22 -0.8767274
## 23  0.7529009
## 24  3.4362116
## 25  6.9330308
## 26  8.0553514
## 27  6.7108396
## 28  8.6172398
## 29  8.7753766
## 30 10.1424820
## 31  9.0056878
## 32  8.6222641
## 33  6.8516434
## 34  4.1967137
## 35  2.0179854
## 36  0.9672949
## 37 -1.9376270
## 38 -4.0463005
## 39 -6.8064310
## 40 -8.8321309
## 41 -9.9360617
## 42 -8.7514001
## 43 -8.7646703
## 44 -8.4123855
## 45 -8.5453054
## 46 -8.0932743
## 47 -6.2919344
## 48 -1.9837884
## 49 -0.3424448
## 50  0.8620543
#simulate a value of theta given previous value of beta
nu1<- n/2 + k/2 + nu
lambda1<-  t(Y- X%*%rbeta)%*%(Y-X%*%rbeta)/2 + t(rbeta-beta0)%*%K%*%(rbeta-beta0)/(2*g) + lambda  
rtheta <- rgamma(1,shape=nu1,rate=lambda1)
print(rtheta)
## [1] 0.1982792
data_sim<- rbind(c(rbeta, rtheta))
print(data_sim)
##           [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
## [1,] -0.299104 1.308407 1.788203 3.265644 2.440618 2.253521 2.896622 1.998689
##            [,9]     [,10]     [,11]     [,12]     [,13]     [,14]     [,15]
## [1,] -0.2332324 0.9021022 0.6501266 -1.844845 -1.862329 -1.978129 -2.167965
##          [,16]     [,17]     [,18]    [,19]    [,20]     [,21]      [,22]
## [1,] -2.058253 -2.941338 -2.606478 -3.65408 -3.18835 -1.248617 -0.8767274
##          [,23]    [,24]    [,25]    [,26]   [,27]   [,28]    [,29]    [,30]
## [1,] 0.7529009 3.436212 6.933031 8.055351 6.71084 8.61724 8.775377 10.14248
##         [,31]    [,32]    [,33]    [,34]    [,35]     [,36]     [,37]     [,38]
## [1,] 9.005688 8.622264 6.851643 4.196714 2.017985 0.9672949 -1.937627 -4.046301
##          [,39]     [,40]     [,41]   [,42]    [,43]     [,44]     [,45]
## [1,] -6.806431 -8.832131 -9.936062 -8.7514 -8.76467 -8.412385 -8.545305
##          [,46]     [,47]     [,48]      [,49]     [,50]     [,51]
## [1,] -8.093274 -6.291934 -1.983788 -0.3424448 0.8620543 0.1982792

Now we will repeat the Gibbs sampling a large number of times.

N_MCMC<- 10000

for(count in 1:N_MCMC){
  
  rsig2<- 1/rtheta
#prior mean for beta
beta0<- cbind(rep(0,k))
#prior variance for beta

Ik<-diag(rep(1,k))
S1<- as.matrix(Ik[-k,] - cbind(rep(0,(k-1)), Ik[1:(k-1),1:(k-1)]))


K<- t(S1)%*%S1

A1<- 1/rsig2 * t(X)%*%X  + 1/g * 1/rsig2 * K

beta1<- solve(A1)%*%(1/rsig2 * t(X)%*%Y + 1/g * 1/rsig2 * K %*% beta0)

# simulate a value of beta given theta(or equivalently beta given sigma^2)
rbeta<- cbind(c(mvrnorm(1, beta1, solve(A1))))



#simulate a value of theta given previous value of beta
nu1<- n/2 + k/2 + nu
lambda1<-  t(Y- X%*%rbeta)%*%(Y-X%*%rbeta)/2 + t(rbeta-beta0)%*%K%*%(rbeta-beta0)/(2*g) + lambda  
rtheta <- rgamma(1,shape=nu1,rate=lambda1)
 data_sim<- rbind(data_sim,c(rbeta, rtheta))
}

Now, let us plot the functions simulated from posterior distribution during MCMC, their mean (which is the posterior mean function) and compare with the true function.

l<-N_MCMC/2
u<- N_MCMC
means<- apply(data_sim[l:u, 1:dim(X)[2]], MARGIN=2, FUN=mean)

pred<-X%*%cbind(means)
plot(x,Y, main=paste("smoothing parameter g=",g))
points(x,pred, col="black", cex=.5, pch=11)

for(ii in c(l:u)){
  pred<-X%*%cbind(data_sim[ii,1:dim(X)[2]])
  points(x,pred, col="brown", cex=.1)
  
}


points(x,fx, col="blue", cex=.5)

What would be a good choice for smoothing parameter\(g\) ? We will use the cross-validation criteria described in section 8.1.9 of Fahrmeir et al(2013), which is to minimize \[ CV = \frac{1}{n}\sum_{i=1}^{n}(Y_i - \hat{f}^{(-i)}(x_i))^2,\] where \(\hat{f}^{(-i)}(x_i)\) is the prediction obtained by estimating the model without using the \(i^{th}\) observation. As discussed in Fahrmeir et al(2013), an easier computation of CV is obtained as follows \[CV = \frac{1}{n}\sum_{i=1}^n\left(\frac{Y_i- \hat{f}(x_i)}{1-s_{ii}} \right)^2,\] where \(s_{ii}\) is the \(i^{th}\) diagonal entry of the Smoother matrix \(S\): \(S=X (X^T X + \frac{1}{g} K)^{-1}X^T\) and the predictions at the observed data points are obtained as \(\hat{f}=SY\). Below, we show the computation of CV for \(g=1\).

S<- X%*% solve(t(X)%*%X + 1/g * K)%*%t(X)
fhat<- S%*%Y
diag_S<- c(diag(S))
Z<- (Y-fhat)/(1-diag_S)
CV<- 1/n *sum(Z*Z)
print(CV)
## [1] 8.674162

Below, we look at the fit for different values of \(g\).

The table of CV for different values of \(g\) is given below.

print(CV_data[-1,])
##         g        CV
## 2    0.01 19.838802
## 3    1.00  8.674162
## 4    5.00 10.199177
## 5   10.00 11.210716
## 6   25.00 12.935874
## 7 1000.00 32.895557

We can see that the CV is lowest for \(g=1\). This is just an illustration. A finer grid of \(g\) values would need to be considered to find the best value.

Reference: Fahrmeir, Kneib, Lang and Mark (2013), “Regression: Models , Methods and Applications”, Springer.