Getting started with Bayesian variable selection using JAGS and rjags

Bayesian variable selection

In a previous post I gave a quick introduction to using the rjags R package to access the JAGS Bayesian inference from within R. In this post I want to give a quick guide to using rjags for Bayesian variable selection. I intend to use this post as a starting point for future posts on Bayesian model and variable selection using more sophisticated approaches.

I will use the simple example of multiple linear regression to illustrate the ideas, but it should be noted that I’m just using that as an example. It turns out that in the context of linear regression there are lots of algebraic and computational tricks which can be used to simplify the variable selection problem. The approach I give here is therefore rather inefficient for linear regression, but generalises to more complex (non-linear) problems where analytical and computational short-cuts can’t be used so easily.

Consider a linear regression problem with n observations and p covariates, which we can write in matrix form as

y = \alpha \boldmath{1} + X\beta + \varepsilon,

where X is an n\times p matrix. The idea of variable selection is that probably not all of the p covariates are useful for predicting y, and therefore it would be useful to identify the variables which are, and just use those. Clearly each combination of variables corresponds to a different model, and so the variable selection amounts to choosing among the 2^p possible models. For large values of p it won’t be practical to consider each possible model separately, and so the idea of Bayesian variable selection is to consider a model containing all of the possible model combinations as sub-models, and the variable selection problem as just another aspect of the model which must be estimated from data. I’m simplifying and glossing over lots of details here, but there is a very nice review paper by O’Hara and Sillanpaa (2009) which the reader is referred to for further details.

The simplest and most natural way to tackle the variable selection problem from a Bayesian perspective is to introduce an indicator random variable I_i for each covariate, and introduce these into the model in order to “zero out” inactive covariates. That is we write the ith regression coefficient \beta_i as \beta_i=I_i\beta^\star_i, so that \beta^\star_i is the regression coefficient when I_i=1, and “doesn’t matter” when I_i=0. There are various ways to choose the prior over I_i and \beta^\star_i, but the simplest and most natural choice is to make them independent. This approach was used in Kuo and Mallick (1998), and hence is referred to as the Kuo and Mallick approach in O’Hara and Sillanpaa.

Simulating some data

In order to see how things work, let’s first simulate some data from a regression model with geometrically decaying regression coefficients.

n=500
p=20
X=matrix(rnorm(n*p),ncol=p)
beta=2^(0:(1-p))
print(beta)
alpha=3
tau=2
eps=rnorm(n,0,1/sqrt(tau))
y=alpha+as.vector(X%*%beta + eps)

Let’s also fit the model by least squares.

mod=lm(y~X)
print(summary(mod))

This should give output something like the following.

Call:
lm(formula = y ~ X)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.62390 -0.48917 -0.02355  0.45683  2.35448 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.0565406  0.0332104  92.036  < 2e-16 ***
X1           0.9676415  0.0322847  29.972  < 2e-16 ***
X2           0.4840052  0.0333444  14.515  < 2e-16 ***
X3           0.2680482  0.0320577   8.361  6.8e-16 ***
X4           0.1127954  0.0314472   3.587 0.000369 ***
X5           0.0781860  0.0334818   2.335 0.019946 *  
X6           0.0136591  0.0335817   0.407 0.684379    
X7           0.0035329  0.0321935   0.110 0.912662    
X8           0.0445844  0.0329189   1.354 0.176257    
X9           0.0269504  0.0318558   0.846 0.397968    
X10          0.0114942  0.0326022   0.353 0.724575    
X11         -0.0045308  0.0330039  -0.137 0.890868    
X12          0.0111247  0.0342482   0.325 0.745455    
X13         -0.0584796  0.0317723  -1.841 0.066301 .  
X14         -0.0005005  0.0343499  -0.015 0.988381    
X15         -0.0410424  0.0334723  -1.226 0.220742    
X16          0.0084832  0.0329650   0.257 0.797026    
X17          0.0346331  0.0327433   1.058 0.290718    
X18          0.0013258  0.0328920   0.040 0.967865    
X19         -0.0086980  0.0354804  -0.245 0.806446    
X20          0.0093156  0.0342376   0.272 0.785671    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.7251 on 479 degrees of freedom
Multiple R-squared: 0.7187,     Adjusted R-squared: 0.707 
F-statistic:  61.2 on 20 and 479 DF,  p-value: < 2.2e-16 

The first 4 variables are “highly significant” and the 5th is borderline.

Saturated model

We can fit the saturated model using JAGS with the following code.

require(rjags)
data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,alpha=0,beta=rep(0,p))
modelstring="
  model {
    for (i in 1:n) {
      mean[i]<-alpha+inprod(X[i,],beta)
      y[i]~dnorm(mean[i],tau)
    }
    for (j in 1:p) {
      beta[j]~dnorm(0,0.001)
    }
    alpha~dnorm(0,0.0001)
    tau~dgamma(1,0.001)
  }
"
model=jags.model(textConnection(modelstring),
				data=data,inits=init)
update(model,n.iter=100)
output=coda.samples(model=model,variable.names=c("alpha","beta","tau"),
			n.iter=10000,thin=1)
print(summary(output))
plot(output)

I’ve hard-coded various hyper-parameters in the script which are vaguely reasonable for this kind of problem. I won’t include all of the output in this post, but this works fine and gives sensible results. However, it does not address the variable selection problem.

Basic variable selection

Let’s now modify the above script to do basic variable selection in the style of Kuo and Mallick.

data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,alpha=0,betaT=rep(0,p),ind=rep(0,p))
modelstring="
  model {
    for (i in 1:n) {
      mean[i]<-alpha+inprod(X[i,],beta)
      y[i]~dnorm(mean[i],tau)
    }
    for (j in 1:p) {
      ind[j]~dbern(0.2)
      betaT[j]~dnorm(0,0.001)
      beta[j]<-ind[j]*betaT[j]
    }
    alpha~dnorm(0,0.0001)
    tau~dgamma(1,0.001)
  }
"
model=jags.model(textConnection(modelstring),
				data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,
		variable.names=c("alpha","beta","ind","tau"),
		n.iter=10000,thin=1)
print(summary(output))
plot(output)

Note that I’ve hard-coded an expectation that around 20% of variables should be included in the model. Again, I won’t include all of the output here, but the posterior mean of the indicator variables can be interpreted as posterior probabilities that the variables should be included in the model. Inspecting the output then reveals that the first three variables have a posterior probability of very close to one, the 4th variable has a small but non-negligible probability of inclusion, and the other variables all have very small probabilities of inclusion.

This is fine so far as it goes, but is not entirely satisfactory. One problem is that the choice of a “fixed effects” prior for the regression coefficients of the included variables is likely to lead to a Lindley’s paradox type situation, and a consequent under-selection of variables. It is arguably better to model the distribution of included variables using a “random effects” approach, leading to a more appropriate distribution for the included variables.

Variable selection with random effects

Adopting a random effects distribution for the included coefficients that is normal with mean zero and unknown variance helps to combat Lindley’s paradox, and can be implemented as follows.

data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,taub=1,alpha=0,betaT=rep(0,p),ind=rep(0,p))
modelstring="
  model {
    for (i in 1:n) {
      mean[i]<-alpha+inprod(X[i,],beta)
      y[i]~dnorm(mean[i],tau)
    }
    for (j in 1:p) {
      ind[j]~dbern(0.2)
      betaT[j]~dnorm(0,taub)
      beta[j]<-ind[j]*betaT[j]
    }
    alpha~dnorm(0,0.0001)
    tau~dgamma(1,0.001)
    taub~dgamma(1,0.001)
  }
"
model=jags.model(textConnection(modelstring),
				data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,
		variable.names=c("alpha","beta","ind","tau","taub"),
		n.iter=10000,thin=1)
print(summary(output))
plot(output)

This leads to a large inclusion probability for the 4th variable, and non-negligible inclusion probabilities for the next few (it is obviously somewhat dependent on the simulated data set). This random effects variable selection modelling approach generally performs better, but it still has the potentially undesirable feature of hard-coding the probability of variable inclusion. Under the prior model, the number of variables included is binomial, and the binomial distribution is rather concentrated about its mean. Where there is a general desire to control the degree of sparsity in the model, this is a good thing, but if there is considerable uncertainty about the degree of sparsity that is anticipated, then a more flexible model may be desirable.

Variable selection with random effects and a prior on the inclusion probability

The previous model can be modified by introducing a Beta prior for the model inclusion probability. This induces a distribution for the number of included variables which has longer tails than the binomial distribution, allowing the model to learn about the degree of sparsity.

data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,taub=1,pind=0.5,alpha=0,betaT=rep(0,p),ind=rep(0,p))
modelstring="
  model {
    for (i in 1:n) {
      mean[i]<-alpha+inprod(X[i,],beta)
      y[i]~dnorm(mean[i],tau)
    }
    for (j in 1:p) {
      ind[j]~dbern(pind)
      betaT[j]~dnorm(0,taub)
      beta[j]<-ind[j]*betaT[j]
    }
    alpha~dnorm(0,0.0001)
    tau~dgamma(1,0.001)
    taub~dgamma(1,0.001)
    pind~dbeta(2,8)
  }
"
model=jags.model(textConnection(modelstring),
				data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,
		variable.names=c("alpha","beta","ind","tau","taub","pind"),
		n.iter=10000,thin=1)
print(summary(output))
plot(output)

It turns out that for this particular problem the posterior distribution is not very different to the previous case, as for this problem the hard-coded choice of 20% is quite consistent with the data. However, the variable inclusion probabilities can be rather sensitive to the choice of hard-coded proportion.

Conclusion

Bayesian variable selection (and model selection more generally) is a very delicate topic, and there is much more to say about it. In this post I’ve concentrated on the practicalities of introducing variable selection into JAGS models. For further reading, I highly recommend the review of O’Hara and Sillanpaa (2009), which discusses other computational algorithms for variable selection. I intend to discuss some of the other methods in future posts.

References

O’Hara, R. and Sillanpaa, M. (2009) A review of Bayesian variable selection methods: what, how and which. Bayesian Analysis, 4(1):85-118. [DOI, PDF, Supp, BUGS Code]
Kuo, L. and Mallick, B. (1998) Variable selection for regression models. Sankhya B, 60(1):65-81.