Getting started with Bayesian variable selection using JAGS and rjags

Bayesian variable selection

In a previous post I gave a quick introduction to using the rjags R package to access the JAGS Bayesian inference from within R. In this post I want to give a quick guide to using rjags for Bayesian variable selection. I intend to use this post as a starting point for future posts on Bayesian model and variable selection using more sophisticated approaches.

I will use the simple example of multiple linear regression to illustrate the ideas, but it should be noted that I’m just using that as an example. It turns out that in the context of linear regression there are lots of algebraic and computational tricks which can be used to simplify the variable selection problem. The approach I give here is therefore rather inefficient for linear regression, but generalises to more complex (non-linear) problems where analytical and computational short-cuts can’t be used so easily.

Consider a linear regression problem with n observations and p covariates, which we can write in matrix form as

y = \alpha \boldmath{1} + X\beta + \varepsilon,

where X is an n\times p matrix. The idea of variable selection is that probably not all of the p covariates are useful for predicting y, and therefore it would be useful to identify the variables which are, and just use those. Clearly each combination of variables corresponds to a different model, and so the variable selection amounts to choosing among the 2^p possible models. For large values of p it won’t be practical to consider each possible model separately, and so the idea of Bayesian variable selection is to consider a model containing all of the possible model combinations as sub-models, and the variable selection problem as just another aspect of the model which must be estimated from data. I’m simplifying and glossing over lots of details here, but there is a very nice review paper by O’Hara and Sillanpaa (2009) which the reader is referred to for further details.

The simplest and most natural way to tackle the variable selection problem from a Bayesian perspective is to introduce an indicator random variable I_i for each covariate, and introduce these into the model in order to “zero out” inactive covariates. That is we write the ith regression coefficient \beta_i as \beta_i=I_i\beta^\star_i, so that \beta^\star_i is the regression coefficient when I_i=1, and “doesn’t matter” when I_i=0. There are various ways to choose the prior over I_i and \beta^\star_i, but the simplest and most natural choice is to make them independent. This approach was used in Kuo and Mallick (1998), and hence is referred to as the Kuo and Mallick approach in O’Hara and Sillanpaa.

Simulating some data

In order to see how things work, let’s first simulate some data from a regression model with geometrically decaying regression coefficients.

n=500
p=20
X=matrix(rnorm(n*p),ncol=p)
beta=2^(0:(1-p))
print(beta)
alpha=3
tau=2
eps=rnorm(n,0,1/sqrt(tau))
y=alpha+as.vector(X%*%beta + eps)

Let’s also fit the model by least squares.

mod=lm(y~X)
print(summary(mod))

This should give output something like the following.

Call:
lm(formula = y ~ X)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.62390 -0.48917 -0.02355  0.45683  2.35448 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.0565406  0.0332104  92.036  < 2e-16 ***
X1           0.9676415  0.0322847  29.972  < 2e-16 ***
X2           0.4840052  0.0333444  14.515  < 2e-16 ***
X3           0.2680482  0.0320577   8.361  6.8e-16 ***
X4           0.1127954  0.0314472   3.587 0.000369 ***
X5           0.0781860  0.0334818   2.335 0.019946 *  
X6           0.0136591  0.0335817   0.407 0.684379    
X7           0.0035329  0.0321935   0.110 0.912662    
X8           0.0445844  0.0329189   1.354 0.176257    
X9           0.0269504  0.0318558   0.846 0.397968    
X10          0.0114942  0.0326022   0.353 0.724575    
X11         -0.0045308  0.0330039  -0.137 0.890868    
X12          0.0111247  0.0342482   0.325 0.745455    
X13         -0.0584796  0.0317723  -1.841 0.066301 .  
X14         -0.0005005  0.0343499  -0.015 0.988381    
X15         -0.0410424  0.0334723  -1.226 0.220742    
X16          0.0084832  0.0329650   0.257 0.797026    
X17          0.0346331  0.0327433   1.058 0.290718    
X18          0.0013258  0.0328920   0.040 0.967865    
X19         -0.0086980  0.0354804  -0.245 0.806446    
X20          0.0093156  0.0342376   0.272 0.785671    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.7251 on 479 degrees of freedom
Multiple R-squared: 0.7187,     Adjusted R-squared: 0.707 
F-statistic:  61.2 on 20 and 479 DF,  p-value: < 2.2e-16 

The first 4 variables are “highly significant” and the 5th is borderline.

Saturated model

We can fit the saturated model using JAGS with the following code.

require(rjags)
data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,alpha=0,beta=rep(0,p))
modelstring="
  model {
    for (i in 1:n) {
      mean[i]<-alpha+inprod(X[i,],beta)
      y[i]~dnorm(mean[i],tau)
    }
    for (j in 1:p) {
      beta[j]~dnorm(0,0.001)
    }
    alpha~dnorm(0,0.0001)
    tau~dgamma(1,0.001)
  }
"
model=jags.model(textConnection(modelstring),
				data=data,inits=init)
update(model,n.iter=100)
output=coda.samples(model=model,variable.names=c("alpha","beta","tau"),
			n.iter=10000,thin=1)
print(summary(output))
plot(output)

I’ve hard-coded various hyper-parameters in the script which are vaguely reasonable for this kind of problem. I won’t include all of the output in this post, but this works fine and gives sensible results. However, it does not address the variable selection problem.

Basic variable selection

Let’s now modify the above script to do basic variable selection in the style of Kuo and Mallick.

data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,alpha=0,betaT=rep(0,p),ind=rep(0,p))
modelstring="
  model {
    for (i in 1:n) {
      mean[i]<-alpha+inprod(X[i,],beta)
      y[i]~dnorm(mean[i],tau)
    }
    for (j in 1:p) {
      ind[j]~dbern(0.2)
      betaT[j]~dnorm(0,0.001)
      beta[j]<-ind[j]*betaT[j]
    }
    alpha~dnorm(0,0.0001)
    tau~dgamma(1,0.001)
  }
"
model=jags.model(textConnection(modelstring),
				data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,
		variable.names=c("alpha","beta","ind","tau"),
		n.iter=10000,thin=1)
print(summary(output))
plot(output)

Note that I’ve hard-coded an expectation that around 20% of variables should be included in the model. Again, I won’t include all of the output here, but the posterior mean of the indicator variables can be interpreted as posterior probabilities that the variables should be included in the model. Inspecting the output then reveals that the first three variables have a posterior probability of very close to one, the 4th variable has a small but non-negligible probability of inclusion, and the other variables all have very small probabilities of inclusion.

This is fine so far as it goes, but is not entirely satisfactory. One problem is that the choice of a “fixed effects” prior for the regression coefficients of the included variables is likely to lead to a Lindley’s paradox type situation, and a consequent under-selection of variables. It is arguably better to model the distribution of included variables using a “random effects” approach, leading to a more appropriate distribution for the included variables.

Variable selection with random effects

Adopting a random effects distribution for the included coefficients that is normal with mean zero and unknown variance helps to combat Lindley’s paradox, and can be implemented as follows.

data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,taub=1,alpha=0,betaT=rep(0,p),ind=rep(0,p))
modelstring="
  model {
    for (i in 1:n) {
      mean[i]<-alpha+inprod(X[i,],beta)
      y[i]~dnorm(mean[i],tau)
    }
    for (j in 1:p) {
      ind[j]~dbern(0.2)
      betaT[j]~dnorm(0,taub)
      beta[j]<-ind[j]*betaT[j]
    }
    alpha~dnorm(0,0.0001)
    tau~dgamma(1,0.001)
    taub~dgamma(1,0.001)
  }
"
model=jags.model(textConnection(modelstring),
				data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,
		variable.names=c("alpha","beta","ind","tau","taub"),
		n.iter=10000,thin=1)
print(summary(output))
plot(output)

This leads to a large inclusion probability for the 4th variable, and non-negligible inclusion probabilities for the next few (it is obviously somewhat dependent on the simulated data set). This random effects variable selection modelling approach generally performs better, but it still has the potentially undesirable feature of hard-coding the probability of variable inclusion. Under the prior model, the number of variables included is binomial, and the binomial distribution is rather concentrated about its mean. Where there is a general desire to control the degree of sparsity in the model, this is a good thing, but if there is considerable uncertainty about the degree of sparsity that is anticipated, then a more flexible model may be desirable.

Variable selection with random effects and a prior on the inclusion probability

The previous model can be modified by introducing a Beta prior for the model inclusion probability. This induces a distribution for the number of included variables which has longer tails than the binomial distribution, allowing the model to learn about the degree of sparsity.

data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,taub=1,pind=0.5,alpha=0,betaT=rep(0,p),ind=rep(0,p))
modelstring="
  model {
    for (i in 1:n) {
      mean[i]<-alpha+inprod(X[i,],beta)
      y[i]~dnorm(mean[i],tau)
    }
    for (j in 1:p) {
      ind[j]~dbern(pind)
      betaT[j]~dnorm(0,taub)
      beta[j]<-ind[j]*betaT[j]
    }
    alpha~dnorm(0,0.0001)
    tau~dgamma(1,0.001)
    taub~dgamma(1,0.001)
    pind~dbeta(2,8)
  }
"
model=jags.model(textConnection(modelstring),
				data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,
		variable.names=c("alpha","beta","ind","tau","taub","pind"),
		n.iter=10000,thin=1)
print(summary(output))
plot(output)

It turns out that for this particular problem the posterior distribution is not very different to the previous case, as for this problem the hard-coded choice of 20% is quite consistent with the data. However, the variable inclusion probabilities can be rather sensitive to the choice of hard-coded proportion.

Conclusion

Bayesian variable selection (and model selection more generally) is a very delicate topic, and there is much more to say about it. In this post I’ve concentrated on the practicalities of introducing variable selection into JAGS models. For further reading, I highly recommend the review of O’Hara and Sillanpaa (2009), which discusses other computational algorithms for variable selection. I intend to discuss some of the other methods in future posts.

References

O’Hara, R. and Sillanpaa, M. (2009) A review of Bayesian variable selection methods: what, how and which. Bayesian Analysis, 4(1):85-118. [DOI, PDF, Supp, BUGS Code]
Kuo, L. and Mallick, B. (1998) Variable selection for regression models. Sankhya B, 60(1):65-81.

Published by

darrenjw

I am Professor of Statistics within the Department of Mathematical Sciences at Durham University, UK. I am an Bayesian statistician interested in computation and applications, especially to engineering and the life sciences.

23 thoughts on “Getting started with Bayesian variable selection using JAGS and rjags”

  1. The approach of introducing 0-1 indicators in Bayesian model selection goes back well before 1998. I think it’s a bit much to say that Kuo and Mallick introduced such an idea.

    1. OK – fair comment – I’ve changed “introduced” to “used”. The point is that it is important to distinguish between the many closely related but subtly different models for BVS, and for that I’m following the terminology of O’Hara and Sillanpaa.

  2. The first time that the zero-one indicators are used are in SSVS (George and McCullogh, 1993, Jasa) which is slightly different that KM. Also see the work of Dellaportas Forster and Ntzoufras (2002) in Stats+Comp that compares the methods and introduces a more efficient gibbs sampler for variable selection. Also the same group has published a paper in 2000 with BUGS code and I have a paper in 2003 in JSS explaining in detail how you can implement these methods in BUGS. Chapter 11 in my book also illustrates how you can implement the variable selection Gibbs samplers in WinBUGS. Finally, In a recent tutorial for WinBUGS in WIRES Computational Statistics we explain how Lasso methods and Lasso based variable selection can be implemented in WinBUGS.

    Regards

    Ioannis

  3. Great post (even if I got here more >1 year later)!

    It did make me wonder whether a Bayes Factor can be constructed from the indicator variables for model selection?
    If instead of choosing pind=0.2 but rather 0.5 so that model priors were equal, would the ratio of ind[j]==1 to ind[j]==0 be the Bayes Factor for a model including variable j?

    Thanks for the great blogging!

    1. Great post even another one more year later. And I have the same question. How to calculate Bayes Factor here? Thank Darren’s great blogging and Sameer’s question. If you have figure out the answer, please share it with me.

      1. The question isn’t very well posed. A Bayes factor compares 2 models, but here we have 2^p models, and it isn’t very clear why we’d be interested in specifically comparing any two of them.

  4. Hello, thank you for your great post related to the Bayesian variable selection for linear models. I have a question, in the last section, you used Beta(2,8) prior for the inclusion probability p, it seems that often no-informative prior uniform(0,1) is imposed for p, here I understand that you want to learn about degree of sparsity so instead you use Beta prior, do you know any reference paper that mentioned Beta prior is reasonable to be used even if you don’t know any prior information for p?

    1. The beta distribution is conjugate in this context. Note that the Uniform(0,1) is a special case of the beta, so you are using a beta if you use the Uniform(0,1).

      1. Thank you for your reply. I have one more question, why do you use Beta(2,8) rather than other Beta distribution, say, Beta(1,8), Beta(0.5,0.5), Unif(0,1)?
        Right now I am doing some data analysis, I need to justify that I could choose the prior Beta(2,8) for inclusion probability p, since I found that the results are a little different if I use different beta distributions as prior for inclusion probability.

    2. See Scott and Berger 2010 “BAYES AND EMPIRICAL-BAYES MULTIPLICITY ADJUSTMENT
      IN THE VARIABLE-SELECTION PROBLEM” for comparisons between a uniform and beta prior. By integrating out the inclusion probability p it is interesting to consider the marginal prior distribution induced over the model space. See figure 1 for the log prior vs model size. Uniform prior favours relatively small and large models, for which the latter could be considered undesirable.

  5. Thanks for a nice and clear tutorial. What happens if the model is binomial? Can this approach still be used? I would imagine that the prior on the inclusion probability differs for a binomial model?

  6. Ok, thank for the answer. I am wondering, because in my model (binomial), the betas are totally different when the Indicator is included compared with the model without indicators (when using the last approach for indicators that you have presented here).

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.