Bayesian variable selection
In a previous post I gave a quick introduction to using the rjags R package to access the JAGS Bayesian inference from within R. In this post I want to give a quick guide to using rjags for Bayesian variable selection. I intend to use this post as a starting point for future posts on Bayesian model and variable selection using more sophisticated approaches.
I will use the simple example of multiple linear regression to illustrate the ideas, but it should be noted that I’m just using that as an example. It turns out that in the context of linear regression there are lots of algebraic and computational tricks which can be used to simplify the variable selection problem. The approach I give here is therefore rather inefficient for linear regression, but generalises to more complex (non-linear) problems where analytical and computational short-cuts can’t be used so easily.
Consider a linear regression problem with n observations and p covariates, which we can write in matrix form as
where is an
matrix. The idea of variable selection is that probably not all of the p covariates are useful for predicting y, and therefore it would be useful to identify the variables which are, and just use those. Clearly each combination of variables corresponds to a different model, and so the variable selection amounts to choosing among the
possible models. For large values of p it won’t be practical to consider each possible model separately, and so the idea of Bayesian variable selection is to consider a model containing all of the possible model combinations as sub-models, and the variable selection problem as just another aspect of the model which must be estimated from data. I’m simplifying and glossing over lots of details here, but there is a very nice review paper by O’Hara and Sillanpaa (2009) which the reader is referred to for further details.
The simplest and most natural way to tackle the variable selection problem from a Bayesian perspective is to introduce an indicator random variable for each covariate, and introduce these into the model in order to “zero out” inactive covariates. That is we write the ith regression coefficient
as
, so that
is the regression coefficient when
, and “doesn’t matter” when
. There are various ways to choose the prior over
and
, but the simplest and most natural choice is to make them independent. This approach was used in Kuo and Mallick (1998), and hence is referred to as the Kuo and Mallick approach in O’Hara and Sillanpaa.
Simulating some data
In order to see how things work, let’s first simulate some data from a regression model with geometrically decaying regression coefficients.
n=500 p=20 X=matrix(rnorm(n*p),ncol=p) beta=2^(0:(1-p)) print(beta) alpha=3 tau=2 eps=rnorm(n,0,1/sqrt(tau)) y=alpha+as.vector(X%*%beta + eps)
Let’s also fit the model by least squares.
mod=lm(y~X) print(summary(mod))
This should give output something like the following.
Call: lm(formula = y ~ X) Residuals: Min 1Q Median 3Q Max -1.62390 -0.48917 -0.02355 0.45683 2.35448 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.0565406 0.0332104 92.036 < 2e-16 *** X1 0.9676415 0.0322847 29.972 < 2e-16 *** X2 0.4840052 0.0333444 14.515 < 2e-16 *** X3 0.2680482 0.0320577 8.361 6.8e-16 *** X4 0.1127954 0.0314472 3.587 0.000369 *** X5 0.0781860 0.0334818 2.335 0.019946 * X6 0.0136591 0.0335817 0.407 0.684379 X7 0.0035329 0.0321935 0.110 0.912662 X8 0.0445844 0.0329189 1.354 0.176257 X9 0.0269504 0.0318558 0.846 0.397968 X10 0.0114942 0.0326022 0.353 0.724575 X11 -0.0045308 0.0330039 -0.137 0.890868 X12 0.0111247 0.0342482 0.325 0.745455 X13 -0.0584796 0.0317723 -1.841 0.066301 . X14 -0.0005005 0.0343499 -0.015 0.988381 X15 -0.0410424 0.0334723 -1.226 0.220742 X16 0.0084832 0.0329650 0.257 0.797026 X17 0.0346331 0.0327433 1.058 0.290718 X18 0.0013258 0.0328920 0.040 0.967865 X19 -0.0086980 0.0354804 -0.245 0.806446 X20 0.0093156 0.0342376 0.272 0.785671 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7251 on 479 degrees of freedom Multiple R-squared: 0.7187, Adjusted R-squared: 0.707 F-statistic: 61.2 on 20 and 479 DF, p-value: < 2.2e-16
The first 4 variables are “highly significant” and the 5th is borderline.
Saturated model
We can fit the saturated model using JAGS with the following code.
require(rjags) data=list(y=y,X=X,n=n,p=p) init=list(tau=1,alpha=0,beta=rep(0,p)) modelstring=" model { for (i in 1:n) { mean[i]<-alpha+inprod(X[i,],beta) y[i]~dnorm(mean[i],tau) } for (j in 1:p) { beta[j]~dnorm(0,0.001) } alpha~dnorm(0,0.0001) tau~dgamma(1,0.001) } " model=jags.model(textConnection(modelstring), data=data,inits=init) update(model,n.iter=100) output=coda.samples(model=model,variable.names=c("alpha","beta","tau"), n.iter=10000,thin=1) print(summary(output)) plot(output)
I’ve hard-coded various hyper-parameters in the script which are vaguely reasonable for this kind of problem. I won’t include all of the output in this post, but this works fine and gives sensible results. However, it does not address the variable selection problem.
Basic variable selection
Let’s now modify the above script to do basic variable selection in the style of Kuo and Mallick.
data=list(y=y,X=X,n=n,p=p) init=list(tau=1,alpha=0,betaT=rep(0,p),ind=rep(0,p)) modelstring=" model { for (i in 1:n) { mean[i]<-alpha+inprod(X[i,],beta) y[i]~dnorm(mean[i],tau) } for (j in 1:p) { ind[j]~dbern(0.2) betaT[j]~dnorm(0,0.001) beta[j]<-ind[j]*betaT[j] } alpha~dnorm(0,0.0001) tau~dgamma(1,0.001) } " model=jags.model(textConnection(modelstring), data=data,inits=init) update(model,n.iter=1000) output=coda.samples(model=model, variable.names=c("alpha","beta","ind","tau"), n.iter=10000,thin=1) print(summary(output)) plot(output)
Note that I’ve hard-coded an expectation that around 20% of variables should be included in the model. Again, I won’t include all of the output here, but the posterior mean of the indicator variables can be interpreted as posterior probabilities that the variables should be included in the model. Inspecting the output then reveals that the first three variables have a posterior probability of very close to one, the 4th variable has a small but non-negligible probability of inclusion, and the other variables all have very small probabilities of inclusion.
This is fine so far as it goes, but is not entirely satisfactory. One problem is that the choice of a “fixed effects” prior for the regression coefficients of the included variables is likely to lead to a Lindley’s paradox type situation, and a consequent under-selection of variables. It is arguably better to model the distribution of included variables using a “random effects” approach, leading to a more appropriate distribution for the included variables.
Variable selection with random effects
Adopting a random effects distribution for the included coefficients that is normal with mean zero and unknown variance helps to combat Lindley’s paradox, and can be implemented as follows.
data=list(y=y,X=X,n=n,p=p) init=list(tau=1,taub=1,alpha=0,betaT=rep(0,p),ind=rep(0,p)) modelstring=" model { for (i in 1:n) { mean[i]<-alpha+inprod(X[i,],beta) y[i]~dnorm(mean[i],tau) } for (j in 1:p) { ind[j]~dbern(0.2) betaT[j]~dnorm(0,taub) beta[j]<-ind[j]*betaT[j] } alpha~dnorm(0,0.0001) tau~dgamma(1,0.001) taub~dgamma(1,0.001) } " model=jags.model(textConnection(modelstring), data=data,inits=init) update(model,n.iter=1000) output=coda.samples(model=model, variable.names=c("alpha","beta","ind","tau","taub"), n.iter=10000,thin=1) print(summary(output)) plot(output)
This leads to a large inclusion probability for the 4th variable, and non-negligible inclusion probabilities for the next few (it is obviously somewhat dependent on the simulated data set). This random effects variable selection modelling approach generally performs better, but it still has the potentially undesirable feature of hard-coding the probability of variable inclusion. Under the prior model, the number of variables included is binomial, and the binomial distribution is rather concentrated about its mean. Where there is a general desire to control the degree of sparsity in the model, this is a good thing, but if there is considerable uncertainty about the degree of sparsity that is anticipated, then a more flexible model may be desirable.
Variable selection with random effects and a prior on the inclusion probability
The previous model can be modified by introducing a Beta prior for the model inclusion probability. This induces a distribution for the number of included variables which has longer tails than the binomial distribution, allowing the model to learn about the degree of sparsity.
data=list(y=y,X=X,n=n,p=p) init=list(tau=1,taub=1,pind=0.5,alpha=0,betaT=rep(0,p),ind=rep(0,p)) modelstring=" model { for (i in 1:n) { mean[i]<-alpha+inprod(X[i,],beta) y[i]~dnorm(mean[i],tau) } for (j in 1:p) { ind[j]~dbern(pind) betaT[j]~dnorm(0,taub) beta[j]<-ind[j]*betaT[j] } alpha~dnorm(0,0.0001) tau~dgamma(1,0.001) taub~dgamma(1,0.001) pind~dbeta(2,8) } " model=jags.model(textConnection(modelstring), data=data,inits=init) update(model,n.iter=1000) output=coda.samples(model=model, variable.names=c("alpha","beta","ind","tau","taub","pind"), n.iter=10000,thin=1) print(summary(output)) plot(output)
It turns out that for this particular problem the posterior distribution is not very different to the previous case, as for this problem the hard-coded choice of 20% is quite consistent with the data. However, the variable inclusion probabilities can be rather sensitive to the choice of hard-coded proportion.
Conclusion
Bayesian variable selection (and model selection more generally) is a very delicate topic, and there is much more to say about it. In this post I’ve concentrated on the practicalities of introducing variable selection into JAGS models. For further reading, I highly recommend the review of O’Hara and Sillanpaa (2009), which discusses other computational algorithms for variable selection. I intend to discuss some of the other methods in future posts.
References
- O’Hara, R. and Sillanpaa, M. (2009) A review of Bayesian variable selection methods: what, how and which. Bayesian Analysis, 4(1):85-118. [DOI, PDF, Supp, BUGS Code]
- Kuo, L. and Mallick, B. (1998) Variable selection for regression models. Sankhya B, 60(1):65-81.