## Posts Tagged ‘rstats’

### Statistical computing languages at the RSS

22/11/2014

On Friday the Royal Statistical Society hosted a meeting on Statistical computing languages, organised by my colleague Colin Gillespie. Four languages were presented at the meeting: Python, Scala, Matlab and Julia. I presented the talk on Scala. The slides I presented are available, in addition to the code examples and instructions on how to run them, in a public github repo I have created. The talk mainly covered material I have discussed in various previous posts on this blog. What was new was the emphasis on how easy it is to use and run Scala code, and the inclusion of complete examples and instructions on how to run them on any platform with a JVM installed. I also discussed some of the current limitations of Scala as an environment for interactive statistical data analysis and visualisation, and how these limitations could be overcome with a little directed effort. Colin suggested that all of the speakers covered a couple of examples (linear regression and a Monte Carlo integral) in “their” languages, and he provided an R solution. It is interesting to see the examples in the five different languages side by side for comparison purposes. Colin is collecting together all of the material relating to the meeting in a combined github repo, which should fill out over the next few days.

For other Scala posts on this blog, see all of my posts with a “scala” tag.

22/11/2014

### Doctoral programme in cloud computing for big data

I’ve spent much of this year working to establish our new EPSRC Centre for Doctoral Training in Cloud Computing for Big Data, which partly explains the lack of posts on this blog in recent months. The CDT is now established, with 11 students in the first cohort, and we have begun recruiting for the second cohort, to start in September 2015. We admit roughly equal numbers of students from a Computing Science and Mathematics/Statistics background, and provide an intensive programme of inter-disciplinary training in the first (of four) years. After initial induction and cohort team-building events, the programme begins with 8 weeks of intensive bespoke training developed especially for the CDT students. For the first two weeks the cohort is split into two streams for an initial crash course. Students from a M/S background receive basic training in CS and programming, with an emphasis on object-oriented programming in Java. Students from a CS background get basic training in M/S, emphasising elementary probability and statistics and basic linear algebra. From the start of the third week the entire cohort is trained together. This whole cohort training begins with two 6 week courses running in parallel. One is Programming for big data, concentrating on R programming, Java, databases, and software development tools and techniques. The other course is Statistics for big data, a course that I developed and taught, and the main topic of this post. After the initial 8 weeks of specialist training, the students next take courses on Cloud Computing and Machine learning followed by Big data analytics and Time series data, and finally courses on Research skills and Professional skills, which run along side a Group project.

### Statistics for big data

I have found it an interesting challenge to try and put together a 15 credit (150 hours of student effort) course to start from a basic level of statistical knowledge and cover most of the important concepts and methods necessary for modelling and analysis of large and complex data sets. Although the course was not about Big Data per se, it emphasised scalability in general, and exploitation of linearity in particular. Given the mixed levels of mathematical sophistication of the cohort I felt it important to start off with a practical non-technical introduction. For this I covered the first 6 chapters of Introduction to Statistical Learning with R. This provided the students with a basic grounding in statistical modelling ideas, and practical skills in working with data in R. This book is excellent as a non-technical introduction, but is missing the underpinning mathematical and computational details necessary for understanding how to implement such methods in practice. I’ve not found Elements of Statistical Learning to be very suitable as a student text, so I revised, expanded and updated my old multivariate notes to produce a new set of notes on Multivariate Data Analysis using R (I discussed an early version of these notes in a previous post). These notes emphasise the role of numerical linear algebra for solving problems of linear statistical inference in an efficient and numerically stable manner. In particular, they illustrate the use of different matrix factorisations, such as Cholesky, QR, SVD, as well as spectral decompositions, for constructing efficient solutions. Although some of the students with a weaker mathematical background struggled with some of the more technical topics, the associated practical sessions illustrated how the techniques are used in practice.

After filling in a few additional topics of frequentist linear statistical inference (including ANOVA, contrasts, missing data, and experimental design), we moved on to computational Bayesian inference. For the introductory material, I used a variety of sources, including Bayesian reasoning and machine learning, and some introductory material from my own textbook, Stochastic modelling for systems biology. The emphasis for this part of the course was the use of flexible Bayesian hierarchical modelling using MCMC. The primary software tool used was JAGS, used via the rjags package from R. For more advanced material on Bayesian computation, I made substantial use of various posts on this blog, as well as some other material we use for teaching Bayesian inference as part of our undergraduate programmes. As for the first part of the course, the emphasis was on practical modelling and data analysis rather than theory. A few of the computer practicals from the Bayesian part of the course are on-line. I may re-write one or two of these practicals as blog posts in due course. Although the emphasis was on flexible modelling, we did touch on efficiency and exploitation of linearity, and I included an extra Chapter on linear Bayesian inference in my multivariate notes in this context. I rounded off this part of the course with a lecture on parallelisation of Monte Carlo algorithms, along the lines of my BIRS lecture on that topic.

The course finished with a group data analysis project, concerned with linear modelling and variable selection for a fairly large heterogeneous data set containing missing data, using both frequentist and Bayesian approaches. The course has just finished, and I’m reasonably happy with how it has gone, but I’ll reflect on it for a couple of weeks and get some feedback before deciding on some revisions to make before delivering it again for the new cohort next year.

### Tuning particle MCMC algorithms

08/06/2014

Several papers have appeared recently discussing the issue of how to tune the number of particles used in the particle filter within a particle MCMC algorithm such as particle marginal Metropolis Hastings (PMMH). Three such papers are:

I have discussed psuedo marginal MCMC and particle MCMC algorithms in previous posts. It will be useful to refer back to these posts if these topics are unfamiliar. Within particle MCMC algorithms (and psuedo-marginal MCMC algorithms, more generally), an unbiased estimate of marginal likelihood is constructed using a number of particles. The more particles that are used, the better the estimate of marginal likelihood is, and the resulting MCMC algorithm will behave more like a “real” marginal MCMC algorithm. For a small number of particles, the algorithm will still have exactly the correct target, but the noise in the unbiased estimator of marginal likelihood will lead to poor mixing of the MCMC chain. The idea is to use just enough particles to ensure that there isn’t “too much” noise in the unbiased estimator, but not to waste lots of time producing a super-accurate estimate of marginal likelihood if that isn’t necessary to ensure good mixing of the MCMC chain.

The papers above try to give theoretical justifications for certain “rules of thumb” that are commonly used in practice. One widely adopted scheme is to tune the number of particles so that the variance of the log of the estimate of marginal liklihood is around one. The obvious questions are “where?” and “why?”, and these questions turn out to be connected. As we will see, there isn’t really a good answer to the “where?” question, but what people usually do is use a pilot run to get an estimate of the posterior mean, or mode, or MLE, and then pick one and tune the noise variance at that particular parameter value. As to “why?”, well, the papers above make various (slightly different) assumptions, all of which lead to trading off mixing against computation time to obtain an “optimal” number of particles. They don’t all agree that the variance of the noise should be exactly 1, but they all agree to an order of magnitude.

All of the above papers make the assumption that the noise distribution associated with the marginal likelihood estimate is independent of the parameter at which it is being evaluated, which explains why there isn’t a really good answer to the “where?” question – under the assumption it doesn’t matter what parameter value is used for tuning – they are all the same! Easy. Except that’s quite a big assumption, so it would be nice to know that it is reasonable, and unfortunately it isn’t. Let’s look at an example to see what goes wrong.

#### Example

In Chapter 10 of my book I look in detail at constructing a PMMH algorithm for inferring the parameters of a discretely observed stochastic Lotka-Volterra model. I’ve stepped through the computational details in a previous post which you should refer back to for the necessary background. Following that post, we can construct a particle filter to return an unbiased estimate of marginal likelihood using the following R code (which relies on the smfsb CRAN package):

require(smfsb)
# data
data(LVdata)
data=as.timedData(LVnoise10)
noiseSD=10
# measurement error model
dataLik <- function(x,t,y,log=TRUE,...)
{
ll=sum(dnorm(y,x,noiseSD,log=TRUE))
if (log)
return(ll)
else
return(exp(ll))
}
# now define a sampler for the prior on the initial state
simx0 <- function(N,t0,...)
{
mat=cbind(rpois(N,50),rpois(N,100))
colnames(mat)=c("x1","x2")
mat
}
# construct particle filter
mLLik=pfMLLik(150,simx0,0,stepLVc,dataLik,data)


Again, see the relevant previous post for details. So now mLLik() is a function that will return the log of an unbiased estimate of marginal likelihood (based on 150 particles) given a parameter value at which to evaluate.

What we are currently wondering is whether the noise in the estimate is independent of the parameter at which it is evaluated. We can investigate this for this filter easily by looking at how the estimate varies as the first parameter (prey birth rate) varies. The following code computes a log likelihood estimate across a range of values and plots the result.

mLLik1=function(x){mLLik(th=c(th1=x,th2=0.005,th3=0.6))}
x=seq(0.7,1.3,length.out=5001)
y=sapply(x,mLLik1)
plot(x[y>-1e10],y[y>-1e10])


The resulting plot is as follows (click for full size):

So, looking at the plot, it is very clear that the noise variance certainly isn’t constant as the parameter varies – it varies substantially. Furthermore, the way in which it varies is “dangerous”, in that the noise is smallest in the vicinity of the MLE. So, if a parameter close to the MLE is chosen for tuning the number of particles, this will ensure that the noise is small close to the MLE, but not elsewhere in parameter space. This could have bad consequences for the mixing of the MCMC algorithm as it explores the tails of the posterior distribution.

So with the above in mind, how should one tune the number of particles in a pMCMC algorithm? I can’t give a general answer, but I can explain what I do. We can’t rely on theory, so a pragmatic approach is required. The above rule of thumb usually gives a good starting point for exploration. Then I just directly optimise ESS per CPU second of the pMCMC algorithm from pilot runs for varying numbers of particles (and other tuning parameters in the algorithm). ESS is “expected sample size”, which can be estimated using the effectiveSize() function in the coda CRAN package. Ugly and brutish, but it works…

01/10/2013

## Introduction

In the previous post I showed that it is possible to couple parallel tempered MCMC chains in order to improve mixing. Such methods can be used when the target of interest is a Bayesian posterior distribution that is difficult to sample. There are (at least) a couple of obvious ways that one can temper a Bayesian posterior distribution. Perhaps the most obvious way is a simple flattening, so that if

$\pi(\theta|y) \propto \pi(\theta)\pi(y|\theta)$

is the posterior distribution, then for $t\in [0,1]$ we define

$\pi_t(\theta|y) \propto \pi(\theta|y)^t \propto [ \pi(\theta)\pi(y|\theta) ]^t.$

This corresponds with the tempering that is often used in statistical physics applications. We recover the posterior of interest for $t=1$ and tend to a flat distribution as $t\longrightarrow 0$. However, for Bayesian posterior distributions, there is a different way of tempering that is often more natural and useful, and that is to temper using the power posterior, defined by

$\pi_t(\theta|y) \propto \pi(\theta)\pi(y|\theta)^t.$

Here we again recover the posterior for $t=1$, but get the prior for $t=0$. Thus, the family of distributions forms a natural bridge or path from the prior to the posterior distributions. The power posterior is a special case of the more general concept of a geometric path from distribution $f(\theta)$ (at $t=0$) to $g(\theta)$ (at $t=1$) defined by

$h_t(\theta) \propto f(\theta)^{1-t}g(\theta)^t,$

where, in our case, $f(\cdot)$ is the prior and $g(\cdot)$ is the posterior.

So, given a posterior distribution that is difficult to sample, choose a temperature schedule

$0=t_0

and run a parallel tempering scheme as outlined in the previous post. The idea is that for small values of $t$ mixing will be good, as prior-like distributions are usually well-behaved, and the mixing of these "high temperature" chains can help to improve the mixing of the "low temperature" chains that are more like the posterior (note that $t$ is really an inverse temperature parameter the way I’ve defined it here…).

## Marginal likelihood and normalising constants

The marginal likelihood of a Bayesian model is

$\pi(y) = \int_\Theta \pi(\theta)\pi(y|\theta)d\theta.$

This quantity is of interest for many reasons, including calculation of the Bayes factor between two competing models. Note that this quantity has several different names in different fields. In particular, it is often known as the evidence, due to its role in Bayes factors. It is also worth noting that it is the normalising constant of the Bayesian posterior distribution. Although it is very easy to describe and define, it is notoriously difficult to compute reliably for complex models.

The normalising constant is conceptually very easy to estimate. From the above integral representation, it is clear that

$\pi(y) = E_\pi [ \pi(y|\theta) ]$

where the expectation is taken with respect to the prior. So, given samples from the prior, $\theta_1,\theta_2,\ldots,\theta_n$, we can construct the Monte Carlo estimate

$\displaystyle \widehat{\pi}(y) = \frac{1}{n}\sum_{i=1}^n \pi(y|\theta_i)$

and this will be a consistent estimator of the true evidence under fairly mild regularity conditions. Unfortunately, in practice it is likely to be a very poor estimator if the posterior and prior are not very similar. Now, we could also use Bayes theorem to re-write the integral as an expectation with respect to the posterior, so we could then use samples from the posterior to estimate the evidence. This leads to the harmonic mean estimator of the evidence, which has been described as the worst Monte Carlo method ever! Now it turns out that there are many different ways one can construct estimators of the evidence using samples from the prior and the posterior, some of which are considerably better than the two I’ve outlined. This is the subject of the bridge sampling paper of Meng and Wong. However, the reality is that no method will work well if the prior and posterior are very different.

If we have tempered chains, then we have a sequence of chains targeting distributions which, by construction, are not too different, and so we can use the output from tempered chains in order to construct estimators of the evidence that are more numerically stable. If we call the evidence of the $i$th chain $z_i$, so that $z_0=1$ and $z_N=\pi(y)$, then we can write the evidence in telescoping fashion as

$\displaystyle \pi(y)=z_N = \frac{z_N}{z_0} = \frac{z_1}{z_0}\times \frac{z_2}{z_1}\times \cdots \times \frac{z_N}{z_{N-1}}.$

Now the $i$th term in this product is $z_{i+1}/z_{i}$, which can be estimated using the output from the $i$th and/or $(i+1)$th chain(s). Again, this can be done in a variety of ways, using your favourite bridge sampling estimator, but the point is that the estimator should be reasonably good due to the fact that the $i$th and $(i+1)$th targets are very similar. For the power posterior, the simplest method is to write

$\displaystyle \frac{z_{i+1}}{z_i} = \frac{\displaystyle \int \pi(\theta)\pi(y|\theta)^{t_{i+1}}d\theta}{z_i} = \int \pi(y|\theta)^{t_{i+1}-t_i}\times \frac{\pi(y|\theta)^{t_i}\pi(\theta)}{z_i}d\theta$

$\displaystyle \mbox{}\qquad = E_i\left[\pi(y|\theta)^{t_{i+1}-t_i}\right],$

where the expectation is with respect to the $i$th target, and hence can be estimated in the usual way using samples from the $i$th chain.

For numerical stability, in practice we compute the log of the evidence as

$\displaystyle \log\pi(y) = \sum_{i=0}^{N-1} \log\frac{z_{i+1}}{z_i} = \sum_{i=0}^{N-1} \log E_i\left[\pi(y|\theta)^{t_{i+1}-t_i}\right]$

$\displaystyle = \sum_{i=0}^{N-1} \log E_i\left[\exp\{(t_{i+1}-t_i)\log\pi(y|\theta)\}\right].\qquad(\dagger)$

The above expression is exact, and is the obvious formula to use for computation. However, it is clear that if $t_i$ and $t_{i+1}$ are sufficiently close, it will be approximately OK to switch the expectation and exponential, giving

$\displaystyle \log\pi(y) \approx \sum_{i=0}^{N-1}(t_{i+1}-t_i)E_i\left[\log\pi(y|\theta)\right].$

In the continuous limit, this gives rise to the well-known path sampling identity,

$\displaystyle \log\pi(y) = \int_0^1 E_t\left[\log\pi(y|\theta)\right]dt.$

So, an alternative approach to computing the evidence is to use the samples to approximately numerically integrate the above integral, say, using the trapezium rule. However, it isn’t completely clear (to me) that this is better than using $(\dagger)$ directly, since there there is no numerical integration error to worry about.

## Numerical illustration

We can illustrate these ideas using the simple double potential well example from the previous post. Now that example doesn’t really correspond to a Bayesian posterior, and is tempered directly, rather than as a power posterior, but essentially the same ideas follow for general parallel tempered distributions. In general, we can use the sample to estimate the ratio of the last and first normalising constants, $z_N/z_0$. Here it isn’t obvious why we’d want to know that, but we’ll compute it anyway to illustrate the method. As before, we expand as a telescopic product, where the $i$th term is now

$\displaystyle \frac{z_{i+1}}{z_i} = E_i\left[\exp\{-(\gamma_{i+1}-\gamma_i)(x^2-1)^2\}\right].$

A Monte Carlo estimate of each of these terms is formed using the samples from the $i$th chain, and the logs of these are then summed to give $\log(z_N/z_0)$. A complete R script to run the Metropolis coupled sampler and compute the evidence is given below.

U=function(gam,x)
{
gam*(x*x-1)*(x*x-1)
}

temps=2^(0:3)
iters=1e5

chains=function(pot=U, tune=0.1, init=1)
{
x=rep(init,length(temps))
xmat=matrix(0,iters,length(temps))
for (i in 1:iters) {
can=x+rnorm(length(temps),0,tune)
logA=unlist(Map(pot,temps,x))-unlist(Map(pot,temps,can))
accept=(log(runif(length(temps)))<logA)
x[accept]=can[accept]
swap=sample(1:length(temps),2)
logA=pot(temps[swap[1]],x[swap[1]])+pot(temps[swap[2]],x[swap[2]])-
pot(temps[swap[1]],x[swap[2]])-pot(temps[swap[2]],x[swap[1]])
if (log(runif(1))<logA)
x[swap]=rev(x[swap])
xmat[i,]=x
}
colnames(xmat)=paste("gamma=",temps,sep="")
xmat
}

mat=chains()
mat=mat[,1:(length(temps)-1)]
diffs=diff(temps)
mat=(mat*mat-1)^2
mat=-t(diffs*t(mat))
mat=exp(mat)
logEvidence=sum(log(colMeans(mat)))
message(paste("The log of the ratio of the last and first normalising constants is",logEvidence))


It turns out that these double well potential densities are tractable, and so the normalising constants can be computed exactly. So, with a little help from Wolfram Alpha, I compute log of the ratio of the last and first normalising constants to be approximately -1.12. Hopefully the above script will output something a bit like that…

29/09/2013

## Introduction

Parallel tempering is a method for getting Metropolis-Hastings based MCMC algorithms to work better on multi-modal distributions. Although the idea has been around for more than 20 years, and works well on many problems, it still isn’t routinely used in applications. I think this is partly because relatively few people understand how it works, and partly due to the perceived difficulty of implementation. I hope to show here that it is both very easy to understand and to implement. It is also rather easy to implement in parallel on multi-core systems, though I won’t get into that in this post.

## Sampling a double-well potential

To illustrate the ideas, we need a toy multi-modal distribution to sample. There are obviously many possibilities here, but I rather like to use a double potential well distribution. The simplest version of this assumes a potential function of the form

$U(x) = \gamma (x^2-1)^2$

for some given potential barrier height $\gamma$. The potential function $U(x)$ corresponds to the probability density function

$\pi(x) \propto \exp\{-U(x)\}.$

There is a physical explanation for the terminology, via Langevin diffusions, but that isn’t really important here. It is fine to just think of potentials as being a (negative) log-density scale. On this scale, high potential barrier heights correspond to regions of very low probability density. We can set up a double well potential and plot it for the case $\gamma=4$ in R with the following code

U=function(gam,x)
{
gam*(x*x-1)*(x*x-1)
}

curried=function(gam)
{
message(paste("Returning a function for gamma =",gam))
function(x) U(gam,x)
}
U4=curried(4)

op=par(mfrow=c(2,1))
curve(U4(x),-2,2,main="Potential function, U(x)")
curve(exp(-U4(x)),-2,2,main="Unnormalised density function, exp(-U(x))")
par(op)


Incidentally, the function curried(), which curries the potential function, did not include the message() statement when I first wrote it. It mostly worked fine, but some of the code below didn’t behave as I expected. I inserted the message() statement to figure out what was going on, and the code started behaving perfectly – a beautiful example of a Heisenbug! The reason is that the message statement is not redundant – it forces evaluation of the gam variable, which is necessary in some cases, due to the lazy evaluation model that R uses for function arguments. If you don’t like the message() statement, replacing it with a simple gam works just as well.

Anyway, the point is that we have defined a multi-modal density, and that a Metropolis-Hastings algorithm initialised in one of the modes will have a hard time jumping to the other mode, and the difficulty of this jump will increase as we increase the value of $\gamma$.

We can write a simple function for a Metropolis algorithm targeting a particular potential function as follows.

chain=function(target,tune=0.1,init=1)
{
x=init
xvec=numeric(iters)
for (i in 1:iters) {
can=x+rnorm(1,0,tune)
logA=target(x)-target(can)
if (log(runif(1))<logA)
x=can
xvec[i]=x
}
xvec
}


We can use this code to run some chains for a few different values of $\gamma$ as follows.

temps=2^(0:3)
iters=1e5

mat=sapply(lapply(temps,curried),chain)
colnames(mat)=paste("gamma=",temps,sep="")

require(smfsb)
mcmcSummary(mat,rows=length(temps))


We see that as $\gamma$ increases, the chain jumps between modes less frequently. Indeed, for $\gamma=8$, the chain fails to jump to the second mode at all during this particular run of 100,000 iterations. That’s a problem if we are really interested in sampling from distributions like this. Of course, for this particular problem, there are all kinds of ways to fix this sampler, but the point is to try and develop methods that will work in high-dimensional situations where we cannot just “look” at what is going wrong.

Before we see how to couple the chains and improve the mixing, it is useful to think how to re-write this sampler. Above we sampled each chain in turn for different barrier heights. To couple the chains, we need to sample them together, sampling each iteration for all of the chains in turn. This is very easy to do. The code below isn’t especially efficient, but it is written to look very similar to the single chain code above.

chains=function(pot=U, tune=0.1, init=1)
{
x=rep(init,length(temps))
xmat=matrix(0,iters,length(temps))
for (i in 1:iters) {
can=x+rnorm(length(temps),0,tune)
logA=unlist(Map(pot,temps,x))-unlist(Map(pot,temps,can))
accept=(log(runif(length(temps)))<logA)
x[accept]=can[accept]
xmat[i,]=x
}
colnames(xmat)=paste("gamma=",temps,sep="")
xmat
}

mcmcSummary(chains(),rows=length(temps))


This code should behave identically to the previous code, simulating independent parallel MCMC chains. However, the code is now in the form that is very easy to modify to couple the chains together in an attempt to improve mixing.

## Coupling parallel chains

In the above example the chains we are simulating are all independent of one another. Some mix reasonably well, and some mix very badly. But the distributions are all related to one another, changing gradually as the barrier height changes. The idea behind coupling the chains is to try and swap states between the chains to use the chains which are mixing well to improve the mixing of the chains which aren’t. In particular, suppose interest is in the target of the worst mixing chain. The other chains can be constructed as “tempered” versions of the target of interest, here by raising it to a power between 0 and 1, with 0 corresponding to a complete flattening of the distribution, and 1 corresponding to the desired target. The use of parallel chains to improve mixing in this way is usually referred to as parallel tempering, but also sometimes as $(\text{MC})^3$. In the context of Bayesian inference, tempering using the “power posterior” can often be more natural and useful (to be discussed in a subsequent post).

So, how do we swap states between the chains without affecting the target distributions? As always, just use a Metropolis-Hastings update… To keep things simple, let’s just suppose that we have two (independent, parallel) chains, one with target $f(x)$ and the other with target $g(y)$. We can consider these chains to be evolving together, with joint target $\pi(x,y)=f(x)g(y)$. The updates chosen to update the within-chain states will obviously preserve this joint target. Now we consider how to swap states between the two chains without destroying the target. We simply propose a swap of $x$ and $y$. That is, we propose to move from $(x,y)$ to $(x^\star,y^\star)$, where $x^\star=y$ and $y^\star=x$. We are proposing this move as a standard Metropolis-Hastings update of the joint chain. We may therefore wonder about exactly what the proposal density for this move is. In fact, it is a point mass at the swapped values, and therefore has density

$q((x^\star,y^\star)|(x,y)) = \delta(x^\star-y)\delta(y^\star-x),$

but it really doesn’t matter, as it is clearly a symmetric proposal, and hence will drop out of the M-H ratio. Our acceptance probability is therefore $\min\{1,A\}$, where

$\displaystyle A = \frac{\pi(x^\star,y^\star)}{\pi(x,y)} = \frac{\pi(y,x)}{\pi(x,y)} = \frac{f(y)g(x)}{f(x)g(y)}.$

So, if we use this acceptance probability whenever we propose a swap of the states between two chains, then we will preserve the joint target, and hence the marginal targets and asymptotic independence of the target. However, the chains themselves will no longer be independent of one another. They will be coupled – Metropolis coupled. This is very easy to implement. We can just add a few lines to our previous chains() function as follows

chains=function(pot=U, tune=0.1, init=1)
{
x=rep(init,length(temps))
xmat=matrix(0,iters,length(temps))
for (i in 1:iters) {
can=x+rnorm(length(temps),0,tune)
logA=unlist(Map(pot,temps,x))-unlist(Map(pot,temps,can))
accept=(log(runif(length(temps)))<logA)
x[accept]=can[accept]
# now the coupling update
swap=sample(1:length(temps),2)
logA=pot(temps[swap[1]],x[swap[1]])+pot(temps[swap[2]],x[swap[2]])-
pot(temps[swap[1]],x[swap[2]])-pot(temps[swap[2]],x[swap[1]])
if (log(runif(1))<logA)
x[swap]=rev(x[swap])
# end of the coupling update
xmat[i,]=x
}
colnames(xmat)=paste("gamma=",temps,sep="")
xmat
}


This can be used as before, but now gives results as illustrated in the following plots.

We see immediately from the plots that whilst the individual target distributions remain unchanged, the mixing of the chains is greatly improved (though still far from perfect). Note that in the code above I just picked two chains at random to propose a state swap. In practice people typically only propose to swap states between chains which are adjacent – i.e. most similar, since proposed swaps between chains with very different targets are unlikely to be accepted. Since implementation of either strategy is very easy, I would recommend trying both to see which works best.

## Parallel implementation

It should be clear that there are opportunities for parallelising the above algorithm to make effective use of modern multi-core hardware. An approach using OpenMP with C++ is discussed in this blog post. Also note that if the state space of the chains is large, it can be much more efficient to swap temperatures between the chains rather than states – so long as you are careful about keeping track of stuff – this is explored in detail in Altekar et al (’04).

## Complete R script

For convenience, a complete R script to run all of the examples in this post is given below.

# temper.R
# functions for messing around with tempering MCMC

U=function(gam,x)
{
gam*(x*x-1)*(x*x-1)
}

curried=function(gam)
{
#gam
message(paste("Returning a function for gamma =",gam))
function(x) U(gam,x)
}
U4=curried(4)

op=par(mfrow=c(2,1))
curve(U4(x),-2,2,main="Potential function, U(x)")
curve(exp(-U4(x)),-2,2,main="Unnormalised density function, exp(-U(x))")
par(op)

# global settings
temps=2^(0:3)
iters=1e5

# First look at some independent chains
chain=function(target,tune=0.1,init=1)
{
x=init
xvec=numeric(iters)
for (i in 1:iters) {
can=x+rnorm(1,0,tune)
logA=target(x)-target(can)
if (log(runif(1))<logA)
x=can
xvec[i]=x
}
xvec
}

mat=sapply(lapply(temps,curried),chain)
colnames(mat)=paste("gamma=",temps,sep="")

require(smfsb)
mcmcSummary(mat,rows=length(temps))

# Next, let's generate 5 chains at once...
chains=function(pot=U, tune=0.1, init=1)
{
x=rep(init,length(temps))
xmat=matrix(0,iters,length(temps))
for (i in 1:iters) {
can=x+rnorm(length(temps),0,tune)
logA=unlist(Map(pot,temps,x))-unlist(Map(pot,temps,can))
accept=(log(runif(length(temps)))<logA)
x[accept]=can[accept]
xmat[i,]=x
}
colnames(xmat)=paste("gamma=",temps,sep="")
xmat
}

mcmcSummary(chains(),rows=length(temps))

# Next let's couple the chains...
chains=function(pot=U, tune=0.1, init=1)
{
x=rep(init,length(temps))
xmat=matrix(0,iters,length(temps))
for (i in 1:iters) {
can=x+rnorm(length(temps),0,tune)
logA=unlist(Map(pot,temps,x))-unlist(Map(pot,temps,can))
accept=(log(runif(length(temps)))<logA)
x[accept]=can[accept]
# now the coupling update
swap=sample(1:length(temps),2)
logA=pot(temps[swap[1]],x[swap[1]])+pot(temps[swap[2]],x[swap[2]])-
pot(temps[swap[1]],x[swap[2]])-pot(temps[swap[2]],x[swap[1]])
if (log(runif(1))<logA)
x[swap]=rev(x[swap])
# end of the coupling update
xmat[i,]=x
}
colnames(xmat)=paste("gamma=",temps,sep="")
xmat
}

mcmcSummary(chains(),rows=length(temps))

# eof


### Summary stats for ABC

01/09/2013

#### Introduction

In the previous post I gave a very brief introduction to ABC, including a simple example for inferring the parameters of a Markov process given some time series observations. Towards the end of the post I observed that there were (at least!) two potential problems with scaling up the simple approach described, one relating to the dimension of the data and the other relating to the dimension of the parameter space. Before moving on to the (to me, more interesting) problem of the dimension of the parameter space, I will briefly discuss the data dimension problem in this post, and provide a couple of references for further reading.

#### Summary stats

Recall that the simple rejection sampling approach to ABC involves first sampling a candidate parameter $\theta^\star$ from the prior and then sampling a corresponding data set $x^\star$ from the model. This simulated data set is compared with the true data $x$ using some (pseudo-)norm, $\Vert\cdot\Vert$, and accepting $\theta^\star$ if the simulated data set is sufficiently close to the true data, $\Vert x^\star - x\Vert <\epsilon$. It should be clear that if we are using a proper norm then as $\epsilon$ tends to zero the distribution of the accepted values tends to the desired posterior distribution of the parameters given the data.

However, smaller choices of $\epsilon$ will lead to higher rejection rates. This will be a particular problem in the context of high-dimensional $x$, where it is often unrealistic to expect a close match between all components of $x$ and the simulated data $x^\star$, even for a good choice of $\theta^\star$. In this case, it makes more sense to look for good agreement between particular aspects of $x$, such as the mean, or variance, or auto-correlation, depending on the exact problem and context. If we can find a finite set of sufficient statistics, $s(x)$ for $\theta$, then it should be clear that replacing the acceptance criterion with $\Vert s(x^\star) - s(x)\Vert <\epsilon$ will also lead to a scheme tending to the true posterior as $\epsilon$ tends to zero (assuming a proper norm on the space of sufficient statistics), and will typically be better than the naive method, since the sufficient statistics will be of lower dimension and less “noisy” that the raw data, leading to higher acceptance rates with no loss of information.

Unfortunately for most problems of practical interest it is not possible to find low-dimensional sufficient statistics, and so people in practice use domain knowledge and heuristics to come up with a set of summary statistics, $s(x)$ which they hope will closely approximate sufficient statistics. There is still a question as to how these statistics should be weighted or transformed to give a particular norm. This can be done using theory or heuristics, and some relevant references for this problem are given at the end of the post.

#### Implementation in R

Let’s now look at the problem from the previous post. Here, instead of directly computing the Euclidean distance between the real and simulated data, we will look at the Euclidean distance between some (normalised) summary statistics. First we will load some packages and set some parameters.

require(smfsb)
require(parallel)
options(mc.cores=4)
data(LVdata)

N=1e7
bs=1e5
batches=N/bs
message(paste("N =",N," | bs =",bs," | batches =",batches))


Next we will define some summary stats for a univariate time series – the mean, the (log) variance, and the first two auto-correlations.

ssinit <- function(vec)
{
ac23=as.vector(acf(vec,lag.max=2,plot=FALSE)$acf)[2:3] c(mean(vec),log(var(vec)+1),ac23) }  Once we have this, we can define some stats for a bivariate time series by combining the stats for the two component series, along with the cross-correlation between them. ssi <- function(ts) { c(ssinit(ts[,1]),ssinit(ts[,2]),cor(ts[,1],ts[,2])) }  This gives a set of summary stats, but these individual statistics are potentially on very different scales. They can be transformed and re-weighted in a variety of ways, usually on the basis of a pilot run which gives some information about the distribution of the summary stats. Here we will do the simplest possible thing, which is to normalise the variance of the stats on the basis of a pilot run. This is not at all optimal – see the references at the end of the post for a description of better methods. message("Batch 0: Pilot run batch") prior=cbind(th1=exp(runif(bs,-6,2)),th2=exp(runif(bs,-6,2)),th3=exp(runif(bs,-6,2))) rows=lapply(1:bs,function(i){prior[i,]}) samples=mclapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)}) sumstats=mclapply(samples,ssi) sds=apply(sapply(sumstats,c),1,sd) print(sds) # now define a standardised distance ss<-function(ts) { ssi(ts)/sds } ss0=ss(LVperfect) distance <- function(ts) { diff=ss(ts)-ss0 sum(diff*diff) }  Now we have a normalised distance function defined, we can proceed exactly as before to obtain an ABC posterior via rejection sampling. post=NULL for (i in 1:batches) { message(paste("batch",i,"of",batches)) prior=cbind(th1=exp(runif(bs,-6,2)),th2=exp(runif(bs,-6,2)),th3=exp(runif(bs,-6,2))) rows=lapply(1:bs,function(i){prior[i,]}) samples=mclapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)}) dist=mclapply(samples,distance) dist=sapply(dist,c) cutoff=quantile(dist,1000/N,na.rm=TRUE) post=rbind(post,prior[dist<cutoff,]) } message(paste("Finished. Kept",dim(post)[1],"simulations"))  Having obtained the posterior, we can use the following code to plot the results. th=c(th1 = 1, th2 = 0.005, th3 = 0.6) op=par(mfrow=c(2,3)) for (i in 1:3) { hist(post[,i],30,col=5,main=paste("Posterior for theta[",i,"]",sep="")) abline(v=th[i],lwd=2,col=2) } for (i in 1:3) { hist(log(post[,i]),30,col=5,main=paste("Posterior for log(theta[",i,"])",sep="")) abline(v=log(th[i]),lwd=2,col=2) } par(op)  This gives the plot shown below. From this we can see that the ABC posterior obtained here is very similar to that obtained in the previous post using the full data. Here the dimension reduction is not that great – reducing from 32 data points to 9 summary statistics – and so the improvement in performance is not that noticable. But in higher dimensional problems reducing the dimension of the data is practically essential. #### Summary and References As before, I recommend the wikipedia article on approximate Bayesian computation for further information and a comprehensive set of references for further reading. Here I just want to highlight two references particularly relevant to the issue of summary statistics. It is quite difficult to give much practical advice on how to construct good summary statistics, but how to transform a set of summary stats in a “good” way is a problem that is reasonably well understood. In this post I did something rather naive (normalising the variance), but the following two papers describe much better approaches. I still haven’t addressed the issue of a high-dimensional parameter space – that will be the topic of a subsequent post. #### The complete R script require(smfsb) require(parallel) options(mc.cores=4) data(LVdata) N=1e6 bs=1e5 batches=N/bs message(paste("N =",N," | bs =",bs," | batches =",batches)) ssinit <- function(vec) { ac23=as.vector(acf(vec,lag.max=2,plot=FALSE)$acf)[2:3]
c(mean(vec),log(var(vec)+1),ac23)
}

ssi <- function(ts)
{
c(ssinit(ts[,1]),ssinit(ts[,2]),cor(ts[,1],ts[,2]))
}

message("Batch 0: Pilot run batch")
prior=cbind(th1=exp(runif(bs,-6,2)),th2=exp(runif(bs,-6,2)),th3=exp(runif(bs,-6,2)))
rows=lapply(1:bs,function(i){prior[i,]})
samples=mclapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)})
sumstats=mclapply(samples,ssi)
sds=apply(sapply(sumstats,c),1,sd)
print(sds)

# now define a standardised distance
ss<-function(ts)
{
ssi(ts)/sds
}

ss0=ss(LVperfect)

distance <- function(ts)
{
diff=ss(ts)-ss0
sum(diff*diff)
}

post=NULL
for (i in 1:batches) {
message(paste("batch",i,"of",batches))
prior=cbind(th1=exp(runif(bs,-6,2)),th2=exp(runif(bs,-6,2)),th3=exp(runif(bs,-6,2)))
rows=lapply(1:bs,function(i){prior[i,]})
samples=mclapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)})
dist=mclapply(samples,distance)
dist=sapply(dist,c)
cutoff=quantile(dist,1000/N,na.rm=TRUE)
post=rbind(post,prior[dist<cutoff,])
}
message(paste("Finished. Kept",dim(post)[1],"simulations"))

# plot the results
th=c(th1 = 1, th2 = 0.005, th3 = 0.6)
op=par(mfrow=c(2,3))
for (i in 1:3) {
hist(post[,i],30,col=5,main=paste("Posterior for theta[",i,"]",sep=""))
abline(v=th[i],lwd=2,col=2)
}
for (i in 1:3) {
hist(log(post[,i]),30,col=5,main=paste("Posterior for log(theta[",i,"])",sep=""))
abline(v=log(th[i]),lwd=2,col=2)
}
par(op)


### Introduction to Approximate Bayesian Computation (ABC)

31/03/2013

Many of the posts in this blog have been concerned with using MCMC based methods for Bayesian inference. These methods are typically “exact” in the sense that they have the exact posterior distribution of interest as their target equilibrium distribution, but are obviously “approximate”, in that for any finite amount of computing time, we can only generate a finite sample of correlated realisations from a Markov chain that we hope is close to equilibrium.

Approximate Bayesian Computation (ABC) methods go a step further, and generate samples from a distribution which is not the true posterior distribution of interest, but a distribution which is hoped to be close to the real posterior distribution of interest. There are many variants on ABC, and I won’t get around to explaining all of them in this blog. The wikipedia page on ABC is a good starting point for further reading. In this post I’ll explain the most basic rejection sampling version of ABC, and in a subsequent post, I’ll explain a sequential refinement, often referred to as ABC-SMC. As usual, I’ll use R code to illustrate the ideas.

#### Basic idea

There is a close connection between “likelihood free” MCMC methods and those of approximate Bayesian computation (ABC). To keep things simple, consider the case of a perfectly observed system, so that there is no latent variable layer. Then there are model parameters $\theta$ described by a prior $\pi(\theta)$, and a forwards-simulation model for the data $x$, defined by $\pi(x|\theta)$. It is clear that a simple algorithm for simulating from the desired posterior $\pi(\theta|x)$ can be obtained as follows. First simulate from the joint distribution $\pi(\theta,x)$ by simulating $\theta^\star\sim\pi(\theta)$ and then $x^\star\sim \pi(x|\theta^\star)$. This gives a sample $(\theta^\star,x^\star)$ from the joint distribution. A simple rejection algorithm which rejects the proposed pair unless $x^\star$ matches the true data $x$ clearly gives a sample from the required posterior distribution.

#### Exact rejection sampling

• 1. Sample $\theta^\star \sim \pi(\theta^\star)$
• 2. Sample $x^\star\sim \pi(x^\star|\theta^\star)$
• 3. If $x^\star=x$, keep $\theta^\star$ as a sample from $\pi(\theta|x)$, otherwise reject.

This algorithm is exact, and for discrete $x$ will have a non-zero acceptance rate. However, in most interesting problems the rejection rate will be intolerably high. In particular, the acceptance rate will typically be zero for continuous valued $x$.

#### ABC rejection sampling

The ABC “approximation” is to accept values provided that $x^\star$ is “sufficiently close” to $x$. In the first instance, we can formulate this as follows.

• 1. Sample $\theta^\star \sim \pi(\theta^\star)$
• 2. Sample $x^\star\sim \pi(x^\star|\theta^\star)$
• 3. If $\Vert x^\star-x\Vert< \epsilon$, keep $\theta^\star$ as a sample from $\pi(\theta|x)$, otherwise reject.

Euclidean distance is usually chosen as the norm, though any norm can be used. This procedure is “honest”, in the sense that it produces exact realisations from

$\theta^\star\big|\Vert x^\star-x\Vert < \epsilon.$

For suitable small choice of $\epsilon$, this will closely approximate the true posterior. However, smaller choices of $\epsilon$ will lead to higher rejection rates. This will be a particular problem in the context of high-dimensional $x$, where it is often unrealistic to expect a close match between all components of $x$ and the simulated data $x^\star$, even for a good choice of $\theta^\star$. In this case, it makes more sense to look for good agreement between particular aspects of $x$, such as the mean, or variance, or auto-correlation, depending on the exact problem and context.

In the simplest case, this is done by forming a (vector of) summary statistic(s), $s(x^\star)$ (ideally a sufficient statistic), and accepting provided that $\Vert s(x^\star)-s(x)\Vert<\epsilon$ for some suitable choice of metric and $\epsilon$. We will return to this issue in a subsequent post.

#### Inference for an intractable Markov process

I’ll illustrate ABC in the context of parameter inference for a Markov process with an intractable transition kernel: the discrete stochastic Lotka-Volterra model. A function for simulating exact realisations from the intractable kernel is included in the smfsb CRAN package discussed in a previous post. Using pMCMC to solve the parameter inference problem is discussed in another post. It may be helpful to skim through those posts quickly to become familiar with this problem before proceeding.

So, for a given proposed set of parameters, realisations from the process can be sampled using the functions simTs and stepLV (from the smfsb package). We will use the sample data set LVperfect (from the LVdata dataset) as our “true”, or “target” data, and try to find parameters for the process which are consistent with this data. A fairly minimal R script for this problem is given below.

require(smfsb)
data(LVdata)

N=1e5
message(paste("N =",N))
prior=cbind(th1=exp(runif(N,-6,2)),th2=exp(runif(N,-6,2)),th3=exp(runif(N,-6,2)))
rows=lapply(1:N,function(i){prior[i,]})
message("starting simulation")
samples=lapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)})
message("finished simulation")

distance<-function(ts)
{
diff=ts-LVperfect
sum(diff*diff)
}

message("computing distances")
dist=lapply(samples,distance)
message("distances computed")

dist=sapply(dist,c)
cutoff=quantile(dist,1000/N)
post=prior[dist<cutoff,]

op=par(mfrow=c(2,3))
apply(post,2,hist,30)
apply(log(post),2,hist,30)
par(op)


This script should take 5-10 minutes to run on a decent laptop, and will result in histograms of the posterior marginals for the components of $\theta$ and $\log(\theta)$. Note that I have deliberately adopted a functional programming style, making use of the lapply function for the most computationally intensive steps. The reason for this will soon become apparent. Note that rather than pre-specifying a cutoff $\epsilon$, I’ve instead picked a quantile of the distance distribution. This is common practice in scenarios where the distance is difficult to have good intuition about. In fact here I’ve gone a step further and chosen a quantile to give a final sample of size 1000. Obviously then in this case I could have just selected out the top 1000 directly, but I wanted to illustrate the quantile based approach.

One problem with the above script is that all proposed samples are stored in memory at once. This is problematic for problems involving large numbers of samples. However, it is convenient to do simulations in large batches, both for computation of quantiles, and also for efficient parallelisation. The script below illustrates how to implement a batch parallelisation strategy for this problem. Samples are generated in batches of size 10^4, and only the best fitting samples are stored before the next batch is processed. This strategy can be used to get a good sized sample based on a more stringent acceptance criterion at the cost of addition simulation time. Note that the parallelisation code will only work with recent versions of R, and works by replacing calls to lapply with the parallel version, mclapply. You should notice an appreciable speed-up on a multicore machine.

require(smfsb)
require(parallel)
options(mc.cores=4)
data(LVdata)

N=1e5
bs=1e4
batches=N/bs
message(paste("N =",N," | bs =",bs," | batches =",batches))

distance<-function(ts)
{
diff=ts[,1]-LVprey
sum(diff*diff)
}

post=NULL
for (i in 1:batches) {
message(paste("batch",i,"of",batches))
prior=cbind(th1=exp(runif(bs,-6,2)),th2=exp(runif(bs,-6,2)),th3=exp(runif(bs,-6,2)))
rows=lapply(1:bs,function(i){prior[i,]})
samples=mclapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)})
dist=mclapply(samples,distance)
dist=sapply(dist,c)
cutoff=quantile(dist,1000/N)
post=rbind(post,prior[dist<cutoff,])
}
message(paste("Finished. Kept",dim(post)[1],"simulations"))

op=par(mfrow=c(2,3))
apply(post,2,hist,30)
apply(log(post),2,hist,30)
par(op)


Note that there is an additional approximation here, since the top 100 samples from each of 10 batches of simulations won’t correspond exactly to the top 1000 samples overall, but given all of the other approximations going on in ABC, this one is likely to be the least of your worries.

Now, if you compare the approximate posteriors obtained here with the “true” posteriors obtained in an earlier post using pMCMC, you will see that these posteriors are really quite poor. However, this isn’t a very fair comparison, since we’ve only done 10^5 simulations. Jacking N up to 10^7 gives the ABC posterior below.

ABC posterior from 10^7 samples

This is a bit better, but really not great. There are two basic problems with the simplistic ABC strategy adopted here, one related to the dimensionality of the data and the other the dimensionality of the parameter space. The most basic problem that we have here is the dimensionality of the data. We have 16 (bivariate) observations, so we want our stochastic simulation to shoot at a point in a 16- or 32-dimensional space. That’s a tough call. The standard way to address this problem is to reduce the dimension of the data by introducing a few carefully chosen summary statistics and then just attempting to hit those. I’ll illustrate this in a subsequent post. The other problem is that often the prior and posterior over the parameters are quite different, and this problem too is exacerbated as the dimension of the parameter space increases. The standard way to deal with this is to sequentially adapt from the prior through a sequence of ABC posteriors. I’ll examine this in a future post as well, once I’ve also posted an introduction to the use of sequential Monte Carlo (SMC) samplers for static problems.

For further reading, I suggest browsing the reference list of the Wikipedia page for ABC. Also look through the list of software on that page. In particular, note that there is a CRAN package, abc, providing R support for ABC. There is a vignette for this package which should be sufficient to get started.

### Getting started with Bayesian variable selection using JAGS and rjags

20/11/2012

#### Bayesian variable selection

In a previous post I gave a quick introduction to using the rjags R package to access the JAGS Bayesian inference from within R. In this post I want to give a quick guide to using rjags for Bayesian variable selection. I intend to use this post as a starting point for future posts on Bayesian model and variable selection using more sophisticated approaches.

I will use the simple example of multiple linear regression to illustrate the ideas, but it should be noted that I’m just using that as an example. It turns out that in the context of linear regression there are lots of algebraic and computational tricks which can be used to simplify the variable selection problem. The approach I give here is therefore rather inefficient for linear regression, but generalises to more complex (non-linear) problems where analytical and computational short-cuts can’t be used so easily.

Consider a linear regression problem with n observations and p covariates, which we can write in matrix form as

$y = \alpha \boldmath{1} + X\beta + \varepsilon,$

where $X$ is an $n\times p$ matrix. The idea of variable selection is that probably not all of the p covariates are useful for predicting y, and therefore it would be useful to identify the variables which are, and just use those. Clearly each combination of variables corresponds to a different model, and so the variable selection amounts to choosing among the $2^p$ possible models. For large values of p it won’t be practical to consider each possible model separately, and so the idea of Bayesian variable selection is to consider a model containing all of the possible model combinations as sub-models, and the variable selection problem as just another aspect of the model which must be estimated from data. I’m simplifying and glossing over lots of details here, but there is a very nice review paper by O’Hara and Sillanpaa (2009) which the reader is referred to for further details.

The simplest and most natural way to tackle the variable selection problem from a Bayesian perspective is to introduce an indicator random variable $I_i$ for each covariate, and introduce these into the model in order to “zero out” inactive covariates. That is we write the ith regression coefficient $\beta_i$ as $\beta_i=I_i\beta^\star_i$, so that $\beta^\star_i$ is the regression coefficient when $I_i=1$, and “doesn’t matter” when $I_i=0$. There are various ways to choose the prior over $I_i$ and $\beta^\star_i$, but the simplest and most natural choice is to make them independent. This approach was used in Kuo and Mallick (1998), and hence is referred to as the Kuo and Mallick approach in O’Hara and Sillanpaa.

#### Simulating some data

In order to see how things work, let’s first simulate some data from a regression model with geometrically decaying regression coefficients.

n=500
p=20
X=matrix(rnorm(n*p),ncol=p)
beta=2^(0:(1-p))
print(beta)
alpha=3
tau=2
eps=rnorm(n,0,1/sqrt(tau))
y=alpha+as.vector(X%*%beta + eps)


Let’s also fit the model by least squares.

mod=lm(y~X)
print(summary(mod))


This should give output something like the following.

Call:
lm(formula = y ~ X)

Residuals:
Min       1Q   Median       3Q      Max
-1.62390 -0.48917 -0.02355  0.45683  2.35448

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.0565406  0.0332104  92.036  < 2e-16 ***
X1           0.9676415  0.0322847  29.972  < 2e-16 ***
X2           0.4840052  0.0333444  14.515  < 2e-16 ***
X3           0.2680482  0.0320577   8.361  6.8e-16 ***
X4           0.1127954  0.0314472   3.587 0.000369 ***
X5           0.0781860  0.0334818   2.335 0.019946 *
X6           0.0136591  0.0335817   0.407 0.684379
X7           0.0035329  0.0321935   0.110 0.912662
X8           0.0445844  0.0329189   1.354 0.176257
X9           0.0269504  0.0318558   0.846 0.397968
X10          0.0114942  0.0326022   0.353 0.724575
X11         -0.0045308  0.0330039  -0.137 0.890868
X12          0.0111247  0.0342482   0.325 0.745455
X13         -0.0584796  0.0317723  -1.841 0.066301 .
X14         -0.0005005  0.0343499  -0.015 0.988381
X15         -0.0410424  0.0334723  -1.226 0.220742
X16          0.0084832  0.0329650   0.257 0.797026
X17          0.0346331  0.0327433   1.058 0.290718
X18          0.0013258  0.0328920   0.040 0.967865
X19         -0.0086980  0.0354804  -0.245 0.806446
X20          0.0093156  0.0342376   0.272 0.785671
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7251 on 479 degrees of freedom
Multiple R-squared: 0.7187,     Adjusted R-squared: 0.707
F-statistic:  61.2 on 20 and 479 DF,  p-value: < 2.2e-16


The first 4 variables are “highly significant” and the 5th is borderline.

#### Saturated model

We can fit the saturated model using JAGS with the following code.

require(rjags)
data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,alpha=0,beta=rep(0,p))
modelstring="
model {
for (i in 1:n) {
mean[i]<-alpha+inprod(X[i,],beta)
y[i]~dnorm(mean[i],tau)
}
for (j in 1:p) {
beta[j]~dnorm(0,0.001)
}
alpha~dnorm(0,0.0001)
tau~dgamma(1,0.001)
}
"
model=jags.model(textConnection(modelstring),
data=data,inits=init)
update(model,n.iter=100)
output=coda.samples(model=model,variable.names=c("alpha","beta","tau"),
n.iter=10000,thin=1)
print(summary(output))
plot(output)


I’ve hard-coded various hyper-parameters in the script which are vaguely reasonable for this kind of problem. I won’t include all of the output in this post, but this works fine and gives sensible results. However, it does not address the variable selection problem.

#### Basic variable selection

Let’s now modify the above script to do basic variable selection in the style of Kuo and Mallick.

data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,alpha=0,betaT=rep(0,p),ind=rep(0,p))
modelstring="
model {
for (i in 1:n) {
mean[i]<-alpha+inprod(X[i,],beta)
y[i]~dnorm(mean[i],tau)
}
for (j in 1:p) {
ind[j]~dbern(0.2)
betaT[j]~dnorm(0,0.001)
beta[j]<-ind[j]*betaT[j]
}
alpha~dnorm(0,0.0001)
tau~dgamma(1,0.001)
}
"
model=jags.model(textConnection(modelstring),
data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,
variable.names=c("alpha","beta","ind","tau"),
n.iter=10000,thin=1)
print(summary(output))
plot(output)


Note that I’ve hard-coded an expectation that around 20% of variables should be included in the model. Again, I won’t include all of the output here, but the posterior mean of the indicator variables can be interpreted as posterior probabilities that the variables should be included in the model. Inspecting the output then reveals that the first three variables have a posterior probability of very close to one, the 4th variable has a small but non-negligible probability of inclusion, and the other variables all have very small probabilities of inclusion.

This is fine so far as it goes, but is not entirely satisfactory. One problem is that the choice of a “fixed effects” prior for the regression coefficients of the included variables is likely to lead to a Lindley’s paradox type situation, and a consequent under-selection of variables. It is arguably better to model the distribution of included variables using a “random effects” approach, leading to a more appropriate distribution for the included variables.

#### Variable selection with random effects

Adopting a random effects distribution for the included coefficients that is normal with mean zero and unknown variance helps to combat Lindley’s paradox, and can be implemented as follows.

data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,taub=1,alpha=0,betaT=rep(0,p),ind=rep(0,p))
modelstring="
model {
for (i in 1:n) {
mean[i]<-alpha+inprod(X[i,],beta)
y[i]~dnorm(mean[i],tau)
}
for (j in 1:p) {
ind[j]~dbern(0.2)
betaT[j]~dnorm(0,taub)
beta[j]<-ind[j]*betaT[j]
}
alpha~dnorm(0,0.0001)
tau~dgamma(1,0.001)
taub~dgamma(1,0.001)
}
"
model=jags.model(textConnection(modelstring),
data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,
variable.names=c("alpha","beta","ind","tau","taub"),
n.iter=10000,thin=1)
print(summary(output))
plot(output)


This leads to a large inclusion probability for the 4th variable, and non-negligible inclusion probabilities for the next few (it is obviously somewhat dependent on the simulated data set). This random effects variable selection modelling approach generally performs better, but it still has the potentially undesirable feature of hard-coding the probability of variable inclusion. Under the prior model, the number of variables included is binomial, and the binomial distribution is rather concentrated about its mean. Where there is a general desire to control the degree of sparsity in the model, this is a good thing, but if there is considerable uncertainty about the degree of sparsity that is anticipated, then a more flexible model may be desirable.

#### Variable selection with random effects and a prior on the inclusion probability

The previous model can be modified by introducing a Beta prior for the model inclusion probability. This induces a distribution for the number of included variables which has longer tails than the binomial distribution, allowing the model to learn about the degree of sparsity.

data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,taub=1,pind=0.5,alpha=0,betaT=rep(0,p),ind=rep(0,p))
modelstring="
model {
for (i in 1:n) {
mean[i]<-alpha+inprod(X[i,],beta)
y[i]~dnorm(mean[i],tau)
}
for (j in 1:p) {
ind[j]~dbern(pind)
betaT[j]~dnorm(0,taub)
beta[j]<-ind[j]*betaT[j]
}
alpha~dnorm(0,0.0001)
tau~dgamma(1,0.001)
taub~dgamma(1,0.001)
pind~dbeta(2,8)
}
"
model=jags.model(textConnection(modelstring),
data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,
variable.names=c("alpha","beta","ind","tau","taub","pind"),
n.iter=10000,thin=1)
print(summary(output))
plot(output)


It turns out that for this particular problem the posterior distribution is not very different to the previous case, as for this problem the hard-coded choice of 20% is quite consistent with the data. However, the variable inclusion probabilities can be rather sensitive to the choice of hard-coded proportion.

#### Conclusion

Bayesian variable selection (and model selection more generally) is a very delicate topic, and there is much more to say about it. In this post I’ve concentrated on the practicalities of introducing variable selection into JAGS models. For further reading, I highly recommend the review of O’Hara and Sillanpaa (2009), which discusses other computational algorithms for variable selection. I intend to discuss some of the other methods in future posts.

#### References

O’Hara, R. and Sillanpaa, M. (2009) A review of Bayesian variable selection methods: what, how and which. Bayesian Analysis, 4(1):85-118. [DOI, PDF, Supp, BUGS Code]
Kuo, L. and Mallick, B. (1998) Variable selection for regression models. Sankhya B, 60(1):65-81.

### Keeping R up to date on Ubuntu linux

10/11/2012

R is included as part of the standard Ubuntu distribution, and can be installed with a command like

sudo apt-get install r-base


Obviously the software included as part of the standard distribution usually lags a little behind the latest version, and this is usually quite acceptable for most users most of the time. However, R is evolving quite quickly at the moment, and for various reasons I have decided to skip Ubuntu 12.10 (quantal) and stick with Ubuntu 12.4 (precise) for the time being. Since R 2.14 is included with Ubuntu 12.4, and I’d rather use R 2.15, I’d like to run with the latest R builds on my Ubuntu system.

Fortunately this is very easy, as there is a maintained repository for Ubuntu builds of R on CRAN. Full instructions are provided on CRAN, but here is the quick summary. First you need to know your nearest CRAN mirror – there is a list of mirrors on CRAN. I generally use the Bristol mirror, and so I will use it in the following.

sudo su
echo "deb http://www.stats.bris.ac.uk/R/bin/linux/ubuntu precise/" >> /etc/apt/sources.list
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
apt-get update


That’s it. You are updated to the latest version of R, and your system will check for updates in the usual way. There are just two things you may need to edit in line 2 above. The first is the address of the CRAN mirror (here “www.stats.bris.ac.uk”). The second is the name of the Ubuntu distro you are running (here “precise”).

### Inlining JAGS models in R scripts for rjags

02/10/2012

JAGS (Just Another Gibbs Sampler) is a general purpose MCMC engine similar to WinBUGS and OpenBUGS. I have a slight preference for JAGS as it is free and portable, works well on Linux, and interfaces well with R. It is tempting to write a tutorial introduction to JAGS and the corresponding R package, rjags, but there is a lot of material freely available on-line already, so it isn’t really necessary. If you are new to JAGS, I suggest starting with Getting Started with JAGS, rjags, and Bayesian Modelling. In this post I want to focus specifically on the problem of inlining JAGS models in R scripts as it can be very useful, and is usually skipped in introductory material.

#### JAGS and rjags on Ubuntu Linux

On recent versions of Ubuntu, assuming that R is already installed, the simplest way to install JAGS and rjags is using the command

sudo apt-get install jags r-cran-rjags


Now rjags is a CRAN package, so it can be installed in the usual way with install.packages("rjags"). However, taking JAGS and rjags direct from the Ubuntu repos should help to ensure that the versions of JAGS and rjags are in sync, which is a good thing.

#### Toy model

For this post, I will use a trivial toy example of inference for the mean and precision of a normal random sample. That is, we will assume data

$X_i \sim N(\mu,1/\tau),\quad i=1,2,\ldots n,$

with priors on $\mu$ and $\tau$ of the form

$\tau\sim Ga(a,b),\quad \mu \sim N(c,1/d).$

#### Separate model file

The usual way to fit this model in R using rjags is to first create a separate file containing the model

  model {
for (i in 1:n) {
x[i]~dnorm(mu,tau)
}
mu~dnorm(cc,d)
tau~dgamma(a,b)
}


Then, supposing that this file is called jags1.jags, an R session to fit the model could be constructed as follows:

require(rjags)
x=rnorm(15,25,2)
data=list(x=x,n=length(x))
hyper=list(a=3,b=11,cc=10,d=1/100)
init=list(mu=0,tau=1)
model=jags.model("jags1.jags",data=append(data,hyper), inits=init)
update(model,n.iter=100)
output=coda.samples(model=model,variable.names=c("mu", "tau"), n.iter=10000, thin=1)
print(summary(output))
plot(output)


This is all fine, and it can be very useful to have the model declared in a separate file, especially if the model is large and complex, and you might want to use it from outside R. However, very often for simple models it can be quite inconvenient to have the model separate from the R script which runs it. In particular, people often have issues with naming files correctly, making sure R is looking in the correct directory, moving the model with the R script, etc. So it would be nice to be able to just inline the JAGS model within an R script, to keep the model, the data, and the analysis all together in one place.

#### Using a temporary file

What we want to do is declare the JAGS model within a text string inside an R script and then somehow pass this into the call to jags.model(). The obvious way to do this is to write the string to a text file, and then pass the name of that text file into jags.model(). This works fine, but some care needs to be taken to make sure this works in a generic platform independent way. For example, you need to write to a file that you know doesn’t exist in a directory that is writable using a filename that is valid on the OS on which the script is being run. For this purpose R has an excellent little function called tempfile() which solves exactly this naming problem. It should always return the name of a file which does not exist in a writable directly within the standard temporary file location on the OS on which R is being run. This function is exceedingly useful for all kinds of things, but doesn’t seem to be very well known by newcomers to R. Using this we can construct a stand-alone R script to fit the model as follows:

require(rjags)
x=rnorm(15,25,2)
data=list(x=x,n=length(x))
hyper=list(a=3,b=11,cc=10,d=1/100)
init=list(mu=0,tau=1)
modelstring="
model {
for (i in 1:n) {
x[i]~dnorm(mu,tau)
}
mu~dnorm(cc,d)
tau~dgamma(a,b)
}
"
tmpf=tempfile()
tmps=file(tmpf,"w")
cat(modelstring,file=tmps)
close(tmps)
model=jags.model(tmpf,data=append(data,hyper), inits=init)
update(model,n.iter=100)
output=coda.samples(model=model,variable.names=c("mu", "tau"), n.iter=10000, thin=1)
print(summary(output))
plot(output)


Now, although there is a file containing the model temporarily involved, the script is stand-alone and portable.

#### Using a text connection

The solution above works fine, but still involves writing a file to disk and reading it back in again, which is a bit pointless in this case. We can solve this by using another under-appreciated R function, textConnection(). Many R functions which take a file as an argument will work fine if instead passed a textConnection object, and the rjags function jags.model() is no exception. Here, instead of writing the model string to disk, we can turn it into a textConnection object and then pass that directly into jags.model() without ever actually writing the model file to disk. This is faster, neater and cleaner. An R session which takes this approach is given below.

require(rjags)
x=rnorm(15,25,2)
data=list(x=x,n=length(x))
hyper=list(a=3,b=11,cc=10,d=1/100)
init=list(mu=0,tau=1)
modelstring="
model {
for (i in 1:n) {
x[i]~dnorm(mu,tau)
}
mu~dnorm(cc,d)
tau~dgamma(a,b)
}
"
model=jags.model(textConnection(modelstring), data=append(data,hyper), inits=init)
update(model,n.iter=100)
output=coda.samples(model=model,variable.names=c("mu", "tau"), n.iter=10000, thin=1)
print(summary(output))
plot(output)


This is my preferred way to use rjags. Note again that textConnection objects have many and varied uses and applications that have nothing to do with rjags.

The stupidest thing...