## Posts Tagged ‘rstats’

### Introduction to Approximate Bayesian Computation (ABC)

31/03/2013

Many of the posts in this blog have been concerned with using MCMC based methods for Bayesian inference. These methods are typically “exact” in the sense that they have the exact posterior distribution of interest as their target equilibrium distribution, but are obviously “approximate”, in that for any finite amount of computing time, we can only generate a finite sample of correlated realisations from a Markov chain that we hope is close to equilibrium.

Approximate Bayesian Computation (ABC) methods go a step further, and generate samples from a distribution which is not the true posterior distribution of interest, but a distribution which is hoped to be close to the real posterior distribution of interest. There are many variants on ABC, and I won’t get around to explaining all of them in this blog. The wikipedia page on ABC is a good starting point for further reading. In this post I’ll explain the most basic rejection sampling version of ABC, and in a subsequent post, I’ll explain a sequential refinement, often referred to as ABC-SMC. As usual, I’ll use R code to illustrate the ideas.

#### Basic idea

There is a close connection between “likelihood free” MCMC methods and those of approximate Bayesian computation (ABC). To keep things simple, consider the case of a perfectly observed system, so that there is no latent variable layer. Then there are model parameters $\theta$ described by a prior $\pi(\theta)$, and a forwards-simulation model for the data $x$, defined by $\pi(x|\theta)$. It is clear that a simple algorithm for simulating from the desired posterior $\pi(\theta|x)$ can be obtained as follows. First simulate from the joint distribution $\pi(\theta,x)$ by simulating $\theta^\star\sim\pi(\theta)$ and then $x^\star\sim \pi(x|\theta^\star)$. This gives a sample $(\theta^\star,x^\star)$ from the joint distribution. A simple rejection algorithm which rejects the proposed pair unless $x^\star$ matches the true data $x$ clearly gives a sample from the required posterior distribution.

#### Exact rejection sampling

• 1. Sample $\theta^\star \sim \pi(\theta^\star)$
• 2. Sample $x^\star\sim \pi(x^\star|\theta^\star)$
• 3. If $x^\star=x$, keep $\theta^\star$ as a sample from $\pi(\theta|x)$, otherwise reject.

This algorithm is exact, and for discrete $x$ will have a non-zero acceptance rate. However, in most interesting problems the rejection rate will be intolerably high. In particular, the acceptance rate will typically be zero for continuous valued $x$.

#### ABC rejection sampling

The ABC “approximation” is to accept values provided that $x^\star$ is “sufficiently close” to $x$. In the first instance, we can formulate this as follows.

• 1. Sample $\theta^\star \sim \pi(\theta^\star)$
• 2. Sample $x^\star\sim \pi(x^\star|\theta^\star)$
• 3. If $\Vert x^\star-x\Vert< \epsilon$, keep $\theta^\star$ as a sample from $\pi(\theta|x)$, otherwise reject.

Euclidean distance is usually chosen as the norm, though any norm can be used. This procedure is “honest”, in the sense that it produces exact realisations from

$\theta^\star\big|\Vert x^\star-x\Vert < \epsilon.$

For suitable small choice of $\epsilon$, this will closely approximate the true posterior. However, smaller choices of $\epsilon$ will lead to higher rejection rates. This will be a particular problem in the context of high-dimensional $x$, where it is often unrealistic to expect a close match between all components of $x$ and the simulated data $x^\star$, even for a good choice of $\theta^\star$. In this case, it makes more sense to look for good agreement between particular aspects of $x$, such as the mean, or variance, or auto-correlation, depending on the exact problem and context.

In the simplest case, this is done by forming a (vector of) summary statistic(s), $s(x^\star)$ (ideally a sufficient statistic), and accepting provided that $\Vert s(x^\star)-s(x)\Vert<\epsilon$ for some suitable choice of metric and $\epsilon$. We will return to this issue in a subsequent post.

#### Inference for an intractable Markov process

I’ll illustrate ABC in the context of parameter inference for a Markov process with an intractable transition kernel: the discrete stochastic Lotka-Volterra model. A function for simulating exact realisations from the intractable kernel is included in the smfsb CRAN package discussed in a previous post. Using pMCMC to solve the parameter inference problem is discussed in another post. It may be helpful to skim through those posts quickly to become familiar with this problem before proceeding.

So, for a given proposed set of parameters, realisations from the process can be sampled using the functions simTs and stepLV (from the smfsb package). We will use the sample data set LVperfect (from the LVdata dataset) as our “true”, or “target” data, and try to find parameters for the process which are consistent with this data. A fairly minimal R script for this problem is given below.

require(smfsb)
data(LVdata)

N=1e5
message(paste("N =",N))
prior=cbind(th1=exp(runif(N,-4,2)),th2=exp(runif(N,-4,2)),th3=exp(runif(N,-4,2)))
rows=lapply(1:N,function(i){prior[i,]})
message("starting simulation")
samples=lapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)})
message("finished simulation")

distance<-function(ts)
{
diff=ts-LVperfect
sum(diff*diff)
}

message("computing distances")
dist=lapply(samples,distance)
message("distances computed")

dist=sapply(dist,c)
cutoff=quantile(dist,1000/N)
post=prior[dist<cutoff,]

op=par(mfrow=c(2,3))
apply(post,2,hist,30)
apply(log(post),2,hist,30)
par(op)


This script should take 5-10 minutes to run on a decent laptop, and will result in histograms of the posterior marginals for the components of $\theta$ and $\log(\theta)$. Note that I have deliberately adopted a functional programming style, making use of the lapply function for the most computationally intensive steps. The reason for this will soon become apparent. Note that rather than pre-specifying a cutoff $\epsilon$, I’ve instead picked a quantile of the distance distribution. This is common practice in scenarios where the distance is difficult to have good intuition about. In fact here I’ve gone a step further and chosen a quantile to give a final sample of size 1000. Obviously then in this case I could have just selected out the top 1000 directly, but I wanted to illustrate the quantile based approach.

One problem with the above script is that all proposed samples are stored in memory at once. This is problematic for problems involving large numbers of samples. However, it is convenient to do simulations in large batches, both for computation of quantiles, and also for efficient parallelisation. The script below illustrates how to implement a batch parallelisation strategy for this problem. Samples are generated in batches of size 10^4, and only the best fitting samples are stored before the next batch is processed. This strategy can be used to get a good sized sample based on a more stringent acceptance criterion at the cost of addition simulation time. Note that the parallelisation code will only work with recent versions of R, and works by replacing calls to lapply with the parallel version, mclapply. You should notice an appreciable speed-up on a multicore machine.

require(smfsb)
require(parallel)
options(mc.cores=4)
data(LVdata)

N=1e5
bs=1e4
batches=N/bs
message(paste("N =",N," | bs =",bs," | batches =",batches))

distance<-function(ts)
{
diff=ts[,1]-LVprey
sum(diff*diff)
}

post=NULL
for (i in 1:batches) {
message(paste("batch",i,"of",batches))
prior=cbind(th1=exp(runif(bs,-4,2)),th2=exp(runif(bs,-4,2)),th3=exp(runif(bs,-4,2)))
rows=lapply(1:bs,function(i){prior[i,]})
samples=mclapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)})
dist=mclapply(samples,distance)
dist=sapply(dist,c)
cutoff=quantile(dist,1000/N)
post=rbind(post,prior[dist<cutoff,])
}
message(paste("Finished. Kept",dim(post)[1],"simulations"))

op=par(mfrow=c(2,3))
apply(post,2,hist,30)
apply(log(post),2,hist,30)
par(op)


Note that there is an additional approximation here, since the top 100 samples from each of 10 batches of simulations won’t correspond exactly to the top 1000 samples overall, but given all of the other approximations going on in ABC, this one is likely to be the least of your worries.

Now, if you compare the approximate posteriors obtained here with the “true” posteriors obtained in an earlier post using pMCMC, you will see that these posteriors are really quite poor. However, this isn’t a very fair comparison, since we’ve only done 10^5 simulations. Jacking N up to 10^7 gives the ABC posterior below.

ABC posterior from 10^7 samples

This is a bit better, but really not great. There are two basic problems with the simplistic ABC strategy adopted here, one related to the dimensionality of the data and the other the dimensionality of the parameter space. The most basic problem that we have here is the dimensionality of the data. We have 16 bivariate observations, so we want our stochastic simulation to shoot at a point in 32-dimensional space. That’s a tough call. The standard way to address this problem is to reduce the dimension of the data by introducing a few carefully chosen summary statistics and then just attempting to hit those. I’ll illustrate this in a subsequent post. The other problem is that often the prior and posterior over the parameters are quite different, and this problem too is exacerbated as the dimension of the parameter space increases. The standard way to deal with this is to sequentially adapt from the prior through a sequence of ABC posteriors. I’ll examine this in a future post as well, once I’ve also posted an introduction to the use of sequential Monte Carlo (SMC) samplers for static problems.

For further reading, I suggest browsing the reference list of the Wikipedia page for ABC. Also look through the list of software on that page. In particular, note that there is a CRAN package, abc, providing R support for ABC. There is a vignette for this package which should be sufficient to get started.

### Getting started with Bayesian variable selection using JAGS and rjags

20/11/2012

#### Bayesian variable selection

In a previous post I gave a quick introduction to using the rjags R package to access the JAGS Bayesian inference from within R. In this post I want to give a quick guide to using rjags for Bayesian variable selection. I intend to use this post as a starting point for future posts on Bayesian model and variable selection using more sophisticated approaches.

I will use the simple example of multiple linear regression to illustrate the ideas, but it should be noted that I’m just using that as an example. It turns out that in the context of linear regression there are lots of algebraic and computational tricks which can be used to simplify the variable selection problem. The approach I give here is therefore rather inefficient for linear regression, but generalises to more complex (non-linear) problems where analytical and computational short-cuts can’t be used so easily.

Consider a linear regression problem with n observations and p covariates, which we can write in matrix form as

$y = \alpha \boldmath{1} + X\beta + \varepsilon,$

where $X$ is an $n\times p$ matrix. The idea of variable selection is that probably not all of the p covariates are useful for predicting y, and therefore it would be useful to identify the variables which are, and just use those. Clearly each combination of variables corresponds to a different model, and so the variable selection amounts to choosing among the $2^p$ possible models. For large values of p it won’t be practical to consider each possible model separately, and so the idea of Bayesian variable selection is to consider a model containing all of the possible model combinations as sub-models, and the variable selection problem as just another aspect of the model which must be estimated from data. I’m simplifying and glossing over lots of details here, but there is a very nice review paper by O’Hara and Sillanpaa (2009) which the reader is referred to for further details.

The simplest and most natural way to tackle the variable selection problem from a Bayesian perspective is to introduce an indicator random variable $I_i$ for each covariate, and introduce these into the model in order to “zero out” inactive covariates. That is we write the ith regression coefficient $\beta_i$ as $\beta_i=I_i\beta^\star_i$, so that $\beta^\star_i$ is the regression coefficient when $I_i=1$, and “doesn’t matter” when $I_i=0$. There are various ways to choose the prior over $I_i$ and $\beta^\star_i$, but the simplest and most natural choice is to make them independent. This approach was used in Kuo and Mallick (1998), and hence is referred to as the Kuo and Mallick approach in O’Hara and Sillanpaa.

#### Simulating some data

In order to see how things work, let’s first simulate some data from a regression model with geometrically decaying regression coefficients.

n=500
p=20
X=matrix(rnorm(n*p),ncol=p)
beta=2^(0:(1-p))
print(beta)
alpha=3
tau=2
eps=rnorm(n,0,1/sqrt(tau))
y=alpha+as.vector(X%*%beta + eps)


Let’s also fit the model by least squares.

mod=lm(y~X)
print(summary(mod))


This should give output something like the following.

Call:
lm(formula = y ~ X)

Residuals:
Min       1Q   Median       3Q      Max
-1.62390 -0.48917 -0.02355  0.45683  2.35448

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.0565406  0.0332104  92.036  < 2e-16 ***
X1           0.9676415  0.0322847  29.972  < 2e-16 ***
X2           0.4840052  0.0333444  14.515  < 2e-16 ***
X3           0.2680482  0.0320577   8.361  6.8e-16 ***
X4           0.1127954  0.0314472   3.587 0.000369 ***
X5           0.0781860  0.0334818   2.335 0.019946 *
X6           0.0136591  0.0335817   0.407 0.684379
X7           0.0035329  0.0321935   0.110 0.912662
X8           0.0445844  0.0329189   1.354 0.176257
X9           0.0269504  0.0318558   0.846 0.397968
X10          0.0114942  0.0326022   0.353 0.724575
X11         -0.0045308  0.0330039  -0.137 0.890868
X12          0.0111247  0.0342482   0.325 0.745455
X13         -0.0584796  0.0317723  -1.841 0.066301 .
X14         -0.0005005  0.0343499  -0.015 0.988381
X15         -0.0410424  0.0334723  -1.226 0.220742
X16          0.0084832  0.0329650   0.257 0.797026
X17          0.0346331  0.0327433   1.058 0.290718
X18          0.0013258  0.0328920   0.040 0.967865
X19         -0.0086980  0.0354804  -0.245 0.806446
X20          0.0093156  0.0342376   0.272 0.785671
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7251 on 479 degrees of freedom
Multiple R-squared: 0.7187,     Adjusted R-squared: 0.707
F-statistic:  61.2 on 20 and 479 DF,  p-value: < 2.2e-16


The first 4 variables are “highly significant” and the 5th is borderline.

#### Saturated model

We can fit the saturated model using JAGS with the following code.

require(rjags)
data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,alpha=0,beta=rep(0,p))
modelstring="
model {
for (i in 1:n) {
mean[i]<-alpha+inprod(X[i,],beta)
y[i]~dnorm(mean[i],tau)
}
for (j in 1:p) {
beta[j]~dnorm(0,0.001)
}
alpha~dnorm(0,0.0001)
tau~dgamma(1,0.001)
}
"
model=jags.model(textConnection(modelstring),
data=data,inits=init)
update(model,n.iter=100)
output=coda.samples(model=model,variable.names=c("alpha","beta","tau"),
n.iter=10000,thin=1)
print(summary(output))
plot(output)


I’ve hard-coded various hyper-parameters in the script which are vaguely reasonable for this kind of problem. I won’t include all of the output in this post, but this works fine and gives sensible results. However, it does not address the variable selection problem.

#### Basic variable selection

Let’s now modify the above script to do basic variable selection in the style of Kuo and Mallick.

data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,alpha=0,betaT=rep(0,p),ind=rep(0,p))
modelstring="
model {
for (i in 1:n) {
mean[i]<-alpha+inprod(X[i,],beta)
y[i]~dnorm(mean[i],tau)
}
for (j in 1:p) {
ind[j]~dbern(0.2)
betaT[j]~dnorm(0,0.001)
beta[j]<-ind[j]*betaT[j]
}
alpha~dnorm(0,0.0001)
tau~dgamma(1,0.001)
}
"
model=jags.model(textConnection(modelstring),
data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,
variable.names=c("alpha","beta","ind","tau"),
n.iter=10000,thin=1)
print(summary(output))
plot(output)


Note that I’ve hard-coded an expectation that around 20% of variables should be included in the model. Again, I won’t include all of the output here, but the posterior mean of the indicator variables can be interpreted as posterior probabilities that the variables should be included in the model. Inspecting the output then reveals that the first three variables have a posterior probability of very close to one, the 4th variable has a small but non-negligible probability of inclusion, and the other variables all have very small probabilities of inclusion.

This is fine so far as it goes, but is not entirely satisfactory. One problem is that the choice of a “fixed effects” prior for the regression coefficients of the included variables is likely to lead to a Lindley’s paradox type situation, and a consequent under-selection of variables. It is arguably better to model the distribution of included variables using a “random effects” approach, leading to a more appropriate distribution for the included variables.

#### Variable selection with random effects

Adopting a random effects distribution for the included coefficients that is normal with mean zero and unknown variance helps to combat Lindley’s paradox, and can be implemented as follows.

data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,taub=1,alpha=0,betaT=rep(0,p),ind=rep(0,p))
modelstring="
model {
for (i in 1:n) {
mean[i]<-alpha+inprod(X[i,],beta)
y[i]~dnorm(mean[i],tau)
}
for (j in 1:p) {
ind[j]~dbern(0.2)
betaT[j]~dnorm(0,taub)
beta[j]<-ind[j]*betaT[j]
}
alpha~dnorm(0,0.0001)
tau~dgamma(1,0.001)
taub~dgamma(1,0.001)
}
"
model=jags.model(textConnection(modelstring),
data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,
variable.names=c("alpha","beta","ind","tau","taub"),
n.iter=10000,thin=1)
print(summary(output))
plot(output)


This leads to a large inclusion probability for the 4th variable, and non-negligible inclusion probabilities for the next few (it is obviously somewhat dependent on the simulated data set). This random effects variable selection modelling approach generally performs better, but it still has the potentially undesirable feature of hard-coding the probability of variable inclusion. Under the prior model, the number of variables included is binomial, and the binomial distribution is rather concentrated about its mean. Where there is a general desire to control the degree of sparsity in the model, this is a good thing, but if there is considerable uncertainty about the degree of sparsity that is anticipated, then a more flexible model may be desirable.

#### Variable selection with random effects and a prior on the inclusion probability

The previous model can be modified by introducing a Beta prior for the model inclusion probability. This induces a distribution for the number of included variables which has longer tails than the binomial distribution, allowing the model to learn about the degree of sparsity.

data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,taub=1,pind=0.5,alpha=0,betaT=rep(0,p),ind=rep(0,p))
modelstring="
model {
for (i in 1:n) {
mean[i]<-alpha+inprod(X[i,],beta)
y[i]~dnorm(mean[i],tau)
}
for (j in 1:p) {
ind[j]~dbern(pind)
betaT[j]~dnorm(0,taub)
beta[j]<-ind[j]*betaT[j]
}
alpha~dnorm(0,0.0001)
tau~dgamma(1,0.001)
taub~dgamma(1,0.001)
pind~dbeta(2,8)
}
"
model=jags.model(textConnection(modelstring),
data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,
variable.names=c("alpha","beta","ind","tau","taub","pind"),
n.iter=10000,thin=1)
print(summary(output))
plot(output)


It turns out that for this particular problem the posterior distribution is not very different to the previous case, as for this problem the hard-coded choice of 20% is quite consistent with the data. However, the variable inclusion probabilities can be rather sensitive to the choice of hard-coded proportion.

#### Conclusion

Bayesian variable selection (and model selection more generally) is a very delicate topic, and there is much more to say about it. In this post I’ve concentrated on the practicalities of introducing variable selection into JAGS models. For further reading, I highly recommend the review of O’Hara and Sillanpaa (2009), which discusses other computational algorithms for variable selection. I intend to discuss some of the other methods in future posts.

#### References

O’Hara, R. and Sillanpaa, M. (2009) A review of Bayesian variable selection methods: what, how and which. Bayesian Analysis, 4(1):85-118. [DOI, PDF, Supp, BUGS Code]
Kuo, L. and Mallick, B. (1998) Variable selection for regression models. Sankhya B, 60(1):65-81.

### Keeping R up to date on Ubuntu linux

10/11/2012

R is included as part of the standard Ubuntu distribution, and can be installed with a command like

sudo apt-get install r-base


Obviously the software included as part of the standard distribution usually lags a little behind the latest version, and this is usually quite acceptable for most users most of the time. However, R is evolving quite quickly at the moment, and for various reasons I have decided to skip Ubuntu 12.10 (quantal) and stick with Ubuntu 12.4 (precise) for the time being. Since R 2.14 is included with Ubuntu 12.4, and I’d rather use R 2.15, I’d like to run with the latest R builds on my Ubuntu system.

Fortunately this is very easy, as there is a maintained repository for Ubuntu builds of R on CRAN. Full instructions are provided on CRAN, but here is the quick summary. First you need to know your nearest CRAN mirror – there is a list of mirrors on CRAN. I generally use the Bristol mirror, and so I will use it in the following.

sudo su
echo "deb http://www.stats.bris.ac.uk/R/bin/linux/ubuntu precise/" >> /etc/apt/sources.list
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
apt-get update


That’s it. You are updated to the latest version of R, and your system will check for updates in the usual way. There are just two things you may need to edit in line 2 above. The first is the address of the CRAN mirror (here “www.stats.bris.ac.uk”). The second is the name of the Ubuntu distro you are running (here “precise”).

### Inlining JAGS models in R scripts for rjags

02/10/2012

JAGS (Just Another Gibbs Sampler) is a general purpose MCMC engine similar to WinBUGS and OpenBUGS. I have a slight preference for JAGS as it is free and portable, works well on Linux, and interfaces well with R. It is tempting to write a tutorial introduction to JAGS and the corresponding R package, rjags, but there is a lot of material freely available on-line already, so it isn’t really necessary. If you are new to JAGS, I suggest starting with Getting Started with JAGS, rjags, and Bayesian Modelling. In this post I want to focus specifically on the problem of inlining JAGS models in R scripts as it can be very useful, and is usually skipped in introductory material.

#### JAGS and rjags on Ubuntu Linux

On recent versions of Ubuntu, assuming that R is already installed, the simplest way to install JAGS and rjags is using the command

sudo apt-get install jags r-cran-rjags


Now rjags is a CRAN package, so it can be installed in the usual way with install.packages("rjags"). However, taking JAGS and rjags direct from the Ubuntu repos should help to ensure that the versions of JAGS and rjags are in sync, which is a good thing.

#### Toy model

For this post, I will use a trivial toy example of inference for the mean and precision of a normal random sample. That is, we will assume data

$X_i \sim N(\mu,1/\tau),\quad i=1,2,\ldots n,$

with priors on $\mu$ and $\tau$ of the form

$\tau\sim Ga(a,b),\quad \mu \sim N(c,1/d).$

#### Separate model file

The usual way to fit this model in R using rjags is to first create a separate file containing the model

  model {
for (i in 1:n) {
x[i]~dnorm(mu,tau)
}
mu~dnorm(cc,d)
tau~dgamma(a,b)
}


Then, supposing that this file is called jags1.jags, an R session to fit the model could be constructed as follows:

require(rjags)
x=rnorm(15,25,2)
data=list(x=x,n=length(x))
hyper=list(a=3,b=11,cc=10,d=1/100)
init=list(mu=0,tau=1)
model=jags.model("jags1.jags",data=append(data,hyper), inits=init)
update(model,n.iter=100)
output=coda.samples(model=model,variable.names=c("mu", "tau"), n.iter=10000, thin=1)
print(summary(output))
plot(output)


This is all fine, and it can be very useful to have the model declared in a separate file, especially if the model is large and complex, and you might want to use it from outside R. However, very often for simple models it can be quite inconvenient to have the model separate from the R script which runs it. In particular, people often have issues with naming files correctly, making sure R is looking in the correct directory, moving the model with the R script, etc. So it would be nice to be able to just inline the JAGS model within an R script, to keep the model, the data, and the analysis all together in one place.

#### Using a temporary file

What we want to do is declare the JAGS model within a text string inside an R script and then somehow pass this into the call to jags.model(). The obvious way to do this is to write the string to a text file, and then pass the name of that text file into jags.model(). This works fine, but some care needs to be taken to make sure this works in a generic platform independent way. For example, you need to write to a file that you know doesn’t exist in a directory that is writable using a filename that is valid on the OS on which the script is being run. For this purpose R has an excellent little function called tempfile() which solves exactly this naming problem. It should always return the name of a file which does not exist in a writable directly within the standard temporary file location on the OS on which R is being run. This function is exceedingly useful for all kinds of things, but doesn’t seem to be very well known by newcomers to R. Using this we can construct a stand-alone R script to fit the model as follows:

require(rjags)
x=rnorm(15,25,2)
data=list(x=x,n=length(x))
hyper=list(a=3,b=11,cc=10,d=1/100)
init=list(mu=0,tau=1)
modelstring="
model {
for (i in 1:n) {
x[i]~dnorm(mu,tau)
}
mu~dnorm(cc,d)
tau~dgamma(a,b)
}
"
tmpf=tempfile()
tmps=file(tmpf,"w")
cat(modelstring,file=tmps)
close(tmps)
model=jags.model(tmpf,data=append(data,hyper), inits=init)
update(model,n.iter=100)
output=coda.samples(model=model,variable.names=c("mu", "tau"), n.iter=10000, thin=1)
print(summary(output))
plot(output)


Now, although there is a file containing the model temporarily involved, the script is stand-alone and portable.

#### Using a text connection

The solution above works fine, but still involves writing a file to disk and reading it back in again, which is a bit pointless in this case. We can solve this by using another under-appreciated R function, textConnection(). Many R functions which take a file as an argument will work fine if instead passed a textConnection object, and the rjags function jags.model() is no exception. Here, instead of writing the model string to disk, we can turn it into a textConnection object and then pass that directly into jags.model() without ever actually writing the model file to disk. This is faster, neater and cleaner. An R session which takes this approach is given below.

require(rjags)
x=rnorm(15,25,2)
data=list(x=x,n=length(x))
hyper=list(a=3,b=11,cc=10,d=1/100)
init=list(mu=0,tau=1)
modelstring="
model {
for (i in 1:n) {
x[i]~dnorm(mu,tau)
}
mu~dnorm(cc,d)
tau~dgamma(a,b)
}
"
model=jags.model(textConnection(modelstring), data=append(data,hyper), inits=init)
update(model,n.iter=100)
output=coda.samples(model=model,variable.names=c("mu", "tau"), n.iter=10000, thin=1)
print(summary(output))
plot(output)


This is my preferred way to use rjags. Note again that textConnection objects have many and varied uses and applications that have nothing to do with rjags.

04/06/2012

### Introduction

Very often it is desirable to use Metropolis Hastings MCMC for a target distribution which does not have full support (for example, it may correspond to a non-negative random variable), using a proposal distribution which does (for example, a Gaussian random walk proposal). This isn’t a problem at all, but on more than one occasion now I have come across students getting this wrong, so I thought it might be useful to have a brief post on how to do it right, see what people sometimes get wrong, and why, and then think about correcting the wrong method in order to make it right…

### A simple example

For this post we will consider a simple $Ga(2,1)$ target distribution, with density

$\pi(x) = xe^{-x},\quad x\geq 0.$

Of course this is a very simple distribution, and there are many straightforward ways to simulate it directly, but for this post we will use a random walk Metropolis-Hastings (MH) scheme with standard Gaussian innovations. So, if the current state of the chain is $x$, a proposed new value $x^\star$ will be generated from

$f(x^\star|x) = \phi(x^\star-x),$

where $\phi(\cdot)$ is the standard normal density. This proposed new value is accepted with probability $\min\{1,A\}$, where

$\displaystyle A = \frac{\pi(x^\star)}{\pi(x)} \frac{f(x|x^\star)}{f(x^\star|x)} = \frac{\pi(x^\star)}{\pi(x)} \frac{\phi(x-x^\star)}{\phi(x^\star-x)} = \frac{\pi(x^\star)}{\pi(x)} ,$

since the standard normal density is symmetric.

#### Correct implementation

We can easily implement this using R as follows:

met1=function(iters)
{
xvec=numeric(iters)
x=1
for (i in 1:iters) {
xs=x+rnorm(1)
A=dgamma(xs,2,1)/dgamma(x,2,1)
if (runif(1)<A)
x=xs
xvec[i]=x
}
return(xvec)
}


We can run it, plot the results and check it against the true target with the following commands.

iters=1000000
out=met1(iters)
hist(out,100,freq=FALSE,main="met1")


If you have a slow computer, you may prefer to use iters=100000. The above code uses R’s built-in gamma density. Alternatively, we can hard-code the density as follows.

met2=function(iters)
{
xvec=numeric(iters)
x=1
for (i in 1:iters) {
xs=x+rnorm(1)
A=xs*exp(-xs)/(x*exp(-x))
if (runif(1)<A)
x=xs
xvec[i]=x
}
return(xvec)
}


We can run this code using the following commands, to verify that it does work as expected.

out=met2(iters)
hist(out,100,freq=FALSE,main="met2")


However, there is a potential problem with the above code that we have got away with in this instance, which often catches people out. We have hard-coded the density for $x>0$ without checking the sign of $x$. Here we get away with it as a negative proposal will lead to a negative acceptance ratio that we will reject straight away. This is not always the case (consider, for example, a $Ga(3,1)$ distribution). So really we should check the sign of $x^\star$ and reject immediately if is not within the support of the target.

Although this problem often catches people out, it tends not to be a big issue in practice, as it typically leads to an obviously incorrect sampler, or a sampler which crashes, and is relatively simple to debug and fix.

#### An incorrect sampler

The problem I want to focus on here is more subtle, but closely related. It is clear that any $x^\star<0$ should be rejected. With the above code, such values are indeed rejected, and the sampler advances to the next iteration. However, in more complex samplers, where an update like this might be one tiny part of a massive sampler with a very high-dimensional state space, it seems like a bit of a "waste" of a MH move to just propose a negative value, throw it away, and move on. Evidently, it seems tempting, therefore, to keep on sampling $x^\star$ values until a non-negative value is obtained, and then evaluate the acceptance ratio and decide whether or not to accept. We could code up this sampler as follows.

met3=function(iters)
{
xvec=numeric(iters)
x=1
for (i in 1:iters) {
repeat {
xs=x+rnorm(1)
if (xs>0)
break
}
A=xs*exp(-xs)/(x*exp(-x))
if (runif(1)<A)
x=xs
xvec[i]=x
}
return(xvec)
}


As reasonable as this idea may at first seem, it does not lead to a sampler having the desired target, as can be verified using the following commands.

out=met3(iters)
hist(out,100,freq=FALSE,main="met3")


So, this sampler seems to be sampling something close to the desired target, but not the same. This raises a couple of questions. First and most important, can we fix this sampler so that it does sample the correct target (yes), and second, can we figure out what target density the incorrect sampler is actually sampling (again, yes)? Let’s start with the issue of how to fix the sampler, as this will also help us to understand what the incorrect sampler is doing.

#### Fixing the truncated sampler

By repeatedly sampling from the proposal until we obtain a non-negative value, we are actually implementing a rejection sampler for sampling from the proposal distribution truncated at zero. This is a perfectly reasonable proposal distribution, so we can use it provided that we use the correct MH acceptance ratio. Now, the truncated density has the same density as the untruncated density, apart from the differing support and a normalising constant. Indeed, this may be why people often assume this method will work, because normalising constants often don’t matter in MH schemes. However, the normalising constant only doesn’t matter if it is independent of the state, and here it is not… Explicitly, we have

$f(x^\star|x) \propto \phi(x^\star-x),\quad x^\star>0.$

Including the normalising constant we have

$\displaystyle f(x^\star|x) = \frac{\phi(x^\star-x)}{\Phi(x)},\quad x^\star>0,$

where $\Phi(x)$ is the standard normal CDF. Consequently, the correct acceptance ratio to use with this proposal is

$\displaystyle A = \frac{\pi(x^\star)}{\pi(x)} \frac{\phi(x-x^\star)}{\phi(x^\star-x)}\frac{\Phi(x)}{\Phi(x^\star)} = \frac{\pi(x^\star)}{\pi(x)}\frac{\Phi(x)}{\Phi(x^\star)},$

where we see that the normalising constants do not cancel out. We can modify the previous sampler to use the correct acceptance ratio as follows.

met4=function(iters)
{
xvec=numeric(iters)
x=1
for (i in 1:iters) {
repeat {
xs=x+rnorm(1)
if (xs>0)
break
}
A=xs*exp(-xs)/(x*exp(-x))
A=A*pnorm(x)/pnorm(xs)
if (runif(1)<A)
x=xs
xvec[i]=x
}
return(xvec)
}


We can verify that this sampler gives leads to the correct target with the following commands.

out=met4(iters)
hist(out,100,freq=FALSE,main="met4")


So, truncating the proposal at zero is fine, provided that you modify the acceptance ratio accordingly.

#### What does the incorrect sampler target?

Now that we understand why the naive truncated sampler was wrong and how to fix it, we can, out of curiosity, wonder what distribution that sampler actually targets. Now we understand what proposal we are actually using, we can re-write the acceptance ratio as

$\displaystyle A = \frac{\pi(x^\star)\Phi(x^\star)}{\pi(x)\Phi(x)}\frac{\frac{\phi(x-x^\star)}{\Phi(x^\star)}}{\frac{\phi(x^\star-x)}{\Phi(x)}},$

from which it is clear that the actual target of this chain is

$\tilde\pi(x) \propto \pi(x)\Phi(x),$

or

$\tilde\pi(x)\propto xe^{-x}\Phi(x),\quad x\geq 0.$

The constant of proportionality is not immediately obvious, but is tractable, and turns out to be a nice undergraduate exercise in integration by parts, leading to

$\displaystyle \tilde\pi(x) = \frac{2\sqrt{2\pi}}{2+\sqrt{2\pi}}xe^{-x}\Phi(x),\quad x\geq 0.$

We can verify this using the following commands.

out=met3(iters)
hist(out,100,freq=FALSE,main="met3")


Now we know the actual target of the incorrect sampler, we can compare it with the correct target as follows.

curve(dgamma(x,2,1),0,10,col=2,lwd=2,main="Densities")


So we see that the distributions are different, but not so different that one would immediate suspect an error on the basis of a sample of output. This makes it a difficult bug to track down.

### Summary

There is no problem in principle using a proposal with full support for a target with limited support in MH algorithms. However, it is important to check whether a proposed value is within the support of the target and reject the proposed move if it is not. If you are concerned that such a scheme might be inefficient, it is possible to use a truncated proposal provided that you modify the MH acceptance ratio to include the relevant normalisation constants. If you don’t modify the acceptance probability, you will get a sampler which targets the wrong distribution, but it will often be quite similar to the correct target, making it a difficult bug to spot and track down.

### Multivariate data analysis (using R)

29/05/2012

I’ve been very quiet on-line in the last few months, due mainly to the fact that I’ve been writing a new undergraduate course on multivariate data analysis. Although there are many books and on-line notes on the general topic of multivariate statistics, I wanted to do something a little bit different from any text I have yet discovered. First, I wanted to have a strong emphasis on using techniques in practice on example data sets of reasonable size. For this, I found Hastie et al (2009) to be very useful, as it covered some interesting example data sets which have been bundled in the CRAN R package, ElemStatLearn. I used several of the data sets from this package as running examples throughout the course. In fact my initial plan was to use Hastie et al as the main course text, but it turned out that this text was in some places overly technical and in many places far too terse to be good as an undergraduate text. I would still recommend the book for researchers who want a good overview of the interface between statistics and machine learning, but with hindsight I’m not convinced it is ideal for typical statistics undergraduate students.

I also wanted to have a strong emphasis on numerical linear algebra as the basis for multivariate statistical computation. Again, this is a bit different from “old school” multivariate statistics (which reminds me, John Marden has produced a great text available freely on-line on old school multivariate analysis, which isn’t quite as “old school” as the title might suggest). I wanted to spend some time talking about linear systems and matrix factorisations, explaining, for example how the LU decomposition, the Cholesky factorisation and the QR factorisations are related, and why the latter two are both fundamental to multivariate data analysis, and how the singular value decomposition (SVD) is related to the spectral decomposition, and why it is generally better to construct principal components from the SVD of the centred data matrix than the eigen-decomposition of the sample variance matrix, etc. These sorts of topics are not often covered in undergraduate statistics courses, but they are crucial to understanding how to analyse large multivariate data sets in a numerically stable way.

I also wanted to downplay distribution theory as much as possible, as multivariate distribution theory is quite difficult, and not necessary for understanding most of the essential concepts in multivariate data analysis. Also, it is not obviously very useful. Essentially all introductory courses are based around the multivariate normal distribution, but I have yet to see a real non-trivial multivariate data set for which an assumption of multivariate normality is appropriate. Consequently I delayed the introduction of the multivariate normal until well into the course, and didn’t bother with the Wishart distribution, or testing for multivariate normality. Like much frequentist testing, it is really just a matter of seeing if you have yet collected a large enough sample to reject the null hypothesis – I just don’t see the point (null)!

Finally, I wanted to use R to illustrate all of the methods in practice as they were introduced. We use R throughout our undergraduate statistics programme, and I think it is a good language for learning about statistical methods, algorithms and concepts. In most cases I begin by showing how to carry out analyses using “elementary” operations (such as matrix manipulations), and then go on to show how to accomplish the same task more simply using higher-level R functions and packages. Again, I think it really helps understanding to first see the mathematical description directly translated into computer code before jumping to high-level data analysis functions.

There are several aspects of the course that I would like to distil out into self-contained blog posts, but I have a busy summer schedule, and a couple of other things I want to write about before I’ll have a chance to get around to it, so in the mean time, anyone interested is welcome to download a copy of the course notes (PDF, with hyperlinks). This is the student version, containing gaps, but the gaps mainly correspond to bits of standard theory and examples designed to be worked through by hand. All of the essential theory and context and all of the R examples are present in this version of the notes. There are seven chapters: Introduction to multivariate data; PCA and matrix factorisations; Inference, the MVN and multivariate regression; Cluster analysis and unsupervised learning; Discrimination and classification; Graphical modelling; Variable selection and multiple testing.

29/12/2011

### Introduction

In a previous post I showed how to construct a PMMH pMCMC algorithm for parameter estimation with partially observed Markov processes. The inner loop of a pMCMC algorithm consists of running a particle filter to construct an unbiased estimate of marginal likelihood. This inner loop is the place where the code spends almost all of its time, and so speeding up the particle filter will result in dramatic speedup of the pMCMC algorithm. This is fortunate, since as previously discussed, MCMC algorithms are difficult to parallelise other than on a per iteration basis. Here, each iteration can be speeded up if we can effectively parallelise a particle filter. Particle filters are much easier to parallelise than MCMC algorithms, and so it is tempting to try and exploit this within R. In fact, although it is the case that it is possible to effectively parallelise particle filters in efficient languages using low-level parallelisation tools (say, using C with MPI, or Java concurrency tools), it is not so easy to speed up R-based particle filters using R’s high-level parallelisation constructs, as we shall see.

### Particle filters

In the previous post we looked at the function pfMLLik within the CRAN package smfsb. As a reminder, the source code is

pfMLLik <- function (n, simx0, t0, stepFun, dataLik, data)
{
times = c(t0, as.numeric(rownames(data)))
deltas = diff(times)
return(function(...) {
xmat = simx0(n, t0, ...)
ll = 0
for (i in 1:length(deltas)) {
xmat = t(apply(xmat, 1, stepFun, t0 = times[i], deltat = deltas[i], ...))
w = apply(xmat, 1, dataLik, t = times[i + 1], y = data[i,], log = FALSE, ...)
if (max(w) < 1e-20) {
warning("Particle filter bombed")
return(-1e+99)
}
ll = ll + log(mean(w))
rows = sample(1:n, n, replace = TRUE, prob = w)
xmat = xmat[rows, ]
}
ll
})
}


The function itself doesn’t actually run a particle filter, but instead returns a function closure which does (see the previous post for a discussion of lexical scope and function closures in R). There are obviously several different steps within the particle filter, and several of these are amenable to parallelisation. However, for complex models, forward simulation from the model will be the rate-limiting step, where the vast majority of CPU cycles will be spent. Line 9 in the above code is where forward simulation takes place, and in particular, the key function call is the apply call:

apply(xmat, 1, stepFun, t0 = times[i], deltat = deltas[i], ...)


This call applies the forward simulation algorithm stepFun to each row of the matrix xmat independently. Since there are no dependencies between the function calls, this is in principle very straightforward to parallelise on multicore hardware.

### Multicore support in R

I’m writing this post on a laptop with an Intel i7 quad core chip, running the 64 bit version of Ubuntu 11.10. R has support for multicore processing on this platform – it is just a simple matter of installing the relevant packages. However, things are changing rapidly regarding multicore support in R right now, so YMMV. Ubuntu 11.10 has R 2.13 by default, but the multicore support is slightly different in the recently released R 2.14. I’m still using R 2.13. I may update this post (or comment) when I move to R 2.14. The main difference is that the package multicore has been replaced by the package parallel. There are a few other minor changes, but it should be easy to adapt what is presented here to 2.14.

There is a new O’Reilly book called Parallel R. I’ve got a copy of it. It does cover the new parallel package in R 2.14, as well as other parallel R topics, but the book is a bit light weight, to say the least, and I reviewed it on this blog. Please read my review for further details before you buy it.

If you haven’t used multicore in R previously, then

install.packages(c("multicore","doMC"))


should get you started (again, I’m assuming that your R version is strictly < 2.14). You can test it has worked with:

library(multicore)
multicore:::detectCores()


When I do this, I get the answer 8 (I have 4 cores, each of which is hyper-threaded). To begin with, I want to tell R to use just 4 process threads, and I can do this with

library(doMC)
registerDoMC(4)


Replacing the second line with registerDoMC() will set things up to use all detected cores (in my case, 8). There are a couple of different strategies we could use to parallelise this. One strategy for parallelising the apply call discussed above is to be to replace it with a foreach / %dopar% loop. This is best illustrated by example. Start with line 9 from the function pfMLLik:

xmat = t(apply(xmat, 1, stepFun, t0 = times[i], deltat = deltas[i], ...))


We can produce a parallelised version by replacing this line with the following block of code:

res=foreach(j=1:dim(xmat)[1]) %dopar% {
stepFun(xmat[j,], t0 = times[i], deltat = deltas[i], ...)
}
xmat=t(sapply(res,cbind))


Each iteration of the foreach loop is executed independently (possibly using multiple cores), and the result of each iteration is returned as a list, and captured in res. This list of return vectors is then coerced back into a matrix with the final line.

In fact, we can improve on this by using the .combine argument to foreach, which describes how to combine the results from each iteration. Here we can just use rbind to combine the results into a matrix, using:

xmat=foreach(j=1:dim(xmat)[1], .combine="rbind") %dopar% {
stepFun(xmat[j,], t0 = times[i], deltat = deltas[i], ...)
}


This code is much neater, and in principle ought to be a bit faster, though I haven’t noticed much difference in practice.

In fact, it is not necessary to use the foreach construct at all. The multicore package provides the mclapply function, which is a multicore version of lapply. To use mclapply (or, indeed, lapply) here, we first need to split our matrix into a list of rows, which we can do using the split command. So in fact, our apply call can be replaced with the single line:

xmat=t(sapply(mclapply(split(xmat,row(xmat)), stepFun, t0=times[i], deltat=deltas[i], ...),cbind))


This is actually a much cleaner solution than the method using foreach, but it does require grokking a bit more R. Note that mclapply uses a different method to specify the number of threads to use than foreach/doMC. Here you can either use the named argument to mclapply, mc.cores, or use options(), eg. options(cores=4).

As well as being much cleaner, I find that the mclapply approach is much faster than the foreach/dopar approach for this problem. I’m guessing that this is because foreach doesn’t pre-schedule tasks by default, whereas mclapply does, but I haven’t had a chance to dig into this in detail yet.

### A parallelised particle filter

We can now splice the parallelised forward simulation step (using mclapply) back into our particle filter function to get:

require(multicore)
pfMLLik <- function (n, simx0, t0, stepFun, dataLik, data)
{
times = c(t0, as.numeric(rownames(data)))
deltas = diff(times)
return(function(...) {
xmat = simx0(n, t0, ...)
ll = 0
for (i in 1:length(deltas)) {
xmat=t(sapply(mclapply(split(xmat,row(xmat)), stepFun, t0=times[i], deltat=deltas[i], ...),cbind))
w = apply(xmat, 1, dataLik, t = times[i + 1], y = data[i,], log = FALSE, ...)
if (max(w) < 1e-20) {
warning("Particle filter bombed")
return(-1e+99)
}
ll = ll + log(mean(w))
rows = sample(1:n, n, replace = TRUE, prob = w)
xmat = xmat[rows, ]
}
ll
})
}


This can be used in place of the version supplied with the smfsb package for slow simulation algorithms running on modern multicore machines.

There is an issue regarding Monte Carlo simulations such as this and the multicore package (whether you use mclapply or foreach/dopar) in that it adopts a “different seeds” approach to parallel random number generation, rather than a true parallel random number generator. This probably isn’t worth worrying too much about now, since it is fixed in the new parallel package in R 2.14, but is something to be aware of. I discuss parallel random number generation issues in Wilkinson (2005).

### Multiple particle filters and pMCMC

Let’s look again at the main loop of the pMCMC algorithm discussed in the previous post:

for (i in 1:iters) {
message(paste(i,""),appendLF=FALSE)
for (j in 1:thin) {
thprop=th*exp(rnorm(p,0,tune))
llprop=mLLik(thprop)
if (log(runif(1)) < llprop - ll) {
th=thprop
ll=llprop
}
}
thmat[i,]=th
}


It is clear that the main computational bottleneck of this code is the call to mLLik on line 5, as this is the call which runs the particle filter. The purpose of making the call is to obtain an unbiased estimate of marginal likelihood. However, there are plenty of other ways that we can obtain such estimates than by running a single particle filter. In particular, we could run multiple particle filters and average the results. So, let’s look at how to do this in the multicore setting. Let’s start by thinking about running 4 particle filters. We could just replace the line

llprop=mLLik(thprop)


with the code

llprop=0.25*foreach(i=1:4, .combine="+") %dopar% {
mLLik(thprop)
}


Now, there are at least 2 issues with this. The first is that we are now just running 4 particle filters rather than 1, and so even with perfect parallelisation, it will run no quicker than the code we started with. However, the idea is that by running 4 particle filters we ought to be able to get away with each particle filter using fewer particles, though it isn’t trivial to figure out exactly how many. For example, averaging the results from 4 particle filters, each of which uses 25 particles is not as good as running a single particle filter with 100 particles. In practice, some trial and error is likely to be required. The second problem is that we have computed the mean of the log of the likelihoods, and not the likelihoods themselves. This will almost certainly work fine in practice, as the resulting estimate will in most cases be very close to unbiased, but it will not be exactly unbiased, as so will not lead to an “exact” approximate algorithm. In principle, this can be fixed by instead using

res=foreach(i=1:4) %dopar% {
mLLik(thprop)
}
llprop=log(mean(sapply(res,exp)))


but in practice this is likely to be subject to numerical underflow problems, as it involves manipulating raw likelihood values, which is generally a bad idea. It is possible to compute the log of the mean of the likelihoods in a more numerically stable way, but that is left as an exercise for the reader, as this post is way too long already… However, one additional tip worth mentioning is that the foreach package includes a convenience function called times for situations like the above, where the argument is not varying over calls. So the above code can be replaced with

res=times(4) %dopar% mLLik(thprop)
llprop=log(mean(sapply(res,exp)))


which is a bit cleaner and more readable.

Using this approach to parallelisation, there is now a much better chance of getting some speedup on multicore architectures, as the granularity of the tasks being parallelised is now much larger. Consider the example from the previous post, where at each iteration we ran a particle filter with 100 particles. If we now re-run that example, but instead use 4 particle filters each using 25 particles, we do get a slight speedup. However, on my laptop, the speedup is only around a factor of 1.6 using 4 cores, and as already discussed, 4 filters each with 25 particles isn’t actually quite as good as a single filter with 100 particles anyway. So, the benefits are rather modest here, but will be much better with less trivial examples (slower simulators). For completeness, a complete runnable demo script is included after the references. Also, it is probably worth emphasising that if your pMCMC algorithm has a short burn-in period, you may well get much better overall speed-ups by just running parallel MCMC chains. Depressing, perhaps, but true.

### References

• McCallum, E., Weston, S. (2011) Parallel R, O’Reilly.
• Wilkinson, D. J. (2005) Parallel Bayesian Computation, Chapter 16 in E. J. Kontoghiorghes (ed.) Handbook of Parallel Computing and Statistics, Marcel Dekker/CRC Press, 481-512.
• Wilkinson, D. J. (2011) Stochastic Modelling for Systems Biology, second edition, Boca Raton, Florida: Chapman & Hall/CRC Press.
• ### Demo script

require(smfsb)
data(LVdata)

require(multicore)
require(doMC)
registerDoMC(4)

# set up data likelihood
noiseSD=10
dataLik <- function(x,t,y,log=TRUE,...)
{
ll=sum(dnorm(y,x,noiseSD,log=TRUE))
if (log)
return(ll)
else
return(exp(ll))
}
# now define a sampler for the prior on the initial state
simx0 <- function(N,t0,...)
{
mat=cbind(rpois(N,50),rpois(N,100))
colnames(mat)=c("x1","x2")
mat
}
# convert the time series to a timed data matrix
LVdata=as.timedData(LVnoise10)
# create marginal log-likelihood functions, based on a particle filter

# use 25 particles instead of 100
mLLik=pfMLLik(25,simx0,0,stepLVc,dataLik,LVdata)

iters=1000
tune=0.01
thin=10
th=c(th1 = 1, th2 = 0.005, th3 = 0.6)
p=length(th)
ll=-1e99
thmat=matrix(0,nrow=iters,ncol=p)
colnames(thmat)=names(th)
# Main pMCMC loop
for (i in 1:iters) {
message(paste(i,""),appendLF=FALSE)
for (j in 1:thin) {
thprop=th*exp(rnorm(p,0,tune))
res=times(4) %dopar% mLLik(thprop)
llprop=log(mean(sapply(res,exp)))
if (log(runif(1)) < llprop - ll) {
th=thprop
ll=llprop
}
}
thmat[i,]=th
}
message("Done!")
# Compute and plot some basic summaries
mcmcSummary(thmat)


23/11/2011

## Introduction

R is different to many “easy to use” statistical software packages – it expects to be given commands at the R command prompt. This can be intimidating for new users, but is at the heart of its power. Most powerful software tools have an underlying scripting language. This is because scriptable tools are typically more flexible, and easier to automate, script, program, etc. In fact, even software packages like Excel or Minitab have a macro programming language behind the scenes available for “power users” to exploit.

### Programming from the ground up

It is natural to want to automate (repetitive) tasks on a computer, to automate a “work flow”. This is especially natural for computational tasks, as all software tools are built from programming language components, anyway. In R, you do stuff by executing a sequence of commands. By putting a bunch of commands one after another into a text file, we can source the file, and script R. Scripting is the simplest form of programming – automating a sequence of tasks. Indeed, in Unix (including Linux and MacOS), we can put a bunch of Unix shell commands together in a shell script. In Windows, you can put a bunch of terminal commands together in a batch file.

Next, one can add in simple control structures, to support looping, branching and conditional execution. Looping allows repetition of very similar tasks. Branching and conditional execution allow decisions to be made depending on what has already happened. Most scripting languages support simple control structures – this allows carrying out of tasks which we could do in principle, but perhaps not in practice, due to the laborious and repetitive nature of some work-flows. We can go a long way with this, but…

Although scripting is a simple form of programming, it isn’t “real” programming, or software engineering. Software engineering is about developing flexible, modular, robust, re-usable, generic program components, and using them to build large, complex software systems – modularity is absolutely key here. Functions and procedures are a first step towards introducing modularity, allowing the development of “real” software. Proper support for these tends to distinguish “real” programming languages from scripting languages (though many modern “scripting” languages have at least some limited support, and the distinction between scripting languages and “real” languages is now very blurred).

## Functions and procedures

Procedures (or subroutines) are re-usable pieces of code which can be called from other pieces of code when needed. They may be provided with inputs, but do not have to be. They are usually called for their “side-effects”, such as doing plots, changing global variables, or reading/writing data to/from disk.

Functions are also re-usable pieces of code, but are mainly used to obtain a return-value that is computed on the basis of the given inputs. “Pure” functions do not have any side-effects. Functions and procedures may be combined in a hierarchical way to build large, complex algorithms from much simpler modular components. Note that many languages (including R), do not make a distinction between functions and procedures in the syntax of the language, but conceptually the distinction is really quite important.

## Variable scope

Almost all programming languages allow the definition of variables which are labels or tags representing or pointing at some value that may be defined and re-defined at run-time. In most modern programming languages, functions can define local variables which can be used in addition to any inputs (formal parameters) of the function – these are very important for the development of modular, re-usable code components. In particular, they help to avoid unanticipated name clashes in the global name-space. If a function refers to a variable which is neither a formal parameter nor a local variable, then a rule is needed to find which (if any) variable with that label is in scope for the function, so that the program can know what value to use.

### Dynamic scope

Under dynamic scope, if an “unknown” variable is referred to in a function, the idea is to use the version of the variable that is in scope at the time that the function was called (and apply this rule recursively) – this is the scoping rule used by the S-PLUS implementation of the S language. Dynamic scope was common among early dynamic programming languages – including early implementations of LISP (and is still used in Emacs LISP), as it was quite intuitive and natural to implement using a stack-based approach similar to the stack-based approach to passing variables in and out of subroutines commonly used by machine code and assembly programmers.

Despite being intuitively appealing, at least initially, there are a number of problems with dynamic scope in practice. In particular, we can’t really know by code inspection whether or not a given section of code will run in all situations without actually running the code, as we can’t know whether all variable bindings will resolve correctly. This is an issue even for dynamic languages, but is particularly problematic for strongly typed compiled languages, as it becomes difficult for the compiler to figure out the types of all variables correctly and therefore generate the appropriate byte-code. It is also very difficult for a function to have associated state – to do this, you must somehow get state variables into global name-space where they then become vulnerable to masking and name clashes. See the Wikipedia page on scope for further details.

### Lexical scope

Under lexical scoping rules, if an “unknown” variable is referred to in a function, the idea is to use the version that is “in scope” in the enclosing piece of code (and apply this rule recursively) — this is the scoping rule used by R (as R is built on top of a Scheme interpreter, a LISP derivative which emphasises lexical scope). Variable bindings can be all resolved, checked and verified at compile-time – this is safer, and in many other ways better. Most modern languages adopt lexical scoping, including most functional languages, such as LISPs (including LISP-STAT) and derivatives. In fact, I first read about lexical scope, function closures and their use in statistical computing in Luke Tierney’s LISP-STAT book (Tierney, 1990) in the early 1990s. That book was published over 20 years ago, so it just goes to show that there is nothing new about these functional programming approaches. In fact, although Tierney’s book describes a now obsolete system, I would nevertheless recommend reading it if you can find a copy, as I think it is still one of the best books on statistical computing ever written. It really puts the recent glut of horrible R-themed books to shame!

Given that R has been lexically scoped and has supported function closures since day one, it is reasonable to wonder why this programming style is not used more widely in R code. I think it is the difference in scoping rules between S-PLUS and R that has led to a fear of developing any R code which relies on non-local scoping rules. Certainly, in the early days of R, I would use S-PLUS at work and R at home, and I would want my code to work in exactly the same in both places! This is a shame, as lexical scoping is very powerful, and exploited widely in functional programming styles. The use of lexical scope and function closures in R is described quite nicely in Gentleman (2008), along with many other things.

To make sure that the concepts are clear, inspect the following piece of code and figure out what the result of the final function call will be. The answer is given below the code, so try not to peek before reading on…

a=1
b=2
f<-function(x)
{
a*x + b
}
g<-function(x)
{
a=2
b=1
f(x)
}
g(2)


No, really, try and figure it out before reading on for the answer! Understanding this example is key to understanding the difference between lexical and dynamic scope. Clearly the obvious answers are 4 and 5. If you didn’t get one of those, go back and try again! So, one of those is the result you get in a dynamically scoped language like S-PLUS, and the other is the result that you get in a lexically scoped language like R. But which is which? Many people when asked what this code does give the answer 5. This is the result for a dynamically scoped language. It is not the answer you get in R. In R, you get the answer 4. This is because f() was defined in the global environment, so it is the global bindings of a and b which count. Although the function g() defines its own local bindings for a and b, these have no impact on the global bindings, and are simply not relevant to the evaluation of f().

## Function closures

Some languages (including LISPs and derivatives such as Scheme, Python, and R) have functions as “first class objects”, which means that a function is able to return as its value another function. If the function (fChild) returned by the function (fParent) refers to variables not local to fChild, then scoping rules must apply to the resolution of the variable binding. If the language is lexically scoped, then the binding is determined by the variables in scope within the function fParent. The function fChild therefore has an associated environment, which provides bindings for non-local variable references – this allows maintaining of state. A function together with its environment is referred to as a function closure, and is a very powerful programming tool. Below is some more code to help illustrate what is going on. Again, try to figure out the result of the final function call before reading on for the answer and explanation…

a=1
b=2
f<-function(a,b)
{
return( function(x) {
a*x + b
})
}
g=f(2,1)
g(2)


Here, the function g(), together with its associated environment, is referred to as a function closure. See the Wikipedia page for closure for further details. So, what is the result of calling g(2) in this case? Again, some people get this wrong, and give the answer 4. This isn’t what you get in R – in R you get 5, again due to lexical scope. The point is that the function g() is created inside f(), and so it is the variable bindings in scope within f() at the time g() was created which matter. Since f() has a and b as formal arguments, these mask the global variables of the same name, so it is the 2 and 1 that are passed into f() to create g() which matter in the evaluation of g(). This is why function closures are so powerful. They are not simply functions, they are functions together with an associated environment, and the associated environment allows function closures to have associated state. Here the state corresponds to the values of a and b that were used in the creation of g(), but in principle the state can be essentially any data structure.

## Function closures for scientific computing

Function closures have numerous important applications in a variety of problems in scientific computing that involve dealing in some way with the “function environment problem”. There is quite a nice discussion of this issue in Oliveira and Stewart (2006), in the context of several strongly typed compiled languages. Consider, for example, a function that will numerically integrate a univariate function using (say) the trapezium rule. This integration function might expect that you pass in the function to be integrated, together with the limits of integration, and possibly a step size. Most likely this integration function will expect that the function passed in is univariate. However, in practice many functions have additional parameters (eg. the straight line example, above, which was a function of x, but depending on additional parameters a and b). This problem is solved by passing in a univariate function closure that contains the necessary environment to evaluate this univariate function correctly. Similar considerations apply for functions that carry out optimisation, solve ODEs by passing in the RHS, etc.

### The smfsb R package

The second edition of my textbook, Stochastic Modelling for Systems Biology, has recently been published (Wilkinson, 2011). The second edition has an associated R package, smfsb, available from CRAN – I gave a tutorial introduction in a previous post. The code makes extensive use of lexical scope and function closures, precisely to solve the function environment problem…

## References

• Oliveira, S, Stewart, D.E. (2006) Writing scientific software, CUP.
• Gentleman, R. (2008) R Programming for Bioinformatics, Chapman & Hall/CRC Press.
• Tierney, L. (1990) LISP-STAT, Wiley.
• Wilkinson (2011), Stochastic Modelling for Systems Biology, second edition, Chapman & Hall/CRC Press.
• ### Review of “Parallel R” by McCallum and Weston

16/11/2011

#### Introduction

This is the first book review I’ve done on this blog, and I don’t intend to make it a regular feature, but I ordered a copy of “Parallel R” a few days ago. It arrived today, and I’m quite disappointed with it, so I wanted to write a quick review to provide some additional information for people thinking of buying a copy. Just to be clear, the book is:

• McCallum, E., Weston, S. (2011) Parallel R, O’Reilly.
• I generally like O’Reilly books, and own a number of them. I use R a lot, I am very interested in parallel computing (traditionally using C and MPI), and I’ve dabbled a little with some parallel stuff in R, but don’t consider myself to be an expert. In other words, people like me are probably the target audience for this book, and sure enough I have handed over some of my hard-earned cash to buy a copy.

The main problem with the book is that it just doesn’t feel finished. It seems as though the authors have rushed the text as quickly as possible and published it without any kind of critical review or reflection. It is very short – just 108 pages of the main text, and most annoyingly, doesn’t have an index. This wouldn’t be so much of a problem for an electronic version, but selling a technical computing book in dead tree format without an index is really unforgivable. All of my other O’Reilly books contain a decent index, so I’m just baffled as to why this one doesn’t. It really feels like this is the first draft of a manuscript that you would circulate to a few friends and colleagues for comments and suggested improvements. There is the kernel of a decent book here, but most of the current material will be obsolete before a second edition could be put together and published, so the second edition will have to be a complete re-write.

#### Chapter by chapter

##### Chapter 1 – Getting started – 5 pages

A brief introduction to the rest of the book, and has a pointer to the companion website, parallelrbook.com (at the time of writing, it is empty…).

##### Chapter 2 – snow – 30 pages

This chapter is the most substantial, and arguably the best, chapter of the book. It provides a very reasonable introduction to the snow package for a simple network of workstations, for running embarrassingly parallel jobs on a cluster.

##### Chapter 3 – multicore – 13 pages

This chapter provides a very brief and superficial introduction to the multicore package, for exploiting modern multicore hardware. It provides a very brief introduction to the high level API (mclapply, pvec, parallel, collect). Discussion of the low-level API is almost non-existent (an example function is given which uses some low-level calls). Also, there is no discussion here, or anywhere else, of the foreach package/function, or the doMC back-end. Unfortunately I couldn’t verify this by checking the index (see above), but as there are only 100 or so pages, it didn’t take that long to flick through them all to double-check… Now I can understand that the book will not cover all obscure parallel packages for R, but foreach/doMC?! Missing from a book called “Parallel R”? Seriously? It is all the more weird as one of the authors (Weston) is an author of foreach/doMC. Go figure…

##### Chapter 4 – parallel – 8 pages

This chapter provides an even more brief introduction to the new parallel package for R 2.14. It should be noted that the book went to press before 2.14 was frozen for release, but the content that was there looked OK on the basis of a very quick skim. But at 8 pages, don’t expect too much.

##### Chapter 5 – A primer on MapReduce and Hadoop – 8 pages

A very brief introduction to the ideas behind MapReduce and Hadoop. Not actually anything to do with R, but necessary for the next chapter.

##### Chapter 6 – R+Hadoop – 18 pages

I’ve not worked through this chapter in detail, but it looks like a reasonable “getting started guide” for using Hadoop with R.

##### Chapter 7 – RHIPE – 16 pages

Again, I’ve not studied this chapter carefully, but it seems like a reasonable introduction to the RHIPE package (it is a package to make it simpler to use R with Hadoop, by hiding Hadoop stuff from the R user).

##### Chapter 8 – Segue – 6 pages

A very brief introduction to the segue package, which enables running jobs on Amazon’s Elastic MapReduce service.

##### Chapter 9 – New and upcoming – 2 pages

A very brief mention of some of the things that weren’t covered in the book… foreach is mentioned here, but no example is given.

#### Conclusion

For people who have absolutely no idea about parallel computing in R, or about what different options are available, then this book does provide a useful overview, together with some simple examples to illustrate the ideas, and try out for themselves. It is generally very brief and superficial, there are some gaping holes, and much of the material will become obsolete very quickly. It is a shame that there is not more discussion of low level functions, or of parallel computing in anything other than a simple embarrassingly parallel context. Admittedly, if your job isn’t embarrassingly parallel, you probably don’t want to use R anyway, but some discussion would still have been nice. And did I mention that there is no index?! I did toy briefly with the idea of sending it back, but I’m not going to. To be fair, there is quite a bit of useful information in the book, and I’d like to work through the Hadoop chapters at some point. So in summary, it’s OK, but don’t expect to love it.

#### References

• McCallum, E., Weston, S. (2011) Parallel R, O’Reilly.
• ### Particle filtering and pMCMC using R

12/11/2011

In the previous post I gave a quick introduction to the CRAN R package smfsb, and how it can be used for simulation of Markov processes determined by stochastic kinetic networks. In this post I’ll show how to use data and particle MCMC techniques in order to carry out Bayesian inference for the parameters of partially observed Markov processes.

### The simulation model and the data

For this post we will assume that the smfsb package is installed and loaded (see previous post for details). The package includes the function stepLVc which simulates from the Markov transition kernel of a Lotka-Volterra process by calling out to some native C code for speed. So, for example,

stepLVc(c(x1=50,x2=100),0,1)


will simulate the state of the process at time 1 given an initial condition of 50 prey and 100 predators at time 0, using the default rate parameters of the function, th = c(1, 0.005, 0.6). The package also includes some data simulated from this model using these parameters, with and without added noise. The datasets can be loaded with

data(LVdata)


For simplicity, we will just make use of the dataset LVnoise10 in this post. This dataset is a multivariate time series consisting of 16 equally spaced observations on both prey and predators subject to Gaussian measurement error with a standard deviation of 10. We can plot the data with

plot(LVnoise10,plot.type="single",col=c(2,4))


giving:

The Bayesian inference problem is to see how much we are able to learn about the parameters which generated the data using only the data and our knowledge of the structure of the problem. There are many approaches one can take to this problem, but most are computationally intensive, due to the analytical intractability of the transition kernel of the LV process. Here we will follow Wilkinson (2011) and take a particle MCMC (pMCMC) approach, and specifically, use a pseudo-marginal “exact approximate” MCMC algorithm based on the particle marginal Metropolis-Hastings (PMMH) algorithm. I have discussed the pseudo-marginal approach, using particle filters for marginal likelihood estimation, and the PMMH algorithm in previous posts, so if you have been following my posts for a while, this should all make perfect sense…

### Particle filter

One of the key ingredients required to implement the pseudo-marginal MCMC scheme is a (bootstrap) particle filter which generates an unbiased estimate of the marginal likelihood of the data given the parameters (integrated over the unobserved state trajectory). The algorithm was discussed in this post, and R code to implement this is included in the smfsb R package as pfMLLik. For reasons of numerical stability, the function computes and returns the log of the marginal likelihood, but it is important to understand that it is the actually likelihood estimate that is unbiased for the true likelihood, and not the corresponding statement for the logs. The actual code of the function is relatively short, and for completeness is given below:

pfMLLik <- function (n, simx0, t0, stepFun, dataLik, data)
{
times = c(t0, as.numeric(rownames(data)))
deltas = diff(times)
return(function(...) {
xmat = simx0(n, t0, ...)
ll = 0
for (i in 1:length(deltas)) {
xmat = t(apply(xmat, 1, stepFun, t0 = times[i], deltat = deltas[i], ...))
w = apply(xmat, 1, dataLik, t = times[i + 1], y = data[i,], log = FALSE, ...)
if (max(w) < 1e-20) {
warning("Particle filter bombed")
return(-1e+99)
}
ll = ll + log(mean(w))
rows = sample(1:n, n, replace = TRUE, prob = w)
xmat = xmat[rows, ]
}
ll
})
}


We need to set up the prior and the data likelihood correctly before we can use this function, but first note that the function does not actually run a particle filter at all, but instead stores everything it needs to know to run the particle filter in the local environment, and then returns a function closure for evaluating the marginal likelihood at a given set of parameters. The resulting function (closure) can then be used to run a particle filter for a given set of parameters, simply by passing the required parameters into the function. This functional programming style is consistent with that used throughout the smfsb R package, and leads to quite simple, modular code. To use pfMLLik, we first need to define a function which evaluates the log-likelihood of an observation conditional on the true state, and another which samples from the prior distribution of the initial state of the system. Here, we can do that as follows.

# set up data likelihood
noiseSD=10
dataLik <- function(x,t,y,log=TRUE,...)
{
ll=sum(dnorm(y,x,noiseSD,log=TRUE))
if (log)
return(ll)
else
return(exp(ll))
}
# now define a sampler for the prior on the initial state
simx0 <- function(N,t0,...)
{
mat=cbind(rpois(N,50),rpois(N,100))
colnames(mat)=c("x1","x2")
mat
}
# convert the time series to a timed data matrix
LVdata=as.timedData(LVnoise10)
# create marginal log-likelihood functions, based on a particle filter
mLLik=pfMLLik(100,simx0,0,stepLVc,dataLik,LVdata)


Now the function (closure) mLLik will, for a given parameter vector, run a particle filter (using 100 particles) and return the log of the particle filter’s unbiased estimate of the marginal likelihood of the data. It is then very easy to use this function to create a simple PMMH algorithm for parameter inference.

### PMMH algorithm

Below is an algorithm based on flat priors and a simple Metropolis-Hastings update for the parameters using the function closure mLLik, defined above.

iters=1000
tune=0.01
thin=10
th=c(th1 = 1, th2 = 0.005, th3 = 0.6)
p=length(th)
ll=-1e99
thmat=matrix(0,nrow=iters,ncol=p)
colnames(thmat)=names(th)
# Main pMCMC loop
for (i in 1:iters) {
message(paste(i,""),appendLF=FALSE)
for (j in 1:thin) {
thprop=th*exp(rnorm(p,0,tune))
llprop=mLLik(thprop)
if (log(runif(1)) < llprop - ll) {
th=thprop
ll=llprop
}
}
thmat[i,]=th
}
message("Done!")
# Compute and plot some basic summaries
mcmcSummary(thmat)


This will take a little while to run, but in the end should give a plot something like the following (click for full size):

So, although we should really run the chain for a bit longer, we see that we can learn a great deal about the parameters of the process from very little data. For completeness, a full runnable demo script is included below the references. Of course there are many obvious extensions of this basic problem, such as partial observation (eg. only observing the prey) and unknown measurement error. These are discussed in Wilkinson (2011), and code for these cases is included within the demo(PMCMC), which should be inspected for further details.

### Discussion

At this point it is probably worth emphasising that there are other “likelihood free” approaches which can be taken to parameter inference for partially observed Markov process (POMP) models. Many of these are implemented in the pomp R package, also available from CRAN, by King et al (2008). The pomp package is well documented, and has a couple of good tutorial vignettes which should be sufficient to get people started. The API of the package is rather cumbersome, but the algorithms appear to be quite robust. Approximate Bayesian computation (ABC) approaches are also quite natural for POMP models (see, for example, Toni et al (2009)). This is because “exact” likelihood free procedures break down in the case of low/no measurement error or high-dimensional observations. There are some R packages for ABC, but I am not sufficiently familiar with them to be able to give recommendations.

If one is able to move away from the “likelihood free” paradigm, it is possible to develop “exact” pMCMC algorithms which do not break down in challenging observation scenarios. The problem here is the intractability of the Markov transition kernel. In the case of nonlinear Markov jump processes, finding very generic solutions seems quite difficult, but for diffusion (approximation) processes based on stochastic differential equations, it seems to be possible to develop irreducible pMCMC algorithms which have very broad applicability – see Golightly and Wilkinson (2011) for further details of how such techniques can be used in the context of stochastic kinetic models similar to those considered in this post.

### References

• Golightly, A., Wilkinson, D. J. (2011) Bayesian parameter inference for stochastic biochemical network models using particle MCMC, Interface Focus, 1(6):807-820.
• King, A.A., Ionides, E.L., & Breto, C.M. (2008) pomp: Statistical inference for partially observed Markov processes, CRAN.
• Toni, T., Welch, D., Strelkowa, N., Ipsen, A. & Stumpf, M. (2009) Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems, J. R. Soc. Interface 6(31): 187-202.
• Wilkinson, D. J. (2011) Stochastic Modelling for Systems Biology, second edition, Boca Raton, Florida: Chapman & Hall/CRC Press.
• ### Demo script

require(smfsb)
data(LVdata)

# set up data likelihood
noiseSD=10
dataLik <- function(x,t,y,log=TRUE,...)
{
ll=sum(dnorm(y,x,noiseSD,log=TRUE))
if (log)
return(ll)
else
return(exp(ll))
}
# now define a sampler for the prior on the initial state
simx0 <- function(N,t0,...)
{
mat=cbind(rpois(N,50),rpois(N,100))
colnames(mat)=c("x1","x2")
mat
}
# convert the time series to a timed data matrix
LVdata=as.timedData(LVnoise10)
# create marginal log-likelihood functions, based on a particle filter
mLLik=pfMLLik(100,simx0,0,stepLVc,dataLik,LVdata)

iters=1000
tune=0.01
thin=10
th=c(th1 = 1, th2 = 0.005, th3 = 0.6)
p=length(th)
ll=-1e99
thmat=matrix(0,nrow=iters,ncol=p)
colnames(thmat)=names(th)
# Main pMCMC loop
for (i in 1:iters) {
message(paste(i,""),appendLF=FALSE)
for (j in 1:thin) {
thprop=th*exp(rnorm(p,0,tune))
llprop=mLLik(thprop)
if (log(runif(1)) < llprop - ll) {
th=thprop
ll=llprop
}
}
thmat[i,]=th
}
message("Done!")
# Compute and plot some basic summaries
mcmcSummary(thmat)


Xi'an's Og

an attempt at bloggin, from scratch...

Normal Deviate

Thoughts on Statistics and Machine Learning