MCMC, Monte Carlo likelihood estimation, and the bootstrap particle filter

The pseudo-marginal approach to MCMC for Bayesian inference

In a previous post I described a generalisation of the Metropolis Hastings MCMC algorithm which uses unbiased Monte Carlo estimates of likelihood in the acceptance ratio, but is nevertheless exact, when considered as a pseudo-marginal approach to “exact approximate” MCMC. To be useful in the context of Bayesian inference, we need to be able to compute unbiased estimates of the (marginal) likelihood of the data given some proposed model parameters with any “latent variables” integrated out.

To be more precise, consider a model for data $y$ with parameters $\theta$ of the form $\pi(y|\theta)$ together with a prior on $\theta$ , $\pi(\theta)$ , giving a joint model

$\displaystyle \pi(\theta,y)=\pi(\theta)\pi(y|\theta).$

Suppose now that interest is in the posterior distribution

$\displaystyle \pi(\theta|y) \propto \pi(\theta,y)=\pi(\theta)\pi(y|\theta).$

We can construct a fairly generic (marginal) MCMC scheme for this posterior by first proposing $\theta^\star \sim f(\theta^\star|\theta)$ from some fairly arbitrary proposal distribution and then accepting the value with probability $\min\{1,A\}$ where

$\displaystyle A = \frac{\pi(\theta^\star)}{\pi(\theta)} \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)} \frac{\pi(y|\theta^\star)}{\pi(y|\theta)}$

This method is great provided that the (marginal) likelihood of the data $\pi(y|\theta)$ is available to us analytically, but in many (most) interesting models it is not. However, in the previous post I explained why substituting in a Monte Carlo estimate $\hat\pi(y|\theta)$ will still lead to the exact posterior if the estimate is unbiased in the sense that $E[\hat\pi(y|\theta)]=\pi(y|\theta)$ . Consequently, sources of (cheap) unbiased Monte Carlo estimates of (marginal) likelihood are of potential interest in the development of exact MCMC algorithms.

Latent variables and marginalisation

Often the reason that we cannot evaluate $\pi(y|\theta)$ is that there are latent variables in the problem, and the model for the data is conditional on those latent variables. Explicitly, if we denote the latent variables by $x$ , then the joint distribution for the model takes the form

$\displaystyle \pi(\theta,x,y) = \pi(\theta)\pi(x|\theta)\pi(y|x,\theta)$

Now since

$\displaystyle \pi(y|\theta) = \int_X \pi(y|x,\theta)\pi(x|\theta)\,dx$

there is a simple and obvious Monte Carlo strategy for estimating $\pi(y|\theta)$ provided that we can evaluate $\pi(y|x,\theta)$ and simulate realisations from $\pi(x|\theta)$ . That is, simulate values $x_1,x_2,\ldots,x_n$ from $\pi(x|\theta)$ for some suitably large $n$ , and then put

$\displaystyle \hat\pi(y|\theta) = \frac{1}{n}\sum_{i=1}^n \pi(y|x_i,\theta).$

It is clear by the law of large numbers that this estimate will converge to $\pi(y|\theta)$ as $n\rightarrow \infty$ . That is, $\hat\pi(y|\theta)$ is a consistent estimate of $\pi(y|\theta)$ . However, a moment’s thought reveals that this estimate is not only consistent, but also unbiased, since each term in the sum has expectation $\pi(y|\theta)$ . This simple Monte Carlo estimate of likelihood can therefore be substituted into a Metropolis-Hastings acceptance ratio without affecting the (marginal) target distribution of the Markov chain. Note that this estimate of marginal likelihood is sometimes referred to as the Rao-Blackwellised estimate, due to its connection with the Rao-Blackwell theorem.

Importance sampling

Suppose now that we cannot sample values directly from $\pi(x|\theta)$ , but can sample instead from a distribution $\pi'(x|\theta)$ having the same support as $\pi(x|\theta)$ . We can then instead produce an importance sampling estimate for $\pi(y|\theta)$ by noting that

$\displaystyle \pi(y|\theta) = \int_X \pi(y|x,\theta)\frac{\pi(x|\theta)}{\pi'(x|\theta)}\pi'(x|\theta)\,dx.$

Consequently, samples $x_1,x_2,\ldots,x_n$ from $\pi'(x|\theta)$ can be used to construct the estimate

$\displaystyle \hat{\pi}(y|\theta) = \frac{1}{n}\sum_{i=1}^n \pi(y|x_i,\theta) \frac{\pi(x_i|\theta)}{\pi'(x_i|\theta)}$

which again is clearly both consistent and unbiased. This estimate is often written

$\displaystyle \hat{\pi}(y|\theta) = \frac{1}{n}\sum_{i=1}^n \pi(y|x_i,\theta) w_i$
where $w_i=\pi(x_i|\theta)/\pi'(x_i|\theta)$ . The weights, $w_i$ , are known as importance weights.

Importance resampling

An idea closely related to that of importance sampling is that of importance resampling where importance weights are used to resample a sample in order to equalise the weights, often prior to a further round of weighting and resampling. The basic idea is to generate an approximate sample from a target density $\pi(x)$ using values sampled from an auxiliary distribution $\pi'(x)$ , where we now supress any dependence of the distributions on model parameters, $\theta$ .

First generate a sample $x_1,\ldots,x_n$ from $\pi'(x)$ and compute weights $w_i=\pi(x_i)/\pi'(x_i),\ i=1,\ldots,n$ . Then compute normalised weights $\tilde{w}_i=w_i/\sum_{k=1}^n w_k$ . Generate a new sample of size $n$ by sampling $n$ times with replacement from the original sample with the probability of choosing each value determined by its normalised weight.

As an example, consider using a sample from the Cauchy distribution as an auxiliary distribution for approximately sampling standard normal random quantities. We can do this using a few lines of R as follows.

n=1000
xa=rcauchy(n)
w=dnorm(xa)/dcauchy(xa)
x=sample(xa,n,prob=w,replace=TRUE)
hist(x,30)
mean(w)

Note that we don’t actually need to compute the normalised weights, as the sample function will do this for us. Note also that the average weight will be close to one. It should be clear that the expected value of the weights will be exactly 1 when both the target and auxiliary densities are correctly normalised. Also note that the procedure can be used when one or both of the densities are not correctly normalised, since the weights will be normalised prior to sampling anyway. Note that in this case the expected weight will be the (ratio of) normalising constant(s), and so looking at the average weight will give an estimate of the normalising constant.

Note that the importance sampling procedure is approximate. Unlike a technique such as rejection sampling, which leads to samples having exactly the correct distribution, this is not the case here. Indeed, it is clear that in the $n=1$ case, the final sample will be exactly drawn from the auxiliary and not the target. The procedure is asymptotic, in that it improves as the sample size increases, tending to the exact target as $n\rightarrow \infty$ .

We can understand why importance resampling works by first considering the univariate case, using correctly normalised densities. Consider a very large number of particles, $N$ . The proportion of the auxiliary samples falling in a small interval $[x,x+dx)$ will be $\pi'(x)dx$ , corresponding to roughly $N\pi'(x)dx$ particles. The weight for each of those particles will be $w(x)=\pi(x)/\pi'(x)$ , and since the expected weight of a random particle is 1, the sum of all weights will be (roughly) $N$ , leading to normalised weights for the particles near $x$ of $\tilde{w}(x)=\pi(x)/[N\pi'(x)]$ . The combined weight of all particles in $[x,x+dx)$ is therefore $\pi(x)dx$ . Clearly then, when we resample $N$ times we expect to select roughly $N\pi(x)dx$ particles from this interval. This corresponds to a proportion $\pi(x)dt$ , corresponding to a density of $\pi(x)$ in the final sample.

Obviously the above argument is very informal, but can be tightened up into a reasonably rigorous proof for the 1d case without too much effort, and the multivariate extension is also reasonably clear.

The bootstrap particle filter

The bootstrap particle filter is an iterative method for carrying out Bayesian inference for dynamic state space (partially observed Markov process) models, sometimes also known as hidden Markov models (HMMs). Here, an unobserved Markov process, $x_0,x_1,\ldots,x_T$ governed by a transition kernel $p(x_{t+1}|x_t)$ is partially observed via some measurement model $p(y_t|x_t)$ leading to data $y_1,\ldots,y_T$ . The idea is to make inference for the hidden states $x_{0:T}$ given the data $y_{1:T}$ . The method is a very simple application of the importance resampling technique. At each time, $t$ , we assume that we have a (approximate) sample from $p(x_t|y_{1:t})$ and use importance resampling to generate an approximate sample from $p(x_{t+1}|y_{1:t+1})$ .

More precisely, the procedure is initialised with a sample from $x_0^k \sim p(x_0),\ k=1,\ldots,M$ with uniform normalised weights ${w'}_0^k=1/M$ . Then suppose that we have a weighted sample $\{x_t^k,{w'}_t^k|k=1,\ldots,M\}$ from $p(x_t|y_{1:t})$ . First generate an equally weighted sample by resampling with replacement $M$ times to obtain $\{\tilde{x}_t^k|k=1,\ldots,M\}$ (giving an approximate random sample from $p(x_t|y_{1:t})$ ). Note that each sample is independently drawn from $\sum_{i=1}^M {w'}_t^i\delta(x-x_t^i)$ . Next propagate each particle forward according to the Markov process model by sampling $x_{t+1}^k\sim p(x_{t+1}|\tilde{x}_t^k),\ k=1,\ldots,M$ (giving an approximate random sample from $p(x_{t+1}|y_{1:t})$ ). Then for each of the new particles, compute a weight $w_{t+1}^k=p(y_{t+1}|x_{t+1}^k)$ , and then a normalised weight ${w'}_{t+1}^k=w_{t+1}^k/\sum_i w_{t+1}^i$ .

It is clear from our understanding of importance resampling that these weights are appropriate for representing a sample from $p(x_{t+1}|y_{1:t+1})$ , and so the particles and weights can be propagated forward to the next time point. It is also clear that the average weight at each time gives an estimate of the marginal likelihood of the current data point given the data so far. So we define

$\displaystyle \hat{p}(y_t|y_{1:t-1})=\frac{1}{M}\sum_{k=1}^M w_t^k$

and

$\displaystyle \hat{p}(y_{1:T}) = \hat{p}(y_1)\prod_{t=2}^T \hat{p}(y_t|y_{1:t-1}).$

Again, from our understanding of importance resampling, it should be reasonably clear that $\hat{p}(y_{1:T})$ is a consistent estimator of ${p}(y_{1:T})$ . It is much less clear, but nevertheless true that this estimator is also unbiased. The standard reference for this fact is Del Moral (2004), but this is a rather technical monograph. A much more accessible proof (for a very general particle filter) is given in Pitt et al (2011).

It should therefore be clear that if one is interested in developing MCMC algorithms for state space models, one can use a pseudo-marginal MCMC scheme, substituting in $\hat{p}_\theta(y_{1:T})$ from a bootstrap particle filter in place of $p(y_{1:T}|\theta)$ . This turns out to be a simple special case of the particle marginal Metropolis-Hastings (PMMH) algorithm described in Andreiu et al (2010). However, the PMMH algorithm in fact has the full joint posterior $p(\theta,x_{0:T}|y_{1:T})$ as its target. I will explain the PMMH algorithm in a subsequent post.

Hi Darren,

First of all, thank you so much! Your explanations are far clearer than the papers you cite and have given a huge boost to my research. I cannot stress enough what a help you’ve been.

I wanted to point out what I believe is an error in this postso you can set me right or keep things true. I don’t believe that using a bootstrap particle filter, we end up with an unbiased approximation to the likelihood. The problem lies in that normalizing by the sun of weights at each iteration for any finite number of particles introduces a bias. Take a simple 1 step, single particle bootstrap filter to see that after one correcting for one observation, we do not have an unbiased estimate for the state at time 1 given an observation at time zero. I admittedly skimmed the Pitt 2011 paper, but from his arguments it seems that he considers the bootstrap filter without resampling .

Let me know if this seems right to you.Hi Darren,

First of all, thank you so much! Your explanations are far clearer than the papers you cite and have given a huge boost to my research. I cannot stress enough what a help you’ve been.

Let me know if this seems right to you.

12 thoughts on “MCMC, Monte Carlo likelihood estimation, and the bootstrap particle filter”

Pingback: The particle marginal Metropolis-Hastings (PMMH) particle MCMC algorithm « Darren Wilkinson's research blog
Daniel Duckworth says:

11/07/2011 at 17:03

Hi Darren,

First of all, thank you so much! Your explanations are far clearer than the papers you cite and have given a huge boost to my research. I cannot stress enough what a help you’ve been.

I wanted to point out what I believe is an error in this postso you can set me right or keep things true. I don’t believe that using a bootstrap particle filter, we end up with an unbiased approximation to the likelihood. The problem lies in that normalizing by the sun of weights at each iteration for any finite number of particles introduces a bias. Take a simple 1 step, single particle bootstrap filter to see that after one correcting for one observation, we do not have an unbiased estimate for the state at time 1 given an observation at time zero. I admittedly skimmed the Pitt 2011 paper, but from his arguments it seems that he considers the bootstrap filter without resampling .

Let me know if this seems right to you.Hi Darren,

First of all, thank you so much! Your explanations are far clearer than the papers you cite and have given a huge boost to my research. I cannot stress enough what a help you’ve been.

I wanted to point out what I believe is an error in this postso you can set me right or keep things true. I don’t believe that using a bootstrap particle filter, we end up with an unbiased approximation to the likelihood. The problem lies in that normalizing by the sun of weights at each iteration for any finite number of particles introduces a bias. Take a simple 1 step, single particle bootstrap filter to see that after one correcting for one observation, we do not have an unbiased estimate for the state at time 1 given an observation at time zero. I admittedly skimmed the Pitt 2011 paper, but from his arguments it seems that he considers the bootstrap filter without resampling .

Let me know if this seems right to you.

1. darrenjw says:
  
  11/07/2011 at 17:16
  
  Thanks for your feedback. Actually, the bootstrap filter does give an unbiased estimate of likelihood, even with resampling. Resampling is explicitly considered in the Pitt paper. The argument is a little involved, but is reasonably straightforward. You may find it helpful to write out the special case of the Pitt proof for the simple bootstrap filter. The key to really understanding what is going on is to see the estimate as a Martingale (which is the essence of the Del Moral proof), but the Pitt paper explains clearly why this is the case. Also note that you don’t have to understand why the estimate is unbiased to understand the PMMH algorithm discussed in the following post.
  
  1. Daniel Duckworth says:
    
    15/07/2011 at 18:01
    
    You’re absolutely right, I misread the paper. Thanks!
Pingback: Particle filtering and pMCMC using R « Darren Wilkinson's research blog
Pingback: Catalogue of my first 25 blog posts « Darren Wilkinson's research blog
Pingback: ICML paper: Kernel Adaptive Metropolis Hastings | herrstrathmann
Pingback: ICML paper: Kernel Adaptive Metropolis Hastings | herr strathmann
Pingback: A scalable particle filter in Scala – Darren Wilkinson's research blog
Pingback: Sequential Monte Carlo and the bootstrap filter | Umberto Picchini's research blog
Charlie Whittaker says:

29/01/2018 at 21:31

Hiya Darren,

I’m a first year PhD student working in mathematical modelling of infectious disease dynamics at Imperial College! Firstly just wanted to say thank you so much for having written this series on Particle Filters and the associated maths, it’s been immensely useful!

I’ve got just two quick questions based on the post above:

1) It wasn’t clear to me how the average weight at each time provides an estimate of the marginal likelihood of the current data point given the data so far- could you possibly explain how that’s the case?
2) Just to double check, is the estimate of the marginal likelihood calculated using the weights or the normalised weights?

Thanks again!

Charlie Whittaker

darrenjw says:

29/01/2018 at 22:21

1) Just because you are averaging the (raw) weight p(y_{t+1}|x_{t+1}) wrt the distribution p(x_{t+1}|y_{1:t}) to get an estimate of p(y_{t+1}|y_{1:t}).
2) Raw, because of the above.

MCMC, Monte Carlo likelihood estimation, and the bootstrap particle filter

The pseudo-marginal approach to MCMC for Bayesian inference

Latent variables and marginalisation

Importance sampling

Importance resampling

The bootstrap particle filter

Published by

darrenjw

12 thoughts on “MCMC, Monte Carlo likelihood estimation, and the bootstrap particle filter”

Leave a comment Cancel reply

The pseudo-marginal approach to MCMC for Bayesian inference

Latent variables and marginalisation

Importance sampling

Importance resampling

The bootstrap particle filter

Share this:

Related

Published by

darrenjw

12 thoughts on “MCMC, Monte Carlo likelihood estimation, and the bootstrap particle filter”

Leave a comment Cancel reply