## Part 6: Hamiltonian Monte Carlo (HMC)

### Introduction

This is the sixth part in a series of posts on MCMC-based Bayesian inference for a logistic regression model. If you are new to this series, please go back to Part 1.

In the previous post we saw how to construct an MCMC algorithm utilising gradient information by considering a Langevin equation having our target distribution of interest as its equilibrium. This equation has a physical interpretation in terms of the stochastic dynamics of a particle in a potential equal to minus the log of the target density. It turns out that thinking about the deterministic dynamics of a particle in such a potential can lead to more efficient MCMC algorithms.

### Hamiltonian dynamics

Hamiltonian dynamics is often presented as an extension of a fairly general version of Lagrangian dynamics. However, for our purposes a rather simple version is quite sufficient, based on basic concepts from Newtonian dynamics, familiar from school. Inspired by our Langevin example, we will consider the dynamics of a particle in a potential function $V(q)$. We will see later why we want $V(q) = -\log \pi(q)$ for our target of interest, $\pi(\cdot)$. In the context of Hamiltonian (and Lagrangian) dynamics we typically use $q$ as our position variable, rather than $x$.

The potential function induces a (conservative) force on the particle equal to $-\nabla V(q)$ when the particle is at position $q$. Then Newton’s second law of motion, "F=ma", takes the form

$\displaystyle \nabla V(q) + m \ddot{q} = 0.$

In Newtonian mechanics, we often consider the position vector $q$ as 3-dimensional. Here it will be $n$-dimensional, where $n$ is the number of variables in our target. We can then think of our second law as governing a single $n$-dimensional particle of mass $m$, or $n$ one-dimensional particles all of mass $m$. But in this latter case, there is no need to assume that all particles have the same mass, and we could instead write our law of motion as

$\displaystyle \nabla V(q) + M \ddot{q} = 0,$

where $M$ is a diagonal matrix. But in fact, since we could change coordinates, there’s no reason to require that $M$ is diagonal. All we need is that $M$ is positive definite, so that we don’t have negative mass in any coordinate direction.

We will take the above equation as the fundamental law governing our dynamical system of interest. The motivation from Newtonian dynamics is interesting, but not required. What is important is that the dynamics of such a system are conservative, in a way that we will shortly make precise.

Our law of motion is a second-order differential equation, since it involves the second derivative of $q$ wrt time. If you’ve ever studied differential equations, you’ll know that there is an easy way to turn a second order equation into a first order equation with twice the dimension by augmenting the system with the velocities. Here, it is more convenient to augment the system with "momentum" variables, $p$, which we define as $p = M\dot{q}$. Then we can write our second order system as a pair of first order equations

$\displaystyle \dot{q} = M^{-1}p$

$\displaystyle \dot{p} = -\nabla V(q)$

These are, in fact, Hamilton’s equations for this system, though this isn’t how they are typically written.

If we define the kinetic energy as

$\displaystyle T(p) = \frac{1}{2}p^\text{T}M^{-1}p,$

then the Hamiltonian

$\displaystyle H(q,p) = V(q) + T(p),$

representing the total energy in the system, is conserved, since

$\displaystyle \dot{H} = \nabla V\cdot \dot{q} + \dot{p}^\text{T}M^{-1}p = \nabla V\cdot \dot{q} + \dot{p}^\text{T}\dot{q} = [\nabla V + \dot{p}]\cdot\dot{q} = 0.$

So, if we obey our Hamiltonian dynamics, our trajectory in $(q,p)$-space will follow contours of the Hamiltonian. It’s also clear that the system is time-reversible, so flipping the sign of the momentum $p$ and integrating will exactly reverse the direction in which the contours are traversed. Another quite important property of Hamiltonian dynamics is that they are volume preserving. This can be verified by checking that the divergence of the flow is zero.

$\displaystyle \nabla\cdot(\dot{q},\dot{p}) = \nabla_q\cdot\dot{q} + \nabla_p\cdot\dot{p} = 0,$

since $\dot{q}$ is a function of $p$ only and $\dot{p}$ is a function of $q$ only.

### Hamiltonian Monte Carlo (HMC)

In Hamiltonian Monte Carlo we introduce an augmented target distribution,

$\displaystyle \tilde \pi(q,p) \propto \exp[-H(q,p)]$

It is clear from this definition that moves leaving the Hamiltonian invariant will also leave the augmented target density unchanged. By following the Hamiltonian dynamics, we will be able to make big (reversible) moves in the space that will be accepted with high probability. Also, our target factorises into two independent components as

$\displaystyle \tilde \pi(q,p) \propto \exp[-V(q)]\exp[-T(p)],$

and so choosing $V(q)=-\log \pi(q)$ will ensure that the $q$-marginal is our real target of interest, $\pi(\cdot)$. It’s also clear that our $p$-marginal is $\mathcal N(0,M)$. This is also the full-conditional for $p$, so re-sampling $p$ from this distribution and leaving $q$ unchanged is a Gibbs move that will leave the augmented target invariant. Re-sampling $p$ will be necessary to properly explore our augmented target, since this will move us to a different contour of $H$.

So, an idealised version of HMC would proceed as follows: First, update $p$ by sampling from its known tractable marginal. Second, update $p$ and $q$ jointly by following the Hamiltonian dynamics. If this second move is regarded as a (deterministic) reversible M-H proposal, it will be accepted with probability one since it leaves the augmented target density unchanged. If we could exactly integrate Hamilton’s equations, this would be fine. But in practice, we will need to use some imperfect numerical method for the integration step. But just as for MALA, we can regard the numerical method as a M-H proposal and correct for the fact that it is imperfect, preserving the exact augmented target distribution.

Hamiltonian systems admit nice numerical integration schemes called symplectic integrators. In HMC a simple alternating Euler method is typically used, known as the leap-frog algorithm. The component updates are all shear transformations, and therefore volume preserving, and exact reversibility is ensured by starting and ending with a half-step update of the momentum variables. In principle, to ensure reversibility of the proposal the momentum variables should be sign-flipped (reversed) to finish, but in practice this doesn’t matter since it doesn’t affect the evaluation of the Hamiltonian and it will then get refreshed, anyway.

So, advancing our system by a time step $\epsilon$ can be done with

$\displaystyle p(t+\epsilon/2) := p(t) - \frac{\epsilon}{2}\nabla V(q(t))$

$\displaystyle q(t+\epsilon) := q(t) + \epsilon M^{-1}p(t+\epsilon/2)$

$\displaystyle p(t+\epsilon) := p(t+\epsilon/2) - \frac{\epsilon}{2}\nabla V(q(t+\epsilon))$

It is clear that if many such updates are chained together, adjacent momentum updates can be collapsed together, giving rise to the "leap-frog" nature of the algorithm, and therefore requiring roughly one gradient evaluation per $\epsilon$ update, rather than two. Since this integrator is volume preserving and exactly reversible, for reasonably small $\epsilon$ it follows the Hamiltonian dynamics reasonably well, but not exactly, and so it does not exactly preserve the Hamiltonian. However, it does make a good M-H proposal, and reasonable acceptance probabilities can often be obtained by chaining together $l$ updates to advance the time of the system by $T=l\epsilon$. The "optimal" value of $l$ and $\epsilon$ will be highly problem dependent, but values of $l=20$ or $l=50$ are not unusual. There are various more-or-less standard methods for tuning these, but we will not consider them here.

Note that since our HMC update on the augmented space consists of a Gibbs move and a M-H update, it is important that our M-H kernel does not keep or thread through the old log target density from the previous M-H update, since the Gibbs move will have changed it in the meantime.

## Implementations

### R

We need a M-H kernel that does not thread through the old log density.

mhKernel = function(logPost, rprop)
function(x) {
prop = rprop(x)
a = logPost(prop) - logPost(x)
if (log(runif(1)) < a)
prop
else
x
}


We can then use this to construct a M-H move as part of our HMC update.

hmcKernel = function(lpi, glpi, eps = 1e-4, l=10, dmm = 1) {
sdmm = sqrt(dmm)
leapf = function(q, p) {
p = p + 0.5*eps*glpi(q)
for (i in 1:l) {
q = q + eps*p/dmm
if (i < l)
p = p + eps*glpi(q)
else
p = p + 0.5*eps*glpi(q)
}
list(q=q, p=-p)
}
alpi = function(x)
lpi(x$q) - 0.5*sum((x$p^2)/dmm)
rprop = function(x)
leapf(x$q, x$p)
mhk = mhKernel(alpi, rprop)
function(q) {
d = length(q)
x = list(q=q, p=rnorm(d, 0, sdmm))
mhk(x)$q } }  See the full runnable script for further details. ### Python First a M-H kernel, def mhKernel(lpost, rprop): def kernel(x): prop = rprop(x) a = lpost(prop) - lpost(x) if (np.log(np.random.rand()) < a): x = prop return x return kernel  and then an HMC kernel. def hmcKernel(lpi, glpi, eps = 1e-4, l=10, dmm = 1): sdmm = np.sqrt(dmm) def leapf(q, p): p = p + 0.5*eps*glpi(q) for i in range(l): q = q + eps*p/dmm if (i < l-1): p = p + eps*glpi(q) else: p = p + 0.5*eps*glpi(q) return (q, -p) def alpi(x): (q, p) = x return lpi(q) - 0.5*np.sum((p**2)/dmm) def rprop(x): (q, p) = x return leapf(q, p) mhk = mhKernel(alpi, rprop) def kern(q): d = len(q) p = np.random.randn(d)*sdmm return mhk((q, p))[0] return kern  See the full runnable script for further details. #### JAX Again, we want an appropriate M-H kernel, def mhKernel(lpost, rprop, dprop = jit(lambda new, old: 1.)): @jit def kernel(key, x): key0, key1 = jax.random.split(key) prop = rprop(key0, x) ll = lpost(x) lp = lpost(prop) a = lp - ll + dprop(x, prop) - dprop(prop, x) accept = (jnp.log(jax.random.uniform(key1)) < a) return jnp.where(accept, prop, x) return kernel  and then an HMC kernel. def hmcKernel(lpi, glpi, eps = 1e-4, l = 10, dmm = 1): sdmm = jnp.sqrt(dmm) @jit def leapf(q, p): p = p + 0.5*eps*glpi(q) for i in range(l): q = q + eps*p/dmm if (i < l-1): p = p + eps*glpi(q) else: p = p + 0.5*eps*glpi(q) return jnp.concatenate((q, -p)) @jit def alpi(x): d = len(x) // 2 return lpi(x[jnp.array(range(d))]) - 0.5*jnp.sum((x[jnp.array(range(d,2*d))]**2)/dmm) @jit def rprop(k, x): d = len(x) // 2 return leapf(x[jnp.array(range(d))], x[jnp.array(range(d, 2*d))]) mhk = mhKernel(alpi, rprop) @jit def kern(k, q): key0, key1 = jax.random.split(k) d = len(q) x = jnp.concatenate((q, jax.random.normal(key0, [d])*sdmm)) return mhk(key1, x)[jnp.array(range(d))] return kern  There is something a little bit strange about this implementation, since the proposal for the M-H move is deterministic, the function rprop just ignores the RNG key that is passed to it. We could tidy this up by making a M-H function especially for deterministic proposals. We won’t pursue this here, but this issue will crop up again later in some of the other functional languages. See the full runnable script for further details. ### Scala A M-H kernel, def mhKern[S]( logPost: S => Double, rprop: S => S, dprop: (S, S) => Double = (n: S, o: S) => 1.0 ): (S) => S = val r = Uniform(0.0,1.0) x0 => val x = rprop(x0) val ll0 = logPost(x0) val ll = logPost(x) val a = ll - ll0 + dprop(x0, x) - dprop(x, x0) if (math.log(r.draw()) < a) x else x0  and a HMC kernel. def hmcKernel(lpi: DVD => Double, glpi: DVD => DVD, dmm: DVD, eps: Double = 1e-4, l: Int = 10) = val sdmm = sqrt(dmm) def leapf(q: DVD, p: DVD): (DVD, DVD) = @tailrec def go(q0: DVD, p0: DVD, l: Int): (DVD, DVD) = val q = q0 + eps*(p0/:/dmm) val p = if (l > 1) p0 + eps*glpi(q) else p0 + 0.5*eps*glpi(q) if (l == 1) (q, -p) else go(q, p, l-1) go(q, p + 0.5*eps*glpi(q), l) def alpi(x: (DVD, DVD)): Double = val (q, p) = x lpi(q) - 0.5*sum(pow(p,2) /:/ dmm) def rprop(x: (DVD, DVD)): (DVD, DVD) = val (q, p) = x leapf(q, p) val mhk = mhKern(alpi, rprop) (q: DVD) => val d = q.length val p = sdmm map (sd => Gaussian(0,sd).draw()) mhk((q, p))._1  See the full runnable script for further details. ### Haskell A M-H kernel: mdKernel :: (StatefulGen g m) => (s -> Double) -> (s -> s) -> g -> s -> m s mdKernel logPost prop g x0 = do let x = prop x0 let ll0 = logPost x0 let ll = logPost x let a = ll - ll0 u <- (genContVar (uniformDistr 0.0 1.0)) g let next = if ((log u) < a) then x else x0 return next  Note that here we are using a M-H kernel specifically for deterministic proposals, since there is no non-determinism signalled in the type signature of prop. We can then use this to construct our HMC kernel. hmcKernel :: (StatefulGen g m) => (Vector Double -> Double) -> (Vector Double -> Vector Double) -> Vector Double -> Double -> Int -> g -> Vector Double -> m (Vector Double) hmcKernel lpi glpi dmm eps l g = let sdmm = cmap sqrt dmm leapf q p = let go q0 p0 l = let q = q0 + (scalar eps)*p0/dmm p = if (l > 1) then p0 + (scalar eps)*(glpi q) else p0 + (scalar (eps/2))*(glpi q) in if (l == 1) then (q, -p) else go q p (l - 1) in go q (p + (scalar (eps/2))*(glpi q)) l alpi x = let (q, p) = x in (lpi q) - 0.5*(sumElements (p*p/dmm)) prop x = let (q, p) = x in leapf q p mk = mdKernel alpi prop g in (\q0 -> do let d = size q0 zl <- (replicateM d . genContVar (normalDistr 0.0 1.0)) g let z = fromList zl let p0 = sdmm * z (q, p) <- mk (q0, p0) return q)  See the full runnable script for further details. ### Dex Again we can use a M-H kernel specific to deterministic proposals. def mdKernel {s} (lpost: s -> Float) (prop: s -> s) (x0: s) (k: Key) : s = x = prop x0 ll0 = lpost x0 ll = lpost x a = ll - ll0 u = rand k select (log u < a) x x0  and use this to construct an HMC kernel. def hmcKernel {n} (lpi: (Fin n)=>Float -> Float) (dmm: (Fin n)=>Float) (eps: Float) (l: Nat) (q0: (Fin n)=>Float) (k: Key) : (Fin n)=>Float = sdmm = sqrt dmm idmm = map (\x. 1.0/x) dmm glpi = grad lpi def leapf (q0: (Fin n)=>Float) (p0: (Fin n)=>Float) : ((Fin n)=>Float & (Fin n)=>Float) = p1 = p0 + (eps/2) .* (glpi q0) q1 = q0 + eps .* (p1*idmm) (q, p) = apply_n l (q1, p1) \(qo, po). pn = po + eps .* (glpi qo) qn = qo + eps .* (pn*idmm) (qn, pn) pf = p + (eps/2) .* (glpi q) (q, -pf) def alpi (qp: ((Fin n)=>Float & (Fin n)=>Float)) : Float = (q, p) = qp (lpi q) - 0.5*(sum (p*p*idmm)) def prop (qp: ((Fin n)=>Float & (Fin n)=>Float)) : ((Fin n)=>Float & (Fin n)=>Float) = (q, p) = qp leapf q p mk = mdKernel alpi prop [k1, k2] = split_key k z = randn_vec k1 p0 = sdmm * z (q, p) = mk (q0, p0) k2 q  Note that the gradient is obtained via automatic differentiation. See the full runnable script for details. ## Next steps This was the main place that I was trying to get to when I started this series of posts. For differentiable log-posteriors (as we have in the case of Bayesian logistic regression), HMC is a pretty good algorithm for reasonably efficient posterior exploration. But there are lots of places we could go from here. We could explore the tuning of MCMC algorithms, or HMC extensions such as NUTS. We could look at MCMC algorithms that are specifically tailored to the logistic regression problem, or we could look at new MCMC algorithms for differentiable targets based on piecewise deterministic Markov processes. Alternatively, we could temporarily abandon MCMC and look at SMC or ABC approaches. Another possibility would be to abandon this multi-language approach and have a bit of a deep dive into Dex, which I think has the potential to be a great programming language for statistical computing. All of these are possibilities for the future, but I’ve a busy few weeks coming up, so the frequency of these posts is likely to substantially decrease. Remember that all of the code associated with this series of posts is available from this github repo. Advertisement ## Bayesian inference for a logistic regression model (Part 5) ## Part 5: the Metropolis-adjusted Langevin algorithm (MALA) ### Introduction This is the fifth part in a series of posts on MCMC-based Bayesian inference for a logistic regression model. If you are new to this series, please go back to Part 1. In the previous post we saw how to use Langevin dynamics to construct an approximate MCMC scheme using the gradient of the log target distribution. Each step of the algorithm involved simulating from the Euler-Maruyama approximation to the transition kernel of the process, based on some pre-specified step size, $\Delta t$. We can improve the accuracy of this approximation by making the step size smaller, but this will come at the expense of a more slowly mixing MCMC chain. Fortunately, there is an easy way to make the algorithm "exact" (in the sense that the equilibrium distribution of the Markov chain will be the exact target distribution), for any finite step size, $\Delta t$, simply by using the Euler-Maruyama approximation as the proposal distribution in a Metropolis-Hastings algorithm. This is the Metropolis-adjusted Langevin algorithm (MALA). There are various ways this could be coded up, but here, for clarity, a HoF for generating a MALA kernel will be used, and this function will in turn call on a HoF for generating a Metropolis-Hastings kernel. ## Implementations ### R First we need a function to generate a M-H kernel. mhKernel = function(logPost, rprop, dprop = function(new, old, ...) { 1 }) function(x, ll) { prop = rprop(x) llprop = logPost(prop) a = llprop - ll + dprop(x, prop) - dprop(prop, x) if (log(runif(1)) < a) list(x=prop, ll=llprop) else list(x=x, ll=ll) }  Then we can easily write a function for returning a MALA kernel that makes use of this M-H function. malaKernel = function(lpi, glpi, dt = 1e-4, pre = 1) { sdt = sqrt(dt) spre = sqrt(pre) advance = function(x) x + 0.5*pre*glpi(x)*dt mhKernel(lpi, function(x) rnorm(p, advance(x), spre*sdt), function(new, old) sum(dnorm(new, advance(old), spre*sdt, log=TRUE))) }  Notice that our MALA function requires as input both the gradient of the log posterior (for the proposal) and the log posterior itself (for the M-H correction). Other details are as we have already seen – see the full runnable script. ### Python Again, we need a M-H kernel def mhKernel(lpost, rprop, dprop = lambda new, old: 1.): def kernel(x, ll): prop = rprop(x) lp = lpost(prop) a = lp - ll + dprop(x, prop) - dprop(prop, x) if (np.log(np.random.rand()) < a): x = prop ll = lp return x, ll return kernel  and then a MALA kernel def malaKernel(lpi, glpi, dt = 1e-4, pre = 1): p = len(init) sdt = np.sqrt(dt) spre = np.sqrt(pre) advance = lambda x: x + 0.5*pre*glpi(x)*dt return mhKernel(lpi, lambda x: advance(x) + np.random.randn(p)*spre*sdt, lambda new, old: np.sum(sp.stats.norm.logpdf(new, loc=advance(old), scale=spre*sdt)))  See the full runnable script for further details. #### JAX If we want our algorithm to run fast, and if we want to exploit automatic differentiation to avoid the need to manually compute gradients, then we can easily convert the above code to use JAX. def mhKernel(lpost, rprop, dprop = jit(lambda new, old: 1.)): @jit def kernel(key, x, ll): key0, key1 = jax.random.split(key) prop = rprop(key0, x) lp = lpost(prop) a = lp - ll + dprop(x, prop) - dprop(prop, x) accept = (jnp.log(jax.random.uniform(key1)) < a) return jnp.where(accept, prop, x), jnp.where(accept, lp, ll) return kernel def malaKernel(lpi, dt = 1e-4, pre = 1): p = len(init) glpi = jit(grad(lpost)) sdt = jnp.sqrt(dt) spre = jnp.sqrt(pre) advance = jit(lambda x: x + 0.5*pre*glpi(x)*dt) return mhKernel(lpi, jit(lambda k, x: advance(x) + jax.random.normal(k, [p])*spre*sdt), jit(lambda new, old: jnp.sum(jsp.stats.norm.logpdf(new, loc=advance(old), scale=spre*sdt))))  See the full runnable script for further details. ### Scala def mhKernel[S]( logPost: S => Double, rprop: S => S, dprop: (S, S) => Double = (n: S, o: S) => 1.0 ): ((S, Double)) => (S, Double) = val r = Uniform(0.0,1.0) state => val (x0, ll0) = state val x = rprop(x0) val ll = logPost(x) val a = ll - ll0 + dprop(x0, x) - dprop(x, x0) if (math.log(r.draw()) < a) (x, ll) else (x0, ll0) def malaKernel(lpi: DVD => Double, glpi: DVD => DVD, pre: DVD, dt: Double = 1e-4) = val sdt = math.sqrt(dt) val spre = sqrt(pre) val p = pre.length def advance(beta: DVD): DVD = beta + (0.5*dt)*(pre*:*glpi(beta)) def rprop(beta: DVD): DVD = advance(beta) + sdt*spre.map(Gaussian(0,_).sample()) def dprop(n: DVD, o: DVD): Double = val ao = advance(o) (0 until p).map(i => Gaussian(ao(i), spre(i)*sdt).logPdf(n(i))).sum mhKernel(lpi, rprop, dprop)  See the full runnable script for further details. ### Haskell mhKernel :: (StatefulGen g m) => (s -> Double) -> (s -> g -> m s) -> (s -> s -> Double) -> g -> (s, Double) -> m (s, Double) mhKernel logPost rprop dprop g (x0, ll0) = do x <- rprop x0 g let ll = logPost(x) let a = ll - ll0 + (dprop x0 x) - (dprop x x0) u <- (genContVar (uniformDistr 0.0 1.0)) g let next = if ((log u) < a) then (x, ll) else (x0, ll0) return next malaKernel :: (StatefulGen g m) => (Vector Double -> Double) -> (Vector Double -> Vector Double) -> Vector Double -> Double -> g -> (Vector Double, Double) -> m (Vector Double, Double) malaKernel lpi glpi pre dt g = let sdt = sqrt dt spre = cmap sqrt pre p = size pre advance beta = beta + (scalar (0.5*dt))*pre*(glpi beta) rprop beta g = do zl <- (replicateM p . genContVar (normalDistr 0.0 1.0)) g let z = fromList zl return$ advance(beta) + (scalar sdt)*spre*z
dprop n o = let
in sum $(\i -> logDensity (normalDistr (ao!i) ((spre!i)*sdt)) (n!i)) <$> [0..(p-1)]
in mhKernel lpi rprop dprop g


See the full runnable script for further details.

### Dex

Recall that Dex is differentiable, so we don’t need to provide gradients.

def mhKernel {s} (lpost: s -> Float) (rprop: s -> Key -> s) (dprop: s -> s -> Float)
(sll: (s & Float)) (k: Key) : (s & Float) =
(x0, ll0) = sll
[k1, k2] = split_key k
x = rprop x0 k1
ll = lpost x
a = ll - ll0 + (dprop x0 x) - (dprop x x0)
u = rand k2
select (log u < a) (x, ll) (x0, ll0)

def malaKernel {n} (lpi: (Fin n)=>Float -> Float)
(pre: (Fin n)=>Float) (dt: Float) :
((Fin n)=>Float & Float) -> Key -> ((Fin n)=>Float & Float) =
sdt = sqrt dt
spre = sqrt pre
v = dt .* pre
vinv = map (\ x. 1.0/x) v
def advance (beta: (Fin n)=>Float) : (Fin n)=>Float =
beta + (0.5*dt) .* (pre*(glp beta))
def rprop (beta: (Fin n)=>Float) (k: Key) : (Fin n)=>Float =
(advance beta) + sdt .* (spre*(randn_vec k))
def dprop (new: (Fin n)=>Float) (old: (Fin n)=>Float) : Float =
diff = new - ao
-0.5 * sum ((log v) + diff*diff*vinv)
mhKernel lpi rprop dprop


See the full runnable script for further details.

## Next steps

MALA gives us an MCMC algorithm that exploits gradient information to generate "informed" M-H proposals. But it still has a rather "diffusive" character, making it difficult to tune in such a way that large moves are likely to be accepted in challenging high-dimensional situations.

The Langevin dynamics on which MALA is based can be interpreted as the (over-damped) stochastic dynamics of a particle moving in a potential energy field corresponding to minus the log posterior. It turns out that the corresponding deterministic dynamics can be exploited to generate proposals better able to make large moves while still having a high probability of acceptance. This is the idea behind Hamiltonian Monte Carlo (HMC), which we’ll look at next.

## Part 4: Gradients and the Langevin algorithm

### Introduction

This is the fourth part in a series of posts on MCMC-based Bayesian inference for a logistic regression model. If you are new to this series, please go back to Part 1.

In the previous post we saw how the Metropolis algorithm could be used to generate a Markov chain targeting our posterior distribution. In high dimensions the diffusive nature of the Metropolis random walk proposal becomes increasingly inefficient. It is therefore natural to try and develop algorithms that use additional information about the target distribution. In the case of a differentiable log posterior target, a natural first step in this direction is to try and make use of gradient information.

## Gradient of a logistic regression model

There are various ways to derive the gradient of our logistic regression model, but it might be simplest to start from the first form of the log likelihood that we deduced in Part 2:

$\displaystyle l(\beta;y) = y^\textsf{T}X\beta - \mathbf{1}^\textsf{T}\log(\mathbf{1}+\exp[X\beta])$

We can write this out in component form as

$\displaystyle l(\beta;y) = \sum_j\sum_j y_iX_{ij}\beta_j - \sum_i\log\left(1+\exp\left[\sum_jX_{ij}\beta_j\right]\right).$

Differentiating wrt $\beta_k$ gives

$\displaystyle \frac{\partial l}{\partial \beta_k} = \sum_i y_iX_{ik} - \sum_i \frac{\exp\left[\sum_j X_{ij}\beta_j\right]X_{ik}}{1+\exp\left[\sum_j X_{ij}\beta_j\right]}.$

It’s then reasonably clear that stitching all of the partial derivatives together will give the gradient vector

$\displaystyle \nabla l = X^\textsf{T}\left[ y - \frac{\mathbf{1}}{\mathbf{1}+\exp[-X\beta]} \right].$

This is the gradient of the log likelihood, but we also need the gradient of the log prior. Since we are assuming independent $\beta_i \sim N(0,v_i)$ priors, it is easy to see that the gradient of the log prior is just $-\beta\circ v^{-1}$. It is the sum of these two terms that gives the gradient of the log posterior.

### R

In R we can implement our gradient function as

glp = function(beta) {
glpr = -beta/(pscale*pscale)
gll = as.vector(t(X) %*% (y - 1/(1 + exp(-X %*% beta))))
glpr + gll
}


### Python

In Python we could use

def glp(beta):
glpr = -beta/(pscale*pscale)
gll = (X.T).dot(y - 1/(1 + np.exp(-X.dot(beta))))
return (glpr + gll)


We don’t really need a JAX version, since JAX can auto-diff the log posterior for us.

### Scala

  def glp(beta: DVD): DVD =
val glpr = -beta /:/ pvar
val gll = (X.t)*(y - ones/:/(ones + exp(-X*beta)))
glpr + gll



Using hmatrix we could use something like

glp :: Matrix Double -> Vector Double -> Vector Double -> Vector Double
glp x y b = let
glpr = -b / (fromList [100.0, 1, 1, 1, 1, 1, 1, 1])
gll = (tr x) #> (y - (scalar 1)/((scalar 1) + (cmap exp (-x #> b))))
in glpr + gll


There’s something interesting to say about Haskell and auto-diff, but getting into this now will be too much of a distraction. I may come back to it in some future post.

### Dex

Dex is differentiable, so we don’t need a gradient function – we can just use grad lpost. However, for interest and comparison purposes we could nevertheless implement it directly with something like

prscale = map (\ x. 1.0/x) pscale

def glp (b: (Fin 8)=>Float) : (Fin 8)=>Float =
glpr = -b*prscale*prscale
gll = (transpose x) **. (y - (map (\eta. 1.0/(1.0 + eta)) (exp (-x **. b))))
glpr + gll


## Langevin diffusions

Now that we have a way of computing the gradient of the log of our target density we need some MCMC algorithms that can make good use of it. In this post we will look at a simple approximate MCMC algorithm derived from an overdamped Langevin diffusion model. In subsequent posts we’ll look at more sophisticated, exact MCMC algorithms.

The multivariate stochastic differential equation (SDE)

$\displaystyle dX_t = \frac{1}{2}\nabla\log\pi(X_t)dt + dW_t$

has $\pi(\cdot)$ as its equilibrium distribution. Informally, an SDE of this form is a continuous time process with infinitesimal transition kernel

$\displaystyle X_{t+dt}|(X_t=x_t) \sim N\left(x_t+\frac{1}{2}\nabla\log\pi(x_t)dt,\mathbf{I}dt\right).$

There are various more-or-less formal ways to see that $\pi(\cdot)$ is stationary. A good way is to check it satisfies the Fokkerâ€“Planck equation with zero LHS. A less formal approach would be to see that the infinitesimal transition kernel for the process satisfies detailed balance with $\pi(\cdot)$.

Similar arguments show that for any fixed positive definite matrix $A$, the SDE

$\displaystyle dX_t = \frac{1}{2}A\nabla\log\pi(X_t)dt + A^{1/2}dW_t$

also has $\pi(\cdot)$ as a stationary distribution. It is quite common to choose a diagonal matrix $A$ to put the components of $X_t$ on a common scale.

Simulating exact sample paths from SDEs such as the overdamped Langevin diffusion model is typically difficult (though not necessarily impossible), so we instead want something simple and tractable as the basis of our MCMC algorithms. Here we will just simulate from the Eulerâ€“Maruyama approximation of the process by choosing a small but finite time step $\Delta t$ and using the transition kernel

$\displaystyle X_{t+\Delta t}|(X_t=x_t) \sim N\left(x_t+\frac{1}{2}A\nabla\log\pi(x_t)\Delta t, A\Delta t\right)$

as the basis of our MCMC method. For sufficiently small $\Delta t$ this should accurately approximate the Langevin dynamics, leading to an equilibrium distribution very close to $\pi(\cdot)$. That said, we would like to choose $\Delta t$ as large as we can get away with, since that will lead to a more rapidly mixing MCMC chain. Below are some implementations of this kernel for a diagonal pre-conditioning matrix.

## Implementation

### R

We can create a kernel for the unadjusted Langevin algorithm in R with the following function.

ulKernel = function(glpi, dt = 1e-4, pre = 1) {
sdt = sqrt(dt)
spre = sqrt(pre)
advance = function(x) x + 0.5*pre*glpi(x)*dt
}


Here, we can pass in pre, which is expected to be a vector representing the diagonal of the pre-conditioning matrix, $A$. We can then use this kernel to generate an MCMC chain as we have seen previously. See the full runnable script for further details.

### Python

def ulKernel(glpi, dt = 1e-4, pre = 1):
p = len(init)
sdt = np.sqrt(dt)
spre = np.sqrt(pre)
advance = lambda x: x + 0.5*pre*glpi(x)*dt
def kernel(x):
return kernel


See the full runnable script for further details.

#### JAX

def ulKernel(lpi, dt = 1e-4, pre = 1):
p = len(init)
sdt = jnp.sqrt(dt)
spre = jnp.sqrt(pre)
advance = jit(lambda x: x + 0.5*pre*glpi(x)*dt)
@jit
def kernel(key, x):
return kernel


Note how for JAX we can just pass in the log posterior, and the gradient function can be obtained by automatic differentiation. See the full runnable script for further details.

### Scala

def ulKernel(glp: DVD => DVD, pre: DVD, dt: Double): DVD => DVD =
val sdt = math.sqrt(dt)
val spre = sqrt(pre)
beta + (0.5*dt)*(pre*:*glp(beta))


See the full runnable script for further details.

ulKernel :: (StatefulGen g m) =>
(Vector Double -> Vector Double) -> Vector Double -> Double -> g ->
Vector Double -> m (Vector Double)
ulKernel glpi pre dt g beta = do
let sdt = sqrt dt
let spre = cmap sqrt pre
let p = size pre
let advance beta = beta + (scalar (0.5*dt))*pre*(glpi beta)
zl <- (replicateM p . genContVar (normalDistr 0.0 1.0)) g
let z = fromList zl
ll = pair$ll } mat[i, ] = x } if (verb) message("Done.") mat }  Then, in the context of our running logistic regression example, and using the log-posterior from the previous post, we can construct our kernel and run it as follows. pre = c(10.0,1,1,1,1,1,5,1) out = mcmc(init, mhKernel(lpost, function(x) x + pre*rnorm(p, 0, 0.02)), thin=1000)  Note the use of a symmetric proposal, so the proposal density is not required. Also note the use of a larger proposal variance for the intercept term and the second last covariate. See the full runnable script for further details. ### Python We can do something very similar to R in Python using NumPy. Our HoF for constructing a M-H kernel is def mhKernel(lpost, rprop, dprop = lambda new, old: 1.): def kernel(x, ll): prop = rprop(x) lp = lpost(prop) a = lp - ll + dprop(x, prop) - dprop(prop, x) if (np.log(np.random.rand()) < a): x = prop ll = lp return x, ll return kernel  Our Markov chain runner function is def mcmc(init, kernel, thin = 10, iters = 10000, verb = True): p = len(init) ll = -np.inf mat = np.zeros((iters, p)) x = init if (verb): print(str(iters) + " iterations") for i in range(iters): if (verb): print(str(i), end=" ", flush=True) for j in range(thin): x, ll = kernel(x, ll) mat[i,:] = x if (verb): print("\nDone.", flush=True) return mat  We can use this code in the context of our logistic regression example as follows. pre = np.array([10.,1.,1.,1.,1.,1.,5.,1.]) def rprop(beta): return beta + 0.02*pre*np.random.randn(p) out = mcmc(init, mhKernel(lpost, rprop), thin=1000)  See the full runnable script for further details. #### JAX The above R and Python scripts are fine, but both languages are rather slow for this kind of workload. Fortunately it’s rather straightforward to convert the Python code to JAX to obtain quite amazing speed-up. We can write our M-H kernel as def mhKernel(lpost, rprop, dprop = jit(lambda new, old: 1.)): @jit def kernel(key, x, ll): key0, key1 = jax.random.split(key) prop = rprop(key0, x) lp = lpost(prop) a = lp - ll + dprop(x, prop) - dprop(prop, x) accept = (jnp.log(jax.random.uniform(key1)) < a) return jnp.where(accept, prop, x), jnp.where(accept, lp, ll) return kernel  and our MCMC runner function as def mcmc(init, kernel, thin = 10, iters = 10000): key = jax.random.PRNGKey(42) keys = jax.random.split(key, iters) @jit def step(s, k): [x, ll] = s x, ll = kernel(k, x, ll) s = [x, ll] return s, s @jit def iter(s, k): keys = jax.random.split(k, thin) _, states = jax.lax.scan(step, s, keys) final = [states[0][thin-1], states[1][thin-1]] return final, final ll = -np.inf x = init _, states = jax.lax.scan(iter, [x, ll], keys) return states[0]  There are really only two slightly tricky things about this code. The first relates to the way JAX handles pseudo-random numbers. Since JAX is a pure functional eDSL, it can’t be used in conjunction with the typical pseudo-random number generators often used in imperative programming languages which rely on a global mutable state. This can be dealt with reasonably straightforwardly by explicitly passing around the random number state. There is a standard way of doing this that has been common practice in functional programming languages for decades. However, this standard approach is very sequential, and so doesn’t work so well in a parallel context. JAX therefore uses a splittable random number generator, where new states are created by splitting the current state into two (or more). We’ll come back to this when we get to the Haskell examples. The second thing that might be unfamiliar to imperative programmers is the use of the scan operation (jax.lax.scan) to generate the Markov chain rather than a "for" loop. But scans are standard operations in most functional programming languages. We can then call this code for our logistic regression example with pre = jnp.array([10.,1.,1.,1.,1.,1.,5.,1.]).astype(jnp.float32) @jit def rprop(key, beta): return beta + 0.02*pre*jax.random.normal(key, [p]) out = mcmc(init, mhKernel(lpost, rprop), thin=1000)  See the full runnable script for further details. ### Scala In Scala we can use a similar approach to that already seen for defining a HoF to return a M-H kernel. def mhKernel[S]( logPost: S => Double, rprop: S => S, dprop: (S, S) => Double = (n: S, o: S) => 1.0 ): ((S, Double)) => (S, Double) = val r = Uniform(0.0,1.0) state => val (x0, ll0) = state val x = rprop(x0) val ll = logPost(x) val a = ll - ll0 + dprop(x0, x) - dprop(x, x0) if (math.log(r.draw()) < a) (x, ll) else (x0, ll0)  Note that Scala’s static typing does not prevent us from defining a function that is polymorphic in the type of the chain state, which we here call S. Also note that we are adopting a pragmatic approach to random number generation, exploiting the fact that Scala is not a pure functional language, using a mutable generator, and omitting to capture the non-determinism of the rprop function (and the returned kernel) in its type signature. In Scala this is a choice, and we could adopt a purer approach if preferred. We’ll see what such an approach will look like in Haskell, coming up next. Now that we have the kernel, we don’t need to write an explicit runner function since Scala has good support for streaming data. There are many more-or-less sophisticated ways that we can work with data streams in Scala, and the choice depends partly on how pure one is being about tracking effects (such as non-determinism), but here I’ll just use the simple LazyList from the standard library for unfolding the kernel into an infinite MCMC chain before thinning and truncating appropriately.  val pre = DenseVector(10.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0) def rprop(beta: DVD): DVD = beta + pre *:* (DenseVector(Gaussian(0.0,0.02).sample(p).toArray)) val kern = mhKernel(lpost, rprop) val s = LazyList.iterate((init, -Inf))(kern) map (_._1) val out = s.drop(150).thin(1000).take(10000)  See the full runnable script for further details. ### Haskell Since Haskell is a pure functional language, we need to have some convention regarding pseudo-random number generation. Haskell supports several styles. The most commonly adopted approach wraps a mutable generator up in a monad. The typical alternative is to use a pure functional generator and either explicitly thread the state through code or hide this in a monad similar to the standard approach. However, Haskell also supports the use of splittable generators, so we can consider all three approaches for comparative purposes. The approach taken does affect the code and the type signatures, and even the streaming data abstractions most appropriate for chain generation. Starting with a HoF for producing a Metropolis kernel, an approach using the standard monadic generators could like like mKernel :: (StatefulGen g m) => (s -> Double) -> (s -> g -> m s) -> g -> (s, Double) -> m (s, Double) mKernel logPost rprop g (x0, ll0) = do x <- rprop x0 g let ll = logPost(x) let a = ll - ll0 u <- (genContVar (uniformDistr 0.0 1.0)) g let next = if ((log u) < a) then (x, ll) else (x0, ll0) return next  Note how non-determinism is captured in the type signatures by the monad m. The explicit pure approach is to thread the generator through non-deterministic functions. mKernelP :: (RandomGen g) => (s -> Double) -> (s -> g -> (s, g)) -> g -> (s, Double) -> ((s, Double), g) mKernelP logPost rprop g (x0, ll0) = let (x, g1) = rprop x0 g ll = logPost(x) a = ll - ll0 (u, g2) = uniformR (0, 1) g1 next = if ((log u) < a) then (x, ll) else (x0, ll0) in (next, g2)  Here the updated random number generator state is returned from each non-deterministic function for passing on to subsequent non-deterministic functions. This explicit sequencing of operations makes it possible to wrap the generator state in a state monad giving code very similar to the stateful monadic generator approach, but as already discussed, the sequential nature of this approach makes it unattractive in parallel and concurrent settings. Fortunately the standard Haskell pure generator is splittable, meaning that we can adopt a splitting approach similar to JAX if we prefer, since this is much more parallel-friendly. mKernelP :: (RandomGen g) => (s -> Double) -> (s -> g -> s) -> g -> (s, Double) -> (s, Double) mKernelP logPost rprop g (x0, ll0) = let (g1, g2) = split g x = rprop x0 g1 ll = logPost(x) a = ll - ll0 u = unif g2 next = if ((log u) < a) then (x, ll) else (x0, ll0) in next  Here non-determinism is signalled by passing a generator state (often called a "key" in the context of splittable generators) into a function. Functions receiving a key are responsible for splitting it to ensure that no key is ever used more than once. Once we have a kernel, we need to unfold our Markov chain. When using the monadic generator approach, it is most natural to unfold using a monadic stream mcmc :: (StatefulGen g m) => Int -> Int -> s -> (g -> s -> m s) -> g -> MS.Stream m s mcmc it th x0 kern g = MS.iterateNM it (stepN th (kern g)) x0 stepN :: (Monad m) => Int -> (a -> m a) -> (a -> m a) stepN n fa = if (n == 1) then fa else (\x -> (fa x) >>= (stepN (n-1) fa))  whereas for the explicit approaches it is more natural to unfold into a regular infinite data stream. So, for the explicit sequential approach we could use mcmcP :: (RandomGen g) => s -> (g -> s -> (s, g)) -> g -> DS.Stream s mcmcP x0 kern g = DS.unfold stepUf (x0, g) where stepUf xg = let (x1, g1) = kern (snd xg) (fst xg) in (x1, (x1, g1))  and with the splittable approach we could use mcmcP :: (RandomGen g) => s -> (g -> s -> s) -> g -> DS.Stream s mcmcP x0 kern g = DS.unfold stepUf (x0, g) where stepUf xg = let (x1, g1) = xg x2 = kern g1 x1 (g2, _) = split g1 in (x2, (x2, g2))  Calling these functions for our logistic regression example is similar to what we have seen before, but again there are minor syntactic differences depending on the approach. For further details see the full runnable scripts for the monadic approach, the pure sequential approach, and the splittable approach. ### Dex Dex is a pure functional language and uses a splittable random number generator, so the style we use is similar to JAX (or Haskell using a splittable generator). We can generate a Metropolis kernel with def mKernel {s} (lpost: s -> Float) (rprop: Key -> s -> s) : Key -> (s & Float) -> (s & Float) = def kern (k: Key) (sll: (s & Float)) : (s & Float) = (x0, ll0) = sll [k1, k2] = split_key k x = rprop k1 x0 ll = lpost x a = ll - ll0 u = rand k2 select (log u < a) (x, ll) (x0, ll0) kern  We can then unfold our Markov chain with def markov_chain {s} (k: Key) (init: s) (kern: Key -> s -> s) (its: Nat) : Fin its => s = with_state init \st. for i:(Fin its). x = kern (ixkey k i) (get st) st := x x  Here we combine Dex’s state effect with a for loop to unfold the stream. See the full runnable script for further details. ## Next steps As previously discussed, none of these codes are optimised, so care should be taken not to over-interpret running times. However, JAX and Dex are noticeably faster than the alternatives, even running on a single CPU core. Another interesting feature of both JAX and Dex is that they are differentiable. This makes it very easy to develop algorithms using gradient information. In subsequent posts we will think about the gradient of our example log-posterior and how we can use gradient information to develop "better" sampling algorithms. The complete runnable scripts are all available from this public github repo. ## Bayesian inference for a logistic regression model (Part 2) ## Part 2: The log posterior ### Introduction This is the second part in a series of posts on MCMC-based Bayesian inference for a logistic regression model. If you are new to this series, please go back to Part 1. In the previous post we looked at the basic modelling concepts, and how to fit the model using a variety of PPLs. In this post we will prepare for doing MCMC by considering the problem of computing the unnormalised log posterior for the model. We will then see how this posterior can be implemented in several different languages and libraries. ## Derivation ### Basic structure In Bayesian inference the posterior distribution is just the conditional distribution of the model parameters given the data, and therefore proportional to the joint distribution of the model and data. We often write this as $\displaystyle \pi(\theta|y) \propto \pi(\theta,y) = \pi(\theta)\pi(y|\theta).$ Taking logs we have $\displaystyle \log \pi(\theta, y) = \log \pi(\theta) + \log \pi(y|\theta).$ So (up to an additive constant) the log posterior is just the sum of the log prior and log likelihood. There are many good (numerical) reasons why we try to work exclusively with the log posterior and try to avoid ever evaluating the raw posterior density. For our example logistic regression model, the parameter vector $\theta$ is just the vector of regression coefficients, $\beta$. We assumed independent mean zero normal priors for the components of this vector, so the log prior is just the sum of logs of normal densities. Many scientific libraries will have built-in functions for returning the log-pdf of standard distributions, but if an explicit form is required, the log of the density of a $N(0,\sigma^2)$ at $x$ is just $\displaystyle -\log(2\pi)/2 - \log|\sigma| - x^2/(2\sigma^2),$ and the initial constant term normalising the density can often be dropped. ### Log-likelihood (first attempt) Information from the data comes into the log posterior via the log-likelihood. The typical way to derive the likelihood for problems of this type is to assume the usual binary encoding of the data (success 1, failure 0). Then, for a Bernoulli observation with probability $p_i,\ i=1,\ldots,n$, the likelihood associated with observation $y_i$ is $\displaystyle f(y_i|p_i) = \left[ \hphantom{1-}p_i \quad :\ y_i=1 \atop 1-p_i \quad :\ y_i=0 \right. \quad = \quad p_i^{y_i}(1-p_i)^{1-y_i}.$ Taking logs and then switching to parameter $\eta_i=\text{logit}(p_i)$ we have $\displaystyle \log f(y_i|\eta_i) = y_i\eta_i - \log(1+e^{\eta_i}),$ and summing over $n$ observations gives the log likelihood $\displaystyle \log\pi(y|\eta) \equiv \ell(\eta;y) = y\cdot \eta - \mathbf{1}\cdot\log(\mathbf{1}+\exp{\eta}).$ In the context of logistic regression, $\eta$ is the linear predictor, so $\eta=X\beta$, giving $\displaystyle \ell(\beta;y) = y^\textsf{T}X\beta - \mathbf{1}^\textsf{T}\log(\mathbf{1}+\exp[X\beta]).$ This is a perfectly good way of expressing the log-likelihood, and we will come back to it later when we want the gradient of the log-likelihood, but it turns out that there is a similar-but-different way of deriving it that results in an expression that is equivalent but slightly cheaper to evaluate. ### Log-likelihood (second attempt) For our second attempt, we will assume that the data is coded in a different way. Instead of the usual binary encoding, we will assume that the observation $\tilde y_i$ is 1 for success and -1 for failure. This isn’t really a problem, since the two encodings are related by $\tilde y_i = 2y_i-1$. This new encoding is convenient in the context of a logit parameterisation since then $\displaystyle f(y_i|\eta_i) = \left[ p_i \ :\ \tilde y_i=1\atop 1-p_i\ :\ \tilde y_i=-1 \right. \ = \ \left[ (1+e^{-\eta_i})^{-1} \ :\ \tilde y_i=1\atop (1+e^{\eta_i})^{-1} \ :\ \tilde y_i=-1 \right. \ = \ (1+e^{-\tilde y_i\eta_i})^{-1} ,$ and hence $\displaystyle \log f(y_i|\eta_i) = -\log(1+e^{-\tilde y_i\eta_i}).$ Summing over observations gives $\displaystyle \ell(\eta;\tilde y) = -\mathbf{1}\cdot \log(\mathbf{1}+\exp[-\tilde y\circ \eta]),$ where $\circ$ denotes the Hadamard product. Substituting $\eta=X\beta$ gives the log-likelihood $\displaystyle \ell(\beta;\tilde y) = -\mathbf{1}^\textsf{T} \log(\mathbf{1}+\exp[-\tilde y\circ X\beta]).$ This likelihood is a bit cheaper to evaluate that the one previously derived. If we prefer to write in terms of the original data encoding, we can obviously do so as $\displaystyle \ell(\beta; y) = -\mathbf{1}^\textsf{T} \log(\mathbf{1}+\exp[-(2y-\mathbf{1})\circ (X\beta)]),$ and in practice, it is this version that is typically used. To be clear, as an algebraic function of $\beta$ and $y$ the two functions are different. But they coincide for binary vectors $y$ which is all that matters. ## Implementation ### R In R we can create functions for evaluating the log-likelihood, log-prior and log-posterior as follows (assuming that X and y are in scope). ll = function(beta) sum(-log(1 + exp(-(2*y - 1)*(X %*% beta)))) lprior = function(beta) dnorm(beta[1], 0, 10, log=TRUE) + sum(dnorm(beta[-1], 0, 1, log=TRUE)) lpost = function(beta) ll(beta) + lprior(beta)  ### Python In Python (with NumPy and SciPy) we can define equivalent functions with def ll(beta): return np.sum(-np.log(1 + np.exp(-(2*y - 1)*(X.dot(beta))))) def lprior(beta): return (sp.stats.norm.logpdf(beta[0], loc=0, scale=10) + np.sum(sp.stats.norm.logpdf(beta[range(1,p)], loc=0, scale=1))) def lpost(beta): return ll(beta) + lprior(beta)  #### JAX Python, like R, is a dynamic language, and relatively slow for MCMC algorithms. JAX is a tensor computation framework for Python that embeds a pure functional differentiable array processing language inside Python. JAX can JIT-compile high-performance code for both CPU and GPU, and has good support for parallelism. It is rapidly becoming the preferred way to develop high-performance sampling algorithms within the Python ecosystem. We can encode our log-posterior in JAX as follows. @jit def ll(beta): return jnp.sum(-jnp.log(1 + jnp.exp(-(2*y - 1)*jnp.dot(X, beta)))) @jit def lprior(beta): return (jsp.stats.norm.logpdf(beta[0], loc=0, scale=10) + jnp.sum(jsp.stats.norm.logpdf(beta[jnp.array(range(1,p))], loc=0, scale=1))) @jit def lpost(beta): return ll(beta) + lprior(beta)  ### Scala JAX is a pure functional programming language embedded in Python. Pure functional programming languages are intrinsically more scalable and compositional than imperative languages such as R and Python, and are much better suited to exploit concurrency and parallelism. I’ve given a bunch of talks about this recently, so if you are interested in this, perhaps start with the materials for my Laplace’s Demon talk. Scala and Haskell are arguably the current best popular general purpose functional programming languages, so it is possibly interesting to consider the use of these languages for the development of scalable statistical inference codes. Since both languages are statically typed compiled functional languages with powerful type systems, they can be highly performant. However, neither is optimised for numerical (tensor) computation, so you should not expect that they will have performance comparable with optimised tensor computation frameworks such as JAX. We can encode our log-posterior in Scala (with Breeze) as follows:  def ll(beta: DVD): Double = sum(-log(ones + exp(-1.0*(2.0*y - ones)*:*(X * beta)))) def lprior(beta: DVD): Double = Gaussian(0,10).logPdf(beta(0)) + sum(beta(1 until p).map(Gaussian(0,1).logPdf(_))) def lpost(beta: DVD): Double = ll(beta) + lprior(beta)  #### Spark Apache Spark is a Scala library for distributed "big data" processing on clusters of machines. Despite fundamental differences, there is a sense in which Spark for Scala is a bit analogous to JAX for Python: both Spark and JAX are concerned with scalability, but they are targeting rather different aspects of scalability: JAX is concerned with getting regular sized data processing algorithms to run very fast (on GPUs), whereas Spark is concerned with running huge data processing tasks quickly by distributing work over clusters of machines. Despite obvious differences, the fundamental pure functional computational model adopted by both systems is interestingly similar: both systems are based on lazy transformations of immutable data structures using pure functions. This is a fundamental pattern for scalable data processing transcending any particular language, library or framework. We can encode our log posterior in Spark as follows.  def ll(beta: DVD): Double = df.map{row => val y = row.getAs[Double](0) val x = BDV.vertcat(BDV(1.0),toBDV(row.getAs[DenseVector](1))) -math.log(1.0 + math.exp(-1.0*(2.0*y-1.0)*(x.dot(beta))))}.reduce(_+_) def lprior(beta: DVD): Double = Gaussian(0,10).logPdf(beta(0)) + sum(beta(1 until p).map(Gaussian(0,1).logPdf(_))) def lpost(beta: DVD): Double = ll(beta) + lprior(beta)  ### Haskell Haskell is an old, lazy pure functional programming language with an advanced type system, and remains the preferred language for the majority of functional programming language researchers. Hmatrix is the standard high performance numerical linear algebra library for Haskell, so we can use it to encode our log-posterior as follows. ll :: Matrix Double -> Vector Double -> Vector Double -> Double ll x y b = (negate) (vsum (cmap log ( (scalar 1) + (cmap exp (cmap (negate) ( (((scalar 2) * y) - (scalar 1)) * (x #> b) ) ))))) pscale :: [Double] -- prior standard deviations pscale = [10.0, 1, 1, 1, 1, 1, 1, 1] lprior :: Vector Double -> Double lprior b = sum$  (\x -> logDensity (normalDistr 0.0 (snd x)) (fst x)) <$> (zip (toList b) pscale) lpost :: Matrix Double -> Vector Double -> Vector Double -> Double lpost x y b = (ll x y b) + (lprior b)  Again, a reminder that, here and elsewhere, there are various optimisations could be done that I’m not bothering with. This is all just proof-of-concept code. ### Dex JAX proves that a pure functional DSL for tensor computation can be extremely powerful and useful. But embedding such a language in a dynamic imperative language like Python has a number of drawbacks. Dex is an experimental statically typed stand-alone DSL for differentiable array and tensor programming that attempts to combine some of the correctness and composability benefits of powerful statically typed functional languages like Scala and Haskell with the performance benefits of tensor computation systems like JAX. It is currently rather early its development, but seems very interesting, and is already quite useable. We can encode our log-posterior in Dex as follows. def ll (b: (Fin 8)=>Float) : Float = neg$  sum (log (map (\ x. (exp x) + 1) ((map (\ yi. 1 - 2*yi) y)*(x **. b))))

pscale = [10.0, 1, 1, 1, 1, 1, 1, 1] -- prior SDs
prscale = map (\ x. 1.0/x) pscale

def lprior (b: (Fin 8)=>Float) : Float =
bs = b*prscale
neg $sum ((log pscale) + (0.5 .* (bs*bs))) def lpost (b: (Fin 8)=>Float) : Float = (ll b) + (lprior b)  ## Next steps Now that we have a way of evaluating the log posterior, we can think about constructing Markov chains having the posterior as their equilibrium distribution. In the next post we will look at one of the simplest ways of doing this: the Metropolis algorithm. Complete runnable scripts are available from this public github repo. ## Bayesian inference for a logistic regression model (Part 1) ## Part 1: The basics ### Introduction This is the first in a series of posts on MCMC-based fully Bayesian inference for a logistic regression model. In this series we will look at the model, and see how the posterior distribution can be sampled using a variety of different programming languages and libraries. ## Logistic regression Logistic regression is concerned with predicting a binary outcome based on some covariate information. The probability of "success" is modelled via a logistic transformation of a linear predictor constructed from the covariate vector. This is a very simple model, but is a convenient toy example since it is arguably the simplest interesting example of an intractable (nonlinear) statistical model requiring some kind of iterative numerical fitting method, even in the non-Bayesian setting. In a Bayesian context, the posterior distribution is intractable, necessitating either approximate or computationally intensive numerical methods of "solution". In this series of posts, we will mainly concentrate on MCMC algortithms for sampling the full posterior distribution of the model parameters given some observed data. We assume $n$ observations and $p$ covariates (including an intercept term that is always 1). The binary observations $y_i,\ i=1,\ldots,n$ are 1 for a "success" and 0 for a "failure". The covariate $p$-vectors $x_i, i=1,\ldots,n$ all have 1 as the first element. The statistical model is $\displaystyle \text{logit}\left(\text{Pr}[Y_i = 1]\right) = x_i \cdot \beta,\quad i=1,\ldots,n,$ where $\beta$ is a $p$-vector of parameters, $a\cdot b = a^\textsf{T} b$, and $\displaystyle \text{logit}(q) \equiv \log\left(\frac{q}{1-q}\right),\quad \forall\ q\in (0,1).$ Equivalently, $\displaystyle \text{Pr}[Y_i = 1] = \text{expit}(x_i \cdot \beta),\quad i=1,\ldots,n,$ where $\displaystyle \text{expit}(\theta) \equiv \frac{1}{1+e^{-\theta}},\quad \forall\ \theta\in\mathbb{R}.$ Note that the expit function is sometimes called the logistic or sigmoid function, but expit is slightly less ambiguous. The statistical problem is to choose the parameter vector $\beta$ to provide the "best" model for the probability of success. In the Bayesian setting, a prior distribution (typically multivariate normal) is specified for $\beta$, and then the posterior distribution after conditioning on the data is the object of inferential interest. ### Example problem In order to illustrate the ideas, it is useful to have a small running example. Here we will use the (infamous) Pima training dataset (MASS::Pima.tr in R). Here there are $n=200$ observations and 7 predictors. Adding an intercept gives $p=8$ covariates. For the Bayesian analysis, we need a prior on $\beta$. We will assume independent mean zero normal distributions for each component. The prior standard deviation for the intercept will be 10 and for the other covariates will be 1. ### Describing the model in some PPLs In this first post in the series, we will use probabilistic programming languages (PPLs) to describe the model and sample the posterior distribution. #### JAGS JAGS is a stand-alone domain specific language (DSL) for probabilistic programming. It can be used independently of general purpose programming languages, or called from popular languages for data science such as Python and R. We can describe our model in JAGS with the following code.  model { for (i in 1:n) { y[i] ~ dbern(pr[i]) logit(pr[i]) <- inprod(X[i,], beta) } beta[1] ~ dnorm(0, 0.01) for (j in 2:p) { beta[j] ~ dnorm(0, 1) } }  Note that JAGS uses precision as the second parameter of a normal distribution. See the full runnable R script for further details. Given this model description, JAGS can construct an MCMC sampler for the posterior distribution of the model parameters given the data. See the full script for how to feed in the data, run the sampler, and analyse the output. #### Stan Stan is another stand-alone DSL for probabilistic programming, and has a very sophisticated sampling algorithm, making it a popular choice for non-trivial models. It uses gradient information for sampling, and therefore requires a differentiable log-posterior. We could encode our logistic regression model as follows. data { int<lower=1> n; int<lower=1> p; int<lower=0, upper=1> y[n]; real X[n,p]; } parameters { real beta[p]; } model { for (i in 1:n) { real eta = dot_product(beta, X[i,]); real pr = 1/(1+exp(-eta)); y[i] ~ binomial(1, pr); } beta[1] ~ normal(0, 10); for (j in 2:p) { beta[j] ~ normal(0, 1); } }  Note that Stan uses standard deviation as the second parameter of the normal distribution. See the full runnable R script for further details. #### PyMC JAGS and Stan are stand-alone DSLs for probabilistic programming. This has the advantage of making them independent of any particular host (general purpose) programming language. But it also means that they are not able to take advantage of the language and tool support of an existing programming language. An alternative to stand-alone DSLs are embedded DSLs (eDSLs). Here, a DSL is embedded as a library or package within an existing (general purpose) programming language. Then, ideally, in the context of PPLs, probabilistic programs can become ordinary values within the host language, and this can have some advantages, especially if the host language is sophisticated. A number of probabilistic programming languages have been implemented as eDSLs in Python. Python is not a particularly sophisticated language, so the advantages here are limited, but not negligible. PyMC is probably the most popular eDSL PPL in Python. We can encode our model in PyMC as follows. pscale = np.array([10.,1.,1.,1.,1.,1.,1.,1.]) with pm.Model() as model: beta = pm.Normal('beta', 0, pscale, shape=p) eta = pmm.matrix_dot(X, beta) pm.Bernoulli('y', logit_p=eta, observed=y) traces = pm.sample(2500, tune=1000, init="adapt_diag", return_inferencedata=True)  See the full runnable script for further details. #### NumPyro NumPyro is a fork of Pyro for NumPy and JAX (of which more later). We can encode our model with NumPyro as follows. pscale = jnp.array([10.,1.,1.,1.,1.,1.,1.,1.]).astype(jnp.float32) def model(X, y): coefs = numpyro.sample("beta", dist.Normal(jnp.zeros(p), pscale)) logits = jnp.dot(X, coefs) return numpyro.sample("obs", dist.Bernoulli(logits=logits), obs=y)  See the full runnable script for further details. Please note that none of the above examples have been optimised, or are even necessarily expressed idiomatically within each PPL. I’ve just tried to express the model in as simple and similar way across each PPL. For example, I know that there is a function bernoulli_logit_glm in Stan which would simplify the model and improve sampling efficiency, but I’ve deliberately not used it in order to try and keep the implementations as basic as possible. The same will be true for all of the code examples in this series of blog posts. The code has not been optimised and should therefore not be used for serious benchmarking. ### Next steps PPLs are convenient, and are becoming increasingly sophisticated. Each of the above PPLs provides a simple way to pass in observed data, and automatically construct an MCMC algorithm for sampling from the implied posterior distribution – see the full scripts for details. All of the PPLs work well for this problem, and all produce identical results up to Monte Carlo error. Each PPL has its own approach to sampler construction, and some PPLs offer multiple choices. However, more challenging problems often require highly customised samplers. Such samplers will often need to be created from scratch, and will require (at least) the ability to compute the (unnormalised log) posterior density at proposed parameter values, so in the next post we will look at how this can be derived for this model (in a couple of different ways) and coded up from scratch in a variety of programming languages. All of the complete, runnable code associated with this series of blog posts can be obtained from this public github repo. ## MCMC code for Bayesian inference for a discretely observed stochastic kinetic model In June this year the (twice COVID-delayed) Richard J Boys Memorial Workshop finally took place, celebrating the life and work of my former colleague and collaborator, who died suddenly in 2019 (obituary). I completed the programme of talks by delivering the inaugural RSS North East Richard Boys lecture. For this, I decided that it would be most appropriate to talk about the paper Bayesian inference for a discretely observed stochastic kinetic model, published in Statistics and Computing in 2008. The paper is concerned with (exact and approximate) MCMC-based fully Bayesian inference for continuous time Markov jump processes observed partially and discretely in time. Although the ideas are generally applicable to a broad class of “stochastic kinetic models”, the running example throughout the paper is a discrete stochastic Lotka Volterra model. In preparation for the talk, I managed to track down most of the MCMC codes used for the examples presented in the paper. This included C code I wrote for exact and approximate block-updating algorithms, and Fortran code written by Richard using an exact reversible jump MCMC approach. I’ve fixed up all of the codes so that they are now easy to build and run on a modern Linux (or other Unix-like) system, and provided some minimal documentation. It is all available in a public github repo. Hopefully this will be of some interest or use to a few people. ## Unbiased MCMC with couplings Yesterday there was an RSS Read Paper meeting for the paper Unbiased Markov chain Monte Carlo with couplings by Pierre Jacob, John O’Leary and Yves F. AtchadÃ©. The paper addresses the bias in MCMC estimates due to lack of convergence to equilibrium (the “burn-in” problem), and shows how it is possible to modify MCMC algorithms in order to construct estimates which exactly remove this bias. The requirement is to couple a pair of MCMC chains so that they will at some point meet exactly and thereafter remain coupled. This turns out to be easier to do that one might naively expect. There are many reasons why we might want to remove bias from MCMC estimates, but the primary motivation in the paper was the application to parallel MCMC computation. The idea here is that many pairs of chains can be run independently on any available processors, and the unbiased estimates from the different pairs can be safely averaged to get an (improved) unbiased estimate based on all of the chains. As a discussant of the paper, I’ve spent a bit of time thinking about this idea, and have created a small repository of materials relating to the paper which may be useful for others interested in understanding the method and how to use it in practice. The repo includes a page of links to related papers, blog posts, software and other resources relating to unbiased MCMC that I’ve noticed on-line. Earlier in the year I gave an internal seminar at Newcastle giving a tutorial introduction to the main ideas from the paper, including runnable R code implementations of the examples. The talk was prepared as an executable R Markdown document. The R Markdown source code is available in the repo, but for the convenience of casual browsers I’ve also included a pre-built set of PDF slides. Code examples include code for maximal coupling of two (univariate) distributions, coupling Metropolis-Hastings chains, and coupling a Gibbs sampler for an AR(1) process. I haven’t yet finalised my written discussion contribution, but the slides I presented at the Read Paper meeting are also available. Again, there is source code and pre-built PDF slides. My discussion focused on seeing how well the technique works for Gibbs samplers applied to high-dimensional latent process models (an AR(1) process and a Gaussian Markov random field), and reflecting on the extent to which the technique really solves the burn-in/parallel MCMC problem. The repo also contains a few stand-alone code examples. There are some simple tutorial examples in R (largely derived from my tutorial introduction), including implementation of (univariate) independent and reflection maximal couplings, and a coupled AR(1) process example. The more substantial example concerns a coupled Gibbs sampler for a GMRF. This example is written in the Scala programming language. There are a couple of notable features of this implementation. First, the code illustrates monadic coupling of probability distributions, based on the Rand type in the Breeze scientific library. This provides an elegant way to max couple arbitrary (continuous) random variables, and to create coupled Metropolis(-Hastings) kernels. For example, a coupling of two distributions can be constructed as  def couple[T](p: ContinuousDistr[T], q: ContinuousDistr[T]): Rand[(T, T)] = { def ys: Rand[T] = for { y <- q w <- Uniform(0, 1) ay <- if (math.log(w) > p.logPdf(y) - q.logPdf(y)) Rand.always(y) else ys } yield ay val pair = for { x <- p w <- Uniform(0, 1) } yield (math.log(w) <= q.logPdf(x) - p.logPdf(x), x) pair flatMap { case (b, x) => if (b) Rand.always((x, x)) else (ys map (y => (x, y))) } }  and then draws can be sampled from the resulting Rand[(T, T)] polymorphic type as required. Incidentally, this also illustrates how to construct an independent maximal coupling without evaluating any raw likelihoods. The other notable feature of the code is the use of a parallel comonadic image type for parallel Gibbs sampling of the GMRF, producing a (lazy) Stream of coupled MCMC samples. ## The smfsb R package ## Introduction In the previous post I gave a brief introduction to the third edition of my textbook, Stochastic modelling for systems biology. The algorithms described in the book are illustrated by implementations in R. These implementations are collected together in an R package on CRAN called smfsb. This post will provide a brief introduction to the package and its capabilities. ## Installation The package is on CRAN – see the CRAN package page for details. So the simplest way to install it is to enter install.packages("smfsb")  at the R command prompt. This will install the latest version that is on CRAN. Once installed, the package can be loaded with library(smfsb)  The package is well-documented, so further information can be obtained with the usual R mechanisms, such as vignette(package="smfsb") vignette("smfsb") help(package="smfsb") ?StepGillespie example(StepCLE1D)  The version of the package on CRAN is almost certainly what you want. However, the package is developed on R-Forge – see the R-Forge project page for details. So the very latest version of the package can always be installed with install.packages("smfsb", repos="http://R-Forge.R-project.org")  if you have a reason for wanting it. ## A brief tutorial The vignette gives a quick introduction the the library, which I don’t need to repeat verbatim here. If you are new to the package, I recommend working through that before continuing. Here I’ll concentrate on some of the new features associated with the third edition. ### Simulating stochastic kinetic models Much of the book is concerned with the simulation of stochastic kinetic models using exact and approximate algorithms. Although the primary focus of the text is the application to modelling of intra-cellular processes, the methods are also appropriate for population modelling of ecological and epidemic processes. For example, we can start by simulating a simple susceptible-infectious-recovered (SIR) disease epidemic model. set.seed(2) data(spnModels) stepSIR = StepGillespie(SIR) plot(simTs(SIR$M, 0, 8, 0.05, stepSIR),
main="Exact simulation of the SIR model")


The focus of the text is stochastic simulation of discrete models, so that is the obvious place to start. But there is also support for continuous deterministic simulation.

plot(simTs(SIR$M, 0, 8, 0.05, StepEulerSPN(SIR)), main="Euler simulation of the SIR model")  My favourite toy population dynamics model is the Lotka-Volterra (LV) model, so I tend to use this frequently as a running example throughout the book. We can simulate this (exactly) as follows. stepLV = StepGillespie(LV) plot(simTs(LV$M, 0, 30, 0.2, stepLV),
main="Exact simulation of the LV model")


### Stochastic reaction-diffusion modelling

The first two editions of the book were almost exclusively concerned with well-mixed systems, where spatial effects are ignorable. One of the main new features of the third edition is the inclusion of a new chapter on spatially extended systems. The focus is on models related to the reaction diffusion master equation (RDME) formulation, rather than individual particle-based simulations. For these models, space is typically divided into a regular grid of voxels, with reactions taking place as normal within each voxel, and additional reaction events included, corresponding to the diffusion of particles to adjacent voxels. So to specify such models, we just need an initial condition, a reaction model, and diffusion coefficients (one for each reacting species). So, we can carry out exact simulation of an RDME model for a 1D spatial domain as follows.

N=20; T=30
x0=matrix(0, nrow=2, ncol=N)
rownames(x0) = c("x1", "x2")
x0[,round(N/2)] = LV$M stepLV1D = StepGillespie1D(LV, c(0.6, 0.6)) xx = simTs1D(x0, 0, T, 0.2, stepLV1D, verb=TRUE) image(xx[1,,], main="Prey", xlab="Space", ylab="Time")  image(xx[2,,], main="Predator", xlab="Space", ylab="Time")  Exact simulation of discrete stochastic reaction diffusion systems is very expensive (and the reference implementation provided in the package is very inefficient), so we will often use diffusion approximations based on the CLE. stepLV1DC = StepCLE1D(LV, c(0.6, 0.6)) xx = simTs1D(x0, 0, T, 0.2, stepLV1D) image(xx[1,,], main="Prey", xlab="Space", ylab="Time")  image(xx[2,,], main="Predator", xlab="Space", ylab="Time")  We can think of this algorithm as an explicit numerical integration of the obvious SPDE approximation to the exact model. The package also includes support for simulation of 2D systems. Again, we can use the Spatial CLE to speed things up. m=70; n=50; T=10 data(spnModels) x0=array(0, c(2,m,n)) dimnames(x0)[[1]]=c("x1", "x2") x0[,round(m/2),round(n/2)] = LV$M
stepLV2D = StepCLE2D(LV, c(0.6,0.6), dt=0.05)
xx = simTs2D(x0, 0, T, 0.5, stepLV2D)
N = dim(xx)[4]
image(xx[1,,,N],main="Prey",xlab="x",ylab="y")


image(xx[2,,,N],main="Predator",xlab="x",ylab="y")


### Bayesian parameter inference

Although much of the book is concerned with the problem of forward simulation, the final chapters are concerned with the inverse problem of estimating model parameters, such as reaction rate constants, from data. A computational Bayesian approach is adopted, with the main emphasis being placed on “likelihood free” methods, which rely on forward simulation to avoid explicit computation of sample path likelihoods. The second edition included some rudimentary code for a likelihood free particle marginal Metropolis-Hastings (PMMH) particle Markov chain Monte Carlo (pMCMC) algorithm. The third edition includes a more complete and improved implementation, in addition to approximate inference algorithms based on approximate Bayesian computation (ABC).

The key function underpinning the PMMH approach is pfMLLik, which computes an estimate of marginal model log-likelihood using a (bootstrap) particle filter. There is a new implementation of this function with the third edition. There is also a generic implementation of the Metropolis-Hastings algorithm, metropolisHastings, which can be combined with pfMLLik to create a PMMH algorithm. PMMH algorithms are very slow, but a full demo of how to use these functions for parameter inference is included in the package and can be run with

demo(PMCMC)


Simple rejection-based ABC methods are facilitated by the (very simple) function abcRun, which just samples from a prior and then carries out independent simulations in parallel before computing summary statistics. A simple illustration of the use of the function is given below.

data(LVdata)
rprior <- function() { exp(c(runif(1, -3, 3),runif(1,-8,-2),runif(1,-4,2))) }
rmodel <- function(th) { simTs(c(50,100), 0, 30, 2, stepLVc, th) }
sumStats <- identity
ssd = sumStats(LVperfect)
distance <- function(s) {
diff = s - ssd
sqrt(sum(diff*diff))
}
rdist <- function(th) { distance(sumStats(rmodel(th))) }
out = abcRun(10000, rprior, rdist)
q=quantile(out$dist, c(0.01, 0.05, 0.1)) print(q)  ## 1% 5% 10% ## 772.5546 845.8879 881.0573  accepted = out$param[out\$dist < q[1],]
print(summary(accepted))

##        V1                V2                  V3
##  Min.   :0.06498   Min.   :0.0004467   Min.   :0.01887
##  1st Qu.:0.16159   1st Qu.:0.0012598   1st Qu.:0.04122
##  Median :0.35750   Median :0.0023488   Median :0.14664
##  Mean   :0.68565   Mean   :0.0046887   Mean   :0.36726
##  3rd Qu.:0.86708   3rd Qu.:0.0057264   3rd Qu.:0.36870
##  Max.   :4.76773   Max.   :0.0309364   Max.   :3.79220

print(summary(log(accepted)))

##        V1                V2               V3
##  Min.   :-2.7337   Min.   :-7.714   Min.   :-3.9702
##  1st Qu.:-1.8228   1st Qu.:-6.677   1st Qu.:-3.1888
##  Median :-1.0286   Median :-6.054   Median :-1.9198
##  Mean   :-0.8906   Mean   :-5.877   Mean   :-1.9649
##  3rd Qu.:-0.1430   3rd Qu.:-5.163   3rd Qu.:-0.9978
##  Max.   : 1.5619   Max.   :-3.476   Max.   : 1.3329


Naive rejection-based ABC algorithms are notoriously inefficient, so the library also includes an implementation of a more efficient, sequential version of ABC, often known as ABC-SMC, in the function abcSmc. This function requires specification of a perturbation kernel to “noise up” the particles at each algorithm sweep. Again, the implementation is parallel, using the parallel package to run the required simulations in parallel on multiple cores. A simple illustration of use is given below.

rprior <- function() { c(runif(1, -3, 3), runif(1, -8, -2), runif(1, -4, 2)) }
dprior <- function(x, ...) { dunif(x[1], -3, 3, ...) +
dunif(x[2], -8, -2, ...) + dunif(x[3], -4, 2, ...) }
rmodel <- function(th) { simTs(c(50,100), 0, 30, 2, stepLVc, exp(th)) }
rperturb <- function(th){th + rnorm(3, 0, 0.5)}
dperturb <- function(thNew, thOld, ...){sum(dnorm(thNew, thOld, 0.5, ...))}
sumStats <- identity
ssd = sumStats(LVperfect)
distance <- function(s) {
diff = s - ssd
sqrt(sum(diff*diff))
}
rdist <- function(th) { distance(sumStats(rmodel(th))) }
out = abcSmc(5000, rprior, dprior, rdist, rperturb,
dperturb, verb=TRUE, steps=6, factor=5)

## 6 5 4 3 2 1

print(summary(out))

##        V1                V2               V3
##  Min.   :-2.9961   Min.   :-7.988   Min.   :-3.999
##  1st Qu.:-1.9001   1st Qu.:-6.786   1st Qu.:-3.428
##  Median :-1.2571   Median :-6.167   Median :-2.433
##  Mean   :-1.0789   Mean   :-6.014   Mean   :-2.196
##  3rd Qu.:-0.2682   3rd Qu.:-5.261   3rd Qu.:-1.161
##  Max.   : 2.1128   Max.   :-2.925   Max.   : 1.706


We can then plot some results with

hist(out[,1],30,main="log(c1)")


hist(out[,2],30,main="log(c2)")


hist(out[,3],30,main="log(c3)")


Although the inference methods are illustrated in the book in the context of parameter inference for stochastic kinetic models, their implementation is generic, and can be used with any appropriate parameter inference problem.

## The smfsbSBML package

smfsbSBML is another R package associated with the third edition of the book. This package is not on CRAN due to its dependency on a package not on CRAN, and hence is slightly less straightforward to install. Follow the available installation instructions to install the package. Once installed, you should be able to load the package with

library(smfsbSBML)


This package provides a function for reading in SBML files and parsing them into the simulatable stochastic Petri net (SPN) objects used by the main smfsb R package. Examples of suitable SBML models are included in the main smfsb GitHub repo. An appropriate SBML model can be read and parsed with a command like:

model = sbml2spn("mySbmlModel.xml")


The resulting value, model is an SPN object which can be passed in to simulation functions such as StepGillespie for constructing stochastic simulation algorithms.

## Other software

In addition to the above R packages, I also have some Python scripts for converting between SBML and the SBML-shorthand notation I use in the book. See the SBML-shorthand page for further details.

Although R is a convenient language for teaching and learning about stochastic simulation, it isn’t ideal for serious research-level scientific computing or computational statistics. So for the third edition of the book I have also developed scala-smfsb, a library written in the Scala programming language, which re-implements all of the models and algorithms from the third edition of the book in Scala, a fast, efficient, strongly-typed, compiled, functional programming language. I’ll give an introduction to this library in a subsequent post, but in the meantime, it is already well documented, so see the scala-smfsb repo for further details, including information on installation, getting started, a tutorial, examples, API docs, etc.

## Source

This blog post started out as an RMarkdown document, the source of which can be found here.

## Stochastic Modelling for Systems Biology, third edition

The third edition of my textbook, Stochastic Modelling for Systems Biology has recently been published by Chapman & Hall/CRC Press. The book has ISBN-10 113854928-2 and ISBN-13 978-113854928-9. It can be ordered from CRC Press, Amazon.com, Amazon.co.uk and similar book sellers.

I was fairly happy with the way that the second edition, published in 2011, turned out, and so I haven’t substantially re-written any of the text for the third edition. Instead, I’ve concentrated on adding in new material and improving the associated on-line resources. Those on-line resources are all free and open source, and hence available to everyone, irrespective of whether you have a copy of the new edition. I’ll give an introduction to those resources below (and in subsequent posts). The new material can be briefly summarised as follows:

• New chapter on spatially extended systems, covering the spatial Gillespie algorithm for reaction diffusion master equation (RDME) models in 1- and 2-d, the next subvolume method, spatial CLE, scaling issues, etc.
• Significantly expanded chapter on inference for stochastic kinetic models from data, covering approximate methods of inference (ABC), including ABC-SMC. The material relating to particle MCMC has also been improved and extended.
• Updated R package, including code relating to all of the new material
• New R package for parsing SBML models into simulatable stochastic Petri net models
• New software library, written in Scala, replicating most of the functionality of the R packages in a fast, compiled, strongly typed, functional language

## New content

Although some minor edits and improvements have been made throughout the text, there are two substantial new additions to the text in this new edition. The first is an entirely new chapter on spatially extended systems. The first two editions of the text focused on the implications of discreteness and stochasticity in chemical reaction systems, but maintained the well-mixed assumption throughout. This is a reasonable first approach, since discreteness and stochasticity are most pronounced in very small volumes where diffusion should be rapid. In any case, even these non-spatial models have very interesting behaviour, and become computationally challenging very quickly for non-trivial reaction networks. However, we know that, in fact, the cell is a very crowded environment, and so even at small spatial scales, many interesting processes are diffusion limited. It therefore seems appropriate to dedicate one chapter (the new Chapter 9) to studying some of the implications of relaxing the well-mixed assumption. Entire books can be written on stochastic reaction-diffusion systems, so here only a brief introduction is provided, based mainly around models in the reaction-diffusion master equation (RDME) style. Exact stochastic simulation algorithms are discussed, and implementations provided in the 1- and 2-d cases, and an appropriate Langevin approximation is examined, the spatial CLE.

The second major addition is to the chapter on inference for stochastic kinetic models from data (now Chapter 11). The second edition of the book included a discussion of “likelihood free” Bayesian MCMC methods for inference, and provided a working implementation of likelihood free particle marginal Metropolis-Hastings (PMMH) for stochastic kinetic models. The third edition improves on that implementation, and discusses approximate Bayesian computation (ABC) as an alternative to MCMC for likelihood free inference. Implementation issues are discussed, and sequential ABC approaches are examined, concentrating in particular on the method known as ABC-SMC.

## New software and on-line resources

Accompanying the text are new and improved on-line resources, all well-documented, free, and open source.

### New website/GitHub repo

Information and materials relating to the previous editions were kept on my University website. All materials relating to this new edition are kept in a public GitHub repo: darrenjw/smfsb. This will be simpler to maintain, and will make it much easier for people to make copies of the material for use and studying off-line.

### Updated R package(s)

Along with the second edition of the book I released an accompanying R package, “smfsb”, published on CRAN. This was a very popular feature, allowing anyone with R to trivially experiment with all of the models and algorithms discussed in the text. This R package has been updated, and a new version has been published to CRAN. The updates are all backwards-compatible with the version associated with the second edition of the text, so owners of that edition can still upgrade safely. I’ll give a proper introduction to the package, including the new features, in a subsequent post, but in the meantime, you can install/upgrade the package from a running R session with

install.packages("smfsb")


and then pop up a tutorial vignette with:

vignette("smfsb")


This should be enough to get you started.

In addition to the main R package, there is an additional R package for parsing SBML models into models that can be simulated within R. This package is not on CRAN, due to its dependency on a non-CRAN package. See the repo for further details.

There are also Python scripts available for converting SBML models to and from the shorthand SBML notation used in the text.

### New Scala library

Another major new resource associated with the third edition of the text is a software library written in the Scala programming language. This library provides Scala implementations of all of the algorithms discussed in the book and implemented in the associated R packages. This then provides example implementations in a fast, efficient, compiled language, and is likely to be most useful for people wanting to use the methods in the book for research. Again, I’ll provide a tutorial introduction to this library in a subsequent post, but it is well-documented, with all necessary information needed to get started available at the scala-smfsb repo/website, including a step-by-step tutorial and some additional examples.