An introduction to functional programming for scalable statistical computing and machine learning

Functional programming (FP) languages are great for statistical computing, computational statistics, and machine learning. They are particularly well-suited to scalable computation, where this could either mean scaling up to distributed algorithms for very big data, or running algorithms for more moderately sized data sets very fast in parallel on GPUs. However, people unfamiliar with FP often find FP languages quite intimidating, due to the fairly steep initial learning curve. This issue is exacerbated by the fact that there is very little introductory documentation available for people new to FP who are interested in applications to statistical computing and machine learning (ML).

So for some time I’ve been meaning to put together materials for a short course (suitable for self-study) that will get people started with FP in few different languages, with a very basic problem from statistical computing used as the running example, together with a catalogue of resources for further learning, in order to provide people with the information they need to keep going once they have got over the initial hurdle. But as with many things, it never got high enough up my priority list to actually sit down and do it. Fortunately, StatML invited me to deliver some training in advanced statistical computing, so this gave me the perfect motivation to actually assemble something. The in-person training has been delayed (due to the UCU strike), but the materials are all prepared and publicly available, and suitable for self-study now.

The course gives a very quick introduction to the ideas of FP, followed by very quick hands-on introductions to my favourite FP languages/libraries: Scala, Haskell, JAX and Dex. There is also a brief introduction to splittable random number generators which are becoming increasingly popular for the development of functional parallel Monte Carlo algorithms.

If you’ve been vaguely interested in FP for statistical computing and ML but not sure how to get started, hopefully this solves the problem.

Part 5: the Metropolis-adjusted Langevin algorithm (MALA)

Introduction

This is the fifth part in a series of posts on MCMC-based Bayesian inference for a logistic regression model. If you are new to this series, please go back to Part 1.

In the previous post we saw how to use Langevin dynamics to construct an approximate MCMC scheme using the gradient of the log target distribution. Each step of the algorithm involved simulating from the Euler-Maruyama approximation to the transition kernel of the process, based on some pre-specified step size, $\Delta t$. We can improve the accuracy of this approximation by making the step size smaller, but this will come at the expense of a more slowly mixing MCMC chain.

Fortunately, there is an easy way to make the algorithm "exact" (in the sense that the equilibrium distribution of the Markov chain will be the exact target distribution), for any finite step size, $\Delta t$, simply by using the Euler-Maruyama approximation as the proposal distribution in a Metropolis-Hastings algorithm. This is the Metropolis-adjusted Langevin algorithm (MALA). There are various ways this could be coded up, but here, for clarity, a HoF for generating a MALA kernel will be used, and this function will in turn call on a HoF for generating a Metropolis-Hastings kernel.

Implementations

R

First we need a function to generate a M-H kernel.

mhKernel = function(logPost, rprop, dprop = function(new, old, ...) { 1 })
function(x, ll) {
prop = rprop(x)
llprop = logPost(prop)
a = llprop - ll + dprop(x, prop) - dprop(prop, x)
if (log(runif(1)) < a)
list(x=prop, ll=llprop)
else
list(x=x, ll=ll)
}


Then we can easily write a function for returning a MALA kernel that makes use of this M-H function.

malaKernel = function(lpi, glpi, dt = 1e-4, pre = 1) {
sdt = sqrt(dt)
spre = sqrt(pre)
advance = function(x) x + 0.5*pre*glpi(x)*dt
function(new, old) sum(dnorm(new, advance(old), spre*sdt, log=TRUE)))
}


Notice that our MALA function requires as input both the gradient of the log posterior (for the proposal) and the log posterior itself (for the M-H correction). Other details are as we have already seen – see the full runnable script.

Python

Again, we need a M-H kernel

def mhKernel(lpost, rprop, dprop = lambda new, old: 1.):
def kernel(x, ll):
prop = rprop(x)
lp = lpost(prop)
a = lp - ll + dprop(x, prop) - dprop(prop, x)
if (np.log(np.random.rand()) < a):
x = prop
ll = lp
return x, ll
return kernel


and then a MALA kernel

def malaKernel(lpi, glpi, dt = 1e-4, pre = 1):
p = len(init)
sdt = np.sqrt(dt)
spre = np.sqrt(pre)
advance = lambda x: x + 0.5*pre*glpi(x)*dt
return mhKernel(lpi, lambda x: advance(x) + np.random.randn(p)*spre*sdt,
lambda new, old: np.sum(sp.stats.norm.logpdf(new, loc=advance(old), scale=spre*sdt)))


See the full runnable script for further details.

JAX

If we want our algorithm to run fast, and if we want to exploit automatic differentiation to avoid the need to manually compute gradients, then we can easily convert the above code to use JAX.

def mhKernel(lpost, rprop, dprop = jit(lambda new, old: 1.)):
@jit
def kernel(key, x, ll):
key0, key1 = jax.random.split(key)
prop = rprop(key0, x)
lp = lpost(prop)
a = lp - ll + dprop(x, prop) - dprop(prop, x)
accept = (jnp.log(jax.random.uniform(key1)) < a)
return jnp.where(accept, prop, x), jnp.where(accept, lp, ll)
return kernel

def malaKernel(lpi, dt = 1e-4, pre = 1):
p = len(init)
sdt = jnp.sqrt(dt)
spre = jnp.sqrt(pre)
advance = jit(lambda x: x + 0.5*pre*glpi(x)*dt)
return mhKernel(lpi, jit(lambda k, x: advance(x) +
jax.random.normal(k, [p])*spre*sdt),
jit(lambda new, old:
jnp.sum(jsp.stats.norm.logpdf(new,


See the full runnable script for further details.

Scala

def mhKernel[S](
logPost: S => Double, rprop: S => S,
dprop: (S, S) => Double = (n: S, o: S) => 1.0
): ((S, Double)) => (S, Double) =
val r = Uniform(0.0,1.0)
state =>
val (x0, ll0) = state
val x = rprop(x0)
val ll = logPost(x)
val a = ll - ll0 + dprop(x0, x) - dprop(x, x0)
if (math.log(r.draw()) < a)
(x, ll)
else
(x0, ll0)

def malaKernel(lpi: DVD => Double, glpi: DVD => DVD, pre: DVD, dt: Double = 1e-4) =
val sdt = math.sqrt(dt)
val spre = sqrt(pre)
val p = pre.length
beta + (0.5*dt)*(pre*:*glpi(beta))
def rprop(beta: DVD): DVD =
def dprop(n: DVD, o: DVD): Double =
(0 until p).map(i => Gaussian(ao(i), spre(i)*sdt).logPdf(n(i))).sum
mhKernel(lpi, rprop, dprop)


See the full runnable script for further details.

mhKernel :: (StatefulGen g m) => (s -> Double) -> (s -> g -> m s) ->
(s -> s -> Double) -> g -> (s, Double) -> m (s, Double)
mhKernel logPost rprop dprop g (x0, ll0) = do
x <- rprop x0 g
let ll = logPost(x)
let a = ll - ll0 + (dprop x0 x) - (dprop x x0)
u <- (genContVar (uniformDistr 0.0 1.0)) g
let next = if ((log u) < a)
then (x, ll)
else (x0, ll0)
return next

malaKernel :: (StatefulGen g m) =>
(Vector Double -> Double) -> (Vector Double -> Vector Double) ->
Vector Double -> Double -> g ->
(Vector Double, Double) -> m (Vector Double, Double)
malaKernel lpi glpi pre dt g = let
sdt = sqrt dt
spre = cmap sqrt pre
p = size pre
advance beta = beta + (scalar (0.5*dt))*pre*(glpi beta)
rprop beta g = do
zl <- (replicateM p . genContVar (normalDistr 0.0 1.0)) g
let z = fromList zl
return $advance(beta) + (scalar sdt)*spre*z dprop n o = let ao = advance o in sum$ (\i -> logDensity (normalDistr (ao!i)
((spre!i)*sdt)) (n!i)) <$> [0..(p-1)] in mhKernel lpi rprop dprop g  See the full runnable script for further details. Dex Recall that Dex is differentiable, so we don’t need to provide gradients. def mhKernel {s} (lpost: s -> Float) (rprop: s -> Key -> s) (dprop: s -> s -> Float) (sll: (s & Float)) (k: Key) : (s & Float) = (x0, ll0) = sll [k1, k2] = split_key k x = rprop x0 k1 ll = lpost x a = ll - ll0 + (dprop x0 x) - (dprop x x0) u = rand k2 select (log u < a) (x, ll) (x0, ll0) def malaKernel {n} (lpi: (Fin n)=>Float -> Float) (pre: (Fin n)=>Float) (dt: Float) : ((Fin n)=>Float & Float) -> Key -> ((Fin n)=>Float & Float) = sdt = sqrt dt spre = sqrt pre glp = grad lpi v = dt .* pre vinv = map (\ x. 1.0/x) v def advance (beta: (Fin n)=>Float) : (Fin n)=>Float = beta + (0.5*dt) .* (pre*(glp beta)) def rprop (beta: (Fin n)=>Float) (k: Key) : (Fin n)=>Float = (advance beta) + sdt .* (spre*(randn_vec k)) def dprop (new: (Fin n)=>Float) (old: (Fin n)=>Float) : Float = ao = advance old diff = new - ao -0.5 * sum ((log v) + diff*diff*vinv) mhKernel lpi rprop dprop  See the full runnable script for further details. Next steps MALA gives us an MCMC algorithm that exploits gradient information to generate "informed" M-H proposals. But it still has a rather "diffusive" character, making it difficult to tune in such a way that large moves are likely to be accepted in challenging high-dimensional situations. The Langevin dynamics on which MALA is based can be interpreted as the (over-damped) stochastic dynamics of a particle moving in a potential energy field corresponding to minus the log posterior. It turns out that the corresponding deterministic dynamics can be exploited to generate proposals better able to make large moves while still having a high probability of acceptance. This is the idea behind Hamiltonian Monte Carlo (HMC), which we’ll look at next. Bayesian inference for a logistic regression model (Part 4) Part 4: Gradients and the Langevin algorithm Introduction This is the fourth part in a series of posts on MCMC-based Bayesian inference for a logistic regression model. If you are new to this series, please go back to Part 1. In the previous post we saw how the Metropolis algorithm could be used to generate a Markov chain targeting our posterior distribution. In high dimensions the diffusive nature of the Metropolis random walk proposal becomes increasingly inefficient. It is therefore natural to try and develop algorithms that use additional information about the target distribution. In the case of a differentiable log posterior target, a natural first step in this direction is to try and make use of gradient information. Gradient of a logistic regression model There are various ways to derive the gradient of our logistic regression model, but it might be simplest to start from the first form of the log likelihood that we deduced in Part 2: $\displaystyle l(\beta;y) = y^\textsf{T}X\beta - \mathbf{1}^\textsf{T}\log(\mathbf{1}+\exp[X\beta])$ We can write this out in component form as $\displaystyle l(\beta;y) = \sum_j\sum_j y_iX_{ij}\beta_j - \sum_i\log\left(1+\exp\left[\sum_jX_{ij}\beta_j\right]\right).$ Differentiating wrt $\beta_k$ gives $\displaystyle \frac{\partial l}{\partial \beta_k} = \sum_i y_iX_{ik} - \sum_i \frac{\exp\left[\sum_j X_{ij}\beta_j\right]X_{ik}}{1+\exp\left[\sum_j X_{ij}\beta_j\right]}.$ It’s then reasonably clear that stitching all of the partial derivatives together will give the gradient vector $\displaystyle \nabla l = X^\textsf{T}\left[ y - \frac{\mathbf{1}}{\mathbf{1}+\exp[-X\beta]} \right].$ This is the gradient of the log likelihood, but we also need the gradient of the log prior. Since we are assuming independent $\beta_i \sim N(0,v_i)$ priors, it is easy to see that the gradient of the log prior is just $-\beta\circ v^{-1}$. It is the sum of these two terms that gives the gradient of the log posterior. R In R we can implement our gradient function as glp = function(beta) { glpr = -beta/(pscale*pscale) gll = as.vector(t(X) %*% (y - 1/(1 + exp(-X %*% beta)))) glpr + gll }  Python In Python we could use def glp(beta): glpr = -beta/(pscale*pscale) gll = (X.T).dot(y - 1/(1 + np.exp(-X.dot(beta)))) return (glpr + gll)  We don’t really need a JAX version, since JAX can auto-diff the log posterior for us. Scala  def glp(beta: DVD): DVD = val glpr = -beta /:/ pvar val gll = (X.t)*(y - ones/:/(ones + exp(-X*beta))) glpr + gll  Haskell Using hmatrix we could use something like glp :: Matrix Double -> Vector Double -> Vector Double -> Vector Double glp x y b = let glpr = -b / (fromList [100.0, 1, 1, 1, 1, 1, 1, 1]) gll = (tr x) #> (y - (scalar 1)/((scalar 1) + (cmap exp (-x #> b)))) in glpr + gll  There’s something interesting to say about Haskell and auto-diff, but getting into this now will be too much of a distraction. I may come back to it in some future post. Dex Dex is differentiable, so we don’t need a gradient function – we can just use grad lpost. However, for interest and comparison purposes we could nevertheless implement it directly with something like prscale = map (\ x. 1.0/x) pscale def glp (b: (Fin 8)=>Float) : (Fin 8)=>Float = glpr = -b*prscale*prscale gll = (transpose x) **. (y - (map (\eta. 1.0/(1.0 + eta)) (exp (-x **. b)))) glpr + gll  Langevin diffusions Now that we have a way of computing the gradient of the log of our target density we need some MCMC algorithms that can make good use of it. In this post we will look at a simple approximate MCMC algorithm derived from an overdamped Langevin diffusion model. In subsequent posts we’ll look at more sophisticated, exact MCMC algorithms. The multivariate stochastic differential equation (SDE) $\displaystyle dX_t = \frac{1}{2}\nabla\log\pi(X_t)dt + dW_t$ has $\pi(\cdot)$ as its equilibrium distribution. Informally, an SDE of this form is a continuous time process with infinitesimal transition kernel $\displaystyle X_{t+dt}|(X_t=x_t) \sim N\left(x_t+\frac{1}{2}\nabla\log\pi(x_t)dt,\mathbf{I}dt\right).$ There are various more-or-less formal ways to see that $\pi(\cdot)$ is stationary. A good way is to check it satisfies the Fokkerâ€“Planck equation with zero LHS. A less formal approach would be to see that the infinitesimal transition kernel for the process satisfies detailed balance with $\pi(\cdot)$. Similar arguments show that for any fixed positive definite matrix $A$, the SDE $\displaystyle dX_t = \frac{1}{2}A\nabla\log\pi(X_t)dt + A^{1/2}dW_t$ also has $\pi(\cdot)$ as a stationary distribution. It is quite common to choose a diagonal matrix $A$ to put the components of $X_t$ on a common scale. The unadjusted Langevin algorithm Simulating exact sample paths from SDEs such as the overdamped Langevin diffusion model is typically difficult (though not necessarily impossible), so we instead want something simple and tractable as the basis of our MCMC algorithms. Here we will just simulate from the Eulerâ€“Maruyama approximation of the process by choosing a small but finite time step $\Delta t$ and using the transition kernel $\displaystyle X_{t+\Delta t}|(X_t=x_t) \sim N\left(x_t+\frac{1}{2}A\nabla\log\pi(x_t)\Delta t, A\Delta t\right)$ as the basis of our MCMC method. For sufficiently small $\Delta t$ this should accurately approximate the Langevin dynamics, leading to an equilibrium distribution very close to $\pi(\cdot)$. That said, we would like to choose $\Delta t$ as large as we can get away with, since that will lead to a more rapidly mixing MCMC chain. Below are some implementations of this kernel for a diagonal pre-conditioning matrix. Implementation R We can create a kernel for the unadjusted Langevin algorithm in R with the following function. ulKernel = function(glpi, dt = 1e-4, pre = 1) { sdt = sqrt(dt) spre = sqrt(pre) advance = function(x) x + 0.5*pre*glpi(x)*dt function(x, ll) rnorm(p, advance(x), spre*sdt) }  Here, we can pass in pre, which is expected to be a vector representing the diagonal of the pre-conditioning matrix, $A$. We can then use this kernel to generate an MCMC chain as we have seen previously. See the full runnable script for further details. Python def ulKernel(glpi, dt = 1e-4, pre = 1): p = len(init) sdt = np.sqrt(dt) spre = np.sqrt(pre) advance = lambda x: x + 0.5*pre*glpi(x)*dt def kernel(x): return advance(x) + np.random.randn(p)*spre*sdt return kernel  See the full runnable script for further details. JAX def ulKernel(lpi, dt = 1e-4, pre = 1): p = len(init) glpi = jit(grad(lpi)) sdt = jnp.sqrt(dt) spre = jnp.sqrt(pre) advance = jit(lambda x: x + 0.5*pre*glpi(x)*dt) @jit def kernel(key, x): return advance(x) + jax.random.normal(key, [p])*spre*sdt return kernel  Note how for JAX we can just pass in the log posterior, and the gradient function can be obtained by automatic differentiation. See the full runnable script for further details. Scala def ulKernel(glp: DVD => DVD, pre: DVD, dt: Double): DVD => DVD = val sdt = math.sqrt(dt) val spre = sqrt(pre) def advance(beta: DVD): DVD = beta + (0.5*dt)*(pre*:*glp(beta)) beta => advance(beta) + sdt*spre.map(Gaussian(0,_).sample())  See the full runnable script for further details. Haskell ulKernel :: (StatefulGen g m) => (Vector Double -> Vector Double) -> Vector Double -> Double -> g -> Vector Double -> m (Vector Double) ulKernel glpi pre dt g beta = do let sdt = sqrt dt let spre = cmap sqrt pre let p = size pre let advance beta = beta + (scalar (0.5*dt))*pre*(glpi beta) zl <- (replicateM p . genContVar (normalDistr 0.0 1.0)) g let z = fromList zl return$latex  advance(beta) + (scalar sdt)*spre*z


See the full runnable script for further details.

Dex

In Dex we can write a function that accepts a gradient function

def ulKernel {n} (glpi: (Fin n)=>Float -> (Fin n)=>Float)
(pre: (Fin n)=>Float) (dt: Float)
(b: (Fin n)=>Float) (k: Key) : (Fin n)=>Float =
sdt = sqrt dt
spre = sqrt pre
b + (((0.5)*dt) .* (pre*(glpi b))) +
(sdt .* (spre*(randn_vec k)))


or we can write a function that accepts a log posterior, and uses auto-diff to construct the gradient

def ulKernel {n} (lpi: (Fin n)=>Float -> Float)
(pre: (Fin n)=>Float) (dt: Float)
(b: (Fin n)=>Float) (k: Key) : (Fin n)=>Float =
sdt = sqrt dt
spre = sqrt pre
b + ((0.5)*dt) .* (pre*(glpi b)) +
sdt .* (spre*(randn_vec k))


and since Dex is statically typed, we can’t easily mix these functions up.

See the full runnable scripts, without and with auto-diff.

Next steps

In this post we have seen how to construct an MCMC algorithm that makes use of gradient information. But this algorithm is approximate. In the next post we’ll see how to correct for the approximation by using the Langevin updates as proposals within a Metropolis-Hastings algorithm.

Part 3: The Metropolis algorithm

Introduction

This is the third part in a series of posts on MCMC-based Bayesian inference for a logistic regression model. If you are new to this series, please go back to Part 1.

In the previous post we derived the log posterior for the model and implemented it in a variety of programming languages and libraries. In this post we will construct a Markov chain having the posterior as its equilibrium.

MCMC

Detailed balance

A homogeneous Markov chain with transition kernel $p(\theta_{n+1}|\theta_n)$ is said to satisfy detailed balance for some target distribution $\pi(\theta)$ if

$\displaystyle \pi(\theta)p(\theta'|\theta) = \pi(\theta')p(\theta|\theta'), \quad \forall \theta, \theta'$

Integrating both sides wrt $\theta$ gives

$\displaystyle \int_\Theta \pi(\theta)p(\theta'|\theta)d\theta = \pi(\theta'),$

from which it is clear that $\pi(\cdot)$ is a stationary distribution of the chain (and the chain is reversible). Under fairly mild regularity conditions we expect $\pi(\cdot)$ to be the equilibrium distribution of the chain.

For a given target $\pi(\cdot)$ we would like to find an easy-to-sample-from transition kernel $p(\cdot|\cdot)$ that satisfies detailed balance. This will then give us a way to (asymptotically) generate samples from our target.

In the context of Bayesian inference, the target $\pi(\theta)$ will typically be the posterior distribution, which in the previous post we wrote as $\pi(\theta|y)$. Here we drop the notational dependence on $y$, since MCMC can be used for any target distribution of interest.

Metropolis-Hastings

Suppose we have a fairly arbitrary easy-to-sample-from transition kernel $q(\theta_{n+1}|\theta_n)$ and a target of interest, $\pi(\cdot)$. Metropolis-Hastings (M-H) is a strategy for using $q(\cdot|\cdot)$ to construct a new transition kernel $p(\cdot|\cdot)$ satisfying detailed balance for $\pi(\cdot)$.

The kernel $p(\theta'|\theta)$ can be described algorithmically as follows:

1. Call the current state of the chain $\theta$. Generate a proposal $\theta^\star$ by simulating from $q(\theta^\star|\theta)$.
2. Compute the acceptance probability

$\displaystyle \alpha(\theta^\star|\theta) = \min\left[1,\frac{\pi(\theta^\star)q(\theta|\theta^\star)}{\pi(\theta)q(\theta^\star|\theta)}\right].$

1. With probability $\alpha(\theta^\star|\theta)$ return new state $\theta'=\theta^\star$, otherwise return $\theta'=\theta$.

It is clear from the algorithmic description that this kernel will have a point mass at $\theta'=\theta$, but that for $\theta'\not=\theta$ the transition kernel will be $p(\theta'|\theta)=q(\theta'|\theta)\alpha(\theta'|\theta)$. But then

$\displaystyle \pi(\theta)p(\theta'|\theta) = \min[\pi(\theta)q(\theta'|\theta),\pi(\theta')q(\theta|\theta')]$

is symmetric in $\theta$ and $\theta'$, and so detailed balance is satisfied. Since detailed balance is trivial at the point mass at $\theta=\theta'$ we are done.

Metropolis algorithm

It is often convenient to generate proposals perturbatively, using a distribution that is symmetric about the current state of the chain. But then $q(\theta'|\theta)=q(\theta|\theta')$, and so $q(\cdot|\cdot)$ drops out of the acceptance probability. This is the Metropolis algorithm.

Some computational tricks

To generate an event with probability $\alpha(\theta^\star|\theta)$, we can generate a $u\sim U(0,1)$ and accept if $u < \alpha(\theta^\star|\theta)$. This is convenient for several reasons. First, it means that we can ignore the "min", and just accept if

$\displaystyle u < \frac{\pi(\theta^\star)q(\theta|\theta^\star)}{\pi(\theta)q(\theta^\star|\theta)}$

since $u\leq 1$ regardless. Better still, we can take logs, and accept if

$\displaystyle \log u < \log\pi(\theta^\star) - \log\pi(\theta) + \log q(\theta|\theta^\star) - \log q(\theta^\star|\theta),$

so there is no need to evaluate any raw densities. Again, in the case of a symmetric proposal distribution, the $q(\cdot|\cdot)$ terms can be dropped.

Another trick worth noting is that in the case of the simple M-H algorithm described, using a single update for the entire state space (and not multiple component-wise updates, for example), and assuming that the same M-H kernel is used repeatedly to generate successive states of a Markov chain, then the $\log \pi(\theta)$ term (which in the context of Bayesian inference will typically be the log posterior) will have been computed at the previous update (irrespective of whether or not the previous move was accepted). So if we are careful about how we pass on that old value to the next iteration, we can avoid recomputing the log posterior, and our algorithm will only require one log posterior evaluation per iteration rather than two. In functional programming languages it is often convenient to pass around this current log posterior density evaluation explicitly, effectively augmenting the state space of the Markov chain to include the log posterior density.

HoF for a M-H kernel

Since I’m a fan of functional programming, we will adopt a functional style throughout, and start by creating a higher-order function (HoF) that accepts a log-posterior and proposal kernel as input and returns a Metropolis kernel as output.

R

In R we can write a function to create a M-H kernel as follows.

mhKernel = function(logPost, rprop, dprop = function(new, old, ...) { 1 })
function(x, ll) {
prop = rprop(x)
llprop = logPost(prop)
a = llprop - ll + dprop(x, prop) - dprop(prop, x)
if (log(runif(1)) < a)
list(x=prop, ll=llprop)
else
list(x=x, ll=ll)
}


Note that the kernel returned requires as input both a current state x and its associated log-posterior, ll. The new state and log-posterior densities are returned.

We need to use this transition kernel to simulate a Markov chain by successive substitution of newly simulated values back into the kernel. In more sophisticated programming languages we will use streams for this, but in R we can just use a for loop to sample values and write the states into the rows of a matrix.

mcmc = function(init, kernel, iters = 10000, thin = 10, verb = TRUE) {
p = length(init)
ll = -Inf
mat = matrix(0, nrow = iters, ncol = p)
colnames(mat) = names(init)
x = init
if (verb)
message(paste(iters, "iterations"))
for (i in 1:iters) {
if (verb)
message(paste(i, ""), appendLF = FALSE)
for (j in 1:thin) {
pair = kernel(x, ll)
x = pair$x ll = pair$ll
}
mat[i, ] = x
}
if (verb)
message("Done.")
mat
}


Then, in the context of our running logistic regression example, and using the log-posterior from the previous post, we can construct our kernel and run it as follows.

pre = c(10.0,1,1,1,1,1,5,1)
out = mcmc(init, mhKernel(lpost,
function(x) x + pre*rnorm(p, 0, 0.02)), thin=1000)


Note the use of a symmetric proposal, so the proposal density is not required. Also note the use of a larger proposal variance for the intercept term and the second last covariate. See the full runnable script for further details.

Python

We can do something very similar to R in Python using NumPy. Our HoF for constructing a M-H kernel is

def mhKernel(lpost, rprop, dprop = lambda new, old: 1.):
def kernel(x, ll):
prop = rprop(x)
lp = lpost(prop)
a = lp - ll + dprop(x, prop) - dprop(prop, x)
if (np.log(np.random.rand()) < a):
x = prop
ll = lp
return x, ll
return kernel


Our Markov chain runner function is

def mcmc(init, kernel, thin = 10, iters = 10000, verb = True):
p = len(init)
ll = -np.inf
mat = np.zeros((iters, p))
x = init
if (verb):
print(str(iters) + " iterations")
for i in range(iters):
if (verb):
print(str(i), end=" ", flush=True)
for j in range(thin):
x, ll = kernel(x, ll)
mat[i,:] = x
if (verb):
print("\nDone.", flush=True)
return mat


We can use this code in the context of our logistic regression example as follows.

pre = np.array([10.,1.,1.,1.,1.,1.,5.,1.])

def rprop(beta):
return beta + 0.02*pre*np.random.randn(p)

out = mcmc(init, mhKernel(lpost, rprop), thin=1000)


See the full runnable script for further details.

JAX

The above R and Python scripts are fine, but both languages are rather slow for this kind of workload. Fortunately it’s rather straightforward to convert the Python code to JAX to obtain quite amazing speed-up. We can write our M-H kernel as

def mhKernel(lpost, rprop, dprop = jit(lambda new, old: 1.)):
@jit
def kernel(key, x, ll):
key0, key1 = jax.random.split(key)
prop = rprop(key0, x)
lp = lpost(prop)
a = lp - ll + dprop(x, prop) - dprop(prop, x)
accept = (jnp.log(jax.random.uniform(key1)) < a)
return jnp.where(accept, prop, x), jnp.where(accept, lp, ll)
return kernel


and our MCMC runner function as

def mcmc(init, kernel, thin = 10, iters = 10000):
key = jax.random.PRNGKey(42)
keys = jax.random.split(key, iters)
@jit
def step(s, k):
[x, ll] = s
x, ll = kernel(k, x, ll)
s = [x, ll]
return s, s
@jit
def iter(s, k):
keys = jax.random.split(k, thin)
_, states = jax.lax.scan(step, s, keys)
final = [states[0][thin-1], states[1][thin-1]]
return final, final
ll = -np.inf
x = init
_, states = jax.lax.scan(iter, [x, ll], keys)
return states[0]


The first relates to the way JAX handles pseudo-random numbers. Since JAX is a pure functional eDSL, it can’t be used in conjunction with the typical pseudo-random number generators often used in imperative programming languages which rely on a global mutable state. This can be dealt with reasonably straightforwardly by explicitly passing around the random number state. There is a standard way of doing this that has been common practice in functional programming languages for decades. However, this standard approach is very sequential, and so doesn’t work so well in a parallel context. JAX therefore uses a splittable random number generator, where new states are created by splitting the current state into two (or more). We’ll come back to this when we get to the Haskell examples.

The second thing that might be unfamiliar to imperative programmers is the use of the scan operation (jax.lax.scan) to generate the Markov chain rather than a "for" loop. But scans are standard operations in most functional programming languages.

We can then call this code for our logistic regression example with

pre = jnp.array([10.,1.,1.,1.,1.,1.,5.,1.]).astype(jnp.float32)

@jit
def rprop(key, beta):
return beta + 0.02*pre*jax.random.normal(key, [p])

out = mcmc(init, mhKernel(lpost, rprop), thin=1000)


See the full runnable script for further details.

Scala

In Scala we can use a similar approach to that already seen for defining a HoF to return a M-H kernel.

def mhKernel[S](
logPost: S => Double, rprop: S => S,
dprop: (S, S) => Double = (n: S, o: S) => 1.0
): ((S, Double)) => (S, Double) =
val r = Uniform(0.0,1.0)
state =>
val (x0, ll0) = state
val x = rprop(x0)
val ll = logPost(x)
val a = ll - ll0 + dprop(x0, x) - dprop(x, x0)
if (math.log(r.draw()) < a)
(x, ll)
else
(x0, ll0)


Note that Scala’s static typing does not prevent us from defining a function that is polymorphic in the type of the chain state, which we here call S. Also note that we are adopting a pragmatic approach to random number generation, exploiting the fact that Scala is not a pure functional language, using a mutable generator, and omitting to capture the non-determinism of the rprop function (and the returned kernel) in its type signature. In Scala this is a choice, and we could adopt a purer approach if preferred. We’ll see what such an approach will look like in Haskell, coming up next.

Now that we have the kernel, we don’t need to write an explicit runner function since Scala has good support for streaming data. There are many more-or-less sophisticated ways that we can work with data streams in Scala, and the choice depends partly on how pure one is being about tracking effects (such as non-determinism), but here I’ll just use the simple LazyList from the standard library for unfolding the kernel into an infinite MCMC chain before thinning and truncating appropriately.

  val pre = DenseVector(10.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0)
def rprop(beta: DVD): DVD = beta + pre *:* (DenseVector(Gaussian(0.0,0.02).sample(p).toArray))
val kern = mhKernel(lpost, rprop)
val s = LazyList.iterate((init, -Inf))(kern) map (_._1)
val out = s.drop(150).thin(1000).take(10000)


See the full runnable script for further details.

Since Haskell is a pure functional language, we need to have some convention regarding pseudo-random number generation. Haskell supports several styles. The most commonly adopted approach wraps a mutable generator up in a monad. The typical alternative is to use a pure functional generator and either explicitly thread the state through code or hide this in a monad similar to the standard approach. However, Haskell also supports the use of splittable generators, so we can consider all three approaches for comparative purposes. The approach taken does affect the code and the type signatures, and even the streaming data abstractions most appropriate for chain generation.

Starting with a HoF for producing a Metropolis kernel, an approach using the standard monadic generators could like like

mKernel :: (StatefulGen g m) => (s -> Double) -> (s -> g -> m s) ->
g -> (s, Double) -> m (s, Double)
mKernel logPost rprop g (x0, ll0) = do
x <- rprop x0 g
let ll = logPost(x)
let a = ll - ll0
u <- (genContVar (uniformDistr 0.0 1.0)) g
let next = if ((log u) < a)
then (x, ll)
else (x0, ll0)
return next


Note how non-determinism is captured in the type signatures by the monad m. The explicit pure approach is to thread the generator through non-deterministic functions.

mKernelP :: (RandomGen g) => (s -> Double) -> (s -> g -> (s, g)) ->
g -> (s, Double) -> ((s, Double), g)
mKernelP logPost rprop g (x0, ll0) = let
(x, g1) = rprop x0 g
ll = logPost(x)
a = ll - ll0
(u, g2) = uniformR (0, 1) g1
next = if ((log u) < a)
then (x, ll)
else (x0, ll0)
in (next, g2)


Here the updated random number generator state is returned from each non-deterministic function for passing on to subsequent non-deterministic functions. This explicit sequencing of operations makes it possible to wrap the generator state in a state monad giving code very similar to the stateful monadic generator approach, but as already discussed, the sequential nature of this approach makes it unattractive in parallel and concurrent settings.

Fortunately the standard Haskell pure generator is splittable, meaning that we can adopt a splitting approach similar to JAX if we prefer, since this is much more parallel-friendly.

mKernelP :: (RandomGen g) => (s -> Double) -> (s -> g -> s) ->
g -> (s, Double) -> (s, Double)
mKernelP logPost rprop g (x0, ll0) = let
(g1, g2) = split g
x = rprop x0 g1
ll = logPost(x)
a = ll - ll0
u = unif g2
next = if ((log u) < a)
then (x, ll)
else (x0, ll0)
in next


Here non-determinism is signalled by passing a generator state (often called a "key" in the context of splittable generators) into a function. Functions receiving a key are responsible for splitting it to ensure that no key is ever used more than once.

Once we have a kernel, we need to unfold our Markov chain. When using the monadic generator approach, it is most natural to unfold using a monadic stream

mcmc :: (StatefulGen g m) =>
Int -> Int -> s -> (g -> s -> m s) -> g -> MS.Stream m s
mcmc it th x0 kern g = MS.iterateNM it (stepN th (kern g)) x0

stepN :: (Monad m) => Int -> (a -> m a) -> (a -> m a)
stepN n fa = if (n == 1)
then fa
else (\x -> (fa x) >>= (stepN (n-1) fa))


whereas for the explicit approaches it is more natural to unfold into a regular infinite data stream. So, for the explicit sequential approach we could use

mcmcP :: (RandomGen g) => s -> (g -> s -> (s, g)) -> g -> DS.Stream s
mcmcP x0 kern g = DS.unfold stepUf (x0, g)
where
stepUf xg = let
(x1, g1) = kern (snd xg) (fst xg)
in (x1, (x1, g1))


and with the splittable approach we could use

mcmcP :: (RandomGen g) =>
s -> (g -> s -> s) -> g -> DS.Stream s
mcmcP x0 kern g = DS.unfold stepUf (x0, g)
where
stepUf xg = let
(x1, g1) = xg
x2 = kern g1 x1
(g2, _) = split g1
in (x2, (x2, g2))


Calling these functions for our logistic regression example is similar to what we have seen before, but again there are minor syntactic differences depending on the approach. For further details see the full runnable scripts for the monadic approach, the pure sequential approach, and the splittable approach.

Dex

Dex is a pure functional language and uses a splittable random number generator, so the style we use is similar to JAX (or Haskell using a splittable generator). We can generate a Metropolis kernel with

def mKernel {s} (lpost: s -> Float) (rprop: Key -> s -> s) :
Key -> (s & Float) -> (s & Float) =
def kern (k: Key) (sll: (s & Float)) : (s & Float) =
(x0, ll0) = sll
[k1, k2] = split_key k
x = rprop k1 x0
ll = lpost x
a = ll - ll0
u = rand k2
select (log u < a) (x, ll) (x0, ll0)
kern


We can then unfold our Markov chain with

def markov_chain {s} (k: Key) (init: s) (kern: Key -> s -> s) (its: Nat) :
Fin its => s =
with_state init \st.
for i:(Fin its).
x = kern (ixkey k i) (get st)
st := x
x


Here we combine Dex’s state effect with a for loop to unfold the stream. See the full runnable script for further details.

Next steps

As previously discussed, none of these codes are optimised, so care should be taken not to over-interpret running times. However, JAX and Dex are noticeably faster than the alternatives, even running on a single CPU core. Another interesting feature of both JAX and Dex is that they are differentiable. This makes it very easy to develop algorithms using gradient information. In subsequent posts we will think about the gradient of our example log-posterior and how we can use gradient information to develop "better" sampling algorithms.

The complete runnable scripts are all available from this public github repo.

Part 2: The log posterior

Introduction

This is the second part in a series of posts on MCMC-based Bayesian inference for a logistic regression model. If you are new to this series, please go back to Part 1.

In the previous post we looked at the basic modelling concepts, and how to fit the model using a variety of PPLs. In this post we will prepare for doing MCMC by considering the problem of computing the unnormalised log posterior for the model. We will then see how this posterior can be implemented in several different languages and libraries.

Derivation

Basic structure

In Bayesian inference the posterior distribution is just the conditional distribution of the model parameters given the data, and therefore proportional to the joint distribution of the model and data. We often write this as

$\displaystyle \pi(\theta|y) \propto \pi(\theta,y) = \pi(\theta)\pi(y|\theta).$

Taking logs we have

$\displaystyle \log \pi(\theta, y) = \log \pi(\theta) + \log \pi(y|\theta).$

So (up to an additive constant) the log posterior is just the sum of the log prior and log likelihood. There are many good (numerical) reasons why we try to work exclusively with the log posterior and try to avoid ever evaluating the raw posterior density.

For our example logistic regression model, the parameter vector $\theta$ is just the vector of regression coefficients, $\beta$. We assumed independent mean zero normal priors for the components of this vector, so the log prior is just the sum of logs of normal densities. Many scientific libraries will have built-in functions for returning the log-pdf of standard distributions, but if an explicit form is required, the log of the density of a $N(0,\sigma^2)$ at $x$ is just

$\displaystyle -\log(2\pi)/2 - \log|\sigma| - x^2/(2\sigma^2),$

and the initial constant term normalising the density can often be dropped.

Log-likelihood (first attempt)

Information from the data comes into the log posterior via the log-likelihood. The typical way to derive the likelihood for problems of this type is to assume the usual binary encoding of the data (success 1, failure 0). Then, for a Bernoulli observation with probability $p_i,\ i=1,\ldots,n$, the likelihood associated with observation $y_i$ is

$\displaystyle f(y_i|p_i) = \left[ \hphantom{1-}p_i \quad :\ y_i=1 \atop 1-p_i \quad :\ y_i=0 \right. \quad = \quad p_i^{y_i}(1-p_i)^{1-y_i}.$

Taking logs and then switching to parameter $\eta_i=\text{logit}(p_i)$ we have

$\displaystyle \log f(y_i|\eta_i) = y_i\eta_i - \log(1+e^{\eta_i}),$

and summing over $n$ observations gives the log likelihood

$\displaystyle \log\pi(y|\eta) \equiv \ell(\eta;y) = y\cdot \eta - \mathbf{1}\cdot\log(\mathbf{1}+\exp{\eta}).$

In the context of logistic regression, $\eta$ is the linear predictor, so $\eta=X\beta$, giving

$\displaystyle \ell(\beta;y) = y^\textsf{T}X\beta - \mathbf{1}^\textsf{T}\log(\mathbf{1}+\exp[X\beta]).$

This is a perfectly good way of expressing the log-likelihood, and we will come back to it later when we want the gradient of the log-likelihood, but it turns out that there is a similar-but-different way of deriving it that results in an expression that is equivalent but slightly cheaper to evaluate.

Log-likelihood (second attempt)

For our second attempt, we will assume that the data is coded in a different way. Instead of the usual binary encoding, we will assume that the observation $\tilde y_i$ is 1 for success and -1 for failure. This isn’t really a problem, since the two encodings are related by $\tilde y_i = 2y_i-1$. This new encoding is convenient in the context of a logit parameterisation since then

$\displaystyle f(y_i|\eta_i) = \left[ p_i \ :\ \tilde y_i=1\atop 1-p_i\ :\ \tilde y_i=-1 \right. \ = \ \left[ (1+e^{-\eta_i})^{-1} \ :\ \tilde y_i=1\atop (1+e^{\eta_i})^{-1} \ :\ \tilde y_i=-1 \right. \ = \ (1+e^{-\tilde y_i\eta_i})^{-1} ,$

and hence

$\displaystyle \log f(y_i|\eta_i) = -\log(1+e^{-\tilde y_i\eta_i}).$

Summing over observations gives

$\displaystyle \ell(\eta;\tilde y) = -\mathbf{1}\cdot \log(\mathbf{1}+\exp[-\tilde y\circ \eta]),$

where $\circ$ denotes the Hadamard product. Substituting $\eta=X\beta$ gives the log-likelihood

$\displaystyle \ell(\beta;\tilde y) = -\mathbf{1}^\textsf{T} \log(\mathbf{1}+\exp[-\tilde y\circ X\beta]).$

This likelihood is a bit cheaper to evaluate that the one previously derived. If we prefer to write in terms of the original data encoding, we can obviously do so as

$\displaystyle \ell(\beta; y) = -\mathbf{1}^\textsf{T} \log(\mathbf{1}+\exp[-(2y-\mathbf{1})\circ (X\beta)]),$

and in practice, it is this version that is typically used. To be clear, as an algebraic function of $\beta$ and $y$ the two functions are different. But they coincide for binary vectors $y$ which is all that matters.

Implementation

R

In R we can create functions for evaluating the log-likelihood, log-prior and log-posterior as follows (assuming that X and y are in scope).

ll = function(beta)
sum(-log(1 + exp(-(2*y - 1)*(X %*% beta))))

lprior = function(beta)
dnorm(beta[1], 0, 10, log=TRUE) + sum(dnorm(beta[-1], 0, 1, log=TRUE))

lpost = function(beta) ll(beta) + lprior(beta)


Python

In Python (with NumPy and SciPy) we can define equivalent functions with

def ll(beta):
return np.sum(-np.log(1 + np.exp(-(2*y - 1)*(X.dot(beta)))))

def lprior(beta):
return (sp.stats.norm.logpdf(beta[0], loc=0, scale=10) +
np.sum(sp.stats.norm.logpdf(beta[range(1,p)], loc=0, scale=1)))

def lpost(beta):
return ll(beta) + lprior(beta)


JAX

Python, like R, is a dynamic language, and relatively slow for MCMC algorithms. JAX is a tensor computation framework for Python that embeds a pure functional differentiable array processing language inside Python. JAX can JIT-compile high-performance code for both CPU and GPU, and has good support for parallelism. It is rapidly becoming the preferred way to develop high-performance sampling algorithms within the Python ecosystem. We can encode our log-posterior in JAX as follows.

@jit
def ll(beta):
return jnp.sum(-jnp.log(1 + jnp.exp(-(2*y - 1)*jnp.dot(X, beta))))

@jit
def lprior(beta):
return (jsp.stats.norm.logpdf(beta[0], loc=0, scale=10) +
jnp.sum(jsp.stats.norm.logpdf(beta[jnp.array(range(1,p))], loc=0, scale=1)))

@jit
def lpost(beta):
return ll(beta) + lprior(beta)



Scala

JAX is a pure functional programming language embedded in Python. Pure functional programming languages are intrinsically more scalable and compositional than imperative languages such as R and Python, and are much better suited to exploit concurrency and parallelism. I’ve given a bunch of talks about this recently, so if you are interested in this, perhaps start with the materials for my Laplace’s Demon talk. Scala and Haskell are arguably the current best popular general purpose functional programming languages, so it is possibly interesting to consider the use of these languages for the development of scalable statistical inference codes. Since both languages are statically typed compiled functional languages with powerful type systems, they can be highly performant. However, neither is optimised for numerical (tensor) computation, so you should not expect that they will have performance comparable with optimised tensor computation frameworks such as JAX. We can encode our log-posterior in Scala (with Breeze) as follows:

  def ll(beta: DVD): Double =
sum(-log(ones + exp(-1.0*(2.0*y - ones)*:*(X * beta))))

def lprior(beta: DVD): Double =
Gaussian(0,10).logPdf(beta(0)) +
sum(beta(1 until p).map(Gaussian(0,1).logPdf(_)))

def lpost(beta: DVD): Double = ll(beta) + lprior(beta)


Spark

Apache Spark is a Scala library for distributed "big data" processing on clusters of machines. Despite fundamental differences, there is a sense in which Spark for Scala is a bit analogous to JAX for Python: both Spark and JAX are concerned with scalability, but they are targeting rather different aspects of scalability: JAX is concerned with getting regular sized data processing algorithms to run very fast (on GPUs), whereas Spark is concerned with running huge data processing tasks quickly by distributing work over clusters of machines. Despite obvious differences, the fundamental pure functional computational model adopted by both systems is interestingly similar: both systems are based on lazy transformations of immutable data structures using pure functions. This is a fundamental pattern for scalable data processing transcending any particular language, library or framework. We can encode our log posterior in Spark as follows.

    def ll(beta: DVD): Double =
df.map{row =>
val y = row.getAs[Double](0)
val x = BDV.vertcat(BDV(1.0),toBDV(row.getAs[DenseVector](1)))
-math.log(1.0 + math.exp(-1.0*(2.0*y-1.0)*(x.dot(beta))))}.reduce(_+_)
def lprior(beta: DVD): Double =
Gaussian(0,10).logPdf(beta(0)) +
sum(beta(1 until p).map(Gaussian(0,1).logPdf(_)))
def lpost(beta: DVD): Double =
ll(beta) + lprior(beta)



Haskell is an old, lazy pure functional programming language with an advanced type system, and remains the preferred language for the majority of functional programming language researchers. Hmatrix is the standard high performance numerical linear algebra library for Haskell, so we can use it to encode our log-posterior as follows.

ll :: Matrix Double -> Vector Double -> Vector Double -> Double
ll x y b = (negate) (vsum (cmap log (
(scalar 1) + (cmap exp (cmap (negate) (
(((scalar 2) * y) - (scalar 1)) * (x #> b)
)
)))))

pscale :: [Double] -- prior standard deviations
pscale = [10.0, 1, 1, 1, 1, 1, 1, 1]
lprior :: Vector Double -> Double
lprior b = sum $(\x -> logDensity (normalDistr 0.0 (snd x)) (fst x)) <$> (zip (toList b) pscale)

lpost :: Matrix Double -> Vector Double -> Vector Double -> Double
lpost x y b = (ll x y b) + (lprior b)


Again, a reminder that, here and elsewhere, there are various optimisations could be done that I’m not bothering with. This is all just proof-of-concept code.

Dex

JAX proves that a pure functional DSL for tensor computation can be extremely powerful and useful. But embedding such a language in a dynamic imperative language like Python has a number of drawbacks. Dex is an experimental statically typed stand-alone DSL for differentiable array and tensor programming that attempts to combine some of the correctness and composability benefits of powerful statically typed functional languages like Scala and Haskell with the performance benefits of tensor computation systems like JAX. It is currently rather early its development, but seems very interesting, and is already quite useable. We can encode our log-posterior in Dex as follows.

def ll (b: (Fin 8)=>Float) : Float =
neg $sum (log (map (\ x. (exp x) + 1) ((map (\ yi. 1 - 2*yi) y)*(x **. b)))) pscale = [10.0, 1, 1, 1, 1, 1, 1, 1] -- prior SDs prscale = map (\ x. 1.0/x) pscale def lprior (b: (Fin 8)=>Float) : Float = bs = b*prscale neg$  sum ((log pscale) + (0.5 .* (bs*bs)))

def lpost (b: (Fin 8)=>Float) : Float =
(ll b) + (lprior b)


Next steps

Now that we have a way of evaluating the log posterior, we can think about constructing Markov chains having the posterior as their equilibrium distribution. In the next post we will look at one of the simplest ways of doing this: the Metropolis algorithm.

Complete runnable scripts are available from this public github repo.