In the previous post I gave a brief introduction to Rainier, a new HMC-based probabilistic programming library/DSL for Scala. In that post I assumed that people were using the latest source version of the library. Since then, version 0.1.1 of the library has been released, so in this post I will demonstrate use of the released version of the software (using the binaries published to Sonatype), and will walk through a slightly more interesting example – a dynamic linear state space model with unknown static parameters. This is similar to, but slightly different from, the DLM example in the Rainier library. So to follow along with this post, all that is required is SBT.

First run SBT from an empty directory, and paste the following at the SBT prompt:

set libraryDependencies += "com.stripe" %% "rainier-plot" % "0.1.1" set scalaVersion := "2.12.4" console

This should give a Scala REPL with appropriate dependencies (`rainier-plot`

has all of the relevant transitive dependencies). We’ll begin with some imports, and then simulating some synthetic data from a dynamic linear state space model with an AR(1) latent state and Gaussian noise on the observations.

import com.stripe.rainier.compute._ import com.stripe.rainier.core._ import com.stripe.rainier.sampler._ implicit val rng = ScalaRNG(1) val n = 60 // number of observations/time points val mu = 3.0 // AR(1) mean val a = 0.95 // auto-regressive parameter val sig = 0.2 // AR(1) SD val sigD = 3.0 // observational SD val state = Stream. iterate(0.0)(x => mu + (x - mu) * a + sig * rng.standardNormal). take(n).toVector val obs = state.map(_ + sigD * rng.standardNormal)

Now we have some synthetic data, let’s think about building a probabilistic program for this model. Start with a prior.

case class Static(mu: Real, a: Real, sig: Real, sigD: Real) val prior = for { mu <- Normal(0, 10).param a <- Normal(1, 0.1).param sig <- Gamma(2,1).param sigD <- Gamma(2,2).param sp <- Normal(0, 50).param } yield (Static(mu, a, sig, sigD), List(sp))

Note the use of a case class for wrapping the static parameters. Next, let’s define a function to add a state and associated observation to an existing model.

def addTimePoint(current: RandomVariable[(Static, List[Real])], datum: Double) = for { tup <- current static = tup._1 states = tup._2 os = states.head ns <- Normal(((Real.one - static.a) * static.mu) + (static.a * os), static.sig).param _ <- Normal(ns, static.sigD).fit(datum) } yield (static, ns :: states)

Given this, we can generate the probabilistic program for our model as a *fold* over the data initialised with the prior.

val fullModel = obs.foldLeft(prior)(addTimePoint(_, _))

If we don’t want to keep samples for all of the variables, we can focus on the parameters of interest, wrapping the results in a `Map`

for convenient sampling and plotting.

val model = for { tup <- fullModel static = tup._1 states = tup._2 } yield Map("mu" -> static.mu, "a" -> static.a, "sig" -> static.sig, "sigD" -> static.sigD, "SP" -> states.reverse.head)

We can sample with

val out = model.sample(HMC(3), 100000, 10000 * 500, 500)

(this will take several minutes) and plot some diagnostics with

import com.cibo.evilplot.geometry.Extent import com.stripe.rainier.plot.EvilTracePlot._ val truth = Map("mu" -> mu, "a" -> a, "sigD" -> sigD, "sig" -> sig, "SP" -> state(0)) render(traces(out, truth), "traceplots.png", Extent(1200, 1400)) render(pairs(out, truth), "pairs.png")

This generates the following diagnostic plots:

Everything looks good.

Rainier is a monadic embedded DSL for probabilistic programming in Scala. We can use standard functional combinators and for-expressions for building models to sample, and then run an efficient HMC algorithm on the resulting probability monad in order to obtain samples from the posterior distribution of the model.

See the Rainier repo for further details.

]]>Rainier is an interesting new probabilistic programming library for Scala recently open-sourced by Stripe. Probabilistic programming languages provide a computational framework for building and fitting Bayesian models to data. There are many interesting probabilistic programming languages, and there is currently a lot of interesting innovation happening with probabilistic programming languages embedded in strongly typed functional programming languages such as Scala and Haskell. However, most such languages tend to be developed by people lacking expertise in statistics and numerics, leading to elegant, composable languages which work well for toy problems, but don’t scale well to the kinds of practical problems that applied statisticians are interested in. Conversely, there are a few well-known probabilistic programming languages developed by and for statisticians which have efficient inference engines, but are hampered by inflexible, inelegant languages and APIs. Rainier is interesting because it is an attempt to bridge the gap between these two worlds: it has a functional, composable, extensible, monadic API, yet is backed by a very efficient, high-performance scalable inference engine, using HMC and a static compute graph for reverse-mode AD. Clearly there will be some loss of generality associated with choosing an efficient inference algorithm (eg. for HMC, there needs to be a fixed number of parameters and they must all be continuous), but it still covers a large proportion of the class of hierarchical models commonly used in applied statistical modelling.

In this post I’ll give a quick introduction to Rainier using an interactive session requiring only that SBT is installed and the Rainier repo is downloaded or cloned.

To follow along with this post just clone, or download and unpack, the Rainier repo, and run SBT from the top-level Rainier directory and paste commands. First start a Scala REPL.

project rainierPlot console

Before we start building models, we need some data. For this post we will focus on a simple logistic regression model, and so we will begin by simulating some synthetic data consistent with such a model.

val r = new scala.util.Random(0) val N = 1000 val beta0 = 0.1 val beta1 = 0.3 val x = (1 to N) map { i => 3.0 * r.nextGaussian } val theta = x map { xi => beta0 + beta1 * xi } def expit(x: Double): Double = 1.0 / (1.0 + math.exp(-x)) val p = theta map expit val y = p map (pi => (r.nextDouble < pi))

Now we have some synthetic data, we can fit the model and see if we are able to recover the “true” parameters used to generate the synthetic data. In Rainier, we build models by declaring probabilistic programs for the model and the data, and then run an inference engine to generate samples from the posterior distribution.

Start with a bunch of Rainier imports:

import com.stripe.rainier.compute._ import com.stripe.rainier.core._ import com.stripe.rainier.sampler._ import com.stripe.rainier.repl._

Now we want to build a model. We do so by describing the joint distribution of parameters and data. Rainier has a few built-in distributions, and these can be combined using standard functional monadic combinators such as `map`

, `zip`

, `flatMap`

, etc., to create a probabilistic program representing a probability monad for the model. Due to the monadic nature of such probabilistic programs, it is often most natural to declare them using a `for`

-expression.

val model = for { beta0 <- Normal(0, 5).param beta1 <- Normal(0, 5).param _ <- Predictor.from{x: Double => { val theta = beta0 + beta1 * x val p = Real(1.0) / (Real(1.0) + (Real(0.0) - theta).exp) Categorical.boolean(p) } }.fit(x zip y) } yield Map("b0"->beta0, "b1"->beta1)

This kind of construction is very natural for anyone familiar with monadic programming in Scala, but will no doubt be a little mysterious otherwise. `RandomVariable`

is the probability monad used for HMC sampling, and these can be constructed from `Distributions`

using `.param`

(for unobserved parameters) and `.fit`

(for variables with associated observations). `Predictor`

is just a convenience for observations corresponding to covariate information. `model`

is therefore a `RandomVariable`

over `beta0`

and `beta1`

, the two unobserved parameters of interest. Note that I briefly discussed this kind of pure functional approach to describing probabilistic programs (using `Rand`

from Breeze) in my post on MCMC as a stream.

Now we have our probabilistic program, we can sample from it using HMC as follows.

implicit val rng = ScalaRNG(3) val its = 10000 val thin = 5 val out = model.sample(HMC(5), 10000, its*thin, thin) println(out.take(10))

The argument to `HMC()`

is the number of leapfrog steps to take per iteration.

Finally, we can use EvilPlot to look at the HMC output and check that we have managed to reasonably recover the true parameters associated with our synthetic data.

import com.cibo.evilplot.geometry.Extent import com.stripe.rainier.plot.EvilTracePlot._ render(traces(out, truth = Map("b0" -> beta0, "b1" -> beta1)), "traceplots.png", Extent(1200, 1000)) render(pairs(out, truth = Map("b0" -> beta0, "b1" -> beta1)), "pairs.png")

Everything looks good, and the sampling is very fast!

For further information, see the Rainier repo. In particular, start with the tour of Rainier’s core, which gives a more detailed introduction to how Rainier works than this post. Those interested in how the efficient AD works may want to read about the compute graph, and the implementation notes explain how it all fits together. There is some basic ScalaDoc for the core package, and also some examples (including this one), and there’s a gitter channel for asking questions. This is a very new project, so there are a few minor bugs and wrinkles in the initial release, but development is progressing rapidly, so I fully expect the library to get properly battle-hardened over the next few months.

For those unfamiliar with the monadic approach to probabilistic programming, then Ĺcibior et al (2015) is probably a good starting point.

]]>EvilPlot is a new functional data visualisation library for Scala. Although there are several data viz libraries for Scala, this new library has a nice functional API for producing attractive, flexible, compositional plots which can be rendered in JVM applications and in web applications (via Scala.js). For a quick introduction, see this blog post from one of the library’s creators. For further information, see the official documentation and the github repo. For a quick overview of the kinds of plots that the library is capable of generating, see the plot catalog.

The library is designed to produce plots which can be rendered into applications. However, when doing data analysis in the REPL on the JVM, it is often convenient to be able to just pop up a plot in a window on the desktop. EvilPlot doesn’t seem to contain code for on-screen rendering, but the plots can be rendered to a bitmap image. In the previous post I described a small library, scala-view, which renders such images, and image sequences on the desktop. In this post I’ll walk through using scala-view to render EvilPlot plots on-screen.

To follow this session, you just need to run SBT from an empty directory. Just run `sbt`

and paste the following at the SBT prompt:

set libraryDependencies += "com.cibo" %% "evilplot" % "0.2.0" set libraryDependencies += "com.github.darrenjw" %% "scala-view" % "0.6-SNAPSHOT" set resolvers += Resolver.bintrayRepo("cibotech", "public") set resolvers += "Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/" set scalaVersion := "2.12.4" set fork := true console

This will give a Scala REPL prompt. First we need some imports:

import com.cibo.evilplot.plot._ import com.cibo.evilplot.colors._ import com.cibo.evilplot.plot.aesthetics.DefaultTheme._ import com.cibo.evilplot.numeric.Point import java.awt.Image.SCALE_SMOOTH import scalaview.Utils._

We can simulate some data an produce a simple line chart:

val data = Seq.tabulate(100) { i => Point(i.toDouble, scala.util.Random.nextDouble()) } val plot = LinePlot.series(data, "Line graph", HSL(210, 100, 56)). xAxis().yAxis().frame(). xLabel("x").yLabel("y").render()

This `plot`

object contains the rendering instructions, but doesn’t actually produce a plot. We can use scala-view to display it as follows:

scalaview.SfxImageViewer(biResize(plot.asBufferedImage,1000,800,SCALE_SMOOTH))

This will produce a window on screen something like the following:

Don’t close this plot yet, as this will confuse the REPL. Just switch back to the REPL and continue.

Sometimes we want to produce a sequence of plots. Let’s now suppose that the data above arises sequentially as a stream, and that we want to produce a sequence of plots with each observation as it arrives. First create a stream of partial datasets and map a function which turns a dataset into a plot to get a stream of images representing the plots. Then pass the stream of images into the viewer to get an animated sequence of plots on-screen:

val dataStream = data.toStream val cumulStream = dataStream.scanLeft(Nil: List[Point])((l,p) => p :: l).drop(1) def dataToImage(data: List[Point]) = LinePlot. series(data, "Line graph", HSL(210, 100, 56)). xAxis().yAxis().frame(). xLabel("x").yLabel("y").render().asBufferedImage val plotStream = cumulStream map (d => biResize(dataToImage(d),1000,800,SCALE_SMOOTH)) scalaview.SfxImageViewer.bi(plotStream, 100000, autoStart=true)]]>

In the previous post I discussed how comonads can be useful for structuring certain kinds of scientific and statistical computations. Two of the examples I gave were concerned with the time-evolution of 2-d images. In that post I used Breeze to animate the sequence of computed images. In this post I want to describe an alternative that is better suited to animating an image sequence.

Scala-view is a small Scala library for animating a Stream of Images on-screen in a separate window managed by your window manager. It works with both ScalaFX `Images`

(recommended) and Scala Swing/AWT `BufferedImages`

(legacy). The stream of images is animated in a window with some simple controls to start and stop the animation, and to turn on and off the saving of image frames to disk (typically for the purpose of turning the image sequence into a movie). An example of what a window might look like is given below.

More comprehensive documentation is available from the scala-view github repo, but here I give a quick introduction to the library to outline its capabilities.

This brief tutorial gives a quick introduction to using the Scala-view library for viewing a ScalaFX Image Stream. It assumes only that you have SBT installed, and that you run SBT from an empty directory.

Start by running SBT from an empty or temporary directory to get an SBT prompt:

$ sbt >

Now we need to configure SBT to use the Scala-view library, and start a console. From the SBT prompt:

set libraryDependencies += "com.github.darrenjw" %% "scala-view" % "0.5" set scalaVersion := "2.12.4" console

The should result in a `scala>`

REPL prompt. We can now use Scala and the Scala-view library interactively.

You should be able to paste the code snippets below directly into the REPL. You may find `:paste`

mode helpful.

We will replicate the heat equation example from the examples-sfx directory, which is loosely based on the example from my blog post on comonads. We will start by defining a simple parallel Image and corresponding comonadic pointed image `PImage`

type. If you aren’t familiar with comonads, you may find it helpful to read through that post.

import scala.collection.parallel.immutable.ParVector case class Image[T](w: Int, h: Int, data: ParVector[T]) { def apply(x: Int, y: Int): T = data(x * h + y) def map[S](f: T => S): Image[S] = Image(w, h, data map f) def updated(x: Int, y: Int, value: T): Image[T] = Image(w, h, data.updated(x * h + y, value)) } case class PImage[T](x: Int, y: Int, image: Image[T]) { def extract: T = image(x, y) def map[S](f: T => S): PImage[S] = PImage(x, y, image map f) def coflatMap[S](f: PImage[T] => S): PImage[S] = PImage( x, y, Image(image.w, image.h, (0 until (image.w * image.h)).toVector.par.map(i => { val xx = i / image.h val yy = i % image.h f(PImage(xx, yy, image)) }))) def up: PImage[T] = { val py = y - 1 val ny = if (py >= 0) py else (py + image.h) PImage(x, ny, image) } def down: PImage[T] = { val py = y + 1 val ny = if (py < image.h) py else (py - image.h) PImage(x, ny, image) } def left: PImage[T] = { val px = x - 1 val nx = if (px >= 0) px else (px + image.w) PImage(nx, y, image) } def right: PImage[T] = { val px = x + 1 val nx = if (px < image.w) px else (px - image.w) PImage(nx, y, image) } }

We will need a function to convert this image into a ScalaFX `WritableImage`

.

import scalafx.scene.image.WritableImage import scalafx.scene.paint._ def toSfxI(im: Image[Double]): WritableImage = { val wi = new WritableImage(im.w, im.h) val pw = wi.pixelWriter (0 until im.w) foreach (i => (0 until im.h) foreach (j => pw.setColor(i, j, Color.gray(im(i,j))) )) wi }

We will need a starting image representing the initial condition for the heat equation.

val w = 600 val h = 500 val pim0 = PImage(0, 0, Image(w, h, ((0 until w*h).toVector map {i: Int => { val x = i / h val y = i % h 0.1*math.cos(0.1*math.sqrt((x*x+y*y))) + 0.1 + 0.8*math.random }}).par ))

We can define a kernel associated with the update of a single image pixel based on a single time step of a finite difference solution of the heat equation.

def kernel(pi: PImage[Double]): Double = (2*pi.extract+ pi.up.extract+pi.down.extract+pi.left.extract+pi.right.extract)/6.0

We can now create a `Stream`

of `PImage`

with

def pims = Stream.iterate(pim0)(_.coflatMap(kernel))

We can turn this into a `Stream[WritableImage]`

with

def sfxis = pims map (im => toSfxI(im.image))

Note that we are essentially finished at this point, but so far everything we have done has been purely functional with no side effects. We haven’t even computed our solution to the heat equation. All we have constructed are lazy infinite streams representing the solution of the heat equation.

Finally, we can render our Stream of Images on screen with

scalaview.SfxImageViewer(sfxis,1e7.toInt)

which has a delay of 1e7 nanoseconds (10 milliseconds) between frames.

This should pop up a window on your display containing the initial image. Click on the Start button to animate the solution of the heat equation. See the API docs for SfxImageViewer for additional options. The ScalaFX API docs may also be useful, especially the docs for Image and WritableImage.

]]>In a previous post I’ve given a brief introduction to *monads* in Scala, aimed at people interested in scientific and statistical computing. Monads are a concept from category theory which turn out to be exceptionally useful for solving many problems in functional programming. But most categorical concepts have a *dual*, usually prefixed with “co”, so the dual of a monad is the *comonad*. Comonads turn out to be especially useful for formulating algorithms from scientific and statistical computing in an elegant way. In this post I’ll illustrate their use in signal processing, image processing, numerical integration of PDEs, and Gibbs sampling (of an Ising model). Comonads enable the *extension* of a *local computation* to a *global computation*, and this pattern crops up all over the place in statistical computing.

Simplifying massively, from the viewpoint of a Scala programmer, a monad is a mappable (functor) type class augmented with the methods `pure`

and `flatMap`

:

trait Monad[M[_]] extends Functor[M] { def pure[T](v: T): M[T] def flatMap[T,S](v: M[T])(f: T => M[S]): M[S] }

In category theory, the dual of a concept is typically obtained by “reversing the arrows”. Here that means reversing the direction of the methods `pure`

and `flatMap`

to get `extract`

and `coflatMap`

, respectively.

trait Comonad[W[_]] extends Functor[W] { def extract[T](v: W[T]): T def coflatMap[T,S](v: W[T])(f: W[T] => S): W[S] }

So, while `pure`

allows you to wrap plain values in a monad, `extract`

allows you to get a value out of a comonad. So you can always get a value out of a comonad (unlike a monad). Similarly, while `flatMap`

allows you to transform a monad using a function returning a monad, `coflatMap`

allows you to transform a comonad using a function which collapses a comonad to a single value. It is `coflatMap`

(sometimes called `extend`

) which can extend a local computation (producing a single value) to the entire comonad. We’ll look at how that works in the context of some familiar examples.

One of the simplest examples of a comonad is an *infinite* stream of data. I’ve discussed streams in a previous post. By focusing on infinite streams we know the stream will never be empty, so there will always be a value that we can `extract`

. Which value does `extract`

give? For a `Stream`

encoded as some kind of lazy list, the only value we actually know is the value at the head of the stream, with subsequent values to be lazily computed as required. So the head of the list is the only reasonable value for `extract`

to return.

Understanding `coflatMap`

is a bit more tricky, but it is `coflatMap`

that provides us with the power to apply a non-trivial statistical computation to the stream. The input is a function which transforms a stream into a value. In our example, that will be a function which computes a weighted average of the first few values and returns that weighted average as the result. But the return type of `coflatMap`

must be a stream of such computations. Following the types, a few minutes thought reveals that the only reasonable thing to do is to return the stream formed by applying the weighted average function to all sub-streams, recursively. So, for a `Stream`

`s`

(of type `Stream[T]`

) and an input function `f: W[T] => S`

, we form a stream whose head is `f(s)`

and whose tail is `coflatMap(f)`

applied to `s.tail`

. Again, since we are working with an infinite stream, we don’t have to worry about whether or not the `tail`

is empty. This gives us our comonadic `Stream`

, and it is exactly what we need for applying a linear filter to the data stream.

In Scala, Cats is a library providing type classes from Category theory, and instances of those type classes for parametrised types in the standard library. In particular, it provides us with comonadic functionality for the standard Scala `Stream`

. Let’s start by defining a stream corresponding to the logistic map.

import cats._ import cats.implicits._ val lam = 3.7 def s = Stream.iterate(0.5)(x => lam*x*(1-x)) s.take(10).toList // res0: List[Double] = List(0.5, 0.925, 0.25668749999999985, // 0.7059564011718747, 0.7680532550204203, 0.6591455741499428, ...

Let us now suppose that we want to apply a linear filter to this stream, in order to smooth the values. The idea behind using comonads is that you figure out how to generate *one* desired value, and let `coflatMap`

take care of applying the same logic to the rest of the structure. So here, we need a function to generate the *first* filtered value (since `extract`

is focused on the head of the stream). A simple first attempt a function to do this might look like the following.

def linearFilterS(weights: Stream[Double])(s: Stream[Double]): Double = (weights, s).parMapN(_*_).sum

This aligns each weight in parallel with a corresponding value from the stream, and combines them using multiplication. The resulting (hopefully finite length) stream is then summed (with addition). We can test this with

linearFilterS(Stream(0.25,0.5,0.25))(s) // res1: Double = 0.651671875

and let `coflatMap`

extend this computation to the rest of the stream with something like:

s.coflatMap(linearFilterS(Stream(0.25,0.5,0.25))).take(5).toList // res2: List[Double] = List(0.651671875, 0.5360828502929686, ...

This is all completely fine, but our `linearFilterS`

function is specific to the `Stream`

comonad, despite the fact that all we’ve used about it in the function is that it is a parallelly composable and foldable. We can make this much more generic as follows:

def linearFilter[F[_]: Foldable, G[_]]( weights: F[Double], s: F[Double] )(implicit ev: NonEmptyParallel[F, G]): Double = (weights, s).parMapN(_*_).fold

This uses some fairly advanced Scala concepts which I don’t want to get into right now (I should also acknowledge that I had trouble getting the syntax right for this, and got help from Fabio Labella (@SystemFw) on the Cats gitter channel). But this version is more generic, and can be used to linearly filter other data structures than `Stream`

. We can use this for regular `Streams`

as follows:

s.coflatMap(s => linearFilter(Stream(0.25,0.5,0.25),s)) // res3: scala.collection.immutable.Stream[Double] = Stream(0.651671875, ?)

But we can apply this new filter to other collections. This could be other, more sophisticated, streams such as provided by FS2, Monix or Akka streams. But it could also be a non-stream collection, such as `List`

:

val sl = s.take(10).toList sl.coflatMap(sl => linearFilter(List(0.25,0.5,0.25),sl)) // res4: List[Double] = List(0.651671875, 0.5360828502929686, ...

Assuming that we have the Breeze scientific library available, we can plot the raw and smoothed trajectories.

def myFilter(s: Stream[Double]): Double = linearFilter(Stream(0.25, 0.5, 0.25),s) val n = 500 import breeze.plot._ import breeze.linalg._ val fig = Figure(s"The (smoothed) logistic map (lambda=$lam)") val p0 = fig.subplot(3,1,0) p0 += plot(linspace(1,n,n),s.take(n)) p0.ylim = (0.0,1.0) p0.title = s"The logistic map (lambda=$lam)" val p1 = fig.subplot(3,1,1) p1 += plot(linspace(1,n,n),s.coflatMap(myFilter).take(n)) p1.ylim = (0.0,1.0) p1.title = "Smoothed by a simple linear filter" val p2 = fig.subplot(3,1,2) p2 += plot(linspace(1,n,n),s.coflatMap(myFilter).coflatMap(myFilter).coflatMap(myFilter).coflatMap(myFilter).coflatMap(myFilter).take(n)) p2.ylim = (0.0,1.0) p2.title = "Smoothed with 5 applications of the linear filter" fig.refresh

Streaming data is in no way the only context in which a comonadic approach facilitates an elegant approach to scientific and statistical computing. Comonads crop up anywhere where we want to extend a computation that is local to a small part of a data structure to the full data structure. Another commonly cited area of application of comonadic approaches is image processing (I should acknowledge that this section of the post is very much influenced by a blog post on comonadic image processing in Haskell). However, the kinds of operations used in image processing are in many cases very similar to the operations used in finite difference approaches to numerical integration of partial differential equations (PDEs) such as the heat equation, so in this section I will blur (sic) the distinction between the two, and numerically integrate the 2D heat equation in order to Gaussian blur a noisy image.

First we need a simple image type which can have pixels of arbitrary type `T`

(this is very important – all functors must be fully type polymorphic).

import scala.collection.parallel.immutable.ParVector case class Image[T](w: Int, h: Int, data: ParVector[T]) { def apply(x: Int, y: Int): T = data(x*h+y) def map[S](f: T => S): Image[S] = Image(w, h, data map f) def updated(x: Int, y: Int, value: T): Image[T] = Image(w,h,data.updated(x*h+y,value)) }

Here I’ve chosen to back the image with a parallel immutable vector. This wasn’t necessary, but since this type has a `map`

operation which automatically parallelises over multiple cores, any `map`

operations applied to the image will be automatically parallelised. This will ultimately lead to all of our statistical computations being automatically parallelised without us having to think about it.

As it stands, this image isn’t comonadic, since it doesn’t implement `extract`

or `coflatMap`

. Unlike the case of `Stream`

, there isn’t really a uniquely privileged pixel, so it’s not clear what `extract`

should return. For many data structures of this type, we make them comonadic by adding a “cursor” pointing to a “current” element of interest, and use this as the focus for computations applied with `coflatMap`

. This is simplest to explain by example. We can define our “pointed” image type as follows:

case class PImage[T](x: Int, y: Int, image: Image[T]) { def extract: T = image(x, y) def map[S](f: T => S): PImage[S] = PImage(x, y, image map f) def coflatMap[S](f: PImage[T] => S): PImage[S] = PImage( x, y, Image(image.w, image.h, (0 until (image.w * image.h)).toVector.par.map(i => { val xx = i / image.h val yy = i % image.h f(PImage(xx, yy, image)) })))

There is missing a closing brace, as I’m not quite finished. Here `x`

and `y`

represent the location of our cursor, so `extract`

returns the value of the pixel indexed by our cursor. Similarly, `coflatMap`

forms an image where the value of the image at each location is the result of applying the function `f`

to the image which had the cursor set to that location. Clearly `f`

should use the cursor in some way, otherwise the image will have the same value at every pixel location. Note that `map`

and `coflatMap`

operations will be automatically parallelised. The intuitive idea behind `coflatMap`

is that it extends local computations. For the stream example, the local computation was a linear combination of nearby values. Similarly, in image analysis problems, we often want to apply a linear filter to nearby pixels. We can get at the pixel at the cursor location using `extract`

, but we probably also want to be able to move the cursor around to nearby locations. We can do that by adding some appropriate methods to complete the class definition.

def up: PImage[T] = { val py = y-1 val ny = if (py >= 0) py else (py + image.h) PImage(x,ny,image) } def down: PImage[T] = { val py = y+1 val ny = if (py < image.h) py else (py - image.h) PImage(x,ny,image) } def left: PImage[T] = { val px = x-1 val nx = if (px >= 0) px else (px + image.w) PImage(nx,y,image) } def right: PImage[T] = { val px = x+1 val nx = if (px < image.w) px else (px - image.w) PImage(nx,y,image) } }

Here each method returns a new pointed image with the cursor shifted by one pixel in the appropriate direction. Note that I’ve used periodic boundary conditions here, which often makes sense for numerical integration of PDEs, but makes less sense for real image analysis problems. Note that we have embedded all “indexing” issues inside the definition of our classes. Now that we have it, none of the statistical algorithms that we develop will involve any explicit indexing. This makes it much less likely to develop algorithms containing bugs corresponding to “off-by-one” or flipped axis errors.

This class is now fine for our requirements. But if we wanted Cats to understand that this structure is really a comonad (perhaps because we wanted to use derived methods, such as `coflatten`

), we would need to provide evidence for this. The details aren’t especially important for this post, but we can do it simply as follows:

implicit val pimageComonad = new Comonad[PImage] { def extract[A](wa: PImage[A]) = wa.extract def coflatMap[A,B](wa: PImage[A])(f: PImage[A] => B): PImage[B] = wa.coflatMap(f) def map[A,B](wa: PImage[A])(f: A => B): PImage[B] = wa.map(f) }

It’s handy to have some functions for converting Breeze dense matrices back and forth with our image class.

import breeze.linalg.{Vector => BVec, _} def BDM2I[T](m: DenseMatrix[T]): Image[T] = Image(m.cols, m.rows, m.data.toVector.par) def I2BDM(im: Image[Double]): DenseMatrix[Double] = new DenseMatrix(im.h,im.w,im.data.toArray)

Now we are ready to see how to use this in practice. Let’s start by defining a very simple linear filter.

def fil(pi: PImage[Double]): Double = (2*pi.extract+ pi.up.extract+pi.down.extract+pi.left.extract+pi.right.extract)/6.0

This simple filter can be used to “smooth” or “blur” an image. However, from a more sophisticated viewpoint, exactly this type of filter can be used to represent one time step of a numerical method for time integration of the 2D heat equation. Now we can simulate a noisy image and apply our filter to it using `coflatMap`

:

import breeze.stats.distributions.Gaussian val bdm = DenseMatrix.tabulate(200,250){case (i,j) => math.cos( 0.1*math.sqrt((i*i+j*j))) + Gaussian(0.0,2.0).draw} val pim0 = PImage(0,0,BDM2I(bdm)) def pims = Stream.iterate(pim0)(_.coflatMap(fil))

Note that here, rather than just applying the filter once, I’ve generated an infinite stream of pointed images, each one representing an additional application of the linear filter. Thus the sequence represents the time solution of the heat equation with initial condition corresponding to our simulated noisy image.

We can render the first few frames to check that it seems to be working.

import breeze.plot._ val fig = Figure("Diffusing a noisy image") pims.take(25).zipWithIndex.foreach{case (pim,i) => { val p = fig.subplot(5,5,i) p += image(I2BDM(pim.image)) }}

Note that the numerical integration is carried out in parallel on all available cores automatically. Other image filters can be applied, and other (parabolic) PDEs can be numerically integrated in an essentially similar way.

Another place where the concept of extending a local computation to a global computation crops up is in the context of Gibbs sampling a high-dimensional probability distribution by cycling through the sampling of each variable in turn from its full-conditional distribution. I’ll illustrate this here using the Ising model, so that I can reuse the pointed image class from above, but the principles apply to any Gibbs sampling problem. In particular, the Ising model that we consider has a conditional independence structure corresponding to a graph of a square lattice. As above, we will use the comonadic structure of the square lattice to construct a Gibbs sampler. However, we can construct a Gibbs sampler for arbitrary graphical models in an essentially identical way by using a graph comonad.

Let’s begin by simulating a random image containing +/-1s:

import breeze.stats.distributions.{Binomial,Bernoulli} val beta = 0.4 val bdm = DenseMatrix.tabulate(500,600){ case (i,j) => (new Binomial(1,0.2)).draw }.map(_*2 - 1) // random matrix of +/-1s val pim0 = PImage(0,0,BDM2I(bdm))

We can use this to initialise our Gibbs sampler. We now need a Gibbs kernel representing the update of each pixel.

def gibbsKernel(pi: PImage[Int]): Int = { val sum = pi.up.extract+pi.down.extract+pi.left.extract+pi.right.extract val p1 = math.exp(beta*sum) val p2 = math.exp(-beta*sum) val probplus = p1/(p1+p2) if (new Bernoulli(probplus).draw) 1 else -1 }

So far so good, but there a couple of issues that we need to consider before we plough ahead and start coflatMapping. The first is that pure functional programmers will object to the fact that this function is not pure. It is a stochastic function which has the side-effect of mutating the random number state. I’m just going to duck that issue here, as I’ve previously discussed how to fix it using probability monads, and I don’t want it to distract us here.

However, there is a more fundamental problem here relating to parallel versus sequential application of Gibbs kernels. `coflatMap`

is conceptually parallel (irrespective of how it is implemented) in that all computations used to build the new comonad are based solely on the information available in the starting comonad. OTOH, detailed balance of the Markov chain will only be preserved if the kernels for each pixel are applied sequentially. So if we `coflatMap`

this kernel over the image we will break detailed balance. I should emphasise that this has nothing to do with the fact that I’ve implemented the pointed image using a parallel vector. Exactly the same issue would arise if we switched to backing the image with a regular (sequential) immutable `Vector`

.

The trick here is to recognise that if we coloured alternate pixels black and white using a chequerboard pattern, then all of the black pixels are conditionally independent given the white pixels and vice-versa. Conditionally independent pixels *can* be updated by parallel application of a Gibbs kernel. So we just need separate kernels for updating odd and even pixels.

def oddKernel(pi: PImage[Int]): Int = if ((pi.x+pi.y) % 2 != 0) pi.extract else gibbsKernel(pi) def evenKernel(pi: PImage[Int]): Int = if ((pi.x+pi.y) % 2 == 0) pi.extract else gibbsKernel(pi)

Each of these kernels can be coflatMapped over the image preserving detailed balance of the chain. So we can now construct an infinite stream of MCMC iterations as follows.

def pims = Stream.iterate(pim0)(_.coflatMap(oddKernel). coflatMap(evenKernel))

We can animate the first few iterations with:

import breeze.plot._ val fig = Figure("Ising model Gibbs sampler") fig.width = 1000 fig.height = 800 pims.take(50).zipWithIndex.foreach{case (pim,i) => { print(s"$i ") fig.clear val p = fig.subplot(1,1,0) p.title = s"Ising model: frame $i" p += image(I2BDM(pim.image.map{_.toDouble})) fig.refresh }} println

Here I have a movie showing the first 1000 iterations. Note that youtube seems to have over-compressed it, but you should get the basic idea.

Again, note that this MCMC sampler runs in parallel on all available cores, automatically. This issue of odd/even pixel updating emphasises another issue that crops up a lot in functional programming: very often, thinking about how to express an algorithm functionally leads to an algorithm which parallelises naturally. For general graphs, figuring out which groups of nodes can be updated in parallel is essentially the graph colouring problem. I’ve discussed this previously in relation to parallel MCMC in:

Wilkinson, D. J. (2005) Parallel Bayesian Computation, Chapter 16 in E. J. Kontoghiorghes (ed.) *Handbook of Parallel Computing and Statistics*, Marcel Dekker/CRC Press, 481-512.

There are quite a few blog posts discussing comonads in the context of Haskell. In particular, the post on comonads for image analysis I mentioned previously, and this one on cellular automata. Bartosz’s post on comonads gives some connection back to the mathematical origins. Runar’s Scala comonad tutorial is the best source I know for comonads in Scala.

Full runnable code corresponding to this blog post is available from my blog repo.

]]>As discussed in the previous post, I’ve recently constructed and delivered a short course on statistical computing with Scala. Much of the course is concerned with writing statistical algorithms in Scala, typically making use of the scientific and numerical computing library, Breeze. Breeze has all of the essential tools necessary for building statistical algorithms, but doesn’t contain any higher level modelling functionality. As part of the course, I walked through how to build a small library for regression modelling on top of Breeze, including all of the usual regression diagnostics (such as standard errors, t-statistics, p-values, F-statistics, etc.). While preparing the course materials it occurred to me that it would be useful to package and document this code properly for general use. In advance of the course I packaged the code up into a bare-bones library, but since then I’ve fleshed it out, tidied it up and documented it properly, so it’s now ready for people to use.

The library covers PCA, linear regression modelling and simple one-parameter GLMs (including logistic and Poisson regression). The underlying algorithms are fairly efficient and numerically stable (eg. linear regression uses the QR decomposition of the model matrix, and the GLM fitting uses QR within each IRLS step), though they are optimised more for clarity than speed. The library also includes a few utility functions and procedures, including a pairs plot (scatter-plot matrix).

Plenty of documentation is available from the scala-glm github repo which I won’t repeat here. But to give a rough idea of how things work, I’ll run through an interactive session for the linear regression example.

First, download a dataset from the UCI ML Repository to disk for subsequent analysis (caching the file on disk is good practice, as it avoids unnecessary load on the UCI server, and allows running the code off-line):

import scalaglm._ import breeze.linalg._ val url = "http://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat" val fileName = "self-noise.csv" // download the file to disk if it hasn't been already val file = new java.io.File(fileName) if (!file.exists) { val s = new java.io.PrintWriter(file) val data = scala.io.Source.fromURL(url).getLines data.foreach(l => s.write(l.trim. split('\t').filter(_ != ""). mkString("", ",", "\n"))) s.close }

Once we have a CSV file on disk, we can load it up and look at it.

val mat = csvread(new java.io.File(fileName)) // mat: breeze.linalg.DenseMatrix[Double] = // 800.0 0.0 0.3048 71.3 0.00266337 126.201 // 1000.0 0.0 0.3048 71.3 0.00266337 125.201 // 1250.0 0.0 0.3048 71.3 0.00266337 125.951 // ... println("Dim: " + mat.rows + " " + mat.cols) // Dim: 1503 6 val figp = Utils.pairs(mat, List("Freq", "Angle", "Chord", "Velo", "Thick", "Sound")) // figp: breeze.plot.Figure = breeze.plot.Figure@37718125

We can then regress the response in the final column on the other variables.

val y = mat(::, 5) // response is the final column // y: DenseVector[Double] = DenseVector(126.201, 125.201, ... val X = mat(::, 0 to 4) // X: breeze.linalg.DenseMatrix[Double] = // 800.0 0.0 0.3048 71.3 0.00266337 // 1000.0 0.0 0.3048 71.3 0.00266337 // 1250.0 0.0 0.3048 71.3 0.00266337 // ... val mod = Lm(y, X, List("Freq", "Angle", "Chord", "Velo", "Thick")) // mod: scalaglm.Lm = // Lm(DenseVector(126.201, 125.201, ... mod.summary // Estimate S.E. t-stat p-value Variable // --------------------------------------------------------- // 132.8338 0.545 243.866 0.0000 * (Intercept) // -0.0013 0.000 -30.452 0.0000 * Freq // -0.4219 0.039 -10.847 0.0000 * Angle // -35.6880 1.630 -21.889 0.0000 * Chord // 0.0999 0.008 12.279 0.0000 * Velo // -147.3005 15.015 -9.810 0.0000 * Thick // Residual standard error: 4.8089 on 1497 degrees of freedom // Multiple R-squared: 0.5157, Adjusted R-squared: 0.5141 // F-statistic: 318.8243 on 5 and 1497 DF, p-value: 0.00000 val fig = mod.plots // fig: breeze.plot.Figure = breeze.plot.Figure@60d7ebb0

There is a `.predict`

method for generating point predictions (and standard errors) given a new model matrix, and fitting GLMs is very similar – these things are covered in the quickstart guide for the library.

scala-glm is a small Scala library built on top of the Breeze numerical library which enables simple and convenient regression modelling in Scala. It is reasonably well documented and usable in its current form, but I intend to gradually add additional features according to demand as time permits.

]]>I’ve written a brief self-study guide to point people in the right direction. For people studying the material in their spare time, the course is probably best done over nine weeks (one chapter per week), and this will then cover material at a similar rate to a typical MOOC.

The nine chapters are:

1. Introduction

2. Scala and FP Basics

3. Collections

4. Scala Breeze

5. Monte Carlo

6. Statistical modelling

7. Tools

8. Apache Spark

9. Advanced topics

For anyone frustrated by the limitations of dynamic languages such as R, Python or Octave, this course should provide a good pathway to an altogether more sophisticated, modern programming paradigm.

]]>This weekend I’ve been preparing some material for my upcoming Scala for statistical computing short course. As part of the course, I thought it would be useful to walk through how to think about and structure MCMC codes, and in particular, how to think about MCMC algorithms as infinite streams of state. This material is reasonably stand-alone, so it seems suitable for a blog post. Complete runnable code for the examples in this post are available from my blog repo.

For this post I will just consider a trivial toy Metropolis algorithm using a Uniform random walk proposal to target a standard normal distribution. I’ve considered this problem before on my blog, so if you aren’t very familiar with Metropolis-Hastings algorithms, you might want to quickly review my post on Metropolis-Hastings MCMC algorithms in R before continuing. At the end of that post, I gave the following R code for the Metropolis sampler:

metrop3<-function(n=1000,eps=0.5) { vec=vector("numeric", n) x=0 oldll=dnorm(x,log=TRUE) vec[1]=x for (i in 2:n) { can=x+runif(1,-eps,eps) loglik=dnorm(can,log=TRUE) loga=loglik-oldll if (log(runif(1)) < loga) { x=can oldll=loglik } vec[i]=x } vec }

I will begin this post with a fairly direct translation of this algorithm into Scala:

def metrop1(n: Int = 1000, eps: Double = 0.5): DenseVector[Double] = { val vec = DenseVector.fill(n)(0.0) var x = 0.0 var oldll = Gaussian(0.0, 1.0).logPdf(x) vec(0) = x (1 until n).foreach { i => val can = x + Uniform(-eps, eps).draw val loglik = Gaussian(0.0, 1.0).logPdf(can) val loga = loglik - oldll if (math.log(Uniform(0.0, 1.0).draw) < loga) { x = can oldll = loglik } vec(i) = x } vec }

This code works, and is reasonably fast and efficient, but there are several issues with it from a functional programmers perspective. One issue is that we have committed to storing all MCMC output in RAM in a `DenseVector`

. This probably isn’t an issue here, but for some big problems we might prefer to not store the full set of states, but to just print the states to (say) the console, for possible re-direction to a file. It is easy enough to modify the code to do this:

def metrop2(n: Int = 1000, eps: Double = 0.5): Unit = { var x = 0.0 var oldll = Gaussian(0.0, 1.0).logPdf(x) (1 to n).foreach { i => val can = x + Uniform(-eps, eps).draw val loglik = Gaussian(0.0, 1.0).logPdf(can) val loga = loglik - oldll if (math.log(Uniform(0.0, 1.0).draw) < loga) { x = can oldll = loglik } println(x) } }

But now we have two version of the algorithm. One for storing results locally, and one for streaming results to the console. This is clearly unsatisfactory, but we shall return to this issue shortly. Another issue that will jump out at functional programmers is the reliance on mutable variables for storing the state and old likelihood. Let’s fix that now by re-writing the algorithm as a tail-recursion.

@tailrec def metrop3(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue): Unit = { if (n > 0) { println(x) val can = x + Uniform(-eps, eps).draw val loglik = Gaussian(0.0, 1.0).logPdf(can) val loga = loglik - oldll if (math.log(Uniform(0.0, 1.0).draw) < loga) metrop3(n - 1, eps, can, loglik) else metrop3(n - 1, eps, x, oldll) } }

This has eliminated the `var`

s, and is just as fast and efficient as the previous version of the code. Note that the `@tailrec`

annotation is optional – it just signals to the compiler that we want it to throw an error if for some reason it cannot eliminate the tail call. However, this is for the print-to-console version of the code. What if we actually want to keep the iterations in RAM for subsequent analysis? We can keep the values in an accumulator, as follows.

@tailrec def metrop4(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue, acc: List[Double] = Nil): DenseVector[Double] = { if (n == 0) DenseVector(acc.reverse.toArray) else { val can = x + Uniform(-eps, eps).draw val loglik = Gaussian(0.0, 1.0).logPdf(can) val loga = loglik - oldll if (math.log(Uniform(0.0, 1.0).draw) < loga) metrop4(n - 1, eps, can, loglik, can :: acc) else metrop4(n - 1, eps, x, oldll, x :: acc) } }

This is all fine, but we haven’t yet addressed the issue of having different versions of the code depending on what we want to do with the output. The problem is that we have tied up the logic of advancing the Markov chain with what to do with the output. What we need to do is separate out the code for advancing the state. We can do this by defining a new function.

def newState(x: Double, oldll: Double, eps: Double): (Double, Double) = { val can = x + Uniform(-eps, eps).draw val loglik = Gaussian(0.0, 1.0).logPdf(can) val loga = loglik - oldll if (math.log(Uniform(0.0, 1.0).draw) < loga) (can, loglik) else (x, oldll) }

This function takes as input a current state and associated log likelihood and returns a new state and log likelihood following the execution of one step of a MH algorithm. This separates the concern of state updating from the rest of the code. So now if we want to write code that prints the state, we can write it as

@tailrec def metrop5(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue): Unit = { if (n > 0) { println(x) val ns = newState(x, oldll, eps) metrop5(n - 1, eps, ns._1, ns._2) } }

and if we want to accumulate the set of states visited, we can write that as

@tailrec def metrop6(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue, acc: List[Double] = Nil): DenseVector[Double] = { if (n == 0) DenseVector(acc.reverse.toArray) else { val ns = newState(x, oldll, eps) metrop6(n - 1, eps, ns._1, ns._2, ns._1 :: acc) } }

Both of these functions call `newState`

to do the real work, and concentrate on what to do with the sequence of states. However, both of these functions repeat the logic of how to iterate over the sequence of states.

Ideally we would like to abstract out the details of how to do state iteration from the code as well. Most functional languages have some concept of a `Stream`

, which represents a (potentially infinite) sequence of states. The `Stream`

can embody the logic of how to perform state iteration, allowing us to abstract that away from our code, as well.

To do this, we will restructure our code slightly so that it more clearly maps old state to new state.

def nextState(eps: Double)(state: (Double, Double)): (Double, Double) = { val x = state._1 val oldll = state._2 val can = x + Uniform(-eps, eps).draw val loglik = Gaussian(0.0, 1.0).logPdf(can) val loga = loglik - oldll if (math.log(Uniform(0.0, 1.0).draw) < loga) (can, loglik) else (x, oldll) }

The "real" state of the chain is just `x`

, but if we want to avoid recalculation of the old likelihood, then we need to make this part of the chain’s state. We can use this `nextState`

function in order to construct a `Stream`

.

def metrop7(eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue): Stream[Double] = Stream.iterate((x, oldll))(nextState(eps)) map (_._1)

The result of calling this is an infinite stream of states. Obviously it isn’t computed – that would require infinite computation, but it captures the logic of iteration and computation in a `Stream`

, that can be thought of as a lazy `List`

. We can get values out by converting the `Stream`

to a regular collection, being careful to truncate the `Stream`

to one of finite length beforehand! eg. `metrop7().drop(1000).take(10000).toArray`

will do a burn-in of 1,000 iterations followed by a main monitoring run of length 10,000, capturing the results in an `Array`

. Note that `metrop7().drop(1000).take(10000)`

is a `Stream`

, and so nothing is actually computed until the `toArray`

is encountered. Conversely, if printing to console is required, just replace the `.toArray`

with `.foreach(println)`

.

The above stream-based approach to MCMC iteration is clean and elegant, and deals nicely with issues like burn-in and thinning (which can be handled similarly). This is how I typically write MCMC codes these days. However, functional programming purists would still have issues with this approach, as it isn’t quite pure functional. The problem is that the code isn’t pure – it has a side-effect, which is to mutate the state of the under-pinning pseudo-random number generator. If the code was pure, calling `nextState`

with the same inputs would always give the same result. Clearly this isn’t the case here, as we have specifically designed the function to be stochastic, returning a randomly sampled value from the desired probability distribution. So `nextState`

represents a function for randomly sampling from a conditional probability distribution.

Now, ultimately all code has side-effects, or there would be no point in running it! But in functional programming the desire is to make as much of the code as possible pure, and to push side-effects to the very edges of the code. So it’s fine to have side-effects in your `main`

method, but not buried deep in your code. Here the side-effect is at the very heart of the code, which is why it is potentially an issue.

To keep things as simple as possible, at this point we will stop worrying about carrying forward the old likelihood, and hard-code a value of `eps`

. Generalisation is straightforward. We can make our code pure by instead defining a function which represents the conditional probability distribution itself. For this we use a *probability monad*, which in Breeze is called `Rand`

. We can couple together such functions using monadic binds (`flatMap`

in Scala), expressed most neatly using for-comprehensions. So we can write our transition kernel as

def kernel(x: Double): Rand[Double] = for { innov <- Uniform(-0.5, 0.5) can = x + innov oldll = Gaussian(0.0, 1.0).logPdf(x) loglik = Gaussian(0.0, 1.0).logPdf(can) loga = loglik - oldll u <- Uniform(0.0, 1.0) } yield if (math.log(u) < loga) can else x

This is now pure – the same input `x`

will always return the same probability distribution – the conditional distribution of the next state given the current state. We can draw random samples from this distribution if we must, but it’s probably better to work as long as possible with pure functions. So next we need to encapsulate the iteration logic. Breeze has a `MarkovChain`

object which can take kernels of this form and return a stochastic `Process`

object representing the iteration logic, as follows.

MarkovChain(0.0)(kernel). steps. drop(1000). take(10000). foreach(println)

The `steps`

method contains the logic of how to advance the state of the chain. But again note that no computation actually takes place until the `foreach`

method is encountered – this is when the sampling occurs and the side-effects happen.

Metropolis-Hastings is a common use-case for Markov chains, so Breeze actually has a helper method built-in that will construct a MH sampler directly from an initial state, a proposal kernel, and a (log) target.

MarkovChain. metropolisHastings(0.0, (x: Double) => Uniform(x - 0.5, x + 0.5))(x => Gaussian(0.0, 1.0).logPdf(x)). steps. drop(1000). take(10000). toArray

Note that if you are using the MH functionality in Breeze, it is important to make sure that you are using version 0.13 (or later), as I fixed a few issues with the MH code shortly prior to the 0.13 release.

Viewing MCMC algorithms as infinite streams of state is useful for writing elegant, generic, flexible code. Streams occur everywhere in programming, and so there are lots of libraries for working with them. In this post I used the simple `Stream`

from the Scala standard library, but there are much more powerful and flexible stream libraries for Scala, including fs2 and Akka-streams. But whatever libraries you are using, the fundamental concepts are the same. The most straightforward approach to implementation is to define impure stochastic streams to consume. However, a pure functional approach is also possible, and the Breeze library defines some useful functions to facilitate this approach. I’m still a little bit ambivalent about whether the pure approach is worth the additional cognitive overhead, but it’s certainly very interesting and worth playing with and thinking about the pros and cons.

Complete runnable code for the examples in this post are available from my blog repo.

]]>Apache Spark is a Scala library for analysing "big data". It can be used for analysing huge (internet-scale) datasets distributed across large clusters of machines. The analysis can be anything from the computation of simple descriptive statistics associated with the datasets, through to rather sophisticated machine learning pipelines involving data pre-processing, transformation, nonlinear model fitting and regularisation parameter tuning (via methods such as cross-validation). A relatively impartial overview can be found in the Apache Spark Wikipedia page.

Although Spark is really aimed at data that can’t easily be analysed on a laptop, it is actually very easy to install and use (in standalone mode) on a laptop, and a good laptop with a fast multicore processor and plenty of RAM is fine for datasets up to a few gigabytes in size. This post will walk through getting started with Spark, installing it locally (not requiring admin/root access) doing some simple descriptive analysis, and moving on to fit a simple linear regression model to some simulated data. After this walk-through it should be relatively easy to take things further by reading the Spark documentation, which is generally pretty good.

Anyone who is interested in learning more about setting up and using Spark clusters may want to have a quick look over on my personal blog (mainly concerned with the Raspberry Pi), where I have previously considered installing Spark on a Raspberry Pi 2, setting up a small Spark cluster, and setting up a larger Spark cluster. Although these posts are based around the Raspberry Pi, most of the material there is quite generic, since the Raspberry Pi is just a small (Debian-based) Linux server.

The only pre-requisite for installing Spark is a recent Java installation. On Debian-based Linux systems (such as Ubuntu), Java can be installed with:

sudo apt-get update sudo apt-get install openjdk-8-jdk

For other systems you should Google for the best way to install Java. If you aren’t sure whether you have Java or not, type `java -version`

into a terminal window. If you get a version number of the form 1.7.x or 1.8.x you should be fine.

Once you have Java installed, you can download and install Spark in any appropriate place in your file-system. If you are running Linux, or a Unix-alike, just `cd`

to an appropriate place and enter the following commands:

wget http://www.eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz tar xvfz spark-2.1.0-bin-hadoop2.7.tgz cd spark-2.1.0-bin-hadoop2.7 bin/run-example SparkPi 10

If all goes well, the last command should run an example. Don’t worry if there are lots of INFO and WARN messages – we will sort that out shortly. On other systems it should simply be a matter of downloading and unpacking Spark somewhere appropriate, then running the example from the top-level Spark directory. Get Spark from the downloads page. You should get version 2.1.0 built for Hadoop 2.7. It doesn’t matter if you don’t have Hadoop installed – it is not required for single-machine use.

The INFO messages are useful for debugging cluster installations, but are too verbose for general use. On a Linux system you can turn down the verbosity with:

sed 's/rootCategory=INFO/rootCategory=WARN/g' < conf/log4j.properties.template > conf/log4j.properties

On other systems, copy the file `log4j.properties.template`

in the `conf`

sub-directory to `log4j.properties`

and edit the file, replacing `INFO`

with `WARN`

on the relevant line. Check it has worked by re-running the `SparkPi`

example – it should be much less verbose this time. You can also try some other examples:

bin/run-example SparkLR ls examples/src/main/scala/org/apache/spark/examples/

There are several different ways to use Spark. For this walk-through we are just going to use it interactively from the "Spark shell". We can pop up a shell with:

bin/spark-shell --master local[4]

The "4" refers to the number of worker threads to use. Four is probably fine for most decent laptops. `Ctrl-D`

or `:quit`

will exit the Spark shell and take you back to your OS shell. It is more convenient to have the Spark `bin`

directory in your path. If you are using `bash`

or a similar OS shell, you can temporarily add the Spark `bin`

to your path with the OS shell command:

export PATH=$PATH:`pwd`/bin

You can make this permanent by adding a line like this (but with the full path hard-coded) to your `.profile`

or similar start-up dot-file. I prefer not to do this, as I typically have several different Spark versions on my laptop and want to be able to select exactly the version I need. If you are not running `bash`

, Google how to add a directory to your path. Check the path update has worked by starting up a shell with:

spark-shell --master local[4]

Note that if you want to run a script containing Spark commands to be run in "batch mode", you could do it with a command like:

spark-shell --driver-memory 25g --master local[4] < spark-script.scala | tee script-out.txt

There are much better ways to develop and submit batch jobs to Spark clusters, but I won’t discuss those in this post. Note that while Spark is running, diagnostic information about the "cluster" can be obtained by pointing a web browser at port 4040 on the master, which here is just http://localhost:4040/ – this is extremely useful for debugging purposes.

We are now ready to start using Spark. From a Spark shell in the top-level directory, enter:

sc.textFile("README.md").count

If all goes well, you should get a count of the number of lines in the file `README.md`

. The value `sc`

is the "Spark context", containing information about the Spark cluster (here it is just a laptop, but in general it could be a large cluster of machines, each with many processors and each processor with many cores). The `textFile`

method loads up the file into an RDD (Resilient Distributed Dataset). The RDD is the fundamental abstraction provided by Spark. It is a lazy distributed parallel monadic collection. After loading a text file like this, each element of the collection represents one line of the file. I’ve talked about monadic collections in previous posts, so if this isn’t a familiar concept, it might be worth having a quick skim through at least the post on first steps with monads in Scala. The point is that although RDDs are potentially huge and distributed over a large cluster, using them is very similar to using any other monadic collection in Scala. We can unpack the previous command slightly as follows:

val rdd1 = sc.textFile("README.md") rdd1 rdd1.count

Note that RDDs are "lazy", and this is important for optimising complex pipelines. So here, after assigning the value `rdd1`

, no data is actually loaded into memory. All of the actual computation is deferred until an "action" is called – `count`

is an example of such an action, and therefore triggers the loading of data into memory and the counting of elements.

We can now look at a very slightly more complex pipeline – counting the number of words in a text file rather than the number of lines. This can be done as follows:

sc.textFile("README.md"). map(_.trim). flatMap(_.split(' ')). count

Note that `map`

and `flatMap`

are both lazy ("transformations" in Spark terminology), and so no computation is triggered until the final action, `count`

is called. The call to `map`

will just trim any redundant white-space from the line ends. So after the call to `map`

the RDD will still have one element for each line of the file. However, the call to `flatMap`

splits each line on white-space, so after this call each element of the RDD will correspond to a word, and not a line. So, the final `count`

will again count the number of elements in the RDD, but here this corresponds to the number of words in the file.

A final example before moving on to look at quantitative data analysis: counting the frequency with which each character occurs in a file. This can be done as follows:

sc.textFile("README.md"). map(_.toLowerCase). flatMap(_.toCharArray). map{(_,1)}. reduceByKey(_+_). collect

The first call to `map`

converts upper case characters to lower case, as we don’t want separate counts for upper and lower case characters. The call to `flatMap`

then makes each element of the RDD correspond to a single character in the file. The second call to `map`

transforms each element of the RDD to a key-value pair, where the key is the character and the value is the integer 1. RDDs have special methods for key-value pairs in this form – the method `reduceByKey`

is one such – it applies the reduction operation (here just "+") to all values corresponding to a particular value of the key. Since each character has the value 1, the sum of the values will be a character count. Note that the reduction will be done in parallel, and for this to work it is vital that the reduction operation is associative. Simple addition of integers is clearly associative, so here we are fine. Note that `reduceByKey`

is a (lazy) transformation, and so the computation needs to be triggered by a call to the action `collect`

.

On most Unix-like systems there is a file called `words`

that is used for spell-checking. The example below applies the character count to this file. Note the calls to `filter`

, which filter out any elements of the RDD not matching the predicate. Here it is used to filter out special characters.

sc.textFile("/usr/share/dict/words"). map(_.trim). map(_.toLowerCase). flatMap(_.toCharArray). filter(_ > '/'). filter(_ < '}'). map{(_,1)}. reduceByKey(_+_). collect

We first need some quantitative data, so let’s simulate some. Breeze is the standard Scala library for scientific and statistical computing. I’ve given a quick introduction to Breeze in a previous post. Spark has a dependence on Breeze, and therefore can be used from inside the Spark shell – this is very useful. So, we start by using Breeze to simulate a vector of normal random quantities:

import breeze.stats.distributions._ val x = Gaussian(1.0,2.0).sample(10000)

Note, though, that `x`

is just a regular Breeze Vector, a simple serial collection all stored in RAM on the master thread. To use it as a Spark RDD, we must convert it to one, using the `parallelize`

function:

val xRdd = sc.parallelize(x)

Now `xRdd`

is an RDD, and so we can do Spark transformations and actions on it. There are some special methods for RDDs containing numeric values:

xRdd.mean xRdd.sampleVariance

Each summary statistic is computed with a single pass through the data, but if several summary statistics are required, it is inefficient to make a separate pass through the data for each summary, so the `stats`

method makes a single pass through the data returning a `StatsCounter`

object that can be used to compute various summary statistics.

val xStats = xRdd.stats xStats.mean xStats.sampleVariance xStats.sum

The `StatsCounter`

methods are: `count`

, `mean`

, `sum`

, `max`

, `min`

, `variance`

, `sampleVariance`

, `stdev`

, `sampleStdev`

.

Moving beyond very simple descriptive statistics, we will look at a simple linear regression model, which will also allow us to introduce Spark `DataFrame`

s – a high level abstraction layered on top of RDDs which makes working with tabular data much more convenient, especially in the context of statistical modelling.

We start with some standard (non-Spark) Scala Breeze code to simulate some data from a simple linear regression model. We use the `x`

already simulated as our first covariate. Then we simulate a second covariate, `x2`

. Then, using some residual noise, `eps`

, we simulate a regression model scenario, where we know that the "true" intercept is 1.5 and the "true" covariate regression coefficients are 2.0 and 1.0.

val x2 = Gaussian(0.0,1.0).sample(10000) val xx = x zip x2 val lp = xx map {p => 2.0*p._1 + 1.0*p._2 + 1.5} val eps = Gaussian(0.0,1.0).sample(10000) val y = (lp zip eps) map (p => p._1 + p._2) val yx = (y zip xx) map (p => (p._1,p._2._1,p._2._2)) val rddLR = sc.parallelize(yx)

Note that the last line converts the regular Scala Breeze collection into a Spark RDD using `parallelize`

. We could, in principle, do regression modelling using raw RDDs, and early versions of Spark required this. However, statisticians used to statistical languages such as R know that data frames are useful for working with tabular data. I gave a brief overview of Scala data frame libraries in a previous post. We can convert an RDD of tuples to a Spark `DataFrame`

as follows:

val dfLR = rddLR.toDF("y","x1","x2") dfLR.show dfLR.show(5)

Note that `show`

shows the first few rows of a `DataFrame`

, and giving it a numeric argument specifies the number to show. This is very useful for quick sanity-checking of `DataFrame`

contents.

Note that there are other ways of getting data into a Spark `DataFrame`

. One of the simplest ways to get data into Spark from other systems is via a CSV file. A properly formatted CSV file with a header row can be read into Spark with a command like:

// Don't run unless you have an appropriate CSV file... val df = spark.read. option("header","true"). option("inferSchema","true"). csv("myCsvFile.csv")

This requires two passes over the data – one to infer the schema and one to actually read the data. For very large datasets it is better to declare the schema and not use automatic schema inference. However, for very large datasets, CSV probably isn’t a great choice of format anyway. Spark supports many more efficient data storage formats. Note that Spark also has functions for querying SQL (and other) databases, and reading query results directly into `DataFrame`

objects. For people familiar with databases, this is often the most convenient way of ingesting data into Spark. See the Spark DataFrames guide and the API docs for DataFrameReader for further information.

Spark has an extensive library of tools for the development of sophisticated machine learning pipelines. Included in this are functions for fitting linear regression models, regularised regression models (Lasso, ridge, elastic net), generalised linear models, including logistic regression models, etc., and tools for optimising regularisation parameters, for example, using cross-validation. For this post I’m just going to show how to fit a simple OLS linear regression model: see the ML pipeline documentation for further information, especially the docs on classification and regression.

We start by creating an object for fitting linear regression models:

import org.apache.spark.ml.regression.LinearRegression import org.apache.spark.ml.linalg._ val lm = new LinearRegression lm.explainParams lm.getStandardization lm.setStandardization(false) lm.getStandardization lm.explainParams

Note that there are many parameters associated with the fitting algorithm, including regularisation parameters. These are set to defaults corresponding to no regularisation (simple OLS). Note, however, that the algorithm defaults to standardising covariates to be mean zero variance one. We can turn that off before fitting the model if desired.

Also note that the model fitting algorithm assumes that the `DataFrame`

to be fit has (at least) two columns, one called `label`

containing the response variable, and one called `features`

, where each element is actually a `Vectors`

of covariates. So we first need to transform our `DataFrame`

into the required format.

// Transform data frame to required format val dflr = (dfLR map {row => (row.getDouble(0), Vectors.dense(row.getDouble(1),row.getDouble(2)))}). toDF("label","features") dflr.show(5)

Now we have the data in the correct format, it is simple to fit the model and look at the estimated parameters.

// Fit model val fit = lm.fit(dflr) fit.intercept fit.coefficients

You should see that the estimated parameters are close to the "true" parameters that were used to simulate from the model. More detailed diagnostics can be obtained from the fitted summary object.

val summ = fit.summary summ.r2 summ.rootMeanSquaredError summ.coefficientStandardErrors summ.pValues summ.tValues summ.predictions summ.residuals

So, that’s how to fit a simple OLS linear regression model. Fitting GLMs (including logistic regression) is very similar, and setting up routines to tune regularisation parameters via cross-validation is not much more difficult.

As previously mentioned, once you are up and running with a Spark shell, the official Spark documentation is reasonably good. First go through the quick start guide, then the programming guide, then the ML guide, and finally, consult the API docs. I discussed books on scala for data science in the previous post – many of these cover Spark to a greater or lesser extent.

I recently gave a talk on some of the general principles behind the use of functional programming for scalable statistical computing, and how concepts from category theory, such as monads, can help. The PDF slides are available. I’m not sure how comprehensible they will be without my explanations and white-board diagrams, but come to think of it, I’m not sure how comprehensible they were *with* my explanations and white-board diagrams… Also note that I occasionally run a three-day short-course on Scala for statistical computing, and much of the final day is concerned with using Apache Spark.

People regularly ask me about books and other resources for getting started with Scala for statistical computing and data science. This post will focus on books, but it’s worth briefly noting that there are a number of other resources available, on-line and otherwise, that are also worth considering. I particularly like the Coursera course Functional Programming Principles in Scala – I still think this is probably the best way to get started with Scala and functional programming for most people. In fact, there is an entire Functional Programming in Scala Specialization that is worth considering – I’ll probably discuss that more in another post. I’ve got a draft page of Scala links which has a bias towards scientific and statistical computing, and I’m currently putting together a short course in that area, which I’ll also discuss further in future posts. But this post will concentrate on books.

Before one can dive into statistical computing and data science using Scala, it’s a good idea to understand a bit about the language and about functional programming. There are by now many books on Scala, and I haven’t carefully reviewed all of them, but I’ve looked at enough to have an idea about good ways of getting started.

- Programming in Scala: Third edition, Odersky et al, Artima.
- This is
*the*Scala book, often referred to on-line as**PinS**. It is a weighty tome, and works through the Scala language in detail, starting from the basics. Every serious Scala programmer should own this book. However, it isn’t the easiest introduction to the language. - Scala for the Impatient, Horstmann, Addison-Wesley.
- As the name suggests, this is a much quicker and easier introduction to Scala than PinS, but assumes reasonable familiarity with programming in general, and sort-of assumes that the reader has a basic knowledge of Java and the JVM ecosystem. That said, it does not assume that the reader is a Java expert. My feeling is that for someone who has a reasonable programming background and a passing familiarity with Java, then this book is probably the best introduction to the language. Note that there is a second edition in the works.
- Functional Programming in Scala Chiusano and Bjarnason, Manning.
- It is possible to write Scala code in the style of "Java-without-the-semi-colons", but really the whole point of Scala is to move beyond that kind of Object-Oriented programming style. How much you venture down the path towards pure Functional Programming is very much a matter of taste, but many of the best Scala programmers are pretty hard-core FP, and there’s probably a reason for that. But many people coming to Scala don’t have a strong FP background, and getting up to speed with strongly-typed FP isn’t easy for people who only know an imperative (Object-Oriented) style of programming.
*This*is the book that will help you to make the jump to FP. Sometimes referred to online as**FPiS**, or more often even just as the**red book**, this is also a book that every serious Scala programmer should own (and read!). Note that is isn’t really a book*about*Scala – it is a book about strongly typed FP that just "happens" to*use*Scala for illustrating the ideas. Consequently, you will probably want to augment this book with a book that really is about Scala, such as one of the books above. Since this is the first book on the list published by Manning, I should also mention how much I like computing books from this publisher. They are typically well-produced, and their paper books (pBooks) come with complimentary access to well-produced DRM-free eBook versions, however you purchase them. - Functional and Reactive Domain Modeling, Ghosh, Manning.
- This is another book that isn’t really
*about*Scala, but about software engineering using a strongly typed FP language. But again, it uses Scala to illustrate the ideas, and is an excellent read. You can think of it as a more practical "hands-on" follow-up to the red book, which shows how the ideas from the red book translate into effective solutions to real-world problems. - Structure and Interpretation of Computer Programs, second edition Abelson et al, MIT Press.
- This is not a Scala book! This is the only book in this list which doesn’t use Scala at all. I’ve included it on the list because it is one of the best books on programming that I’ve read, and is the book that I wish someone had told me about 20 years ago! In fact the book uses Scheme (a Lisp derivative) as the language to illustrate the ideas. There are obviously important differences between Scala and Scheme – eg. Scala is strongly statically typed and compiled, whereas Scheme is dynamically typed and interpreted. However, there are also similarities – eg. both languages support and encourage a functional style of programming but are not pure FP languages. Referred to on-line as
**SICP**this book is a classic. Note that there is no need to buy a paper copy if you like eBooks, since electronic versions are available free on-line.

- Scala for Data Science, Bugnion, Packt.
- Not to be confused with the (terrible) book, Scala for machine learning by the same publisher. Scala for Data Science is my top recommendation for getting started with statistical computing and data science applications using Scala. I have reviewed this book in another post, so I won’t say more about it here (but I like it).
- Scala Data Analysis Cookbook, Manivannan, Packt.
- I’m not a huge fan of the cookbook format, but this book is really mis-named, as it isn’t really a cookbook and isn’t really about data analysis in Scala! It is really a book about Apache Spark, and proceeds fairly sequentially in the form of a tutorial introduction to Spark. Spark is an impressive piece of technology, and it is obviously one of the factors driving interest in Scala, but it’s important to understand that Spark isn’t Scala, and that many typical data science applications will be better tackled using Scala without Spark. I’ve not read this book cover-to-cover as it offers little over Scala for Data Science, but its coverage of Spark is a bit more up-to-date than the Spark books I mention below, so it could be of interest to those who are mainly interested in Scala for Spark.
- Scala High Performance Programming, Theron and Diamant, Packt.
- This is an interesting book, fundamentally about developing high performance streaming data processing algorithm pipelines in Scala. It makes no reference to Spark. The running application is an on-line financial trading system. It takes a deep dive into understanding performance in Scala and on the JVM, and looks at how to benchmark and profile performance, diagnose bottlenecks and optimise code. This is likely to be of more interest to those interested in developing efficient algorithms for scientific and statistical computing rather than applied data scientists, but it covers some interesting material not covered by any of the other books in this list.
- Learning Spark, Karau et al, O’Reilly.
- This book provides an introduction to Apache Spark, written by some of the people who developed it. Spark is a big data analytics framework built on top of Scala. It is arguably the best available framework for big data analytics on computing clusters in the cloud, and hence there is a lot of interest in it. The book is a perfectly good introduction to Spark, and shows most examples implemented using the Java and Python APIs in addition to the canonical Scala (Spark Shell) implementation. This is useful for people working with multiple languages, but can be mildly irritating to anyone who is only interested in Scala. However, the big problem with this (and every other) book on Spark is that Spark is evolving very quickly, and so by the time any book on Spark is written and published it is inevitably very out of date. It’s not clear that it is worth buying a book specifically about Spark at this stage, or whether it would be better to go for a book like
*Scala for Data Science*, which has a couple of chapters of introduction to Spark, which can then provide a starting point for engaging with Spark’s on-line documentation (which is reasonably good). - Advanced Analytics with Spark, Ryza et al, O’Reilly.
- This book has a bit of a "cookbook" feel to it, which some people like and some don’t. It’s really more like an "edited volume" with different chapters authored by different people. Unlike Learning Spark it focuses exclusively on the Scala API. The book basically covers the development of a bunch of different machine learning pipelines for a variety of applications. My main problem with this book is that it has aged particularly badly, as all of the pipelines are developed with raw RDDs, which isn’t how ML pipelines in Spark are constructed any more. So again, it’s difficult for me to recommend. The message here is that if you are thinking of buying a book about Spark, check very carefully when it was published and what version of Spark it covers and whether that is sufficiently recent to be of relevance to you.

There are lots of books to get started with Scala for statistical computing and data science applications. My "bare minimum" recommendation would be some generic Scala book (doesn’t really matter which one), the *red book*, and *Scala for data science*. After reading those, you will be very well placed to top-up your knowledge as required with on-line resources.