A scalable particle filter in Scala

Introduction

Many modern algorithms in computational Bayesian statistics have at their heart a particle filter or some other sequential Monte Carlo (SMC) procedure. In this blog I’ve discussed particle MCMC algorithms which use a particle filter in the inner-loop in order to compute a (noisy, unbiased) estimate of the marginal likelihood of the data. These algorithms are often very computationally intensive, either because the forward model used to propagate the particles is expensive, or because the likelihood associated with each particle/observation is expensive (or both). In this case it is desirable to parallelise the particle filter to run on all available cores of a machine, or in some cases, it would even be desirable to distribute the the particle filter computation across a cluster of machines.

Parallelisation is difficult when using the conventional imperative programming languages typically used in scientific and statistical computing, but is much easier using modern functional languages such as Scala. In fact, in languages such as Scala it is possible to describe algorithms at a higher level of abstraction, so that exactly the same algorithm can run in serial, run in parallel across all available cores on a single machine, or run in parallel across a cluster of machines, all without changing any code. Doing so renders parallelisation a non-issue. In this post I’ll talk through how to do this for a simple bootstrap particle filter, but the same principle applies for a large range of statistical computing algorithms.

Typeclasses and monadic collections

In the previous post I gave a quick introduction to the monad concept, and to monadic collections in particular. Many computational tasks in statistics can be accomplished using a sequence of operations on monadic collections. We would like to write code that is independent of any particular implementation of a monadic collection, so that we can switch to a different implementation without changing the code of our algorithm (for example, switching from a serial to a parallel collection). But in strongly typed languages we need to know at compile time that the collection we use has the methods that we require. Typeclasses provide a nice solution to this problem. I don’t want to get bogged down in a big discussion about Scala typeclasses here, but suffice to say that they describe a family of types conforming to a particular interface in an ad hoc loosely coupled way (they are said to provide ad hoc polymorphism). They are not the same as classes in traditional O-O languages, but they do solve a similar problem to the adaptor design pattern, in a much cleaner way. We can describe a simple typeclass for our monadic collection as follows:

trait GenericColl[C[_]] {
  def map[A, B](ca: C[A])(f: A => B): C[B]
  def reduce[A](ca: C[A])(f: (A, A) => A): A
  def flatMap[A, B, D[B] <: GenTraversable[B]](ca: C[A])(f: A => D[B]): C[B]
  def zip[A, B](ca: C[A])(cb: C[B]): C[(A, B)]
  def length[A](ca: C[A]): Int
}

In the typeclass we just list the methods that we expect our generic collection to provide, but do not say anything about how they are implemented. For example, we know that operations such as map and reduce can be executed in parallel, but this is a separate concern. We can now write code that can be used for any collection conforming to the requirements of this typeclass. The full code for this example is provided in the associated github repo for this blog, and includes the obvious syntax for this typeclass, and typeclass instances for the Scala collections Vector and ParVector, that we will exploit later in the example.

SIR step for a bootstrap filter

We can now write some code for a single observation update of a bootstrap particle filter.

def update[S: State, O: Observation, C[_]: GenericColl](
  dataLik: (S, O) => LogLik, stepFun: S => S
)(x: C[S], o: O): (LogLik, C[S]) = {
  val xp = x map (stepFun(_))
  val lw = xp map (dataLik(_, o))
  val max = lw reduce (math.max(_, _))
  val rw = lw map (lwi => math.exp(lwi - max))
  val srw = rw reduce (_ + _)
  val l = rw.length
  val z = rw zip xp
  val rx = z flatMap (p => Vector.fill(Poisson(p._1 * l / srw).draw)(p._2))
  (max + math.log(srw / l), rx)
}

This is a very simple bootstrap filter, using Poisson resampling for simplicity and data locality, but does include use of the log-sum-exp trick to prevent over/underflow of raw weight calculations, and tracks the marginal (log-)likelihood of the observation. With this function we can now pass in a “prior” particle distribution in any collection conforming to our typeclass, together with a propagator function, an observation (log-)likelihood, and an observation, and it will return back a new collection of particles of exactly the same type that was provided for input. Note that all of the operations we require can be accomplished with the standard monadic collection operations declared in our typeclass.

Filtering as a functional fold

Once we have a function for executing one step of a particle filter, we can produce a function for particle filtering as a functional fold over a sequence of observations:

def pFilter[S: State, O: Observation, C[_]: GenericColl, D[O] <: GenTraversable[O]](
  x0: C[S], data: D[O], dataLik: (S, O) => LogLik, stepFun: S => S
): (LogLik, C[S]) = {
  val updater = update[S, O, C](dataLik, stepFun) _
  data.foldLeft((0.0, x0))((prev, o) => {
    val next = updater(prev._2, o)
    (prev._1 + next._1, next._2)
  })
}

Folding data structures is a fundamental concept in functional programming, and is exactly what is required for any kind of filtering problem. Note that Brian Beckman has recently written a series of articles on Kalman filtering as a functional fold.

Marginal likelihoods and parameter estimation

So far we haven’t said anything about parameters or parameter estimation, but this is appropriate, since parametrisation is a separate concern from filtering. However, once we have a function for particle filtering, we can produce a function concerned with evaluating marginal likelihoods trivially:

def pfMll[S: State, P: Parameter, O: Observation, 
            C[_]: GenericColl, D[O] <: GenTraversable[O]](
  simX0: P => C[S], stepFun: P => S => S, 
  dataLik: P => (S, O) => LogLik, data: D[O]
): (P => LogLik) = (th: P) => 
       pFilter(simX0(th), data, dataLik(th), stepFun(th))._1

Note that this higher-order function does not return a value, but instead a function which will accept a parameter as input and return a (log-)likelihood as output. This can then be used for parameter estimation purposes, perhaps being used in a PMMH pMCMC algorithm, or something else. Again, this is a separate concern.

Example

Here I’ll just give a completely trivial toy example, purely to show how the functions work. For avoidance of doubt, I know that there are many better/simpler/easier ways to tackle this problem! Here we will just look at inferring the auto-regression parameter of a linear Gaussian AR(1)-plus-noise model using the functions we have developed.

First we can simulate some synthetic data from this model, using a value of 0.8 for the auto-regression parameter:

val inNoise = Gaussian(0.0, 1.0).sample(99)
val state = DenseVector(inNoise.scanLeft(0.0)((s, i) => 0.8 * s + i).toArray)
val noise = DenseVector(Gaussian(0.0, 2.0).sample(100).toArray)
val data = (state + noise).toArray.toList

Now assuming that we don’t know the auto-regression parameter, we can construct a function to evaluate the likelihood of different parameter values as follows:

val mll = pfMll(
  (th: Double) => Gaussian(0.0, 10.0).sample(10000).toVector.par,
  (th: Double) => (s: Double) => Gaussian(th * s, 1.0).draw,
  (th: Double) => (s: Double, o: Double) => Gaussian(s, 2.0).logPdf(o),
  data
)

Note that the 4 characters “.par” at the end of line 2 are the only difference between running this code serially or in parallel! Now we can run this code by calling the returned function with different values. So, hopefully mll(0.8) will return a larger log-likelihood than (say) mll(0.6) or mll(0.9). The example code in the github repo plots the results of calling mll() for a range of values (note that if that was the genuine use-case, then it would be much better to parallellise the parameter range than the particle filter, due to providing better parallelisation granularity, but many other examples require parallelisation of the particle filter itself). In this particular example, both the forward model and the likelihood are very cheap operations, so there is little to be gained from parallelisation. Nevertheless, I still get a speedup of more than a factor of two using the parallel version on my laptop.

Conclusion

In this post we have shown how typeclasses can be used in Scala to write code that is parallelisation-agnostic. Code written in this way can be run on one or many cores as desired. We’ve illustrated the concept with a scalable particle filter, but nothing about the approach is specific to that application. It would be easy to build up a library of statistical routines this way, all of which can effectively exploit available parallel hardware. Further, although we haven’t demonstrated it here, it is trivial to extend this idea to allow code to be distribution over a cluster of parallel machines if necessary. For example, if an Apache Spark cluster is available, it is easy to make a Spark RDD instance for our generic collection typeclass, that will then allow us to run our (unmodified) particle filter code over a Spark cluster. This emphasises the fact that Spark can be useful for distributing computation as well as just processing “big data”. I’ll say more about Spark in subsequent posts.

Advertisements

Scala as a platform for statistical computing and data science

There has been a lot of discussion on-line recently about languages for data analysis, statistical computing, and data science more generally. I don’t really want to go into the detail of why I believe that all of the common choices are fundamentally and unfixably flawed – language wars are so unseemly. Instead I want to explain why I’ve been using the Scala programming language recently and why, despite being far from perfect, I personally consider it to be a good language to form a platform for efficient and scalable statistical computing. Obviously, language choice is to some extent a personal preference, implicitly taking into account subjective trade-offs between features different individuals consider to be important. So I’ll start by listing some language/library/ecosystem features that I think are important, and then explain why.

A feature wish list

It should:

  • be a general purpose language with a sizable user community and an array of general purpose libraries, including good GUI libraries, networking and web frameworks
  • be free, open-source and platform independent
  • be fast and efficient
  • have a good, well-designed library for scientific computing, including non-uniform random number generation and linear algebra
  • have a strong type system, and be statically typed with good compile-time type checking and type safety
  • have reasonable type inference
  • have a REPL for interactive use
  • have good tool support (including build tools, doc tools, testing tools, and an intelligent IDE)
  • have excellent support for functional programming, including support for immutability and immutable data structures and “monadic” design
  • allow imperative programming for those (rare) occasions where it makes sense
  • be designed with concurrency and parallelism in mind, having excellent language and library support for building really scalable concurrent and parallel applications

The not-very-surprising punch-line is that Scala ticks all of those boxes and that I don’t know of any other languages that do. But before expanding on the above, it is worth noting a couple of (perhaps surprising) omissions. For example:

  • have excellent data viz capability built-in
  • have vast numbers of statistical routines in the standard library

The above are points (and there are other similar points) where other languages (for example, R), currently score better than Scala. It is not that these things are not important – indeed, they are highly desirable. But I consider them to be of lesser importance as they are much easier to fix, given a suitable platform, than fixing an unsuitable language and platform. Visualisation is not trivial, but it is not fantastically difficult in a language with excellent GUI libraries. Similarly, most statistical routines are quite straightforward to implement for anyone with reasonable expertise in scientific and statistical computing and numerical linear algebra. These are things that are relatively easy for a community to contribute to. Building a great programming language, on the other hand, is really, really, difficult.

I will now expand briefly on each point in turn.

be a general purpose language with a sizable user community and an array of general purpose libraries, including good GUI libraries, networking and web frameworks

History has demonstrated, time and time again, that domain specific languages (DSLs) are synonymous with idiosyncratic, inconsistent languages that are terrible for anything other than what they were specifically designed for. They can often be great for precisely the thing that they were designed for, but people always want to do other things, and that is when the problems start. For the avoidance of controversy I won’t go into details, but the whole Python versus R thing is a perfect illustration of this general versus specific trade-off. Similarly, although there has been some buzz around another new language recently, which is faster than R and Python, my feeling is that the last thing the world needs right now is Just Unother Language for Indexed Arrays…

In this day-and-age it is vital that statistical code can use a variety of libraries, and communicate with well-designed network libraries and web frameworks, as statistical analysis does not exist in a vacuum. Scala certainly fits the bill here, being used in a large number of important high-profile systems, ensuring a lively, well-motivated ecosystem. There are numerous well-maintained libraries for almost any task. Picking on web frameworks, for example, there are a number of excellent libraries, including Lift and Play. Scala also has the advantage of offering seamless Java integration, for those (increasingly rare) occasions when a native Scala library for the task at hand doesn’t exist.

be free, open-source and platform independent

This hardly needs expanding upon, other than to observe that there are a few well-known commercial software solutions for scientific, statistical and mathematical computing. There are all kinds of problems with using closed proprietary systems, including transparency and reproducibility, but also platform and scalability problems. eg. running code requiring a license server in the cloud. The academic statistical community has largely moved away from commercial software, and I don’t think there is any going back. Scala is open source and runs on the JVM, which is about as platform independent as it is possible to get.

be fast and efficient

Speed and efficiency continue to be important, despite increasing processor speeds. Computationally intensive algorithms are being pushed to ever larger and more complex models and data sets. Compute cycles and memory efficiency really matter, and can’t be ignored. This doesn’t mean that we all have to code in C/C++/Fortran, but we can’t afford to code in languages which are orders of magnitude slower. This will always be a problem. Scala code generally runs well within a factor of 2 of comparable native code – see my Gibbs sampler post for a simple example including timings.

have a good, well-designed library for scientific computing, including non-uniform random number generation and linear algebra

I hesitated about including this in my list of essentials, because it is certainly something that can, in principle, be added to a language at a later date. However, building such libraries is far from trivial, and they need to be well-designed, comprehensive and efficient. For Scala, Breeze is rapidly becoming the standard scientific library, including special functions, non-uniform random number generation and numerical linear algebra. For a data library, there is Saddle, and for a scalable analytics library there is Spark. These libraries certainly don’t cover everything that can be found in R/CRAN, but they provide a fairly solid foundation on which to build.

have a strong type system, and be statically typed with good compile-time type checking and type safety

I love dynamic languages – they are fun and exciting. It is fun to quickly throw together a few functions in a scripting language without worrying about declaring the types of anything. And it is exciting to see the myriad of strange and unanticipated ways your code can crash-and-burn at runtime! 😉 But this excitement soon wears off, and you end up adding lots of boilerplate argument checking code that would not only be much cleaner and simpler in a statically typed language, but would be checked at compile-time, making the static code faster and more efficient. For messing about prototyping, dynamic languages are attractive, but as a solid platform for statistical computing, they really don’t make sense. Scala has a strong type system offering a high degree of compile-time checking, making it a safe and efficient language.

have reasonable type inference

A common issue with statically typed languages is that they lead to verbose code containing many redundant type declarations that the compiler ought to be able to check. This doesn’t just mean more typing – it leads to verbose code that can hide the program logic. Languages with type inference offer the best of both worlds – the safety of static typing without the verbosity. Scala does a satisfactory job here.

have a REPL for interactive use

One thing that dynamic languages have taught us is that it is actually incredibly useful to have a REPL for interactive analysis. This is true generally, but especially so for statistical computing, where human intervention is often desirable. Again, Scala has a nice REPL.

have good tool support (including build tools, doc tools, testing tools, and an intelligent IDE)

Tools matter. Scala has an excellent build tool in the SBT. It has code documentation in the form of scaladoc (similar to javadoc). It has a unit testing framework, and a reasonably intelligent IDE in the form of the Scala IDE (based on Eclipse).

have excellent support for functional programming, including support for immutability and immutable data structures and “monadic” design

I, like many others, am gradually coming to realise that functional programming offers many advantages over other programming styles. In particular, it provides best route to building scalable software, in terms of both program complexity and data size/complexity. Scala has good support for functional programming, including immutable named values, immutable data structures and for-comprehensions. And if off-the-shelf Scala isn’t sufficiently functional already, libraries such as scalaz make it even more so.

allow imperative programming for those (rare) occasions where it makes sense

Although most algorithms in scientific computing are typically conceived of and implemented in an imperative style, I’m increasingly convinced that most can be recast in a pure functional way without significant loss of efficiency, and with significant benefits. That said, there really are some problems that are more efficient to implement in an imperative framework. It is therefore important that the language is not so “pure” functional that this is forbidden. Again, Scala fits the bill.

be designed with concurrency and parallelism in mind, having excellent language and library support for building really scalable concurrent and parallel applications

These days scalability typically means exploiting concurrency and parallelism. In an imperative world this is hard, and libraries such as MPI prove that it is difficult to bolt parallelism on top of a language post-hoc. Check-points, communication overhead, deadlocks and race conditions make it very difficult to build codes that scale well to more than a few processors. Concurrency is more straightforward in functional languages, and this is one of the reasons for the recent resurgence of functional languages and programming. Scala has good concurrency support built-in, and libraries such as Akka make it relatively easy to build truly scalable software.

Summary

The Scala programming language ticks many boxes when it comes to forming a nice solid foundation for building a platform for efficient scalable statistical computing. Although I still use R and Python almost every day, I’m increasingly using Scala for serious algorithm development. In the short term I can interface to my Scala code from R using jvmr, but in the longer term I hope that Scala will become a complete framework for statistics and data science. In a subsequent post I will attempt to give a very brief introduction to Scala and the Breeze numerical library.