Scala for Machine Learning [book review]

Full disclosure: I received a free electronic version of this book from the publisher for the purposes of review.

There is clearly a market for a good book about using Scala for statistical computing, machine learning and data science. So when the publisher of “Scala for Machine Learning” offered me a copy for review purposes, I eagerly accepted. Three months later, I have eventually forced myself to read through the whole book, but I was very disappointed. It is important to be clear that I’m not just disappointed because I personally didn’t get much from the book – I am not really the target audience. I am disappointed because I struggle to envisage any audience that will benefit greatly from reading this book. There are several potential audiences for a book with this title: eg. people with little knowledge of Scala or machine learning (ML), people with knowledge of Scala but not ML, people with knowledge of ML but not Scala, and people with knowledge of both. I think there is scope for a book targeting any of those audiences. Personally, I fall in the latter category. The book author claimed to be aiming primarily at those who know Scala but not ML. This is sensible in that the book assumes a good working knowledge of Scala, and uses advanced features of the Scala language without any explanation: this book is certainly not appropriate for people hoping to learn about Scala in the context of ML. However, it is also a problem, as this would probably be the worst book I have ever encountered for learning about ML from scratch, and there are a lot of poor books about ML! The book just picks ML algorithms out of thin air without any proper explanation or justification, and blindly applies them to tedious financial data sets irrespective of whether or not it would be in any way appropriate to do so. It presents ML as an incoherent “bag of tricks” to be used indiscriminately on any data of the correct “shape”. It is by no means the only ML book to take such an approach, but there are many much better books which don’t. The author also claims that the book will be useful to people who know ML but not Scala, but as previously explained, I do not think that this is the case (eg. monadic traits appear on the fifth page, without proper explanation, and containing typos). I think that the only audience that could potentially benefit from this book would be people who know some Scala and some ML and want to see some practical examples of real world implementations of ML algorithms in Scala. I think those people will also be disappointed, for reasons outlined below.

The first problem with the book is that it is just full of errors and typos. It doesn’t really matter to me that essentially all of the equations in the first chapter are wrong – I already know the difference between an expectation and a sample mean, and know Bayes theorem – so I can just see that they are wrong, correct them, and move on. But for the intended audience it would be a complete nightmare. I wonder about the quality of copy-editing and technical review that this book received – it is really not of “publishable” quality. All of the descriptions of statistical/ML methods and algorithms are incredibly superficial, and usually contain factual errors or typos. One should not attempt to learn ML by reading this book. So the only hope for this book is that the Scala implementations of ML algorithms are useful and insightful. Again, I was disappointed.

For reasons that are not adequately explained or justified, the author decides to use a combination of plain Scala interfaced to legacy Java libraries (especially Apache Commons Math) for all of the example implementations. In addition, the author is curiously obsessed with an F# style pipe operator, which doesn’t seem to bring much practical benefit. Consequently, all of the code looks like a strange and rather inelegant combination of Java, Scala, C++, and F#, with a hint of Haskell, and really doesn’t look like clean idiomatic Scala code at all. For me this was the biggest disappointment of all – I really wouldn’t want any of this code in my own Scala code base (though the licensing restrictions on the code probably forbid this, anyway). It is a real shame that Scala libraries such as Breeze were not used for all of the examples – this would have led to much cleaner and more idiomatic Scala code, which could have really taken proper advantage of the functional power of the Scala language. As it is, advanced Scala features were used without much visible pay-off. Reading this book one could easily get the (incorrect) impression that Scala is an unnecessarily complex language which doesn’t offer much advantage over Java for implementing ML algorithms.

On the positive side, the book consists of nearly 500 pages of text, covering a wide range of ML algorithms and examples, and has a zip file of associated code containing the implementation and examples, which builds using sbt. If anyone is interested in seeing examples of ML algorithms implemented in Scala using Java rather than Scala libraries together with a F# pipe operator, then there is definitely something of substance here of interest.

Alternatives

It should be clear from the above review that I think there is still a gap in the market for a good book about using Scala for statistical computing, machine learning and data science. Hopefully someone will fill this gap soon. In the meantime it is necessary to learn about Scala and ML separately, and to put the ideas together yourself. This isn’t so difficult, as there are many good resources and code repositories to help. For learning about ML, I would recommend starting off with ISLR, which uses R for the examples (but if you work in data science, you need to know R anyway). Once the basic concepts are understood, one can move on to a serious text, such as Machine Learning (which has associated Matlab code). Converting algorithms from R or Matlab to Scala (plus Breeze) is generally very straightforward, if you know Scala. For learning Scala, there are many on-line resources. If you want books, I recommend Functional Programming in Scala and Programming in Scala, 2e. Once you know about Scala, learn about scientific computing using Scala by figuring out Breeze. At some point you will probably also want to know about Spark, and there are now books on this becoming available – I’ve just got a copy of Learning Spark, which looks OK.

Calling R from Scala sbt projects

Overview

In previous posts I’ve shown how the jvmr CRAN R package can be used to call Scala sbt projects from R and inline Scala Breeze code in R. In this post I will show how to call to R from a Scala sbt project. This requires that R and the jvmr CRAN R package are installed on your system, as described in the previous posts. Since I’m focusing here on Scala sbt projects, I’m also assuming that sbt is installed.

The only “trick” required for calling back to R from Scala is telling sbt where the jvmr jar file is located. You can find the location from the R console as illustrated by the following session:

> library(jvmr)
> .jvmr.jar
[1] "/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.1/jvmr/java/jvmr_2.11-2.11.2.1.jar"

This location (which will obviously be different for you) can then be added in to your sbt classpath by adding the following line to your build.sbt file:

unmanagedJars in Compile += file("/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.1/jvmr/java/jvmr_2.11-2.11.2.1.jar")

Once this is done, calling out to R from your Scala sbt project can be carried out as described in the jvmr documentation. For completeness, a working example is given below.

Example

In this example I will use Scala to simulate some data consistent with a Poisson regression model, and then push the data to R to fit it using the R function glm(), and then pull back the fitted regression coefficients into Scala. This is obviously a very artificial example, but the point is to show how it is possible to call back to R for some statistical procedure that may be “missing” from Scala.

The dependencies for this project are described in the file build.sbt

name := "jvmr test"

version := "0.1"

scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature")

libraryDependencies  ++= Seq(
            "org.scalanlp" %% "breeze" % "0.10",
            "org.scalanlp" %% "breeze-natives" % "0.10"
)

resolvers ++= Seq(
            "Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/",
            "Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/"
)

unmanagedJars in Compile += file("/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.1/jvmr/java/jvmr_2.11-2.11.2.1.jar")

scalaVersion := "2.11.2"

The complete Scala program is contained in the file PoisReg.scala

import org.ddahl.jvmr.RInScala
import breeze.stats.distributions._
import breeze.linalg._

object ScalaToRTest {

  def main(args: Array[String]) = {

    // first simulate some data consistent with a Poisson regression model
    val x = Uniform(50,60).sample(1000)
    val eta = x map { xi => (xi * 0.1) - 3 }
    val mu = eta map { math.exp(_) }
    val y = mu map { Poisson(_).draw }
    
    // call to R to fit the Poission regression model
    val R = RInScala() // initialise an R interpreter
    R.x=x.toArray // send x to R
    R.y=y.toArray // send y to R
    R.eval("mod <- glm(y~x,family=poisson())") // fit the model in R
    // pull the fitted coefficents back into scala
    val beta = DenseVector[Double](R.toVector[Double]("mod$coefficients"))

    // print the fitted coefficents
    println(beta)

  }

}

If these two files are put in an empty directory, the code can be compiled and run by typing sbt run from the command prompt in the relevant directory. The commented code should be self-explanatory, but see the jvmr documentation for further details.

Inlining Scala Breeze code in R using jvmr and sbt

Introduction

In the previous post I showed how to call Scala code from R using sbt and jvmr. The approach described in that post is the one I would recommend for any non-trivial piece of Scala code – mixing up code from different languages in the same source code file is not a good strategy in general. That said, for very small snippets of code, it can sometimes be convenient to inline Scala code directly into an R source code file. The canonical example of this is a computationally intensive algorithm being prototyped in R which has a slow inner loop. If the inner loop can be recoded in a few lines of Scala, it would be nice to just inline this directly into the R code without having to create a separate Scala project. The CRAN package jvmr provides a very simple and straightforward way to do this. However, as discussed in the last post, most Scala code for statistical computing (even short and simple code) is likely to rely on Breeze for special functions, probability distributions, non-uniform random number generation, numerical linear algebra, etc. In this post we will see how to use sbt in order to make sure that the Breeze library and all of its dependencies are downloaded and cached, and to provide a correct classpath with which to initialise a jvmr scalaInterpreter session.

Setting up

Configuring your system to be able to inline Scala Breeze code is very easy. You just need to install Java, R and sbt. Then install the CRAN R package jvmr. At this point you have everything you need except for the R function breezeInit, given at the end of this post. I’ve deferred the function to the end of the post as it is a bit ugly, and the details of it are not important. All it does is get sbt to ensure that Breeze is correctly downloaded and cached and then starts a scalaInterpreter with Breeze on the classpath. With this function available, we can use it within an R session as the following R session illustrates:

> b=breezeInit()
> b['import breeze.stats.distributions._']
NULL
> b['Poisson(10).sample(20).toArray']
 [1] 13 14 13 10  7  6 15 14  5 10 14 11 15  8 11 12  6  7  5  7
> summary(b['Gamma(3,2).sample(10000).toArray'])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2124  3.4630  5.3310  5.9910  7.8390 28.5200 
> 

So we see that Scala Breeze code can be inlined directly into an R session, and if we are careful about return types, have the results of Scala expressions automatically unpack back into convenient R data structures.

Summary

In this post I have shown how easy it is to inline Scala Breeze code into R using sbt in conjunction with the CRAN package jvmr. This has many potential applications, with the most obvious being the desire to recode slow inner loops from R to Scala. This should give performance quite comparable with alternatives such as Rcpp, with the advantage being that you get to write beautiful, elegant, functional Scala code instead of horrible, ugly, imperative C++ code! ;-)

The breezeInit function

The actual breezeInit() function is given below. It is a little ugly, but very simple. It is obviously easy to customise for different libraries and library versions as required. All of the hard work is done by sbt which must be installed and on the default system path in order for this function to work.

breezeInit<-function()
{
  library(jvmr)
  sbtStr="name := \"tmp\"

version := \"0.1\"

libraryDependencies  ++= Seq(
            \"org.scalanlp\" %% \"breeze\" % \"0.10\",
            \"org.scalanlp\" %% \"breeze-natives\" % \"0.10\"
)

resolvers ++= Seq(
            \"Sonatype Snapshots\" at \"https://oss.sonatype.org/content/repositories/snapshots/\",
            \"Sonatype Releases\" at \"https://oss.sonatype.org/content/repositories/releases/\"
)

scalaVersion := \"2.11.2\"

lazy val printClasspath = taskKey[Unit](\"Dump classpath\")

printClasspath := {
  (fullClasspath in Runtime value) foreach {
    e => print(e.data+\"!\")
  }
}
"
  tmps=file(file.path(tempdir(),"build.sbt"),"w")
  cat(sbtStr,file=tmps)
  close(tmps)
  owd=getwd()
  setwd(tempdir())
  cpstr=system2("sbt","printClasspath",stdout=TRUE)
  cpst=cpstr[length(cpstr)]
  cpsp=strsplit(cpst,"!")[[1]]
  cp=cpsp[2:(length(cpsp)-1)]
  si=scalaInterpreter(cp,use.jvmr.class.path=FALSE)
  setwd(owd)
  si
}

Calling Scala code from R using jvmr

Introduction

In previous posts I have explained why I think that Scala is a good language to use for statistical computing and data science. Despite this, R is very convenient for simple exploratory data analysis and visualisation – currently more convenient than Scala. I explained in my recent talk at the RSS what (relatively straightforward) things would need to be developed for Scala in order to make R completely redundant, but for the short term at least, it seems likely that I will need to use both R and Scala for my day-to-day work.

Since I use both Scala and R for statistical computing, it is very convenient to have a degree of interoperability between the two languages. I could call R from Scala code or Scala from R code, or both. Fortunately, some software tools have been developed recently which make this much simpler than it used to be. The software is jvmr, and as explained at the website, it enables calling Java and Scala from R and calling R from Java and Scala. I have previously discussed calling Java from R using the R CRAN package rJava. In this post I will focus on calling Scala from R using the CRAN package jvmr, which depends on rJava. I may examine calling R from Scala in a future post.

On a system with Java installed, it should be possible to install the jvmr R package with a simple

install.packages("jvmr")

from the R command prompt. The package has the usual documentation associated with it, but the draft paper describing the package is the best way to get an overview of its capabilities and a walk-through of simple usage.

A Gibbs sampler in Scala using Breeze

For illustration I’m going to use a Scala implementation of a Gibbs sampler which relies on the Breeze scientific library, and will be built using the simple build tool, sbt. Most non-trivial Scala projects depend on various versions of external libraries, and sbt is an easy way to build even very complex projects trivially on any system with Java installed. You don’t even need to have Scala installed in order to build and run projects using sbt. I give some simple complete worked examples of building and running Scala sbt projects in the github repo associated with my recent RSS talk. Installing sbt is trivial as explained in the repo READMEs.

For this post, the Scala code, gibbs.scala is given below:

package gibbs

object Gibbs {

    import scala.annotation.tailrec
    import scala.math.sqrt
    import breeze.stats.distributions.{Gamma,Gaussian}

    case class State(x: Double, y: Double) {
      override def toString: String = x.toString + " , " + y + "\n"
    }

    def nextIter(s: State): State = {
      val newX = Gamma(3.0, 1.0/((s.y)*(s.y)+4.0)).draw
      State(newX, Gaussian(1.0/(newX+1), 1.0/sqrt(2*newX+2)).draw)
    }

    @tailrec def nextThinnedIter(s: State,left: Int): State =
      if (left==0) s else nextThinnedIter(nextIter(s),left-1)

    def genIters(s: State, stop: Int, thin: Int): List[State] = {
      @tailrec def go(s: State, left: Int, acc: List[State]): List[State] =
        if (left>0)
          go(nextThinnedIter(s,thin), left-1, s::acc)
          else acc
      go(s,stop,Nil).reverse
    }

    def main(args: Array[String]) = {
      if (args.length != 3) {
        println("Usage: sbt \"run <outFile> <iters> <thin>\"")
        sys.exit(1)
      } else {
        val outF=args(0)
        val iters=args(1).toInt
        val thin=args(2).toInt
        val out = genIters(State(0.0,0.0),iters,thin)
        val s = new java.io.FileWriter(outF)
        s.write("x , y\n")
        out map { it => s.write(it.toString) }
        s.close
      }
    }

}

This code requires Scala and the Breeze scientific library in order to build. We can specify this in a sbt build file, which should be called build.sbt and placed in the same directory as the Scala code.

name := "gibbs"

version := "0.1"

scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature")

libraryDependencies  ++= Seq(
            "org.scalanlp" %% "breeze" % "0.10",
            "org.scalanlp" %% "breeze-natives" % "0.10"
)

resolvers ++= Seq(
            "Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/",
            "Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/"
)

scalaVersion := "2.11.2"

Now, from a system command prompt in the directory where the files are situated, it should be possible to download all dependencies and compile and run the code with a simple

sbt "run output.csv 50000 1000"

Calling via R system calls

Since this code takes a relatively long time to run, calling it from R via simple system calls isn’t a particularly terrible idea. For example, we can do this from the R command prompt with the following commands

system("sbt \"run output.csv 50000 1000\"")
out=read.csv("output.csv")
library(smfsb)
mcmcSummary(out,rows=2)

This works fine, but won’t work so well for code which needs to be called repeatedly. For this, tighter integration between R and Scala would be useful, which is where jvmr comes in.

Calling sbt-based Scala projects via jvmr

jvmr provides a very simple way to embed a Scala interpreter within an R session, to be able to execute Scala expressions from R and to have the results returned back to the R session for further processing. The main issue with using this in practice is managing dependencies on external libraries and setting the Scala classpath correctly. For an sbt project such as we are considering here, it is relatively easy to get sbt to provide us with all of the information we need in a fully automated way.

First, we need to add a new task to our sbt build instructions, which will output the full classpath in a way that is easy to parse from R. Just add the following to the end of the file build.sbt:

lazy val printClasspath = taskKey[Unit]("Dump classpath")

printClasspath := {
  (fullClasspath in Runtime value) foreach {
    e => print(e.data+"!")
  }
}

Be aware that blank lines are significant in sbt build files. Once we have this in our build file, we can write a small R function to get the classpath from sbt and then initialise a jvmr scalaInterpreter with the correct full classpath needed for the project. An R function which does this, sbtInit(), is given below

sbtInit<-function()
{
  library(jvmr)
  system2("sbt","compile")
  cpstr=system2("sbt","printClasspath",stdout=TRUE)
  cpst=cpstr[length(cpstr)]
  cpsp=strsplit(cpst,"!")[[1]]
  cp=cpsp[1:(length(cpsp)-1)]
  scalaInterpreter(cp,use.jvmr.class.path=FALSE)
}

With this function at our disposal, it becomes trivial to call our Scala code direct from the R interpreter, as the following code illustrates.

sc=sbtInit()
sc['import gibbs.Gibbs._']
out=sc['genIters(State(0.0,0.0),50000,1000).toArray.map{s=>Array(s.x,s.y)}']
library(smfsb)
mcmcSummary(out,rows=2)

Here we call the getIters function directly, rather than via the main method. This function returns an immutable List of States. Since R doesn’t understand this, we map it to an Array of Arrays, which R then unpacks into an R matrix for us to store in the matrix out.

Summary

The CRAN package jvmr makes it very easy to embed a Scala interpreter within an R session. However, for most non-trivial statistical computing problems, the Scala code will have dependence on external scientific libraries such as Breeze. The standard way to easily manage external dependencies in the Scala ecosystem is sbt. Given an sbt-based Scala project, it is easy to add a task to the sbt build file and a function to R in order to initialise the jvmr Scala interpreter with the full classpath needed to call arbitrary Scala functions. This provides very convenient inter-operability between R and Scala for many statistical computing applications.

Statistical computing languages at the RSS

On Friday the Royal Statistical Society hosted a meeting on Statistical computing languages, organised by my colleague Colin Gillespie. Four languages were presented at the meeting: Python, Scala, Matlab and Julia. I presented the talk on Scala. The slides I presented are available, in addition to the code examples and instructions on how to run them, in a public github repo I have created. The talk mainly covered material I have discussed in various previous posts on this blog. What was new was the emphasis on how easy it is to use and run Scala code, and the inclusion of complete examples and instructions on how to run them on any platform with a JVM installed. I also discussed some of the current limitations of Scala as an environment for interactive statistical data analysis and visualisation, and how these limitations could be overcome with a little directed effort. Colin suggested that all of the speakers covered a couple of examples (linear regression and a Monte Carlo integral) in “their” languages, and he provided an R solution. It is interesting to see the examples in the five different languages side by side for comparison purposes. Colin is collecting together all of the material relating to the meeting in a combined github repo, which should fill out over the next few days.

For other Scala posts on this blog, see all of my posts with a “scala” tag.

Parallel Monte Carlo using Scala

Introduction

In previous posts I have discussed general issues regarding parallel MCMC and examined in detail parallel Monte Carlo on a multicore laptop. In those posts I used the C programming language in conjunction with the MPI parallel library in order to illustrate the concepts. In this post I want to take the example from the second post and re-examine it using the Scala programming language.

The toy problem considered in the parallel Monte Carlo post used 10^9 U(0,1) random quantities to construct a Monte Carlo estimate of the integral

\displaystyle I=\int_0^1\exp\{-u^2\}du.

A very simple serial program to implement this algorithm is given below:

import java.util.concurrent.ThreadLocalRandom
import scala.math.exp
import scala.annotation.tailrec

object MonteCarlo {

  @tailrec
  def sum(its: Long,acc: Double): Double = {
    if (its==0) 
      (acc)
    else {
      val u=ThreadLocalRandom.current().nextDouble()
      sum(its-1,acc+exp(-u*u))
    }
  }

  def main(args: Array[String]) = {
    println("Hello")
    val iters=1000000000
    val result=sum(iters,0.0)
    println(result/iters)
    println("Goodbye")
  }

}

Note that ThreadLocalRandom is a parallel random number generator introduced into recent versions of the Java programming language, which can be easily utilised from Scala code. Assuming that Scala is installed, this can be compiled and run with commands like

scalac monte-carlo.scala
time scala MonteCarlo

This program works, and the timings (in seconds) for three runs are 57.79, 57.77 and 57.55 on the same laptop considered in the previous post. The first thing to note is that this Scala code is actually slightly faster than the corresponding C+MPI code in the single processor special case! Now that we have a good working implementation we should think how to parallelise it…

Parallel implementation

Before constructing a parallel implementation, we will first construct a slightly re-factored serial version that will be easier to parallelise. The simplest way to introduce parallelisation into Scala code is to parallelise a map over a collection. We therefore need a collection and a map to apply to it. Here we will just divide our 10^9 iterations into N=4 separate computations, and use a map to compute the required Monte Carlo sums.

import java.util.concurrent.ThreadLocalRandom
import scala.math.exp
import scala.annotation.tailrec

object MonteCarlo {

  @tailrec
  def sum(its: Long,acc: Double): Double = {
    if (its==0) 
      (acc)
    else {
      val u=ThreadLocalRandom.current().nextDouble()
      sum(its-1,acc+exp(-u*u))
    }
  }

  def main(args: Array[String]) = {
    println("Hello")
    val N=4
    val iters=1000000000
    val its=iters/N
    val sums=(1 to N).toList map {x => sum(its,0.0)}
    val result=sums.reduce(_+_)
    println(result/iters)
    println("Goodbye")
  }

}

Running this new code confirms that it works and gives similar estimates for the Monte Carlo integral as the previous version. The timings for 3 runs on my laptop were 57.57, 57.67 and 57.80, similar to the previous version of the code. So far so good. But how do we make it parallel? Like this:

import java.util.concurrent.ThreadLocalRandom
import scala.math.exp
import scala.annotation.tailrec

object MonteCarlo {

  @tailrec
  def sum(its: Long,acc: Double): Double = {
    if (its==0) 
      (acc)
    else {
      val u=ThreadLocalRandom.current().nextDouble()
      sum(its-1,acc+exp(-u*u))
    }
  }

  def main(args: Array[String]) = {
    println("Hello")
    val N=4
    val iters=1000000000
    val its=iters/N
    val sums=(1 to N).toList.par map {x => sum(its,0.0)}
    val result=sums.reduce(_+_)
    println(result/iters)
    println("Goodbye")
  }

}

That’s it! It’s now parallel. Studying the above code reveals that the only difference from the previous version is the introduction of the 4 characters .par in line 22 of the code. R programmers will find this very much analagous to using lapply() versus mclapply() in R code. The function par converts the collection (here an immutable List) to a parallel collection (here an immutable parallel List), and then subsequent maps, filters, etc., can be computed in parallel on appropriate multicore architectures. Timings for 3 runs on my laptop were 20.74, 20.82 and 20.88. Note that these timings are faster than the timings for N=4 processors for the corresponding C+MPI code…

Varying the size of the parallel collection

We can trivially modify the previous code to make the size of the parallel collection, N, a command line argument:

import java.util.concurrent.ThreadLocalRandom
import scala.math.exp
import scala.annotation.tailrec

object MonteCarlo {

  @tailrec
  def sum(its: Long,acc: Double): Double = {
    if (its==0) 
      (acc)
    else {
      val u=ThreadLocalRandom.current().nextDouble()
      sum(its-1,acc+exp(-u*u))
    }
  }

  def main(args: Array[String]) = {
    println("Hello")
    val N=args(0).toInt
    val iters=1000000000
    val its=iters/N
    val sums=(1 to N).toList.par map {x => sum(its,0.0)}
    val result=sums.reduce(_+_)
    println(result/iters)
    println("Goodbye")
  }

}

We can now run this code with varying sizes of N in order to see how the runtime of the code changes as the size of the parallel collection increases. Timings on my laptop are summarised in the table below.

 N     T1     T2     T3
 1   57.67  57.62  57.83
 2   32.20  33.24  32.76
 3   26.63  26.60  26.63
 4   20.99  20.92  20.75
 5   20.13  18.70  18.76
 6   16.57  16.52  16.59
 7   15.72  14.92  15.27
 8   13.56  13.51  13.32
 9   18.30  18.13  18.12
10   17.25  17.33  17.22
11   17.04  16.99  17.09
12   15.95  15.85  15.91

16   16.62  16.68  16.74
32   15.41  15.54  15.42
64   15.03  15.03  15.28

So we see that the timings decrease steadily until the size of the parallel collection hits 8 (the number of processors my hyper-threaded quad-core presents via Linux), and then increases very slightly, but not much as the size of the collection increases. This is better than the case of C+MPI where performance degrades noticeably if too many processes are requested. Here, the Scala compiler and JVM runtime manage an appropriate number of threads for the collection irrespective of the actual size of the collection. Also note that all of the timings are faster than the corresponding C+MPI code discussed in the previous post.

However, the notion that the size of the collection is irrelevant is only true up to a point. Probably the most natural way to code this algorithm would be as:

import java.util.concurrent.ThreadLocalRandom
import scala.math.exp

object MonteCarlo {

  def main(args: Array[String]) = {
    println("Hello")
    val iters=1000000000
    val sums=(1 to iters).toList map {x => ThreadLocalRandom.current().nextDouble()} map {x => exp(-x*x)}
    val result=sums.reduce(_+_)
    println(result/iters)
    println("Goodbye")
  }

}

or as the parallel equivalent

import java.util.concurrent.ThreadLocalRandom
import scala.math.exp

object MonteCarlo {

  def main(args: Array[String]) = {
    println("Hello")
    val iters=1000000000
    val sums=(1 to iters).toList.par map {x => ThreadLocalRandom.current().nextDouble()} map {x => exp(-x*x)}
    val result=sums.reduce(_+_)
    println(result/iters)
    println("Goodbye")
  }

}

Although these algorithms are in many ways cleaner and more natural, they will bomb out with a lack of heap space unless you have a huge amount of RAM, as they rely on having all 10^9 realisations in RAM simultaneously. The lesson here is that even though functional languages make it very easy to write clean, efficient parallel code, we must still be careful not to fill up the heap with gigantic (immutable) data structures…

Brief introduction to Scala and Breeze for statistical computing

Introduction

In the previous post I outlined why I think Scala is a good language for statistical computing and data science. In this post I want to give a quick taste of Scala and the Breeze numerical library to whet the appetite of the uninitiated. This post certainly won’t provide enough material to get started using Scala in anger – but I’ll try and provide a few pointers along the way. It also won’t be very interesting to anyone who knows Scala – I’m not introducing any of the very cool Scala stuff here – I think that some of the most powerful and interesting Scala language features can be a bit frightening for new users.

To reproduce the examples, you need to install Scala and Breeze. This isn’t very tricky, but I don’t want to get bogged down with a detailed walk-through here – I want to concentrate on the Scala language and Breeze library. You just need to install a recent version of Java, then Scala, and then Breeze. You might also want SBT and/or the ScalaIDE, though neither of these are necessary. Then you need to run the Scala REPL with the Breeze library in the classpath. There are several ways one can do this. The most obvious is to just run scala with the path to Breeze manually specified (or specified in an environment variable). Alternatively, you could run a console from an sbt session with a Breeze dependency (which is what I actually did for this post), or you could use a Scala Worksheet from inside a ScalaIDE project with a Breeze dependency.

A Scala REPL session

A first glimpse of Scala

We’ll start with a few simple Scala concepts that are not dependent on Breeze. For further information, see the Scala documentation.

Welcome to Scala version 2.10.3 (OpenJDK 64-Bit Server VM, Java 1.7.0_25).
Type in expressions to have them evaluated.
Type :help for more information.

scala> val a = 5
a: Int = 5

scala> a
res0: Int = 5

So far, so good. Using the Scala REPL is much like using the Python or R command line, so will be very familiar to anyone used to these or similar languages. The first thing to note is that labels need to be declared on first use. We have declared a to be a val. These are immutable values, which can not be just re-assigned, as the following code illustrates.

scala> a = 6
<console>:8: error: reassignment to val
       a = 6
         ^
scala> a
res1: Int = 5

Immutability seems to baffle people unfamiliar with functional programming. But fear not, as Scala allows declaration of mutable variables as well:

scala> var b = 7
b: Int = 7

scala> b
res2: Int = 7

scala> b = 8
b: Int = 8

scala> b
res3: Int = 8

The Zen of functional programming is to realise that immutability is generally a good thing, but that really isn’t the point of this post. Scala has excellent support for both mutable and immutable collections as part of the standard library. See the API docs for more details. For example, it has immutable lists.

scala> val c = List(3,4,5,6)
c: List[Int] = List(3, 4, 5, 6)

scala> c(1)
res4: Int = 4

scala> c.sum
res5: Int = 18

scala> c.length
res6: Int = 4

scala> c.product
res7: Int = 360

Again, this should be pretty familiar stuff for anyone familiar with Python. Note that the sum and product methods are special cases of reduce operations, which are well supported in Scala. For example, we could compute the sum reduction using

scala> c.foldLeft(0)((x,y) => x+y)
res8: Int = 18

or the slightly more condensed form given below, and similarly for the product reduction.

scala> c.foldLeft(0)(_+_)
res9: Int = 18

scala> c.foldLeft(1)(_*_)
res10: Int = 360

Scala also has a nice immutable Vector class, which offers a range of constant time operations (but note that this has nothing to do with the mutable Vector class that is part of the Breeze library).

scala> val d = Vector(2,3,4,5,6,7,8,9)
d: scala.collection.immutable.Vector[Int] = Vector(2, 3, 4, 5, 6, 7, 8, 9)

scala> d
res11: scala.collection.immutable.Vector[Int] = Vector(2, 3, 4, 5, 6, 7, 8, 9)

scala> d.slice(3,6)
res12: scala.collection.immutable.Vector[Int] = Vector(5, 6, 7)

scala> val e = d.updated(3,0)
e: scala.collection.immutable.Vector[Int] = Vector(2, 3, 4, 0, 6, 7, 8, 9)

scala> d
res13: scala.collection.immutable.Vector[Int] = Vector(2, 3, 4, 5, 6, 7, 8, 9)

scala> e
res14: scala.collection.immutable.Vector[Int] = Vector(2, 3, 4, 0, 6, 7, 8, 9)

Note that when e is created as an updated version of d the whole of d is not copied – only the parts that have been updated. And we don’t have to worry that aspects of d and e point to the same information in memory, as they are both immutable… As should be clear by now, Scala has excellent support for functional programming techniques. In addition to the reduce operations mentioned already, maps and filters are also well covered.

scala> val f=(1 to 10).toList
f: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> f
res15: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> f.map(x => x*x)
res16: List[Int] = List(1, 4, 9, 16, 25, 36, 49, 64, 81, 100)

scala> f map {x => x*x}
res17: List[Int] = List(1, 4, 9, 16, 25, 36, 49, 64, 81, 100)

scala> f filter {_ > 4}
res18: List[Int] = List(5, 6, 7, 8, 9, 10)

Note how Scala allows methods with a single argument to be written as an infix operator, making for more readable code.

A first look at Breeze

The next part of the session requires the Breeze library – see the Breeze quickstart guide for further details. We begin by taking a quick look at everyone’s favourite topic of non-uniform random number generation. Let’s start by generating a couple of draws from a Poisson distribution with mean 3.

scala> import breeze.stats.distributions._
import breeze.stats.distributions._

scala> val poi = Poisson(3.0)
poi: breeze.stats.distributions.Poisson = Poisson(3.0)

scala> poi.draw
res19: Int = 2

scala> poi.draw
res20: Int = 3

If more than a single draw is required, an iid sample can be obtained.

scala> val x = poi.sample(10)
x: IndexedSeq[Int] = Vector(2, 3, 3, 4, 2, 2, 1, 2, 4, 2)

scala> x
res21: IndexedSeq[Int] = Vector(2, 3, 3, 4, 2, 2, 1, 2, 4, 2)

scala> x.sum
res22: Int = 25

scala> x.length
res23: Int = 10

scala> x.sum.toDouble/x.length
res24: Double = 2.5

Note that this Vector is mutable. The probability mass function (PMF) of the Poisson distribution is also available.

scala> poi.probabilityOf(2)
res25: Double = 0.22404180765538775

scala> x map {x => poi.probabilityOf(x)}
res26: IndexedSeq[Double] = Vector(0.22404180765538775, 0.22404180765538775, 0.22404180765538775, 0.16803135574154085, 0.22404180765538775, 0.22404180765538775, 0.14936120510359185, 0.22404180765538775, 0.16803135574154085, 0.22404180765538775)

scala> x map {poi.probabilityOf(_)}
res27: IndexedSeq[Double] = Vector(0.22404180765538775, 0.22404180765538775, 0.22404180765538775, 0.16803135574154085, 0.22404180765538775, 0.22404180765538775, 0.14936120510359185, 0.22404180765538775, 0.16803135574154085, 0.22404180765538775)

Obviously, Gaussian variables (and Gamma, and several others) are supported in a similar way.

scala> val gau=Gaussian(0.0,1.0)
gau: breeze.stats.distributions.Gaussian = Gaussian(0.0, 1.0)

scala> gau.draw
res28: Double = 1.606121255846881

scala> gau.draw
res29: Double = -0.1747896055492152

scala> val y=gau.sample(20)
y: IndexedSeq[Double] = Vector(-1.3758577012869702, -1.2148314970824652, -0.022501190144116855, 0.3244006323566883, 0.35978577573558407, 0.9651857500320781, -0.40834034207848985, 0.11583348205331555, -0.8797699986810634, -0.33609738668214695, 0.7043252811790879, -1.2045594639823656, 0.19442688045065826, -0.31442160076087067, 0.06313451540562891, -1.5304745838587115, -1.2372764884467027, 0.5875490994217284, -0.9385520597707431, -0.6647903243363228)

scala> y
res30: IndexedSeq[Double] = Vector(-1.3758577012869702, -1.2148314970824652, -0.022501190144116855, 0.3244006323566883, 0.35978577573558407, 0.9651857500320781, -0.40834034207848985, 0.11583348205331555, -0.8797699986810634, -0.33609738668214695, 0.7043252811790879, -1.2045594639823656, 0.19442688045065826, -0.31442160076087067, 0.06313451540562891, -1.5304745838587115, -1.2372764884467027, 0.5875490994217284, -0.9385520597707431, -0.6647903243363228)

scala> y.sum/y.length
res31: Double = -0.34064156102380994

scala> y map {gau.logPdf(_)}
res32: IndexedSeq[Double] = Vector(-1.8654307403000054, -1.6568463163564844, -0.9191916849836235, -0.9715564183413823, -0.9836614354155007, -1.3847302992371653, -1.0023094506890617, -0.9256472309869705, -1.3059361584943119, -0.975419259871957, -1.1669755840586733, -1.6444202843394145, -0.93783943912556, -0.9683690047171869, -0.9209315167224245, -2.090114759123421, -1.6843650876361744, -1.0915455053203147, -1.359378517654625, -1.1399116208702693)

scala> Gamma(2.0,3.0).sample(5)
res33: IndexedSeq[Double] = Vector(2.38436441278546, 2.125017198373521, 2.333118708811143, 5.880076392566909, 2.0901427084667503)

This is all good stuff for those of us who like to do Markov chain Monte Carlo. There are not masses of statistical data analysis routines built into Breeze, but a few basic tools are provided, including some basic summary statistics.

scala> import breeze.stats.DescriptiveStats._
import breeze.stats.DescriptiveStats._

scala> mean(y)
res34: Double = -0.34064156102380994

scala> variance(y)
res35: Double = 0.574257149387757

scala> meanAndVariance(y)
res36: (Double, Double) = (-0.34064156102380994,0.574257149387757)

Support for linear algebra is an important part of any scientific library. Here the Breeze developers have made the wise decision to provide a nice Scala interface to netlib-java. This in turn calls out to any native optimised BLAS or LAPACK libraries installed on the system, but will fall back to Java code if no optimised libraries are available. This means that linear algebra code using Scala and Breeze should run as fast as code written in any other language, including C, C++ and Fortran, provided that optimised libraries are installed on the system. For further details see the Breeze linear algebra guide. Let’s start by creating and messing with a dense vector.

scala> import breeze.linalg._
import breeze.linalg._

scala> val v=DenseVector(y.toArray)
v: breeze.linalg.DenseVector[Double] = DenseVector(-1.3758577012869702, -1.2148314970824652, -0.022501190144116855, 0.3244006323566883, 0.35978577573558407, 0.9651857500320781, -0.40834034207848985, 0.11583348205331555, -0.8797699986810634, -0.33609738668214695, 0.7043252811790879, -1.2045594639823656, 0.19442688045065826, -0.31442160076087067, 0.06313451540562891, -1.5304745838587115, -1.2372764884467027, 0.5875490994217284, -0.9385520597707431, -0.6647903243363228)

scala> v(1) = 0

scala> v
res38: breeze.linalg.DenseVector[Double] = DenseVector(-1.3758577012869702, 0.0, -0.022501190144116855, 0.3244006323566883, 0.35978577573558407, 0.9651857500320781, -0.40834034207848985, 0.11583348205331555, -0.8797699986810634, -0.33609738668214695, 0.7043252811790879, -1.2045594639823656, 0.19442688045065826, -0.31442160076087067, 0.06313451540562891, -1.5304745838587115, -1.2372764884467027, 0.5875490994217284, -0.9385520597707431, -0.6647903243363228)

scala> v(1 to 3) := 1.0
res39: breeze.linalg.DenseVector[Double] = DenseVector(1.0, 1.0, 1.0)

scala> v
res40: breeze.linalg.DenseVector[Double] = DenseVector(-1.3758577012869702, 1.0, 1.0, 1.0, 0.35978577573558407, 0.9651857500320781, -0.40834034207848985, 0.11583348205331555, -0.8797699986810634, -0.33609738668214695, 0.7043252811790879, -1.2045594639823656, 0.19442688045065826, -0.31442160076087067, 0.06313451540562891, -1.5304745838587115, -1.2372764884467027, 0.5875490994217284, -0.9385520597707431, -0.6647903243363228)

scala> v(1 to 3) := DenseVector(1.0,1.5,2.0)
res41: breeze.linalg.DenseVector[Double] = DenseVector(1.0, 1.5, 2.0)

scala> v
res42: breeze.linalg.DenseVector[Double] = DenseVector(-1.3758577012869702, 1.0, 1.5, 2.0, 0.35978577573558407, 0.9651857500320781, -0.40834034207848985, 0.11583348205331555, -0.8797699986810634, -0.33609738668214695, 0.7043252811790879, -1.2045594639823656, 0.19442688045065826, -0.31442160076087067, 0.06313451540562891, -1.5304745838587115, -1.2372764884467027, 0.5875490994217284, -0.9385520597707431, -0.6647903243363228)

scala> v :> 0.0
res43: breeze.linalg.BitVector = BitVector(1, 2, 3, 4, 5, 7, 10, 12, 14, 17)

scala> (v :> 0.0).toArray
res44: Array[Boolean] = Array(false, true, true, true, true, true, false, true, false, false, true, false, true, false, true, false, false, true, false, false)

Next let’s create and mess around with some dense matrices.

scala> val m = new DenseMatrix(5,4,linspace(1.0,20.0,20).toArray)
m: breeze.linalg.DenseMatrix[Double] = 
1.0  6.0   11.0  16.0  
2.0  7.0   12.0  17.0  
3.0  8.0   13.0  18.0  
4.0  9.0   14.0  19.0  
5.0  10.0  15.0  20.0  

scala> m
res45: breeze.linalg.DenseMatrix[Double] = 
1.0  6.0   11.0  16.0  
2.0  7.0   12.0  17.0  
3.0  8.0   13.0  18.0  
4.0  9.0   14.0  19.0  
5.0  10.0  15.0  20.0  

scala> m.rows
res46: Int = 5

scala> m.cols
res47: Int = 4

scala> m(::,1)
res48: breeze.linalg.DenseVector[Double] = DenseVector(6.0, 7.0, 8.0, 9.0, 10.0)

scala> m(1,::)
res49: breeze.linalg.DenseMatrix[Double] = 2.0  7.0  12.0  17.0  

scala> m(1,::) := linspace(1.0,2.0,4)
res50: breeze.linalg.DenseMatrix[Double] = 1.0  1.3333333333333333  1.6666666666666665  2.0  

scala> m
res51: breeze.linalg.DenseMatrix[Double] = 
1.0  6.0                 11.0                16.0  
1.0  1.3333333333333333  1.6666666666666665  2.0   
3.0  8.0                 13.0                18.0  
4.0  9.0                 14.0                19.0  
5.0  10.0                15.0                20.0  

scala> 

scala> val n = m.t
n: breeze.linalg.DenseMatrix[Double] = 
1.0   1.0                 3.0   4.0   5.0   
6.0   1.3333333333333333  8.0   9.0   10.0  
11.0  1.6666666666666665  13.0  14.0  15.0  
16.0  2.0                 18.0  19.0  20.0  

scala> n
res52: breeze.linalg.DenseMatrix[Double] = 
1.0   1.0                 3.0   4.0   5.0   
6.0   1.3333333333333333  8.0   9.0   10.0  
11.0  1.6666666666666665  13.0  14.0  15.0  
16.0  2.0                 18.0  19.0  20.0  

scala> val o = m*n
o: breeze.linalg.DenseMatrix[Double] = 
414.0              59.33333333333333  482.0              516.0              550.0              
59.33333333333333  9.555555555555555  71.33333333333333  77.33333333333333  83.33333333333333  
482.0              71.33333333333333  566.0              608.0              650.0              
516.0              77.33333333333333  608.0              654.0              700.0              
550.0              83.33333333333333  650.0              700.0              750.0              

scala> o
res53: breeze.linalg.DenseMatrix[Double] = 
414.0              59.33333333333333  482.0              516.0              550.0              
59.33333333333333  9.555555555555555  71.33333333333333  77.33333333333333  83.33333333333333  
482.0              71.33333333333333  566.0              608.0              650.0              
516.0              77.33333333333333  608.0              654.0              700.0              
550.0              83.33333333333333  650.0              700.0              750.0              

scala> val p = n*m
p: breeze.linalg.DenseMatrix[Double] = 
52.0                117.33333333333333  182.66666666666666  248.0              
117.33333333333333  282.77777777777777  448.22222222222223  613.6666666666667  
182.66666666666666  448.22222222222223  713.7777777777778   979.3333333333334  
248.0               613.6666666666667   979.3333333333334   1345.0             

scala> p
res54: breeze.linalg.DenseMatrix[Double] = 
52.0                117.33333333333333  182.66666666666666  248.0              
117.33333333333333  282.77777777777777  448.22222222222223  613.6666666666667  
182.66666666666666  448.22222222222223  713.7777777777778   979.3333333333334  
248.0               613.6666666666667   979.3333333333334   1345.0             

So, messing around with vectors and matrices is more-or-less as convenient as in well-known dynamic and math languages. To conclude this section, let us see how to simulate some data from a regression model and then solve the least squares problem to obtain the estimated regression coefficients. We will simulate 1,000 observations from a model with 5 covariates.

scala> val X = new DenseMatrix(1000,5,gau.sample(5000).toArray)
X: breeze.linalg.DenseMatrix[Double] = 
-0.40186606934180685  0.9847148198711287    ... (5 total)
-0.4760404521336951   -0.833737041320742    ...
-0.3315199616926892   -0.19460446824586297  ...
-0.14764615494496836  -0.17947658245206904  ...
-0.8357372755800905   -2.456222113596015    ...
-0.44458309216683184  1.848007773944826     ...
0.060314034896221065  0.5254462055311016    ...
0.8637867740789016    -0.9712570453363925   ...
0.11620167261655819   -1.2231380938032232   ...
-0.3335514290842617   -0.7487303696662753   ...
-0.5598937433421866   0.11083382409013512   ...
-1.7213395389510568   1.1717491221846357    ...
-1.078873342208984    0.9386859686451607    ...
-0.7793854546738327   -0.9829373863442161   ...
-1.054275201631216    0.10100826507456745   ...
-0.6947188686537832   1.215...
scala> val b0 = linspace(1.0,2.0,5)
b0: breeze.linalg.DenseVector[Double] = DenseVector(1.0, 1.25, 1.5, 1.75, 2.0)

scala> val y0 = X * b0
y0: breeze.linalg.DenseVector[Double] = DenseVector(0.08200546839589107, -0.5992571365601228, -5.646398002309553, -7.346136663325798, -8.486423788193362, 1.451119214541837, -0.25792385841948406, 2.324936340609002, -1.2285599639827862, -4.030261316643863, -4.1732627416377674, -0.5077151099958077, -0.2087263741903591, 0.46678616461409383, 2.0244342278575975, 1.775756468177401, -4.799821190728213, -1.8518388060564481, 1.5892306875621767, -1.6528539564387008, 1.4064864330994125, -0.8734630221484178, -7.75470002781836, -0.2893619536998493, -5.972958583649336, -4.952666733286302, 0.5431255990489059, -2.477076684976403, -0.6473617571867107, -0.509338416957489, -1.5415350935719594, -0.47068802465681125, 2.546118380362026, -7.940401988804477, -1.037049442788122, -1.564016663370888, -3.3147087994...
scala> val y = y0 + DenseVector(gau.sample(1000).toArray)
y: breeze.linalg.DenseVector[Double] = DenseVector(-0.572127338358624, -0.16481167194161406, -4.213873268823003, -10.142015065601388, -7.893898543052863, 1.7881055848475076, -0.26987820512025357, 3.3289433195054148, -2.514141419925489, -4.643625974157769, -3.8061000214061886, 0.6462624993109218, 0.23603338389134149, 1.0211137806779267, 2.0061727641393317, 0.022624943149799348, -5.429601401989341, -1.836181225242386, 1.0265599173053048, -0.1673732536615371, 0.8418249443853956, -1.1547110533101967, -8.392100167478764, -1.1586377992526877, -6.400362975646245, -5.487018086963841, 0.3038055584347069, -1.2247410435868684, -0.06476921390724344, -1.5039074374120407, -1.0189111630970076, 1.307339668865724, 2.048320821568789, -8.769328824477714, -0.9104251029228555, -1.3533910178496698, -2.178788...
scala> val b = X \ y  // defaults to a QR-solve of the least squares problem
b: breeze.linalg.DenseVector[Double] = DenseVector(0.9952708232116663, 1.2344546192238952, 1.5543512339052412, 1.744091673457169, 1.9874158953720507)

So all of the most important building blocks for statistical computing are included in the Breeze library.

At this point it is really worth reminding yourself that Scala is actually a statically typed language, despite the fact that in this session we have not explicitly declared the type of anything at all! This is because Scala has type inference, which makes type declarations optional when it is straightforward for the compiler to figure out what the types must be. For example, for our very first expression, val a = 5, because the RHS is an Int, it is clear that the LHS must also be an Int, and so the compiler infers that the type of a must be an Int, and treats the code as if the type had been declared as val a: Int = 5. This type inference makes Scala feel very much like a dynamic language in general use. Typically, we carefully specify the types of function arguments (and often the return type of the function, too), but then for the main body of each function, just let the compiler figure out all of the types and write code as if the language were dynamic. To me, this seems like the best of all worlds. The convenience of dynamic languages with the safety of static typing.

Declaring the types of function arguments is not usually a big deal, as the following simple example demonstrates.

scala> def mean(arr: Array[Int]): Double = {
     |   arr.sum.toDouble/arr.length
     | }
mean: (arr: Array[Int])Double

scala> mean(Array(3,1,4,5))
res55: Double = 3.25

A complete Scala program

For completeness, I will finish this post with a very simple but complete Scala/Breeze program. In a previous post I discussed a simple Gibbs sampler in Scala, but in that post I used the Java COLT library for random number generation. Below is a version using Breeze instead.

object BreezeGibbs {

  import breeze.stats.distributions._
  import scala.math.sqrt

  class State(val x: Double, val y: Double)

  def nextIter(s: State): State = {
    val newX = Gamma(3.0, 1.0 / ((s.y) * (s.y) + 4.0)).draw()
    new State(newX, Gaussian(1.0 / (newX + 1), 1.0 / sqrt(2 * newX + 2)).draw())
  }

  def nextThinnedIter(s: State, left: Int): State = {
    if (left == 0) s
    else nextThinnedIter(nextIter(s), left - 1)
  }

  def genIters(s: State, current: Int, stop: Int, thin: Int): State = {
    if (!(current > stop)) {
      println(current + " " + s.x + " " + s.y)
      genIters(nextThinnedIter(s, thin), current + 1, stop, thin)
    } else s
  }

  def main(args: Array[String]) {
    println("Iter x y")
    genIters(new State(0.0, 0.0), 1, 50000, 1000)
  }

}

Summary

In this post I’ve tried to give a quick taste of the Scala language and the Breeze library for those used to dynamic languages for statistical computing. Hopefully I’ve illustrated that the basics don’t look too different, so there is no reason to fear Scala. It is perfectly possible to start using Scala as a better and faster Python or R. Once you’ve mastered the basics, you can then start exploring the full power of the language. There’s loads of introductory Scala material to be found on-line. It probably makes sense to start with the links I’ve highlighted above. After that, just start searching – there’s an interesting set of tutorials I noticed just the other day. A very time-efficient way to learn Scala quickly is to do the FP with Scala course on Coursera, but whether this makes sense will depend on when it is next running. For those who prefer real books, the book Programming in Scala is the standard reference, and I’ve also found Functional programming in Scala to be useful (free text of the first edition of the former and a draft of the latter can be found on-line).

REPL Script

Below is a copy of the complete REPL script, for reference.

// start with non-Breeze stuff

val a = 5
a
a = 6
a

var b = 7
b
b = 8
b

val c = List(3,4,5,6)
c(1)
c.sum
c.length
c.product
c.foldLeft(0)((x,y) => x+y)
c.foldLeft(0)(_+_)
c.foldLeft(1)(_*_)

val d = Vector(2,3,4,5,6,7,8,9)
d
d.slice(3,6)
val e = d.updated(3,0)
d
e

val f=(1 to 10).toList
f
f.map(x => x*x)
f map {x => x*x}
f filter {_ > 4}

// introduce breeze through random distributions
// https://github.com/scalanlp/breeze/wiki/Quickstart

import breeze.stats.distributions._
val poi = Poisson(3.0)
poi.draw
poi.draw
val x = poi.sample(10)
x
x.sum
x.length
x.sum.toDouble/x.length
poi.probabilityOf(2)
x map {x => poi.probabilityOf(x)}
x map {poi.probabilityOf(_)}

val gau=Gaussian(0.0,1.0)
gau.draw
gau.draw
val y=gau.sample(20)
y
y.sum/y.length
y map {gau.logPdf(_)}

Gamma(2.0,3.0).sample(5)

import breeze.stats.DescriptiveStats._
mean(y)
variance(y)
meanAndVariance(y)


// move on to linear algebra
// https://github.com/scalanlp/breeze/wiki/Breeze-Linear-Algebra

import breeze.linalg._
val v=DenseVector(y.toArray)
v(1) = 0
v
v(1 to 3) := 1.0
v
v(1 to 3) := DenseVector(1.0,1.5,2.0)
v
v :> 0.0
(v :> 0.0).toArray

val m = new DenseMatrix(5,4,linspace(1.0,20.0,20).toArray)
m
m.rows
m.cols
m(::,1)
m(1,::)
m(1,::) := linspace(1.0,2.0,4)
m

val n = m.t
n
val o = m*n
o
val p = n*m
p

// regression and QR solution

val X = new DenseMatrix(1000,5,gau.sample(5000).toArray)
val b0 = linspace(1.0,2.0,5)
val y0 = X * b0
val y = y0 + DenseVector(gau.sample(1000).toArray)
val b = X \ y  // defaults to a QR-solve of the least squares problem

// a simple function example

def mean(arr: Array[Int]): Double = {
  arr.sum.toDouble/arr.length
}

mean(Array(3,1,4,5))