Scala for Data Science [book review]

This post will review the book:

Disclaimer: This book review has not been solicited by the publisher (or anyone else) in any way. I purchased the review copy of this book myself. I have not received any benefit from the writing of this review.


On this blog I previously reviewed the (terrible) book, Scala for machine learning by the same publisher. I was therefore rather wary of buying this book. But the topic coverage looked good, so I decided to buy it, and wasn’t disappointed. Scala for Data Science is my top recommendation for getting started with statistical computing and data science applications using Scala.


The book assumes a basic familiarity with programming in Scala, at around the level of someone who has completed the Functional Programming Principles in Scala Coursera course. That is, it (quite sensibly) doesn’t attempt to teach the reader how to program in Scala, but rather how to approach the development of data science applications using Scala. It introduces more advanced Scala idioms gradually (eg. typeclasses don’t appear until Chapter 5), so it is relatively approachable for those who aren’t yet Scala experts. The book does cover Apache Spark, but Spark isn’t introduced until Chapter 10, so it isn’t “just another Spark book”. Most of the book is about developing data science applications in Scala, completely independently of Spark. That said, it also provides one of the better introductions to Spark, so doubles up as a pretty good introductory Spark book, in addition to being a good introduction to the development of data science applications with Scala. It should probably be emphasised that the book is very much focused on data science, rather than statistical computing, but there is plenty of material of relevance to those who are more interested in statistical computing than applied data science.

Chapter by chapter

  1. Scala and Data Science – motivation for using Scala in preference to certain other languages I could mention…
  2. Manipulating data with BreezeBreeze is the standard Scala library for scientific and statistical computing. It’s pretty good, but documentation is rather lacking. This Chapter provides a good tutorial introduction to Breeze, which should be enough to get people going sufficiently to be able to make some sense of the available on-line documentation.
  3. Plotting with breeze-viz – Breeze has some support for plotting and visualisation of data. It’s somewhat limited when compared to what is available in R, but is fine for interactive exploratory analysis. However, the available on-line documentation for breeze-viz is almost non-existent. This Chapter is the best introduction to breeze-viz that I have seen.
  4. Parallel collections and futures – the Scala standard library has built-in support for parallel and concurrent programming based on functional programming concepts such as parallel (monadic) collections and Futures. Again, this Chapter provides an excellent introduction to these powerful concepts, allowing the reader to start developing parallel algorithms for multi-core hardware with minimal fuss.
  5. Scala and SQL through JDBC – this Chapter looks at connecting to databases using standard JVM mechanisms such as JDBC. However, it gradually introduces more functional ways of interfacing with databases using typeclasses, motivating:
  6. Slick – a functional interface for SQL – an introduction to the Slick library for a more Scala-esque way of database interfacing.
  7. Web APIs – the practicalities of talking to web APIs. eg. authenticated HTTP requests and parsing of JSON responses.
  8. Scala and MongoDB – working with a NoSQL database from Scala
  9. Concurrency with Akka – Akka is the canonical implementation of the actor model in Scala, for building large concurrent applications. It is the foundation on which Spark is built.
  10. Distributed batch processing with Spark – a tutorial introduction to Apache Spark. Spark is a big data analytics framework built on top of Scala and Akka. It is arguably the best available framework for big data analytics on computing clusters in the cloud, and hence there is a lot of interest in it. Indeed, Spark is driving some of the interest in Scala.
  11. Spark SQL and DataFrames – interfacing with databases using Spark, and more importantly, an introduction to Spark’s DataFrame abstraction, which is now fundamental to developing machine learning pipelines in Spark.
  12. Distributed machine learning with MLLib – MLLib is the machine learning library for Spark. It is worth emphasising that unlike many early books on Spark, this chapter covers the newer DataFrame-based pipeline API, in addition to the original RDD-based API. Together, Chapters 10, 11 and 12 provide a pretty good tutorial introduction to Spark. After working through these, it should be easy to engage with the official on-line Spark documentation.
  13. Web APIs with Play – is concerned with developing a web API at the end of a data science pipeline.
  14. Visualisation with D3 and the Play framework – is concerned with integrating visualisation into a data science web application.


This book provides a good tutorial introduction to a large number of topics relevant to statisticians and data scientists interested in developing data science applications using Scala. After working through this book, readers should be well-placed to augment their knowledge with readily searchable on-line documentation.

In a follow-up post I will give a quick overview of some other books relevant to getting started with Scala for statistical computing and data science.


Scala for Machine Learning [book review]

Full disclosure: I received a free electronic version of this book from the publisher for the purposes of review.

There is clearly a market for a good book about using Scala for statistical computing, machine learning and data science. So when the publisher of “Scala for Machine Learning” offered me a copy for review purposes, I eagerly accepted. Three months later, I have eventually forced myself to read through the whole book, but I was very disappointed. It is important to be clear that I’m not just disappointed because I personally didn’t get much from the book – I am not really the target audience. I am disappointed because I struggle to envisage any audience that will benefit greatly from reading this book. There are several potential audiences for a book with this title: eg. people with little knowledge of Scala or machine learning (ML), people with knowledge of Scala but not ML, people with knowledge of ML but not Scala, and people with knowledge of both. I think there is scope for a book targeting any of those audiences. Personally, I fall in the latter category. The book author claimed to be aiming primarily at those who know Scala but not ML. This is sensible in that the book assumes a good working knowledge of Scala, and uses advanced features of the Scala language without any explanation: this book is certainly not appropriate for people hoping to learn about Scala in the context of ML. However, it is also a problem, as this would probably be the worst book I have ever encountered for learning about ML from scratch, and there are a lot of poor books about ML! The book just picks ML algorithms out of thin air without any proper explanation or justification, and blindly applies them to tedious financial data sets irrespective of whether or not it would be in any way appropriate to do so. It presents ML as an incoherent “bag of tricks” to be used indiscriminately on any data of the correct “shape”. It is by no means the only ML book to take such an approach, but there are many much better books which don’t. The author also claims that the book will be useful to people who know ML but not Scala, but as previously explained, I do not think that this is the case (eg. monadic traits appear on the fifth page, without proper explanation, and containing typos). I think that the only audience that could potentially benefit from this book would be people who know some Scala and some ML and want to see some practical examples of real world implementations of ML algorithms in Scala. I think those people will also be disappointed, for reasons outlined below.

The first problem with the book is that it is just full of errors and typos. It doesn’t really matter to me that essentially all of the equations in the first chapter are wrong – I already know the difference between an expectation and a sample mean, and know Bayes theorem – so I can just see that they are wrong, correct them, and move on. But for the intended audience it would be a complete nightmare. I wonder about the quality of copy-editing and technical review that this book received – it is really not of “publishable” quality. All of the descriptions of statistical/ML methods and algorithms are incredibly superficial, and usually contain factual errors or typos. One should not attempt to learn ML by reading this book. So the only hope for this book is that the Scala implementations of ML algorithms are useful and insightful. Again, I was disappointed.

For reasons that are not adequately explained or justified, the author decides to use a combination of plain Scala interfaced to legacy Java libraries (especially Apache Commons Math) for all of the example implementations. In addition, the author is curiously obsessed with an F# style pipe operator, which doesn’t seem to bring much practical benefit. Consequently, all of the code looks like a strange and rather inelegant combination of Java, Scala, C++, and F#, with a hint of Haskell, and really doesn’t look like clean idiomatic Scala code at all. For me this was the biggest disappointment of all – I really wouldn’t want any of this code in my own Scala code base (though the licensing restrictions on the code probably forbid this, anyway). It is a real shame that Scala libraries such as Breeze were not used for all of the examples – this would have led to much cleaner and more idiomatic Scala code, which could have really taken proper advantage of the functional power of the Scala language. As it is, advanced Scala features were used without much visible pay-off. Reading this book one could easily get the (incorrect) impression that Scala is an unnecessarily complex language which doesn’t offer much advantage over Java for implementing ML algorithms.

On the positive side, the book consists of nearly 500 pages of text, covering a wide range of ML algorithms and examples, and has a zip file of associated code containing the implementation and examples, which builds using sbt. If anyone is interested in seeing examples of ML algorithms implemented in Scala using Java rather than Scala libraries together with a F# pipe operator, then there is definitely something of substance here of interest.


It should be clear from the above review that I think there is still a gap in the market for a good book about using Scala for statistical computing, machine learning and data science. Hopefully someone will fill this gap soon. In the meantime it is necessary to learn about Scala and ML separately, and to put the ideas together yourself. This isn’t so difficult, as there are many good resources and code repositories to help. For learning about ML, I would recommend starting off with ISLR, which uses R for the examples (but if you work in data science, you need to know R anyway). Once the basic concepts are understood, one can move on to a serious text, such as Machine Learning (which has associated Matlab code). Converting algorithms from R or Matlab to Scala (plus Breeze) is generally very straightforward, if you know Scala. For learning Scala, there are many on-line resources. If you want books, I recommend Functional Programming in Scala and Programming in Scala, 2e. Once you know about Scala, learn about scientific computing using Scala by figuring out Breeze. At some point you will probably also want to know about Spark, and there are now books on this becoming available – I’ve just got a copy of Learning Spark, which looks OK.

Review of “Parallel R” by McCallum and Weston


This is the first book review I’ve done on this blog, and I don’t intend to make it a regular feature, but I ordered a copy of “Parallel R” a few days ago. It arrived today, and I’m quite disappointed with it, so I wanted to write a quick review to provide some additional information for people thinking of buying a copy. Just to be clear, the book is:

  • McCallum, E., Weston, S. (2011) Parallel R, O’Reilly.
  • I generally like O’Reilly books, and own a number of them. I use R a lot, I am very interested in parallel computing (traditionally using C and MPI), and I’ve dabbled a little with some parallel stuff in R, but don’t consider myself to be an expert. In other words, people like me are probably the target audience for this book, and sure enough I have handed over some of my hard-earned cash to buy a copy.

    The main problem with the book is that it just doesn’t feel finished. It seems as though the authors have rushed the text as quickly as possible and published it without any kind of critical review or reflection. It is very short – just 108 pages of the main text, and most annoyingly, doesn’t have an index. This wouldn’t be so much of a problem for an electronic version, but selling a technical computing book in dead tree format without an index is really unforgivable. All of my other O’Reilly books contain a decent index, so I’m just baffled as to why this one doesn’t. It really feels like this is the first draft of a manuscript that you would circulate to a few friends and colleagues for comments and suggested improvements. There is the kernel of a decent book here, but most of the current material will be obsolete before a second edition could be put together and published, so the second edition will have to be a complete re-write.

    Chapter by chapter

    Chapter 1 – Getting started – 5 pages

    A brief introduction to the rest of the book, and has a pointer to the companion website, (at the time of writing, it is empty…).

    Chapter 2 – snow – 30 pages

    This chapter is the most substantial, and arguably the best, chapter of the book. It provides a very reasonable introduction to the snow package for a simple network of workstations, for running embarrassingly parallel jobs on a cluster.

    Chapter 3 – multicore – 13 pages

    This chapter provides a very brief and superficial introduction to the multicore package, for exploiting modern multicore hardware. It provides a very brief introduction to the high level API (mclapply, pvec, parallel, collect). Discussion of the low-level API is almost non-existent (an example function is given which uses some low-level calls). Also, there is no discussion here, or anywhere else, of the foreach package/function, or the doMC back-end. Unfortunately I couldn’t verify this by checking the index (see above), but as there are only 100 or so pages, it didn’t take that long to flick through them all to double-check… Now I can understand that the book will not cover all obscure parallel packages for R, but foreach/doMC?! Missing from a book called “Parallel R”? Seriously? It is all the more weird as one of the authors (Weston) is an author of foreach/doMC. Go figure…

    Chapter 4 – parallel – 8 pages

    This chapter provides an even more brief introduction to the new parallel package for R 2.14. It should be noted that the book went to press before 2.14 was frozen for release, but the content that was there looked OK on the basis of a very quick skim. But at 8 pages, don’t expect too much.

    Chapter 5 – A primer on MapReduce and Hadoop – 8 pages

    A very brief introduction to the ideas behind MapReduce and Hadoop. Not actually anything to do with R, but necessary for the next chapter.

    Chapter 6 – R+Hadoop – 18 pages

    I’ve not worked through this chapter in detail, but it looks like a reasonable “getting started guide” for using Hadoop with R.

    Chapter 7 – RHIPE – 16 pages

    Again, I’ve not studied this chapter carefully, but it seems like a reasonable introduction to the RHIPE package (it is a package to make it simpler to use R with Hadoop, by hiding Hadoop stuff from the R user).

    Chapter 8 – Segue – 6 pages

    A very brief introduction to the segue package, which enables running jobs on Amazon’s Elastic MapReduce service.

    Chapter 9 – New and upcoming – 2 pages

    A very brief mention of some of the things that weren’t covered in the book… foreach is mentioned here, but no example is given.


    For people who have absolutely no idea about parallel computing in R, or about what different options are available, then this book does provide a useful overview, together with some simple examples to illustrate the ideas, and try out for themselves. It is generally very brief and superficial, there are some gaping holes, and much of the material will become obsolete very quickly. It is a shame that there is not more discussion of low level functions, or of parallel computing in anything other than a simple embarrassingly parallel context. Admittedly, if your job isn’t embarrassingly parallel, you probably don’t want to use R anyway, but some discussion would still have been nice. And did I mention that there is no index?! I did toy briefly with the idea of sending it back, but I’m not going to. To be fair, there is quite a bit of useful information in the book, and I’d like to work through the Hadoop chapters at some point. So in summary, it’s OK, but don’t expect to love it.


  • McCallum, E., Weston, S. (2011) Parallel R, O’Reilly.