There has been a lot of discussion on-line recently about languages for data analysis, statistical computing, and data science more generally. I don’t really want to go into the detail of why I believe that all of the common choices are fundamentally and unfixably flawed – language wars are so unseemly. Instead I want to explain why I’ve been using the Scala programming language recently and why, despite being far from perfect, I personally consider it to be a good language to form a platform for efficient and scalable statistical computing. Obviously, language choice is to some extent a personal preference, implicitly taking into account subjective trade-offs between features different individuals consider to be important. So I’ll start by listing some language/library/ecosystem features that I think are important, and then explain why.
A feature wish list
- be a general purpose language with a sizable user community and an array of general purpose libraries, including good GUI libraries, networking and web frameworks
- be free, open-source and platform independent
- be fast and efficient
- have a good, well-designed library for scientific computing, including non-uniform random number generation and linear algebra
- have a strong type system, and be statically typed with good compile-time type checking and type safety
- have reasonable type inference
- have a REPL for interactive use
- have good tool support (including build tools, doc tools, testing tools, and an intelligent IDE)
- have excellent support for functional programming, including support for immutability and immutable data structures and “monadic” design
- allow imperative programming for those (rare) occasions where it makes sense
- be designed with concurrency and parallelism in mind, having excellent language and library support for building really scalable concurrent and parallel applications
The not-very-surprising punch-line is that Scala ticks all of those boxes and that I don’t know of any other languages that do. But before expanding on the above, it is worth noting a couple of (perhaps surprising) omissions. For example:
- have excellent data viz capability built-in
- have vast numbers of statistical routines in the standard library
The above are points (and there are other similar points) where other languages (for example, R), currently score better than Scala. It is not that these things are not important – indeed, they are highly desirable. But I consider them to be of lesser importance as they are much easier to fix, given a suitable platform, than fixing an unsuitable language and platform. Visualisation is not trivial, but it is not fantastically difficult in a language with excellent GUI libraries. Similarly, most statistical routines are quite straightforward to implement for anyone with reasonable expertise in scientific and statistical computing and numerical linear algebra. These are things that are relatively easy for a community to contribute to. Building a great programming language, on the other hand, is really, really, difficult.
I will now expand briefly on each point in turn.
be a general purpose language with a sizable user community and an array of general purpose libraries, including good GUI libraries, networking and web frameworks
History has demonstrated, time and time again, that domain specific languages (DSLs) are synonymous with idiosyncratic, inconsistent languages that are terrible for anything other than what they were specifically designed for. They can often be great for precisely the thing that they were designed for, but people always want to do other things, and that is when the problems start. For the avoidance of controversy I won’t go into details, but the whole Python versus R thing is a perfect illustration of this general versus specific trade-off. Similarly, although there has been some buzz around another new language recently, which is faster than R and Python, my feeling is that the last thing the world needs right now is Just Unother Language for Indexed Arrays…
In this day-and-age it is vital that statistical code can use a variety of libraries, and communicate with well-designed network libraries and web frameworks, as statistical analysis does not exist in a vacuum. Scala certainly fits the bill here, being used in a large number of important high-profile systems, ensuring a lively, well-motivated ecosystem. There are numerous well-maintained libraries for almost any task. Picking on web frameworks, for example, there are a number of excellent libraries, including Lift and Play. Scala also has the advantage of offering seamless Java integration, for those (increasingly rare) occasions when a native Scala library for the task at hand doesn’t exist.
be free, open-source and platform independent
This hardly needs expanding upon, other than to observe that there are a few well-known commercial software solutions for scientific, statistical and mathematical computing. There are all kinds of problems with using closed proprietary systems, including transparency and reproducibility, but also platform and scalability problems. eg. running code requiring a license server in the cloud. The academic statistical community has largely moved away from commercial software, and I don’t think there is any going back. Scala is open source and runs on the JVM, which is about as platform independent as it is possible to get.
be fast and efficient
Speed and efficiency continue to be important, despite increasing processor speeds. Computationally intensive algorithms are being pushed to ever larger and more complex models and data sets. Compute cycles and memory efficiency really matter, and can’t be ignored. This doesn’t mean that we all have to code in C/C++/Fortran, but we can’t afford to code in languages which are orders of magnitude slower. This will always be a problem. Scala code generally runs well within a factor of 2 of comparable native code – see my Gibbs sampler post for a simple example including timings.
have a good, well-designed library for scientific computing, including non-uniform random number generation and linear algebra
I hesitated about including this in my list of essentials, because it is certainly something that can, in principle, be added to a language at a later date. However, building such libraries is far from trivial, and they need to be well-designed, comprehensive and efficient. For Scala, Breeze is rapidly becoming the standard scientific library, including special functions, non-uniform random number generation and numerical linear algebra. For a data library, there is Saddle, and for a scalable analytics library there is Spark. These libraries certainly don’t cover everything that can be found in R/CRAN, but they provide a fairly solid foundation on which to build.
have a strong type system, and be statically typed with good compile-time type checking and type safety
I love dynamic languages – they are fun and exciting. It is fun to quickly throw together a few functions in a scripting language without worrying about declaring the types of anything. And it is exciting to see the myriad of strange and unanticipated ways your code can crash-and-burn at runtime! 😉 But this excitement soon wears off, and you end up adding lots of boilerplate argument checking code that would not only be much cleaner and simpler in a statically typed language, but would be checked at compile-time, making the static code faster and more efficient. For messing about prototyping, dynamic languages are attractive, but as a solid platform for statistical computing, they really don’t make sense. Scala has a strong type system offering a high degree of compile-time checking, making it a safe and efficient language.
have reasonable type inference
A common issue with statically typed languages is that they lead to verbose code containing many redundant type declarations that the compiler ought to be able to check. This doesn’t just mean more typing – it leads to verbose code that can hide the program logic. Languages with type inference offer the best of both worlds – the safety of static typing without the verbosity. Scala does a satisfactory job here.
have a REPL for interactive use
One thing that dynamic languages have taught us is that it is actually incredibly useful to have a REPL for interactive analysis. This is true generally, but especially so for statistical computing, where human intervention is often desirable. Again, Scala has a nice REPL.
have good tool support (including build tools, doc tools, testing tools, and an intelligent IDE)
Tools matter. Scala has an excellent build tool in the SBT. It has code documentation in the form of scaladoc (similar to javadoc). It has a unit testing framework, and a reasonably intelligent IDE in the form of the Scala IDE (based on Eclipse).
have excellent support for functional programming, including support for immutability and immutable data structures and “monadic” design
I, like many others, am gradually coming to realise that functional programming offers many advantages over other programming styles. In particular, it provides best route to building scalable software, in terms of both program complexity and data size/complexity. Scala has good support for functional programming, including immutable named values, immutable data structures and for-comprehensions. And if off-the-shelf Scala isn’t sufficiently functional already, libraries such as scalaz make it even more so.
allow imperative programming for those (rare) occasions where it makes sense
Although most algorithms in scientific computing are typically conceived of and implemented in an imperative style, I’m increasingly convinced that most can be recast in a pure functional way without significant loss of efficiency, and with significant benefits. That said, there really are some problems that are more efficient to implement in an imperative framework. It is therefore important that the language is not so “pure” functional that this is forbidden. Again, Scala fits the bill.
be designed with concurrency and parallelism in mind, having excellent language and library support for building really scalable concurrent and parallel applications
These days scalability typically means exploiting concurrency and parallelism. In an imperative world this is hard, and libraries such as MPI prove that it is difficult to bolt parallelism on top of a language post-hoc. Check-points, communication overhead, deadlocks and race conditions make it very difficult to build codes that scale well to more than a few processors. Concurrency is more straightforward in functional languages, and this is one of the reasons for the recent resurgence of functional languages and programming. Scala has good concurrency support built-in, and libraries such as Akka make it relatively easy to build truly scalable software.
The Scala programming language ticks many boxes when it comes to forming a nice solid foundation for building a platform for efficient scalable statistical computing. Although I still use R and Python almost every day, I’m increasingly using Scala for serious algorithm development. In the short term I can interface to my Scala code from R using jvmr, but in the longer term I hope that Scala will become a complete framework for statistics and data science. In a subsequent post I will attempt to give a very brief introduction to Scala and the Breeze numerical library.
18 thoughts on “Scala as a platform for statistical computing and data science”
Nice article. I’m surprised that you didn’t mention F#. It’s similar to Scala. It’s a great all around tool for data analysis.
Check out my blog
True F Sharp’s code is free, but it’s runtime environment is not (ignoring mono as it’s so immature). Therefore it breaks the principle of platform independence, which in turn breaks a billion other principles because Windows is such a terrible operating system for being productive and doing Big Data.
You simply cannot be productive in an environment that doesn’t have a native and virtually perfect command line shell – i.e. bash (ignoring cygwin, it sucks, and PowerShell or rather VerbosityShell is even worse). Furthermore you simply cannot build Big Data products on an OS that’s totally not performant enough to do so. This is why High Frequency trading and super computers nearly exclusively run Linux, and why middleware high-availability for emergency services is on either Linux or Unix, and why big tech companies like Google, Facebook, Twitter, etc all use Linux, not Windows. Windows is an OS for secretaries, accountants and simple low velocity websites, not serious computing.
Another point is that Scala is interoperable with the rest of the Big Data world that is primarily written in Java, i.e. HDFS, NoSQL databases, Kafka (which is Scala), etc. Simply connecting F Sharp up to HDFS and Kafka is not straightforward, but it is in Scala.
Furthermore F Sharp doesn’t have Spark!! (and m-brace is not even close to the awesomeness of Spark).
Conclusion: the point is that one can build end to end data science products with Scala, this is simply not possible in F Sharp.
F# is Microsoft proprietary.
F# has been open sourced for some time.
Very interesting post, thank you.
In my opinion, the most successful example of scientific computing is the Numpy-Scipy platform. It is built on a single overarching data structure (ndarray) and over many years built up a formidable platform of libraries across many fields (http://www.lfd.uci.edu/~gohlke/pythonlibs/). It can be linked to fast and optimized “native code” in Intel MKL and thus has small differences to C or C++ execution speeds. With the upcoming library releases, it will be even better with more sophisticated data types (Blaze) and JIT & GPU integration (Numba). In order for a potential success of Scala in this field, it will simply need to do all of this and more.
The first condition for this to happen is the underlying data type as well as the fast linear algebra libraries to be available, i.e. the numpy analog for Scala. I looked hard for such a solution but did not find anything as well maintained or as fast (Intel MKL integration is crucial). BIDMAT (https://github.com/jcanny/BIDMat) is promising (supports MKL and CUDA) but is rather poorly documented, and is developed by a single lab. I hope it gets more traction eventually.
I hope to use Scala for scientific computing as well but much work needs to be done before this is possible.
Julia has been designed fundamentally for this. I think it’s going to have a tough time breaking through python/numpy/compiled python, C, Fortran, java, and R.
so.. scala will have an even tougher time.
Scala is really a nice language and I will love to see it as a platform like Java in near future
“Building a great programming language, on the other hand, is really, really, difficult.”
Is a fantastic point. One can easily and quickly build more libraries, but cannot easily and quickly fix interoperability with Java, or fix type safety, etc. I’ve got to the point now after developing in Scala for 2 years that I actually find it’s quicker for me to implement a machine learning utility from scratch than it is to read the documentation on an existing utility – of course reinventing the wheel is bad practice, but the point is that Scala is such an awesome powerful language that an absence of a utility really doesn’t matter.
Nice Article. If you ever get a spare moment, it might be of interest to readers to hear how you set up a typical working environment for scala (with dependencies like Breeze, Saddle etc). For example, do you set up separate build projects like this (http://scalatutorials.com/beginner/2013/07/18/getting-started-with-sbt) for *every* single set of examples and projects you work on? This kind of granular dependency control makes sense when developing production tools but seems cumbersome for smaller statistical explorations and examples. Apologies if you’ve already posted this elsewhere. I had a look and didn’t see anything related to these kind of details.
Actually, I do discuss this to some extent in the slides and code examples for my talk at the RSS meeting – https://github.com/darrenjw/statslang-scala – discussed in https://darrenjw.wordpress.com/2014/11/22/statistical-computing-languages-at-the-rss/
Although it is obviously a matter of personal taste how one does this sort of thing, I generally create a new directory for each project (dataset) I work on, and I don’t find it particularly cumbersome to copy a sbt file into the directory. The advantage of this is that it fixes the versions of all dependencies.
If it is a small exploratory project, you can do everything in the same directory without creating a big directory tree (like my “monte-carlo” example in the above repo). If it’s a bigger project, following the standard directory structure is a relatively small setup overhead (like my “regression” example), and works better with IDEs such as the ScalaIDE (based on Eclipse). I use the sbt “eclipse” plugin to generate project files for the ScalaIDE.
I like this blog about scala for statistical computing. I am about to offer a spark-based scalable data science course in New Zealand:
Your blog is a nice resource to introduce ‘why scala for data science and computational stats?’.