Introduction
This is the first book review I’ve done on this blog, and I don’t intend to make it a regular feature, but I ordered a copy of “Parallel R” a few days ago. It arrived today, and I’m quite disappointed with it, so I wanted to write a quick review to provide some additional information for people thinking of buying a copy. Just to be clear, the book is:
I generally like O’Reilly books, and own a number of them. I use R a lot, I am very interested in parallel computing (traditionally using C and MPI), and I’ve dabbled a little with some parallel stuff in R, but don’t consider myself to be an expert. In other words, people like me are probably the target audience for this book, and sure enough I have handed over some of my hard-earned cash to buy a copy.
The main problem with the book is that it just doesn’t feel finished. It seems as though the authors have rushed the text as quickly as possible and published it without any kind of critical review or reflection. It is very short – just 108 pages of the main text, and most annoyingly, doesn’t have an index. This wouldn’t be so much of a problem for an electronic version, but selling a technical computing book in dead tree format without an index is really unforgivable. All of my other O’Reilly books contain a decent index, so I’m just baffled as to why this one doesn’t. It really feels like this is the first draft of a manuscript that you would circulate to a few friends and colleagues for comments and suggested improvements. There is the kernel of a decent book here, but most of the current material will be obsolete before a second edition could be put together and published, so the second edition will have to be a complete re-write.
Chapter by chapter
Chapter 1 – Getting started – 5 pages
A brief introduction to the rest of the book, and has a pointer to the companion website, parallelrbook.com (at the time of writing, it is empty…).
Chapter 2 – snow – 30 pages
This chapter is the most substantial, and arguably the best, chapter of the book. It provides a very reasonable introduction to the snow package for a simple network of workstations, for running embarrassingly parallel jobs on a cluster.
Chapter 3 – multicore – 13 pages
This chapter provides a very brief and superficial introduction to the multicore package, for exploiting modern multicore hardware. It provides a very brief introduction to the high level API (mclapply, pvec, parallel, collect). Discussion of the low-level API is almost non-existent (an example function is given which uses some low-level calls). Also, there is no discussion here, or anywhere else, of the foreach package/function, or the doMC back-end. Unfortunately I couldn’t verify this by checking the index (see above), but as there are only 100 or so pages, it didn’t take that long to flick through them all to double-check… Now I can understand that the book will not cover all obscure parallel packages for R, but foreach/doMC?! Missing from a book called “Parallel R”? Seriously? It is all the more weird as one of the authors (Weston) is an author of foreach/doMC. Go figure…
Chapter 4 – parallel – 8 pages
This chapter provides an even more brief introduction to the new parallel package for R 2.14. It should be noted that the book went to press before 2.14 was frozen for release, but the content that was there looked OK on the basis of a very quick skim. But at 8 pages, don’t expect too much.
Chapter 5 – A primer on MapReduce and Hadoop – 8 pages
A very brief introduction to the ideas behind MapReduce and Hadoop. Not actually anything to do with R, but necessary for the next chapter.
Chapter 6 – R+Hadoop – 18 pages
I’ve not worked through this chapter in detail, but it looks like a reasonable “getting started guide” for using Hadoop with R.
Chapter 7 – RHIPE – 16 pages
Again, I’ve not studied this chapter carefully, but it seems like a reasonable introduction to the RHIPE package (it is a package to make it simpler to use R with Hadoop, by hiding Hadoop stuff from the R user).
Chapter 8 – Segue – 6 pages
A very brief introduction to the segue package, which enables running jobs on Amazon’s Elastic MapReduce service.
Chapter 9 – New and upcoming – 2 pages
A very brief mention of some of the things that weren’t covered in the book… foreach is mentioned here, but no example is given.
Conclusion
For people who have absolutely no idea about parallel computing in R, or about what different options are available, then this book does provide a useful overview, together with some simple examples to illustrate the ideas, and try out for themselves. It is generally very brief and superficial, there are some gaping holes, and much of the material will become obsolete very quickly. It is a shame that there is not more discussion of low level functions, or of parallel computing in anything other than a simple embarrassingly parallel context. Admittedly, if your job isn’t embarrassingly parallel, you probably don’t want to use R anyway, but some discussion would still have been nice. And did I mention that there is no index?! I did toy briefly with the idea of sending it back, but I’m not going to. To be fair, there is quite a bit of useful information in the book, and I’d like to work through the Hadoop chapters at some point. So in summary, it’s OK, but don’t expect to love it.
2 thoughts on “Review of “Parallel R” by McCallum and Weston”