Statistics for Big Data

Doctoral programme in cloud computing for big data

I’ve spent much of this year working to establish our new EPSRC Centre for Doctoral Training in Cloud Computing for Big Data, which partly explains the lack of posts on this blog in recent months. The CDT is now established, with 11 students in the first cohort, and we have begun recruiting for the second cohort, to start in September 2015. We admit roughly equal numbers of students from a Computing Science and Mathematics/Statistics background, and provide an intensive programme of inter-disciplinary training in the first (of four) years. After initial induction and cohort team-building events, the programme begins with 8 weeks of intensive bespoke training developed especially for the CDT students. For the first two weeks the cohort is split into two streams for an initial crash course. Students from a M/S background receive basic training in CS and programming, with an emphasis on object-oriented programming in Java. Students from a CS background get basic training in M/S, emphasising elementary probability and statistics and basic linear algebra. From the start of the third week the entire cohort is trained together. This whole cohort training begins with two 6 week courses running in parallel. One is Programming for big data, concentrating on R programming, Java, databases, and software development tools and techniques. The other course is Statistics for big data, a course that I developed and taught, and the main topic of this post. After the initial 8 weeks of specialist training, the students next take courses on Cloud Computing and Machine learning followed by Big data analytics and Time series data, and finally courses on Research skills and Professional skills, which run along side a Group project.

Statistics for big data

I have found it an interesting challenge to try and put together a 15 credit (150 hours of student effort) course to start from a basic level of statistical knowledge and cover most of the important concepts and methods necessary for modelling and analysis of large and complex data sets. Although the course was not about Big Data per se, it emphasised scalability in general, and exploitation of linearity in particular. Given the mixed levels of mathematical sophistication of the cohort I felt it important to start off with a practical non-technical introduction. For this I covered the first 6 chapters of Introduction to Statistical Learning with R. This provided the students with a basic grounding in statistical modelling ideas, and practical skills in working with data in R. This book is excellent as a non-technical introduction, but is missing the underpinning mathematical and computational details necessary for understanding how to implement such methods in practice. I’ve not found Elements of Statistical Learning to be very suitable as a student text, so I revised, expanded and updated my old multivariate notes to produce a new set of notes on Multivariate Data Analysis using R (I discussed an early version of these notes in a previous post). These notes emphasise the role of numerical linear algebra for solving problems of linear statistical inference in an efficient and numerically stable manner. In particular, they illustrate the use of different matrix factorisations, such as Cholesky, QR, SVD, as well as spectral decompositions, for constructing efficient solutions. Although some of the students with a weaker mathematical background struggled with some of the more technical topics, the associated practical sessions illustrated how the techniques are used in practice.

After filling in a few additional topics of frequentist linear statistical inference (including ANOVA, contrasts, missing data, and experimental design), we moved on to computational Bayesian inference. For the introductory material, I used a variety of sources, including Bayesian reasoning and machine learning, and some introductory material from my own textbook, Stochastic modelling for systems biology. The emphasis for this part of the course was the use of flexible Bayesian hierarchical modelling using MCMC. The primary software tool used was JAGS, used via the rjags package from R. For more advanced material on Bayesian computation, I made substantial use of various posts on this blog, as well as some other material we use for teaching Bayesian inference as part of our undergraduate programmes. As for the first part of the course, the emphasis was on practical modelling and data analysis rather than theory. A few of the computer practicals from the Bayesian part of the course are on-line. I may re-write one or two of these practicals as blog posts in due course. Although the emphasis was on flexible modelling, we did touch on efficiency and exploitation of linearity, and I included an extra Chapter on linear Bayesian inference in my multivariate notes in this context. I rounded off this part of the course with a lecture on parallelisation of Monte Carlo algorithms, along the lines of my BIRS lecture on that topic.

The course finished with a group data analysis project, concerned with linear modelling and variable selection for a fairly large heterogeneous data set containing missing data, using both frequentist and Bayesian approaches. The course has just finished, and I’m reasonably happy with how it has gone, but I’ll reflect on it for a couple of weeks and get some feedback before deciding on some revisions to make before delivering it again for the new cohort next year.


Multivariate data analysis (using R)

I’ve been very quiet on-line in the last few months, due mainly to the fact that I’ve been writing a new undergraduate course on multivariate data analysis. Although there are many books and on-line notes on the general topic of multivariate statistics, I wanted to do something a little bit different from any text I have yet discovered. First, I wanted to have a strong emphasis on using techniques in practice on example data sets of reasonable size. For this, I found Hastie et al (2009) to be very useful, as it covered some interesting example data sets which have been bundled in the CRAN R package, ElemStatLearn. I used several of the data sets from this package as running examples throughout the course. In fact my initial plan was to use Hastie et al as the main course text, but it turned out that this text was in some places overly technical and in many places far too terse to be good as an undergraduate text. I would still recommend the book for researchers who want a good overview of the interface between statistics and machine learning, but with hindsight I’m not convinced it is ideal for typical statistics undergraduate students.

I also wanted to have a strong emphasis on numerical linear algebra as the basis for multivariate statistical computation. Again, this is a bit different from “old school” multivariate statistics (which reminds me, John Marden has produced a great text available freely on-line on old school multivariate analysis, which isn’t quite as “old school” as the title might suggest). I wanted to spend some time talking about linear systems and matrix factorisations, explaining, for example how the LU decomposition, the Cholesky factorisation and the QR factorisations are related, and why the latter two are both fundamental to multivariate data analysis, and how the singular value decomposition (SVD) is related to the spectral decomposition, and why it is generally better to construct principal components from the SVD of the centred data matrix than the eigen-decomposition of the sample variance matrix, etc. These sorts of topics are not often covered in undergraduate statistics courses, but they are crucial to understanding how to analyse large multivariate data sets in a numerically stable way.

I also wanted to downplay distribution theory as much as possible, as multivariate distribution theory is quite difficult, and not necessary for understanding most of the essential concepts in multivariate data analysis. Also, it is not obviously very useful. Essentially all introductory courses are based around the multivariate normal distribution, but I have yet to see a real non-trivial multivariate data set for which an assumption of multivariate normality is appropriate. Consequently I delayed the introduction of the multivariate normal until well into the course, and didn’t bother with the Wishart distribution, or testing for multivariate normality. Like much frequentist testing, it is really just a matter of seeing if you have yet collected a large enough sample to reject the null hypothesis – I just don’t see the point (null)!

Finally, I wanted to use R to illustrate all of the methods in practice as they were introduced. We use R throughout our undergraduate statistics programme, and I think it is a good language for learning about statistical methods, algorithms and concepts. In most cases I begin by showing how to carry out analyses using “elementary” operations (such as matrix manipulations), and then go on to show how to accomplish the same task more simply using higher-level R functions and packages. Again, I think it really helps understanding to first see the mathematical description directly translated into computer code before jumping to high-level data analysis functions.

There are several aspects of the course that I would like to distil out into self-contained blog posts, but I have a busy summer schedule, and a couple of other things I want to write about before I’ll have a chance to get around to it, so in the mean time, anyone interested is welcome to download a copy of the course notes (PDF, with hyperlinks). This is the student version, containing gaps, but the gaps mainly correspond to bits of standard theory and examples designed to be worked through by hand. All of the essential theory and context and all of the R examples are present in this version of the notes. There are seven chapters: Introduction to multivariate data; PCA and matrix factorisations; Inference, the MVN and multivariate regression; Cluster analysis and unsupervised learning; Discrimination and classification; Graphical modelling; Variable selection and multiple testing.