Multivariate data analysis (using R)

I’ve been very quiet on-line in the last few months, due mainly to the fact that I’ve been writing a new undergraduate course on multivariate data analysis. Although there are many books and on-line notes on the general topic of multivariate statistics, I wanted to do something a little bit different from any text I have yet discovered. First, I wanted to have a strong emphasis on using techniques in practice on example data sets of reasonable size. For this, I found Hastie et al (2009) to be very useful, as it covered some interesting example data sets which have been bundled in the CRAN R package, ElemStatLearn. I used several of the data sets from this package as running examples throughout the course. In fact my initial plan was to use Hastie et al as the main course text, but it turned out that this text was in some places overly technical and in many places far too terse to be good as an undergraduate text. I would still recommend the book for researchers who want a good overview of the interface between statistics and machine learning, but with hindsight I’m not convinced it is ideal for typical statistics undergraduate students.

I also wanted to have a strong emphasis on numerical linear algebra as the basis for multivariate statistical computation. Again, this is a bit different from “old school” multivariate statistics (which reminds me, John Marden has produced a great text available freely on-line on old school multivariate analysis, which isn’t quite as “old school” as the title might suggest). I wanted to spend some time talking about linear systems and matrix factorisations, explaining, for example how the LU decomposition, the Cholesky factorisation and the QR factorisations are related, and why the latter two are both fundamental to multivariate data analysis, and how the singular value decomposition (SVD) is related to the spectral decomposition, and why it is generally better to construct principal components from the SVD of the centred data matrix than the eigen-decomposition of the sample variance matrix, etc. These sorts of topics are not often covered in undergraduate statistics courses, but they are crucial to understanding how to analyse large multivariate data sets in a numerically stable way.

I also wanted to downplay distribution theory as much as possible, as multivariate distribution theory is quite difficult, and not necessary for understanding most of the essential concepts in multivariate data analysis. Also, it is not obviously very useful. Essentially all introductory courses are based around the multivariate normal distribution, but I have yet to see a real non-trivial multivariate data set for which an assumption of multivariate normality is appropriate. Consequently I delayed the introduction of the multivariate normal until well into the course, and didn’t bother with the Wishart distribution, or testing for multivariate normality. Like much frequentist testing, it is really just a matter of seeing if you have yet collected a large enough sample to reject the null hypothesis – I just don’t see the point (null)!

Finally, I wanted to use R to illustrate all of the methods in practice as they were introduced. We use R throughout our undergraduate statistics programme, and I think it is a good language for learning about statistical methods, algorithms and concepts. In most cases I begin by showing how to carry out analyses using “elementary” operations (such as matrix manipulations), and then go on to show how to accomplish the same task more simply using higher-level R functions and packages. Again, I think it really helps understanding to first see the mathematical description directly translated into computer code before jumping to high-level data analysis functions.

There are several aspects of the course that I would like to distil out into self-contained blog posts, but I have a busy summer schedule, and a couple of other things I want to write about before I’ll have a chance to get around to it, so in the mean time, anyone interested is welcome to download a copy of the course notes (PDF, with hyperlinks). This is the student version, containing gaps, but the gaps mainly correspond to bits of standard theory and examples designed to be worked through by hand. All of the essential theory and context and all of the R examples are present in this version of the notes. There are seven chapters: Introduction to multivariate data; PCA and matrix factorisations; Inference, the MVN and multivariate regression; Cluster analysis and unsupervised learning; Discrimination and classification; Graphical modelling; Variable selection and multiple testing.