Parallel Monte Carlo with an Intel i7 Quad Core

I’ve recently acquired a new laptop with an Intel i7 Quad Core CPU – an i7-940XM to be precise, and I’m interested in the possibility of running parallel MCMC codes on this machine (a Dell Precision M4500) in order to speed things up a bit. I’m running the AMD64 version of Ubuntu Linux 10.10 on it, as it has 8GB of (1333MHz DDR2 dual channel) RAM. It also contains an NVIDIA 1GB Quadro FX graphics card, which I could probably also use for GPU-style speed-up using CUDA, but I don’t really want that kind of hassle if I can avoid it. In a previous post I gave an Introduction to parallel MCMC, which explains how to set up an MPI-based parallel computing environment with Ubuntu. In this post I’m just going to look at a very simple embarrassingly parallel Monte Carlo code, based on the code monte-carlo.c from the previous post. The slightly modified version of the code is given below.

#include <math.h>
#include <mpi.h>
#include <gsl/gsl_rng.h>
#include "gsl-sprng.h"

int main(int argc,char *argv[])
{
  int i,k,N; long Iters; double u,ksum,Nsum; gsl_rng *r;
  MPI_Init(&argc,&argv);
  MPI_Comm_size(MPI_COMM_WORLD,&N);
  MPI_Comm_rank(MPI_COMM_WORLD,&k);
  Iters=1e9;
  r=gsl_rng_alloc(gsl_rng_sprng20);
  for (i=0;i<(Iters/N);i++) {
    u = gsl_rng_uniform(r);
    ksum += exp(-u*u);
  }
  MPI_Reduce(&ksum,&Nsum,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
  if (k == 0) {
    printf("Monte carlo estimate is %f\n", Nsum/Iters );
  }
  MPI_Finalize();
  exit(EXIT_SUCCESS);
}

This code does 10^9 iterations of a Monte Carlo integral, dividing them equally among the available processes. This code can be compiled with something like:

mpicc -I/usr/local/src/sprng2.0/include -L/usr/local/src/sprng2.0/lib -o monte-carlo monte-carlo.c -lsprng -lgmp -lgsl -lgslcblas

and run with (say) 4 processes with a command like:

time mpirun -np 4 monte-carlo

How many processes should one run on this machine? This is an interesting question. There is only one CPU in this laptop, but as the name suggests, it has 4 cores. Furthermore, each of those cores is hyper-threaded, so the linux kernel presents it to the user as 8 processors (4 cores, 8 siblings), as a quick cat /proc/cpuinfo will confirm. Does this really mean that it is the equivalent of 8 independent CPUs? The short answer is “no”, but running 8 processes of an MPI (or OpenMP) job could still be optimal. I investigated this issue empirically by varying the number of processes for the MPI job from 1 to 10, doing 3 repeats for each to get some kind of idea of variability (I’ve been working in a lab recently, so I know that all good biologists always do 3 repeats of everything!). The conditions were by no means optimal, but probably quite typical – I was running the Ubuntu window manager, several terminal windows, had Firefox open with several tabs in another workspace, and Xemacs open with several buffers, etc. The raw timings (in seconds) are given below.

NP	T1	T2	T3
1	62.046	62.042	61.900
2	35.652	34.737	36.116
3	29.048	28.238	28.567
4	23.273	24.184	22.207
5	24.418	24.735	24.580
6	21.279	21.184	22.379
7	20.072	19.758	19.836
8	17.858	17.836	18.330
9	20.392	21.290	21.279
10	22.342	19.685	19.309

A quick scan of the numbers in the table makes sense – the time taken decreases steadily as the number of processes increases up to 8 processes, and then increases again as the number goes above 8. This is exactly the kind of qualitative pattern one would hope to see, but the quantitative pattern is a bit more interesting. First let’s look at a plot of the timings.

You will probably need to click on the image to be able to read the axes. The black line gives the average time, and the grey envelope covers all 3 timings. Again, this is broadly what I would have expected – time decreasing steadily to 4 processes, then diminishing returns from 4 to 8 processes, and a penalty for attempting to run more than 8 processes in parallel. Now lets look at the speed up (speed relative to the average time of the 1 processor version).

Here again, the qualitative pattern is as expected. However, inspection of the numbers on the y-axis is a little disappointing. Remeber that this not some cleverly parallelised MCMC chain with lots of inter-process communication – this is an embarrassingly parallel Monte Carlo job – one would expect to see close to 8x speedup on 8 independent CPUs. Here we see that for 4 processes, we get a little over 2.5x speedup, and if we crank things all the way up to 8 processes, we get nearly 3.5x speedup. This isn’t mind-blowing, but it is a lot better than nothing, and this is a fairly powerful processor, so even the 1 processor code is pretty quick…

So, in answer to the question of how many processes to run, the answer seems to be that if the code is very parallel, running 8 processes will probably be quickest, but running 4 processes probably won’t be much slower, and will leave some spare capacity for web browsing, editing, etc, while waiting for the MPI job to run. As ever, YMMV…

Tags: , , , , , , , , , , , , ,

7 Responses to “Parallel Monte Carlo with an Intel i7 Quad Core”

  1. Sidafa Conde Says:

    great post. I have myself an I7 processor, hackintosh I built. I did a performance monte carlo MPI study, my results are very similar to the ones I see here

  2. xcorr Says:

    Interesting. Did you try disabling Hyperthreading in the BIOS and running the code with 4 threads? I find that Matlab (R2011a, Ubuntu 64-bit) is a bit quicker on benchmarks with HT disabled. Wondering if that’s true in general of number crunching programs or it’s just a Matlab thing.

    • darrenjw Says:

      No I didn’t disable hyperthreading, but I thought about it, as it raises some interesting issues. I’m not surprised by the results you see – I would expect to see the same with my MPI jobs – there is no point wasting time trying to cleverly hyperthread when you have 4 CPU intensive threads – it makes sense that that would slow things down. However, the other potential big difference is the “baseline” speed of one thread. Presumably one thread will run more slowly with hyperthreading disabled, which will then make the multi-threaded jobs look faster relative to 1 thread, even if they are not actually any faster than when hyperthreading is turned on… This is potentially misleading, I think. But the main reason for leaving hyperthreading turned on (other than lazyness) is the same reason I ran stuff alongside a window manager, firefox, etc. – to get some idea about real-world speed-ups, rather than theoretical speedup. Most people are going to want hyperthreading turned on, just as they want to run a window manager, and I am really interested in how much faster my code can run relative to my single process hyperthreaded job… Thanks for your feedback.

      • Patrick Mineault Says:

        I see. We have 4 i7s with similar configurations here and initially I had disabled HT on just one. The benchmarks looked a bit better and I didn’t see adverse consequences in everyday use (Matlab, Python, Firefox, NX and so forth), so I disabled HT on all 4 after a few months. Haven’t switched back.

  3. Mike Croucher Says:

    I’ve never used the RNGs in the GSL so sorry if this is a stupid question but how have you handled seeding the generator. Is each thread definitely getting its own, independent random number stream?

  4. Catalogue of my first 25 blog posts « Darren Wilkinson's research blog Says:

    [...] Parallel Monte Carlo with an Intel i7 Quad Core: a quick look at the potential speed-ups possible exploiting parallelisation on a laptop with a [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 35 other followers