High Performance Computing with Python


Python is the programming language of choice for many scientists, but one of the criticisms you often hear levelled at it is 'it's so slow!'. Which is a fair point- pure python can be orders of magnitude slower than the equivalent C code. There are many modules and packages out there to address precisely this issue, though, and I recently attended a course describing some of the main ones: numpy, numba, MPI4py, cython, and multiprocess. I'll talk about the first three of those here and the last two plus an overall comparison in another post. The code from this post can be found here.


numpy (numpy.org) is definitely the package I was most familiar with beforehand, but it was still interesting to see how much of an improvement it gives over the standard math libraries. For example, the following code:

import math
def python_sine(array):
    results=[math.sin(value) for value in array]
    return results

takes 2.21 seconds to run on an array with 10 million elements on my laptop. The equivalent numpy code:

import numpy as np
def numpy_sine(array):
    return results

takes just 350 ms!

Interestingly, I hadn't realised that applying math.sin to a single element is much faster than np.sin- numpy functions are only faster when we can apply them to whole arrays at once. I also hadn't realised that with just a few lines of code, we can do ~5-10 times better than numpy- but that's where numba and cython come in.


numba (numba.pydata.org/) compiles your python functions to give performances similar to C or Fortran. It uses 'Just in Time' compilation, meaning a function is compiled into machine code the first time it's called, just before it's executed. Any further calls to this function now don't need to go via the python interpreter again, making them much quicker.

It's very easy to add numba to your existing code- just import jit and then add it as a decorator around a function.

import math
from numba import jit
def numba_sine(array):
    for i, value in enumerate(array):
    return results

This takes around the same time as the numpy code above, but we were also shown the example of finding the 'p-norm' of a vector where numba outperforms np.linalg.norm by a factor of ~10.

You can use numba to enforce the types of the function and its inputs, turn functions into 'universal functions' which act on arrays (like numpy does) and parallelise your code with the addition of a simple 'target="parallel"' to the arguments of the decorator. numba is still a pretty young package- it's only on version 0.33- but definitely one to watch.


MPI (Message Passing Interface) is the industry standard for distributed, parallel computing. It allows different processes to execute the same program and communicate with each other, sharing data or results between them. MPI4py brings the MPI functionality to python.

The key thing which took me a while to get my head around was that each script runs the same program. I was thinking that the master process would have to see different code to all the others, but that's not how it works- a few well placed if statements takes care of that. To perform our test of taking the sine of each element of a large array, we have to do a few things:

  • Set up a communicator between different processes
  • Make an array on the master process which we want to do something with- our input
  • Set up the points at which we want to split this array at, in order to distribute it between processes
  • Make these splits and send a chunk of the array to each node
  • Have each process do something to that chunk
  • Gather each sub-array back and reassemble them again

A python script which does this for our simple example can be found here (it borrows a lot from this stack overflow answer). As you can see, MPI4py is far more complicated than using either numpy or numba! The code has to be written with MPI in mind, rather than simply editing functions you already have, and should be called using mpiexec -np $n_processes python script_name, rather than just python script_name. But the possible gains are bigger, especially if you're running on a cluster of computers, so it might be worth taking the time to use it.

Next time- cython, multiprocess and a comparison of everything on a simple Monte Carlo problem.