[cctbxbb] Should we enable boost threads in bootstrap?

Wed Sep 27 08:54:51 PDT 2017

Sorry for not replying directly to the existing thread, but I've only 
just subscribed to cctbxbb. Since I started the discussion of the Global 
Interpreter Lock (GIL) yesterday, I thought I should give a quick 
run-down of why it's important and where/how it can be released.

In brief, the much-maligned GIL exists because much of the Python API 
proper (in particular, reference counting) is not thread-safe. Its 
purpose is to ensure that only one thread can ever be using Python 
objects at any given time. In practice this means that a naive 
implementation of Python threads gives you the worst of all possible 
worlds: parallel logic, but slower-than-single-threaded performance. 
Python regularly tries to swap between all threads it has running, but 
only the one that currently holds the GIL can go forward. So if one 
thread runs a method that takes 10 seconds without releasing the GIL, 
*all* threads will hang for 10 seconds.

To be clear, this *doesn't* prevent you from using C++ threads, as long 
as those threads are not acting on Python objects.

The thing is, in most production code the heavy-duty computation is 
*not* done on Python objects - it's done in C++ on objects that Python 
simply holds a pointer to. All such functions can safely release the GIL 
as long as they reacquire it before returning to Python. In the context 
of the example above, that means all other threads are able to continue 
on doing their own thing while that 10-second function is running.

So why use threads rather than multiprocessing? Three key reasons:
- threads are trivial to set up and run in the same way on Linux, macOS 
and Windows
- threads share memory by default, making life much easier when they 
need to communicate regularly
- It turns out that OpenCL/CUDA do not play at all well with forked 
processes.

A real-world example: in ISOLDE I'm currently bringing together two key 
packages with their own Python APIs: ChimeraX for the molecular 
graphics, OpenMM for MD simulation. Since this is an interactive 
application, speed is of the essence. Graphics performance needs to be 
independent of simulation performance (to allow the user to 
rotate/translate etc. smoothly no matter how fast the simulation runs) 
so parallelism is mandatory. There is constant back-and-forth 
communication needed (the simulation needs to update coordinates; 
interactions need to be sent back to the simulation), which 
easier/faster in shared memory. The simulation is running on a GPU. My 
initial implementation used Python's multiprocessing module applying 
os.fork() - this worked in Linux under the very specific circumstances 
where the GPU had not previously been used for OpenCL or CUDA by the 
master process, but failed on the Mac and is of course impossible on 
Linux. Switching from multiprocessing to threading (with no other 
changes to my code) gave me an implementation that works equally well on 
Linux and Mac with no OS-specific code, and should work just as well on 
Windows once I get around to doing a build. The performance is 
effectively equal to what I was getting from multiprocessing, since all 
the major libraries I'm using (ChimeraX, OpenMM, NumPy) release the GIL 
in all their C++ calls. At the moment, though, CCTBX functions don't - 
but there's no reason why they can't. See 
https://wiki.python.org/moin/boost.python/HowTo#Multithreading_Support_for_my_function.

A simple toy example using Python threads to speed up a heavyish 
computation:

from multiprocessing.pool import ThreadPool
import numpy
from math import ceil
from time import sleep

def f(a, target, start_i, end_i):
     arr = a[start_i:end_i]
     target[start_i:end_i] = numpy.exp(numpy.cos(arr)+numpy.sin(arr))

def test_threads(a, num_threads):
     l = len(a)
     ret = numpy.empty(l, a.dtype)
     stride = int(ceil(l/num_threads))
     with ThreadPool(processes=num_threads) as p:
         for i in range(num_threads):
             start = stride*i
             end = stride*(i+1)
             if end > l:
                 end = l
             p.apply_async(f, (a, ret, start, end))
         p.close()
         p.join()
     return ret

a = numpy.random.rand(50000000)
target = numpy.empty(len(a), a.dtype)

%timeit f(a, target, 0, len(a))
2.35 s ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit test_threads(a, 1):
2.55 s ± 30.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit test_threads(a, 2):
1.5 s ± 5.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit test_threads(a, 3):
1.2 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)