[cctbxbb] Should we enable boost threads in bootstrap?
Tristan Croll
tic20 at cam.ac.uk
Wed Sep 27 08:54:51 PDT 2017
Sorry for not replying directly to the existing thread, but I've only
just subscribed to cctbxbb. Since I started the discussion of the Global
Interpreter Lock (GIL) yesterday, I thought I should give a quick
run-down of why it's important and where/how it can be released.
In brief, the much-maligned GIL exists because much of the Python API
proper (in particular, reference counting) is not thread-safe. Its
purpose is to ensure that only one thread can ever be using Python
objects at any given time. In practice this means that a naive
implementation of Python threads gives you the worst of all possible
worlds: parallel logic, but slower-than-single-threaded performance.
Python regularly tries to swap between all threads it has running, but
only the one that currently holds the GIL can go forward. So if one
thread runs a method that takes 10 seconds without releasing the GIL,
*all* threads will hang for 10 seconds.
To be clear, this *doesn't* prevent you from using C++ threads, as long
as those threads are not acting on Python objects.
The thing is, in most production code the heavy-duty computation is
*not* done on Python objects - it's done in C++ on objects that Python
simply holds a pointer to. All such functions can safely release the GIL
as long as they reacquire it before returning to Python. In the context
of the example above, that means all other threads are able to continue
on doing their own thing while that 10-second function is running.
So why use threads rather than multiprocessing? Three key reasons:
- threads are trivial to set up and run in the same way on Linux, macOS
and Windows
- threads share memory by default, making life much easier when they
need to communicate regularly
- It turns out that OpenCL/CUDA do not play at all well with forked
processes.
A real-world example: in ISOLDE I'm currently bringing together two key
packages with their own Python APIs: ChimeraX for the molecular
graphics, OpenMM for MD simulation. Since this is an interactive
application, speed is of the essence. Graphics performance needs to be
independent of simulation performance (to allow the user to
rotate/translate etc. smoothly no matter how fast the simulation runs)
so parallelism is mandatory. There is constant back-and-forth
communication needed (the simulation needs to update coordinates;
interactions need to be sent back to the simulation), which
easier/faster in shared memory. The simulation is running on a GPU. My
initial implementation used Python's multiprocessing module applying
os.fork() - this worked in Linux under the very specific circumstances
where the GPU had not previously been used for OpenCL or CUDA by the
master process, but failed on the Mac and is of course impossible on
Linux. Switching from multiprocessing to threading (with no other
changes to my code) gave me an implementation that works equally well on
Linux and Mac with no OS-specific code, and should work just as well on
Windows once I get around to doing a build. The performance is
effectively equal to what I was getting from multiprocessing, since all
the major libraries I'm using (ChimeraX, OpenMM, NumPy) release the GIL
in all their C++ calls. At the moment, though, CCTBX functions don't -
but there's no reason why they can't. See
https://wiki.python.org/moin/boost.python/HowTo#Multithreading_Support_for_my_function.
A simple toy example using Python threads to speed up a heavyish
computation:
from multiprocessing.pool import ThreadPool
import numpy
from math import ceil
from time import sleep
def f(a, target, start_i, end_i):
arr = a[start_i:end_i]
target[start_i:end_i] = numpy.exp(numpy.cos(arr)+numpy.sin(arr))
def test_threads(a, num_threads):
l = len(a)
ret = numpy.empty(l, a.dtype)
stride = int(ceil(l/num_threads))
with ThreadPool(processes=num_threads) as p:
for i in range(num_threads):
start = stride*i
end = stride*(i+1)
if end > l:
end = l
p.apply_async(f, (a, ret, start, end))
p.close()
p.join()
return ret
a = numpy.random.rand(50000000)
target = numpy.empty(len(a), a.dtype)
%timeit f(a, target, 0, len(a))
2.35 s ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_threads(a, 1):
2.55 s ± 30.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_threads(a, 2):
1.5 s ± 5.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_threads(a, 3):
1.2 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
More information about the cctbxbb
mailing list