[cctbxbb] Should we enable boost threads in bootstrap?

Tristan Croll tic20 at cam.ac.uk
Wed Sep 27 12:34:35 PDT 2017


Correction: "... and is of course impossible on *Windows*."

On 2017-09-27 11:54, Tristan Croll wrote:
> Sorry for not replying directly to the existing thread, but I've only
> just subscribed to cctbxbb. Since I started the discussion of the
> Global Interpreter Lock (GIL) yesterday, I thought I should give a
> quick run-down of why it's important and where/how it can be released.
> 
> In brief, the much-maligned GIL exists because much of the Python API
> proper (in particular, reference counting) is not thread-safe. Its
> purpose is to ensure that only one thread can ever be using Python
> objects at any given time. In practice this means that a naive
> implementation of Python threads gives you the worst of all possible
> worlds: parallel logic, but slower-than-single-threaded performance.
> Python regularly tries to swap between all threads it has running, but
> only the one that currently holds the GIL can go forward. So if one
> thread runs a method that takes 10 seconds without releasing the GIL,
> *all* threads will hang for 10 seconds.
> 
> To be clear, this *doesn't* prevent you from using C++ threads, as
> long as those threads are not acting on Python objects.
> 
> The thing is, in most production code the heavy-duty computation is
> *not* done on Python objects - it's done in C++ on objects that Python
> simply holds a pointer to. All such functions can safely release the
> GIL as long as they reacquire it before returning to Python. In the
> context of the example above, that means all other threads are able to
> continue on doing their own thing while that 10-second function is
> running.
> 
> So why use threads rather than multiprocessing? Three key reasons:
> - threads are trivial to set up and run in the same way on Linux,
> macOS and Windows
> - threads share memory by default, making life much easier when they
> need to communicate regularly
> - It turns out that OpenCL/CUDA do not play at all well with forked 
> processes.
> 
> A real-world example: in ISOLDE I'm currently bringing together two
> key packages with their own Python APIs: ChimeraX for the molecular
> graphics, OpenMM for MD simulation. Since this is an interactive
> application, speed is of the essence. Graphics performance needs to be
> independent of simulation performance (to allow the user to
> rotate/translate etc. smoothly no matter how fast the simulation runs)
> so parallelism is mandatory. There is constant back-and-forth
> communication needed (the simulation needs to update coordinates;
> interactions need to be sent back to the simulation), which
> easier/faster in shared memory. The simulation is running on a GPU. My
> initial implementation used Python's multiprocessing module applying
> os.fork() - this worked in Linux under the very specific circumstances
> where the GPU had not previously been used for OpenCL or CUDA by the
> master process, but failed on the Mac and is of course impossible on
> Linux. Switching from multiprocessing to threading (with no other
> changes to my code) gave me an implementation that works equally well
> on Linux and Mac with no OS-specific code, and should work just as
> well on Windows once I get around to doing a build. The performance is
> effectively equal to what I was getting from multiprocessing, since
> all the major libraries I'm using (ChimeraX, OpenMM, NumPy) release
> the GIL in all their C++ calls. At the moment, though, CCTBX functions
> don't - but there's no reason why they can't. See
> https://wiki.python.org/moin/boost.python/HowTo#Multithreading_Support_for_my_function.
> 
> A simple toy example using Python threads to speed up a heavyish 
> computation:
> 
> from multiprocessing.pool import ThreadPool
> import numpy
> from math import ceil
> from time import sleep
> 
> def f(a, target, start_i, end_i):
>     arr = a[start_i:end_i]
>     target[start_i:end_i] = numpy.exp(numpy.cos(arr)+numpy.sin(arr))
> 
> def test_threads(a, num_threads):
>     l = len(a)
>     ret = numpy.empty(l, a.dtype)
>     stride = int(ceil(l/num_threads))
>     with ThreadPool(processes=num_threads) as p:
>         for i in range(num_threads):
>             start = stride*i
>             end = stride*(i+1)
>             if end > l:
>                 end = l
>             p.apply_async(f, (a, ret, start, end))
>         p.close()
>         p.join()
>     return ret
> 
> a = numpy.random.rand(50000000)
> target = numpy.empty(len(a), a.dtype)
> 
> %timeit f(a, target, 0, len(a))
> 2.35 s ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> 
> %timeit test_threads(a, 1):
> 2.55 s ± 30.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> 
> %timeit test_threads(a, 2):
> 1.5 s ± 5.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> 
> %timeit test_threads(a, 3):
> 1.2 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> 
> 
> 
> 
> 
> _______________________________________________
> cctbxbb mailing list
> cctbxbb at phenix-online.org
> http://phenix-online.org/mailman/listinfo/cctbxbb




More information about the cctbxbb mailing list