[cctbxbb] Removal of Boost.Thread support

Luc Bourhis luc_j_bourhis at mac.com
Wed Aug 15 21:51:50 PDT 2012


Hi Jeffrey,

to rephrase what has already be written by Nat, a significant portion of CPU time is often spent in algorithm that are not easily parallelised. 

Let's elaborate on two examples so as to give you a more precise idea of the issues at hand: one that is not favorable, the computation of protein structure factors and their derivatives, and one that is more favorable, the computation of the normal matrix in least-squares refinement. 

I. Protein structure factors and their derivatives
It's roughly done in 2 steps:

1. the electron density is sample on a grid as follow:
for each X-ray scatterer S, compute the contribution of S to the grid node in some neighbourhood of the location of S
2. use FFT to compute the structure factors from the sampled density

For derivatives, just replace electron density and structure factor by their derivatives in that description.

The FFT is very amenable to parallelisation but the sampling in step 1 is much more difficult. Indeed parallelising the loop as described with a mere OpenMP directive is hopeless: each grid point gets contribution from several scatterer, thus requiring careful locking to prevent race conditions, but the resulting parallel code is then hopelessly inefficient. Having each thread work on its private grid puts too much strain on memory: a no-go as well. We even tried to parallelise the inner loop over the grid point, again with a mere OpenMP directive. But that was predictably hopeless as well: the outer loop has many iterations whereas the inner loop are comparably very few, thus making the overhead of firing the threads lethal.

Since the CPU time spent in the sampling step is significant, the benefit of parallelising the FFT step is severely tarnished by Amdahl's law. One would need to parallelise the sampling step, and that is definitively difficult to do. At the very least, that would require a major reorganisation of a very tricky piece of code, i.e. lots of work and lots of testing. For example, one could think of looping over the grid point, and for each of them add up the contribution of neighbouring scatterers. Such a loop would be amenable to parallelisation with a simple OpenMP directive but this new algorithm requires a preparation phase building data structures to efficiently find those neighbouring atoms. So a significant rewrite as stated.

II. Computation of the normal matrix in least-squares refinement

Here for each reflection hkl, we compute the gradient g of the structure factor Fc(hkl) wrt the crystallographic parameters and then perform
A = A + g g^T  (#)
where A is the so-called normal matrix, that will be used to compute the shift of parameters bringing the model closer to a minimum of the least-squares fit of the modelled Fc against the observed structure factors. Those operations (#), called symmetric rank-1 updates if you are mathematically minded, are very amenable to parallelisation, and also to vectorization actually. Since they dominate the CPU time, speeding them up with parallelisation and vectorization is worth the effort even though Amdahl's law will still spoil the speedup if e.g. the computation of those gradients is not parallelised.

Unfortunately, the use of the normal matrix is only feasible for structures with up to around a 1000 atoms in the asymmetric unit. 

Moreover even for small- to medium- size structures, this particular algorithm is more the exception than the norm: hotspot easy to parallelise are definitely rare in crystallographic computing.

Best wishes,

Luc Bourhis



More information about the cctbxbb mailing list