[cctbxbb] Exceptions squashed by easy_mp (revenge of)

Tue Apr 3 06:27:09 PDT 2018

Hi Graeme,

if all you want to do is print the stacktrace out to stdout/stderr, then
the following lines should do this:

from libtbx.scheduling import stacktrace
stacktrace.enable()
parallel_map(...)

This will install a custom excepthook and print the propagated stacktrace
onto the console when the program crashes.

If you want to accumulate potentially multiple traceback, you need to do
what Rob suggests (although I would not actually copy the code, but instead
create a wrapping function that fails silently and stores the tracebacks) -
it is not immediately clear to me whether this has any advantages over the
previous approach, but let me know if you need some help for such a generic
solution.

BW, Gabor

On Tue, Apr 3, 2018 at 1:46 PM, Dr. Robert Oeffner <rdo20 at cam.ac.uk> wrote:

> Hi Graeme,
>
> Had a look at the code again in parallel_map(). I think it may be possible
> to adapt it to retain stack traces of individual qsub jobs.
> In libtbx/easy_mp.py compare the lines 627 with 718. Both are doing
>         result = res()
> But the latter is guarded by a try except block. Any exception is the
> result of a child process dying and it's stack trace is added as the third
> member of the parmres tuple which is passed on to the user.
>
> I think similar could be done with parallel_map(). So I suggest fashioning
> a parallel_map2() function which is a copy of parallel_map() but with the
> added exception handler around the result=res() statement.
>
> As I have no access to a qsub cluster I can't test whether this would work.
>
> Regards,
>
> Rob
>
>
>
> On 03/04/2018 13:23, Graeme.Winter at Diamond.ac.uk wrote:
>
>> HI Rob
>>
>> I think this is true … sometimes
>>
>> It sets up the qsub every time, but does not always use it - at least it
>> works on my MacBook with no qsub ;-)
>>
>> That said, the question remains why exception reports are bad for
>> parallel map… we *are* using preserve_exception_message…
>>
>> Cheers Graeme
>>
>>
>> On 3 Apr 2018, at 13:20, Dr. Robert Oeffner <rdo20 at cam.ac.uk> wrote:
>>>
>>> Hi Graeme,
>>>
>>> Just had a look at the code in dials/util/mp.py. It seems that you are
>>> using parallel_map() on a cluster using qsub. Unfortunately
>>> multi_core_run() is not designed for that. It only runs on a single multi
>>> core CPU PC.
>>>
>>> Sorry,
>>>
>>> Rob
>>>
>>>
>>> On 03/04/2018 12:44, Graeme.Winter at Diamond.ac.uk wrote:
>>>
>>>> Thanks Rob, I could not dig out the thread (and the mail list thing
>>>> does not have search that I could find)
>>>> I’ll talk to the crew about swapping this out for dials.* - though is
>>>> possibly quite a big change?
>>>> Cheers Graeme
>>>> On 3 Apr 2018, at 12:26, Dr. Robert Oeffner <rdo20 at cam.ac.uk<mailto:
>>>> rdo20 at cam.ac.uk>> wrote:
>>>> Hi Graeme,
>>>> I recall we've been here before,
>>>> http://phenix-online.org/pipermail/cctbxbb/2017-December/001807.html
>>>> I believe the solution is to use easy_mp.multi_core_run() instead of
>>>> easy_mp.parallel_map(). The first function preserves stack traces of
>>>> individual process, unlike easy_mp.parallel_map().
>>>> Regards,
>>>> Rob
>>>> On 03/04/2018 07:16, Graeme.Winter at Diamond.ac.uk<mailto:
>>>> Graeme.Winter at Diamond.ac.uk> wrote:
>>>> Folks,
>>>> Following up on user reports again of errors within easy_mp - all that
>>>> gets logged is “something went wrong” i.e.
>>>>   Using multiprocessing with 10 parallel job(s)
>>>> Traceback (most recent call last):
>>>>    File "/home/user/bin/dials-installer/build/../modules/dials/command_line/integrate.py",
>>>> line 613, in <module>
>>>>      halraiser(e)
>>>>    File "/home/user/bin/dials-installer/build/../modules/dials/command_line/integrate.py",
>>>> line 611, in <module>
>>>>      script.run()
>>>>    File "/home/user/bin/dials-installer/build/../modules/dials/command_line/integrate.py",
>>>> line 341, in run
>>>>      reflections = integrator.integrate()
>>>>    File "/home/user/bin/dials-installer/modules/dials/algorithms/integration/integrator.py",
>>>> line 1214, in integrate
>>>>      self.reflections, _, time_info = processor.process()
>>>>    File "/home/user/bin/dials-installer/modules/dials/algorithms/integration/processor.py",
>>>> line 271, in process
>>>>      preserve_exception_message = True)
>>>>    File "/home/user/bin/dials-installer/modules/dials/util/mp.py",
>>>> line 171, in multi_node_parallel_map
>>>>      preserve_exception_message = preserve_exception_message)
>>>>    File "/home/user/bin/dials-installer/modules/dials/util/mp.py",
>>>> line 53, in parallel_map
>>>>      preserve_exception_message = preserve_exception_message)
>>>>    File "/home/user/bin/dials-installer/modules/cctbx_project/libtbx/easy_mp.py",
>>>> line 627, in parallel_map
>>>>      result = res()
>>>>    File "/home/user/bin/dials-installer/modules/cctbx_project/libtbx/scheduling/result.py",
>>>> line 119, in __call__
>>>>      self.traceback( exception = self.exception() )
>>>>    File "/home/user/bin/dials-installer/modules/cctbx_project/
>>>> libtbx/scheduling/stacktrace.py", line 115, in __call__
>>>>      self.raise_handler( exception = exception )
>>>>    File "/home/user/bin/dials-installer/modules/cctbx_project/
>>>> libtbx/scheduling/mainthread.py", line 100, in poll
>>>>      value = target( *args, **kwargs )
>>>>    File "/home/user/bin/dials-installer/modules/dials/util/mp.py",
>>>> line 91, in __call__
>>>>      preserve_exception_message = self.preserve_exception_message)
>>>>    File "/home/user/bin/dials-installer/modules/cctbx_project/libtbx/easy_mp.py",
>>>> line 627, in parallel_map
>>>>      result = res()
>>>>    File "/home/user/bin/dials-installer/modules/cctbx_project/libtbx/scheduling/result.py",
>>>> line 119, in __call__
>>>>      self.traceback( exception = self.exception() )
>>>>    File "/home/user/bin/dials-installer/modules/cctbx_project/
>>>> libtbx/scheduling/stacktrace.py", line 86, in __call__
>>>>      raise exception
>>>> RuntimeError: Please report this error to
>>>> dials-support at lists.sourceforge.net<mailto:dials-support@
>>>> lists.sourceforge.net>: exit code = -9
>>>> I forget why it was decided that keeping the proper stack trace was a
>>>> bad thing, but could this be revisited? It would greatly help to see it in
>>>> the output of the program (if as is the case here I do not have the user
>>>> data)
>>>> My email-fu is not strong enough to dig out the previous conversation
>>>> Cheers Graeme
>>>> --
>>>> Robert Oeffner, Ph.D.
>>>> Research Associate, The Read Group
>>>> Department of Haematology,
>>>> Cambridge Institute for Medical Research
>>>> University of Cambridge
>>>> Cambridge Biomedical Campus
>>>> Wellcome Trust/MRC Building
>>>> Hills Road
>>>> Cambridge CB2 0XY
>>>> www.cimr.cam.ac.uk/investigators/read/index.html<http://www.
>>>> cimr.cam.ac.uk/investigators/read/index.html>
>>>> tel: +44(0)1223 763234
>>>>
>>>
>>>
>>> --
>>> Robert Oeffner, Ph.D.
>>> Research Associate, The Read Group
>>> Department of Haematology,
>>> Cambridge Institute for Medical Research
>>> University of Cambridge
>>> Cambridge Biomedical Campus
>>> Wellcome Trust/MRC Building
>>> Hills Road
>>> Cambridge CB2 0XY
>>>
>>> www.cimr.cam.ac.uk/investigators/read/index.html
>>> tel: +44(0)1223 763234
>>>
>>
>>
>>
>
> --
> Robert Oeffner, Ph.D.
> Research Associate, The Read Group
> Department of Haematology,
> Cambridge Institute for Medical Research
> University of Cambridge
> Cambridge Biomedical Campus
> Wellcome Trust/MRC Building
> Hills Road
> Cambridge CB2 0XY
>
> www.cimr.cam.ac.uk/investigators/read/index.html
> tel: +44(0)1223 763234
> _______________________________________________
> cctbxbb mailing list
> cctbxbb at phenix-online.org
> http://phenix-online.org/mailman/listinfo/cctbxbb
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://phenix-online.org/pipermail/cctbxbb/attachments/20180403/10e1ab30/attachment-0001.htm>