Re: [phenixbb] mr_rosetta on a supercomputer

18 May 2011

      Hi Patricia,

I'm sorry for the trouble!  I am seeing that figuring out problems with
the multi-processing part of mr_rosetta is pretty difficult...

So first I suggest running the mr_rosetta regression tests on one node to
make sure it can all run.  In your script itself, instead of
"phenix.mr_rosetta etc etc", say:

  phenix_regression.wizards.test_command_line_rosetta_quick_tests

That should take 10-20 minutes to run and say "OK" for all the tests.

If that fails...you can go into the failed run (for example,
test_autobuild/) and run the script there (e.g., ./test_autobuild.com)
which should fail..and you can track down what is not working.

If it succeeds, then there is something specific to your data or script.

The best way to debug this is to go to the last sub-process that has
failed or hung.  Here is how to get there:

1. in your main log file the last lines will be something like...

Starting job 1...Log will be:
/net/omega/raid1/scratch1/terwillMR_ROSETTA_2/GROUP_OF_PLACE_MODEL_1/RUN_FILE_1.log

(1a. This log file in turn may say that further jobs were submitted...if
so, work from the last last run submitted there.)

2. Your last run is in the directory where RUN_FILE_1.log is located. 
There will be the following files (more if there are lots of runs in this
directory of course):

terwill@sigma> cd MR_ROSETTA_2/GROUP_OF_PLACE_MODEL_1/
terwill@sigma> ls -tlr
total 60
-rwx------ 1 terwill lanl  1495 Feb  5 14:54 RUN_FILE_1.sh*
-rwx------ 1 terwill lanl   282 Feb  5 14:54 RUN_FILE_1*
-rw-r--r-- 1 terwill lanl  6431 Feb  5 14:54 PARAMS_1.eff
-rw-r--r-- 1 terwill lanl  6564 Feb  5 14:54 mr_rosetta_params.eff
-rw-r--r-- 1 terwill lanl   130 Feb  5 14:54 INFO_FILE_1
drwxr-xr-x 6 terwill lanl  4096 Feb  5 16:44 RUN_1/
-rw-r--r-- 1 terwill lanl 21575 Feb  5 16:45 RUN_FILE_1.log
-rw-r--r-- 1 terwill lanl    51 Feb  5 16:46 JOBS_RUNNING

Here:
PARAMS_1.eff are the parameters used in the run
RUN_FILE_1.sh actually runs the job (e.g., phenix.mr_rosetta PARAMS_1.eff)
  NOTE: usually this is phenix.mr_rosetta but it could also be another
routine, so you do have to look at it or the first line of PARAMS_1.eff
which will name the routine used.
RUN_FILE_1.log is the log file for this run.  Look at the end of this file.
The job is run in RUN_1/

The key here is that you can type

  phenix.mr_rosetta PARAMS_1.eff

and the exact same job that failed or ran will be run again. You can use
this to debug what is going on.

3. Look at the log file RUN_FILE_1.log and the files in RUN_1/.  Notice
what the last file written in RUN_1/ is...this may give a clue as to when
and where the problem occurred. Usually there will be an error message in
RUN_FILE_1.log that may be informative.

4. If the run in question is a Rosetta job, then the actual Rosetta job is
run in a subdirectory of RUN_1/  This will be in a directory like:

MR_ROSETTA_2/GROUP_OF_ROSETTA_REBUILD_1/RUN_1/REBUILD_IN_SETS_1/RUN_5/WORK_1

Here this is in RUN_1 of a group of rosetta models, set 1, run 5, working
directory.  In this directory you will find something like:

terwill@sigma> cd WORK_1/
terwill@sigma> ls -tlr
total 684
-rw-r--r-- 1 terwill lanl   1475 Feb  5 16:48 rebuild.flags
-rwxr-xr-x 1 terwill lanl    304 Feb  5 16:48 run_rebuild.sh*
-rw-r--r-- 1 terwill lanl 422921 Feb  5 17:26 S_3DZB__0001.pdb
-rw-r--r-- 1 terwill lanl    665 Feb  5 17:26 score.sc
-rw-r--r-- 1 terwill lanl  97437 Feb  5 17:26 rebuild.log
-rw-r--r-- 1 terwill lanl 158717 Feb  5 18:17 S_3DZB__0001_ed.pdb

Here:
 rebuild.flags are the commands to Rosetta
 run_rebuild.sh  is a command file to run Rosetta with rebuild.flags
 rebuild.log is the log file

You can look at the log file and see if there are any messages.  Then you
can rerun the Rosetta job in a scratch directory with:

mkdir junk
cd junk
../run_rebuild.sh

I hope that helps in debugging! I will put these instructions in the
manual too.

All the best,
Tom T
...
...
Hello,
I have been trying to run phenix.mr_rosetta on a supercomputer
where we are limited to 24h for parallel jobs and we have different
queues for job mono.
Thus to run we have two files.
We do not really understand how the code proceeds with nproc so we just
put 6 for now.
When we sent it, it created 1000 jobs on
MR_ROSETTA_1/GROUP_OF_PLACE_MODEL_1/RUN_1/GROUP_OF_ROSETTA_REBUILD_1/RUN_1/REBUILD_IN_SETS_1
it sent 10 jobs mono, but ran only 5 first than the other five. Then the
job running the script hangs.
when we looked at the log files in the
MR_ROSETTA_1/GROUP_OF_PLACE_MODEL_1/RUN_1/GROUP_OF_ROSETTA_REBUILD_1/RUN_1/REBUILD_IN_SETS_1
directories
it filled up only 4 if we re-run only 3 and never the same.
do you have any idea what we're doing wrong?
thanks in advance for your input
Patricia Amara
ps; Thus to run we have two files.
One to send the script
#!/bin/bash
#MSUB -e ros1_%I.e
#MSUB -o ros1_%I.o
#MSUB -n 6
#MSUB -T 86400
cd /scratch/cont003/amara/Rosetta-NadA
./rosetta.run
and the script itself:
phenix.mr_rosetta \
  seq_file=/scratch/cont003/amara/Rosetta-NadA/NADA.seq \
data=/scratch/cont003/amara/Rosetta-NadA/NadA-7-BM16-remote-no_ano-P21212-Rfree_shell.mtz
\
  search_models=/scratch/cont003/amara/Rosetta-NadA/model_2.pdb \
  run_prerefine=True \
  number_of_prerefine_models=1000 \
  fragment_files =
/scratch/cont003/amara/Rosetta-NadA/aat000_03_05.200_v1_3.gz  \
  fragment_files =
/scratch/cont003/amara/Rosetta-NadA/aat000_09_05.200_v1_3.gz \
  rescore_mr.relax=False \
  rosetta_models=20 \
  ncs_copies=3 \
  space_group=p21212  \
  use_all_plausible_sg=False \
  max_wait_time=100 \
  nproc=6 \
  group_run_command=
"/usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/bsub -n 1  -q mono -W
03:00"
_______________________________________________
phenixbb mailing list
[email protected]
http://phenix-online.org/mailman/listinfo/phenixbb

Re: [phenixbb] mr_rosetta on a supercomputer

Thomas C. Terwilliger