[phenixbb] mr_rosetta on a supercomputer
Thomas C. Terwilliger
terwilliger at lanl.gov
Wed May 18 07:50:59 PDT 2011
I'm sorry for the trouble! I am seeing that figuring out problems with
the multi-processing part of mr_rosetta is pretty difficult...
So first I suggest running the mr_rosetta regression tests on one node to
make sure it can all run. In your script itself, instead of
"phenix.mr_rosetta etc etc", say:
That should take 10-20 minutes to run and say "OK" for all the tests.
If that fails...you can go into the failed run (for example,
test_autobuild/) and run the script there (e.g., ./test_autobuild.com)
which should fail..and you can track down what is not working.
If it succeeds, then there is something specific to your data or script.
The best way to debug this is to go to the last sub-process that has
failed or hung. Here is how to get there:
1. in your main log file the last lines will be something like...
Starting job 1...Log will be:
(1a. This log file in turn may say that further jobs were submitted...if
so, work from the last last run submitted there.)
2. Your last run is in the directory where RUN_FILE_1.log is located.
There will be the following files (more if there are lots of runs in this
directory of course):
terwill at sigma> cd MR_ROSETTA_2/GROUP_OF_PLACE_MODEL_1/
terwill at sigma> ls -tlr
-rwx------ 1 terwill lanl 1495 Feb 5 14:54 RUN_FILE_1.sh*
-rwx------ 1 terwill lanl 282 Feb 5 14:54 RUN_FILE_1*
-rw-r--r-- 1 terwill lanl 6431 Feb 5 14:54 PARAMS_1.eff
-rw-r--r-- 1 terwill lanl 6564 Feb 5 14:54 mr_rosetta_params.eff
-rw-r--r-- 1 terwill lanl 130 Feb 5 14:54 INFO_FILE_1
drwxr-xr-x 6 terwill lanl 4096 Feb 5 16:44 RUN_1/
-rw-r--r-- 1 terwill lanl 21575 Feb 5 16:45 RUN_FILE_1.log
-rw-r--r-- 1 terwill lanl 51 Feb 5 16:46 JOBS_RUNNING
PARAMS_1.eff are the parameters used in the run
RUN_FILE_1.sh actually runs the job (e.g., phenix.mr_rosetta PARAMS_1.eff)
NOTE: usually this is phenix.mr_rosetta but it could also be another
routine, so you do have to look at it or the first line of PARAMS_1.eff
which will name the routine used.
RUN_FILE_1.log is the log file for this run. Look at the end of this file.
The job is run in RUN_1/
The key here is that you can type
and the exact same job that failed or ran will be run again. You can use
this to debug what is going on.
3. Look at the log file RUN_FILE_1.log and the files in RUN_1/. Notice
what the last file written in RUN_1/ is...this may give a clue as to when
and where the problem occurred. Usually there will be an error message in
RUN_FILE_1.log that may be informative.
4. If the run in question is a Rosetta job, then the actual Rosetta job is
run in a subdirectory of RUN_1/ This will be in a directory like:
Here this is in RUN_1 of a group of rosetta models, set 1, run 5, working
directory. In this directory you will find something like:
terwill at sigma> cd WORK_1/
terwill at sigma> ls -tlr
-rw-r--r-- 1 terwill lanl 1475 Feb 5 16:48 rebuild.flags
-rwxr-xr-x 1 terwill lanl 304 Feb 5 16:48 run_rebuild.sh*
-rw-r--r-- 1 terwill lanl 422921 Feb 5 17:26 S_3DZB__0001.pdb
-rw-r--r-- 1 terwill lanl 665 Feb 5 17:26 score.sc
-rw-r--r-- 1 terwill lanl 97437 Feb 5 17:26 rebuild.log
-rw-r--r-- 1 terwill lanl 158717 Feb 5 18:17 S_3DZB__0001_ed.pdb
rebuild.flags are the commands to Rosetta
run_rebuild.sh is a command file to run Rosetta with rebuild.flags
rebuild.log is the log file
You can look at the log file and see if there are any messages. Then you
can rerun the Rosetta job in a scratch directory with:
I hope that helps in debugging! I will put these instructions in the
All the best,
>> I have been trying to run phenix.mr_rosetta on a supercomputer
>> where we are limited to 24h for parallel jobs and we have different
>> queues for job mono.
>> Thus to run we have two files.
>> We do not really understand how the code proceeds with nproc so we just
>> put 6 for now.
>> When we sent it, it created 1000 jobs on
>> it sent 10 jobs mono, but ran only 5 first than the other five. Then the
>> job running the script hangs.
>> when we looked at the log files in the
>> it filled up only 4 if we re-run only 3 and never the same.
>> do you have any idea what we're doing wrong?
>> thanks in advance for your input
>> Patricia Amara
>> ps; Thus to run we have two files.
>> One to send the script
>> #MSUB -e ros1_%I.e
>> #MSUB -o ros1_%I.o
>> #MSUB -n 6
>> #MSUB -T 86400
>> cd /scratch/cont003/amara/Rosetta-NadA
>> and the script itself:
>> phenix.mr_rosetta \
>> seq_file=/scratch/cont003/amara/Rosetta-NadA/NADA.seq \
>> search_models=/scratch/cont003/amara/Rosetta-NadA/model_2.pdb \
>> run_prerefine=True \
>> number_of_prerefine_models=1000 \
>> fragment_files =
>> /scratch/cont003/amara/Rosetta-NadA/aat000_03_05.200_v1_3.gz \
>> fragment_files =
>> /scratch/cont003/amara/Rosetta-NadA/aat000_09_05.200_v1_3.gz \
>> rescore_mr.relax=False \
>> rosetta_models=20 \
>> ncs_copies=3 \
>> space_group=p21212 \
>> use_all_plausible_sg=False \
>> max_wait_time=100 \
>> nproc=6 \
>> "/usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/bsub -n 1 -q mono -W
>> phenixbb mailing list
>> phenixbb at phenix-online.org
More information about the phenixbb