Hi Patricia, I'm sorry for the trouble! I am seeing that figuring out problems with the multi-processing part of mr_rosetta is pretty difficult... So first I suggest running the mr_rosetta regression tests on one node to make sure it can all run. In your script itself, instead of "phenix.mr_rosetta etc etc", say: phenix_regression.wizards.test_command_line_rosetta_quick_tests That should take 10-20 minutes to run and say "OK" for all the tests. If that fails...you can go into the failed run (for example, test_autobuild/) and run the script there (e.g., ./test_autobuild.com) which should fail..and you can track down what is not working. If it succeeds, then there is something specific to your data or script. The best way to debug this is to go to the last sub-process that has failed or hung. Here is how to get there: 1. in your main log file the last lines will be something like... Starting job 1...Log will be: /net/omega/raid1/scratch1/terwillMR_ROSETTA_2/GROUP_OF_PLACE_MODEL_1/RUN_FILE_1.log (1a. This log file in turn may say that further jobs were submitted...if so, work from the last last run submitted there.) 2. Your last run is in the directory where RUN_FILE_1.log is located. There will be the following files (more if there are lots of runs in this directory of course): terwill@sigma> cd MR_ROSETTA_2/GROUP_OF_PLACE_MODEL_1/ terwill@sigma> ls -tlr total 60 -rwx------ 1 terwill lanl 1495 Feb 5 14:54 RUN_FILE_1.sh* -rwx------ 1 terwill lanl 282 Feb 5 14:54 RUN_FILE_1* -rw-r--r-- 1 terwill lanl 6431 Feb 5 14:54 PARAMS_1.eff -rw-r--r-- 1 terwill lanl 6564 Feb 5 14:54 mr_rosetta_params.eff -rw-r--r-- 1 terwill lanl 130 Feb 5 14:54 INFO_FILE_1 drwxr-xr-x 6 terwill lanl 4096 Feb 5 16:44 RUN_1/ -rw-r--r-- 1 terwill lanl 21575 Feb 5 16:45 RUN_FILE_1.log -rw-r--r-- 1 terwill lanl 51 Feb 5 16:46 JOBS_RUNNING Here: PARAMS_1.eff are the parameters used in the run RUN_FILE_1.sh actually runs the job (e.g., phenix.mr_rosetta PARAMS_1.eff) NOTE: usually this is phenix.mr_rosetta but it could also be another routine, so you do have to look at it or the first line of PARAMS_1.eff which will name the routine used. RUN_FILE_1.log is the log file for this run. Look at the end of this file. The job is run in RUN_1/ The key here is that you can type phenix.mr_rosetta PARAMS_1.eff and the exact same job that failed or ran will be run again. You can use this to debug what is going on. 3. Look at the log file RUN_FILE_1.log and the files in RUN_1/. Notice what the last file written in RUN_1/ is...this may give a clue as to when and where the problem occurred. Usually there will be an error message in RUN_FILE_1.log that may be informative. 4. If the run in question is a Rosetta job, then the actual Rosetta job is run in a subdirectory of RUN_1/ This will be in a directory like: MR_ROSETTA_2/GROUP_OF_ROSETTA_REBUILD_1/RUN_1/REBUILD_IN_SETS_1/RUN_5/WORK_1 Here this is in RUN_1 of a group of rosetta models, set 1, run 5, working directory. In this directory you will find something like: terwill@sigma> cd WORK_1/ terwill@sigma> ls -tlr total 684 -rw-r--r-- 1 terwill lanl 1475 Feb 5 16:48 rebuild.flags -rwxr-xr-x 1 terwill lanl 304 Feb 5 16:48 run_rebuild.sh* -rw-r--r-- 1 terwill lanl 422921 Feb 5 17:26 S_3DZB__0001.pdb -rw-r--r-- 1 terwill lanl 665 Feb 5 17:26 score.sc -rw-r--r-- 1 terwill lanl 97437 Feb 5 17:26 rebuild.log -rw-r--r-- 1 terwill lanl 158717 Feb 5 18:17 S_3DZB__0001_ed.pdb Here: rebuild.flags are the commands to Rosetta run_rebuild.sh is a command file to run Rosetta with rebuild.flags rebuild.log is the log file You can look at the log file and see if there are any messages. Then you can rerun the Rosetta job in a scratch directory with: mkdir junk cd junk ../run_rebuild.sh I hope that helps in debugging! I will put these instructions in the manual too. All the best, Tom T
Hello,
I have been trying to run phenix.mr_rosetta on a supercomputer where we are limited to 24h for parallel jobs and we have different queues for job mono. Thus to run we have two files. We do not really understand how the code proceeds with nproc so we just put 6 for now. When we sent it, it created 1000 jobs on MR_ROSETTA_1/GROUP_OF_PLACE_MODEL_1/RUN_1/GROUP_OF_ROSETTA_REBUILD_1/RUN_1/REBUILD_IN_SETS_1 it sent 10 jobs mono, but ran only 5 first than the other five. Then the job running the script hangs. when we looked at the log files in the MR_ROSETTA_1/GROUP_OF_PLACE_MODEL_1/RUN_1/GROUP_OF_ROSETTA_REBUILD_1/RUN_1/REBUILD_IN_SETS_1 directories it filled up only 4 if we re-run only 3 and never the same. do you have any idea what we're doing wrong? thanks in advance for your input
Patricia Amara
ps; Thus to run we have two files. One to send the script #!/bin/bash #MSUB -e ros1_%I.e #MSUB -o ros1_%I.o #MSUB -n 6 #MSUB -T 86400 cd /scratch/cont003/amara/Rosetta-NadA ./rosetta.run
and the script itself:
phenix.mr_rosetta \ seq_file=/scratch/cont003/amara/Rosetta-NadA/NADA.seq \
data=/scratch/cont003/amara/Rosetta-NadA/NadA-7-BM16-remote-no_ano-P21212-Rfree_shell.mtz \ search_models=/scratch/cont003/amara/Rosetta-NadA/model_2.pdb \ run_prerefine=True \ number_of_prerefine_models=1000 \ fragment_files = /scratch/cont003/amara/Rosetta-NadA/aat000_03_05.200_v1_3.gz \ fragment_files = /scratch/cont003/amara/Rosetta-NadA/aat000_09_05.200_v1_3.gz \ rescore_mr.relax=False \ rosetta_models=20 \ ncs_copies=3 \ space_group=p21212 \ use_all_plausible_sg=False \ max_wait_time=100 \ nproc=6 \ group_run_command= "/usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/bsub -n 1 -q mono -W 03:00"
_______________________________________________ phenixbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/phenixbb