automating setting up parallel refinement jobs
Dear PHENIX developers, I'm writing to ask if it would be possible to automate the following: What we do now is a series of parallel throw away refinements to test for different parameters. So I will set up 12 folders with the same data and different parameter files (and all the same file names). On top of individual sites and adp I would add: 1=nothing (control) 2=ncs=True 3=fix_rotamers = True 4=flip_peptides = True 5=nqh_flips = True 6=individual_sites_real_space 7=tls with 3 groups per monomer 8=tls with 6 groups per monomer 9=tls with 9 groups per monomer 10=tls 1 with 12 groups per monomer 11=tls 1 with 15 groups per monomer 12=tls 2 with 20 groups per monomer I then launch all of them with a script that sends it to a cluster. After reviewing results, we combine parameters for one job that we keep. So here is my feature request: I would love to have a GUI interface to generate the folders and parameter files, where you would select a list of common parameters, and a list of parameters to populate 1/ job, instead of having to set up each one as a separate job. I think this would really speed things up for your industrial users. I'm working on 20 structures of the same protein with different ligands, and expect to spend maybe 8 hours generating TLS groups and editing the 240 parameter files. A GUI interface would make it 10 or 20 minutes! I don't see a way to select different TLS groups. I really prefer to test different ones myself for comparison, at least to convince myself that the auto selection is superior. We see big differences from optimizing TLS group number. Right now I am asking TLSMD to generate all these tls files, copying them to my desktop, copying to parameter files. Help! All the best, Kendall
On Feb 17, 2011, at 1:02 PM, Kendall Nettles wrote:
So here is my feature request: I would love to have a GUI interface to generate the folders and parameter files, where you would select a list of common parameters, and a list of parameters to populate 1/ job, instead of having to set up each one as a separate job.
A talented bash scripter can get this done for you . If you want a GUI, an even more talented Automator (OS X) scripter can get this done for you. However, since you have 12 specific parameters to run you shouldn't need a gui since the only input is the pdb and data (run_refine_workflow.sh file.pdb file.mtz). Pulling out the r/ r_free out of each is a straightforward grep. I've found that OS X Xgrid works for parallelization (I've got a pipeline that paramaterizes Phaser searches, and it can run anomalous FFT's, or rigid body refines of the solutions).
Right now I am asking TLSMD to generate all these tls files, copying them to my desktop, copying to parameter files. Help!
Is phenix.find_tls_groups (in dev-610) not sufficient? Pavel writes:
PHENIX users:
starting dev-610 (development version of PHENIX) there is a new tool available for completely automated partitioning a model into TLS groups:
http://www.phenix-online.org/download/nightly_builds.cgi
To run:
phenix.find_tls_groups model.pdb
or if you have a multiple CPU machine:
phenix.find_tls_groups model.pdb nproc=N
where N is the number of CPUs available (thanks Nat for parallelization!). There is no parameters that a user is supposed to tweak (except defining the number of CPUs, if desired).
The result of running the above command are atom selections that define TLS groups. These atom selections are ready to use in phenix.refine.
This is available from PHENIX GUI too, where automatically defined TLS groups can be readily visualized and checked on the graphics (thanks Nat!).
The algorithm is fast. For example, for a GroEL structure (3668 residues, 26957 atoms, 7 chains) it takes only 135 seconds using 1 CPU, and 44 seconds using 10 CPUs. Analogous job takes 3630 seconds using TLSMD server. For a lysozime structure it takes 9.5 seconds with one CPU, and 2.5 seconds using 10 CPUs. The timing results may vary depending on the performance of your computer.
There is ongoing work that will slightly improve phenix.find_tls_groups within the next few weeks / a month; however the current version is functional and can be tried now. An example of such improvements are analyzing (scoring) user-defined TLS groups (for example, TLS groups from PDB file header), automated combining cross-chain TLS groups (non-contiguous segments) that will be obtained through connectivity analysis, better handling non-protein chains, and more. Integration with phenix.refine is also planned.
Any feedback is very much appreciated!
Thanks, Pavel.
--------------------------------------------- Francis E. Reyes M.Sc. 215 UCB University of Colorado at Boulder gpg --keyserver pgp.mit.edu --recv-keys 67BA8D5D 8AE2 F2F4 90F7 9640 28BC 686F 78FD 6669 67BA 8D5D
Hi Francis, We always scan different TLS groups. It makes at least 1% difference in the R/Rfree in every structure I've done. I haven't compared with the phenix version yet to see if we can drop this. Kendall On Feb 17, 2011, at 4:15 PM, Francis E Reyes wrote:
On Feb 17, 2011, at 1:02 PM, Kendall Nettles wrote:
So here is my feature request: I would love to have a GUI interface to generate the folders and parameter files, where you would select a list of common parameters, and a list of parameters to populate 1/ job, instead of having to set up each one as a separate job.
A talented bash scripter can get this done for you .
If you want a GUI, an even more talented Automator (OS X) scripter can get this done for you. However, since you have 12 specific parameters to run you shouldn't need a gui since the only input is the pdb and data (run_refine_workflow.sh file.pdb file.mtz). Pulling out the r/ r_free out of each is a straightforward grep.
I've found that OS X Xgrid works for parallelization (I've got a pipeline that paramaterizes Phaser searches, and it can run anomalous FFT's, or rigid body refines of the solutions).
Right now I am asking TLSMD to generate all these tls files, copying them to my desktop, copying to parameter files. Help!
Is phenix.find_tls_groups (in dev-610) not sufficient?
Pavel writes:
PHENIX users:
starting dev-610 (development version of PHENIX) there is a new tool available for completely automated partitioning a model into TLS groups:
http://www.phenix-online.org/download/nightly_builds.cgi
To run:
phenix.find_tls_groups model.pdb
or if you have a multiple CPU machine:
phenix.find_tls_groups model.pdb nproc=N
where N is the number of CPUs available (thanks Nat for parallelization!). There is no parameters that a user is supposed to tweak (except defining the number of CPUs, if desired).
The result of running the above command are atom selections that define TLS groups. These atom selections are ready to use in phenix.refine.
This is available from PHENIX GUI too, where automatically defined TLS groups can be readily visualized and checked on the graphics (thanks Nat!).
The algorithm is fast. For example, for a GroEL structure (3668 residues, 26957 atoms, 7 chains) it takes only 135 seconds using 1 CPU, and 44 seconds using 10 CPUs. Analogous job takes 3630 seconds using TLSMD server. For a lysozime structure it takes 9.5 seconds with one CPU, and 2.5 seconds using 10 CPUs. The timing results may vary depending on the performance of your computer.
There is ongoing work that will slightly improve phenix.find_tls_groups within the next few weeks / a month; however the current version is functional and can be tried now. An example of such improvements are analyzing (scoring) user-defined TLS groups (for example, TLS groups from PDB file header), automated combining cross-chain TLS groups (non-contiguous segments) that will be obtained through connectivity analysis, better handling non-protein chains, and more. Integration with phenix.refine is also planned.
Any feedback is very much appreciated!
Thanks, Pavel.
--------------------------------------------- Francis E. Reyes M.Sc. 215 UCB University of Colorado at Boulder
gpg --keyserver pgp.mit.edu --recv-keys 67BA8D5D
8AE2 F2F4 90F7 9640 28BC 686F 78FD 6669 67BA 8D5D
_______________________________________________ phenixbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/phenixbb
On Thu, Feb 17, 2011 at 12:02 PM, Kendall Nettles
So here is my feature request: I would love to have a GUI interface to generate the folders and parameter files, where you would select a list of common parameters, and a list of parameters to populate 1/ job, instead of having to set up each one as a separate job.
I think this would really speed things up for your industrial users. I'm working on 20 structures of the same protein with different ligands, and expect to spend maybe 8 hours generating TLS groups and editing the 240 parameter files. A GUI interface would make it 10 or 20 minutes!
I did something like this for Phaser as a proof-of-concept for simple parallelization of tasks: http://cci.lbl.gov/~nat/img/phenix/phaser_mp_config.png http://cci.lbl.gov/~nat/img/phenix/phaser_mp_results.png It runs all search models in parallel, and can sample multiple expected RMSDs too. The calculations can be parallelized over multiple cores (I never tried more than 12, I think, but there's no limit that I'm aware of) or across a cluster. It only uses one dataset with many models, but I could have just as easily done the reverse, or both model and data parallel. This isn't a very sophisticated program (it was maybe 2 days effort), but eventually we'll have a new MR frontend that does something similar, with lots more pre-processing of search models. So, from a technical standpoint, it's fairly easy to set up, and distributing the jobs and displaying results is relatively easy. The main reason I haven't done anything like this yet is that it isn't obvious to me which parameters need to be sampled and which would be in common. (Also, I'm already at the limit of my multitasking ability.) I like the idea of making the user choose; since almost all of the controls in the GUI can be generated automatically, a dynamic interface is not difficult to set up. There is a separate problem of how to group inputs, but this may not be as hard as I'm imagining. (From my perspective, there is yet another issue with how to organize and save results - should those 20 structures be one project or 20, etc.) All that said, I think the immediate problem is actually not too bad - phenix.refine will take as many parameter files as you want, so for TLS, for example, you just need to make one file that looks like this (for example): refinement.refine.adp { tls = "chain A" tls = "chain B" } ... and call it "tls.eff", then run "phenix.refine model1.pdb data1.mtz tls.eff other_params.eff", and so on with each dataset. You can have additional parameter files for other settings that you want to vary. It doesn't solve the organizational problem, however, nor does it display the results conveniently, but at least it's less time spent in a text editor. -Nat
On 15:42 Thu 17 Feb , Nathaniel Echols wrote:
On Thu, Feb 17, 2011 at 12:02 PM, Kendall Nettles
wrote: So here is my feature request: I would love to have a GUI interface to generate the folders and parameter files, where you would select a list of common parameters, and a list of parameters to populate 1/ job, instead of having to set up each one as a separate job.
I think this would really speed things up for your industrial users. I'm working on 20 structures of the same protein with different ligands, and expect to spend maybe 8 hours generating TLS groups and editing the 240 parameter files. A GUI interface would make it 10 or 20 minutes!
I did something like this for Phaser as a proof-of-concept for simple parallelization of tasks:
http://cci.lbl.gov/~nat/img/phenix/phaser_mp_config.png http://cci.lbl.gov/~nat/img/phenix/phaser_mp_results.png
It runs all search models in parallel, and can sample multiple expected RMSDs too. The calculations can be parallelized over multiple cores (I never tried more than 12, I think, but there's no limit that I'm aware of) or across a cluster. It only uses one dataset with many models, but I could have just as easily done the reverse, or both model and data parallel. This isn't a very sophisticated program (it was maybe 2 days effort), but eventually we'll have a new MR frontend that does something similar, with lots more pre-processing of search models.
Intriguing. I'd really love to see it also run over resolution limits in parallel, so you could do a test like Figure 1(a) in this recent Acta paper: Bjørn P. Pedersen, J. Preben Morth and Poul Nissen. "Structure determination using poorly diffracting membrane-protein crystals: the H+-ATPase and Na+,K+-ATPase case history" Acta Cryst D66: 309-313 (2010). -- Thanks, Donnie Donald S. Berkholz, Ph.D. Research Fellow James R. Thompson lab, Physiology & Biomedical Engineering Grazia Isaya lab, Pediatric & Adolescent Medicine Medical Sciences 2-66 Mayo Clinic College of Medicine 200 First Street SW Rochester, MN 55905 office: 507-538-6924 cell: 612-991-1321
On Feb 17, 2011, at 5:21 PM, Donnie Berkholz wrote:
Bjørn P. Pedersen, J. Preben Morth and Poul Nissen. "Structure determination using poorly diffracting membrane-protein crystals: the H+-ATPase and Na+,K+-ATPase case history" Acta Cryst D66: 309-313 (2010).
I've tried this out with a few of my own structures. In good cases (> 50% coverage of the ASU with your search model), there is a correlation between high TFZ's and high anomalous peaks (i.e. if you took their Figure 1, and then made the same thing but instead scored for high anomalous peaks). However, I currently have a similar case to their MHp1 example (30% coverage of the ASU of a fold that I believe to be in my structure) and using their pipeline, my TFZ's are in the 7's with very low anomalous peaks... i.e. no solution. Another interesting case is if you have anomalous data and an MR model, could anomalous peaks be a filter for scoring MR solutions (have a separate column in phaser next to TFZ that checks the anomalous map with the MR model phases). I had a conversation with Randy Read about this before and in his test cases, it didn't seem reliable. I wonder if an anom LLG map would be better at teasing this one out. Or even better yet, if you have a bound cofactor (my case) or know exactly where your heavy atoms are (Se-Met), it would be interesting to only check the anomalous peaks in a given region (basically search the anomalous map given a pdb selection of the model) . Suffice to say, I never do a single phaser run anymore, but parameterize RMSD and reso. It's a lot of jobs, but when it's passed to the cluster, it goes relatively quickly. F
--------------------------------------------- Francis E. Reyes M.Sc. 215 UCB University of Colorado at Boulder gpg --keyserver pgp.mit.edu --recv-keys 67BA8D5D 8AE2 F2F4 90F7 9640 28BC 686F 78FD 6669 67BA 8D5D
Hi Nat, that would be a great help. One thing you could do would be to have a tab of parameters that will be distributed to all jobs, and another tab for which each thing checked generates a separate run. We usually start with just individual xyz and ADP refinement, as long as the structure is 3 angstroms. If the additional parameter improves R/Rfree, we add them together for the final run. I generally have a folder called refine1 that contains all the subjobs, including the throwaway ones and the final combined one, then after rebuilding I would start the next one as refine2. Ideally you could name the folders to identify parameters, and maybe add a check box on the GUI tab to indicate the final keeper run, which would go into the file name. so far we have have just looked at Rfree to pick which ones to keep, and find a spread of 1-3% between parameters, depending on the structure. for outputs it would be great to have a table with R/Rfree. Have you compared the phenix TLS grouping with various groupings from TLSMD? I will do so this week on our 20 structures and can provide some feedback. We have a range of 2-3 angstroms so it should provide a good test set. best regards, Kendall On Feb 17, 2011, at 6:42 PM, Nathaniel Echols wrote:
On Thu, Feb 17, 2011 at 12:02 PM, Kendall Nettles
wrote: So here is my feature request: I would love to have a GUI interface to generate the folders and parameter files, where you would select a list of common parameters, and a list of parameters to populate 1/ job, instead of having to set up each one as a separate job.
I think this would really speed things up for your industrial users. I'm working on 20 structures of the same protein with different ligands, and expect to spend maybe 8 hours generating TLS groups and editing the 240 parameter files. A GUI interface would make it 10 or 20 minutes!
I did something like this for Phaser as a proof-of-concept for simple parallelization of tasks:
http://cci.lbl.gov/~nat/img/phenix/phaser_mp_config.png http://cci.lbl.gov/~nat/img/phenix/phaser_mp_results.png
It runs all search models in parallel, and can sample multiple expected RMSDs too. The calculations can be parallelized over multiple cores (I never tried more than 12, I think, but there's no limit that I'm aware of) or across a cluster. It only uses one dataset with many models, but I could have just as easily done the reverse, or both model and data parallel. This isn't a very sophisticated program (it was maybe 2 days effort), but eventually we'll have a new MR frontend that does something similar, with lots more pre-processing of search models.
So, from a technical standpoint, it's fairly easy to set up, and distributing the jobs and displaying results is relatively easy. The main reason I haven't done anything like this yet is that it isn't obvious to me which parameters need to be sampled and which would be in common. (Also, I'm already at the limit of my multitasking ability.) I like the idea of making the user choose; since almost all of the controls in the GUI can be generated automatically, a dynamic interface is not difficult to set up. There is a separate problem of how to group inputs, but this may not be as hard as I'm imagining. (From my perspective, there is yet another issue with how to organize and save results - should those 20 structures be one project or 20, etc.)
All that said, I think the immediate problem is actually not too bad - phenix.refine will take as many parameter files as you want, so for TLS, for example, you just need to make one file that looks like this (for example):
refinement.refine.adp { tls = "chain A" tls = "chain B" }
... and call it "tls.eff", then run "phenix.refine model1.pdb data1.mtz tls.eff other_params.eff", and so on with each dataset. You can have additional parameter files for other settings that you want to vary. It doesn't solve the organizational problem, however, nor does it display the results conveniently, but at least it's less time spent in a text editor.
-Nat _______________________________________________ phenixbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/phenixbb
Hi Kendall,
Have you compared the phenix TLS grouping with various groupings from TLSMD?
How (name the comparison criteria, except Rfree, which is not the best one in this case) ? Anyway, this is what I have from the last PHENIX developers workshop: http://cci.lbl.gov/~afonine/tls.pdf by no means it is general or conclusive, but I hope it will give some idea.
I will do so this week on our 20 structures and can provide some feedback.
This is always helpful. I would greatly appreciate your feedback! Thanks! Pavel.
Hi Kendal, one day there will be a new keyword: strategy=auto which would do as much automatic decisions as possible.
and expect to spend maybe 8 hours generating TLS groups
I've never seen phenix.find_tls_groups running more than a few minutes on one CPU for a huge structure (it takes seconds for small to medium structures, and may take a minute for large ones).
I don't see a way to select different TLS groups. I really prefer to test different ones myself for comparison, at least to convince myself that the auto selection is superior. We see big differences from optimizing TLS group number. Right now I am asking TLSMD to generate all these tls files, copying them to my desktop, copying to parameter files. Help!
I wrote phenix.find_tls_groups to avoid the tedium of doing this. If you find a problem using phenix.find_tls_groups (as part of your systematic tests) please let me know. Good luck! Pavel.
Hi Pavel, the time is not CPU time, its cut and pasting. Other responses to follow. Kendall On Feb 17, 2011, at 7:38 PM, Pavel Afonine wrote:
Hi Kendal,
one day there will be a new keyword: strategy=auto which would do as much automatic decisions as possible.
and expect to spend maybe 8 hours generating TLS groups
I've never seen phenix.find_tls_groups running more than a few minutes on one CPU for a huge structure (it takes seconds for small to medium structures, and may take a minute for large ones).
I don't see a way to select different TLS groups. I really prefer to test different ones myself for comparison, at least to convince myself that the auto selection is superior. We see big differences from optimizing TLS group number. Right now I am asking TLSMD to generate all these tls files, copying them to my desktop, copying to parameter files. Help!
I wrote phenix.find_tls_groups to avoid the tedium of doing this. If you find a problem using phenix.find_tls_groups (as part of your systematic tests) please let me know.
Good luck! Pavel.
_______________________________________________ phenixbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/phenixbb
Hi Kendall,
Hi Pavel, the time is not CPU time, its cut and pasting.
hm... I'm a bit confused, sorry.. If running phenix.find_tls_group takes from few seconds to few minutes, do you mean that the rest 7+ hours goes to "cut and pasting"? Am I completely misunderstanding something? Anyway, I'm sure there are ways to improve it. I do lots of systematic runs of refinement or/and other statistics gathering runs through the whole PDB and I've never faced such a challenge, so I believe we can automate what you do. Let's discuss it (may be off-list). Pavel.
Hi Kendall, one day there will be a new keyword: strategy=auto which would do as much automatic decisions as possible.
and expect to spend maybe 8 hours generating TLS groups
I've never seen phenix.find_tls_groups running more than a few minutes on one CPU for a huge structure (it takes seconds for small to medium structures, and may take a minute for large ones).
I don't see a way to select different TLS groups. I really prefer to test different ones myself for comparison, at least to convince myself that the auto selection is superior. We see big differences from optimizing TLS group number. Right now I am asking TLSMD to generate all these tls files, copying them to my desktop, copying to parameter files. Help!
I wrote phenix.find_tls_groups to avoid the tedium of doing this. If you find a problem using phenix.find_tls_groups (as part of your systematic tests) please let me know. Good luck! Pavel.
participants (5)
-
Donnie Berkholz
-
Francis E Reyes
-
Kendall Nettles
-
Nathaniel Echols
-
Pavel Afonine