Re: [phenixbb] Phaser output and error messages

17 Nov 2009

      Hi,

I think a number of these questions could be answered by looking  
carefully through the whole logfile and seeing what it tells you about  
what is happening in each step of the calculation.  As well, the  
primary literature should be considered to be part of what documents a  
program.

1. Phaser uses likelihood to solve structures by molecular  
replacement, so the best solution is the one with the highest log- 
likelihood-gain (LLG).  One of the ways we talk about this is to  
consider molecular replacement as testing a series of hypotheses about  
how the molecule is oriented and how it is positioned, and likelihood  
measures how consistent the data are with each of these hypotheses.   
The one with the highest LLG is the one that is supported most  
strongly by the data.

On the point of "tuning" parameters, I'm not sure what you mean.  In a  
particular case, you should know what you put into your  
crystallization drop, and you will usually have some expectation about  
the stoichiometry of any complexes, so you usually have a good idea of  
the possible content of the asymmetric unit and the sequence identity  
of the models (thus giving you a rough idea of the expected RMS  
error).  You may have to test different choices for the number of  
copies, if different numbers are consistent with the range of solvent  
contents observed in crystals, but you can do better than a generic  
assumption of, say, 50% solvent.  The sequence identity <-> RMS error  
relationship is only approximate, but once the structure is solved  
then refinement programs like phenix.refine will do a better job of  
estimating the impact of errors in the coordinates.  However, if you  
only see negative LLG values in a search, then (as the documentation  
says) you should revise the estimated RMS error upwards, because the  
model is clearly worse than you would expect from the sequence identity.

2. Given that the potential solutions in the .sol file are sorted by  
LLG, I'm not sure where the idea would come from that they could be  
given in the order they were found.  You can follow the solutions in  
the logfile and see this.  The whole computation has to finish before  
it is known which is the best solution.  We have heuristics to stop  
Phaser spending too much time looking down blind alleys and, as we  
improve our understanding of how to recognize a correct solution from  
noise, we will improve these heuristics.  So we're already doing as  
well as we know how to stop when the solution is found.

The ten-minute timeout is not a good idea.  A Phaser molecular  
replacement run comes after weeks to years of protein expression,  
crystallization and data collection, and before days to months of  
rebuilding, refinement and interpretation, so if it takes 30 minutes,  
two hours or even a day to find a solution, then it doesn't seem too  
long to wait.

3. To help people, in cases where (say) the computer crashes in the  
middle of a long run, we've made Phaser write out  
intermediate .sol, .pdb and .mtz files, so that (in principle) you  
could pick up from the middle, or you could examine an intermediate  
solution, say with 2 of 3 components placed.  If you stop it in the  
middle, then you will get files from, say, after the translation  
search but before the packing check, or after the packing check but  
before the rigid-body refinement.  The results won't be as good, and  
you may well miss something better that would have been found later.

4. If Phaser reports that there is no scattering in a model, it means  
that you have supplied an empty PDB file, or one where all the  
occupancies are equal to zero, or one containing only HETATM records  
and no ATOM records.  If this happens in other circumstances, then it  
would be a bug and we would appreciate seeing the offending PDB file.

I hope that helps.

Regards,

Randy Read

On 15 Nov 2009, at 13:13, Ian Stokes-Rees wrote:
...
I'm having some discussion with a colleague about phaser output (we're
using Phaser 2.1.4).  We haven't been able to find any documentation
which can clarify our situation, and I'm hoping someone on the list  
can
help answer these questions.  I should mention that I am relatively  
new
to Phaser.
1. PHASER.sol files:  Which "SOLU SET" does Phaser consider to be the
best?  The first or the last?  Or the one with the highest LLG,  
wherever
that may be?  In our experience of running Phaser over several MTZ  
files
with a range of models the best Phaser solution has always been the
first, and this has had the highest LLG.
Note: this is with "untuned" Phaser settings for identity, solvent
fraction, or number of search models in ASU -- our goal is to do a  
first
run with "generic" settings for these over a larger set of models,  
then
(from TFZ and LLG scores) select a subset for which we will tune  
Phaser
parameters and PDB search model variations.
2. If we are right that the first "SOLU SET" entry is indicative of  
the
potential for the search model to form a good MR candidate, then is it
the case that the first entry is the first Phaser solution that is
computed?  Or is the PHASER.sol file a sorted list output at the end  
of
the run?  From my reading of the documentation it is output in order  
of
computation, and *for our purposes* (if my first statement in this
question is correct) Phaser can stop after it outputs this first
solution.  Is there some way to tell Phaser to stop after the first
solution is output?
I realize that this doesn't sound like it makes sense (how could  
Phaser
know to pick the best solution first, and even if it could, why  
would it
ever continue past this point), however I ask because we have put a 10
minute timeout into our Phaser runs and we have many situations  
where we
get a timeout but PHASER.sol has already been generated and the best  
LLG
solutions are output first.  It leaves me wondering why it didn't just
stop on its own after outputting the first result instead of being
aborted by our (external) timeout that terminates the process?
3. PHASER.sol files: For single domain search models, we usually get
output of the form:
SOLU SET  RFZ=4.5 TFZ=5.2 PAK=0 LLG=14 LLG=14
SOLU 6DIM ENSE model1 EULER  242.049   45.040  326.088 FRAC -0.09425
0.50268  0.42575
however we see three variations:
i) No LLG:
SOLU SET  RFZ=3.1 TFZ=5.0 PAK=0
SOLU 6DIM ENSE model2 EULER   59.983   69.335  319.701 FRAC -1.17131
-0.70030  0.23150
ii) One LLG:
SOLU SET  RFZ=3.7 TFZ=4.6 PAK=0 LLG=25
SOLU 6DIM ENSE model3 EULER  293.943  128.068  332.147 FRAC  0.06273
0.13175  0.25054
iii) Two LLG entries, but with different values:
SOLU SET  RFZ=3.8 TFZ=4.1 PAK=0 LLG=21 LLG=20
SOLU 6DIM ENSE model4 EULER  278.058  129.347   33.292 FRAC  0.28446
0.29011 -0.07986
4. Occasionally we get an error that we don't understand:
FATAL RUNTIME ERROR: No scattering in pdbfile model1.pdb
What does this mean?  Is there a problem with the PDB file?  We can't
see anything obvious in the ones which produce this error.
Thanks,
Ian
-- 
Ian Stokes-Rees, Research Associate
SBGrid, Harvard Medical School
http://sbgrid.org
_______________________________________________
phenixbb mailing list
[email protected]
http://www.phenix-online.org/mailman/listinfo/phenixbb
------
Randy J. Read
Department of Haematology, University of Cambridge
Cambridge Institute for Medical Research      Tel: + 44 1223 336500
Wellcome Trust/MRC Building                   Fax: + 44 1223 336827
Hills Road                                    E-mail: [email protected]
Cambridge CB2 0XY, U.K.                       www- 
structmed.cimr.cam.ac.uk