[phenixbb] questions related to Phenix refinement

Sun Jan 18 09:58:14 PST 2015

Dear Kay,

thanks for email and bringing this topic!

>>>> In the X-ray statistics by resolution bin of the Phenix.refine result,
>>>> there is a column "%complete".  For my refinement data, I find the
>>>> better the resolution (from lower resolution to the higher
>>>> resolution), the lower the completeness (for example for 40-6 A,
>>>> %complete is 98, for 3.1-3.0 A, %complete is 60%, for 2.2-2.1 A,
>>>>   %complete is  6%).
>>>>
>>>> Will you please tell me what does this "%complete" mean? why it
>>>> decreases in the better diffraction bin?
>> Completeness is how many reflections you have compared to theoretically
>> possible. So the higher completeness the better. Ideally (and it's not
>> that uncommon these days) you should have 100% complete data set in
>> d_min-inf resolution. Anything below say 80 in any resolution bin is
>> bad, and numbers you quote 6-60% mean something is wrong withe the dataset.
>>
> Given your standing in the community, the last sentence will lead many
> unexperienced people to believe that they should cut their data at the
> resolution where the completeness falls below "say 80"%.
>
> But that would be wrong. There is no reason to consider a completeness
> as "too low in a high-resolution shell" as long as the data in that
> shell are good. Particularly in refinement any reflection helps to
> improve the model, and to reduce overfitting.

Clearly, email is not the best way of communication, especially if 
written without a lawyer's help and attempted to read between the lines!

No, I was not suggesting to cut the data, particularly if cutting is 
judged by completeness exclusively. What I was really saying is that if 
the data set is so incomplete then that should be alerting and prompt to 
review data collection and processing steps (rather than spending months 
struggling with a poor data set!).

Also, I think, extremes such as routine data cutoffs by "sigma" or/and 
resolution (as used to be in the past) and panic fear to throw away a 
reflection (as the modern trend is) may be counterproductive. Indeed, 
for example, non-permanent data cutoffs by resolution (or by other 
criteria, such as derived from Fobs vs Fmodel differences) may be 
essential for success of refinement and phasing by Molecular Replacement:

           J. Appl. Cryst. (2008). 41, 491-522
           Structure refinement: some background theory and practical 
strategies
           D. Watkin

           Acta Cryst. (1999). D55, 1759-1764
           Detecting outliers in non-redundant diffraction data
           R. J. Read

           J. Appl. Cryst. (2009). 42, 607-615
           Automatic multiple-zone rigid-body refinement with a large 
convergence radius
           P. V. Afonine, R. W. Grosse-Kunstleve, A. Urzhumtsev and P. 
D. Adams

           STIR option in SHELX.

Also, incomplete data can distort maps. As few as 1% of missing 
reflections may be sufficient to destroy molecule image in Fourier maps:

           Acta Cryst. (1991). A47, 794-801
           Low-resolution phases: influence on SIR syntheses and 
retrieval with double-step filtration
           A. G. Urzhumtsev

           Acta Cryst. (2014). D70, 2593-2606
           Metrics for comparison of crystallographic maps
           A. Urzhumtsev, P. V. Afonine, V. Y. Lunin, T. C. Terwilliger 
and P. D. Adams

           Retrieval of lost reflections in high resolution Fourier 
syntheses by 'soft' solvent flattening.
           Natalia L. Lunina, Vladimir Y. Lunin and Alberto D. Podjarny
http://www.ccp4.ac.uk/newsletters/newsletter41/00_contents.html

Finally, it is a poor idea to assign the data resolution the resolution 
of the highest resolution reflection unless the data set is 100% 
complete. Instead, effective resolution (that has strict mathematical 
definition and meaning) should be used:

           Acta Cryst. (2013). D69, 1921-1934
           On effective and optical resolutions of diffraction data sets
           L. Urzhumtseva, B. Klaholz and A. Urzhumtsev

Summarizing, a severely incomplete data set should trigger suspicion. If 
that's the only datset available then correct expectations should be set 
about (possible difficulty of) structure solution and quality of final 
model.

All the best,
Pavel