Re: [phenixbb] WAS: changing TLS groups mid refinement

17 May 2010

      Can we turn the argument on its head ?

Demonstrate that the way that phenix.refine, as currently implemented, 
inappropriately throws away weak data is never potentially deleterious 
to the quality of a protein structure model.

See below - a test on real data/structure suggests that the data does 
matter.
...
phenix.refine does not have intensity based X-ray refinement targets and 
therefore phenix.refine does not use intensities in refinement. Although 
it accepts input reflection files with intensities which it then 
converts to amplitudes for all subsequent purposes.
So let's look at real data:

Short version:

phenix.refine throws out 2896 reflections of 48895, including 11% of the 
data in the outermost shell, compared to using TRUNCATE to prep my data. 
  Using the common data subset the structure has a decreased R-free of 
0.8% if you refine against the truncate=yes PDB file with the common 
subset of data.

0.8% at a 24% R-free (24.0 vs 24.8) is pretty significant IMHO.

Longer version:

Using the same MTZ file and PDB file, just using the default column 
label selection (IMEAN, SIGIMEAN) or the truncate=yes stucture factors 
(F, SIGF).

phenix.refine's default behavior

| 12:  2.1292 -  2.0684 0.90   2536  123 0.1851 0.2381
| 13:  2.0684 -  2.0139 0.88   2462  123 0.1813 0.2654
| 14:  2.0139 -  1.9648 0.87   2400  136 0.1936 0.2637
| 15:  1.9648 -  1.9201 0.81   2273  130 0.2090 0.2789
| 16:  1.9201 -  1.8793 0.83   2303  118 0.2216 0.2761
| 17:  1.8793 -  1.8417 0.72   1998  106 0.2388 0.2790

phenix.refine, forcing it to use F, SIGF out of truncate (truncate=yes)

| 13:  2.1079 -  2.0524 0.97   2546  135 0.1862 0.2383
| 14:  2.0524 -  2.0023 0.96   2545  128 0.1874 0.2623
| 15:  2.0023 -  1.9568 0.96   2499  149 0.1920 0.2562
| 16:  1.9568 -  1.9152 0.92   2452  132 0.2106 0.2385
| 17:  1.9152 -  1.8769 0.95   2500  130 0.2169 0.2895
| 18:  1.8769 -  1.8415 0.83   2182  115 0.2403 0.2680

Columns are resolution range, completeness (work+free), #work, #free, 
Rwork, Rfree.  The incompleteness in the outer shell of the "complete" 
data is because I was overly pessimistic in setting the detector 
distance.  Mea culpa.  The outer shell R-symm in SCALEPACK is 53.8%.

Default behavior yields:
Final: r_work = 0.1898 r_free = 0.2479 bonds = 0.007 angles = 1.114

REMARK   3  DATA USED IN REFINEMENT.
REMARK   3   RESOLUTION RANGE HIGH (ANGSTROMS) : 1.842
REMARK   3   RESOLUTION RANGE LOW  (ANGSTROMS) : 32.943
REMARK   3   MIN(FOBS/SIGMA_FOBS)              : 0.02
REMARK   3   COMPLETENESS FOR RANGE        (%) : 91.38
REMARK   3   NUMBER OF REFLECTIONS             : 45999
REMARK   3
REMARK   3  FIT TO DATA USED IN REFINEMENT.
REMARK   3   R VALUE     (WORKING + TEST SET) : 0.1928
REMARK   3   R VALUE            (WORKING SET) : 0.1898
REMARK   3   FREE R VALUE                     : 0.2479
REMARK   3   FREE R VALUE TEST SET SIZE   (%) : 5.08
REMARK   3   FREE R VALUE TEST SET COUNT      : 2339

Truncate=yes data yields:
Final: r_work = 0.1932 r_free = 0.2473 bonds = 0.007 angles = 1.113

REMARK   3  DATA USED IN REFINEMENT.
REMARK   3   RESOLUTION RANGE HIGH (ANGSTROMS) : 1.841
REMARK   3   RESOLUTION RANGE LOW  (ANGSTROMS) : 32.943
REMARK   3   MIN(FOBS/SIGMA_FOBS)              : 1.34
REMARK   3   COMPLETENESS FOR RANGE        (%) : 97.10
REMARK   3   NUMBER OF REFLECTIONS             : 48895
REMARK   3
REMARK   3  FIT TO DATA USED IN REFINEMENT.
REMARK   3   R VALUE     (WORKING + TEST SET) : 0.1960
REMARK   3   R VALUE            (WORKING SET) : 0.1932
REMARK   3   FREE R VALUE                     : 0.2473
REMARK   3   FREE R VALUE TEST SET SIZE   (%) : 5.10
REMARK   3   FREE R VALUE TEST SET COUNT      : 2494

Despite the inclusion of more weak data the R-free doesn't change much. 
It should increase a little - the same way that R-work does.  However 
phenix.refine discards 5.7% of the data overall, 11% of data in the 
outermost shell, and this is for a dataset that is not at all 
anisotropic - you expect the trend to be far worse with anisotropic data 
where a lot of the data can be weak at the high resolution limit.

Bigger question is: what would R-free be for the common data subset 
(Imean > 0) but using the truncate=yes F values and PDB file ?  I used 
SFTOOLS to make this selection, and then refining just the bulk solvent 
correction for the truncate=yes PDB file against this data subset.....
Final R-work = 0.1884, R-free = 0.2399
i.e. if you refine the model against all the data from TRUNCATE, but 
then cut to the subset that phenix.refine would use by default, the 
R-free is lower by 0.8%.
The R-free test count was the same as for the default phenix.refine 
behavior, so this superficially suggests I didn't do anything wrong.

Phil Jeffrey