[phenixbb] phenix and weak data

Wed Dec 12 08:04:10 PST 2012

On Dec 12, 2012, at 10:47 AM, Randy Read <rjr27 at CAM.AC.UK> wrote:

> On 12 Dec 2012, at 15:36, Douglas Theobald wrote:
> 
>> On Dec 12, 2012, at 1:46 AM, Ed Pozharski <epozh001 at UMARYLAND.EDU> wrote:
>> 
>>> On Tue, 2012-12-11 at 11:27 -0500, Douglas Theobald wrote:
>>> 
>>>> What is the evidence, if any, that the exptl sigmas are actually negligible compared to fit beta (is it alluded to in Lunin 2002)?  Is there somewhere in phenix output I can verify this myself?
>>> 
>>> Essentially, equation 4 in Lunin (2002) is the same as equation 14 in
>>> Murshudov (1997) or equation 1 in Cowtan (2005) or 12-79 in Rupp (2010).
>>> The difference is that instead of combination of sigf^2 and sigma_wc you
>>> have a single parameter, beta.  One can do that assuming that
>>> sigf<<sqrt(beta).  Phenix log files list optimized beta parameter in
>>> each resolution shell.  
>> 
>> From the log file: 
>> 
>> |-----------------------------------------------------------------------------|
>> |R-free likelihood based estimates for figures of merit, absolute phase error,|
>> |and distribution parameters alpha and beta (Acta Cryst. (1995). A51, 880-887)|
>> |                                                                             |
>> | Bin     Resolution      No. Refl.   FOM  Phase Scale    Alpha        Beta   |
>> |  #        range        work  test        error factor                       |
>> |  1: 44.4859 -  3.0705 14086   154  0.93  12.12   1.00     0.98     118346.13|
>> |  2:  3.0705 -  2.4372 13777   149  0.91  15.26   1.00     0.99      58331.77|
>> |  3:  2.4372 -  2.1291 13644   148  0.94  11.42   1.00     0.99      23216.31|
>> 
>> it appears that phenix estimates alpha and beta from the R-free set rather than from the working set (I might be misreading that).  Is that correct?
> 
> Yes, using the cross-validation data was a key step in getting maximum likelihood refinement to work.  A long time ago (a few years before our first paper on ML refinement) I implemented a first version of the MLF target we put into CNS, but the sigmaA values were estimated from the working data.  What happened was that the data would be over-fit, then the sigmaA estimates would go up (with part of the increase being a result of the overfitting), then in the next cycle the pressure to fit the data compared to the restraints would be higher, and so on.  The best I could claim for this at the time was that the resulting models were at least as good as the ones from least-squares refinement, but the R-factors were higher (indicating that there was still less over-fitting).  It would have been hard to sell the advantage of higher R-factors to the protein crystallography community so it was good that, when we started using cross-validated sigmaA values, the convergence radius improved and we could get significantly better models with lower R-factors.  I think you'll find that all the programs use just the cross-validation data to estimate the variance parameters for the likelihood target, not just phenix.refine.

Thanks Randy.  I must say I'm quite surprised by this, and coming from a likelihoodist/Bayesian POV it seems very wrong :).  I realize it works in practice, and works quite well evidently, but there's a very odd marriage of statistical philosophies going on here.  From a likelihood POV, the joint ML estimates really should come from the same data set (i.e. the working set --- and there's the issue that there's already been a likelihood "compromise" of sorts by excluding some of the data from estimation anyway, a practice which violates the likelihood principle).  And from a frequentist cross-validation POV, you certainly should not be refining against the test set --- that violates the very rationale of a test set.  

> 
> Randy
> 
>> _______________________________________________
>> phenixbb mailing list
>> phenixbb at phenix-online.org
>> http://phenix-online.org/mailman/listinfo/phenixbb
> 
> ------
> Randy J. Read
> Department of Haematology, University of Cambridge
> Cambridge Institute for Medical Research      Tel: + 44 1223 336500
> Wellcome Trust/MRC Building                   Fax: + 44 1223 336827
> Hills Road                                    E-mail: rjr27 at cam.ac.uk
> Cambridge CB2 0XY, U.K.                       www-structmed.cimr.cam.ac.uk
> 
> _______________________________________________
> phenixbb mailing list
> phenixbb at phenix-online.org
> http://phenix-online.org/mailman/listinfo/phenixbb