[phenixbb] phenix and weak data

Wed Dec 12 09:17:06 PST 2012

Hi,

I'm not too worried about refining sigmaA values against the test set, since there's only one parameter per resolution shell, and it doesn't (directly) affect the Fcalc values.  Nonetheless, I agree that it would be better in principle to refine all parameters against the working data, and maybe we could do that if we understood how to correct the variance estimates in the likelihood targets for the number of degrees of freedom.

Randy

On 12 Dec 2012, at 16:04, Douglas Theobald wrote:

> On Dec 12, 2012, at 10:47 AM, Randy Read <rjr27 at CAM.AC.UK> wrote:
> 
>> On 12 Dec 2012, at 15:36, Douglas Theobald wrote:
>> 
>>> On Dec 12, 2012, at 1:46 AM, Ed Pozharski <epozh001 at UMARYLAND.EDU> wrote:
>>> 
>>>> On Tue, 2012-12-11 at 11:27 -0500, Douglas Theobald wrote:
>>>> 
>>>>> What is the evidence, if any, that the exptl sigmas are actually negligible compared to fit beta (is it alluded to in Lunin 2002)?  Is there somewhere in phenix output I can verify this myself?
>>>> 
>>>> Essentially, equation 4 in Lunin (2002) is the same as equation 14 in
>>>> Murshudov (1997) or equation 1 in Cowtan (2005) or 12-79 in Rupp (2010).
>>>> The difference is that instead of combination of sigf^2 and sigma_wc you
>>>> have a single parameter, beta.  One can do that assuming that
>>>> sigf<<sqrt(beta).  Phenix log files list optimized beta parameter in
>>>> each resolution shell.  
>>> 
>>> From the log file: 
>>> 
>>> |-----------------------------------------------------------------------------|
>>> |R-free likelihood based estimates for figures of merit, absolute phase error,|
>>> |and distribution parameters alpha and beta (Acta Cryst. (1995). A51, 880-887)|
>>> |                                                                             |
>>> | Bin     Resolution      No. Refl.   FOM  Phase Scale    Alpha        Beta   |
>>> |  #        range        work  test        error factor                       |
>>> |  1: 44.4859 -  3.0705 14086   154  0.93  12.12   1.00     0.98     118346.13|
>>> |  2:  3.0705 -  2.4372 13777   149  0.91  15.26   1.00     0.99      58331.77|
>>> |  3:  2.4372 -  2.1291 13644   148  0.94  11.42   1.00     0.99      23216.31|
>>> 
>>> it appears that phenix estimates alpha and beta from the R-free set rather than from the working set (I might be misreading that).  Is that correct?
>> 
>> Yes, using the cross-validation data was a key step in getting maximum likelihood refinement to work.  A long time ago (a few years before our first paper on ML refinement) I implemented a first version of the MLF target we put into CNS, but the sigmaA values were estimated from the working data.  What happened was that the data would be over-fit, then the sigmaA estimates would go up (with part of the increase being a result of the overfitting), then in the next cycle the pressure to fit the data compared to the restraints would be higher, and so on.  The best I could claim for this at the time was that the resulting models were at least as good as the ones from least-squares refinement, but the R-factors were higher (indicating that there was still less over-fitting).  It would have been hard to sell the advantage of higher R-factors to the protein crystallography community so it was good that, when we started using cross-validated sigmaA values, the convergence radius improved and we could get significantly better models with lower R-factors.  I think you'll find that all the programs use just the cross-validation data to estimate the variance parameters for the likelihood target, not just phenix.refine.
> 
> Thanks Randy.  I must say I'm quite surprised by this, and coming from a likelihoodist/Bayesian POV it seems very wrong :).  I realize it works in practice, and works quite well evidently, but there's a very odd marriage of statistical philosophies going on here.  From a likelihood POV, the joint ML estimates really should come from the same data set (i.e. the working set --- and there's the issue that there's already been a likelihood "compromise" of sorts by excluding some of the data from estimation anyway, a practice which violates the likelihood principle).  And from a frequentist cross-validation POV, you certainly should not be refining against the test set --- that violates the very rationale of a test set.  
> 
> 
> 
> 
> 
>> 
>> Randy
>> 
>>> _______________________________________________
>>> phenixbb mailing list
>>> phenixbb at phenix-online.org
>>> http://phenix-online.org/mailman/listinfo/phenixbb
>> 
>> ------
>> Randy J. Read
>> Department of Haematology, University of Cambridge
>> Cambridge Institute for Medical Research      Tel: + 44 1223 336500
>> Wellcome Trust/MRC Building                   Fax: + 44 1223 336827
>> Hills Road                                    E-mail: rjr27 at cam.ac.uk
>> Cambridge CB2 0XY, U.K.                       www-structmed.cimr.cam.ac.uk
>> 
>> _______________________________________________
>> phenixbb mailing list
>> phenixbb at phenix-online.org
>> http://phenix-online.org/mailman/listinfo/phenixbb
> 
> _______________________________________________
> phenixbb mailing list
> phenixbb at phenix-online.org
> http://phenix-online.org/mailman/listinfo/phenixbb

------
Randy J. Read
Department of Haematology, University of Cambridge
Cambridge Institute for Medical Research      Tel: + 44 1223 336500
Wellcome Trust/MRC Building                   Fax: + 44 1223 336827
Hills Road                                    E-mail: rjr27 at cam.ac.uk
Cambridge CB2 0XY, U.K.                       www-structmed.cimr.cam.ac.uk