[phenixbb] phenix and weak data
dtheobald at brandeis.edu
Wed Dec 12 09:30:46 PST 2012
On Dec 12, 2012, at 12:17 PM, Randy Read <rjr27 at CAM.AC.UK> wrote:
> I'm not too worried about refining sigmaA values against the test set, since there's only one parameter per resolution shell, and it doesn't (directly) affect the Fcalc values.
That's all good, and pragmatics should probably take priority. But it is somewhat concerning that you have to do it in the first place --- the fact that a parameter blows up in a likelihood analysis is usually a good indication that something is fundamentally wrong, usually non-identifiability of some sort. Or that there's an error in implementation. I wonder if anyone has recently re-visited the issue with the latest software --- sometimes these problems just "disappear" when other issues get worked out.
> Nonetheless, I agree that it would be better in principle to refine all parameters against the working data, and maybe we could do that if we understood how to correct the variance estimates in the likelihood targets for the number of degrees of freedom.
> On 12 Dec 2012, at 16:04, Douglas Theobald wrote:
>> On Dec 12, 2012, at 10:47 AM, Randy Read <rjr27 at CAM.AC.UK> wrote:
>>> On 12 Dec 2012, at 15:36, Douglas Theobald wrote:
>>>> On Dec 12, 2012, at 1:46 AM, Ed Pozharski <epozh001 at UMARYLAND.EDU> wrote:
>>>>> On Tue, 2012-12-11 at 11:27 -0500, Douglas Theobald wrote:
>>>>>> What is the evidence, if any, that the exptl sigmas are actually negligible compared to fit beta (is it alluded to in Lunin 2002)? Is there somewhere in phenix output I can verify this myself?
>>>>> Essentially, equation 4 in Lunin (2002) is the same as equation 14 in
>>>>> Murshudov (1997) or equation 1 in Cowtan (2005) or 12-79 in Rupp (2010).
>>>>> The difference is that instead of combination of sigf^2 and sigma_wc you
>>>>> have a single parameter, beta. One can do that assuming that
>>>>> sigf<<sqrt(beta). Phenix log files list optimized beta parameter in
>>>>> each resolution shell.
>>>> From the log file:
>>>> |R-free likelihood based estimates for figures of merit, absolute phase error,|
>>>> |and distribution parameters alpha and beta (Acta Cryst. (1995). A51, 880-887)|
>>>> | |
>>>> | Bin Resolution No. Refl. FOM Phase Scale Alpha Beta |
>>>> | # range work test error factor |
>>>> | 1: 44.4859 - 3.0705 14086 154 0.93 12.12 1.00 0.98 118346.13|
>>>> | 2: 3.0705 - 2.4372 13777 149 0.91 15.26 1.00 0.99 58331.77|
>>>> | 3: 2.4372 - 2.1291 13644 148 0.94 11.42 1.00 0.99 23216.31|
>>>> it appears that phenix estimates alpha and beta from the R-free set rather than from the working set (I might be misreading that). Is that correct?
>>> Yes, using the cross-validation data was a key step in getting maximum likelihood refinement to work. A long time ago (a few years before our first paper on ML refinement) I implemented a first version of the MLF target we put into CNS, but the sigmaA values were estimated from the working data. What happened was that the data would be over-fit, then the sigmaA estimates would go up (with part of the increase being a result of the overfitting), then in the next cycle the pressure to fit the data compared to the restraints would be higher, and so on. The best I could claim for this at the time was that the resulting models were at least as good as the ones from least-squares refinement, but the R-factors were higher (indicating that there was still less over-fitting). It would have been hard to sell the advantage of higher R-factors to the protein crystallography community so it was good that, when we started using cross-validated sigmaA values, the convergence radius !
> improved and we could get significantly better models with lower R-factors. I think you'll find that all the programs use just the cross-validation data to estimate the variance parameters for the likelihood target, not just phenix.refine.
>> Thanks Randy. I must say I'm quite surprised by this, and coming from a likelihoodist/Bayesian POV it seems very wrong :). I realize it works in practice, and works quite well evidently, but there's a very odd marriage of statistical philosophies going on here. From a likelihood POV, the joint ML estimates really should come from the same data set (i.e. the working set --- and there's the issue that there's already been a likelihood "compromise" of sorts by excluding some of the data from estimation anyway, a practice which violates the likelihood principle). And from a frequentist cross-validation POV, you certainly should not be refining against the test set --- that violates the very rationale of a test set.
>>>> phenixbb mailing list
>>>> phenixbb at phenix-online.org
>>> Randy J. Read
>>> Department of Haematology, University of Cambridge
>>> Cambridge Institute for Medical Research Tel: + 44 1223 336500
>>> Wellcome Trust/MRC Building Fax: + 44 1223 336827
>>> Hills Road E-mail: rjr27 at cam.ac.uk
>>> Cambridge CB2 0XY, U.K. www-structmed.cimr.cam.ac.uk
>>> phenixbb mailing list
>>> phenixbb at phenix-online.org
>> phenixbb mailing list
>> phenixbb at phenix-online.org
> Randy J. Read
> Department of Haematology, University of Cambridge
> Cambridge Institute for Medical Research Tel: + 44 1223 336500
> Wellcome Trust/MRC Building Fax: + 44 1223 336827
> Hills Road E-mail: rjr27 at cam.ac.uk
> Cambridge CB2 0XY, U.K. www-structmed.cimr.cam.ac.uk
> phenixbb mailing list
> phenixbb at phenix-online.org
More information about the phenixbb