[cctbxbb] Question about sigma(I) calculation in merge_equivalents

Kay Diederichs Kay.Diederichs at uni-konstanz.de
Mon Sep 10 05:37:43 PDT 2012


Dear Keitaro, Phil, Luc,

after looking into the question of how to merge equivalent reflections 
(in the context of XDS data processing), I am now convinced that cctbx 
should _not_ use max(internal sigma, external sigma) by default.
Why?
- it is not obvious to the "end user" of the data that come out of cctbx 
that this formula was applied (I am not aware that the above formula is 
clearly documented)
- the data processing programs that I know of do not use this formula, 
but rather use the "external sigma" . I don't imply that the "external 
sigma" is better, but it violates the principle of least surprise to 
deviate from a proven procedure.
- if the observations (that are merged in cctbx) come from one of these 
data processing programs then there is a chance that the error model was 
already adjusted such that the reduced chi**2 is near 1 (at least this 
is the case for XDS and SCALA). If the sigmas of the merged data then 
are adjusted upwards again (by the formula above), I have a strong 
feeling that this leads to an inconsistency.
- using max() seems like an ad-hoc way and it lacks a clear rationale.

I would like to see examples of comparison of downstream 
crystallographic calculations (most importantly, experimental phasing) 
using different ways of calculating the sigma of the merged data. Until 
this proves the superiority of the above formula, I believe it should be 
an option, not a default.

thanks,

Kay



> Keitaro Yamashita yamashita at castor.sci.hokudai.ac.jp
> Thu Sep 6 09:20:34 PDT 2012
>
> Previous message: [cctbxbb] Question about sigma(I) calculation in
> merge_equivalents
> Next message: [cctbxbb] Question about sigma(I) calculation in
> merge_equivalents
> Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
>
> Dear Luc,
>
> Thank you for your explanation! I understand better than before.
>
> Usually, sigmas given by data processing programs are already
> corrected based on their error model.
> Sigmas are adjusted to match the actual scatter.
> I think using internal variance is re-correction of sigmas.
> Is it valid way? If internal variance is bigger, it suggests error
> model is not perfect?
>
> I calculated external/internal variances using lysozyme (standard
> sample in protein crystallography) data.
> Intensities and sigmas are determined by XDS.
>
> I attached two plots, where "Imean", "wsigma", "sigma" are averaged
> intensity, internal sigma, external sigma, respectively.
> One plot is histogram of wsigma/sigma by multiplicity. For lower
> multiplicity, we can see extreme discrepancies. (Note that each
> vertical axis is not on the same scale.)
> The other plot is wsigma/sigma vs intensity. Extreme discrepancies can
> be seen in lower intensities.
> I hope it could be interesting for you.
>
>
>  > Crystals use only the external variance by default, the reasoning
> being that the internal variance being based on sample statistics is
> almost always too unreliable because groups of equivalent reflections
> are too small.
>
> Then, I think it would be nice if we can choose the way in cctbx.
> I mean, option to choose "use bigger one" or "always use
> external/internal variance" would be nice to have.
>
> I am looking forward to your comment.
>
> Best regards,
> Keitaro
>
> 2012/9/6 Luc Bourhis <luc_j_bourhis at mac.com>:
>  > Dear Keitaro,
>  >
>  >> But it is still unclear to me why it takes the greatest of the
>  >> "internal" variance and "external" variance.
>  >> Is it based on some tests using real data? or is it theoretically
>  >> superior to using always external variance?
>  >
>  > Those are good questions and to be honest I do not know for sure the
> answer to them. As it seems common in applied statistics, the treatment
> starts with by-the-book methods relying on a well defined theory but at
> the end there is always a completely heuristic twist. Particularly true
> in crystallography I would argue. But let me try to give some rationales.
>  >
>  > It seems to me that the internal and external variance should not
> differ too much. Let's consider the two ways this may not be true.
>  >
>  > 1. The quoted intensities of a group of equivalent reflections have a
> small spread, leading to a small internal variance, but the quoted
> sigma's are comparatively big, resulting in an external variance
> significantly bigger than the internal one. This is a possible event but
> an unlikely one: the statistical intuition in that case is to say that
> the small internal variance is a fluke and to use the external one instead.
>  >
>  > 2. An external variance significantly smaller than the internal one,
> should ring an alarm bell. Indeed a small external variance means that
> the small quoted sigma's strongly suggests the intensities cannot spread
> too much from their assumed common true value whereas the comparatively
> bigger internal variance blatantly contradicts that. Thus either the
> intensities or the sigma's have not been correctly determined.
> Crystallographers seem to err on the side of trusting data here, i.e. to
> disregard the sigma's, and therefore to choose the internal variance.
>  >
>  >> I would like to know how this method affects further
> crystallographic process.
>  >
>  > I am afraid I do not have experience with your domain, protein
> crystallography. I know that the small molecule program Crystals use
> only the external variance by default, the reasoning being that the
> internal variance being based on sample statistics is almost always too
> unreliable because groups of equivalent reflections are too small. Since
> Crystals is as well accepted as ShelXL to produce publishable
> structures, it answer your question in at least in one corner of
> crystallography, unfortunately not yours.
>  >
>  > I think it could be a simple and interesting exercise to take a
> representative protein dataset of yours, then to print the redundancy,
> the internal, and the external variance. Actually I would be surprised
> if such a study has not already been done and published. Perhaps some of
> the gurus on this forum can shed more lights onto that subject.
>  >
>  > Best wishes,
>  >
>  > Luc

-- 
Kay Diederichs                http://strucbio.biologie.uni-konstanz.de
email: Kay.Diederichs at uni-konstanz.de    Tel +49 7531 88 4049 Fax 3183
Fachbereich Biologie, Universität Konstanz, Box M647, D-78457 Konstanz

This e-mail is digitally signed. If your e-mail client does not have the
necessary capabilities, just ignore the attached signature "smime.p7s".

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4595 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://phenix-online.org/pipermail/cctbxbb/attachments/20120910/8afd8963/attachment.p7s>


More information about the cctbxbb mailing list