Data quality assessment with phenix.xtriage


	Python-based Hierarchical ENvironment for Integrated Xtallography
Documentation Home

Data quality assessment with phenix.xtriage

Author(s)
Purpose
Usage: How xtriage works; Output files from xtriage; Xtriage keywords in detail; Interpreting Xtriage output
Examples: Standard run of xtriage
Possible Problems: Specific limitations and problems
Literature
Additional information: List of all xtriage keywords

Author(s)

xtriage: Peter Zwart
Phil command interpreter: Ralf W. Grosse-Kunstleve

Purpose

The xtriage method is a tool for analyzing structure factor data to identify outliers, presence of twinning and other conditions that the user should be aware of.

Usage

How xtriage works

Basic sanity checks performed by xtriage are

Wilson plot sanity
Probabilistic Matthews analysis
Data strength analysis
Ice ring analysis
Twinning analysis
Reference analysis (determines possible re-indexing. optional)
Detwinning and data massaging (optional)

See also: phenix.reflection_statistics (comparison of multiple data sets)

Output files from xtriage

(1) A log file that contains all the screen output plus some ccp4 style graphs

(2) optional: an mtz file with massaged data

Xtriage keywords in detail

Scope: parameters.asu_contents

keys: * n_residues :: Number of residues per monomer/unit
      * n_bases :: Number of nucleotides per monomer/unit
      * n_copies_per_asu :: Number of copies in the ASU.

These keywords control the determination of the absolute scale. If the number of residues/bases is not specified, a solvent content of 50% is assumed. Scope: parameters.misc_twin_parameters.missing_symmetry

keys: * tanh_location :: tanh decision rule parameter
      * tanh_slope :: tanh decision rule parameter

The tanh_location and tanh_slope parameter control what R-value is considered to be low enough to be considered a 'proper' symmetry operator. the tanh_location parameter corresponds to the inflection point of the approximate step function. Increasing tanh_location will result in large R-value thresholds. tanh_slope is set to 50 and should be okai. Scope: parameters.misc_twin_parameters.twinning_with_ncs

keys: * perform_test :: can be set to True or False
      * n_bins :: Number of bins in determination of D_ncs

The perform_test is by default set to False. Setting it to True triggers the determination of the twin fraction while taking into account NCS parallel to the twin axis. Scope: parameters.misc_twin_parameters.twin_test_cuts

keys: * high_resolution : high resolution for twin tests
      * low_resolution: low resolution for twin tests
      * isigi_cut: I/sig(I) threshold in automatic determination
                   of high resolution limit
      * completeness_cut: completeness threshold in automatic
                          determination of high resolution limit

The automatic determination of the resolution limit for the twinning test is determined on the basis of the completeness after removing intensities for which I/sigI < isigi_cut. The lowest limit obtain in this way is 3.5A. The value determined by the automatic procedure can be overruled by specification of the high_resolution keyword. The low resolution is set to 10A by default. Scope: parameters.reporting

keys: * verbose :: verbosity level.
      * log :: log file name
      * ccp4_style_graphs :: Either True or False. Determines whether or
                             not ccp4 style logfgra plots are written to the
                             log file

Scope: xray_data

keys: * file_name :: file name with xray data.
      * obs_labels :: labels for observed data is format is mtz or XPLOR/CNS
      * calc_labels :: optional; labels for calculated data
      * unit_cell :: overrides unit cell in reflection file (if present)
      * space_group :: overrides space group in reflection file (if present)
      * high_resolution :: High resolution limit of the data
      * low_resolution :: Low resolution limit of the data

Note that the matching of specified and present labels involves a sub-string matching algorithm. Scope: optional

keys: * hklout :: output mtz file
      * twinning.action :: Whether to detwin the data
      * twinning.twin_law :: using this twin law (h,k,l or x,y,z notation)
      * twinning.fraction :: The detwinning fraction.
      * b_value :: the resulting Wilson B value

The output mtz file contains an anisotropy corrected mtz file, with suspected outliers removed. The data is put scaled and has the specified Wilson B value. These options have an associated expert level of 10, and are not shown by default. Specification of the expert level on the command line as 'level=100' will show all available options.

Interpreting Xtriage output

Typing:

%phenix.xtriage some_data.sca residues=290 log=some_data.log

results in the following output (parts omitted). Matthews analysis First, a cell contents analysis is performed. Matthews coefficients, solvent content and solvent content probabilities are listed, and the most likely composition is guessed

Matthews coefficient and Solvent content statistics
----------------------------------------------------------------
| Copies | Solvent content | Matthews Coed. | P(solvent cont.) |
|--------|-----------------|----------------|------------------|
|      1 |      0.705      |      4.171     |       0.241      |
|      2 |      0.411      |      2.085     |       0.750      |
|      3 |      0.116      |      1.390     |       0.009      |
----------------------------------------------------------------
|              Best guess :    2  copies in the asu            |
----------------------------------------------------------------

Data strength The next step, the strength of the data is gauged by determining the completeness of the in resolution bins after application of several I/sigI cut off values

Completeness and data strength analysis

  The following table lists the completeness in various resolution
  ranges, after applying a I/sigI cut. Miller indices for which
  individual I/sigI values are larger than the value specified in
  the top row of the table, are retained, while other intensities
  are discarded. The resulting completeness profiles are an indication
  of the strength of the data.

----------------------------------------------------------------------------------------
| Res. Range   | I/sigI>1  | I/sigI>2  | I/sigI>3  | I/sigI>5  | I/sigI>10 | I/sigI>15 |
----------------------------------------------------------------------------------------
| 19.87 - 7.98 | 96.4%     | 95.3%     | 94.5%     | 93.6%     | 91.7%     | 89.3%     |
|  7.98 - 6.40 | 99.2%     | 98.2%     | 97.1%     | 95.5%     | 90.9%     | 84.7%     |
|  6.40 - 5.61 | 97.8%     | 95.4%     | 93.3%     | 87.1%     | 76.6%     | 66.8%     |
|  5.61 - 5.11 | 98.2%     | 95.9%     | 94.0%     | 87.9%     | 74.1%     | 58.0%     |
|  5.11 - 4.75 | 97.9%     | 96.2%     | 94.5%     | 91.1%     | 79.2%     | 62.5%     |
|  4.75 - 4.47 | 97.4%     | 95.4%     | 93.1%     | 88.9%     | 76.6%     | 56.9%     |
|  4.47 - 4.25 | 96.5%     | 94.5%     | 92.1%     | 88.0%     | 75.3%     | 56.5%     |
|  4.25 - 4.07 | 96.6%     | 94.0%     | 91.2%     | 85.4%     | 69.3%     | 44.9%     |
|  4.07 - 3.91 | 95.6%     | 92.1%     | 87.8%     | 80.1%     | 61.9%     | 34.8%     |
|  3.91 - 3.78 | 94.3%     | 89.6%     | 83.7%     | 71.1%     | 48.7%     | 20.5%     |
|  3.78 - 3.66 | 95.7%     | 90.9%     | 85.6%     | 71.5%     | 42.4%     | 14.8%     |
|  3.66 - 3.56 | 91.6%     | 85.0%     | 78.0%     | 63.3%     | 34.1%     | 9.5%      |
|  3.56 - 3.46 | 89.8%     | 80.4%     | 70.2%     | 52.8%     | 22.2%     | 3.8%      |
|  3.46 - 3.38 | 87.4%     | 76.3%     | 64.6%     | 46.7%     | 15.5%     | 1.7%      |
----------------------------------------------------------------------------------------

This analysis is also used in the automatic determination of the high resolution limit used in the intensity statistics and twin analyses. Absolute, likelihood based Wilson scaling The (anisotropic) B value of the data is determined using a likelihood based approach. The resulting B value/tensor is reported:

Maximum likelihood isotropic Wilson scaling
ML estimate of overall B value of sec17.sca:i_obs,sigma:
75.85 A**(-2)
Estimated -log of scale factor of sec17.sca:i_obs,sigma:
-2.50

Maximum likelihood anisotropic Wilson scaling
ML estimate of overall B_cart value of sec17.sca:i_obs,sigma:
68.92,  0.00,  0.00
       68.92,  0.00
              91.87
Equivalent representation as U_cif:
 0.87, -0.00, -0.00
        0.87,  0.00
               1.16

ML estimate of  -log of scale factor of sec17.sca:i_obs,sigma:
-2.50
Correcting for anisotropy in the data

A large spread in (especially the diagonal) values indicates anisotropy. The anisotropy is corrected for. This clears up intensity statistics. Low resolution completeness analysis Mostly data processing software do not provide a clear picture of the completeness of the data at low resolution. For this reason, xtriage lists the completeness of the data up to 5 Angstrom:

Low resolution completeness analysis

 The following table shows the completeness
 of the data to 5 Angstrom.
unused:         - 19.8702 [  0/68 ] 0.000
bin  1: 19.8702 - 10.3027 [425/455] 0.934
bin  2: 10.3027 -  8.3766 [443/446] 0.993
bin  3:  8.3766 -  7.3796 [446/447] 0.998
bin  4:  7.3796 -  6.7336 [447/449] 0.996
bin  5:  6.7336 -  6.2673 [450/454] 0.991
bin  6:  6.2673 -  5.9080 [428/429] 0.998
bin  7:  5.9080 -  5.6192 [459/466] 0.985
bin  8:  5.6192 -  5.3796 [446/450] 0.991
bin  9:  5.3796 -  5.1763 [437/440] 0.993
bin 10:  5.1763 -  5.0006 [460/462] 0.996
unused:  5.0006 -         [  0/0  ]

This analysis allows one to quickly see if there is any unusually low completeness at low resolution, for instance due to missing overloads. Wilson plot analysis A Wilson plot analysis a la ARP/wARP is carried out, albeit with a slightly different standard curve:

Mean intensity analysis
 Analysis of the mean intensity.
 Inspired by: Morris et al. (2004). J. Synch. Rad.11, 56-59.
 The following resolution shells are worrisome:
------------------------------------------------
| d_spacing | z_score | compl. | <Iobs>/<Iexp> |
------------------------------------------------
|    5.773  |   7.95  |   0.99 |     0.658     |
|    5.423  |   8.62  |   0.99 |     0.654     |
|    5.130  |   6.31  |   0.99 |     0.744     |
|    4.879  |   5.36  |   0.99 |     0.775     |
|    4.662  |   4.52  |   0.99 |     0.803     |
|    3.676  |   5.45  |   0.99 |     1.248     |
------------------------------------------------

 Possible reasons for the presence of the reported
 unexpected low or elevated mean intensity in
 a given resolution bin are :
 - missing overloaded or weak reflections
 - suboptimal data processing
 - satellite (ice) crystals
 - NCS
 - translational pseudo symmetry (detected elsewhere)
 - outliers (detected elsewhere)
 - ice rings (detected elsewhere)
 - other problems
 Note that the presence of abnormalities
 in a certain region of reciprocal space might
 confuse the data validation algorithm throughout
 a large region of reciprocal space, even though
 the data is acceptable in those areas.

A very long list of warnings could indicate a serious problem with your data. Decisions on whether or not the data is useful, should be cut or should thrown away altogether, is not straightforward and falls beyond the scope of xtriage. Outlier detection and rejection Possible outliers are detected on the basis Wilson statistics:

Possible outliers
 Inspired by: Read, Acta Cryst. (1999). D55, 1759-1764

Acentric reflections:

-----------------------------------------------------------------
| d_space |      H     K     L |  |E|  | p(wilson) | p(extreme) |
-----------------------------------------------------------------
|   3.716 |      8,    6,   31 |  3.52 |  4.06e-06 |   5.87e-02 |
-----------------------------------------------------------------

p(wilson)  : 1-(1-exp[-|E|^2])
p(extreme) : 1-(1-exp[-|E|^2])^(n_acentrics)
p(wilson) is the probability that an E-value of the specified
value would be observed when it would selected at random from
the given data set.
p(extreme) is the probability that the largest |E| value is
larger or equal than the observed largest |E| value.

Both measures can be used for outlier detection. p(extreme)
takes into account the size of the data set.

Outliers are removed from the data set in the further analysis. Note that if pseudo translational symmetry is present, a large number of 'outliers' will be present. Ice ring detection Ice rings in the data are detected by analyzing the completeness and the mean intensity:

Ice ring related problems

 The following statistics were obtained from ice-ring
 insensitive resolution ranges
  mean bin z_score      : 3.47
      ( rms deviation   : 2.83 )
  mean bin completeness : 0.99
     ( rms deviation   : 0.00 )

 The following table shows the z-scores
 and completeness in ice-ring sensitive areas.
 Large z-scores and high completeness in these
 resolution ranges might be a reason to re-assess
 your data processing if ice rings were present.

------------------------------------------------
| d_spacing | z_score | compl. | Rel. Ice int. |
------------------------------------------------
|    3.897  |   0.12  |   0.97 |     1.000     |
|    3.669  |   0.96  |   0.95 |     0.750     |
|    3.441  |   2.14  |   0.94 |     0.530     |
------------------------------------------------

 Abnormalities in mean intensity or completeness at
 resolution ranges with a relative ice ring intensity
 lower then 0.10 will be ignored.

 At 3.67 A there is an lower occupancy
  then expected from the rest of the data set.
  Even though the completeness is lower as expected,
  the mean intensity is still reasonable at this resolution

 At 3.44 A there is an lower occupancy
  then expected from the rest of the data set.
  Even though the completeness is lower as expected,
  the mean intensity is still reasonable at this resolution

 There were  2 ice ring related warnings
 This could indicate the presence of ice rings.

Anomalous signal If the input reflection file contains separate intensities for each Friedel mate, a quality measure of the anomalous signal is reported:

Analysis of anomalous differences

  Table of measurability as a function of resolution

  The measurability is defined as the fraction of
  Bijvoet related intensity differences for which
  |delta_I|/sigma_delta_I > 3.0
  min[I(+)/sigma_I(+), I(-)/sigma_I(-)] > 3.0
  holds.
  The measurability provides an intuitive feeling
  of the quality of the data, as it is related to the
  number of reliable Bijvoet differences.
  When the data is processed properly and the standard
  deviations have been estimated accurately, values larger
  than 0.05 are encouraging.

unused:         - 19.8704 [   0/68  ]
bin  1: 19.8704 -  7.0211 [1551/1585]  0.1924
bin  2:  7.0211 -  5.6142 [1560/1575]  0.0814
bin  3:  5.6142 -  4.9168 [1546/1555]  0.0261
bin  4:  4.9168 -  4.4729 [1563/1582]  0.0081
bin  5:  4.4729 -  4.1554 [1557/1577]  0.0095
bin  6:  4.1554 -  3.9124 [1531/1570]  0.0083
bin  7:  3.9124 -  3.7178 [1541/1585]  0.0069
bin  8:  3.7178 -  3.5569 [1509/1552]  0.0028
bin  9:  3.5569 -  3.4207 [1522/1606]  0.0085
bin 10:  3.4207 -  3.3032 [1492/1574]  0.0044
unused:  3.3032 -         [   0/0   ]

 The anomalous signal seems to extend to about 5.9 A
 (or to 5.2 A, from a more optimistic point of view)
 The quoted resolution limits can be used as a guideline
 to decide where to cut the resolution for phenix.hyss
 As the anomalous signal is not very strong in this data set
 substructure solution via SAD might prove to be a challenge.
 Especially if only low resolution reflections are used,
 the resulting substructures could contain a significant amount of
 of false positives.

Determination of twin laws Twin laws are found using a modified le-Page algorithm and classified as merohedral and pseudo merohedral:

Determining possible twin laws.

The following twin laws have been found:

-------------------------------------------------------------------------------
| Type | Axis   | R metric (%) | delta (le Page) | delta (Lebedev) | Twin law
|
-------------------------------------------------------------------------------
|   M  | 2-fold | 0.000        | 0.000           | 0.000           | -h,k,-l
|
-------------------------------------------------------------------------------
M:  Merohedral twin law
PM: Pseudomerohedral twin law

  1 merohedral twin operators found
  0 pseudo-merohedral twin operators found
In total,   1 twin operator were found

Non-merohedral (reticular) twinning is not considered. The R-metric is equal to :

Sum (M_i-N_i)^2 / Sum M_i^2

M_i are elements of the original metric tensor and N_i are elements of the metric tensor after 'idealizing' the unit cell, in compliance with the restrictions the twin law poses on the lattice if it would be a 'true' symmetry operator. The delta le-Page is the familiar obliquity. The delta Lebedev is a twin law quality measure developed by A. Lebedev (Lebedev, Vagin & Murshudov; Acta Cryst. (2006). D62, 83-95.). Note that for merohedral twin laws, all quality indicators are 0. For non-merohedral twin laws, this value is larger or equal to zero. If a twin law is classified as non-merohedral, but has a delta le-page equal to zero, the twin law is sometimes referred to as a metric merohedral twin law. Locating translational pseudo symmetry (TPS) TPS is located by inspecting a low resolution Patterson function. Peaks and their significance levels are reported:

Largest Patterson peak with length larger then 15 Angstrom

Frac. coord.        :   0.027    0.057    0.345
Distance to origin  :  17.444
Height (origin=100) :   3.886
p_value(height)     :   9.982e-01

  The reported p_value has the following meaning:
    The probability that a peak of the specified height
    or larger is found in a Patterson function of a
    macro molecule that does not have any translational
    pseudo symmetry is equal to  9.982e-01
    p_values smaller then 0.05 might indicate
    weak translation pseudo symmetry, or the self vector of
    a large anomalous scatterer such as Hg, whereas values
    smaller then 1e-3 are a very strong indication for
    the presence of translational pseudo symmetry.

Moments of the observed intensities The moment of the observed intensity/amplitude distribution, are reported, as well as their expected values:

Wilson ratio and moments

Acentric reflections
   <I^2>/<I>^2    :1.955   (untwinned: 2.000; perfect twin 1.500)
   <F>^2/<F^2>    :0.796   (untwinned: 0.785; perfect twin 0.885)
   <|E^2 - 1|>    :0.725   (untwinned: 0.736; perfect twin 0.541)

Centric reflections
   <I^2>/<I>^2    :2.554   (untwinned: 3.000; perfect twin 2.000)
   <F>^2/<F^2>    :0.700   (untwinned: 0.637; perfect twin 0.785)
   <|E^2 - 1|>    :0.896   (untwinned: 0.968; perfect twin 0.736)

Significant departure from the ideal values could indicate the presence of twinning or pseudo translations. For instance, an <I^2>/<I>^2 value significantly lower than 2.0, might point to twinning, whereas a value significantly larger than 2.0, might point towards pseudo translational symmetry. Cumulative intensity distribution The cumulative intensity distribution is reported:

-----------------------------------------------
|  Z  | Nac_obs | Nac_theo | Nc_obs | Nc_theo |
-----------------------------------------------
| 0.0 |   0.000 |    0.000 |  0.000 |   0.000 |
| 0.1 |   0.081 |    0.095 |  0.168 |   0.248 |
| 0.2 |   0.167 |    0.181 |  0.292 |   0.345 |
| 0.3 |   0.247 |    0.259 |  0.354 |   0.419 |
| 0.4 |   0.321 |    0.330 |  0.420 |   0.474 |
| 0.5 |   0.392 |    0.394 |  0.473 |   0.520 |
| 0.6 |   0.452 |    0.451 |  0.521 |   0.561 |
| 0.7 |   0.506 |    0.503 |  0.570 |   0.597 |
| 0.8 |   0.552 |    0.551 |  0.603 |   0.629 |
| 0.9 |   0.593 |    0.593 |  0.636 |   0.657 |
| 1.0 |   0.635 |    0.632 |  0.673 |   0.683 |
-----------------------------------------------
| Maximum deviation acentric      :  0.015    |
| Maximum deviation centric       :  0.080    |
|                                             |
| <NZ(obs)-NZ(twinned)>_acentric  : -0.004    |
| <NZ(obs)-NZ(twinned)>_centric   : -0.039    |
-----------------------------------------------

The N(Z) test is related to the moments based test discussed above. Nac_obs is the observed cumulative distribution of normalized intensities of the acentric data, and uses the full distribution rather then just a moment. The effects of twinning shows itself for Nac_obs having a more sigmoidal character. In the case of pseudo centering, Nac_obs will tend towards Nc_theo. The L test The L-test is an intensity statistic developed by Padilla and Yeates (Acta Cryst. (2003), D59: 1124-1130) and is reasonably robust in the presence of anisotropy and pseudo centering, especially if the miller indices are partitioned properly. Partitioning is carried out on the basis of a Patterson analysis. A significant deviation of both <|L|> and <L^2> from the expected values indicate twinning or other problems:

 L test for acentric data

 using difference vectors (dh,dk,dl) of the form:
(2hp,2kp,2lp)
  where hp, kp, and lp are random signed integers such that
  2 <= |dh| + |dk| + |dl| <= 8

  Mean |L|   :0.482  (untwinned: 0.500; perfect twin: 0.375)
  Mean  L^2  :0.314  (untwinned: 0.333; perfect twin: 0.200)

  The distribution of |L| values indicates a twin fraction of
  0.00. Note that this estimate is not as reliable as obtained
  via a Britton plot or H-test if twin laws are available.

Whether or not the <|L|> and <L^2> differ significantly from the expected values, is shown in the final summary (see below). Analysis of twin laws Twin law specific tests (Britton, H and RvsR) are performed:

Results of the H-test on a-centric data:

 (Only 50.0% of the strongest twin pairs were used)

mean |H| : 0.183   (0.50: untwinned; 0.0: 50% twinned)
mean H^2 : 0.055   (0.33: untwinned; 0.0: 50% twinned)
Estimation of twin fraction via mean |H|: 0.317
Estimation of twin fraction via cum. dist. of H: 0.308

Britton analysis

  Extrapolation performed on  0.34 < alpha < 0.495
  Estimated twin fraction: 0.283
  Correlation: 0.9951

R vs R statistic:
  R_abs_twin = <|I1-I2|>/<|I1+I2|>
  Lebedev, Vagin, Murshudov. Acta Cryst. (2006). D62, 83-95

   R_abs_twin observed data   : 0.193
   R_abs_twin calculated data : 0.328

  R_sq_twin = <(I1-I2)^2>/<(I1+I2)^2>
   R_sq_twin observed data    : 0.044
   R_sq_twin calculated data  : 0.120

Maximum Likelihood twin fraction determination
    Zwart, Read, Grosse-Kunstleve & Adams, to be published.

   The estimated twin fraction is equal to 0.227

These tests allow one to estimate the twin fraction and (if calculated data is provided) determine if rotational pseudo symmetry is present. Another option (albeit more computationally expensive), is to estimate the correlation between error free, untwinned, twin related normalized intensities (use the key perform=True on the command line)

Estimation of twin fraction, while taking into account the
effects of possible NCS parallel to the twin axis.
    Zwart, Read, Grosse-Kunstleve & Adams, to be published.

  A parameters D_ncs will be estimated as a function of resolution,
  together with a global twin fraction.
  D_ncs is an estimate of the correlation coefficient between
  untwinned, error-free, twin related, normalized intensities.
  Large values (0.95) could indicate an incorrect point group.
  Value of D_ncs larger than say, 0.5, could indicate the presence
  of NCS. The twin fraction should be smaller or similar to other
  estimates given elsewhere.

  The refinement can take some time.
  For numerical stability issues, D_ncs is limited between 0 and 0.95.
  The twin fraction is allowed to vary between 0 and 0.45.
  Refinement cycle numbers are printed out to keep you entertained.

. . . .   5  . . . .  10  . . . .  15  . . . .  20  . . . .  25  . . . .  30
. . . .  35  . . . .  40  . . . .  45  . . . .  50  . . . .  55  . . . .  60
. . . .  65  . . . .  70  . . . .  75  . . .

  Cycle :  78
  -----------
  Log[likelihood]:       22853.700
  twin fraction: 0.201
  D_ncs in resolution ranges:
     9.8232 -- 4.5978 :: 0.830
     4.5978 -- 3.7139 :: 0.775
     3.7139 -- 3.2641 :: 0.745
     3.2641 -- 2.9747 :: 0.746
     2.9747 -- 2.7666 :: 0.705
     2.7666 -- 2.6068 :: 0.754
     2.6068 -- 2.4784 :: 0.735

 The correlation of the calculated F^2 should be similar to
 the estimated values.

 Observed correlation between twin related, untwinned calculated F^2
 in resolution ranges, as well as estimates D_ncs^2 values:
 Bin    d_max     d_min     CC_obs   D_ncs^2
  1)    9.8232 -- 4.5978 ::  0.661    0.689
  2)    4.5978 -- 3.7139 ::  0.544    0.601
  3)    3.7139 -- 3.2641 ::  0.650    0.556
  4)    3.2641 -- 2.9747 ::  0.466    0.557
  5)    2.9747 -- 2.7666 ::  0.426    0.497
  6)    2.7666 -- 2.6068 ::  0.558    0.569
  7)    2.6068 -- 2.4784 ::  0.531    0.540

The twin fraction obtained via this method is usually lower than what is obtained by refinement. The estimated correlation coefficient (D_ncs^2) between the twin related F^2 values, is however reasonably accurate. Exploring higher metric symmetry The fact that a twin law is present, could indicate that the data was incorrectly processed as well. The example below, shows a P41212 data set processed in P1:

Exploring higher metric symmetry

Point group of data as dictated by the space group is P 1
  the point group in the Niggli setting is P 1
The point group of the lattice is P 4 2 2
A summary of R values for various possible point groups follow.

-----------------------------------------------------------------------------------------------
| Point group              | mean R_used | max R_used | mean R_unused | min R_unused | choice |
-----------------------------------------------------------------------------------------------
| P 1                      | None        | None       | 0.022         | 0.017        |        |
| P 4 2 2                  | 0.022       | 0.025      | None          | None         | <---   |
| P 1 2 1                  | 0.017       | 0.017      | 0.026         | 0.024        |        |
| Hall:  C 2y (x-y,x+y,z)  | 0.025       | 0.025      | 0.022         | 0.017        |        |
| P 4                      | 0.025       | 0.028      | 0.025         | 0.025        |        |
| Hall:  C 2 2 (x-y,x+y,z) | 0.024       | 0.025      | 0.017         | 0.017        |        |
| Hall:  C 2y (x+y,-x+y,z) | 0.024       | 0.024      | 0.023         | 0.017        |        |
| P 1 1 2                  | 0.028       | 0.028      | 0.021         | 0.017        |        |
| P 2 1 1                  | 0.027       | 0.027      | 0.022         | 0.017        |        |
| P 2 2 2                  | 0.023       | 0.028      | 0.025         | 0.025        |        |
-----------------------------------------------------------------------------------------------

R_used: mean and maximum R value for symmetry operators *used* in this point group
R_unused: mean and minimum R value for symmetry operators *not used* in this point group
The likely point group of the data is:  P 4 2 2

As in phenix.explore_metric_symmetry, the possible space groups are listed as well (not shown here). Twin analysis summary The results of the twin analysis are summarized. Typical outputs look as follows for cases of wrong symmetry, twin laws but no suspected twinning and twinned data respectively. Wrong symmetry:

-------------------------------------------------------------------------------
Twinning and intensity statistics summary (acentric data):

Statistics independent of twin laws
  - <I^2>/<I>^2 : 2.104
  - <F>^2/<F^2> : 0.770
  - <|E^2-1|>   : 0.757
  - <|L|>, <L^2>: 0.512, 0.349
       Multivariate Z score L-test: 2.777
       The multivariate Z score is a quality measure of the given
       spread in intensities. Good to reasonable data is expected
       to have a Z score lower than 3.5.
       Large values can indicate twinning, but small values do not
       necessarily exclude it.

Statistics depending on twin laws
------------------------------------------------------
| Operator | type | R obs. | Britton alpha | H alpha |
------------------------------------------------------
| k,h,-l   |  PM  | 0.025  | 0.458         | 0.478   |
| -h,k,-l  |  PM  | 0.017  | 0.459         | 0.487   |
| -k,h,l   |  PM  | 0.024  | 0.458         | 0.478   |
| -k,-h,-l |  PM  | 0.024  | 0.458         | 0.478   |
| -h,-k,l  |  PM  | 0.028  | 0.458         | 0.476   |
| h,-k,-l  |  PM  | 0.027  | 0.458         | 0.477   |
| k,-h,l   |  PM  | 0.024  | 0.457         | 0.478   |
------------------------------------------------------

Patterson analysis
  - Largest peak height   : 6.089
   (corresponding p value : 6.921e-01)

The largest off-origin peak in the Patterson function is 6.09% of the
height of the origin peak. No significant pseudo-translation is detected.

The results of the L-test indicate that the intensity statistics
behave as expected. No twinning is suspected.
The symmetry of the lattice and intensity however suggests that the
input space group is too low. See the relevant sections of the log
file for more details on your choice of space groups.
As the symmetry is suspected to be incorrect, it is advisable to reconsider
data processing.
-------------------------------------------------------------------------------

Twin laws present but no suspected twinning:

-------------------------------------------------------------------------------
Twinning and intensity statistics summary (acentric data):

Statistics independent of twin laws
  - <I^2>/<I>^2 : 1.955
  - <F>^2/<F^2> : 0.796
  - <|E^2-1|>   : 0.725
  - <|L|>, <L^2>: 0.482, 0.314
       Multivariate Z score L-test: 1.225
       The multivariate Z score is a quality measure of the given
       spread in intensities. Good to reasonable data is expected
       to have a Z score lower than 3.5.
       Large values can indicate twinning, but small values do not
       necessarily exclude it.

Statistics depending on twin laws
------------------------------------------------------
| Operator | type | R obs. | Britton alpha | H alpha |
------------------------------------------------------
| -h,k,-l  |   M  | 0.455  | 0.016         | 0.035   |
------------------------------------------------------

Patterson analysis
  - Largest peak height   : 3.886
   (corresponding p value : 9.982e-01)

The largest off-origin peak in the Patterson function is 3.89% of the
height of the origin peak. No significant pseudo-translation is detected.

The results of the L-test indicate that the intensity statistics
behave as expected. No twinning is suspected.
Even though no twinning is suspected, it might be worthwhile carrying out
a refinement using a dedicated twin target anyway, as twinned structures with
low twin fractions are difficult to distinguish from non-twinned structures.

-------------------------------------------------------------------------------

Twinned data:

-------------------------------------------------------------------------------
Twinning and intensity statistics summary (acentric data):

Statistics independent of twin laws
  - <I^2>/<I>^2 : 1.587
  - <F>^2/<F^2> : 0.871
  - <|E^2-1|>   : 0.568
  - <|L|>, <L^2>: 0.387, 0.212
       Multivariate Z score L-test: 11.589
       The multivariate Z score is a quality measure of the given
       spread in intensities. Good to reasonable data is expected
       to have a Z score lower than 3.5.
       Large values can indicate twinning, but small values do not
       necessarily exclude it.

Statistics depending on twin laws
------------------------------------------------------
| Operator | type | R obs. | Britton alpha | H alpha |
------------------------------------------------------
| -l,-k,-h |  PM  | 0.170  | 0.330         | 0.325   |
------------------------------------------------------

Patterson analysis
  - Largest peak height   : 7.300
   (corresponding p value : 4.454e-01)

The largest off-origin peak in the Patterson function is 7.30% of the
   height of the origin peak. No significant pseudo-translation is detected.

The results of the L-test indicate that the intensity statistics
are significantly different then is expected from good to reasonable,
untwinned data.
As there are twin laws possible given the crystal symmetry, twinning could
be the reason for the departure of the intensity statistics from normality.
It might be worthwhile carrying refinement with a twin specific target
function.
-------------------------------------------------------------------------------

In the summary, the significance of the departure of the values of the L-test from normality are stated. The multivariate Z-score (also known as the Mahalanobis distance) is used for this purpose.

Examples

Standard run of xtriage

Running xtriage is easy. From the command-line you can type:

phenix.xtriage data.sca

When an MTZ or CNS file is used, labels have to be specified:

phenix.xtriage file=my_brilliant_data.mtz obs_labels='F(+),SIGF(+),F(-),SIGF(-)'

In order to perform a Matthews analysis, it might be useful to specify the number of residues/nucleotides in the crystallized macro molecule:

phenix.xtriage data.sca n_residues=230 n_bases=25

By default, the screen output plus additional ccp4 style graphs (viewable with the ccp4 programs loggraph) are echoed to a file named logfile.log. The command line arguments and all other defaults settings are summarized in a PHIL parameter data block given at the beginning of the logfile / screen output:


scaling.input {
  parameters {
    asu_contents {
      n_residues = None
      n_bases = None
      n_copies_per_asu = None
    }
    misc_twin_parameters {
      missing_symmetry {
        tanh_location = 0.08
        tanh_slope = 50
      }
      twinning_with_ncs {
        perform_analysis = False
        n_bins = 7
      }
      twin_test_cuts {
        low_resolution = 10
        high_resolution = None
        isigi_cut = 3
        completeness_cut = 0.85
      }
    }
    reporting {
      verbose = 1
      log = "logfile.log"
      ccp4_style_graphs = True
    }
  }
  xray_data {
    file_name = "some_data.sca"
    obs_labels = None
    calc_labels = None
    unit_cell = 64.5 69.5 45.5 90 104.3 90
    space_group = "P 1 21 1"
    high_resolution = None
    low_resolution = None
  }
}

The defaults are good for most applications.

Possible Problems

Specific limitations and problems

Xtriage doesn't deal with data in centric space groups

Literature

CCP4 newsletter No. 42, Summer 2005: Characterization of X-ray data sets

CCP4 newsletter No. 43, Winter 2005: Xtriage and Fest: automatic assessment of X-ray data and substructure structure factor estimation

Additional information

List of all xtriage keywords

------------------------------------------------------------------------------- 
Legend: black bold - scope names
        black - parameter names
        red - parameter values
        blue - parameter help
        blue bold - scope help
        Parameter values:
          * means selected parameter (where multiple choices are available)
          False is No
          True is Yes
          None means not provided, not predefined, or left up to the program
          "%3d" is a Python style formatting descriptor
------------------------------------------------------------------------------- 
scaling
   input
      expert_level= 1 Expert level
      asu_contents Defines the ASU contents
         sequence_file= None File containing protein or nucleic acid
                        sequences. Values for n_residues and n_bases will be
                        extracted automatically if this is provided.
         n_residues= None Number of residues in structural unit
         n_bases= None Number of nucleotides in structural unit
         n_copies_per_asu= None Number of copies per ASU. If not specified,
                           Matthews analyses is performed
      xray_data Defines xray data
         file_name= None File name with data
         obs_labels= None Labels for observed data
         calc_labels= None Lables for calculated data
         unit_cell= None Unit cell parameters
         space_group= None space group
         high_resolution= None High resolution limit
         low_resolution= None Low resolution limit
         reference A reference data set. For the investigation of possible
                   reindexing options
            data Defines an x-ray dataset
               file_name= None File name
               labels= None Labels
               unit_cell= None Unit cell parameters"
               space_group= None Space group
            structure
               file_name= None Filename of reference PDB file
      parameters Basic settings
         reporting Some output issues
            verbose= 1 Verbosity
            log= logfile.log Logfile
            ccp4_style_graphs= True SHall we include ccp4 style graphs?
         misc_twin_parameters Various settings for twinning or symmetry tests
            apply_basic_filters_prior_to_twin_analysis= True Keep data cutoffs
                                                        from the
                                                        basic_analyses module
                                                        (I/sigma,Wilson
                                                        scaling,Anisotropy)
                                                        when twin stats are
                                                        computed.
            missing_symmetry Settings for missing symmetry tests
               sigma_inflation= 1.25 Standard deviations of intensities can be
                                increased to make point group determination
                                more reliable.
            twinning_with_ncs Analysing the possibility of an NCS operator
                              parallel to a twin law.
               perform_analyses= False Determines whether or not this analyses
                                 is carried out.
               n_bins= 7 Number of bins used in NCS analyses.
            twin_test_cuts Various cuts used in determining resolution limit
                           for data used in intensity statistics
               low_resolution= 10.0 Low resolution
               high_resolution= None High resolution
               isigi_cut= 3.0 I/sigI ratio used in completeness cut
               completeness_cut= 0.85 Data is cut at resolution where
                                 intensities with I/sigI greater than
                                 isigi_cut are more than completeness_cut
                                 complete
      optional Optional data massage possibilities
         hklout= None HKL out
         hklout_type= mtz sca *mtz_or_sca Output format
         label_extension= "massaged" Label extension
         aniso Parameters dealing with anisotropy correction
            action= *remove_aniso None Remove anisotropy?
            final_b= *eigen_min eigen_mean user_b_iso Final b value
            b_iso= None User specified B value
         outlier Outlier analyses
            action= *extreme basic beamstop None Outlier protocol
            parameters Parameters for outlier detection
               basic_wilson
                  level= 1E-6
               extreme_wilson
                  level= 0.01
               beamstop
                  level= 0.001
                  d_min= 10.0
         symmetry
            action= detwin twin *None
            twinning_parameters
               twin_law= None
               fraction= None
   gui GUI-specific parameters, not applicable to command-line version.
      result_file= None Pickled result file for Phenix GUI
      job_title= None Job title in PHENIX GUI, not used on command line