phenix_logo
Python-based Hierarchical ENvironment for Integrated Xtallography
Documentation Home
 

Data quality assessment with phenix.xtriage

Author(s)
Purpose
Usage
How xtriage works
Output files from xtriage
Xtriage keywords in detail
Interpreting Xtriage output
Examples
Standard run of xtriage
Possible Problems
Specific limitations and problems
Literature
Additional information
List of all xtriage keywords

Author(s)

  • xtriage: Peter Zwart
  • Phil command interpreter: Ralf W. Grosse-Kunstleve

Purpose

The xtriage method is a tool for analyzing structure factor data to identify outliers, presence of twinning and other conditions that the user should be aware of.

Usage

How xtriage works

Basic sanity checks performed by xtriage are

  • Wilson plot sanity
  • Probabilistic Matthews analysis
  • Data strength analysis
  • Ice ring analysis
  • Twinning analysis
  • Reference analysis (determines possible re-indexing. optional)
  • Detwinning and data massaging (optional)
See also: phenix.reflection_statistics (comparison of multiple data sets)

Output files from xtriage

  • (1) A log file that contains all the screen output plus some ccp4 style graphs
  • (2) optional: an mtz file with massaged data

    Xtriage keywords in detail

    Scope: parameters.asu_contents

    keys: * n_residues :: Number of residues per monomer/unit
          * n_bases :: Number of nucleotides per monomer/unit
          * n_copies_per_asu :: Number of copies in the ASU.
    
    These keywords control the determination of the absolute scale. If the number of residues/bases is not specified, a solvent content of 50% is assumed. Scope: parameters.misc_twin_parameters.missing_symmetry
    keys: * tanh_location :: tanh decision rule parameter
          * tanh_slope :: tanh decision rule parameter
    
    The tanh_location and tanh_slope parameter control what R-value is considered to be low enough to be considered a 'proper' symmetry operator. the tanh_location parameter corresponds to the inflection point of the approximate step function. Increasing tanh_location will result in large R-value thresholds. tanh_slope is set to 50 and should be okai. Scope: parameters.misc_twin_parameters.twinning_with_ncs
    keys: * perform_test :: can be set to True or False
          * n_bins :: Number of bins in determination of D_ncs
    
    The perform_test is by default set to False. Setting it to True triggers the determination of the twin fraction while taking into account NCS parallel to the twin axis. Scope: parameters.misc_twin_parameters.twin_test_cuts
    keys: * high_resolution : high resolution for twin tests
          * low_resolution: low resolution for twin tests
          * isigi_cut: I/sig(I) threshold in automatic determination
                       of high resolution limit
          * completeness_cut: completeness threshold in automatic
                              determination of high resolution limit
    
    The automatic determination of the resolution limit for the twinning test is determined on the basis of the completeness after removing intensities for which I/sigI < isigi_cut. The lowest limit obtain in this way is 3.5A. The value determined by the automatic procedure can be overruled by specification of the high_resolution keyword. The low resolution is set to 10A by default. Scope: parameters.reporting
    keys: * verbose :: verbosity level.
          * log :: log file name
          * ccp4_style_graphs :: Either True or False. Determines whether or
                                 not ccp4 style logfgra plots are written to the
                                 log file
    
    Scope: xray_data
    keys: * file_name :: file name with xray data.
          * obs_labels :: labels for observed data is format is mtz or XPLOR/CNS
          * calc_labels :: optional; labels for calculated data
          * unit_cell :: overrides unit cell in reflection file (if present)
          * space_group :: overrides space group in reflection file (if present)
          * high_resolution :: High resolution limit of the data
          * low_resolution :: Low resolution limit of the data
    
    Note that the matching of specified and present labels involves a sub-string matching algorithm. Scope: optional
    keys: * hklout :: output mtz file
          * twinning.action :: Whether to detwin the data
          * twinning.twin_law :: using this twin law (h,k,l or x,y,z notation)
          * twinning.fraction :: The detwinning fraction.
          * b_value :: the resulting Wilson B value
    
    The output mtz file contains an anisotropy corrected mtz file, with suspected outliers removed. The data is put scaled and has the specified Wilson B value. These options have an associated expert level of 10, and are not shown by default. Specification of the expert level on the command line as 'level=100' will show all available options.

    Interpreting Xtriage output

    Typing:

    %phenix.xtriage some_data.sca residues=290 log=some_data.log
    
    results in the following output (parts omitted). Matthews analysis First, a cell contents analysis is performed. Matthews coefficients, solvent content and solvent content probabilities are listed, and the most likely composition is guessed
    Matthews coefficient and Solvent content statistics
    ----------------------------------------------------------------
    | Copies | Solvent content | Matthews Coed. | P(solvent cont.) |
    |--------|-----------------|----------------|------------------|
    |      1 |      0.705      |      4.171     |       0.241      |
    |      2 |      0.411      |      2.085     |       0.750      |
    |      3 |      0.116      |      1.390     |       0.009      |
    ----------------------------------------------------------------
    |              Best guess :    2  copies in the asu            |
    ----------------------------------------------------------------
    
    Data strength The next step, the strength of the data is gauged by determining the completeness of the in resolution bins after application of several I/sigI cut off values
    Completeness and data strength analysis
    
      The following table lists the completeness in various resolution
      ranges, after applying a I/sigI cut. Miller indices for which
      individual I/sigI values are larger than the value specified in
      the top row of the table, are retained, while other intensities
      are discarded. The resulting completeness profiles are an indication
      of the strength of the data.
    
    ----------------------------------------------------------------------------------------
    | Res. Range   | I/sigI>1  | I/sigI>2  | I/sigI>3  | I/sigI>5  | I/sigI>10 | I/sigI>15 |
    ----------------------------------------------------------------------------------------
    | 19.87 - 7.98 | 96.4%     | 95.3%     | 94.5%     | 93.6%     | 91.7%     | 89.3%     |
    |  7.98 - 6.40 | 99.2%     | 98.2%     | 97.1%     | 95.5%     | 90.9%     | 84.7%     |
    |  6.40 - 5.61 | 97.8%     | 95.4%     | 93.3%     | 87.1%     | 76.6%     | 66.8%     |
    |  5.61 - 5.11 | 98.2%     | 95.9%     | 94.0%     | 87.9%     | 74.1%     | 58.0%     |
    |  5.11 - 4.75 | 97.9%     | 96.2%     | 94.5%     | 91.1%     | 79.2%     | 62.5%     |
    |  4.75 - 4.47 | 97.4%     | 95.4%     | 93.1%     | 88.9%     | 76.6%     | 56.9%     |
    |  4.47 - 4.25 | 96.5%     | 94.5%     | 92.1%     | 88.0%     | 75.3%     | 56.5%     |
    |  4.25 - 4.07 | 96.6%     | 94.0%     | 91.2%     | 85.4%     | 69.3%     | 44.9%     |
    |  4.07 - 3.91 | 95.6%     | 92.1%     | 87.8%     | 80.1%     | 61.9%     | 34.8%     |
    |  3.91 - 3.78 | 94.3%     | 89.6%     | 83.7%     | 71.1%     | 48.7%     | 20.5%     |
    |  3.78 - 3.66 | 95.7%     | 90.9%     | 85.6%     | 71.5%     | 42.4%     | 14.8%     |
    |  3.66 - 3.56 | 91.6%     | 85.0%     | 78.0%     | 63.3%     | 34.1%     | 9.5%      |
    |  3.56 - 3.46 | 89.8%     | 80.4%     | 70.2%     | 52.8%     | 22.2%     | 3.8%      |
    |  3.46 - 3.38 | 87.4%     | 76.3%     | 64.6%     | 46.7%     | 15.5%     | 1.7%      |
    ----------------------------------------------------------------------------------------
    
    This analysis is also used in the automatic determination of the high resolution limit used in the intensity statistics and twin analyses. Absolute, likelihood based Wilson scaling The (anisotropic) B value of the data is determined using a likelihood based approach. The resulting B value/tensor is reported:
    Maximum likelihood isotropic Wilson scaling
    ML estimate of overall B value of sec17.sca:i_obs,sigma:
    75.85 A**(-2)
    Estimated -log of scale factor of sec17.sca:i_obs,sigma:
    -2.50
    
    Maximum likelihood anisotropic Wilson scaling
    ML estimate of overall B_cart value of sec17.sca:i_obs,sigma:
    68.92,  0.00,  0.00
           68.92,  0.00
                  91.87
    Equivalent representation as U_cif:
     0.87, -0.00, -0.00
            0.87,  0.00
                   1.16
    
    ML estimate of  -log of scale factor of sec17.sca:i_obs,sigma:
    -2.50
    Correcting for anisotropy in the data
    
    A large spread in (especially the diagonal) values indicates anisotropy. The anisotropy is corrected for. This clears up intensity statistics. Low resolution completeness analysis Mostly data processing software do not provide a clear picture of the completeness of the data at low resolution. For this reason, xtriage lists the completeness of the data up to 5 Angstrom:
    Low resolution completeness analysis
    
     The following table shows the completeness
     of the data to 5 Angstrom.
    unused:         - 19.8702 [  0/68 ] 0.000
    bin  1: 19.8702 - 10.3027 [425/455] 0.934
    bin  2: 10.3027 -  8.3766 [443/446] 0.993
    bin  3:  8.3766 -  7.3796 [446/447] 0.998
    bin  4:  7.3796 -  6.7336 [447/449] 0.996
    bin  5:  6.7336 -  6.2673 [450/454] 0.991
    bin  6:  6.2673 -  5.9080 [428/429] 0.998
    bin  7:  5.9080 -  5.6192 [459/466] 0.985
    bin  8:  5.6192 -  5.3796 [446/450] 0.991
    bin  9:  5.3796 -  5.1763 [437/440] 0.993
    bin 10:  5.1763 -  5.0006 [460/462] 0.996
    unused:  5.0006 -         [  0/0  ]
    
    This analysis allows one to quickly see if there is any unusually low completeness at low resolution, for instance due to missing overloads. Wilson plot analysis A Wilson plot analysis a la ARP/wARP is carried out, albeit with a slightly different standard curve:
    Mean intensity analysis
     Analysis of the mean intensity.
     Inspired by: Morris et al. (2004). J. Synch. Rad.11, 56-59.
     The following resolution shells are worrisome:
    ------------------------------------------------
    | d_spacing | z_score | compl. | <Iobs>/<Iexp> |
    ------------------------------------------------
    |    5.773  |   7.95  |   0.99 |     0.658     |
    |    5.423  |   8.62  |   0.99 |     0.654     |
    |    5.130  |   6.31  |   0.99 |     0.744     |
    |    4.879  |   5.36  |   0.99 |     0.775     |
    |    4.662  |   4.52  |   0.99 |     0.803     |
    |    3.676  |   5.45  |   0.99 |     1.248     |
    ------------------------------------------------
    
     Possible reasons for the presence of the reported
     unexpected low or elevated mean intensity in
     a given resolution bin are :
     - missing overloaded or weak reflections
     - suboptimal data processing
     - satellite (ice) crystals
     - NCS
     - translational pseudo symmetry (detected elsewhere)
     - outliers (detected elsewhere)
     - ice rings (detected elsewhere)
     - other problems
     Note that the presence of abnormalities
     in a certain region of reciprocal space might
     confuse the data validation algorithm throughout
     a large region of reciprocal space, even though
     the data is acceptable in those areas.
    
    A very long list of warnings could indicate a serious problem with your data. Decisions on whether or not the data is useful, should be cut or should thrown away altogether, is not straightforward and falls beyond the scope of xtriage. Outlier detection and rejection Possible outliers are detected on the basis Wilson statistics:
    Possible outliers
     Inspired by: Read, Acta Cryst. (1999). D55, 1759-1764
    
    Acentric reflections:
    
    -----------------------------------------------------------------
    | d_space |      H     K     L |  |E|  | p(wilson) | p(extreme) |
    -----------------------------------------------------------------
    |   3.716 |      8,    6,   31 |  3.52 |  4.06e-06 |   5.87e-02 |
    -----------------------------------------------------------------
    
    p(wilson)  : 1-(1-exp[-|E|^2])
    p(extreme) : 1-(1-exp[-|E|^2])^(n_acentrics)
    p(wilson) is the probability that an E-value of the specified
    value would be observed when it would selected at random from
    the given data set.
    p(extreme) is the probability that the largest |E| value is
    larger or equal than the observed largest |E| value.
    
    Both measures can be used for outlier detection. p(extreme)
    takes into account the size of the data set.
    
    Outliers are removed from the data set in the further analysis. Note that if pseudo translational symmetry is present, a large number of 'outliers' will be present. Ice ring detection Ice rings in the data are detected by analyzing the completeness and the mean intensity:
    Ice ring related problems
    
     The following statistics were obtained from ice-ring
     insensitive resolution ranges
      mean bin z_score      : 3.47
          ( rms deviation   : 2.83 )
      mean bin completeness : 0.99
         ( rms deviation   : 0.00 )
    
     The following table shows the z-scores
     and completeness in ice-ring sensitive areas.
     Large z-scores and high completeness in these
     resolution ranges might be a reason to re-assess
     your data processing if ice rings were present.
    
    ------------------------------------------------
    | d_spacing | z_score | compl. | Rel. Ice int. |
    ------------------------------------------------
    |    3.897  |   0.12  |   0.97 |     1.000     |
    |    3.669  |   0.96  |   0.95 |     0.750     |
    |    3.441  |   2.14  |   0.94 |     0.530     |
    ------------------------------------------------
    
     Abnormalities in mean intensity or completeness at
     resolution ranges with a relative ice ring intensity
     lower then 0.10 will be ignored.
    
     At 3.67 A there is an lower occupancy
      then expected from the rest of the data set.
      Even though the completeness is lower as expected,
      the mean intensity is still reasonable at this resolution
    
     At 3.44 A there is an lower occupancy
      then expected from the rest of the data set.
      Even though the completeness is lower as expected,
      the mean intensity is still reasonable at this resolution
    
     There were  2 ice ring related warnings
     This could indicate the presence of ice rings.
    
    Anomalous signal If the input reflection file contains separate intensities for each Friedel mate, a quality measure of the anomalous signal is reported:
    Analysis of anomalous differences
    
      Table of measurability as a function of resolution
    
      The measurability is defined as the fraction of
      Bijvoet related intensity differences for which
      |delta_I|/sigma_delta_I > 3.0
      min[I(+)/sigma_I(+), I(-)/sigma_I(-)] > 3.0
      holds.
      The measurability provides an intuitive feeling
      of the quality of the data, as it is related to the
      number of reliable Bijvoet differences.
      When the data is processed properly and the standard
      deviations have been estimated accurately, values larger
      than 0.05 are encouraging.
    
    unused:         - 19.8704 [   0/68  ]
    bin  1: 19.8704 -  7.0211 [1551/1585]  0.1924
    bin  2:  7.0211 -  5.6142 [1560/1575]  0.0814
    bin  3:  5.6142 -  4.9168 [1546/1555]  0.0261
    bin  4:  4.9168 -  4.4729 [1563/1582]  0.0081
    bin  5:  4.4729 -  4.1554 [1557/1577]  0.0095
    bin  6:  4.1554 -  3.9124 [1531/1570]  0.0083
    bin  7:  3.9124 -  3.7178 [1541/1585]  0.0069
    bin  8:  3.7178 -  3.5569 [1509/1552]  0.0028
    bin  9:  3.5569 -  3.4207 [1522/1606]  0.0085
    bin 10:  3.4207 -  3.3032 [1492/1574]  0.0044
    unused:  3.3032 -         [   0/0   ]
    
     The anomalous signal seems to extend to about 5.9 A
     (or to 5.2 A, from a more optimistic point of view)
     The quoted resolution limits can be used as a guideline
     to decide where to cut the resolution for phenix.hyss
     As the anomalous signal is not very strong in this data set
     substructure solution via SAD might prove to be a challenge.
     Especially if only low resolution reflections are used,
     the resulting substructures could contain a significant amount of
     of false positives.
    
    Determination of twin laws Twin laws are found using a modified le-Page algorithm and classified as merohedral and pseudo merohedral:
    Determining possible twin laws.
    
    The following twin laws have been found:
    
    -------------------------------------------------------------------------------
    | Type | Axis   | R metric (%) | delta (le Page) | delta (Lebedev) | Twin law
    |
    -------------------------------------------------------------------------------
    |   M  | 2-fold | 0.000        | 0.000           | 0.000           | -h,k,-l
    |
    -------------------------------------------------------------------------------
    M:  Merohedral twin law
    PM: Pseudomerohedral twin law
    
      1 merohedral twin operators found
      0 pseudo-merohedral twin operators found
    In total,   1 twin operator were found
    
    Non-merohedral (reticular) twinning is not considered. The R-metric is equal to :
    Sum (M_i-N_i)^2 / Sum M_i^2
    M_i are elements of the original metric tensor and N_i are elements of the metric tensor after 'idealizing' the unit cell, in compliance with the restrictions the twin law poses on the lattice if it would be a 'true' symmetry operator. The delta le-Page is the familiar obliquity. The delta Lebedev is a twin law quality measure developed by A. Lebedev (Lebedev, Vagin & Murshudov; Acta Cryst. (2006). D62, 83-95.). Note that for merohedral twin laws, all quality indicators are 0. For non-merohedral twin laws, this value is larger or equal to zero. If a twin law is classified as non-merohedral, but has a delta le-page equal to zero, the twin law is sometimes referred to as a metric merohedral twin law. Locating translational pseudo symmetry (TPS) TPS is located by inspecting a low resolution Patterson function. Peaks and their significance levels are reported:
    Largest Patterson peak with length larger then 15 Angstrom
    
    Frac. coord.        :   0.027    0.057    0.345
    Distance to origin  :  17.444
    Height (origin=100) :   3.886
    p_value(height)     :   9.982e-01
    
      The reported p_value has the following meaning:
        The probability that a peak of the specified height
        or larger is found in a Patterson function of a
        macro molecule that does not have any translational
        pseudo symmetry is equal to  9.982e-01
        p_values smaller then 0.05 might indicate
        weak translation pseudo symmetry, or the self vector of
        a large anomalous scatterer such as Hg, whereas values
        smaller then 1e-3 are a very strong indication for
        the presence of translational pseudo symmetry.
    
    Moments of the observed intensities The moment of the observed intensity/amplitude distribution, are reported, as well as their expected values:
    Wilson ratio and moments
    
    Acentric reflections
       <I^2>/<I>^2    :1.955   (untwinned: 2.000; perfect twin 1.500)
       <F>^2/<F^2>    :0.796   (untwinned: 0.785; perfect twin 0.885)
       <|E^2 - 1|>    :0.725   (untwinned: 0.736; perfect twin 0.541)
    
    Centric reflections
       <I^2>/<I>^2    :2.554   (untwinned: 3.000; perfect twin 2.000)
       <F>^2/<F^2>    :0.700   (untwinned: 0.637; perfect twin 0.785)
       <|E^2 - 1|>    :0.896   (untwinned: 0.968; perfect twin 0.736)
    
    Significant departure from the ideal values could indicate the presence of twinning or pseudo translations. For instance, an <I^2>/<I>^2 value significantly lower than 2.0, might point to twinning, whereas a value significantly larger than 2.0, might point towards pseudo translational symmetry. Cumulative intensity distribution The cumulative intensity distribution is reported:
    -----------------------------------------------
    |  Z  | Nac_obs | Nac_theo | Nc_obs | Nc_theo |
    -----------------------------------------------
    | 0.0 |   0.000 |    0.000 |  0.000 |   0.000 |
    | 0.1 |   0.081 |    0.095 |  0.168 |   0.248 |
    | 0.2 |   0.167 |    0.181 |  0.292 |   0.345 |
    | 0.3 |   0.247 |    0.259 |  0.354 |   0.419 |
    | 0.4 |   0.321 |    0.330 |  0.420 |   0.474 |
    | 0.5 |   0.392 |    0.394 |  0.473 |   0.520 |
    | 0.6 |   0.452 |    0.451 |  0.521 |   0.561 |
    | 0.7 |   0.506 |    0.503 |  0.570 |   0.597 |
    | 0.8 |   0.552 |    0.551 |  0.603 |   0.629 |
    | 0.9 |   0.593 |    0.593 |  0.636 |   0.657 |
    | 1.0 |   0.635 |    0.632 |  0.673 |   0.683 |
    -----------------------------------------------
    | Maximum deviation acentric      :  0.015    |
    | Maximum deviation centric       :  0.080    |
    |                                             |
    | <NZ(obs)-NZ(twinned)>_acentric  : -0.004    |
    | <NZ(obs)-NZ(twinned)>_centric   : -0.039    |
    -----------------------------------------------
    
    The N(Z) test is related to the moments based test discussed above. Nac_obs is the observed cumulative distribution of normalized intensities of the acentric data, and uses the full distribution rather then just a moment. The effects of twinning shows itself for Nac_obs having a more sigmoidal character. In the case of pseudo centering, Nac_obs will tend towards Nc_theo. The L test The L-test is an intensity statistic developed by Padilla and Yeates (Acta Cryst. (2003), D59: 1124-1130) and is reasonably robust in the presence of anisotropy and pseudo centering, especially if the miller indices are partitioned properly. Partitioning is carried out on the basis of a Patterson analysis. A significant deviation of both <|L|> and <L^2> from the expected values indicate twinning or other problems:
     L test for acentric data
    
     using difference vectors (dh,dk,dl) of the form:
    (2hp,2kp,2lp)
      where hp, kp, and lp are random signed integers such that
      2 <= |dh| + |dk| + |dl| <= 8
    
      Mean |L|   :0.482  (untwinned: 0.500; perfect twin: 0.375)
      Mean  L^2  :0.314  (untwinned: 0.333; perfect twin: 0.200)
    
      The distribution of |L| values indicates a twin fraction of
      0.00. Note that this estimate is not as reliable as obtained
      via a Britton plot or H-test if twin laws are available.
    
    Whether or not the <|L|> and <L^2> differ significantly from the expected values, is shown in the final summary (see below). Analysis of twin laws Twin law specific tests (Britton, H and RvsR) are performed:
    Results of the H-test on a-centric data:
    
     (Only 50.0% of the strongest twin pairs were used)
    
    mean |H| : 0.183   (0.50: untwinned; 0.0: 50% twinned)
    mean H^2 : 0.055   (0.33: untwinned; 0.0: 50% twinned)
    Estimation of twin fraction via mean |H|: 0.317
    Estimation of twin fraction via cum. dist. of H: 0.308
    
    Britton analysis
    
      Extrapolation performed on  0.34 < alpha < 0.495
      Estimated twin fraction: 0.283
      Correlation: 0.9951
    
    R vs R statistic:
      R_abs_twin = <|I1-I2|>/<|I1+I2|>
      Lebedev, Vagin, Murshudov. Acta Cryst. (2006). D62, 83-95
    
       R_abs_twin observed data   : 0.193
       R_abs_twin calculated data : 0.328
    
      R_sq_twin = <(I1-I2)^2>/<(I1+I2)^2>
       R_sq_twin observed data    : 0.044
       R_sq_twin calculated data  : 0.120
    
    Maximum Likelihood twin fraction determination
        Zwart, Read, Grosse-Kunstleve & Adams, to be published.
    
       The estimated twin fraction is equal to 0.227
    
    These tests allow one to estimate the twin fraction and (if calculated data is provided) determine if rotational pseudo symmetry is present. Another option (albeit more computationally expensive), is to estimate the correlation between error free, untwinned, twin related normalized intensities (use the key perform=True on the command line)
    Estimation of twin fraction, while taking into account the
    effects of possible NCS parallel to the twin axis.
        Zwart, Read, Grosse-Kunstleve & Adams, to be published.
    
      A parameters D_ncs will be estimated as a function of resolution,
      together with a global twin fraction.
      D_ncs is an estimate of the correlation coefficient between
      untwinned, error-free, twin related, normalized intensities.
      Large values (0.95) could indicate an incorrect point group.
      Value of D_ncs larger than say, 0.5, could indicate the presence
      of NCS. The twin fraction should be smaller or similar to other
      estimates given elsewhere.
    
      The refinement can take some time.
      For numerical stability issues, D_ncs is limited between 0 and 0.95.
      The twin fraction is allowed to vary between 0 and 0.45.
      Refinement cycle numbers are printed out to keep you entertained.
    
    . . . .   5  . . . .  10  . . . .  15  . . . .  20  . . . .  25  . . . .  30
    . . . .  35  . . . .  40  . . . .  45  . . . .  50  . . . .  55  . . . .  60
    . . . .  65  . . . .  70  . . . .  75  . . .
    
      Cycle :  78
      -----------
      Log[likelihood]:       22853.700
      twin fraction: 0.201
      D_ncs in resolution ranges:
         9.8232 -- 4.5978 :: 0.830
         4.5978 -- 3.7139 :: 0.775
         3.7139 -- 3.2641 :: 0.745
         3.2641 -- 2.9747 :: 0.746
         2.9747 -- 2.7666 :: 0.705
         2.7666 -- 2.6068 :: 0.754
         2.6068 -- 2.4784 :: 0.735
    
     The correlation of the calculated F^2 should be similar to
     the estimated values.
    
     Observed correlation between twin related, untwinned calculated F^2
     in resolution ranges, as well as estimates D_ncs^2 values:
     Bin    d_max     d_min     CC_obs   D_ncs^2
      1)    9.8232 -- 4.5978 ::  0.661    0.689
      2)    4.5978 -- 3.7139 ::  0.544    0.601
      3)    3.7139 -- 3.2641 ::  0.650    0.556
      4)    3.2641 -- 2.9747 ::  0.466    0.557
      5)    2.9747 -- 2.7666 ::  0.426    0.497
      6)    2.7666 -- 2.6068 ::  0.558    0.569
      7)    2.6068 -- 2.4784 ::  0.531    0.540
    
    The twin fraction obtained via this method is usually lower than what is obtained by refinement. The estimated correlation coefficient (D_ncs^2) between the twin related F^2 values, is however reasonably accurate. Exploring higher metric symmetry The fact that a twin law is present, could indicate that the data was incorrectly processed as well. The example below, shows a P41212 data set processed in P1:
    Exploring higher metric symmetry
    
    Point group of data as dictated by the space group is P 1
      the point group in the Niggli setting is P 1
    The point group of the lattice is P 4 2 2
    A summary of R values for various possible point groups follow.
    
    -----------------------------------------------------------------------------------------------
    | Point group              | mean R_used | max R_used | mean R_unused | min R_unused | choice |
    -----------------------------------------------------------------------------------------------
    | P 1                      | None        | None       | 0.022         | 0.017        |        |
    | P 4 2 2                  | 0.022       | 0.025      | None          | None         | <---   |
    | P 1 2 1                  | 0.017       | 0.017      | 0.026         | 0.024        |        |
    | Hall:  C 2y (x-y,x+y,z)  | 0.025       | 0.025      | 0.022         | 0.017        |        |
    | P 4                      | 0.025       | 0.028      | 0.025         | 0.025        |        |
    | Hall:  C 2 2 (x-y,x+y,z) | 0.024       | 0.025      | 0.017         | 0.017        |        |
    | Hall:  C 2y (x+y,-x+y,z) | 0.024       | 0.024      | 0.023         | 0.017        |        |
    | P 1 1 2                  | 0.028       | 0.028      | 0.021         | 0.017        |        |
    | P 2 1 1                  | 0.027       | 0.027      | 0.022         | 0.017        |        |
    | P 2 2 2                  | 0.023       | 0.028      | 0.025         | 0.025        |        |
    -----------------------------------------------------------------------------------------------
    
    R_used: mean and maximum R value for symmetry operators *used* in this point group
    R_unused: mean and minimum R value for symmetry operators *not used* in this point group
    The likely point group of the data is:  P 4 2 2
    
    As in phenix.explore_metric_symmetry, the possible space groups are listed as well (not shown here). Twin analysis summary The results of the twin analysis are summarized. Typical outputs look as follows for cases of wrong symmetry, twin laws but no suspected twinning and twinned data respectively. Wrong symmetry:
    -------------------------------------------------------------------------------
    Twinning and intensity statistics summary (acentric data):
    
    Statistics independent of twin laws
      - <I^2>/<I>^2 : 2.104
      - <F>^2/<F^2> : 0.770
      - <|E^2-1|>   : 0.757
      - <|L|>, <L^2>: 0.512, 0.349
           Multivariate Z score L-test: 2.777
           The multivariate Z score is a quality measure of the given
           spread in intensities. Good to reasonable data is expected
           to have a Z score lower than 3.5.
           Large values can indicate twinning, but small values do not
           necessarily exclude it.
    
    Statistics depending on twin laws
    ------------------------------------------------------
    | Operator | type | R obs. | Britton alpha | H alpha |
    ------------------------------------------------------
    | k,h,-l   |  PM  | 0.025  | 0.458         | 0.478   |
    | -h,k,-l  |  PM  | 0.017  | 0.459         | 0.487   |
    | -k,h,l   |  PM  | 0.024  | 0.458         | 0.478   |
    | -k,-h,-l |  PM  | 0.024  | 0.458         | 0.478   |
    | -h,-k,l  |  PM  | 0.028  | 0.458         | 0.476   |
    | h,-k,-l  |  PM  | 0.027  | 0.458         | 0.477   |
    | k,-h,l   |  PM  | 0.024  | 0.457         | 0.478   |
    ------------------------------------------------------
    
    Patterson analysis
      - Largest peak height   : 6.089
       (corresponding p value : 6.921e-01)
    
    The largest off-origin peak in the Patterson function is 6.09% of the
    height of the origin peak. No significant pseudo-translation is detected.
    
    The results of the L-test indicate that the intensity statistics
    behave as expected. No twinning is suspected.
    The symmetry of the lattice and intensity however suggests that the
    input space group is too low. See the relevant sections of the log
    file for more details on your choice of space groups.
    As the symmetry is suspected to be incorrect, it is advisable to reconsider
    data processing.
    -------------------------------------------------------------------------------
    
    Twin laws present but no suspected twinning:
    -------------------------------------------------------------------------------
    Twinning and intensity statistics summary (acentric data):
    
    Statistics independent of twin laws
      - <I^2>/<I>^2 : 1.955
      - <F>^2/<F^2> : 0.796
      - <|E^2-1|>   : 0.725
      - <|L|>, <L^2>: 0.482, 0.314
           Multivariate Z score L-test: 1.225
           The multivariate Z score is a quality measure of the given
           spread in intensities. Good to reasonable data is expected
           to have a Z score lower than 3.5.
           Large values can indicate twinning, but small values do not
           necessarily exclude it.
    
    Statistics depending on twin laws
    ------------------------------------------------------
    | Operator | type | R obs. | Britton alpha | H alpha |
    ------------------------------------------------------
    | -h,k,-l  |   M  | 0.455  | 0.016         | 0.035   |
    ------------------------------------------------------
    
    Patterson analysis
      - Largest peak height   : 3.886
       (corresponding p value : 9.982e-01)
    
    The largest off-origin peak in the Patterson function is 3.89% of the
    height of the origin peak. No significant pseudo-translation is detected.
    
    The results of the L-test indicate that the intensity statistics
    behave as expected. No twinning is suspected.
    Even though no twinning is suspected, it might be worthwhile carrying out
    a refinement using a dedicated twin target anyway, as twinned structures with
    low twin fractions are difficult to distinguish from non-twinned structures.
    
    -------------------------------------------------------------------------------
    
    Twinned data:
    -------------------------------------------------------------------------------
    Twinning and intensity statistics summary (acentric data):
    
    Statistics independent of twin laws
      - <I^2>/<I>^2 : 1.587
      - <F>^2/<F^2> : 0.871
      - <|E^2-1|>   : 0.568
      - <|L|>, <L^2>: 0.387, 0.212
           Multivariate Z score L-test: 11.589
           The multivariate Z score is a quality measure of the given
           spread in intensities. Good to reasonable data is expected
           to have a Z score lower than 3.5.
           Large values can indicate twinning, but small values do not
           necessarily exclude it.
    
    Statistics depending on twin laws
    ------------------------------------------------------
    | Operator | type | R obs. | Britton alpha | H alpha |
    ------------------------------------------------------
    | -l,-k,-h |  PM  | 0.170  | 0.330         | 0.325   |
    ------------------------------------------------------
    
    Patterson analysis
      - Largest peak height   : 7.300
       (corresponding p value : 4.454e-01)
    
    The largest off-origin peak in the Patterson function is 7.30% of the
       height of the origin peak. No significant pseudo-translation is detected.
    
    The results of the L-test indicate that the intensity statistics
    are significantly different then is expected from good to reasonable,
    untwinned data.
    As there are twin laws possible given the crystal symmetry, twinning could
    be the reason for the departure of the intensity statistics from normality.
    It might be worthwhile carrying refinement with a twin specific target
    function.
    -------------------------------------------------------------------------------
    
    In the summary, the significance of the departure of the values of the L-test from normality are stated. The multivariate Z-score (also known as the Mahalanobis distance) is used for this purpose.

    Examples

    Standard run of xtriage

    Running xtriage is easy. From the command-line you can type:

    phenix.xtriage data.sca
    
    When an MTZ or CNS file is used, labels have to be specified:
    phenix.xtriage file=my_brilliant_data.mtz obs_labels='F(+),SIGF(+),F(-),SIGF(-)'
    
    In order to perform a Matthews analysis, it might be useful to specify the number of residues/nucleotides in the crystallized macro molecule:
    phenix.xtriage data.sca n_residues=230 n_bases=25
    
    By default, the screen output plus additional ccp4 style graphs (viewable with the ccp4 programs loggraph) are echoed to a file named logfile.log. The command line arguments and all other defaults settings are summarized in a PHIL parameter data block given at the beginning of the logfile / screen output:
    
    scaling.input {
      parameters {
        asu_contents {
          n_residues = None
          n_bases = None
          n_copies_per_asu = None
        }
        misc_twin_parameters {
          missing_symmetry {
            tanh_location = 0.08
            tanh_slope = 50
          }
          twinning_with_ncs {
            perform_analysis = False
            n_bins = 7
          }
          twin_test_cuts {
            low_resolution = 10
            high_resolution = None
            isigi_cut = 3
            completeness_cut = 0.85
          }
        }
        reporting {
          verbose = 1
          log = "logfile.log"
          ccp4_style_graphs = True
        }
      }
      xray_data {
        file_name = "some_data.sca"
        obs_labels = None
        calc_labels = None
        unit_cell = 64.5 69.5 45.5 90 104.3 90
        space_group = "P 1 21 1"
        high_resolution = None
        low_resolution = None
      }
    }
    
    The defaults are good for most applications.

    Possible Problems

    Specific limitations and problems

  • Xtriage doesn't deal with data in centric space groups
  • Literature

    Additional information

    List of all xtriage keywords

    ------------------------------------------------------------------------------- 
    Legend: black bold - scope names
            black - parameter names
            red - parameter values
            blue - parameter help
            blue bold - scope help
            Parameter values:
              * means selected parameter (where multiple choices are available)
              False is No
              True is Yes
              None means not provided, not predefined, or left up to the program
              "%3d" is a Python style formatting descriptor
    ------------------------------------------------------------------------------- 
    scaling
       input
          expert_level= 1 Expert level
          asu_contents Defines the ASU contents
             sequence_file= None File containing protein or nucleic acid
                            sequences. Values for n_residues and n_bases will be
                            extracted automatically if this is provided.
             n_residues= None Number of residues in structural unit
             n_bases= None Number of nucleotides in structural unit
             n_copies_per_asu= None Number of copies per ASU. If not specified,
                               Matthews analyses is performed
          xray_data Defines xray data
             file_name= None File name with data
             obs_labels= None Labels for observed data
             calc_labels= None Lables for calculated data
             unit_cell= None Unit cell parameters
             space_group= None space group
             high_resolution= None High resolution limit
             low_resolution= None Low resolution limit
             reference A reference data set. For the investigation of possible
                       reindexing options
                data Defines an x-ray dataset
                   file_name= None File name
                   labels= None Labels
                   unit_cell= None Unit cell parameters"
                   space_group= None Space group
                structure
                   file_name= None Filename of reference PDB file
          parameters Basic settings
             reporting Some output issues
                verbose= 1 Verbosity
                log= logfile.log Logfile
                ccp4_style_graphs= True SHall we include ccp4 style graphs?
             misc_twin_parameters Various settings for twinning or symmetry tests
                apply_basic_filters_prior_to_twin_analysis= True Keep data cutoffs
                                                            from the
                                                            basic_analyses module
                                                            (I/sigma,Wilson
                                                            scaling,Anisotropy)
                                                            when twin stats are
                                                            computed.
                missing_symmetry Settings for missing symmetry tests
                   sigma_inflation= 1.25 Standard deviations of intensities can be
                                    increased to make point group determination
                                    more reliable.
                twinning_with_ncs Analysing the possibility of an NCS operator
                                  parallel to a twin law.
                   perform_analyses= False Determines whether or not this analyses
                                     is carried out.
                   n_bins= 7 Number of bins used in NCS analyses.
                twin_test_cuts Various cuts used in determining resolution limit
                               for data used in intensity statistics
                   low_resolution= 10.0 Low resolution
                   high_resolution= None High resolution
                   isigi_cut= 3.0 I/sigI ratio used in completeness cut
                   completeness_cut= 0.85 Data is cut at resolution where
                                     intensities with I/sigI greater than
                                     isigi_cut are more than completeness_cut
                                     complete
          optional Optional data massage possibilities
             hklout= None HKL out
             hklout_type= mtz sca *mtz_or_sca Output format
             label_extension= "massaged" Label extension
             aniso Parameters dealing with anisotropy correction
                action= *remove_aniso None Remove anisotropy?
                final_b= *eigen_min eigen_mean user_b_iso Final b value
                b_iso= None User specified B value
             outlier Outlier analyses
                action= *extreme basic beamstop None Outlier protocol
                parameters Parameters for outlier detection
                   basic_wilson
                      level= 1E-6
                   extreme_wilson
                      level= 0.01
                   beamstop
                      level= 0.001
                      d_min= 10.0
             symmetry
                action= detwin twin *None
                twinning_parameters
                   twin_law= None
                   fraction= None
       gui GUI-specific parameters, not applicable to command-line version.
          result_file= None Pickled result file for Phenix GUI
          job_title= None Job title in PHENIX GUI, not used on command line