Contents

- xtriage: Peter Zwart
- Phil command interpreter: Ralf W. Grosse-Kunstleve

The xtriage method is a tool for analyzing structure factor data to identify outliers, presence of twinning and other conditions that the user should be aware of.

Basic sanity checks performed by xtriage are

- Wilson plot sanity
- Probabilistic Matthews analysis
- Data strength analysis
- Ice ring analysis
- Twinning analysis
- Reference analysis (determines possible re-indexing. optional)
- Detwinning and data massaging (optional)

See also:
``phenix.reflection_statistics` <reflection_statistics.html>`__
(comparison of multiple data sets)

- (1) A log file that contains all the screen output plus some ccp4 style graphs
- optional: an mtz file with massaged data

**Scope**: *parameters.asu_contents*

keys: * n_residues :: Number of residues per monomer/unit * n_bases :: Number of nucleotides per monomer/unit * n_copies_per_asu :: Number of copies in the ASU.

These keywords control the determination of the absolute scale. If the number of residues/bases is not specified, a solvent content of 50% is assumed.

**Scope**: *parameters.misc_twin_parameters.missing_symmetry*

keys: * tanh_location :: tanh decision rule parameter * tanh_slope :: tanh decision rule parameter

The `tanh_location` and `tanh_slope` parameter control what R-value
is considered to be low enough to be considered a 'proper' symmetry
operator. the `tanh_location parameter` corresponds to the inflection
point of the approximate step function. Increasing `tanh_location`
will result in large R-value thresholds. `tanh_slope` is set to 50 and
should be okai.

**Scope**: *parameters.misc_twin_parameters.twinning_with_ncs*

keys: * perform_test :: can be set to True or False * n_bins :: Number of bins in determination of D_ncs

The perform_test is by default set to False. Setting it to True triggers the determination of the twin fraction while taking into account NCS parallel to the twin axis.

**Scope**: *parameters.misc_twin_parameters.twin_test_cuts*

keys: * high_resolution : high resolution for twin tests * low_resolution: low resolution for twin tests * isigi_cut: I/sig(I) threshold in automatic determination of high resolution limit * completeness_cut: completeness threshold in automatic determination of high resolution limit

The automatic determination of the resolution limit for the twinning test is determined on the basis of the completeness after removing intensities for which I/sigI < isigi_cut. The lowest limit obtain in this way is 3.5A. The value determined by the automatic procedure can be overruled by specification of the high_resolution keyword. The low resolution is set to 10A by default.

**Scope**: *parameters.reporting*

keys: * verbose :: verbosity level. * log :: log file name * ccp4_style_graphs :: Either True or False. Determines whether or not ccp4 style logfgra plots are written to the log file

**Scope**: *xray_data*

keys: * file_name :: file name with xray data. * obs_labels :: labels for observed data is format is mtz or XPLOR/CNS * calc_labels :: optional; labels for calculated data * unit_cell :: overrides unit cell in reflection file (if present) * space_group :: overrides space group in reflection file (if present) * high_resolution :: High resolution limit of the data * low_resolution :: Low resolution limit of the data

Note that the matching of specified and present labels involves a sub-string matching algorithm.

**Scope**: *optional*

keys: * hklout :: output mtz file * twinning.action :: Whether to detwin the data * twinning.twin_law :: using this twin law (h,k,l or x,y,z notation) * twinning.fraction :: The detwinning fraction. * b_value :: the resulting Wilson B value

The output mtz file contains an anisotropy corrected mtz file, with suspected outliers removed. The data is put scaled and has the specified Wilson B value. These options have an associated expert level of 10, and are not shown by default. Specification of the expert level on the command line as 'level=100' will show all available options.

Typing:

%phenix.xtriage some_data.sca residues=290 log=some_data.log

results in the following output (parts omitted).

*Matthews analysis*

First, a cell contents analysis is performed. Matthews coefficients, solvent content and solvent content probabilities are listed, and the most likely composition is guessed

Matthews coefficient and Solvent content statistics ---------------------------------------------------------------- | Copies | Solvent content | Matthews Coed. | P(solvent cont.) | |--------|-----------------|----------------|------------------| | 1 | 0.705 | 4.171 | 0.241 | | 2 | 0.411 | 2.085 | 0.750 | | 3 | 0.116 | 1.390 | 0.009 | ---------------------------------------------------------------- | Best guess : 2 copies in the asu | ----------------------------------------------------------------

*Data strength*

The next step, the strength of the data is gauged by determining the completeness of the in resolution bins after application of several I/sigI cut off values

Completeness and data strength analysis The following table lists the completeness in various resolution ranges, after applying a I/sigI cut. Miller indices for which individual I/sigI values are larger than the value specified in the top row of the table, are retained, while other intensities are discarded. The resulting completeness profiles are an indication of the strength of the data. ---------------------------------------------------------------------------------------- | Res. Range | I/sigI>1 | I/sigI>2 | I/sigI>3 | I/sigI>5 | I/sigI>10 | I/sigI>15 | ---------------------------------------------------------------------------------------- | 19.87 - 7.98 | 96.4% | 95.3% | 94.5% | 93.6% | 91.7% | 89.3% | | 7.98 - 6.40 | 99.2% | 98.2% | 97.1% | 95.5% | 90.9% | 84.7% | | 6.40 - 5.61 | 97.8% | 95.4% | 93.3% | 87.1% | 76.6% | 66.8% | | 5.61 - 5.11 | 98.2% | 95.9% | 94.0% | 87.9% | 74.1% | 58.0% | | 5.11 - 4.75 | 97.9% | 96.2% | 94.5% | 91.1% | 79.2% | 62.5% | | 4.75 - 4.47 | 97.4% | 95.4% | 93.1% | 88.9% | 76.6% | 56.9% | | 4.47 - 4.25 | 96.5% | 94.5% | 92.1% | 88.0% | 75.3% | 56.5% | | 4.25 - 4.07 | 96.6% | 94.0% | 91.2% | 85.4% | 69.3% | 44.9% | | 4.07 - 3.91 | 95.6% | 92.1% | 87.8% | 80.1% | 61.9% | 34.8% | | 3.91 - 3.78 | 94.3% | 89.6% | 83.7% | 71.1% | 48.7% | 20.5% | | 3.78 - 3.66 | 95.7% | 90.9% | 85.6% | 71.5% | 42.4% | 14.8% | | 3.66 - 3.56 | 91.6% | 85.0% | 78.0% | 63.3% | 34.1% | 9.5% | | 3.56 - 3.46 | 89.8% | 80.4% | 70.2% | 52.8% | 22.2% | 3.8% | | 3.46 - 3.38 | 87.4% | 76.3% | 64.6% | 46.7% | 15.5% | 1.7% | ----------------------------------------------------------------------------------------

This analysis is also used in the automatic determination of the high resolution limit used in the intensity statistics and twin analyses.

*Absolute, likelihood based Wilson scaling*

The (anisotropic) B value of the data is determined using a likelihood based approach. The resulting B value/tensor is reported:

Maximum likelihood isotropic Wilson scaling ML estimate of overall B value of sec17.sca:i_obs,sigma: 75.85 A**(-2) Estimated -log of scale factor of sec17.sca:i_obs,sigma: -2.50 Maximum likelihood anisotropic Wilson scaling ML estimate of overall B_cart value of sec17.sca:i_obs,sigma: 68.92, 0.00, 0.00 68.92, 0.00 91.87 Equivalent representation as U_cif: 0.87, -0.00, -0.00 0.87, 0.00 1.16 ML estimate of -log of scale factor of sec17.sca:i_obs,sigma: -2.50 Correcting for anisotropy in the data

A large spread in (especially the diagonal) values indicates anisotropy. The anisotropy is corrected for. This clears up intensity statistics.

*Low resolution completeness analysis*

Mostly data processing software do not provide a clear picture of the
completeness of the data at low resolution. For this reason, *xtriage*
lists the completeness of the data up to 5 Angstrom:

Low resolution completeness analysis The following table shows the completeness of the data to 5 Angstrom. unused: - 19.8702 [ 0/68 ] 0.000 bin 1: 19.8702 - 10.3027 [425/455] 0.934 bin 2: 10.3027 - 8.3766 [443/446] 0.993 bin 3: 8.3766 - 7.3796 [446/447] 0.998 bin 4: 7.3796 - 6.7336 [447/449] 0.996 bin 5: 6.7336 - 6.2673 [450/454] 0.991 bin 6: 6.2673 - 5.9080 [428/429] 0.998 bin 7: 5.9080 - 5.6192 [459/466] 0.985 bin 8: 5.6192 - 5.3796 [446/450] 0.991 bin 9: 5.3796 - 5.1763 [437/440] 0.993 bin 10: 5.1763 - 5.0006 [460/462] 0.996 unused: 5.0006 - [ 0/0 ]

This analysis allows one to quickly see if there is any unusually low completeness at low resolution, for instance due to missing overloads.

*Wilson plot analysis*

A Wilson plot analysis *a la* ARP/wARP is carried out, albeit with a
slightly different standard curve:

Mean intensity analysis Analysis of the mean intensity. Inspired by: Morris et al. (2004). J. Synch. Rad.11, 56-59. The following resolution shells are worrisome: ------------------------------------------------ | d_spacing | z_score | compl. | <Iobs>/<Iexp> | ------------------------------------------------ | 5.773 | 7.95 | 0.99 | 0.658 | | 5.423 | 8.62 | 0.99 | 0.654 | | 5.130 | 6.31 | 0.99 | 0.744 | | 4.879 | 5.36 | 0.99 | 0.775 | | 4.662 | 4.52 | 0.99 | 0.803 | | 3.676 | 5.45 | 0.99 | 1.248 | ------------------------------------------------ Possible reasons for the presence of the reported unexpected low or elevated mean intensity in a given resolution bin are : - missing overloaded or weak reflections - suboptimal data processing - satellite (ice) crystals - NCS - translational pseudo symmetry (detected elsewhere) - outliers (detected elsewhere) - ice rings (detected elsewhere) - other problems Note that the presence of abnormalities in a certain region of reciprocal space might confuse the data validation algorithm throughout a large region of reciprocal space, even though the data is acceptable in those areas.

A very long list of warnings could indicate a serious problem with your
data. Decisions on whether or not the data is useful, should be cut or
should thrown away altogether, is not straightforward and falls beyond
the scope of *xtriage*.

*Outlier detection and rejection*

Possible outliers are detected on the basis Wilson statistics:

Possible outliers Inspired by: Read, Acta Cryst. (1999). D55, 1759-1764 Acentric reflections: ----------------------------------------------------------------- | d_space | H K L | |E| | p(wilson) | p(extreme) | ----------------------------------------------------------------- | 3.716 | 8, 6, 31 | 3.52 | 4.06e-06 | 5.87e-02 | ----------------------------------------------------------------- p(wilson) : 1-(1-exp[-|E|^2]) p(extreme) : 1-(1-exp[-|E|^2])^(n_acentrics) p(wilson) is the probability that an E-value of the specified value would be observed when it would selected at random from the given data set. p(extreme) is the probability that the largest |E| value is larger or equal than the observed largest |E| value. Both measures can be used for outlier detection. p(extreme) takes into account the size of the data set.

Outliers are removed from the data set in the further analysis. Note that if pseudo translational symmetry is present, a large number of 'outliers' will be present.

*Ice ring detection*

Ice rings in the data are detected by analyzing the completeness and the mean intensity:

Ice ring related problems The following statistics were obtained from ice-ring insensitive resolution ranges mean bin z_score : 3.47 ( rms deviation : 2.83 ) mean bin completeness : 0.99 ( rms deviation : 0.00 ) The following table shows the z-scores and completeness in ice-ring sensitive areas. Large z-scores and high completeness in these resolution ranges might be a reason to re-assess your data processing if ice rings were present. ------------------------------------------------ | d_spacing | z_score | compl. | Rel. Ice int. | ------------------------------------------------ | 3.897 | 0.12 | 0.97 | 1.000 | | 3.669 | 0.96 | 0.95 | 0.750 | | 3.441 | 2.14 | 0.94 | 0.530 | ------------------------------------------------ Abnormalities in mean intensity or completeness at resolution ranges with a relative ice ring intensity lower then 0.10 will be ignored. At 3.67 A there is an lower occupancy then expected from the rest of the data set. Even though the completeness is lower as expected, the mean intensity is still reasonable at this resolution At 3.44 A there is an lower occupancy then expected from the rest of the data set. Even though the completeness is lower as expected, the mean intensity is still reasonable at this resolution There were 2 ice ring related warnings This could indicate the presence of ice rings.

*Anomalous signal*

If the input reflection file contains separate intensities for each Friedel mate, a quality measure of the anomalous signal is reported:

Analysis of anomalous differences Table of measurability as a function of resolution The measurability is defined as the fraction of Bijvoet related intensity differences for which |delta_I|/sigma_delta_I > 3.0 min[I(+)/sigma_I(+), I(-)/sigma_I(-)] > 3.0 holds. The measurability provides an intuitive feeling of the quality of the data, as it is related to the number of reliable Bijvoet differences. When the data is processed properly and the standard deviations have been estimated accurately, values larger than 0.05 are encouraging. unused: - 19.8704 [ 0/68 ] bin 1: 19.8704 - 7.0211 [1551/1585] 0.1924 bin 2: 7.0211 - 5.6142 [1560/1575] 0.0814 bin 3: 5.6142 - 4.9168 [1546/1555] 0.0261 bin 4: 4.9168 - 4.4729 [1563/1582] 0.0081 bin 5: 4.4729 - 4.1554 [1557/1577] 0.0095 bin 6: 4.1554 - 3.9124 [1531/1570] 0.0083 bin 7: 3.9124 - 3.7178 [1541/1585] 0.0069 bin 8: 3.7178 - 3.5569 [1509/1552] 0.0028 bin 9: 3.5569 - 3.4207 [1522/1606] 0.0085 bin 10: 3.4207 - 3.3032 [1492/1574] 0.0044 unused: 3.3032 - [ 0/0 ] The anomalous signal seems to extend to about 5.9 A (or to 5.2 A, from a more optimistic point of view) The quoted resolution limits can be used as a guideline to decide where to cut the resolution for phenix.hyss As the anomalous signal is not very strong in this data set substructure solution via SAD might prove to be a challenge. Especially if only low resolution reflections are used, the resulting substructures could contain a significant amount of of false positives.

*Determination of twin laws*

Twin laws are found using a modified le-Page algorithm and classified as merohedral and pseudo merohedral:

Determining possible twin laws. The following twin laws have been found: ------------------------------------------------------------------------------- | Type | Axis | R metric (%) | delta (le Page) | delta (Lebedev) | Twin law | ------------------------------------------------------------------------------- | M | 2-fold | 0.000 | 0.000 | 0.000 | -h,k,-l | ------------------------------------------------------------------------------- M: Merohedral twin law PM: Pseudomerohedral twin law 1 merohedral twin operators found 0 pseudo-merohedral twin operators found In total, 1 twin operator were found

Non-merohedral (reticular) twinning is not considered. The `R-metric`
is equal to :

Sum (M_i-N_i)^2 / Sum M_i^2

M_i are elements of the original metric tensor and N_i are elements of the metric tensor after 'idealizing' the unit cell, in compliance with the restrictions the twin law poses on the lattice if it would be a 'true' symmetry operator.

The `delta le-Page` is the familiar obliquity. The `delta Lebedev`
is a twin law quality measure developed by A. Lebedev (Lebedev, Vagin &
Murshudov; Acta Cryst. (2006). D62, 83-95.).

Note that for merohedral twin laws, all quality indicators are 0. For
non-merohedral twin laws, this value is larger or equal to zero. If a
twin law is classified as non-merohedral, but has a `delta le-page`
equal to zero, the twin law is sometimes referred to as a *metric
merohedral twin law*.

*Locating translational pseudo symmetry (TPS)*

TPS is located by inspecting a low resolution Patterson function. Peaks and their significance levels are reported:

Largest Patterson peak with length larger then 15 Angstrom Frac. coord. : 0.027 0.057 0.345 Distance to origin : 17.444 Height (origin=100) : 3.886 p_value(height) : 9.982e-01 The reported p_value has the following meaning: The probability that a peak of the specified height or larger is found in a Patterson function of a macro molecule that does not have any translational pseudo symmetry is equal to 9.982e-01 p_values smaller then 0.05 might indicate weak translation pseudo symmetry, or the self vector of a large anomalous scatterer such as Hg, whereas values smaller then 1e-3 are a very strong indication for the presence of translational pseudo symmetry.

*Moments of the observed intensities*

The moment of the observed intensity/amplitude distribution, are reported, as well as their expected values:

Wilson ratio and moments Acentric reflections <I^2>/<I>^2 :1.955 (untwinned: 2.000; perfect twin 1.500) <F>^2/<F^2> :0.796 (untwinned: 0.785; perfect twin 0.885) <|E^2 - 1|> :0.725 (untwinned: 0.736; perfect twin 0.541) Centric reflections <I^2>/<I>^2 :2.554 (untwinned: 3.000; perfect twin 2.000) <F>^2/<F^2> :0.700 (untwinned: 0.637; perfect twin 0.785) <|E^2 - 1|> :0.896 (untwinned: 0.968; perfect twin 0.736)

Significant departure from the ideal values could indicate the presence of twinning or pseudo translations. For instance, an <I^2>/<I>^2 value significantly lower than 2.0, might point to twinning, whereas a value significantly larger than 2.0, might point towards pseudo translational symmetry.

*Cumulative intensity distribution*

The cumulative intensity distribution is reported:

----------------------------------------------- | Z | Nac_obs | Nac_theo | Nc_obs | Nc_theo | ----------------------------------------------- | 0.0 | 0.000 | 0.000 | 0.000 | 0.000 | | 0.1 | 0.081 | 0.095 | 0.168 | 0.248 | | 0.2 | 0.167 | 0.181 | 0.292 | 0.345 | | 0.3 | 0.247 | 0.259 | 0.354 | 0.419 | | 0.4 | 0.321 | 0.330 | 0.420 | 0.474 | | 0.5 | 0.392 | 0.394 | 0.473 | 0.520 | | 0.6 | 0.452 | 0.451 | 0.521 | 0.561 | | 0.7 | 0.506 | 0.503 | 0.570 | 0.597 | | 0.8 | 0.552 | 0.551 | 0.603 | 0.629 | | 0.9 | 0.593 | 0.593 | 0.636 | 0.657 | | 1.0 | 0.635 | 0.632 | 0.673 | 0.683 | ----------------------------------------------- | Maximum deviation acentric : 0.015 | | Maximum deviation centric : 0.080 | | | | <NZ(obs)-NZ(twinned)>_acentric : -0.004 | | <NZ(obs)-NZ(twinned)>_centric : -0.039 | -----------------------------------------------

The N(Z) test is related to the moments based test discussed above. Nac_obs is the observed cumulative distribution of normalized intensities of the acentric data, and uses the full distribution rather then just a moment.

The effects of twinning shows itself for Nac_obs having a more sigmoidal character. In the case of pseudo centering, Nac_obs will tend towards Nc_theo.

*The L test*

The `L-test` is an intensity statistic developed by Padilla and Yeates
(Acta Cryst. (2003), D59: 1124-1130) and is reasonably robust in the
presence of anisotropy and pseudo centering, especially if the miller
indices are partitioned properly. Partitioning is carried out on the
basis of a Patterson analysis. A significant deviation of both <|L|>
and <L^2> from the expected values indicate twinning or other problems:

L test for acentric data using difference vectors (dh,dk,dl) of the form: (2hp,2kp,2lp) where hp, kp, and lp are random signed integers such that 2 <= |dh| + |dk| + |dl| <= 8 Mean |L| :0.482 (untwinned: 0.500; perfect twin: 0.375) Mean L^2 :0.314 (untwinned: 0.333; perfect twin: 0.200) The distribution of |L| values indicates a twin fraction of 0.00. Note that this estimate is not as reliable as obtained via a Britton plot or H-test if twin laws are available.

Whether or not the <|L|> and <L^2> differ significantly from the expected values, is shown in the final summary (see below).

*Analysis of twin laws*

Twin law specific tests (Britton, H and RvsR) are performed:

Results of the H-test on a-centric data: (Only 50.0% of the strongest twin pairs were used) mean |H| : 0.183 (0.50: untwinned; 0.0: 50% twinned) mean H^2 : 0.055 (0.33: untwinned; 0.0: 50% twinned) Estimation of twin fraction via mean |H|: 0.317 Estimation of twin fraction via cum. dist. of H: 0.308 Britton analysis Extrapolation performed on 0.34 < alpha < 0.495 Estimated twin fraction: 0.283 Correlation: 0.9951 R vs R statistic: R_abs_twin = <|I1-I2|>/<|I1+I2|> Lebedev, Vagin, Murshudov. Acta Cryst. (2006). D62, 83-95 R_abs_twin observed data : 0.193 R_abs_twin calculated data : 0.328 R_sq_twin = <(I1-I2)^2>/<(I1+I2)^2> R_sq_twin observed data : 0.044 R_sq_twin calculated data : 0.120 Maximum Likelihood twin fraction determination Zwart, Read, Grosse-Kunstleve & Adams, to be published. The estimated twin fraction is equal to 0.227

These tests allow one to estimate the twin fraction and (if calculated
data is provided) determine if rotational pseudo symmetry is present.
Another option (albeit more computationally expensive), is to estimate
the correlation between error free, untwinned, twin related normalized
intensities (use the key `perform=True` on the command line)

Estimation of twin fraction, while taking into account the effects of possible NCS parallel to the twin axis. Zwart, Read, Grosse-Kunstleve & Adams, to be published. A parameters D_ncs will be estimated as a function of resolution, together with a global twin fraction. D_ncs is an estimate of the correlation coefficient between untwinned, error-free, twin related, normalized intensities. Large values (0.95) could indicate an incorrect point group. Value of D_ncs larger than say, 0.5, could indicate the presence of NCS. The twin fraction should be smaller or similar to other estimates given elsewhere. The refinement can take some time. For numerical stability issues, D_ncs is limited between 0 and 0.95. The twin fraction is allowed to vary between 0 and 0.45. Refinement cycle numbers are printed out to keep you entertained. . . . . 5 . . . . 10 . . . . 15 . . . . 20 . . . . 25 . . . . 30 . . . . 35 . . . . 40 . . . . 45 . . . . 50 . . . . 55 . . . . 60 . . . . 65 . . . . 70 . . . . 75 . . . Cycle : 78 ----------- Log[likelihood]: 22853.700 twin fraction: 0.201 D_ncs in resolution ranges: 9.8232 -- 4.5978 :: 0.830 4.5978 -- 3.7139 :: 0.775 3.7139 -- 3.2641 :: 0.745 3.2641 -- 2.9747 :: 0.746 2.9747 -- 2.7666 :: 0.705 2.7666 -- 2.6068 :: 0.754 2.6068 -- 2.4784 :: 0.735 The correlation of the calculated F^2 should be similar to the estimated values. Observed correlation between twin related, untwinned calculated F^2 in resolution ranges, as well as estimates D_ncs^2 values: Bin d_max d_min CC_obs D_ncs^2 1) 9.8232 -- 4.5978 :: 0.661 0.689 2) 4.5978 -- 3.7139 :: 0.544 0.601 3) 3.7139 -- 3.2641 :: 0.650 0.556 4) 3.2641 -- 2.9747 :: 0.466 0.557 5) 2.9747 -- 2.7666 :: 0.426 0.497 6) 2.7666 -- 2.6068 :: 0.558 0.569 7) 2.6068 -- 2.4784 :: 0.531 0.540

The twin fraction obtained via this method is usually lower than what is obtained by refinement. The estimated correlation coefficient (D_ncs^2) between the twin related F^2 values, is however reasonably accurate.

*Exploring higher metric symmetry*

The fact that a twin law is present, could indicate that the data was incorrectly processed as well. The example below, shows a P41212 data set processed in P1:

Exploring higher metric symmetry Point group of data as dictated by the space group is P 1 the point group in the Niggli setting is P 1 The point group of the lattice is P 4 2 2 A summary of R values for various possible point groups follow. ----------------------------------------------------------------------------------------------- | Point group | mean R_used | max R_used | mean R_unused | min R_unused | choice | ----------------------------------------------------------------------------------------------- | P 1 | None | None | 0.022 | 0.017 | | | P 4 2 2 | 0.022 | 0.025 | None | None | <--- | | P 1 2 1 | 0.017 | 0.017 | 0.026 | 0.024 | | | Hall: C 2y (x-y,x+y,z) | 0.025 | 0.025 | 0.022 | 0.017 | | | P 4 | 0.025 | 0.028 | 0.025 | 0.025 | | | Hall: C 2 2 (x-y,x+y,z) | 0.024 | 0.025 | 0.017 | 0.017 | | | Hall: C 2y (x+y,-x+y,z) | 0.024 | 0.024 | 0.023 | 0.017 | | | P 1 1 2 | 0.028 | 0.028 | 0.021 | 0.017 | | | P 2 1 1 | 0.027 | 0.027 | 0.022 | 0.017 | | | P 2 2 2 | 0.023 | 0.028 | 0.025 | 0.025 | | ----------------------------------------------------------------------------------------------- R_used: mean and maximum R value for symmetry operators *used* in this point group R_unused: mean and minimum R value for symmetry operators *not used* in this point group The likely point group of the data is: P 4 2 2

As in phenix.explore_metric_symmetry, the possible space groups are listed as well (not shown here).

*Twin analysis summary*

The results of the twin analysis are summarized. Typical outputs look as follows for cases of wrong symmetry, twin laws but no suspected twinning and twinned data respectively.

Wrong symmetry:

------------------------------------------------------------------------------- Twinning and intensity statistics summary (acentric data): Statistics independent of twin laws - <I^2>/<I>^2 : 2.104 - <F>^2/<F^2> : 0.770 - <|E^2-1|> : 0.757 - <|L|>, <L^2>: 0.512, 0.349 Multivariate Z score L-test: 2.777 The multivariate Z score is a quality measure of the given spread in intensities. Good to reasonable data is expected to have a Z score lower than 3.5. Large values can indicate twinning, but small values do not necessarily exclude it. Statistics depending on twin laws ------------------------------------------------------ | Operator | type | R obs. | Britton alpha | H alpha | ------------------------------------------------------ | k,h,-l | PM | 0.025 | 0.458 | 0.478 | | -h,k,-l | PM | 0.017 | 0.459 | 0.487 | | -k,h,l | PM | 0.024 | 0.458 | 0.478 | | -k,-h,-l | PM | 0.024 | 0.458 | 0.478 | | -h,-k,l | PM | 0.028 | 0.458 | 0.476 | | h,-k,-l | PM | 0.027 | 0.458 | 0.477 | | k,-h,l | PM | 0.024 | 0.457 | 0.478 | ------------------------------------------------------ Patterson analysis - Largest peak height : 6.089 (corresponding p value : 6.921e-01) The largest off-origin peak in the Patterson function is 6.09% of the height of the origin peak. No significant pseudo-translation is detected. The results of the L-test indicate that the intensity statistics behave as expected. No twinning is suspected. The symmetry of the lattice and intensity however suggests that the input space group is too low. See the relevant sections of the log file for more details on your choice of space groups. As the symmetry is suspected to be incorrect, it is advisable to reconsider data processing. -------------------------------------------------------------------------------

Twin laws present but no suspected twinning:

------------------------------------------------------------------------------- Twinning and intensity statistics summary (acentric data): Statistics independent of twin laws - <I^2>/<I>^2 : 1.955 - <F>^2/<F^2> : 0.796 - <|E^2-1|> : 0.725 - <|L|>, <L^2>: 0.482, 0.314 Multivariate Z score L-test: 1.225 The multivariate Z score is a quality measure of the given spread in intensities. Good to reasonable data is expected to have a Z score lower than 3.5. Large values can indicate twinning, but small values do not necessarily exclude it. Statistics depending on twin laws ------------------------------------------------------ | Operator | type | R obs. | Britton alpha | H alpha | ------------------------------------------------------ | -h,k,-l | M | 0.455 | 0.016 | 0.035 | ------------------------------------------------------ Patterson analysis - Largest peak height : 3.886 (corresponding p value : 9.982e-01) The largest off-origin peak in the Patterson function is 3.89% of the height of the origin peak. No significant pseudo-translation is detected. The results of the L-test indicate that the intensity statistics behave as expected. No twinning is suspected. Even though no twinning is suspected, it might be worthwhile carrying out a refinement using a dedicated twin target anyway, as twinned structures with low twin fractions are difficult to distinguish from non-twinned structures. -------------------------------------------------------------------------------

Twinned data:

------------------------------------------------------------------------------- Twinning and intensity statistics summary (acentric data): Statistics independent of twin laws - <I^2>/<I>^2 : 1.587 - <F>^2/<F^2> : 0.871 - <|E^2-1|> : 0.568 - <|L|>, <L^2>: 0.387, 0.212 Multivariate Z score L-test: 11.589 The multivariate Z score is a quality measure of the given spread in intensities. Good to reasonable data is expected to have a Z score lower than 3.5. Large values can indicate twinning, but small values do not necessarily exclude it. Statistics depending on twin laws ------------------------------------------------------ | Operator | type | R obs. | Britton alpha | H alpha | ------------------------------------------------------ | -l,-k,-h | PM | 0.170 | 0.330 | 0.325 | ------------------------------------------------------ Patterson analysis - Largest peak height : 7.300 (corresponding p value : 4.454e-01) The largest off-origin peak in the Patterson function is 7.30% of the height of the origin peak. No significant pseudo-translation is detected. The results of the L-test indicate that the intensity statistics are significantly different then is expected from good to reasonable, untwinned data. As there are twin laws possible given the crystal symmetry, twinning could be the reason for the departure of the intensity statistics from normality. It might be worthwhile carrying refinement with a twin specific target function. -------------------------------------------------------------------------------

In the summary, the significance of the departure of the values of the L-test from normality are stated. The multivariate Z-score (also known as the Mahalanobis distance) is used for this purpose.

Running xtriage is easy. From the command-line you can type:

phenix.xtriage data.sca

When an MTZ or CNS file is used, labels have to be specified:

phenix.xtriage file=my_brilliant_data.mtz obs_labels='F(+),SIGF(+),F(-),SIGF(-)'

In order to perform a Matthews analysis, it might be useful to specify the number of residues/nucleotides in the crystallized macro molecule:

phenix.xtriage data.sca n_residues=230 n_bases=25

By default, the screen output plus additional ccp4 style graphs (viewable with the ccp4 programs loggraph) are echoed to a file named logfile.log. The command line arguments and all other defaults settings are summarized in a PHIL parameter data block given at the beginning of the logfile / screen output:

scaling.input { parameters { asu_contents { n_residues = None n_bases = None n_copies_per_asu = None } misc_twin_parameters { missing_symmetry { tanh_location = 0.08 tanh_slope = 50 } twinning_with_ncs { perform_analysis = False n_bins = 7 } twin_test_cuts { low_resolution = 10 high_resolution = None isigi_cut = 3 completeness_cut = 0.85 } } reporting { verbose = 1 log = "logfile.log" ccp4_style_graphs = True } } xray_data { file_name = "some_data.sca" obs_labels = None calc_labels = None unit_cell = 64.5 69.5 45.5 90 104.3 90 space_group = "P 1 21 1" high_resolution = None low_resolution = None } }

The defaults are good for most applications.

- Xtriage doesn't deal with data in centric space groups

- scaling
- input
- expert_level = 1 Expert level
- asu_contentsDefines the ASU contents
- sequence_file = None File containing protein or nucleic acid sequences. Values for n_residues and n_bases will be extracted automatically if this is provided.
- n_residues = None Number of residues in structural unit
- n_bases = None Number of nucleotides in structural unit
- n_sites = 5 Number of atoms in anomalous substructure
- n_copies_per_asu = None Number of copies per ASU. If not specified, Matthews analyses is performed

- xray_dataDefines xray data
- file_name = None File name with data
- obs_labels = None Labels for observed data
- calc_labels = None Lables for calculated data
- unit_cell = None Unit cell parameters
- space_group = None space group
- high_resolution = None High resolution limit
- low_resolution = None Low resolution limit
- skip_sanity_checks = False Disable guards against corrupted input data.
- completeness_as_non_anomalous = True Report completeness of non-anomalous version of arrays
- referenceA reference data set. For the investigation of possible reindexing options
- dataDefines an x-ray dataset
- file_name = None File name
- labels = None Labels
- unit_cell = None Unit cell parameters"
- space_group = None Space group

- structure
- file_name = None Filename of reference PDB file

- dataDefines an x-ray dataset

- parametersBasic settings
- reportingSome output issues
- verbose = 1 Verbosity
- log = logfile.log Logfile
- loggraphs = False

- merging
- n_bins = 10
- skip_merging = False

- misc_twin_parametersVarious settings for twinning or symmetry tests
- l_test_dhkl = None
- apply_basic_filters_prior_to_twin_analysis = True Keep data cutoffs from the basic_analyses module (I/sigma,Wilson scaling,Anisotropy) when twin stats are computed.
- missing_symmetrySettings for missing symmetry tests
- sigma_inflation = 1.25 Standard deviations of intensities can be increased to make point group determination more reliable.

- twinning_with_ncsAnalysing the possibility of an NCS operator parallel to a twin law.
- perform_analyses = False Determines whether or not this analyses is carried out.
- n_bins = 7 Number of bins used in NCS analyses.

- twin_test_cutsVarious cuts used in determining resolution limit for data used in intensity statistics
- low_resolution = 10.0 Low resolution
- high_resolution = None High resolution
- isigi_cut = 3.0 I/sigI ratio used in completeness cut
- completeness_cut = 0.85 Data is cut at resolution where intensities with I/sigI greater than isigi_cut are more than completeness_cut complete

- reportingSome output issues
- optional
- hklout = None
- hklout_type = mtz sca *Auto
- label_extension = "massaged"
- anisoParameters dealing with anisotropy correction
- action = *remove_aniso None Remove anisotropy?
- final_b = *eigen_min eigen_mean user_b_iso Final b value
- b_iso = None User specified B value

- outlierOutlier analyses
- action = *extreme basic beamstop None Outlier protocol
- parametersParameters for outlier detection
- basic_wilson
- level = 1E-6

- extreme_wilson
- level = 0.01

- beamstop
- level = 0.001
- d_min = 10.0

- basic_wilson

- symmetry
- action = detwin twin *None
- twinning_parameters
- twin_law = None
- fraction = None

- guiGUI-specific parameters, not applicable to command-line version.
- result_file = None Pickled result file for Phenix GUI
- output_dir = None
- job_title = None Job title in PHENIX GUI, not used on command line

- input