phenix.hyss examples

The only input file required for running HySS is a file with the reflection data. HySS reads the following formats directly:

merged scalepack files

unmerged scalepack files (but merged files are preferred!)

CCP4 MTZ files with merged data

CCP4 MTZ files with unmerged data (but merged files are preferred!)

d*trek .ref files

XDS_ASCII files with merged data

CNS reflection files

SHELX reflection files with amplitudes

nsf_d2_peak.sca

The CCI Apps binary bundles include a scalepack file with anomalous peak data for the structure with the PDB access code 1NSF (courtesy of A.T. Brunger). To find the 8 selenium sites enter:

phenix.hyss nsf_d2_peak.sca 8 se

This leads to:

Reading reflection file: nsf_d2_peak.sca

Space group found in file: P 6
Is this the correct space group? [Y/N]:

HySS prompts for a confirmation of the space group because space group P6 is often used as a placeholder during data reduction. If the space group symbol found in the reflection file is not correct it can be changed. However, in this case the symbol is correct. At the prompt enter Y to continue. Alternatively, the interactive prompt can be avoided by using the --space_group option:

phenix.hyss nsf_d2_peak.sca 8 se --space_group=p6

HySS will quickly print a few screen-pages with information about the data (e.g. the magnitude of the anomalous signal) and the many search parameters. The most interesting output is produced after this point:

Entering search loop:

p = peaklist index in Patterson map
f = peaklist index in two-site translation function
cc = correlation coefficient after extrapolation scan
r = number of dual-space recycling cycles
cc = final correlation coefficient

p=000 f=000 cc=0.364 r=015 cc=0.479 [ best cc: 0.479 ]
p=000 f=001 cc=0.310 r=015 cc=0.477 [ best cc: 0.479 0.477 ]
Number of matching sites of top 2 structures: 11
p=000 f=002 cc=0.166 r=015 cc=0.479 [ best cc: 0.479 0.479 0.477 ]
Number of matching sites of top 2 structures: 11
Number of matching sites of top 3 structures: 11

It will take a few seconds for each line starting with p= to appear. Each of these lines summarizes the result of one trial consisting of an evaluation of the Patterson function, two fast translation functions, and 15 cycles of dual-space recycling. The important number to watch is the final correlation. In the first three trials HySS finds three substructure models with promisingly high correlations. These models are compared, taking allowed origin shifts and the hand ambiguity into account. The three models have more than 2/3 of the expected number of sites in common. Therefore HySS decides that the search is complete and prints a summary of the matching sites:

Top 3 correlations:
p=000 f=000 cc=0.364 r=015 cc=0.479
p=000 f=002 cc=0.166 r=015 cc=0.479
p=000 f=001 cc=0.310 r=015 cc=0.477
Match summary:
  Operator:
       rotation: {{-1.0, 0.0, 0.0}, {0.0, -1.0, 0.0}, {0.0, 0.0, -1.0}}
    translation: (-9.6289517721653785e-38, 0.0, 0.091526465343537006)
  rms coordinate differences: 0.06
  Pairs: 11
    site001 site001 0.018
    site002 site002 0.056
    site003 site003 0.033
    site004 site004 0.026
    site005 site005 0.050
    site006 site006 0.103
    site007 site007 0.040
    site008 site008 0.063
    site009 site010 0.067
    site010 site009 0.120
    site011 site011 0.029
  Singles model 1: 0
  Singles model 2: 0

The matching sites are used to build a consensus model. The coordinates and occupancies are quickly refined using a quasi-Newton minimizer:

Minimizing consensus model (11 sites).
Truncating consensus model to expected number of sites.
Minimizing consensus model (8 sites).
Correlation coefficient for consensus model (8 sites): 0.483

The refined sites are sorted by occupancy in descending order. The model is truncated to the expected number of sites and refined again. After printing detailed timing information (not shown) the output ends with:

Storing all substructures found: nsf_d2_peak_hyss_models.pickle

Storing consensus model: nsf_d2_peak_hyss_consensus_model.pickle

Writing consensus model as PDB file: nsf_d2_peak_hyss_consensus_model.pdb

Writing consensus model as CNS SDB file: nsf_d2_peak_hyss_consensus_model.sdb

Writing consensus model as SOLVE xyz records: nsf_d2_peak_hyss_consensus_model.xyz
The fractional coordinates may also be useful in other programs.

Total CPU time: 49.60 seconds

The resulting coordinate files can be used for phasing and density modification with other programs.

gere_MAD.mtz

The CCP4 distribution includes a four-wavelength MAD dataset in the tutorial directory. To find the 12 selenium sites with HySS enter:

phenix.hyss $CEXAM/tutorial2000/data/gere_MAD.mtz 12 se

HySS automatically picks the wavelength with the strongest anomalous signal and finishes after about 34 seconds (2.8GHz Pentium 4 Linux), writing out the 12 (or sometimes only 11) sites in the various file formats.

mbp.hkl

The CNS tutorial includes data from a MAD experiment with Ytterbium as the anomalous scatterer. CNS reflection files do not contain information about the unit cell and space group. However, HySS is able to extract this information from other files, e.g. other reflection files, CNS files, SOLVE files, PDB files or SHELX files. For example:

phenix.hyss $CNS_SOLVE/doc/html/tutorial/data/mbp/mbp.hkl 4 yb --symmetry $CNS_SOLVE/doc/html/tutorial/data/mbp/def

HySS reads the reflection data from the mbp.hkl file. The --symmetry options instructs HySS to scan the def file for unit cell parameters and a space group symbol. HySS finishes after about 26 seconds (2.8GHz Pentium 4 Linux).

Command line options

Enter phenix.hyss without arguments to obtain a list of the available command line options:

Usage: phenix.hyss [options] reflection_file n_sites element_symbol

Options:
  -h, --help            show this help message and exit
  --unit-cell=10,10,20,90,90,120|FILENAME
                        External unit cell parameters
  --space-group=P212121|FILENAME
                        External space group symbol
  --symmetry=FILENAME   External file with symmetry information
  --chunk=n,i           Number of chunks for parallel execution and index for
                        one process
  --search=fast|full    Search mode
  --resolution=FLOAT    High resolution limit (minimum d-spacing, d_min)
  --low-resolution=FLOAT
                        Low resolution limit (maximum d-spacing, d_max)
  --site-min-distance=FLOAT
                        Minimum distance between substructure sites (default:
                        3.5)
  --site-min-distance-sym-equiv=FLOAT
                        Minimum distance between symmetrically-equivalent
                        substructure sites (overrides --site_min_distance)
  --site-min-cross-distance=FLOAT
                        Minimum distance between substructure sites not
                        related by symmetry (overrides --site_min_distance)
  --molecular-weight=FLOAT
                        Molecular weight
  --solvent-content=FLOAT
                        Solvent content (default: 0.55)
  --random-seed=INT     Seed for random number generator
  --real-space-squaring
                        Use real space squaring (as opposed to the tangent
                        formula)
  --data-label=STRING   Substring of reflection data label

See also:
  http://www.phenix-online.org/download/documentation/cci_apps/hyss/

Example: phenix.hyss w1.sca 66 Se

The --data_label, --resolution and --low_resolution options can be used to override the automatic selection of the reflection data and the resolution range. For example, one may enter the following command with the goal to instruct HySS to use the peak data in the gere_MAD.mtz file (instead of the inflection point data), and to set the high resolution limit to 5 Angstrom:

phenix.hyss gere_MAD.mtz 12 se --data_label=peak --resolution=5

Output:

Command line arguments: gere_MAD.mtz 12 se --data_label=peak --resolution=5

Reading reflection file: gere_MAD.mtz

Ambiguous --data_label=peak

Possible choices:
  5: gere_MAD.mtz:FSEpeak,SIGFSEpeak,DSEpeak,SIGDSEpeak,merged
  6: gere_MAD.mtz:F(+)SEpeak,SIGF(+)SEpeak,F(-)SEpeak,SIGF(-)SEpeak

Please specify an unambiguous substring of the target label.

Sorry: Please try again.

That's a good first try but if --data_label=peak turns out to be ambiguous HySS will ask for more information. Second try:

phenix.hyss gere_MAD.mtz 12 se --data_label="F(+)SEpeak" --resolution=5

Now HySS will actually perform the search. Typically the search finishes in less than 10 seconds finding 8-12 sites, depending on the random number generator (which is seeded with the current time unless the --random_seed option is used).

The --site_min_distance, --site_min_distance_sym_equiv, and --site_min_cross_distance options are available to override the default minimum distance of 3.5 Angstroms between substructure sites.

The --real_space_squaring option can be useful for large structures with high-resolution data. In this case the large number of triplets generated for the reciprocal-space direct methods procedure (i.e. the tangent formula) may lead to excessive memory allocation. By default HySS switches to real-space direct methods (i.e. E-map squaring) if it searches for more than 100 sites. If this limit is too high given the available memory use the --real_space_squaring option. For substructures with a large number of sites it is in our experience not critical to employ reciprocal-space direct methods.

If the --molecular_weight and --solvent_content options are used HySS will help in determining the number of substructures sites in the unit cell, interpreting the number of sites specified on the command line as number of sites per molecule. For example:

phenix.hyss gere_MAD.mtz 2 se --molecular_weight=8000 --solvent_content=0.70

This is telling HySS that we have a molecule with a molecular weight of 8 kD, a crystal with an estimated solvent content of 70%, and that we expect to find 2 Se sites per molecule. The HySS output will now show the following:

#---------------------------------------------------------------------------#
| Formula for calculating the number of molecules given a molecular weight. |
|---------------------------------------------------------------------------|
| n_mol = ((1.0-solvent_content)*v_cell)/(molecular_weight*n_sym*.783)      |
#---------------------------------------------------------------------------#
Number of molecules: 6
Number of sites: 12
Values used in calculation:
  Solvent content: 0.70
  Unit cell volume: 476839
  Molecular weight: 8000.00
  Number of symmetry operators: 4

HySS will go on searching for 12 sites.

If things go wrong

If the HySS consensus model does not lead to an interpretable electron density map please try the --search full option:

phenix.hyss your_file.sca 100 se --search full

This disables the automatic termination detection and the run will in general take considerably longer. If the full search leads to a better consensus model please let us know because we will want to improve the automatic termination detection.

Another possibility is to override the automatic determination of the high-resolution limit with the --resolution option. In some cases the resolution limit is very critical. Truncating the high-resolution limit of the data can sometimes lead to a successful search, as more reflections with a weak anomalous signal are excluded.

If there is no consensus model at the end of a HySS run please try alternative programs. For example, run SHELXD with the .ins and .hkl files that are automatically generated by HySS:

Writing anomalous differences as SHELX HKLF file: mbp_anom_diffs.hkl

Writing SHELXD ins file: mbp_anom_diffs.ins

If HySS does not produce a consensus model even though it is possible to solve the substructure with other programs we would like to investigate. Please send email to bugs@phenix-online.org.

Auxiliary programs

phenix.emma

EMMA stands for Euclidean Model Matching and is the algorithm used by HySS to superimpose two putative solutions and to derive the consensus model. The same algorithm is also available through the external phenix.emma command-line interface. Enter phenix.emma without arguments to obtain the help page:

usage: phenix.emma [options] reference_coordinates reference_coordinates other_coordinates

options:
  -h, --help            show this help message and exit
  --unit_cell=10,10,20,90,90,120|FILENAME
                        External unit cell parameters
  --space_group=P212121|FILENAME
                        External space group symbol
  --symmetry=FILENAME   External file with symmetry information
  --tolerance=FLOAT     match tolerance
  --diffraction_index_equivalent
                        Use only if models are diffraction-index equivalent.

Example: phenix.emma model1.pdb model2.sdb

The command takes two coordinate files in various formats (.pdb, CNS .sdb, SOLVE output, SHELX .ins) and compares the structures taking the space group symmetry, the allowed origin shifts and the hand ambiguity into account. The output is similar to the Match summary shown above in the example HySS output.

The match tolerance defaults to 3 Angstrom. For structures obtained with very low resolution data it may be necessary to specify a different tolerance, e.g. --tolerance=5.

The --symmetry option works just like it does for phenix.hyss. It can be used to extract symmetry information from external files such as input files for other programs (CNS, SHELX, SOLVE, ...) or reflection files. However, the --symmetry option is only required if the information about the unit cell and the space group is missing in both coordinate files given to phenix.emma.

phenix.emma conducts an exhaustive search and, in contrast to HySS, displays all possible matches. The match with the largest number of matching sites is shown first, the match with the smallest number of matching sites is shown last (often just one site). Therefore you have to look at the beginning of the output to see the best match. I.e. if the output goes to the screen don't let yourself get distracted if you see a large number of Singles near the end of the output. Scroll back to see the best match.

Emma is also available via a web interface.

phenix.xtriage

If HySS cannot solve the structure in default mode it may be worth looking at some statistics of the reflection data using the phenix.xtriage command. Please refer to the dedicated phenix.xtriage documentation.

phenix.reflection_statistics

Comparisions between multiple datasets are available via the phenix.reflection_statistics command:

usage: phenix.reflection_statistics [options] reflection_file [...]

options:
  -h, --help            show this help message and exit
  --unit_cell=10,10,20,90,90,120|FILENAME
                        External unit cell parameters
  --space_group=P212121|FILENAME
                        External space group symbol
  --symmetry=FILENAME   External file with symmetry information
  --quick               Do not compute statistics between pairs of data arrays
  --resolution=FLOAT    High resolution limit (minimum d-spacing, d_min)
  --low_resolution=FLOAT
                        Low resolution limit (maximum d-spacing, d_max)
  --bins=INT            Number of bins

Example: phenix.reflection_statistics data1.mtz data2.sca

This utility reads one or more reflection files in any of the formats listed near the top of the document. For each of the datasets found in the reflection files the output shows a block like the following:

Miller array info: gere_MAD.mtz:FSEinfl,SIGFSEinfl,DSEinfl,SIGDSEinfl
Observation type: xray.reconstructed_amplitude
Type of data: double, size=20994
Type of sigmas: double, size=20994
Number of Miller indices: 20994
Anomalous flag: 1
Unit cell: (108.742, 61.679, 71.652, 90, 97.151, 90)
Space group: C 1 2 1 (No. 5)
Systematic absences: 0
Centric reflections: 0
Resolution range: 24.7492 2.74876
Completeness in resolution range: 0.873513
Completeness with d_max=infinity: 0.872315
Bijvoet pairs: 10497
Lone Bijvoet mates: 0
Anomalous signal: 0.1065

This is followed by a listing of the completeness and the anomalous signal in resolution bins (the number of bins and the resolution range may be adjusted with the options shown above).

Unless the --quick option is specified the output will also show the correlations between the datasets and, if applicable, between the anomalous differences, both as overall values and in bins. The correlation between anomalous differences is often a very powerful indicator for the resolution up to which the anomalous signal is useful for substructure determination. In general one should use reflection data only up to the resolution to which the correlation is better than 0.30. For example:

Anomalous difference correlation of:
  gere_MAD.mtz:F(+)SEinfl,SIGF(+)SEinfl,F(-)SEinfl,SIGF(-)SEinfl
  gere_MAD.mtz:F(+)SElrm,SIGF(+)SElrm,F(-)SElrm,SIGF(-)SElrm
Overall correlation:  0.390
unused:              d >   24.7502:  0.000
bin  1:   24.7502 >= d >    5.8979:  0.874
bin  2:    5.8979 >= d >    4.6917:  0.702
bin  3:    4.6917 >= d >    4.1017:  0.647
bin  4:    4.1017 >= d >    3.7281:  0.483
bin  5:    3.7281 >= d >    3.4616:  0.346
bin  6:    3.4616 >= d >    3.2580:  0.188
bin  7:    3.2580 >= d >    3.0952:  0.189
bin  8:    3.0952 >= d >    2.9607:  0.138
bin  9:    2.9607 >= d >    2.8469:  0.138
bin 10:    2.8469 >= d >    2.7488:  0.136
unused:    2.7488 >  d            :  0.000

In this case the correlation drops below 0.30 between 3.46 and 3.25 Angstrom. This suggests to run HySS with --resolution=3.5.