The only input file required for running HySS is a file with the reflection data. HySS reads the following formats directly:
- merged scalepack files
- unmerged scalepack files (but merged files are preferred!)
- CCP4 MTZ files with merged data
- CCP4 MTZ files with unmerged data (but merged files are preferred!)
- d*trek .ref files
- XDS_ASCII files with merged data
- CNS reflection files
- SHELX reflection files with amplitudes
The CCI Apps binary bundles include a scalepack file with anomalous peak data for the structure with the PDB access code 1NSF (courtesy of A.T. Brunger). To find the 8 selenium sites enter:
phenix.hyss nsf_d2_peak.sca 8 se
This leads to:
Reading reflection file: nsf_d2_peak.sca Space group found in file: P 6 Is this the correct space group? [Y/N]:
HySS prompts for a confirmation of the space group because space group P6 is often used as a placeholder during data reduction. If the space group symbol found in the reflection file is not correct it can be changed. However, in this case the symbol is correct. At the prompt enter Y to continue. Alternatively, the interactive prompt can be avoided by using the --space_group option:
phenix.hyss nsf_d2_peak.sca 8 se --space_group=p6
HySS will quickly print a few screen-pages with information about the data (e.g. the magnitude of the anomalous signal) and the many search parameters. The most interesting output is produced after this point:
Entering search loop: p = peaklist index in Patterson map f = peaklist index in two-site translation function cc = correlation coefficient after extrapolation scan r = number of dual-space recycling cycles cc = final correlation coefficient p=000 f=000 cc=0.364 r=015 cc=0.479 [ best cc: 0.479 ] p=000 f=001 cc=0.310 r=015 cc=0.477 [ best cc: 0.479 0.477 ] Number of matching sites of top 2 structures: 11 p=000 f=002 cc=0.166 r=015 cc=0.479 [ best cc: 0.479 0.479 0.477 ] Number of matching sites of top 2 structures: 11 Number of matching sites of top 3 structures: 11
It will take a few seconds for each line starting with p= to appear. Each of these lines summarizes the result of one trial consisting of an evaluation of the Patterson function, two fast translation functions, and 15 cycles of dual-space recycling. The important number to watch is the final correlation. In the first three trials HySS finds three substructure models with promisingly high correlations. These models are compared, taking allowed origin shifts and the hand ambiguity into account. The three models have more than 2/3 of the expected number of sites in common. Therefore HySS decides that the search is complete and prints a summary of the matching sites:
Top 3 correlations: p=000 f=000 cc=0.364 r=015 cc=0.479 p=000 f=002 cc=0.166 r=015 cc=0.479 p=000 f=001 cc=0.310 r=015 cc=0.477 Match summary: Operator: rotation: {{-1.0, 0.0, 0.0}, {0.0, -1.0, 0.0}, {0.0, 0.0, -1.0}} translation: (-9.6289517721653785e-38, 0.0, 0.091526465343537006) rms coordinate differences: 0.06 Pairs: 11 site001 site001 0.018 site002 site002 0.056 site003 site003 0.033 site004 site004 0.026 site005 site005 0.050 site006 site006 0.103 site007 site007 0.040 site008 site008 0.063 site009 site010 0.067 site010 site009 0.120 site011 site011 0.029 Singles model 1: 0 Singles model 2: 0
The matching sites are used to build a consensus model. The coordinates and occupancies are quickly refined using a quasi-Newton minimizer:
Minimizing consensus model (11 sites). Truncating consensus model to expected number of sites. Minimizing consensus model (8 sites). Correlation coefficient for consensus model (8 sites): 0.483
The refined sites are sorted by occupancy in descending order. The model is truncated to the expected number of sites and refined again. After printing detailed timing information (not shown) the output ends with:
Storing all substructures found: nsf_d2_peak_hyss_models.pickle Storing consensus model: nsf_d2_peak_hyss_consensus_model.pickle Writing consensus model as PDB file: nsf_d2_peak_hyss_consensus_model.pdb Writing consensus model as CNS SDB file: nsf_d2_peak_hyss_consensus_model.sdb Writing consensus model as SOLVE xyz records: nsf_d2_peak_hyss_consensus_model.xyz The fractional coordinates may also be useful in other programs. Total CPU time: 49.60 seconds
The resulting coordinate files can be used for phasing and density modification with other programs.
The CCP4 distribution includes a four-wavelength MAD dataset in the tutorial directory. To find the 12 selenium sites with HySS enter:
phenix.hyss $CEXAM/tutorial2000/data/gere_MAD.mtz 12 se
HySS automatically picks the wavelength with the strongest anomalous signal and finishes after about 34 seconds (2.8GHz Pentium 4 Linux), writing out the 12 (or sometimes only 11) sites in the various file formats.
The CNS tutorial includes data from a MAD experiment with Ytterbium as the anomalous scatterer. CNS reflection files do not contain information about the unit cell and space group. However, HySS is able to extract this information from other files, e.g. other reflection files, CNS files, SOLVE files, PDB files or SHELX files. For example:
phenix.hyss $CNS_SOLVE/doc/html/tutorial/data/mbp/mbp.hkl 4 yb --symmetry $CNS_SOLVE/doc/html/tutorial/data/mbp/def
HySS reads the reflection data from the mbp.hkl file. The --symmetry options instructs HySS to scan the def file for unit cell parameters and a space group symbol. HySS finishes after about 26 seconds (2.8GHz Pentium 4 Linux).
Enter phenix.hyss without arguments to obtain a list of the available command line options:
Usage: phenix.hyss [options] reflection_file n_sites element_symbol Options: -h, --help show this help message and exit --unit-cell=10,10,20,90,90,120|FILENAME External unit cell parameters --space-group=P212121|FILENAME External space group symbol --symmetry=FILENAME External file with symmetry information --chunk=n,i Number of chunks for parallel execution and index for one process --search=fast|full Search mode --resolution=FLOAT High resolution limit (minimum d-spacing, d_min) --low-resolution=FLOAT Low resolution limit (maximum d-spacing, d_max) --site-min-distance=FLOAT Minimum distance between substructure sites (default: 3.5) --site-min-distance-sym-equiv=FLOAT Minimum distance between symmetrically-equivalent substructure sites (overrides --site_min_distance) --site-min-cross-distance=FLOAT Minimum distance between substructure sites not related by symmetry (overrides --site_min_distance) --molecular-weight=FLOAT Molecular weight --solvent-content=FLOAT Solvent content (default: 0.55) --random-seed=INT Seed for random number generator --real-space-squaring Use real space squaring (as opposed to the tangent formula) --data-label=STRING Substring of reflection data label See also: http://www.phenix-online.org/download/documentation/cci_apps/hyss/ Example: phenix.hyss w1.sca 66 Se
The --data_label, --resolution and --low_resolution options can be used to override the automatic selection of the reflection data and the resolution range. For example, one may enter the following command with the goal to instruct HySS to use the peak data in the gere_MAD.mtz file (instead of the inflection point data), and to set the high resolution limit to 5 Angstrom:
phenix.hyss gere_MAD.mtz 12 se --data_label=peak --resolution=5
Output:
Command line arguments: gere_MAD.mtz 12 se --data_label=peak --resolution=5 Reading reflection file: gere_MAD.mtz Ambiguous --data_label=peak Possible choices: 5: gere_MAD.mtz:FSEpeak,SIGFSEpeak,DSEpeak,SIGDSEpeak,merged 6: gere_MAD.mtz:F(+)SEpeak,SIGF(+)SEpeak,F(-)SEpeak,SIGF(-)SEpeak Please specify an unambiguous substring of the target label. Sorry: Please try again.
That's a good first try but if --data_label=peak turns out to be ambiguous HySS will ask for more information. Second try:
phenix.hyss gere_MAD.mtz 12 se --data_label="F(+)SEpeak" --resolution=5
Now HySS will actually perform the search. Typically the search finishes in less than 10 seconds finding 8-12 sites, depending on the random number generator (which is seeded with the current time unless the --random_seed option is used).
The --site_min_distance, --site_min_distance_sym_equiv, and --site_min_cross_distance options are available to override the default minimum distance of 3.5 Angstroms between substructure sites.
The --real_space_squaring option can be useful for large structures with high-resolution data. In this case the large number of triplets generated for the reciprocal-space direct methods procedure (i.e. the tangent formula) may lead to excessive memory allocation. By default HySS switches to real-space direct methods (i.e. E-map squaring) if it searches for more than 100 sites. If this limit is too high given the available memory use the --real_space_squaring option. For substructures with a large number of sites it is in our experience not critical to employ reciprocal-space direct methods.
If the --molecular_weight and --solvent_content options are used HySS will help in determining the number of substructures sites in the unit cell, interpreting the number of sites specified on the command line as number of sites per molecule. For example:
phenix.hyss gere_MAD.mtz 2 se --molecular_weight=8000 --solvent_content=0.70
This is telling HySS that we have a molecule with a molecular weight of 8 kD, a crystal with an estimated solvent content of 70%, and that we expect to find 2 Se sites per molecule. The HySS output will now show the following:
#---------------------------------------------------------------------------# | Formula for calculating the number of molecules given a molecular weight. | |---------------------------------------------------------------------------| | n_mol = ((1.0-solvent_content)*v_cell)/(molecular_weight*n_sym*.783) | #---------------------------------------------------------------------------# Number of molecules: 6 Number of sites: 12 Values used in calculation: Solvent content: 0.70 Unit cell volume: 476839 Molecular weight: 8000.00 Number of symmetry operators: 4
HySS will go on searching for 12 sites.
If the HySS consensus model does not lead to an interpretable electron density map please try the --search full option:
phenix.hyss your_file.sca 100 se --search full
This disables the automatic termination detection and the run will in general take considerably longer. If the full search leads to a better consensus model please let us know because we will want to improve the automatic termination detection.
Another possibility is to override the automatic determination of the high-resolution limit with the --resolution option. In some cases the resolution limit is very critical. Truncating the high-resolution limit of the data can sometimes lead to a successful search, as more reflections with a weak anomalous signal are excluded.
If there is no consensus model at the end of a HySS run please try alternative programs. For example, run SHELXD with the .ins and .hkl files that are automatically generated by HySS:
Writing anomalous differences as SHELX HKLF file: mbp_anom_diffs.hkl Writing SHELXD ins file: mbp_anom_diffs.ins
If HySS does not produce a consensus model even though it is possible to solve the substructure with other programs we would like to investigate. Please send email to bugs@phenix-online.org.
EMMA stands for Euclidean Model Matching and is the algorithm used by HySS to superimpose two putative solutions and to derive the consensus model. The same algorithm is also available through the external phenix.emma command-line interface. Enter phenix.emma without arguments to obtain the help page:
usage: phenix.emma [options] reference_coordinates reference_coordinates other_coordinates options: -h, --help show this help message and exit --unit_cell=10,10,20,90,90,120|FILENAME External unit cell parameters --space_group=P212121|FILENAME External space group symbol --symmetry=FILENAME External file with symmetry information --tolerance=FLOAT match tolerance --diffraction_index_equivalent Use only if models are diffraction-index equivalent. Example: phenix.emma model1.pdb model2.sdb
The command takes two coordinate files in various formats (.pdb, CNS .sdb, SOLVE output, SHELX .ins) and compares the structures taking the space group symmetry, the allowed origin shifts and the hand ambiguity into account. The output is similar to the Match summary shown above in the example HySS output.
The match tolerance defaults to 3 Angstrom. For structures obtained with very low resolution data it may be necessary to specify a different tolerance, e.g. --tolerance=5.
The --symmetry option works just like it does for phenix.hyss. It can be used to extract symmetry information from external files such as input files for other programs (CNS, SHELX, SOLVE, ...) or reflection files. However, the --symmetry option is only required if the information about the unit cell and the space group is missing in both coordinate files given to phenix.emma.
phenix.emma conducts an exhaustive search and, in contrast to HySS, displays all possible matches. The match with the largest number of matching sites is shown first, the match with the smallest number of matching sites is shown last (often just one site). Therefore you have to look at the beginning of the output to see the best match. I.e. if the output goes to the screen don't let yourself get distracted if you see a large number of Singles near the end of the output. Scroll back to see the best match.
Emma is also available via a web interface.
If HySS cannot solve the structure in default mode it may be worth looking at some statistics of the reflection data using the phenix.xtriage command. Please refer to the dedicated phenix.xtriage documentation.
Comparisions between multiple datasets are available via the phenix.reflection_statistics command:
usage: phenix.reflection_statistics [options] reflection_file [...] options: -h, --help show this help message and exit --unit_cell=10,10,20,90,90,120|FILENAME External unit cell parameters --space_group=P212121|FILENAME External space group symbol --symmetry=FILENAME External file with symmetry information --quick Do not compute statistics between pairs of data arrays --resolution=FLOAT High resolution limit (minimum d-spacing, d_min) --low_resolution=FLOAT Low resolution limit (maximum d-spacing, d_max) --bins=INT Number of bins Example: phenix.reflection_statistics data1.mtz data2.sca
This utility reads one or more reflection files in any of the formats listed near the top of the document. For each of the datasets found in the reflection files the output shows a block like the following:
Miller array info: gere_MAD.mtz:FSEinfl,SIGFSEinfl,DSEinfl,SIGDSEinfl Observation type: xray.reconstructed_amplitude Type of data: double, size=20994 Type of sigmas: double, size=20994 Number of Miller indices: 20994 Anomalous flag: 1 Unit cell: (108.742, 61.679, 71.652, 90, 97.151, 90) Space group: C 1 2 1 (No. 5) Systematic absences: 0 Centric reflections: 0 Resolution range: 24.7492 2.74876 Completeness in resolution range: 0.873513 Completeness with d_max=infinity: 0.872315 Bijvoet pairs: 10497 Lone Bijvoet mates: 0 Anomalous signal: 0.1065
This is followed by a listing of the completeness and the anomalous signal in resolution bins (the number of bins and the resolution range may be adjusted with the options shown above).
Unless the --quick option is specified the output will also show the correlations between the datasets and, if applicable, between the anomalous differences, both as overall values and in bins. The correlation between anomalous differences is often a very powerful indicator for the resolution up to which the anomalous signal is useful for substructure determination. In general one should use reflection data only up to the resolution to which the correlation is better than 0.30. For example:
Anomalous difference correlation of: gere_MAD.mtz:F(+)SEinfl,SIGF(+)SEinfl,F(-)SEinfl,SIGF(-)SEinfl gere_MAD.mtz:F(+)SElrm,SIGF(+)SElrm,F(-)SElrm,SIGF(-)SElrm Overall correlation: 0.390 unused: d > 24.7502: 0.000 bin 1: 24.7502 >= d > 5.8979: 0.874 bin 2: 5.8979 >= d > 4.6917: 0.702 bin 3: 4.6917 >= d > 4.1017: 0.647 bin 4: 4.1017 >= d > 3.7281: 0.483 bin 5: 3.7281 >= d > 3.4616: 0.346 bin 6: 3.4616 >= d > 3.2580: 0.188 bin 7: 3.2580 >= d > 3.0952: 0.189 bin 8: 3.0952 >= d > 2.9607: 0.138 bin 9: 2.9607 >= d > 2.8469: 0.138 bin 10: 2.8469 >= d > 2.7488: 0.136 unused: 2.7488 > d : 0.000
In this case the correlation drops below 0.30 between 3.46 and 3.25 Angstrom. This suggests to run HySS with --resolution=3.5.