PHENIX Python-based Hierarchical ENvironment for Integrated Xtallography

Hybrid Substructure Search

Contents

  • Authors
  • HySS overview
  • A new powerful and parallel Hybrid Substructure Search (HySS)
  • Graphical interface
  • Command line options
  • NOTES
  • If things go wrong
  • Auxiliary programs
    • phenix.emma
  • References
    • List of all available keywords

Authors

Ralf Grosse-Kunstleve, Paul Adams, Randy Read, Gabor Bunkoczi, Tom Terwilliger

HySS overview

The HySS (Hybrid Substructure Search) submodule of the Phenix package is a highly-automated procedure for the location of anomalous scatterers in macromolecular structures. HySS starts with the automatic detection of the reflection file format and analyses all available datasets in a given reflection file to decide which of these is best suited for solving the structure. The search parameters are automatically adjusted based on the available data and the number of expected sites given by the user. The search method is a systematic multi-trial procedure employing

  • direct-space Patterson interpretation followed by
  • reciprocal-space Patterson interpretation followed by
  • dual-space direct methods or Phaser LLG completion followed by
  • automatic comparison of the solutions and
  • automatic termination detection.

The end result is a consensus model which is exported in a variety of file formats suitable for frequently used phasing and density modification packages.

The core search procedure is applicable to both anomalous diffraction and isomorphous replacement problems. However, currently the command line interface is limited to work with anomalous diffraction data or externally preprocessed difference data.

A new powerful and parallel Hybrid Substructure Search (HySS)

HySS has many new features in 2014. The first thing you might notice in HySS is that it can use as many processors as you have on your computer. This can make for a really quick direct methods search for your anomalously-scattering substructure.

You might notice next that HySS now automatically tries Phaser completion to find a solution if the direct methods approach does not give a clear solution right away. Phaser completion uses the likelihood function to create an LLG map that is used to find additional sites. This is really great because Phaser completion in HySS can be much more powerful than direct methods in HySS. Phaser completion takes a lot longer than direct methods completion but it is now quite feasible, particularly if you have several processors on your computer.

The next thing you might notice in HySS is that it automatically tries several resolution cutoffs for the searches if the first try does not give a convincing solution. Also HySS will start out with a few Patterson seeds and then try more if that doesn't give a clear solution.

HySS now considers a solution convincing if it finds the same solution several times, starting with different initial Patterson peaks as seeds. The more sites in the solution, the fewer duplicates need to be found to have a convincing solution.

Putting all these together, the new HySS is much faster than the old HySS and it can solve substructures that the old HySS could not touch.

Graphical interface

The HySS GUI is listed in the "Experimental phasing" category of the main PHENIX GUI. Most options are shown in the main window, but only the fields highlighted below are mandatory. The data labels will be selected automatically if the reflections file contains anomalous arrays, and any symmetry information present in the file will be loaded in the unit cell and space group fields.

../images/hyss_config.png

It may be helpful to run Xtriage first to determine an appropriate high resolution cutoff, as most datasets do not have significant anomalous signal in the highest resolution shells. The wavelength is only required if Phaser is being used for rescoring. Additional options are described below in the command-line documentation.

At the end of the run, a tab will be added showing output files and basic statistics. A correlation coefficient of XXX usually indicates that the sites are real. If you are happy with the sites, you can load them into AutoSol or Phaser directly from this window.

../images/hyss_result.png

A full list on sites is displayed in the "Edit sites" tab. For a typical high-quality selenomethionine dataset, such as the p9-sad tutorial data used here, valid sites should have an occupancy close to 1, but for certain types of heavy-atom soaks (such as bromine) all sites may have partial occupancy. You can edit the sites by changing the occupancy or unchecking any that you wish to discard, then clicking the "Save selected" button.

../images/hyss_edit.png

Command line options

Enter phenix.hyss without arguments to obtain a list of the available command line options:

usage: phenix.hyss [options] reflection_file n_sites element_symbol

Example: phenix.hyss w1.sca 66 Se

NOTES

The site_min_distance, site_min_distance_sym_equiv, and site_min_cross_distance options are available to override the default minimum distance of 3.5 Angstroms between substructure sites.

The real_space_squaring option can be useful for large structures with high-resolution data. In this case the large number of triplets generated for the reciprocal-space direct methods procedure (i.e. the tangent formula) may lead to excessive memory allocation. By default HySS switches to real-space direct methods (i.e. E-map squaring) if it searches for more than 100 sites. If this limit is too high given the available memory use the real_space_squaring option. For substructures with a large number of sites it is in our experience not critical to employ reciprocal-space direct methods.

If the molecular_weight and solvent_content options are used HySS will help in determining the number of substructures sites in the unit cell, interpreting the number of sites specified on the command line as number of sites per molecule. For example:

phenix.hyss gere_MAD.mtz 2 se molecular_weight=8000 solvent_content=0.70

This is telling HySS that we have a molecule with a molecular weight of 8 kD, a crystal with an estimated solvent content of 70%, and that we expect to find 2 Se sites per molecule. The HySS output will now show the following:

#---------------------------------------------------------------------------#
| Formula for calculating the number of molecules given a molecular weight. |
|---------------------------------------------------------------------------|
| n_mol = ((1.0-solvent_content)*v_cell)/(molecular_weight*n_sym*.783)      |
#---------------------------------------------------------------------------#
Number of molecules: 6
Number of sites: 12
Values used in calculation:
  Solvent content: 0.70
  Unit cell volume: 476839
  Molecular weight: 8000.00
  Number of symmetry operators: 4

HySS will go on searching for 12 sites.

If things go wrong

If the HySS consensus model does not lead to an interpretable electron density map please try the search=full option:

phenix.hyss your_file.sca 100 se search full

This disables the automatic termination detection and the run will in general take considerably longer. If the full search leads to a better consensus model please let us know because we will want to improve the automatic termination detection.

Another possibility is to override the automatic determination of the high-resolution limit with the resolution option. In some cases the resolution limit is very critical. Truncating the high-resolution limit of the data can sometimes lead to a successful search, as more reflections with a weak anomalous signal are excluded.

Enabling a phaser-based rescoring protocol can also help (rescore=phaser-complete is recommended). It is less affected by suboptimal resolution cutoffs and also provides more discrimination with noisy data. Switching on the phaser-map extrapolation protocol is also worthwhile, since it increases success rate and is only a small runtime overhead compared to phaser-based rescoring.

If there is no consensus model at the end of a HySS run please try alternative programs. For example, run SHELXD with the .ins and .hkl files that are automatically generated by HySS:

Writing anomalous differences as SHELX HKLF file: mbp_anom_diffs.hkl

Writing SHELXD ins file: mbp_anom_diffs.ins

If HySS does not produce a consensus model even though it is possible to solve the substructure with other programs we would like to investigate. Please send email to bugs@phenix-online.org.

Auxiliary programs

phenix.emma

EMMA stands for Euclidean Model Matching which allows two sets of coordinates to be superimposed as best as possible given symmetry and origin choices. See the phenix.emma documentation for more details.

References

Substructure search procedures for macromolecular structures. R.W. Grosse-Kunstleve, and P.D. Adams. Acta Cryst. D59, 1966-1973. (2003).

Simple algorithm for a maximum-likelihood SAD function. A.J. McCoy, L.C. Storoni, and R.J. Read. Acta Crystallogr D Biol Crystallogr 60, 1220-8 (2004).

List of all available keywords

  • data = None
  • data_label = None
  • n_sites = None
  • scattering_type = None
  • wavelength = None
  • space_group = None
  • unit_cell = None
  • symmetry = None File with crystal symmetry
  • resolution = None
  • low_resolution = None
  • search = *fast full
  • root = None Root for output file names
  • output_dir = None
  • rescore = *Auto correlation phaser-refine phaser-complete Strategy and rescore function Auto runs automatic procedure. Correlation_coefficient Phaser_refinement and Phaser_substructure_completion are choices for single-rescoring procedures. Automatic procedure for substructure determination assumes: sad=True, wavelength=peak, variable resolution range if not set. Tries first direct methods then phaser-completion based on Patterson peaks, rescoring everything at best resolution. Terminates when n_top_models_to_compare top-scoring solutions from different Patterson seeds match.
  • llgc_sigma = None Log-likelihood gradient map sigma cutoff Default is 5; if your anomalous signal is very low you might want to try lowering this parameter to as low as 2.
  • nproc = None Number of processors. NOTE: on Windows nproc=1 always
  • phaser_resolution = None Resolution for phaser rescoring and extrapolation. If not set, same as overall resolution.
  • rms_cutoff = 3.5 Anomalous differences larger than rms_cutoff times the rms will be ignored
  • sigma_cutoff = 1 Anomalous differences smaller than sima_cutoff times sigma will be ignored
  • verbose = False
  • pdb_only = False Suppress all output except for pdb file output
  • cluster_termination = False Terminate by looking for bimodal distribution in phaser rescoring. Does not apply to correlation rescoring
  • max_view = 10 Number of solutions to show when run in sub-processes
  • minimum_reflections_for_phaser = 300 Minimum reflections to run phaser rescoring
  • job_title = None Job title in PHENIX GUI, not used on command line
  • auto_control
    • lowest_high_resolution_to_try = None
    • highest_high_resolution_to_try = None
    • default_high_resolution_to_try = None
    • direct_methods = True Run direct methods in automatic procedure
    • complete_direct_methods = False Run Phaser completion on direct methods solutions in automatic procedure
    • phaser_completion = True Run Phaser completion on Patterson seeds in automatic procedure
    • try_multiple_resolutions = True
    • try_full_resolution = False
    • max_multiple = 5
    • max_groups = 5
    • seeds_to_use = None
    • starting_seed = None
    • default_seeds_to_use = 20
    • solutions_to_save = 20 If you are running multiprocessing it may be necessary to make solutions_to_save small (e.g. 10 or 20 rather than 1000)
    • max_tries = None Used internally by automated procedures
    • write_output_model = True
  • termination
    • matches_must_not_share_patterson_vector = True
    • n_top_models_to_compare = None Number of top models that must share sites to terminate (set automatically.
  • parameter_estimation
    • solvent_content = 0.55 Solvent content (default: 0.55)
    • molecular_weight = None
  • search_control
    • random_seed = None Seed for random number generator
    • site_min_distance = 3.5 Minimum distance between substructure sites (default: 3.5)
    • site_min_distance_sym_equiv = None Minimum distance between symmetrically-equivalent substructure sites (overrides --site_min_distance)
    • site_min_cross_distance = None Minimum distance between substructure sites not related by symmetry (overrides --site_min_distance)
    • skip_consensus = False
  • direct_methods
    • real_space_squaring = False Use real space squaring (as opposed to the tangent formula)
    • extrapolation = *fast_nv1995 phaser-map Extrapolation function
  • fragment_search
    • minimum_fragments_to_consider = 0 Minimum fragments per patterson vector
    • n_patterson_vectors = None Patterson vectors to consider
    • score_initial_solutions = False Score 2-site Patterson solutions and sort them
    • input_emma_model_list = None Input model or models. Each model group in a file is treated separately.
    • input_add_model = None Input atoms to add to input_model
    • keep_seed_sites = False Keep seed sites (never omit them)
    • n_shake = 0 Number of random variants of each fragment
    • rms_shake = 1. RMS for random variants
    • dump_all_fragments = False Dump fragments to fragment_x.pdb. Requires score_initial_solutions=True
    • score_only = False Just score input models
    • dump_all_models = False Dump models to dump_10xxxyy.pdb xxx=patterson yy=trans
  • comparison_files
    • comparison_emma_model = None Comparison model or models
    • comparison_emma_model_tolerance = 1.5 Comparison model tolerance
  • chunk
    • n = 1
    • i = 0