Generating alternative conformations matching a map with create_alt_conf

Author(s)

create_alt_conf: Tom Terwilliger, Pavel Afonine,

Nigel Moriarty, Dorothee Liebschner

Purpose

The purpose of create_alt_conf is to create alternative side-chain conformations for a structure using very high-resolution X-ray data or a very high-resolution cryo-EM map as a guide.

The problem that it solves is that if you have multiple conformations present in a structure, residues that have the same altloc (A or B or C etc) should have plausible relationships, and in particular, the should not clash. Residues with different altloc (A vs B) can have any relationship as they are in different conformers. Finding the assignment of side chain conformations to conformers A, B, C etc that yields good geometry for all the conformers can be difficult.

Note that the model that is produced is not intended to be a final model, rather it is a model that has some plausible alternative conformations. It is recommended that you use the conformers in the output model as suggestions for your model.

How create_alt_conf works:

Generating alternative conformations:

The starting point for create_alt_conf is normally a model with a single conformation (no A or B altlocs).

The first step is to generate a set of plausible conformations for each side chain based on the current model and the X-ray data or cryo-EM map.

For X-ray data, a 4mFo-3DFc map is created. This map is expected to show conformations that are not represented in the model. For cryo-EM data, the input map is used.

In either case, plausible conformations for at each side chain position are found by testing each rotamer in a rotamer library for the side chain type present at that position and choosing those that fit the density. The conformer that best fits the density and that is different from the current conformer at that position is chosen. If no suitable alternative is found, the original is used. Conformers that clash with all conformers at any other position are discarded. The new side chain conformers are then used to create a new model with side chains that match the density map and that are as different as possible from the starting side chains. This model is refined using the X-ray data or cryo-EM map. The result of this step is a pair of models, the original and one that has alternative side-chain conformations. Each model has just one conformer.

If you want, you can instead supply a model with multiple conformations and it will just split that model into separate models, one for each conformation (each altloc A or B).

Optimizing assignment of side-chain conformers to models:

The key step in create_alt_conf is creating a set of models in which each model has good geometry and the models collectively fit the density. The starting point for this step is a set of models that have different side-chain conformations at some or all positions. The challenging part of this step is arranging the side-chain conformers in a way that minimizes clashes but that uses all the supplied side chain conformers.

The procedure used is to generate a diverse set of refined multi-conformer models, to score each model, then to find an optimized model by recombination among the multi-conformer models. The procedure is complicated somewhat by the need to select a small number of test models for scoring, as scoring requires refinement and is therefore a slow process.

Scoring function:

The scoring function used to group side-chain conformers is the Holton geometry validation score . This score is a composite of geometry restraints used in refinement and validation metrics and it is a rather good indication of the overall geometric quality of a model. The scoring function requires a refined model, as unrefined models will generally have very poor geometry.

The scoring function is also calculated on a per-residue basis. This per-residue score is the Holton geometry validation score, only including interactions that involve this residue. Note that the sum of per-residue scores calculated in this way does not equal the total score, both because of the way in which scores are calculated, and because some interactions are only within a residue and others are between residues. Even so, the sum of per-residue scores is closely related to the total score. This allows estimation of the total score from per-residue scores.

Probabilistic estimation of score for a new arrangement of side chains:

To reduce the number of models that need to be fully scored, a procedure for probabilistic estimation of the expected score for a model with a new arrangement of side chains is used. In this estimation process, it is assumed that many of the most significant contacts between residues will be with nearby residues (segments of length group_length are considered).

The starting point for this estimation is a set of scored models with varying arrangements of alternative conformations. These arrangments of alternative conformations are set up so that for any stretch of group_length residues, there will be models that match at all other positions but that differ in all possible ways at this set of group_length residues. The differences in per-residue scores among these models are then used to estimate the expected effect of changing the arrangement of conformers for any one residue, in the context of the conformers surrounding that residue. Additionally, an error estimate in this expected effect is estimated.

With these predictors for the effects of changes in alternative conformations at any one residue in the context of its neighbors, an estimate can be made for the expected score for a model with any arrangement of conformations. The score for the most similar model is taken as a starting value, then the minimal set of changes in arrangement is applied and the expected change in score for each change is noted, along with its uncertainty. This yields a probabilistic estimate of the expected score for this arrangement.

Recombination procedure:

The recombination procedure is targeted (not random recombination). The reason for this is that scoring requires refinement, a relatively slow process. The current set of scored models is used as described above to create a probabilistic estimate of the score that would be found for any new arrangement of side chains as conformers in a model. A simple recombination and mutation procedure is used to find arrangements that are predicted to have good Holton geometry validation scores. The arrangements with good predicted scores are then generated, refined and scored to identify optimized arrangements that actually do have good scores.

Quick run:

You can skip the extensive optimization procedure and use a shorter one instead with the keyword quick=True. This is still not that fast, but it is a lot faster than the full version.

Examples

Standard run of create_alt_conf:

Running create_alt_conf is easy. From the command-line you can type:

phenix.create_alt_conf model.pdb data.mtz

where model.pdb is the model you would like to use as a starting point and data.mtz is an X-ray data file. The tool will run (for a long time) and create a model named model_overall_best.pdb that will contain alternative conformations based on the X-ray data. If you want to use more processors, specify nproc=32 or whatever number you would like. If you want to create more than two conformers, you can specify the number with conformers=3 or whatever you like. If you want to use the conformers present in model.pdb as starting points, rearranging them as needed, you can specify use_existing_altlocs=True.

Possible Problems

The procedure takes a long time. You would not normally want to run this on a machine with just a few processors. Running with quick=True is much faster, but not as comprehensive.

This procedure creates N full conformers (altlocs A, B, C etc), with all atoms in the macromolecule present in all the conformers. If you want to only have multiple conformations in a few places, you will need to use phenix.pdbtools or another method to remove multiple conformations from the rest of the structure.

This method is only suitable if you have very high-resolution data. Normally 1.5 A is about the lowest resolution data you would want to use.

The procedure only creates alternative side-chain conformers, not main-chain. If you main chain has substantial alternative conformations (not just slight adjustments to match the side chain conformers), you will need to use another approach.

Literature

Additional information

List of all available keywords

job_title = None Job title in PHENIX GUI, not used on command line
input_files
- xray_data = None Data file with experimental data ( FP SIGFP or I SIGI ).
- xray_data_labels = None Optional labels for X-ray data Normally these would be something like I,SIGI or F,SIGF
- free_r_labels = None Optional labels for free_r flags. Normally these would be something like FreeR_flags or R_free_flags
- free_r_flag_value = 0 FreeR flag value
- ncs_file = None File with NCS information (typically point-group NCS with the center specified). Typically in PDB format. Can also be a .ncs_spec file from phenix. Created automatically if symmetry is specified.
- map_coeffs = None MTZ file with coefficients for a map
- use_map_coeffs_if_optional = False Use map_coeffs (a map) instead of Fobs, sigFobs if both are available
- map_coeffs_virtual = None Used internally
- map_coeffs_labels = None If map coefficients cannot be identified automatically from your MTZ file, you can specify the label or labels for them. (Please separate labels with blank space, MTZ columns grouped together separated by commas with no blanks.) You can specify: map_coeffs_labels (e.g., FWT,PHIFWT) amplitudes and phases (e.g., FP,SIGFP PHIB) or amplitudes, phases, weights (e.g., FP,SIGFP PHIB FOM)
- alt_models = None Set of alternate conformations as individual model files. You can supply alternate conformations with alt_models, or you can supply a single model file with multiple conformations and specify use_existing_altlocs=True.
- map_model
  - full_map = None Input full map file
  - half_map = None Input half map files
  - model = None Input model file
output
- alt_conf_model = default Model with alternate conformations
- overwrite = True Overwrite files with same names
- file_name = None Not used
- filename = None Not used
- serial = None Not used
- temp_dir = TEMP_CREATE_ALT_CONF Temporary directory
alt_confs
- cycles = 4 Number of overall cycles
- use_existing_altlocs = False Use existing conformers. Default is False (remove any altlocs in the input model (if one model is supplied). If alt_models are supplied, they are used as a set of conformers.
- sift_and_rename_waters_only = False Assign waters to altlocs if possible and stop
- water_and_hydrogens = *at_end_of_cycle always never When to add hydrogens and waters in refinement and scoring. Refinement and rebuilding of side chains is done at the same time. If at_end_of_cycle, added at end of each cycle
- recycle_water = False Use input waters and waters from previous models as possible water positions
- max_allowed_residual = 1 Maximum contact energy allowed for adding waters
- max_models_per_cycle = 48 Maximum models to consider in any cycle
- max_rotamers_to_consider = None Limit the number of rotamers to consider at each site. Default is to consider all possible rotamers at each site.
- min_rotamer_rmsd = 0.25 Minimum rmsd between rotamers. If side chain atoms differ by less than this, consider them the same.
- conformers = 2 Number of conformers to create. If you supply conformers as alt_models or as a single model (with use_existing_altlocs=True) the number of supplied conformers must match the value of conformers. (This check is to make sure that the number of supplied conformers is what you intend.)
- pool_size = 32 Target number of possibilities to examine at once
- mask_at_large_diffs = None Mask all changes to include only residues with large differences between conformations (defined by large_rmsd). Default is False if two conformations, otherwise True
- score_confidence_ratio = 2. Score for a predicted pattern is calculated score plus score_confidence_ratio times the uncertainty. For a score_confidence ratio of 2, this is the score for which there is a 95% probability that the true score is less than this.
- use_edited_model = True Use model edited to remove poor rotamers in next cycle
- rebuild_main_chain = False Find alternate conformations for main chain. Default is False, only look for new side chain positions. Not implemented.
- diff_map_resolution = 2 Resolution for difference map (typically 2 A)
- macro_cycles = 3 Refinement macro_cycles
- water_and_h_macro_cycles = 5 Refinement macro_cycles when placing waters and hydrogens
- rsr_macro_cycles = 3 Real-space refinement macro_cycles
- mask_atoms_atom_radius = 6 Radius for masking atoms. Set high enough to capture density in alternative conformations not present in starting model.
- large_rmsd = 0.5 Large rmsd between conformer residues indicating need for alternates
- minimum_patterns = 4 Minimum patterns to consider
- group_length = 3 Number of residues in group that is to be varied. Residues in a chain are grouped in sets of group_length and all patterns of side-chain assignments are tested.
- domain_size = 200 Maximum residues to consider when finding clashes
- n_swap = 1 Add extra variation of swapping chains, n_swap times existing chains added on. Random locations
- alt_labels = ABCDEFGHIJKLMNOPQRSTUVWXYZ Labels to use for alt id. Include enough for all conformations. These are one-character labels only.
- geometry_score_type = magnitudes gradients residue_score magnitudes_by_chain gradients_by_chain *residue_score_by_chain Score type for geometry evaluation. Both are based on the geometry gradients by residue after refinement. Gradients uses the magnitudes of the mean gradient in a residue. Magnitudes uses the average magnitude of gradients in a residue.
- correlated_errors = True Assume errors in residue geometry scores are correlated
- geometry_mean_or_min = *mean min Geometry mean or minimum. Use the mean value of scores for all examples of a particular pattern of side chains in a set of group_length residues.
- clash_weight = 0 Weight on clashes within a conformer
- allow_close = NO ON OO NN atom pairs that can be close (for clashes)
- close_dist = 2.8 Close distance for allow_close atoms
- other_dist = 4.0 Close distance for non-allow_close atoms
crystal_info
- resolution = None Nominal resolution of map
- wrapping = None You can specify whether the map is wrapped (can map values outside bounds to inside with cell translations). Always true for crystallographic maps.
- scattering_table = *n_gaussian wk1995 it1992 electron neutron Choice of scattering table for structure factor calculations. Standard for X-ray is n_gaussian, for cryoEM is electron.
- chain_type = *None PROTEIN RNA DNA Type of polymer (normally identified by chain automatically).
control
- multiprocessing = *multiprocessing sge lsf pbs condor pbspro slurm Choices are multiprocessing (single machine) or queuing systems Not implemented
- queue_run_command = None run command for queue jobs. For example qsub.
- nproc = 4 Number of processors to use. NOTE: by default multiple processors will only be used in the map-to-model step (this is because multiprocessing requires writing out nproc sets of huge files and it can be very slow with distributed queues.). You can override this with force_nproc = True.
- ignore_symmetry_conflicts = False You can ignore the symmetry information (CRYST1) from coordinate files. This may be necessary if your model has been placed in a box with box_map for example.
- use_existing_files = False Use existing files
- random_seed = 771914 random seed
- verbose = False Verbose output
- quick = False Quick run (no optimization with refinement, just find a plausible set of altlocs and finish up
- clean_up = True Remove temporary directory when done
guiGUI-specific parameter required for output directory
- output_dir = None