Merging two models with combine_models

Author(s)

combine_models: Tom Terwilliger

Purpose

This routine takes a model and finds pieces of a second model file that improve its fit to density when they replace the corresponding pieces in the first model.

Usage

The main uses of phenix.combine_models are:

Taking the best parts of 2 models and merging them to make a new model.
Replacing a segment in one model with a corresponding segment from another.

How combine_models works:

The phenix.combine_models tool can combine the best parts of two or more models. You can use it in either of two different ways. One approach keeps segments in the models intact, and the second applies crossovers within segments.

In the first approach (merge_by_segment_correlation=True), each fragment in each model is scored based on correlation to the map. The scoring also includes a weight based on the square root of the length of the segment. It also includes weights based on whether a segment has been assigned to the sequence and the number of secondary structure hydrogen bonds in that segment. This approach can combine chains of different types as well. A final increment is added to the score for a segment if it is considered very likely to be correct with different thresholds for chains of different types.

Then the segments are picked in rank order to create a composite model. Any part of a segment that does not overlap an existing part of the composite model is kept. If symmetry is present, overlaps include that symmetry.

In the second approach, (merge_by_segment_correlation=False), a working model is created by taking all of the first model supplied and filling in any empty regions with fragments from other models. Then one by one, the segments in the working model are recombined with all other segments. To carry out recombination between two chains, residues that match in the two chains are identified. Then for each segment between matching pairs of residues, whichever chain has the higher correlation to the map is kept to create a composite model.

This second approach to combine_models starts with two input models. The first model is used as the default; if nothing can be found in the second model that is better than what is in the first model, then that part of the first model is kept. The second model is used as a template for improving the first model. Fragments of the second model are considered as alternatives for corresponding segments in the first model.

The fit of the models to density is used to decide which of a pair of fragments is best. In general, the correlation of model density with the map is used as the criterion. In cases where unequal numbers of residues are considered, then this correlation is weighted by the square root of the number of residues in each case. During the optional merge_second_model step, the scoring is optionally based on correlation of density, or by default, based on density at the positions of the main-chain atoms in the model.

If the two input models are not in the same asymmetric unit of the crystal, then combine_models will move the pieces from the second model to the corresponding locations in the first model. In this way the final model has all its parts in the same place.

The second approach in combine_models has four main steps, each of which is optional:

Selecting parts of the input models to consider, using atom selections. This step allows you to do things such as removing a segment from the first model, then crossing with the corresponding segment from the second model, effectively replacing that segment in a way that you specify.
Creating as complete a starting model as possible (merge_second_model=True). This is done by cutting the second model into small pieces and reassembling the first model, considering these pieces along with all the segments in the first model as potential pieces of the reassembled model. The result is a new working model.
Crossing the working model with the second input model, allowing only equal-length crossovers (matching=True). In this step, pairs of segments in the working model and the second input model that have overlapping residues are considered, one at a time. The local map correlation is calculated for each residue in each fragment. Then at each position, the coordinates from one fragment or the other are chosen, with the choice crossing over between one fragment and the other only at positions where the main-chain atoms in the residue are within match_distance (default 0.5 A) of each other.
Crossing the working model with the second input model, allowing only unequal-length crossovers (non_matching=True). This step is just like the previous one, except that only unequal-length crossovers are considered. If this method is included, then it may be necessary to reassign the sequence afterwards as the alignment may change
Checking the sequence. If unequal-length crossovers are made, then the sequence alignment to the model may need to be changed. This step carries that out.

Output files from combine_models

combine_models.pdb: A PDB file with your combined model.

Examples

Standard run of combine_models:

Running combine_models is easy. From the command-line you can type:

phenix.combine_models pdb_in=first.pdb \
   second_pdb_in=second.pdb \
   seq.dat \
   map_coeffs.mtz

This will combine first.pdb and second.pdb based on fit to the map from map_coeffs.mtz, recheck the sequence alignment to seq.dat, and write out the resulting model.

Ranking all fragments and picking the best ones:

phenix.combine_models pdb_in=model.pdb

seq.dat map_coeffs.mtz merge_by_segment_correlation=True

This will score each segment (fragment), then work down from best to worst, keeping any part of any segment that does not overlap with a better-scoring segment.

Selecting pieces from the two models:

To take first.pdb and then see if residues A21-A30 of second.pdb can improve it, you can type:

phenix.combine_models pdb_in=first.pdb \
   second_pdb_in=second.pdb \
   seq.dat \
   map_coeffs.mtz \
   second_pdb_in_atom_selection="(chain A and resid 21:30)" \

Replacing a specific segment:

To take first.pdb and then see if residues A21-A30 and B21-B30 can be improved by replacing them with residues C10-C20 and D10-D20 of second.pdb, you can tell combine_models to ignore residues A22-A29 and B22-B29 and to consider only residues C10-C20 and D10-D20 of second.pdb:

phenix.combine_models pdb_in=first.pdb \
   second_pdb_in=second.pdb \
   seq.dat \
   map_coeffs.mtz \
   pdb_in_atom_selection="(not  ( (chain A or chain B) and resid 22:29) )" \
   second_pdb_in_atom_selection="( (chain C or chain D) and resid 10:20)" \

Crossing two models that have entirely matched residues:

If your first.pdb and second.pdb have exactly the same residues present, and just differ in coordinates, then you might want to preserve all the connectivity by skipping the merge_second_model step, and by skipping the non_matching crossover step, and by skipping the reassignment of sequence. You can type preserve_connectivity=True as a shortcut for this:

phenix.combine_models pdb_in=first.pdb \
   second_pdb_in=second.pdb \
   seq.dat \
   map_coeffs.mtz \
   preserve_connectivity=True

Possible Problems

Specific limitations and problems:

Literature

Additional information

List of all available keywords

combine_models
- input_files
  - pdb_in = None Input starting PDB file. This model will be the default output model. Parts of the model that are better in the second model be replaced by the second model.
  - pdb_in_atom_selection = None Any selection specified with pdb_in_atom_selection is applied to the pdb_in input model before using it.
  - second_pdb_in = None Input second PDB file. Parts that are better in this model will replace corresponding parts of the first model.
  - second_pdb_in_atom_selection = None Any selection specified with second_pdb_in_atom_selection is applied to the second_pdb_in input model before using it . If multiple second_pdb_in files are supplied, the same number of atom selections are expected.
  - map_in = None CCP4 or MRC-style map file (alternative to mtz_in)
  - mtz_in = None Input MTZ file with map coefficients
  - map_coeff_labels = None If map coefficients cannot be identified automatically from your MTZ file, you can specify the label or labels for them. (Please separate labels with blank space, MTZ columns grouped together separated by commas with no blanks.) You can specify: map_coeff_labels (e.g., FWT,PHIFWT) amplitudes and phases (e.g., FP,SIGFP PHIB) or amplitudes, phases, weights (e.g., FP,SIGFP PHIB FOM)
  - seq_file = None Sequence file for sequence alignment
  - ncs_file = None Input NCS spec file for use with segment_and_split_map.
- output_files
  - pdb_out = combine_models.pdb Output PDB file
  - log = combine_models.log Output log file
  - params_out = combine_models_params.eff Parameters file to rerun combine_models
  - linkage_output_file = linkage_output.pkl Linkage dictionary
  - output_directory = None Not used (output directory for passing to map_to_model)
- crystal_info
  - chain_type = *PROTEIN DNA RNA You can specify whether to include protein, DNA, or RNA If you use merge_by_segment_correlation you can include multiple chain types and chain_type is ignored
  - resolution = None High-resolution limit
  - solvent_content = None Solvent fraction of the cell. If this is density cut out from a bigger cell, you can specify the fraction of the volume of this cell that is taken up by the macromolecule. Normally set automatically. Values go from 0 to 1.
  - space_group = None Space group (normally read from the data file)
  - unit_cell = None Unit Cell (normally read from the data file)
  - map_inside_box = True Place centers of all chains inside (0,1). This is required during normal operation.
  - use_sg_symmetry = None Not used
- directories
  - output_dir = "" Output directory where files are to be written
- crossing
  - high_resolution = None Resolution used in map calculation
  - match_distance = 1.5 Residue pairs must have rmsd of match_distance (A) or lower to be crossed. A value between 0.5 A and 1.5 A is generally best.
  - merge_by_segment_correlation = False Split all models into segments (breaking where residue numbers are not sequential), rank segments by correlation to map weighted by number of atoms, take highest-scoring non-overlapping segments. Followed by standard merge with all unused segments.
  - optimize_ncs = False Optimize NCS after merge_by_segment_correlation. NOTE: assumes that all NCS operators are allowed and may not place the resulting model in density if the NCS is not present in the density.
  - trim_to_au = True Split up any chains that are in multiple NCS au
  - weight_rms = 1 Weight on closest approach of fragments
  - cc_cut = 0.90 Eliminate NCS copies with CC less than cc_cut of max
  - weight_cc = 0.1 Weight on CC in optimizing NCS (typically 10)
  - weight_rad_gyr = 0.1 Weight on radius of gyration in optimize_ncs
  - n_substitutions = 3 Substitutions in optimization
  - n_consider = 20 Substitutions in optimization applied to last n_consider fragments
  - n_random_opt = 0 Try factor
  - max_optimize_tries = 500 Maximum optimization steps
  - overlap_tolerance = None Minimum separation between N/C1'/P atoms in separate chains. Applies to merge_by_segment_correlation only. Default is 3 A for C1'/P and 2 A for N
  - ncs_overlap_tolerance_factor = 1.5 Overlap tolerance required for NCS-related chains. Typically larger than for chains in same NCS au to ensure they do not overlap
  - very_good_protein_score = 20.0 Score indicating in combination with very_good_protein_cc that an PROTEIN fragment is almost certainly correct
  - very_good_protein_cc = 0.75 CC indicating in combination with very_good_protein_score that an PROTEIN fragment is almost certainly correct
  - very_good_rna_score = 7.0 Score indicating in combination with very_good_rna_cc that an RNA fragment is almost certainly correct
  - very_good_rna_cc = 0.60 CC indicating in combination with very_good_rna_score that an RNA fragment is almost certainly correct
  - min_fractional = 0.05 Minimum fraction of a chain remaining after removing parts occupied by a better-scoring chain
  - min_fractional_multiple_chain_types = 0.35 Minimum fraction of a chain remaining after removing parts occupied by a better-scoring chain. Applies if multiple chain types are present (otherwise min_fractional applies)
  - max_residues_in_score = 20 Maximum number of residues to use when increasing score by sqrt(n_residues)
  - weight_by_ss = False Apply extra weight on residues with H-bonds
  - assignment_weight = None Apply extra weight on residues assigned to sequence
  - extra_weight_h_bonds = 1. Extra weight on residues with H-bonds
  - min_absolute = 5 Minimum residues in a chain remaining after removing parts occupied by a better-scoring chain
  - merge_remainder = True Merge remainder in merge_by_segment_correlation
  - dump_remainder = False If merge_remainder is False and dump_remainder is True, dump the remainder pdb to remainder.pdb
  - ncs_in_cell_only = True Only consider NCS-related molecules that are in cell
  - remove_only_if_centers_outside = False Remove outside cell only if centers outside cell. Modifies remove_outside_cell. Must set remove_outside_cell as well to apply.
  - remove_outside_cell = False Remove chains with atoms outside cell
  - merge_segments = False Identify segments in first and second models For each gap or end in first model, try to insert or attach a segment from the second model. If selected, no other merging is done.
  - min_segment_length = 5 Segments shorter than this will be ignored in merge_segments.
  - merge_second_model = True Cut second model into pieces and merge them into first model. (Useful for filling in gaps in first model.)
  - merge_both_models = False Cut first and second models into pieces and merge them (useful for combining all pieces of both models). Alternative to merge_second_model.
  - trim_in_merge = False Trim all the fragments of first and second model back to match density before merging of models. See also remove_bad_fragments in merge which removes bad fragments (after any trimming is done).
  - remove_bad_fragments_in_merge = True Remove bad fragments in merge. See also trim_in_merge which trims back fragments before trying to merge them.
  - extend_in_merge = False Extend fragments of first and second model during merging of models
  - fragment_length = 10 Length of pieces that second model will be cut down to in merge_second_model or merge_both_models
  - use_cc_in_combine_extend = None You can choose to use the correlation of density rather than density at atomic positions to score models in the merge_second_model or merge_both_models step. This may be useful at lower resolution (> 3 A)
  - matching = True Carry out crossover using segments that match (have the same number of residues)
  - non_matching = True Carry out crossover using segments that do not match (do not have the same number of residues)
  - check_sequence = None After running all other procedures, redo the sequence assignment. This may be necessary if the models did not have the same sequence assignments or if unassigned pieces are now put together and can be assigned. If None, then check_sequence will be set to True if non_matching=True.
  - preserve_connectivity = None This is a shortcut for turning off merge_second_model, merge_both_models, non_matching, and check_sequence. It is useful if your two models have the same residues, just with different coordinates, and you want to maintain the connectivity.
  - merge_only = None This is a shortcut for turning off everything except merge_second_model and merge_both_models.
  - standard_merge = None Merge by reading all chains and running resolve merging.
  - wang_radius = 5 Smoothing radius for solvent identification
  - cc_min = 0.20 Minimum map-model correlation to keep a segment after refinement
  - cc_ratio_min = None Minimum map-model correlation relative to maximum to keep a segment after refinement
  - combine_models_score_min = None Minimum score for a chain in combine_models (after helices_strands). Score is mostly defined by sqrt(n_atoms)*overall map CC .
  - cc_min_rna = 0.25 Minimum map-model correlation (RNA building)
  - map_inside_box = None Map final chains inside unit cell
- control
  - verbose = False Verbose output
  - nproc = 1 Number of processors to use
  - multiprocessing = *multiprocessing sge lsf pbs condor pbspro slurm Choices are multiprocessing (single machine) or queuing systems
  - queue_run_command = None run command for queue jobs. For example qsub.
  - raise_sorry = False Raise sorry if problems
  - return_on_failure = False Just return on failure
  - debug = False Debugging output
  - dry_run = False Just read in and check parameter names
  - i_ran_seed = 351219 random seed
  - resolve_command_list = None You can supply any resolve command here NOTE: for command-line usage you need to enclose the whole set of commands in double quotes (") and each individual command in single quotes (') like this: resolve_command_list="'no_build' 'b_overall 23' "