Fixing register errors in a model with fix_insertions_deletions

Author(s)

fix_insertions_deletions: Tom Terwilliger

Purpose

The routine fix_insertions_deletions is a tool for fixing register errors in a model by comparing the density at side-chain positions to a sequence file.

Usage

Normally you will access the functionality of fix_insertions_deletions by running the Phenix map_to_model tool in the Phenix GUI. However you can run it directly as well (there is no GUI for fix_insertions_deletions).

How fix_insertions_deletions works:

The fix_insertions_deletions tool examines the density in the supplied map at the position of each side chain in the supplied model and creates a table of side-chain probabilities corresponding to each segment in the model.

These side-chain probabilities are used to generate a map-based sequence for the model and map. This map-based sequence is then compared to the actual sequence to identify positions where the sequence register is likely to be incorrect, and what changes in register are needed to fix it.

Each place a register shift is needed is used as a target for main-chain rebuilding. During rebuilding the specified insertion or deletion is enforced so that only models with the desired changes are obtained (if possible).

Additional rebuilding of the worst-fitting regions is also carried out.

Using fix_insertions_deletions:

The tool fix_insertions_deletions is usually run automatically as part of trace_and_build. However you can run it yourself to try and fix up a model.

Input map file: The map file should cover the model you supply.

Resolution: Specify the resolution of your map (usually the resolution defined by your half-dataset Fourier shell correlation

Model: Supply a model that you want to fix. Only the main-chain will matter.

Sequence: Supply a sequence file that covers at least the part of the model that is supplied

Examples

Standard run of fix_insertions_deletions:

You can use fix_insertions_deletions to fix register shifts in a model based on a cryo-EM map:

phenix.fix_insertions_deletions my_map.mrc resolution=3 my_model.pdb seq.dat

Possible Problems

Specific limitations and problems:

Literature

Additional information

List of all available keywords

job_title = None Job title in PHENIX GUI, not used on command line
input_files
- map_file = None File with CCP4-style map. May have origin in any location.
- model_file = None Input PDB file with chains to be adjusted.
- seq_file = None Optional sequence file
- placement_pickle_file = None Read placements from this file
output_files
- target_output_format = *None pdb mmcif Desired output format (if possible). Choices are None ( try to use input format), pdb, mmcif. If output model does not fit in pdb format, mmcif will be used. Default is pdb.
- pdb_out = fix_insertions_deletions.pdb Output rebuilt model
- output_placement_pickle_file = None Write placements to this file
- temp_dir = None Temporary directory. Default is fix_insertions_deletions_xx where xx creates new directory
crystal_info
- resolution = None High-resolution limit for map analysis.
- scattering_table = n_gaussian wk1995 it1992 *electron neutron Choice of scattering table for structure factor calculations. Standard for X-ray is n_gaussian, for cryoEM is electron.
- chain_type = *PROTEIN Chain type. Must be PROTEIN
- solvent_content = None Solvent fraction of the cell. If this is density cut out from a bigger cell, you can specify the fraction of the volume of this cell that is taken up by the macromolecule. Normally set automatically. Values go from 0 to 1.
- solvent_content_iterations = 3 Iterations of solvent fraction estimation
- use_mask_if_present = True If map is masked, use the mask as solvent content
- sequence = None Sequences
- origin_cart = None Origin (cartesian coordinates, overrides value based on map)
strategy
- first_cycle_for_fixing = 1 First macro-cycle where insertions/deletions should be tested
- loop_method = *trace_chain *extend_only *split_loop *rebuild Method for loop building. None means try everything. trace_chain is finding CA positions to trace chain. extend-only is using resolve model-building to build loop. split_loop is cut loop in the middle and refine. rebuild is rebuild loops
- mask_secondary_structure_in_split_loop = True Mask out mask secondary structure in split loop
- residues_to_skip_on_ends = -2 Skip residues_to_skip_on_ends at ends of secondary structure in masking. Negative means keep that many residues off ends of secondary structure
rebuilding
- fix_insertions_deletions = False Fix insertions and deletions using sequence_from_map to adjust alignment to match sequence that is supplied. NOTE: Use instead split_with_sequence which works better.
- restrain_ends_to_original = None Restrain this many residues at each end to original
- fix_insertions_deletions_only = None Only fix clear insertions and deletions
- take_all_insertions_deletions = None Accept insertion/deletion fixes without scoring (default quick is True)
- ratio_for_sequence_register = 1.0 Accept fixes for insertions/deletions if score is above ratio_for_sequence_register times previous score
- try_as_is = None Try rebuilding without insertions/deletions
- try_insertions = None Try insertions
- try_deletions = None Try deletions
- refine = None Refine models at start of procedure. Default is True unless quick is set.
- refine_cycles = 1 Refinement cycles (except final cycles)
- refine_b = None Refine B-values
- good_enough_cc = None If all residues have this CC, don't bother rebuilding them. Default is 0.7 (0.6 if quick=True)
- rebuild_length_worst = 15 Longest rebuild length to try
- max_insert_or_delete = 1 Maximum residues to insert or delete in on rebuild stage
- average_rebuild_length = True Choose rebuild length as average of minimum and optimal if optimal is longer than rebuild_length_worst. Alternative is use minimum if optimal is too long.
- time_per_residue = 1 How long to try in fitting loops (sec/residue)
- max_rebuild_cycles = None Maximum rebuilding cycles per macro cycle. Default is 20 (1 if quick is set and sequence is present)
- macro_cycles = None Macro cycles of rebuilding and refinement. Default is 4 (1 if quick is set)
- start_rebuild = None Starting residue to rebuild. If specified, this is all that is done
- end_rebuild = None Ending residue to rebuild. If specified, this is all that is done
- rebuild_segment = 1 Segment to rebuild from start_rebuild to end_rebuild. None means rebuild all segments from start_rebuild to end_rebuild.
- minimum_contact_distance = 3 Minimum distance between CA atoms not immediately connected
- split_with_sequence = False Use sequence assignment to identify sequence register errors. Runs sequence_from_map to split and fix assignment and then fit_all_loops to fill in gaps.
- keep_connectivity_in_split_with_sequence = True Keep connectivity in split_with_sequence
- first_cycle_for_split = 2 First macro-cycle where splitting should be tested
- split_input_model = True Input model will be split into segments
weights
- weight_vdw = 10 weight on very close CA-CA contacts
- weight_ca_ca_dist = 1. weight on CA-CA distance
- weight_proximity_to_known_position = 0.5 weight on proximity to well-identified CA position
- weight_density = 1.0 weight on density at mid-points between CA atoms
- weight_cc_mask_score = 1.0 Weight on map correlation in evaluating chain direction
- weight_seq_score = 1.0 Weight on sequence-map matchin evaluating chain direction
- weight_x_gly_score = 1.0 Weight on excess Gly and X residues in evaluating chain direction
sequencing
- random_sequences = 100 Number of random sequences of each length to use as baseline
- positive_gap_penalty = 1 Penalty for missing residues is positive_gap_penalty * gap**2
- negative_gap_penalty = 2 Gap penalty for extra residues is negative_gap_penalty * gap**2
- max_gap_length = 2 Maximum gap length for adjacent alignments
- minimum_alignment_length = 5 Minimum length of an alignment
- score_by_residue_groups = None Use residue groups in sequence alignment and listing of optimal sequences. Default is True unless no sequence is supplied.
trace_chain
- helices_strands_cc_min = 0.5 Minimum map CC for helices/strands
- n_random_frag = 100
control
- multiprocessing = *multiprocessing sge lsf pbs condor pbspro slurm Choices are multiprocessing (single machine) or queuing systems
- queue_run_command = None run command for queue jobs. For example qsub.
- nproc = 1 Number of processors to use
- random_seed = 171731 Random seed
- verbose = False Verbose output
- skip_temp_dir = True Skip temp_dir when scoring
- quick = True Quick run
- superquick = False Very quick run
- ignore_symmetry_conflicts = False You can ignore the symmetry information (CRYST1) from coordinate files. This may be necessary if your model has been placed in a box with box_map for example.
- max_dirs = 1000 Maximum number of directories (fix_insertions_deletions_xxxx)
- resolve_size = None Size of resolve to use.
- coarse_grid = None Use a coarse grid in RESOLVE (saves on memory)
- em_side_density = False Use EM side chain density. Alternative is to use standard x-ray side chain density in sequence templates.
guiGUI-specific parameter required for output directory
- output_dir = None