Rapid model-building with trace_and_build
Author(s)
- trace_and_build: Tom Terwilliger
Purpose
The routine trace_and_build is a tool for rapid protein model-building.
Usage
How trace_and_build works:
The trace_and_build tool traces the path of a polypeptide chain by
working from high to low density and following the highest density path
that does not yield branching. It then finds CB positions and builds an
atomic model.
Using trace_and_build:
Normally you will run trace_and_build on the unique part of a cryo-EM map.
Your main options are the resolution, whether to try a quick run
(quick=True) or a more thorough run (quick=False), and how many segments
to try and build.
Input map file: Usually you should supply trace_and_build with a
map that represents the
unique part of your structure. If your map has symmetry you can use
phenix.map_box with extract_unique=true to extract this part of the map.
Resolution: Specify the resolution of your map (usually the
resolution defined by your half-dataset Fourier shell correlation
Sequence file: Supply a sequence file with the sequence or sequences of
the molecule(s) to be built. If more than one they will simply be put
together at this point.
Segments to build: you can choose to build just one or a few of the
longest segments that trace_and_build can find (max_segments=3), or
everything (max_segments=None)
Quick run: You can try to run quickly (quick=true) or more thoroughly
(quick=False). One difference is that with quick=True, the number
of segments to build is by default 3, while with quick=False, it is
unlimited. The other is that with quick=False, after building the model
an attempt to fix insertions and deletions will be made
(fix_insertions_deletions=True). You can set these parameters separately
as well.
Input model: You can supply an input fixed model and trace_and_build will
use it as a potential interpretation of the tracing of part of the map.
It will not be used as is, rather CA positions will be extracted and used
in later interpretation steps. This fixed model will take the place
of the find_helices_strands step that is otherwise carried out.
Procedure used by trace_and_build
The procedure used by trace_and_build has several steps:
If no model is supplied, an initial fixed model is created by searching
for regular
secondary structure in the map. Then the tool find_helices_strands
is used to analyze the new map to find regular secondary structure.
Optionally both directions of each segment of the fixed model can
be kept (allow_reverse=True).
The core of trace_and_build is to find, extend, and connect segments of
high density in the map. The way this is done is a lot like the way a person
would examine the path of a chain in a map, starting from a region of
clear (high) density, following the chain until it ends, then lowering the
contour level until a path is visible and following that one.
Initial segments of density are identified using the
model that is supplied or found with find_helices_strands. New
segments are identified from extended regions of high density, working
down from high to low density in the map. Connections and extensions are
also made working down from high to low density. Chain tracings are only
kept if they do not have branching.
Once the path(s) of the polypeptide chain are identified, likely positions
of CB atoms are identified from the presence of side-chain density along
the traced path. The positions of CA atoms are then guessed and refined
with the tool phenix.refine_ca_model which adjusts the number of CA atoms
and their positions to match the likely CB positions, the chain tracing,
and expected CA-CA distances.
Once a CA-only model is created, an attempt to correct the model using
CA positions in the fixed model (input directly or from find_helices_strands).
In this step the CA positions in segments in the fixed model are matched
with those of the CA-only model, and if they overlap, the CA positions from
the fixed model are used.
An all-atom model is generated from each CA-only model using Pulchra .
If allow_reverse is set, then each possible
direction of each segment is considered. Each of these possibilities
is refined and scored based on map-model correlation (CC), agreement of
side-chain density with the sequence, and H-bonding in the model.
The highest-scoring model is written out so that it superimposes on the
input map.
Examples
Standard run of trace_and_build:
You can use trace_and_build to build a model based on a cryo-EM map:
phenix.trace_and_build my_map.mrc resolution=2.8 my_seq.dat
Using trace_and_build to evaluate forward and reverse versions of a model:
You can use trace_and_build to work on one or more segments, checking both
directions:
phenix.trace_and_build my_map.mrc resolution=2.8 my_seq.dat my_fragment.pdb \
find_chains=False extend_chains=False connect_chains=False allow_reverse=True
This will read in your fragment(s), create forward and reverse versions, score
them both, and try to build the better one (but without extending it or
building any new model). You can set verbose=True to see more details of
the scoring if you like. If one direction is clearly better than the other,
only it will be kept. If you want to keep both, set the parameter
convincing_delta_score to a big number or None (take everything).
Possible Problems
Specific limitations and problems:
Literature
- Fast procedure for reconstruction of full-atom protein models from reduced representations. P. Rotkiewicz, and J. Skolnick. J Comput Chem 29, 1460-5 (2008).
Additional information
List of all available keywords
- job_title = None Job title in PHENIX GUI, not used on command line
- input_files
- map_file = None File with CCP4-style map. May have origin in any location.
- trace_map_file = None File with map for tracing 2 fragments and enclosed loop. When trace_loops is specified, supply a map_file with map showing just the loop and a trace_map_file with map showing fragments and the loop.
- model_file = None Input PDB file with suggestions for CA positions.
- marker_model_file = None Input PDB file with marker atoms from trace_chain
- trace_model_file = None Input PDB file with dummy CA atoms marking path of chain at about 0.5-1 A intervals
- input_scoring_file = None Input .pkl file with scoring information (can have more than one)
- seq_file = None Optional sequence file
- fixed_model = None Optional partial model to be used as a starting point. This model substitutes for a helices-strands model that is normally created. The chains in this model will be used with only minor adjustment and additional chains will be built. For adjustment only supply a model_file only. For adjustment of part of a model, specify the part of the model to keep as is with fixed_model and the whole model in model_file.
- placement_pickle_file = None Read placements from this file
- output_files
- pdb_out = trace_and_build.pdb Model with adjusted position/connectivity determined from map.
- output_scoring_file = None Output .pkl file with scoring information
- output_model_file = trace_chain_model.pdb Output PDB file with possible CA/CB positions
- output_marker_model_file = trace_chain_marker_atoms.pdb Output PDB file with marker atoms from trace_chain
- output_placement_pickle_file = None Write placements to this file
- temp_dir = None Temporary directory. Default is trace_and_build_xx where xx creates new directory
- crystal_info
- resolution = None High-resolution limit for map analysis.
- scattering_table = n_gaussian wk1995 it1992 *electron neutron Choice of scattering table for structure factor calculations. Standard for X-ray is n_gaussian, for cryoEM is electron.
- chain_type = *PROTEIN Chain type. Must be PROTEIN
- ca_ca_distance = 3.8 CA-CA distance
- solvent_content = None Solvent fraction of the cell. If this is density cut out from a bigger cell, you can specify the fraction of the volume of this cell that is taken up by the macromolecule. Normally set automatically. Values go from 0 to 1.
- solvent_content_iterations = 3 Iterations of solvent fraction estimation
- use_mask_if_present = True If map is masked, use the mask as solvent content
- sequence = None Sequences
- wrapping = False For cryo-EM maps, wrapping should be off
- origin_cart = None Origin (cartesian coordinates, overrides value based on map)
- strategy
- find_chains = True Try to create new chains in build_all_loops
- extend_chains = True Try to extend chains in build_all_loops
- connect_chains = True Try to connect chains in build_all_loops
- allow_reverse = True If two chains are opposite direction but connected, reverse the one with lower score (usually shorter) and connect them.
- correct_segments = True Correct segments. Try to fix errors using fixed model as a template
- mask_side_chains = False Mask side chains in existing model in build_all_loops
- same_chain_rmsd_max = 1.0 RMSD between two CA models of same length to consider them the same
- local_trace = True Run trace in local regions
- box_buffer = 5 Box buffer. Box will be at least the size of fragments plus this buffer in each direction (grid units)
- box_size = 150 150 150 You can specify the size of the boxes to use (grid units) when finding chains
- target_n_overlap = 10 You can specify the targeted overlap of boxes
- ends_only = True Keep track only of the very ends of chains. Ignore direction. Just 2 atoms for each chain (one at each end).
- minimum_new_chain_length = 15 Minimum length to try building a new chain
- similar_threshold_ratio = 0.1 Similar threshold ratio. Thresholds differing by this fraction of the difference between maximim and minimum thresholds are considered similar.
- max_merge_cycles = 100 Maximum cycles of merging fragments
- minimum_extension_length = 10 Minimum length to try extending a chain
- minimum_extension_improvement = 2 Minimum improvement to keep a trial extension
- weight_path_by_density = True Weight path by density to trace through high density. Scale on distance is exp(path_density_weight times log(density-density_min)/(density max - density min))
- path_density_weight = 2 Weight on having high density along path of trace.
- min_weight_ratio = 0.0001 Minimum scale on distance in trace through high density
- max_branch_length = 10 Maximum branch (not main path) to keep a segment If a branch is longer than max_branch_length or longer than max_fractional_branch_length times the main path, reject
- max_fractional_branch_length = 0.5 Maximum branch (not main path) to keep a segment. If a branch is longer than max_branch_length or longer than max_fractional_branch_length times the main path, reject
- matching_end_dist = 3 Maximum distance between end atoms to consider them matching for purposes of excluding duplicate ends
- sharing_end_dist = 12 Maximum distance between end atoms to consider them matching for purposes of linking them. Can be larger than matching because the end points may not be at the ends of the chain.
- create_loop_maps = None Find connections between fragments in input model and write out small maps with just the density for the connection and the associated fragments. Default is True.
- max_points_per_region = 4000 Maximum number of points in a region. Will be ignored if more than this. (Limiting this can prevent very long connections from being made, but reduces possibility that a connection that is totally wrong is made).
- max_new_chains = 1 Maximum number of new chains to obtain in one pass
- regions_for_new_chains = 3 Maximum number of top regions to examine for new chain info
- max_loops = 10 Maximum number of loops to write out in create_loop_maps
- range = 2 Range (grid points) for examining shape of density
- max_loop_iterations = 999 Maximum number of iterations to look for loops
- spacing = 1 Spacing of points for a trace
- intervals = 10 Number of threshold values to try in create_loop_maps
- mean_density_ratio = 2 Guess of mean density at coordinates of atoms is mean density in map plus SD of density in map, corrected for solvent content, times mean_density_ratio
- min_sd_to_mean = 0.20 Limit SD of density at atoms to at least this value times the mean value.
- residues_to_have_middle = 10 Length of a chain to have a middle that can be masked out.
- min_points_in_region = 10 Minimum points in region to be considered as a new chain
- max_points_in_region = 3000 Maximum points in region to be considered as a new chain
- sd_density_ratio = 0.5 Guess of SD of density at coordinates of atoms is SD of density in map, corrected for solvent content, times sd_density_ratio
- threshold = None Density threshold for following density.
- threshold_low = None Density threshold low value for following density.
- threshold_high = None Density threshold high value for following density.
- sd_ratio_intervals = 1 SD ratio intervals. Density is traced starting from highest and going to lowest in sd_ratio_intervals. XXX not used
- sd_ratio_create = 1 SD ratio for creating new chains . Default is same as sd_ratio.
- sd_ratio = 2. Lower limit of density to consider is sd_ratio below mean of density at C/N/CA atoms in current model in create_loop_maps. Applies when connecting segments. A high value may result in incorrect connections.
- mean_ratio = 0.25 Lower limit of density to consider is mean_ratio times mean of density at C/N/CA atoms in current model in create_loop_maps after subtracting mean of map
- minimum_fraction_intervals_to_try = 0.67 If at least this fraction of intervals have been tried in trace, terminate if recent fraction of long branches is too high.
- max_fraction_long_branches = 0.4 Terminate trace if recent fraction of long branches is high
- recent_fraction_length = 8 Number of tries to include in recent fraction of long branches
- expand_size = 1 Expansion of mask when cutting out loop density
- build_segments = True Build segments
- first_cycle_for_fixing = None First macro-cycle where insertions/deletions should be tested
- trace_and_build
- max_segments = None Maximum number of segments to keep (after joining and insertion). Default is take all unless quick is set (then default is 3)
- fix_insertions_deletions = None Retrace chains to fix insertions and deletions. Default is True unless quick is set.
- convincing_delta_score = 10. Convincing delta score for forward vs reverse directions that nearly always means the higher-scoring one is correct.
- find_helices_strands = True Find helices/strands before tracing chain
- minimum_length_angstroms_helices_strands = 12 Minimum length (A CA start - CA end) for helix/strand to keep
- trim_chains = True Try to trim chains if turns are too tight (i-j-k with dist(i,k) < min_dist_ik)
- min_dist_ik = 4.5 Minimum i-k distance for CA i j k.
- fill_gaps = True Try to fill short gaps in input model
- tolerance_residue_distance = 2 Tolerance for residue-residue distance (typically 4.0)
- dot_min = -0.2 Minimum cosine of angle between suquential CA-CA directions.
- minimum_length = 2 Minimum length between CA-CA positions
- max_gap_residues = 2 Maximum number of residues to try and fill in gaps. Must be 1 or 2
- max_trim_ends = 1 Maximum number of residues to trim from ends before gap filling
- max_gap_dist = 3 Maximum distance to span in gap
- minimum_relative_density = 0.50 Minimum density at marker sites, relative to mean at coordinates of input model, to keep. Also minimum along path between a marker site and nearest main_chain trace. Also minimum ratio of density at ends of a connection and in gap to be filled.
- morph_segments = True Morph segments to fit density before looking for side chains
- morphing_iterations = 3 Number of iterations of morphing
- points_per_atom = 3 Number of points between atoms in tracing path of main chain
- shift_radius = 2 Marker atoms within shift_radius of a main-chain atom are used to identify shift of main-chain in morphing
- minimum_sites = 4 Minimum nearby sites for shifting main chain in morphing
- smoothing_window = 5 Smoothing window for shifts of main-chain in morphing
- main_chain_radius = 1.5 Approximate radius of tube of density for main-chain. Marked points within this radius of an atom in the main-chain are considered main-chain, those outside are side chains and other chains
- side_chain_radius = 3.0 Consider marked points between main_chain_radius and side_chain_radius as likely side-chain markers. Best to use value of 3 or less or neighboring side chains will interfere.
- mask_atoms_radius = 2.5 Radius to use when masking around already-built model atoms
- delta_atoms_radius = 2. Increase in mask_atoms_radius including already-built model.
- non_expanded_atoms_radius = None Atoms radius for creating mask showing main-chain. Should be slightly larger than the grid used so that a point on a grid point will have 6 adjacent grid points within this radius. Also should be comparable to N-CA-C distances. Default=max(1.0,grid+0.01)
- exclude_residues = 3 Residues to exclude at each end of chain in masking
- shell_radius = 0.5 Divide region between main_chain_radius and side_chain_radius into shells of thickness shell_radius
- cluster_width_ratio = 0.2 Maximum cluster width relative to CA-CA distance
- minimum_vector = 1.0 Minimum length of CB vector to include
- pruning_ratio = 3 Minimum distance between cluster centers relative to cluster width
- ca_ca_distance_tol = 1.0 CA-CA tolerance in scoring a set of possible CA atoms
- ca_ca_distance_tol_curves = 1.0 CA-CA tolerance in curves
- ca_ca_distance_curves = 3.0 CA-CA distance for residues in curves
- residues_defining_curves = 3 Number of residues in a row checked for defining a segment that curves.
- sites_to_delete = 3 Sites to delete and then try to rebuild as a group with 1 extra reside
- add_group = True Delete sites_to_delete and then rebuild as a group with 1 extra residue
- n_add_group = 10 Cycles for sites_to_delete
- n_tries_factor = 2 Try n_sites times n_tries_factor time to improve trace
- weights
- weight_vdw = 10 weight on very close CA-CA contacts
- weight_ca_ca_dist = 1. weight on CA-CA distance
- weight_proximity_to_known_position = 0.5 weight on proximity to well-identified CA position
- weight_chain_follows_trace = 10. weight on following the trace. Basically makes sure that the main chain does not cut across weak density
- distance_chain_follows_trace = 0.5 Target distance for following the trace.
- weight_target_length = 0.5 weight on total path length given number of CA positions
- weight_density = 1.0 weight on density at mid-points between CA atoms
- weight_cc_mask_score = 1.0 Weight on map correlation in evaluating chain direction
- weight_seq_score = 1.0 Weight on sequence-map matching evaluating chain direction
- weight_x_gly_score = 1.0 Weight on excess Gly and X residues in evaluating chain direction
- trace_chain
- helices_strands_cc_min = 0.35 Minimum map CC for helices/strands. Along with combine_models_score_min defines which fragments are tossed. Fragments are kept unless both criteria fail.
- use_all_trace_chains = False Use all trace_chains possibilities, not just the best
- n_overlap = 2 Maximum overlap in trace_chain nonamers
- n_random_frag = 100
- n_sift_nona = 2 Maximum nonamers with a common mid-point
- dist_ca_tol_start = 0.40 Minimum tolerance for CA-CA distances.
- dist_ca_tol_max = 0.6 Maximum deviation of CA-CA distances from target
- reject_ca_z_score = 2.5 Reject CA atoms in trace if their density density is more than Z of reject_ca_z_score below the mean and more than fraction reject_ca_fraction below the mean.
- reject_ca_fraction = 0.5 Reject CA atoms in trace if their density density is more than Z of reject_ca_fraction below the mean and more than fraction reject_ca_fraction below the mean.
- cc_ratio_min = None Minimum map CC relative to maximum for helices/strands.
- combine_models_score_min = 3.0 Minimum score for a chain in combine_models (after helices_strands). Score is mostly defined by sqrt(n_atoms)*overall map CC .
- rho_cut_min = 0.75 Minimum density (rho/sigma) at coordinates of potential CA atoms in trace_chain, after normalization for solvent fraction. For constant actual local rms in a map, the sigma (overall rms) of the map is proportional to the sqrt(1-solvent_fraction). Therefore rho_cut_min is adjusted by sqrt(0.5)/sqrt(1-solvent_fraction) to place it on a constant scale relative to a map with standard local rms.
- target_angle = 180 Target angle for CA-CA-CA (set to 180 to maximize chain length)
- rho_cut_min_low = 1. Starting value of rho_cut_min. Applies if rho_cut_min_delta is set (rho_cut_min is ignored in this case)
- rho_cut_min_high = 5 Ending value of rho_cut_min. Applies if rho_cut_min_delta is set
- rho_cut_min_delta = None Incremental value of rho_cut_min. If set, rho_cut_min will be ignored
- rat_pair_min = 0.5 Minimum ratio of density at midpoint between points to trace chain between them
- rad_sep_trace = 0.6 Dummy atom separation in trace_chain Usual 0.6 A for thorough run and 0.75 for quick Increased automatically if resolution is greater than 3 A Value of rad_mask_trace in resolve will be rad_sep_trace*2
- target_p_ratio = 4 Target ratio of atoms to peaks in trace_chain
- target_n_ratio = 1 Target ratio of nonamers to peaks in trace_chain
- max_triple_ratio = None Maximum ratio of triples to pairs in trace_chain
- max_pent_ratio = None Maximum ratio of pentamers to pairs in trace_chain
- n_atoms_total_scale = 3 Ratio of estimated atoms in au to standard estimate
- atom_target_ratio = 1.0 Target ratio of CA to look for to expected atoms in structure Standard is 0.45, quick is 0.35
- min_end_correl = 0.5 Minimum correlation of direction estimated from two ends to use end matching as criterion for keeping a chain
- add_side_chains = True Add in side chains at trace_chain step
- user_end_ratio = 2.0 Ratio of rad_user for ends of input chains to middle
- user_end_length = 3 Points in input chains within user_end_length of an end are considered near the end
- rad_user = 2.5 Radius for input chains in trace_chains. Points within this radius of an input chain will be removed.
- time_per_volume = None How long to try in trace chain (sec per volume of 10000 A**3) Try 2-20 to speed up trace_chain. Has similar effect as n_tries_target_p=2 n_tries_p_ratio=5.
- n_tries_target_p = None How many tries to get target p value in trace_chain n_target_p. n_tries_target_p. Set to 40 for slow. 2 quick
- n_tries_p_ratio = None Tries for p ratio. n_p_ratio. n_tries_max. Set to 20 for slow, 5 for quick
- sequencing
- random_sequences = None Number of random sequences of each length to use as baseline
- minimum_length = None Minimum length of a segment to consider it for sequencing
- positive_gap_penalty = None Penalty for missing residues is positive_gap_penalty * gap**2
- negative_gap_penalty = None Gap penalty for extra residues is negative_gap_penalty * gap**2
- max_gap_length = None Maximum gap length for adjacent alignments
- minimum_crossover_length = None Minimum length of crossovers
- minimum_alignment_length = None Minimum length of an alignment
- minimum_crossover_segment_length = 10 Minimum length of a segment in crossovers
- too_far_crossover = 0.75 fraction of CA-CA distance that is too far to cross over between chains
- too_much_further_crossover = 0.5 fraction of CA-CA distance that is too much further than the last matching CA-CA distance to cross over between chains
- rebuilding
- refine = True Refine all-atom models at end of procedure
- refine_cycles = None Refinement cycles (except final cycles). Typical is 1 for quick or superquick and 5 for thorough building
- final_refine_cycles = 5 Cycles of refinement for final model
- good_enough_cc = None If all residues have this CC, don't bother rebuilding them
- minimum_contact_distance = 3 Minimum distance between CA atoms not immediately connected
- split_with_sequence = None Use sequence to identify sequence register errors. Runs replace_side_chains with reassign_sequence=true and assign_sequence with iterative_assignment=true.
- rebuild_length_worst = 15 Longest rebuild length to try
- max_insert_or_delete = 1 Maximum residues to insert or delete in on rebuild stage
- time_per_residue = 1 How long to try in fitting loops (sec/residue)
- try_as_is = None Try rebuilding without insertions/deletions
- try_insertions = None Try insertions
- try_deletions = None Try deletions
- max_rebuild_cycles = None Maximum rebuilding cycles per macro cycle
- macro_cycles = None Macro cycles of rebuilding and refinement
- start_rebuild = None Starting residue to rebuild. If specified, this is all that is done
- end_rebuild = None Ending residue to rebuild. If specified, this is all that is done
- rebuild_segment = 1 Segment to rebuild from start_rebuild to end_rebuild. None means rebuild all segments from start_rebuild to end_rebuild.
- min_z_score = 2.5 Minimum Z-score for keeping a residue in trace-chain stage. (density at N C CA atoms less than mean for all residues minus min_z_score times SD for all residues).
- min_average_z_score = 2 Minimum average Z-score for keeping a residue in trace-chain stage. (average density at N C CA atoms less than mean for all residues minus min_average_z_score times SD for all residues).
- control
- nproc = 1 Number of processors to use
- random_seed = 171731 Random seed
- verbose = False Verbose output
- write_maps = True Verbose map output
- trace_only = False Trace map and stop
- quick = True Quick run. Refine just 1 cycle and look for up to 3 segments
- superquick = False Very quick run
- resolve_size = None Size of resolve to use.
- coarse_grid = None Use a coarse grid in RESOLVE (saves on memory)
- guiGUI-specific parameter required for output directory