Guessing sequence from a map with sequence_from_map
Author(s)
- sequence_from_map: Tom Terwilliger
Purpose
The routine sequence_from_map will use a map to guess the sequence of a
supplied segment or of segments built automatically.
Usage
How sequence_from_map works:
Sequence-from-map examines the features in a map at the position where side
chains in a model are expected to be located and uses these features
to guess the identity of the side chain at each position.
The procedure used by sequence-from-map is has several steps:
If map (and no model) is supplied, a model is created by searching for regular
secondary structure in the map. First the unique part of the map is
identified and cut out to form a new map.
This is done first using the map_symmetry to find helical or point-group
symmetry in the map. If any is found, a new map is created containing just
the unique part of the map. Then (if improper_symmetry is set), the new map
is examined to see if there is any local symmetry remaining. If so, the unique
part is cut out and a new map is created. Then the tool find_helices_strands
is used to analyze the new map to find regular secondary structure.
Once a map and model are available, each segment (a piece of chain without
breaks) is compared to the map and sequences compatible with the map
are identified.
The core of sequence-from-map is to generate a scoring matrix for each
segment that reflects the relative probability that each possible side chain
is located at each position in the segment. This probability is estimated
from the map-model correlation for each side chain (after examining all
rotamers of the side chain and picking the one with the highest correlation).
A Z-score (value for this side chain minus mean for this side chain at all
positions, divided by the standard deviation of this quantity) is calculated
for each side chain at each position, where Z values less than zero are set
to zero. These Z-scores are used as approximate minus-log-likelihood scores
for side chain probabilities and make up the scoring matrix for the segment.
A best-fitting sequence is generated for each segment by simply picking the
highest-scoring side chain at each position. Optionally, if
score_by_residue_groups is set and residue groups are defined
(residue_groups = "VGASCTI P LDNEQM KR FHY W"), one representative of each
residue group is chosen to represent all the side chains in that group.
Optionally (if minimum_discrimination is defined and trim_models is set),
the scoring matrix is used to cut the model up into pieces, removing
any residues where the discrimination between the most probable and
least probable side chain is less than minimum_discrimination. The
segments are then cut up at these points and smaller segments are created.
Also optionally (if trim_models is set to some value N), the N residues on
each end of each segment are removed before analysis.
If one or more potential sequences representing the molecule in the map are
provided, these sequences are scored, one at a time, by evaluating the
optimal alignment of the sequence to the segments in the model.
The alignment of a sequence to a set of segments is done by considering
all possible alignments for all segments, keeping the highest-scoring
segment-sequence alignment, then repeating the process with all remaining
segments. Optionally, the alignments can be restricted to a set that
uses each residue in the sequence only once (allow_duplicates set to False).
Optionally, sequences that are shorter than the number of residues in the
model are rejected (if skip_if_too_short is set).
The raw score for one segment-sequence alignment is given by the sum of the
position-dependent minus-log-likelihood scores for all the side chains
in the alignment. This raw score is then adjusted by subtracting the mean
value of raw segment-sequence alignment scores for random sequences of the
same length as the sequence being considered. The random sequences are
created using a residue frequency of eukaryotic amino acids (this can be
adjusted with the default_sequence parameter). If positive_only is set,
negative segment-sequence alignment scores are ignored.
Once the best-fitting sequence is identified, the side chains of the
model are re-fit using this sequence and a model with the fitted side
chains is written out.
Examples
Standard run of sequence_from_map:
You can use sequence_from_map to guess the sequence of a fragment you
have built:
phenix.sequence_from_map my_fragment.pdb my_map.mrc resolution=2.8
You can also use it to try and pick out which sequence file matches your map:
phenix.sequence_from_map my_map.mrc resolution=3 seq_dir=sequences/
where the directory sequences has a set of fasta or similar files representing
possible sequences that might match your map.
Re-sequencing and filling in a gap with run_assign_sequence=True:
You can use the special keyword run_assign_sequence=True in sequence_from_map
to carry out a two-step process to try and fix a mistraced part of your
model. If residues 154-162 of your model look bad and perhaps the sequence
register is wrong after residue 162, you can try something like this:
First remove the bad residues:
- phenix.pdbtools bad_model.pdb
- remove="resseq 154:162" output.file_name=bad_model_del.pdb
Then run sequence_from_map to reassign the sequence and fill in the gap
(you can supply a map_file or an mtz_file; either will be fine):
- phenix.sequence_from_map run_assign_sequence=True
- model_file=bad_model_del.pdb seq_file=sequence.dat map_file=map_file.ccp4 pdb_out=bad_model_fixed.pdb resolution=2
This will produce bad_model_fixed.pdb
Possible Problems
Specific limitations and problems:
Sequence_from_map does not recognize insertion codes, so if your model uses
these, first run phenix.pdb_tools with the "renumber=True" keyword to remove
them (actually to renumber the residues and leave the insertion codes in place).
Literature
- Protein identification from electron cryomicroscopy maps by automated model-building and side-chain matching. T.C. Terwilliger, O.V. Sobolev, Afonine, PV, P.D. Adams, Ho, CM, Li, X, Zhou, and ZH. Acta Cryst D. 77, 457-462 (2021).
Additional information
List of all available keywords
- job_title = None Job title in PHENIX GUI, not used on command line
- input_files
- map_file = None File with CCP4-style map. May have origin in any location.
- multiple_seq_file = None Optional sequence file containing multiple potential sequences. (Fasta format or sequences separated by blank lines)
- seq_file = None Optional sequence file(s) containing potential sequences
- seq_dir = None Optional directory containing sequence files
- model_file = None Input PDB file with chains to be sequenced NOTE: if you supply model_file no model-building will be done. If you supply starting_model_file it will be used as a starting model for model-building.
- starting_model_file = None Input PDB file to be used as starting model in model-building. NOTE: if you supply model_file no model-building will be done. If you supply starting_model_file it will be used as a starting model for model-building.
- symmetry_file = None Optional symmetry file (.ncs_spec format or MATRX records) with reconstruction symmetry. Used in identification of unique part of map.
- probability_matrix_file = None Optional .pkl file with probability matrices. Note: if build_type is all then a prefix of either helices_strands or full_model will be added to the file name (if not already there)
- model_pickle_file = None Optional .pkl file with final fragments. Note: if build_type is all then a prefix of either helices_strands or full_model will be added to the file name (if not already there)
- output_files
- pdb_out = sequence_from_map.pdb Model with sequence determined from map. Note: if build_type is all then a prefix of either helices_strands or full_model will be added to the file name (if not already there)
- output_probability_matrix_file = None Optional output .pkl file with probability matrices
- output_model_pickle_file = None Optional output .pkl file with final fragments
- temp_dir = None Temporary directory. Default is sequence_from_map_xx where xx creates new directory
- crystal_info
- resolution = None High-resolution limit for map analysis.
- scattering_table = n_gaussian wk1995 it1992 *electron neutron Choice of scattering table for structure factor calculations. Standard for X-ray is n_gaussian, for cryoEM is electron.
- sequences = None Sequences to be matched as text strings
- default_sequence = '' Sequence reflecting relative usage of residues/nucleotides. Default is amino acid composition for eukaryotes.
- chain_type = *PROTEIN DNA RNA Chain type
- molecular_mass = None Molecular mass of the entire structure in the map (not the mass of one chain). Does not have to be accurate, but helpful if using symmetry to extract unique part of map.
- strategy
- fill_in_sequence = None Just fill in optimal sequence. Default if no sequence supplied.
- extract_best_parts = False Try to identify best-fitting parts of sequence and use those
- create_gap_length = None Create gaps of this length if segments are nearly overlapping
- extract_length = 35 Target lengths of extracted best parts.
- extract_cc = 0.80 Target map CC of extracted best parts.
- extract_cc_delta = 0.01 Delta for target map CC of extracted best parts.
- extract_cc_min = 0.70 Minimum target map CC of extracted best parts.
- extract_n = 10 Maximum number of extracted best parts.
- extract_length_vs_cc = 10 Trade off between length and CC
- score_with_optimal_sequence = True Identify optimal sequence with reduced residue set and use it to match sequence
- unique_sequence = True Only allow one use of each part of sequence unless model is longer than sequence
- split_using_alignment = None At end, split fragments based on alignment. Default is True if score_with_optimal_sequence is True and sequence is supplied
- keep_segment_order = None Keep order of segments (only change their local positions). Only applies if unique_sequence=True and score_with_optimal_sequence=True
- extract_ss = False Extract secondary structure groups (usual alternative is extract regions)
- extract_ss_dist_max = 15 RMSD of ends of secondary structure must match within this dist.
- extract_ss_groups = 1 Number of secondary structure groups to generate
- build_only = False Just set up and build model and stop
- retrace = None Retrace chains and score with map correlation and sequence alignment in trace_and_build procedure. Default is False if quick = True
- block_gaps_in_secondary_structure = True Block creation of gaps in alignments in the middle of secondary structure
- run_map_box = False Run map box on input model with radius atom_radius
- map_box_radius = 10 RMSD of ends of secondary structure must match within this dist.
- rerun_with_avail_seq = True Rerun sequence alignment with parts of the sequence already used removed. With a large structure this can take a long time.
- speed_up_ca_insertion = False Speed up CA insertion by running on short segments first.
- chains_from_fixed_model = None Use fixed model to mark initial chain positions.
- max_fraction_long_branches = None Terminate trace if recent fraction of long branches is high
- split_and_join_iterations = None Iterations of split_and_join
- split_and_join = None Split fragments at low density and then rejoin. Criterion is sd_ratio_pare_model
- rejoin_split_fragments = False Specifically rejoin the split fragments leaving out poor junctions
- test_threshold_fixed_segments = None Test thresholds when creating fixed segments
- require_ends_in_region = None Add ends of chain to region if necessary
- try_alternate_joins = None try joining each end of pair of fragments, not just the first
- skip_path_if_missing = None Skip path if points are missing
- sequencing
- create_final_model = True Create final model with sequence obtained
- residue_number_from_sequence = True Start chains with best guess of residue number from sequence file
- build_type = *None helices_and_full helices_strands full_model trace_and_build trace_and_build_two_cycles Default is trace_and_build if quick = true and trace_and_build_two_cycles if quick = False. You can choose to build with trace_and_build, just helices/strands (quick) or a full model (takes more time). Note you can define how many regions to analyze with regions_to_keep (set it to 1 to run quickly). You can choose helices_and_full to run both helices_strands and full_model or trace_and_build_two_cycles to run trace and build two cycles and pick the best result.
- tnb_build_cycles = 1 Trace and build build cycles
- regions_to_keep = None You can specify a limit to the number of regions to keep when generating the asymmetric unit of density. Default is 1 for build_type = full_model or trace_and_build and None for build_type = helices_strands.
- minimum_score = -10. Minimum score. Scores less than this will be replaced with this value.
- alignment_method = *smith_waterman muscle Alignment method. Alternatives are muscle and smith_waterman
- gap_ratio = 0.5 Ratio of gap extension penalty to gap opening penalty
- blocking_ratio = 10 Ratio of gap extension penalty in regions to be blocked to to allowed regions
- gap_opening_penalty_max = None Gap opening penalty for sequence alignments. Goes from gap_opening_penalty_min to gap_opening_penalty_max. Default is 10 for muscle and 150 for phenix
- gap_opening_penalty_min = None Gap opening penalty for sequence alignments. Goes from gap_opening_penalty_min to gap_opening_penalty_max. Default is 2.5 for muscle and 35 for phenix
- trim_ends_by = 1 When blocking alignments with secondary structure, trim the secondary structure back by this amount to allow insertions and deletions at the end of the secondary structure.
- convincing_delta_score = 10. Convincing delta score for two solutions that nearly always means the higher-scoring one is correct.
- average_z_scores = False Use average Z-score for multiple builds. Alternative is use highest score.
- minimum_final_z_score = 4 Skip additional builds if a solution with this Z-score or better is found.
- minimum_length_after_split = 5 Minimum length of a segment after splitting with split_using_alignment
- minimum_length_after_split_penalty = 2 Penalty per residue shorter than minimum
- minimum_length = None Minimum length of a segment to consider it for sequencing. Default is 10 if a multiple_sequence_file or sequence_dir is supplied
- minimum_length_ratio = None Reject segments that are less than this fraction of longest segment. Applies if max_segments is not set. Default is 0.15 if a multiple_sequence_file or sequence_dir is supplied.
- maximum_segments_in_sequencing = None Maximum number of segments to consider (remainder ignored) in analyzing sequence placement. Default is 10 if a multiple_sequence_file or sequence_dir is supplied.
- max_segments = None Maximum segments to consider in analyzing sequences. Default is to take all segments unless a multiple_sequence_file or sequence_dir is supplied, in which case the default is 3.
- minimum_residues = None Try to choose enough segments to get this many residues. Applies if max_segments is not set. Default is 500 if multiple sequences are supplied.
- maximum_sequence_length = None Maximum sequence length to consider. Default is 2000 if multiple sequences are supplied.
- subtract_expected = None *single_try multiple_tries Subtract expected score based on random sequence for a single try or for the number of tries carried out.
- positive_only = False Just sum scores that are positive and skip others
- allow_duplicates = True Allow duplication of use of sequence.
- skip_if_too_short = False Skip sequences that are shorter than the number of residues in the model.
- skip_if_too_short_for_segment = None Skip sequences that are shorter than the number of residues in a segment, with tolerance of short_tolerance_percent.
- short_tolerance_percent = 10 Tolerance for skip_if_too_short_for_segment
- refinement = None Run refinement after finding model. Default is True for helices_strands or full_model and False for trace_and_build.
- simulated_annealing_tries = None Run simulated annealing this many times on segments and weight sequence probability based on the resulting likelihood scores.
- average_sa_by_residue = False Average SA tries by residue (alternative is average by model)
- split_then_fix = False First use split_at_unknown to split up model, screen all input sequences and choose top n_split_then_fix sequences. Then run again on these sequences using fix_insertions_deletions on each of them for final scoring.
- n_split_then_fix = None Number of sequences to keep in play in split_then_fix
- fix_insertions_deletions = False Try to fix alignment allowing up to n_split*n_sites/10 pieces but keep them linked to maximize sequence match
- n_split = 1 Maximum number of pieces to cut segments up into is n_sites * n_split/10
- max_start_points_to_consider = 5 Maximum number of starting points for sequence to consider per 100 residues in sequence file.
- positive_gap_penalty = 1 Penalty for missing residues is positive_gap_penalty * gap**2
- negative_gap_penalty = 2 Gap penalty for extra residues is negative_gap_penalty * gap**2
- max_gap_length = 2 Maximum gap length for adjacent alignments
- end_overlap = 2 Maximum end overlap of fragments
- minimum_alignment_length = 5 Minimum length of an alignment
- test_lengths = 10 20 Lengths of segments to be tested for sequence matching.
- macro_cycles = 5 Refinement macro-cycles
- trim_from_ends = None You can take final fragments and trim them N residues in from each end.
- residue_groups = "GVASCTI P LDNEQM KR YFH W" Optional groups of residues to score together. If RNA or DNA groups are not allowed.
- score_by_residue_groups = None Use residue groups in sequence alignment and listing of optimal sequences. Default is True unless no sequence is supplied.
- trim_models = False Trim models to break at residues with low discrimination. Default is False.
- maximum_length = 50 If segment is longer than this, try to cut it into pieces, splitting at residues that are poorly-defined. Assumption is very long segments probably have register shifts. Only applies if trim_models = True
- split_at_unknown = None Split chains at positions where residue is unknown (X). Default is True unless fix_insertions_deletions = True or score_with_optimal_sequence = True
- keep_split_together = True Apply penalty if chains that are split at unknown are not close in sequence
- vary_minimum_cc = False Vary the minimum_cc in optimizing percent to reject. Alternative is vary_minimum_z_score.
- percent_to_reject = 7.5 Adjust minimum_cc and minimum_z_score to reject this percentage of residues.
- minimum_z_score = 0.5 Minimum residue Z-score for a residue (or residue groups if score_by_residue_groups is set) to keep the choice. Otherwise report residue as X.
- minimum_cc = None Minimum map-model correlation for a residue (or residue groups if score_by_residue_groups is set) to keep the choice. Otherwise report residue as X.
- minimum_discrimination = 0.01 Minimum discrimination between lowest and highest probability residue (or residue groups if score_by_residue_groups is set) to keep the choice. Otherwise report residue as X.
- random_sequences = 20 Number of random sequences of each length to use as baseline
- use_default_sequence = False Base the random sequences on the default sequence. Alternative is to base them on the actual sequence
- remove_clashes = True Truncate clashing side chains on output
- symmetry = ANY Optional symmetry used in reconstruction. For example D7, C3, C2 I (icosahedral), T (tetrahedral), or ANY (try everything and use the highest symmetry found). Not needed if symmetry is supplied.
- include_helical_symmetry = False Include search for helical symmetry if symmetry = ALL
- run_assign_sequence = False Run assign sequence to identify sequence of multiple fragments and connect them.
- improper_symmetry = False After any proper symmetry is found, take new unique part of map and search for any symmetry at all with find_ncs_from_density. Useful if your molecule has both proper reconstruction symmetry such as C3 and within the reconstruction unit there is local symmetry which can be of any kind. Final result is just the unique part of the map.
- separate_chains = True Write each segment as a separate chain with unique chain ID
- null_score = -2 Score for a segment that is not placed
- n_direct = 10 Extreme values for N<= n_direct calculated directly
- distance_cutoff = None Distance cutoff (CA-CA or P-P) for chain breaks. Large value used to prevent breaking chain due to big gaps.
- min_ncs_cc = 0.80 Minimum Symmetry CC to keep operators when identifying automatically
- rebuilding
- split_input_model = True Input model will be split into segments
- refine = True Refine all-atom models at end of procedure
- refine_b = None Refine B-values
- refine_cycles = 1 Refinement cycles (except final cycles). Typical is 1 for quick or superquick and 5 for thorough building
- trace_and_build_min_segment_length = 15 Minimum segment length in trace_and_build
- control
- multiprocessing = *multiprocessing sge lsf pbs condor pbspro slurm Choices are multiprocessing (single machine) or queuing systems
- queue_run_command = None run command for queue jobs. For example qsub.
- nproc = 1 Number of processors to use
- random_seed = 171731 Random seed
- verbose = False Verbose output
- quick = True Quick run. Suitable for most cases. Use quick = False to obtain maximum sensitivity by setting build_type = trace_and_build_two_cycles simulated_annealing_tries = 32
- superquick = False Very quick run
- ignore_symmetry_conflicts = False You can ignore the symmetry information (CRYST1) from coordinate files. This may be necessary if your model has been placed in a box with box_map for example.
- max_dirs = 1000 Maximum number of directories (sequence_from_map_xxx)
- em_side_density = False Use EM side chain density. Alternative is to use standard x-ray side chain templates in sequence assignment
- em_side_density_in_tnb = False EM side chain density in trace_and_build ( if em_side_density is True)
- skip_temp_dir = False Do not create or use a temp_dir
- clean_up = False Clean up and remove temporary sequence_from_map directory after finishing
- guiGUI-specific parameter required for output directory