Sequence assignment and linkage of neighboring segments with assign_sequence
Author(s)
- assign_sequence: Tom Terwilliger
 
Purpose
You can now carry out an improved sequence assignment of a model that
you have already built with phenix.assign_sequence. Further, once the
sequence has been assigned, this method will use the sequence and
proximity to identify chains that should be connected, and it will
connect those that have the appropriate relationships using the new loop
libraries available in phenix.fit_loops. The result is that you may be
able to obtain a more complete model with more chains assigned to
sequence than previously.
assign_sequence is a command line tool for reanalyzing resolve sequence
assignment for a model and a map including the non-crystallographic
symmetry, exclusion of sequence by previously-assigned regions, and
requirement for plausible distances and geometries between ends of
fragments with assigned sequences. Additionally assign_sequence will
use the fit_loops loop library to connect segments that are separated
by a short loop.
Note: assign_sequence is designed to be used after resolve
model-building in which residues that are not assigned to sequence are
given residue numbers higher than any residue in the input sequence
file. If you input a model not built by resolve or in phenix, or if you
would like to completely redo the sequence assignment for your model, be
sure to set "allow_fixed_segments=False".
NOTE: assign_sequence is normally called from phenix.phase_and_build
but you can run it interactively if you want.
 
Usage
How assign_sequence works:
The starting point for assign_sequence is a set of segments of
structure read in from the input model. assign_sequence then uses
resolve to calculate the compatibility of each possible side chain with
each residue in each segment. Then assign_sequence tests out possible
combinations of alignments of all the segments in the input model and
chooses the set of alignments that is most compatible with the density
map, the number of NCS copies, and with the geometries and distances
between ends of the segments.
Sequence probabilities:
assign_sequence uses the side-chain to map compatibility matrix
calculated by resolve to assess the relative probabilities of each
possible side chain at each position in the input model. Segments that
are positively assigned to a sequence by resolve are (by default)
maintained and used as anchors for further sequence assignment. All
other segments have a relative probability associated with each possible
alignment of the segment to the input sequence. The score for each
alignment is the logarithm of this probability (essentially a
log-likelihood LL score).
Connection scores:
Any pair of segments with some assignment of sequence to each segment
has an additional score corresponding to the plausibility of a
connection of the expected length existing between the segments. If the
distance between ends is greater than can be bridged by the number of
residues separating them, then the connection is not possible. If the
connection is possible, it is scored based on the best density fit (CC)
of a loop from the fit_loops loop library. This additional score is
normally 10*CC.
Generating sequence alignments and connectivities
assign_sequence starts with the segments with the most convincing
assignments of sequence. Often these are those with sequence positively
assigned by resolve; otherwise they are those with the
highest-probability assignments. This yields a starting arrangement
(sequence assignment for a set of segments). Then each possible sequence
assignment of each unassigned segment is tested for compatibility with
the existing arrangement and the one that is most compatible (based on
the connections that would result, duplication of sequence, and
sequence-map matching) is added to the arrangement. Optionally many
arrangements can be built up in parallel, but often a very good one can
be found simply by taking the top one at each step. This process is
repeated until no additional segments can be added to the arrangement to
yield an increase in log-likelihood score of (by default) 2 or greater.
NCS copies:
assign_sequence builds up a set of possible sequence assignments and
connectivities that depends on the expected number of copies in the
asymmetric unit of the crystal. If there is only one copy of the
molecule in the crystal, then no residues in the sequence can be used
more than once in sequence assignment. If there are N copies, then a
residue can be used up to N times. If there are multiple copies, then
each molecule must be self-consistent, with plausible distances and
geometries relating each segment to the next.
Connecting segments:
Once a final arrangement is found, including NCS if applicable, all
segments that are separated by short loops (typically 0-3 residues) are
connected using loops from the fit_loops loop library. This yields
longer segments of structure with sequences fully assigned. The
resulting model then has side chains added to match the newly-assigned
sequence and is written out.
 
Output files from assign_sequence
assign_sequence.pdb: A PDB file with your input model assigned to
sequence (to the extent possible). Residues not assigned to sequence
will be given a chain ID higher than those assigned, and they will be
given residue numbers higher than any residue number in the sequence
file.
 
Examples
Standard run of assign_sequence:
Running assign_sequence is easy. From the command-line you can type:
phenix.assign_sequence map_coeffs.mtz coords.pdb sequence.dat
If you want (or need) to specify the column names from your mtz file,
you will need to tell assign_sequence what FP and PHIB (and optionally
FOM) are, in this format:
phenix.assign_sequence map_coeffs.mtz coords.pdb \
labin="FP=2FOFCWT PHIB=PH2FOFCWT" sequence.dat
 
Possible Problems
Specific limitations and problems:
Literature
Additional information
List of all available keywords
- input_files- seq_file = None File with 1-letter code sequence of molecule.
              Chains separated by blank line or greater-than sign
- pdb_in = None Optional starting PDB file (ends will be extended if present)
- mtz_in = None MTZ file with coefficients for a map 
- map_coeff_labels = None If map coefficients cannot be identified automatically from your MTZ file, you can specify the label or labels  for them. (Please separate labels with blank space,  MTZ columns grouped together separated by commas with no blanks.) You can specify: map_coeff_labels (e.g., FWT,PHIFWT) amplitudes and phases (e.g., FP,SIGFP PHIB) or amplitudes, phases, weights (e.g., FP,SIGFP PHIB FOM)
- labin = ""  For backward compatibility only.
                 Use instead map_coeff_labels.
                 Labin line for MTZ file with map coefficients.
                 This is optional if assign_sequence
                 can guess the correct coefficients
                 for FP PHIB .  Otherwise specify:
                 LABIN FP=myFP PHIB=myPHI
                 where myFP is your column label for FP.
                 NOTE: myFP and myPHI must be adjacent in the mtz file.
- map_in = None CCP4 or MRC-style map file
- prob_file = None File with sequence probability information from resolve
- ec_hist_file = /home/builder/slave/phenix-docs-ci/modules/solve_resolve/ext_ref_files/ec_hist.pkl File with EC histograms
- ec_dict_file = None Input file with pre-calculated EC scores
- ss_pred_file = None File with secondary structure probability information in the format of raptorX (http://raptorx.uchicago.edu/StructurePropertyPred/predict/)
- linkage_file = None File with linkage information from combine_models
                 Must have been run with the same input pdb file
                 and the same value of min_segment_length
- loop_dict_file = None File with loop information from previous run as pickle file
- pair_object_dict_file = None pair object dict as pickle file
- checked_connections_file = None checked connections dict as pickle file
- density_removed_mtz_in = None MTZ file with density_removed coefficients for a map 
- comparison_model = None Comparison model (normally just for testing)
- ec_file = None EC file (format is residue i residue j EC value)
 
- output_files- target_output_format = *None pdb mmcif Desired output format (if possible). Choices are None ( try to use input format), pdb, mmcif. If output model does not fit in pdb format, mmcif will be used. Default is pdb.
- pdb_out = assign_sequence.pdb Output PDB file
- log = None Output log file
- params_out = assign_sequence_params.eff Parameters file to rerun assign_sequence
- output_loop_dict_file = None loop dict as pickle file
- output_pair_object_dict_file = None pair object dict as pickle file
- output_checked_connections_file = None checked connections dict as pickle file
- output_ec_dict_file = None ec dict as pickle file
 
- assignment- map_all_connections = False Map all connecting parts, even those that are not linked
- ec_scale = None Scale on EC scores (compared to density analysis). If None, adjust to match SD of EC scores to SD of scores from density.
- fraction_ec_inter_chain = 0.1 Fraction of inter-chain EC values to save
- ncs_resolution = None Resolution for NCS identification
- find_ncs = False Try to find NCS in chains after sequence assignment
- range_to_keep = 4.0 Keep solutions with score within range_to_keep of the maximum
- max_keep = 10 Number of possibilities to keep in optimization
- max_write = 1 Number of possibilities to write out at end
- linkage_score = 10. Score for creating a link between segments
- max_linkage_score = 12. Maximum linkage score attainable
- loop_score = 10. Score for a loop is loop_score*(1.+loop_cc)
- max_overlap_rmsd = 2. Maximum rmsd for 3 residues on each end of loop-lib fit
- allowed_overlap = 4 Allowed overlap of segments allowed. At the end if there are overlaps they will be trimmed off.
- length_mismatch_penalty = 0 Decrease in linkage score if linkage is not correct length
- depth_to_keep = 8.0 In full optimization solutions will be kept with score  depth_to_keep + max_linkage_score below the best
- maximum_length_mismatch = 3 Maximum length mismatch in linkages
- min_confidence = 0.9999 Sets required confidence level in a placement
            of a segment to keep the best one. 
- convincing_score = 2. Score gain required to keep a sequence assignment
- starting_convincing_score = 10. Score gain required to keep an initial sequence assignment
- use_ec = None Use EC values in assigning sequence. Requires ec_file.
- use_density = True Use density in assigning sequence. Requires map coefficients
- minimum_inter_residue_difference = None Ignore EC values for residue differences shorter than this
- maximum_inter_residue_difference = None Ignore EC values for residue differences longer than this
- subtract_expected_value = True Subtract ec values for distances associated with intra-residue differences. Alternative is subtract ec values for all distances corresponding to residue differences.
- match_ec_to_pdb = None Align ec to pdb sequence and adjust ec residue numbers. Alternatives are no matching or match_c_to_seq
- match_ec_to_seq = None Align ec to input sequence and adjust ec residue numbers. Alternatives are no matching or match_c_to_pdb. NOTE: sequence file must have only one chain and start at residue 1 unless start_res is specified.
- min_ec_length = 5 Minimum length of match between PDB and EC file
- n_random_score = 20 Number of random distances to average for random scoring
- max_separation_for_bins = 20 Maximum distance (in residues) between residues to consider   in binning EC values
- max_ec_seq_mismatch = 2 Maximum mismatches of sequence in EC database compared to PDB file or sequence file.
- use_cb_for_ec = True use C-beta positions for EC calculations
- minimum_ec = -1000 Minimum EC to consider
- minimum_length = 4 Minimum length of a segment to place
- min_segment_length = 5 Segments shorter than this will be ignored on read-in.
- max_levels = None Number of segments to consider in building a  complete sequence assignment (quick default = 6, otherwise 20)
- max_indiv_tries_per_level = None Number of sequence assignments to consider for each  segment (quick default = 1, otherwise 3)
- max_total_tries_per_level = None Number of sequence arrangements to consider for all  additional segments (quick default = 1, otherwise 6)
- max_placements = 100 Number of placements of any segment to consider
- max_final_placements = 20 Number of final arrangements to consider
- check_ncs_with_offset = True Check to verify that segments that seem to show NCS are
            actually different if offset by 1 residue. If protein is just helices
            then you might need to try check=False
- list_only_complete = False Only include complete arrangements; ignore those that
            have arrangements of some segments that are subsequently
            removed as incompatible
- allow_fixed_segments = True If True, then input segments with sequences assigned
           are kept fixed.
            If fix_known=False, then assigned segments are identified by
            sequence numbers less than or equal to the longest
            segment in the sequence file.
           
- all_are_fixed = False Fix all segments. Note: will fail if multiple chains
- fix_known = False If True, then all input segments with except those marked
            as segid=UNK will be considered fixed. (Instead of identifying
            them based on sequence number in input file).
            Forces allow_fixed_segments to be True.
- keep_connectivity = False This is very useful if you expect your model to have the same
            connectivity as your template. If True, then input segments
            are kept in the order found in the
            input PDB and kept assigned to the original chains, but
            their assignments may change otherwise (allowing
            insertions/deletions). NOTE: If True, then allow_fixed_segments=False
            unless fix_known=True
- include_chain_u_in_keep_connectivity = False If True, chains with chainID of 'U' are included in analysis with keep_connectivity=True
- optimize_arrangements = None Try to optimize arrangements at the end, including removal
                of uncertain segments
- use_connectivity_in_optimize = True If True and keep_connectivity is True and optimize_arrangements is True , then connectivity will be used in chain optimization
- remove_uncertain_segments = True If True, then remove uncertain segments. NOTE: at very end of  iteration, if any, then the removal of uncertain segments is  determined by iteration.remove_uncertain_at_end.
- optimize_sequence_alignment = False If True, then try to align fragments to template sequence Only appropriate if fully assigned model is input  (not likely)
- minimize_alignment_changes = False If True, then try to keep sequence numbers matched to
            original as much as possible.
- allow_longer_connections = True If a connection segment is available, try with lower
                scoring also n+1,n+2,n+3 connections. This may be a
                good idea because loops are often built short by one or a few
                residues.
                
- replace_side_chains = True At the end of sequence assignment identify side-chain
            rotamers and replace existing side-chains
- replace_direct_joins = False Use fit-loops to rebuild all junctions that are joined flush
- short_segment_length = 12 Definition of a short segment
- max_loop_length = 8 Maximum length of loop to try to fit
- n_random_loop = 500 Number of loop versions to build
- max_unassigned_short_segments = 20 Maximum number of segments short_segment_length or fewer  residues that are  not assigned to sequence to consider in connections.  Keeping  too many can make the analysis take a very long time.
- compare_only = False Just compare input model to comparison model
- reset_sequence = False Adjust sequence numbering and sequence of segments in  pdb_in based  on sequence numbers of matching positions in comparison_model. Cannot be used with compare_only.
- make_unique = None Make segments of input model unique (no overlapping) residues. Normally keep this at None. Used in iteration of assign_sequence.
- start_res = None Starting residue number of sequence file. Used only for  compare_only with ec weighting
- ignore_ter_or_break = None Ignore TER/BREAK records in PDB file. Use only residue numbers and chain ID to identify breaks in seq_prob steps
 
- iteration- iterative_assignment = False You can iteratively assign sequence and fit loops This can improve the assignment, but will take longer
- cycles = 3 Cycles of iteration
- start_step = *assign_sequence fit_loops insert_loops get_connections You can decide where to start in the cycle
- end_step = *None assign_sequence fit_loops insert_loops get_connections You can specify last step to carry out in iteration
- skip_step = None You can specify steps to skip in iteration
- assign_sequence_file = None File partially assigned to sequence (output of assign_sequence)
- loops_file = None File with loops to insert
- assign_sequence_insertion_file = None File partially assigned to sequence with insertions
- connections_file_list = None list of files with possible connections to consider
- only_assign_sequence_on_last_cycle = True Just run assign_sequence step on last iteration cycle
- build_outside_model = True Build outside model in get_connections
- build_outside_model_once = True Run build outside model a maximum of one time (Run once if build_outside_model=True, never if  build_outside_model=False)
- remove_uncertain_at_end = False If True, then remove uncertain segments at very end of iteration (in addition to removing them along the way if  remove_uncertain_segments=True )
- connect_all_segments = True Connect all segments sequentially in get_connections
- fill_in_gaps = True After optimize and keep_connectivity, try to fill in gaps
- standard_gap_score = 0. Score for filling default gap
- decrease_tries_with_levels = False Reduce number of tries each level of search
 
- directories- temp_dir = "temp_dir" Optional temporary work directory 
- output_dir = "" Optional output directory
 
- crystal_info- resolution = 0. high-resolution limit for map calculation
- solvent_fraction = 0.5 solvent fraction
- chain_type = *PROTEIN DNA RNA Chain type (for identifying main-chain and side-chain atoms)
- ncs_copies = 1 NCS copies
 
- model_building- dist_max = 20. Maximum distance ends can be apart to consider for linking
- max_loop_lib_gap = 3 Maximum number of residues in working loop library (This must match loop libraries that are available)
 
- control- verbose = False Verbose output
- quick = True Run quickly
- raise_sorry = False Raise sorry if problems
- debug = False Debugging output
- dry_run = False Just read in and check parameter names
- nproc = 1    You can specify the number of processors to use 
- check_wait_time = 1.0    You can specify the length of time (seconds) to wait  between checking for subprocesses to end
- wait_between_submit_time = 1.0    You can specify the length of time (seconds) to wait  between each job that is submitted when running sub-processes. This can be helpful on NFS-mounted systems when running  with multiple processors to avoid file conflicts. The symptom of too short a wait_between_submit_time is File exists:....
- resolve_command_list = None You can supply any resolve command here NOTE: for command-line usage you need to enclose the whole set of commands in double quotes (") and each individual command in single quotes (') like this:  resolve_command_list="'no_build' 'b_overall 23' " 
- background = None run jobs in background or not (if nproc is greater than 1) Usually set automatically. If run_command is sh or csh, True
- run_command = "sh "  Command for running jobs (e.g., sh or qsub )
- resolve_size = 24 Size of resolve to use. 
- em_side_density = False Use EM side chain density
 
- non_user_params- print_citations = True Print citation information at end of run