Fitting loops to fill in gaps in models with fit_loops
- fit_loops: Tom Terwilliger
fit_loops is a tool for building a loop into density to connect
existing chain ends. You supply a model with a gap and a sequence file
and coefficients for an electron density map, and you specify the first
and last residues to be built. Then fit_loops will attempt to build the
loop that you specify. One loop can be done at a time with fit_loops
(but if you have multiple identical chains, you can fit them all at
once).
You can use either of two methods to fit the loops. By default
fit_loops uses resolve chain extension to try to trace residues from
the ends of segments in your input PDB file. If it can connect the
segments, it writes out the connecting loops. Alternatively you can use
a loop library supplied with PHENIX to connect ends of segments from
your input PDB file.
If you want a more complete model-building process, then you will want
to use phenix.autobuild .
fit_loops can be run from the command line or from the PHENIX GUI.
fit_loops calculates a map based on the supplied map coefficients, then
tries to extend the ends of the supplied model into the gap region,
following the electron density in the map.
model_with_loops.pdb: The output from fit_loops is a new PDB file
containing your input model with the newly-built loop inserted into it
(if a loop could be found).
A typical command-line input would be:
phenix.fit_loops pdb_in=nsf_gap.pdb mtz_in=map_coeffs.mtz \
seq_file=nsf.seq start=37 end=43 chain_id=None
This will fit a loop starting with residue 37 and ending with residue 43
in nsf_gap.pdb. phenix.fit_loops will expect that your existing
nsf_gap.pdb model has a chain ending at residue 36 and another starting
at residue 44. As chain_id=None in this example, if there are multiple
chains A,B,C in nsf_gap.pdb then all 3 will be filled in.
If you want (or need) to specify the column names from your mtz file,
you will need to tell fit_loops what FP and PHIB (and optionally FOM)
are, in this format:
phenix.fit_loops pdb_in=nsf_gap.pdb mtz_in=map_coeffs.mtz \
seq_file=nsf.seq start=37 end=43 chain_id=None \
labin="FP=2FOFCWT PHIB=PH2FOFCWT"
If you want to try and fit a loop with poor density, you might want to
lower the threshold for the correlation of density in the loop (default
minimum correlation is 0.2):
phenix.fit_loops pdb_in=nsf_gap.pdb mtz_in=map_coeffs.mtz \
seq_file=nsf.seq start=37 end=43 chain_id=None \
loop_cc_min=0.1
To use the loop library in PHENIX, use the keyword loop_lib:
phenix.fit_loops pdb_in=nsf_gap.pdb mtz_in=map_coeffs.mtz \
seq_file=nsf.seq start=37 end=39 chain_id=None loop_lib=True
This will fit a loop starting with residue 37 and ending with residue
39. The maximum current length in the loop library is 3 residues.
To use the loop library in PHENIX and try to connect any pair of
segments that have an geometrical relationship, use the keyword
connect_all:
phenix.fit_loops pdb_in=nsf_gap.pdb mtz_in=map_coeffs.mtz \
seq_file=nsf.seq connect_all=True
This will go through all pairs of segments, trying to connect them with
a loop from the PHENIX loop library. Note that this is a last-resort
approach, normally instead use the default and let fit_loops connect
segments that are close in sequence.
To specify a particular gap for loop fitting, you can say
phenix.fit_loops pdb_in=nsf_gap.pdb mtz_in=map_coeffs.mtz \
seq_file=nsf.seq start=37 end=39 chain_id=A
Note the residue number for start and end are the first and last in the
gap, not the residues flanking the gap that are already present. This
command fits a gap present at residues 37-39. Note: if any of the residues
from 37-39 are still present in your model, specify remove_loop=True and
fit_loops will remove them (otherwise it will just fit the existing gap).
One of the most common problems is that the residues just before and after a
gap are themselves poorly fitted. A solution is to remove the last couple residues before a loop and the first couple residues after the loop and to try and
fit the resulting larger gap. To do this you can say something like:
phenix.fit_loops pdb_in=nsf_gap.pdb mtz_in=map_coeffs.mtz \
seq_file=nsf.seq start=35 end=41 chain_id=A replace_residues=True
Here the replace_residues=True command is required to tell fit_loops to
remove residues 35 and 36 and residues 40 and 41 before trying to fit
residues 35-41.
You can also accomplish this by leaving the parameter trim_ends_if_needed=True,
which will do this automatically (trying several possible trimmings of the
ends).
- input_files
- pdb_in = None Model with gap to fill.
- mtz_in = None MTZ file with coefficients for a map
- mtz_in_virtual = None Used internally
- map_coeff_labels = None If map coefficients cannot be identified automatically from your MTZ file, you can specify the label or labels for them. (Please separate labels with blank space, MTZ columns grouped together separated by commas with no blanks.) You can specify: map_coeff_labels (e.g., FWT,PHIFWT) amplitudes and phases (e.g., FP,SIGFP PHIB) or amplitudes, phases, weights (e.g., FP,SIGFP PHIB FOM)
- labin = "" Labin line for MTZ file with map coefficients.
Normally use instead map_coeff_labels. This is available
for backward compatibility. You can specify:
LABIN FP=myFP PHIB=myPHI FOM=myFOM
where myFP is your column label for FP
- map_in = None CCP4 or MRC-style map file
- seq_file = None sequence file (1-letter code, one copy of each chain)
- seq_prob_file = None File seq_prob.dat from resolve sequence alignment
- output_files
- pdb_out = connect.pdb Output PDB file (will be missing if no result).
- results_as_json = None Output result summary as JSON
- log = None Output logfile
- params_out = fit_loops_params.eff Parameters file to rerun fit_loops
- fitting
- min_acceptable_prob = None Minimum loop probability to consider
- refine_loops = True Refine fitted loops in loop_lib
- chain_id = None Chain ID for chain containing missing loops.
If None any chain ID is allowed.
All missing segments matching chain_id, start and end will be fit.
If chain_id is specified without start and end, all gaps in that
chain will be fit. If no chain_id and no start and end are
specified, all gaps are fit.
- start = None Starting residue number of loop(s) to be fit.
NOTE: This is the residue number of the first residue that is NOT
present in your model already (one more than the residue number of the
residue just before the gap).
if None any start is allowed.
All missing segments matching chain_id, start and end will be fit
- end = None Ending residue number of loop(s) to be fit.
NOTE: This is the residue number of the last residue that is NOT
present in your model already (one fewer than the residue number of the
residue just after the gap).
If None any end is allowed.
All missing segments matching chain_id, start and end will be fit
- insert_or_delete_residues = None Try to insert or delete this many residues in the loop (ignore loop sequence if so and use any side chain
- remove_loops = True Remove existing residues and replace with new loop
All segments matching chain_id, start and end will be fit
- skip_trim = True If skip_trim=True (default) then model with added loops will be written out without checking for overlaps with non-sequence-aligned residues. If skip_trim=False then this check will be carried out. Note that skip_trim=False can cause some residues to be de-assigned from sequence if they cannot be successfully matched to density. In such cases you might try skip_trim=True."
- score_min = 1.0 Minimum connection score for connect_all_segments
- min_dist = 3. Minimum distance between connections for connect_all_segments"
- min_log_prob = -5. Minimum log(P) for sequence to consider location
- skip_seq_prob = False Skip sequence probability calculation
- save_acceptable_loops = False Just return a file with possible loops
- n_random_loop = 200 Number of tries at building loops
- connect_all_segments = False Try to connect all segments to each other, regardless of sequence and residue numbers. Note: this is a last-resort approach. Normally just use the default and let fit_loops connect segments that are close in sequence.
- sequential_only = False Only connect adjacent segments in connect_all_segments
- dist_max = 15 Maximum CA-CA distance in connect_all_segments
- ignore_sequence_register = False Ignore the input sequence register
- all_assigned = True Assume all residues in model can be assigned to sequence If you set all_assigned=False then any residues with residue number greater than the number of residues in the longest sequence in the sequence file are assumed not to be assigned to sequence. This is useful if you are using resolve model-building as resolve will identify non-sequence-aligned residues with high residue numbers.
- sequence_offset = None You can specify that fit_loops should offset input sequence file residue numbers by sequence_offset before use (same effect as adding sequence_offset*X to the beginning of the sequence file.) Note: number of of sequence_offset values must match number of chains in sequence file
- loop_cc_min = 0.2 Minimum loop map-model correlation
- aggressive = False Aggressive loop building (risky)
- target_insert = None You can try to force the number of residues to insert with trace_loops. If none, try to fill in the gap based on the number of missing residues. If set and greater than 0, only accept the number of residues given. If zero, take any length. Not supported.
- time_per_residue = None You can specify how long to try in fitting (sec/residue)
- trim_ends_if_needed = True If a single loop is specified and no loop is found, try trimming back the ends.
- loop_backup_residues = None Number of tries removing one residue at a time from each end of existing ends of loop if no loop is found with initial gap Default is 4 (1 if quick)
- split_by_gap = True Split up work by gaps and run individually (normally True)
- loop_lib = False Use loop library to fit loops Only applicable for chain_type=PROTEIN
- standard_loops = True Use standard loop fitting
- trace_loops = False Use loop tracing to fit loops Only applicable for chain_type=PROTEIN
- refine_trace_loops = True Refine loops (real-space) after trace_loops
- density_of_points = None Packing density of points to consider as as possible CA atoms in trace_loops. Try 1.0 for a quick run, up to 5 for much more thorough run If None, try value depending on value of quick.
- a_cut_min = None Minimum density (relative to SD of map, normalized for solvent content) for trace_loop points
- max_density_of_points = None Maximum packing density of points to consider as as possible CA atoms in trace_loops.
- cutout_model_radius = None Radius to cut out density for trace_loops If None, guess based on length of loop
- max_cutout_model_radius = 20. Maximum value of cutout_model_radius to try
- padding = 1. Padding for cut out density in trace_loops
- cut_out_density = True Cut out density for trace_loops
- max_span = 30 Maximum length of a gap to try to fill
- max_c_ca_dist = None Maximum C-CA distance for a connection
- max_overlap_rmsd = 2. Maximum rmsd for 3 residues on each end of loop-lib fit
- max_overlap = None Maximum number of residues from ends to start with. (1=use existing ends, 2=one in from ends etc) If None, set based on value of quick.
- min_overlap = None Minimum number of residues from ends to start with. (1=use existing ends, 2=one in from ends etc)
- crystal_info
- resolution = 0. high-resolution limit for map calculation
- chain_type = *PROTEIN DNA RNA Chain type (for identifying main-chain and side-chain atoms)
- final_solvent_content = None Solvent content (after cutout if trace_loops used) . Normally determined automatically.
- directories
- temp_dir = "temp_dir" Temporary work directory
- create_temp_dir_if_missing = True Create temp_dir if not present
- output_dir = None Output directory where files are to be written
- gui_output_dir = None GUI use only - does not apply to command line version
- control
- verbose = False Verbose output
- quick = False Try to run quickly
- raise_sorry = False Raise sorry if problems
- debug = False Debugging output
- dry_run = False Just read in and check parameter names
- coarse_grid = False Use coarse grid (saves on memory)
- i_ran_seed = None random seed
- resolve_command_list = None Commands for resolve. One per line in the form: keyword value value can be optional Examples: coarse_grid resolution 200 2.0 hklin test.mtz NOTE: for command-line usage you need to enclose the whole set of commands in double quotes (") and each individual command in single quotes (') like this: resolve_command_list="'no_build' 'b_overall 23' "
- write_run_directory_to_file = None The working directory name is written to this file
- pickled_arg_dict = None Pickled keywords to __init__ are in this file
- nproc = 1 You can specify the number of processors to use
- max_wait_time = 1.0 You can specify the length of time (seconds) to wait when looking for a file. If you have a cluster where jobs do not start right away you may need a longer time to wait. The symptom of too short a wait time is 'File not found'
- wait_between_submit_time = 1.0 You can specify the length of time (seconds) to wait between each job that is submitted when running sub-processes. This can be helpful on NFS-mounted systems when running with multiple processors to avoid file conflicts. The symptom of too short a wait_between_submit_time is File exists:....
- background = None run jobs in background or not (if nproc is greater than 1) Usually set automatically. If run_command is sh or csh, True
- run_command = "sh " Command for running jobs (e.g., sh or qsub )
- non_user_params
- print_citations = True Print citation information at end of run
- guiGUI-specific parameters, not used on command line
- result_file = None
- job_title = None Job title in PHENIX GUI, not used on command line
- outputDuplicate of output scope, needed because fit_loops constructs its own parameters in __init__ with params=self.process_params
- target_output_format = *None mmcif pdb