Rapid fitting of secondary structure to a map with find_helices_strands

Author(s)
Purpose
Running the program
Methods
Using find_helices_strands to bootstrap phenix.autobuild
Command-line use
Specific limitations and problems
References
List of all available keywords

Author(s)

find_helices_strands: Tom Terwilliger

PULCHRA: Piotr Rotkiewicz

Purpose

find_helices_strands is a command line tool for finding helices and strands in a map and building an model of the parts of a structure that have regular secondary structure. It can be used for protein, RNA, and DNA. An option is to use a rapid chain-tracing algorithm to build CA of proteins, followed by reconstruction of a full model.

Running the program

The program is available in the Phenix GUI under "Model building". The only required input is an electron density map, such as the output of density modification. A sequence file is optional.

The default behavior is to search for protein helices and strands, but you may instead look for nucleic acid helices, or use the trace_chain method (describe below) to find C-alpha positions independently of secondary structure motifs. Using the default options without a sequence, the model will be a poly-alanine backbone; with the trace_chain method it will be a C-alpha trace, but you may optionally have PULCHRA (Rotkiewicz & Skolnick 2008) complete the backbone.

The program usually only requires a few minutes at most to run. Output will be to a new directory, with a single PDB file containing a partial model.

Methods

How find_helices_strands finds helices and strands in maps:

find_helices_strands first identifies helical segments as rods of density at 5-8 A. Then it identifies helices at higher resolution keeping the overall locations of the helices fixed. Then it identifies the directions and CA positions of helices by noting the helical pattern of high-density points offset slightly along the helix axis from the main helical density (as used in "O" to identify helix direction). Finally model helices are fit to the density using the positions and orientations identified in the earlier steps. A similar procedure is used to identify strands. Then the helices and strands are combined into a single model.

How find_helices_strands finds RNA and DNA helices in maps:

find_helices_strands finds RNA and DNA helices differently than it finds helices in proteins. It uses a convolution search to find places in the asymmetric unit where an A-form RNA or B-form DNA helix can be placed. These are assembled into contiguous helical segments if possible. The resolution of this search is 4.5 A if you have resolution beyond 4.5 A, and the resolution of your data otherwise.

How trace_chain finds CA positions in maps:

The RESOLVE trace_chain algorithm places dummy atoms down the middle of all the tubes of density in a map, then it attempts to find sets of these atoms that may be CA atoms, where the atoms are spaced by 3.8 A and where there is strong density between each pair. This yields segments represented by CA atoms. Next PULCHRA (Rotkiewicz & Skolnick 2008) can optionally be used to reconstruct a full main-chain model. Finally RESOLVE is used to assemble all the resulting fragments into a model.

Using find_helices_strands to bootstrap phenix.autobuild

If you run phenix.autobuild at low resolution (3.5 A or lower) then your model may have strands built instead of helices. You can use find_helices_strands to help bootstrap autobuild model-building by providing the helical model from find_helices_strands to phenix.autobuild. Just run phenix.find_helices_strands with your best map map_coeffs.mtz. Then take the helical model pass it to phenix.autobuild with the keyword (in addition to your usual keywords for autobuild):

consider_main_chain_list=map_coeffs.mtz_helices.pdb

Then the AutoBuild wizard will treat your helical model just like one of the models that it builds, and merge it into the model as it is being assembled.

Command-line use

From the command-line you can type:

phenix.find_helices_strands map_coeffs.mtz quick=True

If you want a more thorough run, then skip the "quick=True" flag. If you want (or need) to specify the column names from your mtz file, you will need to tell find_helices_strands what FP and PHIB are, in this format:

phenix.find_helices_strands map_coeffs.mtz \
  labin="LABIN FP=2FOFCWT PHIB=PH2FOFCWT"

If you want to specify a sequence file, then in the last step find_helices_strands will try to align your sequence with the map and model:

phenix.find_helices_strands map_coeffs.mtz seq_file=seq.dat

If you want to use the trace_chain algorithm, then specify:

phenix.find_helices_strands map_coeffs.mtz seq_file=seq.dat trace_chain=True

Here is an example using data from the PHENIX examples library:

phenix.find_helices_strands $PHENIX/phenix_examples/p9-build/p9-resolve.mtz \
  labin="FP=FP PHIB=PHIM FOM=FOMM" trace_chain=True

That should build a model using sample data in a few seconds.

Specific limitations and problems

References

Fast procedure for reconstruction of full-atom protein models from reduced representations. P. Rotkiewicz, and J. Skolnick. J Comput Chem 29, 1460-5 (2008).

Rapid model building of alpha-helices in electron-density maps. T.C. Terwilliger. Acta Crystallogr D Biol Crystallogr 66, 268-75 (2010).

Rapid model building of beta-sheets in electron-density maps. T.C. Terwilliger. Acta Crystallogr D Biol Crystallogr 66, 276-84 (2010).

Rapid chain tracing of polypeptide backbones in electron-density maps. T.C. Terwilliger. Acta Crystallogr D Biol Crystallogr 66, 285-94 (2010).

List of all available keywords

use_pdb_in_directly = False Use PDB input model directly
output_dir = None Output directory
seq_file = None Sequence file for sequence alignment
compare_file = None PDB file for comparison only
res_convolution = 4.5 high-resolution limit for convolution calculation. (Applies to nucleic acids only)
temp_dir = "temp_dir" Optional temporary work directory
helices_only = False Find only helices (Applies only if trace_chain=False)
strands_only = False Find only strands (Applies only if trace_chain=False)
trace_chain = False Use resolve trace_chain algorithm
helices_before_trace = None Find helices before running resolve trace_chain algorithm
strands_before_trace = None Find strands before running resolve trace_chain algorithm
pulchra = False Use PULCHRA to create all-atom model from CA model
reverse_if_better = True Reverse chain direction in trace_chain if pulchra=True and reversed chain has improved geometry.
remove_side = False Try to remove density for side chains in trace_chain
rho_cut_min = None Minimum rho/sigma at potential CA positions
rat_pair_min = None Target minimum ratio of rho/sigma at midpoint between potential CA positions to the average at the two CA positions
minimum_length_angstroms_helices_strands = None Minimum length (A CA start - CA end) for helix/strand to keep
dist_ca_target = 3.8 Target CA-CA distance
dist_ca_sep = 4.0 Max separation of CA atoms
dist_ca_tol_max = None Maximum tolerance for CA-CA distances. Normally 0.8 A for medium and 1.3 A for quick
dist_ca_start = None Guess of tolerance for CA-CA distances. Set automatically by default.
dist_ca_tol = None Tolerance for CA-CA distances. Normally set automatically Typical values are 0.3-1.0 A A high number can be used to force a more thorough search. Compare with target_p_ratio which adjusts CA-CA tolerance to achieve a targeted ratio of nonamers to atoms.
cutoff_trace = 0.000 The top cutoff_trace fraction of peaks in trace_chain will be assumed to be non-protein, and all peaks near them will be ignored. (Default = 0.00)
ncut_trace_min = 0 The top ncut_trace_min peaks in trace_chain will be assumed to be non-protein, and all peaks near them will be ignored. (Default = 0)
target_p_ratio = None Target ratio of nonamers found to atoms in a.u default=3 for quick, 4 otherwise. This can be used as an alternate method to adjust the thoroughness of trace_chain searches. Differs from setting dist_ca_tol by adjusting CA-CA tolerance to achieve the desired target ratio, while dist_ca_tol=tol sets the tolerance directly to tol
trace_ratio_long = 0.5 When setting tolerances for CA-CA distances in trace_chain the upper bound will increase trace_ratio_long as fast as the lower bound. (Default=0.5)
ratio_trace_extra = None Minimum ratio of dist between extra atoms added in trace_chain to rad_sep_trace Default is 1.25 for standard and 1.5 for quick
rad_sep_trace = None Dummy atom separation in trace_chain Default is 0.6 A for standard and 0.75 for quick Increased if resolution is greater than 3 A Value of rad_mask_trace in resolve will be rad_sep_trace*2
atom_target_ratio = 1.0 Ratio of target number of atoms to estimated actual number
n_atoms_total_scale = 1.0 Ratio of estimated atoms in au to standard estimate
fill_gaps = True Try to fill in gaps in trace_chain
target_angle = None Target angle for CA-CA-CA or P-P-P
use_any_side = False Use any side chain that fits density in assembly
cc_helix_min = None Minimum CC of low-res helical density to map to keep.
group_ca_length = None Minimum length of a segment of helix or strand to keep. (only applies if trace_chain=False)
cc_strand_min = None Minimum CC of strand density to map to keep. (only applies if trace_chain=False)
quick = False Try to find structure quickly
target_time_for_trace_chain = 15 Target length of time to work on trace_chain (sec)
recycle = False Recycle CA positions in trace_chain
optimize = True Try to optimize CA-CA tolerance in trace_chain to obtain nonamer ratio equal to target_p_ratio
assemble = False Assemble model with resolve after trace_chain
resolve_size = 12 Size of resolve to use. You may need a bigger size than in other resolve applications
coarse_grid = False Coarse_grid allows the use of a smaller resolve size
verbose = False Verbose output
raise_sorry = False Raise sorry if problems
debug = False Debugging output
dry_run = False Just read in and check parameter names
mtz_label_prefix = None Prefix for column names, used by Coot. The GUI will detect and set this automatically. It should not be changed directly by users.
delete_tmp_dir = False GUI setting, does not apply to command-line version
job_title = None Job title in PHENIX GUI, not used on command line
input_files
- mtz_in = None MTZ file with coefficients for a map
- map_coeff_labels = None If map coefficients cannot be identified automatically from your MTZ file, you can specify the label or labels for them. (Please separate labels with blank space, MTZ columns grouped together separated by commas with no blanks.) You can specify: map_coeff_labels (e.g., FWT,PHIFWT) amplitudes and phases (e.g., FP,SIGFP PHIB) or amplitudes, phases, weights (e.g., FP,SIGFP PHIB FOM)
- labin = "" Labin line for MTZ file with map coefficients. Normally use instead map_coeff_labels. This is available for backward compatibility. You can specify: LABIN FP=myFP PHIB=myPHI FOM=myFOM where myFP is your column label for FP
- map_in = None CCP4 or MRC-style map file
- pdb_in = None Optional PDB file to be used for seeding trace_chain with CA
output_files
- target_output_format = *None pdb mmcif Desired output format (if possible). Choices are None ( try to use input format), pdb, mmcif. If output model does not fit in pdb format, mmcif will be used. Default is pdb.
- output_model = None Output PDB file
- output_log = None Output log file name. If you want to specify a directory to put this file in then please use "output_dir=myoutput_dir"
- gui_output_dir = None GUI parameter only
crystal_info
- resolution = None high-resolution limit for map calculation
- chain_type = *PROTEIN DNA RNA Chain type (for identifying main-chain and side-chain atoms)
- solvent_fraction = 0.5 You can specify the solvent fraction