Removing protein chains built into RNA density with remove_poor_fragments

Author(s)

remove_poor_fragments: Tom Terwilliger

Purpose

This routine takes a model that contains RNA and protein and tries to identify fragments of protein that were accidentally built into RNA density.

Usage

When carrying out model-building for complexes between protein and RNA it sometimes happens that protein can be built accidentally into regions that are really RNA. The phenix.remove_poor_fragments tool is used to try and identify such incorrectly-built regions.

How remove_poor_fragments works:

The key idea in this approach is that these incorrectly-built regions tend to be isolated (not in the middle of a protein domain) and to have lower map-model correlation than the correct parts of the protein chains. The phenix.remove_poor_fragments tool ranks all protein segments on these criteria and removes the worst ones. The threshold that is chosen based on the number of residues of RNA expected in the molecule that were not built, the fraction of protein that was built, and the quality of the model, using the formula:

residues_to_remove = A * protein_built * rna_not_built/(cc*protein_present)

where cc is the map-model correlation for the protein part of the model, protein_present is the number of protein residues in the sequence file, protein_built are residues of protein built, and rna_not_built are residues of RNA in the sequence file minus the number built.

The logic of this formula is that there is a certain volume, proportional to rna_not_built, that is really RNA that might be accidentally built as protein. Furthermore the logic is that the number of residues of protein that might be built in that volume is higher if more residues of protein have been built and higher if the model-map correlation for the protein model is low. These relationships were seen in an analysis of ribosome structures. A scale factor A of 0.53 relating the optimal number of residues to remove to the other factors was found empirically.

Output files from remove_poor_fragments

trimmed_pdb.pdb: A PDB file with your trimmed model.

Examples

Standard run of remove_poor_fragments:

Running remove_poor_fragments is easy. From the command-line you can type:

phenix.remove_poor_fragments model_with_protein_and_rna.pdb \
   seq.dat \
   map.ccp4

Possible Problems

Specific limitations and problems:

Literature

Additional information

List of all available keywords

map_file_name = None Map file name
map_coeffs_file = None File with map coefficients (alternative to map file)
map_coeffs_labels = None Optional label specifying which columns of of map coefficients to use
model_file_name = None Model file names. Normally only one specified.
output_pdb_file_name = trimmed_pdb.pdb Model file name
seq_file_name = None Sequence file name. Used to calculate n_protein_in_complex, n_rna_in_complex and rna_ratio
select_good_models = True Select good models from input models and write to output PDB file
cutoff_method = *volume_of_rna_not_built sigma_cutoff Choice of method for deciding how many residues to remove. Volume of RNA not built means calculate the volume in the structure that should be RNA, subtract the volume of RNA actually built, weight by 1/mean map CC and the scale factor rna_volume_ratio. Sigma cutoff means use a cutoff based on resolution and ratio of RNA:protein residues in the model with parameters. Default is Volume of RNA not built.
rna_volume_ratio = 0.53 Scale factor on RNA volume used to decide how many residues to remove.
n_protein_in_complex = None Number of protein residues in complex. Normally obtained from sequence file.
n_rna_in_complex = None Number of RNA residues in complex. Normally obtained from sequence file.
n_protein_in_model = None Number of protein residues in model. Normally obtained from the input model. Note distinction from n_protein_in_complex which is how many there are based on the sequence file.
n_rna_in_model = None Number of RNA residues in model. Normally obtained from the input model. Note distinction from n_protein_in_complex which is how many there are based on the sequence file.
rna_ratio = None Used to estimate optimal cutoff in sigma_cutoff method. Calculated automatically from sequence file. Note that this whole procedure is optimized for cases where the rna_ratio is greater than 0.2.
toss_sigma_cutoff_a = 1.4 For sigma_cutoff method, At the conclusion of the procedure, the worst portion of chains, with low z-scores, is removed. The cutoff is toss_sigma_cutoff=toss_sigma_cutoff_a+ toss_sigma_cutoff_b*(resolution-3.0)+ toss_sigma_cutoff_c*(rna_ratio-0.5). You can just set the value by setting toss_sigma_cutoff_a and setting toss_sigma_cutoff_b and toss_sigma_cutoff_c to zero. To accept a few more chains, increase toss_sigma_cutoff_a by a little (try 0.1 higher).
toss_sigma_cutoff_b = -1.1 For sigma_cutoff method, At the conclusion of the procedure, the worst portion of chains, with low z-scores, is removed. The cutoff is toss_sigma_cutoff=toss_sigma_cutoff_a+ toss_sigma_cutoff_b*(resolution-3.0)+ toss_sigma_cutoff_c*(rna_ratio-0.5)
toss_sigma_cutoff_c = -3.3 For sigma_cutoff method, At the conclusion of the procedure, the worst portion of chains, with low z-scores, is removed. The cutoff is toss_sigma_cutoff=toss_sigma_cutoff_a+ toss_sigma_cutoff_b*(resolution-3.0)+ toss_sigma_cutoff_c*(rna_ratio -0.5)
resolution = None Resolution
metrics = *mean_delta_rama *map_value *cc fraction_with_neighbor n_residues fraction_favored Metrics for a chain (fragment) being likely to be correct. Mean_delta_rama is mean change in Ramachandran angles between neighboring residues. Map_value is maximum map value at coordinates of a CA atom in chain for low_resolution map (typically 30 A) calculated from the model. Fraction_with_neighbor is fraction of residues in a chain near another chain. N_residues is length of chain. Fraction_favored is fraction of residues with favored Ramachandran angles
low_resolution_map_d_min = 30 Resolution for low-resolution map showing location of molecule
distance_cutoff = 7 In Score for fraction_with_neighbor, a residue in the same chain has to be within distance_cutoff to count as being close to another chain
any_in_series = 4 Score for fraction_with_neighbor, fraction of residues in a chain that are near another chain, counts a residue as close if any residue in a string of any_in_a_series is near another chain
n_offset = 10 In Score for fraction_with_neighbor, a residue in the same chain has to be separated by n_offset to count as being from another chain
atom_selection = "name ca and (not element Ca)" Specification of atoms to use in fraction_with_neighbor. Normally CA only
top_half_value = 2. The top set of chains (typically about one half) are used to estimate statistics for good chains. Value is 1/top_half_value.
cross_validate = True Cross-validate values of mean/sd for each metric.
optimize_weights = False Optimize weights. Develpment only. Only applies if model_file_name specifies a model with good chains in a structure, second_model_file_name specifies one with poor chains in the same structure, and weights are to be optimized.
n_cycle = 2 Number of minor cycles of estmation of histograms.
n_big_cycle = 1 Number of big cycles of estmation of histograms.
weight_mean_delta_rama = 1 Weight on mean_delta_rama metric
weight_map_value = 1 Weight on map_value metric
weight_cc = 1 Weight on cc metric
weight_fraction_favored = 1 Weight on fraction_favored metric
weight_n_residues = 1 Weight on n_residues metric
weight_fraction_with_neighbor = 1 Weight on fraction_with_neighbor metric
minimum_number_of_fragments = 10 Minimum number of fragments evaluated with a metric to use that metric.
distance_cutoff_for_split_model = 5. Distance cutoff for splitting chain into fragments.
job_title = None Job title in PHENIX GUI, not used on command line
guiGUI-specific parameter required for output directory
- output_dir = None