Processing AlphaFold2, RoseTTAFold and other predicted models
Author(s)
- process_predicted_model: Tom Terwilliger, Claudia Millan Nebot, Tristan Croll
Purpose
Process model files produced by AlphaFold, RoseTTAFold and other prediction
software, replacing information these programs put in the B-factor field
with pseudo-B values and optionally breaking the model into compact domains.
Background
As described in AlphaFold and Phenix,
structure prediction software is now capable of generating models that are
highly accurate over some or all parts of the models. Importantly,
these predictions often come with reliable residue-by-residue estimates of
uncertainty. The process_predicted_model tool is designed to help you
remove low-confidence residues and break the model into domains for
docking or molecular replacement.
How process_predicted_model works:
The process_predicted_model tool uses estimates of uncertainty supplied
by structure prediction tools in the B-value (atomic displacement parameters)
field of a model to create new pseudo-B values, to remove uncertain parts
of the model, and to break up the model into domains.
The B-value field in most predicted models represents one of three possible
values:
An actual B-value (atomic displacement parameter)
An estimate of error in A (rmsd)
A confidence (LDDT) on a scale of either 0 to 1 or 0 to 100.
In process_predicted_model, error estimates in A or confidence values are
first converted to B-values. Then residues with high B-values are removed. Then
the remaining residues are grouped (optionally) into domains.
Conversion of error estimates to B-values
Error estimates in A are converted to B-values using the standard formula
for the relationshiop between rms positional variation and B-values:
B = rmsd**2 * ((8 * (pi**2)) / 3.0)
Conversion of LDDT values to error estimates
LDDT values are first converted to a scale of 0 to 1. You can specify
whether the LDDT values in your model are from 0 to 1 (fractional) or
from 0 to 100. If you don't specify, a model with all LDDT values between
0 and 1 is assumed to contain fractional LDDT values.
Then LDDT values on a scale of 0 to 1 are converted to error estimates using
an empirical formula from
- ::
- Hiranuma, N., Park, H., Baek, M. et al. Improved protein structure
- refinement guided by deep learning based accuracy estimation.
Nat Commun 12, 1340 (2021).
https://doi.org/10.1038/s41467-021-21511-x
This empirical formula is:
RMSD = 1.5 * exp(4*(0.7-LDDT))
Trimming away low-confidence regions from predicted models
Normally it is a good idea to remove low-confidence regions from a predicted
model before using them as a starting point for experimental
structure determination. For AlphaFold2 models, low-confidence corresponds
approximately to an LDDT value of about 0.7 (on a scale of 0 to 1, or
70 on a scale of 0 to 100), or to an RMSD value of about 1.5, or to
a B-value of about 60. For other types of models these values might vary,
so you might need to experiment or use values that others have found
useful.
After trimming low-confidence residues, you will usually be left with a
model that has some complete parts of various sizes and some small pieces.
Splitting a trimmed model into domains
It can be helpful to group the pieces from your trimmed model into
compact domains, or even to split some pieces into compact domains.
The process_predicted_model tool allows you to choose a typical domain
size, and if you want, a maximum number of domains, and then it will
try to split your model into compact domains.
There are two methods available. One is based finding compact domains, the
other is based on using the predicted alignment error matrix (AlphaFold2 only).
Finding domains from a low-resolution model representation
The method used is to calculate a low-resolution map based on the input model,
then to identify large blobs corresponding to domains. All the residues in
the structure are assigned to an initial domain.Then the residues are
regrouped in order to have as few cases where small parts of the model are
part of one domain but neighboring parts are part of another as possible.
When using this method, you can easily adjust the number of domains you get
by adjusting the target domain size (in A). You can also just restrict the
number using the maximum_domains keyword (less good).
Finding domains using the predicted alignment error matrix
This method analyzes the predicted alignment error matrix (PAE) provided by
AlphaFold2 and finds groupings of residues that have small mutual alignment
error. This often corresponds to domains.
When using this method you can adjust the number of domains by changing the
value of pae_power (the exponent applied to pae before using it in finding
domains). You can also just restrict the number using the maximum_domains
keyword (less good).
Examples
Standard run of process_predicted_model:
Running process_predicted_model is easy. From the command-line you can type:
phenix.process_predicted_model my_model.pdb b_value_field_is=lddt
This will convert the B-value field in my_model.pdb based from LDDT to B-values,
trim residues with LDDT less than 0.7, and write out a new model with
individual chains (separate chain ID values) corresponding to
compact domains.
Possible Problems
Specific limitations and problems:
Literature
Additional information
List of all available keywords
- job_title = None Job title in PHENIX GUI, not used on command line
- input_files
- chain_id = None If specified, find domains in this chain only. NOTE: only one chain can be used for finding domains at a time.
- selection = None If specified, use only selected part of model
- pae_file = None Optional input json file with matrix of inter-residue estimated errors (pae file)
- distance_model_file = None Distance_model_file. A PDB or mmCIF file containing the model corresponding to the PAE matrix. Only needed if weight_by_ca_ca_distances is True.
- model = None Input predicted model (e.g., AlphaFold model). Assumed to have LDDT values in B-value field (or RMSD values).
- output_files
- processed_model_prefix = None Output file with processed models will begin with this prefix. If not specified, the input model file name will be used with the suffix _processed.
- remainder_seq_file_prefix = None Output file with sequences of deleted parts of model will begin with this prefix
- process_predicted_model
- remove_low_confidence_residues = True Remove low-confidence residues (based on minimum lddt or maximum_rmsd, whichever is specified)
- split_model_by_compact_regions = True Split model into compact regions after removing low-confidence residues.
- maximum_domains = 3 Maximum domains to obtain. You can use this to merge the closest domains at the end of splitting the model. Make it bigger (and optionally make domain_size smaller) to get more domains.
- domain_size = 15 Approximate size of domains to be found (A units). This is the resolution that will be used to make a domain map. If you are getting too many domains, try making domain_size bigger (maximum is 70 A).
- minimum_domain_length = 10 Minimum length of a domain to keep (reject at end if smaller).
- maximum_fraction_close = 0.3 Maximum fraction of CA in one domain close to one in another before merging them
- minimum_sequential_residues = 5 Minimum length of a short segment to keep (reject at end ).
- minimum_remainder_sequence_length = 15 used to choose whether the sequence of a removed segment is written to the remainder sequence file.
- b_value_field_is = *lddt rmsd b_value The B-factor field in predicted models can be LDDT (confidence, 0-1 or 0-100) or rmsd (A) or a B-factor
- input_lddt_is_fractional = None You can specify if the input lddt values (in B-factor field) are fractional (0-1) or not (0-100). By default if all values are between 0 and 1 it is fractional.
- minimum_lddt = None If low-confidence residues are removed, the cutoff is defined by minimum_lddt or maximum_rmsd, whichever is defined (you cannot define both). A minimum lddt of 0.70 corresponds to a maximum rmsd of 1.5. Minimum lddt values are fractional or not depending on the value of input_lddt_is_fractional.
- maximum_rmsd = 1.5 If low-confidence residues are removed, the cutoff is defined by minimum_lddt or maximum_rmsd, whichever is defined (you cannot define both). A minimum lddt of 0.70 corresponds to a maximum rmsd of 1.5. Minimum lddt values are fractional or not depending on the value of input_lddt_is_fractional.
- default_maximum_rmsd = 1.5 Default value of maximum_rmsd, used if maximum_rmsd is not set
- subtract_minimum_b = False If set, subtract the lowest B-value from all B-values just before writing out the final files. Does not affect the cutoff for removing low- confidence residues.
- pae_power = 1 If PAE matrix (predicted alignment error matrix) is supplied, each edge in the graph will be weighted proportional to (1/pae**pae_power). Use this to try and get the number of domains that you want (try 1, 0.5, 1.5, 2)
- pae_cutoff = 5 If PAE matrix (predicted alignment error matrix) is supplied, graph edges will only be created for residue pairs with pae<pae_cutoff
- pae_graph_resolution = 1 If PAE matrix (predicted alignment error matrix) is supplied, pae_graph_resolution regulates how aggressively the clustering algorithm is. Smaller values lead to larger clusters. Value should be larger than zero, and values larger than 5 are unlikely to be useful
- weight_by_ca_ca_distance = False Adjust the edge weighting for each residue pair according to the distance between CA residues. If this is True, then distance_model must be provided. See also distance_power
- distance_power = 1 If weight_by_ca_ca_distance is True, then edge weights will be multiplied by 1/distance**distance_power.
- control
- write_files = True Write output files
- guiGUI-specific parameter required for output directory