PredictModel: Predict structures of chains in a sequence file

Author(s)

Tutorial video

A tutorial video for AlphaFold prediction is available on the Phenix YouTube channel and covers the following topics:

Purpose

Predict model can be used to generate predicted models from a sequence file. One predicted model is generated for each unique sequence.

You can choose whether to use a multiple sequence alignment (MSA), whether go use templates from the PDB, and whether to supply your own template to be used as a guide in prediction.

Sequence file

The sequence file that you supply specifies what is going to be predicted and how many copies of each chain are present in the structure (one for every copy in the sequence file). All models that are created and input will be associated with one or more of the chains in your sequence file based on sequence identity (normally every model should match a chain in the sequence file exactly).

Model prediction

Predict model normally uses the Phenix server notebook to carry out AlphaFold prediction of one chain at a time.

You can specify what inputs the AlphaFold prediction should use. These always include a sequence file, but it can include an optional multiple sequence alignment file, optional templates, keywords for model prediction such as the number of models to generate, random seed, and whether to use multiple sequence alignment.

When prediction is being carried out, the Phenix GUI waits for the predicted models to appear in the working directory, then it writes out the resulting models.

Using the Phenix server for model prediction

Prediction is fully automated with the Phenix server through the Phenix GUI, so all you have to do is let it run. The Phenix GUI will use the Phenix server to carry out the prediction and put it in the working directory.

MSA calculation vs model prediction

The calculation of multiple sequence alignments (MSAs) is a separate step from model prediction. The Phenix GUI on your computer sends a request to the mmseqs2 server, which creates an MSA and sends it back to your computer. Then your computer uploads the MSA to the Phenix server, which uses it in an AlphaFold prediction.

If you want, you can supply your own MSAs. The key requirement is that the sequence of the first entry in your MSA must exactly match the sequence to be predicted.

You can also skip the use of MSAs. This can be useful if you supply a template and you want AlphaFold to rebuild your template instead of doing a new prediction.

Number of models

AlphaFold can carry out multiple predictions for a sequence. You can specify how many of these to carry out. The PredictModel tool will choose the one with the highest value of pLDDT (predicted local difference distance test, Mariani et al., 2013). Normally the default (5 models) is a reasonable compromise between optimal prediction and the length of time needed to carry out the prediction. Note that the Phenix server will stop carrying out predictions if it appears unlikely that further predictions will significantly improve the pLDDT (the estimate is based on the variability of pLDDT values so far.)

Using templates from the PDB

You can request that templates from the PDB be used in prediction. If you use this feature, models will be predicted both with and without the templates, and the model with the highest pLDDT will be saved.

Using supplied templates

You can supply your own templates. As for templates from the PDB, if you use this feature, models will be predicted both with and without the templates, and the model with the highest pLDDT will be saved.

Predicting very long sequences in chunks and reassembling

Sequences longer than about 1500 residues take quite a long time to run. By default if your sequence is more than 1500 residues it will be split into overlapping chunks of length 600 with an overlap of 200. These chunks will be run individually, then reassembled at the end. To improve the contacts between domains in different chunks, a contact map is created and domains that are in contact but in different chunks are re-predicted together. During assembly the domain pairs are used to adjust the positions of domains, and also positions of domains are adjusted to minimize overlaps. The repositioned domains are then used as a template to morph the full reassembled prediction. The result of this procedure is to assemble a full-length prediction from overlapping chunk predictions and to distort this assembled prediction to minimize overlaps and make domain-domain contacts present between chunks.

Note that if you have a sequence longer than the threshold for breaking into chunks then you can only include one sequence in a PredictModel run.

Precalculated MSAs and models save time if you rerun with the same sequence

When you run a prediction on the Phenix server, it saves the MSA and the predicted model. Then if you run the same request again (same number of AlphaFold models, same sequence, same choice of including templates from the PDB, no supplied template), the server will return the result from the original run. You can turn this off with the precalculated results and precalculated MSA keywords if you want. Note that the server does not save your sequence or parameters. It just makes a hash string from them so that if the same request is made again it can detect it. The results are simply saved in a file that has the hash string as a name.

Precalculated MSAs and models allow you go get your result even if disconnected

If you get disconnected from the server during a prediction job, you can just wait for the job to finish, then submit a new request with the same parameters. As the server saves the predictions, it will return the result to you right away (if it is finished).

Using AlphaFold to improve a model you already have

As AlphaFold often produces models with quite good geometry, you can use it as a procedure for geometry optimization. You supply your working model and your sequence, and you turn off the use of MSAs and the use of templates from the PDB. Of course AlphaFold does not know about your density map, so it could move the model away from the density. Normally you would use refinement to help with this.

Using AlphaFold to build parts of a model that are missing

If you want AlphaFold to try and build the missing parts of a chain that you have partially built, supply your working partial model and your sequence, and allow the use of MSAs (and optionally, the use of templates from the PDB). Your working model will be used as a template for the part of the structure that you have already built. Note that you have to have the correct sequence for your partially-built model.

Notes on using the Phenix server from the Phenix GUI for model prediction

The Phenix GUI can automatically send a request to the Phenix server for an AlphaFold prediction. This requires an internet connection, the Phenix server to be up, and a token that is supplied with Phenix.

If the Phenix GUI successfully communicates with the server, it will wait for the server to carry out the prediction, showing the status of the prediction while it is being carried out. The status will usually start out as waiting in the queue for a place on the GPU. If this status lasts more than a few seconds, an estimated waiting time (before starting processing) is displayed in the Phenix GUI. The waiting time is very approximate. It is based on how long previous jobs waited to start and how many jobs are ahead of yours. If the currently-running jobs happen to take a very short or long time, the estimate could be off in either direction. If you submit more than one job at a time, the first job you submit will get higher priority and may move nearer the front of the queue than the others.

Once your job starts, the status in the Phenix GUI will change to reflect that is is running. There won't be an estimated run time shown, but most of the time jobs with a few hundred residues take 10-20 min, and longer sequences take more or less proportionally longer. If you include templates from the PDB, it takes longer to run. If you specify that the number of AlphaFold models to use is more than the default (5), it will also take longer.

If your prediction is successful, the Phenix GUI will list the estimated confidence (pLDDT) and the output data files. The output files include the AlphaFold model (named something like PredictAndBuild_10_ALPHAFOLD_A.pdb), the multiple sequence alignment used (PredictAndBuild_10_ALPHAFOLD_A_MSA.a3m), and the Predicted Aligned Error (PAE) file (PredictAndBuild_10_ALPHAFOLD_A_PAE.jsn).

The output predicted model

The predicted model produced by AlphaFold with Phenix will consist of a single chain with exactly the sequence that is supplied. It will be in the standard PDB (version 3.2) format. It will contain all (non-hydrogen) atoms for all the residues. The coordinates of the atoms have the usual meaning (locations of atoms in 3D space), but the orientation and position of the chain are arbitrary. The B-value field (normally atomic displacement parameters) will contain the values of pLDDT (confidence), with one value per residue. This means that some processing of this model is necessary before use (this is normally done by phenix.process_predicted_model).

The pLDDT confidence measure

AlphaFold provides residue-level estimates of model accuracy in the form of predicted values of the LDDT (Local Difference Distance Test, Mariani et al., 2013). These predicted values are referred to as pLDDT values (predicted Local Difference Distance Test). The LDDT measure is a number from zero to one that reflects the similarity of CA positions in two structures.

The LDDT is not a simple measure of accuracy, but rather a composite that is based on the CA distances between all pairs of CA atoms in one model and how similar these inter-atomic distances are to those in the other model. It is a local test, where only inter-atomic distances that are less than 15 Å are considered.

The LDDT is a composite that is generated from the fraction of CA-CA distances that are very accurate (less than 0.5 Å off), the fraction that are quite accurate (less than 1 Å off), moderately accurate (less than 2 Å off), and somewhat accurate (less than 4 Å off). If all CA-CA distances are within 0.5 Å, the resulting score will be 1, if all are more than 4 Å off, the score is zero. Note that the LDDT is related to RMSD, but not in a simple way.

The pLDDT values (predicted Local Difference Distance Test) provided by AlphaFold are estimates of the LDDT values. The pLDDT values are essentially unbiased (just as likely to be too low as too high), and reasonably accurate. The correlation between pLDDT values and actual LDDT values calculated using AlphaFold models and models in the PDB is about 0.7 - 0.75, which means that the values of pLDDT you get are useful indicators of model quality but also that sometimes AlphaFold will have high confidence in an incorrect prediction or low confidence in a correct prediction.

As the pLDDT metric is an estimate of local error, it does not tell you that much about long-range distortions or domain movements between the predicted and actual structure. The PAE matrix does have some information about these features because it is not restricted to short distances.

The output MSA file

The MSA file that was used in AlphaFold prediction is available if the Phenix server is used. This MSA file can be used in subsequent predictions (if the sequence is identical). Normally there is no need to supply an MSA file because the MSA is saved by the Phenix server and it will just be re-used if the same sequence is supplied. However if you want to edit this MSA file before running an AlphaFold prediction, you can do that.

The output PAE file

AlphaFold provides a Predicted Aligned Error matrix in the form of a PAE file. The error that it is describing reflects the estimated error in CA positions in the predicted model compared to the (unknown) true structure.

Suppose we had both the prediction and the true structure. Then suppose we superpose the predicted model on the true structure, basing the superposition on the residues near residue i. Using that superposition, we can now look at the location of the CA atom of residue j in the predicted model and in the true structure. We could in particular note the distance between the predicted and actual position of the CA atom of residue j, when the structures are superposed using residues near residue i. We could do this for every residue pair (i,j), and the result would be the Aligned Error matrix.

The Predicted Aligned Error matrix is an estimate of the Aligned Error matrix It is generated by AlphaFold. Basically, AlphaFold predicts the model and also it generates an error estimate in the prediction.

The PAE matrix is useful for identifying domains within a predicted chain that are accurately predicted, along with connecting regions that are more uncertain. In the Phenix GUI the PAE matrix is plotted with dark green indicating low PAE values (relationships between pairs of residues that are predicted to be accurate). A domain that is accurately predicted can be seen in such a plot as a solid green square along the diagonal. If your plot has just one big green square, it is probably all quite accurate. If it looks instead like a few squares along the diagonal, probably the individual domains are accurately predicted, but their relationship is not known.

Relationship between the pLDDT values and the PAE matrix

The PAE matrix is related to the pLDDT values, but not in a simple way.

The pLDDT value for a residue describes the expected errors in the distances between that CA atom and nearby CA atoms (compared to the true structure) on a scale of 0 to 1, where 1 means all CA-CA distances are expected to match within 0.5 Å.

The value of the PAE matrix for a pair of residues is the estimated error in Å in the position of the CA atom for the second residue if the model is superposed on the true structure using CA atoms near the first CA atom.

Both measures tell you something about the uncertainties in your model, but neither tells you exactly how well the model would superpose on the true structure.

Rough guide as to expected RMSD for a residue if you have the pLDDT

Here is a table that gives you an idea of the RMSD of CA atoms in an AlphaFold prediction to a superposed deposited model (see Terwilliger et al., 2023):

  pLDDT         Median prediction error     % with error over 2 Å

pLDDT > 90               0.6 Å                    10%
pLDDT > 80-90            1.1 Å                    22%
pLDDT > 70-80            1.5 Å                    33%
pLDDT < 70               3.5 Å                    77%

This table shows that high-confidence residues (pLDDT >90) are generally incredibly accurate, but a few (10%) still are off by more than 2 Å. Low-confidence residues (pLDDT < 70) are usually pretty far off (median of 3.5 Å) but still almost a quarter are better than 2 Å.

Examples

Standard run of predict_model

Running predict_model is easy. From the command-line you can type:

phenix.predict_and_build jobname=myjob seq_file=seq.dat \
  prediction_server=PhenixServer \
  stop_after_predict=True

Common questions

How long will it take?

A typical protein chain with 200 residues will take about 5-10 minutes to run, once it has started on the Phenix server.

Timing is more or less proportional to chain length. If you specify that more than 5 models are to be built, it will typically take longer.

If your chain has already been predicted for you or for someone else, you normally should get a result (a copy of that prediction) in about 30 seconds or less.

If the Phenix server is full (typically 6 jobs can run at once), you can use the Server button on the Phenix GUI to see the expected wait time to start a job.

Server problems

The most common problem in running Predict model is that the Phenix server is not working as expected. Normally the first thing to try is just let the program retry (it will do this for a while normally). If that does not work, you can take the files in the packaged .tgz file (listed in the GUI output), use them to get your own prediction with any server, and put the resulting predicted models in the place specified in the GUI or program output.

Specific limitations and problems:

Literature

Additional information

List of all available keywords (same as predict_and_build keywords)