AlphaFold and Phenix

You can use the predicted models from AlphaFold and other prediction software in Phenix. Using these models can be very helpful in structure determination because the models can be very accurate over much of their length and the models come with accuracy estimates that allow removal of poorly-predicted regions.

Overall procedure for using AlphaFold models in Phenix

A typical procedure for structure determination with AlphaFold models is:

Analyze your data (Xtriage for X-ray data or Mtriage for cryo-EM)
Identify whether most parts of most of your structure can be predicted or modeled (PredictModel)
Automatically generate an interpretation of your data (PredictAndBuild)

1. Analyzing your data

It is a good idea to evaluate your data to have an idea of its resolution and overall quality before working with it. You can use phenix.xtriage to do this for X-ray data and phenix.mtriage for cryo-EM data. For X-ray data in particular you will need to know the space group of your structure (or its enantiomer, as in P61 or P65).

2. Identify whether most parts of most of your structure can be predicted

You might want to go ahead and predict models for all the chains in your structure (even before getting the data) to get an idea of how much information you are going to get from structure prediction. You can use phenix.predict_model to do this for protein chains. For nucleic acid chains, normally you are going to have to either use an external procedure to predict their structures or to build them separately (i.e., with phenix.autobuild ).

In general terms, if most parts of most of the components of your structure can be predicted with high confidence (i.e., a value of the plDDT metric of 70 or higher), then predicted models are likely to be very useful.

3. Automatically generate an interpretation of your data

If you have either cryo-EM or X-ray data with a resolution of about 4.5 A or better, and if you have useful predicted models for most of the components of your structure, you may be able to use Predict and build to automatically generate an initial interpretation of your data.

The Predict and build procedure (phenix.predict_and_build) just requires a sequence file (one sequence for each chain in the model to be built) and either X-ray data (an mtz file) or cryo-EM data (two half-maps). This procedure carries out iterative prediction and structure determination, yielding a final rebuilt model as well as a final morphed model that contains the final set of predicted models.

Other procedures for using AlphaFold models in Phenix

1. Get an AlphaFold model (or a model from the PDB) for each chain in your structure. You can use the phenix.predict_model method in the Phenix GUI to do this using the Phenix server.

2. Trim your model and break it into rigid domains. You can use phenix.process_predicted_model to do this.

3. Dock models (cryo-EM) or carry out molecular replacement (crystallography) to place your models in the right places in your map or unit cell. For cryo-EM structures, you can supply models directly to phenix.predict_and_build if you want. Alternatively you can dock your models with phenix.dock_predicted_model or phenix.em_placement. For X-ray structures you can also use phenix.predict_and_build or you can use phenix.phaser (crystallography) to find the locations of your models in the crystal.

4. Fill in missing parts of your models with loop fitting or iterative model-building. You can do this with phenix.predict_and_build if you want. Alternatively for crystallography you can use phenix.autobuild .

5. Refine the rebuilt predicted models that you obtain. You can use phenix.real_space_refine for cryo-EM models and phenix.refine for crytstallographic models.

6. Examine your resulting model in detail, using the validation tools that are part of phenix.real_space_refine and phenix.refine to help you identify problem areas and using manual model-building tools to fix them.

Advanced steps

Iteration of AlphaFold and model-building with experimental data

You may be able to improve your rebuilt AlphaFold model by iterating the AlphaFold step and including your rebuilt model as a template (and skipping any other templates). This process is what the phenix.predict_and_build Phenix tool in the GUI does for you automatically.

Structures with multiple chains

The phenix.predict_and_build tool can work with a structure with multiple chains, but you may wish to run one chain at a time for a cryo-EM structure. You can use the whole map for each chain, or if you have some idea of what chain goes where, you can mask out or box the map so that it shows only one chain and use that as your map. Boxing or masking the map can speed up the process and improve the result considerably.

Boxing maps before rebuilding

For complex structures with many chains or with chains that contain domains with long linkers, docking can be very complicated and take a long time. In these cases it may be especially helpful to box or mask the map if you have an idea where each chain is located. If you cannot, you might want to run the docking step individually with each domain that you get from phenix.process_predicted_model and then examine where they went in the map. If the domains seem to correspond to different molecules, you might want to mask out the part of the density that corresponds to the molecule you don't want to fit and re-try. You can also try running phenix.dock_in_map with one domain at a time and ask to find multiple copies; then you can choose the one that matches up with the other domains you have placed.

Background

Structure prediction software is now capable of generating models that are highly accurate over some or all parts of the models. Importantly, these predictions often come with reliable residue-by-residue estimates of uncertainty.

Compact domains in these predicted models in which all the residues have high confidence often will be very accurate over the entire domains. However, separate domains that each have high confidence but are connected by lower confidence residues sometimes have relative positions and orientations that differ between predicted and experimentally-determined structures.

When using predicted models as a starting point for experimental structure determination, it can be helpful to:

Remove low-confidence residues entirely

Break up the model into domains and allow the domains to have
different orientations

For a high-confidence predicted model, you might try using the predicted model as-is first. For most predicted models, you may want to try removing low-confidence residues, then additionally try breaking the model into domains and placing the domains one at a time.

An important feature of recent predicted models is that they generally have very accurate sequence alignment. That means that the assignment of the sequence to the high-confidence parts of the model is usually correct. This can make a very big difference in completion of the remainder of the structure (the parts that were not predicted with high confidence) because you know exactly what residues go in the gaps. This means that model-building of the remainder of the structure can often be completed with loop-fitting tools instead of trying to rebuild everything.

What do do after applying Phenix tools to predicted models

While AlphaFold and other predicted models can be quite accurate overall, some details in otherwise-accurate regions and some whole regions can be incompatible with your experimental data. The Phenix procedures for producing models based on your experimental data and using predicted models as starting points are designed to try and keep the accurate parts of the predicted models and to replace the inaccurate parts. Depending on the resolution of your data, the automatically-produced models may be quite accurate or may themselves need a lot of trimming and rebuilding.

Once you have used Phenix tools to modify a predicted model based on experimental data you will want to carefully analyze the resulting model, comparing every detail to the experimental map or data. You will generally want to use manual model-building tools such as Coot or Isolde to fix small and large errors in the models that remain.

You can also iterate the AlphaFold prediction using your rebuilt model as an input to a new round of AlphaFold (See also the documentation for running AlphaFold predictions).

Acknowledgements

The Phenix AlphaFold server uses AlphaFold (Jumper et al., 2021), the mmseqs2 MSA server (Steinegger, and Söding, 2017), and scripts derived from ColabFold (Mirdita et al., 2022).

Literature

Accelerating crystal structure determination with iterative AlphaFold prediction. T.C. Terwilliger, P.V. Afonine, Liebschner, D., T.I. Croll, McCoy, AJ, Oeffner, RD., Williams, CJ, Poon, BK, J.S. Richardson, R.J. Read, and P.D. Adams. Acta Cryst D 79, 234-244 (2023).

Improved AlphaFold modeling with implicit experimental information. T.C. Terwilliger, B.K. Poon, P.V. Afonine, C.J. Schlicksup, T.I. Croll, C. Millán, J.S. Richardson, R.J. Read, and P.D. Adams. Nature Methods 19, 1376-1382 (2022).

AlphaFold predictions: great hypotheses but no match for experiment. T.C. Terwilliger, P.V. Afonine, Liebschner, D., T.I. Croll, McCoy, AJ, Oeffner, RD., Williams, CJ, Poon, BK, J.S. Richardson, R.J. Read, and P.D. Adams. BiorXiv (2023).

Making protein folding accessible to all. M. Mirdita, K. Schütze, Y. Moriwaki, L. Heo, S. Ovchinnikov, and M. Steinegger. Nature Methods 19, 679–682 (2022).

Highly accurate protein structure prediction with AlphaFold. J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko, A. Bridgland, C. Meyer, S.A.A. Kohl, A.J. Ballard, A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman, E. Clancy, M. Zielinski, M. Steinegger, M. Pacholska, T. Berghammer, S. Bodenstein, D. Silver, O. Vinyals, A.W. Senior, K. Kavukcuoglu, P. Kohli, and D. Hassabis. Nature 596, 583-589 (2021).

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. M. Steinegger, and J. Söding. Nature biotechnology 35, 1026-1028 (2017).