PredictAndBuild: Solving an X-ray structure or interpreting a cryo-EM map using predicted models

Author(s)

Acknowledgements

The Phenix AlphaFold server uses AlphaFold (Jumper et al., 2021), the mmseqs2 MSA server (Steinegger, and Söding, 2017), and scripts derived from ColabFold (Mirdita et al., 2022).

Tutorial video - X-ray

Tutorial video - Cryo-EM

Tutorial videos for X-ray and cryo-EM are available on the Phenix YouTube channel and cover the following topics:

Purpose

Predict and build can be used to generate predicted models and to use them to solve an X-ray structure by molecular replacement or to interpret a cryo-EM map. Predict and build then carries out iterative model rebuilding and prediction to improve the models. The iterative procedure allows creation of more accurate predicted models than can be obtained with a simple prediction.

Normally Predict and build is used as a way to automatically generate a fairly accurate model starting from just a sequence file and either cryo-EM half-maps or X-ray data. Additionally Predict and build provides morphed versions of unrefined predicted models that can be useful as reference models for refinement.

Steps in predict_and_build

The principal steps in predict_and_build are:

Model prediction (e.g., with AlphaFold)

Model trimming

Structure solution by molecular replacement (X-ray)  or map interpretation
by docking (cryo-EM) to create a scaffold model

Morphing of original predicted models onto the scaffold model

Model refinement and rebuilding

Iteration of model prediction using the rebuilt models as templates

Sequence file

The sequence file that you supply specifies what is going to be predicted and how many copies of each chain are present in the structure (one for every copy in the sequence file). All models that are created and input will be associated with one or more of the chains in your sequence file based on sequence identity (normally every model should match a chain in the sequence file exactly).

Input data (cryo-EM)

For cryo-EM maps, normally you will supply two full-size half-maps. These will be used in density modification to create a single full-size map. This map will then be boxed to extract just the part that contains the molecule (the part with high density). Alternatively, you can supply a single full-size or boxed map if you wish.

Input data (X-ray)

For a crystal structure, you will supply an mtz-style file containing data for Fobs and sigFobs. Optionally this file can contain a test set (FreeR_flags). The space group specified in this data file and its enantiomer, if any, will be used in structure determination. This means you should run phenix.xtriage first so that you have a good idea of the space group.

Model prediction

Predict and build normally uses the Phenix server to carry out AlphaFold prediction of one chain at a time. It can also use predictions that you supply or that you generate on demand and put in the same place on your computer as those that are created by one of the servers.

Prediction (see phenix.predict_model for details) is fully automated with the Phenix server through the Phenix GUI, so all you have to do is let it run. The Phenix GUI will use the Phenix server to carry out the prediction and put it in the working directory.

You can specify what inputs the AlphaFold prediction should use. These always include a sequence file, but it can include an optional multiple sequence alignment file, optional templates, keywords for model prediction such as the number of models to generate, random seed, and whether to use multiple sequence alignment.

When prediction is being carried out, the Phenix GUI waits for the predicted models to appear in the working directory, then it goes on to the next steps. If you want to create these models in some other way, you can just put them in the working directory (the expected file names are listed in the GUI when prediction starts) and they will be used. Note that all prediction files must have all the residues in the corresponding sequence present.

Model trimming

The predicted models are trimmed to remove low-confidence sections and to split into compact domains using phenix.process_predicted_model . This allows the molecular replacement (X-ray) or docking (cryo-EM) procedures to use the most accurate parts of the models, increasing the chance of finding the correct solution. Splitting into compact domains increases the chance of correctly placing parts of chains that have different orientations in the predicted model compared to the actual structure.

Structure determination by molecular replacement (X-ray data)

Predict and build uses phenix.phaser with all default parameters to solve a structure by molecular replacement. If your structure cannot be solved automatically in this way, then you will want to take your trimmed predicted models, solve your structure separately by molecular replacement (e.g., by using phenix.phaser but changing some parameters), then supplying the molecular replacement model to Predict and build as a "docked model". After molecular replacement, domains are rearranged if necessary to allow sequential parts to be connected, creating a scaffold model for use in the morphing step to follow.

The molecular replacement model is refined using the X-ray data, yielding a 2mFo-DFc map. This map is density-modified as well, and the map that has the higher correlation with the refined model is used as the working map.

Structure determination by docking (cryo-EM data)

Predict and build first carries out density modification based on the two half-maps that you provide. This creates an optimized map for interpretation.

Predict and build then sequentially docks all the domains from all the predicted models into the map. After docking, the domains are rearranged if necessary to allow sequential parts to be connected. This docking is carried out by phenix.dock_and_rebuild . The rearranged docked model is the scaffold model that will be used in the next step.

Morphing of original predicted models onto the scaffold model

The scaffold model obtained from predicted models and X-ray or cryo-EM maps usually contains one chain for each sequence in your sequence file. Your predicted models normally include one model matching each unique sequence in the sequence file. For each chain in the scaffold model, the predicted model that is most similar is then morphed (adjusted) to superimpose on the scaffold model.

In the simplest case, your predicted models all yielded a single domain in the trimming step. In this case the chains in your scaffold model will match your predicted models exactly and morphing consistes of simple superposition of the full predicted model on the docked chain. In a more complicated case, your trimmed model may have consisted of multiple domains for one chain, so that your scaffold model may have separately-placed domains for a particular chain. In this case, the morphing consists of superimposing the parts of the full predicted model that match the scaffold, and smoothly deforming the connecting segments.

Model refinement, rebuilding, and trimming

Model refinement and rebuilding is done by phenix.dock_and_rebuild . This procedure consists of identifying all the parts of each chain that are either predicted with low confidence or that do not match the density, rebuilding them in several different ways, and picking the one that matches the density the best. Then the individual chains are refined with real-space refinement.

At the end of rebuilding, residues in the model that are in poor density are identified and marked with zero occupancies. A new trimmed final model lacking these residues is also created

For X-ray data, the trimmed model is refined using the X-ray data, yielding a new working map (2mFo-DFc or density modified, as in the molecular replacement step).

Iteration of model prediction using the rebuilt models as templates

A key element of the Predict and build procedure is iteration of model prediction using chains from the rebuilt model as templates. This improves model prediction compared to a single prediction step. Normally the entire procedure is repeated until the change in predicted models between subsequent cycles is small.

Output models

The results of the Predict and build procedure are:

A final docked model.  This model is the last scaffold model and is
normally suitable for use as a reference model in further refinement.
Normally this model will consist of pure predicted models,
placed appropriately to match the map.

A final trimmed rebuilt model. This model is obtained by rebuilding and
refinement, followed by trimming residues that are in poor density.
It is normally suitable for use as a
starting point for further model rebuilding, refinement, and addition of
ligands and covalent modifications.  A final untrimmed model is also
provided. This can be used as a hypothesis for the poorly-fitting regions.

In some cases (low-resolution data), the trimmed rebuilt model may have very
poor geometry and you might want to use the final docked model as the
starting point for refinement and continuation of building of your structure.

The Carry-on directory and restarting or re-running jobs

Predict and build saves the working files in a carry_on_directory. At the end of the log file (and in the GUI at the end of a run) the carry-on directory is listed. In the GUI it is normally a directory with the same number as the job and ending in "CarryOn": for PredictAndBuild_30 it is PredictAndBuild_30_CarryOn. If you have a previous run that you want to restart or continue, specify the name of this file in the GUI as the Carry-on directory, make sure the flag "carry_on=True" is set, and rerun the job with the same inputs (otherwise) as before. Then running the job will result in reading all the results that had been accumulated by the previous job, then continuing from where it left off.

The carry-on directory lets you restart after stopping or crashing. It also lets you change some parameters and carry on from where you left off.

The jobname also lets you move a project to another computer if you want. You can copy the carry-on directory and put it in any directory on any computer. Then you can specify that directory as the Carry-on directory on that computer and it will carry on from where you left off.

Note that the file names inside the Carry-on directory contain the jobname, so if you want to move individual files around, you might have to change their names.

The universal solution for (most) problems in PredictAndBuild

The solution to many problems that you may have with PredictAndBuild is simply to completely stop running your job, then restart, specifying the Carry-on directory from the job that stopped. This will pick up where you left off. It is suitable for cases where something crashed, where you accidentally stopped a job, where you lost connections to the server, or where something else went wrong.

In some cases to stop running your job you might have to stop all python processes on your computer and quit the GUI, but in most cases you can just hit Abort in the GUI and wait a little.

MSA calculation vs model prediction

If you use the Phenix server, the calculation of multiple sequence alignments (MSAs) is a separate step from model prediction. The Phenix GUI on your computer sends a request to the mmseqs2 server, which creates an MSA and sends it back to your computer. Then your computer uploads the MSA to the Phenix server, which uses it in an AlphaFold prediction.

If you want, you can supply your own MSAs. The key requirement is that the sequence of the first entry in your MSA must exactly match the sequence to be predicted.

You can also skip the use of MSAs. This can be useful if you supply a template and you want AlphaFold to rebuild your template instead of doing a new prediction.

Number of models

AlphaFold can carry out multiple predictions for a sequence. You can specify how many of these to carry out. The PredictModel tool will choose the one with the highest value of pLDDT (predicted local difference distance test).

Using templates from the PDB

You can request that templates from the PDB be used in prediction. If you use this feature, models will be predicted both with and without the templates, and the model with the highest pLDDT will be saved.

Using supplied templates

You can supply your own templates. As for templates from the PDB, if you use this feature, models will be predicted both with and without the templates, and the model with the highest pLDDT will be saved.

Using supplied predictions or RNA/DNA models

You can supply homology models or models that you have built yourself for any of the chains in your structure. If you are working with RNA or DNA, you have to do this. You just supply a model (does not have to be perfect) for each chain that you do not want to have AlphaFold predict, and you call it a predicted model (note the difference between a template -- a model to guide AlphaFold prediction, and a predicted model -- the prediction or an alternative to a prediction.)

Using PredictAndBuild to change the sequences in a model

If you have docked (cryo-EM) or placed (X-ray) chains for a structure from one species using data from another species, you can use PredictAndBuild to fix all the sequences and create a plausible model for each chain.

There are several ways you can do this. If you first run AlphaFold to get a predicted model for each unique chain, you can use a command like this:

 phenix.predict_and_build predicted_model=replacements.pdb \
input_files.b_value_field_is=b_value scaffold_model=existing.pdb \
seq_file=seq.dat cycles=1 refine_only=True refine_cycles=0 \
crystal_info.resolution=3 full_map=full_map.ccp4

Here existing.pdb is the model you have already created (wrong sequences). The model replacements.pdb contains your predicted models for each unique chain, where each predicted model has the right sequence, but can be in any orientation or position. The sequence file seq.dat has all the correct sequences. The map is required as it is used to adjust the predicted models. The keyword b_value_field_is=b_value tells PredictAndBuild that the predicted models contain B-values (atomic displacement parameters), not pLDDT values (predicted Local Difference Distance Test).

If you actually have no map, you can create a model-based map from your existing.pdb structure with the phenix.fmodel tool. The map type you want is called complex (structure factors) and you would want to make it a low-resolution map (like 10 A).

If you don't want to generate the predicted models yourself, you can skip that input and just let PredictAndBuild do the predictions. Of course if you have RNA or DNA chains, you will need to supply predicted models for those chains.

Rebuilding a model without prediction (morph and refine only):

You can use predict_and_build to just rebuild a model that is placed (in a cryo-EM map) like this:

In the GUI, select PredictAndBuild (cryo-EM data)

Supply:  your map (call it “Map file”), your model (call it
“predicted model”), your sequence.

Just below the inputs there is a line that starts with
“Contents of b-value field”… Select “b_value” (not plddt),
as this model has B values in its b-value field.

In Prediction and building settings, go to All parameters,
the “Search parameters”, and search for “already”.  Tick
the check box for “Models are already placed”.

In Prediction and building settings, select “Building” and check
the box for “Refine only”.

Hit Run.  It should take your model, morph it to match
the map, and refine it.

Examples

Standard run of predict_and_build

Running predict_and_build is easy. From the command-line you can type:

phenix.predict_and_build jobname=myjob xray_data=my_xray_data.mtz seq_file=seq.dat

This will carry out all the steps of prediction, molecular replacement and iterative rebuilding and prediction to yield my_model_rebuilt.pdb

If you run again with the same command, Predict and build will read all the previous files and just give you the results.

Common questions

How long will it take to run?

Running predict_and_build can take anywhere from an hour to several days, depending largely on the size of the chains to be placed and the number of chains. Increasing the number of processors used (default of 4) will speed it up. If you are unsure if the program is running, have a look at the log file, where the status and current time are printed out periodically. Also you can just see if any Python processes are running on your machine.

Server problems

The most common problem in running Predict and build is that the Phenix server is not working as expected. Normally the first thing to try is just let the program retry (it will do this for a while normally).

Here is a test you can run to see if everything is ok: In the GUI, hit AlphaFold/AlphaFold model prediction/Prediction settings/ and check the box for “Run test job on server and stop”. Then hit Run. In about 10 sec it should say “Test job completed successfully” if everything is ok.

If the server is still not working, you can take the files in the packaged .tgz file (listed in the GUI output), use them to get your own prediction with any server, and put the resulting predicted models in the place specified in the GUI or program output.

Problems with low-confidence models

If your predicted models have low confidence, the entire procedure may not work.

If the resolution of your data is about 4.5 A or better, and most of most of your predicted models have high confidence (plDDT > 70), then the procedure has a good chance of working. Otherwise, the chances are lower.

Symmetry problems

If your cryo-EM map has pseudo-symmetry (like a proteasome) you might need to box one subunit or try ssm_search=False to use a more thorough search in docking.

Working on individual chains in large cryo-EM structures

If you are working with a large cryo-EM structure (or even a small one), you may find it most efficient to break up the work into small pieces. If you can identify where individual chains are in your map, you can box around each chain, creating individual boxes for each chain to work on. One way to create a box around a chain is to put a dummy molecule in the density you are interested (anything that more or less covers the region you want), then use the MapBox tool to create a little map surrounding that dummy molecule.

When you use the MapBox tool be sure to use the defaults and do not shift the origin (if you shift the origin then the models you get will not superimpose on the original map).

Then of course you have to guess which sequence goes with that density. You could try a few possibilites and run PredictAndBuild on your small map with each likely sequence.

When you are done with each box, you can just combine all the models because they will stay in their correct locations.

Inverted map:

If your map is inverted (left-handed), docking and model-building will not work properly. You can often tell if your map is inverted because any helices will be left-handed. If you are unsure, you can run MapBox with invert_hand=True to invert the map and then see if docking works. Note that if your map is inverted, you will want to invert all your maps and start everything from the beginning.

Specific limitations and problems:

AlphaFold only predicts protein structures, not RNA or DNA, so you will have to either build those chains separately or supply predicted models (homology models, for example) for those chains. If you supply predicted models for some chains and not others, predict_and_build will try to predict the missing chains and build with all the available models.

The predict_and_build tool has two command-line parameters for nproc (prediction.nproc and control.nproc), two for resolution (prediction.resolution and crystal_info.resolution), and two for random_seed (prediction.random_seed and control.random_seed). Normally use the control.nproc, crystal_info.resolution and control.random_seed parameters. However if you want to change the number of processors (on the server, nproc only used for getting templates with structure_search and limited to 4 processors) you can set prediction.nproc. If you want to set the random seed used in prediction to a particular value you can set prediction.random_seed (and also usually set use_supplied_random_seed_in_prediction=True).

Literature

Additional information

List of all available keywords