Automated protein-ligand structure determination with phenix.ligand_pipeline

Overview
Input files
Program workflow
Interactive mode
Output files
Next steps
Limitations
Troubleshooting
Other notes about use
References
List of all available keywords

Overview

phenix.ligand_pipeline is an automation system combining Xtriage, Phaser, eLBOW, phenix.refine, AutoBuild, and LigandFit, with optional interaction with Coot. In favorable cases, it can produce a near-finished structure of a protein-ligand complex starting from minimal inputs, and can significantly reduce the manual effort required for more difficult structures.

Input files

The typical input for the pipeline consists of processed experimental data, a starting model for molecular replacement, a sequence file equivalent to the starting model (with the same number of chains), and a source of ligand information. File types will be automatically detected if explicit parameter names are not used, for instance:

phenix.ligand_pipeline model.pdb data.mtz seq.fa lig.cif

The data may be in any format; if R-free flags are not found they will be generated automatically, but you may also supply the flags in a separate file (r_free_flags.file_name=flags.mtz). This is espeically recommended if you are working with multiple isomorphous structures. The sequence file is optional but highly recommended.

To specify the ligand geometry, several mutually exclusive options are available:

Chemical file, such as MOL2, CIF (this can include existing restraints), or SMILES: ligand=lig.cif
SMILES string: ligand_smiles=c1ccccc1
For ligands already present in the PDB, if you know the three-letter code used, you may specify this instead: ligand_code=ATP

It is recommended that you use the most detailed and explicit information about the ligand chemistry and geometry that you have available: a pre-generated and manually validated CIF is ideal. You may also supply a PDB file (template_pdb=lig.pdb) containing the atom names to use. However, PDB files should not be the primary source of ligand geometry information! Additionally, if you have additional ligands in your input model, you can supply CIF files for refinement with the extra_cif keyword. You should also specify keep_hetatms=True to prevent Phaser from resetting the occupancies to zero.

Finally, if all files are in the same directory and their uses can be unambiguously determined, you may supply the path name as a single argument when launching the program:

phenix.ligand_pipeline /path/to/data [options]

Program workflow

Data preparation. The program begins by importing the experimental data and converting it to amplitudes with standard MTZ column labels (this simplifies later steps). R-free flags will be generated if not available, otherwise the user-supplied flags will be used (extended to higher resolution if necessary). Xtriage will be run to assess data quality and guess an appropriate number of search copies for MR (which, by default, will be the same as the number of ligands to look for).

Ligand generation. eLBOW will be run on the input ligand definition to generate suitable geometry restraints and a starting structure for LigandFit. The most important option is the choice of whether to run the semi-empirical AM1 quantum-mechanical optimization or not (optimize=True). This is not done by default because it is more time-consuming, but in some cases it may lead to improved geometry. For troublesome cases, we recommend generating the ligand restraints and structure manually prior to running the pipeline, then supplying the output CIF as input to the pipeline with the additional argument keep_input_restraints=True. In this case, eBLOW will only be used to generate a PDB file, and the input restraints will be used for refinement.

Model preparation. For each input model, a series of cleanup steps is run to remove specific details such as alternate conformations, hydrogen atoms, or waters. The model will also be processed with Sculptor to ensure that all residues match the target sequence. (This will not extend truncated sidechains, but this is available as a later step after ligand placement.)

Model placement. Two options are available, rigid-body refinement or molecular replacement:

Molecular replacement. This uses the full MR_AUTO mode in Phaser, which includes rigid-body and group B-factor refinement (although the B-factors are not output). The number of search copies will be determined automatically by Xtriage (copies=Auto), but since this will occasionally guess too few copies, you can explicitly set the search number instead (e.g. copies=2). If Phaser finds fewer copies than expected, the number of expected ligands will automatically be adjusted. Non-water heteroatoms will be preserved at their original occupancies. If you want to preserve a specific placement or orientation of the MR solution, you can specify a previously determined model by passing the reference_structure parameter.
Rigid-body refinement. If the input model is already close enough to the target structure, rigid-body refinement alone will be suitable.

Currently MR is run by default, but this will be changed in the future; if you are confident that the model can be correctly placed without MR, add mr=False or skip_mr=True or --skip_mr to jump straight to refinement. (This will automatically turn on rigid-body refinement in the subsequent step.) A more useful (but slightly slower) option is mr=Auto, which will attempt rigid-body refinement and then revert to running Phaser if the R-free is greater than 0.4 or no models with appropriate symmetry are found.

In automatic or MR mode, you may specify multiple models as input; the best model will be selected either by Phaser or the rigid-body refinement step. The space group will be updated based on the Phaser solution or selected model if necessary.

Initial refinement. The MR solution (or starting model, if MR was not performed) will be refined using a relatively conservative resolution-dependent strategy, by default rotamer fitting, real-space and reciprocal-space coordinate minimization, individual B-factor refinement, and TLS groups. Torsion NCS will be used if appropriate. Rigid-body refinement will be used if Phaser was not previously run. Water molecules will be added if the resolution is sufficient, but these will be removed prior to ligand fitting. Simulated annealing may be run if desired (anneal=True), but this will significantly increase the runtime. Because rapid convergence is more important at this point than ideal geometry, weight optimization will not be used by default. (Water picking is used because this improves the maps, but the solvent atoms will be removed prior to ligand fitting to avoid overlap.)

If the refined R-free is above 0.5, the program will immediately exit, since this indicates that the structure may be incorrect or at least in need of extensive manual rebuilding (or possibly additional search copies for MR). If R-free is above 0.3, the default behavior (build=Auto) is to rebuild the model in AutoBuild. Otherwise, the structure is considered suitable for ligand fitting.

Rebuilding. Rebuilding is optional but may be essential in some cases where large rearrangements are required. AutoBuild will be run with rebuild_in_place=Auto, which in most cases means that residues will not be added or removed. If run, AutoBuild is generally the most time-consuming stage in the pipeline, although it can take advantage of multiple processors.

An alternate protocol, run by default, is to simply delete those residues with poor fit to density, with the goal of moving them out of the way of the ligand binding site. Either the mainchain or sidechain atoms or both can be removed, but the default is to only prune sidechains. These may optionally be rebuilt after the ligand is placed.

As a final preparation step, waters are removed from the model and the difference map is recalculated.

Ligand placement. Currently only LigandFit is supported, although additional options will be available in the future. The target map will be the mFo-DFc map from phenix.refine. The parameters have been adjusted slightly for optimal performance at the expense of slightly longer runtime; you may specify quick=True or aggressive=True to override the defaults. Real-space refinement will be performed on ligands after placement to obtain an optimized fit and local geometry. Unlike most of the other steps, here the resolution will be truncated by default to 1.8Å (ligandfit.d_min=1.8), because higher-resolution data tend to lead to lower correlations even when the fit is excellent.

An essential feature of automatic ligand fitting is a cutoff model-to-map CC to prevent false positives (i.e. spurious placement). By default this is set to 0.7, which has been empirically determined to nearly always indicate a successful fit. Any ligands with a CC below this value will be omitted from the model. In some cases this may actually reject correct fits (see Troubleshooting section below), but we recommend that these solutions be viewed with suspicion. If no acceptable ligand fits are obtained, the pipeline will skip further refinement and exit early with an error message.

In many cases the one or more ligands will be placed, but fewer than the desired number; in these situations, a separate post-LigandFit routine will first attempt to use NCS operators to place additional copies of the ligand, pruning the model as necessary to avoid clashes. If there are still fewer copies than desired, the pipeline will continue with refinement but print a warning at the end of the run.

Final refinement and validation. If the ligand placement was at least partially successful, a final round of refinement will be performed on the combined model. In this run, waters will be added automatically if the resolution is at least 2.9Å, and the geometry weight will be optimized if the resolution is worse than 1.75Å (this is much slower but can run on multiple processors, and generally produces a superior model).

Interactive mode

For users desiring more control over the rebuilding process, or for difficult cases where more aggressive model modifications are required, the program can be run semi-interactively by adding the flag interactive=True (or alternately, --interactive). This will launch Coot after the first round of refinement, with model and maps loaded, allowing rebuilding to take place. The recommended procedure is to locate the putative ligand density in the mFo-DFc map, and adjust the protein structure as necessary to remove overlap with the ligand binding site (as well as any other modifications that will not be automatically fixed by refinement and/or AutoBuild). Once rebuilding is complete, click the button "Return to PHENIX" to save the edited model and resume the pipeline (the next step will be LigandFit, with a freshly calculated mFo-DFc map). At any point during rebuilding, you can obtain new maps for the edited model by clicking "Fetch new maps" on the toolbar, which will write out a temporary model and recalculate maps in the background, automatically loading these into Coot when finished.

Because the interactive mode overcomes many of the limitations present when significant rebuilding is required, it is the reocmmended method for running the pipeline, especially for new users.

Output files

All files will be written to a directory named pipeline_X, where X is an automatically incremented integer. Most of the important output files will be named after the ligand code, by default LIG:

LIG_data.mtz: filtered data and R-free flags
LIG_final.pdb: final output model
LIG_final.mtz: final map coefficients (as well as data and flags)
LIG_ligand.cif: ligand geometry restraints
pipeline.log: log file for the entire run
visualize_coot.py: script for loading results in Coot

The Coot script can be run like this:

coot --script pipeline_1/visualize_coot.py

This will open the final model and map coefficients, and if ligand fitting was successful, a window will be created containing buttons to zoom in on the ligand(s).

Inside the output directory, separate subdirectories will be created for each stage of the pipeline, e.g. phaser_0, refine_0, refine_1, LigandFit_run_1_. You do not normally need any of the files in these directories, but they may be useful for debugging purposes. In particular, if LigandFit failed to place one or more copies, the ligand PDB files will not be incorporated into the final model, but will still be available in the LigandFit output directory.

Next steps

Although the pipeline can produce nearly-complete models in favorable cases (typically at moderate-to-high resolution, approximately 2.2 to 1.8Å), the structures are unlikely to be truly complete. Several recommendations for finishing the refinement in Coot:

Inspect the ligand binding closely. A number of structures in the PDB have been shown to exhibit ligand placements which are not supported by the data, leading to several retractions so far. Although the CC cutoff for ligand placement is designed to avoid this outcome, it is essential that you visually confirm the refined ligand position. Of particular concern:
- Does the difference map show residual positive or (especially) negative peaks around the ligand binding site? Even if the overall position is approximately correct, this may indicate problems with the specific conformation and/or geometry; it is not uncommon, for example, for nearly-symmetric rings to be fit backwards.
- Does the overall shape of the 2mFo-DFc map approximate that of the ligand? If not, this could mean that the ligand has been mistakenly placed in density for missing protein atoms, or for another unrelated ligand (such as contaminants from the crystallization solution). Note that it is quite possible for flexible ligands to be partially disordered, but in such cases the majority of the ligand should still be clearly defined in the map.
- Does the binding make biological sense? Pay special attention to hydrogen bond donors and acceptors in both the ligand and the protein, and especially the interactions of charged groups. Chemically nonsensical binding usually indicates an incorrect fit (or, in extreme cases, an incorrect ligand).
Look for additional ligands. It is quite common for protein structures to co-crystallize with buffer molecules, ions, and endogenous contaminants. These can be placed by running LigandFit, or interactively in Coot.
Check for errors or missing pieces in the protein model. Even after refinement and/or rebuilding, some local fitting problems may be present (especially at lower resolution). Additionally, it is very common for pieces of the start model to be missing from the refined structure, or vice-versa; this will usually be obvious from the maps. Alternate conformations will normally be stripped out before molecular replacement, so at high resolution (typically 1.6Å or better), these may need to be added manually. You should also check the validation results to identify geometry problems and misfitting.
If none of the above steps requires further adjustments, your model is probably complete. Otherwise, if you make any changes, you should run a final round of refinement with weight optimization to get the best possible model, and a final set of maps which should also be inspected.

More generally, with any automation system, two rules apply: exercise appropriate skepticism and seek expert opinion if unsure of the results. (If a local expert is not available, email help@phenix-online.org for advice.)

Limitations

For maximum flexibility, no restriction is placed on the quality of the starting model or its similarity to the target model; the Sculptor step will take care of any necessary sequence deletions. However, in practice it is vastly more effective when working with near-identical structures.
When working with a set of related structures that differ from previously solved crystal forms, it will probably be most productive to solve the first one interactively, then use this as a starting point for the remaining structures.
Only one type of ligand is supported at present. Essential cofactors or other molecules whose binding mode is already known may included in the input model; in these cases, you should add keep_hetatms=True to instruct Phaser to leave the occupancies unmodified. Any additional ligands will need to be placed after the final refinement.
Multiple stereoisomers will need to be screened independently; the program does not currently have the ability to evaluate different chiral forms.
Covalently bound ligands may be placed correctly, but the covalent linkage will not be handled, and nonbonded forces may distort binding. Additionally, the ligand parameterization may be based on the unbound molecule.
Structures such as HIV protease which exhibit symmetric binding of inhibitors will not be handled properly - and more generally, alternate conformations for the ligand are not modeled. (The pipeline may be able to place a single conformation, however.)
The pipeline is designed to work at resolutions of 3.0A or better; while it is possible to solve complexes at lower resolution, the ligand fitting procedure is not expected to work as well.
There is no high-resolution limit, but structures at atomic resolution will inevitably require more manual effort to complete due to the presence of alternate conformations and (often) additional small-molecule ligands, which are not presently handled by Phenix.
Building is currently limited to either rebuild-in-place mode or building an entirely new model from scratch, and rebuilds the entire model. This means that for structures where local rebuilding or addition of a few residues is required, interactive mode will probably be more successful.
Most of the optimization of program behavior and parameters has been focused on structures bound to drug-like molecules; less attention has been given to fragments. These tend to have weaker binding and partial occupancy, and are therefore more difficult to distinguish in maps.

Troubleshooting

The pipeline may fail for a number of reasons, usually because LigandFit could not successfully place any copies of the ligand. If this happens, the program will skip the final refinement step and exit early with an error message. You can still run the Coot script to load the output of the previous refinement or AutoBuild, which may provide clues about the cause of failure. Several common problems are listed below. Note that these may also apply to cases where the pipeline runs to completion but was unable to place all copies of the ligand successfully (in which case a warning message will be printed).

The protein blocks the ligand site. This is one of the most common problems, especially in structures which undergo extensive conformational changes or local rearrangements, such as kinases. Even a single stray sidechain may prevent LigandFit from placing the ligand correctly, both due to flattened difference density, and steric clashes. For structures with NCS, this can sometimes be overcome by the NCS application procedure described above, but this still requires that at least one copy was placed correctly. Running AutoBuild can sometimes help, but this also has the risk of moving the protein into ligand density. The interactive mode is strongly recommended for overcoming this problem.
Spurious difference density for other atoms. If large regions of positive mFo-DFc density corresponding to missing protein residues are present, LigandFit may sometimes place the ligand in these sites instead of the correct binding site. Again, interactive mode is the recommended solution (and an optimal search model can help too). Note that unexpected ordered buffer components may also lead to false positives; this is especially problematic for fragments.
Poor or incomplete difference density. In some cases, a particularly flexible ligand may not fit entirely into difference density; this is not uncommon or necessarily problematic, but it can reduce the CC below the cutoff required to continue. More common is poor density quality for the entire molecule; this can be caused by an incompletely refined model, low resolution, or partial binding. In some cases, it means that the ligand is not bound at all. While reducing the CC cutoff may allow some structures to proceed, we recommend inspecting the density interactively, both before and after ligand placement. Also note that maximum entropy treatment of the difference map (maxent=True) may improve the fitting in borderline cases.
Poor ligand geometry. In some cases eLBOW is unable to automatically determine the optimal geometry from the inputs. Running AM1 optimization may help for structures where the initial geometry is partially correct. However, in extreme cases the geometry will be too distorted for AM1 to work, and additional information on geometry is required. For this reason, we recommend supplying as rich a source of information as possible; in particular, if geometry from a crystal structure of the target compound is available, this should be included in the inputs.

Other notes about use

Speeding up execution: the bulk of the runtime is spent refining the model; the default protocol is very robust but not especially fast, and may be unnecessarily intensive for easy cases. You can switch to a faster protocol by specifying quick_refine=True, which will shorten both refinement steps from 6 to 3 cycles, and disable weight optimization.

Hydrogen atoms: these will be added prior to the final refinement step. Usually this will improve geometry somewhat (and also R-factors at high resolution); however, it has the side effect of making refinement significantly slower. Specfiy after_ligand.hydrogens=False to leave them off.

Rebuilding sidechains: sidechains truncated by a previous step (Sculptor or the pruning step), or missing in the input model, may optionally be restored after ligand placement by specifying extend_model=True.

Standalone tools: although most of the pipeline is built around pre-existing programs in Phenix, several new tools have been developed for this protocol and are also available as separate command-line programs:

mmtbx.select_best_starting_model: given a data file and a collection of PDB files, calculate R-factors and (optionally) run rigid-body refinement to pick the best one for further refinement
mmtbx.prune_model: remove protein atoms with poor fit to density.
mmtbx.apply_ncs_to_ligand: place additional copies of the desired ligand using NCS operators between protein chains.
mmtbx.extend_sidechains: rebuild missing protein sidechain atoms and perform quick real-space refinement.
mmtbx.validate_ligand: check for possible fitting errors in a specific ligand.
phenix.sort_hetatms: group heteroatoms (ligands, water, etc.) with the closest macromolecule(s) in the PDB file, similar to the procedure used at PDB deposition.

References

Automating crystallographic structure solution and refinement of protein-ligand complexes. N. Echols, N.W. Moriarty, H.E. Klei, P.V. Afonine, G. Bunkóczi, J.J. Headd, A.J. McCoy, R.D. Oeffner, R.J. Read, T.C. Terwilliger, and P.D. Adams. Acta Crystallogr D Biol Crystallogr 70, 144-54 (2014).

Xtriage and Fest: automatic assessment of X-ray data and substructure structure factor estimation. P.H. Zwart, R.W. Grosse-Kunstleve, and P.D. Adams. CCP4 Newsletter Winter, Contribution 7 (2005).

Improvement of molecular-replacement models with Sculptor. G. Bunkóczi, and R.J. Read. Acta Crystallogr D Biol Crystallogr 67, 303-12 (2011).

Phaser crystallographic software. A.J. McCoy, R.W. Grosse-Kunstleve, P.D. Adams, M.D. Winn, L.C. Storoni, and R.J. Read. J Appl Crystallogr 40, 658-674 (2007).

electronic Ligand Builder and Optimization Workbench (eLBOW): a tool for ligand coordinate and restraint generation. N.W. Moriarty, R.W. Grosse-Kunstleve, and P.D. Adams. Acta Crystallogr D Biol Crystallogr 65, 1074-80 (2009).

Towards automated crystallographic structure refinement with phenix.refine. P.V. Afonine, R.W. Grosse-Kunstleve, N. Echols, J.J. Headd, N.W. Moriarty, M. Mustyakimov, T.C. Terwilliger, A. Urzhumtsev, P.H. Zwart, and P.D. Adams. Acta Crystallogr D Biol Crystallogr 68, 352-67 (2012).

Iterative model building, structure refinement and density modification with the PHENIX AutoBuild wizard. T.C. Terwilliger, R.W. Grosse-Kunstleve, P.V. Afonine, N.W. Moriarty, P.H. Zwart, L.-W. Hung, R.J. Read, and P.D. Adams. Acta Cryst. D64, 61-69 (2008).

Automated ligand fitting by core-fragment fitting and extension into density. T.C. Terwilliger, H. Klei, P.D. Adams, N.W. Moriarty, and J.D. Cohn. Acta Crystallogr D Biol Crystallogr 62, 915-22 (2006).

MolProbity: More and better reference data for improved all-atom structure validation. C.J. Williams, B.J. Hintze, J.J. Headd, N.W. Moriarty, V.B. Chen, S. Jain, S.M. Prisant MG Lewis, L.L. Videau, D.A. Keedy, L.N. Deis, I.I.I. Arendall WB, V. Verma, J.S. Snoeyink, P.D. Adams, S.C. Lovell, J.S. Richardson, and D.C. Richardson. Protein Science 27, 293-315 (2018).

List of all available keywords