Python-based Hierarchical ENvironment for Integrated Xtallography |
Documentation Home |
Automated protein-ligand structure determination with phenix.ligand_pipeline
Overviewphenix.ligand_pipeline is an automation system combining Xtriage, Phaser, eLBOW, phenix.refine, AutoBuild, and LigandFit, with optional interaction with Coot. In favorable cases, it can produce a near-finished structure of a protein-ligand complex starting from minimal inputs, and can significantly reduce the manual effort required for more difficult structures. Input filesThe typical input for the pipeline consists of processed experimental data, a starting model for molecular replacement, a sequence file equivalent to the starting model (with the same number of chains), and a source of ligand information. File types will be automatically detected if explicit parameter names are not used, for instance: phenix.ligand_pipeline model.pdb data.mtz seq.fa lig.smi The data may be in any format; if R-free flags are not found they will be generated automatically, but you may also supply the flags in a separate file (r_free_flags.file_name=flags.mtz). The sequence file is optional but highly recommended. To specify the ligand geometry, several mutually exclusive options are available:
You may also supply a PDB file (template_pdb=lig.pdb) containing the atom names to use. However, PDB files should not be the primary source of ligand geometry information! Additionally, if you have additional ligands in your input model, you can supply CIF files for refinement with the extra_cif keyword. You should also specify keep_hetatms=True to prevent Phaser from resetting the occupancies to zero. Finally, if all files are in the same directory and their uses can be unambiguously determined, you may supply the path name as a single argument when launching the program: phenix.ligand_pipeline /path/to/data [options] Program workflowData preparation. The program begins by importing the experimental data and converting it to amplitudes with standard MTZ column labels (this simplifies later steps). R-free flags will be generated if not available, otherwise the user-supplied flags will be used (extended to higher resolution if necessary). Xtriage will be run to assess data quality and guess an appropriate number of search copies for MR (which, by default, will be the same as the number of ligands to look for). Ligand generation. eLBOW will be run on the input ligand definition to generate suitable geometry restraints and a starting structure for LigandFit. The most important option is the choice of whether to run the semi-empirical AM1 quantum-mechanical optimization or not (optimize=True). This is not done by default because it is more time-consuming, but in some cases it may lead to improved geometry. For troublesome cases, we recommend generating the ligand restraints and structure manually prior to running the pipeline, then supplying the output CIF as input to the pipeline with the additional argument keep_input_restraints=True. In this case, eBLOW will only be used to generate a PDB file, and the input restraints will be used for refinement. Molecular replacement. Although molecular replacement is optional if the starting model is already nearly-complete and isomorphous to the target structure, in most cases it is preferable to run Phaser. The number of search copies will be determined automatically by Xtriage (copies=Auto), but since this will occasionally guess too few copies, you can explicitly set the search number instead (e.g. copies=2). If Phaser finds fewer copies than expected, the number of expected ligands will automatically be adjusted. Non-water heteroatoms will be preserved at their original occupancies, but alternate conformations will be removed. The model will also be processed with Sculptor to ensure that all residues match the target sequence. If you want to preserve a specific placement or orientation of the MR solution, you can specify a previously determined model by passing the reference_structure parameter. Initial refinement. The MR solution (or starting model, if MR was not performed) will be refined using a relatively conservative resolution-dependent strategy, by default rotamer fitting, real-space and reciprocal-space coordinate minimization, individual B-factor refinement, and TLS groups. Torsion NCS will be used if appropriate. Rigid-body refinement will be used if Phaser was not previously run. Water molecules will be added if the resolution is sufficient, but these will be removed prior to ligand fitting. Simulated annealing may be run if desired (anneal=True), but this will significantly increase the runtime. Because rapid convergence is more important at this point than ideal geometry, weight optimization will not be used by default. If the refined R-free is above 0.5, the program will immediately exit, since this indicates that the structure may be incorrect or at least in need of extensive manual rebuilding (or possibly additional search copies for MR). If R-free is above 0.3, the default behavior (build=Auto) is to rebuild the model in AutoBuild. Otherwise, the structure is considered suitable for ligand fitting. Rebuilding. Rebuilding is optional but may be essential in some cases where large rearrangements are required. AutoBuild will be run with rebuild_in_place=Auto, which in most cases means that residues will not be added or removed. If run, AutoBuild is generally the most time-consuming stage in the pipeline, although it can take advantage of multiple processors. An alternate protocol is to simply delete those residues with poor fit to density, with the goal of moving them out of the way of the ligand binding site. This can be turned on by specifying prune=True; you may also wish to set build=False. Note that residues deleted this way will not be restored by running AutoBuild in rebuild-in-place mode. Ligand placement. Currently only LigandFit is supported, although additional options will be available in the future. The target map will be the mFo-DFc map from phenix.refine. The parameters have been adjusted slightly for optimal performance at the expense of slightly longer runtime; you may specify quick=True or aggressive=True to override the defaults. Real-space refinement will be performed on ligands after placement to obtain an optimized fit and local geometry. Unlike most of the other steps, here the resolution will be truncated by default to 1.8Å (ligandfit.d_min=1.8), because higher-resolution data tend to lead to lower correlations even when the fit is excellent. An essential feature of automatic ligand fitting is a cutoff model-to-map CC to prevent false positives (i.e. spurious placement). By default this is set to 0.7, which has been empirically determined to nearly always indicate a successful fit. Any ligands with a CC below this value will be omitted from the model. In some cases this may actually reject correct fits (see Troubleshooting section below), but we recommend that these solutions be viewed with suspicion. If no acceptable ligand fits are obtained, the pipeline will skip further refinement and exit early with an error message. In many cases the one or more ligands will be placed, but fewer than the desired number; in these situations, a separate post-LigandFit routine will first attempt to use NCS operators to place additional copies of the ligand, pruning the model as necessary to avoid clashes. If there are still fewer copies than desired, the pipeline will continue with refinement but print a warning at the end of the run. Final refinement and validation. If the ligand placement was at least partially successful, a final round of refinement will be performed on the combined model. In this run, waters will be added automatically if the resolution is at least 2.9Å, and the geometry weight will be optimized if the resolution is worse than 1.75Å (this is much slower but can run on multiple processors, and generally produces a superior model). Interactive modeFor users desiring more control over the rebuilding process, or for difficult cases where more aggressive model modifications are required, the program can be run semi-interactively by adding the flag interactive=True (or alternately, --interactive). This will launch Coot after the first round of refinement, with model and maps loaded, allowing rebuilding to take place. The recommended procedure is to locate the putative ligand density in the mFo-DFc map, and adjust the protein structure as necessary to remove overlap with the ligand binding site (as well as any other modifications that will not be automatically fixed by refinement and/or AutoBuild). Once rebuilding is complete, click the button "Return to PHENIX" to save the edited model and resume the pipeline (the next step will be LigandFit, with a freshly calculated mFo-DFc map). At any point during rebuilding, you can obtain new maps for the edited model by clicking "Fetch new maps" on the toolbar, which will write out a temporary model and recalculate maps in the background, automatically loading these into Coot when finished. Because the interactive mode overcomes many of the limitations present when significant rebuilding is required, it is the reocmmended method for running the pipeline, especially for new users. Output filesAll files will be written to a directory named pipeline_X, where X is an automatically incremented integer. Most of the important output files will be named after the ligand code, by default LIG:
The Coot script can be run like this: coot --script pipeline_1/visualize_coot.py This will open the final model and map coefficients, and if ligand fitting was successful, a window will be created containing buttons to zoom in on the ligand(s). Inside the output directory, separate subdirectories will be created for each stage of the pipeline, e.g. phaser_0, refine_0, refine_1, LigandFit_run_1_. You do not normally need any of the files in these directories, but they may be useful for debugging purposes. In particular, if LigandFit failed to place one or more copies, the ligand PDB files will not be incorporated into the final model, but will still be available in the LigandFit output directory. Next stepsAlthough the pipeline can produce nearly-complete models in favorable cases (typically at moderate-to-high resolution, approximately 2.2 to 1.8Å), the structures are unlikely to be truly complete. Several recommendations for finishing the refinement in Coot:
More generally, with any automation system, two rules apply: exercise appropriate skepticism and seek expert opinion if unsure of the results. (If a local expert is not available, email help@phenix-online.org for advice.) Limitations
TroubleshootingThe pipeline may fail for a number of reasons, usually because LigandFit could not successfully place any copies of the ligand. If this happens, the program will skip the final refinement step and exit early with an error message. You can still run the Coot script to load the output of the previous refinement or AutoBuild, which may provide clues about the cause of failure. Several common problems are listed below. Note that these may also apply to cases where the pipeline runs to completion but was unable to place all copies of the ligand successfully (in which case a warning message will be printed).
References
List of all ligand_pipeline keywords------------------------------------------------------------------------------- Legend: black bold - scope names black - parameter names red - parameter values blue - parameter help blue bold - scope help Parameter values: * means selected parameter (where multiple choices are available) False is No True is Yes None means not provided, not predefined, or left up to the program "%3d" is a Python style formatting descriptor ------------------------------------------------------------------------------- input_files seq_file= None Sequence of the crystallized protein. This should correspond to the search model, i.e. if the model contains two copies of the protein, the sequence file should as well. model= None Search model for MR (or starting model for refinement). Water molecules will be removed. Additional ligands may be present but their occupancy will be set to zero by Phaser unless you specify keep_hetatms=True. ligand= None Ligand description in chemical format (e.g. SMILES) ligand_smiles= None SMILES string for ligand ligand_pdb= None PDB file with ligand geometry and/or atom names ligand_code= None Chemical components code for ligand (if no file is provided), for molecules already present in the PDB. extra_cif= None Additional CIF file(s) for ligands already present in the structure. reference_structure= None If specified, phenix.find_alt_orig_sym_mate will be applied to map the solution to the reference structure. Primarily intended for testing, but also useful in cases where a common reference frame is desired. xray_data wavelength= None file_name= None labels= None high_resolution= None low_resolution= None outliers_rejection= True Remove basic wilson outliers , extreme wilson outliers , and beamstop shadow outliers french_wilson_scale= True sigma_fobs_rejection_criterion= None sigma_iobs_rejection_criterion= None ignore_all_zeros= True force_anomalous_flag_to_be_equal_to= None phase_labels= None space_group= None unit_cell= None french_wilson max_bins= 60 Maximum number of resolution bins min_bin_size= 40 Minimum number of reflections per bin r_free_flags file_name= None This is normally the same as the file containing Fobs and is usually selected automatically. label= None test_flag_value= None This value is usually selected automatically - do not change unless you really know what you're doing! ignore_r_free_flags= False Use all reflections in refinement (work and test) disable_suitability_test= False ignore_pdb_hexdigest= False If True, disables safety check based on MD5 hexdigests stored in PDB files produced by previous runs. generate= False Generate R-free flags (if not available in input files) fraction= 0.1 max_free= 2000 lattice_symmetry_max_delta= 5 use_lattice_symmetry= True use_dataman_shells= False Used to avoid biasing of the test set by certain types of non-crystallographic symmetry. n_shells= 20 output prefix= None directory= None make_subdir= True If True, the program will create a subdirectory for the current run. directory_number= None verbose= False job_title= None Job title in PHENIX GUI, not used on command line coot_absolute_paths= True runtime nproc= 1 Number of processors to use for parallelized steps (including LigandFit, AutoBuild, and phenix.refine with weight optimization). If set to Auto, the program will use as many CPUs as are available, minus the current load average. ligand_copies= None Number of copies of the ligand to place. By default, this is the same as the number of copies as the MR search model. skip_xtriage= False skip_mr= False prune= Auto Prune the model after refinement to remove residues and sidechains in poor density. build= Auto Run AutoBuild after initial refinement. By default, this will be done if R-free is greater than the max_r_free cutoff. skip_ligand= False refine_after_fitting= True Run phenix.refine a final time after ligand placement. If not True, new maps will be calculated instead. refine_ligand_occupancy= False mr_program= *Auto phaser mrage Program to use for molecular replacement. Will use Phaser by default if a PDB file is included in input, otherwise will use phaser.mrage (sequence required). method= *ligandfit Program to use for ligand placement. Currently only LigandFit is supported, but other programs will be added in the future. fill_maps= True extend_model= False validate= True sort_hetatms= True Rearrange non-polymer heteroatoms (waters and ligands) to have the same chain ID and asymmetric unit as the nearest polymer chain. run_command= None ncs_for_ligands= True Use NCS operators to place additional ligands if LigandFit was not able to find all copies with acceptable CC. interactive= False Run program interactively with manual editing in Coot between refinement (or autobuilding) and ligand-fitting steps. Requires that Coot be installed and present on the path. stop_if_r_free_greater_than= 0.5 Cutoff for continuing the program after initial round of refinement - usually an R-free above 0.5 indicates that the model contains serious defects which cannot be automatically fixed (or is simply wrong altogether). Set to None to disable this check. xtriage i_over_sigma_cutoff= None completeness_cutoff= None phaser d_min= None Resolution limit for running Phaser. copies= Auto Number of copies to search for. If Auto, Xtriage will be used to guess based on the expected solvent content. identity= None Sequence identity of the search model to the target structure. If None and rmsd is also None, perfect sequence identity will be assumed. rmsd= None RMSD of the search model to the target structure. If not specified, the sequence identity will be used instead. sg_alt= none hand *all Choice of alternate space groups to try. ignore_packing= False pack_cutoff= 10 clean_model= True If True, alternate conformations will be removed and all atoms will have their occupancies reset to 1. sculpt= True Run Sculptor to homogenize sidechains before MR keep_hetatms= True If True, HETATMs will not have their occupancies reset to zero. sort_first= True force_accept_composition= False prune_model resolution_factor= 1/4. Map grid spacing (multiplied times d_min). sidechains= True Remove poor sidechains mainchain= False Remove entire residues in poor density min_backbone_2fofc= 0.8 Minimum 2mFo-DFc sigma level at C-alpha to keep. Residues with C-alpha in density below this cutoff will be deleted. min_backbone_fofc= -3.0 Maximum mFo-DFc sigma level at C-alpha to keep. Residues with C-alpha in difference density below this cutoff will be deleted. min_sidechain_2fofc= 0.6 Minimum mean 2mFo-DFc sigma level for sidechain atoms to keep. Residues with sidechains below this level will be truncated. max_sidechain_fofc= -2.8 Maximum mean 2mFo-DFc sigma level for sidechain atoms to keep. Residues with sidechains below this level will be truncated. min_cc= 0.7 Minimum overall CC for entire residue to keep. min_cc_sidechain= 0.6 Minimum overall CC for sidechains to keep. min_fragment_size= 3 Minimum fragment size to keep. Fragments smaller than this will be deleted in the final step (based on the assumption that the adjacent residues were already removed). Set this to None to prevent fragment filtering. check_cgamma= True Check for poor density at the C-gamma atom for long sidechains. Useful in cases where the terminal atoms may have been misfit into nearby density. autobuild d_min= None Resolution for running AutoBuild max_r_free= 0.3 R-free cutoff for deciding whether to run AutoBuild when build=Auto. max_frac_missing= 0.05 quick= False Run AutoBuild in quick mode. Inferior results, but a huge time-saver. rebuild_in_place= Auto Controls whether AutoBuild will only rebuild the existing model, without adding or removing atoms, or build an entirely new model. include_input_model= True Specifies whether to incorporate parts of the input model when building a new model (i.e. rebuild_in_place=False). phil_file= None Parameters file for AutoBuild. refine d_min= None ncs_type= *Auto torsion cartesian None twin_law= Auto phil_file= None File containing additional parameters for phenix.refine. after_mr cycles= 6 Number of macrocycles of refinement to run. optimize_weights= False Perform weight optimization (geometry and ADP restraints). sites= True Refine with individual_sites strategy. real_space= Auto rigid_body= Auto Run rigid-body refinement. By default this is only run if MR was not used to place the model. adp_type= *Auto aniso iso tls= Auto update_waters= Auto Add/remove waters automatically. These will not interfere with ligand placement, since the fitting programs ignore them. anneal= False Run simulated annealing. May increase radius of convergence, especially for rough initial models, but significantly adds to runtime. ready_set= False Run ready_set to generate additional parameter files for phenix.refine. Hydrogens are not added at this stage. after_ligand optimize_weights= Auto Perform a grid search to optimize the XYZ and ADP restraint weights. Not usually necessary if d_min <= 1.75. hydrogens= True Add hydrogens to model with phenix.ready_set. If Auto, the decision will be made based on resolution, the presence of hydrogens in the input model, or a mention of riding hydrogens in the PDB header. cycles= 6 Number of macrocycles. tls= Auto Refine TLS groups. adp_type= *Auto aniso iso B-factor parameterization waters= Auto ions= None List of element symbols for ion picking (separated by commas or spaces). outlier_rejection= True elbow ligand_id= Auto Three-letter residue code to use in output PDB file. If not specified, will default to LIG . optimize= False Run AM1 quantum-mechanical optimization on ligand geometry. initial_geometry= None Starting geometry for AM1 optimization. keep_input_restraints= True If True, and the input files include pre-calculated restraints for the target ligand, eLBOW will propagate these restraints instead of generating new ones. template_pdb= None PDB file containing atom names to use for ligand. If not specified (and not present in the input ligand file), the atom names will be generated automatically. ligandfit d_min= 1.5 Resolution cutoff for ligand fitting. For atomic resolution datasets, LigandFit usually works better at slightly lower resolution. map_type= *mFo-DFc quick= False Run LigandFit in quick (single-job) mode. Not recommended. aggressive= False Adjusts the LigandFit parameters to search for more conformations, at the expense of increased runtime. min_cc= None conformers= 20 n_group_search= 5 ligand_near_residue= None ligand_near_chain= None min_ligand_cc_keep= 0.7 Minimum CC to consider a ligand placement correct. Ligands with at least this CC will be incorporated into the current model for refinement. min_overall_cc_stop= 0.75 Minimum overall CC (for all ligands) to consider the structure complete. maxent= False ligand_ncs Parameters for placing missing ligands using NCS operators between protein chains. d_min= 2.5 max_rmsd= 2.0 min_cc= 0.7 min_cc_reference= 0.85 min_2fofc= 1.0 min_dist_center= 5 remove_clashing_atoms= True clash_cutoff= 2.0 write_sampled_pdbs= False mode= refit *copy extension selection= None build_hydrogens= Auto max_atoms_missing= None use_rotamers= True anneal_residues= False skip_rsr= False output_model= None output_map_coeffs= None sorting preserve_chain_id= False The default behavior is to group heteroatoms with the nearest macromolecule chain, whose ID is inherited. This parameter disables the change of chain ID, and preserves the original chain ID. waters_only= False Rearrange waters, but leave all other ligands alone. sort_waters_by= none *b_iso Ordering of waters - by default it will sort them by the isotropic B-factor. set_hetatm_record= True Set HETATM label ignore_selection= None Selection of atoms to skip. Any residue group which overlaps with this selection will be preserved with the original chain ID and numbering. renumber= True Renumber heteroatoms once they are in new chains. sequential_numbering= True If True, the heteroatoms will be renumbered starting from the next available residue number after the end of the associated macromolecule chain. Otherwise, numbering will start from 1. distance_cutoff= 6.0 Cutoff for identifying nearby macromolecule chains. This should be kept relatively small for speed reasons, but it may miss waters that are far out in solvent channels. loose_chain_id= X Chain ID assigned to heteroatoms that can't be mapped to a nearby macromolecule chain. testing Parameters for testing and benchmarking the program - not intended for general use. pdb_id= None If specified, the program will download all relevant data including search model and SMILES string from the PDB, and exit. |