Ensemble creation with Ensembler

Author(s)

Purpose

Ensembler can be used to superpose multiple chains to be used as an ensemble search model for molecular replacement.

Usage

Ensembler can be run from the PHENIX GUI and the command line, the only difference being the way commands are taken from the user.

The graphical user interface makes all settings accessible either as part of the main window (for frequently used options) or as a dialog box Ensemble generation settings.... Input files are specified either by the +/- button pair, or by drag-and-drop onto the window area. File types are automatically recognized, and added to the relevant input section.

../images/ensembler_gui.png

Input files

Output files

The superposed chains can be written out either as a quasi multiple model PDB file that is readable by phaser directly (output style merged, output file name root_merged.pdb) or as a series of files containing each chain separately (output style separate, file name root_pdb_chain.pdb, where pdb is the name of the PDB file the chain was read from, and chain is the chain identifier). The file name root can be changed via the root parameter of the output (default: ensemble).

Description

The workflow consists of several stages that can be independently configured. These are listed in order of execution, and control parameters are accessible from the Ensemble generation settings... dialog box. For a summary of all parameters with the corresponding defaults, see the Additional information section.

Residue mapping

Establishes the equivalence of residues among the input chains. There are several options available:

Most common residue mapping mode is ssm. If no SSM alignment can be done or this is imprecise (e.g. no secondary structure), muscle is a good second choice.

Atom mapping

Maps selected atoms within equivalent residues to each other. The mapping is done by name hence the order of atoms in the residue does not matter. If atoms are missing from certain residues (or if certain residues contain extra atoms), a gap will be filled where necessary. Atom selection is controlled by the atoms parameter of the configuration scope. Default atom selection: CA.

Superposition

Equivalent positions are superposed iteratively to find a globally optimal solution. There are two superposition algorithms implemented, which primarily differ in how they handle gaps in equivalent positions.

Both algorithms use Diamond's formulation to solve the pairwise rotational superposition problem (Diamond, 1988).

An exception is raised if there are less than 3 sites present for superposition.

Multiple superposition is an iterative process and consists of a series of pairwise superpositions. The convergence criterion is controlled by the convergence parameter (in the superposition scope), which is the r.m.s. difference change between two consecutive iterations.

Weighting

Automatic weighting can be used to improve superposition, either to amplify highly homologous regions or to decrease the effect of incorrect site-equivalence (typically arises because of a wrong alignment). Implemented weighting schemes are as follows:

Weighting is iterated with superposition until weights converge, which can be controlled by the convergence parameter of the weighting scope.

In case of highly dissimilar structures (or incorrect residue mapping), weight determination may temporarily need to be damped to avoid divergence. This is done automatically (in steps controlled by the incremental_damping_factor parameter of the weighting scope), until a preset value (controlled by the max_damping_factor parameter of the weighting scope) is reached, at which point an exception is raised.

Cluster analysis

Hierarchical cluster analysis is performed using the pairwise r.m.s. differences as a distance measure. The clustering parameter of the configuration score can be used to adjust cluster boundaries.

Chain trimming

This option trims residues from the final superposed model where the unweighted r.m.s.d. is above a certain threshold (threshold parameter of the trimming scope). Useful in removing flexible loops, etc. Default: no trimming.

Sorting

After superposition is complete, the chains can be sorted by sequence identity (identity), fraction of common sites wrt all aligned atom positions (overlap), weighted r.m.s.d. (wrmsd) or unweighted r.m.s.d. (unwrmsd). This is controlled by the sort parameter of the output scope. Default: input order (input).

Command line

phenix.ensembler \
    [ command-line switches ] \
    [ PHIL-format parameter files ] \
    [ PHIL command-line assignments ] \
    [ PDB-files ] \
    [ alignment files ]

Command-line switches:

-h, --help            show this help message and exit
--show-defaults       print PHIL and exit
-i, --stdin           read PHIL from stdin as well
-v, --verbosity       set verbosity level (DEBUG,INFO,WARNING,VERBOSE)

PHIL arguments:

Everything not starting with a dash('-') is interpreted as a PHIL argument. This can be a PHIL-format file containing parameters, command-line assignment or a file whose type is automatically recognized (based on file extension; structure files and alignment files are recognized automatically).

Specific limitations and possible problems

Processing features

  • Ensembler considers each chain individually and therefore it is not possible to superpose assemblies. The rationale is that its primary purpose is ensemble generation for molecular replacement and since intermolecular interactions are weak, assemblies are unlikely to be preserved, and one would obtain better models by an assembly of ensembles of monomers than the other way round.
  • Very short residue segments (shorter than 3 consecutive residues) cannot be reliably aligned to the sequence, and these will be discarded from the superposition.

Warning and error messages

  • No alignments given: no alignments were specified and a user-supplied alignment based mapping method was requested.
  • Unsuitable alignment set: different master sequences: this indicates that the series of input alignments cannot be assembled because of a different master sequence.
  • Several alignment files specified; using first one only: this warning indicates that several alignment files were specified, but multiple_alignment was selected as mapping mode, which cannot assemble alignments.
  • No matching alignment for chain...: no alignment sequence matches chain ... and a user-supplied alignment based mapping method was requested.
  • SSM error:...: an error occurred during SSM superposition.
  • Less than 2 chains for superposition: superposition cannot proceed because there are less than 2 chains found.
  • Less than 3 sites for superposition: superposition cannot proceed because there are less than 3 sites available (this is checked after discarding gaps). If this error occurs, discarding the chain with the lowest overlap (printed just before the error is raised) may allow one to proceed.
  • Excessive weight shift, damping weight change: this warning is emitted if weights need to be damped, because all weights are zero.
  • Weight damping recovery exhausted: this error is raised if even after successive rounds of re-weighting, all weights are still zero. This either indicates incorrect alignment of the chains or too tight expected error.

References

[Diamond1988]A note on the rotational superposition problem R. Diamond Acta Cryst. A44, 211-216 (1988)
[Diamond1992]On the multiple simultaneous superposition of molecular structures by rigid body transformations R. Diamond Protein Science 1, 1279-1287 (1992)
[WangSnoeyink2008]Defining and computing optimum RMSD for gapped and weighted multiple-structure alignment X. Wang and J. Snoeyink EEE/ACM Transactions on Computational Biology and Bioinformatics 5, 525-533 (2008)

Additional_information

List of all available keywords