Ensemble creation with Ensembler

Author(s)

ensembler: Gabor Bunkoczi
PHENIX GUI: Nathaniel Echols

Purpose

Ensembler can be used to superpose multiple chains to be used as an ensemble search model for molecular replacement.

Usage

Ensembler can be run from the PHENIX GUI and the command line, the only difference being the way commands are taken from the user.

The graphical user interface makes all settings accessible either as part of the main window (for frequently used options) or as a dialog box Ensemble generation settings.... Input files are specified either by the +/- button pair, or by drag-and-drop onto the window area. File types are automatically recognized, and added to the relevant input section.

Input files

Structures (compulsory). The chains that are to be superposed. Protein chains are automatically identified, other chains are discarded. In case the structure contains multiple chains, these will all be used for superposition. Accepted formats: PDB. Recognized extensions: .pdb, .ent.
Alignment (optional). In case a user-supplied residue mapping is desired, this can be achieved by selecting the appropriate mode and supplying alignment files. Accepted extensions (with the corresponding format) are .aln (CLUSTAL format), .pir (PIR-format) and .ali (relaxed PIR-like format).

Output files

The superposed chains can be written out either as a quasi multiple model PDB file that is readable by phaser directly (output style merged, output file name root_merged.pdb) or as a series of files containing each chain separately (output style separate, file name root_pdb_chain.pdb, where pdb is the name of the PDB file the chain was read from, and chain is the chain identifier). The file name root can be changed via the root parameter of the output (default: ensemble).

Description

The workflow consists of several stages that can be independently configured. These are listed in order of execution, and control parameters are accessible from the Ensemble generation settings... dialog box. For a summary of all parameters with the corresponding defaults, see the Additional information section.

Residue mapping

Establishes the equivalence of residues among the input chains. There are several options available:

ssm - Do structure-based SSM alignment (default).
muscle - Generate a sequence alignment with MUSCLE and use that to align the residues.
multiple_alignment - A user supplied multiple alignment is used. An error message will be raised if this is not present or does not cover all present chains.
alignments - A series of alignments is used that all have the same first sequence (i.e. a series of pairwise alignments against the same target sequence). An error message will be raised if these do not cover all present chains.
resid - Residue sequence number with insertion code will be used.

Most common residue mapping mode is ssm. If no SSM alignment can be done or this is imprecise (e.g. no secondary structure), muscle is a good second choice.

Atom mapping

Maps selected atoms within equivalent residues to each other. The mapping is done by name hence the order of atoms in the residue does not matter. If atoms are missing from certain residues (or if certain residues contain extra atoms), a gap will be filled where necessary. Atom selection is controlled by the atoms parameter of the configuration scope. Default atom selection: CA.

Superposition

Equivalent positions are superposed iteratively to find a globally optimal solution. There are two superposition algorithms implemented, which primarily differ in how they handle gaps in equivalent positions.

gapless. This algorithm discards all positions where gaps are present, For further details, see Diamond (1992).

gapped. This algorithm can use all positions as long as there are two sites presents (i.e. not gaps), and may give better results for chains with distant sequence identity. For the exact algorithm, see Wang & Snoeyink (2008).

Both algorithms use Diamond's formulation to solve the pairwise rotational superposition problem (Diamond, 1988).

An exception is raised if there are less than 3 sites present for superposition.

Multiple superposition is an iterative process and consists of a series of pairwise superpositions. The convergence criterion is controlled by the convergence parameter (in the superposition scope), which is the r.m.s. difference change between two consecutive iterations.

Weighting

Automatic weighting can be used to improve superposition, either to amplify highly homologous regions or to decrease the effect of incorrect site-equivalence (typically arises because of a wrong alignment). Implemented weighting schemes are as follows:

unit - Unit weights (equivalent to no weighting).
robust_resistant - Robust-resistant weighting scheme (default). This tends to converge fast and give reliable results. The exact formula for weighting is as follows:
```
w = 1 - ( delta2 / tolerance2 )2 if delta < tolerance
w = 0 (otherwise)
```
where delta is the deviation from the average. Tolerance is an empirical value, and its optimal value is close to (unweighted r.m.s.d.) ² and can be controlled by the critical parameter of the robust_resistant scope.

Weighting is iterated with superposition until weights converge, which can be controlled by the convergence parameter of the weighting scope.

In case of highly dissimilar structures (or incorrect residue mapping), weight determination may temporarily need to be damped to avoid divergence. This is done automatically (in steps controlled by the incremental_damping_factor parameter of the weighting scope), until a preset value (controlled by the max_damping_factor parameter of the weighting scope) is reached, at which point an exception is raised.

Cluster analysis

Hierarchical cluster analysis is performed using the pairwise r.m.s. differences as a distance measure. The clustering parameter of the configuration score can be used to adjust cluster boundaries.

Chain trimming

This option trims residues from the final superposed model where the unweighted r.m.s.d. is above a certain threshold (threshold parameter of the trimming scope). Useful in removing flexible loops, etc. Default: no trimming.

Sorting

After superposition is complete, the chains can be sorted by sequence identity (identity), fraction of common sites wrt all aligned atom positions (overlap), weighted r.m.s.d. (wrmsd) or unweighted r.m.s.d. (unwrmsd). This is controlled by the sort parameter of the output scope. Default: input order (input).

Command line

phenix.ensembler \
    [ command-line switches ] \
    [ PHIL-format parameter files ] \
    [ PHIL command-line assignments ] \
    [ PDB-files ] \
    [ alignment files ]

Command-line switches:

-h, --help            show this help message and exit
--show-defaults       print PHIL and exit
-i, --stdin           read PHIL from stdin as well
-v, --verbosity       set verbosity level (DEBUG,INFO,WARNING,VERBOSE)

PHIL arguments:

Everything not starting with a dash('-') is interpreted as a PHIL argument. This can be a PHIL-format file containing parameters, command-line assignment or a file whose type is automatically recognized (based on file extension; structure files and alignment files are recognized automatically).

Specific limitations and possible problems

Processing features

Ensembler considers each chain individually and therefore it is not possible to superpose assemblies. The rationale is that its primary purpose is ensemble generation for molecular replacement and since intermolecular interactions are weak, assemblies are unlikely to be preserved, and one would obtain better models by an assembly of ensembles of monomers than the other way round.
Very short residue segments (shorter than 3 consecutive residues) cannot be reliably aligned to the sequence, and these will be discarded from the superposition.

Warning and error messages

No alignments given: no alignments were specified and a user-supplied alignment based mapping method was requested.
Unsuitable alignment set: different master sequences: this indicates that the series of input alignments cannot be assembled because of a different master sequence.
Several alignment files specified; using first one only: this warning indicates that several alignment files were specified, but multiple_alignment was selected as mapping mode, which cannot assemble alignments.
No matching alignment for chain...: no alignment sequence matches chain ... and a user-supplied alignment based mapping method was requested.
SSM error:...: an error occurred during SSM superposition.
Less than 2 chains for superposition: superposition cannot proceed because there are less than 2 chains found.
Less than 3 sites for superposition: superposition cannot proceed because there are less than 3 sites available (this is checked after discarding gaps). If this error occurs, discarding the chain with the lowest overlap (printed just before the error is raised) may allow one to proceed.
Excessive weight shift, damping weight change: this warning is emitted if weights need to be damped, because all weights are zero.
Weight damping recovery exhausted: this error is raised if even after successive rounds of re-weighting, all weights are still zero. This either indicates incorrect alignment of the chains or too tight expected error.

References

[Diamond1988]

A note on the rotational superposition problem R. Diamond Acta Cryst. A44, 211-216 (1988)

[Diamond1992]

On the multiple simultaneous superposition of molecular structures by rigid body transformations R. Diamond Protein Science 1, 1279-1287 (1992)

[WangSnoeyink2008]

Defining and computing optimum RMSD for gapped and weighted multiple-structure alignment X. Wang and J. Snoeyink EEE/ACM Transactions on Computational Biology and Bioinformatics 5, 525-533 (2008)

Additional_information

List of all available keywords

inputInput files
- model = None Input model file
- alignment = None Input alignment file
outputOutput file(s)
- location = ""
- gui_output_dir = None Sets base output directory for Phenix GUI - not used when run from the command line.
- root = ensemble Output file root
- style = *merged separate Output options for ensemble
- sort = *input identity overlap wrmsd unwrmsd Sort ensemble components
- job_title = None Job title in PHENIX GUI, not used on command line
configurationGeneral parameters for ensemble generation
- mapping = resid alignments multiple_alignment muscle *ssm Residue correspondence mapping
- atoms = CA Include atom in superpositions
- clustering = 0.5 Cutoff distance for cluster analysis
- trim = False Trim differing loops
- superpositionSuperposition setup
  - method = *gapless gapped Superposition algorithm
  - convergence = 1.0E-4 Convergence criterion for superposition
- weightingParameters for weighting scheme
  - scheme = unit *robust_resistant Weighting scheme
  - convergence = 1.0E-3 Convergence criterion for weight iteration
  - incremental_damping_factor = 1.5 Damping factor in recovery cycle
  - max_damping_factor = 3.34 Quit recovery if cumulative damping factor is above
  - robust_resistantSetting for robust-resistance scheme
    - critical = 9 tolerance
- trimmingTrimming parameters
  - threshold = 3.0 Max position RMSD for inclusion