Model preparation with sculpt_ensemble

Contents

Purpose
Usage
- Input files
- Output files
Description
- Trimming
Command line
- Command-line switches
- PHIL arguments
Additional information
- List of all available keywords

Purpose

sculpt_ensemble is a unified model preparation program that internally uses Sculptor and Ensembler. When these two programs are run separately, the result is dependent on what order these programs are run, and no matter what order the programs are run, there are corner cases that it is suboptimal. In the unified program no information loss occurs and there are further benefits as well.

Usage

The program sculpt_ensemble is available from the command line and will also be available from the PHENIX GUI.

Input files

Structures (compulsory): structures to be superposed and modified. For each structure, specific parts can be selected using PHENIX atom selection syntax, and assembly information can be associated with the using the assembly keyword, or the convenience keyword read_single_assembly_from_file (for more information please see the Ensembler documentation).

Error files associated with the input PDB file can also be input using the errors scope. See the Sculptor documentation for more details (please note that there is no input for superposition errors, since those are calculated within the ensembling step and passed onto Sculptor).

Accepted formats: PDB. Recognized extensions: .pdb, .ent.
Sequence files: target sequences (for each assembly position). This is used to identify the target sequence for each alignment. In case no alignment is available, and the target sequence for the given assembly position is available, sculpt_ensemble prepares an alignment using the sequence of the model chain (if no target sequence available, and no suitable alignment is found, a dummy alignment will be used instead). Accepted extensions are .fasta, .faa or .fa for FASTA format, .pir for PIR-format and .seq or .dat for a relaxed PIR/FASTA-like format.
Alignments: sequence alignment with the target sequence. The alignment contains information that can be exploited in model improvement (this is currently only implemented for protein chains). The chains are automatically associated with the corresponding alignment based on a sequence comparison. The target sequence is also automatically identified if it is provided through the target_sequence keyword (see above), otherwise the first sequence in the alignment is used as target. Alignment input is optional, in case it is not provided, an alignment will be made up using the chain sequence with itself. Accepted extensions (with the corresponding format) are .aln (CLUSTAL format), .pir (PIR-format) and .ali (relaxed PIR-like format).
Homology search files: hits from a homology search. These work very similarly to a set of alignment files. It is assumed that the first sequence in each alignment is the target sequence (the sequence used for searching homologues). Accepted extensions (with the corresponding format) are .xml (BLAST-family XML output) and .hhr (HHPRED output).

Output files

The fully processed structure is output. The file is named according to the following convention: root_merged.pdb, where root is a user-defined parameter (accessible from the output scope), and is similar to merged output style from Ensembler.

Description

The program reads in the input files, creates assembly, and runs ensembling first. The assemblies are then transformed, superposition error is associated with respective chains, and processed sequentially according to the corresponding Sculptor protocol (note that although the superposition errors are always present, these are not used unless requested explicitly by enabling a protocol). See the respective documentation for more details.

In case there is only a single assembly present, the superposition step is skipped, and the program works as the standalone Sculptor. It is also possible to disable the processing step (sculpting.disable), in which case the program works as the standalone Ensembler.

Trimming

sculpt_ensemble also supports chain trimming, but unlike the explicit option in Ensembler, this is available as the superposition option within the mainchain processing section in the sculpting machinery. In addition, this can also be used in combination with polishing options.

Command line

phenix.sculpt_ensemble \
    [ command-line switches ] \
    [ PHIL-format parameter files ] \
    [ PHIL command-line assignments ] \
    [ PDB-files ] \
    [ alignment files ] \
    [ sequence files ]

Command-line switches

-h, --help            show this help message and exit
--show-defaults       print PHIL and exit
-i, --stdin           read PHIL from stdin as well
-v, --verbosity       set verbosity level (info,debug,verbose)
--version             show program's version number and exit
--text-logfile FILE   Verbatim copy of log stream
--html-logfile FILE   Verbatim copy of log stream in HTML

PHIL arguments

Everything not starting with a dash ('-') is interpreted as a PHIL argument. This can be a PHIL-format file containing parameters, command-line assignment or a file whose type is automatically recognized (based on file extension). Note that sequence files are not accepted on the command line, since associated chains could not easily be guessed and require a fully specified parameter scope.

Additional information

List of all available keywords

inputInput files
- alignment = None Input alignment file
- target_sequence = None Target sequence for assembly position
- structureInput structures
  - remove_alternate_conformations = False Remove alternate conformations
  - sanitize_occupancies = False Sets occupancies > 1.0 to 1.0
  - modelInput model file
    - file_name = None Input file name
    - selection = None Selection string
    - assembly = None Assembly information
    - read_single_assembly_from_file = False Consider contents of the file as a single assembly
    - errorsEstimated errors for chain
      - file_name = None Error file name
      - chain_ids = None Which chain IDs the error file corresponds to
- homology_searchAlignment from homology search file
  - file_name = None Homology search file
  - use = None Which alignments to use
outputOutput file(s)
- root = sculpt_ensemble Output file root
- gui_output_dir = None Sets base output directory for Phenix GUI - not used when run from the command line.
- job_title = None Job title in PHENIX GUI, not used on command line
chain_to_alignment_matchingChain-to-alignment matching options
- consecutivity = *geometry numbering Consecutivity criterion to detect chain breaks
- min_hss_length = 3 Minimum length of a sequence fragment to be included in chain alignment
- max_seed_hss_count = 12 Number of HSS to use in extensive search
- max_completion_hss_count = 6 Number of HSS to use in gap filling
- min_sequence_overlap = 10 Minimum overlap between sequences to perform full alignment
- min_sequence_identity = 0.80 Minimum sequence identity of accepted chain alignment
error_searchParameters for matching error files and/or estimating errors
- min_sequence_identity = 0.8 Minimum sequence identity to accept error file
- min_sequence_overlap = 0.8 Minimum sequence overlap to accept error file
- calculate_if_not_provided = False Use Rosetta and the ProQ2 server to get error estimates
- output_prefix = None File name for results of intermediate steps
- homology_modellingParameters for homology modelling
  - min_residue_margin = 2 Minimum number of residues to cut back on both sides of loops
  - residue_distance = 2.5 Distance covered by a single residue (in A)
  - min_edge_segment_length = 5 Discard edge segments if shorter (after considering gap margins)
  - min_internal_segment_length = 2 Discard internal segments if shorter (after considering gap margins)
  - max_loop_length = 60 Maximum loop length
  - rosetta_max_build_attempts = 1000 Maximum build attempts to close loop
  - rosetta_bump_overlap_factor = 0.1 Allows some atomic overlap in initial loop closures
  - rosetta_loop_closure = kic *ngk Algorithm for Rosetta loop closure
  - rosetta_loop_refinement = default quick test fast *no Algorithm for Rosetta loop refinement
ensemblingOptions for ensembling
- mapping = None Residue mapping methods. Valid choices: ssm, multiple_alignment, ssm_multiple_alignment, muscle, resid
- alignment = None Alignment file (only necessary for 'alignments' modes)
- atoms = None Atom to include in superposition
- clustering = None Cutoff distance for cluster analysis
- superpositionSuperposition setup
  - method = *gapless gapped Superposition algorithm
  - convergence = 1.0E-4 Convergence criterion for superposition
- weightingParameters for weighting scheme
  - scheme = unit *robust_resistant Weighting scheme
  - convergence = 1.0E-3 Convergence criterion for weight iteration
  - incremental_damping_factor = 1.5 Damping factor in recovery cycle
  - max_damping_factor = 3.34 Quit recovery if cumulative damping factor is above
  - robust_resistantSetting for robust-resistance scheme
    - critical = 3 tolerance
sculptingOptions for sculpting
- disable = False No sculpting
- proteinOptions to process protein chains
  - completion = sidechain cbeta_and_pro cbeta Method to build missing sidechain atoms
  - mainchainOptions for main chain processing
    - remove_unaligned = True Delete residues that could not be matched to an alignment
    - deletionMainchain deletion algorithms
      - use = *gap threshold_based_similarity completeness_based_similarity residue_count_based_similarity remove_long rms superposition Algorithm to use
      - gapDelete residue if aligned with gap
        threshold_based_similarityDelete residue if sequence similarity is low
        threshold = -0.20 Threshold to accept a residue
        similarity_calculationConfigure sequence similarity calculation
        matrix = blosum50 *blosum62 dayhoff identity Similarity matrix
        smoothingConfigure raw similarity smoothing
        use = *linear spatial Method to use
        linearParameters for linear averaging
        window = 5 Averaging window width
        weighting = *triangular uniform Weighting scheme
        spatialParameters for spatial averaging
        gap_bleed_length = 3 Alignment positions affected by gaps
        maximum_distance = 10 Spatial distance for averaging
        completeness_based_similarityDelete residues based on sequence similarity to get same number of gaps as the Schwarzenbacher algorithm
        offset = 0.0 Completeness in fraction of model length (0.0 = completeness from Schwarzenbacher algorithm, useful range: +/-0.05)
        similarity_calculationConfigure sequence similarity calculation
        matrix = blosum50 *blosum62 dayhoff identity Similarity matrix
        smoothingConfigure raw similarity smoothing
        use = *linear spatial Method to use
        linearParameters for linear averaging
        window = 5 Averaging window width
        weighting = *triangular uniform Weighting scheme
        spatialParameters for spatial averaging
        gap_bleed_length = 3 Alignment positions affected by gaps
        maximum_distance = 10 Spatial distance for averaging
        residue_count_based_similarityDelete residues based on sequence similarity to delete the same number of residues as the Schwarzenbacher algorithm
        offset = 0.0 Completeness in fraction of model length (0.0 = completeness from Schwarzenbacher algorithm, useful range: +/-0.05)
        similarity_calculationConfigure sequence similarity calculation
        matrix = blosum50 *blosum62 dayhoff identity Similarity matrix
        smoothingConfigure raw similarity smoothing
        use = *linear spatial Method to use
        linearParameters for linear averaging
        window = 5 Averaging window width
        weighting = *triangular uniform Weighting scheme
        spatialParameters for spatial averaging
        gap_bleed_length = 3 Alignment positions affected by gaps
        maximum_distance = 10 Spatial distance for averaging
        remove_longDelete gap segments if longer than a threshold
        minimum_length = 3 Minimum length for mainchain segment to remove
        rmsDelete residues if estimated error is larger than a threshold
        threshold = 5.0 Maximum allowed error for a residue
        missing_value_substitutionPolicy to fill in missing values
        use = *maximum_value scaled_interpolated_value Algorithm to use
        maximum_valueSubstitute maximum from the sequence
        scaled_interpolated_valueSubstitute interpolated value scaled with the distance
        extrapolation_step_scale = 1.20 Stepwise scale factor for extrapolation
        interpolation_step_scale = 1.10 Stepwise scale factor for extrapolation
        superpositionDelete residues if estimated error is larger than a threshold
        threshold = 2.0 Maximum allowed error for a residue
        missing_value_substitutionPolicy to fill in missing values
        use = *maximum_value scaled_interpolated_value Algorithm to use
        maximum_valueSubstitute maximum from the sequence
        scaled_interpolated_valueSubstitute interpolated value scaled with the distance
        extrapolation_step_scale = 1.20 Stepwise scale factor for extrapolation
        interpolation_step_scale = 1.10 Stepwise scale factor for extrapolation
        polishingMainchain polishing algorithms
        use = remove_short undo_short keep_regular Algorithm to use
        remove_shortDelete short unconnected segments
        minimum_length = 3 Minimum length
        undo_shortDelete short gaps
        maximum_length = 2 Maximum length
        keep_regularKeep residues in secondary structure
        maximum_length = 1 Maximum length
        bfactorConfigure B-factor modification
        use = *original asa similarity rms Algorithm to use
        minimum_b = 10.00 Minimum B-factor
        originalUse original bfactors to predict new B-values
        factor = 1.00 Scale factor
        asaUse accessible surface area to predict new B-values
        precision = 960 Number of points per atom
        probe_radius = 1.40 Radius for probing surface accessibility
        factor = 2.00 Scale factor
        similarityUse sequence similarity to predict new B-values
        factor = -100.00 Scale factor
        similarity_calculationConfigure sequence similarity calculation
        matrix = blosum50 *blosum62 dayhoff identity Similarity matrix
        smoothingConfigure raw similarity smoothing
        use = *linear spatial Method to use
        linearParameters for linear averaging
        window = 5 Averaging window width
        weighting = *triangular uniform Weighting scheme
        spatialParameters for spatial averaging
        gap_bleed_length = 3 Alignment positions affected by gaps
        maximum_distance = 10 Spatial distance for averaging
        rmsUse external RMS error estimates to calculate B-values
        factor = 1.00 Scale factor
        missing_value_substitutionPolicy to fill in missing values
        use = *maximum_value scaled_interpolated_value Algorithm to use
        maximum_valueSubstitute maximum from the sequence
        scaled_interpolated_valueSubstitute interpolated value scaled with the distance
        extrapolation_step_scale = 1.20 Stepwise scale factor for extrapolation
        interpolation_step_scale = 1.10 Stepwise scale factor for extrapolation
        renumberResidue renumbering
        use = model *target original Method to use
        start = 1 Start residue number
        pruningOptions for sidechain pruning
        use = null *schwarzenbacher similarity Algorithm to use
        pruning_level_unaligned = 2 Pruning level for residues that could not be matched to an alignment
        nullDo not impose bond distance threshold
        schwarzenbacherTruncate atoms if target residue != source residue
        pruning_level = 2 Level of truncation
        similarityTruncate atoms based on sequence similarity
        pruning_level = 2 Level of intermediate truncation
        full_length_limit = 0.2 Limit of no truncation
        full_truncation_limit = -0.2 Limit for full truncation
        similarity_calculationConfigure sequence similarity calculation
        matrix = blosum50 *blosum62 dayhoff identity Similarity matrix
        smoothingConfigure raw similarity smoothing
        use = *linear spatial Method to use
        linearParameters for linear averaging
        window = 1 Averaging window width
        weighting = *triangular uniform Weighting scheme
        spatialParameters for spatial averaging
        gap_bleed_length = 3 Alignment positions affected by gaps
        maximum_distance = 10 Spatial distance for averaging
        renameResidue renaming
        use = original *target Method to use
        keep_ptm = False Preserve post-translational modification of model residue if base residue types agree
        gapname = ALA Name residues corresponding to alignment gaps
        mappingOptions for sidechain mapping
        use = *connectivity geometry Algorithm to use
        map_if_identical = True Do mapping procedure for identical residues types
        connectivityMatch atoms by connectivity
        match_chemical_elements = True Take chemical element into account
        geometryMatch atoms by geometry considering all rotamers
        match_chemical_elements = False Take chemical element into account
        tolerance = 0.1 Distance tolerance
        fine_sampling = False Use fine sampling for rotamers
        dnaOptions to process DNA chains
        mainchainOptions for main chain processing
        remove_unaligned = True Delete residues that could not be matched to an alignment
        deletionMainchain deletion algorithms
        use = *all superposition Algorithm to use
        allDelete all residues
        superpositionDelete residues if estimated error is larger than a threshold
        threshold = 2.0 Maximum allowed error for a residue
        missing_value_substitutionPolicy to fill in missing values
        use = *maximum_value scaled_interpolated_value Algorithm to use
        maximum_valueSubstitute maximum from the sequence
        scaled_interpolated_valueSubstitute interpolated value scaled with the distance
        extrapolation_step_scale = 1.20 Stepwise scale factor for extrapolation
        interpolation_step_scale = 1.10 Stepwise scale factor for extrapolation
        polishingMainchain polishing algorithms
        use = remove_short Algorithm to use
        remove_shortDelete short unconnected segments
        minimum_length = 3 Minimum length
        rnaOptions to process RNA chains
        mainchainOptions for main chain processing
        remove_unaligned = True Delete residues that could not be matched to an alignment
        deletionMainchain deletion algorithms
        use = *all superposition Algorithm to use
        allDelete all residues
        superpositionDelete residues if estimated error is larger than a threshold
        threshold = 2.0 Maximum allowed error for a residue
        missing_value_substitutionPolicy to fill in missing values
        use = *maximum_value scaled_interpolated_value Algorithm to use
        maximum_valueSubstitute maximum from the sequence
        scaled_interpolated_valueSubstitute interpolated value scaled with the distance
        extrapolation_step_scale = 1.20 Stepwise scale factor for extrapolation
        interpolation_step_scale = 1.10 Stepwise scale factor for extrapolation
        polishingMainchain polishing algorithms
        use = remove_short Algorithm to use
        remove_shortDelete short unconnected segments
        minimum_length = 3 Minimum length
        heteroOptions to process hetero chains
        keep = None Keep named hetero residues
        monosaccharideOptions to process glycosyl chains
        maximum_depth = 0 Keep chain up to maximum depth (None to keep all)
        maximum_bond_length = 1.5 Maximum bond length for glycosidic bond