Model preparation with sculpt_ensemble

Contents

Purpose

sculpt_ensemble is a unified model preparation program that internally uses Sculptor and Ensembler. When these two programs are run separately, the result is dependent on what order these programs are run, and no matter what order the programs are run, there are corner cases that it is suboptimal. In the unified program no information loss occurs and there are further benefits as well.

Usage

The program sculpt_ensemble is available from the command line and will also be available from the PHENIX GUI.

Input files

  • Structures (compulsory): structures to be superposed and modified. For each structure, specific parts can be selected using PHENIX atom selection syntax, and assembly information can be associated with the using the assembly keyword, or the convenience keyword read_single_assembly_from_file (for more information please see the Ensembler documentation).

    Error files associated with the input PDB file can also be input using the errors scope. See the Sculptor documentation for more details (please note that there is no input for superposition errors, since those are calculated within the ensembling step and passed onto Sculptor).

    Accepted formats: PDB. Recognized extensions: .pdb, .ent.

  • Sequence files: target sequences (for each assembly position). This is used to identify the target sequence for each alignment. In case no alignment is available, and the target sequence for the given assembly position is available, sculpt_ensemble prepares an alignment using the sequence of the model chain (if no target sequence available, and no suitable alignment is found, a dummy alignment will be used instead). Accepted extensions are .fasta, .faa or .fa for FASTA format, .pir for PIR-format and .seq or .dat for a relaxed PIR/FASTA-like format.

  • Alignments: sequence alignment with the target sequence. The alignment contains information that can be exploited in model improvement (this is currently only implemented for protein chains). The chains are automatically associated with the corresponding alignment based on a sequence comparison. The target sequence is also automatically identified if it is provided through the target_sequence keyword (see above), otherwise the first sequence in the alignment is used as target. Alignment input is optional, in case it is not provided, an alignment will be made up using the chain sequence with itself. Accepted extensions (with the corresponding format) are .aln (CLUSTAL format), .pir (PIR-format) and .ali (relaxed PIR-like format).

  • Homology search files: hits from a homology search. These work very similarly to a set of alignment files. It is assumed that the first sequence in each alignment is the target sequence (the sequence used for searching homologues). Accepted extensions (with the corresponding format) are .xml (BLAST-family XML output) and .hhr (HHPRED output).

Output files

The fully processed structure is output. The file is named according to the following convention: root_merged.pdb, where root is a user-defined parameter (accessible from the output scope), and is similar to merged output style from Ensembler.

Description

The program reads in the input files, creates assembly, and runs ensembling first. The assemblies are then transformed, superposition error is associated with respective chains, and processed sequentially according to the corresponding Sculptor protocol (note that although the superposition errors are always present, these are not used unless requested explicitly by enabling a protocol). See the respective documentation for more details.

In case there is only a single assembly present, the superposition step is skipped, and the program works as the standalone Sculptor. It is also possible to disable the processing step (sculpting.disable), in which case the program works as the standalone Ensembler.

Trimming

sculpt_ensemble also supports chain trimming, but unlike the explicit option in Ensembler, this is available as the superposition option within the mainchain processing section in the sculpting machinery. In addition, this can also be used in combination with polishing options.

Command line

phenix.sculpt_ensemble \
    [ command-line switches ] \
    [ PHIL-format parameter files ] \
    [ PHIL command-line assignments ] \
    [ PDB-files ] \
    [ alignment files ] \
    [ sequence files ]

Command-line switches

-h, --help            show this help message and exit
--show-defaults       print PHIL and exit
-i, --stdin           read PHIL from stdin as well
-v, --verbosity       set verbosity level (info,debug,verbose)
--version             show program's version number and exit
--text-logfile FILE   Verbatim copy of log stream
--html-logfile FILE   Verbatim copy of log stream in HTML

PHIL arguments

Everything not starting with a dash ('-') is interpreted as a PHIL argument. This can be a PHIL-format file containing parameters, command-line assignment or a file whose type is automatically recognized (based on file extension). Note that sequence files are not accepted on the command line, since associated chains could not easily be guessed and require a fully specified parameter scope.

Additional information

List of all available keywords

  • inputInput files
    • alignment = None Input alignment file
    • target_sequence = None Target sequence for assembly position
    • structureInput structures
      • remove_alternate_conformations = False Remove alternate conformations
      • sanitize_occupancies = False Sets occupancies > 1.0 to 1.0
      • modelInput model file
        • file_name = None Input file name
        • selection = None Selection string
        • assembly = None Assembly information
        • read_single_assembly_from_file = False Consider contents of the file as a single assembly
        • errorsEstimated errors for chain
          • file_name = None Error file name
          • chain_ids = None Which chain IDs the error file corresponds to
    • homology_searchAlignment from homology search file
      • file_name = None Homology search file
      • use = None Which alignments to use
  • outputOutput file(s)
    • root = sculpt_ensemble Output file root
    • job_title = None Job title in PHENIX GUI, not used on command line
  • chain_to_alignment_matchingChain-to-alignment matching options
    • consecutivity = *geometry numbering Consecutivity criterion to detect chain breaks
    • min_hss_length = 3 Minimum length of a sequence fragment to be included in chain alignment
    • max_seed_hss_count = 12 Number of HSS to use in extensive search
    • max_completion_hss_count = 6 Number of HSS to use in gap filling
    • min_sequence_overlap = 10 Minimum overlap between sequences to perform full alignment
    • min_sequence_identity = 0.80 Minimum sequence identity of accepted chain alignment
  • error_searchParameters for matching error files and/or estimating errors
    • min_sequence_identity = 0.8 Minimum sequence identity to accept error file
    • min_sequence_overlap = 0.8 Minimum sequence overlap to accept error file
    • calculate_if_not_provided = False Use Rosetta and the ProQ2 server to get error estimates
    • output_prefix = None File name for results of intermediate steps
    • homology_modellingParameters for homology modelling
      • min_residue_margin = 2 Minimum number of residues to cut back on both sides of loops
      • residue_distance = 2.5 Distance covered by a single residue (in A)
      • min_edge_segment_length = 5 Discard edge segments if shorter (after considering gap margins)
      • min_internal_segment_length = 2 Discard internal segments if shorter (after considering gap margins)
      • max_loop_length = 60 Maximum loop length
      • rosetta_max_build_attempts = 1000 Maximum build attempts to close loop
      • rosetta_bump_overlap_factor = 0.1 Allows some atomic overlap in initial loop closures
      • rosetta_loop_closure = kic *ngk Algorithm for Rosetta loop closure
      • rosetta_loop_refinement = default quick test fast *no Algorithm for Rosetta loop refinement
  • ensemblingOptions for ensembling
    • mapping = None Residue mapping methods. Valid choices: ssm, multiple_alignment, ssm_multiple_alignment, muscle, resid
    • alignment = None Alignment file (only necessary for 'alignments' modes)
    • atoms = None Atom to include in superposition
    • clustering = None Cutoff distance for cluster analysis
    • superpositionSuperposition setup
      • method = *gapless gapped Superposition algorithm
      • convergence = 1.0E-4 Convergence criterion for superposition
    • weightingParameters for weighting scheme
      • scheme = unit *robust_resistant Weighting scheme
      • convergence = 1.0E-3 Convergence criterion for weight iteration
      • incremental_damping_factor = 1.5 Damping factor in recovery cycle
      • max_damping_factor = 3.34 Quit recovery if cumulative damping factor is above
      • robust_resistantSetting for robust-resistance scheme
        • critical = 3 tolerance
  • sculptingOptions for sculpting
    • disable = False No sculpting
    • proteinOptions to process protein chains
      • completion = sidechain cbeta_and_pro cbeta Method to build missing sidechain atoms
      • mainchainOptions for main chain processing
        • remove_unaligned = True Delete residues that could not be matched to an alignment
        • deletionMainchain deletion algorithms
          • use = *gap threshold_based_similarity completeness_based_similarity residue_count_based_similarity remove_long rms superposition Algorithm to use
          • gapDelete residue if aligned with gap
            • threshold_based_similarityDelete residue if sequence similarity is low
              • threshold = -0.20 Threshold to accept a residue
              • similarity_calculationConfigure sequence similarity calculation
                • matrix = blosum50 *blosum62 dayhoff identity Similarity matrix
                • smoothingConfigure raw similarity smoothing
                  • use = *linear spatial Method to use
                  • linearParameters for linear averaging
                    • window = 5 Averaging window width
                    • weighting = *triangular uniform Weighting scheme
                  • spatialParameters for spatial averaging
                    • gap_bleed_length = 3 Alignment positions affected by gaps
                    • maximum_distance = 10 Spatial distance for averaging
            • completeness_based_similarityDelete residues based on sequence similarity to get same number of gaps as the Schwarzenbacher algorithm
              • offset = 0.0 Completeness in fraction of model length (0.0 = completeness from Schwarzenbacher algorithm, useful range: +/-0.05)
              • similarity_calculationConfigure sequence similarity calculation
                • matrix = blosum50 *blosum62 dayhoff identity Similarity matrix
                • smoothingConfigure raw similarity smoothing
                  • use = *linear spatial Method to use
                  • linearParameters for linear averaging
                    • window = 5 Averaging window width
                    • weighting = *triangular uniform Weighting scheme
                  • spatialParameters for spatial averaging
                    • gap_bleed_length = 3 Alignment positions affected by gaps
                    • maximum_distance = 10 Spatial distance for averaging
            • residue_count_based_similarityDelete residues based on sequence similarity to delete the same number of residues as the Schwarzenbacher algorithm
              • offset = 0.0 Completeness in fraction of model length (0.0 = completeness from Schwarzenbacher algorithm, useful range: +/-0.05)
              • similarity_calculationConfigure sequence similarity calculation
                • matrix = blosum50 *blosum62 dayhoff identity Similarity matrix
                • smoothingConfigure raw similarity smoothing
                  • use = *linear spatial Method to use
                  • linearParameters for linear averaging
                    • window = 5 Averaging window width
                    • weighting = *triangular uniform Weighting scheme
                  • spatialParameters for spatial averaging
                    • gap_bleed_length = 3 Alignment positions affected by gaps
                    • maximum_distance = 10 Spatial distance for averaging
            • remove_longDelete gap segments if longer than a threshold
              • minimum_length = 3 Minimum length for mainchain segment to remove
            • rmsDelete residues if estimated error is larger than a threshold
              • threshold = 5.0 Maximum allowed error for a residue
              • missing_value_substitutionPolicy to fill in missing values
                • use = *maximum_value scaled_interpolated_value Algorithm to use
                • maximum_valueSubstitute maximum from the sequence
                  • scaled_interpolated_valueSubstitute interpolated value scaled with the distance
                    • extrapolation_step_scale = 1.20 Stepwise scale factor for extrapolation
                    • interpolation_step_scale = 1.10 Stepwise scale factor for extrapolation
              • superpositionDelete residues if estimated error is larger than a threshold
                • threshold = 5.0 Maximum allowed error for a residue
                • missing_value_substitutionPolicy to fill in missing values
                  • use = *maximum_value scaled_interpolated_value Algorithm to use
                  • maximum_valueSubstitute maximum from the sequence
                    • scaled_interpolated_valueSubstitute interpolated value scaled with the distance
                      • extrapolation_step_scale = 1.20 Stepwise scale factor for extrapolation
                      • interpolation_step_scale = 1.10 Stepwise scale factor for extrapolation
              • polishingMainchain polishing algorithms
                • use = remove_short undo_short keep_regular Algorithm to use
                • remove_shortDelete short unconnected segments
                  • minimum_length = 3 Minimum length
                • undo_shortDelete short gaps
                  • maximum_length = 2 Maximum length
                • keep_regularKeep residues in secondary structure
                  • maximum_length = 1 Maximum length
            • bfactorConfigure B-factor modification
              • use = *original asa similarity rms Algorithm to use
              • minimum_b = 10.00 Minimum B-factor
              • originalUse original bfactors to predict new B-values
                • factor = 1.00 Scale factor
              • asaUse accessible surface area to predict new B-values
                • precision = 960 Number of points per atom
                • probe_radius = 1.40 Radius for probing surface accessibility
                • factor = 2.00 Scale factor
              • similarityUse sequence similarity to predict new B-values
                • factor = -100.00 Scale factor
                • similarity_calculationConfigure sequence similarity calculation
                  • matrix = blosum50 *blosum62 dayhoff identity Similarity matrix
                  • smoothingConfigure raw similarity smoothing
                    • use = *linear spatial Method to use
                    • linearParameters for linear averaging
                      • window = 5 Averaging window width
                      • weighting = *triangular uniform Weighting scheme
                    • spatialParameters for spatial averaging
                      • gap_bleed_length = 3 Alignment positions affected by gaps
                      • maximum_distance = 10 Spatial distance for averaging
              • rmsUse external RMS error estimates to calculate B-values
                • factor = 1.00 Scale factor
                • missing_value_substitutionPolicy to fill in missing values
                  • use = *maximum_value scaled_interpolated_value Algorithm to use
                  • maximum_valueSubstitute maximum from the sequence
                    • scaled_interpolated_valueSubstitute interpolated value scaled with the distance
                      • extrapolation_step_scale = 1.20 Stepwise scale factor for extrapolation
                      • interpolation_step_scale = 1.10 Stepwise scale factor for extrapolation
              • renumberResidue renumbering
                • use = model *target original Method to use
                • start = 1 Start residue number
              • pruningOptions for sidechain pruning
                • use = null *schwarzenbacher similarity Algorithm to use
                • pruning_level_unaligned = 2 Pruning level for residues that could not be matched to an alignment
                • nullDo not impose bond distance threshold
                  • schwarzenbacherTruncate atoms if target residue != source residue
                    • pruning_level = 2 Level of truncation
                  • similarityTruncate atoms based on sequence similarity
                    • pruning_level = 2 Level of intermediate truncation
                    • full_length_limit = 0.2 Limit of no truncation
                    • full_truncation_limit = -0.2 Limit for full truncation
                    • similarity_calculationConfigure sequence similarity calculation
                      • matrix = blosum50 *blosum62 dayhoff identity Similarity matrix
                      • smoothingConfigure raw similarity smoothing
                        • use = *linear spatial Method to use
                        • linearParameters for linear averaging
                          • window = 1 Averaging window width
                          • weighting = *triangular uniform Weighting scheme
                        • spatialParameters for spatial averaging
                          • gap_bleed_length = 3 Alignment positions affected by gaps
                          • maximum_distance = 10 Spatial distance for averaging
                • renameResidue renaming
                  • use = original *target Method to use
                  • keep_ptm = False Preserve post-translational modification of model residue if base residue types agree
                  • gapname = ALA Name residues corresponding to alignment gaps
                • mappingOptions for sidechain mapping
                  • use = *connectivity geometry Algorithm to use
                  • map_if_identical = True Do mapping procedure for identical residues types
                  • connectivityMatch atoms by connectivity
                    • match_chemical_elements = True Take chemical element into account
                  • geometryMatch atoms by geometry considering all rotamers
                    • match_chemical_elements = False Take chemical element into account
                    • tolerance = 0.1 Distance tolerance
                    • fine_sampling = False Use fine sampling for rotamers
              • dnaOptions to process DNA chains
                • mainchainOptions for main chain processing
                  • remove_unaligned = True Delete residues that could not be matched to an alignment
                  • deletionMainchain deletion algorithms
                    • use = *all superposition Algorithm to use
                    • allDelete all residues
                      • superpositionDelete residues if estimated error is larger than a threshold
                        • threshold = 5.0 Maximum allowed error for a residue
                        • missing_value_substitutionPolicy to fill in missing values
                          • use = *maximum_value scaled_interpolated_value Algorithm to use
                          • maximum_valueSubstitute maximum from the sequence
                            • scaled_interpolated_valueSubstitute interpolated value scaled with the distance
                              • extrapolation_step_scale = 1.20 Stepwise scale factor for extrapolation
                              • interpolation_step_scale = 1.10 Stepwise scale factor for extrapolation
                      • polishingMainchain polishing algorithms
                        • use = remove_short Algorithm to use
                        • remove_shortDelete short unconnected segments
                          • minimum_length = 3 Minimum length
                  • rnaOptions to process RNA chains
                    • mainchainOptions for main chain processing
                      • remove_unaligned = True Delete residues that could not be matched to an alignment
                      • deletionMainchain deletion algorithms
                        • use = *all superposition Algorithm to use
                        • allDelete all residues
                          • superpositionDelete residues if estimated error is larger than a threshold
                            • threshold = 5.0 Maximum allowed error for a residue
                            • missing_value_substitutionPolicy to fill in missing values
                              • use = *maximum_value scaled_interpolated_value Algorithm to use
                              • maximum_valueSubstitute maximum from the sequence
                                • scaled_interpolated_valueSubstitute interpolated value scaled with the distance
                                  • extrapolation_step_scale = 1.20 Stepwise scale factor for extrapolation
                                  • interpolation_step_scale = 1.10 Stepwise scale factor for extrapolation
                          • polishingMainchain polishing algorithms
                            • use = remove_short Algorithm to use
                            • remove_shortDelete short unconnected segments
                              • minimum_length = 3 Minimum length
                      • heteroOptions to process hetero chains
                        • keep = None Keep named hetero residues
                      • monosaccharideOptions to process glycosyl chains
                        • maximum_depth = 0 Keep chain up to maximum depth (None to keep all)
                        • maximum_bond_length = 1.5 Maximum bond length for glycosidic bond