Processing, docking and rebuilding AlphaFold2 and other predicted models in cryo-EM maps

Author(s)

dock_and_rebuild: Tom Terwilliger

Purpose

Dock and rebuild combines the functions of processing, docking and rebuilding predicted models produced by AlphaFold, RoseTTAFold and other prediction software into a cryo EM map.

These steps can also be carried out one at a time. There are three steps in working on an AlphaFold or other predicted model with a cryo-EM structure.

The first step is to process the predicted model by trimming off all the uncertain residues in the predicted model and breaking up the remaining structure into a best guess of rigid domains with phenix.process_predicted_model .

The next step is to dock each of the domains of the processed model into the map, keeping plausible connectivity. This is done with phenix.dock_predicted_model .

The third step is to morph the predicted model onto the docked domains and then to rebuild all the parts of the predicted model using the density in the map.

How dock_and_rebuild works:

The dock_and_rebuild procedure is just a way to connect all three steps in processing, docking and rebuilding a predicted model. You may want to use this procedure if you have a simple case, or if you have many models you want to process. If it does not work well, you may want to take its intermediate outputs (i.e., the docked model) and carry out the following steps individually.

The model input to dock_predicted_model is your starting predicted model file (i.e., an AlphaFold model)

The map input to dock_predicted model is normally your best sharpened or density-modified cryo-EM map. It can also be a map generated by any other procedure (including crystallography).

If you are able to mask your map, keeping only the part representing the region where this model belongs, that can be very helpful. You can also box the map around this region. If you have a map that has many chains, this masking can greatly shorted the time for docking. If you don't know where in your map the model goes at all, you can supply the entire map. If your map has symmetry this will normally be found automatically.

The three steps described above are then carried out: processing the model to obtain domains representing the accurate parts of the model (phenix.process_predicted_model), docking the domains (phenix.dock_predicted_model), and morphing and rebuilding the predicted model to yield a rebuilt version of the predicted model.

Examples

Standard run of dock_and_rebuild:

Running dock_and_rebuild is easy. From the command-line you can type:

phenix.dock_and_rebuild model=my_model.pdb \
   full_map=my_map.ccp4 \
   resolution=3

This will carry out all the steps of processing, docking and rebuilding to yield my_model_rebuilt.pdb

Possible Problems

If your map has pseudo-symmetry (like a proteasome) you might need to box one subunit or try ssm_search=False to use a more thorough search in docking.

If your map is inverted (left-handed), docking and model-building will not work properly. You can often tell if your map is inverted because any helices will be left-handed. If you are unsure, you can run MapBox with invert_hand=True to invert the map and then see if docking works. Note that if your map is inverted, you will want to invert all your maps and start everything from the beginning.

Specific limitations and problems:

Literature

Additional information

List of all available keywords

job_title = None Job title in PHENIX GUI, not used on command line
input_files
- seq_file = None Input sequence file. Required if no models supplied. Used to create predicted models. Format should be Fasta or simple sequences separated by blank lines. One sequence per chain to be generated. Must match input models if both are supplied. (Fasta format or sequences separated by blank lines)
- xray_data_file = None Input X-ray data file (MTZ format).
- xray_data_label = None Label specifying which data column to use from xray_data_file
- xray_test_flag_label = None Label specifying which test flag column (if any) to use from xray_data_file
- truncate_models_to_match_sequences = True If input sequences start after or end before supplied predicted models, trim the predictedd models to match sequences
- density_select = None If set, trim map to region containing molecule (non-zero region) before use. Recommended for cryo-EM maps that have a large map that is mostly unused. Default is True if map is full-size and False if not.
- predicted_model = None One or more input models in one or more files (normally predicted models) to be placed. Normally should start at residue 1 and match sequence file exactly.
- b_value_field_is = *plddt rmsd b_value The B-factor field in predicted models can be pLDDT (confidence, 0-1 or 0-100) or rmsd (A) or a B-factor. Only applies to protein chains (always B-factor for non-protein).
- msa_list = None Multiple sequence alignment (MSA) file. Format is a3m only. First sequence in an MSA file must match a sequence in the sequence file exactly.
- models_are_already_placed = False You can specify that all models (predicted_model or model) are already placed (docked, placed in unit cell). No docking or MR is done if so. Equivalent to specifying a scaffold_model with the same contents as the model.
- model_copies_list = None List of number of copies to find of each model in predicted_model. Default is 1 (must specify all or none of them) . Specify all together like model_copies_list=1,2,1,1
- processed_model_file = None Processed model file (e.g., from phenix.process_predicted_model). If supplied, skip the process_predicted_model step and use this file. This model is expected to have one chain for each domain and actual B-values in the B-value field. It is expected that all poorly-predicted residues have been removed.
- docked_model_file = None Docked model (e.g., output of dock_processed_model). If supplied, use this file as the docked model. (Skip process_predicted_model and docking steps This model is expected to have a single chain with gaps for parts of the model that are not accurately known from the prediction. NOTE: You still need to supply the predicted model as the input model for this procedure in addition to this docked model.
- morphed_model_file = None Morphed model (e.g., full model morphed to match output of dock_processed_model). If supplied, skip the process_predicted_model and docking steps and use this file as the morphed model. This model is expected to have a single chain with no gaps. NOTE: You do not need to supply the predicted model as the input model for this procedure in addition to this morphed model.
- previous_model_file = None Previous model (from a previous run or external). Must match sequence of working model exactly. Used as a source of possible model information and as a hypothesis for docking of working model. If docked_model_file is also supplied, previous_model is used only as source of model information.
- scaffold_model = None Scaffold model. If supplied, used as target for docking of each chain in predicted_model or model. Must be similar in sequence to models to be docked. If supplied, model_copies_list is ignored and docking is done by superposition onto this structure instead of docking into density or MR. If the scaffold model does not have all chains represented, they are added in by docking.
- scaffold_minimum_chain_cc = 0.35 Chains in scaffold model that have scaffold_minimum_cc are not re-docked, but those with lower CC are.
- base_scaffold_minimum_chain_cc = -1 Value of scaffold_minimum_chain_cc in cases where scaffold is known to be good.
- fragments_model_file = None Fragments model (e.g., map_to_model.pdb). Used as a source of possible model information and as a hypothesis for docking of working model. If docked_model_file is also supplied, fragments_model is used only as source of model information.
- model_to_rebuild_file = None Model to be rebuilt. Must be already in place (already docked). Must match supplied sequences exactly.
- symmetry_file = None Symmetry file (.ncs_spec format or MATRX records) with reconstruction symmetry. Used in identification of unique part of map. NOTE: symmetry file applies to map file in original position. NOTE 2: only proper symmetry (point-group, helical) is allowed NOTE 3: Only applies to cryo-EM maps
- pae_file = None Optional input json file with matrix of inter-residue estimated errors (pae file)
- distance_model_file = None Distance_model_file. A PDB or mmCIF file containing the model corresponding to the PAE matrix. Only needed if weight_by_ca_ca_distances is True.
- search_model_copies = None Used internally only
- search_model = None used internally only
- map_model
  - full_map = None Input map file (for cryo-EM). This can be a boxed or masked map showing just the molecule to dock (best) or a full map with symmetry. If your map has symmetry be sure to set asymmetric_map = False. If you have a map with symmetry you can supply a symmetry file if you want. Otherwise symmetry will be automatically determined.
  - half_map = None Input half map files. Usually supply one full map or 2 half maps
  - model = None Input predicted model (e.g., AlphaFold model). Assumed to have pLDDT values in B-value field (or RMSD values). May have multiple chains. Normally use predicted_model instead.
output_files
- output_model_prefix = None Output files with superposed models will begin with this prefix
- output_seq_file = None Sequence file (possibly edited) written to this file by PredictAndBuild
- pdb_out = None Used internally only
- temp_dir = None Temporary directory. Default is dock_and_build_xx where xx creates new directory
crystal_info
- resolution = None High-resolution limit for main search. This can be lower resolution than the data. The search is quicker at lower resolution. If your model is poor, try 2-3 A lower resolution than your data (i.e, if your data is 2.5 A, try 5 A).
- scattering_table = *None n_gaussian wk1995 it1992 electron neutron Choice of scattering table for structure factor calculations. Default for X-ray is n_gaussian, for cryoEM is electron.
- wrapping = None You can specify whether the map is wrapped (can map values outside bounds to inside with cell translations).
- asymmetric_map = None Specifies that this is an asymmetric map and no symmetry is to be supplied or found. Alternative to supplying a symmetry file or symmetry. Applies to cryo-EM reconstructions only.
- solvent_content = None Solvent fraction (content) of the cell. You can specify the fraction of the volume of this cell that is taken up by the macromolecule. Normally set automatically. Values go from 0 to 1.
- pdb70_text = None Database of PDB entries to be used as templates (pdb70 file). Normally used internally only
- sequence = None Old-style sequence string (alternative to sequence file).
- chain_type = *PROTEIN DNA RNA Chain type
- msa = None MSA as text. Supply MSA as text or as an msa_file. Used internally
- templates_as_string = None Templates as string (single PDB file with one or more chains)
- space_group = None Space group (normally read from the data file, applies to X-ray)
- space_group_alternatives = *HAND ALL NONE Space group alternatives for MR: ALL: all space groups in point group HAND: enantiomer and listed space group LIST: list supplied (not available) NONE: only supplied space group
- unit_cell = None Unit Cell (normally read from the data file, applies to X-ray)
- unique_sequencesFile names and data labels.
  - sequence = "Enter or edit sequence"
  - copies = 1 Copies of this sequence
  - label = ""
process_predicted_model
- remove_low_confidence_residues = True Remove low-confidence residues (based on minimum plddt or maximum_rmsd, whichever is specified)
- continuous_chain = False When removing low-confidence residues, only trim from ends
- split_model_by_compact_regions = True Split model into compact regions after removing low-confidence residues.
- maximum_domains = 3 Maximum domains to obtain. You can use this to merge the closest domains at the end of splitting the model. Make it bigger (and optionally make domain_size smaller) to get more domains. If model is processed in chunks, maximum_domains will apply to each chunk.
- domain_size = 15 Approximate size of domains to be found (A units). This is the resolution that will be used to make a domain map. If you are getting too many domains, try making domain_size bigger (maximum is 70 A).
- adjust_domain_size = True If more that maximum_domains are initially found, increase domain_size in increments of 5 A and take the value that gives the smallest number of domains, but at least maximum_domains.
- minimum_domain_length = 10 Minimum length of a domain to keep (reject at end if smaller).
- maximum_fraction_close = 0.3 Maximum fraction of CA in one domain close to one in another before merging them
- minimum_sequential_residues = 5 Minimum length of a short segment to keep (reject at end ).
- minimum_remainder_sequence_length = 15 used to choose whether the sequence of a removed segment is written to the remainder sequence file.
- b_value_field_is = *plddt rmsd b_value The B-factor field in predicted models can be pLDDT (confidence, 0-1 or 0-100) or rmsd (A) or a B-factor
- input_plddt_is_fractional = None You can specify if the input plddt values (in B-factor field) are fractional (0-1) or not (0-100). By default if all values are between 0 and 1 it is fractional.
- minimum_plddt = None If low-confidence residues are removed, the cutoff is defined by minimum_plddt or maximum_rmsd, whichever is defined (you cannot define both). A minimum plddt of 0.70 corresponds to a maximum rmsd of 1.5. Minimum plddt values are fractional or not depending on the value of input_plddt_is_fractional.
- maximum_rmsd = 1.5 If low-confidence residues are removed, the cutoff is defined by minimum_plddt or maximum_rmsd, whichever is defined (you cannot define both). A minimum plddt of 0.70 corresponds to a maximum rmsd of 1.5. Minimum plddt values are fractional or not depending on the value of input_plddt_is_fractional.
- default_maximum_rmsd = 1.5 Default value of maximum_rmsd, used if maximum_rmsd is not set
- subtract_minimum_b = False If set, subtract the lowest B-value from all B-values just before writing out the final files. Does not affect the cutoff for removing low- confidence residues.
- pae_power = 1 If PAE matrix (predicted alignment error matrix) is supplied, each edge in the graph will be weighted proportional to (1/pae**pae_power). Use this to try and get the number of domains that you want (try 1, 0.5, 1.5, 2)
- pae_cutoff = 5 If PAE matrix (predicted alignment error matrix) is supplied, graph edges will only be created for residue pairs with pae<pae_cutoff
- pae_graph_resolution = 0.5 If PAE matrix (predicted alignment error matrix) is supplied, pae_graph_resolution regulates how aggressively the clustering algorithm is. Smaller values lead to larger clusters. Value should be larger than zero, and values larger than 5 are unlikely to be useful
- weight_by_ca_ca_distance = False Adjust the edge weighting for each residue pair according to the distance between CA residues. If this is True, then distance_model can be provided, otherwise supplied model will be used. See also distance_power
- distance_power = 1 If weight_by_ca_ca_distance is True, then edge weights will be multiplied by 1/distance**distance_power.
- stop_if_no_residues_obtained = True Raise Sorry and stop if processing yields no residues
- keep_all_if_no_residues_obtained = False Keep everything if processing yields no residues
- vrms_from_rmsd_intercept = 0.25 Estimate of vrms (error in model) from pLDDT will be based on vrms_from_rmsd_intercept + vrms_from_rmsd_slope * pLDDT where mean pLDDT of non-low_confidence_residues is used.
- vrms_from_rmsd_slope = 1.0 Estimate of vrms (error in model) from pLDDT will be based on vrms_from_rmsd_intercept + vrms_from_rmsd_slope * pLDDT where mean pLDDT of non-low_confidence_residues is used.
- break_into_chunks_if_length_is = 1500 If a sequence is at least break_into_chunks_if_length_is, break it into chunks of length chunk_size with overlap of overlap_size for domain identification using split_model_by_compact_regions without a pae matrix
- chunk_size = 600 If a sequence is at least break_into_chunks_if_length_is, break it into chunks of length chunk_size with overlap of overlap_size for domain identification using split_model_by_compact_regions without a pae_matrix
- overlap_size = 200 If a sequence is at least break_into_chunks_if_length_is, break it into chunks of length chunk_size with overlap of overlap_size for domain identification using split_model_by_compact_regions without a pae_matrix
search
- dock_chains_individually = False Dock each chain (identified by chain_id field) individually. This is useful if the chains might have different orientations in the map compared to the search model, such as in a search model that is a predicted model. Normally used along with create_unique_chain_at_end=True for predicted models to put the model back together.
- create_unique_chain_at_end = None Take all docked chains and try to create one chain that has no duplicate residue numbers and that has distances between ends of fragments consistent with the number of residues between them. If symmetry has been applied, create N copies representing that symmetry. Default is True if dock_chains_individually = True and False otherwise.
- minimum_cc_to_keep_domain = 0.2 Do not keep domains in create_unique_chain_at_end if cc is less than minimum_cc_to_keep_domain.
- weight_sequential_fragments_by_distance = None Try to dock a chain in a way that minimizes the distance between its first residue and the previously-docked residue with the highest residue number less than this, and similarly for the last residue and the next available placed residue. Default is True if create_unique_chain_at_end is set. Normally use only if you have a model with a single chain and you are docking pieces of that chain.
- choose_better_of_individual_and_group_docking = None Dock entire search model (normally one only) and also by chain...pick better-fitting of the two for each chain
- low_res_search = True Try to fit by searching at low resolution first
- dock_with_mr = True Try using MR to dock first
- ssm_search = None Try to fit by searching for secondary structure. Default is False for dock_in_map and True for dock_and_rebuild if dock_with_mr is not used.
- refine_cycles = 3 rigid-body refinement cycles
- ssm_search_min_cc = 0.30 Stop ssm search if this cc achieved
- backup_resolution_cutoff_searches = 2 If initial fitting does not work, try up to this many times increasing resolution by backup_resolution_ratio each time
- backup_resolution_ratio = 1.67 If initial fitting does not work, try up to backup_resolution_cutoff_searches times increasing the resolution by backup_resolution_ratio each time
- resolution_radius_scale = 0.5 Resolution for low-res search will be resolution_radius_scale times the radius of gyration of the search model.
- align_moments = False Try to fit by aligning moments of inertia if size of density region and molecule are similar
- max_radius_ratio = 2. Radius of gyration of molecule and density must be within max_radius_ratio of each other
- radius_scale = 1.5 Mask for density and atoms will be radius_scale times the radius of gyration of model
- use_symmetry = True If search_model_copies or search_map_copies are the same for each model and greater than one and allow_symmetry is set, use symmetry in the map to place all copies after the first one.
- skip_if_low_cc = True Skip solution before rigid-body refinement if map-model CC is less than half of min_cc.
- rigid_body_refinement = True Run rigid-body refinement on final model
- rigid_body_refinement_single_unit = True Run rigid-body refinement with just one unit (do not break up into chains)
- rigid_body_refinement_split_method = *chain_id segid When splitting up molecule for rigid-body refinement (if rigid_body_refinement_single_unit=False), use either chain_id or segid to split up molecule
- rigid_body_refinement_resolution = None Run rigid-body refinement at this resolution if specified
- append_to_fixed_model = True Append placed search model to fixed model (if any) after search
- min_cc = 0.4 If quick run, stop if minimum CC is achieved in local search. Also always skip if starting CC_mask is less than 1/4 min_cc.
- run_in_boxes = True Run on sub-boxes and combine at the end
- target_box_size = 60 Try to get boxes about this big on a side (grid units)
- target_boxes = None Try to get this many boxes. Default is nproc unless this makes box size much smaller than target_box_size
- box_to_run = None Run only this box
- box_overlap_scale = 1 box overlap (overlap of boxes) will be box_overlap_scale times the density radius
- edge_ratio = 10 box edge box_overlap times edge_ratio
- density_radius = None Radius for density to be cut out and compared. Default is 6 times the resolution.
- model_radius = 3 Radius for removing density near fixed_model
- zero_value = 0 Value to set map in regions overlapping fixed model
- density_peaks = 20 Number of NCS-related peaks of density to check
- delta_phi = 20 Angular spacing of search
- max_rot = None Maximum rotations to try
- rotz_only = None Rotate only around Z
- single_positions_to_try = 10 Number of offset positions to try in optimizing orientation. Positions along the chain are selected as centers and a local fit near each position is carried out. The resulting offsets relative to the original placement are used to optimize the overall orientation and position.
- max_position_shift_frac = 0.05 Maximum fractional positional shift in single_positions run
- min_relative_cc = 0.67 Minimum local CC relative to original CC to keep a local search. This is a way to reject local searches that are completely wrong.
- sieve_fit = None Use sieve_fit fraction of single positions in fitting. If None, use all
- ncs_copies_max = None Maximum number of matching models to write. If more than one they will be written as MODEL records in the output PDB file. You can get them individually afterwards with phenix.pdbtools placed_model.pdb keep="model 1" etc.
- start_rot = None Three numbers rotx, rotz, rotx defining the starting rotation of the search model. Normally used along with delta_phi=1000 or max_rot=1 to generate exactly one defined rotation.
- search_center = None Optional coordinates in search model for centering search. Note this is different from target_search_center which is the location xyz in the map to look.
- search_center_selection = None Optional selection defining coordinates in search model for centering search.
- target_search_center = None Optional coordinates in reference map where search_center should be approximately located after superimposing maps. Used to eliminate possible superpositions that place the search center elsewhere. Overlap scores decreased based on distance to target_search_center/density_radius.
- model_search_position = None Optional coordinates (usually part of search model) for matching to target_search_position after transformation. These can be specified in addition to target_search_center. Can have multiple model_search_position and target_search_position pairs by specifying each multiple times in order. NOTE: Not compatible with map search
- target_search_position = None Target positions for model_search_position after transformation.
- search_position_radius = None Radius for comparison of target_search_position and transformed search_position values. Default is density_radius. If specified, must be a single value or the same number of values as entries in model_search_position and target_search_position
- rot_id_n = None Number of rotation groups. Along with rot_id_group, allows defining groups of rotations to be carried out in one run. .short_caption = Number of rotation groups
- rot_id_group = None rotation group to include. See rot_id_n.
- map_box = True Run map_box to extract useful part of map before search
- fix_search_position = False You can choose to not move your model center of mass to the origin by fixing the search position
- search_box_size = None You can choose the size of the search box for local searches. Normally set by default corresponding to 3 times density_radius
- lower_bounds = None You can select a part of your map for analysis with lower_bounds and upper_bounds.
- upper_bounds = None You can select a part of your map for analysis with lower_bounds and upper_bounds.
- keep_search_order = None Keep search order as input
- remove_water = False Remove waters and other hetero atoms from input files
iteration
- cycles = None Cycles of prediction and rebuilding ( default is 10 for Thorough, 3 for Standard and 1 for Quick
- cycle = 1 Iteration cycle
build
- rebuild_strategy = Thorough Standard Quick Rebuilding strategy. Standard is up to 3 cycles, refining and replacing poorly-fitting loops in predicted models each cycle, using autobuild for density modification (if Xray). Quick is refinement only, one cycle. Thorough is up to 10 cycles, extensive rebuilding.
- refine_only = None Refine only, no rebuilding steps (set automatically with Quick)
- refine_only_resolution = 3.5 Default cutoff for using Refine only
- run_fit_loops = True Run standard fit_loops
- run_iterative_morph = True Run iterative morphing
- run_trace_loops_through_density = True Run loop fitting with trace_through_density algorithm
- run_refine = True Refine morphed model. Required for run_fit_loops or run_trace_loops_through_density
- run_iterative_resolution_refine = True Refine morphed model with iterative resolution method
- run_extend = True Extend ends of morphed model
- extract_unique = True Extract unique part of map. Applies after density_select (if set).
- use_symmetry_in_extract_unique = True Use symmetry from symmetry file (if available) when extracting unique part of map.
- acceptable_docking_cc = 0.5 Acceptable docking CC for a chain
- minimum_docking_cc = 0.15 Minimum docking CC for a chain
- minimum_cutoff = 0.1 Minimum cutoff for estimating density in better parts of model
- reasonable_cc_ratio = 0.80 Acceptable CC (ratio) for a chain relative to average. Used to make sure a chain entirely in very bad density is not kept
- reasonable_cc_diff = 0.15 Acceptable CC (difference) for a chain relative to average. Used to make sure a chain entirely in very bad density is not kept
- shift_field_distance = None Shift field characteristic distance (default = 10 A) for morphing
- cc_sd_ratio = 3. Keep residues with CC at least within cc_sd_ratio of the mean CC for good residues
- cc_sd_ratio_end = 2. Keep residues with CC at least within cc_sd_ratio_end of the mean CC for good residues (applies to end of fragments)
- cc_sd_ratio_ok = 2. Residues with with CC at least within cc_sd_ratio_ok of the mean CC for good residues are ok (do not delete on the basis of plddt)
- max_gap_ratio = 3. Allow CA-CA distance to be up to max_gap_ratio times expected
- maximum_connectivity_deviation = 15 Maximum connectivity deviation. Reject a solution with bigger deviation than this plus 2 * resolution.
- keep_fraction_of_best = 0.5 Acceptable CC to keep as ratio to best found. Applies if best found is at least acceptable_docking_cc.
- keep_maximum_entries = 10 Maximum dock positions to keep
- rmsd_for_similar_placement = None RMSD value indicating that two placements are similar. default is resolution of map
- rigid_body_refine_cycles = 1 Refinement cycles for rigid-body refinement
- overlap_ca_ca_distance = 3 Overlap distance for CA-CA atoms (or P-P)
- ca_ca_distance = 3.8 CA-CA distance (or P-P)
- allowed_fraction_overlapping = 0.10 Allowed fraction overlapping
- maximum_combinations = 100000 Maximum combinations of placements to consider
- proceed_with_any_symmetry = False Run even with symmetry that is not point-group or helical. Not recommended as symmetry may not work properly
- box_cushion = 20 Size of buffer around model when boxing
- refine_cycles = 3 Refinement cycles
- loop_refine_cycles = 5 Refinement cycles for loops
- loop_backup_residues = 3 Number of tries removing one residue at a time from each end of existing ends of loop if no loop is found with initial gap
- residues_to_trim = 5 Residues to trim on each end of all trimmed fragments
- find_ncs_from_model = True Find NCS (symmetry) from working models and apply in density modification. Also turns on finding NCS in any autobuild density modification (including density-based search)
- allow_split_more_than_one_chain_for_mr = False Allow splitting chains into domains for mr (X-ray only) even if there are multiple chains. Normally only split if a single chain as reassembly may not work with split multiple chains.
- acceptable_cc_ratio = 0.8 Keep segments with CC at least equal to average of base segments times minimum_cc_ratio minus difference between segment confidence (plDDT) and confidence cutoff (typically 0.7)
- low_res_if_multiple_solutions = 3.5 Try phaser MR at increasing lower resolutions up to this value if multiple solutions are found
- delta_low_res = 0.5 Resolution increment for low_res_if_multiple_solutions
prediction
- prediction_method = *alphafold Prediction method to use
- template_search_method = *mmseqs2 structure_search Method to identify templates from PDB and generate MSAs. If structure_search is set, you can specify what PDB databases to use (same as in phenix.structure_search). If structure_search is set, you must supply your own MSA file with the keyword upload_msa_file=True (structure_search does not generate MSAs).
- starting_alphafold_model = None Starting AlphaFold model. You can supply an AlphaFold model and skip the initial AlphaFold step. This is equivalent to setting up an output_directory with just an AlphaFold model named as the expected first AlphaFold model and specifying carry_on=True.
- input_directory = ColabInputs Input directory containing density map. The map filename must start with the same characters as the jobname (only including characters before the first underscore). If you are supplying an MSA file it goes here as well.
- output_directory = ColabOutputs Output directory. Copy outputs to output_directory. Used to restart with carry on.
- save_outputs_in_google_drive = False If run on Colab, copy outputs to output_directory in Google drive
- content_dir = None Content directory. Default is working directory
- maxit_path = None Path to maxit (pdb to cif) converter. Optional.
- data_dir = /mnt Data directory (location of AlphaFold parameters)
- upload_file_with_jobname_resolution_sequence_lines = None Upload a file with a set of jobs (Colab only). Each line in the file is a jobname, resolution, and sequence
- maximum_cycles = 10 Maximum cycles to carry out
- cycle_rmsd_to_resolution_ratio = 0.25 Stop iteration if rmsd between subsequent AlphaFold models is less than cycle_rmsd_to_resolution_ratio times the resolution for two cycles in a row
- significant_increase_in_residues = 5 Significant increase in residues in model
- password = None Phenix download password (Colab only). The password used to download Phenix at your institution. Updated weekly, so you may need to request a new one frequently.
- version = dev-4502 Version of Phenix to run (Colab only)
- query_sequence = None Sequence
- resolution = None Resolution of map (A). Internal use only. Normally set instead crystal_info.resolution.
- jobname = None Name of this job. The first characters before any underscore must be unique and will define the first characters of the corresponding map file in the input directory. The job name will normally also be the name of the working directory. Used internally only
- use_msa = True Use multiple sequence alignments at some point
- skip_all_msa_after_first_cycle = False Skip multiple sequence alignments after first cycle
- include_templates_from_pdb = True Include templates from PDB. Note special behavior when running predict_and_build: applies only on first cycle, and if PhenixServer or Colab used prediction will be run with and without templates and PDB and highest plddt model will be kept
- maximum_templates_from_pdb = 20 Maximum templates from PDB to include
- release_date = None release date for templates from pdb (only use up to this date. Format is 2020-05-14.
- upload_msa_file = False Supply MSA directly (.a3m format). You can supply the MSA as a file in your input_directory and it will be used instead of using mmseqs2 to generate an MSA. Your file name must end in .a3m. The format is two lines per sequence, the first starts with a greater-than sign and is ignored, the second is a sequence of letters or minus signs. All sequence lines must have the same length. The first sequence must be the target.
- upload_manual_templates = False Supply templates for AlphaFold prediction. Used in the same way as templates from the PDB unless uploaded_templates_are_map_to_model is set. May be .cif or .pdb files. Templates must start with characters matching the first characters in the jobname (before the first underscore), the remainder of the file name can be anything but just end in .cif or .pdb. The files must be in the input_directory or be uploaded (in Colab only).
- uploaded_templates_are_map_to_model = False The manual templates are models that may or may not have the sequence of the alphafold models. These are used only as suggestions for placement of the main chain.
- upload_maps = True Use maps (required to be True)
- random_seed = 7231771 Random seed. Used internally
- random_seed_iterations = 1 Random seed iterations of AlphaFold in first cycle. The model with the highest plDDT will be used. If running predict_and_build and include_templates_from_pdb is True this many iterations will be carried out with and without templates . In predict_and_build and predict_chain set number_of_models instead.
- minimum_random_seed_iterations = 1 Random seed iterations of AlphaFold after first cycle. The model with the highest plDDT will be used
- big_improvement = 10 How much improvement in plDDT is worth going through all randomization cycles
- good_enough_plddt = 80 Value of plDDT that is good enough to not make any more models
- nproc = 4 Number of processors to use. Internal use only. Normally set instead control.nproc
- debug = False Debugging run (print traceback error messages)
- get_msa_only = False Just get MSA and save it on server, no prediction. Does not return the MSA
- carry_on = False Carry on from where previous run ended. Used (usually in Colab) to go on after a crash or timeout. Requires that files are saved in the output directory
- cif_dir = None Location of templates (normally set automatically)
- template_hit_list = None List of templates (normally set automatically)
- jobnames = None List of jobnames (normally set automatically)
- resolutions = None List of resolutions (normally set automatically)
- include_templates_from_pdb_list = None List of include_templates_from_pdb ((normally set automatically)
- include_side_in_templates_list = None List of include_side_in_templates_list ((normally set automatically)
- map_filename_dict = None Map filename dict (used internally only)
- msa_filename_dict = None MSA filename dict (used internally only)
- cif_filename_dict = None CIF filename dict (used internally only)
- query_sequences = None List of query sequences (one per jobname) (used internally only)
- manual_templates_uploaded = None List of manual templates (used internally only)
- upload_dir = None Upload directory (used internally only)
- working_directory = None Working directory (used internally only)
- maps_uploaded = None List of maps (used internally only)
- msas_uploaded = None List of msas (used internally only)
- num_models = 1 Number of models (used internally only)
- homooligomer = 1 Number of copies (used internally only)
- cycle = None Cycle number (internal use only)
- host_url = https://api.colabfold.com Host url for mmseqs
- use_env = None Use env (internal use only)
- use_custom_msa = None Use custom msa (internal use only)
- use_templates = None Use templates (internal use only)
- template_paths = None Template paths (internal use only)
- mtm_file_name = None map_to_model file name (internal use only)
- cycle_model_file_name = None cycle_model_file_name (internal use only)
- previous_final_model_name = None previous_final_model_name (internal use only)
- msa = None MSA (internal use only)
- msa_is_msa_object = None MSA info (internal use only)
- deletion_matrix = None Deletion matrix (internal use only)
- structure_search_params
  - structure_search
    - pdb_file = None Enter a PDB file name
    - sequence = None Optional Fasta sequence file. Only needed for a quick sequence search against RCSB without a PDB.
    - output_prefix = 'output' Provide an output prefix if needed
    - blastpath = None Enter path to blastall executable
    - sequence_only = False Do a Blast search against PDBaa sequence instead of doing a Ramanchandran-based structure search
    - structure_only = False Do only a Ramanchandran-based structure search.
    - db_used = 'rcsb' structure database used in search. rcsb, scop95, or AF2
    - db = 'rcsb' Database used in search. rcsb, scop95, or AF2.
    - get_ligand = False Use get_ligand=True to retrive ligands.
    - get_ramacode_only = False Generate Rama code for input pdb/cif only. This is for developers only.
    - get_xml_only = False Get BLAST XML output returned as a string object. No coordinate superposition will be performed. Developers only.
    - use_pdb100aa = False Use PDB100 sequence database for sequence search.
    - use_custom_db = False Use custom database specified by custom_db_files/custom_db_dir.
    - custom_db_dir = None The directory of pdb/cif files to make custom database. Default is current directory
    - custom_db_files = None Filenames of the pdb/cif files seperated by spaces for database. If none specified, all pdb/cif in the custom_db_dir will be collected
    - atom_selection = 'all' Choose part of the pdb used in the search (default=all). for example: chain B, resseq 113:219, ... etc.
    - get_pdb = 10 get_pdb=N will collect and superpose the top N homologous pdbs. Use get_pdb=0 to disable this option.
    - deposited_before = 0 Specify the latest year of matching structures to be considered for scoring. Pdbs deposited after this year will be discarded.
    - deposited_after = 0 Specify the earliest year of matching structures to be considered for scoring. Pdbs deposited before this year will be discarded.
    - batch_size = 0 Process the pdbs in batch of <batch_size> until <min_match> hits are identified or until all <get_pdb> pdbs are processed
    - min_match = 0 Finish structure_search when <min_match> matches are found. Usually uses with <trim_ends> to exit the search once find suitable pdbs.
    - keep_all_pdb = False Keep all the PDB files, including full PDB, PDB_Chain and superposed PDB_Chain. Default is False which will keep only superposed PDB_Chain files in the directory specified in the output message.
    - trim_ends = False Remove terminal residues of hit pdbs extending beyond those of the the target pdb.
    - write_pdb = True Set to False if no output pdb file is needed. Sometimes useful if use Structure_Search within another program and only want to pass pdb objects.
    - write_results = True Set to False if no output results/log files is needed. Useful when calling Structure_Search within another program and only want to pass pdb objects.
    - trim_hit_pdb = False Remove extra domains, extended loops, and unfit portions of hit pdbs after superposed to the target pdb.
    - pickle_hits = False Pickle blast hit results from xml output.
    - coot_display = False (default) Display output pdb files in coot.
    - ask_coot = True prompt for coot display optios
    - PDB_MIRRORDIR = None Enter the top directory of local RCSB PDB mirror. The program will try to retrieve PDBs and/or structure factors from this mirror first. Note this assumes the directory trees under it follows those in RCSB -- pdb files as 'pdb####.ent.gz' in PDB_MIRRORDIR/data/structures/divided/pdb directory. If you use PDB's rsync script, this variable would be the same as the $MIRRORDIR set in the script
    - PDB_MIRROR_MMCIF = None Enter the parent directory of the mmcif files in the local PDB mirror. MMCIFs will be retrieved from subdirectory ## where ## are the second and third letters in the PDB id. This keyword should be $PDB_MIRRORDIR/data/structures/divided/mmcif directory.
    - PDB_MIRROR_PDB = None Enter the parent directory of the PDB files in the local PDB mirror. PDBs will be retrieved from subdirectory ## where ## are the second and third letters in the PDB id. This keyword should be $PDB_MIRRORDIR/data/structures/divided/pdb directory. We recommend setting PDB_MIRRORDIR and it will take care of both PDB_MIRROR_PDB and others together. However, users may choose to specify PDB_MIRROR_PDB directly
    - PDB_MIRROR_STRUCTURE_FACTORS = None Enter the parent directory of the PDB files in the local PDB mirror. structure factors s will be retrieved from subdirectory ## where ## are the second and third letters in the PDB id. This keyword should be the same as the $PDB_MIRRORDIR/data/structures/divided/structure_factors directory. We recommend setting PDB_MIRRORDIR and it will take care of both PDB_MIRROR_PDB and DB_MIRROR_STRUCTURE_FACTORS together. However, users may choose to specify PDB_MIRROR_STRUCTURE_FACTORS directly
    - local_pdb_dir = None Enter the path directly to your local PDB repository.
    - verbose = False verbose output
    - debug = False debugging output
    - job_title = None Job title in PHENIX GUI, not used on command line
    - guiGUI-specific parameter required for output directory
      - output_dir = None
control
- stop_after_dock = None Stop after docking step
- stop_after_morph = False Stop after morphing step
- read_files = False Read existing output files and use them if present
- write_files = True Write output files
- nproc = 1 Number of processors to use on your local machine
- ignore_symmetry_conflicts = True You can ignore the symmetry information (CRYST1) from coordinate files. This may be necessary if your model has been placed in a box with box_map for example.
- random_seed = 171731 Random seed
- max_dirs = 1000 Maximum number of directories (dock_and_build_xxx)
- verbose = False Verbose output
- quick = True Run quickly
guiGUI-specific parameter required for output directory
- output_dir = None