Docking processed AlphaFold2 and other predicted models in cryo-EM maps

Author(s)

dock_predicted_model: Tom Terwilliger

Purpose

Dock predicted model docks the domains from a model produced by AlphaFold, RoseTTAFold and other prediction software into a cryo EM map. It uses the connectivity of the model as a restraint in the docking process so that the docked domains normally are in a reasonable arrangement. It can take map symmetry into account.

The dock_predicted_model program is normally the second step in working on an AlphaFold or other predicted model with a cryo-EM structure.

The first step is to process the predicted model by trimming off all the uncertain residues in the predicted model and breaking up the remaining structure into a best guess of rigid domains with phenix.process_predicted_model .

The next step is to dock each of the domains of the processed model into the map, keeping plausible connectivity. This is done with phenix.dock_predicted_model .

The third step is to morph the predicted model onto the docked domains and then to rebuild all the parts of the predicted model using the density in the map. This is done with phenix.rebuild_predicted_model .

Note that you can run all three steps sequentially with phenix.dock_and_rebuild .

How dock_predicted_model works:

The model input to dock_predicted_model is a model file that comes from your starting predicted model (AlphaFold model) but which has been split so that each chain represents one domain (fixed rigid unit) of the predicted model, and from which all uncertain residues are removed. This is normally the output of phenix.process_predicted_model .

Note that you can work with one chain at a time even in a cryo-EM map that has many chains.

The map input to dock_predicted model is normally your best sharpened or density-modified cryo-EM map. It can also be a map generated by any other procedure (including crystallography).

If you are able to mask your map, keeping only the part representing the region where this model (one chain) belongs, that can be very helpful. You can also box the map around this region. If you have a map that has many chains, this masking can greatly shorted the time for docking. If you don't know where in your map the model goes at all, you can supply the entire map. If your map has symmetry this will normally be found automatically.

The first step in docking the predicted model is to extract the unique part of the map. If the map is asymmetric, this is the region where there is density. If it is symmetric, it is the unique part of the density, chosen to make a compact unit. Note that this step may not be perfect, particularly in the case of a very asymmetric molecule. You can possibly do much better by segmenting the map by hand and selecting just the region represenging your molecule if you have the time.

The next step is to sort the domains in the processed model by size and dock them in the map. Each domain is docked and refined by rigid-body refinement. Then the transformation obtained for that domain is applied to all the other domains as a quick guess as to how they should be placed. If they match the density, they are refined as well. If a complete model can be obtained by connecting these pieces (with distances between residues in one domain and the next residues in another domain compatible with the number of residues in between that are missing), then the docking is complete. If it is not, additional domains are docked, until all have been tried.

When all domains are placed, all the residues in the domains are sorted and a single clean chain with chain ID from the original predicted model and gaps corresponding to the processed model is produced.

Examples

Standard run of dock_predicted_model:

Running dock_predicted_model is easy. From the command-line you can type:

phenix.dock_predicted_model model=my_model.pdb \
   processed_model_file=my_model_processed.pdb map_file=my_map.ccp4 \
   resolution=3

This will dock the domains (one for each chain ID) from my_model_processed.pdb into my_map.ccp4. It will rename all the chains to match the chain in my_model.pdb and arrange all the residues in order.

Possible Problems

For complicated molecules, this procedure may not work well because it does not examine every possible docking position of every domain, and a complex structure might have more than one copy of a particular chain. You can improve the success a lot if you are able to mask the map around one copy of the chain you are looking for.

If that doesn't work, you can try to dock the domains from your AlphaFold model using phenix.dock_in_map and asking for multiple copies of the domains, then you could manually try to choose which domains go together. You can then create one PDB file with one copy of each domain (you will need to create a PDB file that has just one chain ID and all residues arranged sequentially, but with gaps allowed) and go on to phenix.rebuild_predicted_model with that as your docked model.

Specific limitations and problems:

Literature

Additional information

List of all available keywords

job_title = None Job title in PHENIX GUI, not used on command line
input_files
- processed_model_file = None Processed model file (e.g., from phenix.process_predicted_model). This model is expected to have one chain for each domain and actual B-values in the B-value field. It is expected that all poorly-predicted residues have been removed.
- docked_model_file = None Not used in this method
- morphed_model_file = None Not used in this method
- previous_model_file = None Not used in this method
- symmetry_file = None Symmetry file (.ncs_spec format or MATRX records) with reconstruction symmetry. Used in identification of unique part of map. NOTE: symmetry file applies to map file in original position. NOTE 2: only proper symmetry (point-group, helical) is allowed
- pae_file = None Not used in this method
- distance_model_file = None Not used in this method
- search_model_copies = None Used internally only
- search_model = None used internally only
- fragments_model_file = None Fragments model (e.g., map_to_model.pdb). Used as a source of possible model information and as a hypothesis for docking of working model. If docked_model_file is also supplied, fragments_model is used only as source of model information.
- map_model
  - full_map = None Input map file. This can be a boxed or masked map showing just the molecule to dock (best) or a full map with symmetry. If your map has symmetry be sure to set asymmetric_map = False. You can supply a symmetry file if you want. Otherwise symmetry will be automatically determined.
  - half_map = None Not used in this method
  - model = None Input predicted model (e.g., AlphaFold model). Assumed to have LDDT values in B-value field (or RMSD values).
output_files
- output_model_prefix = None Output file with docked model will begin with this prefix
- pdb_out = None Used internally only
- temp_dir = None Temporary directory. Default is dock_and_build_xx where xx creates new directory
crystal_info
- resolution = None High-resolution limit for main search. This can be lower resolution than the data. The search is quicker at lower resolution. If your model is poor, try 2-3 A lower resolution than your data (i.e, if your data is 2.5 A, try 5 A).
- scattering_table = n_gaussian wk1995 it1992 *electron neutron Choice of scattering table for structure factor calculations. Standard for X-ray is n_gaussian, for cryoEM is electron.
- wrapping = None You can specify whether the map is wrapped (can map values outside bounds to inside with cell translations).
- asymmetric_map = None Specifies that this is an asymmetric map and no symmetry is to be supplied or found. Alternative to supplying a symmetry file or symmetry.
- solvent_content = None Solvent fraction (content) of the cell. You can specify the fraction of the volume of this cell that is taken up by the macromolecule. Normally set automatically. Values go from 0 to 1.
- chain_type = *PROTEIN DNA RNA Chain type
process_predicted_model
- remove_low_confidence_residues = True Remove low-confidence residues (based on minimum lddt or maximum_rmsd, whichever is specified)
- split_model_by_compact_regions = True Split model into compact regions after removing low-confidence residues.
- maximum_domains = 3 Maximum domains to obtain. You can use this to merge the closest domains at the end of splitting the model. Make it bigger (and optionally make domain_size smaller) to get more domains.
- domain_size = 15 Approximate size of domains to be found (A units). This is the resolution that will be used to make a domain map. If you are getting too many domains, try making domain_size bigger (maximum is 70 A).
- minimum_domain_length = 10 Minimum length of a domain to keep (reject at end if smaller).
- maximum_fraction_close = 0.3 Maximum fraction of CA in one domain close to one in another before merging them
- minimum_sequential_residues = 5 Minimum length of a short segment to keep (reject at end ).
- minimum_remainder_sequence_length = 15 used to choose whether the sequence of a removed segment is written to the remainder sequence file.
- b_value_field_is = *lddt rmsd b_value The B-factor field in predicted models can be LDDT (confidence, 0-1 or 0-100) or rmsd (A) or a B-factor
- input_lddt_is_fractional = None You can specify if the input lddt values (in B-factor field) are fractional (0-1) or not (0-100). By default if all values are between 0 and 1 it is fractional.
- minimum_lddt = None If low-confidence residues are removed, the cutoff is defined by minimum_lddt or maximum_rmsd, whichever is defined (you cannot define both). A minimum lddt of 0.70 corresponds to a maximum rmsd of 1.5. Minimum lddt values are fractional or not depending on the value of input_lddt_is_fractional.
- maximum_rmsd = 1.5 If low-confidence residues are removed, the cutoff is defined by minimum_lddt or maximum_rmsd, whichever is defined (you cannot define both). A minimum lddt of 0.70 corresponds to a maximum rmsd of 1.5. Minimum lddt values are fractional or not depending on the value of input_lddt_is_fractional.
- default_maximum_rmsd = 1.5 Default value of maximum_rmsd, used if maximum_rmsd is not set
- subtract_minimum_b = False If set, subtract the lowest B-value from all B-values just before writing out the final files. Does not affect the cutoff for removing low- confidence residues.
- pae_power = 1 If PAE matrix (predicted alignment error matrix) is supplied, each edge in the graph will be weighted proportional to (1/pae**pae_power). Use this to try and get the number of domains that you want (try 1, 0.5, 1.5, 2)
- pae_cutoff = 5 If PAE matrix (predicted alignment error matrix) is supplied, graph edges will only be created for residue pairs with pae<pae_cutoff
- pae_graph_resolution = 1 If PAE matrix (predicted alignment error matrix) is supplied, pae_graph_resolution regulates how aggressively the clustering algorithm is. Smaller values lead to larger clusters. Value should be larger than zero, and values larger than 5 are unlikely to be useful
- weight_by_ca_ca_distance = False Adjust the edge weighting for each residue pair according to the distance between CA residues. If this is True, then distance_model can be provided, otherwise supplied model will be used. See also distance_power
- distance_power = 1 If weight_by_ca_ca_distance is True, then edge weights will be multiplied by 1/distance**distance_power.
search
- dock_chains_individually = False Dock each chain (identified by chain_id field) individually. This is useful if the chains might have different orientations in the map compared to the search model, such as in a search model that is a predicted model. Normally used along with create_unique_chain_at_end=True for predicted models to put the model back together.
- create_unique_chain_at_end = None Take all docked chains and try to create one chain that has no duplicate residue numbers and that has distances between ends of fragments consistent with the number of residues between them. If symmetry has been applied, create N copies representing that symmetry. Default is True if dock_chains_individually = True and False otherwise.
- minimum_cc_to_keep_domain = 0.2 Do not keep domains in create_unique_chain_at_end if cc is less than minimum_cc_to_keep_domain.
- weight_sequential_fragments_by_distance = None Try to dock a chain in a way that minimizes the distance between its first residue and the previously-docked residue with the highest residue number less than this, and similarly for the last residue and the next available placed residue. Default is True if create_unique_chain_at_end is set. Normally use only if you have a model with a single chain and you are docking pieces of that chain.
- choose_better_of_individual_and_group_docking = None Dock entire search model (normally one only) and also by chain...pick better-fitting of the two for each chain
- low_res_search = True Try to fit by searching at low resolution first
- ssm_search = None Try to fit by searching for secondary structure. Default is False for dock_in_map and True for dock_and_rebuild.
- refine_cycles = 3 rigid-body refinement cycles
- ssm_search_min_cc = 0.30 Stop ssm search if this cc achieved
- backup_resolution_cutoff_searches = 2 If initial fitting does not work, try up to this many times increasing resolution by backup_resolution_ratio each time
- backup_resolution_ratio = 1.67 If initial fitting does not work, try up to backup_resolution_cutoff_searches times increasing the resolution by backup_resolution_ratio each time
- resolution_radius_scale = 0.5 Resolution for low-res search will be resolution_radius_scale times the radius of gyration of the search model.
- align_moments = False Try to fit by aligning moments of inertia if size of density region and molecule are similar
- max_radius_ratio = 2. Radius of gyration of molecule and density must be within max_radius_ratio of each other
- radius_scale = 1.5 Mask for density and atoms will be radius_scale times the radius of gyration of model
- use_symmetry = True If search_model_copies or search_map_copies are the same for each model and greater than one and allow_symmetry is set, use symmetry in the map to place all copies after the first one.
- skip_if_low_cc = True Skip solution before rigid-body refinement if map-model CC is less than half of min_cc.
- rigid_body_refinement = True Run rigid-body refinement on final model
- rigid_body_refinement_single_unit = True Run rigid-body refinement with just one unit (do not break up into chains)
- rigid_body_refinement_split_method = *chain_id segid When splitting up molecule for rigid-body refinement (if rigid_body_refinement_single_unit=False), use either chain_id or segid to split up molecule
- rigid_body_refinement_resolution = None Run rigid-body refinement at this resolution if specified
- append_to_fixed_model = True Append placed search model to fixed model (if any) after search
- min_cc = 0.4 If quick run, stop if minimum CC is achieved in local search. Also always skip if starting CC_mask is less than 1/4 min_cc.
- run_in_boxes = True Run on sub-boxes and combine at the end
- target_box_size = 60 Try to get boxes about this big on a side (grid units)
- target_boxes = None Try to get this many boxes. Default is nproc unless this makes box size much smaller than target_box_size
- box_to_run = None Run only this box
- box_overlap_scale = 1 box overlap (overlap of boxes) will be box_overlap_scale times the density radius
- edge_ratio = 10 box edge box_overlap times edge_ratio
- density_radius = None Radius for density to be cut out and compared. Default is 6 times the resolution.
- model_radius = 3 Radius for removing density near fixed_model
- zero_value = 0 Value to set map in regions overlapping fixed model
- density_peaks = 20 Number of NCS-related peaks of density to check
- delta_phi = 20 Angular spacing of search
- max_rot = None Maximum rotations to try
- rotz_only = None Rotate only around Z
- single_positions_to_try = 10 Number of offset positions to try in optimizing orientation. Positions along the chain are selected as centers and a local fit near each position is carried out. The resulting offsets relative to the original placement are used to optimize the overall orientation and position.
- max_position_shift_frac = 0.05 Maximum fractional positional shift in single_positions run
- min_relative_cc = 0.67 Minimum local CC relative to original CC to keep a local search. This is a way to reject local searches that are completely wrong.
- sieve_fit = None Use sieve_fit fraction of single positions in fitting. If None, use all
- ncs_copies_max = None Maximum number of matching models to write. If more than one they will be written as MODEL records in the output PDB file. You can get them individually afterwards with phenix.pdbtools placed_model.pdb keep="model 1" etc.
- start_rot = None Three numbers rotx, rotz, rotx defining the starting rotation of the search model. Normally used along with delta_phi=1000 or max_rot=1 to generate exactly one defined rotation.
- search_center = None Optional coordinates in search model for centering search. Note this is different from target_search_center which is the location xyz in the map to look.
- search_center_selection = None Optional selection defining coordinates in search model for centering search.
- target_search_center = None Optional coordinates in reference map where search_center should be approximately located after superimposing maps. Used to eliminate possible superpositions that place the search center elsewhere. Overlap scores decreased based on distance to target_search_center/density_radius.
- model_search_position = None Optional coordinates (usually part of search model) for matching to target_search_position after transformation. These can be specified in addition to target_search_center. Can have multiple model_search_position and target_search_position pairs by specifying each multiple times in order. NOTE: Not compatible with map search
- target_search_position = None Target positions for model_search_position after transformation.
- search_position_radius = None Radius for comparison of target_search_position and transformed search_position values. Default is density_radius. If specified, must be a single value or the same number of values as entries in model_search_position and target_search_position
- rot_id_n = None Number of rotation groups. Along with rot_id_group, allows defining groups of rotations to be carried out in one run. .short_caption = Number of rotation groups
- rot_id_group = None rotation group to include. See rot_id_n.
- map_box = True Run map_box to extract useful part of map before search
- fix_search_position = False You can choose to not move your model center of mass to the origin by fixing the search position
- search_box_size = None You can choose the size of the search box for local searches. Normally set by default corresponding to 3 times density_radius
- lower_bounds = None You can select a part of your map for analysis with lower_bounds and upper_bounds.
- upper_bounds = None You can select a part of your map for analysis with lower_bounds and upper_bounds.
- keep_search_order = None Keep search order as input
- remove_water = False Remove waters and other hetero atoms from input files
build
- run_fit_loops = True Run standard fit_loops
- run_trace_loops_through_density = True Run loop fitting with trace_through_density algorithm
- run_refine = True Refine morphed model
- run_iterative_resolution_refine = True Refine morphed model with iterative resolution method
- run_extend = True Extend ends of morphed model
- extract_unique = True Extract unique part of map
- use_symmetry_in_extract_unique = True Use symmetry from symmetry file (if available) when extracting unique part of map.
- acceptable_docking_cc = 0.5 Acceptable docking CC for a chain
- minimum_docking_cc = 0.15 Minimum docking CC for a chain
- cc_sd_ratio = 3. Keep residues with CC at least within cc_sd_ratio of the mean CC for good residues
- cc_sd_ratio_end = 2. Keep residues with CC at least within cc_sd_ratio of the mean CC for good residues (applies to end of fragments)
- max_gap_ratio = 3. Allow CA-CA distance to be up to max_gap_ratio times expected
- maximum_connectivity_deviation = 15 Maximum connectivity deviation. Reject a solution with bigger deviation than this plus 2 * resolution.
- keep_fraction_of_best = 0.5 Acceptable CC to keep as ratio to best found
- keep_maximum_entries = 10 Maximum dock positions to keep
- rmsd_for_similar_placement = None RMSD value indicating that two placements are similar. default is resolution of map
- rigid_body_refine_cycles = 1 Refinement cycles for rigid-body refinement
- overlap_ca_ca_distance = 3 Overlap distance for CA-CA atoms (or P-P)
- ca_ca_distance = 3.8 CA-CA distance (or P-P)
- allowed_fraction_overlapping = 0.10 Allowed fraction overlapping
- maximum_combinations = 100000 Maximum combinations of placements to consider
- proceed_with_any_symmetry = False Run even with symmetry that is not point-group or helical. Not recommended as symmetry may not work properly
- box_cushion = 20 Size of buffer around model when boxing
- refine_cycles = 3 Refinement cycles
- loop_refine_cycles = 5 Refinement cycles for loops
- residues_to_trim = 5 Residues to trim on each end of all trimmed fragments
- run_iterative_morph = True Run iterative morphing
control
- stop_after_dock = True Stop after docking step. If False, go on and rebuild the docked model.
- read_files = False Read existing output files and use them if present
- write_files = True Write output files
- nproc = 1 Number of processors (if None, use all available)
- ignore_symmetry_conflicts = True You can ignore the symmetry information (CRYST1) from coordinate files. This may be necessary if your model has been placed in a box with box_map for example.
- random_seed = 171731 Random seed
- verbose = False Verbose output
- quick = True Run quickly
- max_dirs = 1000 Maximum number of directories (dock_and_build_xxx)
guiGUI-specific parameter required for output directory
- output_dir = None