Segmenting cryo-EM maps with segment_and_split_map
Author(s)
- segment_and_split_map: Tom Terwilliger
Purpose
The routine segment_and_split_map will identify the asymmetric
unit of a map (typically cryo-EM) and contiguous regions of density
within the asymmetric unit of the map.
Usage
How segment_and_split_map works:
If you have a CCP4-style (mrc, etc) map and a sequence file,
you can use segment_and_split_map to split the map into smaller pieces
suitable for model-building or viewing.
The tool segment_and_split_map will will find where your
molecule is in the map and cut out and work with just that part of the density.
If your map has been averaged based on NCS symmetry and
you supply a file with that NCS information (.ncs_spec, biomtr.dat, etc),
segment_and_split_map will find the asymmetric unit of NCS and work
with that.
Finally, the segment_and_split_map tool will cut the density
in the asymmetic unit of your map into small pieces of connected density and
write out a map for each one.
All the maps that are written by segment_and_split_map are superimposable
on each other. They typically are all shifted from the original map to place
the origin of the maps on the grid point (0,0,0).
Output files from segment_and_split_map
shifted_map.ccp4: Original map, shifted to place the origin on grid point (0,0,)
shifted_ncs.ncs_spec: NCS operators (if any), shifted to match shifted_map.ccp4.
shifted_pdb.pdb: Input PDB file (if any), shifted to match shifted_map.ccp4.
box_map_au.ccp4: Same as shifted_map.ccp4, except that everything except the asymmetric unit of NCS is zeroed out (map shows the asymmetric unit only).
box_mask_au.ccp4: Mask showing location of NCS asymmetric unit. Superimposes on box_map_au.ccp4 and shifted_map.ccp4.
segment_and_split_map_info.pkl: Pickled file with information about the segmentation. Used in phenix.map_to_model and to restore a shifted PDB file to original location.
Shifting the map to the origin
Most crystallographic maps have the origin at the corner of the map (
grid point [0,0,0]), while most cryo-EM maps have the orgin in the
middle of the map. To make a consistent map, any maps with an origin not
at the corner are shifted to put the origin at grid point [0,0,0]. This map
is the shifted map that is used for further steps in model-building.
At the conclusion of model-building, the model is shifted back to
superimpose on the original map.
Finding the region containing the molecule
By default (density_select=True), the region of the map containing density
is cut out of the entire map. This is particularly useful if the original map
is very large and the molecule only takes up a small part of the map. This
portion of the map is then shifted to place the origin at grid point [0,0,0].
(At the conclusion of model-building, the final model is shifted back to
superimpose on the original map.) The region containing density is chosen
as a box containing all the points above a threshold, typically 5% of the
maximum in the map.
Finding the NCS asymmetric unit of the map
If you supply NCS matrices describing the NCS used to average the map (if any),
then segment_and_split_map will try to define a region of the map that represents
the NCS asymmetric unit. Application of the NCS operators to the NCS
asymmetric unit will generate the entire map, and application to a model built
into the asymmetric unit will generate the entire model. Normally
identification of the NCS asymmetric unit and segmentation of the map (below)
are done as a single step, yielding an asymmetric unit and a set of
contiguous regions of density within that asymmetric unit. The asymmetric unit
of NCS will be written out as a map to the segmentation_dir directory,
superimposed on the shifted map (so that they can be viewed together in Coot).
Segmentation of the map
By default (segment=True) the map or NCS asymmetric unit of the map will
be segmented (cut into small pieces) into regions of connected density. This
is done by choosing a threshold of density and identifying contiguous regions
where all grid points are above this threshold. The threshold is chosen to
yield regions that have a size corresponding to about 50 residues. The
regions of density are written out to the segmentation_dir directory
and are superimposed on the shifted map (if you load the shifted map in
Coot and a region map in Coot, they should superimpose.)
Examples
Standard run of segment_and_split_map:
Running segment_and_split_map is easy. From the command-line you can type:
phenix.segment\_and\_split\_map my_map.map seq.fa ncs_file=find_ncs.ncs_spec
where my_map.map is a CCP4, mrc or other related map format, seq.fa is a
sequence file, and find_ncs.ncs_spec is an optional file specifying any
NCS operators used in averaging the map. This can be in the form of
BIOMTR records from a PDB file as well.
Possible Problems
Specific limitations and problems:
Literature
Additional information
List of all available keywords
- input_files
- seq_file = None Sequence file (unique chains only, 1-letter code, chains separated by blank line or greater-than sign.) Can have chains that are DNA/RNA/protein and all can be present in one file.
- map_file = None File with CCP4-style map
- ncs_file = None File with NCS information (typically point-group NCS with the center specified). Typically in PDB format. Can also be a .ncs_spec file from phenix. Created automatically if ncs_type is specified.
- pdb_in = None Optional PDB file matching map_file to be offset
- pdb_to_restore = None Optional PDB file to restore to position matching original map_file. Used in combination with info_file=xxx.pkl and restored_pdb=yyyy.pdb
- info_file = None Optional pickle file with information from a previous run. Can be used with pdb_to_restore to restore a PDB file to to position matching original map_file.
- target_ncs_au_file = None Optional PDB file to partially define the ncs asymmetric unit of the map. The coordinates in this file will be used to mark part of the ncs au and all points nearby that are not part of another ncs au will be added.
- output_files
- magnification_map_file = magnification_map.ccp4 Input map file with magnification applied. Only written if magnification is applied.
- magnification_ncs_file = magnification_ncs.ncs_spec Input NCS with magnification applied. Only written if magnification is applied.
- sharpening_map_file = sharpening_map.ccp4 Input map file with sharpening applied. Only written if sharpening is applied.
- shifted_map_file = shifted_map.ccp4 Input map file shifted to new origin.
- shifted_sharpened_map_file = shifted_sharpened_map.ccp4 Input map file shifted to new origin and sharpened.
- shifted_pdb_file = shifted_pdb.pdb Input pdb file shifted to new origin.
- shifted_ncs_file = shifted_ncs.ncs_spec NCS information shifted to new origin.
- output_directory = segmented_maps Directory where output files are to be written applied.
- box_map_file = box_map_au.ccp4 Output map file with one NCS asymmetric unit, cut out box
- box_mask_file = box_mask_au.ccp4 Output mask file with one NCS asymmetric unit, cut out box
- box_buffer = 5 Buffer (grid units) around NCS asymmetric unit in box_mask and map
- au_output_file_stem = shifted_au File stem for output map files with one NCS asymmetric unit
- write_intermediate_maps = False Write out intermediate maps and masks for visualization
- write_output_maps = True Write out maps
- remainder_map_file = remainder_map.ccp4 output map file with remainder after initial regions identified
- output_info_file = segment_and_split_map_info.pkl Output pickle file with information about map and masks
- restored_pdb = None Output name of PDB restored to position matching original map_file. Used in combination with info_file=xxx.pkl and pdb_to_restore=xxxx.pdb
- crystal_info
- chain_type = *None PROTEIN RNA DNA Chain type. Determined automatically from sequence file if not given. Mixed chain types are fine (leave blank if so).
- is_crystal = False Defines whether this is a crystal (or cryo-EM). Only affects printout of what the NCS represents.
- use_sg_symmetry = False If you set use_sg_symmetry=True then the symmetry of the space group will be used. For example in P1 a point at one end of the unit cell is next to a point on the other end. Normally for cryo-EM data this should be set to False and for crystal data it should be set to True.
- resolution = None Nominal resolution of the map. This is used later to decide on resolution cutoffs for Fourier inversion of the map. Note: the resolution is not cut at this value, it is cut at resolution*d_min_ratio if at all.
- space_group = None Space group (used for boxed maps)
- unit_cell = None Unit Cell (used for boxed maps)
- solvent_content = None Solvent fraction of the cell. Used for ID of solvent content in boxed maps.
- solvent_content_iterations = 3 Iterations of solvent fraction estimation. Used for ID of solvent content in boxed maps.
- reconstruction_symmetry
- ncs_type = None Symmetry used in reconstruction. For example D7, C3, C2 I (icosahedral),T (tetrahedral), or ANY (try everything and use the highest symmetry found). Not needed if ncs_file is supplied. Note: ANY does not search for helical symmetry
- ncs_center = None Center (in A) for NCS operators (if ncs is found automatically). If set to None, first guess is the center of the cell and then if that fails, found automatically as the center of the density in the map.
- optimize_center = None Optimize position of NCS center. Default is False if ncs_center is supplied or center of map is used and True if it is found automatically).
- helical_rot_deg = None helical rotation about z in degrees
- helical_trans_z_angstrom = None helical translation along z in Angstrom units
- two_fold_along_x = None Specifies if D or I two-fold is along x (True) or y (False). If None, both are tried.
- random_points = 100 Number of random points in map to examine in finding NCS
- n_rescore = 5 Number of NCS operators to rescore
- op_max = 14 If ncs_type is ANY, try up to op_max-fold symmetries
- map_modification
- magnification = None Magnification to apply to input map. Input map grid will be scaled by magnification factor before anything else is done.
- b_iso = None Target B-value for map (sharpening will be applied to yield this value of b_iso)
- b_sharpen = None Sharpen with this b-value. Contrast with b_iso that yield a targeted value of b_iso
- resolution_dependent_b = None If set, apply resolution_dependent_b (b0 b1 b2). Log10(amplitudes) will start at 1, change to b0 at half of resolution specified, changing linearly, change to b1 at resolution specified, and change to b2 at high-resolution limit of map
- d_min_ratio = 0.833 Sharpening will be applied using d_min equal to d_min_ratio times resolution. If None, box of reflections with the same grid as the map used.
- auto_sharpen = True Automatically determine sharpening using kurtosis maximization or adjusted surface area
- auto_sharpen_methods = *no_sharpening *b_iso *b_iso_to_d_cut *resolution_dependent None Methods to use in sharpening. b_iso searches for b_iso to maximize sharpening target (kurtosis or adjusted_sa). b_iso_to_d_cut applies b_iso only up to resolution specified, with fall-over of k_sharpen. Resolution dependent adjusts 3 parameters to sharpen variably over resolution range.
- box_in_auto_sharpen = True Use a representative box of density for initial auto-sharpening instead of the entire map.
- max_box_fraction = 0.5 If box is greater than this fraction of entire map, use entire map.
- k_sharpen = 10 Steepness of transition between sharpening (up to resolution ) and not sharpening (d < resolution). Note: for blurring, all data are blurred (regardless of resolution), while for sharpening, only data with d about resolution or lower are sharpened. This prevents making very high-resolution data too strong. Note 2: if k_sharpen is zero or None, then no transition is applied and all data is sharpened or blurred. Note 3: only used if b_iso is set.
- search_b_min = -100 Low bound for b_iso search.
- search_b_max = 300 High bound for b_iso search.
- search_b_n = 21 Number of b_iso values to search.
- residual_target = 'adjusted_sa' Target for maximization steps in sharpening. Can be kurtosis or adjusted_sa (adjusted surface area)
- sharpening_target = 'adjusted_sa' Overall target for sharpening. Can be kurtosis or adjusted_sa (adjusted surface area). Used to decide which sharpening approach is used. Note that during optimization, residual_target is used (they can be the same.)
- require_improvement = True Require improvement in score for sharpening to be applied
- region_weight = 40 Region weighting in adjusted surface area calculation. Score is surface area minus region_weight times number of regions. Default is 40. A smaller value will give more sharpening.
- sa_percent = 30. Percent of target regions used in calulation of adjusted surface area. Default is 30.
- fraction_occupied = 0.20 Fraction of molecular volume targeted to be inside contours. Used to set contour level. Default is 0.20
- n_bins = 20 Number of resolution bins for sharpening. Default is 20.
- max_regions_to_test = 30 Number of regions to test for surface area in adjusted_sa scoring of sharpening
- eps = None
- segmentation
- density_select = True Run map_box with density_select=True to cut out the region in the input map that contains density. Useful if the input map is much larger than the structure. Done before segmentation is carried out.
- density_select_threshold = 0.05 Choose region where density is this fraction of maximum or greater
- mask_threshold = None threshold in identification of overall mask. If None, guess volume of molecule from sequence and NCS copies.
- grid_spacing_for_au = 3 Grid spacing for asymmetric unit when constructing asymmetric unit.
- radius = None Radius for constructing asymmetric unit.
- value_outside_mask = 0.0 Value to assign to density outside masks
- density_threshold = None Threshold density for identifying regions of density. Applied after normalizing the density in the region of the molecule to an rms of 1 and mean of zero.
- starting_density_threshold = None Optional guess of threshold density
- max_overlap_fraction = 0.05 Maximum fractional overlap allowed to density in another asymmetric unit. Definition of a bad region.
- remove_bad_regions_percent = 1 Remove the worst regions that are part of more than one NCS asymmetric unit, up to remove_bad_regions_percent of the total
- require_complete = True Require all NCS copies to be represented for a region
- split_if_possible = True Split regions that are split in some NCS copies. If None, split if most copies are split.
- write_all_regions = False Write all regions to ccp4 map files.
- fraction_occupied = 0.2 Fraction of volume inside macromolecule that should be above threshold density
- max_per_au = None Maximum number of regions to be kept in the NCS asymmetric unit
- max_per_au_ratio = 5. Maximum ratio of number of regions to be kept in the NCS asymmetric unit to those expected
- min_ratio_of_ncs_copy_to_first = 0.5 Minimum ratio of the last ncs_copy region size to maximum
- min_ratio = 0.1 Minimum ratio of region size to maximum to keep it
- max_ratio_to_target = 3 Maximum ratio of grid points in top region to target
- min_ratio_to_target = 0.3 Minimum ratio of grid points in top region to target
- min_volume = 10 Minimum region size to consider (in grid points)
- residues_per_region = 50 Target number of residues per region
- seeds_to_try = 10 Number of regions to try as centers
- iterate_with_remainder = True Iterate looking for regions based on remainder from first analysis
- weight_rad_gyr = 0.1 Weight on radius of gyration of group of regions in NCS AU relative to weight on closeness to neighbors. Normalized to largest cell dimension with weight=weight_rad_gyr*300/cell_max
- expand_size = None Grid points to expand size of regions when excluding for next round. If None, set to approx number of grid points to get expand_target below
- expand_target = 1.5 Target expansion of regions (A)
- mask_additional_expand_size = 1 Mask expansion in addition to expand_size for final map
- exclude_points_in_ncs_copies = True Exclude points that are in NCS copies when creating NCS au
- control
- verbose = False Verbose output
- sharpen_only = None Sharpen map and stop
- resolve_size = None Size of resolve to use.