Model-building into cryo-EM and low-resolution maps with map_to_model

Author(s)

Purpose

The routine map_to_model will interpret a map (cryo-EM, low-resolution X-ray) and try to build an atomic model, fully automatically.

Usage

How map_to_model works:

If you have a CCP4-style (mrc, etc) map or just mtz map coefficients and a sequence file, you can use map_to_model to build a model into your map. The tool map_to_model will identify what kind of chains to build based on your sequence file. It will find where your molecule is in the map and cut out and work with just that part of the density.

If your map has been averaged based on NCS symmetry and you supply a file with that NCS information (.ncs_spec, biomtr.dat, etc), map_to_model will find the asymmetric unit of NCS, build that, then expand to the entire map. You can also just supply the type of symmetry (e.g., D7) or ask to check all plausible symmetry.

The map_to_model tool will cut the density in your map into small pieces of connected density and try to build model into each one. It will merge all the pieces into a compact model, refine it, and superimpose final model on the original map. The type of chain or chains to be built are chosen based on your sequence file. If multiple chain types are considered, the entire map is interpreted with each chain type, then the best-fitting non-overlapping chains are chosen.

Output files from map_to_model

map_to_model.pdb: A PDB file with the resulting model, superimposed on
the original map (or on the magnified map if magnification is applied).

Applying magnification to the map

Cryo-EM maps often have a scale that is not precisely defined by the experiment. map_to_model allows application of a scale factor (magnification) to the grid of the map. Normally this scale factor will be close to 1. If the magnification is specified and is not equal to 1, it will be applied to the input map and a magnification map will be written out to the output directory and will be used as if it were the original input file from then on. Additionally any input NCS information will be adjusted by the same magnification factor (translations and centers are scaled by the magnification factor, rotations are unchanged). Any input models are not modified.

Shifting the map to the origin

Most crystallographic maps have the origin at the corner of the map ( grid point [0,0,0]), while most cryo-EM maps have the orgin in the middle of the map. To make a consistent map, any maps with an origin not at the corner are shifted to put the origin at grid point [0,0,0]. This map is the shifted map that is used for further steps in model-building. At the conclusion of model-building, the model is shifted back to superimpose on the original map.

Finding the region containing the molecule

By default (density_select=True), the region of the map containing density is cut out of the entire map. This is particularly useful if the original map is very large and the molecule only takes up a small part of the map. This portion of the map is then shifted to place the origin at grid point [0,0,0]. (At the conclusion of model-building, the final model is shifted back to superimpose on the original map.) The region containing density is chosen as a box containing all the points above a threshold, typically 5% of the maximum in the map.

Map sharpening/blurring

By default (auto_sharpen=True) the resolution dependence of the map will be adjusted to maximize the clarity of the map. You can choose to use map kurtosis or the adjusted surface area of the map (default) for this purpose.

Kurtosis is a standard statistical measure that reflects the peakiness of the map.

The adjusted surface area is a combination of the surface area of contours in the map at a particular threshold and of the number of distinct regions enclosed by the top 30% (default) of those contours. The threshold is chosen by default to be one where the volume enclosed by the contours is 20% of the non-solvent volume in the map. The weighting between the surface area (to be maximized) and number of regions enclosed (to be minimized) is chosen empirically (default region_weight=20).

Several resolution-dependent functions are tested, and the one that gives the best kurtosis (or adjusted surface area) is chosen. In each case the map is transformed to obtain Fourier coefficients. The amplitudes of these coefficients are then adjusted, keeping the phases constant. The available functions for modifying the amplitudes are:

No sharpening (map is left as is)

Sharpening b-factor applied over entire resolution range (b_sharpen
applied to achieve an effective isotropic overall b-value of b_iso).

Sharpening b-factor applied up to resolution specified with the
resolution=xxx keyword, then not applied beyond this resolution (with
transition specified by the keyword k_sharpen, b_iso_to_d_cut).  If
blurring (sharpening with value less than zero) is applied,
the blurring is applied over the entire resolution range.

Resolution-dependent sharpening factor with three parameters.
First the resolution-dependence of the map is removed by normalizing the
amplitudes.  Then a scale factor S is to the data, where
log10(S) is determined by coefficients b[0],b[1],b[2] and a resolution
d_cut (typically d_cut is the nominal resolution of the map).
The value of log10(S) varies smoothly from 0 at resolution=infinity, to b[0]
at d_cut/2, to b[1] at d_cut, and to b[1]+b[2] at the highest resolution
in the map.  The value of b[1] is limited to being no larger than b[0] and the
value of b[1]+b[2] is limited to be no larger than b[1].

Sharpening using a half-dataset correlation.
The resolution-dependent correlation of density in two half-maps
is used to identify the optimal resolution-dependent weighting of
the map.  This approach requires a target resolution which is used
to set the overall fall-off with resolution for an ideal map.  That
fall-off for an ideal map is then multiplied by an estimated
resolution-dependent correlation of density in the map with the true
map (the estimation comes from the half-map correlations).


Model-based sharpening.
You can identify the sharpening parameters using your map and a
model.  This approach requires a guess of the RMSD between the model and
the true model.  The resolution-dependent correlation of model and map
density is used as in the half-map approach above to identify the
weighting of Fourier coefficients.

Finding the NCS asymmetric unit of the map

If you supply NCS matrices describing the NCS used to average the map (if any), then map_to_model will try to define a region of the map that represents the NCS asymmetric unit. Application of the NCS operators to the NCS asymmetric unit will generate the entire map, and application to a model built into the asymmetric unit will generate the entire model.

You can also supply the type of symmetry (e.g., C3, D7, etc) and map_to_model will try to find that symmetry in the map. You can even search for all plausible symmetry (ANY). For helical symmetry either an NCS file or the rotation and translation information is required, however.

Normally identification of the NCS asymmetric unit and segmentation of the map (below) are done as a single step, yielding an asymmetric unit and a set of contiguous regions of density within that asymmetric unit. The asymmetric unit of NCS will be written out as a map to the segmentation_dir directory, superimposed on the shifted map (so that they can be viewed together in Coot).

Segmentation of the map

By default (segment=True) the map or NCS asymmetric unit of the map will be segmented (cut into small pieces) into regions of connected density. This is done by choosing a threshold of density and identifying contiguous regions where all grid points are above this threshold. The threshold is chosen to yield regions that have a size corresponding to about 50 residues. The regions of density are written out to the segmentation_dir directory and are superimposed on the shifted map (if you load the shifted map in Coot and a region map in Coot, they should superimpose.)

Model-building

Models are built in several ways by map_to_model and then the best-fitting, non-overlapping models are chosen. The main methods used for model-building are:

Standard RESOLVE model-building for PROTEIN/RNA/DNA for the entire
asymmetric unit  of NCS (or the entire molecule if no NCS).

Helices (RNA) or helices/strands (PROTEIN) for entire asymmetric unit

tracing chain (RNA/PROTEIN/DNA) for each segmented region, with various
values of map sharpening applied

RESOLVE model-building for each segmented region, with various values of
map sharpening applied

Intermediate models are refined with phenix.real_space_refine and are written out relative to the shifted map with origin at [0,0,0]. You can view these intermediate models, the shifted map, and the shifted map containing just the asymmetric unit of NCS, and any region maps in Coot and they should all superimpose.

Once all intermediate models are built, all models of each chain type are combined, taking the best-fitting model for each part of the map. Then all chain types are combined, once again taking the best-fitting model for each part of the map. The models are refined again.

Then (if present) NCS is applied to the model and the full model is refined. Finally the best model, with NCS applied if present, is shifted to match the original map and is written out.

Iterative map improvement with model-building and sharpening

You can carry out multiple cycles of model-building and map sharpening, using the model from each cycle in the sharpening process. The map is sharpened to make it as similar as possible to density calculated from the current model and the new map is used for the next cycle of model building.

Examples

Standard run of map_to_model:

Running map_to_model is easy. From the command-line you can type:

phenix.map_to_model my_map.map seq.fa ncs_file=find_ncs.ncs_spec

where my_map.map is a CCP4, mrc or other related map format, seq.fa is a sequence file, and find_ncs.ncs_spec is an optional file specifying any NCS operators used in averaging the map. This can be in the form of BIOMTR records from a PDB file as well.

Standard run of map_to_model, specifying NCS type:

Finding NCS map_to_model is easy. From the command-line you can type:

phenix.map_to_model my_map.map seq.fa ncs_type=D7

Here map_to_model will look for D7 (7-fold symmetry along the c-axis and 2-fold symmetry along a or b) in the map and apply it if it is found. It will also write out the NCS matrices corresponding to this symmetry.

Combining partial runs from map_to_model:

If you have a completed or partially-completed run of map_to_model and you want to run again but you do not want to re-run all the steps, you can skip building some models by specifying a previously-built model for some steps. The models you can specify are:

init_PROTEIN

init_RNA

init_DNA

helices_strands_only_PROTEIN

build_rna_helices_RNA

standard_PROTEIN

standard_RNA

standard_DNA

final_PROTEIN

final_RNA

final_DNA

For example, you can say:

phenix.map_to_model my_map.map seq.fa ncs_file=find_ncs.ncs_spec
  partial_model=last_run/final_PROTEIN.pdb
  partial_model_type=final_PROTEIN
  partial_model=last_run/final_RNA.pdb
  partial_model_type=final_RNA

and then the files last_run/final_PROTEIN.pdb and last_run/final_RNA.pdb will simply be combined to create a new model, instead of doing a full model-building run.

The stages of model-building that can be replaced are:

init (building into individual regions of density)

helices_strands_only (PROTEIN search for helices and strands)

build_rna_helices (RNA search for A-form helices)

standard (RESOLVE model-building on entire structure)

final (combined model for one chain type)

The chain-types that can be specified are:

PROTEIN

RNA

DNA

Possible Problems

If you have a very large structure it is possible that your computer may not have enough memory to run map_to_model and that one or more sub-processes might crash. The best solution for this is to try on a computer with even more memory. You can also cut back the resolution used in steps that use Fourier transformation.

If your queueing system crashes during a run or one or more sub-processes crashes, then you might end up with models built for some stages of building and others not. You can carry out another run and read in the models that have already been built so that you do not need to build them again (see above in the section on combining partial runs).

Specific limitations and problems:

Literature

Additional information

List of all available keywords