Model-building into cryo-EM and low-resolution maps with map_to_model

Author(s)

Purpose

The routine map_to_model will interpret a map (cryo-EM, low-resolution X-ray) and try to build an atomic model, fully automatically.

Video Tutorial

The tutorial video is available on the Phenix YouTube channel and covers the following topics:

GUI

A Graphical User Interface is available.

Usage

How map_to_model works:

If you have a CCP4-style (mrc, etc) map or just mtz map coefficients and a sequence file, you can use map_to_model to build a model into your map. The tool map_to_model will identify what kind of chains to build based on your sequence file. It will find where your molecule is in the map and cut out and work with just that part of the density.

To run map_to_model successfully, you'll need a map with resolution of about 4.5 A or better. The higher the resolution, the more complete the model that you will typically get.

It is easiest to supply map_to_model with a map that has already been sharpened optimally (for example with phenix.auto_sharpen). It is also easiest if you cut out just the unique part of your map before supplying it to map_to_model (you can use phenix.map_box to do this with the extract_unique=True option). After you are all done you can reconstruct the entire molecule using phenix.apply_ncs.

(Alternatively you can have map_to_model do everything for you (auto_sharpen, symmetry average, cut out the unique part, reconstruct the entire molecule). This is a lot slower so normally it is best to prepare just the right part of the map in advance.)

You have a choice in map_to_model of running quickly (quick=True) or more thoroughly. Normally quick=True should be fine, particularly for building protein chains. In this mode, map_to_model runs the phenix.trace_and_build tool to trace the chain, identify positions of side chains, build a model for each segment of density, and refine the entire model. This can be done quickly with multiple processors, requiring perhaps 15 minutes to build 300 residues with 4 processors.

If you choose thorough building (quick=False), the map_to_model will try several different methods of building your model. One method is to build in small regions of the map. The map_to_model tool will cut the density in your map into small pieces of connected density and try to build model into each one. It will merge all the pieces into a compact model, refine it, and superimpose final model on the original map. Another method is the trace_and_build method used in the quick version. A third is standard resolve model-building. A fourth is finding helices and strands.

If your structure has RNA or DNA, model-building for these chains is done separately from protein. The type of chain or chains to be built are chosen based on your sequence file. If multiple chain types are considered, the entire map is interpreted with each chain type, then the best-fitting non-overlapping chains are chosen.

How to use map_to_model and trace_and_build to build and complete a model

The recommended way to build a model using phenix is first to run map_to_model and get a starting model, and then to try and build additional model and fix errors using trace_and_build and fix_insertions_deletions. Note that map_to_model will show you all the chains that are built along with their density, so you can see even before it is done whether there are changes in chain tracing that you would like to make.

Once map_to_model is done, you can try to connect parts of the model, extend chains, or build new chains using trace_and_build. map_to_model writes out each segment as a separate chain, so you can specify chains to be connected with trace_and_build. For example, if you think chains A and B should be connected, you can use pdbtools to create a working model working_AB.pdb with just chains A and B. Then you can run trace_and_build with working_AB.pdb and your map, specifying connect_chains and turning off extend_chains and find_chains. Then trace_and_build will attempt to connect your chains A and B and create a new model with a single chain. You can then replace chains A and B in your model with the new chain.

You can also try to fix insertions and deletions in your model by running fix_insertions_deletions with your model, your sequence file, and your map.

You can try to improve the sequence assignment with sequence_from_map. This is run automatically at the end of map_to_model and trace_and_build but if you create new chains you might want to try and assign them to sequence separately.

If you want to create a model with full symmetry, you can do this with map_to_model as well. You supply the full map, your model of the unique part of the structure as the starting_model, and the symmetry matrices that you can obtain from the map_symmetry tool. You can turn off all the model-building methods so that the only thing that happens is assembly and refinement. Then map_to_model will apply the symmetry, remove any pieces that overlap due to symmetry, and refine the entire model.

Applying magnification to the map

Cryo-EM maps often have a scale that is not precisely defined by the experiment. map_to_model allows application of a scale factor (magnification) to the grid of the map. Normally this scale factor will be close to 1. If the magnification is specified and is not equal to 1, it will be applied to the input map and a magnification map will be written out to the output directory and will be used as if it were the original input file from then on. Additionally any input symmetry information will be adjusted by the same magnification factor (translations and centers are scaled by the magnification factor, rotations are unchanged). Any input models are not modified.

Shifting the map to the origin

Most crystallographic maps have the origin at the corner of the map ( grid point [0,0,0]), while most cryo-EM maps have the orgin in the middle of the map. To make a consistent map, any maps with an origin not at the corner are shifted to put the origin at grid point [0,0,0]. This map is the shifted map that is used for further steps in model-building. At the conclusion of model-building, the model is shifted back to superimpose on the original map.

Finding the region containing the molecule

If requested, (density_select=True), the region of the map containing density is cut out of the entire map. This is particularly useful if the original map is very large and the molecule only takes up a small part of the map. This portion of the map is then shifted to place the origin at grid point [0,0,0]. (At the conclusion of model-building, the final model is shifted back to superimpose on the original map.) The region containing density is chosen as a box containing all the points above a threshold, typically 5% of the maximum in the map.

Map sharpening/blurring

If requested, (auto_sharpen=True) the resolution dependence of the map will be adjusted to maximize the clarity of the map. It is generally preferable to do this in advance using the phenix.auto_sharpen tool however.

If you run auto_sharpen, You can choose to use map kurtosis or the adjusted surface area of the map (default) for this purpose.

Kurtosis is a standard statistical measure that reflects the peakiness of the map.

The adjusted surface area is a combination of the surface area of contours in the map at a particular threshold and of the number of distinct regions enclosed by the top 30% (default) of those contours. The threshold is chosen by default to be one where the volume enclosed by the contours is 20% of the non-solvent volume in the map. The weighting between the surface area (to be maximized) and number of regions enclosed (to be minimized) is chosen empirically (default region_weight=20).

Several resolution-dependent functions are tested, and the one that gives the best adjusted surface area (or kurtosis) is chosen. In each case the map is transformed to obtain Fourier coefficients. The amplitudes of these coefficients are then adjusted, keeping the phases constant. The available functions for modifying the amplitudes are:

No sharpening (map is left as is)

Sharpening b-factor applied over entire resolution range (b_sharpen
applied to achieve an effective isotropic overall b-value of b_iso).

Sharpening b-factor applied up to resolution specified with the
resolution=xxx keyword, then blurring applied beyond this resolution (with
transition specified by the keyword k_sharpen, b_iso_to_d_cut).  If
sharpening value is less than zero (map is to be blurred),
the blurring is applied over the entire resolution range.

Resolution-dependent sharpening factor with three parameters.
First the resolution-dependence of the map is removed by normalizing the
amplitudes.  Then a scale factor S is to the data, where
log10(S) is determined by coefficients b[0],b[1],b[2] and a resolution
d_cut (typically d_cut is the nominal resolution of the map).
The value of log10(S) varies smoothly from 0 at resolution=infinity, to b[0]
at d_cut/2, to b[1] at d_cut, and to b[1]+b[2] at the highest resolution
in the map.  The value of b[1] is limited to being no larger than b[0] and the
value of b[1]+b[2] is limited to be no larger than b[1].

Sharpening using a half-dataset correlation.
The resolution-dependent correlation of density in two half-maps
is used to identify the optimal resolution-dependent weighting of
the map.  This approach requires a target resolution which is used
to set the overall fall-off with resolution for an ideal map.  That
fall-off for an ideal map is then multiplied by an estimated
resolution-dependent correlation of density in the map with the true
map (the estimation comes from the half-map correlations).


Model-based sharpening.
You can identify the sharpening parameters using your map and a
model.  This approach requires a guess of the RMSD between the model and
the true model.  The resolution-dependent correlation of model and map
density is used as in the half-map approach above to identify the
weighting of Fourier coefficients.

Local sharpening

Any of the sharpening methods can be applied locally instead of globally over the entire map. For local sharpening, the map is cut into overlapping boxes and sharpening parameter are determined for that box and applied. For parts of boxes that overlap, the distances from a grid point to the centers of the corresponding boxes are used to weight the values for those boxes.

Finding the asymmetric unit of the map

Normally you should supply map_to_model with the unique part (asymmetric unit of the map. You can get this using the phenix.map_box tool with the extract_unique=True command. However if you want map_to_model to do this automatically, you can supply symmetry matrices describing the symmetry used to average the map (if any), then map_to_model will try to define a region of the map that represents the asymmetric unit of the map. Application of the symmetry operators to the asymmetric unit will generate the entire map, and application to a model built into the asymmetric unit will generate the entire model.

You can also supply the type of symmetry (e.g., C3, D7, etc) and map_to_model will try to find that symmetry in the map. You can even search for all plausible symmetry (ANY). For helical symmetry either a symmetry file or the rotation and translation information is required, however.

Normally identification of the asymmetric unit and segmentation of the map (below) are done as a single step, yielding an asymmetric unit and a set of contiguous regions of density within that asymmetric unit. The asymmetric unit will be written out as a map to the segmentation_dir directory, superimposed on the shifted map (so that they can be viewed together in Coot).

Segmentation of the map

By default (segment=True) the map or asymmetric unit of the map will be segmented (cut into small pieces) into regions of connected density. This is done by choosing a threshold of density and identifying contiguous regions where all grid points are above this threshold. The threshold is chosen to yield regions that have a size corresponding to about 50 residues. The regions of density are written out to the segmentation_dir directory and are superimposed on the shifted map (if you load the shifted map in Coot and a region map in Coot, they should superimpose.)

Model-building

Models are built in several ways by map_to_model and then the best-fitting, non-overlapping models are chosen. If the quick method for map_to_model is selected and the chain type is protein, then the default is to run just the trace_and_build model-building method. If quick=False, all methods are tried. The main methods used for model-building are:

Chain tracing followed by model-building (trace_and_build).  Continuous
density is identified in the map, choosing the longest paths if there
are choices.  Then C-alpha positions are identified from the presence of
side chain density, an all-atom model is constructed and refined.

Standard RESOLVE model-building for PROTEIN/RNA/DNA for the entire
asymmetric unit  (or the entire molecule if no symmetry was used ).

Helices (RNA) or helices/strands (PROTEIN) for entire asymmetric unit

tracing chain (RNA/PROTEIN/DNA) for each segmented region, with various
values of map sharpening applied

RESOLVE model-building for each segmented region, with various values of
map sharpening applied

Intermediate models are refined with phenix.real_space_refine and are written out relative to the shifted map with origin at [0,0,0]. You can view these intermediate models, the shifted map, and the shifted map containing just the asymmetric unit , and any region maps in Coot and they should all superimpose.

Once all intermediate models are built, all models of each chain type are combined, taking the best-fitting model for each part of the map. Then all chain types are combined, once again taking the best-fitting model for each part of the map. The models are refined again.

Then (if present) symmetry is applied to the model and the full model is refined. Finally the best model, with symmetry applied if present, is shifted to match the original map and is written out.

Iterative map improvement with model-building and sharpening

A procedure is available for carrying out multiple cycles multiple cycles of model-building and map sharpening, using the model from each cycle in the sharpening process. In practice this may help in borderline cases but it is not recommended for normal use. The map is sharpened to make it as similar as possible to density calculated from the current model and the new map is used for the next cycle of model building.

Examples

Standard run of map_to_model:

Running map_to_model is easy. From the command-line you can type:

phenix.map_to_model my_map.map seq.fa resolution=3

where my_map.map is a CCP4, mrc or other related map format, seq.fa is a sequence file. Normally you should sharpen your map with phenix.auto_sharpen and cut out the unique part of the map with phenix.map_box (using the extract_unique=True keyword) before supplying it to map_to_model.

Output files from map_to_model

map_to_model.pdb: A PDB file with the resulting model, superimposed on
the original map (or on the magnified map if magnification is applied).

Standard run of map_to_model, specifying symmetry type:

Running map_to_model with symmetry is also easy. From the command-line you can type:

phenix.map_to_model my_map.map seq.fa symmetry=D7 resolution=3

Here map_to_model will look for D7 (7-fold symmetry along the c-axis and 2-fold symmetry along a or b) in the map and apply it if it is found. It will also write out the matrices corresponding to this symmetry.

Using carry_on to continue with a partially-finished run

If you have a completed or partially-completed run of map_to_model and you want to run again but you do not want to re-run all the steps, you can just carry on from where you left off by using the keyword carry_on=True.

Using carry_on to break up your run into small pieces

You can use carry_on=True to progressively build up your model from pieces or to run many jobs in parallel.

First you run map_to_model with segment_and_split_only=True to split up your map and write a file (the info_pickle_file) that has all the necessary information to put everything together.

Then you can create script files to run each small step in the analysis. You can do that with the keyword split_up_job=True. This will put your scripts in a commands/ subdirectory and you can run them with your queueing system. When they are done, you can use carry_on=True to put everything together.

You can also specify a source or other command to be placed in all your scripts with the keyword source_command="xxx".

The scripts created by the split_up_job command use a set of keywords like this (here only RNA will be built, building will be done in just region number 26, secondary structure searches are disabled and overall model-building is disabled:

do_only_one_thing=True
chain_type=RNA
build_in_regions=True
input_map_id_start=26
input_map_id_end=26

include_helices_strands_only=False
include_phase_and_build=False

Another run might look like this, where overall model-building is done starting with model number 4:

do_only_one_thing=True
chain_type=PROTEIN
build_in_regions=False
include_helices_strands_only=False
include_phase_and_build=True
model_start=4

You can run all the combinations that you would like in parallel.

Special treatment of structures with very high symmetry

If your structure has very high symmetry (by default, 20 or more symmetry operators), then segment_and_split_map will try to cut out a piece of your map and work with that instead of working with the whole map. You can control this with the keyword select_au_box=True (or None, which will give the default behavior). If you use select_au_box then a box that contains about n_au_box asymmetric units of the map (default of 5) will be cut out of your map. That map will be worked with along with the symmetry operators that apply to it for the remainder of the analysis. At the end of the analysis, just the region cut out will be built and placed in the original reference frame. This process can greatly speed up building and save on memory.

Building just the asymmetric unit of your model to save memory

Another way to save on memory is to run just the segment_and_split step of map_to_model on a computer with a lot of memory, then use the small map created by segment_and_split for subsequent stages, and then finally take the resulting model for this small map, put it into map_to_model on the computer with a lot of memory and use it to create a final model. Here are the steps for doing this:

In a clean directory,
run phenix.map_to_model with the keyword
stop_after_segment_and_split_map=True on your map, including symmetry
information. Run this on a computer with a huge amount of memory.
Your segmentation information will be in segmented_maps/

Take the file segmented_maps/box_map_au.ccp4, which contains just the
asymmetric unit of density from your original map, and in a new
directory (perhaps box_map_directory/)
run phenix.map_to_model with box_map_au.ccp4 and the unique
part of your sequence file.  This can be run (with luck) on a computer
with a more moderate amount of memory.  The final model will be
box_map_directory/map_to_model/map_to_model.pdb;
it should match the box_map_au.ccp4 map.

Go back to the original directory where you split up your original map.
Run phenix.map_to_model with carry_on=True and specifying three keywords:

starting_model=box_map_directory/map_to_model/map_to_model.pdb
starting_model_is_from_box_map_au=True
build_new_model=False

Now map_to_model will start with your model from the small map, shift it
to match the original map, apply any symmetry you used originally,
refine the model, and write out a model map_to_model/map_to_model.pdb

Possible Problems

If you have a very large structure it is possible that your computer may not have enough memory to run map_to_model and that one or more sub-processes might crash.

The easiest solution is to run phenix.map_box with the extract_unique=True option beforehand so that you are running map_to_model on just the unique part of the structure.

You can also just cut out boxes of density from your structure with the phenix.map_box tool and work on each one individually.

You can also set the keyword coarse_grid=True to use a coarse grid in RESOLVE and save memory. You can also cut back the resolution which will save memory. Otherwise, you can try on a computer with even more memory.

If your queueing system crashes during a run or one or more sub-processes crashes, then you might end up with models built for some stages of building and others not. You can continue on with the keyword carry_on=True in this case. (see above in the section on combining partial runs).

Specific limitations and problems:

Literature

Additional information

List of all available keywords