Tutorial: Solving a structure with SAD data


This tutorial will use some good SAD data (peak wavelength from an IF5A dataset diffracting to 1.7 A) as an example of how to solve a SAD dataset with AutoSol. It is designed to be read all the way through, giving pointers for you along the way. Once you have read it all and run the example data and looked at the output files, you will be in a good position to run your own data through AutoSol.

Setting up to run PHENIX

If PHENIX is already installed and your environment is all set, then if you type:

echo $PHENIX

then you should get back something like this:


If instead you get:

PHENIX: undefined variable

then you need to set up your PHENIX environment. See the PHENIX installation page for details of how to do this. If you are using the C-shell environment (csh) then all you will need to do is add one line to your .cshrc (or equivalent) file that looks like this:

source /xtal/phenix-1.3/phenix_env

(except that the path in this statement will be where your PHENIX is installed). Then the next time you log in $PHENIX will be defined.

Running the demo p9 data with AutoSol

To run AutoSol on the demo p9 data, make yourself a tutorials directory and cd into that directory:

mkdir tutorials
cd tutorials

Now type the phenix command:

phenix.run_example --help

to list the available examples. Choosing p9-sad for this tutorial, you can now use the phenix command:

phenix.run_example p9-sad

to solve the p9 structure with AutoSol. This command will copy the directory $PHENIX/examples/p9-sad to your current directory (tutorials) and call it tutorials/p9-sad/ . Then it will run AutoSol using the command file run.sh that is present in this tutorials/p9-sad/ directory.

This command file run.sh is simple. It says:

phenix.autosol  seq_file=seq.dat sites=4 atom_type=Se  data=p9_se_w2.sca  \
space_group="I4" unit_cell="113.949 113.949  32.474 90.000  90.000  90.00"  \
resolution=2.4 thoroughness=quick

The first line (#!/bin/sh) tells the system to interpret the remainder of the text in the file using the sh (or bash) -shell (sh).

The command phenix.autosol runs the command-line version of AutoSol (see Automated Structure Solution using AutoSol for all the details about AutoSol including a full list of keywords). The arguments on the command line tell AutoSol about the sequence file (seq_file=seq.dat), the number of sites to look for (sites=4), and the atom type (atom_type=Se). (Note that each of these is specified with an = sign, and that there are no spaces around the = sign.) The Phaser heavy-atom refinement and model completion algorithm used in the AutoSol SAD phasing will add additional sites if warranted.

Note the backslash "\" at the end of some of the lines in the phenix.autosol command. This tells the C-shell (which interprets everything in this file) that the next line is a continuation of the current line. There must be no characters (not even a space) after the backslash for this to work.

The SAD data to be used to solve the structure is in the datafile p9_se_w2.sca. This datafile is in Scalepack unmerged format, which means that there may be multiple instances of each reflection and the cell parameters are not in the file, so we need to provide the cell parameters with the command, unit_cell="113.949 113.949 32.474 90.000 90.000 90.00". (Note that the cell parameters are surrounded by quotation marks. That tells the parser that these are all together.) In this example, the space group in the p9_se_w2.sca file is I41, but the correct space group is I4, so we need to tell AutoSol the correct space group with space_group="I4".

The resolution of the data in p9_se_w2.sca is to 1.74 A, but in this example we would like to solve the structure quickly, so we have cut the resolution back with the commands resolution=2.4 and thoroughness=quick. The quick command sets several defaults to give a less comprehensive search for heavy-atom sites and a less thorough model-building than if you use the default of thoroughness=thorough.

Although the phenix.run_example p9-sad command has just run AutoSol from a script (run.sh), you can run AutoSol yourself from the command line with the same phenix.autosol seq_file= ... command. You can also run AutoSol from a GUI, or by putting commands in another type of script file. All these possibilities are described in Using the PHENIX Wizards.

Where are my files?

Once you have started AutoSol or another Wizard, an output directory will be created in your current (working) directory. The first time you run AutoSol in this directory, this output directory will be called AutoSol_run_1_ (or AutoSol_run_1_/, where the slash at the end just indicates that this is a directory). All of the output from run 1 of AutoSol will be in this directory. If you run AutoSol again, a new subdirectory called AutoSol_run_2_ will be created.

Inside the directory AutoSol_run_1_ there will be one or more temporary directories such as TEMP0 created while the Wizard is running. The files in this temporary directory may be useful sometimes in figuring out what the Wizard is doing (or not doing!). By default these directories are emptied when the Wizard finishes (but you can keep their contents with the command clean_up=False if you want.)

What parameters did I use?

Once the AutoSol wizard has started (when run from the command line), a parameters file called autosol.eff will be created in your output directory (e.g., AutoSol_run_1_/autosol.eff). This parameters file has a header that says what command you used to run AutoSol, and it contains all the starting values of all parameters for this run (including the defaults for all the parameters that you did not set).

The autosol.eff file is good for more than just looking at the values of parameters, though. If you copy this file to a new one (for example autosol_hires.eff) and edit it to change the values of some of the parameters (resolution=1.74) then you can re-run AutoSol with the new values of your parameters like this:

phenix.autosol autosol_hires.eff

This command will do everything just the same as in your first run but use all the data to 1.74 A.

Reading the log files for your AutoSol run file

While the AutoSol wizard is running, there are several places you can look to see what is going on. The most important one is the overall log file for the AutoSol run. This log file is located in:


for run 1 of AutoSol. (The second 1 in this log file name will be incremented if you stop this run in the middle and restart it with a command like phenix.autosol run=1).

The AutoSol_run_1_1.log file is a running summary of what the AutoSol Wizard is doing. Here are a few of the key sections of the log files produced for the p9 SAD dataset.

Summary of the command-line arguments

Near the top of the log file you will find:

Starting AutoSol with the command:

phenix.autosol seq_file=seq.dat sites=4 atom_type=Se data=p9_se_w2.sca space_group=I4   \
unit_cell='113.949 113.949  32.474 90.000  90.000  90.00' resolution=2.4   \

This is just a repeat of how you ran AutoSol; you can copy it and paste it into the command line to repeat this run.


The input data file p9_se_w2.sca is in unmerged Scalepack format. The AutoSol wizard converts everything to premerged Scalepack format before proceeding. Here is where the AutoSol Wizard identifies the format and then calls the ImportRawData Wizard:

HKLIN ENTRY:  p9_se_w2.sca
CONTENTS: ['p9_se_w2.sca', 'sca', 'unmerged', 'I 41', None, None, ['I', 'SIGI']]
Converting the files ['p9_se_w2.sca'] to sca format before proceeding
Running import directly...
WIZARD:  ImportRawData

Using the datafiles converted to premerged format.

After completing the ImportRawData step, the AutoSol Wizard goes back to the beginning, but uses the newly-converted file p9_se_w2_PHX.sca:

HKLIN ENTRY:  AutoSol_run_1_/p9_se_w2_PHX.sca
FILE TYPE scalepack_merge
Unit cell: (113.949, 113.949, 32.474, 90, 90, 90)
Space group: I 4 (No. 79)
CONTENTS: ['AutoSol_run_1_/p9_se_w2_PHX.sca', 'sca', 'premerged', 'I 4',
[113.949, 113.949, 32.473999999999997, 90.0, 90.0, 90.0],
1.7443432606877809, ['IPLUS', 'SIGIPLUS', 'IMINU', 'SIGIMINU']]
Total of 1 input data files

Guessing cell contents

The AutoSol Wizard uses the sequence information in your sequence file (seq.dat) and the cell parameters and space group to guess the number of NCS copies and the solvent fraction, and the number of total methionines (approximately equal to the number of heavy-atom sites for SeMet proteins):

AutoSol_guess_setup_for_scaling  AutoSol  Run 1 Fri Mar  7 00:53:48 2008

Solvent fraction and resolution and ha types/scatt fact
This is the last dataset to scale
Guessing setup for scaling dataset 1
SG I 4
cell [113.949, 113.949, 32.473999999999997, 90.0, 90.0, 90.0]
Number of residues in unique chains in seq file: 139
Unit cell: (113.949, 113.949, 32.474, 90, 90, 90)
Space group: I 4 (No. 79)
CELL VOLUME :421654.580793
Total residues:139
Total Met:4
resolution estimate: 2.4

Running phenix.xtriage

The AutoSol Wizard automatically runs phenix.xtriage on each of your input datafiles to analyze them for twinning, outliers, translational symmetry, and other special conditions that you should be aware of. You can read more about xtriage in Data quality assessment with phenix.xtriage. Part of the summary output from xtriage for this dataset looks like this:

The largest off-origin peak in the Patterson function is 6.49% of the
height of the origin peak. No significant pseudotranslation is detected.

The results of the L-test indicate that the intensity statistics
behave as expected. No twinning is suspected.

Testing for anisotropy in the data

The AutoSol Wizard tests for anisotropy by determining the range of effective anisotropic B values along the principal lattice directions. If this range is large and the ratio of the largest to the smallest value is also large then the data are by default corrected to make the anisotropy small (see Analyzing and scaling the data in the AutoSol web page for more discussion of the anisotropy correction). In the p9 case, the range of anisotropic B values is small and no correction is made:

Range of aniso B:  15.67 26.14
Not using aniso-corrected data files as the range of aniso b  is only
10.47  and 'remove_aniso' is not set

Choosing datafiles with high signal-to-noise

During scaling, the AutoSol Wizard estimates the signal-to-noise in each datafile and the resolution where there is significant signal-to-noise (above 0.3:1 signal-to-noise). You can see this analysis in the log file dataset_scale_1.log for dataset 1. In this case, the signal-to-noise is 1.4 to a resolution of 2.4 A:

FILE DATA:AutoSol_run_1_/p9_se_w2_PHX.sca sn: 1.420786

Running HYSS to find the heavy-atom substructure

The HYSS (hybrid substructure search) procedure for heavy-atom searching uses a combination of a Patterson search for 2-site solutions with direct methods recycling. The search ends when the same solution is found beginning with several different starting points. The HYSS log files are named after the datafile that they are based on and the type of differences (ano, iso) that are being used. In this p9 SAD dataset, the HYSS logfile is p9_se_w2_PHX.sca_ano_1.sca_hyss.log. The key part of this HYSS log file is:

 Entering search loop:

p = peaklist index in Patterson map
f = peaklist index in two-site translation function
cc = correlation coefficient after extrapolation scan
r = number of dual-space recycling cycles
cc = final correlation coefficient

p=000 f=000 cc=0.392 r=015 cc=0.532 [ best cc: 0.532 ]
p=000 f=001 cc=0.381 r=015 cc=0.532 [ best cc: 0.532 0.532 ]
Number of matching sites of top 2 structures: 6

Here a correlation coefficient of 0.5 is very good (0.1 is hopeless, 0.2 is possible, 0.3 is good) and 6 sites were found that matched in the first two tries. The program continues until 5 structures all have matching sites, then ends and prints out the final correlations, after taking the top 4 sites.

Finding the hand and scoring heavy-atom solutions

Normally either hand of the heavy-atom substructure is a possible solution, and both must be tested by calculating phases and examining the electron density map and by carrying out density modification, as they will give the same statistics for all heavy-atom analysis and phasing steps. Note that in chiral space groups (those that have a handedness such as P61, both hands of the space group must be tested. The AutoSol Wizard will do this for you, inverting the hand of the heavy-atom substructure and the space group at the same time. For example, in space group P61 the hand of the substructure is inverted and then it is placed in space group P65.

Scoring heavy-atom solutions

The AutoSol Wizard scores heavy-atom solutions based on two criteria by default. The first criterion is the skew of the electron density in the map (SKEW). Good values for the skew are anything greater than 0.1. In a SAD structure determination, the heavy-atom solution with the correct hand may have a much more positive skew than the one with the inverse hand. The second criterion is the correlation of local RMS density (CORR_RMS). This is a measure of how contiguous the solvent and non-solvent regions are in the map. (If the local rms is low at one point and also low at neighboring points, then the solvent region must be relatively contiguous, and not split up into small regions.) For SAD datasets, Phaser is used for calculating phases. For a SAD dataset, a figure of merit of 0.3 is acceptable, 0.4 is fine and anything above 0.5 is very good. The scores for solution #1 are listed in the AutoSol log file:

Scoring for this solution now...

AutoSol_run_1_/TEMP0/resolve.scores SKEW -0.047612928
AutoSol_run_1_/TEMP0/resolve.scores CORR_RMS 0.8755398

CC-EST (BAYES-CC) SKEW : 10.0 +/- 26.1
CC-EST (BAYES-CC) CORR_RMS : 55.7 +/- 36.1
Resetting sigma of quality estimate due to wide range of estimated values:
Overall quality:  14.7
Highest lower bound of quality for individual estimates: 37.6
Current 2*sigma:  37.5 New 2*sigma:  45.7
ESTIMATED MAP CC x 100:  14.7 +/- 45.7

The ESTIMATED MAP CC x 100 is an estimate of the quality of the experimental electron density map (not the density-modified one). A set of real structures was used to calibrate the range of values of each score that were obtained for phases with varying quality. The resulting probability distributions are used above to estimate the correlation between the experimental map and an ideal map for this structure. Then all the estimates are combined to yield an overall Bayesian estimate of the map quality. These are reported as CC x 100 +/- 2SD. These estimated map CC values are usually fairly close, so as the estimate is 14.7 +/- 45.7, you can be quite confident that this solution is not the right one.

The wizard then tries the inverse solution...

Scoring for this solution now...

AutoSol_run_1_/TEMP0/resolve.scores SKEW 0.2644597
AutoSol_run_1_/TEMP0/resolve.scores CORR_RMS 0.9274329

CC-EST (BAYES-CC) SKEW : 56.5 +/- 18.1
CC-EST (BAYES-CC) CORR_RMS : 63.1 +/- 28.5
ESTIMATED MAP CC x 100:  60.0 +/- 13.6

Reading NCS information from:  AutoSol_run_1_/TEMP0/resolve.log
 based on  [ha_2.pdb,phaser_2.mtz]
Reformatting  ha_2.pdb  and putting it in  ha_2.pdb_formatted.pdb
RANGE to KEEP :1.28
Confident of the hand (Quality diff from opp hand is 1.9 sigma)

This solution looks a lot better. The overall estimated map CC value is 60.0 +/- 13.6. This means that your structure is not only solved but that you will have a good map when it is density modified.

Final phasing with Phaser

Once the best heavy-atom solution or solutions are chosen based on ESTIMATED MAP CC x 100, these are used in a final round of phasing with Phaser (for SAD phasing). The log file from phasing for solution 2 is in phaser_2.log. Here is the final part of the output from this log file, showing the refined coordinates, occupancies, thermal (B) factors for the 4 sites, along with the refined scattering factors (in this case only f" is refined), and the final figure of merit of phasing (0.544):

Atom Parameters: 4 atoms in list
     X      Y      Z      O      B      (AnisoB) M Atomtype
#1    0.180 -0.113 -0.681  1.135   22.8 ( ---- ) 1 SE
#2    0.686 -0.238 -0.710  0.980   23.0 (+22.40) 1 SE
#3    0.665 -0.206 -0.774  1.020   28.2 (+26.14) 1 SE
#5    0.027  0.758  0.905  0.176   23.9 ( ---- ) 1 SE

Scattering Parameters:
 Atom             f"           (f')
   SE         5.5196        -8.0000

Figures of Merit
Bin Resolution   Acentric     Centric      Single       Total
                 Number FOM   Number FOM   Number FOM   Number FOM
ALL  28.49- 2.40   7502 0.594    874 0.140     51 0.057   8427 0.544

log-likelihood gain -90088

Statistical density modification with RESOLVE

After SAD phases are calculated with Phaser, the AutoSol Wizard uses RESOLVE density modification to improve the quality of the electron density map. The statistical density modification in RESOLVE takes advantage of the flatness of the solvent region and the expected distribution of electron density in the region containing the macromolecule, as well as any NCS that can be found from the heavy-atom substructure. The weighted structure factors and phases (FWT, PHWT) from Phaser are used to calculate the starting map for RESOLVE, and the experimental structure factor amplitudes (FP) and SAD Hendrickson-Lattman coefficients from Phaser are used in the density modification process. The output from RESOLVE for solution 1 can be found in resolve_2.log. Here are key sections of this output.

First, the plot of how many points in the "protein" region of the map have each possible value of electron density. The plot below is normalized so that a density of zero is the mean of the solvent region, and the standard deviation of the density in the map is 1.0. A perfect map has a lot of points with density slightly less than zero on this scale (the points between atoms) and a few points with very high density (the points near atoms), and no points with very negative density. Such a map has a very high skew (think "skewed off to the right"). This map is good, with a positive skew, though it is not perfect.

Plot of Observed (o) and model (x) electron density distributions for protein
region, where the model distribution is given by,
 p_model(beta*(rho+offset)) = p_ideal(rho)
and then convoluted with a gaussian with width of sigma
where sigma, offset and beta are given below under "Error estimate."

                             .                   .                            .
                             .                   .                            .
                             .             xxxxxxx                            .
                             .          ooxoooooooxxo                         .
                             .          xxo      .  xoo                       .
                             .         xo        .   xxo                      .
               p(rho)        .        xx         .     xxoo                   .
                             .      ox           .       xxooo                .
                             .      xo           .         xxo                .
                             .    xx             .           xxxx             .
                             .   xo              .              xxxx          .
                             .  xx               .                 xxxx       .
                             xxx                 .                    xxxxx   .
                             x                   .                       ooxxxx
                        0.0  x................................................x

                            -2        -1         0         1         2        3

                                 normalized rho (0 = mean of solvent region)

After density modification is complete, this plot becomes much more like one from a perfect structure:

              .                   .                            .
              .          xxxxx    .                            .
              .         xooooxxo  .                            .
              .       oxo     oxo .                            .
              .       xx        xo.                            .
              .      ox          xxoo                          .
p(rho)        .     ox            .xoo                         .
              .     x             . xo                         .
              .    x              .  xxxx                      .
              .   xx              .     xxxxo                  .
              .  xx               .         xxxxxxxx           .
              . xx                .              ooxxxxxx      .
              xx                  .                   o oxxxxxoo
              x                   .                           xo
         0.0  o................................................x

             -2        -1         0         1         2        3

                  normalized rho (0 = mean of solvent region)

The key statistic from this RESOLVE density modification is the R-factor for comparison of observed structure factor amplitudes (FP) with those calculated from the density modification procedure (FC). In this p9 SAD phasing the R-factor is very low:

Overall R-factor for FC vs FP: 0.239 for       8422 reflections

An acceptable value is anything below 0.35; below 0.30 is good.

Generation of FreeR flags

The AutoSol Wizard will create a set of free R flags indicating which reflections are not to be used in refinement. By default 5% of reflections (up to a maximum of 2000) are reserved for this test set. If you want to supply a reflection file hires.mtz that has higher resolution than the data used to solve the structure, or has a test set already marked, then you can do this with the keyword input_refinement_file=hires.mtz. The files to be used for model-building and refinement are listed in the AutoSol log file:

FreeR_flag added to  phaser_2.mtz
Saving  exptl_fobs_phases_freeR_flags_2.mtz  for refinement
THE FILE AutoSol_run_1_/resolve_2.mtz will be used for model-building

Model-building with RESOLVE

The AutoSol Wizard by default uses a very quick method to build just the secondary structure of your macromolecule. This is controlled by the keyword helices_strands_only=True. The Wizard will guess from your sequence file whether the structure is protein or RNA or DNA (but you can tell it if you want with (chain_type=PROTEIN).

If the quick model-building does not build a satisfactory model (if the correlation of map and model is less than acceptable_secondary_structure_cc=0.35), then model-building is tried again with the standard build procedure, essentially the same as one cycle of model-building with the AutoBuild Wizard (see the web page Automated Model Building and Rebuilding with AutoBuild, except that if you specify thoroughness=quick as we have in this example, the model-building is done less comprehensively to speed things up.

In this case the secondary-structure-only model-building produces an initial model with 61 residues built and side chains assigned to 0, and which has a model-map correlation of 0.33:

Model with helices and strands is in  Build_1.pdb
Log for helices and strands is in  Build_1.log
Final file:  AutoSol_run_1_/TEMP0/Build_1.pdb
Log file:  Build_1.log  copied to  Build_1.log
Model 1: Residues built=61  placed=0  Chains=9  Model-map CC=0.33
This is new best model with cc =  0.33
Getting R for model:  Build_1.pdb
Model: AutoSol_run_1_/TEMP0/refine_1.pdb  R/Rfree=0.55/0.58

As the model-map correlation is only 0.33, the Wizard decides that this is not good enough and tries again with regular model-building, yielding a better model with 86 residues built and a map correlation of 0.55:

Model 2: Residues built=86  placed=7  Chains=15  Model-map CC=0.55
This is new best model with cc =  0.55
Refining model:  Build_2.pdb
Model: AutoSol_run_1_/TEMP0/refine_2.pdb  R/Rfree=0.46/0.49

After one model completion cycle (including extending ends of chains, fitting loops, and building outside the region already built, the best model built has 77 residues built, 22 assigned to sequence and a map correlation of 0.61:

Model completion cycle 1
Models to combine and extend:  ['Build_2.pdb', 'refine_2.pdb']
Model 3: Residues built=77  placed=22  Chains=10  Model-map CC=0.61
This is new best model with cc =  0.61
Refining model:  Build_combine_extend_3.pdb
Model: AutoSol_run_1_/TEMP0/refine_3.pdb  R/Rfree=0.45/0.47

This initial model is written out to refine_3.pdb in the output directory. It is still just a preliminary model, but it is good enough to tell that the structure is solved. For full model-building you will want to go on and use the AutoBuild Wizard (see the web page Automated Model Building and Rebuilding with AutoBuild )

The AutoSol_summary.dat summary file

A quick summary of the results of your AutoSol run is in the AutoSol_summary.dat file in your output directory. This file lists the key files that were produced in your run of AutoSol (all these are in the output directory) and some of the key statistics for the run, including the scores for the heavy-atom substructure and the model-building and refinement statistics. These statistics are listed for all the solutions obtained, with the highest-scoring solutions first. Here is part of the summary for this p9 SAD dataset:

 -----------CURRENT SOLUTIONS FOR RUN 1 : -------------------
 *** FILES ARE IN THE DIRECTORY: AutoSol_run_1_ ****

Solution # 2  BAYES-CC: 60.0 +/- 13.6 Dataset #1   FOM: 0.54 ----------------

Solution  2 using HYSS on
p9_se_w2_PHX.sca_ano_1.sca and taking inverse. Dataset #1
Dataset number: 1
Dataset type: sad
Datafiles used: [
Sites: 4 (Already used for Phasing at resol of 2.4)      Refined Sites: 4
NCS information  in: AutoSol_2.ncs_spec
Experimental phases in: phaser_2.mtz
Experimental phases plus FreeR_flags for refinement in:
Density-modified phases in: resolve_2.mtz
HA sites (PDB format) in: ha_2.pdb_formatted.pdb
Sequence file in: seq.dat
Model in: refine_3.pdb
  Residues built: 77
  Side-chains built: 22
  Chains: 10
  Overall model-map correlation: 0.61
  R/R-free: 0.45/0.47
Scaling logfile in: dataset_1_scale.log
HYSS logfile in: p9_se_w2_PHX.sca_ano_1.sca_hyss.log
Phasing logfile in: phaser_2.log
Density modification logfile in: resolve_2.log (R=0.24)
Build logfile in: Build_combine_extend_3.log

 Score type:     SKEW    CORR_RMS
Raw scores:     0.26      0.93
BAYES-CC:      56.50     63.07

Refined heavy atom sites (fractional):
xyz       0.180     -0.113     -0.681
xyz       0.686     -0.238     -0.710
xyz       0.665     -0.206     -0.774
xyz       0.027      0.758      0.905

How do I know if I have a good solution?

Here are some of the things to look for to tell if you have obtained a correct solution:

What to do next

Once you have run AutoSol and have obtained a good solution and model, the next thing to do is to run the AutoBuild Wizard. If you run it in the same directory where you ran AutoSol, the AutoBuild Wizard will pick up where the AutoSol Wizard left off and carry out iterative model-building, density modification and refinement to improve your model and map. See the web page Automated Model Building and Rebuilding with AutoBuild for details on how to run AutoBuild.

If you do not obtain a good solution, then it's not time to give up yet. There are a number of standard things to try that may improve the structure determination. Here are a few that you should always try:

Additional information

For details about the AutoSol Wizard, see Automated structure solution with AutoSol. For help on running Wizards, see Using the PHENIX Wizards.