phenix_logo
Python-based Hierarchical ENvironment for Integrated Xtallography
Documentation Home
 

Tutorial 2: Solving a structure with MAD data

Introduction
Setting up to run PHENIX
Running the demo gene-5 data with AutoSol
Where are my files?
What parameters did I use?
Reading the log files for your AutoSol run file
Summary of the command-line arguments
Reading the datafiles.
Guessing cell contents
Running phenix.xtriage
Testing for anisotropy in the data
Scaling MAD data and estimating FA values
Choosing datafiles with high signal-to-noise
Running HYSS to find the heavy-atom substructure
Finding the hand and scoring heavy-atom solutions
Final phasing with SOLVE
Statistical density modification with RESOLVE
Generation of FreeR flags
Model-building with RESOLVE
The AutoSol_summary.dat summary file
How do I know if I have a good solution?
What to do next
Additional information

Introduction

This tutorial will use some moderately good MAD data (3 wavelengths from a gene-5 protein SeMet dataset diffracting to 2.6 A) as an example of how to solve a MAD dataset with AutoSol. It is designed to be read all the way through, giving pointers for you along the way. Once you have read it all and run the example data and looked at the output files, you will be in a good position to run your own data through AutoSol.

Setting up to run PHENIX

If PHENIX is already installed and your environment is all set, then if you type:

echo $PHENIX
then you should get back something like this:
/xtal//phenix-1.3
If instead you get:
PHENIX: undefined variable
then you need to set up your PHENIX environment. See the PHENIX installation page for details of how to do this. If you are using the C-shell environment (csh) then all you will need to do is add one line to your .cshrc (or equivalent) file that looks like this:
source /xtal/phenix-1.3/phenix_env
(except that the path in this statement will be where your PHENIX is installed). Then the next time you log in $PHENIX will be defined.

Running the demo gene-5 data with AutoSol

To run AutoSol on the demo gene-5 data, make yourself a tutorials directory and cd into that directory:

mkdir tutorials
cd tutorials 
Now type the phenix command:
phenix.run_example --help 
to list the available examples. Choosing gene-5-mad for this tutorial, you can now use the phenix command:
phenix.run_example gene-5-mad 
to solve the gene-5 structure with AutoSol. This command will copy the directory $PHENIX/examples/gene-5-mad to your current directory (tutorials) and call it tutorials/gene-5-mad/ . Then it will run AutoSol using the command file run.sh that is present in this tutorials/gene-5-mad/ directory. This command file run.sh is simple. It says:
#!/bin/sh
echo "Running AutoSol on gene-5 protein data..."
phenix.autosol  gene-5-mad.eff
The first line (#!/bin/sh) tells the system to interpret the remainder of the text in the file using the sh (or bash) -shell (sh). The command phenix.autosol gene-5-mad.eff runs the command-line version of AutoSol with the parameters in the file gene-5-mad.eff (see Automated Structure Solution using AutoSol for all the details about AutoSol including a full list of keywords). The file gene-5-mad.eff is a typical PHENIX parameters file:
# parameters for autosol run with gene-5 3-wavelength MAD data
# 
autosol {
  atom_type = Se
  sites = 2
  seq_file = sequence.dat
  crystal_info {
    space_group = C2 
    unit_cell = 76.08 27.97 42.36 90 103.2 90
    resolution = 2.6
  }
  wavelength {
    data = high.sca
    lambda = 0.9600
    f_prime = -1.5
    f_double_prime = 3
  }
  wavelength {
    data = peak.sca
    lambda = 0.9792
    f_prime = -3
    f_double_prime = 4
  }
  wavelength {
    data = infl.sca
    lambda = 0.9798
    f_prime = -5
    f_double_prime = 2
  }
}
Notice how the brackets ({ and }) work in this file. Everything in this file after the word "autosol" that is between the opening left-bracket ({) and and the closing right-bracket (}) is part of the autosol "scope". The AutoSol wizard looks for "autosol { lots of parameters }" and interprets everything inside these brackets. Everything outside the scope "autosol" is ignored.

Within the autosol scope there are some keywords like "atom_type = Se", these are normally one per line.

There are also additional scopes, with keywords inside them. For example the space_group and unit_cell information are inside the scope "crystal_info".

The information about each wavelength of data is in a separate scope called "wavelength". You can have as many of these as you like. Within this scope you can define the datafile name, the wavelength (lambda), and f_prime and f_double_prime values for that wavelength. The MAD data to be used to solve the structure is in the datafiles peak.sca, infl.sca and high.sca These datafiles are in Scalepack premerged format, which means that there is just one instance of each reflection and the cell parameters are in the file, so we do not need to provide the cell parameters or the space group (unless the ones in the .sca files are incorrect!) The resolution of the data is to about 2.6 A. Although the phenix.run_example gene-5-mad command has just run AutoSol from a script (run.sh), you can run AutoSol yourself from the command line with the same phenix.autosol gene-5-mad.eff command. You can also run AutoSol from a GUI. Running a Wizard from a GUI, the command-line.

Where are my files?

Once you have started AutoSol or another Wizard, an output directory will be created in your current (working) directory. The first time you run AutoSol in this directory, this output directory will be called AutoSol_run_1_ (or AutoSol_run_1_/, where the slash at the end just indicates that this is a directory). All of the output from run 1 of AutoSol will be in this directory. If you run AutoSol again, a new subdirectory called AutoSol_run_2_ will be created. Inside the directory AutoSol_run_1_ there will be one or more temporary directories such as TEMP0 created while the Wizard is running. The files in this temporary directory may be useful sometimes in figuring out what the Wizard is doing (or not doing!). By default these directories are emptied when the Wizard finishes (but you can keep their contents with the command clean_up=False if you want.)

What parameters did I use?

Once the AutoSol wizard has started (when run from the command line), a parameters file called autosol.eff will be created in your output directory (e.g., AutoSol_run_1_/autosol.eff). This parameters file has a header that says what command you used to run AutoSol, and it contains all the starting values of all parameters for this run (including the defaults for all the parameters that you did not set). The autosol.eff file is good for more than just looking at the values of parameters, though. If you copy this file to a new one (for example autosol_lores.eff) and edit it to change the values of some of the parameters (resolution=3.0) then you can re-run AutoSol with the new values of your parameters like this:

phenix.autosol autosol_lores.eff
This command will do everything just the same as in your first run but use only the data to 3.0 A.

Reading the log files for your AutoSol run file

While the AutoSol wizard is running, there are several places you can look to see what is going on. The most important one is the overall log file for the AutoSol run. This log file is located in:

AutoSol_run_1_/AutoSol_run_1_1.log
for run 1 of AutoSol. (The second 1 in this log file name will be incremented if you stop this run in the middle and restart it with a command like phenix.autosol run=1). The AutoSol_run_1_1.log file is a running summary of what the AutoSol Wizard is doing. Here are a few of the key sections of the log files produced for the gene-5 MAD dataset.

Summary of the command-line arguments

Near the top of the log file you will find:

 ------------------------------------------------------------
Starting AutoSol with the command:

phenix.autosol

Reading effective parameters from gene-5-mad.eff

This is just telling you how you ran AutoSol. Next comes a list of the values of all the parameters that were used to run AutoSol. It is the same as the file "autosol.eff" that is written to record these values. The beginning of this file looks like:
autosol {
  atom_type = "Se"
  lambda = None
  f_prime = None
  f_double_prime = None
  wavelength_name = peak inf high low remote
  sites = 2
  sites_file = None
  seq_file = "sequence.dat"
Here anything that is None was not set. Also anything that has a list of choices (wavelength_name in this case) is not set unless a "*" is next to one of the choices.

This first set of parameters are all keywords that apply to all datasets or are shortcuts for setting keywords that apply to a single dataset. For example "atom_type=Se" will set the atom_type for all wavelengths and all derivatives.

Later in the parameters file comes each wavelength:

 wavelength {
    wavelength_name = peak inf high low remote
    data = "high.sca"
    labels = None
    atom_type = None
    lambda = 0.96
    f_prime = -1.5
    f_double_prime = 3
    sites = 2
    sites_file = None
    group = 1
    added_wavelength = False
  }

Here are listed all the values that you have set for this wavelength. The group is a number you can set to identify what group of wavelengths/natives/derivatives this wavelength is a part of. That lets you enter MIR+SAD or any other combinations of datasets. The "added_wavelength" keyword is added by the wizard and just marks whether you specifically added this wavelength or not.

Reading the datafiles.

The AutoSol Wizard will read in your datafiles and check their contents, printing out a summary for each one:

HKLIN ENTRY:  high.sca
FILE TYPE scalepack_merge
GUESS FILE TYPE MERGE TYPE sca premerged
LABELS['IPLUS', 'SIGIPLUS', 'IMINU', 'SIGIMINU']
Unit cell: (76.08, 27.97, 42.36, 90, 103.2, 90)
Space group: C 1 2 1 (No. 5)
CONTENTS: ['high.sca', 'sca', 'premerged', 'C 1 2 1', 
[76.079999999999998, 27.969999999999999, 42.359999999999999, 90.0, 103.2, 90.0],
2.5940784397029653, ['IPLUS', 'SIGIPLUS', 'IMINU', 'SIGIMINU']]
Total of 3 input data files
['peak.sca', 'infl.sca', 'high.sca']

Guessing cell contents

The AutoSol Wizard uses the sequence information in your sequence file (sequence.dat) and the cell parameters and space group to guess the number of NCS copies and the solvent fraction, and the number of total methionines (approximately equal to the number of heavy-atom sites for SeMet proteins):

 
AutoSol_guess_setup_for_scaling  AutoSol  Run 1 Thu Mar  6 21:43:20 2008
Solvent fraction and resolution and ha types/scatt fact
This is the last dataset to scale
Guessing setup for scaling dataset 1
SG C 1 2 1
cell [76.079999999999998, 27.969999999999999, 42.359999999999999, 90.0, 103.2, 90.0]
Number of residues in unique chains in seq file: 87
Unit cell: (76.08, 27.97, 42.36, 90, 103.2, 90)
Space group: C 1 2 1 (No. 5)
CELL VOLUME :87758.6787391
N_EQUIV:4
GUESS OF NCS COPIES: 1
SOLVENT FRACTION ESTIMATE: 0.46
Total residues:87
Total Met:2
resolution estimate: 2.59

Running phenix.xtriage

The AutoSol Wizard automatically runs phenix.xtriage on each of your input datafiles to analyze them for twinning, outliers, translational symmetry, and other special conditions that you should be aware of. You can read more about xtriage in Data quality assessment with phenix.xtriage. Part of the summary output from xtriage for this dataset looks like this:

 
The largest off-origin peak in the Patterson function is 12.60% of the
height of the origin peak. No significant pseudotranslation is detected.

The results of the L-test indicate that the intensity statistics
behave as expected. No twinning is suspected.

Testing for anisotropy in the data

The AutoSol Wizard tests for anisotropy by determining the range of effective anisotropic B values along the principal lattice directions. If this range is large and the ratio of the largest to the smallest value is also large then the data are by default corrected to make the anisotropy small (see Analyzing and scaling the data in the AutoSol web page for more discussion of the anisotropy correction). In the gene-5 case, the range of anisotropic B values is small and no correction is made:

Range of aniso B:  24.58 27.92
Not using aniso-corrected data files as the range of aniso b  is 
only  3.43  and 'correct_aniso' is not set
Note that if any one of the datafiles in a MAD dataset has a high anisotropy, then by default all of them will be corrected for anisotropy.

Scaling MAD data and estimating FA values

The AutoSol Wizard uses SOLVE localscaling to scale MAD data. The procedure is basically to scale all the data to the most complete dataset, ignoring anomalous differences, to create a reference dataset. Then all F+ and F- observations at all wavelengths are scaled to this reference dataset, and then the data are merged to the asymmetric unit, averaging duplicate observations. During this process outliers that deviate from the reference values by more that ratio_out (default=1) standard deviations (using all data in the appropriate resolution shell to estimate the SD) are rejected. After scaling, the values of f’ and f" are refined based on the relative values of anomalous differences at the various wavelengths and the relative values of dispersive differences among the data at different wavelengths. Then FA values (estimates of the heavy-atom structure factor) are estimated. These FA values can often be more useful than the anomalous differences at any of the individual wavelengths because they combine the anomalous and dispersive information. At the same time as FA values are calculated, an estimate of the phase difference between the structure factor of the anomalously-scattering atoms and the structure factor corresponding to all other atoms can be estimated. This phase difference is useful later in calculating Fourier maps showing the positions of the anomalously-scattering atoms.

Choosing datafiles with high signal-to-noise

For MAD data the AutoSol Wizard analyzes the correlation of anomalous differences at the various wavelengths. The anomalous difference for a particular reflection is related to the f" value at each wavelength. Consequently if the data are good then the anomalous differences at different wavelengths (but for the same reflections) are highly correlated. A shell of resolution in which the anomalous differences have a correlation of about 0.3 or greater has some useful information. A strong SeMet dataset will have an overall correlation of 0.6-0.7 for the peak and high energy remote wavelengths. You can see this analysis in the log file dataset_scale_1.log for this MAD dataset:


Correlation of anomalous differences at different wavelengths.
(You should probably cut your data off at the resolution where
 this drops below about 0.3. A good dataset has correlation
 between peak and remote of at least 0.7 overall. Data with
 correlations below about 0.5 probably are not contributing much.)

           CORRELATION FOR
           WAVELENGTH PAIRS
 DMIN    1 VS 2   1 VS 3   2 VS 3

 5.18     0.79     0.89     0.73
 3.88     0.68     0.75     0.55
 3.63     0.68     0.72     0.46
 3.43     0.53     0.61     0.41
 3.24     0.51     0.58     0.26
 3.11     0.51     0.59     0.36
 2.98     0.36     0.54     0.13
 2.85     0.50     0.45     0.35
 2.72     0.28     0.30     0.10
 2.59     0.32     0.23     0.14

 ALL      0.55     0.66     0.40

During scaling, the AutoSol Wizard estimates the signal-to-noise in each datafile and the resolution where there is significant signal-to-noise (above 0.3:1 signal-to-noise). In this case, the FA's appear to have the highest signal-to-noise (3.1) and the inflection data the lowest (0.5):
 FILE DATA:FA.sca sn: 3.078877
FILE DATA:peak.sca sn: 2.313605
FILE DATA:high.sca sn: 1.365164
FILE DATA:infl.sca sn: 0.3432194
order of datasets for trying phasing:['FA.sca', 'peak.sca', 'high.sca', 'infl.sca']

Running HYSS to find the heavy-atom substructure

The HYSS (hybrid substructure search) procedure for heavy-atom searching uses a combination of a Patterson search for 2-site solutions with direct methods recycling. The search ends when the same solution is found beginning with several different starting points. The HYSS log files are named after the datafile that they are based on and the type of differences (ano, iso) that are being used. In this gene-5 MAD dataset, the HYSS logfile is peak.sca_ano_1.sca_hyss.log. The key part of this HYSS log file is:

Entering search loop:
p = peaklist index in Patterson map
f = peaklist index in two-site translation function
cc = correlation coefficient after extrapolation scan
r = number of dual-space recycling cycles
cc = final correlation coefficient

p=000 f=000 cc=0.144 r=015 cc=0.280 [ best cc: 0.280 ]
p=000 f=001 cc=0.134 r=015 cc=0.282 [ best cc: 0.282 0.280 ]
Number of matching sites of top 2 structures: 3
p=000 f=002 cc=0.100 r=015 cc=0.234 [ best cc: 0.282 0.280 ]
p=001 f=000 cc=0.181 r=015 cc=0.263 [ best cc: 0.282 0.280 0.263 ]
Number of matching sites of top 2 structures: 3
Number of matching sites of top 3 structures: 3
p=001 f=001 cc=0.146 r=015 cc=0.277 [ best cc: 0.282 0.280 0.277 0.263 ]
Number of matching sites of top 2 structures: 3
Number of matching sites of top 3 structures: 3
Number of matching sites of top 4 structures: 3
p=001 f=002 cc=0.142 r=015 cc=0.278 [ best cc: 0.282 0.280 0.278 0.277 0.263 ]
Number of matching sites of top 2 structures: 3
Number of matching sites of top 3 structures: 3
Number of matching sites of top 4 structures: 3
Number of matching sites of top 5 structures: 3

Here a correlation coefficient of 0.5 is very good (0.1 is hopeless, 0.2 is possible, 0.3 is good) and 2 sites were found that matched in the first two tries. The program continues until 5 structures all have matching sites, then ends and prints out the final correlations, after taking the top 2 sites.

Finding the hand and scoring heavy-atom solutions

Normally either hand of the heavy-atom substructure is a possible solution, and both must be tested by calculating phases and examining the electron density map and by carrying out density modification, as they will give the same statistics for all heavy-atom analysis and phasing steps. Note that in chiral space groups (those that have a handedness such as P61, both hands of the space group must be tested. The AutoSol Wizard will do this for you, inverting the hand of the heavy-atom substructure and the space group at the same time. For example, in space group P61 the hand of the substructure is inverted and then it is placed in space group P65. The AutoSol Wizard scores heavy-atom solutions based on two criteria by default. The first criterion is the skew of the electron density in the map (SKEW). Good values for the skew are anything greater than 0.1. In a MAD structure determination, the heavy-atom solution with the correct hand may have a more positive skew than the one with the inverse hand. The second criterion is the correlation of local RMS density (CORR_RMS). This is a measure of how contiguous the solvent and non-solvent regions are in the map. (If the local rms is low at one point and also low at neighboring points, then the solvent region must be relatively contiguous, and not split up into small regions.) For MAD datasets, SOLVE is used for calculating phases. For a MAD dataset, a figure of merit of 0.5 is acceptable, 0.6 is fine and anything above 0.7 is very good. The first three solutions scored are all quite good. Here is the first and best one:

SCORING SOLUTION 1: Solution 1 using HYSS on FA.sca. Dataset #1 SG="C 1 2 1", wi
th 2 sites
Number of scoring criteria:  2
Using BAYES-CC (Bayesian estimate of CC of map to perfect) as scores
Evaluating solution 1
FOM found:  0.59
Number of scoring criteria:  2
Using BAYES-CC (Bayesian estimate of CC of map to perfect) as scores
...
Scoring for this solution now...

AutoSol_run_1_/TEMP0/resolve.scores SKEW 0.2597009
AutoSol_run_1_/TEMP0/resolve.scores CORR_RMS 0.8756687

CC-EST (BAYES-CC) SKEW : 56.1 +/- 18.3
CC-EST (BAYES-CC) CORR_RMS : 55.7 +/- 36.1
ESTIMATED MAP CC x 100:  57.9 +/- 14.1

The ESTIMATED MAP CC x 100 is an estimate of the quality of the experimental electron density map (not the density-modified one). A set of real structures was used to calibrate the range of values of each score that were obtained for phases with varying quality. The resulting probability distributions are used above to estimate the correlation between the experimental map and an ideal map for this structure. Then all the estimates are combined to yield an overall Bayesian estimate of the map quality. These are reported as CC x 100 +/- 2SD. These estimated map CC values are usually fairly close, so if the estimate is 57.9 +/- 14.1 then you can be confident that your structure is not only solved but that you will have a good map when it is density modified. In this case the datasets used to find heavy-atom substructures were the FA values in FA.sca and the peak data in peak.sca_ano_1.sca. For each dataset one solution was found, and that solution and its inverse were scored. The scores were (skipping extra text below):
SCORING SOLUTION 1: Solution 1 using HYSS on FA.sca. 
Dataset #1 SG="C 1 2 1", with 2 sites
ESTIMATED MAP CC x 100:  57.9 +/- 14.1

SCORING SOLUTION 2: Solution  2 using HYSS on FA.sca and taking inverse. 
Dataset #1 SG="C 1 2 1", with 2 sites
ESTIMATED MAP CC x 100:  42.0 +/- 24.3

SCORING SOLUTION 3: Solution 3 using HYSS on peak.sca_ano_1.sca. 
Dataset #1 SG="C 1 2 1", with 2 sites
ESTIMATED MAP CC x 100:  53.2 +/- 16.7

SCORING SOLUTION 4: Solution  4 using HYSS on peak.sca_ano_1.sca and taking inverse. 
Dataset #1 SG="C 1 2 1", with 2 sites
ESTIMATED MAP CC x 100:  53.1 +/- 16.7

In this case the best score was using the FA values and taking the original hand (ESTIMATED MAP CC x 100: 57.9 +/- 14.1), and score for the inverted hand of the heavy-atom substructure was worse (ESTIMATED MAP CC x 100: 42.0 +/- 24.3) and so the hand was clear.

Final phasing with SOLVE

Once the best heavy-atom solution or solutions are chosen based on ESTIMATED MAP CC, these are used in a final round of phasing with SOLVE (for MAD phasing). The log file from phasing for solution 1 is in solve_1.prt. This SOLVE log file repeats the correlation analysis of anomalous differences between data at each wavelength. Then it carries out a detailed refinement of the scattering factors at each wavelength. Finally the heavy-atom model is refined and phases are calculated with Bayesian correlated MAD phasing. The final occupancies and coordinates are listed at the end:

                    SITE  ATOM       OCCUP     X       Y       Z         B
 CURRENT VALUES:      1    Se       0.9609  0.4842  0.5199  0.5931   52.0344
 CURRENT VALUES:      2    Se       0.5891  0.4723  0.3052  0.4479   60.0000

In this case the occupancy of one site is quite near 1 and the other is lower. The second site is a selenomethionine that is not well ordered (it is the N-terminal residue in the protein).

Statistical density modification with RESOLVE

After MAD phases are calculated with SOLVE, the AutoSol Wizard uses RESOLVE density modification to improve the quality of the electron density map. The statistical density modification in RESOLVE takes advantage of the flatness of the solvent region and the expected distribution of electron density in the region containing the macromolecule, as well as any NCS that can be found from the heavy-atom substructure. The weighted structure factors and phases (FWT, PHWT) from SOLVE are used to calculate the starting map for RESOLVE, and the experimental structure factor amplitudes (FP) and MAD Hendrickson-Lattman coefficients from SOLVE are used in the density modification process. The output from RESOLVE for solution 1 can be found in resolve_1.log. Here are key sections of this output. First, the plot of how many points in the "protein" region of the map have each possible value of electron density. The plot below is normalized so that a density of zero is the mean of the solvent region, and the standard deviation of the density in the map is 1.0. A perfect map has a lot of points with density slightly less than zero on this scale (the points between atoms) and a few points with very high density (the points near atoms), and no points with very negative density. Such a map has a very high skew (think "skewed off to the right"). This map is good, with a positive skew, though it is not perfect.


 Plot of Observed (o) and model (x) electron density distributions for protein
 region, where the model distribution is given by,
  p_model(beta*(rho+offset)) = p_ideal(rho)
 and then convoluted with a gaussian with width of sigma
 where sigma, offset and beta are given below under "Error estimate."

                          0.03..................................................
                              .                   .                            .
                              .                 xx.                            .
                              .               xxooxx                           .
                              .              xo   . xx                         .
                              .             xx    .  xx                        .
                              .            xo     .    xo                      .
                p(rho)        .           ox      .     xo                     .
                              .          xx       .      xo                    .
                              .          x        .       xxo                  .
                              .        ox         .         xx                 .
                              .        xx         .           xx               .
                              .       xx          .            xxx             .
                              .    oxx            .              oxxx          .
                              .  oxx              .                oxxxxx      .
                         0.0  xxxx......................................oxxxxxxx

                             -2        -1         0         1         2        3

                                  normalized rho (0 = mean of solvent region)
 -------------------------------------------------------------------------------

After density modification is complete, this plot becomes much more like one from a perfect structure:

                          0.03..................................................
                              .                   .                            .
                              .                   .                            .
                              .             xxxxoo.                            .
                              .            xxoooxx.                            .
                              .          xxo      xoo                          .
                              .          oo       .xxo                         .
                p(rho)        .         x         .  x                         .
                              .        x          .   xx                       .
                              .      ox           .    oxxx                    .
                              .     ox            .      oxxxx                 .
                              .    xx             .         ooxxxxxx           .
                              .   ox              .              oooxxxxxx     .
                              . oxx               .                   o oxxxxxx.
                              xxxx                .                           xo
                         0.0  x................................................x

                             -2        -1         0         1         2        3

                                  normalized rho (0 = mean of solvent region)
 -------------------------------------------------------------------------------

The key statistic from this RESOLVE density modification is the R-factor for comparison of observed structure factor amplitudes (FP) with those calculated from the density modification procedure (FC). In this gene-5 MAD phasing the R-factor is very low:
 Overall R-factor for FC vs FP: 0.284 for       2669 reflections
An acceptable value is anything below 0.35; below 0.30 is good.

Generation of FreeR flags

The AutoSol Wizard will create a set of free R flags indicating which reflections are not to be used in refinement. By default 5% of reflections (up to a maximum of 2000) are reserved for this test set. If you want to supply a reflection file hires.mtz that has higher resolution than the data used to solve the structure, or has a test set already marked, then you can do this with the keyword input_refinement_file=hires.mtz. The files to be used for model-building and refinement are listed in the AutoSol log file:

 
Copying  AutoSol_run_1_/solve_1.mtz  and adding free R flags for refinement
input_data_file_use:  AutoSol_run_1_/solve_1.mtz
labin_use:  labin FP=FP SIGFP=SIGFP PHIB=PHIB FOM=FOM HLA=HLA HLB=HLB HLC=HLC HLD=HLD
Adding FreeR_flag to  AutoSol_run_1_/TEMP0/solve_1.mtz
...
THE FILE AutoSol_run_1_/resolve_1.mtz will be used for model-building
THE FILE exptl_fobs_phases_freeR_flags_1.mtz will be used for refinement

Model-building with RESOLVE

The AutoSol Wizard by default uses a very quick method to build just the secondary structure of your macromolecule. This is controlled by the keywords helices_strands_only=True and helices_strands_start=True. If you set helices_strands_only=True then only secondary structure will be built. If you instead set helices_strands_start=True then a secondary-structure model will be built and then it will be extended with standard RESOLVE model-building. The Wizard will guess from your sequence file whether the structure is protein or RNA or DNA (but you can tell it if you want with (chain_type=PROTEIN). If the quick model-building does not build a satisfactory model (if the correlation of map and model is less than acceptable_secondary_structure_cc=0.35), then model-building is tried again with the standard build procedure, essentially the same as one cycle of model-building with the AutoBuild Wizard (see the web page Automated Model Building and Rebuilding with AutoBuild, except that if you specify thoroughness=quick as we have in this example, the model-building is done less comprehensively to speed things up. In this case the secondary-structure-only model-building produces an initial model with 38 residues built and side chains assigned to 0, and which has a model-map correlation of 0.40

Secondary-structure model:  AutoSol_run_1_/TEMP0/Build_1.pdb
Log file:  Build_1.log  copied to  Build_1.log
Models to combine and extend:  ['Build_1.pdb']
Using CC to score in combine_extend
Model 2: Residues built=38  placed=0  Chains=5  Model-map CC=0.40 (Build_combine
_extend_2.pdb)
This is new best model with cc =  0.4
Refining model:  Build_combine_extend_2.pdb
Model: AutoSol_run_1_/TEMP0/refine_2.pdb  R/Rfree=0.50/0.50
After several cycles of model completion, the model-map correlation is now reasonably good (0.63), the model-building is considered successful and the refined initial model is written out to refine_12.pdb in the output directory. It is still just a preliminary model, but it is good enough to tell that the structure is solved. For full model-building you will want to go on and use the AutoBuild Wizard (see the web page Automated Model Building and Rebuilding with AutoBuild )

The AutoSol_summary.dat summary file

A quick summary of the results of your AutoSol run is in the AutoSol_summary.dat file in your output directory. This file lists the key files that were produced in your run of AutoSol (all these are in the output directory) and some of the key statistics for the run, including the scores for the heavy-atom substructure and the model-building and refinement statistics. These statistics are listed for all the solutions obtained, with the highest-scoring solutions first. Here is part of the summary for this gene-5 MAD dataset:

 
----------CURRENT SOLUTIONS FOR RUN 1 : -------------------
 *** FILES ARE IN THE DIRECTORY: AutoSol_run_1_ ****

Solution # 1  BAYES-CC: 57.9 +/- 14.1 Dataset #1   FOM: 0.5

Solution 1 using HYSS on FA.sca. Dataset #1 SG="C 1 2 1"
Dataset number: 1
Dataset type: mad
Datafiles used: ['high.sca', 'peak.sca', 'infl.sca']
Sites: 2 (Already used for Phasing at resol of 2.6)
NCS information  in: AutoSol_1.ncs_spec
Experimental phases in: solve_1.mtz
Experimental phases plus FreeR_flags for refinement in: 
   exptl_fobs_phases_freeR_flags_1.mtz
Density-modified phases in: resolve_1.mtz
HA sites (PDB format) in: ha_1.pdb_formatted.pdb
Sequence file in: sequence.dat
Model in: refine_12.pdb
  Residues built: 65
  Side-chains built: 10
  Chains: 6
  Overall model-map correlation: 0.63
  R/R-free: 0.42/0.48
Scaling logfile in: dataset_1_scale.log
HYSS logfile in: FA.sca_hyss.log
Phasing logfile in: solve_1.prt
Density modification logfile in: resolve_1.log (R=0.28)
Build logfile in: Build_combine_extend_12.log

 Score type:       SKEW    CORR_RMS
Raw scores:        0.26      0.88
100x EST OF CC:   56.10     55.69

Heavy atom sites (fractional):
xyz       0.486     -0.480      0.592
xyz       0.473      0.324      0.450

How do I know if I have a good solution?

Here are some of the things to look for to tell if you have obtained a correct solution:

  • How much of the model was built? More than 50% is good, particularly if you are using the default of helices_strands_only=True. If less than 25% of the model is built, then it may be entirely incorrect. Have a look at the model. If you see clear sets of parallel or antiparallel strands, or if you see helices and strands with the expected relationships, your model is going to be correct. If you see a lot of short fragments everywhere, your model and solution is going to be incorrect. How many side-chains were fitted to density? More than 25% is ok, more than 50% is very good.
  • What is the R-factor of the model? This only applies if you are building a full model (not for helices_strands_only=True). For a solution at moderate to high resolution (2.5 A or better) the R-factor should be in the low 30's to be very good. For lower-resolution data, an R-factor in the low 40's is probably largely correct but the model is not very good.
  • What was the overall signal-to-noise in the data? Above 1 is good, below 0.5 is very low.
  • What are the individual CC-BAYES estimates of map correlation for your top solution? For a good solution they are all around 50 or more, with 2SD uncertainties that are about 10-20.
  • What is the overall "ESTIMATED MAP CC x 100" of your top solution. This should also be 50 or more for a good solution. This is an estimate of the map correlation before density modification, so if you have a lot of solvent or several NCS-related copies in the asymmetric unit, then lower values may still give you a good map.
  • What is the difference in "ESTIMATED MAP CC x 100" between the top solution and its inverse? If this is large (more than the 2SD values for each) that is a good sign.

What to do next

Once you have run AutoSol and have obtained a good solution and model, the next thing to do is to run the AutoBuild Wizard. If you run it in the same directory where you ran AutoSol, the AutoBuild Wizard will pick up where the AutoSol Wizard left off and carry out iterative model-building, density modification and refinement to improve your model and map. See the web page Automated Model Building and Rebuilding with AutoBuild for details on how to run AutoBuild. If you do not obtain a good solution, then it's not time to give up yet. There are a number of standard things to try that may improve the structure determination. Here are a few that you should always try:

  • Try setting thoroughness=thorough if it had previously been set to quick. This can make a big difference, though it takes longer.
  • Try setting max_choices to a larger number, or desired_coverage to a higher value.
  • Have a careful look at all the output files. Work your way through the main log file (e.g., AutoSol_run_1_1.log) and all the other principal log files in order beginning with scaling (dataset_1_scale.log), then looking at heavy-atom searching (FA.sca_hyss.log), phasing (e.g., solve_10.log or solve_xx.log depending on which solution xx was the top solution) and density modification (e.g., resolve_xx.log). Is there anything strange or unusual in any of them that may give you a clue as to what to try next? For example did the phasing work well (high figure of merit) yet the density modification failed? (Perhaps the hand is incorrect). Was the solvent content estimated correctly? (You can specify it yourself if you want). What does the xtriage output say? Is there twinning or strong translational symmetry? Are there problems with reflections near ice rings? Are there many outlier reflections?
  • Try a different resolution cutoff. For example 0.5 A lower resolution than you tried before. Often the highest-resolution shells have little useful information for structure solution (though the data may be useful in refinement and density modification).
  • Try a different rejection criterion for outliers. The default is ratio_out=10.0 (toss reflections with delta F more than 10 times the rms delta F of all reflections in the shell). Try instead ratio_out=3 to toss outliers.
  • If the heavy-atom substructure search did not yield plausible solutions, try searching with HYSS using the command-line interface, and vary the resolution and number of sites you look for. Can you find a solution that has a higher CC than the one found in AutoSol? If so, you can read your solution in to AutoSol with sites_file=my_sites.pdb.
  • Was an anisotropy correction applied in AutoSol? If there is some anisotropy but no correction was applied, you can force AutoSol to apply the correction with correct_aniso=True.
  • Try related space groups. If you are not positive that your space group is P212121, then try other possibilities with different or no screw axes.

Additional information

For details about the AutoSol Wizard, see Automated structure solution with AutoSol. For help on running Wizards, see Using the PHENIX Wizards.