Automated Structure Solution using AutoSol

Contents

Authors

AutoSol Wizard: Tom Terwilliger Phaser: Gabor Bunkoczi, Airlie McCoy, Randy Read Hyss: Ralf Grosse-Kunstleve, Tom Terwilliger, Gabor Bunkoczi, Randy Read Xtriage: Peter Zwart PHENIX GUI: Nathaniel Echols

Purpose

The AutoSol Wizard uses HYSS, SOLVE, Phaser, RESOLVE, xtriage and phenix.refine to solve a structure and generate experimental phases with the MAD, MIR, SIR, or SAD methods. The Wizard begins with datafiles (.sca, .hkl, etc) containing amplitidues (or intensities) of structure factors, identifies heavy-atom sites, calculates phases, carries out density modification and NCS identification, and builds and refines a preliminary model.

Usage

The AutoSol Wizard can be run from the PHENIX GUI, from the command-line, and from parameters files. All three versions are identical except in the way that they take commands from the user. See Using the PHENIX Wizards for details of how to run a Wizard. The command-line version will be described here, except for MIR and multiple datasets, which can only be run with the GUI or with a parameters file. Nearly all the parameters described here as command-line keyword=value pairs are parameters that can be set in the GUI, so nearly everything described here can be done in the GUI. The GUI is documented separately.

How the AutoSol Wizard works

The basic steps that the AutoSol Wizard carries out are described below. They are: Setting up inputs, Analyzing and scaling the data, Finding heavy-atom (anomalously-scattering atom) sites, Scoring of heavy-atom solutions, Phasing, Density modification (including NCS averaging), and Preliminary model-building and refinement. The data for structure solution are grouped into Datasets and solutions are stored in Solution objects.

Setting up inputs

The AutoSol Wizard expects the following basic information:

You can also specify many other parameters, including resolution, number of sites, whether to search in a thorough or quick fashion, how thoroughly to build a model, etc. If you have a heavy-atom solution from a previous run or another approach, you can read it in directly as well.

Your parameters can be specified on the command-line, using a GUI, or by editing a parameters file (examples below).

Datafile formats in AutoSol

AutoSol will accept the following formats of data:

The data from any of these formats will be converted to amplitudes (F+ , sigF+, and F-, sigF-) internally.

If you have multiple data files for a single dataset, or if you have original index data with multiple observations of each reflection, you might want to run phenix.scale_and_merge on your data to scale and merge the datasets before you run AutoSol. This is particularly useful for weak SAD data (such as S-SAD datasets).

Additionally, within AutoSol if you supply all scalepack unmerged original index files or all mtz unmerged files then the information in these files can be used to improve scaling. In most cases this is not necessary however. If all the files are scalepack unmerged original index or all the files are mtz unmerged and no anisotropy correction is applied, then SOLVE local scaling will be applied to the data prior to merging and averaging equivalent reflections. In all other cases equivalent reflections will be averaged prior to scaling, so that the scaling may not be as effective at removing systematic errors due to absorption or other effects.

Sequence file format

The sequence file for autosol (and autobuild) is the one-letter code sequence of each chain in the structure to be solved, where chains are separated by blank lines or lines starting with a > character. Here is a simple example with two different chains:

> sequence of chain A follows. This line not required
LVLKWVMSTKYVEAGELKEGSYVVIDGEPCRVVEIEKSKTGKHGSAKARIVAVGVFDGGKRTLSLPVDAQVEVPIIEKFT
> sequence of chain B follows. This line could be blank to indicate new chain
AQILSVSGDVIQLMDMRDYKTIEVPMKYVEEEAKGRLAPGAEVEVWQILDRYKIIRVKG

Usually the chain type (RNA, DNA, PROTEIN) is guessed from the sequence file. You can also specify it directly with a command such as chain_type=PROTEIN.

If there are multiple copies of a chain (ncs) then you can put in a single copy and it will be used for all of them. If there are multiple copies of a set of chains (A,A,A, B,B,B would be 3 copies of chains A and B) then you can put in the unique set (A and B). If there are different numbers of copies of different chains, then put in the unique set, and be sure to set the solvent_fraction.

If you have more than one type of chain (RNA, DNA, PROTEIN) then just put in the sequence for the one that is the biggest, and be sure to specify solvent_fraction=xxx so that the correct solvent fraction is used.

NOTE 1: Characters such as numbers and non-printing characters in the sequence file are ignored.

NOTE 2: Be sure that your sequence file does not have any blank lines in the middle of your sequence, as these are interpreted as the beginning of another chain.

Datasets and Solutions in AutoSol

AutoSol breaks down the data for a structure solution into datasets, where a dataset is a set of data that corresponds to a single set of heavy-atom sites. An entire MAD dataset is a single dataset. An MIR structure solution consists of several datasets (one for each native-derivative combination). A MAD + SIR structure has one dataset for the MAD data and a second dataset for the SIR data. The heavy-atom sites for each dataset are found separately (but using difference Fouriers from any previously-solved datasets to help). In the phasing step all the information from all datasets is merged into a single set of phases.

The AutoSol wizard uses a "Solution" object to keep track of heavy-atom solutions and the phased datasets that go with them. There are two types of Solutions: those which consist of a single dataset (Primary Solutions) and those that are combinations of datasets (Composite Solutions). "Primary" Solutions have information on the datafiles that were part of the dataset and on the heavy-atom sites for this dataset. Composite Solutions are simply sets of Primary Solutions, with associated origin shifts. The hand of the heavy-atom or anomalously-scattering atom substructure is part of a Solution, so if you have two datatsets, each with two Solutions related by inversion, then AutoSol would normally construct four different Composite Solutions from these and score each one as described below.

Analyzing and scaling the data

The AutoSol Wizard analyzes input datasets with phenix.xtriage to identify twinning and other conditions that may require special care. The data is scaled with SOLVE. For MAD data, FA values are calculated as well.

Note on anisotropy corrections:

The AutoSol wizard will apply an anistropy correction and B-factor sharpening to all the raw experimental data by default (controlled by they keyword remove_aniso=True). The target overall Wilson B factor can be set with the keyword b_iso, as in b_iso=25. By default the target Wilson B will be 10 times the resolution of the data (e.g., if the resolution is 3 A then b_iso=30.), or the actual Wilson B of the data, whichever is lower.

If an anisotropy correction is applied then the entire AutoSol run will be carried out with anisotropy-corrected and sharpened data. At the very end of the run the final model will be re-refined against the uncorrected refinement data and this re-refined model and the uncorrected refinement data (with freeR flags) will be written out. For the top solution this will be as overall_best.pdb and overall_best_refine_data.mtz; for all other solutions the files will be listed at the end of the log file.

Finding heavy-atom (anomalously-scattering atom) sites

The AutoSol Wizard uses HYSS to find heavy-atom sites. The result of this step is a list of possible heavy-atom solutions for a dataset. For SIR or SAD data, the isomorphous or anomalous differences, respectively are used as input to HYSS. For MAD data, the anomalous differences at each wavelength, and the FA estimates of complete heavy-atom structure factors from SOLVE are each used as separate inputs to HYSS. Each heavy-atom substructure obtained from HYSS corresponds to a potential solution. In space groups where the heavy-atom structure can be either hand, a pair of enantiomorphic solutions is saved for each run of HYSS.

For SAD and MAD data (except for FA estimates) the Phaser LLG completion approach is used to find the heavy-atom sites. This can be quite a bit more powerful than direct methods completion.

Scoring of heavy-atom solutions

Potential heavy-atom solutions are scored based on a set of criteria (SKEW, CORR_RMS, CC_DENMOD, RFACTOR, NCS_OVERLAP,TRUNCATE, REGIONS, CONTRAST, FOM, FLATNESS, described below), using either a Bayesian estimate or a Z-score system to put all the scores on a common scale and to combine them into a single overall score. The overall scoring method chosen (BAYES-CC or Z-SCORE) is determined by the value of the keyword overall_score_method. The default is BAYES-CC. Note that for all scoring methods, the map that is being evaluated, and the estimates of map-perfect-model correlation, refer to the experimental electron density map, not the density-modified map.

Bayesian CC scores (BAYES-CC). Bayesian estimates of the quality of experimental electron density maps are obtained using data from a set of previously-solved datasets. The standard scoring criteria were evaluated for 1905 potential solutions in a set of 246 MAD, SAD, and MIR datasets. As each dataset had previously been solved, the correlation between the refined model and each experimental map (CC_PERFECT) could be calculated for each solution (after offsetting the maps to account for origin differences).

Histograms have been tabulated of the number of instances that a scoring criterion (e.g., SKEW) had various possible values, as a function of the CC_PERFECT of the corresponding experimental map to the refined model. These histograms yield the relative probability of measuring a particular value of that scoring criterion (SKEW), given the value of CC_PERFECT. Using Bayes' rule, these probabilities can be used to estimate the relative probabilities of values of CC_PERFECT given the value of each scoring criterion for a particular electron density map. The mean estimate (BAYES-CC) is reported (multiplied x 100), with a +/-2SD estimate of the uncertainty in this estimate of CC_PERFECT. The BAYES-CC values are estimated independently for each scoring criterion used, and also from all those selected with the keyword score_type_list and not selected with the keyword skip_score_list.

Z-scores (Z-SCORE). The Z-score for one criterion for a particular solution is given by,

Z= (Score - mean_random_solution_score)/(SD_of_random_solution_scores) where Score is the score for this solution, mean_random_solution_score is the mean score for a solution with randomized phases, and SD_of_random_solution_scores is the standard deviation of the scores of solutions with randomized phases. To create a total score based on Z-scores, the Z-scores for each criterion are simply summed.

The principal scoring criteria are:

Phasing

The AutoSol Wizard uses Phaser to calculate experimental phases from SAD data, and SOLVE to calculate phases from MIR, MAD, and multiple-dataset cases.

Density modification (including NCS averaging)

The AutoSol Wizard uses RESOLVE to carry out density modification. It identifies NCS from symmetries in heavy-atom sites with RESOLVE and applies this NCS if it is present in the electron density map.

Preliminary model-building and refinement

The AutoSol Wizard carries out one cycle of model-building and refinement after obtaining density-modified phases. The model-building is done with RESOLVE. The refinement is carried out with phenix.refine.

Resolution limits in AutoSol

There are several resolution limits used in AutoSol. You can leave them all to default, or you can set any of them individually. Here is a list of these limits and how their default values are set:

Output files from AutoSol

When you run AutoSol the output files will be in a subdirectory with your run number:

AutoSol_run_1_/

The key output files that are produced are:

NOTE: If there are multiple chains or multiple ncs copies, each chain will be given its own chainID (A B C D...). Segments that are not assigned to a chain are given a separate chainID and are given a segid of "UNK" to indicate that their assignment is unknown. The chainID for solvent molecules is normally S, and the chainID for heavy-atoms is normally Z.

How to run the AutoSol Wizard

Running the AutoSol Wizard is easy. From the command-line you can type:

phenix.autosol w1.sca seq.dat 2 Se f_prime=-8 f_double_prime=4.5

From the GUI you load in these files and specify the number of anomalously-scattering atome and the atom type and the scattering factors on the main GUI page.

The AutoSol Wizard will assume that w1.sca is a datafile (because it ends in .sca and is a file) and that seq.dat is a sequence file, that there are 2 heavy-atom sites, and that the heavy-atom is Se. The f_prime and f_double_prime values are set explicitly

You can also specify each of these things directly:

phenix.autosol data=w1.sca seq_file=seq.dat sites=2 \
 atom_type=Se f_prime=-8 f_double_prime=4.5

You can specify many more parameters as well. See the list of keywords, defaults and descriptions at the end of this page and also general information about running Wizards at Using the PHENIX Wizards for how to do this. Some of the most common parameters are:

sites=3     # 3 sites
sites_file=sites.pdb  # ha sites in PDB or fractional xyz format
atom_type=Se   # Se is the heavy-atom
seq_file=seq.dat   # sequence file (1-aa code, separate chains with >>>>)
quick=True  # try to find sites quickly
data=w1.sca  # input datafile
lambda=0.9798  # wavelength for SAD

Running from a parameters file

You can run phenix.autosol from a parameters file. This is often convenient because you can generate a default one with:

phenix.autosol --show_defaults > my_autosol.eff

and then you can just edit this file to match your needs and run it with:

phenix.autosol  my_autosol.eff

NOTE: the autosol parameters file my_autosol.eff will have just one blank native, derivative, and wavelength. You can cut and paste them to put in as many as you want to have.

NEW: AutoSol is optimized for weak SAD data from version 1.8.5

If you want to use a very weak anomalous signal in AutoSol you will want to turn on enable_extreme_dm. This allows AutoSol to turn on the features below if the figure of merit of phasing is low.

The new AutoSol is specifically engineered to be able to solve structures at low or high resolution with a very weak anomalous signal. One feature you may notice right away is that the new AutoSol will try to optimize several choices on the fly. AutoSol will use the Bayesian estimates of map quality and the R-value in density modification to decide which choices lead to the best phasing. AutoSol will try using sharpened data for substructure identification as well as unscaled data as input to AutoSol and pick the one leading to the best map. AutoSol will also try several smoothing radii for identification of the solvent boundary and pick the one that gives the best density modification R-value.

You'll also notice that AutoSol uses the new parallel HySS and that it can find substructures with SAD data that are very weak or that only have signal to low resolution. You can use any number of processors on your machine in the HySS step (so far the parallelization is only for HySS, not the other steps in AutoSol, but those are planned as well). The biggest change in AutoSol is that it now uses iterative Phaser LLG completion to improve the anomalously-scattering substructure for SAD phasing.

The key idea is to use the density-modified map (and later, the model built by AutoSol) to iterate the identification of the substructure. This feature is amazingly powerful in cases where only some of the sites can be identified at the start by HySS and by initial Phaser completion. Phaser LLG completion is more powerful if an estimate of part of the structure (from the map or from a model) is available. The new AutoSol may take a little longer than the old one due to the heavy-atom iteration, but you may find that it gives a much improved map and model.

Examples

SAD dataset

You can use AutoSol on SAD data specifying just a few key items. You can load a sequence file and a data file in to the GUI and specify the atom type and wavelength and that is sufficient. You can do this on the command line with:

phenix.autosol w1.sca seq.dat 2 Se lambda=0.9798

The sequence file is used to estimate the solvent content of the crystal and for model-building. The wavelength (lambda) is used to look up values for f_prime and f_double_prime from a table, but if measured values are available from a fluorescence scan, these should be given in addition to the wavelength.

SAD dataset specifying solvent fraction

You can set the solvent fraction in the AutoSol GUI main page or on the command line:

phenix.autosol w1.sca seq.dat 2 Se lambda=0.9798 \
  solvent_fraction=0.45

This will force the solvent fraction to be 0.45. This illustrates a general feature of the Wizards: they will try to estimate values of parameters, but if you input them directly, they will use your input values.

SAD dataset without model-building

To skip model_building you can set build to False in the GUI or on the command_line:

phenix.autosol w1.sca seq.dat 2 Se lambda=0.9798 build=False

This will carry out the usual structure solution, but will skip model-building

SAD dataset, building RNA instead of protein

You can specify the chain type (RNA, DNA, PROTEIN):

phenix.autosol w1.sca seq.dat 2 Se lambda=0.9798 \
  chain_type=RNA

This will carry out the usual structure solution, but will build an RNA chain. For DNA, specify chain_type=DNA. You can only build one type of chain at a time in the AutoSol Wizard. To build protein and DNA, use the AutoBuild Wizard and run it first with chain_type=PROTEIN, then run it again specifying the protein model as input_lig_file_list=proteinmodel.pdb and with chain_type=DNA.

SAD dataset, selecting a particular dataset from an MTZ file

If you have an input MTZ file with more than one anomalous dataset, you can type something like:

phenix.autosol w1.mtz seq.dat 2 Se lambda=0.9798 \
 labels='F+ SIGF+ F- SIGF-'

This will carry out the usual structure solution, but will choose the input data columns based on the labels: 'F+ SIGF+ F- SIGF-' NOTE: to specify anomalous data with F+ SIGF+ F- SIGF- like this, these 4 columns must be adjacent to each other in the MTZ file with no other columns in between. FURTHER NOTE: to instead use a FAVG SIGFAVG DANO SIGDANO array in AutoSol, the data file or an input refinement file MUST also contain a separate array for FP SIGFP or I SIGI or equivalent. This is because FAVG DANO arrays are ONLY allowed as anomalous information, not as amplitudes or intensities. You can use F+ SIGF+ F- SIGF- arrays as a source of both anomalous differences and amplitudes if you want, however. If you run the AutoSol Wizard with SAD data and an MTZ file containing more than one anomalous dataset and don't tell it which one to use, all possible values of labels are printed out for you so that you can just paste the one you want in.

You can also find out all the possible label strings to use by typing:

phenix.autosol display_labels=w1.mtz  # display all labels for w1.mtz

MRSAD -- SAD dataset with an MR model; Phaser SAD phasing including the model

If you are carrying out SAD phasing with Phaser, you can carry out a combination of molecular replacement phasing and SAD phasing (MRSAD) by adding a single new keyword to your AutoSol run:

input_partpdb_file=MR.pdb

You can optionally also specify an estimate of the RMSD between your model and the true structure with a command like:

partpdb_rms=1.5

In this case the MR.pdb file will be used as a partial model in a maximum-likelihood SAD phasing calculation with Phaser to calculate phases and identify sites in Phaser, and the combined MR+SAD phases will be written out.

Notes on model bias and MRSAD.

There are a number of of factors that influence how much model bias there is in a MR-SAD experiment. Additionally the model bias is different at different steps in the procedure. You also have several choices that affect how much model bias there is:

Options for reducing model bias in MR-SAD initial phasing

  1. Considering just the initial phasing step, you have two options in Phenix:
  1. You can find sites with phaser LLG maximization using your input model as part of the total model, then phase using the sites that you find and exclude the model in the phasing step. This is the keyword "phaser_sites_then_phase=True" and it has essentially no model bias.
  2. You can find sites and phase in one step with Phaser, in which case your phases will contain information from the model and there can be model bias. You get two different sets of HL coefficients, one is HLA etc, the other is HLanomA etc. The first contains the same information in the phases, and may be biased by the model. The second does not contain phase information from the model, only from the anomalous differences.

Options for reducing model bias in MR-SAD in density modification and refinement

  1. Once you have obtained calculated phases and phase probabilities with and without the model (HL and HLanom), you have additional choices as to how to use this information. In all cases the starting phases are those from step 1, and include model information unless you set phaser_sites_then_phase=True. The choices in this stage consist of whether to use the phase probabilities with model information (HL) or without (HLanom).
  1. You can specify whether to use phase probabilities with or without the model in density modification and refinement. If you set use_hl_anom_in_denmod=False (default) then model information will be used in density modification. If you set use_hl_anom_in_denmod=True then HLanom coefficients not including model information will be used.
  2. You can separately specify whether to use model information in refinement. If you set use_hl_anom_in_refinement=True (default if a partial model has been supplied as in MR-SAD), then phase probabilities without the model will be used.

Avoiding model bias in MR-SAD

You can avoid model bias in MR-SAD completely with phaser_sites_then_phase=True. You can also have reduced model bias if you set use_hl_anom_in_denmod=True and use_hl_anom_in_refinement=True. If you do not reduce model bias in one of these ways, the amount of bias is going to depend very much on the strength of the anomalous signal, the amount of solvent, and the resolution, as all of these factors contribute to the phase information coming from somewhere other than the model.

A good way to check on model bias in this or any other case is to remove a part of your starting model (say, a helix) that you are worried about, and use that omit model in the whole procedure. Then if the density shows up in the end for this feature you can be sure it is not due to model bias. You can do this after the fact — you solve your structure, you are worried that something in the map is model bias, you run the whole thing over starting with a model that does not have this feature, it comes back in your maps, and you know it is ok. You can show the resulting map to anyone and they will know that it is not biased as well.

Using placed density from a Phaser MR density search to find sites and as a source of phase information (method #2 for MRSAD)

You can carry out a procedure almost like using input_partpdb_file except that the model information comes as a set of map coefficients for density. These map coefficients can be from Phaser or any other source of phase information. In this case the keyword to use is input_part_map_coeffs_file instead of input_partpdb_file. You can also identify the labels with input_part_map_coeffs_labels.

SAD dataset, reading heavy-atom sites from a PDB file written by phenix.hyss

You can type from the command_line:

phenix.autosol 11 Pb data=deriv.sca seq_file=seq.dat \
  sites_file=deriv_hyss_consensus_model.pdb lambda=0.95

This will carry out the usual structure solution process, but will read sites from deriv_hyss_consensus_model.pdb, try both hands, and carry on from there. If you know the hand of the substructure, you can fix it with have_hand=True.

MAD dataset

The inputs for a MAD dataset need to specify f_prime and f_double_prime for each wavelength. You can use a parameters file "mad.eff" to input MAD data. You run it with "phenix.autosol mad.eff". Here is an example of a parameters file for a MAD dataset. You can set many additional parameters as well (see the list at the end of this document).

autosol {
 seq_file = seq.dat
 sites = 2
 atom_type = Se
 wavelength {
   data = peak.sca
   lambda = .9798
   f_prime = -8.0
   f_double_prime = 4.5
 }
 wavelength {
   data = inf.sca
   lambda = .9792
   f_prime = -9.0
   f_double_prime = 1.5
 }
}

MAD dataset, selecting particular datasets from an MTZ file

This is similar to the case for running a SAD analysis, selecting particular columns of data from an MTZ file. If you have an input MTZ file with more than one anomalous dataset, you can use a parameters file like the one above for MAD data, but adding information on the labels in the MTZ file that are to be chosen for each wavelength:

autosol {
 seq_file = seq.dat
 sites = 2
 atom_type = Se
 wavelength {
   data = mad.mtz
   lambda = .9798
   f_prime = -8.0
   f_double_prime = 4.5
   labels='peak(+) SIGpeak(+) peak(-) SIGpeak(-)'

 }
 wavelength {
   data = mad.mtz
   lambda = .9792
   f_prime = -9.0
   f_double_prime = 1.5
   labels='infl(+) SIGinfl(+) infl(-) SIGinfl(-)'
 }
}

This will carry out the usual structure solution, but will choose the input peak data columns based on the label keywords. As in the SAD case, you can find out all the possible label strings to use by typing:

phenix.autosol display_labels=w1.mtz # display all labels for w1.mtz

SIR dataset

The standard inputs for a SIR dataset are the native and derivative, the sequence file, the heavy-atom type, and the number of sites, as well as whether to use anomalous differences (or just isomorphous differences). From the command line you can say:

phenix.autosol native.data=native.sca deriv.data=deriv.sca \
 atom_type=I sites=2 inano=inano

This will set the heavy-atom type to Iodine, look for 2 sites, and include anomalous differences. You can also specify many more parameters using a parameters file. This parameters file shows some of them:

autosol {
  seq_file = seq.dat
  native {
    data = native.sca
  }
  deriv {
    data = pt.sca
    lambda = 1.4
    atom_type = Pt
    f_prime = -3.0
    f_double_prime = 3.5
    sites = 3
  }
}

SAD with more than one anomalously-scattering atom

You can tell the AutoSol wizard to look for more than one anomalously-scattering atom. Specify one atom type (Se) in the usual way. Then specify any additional ones in the GUI window or like this if you are running AutoSol from the command line:

mad_ha_add_list="Br Pt"

Optionally, you can add f_prime and f_double_prime values for the additional atom types with commands like mad_ha_add_f_prime_list=" -7 -10" mad_ha_add_f_double_prime_list=" 4.2 12" but the values from table lookup should be fine. Note that there must be the same number of entries in each of these three keyword lists, if given. During phasing Phaser will try to add whichever atom types best fit the scattering from each new site. This option is available for SAD phasing only and only for a single dataset (not with SAD+MIR etc).

A particularly useful way to use this feature is to add in S atoms in a protein structure that has SeMet (specifying S with mad_ha_add_list will then, with luck, find the Cys sulfurs), or to add in P atoms in a nucleic acid structure phased by some other atom such as I. This works especially well at high resolution.

MIR dataset

You can run an MIR dataset from the GUI or using a parameters file such as "mir.eff" which you then run with "phenix.autosol mir.eff". Here is an example parameters file for MIR:

autosol {
  seq_file = seq.dat
  native {
    data = native.sca
  }
  deriv {
    data = pt.sca
    lambda = 1.4
    atom_type = Pt
  }
  deriv {
    data = ki.sca
    lambda = 1.5
    atom_type = I
  }
}

You can enter as many derivatives as you want. If you specify a wavelength and heavy atom type then scattering factors are calculated from a table for that heavy-atom. You can instead enter scattering factors with the keywords "f_prime = -3.0 " "f_double_prime = 5.0" if you want.

SIR + SAD datasets

A combination of SIR and SAD datasets (or of SAD+SAD or MIR+SAD+SAD or any other combination) is easy with a parameters file. You tell the wizard which grouping each wavelength, native, or derivative goes with with a keyword such as "group=1":

autosol {
  seq_file = seq.dat
  native {
    group = 1
    data = native.sca
  }
  deriv {
    group = 1
    data = pt.sca
    lambda = 1.4
    atom_type = Pt
  }
  wavelength {
    group = 2
    data = w1.sca
    lambda = .9798
    atom_type = Se
    f_prime = -7.
    f_double_prime = 4.5
  }
}

The SIR and SAD datasets will be solved separately (but whichever one is solved first will use difference or anomalous difference Fouriers to locate sites for the other). Then phases will be combined by addition of Hendrickson-Lattman coefficients and the combined phases will be density modified.

AutoSol with an extremely weak signal

You can use AutoSol in cases with a very weak anomalous signal. The big challenges in this kind of situation are finding the anomalously-scattering atoms and density modification. For finding the sites, you may need to try everything available, including running HySS with various resolution cutoffs for the data and trying other software for finding sites as well. You may then want to start with the sites found using one of these approaches and provide those to phenix.autosol with a command like,:

sites_file=my_sites.pdb

For the density modification step, there is a keyword (extreme_dm) that may be helpful in cases where the starting phases are very poor:

extreme_dm = True

This keyword works with the keyword fom_for_extreme_dm to determine whether a set of defaults for density modification with weak phases is appropriate. If so, a very large radius for identification of the solvent boundary (20 A) is used, and the number of density modification cycles is reduced. This can make a big difference in getting started with density modification in such a case.

AutoSol with a cluster compound

You can run SAD phasing in AutoSol with a cluster compound (not MIR or MAD yet). Normally you should supply a PDB file with an example of the cluster with the keyword cluster_pdb_file=my_cluster_compound and a unique residue name XX (yes, really the two letters XX, not XX replaced by some other name). Set the keyword atom_type=XX as well . If your cluster is Ta6Br12 then you can simply put atom_type=TX and skip the cluster_pdb_file. For MAD/MIR, cluster compounds are not currently supported. Instead just use a standard atom.

R-free flags and test set

In Phenix the parameter test_flag_value sets the value of the test set that is to be free. Normally Phenix sets up test sets with values of 0 and 1 with 1 as the free set. The CCP4 convention is values of 0 through 19 with 0 as the free set. Either of these is recognized by default in Phenix and you do not need to do anything special. If you have any other convention (for example values of 0 to 19 and test set is 1) then you can specify this with:

test_flag_value=1

Specific limitations and problems

Literature

Decision-making in structure solution using Bayesian estimates of map quality: the PHENIX AutoSol wizard. T.C. Terwilliger, P.D. Adams, R.J. Read, A.J. McCoy, N.W. Moriarty, R.W. Grosse-Kunstleve, P.V. Afonine, P.H. Zwart, and L.W. Hung. Acta Crystallogr D Biol Crystallogr 65, 582-601 (2009).

Simple algorithm for a maximum-likelihood SAD function. A.J. McCoy, L.C. Storoni, and R.J. Read. Acta Crystallogr D Biol Crystallogr 60, 1220-8 (2004).

Substructure search procedures for macromolecular structures. R.W. Grosse-Kunstleve, and P.D. Adams. Acta Cryst. D59, 1966-1973. (2003).

MAD phasing: Bayesian estimates of F(A). T.C. Terwilliger. Acta Crystallogr D Biol Crystallogr 50, 11-6 (1994).

Rapid automatic NCS identification using heavy-atom substructures. T.C. Terwilliger. Acta Crystallogr D Biol Crystallogr 58, 2213-5 (2002).

Maximum-likelihood density modification. T.C. Terwilliger. Acta Crystallogr D Biol Crystallogr 56, 965-72 (2000).

Statistical density modification with non-crystallographic symmetry. T.C. Terwilliger. Acta Crystallogr D Biol Crystallogr 58, 2082-6 (2002).

Automated side-chain model building and sequence assignment by template matching. T.C. Terwilliger. Acta Crystallogr D Biol Crystallogr 59, 45-9 (2003).

Model morphing and sequence assignment after molecular replacement. T.C. Terwilliger, R.J. Read, P.D. Adams, A.T. Brunger, P.V. Afonine, and L.W. Hung. Acta Crystallogr D Biol Crystallogr 69, 2244-50 (2013).

Automated main-chain model building by template matching and iterative fragment extension. T.C. Terwilliger. Acta Crystallogr D Biol Crystallogr 59, 38-44 (2003).

List of all available keywords