Automated Structure Solution using AutoSol
AutoSol Wizard: Tom Terwilliger
Phaser: Gabor Bunkoczi, Airlie McCoy, Randy Read
Hyss: Ralf Grosse-Kunstleve, Tom Terwilliger, Gabor Bunkoczi, Randy Read
Xtriage: Peter Zwart
PHENIX GUI: Nathaniel Echols
The AutoSol Wizard uses HYSS, SOLVE, Phaser, RESOLVE, xtriage and phenix.refine to solve a structure and generate experimental phases with the MAD, MIR, SIR, or SAD methods. The Wizard begins with datafiles (.sca, .hkl, etc) containing amplitidues (or intensities) of structure factors, identifies heavy-atom sites, calculates phases, carries out density modification and NCS identification, and builds and refines a preliminary model.
The AutoSol Wizard can be run from the PHENIX GUI, from the command-line, and from parameters files. All three versions are identical except in the way that they take commands from the user. See Using the PHENIX Wizards for details of how to run a Wizard. The command-line version will be described here, except for MIR and multiple datasets, which can only be run with the GUI or with a parameters file. Nearly all the parameters described here as command-line keyword=value pairs are parameters that can be set in the GUI, so nearly everything described here can be done in the GUI. The GUI is documented separately.
The basic steps that the AutoSol Wizard carries out are described below. They are: Setting up inputs, Analyzing and scaling the data, Finding heavy-atom (anomalously-scattering atom) sites, Scoring of heavy-atom solutions, Phasing, Density modification (including NCS averaging), and Preliminary model-building and refinement. The data for structure solution are grouped into Datasets and solutions are stored in Solution objects.
AutoSol breaks down the data for a structure solution into datasets, where a dataset is a set of data that corresponds to a single set of heavy-atom sites. An entire MAD dataset is a single dataset. An MIR structure solution consists of several datasets (one for each native-derivative combination). A MAD + SIR structure has one dataset for the MAD data and a second dataset for the SIR data. The heavy-atom sites for each dataset are found separately (but using difference Fouriers from any previously-solved datasets to help). In the phasing step all the information from all datasets is merged into a single set of phases.
The AutoSol wizard uses a "Solution" object to keep track of heavy-atom solutions and the phased datasets that go with them. There are two types of Solutions: those which consist of a single dataset (Primary Solutions) and those that are combinations of datasets (Composite Solutions). "Primary" Solutions have information on the datafiles that were part of the dataset and on the heavy-atom sites for this dataset. Composite Solutions are simply sets of Primary Solutions, with associated origin shifts. The hand of the heavy-atom or anomalously-scattering atom substructure is part of a Solution, so if you have two datatsets, each with two Solutions related by inversion, then AutoSol would normally construct four different Composite Solutions from these and score each one as described below.
The AutoSol Wizard analyzes input datasets with phenix.xtriage to identify twinning and other conditions that may require special care. The data is scaled with SOLVE. For MAD data, FA values are calculated as well.
Note on anisotropy corrections:
The AutoSol wizard will apply an anistropy correction and B-factor sharpening to all the raw experimental data by default (controlled by they keyword remove_aniso=True). The target overall Wilson B factor can be set with the keyword b_iso, as in b_iso=25. By default the target Wilson B will be 10 times the resolution of the data (e.g., if the resolution is 3 A then b_iso=30.), or the actual Wilson B of the data, whichever is lower.
If an anisotropy correction is applied then the entire AutoSol run will be carried out with anisotropy-corrected and sharpened data. At the very end of the run the final model will be re-refined against the uncorrected refinement data and this re-refined model and the uncorrected refinement data (with freeR flags) will be written out. For the top solution this will be as overall_best.pdb and overall_best_refine_data.mtz; for all other solutions the files will be listed at the end of the log file.
The AutoSol Wizard uses HYSS to find heavy-atom sites. The result of this step is a list of possible heavy-atom solutions for a dataset. For SIR or SAD data, the isomorphous or anomalous differences, respectively are used as input to HYSS. For MAD data, the anomalous differences at each wavelength, and the FA estimates of complete heavy-atom structure factors from SOLVE are each used as separate inputs to HYSS. Each heavy-atom substructure obtained from HYSS corresponds to a potential solution. In space groups where the heavy-atom structure can be either hand, a pair of enantiomorphic solutions is saved for each run of HYSS.
For SAD and MAD data (except for FA estimates) the Phaser LLG completion approach is used to find the heavy-atom sites. This can be quite a bit more powerful than direct methods completion.
Potential heavy-atom solutions are scored based on a set of criteria (SKEW, CORR_RMS, CC_DENMOD, RFACTOR, NCS_OVERLAP,TRUNCATE, REGIONS, CONTRAST, FOM, FLATNESS, described below), using either a Bayesian estimate or a Z-score system to put all the scores on a common scale and to combine them into a single overall score. The overall scoring method chosen (BAYES-CC or Z-SCORE) is determined by the value of the keyword overall_score_method. The default is BAYES-CC. Note that for all scoring methods, the map that is being evaluated, and the estimates of map-perfect-model correlation, refer to the experimental electron density map, not the density-modified map.
Bayesian CC scores (BAYES-CC). Bayesian estimates of the quality of experimental electron density maps are obtained using data from a set of previously-solved datasets. The standard scoring criteria were evaluated for 1905 potential solutions in a set of 246 MAD, SAD, and MIR datasets. As each dataset had previously been solved, the correlation between the refined model and each experimental map (CC_PERFECT) could be calculated for each solution (after offsetting the maps to account for origin differences).
Histograms have been tabulated of the number of instances that a scoring criterion (e.g., SKEW) had various possible values, as a function of the CC_PERFECT of the corresponding experimental map to the refined model. These histograms yield the relative probability of measuring a particular value of that scoring criterion (SKEW), given the value of CC_PERFECT. Using Bayes' rule, these probabilities can be used to estimate the relative probabilities of values of CC_PERFECT given the value of each scoring criterion for a particular electron density map. The mean estimate (BAYES-CC) is reported (multiplied x 100), with a +/-2SD estimate of the uncertainty in this estimate of CC_PERFECT. The BAYES-CC values are estimated independently for each scoring criterion used, and also from all those selected with the keyword score_type_list and not selected with the keyword skip_score_list.
Z-scores (Z-SCORE). The Z-score for one criterion for a particular solution is given by,
Z= (Score - mean_random_solution_score)/(SD_of_random_solution_scores)
where Score is the score for this solution, mean_random_solution_score is the mean score for a solution with randomized phases, and SD_of_random_solution_scores is the standard deviation of the scores of solutions with randomized phases.
To create a total score based on Z-scores, the Z-scores for each criterion are simply summed.
The principal scoring criteria are:
- The skew (SKEW; third moment or normalized <rho**3>) of the density in an electron density map is a good measure of its quality, because a random map has a skew of zero (density histograms look like a Gaussian), while a good map has a very positive skew (density histograms very strong near zero, but many points with very high density). This criterion is used in scoring by default. Correlation of local rms density (CORR_RMS). The presence of contiguous flat solvent regions in a map was detected using the correlation coefficient of the smoothed squared electron density calculated as described above, with the same quantity calculated using half the value of the smoothing radius, yielding the correlation of rms density, r2RMS. In this way the local value of the rms density within a small local region (typically within a radius of 3 A) is compared with the local rms density in a larger local region (typically within a radius of 6 A). If there were a large, contiguous solvent region and another large contiguous region containing the macromolecule, the local rms density in the small region would be expected to be highly correlated with the rms density in the larger region. On the other hand, if the solvent region were broken up into many small flat regions, then this correlation would be expected to be smaller.
- Correlation of map-phased electron density map with experimentally- phased map (CC_DENMOD). The statistical density modification in RESOLVE allows the calculation of map-based phases that are (mostly) independent of the experimental phases. The phase information in statistical density modification comes from two sources: your experimental phases and maximization of the agreement of the map with expectations (such as a flat solvent region). Normally the phase probabilities from these two sources are merged together, yielding your density-modified phases. This score is calculated based on the correlation of the phase information from these two sources before combining them, and is a good indication of the quality of the experimental phases. This criterion is used in scoring by default.
- The R-factor for density modification (RFACTOR). Statistical density modification provides an estimate of structure factors that is (mostly) independent of the measured structure factors, so the R-factor between FC and Fobs is a good measure of the quality of experimental phases. This criterion is used in scoring by default.
- Non-crystallographic symmetry (NCS_OVERLAP). The presence of NCS in a map is a nearly-positive indication that the map is good, or has some correct features. The AutoSol Wizard uses symmetry in heavy-atom sites to suggest NCS, and RESOLVE identifies the actual correlation of NCS-related density for the NCS overlap score. This score is used by default if NCS is present in the Z-score method of scoring.
- Figure of merit (FOM). The figure of merit of phasing is a good indicator of the internal consistency of a solution. This score is not normalized by the SD of randomized phase sets (as that has no meaning; rather a standard SD=0.05 is used). This score is used by default if NCS is present in the Z-score method of scoring and in the Bayesian CC estimate method.
- Map correlation after truncation (TRUNCATION). Dummy atoms (the same number as estimated non-hydrogen atoms in the structure) are placed in positions of high density of the map, and a new map is calculated based on these atomic positions. The correlation of these maps is calculated after adjusting an overall B-value for the dummy atoms to maximize the correlation. A good map will show a high correlation of these maps. This score is by default not used.
- Number of contiguous regions per 100 A**3 comprising top 5% of density in map (REGIONS). The top 5% of points in the map are marked, and the number of contiguous regions that result are counted, and divided by the volume of the asymmetric unit, then multiplied by 100. A good map will have just a few contiguous regions at a high contour level, a poor map will have many isolated peaks. This score is by default not used.
- Contrast, or standard deviation of local rms density (CONTRAST). The local rms density in the map is calculated using a smoothing radius of 3 times the high-resolution cutoff (or 6 A, if less than 6A). Then the standard deviation of the local rms, normalized to the mean value of the local rms, is reported. This criteria will be high if there are regions of high local rms (the macromolecule) and separate regions of low local rms (the solvent) and low if the map is random. This score is by default not used.
The AutoSol Wizard uses Phaser to calculate experimental phases from SAD data, and SOLVE to calculate phases from MIR, MAD, and multiple-dataset cases.
The AutoSol Wizard uses RESOLVE to carry out density modification. It identifies NCS from symmetries in heavy-atom sites with RESOLVE and applies this NCS if it is present in the electron density map.
The AutoSol Wizard carries out one cycle of model-building and refinement after obtaining density-modified phases. The model-building is done with RESOLVE. The refinement is carried out with phenix.refine.
There are several resolution limits used in AutoSol. You can leave them all to default, or you can set any of them individually. Here is a list of these limits and how their default values are set:
- resolution: This is the overall resolution for a dataset. By default it is based on the highest resolution for any datafile in this dataset. For multiple datasets, it is the highest resolution for any dataset.
- refinement_resolution: This is the resolution for refinement
- resolution_build: This is the resolution for model-building
- res_phase: This is the resolution for phasing for a dataset. If phase_full_resolution is True then this is the same as "resolution". Otherwise, the value of "recommended_resolution" based on an analysis of signal-to-noise in the dataset is used to define it.
- res_eval: This is the resolution for evaluation of solution quality. The default is the value of "resolution" or 2.5 A, whichever is lower resolution.
When you run AutoSol the output files will be in a subdirectory with your run number:
AutoSol_run_1_/
The key output files that are produced are:
- A log file describing everything in the run and the files produced: AutoSol_run_1_1.log # overall log file
- A summary file listing the results of the run and the other files produced: AutoSol_summary.dat # overall summary
- A warnings file listing any warnings about the run: AutoSol_warnings.dat # any warnings
- Density-modified map coefficients (NOTE: These files will be aniso-corrected and sharpened if remove_aniso=True): overall_best_denmod_map_coeffs.mtz # map coefficients (density modified phases)
- Current preliminary model: overall_best.pdb # model produced for top solution
NOTE: If there are multiple chains or multiple ncs copies, each chain will be given its own chainID (A B C D...). Segments that are not assigned to a chain are given a separate chainID and are given a segid of "UNK" to indicate that their assignment is unknown. The chainID for solvent molecules is normally S, and the chainID for heavy-atoms is normally Z.
- An mtz file for use in refinement NOTE 1: not aniso corrected and not sharpened. NOTE 2: Two sets of HL coefficients may be present. Normally use HLA HLB etc . However, if you supplied a model with input_partpdb_file=my_model.pdb then use instead HLanomA HLanomB etc. The reason is that HL coeffs contain phase information from my_model.pdb in this case and you do not want that information passed to your refinement program.
- overall_best_refine_data.mtz # F Sigma HL coeffs, freeR-flags for refinement NOTE: if this is a SAD or MAD dataset then the overall_best_refine_data.mtz file will normally have your original anomalous data.For MAD data this will be from the wavelength of data with the highest-resolution data present.
- Heavy atom sites in PDB format: overall_best_ha_pdb.pdb # ha file for top solution
- NCS information (if any): overall_best_ncs_file.ncs_spec # NCS information for top solution
- Experimental phases and HL coefficients (NOTE: These files are aniso-corrected and sharpened if remove_aniso=True): overall_best_hklout_phased.mtz # phases and HL coeffs for top solution
- Log file for experimental phasing: overall_best_log_phased.log # experimental phasing log file for top solution
- Log file for scaling: overall_best_log_phased.log # experimental phasing log file for top solution
- Log file for heavy-atom substructure search: overall_best_log_hyss.log # ha search log file for top solution
Running the AutoSol Wizard is easy. From the command-line you can type:
phenix.autosol w1.sca seq.dat 2 Se f_prime=-8 f_double_prime=4.5
From the GUI you load in these files and specify the number of anomalously-scattering atome and the atom type and the scattering factors on the main GUI page.
The AutoSol Wizard will assume that w1.sca is a datafile (because it ends in .sca and is a file) and that seq.dat is a sequence file, that there are 2 heavy-atom sites, and that the heavy-atom is Se. The f_prime and f_double_prime values are set explicitly
You can also specify each of these things directly:
phenix.autosol data=w1.sca seq_file=seq.dat sites=2 \
atom_type=Se f_prime=-8 f_double_prime=4.5
You can specify many more parameters as well. See the list of keywords, defaults and descriptions at the end of this page and also general information about running Wizards at Using the PHENIX Wizards for how to do this. Some of the most common parameters are:
sites=3 # 3 sites
sites_file=sites.pdb # ha sites in PDB or fractional xyz format
atom_type=Se # Se is the heavy-atom
seq_file=seq.dat # sequence file (1-aa code, separate chains with >>>>)
quick=True # try to find sites quickly
data=w1.sca # input datafile
lambda=0.9798 # wavelength for SAD
You can run phenix.autosol from a parameters file. This is often convenient because you can generate a default one with:
phenix.autosol --show_defaults > my_autosol.eff
and then you can just edit this file to match your needs and run it with:
phenix.autosol my_autosol.eff
NOTE: the autosol parameters file my_autosol.eff will have just one blank native, derivative, and wavelength. You can cut and paste them to put in as many as you want to have.
If you want to use a very weak anomalous signal in AutoSol you will want to turn on enable_extreme_dm. This allows AutoSol to turn on the features below if the figure of merit of phasing is low.
The new AutoSol is specifically engineered to be able to solve structures at low or high resolution with a very weak anomalous signal. One feature you may notice right away is that the new AutoSol will try to optimize several choices on the fly. AutoSol will use the Bayesian estimates of map quality and the R-value in density modification to decide which choices lead to the best phasing. AutoSol will try using sharpened data for substructure identification as well as unscaled data as input to AutoSol and pick the one leading to the best map. AutoSol will also try several smoothing radii for identification of the solvent boundary and pick the one that gives the best density modification R-value.
You'll also notice that AutoSol uses the new parallel HySS and that it can find substructures with SAD data that are very weak or that only have signal to low resolution. You can use any number of processors on your machine in the HySS step (so far the parallelization is only for HySS, not the other steps in AutoSol, but those are planned as well). The biggest change in AutoSol is that it now uses iterative Phaser LLG completion to improve the anomalously-scattering substructure for SAD phasing.
The key idea is to use the density-modified map (and later, the model built by AutoSol) to iterate the identification of the substructure. This feature is amazingly powerful in cases where only some of the sites can be identified at the start by HySS and by initial Phaser completion. Phaser LLG completion is more powerful if an estimate of part of the structure (from the map or from a model) is available. The new AutoSol may take a little longer than the old one due to the heavy-atom iteration, but you may find that it gives a much improved map and model.
You can use AutoSol on SAD data specifying just a few key items. You can load a sequence file and a data file in to the GUI and specify the atom type and wavelength and that is sufficient. You can do this on the command line with:
phenix.autosol w1.sca seq.dat 2 Se lambda=0.9798
The sequence file is used to estimate the solvent content of the crystal and for model-building. The wavelength (lambda) is used to look up values for f_prime and f_double_prime from a table, but if measured values are available from a fluorescence scan, these should be given in addition to the wavelength.
You can set the solvent fraction in the AutoSol GUI main page or on the command line:
phenix.autosol w1.sca seq.dat 2 Se lambda=0.9798 \
solvent_fraction=0.45
This will force the solvent fraction to be 0.45. This illustrates a general feature of the Wizards: they will try to estimate values of parameters, but if you input them directly, they will use your input values.
To skip model_building you can set build to False in the GUI or on the command_line:
phenix.autosol w1.sca seq.dat 2 Se lambda=0.9798 build=False
This will carry out the usual structure solution, but will skip model-building
You can specify the chain type (RNA, DNA, PROTEIN):
phenix.autosol w1.sca seq.dat 2 Se lambda=0.9798 \
chain_type=RNA
This will carry out the usual structure solution, but will build an RNA chain. For DNA, specify chain_type=DNA. You can only build one type of chain at a time in the AutoSol Wizard. To build protein and DNA, use the AutoBuild Wizard and run it first with chain_type=PROTEIN, then run it again specifying the protein model as input_lig_file_list=proteinmodel.pdb and with chain_type=DNA.
If you have an input MTZ file with more than one anomalous dataset, you can type something like:
phenix.autosol w1.mtz seq.dat 2 Se lambda=0.9798 \
labels='F+ SIGF+ F- SIGF-'
This will carry out the usual structure solution, but will choose the input data columns based on the labels: 'F+ SIGF+ F- SIGF-' NOTE: to specify anomalous data with F+ SIGF+ F- SIGF- like this, these 4 columns must be adjacent to each other in the MTZ file with no other columns in between. FURTHER NOTE: to instead use a FAVG SIGFAVG DANO SIGDANO array in AutoSol, the data file or an input refinement file MUST also contain a separate array for FP SIGFP or I SIGI or equivalent. This is because FAVG DANO arrays are ONLY allowed as anomalous information, not as amplitudes or intensities. You can use F+ SIGF+ F- SIGF- arrays as a source of both anomalous differences and amplitudes if you want, however.
If you run the AutoSol Wizard with SAD data and an MTZ file containing more than one anomalous dataset and don't tell it which one to use, all possible values of labels are printed out for you so that you can just paste the one you want in.
You can also find out all the possible label strings to use by typing:
phenix.autosol display_labels=w1.mtz # display all labels for w1.mtz
If you are carrying out SAD phasing with Phaser, you can carry out a combination of molecular replacement phasing and SAD phasing (MRSAD) by adding a single new keyword to your AutoSol run:
input_partpdb_file=MR.pdb
You can optionally also specify an estimate of the RMSD between your model and the true structure with a command like:
partpdb_rms=1.5
In this case the MR.pdb file will be used as a partial model in a maximum-likelihood SAD phasing calculation with Phaser to calculate phases and identify sites in Phaser, and the combined MR+SAD phases will be written out.
There are a number of of factors that influence how much model bias there is in a MR-SAD experiment. Additionally the model bias is different at different steps in the procedure. You also have several choices that affect how much model bias there is:
- Considering just the initial phasing step, you have two options in Phenix:
- You can find sites with phaser LLG maximization using your input model as part of the total model, then phase using the sites that you find and exclude the model in the phasing step. This is the keyword "phaser_sites_then_phase=True" and it has essentially no model bias.
- You can find sites and phase in one step with Phaser, in which case your phases will contain information from the model and there can be model bias. You get two different sets of HL coefficients, one is HLA etc, the other is HLanomA etc. The first contains the same information in the phases, and may be biased by the model. The second does not contain phase information from the model, only from the anomalous differences.
- Once you have obtained calculated phases and phase probabilities with and without the model (HL and HLanom), you have additional choices as to how to use this information. In all cases the starting phases are those from step 1, and include model information unless you set phaser_sites_then_phase=True. The choices in this stage consist of whether to use the phase probabilities with model information (HL) or without (HLanom).
- You can specify whether to use phase probabilities with or without the model in density modification and refinement. If you set use_hl_anom_in_denmod=False (default) then model information will be used in density modification. If you set use_hl_anom_in_denmod=True then HLanom coefficients not including model information will be used.
- You can separately specify whether to use model information in refinement. If you set use_hl_anom_in_refinement=True (default if a partial model has been supplied as in MR-SAD), then phase probabilities without the model will be used.
You can avoid model bias in MR-SAD completely with phaser_sites_then_phase=True. You can also have reduced model bias if you set use_hl_anom_in_denmod=True and use_hl_anom_in_refinement=True. If you do not reduce model bias in one of these ways, the amount of bias is going to depend very much on the strength of the anomalous signal, the amount of solvent, and the resolution, as all of these factors contribute to the phase information coming from somewhere other than the model.
A good way to check on model bias in this or any other case is to remove a part of your starting model (say, a helix) that you are worried about, and use that omit model in the whole procedure. Then if the density shows up in the end for this feature you can be sure it is not due to model bias. You can do this after the fact — you solve your structure, you are worried that something in the map is model bias, you run the whole thing over starting with a model that does not have this feature, it comes back in your maps, and you know it is ok. You can show the resulting map to anyone and they will know that it is not biased as well.
You can type from the command_line:
phenix.autosol 11 Pb data=deriv.sca seq_file=seq.dat \
sites_file=deriv_hyss_consensus_model.pdb lambda=0.95
This will carry out the usual structure solution process, but will read sites from deriv_hyss_consensus_model.pdb, try both hands, and carry on from there. If you know the hand of the substructure, you can fix it with have_hand=True.
The inputs for a MAD dataset need to specify f_prime and f_double_prime for each wavelength. You can use a parameters file "mad.eff" to input MAD data. You run it with "phenix.autosol mad.eff". Here is an example of a parameters file for a MAD dataset. You can set many additional parameters as well (see the list at the end of this document).
autosol {
seq_file = seq.dat
sites = 2
atom_type = Se
wavelength {
data = peak.sca
lambda = .9798
f_prime = -8.0
f_double_prime = 4.5
}
wavelength {
data = inf.sca
lambda = .9792
f_prime = -9.0
f_double_prime = 1.5
}
}
This is similar to the case for running a SAD analysis, selecting particular columns of data from an MTZ file. If you have an input MTZ file with more than one anomalous dataset, you can use a parameters file like the one above for MAD data, but adding information on the labels in the MTZ file that are to be chosen for each wavelength:
autosol {
seq_file = seq.dat
sites = 2
atom_type = Se
wavelength {
data = mad.mtz
lambda = .9798
f_prime = -8.0
f_double_prime = 4.5
labels='peak(+) SIGpeak(+) peak(-) SIGpeak(-)'
}
wavelength {
data = mad.mtz
lambda = .9792
f_prime = -9.0
f_double_prime = 1.5
labels='infl(+) SIGinfl(+) infl(-) SIGinfl(-)'
}
}
This will carry out the usual structure solution, but will choose the input peak data columns based on the label keywords.
As in the SAD case, you can find out all the possible label strings to use by typing:
phenix.autosol display_labels=w1.mtz # display all labels for w1.mtz
The standard inputs for a SIR dataset are the native and derivative, the sequence file, the heavy-atom type, and the number of sites, as well as whether to use anomalous differences (or just isomorphous differences). From the command line
you can say:
phenix.autosol native.data=native.sca deriv.data=deriv.sca \
atom_type=I sites=2 inano=inano
This will set the heavy-atom type to Iodine, look for 2 sites, and include anomalous differences.
You can also specify many more parameters using a parameters file. This parameters file shows some of them:
autosol {
seq_file = seq.dat
native {
data = native.sca
}
deriv {
data = pt.sca
lambda = 1.4
atom_type = Pt
f_prime = -3.0
f_double_prime = 3.5
sites = 3
}
}
You can tell the AutoSol wizard to look for more than one anomalously-scattering atom. Specify one atom type (Se) in the usual way. Then specify any additional
ones in the GUI window or like this if you are running AutoSol from
the command line:
mad_ha_add_list="Br Pt"
Optionally, you can add f_prime and f_double_prime values for the additional atom types with commands like
mad_ha_add_f_prime_list=" -7 -10"
mad_ha_add_f_double_prime_list=" 4.2 12"
but the values from table lookup should be fine. Note that there must be the same number of entries in each of these three keyword lists, if given. During phasing Phaser will try to add whichever atom types best fit the scattering from each new site. This option is available for SAD phasing only and only for a single dataset (not with SAD+MIR etc).
A particularly useful way to use this feature is to add in S atoms in a protein structure that has SeMet (specifying S with mad_ha_add_list will then, with luck, find the Cys sulfurs), or to add in P atoms in a nucleic acid structure phased by some other atom such as I. This works especially well at high resolution.
You can run an MIR dataset from the GUI or using a parameters file such as "mir.eff" which you then run with "phenix.autosol mir.eff". Here is an example parameters file for MIR:
autosol {
seq_file = seq.dat
native {
data = native.sca
}
deriv {
data = pt.sca
lambda = 1.4
atom_type = Pt
}
deriv {
data = ki.sca
lambda = 1.5
atom_type = I
}
}
You can enter as many derivatives as you want. If you specify a wavelength and heavy atom type then scattering factors are calculated from a table for that heavy-atom. You can instead enter scattering factors with the keywords "f_prime = -3.0 " "f_double_prime = 5.0" if you want.
A combination of SIR and SAD datasets (or of SAD+SAD or MIR+SAD+SAD or any other combination) is easy with a parameters file. You tell the wizard which grouping each wavelength, native, or derivative goes with with a keyword such as "group=1":
autosol {
seq_file = seq.dat
native {
group = 1
data = native.sca
}
deriv {
group = 1
data = pt.sca
lambda = 1.4
atom_type = Pt
}
wavelength {
group = 2
data = w1.sca
lambda = .9798
atom_type = Se
f_prime = -7.
f_double_prime = 4.5
}
}
The SIR and SAD datasets will be solved separately (but whichever one is solved first will use difference or anomalous difference Fouriers to locate sites for the other). Then phases will be combined by addition of Hendrickson-Lattman coefficients and the combined phases will be density modified.
You can use AutoSol in cases with a very weak anomalous signal. The big challenges in this kind of situation are finding the anomalously-scattering atoms and density modification. For finding the sites, you may need to try everything available, including running HySS with various resolution cutoffs for the data and trying other software for finding sites as well. You may then want to start with the sites found using one of these approaches and provide those to phenix.autosol with a command like,:
sites_file=my_sites.pdb
For the density modification step, there is a keyword (extreme_dm) that may be helpful in cases where the starting phases are very poor:
extreme_dm = True
This keyword works with the keyword fom_for_extreme_dm to determine whether a set of defaults for density modification with weak phases is appropriate. If so, a very large radius for identification of the solvent boundary (20 A) is used, and the number of density modification cycles is reduced. This can make a big difference in getting started with density modification in such a case.
You can run SAD phasing in AutoSol with a cluster compound (not MIR or MAD yet). Normally you should supply a PDB file with an example of the cluster with the keyword cluster_pdb_file=my_cluster_compound and a unique residue name XX (yes, really the two letters XX, not XX replaced by some other name). Set the keyword atom_type=XX as well . If your cluster is Ta6Br12 then you can simply put atom_type=TX and skip the cluster_pdb_file. For MAD/MIR, cluster compounds are not currently supported. Instead just use a standard atom.
In Phenix the parameter test_flag_value sets the value of the test set that is to be free. Normally Phenix sets up test sets with values of 0 and 1 with 1 as the free set. The CCP4 convention is values of 0 through 19 with 0 as the free set. Either of these is recognized by default in Phenix and you do not need to do anything special. If you have any other convention (for example values of 0 to 19 and test set is 1) then you can specify this with:
test_flag_value=1
- The size of the asymmetric unit in the SOLVE/RESOLVE portion of the AutoSol wizard is limited by the memory in your computer and the binaries used. The Wizard is supplied with regular-size ("", size=6), giant ("_giant", size=12), huge ("_huge", size=18) and extra_huge ("_extra_huge", size=36). Larger-size versions can be obtained on request.
- The keywords "cell" and "sg" have been replaced with "unit_cell" and "space_group" to make the keywords the same as in other phenix applications.
- The keywords for running MIR and SIR and MAD datasets from parameter files and the command line have been changed to make the inputs more consistent and suitable for a static GUI.
- The AutoSol Wizard can take a maximum of 6 derivatives for MIR.
- The AutoSol Wizard can take most settings of most space groups, however it can only use the hexagonal setting of rhombohedral space groups (eg., #146 R3:H or #155 R32:H), and it cannot use space groups 114-119 (not found in macromolecular crystallography) even in the standard setting due to difficulties with the use of asuset in the version of ccp4 libraries used in PHENIX for these settings and space groups.
Decision-making in structure solution using Bayesian estimates of map quality: the PHENIX AutoSol wizard. T.C. Terwilliger, P.D. Adams, R.J. Read, A.J. McCoy, N.W. Moriarty, R.W. Grosse-Kunstleve, P.V. Afonine, P.H. Zwart, and L.W. Hung. Acta Crystallogr D Biol Crystallogr 65, 582-601 (2009).
Simple algorithm for a maximum-likelihood SAD function. A.J. McCoy, L.C. Storoni, and R.J. Read. Acta Crystallogr D Biol Crystallogr 60, 1220-8 (2004).
Substructure search procedures for macromolecular structures. R.W. Grosse-Kunstleve, and P.D. Adams. Acta Cryst. D59, 1966-1973. (2003).
MAD phasing: Bayesian estimates of F(A). T.C. Terwilliger. Acta Crystallogr D Biol Crystallogr 50, 11-6 (1994).
Rapid automatic NCS identification using heavy-atom substructures. T.C. Terwilliger. Acta Crystallogr D Biol Crystallogr 58, 2213-5 (2002).
Maximum-likelihood density modification. T.C. Terwilliger. Acta Crystallogr D Biol Crystallogr 56, 965-72 (2000).
Statistical density modification with non-crystallographic symmetry. T.C. Terwilliger. Acta Crystallogr D Biol Crystallogr 58, 2082-6 (2002).
Automated side-chain model building and sequence assignment by template matching. T.C. Terwilliger. Acta Crystallogr D Biol Crystallogr 59, 45-9 (2003).
Model morphing and sequence assignment after molecular replacement. T.C. Terwilliger, R.J. Read, P.D. Adams, A.T. Brunger, P.V. Afonine, and L.W. Hung. Acta Crystallogr D Biol Crystallogr 69, 2244-50 (2013).
Automated main-chain model building by template matching and iterative fragment extension. T.C. Terwilliger. Acta Crystallogr D Biol Crystallogr 59, 38-44 (2003).
- autosol
- atom_type = None Anomalously-scattering atom type. This sets the atom_type in all derivatives and wavelengths. Normally it is used as a shortcut for SAD or SIR cases. NOTE: For SAD only: if you have a cluster compound, normally you should supply a PDB file with an example of the cluster with the keyword cluster_pdb_file=my_cluster_compound and a unique residue name XX (yes, XX). Set the keyword atom_type=XX as well . If your cluster is Ta6Br12 then you can simply put atom_type=TX and skip the cluster_pdb_file. For MAD/MIR, cluster compounds are not currently supported. Instead just use a standard atom. NOTE: if you specify keep_atom_type and supply a sites_file then atom_type will still apply to added sites.
- lambda = None Wavelength (A). This sets the wavelength value in all derivatives and wavelengths. Normally it is used as a shortcut for SAD or SIR cases.
- f_prime = None F-prime value. This sets the f_prime value in all derivatives and wavelengths. Normally it is used as a shortcut for SAD or SIR cases.
- f_double_prime = None F-double-prime value. This sets the f_double_prime value in all derivatives and wavelengths. Normally it is used as a shortcut for SAD or SIR cases.
- wavelength_name = peak inf high low remote Optional name of wavelength for SAD data. This sets the name in all wavelengths. Normally it is used as a shortcut for SAD cases.
- sites = None Number of heavy-atom sites. This sets the number of sites in all derivatives and wavelengths. Normally it is used as a shortcut for SAD or SIR cases.
- sites_file = None PDB or plain-text file with ha sites. This sets the sites in all derivatives and wavelengths. Normally it is used as a shortcut for SAD or SIR cases. If you want to keep the atom types specify keep_atom_types=True
- seq_file = Auto Text file with 1-letter code of protein sequence. Separate chains with a blank line or line starting with > Normally you should include one copy of each unique chain. NOTE: if 1 copy of each unique chain is provided it is assumed that there are ncs_copies (could be 1) of each unique chain. If more than one copy of any chain is provided it is assumed that the asymmetric unit contains the number of copies of each chain that are given, multiplied by ncs_copies. So if the sequence file has two copies of the sequence for chain A and one of chain B, the cell contents are assumed to be ncs_copies*2 of chain A and ncs_copies of chain B. If there are unequal numbers of copies of chains, be sure to set solvent_fraction. ADDITIONAL NOTES: 1. lines starting with > are ignored and separate chains 2. FASTA format is fine 3. If you enter a PDB file for rebuilding and it has the sequence you want, then the sequence file is not necessary. NOTE: You can also enter the name of a PDB file that contains SEQRES records, and the sequence from the SEQRES records will be read, written to seq_from_seqres_records.dat, and used as your input sequence. If you have a duplex DNA, enter each strand as a separate chain. 4. Characters such as numbers and non-printing characters in the sequence file are ignored. 5 Be sure that your sequence file does not have any blank lines in the middle of your sequence, as these are interpreted as the beginning of another chain.
- quick = None Run everything quickly (Same as thoroughness=quick)
- data = None Shortcut for name of datafile (SAD data only. For SIR use "native.data=native.sca" and "deriv.data=deriv.sca". For MIR and MAD use a parameters file and specify data under "native" and " deriv" or for "wavelength") NOTE: For command_line input it is easiest if each wavelength of data is in a separate data file with obvious data columns. File types that are easy to read include Scalepack sca files , CNS hkl files, mtz files with just one wavelength of data, or just native or just derivative. In this case the Wizard can read your data without further information. <p> If you have a datafile with many columns, you can use the "labels" keyword to specify which data columns to read. (It may be easier in some cases to use the GUI or to split it with phenix.reflection_file_converter first, however.)
- labels = None Shortcut for specification string for data labels (SAD data only). Only necessary if the wizard does not automatically choose the correct set of data from your file For SIR use "native.labels" and "deriv.labels". For MIR and MAD use a parameters file and specify labels under "native" and " deriv" NOTE: To find out what the appropriate strings are, type "phenix.autosol display_labels=your-datafile-here.mtz"
- derscale = None shortcut for derivative scale
- crystal_info
- unit_cell = None Enter cell parameter (a b c alpha beta gamma)
- space_group = None Space Group symbol (i.e., C2221 or C 2 2 21)
- solvent_fraction = None Solvent fraction in crystals (0 to 1). This is normally set automatically from the number of NCS copies and the sequence. If your has unequal numbers of different chains, then be sure to set the solvent fraction.
- chain_type = *Auto PROTEIN DNA RNA You can specify whether to build protein, DNA, or RNA chains. At present you can only build one of these in a single run. If you have both DNA and protein, build one first, then run AutoBuild again, supplying the prebuilt model in the "input_lig_file_list" and build the other. NOTE: default for this keyword is Auto, which means "carry out normal process to guess this keyword". The process is to look at the sequence file and/or input pdb file to see what the chain type is. If there are more than one type, the type with the larger number of residues is guessed. If you want to force the chain_type, then set it to PROTEIN RNA or DNA.
- resolution = 0 High-resolution limit. Used as resolution limit for density modification and as general default high-resolution limit. If resolution_build or refinement_resolution are set then they override this for model-building or refinement. If overall_resolution is set then data beyond that resolution is ignored completely. Zero means keep everything.
- change_sg = False You can change the space group. In AutoSol the Wizard will use ImportRawData and let you specify the sg and cell. In AutoMR the wizard will give you an entry form to specify them. NOTE: This only applies when reading in new datasets. It does nothing when changed after datasets are read in.
- residues = None Number of amino acid residues in the au (or equivalent)
- sequence = None Plain text containing 1-letter code of protein sequence Same as seq_file except the sequence is read directly, not from a file. If both are given, seq_file is ignored.
- input_files
- cif_def_file_list = None You can enter any number of CIF definition files. These are normally used to tell phenix.refine about the geometry of a ligand or unusual residue. You usually will use these in combination with "PDB file with metals/ligands" (keyword "input_lig_file_list" ) which allows you to attach the contents of any PDB file you like to your model just before it gets refined. You can use phenix.elbow to generate these if you do not have a CIF file and one is requested by phenix.refine
- group_labels_list = None For command-line and script running of AutoSol, you may wish to use keywords to specify which set of data columns to be used from an MTZ or other file type with multiple datasets. (From the GUI, it is easy because you are prompted with the column labels). You can do this by specifying a string that identifies which dataset to include. All allowed values of this identification string will be written out any time AutoSol is run on this dataset like this: NOTE: To specify a particular set of data you can specify one of the following (this example is for MAD data, specifying data for peak wavelength): ...: peak.labels='F SIGF DANO SIGDANO' peak.labels='F(+) SIGF(+) F(-) SIGF(-)' You can then use one of the above commands on the command-line to identify the dataset of interest. If you want to use a script instead, you can specify N files in your input_data_file_list, and then specify N values for group_labels_list like this: group_labels_list 'F,SIGF,DANO,SIGDANO' 'F(+),SIGF(+),F(-),SIGF(-)' This will take 'F,SIGF,DANO,SIGDANO' as the data for datafile 1 and 'F(+),SIGF(+),F(-),SIGF(-)' for datafile 2 You can identify one dataset from each input file in this way. If you want more than one, then please use phenix.reflection_file_converter to split your input file, or else use the GUI version of AutoSol in which you can select any subset of the data that you wish.
- input_file_list = None Normally not used. Use "data=" or "wavelength.data=" or "native.data=" or "deriv.data=" instead.
- input_phase_file = None MTZ data file with FC PHIC or equivalent to use for finding heavy-atom sites with difference Fourier methods. NOTE: compare with input_part_map_coeffs_file which is a file with map coefficients for use with Phaser completion to find anomalously-scattering atoms.
- input_phase_labels = None Labels for FC and PHIC for data file with FC PHIC or equivalent to use for finding heavy-atom sites with difference Fourier methods.
- input_refinement_file = None Data file to use for refinement. The data in this file should not be corrected for anisotropy. It will be combined with experimental phase information for refinement. If you leave this blank, then the output of phasing will be used in refinement (see below). If no anisotropy correction is applied to the data you do not need to specify a datafile for refinement. If an anisotropy correction is applied to the data files, then you must enter a datafile for refinement if you want to refine your model. (See "remove_aniso" for specifying whether an anisotropy correction is applied. In most cases it is not.) If an anisotropy correction is applied and no refinement datafile is supplied, then no refinement will be carried out in the model-building step. You can choose any of your datafiles to be the refinement file, or a native that is not part of the datasets for structure solution. If there are more than one dataset you will be asked each time for a refinement file, but only the last one will be used. Any standard format is fine; normally only F and sigF will be used. Bijvoet pairs and duplicates will be averaged. If an mtz file is provided then a free R flag can be read in as well. If you do not provide a refinement file then the structure factors from the phasing step will be used in refinement. This is normally satisfactory for SAD data and MIR data. For MAD data you may wish to supply a refinement file because the structure factors from phasing are a combination of data from different wavelengths of data. It is better if you choose your best wavelength of data for refinement.
- input_refinement_labels = None Labels for input refinement file columns (FP SIGFP FreeR_flag)
- input_seq_file = Auto Normally not used. Use instead "seq_file"
- refine_eff_file_list = None You can enter any number of refinement parameter files. These are normally used to tell phenix.refine defaults to apply, as well as creating specialized definitions such as unusual amino acid residues and linkages. These parameters override the normal phenix.refine defaults. They themselves can be overridden by parameters set by the Wizard and by you, controlling the Wizard. NOTE: Any parameters set by AutoBuild directly (such as number_of_macro_cycles, high_resolution, etc...) will not be taken from this parameters file. This is useful only for adding extra parameters not normally set by AutoBuild.
- wavelengthEnter a SAD or MAD dataset by filling in information for one or more wavelengths. You can cut and paste an entire wavelength section and enter as many as you like. If you have multiple datasets (i.e., MIR+MAD) then group them using the "group" keyword.
- wavelength_name = peak inf high low remote Optionally indicate if this is the peak, inflection point, high energy remote or low energy remote or remote
- data = None Datafile for this wavelength.
- labels = None Specification string for data labels for peak wavelength. Only necessary if the wizard does not automatically choose the correct set of data from your file To find out what the appropriate strings are, type "phenix.autosol display_labels=your-datafile-here.mtz"
- atom_type = None Anomalously-scattering atom type. You only need to specify this for one of the wavelengths in MAD datasets. NOTE: if you want Phaser to add additional heavy-atoms of other types, you can specify them with mad_ha_add_list.
- lambda = None wavelength (A). If you supply an atom_type and lambda then if you do not supply f_prime and f_double_prime a guess will be made for them from a table.
- res_hyss = None resolution for running HYSS for this wavelength/deriv
- res_eval = None resolution for evaluation of solutions for this wavelength/deriv
- f_prime = None F-prime value for this wavelength. It is best to supply it if you know it.
- f_double_prime = None F-double_prime value for this wavelength. It is best to supply it if you know it.
- sites = None Number of anomalously-scattering sites for this wavelength You only need to specify this for one wavelength. If you have only MAD data you can also just specify "sites=2"
- sites_file = None PDB or plain-text file with heavy-atom sites. The sites will be taken from this file if supplied
- derscale = None derivative scale factor
- group = 1 Phasing group(s) this wavelength is associated with (Relevant in cases where you have 2 MAD datasets or MAD+SAD or MAD+MIR etc...)
- added_wavelength = False Used internally to flag if this wavelength was added automatically
- ignore = False Ignore this wavelength of data
- nativeEnter an MIR or SIR dataset by filling in information for a native and one or more derivatives. You can cut and paste these sections and enter as many as you like. If you have multiple datasets (i.e., MIR+MAD) then group them using the "group" keyword.
- data = None Datafile for native
- labels = None Specification string for data labels for native. Only necessary if the wizard does not automatically choose the correct set of data from your file To find out what the appropriate strings are, type "phenix.autosol display_labels=your-datafile-here.mtz "
- lambda = None wavelength (A) (Not used, for your reference only).
- group = 1 Phasing group(s) this native is associated with (Relevant in cases where you have more than one group of native+derivs or you have MIR + MAD or SAD)
- added_native = False Used internally to flag if this native was added automatically
- ignore = False Ignore this native data
- derivEnter an MIR or SIR dataset by filling in information for a native and one or more derivatives. You can cut and paste these sections and enter as many as you like. If you have multiple datasets (i.e., MIR+MAD) then group them using the "group" keyword.
- data = None Datafile for this derivative
- labels = None Specification string for data labels for deriv. Only necessary if the wizard does not automatically choose the correct set of data from your file To find out what the appropriate strings are, type "phenix.autosol display_labels=datafile.mtz "
- atom_type = None Heavy-atom type for deriv .
- sites = None Number of heavy-atom sites for deriv .
- sites_file = None PDB or plain-text file with heavy-atom sites. The sites will be taken from this file if supplied
- res_hyss = None resolution for running HYSS for this wavelength/deriv
- res_eval = None resolution for evaluation of solutions for this wavelength/deriv
- inano = noinano *inano anoonly Use anomalous differences for deriv . noinano means do not use anomalous differences. inano means use anomalous differences and isomorphous differences. anoonly means use anomalous differences and not iso differences.
- f_prime = None F-prime value for this derivative.
- f_double_prime = None F-double_prime value for this derivative.
- lambda = None wavelength (A). Used with atom_type to calculate f_prime and f_double_prime if they are not supplied
- derscale = None derivative scale factor
- group = 1 Phasing group(s) this derivative is associated with (Relevant in cases where you have more than one group of native+derivs or you have MIR + MAD or SAD)
- added_deriv = False Used internally to flag if this derivative was added automatically
- ignore = False Ignore this deriv data
- decision_making
- always_include_peak = True Choose True to add PEAK dataset on for HYSS if not automatically chosen
- add_extra_if_fa = True Choose True to try an extra file for HYSS if FA values are used. This may be useful to solve cases where FA values are poor but their sigmas are small. If True then the anomalous differences will be used for HYSS as well.
- create_scoring_table = None Choose whether you want a scoring table for solutions A scoring table is slower but better
- desired_coverage = None Choose what probability you want to have that the correct solution is in your current list of top solutions. A good value is 0.80. If you set a low value (0.01) then only one solution will be kept at any time; if you set a high value, then many solutions will be kept (and it will take longer).
- skip_hyss_other_derivs_if_quick = True You can skip HYSS and instead use difference Fouriers only to find sites for all other derivatives once a good HYSS solution is found. Only used if thoroughness=quick .
- self_diff_fourier = True Choose whether, in cases where there are multiple derivatives or multiple datasets, you want to use difference Fourier analysis on the same derivative(s) used in phasing (True), or instead (False) only phasing other derivatives
- combine_siblings = True You can specify that in MIR or multiple-dataset solutions the solutions to combine must all be ultimately derived by difference fourier from the same parent. Compare with combine_same_parent_only where any solutions must have the same immediate parent (unless one is a composite solution).
- max_cc_extra_unique_solutions = 0.5 Specify the maximum value of CC between experimental maps for two solutions to consider them substantially different. Solutions that are within the range for consideration based on desired_coverage, but are outside of the number of allowed max_choices, will be considered, up to max_extra_unique_solutions, if they have a correlation of no more than max_cc_extra_unique_solutions with all other solutions to be tested.
- max_choices = None Number of choices for solutions to consider. Set automatically with quick: 1 and thorough:3
- max_composite_choices = 8 Number of choices for composite solutions to consider
- max_extra_unique_solutions = None Specify the maximum number of solutions to consider based on their uniqueness as well as their high scores. Solutions that are within the range for consideration based on desired_coverage, but are outside of the number of allowed max_choices, will be considered, up to max_extra_unique_solutions, if they have a correlation of no more than max_cc_extra_unique_solutions with all other solutions to be tested. Set automatically with quick:0 ; thorough:2
- max_range_to_keep = 4 The range of solutions to be kept is range_to_keep * SD of the group of solutions. This sets the maximum of range_to_keep
- min_fom = 0.05 Minimum fom of a solution to keep it at all
- low_fom = 0.20 If best FOM is less than low_fom, double range_to_keep
- minimum_merge_cc = 0.25 Minimum ratio of CC of solutions to expected in merge_mir keep at all
- min_fom_for_dm = 0 Minimum fom of a solution to density modify (otherwise just copy over phases). This is useful in cases where the phasing is so weak that density modification does nothing or makes the phases worse.
- extreme_dm = False Turns on extreme density modification if True or if Auto and FOM is up to fom_for_extreme_dm. Use extreme_dm=True if your phasing is really weak and density modification is not working well. NOTE: if left default, extreme_dm is set to Auto if quick=False.
- fom_for_extreme_dm = 0.35 If extreme_dm is on and FOM of phasing is up to fom_for_extreme_dm then defaults for density modification become: mask_type=wang wang_radius=20 mask_cycles=1 minor_cycles=4
- minimum_ncs_cc = 0.30 Minimum NCS correlation to keep, except in case of extreme_dm
- min_phased_each_deriv = 1 You can require that the wizard phase at least this number of solutions from each derivative, even if they are poor solutions. Usually at least 1 is a good idea so that one derivative does not dominate the solutions.
- n_random = 6 Number of random solutions to generate when setting up scoring table
- res_eval = 0 Resolution for running resolve evaluation (usually 2.5 A) It will be set automatically if you do not set it
- score_individual_offset_list = None Offsets for individual scores in CC-scoring. Each score will be multiplied by the score_individual_scale_list value, then score_individual_offset_list value is added, to estimate the CC**2 value using this score by itself. The uncertainty in the CC**2 value is given by score_individual_sd_list. NOTE: These scores are not used in calculation of the overall score. They are for information only
- score_individual_scale_list = None Scale factors for individual scores in CC-scoring. Each score will be multiplied by the score_individual_scale_list value, then score_individual_offset_list value is added, to estimate the CC**2 value using this score by itself. The uncertainty in the CC**2 value is given by score_individual_sd_list. NOTE: These scores are not used in calculation of the overall score. They are for information only
- score_individual_sd_list = None Uncertainties for individual scores in CC-scoring. Each score will be multiplied by the score_individual_scale_list value, then score_individual_offset_list value is added, to estimate the CC**2 value using this score by itself. The uncertainty in the CC**2 value is given by score_individual_sd_list. NOTE: These scores are not used in calculation of the overall score. They are for information only
- score_overall_offset = None Overall offset for scores in CC-scoring. The weighted scores will be summed, then all multiplied by score_overall_scale, then score_overall_offset will be added.
- score_overall_scale = None Overall scale factor for scores in CC-scoring. The weighted scores will be summed, then all multiplied by score_overall_scale, then score_overall_offset will be added.
- score_overall_sd = None Overall SD of CC**2 estimate for scores in CC-scoring. The weighted scores will be summed, then all multiplied by score_overall_scale, then score_overall_offset will be added. This is an estimate of CC**2, with uncertainty about score_overall_sd. Then the square root is taken to estimate CC and SD(CC), where SD(CC) now depends on CC due to the square root.
- score_type_list = SKEW CORR_RMS You can choose what scoring methods to include in scoring of solutions in AutoSol. (The choices available are: CC_DENMOD RFACTOR SKEW NCS_COPIES NCS_IN_GROUP TRUNCATE FLATNESS CORR_RMS REGIONS CONTRAST FOM ) NOTE: If you are using Z-SCORE or BAYES-CC scoring, The default is CC_RMS RFACTOR SKEW FOM (and NCS_OVERLAP if ncs_copies is at least equal to ncs_copies_min_for_overlap.
- score_weight_list = None Weights on scores for CC-scoring. Enter the weight on each score in score_type_list. The weighted scores will be summed, then all multiplied by score_overall_scale, then score_overall_offset will be added.
- skip_score_list = None You can evaluate some scores but not use them. Include the ones you do not want to use in the final score in skip_score_list.
- ncs_copies_min_for_overlap = 2 Minimum number of ncs copies (set automatically from composition and cell or with ncs_copies=xx) to use NCS_OVERLAP in scoring
- rho_overlap_min = 0.3 Sets minimum average overlap of NCS-related density to keep NCS. Cutoff of overlap will be rho_overlap_min for 2 ncs copies, and proportionally smaller (rho_overlap_min*2/N) for N ncs copies.
- rho_overlap_min_scoring = 0.5 Once NCS is found, rho_overlap_min_scoring sets threshold for whether the NCS is used in scoring. Cutoff of overlap will be rho_overlap_min_scoring for 2 ncs copies, and proportionally smaller (rho_overlap_min_scoring*2/N) for N ncs copies. (Compare with rho_overlap_min, which sets cutoff for finding NCS, not scoring with it)
- hyss_scoring
- model_ha_iteration = None You can iterate heavy-atom location using a model in the process. This takes a lot longer than simple ha_iteration but can be more powerful for finding weak sites. NOTE: only available for Phaser SAD phasing. Default is to use it if extreme_dm applies. Note: not available on Windows.
- acceptable_r = 0.35 Used to decide whether the model is acceptable enough to not do heavy-atom iteration. A good value is 0.35
- max_model_ha_iterations = None Max number of iterations of model-building and finding heavy-atom sites.
- max_models_in_ha_iteration = None Max number of models to include in iterations of model-building and finding heavy-atom sites.
- ha_iteration = None Choose whether you want to iterate the heavy-atom search. With iteration, sites are found with HYSS, then used to phase and carry out quick density-modification, then either Phaser LLG completion (for SAD data) or difference Fourier or anomalous difference Fourier analysis is used to find sites. Default is to use ha_iteration if extreme_dm applies. Note: not available on Windows.
- max_ha_iterations = None Number of iterations of phasing/density modification searching for heavy-atom sites.
- minimum_improvement = 0 Minimum improvement in score to continue ha iteration
- build_scoring
- overall_score_method = *BAYES-CC Z-SCORE You have 2 choices for an overall scoring method: (1) Sum of individual Z-scores (Z-SCORE) (2) Bayesian estimate of CC of map to perfect model (BAYES-CC) You can specify which scoring criteria to include with score_type_list (default is SKEW CORR_RMS for BAYES-CC and CC RFACTOR SKEW FOM for Z-SCORE. Additionally, if NCS is present, NCS_OVERLAP is used by default in the Z-SCORE method).
- r_switch = 0.4 R-value criteria for deciding whether to use R-value or residues built. A good value is 0.40
- acceptable_quality = 40 You can specify the minimum overall quality of a model (as defined by overall_score_method) to be considered acceptable
- acceptable_secondary_structure_cc = 0.35 You can specify the minimum correlation of density from a secondary structure model to be considered acceptable
- trace_chain = False You can build a CA-only model right after density modification using trace_chain
- trace_chain_score = False You can score density-modified maps with the number of residues built with regular secondary-structure using trace_chain.
- include_dm_score = None You can score density-modified maps with the R-factor Normally only included if map_model_cc is not clear and best dm R is less than max_useful_dm_r
- use_map_model_cc = None You can score models with CC to map Only included if CC is higher then quality_cc_min
- quality_cc_min = 0.50 Minimum map-model CC value to use CC in scoring
- max_useful_dm_r = 0.40 Maximum R-value in density modification to use in dm_score
- dev_scoring
- random_scoring = False For testing purposes you can generate random scores
- use_perfect = False You can use the CC between each solution and hklperfect in scoring. This is only for methods development purposes.
- hklperfect = None You can supply an mtz file with idealized coefficients for a map. This will be compared with all maps calculated during structure solution
- perfect_labels = None Labels for input data columns for hklperfect if present. Typical value: "FP PHIC FOM"
- scaling
- remove_aniso = Auto *True False Choose if you want to apply a correction for anisotropy to the data. True means always apply correction, No means never apply it, Auto means apply it if the data is severely anisotropic (recommended=True). If you set remove_aniso=Auto then if the range of anisotropic B-factors is greater than delta_b_for_auto_remove_aniso and the ratio of the largest to the smallest less than ratio_b_for_auto_remove_aniso then the correction will be applied. Anisotropy correction will be applied to all input data before scaling. If used, the default overall target B factor is is minimum of (max_b_iso, lowest B of datasets, target_b_ratio*resolution)
- b_iso = None Target overall B value for anisotropy correction. Ignored if remove_aniso = False. If None, default is minimum of (max_b_iso, lowest B of datasets, target_b_ratio*resolution)
- max_b_iso = 40. Default maximum overall B value for anisotropy correction. Ignored if remove_aniso = False. Ignored if b_iso is set. If used, default is minimum of (max_b_iso, lowest B of datasets, target_b_ratio*resolution)
- target_b_ratio = 10. Default ratio of target B value to resolution for anisotropy correction. Ignored if remove_aniso = False. Ignored if b_iso is set. If used, default is minimum of (max_b_iso, lowest B of datasets, target_b_ratio*resolution)
- try_orig_sad_data_in_hyss = Auto You can try original (unscaled) data in Hyss in addition to scaled data. Applies to SAD datasets only. Default is False if quick=True.
- localscale_before_phaser = True You can apply SOLVE localscaling to SAD data before passing it to Phaser for SAD phasing
- delta_b_for_auto_remove_aniso = 20 Choose what range of aniso B values is so big that you want to correct for anisotropy by default. Both ratio_b and delta_b must be large to correct. See also ratio_b_for_auto_remove_aniso. See also "remove_aniso" which overrides this default if set to "True"
- ratio_b_for_auto_remove_aniso = 1.0 Choose what ratio aniso B values is so big that you want to correct for anisotropy by default. Both ratio_b and delta_b must be large to correct. see also delta_b_for_auto_remove_aniso See also "remove_aniso" which overrides this default if set to "True"
- test_remove_aniso = True Choose whether you want to try applying or not applying an anisotropy correction if the run fails. First your original selection for applying or not will be tried, and then the opposite will be tried if the run fails.
- use_sca_as_is = True Choose True to allow use of sca files (and mtz files) without conversion even if the space group is changed. If False, then original index files will always be converted to premerged if the space group is changed
- derscale_list = None List of deriv scale factors. Not normally used. Use derscale for deriv or wavelength.
- scale_only = False Just scale data and stop
- phase_only = False Just scale data,phase and stop
- heavy_atom_search
- min_hyss_cc = 0.05 Minimum CC of a heavy-atom solution in HYSS to keep it at all
- acceptable_cc_hyss = 0.2 Solutions with CC better than acceptable_cc_hyss will not be rescored.
- good_cc_hyss = 0.3 Hyss will be run up to best_of_n_hyss_always times at a given resolution. If the best CC value is greater than good_cc_hyss and the number of sites found is at least min_fraction_of_sites_found times the number expected and Hyss was tried at least best_of_n_hyss times, then the search is ended. Also if thoroughness=quick and a solution with CC at least as high as good_cc_hyss is found, no more searches will be done at all
- n_add_res_max = 2 Hyss will be run at up to n_add_res_max+1 resolutions starting with res_hyss and adding increments of add_res_max/n_add_res_max. If the best CC value is greater than good_cc_hyss then no more resolutions are tried.
- add_res_max = 2 Hyss will be run at up to n_add_res_max+1 resolutions starting with res_hyss and adding increments of add_res_max/n_add_res_max. If the best CC value is greater than good_cc_hyss then no more resolutions are tried.
- try_recommended_resolution_for_hyss = True If yes, then hyss will be run at recommended_resolution based on anomalous signal in addition to default resolution if CC at default resolution is less than good_cc_hyss and recommended_resolution is more than 0.1 A less than default
- hyss_runs_min = 2 If there are multiple derivatives or candidate wavelengths for HYSS, run at least hyss_runs_min of these.
- best_of_n_hyss = 1 Hyss will be run up to best_of_n_hyss_always times at a given resolution. If the best CC value is greater than good_cc_hyss and the number of sites found is at least min_fraction_of_sites_found times the number expected and Hyss was tried at least best_of_n_hyss times, then the search is ended if hyss_runs_min data files have been attempted.
- best_of_n_hyss_always = 10 Hyss will be run up to best_of_n_hyss_always times at a given resolution. If the best CC value is greater than good_cc_hyss and the number of sites found is at least min_fraction_of_sites_found times the number expected and Hyss was tried at least best_of_n_hyss times, then the search is ended if hyss_runs_min data files have been attempted.
- min_fraction_of_sites_found = 0.667 Hyss will be run up to best_of_n_hyss_always times at a given resolution. If the best CC value is greater than good_cc_hyss and the number of sites found is at least min_fraction_of_sites_found times the number expected and Hyss was tried at least best_of_n_hyss times, then the search is ended if hyss_runs_min data files have been attempted.
- max_single_sites = 5 In sites_from_denmod a core set of sites that are strong is identified. If the hand of the solution is known then additional sites are added all at once up to the expected number of sites. Otherwise sites are added one at a time, up to a maximum number of tries of max_single_sites
- hyss_enable_early_termination = True You can specify whether to stop HYSS as soon as it finds a convincing solution (True, default) or to keep trying...
- hyss_general_positions_only = True Select True if you want HYSS only to consider general positions and ignore sites on special positions. This is appropriate for SeMet or S-Met solutions, not so appropriate for heavy-atom soaks
- hyss_min_distance = 3.5 Enter the minimum distance between heavy-atom sites to keep them in HYSS
- hyss_n_fragments = 3 Enter the number of fragments in HYSS
- hyss_n_patterson_vectors = 33 Enter the number of Patterson vectors to consider in HYSS
- hyss_random_seed = 792341 Enter an integer as random seed for HYSS
- res_hyss = None Overall resolution for running HYSS (usually default is fine)
- try_low_res_for_cys = True Use low-resolution hyss (4.5 A) if number of cys is greater than 2x the number of met and the anomalously-scattering atom is S
- direct_methods_only = False Use only direct methods (no phaser completion) in Hyss (applies to SAD/MAD data only)
- use_measurability = True Use measurability (from xtriage) to estimate recommended resolution for HYSS and for initial phasing. Only applies to MAD/SAD phasing. Alternative is to use signal-to-noise from Solve scaling.
- use_phaser_rescoring = False Run phaser rescoring for HYSS heavy-atom search (only SAD data) if initial try fails
- use_automatic_hyss = True Use automated hyss with direct methods/Phaser rescoring combination for SAD/MAD data
- mad_ha_n = None Normally not used. Use instead "sites" for a wavelength. Number of anomalously-scattering atoms in the au
- mad_ha_type = "Se" Normally not used. Use instead "atom_type" for a wavelength. Anomalously-scattering or heavy atom type. For" example, Se or Au. NOTE: if you want Phaser to add additional heavy-atoms of other types, you can specify them with mad_ha_add_list.
- phasing
- keep_atom_types = False Allows you to keep the atom types in your sites_file
- do_madbst = True Choose whether you want to carry out FA calculation Skipping it speeds up MAD phasing but may reduce the ability to find the sites with HYSS
- overallscale = False You can choose to have only an overall scale factor for this dataset (no local scaling applied). Use this if your data is already fully scaled.
- res_phase = 0 Enter the high-resolution limit for phasing (0= use all)
- phase_full_resolution = True You can choose to use the full resolution of the data in phasing, instead of using the recommended_resolution. This is always a good idea with Phaser phases.
- fixscattfactors = None For SOLVE phasing and MAD data you can choose whether scattering factors are to be fixed by choosing True to fix them or False to refine them. Normally choose True (fix) if the data are weak and False (refine) if the data are strong.
- fixscattfactors_in_phasing = False Fix scattering factors in phasing step. For SOLVE phasing and MAD data you can choose whether scattering factors are to be fixed by choosing True to fix them or False to refine them. Normally False. This command only applies to the phasing step and not initial heavy-atom refinement. It does not apply to Phaser SAD phasing.
- fix_xyz_in_phasing = None Fix coordinates in phasing step. For SOLVE phasing and MAD data you can choose whether ha coordinates are to be fixed by choosing True to fix them or False to refine them. May be useful in maintaining the coordinates of the solutions that were tested in initial phasing steps. If None, then it will be set to True if the resolution of final phasing step is higher than the highest resolution of test phasing runs This command only applies to the phasing step and not initial heavy-atom refinement. It does not apply to Phaser SAD phasing
- have_hand = False Normally you will not know the hand of the heavy-atom substructure, so have_hand=False. However if you do know it (you got the sites from a difference Fourier or you know the answer another way) you can specify that the hand is known.
- id_scale_ref = None By default the datafile with the highest resolution is used for the first step in scaling of MAD data. You can choose to use any of the datafiles in your MAD dataset. NOTE: not applicable for multi-dataset analyses
- ratio_out = 10. You can choose the ratio of del ano or del iso to the rms in the shell for rejection of a reflection. Default = 10.
- ratmin = 0. Reflections with I/sigI less than ratmin will be ignored when read in.
- require_nat = True Choose yes to skip any reflection with no native (for SIR) or no data (MAD/SAD) or where anom difference is very large. This keyword (default=True) allows the routines in SOLVE to remove reflections with an implausibly large anomalous difference (greater than ratio_out times the rms anomalous difference).
- ikeepflag = 1 You can choose to keep all reflections in merging steps. This is separate from rejecting reflections with high iso or ano diffs. Default=1 (keep them)
- phasing_method = SOLVE *PHASER You can choose to phase with SOLVE or with Phaser. (Only applies to SAD phasing at present)
- cluster_pdb_file = None For SAD only: if you have a cluster compound, normally you should supply a PDB file with an example of the cluster with the keyword cluster_pdb_file=my_cluster_compound and a unique residue name XX (yes, XX). Set the keyword atom_type=XX as well . If your cluster is Ta6Br12 then you can simply put atom_type=TX and skip the cluster_pdb_file. For MAD/MIR, cluster compounds are not currently supported. Instead just use a standard atom.
- input_partpdb_file = None You can enter a PDB file (usually from molecular replacement) for use in identifying heavy-atom sites and phasing. NOTE 1: This procedure works best if the model is refined. NOTE 2: This file is only used in SAD phasing with Phaser on a single dataset. In all other cases it is ignored. NOTE 3: The output phases in phaser_xx.mtz will contain both SAD and model information. They are not completely suitable for use with AutoBuild or other iterative model-building procedures because the phases are not entirely experimental (but they may work). Note that you can choose if this file is used to find sites only and then phase with those sites (phaser_sites_then_phase=True) or to phase with the model and also the sites. If you use phaser_sites_then_phase=True then you do not need to worry about bias from the model, but the phasing may not be as good.
- input_part_map_ha_fraction = 0.75 Use effective fo+f' value of input_part_map_ha_fraction for real part of anomalously-scattering atoms when part_map_coeffs are used. (Lower than 1.0 because the map will have some contribution corresponding to these atoms)
- phaser_sites_then_phase = False When using part_map_coeffs or partpdb_file or iterating heavy-atom searches, find sites with this partial model, then calculate phases without the partial model. This approach removes the bias from the model and keeps the real part of the scattering from the heavy atoms. If you use phaser_sites_then_phase=True you can use HLA HLB HLC HLD and not worry about HLanom values.
- partpdb_rms = None Estimate of rmsd of partial PDB file from correct structure. Default is 1A for partpdb_file and 1.5 for part_map_coeffs_file
- input_part_map_coeffs_file = None If you have done MR on density instead of on a model you can enter the output map coefficients file from Phaser MR instead of an input_partpdb_file. It will be used in the same way as an input_partpdb_file. These map coefficients must correspond to the placed density. The map coefficients must also be in the same unit cell as the input data. Note that partpdb_rms also applies to these map coefficients. It is an estimate of how close this density matches the true model. Note that you can choose if this file is used to find sites only and then phase with those sites (phaser_sites_then_phase=True) or to phase with the model and also the sites. If you use phaser_sites_then_phase=True then you do not need to worry about bias from the model, but the phasing may not be as good. NOTE: labels in this mtz file must be FC,PHIC unless they are set with input_part_map_coeffs_labels Note also that there is a similar keyword: input_phase_file that has a different result. The input_phase_file keyword uses a difference Fourier to find sites. Also if you use input_phase_file=xxx and you use ha_iteration=True then during iteration of density modification and finding sites the sites will be found with a difference Fourier.
- input_part_map_coeffs_labels = FC,PHIC Labels for input_part_map_coeffs. Normally FC,PHIC.
- llgc_sigma = None
- phaser_completion = True You can choose to use phaser log-likelihood gradients to complete your heavy-atom sites. This can be used with or without the ha_iteration option.
- use_phaser_hklstart = True You can choose to start density modification with FWT PHWT from Phaser (Only applies to SAD phasing at present)
- combine_same_parent_only = False You can choose to only combine solutions with the same parent (and that have a parent) in MIR, unless one solution is a composite. Compare with combine_siblings in which case the solutions do not have to have the same immediate parents, but can be derived from the same ultimate parent through several difference fourier steps.
- skip_extra_phasing = *Auto True False You can choose to skip an extra phasing step to speed up the process. If the extra step is used then the evaluation of solutions is done with data to res_eval (2.5 A) and then all the data are used in an extra phasing step. Only applicable to Phaser SAD phasing.
- read_sites = False Choose if you want to enter ha sites from a file The name of the file will be requested after scaling is finished. The file can have sites in fractional coordinates or be a PDB file. Normally you do not need to set this. Set automatically if you specify a sites_file
- f_double_prime_list = None f-double-prime for the heavy-atom for this dataset Normally not used. Use f_double_prime for wavelength or deriv
- f_prime_list = None f-prime for the heavy-atom for this dataset Normally not used. Use f_prime for wavelength or deriv
- mad_ha_add_f_double_prime_list = None F-double_prime values of additional heavy-atom types. You must specify the same number of entries of mad_ha_add_f_double_prime_list as you do for mad_ha_add_f_prime_list and for mad_ha_add_list. Only use for Phaser SAD phasing with a single dataset
- mad_ha_add_f_prime_list = None F-prime values of additional heavy-atom types. You must specify the same number of entries of mad_ha_add_f_prime_list as you do for mad_ha_add_f_double_prime_list and for mad_ha_add_list. Only use for Phaser SAD phasing with a single dataset
- mad_ha_add_list = None You can specify heavy atom types in addition to the one you named in mad_ha_type. The heavy-atoms found in initial HySS searches will be given the type of mad_ha_type, and Phaser (if used for phasing) will try to find additional heavy atoms of both the type mad_ha_type and any listed in mad_ha_add_list. You must also specify the same number of mad_ha_add_f_prime_list entries and of mad_ha_add_f_double_prime_list entries. Only use for Phaser SAD phasing with a single dataset
- n_ha_list = None Enter a guess of number of HA sites Normally not used. Use sites in deriv instead
- nat_der_list = None Enter Native or a heavy-atom symbol (Pt, Se) Normally not used. Use atom_type in deriv instead
- density_modification
- add_classic_denmod = Auto You can run classic density modification with solvent flipping after any other kind of density modification. Default is False if quick=True and extreme_dm=False
- skip_classic_if_worse_fom = True Skip results of add_classic_denmod if FOM gets worse during density modification
- skip_ncs_in_add_classic = True Skip using NCS in add_classic_denmod (speeds it up)
- tolerance_add_classic = 0.02 Take classic density modification if R-factor is not made worse by more than this
- fix_xyz = False You can choose to not refine coordinates, and instead to fix them to the values found by the heavy-atom search.
- fix_xyz_after_denmod = None When sites are found after density modification you can choose whether you want to fix the coordinates to the values found in that map.
- hl_in_resolve = False AutoSol normally does not write out HL coefficients in the resolve.mtz file with density-modified phases. You can turn them on with hl_in_resolve=True
- mask_type = *histograms probability wang classic Choose method for obtaining probability that a point is in the protein vs solvent region. Default is "histograms". If you have a SAD dataset with a heavy atom such as Pt or Au then you may wish to choose "wang" because the histogram method is sensitive to very high peaks. Options are: histograms: compare local rms of map and local skew of map to values from a model map and estimate probabilities. This one is usually the best. probability: compare local rms of map to distribution for all points in this map and estimate probabilities. In a few cases this one is much better than histograms. wang: take points with highest local rms and define as protein. Classic runs classical density modification with solvent flipping.
- test_mask_type = None You can choose to have AutoSol test histograms/wang methods for identifying solvent region and statistical vs classical density modification based on the final density modification r-factor.
- mask_cycles = 5 Number of mask cycles in density modification (5 is usual for thorough density modification
- minor_cycles = 10 Number of minor cycles in density modification for each mask cycle (10 is usual for thorough density modification)
- thorough_denmod = None Choose whether you want to go for density modification (usual) or quick (speeds it up and for a terrible map is sometimes better)
- truncate_ha_sites_in_resolve = Auto *True False You can choose to truncate the density near heavy-atom sites at a maximum of 2.5 sigma. This is useful in cases where the heavy-atom sites are very strong, and rarely hurts in cases where they are not. The heavy-atom sites are specified with "input_ha_file" and radius is rad_mask
- rad_mask = None You can define the radius for calculation of the protein mask Applies only to truncate_ha_sites_in_resolve. Default is resolution of data.
- use_ncs_in_denmod = True This script normally uses available ncs information in density modification. Say No to skip this. See also find_ncs
- mask_as_mtz = False Defines how omit_output_mask_file ncs_output_mask_file and protein_output_mask_file are written out. If mask_as_mtz=False it will be a ccp4 map. If mask_as_mtz=True it will be an mtz file with map coefficients FP PHIM FOMM (all three required)
- protein_output_mask_file = None Name of map to be written out representing your protein (non-solvent) region. If mask_as_mtz=False the map will be a ccp4 map. If mask_as_mtz=True it will be an mtz file with map coefficients FP PHIM FOMM (all three required)
- ncs_output_mask_file = None Name of map to be written out representing your ncs asymmetric unit. If mask_as_mtz=False the map will be a ccp4 map. If mask_as_mtz=True it will be an mtz file with map coefficients FP PHIM FOMM (all three required)
- omit_output_mask_file = None Name of map to be written out representing your omit region. If mask_as_mtz=False the map will be a ccp4 map. If mask_as_mtz=True it will be an mtz file with map coefficients FP PHIM FOMM (all three required)
- use_hl_anom_in_denmod = None Default is False (use HL coefficients including model information in density modification) Allows you to specify that HL coefficients including only the phase information from the imaginary (anomalous difference) contribution from the anomalous scatterers are to be used in density modification. Two sets of HL coefficients are produced by Phaser. HLA HLB etc are HL coefficients including the contribution of both the real scattering and the anomalous differences. HLanomA HLanomB etc are HL coefficients including the contribution of the anomalous differences alone. These HL coefficients for anomalous differences alone are the ones that you will want to use in cases where you are bringing in model information that includes the real scattering from the model used in Phaser, such as when you are carrying out density modification with a model or refinement of a model If use_hl_anom_in_denmod=True then the HLanom HL coefficients from Phaser are used in density modification
- use_hl_anom_in_denmod_with_model = None Default is True if input_partpdb_file is included. (See also use_hl_anom_in_denmod) If use_hl_anom_in_denmod=True then the HLanom HL coefficients from Phaser (not including model information) are used in density modification with a model
- precondition = False Precondition density before modification
- model_building
- build = True Build model after density modification?
- phase_improve_and_build = True Carry out cycles of phase improvement with quick model-building followed by a full model-building step NOTE: This is now the standard model-building approach for AutoSol
- sort_hetatms = False Waters are automatically named with the chain of the closest macromolecule if you set sort_hetatms=True This is for the final model only.
- map_to_object = None you can supply a target position for your model with map_to_object=my_target.pdb. Then at the very end your molecule will be placed as close to this as possible. The center of mass of the autobuild model will be superimposed on the center of mass of my_target.pdb using space group symmetry, taking any match closer than 15 A within 3 unit cells of the original position. The new file will be overall_best_mapped.pdb
- helices_strands_only = False You can choose to use a quick model-building method that only builds secondary structure. At low resolution this may be both quicker and more accurate than trying to build the entire structure
- resolution_helices_strands = 3 Resolution to switch to helices_strands_only
- helices_strands_start = False You can choose to use a quick model-building method that builds secondary structure as a way to get started...then model completion is done as usual. (Contrast with helices_strands_only which only does secondary structure)
- cc_helix_min = None Minimum CC of helical density to map at low resolution when using helices_strands_only
- cc_strand_min = None Minimum CC of strand density to map when using helices_strands_only
- build_type = *RESOLVE RESOLVE_AND_BUCCANEER You can choose to build models with RESOLVE or with RESOLVE and BUCCANEER #and TEXTAL and how many different models to build with RESOLVE. The more you build, the more likely to get a complete model. Note that rebuild_in_place can only be carried out with RESOLVE model-building. For BUCCANEER model building you need CCP4 version 6.1.2 or higher and BUCCANEER version 1.3.0 or higher
- resolveParameters specific for RESOLVE model-building
- n_cycle_build = None Choose number of cycles (3).
- refine = True This script normally refines the model during building. Say False to skip refinement
- ncycle_refine = 3 Choose number of refinement cycles (3)
- number_of_builds = None Number of different solutions to build models for
- number_of_models = None This parameter lets you choose how many initial models to build with RESOLVE within a single build cycle.
- resolution_build = 0 Enter the high-resolution limit for model-building. If 0.0, the value of resolution is used as a default.
- fit_loops = True You can fit loops automatically if sequence alignment has been done.
- loop_cc_min = 0.4 You can specify the minimum correlation of density from a loop with the map.
- group_ca_length = 4 In resolve building you can specify how short a fragment to keep. Normally 4 or 5 residues should be the minimum.
- group_length = 2 In resolve building you can specify how many fragments must be joined to make a connected group that is kept. Normally 2 fragments should be the minimum.
- input_compare_file = None If you are rebuilding a model or already think you know what the model should be, you can include a comparison file in rebuilding. The model is not used for anything except to write out information on coordinate differences in the output log files. NOTE: this feature does not always work correctly.
- n_random_frag = 0 In resolve building you can randomize each fragment slightly so as to generate more possibilities for tracing based on extending it.
- n_random_loop = 3 Number of randomized tries from each end for building loops If 0, then one try. If N, then N additional tries with randomization based on rms_random_loop.
- offsets_list = 53 7 23 You can specify an offset for the orientation of the helix and strand templates in building. This is used in generating different starting models.
- remove_outlier_segments_z_cut = 3.0 You can remove any segments that are not assigned to sequence during model-building if the mean density at atomic positions are more than remove_outlier_segments_z_cut sd lower than the mean for the structure.
- resolve_command_list = None Commands for resolve. One per line in the form: keyword value value can be optional Examples: coarse_grid resolution 200 2.0 hklin test.mtz NOTE: for command-line usage you need to enclose the whole set of commands in double quotes (") and each individual command in single quotes (') like this: resolve_command_list="'no_build' 'b_overall 23' "
- solve_command_list = None Commands for solve. One per line in the form: keyword value, where value can be optional Examples: verbose resolution 200 2.0 For specification from command_line enclose each command and value in quotes, and then use a different type of quotes to enclose all of them (same as resolve_command_list)
- rms_random_frag = None Rms random position change added to residues on ends of fragments when extending them If you enter a negative number, defaults will be used.
- rms_random_loop = None Rms random position change added to residues on ends of loops in tries for building loops If you enter a negative number, defaults will be used.
- semet = None You can specify that the dataset that is used for refinement is a selenomethionine dataset, and that the model should be the SeMet version of the protein, with all SD of MET replaced with Se of MSE. By default if your heavy-atom is Se then this will be set to True
- use_met_in_align = Auto *True False You can use the heavy-atom positions in input_ha_file as markers for Met SD positions.
- start_chains_list = None You can specify the starting residue number for each of the unique chains in your structure. If you use a sequence file then the unique chains are extracted and the order must match the order of your starting residue numbers. For example, if your sequence file has chains A and B (identical) and chains C and D (identical to each other, but different than A and B) then you can enter 2 numbers, the starting residues for chains A and C. NOTE: you need to specify an input sequence file for start_chains_list to be applied.
- thorough_loop_fit = None Try many conformations and accept them even if the fit is not perfect. If you say True the parameters for thorough loop fitting are: n_random_loop=100 rms_random_loop=0.3 rho_min_main=0.5 while if you say No those for quick loop fitting are: n_random_loop=20 rms_random_loop=0.3 rho_min_main=1.0
- trace_as_lig = False You can specify that in building steps the ends of chains are to be extended using the LigandFit algorithm. This is default for nucleic acid model-building.
- use_any_side = False You can choose to have resolve model-building place the best-fitting side chain at each position, even if the sequence is not matched to the map.
- loop_lib = False Use loop library to fit loops Only applicable for chain_type=PROTEIN
- standard_loops = True Use standard loop fitting
- trace_loops = False Use loop tracing to fit loops Only applicable for chain_type=PROTEIN
- refine_trace_loops = True Refine loops (real-space) after trace_loops
- density_of_points = None Packing density of points to consider as as possible CA atoms in trace_loops. Try 1.0 for a quick run, up to 5 for much more thorough run If None, try value depending on value of quick.
- max_density_of_points = None Maximum packing density of points to consider as as possible CA atoms in trace_loops.
- cutout_model_radius = None Radius to cut out density for trace_loops If None, guess based on length of loop
- max_cutout_model_radius = 20. Maximum value of cutout_model_radius to try
- padding = 1. Padding for cut out density in trace_loops
- max_span = 30 Maximum length of a gap to try to fill
- max_overlap = None Maximum number of residues from ends to start with. (1=use existing ends, 2=one in from ends etc) If None, set based on value of quick.
- min_overlap = None Minimum number of residues from ends to start with. (1=use existing ends, 2=one in from ends etc)
- ncs
- find_ncs = Auto *True False The wizard normally deduces ncs information from the NCS in heavy atom sites, and then later from any NCS in chains of models that are built during model-building. The update is done each cycle in which an improved model is obtained. Say No to skip this update.
- ncs_copies = None Number of copies of the molecule in the au (note: only one type of molecule allowed at present)
- ncs_refine_coord_sigma_from_rmsd = False You can choose to use the current NCS rmsd as the value of the sigma for NCS restraints. See also ncs_refine_coord_sigma_from_rmsd_ratio
- ncs_refine_coord_sigma_from_rmsd_ratio = 1 You can choose to multiply the current NCS rmsd by this value before using it as the sigma for NCS restraints See also ncs_refine_coord_sigma_from_rmsd
- optimize_ncs = True This script normally deduces ncs information from the NCS in chains of models that are built during iterative model-building. Optimize NCS adds a step to try and make the molecule formed by NCS as compact as possible, without losing any point-group symmetry.
- refine_with_ncs = True This script can allow phenix.refine to automatically identify NCS and use it in refinement.
- ncs_in_refinement = *torsion cartesian None Use torsion_angle refinement of NCS. Alternative is cartesian or None (None will use phenix.refine default)
- refinement
- refine_b = True You can choose whether phenix.refine is to refine individual atomic displacement parameters (B values)
- refine_se_occ = True You can choose to refine the occupancy of SE atoms in a SEMET structure (default=True). This only applies if semet=true
- skip_clash_guard = True Skip refinement check for atoms that clash
- correct_special_position_tolerance = None Adjust tolerance for special position check. If 0., then check for clashes near special positions is not carried out. This sometimes allows phenix.refine to continue even if an atom is near a special position. If 1., then checks within 1 A of special positions. If None, then uses phenix.refine default. (1)
- use_mlhl = True This script normally uses information from the input file (HLA HLB HLC HLD) in refinement. Say No to only refine on Fobs
- generate_hl_if_missing = False This script normally uses information from the input file (HLA HLB HLC HLD) in refinement. Say No to not generate HL coeffs from input phases.
- place_waters = True You can choose whether phenix.refine automatically places ordered solvent (waters) during the refinement process.
- refinement_resolution = 0 Enter the high-resolution limit for refinement only. This high-resolution limit can be different than the high-resolution limit for other steps. The default ("None" or 0.0) is to use the overall high-resolution limit for this run (as set by resolution)
- ordered_solvent_low_resolution = None You can choose what resolution cutoff to use fo placing ordered solvent in phenix.refine. If the resolution of refinement is greater than this cutoff, then no ordered solvent will be placed, even if refinement.main.ordered_solvent=True.
- link_distance_cutoff = 3 You can specify the maximum bond distance for linking residues in phenix.refine called from the wizards.
- r_free_flags_fraction = 0.1 Maximum fraction of reflections in the free R set. You can choose the maximum fraction of reflections in the free R set and the maximum number of reflections in the free R set. The number of reflections in the free R set will be up the lower of the values defined by these two parameters.
- r_free_flags_max_free = 2000 Maximum number of reflections in the free R set. You can choose the maximum fraction of reflections in the free R set and the maximum number of reflections in the free R set. The number of reflections in the free R set will be up the lower of the values defined by these two parameters.
- r_free_flags_use_lattice_symmetry = True When generating r_free_flags you can decide whether to include lattice symmetry (good in general, necessary if there is twinning).
- r_free_flags_lattice_symmetry_max_delta = 5 You can set the maximum deviation of distances in the lattice that are to be considered the same for purposes of generating a lattice-symmetry-unique set of free R flags.
- allow_overlapping = None Default is None (set automatically, normally False unless S or Se atoms are the anomalously-scattering atoms). You can allow atoms in your ligand files to overlap atoms in your protein/nucleic acid model. This overrides 'keep_pdb_atoms' Useful in early stages of model-building and refinement The ligand atoms get the altloc indicator 'L' NOTE: The ligand occupancy will be refined by default if you set allow_overlapping=True (because of the altloc indicator) You can turn this off with fix_ligand_occupancy=True
- fix_ligand_occupancy = None If allow_overlapping=True then ligand occupancies are refined as a group. You can turn this off with fix_ligand_occupancy=true NOTE: has no effect if allow_overlapping=False
- remove_outlier_segments = True You can remove any segments that are not assigned to sequence if their mean B values are more than remove_outlier_segments_z_cut sd higher than the mean for the structure. NOTE: this is done after refinement, so the R/Rfree are no longer applicable; the remarks in the PDB file are removed
- twin_law = None You can specify a twin law for refinement like this: twin_law='-h,k,-l'
- use_hl_anom_in_refinement = None Default is True if input_partpdb_file is used (See also use_hl_anom_in_denmod). If use_hl_anom_in_refinement=True then the HLanom HL coefficients from Phaser (not including model information) are used in refinement
- include_ha_in_refinement = None You can choose to include your heavy-atom sites in the model for refinement. This is a good idea if your structure includes these heavy-atom sites (i.e., for SAD or MAD structures where you are not using a native dataset). Heavy-atom sites that overlap an atom in your model will be ignored. Default is True unless the dataset is SAD/MAD with Se or S
- display
- number_of_solutions_to_display = 1 Number of solutions to put on screen and to write out
- solution_to_display = 0 Solution number of the solution to display and write out ( use 0 to let the wizard display the top solution)
- general
- data_quality = *moderate strong weak The defaults are set for you depending on the anticipated data quality. You can choose "moderate" if you are unsure. NOTE: if best FOM of phasing is less than fom_for_extreme_dm data_quality will be automatically reset to weak.
- thoroughness = *quick medium thorough You can try to run quickly and see if you can get a solution ("quick") or more thoroughly to get the best possible solution ("thorough"). NOTE: if best FOM of phasing is less than fom_for_extreme_dm thoroughness will be automatically reset to thorough.
- nproc = 1 Normally you may want to set nproc to the number of processors on your machine so HySS can use them all. You can specify the number of processors to use (nproc) and the number of batches to divide the data into for parallel jobs. Normally you will set nproc to the number of processors available and leave nbatch alone. If you leave nbatch as None it will be set automatically, with a value depending on the Wizard. This is recommended. The value of nbatch can affect the results that you get, as the jobs are not split into exact replicates, but are rather run with different random numbers. If you want to get the same results, keep the same value of nbatch.
- nbatch = 1 You can specify the number of processors to use (nproc) and the number of batches to divide the data into for parallel jobs. Normally you will set nproc to the number of processors available and leave nbatch alone. If you leave nbatch as None it will be set automatically, with a value depending on the Wizard. This is recommended. The value of nbatch can affect the results that you get, as the jobs are not split into exact replicates, but are rather run with different random numbers. If you want to get the same results, keep the same value of nbatch.
- keep_files = overall_best* phaser_*.mtz resolve_*.mtz solve_*.mtz ha_*.pdb dataset*.log List of files that are not to be cleaned up. wildcards permitted
- coot_name = "coot" If your version of coot is called something else, then you can specify that here.
- i_ran_seed = 72432 Random seed (positive integer) for model-building and simulated annealing refinement
- raise_sorry = False You can have any failure end with a Sorry instead of simply printout to the screen
- background = True When you specify nproc=nn, you can run the jobs in background (default if nproc is greater than 1) or foreground (default if nproc=1). If you set run_command=qsub (or otherwise submit to a batch queue), then you should set background=False, so that the batch queue can keep track of your runs. There is no need to use background=True in this case because all the runs go as controlled by your batch system. If you use run_command='sh ' (or similar, sh is default) then normally you will use background=True so that all the jobs run simultaneously.
- check_wait_time = 1.0 You can specify the length of time (seconds) to wait between checking for subprocesses to end
- max_wait_time = 1.0 You can specify the length of time (seconds) to wait when looking for a file. If you have a cluster where jobs do not start right away you may need a longer time to wait. The symptom of too short a wait time is 'File not found'
- wait_between_submit_time = 1.0 You can specify the length of time (seconds) to wait between each job that is submitted when running sub-processes. This can be helpful on NFS-mounted systems when running with multiple processors to avoid file conflicts. The symptom of too short a wait_between_submit_time is File exists:....
- cache_resolve_libs = True Use caching of resolve libraries to speed up resolve
- resolve_size = 12 Size for solve/resolve ("","_giant", "_huge","_extra_huge" or a number where 12=giant 18=huge
- check_run_command = False You can have the wizard check your run command at startup
- run_command = "sh " When you specify nproc=nn, you can run the subprocesses as jobs in background with sh (default) or submit them to a queue with the command of your choice (i.e., qsub ). If you have a multi-processor machine, use sh. If you have a cluster, use qsub or the equivalent command for your system. NOTE: If you set run_command=qsub (or otherwise submit to a batch queue), then you should set background=False, so that the batch queue can keep track of your runs. There is no need to use background=True in this case because all the runs go as controlled by your batch system. If nproc is greater than 1 and you use run_command='sh '(or similar, sh is default) then normally you will use background=True so that all the jobs run simultaneously.
- queue_commands = None You can add any commands that need to be run for your queueing system. These are written before any other commands in the file that is submitted to your queueing system. For example on a PBS system you might say: queue_commands='#PBS -N mr_rosetta' queue_commands='#PBS -j oe' queue_commands='#PBS -l walltime=03:00:00' queue_commands='#PBS -l nodes=1:ppn=1' NOTE: you can put in the characters '<path>' in any queue_commands line and this will be replaced by a string of characters based on the path to the run directory. The first character and last two characters of each part of the path will be included, separated by '_',up to 15 characters. For example 'test_autobuild/WORK_5/AutoBuild_run_1_/TEMP0/RUN_1' would be represented by: 'tld_W_5_A1__TP0_1'
- condor_universe = vanilla The universe for condor is usually vanilla. However you might need to set it to local for your cluster
- add_double_quotes_in_condor = True You might need to turn on or off double quotes in condor job submission scripts. These are already default elsewhere but may interfere with condor paths.
- condor = None Specifies if the group_run_command is submitting a job to a condor cluster. Set by default to True if group_run_command=condor_submit, otherwise False. For condor job submission mr_rosetta uses a customized script with condor commands. Also uses one_subprocess_level=True
- last_process_is_local = True If true, run the last process in a group in background with sh as part of the job that is submitting jobs. This prevents having the job that is submitting jobs sit and wait for all the others while doing nothing
- skip_r_factor = False You can skip R-factor calculation if refinement is not done and maps_only=True
- test_flag_value = Auto Normally leave this at Auto (default). This parameter sets the value of the test set that is to be free. Normally phenix sets up test sets with values of 0 and 1 with 1 as the free set. The CCP4 convention is values of 0 through 19 with 0 as the free set. Either of these is recognized by default in Phenix. If you have any other convention (for example values of 0 to 19 and test set is 1) then you can specify this with test_flag_value.
- skip_xtriage = False You can bypass xtriage if you want. This will prevent you from applying anisotropy corrections, however.
- base_path = None You can specify the base path for files (default is current working directory)
- temp_dir = None Define a temporary directory (it must exist)
- clean_up = None At the end of the entire run the TEMP directories will be removed if clean_up is True. Files listed in keep_files will not be deleted. If you want to remove files after your run is finished use a command like "phenix.autobuild run=1 clean_up=True"
- print_citations = True Print citations at end of run
- solution_output_pickle_file = None At end of run, write solutions to this file in output directory if defined
- job_title = None Job title in PHENIX GUI, not used on command line
- top_output_dir = None This is used in subprocess calls of wizards and to tell the Wizard where to look for the STOPWIZARD file.
- wizard_directory_number = None This is used by the GUI to define the run number for Wizards. It is the same as desired_run_number NOTE: this value can only be specified on the command line, as the directory number is set before parameters files are read.
- verbose = False Command files and other verbose output will be printed
- extra_verbose = False Facts and possible commands will be printed every cycle if True
- debug = False You can have the wizard stop with error messages about the code if you use debug. Additionally the output goes to the terminal if you specify "debug=True"
- require_nonzero = True Require non-zero values in data columns to consider reading in.
- remove_path_word_list = None List of words identifying paths to remove from PATH These can be used to shorten your PATH. For example... cns ccp4 coot would remove all paths containing these words except those also containing phenix. Capitalization is ignored.
- fill = False Fill in all missing reflections to resolution res_fill. Applies to density modified maps. See also filled_2fofc_maps in autobuild.
- res_fill = None Resolution for filling in missing data (default = highest resolution of any datafile). Only applies to density modified maps. Default is fill to high resolution of data. Ignored if fill=False
- check_only = False Just read in and check initial parameters. Not for general use
- run_control
- coot = None Set coot to True and optionally run=[run-number] to run Coot with the current model and map for run run-number. In some wizards (AutoBuild) you can edit the model and give it back to PHENIX to use as part of the model-building process. If you just say coot then the facts for the highest-numbered existing run will be shown.
- ignore_blanks = None ignore_blanks allows you to have a command-line keyword with a blank value like "input_lig_file_list="
- stop = None You can stop the current wizard with "stopwizard" or "stop". If you type "phenix.autobuild run=3 stop" then this will stop run 3 of autobuild.
- display_facts = None Set display_facts to True and optionally run=[run-number] to display the facts for run run-number. If you just say display_facts then the facts for the highest-numbered existing run will be shown.
- display_summary = None Set display_summary to True and optionally run=[run-number] to show the summary for run run-number. If you just say display_summary then the summary for the highest-numbered existing run will be shown.
- carry_on = None Set carry_on to True to carry on with highest-numbered run from where you left off.
- run = None Set run to n to continue with run n where you left off.
- copy_run = None Set copy_run to n to copy run n to a new run and continue where you left off.
- display_runs = None List all runs for this wizard.
- delete_runs = None List runs to delete: 1 2 3-5 9:12
- display_labels = None display_labels=test.mtz will list all the labels that identify data in test.mtz. You can use the label strings that are produced in AutoSol to identify which data to use from a datafile like this: peak.data="F+ SIGF+ F- SIGF-". The entire string in quotes counts here You can use the individual labels from these strings as identifiers for data columns in AutoSol or AutoBuild like this: input_refinement_labels="FP SIGFP FreeR_flags" # each individual label counts
- dry_run = False Just read in and check parameter names
- params_only = False Just read in and return parameter defaults. Not for general use
- display_all = False Just read in and display parameter defaults
- special_keywords
- write_run_directory_to_file = None Writes the full name of a run directory to the specified file. This can be used as a call-back to tell a script where the output is going to go.
- non_user_parameters These are obsolete parameters and parameters that the wizards use to communicate among themselves. Not normally for general use.
- gui_output_dir = None Used only by the GUI
- allow_negative_f_double_prime = False Allow a negative f-double-prime value
- inano_list = None Choose inano for including anomalous differences and noinano not to include them and anoonly for just anomalous differences (no isomorphous differences) Not normally used. Use inano in deriv instead
- ha_sites_file = None Not normally used. Use sites_file for wavelength or deriv
- expt_type = *Auto mad sir sad Not normally used. Determined automatically from your inputs for wavelength and native/deriv. Experiment type (MAD SIR SAD) NOTE: Please treat MIR experiments as a set of SIR experiments. NOTE: The default for this keyword is Auto which means "carry out normal process to guess this keyword". If you have a single file, then it is assumed to be SAD. If you specify native.data and deriv.data it is SIR, if you specify peak.data and infl.data it is MAD. If the Wizard does not guess correctly, you can set it with this keyword.
- wavelength_list = None Optional wavelength of x-ray data (A) Not normally used. Use wavelength/deriv and lambda instead
- wavelength_name_list = None Names of wavelengths. Not normally used. Use wavelength/deriv and name instead
- sg = None Obsolete. Use space_group instead