About Phenix-supported file formats

Contents

This document provides an introduction to the types of files used by (or generated by) Phenix. Many of these formats are also used by other software, but we have emphasized the most popular file types.

We do not recommend editing any of these file types by hand. Some may be human-readable (and log files, a special case, are intended almost entirely for this purpose), but in most cases the format is strictly defined and careless editing may lead to errors.

Model files

Models are represented primarily as a series of records for each atom, plus associated metadata such as crystal symmetry. The most common format in macromolecular crystallography is the PDB format, which might look like this:

ATOM      1  CA AILE A   1       9.968  51.572  -0.800  1.00 20.69           N
ATOM      2  CA BILE A   1       8.566  51.700  -0.272  1.00 21.30           C
HETATM 2545 ZN    ZN A1317      36.762  44.115  -7.277  1.00 15.32          ZN
HETATM 2557  O   HOH A2001      10.269  53.403  -3.264  1.00 27.45           O

Implicit in this information is a hierarchy of chains (with ID "A" in this case), residues (denoted by unique numbers), and atoms. Each residue may have multiple conformations (such as the isoleucine residue above, which has "A" and "B" conformers). The individual atom records have columns for the XYZ coordinates, occupancy, isotropic B-factor, and element symbol. (They also include a "serial number" - the first numerical column - but this is not usually used in Phenix.) If the structure is refined with anisotropic displacements (either for individual atoms, or TLS parameters), relevant atoms may be followed by an ANISOU record. Note that atoms may be denoted by either ATOM (for standard amino acid and nucleic acid residues) or HETATM (for waters, ions, ligands, and non-standard polymer residues). Most programs do not distinguish between these record types.

Besides atom information, the PDB format also usually includes a CRYST1 record specifying the unit cell and space group, and some number of REMARK records containing metadata such as refinement statistics, TLS matrices, and information about program use. The crystal symmetry is important for many applications, but the REMARK section is usually ignored (except for deposition with the PDB).

If you need to edit a PDB file (other than rebuilding the model in Coot), we recommend that you use either PDBTools, a non-interactive tool available both on the command line and the graphical interface, or phenix.pdb_editor, an interactive graphical editor. Editing by hand is strongly discouraged and often leads to program errors.

In some programs in Phenix (in particular phenix.refine), support is also provided for reading and writing mmCIF files. This is a variation on CIF format (described in detail below) specific to macromolecular models and data. mmCIF is considerably more complicated than PDB format, and less human-readable, but it is also much more flexible and better-defined. In the future, all PDB depositions will be done using mmCIF format, but for now PDB files remain the primary model format during the structure determination process.

See the wwPDB site for full documentation on the PDB format and mmCIF.

Reflection files

Experimental data may be stored in a variety of formats. Some of these may include additional information, described below. In each format, a list of "Miller indices" (h, k, l) is associated with one or more data fields, but the exact contents vary widely.

Currently, the best-supported format in Phenix and other macromolecular crystallography software is MTZ format, which was developed as part of the CCP4 project. These are binary files (not human-readable), which makes them especially compact and fast. MTZ files may contain any number and combination of experimental datasets, R-free flags, phases, weights, or map coefficients. Each field has a unique column label; in Phenix, these are usually grouped together, for instance "F(+),SIGF(+),F(-),SIGF(-)" groups four columns from an MTZ file specifying the Friedel mates for experimental amplitudes and estimated error, and "2FOFCWT,PH2FOFCWT" specifies the amplitudes and phases for a 2mFo-DFC map. MTZ files will always explicitly specify the crystal symmetry, and include a field for the wavelength of each dataset (although this may not always be reliable).

In Phenix, every program that reads or writes reflection data supports MTZ format; these files should usually be compatible with other software such as CCP4 or Coot. (Note however that the wavelength output by Phenix programs may be incorrect, but this is not widely used.) Nearly any MTZ file is suitable as input, except for the raw output of Mosflm, which contains intensities that have not yet been scaled.

Several other formats are supported in various contexts. Some of these differ in whether individual observations of each unique reflection have been merged or not (see the separate documentation on using unmerged data in Phenix for more details), but all of these should contain scaled data. Only CNS format can include R-free flags; the rest are usually the output of data processing programs and require additional processing to be useful, and will ultimately be converted to MTZ format.

CNS format is text, and is the primary input and output format of the program of the same name. It may include any combination of data, but usually this will be amplitudes or intensities plus R-free flags. A variety of file extensions may be used, including .hkl, .fobs, or .cv.

Scalepack format is a text format output by the data processing software HKL2000. It only includes experimental intensities, with or without Friedel pairs. Crystal symmetry is explicitly supplied. These almost always use the extension .sca.

Unmerged Scalepack format is also produced by HKL2000, but includes unmerged intensities and additional data processing information. This is used by several programs (including SOLVE), but it has the disadvantage of lacking the unit cell parameters.

SHELX format is an even simpler text format. We do not recommend using this if possible, because it doesn ot explicitly specify either the crystal symmetry or the data type (amplitudes or intensities). These files also tend to have the extension .hkl.

XDS format is also text, specifying experimental intensities and associated information from data processing. XDS actually outputs several different types of files, but the most appropriate will nearly always be named "XDS_ASCII.HKL". These files contain crystal symmetry information and may include either merged or unmerged data.

If you need to edit or combine reflection files, Phenix includes a graphical editor which supports all of these formats. However, output is limited to MTZ. Visualization of reflection data is provided by phenix.data_viewer.

CIF files

CIF (Crystallographic Information Format) is a general-purpose machine-readable syntax for storing any type of crystallographic information. The contents of a CIF file are defined by specific "tags", many of which are officially defined by a committee. A typical CIF file might look like this (simplified from the PDB's chemical components database):

data_ALA
#
_chem_comp.id                                    ALA
_chem_comp.name                                  ALANINE
_chem_comp.formula                               "C3 H7 N O2"
loop_
_chem_comp_bond.comp_id
_chem_comp_bond.atom_id_1
_chem_comp_bond.atom_id_2
_chem_comp_bond.value_order
ALA N   CA  SING
ALA N   H   SING

In small molecule crystallography CIF is used widely, but in Phenix the primary use is for geometry restraints used in refinement and related tasks. The Phenix installer includes a large number of CIF files including some from the PDB and the CCP4 monomer library, but for novel and/or uncommon ligands you may have to generate your own restraints using eLBOW. You can also edit an existing restraints CIF file using REEL.

In addition to geometry restraints, both models and reflection data may be stored in CIF format using the mmCIF subset of tags. These are the preferred working format for the PDB, which only distributes structure factors in mmCIF format. A variety of tools are able to convert to PDB or MTZ formats; the most commonly used program for this purpose is phenix.cif_as_mtz. Some programs such as phenix.refine can also read and write mmCIF files.

Sequence files

A number of programs require sequence information as input, including Phaser, AutoSol and AutoBuild. Several different formats are allowed, all of which contain the sequence as single-character codes (usually uppercase). For a single sequence, no other information needs to be specified. Otherwise, the most common format is FASTA, which can contain multiple sequences with a header for each, for instance:

> lysozyme
KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRN
LCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL
> rnase
KETAAAKFERQHMDSSTSAASSSNYCNQMMKSRNLTKDRCKPVNTFVHESLADVQAVCSQKNVACKNGQTNCYQ
SYSTMSITDCRETGSSKYPNCAYKTTQANKHIIVACEGNPYVPVHFDASV

MAP files

Although maps are usually stored as the Fourier coefficients (weighted amplitude and phase), which is more convenient and saves disk space, several programs allow them to be output in CCP4 or X-plor format. These are then suitable for viewing in PyMOL, Chimera, or other molecular viewers (as well as Coot). We recommend that you use CCP4 format, which is binary and therefore much faster and smaller than the text X-PLOR format files. If you have map coefficients for which you would like to obtain a map file, phenix.mtz2map will convert them for you by running an FFT. In the Phenix GUI, most programs that output map coefficients will automatically perform the FFT if you click a button to open the results in PyMOL.

PHIL files (.eff, .def, .phil)

PHIL, which stands for "Python Hierarchial Input Language", is the standard format for specifying program parameters in Phenix and CCTBX. It uses a very lightweight syntax that is both human-readable and writeable. GUI users will not normally need to view or edit these files (although they are extensively used internally), but some knowledge of the syntax may be helpful anyway. A typical PHIL file might look like this (loosely based on the parameters used in phenix.refine):

refinement {
  input {
    pdb {
      file_name = model.pdb
      file_name = ligands.pdb
    }
  }
  refine.strategy = *individual_sites *individual_adp tls occupancies
  main {
    ordered_solvent = True
    number_of_macro_cycles = 10
  }
}

Some parameters or parameter groups ("scopes") can have multiple values, as in the case of refinement.input.pdb.file_name above. Consult the manual for each individual program for more information about specific options.

Geometry restraints info (.geo)

These files are generated by phenix.refine and related programs. They are not widely used but are very helpful for debugging issues with geometry restraints.

Other chemical data (various)

eLBOW supports a variety of other formats (in addition to CIF) for specifying chemical information to be used to generate molecular structures and restraints. The most popular is SMILES, but the output of various chemoinformatics programs is also recognized.

Log files (.log)

Most log files do not follow any specific format, and vary widely between different programs. Even for GUI users, we recommend that you familiarize yourself with the log output as it may contain useful debugging information not displayed graphically, as well as providing some insight into how a program works.