[phenixbb] Why is CIF monomer data split into individual files?

Joe Krahn krahn at niehs.nih.gov
Wed Oct 14 08:19:41 PDT 2009


Is there a good reason not to merger all monomer data into a single 
file, similar to PDB's components.cif? My first thought is that it is a 
performance issue. However, I wrote a simple CIF parser in Tcl, and 
found that carefully designed regexp processing can index data segments 
very fast. The entire components.cif takes just a few seconds. (The key 
is to use a regexp that matches an entire data block, and not tokenize 
the whole file.) So, I am thinking that a single merged file might be 
better.

With separate files, there is already a need for a list of residues 
(mon_lib_list.cif) which could just as well be file offsets. So, even if 
parsing is not so fast, it is not an issue because we already have an 
index file.

One problem with separate files is that it relies on the 3-letter 
residue name to be unique. With a merged file, a different database key 
can be used without concern for filename restrictions. The PDB format 
was originally designed so that HET 3-letter codes only need a file 
scope, where HETNAM defines the  full name of each HET residue in a 
file. Everyone has been using HET codes the same as standard residues, 
which has cause the 3-letter codes to become meaningless database keys 
rather than useful abbreviations. Eventually, we will run out of 
3-letter codes, and will have to change. Why not start sooner rather 
than later? Even now, there are potential conflicts. For example, CCP4 
has DNA residues using a lower-case d, which may not provide a unique 
filename for non-POSIX case-insensitive operating systems.

If anyone actually wants a Tcl CIF parser, I can share it. I only wrote 
it as a simple and efficient way to read CIF files from a CCP4i script.

Joe Krahn



More information about the phenixbb mailing list