[phenixbb] Why is CIF monomer data split into individual files?
Joe Krahn
krahn at niehs.nih.gov
Wed Oct 14 08:19:41 PDT 2009
Is there a good reason not to merger all monomer data into a single
file, similar to PDB's components.cif? My first thought is that it is a
performance issue. However, I wrote a simple CIF parser in Tcl, and
found that carefully designed regexp processing can index data segments
very fast. The entire components.cif takes just a few seconds. (The key
is to use a regexp that matches an entire data block, and not tokenize
the whole file.) So, I am thinking that a single merged file might be
better.
With separate files, there is already a need for a list of residues
(mon_lib_list.cif) which could just as well be file offsets. So, even if
parsing is not so fast, it is not an issue because we already have an
index file.
One problem with separate files is that it relies on the 3-letter
residue name to be unique. With a merged file, a different database key
can be used without concern for filename restrictions. The PDB format
was originally designed so that HET 3-letter codes only need a file
scope, where HETNAM defines the full name of each HET residue in a
file. Everyone has been using HET codes the same as standard residues,
which has cause the 3-letter codes to become meaningless database keys
rather than useful abbreviations. Eventually, we will run out of
3-letter codes, and will have to change. Why not start sooner rather
than later? Even now, there are potential conflicts. For example, CCP4
has DNA residues using a lower-case d, which may not provide a unique
filename for non-POSIX case-insensitive operating systems.
If anyone actually wants a Tcl CIF parser, I can share it. I only wrote
it as a simple and efficient way to read CIF files from a CCP4i script.
Joe Krahn
More information about the phenixbb
mailing list