Why is CIF monomer data split into individual files?
Is there a good reason not to merger all monomer data into a single file, similar to PDB's components.cif? My first thought is that it is a performance issue. However, I wrote a simple CIF parser in Tcl, and found that carefully designed regexp processing can index data segments very fast. The entire components.cif takes just a few seconds. (The key is to use a regexp that matches an entire data block, and not tokenize the whole file.) So, I am thinking that a single merged file might be better. With separate files, there is already a need for a list of residues (mon_lib_list.cif) which could just as well be file offsets. So, even if parsing is not so fast, it is not an issue because we already have an index file. One problem with separate files is that it relies on the 3-letter residue name to be unique. With a merged file, a different database key can be used without concern for filename restrictions. The PDB format was originally designed so that HET 3-letter codes only need a file scope, where HETNAM defines the full name of each HET residue in a file. Everyone has been using HET codes the same as standard residues, which has cause the 3-letter codes to become meaningless database keys rather than useful abbreviations. Eventually, we will run out of 3-letter codes, and will have to change. Why not start sooner rather than later? Even now, there are potential conflicts. For example, CCP4 has DNA residues using a lower-case d, which may not provide a unique filename for non-POSIX case-insensitive operating systems. If anyone actually wants a Tcl CIF parser, I can share it. I only wrote it as a simple and efficient way to read CIF files from a CCP4i script. Joe Krahn
participants (1)
-
Joe Krahn