.cif file has 'UNK' auth_asym_id

I’ve finished refining a structure with three protein chains A, B, C. The pdb file looks ‘normal’ to me, but when I inspect the .cif file written by phenix, the auth_asym_id for chains B and C is ‘UNK’. label_asym_id is correct for all chains. I’m not really sure what the difference is between the auth_ and label_ fields. When I try to perform actions on the .cif file, I get duplicate atom label errors. Here’s a few example lines from the .pdb and .cif files, where auth_asym_id is field 6 and label_asym_id is field 16 ATOM 460 OXT PRO A 59 5.272 55.689 31.481 1.00 52.63 O ANISOU 460 OXT PRO A 59 6541 6212 7244 1448 1260 96 O TER ATOM 461 N ALA B 2 39.292 61.974 39.403 1.00 67.81 N ANISOU 461 N ALA B 2 7118 8729 9916 -1711 -388 2066 N ATOM 460 OXT . PRO A 59 ? 5.27169 55.68852 31.48123 1.000 52.63052 O ? A ? 58 1 ATOM 461 N . ALA UNK 2 ? 39.29207 61.97430 39.40271 1.000 67.80544 N ? B ? 1 1 I’m able to convert the pdb file to a usable cif file using gemmi but wanted to report this weird behavior with phenix 1.21.2_5419. -- Kevin Jude, PhD Structural Biology Research Specialist, Garcia Lab Howard Hughes Medical Institute Stanford University School of Medicine Beckman B177, 279 Campus Drive, Stanford CA 94305

Hi Kevin,
Thank you for the report. I would be happy to fix the issue. For this I
need to be able to reproduce it myself. Can you please share (off-list) the
inputs that you used for the refinement to produce this result? All files
will be treated confidentially.
Best regards,
Oleg Sobolev.
On Fri, Feb 7, 2025 at 12:34 PM Kevin M Jude
I’ve finished refining a structure with three protein chains A, B, C. The pdb file looks ‘normal’ to me, but when I inspect the .cif file written by phenix, the auth_asym_id for chains B and C is ‘UNK’. label_asym_id is correct for all chains. I’m not really sure what the difference is between the auth_ and label_ fields. When I try to perform actions on the .cif file, I get duplicate atom label errors.
Here’s a few example lines from the .pdb and .cif files, where auth_asym_id is field 6 and label_asym_id is field 16
ATOM 460 OXT PRO A 59 5.272 55.689 31.481 1.00 52.63 O
ANISOU 460 OXT PRO A 59 6541 6212 7244 1448 1260 96 O
TER
ATOM 461 N ALA B 2 39.292 61.974 39.403 1.00 67.81 N
ANISOU 461 N ALA B 2 7118 8729 9916 -1711 -388 2066 N
ATOM 460 OXT . PRO A 59 ? 5.27169 55.68852 31.48123 1.000 52.63052 O ? A ? 58 1
ATOM 461 N . ALA UNK 2 ? 39.29207 61.97430 39.40271 1.000 67.80544 N ? B ? 1 1
I’m able to convert the pdb file to a usable cif file using gemmi but wanted to report this weird behavior with phenix 1.21.2_5419.
--
Kevin Jude, PhD
Structural Biology Research Specialist, Garcia Lab
Howard Hughes Medical Institute
Stanford University School of Medicine
Beckman B177, 279 Campus Drive, Stanford CA 94305 _______________________________________________ phenixbb mailing list -- [email protected] To unsubscribe send an email to [email protected] Unsubscribe: phenixbb-leave@%(host_name)s

Thanks Oleg. I investigated some more and found a clue:
A few residues at the termini of the B and C chains in the input pdb file have UNK as the segid (column 73-75). The segids were introduced in autobuild. I had apparently noticed this and removed them in the .pdb when I finished refinement a few months ago, because the final output pdb file has a later edit date than the rest of the output files. In the .cif files, all residues in those chains are labeled as ‘UNK’ in the auth_asym_id. Now three months later when making Table 1 using the .cif file, I was surprised when phenix complained about ‘duplicate atoms’ in the cif file.
So now I guess the mystery to me is why phenix extends the UNK segid to the whole chain, and why phenix sees atoms with the same auth_asym_id (segid) but different label_asym_id (chain) as being duplicates. I’ll leave it to you to decide if this is bug in the program or in the user, but still happy to share my files with you off-list if you like.
Best wishes
Kevin
From: Oleg Sobolev

Hi Kevin,
Thank you for the clarification; the situation is much clearer now.
The "UNK" in the segid field was the root cause of the issue. When segid is
present, Phenix prioritizes it over the chain ID when reading PDB files.
This explains why the segid was applied to the entire chain. Essentially,
Phenix interprets any segid as a chain ID. Handling cases with different
segids for the same chain becomes overly complex, especially since the
segid is not a commonly used feature in the PDB format. I'm not sure
whether your specific case could be accommodated within the current
processing workflow.
When Phenix reads an mmCIF model file, it uses the auth_asym_id as the
chain ID and largely disregards the label_asym_id so that explains
duplicated labels.
Great that you figured out the root cause, please let me know if you have
more questions!
Best regards,
Oleg Sobolev
On Fri, Feb 7, 2025 at 8:10 PM Kevin M Jude
Thanks Oleg. I investigated some more and found a clue:
A few residues at the termini of the B and C chains in the input pdb file have UNK as the segid (column 73-75). The segids were introduced in autobuild. I had apparently noticed this and removed them in the .pdb when I finished refinement a few months ago, because the final output pdb file has a later edit date than the rest of the output files. In the .cif files, all residues in those chains are labeled as ‘UNK’ in the auth_asym_id. Now three months later when making Table 1 using the .cif file, I was surprised when phenix complained about ‘duplicate atoms’ in the cif file.
So now I guess the mystery to me is why phenix extends the UNK segid to the whole chain, and why phenix sees atoms with the same auth_asym_id (segid) but different label_asym_id (chain) as being duplicates. I’ll leave it to you to decide if this is bug in the program or in the user, but still happy to share my files with you off-list if you like.
Best wishes
Kevin
*From: *Oleg Sobolev
*Date: *Friday, February 7, 2025 at 4:52 PM *To: *Kevin M Jude *Cc: *[email protected] *Subject: *Re: [phenixbb] .cif file has 'UNK' auth_asym_id Hi Kevin,
Thank you for the report. I would be happy to fix the issue. For this I need to be able to reproduce it myself. Can you please share (off-list) the inputs that you used for the refinement to produce this result? All files will be treated confidentially.
Best regards,
Oleg Sobolev.
On Fri, Feb 7, 2025 at 12:34 PM Kevin M Jude
wrote: I’ve finished refining a structure with three protein chains A, B, C. The pdb file looks ‘normal’ to me, but when I inspect the .cif file written by phenix, the auth_asym_id for chains B and C is ‘UNK’. label_asym_id is correct for all chains. I’m not really sure what the difference is between the auth_ and label_ fields. When I try to perform actions on the .cif file, I get duplicate atom label errors.
Here’s a few example lines from the .pdb and .cif files, where auth_asym_id is field 6 and label_asym_id is field 16
ATOM 460 OXT PRO A 59 5.272 55.689 31.481 1.00 52.63 O
ANISOU 460 OXT PRO A 59 6541 6212 7244 1448 1260 96 O
TER
ATOM 461 N ALA B 2 39.292 61.974 39.403 1.00 67.81 N
ANISOU 461 N ALA B 2 7118 8729 9916 -1711 -388 2066 N
ATOM 460 OXT . PRO A 59 ? 5.27169 55.68852 31.48123 1.000 52.63052 O ? A ? 58 1
ATOM 461 N . ALA UNK 2 ? 39.29207 61.97430 39.40271 1.000 67.80544 N ? B ? 1 1
I’m able to convert the pdb file to a usable cif file using gemmi but wanted to report this weird behavior with phenix 1.21.2_5419.
--
Kevin Jude, PhD
Structural Biology Research Specialist, Garcia Lab
Howard Hughes Medical Institute
Stanford University School of Medicine
Beckman B177, 279 Campus Drive, Stanford CA 94305
_______________________________________________ phenixbb mailing list -- [email protected] To unsubscribe send an email to [email protected] Unsubscribe: phenixbb-leave@%(host_name)s

Got it. Phenix.autobuild assigns the UNK segid to chains with low confidence where the backbone was correct but sequence assignment was not. I corrected the sequence assignment for those positions in Coot, but the presence of the segids is not apparent without looking at the plain text of the pdb file.
Best wishes
Kevin
From: Oleg Sobolev

Hi Kevin,
Thanks again for figuring out what was going on! I have fixed AutoBuild so
that it now sends PDB files with blank segid values to phenix.refine. This
results in all the final autobuild files being free of the UNK segid that
you found. Tomorrow's nightly build and later should have this fix. Let me
know if that does not work for you!
All the best,
Tom T
On Mon, Feb 10, 2025 at 10:52 AM Kevin M Jude
Got it. Phenix.autobuild assigns the UNK segid to chains with low confidence where the backbone was correct but sequence assignment was not. I corrected the sequence assignment for those positions in Coot, but the presence of the segids is not apparent without looking at the plain text of the pdb file.
Best wishes
Kevin
*From: *Oleg Sobolev
*Date: *Monday, February 10, 2025 at 9:31 AM *To: *Kevin M Jude *Cc: *[email protected] *Subject: *Re: [phenixbb] .cif file has 'UNK' auth_asym_id Hi Kevin,
Thank you for the clarification; the situation is much clearer now.
The "UNK" in the segid field was the root cause of the issue. When segid is present, Phenix prioritizes it over the chain ID when reading PDB files. This explains why the segid was applied to the entire chain. Essentially, Phenix interprets any segid as a chain ID. Handling cases with different segids for the same chain becomes overly complex, especially since the segid is not a commonly used feature in the PDB format. I'm not sure whether your specific case could be accommodated within the current processing workflow.
When Phenix reads an mmCIF model file, it uses the auth_asym_id as the chain ID and largely disregards the label_asym_id so that explains duplicated labels.
Great that you figured out the root cause, please let me know if you have more questions!
Best regards,
Oleg Sobolev
On Fri, Feb 7, 2025 at 8:10 PM Kevin M Jude
wrote: Thanks Oleg. I investigated some more and found a clue:
A few residues at the termini of the B and C chains in the input pdb file have UNK as the segid (column 73-75). The segids were introduced in autobuild. I had apparently noticed this and removed them in the .pdb when I finished refinement a few months ago, because the final output pdb file has a later edit date than the rest of the output files. In the .cif files, all residues in those chains are labeled as ‘UNK’ in the auth_asym_id. Now three months later when making Table 1 using the .cif file, I was surprised when phenix complained about ‘duplicate atoms’ in the cif file.
So now I guess the mystery to me is why phenix extends the UNK segid to the whole chain, and why phenix sees atoms with the same auth_asym_id (segid) but different label_asym_id (chain) as being duplicates. I’ll leave it to you to decide if this is bug in the program or in the user, but still happy to share my files with you off-list if you like.
Best wishes
Kevin
*From: *Oleg Sobolev
*Date: *Friday, February 7, 2025 at 4:52 PM *To: *Kevin M Jude *Cc: *[email protected] *Subject: *Re: [phenixbb] .cif file has 'UNK' auth_asym_id Hi Kevin,
Thank you for the report. I would be happy to fix the issue. For this I need to be able to reproduce it myself. Can you please share (off-list) the inputs that you used for the refinement to produce this result? All files will be treated confidentially.
Best regards,
Oleg Sobolev.
On Fri, Feb 7, 2025 at 12:34 PM Kevin M Jude
wrote: I’ve finished refining a structure with three protein chains A, B, C. The pdb file looks ‘normal’ to me, but when I inspect the .cif file written by phenix, the auth_asym_id for chains B and C is ‘UNK’. label_asym_id is correct for all chains. I’m not really sure what the difference is between the auth_ and label_ fields. When I try to perform actions on the .cif file, I get duplicate atom label errors.
Here’s a few example lines from the .pdb and .cif files, where auth_asym_id is field 6 and label_asym_id is field 16
ATOM 460 OXT PRO A 59 5.272 55.689 31.481 1.00 52.63 O
ANISOU 460 OXT PRO A 59 6541 6212 7244 1448 1260 96 O
TER
ATOM 461 N ALA B 2 39.292 61.974 39.403 1.00 67.81 N
ANISOU 461 N ALA B 2 7118 8729 9916 -1711 -388 2066 N
ATOM 460 OXT . PRO A 59 ? 5.27169 55.68852 31.48123 1.000 52.63052 O ? A ? 58 1
ATOM 461 N . ALA UNK 2 ? 39.29207 61.97430 39.40271 1.000 67.80544 N ? B ? 1 1
I’m able to convert the pdb file to a usable cif file using gemmi but wanted to report this weird behavior with phenix 1.21.2_5419.
--
Kevin Jude, PhD
Structural Biology Research Specialist, Garcia Lab
Howard Hughes Medical Institute
Stanford University School of Medicine
Beckman B177, 279 Campus Drive, Stanford CA 94305
_______________________________________________ phenixbb mailing list -- [email protected] To unsubscribe send an email to [email protected] Unsubscribe: phenixbb-leave@%(host_name)s
_______________________________________________ phenixbb mailing list -- [email protected] To unsubscribe send an email to [email protected] Unsubscribe: phenixbb-leave@%(host_name)s
-- Thomas C Terwilliger Laboratory Fellow, Los Alamos National Laboratory Senior Scientist, New Mexico Consortium 100 Entrada Dr, Los Alamos, NM 87544 Email: [email protected] Tel: 505-431-0010

Hi, This was the source of a recent issue I have been having with pdb depositions. There were no ligands involved; as Oleg explained, presence of segids resulted in a cif file uninterpretable by PyMOL, Coot and for pdb deposition as chain ids were overwritten in the cif file only, not pdb. If phenix chooses to overwrite chain ids with segids in cif, while I would not prefer that, that's one rational way. I am puzzled, though, why the pdb file is not handled the same way? Why produce pdb and cif files with different chain connectivities? I think treating both files consistently makes most sense. (Back to re-running all phenix.refine jobs after deleting all segid columns...) Sorry if I am misinterpreting this and thank you! Engin On 2/10/25 11:51 AM, Kevin M Jude wrote:
Got it. Phenix.autobuild assigns the UNK segid to chains with low confidence where the backbone was correct but sequence assignment was not. I corrected the sequence assignment for those positions in Coot, but the presence of the segids is not apparent without looking at the plain text of the pdb file.
Best wishes
Kevin
*From: *Oleg Sobolev
*Date: *Monday, February 10, 2025 at 9:31 AM *To: *Kevin M Jude *Cc: *[email protected] *Subject: *Re: [phenixbb] .cif file has 'UNK' auth_asym_id Hi Kevin,
Thank you for the clarification; the situation is much clearer now.
The "UNK" in the segid field was the root cause of the issue. When segid is present, Phenix prioritizes it over the chain ID when reading PDB files. This explains why the segid was applied to the entire chain. Essentially, Phenix interprets any segid as a chain ID. Handling cases with different segids for the same chain becomes overly complex, especially since the segid is not a commonly used feature in the PDB format. I'm not sure whether your specific case could be accommodated within the current processing workflow.
When Phenix reads an mmCIF model file, it uses the auth_asym_id as the chain ID and largely disregards the label_asym_id so that explains duplicated labels.
Great that you figured out the root cause, please let me know if you have more questions!
Best regards,
Oleg Sobolev
On Fri, Feb 7, 2025 at 8:10 PM Kevin M Jude
wrote: Thanks Oleg. I investigated some more and found a clue:
A few residues at the termini of the B and C chains in the input pdb file have UNK as the segid (column 73-75). The segids were introduced in autobuild. I had apparently noticed this and removed them in the .pdb when I finished refinement a few months ago, because the final output pdb file has a later edit date than the rest of the output files. In the .cif files, all residues in those chains are labeled as ‘UNK’ in the auth_asym_id. Now three months later when making Table 1 using the .cif file, I was surprised when phenix complained about ‘duplicate atoms’ in the cif file.
So now I guess the mystery to me is why phenix extends the UNK segid to the whole chain, and why phenix sees atoms with the same auth_asym_id (segid) but different label_asym_id (chain) as being duplicates. I’ll leave it to you to decide if this is bug in the program or in the user, but still happy to share my files with you off-list if you like.
Best wishes
Kevin
*From: *Oleg Sobolev
*Date: *Friday, February 7, 2025 at 4:52 PM *To: *Kevin M Jude *Cc: *[email protected] *Subject: *Re: [phenixbb] .cif file has 'UNK' auth_asym_id Hi Kevin,
Thank you for the report. I would be happy to fix the issue. For this I need to be able to reproduce it myself. Can you please share (off-list) the inputs that you used for the refinement to produce this result? All files will be treated confidentially.
Best regards,
Oleg Sobolev.
On Fri, Feb 7, 2025 at 12:34 PM Kevin M Jude
wrote: I’ve finished refining a structure with three protein chains A, B, C. The pdb file looks ‘normal’ to me, but when I inspect the .cif file written by phenix, the auth_asym_id for chains B and C is ‘UNK’. label_asym_id is correct for all chains. I’m not really sure what the difference is between the auth_ and label_ fields. When I try to perform actions on the .cif file, I get duplicate atom label errors.
Here’s a few example lines from the .pdb and .cif files, where auth_asym_id is field 6 and label_asym_id is field 16
ATOM 460 OXT PRO A 59 5.272 55.689 31.481 1.00 52.63 O
ANISOU 460 OXT PRO A 59 6541 6212 7244 1448 1260 96 O
TER
ATOM 461 N ALA B 2 39.292 61.974 39.403 1.00 67.81 N
ANISOU 461 N ALA B 2 7118 8729 9916 -1711 -388 2066 N
ATOM 460 OXT . PRO A 59 ? 5.27169 55.68852 31.48123 1.000 52.63052 O ? A ? 58 1
ATOM 461 N . ALA UNK 2 ? 39.29207 61.97430 39.40271 1.000 67.80544 N ? B ? 1 1
I’m able to convert the pdb file to a usable cif file using gemmi but wanted to report this weird behavior with phenix 1.21.2_5419.
--
Kevin Jude, PhD
Structural Biology Research Specialist, Garcia Lab
Howard Hughes Medical Institute
Stanford University School of Medicine
Beckman B177, 279 Campus Drive, Stanford CA 94305
_______________________________________________ phenixbb mailing list -- [email protected] To unsubscribe send an email to [email protected] Unsubscribe: phenixbb-leave@%(host_name)s
_______________________________________________ phenixbb mailing list [email protected] To unsubscribe send an email [email protected] Unsubscribe: phenixbb-leave@%(host_name)s
-- Engin Özkan, Ph.D. Associate Professor Dept of Biochemistry and Molecular Biology University of Chicago Phone: (773) 834-5498 http://ozkan.uchicago.edu

Hi Engin,
I share your frustration over this issue. Without defending the current
approach too much, let me share the rationale behind what is going on in
Phenix.
A brief historical note, thanks to
https://www.wwpdb.org/documentation/file-format
SEGID was not in the original PDB format description of 1972. It was
introduced in 1996:
https://cdn.rcsb.org/wwpdb/docs/documentation/file-format/PDB_format_Jan_199...
and quickly disappeared in 1998:
https://www.wwpdb.org/documentation/file-format-content/format23/sect9.html#...
Likely, because of a real necessity, two years were enough for the
community to start using it and refuse to let it go despite its
disappearance from the format specifications. The demand for the support of
segid was probably the reason why CCTBX processes it even though CCTBX was
first published in 2002.
Part of the structural biology community is using segids largely instead of
chain IDs, often leaving the chain ID field blank. This is the major use
case I'm aware of and the case CCTBX supports.
Now comes mmCIF, and there is NO place for segid because there has been no
formal segid definition for the last 25 years:
https://mmcif.wwpdb.org/docs/pdb_to_pdbx_correspondences.html#ATOMP
The absence of segid prevents us from converting such PDB files into mmCIF
directly, so we have to get creative. Here is the present state of my
understanding: either no segids and no ambiguity, or segids are used
instead of chain IDs. I admit this is a rather narrow use-case scenario,
and I can definitely see that random leftover or carry-over segids can
spoil the output.
Connectivity differences resulting from such mmCIF/PDB files are
second-order consequences, as we definitely use a lot of heuristics to
figure out connectivity, and since conversion PDB ⇄ mmCIF is not equivalent
in the presence of random segids, connectivity might be compromised.
The hope is that with the PDB format being gradually phased out, all of
this will be of less concern for developers and users.
Best regards,
Oleg Sobolev.
On Tue, Mar 11, 2025 at 6:15 PM Engin Özkan
Hi,
This was the source of a recent issue I have been having with pdb depositions. There were no ligands involved; as Oleg explained, presence of segids resulted in a cif file uninterpretable by PyMOL, Coot and for pdb deposition as chain ids were overwritten in the cif file only, not pdb.
If phenix chooses to overwrite chain ids with segids in cif, while I would not prefer that, that's one rational way. I am puzzled, though, why the pdb file is not handled the same way? Why produce pdb and cif files with different chain connectivities? I think treating both files consistently makes most sense.
(Back to re-running all phenix.refine jobs after deleting all segid columns...)
Sorry if I am misinterpreting this and thank you!
Engin
On 2/10/25 11:51 AM, Kevin M Jude wrote:
Got it. Phenix.autobuild assigns the UNK segid to chains with low confidence where the backbone was correct but sequence assignment was not. I corrected the sequence assignment for those positions in Coot, but the presence of the segids is not apparent without looking at the plain text of the pdb file.
Best wishes
Kevin
*From: *Oleg Sobolev
*Date: *Monday, February 10, 2025 at 9:31 AM *To: *Kevin M Jude *Cc: *[email protected] *Subject: *Re: [phenixbb] .cif file has 'UNK' auth_asym_id Hi Kevin,
Thank you for the clarification; the situation is much clearer now.
The "UNK" in the segid field was the root cause of the issue. When segid is present, Phenix prioritizes it over the chain ID when reading PDB files. This explains why the segid was applied to the entire chain. Essentially, Phenix interprets any segid as a chain ID. Handling cases with different segids for the same chain becomes overly complex, especially since the segid is not a commonly used feature in the PDB format. I'm not sure whether your specific case could be accommodated within the current processing workflow.
When Phenix reads an mmCIF model file, it uses the auth_asym_id as the chain ID and largely disregards the label_asym_id so that explains duplicated labels.
Great that you figured out the root cause, please let me know if you have more questions!
Best regards,
Oleg Sobolev
On Fri, Feb 7, 2025 at 8:10 PM Kevin M Jude
wrote: Thanks Oleg. I investigated some more and found a clue:
A few residues at the termini of the B and C chains in the input pdb file have UNK as the segid (column 73-75). The segids were introduced in autobuild. I had apparently noticed this and removed them in the .pdb when I finished refinement a few months ago, because the final output pdb file has a later edit date than the rest of the output files. In the .cif files, all residues in those chains are labeled as ‘UNK’ in the auth_asym_id. Now three months later when making Table 1 using the .cif file, I was surprised when phenix complained about ‘duplicate atoms’ in the cif file.
So now I guess the mystery to me is why phenix extends the UNK segid to the whole chain, and why phenix sees atoms with the same auth_asym_id (segid) but different label_asym_id (chain) as being duplicates. I’ll leave it to you to decide if this is bug in the program or in the user, but still happy to share my files with you off-list if you like.
Best wishes
Kevin
*From: *Oleg Sobolev
*Date: *Friday, February 7, 2025 at 4:52 PM *To: *Kevin M Jude *Cc: *[email protected] *Subject: *Re: [phenixbb] .cif file has 'UNK' auth_asym_id Hi Kevin,
Thank you for the report. I would be happy to fix the issue. For this I need to be able to reproduce it myself. Can you please share (off-list) the inputs that you used for the refinement to produce this result? All files will be treated confidentially.
Best regards,
Oleg Sobolev.
On Fri, Feb 7, 2025 at 12:34 PM Kevin M Jude
wrote: I’ve finished refining a structure with three protein chains A, B, C. The pdb file looks ‘normal’ to me, but when I inspect the .cif file written by phenix, the auth_asym_id for chains B and C is ‘UNK’. label_asym_id is correct for all chains. I’m not really sure what the difference is between the auth_ and label_ fields. When I try to perform actions on the .cif file, I get duplicate atom label errors.
Here’s a few example lines from the .pdb and .cif files, where auth_asym_id is field 6 and label_asym_id is field 16
ATOM 460 OXT PRO A 59 5.272 55.689 31.481 1.00 52.63 O
ANISOU 460 OXT PRO A 59 6541 6212 7244 1448 1260 96 O
TER
ATOM 461 N ALA B 2 39.292 61.974 39.403 1.00 67.81 N
ANISOU 461 N ALA B 2 7118 8729 9916 -1711 -388 2066 N
ATOM 460 OXT . PRO A 59 ? 5.27169 55.68852 31.48123 1.000 52.63052 O ? A ? 58 1
ATOM 461 N . ALA UNK 2 ? 39.29207 61.97430 39.40271 1.000 67.80544 N ? B ? 1 1
I’m able to convert the pdb file to a usable cif file using gemmi but wanted to report this weird behavior with phenix 1.21.2_5419.
--
Kevin Jude, PhD
Structural Biology Research Specialist, Garcia Lab
Howard Hughes Medical Institute
Stanford University School of Medicine
Beckman B177, 279 Campus Drive, Stanford CA 94305
_______________________________________________ phenixbb mailing list -- [email protected] To unsubscribe send an email to [email protected] Unsubscribe: phenixbb-leave@%(host_name)s
_______________________________________________ phenixbb mailing list -- [email protected] To unsubscribe send an email to [email protected] Unsubscribe: phenixbb-leave@%(host_name)s
-- Engin Özkan, Ph.D. Associate Professor Dept of Biochemistry and Molecular Biology University of Chicago Phone: (773) 834-5498http://ozkan.uchicago.edu
_______________________________________________ phenixbb mailing list -- [email protected] To unsubscribe send an email to [email protected] Unsubscribe: phenixbb-leave@%(host_name)s

Thank you, Oleg, for the explanation. I was not aware of the history of segids. As someone who has started in crystallography in 1999, I've never used segids. They have only been a nuisance. To my knowledge, they are not intentionally added by users like me, who only play with chain IDs, but software adds them. There is a "Change chain ID" menu item in Coot but no segid equivalent. I've always thought Phaser or Coot added them during some operation, since they appear after molecular replacement and/or model building, but I have not investigated, so I may be wrong. In Kevin's case, it was AutoBuild (and Tom kindly fixed that). Reading your message, I am convinced segids are unnecessary and unused. So, ignore them I'd say, but I'm sure the phenix team has thought more deeply about this, and knows of cases of actual use. However, I do not agree with the point on second-order consequences: Neither Coot nor PyMOL can display these cifs with correct chain sequences. and wwPDB does not accept them. Losing all carefully curated chain IDs (sometimes going up to AA and on) because of a stray segid is a pain. Those are significant consequences at the moment. Regardless, I should be able to fix this issue going forward. I really appreciate all the work you and the team does. Thank you! Engin P.S. I can now go back to worrying only about having a ton of new chains for each N-linked glycan in the cif file. On 3/13/25 7:21 PM, Oleg Sobolev wrote:
Hi Engin,
I share your frustration over this issue. Without defending the current approach too much, let me share the rationale behind what is going on in Phenix.
A brief historical note, thanks to https://www.wwpdb.org/documentation/file-format
SEGID was not in the original PDB format description of 1972. It was introduced in 1996: https://cdn.rcsb.org/wwpdb/docs/documentation/file-format/PDB_format_Jan_199... and quickly disappeared in 1998: https://www.wwpdb.org/documentation/file-format-content/format23/sect9.html#...
Likely, because of a real necessity, two years were enough for the community to start using it and refuse to let it go despite its disappearance from the format specifications. The demand for the support of segid was probably the reason why CCTBX processes it even though CCTBX was first published in 2002.
Part of the structural biology community is using segids largely instead of chain IDs, often leaving the chain ID field blank. This is the major use case I'm aware of and the case CCTBX supports.
Now comes mmCIF, and there is NO place for segid because there has been no formal segid definition for the last 25 years: https://mmcif.wwpdb.org/docs/pdb_to_pdbx_correspondences.html#ATOMP
The absence of segid prevents us from converting such PDB files into mmCIF directly, so we have to get creative. Here is the present state of my understanding: either no segids and no ambiguity, or segids are used instead of chain IDs. I admit this is a rather narrow use-case scenario, and I can definitely see that random leftover or carry-over segids can spoil the output.
Connectivity differences resulting from such mmCIF/PDB files are second-order consequences, as we definitely use a lot of heuristics to figure out connectivity, and since conversion PDB ⇄ mmCIF is not equivalent in the presence of random segids, connectivity might be compromised.
The hope is that with the PDB format being gradually phased out, all of this will be of less concern for developers and users.
Best regards, Oleg Sobolev.
On Tue, Mar 11, 2025 at 6:15 PM Engin Özkan
wrote: Hi,
This was the source of a recent issue I have been having with pdb depositions. There were no ligands involved; as Oleg explained, presence of segids resulted in a cif file uninterpretable by PyMOL, Coot and for pdb deposition as chain ids were overwritten in the cif file only, not pdb.
If phenix chooses to overwrite chain ids with segids in cif, while I would not prefer that, that's one rational way. I am puzzled, though, why the pdb file is not handled the same way? Why produce pdb and cif files with different chain connectivities? I think treating both files consistently makes most sense.
(Back to re-running all phenix.refine jobs after deleting all segid columns...)
Sorry if I am misinterpreting this and thank you!
Engin
On 2/10/25 11:51 AM, Kevin M Jude wrote:
Got it. Phenix.autobuild assigns the UNK segid to chains with low confidence where the backbone was correct but sequence assignment was not. I corrected the sequence assignment for those positions in Coot, but the presence of the segids is not apparent without looking at the plain text of the pdb file.
Best wishes
Kevin
*From: *Oleg Sobolev
mailto:[email protected] *Date: *Monday, February 10, 2025 at 9:31 AM *To: *Kevin M Jude mailto:[email protected] *Cc: *[email protected] mailto:[email protected] *Subject: *Re: [phenixbb] .cif file has 'UNK' auth_asym_id Hi Kevin,
Thank you for the clarification; the situation is much clearer now.
The "UNK" in the segid field was the root cause of the issue. When segid is present, Phenix prioritizes it over the chain ID when reading PDB files. This explains why the segid was applied to the entire chain. Essentially, Phenix interprets any segid as a chain ID. Handling cases with different segids for the same chain becomes overly complex, especially since the segid is not a commonly used feature in the PDB format. I'm not sure whether your specific case could be accommodated within the current processing workflow.
When Phenix reads an mmCIF model file, it uses the auth_asym_id as the chain ID and largely disregards the label_asym_id so that explains duplicated labels.
Great that you figured out the root cause, please let me know if you have more questions!
Best regards,
Oleg Sobolev
On Fri, Feb 7, 2025 at 8:10 PM Kevin M Jude
wrote: Thanks Oleg. I investigated some more and found a clue:
A few residues at the termini of the B and C chains in the input pdb file have UNK as the segid (column 73-75). The segids were introduced in autobuild. I had apparently noticed this and removed them in the .pdb when I finished refinement a few months ago, because the final output pdb file has a later edit date than the rest of the output files. In the .cif files, all residues in those chains are labeled as ‘UNK’ in the auth_asym_id. Now three months later when making Table 1 using the .cif file, I was surprised when phenix complained about ‘duplicate atoms’ in the cif file.
So now I guess the mystery to me is why phenix extends the UNK segid to the whole chain, and why phenix sees atoms with the same auth_asym_id (segid) but different label_asym_id (chain) as being duplicates. I’ll leave it to you to decide if this is bug in the program or in the user, but still happy to share my files with you off-list if you like.
Best wishes
Kevin
*From: *Oleg Sobolev
*Date: *Friday, February 7, 2025 at 4:52 PM *To: *Kevin M Jude *Cc: *[email protected] *Subject: *Re: [phenixbb] .cif file has 'UNK' auth_asym_id Hi Kevin,
Thank you for the report. I would be happy to fix the issue. For this I need to be able to reproduce it myself. Can you please share (off-list) the inputs that you used for the refinement to produce this result? All files will be treated confidentially.
Best regards,
Oleg Sobolev.
On Fri, Feb 7, 2025 at 12:34 PM Kevin M Jude
wrote: I’ve finished refining a structure with three protein chains A, B, C. The pdb file looks ‘normal’ to me, but when I inspect the .cif file written by phenix, the auth_asym_id for chains B and C is ‘UNK’. label_asym_id is correct for all chains. I’m not really sure what the difference is between the auth_ and label_ fields. When I try to perform actions on the .cif file, I get duplicate atom label errors.
Here’s a few example lines from the .pdb and .cif files, where auth_asym_id is field 6 and label_asym_id is field 16
ATOM 460 OXT PRO A 59 5.272 55.689 31.481 1.00 52.63 O
ANISOU 460 OXT PRO A 59 6541 6212 7244 1448 1260 96 O
TER
ATOM 461 N ALA B 2 39.292 61.974 39.403 1.00 67.81 N
ANISOU 461 N ALA B 2 7118 8729 9916 -1711 -388 2066 N
ATOM 460 OXT . PRO A 59 ? 5.27169 55.68852 31.48123 1.000 52.63052 O ? A ? 58 1
ATOM 461 N . ALA UNK 2 ? 39.29207 61.97430 39.40271 1.000 67.80544 N ? B ? 1 1
I’m able to convert the pdb file to a usable cif file using gemmi but wanted to report this weird behavior with phenix 1.21.2_5419.
--
Kevin Jude, PhD
Structural Biology Research Specialist, Garcia Lab
Howard Hughes Medical Institute
Stanford University School of Medicine
Beckman B177, 279 Campus Drive, Stanford CA 94305
_______________________________________________ phenixbb mailing list -- [email protected] To unsubscribe send an email to [email protected] Unsubscribe: phenixbb-leave@%(host_name)s
_______________________________________________ phenixbb mailing list [email protected] To unsubscribe send an email [email protected] Unsubscribe: phenixbb-leave@%(host_name)s
-- Engin Özkan, Ph.D. Associate Professor Dept of Biochemistry and Molecular Biology University of Chicago Phone: (773) 834-5498 http://ozkan.uchicago.edu
_______________________________________________ phenixbb mailing list -- [email protected] To unsubscribe send an email to [email protected] Unsubscribe: phenixbb-leave@%(host_name)s

Hi All, if memory serves, a long time ago, the idea of purging support of segment identifiers (SEGIDs) in cctbx/Phenix faced a rather strong resistance from researchers working on ribosome structures because they used SEGIDs to annotate structures in ways that chain IDs could not. I can’t recall any details, and this may no longer be even relevant in the mmCIF era. Perhaps someone from this community could comment -- if they decide to read this thread with the rather unrelated (to SEGIDs) subject line! Pavel On 3/13/25 21:25, Engin Özkan wrote:
Thank you, Oleg, for the explanation. I was not aware of the history of segids.
As someone who has started in crystallography in 1999, I've never used segids. They have only been a nuisance. To my knowledge, they are not intentionally added by users like me, who only play with chain IDs, but software adds them. There is a "Change chain ID" menu item in Coot but no segid equivalent. I've always thought Phaser or Coot added them during some operation, since they appear after molecular replacement and/or model building, but I have not investigated, so I may be wrong. In Kevin's case, it was AutoBuild (and Tom kindly fixed that).
Reading your message, I am convinced segids are unnecessary and unused. So, ignore them I'd say, but I'm sure the phenix team has thought more deeply about this, and knows of cases of actual use.
However, I do not agree with the point on second-order consequences: Neither Coot nor PyMOL can display these cifs with correct chain sequences. and wwPDB does not accept them. Losing all carefully curated chain IDs (sometimes going up to AA and on) because of a stray segid is a pain. Those are significant consequences at the moment.
Regardless, I should be able to fix this issue going forward. I really appreciate all the work you and the team does.
Thank you!
Engin
P.S. I can now go back to worrying only about having a ton of new chains for each N-linked glycan in the cif file.
On 3/13/25 7:21 PM, Oleg Sobolev wrote:
Hi Engin,
I share your frustration over this issue. Without defending the current approach too much, let me share the rationale behind what is going on in Phenix.
A brief historical note, thanks to https://www.wwpdb.org/documentation/file-format
SEGID was not in the original PDB format description of 1972. It was introduced in 1996: https://cdn.rcsb.org/wwpdb/docs/documentation/file-format/PDB_format_Jan_199... and quickly disappeared in 1998: https://www.wwpdb.org/documentation/file-format-content/format23/sect9.html#...
Likely, because of a real necessity, two years were enough for the community to start using it and refuse to let it go despite its disappearance from the format specifications. The demand for the support of segid was probably the reason why CCTBX processes it even though CCTBX was first published in 2002.
Part of the structural biology community is using segids largely instead of chain IDs, often leaving the chain ID field blank. This is the major use case I'm aware of and the case CCTBX supports.
Now comes mmCIF, and there is NO place for segid because there has been no formal segid definition for the last 25 years: https://mmcif.wwpdb.org/docs/pdb_to_pdbx_correspondences.html#ATOMP
The absence of segid prevents us from converting such PDB files into mmCIF directly, so we have to get creative. Here is the present state of my understanding: either no segids and no ambiguity, or segids are used instead of chain IDs. I admit this is a rather narrow use-case scenario, and I can definitely see that random leftover or carry-over segids can spoil the output.
Connectivity differences resulting from such mmCIF/PDB files are second-order consequences, as we definitely use a lot of heuristics to figure out connectivity, and since conversion PDB ⇄ mmCIF is not equivalent in the presence of random segids, connectivity might be compromised.
The hope is that with the PDB format being gradually phased out, all of this will be of less concern for developers and users.
Best regards, Oleg Sobolev.
On Tue, Mar 11, 2025 at 6:15 PM Engin Özkan
wrote: Hi,
This was the source of a recent issue I have been having with pdb depositions. There were no ligands involved; as Oleg explained, presence of segids resulted in a cif file uninterpretable by PyMOL, Coot and for pdb deposition as chain ids were overwritten in the cif file only, not pdb.
If phenix chooses to overwrite chain ids with segids in cif, while I would not prefer that, that's one rational way. I am puzzled, though, why the pdb file is not handled the same way? Why produce pdb and cif files with different chain connectivities? I think treating both files consistently makes most sense.
(Back to re-running all phenix.refine jobs after deleting all segid columns...)
Sorry if I am misinterpreting this and thank you!
Engin
On 2/10/25 11:51 AM, Kevin M Jude wrote:
Got it. Phenix.autobuild assigns the UNK segid to chains with low confidence where the backbone was correct but sequence assignment was not. I corrected the sequence assignment for those positions in Coot, but the presence of the segids is not apparent without looking at the plain text of the pdb file.
Best wishes
Kevin
*From: *Oleg Sobolev
mailto:[email protected] *Date: *Monday, February 10, 2025 at 9:31 AM *To: *Kevin M Jude mailto:[email protected] *Cc: *[email protected] mailto:[email protected] *Subject: *Re: [phenixbb] .cif file has 'UNK' auth_asym_id Hi Kevin,
Thank you for the clarification; the situation is much clearer now.
The "UNK" in the segid field was the root cause of the issue. When segid is present, Phenix prioritizes it over the chain ID when reading PDB files. This explains why the segid was applied to the entire chain. Essentially, Phenix interprets any segid as a chain ID. Handling cases with different segids for the same chain becomes overly complex, especially since the segid is not a commonly used feature in the PDB format. I'm not sure whether your specific case could be accommodated within the current processing workflow.
When Phenix reads an mmCIF model file, it uses the auth_asym_id as the chain ID and largely disregards the label_asym_id so that explains duplicated labels.
Great that you figured out the root cause, please let me know if you have more questions!
Best regards,
Oleg Sobolev
On Fri, Feb 7, 2025 at 8:10 PM Kevin M Jude
wrote: Thanks Oleg. I investigated some more and found a clue:
A few residues at the termini of the B and C chains in the input pdb file have UNK as the segid (column 73-75). The segids were introduced in autobuild. I had apparently noticed this and removed them in the .pdb when I finished refinement a few months ago, because the final output pdb file has a later edit date than the rest of the output files. In the .cif files, all residues in those chains are labeled as ‘UNK’ in the auth_asym_id. Now three months later when making Table 1 using the .cif file, I was surprised when phenix complained about ‘duplicate atoms’ in the cif file.
So now I guess the mystery to me is why phenix extends the UNK segid to the whole chain, and why phenix sees atoms with the same auth_asym_id (segid) but different label_asym_id (chain) as being duplicates. I’ll leave it to you to decide if this is bug in the program or in the user, but still happy to share my files with you off-list if you like.
Best wishes
Kevin
*From: *Oleg Sobolev
*Date: *Friday, February 7, 2025 at 4:52 PM *To: *Kevin M Jude *Cc: *[email protected] *Subject: *Re: [phenixbb] .cif file has 'UNK' auth_asym_id Hi Kevin,
Thank you for the report. I would be happy to fix the issue. For this I need to be able to reproduce it myself. Can you please share (off-list) the inputs that you used for the refinement to produce this result? All files will be treated confidentially.
Best regards,
Oleg Sobolev.
On Fri, Feb 7, 2025 at 12:34 PM Kevin M Jude
wrote: I’ve finished refining a structure with three protein chains A, B, C. The pdb file looks ‘normal’ to me, but when I inspect the .cif file written by phenix, the auth_asym_id for chains B and C is ‘UNK’. label_asym_id is correct for all chains. I’m not really sure what the difference is between the auth_ and label_ fields. When I try to perform actions on the .cif file, I get duplicate atom label errors.
Here’s a few example lines from the .pdb and .cif files, where auth_asym_id is field 6 and label_asym_id is field 16
ATOM 460 OXT PRO A 59 5.272 55.689 31.481 1.00 52.63 O
ANISOU 460 OXT PRO A 59 6541 6212 7244 1448 1260 96 O
TER
ATOM 461 N ALA B 2 39.292 61.974 39.403 1.00 67.81 N
ANISOU 461 N ALA B 2 7118 8729 9916 -1711 -388 2066 N
ATOM 460 OXT . PRO A 59 ? 5.27169 55.68852 31.48123 1.000 52.63052 O ? A ? 58 1
ATOM 461 N . ALA UNK 2 ? 39.29207 61.97430 39.40271 1.000 67.80544 N ? B ? 1 1
I’m able to convert the pdb file to a usable cif file using gemmi but wanted to report this weird behavior with phenix 1.21.2_5419.
--
Kevin Jude, PhD
Structural Biology Research Specialist, Garcia Lab
Howard Hughes Medical Institute
Stanford University School of Medicine
Beckman B177, 279 Campus Drive, Stanford CA 94305
_______________________________________________ phenixbb mailing list -- [email protected] To unsubscribe send an email to [email protected] Unsubscribe: phenixbb-leave@%(host_name)s
_______________________________________________ phenixbb mailing list [email protected] To unsubscribe send an email [email protected] Unsubscribe: phenixbb-leave@%(host_name)s
-- Engin Özkan, Ph.D. Associate Professor Dept of Biochemistry and Molecular Biology University of Chicago Phone: (773) 834-5498 http://ozkan.uchicago.edu
_______________________________________________ phenixbb mailing list -- [email protected] To unsubscribe send an email to [email protected] Unsubscribe: phenixbb-leave@%(host_name)s
_______________________________________________ phenixbb mailing list [email protected] To unsubscribe send an email [email protected] Unsubscribe: phenixbb-leave@%(host_name)s

yes indeed Pavel, there are just too many ribosomal proteins (e.g. about 50 to ~80 depending on species etc.) for the alphabet. Hence, combinations of letters are used for SEGIDs, even with small and capital letters and with a preceding “L” or “S” to distinguish between large and small subunits. This gives often column issues with no spaces in between e.g. ILELp or PROSW instead of ILE Lp and PRO SW etc. to keep compatibility between Coot, Phenix and the PDB.
In mmCIF this is placed towards the end of the line (with a space) with that same nomenclature, e.g. see PDB ID 8QOI:
HETATMA0OWO CB LYSLg 71 213.683 260.121 340.960 1.00 19.63 C
gives:
ATOM 132237 C CB . LYS IA 35 71 ? 213.683 260.121 340.960 1.00 19.63 ? 71 LYS Lg CB 1
etc.
Best,
Bruno
From: Pavel Afonine

Ironic, in some sense, that this bit of history gets missed on this list of all places. I date to 1986 with crystallography. AFAIK segment IDs were introduced with Axel Brunger's program X-PLOR (~1987) as a non-standard extension in columns 73:76 (Fortran-77 character string numbering) as a means to split up the coordinate file that basically ignored chain labels. https://atbweb.stanford.edu/atb_publications/brunger_xplor_manual_1992.pdf http://www.ocms.ox.ac.uk/mirrored/xplor/manual/htmlman/htmlman.html section 3.13.2 gives you a sense of it - SEGId's were the definitive identifier for chain blocks and entities. I still have code in my PDB manipulation program that splits up a file by segment ID. I emphasize non-standard PDB format extension - the format at the time had quite a lot of limitations and (I think) had not started to add element type on the end of the ATOM/HETATM record. The successor program, CNS, did much the same thing. https://cns-online.org/v1.3/ https://www.mrc-lmb.cam.ac.uk/public/xtal/doc/cns/cns_1.3/tutorial/text.html If you read https://www.mrc-lmb.cam.ac.uk/public/xtal/doc/cns/cns_1.3/tutorial/generate/... you might get a sense of where phenix, at least syntactically, comes from. Ah, nostalgia. Programs that add SEGId or manipulate them are more of a historical artifact, but at the time lower case chain labels were not recognized, nor were double character chain labels, so this might be viewed as the most sensible choice for larger structures. I prefer PDB, warts and all, because mmCIF is a really bad format for hacking structures during rebuilding. If someone writes a good graphics-based mmCIF editing program that is faster than I am with PDB and emacs, I might change my mind. But I haven't seen one. Phil Jeffrey (entering the "institutional memory" part of his career, apparently) Princeton On 3/14/25 12:25 AM, Engin Özkan wrote:
Thank you, Oleg, for the explanation. I was not aware of the history of segids.
As someone who has started in crystallography in 1999, I've never used segids. They have only been a nuisance. To my knowledge, they are not intentionally added by users like me, who only play with chain IDs, but software adds them. There is a "Change chain ID" menu item in Coot but no segid equivalent. I've always thought Phaser or Coot added them during some operation, since they appear after molecular replacement and/or model building, but I have not investigated, so I may be wrong. In Kevin's case, it was AutoBuild (and Tom kindly fixed that).
Reading your message, I am convinced segids are unnecessary and unused. So, ignore them I'd say, but I'm sure the phenix team has thought more deeply about this, and knows of cases of actual use.
However, I do not agree with the point on second-order consequences: Neither Coot nor PyMOL can display these cifs with correct chain sequences. and wwPDB does not accept them. Losing all carefully curated chain IDs (sometimes going up to AA and on) because of a stray segid is a pain. Those are significant consequences at the moment.
Regardless, I should be able to fix this issue going forward. I really appreciate all the work you and the team does.
Thank you!
Engin
P.S. I can now go back to worrying only about having a ton of new chains for each N-linked glycan in the cif file.
On 3/13/25 7:21 PM, Oleg Sobolev wrote:
Hi Engin,
I share your frustration over this issue. Without defending the current approach too much, let me share the rationale behind what is going on in Phenix.
A brief historical note, thanks to https://www.wwpdb.org/documentation/file-format
SEGID was not in the original PDB format description of 1972. It was introduced in 1996: https://cdn.rcsb.org/wwpdb/docs/documentation/file-format/PDB_format_Jan_199... and quickly disappeared in 1998: https://www.wwpdb.org/documentation/file-format-content/format23/sect9.html#...
Likely, because of a real necessity, two years were enough for the community to start using it and refuse to let it go despite its disappearance from the format specifications. The demand for the support of segid was probably the reason why CCTBX processes it even though CCTBX was first published in 2002.
Part of the structural biology community is using segids largely instead of chain IDs, often leaving the chain ID field blank. This is the major use case I'm aware of and the case CCTBX supports.
Now comes mmCIF, and there is NO place for segid because there has been no formal segid definition for the last 25 years: https://mmcif.wwpdb.org/docs/pdb_to_pdbx_correspondences.html#ATOMP
The absence of segid prevents us from converting such PDB files into mmCIF directly, so we have to get creative. Here is the present state of my understanding: either no segids and no ambiguity, or segids are used instead of chain IDs. I admit this is a rather narrow use-case scenario, and I can definitely see that random leftover or carry-over segids can spoil the output.
Connectivity differences resulting from such mmCIF/PDB files are second-order consequences, as we definitely use a lot of heuristics to figure out connectivity, and since conversion PDB ⇄ mmCIF is not equivalent in the presence of random segids, connectivity might be compromised.
The hope is that with the PDB format being gradually phased out, all of this will be of less concern for developers and users.
Best regards, Oleg Sobolev.
On Tue, Mar 11, 2025 at 6:15 PM Engin Özkan
wrote: Hi,
This was the source of a recent issue I have been having with pdb depositions. There were no ligands involved; as Oleg explained, presence of segids resulted in a cif file uninterpretable by PyMOL, Coot and for pdb deposition as chain ids were overwritten in the cif file only, not pdb.
If phenix chooses to overwrite chain ids with segids in cif, while I would not prefer that, that's one rational way. I am puzzled, though, why the pdb file is not handled the same way? Why produce pdb and cif files with different chain connectivities? I think treating both files consistently makes most sense.
(Back to re-running all phenix.refine jobs after deleting all segid columns...)
Sorry if I am misinterpreting this and thank you!
Engin
On 2/10/25 11:51 AM, Kevin M Jude wrote:
Got it. Phenix.autobuild assigns the UNK segid to chains with low confidence where the backbone was correct but sequence assignment was not. I corrected the sequence assignment for those positions in Coot, but the presence of the segids is not apparent without looking at the plain text of the pdb file.
Best wishes
Kevin
*From: *Oleg Sobolev
mailto:[email protected] *Date: *Monday, February 10, 2025 at 9:31 AM *To: *Kevin M Jude mailto:[email protected] *Cc: *[email protected] mailto:[email protected] *Subject: *Re: [phenixbb] .cif file has 'UNK' auth_asym_id Hi Kevin,
Thank you for the clarification; the situation is much clearer now.
The "UNK" in the segid field was the root cause of the issue. When segid is present, Phenix prioritizes it over the chain ID when reading PDB files. This explains why the segid was applied to the entire chain. Essentially, Phenix interprets any segid as a chain ID. Handling cases with different segids for the same chain becomes overly complex, especially since the segid is not a commonly used feature in the PDB format. I'm not sure whether your specific case could be accommodated within the current processing workflow.
When Phenix reads an mmCIF model file, it uses the auth_asym_id as the chain ID and largely disregards the label_asym_id so that explains duplicated labels.
Great that you figured out the root cause, please let me know if you have more questions!
Best regards,
Oleg Sobolev
On Fri, Feb 7, 2025 at 8:10 PM Kevin M Jude
wrote: Thanks Oleg. I investigated some more and found a clue:
A few residues at the termini of the B and C chains in the input pdb file have UNK as the segid (column 73-75). The segids were introduced in autobuild. I had apparently noticed this and removed them in the .pdb when I finished refinement a few months ago, because the final output pdb file has a later edit date than the rest of the output files. In the .cif files, all residues in those chains are labeled as ‘UNK’ in the auth_asym_id. Now three months later when making Table 1 using the .cif file, I was surprised when phenix complained about ‘duplicate atoms’ in the cif file.
So now I guess the mystery to me is why phenix extends the UNK segid to the whole chain, and why phenix sees atoms with the same auth_asym_id (segid) but different label_asym_id (chain) as being duplicates. I’ll leave it to you to decide if this is bug in the program or in the user, but still happy to share my files with you off-list if you like.
Best wishes
Kevin
*From: *Oleg Sobolev
*Date: *Friday, February 7, 2025 at 4:52 PM *To: *Kevin M Jude *Cc: *[email protected] *Subject: *Re: [phenixbb] .cif file has 'UNK' auth_asym_id Hi Kevin,
Thank you for the report. I would be happy to fix the issue. For this I need to be able to reproduce it myself. Can you please share (off-list) the inputs that you used for the refinement to produce this result? All files will be treated confidentially.
Best regards,
Oleg Sobolev.
On Fri, Feb 7, 2025 at 12:34 PM Kevin M Jude
wrote: I’ve finished refining a structure with three protein chains A, B, C. The pdb file looks ‘normal’ to me, but when I inspect the .cif file written by phenix, the auth_asym_id for chains B and C is ‘UNK’. label_asym_id is correct for all chains. I’m not really sure what the difference is between the auth_ and label_ fields. When I try to perform actions on the .cif file, I get duplicate atom label errors.
Here’s a few example lines from the .pdb and .cif files, where auth_asym_id is field 6 and label_asym_id is field 16
ATOM 460 OXT PRO A 59 5.272 55.689 31.481 1.00 52.63 O
ANISOU 460 OXT PRO A 59 6541 6212 7244 1448 1260 96 O
TER
ATOM 461 N ALA B 2 39.292 61.974 39.403 1.00 67.81 N
ANISOU 461 N ALA B 2 7118 8729 9916 -1711 -388 2066 N
ATOM 460 OXT . PRO A 59 ? 5.27169 55.68852 31.48123 1.000 52.63052 O ? A ? 58 1
ATOM 461 N . ALA UNK 2 ? 39.29207 61.97430 39.40271 1.000 67.80544 N ? B ? 1 1
I’m able to convert the pdb file to a usable cif file using gemmi but wanted to report this weird behavior with phenix 1.21.2_5419.
--
Kevin Jude, PhD
Structural Biology Research Specialist, Garcia Lab
Howard Hughes Medical Institute
Stanford University School of Medicine
Beckman B177, 279 Campus Drive, Stanford CA 94305
_______________________________________________ phenixbb mailing list -- [email protected] To unsubscribe send an email to [email protected] Unsubscribe: phenixbb-leave@%(host_name)s
_______________________________________________ phenixbb mailing list [email protected] To unsubscribe send an email [email protected] Unsubscribe: phenixbb-leave@%(host_name)s
-- Engin Özkan, Ph.D. Associate Professor Dept of Biochemistry and Molecular Biology University of Chicago Phone: (773) 834-5498 http://ozkan.uchicago.edu
_______________________________________________ phenixbb mailing list -- [email protected] To unsubscribe send an email to [email protected] Unsubscribe: phenixbb-leave@%(host_name)s
_______________________________________________ phenixbb mailing list -- [email protected] To unsubscribe send an email to [email protected] Unsubscribe: phenixbb-leave@%(host_name)s
participants (7)
-
Bruno KLAHOLZ
-
Engin Özkan
-
Kevin M Jude
-
Oleg Sobolev
-
Pavel Afonine
-
Phil Jeffrey
-
Tom Terwilliger