Did you try looking at a surface representation of the structure? The one that Pavel generated from a more simplistic random sequence looked vaguely plausible as a chain trace, but had very large cavities when viewed as accessible surface, e.g. in ChimeraX. Randy
On 24 Apr 2022, at 01:36, James Holton
wrote: I downloaded all sequences from the PDB to do a frequency analysis, which I think will enable a more sophisticated "random peptide" than giving all 20 amino acids equal likelihood with no neighbor correlations. Turns out the most common heptapeptide in the PDB is: XXXXXXX
because, apparently, NCBI lists gaps like this. If I eliminate "X", the most common heptapeptide is:
SHHHHHH
Right. His tags. If I eliminate those, the next one is: GLVPRGS thrombin cleavage site. Ugh!
Cleaving off those, the next one is: AAAAAAA apparently also used to denote unknown amino acids.
Ignoring those, the next one is: YFPEPVT this is from the human antibody heavy chain. Thousands of those in the PDB. This made me realize that there is going to be a lot of "homology bias".
I next tried the NCBI refseq_select to try and get something non-redundant, and then I get: GSGKSTL lots of ABC transporters. >10k of this sequence in the db.
On the other hand, of the ~330e6 heptapeptide sequences I'm looking at, 52% of them only appear once. Another 19% of my list of unique sequences occur twice, 9% occur 3x, etc. It is a VERY steep curve. Only 4% of my sample sequences occur more than 10 times, and 0.05% occur more than 100x. I never thought of the PDB this way, but I take this as indicative of how much repetition there is in the sampling. Perhaps I need to brush up on my statistics, but I did not expect this.
So, I said "heck with it" and simply extracted all heptads that only occur EXACTLY ONCE in the PDB. There are 7.4e6 of them. I chose one at random as the starting 7 residues: FLACISE (from 6ekr). Rather than simply appending random heptads, I took the final ISE and extracted all one-off heptads that begin with ISE. There are 1905 of those. I chose one at random: ISENDGN (from 6ixy). I then extracted heptads beginning with DGN, etc. Final sequence of randomly-assembled yet abbuting heptads from random places in the PDB is then:
random1 FLACISENDGNDGDKSNLAQDTDLALGTWVRDISIALRSGFIAEEYPAWWKESVVPYYNQLNGN DIEHSHSALLPTGIYFLEIPLVGTRDVALDTEGETTEEAFRHLDGIVGHPIVIGRRAADGGPLVIRGL NEVLLRLSEVYNAWKLLKVQEILKGLEAYNRAMDYETILREGTWKPAGYTPNALAGGAQCGNAD IVNERIISDGVEASQLATLERRENRALQTALEIENGELEQKWGRLPSAXPAVERQRYRGPSTATW EGFSTDGRFPSEVDIFDRETLDGVEGLAKDVEALKGVLMGLFAPYWKWCTHNCIGYAARFVALN KAIRTALTFARTHSNNETIHNEVVGMIPDIDIAVSDINSQEYTKTRSERQGGVLKEHLNALNAKIEPS VNLKVSA
A quick BLASTP shows a surprisingly high homology match:
I mean... not great, but higher than I'd expect for "completely random" ?
The alphafold2 colab result looks like this:
So still lots of helical content. This shape makes me want to make this sequence and see if it folds. It probably doesn't. But if this thing weren't red I'd wonder.
Discuss?
-James
----- Randy J. Read Department of Haematology, University of Cambridge Cambridge Institute for Medical Research Tel: +44 1223 336500 The Keith Peters Building Fax: +44 1223 336827 Hills Road E-mail: [email protected] Cambridge CB2 0XY, U.K. www-structmed.cimr.cam.ac.uk
I did not try that. And I also did not save it to my google drive. Guess I will have to run it again... On 4/26/2022 7:43 AM, Randy John Read wrote:
Did you try looking at a surface representation of the structure? The one that Pavel generated from a more simplistic random sequence looked vaguely plausible as a chain trace, but had very large cavities when viewed as accessible surface, e.g. in ChimeraX.
Randy
On 24 Apr 2022, at 01:36, James Holton
wrote: I downloaded all sequences from the PDB to do a frequency analysis, which I think will enable a more sophisticated "random peptide" than giving all 20 amino acids equal likelihood with no neighbor correlations. Turns out the most common heptapeptide in the PDB is: XXXXXXX
because, apparently, NCBI lists gaps like this. If I eliminate "X", the most common heptapeptide is:
SHHHHHH
Right. His tags. If I eliminate those, the next one is: GLVPRGS thrombin cleavage site. Ugh!
Cleaving off those, the next one is: AAAAAAA apparently also used to denote unknown amino acids.
Ignoring those, the next one is: YFPEPVT this is from the human antibody heavy chain. Thousands of those in the PDB. This made me realize that there is going to be a lot of "homology bias".
I next tried the NCBI refseq_select to try and get something non-redundant, and then I get: GSGKSTL lots of ABC transporters. >10k of this sequence in the db.
On the other hand, of the ~330e6 heptapeptide sequences I'm looking at, 52% of them only appear once. Another 19% of my list of unique sequences occur twice, 9% occur 3x, etc. It is a VERY steep curve. Only 4% of my sample sequences occur more than 10 times, and 0.05% occur more than 100x. I never thought of the PDB this way, but I take this as indicative of how much repetition there is in the sampling. Perhaps I need to brush up on my statistics, but I did not expect this.
So, I said "heck with it" and simply extracted all heptads that only occur EXACTLY ONCE in the PDB. There are 7.4e6 of them. I chose one at random as the starting 7 residues: FLACISE (from 6ekr). Rather than simply appending random heptads, I took the final ISE and extracted all one-off heptads that begin with ISE. There are 1905 of those. I chose one at random: ISENDGN (from 6ixy). I then extracted heptads beginning with DGN, etc. Final sequence of randomly-assembled yet abbuting heptads from random places in the PDB is then:
random1 FLACISENDGNDGDKSNLAQDTDLALGTWVRDISIALRSGFIAEEYPAWWKESVVPYYNQLNGN DIEHSHSALLPTGIYFLEIPLVGTRDVALDTEGETTEEAFRHLDGIVGHPIVIGRRAADGGPLVIRGL NEVLLRLSEVYNAWKLLKVQEILKGLEAYNRAMDYETILREGTWKPAGYTPNALAGGAQCGNAD IVNERIISDGVEASQLATLERRENRALQTALEIENGELEQKWGRLPSAXPAVERQRYRGPSTATW EGFSTDGRFPSEVDIFDRETLDGVEGLAKDVEALKGVLMGLFAPYWKWCTHNCIGYAARFVALN KAIRTALTFARTHSNNETIHNEVVGMIPDIDIAVSDINSQEYTKTRSERQGGVLKEHLNALNAKIEPS VNLKVSA
A quick BLASTP shows a surprisingly high homology match:
I mean... not great, but higher than I'd expect for "completely random" ?
The alphafold2 colab result looks like this:
So still lots of helical content. This shape makes me want to make this sequence and see if it folds. It probably doesn't. But if this thing weren't red I'd wonder.
Discuss?
-James
----- Randy J. Read Department of Haematology, University of Cambridge Cambridge Institute for Medical Research Tel: +44 1223 336500 The Keith Peters Building Fax: +44 1223 336827 Hills Road E-mail: [email protected] Cambridge CB2 0XY, U.K. www-structmed.cimr.cam.ac.uk
participants (2)
-
James Holton
-
Randy John Read