Re: [phenixbb] NaN of AlphaFold

26 Apr 2022

      Did you try looking at a surface representation of the structure? The one that Pavel generated from a more simplistic random sequence looked vaguely plausible as a chain trace, but had very large cavities when viewed as accessible surface, e.g. in ChimeraX.

Randy
...
On 24 Apr 2022, at 01:36, James Holton  wrote:
I downloaded all sequences from the PDB to do a frequency analysis, which I think will enable a more sophisticated "random peptide" than giving all 20 amino acids equal likelihood with no neighbor correlations.  Turns out the most common heptapeptide in the PDB is:
XXXXXXX
because, apparently, NCBI lists gaps like this.  If I eliminate "X", the most common heptapeptide is:
SHHHHHH
Right. His tags.  If I eliminate those, the next one is:
GLVPRGS
thrombin cleavage site. Ugh!
Cleaving off those, the next one is:
AAAAAAA
apparently also used to denote unknown amino acids.
Ignoring those, the next one is:
YFPEPVT
this is from the human antibody heavy chain. Thousands of those in the PDB.  This made me realize that there is going to be a lot of "homology bias".
I next tried the NCBI refseq_select to try and get something non-redundant, and then I get:
GSGKSTL
lots of ABC transporters. >10k of this sequence in the db.
On the other hand, of the ~330e6 heptapeptide sequences I'm looking at, 52% of them only appear once. Another 19% of my list of unique sequences occur twice, 9% occur 3x, etc. It is a VERY steep curve. Only 4% of my sample sequences occur more than 10 times, and 0.05% occur more than 100x.  I never thought of the PDB this way, but I take this as indicative of how much repetition there is in the sampling.  Perhaps I need to brush up on my statistics, but I did not expect this.
So, I said "heck with it" and simply extracted all heptads that only occur EXACTLY ONCE in the PDB.  There are 7.4e6 of them.  I chose one at random as the starting 7 residues: FLACISE (from 6ekr). Rather than simply appending random heptads, I took the final ISE and extracted all one-off heptads that begin with ISE. There are 1905 of those.  I chose one at random: ISENDGN (from 6ixy). I then extracted heptads beginning with DGN, etc.  Final sequence of randomly-assembled yet abbuting heptads from random places in the PDB is then:
...
random1
FLACISENDGNDGDKSNLAQDTDLALGTWVRDISIALRSGFIAEEYPAWWKESVVPYYNQLNGN
DIEHSHSALLPTGIYFLEIPLVGTRDVALDTEGETTEEAFRHLDGIVGHPIVIGRRAADGGPLVIRGL
NEVLLRLSEVYNAWKLLKVQEILKGLEAYNRAMDYETILREGTWKPAGYTPNALAGGAQCGNAD
IVNERIISDGVEASQLATLERRENRALQTALEIENGELEQKWGRLPSAXPAVERQRYRGPSTATW
EGFSTDGRFPSEVDIFDRETLDGVEGLAKDVEALKGVLMGLFAPYWKWCTHNCIGYAARFVALN
KAIRTALTFARTHSNNETIHNEVVGMIPDIDIAVSDINSQEYTKTRSERQGGVLKEHLNALNAKIEPS
VNLKVSA
A quick BLASTP shows a surprisingly high homology match:

I mean... not great, but higher than I'd expect for "completely random" ?
The alphafold2 colab result looks like this:

So still lots of helical content.  This shape makes me want to make this sequence and see if it folds.  It probably doesn't.  But if this thing weren't red I'd wonder.
Discuss?
-James
-----
Randy J. Read
Department of Haematology, University of Cambridge
Cambridge Institute for Medical Research     Tel: +44 1223 336500
The Keith Peters Building                               Fax: +44 1223 336827
Hills Road                                                       E-mail: [email protected]
Cambridge CB2 0XY, U.K.                              www-structmed.cimr.cam.ac.uk

Re: [phenixbb] NaN of AlphaFold

Randy John Read