[phenixbb] NaN of AlphaFold

Tue Apr 26 07:43:08 PDT 2022

Did you try looking at a surface representation of the structure? The one that Pavel generated from a more simplistic random sequence looked vaguely plausible as a chain trace, but had very large cavities when viewed as accessible surface, e.g. in ChimeraX.

Randy

> On 24 Apr 2022, at 01:36, James Holton <jmholton at lbl.gov> wrote:
> 
> I downloaded all sequences from the PDB to do a frequency analysis, which I think will enable a more sophisticated "random peptide" than giving all 20 amino acids equal likelihood with no neighbor correlations.  Turns out the most common heptapeptide in the PDB is:
> XXXXXXX
> 
> because, apparently, NCBI lists gaps like this.  If I eliminate "X", the most common heptapeptide is:
> 
> SHHHHHH
> 
> Right. His tags.  If I eliminate those, the next one is:
> GLVPRGS
> thrombin cleavage site. Ugh!
> 
> Cleaving off those, the next one is:
> AAAAAAA
> apparently also used to denote unknown amino acids. 
> 
> Ignoring those, the next one is:
> YFPEPVT
> this is from the human antibody heavy chain. Thousands of those in the PDB.  This made me realize that there is going to be a lot of "homology bias".
> 
> I next tried the NCBI refseq_select to try and get something non-redundant, and then I get:
> GSGKSTL
> lots of ABC transporters. >10k of this sequence in the db.
> 
> On the other hand, of the ~330e6 heptapeptide sequences I'm looking at, 52% of them only appear once. Another 19% of my list of unique sequences occur twice, 9% occur 3x, etc. It is a VERY steep curve. Only 4% of my sample sequences occur more than 10 times, and 0.05% occur more than 100x.  I never thought of the PDB this way, but I take this as indicative of how much repetition there is in the sampling.  Perhaps I need to brush up on my statistics, but I did not expect this.
> 
> So, I said "heck with it" and simply extracted all heptads that only occur EXACTLY ONCE in the PDB.  There are 7.4e6 of them.  I chose one at random as the starting 7 residues: FLACISE (from 6ekr). Rather than simply appending random heptads, I took the final ISE and extracted all one-off heptads that begin with ISE. There are 1905 of those.  I chose one at random: ISENDGN (from 6ixy). I then extracted heptads beginning with DGN, etc.  Final sequence of randomly-assembled yet abbuting heptads from random places in the PDB is then:
> 
> > random1
> FLACISENDGNDGDKSNLAQDTDLALGTWVRDISIALRSGFIAEEYPAWWKESVVPYYNQLNGN
> DIEHSHSALLPTGIYFLEIPLVGTRDVALDTEGETTEEAFRHLDGIVGHPIVIGRRAADGGPLVIRGL
> NEVLLRLSEVYNAWKLLKVQEILKGLEAYNRAMDYETILREGTWKPAGYTPNALAGGAQCGNAD
> IVNERIISDGVEASQLATLERRENRALQTALEIENGELEQKWGRLPSAXPAVERQRYRGPSTATW
> EGFSTDGRFPSEVDIFDRETLDGVEGLAKDVEALKGVLMGLFAPYWKWCTHNCIGYAARFVALN
> KAIRTALTFARTHSNNETIHNEVVGMIPDIDIAVSDINSQEYTKTRSERQGGVLKEHLNALNAKIEPS
> VNLKVSA
> 
> A quick BLASTP shows a surprisingly high homology match:
> <o4TgiNP8vaFOcUkl.png>
> 
> I mean... not great, but higher than I'd expect for "completely random" ?
> 
> 
> The alphafold2 colab result looks like this:
> 
> 
> <rHffn0Fa0WfwR4Yg.png>
> 
> So still lots of helical content.  This shape makes me want to make this sequence and see if it folds.  It probably doesn't.  But if this thing weren't red I'd wonder.
> 
> Discuss?
> 
> -James
> 

-----
Randy J. Read
Department of Haematology, University of Cambridge
Cambridge Institute for Medical Research     Tel: +44 1223 336500
The Keith Peters Building                               Fax: +44 1223 336827
Hills Road                                                       E-mail: rjr27 at cam.ac.uk
Cambridge CB2 0XY, U.K.                              www-structmed.cimr.cam.ac.uk