[phenixbb] NaN of AlphaFold

James Holton jmholton at lbl.gov
Tue Apr 26 08:24:49 PDT 2022


I did not try that.  And I also did not save it to my google drive. 
Guess I will have to run it again...

On 4/26/2022 7:43 AM, Randy John Read wrote:
> Did you try looking at a surface representation of the structure? The one that Pavel generated from a more simplistic random sequence looked vaguely plausible as a chain trace, but had very large cavities when viewed as accessible surface, e.g. in ChimeraX.
>
> Randy
>
>> On 24 Apr 2022, at 01:36, James Holton <jmholton at lbl.gov> wrote:
>>
>> I downloaded all sequences from the PDB to do a frequency analysis, which I think will enable a more sophisticated "random peptide" than giving all 20 amino acids equal likelihood with no neighbor correlations.  Turns out the most common heptapeptide in the PDB is:
>> XXXXXXX
>>
>> because, apparently, NCBI lists gaps like this.  If I eliminate "X", the most common heptapeptide is:
>>
>> SHHHHHH
>>
>> Right. His tags.  If I eliminate those, the next one is:
>> GLVPRGS
>> thrombin cleavage site. Ugh!
>>
>> Cleaving off those, the next one is:
>> AAAAAAA
>> apparently also used to denote unknown amino acids.
>>
>> Ignoring those, the next one is:
>> YFPEPVT
>> this is from the human antibody heavy chain. Thousands of those in the PDB.  This made me realize that there is going to be a lot of "homology bias".
>>
>> I next tried the NCBI refseq_select to try and get something non-redundant, and then I get:
>> GSGKSTL
>> lots of ABC transporters. >10k of this sequence in the db.
>>
>> On the other hand, of the ~330e6 heptapeptide sequences I'm looking at, 52% of them only appear once. Another 19% of my list of unique sequences occur twice, 9% occur 3x, etc. It is a VERY steep curve. Only 4% of my sample sequences occur more than 10 times, and 0.05% occur more than 100x.  I never thought of the PDB this way, but I take this as indicative of how much repetition there is in the sampling.  Perhaps I need to brush up on my statistics, but I did not expect this.
>>
>> So, I said "heck with it" and simply extracted all heptads that only occur EXACTLY ONCE in the PDB.  There are 7.4e6 of them.  I chose one at random as the starting 7 residues: FLACISE (from 6ekr). Rather than simply appending random heptads, I took the final ISE and extracted all one-off heptads that begin with ISE. There are 1905 of those.  I chose one at random: ISENDGN (from 6ixy). I then extracted heptads beginning with DGN, etc.  Final sequence of randomly-assembled yet abbuting heptads from random places in the PDB is then:
>>
>>> random1
>> FLACISENDGNDGDKSNLAQDTDLALGTWVRDISIALRSGFIAEEYPAWWKESVVPYYNQLNGN
>> DIEHSHSALLPTGIYFLEIPLVGTRDVALDTEGETTEEAFRHLDGIVGHPIVIGRRAADGGPLVIRGL
>> NEVLLRLSEVYNAWKLLKVQEILKGLEAYNRAMDYETILREGTWKPAGYTPNALAGGAQCGNAD
>> IVNERIISDGVEASQLATLERRENRALQTALEIENGELEQKWGRLPSAXPAVERQRYRGPSTATW
>> EGFSTDGRFPSEVDIFDRETLDGVEGLAKDVEALKGVLMGLFAPYWKWCTHNCIGYAARFVALN
>> KAIRTALTFARTHSNNETIHNEVVGMIPDIDIAVSDINSQEYTKTRSERQGGVLKEHLNALNAKIEPS
>> VNLKVSA
>>
>> A quick BLASTP shows a surprisingly high homology match:
>> <o4TgiNP8vaFOcUkl.png>
>>
>> I mean... not great, but higher than I'd expect for "completely random" ?
>>
>>
>> The alphafold2 colab result looks like this:
>>
>>
>> <rHffn0Fa0WfwR4Yg.png>
>>
>> So still lots of helical content.  This shape makes me want to make this sequence and see if it folds.  It probably doesn't.  But if this thing weren't red I'd wonder.
>>
>> Discuss?
>>
>> -James
>>
> -----
> Randy J. Read
> Department of Haematology, University of Cambridge
> Cambridge Institute for Medical Research     Tel: +44 1223 336500
> The Keith Peters Building                               Fax: +44 1223 336827
> Hills Road                                                       E-mail: rjr27 at cam.ac.uk
> Cambridge CB2 0XY, U.K.                              www-structmed.cimr.cam.ac.uk
>



More information about the phenixbb mailing list