[phenixbb] NaN of AlphaFold
Randy John Read
rjr27 at cam.ac.uk
Tue Apr 26 07:43:08 PDT 2022
Did you try looking at a surface representation of the structure? The one that Pavel generated from a more simplistic random sequence looked vaguely plausible as a chain trace, but had very large cavities when viewed as accessible surface, e.g. in ChimeraX.
Randy
> On 24 Apr 2022, at 01:36, James Holton <jmholton at lbl.gov> wrote:
>
> I downloaded all sequences from the PDB to do a frequency analysis, which I think will enable a more sophisticated "random peptide" than giving all 20 amino acids equal likelihood with no neighbor correlations. Turns out the most common heptapeptide in the PDB is:
> XXXXXXX
>
> because, apparently, NCBI lists gaps like this. If I eliminate "X", the most common heptapeptide is:
>
> SHHHHHH
>
> Right. His tags. If I eliminate those, the next one is:
> GLVPRGS
> thrombin cleavage site. Ugh!
>
> Cleaving off those, the next one is:
> AAAAAAA
> apparently also used to denote unknown amino acids.
>
> Ignoring those, the next one is:
> YFPEPVT
> this is from the human antibody heavy chain. Thousands of those in the PDB. This made me realize that there is going to be a lot of "homology bias".
>
> I next tried the NCBI refseq_select to try and get something non-redundant, and then I get:
> GSGKSTL
> lots of ABC transporters. >10k of this sequence in the db.
>
> On the other hand, of the ~330e6 heptapeptide sequences I'm looking at, 52% of them only appear once. Another 19% of my list of unique sequences occur twice, 9% occur 3x, etc. It is a VERY steep curve. Only 4% of my sample sequences occur more than 10 times, and 0.05% occur more than 100x. I never thought of the PDB this way, but I take this as indicative of how much repetition there is in the sampling. Perhaps I need to brush up on my statistics, but I did not expect this.
>
> So, I said "heck with it" and simply extracted all heptads that only occur EXACTLY ONCE in the PDB. There are 7.4e6 of them. I chose one at random as the starting 7 residues: FLACISE (from 6ekr). Rather than simply appending random heptads, I took the final ISE and extracted all one-off heptads that begin with ISE. There are 1905 of those. I chose one at random: ISENDGN (from 6ixy). I then extracted heptads beginning with DGN, etc. Final sequence of randomly-assembled yet abbuting heptads from random places in the PDB is then:
>
> > random1
> FLACISENDGNDGDKSNLAQDTDLALGTWVRDISIALRSGFIAEEYPAWWKESVVPYYNQLNGN
> DIEHSHSALLPTGIYFLEIPLVGTRDVALDTEGETTEEAFRHLDGIVGHPIVIGRRAADGGPLVIRGL
> NEVLLRLSEVYNAWKLLKVQEILKGLEAYNRAMDYETILREGTWKPAGYTPNALAGGAQCGNAD
> IVNERIISDGVEASQLATLERRENRALQTALEIENGELEQKWGRLPSAXPAVERQRYRGPSTATW
> EGFSTDGRFPSEVDIFDRETLDGVEGLAKDVEALKGVLMGLFAPYWKWCTHNCIGYAARFVALN
> KAIRTALTFARTHSNNETIHNEVVGMIPDIDIAVSDINSQEYTKTRSERQGGVLKEHLNALNAKIEPS
> VNLKVSA
>
> A quick BLASTP shows a surprisingly high homology match:
> <o4TgiNP8vaFOcUkl.png>
>
> I mean... not great, but higher than I'd expect for "completely random" ?
>
>
> The alphafold2 colab result looks like this:
>
>
> <rHffn0Fa0WfwR4Yg.png>
>
> So still lots of helical content. This shape makes me want to make this sequence and see if it folds. It probably doesn't. But if this thing weren't red I'd wonder.
>
> Discuss?
>
> -James
>
-----
Randy J. Read
Department of Haematology, University of Cambridge
Cambridge Institute for Medical Research Tel: +44 1223 336500
The Keith Peters Building Fax: +44 1223 336827
Hills Road E-mail: rjr27 at cam.ac.uk
Cambridge CB2 0XY, U.K. www-structmed.cimr.cam.ac.uk
More information about the phenixbb
mailing list