dev-572 I am checking out sculptor, looks pretty cool, but I provided what i thought was a clustal format alignment to sculptor using --verbose, and it reports "Sorry: Wrong alignment format:" that's it. sounds like user error, but i'm not sure where to proceed next. i tried mafft and clustalw2 output for this. a sample of a bit of the top of the .aln is below. -Bryan p.s: shouldn't readseq convert it to .pir? that'd be code 14 PIR/CODATA? CLUSTAL format alignment by MAFFT L-INS-i (v6.833b) gi|16132042|ref --ANVRLQVEGLSGQLEKNVRAQLSTIESDEVTPDRRFRARVDDAIREGLKALGYYQPTI gi|157163695|re --ANVRLQVEGLSGQLEKNVRAQLSTIESDEVTPDRRFRARVDDAIREGLKALGYYQPTI
On Thu, Nov 18, 2010 at 2:31 PM, Bryan Lepore
I am checking out sculptor, looks pretty cool, but I provided what i thought was a clustal format alignment to sculptor using --verbose, and it reports
"Sorry: Wrong alignment format:"
that's it. sounds like user error, but i'm not sure where to proceed next. i tried mafft and clustalw2 output for this. a sample of a bit of the top of the .aln is below.
CLUSTAL format alignment by MAFFT L-INS-i (v6.833b)
gi|16132042|ref --ANVRLQVEGLSGQLEKNVRAQLSTIESDEVTPDRRFRARVDDAIREGLKALGYYQPTI gi|157163695|re --ANVRLQVEGLSGQLEKNVRAQLSTIESDEVTPDRRFRARVDDAIREGLKALGYYQPTI
I just looked at the parser code, and I think it is expecting exactly the same output as CLUSTALW, which has a slightly different first line than MAFFT. We now distribute a multiple sequence alignment program called MUSCLE with Phenix - you can use it like this: phenix.muscle -in seqs.fa -out seqs.aln -clwstrict I hope this fixes it - if not, Sculptor should also take alignments in FASTA or PIR format. (Make sure you use the correct file extension, because this is what it uses to decide how to parse a file.) -Nat
On Thu, Nov 18, 2010 at 5:47 PM, Nathaniel Echols
On Thu, Nov 18, 2010 at 2:31 PM, Bryan Lepore
wrote: I think it is expecting exactly the same output as CLUSTALW, which has a slightly different first >line than MAFFT.
bingo
We now distribute a multiple sequence alignment program >called MUSCLE with Phenix - you can use it like this:
yes thanks - i saw that in there ;^)
Sculptor should also take alignments in FASTA or PIR format. (Make sure you use the correct file extension,
HERE we go - .fasta is running. but clustalw extension is .aln, right? oh also : wouldn't this be a good candidate for multi-processor threading? -Bryan
On Thu, Nov 18, 2010 at 3:04 PM, Bryan Lepore
Sculptor should also take alignments in FASTA or PIR format. (Make sure you use the correct file extension,
HERE we go - .fasta is running. but clustalw extension is .aln, right?
Correct.
oh also : wouldn't this be a good candidate for multi-processor threading?
Which part? -Nat
oh, and i figure that the alignment needs the structure's sequence in there - right? this would actually save me some keystrokes. -Bryan
On Thu, Nov 18, 2010 at 3:39 PM, Bryan Lepore
oh, and i figure that the alignment needs the structure's sequence in there - right? this would actually save me some keystrokes.
Yes, and the target sequence (i.e. the protein that you crystallized) needs to be first. One annoying thing about MUSCLE is that it sorts sequences and (currently) won't let you disable this behavior, so you may need to do some cut-and-pasting. Maybe we need to modify the Python wrapper to do this automatically. (Although I think it would be even easier if Sculptor did the alignment automatically.) -Nat
Yes, and the target sequence (i.e. the protein that you crystallized) needs to be first. One annoying thing about MUSCLE is that it sorts sequences and (currently) won't let you disable this behavior, so you may need to do some cut-and-pasting. Maybe we need to modify the Python wrapper to do this automatically. (Although I think it would be even easier if Sculptor did the alignment automatically.)
-Nat
This is actually a bit odd, since one could run version 3.7 of MUSCLE with a -stable flag that would output sequences in the the same order as the input... but apparently this has been disabled in the version that ships with PHENIX (3.8.31). Maybe we could just ask Robert Edgar to put it back? Luca
On Fri, Nov 19, 2010 at 4:22 AM, Jovine Luca
This is actually a bit odd, since one could run version 3.7 of MUSCLE with a -stable flag that would output sequences in the the same order as the input... but apparently this has been disabled in the version that ships with PHENIX (3.8.31). Maybe we could just ask Robert Edgar to put it back?
Yeah, I'll ask him about this. Unfortunately we needed a bug fix in 3.8, otherwise I might have left it alone. But I still think the solution is to do the alignment internally - the user should be able to provide a PDB file and a target sequence, and the program will run MUSCLE automatically and figure out which row in the alignment output corresponds to each sequence. This avoids an extra step and would clear up the confusion regarding formats, ordering, etc. -Nat
But I still think the solution is to do the alignment internally - the user should be able to provide a PDB file and a target sequence, and the program will run MUSCLE automatically and figure out which row in the alignment output corresponds to each sequence. This avoids an extra step and would clear up the confusion regarding formats, ordering, etc.
-Nat
I agree this would be a solution to at least part of the problem. There is already some support for this, e.g. sequence { file_name = foo.seq chain_ids = A,B,C } would create an alignment internally using the sequence from foo.seq and chains A, B and C (and apply it to the same chains). It is not currently using muscle, but there is possibly no need - these are only pairwise sequence alignments. On the other hand, the alignment can be of crucial importance, and the user should not be limited to alignment algorithms implemented in Sculptor (and it is possibly not realistic to have an interface to all alignment programs, many of which are web services). Creating multiple sequence alignments via Sculptor (which may be more precise) would also complicate the user interface, for little apparent gain. BW, Gabor
"Sorry: Wrong alignment format:"
I agree that the message is not very helpful. I will change the parser so that you get something more informative!
CLUSTAL format alignment by MAFFT L-INS-i (v6.833b)
gi|16132042|ref --ANVRLQVEGLSGQLEKNVRAQLSTIESDEVTPDRRFRARVDDAIREGLKALGYYQPTI gi|157163695|re --ANVRLQVEGLSGQLEKNVRAQLSTIESDEVTPDRRFRARVDDAIREGLKALGYYQPTI
In this case, the problem is that the CLUSTAL header is not what the parser expects - I was not aware that there are so many .aln-like formats out there. Fortunately, this is just a simple fix.
Yes, and the target sequence (i.e. the protein that you crystallized) needs to be first. One annoying thing about MUSCLE is that it sorts sequences and (currently) won't let you disable this behavior, so you may need to do some cut-and-pasting. Maybe we need to modify the Python wrapper to do this automatically. (Although I think it would be even easier if Sculptor did the alignment automatically.)
There is already some alignment functionality in Sculptor, but it is not very sophisticated (this is why you could use an external alignment is you want full controll). Just specify a target sequence and the chain id in the PDB-file you want to align the sequence with, and it does the rest. It seems to be a common property of alignment programs to rearrange the sequences, and I am wondering what would be more convenient from the user perspective. We could in principle write wrappers for programs that come with PHENIX so that the order of sequences stays the same, but this will not solve the problem for external alignments. Would it be sufficient to provide an option, so that one can specify the index of the target sequence in the alignment file? Would it be better to provide the target sequence, and let Sculptor figure out the index itself? BW, Gabor -- ################################################## Dr Gabor Bunkoczi Cambridge Institute for Medical Research Wellcome Trust/MRC Building Addenbrooke's Hospital Hills Road Cambridge CB2 0XY ##################################################
On Fri, Nov 19, 2010 at 5:16 AM, Dr G. Bunkoczi
Would it be sufficient to provide an option, so that one can specify the index of the target sequence in the alignment file?
in the interest of automation, i guess - though the titles (the stuff on the line starting with ">") can get pretty disgusting if you keep everything from the database it came from : non-alphabetic characters, spaces,
Would it be better to provide the target sequence, and let Sculptor figure out the index itself?
can't say if its better, but if it works - why not. i have seen programs where you say e.g. "fubar" and then the alignment has a part with ">fubar" then everything between ">fubar" and the next ">" is defined as the target sequence (IIUC). ... anyways, how is the target sequence used that a consensus sequence couldn't be used instead? i mean, the model looks like it has one applied to it. -Bryan
Hi Bryan,
i have seen programs where you say e.g. "fubar" and then the alignment has a part with ">fubar" then everything between ">fubar" and the next ">" is defined as the target sequence (IIUC).
OK, this is easy - simply select the sequence with a keyword (which may be a database record identifier) as target.
... anyways, how is the target sequence used that a consensus sequence couldn't be used instead? i mean, the model looks like it has one applied to it.
Many current algorithms only work with pairwise alignments, and therefore require a target sequence, e.g. delete residues from the model that align with gaps in the target. Some of these concepts can be generalized to multiple sequence alignments, and make these decisions based on the local sequence similarity (calculated from residue substitution scores taking nearby residues into account), but not all (e.g. if a Phe in the model aligns with a Gly in the target, one would possibly want the Phe sidechain to be deleted). One could possibly get away without a target sequence by selecting the right algorithms, but I am wondering whether this has any relevance to current practice. The sequence of the protein is usually known, and indeed used in the homology search to provide template models. BW, Gabor
participants (4)
-
Bryan Lepore
-
Dr G. Bunkoczi
-
Jovine Luca
-
Nathaniel Echols