18 September 2009

Translating DNA with XALAN: A custom extension for XSLT

In this post, I'll show how I've create a custom extension for XALAN , a java-based XSLT engine. My favorite XSLT processor has always been xsltproc but I was missing the capacity to create a custom function to process the XML document: Here I show how a java class can be plugged to XALAN to translate a DNA sequence to a peptide.

The Java class

The following class test.Translate translate a DNA to an amino acid sequence, the argument of the constructor is the transl_table given by the NCBI (see http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi). This is not Rocket Science.
package test;
public class Translate
{
static String geneticCode;

public Translate(String geneticCode)
{
this.geneticCode=geneticCode;
}

private static int base2index(char c)
{
switch(Character.toLowerCase(c))
{
case 't': return 0;
case 'c': return 1;
case 'a': return 2;
case 'g': return 3;
default: return -1;
}
}

public String translate(String sequence) {
StringBuilder b= new StringBuilder(1+sequence.length()/3);
for(int i=0;i+2< sequence.length();i+=3)
{
int base1= base2index(sequence.charAt(i));
int base2= base2index(sequence.charAt(i+1));
int base3= base2index(sequence.charAt(i+2));
if(base1==-1 || base2==-1 || base3==-1)
{
b.append('?');
}
else
{
b.append(geneticCode.charAt(base1*16+base2*4+base3));
}
}
return b.toString();
}
}

Compiling and packaging the source


javac test/Translate.java
jar cvf translate.jar test

The XML source


The XML source is a set of INSDSeq sequences downloaded from the NCBI/Genbank
<?xml version="1.0"?>
<!DOCTYPE INSDSet PUBLIC "-//NCBI//INSD INSDSeq/EN" "http://ww
w.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd">
<INSDSet>
<INSDSeq>
<INSDSeq_locus>NM_004953</INSDSeq_locus>
<INSDSeq_length>4888</INSDSeq_length>
<INSDSeq_strandedness>single</INSDSeq_strandedness>
<INSDSeq_moltype>mRNA</INSDSeq_moltype>
<INSDSeq_topology>linear</INSDSeq_topology>
<INSDSeq_division>PRI</INSDSeq_division>
<INSDSeq_update-date>03-SEP-2009</INSDSeq_update-date>
<INSDSeq_create-date>14-MAY-1999</INSDSeq_create-date>
<INSDSeq_definition>Homo sapiens eukaryotic translation initiation factor 4 gamma, 1 (EIF4G1), transcript variant 5, mRNA</INSDSeq_definition>
<INSDSeq_primary-accession>NM_004953</INSDSeq_primary-accession>
<INSDSeq_accession-version>NM_004953.3</INSDSeq_accession-version>
<INSDSeq_other-seqids>
<INSDSeqid>ref|NM_004953.3|</INSDSeqid>
<INSDSeqid>gi|148277098</INSDSeqid>
</INSDSeq_other-seqids>
<INSDSeq_source>Homo sapiens (human)</INSDSeq_source>
<INSDSeq_organism>Homo sapiens</INSDSeq_organism>
<INSDSeq_taxonomy>Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Eu
teleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrh
ini; Hominidae; Homo</INSDSeq_taxonomy>
(...)
<INSDFeature_key>CDS</INSDFeature_key>
<INSDFeature_location>207..4418</INSDFeature_location>
<INSDFeature_intervals>
<INSDInterval>
<INSDInterval_from>207</INSDInterval_from>
<INSDInterval_to>4418</INSDInterval_to>
<INSDInterval_accession>NM_004953.3</INSDInterval_accession>
</INSDInterval>
</INSDFeature_intervals>
<INSDFeature_quals>
<INSDQualifier>
<INSDQualifier_name>gene</INSDQualifier_name>
<INSDQualifier_value>EIF4G1</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>gene_synonym</INSDQualifier_name>
<INSDQualifier_value>DKFZp686A1451; EIF4F; EIF4G; p220</INSDQualifier_value>
</INSDQualifier>
(...)
cttttaatgatgagggtaactatttcagttgtgagccttctagggccccaggctgggaggctcagaggactgaatctgggacctgtgttccccccggcaggcagggacaagatggcatggcaagcatgggggcggggtgggtggggagggatgctgcatttctcagctgggcagtaatcaatttaatggtcctttaaaatgtctgtgtattaaaaatttaagaataccacactttaatattaaatattcataaggtctagtatcttgataataatgtagatgttttaataacaatttttgtccttcttaaaataaaatgaaagaaacttgcttcccttagcctttgttctagaaaataaacttgtgcactttga</INSDSeq_sequence>
</INSDSeq>
</INSDSet>

The XSLT stylesseet

In the following XSLT stylesheet:
in the header, the prefix bio is associated with our Translate class
a parameter named GENETICCODE defines the default transl_table
A new Translate object named code is created with bio:new($GENETICCODE)
This code is then called on the sub-string of DNA containing the CDS: bio:translate($code,substring($dna,$start,1+($end - $start)))


<xsl:stylesheet version="1.0"
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
xmlns:bio="xalan://test.Translate"
extension-element-prefixes="bio">

<xsl:output method="xml" indent="yes"/>

<xsl:param name="GENETICCODE">FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG</xsl:param>

<xsl:template match="/INSDSet">
<html><body>
<xsl:apply-templates select="INSDSeq"/>
</body></html>
</xsl:template>


<xsl:template match="INSDSeq">
<xsl:variable name="dna" select="INSDSeq_sequence"/>
<div>
<h2><xsl:value-of select="INSDSeq_locus"/><xsl:text> : </xsl:text><xsl:value-of select="INSDSeq_definition"/></h2>
<xsl:for-each select="INSDSeq_feature-table/INSDFeature[INSDFeature_key='CDS']/INSDFeature_intervals">
<xsl:variable name="code" select="bio:new($GENETICCODE)"/>
<xsl:variable name="start" select="number(INSDInterval/INSDInterval_from)"/>
<xsl:variable name="end" select="number(INSDInterval/INSDInterval_to)"/>
<div style="font-family:monospace;word-wrap:break-word;width:400px;background-color:rgb(230,230,230);"><xsl:value-of select="bio:translate($code,substring($dna,$start,1+($end - $start)))"/></div>
</xsl:for-each>
</div>
</xsl:template>


</xsl:stylesheet>

Running this XSLT stylesheet with XALAN


java -cp ${XALAN_PATH}/org.apache.xalan_2.7.1.v200905122109.jar:${XALAN_PATH}/org.apache.xml.serializer_2.7.1.v200902170519.jar:translate.jar org.apache.xalan.xslt.Process -IN sequences.gbc -XSL seq2html.xsl

Result


NM_004953 : Homo sapiens eukaryotic translation initiation factor 4 gamma, 1
(EIF4G1), transcript variant 5, mRNA


MSGARTASTPTPPQTGGGLEPQANGETPQVAVIVRPDDRSQGAIIADRPGLPGPEHSPSESQPSSPSPTPSPSPVLEPGSEPNLAVLSIPGDTMTTIQMSVEESTPISRETGEPYRLSPEPTPLAEPILEVEVTLSKPVPESEFSSSPLQAPTPLASHTVEIHEPNGMVPSEDLEPEVESSPELAPPPACPSESPVPIAPTAQPEELLNGAPSPPAVDLSPVSEPEEQAKEVTASMAPPTIPSATPATAPSATSPAQEEEMEEEEEEEEGEAGEAGEAESEKGGEELLPPESTPIPANLSQNLEAAAATQVAVSVPKRRRKIKELNKKEAVGDLLDAFKEANPAVPEVENQPPAGSNPGPESEGSGVPPRPEEADETWDSKEDKIHNAENIQPGEQKYEYKSDQWKPLNLEEKKRYDREFLLGFQFIFASMQKPEGLPHISDVVLDKANKTPLRPLDPTRLQGINCGPDFTPSFANLGRTTLSTRGPPRGGPGGELPRGPAGLGPRRSQQGPRKEPRKIIATVLMTEDIKLNKAEKAWKPSSKRTAADKDRGEEDADGSKTQDLFRRVRSILNKLTPQMFQQLMKQVTQLAIDTEERLKGVIDLIFEKAISEPNFSVAYANMCRCLMALKVPTTEKPTVTVNFRKLLLNRCQKEFEKDKDDDEVFEKKQKEMDEAATAEERGRLKEELEEARDIARRRSLGNIKFIGELFKLKMLTEAIMHDCVVKLLKNHDEESLECLCRLLTTIGKDLDFEKAKPRMDQYFNQMEKIIKEKKTSSRIRFMLQDVLDLRGSNWVPRRGDQGPKTIDQIHKEAEMEEHREHIKVQQLMAKGSDKRRGGPPGPPISRGLPLVDDGGWNTVPISKGSRPIDTSRLTKITKPGSIDSNNQLFAPGGRLSWGKGSSGGSGAKPSDAASEAARPATSTLNRFSALQQAVPTESTDNRRVVQRSSLSRERGEKAGDRGDRLERSERGGDRGDRLDRARTPATKRSFSKEVEERSRERPSQPEGLRKAASLTEDRDRGRDAVKREAALPPVSPLKAALSEEELEKKSKAIIEEYLHLNDMKEAVQCVQELASPSLLFIFVRHGVESTLERSAIAREHMGQLLHQLLCAGHLSTAQYYQGLYEILELAEDMEIDIPHVWLYLAELVTPILQEGGVPMGELFREITKPLRPLGKAASLLLEILGLLCKSMGPKKVGTLWREAGLSWKEFLPEGQDIGAFVAEQKVEYTLGEESEAPGQRALPSEELNRQLEKLLKEGSSNQRVFDWIEANLSEQQIVSNTLVRALMTAVCYSAIIFETPLRVDVAVLKARAKLLQKYLCDEQKELQALYALQALVVTLEQPPNLLRMFFDALYDEDVVKEDAFYSWESSKDPAEQQGKGVALKSVTAFFKWLREAEEESDHN*



NM_182917 : Homo sapiens eukaryotic translation initiation factor 4 gamma, 1
(EIF4G1), transcript variant 1, mRNA


MNKAPQSTGPPPAPSPGLPQPAFPPGQTAPVVFSTPQATQMNTPSQPRQHFYPSRAQPPSSAASRVQSAAPARPGPAAHVYPAGSQVMMIPSQISYPASQGAYYIPGQGRSTYVVPTQQYPVQPGAPGFYPGASPTEFGTYAGAYYPAQGVQQFPTGVAPAPVLMNQPPQIAPKRERKTIRIRDPNQGGKDITEEIMSGARTASTPTPPQTGGGLEPQANGETPQVAVIVRPDDRSQGAIIADRPGLPGPEHSPSESQPSSPSPTPSPSPVLEPGSEPNLAVLSIPGDTMTTIQMSVEESTPISRETGEPYRLSPEPTPLAEPILEVEVTLSKPVPESEFSSSPLQAPTPLASHTVEIHEPNGMVPSEDLEPEVESSPELAPPPACPSESPVPIAPTAQPEELLNGAPSPPAVDLSPVSEPEEQAKEVTASMAPPTIPSATPATAPSATSPAQEEEMEEEEEEEEGEAGEAGEAESEKGGEELLPPESTPIPANLSQNLEAAAATQVAVSVPKRRRKIKELNKKEAVGDLLDAFKEANPAVPEVENQPPAGSNPGPESEGSGVPPRPEEADETWDSKEDKIHNAENIQPGEQKYEYKSDQWKPLNLEEKKRYDREFLLGFQFIFASMQKPEGLPHISDVVLDKANKTPLRPLDPTRLQGINCGPDFTPSFANLGRTTLSTRGPPRGGPGGELPRGPAGLGPRRSQQGPRKEPRKIIATVLMTEDIKLNKAEKAWKPSSKRTAADKDRGEEDADGSKTQDLFRRVRSILNKLTPQMFQQLMKQVTQLAIDTEERLKGVIDLIFEKAISEPNFSVAYANMCRCLMALKVPTTEKPTVTVNFRKLLLNRCQKEFEKDKDDDEVFEKKQKEMDEAATAEERGRLKEELEEARDIARRRSLGNIKFIGELFKLKMLTEAIMHDCVVKLLKNHDEESLECLCRLLTTIGKDLDFEKAKPRMDQYFNQMEKIIKEKKTSSRIRFMLQDVLDLRGSNWVPRRGDQGPKTIDQIHKEAEMEEHREHIKVQQLMAKGSDKRRGGPPGPPISRGLPLVDDGGWNTVPISKGSRPIDTSRLTKITKPGSIDSNNQLFAPGGRLSWGKGSSGGSGAKPSDAASEAARPATSTLNRFSALQQAVPTESTDNRRVVQRSSLSRERGEKAGDRGDRLERSERGGDRGDRLDRARTPATKRSFSKEVEERSRERPSQPEGLRKAASLTEDRDRGRDAVKREAALPPVSPLKAALSEEELEKKSKAIIEEYLHLNDMKEAVQCVQELASPSLLFIFVRHGVESTLERSAIAREHMGQLLHQLLCAGHLSTAQYYQGLYEILELAEDMEIDIPHVWLYLAELVTPILQEGGVPMGELFREITKPLRPLGKAASLLLEILGLLCKSMGPKKVGTLWREAGLSWKEFLPEGQDIGAFVAEQKVEYTLGEESEAPGQRALPSEELNRQLEKLLKEGSSNQRVFDWIEANLSEQQIVSNTLVRALMTAVCYSAIIFETPLRVDVAVLKARAKLLQKYLCDEQKELQALYALQALVVTLEQPPNLLRMFFDALYDEDVVKEDAFYSWESSKDPAEQQGKGVALKSVTAFFKWLREAEEESDHN*



Running this XSLT stylesheet with XALAN and another genetic code


We set the param GENETICCODE to the Scenedesmus obliquus mitochondrial Code:
java -cp ${XALAN_PATH}/org.apache.xalan_2.7.1.v200905122109.jar:${XALAN_PATH}/org.apache.xml.serializer_2.7.1.v200902170519.jar:translate.jar org.apache.xalan.xslt.Process -IN sequences.gbc -XSL seq2html.xsl -PARAM GENETICCODE "'FFLLSSSSYY*QCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'"

Result



NM_004953 : Homo sapiens eukaryotic translation initiation factor 4 gamma, 1
(EIF4G1), transcript variant 5, mRNA


ILGARMASTPTLPQTGGELELHVTGETPQRVVRVRPADRSQGAIRVDRPGLLGPEPSLSDSQLSSLLPTPSPSPVLDPGLELTLAVLLRLGDMITMIHILVDDSTPISQDMGEPSRLLPDPMLLADPILDVDVTWSNPRPDLE'LSKLLQVPTLLALHTVDRPELTGIVPLDALDPEVESSPEWVLPPVCPSDSLVPRVPMAHLEDLLNGAPSPPVVDFSPVKEPEEQAKEVTASIAPPTIPLVMPVTVLSVMSPVQEEDIDDDDDEDDGDAGDAGDVEKENGGEDLLPPEKTLRPANLLQTLEAAAAMHVAVLVPKRSRNRKELTKKEVRGDWLAAFKEANPAVPEVDTQLLAGSTPGPELEGKEVPPQLEDAAET*DSKDDNRHTVENIQPGDQKSDSKSAQ*KLLNLEENNQYDQEFLWE'QFI'AKIQKPEGLPPIKDVVLDKATNTPLRPLAPMSLHGITQGPDFMPS'ANWGRTTWSTQGPPREGPEGELPQGPVGLGPRRLQQGPRNDPRKIRATVFITDAINLNNAENA*NPSSKRTAVAKARGDDAVAGSNTQDLFRRVRSILTNLTPQIFQQLIKHVTQLAIDTEDRLNGVRDLR'EKARSEPNFLVASANICRCLIALNVPMTDKPTVMVNFRKLLLTRQQKE'ENDNAAAER'EKKHNEIADVVTAEDRGRLKDELDEVRDIARRRLFGTIK'RGELFNLKIFTEAIIPDQVVNLWKNPADESWECWQQLLTTRGNDLD'DNAKPRIAQSFNQIDNIRNDKKTSSRIR'ILQDVLALRGST*VPRRGAQEPKTRDQIPKEVEIDDPREHINVQQLIAKGKDKQRGELPGLPISQGWPWVAAEG*NTRPISNESRPRDTSRLTKITKLGSIALNNQL'ALGGRLS*GKGSSGGSGAKPSDAASDVVRPVMKMLTRFSAWHHAVPTDSTATSQVVQRKSLSRDRGENVGDRGDRLERKDRGGDQGDRWAQARTLVTKRSFSKDVEERKSDRPSQLEGLRKAVSLTEARDQGRAAVKRDVALPPVSPLKAVLLEEEFEKNSKVIREDSLPLTDINEAVQCVQELASPSLLFI'VRPEVELTLERKARVQEPIGQLLHQLLQVGPLLMVQYYHGLSDILDLVEDIDRDIPHV*LYLADLVTPRLQDEGVPIGELFRERTKLLSPLGNVVSLLLEILGLLCNSIELNKVGTL*RDAGWS*KD'LLDGQDREAFVVDQKVESTLGEESDALGQRALPSEELNRQLEKLLKEGSKNQRVFD*IEANLKEQQIVSNTFRRALITVVCSLARR'EMPLRVDRAVLNARAKLLQNYLQDEQKELQALYALQAWVVTFDQLPNLLRIF'DALSDEDVVKEAAFYK*EKSKDPVEQQGKEVAWNLVTAFFK*LQDAEEELDHNC



NM_182917 : Homo sapiens eukaryotic translation initiation factor 4 gamma, 1
(EIF4G1), transcript variant 1, mRNA


INNVPQSTGPPPAPSPGLPQPA'PPGQTAPVVFKTPHATHINTLLQPRQHFYLSRAQPPSKAASRVQKAALARLGPVAPVYLVGSHVIIILSQISYPASQGAYYILGQGQSTYRVPTQQYLVQPGAPGFSPEASLTD'GTYVGAYSPAHGVQQ'PMGVAPAPRLINQPPQRVPKREQKTIRRRAPNHGGKAITEEIILGARMASTPTLPQTGGELELHVTGETPQRVVRVRPADRSQGAIRVDRPGLLGPEPSLSDSQLSSLLPTPSPSPVLDPGLELTLAVLLRLGDMITMIHILVDDSTPISQDMGEPSRLLPDPMLLADPILDVDVTWSNPRPDLE'LSKLLQVPTLLALHTVDRPELTGIVPLDALDPEVESSPEWVLPPVCPSDSLVPRVPMAHLEDLLNGAPSPPVVDFSPVKEPEEQAKEVTASIAPPTIPLVMPVTVLSVMSPVQEEDIDDDDDEDDGDAGDAGDVEKENGGEDLLPPEKTLRPANLLQTLEAAAAMHVAVLVPKRSRNRKELTKKEVRGDWLAAFKEANPAVPEVDTQLLAGSTPGPELEGKEVPPQLEDAAET*DSKDDNRHTVENIQPGDQKSDSKSAQ*KLLNLEENNQYDQEFLWE'QFI'AKIQKPEGLPPIKDVVLDKATNTPLRPLAPMSLHGITQGPDFMPS'ANWGRTTWSTQGPPREGPEGELPQGPVGLGPRRLQQGPRNDPRKIRATVFITDAINLNNAENA*NPSSKRTAVAKARGDDAVAGSNTQDLFRRVRSILTNLTPQIFQQLIKHVTQLAIDTEDRLNGVRDLR'EKARSEPNFLVASANICRCLIALNVPMTDKPTVMVNFRKLLLTRQQKE'ENDNAAAER'EKKHNEIADVVTAEDRGRLKDELDEVRDIARRRLFGTIK'RGELFNLKIFTEAIIPDQVVNLWKNPADESWECWQQLLTTRGNDLD'DNAKPRIAQSFNQIDNIRNDKKTSSRIR'ILQDVLALRGST*VPRRGAQEPKTRDQIPKEVEIDDPREHINVQQLIAKGKDKQRGELPGLPISQGWPWVAAEG*NTRPISNESRPRDTSRLTKITKLGSIALNNQL'ALGGRLS*GKGSSGGSGAKPSDAASDVVRPVMKMLTRFSAWHHAVPTDSTATSQVVQRKSLSRDRGENVGDRGDRLERKDRGGDQGDRWAQARTLVTKRSFSKDVEERKSDRPSQLEGLRKAVSLTEARDQGRAAVKRDVALPPVSPLKAVLLEEEFEKNSKVIREDSLPLTDINEAVQCVQELASPSLLFI'VRPEVELTLERKARVQEPIGQLLHQLLQVGPLLMVQYYHGLSDILDLVEDIDRDIPHV*LYLADLVTPRLQDEGVPIGELFRERTKLLSPLGNVVSLLLEILGLLCNSIELNKVGTL*RDAGWS*KD'LLDGQDREAFVVDQKVESTLGEESDALGQRALPSEELNRQLEKLLKEGSKNQRVFD*IEANLKEQQIVSNTFRRALITVVCSLARR'EMPLRVDRAVLNARAKLLQNYLQDEQKELQALYALQAWVVTFDQLPNLLRIF'DALSDEDVVKEAAFYK*EKSKDPVEQQGKEVAWNLVTAFFK*LQDAEEELDHNC



That's it
Pierre

No comments: