30 March 2011

Parsing a genomic position with javacc

Parsing a genomic position (chrom:start-end) is an easy task but I've always been too lazy to create a library for this. Today I wrote a Java-CC-based parser for analyzing the various syntaxes of a genomic position. Here is the grammar I used:

COMMA: ","
LETTER: (["a"-"z"]|["A"-"Z"]|"_") ;
DIGIT: ["0"-"9"];
INT:<DIGIT> ( (<DIGIT>|<COMMA>)* <DIGIT>)? ;
BP: "b" ("p")? ;
KB: ("k") ("B")? ;
MB: ("m") ("B")? ;
GB: ("g") ("B")? ;
IDENTIFIER: <LETTER> (<DIGIT>|<LETTER>)* ;
COLON: ":" ;
DASH: "-" ;
PLUS: "+" ;
DELIM: ("|"|";") ;


java.util.List<Segment> many(): segment() ((<DELIM>)? segment() )* )? <EOF>);
Segment one(): segment() <EOF>;
Segment segment(): chromName() <COLON> position() (<DASH> position()| <PLUS> position())? );
BigInteger position():integer() (factor())?;
BigInteger factor(): ( <BP> | <KB>| <MB> | <GB> );
BigInteger integer():<INT> ;
String chromName():( integer() | identifier());
String identifier(): <IDENTIFIER> ;

Source code



Compiling

javacc SegmentParser.jj
javac SegmentParser.java

Running

echo " chrM:1-100,000"| java SegmentParser
chrM:1-100000
echo " c1:1000"| java SegmentParser
c1:1000-1001
echo "2:1Gb+1 " | java SegmentParser
chr2:999999999-1000000002
echo "chr2:10+100" | java SegmentParser
ParseException: -90 < 0)
echo "chrX:3147483647" | java SegmentParser
ParseException: 3147483647 > 2147483647 (int-max)
echo "2:1Gb+a azd " | java SegmentParser
ParseException: Encountered "a" at line 1, column 7


That's it,

Pierre

29 March 2011

Mapping a mutation on a protein to the genome.

A colleague asked me to solve the following problem: from an article in which a protein (don't dream, there was no accession number) was transferred, she wanted to know the position of the mutation on the human genome to determine whether a known SNP was there.
The program I wrote, backlocate is available on github: https://github.com/lindenb/jsandbox/blob/master/src/sandbox/BackLocate.java and uses the public mysql server of the UCSC.

  • The input is the name of a gene and a mutation "{AA-wild}{position}{AA-mut}"
  • A first SQLquery searches for the gene symbol in the table kgXref.
  • A second SQL query searches for all the transcripts of the table knownGene having this kgXref
  • The genomic DNA for a transcript is downloaded from the DAS-DNA server of the UCSC
  • The protein, the mRNA and the genomic sequences are reconstituted to find the 3 possible bases of the mutated codon.

Example


Let's find the genomic position for EIF4G1 at position 240 in the protein (Note; this mutation steps over two exons on the transcript "uc010hxy.2":
echo -e "EIF4G1\tD240Y" | java -jar backlocate.jar

Result:
#User.Gene AA1 petide.pos.1 AA2 knownGene.name knownGene.strand knownGene.AA index0.in.rna codon base.in.rna chromosome index0.in.genomic exon
##uc003fnt.2
EIF4G1 D 240 Y uc003fnt.2 + D 717 GAC G chr3 184040214 Exon 7
EIF4G1 D 240 Y uc003fnt.2 + D 718 GAC A chr3 184040215 Exon 7
EIF4G1 D 240 Y uc003fnt.2 + D 719 GAC C chr3 184040216 Exon 7
##uc010hxy.2
EIF4G1 D 240 Y uc010hxy.2 + D 717 GAT G chr3 184038780 Exon 9
EIF4G1 D 240 Y uc010hxy.2 + D 718 GAT A chr3 184039069 Exon 10
EIF4G1 D 240 Y uc010hxy.2 + D 719 GAT T chr3 184039070 Exon 10
##uc003fnw.2
EIF4G1 D 240 Y uc003fnw.2 + D 717 GAT G chr3 184038780 Exon 8
EIF4G1 D 240 Y uc003fnw.2 + D 718 GAT A chr3 184039069 Exon 9
EIF4G1 D 240 Y uc003fnw.2 + D 719 GAT T chr3 184039070 Exon 9
##Warning ref aminod acid for uc003fnp.2 [240] is not the same (I/D)
EIF4G1 D 240 Y uc003fnp.2 + I 717 ATC A chr3 184039089 Exon 10
EIF4G1 D 240 Y uc003fnp.2 + I 718 ATC T chr3 184039090 Exon 10
EIF4G1 D 240 Y uc003fnp.2 + I 719 ATC C chr3 184039091 Exon 10
(...)


That's it,

Pierre

24 March 2011

The ultimate Bioinformatics “Cheat Sheet”

The following question was recently asked on Biostar:
inspired by Keith Robison's post on "cheat sheets", what would you put on a cheat sheet for bioinformatics? This might include one-line scripts, conversion factors, handy rules of thumb, etc...

well, here is my 'cheat sheet': ;-)







That's it,

Pierre

(arrows are from wikimedia-commons )

22 March 2011

Blast Stylesheet : XML to HTML

I wrote a XSLT stylesheet for the following question on Biostar: I'd like to create an HTML file (from the XML file and XSL stylesheet) similar to what It can be achieved when we performed a BLAST search on the NCBI server.

The stylesheet I wrote is available on github at: https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/bio/ncbi/blast2html.xsl. (see also my previous post blast2svg )

Usage:

xsltproc --novalid blast2html.xsl blast.xml > blast.html

Example:

Here is a XML output of blast:
<BlastOutput>
<BlastOutput_program>blastp</BlastOutput_program>
<BlastOutput_version>BLASTP 2.2.25+</BlastOutput_version>
<BlastOutput_reference>Alejandro A. Sch&auml;ffer, L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001), "Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements", Nucleic Acids Res. 29:2994-3005.</BlastOutput_reference>
<BlastOutput_db>N/A</BlastOutput_db>
<BlastOutput_query-ID>gi|187956781|gb|AAI40897.1|</BlastOutput_query-ID>
<BlastOutput_query-def>EIF4G1 protein [Homo sapiens]</BlastOutput_query-def>
<BlastOutput_query-len>1606</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_matrix>BLOSUM62</Parameters_matrix>
<Parameters_expect>10</Parameters_expect>
<Parameters_gap-open>11</Parameters_gap-open>
<Parameters_gap-extend>1</Parameters_gap-extend>
<Parameters_filter>F</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>gi|187956781|gb|AAI40897.1|</Iteration_query-ID>
<Iteration_query-def>EIF4G1 protein [Homo sapiens]</Iteration_query-def>
<Iteration_query-len>1606</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gi|293340930|ref|XP_002724789.1|</Hit_id>
<Hit_def>PREDICTED: eukaryotic translation initiation factor 4 gamma, 1 isoform 2 [Rattus norvegicus] >gi|293352298|ref|XP_002727969.1| PREDICTED: eukaryotic translation initiation factor 4, gamma 1 isoform 1 [Rattus norvegicus]</Hit_def>
<Hit_accession>XP_002727969</Hit_accession>
<Hit_len>1584</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>2715.64</Hsp_bit-score>
<Hsp_score>7038</Hsp_score>
<Hsp_evalue>0</Hsp_evalue>
<Hsp_query-from>1</Hsp_query-from>
<Hsp_query-to>1606</Hsp_query-to>
<Hsp_hit-from>1</Hsp_hit-from>
<Hsp_hit-to>1584</Hsp_hit-to>
<Hsp_query-frame>0</Hsp_query-frame>
<Hsp_hit-frame>0</Hsp_hit-frame>
<Hsp_identity>1450</Hsp_identity>
<Hsp_positive>1450</Hsp_positive>
<Hsp_gaps>36</Hsp_gaps>
<Hsp_align-len>1613</Hsp_align-len>
<Hsp_qseq>MNKAPQSTGPPPAPSPGLPQPAFPPGQTAPVVFSTPQATQMNTPSQPRQGGFRSLQHFYPSRAQPPSSAASRVQSAAPARPGPAAHVYPAGSQVMMIPSQISYPASQGAYYIPGQGRSTYVVPTQQYPVQPGAPGFYPGASPTEFGTYAGAYYPAQGVQQFPTGVAPAPVLMNQPPQIAPKRERKTIRIRDPNQGGKDITEEIMSGARTASTPTPPQTGGGLEPQANGETPQVAVIVRPDDRSQGAIIADRPGLPGPEHSP-SESQPSSPSPTPSPSPVLEPGSEPNLAVLSIPGDTMTT--IQMSVEESTPISRETGEPYRLSPEPTPLAEPILEVEVTLSKPVPESEFSSSPLQAPTPLASHTVEIHEPNGMVPSEDLEPEVESSPELAPPP--ACPSESPVPIAPTAQPEELLNGAPSPPAVDLSPVSEPEEQAKEV-TASMAPPTIPSATPATAPSATSPAQEEEMEEEEEEEEGEAGEAGEAESEKGGEELLPPESTPIPANLSQNLEAAAATQVAVSVPKRRRKIKELNKKEAVGDLLDAFKEANPAVPEVENQPPAGSNPGPESEGSGVPPRPEEADETWDSKEDKIHNAENIQPGEQKYEYKSDQWKPLNLEEKKRYDREFLLGFQFIFASMQKPEGLPHISDVVLDKANKTPLRPLDPTRLQGINCGPDFTPSFANLGRTTLSTRGPPRGGPGGELPRGPAGLGPRRSQQGPRKEPRKIIATVLMTEDIKLNKAEKAWKPSSKRTAADKDRGEEDADGSKTQDLFRRVRSILNKLTPQMFQQLMKQVTQLAIDTEERLKGVIDLIFEKAISEPNFSVAYANMCRCLMALKVPTTEKPTVTVNFRKLLLNRCQKEFEKDKDDDEVFEKKQKEMDEAATAEERGRLKEELEEARDIARRRSLGNIKFIGELFKLKMLTEAIMHDCVVKLLKNHDEESLECLCRLLTTIGKDLDFEKAKPRMDQYFNQMEKIIKEKKTSSRIRFMLQDVLDLRGSNWVPRRGDQGPKTIDQIHKEAEMEEHREHIKVQQLMAKGSDKRRGGPPGPPISRGLPLVDDGGWNTVPISKGSRPIDTSRLTKITKPGSIDSNNQLFAPGGRLSWGKGSSGGSGAKPSDAASEAARPATSTLNRFSALQQAVPTESTDNRRVVQRSSLSRERGEKAGDRGDRLERSERGGDRGDRLDRARTPATKRSFSKEVEERSRERPSQPEGLRKAASLTEDRDRGRDAVKREAALPPVSPLKAALSEEELEKKSKAIIEEYLHLNDMKEAVQCVQELASPSLLFIFVRHGVESTLERSAIAREHMGQLLHQLLCAGHLSTAQYYQGLYEILELAEDMEIDIPHVWLYLAELVTPILQEGGVPMGELFREITKPLRPLGKAASLLLEILGLLCKSMGPKKVGTLWREAGLSWKEFLPEGQDIGAFVAEQKVEYTLGEESEAPGQRALPSEELNRQLEKLLKEGSSNQRVFDWIEANLSEQQIVSNTLVRALMTAVCYSAIIFETPLRVDVAVLKARAKLLQKYLCDEQKELQALYALQALVVTLEQPPNLLRMFFDALYDEDVVKEDAFYSWESSKDPAEQQGKGVALKSVTAFFKWLREAE-EESDHN</Hsp_qseq>
<Hsp_hseq>MNKAPQPTGPPPARSPGLPQPAFPPGQTAPVVFSTPQATQMNTPSQPRQ-------HFYPSRAQPPSSAASRVQSAAPARPGPAPHVYPAGSQVMMIPSQISYSASQGAYYIPGQGRSTYVVPTQQYPVQPGAPGFYPGASPTEFGTYAGAYYPAQSVQQFPASVAPAPVLMNQPPQIAPKRERKTIRIRDPNQGGKDITEEIMSGARTASTPTPPQTGGSLEPQPNGESPQVAVIIRPDDRSQGAAIGGRPGLPGPEHSPGTESQPSSPSPTPSPPPILEPGSESNLGVLSIPGDTMTTGMIPISVEESTPISCESGEPYCLSPEPT-LAEPILEVEVTLSKPIPESEFSSSPLQVSTSLVPHRAETHEPNGVIPSEDLEPEVESSTEPAPPPLSACASESLVPIAPTAQPEELLNGAPSPPAVDLSPVSEPEEQAKEVPSAALA--SIVSPTPPVAPSDTSAAQEEEIEED-------EDEDGEAESEKGGEDL-PLDSTPVPAQLSQNLEVAAAPQVAVSVPKRRRKIKELNKKEAVGDLLDAFKEVDPAVPEVENQPPTGSNPSPESEGSAALPQPEEAEETWDSKEDKIHNAENIQPGEQKYEYKSDQWKPLNLEEKKRYDREFLLGFQFIFASMQKPEGLPHITDVVLDKANKTPLRSLDPSRLPGINCGPDFTPSFANLGRPTLSSRGPPRGGPGGELPRGPAGLGPRRSQQGPRKETRKIISSVIMTEDIKLNKAEKAWKPSSKRTAADKDRGEEDADGSKTQDLFRRVRSILNKLTPQMFQQLMKQVTQLAIDTEERLKGVIDLIFEKAISEPNFSVAYANMCRCLMALKVPTTEKPTVTVNFRKLLLNRCQKEFEKDKDDDEVFEKKQKEMDEAATAEERGRLKEELEEARDIARRRSLGNIKFIGELFKLKMLTEAIMHDCVVKLLKNHDEESLECLCRLLTTIGKDLDFAKAKPRMDQYFNQMEKIIKEKKTSSRIRFMLQDVLDLRQSNWVPRRGDQGPKTIDQIHKEAEMEEHREHIKVQQLMAKGGDKRRGGPPGPP-------VNDGGWNTVPISKGSRPIDTSRLTKITKPGSIDSNNQLFAPGGRLSWGKGSSGGSGAKPSDTASEATRPA--TLNRFSALQQTLPVENTDNRRVVQRSSLSRERGEKAGDRGDRLERSERGGDRGDRLDRARTPATKRSFSKEVEERSRERPSQPEGLRKAASLTE--DRGRDPVKREATLPPVSPPKAALAVDEVERKSKAIIEEYLHLNDMKEAVQCVQELASPSLLFIFVRLGIESTLERSTIAREHMGRLLHQLLCAGHLSTAQYYQGLYETLELAEDMEIDIPHVWLYLAELITPILQEDGVPMGELFREITKPLRPMGKATSLLLEILGLLCKSMGPKKVGMLWREAGLSWREFLAEGQDVGSFVAEKKVEYTLGEESEAPGQRALAFEELRRQLEKLLKDGGSNQRVFDWIEANLNEQQIASNTLVRALMTTVCYSAIIFETPLRVDVQVLKVRARLLQKYLSDEQKELQALYALQALVVTLEQPANLLRMFFDALYDEDVVKEDAFYSWESSKDPAEQQGKGVALKSVTAFFNWLREAEDEESDHN</Hsp_hseq>
<Hsp_midline>MNKAPQ TGPPPA SPGLPQPAFPPGQTAPVVFSTPQATQMNTPSQPRQ HFYPSRAQPPSSAASRVQSAAPARPGPA HVYPAGSQVMMIPSQISY ASQGAYYIPGQGRSTYVVPTQQYPVQPGAPGFYPGASPTEFGTYAGAYYPAQ VQQFP VAPAPVLMNQPPQIAPKRERKTIRIRDPNQGGKDITEEIMSGARTASTPTPPQTGG LEPQ NGE PQVAVI RPDDRSQGA I RPGLPGPEHSP ESQPSSPSPTPSP P LEPGSE NL VLSIPGDTMTT I SVEESTPIS E GEPY LSPEPT LAEPILEVEVTLSKP PESEFSSSPLQ T L H E HEPNG PSEDLEPEVESS E APPP AC SES VPIAPTAQPEELLNGAPSPPAVDLSPVSEPEEQAKEV A A I S TP APS TS AQEEE EE E GEAESEKGGE L P STP PA LSQNLE AAA QVAVSVPKRRRKIKELNKKEAVGDLLDAFKE PAVPEVENQPP GSNP PESEGS P PEEA ETWDSKEDKIHNAENIQPGEQKYEYKSDQWKPLNLEEKKRYDREFLLGFQFIFASMQKPEGLPHI DVVLDKANKTPLR LDP RL GINCGPDFTPSFANLGR TLS RGPPRGGPGGELPRGPAGLGPRRSQQGPRKE RKII V MTEDIKLNKAEKAWKPSSKRTAADKDRGEEDADGSKTQDLFRRVRSILNKLTPQMFQQLMKQVTQLAIDTEERLKGVIDLIFEKAISEPNFSVAYANMCRCLMALKVPTTEKPTVTVNFRKLLLNRCQKEFEKDKDDDEVFEKKQKEMDEAATAEERGRLKEELEEARDIARRRSLGNIKFIGELFKLKMLTEAIMHDCVVKLLKNHDEESLECLCRLLTTIGKDLDF KAKPRMDQYFNQMEKIIKEKKTSSRIRFMLQDVLDLR SNWVPRRGDQGPKTIDQIHKEAEMEEHREHIKVQQLMAKG DKRRGGPPGPP V DGGWNTVPISKGSRPIDTSRLTKITKPGSIDSNNQLFAPGGRLSWGKGSSGGSGAKPSD ASEA RPA TLNRFSALQQ P E TDNRRVVQRSSLSRERGEKAGDRGDRLERSERGGDRGDRLDRARTPATKRSFSKEVEERSRERPSQPEGLRKAASLTE DRGRD VKREA LPPVSP KAAL E E KSKAIIEEYLHLNDMKEAVQCVQELASPSLLFIFVR G ESTLERS IAREHMG LLHQLLCAGHLSTAQYYQGLYE LELAEDMEIDIPHVWLYLAEL TPILQE GVPMGELFREITKPLRP GKA SLLLEILGLLCKSMGPKKVG LWREAGLSW EFL EGQD G FVAE KVEYTLGEESEAPGQRAL EEL RQLEKLLK G SNQRVFDWIEANL EQQI SNTLVRALMT VCYSAIIFETPLRVDV VLK RA LLQKYL DEQKELQALYALQALVVTLEQP NLLRMFFDALYDEDVVKEDAFYSWESSKDPAEQQGKGVALKSVTAFF WLREAE EESDHN</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
</Iteration_hits>
<Iteration_stat>
<Statistics>
<Statistics_db-num>0</Statistics_db-num>
<Statistics_db-len>0</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>0</Statistics_eff-space>
<Statistics_kappa>-1</Statistics_kappa>
<Statistics_lambda>-1</Statistics_lambda>
<Statistics_entropy>-1</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>

After processing:

(...)

Descriptions

AccessionDefe-value
XP_002727969PREDICTED: eukaryotic translation initiation factor 4 gamma, 1 isoform 2 [Rattus norvegicus] >gi|293352298|ref|XP_002727969.1| PREDICTED: eukaryotic translation initiation factor 4, gamma 1 isoform 1 [Rattus norvegicus]0
(...)

Alignments

>gi|293340930|ref|XP_002724789.1||XP_002727969|PREDICTED: eukaryotic translation initiation factor 4 gamma, 1 isoform 2 [Rattus norvegicus] >gi|293352298|ref|XP_002727969.1| PREDICTED: eukaryotic translation initiation factor 4, gamma 1 isoform 1 [Rattus norvegicus]
Length=1584
Score = 2715.64 bits (7038), Expect = 0
Identities = 1450/1613 (89.8946063236206%), Gaps = 36/1613 (2.231866088034718%)
Strand = Plus/Plus

Query 1 MNKAPQSTGPPPAPSPGLPQPAFPPGQTAPVVFSTPQATQMNTPSQPRQGGFRSLQHFYP 60
MNKAPQ TGPPPA SPGLPQPAFPPGQTAPVVFSTPQATQMNTPSQPRQ HFYP
Sbjct 1 MNKAPQPTGPPPARSPGLPQPAFPPGQTAPVVFSTPQATQMNTPSQPRQ-------HFYP 53

Query 61 SRAQPPSSAASRVQSAAPARPGPAAHVYPAGSQVMMIPSQISYPASQGAYYIPGQGRSTY 120
SRAQPPSSAASRVQSAAPARPGPA HVYPAGSQVMMIPSQISY ASQGAYYIPGQGRSTY
Sbjct 54 SRAQPPSSAASRVQSAAPARPGPAPHVYPAGSQVMMIPSQISYSASQGAYYIPGQGRSTY 113

Query 121 VVPTQQYPVQPGAPGFYPGASPTEFGTYAGAYYPAQGVQQFPTGVAPAPVLMNQPPQIAP 180
VVPTQQYPVQPGAPGFYPGASPTEFGTYAGAYYPAQ VQQFP VAPAPVLMNQPPQIAP
Sbjct 114 VVPTQQYPVQPGAPGFYPGASPTEFGTYAGAYYPAQSVQQFPASVAPAPVLMNQPPQIAP 173

Query 181 KRERKTIRIRDPNQGGKDITEEIMSGARTASTPTPPQTGGGLEPQANGETPQVAVIVRPD 240
KRERKTIRIRDPNQGGKDITEEIMSGARTASTPTPPQTGG LEPQ NGE PQVAVI RPD
Sbjct 174 KRERKTIRIRDPNQGGKDITEEIMSGARTASTPTPPQTGGSLEPQPNGESPQVAVIIRPD 233

Query 241 DRSQGAIIADRPGLPGPEHSP-SESQPSSPSPTPSPSPVLEPGSEPNLAVLSIPGDTMTT 299
DRSQGA I RPGLPGPEHSP ESQPSSPSPTPSP P LEPGSE NL VLSIPGDTMTT
Sbjct 234 DRSQGAAIGGRPGLPGPEHSPGTESQPSSPSPTPSPPPILEPGSESNLGVLSIPGDTMTT 293

Query 300 --IQMSVEESTPISRETGEPYRLSPEPTPLAEPILEVEVTLSKPVPESEFSSSPLQAPTP 357
I SVEESTPIS E GEPY LSPEPT LAEPILEVEVTLSKP PESEFSSSPLQ T
Sbjct 294 GMIPISVEESTPISCESGEPYCLSPEPT-LAEPILEVEVTLSKPIPESEFSSSPLQVSTS 352

Query 358 LASHTVEIHEPNGMVPSEDLEPEVESSPELAPPP--ACPSESPVPIAPTAQPEELLNGAP 415
L H E HEPNG PSEDLEPEVESS E APPP AC SES VPIAPTAQPEELLNGAP
Sbjct 353 LVPHRAETHEPNGVIPSEDLEPEVESSTEPAPPPLSACASESLVPIAPTAQPEELLNGAP 412

Query 416 SPPAVDLSPVSEPEEQAKEV-TASMAPPTIPSATPATAPSATSPAQEEEMEEEEEEEEGE 474
SPPAVDLSPVSEPEEQAKEV A A I S TP APS TS AQEEE EE
Sbjct 413 SPPAVDLSPVSEPEEQAKEVPSAALA--SIVSPTPPVAPSDTSAAQEEEIEED------- 463

Query 475 AGEAGEAESEKGGEELLPPESTPIPANLSQNLEAAAATQVAVSVPKRRRKIKELNKKEAV 534
E GEAESEKGGE L P STP PA LSQNLE AAA QVAVSVPKRRRKIKELNKKEAV
Sbjct 464 EDEDGEAESEKGGEDL-PLDSTPVPAQLSQNLEVAAAPQVAVSVPKRRRKIKELNKKEAV 522

Query 535 GDLLDAFKEANPAVPEVENQPPAGSNPGPESEGSGVPPRPEEADETWDSKEDKIHNAENI 594
GDLLDAFKE PAVPEVENQPP GSNP PESEGS P PEEA ETWDSKEDKIHNAENI
Sbjct 523 GDLLDAFKEVDPAVPEVENQPPTGSNPSPESEGSAALPQPEEAEETWDSKEDKIHNAENI 582

Query 595 QPGEQKYEYKSDQWKPLNLEEKKRYDREFLLGFQFIFASMQKPEGLPHISDVVLDKANKT 654
QPGEQKYEYKSDQWKPLNLEEKKRYDREFLLGFQFIFASMQKPEGLPHI DVVLDKANKT
Sbjct 583 QPGEQKYEYKSDQWKPLNLEEKKRYDREFLLGFQFIFASMQKPEGLPHITDVVLDKANKT 642

Query 655 PLRPLDPTRLQGINCGPDFTPSFANLGRTTLSTRGPPRGGPGGELPRGPAGLGPRRSQQG 714
PLR LDP RL GINCGPDFTPSFANLGR TLS RGPPRGGPGGELPRGPAGLGPRRSQQG
Sbjct 643 PLRSLDPSRLPGINCGPDFTPSFANLGRPTLSSRGPPRGGPGGELPRGPAGLGPRRSQQG 702

Query 715 PRKEPRKIIATVLMTEDIKLNKAEKAWKPSSKRTAADKDRGEEDADGSKTQDLFRRVRSI 774
PRKE RKII V MTEDIKLNKAEKAWKPSSKRTAADKDRGEEDADGSKTQDLFRRVRSI
Sbjct 703 PRKETRKIISSVIMTEDIKLNKAEKAWKPSSKRTAADKDRGEEDADGSKTQDLFRRVRSI 762

Query 775 LNKLTPQMFQQLMKQVTQLAIDTEERLKGVIDLIFEKAISEPNFSVAYANMCRCLMALKV 834
LNKLTPQMFQQLMKQVTQLAIDTEERLKGVIDLIFEKAISEPNFSVAYANMCRCLMALKV
Sbjct 763 LNKLTPQMFQQLMKQVTQLAIDTEERLKGVIDLIFEKAISEPNFSVAYANMCRCLMALKV 822

Query 835 PTTEKPTVTVNFRKLLLNRCQKEFEKDKDDDEVFEKKQKEMDEAATAEERGRLKEELEEA 894
PTTEKPTVTVNFRKLLLNRCQKEFEKDKDDDEVFEKKQKEMDEAATAEERGRLKEELEEA
Sbjct 823 PTTEKPTVTVNFRKLLLNRCQKEFEKDKDDDEVFEKKQKEMDEAATAEERGRLKEELEEA 882

Query 895 RDIARRRSLGNIKFIGELFKLKMLTEAIMHDCVVKLLKNHDEESLECLCRLLTTIGKDLD 954
RDIARRRSLGNIKFIGELFKLKMLTEAIMHDCVVKLLKNHDEESLECLCRLLTTIGKDLD
Sbjct 883 RDIARRRSLGNIKFIGELFKLKMLTEAIMHDCVVKLLKNHDEESLECLCRLLTTIGKDLD 942

Query 955 FEKAKPRMDQYFNQMEKIIKEKKTSSRIRFMLQDVLDLRGSNWVPRRGDQGPKTIDQIHK 1014
F KAKPRMDQYFNQMEKIIKEKKTSSRIRFMLQDVLDLR SNWVPRRGDQGPKTIDQIHK
Sbjct 943 FAKAKPRMDQYFNQMEKIIKEKKTSSRIRFMLQDVLDLRQSNWVPRRGDQGPKTIDQIHK 1002

Query 1015 EAEMEEHREHIKVQQLMAKGSDKRRGGPPGPPISRGLPLVDDGGWNTVPISKGSRPIDTS 1074
EAEMEEHREHIKVQQLMAKG DKRRGGPPGPP V DGGWNTVPISKGSRPIDTS
Sbjct 1003 EAEMEEHREHIKVQQLMAKGGDKRRGGPPGPP-------VNDGGWNTVPISKGSRPIDTS 1055

Query 1075 RLTKITKPGSIDSNNQLFAPGGRLSWGKGSSGGSGAKPSDAASEAARPATSTLNRFSALQ 1134
RLTKITKPGSIDSNNQLFAPGGRLSWGKGSSGGSGAKPSD ASEA RPA TLNRFSALQ
Sbjct 1056 RLTKITKPGSIDSNNQLFAPGGRLSWGKGSSGGSGAKPSDTASEATRPA--TLNRFSALQ 1113

Query 1135 QAVPTESTDNRRVVQRSSLSRERGEKAGDRGDRLERSERGGDRGDRLDRARTPATKRSFS 1194
Q P E TDNRRVVQRSSLSRERGEKAGDRGDRLERSERGGDRGDRLDRARTPATKRSFS
Sbjct 1114 QTLPVENTDNRRVVQRSSLSRERGEKAGDRGDRLERSERGGDRGDRLDRARTPATKRSFS 1173

Query 1195 KEVEERSRERPSQPEGLRKAASLTEDRDRGRDAVKREAALPPVSPLKAALSEEELEKKSK 1254
KEVEERSRERPSQPEGLRKAASLTE DRGRD VKREA LPPVSP KAAL E E KSK
Sbjct 1174 KEVEERSRERPSQPEGLRKAASLTE--DRGRDPVKREATLPPVSPPKAALAVDEVERKSK 1231

Query 1255 AIIEEYLHLNDMKEAVQCVQELASPSLLFIFVRHGVESTLERSAIAREHMGQLLHQLLCA 1314
AIIEEYLHLNDMKEAVQCVQELASPSLLFIFVR G ESTLERS IAREHMG LLHQLLCA
Sbjct 1232 AIIEEYLHLNDMKEAVQCVQELASPSLLFIFVRLGIESTLERSTIAREHMGRLLHQLLCA 1291

Query 1315 GHLSTAQYYQGLYEILELAEDMEIDIPHVWLYLAELVTPILQEGGVPMGELFREITKPLR 1374
GHLSTAQYYQGLYE LELAEDMEIDIPHVWLYLAEL TPILQE GVPMGELFREITKPLR
Sbjct 1292 GHLSTAQYYQGLYETLELAEDMEIDIPHVWLYLAELITPILQEDGVPMGELFREITKPLR 1351

Query 1375 PLGKAASLLLEILGLLCKSMGPKKVGTLWREAGLSWKEFLPEGQDIGAFVAEQKVEYTLG 1434
P GKA SLLLEILGLLCKSMGPKKVG LWREAGLSW EFL EGQD G FVAE KVEYTLG
Sbjct 1352 PMGKATSLLLEILGLLCKSMGPKKVGMLWREAGLSWREFLAEGQDVGSFVAEKKVEYTLG 1411

Query 1435 EESEAPGQRALPSEELNRQLEKLLKEGSSNQRVFDWIEANLSEQQIVSNTLVRALMTAVC 1494
EESEAPGQRAL EEL RQLEKLLK G SNQRVFDWIEANL EQQI SNTLVRALMT VC
Sbjct 1412 EESEAPGQRALAFEELRRQLEKLLKDGGSNQRVFDWIEANLNEQQIASNTLVRALMTTVC 1471

Query 1495 YSAIIFETPLRVDVAVLKARAKLLQKYLCDEQKELQALYALQALVVTLEQPPNLLRMFFD 1554
YSAIIFETPLRVDV VLK RA LLQKYL DEQKELQALYALQALVVTLEQP NLLRMFFD
Sbjct 1472 YSAIIFETPLRVDVQVLKVRARLLQKYLSDEQKELQALYALQALVVTLEQPANLLRMFFD 1531

Query 1555 ALYDEDVVKEDAFYSWESSKDPAEQQGKGVALKSVTAFFKWLREAE-EESDHN 1606
ALYDEDVVKEDAFYSWESSKDPAEQQGKGVALKSVTAFF WLREAE EESDHN
Sbjct 1532 ALYDEDVVKEDAFYSWESSKDPAEQQGKGVALKSVTAFFNWLREAEDEESDHN 1584



That's it,

Pierre

18 March 2011

Japan Science Support

via Rutger Vos:





Rutger Vos , Reading, Berkshire, UK

Support scientists in Japan, RT or visit this page http://biohelpathon.org/ via @rvosa


17 March 2011

Truncating mutations in the last exon of NOTCH2 cause a rare skeletal disorder with osteoporosis

Here is my presentation describing the computational aspect of our paper recently published in Nature Genetics. This papers shows how truncating mutations in the NOTCH2 protein are the main cause of the Hadju-Cheney syndrome.

Isidor, Bertrand & al. (2011).
"Truncating mutations in the last exon of NOTCH2 cause a rare skeletal disorder with osteoporosis"
Nature Genetics doi:10.1038/ng.778





That's it,

Pierre

07 March 2011

Drawing a protein (Biostar #6172)

This post is my answer for this question on Biostar:Drawing a protein:
Dear all I often find protein's image like this (...) Do you know if there's a program to draw them (I mean circles with letters).

I wrote a Java-Swing application named WirePeptide displaying a draggable peptide. This application is available on github at https://github.com/lindenb/jsandbox/blob/master/src/sandbox/WirePeptide.java. The user can save the image as PNG, SVG and HTML+Canvas.

Compile & Run

cd jsandbox
ant wirepeptide
java -jar dist/wirepeptide.jar


Result (Canvas)



That's it,

Pierre

06 March 2011

Creating a pdf of your favorite tweets with Apache FOP.

This post describes how I created a PDF document from a set of twitter statuses.

I created a XSLT stylesheet transforming a twitter status as XML.

This stylesheet is available on github at: https://github.com/.../twitter/status2fo.xsl.

The stylesheet transforms the XML file generated by the twitter API (e.g.:http://api.twitter.com/1/statuses/show/44175380516585472.xml) to XSL-FO.

This xsl-fo is then processed by Apache-FOP to generate a PDF.

fop -pdf result.pdf -xml status.xml -xsl status2fo.xsl


Result

:


That's it,

Pierre

03 March 2011

count(biostar-users)=f(time)

Another quick post about href="http://biostar.stackexchange.com">http://biostar.stackexchange.com.
Here, I plotted the number of users over time. Note: I embbeded the images in the
HTML source with the following command:
(echo -n "<img src='data:image/png;base64,"; base64 < myimage.png | tr -d "\n" ; echo "'/>" ) > image_in_html.html






Error in the Y-label: I didn't use the log10 for this one (thanks Tim)


That's it,

Pierre