Indexing the content of Gene Ontology with apache SOLR
Download and install SOLR
Download from http://mirrors.ircam.fr/pub/apache/lucene/solr/3.5.0/apache-solr-3.5.0.tgz.tar xvfz apache-solr-3.5.0.tgz rm apache-solr-3.5.0.tgz
Configure schema.xml
We need to tell SOLR about the which fields of GO will be indexed, what are their type, how they should be tokenized and parsed. This information is defined in the schema.xml. The following components will be indexed: accession, name, synonym and definition. Edit apache-solr-3.5.0/example/solr/conf/schema.xml and add the following <fields>:<field name="go_name" type="text_general" indexed="true" stored="true" multiValued="false"/> <field name="go_synonym" type="text_general" indexed="true" stored="true" multiValued="true"/> <field name="go_definition" type="text_general" indexed="true" stored="true" multiValued="false"/>
Start the SOLR server
In this example, the SOLR server is started using the simple Jetty server provided in the distribution:$ cd apache-solr-3.5.0/example/example $ java -jar start.jar (...)
Indexing Gene Ontology
Go is downloaded as RDF/XML from http://archive.geneontology.org/latest-termdb/go_daily-termdb.rdf-xml.gz<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE go:go PUBLIC "-//Gene Ontology//Custom XML/RDF Version 2.0//EN" "http://www.geneontology.org/dtd/go.dtd"> <go:go xmlns:go="http://www.geneontology.org/dtds/go.dtd#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:RDF> <go:term rdf:about="http://www.geneontology.org/go#GO:0000001"> <go:accession>GO:0000001</go:accession> <go:name>mitochondrion inheritance</go:name> <go:synonym>mitochondrial inheritance</go:synonym> <go:definition>The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton.</go:definition> <go:is_a rdf:resource="http://www.geneontology.org/go#GO:0048308" /> <go:is_a rdf:resource="http://www.geneontology.org/go#GO:0048311" /> </go:term> <go:term rdf:about="http://www.geneontology.org/go#GO:0000002"> <go:accession>GO:0000002</go:accession> <go:name>mitochondrial genome maintenance</go:name> <go:definition>The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome.</go:definition> <go:is_a rdf:resource="http://www.geneontology.org/go#GO:0007005" /> <go:dbxref rdf:parseType="Resource"> <go:database_symbol>InterPro</go:database_symbol> (...)
We now need to transform this XML file to another XML file that can be indexed by the SOLR server.
"You can modify a Solr index by POSTing XML Documents containing instructions to add (or update) documents, delete documents, commit pending adds and deletes, and optimize your index."
The following XSLT stylesheet is used to transform the RDF/XML for GO:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<?xml version='1.0' encoding="ISO-8859-1"?> | |
<xsl:stylesheet | |
xmlns:xsl='http://www.w3.org/1999/XSL/Transform' | |
xmlns:go="http://www.geneontology.org/dtds/go.dtd#" | |
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" | |
version='1.0' | |
> | |
<xsl:output method='xml' encoding="UTF-8" indent="yes"/> | |
<xsl:template match="/"> | |
<add> | |
<xsl:apply-templates select="go:go/rdf:RDF/go:term"/> | |
</add> | |
</xsl:template> | |
<xsl:template match="go:term"> | |
<doc> | |
<xsl:apply-templates select="go:accession|go:name|go:synonym|go:definition"/> | |
</doc> | |
</xsl:template> | |
<xsl:template match="go:accession"> | |
<field name="id"> | |
<xsl:value-of select="."/> | |
</field> | |
</xsl:template> | |
<xsl:template match="go:synonym|go:definition|go:name"> | |
<xsl:element name="field"> | |
<xsl:attribute name="name"> | |
<xsl:value-of select="translate(name(),':','_')"/> | |
</xsl:attribute> | |
<xsl:if test="local-name()='name'"> | |
<xsl:attribute name="boost">2</xsl:attribute> | |
</xsl:if> | |
<xsl:value-of select="."/> | |
</xsl:element> | |
</xsl:template> | |
</xsl:stylesheet> |
$ xsltproc --novalid go2solr.xsl go_daily-termdb.rdf-xml.gz > add.xml $ cat add.xml
Before indexing the current disk usage under apache-solr-3.5.0 is 136Mo. We can now use the java utiliy post.jar to index GeneOntology.
$ cd ~/package/apache-solr-3.5.0/example/exampledocs $ java -jar post.jar add.xml SimplePostTool: version 1.4 SimplePostTool: POSTing files to http://localhost:8983/solr/update.. SimplePostTool: POSTing file jeter.xml SimplePostTool: COMMITting Solr index changes..
After indexing, the disk usage under apache-solr-3.5.0 is 153Mo.
Querying
Search for the GO terms having go:definition containing "cancer" a go:name containing "genome" but discard those having go:definition containing "metabolism".curl "http://localhost:8983/solr/select/?q=go_definition%3Acancer+go_name%3Agenome+-go_definition%3Ametabolism&version=2.2&start=0&rows=10&indent=on"
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<?xml version="1.0" encoding="UTF-8"?> | |
<response> | |
<lst name="responseHeader"> | |
<int name="status">0</int> | |
<int name="QTime">1</int> | |
<lst name="params"> | |
<str name="indent">on</str> | |
<str name="start">0</str> | |
<str name="q">go_definition:cancer go_name:genome -go_definition:metabolism</str> | |
<str name="version">2.2</str> | |
<str name="rows">10</str> | |
</lst> | |
</lst> | |
<result name="response" numFound="35" start="0"> | |
<doc> | |
<str name="go_definition">The whole of the genetic information of a virus, contained as either DNA or RNA.</str> | |
<str name="go_name">viral genome</str> | |
<str name="id">GO:0019015</str> | |
</doc> | |
<doc> | |
<str name="go_definition">The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome.</str> | |
<str name="go_name">mitochondrial genome maintenance</str> | |
<str name="id">GO:0000002</str> | |
</doc> | |
<doc> | |
<str name="go_definition">A viral genome that consists of one continuous nucleic acid molecule.</str> | |
<str name="go_name">non-segmented viral genome</str> | |
<str name="id">GO:0019016</str> | |
</doc> | |
<doc> | |
<str name="go_definition">A viral genome that is divided into two or more physically separate molecules of nucleic acid and packaged into a single virion.</str> | |
<str name="go_name">segmented viral genome</str> | |
<str name="id">GO:0019017</str> | |
</doc> | |
<doc> | |
<str name="go_definition">A segmented viral genome consisting of two sub-genomic nucleic acids but each nucleic acid is packaged into a different virion.</str> | |
<str name="go_name">bipartite viral genome</str> | |
<str name="id">GO:0019018</str> | |
</doc> | |
<doc> | |
<str name="go_definition">A segmented viral genome consisting of three sub-genomic nucleic acids but each nucleic acid is packaged into a different virion.</str> | |
<str name="go_name">tripartite viral genome</str> | |
<str name="id">GO:0019019</str> | |
</doc> | |
<doc> | |
<str name="go_definition">A segmented viral genome consisting of more than three sub-genomic nucleic acids but each nucleic acid is packaged into a different virion.</str> | |
<str name="go_name">multipartite viral genome</str> | |
<str name="id">GO:0019020</str> | |
</doc> | |
<doc> | |
<str name="go_definition">A viral genome composed of deoxyribonucleic acid.</str> | |
<str name="go_name">DNA viral genome</str> | |
<str name="id">GO:0019021</str> | |
</doc> | |
<doc> | |
<str name="go_definition">A viral genome composed of ribonucleic acid. This results in genome replication and expression of genetic information being inextricably linked.</str> | |
<str name="go_name">RNA viral genome</str> | |
<str name="id">GO:0019022</str> | |
</doc> | |
<doc> | |
<str name="go_definition">A viral genome composed of double stranded RNA.</str> | |
<str name="go_name">dsRNA viral genome</str> | |
<str name="id">GO:0019023</str> | |
</doc> | |
</result> | |
</response> |
curl "http://localhost:8983/solr/select/?q=go_definition%3Acancer+go_name%3Agenome+-go_definition%3Ametabolism&version=2.2&start=0&rows=10&indent=on&wt=json"
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"responseHeader":{ | |
"status":0, | |
"QTime":1, | |
"params":{ | |
"indent":"on", | |
"start":"0", | |
"q":"go_definition:cancer go_name:genome -go_definition:metabolism", | |
"wt":"json", | |
"version":"2.2", | |
"rows":"10"}}, | |
"response":{"numFound":35,"start":0,"docs":[ | |
{ | |
"id":"GO:0019015", | |
"go_name":"viral genome", | |
"go_definition":"The whole of the genetic information of a virus, contained as either DNA or RNA."}, | |
{ | |
"id":"GO:0000002", | |
"go_name":"mitochondrial genome maintenance", | |
"go_definition":"The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome."}, | |
{ | |
"id":"GO:0019016", | |
"go_name":"non-segmented viral genome", | |
"go_definition":"A viral genome that consists of one continuous nucleic acid molecule."}, | |
{ | |
"id":"GO:0019017", | |
"go_name":"segmented viral genome", | |
"go_definition":"A viral genome that is divided into two or more physically separate molecules of nucleic acid and packaged into a single virion."}, | |
{ | |
"id":"GO:0019018", | |
"go_name":"bipartite viral genome", | |
"go_definition":"A segmented viral genome consisting of two sub-genomic nucleic acids but each nucleic acid is packaged into a different virion."}, | |
{ | |
"id":"GO:0019019", | |
"go_name":"tripartite viral genome", | |
"go_definition":"A segmented viral genome consisting of three sub-genomic nucleic acids but each nucleic acid is packaged into a different virion."}, | |
{ | |
"id":"GO:0019020", | |
"go_name":"multipartite viral genome", | |
"go_definition":"A segmented viral genome consisting of more than three sub-genomic nucleic acids but each nucleic acid is packaged into a different virion."}, | |
{ | |
"id":"GO:0019021", | |
"go_name":"DNA viral genome", | |
"go_definition":"A viral genome composed of deoxyribonucleic acid."}, | |
{ | |
"id":"GO:0019022", | |
"go_name":"RNA viral genome", | |
"go_definition":"A viral genome composed of ribonucleic acid. This results in genome replication and expression of genetic information being inextricably linked."}, | |
{ | |
"id":"GO:0019023", | |
"go_name":"dsRNA viral genome", | |
"go_definition":"A viral genome composed of double stranded RNA."}] | |
}} |
4 comments:
Hi Pierre
As it happens we're investigating Solr as a backend for both QuickGO and AmiGO. We're looking at storing not only the ontology terms, but also all genes together with the pre-computed inferences over the whole ontology. We should have a demo server available soon.
hi
pierre, I have to use a custom genome (silkworm) and go annotation to identify binding sites and annotate peaks of chipseq data.
I have go and interpro annotations.I was browsing to manually do the annotation, a program called ceas needs sqlite3 format. could you please suggest how to convert a go annotation to a sqlite3 file.
or an alternative where i can annotate peaks (bed) with a go or interpro file.
thanks
harsha
pune, india
harsha, ask biostars.org please
Just a quick thank you / merci beaucoup - almost 4 years from you posting it, this has been particularly helpful ... can look forward to a bright ontology-searching future ;)
Post a Comment