17 July 2009

Indexing and Searching NCBI Genes with Apache Lucene



In this post I'll show how Apache Lucene can be grammatically used to index the content of a set of NCBI Genes entries and how to query and retrieve those data.

(via wikipedia:)Apache Lucene is a free/open source information retrieval java library, It is supported by the Apache Software Foundation. While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching. At the core of Lucene's logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene's API to be independent of the file format. Text from PDFs, HTML, Microsoft Word, and OpenDocument documents, as well as many others can all be indexed so long as their textual information can be extracted.

Here my source of data is a set of XML EntrezGene entries related to the initiation of translation and downloaded from the NCBI.

<Entrezgene-Set>

<Entrezgene>
<Entrezgene_track-info>
<Gene-track>
<Gene-track_geneid>1981</Gene-track_geneid>
<Gene-track_status value="live">0</Gene-track_status>
<Gene-track_create-date>
<Date>
(...)
</Date>
</Gene-track_create-date>
</Gene-track>
</Entrezgene_track-info>
<Entrezgene_type value="protein-coding">6</Entrezgene_type>
<Entrezgene_source>
<BioSource>
<BioSource_genome value="genomic">1</BioSource_genome>
<BioSource_origin value="natural">1</BioSource_origin>
<BioSource_org>
<Org-ref>
<Org-ref_taxname>Homo sapiens</Org-ref_taxname>
<Org-ref_common>human</Org-ref_common>
<Org-ref_syn>
<Org-ref_syn_E>man</Org-ref_syn_E>
</Org-ref_syn>
<Org-ref_orgname>
<OrgName>
<OrgName_name>
<OrgName_name_binomial>
<BinomialOrgName>
<BinomialOrgName_genus>Homo</BinomialOrgName_genus>
<BinomialOrgName_species>sapiens</BinomialOrgName_species>
</BinomialOrgName>
</OrgName_name_binomial>
</OrgName_name>
<OrgName_lineage>Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo</OrgName_lineage>
<OrgName_gcode>1</OrgName_gcode>
<OrgName_mgcode>2</OrgName_mgcode>
<OrgName_div>PRI</OrgName_div>
</OrgName>
</Org-ref_orgname>
</Org-ref>
</BioSource_org>
<BioSource_subtype>
<SubSource>
<SubSource_subtype value="chromosome">1</SubSource_subtype>
<SubSource_name>3</SubSource_name>
</SubSource>
</BioSource_subtype>
</BioSource>
</Entrezgene_source>
<Entrezgene_gene>
<Gene-ref>
<Gene-ref_locus>EIF4G1</Gene-ref_locus>
<Gene-ref_desc>eukaryotic translation initiation factor 4 gamma, 1</Gene-ref_desc>
<Gene-ref_maploc>3q27-qter</Gene-ref_maploc>
<Gene-ref_db>
<Dbtag>
(...)
</Dbtag>
</Gene-ref_db>
<Gene-ref_syn>
<Gene-ref_syn_E>p220</Gene-ref_syn_E>
<Gene-ref_syn_E>EIF4F</Gene-ref_syn_E>
<Gene-ref_syn_E>EIF4G</Gene-ref_syn_E>
<Gene-ref_syn_E>DKFZp686A1451</Gene-ref_syn_E>
</Gene-ref_syn>
</Gene-ref>
</Entrezgene_gene>
<Entrezgene_prot>
<Prot-ref>
<Prot-ref_name>
<Prot-ref_name_E>eukaryotic translation initiation factor 4 gamma, 1</Prot-ref_name_E>
<Prot-ref_name_E>EIF4-gamma</Prot-ref_name_E>
</Prot-ref_name>
<Prot-ref_desc>eukaryotic translation initiation factor 4 gamma, 1</Prot-ref_desc>
</Prot-ref>
</Entrezgene_prot>
<Entrezgene_summary>The protein encoded by this gene is a component of the protein complex EIF4F, which is involved in the recognition of the mRNA cap, ATP-dependent unwinding of 5'-terminal secondary structure, and recruitment of mRNA to the ribosome. Alternative splicing results in five transcript variants encoding four distinct isoforms. [provided by RefSeq]</Entrezgene_summary>
<Entrezgene_location>
<Maps>
<Maps_display-str>3q27-qter</Maps_display-str>
<Maps_method>
<Maps_method_map-type value="cyto"/>
</Maps_method>
</Maps>
</Entrezgene_location>
<Entrezgene_gene-source>
<Gene-source>
<Gene-source_src>LocusLink</Gene-source_src>
<Gene-source_src-int>1981</Gene-source_src-int>
<Gene-source_src-str2>1981</Gene-source_src-str2>
</Gene-source>
</Entrezgene_gene-source>
<Entrezgene_xtra-index-terms>
<Entrezgene_xtra-index-terms_E>LOC1981</Entrezgene_xtra-index-terms_E>
</Entrezgene_xtra-index-terms>
</Entrezgene>
</Entrezgene-Set>



Indexing the XML


We create a new standard Analyser breaking the sentences using English stop words.
Analyzer analyzer=new StandardAnalyzer();

An IndexWriter uses this Analyser and creates and maintains the index.

IndexWriter indexWriter=new IndexWriter(
this.luceneDir,//working directory
analyzer,
true,//create
IndexWriter.MaxFieldLength.UNLIMITED //no limit
);

The Entrezgene XML entries will be analyzed by a SAX Handler. Each time a textual field is found, its value is appeneded to a buffer that will be broken and digested by the Analyser. We also store the plain values of the ID and a title for each <Entrez-Gene> entry.
if(name.equals("Gene-track_geneid"))
{
this.id= this.content.toString();
}
else if(this.title==null && StringUtils.isIn(name,"Gene-ref_desc","Prot-ref_desc"))
{
this.title= this.content.toString();
}
this.text.append(this.content.toString()).append(" ");


Each time a <Entrezgene> tag is closed, a Document is created.
The value of id the title are saved to this document and the textual content is analysed.

Document document=new Document();
document.add(
new Field(
"id",
this.id,
Field.Store.YES,//Store the original field value in the index.
Field.Index.NOT_ANALYZED //Index the field's value without using an Analyzer, so it can be searched.
)
);
document.add(
new Field(
"title",
(this.title==null?this.id:this.title),
Field.Store.YES,//Store the original field value in the index.
Field.Index.NOT_ANALYZED //Index the field's value without using an Analyzer, so it can be searched.
)
);
document.add(
new Field(
"content",
this.text.toString(),
Field.Store.YES,//Store the original field value in the index.
Field.Index.ANALYZED//Index the tokens produced by running the field's value through an Analyzer.
)
);

A specific 'weight' can be assigned to some documents (default is 1.0). For example here, a weight of 10.0 is set for each document containing the word 'Rotavirus'.
if(this.text.toString().toLowerCase().contains("rotavirus"))
{
document.setBoost(100f);
}

... and the document is saved by the indexer:
this.indexWriter.addDocument(document);

at the end, the indexer is closed.

/* multiple files for each segment are merged into a single file when a new segment is flushed. */
indexWriter.setUseCompoundFile(true);
/* Requests an "optimize" operation on an index, priming the index for the fastest available search. */
indexWriter.optimize();
indexWriter.close();




Output


java -cp lucene-core-2.4.1.jar:build org.lindenb.tinytools.Lucene4Genes -p index gene_result.txt.gz
INFO: indexing genes in /tmp/lucene4genes
Jul 17, 2009 4:14:21 PM org.lindenb.tinytools.Lucene4Genes$GeneHandler endElement
INFO: adding document "eukaryotic translation initiation factor 6"
Jul 17, 2009 4:14:21 PM org.lindenb.tinytools.Lucene4Genes$GeneHandler endElement
INFO: adding document "eukaryotic translation initiation factor 2B, subunit 5 epsilon, 82kDa"
Jul 17, 2009 4:14:21 PM org.lindenb.tinytools.Lucene4Genes$GeneHandler endElement
INFO: adding document "eukaryotic translation initiation factor 2, subunit 1 alpha, 35kDa"
Jul 17, 2009 4:14:21 PM org.lindenb.tinytools.Lucene4Genes$GeneHandler endElement
INFO: adding document "eukaryotic translation initiation factor 2B, subunit 4 delta, 67kDa"
Jul 17, 2009 4:14:21 PM org.lindenb.tinytools.Lucene4Genes$GeneHandler endElement
INFO: adding document "eukaryotic translation initiation factor 5B"
Jul 17, 2009 4:14:21 PM org.lindenb.tinytools.Lucene4Genes$GeneHandler endElement
INFO: adding document "eukaryotic translation initiation factor 4A, isoform 1"
Jul 17, 2009 4:14:21 PM org.lindenb.tinytools.Lucene4Genes$GeneHandler endElement
INFO: adding document "eukaryotic translation initiation factor 2A, 65kDa"
Jul 17, 2009 4:14:21 PM org.lindenb.tinytools.Lucene4Genes$GeneHandler endElement
INFO: adding document "eukaryotic translation initiation factor 4A, isoform 2"
(...)
INFO: adding document "eukaryotic translation initiation factor 2B, subunit 5 epsilon"
Jul 17, 2009 4:14:22 PM org.lindenb.tinytools.Lucene4Genes$GeneHandler endElement
INFO: adding document "mitochondrial translational initiation factor 2"


Querying


Lucene provides a rich query language through the QueryParser.
First an IndexSearcher is created for the current directory.
Directory directory= FSDirectory.getDirectory(this.luceneDir);
IndexSearcher searcher=new IndexSearcher(directory);

The QueryParser translates query expressions into one of Lucene’s built-in query types By default it will search in the "content" attribute of each Document.
QueryParser q=new QueryParser("content", new StandardAnalyzer());

The TopDocCollector will contains the five best results:
TopDocCollector hitCollector = new TopDocCollector(5);

We can now parse, excute the query and loop over the results. Each time a document is found, we print its id, its title and its score.
Query query =q.parse(terms);
searcher.search(
query,
null,//if non-null, used to permit documents to be collected.
hitCollector
);
TopDocs topDocs = hitCollector.topDocs();
if (topDocs!=null && topDocs.totalHits>0)
{
for(ScoreDoc scoredoc:topDocs.scoreDocs)
{
Document document = searcher.doc(scoredoc.doc);
System.out.println(
document.get("id")+"\t"+
document.get("title")+"\t"+
scoredoc.score
);

}
}

Result


Search for alpha
java -cp lucene-core-2.4.1.jar:build org.lindenb.tinytools.Lucene4Genes \
-p query "alpha"
200526 similar to eukaryotic translation initiation factor 2 alpha kinase PEK 0.98296475
201554 similar to eukaryotic translation initiation factor 3, subunit 1 (alpha, 35kD) 0.8425412
340467 similar to Eukaryotic translation initiation factor 3 subunit 1 (eIF-3 alpha) (eIF3 p35) (eIF3j) 0.8425412
203221 similar to eukaryotic translation initiation factor 3, subunit 1 (alpha, 35kD) 0.8425412
82918 similar to eukaryotic translation initiation factor 3, subunit 1 (alpha, 35kD) (H. sapiens) 0.8425412

Search for 1967
java -cp lucene-core-2.4.1.jar:build org.lindenb.tinytools.Lucene4Genes \
-p query "1967"
1967 eukaryotic translation initiation factor 2B, subunit 1 alpha, 26kDa 0.3756647

Search Alpha but NOT subunit
java -cp lucene-core-2.4.1.jar:build org.lindenb.tinytools.Lucene4Genes \
-p query "+alpha -subunit"
200526 similar to eukaryotic translation initiation factor 2 alpha kinase PEK 0.98296475
56478 eukaryotic translation initiation factor 4E nuclear import factor 1 0.24823609
27102 eukaryotic translation initiation factor 2-alpha kinase 1 0.19858888
1983 eukaryotic translation initiation factor 5 0.19858888
5610 eukaryotic translation initiation factor 2-alpha kinase 2 0.17552942

Search for 1967
java -cp lucene-core-2.4.1.jar:build org.lindenb.tinytools.Lucene4Genes \
-p query "1967"
1967 eukaryotic translation initiation factor 2B, subunit 1 alpha, 26kDa 0.3756647

Search for eif4G. The first entry contains the word Rotavirus and we boosted this kind of document, that is why its score is high.
java -cp lucene-core-2.4.1.jar:build org.lindenb.tinytools.Lucene4Genes \
-p query "eif4G"
1981 eukaryotic translation initiation factor 4 gamma, 1 51.464252
1982 eukaryotic translation initiation factor 4 gamma, 2 0.44605052
1973 eukaryotic translation initiation factor 4A, isoform 1 0.42054045
3646 eukaryotic translation initiation factor 3, subunit E 0.26019612
8661 eukaryotic translation initiation factor 3, subunit A 0.22302526

Search for rotavirus AND anyvirus.
java -cp lucene-core-2.4.1.jar:build org.lindenb.tinytools.Lucene4Genes \
-p query "(rotavirus AND anyvirus)"
(empty)

Search for rotavirus OR anyvirus.
java -cp lucene-2.4.1/lucene-core-2.4.1.jar:build org.lindenb.tinytools.Lucene4Genes \
-p query "(rotavirus OR anyvirus)"
1981 eukaryotic translation initiation factor 4 gamma, 1 10.424776

Search for the document having a field id equals to 203221.
java -cp lucene-core-2.4.1.jar:build org.lindenb.tinytools.Lucene4Genes \
-p query "id:203221"
203221 similar to eukaryotic translation initiation factor 3, subunit 1 (alpha, 35kD) 6.0106354

Search for the document having a field id equals to 0.
java -cp lucene-2.4.1/lucene-core-2.4.1.jar:build org.lindenb.tinytools.Lucene4Genes \
-p query "id:00000"
(empty)

Search for the document containing chrom*.
java -cp lucene-2.4.1/lucene-core-2.4.1.jar:build org.lindenb.tinytools.Lucene4Genes \
-p query "chrom*"
83754 eukaryotic translation initiation factor 1A, X chromosome 0.7984222
653994 similar to Eukaryotic translation initiation factor 4H (eIF-4H) (Williams-Beuren syndrome chromosome region 1 protein homolog) 0.5432575
1968 eukaryotic translation initiation factor 2, subunit 3 gamma, 52kDa 0.4981929
3646 eukaryotic translation initiation factor 3, subunit E 0.38104227
54791 argonaute 4 0.30731285


Source code


The source code is also available at Lucene4Genes.java.
package org.lindenb.tinytools;

import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.util.logging.Logger;

import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocCollector;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.lindenb.io.IOUtils;
import org.lindenb.util.Compilation;
import org.lindenb.util.StringUtils;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

/**
* A test for Apache Lucene
*
*/
public class Lucene4Genes
{
private static Logger LOG= Logger.getLogger(Lucene4Genes.class.getName());
private File luceneDir;


/**
* A SAXHandler parsing Entrez Gene and indexing the textual data
* @author pierre
*
*/
private static class GeneHandler
extends DefaultHandler
{
//current value of the tag
private StringBuilder content=null;
//entrez gene id
private String id=null;
//entrez gene title
private String title=null;
//entrez gene concatenated textual data
private StringBuilder text= new StringBuilder();
//lucene indexer
private IndexWriter indexWriter;

GeneHandler(IndexWriter indexWriter)
{
this.indexWriter=indexWriter;
}

@Override
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException
{
this.content=null;
if(StringUtils.isIn(
name,
"Gene-track_geneid",
"Gene-ref_locus",
"Gene-ref_desc",
"Prot-ref_name_E",
"Gene-ref_desc",
"Entrezgene_summary",
"Gene-commentary_text"
))
{
this.content=new StringBuilder();
}

}

@Override
public void endElement(String uri, String localName, String name)
throws SAXException
{
if(name.equals("Entrezgene"))
{
try {
LOG.info("adding document \""+title+"\"");
Document document=new Document();
document.add(
new Field(
"id",
this.id,
Field.Store.YES,//Store the original field value in the index.
Field.Index.NOT_ANALYZED //Index the field's value without using an Analyzer, so it can be searched.
)
);
document.add(
new Field(
"title",
(this.title==null?this.id:this.title),
Field.Store.YES,//Store the original field value in the index.
Field.Index.NOT_ANALYZED //Index the field's value without using an Analyzer, so it can be searched.
)
);
document.add(
new Field(
"content",
this.text.toString(),
Field.Store.YES,//Store the original field value in the index.
Field.Index.ANALYZED//Index the tokens produced by running the field's value through an Analyzer.
)
);
//Sets a boost factor for hits on any field of this document. This value will be multiplied into the score of all hits on this document.
if(this.text.toString().toLowerCase().contains("rotavirus"))
{
document.setBoost(100f);
}
//Adds a document to this index.
this.indexWriter.addDocument(document);

} catch (CorruptIndexException e) {
throw new SAXException(e);
} catch (IOException e) {
throw new SAXException(e);
}
this.id=null;
this.title=null;
this.text= new StringBuilder();
}
else if(this.content!=null)
{
if(name.equals("Gene-track_geneid"))
{
this.id= this.content.toString();
}
else if(this.title==null && StringUtils.isIn(name,"Gene-ref_desc","Prot-ref_desc"))
{
this.title= this.content.toString();
}

this.text.append(this.content.toString()).append(" ");
}
this.content=null;
}

@Override
public void characters(char[] ch, int start, int length)
throws SAXException {
if(content!=null)
{
content.append(ch, start, length);
}
}

}

/** Constructor, create the directory if it does not exist */
private Lucene4Genes(File luceneDir)
throws IOException
{
if(!luceneDir.exists())
{
if(!luceneDir.mkdir())
{
throw new IOException("Cannot create "+luceneDir);
}
System.err.println("Created "+luceneDir);
}
if(!luceneDir.isDirectory())
{
throw new IOException("Not a directory "+luceneDir);
}
this.luceneDir=luceneDir;
}

/**
* Index the XML stream containing the entrez genes
* @param in xml stream
* @throws IOException
* @throws SAXException
*/
private void indexGenes(InputStream in) throws IOException,SAXException
{
LOG.info("indexing genes in "+this.luceneDir);
SAXParserFactory f= SAXParserFactory.newInstance();
f.setNamespaceAware(true);
SAXParser parser= null;
try {
parser=f.newSAXParser();
}
catch (ParserConfigurationException err)
{
throw new SAXException(err);
}

/* An Analyzer builds TokenStreams, which analyze text.
* It thus represents a policy for extracting index terms from text.
*/
Analyzer analyzer=new StandardAnalyzer();

/* An IndexWriter creates and maintains an index. */
IndexWriter indexWriter=new IndexWriter(
this.luceneDir,//data dir
analyzer,
true,//create
IndexWriter.MaxFieldLength.UNLIMITED //no limit
);

parser.parse(in, new GeneHandler(indexWriter));

/* multiple files for each segment are merged into a single file when a new segment is flushed. */
indexWriter.setUseCompoundFile(true);
/* Requests an "optimize" operation on an index, priming the index for the fastest available search. */
indexWriter.optimize();
indexWriter.close();
}

/**
* Search our database with the user query, print the result to stdout
* @param terms
* @throws IOException
*/
private void search(String terms) throws IOException
{
Directory directory= FSDirectory.getDirectory(this.luceneDir);
IndexSearcher searcher=new IndexSearcher(directory);
/* QueryParser translates query expressions into one of Lucene’s built-in query types */
QueryParser q=new QueryParser("content", new StandardAnalyzer());
try
{
TopDocCollector hitCollector = new TopDocCollector(5);
Query query =q.parse(terms);
searcher.search(
query,
null,//if non-null, used to permit documents to be collected.
hitCollector
);
TopDocs topDocs = hitCollector.topDocs();

if (topDocs!=null && topDocs.totalHits>0)
{
for(ScoreDoc scoredoc:topDocs.scoreDocs)
{
Document document = searcher.doc(scoredoc.doc);
System.out.println(
document.get("id")+"\t"+
document.get("title")+"\t"+
scoredoc.score
);

}
}
}
catch(ParseException err)
{
throw new IOException(err);
}
}


public static void main(String[] args)
{
Lucene4Genes app=null;
try
{
File dir= new File(System.getProperty("java.io.tmpdir"),"lucene4genes");
String program=null;
int optind=0;
while(optind< args.length)
{
if(args[optind].equals("-h"))
{
System.err.println("Lucene for genes. Pierre Lindenbaum PhD (2009).");
System.err.println(Compilation.getLabel());
System.err.println("options:");
System.err.println(" -d <lucene-directory> default:"+dir);
System.err.println(" -p <program>");
System.err.println(" 'index' <stdin|files> index the EntrezGenes input");
System.err.println(" 'query' '<the query>'");
}
else if(args[optind].equals("-d"))
{
dir=new File(args[++optind]);
}
else if(args[optind].equals("-p"))
{
program=args[++optind];
}
else if(args[optind].equals("--"))
{
optind++;
break;
}
else if(args[optind].startsWith("-"))
{
System.err.println("Unknown option "+args[optind]);
}
else
{
break;
}
++optind;
}
if(program==null)
{
System.err.println("Undefined program");
return;
}
app= new Lucene4Genes(dir);
if(program.equals("query"))
{
if(optind+1!=args.length)
{
System.err.println("Illegal number of arguments.");
return;
}
String query= args[optind++];
app.search(query);
}
else if(program.equals("index"))
{

if(optind==args.length)
{
LOG.info("reading stdin");
app.indexGenes(System.in);
}
else
{
while(optind< args.length)
{
String filename=args[optind++];
LOG.info("reading file "+filename);
java.io.InputStream r= IOUtils.openInputStream(filename);
app.indexGenes(r);
r.close();
}
}
}
else
{
System.err.println("Unknown program "+program);
return;
}
}
catch(Throwable err)
{
err.printStackTrace();
}
}
}


That's it !
Pierre

No comments: