23 June 2009

Back to basics: Postscript

(Wikipedia): The Postscript language is a vectorial drawing format used to describe a page. It is an interpreted, stack-based language. The language syntax uses reverse Polish notation, which makes the order of operations unambiguous, but reading a program requires some practice, because one has to keep the layout of the stack in mind.
For example, the following postscript program draws a line from the point (0,0) to (100,100):

%!
0 0 moveto
100 100 lineto
stroke
You can view this file in ghostscript or drag this script onto your postscript printer.

I've played with postscript and created a XSLT stylesheet transforming a simple SVG document to postscript. The stylesheet is available here: svg2ps.xsl.

The following SVG document
<svg width="300px" height="300px" preserveAspectRatio="xMidYMid meet" zoomAndPan="magnify" version="1.0" contentScriptType="text/ecmascript" contentStyleType="text/css">
<title content="structured text">Small SVG example</title>
<circle cx="120" cy="150" r="60" fill="gold"/>
<polyline points="120 30, 25 150, 290 150" stroke-width="4" stroke="brown" fill="none"/>
<polygon points="210 100, 210 200, 270 150" fill="lawngreen"/>
<text x="60" y="250" fill="blue">Hello World</text>
</svg>




was transformed using my XSLT stylesheet, and here the result uploaded into scribd.com:

A Simple SVG document transformed to PDF with XSLT

Pierre

22 June 2009

Event-driven XML parsing (SAX) with Java+JavaScript

I just wrote SAXScript, an event-driven SAX parser java program invoking some javascript callbacks. It can be used to quickly write a piece of code to parse a huge XML file.

Download


Download saxscript.jar from http://code.google.com/p/lindenb/downloads/list

Invoke


java -jar saxscript.jar (options) [file|url]s

Options


-h (help) this screen
-f read javascript script from file
-e 'script' read javascript script from argument
-D add a variable (as string) in the scripting context.
__FILENAME__ is the current uri.
-n SAX parser is NOT namespace aware (default true)
-v SAX parser is validating (default false)

Callbacks


function startDocument()
{println("Start doc");}
function endDocument()
{println("End doc");}
function startElement(uri,localName,name,atts)
{
print(""+__FILENAME__+" START uri: "+uri+" localName:"+localName);
for(var i=0;atts!=undefined && i< atts.getLength();++i)
{
print(" @"+atts.getQName(i)+"="+atts.getValue(i));
}
println("");
}
function characters(s)
{println("Characters :" +s);}
function endElement(uri,localName,name)
{println("END: uri: "+uri+" localName:"+localName);}

Source Code



Example


The following shell script invokes NCBI/ESearch to retrieve a key to get all the bibliographic references about the Rotaviruses (8793 references).
This key is then used to download each pubmed entry and we then count the number of time each journal (tag is "MedlineTA") was cited.
#!/bin/sh
JAVA=${JAVA_HOME}/bin/java
WEBENV=`${JAVA} -jar saxscript.jar \
-e '
var WebEnv=null;
function startElement(uri,localName,name,atts)
{
if(name=="WebEnv") WebEnv="";
}

function characters(s)
{
if(WebEnv!=null) WebEnv+=s;
}

function endElement(uri,localName,name)
{
if(WebEnv!=null)
{
print(WebEnv);
WebEnv=null;
}
}
' \
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&usehistory=y&retmode=xml&term=Rotavirus"`


${JAVA} -jar saxscript.jar -e '
var content=null;
var hash=new Array();
function startElement(uri,localName,name,atts)
{
if(name=="MedlineTA") content="";
}

function characters(s)
{
if(content!=null) content+=s;
}

function endElement(uri,localName,name)
{
if(content!=null)
{
var c=hash[content];
hash[content]=(c==null?1:c+1);
content=null;
}
}
function endDocument()
{
for(var content in hash)
{
println(content+"\t"+ hash[content]);
}
}
' "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&query_key=1&WebEnv=${WEBENV}&retmode=xml" |\
sort -t ' ' -k2n

Result


Acta Gastroenterol Latinoam 1
Acta Histochem Suppl 1
Acta Microbiol Acad Sci Hung 1
Acta Microbiol Hung 1
Acta Microbiol Immunol Hung 1
Acta Pathol Microbiol Scand C 1
Acta Vet Acad Sci Hung 1
Adv Neonatal Care 1
Adv Nurse Pract 1
Adv Ther 1
Adv Vet Med 1
Afr J Med Med Sci 1
Age Ageing 1
AIDS Res Hum Retroviruses 1
AJNR Am J Neuroradiol 1
AJR Am J Roentgenol 1
Akush Ginekol (Sofiia) 1
(...)
Appl Environ Microbiol 87
J Pediatr Gastroenterol Nutr 97
J Virol Methods 130
Lancet 130
Vaccine 158
Pediatr Infect Dis J 177
J Gen Virol 217
Arch Virol 254
J Med Virol 262
J Infect Dis 265
Virology 278
J Virol 460
J Clin Microbiol 514


That's it !
Pierre

12 June 2009

RDF storage with a Key/Value Engine

Just curious, can I store some RDF statements in a Key/Value engine like BerkeleyDB (java Edition) ?
Yes, it's like re-inventing the wheel but, again, I like re-inventing the wheel :-)

A berkeleyDB Database contains a set of Key/Value. e.g.






KeyValue
SecurityNumber:9877FirstName:John LastName:Doe
SecurityNumber:9899FirstName:Peter LastName:Parker
SecurityNumber:9988FirstName:Edith LastName:Parker

Data are stored as an array of bytes. Keys are ordered on a bit-based order and the records are stored in a B-Tree table. Databases can be defined to store unique or duplicated keys.
In BerkeleyDB, a Cursor is an iterator used to scan the database: as the keys are sorted , accessing a range of keys is very fast.
Some indexes (Secondary Database) can be linked to a Database, for example, in the previous table, if you want to quickly access the person having a LastName=="Parker" I would create a secondary database on LastName. Deleting an item in the secondary database, automatically delete the corresponding item in the main database. As far as I know, you cannot create a secondary database if your primary database allows duplicate keys.





Key2Key1Value1
LastName:DoeSecurityNumber:9877FirstName:John LastName:Doe
LastName:ParkerSecurityNumber:9899FirstName:Peter LastName:Parker
LastName:ParkerSecurityNumber:9988FirstName:Edith LastName:Parker


OK, now I want to store some RDF statements, that is to say something like the following triple:
{
SUBJECT = RESOURCE;
PREDICATE = RESOURCE;
OBJECT = ( RESOURCE || LITERAL)
}

I need to create an index to quickly find any statement matching one , two or the three components of the statement.
In the solution I've implemented, there is only one primary database and all the component of a statement are part of the 'DATA' , the 'KEY' of the database is just a unique number.





KeyValue
1(s1,p1,o1)
2(s2,p2,o2)
3(s3,p3,o3)
Some secondary databases ( subject2triple ,predicate2triple, objectLiteral2triple, objectRsrc2triple ) are used to quickly find each rdf Statement from a given node.

Creating the Database 'triplesDB' :
EnvironmentConfig envCfg= new EnvironmentConfig();
envCfg.setAllowCreate(true);
this.environment= new Environment(envFile,envCfg);
DatabaseConfig cfg= new DatabaseConfig();
cfg.setAllowCreate(true);
cfg.setSortedDuplicates(false);
this.triplesDB= this.environment.openDatabase( null, "triples", cfg);


We then create an the secondary indexes for each component of a statement ( subject2triple ,predicate2triple, objectLiteral2triple, objectRsrc2triple) . For example, the following code opens the secondaty database 'objectRsrc2triple'. We create a 'SecondaryKeyCreator' to tell BerkeleyDB about how it should extract the secondary key from the primary data.

config2= new SecondaryConfig();
config2.setAllowCreate(true);
config2.setSortedDuplicates(true);
config2.setKeyCreator(new SecondaryKeyCreator()
{
@Override
public boolean createSecondaryKey(SecondaryDatabase arg0,
DatabaseEntry key,
DatabaseEntry data,
DatabaseEntry result) throws DatabaseException {
Statement stmt= STMT_VALUE_BINDING.entryToObject(data);
if(!stmt.getValue().isResource()) return false;
Resource L= Resource.class.cast(stmt.getValue());
TupleOutput out= new TupleOutput();
saveResource(L, out);
result.setData(out.toByteArray());
return true;
}
});
this.objectRsrc2triple=this.environment.openSecondaryDatabase(null,"objectRsrc2triple", triplesDB, config2);

We need some methods to write and read the components of a Statement from/to an array of bytes:
public void objectToEntry(Statement stmt, TupleOutput out)
{
saveResource(stmt.getSubject(),out);
saveResource(stmt.getPredicate(),out);
if(stmt.getValue().isResource())
{
out.writeByte(OPCODE_RESOURCE);
saveResource(Resource.class.cast(stmt.getValue()),out);
}
else
{
out.writeByte(OPCODE_LITERAL);
saveLiteral(Literal.class.cast(stmt.getValue()),out);
}
}

public Statement entryToObject(TupleInput in)
{
Resource subject = readResource(in);
Resource predicate = readResource(in);
RDFNode object=null;
switch(in.readByte())
{
case OPCODE_RESOURCE:
{
object= readResource(in);
break;
}
case OPCODE_LITERAL:
{
object= readLiteral(in);
break;
}
default: throw new IllegalStateException("Unknown opcode");
}
return new Statement(subject,predicate,object);
}
}

I've also wrapped the Cursor in a java.util.Iterator:. One interesting Iterator is the JoinIterator which quickly retrieve the *common* Statements returned from a distinct set of 'Cursors'. When we first retrieve the values of those cursors, we seek for each searched keys. Then we let the BerkeleyDB API finding the intersection of those cursors.
private class JoinIterator
extends AbstractIterator<Statement>
{
/** the joined iterators */
protected List<CursorAndEntries> cursorEntries;
/** our join cursor */
private JoinCursor joinCursor=null;
/** current key */
protected DatabaseEntry keyEntry=new DatabaseEntry();
/** current value */
protected DatabaseEntry valueEntry=new DatabaseEntry();
protected JoinIterator(List<CursorAndEntries> cursorsEntries)
{
this.cursorEntries=new ArrayList<CursorAndEntries>(cursorsEntries);
}
protected Statement readNext() throws DatabaseException
{
if(super._firstCall)
{
super._firstCall=false;
for(CursorAndEntries ca:this.cursorEntries)
{
if(ca.cursor.getSearchKey(ca.keyEntry,ca.valueEntry,null)!=OperationStatus.SUCCESS)
{
return null;
}
}
Cursor cursors[]= new Cursor[this.cursorEntries.size()];
for(int i=0;i< this.cursorEntries.size();++i)
{
cursors[i]= cursorEntries.get(i).cursor;
}
joinCursor = BerkeleyDBModel.this.triplesDB.join(cursors, null);
}
if(joinCursor.getNext(keyEntry, valueEntry,LockMode.DEFAULT)==OperationStatus.SUCCESS)
{
return STMT_VALUE_BINDING.entryToObject(valueEntry);
}
else
{
return null;
}
}

@Override
public void close()
{
for(CursorAndEntries cursor:cursorEntries)
{
try {
if(cursor.cursor!=null) cursor.cursor.close();
cursor.cursor=null;
} catch (Exception e) {
e.printStackTrace();
}
}
try {
if(joinCursor!=null) joinCursor.close();
joinCursor=null;
} catch (Exception e) {
e.printStackTrace();
}
}
}

This JoinCursor is then used for any specific query. For example, the following method 'listStatement' takes three parameters (s,p,o) and return an Iterator over a list of Statements. If a parameter is null, it will be used as a wildcard: this is where we shall use the secondary databases for finding the statements.
public CloseableIterator<Statement> listStatements(
Resource s,
Resource p,
RDFNode o) throws RDFException
{
if(s==null && p==null && o==null)
{
return listStatements();
}
try
{
List<CursorAndEntries> cursors= new ArrayList<CursorAndEntries>(3);
if(s!=null)/** subject is not null */
{
CursorAndEntries ca= new CursorAndEntries();
ca.cursor=this.subject2triple.openCursor(null, null);
RSRC_KEY_BINDING.objectToEntry(s, ca.keyEntry);
cursors.add(ca);
}
if(p!=null)/** predicate is not null */
{
CursorAndEntries ca= new CursorAndEntries();
ca.cursor=this.predicate2triple.openCursor(null, null);
RSRC_KEY_BINDING.objectToEntry(p, ca.keyEntry);
cursors.add(ca);
}
if(o!=null)/** object is not null */
{
if(o.isResource())
{
CursorAndEntries ca= new CursorAndEntries();
ca.cursor=this.objectRsrc2triple.openCursor(null, null);
RSRC_KEY_BINDING.objectToEntry(Resource.class.cast(o), ca.keyEntry);
cursors.add(ca);
}
else
{
CursorAndEntries ca= new CursorAndEntries();
ca.cursor=this.objectLiteral2triple.openCursor(null, null);
LITERAL_KEY_BINDING.objectToEntry(Literal.class.cast(o), ca.keyEntry);
cursors.add(ca);
}
}
return new JoinIterator(cursors);
}
catch (DatabaseException e)
{
throw new RDFException(e);
}
}


Performance


This RDFStore was used to download and parse a remote gzipped file from genontology.org. The uncompressed file is ~61.0Mo and contains 774579 statements. It took about ~6 min) to download and digest the file. Remember that each time a statement is about to be inserted, we need to check if it doesn't already exists in the database The amount of space required to store the database was 591Mo (ouch !!).

This code was my first idea about how to solve this problem. Obviously, some other engines are far more efficient :-) :(...)Good progress though, last night's run yielded 5 billion triples loaded in just under 10 hours for an average throughput of 135k triples per second. Max throughput was just above 210k triples per second. 1 billion triples was reached in an astonishing 78 minutes.(...)

As a conclusion, here is the full source code for the first version of this storage engine:
package org.lindenb.sw.model;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URI;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.zip.GZIPInputStream;
import javax.xml.stream.XMLStreamException;
import org.lindenb.sw.RDFException;
import org.lindenb.sw.io.RDFHandler;
import org.lindenb.sw.nodes.Literal;
import org.lindenb.sw.nodes.RDFNode;
import org.lindenb.sw.nodes.Resource;
import org.lindenb.sw.nodes.Statement;
import org.lindenb.util.iterator.CloseableIterator;

import com.sleepycat.bind.tuple.IntegerBinding;
import com.sleepycat.bind.tuple.TupleBinding;
import com.sleepycat.bind.tuple.TupleInput;
import com.sleepycat.bind.tuple.TupleOutput;
import com.sleepycat.je.Cursor;
import com.sleepycat.je.Database;
import com.sleepycat.je.DatabaseConfig;
import com.sleepycat.je.DatabaseEntry;
import com.sleepycat.je.DatabaseException;
import com.sleepycat.je.Environment;
import com.sleepycat.je.EnvironmentConfig;
import com.sleepycat.je.JoinCursor;
import com.sleepycat.je.LockMode;
import com.sleepycat.je.OperationStatus;
import com.sleepycat.je.SecondaryConfig;
import com.sleepycat.je.SecondaryDatabase;
import com.sleepycat.je.SecondaryKeyCreator;

public class BerkeleyDBModel
{
private static final byte OPCODE_RESOURCE = 'R';
private static final byte OPCODE_LITERAL = 'L';

private Environment environment;
private Database triplesDB;
private SecondaryDatabase subject2triple= null;
private SecondaryDatabase predicate2triple= null;
private SecondaryDatabase objectLiteral2triple= null;
private SecondaryDatabase objectRsrc2triple= null;

private static final StatementBinding STMT_VALUE_BINDING= new StatementBinding();
private static final ResourceBinding RSRC_KEY_BINDING= new ResourceBinding();
private static final LiteralBinding LITERAL_KEY_BINDING= new LiteralBinding();

/**
* TupleBinding for a Resource
*
*/
private static class ResourceBinding
extends TupleBinding<Resource>
{
public void objectToEntry(Resource rsrc, TupleOutput out)
{
saveResource(rsrc,out);
}

public Resource entryToObject(TupleInput in)
{
return readResource(in);
}
}

/**
* TupleBinding for a Literal
*
*/
private static class LiteralBinding
extends TupleBinding<Literal>
{
public void objectToEntry(Literal rsrc, TupleOutput out)
{
saveLiteral(rsrc,out);
}

public Literal entryToObject(TupleInput in)
{
return readLiteral(in);
}
}

/**
* TupleBinding for a Statement
*
*/
private static class StatementBinding
extends TupleBinding<Statement>
{
public void objectToEntry(Statement stmt, TupleOutput out)
{
saveResource(stmt.getSubject(),out);
saveResource(stmt.getPredicate(),out);
if(stmt.getValue().isResource())
{
out.writeByte(OPCODE_RESOURCE);
saveResource(Resource.class.cast(stmt.getValue()),out);
}
else
{
out.writeByte(OPCODE_LITERAL);
saveLiteral(Literal.class.cast(stmt.getValue()),out);
}
}

public Statement entryToObject(TupleInput in)
{
Resource subject = readResource(in);
Resource predicate = readResource(in);
RDFNode object=null;
switch(in.readByte())
{
case OPCODE_RESOURCE:
{
object= readResource(in);
break;
}
case OPCODE_LITERAL:
{
object= readLiteral(in);
break;
}
default: throw new IllegalStateException("Unknown opcode");
}
return new Statement(subject,predicate,object);
}
}






/**
* AbstractIterator
*/
private abstract class AbstractIterator<T>
implements CloseableIterator<T>
{
protected T _object=null;
private boolean _nextTested=false;
private boolean _hasNext=false;
protected boolean _firstCall=true;
protected AbstractIterator()
{
}

@Override
public void remove() {
throw new UnsupportedOperationException();
}

@Override
public boolean hasNext()
{
if(_nextTested) return _hasNext;
_nextTested=true;
_hasNext=false;

T obj=null;
try
{
obj=readNext();
_firstCall=false;
if(obj!=null)
{
_object=obj;
_hasNext=true;
}
}
catch(DatabaseException err)
{
err.printStackTrace();
}

if(!_hasNext)
{
close();
}
return _hasNext;
}

protected abstract T readNext() throws DatabaseException;

@Override
public T next()
{
if(!_nextTested)
{
if(!hasNext()) throw new IllegalStateException();
}
_nextTested=false;
_hasNext=false;

T x= _object;
_object=null;
return x;
}

@Override
public abstract void close();
}


/**
* CursorIterator
*/
private abstract class CursorIterator<T>
extends AbstractIterator<T>
{
protected Cursor cursor;
private DatabaseEntry keyEntry;
private DatabaseEntry valueEntry;

protected CursorIterator(Cursor cursor)
{
this.cursor=cursor;
this.keyEntry= new DatabaseEntry();
this.valueEntry= new DatabaseEntry();
}

@Override
public void remove() {
throw new UnsupportedOperationException();
}

protected abstract T readNext(DatabaseEntry key,DatabaseEntry value) throws DatabaseException;


protected final T readNext() throws DatabaseException
{
if(this.cursor==null) return null;
return readNext(this.keyEntry,this.valueEntry);
}

@Override
public void close()
{
if(this.cursor!=null)
{
try { this.cursor.close(); }
catch (Exception e) { e.printStackTrace();}
}
this.cursor=null;
}
}

/**
* A container for the 3 values used
* in the following next JoinIterator
**/
private static class CursorAndEntries
{
Cursor cursor;
DatabaseEntry keyEntry=new DatabaseEntry();
DatabaseEntry valueEntry=new DatabaseEntry();
}

/**
* JoinIterator
*/
private class JoinIterator
extends AbstractIterator<Statement>
{
/** the joined iterators */
protected List<CursorAndEntries> cursorEntries;
/** our join cursor */
private JoinCursor joinCursor=null;
/** current key */
protected DatabaseEntry keyEntry=new DatabaseEntry();
/** current value */
protected DatabaseEntry valueEntry=new DatabaseEntry();
protected JoinIterator(List<CursorAndEntries> cursorsEntries)
{
this.cursorEntries=new ArrayList<CursorAndEntries>(cursorsEntries);
}

@Override
public void remove() {
throw new UnsupportedOperationException();
}

protected Statement readNext() throws DatabaseException
{
if(super._firstCall)
{
super._firstCall=false;
for(CursorAndEntries ca:this.cursorEntries)
{
if(ca.cursor.getSearchKey(ca.keyEntry,ca.valueEntry,null)!=OperationStatus.SUCCESS)
{
return null;
}
}
Cursor cursors[]= new Cursor[this.cursorEntries.size()];
for(int i=0;i< this.cursorEntries.size();++i)
{
cursors[i]= cursorEntries.get(i).cursor;
}
joinCursor = BerkeleyDBModel.this.triplesDB.join(cursors, null);
}
if(joinCursor.getNext(keyEntry, valueEntry, LockMode.DEFAULT)==OperationStatus.SUCCESS)
{
return STMT_VALUE_BINDING.entryToObject(valueEntry);
}
else
{
return null;
}
}

@Override
public void close()
{

for(CursorAndEntries cursor:cursorEntries)
{
try {
if(cursor.cursor!=null) cursor.cursor.close();
cursor.cursor=null;
} catch (Exception e) {
e.printStackTrace();
}
}
try {
if(joinCursor!=null) joinCursor.close();
joinCursor=null;
} catch (Exception e) {
e.printStackTrace();
}
}
}


/**
*
* BerkeleyDBModel
*
*/
public BerkeleyDBModel(
File envFile
) throws RDFException
{

try {
EnvironmentConfig envCfg= new EnvironmentConfig();
envCfg.setAllowCreate(true);
this.environment= new Environment(envFile,envCfg);


DatabaseConfig cfg= new DatabaseConfig();
cfg.setAllowCreate(true);
cfg.setSortedDuplicates(false);


this.triplesDB= this.environment.openDatabase(
null, "triples", cfg);


/* create secondary key on literal as value */
SecondaryConfig config2= new SecondaryConfig();
config2.setAllowCreate(true);
config2.setSortedDuplicates(true);

config2.setKeyCreator(new SecondaryKeyCreator()
{
@Override
public boolean createSecondaryKey(SecondaryDatabase arg0,
DatabaseEntry key,
DatabaseEntry data,
DatabaseEntry result) throws DatabaseException
{
Statement stmt= STMT_VALUE_BINDING.entryToObject(data);
if(stmt.getValue().isResource()) return false;
Literal L= Literal.class.cast(stmt.getValue());
TupleOutput out= new TupleOutput();
saveLiteral(L, out);
result.setData(out.toByteArray());
return true;
}
});
this.objectLiteral2triple=this.environment.openSecondaryDatabase(null,"objectLiteral2triple", triplesDB, config2);

/* create secondary key on resource as value */
config2= new SecondaryConfig();
config2.setAllowCreate(true);
config2.setSortedDuplicates(true);
config2.setKeyCreator(new SecondaryKeyCreator()
{
@Override
public boolean createSecondaryKey(SecondaryDatabase arg0,
DatabaseEntry key,
DatabaseEntry data,
DatabaseEntry result) throws DatabaseException {
Statement stmt= STMT_VALUE_BINDING.entryToObject(data);
if(!stmt.getValue().isResource()) return false;
Resource L= Resource.class.cast(stmt.getValue());
TupleOutput out= new TupleOutput();
saveResource(L, out);
result.setData(out.toByteArray());
return true;
}
});
this.objectRsrc2triple=this.environment.openSecondaryDatabase(null,"objectRsrc2triple", triplesDB, config2);

/* create secondary key on predicate */
config2= new SecondaryConfig();
config2.setAllowCreate(true);
config2.setSortedDuplicates(true);
config2.setKeyCreator(new SecondaryKeyCreator()
{
@Override
public boolean createSecondaryKey(SecondaryDatabase arg0,
DatabaseEntry key,
DatabaseEntry data,
DatabaseEntry result) throws DatabaseException {
Statement stmt= STMT_VALUE_BINDING.entryToObject(data);
TupleOutput out= new TupleOutput();
saveResource(stmt.getSubject(), out);
result.setData(out.toByteArray());
return true;
}
});
this.subject2triple=this.environment.openSecondaryDatabase(null, "subject2triple", triplesDB, config2);


/* create secondary key on predicate */
config2= new SecondaryConfig();
config2.setAllowCreate(true);
config2.setSortedDuplicates(true);
config2.setKeyCreator(new SecondaryKeyCreator()
{
@Override
public boolean createSecondaryKey(SecondaryDatabase arg0,
DatabaseEntry key,
DatabaseEntry data,
DatabaseEntry result) throws DatabaseException {
Statement stmt= STMT_VALUE_BINDING.entryToObject(data);
TupleOutput out= new TupleOutput();
saveResource(stmt.getPredicate(), out);
result.setData(
out.getBufferBytes(),
out.getBufferOffset(),
out.getBufferLength()
);
return true;
}
});
this.predicate2triple=this.environment.openSecondaryDatabase(null, "predicate2triple", triplesDB, config2);
}
catch (DatabaseException e)
{
throw new RDFException(e);
}
}

/** Close this model */
public void close() throws RDFException
{
try
{
subject2triple.close();
predicate2triple.close();
objectLiteral2triple.close();
objectRsrc2triple.close();
triplesDB.close();
environment.close();
} catch(DatabaseException err)
{
throw new RDFException(err);
}
subject2triple=null;
predicate2triple=null;
objectLiteral2triple=null;
objectRsrc2triple=null;
triplesDB=null;
environment=null;
}

public void clear() throws RDFException
{
try {
DatabaseEntry key= new DatabaseEntry();
DatabaseEntry data= new DatabaseEntry();
Cursor c= triplesDB.openCursor(null, null);
while(c.getNext(key, data, null)==OperationStatus.SUCCESS)
{
c.delete();
}
c.close();
} catch (Exception e) {
e.printStackTrace();
}
}


protected Environment getEnvironment()
{
return this.environment;
}

protected Database getTripleDB()
{
return this.triplesDB;
}

public CloseableIterator<Resource> listSubjects() throws RDFException
{
try {
return new CursorIterator<Resource>(this.subject2triple.openCursor(null, null))
{
@Override
protected Resource readNext(DatabaseEntry key,DatabaseEntry value)
throws DatabaseException
{
if(this.cursor.getNext(key, value, null)==OperationStatus.SUCCESS)
{
return RSRC_KEY_BINDING.entryToObject(value);
}
return null;
}
};
} catch (Exception e)
{
throw new RDFException(e);
}

}


public CloseableIterator<Statement> listStatements() throws RDFException
{
try {
return new CursorIterator<Statement>(this.triplesDB.openCursor(null, null))
{
@Override
protected Statement readNext(DatabaseEntry key,DatabaseEntry
value) throws DatabaseException
{
if(this.cursor.getNext(key, value, null)==OperationStatus.SUCCESS)
{
return STMT_VALUE_BINDING.entryToObject(value);
}
return null;
}
};
} catch (DatabaseException e)
{
throw new RDFException(e);
}
}


public CloseableIterator<Statement> listStatements(
Resource s,
Resource p,
RDFNode o) throws RDFException
{
if(s==null && p==null && o==null)
{
return listStatements();
}
try
{
List<CursorAndEntries> cursors= new ArrayList<CursorAndEntries>(3);
if(s!=null)
{
CursorAndEntries ca= new CursorAndEntries();
ca.cursor=this.subject2triple.openCursor(null, null);
RSRC_KEY_BINDING.objectToEntry(s, ca.keyEntry);
cursors.add(ca);
}
if(p!=null)
{
CursorAndEntries ca= new CursorAndEntries();
ca.cursor=this.predicate2triple.openCursor(null, null);
RSRC_KEY_BINDING.objectToEntry(p, ca.keyEntry);
cursors.add(ca);
}
if(o!=null)
{
if(o.isResource())
{
CursorAndEntries ca= new CursorAndEntries();
ca.cursor=this.objectRsrc2triple.openCursor(null, null);
RSRC_KEY_BINDING.objectToEntry(Resource.class.cast(o), ca.keyEntry);
cursors.add(ca);
}
else
{
CursorAndEntries ca= new CursorAndEntries();
ca.cursor=this.objectLiteral2triple.openCursor(null, null);
LITERAL_KEY_BINDING.objectToEntry(Literal.class.cast(o), ca.keyEntry);
cursors.add(ca);
}
}
return new JoinIterator(cursors);
}
catch (DatabaseException e)
{
throw new RDFException(e);
}

}


public Resource createResource(String uri)
{
return new Resource(uri);
}

public Resource createResource(URI uri)
{
return createResource(uri.toString());
}

public Literal createLiteral(String text)
{
return new Literal(text);
}


private static void saveResource(Resource rsrc,TupleOutput out)
{
out.writeString(rsrc.getURI());
}

private static Resource readResource(TupleInput in)
{
return new Resource(in.readString());
}

private static void saveLiteral(Literal literal,TupleOutput out)
{
out.writeString(literal.getLexicalForm());

}

private static Literal readLiteral(TupleInput in)
{
String s= in.readString();
return new Literal(s);
}


public long size()throws RDFException
{
try
{
return getTripleDB().count();
}catch(DatabaseException err)
{
throw new RDFException(err);
}
}

public boolean contains(Statement stmt ) throws RDFException
{
CloseableIterator<Statement> iter=null;
try {
iter= listStatements(stmt.getSubject(), stmt.getPredicate(),stmt.getValue());
return (iter.hasNext());
} catch (RDFException e) {
throw e;
}
finally
{
if(iter!=null) iter.close();
}
}

public BerkeleyDBModel add(Resource s,Resource p,RDFNode o) throws RDFException
{
return add(new Statement(s,p,o));
}


public BerkeleyDBModel add(Statement stmt) throws RDFException
{
DatabaseEntry key= new DatabaseEntry();
DatabaseEntry value= new DatabaseEntry();
Cursor c=null;
try
{
if(contains(stmt)) return this;
c= triplesDB.openCursor(null, null);
int id=0;

if(c.getLast(key, value, null)==OperationStatus.SUCCESS)
{
id= IntegerBinding.entryToInt(key);
}
STMT_VALUE_BINDING.objectToEntry(stmt, value);
IntegerBinding.intToEntry(id+1,key);
getTripleDB().put(null, key, value);
return this;
}
catch(DatabaseException error)
{
throw new RDFException(error);
}
finally
{
if(c!=null) try {c.close(); } catch(DatabaseException err) {}
}
}

public void read(InputStream in) throws
IOException,RDFException,XMLStreamException
{
org.lindenb.sw.io.RDFHandler h= new RDFHandler()
{
@Override
public void found(URI subject, URI predicate, Object value,
URI dataType, String lang, int index) {
try
{
if(value instanceof URI)
{
add(createResource(subject),createResource(predicate),createResource((URI)value));
}
else
{
add(createResource(subject),createResource(predicate),createLiteral((String)value));
}
} catch(RDFException err)
{
throw new RuntimeException(err);
}
}
};
h.parse(in);
}

public static void main(String[] args) {
BerkeleyDBModel rdfStore= null;
try {
URL url= new
URL("http://archive.geneontology.org/latest-lite/go_20090607-termdb.owl.gz");

rdfStore = new BerkeleyDBModel(new File("/tmp/rdfdb"));
for(int i=0;i<10;i++)
{
long now= System.currentTimeMillis();
rdfStore.clear();
InputStream in= new GZIPInputStream(url.openStream());
rdfStore.read(in);
in.close();
System.err.println("("+i+") "+rdfStore.size()+"time="+(System.currentTimeMillis()-now)/1000);
}
rdfStore.clear();
}
catch (Exception e) {
e.printStackTrace();
}
finally
{
if(rdfStore!=null) try { rdfStore.close(); } catch(Exception err) {err.printStackTrace();}
}
}

}



That's it
Pierre

RDF : javascript, xsl stylesheets

A few notes:
I've implemented a javascript library to parse RDF (I love re-inventing the wheel, it's always interesting to learn how softwares and algorithms are working ). The RDF syntax is still not fully implemented (e.g. it don't support xml:lang, parseType=Literal, etc...)

.


I've also created 3 XSLT stylesheets transforming RDF to ....

  • N3:
    xsltproc rdf2n3.xsl http://www.w3.org/TR/rdf-syntax-grammar/example12.rdf
    <http://www.w3.org/TR/rdf-syntax-grammar> <http://purl.org/dc/elements/1.1/title> "RDF/XML Syntax Specification (Revised)" .
    <http://www.w3.org/TR/rdf-syntax-grammar> <http://example.org/stuff/1.0/editor> <_:anodeid2245696> .
    <_:anodeid2245696> <http://example.org/stuff/1.0/fullName> "Dave Beckett" .
    <_:anodeid2245696> <http://example.org/stuff/1.0/homePage> <http://purl.org/net/dajobe/> .
  • rdf:Statement:
    xsltproc rdf2rdf.xsl http://www.w3.org/TR/rdf-syntax-grammar/example12.rdf

    <?xml version="1.0"?>
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Statement>
    <rdf:subject rdf:resource="http://www.w3.org/TR/rdf-syntax-grammar"/>
    <rdf:predicate rdf:resource="http://purl.org/dc/elements/1.1/title"/>
    <rdf:object>RDF/XML Syntax Specification (Revised)</rdf:object>
    </rdf:Statement>
    <rdf:Statement>
    <rdf:subject rdf:resource="http://www.w3.org/TR/rdf-syntax-grammar"/>
    <rdf:predicate rdf:resource="http://example.org/stuff/1.0/editor"/>
    <rdf:object rdf:resource="_:anodeid2245620"/>
    </rdf:Statement>
    <rdf:Statement>
    <rdf:subject rdf:resource="_:anodeid2245620"/>
    <rdf:predicate rdf:resource="http://example.org/stuff/1.0/fullName"/>
    <rdf:object>Dave Beckett</rdf:object>
    </rdf:Statement>
    <rdf:Statement>
    <rdf:subject rdf:resource="_:anodeid2245620"/>
    <rdf:predicate rdf:resource="http://example.org/stuff/1.0/homePage"/>
    <rdf:object rdf:resource="http://purl.org/net/dajobe/"/>
    </rdf:Statement>
    </rdf:RDF>
  • SQL statements:
    xsltproc rdf2sql.xsl http://www.w3.org/TR/rdf-syntax-grammar/example12.rdf


    create table TRIPLE IF NOT EXISTS
    (
    subject varchar(50) not null,
    predicate varchar(50) not null,
    value_is_uri enum('true','false') not null,
    value varchar(50) not null //need to fix dataType and xml:lang
    );
    insert into TRIPLE(subject,predicate,value_is_uri,value) values ("http://www.w3.org/TR/rdf-syntax-grammar","http://purl.org/dc/elements/1.1/title","false","RDF/XML Syntax Specification (Revised)");
    insert into TRIPLE(subject,predicate,value_is_uri,value) values ("http://www.w3.org/TR/rdf-syntax-grammar","http://example.org/stuff/1.0/editor","true","_:anodeid2245974");
    insert into TRIPLE(subject,predicate,value_is_uri,value) values ("_:anodeid2245974","http://example.org/stuff/1.0/fullName","false","Dave Beckett");
    insert into TRIPLE(subject,predicate,value_is_uri,value) values ("_:anodeid2245974","http://example.org/stuff/1.0/homePage","true","http://purl.org/net/dajobe/");



That's it
Pierre

11 June 2009

A RDF Editor for Media wiki (draft)

(This page was copied from the article I started on mediawiki.org)
I've created an applet that can be used as a RDF editor for mediawiki, the wiki engine for Wikipedia. (This is mainly a proof-of-concept, I don't know if I'm going to use this system myself) The XML/RDF syntax of the document is checked and it is validated against a ~RDFS schema. This method was inspired from the one described in the article "Add Java extensions to your wiki".

On opening, the java applet is opened, and the user write a XML/RDF document in an input area.

The syntax of XML/RDF is checked. The document is also validated vs a schema localized in ${MW}/mwrdf/schema.rdf. If there is an error, a message is displayed and the 'Save Button' is disabled.

Once the document is saved, it is displayed as a <PRE> section.
Categories are bound to "mwrdf/shema.rdf and are automatically added.


and the RDF document can then be retrieved with the Mediawiki API.


Installation

  • Install the java JRE (version > 1.6)
  • The MediaWiki API must be enabled for action=query
  • append the following code at the end of ${MW}/LocalSettings.php
    require_once("mwrdf/RDFEdit.php");
  • download mwrdf.zip' from http://code.google.com/p/lindenb/downloads/list
  • unzip the file mwrdf.zip in ${MW}
  • edit the schema mwrdf/schema.rdf (TODO, describe the Schema for ... the schema... :-) )


That's it.
Pierre

Learning OWL: a simple Ontology for contributions

This post is a simple reminder for creating a simple OWL ontology, I know RDFS (RDF schema) but I'm not at all an expert with the OWL language, so feel free to make any comment about the following ontology.

OK...

at the beginning there is an empty RDF document:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE rdf:RDF [
<!ENTITY rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#">
]>
<rdf:RDF xmlns:rdf="&rdf;">
</rdf:RDF>


We're going to use a few more namespaces:
  • DC: the Dublin core provides the basic metadata to describe a resource (title, author...)
  • RDFS: the namespace for the simpliest RDF ontology
  • OWL a more precise language for describing an ontology extendings RDFs
  • FOAF. defines the People, the Images, the Documents, etc...
  • biogang: this will be the prefix for our ontology


<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE rdf:RDF [
<!ENTITY rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<!ENTITY rdfs "http://www.w3.org/2000/01/rdf-schema#">
<!ENTITY owl "http://www.w3.org/2002/07/owl#">
<!ENTITY dc "http://purl.org/dc/elements/1.1/">
<!ENTITY biogang "urn:biogang/ontology/contribution#">
<!ENTITY foaf "http://xmlns.com/foaf/0.1/">
]>
<rdf:RDF xmlns:rdf="&rdf;"
xmlns:rdfs="&rdfs;"
xmlns:owl="&owl;"
xmlns:foaf="&foaf;"
xmlns:dc="&dc;"
xmlns:biogang="&biogang;">

</rdf:RDF>


In a first statement We describe our Ontology (label, comment,...)
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE rdf:RDF [
<!ENTITY rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<!ENTITY rdfs "http://www.w3.org/2000/01/rdf-schema#">
<!ENTITY owl "http://www.w3.org/2002/07/owl#">
<!ENTITY dc "http://purl.org/dc/elements/1.1/">
<!ENTITY foaf "http://xmlns.com/foaf/0.1/">
<!ENTITY biogang "urn:biogang/ontology/contribution#">
]>
<rdf:RDF xmlns:rdf="&rdf;" xmlns:rdfs="&rdfs;" xmlns:owl="&owl;" xmlns:dc="&dc;" xmlns:foaf="&foaf;" xmlns:biogang="&biogang;">

<owl:Ontology rdf:about="">
<dc:date>2009-06-11</dc:date>
<dc:creator>Pierre Lindenbaum</dc:creator>
<rdfs:label>The Attribution Ontology</rdfs:label>
<rdfs:comment>
An ontology used to provide a quantitative citation
for every unique author of a document
</rdfs:comment>
</owl:Ontology>
</rdf:RDF>

Let's declare our main class, an Attribution:
<owl:Class rdf:about="&biogang;Contribution">
<rdfs:label xml:lang="en">Contribution</rdfs:label>
<rdfs:label xml:lang="fr">Contribution</rdfs:label>
<rdfs:comment xml:lang="en">A contribution</rdfs:comment>
<rdfs:comment xml:lang="fr">Une contribution</rdfs:comment>
</owl:Class>

An attribution is a link between an author and a document.
An author is a foaf:Person as defined in the FOAF ontology.
A document is a foaf:Document as defined in the FOAF ontology (this could be an article but also a picture, a song, a web site, etc...).
So, an Attribution is a Class with two properties: contributor and contributedTo.
The domain of both contributor and contributedTo is an object of type Contribution.
The range of a "contributor" is NOT the name of the author (aka a Literal, aka a DataTypeProperty) but it is a link to a resource describing the Person. So its range is an ObjectProperty pointing to a foaf:Person. Also contributedTo extends the Dublin-Core property dc:creator
<owl:ObjectProperty rdf:about="&biogang;contributor">
<rdfs:domain rdf:resource="&biogang;Contribution"/>
<rdfs:range rdf:resource="&foaf;Person"/>
<rdfs:label>contributor</rdfs:label>
<rdfs:subPropertyOf rdf:resource="&dc;creator"/>
</owl:ObjectProperty>

The range of contributedTo" is NOT the literal description of the document but it is a link to a resource describing the Document. So its range is an ObjectProperty pointing to a foaf:Document.
<owl:ObjectProperty rdf:about="&biogang;contributedTo">
<rdfs:domain rdf:resource="&biogang;Contribution"/>
<rdfs:range rdf:resource="&foaf;Document"/>
<rdfs:label>contributed to</rdfs:label>
</owl:ObjectProperty>


We can also add a Literal property to describe the nature of the contribution.
<owl:DatatypeProperty rdf:about="&biogang;comment">
<rdfs:domain rdf:resource="&biogang;Contribution"/>
<rdfs:range rdf:resource="&rdfs;Literal"/>
<rdfs:label>description of this contribution</rdfs:label>
<rdfs:subPropertyOf rdf:resource="&dc;description"/>
</owl:DatatypeProperty>


But a Contribution should contain one 'contributor' and one 'contributedTo', so we add a restriction on the cardinality of those properties:


<owl:Class rdf:about="&biogang;Contribution">
<rdfs:label xml:lang="en">Contribution</rdfs:label>
<rdfs:label xml:lang="fr">Contribution</rdfs:label>
<rdfs:comment xml:lang="en">A contribution</rdfs:comment>
<rdfs:comment xml:lang="fr">Une contribution</rdfs:comment>
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource="&biogang;contributor"/>
<owl:cardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#nonNegativeInteger">1</owl:cardinality>
</owl:Restriction>
</rdfs:subClassOf>
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource="&biogang;contributedTo"/>
<owl:cardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#nonNegativeInteger">1</owl:cardinality>
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>

We can also create a set of sub-Classes to extend the Class Contribution to give a quantifiable view of the contribution.
<owl:Class rdf:about="&biogang;MajorContribution">
<rdfs:subClassOf rdf:resource="&biogang;Contribution"/>
<rdfs:label>Major contribution</rdfs:label>
</owl:Class>

<owl:Class rdf:about="&biogang;MediumContribution">
<rdfs:subClassOf rdf:resource="&biogang;Contribution"/>
<rdfs:label>Medium contribution</rdfs:label>
</owl:Class>

<owl:Class rdf:about="&biogang;MicroContribution">
<rdfs:subClassOf rdf:resource="&biogang;Contribution"/>
<rdfs:label>Micro contribution</rdfs:label>
</owl:Class>

It would be also nice, to extends this ontology to describe the 'nature' of the contribution (drawing figures, useful discussions, writing the paper). This would allow to easily find the person who are good with a given task.
Anyway, at the end, here is the full ontology:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:biogang="urn:biogang/ontology/contribution#">

<owl:Ontology rdf:about="">
<dc:date>2009-06-11</dc:date>
<dc:creator>Pierre Lindenbaum</dc:creator>
<rdfs:label>The Attribution Ontology</rdfs:label>
<rdfs:comment>
An ontology used to provide a quantitative citation
for every unique author of a document
</rdfs:comment>
</owl:Ontology>

<owl:Class rdf:about="urn:biogang/ontology/contribution#Contribution">
<rdfs:label xml:lang="en">Contribution</rdfs:label>
<rdfs:label xml:lang="fr">Contribution</rdfs:label>
<rdfs:comment xml:lang="en">A contribution</rdfs:comment>
<rdfs:comment xml:lang="fr">Une contribution</rdfs:comment>
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource="urn:biogang/ontology/contribution#contributor"/>
<owl:cardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#nonNegativeInteger">1</owl:cardinality>
</owl:Restriction>
</rdfs:subClassOf>
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource="urn:biogang/ontology/contribution#contributedTo"/>
<owl:cardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#nonNegativeInteger">1</owl:cardinality>
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>


<owl:ObjectProperty rdf:about="urn:biogang/ontology/contribution#contributor">
<rdfs:domain rdf:resource="urn:biogang/ontology/contribution#Contribution"/>
<rdfs:range rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
<rdfs:label>contributor</rdfs:label>
<rdfs:subPropertyOf rdf:resource="http://purl.org/dc/elements/1.1/creator"/>
</owl:ObjectProperty>

<owl:ObjectProperty rdf:about="urn:biogang/ontology/contribution#contributedTo">
<rdfs:domain rdf:resource="urn:biogang/ontology/contribution#Contribution"/>
<rdfs:range rdf:resource="http://xmlns.com/foaf/0.1/Document"/>
<rdfs:label>contributed to</rdfs:label>
</owl:ObjectProperty>

<owl:DatatypeProperty rdf:about="urn:biogang/ontology/contribution#comment">
<rdfs:domain rdf:resource="urn:biogang/ontology/contribution#Contribution"/>
<rdfs:range rdf:resource="http://www.w3.org/2000/01/rdf-schema#Literal"/>
<rdfs:label>description of this contribution</rdfs:label>
<rdfs:subPropertyOf rdf:resource="http://purl.org/dc/elements/1.1/description"/>
</owl:DatatypeProperty>

<owl:Class rdf:about="urn:biogang/ontology/contribution#MajorContribution">
<rdfs:subClassOf rdf:resource="urn:biogang/ontology/contribution#Contribution"/>
<rdfs:label>Major contribution</rdfs:label>
</owl:Class>

<owl:Class rdf:about="urn:biogang/ontology/contribution#MediumContribution">
<rdfs:subClassOf rdf:resource="urn:biogang/ontology/contribution#Contribution"/>
<rdfs:label>Medium contribution</rdfs:label>
</owl:Class>

<owl:Class rdf:about="urn:biogang/ontology/contribution#MicroContribution">
<rdfs:subClassOf rdf:resource="urn:biogang/ontology/contribution#Contribution"/>
<rdfs:label>Micro contribution</rdfs:label>
</owl:Class>

</rdf:RDF>


At the end, we can create some new instances of contributions just like this:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:biogang="urn:biogang/ontology/contribution#" (...) >
<biogang:MajorContribution>
<biogang:comment>Made the Y2H experiments</biogang:comment>
<biogang:contributor rdf:resource="mailto:plindenbaum@yahoo.fr"/>
<biogang:contributedTo rdf:resource="http://www.ncbi.nlm.nih.gov/pubmed/8985320"/>
</biogang:MajorContribution>

<foaf:Person rdf:about="mailto:plindenbaum@yahoo.fr">
<foaf:name>Pierre</foaf:name>
</foaf:Person>

<foaf:Document rdf:about="http://www.ncbi.nlm.nih.gov/pubmed/8985320">
<dc:title>In vivo and in vitro phosphorylation of rotavirus NSP5 correlates with its localization in viroplasms</dc:title>
</foaf:Document>
</rdf:RDF>


That's it
Pierre

09 June 2009

Exploring JavaFX: creating charts for pubmed

The JavaFX SDK, a scripting language for Java, is now available for Linux. The API is said to be a new and simplier way to write a graphical interface and it have some nice data bindings to create an immediate and direct relationship between two variables. But it comes after Flash, Air, SVG , Processing, so isn't it already too late for this new language ?
However, JavaFX contains a chart API and I've generated some JavaFX code from XML/Pubmed and the following XSL stylesheet:

.
I worked with the following pubmed query: ontology[ti] “Database Management Systems”[mesh].
The XML Document was transformed into a JavaFX code with XSLT
xsltproc pubmed2fx01.xsl pubmed_result.txt > code.fx

The source code was then compiled to a java class with the javafx compiler:
javafxc code.fx

And the java code was executed with javafx:
javafx code

The exectuable creates to charts:




That's it.
Pierre

04 June 2009

Fun with SVG: NCBI/pubchem+XSLT= SVG

Just for fun. I've played with the compounds stored in NCBI/Pubchem and I've created a XSLT stylesheet transforming the pubchem/xml format into a SVG figure.
The XSLT stylesheet is available here:

This xml format was new to me, so feel free to tell me if I've missed something...

Here are two examples:
The stylesheet takes a few optional arguments:
  • scale :=scale factor
  • show-bounds := (true/false)
  • xradius := scale factor for atoms

Example


xsltproc --stringparam scale 30 --stringparam xradius 2 --stringparam show-bounds false src/xsl/pubchem2svg.xsl CID_16204538.xml > file.svg


A few points about how it works


First we need to collect the min/max values of each x/y/z coordinate. For example for max-x, we get an ordered list of all the X coordinates and we get the first item.
<xsl:variable name="max-x">
<xsl:for-each select="$x_array/c:PC-Conformer_x_E">
<xsl:sort select="." data-type="number" order="descending"/>
<xsl:if test="position() = 1">
<xsl:value-of select="."/>
</xsl:if>
</xsl:for-each>
</xsl:variable>


Then we loop over the index for each atom and we call the template xyz
<xsl:comment>BEGIN ATOMS</xsl:comment>
<xsl:element name="svg:g">
<xsl:call-template name="xyz">
<xsl:with-param name="index" select="1"/>
</xsl:call-template>
</xsl:element>
<xsl:comment>END ATOMS</xsl:comment>
</xsl:element>

For each index-th atom, the template xyz create a SVG figure for the atom. A call to the templates coord-x and coord-y returns the coordinates (x/y) of this element on the SVG panel.
<xsl:element name="svg:use">
<xsl:attribute name="xlink:href">#atom<xsl:choose>
<xsl:when test="$s='o' or $s='c' or $s='h'"><xsl:value-of select="$s"/></xsl:when>
<xsl:otherwise>UN</xsl:otherwise>
</xsl:choose></xsl:attribute>
<xsl:attribute name="x"><xsl:call-template name="coord-x"><xsl:with-param name="index" select="$index"/></xsl:call-template></xsl:attribute>
<xsl:attribute name="y"><xsl:call-template name="coord-y"><xsl:with-param name="index" select="$index"/></xsl:call-template></xsl:attribute>
<xsl:attribute name="title"><xsl:value-of select="$s"/><xsl:text> </xsl:text><xsl:value-of select="$index"/></xsl:attribute>
</xsl:element>

The template coord-x itself align the coordinate of the atom according to the scale and the width of the SVG panel:
<xsl:template name="coord-x">
<xsl:param name="index"/>
<xsl:variable name="x" select="$x_array/c:PC-Conformer_x_E[$index]"/>
<xsl:value-of select="$margin + ($x - $min-x) * $scale"/>
</xsl:template>


That's it.

Pierre

PS: See also the post I wrote one year ago NCBI Blast+ XSLT => XHTML + SVG .