03 September 2009

Generating a C Pull Parser for dbSNP with XSLT



I've used the XSD schema describing dbSNP to generate a C "Pull parser" reading the content of the dbSNP XML files. To transform the schema into a C code I wrote the following XSLT stylesheet:. This stylesheet was specifically developed for dbSNP so it might not handle a more complicated schema (for example a schema that would use <xsd:elementType> ). Basically the C code generated is a scaffold for a Pull Parser using the libxml2 library. For example, here is a simplified snippet of code handling the tag <Assembly/>
/** A collection of genome sequence records (curated gene regions (NG's),
contigs (NWNT's) and chromosomes (NC/AC's) produced by a genome
sequence project. Structure is populated from ContigInfo tables. */

static int processAssembly(StatePtr state)
{
int returnValue=EXIT_SUCCESS;
int success;
int nodeType;
const int isEmptyElement= xmlTextReaderIsEmptyElement(state -> reader);

/** Name of the group(s) or organization(s) that generated the assembly */
xmlChar* assemblySourceAttr=NULL;

//(...) declare other attributes

assemblySourceAttr= xmlTextReaderGetAttribute(
state->reader,
BAD_CAST "assemblySource"
);

//(...) other attributes

if(!isEmptyElement)
{
success = xmlTextReaderRead( state -> reader );
if(!success)
{
fprintf( state->error,"In Assembly I/O Error. xmlTextReaderRead returned \n");
returnValue = EXIT_FAILURE;
goto cleanup;
}
nodeType = xmlTextReaderNodeType( state -> reader );


/* process childNode <Component/> */

while(nodeType == XML_READER_TYPE_ELEMENT)
{
if(xmlStrcmp(
xmlTextReaderConstName(state -> reader),
BAD_CAST "Component"
)!=0)
{
break;
}

if(processComponent(state)!=EXIT_SUCCESS)
{
returnValue = EXIT_FAILURE;
goto cleanup;
}

/* read next event */
success= xmlTextReaderRead(state->reader);
if(!success)
{
returnValue = EXIT_FAILURE;
fprintf( state->error,"In Assembly/Component I/O Error.\n");
goto cleanup;
}
nodeType=xmlTextReaderNodeType(state->reader);
}

/* process childNode <SnpStat/> */
(...)

}//end of if(!isEmptyElement)

cleanup:

//free attributes
if(assemblySourceAttr!=NULL)
{
xmlFree(assemblySourceAttr);
}
//(...) other attributes
return returnValue;
}
Using this prototype I was able to quickly write a fast parser , for example echoing a JSON description of the SNPs of the Human Mitochondrial Genome (time: 0.11user 0.01system 0:00.19elapsed 66%CPU).
[
{
"rsId":8896,
"seq5":"GGTGTTGGTTCTCTTAATCTTTAACTTAAAAGGTTAATGCTAAGTTAGCTTTACAGTGGGCTCTAGAGGGGG
TAGAGGGGGTG",
"observed":"C/T",
"seq3":"TATAGGGTAAATACGGGCCCTATTTCAAAGATTTTTAGGGGAATTAATTCTAGGACGATGGGCATGAAACTGTGGTTTGCTCCACAGATTTCAGAGCATT"
}
,
{
"rsId":8936,
"seq5":"ACTACGGCGGACTAATCTTCAACTCCTACATACTTCCCCCATTATTCCTAGAACCAGGCGACCTGCGACTCCTTGACGTTGACAATCGAGTAGTACTCCCGATTGAAGCCCCCATTCGTATAATAATTACATCACAAGACGTCTTGCACTCATGAGCTGTCCCCACATTAGGCTTAAAAACAGATGCAATTCCCGGACGT",
"observed":"A/C/T",
"seq3":"TAAACCAAACCACTTTCACCGCTACACGACCGGGGGTATACTACGGTCAATGCTCTGAAATCTGTGGAGCAAACCACAGTTTCATGCCCATCGTCCTAGAATTAATTCCCCTAAAAATCTTTGAAATAGGGCCCGTATTTACCCTATAGCACCCCCTCTACCCCCTCTAGAGCCCACTGTAAAGCTAACTTAGCATTAAC"
}
(...)
{
"rsId":72619366,
"seq5":"TGCTTACAAGCAAGTACAGCAATCAACCTTCAACTATCACACATCAACTGCAACTCCAAAGCCACCCCTCACCCACTAGGATACCAACAAACCTACCCAC",
"observed":"C/T",
"seq3":"CTTAACAGTACATAGTACATAAAGCCATTTACCGTACATAGCACATTACAGTCAAATCCCTTCTCGTCCCCATGGATGACCCCCCTCAGATAGGGGTCCC"
}
]



That's it
Pierre

No comments: