31 July 2011

Storing some SNPs using the leveldb library. My notebook.

In this post I'll describe how I've used the leveldb API, a C++ key-value database, to store some SNPs (key=rs,value=sequence). "LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.". A benchmark for this engine is available here: http://code.google.com/p/leveldb/source/browse/trunk/doc/benchmark.html.

Download & install

$ svn checkout http://leveldb.googlecode.com/svn/trunk/ leveldb-read-only
$ cd leveldb-read-only/
$ make

Open & close a leveldb database

#include "leveldb/db.h"
(...)
leveldb::DB* db=NULL;
leveldb::Options options;
options.comparator=&rs_comparator;/* custom comparator for ordering the keys, see later */
options.create_if_missing = true;
clog << "[LOG]opening database" << endl;
leveldb::Status status = leveldb::DB::Open(options,db_home, &db);
//check status
if (!status.ok())
{
cerr << "cannot open " << db_home << " : " << status.ToString() << endl;
return EXIT_FAILURE;
}
(...)
/* use the database */
(...)
delete db;

A custom comparator for ordering the keys

Here, the keys are the rs-ids ordered on their numerical value. Both keys and values have a type "leveldb::Slice".
class RsComparator : public leveldb::Comparator
{
private:
/* parses the rsId: rs[0-9]+ */
int rs(const leveldb::Slice& s) const
{
int n=0;
for(size_t i=2;i< s.size();++i)
{
n=n*10+s[i]-'0';
}
return n;
}
public:
virtual int Compare(const leveldb::Slice& a, const leveldb::Slice& b) const
{
return rs(a)-rs(b);
}
(...)
}
(...)
RsComparator rs_comparator;
options.comparator=&rs_comparator;

Inserting a FASTA sequence


(...)
std::string name;
std::string sequence;
(...)
leveldb::Status status = db->Put(leveldb::WriteOptions(), name, sequence);
if(!status.ok())
{
cerr << "Cannot insert "<< name << " "
<< status.ToString() << endl;
return EXIT_FAILURE;
}

Searching for a rs###

(...)
std::string name(argv[optind]);
std::string sequence;
(...)
leveldb::Status status = db->Get(leveldb::ReadOptions(),seqname, &sequence);
if(!status.ok())
{
cerr << "Cannot find " << seqname<< " in " << db_home << endl;
continue;
}
printFasta(seqname,sequence);
(...)

Dumping all the SNPs using an iterator

(...)
leveldb::Iterator* it = db->NewIterator(leveldb::ReadOptions());
for (it->SeekToFirst(); it->Valid(); it->Next())
{
printFasta(it->key(),it->value());
}
delete it;
(...)

Examples

Reading the SNPs:
$ curl -s "ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/rs_fasta/rs_chMT.fas.gz" |\
./rsput -D snp.db

[LOG]opening database
[LOG] added rs8896
[LOG] added rs8936
[LOG] added rs9743
(...)
[LOG]closing database

$ du -h snp.db
336K snp.db
Dump the snps

$ ./rsget  -D snp.db

[LOG]opening database
>rs8896
GGTGTTGGTTCTCTTAATCTTTAACTTAAAAGGTTAATGCTAAGTTAGCTTTACAGTGGG
CTCTAGAGGGGGTAGAGGGGGTGYTATAGGGTAAATACGGGCCCTATTTCAAAGATTTTT
AGGGGAATTAATTCTAGGACGATGGGCATGAAACTGTGGTTTGCTCCACAGATTTCAGAG
CATT
>rs8936
ACTACGGCGGACTAATCTTCAACTCCTACATACTTCCCCCATTATTCCTAGAACCAGGCG
ACCTGCGACTCCTTGACGTTGACAATCGAGTAGTACTCCCGATTGAAGCCCCCATTCGTA
TAATAATTACATCACAAGACGTCTTGCACTCATGAGCTGTCCCCACATTAGGCTTAAAAA
CAGATGCAATTCCCGGACGTHTAAACCAAACCACTTTCACCGCTACACGACCGGGGGTAT
ACTACGGTCAATGCTCTGAAATCTGTGGAGCAAACCACAGTTTCATGCCCATCGTCCTAG
AATTAATTCCCCTAAAAATCTTTGAAATAGGGCCCGTATTTACCCTATAGCACCCCCTCT
ACCCCCTCTAGAGCCCACTGTAAAGCTAACTTAGCATTAAC
>rs9743
CCATGTGATTTCACTTCCACTCCATAACGCTCCTCATACTAGGCCTACTAACCAACACAC
TAACCATATACCAATGATGNCGCGATGTAACACGAGAAAGCACATACCAAGGCCACCACA
CACCACCTGTCCAAAAAGGCCTTCGATACGGGATAATCCTATTTATTACCTCAGAANTTT
TTTTCTTCGCAGGATTTTTCTGAGCCTTTTACCACTCCAGCCTAGCCCCTACCCCCCAAN
(...)
[LOG]closing database
Search for some SNPs:
$ ./rsget -D snp.db rs78894381 rs72619361 rs25
[LOG]opening database
[LOG]searching for rs78894381
>rs78894381
CTACTAATCTCATCAACACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTA
ACCCCATACCCCGAACCAACCAAACCCCAAAGACACCCCCNCACAGTTTATGTAGCTTAC
CTCCTCAAAGCAATACACTGAAAATGTTTAGACGGGCTCACATCACCCCATAAACAAATA
GGTTTGGTCCTAGCCTTTCTA
[LOG]searching for rs72619361
>rs72619361
ATGCATTTGGTATTTTAATCTGGGGGGTGTGCACGCGATAGCATTGTGAAACGCTGGCCC
CAGAGCACCCTATGTCGCAGTGTCTGTCTTTGATTCCTGCCYCATCCCATTATTGATCAC
ACCTACATTCAATATCCCAGGCGAGCATACCTATCACAAGGTGTTAATTAATTAATGCTT
GTAGGACATAACAATCAGTAAAC
[LOG]searching for rs25
Cannot find rs25 in snp.db
[LOG]closing database

Source code






That's it,

Pierre

27 July 2011

Controlling IGV through a Port: my notebook.

A protocol for interacting with the The Integrative Genomics Viewer (IGV) is described here: http://www.broadinstitute.org/igv/PortCommands. So, you can save some pictures of a genomic region without interacting with IGV.

Via a shell script


$ cat script.sh

mkdir -p /tmp/igv
sleep 2
echo "snapshotDirectory /tmp/igv"
sleep 2
echo "goto chr3:100"
sleep 2
echo "snapshot"
sleep 2
echo "goto chr3:200-300"
sleep 2
echo "snapshot"
sleep 2

invoke the script
$ sh script.sh |  telnet 127.0.0.1 60151
Trying 127.0.0.1...
Connected to srv-clc-02.irt.univ-nantes.prive (127.0.0.1).
Escape character is '^]'.
INFO [2011-07-27 17:58:16,680] [CommandExecutor.java:124] [Thread-6] OK
OK
INFO [2011-07-27 17:58:18,685] [CommandExecutor.java:124] [Thread-6] OK
OK
INFO [2011-07-27 17:58:18,687] [MacroSnapshotAction.java:85] [Thread-6] Snapshot: chr3_80_119.png
INFO [2011-07-27 17:58:20,787] [CommandExecutor.java:124] [Thread-6] OK
OK
INFO [2011-07-27 17:58:22,792] [CommandExecutor.java:124] [Thread-6] OK
OK
INFO [2011-07-27 17:58:22,794] [MacroSnapshotAction.java:85] [Thread-6] Snapshot: chr3_200_300.png
Connection closed by foreign host.


Via java


The code
import java.io.*;
import java.net.*;
class Test
{
public static void main(String[] args) throws Exception
{
Socket socket = new Socket("127.0.0.1", 60151);
PrintWriter out = new PrintWriter(socket.getOutputStream(), true);
BufferedReader in = new BufferedReader(new InputStreamReader(socket.getInputStream()));


out.println("snapshotDirectory /tmp/igv");
System.out.println(in.readLine());
out.println("goto chr3:100");
System.out.println(in.readLine());

out.println("snapshot");
System.out.println(in.readLine());

in.close();
out.close();
socket.close();
}
}

Compile and execute:
$ javac Test
$ java Test
INFO [2011-07-27 18:01:15,706] [CommandExecutor.java:124] [Thread-6] OK
OK

INFO [2011-07-27 18:01:17,711] [CommandExecutor.java:124] [Thread-6] OK
OK

INFO [2011-07-27 18:01:17,729] [MacroSnapshotAction.java:85] [Thread-6] Snapshot: chr3_80_119.png
INFO [2011-07-27 18:01:20,533] [CommandExecutor.java:124] [Thread-6] OK
OK

26 July 2011

A mysql full-text parser searching for some SNPs

A mysql full-text parser server plugin can be used to replace or modify the built-in full-text parser.
This post a full-text parser plugin named bioparser that extracts the rs#### id from a text.

The source

The source of this plugin is available on github at:
https://github.com/lindenb/bioudf/blob/master/src/bioparser.c.

The work horse of the plugin is a simple function bioparser_parse scanning the SQL query or the TEXT. Each time a word starting with "rs" and followed by one or more number is found, it is added to the list of words to be find:
static int bioparser_parse(MYSQL_FTPARSER_PARAM *param)
{
char *curr=param->doc;
const char *begin=param->doc;
const char *end= begin + param->length;

param->flags = MYSQL_FTFLAGS_NEED_COPY;
while(curr+2<end)
{
if(tolower(*curr)=='r' &&
tolower(*(curr+1))=='s' &&
isdigit(*(curr+2)) &&
(curr==begin || IS_DELIM(*(curr-1) ) )
)
{
char* p=curr+2;
while(p!=end && isdigit(*p))
{
++p;
}
if(p==end || IS_DELIM(*p))
{
my_add_word(param,curr,p-curr);
}
curr=p;
}
else
{
curr++;
}
}

return 0;
}

Install the Plugin


mysql> INSTALL PLUGIN bioparser SONAME 'bioparser.so';
Query OK, 0 rows affected (0.00 sec)

mysql> show plugins;
+-----------------------+----------+--------------------+--------------+---------+
| Name | Status | Type | Library | License |
+-----------------------+----------+--------------------+--------------+---------+
(...)
| partition | ACTIVE | STORAGE ENGINE | NULL | GPL |
| bioparser | ACTIVE | FTPARSER | bioparser.so | GPL |
+-----------------------+----------+--------------------+--------------+---------+
21 rows in set (0.00 sec)

Invoke Plugin


create a table that will use the plugin:
mysql> create table pubmed(
abstract TEXT,
FULLTEXT (abstract) WITH PARSER bioparser
) ENGINE=MyISAM;

Insert some abstracts.
mysql> insert into pubmed(abstract) values("A predictive role in radiation pneumonitis (RP) development was observed for the LIG4 SNP rs1805388 (adjusted hazard ratio, 2.08; 95% confidence interval, 1.04-4.12; P = .037 for the CT/TT genotype vs the CC genotype). In addition, men with the TT genotype of the XRCC4 rs6869366 SNP and women with AG + AA genotypes of the XRCC5 rs3835 SNP also were at increased risk of developing severe RP.");
Query OK, 1 row affected (0.00 sec)

(...)

mysql> select abstract from pubmed;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| abstract |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| A predictive role in radiation pneumonitis (RP) development was observed for the LIG4 SNP rs1805388 (adjusted hazard ratio, 2.08; 95% confidence interval, 1.04-4.12; P = .037 for the CT/TT genotype vs the CC genotype). In addition, men with the TT genotype of the XRCC4 rs6869366 SNP and women with AG + AA genotypes of the XRCC5 rs3835 SNP also were at increased risk of developing severe RP. |
| Nonhomologous end joining (NHEJ) is a pathway that repairs DNA double-strand breaks (DSBs) to maintain genomic stability in response to irradiation. The authors hypothesized that single nucleotide polymorphisms (SNPs) in NHEJ repair genes may affect clinical outcomes in patients with nonsmall cell lung cancer (NSCLC) who receive definitive radio(chemo)therapy. |
| The authors genotyped 5 potentially functional SNPs-x-ray repair complementing defective repair in Chinese hamster cells 4 (XRCC4) reference SNP (rs) number rs6869366 (-1394 guanine to thymine [-1394G?T] change) and rs28360071 (intron 3, deletion/insertion), XRCC5 rs3835 (guanine to adenine [G?A] change at nucleotide 2408), XRCC6 rs2267437 (-1310 cytosine to guanine [C?G) change], and DNA ligase IV (LIG4) rs1805388 (threonine-to-isoleucine change at codon 9 [T9I])-and estimated their associations with severe radiation pneumonitis (RP) (grade ?3) in 195 patients with NSCLC. |
| The current results indicated that NHEJ genetic polymorphisms, particularly LIG4 rs1805388, may modulate the risk of RP in patients with NSCLC who receive definitive radio(chemo)therapy. Large studies will be needed to confirm these findings. |
| The repair of DNA double-strand breaks (DSBs) is the major mechanism to maintain genomic stability in response to irradiation. We hypothesized that genetic polymorphisms in DSB repair genes may affect clinical outcomes among non-small cell lung cancer (NSCLC) patients treated with definitive radio(chemo)therapy. |
| We also found that RAD51 -135G>C and XRCC2 R188H SNPs were independent prognostic factors for overall survival (adjusted HR?=?1.70, 95% CI, 1.14-2.62, P?=?0.009 for CG/CC vs. GG; and adjusted HR?=?1.70; 95% CI, 1.02-2.85, P?=?0.043 for AG vs. GG, respectively) and that the SNP-survival association was most pronounced in the presence of RP. |
| A total of 291 patients (145 male/146 female, mean age (± S.D.) 52.2 (± 13.1) years) with PsA were examined clinically, by standard laboratory tests and their DNA was genotyped for the SNP rs2476601 (PTPN22 +1858 C/T). Allelic frequencies were determined and compared with 725 controls. |
| this is a test rs2476601, rs1805388, rs3835 and rs25 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
8 rows in set (0.00 sec)

Now, test the plugin:

mysql> select
concat(left(abstract,40),"...") as ABSTRACT,
match(abstract) against("DNA double-strand) as SCORE
from pubmed group by 2 HAVING SCORE!=0;
Empty set (0.00 sec)

mysql> select
concat(left(abstract,40),"...") as ABSTRACT,
match(abstract) against("rs25") as SCORE
from pubmed group by 2 HAVING SCORE!=0;
+---------------------------------------------+--------------------+
| ABSTRACT | SCORE |
+---------------------------------------------+--------------------+
| this is a test rs2476601, rs1805388, rs3... | 1.8603347539901733 |
+---------------------------------------------+--------------------+
1 row in set (0.01 sec)

mysql> select
concat(left(abstract,40),"...") as ABSTRACT,
match(abstract) against("rs2476601 rs1805388 rs6869366") as SCORE
from pubmed group by 2 HAVING SCORE!=0 order by 2 desc;
+---------------------------------------------+--------------------+
| ABSTRACT | SCORE |
+---------------------------------------------+--------------------+
| A total of 291 patients (145 male/146 fe... | 1.086121916770935 |
| A predictive role in radiation pneumonit... | 1.0619741678237915 |
| this is a test rs2476601, rs1805388, rs3... | 1.0502985715866089 |
| The authors genotyped 5 potentially func... | 1.0388768911361694 |
+---------------------------------------------+--------------------+
4 rows in set (0.00 sec)

mysql> select
concat(left(abstract,40),"...") as ABSTRACT,
match(abstract) against("rs25,rs2476601,rs1805388,rs6869366") as SCORE
from pubmed group by 2 HAVING SCORE!=0 order by 2 desc;
+---------------------------------------------+--------------------+
| ABSTRACT | SCORE |
+---------------------------------------------+--------------------+
| this is a test rs2476601, rs1805388, rs3... | 2.9106333255767822 |
| A total of 291 patients (145 male/146 fe... | 1.086121916770935 |
| A predictive role in radiation pneumonit... | 1.0619741678237915 |
| The authors genotyped 5 potentially func... | 1.0388768911361694 |
+---------------------------------------------+--------------------+
4 rows in set (0.00 sec)

uninstall the plugin


mysql> UNINSTALL PLUGIN bioparser;
Query OK, 0 rows affected (0.00 sec)


That's it,

Pierre

25 July 2011

The samtools C API: creating a WIG file for the genome coverage

I wrote a tool which uses the samtools API to generate a WIG file for the genome coverage. This tool, bam2wig, as well as some other tools using this library are available in the following git repository:


My code is based on this original C source hosted on: http://samtools.sourceforge.net/sam-exam.shtml. (I hope I understood this API as many functions/structures remain undocumented).

Example

#compile
$ export SAMDIR=/path/to/samtools-0.1.17
$ cd amtools-utilities
$ make
#invoke bam2wig
$ .bin/bam2wig -t /path/to/my/sample.bam "chr1:120408225-120558920" | head
track name="__TRACK_NAME__" description="__TRACK_DESC__" type="wiggle_0"
fixedStep chrom=chr1 start=120408225 step=1 span=1
1
1
1
2
2
3
4
5

Result

The wig file uploaded as a custom track in the UCSC genome browser:


That's it

Pierre

20 July 2011

Drawing a SVG timeline with the http://data.bnf.fr data

The National French Library has recently started to release
its data as RDF/XML. Here, I've played with the biographies of the famous French writers to create a simple timeline.

Unfortunately, this timeline is too large to be displayed here. :-)






However, here is the java code I used to generate this map as a SVG document:


See also: Freebase and the History of Sciences

That's it,

Pierre