09 July 2011

Storing SNPs in a HDF5 file: my notebook

I discovered the benefits of using twitter in 2008, the day Deepak Singh replied to one of my tweets related to the storage of a large number of genotypes.





Since that day, I've tried to use the HDF5 library, without any success (there's a large disheartening documentation/API on the HDF5 site and the API seems to be only used by a small number of geeks). Furthermore, HDF5 is a technology used by the IGV genome browser.

In that post, I'll describe a C program loading a set of SNPs defined by the following C structure:
typedef struct structSnp
{
char rsId[RS_LENGTH];//rs##
char chrom[CHROM_LENGTH];//chromosome name
int chromStart;//genomic start index
int chromEnd;//genomic end index
}Snp;
The SNPs will be read from stdin. Four columns are expected: rs-id, chrom, chromStart and chromEnd. Here is the command line I used to get a small input file:
curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/snp132.txt.gz" |\
gunzip -c |\
cut -d ' ' -f 2-5 |\
head -n 10000 > sample.tsv

Source code


The C program is described below. I hope my comments will make the code readable.

The Makefile



Compile and run

gcc -Wall -o a.out -I  /path/to/hdf5/include -L /path/to/hdf5/lib dbsnp2hdf5.c  -lhdf5
./a.out < sample.tsv

Dump the data


At the en, the program creates a structured binary file containing our SNPs. The HDF5 utility h5dump can be used to display the data as text:
$ h5dump storage.h5

HDF5 "storage.h5" {
GROUP "/" {
GROUP "variations" {
DATASET "dbSNP" {
DATATYPE H5T_COMPOUND {
H5T_STRING {
STRSIZE 13;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
} "rs";
H5T_STRING {
STRSIZE 7;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
} "chrom";
H5T_STD_I32LE "chromStart";
H5T_STD_I32LE "chromEnd";
}
DATASPACE SIMPLE { ( 10000 ) / ( H5S_UNLIMITED ) }
DATA {
(0): {
"rs112750067",
"chr1",
10326,
10327
},
(1): {
"rs56289060",
"chr1",
10433,
10433
},
(2): {
"rs112155239",
"chr1",
10439,
10440
},
(3): {
"rs112766696",
"chr1",
10439,
10440
},
(...)
450425,
450426
}
}
ATTRIBUTE "rdfs:comment" {
DATATYPE H5T_STRING {
STRSIZE 15;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "My list of SNPs"
}
}
}
}
}
}


That's it !
Pierre

1 comment:

Anonymous said...

Well and then there's GWASpi, which loads genotypes in a HDF compatible netCDF file.
It seems to be the same type of storage as you can open up netCDf files with HDFView.

For sequences (reads and alignments) there's also the BioHDF libraries which do all the storage and I/O work for you.