23 April 2010

Short post: Plain text vs Binary data.

This week, I was asked the difference between storing some data using a plain text format and a binary format. I wrote the following code C++ to illustrate how to store some genotypes in a ( possibly huge )table saved as a binary file. The first bytes of this file tells the number of individuals, the number of markers, the name of the individuals. Then for each marker, we get the name of the marker, its position and an array of genotypes for each individual.

The first invocation of this program (-p write ) creates a random table and a second call (-p read ) answers the genotypes for some random individuals/markers.



That's it.

Pierre

2 comments:

Mailund said...

While working on some rather large SNP datasets at DeCODE I needed a file format for it. One that was efficient to access directly on the disk, since I couldn't load all of it into RAM. The first attempt was very similar to yours, but I needed to extend it again and again with meta data about the individuals and the markers, so a student and I extended it with general meta data that can be stored directly in the files so we didn't have to worry about having the main marker data and the meta data go out of sync.

We had a number of small programs to manipulate these files, but that became too tedious an approach. We had to write a new program for each new kind of meta data. So we wrote a Python interface to the file format instead. That meant we had to add type information to the meta data which was a bit difficult to get right, but in the end we had a pretty flexible data format.

You can see the code here: http://bircwww.daimi.au.dk/~mailund/SNPfile/index.html

Pierre Lindenbaum said...

Thomas, I *would not* store some genotypes using the binary format I've described in this post: As you said, It would be too difficult to insert some (sorted) new data, to add some metadata , etc... That's why I would rather use something like BerkeleyDB to manage my genotypes.