14 April 2008

Bio-Twitters, Unite !

If your a scientist, a bioinformatician, etc... join the scientific community of the biotwitters on http://twitter.com. Follow @biotecher to find all the biotwitters in one place (Thanks Attila !) and follow me on @yokofakun.

If you don't know twitter, here is a short video about it:


Pierre

04 April 2008

BerkeleyDB and Hapmap: My notebook.

I'm currently trying to find the best way to store some genotypes. For example I need to store 278.766.958 illumina genotypes (marker,individual, allele1, allele2) and mysql, even with indexes, is getting slow when I'm looking for the Mendelian incompatibilities. Deepak suggested via twitter to use href="http://hdf.ncsa.uiuc.edu/HDF5/">HDF5 but as far as I understand the documentation, HDF5 is "just" a smarter implementation of the C functions fseek/fread/fwrite.

I've been looking at the java implementation of the BerkeleyDB (BDB) just to watch its performance according to my needs. This engine can be used as en embedded database and doesn't need a database server running as a daemon (just like JavaDB). A BDB berkeley database contains records where each record contains a "key" and a "value" (a kind of two columns database, or a kind of C++ std::multimap). In the current post, I'll describe a program which 1) download some genotypes from HapMap 2) Find the pedigree of the samples 3) Loop over the markers (ordered on their position on the genome) and get the frequency 4) find Mendelian incompatibilities

The source code is available on pastie at:



First we need a few classes:

class Position holds a position on a chromosome
private static class Position
{
String chromosome;
int position;
Position(String chromosome,int position)
{
this.chromosome=chromosome;
this.position=position;
}
}


class Marker contains the informations about a snp :
private static class Marker
{
int rsId;
String alleles;
Position pos;
char strand;
}


class Individual contains the informations about an individual including his 'id' on corriel ( see http://cardiogenomics.med.harvard.edu)
private static class Individual
{
/**Coriell Repository Number*/
String name;
/** id in corriel */
String individualID=null;
/** father id */
String fatherID=null;
/** mother id */
String motherID=null;
}


As BDB is a collection of pairs(key,value) we need a class MarkerIndividual holding the pair(marker,individual) to store the genotypes with f(pair(marker,individual)=genotype.
private static class MarkerIndividual
{
private int rsId;
/**Coriell Repository Number*/
private String individualId;
}


At the end, we need a class Genotype to store two alleles:
private static class Genotype
{
char a1,a2;

public boolean same(char a1,char a2)
{
return (this.a1==a1 && this.a2==a2) ||
(this.a1==a2 && this.a2==a1)
;
}
}


BDB has several alternatives to read and write the java objects from/to the databases, this operation requires the object to be converted into an array of bytes: 1) the java Serialization can be used, 2) a TupleBinding can be implemented, this class must impements two functions which are used to encode/decode the object to/from an array of bytes. I've choosen to use this later option, and here is for example the TupleBinding implementation for the class Individual:

private TupleBinding individualTupleBinding=new TupleBinding()
{
@Override
public Object entryToObject(TupleInput input)
{
Individual indi= new Individual();
indi.name= input.readString();
indi.individualID= input.readString();
indi.fatherID=input.readString();
indi.motherID= input.readString();
return indi;
}

@Override
public void objectToEntry(Object object, TupleOutput output) {
Individual indi= Individual.class.cast(object);
output.writeString(indi.name);
output.writeString(indi.individualID);
output.writeString(indi.fatherID);
output.writeString(indi.motherID);
}
};


To open and create an Berkeley Environment the following code was written:
EnvironmentConfig envConf= new EnvironmentConfig();
envConf.setAllowCreate(!readOnly);
envConf.setReadOnly(readOnly);
envConf.setTransactional(false);
this.env= new Environment(
file,
envConf
);


Then, we open each database. We need 3 databases: markers, individuals and genotypes. Opening a Database looks like this:

DatabaseConfig dbConfig= new DatabaseConfig();
/* allow create if database does not exists */
dbConfig.setAllowCreate(!readOnly);
dbConfig.setReadOnly(readOnly);
Database db= this.env.openDatabase(null, "database-name", d
bConfig);


Althoug DBD is based on a pair(key,value) another component of the value could be searched and need to be indexed. This process is called a "Secondary Database". For example, with this project I created a secondary database:1) to find/loop over the markers using their position 2) to find/loop over the individuals using individualID instead of name. We need a class extending SecondaryKeyCreator
to extract this second key for the original value. For example here is the class extraction the Position from the Marker.
class MarkerPositionCreator implements SecondaryKeyCreator
{
public boolean createSecondaryKey(
SecondaryDatabase secDb,
DatabaseEntry keyEntry,
DatabaseEntry dataEntry,
DatabaseEntry resultEntry)
{
Marker marker= (Marker)Test01.this.markerTupleBinding.entryToObject(dataEntry);
Test01.this.positionTupleBinding.objectToEntry(marker.pos, resultEntry);
return true;
}
}

We also need to open those "secondary" databases
SecondaryConfig secondConfig= new SecondaryConfig();
secondConfig.setAllowCreate(!readOnly);
/* on open, if the secondary database is empty then the primary
database is read in its entirety and additions/modifications to the secondary's records occur automatically */
secondConfig.setAllowPopulate(true);
secondConfig.setSortedDuplicates(true);
MarkerPositionCreator posCreator = new MarkerPositionCreator();
/* Identifies the key creator object to be used for secondary key creation. */
secondConfig.setKeyCreator(posCreator);
positionDB = this.env.openSecondaryDatabase(null, "position
", this.markerDB,secondConfig);


OK, more genetic now. The HapMap genotypes are available here:http://www.hapmap.org/genotypes/latest/rs_strand/non-redundant/ (the path may change). A file looks like a table f(marker,individual)=genotype:
rs# SNPalleles chrom pos strand genome_build center protLSID assayLSID panelLSID QC_code NA18940 NA18942 NA189
rs28412942 A/T chrMT 410 + ncbi_B36 affymetrix GenomeWideSNP_6.02 Japanese:2 QC+ AA AA AA AA AA AA AA AA AA AA
rs3937039 A/G chrMT 665 + ncbi_b36 broad genotype_protocol_11 Japanese:1 QC+ AA AA AA AA AA AA AA AA AA AA AA
rs2853517 A/G chrMT 711 + ncbi_b36 broad genotype_protocol_11 Japanese:1 QC+ GG GG GG GG GG GG GG GG GG GG GG
rs28358568 C/T chrMT 712 + ncbi_b36 broad genotype_protocol_11 Japanese:1 QC+ TT TT TT TT TT TT TT TT TT TT TT
(...)


The file is processed as follow.

Pattern space=Pattern.compile("[ ]");
String HEADER[]=new String[]{"rs#","SNPalleles","chrom","pos","strand","genome_build","center","protLSID","assayLSID","panelLSID","QC_code"};
BufferedReader in= new BufferedReader(new InputStreamReader(new GZIPInputStream(url.openStream())));

String line= in.readLine();
if(line==null) throw new IOException("Empty file");
/* The first line of this file is the header*/
String header[]=space.split(line);
for(int i=0;i< HEADER.length;++i)
{
if(!header[i].equals(HEADER[i])) throw new IOException("Bad header "+header[i]+" expected "+HEADER[i]);
}
/* the header contains the name of the Individual which will be inserted. At this time we don't know what are the relationships between those individuals.*/
for(int i=HEADER.length;i< header.length;++i)
{
Individual individual= new Individual();
individual.name=header[i];
DatabaseEntry key= new DatabaseEntry(individual.name.getBytes());
DatabaseEntry data= new DatabaseEntry();
this.individualTupleBinding.objectToEntry(individual, data);
getIndividualDB().put(null
,key
,data
);
}
/** the following lines are the markers and the genotypes */
TupleBinding INT_BINDING=TupleBinding.getPrimitiveBinding(Integer.class);
while((line=in.readLine())!=null)
{
if(!line.startsWith("rs")) continue;
String tokens[]=space.split(line);
//System.err.println(line);
/* fill the information of this marker */
Marker marker= new Marker(Integer.parseInt( tokens[0].substring(2)));
marker.alleles= tokens[1];
marker.pos= new Position(tokens[2],Integer.parseInt(tokens[3]));
marker.strand= tokens[4].charAt(0);

DatabaseEntry key= new DatabaseEntry();
INT_BINDING.objectToEntry(marker.getRSId(), key);
DatabaseEntry data= new DatabaseEntry();
this.markerTupleBinding.objectToEntry(marker, data);
getMarkerDB().put(null
,key
,data
);
/** loop over this marker and get the genotypes */
for(int i=HEADER.length;i<header.length;++i)
{
if(tokens[i].length()!=2 || tokens[i].equals("NN")) continue;
/** create the KEY */
MarkerIndividual mi= new MarkerIndividual(marker.rsId,header[i]);
this.markerIndividualTupleBinding.objectToEntry(mi, key);
/** create the genotype */
this.genotypeTupleBinding.objectToEntry(
new Genotype(tokens[i].charAt(0),tokens[i].charAt(1)),
data
);
/** put the new pair( pair(marker,individual), genotype) in the BDB */
getGenotypeDB().put(null
,key
,data
);
}
}
in.close();


OK, I want to find the relationships between those individuals, this information is available here. For each "Coriell Repository Number" we find the individual in our database, if it exists we add the information and write the individual back to the database. (See function ">makePedigree line 466).

To retrieve the genotype g=f(marker,individual) I wrote the following simple utility function getGenotypeAt:

Genotype getGenotypeAt(Marker marker,Individual indi) throws DatabaseException
{
if(marker==null || indi==null) return null;
DatabaseEntry key=new DatabaseEntry();
DatabaseEntry value=new DatabaseEntry();
this.markerIndividualTupleBinding.objectToEntry(new MarkerIndividual(marker.rsId,indi.name),key);
if(this.genotypeDB.get(null, key, value, LockMode.DEFAULT)!=OperationStatus.SUCCESS) return null;
Genotype g= Genotype.class.cast(this.genotypeTupleBinding.entryToObject(value));
return g;
}



To get the frequencies of the alleles, we loop each over each marker (using a secondary database to get the markers ordered on the genome (not ordered on rs##)) and we get all the genotypes for each individual. To loop over a BDB an instance Cursor (looks like a java.util.Iterator) is used.
void frequencies() throws DatabaseException
{
SecondaryCursor cM= this.positionDB.openSecondaryCursor(null, null);
DatabaseEntry key=new DatabaseEntry();
DatabaseEntry value=new DatabaseEntry();
while(cM.getNext(key, value,LockMode.DEFAULT)==OperationStatus.SUCCESS)
{
Marker m= Marker.class.cast(this.markerTupleBinding.entryToObject(value));
HashMap<Character, Integer> allele2count= new HashMap<Character, Integer>();
int totalGenotyped=0;
int totalFailures=0;
System.out.print("rs"+m.rsId+" "+m.alleles+" "+m.pos.chromosome+" "+m.pos.position+" "+m.strand);
Cursor cI= this.individualDB.openCursor(null, null);
while(cI.getNext(key, value,LockMode.DEFAULT)==OperationStatus.SUCCESS)
{
Individual indi=Individual.class.cast(this.individualTupleBinding.entryToObject(value));
Genotype g= getGenotypeAt(m, indi);
if(g!=null)
{
totalGenotyped++;
for(int i=0;i< 2;++i)
{
char c= (i==0?g.a1:g.a2);
Integer count= allele2count.get(c);
if(count==null) count=0;
allele2count.put(c,count+1);
}
}
else
{
totalFailures++;
}
}
cI.close();
System.out.print(" genotyped:"+(int)(100.0*(totalGenotyped-totalFailures)/(float)totalGenotyped)+"%");
for(Character allele: allele2count.keySet())
{
System.out.print(" f("+allele+")="+allele2count.get(allele)/(2.0*totalGenotyped));
}
System.out.println();
}
cM.close();
}



Finding the Mendelian incompatibilities is much like the same: I sued here the brute force, we loop over each individual and over each marker. If the individual as any parent, we check that his genotype is compatible with them.
void incompats() throws DatabaseException
{
DatabaseEntry key=new DatabaseEntry();
DatabaseEntry value=new DatabaseEntry();

Cursor cI= this.individualDB.openCursor(null, null);
while(cI.getNext(key, value,LockMode.DEFAULT)==OperationStatus.SUCCESS)
{
Individual indi=Individual.class.cast(this.individualTupleBinding.entryToObject(value));

if(indi.fatherID==null && indi.motherID==null)
{
continue;
}

Individual father= findIndividualByCorrielId(indi.fatherID);
Individual mother= findIndividualByCorrielId(indi.motherID);

Cursor cM= this.markerDB.openCursor(null, null);
while(cM.getNext(key, value,LockMode.DEFAULT)==OperationStatus.SUCCESS)
{
Marker m= Marker.class.cast(this.markerTupleBinding.entryToObject(value));
Genotype gChild= getGenotypeAt(m, indi);
Genotype gFather= getGenotypeAt(m, father);
if(isIncompat(gChild,gFather))
{
System.out.println("Incompat: for rs"+m.getRSId()+"("+m.alleles+") "+
indi.individualID+" is "+gChild+" and his father "+indi.fatherID+" is "+
gFather
);
continue;
}
Genotype gMother= getGenotypeAt(m, mother);
if(isIncompat(gChild,gMother))
{
System.out.println("Incompat: for rs"+m.getRSId()+"("+m.alleles+") "+
indi.individualID+" is "+gChild+" and his mother "+indi.motherID+" is "+
gMother
);
continue;
}
if(isIncompat(gChild,gFather,gMother))
{
System.out.println("Incompat: for rs"+m.getRSId()+"("+m.alleles+") "+
indi.individualID+" is "+gChild+
" and his father "+indi.fatherID+" is "+ gFather+" and his mother "+indi.motherID+" is "+ gMother
);
}
}
cM.close();
}
cI.close();
}


That's it. I first test the chromosome 1 at http://www.hapmap.org/genotypes/latest/rs_strand/non-redundant/genotypes_chr1_CEU_r23a_nr.b36.txt.gz(11Mo) but I pressed Ctrl-C when the files reached 1.4Go !
I then used the chr22 file directly downloaded on my computer. The space required by BerkeleyDB to hold the database and the indexes was 721Mo whereas the zipped original source of data was 2Mo (25Mo unzipped)!!! (Arghhhhhhhhhhhh !!!!!).

  • Individuals count:=90

  • Markers count:=54786

  • Genotypes count:=4853237


Time required to load the hapmap file : 1174secs (20min)

rs9624968 A/G chr22 24783030 + genotyped:86% f(G)=0.879746835443038 f(A)=0.12025316455696203
rs9624969 C/T chr22 24784595 + genotyped:87% f(T)=0.075 f(C)=0.925
rs6004919 C/T chr22 24785216 + genotyped:100% f(T)=0.12777777777777777 f(C)=0.8722222222222222
rs4585127 A/G chr22 24785559 + genotyped:100% f(G)=0.8722222222222222 f(A)=0.12777777777777777
rs5752262 A/G chr22 24786367 + genotyped:95% f(G)=0.5116279069767442 f(A)=0.4883720930232558
rs16981296 C/G chr22 24787784 + genotyped:95% f(G)=0.8488372093023255 f(C)=0.1511627906976744
rs1003547 G/T chr22 24788134 + genotyped:86% f(T)=0.44936708860759494 f(G)=0.5506329113924051
rs9613094 A/G chr22 24788388 + genotyped:100% f(G)=0.2222222222222222 f(A)=0.7777777777777778
rs16986627 A/G chr22 24789298 + genotyped:88% f(G)=0.2654320987654321 f(A)=0.7345679012345679
(...)


Time required to generate the frequencies Time:955secs

Incompat: for rs133457(C/T) 1341M02 is CC and his father 1341MF13 is TT
Incompat: for rs136009(C/T) 1341M02 is CT and his father 1341MF13 is TT and his mother 1341MM14 is TT
Incompat: for rs394518(C/T) 1341M02 is CT and his father 1341MF13 is TT and his mother 1341MM14 is TT
Incompat: for rs628437(A/C) 1341M02 is AC and his father 1341MF13 is CC and his mother 1341MM14 is CC
Incompat: for rs731403(C/G) 1341M02 is GG and his mother 1341MM14 is CC
Incompat: for rs1122940(C/T) 1341M02 is CT and his father 1341MF13 is TT and his mother 1341MM14 is TT
Incompat: for rs2845343(A/T) 1341M02 is AA and his mother 1341MM14 is TT
Incompat: for rs4822498(C/T) 1341M02 is CC and his father 1341MF13 is TT
Incompat: for rs4822499(C/T) 1341M02 is TT and his father 1341MF13 is CC
Incompat: for rs4823195(A/G) 1341M02 is AA and his mother 1341MM14 is GG
Incompat: for rs4991267(C/T) 1341M02 is CC and his mother 1341MM14 is TT
Incompat: for rs5755047(A/T) 1341M02 is TT and his father 1341MF13 is AA
Incompat: for rs5755343(A/T) 1341M02 is AA and his mother 1341MM14 is TT
Incompat: for rs5755420(A/C) 1341M02 is AC and his father 1341MF13 is CC and his mother 1341MM14 is CC
Incompat: for rs5765056(C/T) 1341M02 is CT and his father 1341MF13 is CC and his mother 1341MM14 is CC
Incompat: for rs5765436(A/C) 1341M02 is AC and his father 1341MF13 is CC and his mother 1341MM14 is CC
Incompat: for rs5765499(C/T) 1341M02 is CC and his father 1341MF13 is TT
Incompat: for rs5768636(G/T) 1341M02 is GT and his father 1341MF13 is GG and his mother 1341MM14 is GG
Incompat: for rs5769710(C/T) 1341M02 is CC and his father 1341MF13 is TT
Incompat: for rs5770600(A/C) 1341M02 is CC and his father 1341MF13 is AA
Incompat: for rs5997220(A/G) 1341M02 is AG and his father 1341MF13 is GG and his mother 1341MM14 is GG
(...)
764 errors


Time required to find the Mendelian incompatibilities: 11021secs (~3H00)

When the program closed, the database was compacted down to 710Mo.

Conclusion: Too slow, some huges files generated. Definitely a bad choice to handle this kind of data.

03 April 2008

Study Collaborators Included in PubMed

Via NLM Technical Bulletin:

As of November 2007, there were over 57,000 occurrences of group (corporate) authors in MEDLINE/PubMed with over 17,000 citations with no co-occurring personal authors. Not everyone involved in a group is actually writing or authoring the paper, however. NLM agrees (...) that "Authorship credit should be based on

  1. substantial contributions to conception and design, or acquisition of data, or analysis and interpretation of data
  2. drafting the article or revising it critically for important intellectual content
  3. final approval of the version to be published
Taking these and other factors into consideration, NLM decided that it is time to include the individual names associated with the group authors in MEDLINE/PubMed. Therefore, when a group name is included as an author, the respective group member names appearing in the article will be acknowledged as collaborators but not associated with authorship. This significant enhancement allows PubMed users to identify articles to which an individual has contributed, whether as an author or as a collaborator. NLM has implemented this new feature routinely for MEDLINE citations created beginning in March 2008 if the article was published in 2008 forward.


See also: