18 September 2012

Notes about bwa 0.6.2 and the multiple hits.

Here are my notes about the way bwa 0.6.2 handles the multiple hits.


I've created a sub-sequence (~6Mb ) of the human chr22 (see the Makefile below). The sequence doesn't contain any base 'N' or any lowercase (=repeatMasked) letter.
1E6 reads (forward and reverse) have been generated from this ~chr22 with samtools wgsim. No mutation and no sequencing error have been allowed.

This ~chr22 sequence have been duplicated and renamed as 'DUP'. Both sequences have been merged in one fasta file named 'twoChrom.fa'

The reads have been aligned with BWA (v. 0.6.2-r126) on this genome.


There's only one SAM alignment per read

$ samtools view align.sorted.bam  | wc -l
2000000

1% of the reads were not properly paired/mapped on the genome

$ samtools view  -f 2 align.sorted.bam | wc -l
1999926

The properly mapped reads contain some informations about the alternative hits in 'XA'

$samtools view  -f 2 align.sorted.bam | grep 22_10_402_0 | verticalize -n

>>> 1
$1   22_10_402_0:0:0_0:0:0_c1608
$2   99
$3   22
$4   10
$5   0
$6   70M
$7   =
$8   333
$9   393
$10  GGGACGGTCATGCAATCTGGACAACATTCACCTTTAAAAGTTTATTGATCTTTTGTGACATGCACGTGGG
$11  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
$12  XT:A:R
$13  NM:i:0
$14  SM:i:0
$15  AM:i:0
$16  X0:i:2
$17  X1:i:0
$18  XM:i:0
$19  XO:i:0
$20  XG:i:0
$21  MD:Z:70
$22  XA:Z:DUP,+10,70M,0;
<<< 1

>>> 2
$1   22_10_402_0:0:0_0:0:0_c1608
$2   147
$3   22
$4   333
$5   0
$6   70M
$7   =
$8   10
$9   -393
$10  ATACCTGCCAGATGAGTCACTGGCAAAAGGTGCTGCTCCCTGGTGAGGGAGAAACACCAGGGGCTGGGAG
$11  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
$12  XT:A:R
$13  NM:i:0
$14  SM:i:0
$15  AM:i:0
$16  X0:i:2
$17  X1:i:0
$18  XM:i:0
$19  XO:i:0
$20  XG:i:0
$21  MD:Z:70
$22  XA:Z:DUP,-333,70M,0;
<<< 2

50% mapped on chr22, 50% mapped on DUP

$ samtools view  -f 2 align.sorted.bam | cut -d '      ' -f 3 | sort | uniq -c
 999860 22
1000066 DUP

Some pairs have been unmapped because the mate have been mapped on the other chromosome !

$ samtools-0.1.18/samtools view  -F 2 align.sorted.bam | grep 22_1210321_1210924 | verticalize -n

>>> 1
$1   22_1210321_1210924_0:0:0_0:0:0_e28f5
$2   81
$3   22
$4   1210855
$5   0
$6   70M
$7   DUP
$8   1210321
$9   0
$10  AGAGACTGAGTCCGGCTAGAGAACAGGGTGGAGCCCCTTTGGACCTTAGAGCTGGGCCTTTGGGCCTTGG
$11  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
$12  XT:A:R
$13  NM:i:0
$14  SM:i:0
$15  AM:i:0
$16  X0:i:6
$17  X1:i:0
$18  XM:i:0
$19  XO:i:0
$20  XG:i:0
$21  MD:Z:70
$22  XA:Z:22,-1996342,70M,0;DUP,-1996342,70M,0;22,+2551778,70M,0;DUP,+2551778,70M,0;DUP,-1210855,70M,0;
<<< 1

>>> 2
$1   22_1210321_1210924_0:0:0_0:0:0_e28f5
$2   161
$3   DUP
$4   1210321
$5   0
$6   70M
$7   22
$8   1210855
$9   0
$10  CCCGGGCCGGATGGCTCGCCTGCGCGGCCAGCTCCGGGCCGAAGCGGCTTCGCGGTCCGAGGTGCCGCGG
$11  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
$12  XT:A:R
$13  NM:i:0
$14  SM:i:0
$15  AM:i:0
$16  X0:i:2
$17  X1:i:0
$18  XM:i:0
$19  XO:i:0
$20  XG:i:0
$21  MD:Z:70
$22  XA:Z:22,+1210321,70M,0;
<<< 2

The Makefile

SAMDIR=/usr/local/package/samtools-0.1.18
SAMTOOLS=$(SAMDIR)/samtools
BCFTOOLS=$(SAMDIR)/bcftools/bcftools
BWA=/usr/local/package/bwa-0.6.2/bwa
REF1=chr22.fa
REF2=twoChrom.fa

.INTERMEDIATE : align.sam random_1.sai random_2.sai align.bam variations.bcf

%.bam : %.sam
        $(SAMTOOLS) view -o $@ -b -S -T $(REF2) $<
%.bam.bai : %.bam
        $(SAMTOOLS) index $<



align.sorted.bam : align.bam
        $(SAMTOOLS) sort $< align.sorted


align.sam : random_1.sai random_2.sai  
        $(BWA) sampe -a 600 $(REF2) $^ random_1.fq.gz random_2.fq.gz > $@

$(REF1):
        curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/$(REF1).gz" |\
        gunzip -c | sed 's/^>chr/>/' | grep -v '[Nnatgc]' | head -n 100000 > $@


random_1.sai :  random_1.fq.gz $(REF2).bwt
        $(BWA) aln -f $@ $(REF2) $<

random_2.sai :  random_2.fq.gz $(REF2).bwt
        $(BWA) aln -f $@ $(REF2) $<

random_1.fq.gz random_2.fq.gz : $(REF1)
        $(SAMDIR)/misc/wgsim -N 1000000 -e 0.0 -r 0.0 -d 400 $< random_1.fq random_2.fq > wgsim.output
        gzip -f --best random_1.fq random_2.fq

$(REF2): $(REF1)
        cp $< $@
        sed 's/^>.*/>DUP/' $< >> $@

$(REF2).bwt : $(REF2)
        $(BWA) index -a bwtsw $<

$(REF2).fai :  $(REF2)
        $(SAMTOOLS) faidx $< 



clean:
        rm -f chr22.* *.bam *.vcf *.bcf *.sai *.gz *.fq *.bai  wgsim.output *.sam
That's it,

Pierre

No comments: