31 August 2012

The 500th post: Generating a pipeline of analysis (Makefile) for NGS with xslt.

This is my 500th post ! Happy birthday my blog ! (I know, I know, I have been unfaithful since the beginning of the year).

In my previous post, I've shown how a the .INTERMEDIATE keyword can be used in a Makefile to reduce the number of temporary files generated in a classical NGS pipeline. 'Make' also has an option '-j' to specify the number of threads: some independent tasks (such as aligning two distinct samples) can be processed in parallel .

At this point, I should cite this tweet from Sean Davis:



... and this even-more-recent tweet from Justin H. Johnson:




Here, I will show how XSLT can be used to generate a whole Makefile to process the output of Illumina/Casava after the FASTQs files have been generated using configureBclToFastq.pl using --fastq-cluster-count.

A Xml descriptor generated by Illumina Casava looks like this:
<DemultiplexConfig>
  <Software/>
  <FlowcellInfo ID="C0KTCACXX" Operator="JohnDoe" Recipe="" Desc="">
    <Lane Number="1">
      <Sample ProjectId="WGS" Control="N" Index="CTTGTA" SampleId="Father" Desc="1.5pM" Ref="" />
      <Sample ProjectId="WGS" Control="N" Index="GCCAAT" SampleId="Mother" Desc="1.5pM" Ref="" />
      <Sample ProjectId="WGS" Control="N" Index="TGACCA" SampleId="Daughter" Desc="1.5pM" Ref="" />
      <Sample ProjectId="Undetermined_indices" Control="N" Index="Undetermined" SampleId="lane1" Desc="Clusters with unmatched barcodes for lane 1" Ref="unknown" />
    </Lane>
    <Lane Number="2">
      <Sample ProjectId="WGS" Control="N" Index="CTTGTA" SampleId="Father" Desc="1.5pM" Ref="" />
      <Sample ProjectId="WGS" Control="N" Index="GCCAAT" SampleId="Mother" Desc="1.5pM" Ref="" />
      <Sample ProjectId="WGS" Control="N" Index="TGACCA" SampleId="Daughter" Desc="1.5pM" Ref="" />
      <Sample ProjectId="Undetermined_indices" Control="N" Index="Undetermined" SampleId="lane2" Desc="Clusters with unmatched barcodes for lane 2" Ref="unknown" />
    </Lane>
  </FlowcellInfo>
</DemultiplexConfig>
This document contains the required informations to find the fastqs and to generate a Makefile. I wrote the following XSLT stylesheet:


This stylesheet transforms the DemultiplexConfig.xml file to the following Makefile:

#
#files must have been generated using CASAVA/configureBclToFastq.pl with --fastq-cluster-count 0
#path to bwa
override BWA=/path/to/bwa-0.6.1/bwa
BWASAMPEFLAG= -a 600
BWAALNFLAG= -t 4
#path to samtools
SAMTOOLS=/path/to/samtools
#path to BEDTOOLS
BEDTOOLS=/path/to/bedtools/bin
#output directory
OUTDIR=2012831.OUTPUT.idp0
INPUTDIR=.
REF=/path/to/human_g1k_v37.fasta
JAVA=java

PICARDLIB=/path/to/picard-tools
TABIX=/path/to/tabix
GATK=/path/to/GenomeAnalysisTK.jar
GATKFLAGS=
VCFDBSNP=/path/to/ncbi/snp/00-All.vcf.gz

#rule how to create a bai from a bam:
%.bam.bai: %.bam
 #indexing BAM $<
 $(SAMTOOLS) index $<

#creates the default output directory
$(OUTDIR):
 mkdir -p $@

#indexes the reference for bwa
$(REF).bwt $(REF).ann $(REF).amb $(REF).pac $(REF).sa : $(REF)
 $(BWA) index -a bwtsw $<

#indexes the reference for samtools
$(REF).fai: $(REF)
 $(SAMTOOLS) faidx $<

#creates sequence dict for picard
$(REF).dict: $(REF)
 $(JAVA) -jar $(PICARDLIB)/CreateSequenceDictionary.jar REFERENCE=$(REF) OUTPUT=$<



#intermediate see http://plindenbaum.blogspot.fr/2012/08/next-generation-sequencing-gnu-make-and.html
.INTERMEDIATE :  \
 $(OUTDIR)/Project_WGS/Sample_Father/Father.bam.bai \
 $(OUTDIR)/Project_WGS/Sample_Father/Father.bam \
 $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001.bam \
 $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002.bam \
 $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam.bai \
 $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam \
 $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001.bam \
 $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002.bam \
 $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam.bai \
 $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam \
 $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001.bam \
 $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002.bam


#main target : generate the VCFs
all : vcfs


vcfs:  \
 $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf.gz \
 $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf.gz \
 $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf.gz


$(OUTDIR)/Project_WGS/Sample_Father/Father.realigned.bam : $(OUTDIR)/Project_WGS/Sample_Father/Father.bam $(OUTDIR)/Project_WGS/Sample_Father/Father.bam.bai
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T RealignerTargetCreator \
  -R $(REF) \
  -I $< \
  -o $<.intervals \
  --known $(VCFDBSNP)
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T IndelRealigner \
  -R $(REF) \
  -I $< \
  -o $@ \
  -targetIntervals $<.intervals \
  --knownAlleles $(VCFDBSNP)
 rm -f $<.intervals


#creates a BAM for 'Father' by merging the other lanes
$(OUTDIR)/Project_WGS/Sample_Father/Father.bam: $(OUTDIR)  \
  $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001.bam \
  $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002.bam
 #merge BAMS
 $(JAVA) -jar $(PICARDLIB)/MergeSamFiles.jar SORT_ORDER=coordinate OUTPUT=$@ \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001.bam \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002.bam
 #fix mate information
 $(JAVA) -jar $(PICARDLIB)/FixMateInformation.jar \
  INPUT=$@
 #validate
 $(JAVA) -jar $(PICARDLIB)/ValidateSamFile.jar \
  VALIDATE_INDEX=true \
  I=$@


#Creates a BAM for 'Project_WGS/Sample_Father/Father_CTTGTA_L001'

$(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Father
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp22528_1.sai $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp22528_2.sai $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idp22528 CN:MyLaboratory LB:Father PG:bwa SM:Father PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idp22528.sam \
  $(OUTDIR)/idp22528_1.sai \
  $(OUTDIR)/idp22528_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idp22528_1.sai $(OUTDIR)/idp22528_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp22528_unsorted.bam $(OUTDIR)/idp22528.sam
 #remove the sam
 rm -f $(OUTDIR)/idp22528.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idp22528_unsorted.bam $(OUTDIR)/idp22528_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idp22528_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idp22528_sorted.bam \
  OUTPUT=$(OUTDIR)/idp22528_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Father/Father.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idp22528_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idp22528_dup.bam $@



$(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Father
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp31344_1.sai $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp31344_2.sai $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idp31344 CN:MyLaboratory LB:Father PG:bwa SM:Father PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idp31344.sam \
  $(OUTDIR)/idp31344_1.sai \
  $(OUTDIR)/idp31344_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idp31344_1.sai $(OUTDIR)/idp31344_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp31344_unsorted.bam $(OUTDIR)/idp31344.sam
 #remove the sam
 rm -f $(OUTDIR)/idp31344.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idp31344_unsorted.bam $(OUTDIR)/idp31344_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idp31344_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idp31344_sorted.bam \
  OUTPUT=$(OUTDIR)/idp31344_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Father/Father.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idp31344_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idp31344_dup.bam $@




#creates a VCF for 'Father'
$(OUTDIR)/Project_WGS/Sample_Father/Father.vcf.gz : $(OUTDIR)/Project_WGS/Sample_Father/Father.realigned.bam
 $(JAVA) -jar $(GATK) \
  -T UnifiedGenotyper \
  -glm BOTH \
  -baq OFF \
  -R $(REF) \
  -I $< \
  --dbsnp $(VCFDBSNP) \
  -o $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf
 $(TABIX)/bgzip $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf
 $(TABIX)/tabix -f -p vcf $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf.gz
$(OUTDIR)/Project_WGS/Sample_Mother/Mother.realigned.bam : $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam.bai
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T RealignerTargetCreator \
  -R $(REF) \
  -I $< \
  -o $<.intervals \
  --known $(VCFDBSNP)
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T IndelRealigner \
  -R $(REF) \
  -I $< \
  -o $@ \
  -targetIntervals $<.intervals \
  --knownAlleles $(VCFDBSNP)
 rm -f $<.intervals


#creates a BAM for 'Mother' by merging the other lanes
$(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam: $(OUTDIR)  \
  $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001.bam \
  $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002.bam
 #merge BAMS
 $(JAVA) -jar $(PICARDLIB)/MergeSamFiles.jar SORT_ORDER=coordinate OUTPUT=$@ \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001.bam \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002.bam
 #fix mate information
 $(JAVA) -jar $(PICARDLIB)/FixMateInformation.jar \
  INPUT=$@
 #validate
 $(JAVA) -jar $(PICARDLIB)/ValidateSamFile.jar \
  VALIDATE_INDEX=true \
  I=$@


#Creates a BAM for 'Project_WGS/Sample_Mother/Mother_GCCAAT_L001'

$(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Mother
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp15352_1.sai $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp15352_2.sai $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idp15352 CN:MyLaboratory LB:Mother PG:bwa SM:Mother PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idp15352.sam \
  $(OUTDIR)/idp15352_1.sai \
  $(OUTDIR)/idp15352_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idp15352_1.sai $(OUTDIR)/idp15352_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp15352_unsorted.bam $(OUTDIR)/idp15352.sam
 #remove the sam
 rm -f $(OUTDIR)/idp15352.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idp15352_unsorted.bam $(OUTDIR)/idp15352_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idp15352_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idp15352_sorted.bam \
  OUTPUT=$(OUTDIR)/idp15352_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Mother/Mother.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idp15352_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idp15352_dup.bam $@



$(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Mother
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idm22432_1.sai $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idm22432_2.sai $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idm22432 CN:MyLaboratory LB:Mother PG:bwa SM:Mother PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idm22432.sam \
  $(OUTDIR)/idm22432_1.sai \
  $(OUTDIR)/idm22432_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idm22432_1.sai $(OUTDIR)/idm22432_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idm22432_unsorted.bam $(OUTDIR)/idm22432.sam
 #remove the sam
 rm -f $(OUTDIR)/idm22432.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idm22432_unsorted.bam $(OUTDIR)/idm22432_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idm22432_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idm22432_sorted.bam \
  OUTPUT=$(OUTDIR)/idm22432_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Mother/Mother.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idm22432_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idm22432_dup.bam $@




#creates a VCF for 'Mother'
$(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf.gz : $(OUTDIR)/Project_WGS/Sample_Mother/Mother.realigned.bam
 $(JAVA) -jar $(GATK) \
  -T UnifiedGenotyper \
  -glm BOTH \
  -baq OFF \
  -R $(REF) \
  -I $< \
  --dbsnp $(VCFDBSNP) \
  -o $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf
 $(TABIX)/bgzip $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf
 $(TABIX)/tabix -f -p vcf $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf.gz
$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.realigned.bam : $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam.bai
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T RealignerTargetCreator \
  -R $(REF) \
  -I $< \
  -o $<.intervals \
  --known $(VCFDBSNP)
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T IndelRealigner \
  -R $(REF) \
  -I $< \
  -o $@ \
  -targetIntervals $<.intervals \
  --knownAlleles $(VCFDBSNP)
 rm -f $<.intervals


#creates a BAM for 'Daughter' by merging the other lanes
$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam: $(OUTDIR)  \
  $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001.bam \
  $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002.bam
 #merge BAMS
 $(JAVA) -jar $(PICARDLIB)/MergeSamFiles.jar SORT_ORDER=coordinate OUTPUT=$@ \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001.bam \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002.bam
 #fix mate information
 $(JAVA) -jar $(PICARDLIB)/FixMateInformation.jar \
  INPUT=$@
 #validate
 $(JAVA) -jar $(PICARDLIB)/ValidateSamFile.jar \
  VALIDATE_INDEX=true \
  I=$@


#Creates a BAM for 'Project_WGS/Sample_Daughter/Daughter_TGACCA_L001'

$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Daughter
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp5176_1.sai $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp5176_2.sai $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idp5176 CN:MyLaboratory LB:Daughter PG:bwa SM:Daughter PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idp5176.sam \
  $(OUTDIR)/idp5176_1.sai \
  $(OUTDIR)/idp5176_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idp5176_1.sai $(OUTDIR)/idp5176_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp5176_unsorted.bam $(OUTDIR)/idp5176.sam
 #remove the sam
 rm -f $(OUTDIR)/idp5176.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idp5176_unsorted.bam $(OUTDIR)/idp5176_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idp5176_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idp5176_sorted.bam \
  OUTPUT=$(OUTDIR)/idp5176_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idp5176_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idp5176_dup.bam $@



$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Daughter
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp211744_1.sai $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp211744_2.sai $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idp211744 CN:MyLaboratory LB:Daughter PG:bwa SM:Daughter PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idp211744.sam \
  $(OUTDIR)/idp211744_1.sai \
  $(OUTDIR)/idp211744_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idp211744_1.sai $(OUTDIR)/idp211744_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp211744_unsorted.bam $(OUTDIR)/idp211744.sam
 #remove the sam
 rm -f $(OUTDIR)/idp211744.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idp211744_unsorted.bam $(OUTDIR)/idp211744_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idp211744_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idp211744_sorted.bam \
  OUTPUT=$(OUTDIR)/idp211744_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idp211744_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idp211744_dup.bam $@




#creates a VCF for 'Daughter'
$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf.gz : $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.realigned.bam
 $(JAVA) -jar $(GATK) \
  -T UnifiedGenotyper \
  -glm BOTH \
  -baq OFF \
  -R $(REF) \
  -I $< \
  --dbsnp $(VCFDBSNP) \
  -o $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf
 $(TABIX)/bgzip $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf
 $(TABIX)/tabix -f -p vcf $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf.gz
(Of course, this workflow would satisfy my needs, may be not yours)

Et voila !


That's it !


Pierre

1 comment:

Anonymous said...

I've created a tool that just infers the CASAVA file structure and creates the Makefile from that. I like make + SGE it's a nice and simple solution. Hierarchical makefiles also makes it dead easy to re-run parts of the analysis.

/Lars