Assembly of a human genome with latest sequencing technologies

So Illumina released a nice dataset of a human genome with their latest improvements in their sequencing products.

The dataset HiSeq-2500-NA12878-demo-2x150 has 1171357300 reads, each with 151 nucleotides.

Using Ray with the polytope virtual network in Ray Platform, I assembled a pretty good assembly in a timely way. The job ran on 64 nodes or 512 cores, using 512 MPI ranks.

$ cat Ray-polytope.sh
#!/bin/bash

#PBS -S /bin/bash
#PBS -N HiSeq-2500-NA12878-demo-2x150-2013-01-18-2
#PBS -o HiSeq-2500-NA12878-demo-2x150-2013-01-18-2.stdout
#PBS -e HiSeq-2500-NA12878-demo-2x150-2013-01-18-2.stderr
#PBS -A nne-790-ac
#PBS -l walltime=48:00:00
#PBS -l qos=SPJ1024
#PBS -l nodes=64:ppn=8
#PBS -M sebastien.boisvert.3@ulaval.ca
#PBS -m bea
cd $PBS_O_WORKDIR

source /rap/nne-790-ab/software/NGS-Pipelines/LoadModules.sh

mpiexec -n 512 \
-bind-to-core -bynode \
Ray \
-route-messages -connection-type polytope -routing-graph-degree 21 \
 -o \
 HiSeq-2500-NA12878-demo-2x150-2013-01-18-2 \
 -k \
 31 \
-p \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_001.fastq.gz \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_001.fastq.gz \
-p \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_002.fastq.gz \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_002.fastq.gz \
-p \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_001.fastq.gz \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_001.fastq.gz \
-p \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_002.fastq.gz \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_002.fastq.gz \


With the recent optimizations, the slowest step is no longer graph traversal.


tail HiSeq-2500-NA12878-demo-2x150-2013-01-18-2/ElapsedTime.txt -n 24

 Network testing: 1 seconds
 Counting sequences to assemble: 8 minutes, 27 seconds
 Sequence loading: 1 hours, 9 minutes, 12 seconds
 K-mer counting: 31 minutes, 35 seconds
 Coverage distribution analysis: 16 seconds
 Graph construction: 47 minutes, 10 seconds
 Null edge purging: 7 minutes, 21 seconds
 Selection of optimal read markers: 56 minutes, 4 seconds
 Detection of assembly seeds: 3 hours, 11 minutes, 0 seconds
 Estimation of outer distances for paired reads: 16 minutes, 47 seconds
 Bidirectional extension of seeds: 4 hours, 37 minutes, 53 seconds
 Merging of redundant paths: 1 days, 1 hours, 10 minutes, 46 seconds
 Generation of contigs: 4 minutes, 54 seconds
 Scaffolding of contigs: 6 hours, 54 minutes, 46 seconds
 Counting sequences to search: 0 seconds
 Graph coloring: 16 seconds
 Counting contig biological abundances: 4 minutes, 15 seconds
 Counting sequence biological abundances: 0 seconds
 Loading taxons: 13 seconds
 Loading tree: 16 seconds
 Processing gene ontologies: 27 seconds
 Computing neighbourhoods: 0 seconds
 Total: 1 days, 20 hours, 1 minutes, 55 seconds


The assembly is quite good too:

$ cat HiSeq-2500-NA12878-demo-2x150-2013-01-18-2/OutputNumbers.txt
Contigs >= 100 nt
 Number: 632240
 Total length: 2715204138
 Average: 4294
 N50: 9119
 Median: 2234
 Largest: 160698
Contigs >= 500 nt
 Number: 504237
 Total length: 2693905405
 Average: 5342
 N50: 9204
 Median: 3185
 Largest: 160698
Scaffolds >= 100 nt
 Number: 592711
 Total length: 2724259174
 Average: 4596
 N50: 10391
 Median: 2212
 Largest: 160698
Scaffolds >= 500 nt
 Number: 464708
 Total length: 2702960441
 Average: 5816
 N50: 10491
 Median: 3323
 Largest: 160698


Inserts are not that long.

$ grep -A1 AverageO HiSeq-2500-NA12878-demo-2x150-2013-01-18-2/LibraryStatistics.txt
  AverageOuterDistance: 296
  StandardDeviation: 61
--
  AverageOuterDistance: 296
  StandardDeviation: 61
--
  AverageOuterDistance: 296
  StandardDeviation: 61
--
  AverageOuterDistance: 296
  StandardDeviation: 61

Comments

Popular posts from this blog

Learning to solve the example 1 of puzzle 3aa6fb7a in the ARC prize

The Thorium actor engine is operational now, we can start to work on actor applications for metagenomics

The source code of SOAPdenovo2 sits in the shadows