2013-01-21

Assembly of a human genome with latest sequencing technologies

So Illumina released a nice dataset of a human genome with their latest improvements in their sequencing products.

The dataset HiSeq-2500-NA12878-demo-2x150 has 1171357300 reads, each with 151 nucleotides.

Using Ray with the polytope virtual network in Ray Platform, I assembled a pretty good assembly in a timely way. The job ran on 64 nodes or 512 cores, using 512 MPI ranks.

$ cat Ray-polytope.sh
#!/bin/bash

#PBS -S /bin/bash
#PBS -N HiSeq-2500-NA12878-demo-2x150-2013-01-18-2
#PBS -o HiSeq-2500-NA12878-demo-2x150-2013-01-18-2.stdout
#PBS -e HiSeq-2500-NA12878-demo-2x150-2013-01-18-2.stderr
#PBS -A nne-790-ac
#PBS -l walltime=48:00:00
#PBS -l qos=SPJ1024
#PBS -l nodes=64:ppn=8
#PBS -M sebastien.boisvert.3@ulaval.ca
#PBS -m bea
cd $PBS_O_WORKDIR

source /rap/nne-790-ab/software/NGS-Pipelines/LoadModules.sh

mpiexec -n 512 \
-bind-to-core -bynode \
Ray \
-route-messages -connection-type polytope -routing-graph-degree 21 \
 -o \
 HiSeq-2500-NA12878-demo-2x150-2013-01-18-2 \
 -k \
 31 \
-p \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_001.fastq.gz \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_001.fastq.gz \
-p \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_002.fastq.gz \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_002.fastq.gz \
-p \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_001.fastq.gz \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_001.fastq.gz \
-p \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_002.fastq.gz \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_002.fastq.gz \


With the recent optimizations, the slowest step is no longer graph traversal.


tail HiSeq-2500-NA12878-demo-2x150-2013-01-18-2/ElapsedTime.txt -n 24

 Network testing: 1 seconds
 Counting sequences to assemble: 8 minutes, 27 seconds
 Sequence loading: 1 hours, 9 minutes, 12 seconds
 K-mer counting: 31 minutes, 35 seconds
 Coverage distribution analysis: 16 seconds
 Graph construction: 47 minutes, 10 seconds
 Null edge purging: 7 minutes, 21 seconds
 Selection of optimal read markers: 56 minutes, 4 seconds
 Detection of assembly seeds: 3 hours, 11 minutes, 0 seconds
 Estimation of outer distances for paired reads: 16 minutes, 47 seconds
 Bidirectional extension of seeds: 4 hours, 37 minutes, 53 seconds
 Merging of redundant paths: 1 days, 1 hours, 10 minutes, 46 seconds
 Generation of contigs: 4 minutes, 54 seconds
 Scaffolding of contigs: 6 hours, 54 minutes, 46 seconds
 Counting sequences to search: 0 seconds
 Graph coloring: 16 seconds
 Counting contig biological abundances: 4 minutes, 15 seconds
 Counting sequence biological abundances: 0 seconds
 Loading taxons: 13 seconds
 Loading tree: 16 seconds
 Processing gene ontologies: 27 seconds
 Computing neighbourhoods: 0 seconds
 Total: 1 days, 20 hours, 1 minutes, 55 seconds


The assembly is quite good too:

$ cat HiSeq-2500-NA12878-demo-2x150-2013-01-18-2/OutputNumbers.txt
Contigs >= 100 nt
 Number: 632240
 Total length: 2715204138
 Average: 4294
 N50: 9119
 Median: 2234
 Largest: 160698
Contigs >= 500 nt
 Number: 504237
 Total length: 2693905405
 Average: 5342
 N50: 9204
 Median: 3185
 Largest: 160698
Scaffolds >= 100 nt
 Number: 592711
 Total length: 2724259174
 Average: 4596
 N50: 10391
 Median: 2212
 Largest: 160698
Scaffolds >= 500 nt
 Number: 464708
 Total length: 2702960441
 Average: 5816
 N50: 10491
 Median: 3323
 Largest: 160698


Inserts are not that long.

$ grep -A1 AverageO HiSeq-2500-NA12878-demo-2x150-2013-01-18-2/LibraryStatistics.txt
  AverageOuterDistance: 296
  StandardDeviation: 61
--
  AverageOuterDistance: 296
  StandardDeviation: 61
--
  AverageOuterDistance: 296
  StandardDeviation: 61
--
  AverageOuterDistance: 296
  StandardDeviation: 61

No comments: