Assembly of a human genome with latest sequencing technologies

- January 21, 2013

So Illumina released a nice dataset of a human genome with their latest improvements in their sequencing products.

The dataset HiSeq-2500-NA12878-demo-2x150 has 1171357300 reads, each with 151 nucleotides.

Using Ray with the polytope virtual network in Ray Platform, I assembled a pretty good assembly in a timely way. The job ran on 64 nodes or 512 cores, using 512 MPI ranks.

$ cat Ray-polytope.sh
#!/bin/bash

#PBS -S /bin/bash
#PBS -N HiSeq-2500-NA12878-demo-2x150-2013-01-18-2
#PBS -o HiSeq-2500-NA12878-demo-2x150-2013-01-18-2.stdout
#PBS -e HiSeq-2500-NA12878-demo-2x150-2013-01-18-2.stderr
#PBS -A nne-790-ac
#PBS -l walltime=48:00:00
#PBS -l qos=SPJ1024
#PBS -l nodes=64:ppn=8
#PBS -M sebastien.boisvert.3@ulaval.ca
#PBS -m bea
cd $PBS_O_WORKDIR

source /rap/nne-790-ab/software/NGS-Pipelines/LoadModules.sh

mpiexec -n 512 \
-bind-to-core -bynode \
Ray \
-route-messages -connection-type polytope -routing-graph-degree 21 \
-o \
HiSeq-2500-NA12878-demo-2x150-2013-01-18-2 \
-k \
31 \
-p \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_001.fastq.gz \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_001.fastq.gz \
-p \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_002.fastq.gz \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_002.fastq.gz \
-p \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_001.fastq.gz \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_001.fastq.gz \
-p \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_002.fastq.gz \
      HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_002.fastq.gz \

With the recent optimizations, the slowest step is no longer graph traversal.

$ tail HiSeq-2500-NA12878-demo-2x150-2013-01-18-2/ElapsedTime.txt -n 24

Network testing: 1 seconds
Counting sequences to assemble: 8 minutes, 27 seconds
Sequence loading: 1 hours, 9 minutes, 12 seconds
K-mer counting: 31 minutes, 35 seconds
Coverage distribution analysis: 16 seconds
Graph construction: 47 minutes, 10 seconds
Null edge purging: 7 minutes, 21 seconds
Selection of optimal read markers: 56 minutes, 4 seconds
Detection of assembly seeds: 3 hours, 11 minutes, 0 seconds
Estimation of outer distances for paired reads: 16 minutes, 47 seconds
Bidirectional extension of seeds: 4 hours, 37 minutes, 53 seconds
Merging of redundant paths: 1 days, 1 hours, 10 minutes, 46 seconds
Generation of contigs: 4 minutes, 54 seconds
Scaffolding of contigs: 6 hours, 54 minutes, 46 seconds
Counting sequences to search: 0 seconds
Graph coloring: 16 seconds
Counting contig biological abundances: 4 minutes, 15 seconds
Counting sequence biological abundances: 0 seconds
Loading taxons: 13 seconds
Loading tree: 16 seconds
Processing gene ontologies: 27 seconds
Computing neighbourhoods: 0 seconds
Total: 1 days, 20 hours, 1 minutes, 55 seconds

The assembly is quite good too:

$ cat HiSeq-2500-NA12878-demo-2x150-2013-01-18-2/OutputNumbers.txt
Contigs >= 100 nt
Number: 632240
Total length: 2715204138
Average: 4294
N50: 9119
Median: 2234
Largest: 160698
Contigs >= 500 nt
Number: 504237
Total length: 2693905405
Average: 5342
N50: 9204
Median: 3185
Largest: 160698
Scaffolds >= 100 nt
Number: 592711
Total length: 2724259174
Average: 4596
N50: 10391
Median: 2212
Largest: 160698
Scaffolds >= 500 nt
Number: 464708
Total length: 2702960441
Average: 5816
N50: 10491
Median: 3323
Largest: 160698

Inserts are not that long.

$ grep -A1 AverageO HiSeq-2500-NA12878-demo-2x150-2013-01-18-2/LibraryStatistics.txt
AverageOuterDistance: 296
StandardDeviation: 61
--
AverageOuterDistance: 296
StandardDeviation: 61
--
AverageOuterDistance: 296
StandardDeviation: 61
--
AverageOuterDistance: 296
StandardDeviation: 61

Search This Blog

DSKernel: AI and Strength Training

Assembly of a human genome with latest sequencing technologies

Comments

Popular posts from this blog

Learning to solve the example 1 of puzzle 3aa6fb7a in the ARC prize

The Thorium actor engine is operational now, we can start to work on actor applications for metagenomics

The source code of SOAPdenovo2 sits in the shadows