Assembly of a human genome with latest sequencing technologies
So Illumina released a nice dataset of a human genome with their latest improvements in their sequencing products.
The dataset HiSeq-2500-NA12878-demo-2x150 has 1171357300 reads, each with 151 nucleotides.
Using Ray with the polytope virtual network in Ray Platform, I assembled a pretty good assembly in a timely way. The job ran on 64 nodes or 512 cores, using 512 MPI ranks.
$ cat Ray-polytope.sh
#!/bin/bash
#PBS -S /bin/bash
#PBS -N HiSeq-2500-NA12878-demo-2x150-2013-01-18-2
#PBS -o HiSeq-2500-NA12878-demo-2x150-2013-01-18-2.stdout
#PBS -e HiSeq-2500-NA12878-demo-2x150-2013-01-18-2.stderr
#PBS -A nne-790-ac
#PBS -l walltime=48:00:00
#PBS -l qos=SPJ1024
#PBS -l nodes=64:ppn=8
#PBS -M sebastien.boisvert.3@ulaval.ca
#PBS -m bea
cd $PBS_O_WORKDIR
source /rap/nne-790-ab/software/NGS-Pipelines/LoadModules.sh
mpiexec -n 512 \
-bind-to-core -bynode \
Ray \
-route-messages -connection-type polytope -routing-graph-degree 21 \
-o \
HiSeq-2500-NA12878-demo-2x150-2013-01-18-2 \
-k \
31 \
-p \
HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_001.fastq.gz \
HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_001.fastq.gz \
-p \
HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_002.fastq.gz \
HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_002.fastq.gz \
-p \
HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_001.fastq.gz \
HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_001.fastq.gz \
-p \
HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_002.fastq.gz \
HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_002.fastq.gz \
With the recent optimizations, the slowest step is no longer graph traversal.
$ tail HiSeq-2500-NA12878-demo-2x150-2013-01-18-2/ElapsedTime.txt -n 24
Network testing: 1 seconds
Counting sequences to assemble: 8 minutes, 27 seconds
Sequence loading: 1 hours, 9 minutes, 12 seconds
K-mer counting: 31 minutes, 35 seconds
Coverage distribution analysis: 16 seconds
Graph construction: 47 minutes, 10 seconds
Null edge purging: 7 minutes, 21 seconds
Selection of optimal read markers: 56 minutes, 4 seconds
Detection of assembly seeds: 3 hours, 11 minutes, 0 seconds
Estimation of outer distances for paired reads: 16 minutes, 47 seconds
Bidirectional extension of seeds: 4 hours, 37 minutes, 53 seconds
Merging of redundant paths: 1 days, 1 hours, 10 minutes, 46 seconds
Generation of contigs: 4 minutes, 54 seconds
Scaffolding of contigs: 6 hours, 54 minutes, 46 seconds
Counting sequences to search: 0 seconds
Graph coloring: 16 seconds
Counting contig biological abundances: 4 minutes, 15 seconds
Counting sequence biological abundances: 0 seconds
Loading taxons: 13 seconds
Loading tree: 16 seconds
Processing gene ontologies: 27 seconds
Computing neighbourhoods: 0 seconds
Total: 1 days, 20 hours, 1 minutes, 55 seconds
The assembly is quite good too:
$ cat HiSeq-2500-NA12878-demo-2x150-2013-01-18-2/OutputNumbers.txt
Contigs >= 100 nt
Number: 632240
Total length: 2715204138
Average: 4294
N50: 9119
Median: 2234
Largest: 160698
Contigs >= 500 nt
Number: 504237
Total length: 2693905405
Average: 5342
N50: 9204
Median: 3185
Largest: 160698
Scaffolds >= 100 nt
Number: 592711
Total length: 2724259174
Average: 4596
N50: 10391
Median: 2212
Largest: 160698
Scaffolds >= 500 nt
Number: 464708
Total length: 2702960441
Average: 5816
N50: 10491
Median: 3323
Largest: 160698
Inserts are not that long.
$ grep -A1 AverageO HiSeq-2500-NA12878-demo-2x150-2013-01-18-2/LibraryStatistics.txt
AverageOuterDistance: 296
StandardDeviation: 61
--
AverageOuterDistance: 296
StandardDeviation: 61
--
AverageOuterDistance: 296
StandardDeviation: 61
--
AverageOuterDistance: 296
StandardDeviation: 61
The dataset HiSeq-2500-NA12878-demo-2x150 has 1171357300 reads, each with 151 nucleotides.
Using Ray with the polytope virtual network in Ray Platform, I assembled a pretty good assembly in a timely way. The job ran on 64 nodes or 512 cores, using 512 MPI ranks.
$ cat Ray-polytope.sh
#!/bin/bash
#PBS -S /bin/bash
#PBS -N HiSeq-2500-NA12878-demo-2x150-2013-01-18-2
#PBS -o HiSeq-2500-NA12878-demo-2x150-2013-01-18-2.stdout
#PBS -e HiSeq-2500-NA12878-demo-2x150-2013-01-18-2.stderr
#PBS -A nne-790-ac
#PBS -l walltime=48:00:00
#PBS -l qos=SPJ1024
#PBS -l nodes=64:ppn=8
#PBS -M sebastien.boisvert.3@ulaval.ca
#PBS -m bea
cd $PBS_O_WORKDIR
source /rap/nne-790-ab/software/NGS-Pipelines/LoadModules.sh
mpiexec -n 512 \
-bind-to-core -bynode \
Ray \
-route-messages -connection-type polytope -routing-graph-degree 21 \
-o \
HiSeq-2500-NA12878-demo-2x150-2013-01-18-2 \
-k \
31 \
-p \
HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_001.fastq.gz \
HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_001.fastq.gz \
-p \
HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_002.fastq.gz \
HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_002.fastq.gz \
-p \
HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_001.fastq.gz \
HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_001.fastq.gz \
-p \
HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_002.fastq.gz \
HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_002.fastq.gz \
With the recent optimizations, the slowest step is no longer graph traversal.
$ tail HiSeq-2500-NA12878-demo-2x150-2013-01-18-2/ElapsedTime.txt -n 24
Network testing: 1 seconds
Counting sequences to assemble: 8 minutes, 27 seconds
Sequence loading: 1 hours, 9 minutes, 12 seconds
K-mer counting: 31 minutes, 35 seconds
Coverage distribution analysis: 16 seconds
Graph construction: 47 minutes, 10 seconds
Null edge purging: 7 minutes, 21 seconds
Selection of optimal read markers: 56 minutes, 4 seconds
Detection of assembly seeds: 3 hours, 11 minutes, 0 seconds
Estimation of outer distances for paired reads: 16 minutes, 47 seconds
Bidirectional extension of seeds: 4 hours, 37 minutes, 53 seconds
Merging of redundant paths: 1 days, 1 hours, 10 minutes, 46 seconds
Generation of contigs: 4 minutes, 54 seconds
Scaffolding of contigs: 6 hours, 54 minutes, 46 seconds
Counting sequences to search: 0 seconds
Graph coloring: 16 seconds
Counting contig biological abundances: 4 minutes, 15 seconds
Counting sequence biological abundances: 0 seconds
Loading taxons: 13 seconds
Loading tree: 16 seconds
Processing gene ontologies: 27 seconds
Computing neighbourhoods: 0 seconds
Total: 1 days, 20 hours, 1 minutes, 55 seconds
The assembly is quite good too:
$ cat HiSeq-2500-NA12878-demo-2x150-2013-01-18-2/OutputNumbers.txt
Contigs >= 100 nt
Number: 632240
Total length: 2715204138
Average: 4294
N50: 9119
Median: 2234
Largest: 160698
Contigs >= 500 nt
Number: 504237
Total length: 2693905405
Average: 5342
N50: 9204
Median: 3185
Largest: 160698
Scaffolds >= 100 nt
Number: 592711
Total length: 2724259174
Average: 4596
N50: 10391
Median: 2212
Largest: 160698
Scaffolds >= 500 nt
Number: 464708
Total length: 2702960441
Average: 5816
N50: 10491
Median: 3323
Largest: 160698
Inserts are not that long.
$ grep -A1 AverageO HiSeq-2500-NA12878-demo-2x150-2013-01-18-2/LibraryStatistics.txt
AverageOuterDistance: 296
StandardDeviation: 61
--
AverageOuterDistance: 296
StandardDeviation: 61
--
AverageOuterDistance: 296
StandardDeviation: 61
--
AverageOuterDistance: 296
StandardDeviation: 61
Comments