Re: Miseq 2x250 – Does Length Really Matter?
Justin Johnson at Edge BioSystems, Inc. posted assemblies done with CLCBio for fresh MiSeq data. Here are some assemblies with Ray for the same data.
Table 1: Metrics for contigs >= 500 b with Ray for samples EdgeBio-MiSeq-E.Coli-DH10B-150x2 and EdgeBio-MiSeq-E.Coli-DH10B-250x2. *, **
*Jobs were done on 4 nodes with a total of 32 processor cores.
**Contigs produced by Ray contain only symbols from {A, C, G, T}.
Table 1: Metrics for scaffolds >= 500 b with Ray for samples EdgeBio-MiSeq-E.Coli-DH10B-150x2 and EdgeBio-MiSeq-E.Coli-DH10B-250x2. Jobs were done on 4 nodes with a total of 32 processor cores.
Assemblies done:
3.1. 2x150 -k 31Idle Running Completed
3.2. 2x150 -k 51Idle Running Completed
3.3. 2x150 -k 61Idle Running Completed
4.1. 2x250 -k 31Idle Running Completed
4.2. 2x250 -k 51Idle Running Completed
4.3. 2x250 -k 61Idle Running Completed
1.2. Source code:
git://github.com/sebhtml/ray.git b80e067346ae38
git://github.com/sebhtml/RayPlatform.git bca919fb
1.3. Compiler: compilers/gcc/4.7.2
1.4. MPI library and runtime: mpi/openmpi/1.6.3_gcc
1.5. Kernel: Linux 2.6.32.40-clumeq
Build instructions * ,**
make MAXKMERLENGTH=32 ASSERT=y HAVE_LIBZ=y HAVE_LIBBZ2=y \
-j 10 \
PREFIX=/software/ray/2.1.1-devel-b80e067346ae38-bca919fb \
CXXFLAGS=" -O3 -std=c++98 -Wall -march=native "
make install
*compiled with MAXKMERLENGTH=32 for -k 31 jobs, with MAXKMERLENGTH=64 for others.
** ASSERT=n may lower the running time
Processor: anything with a few cores
Total cores: between 16 and 32
Memory per core: at least 256 MB
Storage: storage for input files, genome size * 3 bytes for output files (there are no intermediate files)
2.2. Hardware used for these jobs:
Nodes: 4
Processor: Intel(R) Xeon(R) CPU X5560 @ 2.80GHz
Processors per node: 2
Cores per node: 4
Total cores: 32
Memory per node: 24661852 kB
Hyper-threading: no
Interconnect: Mellanox Technologies MT26428
Reads: 26020908
Read length (nucleotides): 150
Insert size (nucleotides, include reads): 281 +/- 30
Technology: Illumina(R) MiSeq(R), Illumina, Inc.
Service provider: Next Generation Sequencing Services, Edge BioSystems, Inc.
Metrics:
Contigs >= 100 nt
Number: 155
Total length: 4616834
Average: 29786
N50: 67106
Median: 15855
Largest: 227502
Contigs >= 500 nt
Number: 123
Total length: 4609180
Average: 37473
N50: 67106
Median: 30135
Largest: 227502
Scaffolds >= 100 nt
Number: 143
Total length: 4617737
Average: 32291
N50: 70985
Median: 16455
Largest: 227502
Scaffolds >= 500 nt
Number: 112
Total length: 4610547
Average: 41165
N50: 70985
Median: 31521
Largest: 227502
Command:
mpiexec -n 32 Ray \
-o EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-31-2012-12-19-1 \
-k 31 \
-p EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R1_001.fastq \
EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R2_001.fastq
Running time:
Network testing: 0 seconds
Counting sequences to assemble: 13 seconds
Sequence loading: 1 minutes, 5 seconds
K-mer counting: 5 minutes, 52 seconds
Coverage distribution analysis: 3 seconds
Graph construction: 8 minutes, 38 seconds
Null edge purging: 45 seconds
Selection of optimal read markers: 11 minutes, 54 seconds
Detection of assembly seeds: 1 minutes, 58 seconds
Estimation of outer distances for paired reads: 55 seconds
Bidirectional extension of seeds: 24 minutes, 25 seconds
Merging of redundant paths: 4 minutes, 55 seconds
Generation of contigs: 3 seconds
Scaffolding of contigs: 5 minutes, 23 seconds
Counting sequences to search: 0 seconds
Graph coloring: 2 seconds
Counting contig biological abundances: 5 seconds
Counting sequence biological abundances: 0 seconds
Loading taxons: 2 seconds
Loading tree: 3 seconds
Processing gene ontologies: 6 seconds
Computing neighbourhoods: 0 seconds
Total: 1 hours, 6 minutes, 27 seconds
Max. memory usage per core:
351 +/- 19 MiB
Metrics:
Contigs >= 100 nt
Number: 517
Total length: 4598589
Average: 8894
N50: 87569
Median: 110
Largest: 296628
Contigs >= 500 nt
Number: 94
Total length: 4546717
Average: 48369
N50: 95528
Median: 29691
Largest: 296628
Scaffolds >= 100 nt
Number: 512
Total length: 4599494
Average: 8983
N50: 97574
Median: 110
Largest: 296628
Scaffolds >= 500 nt
Number: 89
Total length: 4547622
Average: 51096
N50: 100680
Median: 31489
Largest: 296628
Command:
mpiexec -n 32 Ray \
-o EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-51-2012-12-19-1 \
-k 51 \
-p EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R1_001.fastq \
EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R2_001.fastq
Running time:
Network testing: 0 seconds
Counting sequences to assemble: 13 seconds
Sequence loading: 1 minutes, 6 seconds
K-mer counting: 6 minutes, 31 seconds
Coverage distribution analysis: 3 seconds
Graph construction: 9 minutes, 44 seconds
Null edge purging: 46 seconds
Selection of optimal read markers: 10 minutes, 53 seconds
Detection of assembly seeds: 1 minutes, 49 seconds
Estimation of outer distances for paired reads: 55 seconds
Bidirectional extension of seeds: 28 minutes, 29 seconds
Merging of redundant paths: 6 minutes, 28 seconds
Generation of contigs: 3 seconds
Scaffolding of contigs: 5 minutes, 52 seconds
Counting sequences to search: 0 seconds
Graph coloring: 3 seconds
Counting contig biological abundances: 6 seconds
Counting sequence biological abundances: 0 seconds
Loading taxons: 2 seconds
Loading tree: 3 seconds
Processing gene ontologies: 6 seconds
Computing neighbourhoods: 0 seconds
Total: 1 hours, 13 minutes, 12 seconds
Max. memory usage per core:
395 +/- 54 MiB
Contigs >= 100 nt
Number: 53566
Total length: 10789399
Average: 201
N50: 120
Median: 113
Largest: 326854
Contigs >= 500 nt
Number: 104
Total length: 4580417
Average: 44042
N50: 87572
Median: 28005
Largest: 326854
Scaffolds >= 100 nt
Number: 53556
Total length: 10790842
Average: 201
N50: 120
Median: 113
Largest: 326854
Scaffolds >= 500 nt
Number: 95
Total length: 4582327
Average: 48235
N50: 96902
Median: 29363
Largest: 326854
Command:
mpiexec -n 32 Ray \
-o EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-61-2012-12-19-1 \
-k 61 \
-p EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R1_001.fastq \
EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R2_001.fastq
Running time:
Network testing: 0 seconds
Counting sequences to assemble: 9 seconds
Sequence loading: 1 minutes, 8 seconds
K-mer counting: 6 minutes, 25 seconds
Coverage distribution analysis: 3 seconds
Graph construction: 9 minutes, 27 seconds
Null edge purging: 44 seconds
Selection of optimal read markers: 10 minutes, 4 seconds
Detection of assembly seeds: 1 minutes, 47 seconds
Estimation of outer distances for paired reads: 1 minutes, 55 seconds
Bidirectional extension of seeds: 32 minutes, 6 seconds
Merging of redundant paths: 6 minutes, 38 seconds
Generation of contigs: 3 seconds
Scaffolding of contigs: 8 minutes, 53 seconds
Counting sequences to search: 0 seconds
Graph coloring: 3 seconds
Counting contig biological abundances: 9 seconds
Counting sequence biological abundances: 0 seconds
Loading taxons: 3 seconds
Loading tree: 3 seconds
Processing gene ontologies: 6 seconds
Computing neighbourhoods: 0 seconds
Total: 1 hours, 19 minutes, 46 seconds
Max. memory usage per core:
408 +/- 58 MiB
Reads: 18928232
Read length (nucleotides):250
Insert size (nucleotides, include reads): 490 +/- 74
Technology: Illumina(R) MiSeq(R), Illumina, Inc.
Service provider: Next Generation Sequencing Services, Edge BioSystems, Inc.
Metrics:
Contigs >= 100 nt
Number: 158
Total length: 4550832
Average: 28802
N50: 97852
Median: 1660
Largest: 327190
Contigs >= 500 nt
Number: 89
Total length: 4539062
Average: 51000
N50: 97852
Median: 32382
Largest: 327190
Scaffolds >= 100 nt
Number: 154
Total length: 4551116
Average: 29552
N50: 108530
Median: 1234
Largest: 327190
Scaffolds >= 500 nt
Number: 85
Total length: 4539346
Average: 53404
N50: 108530
Median: 32862
Largest: 327190
Command:
mpiexec -n 32 Ray \
-o EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-31-2012-12-19-1 \
-k 31 \
-p EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R1_001.fastq \
EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R2_001.fastq
Running time:
Network testing: 1 seconds
Counting sequences to assemble: 14 seconds
Sequence loading: 1 minutes, 17 seconds
K-mer counting: 7 minutes, 37 seconds
Coverage distribution analysis: 3 seconds
Graph construction: 11 minutes, 58 seconds
Null edge purging: 1 minutes, 32 seconds
Selection of optimal read markers: 16 minutes, 31 seconds
Detection of assembly seeds: 4 minutes, 33 seconds
Estimation of outer distances for paired reads: 39 seconds
Bidirectional extension of seeds: 1 hours, 1 minutes, 50 seconds
Merging of redundant paths: 4 minutes, 57 seconds
Generation of contigs: 3 seconds
Scaffolding of contigs: 5 minutes, 2 seconds
Counting sequences to search: 0 seconds
Graph coloring: 3 seconds
Counting contig biological abundances: 5 seconds
Counting sequence biological abundances: 0 seconds
Loading taxons: 3 seconds
Loading tree: 3 seconds
Processing gene ontologies: 6 seconds
Computing neighbourhoods: 0 seconds
Total: 1 hours, 56 minutes, 37 seconds
Max. memory usage per core:
421 +/- 39 MiB
Metrics:
Contigs >= 100 nt
Number: 1554
Total length: 4714179
Average: 3033
N50: 88483
Median: 109
Largest: 327166
Contigs >= 500 nt
Number: 82
Total length: 4544573
Average: 55421
N50: 95958
Median: 33541
Largest: 327166
Scaffolds >= 100 nt
Number: 1550
Total length: 4715048
Average: 3041
N50: 98417
Median: 109
Largest: 327166
Scaffolds >= 500 nt
Number: 78
Total length: 4545442
Average: 58274
N50: 101674
Median: 35232
Largest: 327166
Command:
mpiexec -n 32 Ray \
-o EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-51-2012-12-19-1 \
-k 51 \
-p EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R1_001.fastq \
EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R2_001.fastq
Running time:
Network testing: 0 seconds
Counting sequences to assemble: 12 seconds
Sequence loading: 1 minutes, 14 seconds
K-mer counting: 9 minutes, 10 seconds
Coverage distribution analysis: 3 seconds
Graph construction: 14 minutes, 17 seconds
Null edge purging: 1 minutes, 39 seconds
Selection of optimal read markers: 16 minutes, 21 seconds
Detection of assembly seeds: 4 minutes, 41 seconds
Estimation of outer distances for paired reads: 47 seconds
Bidirectional extension of seeds: 52 minutes, 35 seconds
Merging of redundant paths: 4 minutes, 32 seconds
Generation of contigs: 3 seconds
Scaffolding of contigs: 3 minutes, 44 seconds
Counting sequences to search: 0 seconds
Graph coloring: 3 seconds
Counting contig biological abundances: 5 seconds
Counting sequence biological abundances: 0 seconds
Loading taxons: 3 seconds
Loading tree: 4 seconds
Processing gene ontologies: 6 seconds
Computing neighbourhoods: 0 seconds
Total: 1 hours, 49 minutes, 39 seconds
Max. memory usage per core:
447 +/- 46 MiB
Metrics:
Contigs >= 100 nt
Number: 177174
Total length: 24678110
Average: 139
N50: 119
Median: 114
Largest: 327181
Contigs >= 500 nt
Number: 80
Total length: 4585518
Average: 57318
N50: 113734
Median: 33332
Largest: 327181
Scaffolds >= 100 nt
Number: 177168
Total length: 24679683
Average: 139
N50: 119
Median: 114
Largest: 327181
Scaffolds >= 500 nt
Number: 74
Total length: 4587091
Average: 61987
N50: 118481
Median: 35548
Largest: 327181
Command:
mpiexec -n 32 Ray \
-o EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-61-2012-12-19-1 \
-k 61 \
-p EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R1_001.fastq \
EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R2_001.fastq
Running time:
Network testing: 0 seconds
Counting sequences to assemble: 15 seconds
Sequence loading: 1 minutes, 18 seconds
K-mer counting: 9 minutes, 25 seconds
Coverage distribution analysis: 4 seconds
Graph construction: 14 minutes, 34 seconds
Null edge purging: 1 minutes, 37 seconds
Selection of optimal read markers: 15 minutes, 50 seconds
Detection of assembly seeds: 4 minutes, 41 seconds
Estimation of outer distances for paired reads: 1 minutes, 29 seconds
Bidirectional extension of seeds: 44 minutes, 20 seconds
Merging of redundant paths: 10 minutes, 36 seconds
Generation of contigs: 5 seconds
Scaffolding of contigs: 6 minutes, 54 seconds
Counting sequences to search: 0 seconds
Graph coloring: 3 seconds
Counting contig biological abundances: 17 seconds
Counting sequence biological abundances: 1 seconds
Loading taxons: 3 seconds
Loading tree: 3 seconds
Processing gene ontologies: 6 seconds
Computing neighbourhoods: 0 seconds
Total: 1 hours, 51 minutes, 41 seconds
Max. memory usage per core:
468 +/- 39 MiB
This comparison between 2x150 and 2x250 is not fair because the insert size is not the same: 281 +/- 30 for 2x150 and 490 +/- 74 for 2x250. As Chaisson et al. shown in 2009 in Genome Research, the insert size alone is sufficient as long as the read length is above a threshold that depends on the life form being studied.
From Chaisson et al. 2009 Genome Research:
"When the read length exceeds a certain threshold, the read-length barrier, the efficiency reaches nearly 100%, so that the read length, indeed, does not matter. For example, for the Escherichia coli genome, the read-length barrier is 35 nt."
Source: Chaisson et al. 2009 Genome Res. 2009. 19: 336-346
This elegant paper is already a classic in the de novo genome assembly litterature.
Table 1: Metrics for contigs >= 500 b with Ray for samples EdgeBio-MiSeq-E.Coli-DH10B-150x2 and EdgeBio-MiSeq-E.Coli-DH10B-250x2. *, **
Sample |
K-mer length |
N50 (kb) |
Max (kb) |
Mean (kb) |
Count |
Time (hour:min) |
Max mem. usage per core (MiB) |
2x150 |
31 |
67.1 |
227.5 |
37.4 |
123 |
01:06 |
351 +/- 19 |
2x150 |
51 |
95.5 |
296.6 |
48.3 |
94 |
01:13 |
395 +/- 54 |
2x150 |
61 |
87.5 |
326.8 |
44.0 |
104 |
01:19 |
408+/- 58 |
2x250 |
31 |
97.8 |
327.1 |
51.0 |
89 |
01:56 |
421 +/- 39 |
2x250 |
51 |
95.9 |
327.1 |
55.4 |
82 |
01:49 |
447 +/- 46 |
2x250 |
61 |
113.7 |
327.1 |
57.3 |
80 |
01:51 |
468 +/- 39 |
Table 1: Metrics for scaffolds >= 500 b with Ray for samples EdgeBio-MiSeq-E.Coli-DH10B-150x2 and EdgeBio-MiSeq-E.Coli-DH10B-250x2. Jobs were done on 4 nodes with a total of 32 processor cores.
Sample |
K-mer length |
N50 (kb) |
Max (kb) |
Mean (kb) |
Count |
Time (hour:min) |
Max mem. usage per core (MiB) |
2x150 |
31 |
70.9 |
227.5 |
41.1 |
112 |
01:06 |
351 +/- 19 |
2x150 |
51 |
100.6 |
296.6 |
51.0 |
89 |
01:13 |
395 +/- 54 |
2x150 |
61 |
96.9 |
326.8 |
48.2 |
95 |
01:19 |
408+/- 58 |
2x250 |
31 |
108.5 |
327.1 |
53.4 |
85 |
01:56 |
421 +/- 39 |
2x250 |
51 |
101.6 |
327.1 |
58.2 |
78 |
01:49 |
447 +/- 46 |
2x250 |
61 |
118.4 |
327.1 |
61.9 |
74 |
01:51 |
468 +/- 39 |
Assemblies done:
3.1. 2x150 -k 31
3.2. 2x150 -k 51
3.3. 2x150 -k 61
4.1. 2x250 -k 31
4.2. 2x250 -k 51
4.3. 2x250 -k 61
1. Software
1.1. de novo assembler: Ray v2.1.1-devel1.2. Source code:
git://github.com/sebhtml/ray.git b80e067346ae38
git://github.com/sebhtml/RayPlatform.git bca919fb
1.3. Compiler: compilers/gcc/4.7.2
1.4. MPI library and runtime: mpi/openmpi/1.6.3_gcc
1.5. Kernel: Linux 2.6.32.40-clumeq
Build instructions * ,**
make MAXKMERLENGTH=32 ASSERT=y HAVE_LIBZ=y HAVE_LIBBZ2=y \
-j 10 \
PREFIX=/software/ray/2.1.1-devel-b80e067346ae38-bca919fb \
CXXFLAGS=" -O3 -std=c++98 -Wall -march=native "
make install
*compiled with MAXKMERLENGTH=32 for -k 31 jobs, with MAXKMERLENGTH=64 for others.
** ASSERT=n may lower the running time
2. Hardware
2.1. Recommended hardware for bacterial genomesProcessor: anything with a few cores
Total cores: between 16 and 32
Memory per core: at least 256 MB
Storage: storage for input files, genome size * 3 bytes for output files (there are no intermediate files)
2.2. Hardware used for these jobs:
Nodes: 4
Processor: Intel(R) Xeon(R) CPU X5560 @ 2.80GHz
Processors per node: 2
Cores per node: 4
Total cores: 32
Memory per node: 24661852 kB
Hyper-threading: no
Interconnect: Mellanox Technologies MT26428
3. Assemblies for EdgeBio-MiSeq-E.Coli-DH10B-150x2
Download dataReads: 26020908
Read length (nucleotides): 150
Insert size (nucleotides, include reads): 281 +/- 30
Technology: Illumina(R) MiSeq(R), Illumina, Inc.
Service provider: Next Generation Sequencing Services, Edge BioSystems, Inc.
3.1. -k 31
Metrics:
Contigs >= 100 nt
Number: 155
Total length: 4616834
Average: 29786
N50: 67106
Median: 15855
Largest: 227502
Contigs >= 500 nt
Number: 123
Total length: 4609180
Average: 37473
N50: 67106
Median: 30135
Largest: 227502
Scaffolds >= 100 nt
Number: 143
Total length: 4617737
Average: 32291
N50: 70985
Median: 16455
Largest: 227502
Scaffolds >= 500 nt
Number: 112
Total length: 4610547
Average: 41165
N50: 70985
Median: 31521
Largest: 227502
Command:
mpiexec -n 32 Ray \
-o EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-31-2012-12-19-1 \
-k 31 \
-p EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R1_001.fastq \
EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R2_001.fastq
Running time:
Network testing: 0 seconds
Counting sequences to assemble: 13 seconds
Sequence loading: 1 minutes, 5 seconds
K-mer counting: 5 minutes, 52 seconds
Coverage distribution analysis: 3 seconds
Graph construction: 8 minutes, 38 seconds
Null edge purging: 45 seconds
Selection of optimal read markers: 11 minutes, 54 seconds
Detection of assembly seeds: 1 minutes, 58 seconds
Estimation of outer distances for paired reads: 55 seconds
Bidirectional extension of seeds: 24 minutes, 25 seconds
Merging of redundant paths: 4 minutes, 55 seconds
Generation of contigs: 3 seconds
Scaffolding of contigs: 5 minutes, 23 seconds
Counting sequences to search: 0 seconds
Graph coloring: 2 seconds
Counting contig biological abundances: 5 seconds
Counting sequence biological abundances: 0 seconds
Loading taxons: 2 seconds
Loading tree: 3 seconds
Processing gene ontologies: 6 seconds
Computing neighbourhoods: 0 seconds
Total: 1 hours, 6 minutes, 27 seconds
Max. memory usage per core:
351 +/- 19 MiB
3.2. -k 51
Metrics:
Contigs >= 100 nt
Number: 517
Total length: 4598589
Average: 8894
N50: 87569
Median: 110
Largest: 296628
Contigs >= 500 nt
Number: 94
Total length: 4546717
Average: 48369
N50: 95528
Median: 29691
Largest: 296628
Scaffolds >= 100 nt
Number: 512
Total length: 4599494
Average: 8983
N50: 97574
Median: 110
Largest: 296628
Scaffolds >= 500 nt
Number: 89
Total length: 4547622
Average: 51096
N50: 100680
Median: 31489
Largest: 296628
Command:
mpiexec -n 32 Ray \
-o EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-51-2012-12-19-1 \
-k 51 \
-p EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R1_001.fastq \
EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R2_001.fastq
Running time:
Network testing: 0 seconds
Counting sequences to assemble: 13 seconds
Sequence loading: 1 minutes, 6 seconds
K-mer counting: 6 minutes, 31 seconds
Coverage distribution analysis: 3 seconds
Graph construction: 9 minutes, 44 seconds
Null edge purging: 46 seconds
Selection of optimal read markers: 10 minutes, 53 seconds
Detection of assembly seeds: 1 minutes, 49 seconds
Estimation of outer distances for paired reads: 55 seconds
Bidirectional extension of seeds: 28 minutes, 29 seconds
Merging of redundant paths: 6 minutes, 28 seconds
Generation of contigs: 3 seconds
Scaffolding of contigs: 5 minutes, 52 seconds
Counting sequences to search: 0 seconds
Graph coloring: 3 seconds
Counting contig biological abundances: 6 seconds
Counting sequence biological abundances: 0 seconds
Loading taxons: 2 seconds
Loading tree: 3 seconds
Processing gene ontologies: 6 seconds
Computing neighbourhoods: 0 seconds
Total: 1 hours, 13 minutes, 12 seconds
Max. memory usage per core:
395 +/- 54 MiB
3.3. -k 61
Metrics:Contigs >= 100 nt
Number: 53566
Total length: 10789399
Average: 201
N50: 120
Median: 113
Largest: 326854
Contigs >= 500 nt
Number: 104
Total length: 4580417
Average: 44042
N50: 87572
Median: 28005
Largest: 326854
Scaffolds >= 100 nt
Number: 53556
Total length: 10790842
Average: 201
N50: 120
Median: 113
Largest: 326854
Scaffolds >= 500 nt
Number: 95
Total length: 4582327
Average: 48235
N50: 96902
Median: 29363
Largest: 326854
Command:
mpiexec -n 32 Ray \
-o EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-61-2012-12-19-1 \
-k 61 \
-p EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R1_001.fastq \
EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R2_001.fastq
Running time:
Network testing: 0 seconds
Counting sequences to assemble: 9 seconds
Sequence loading: 1 minutes, 8 seconds
K-mer counting: 6 minutes, 25 seconds
Coverage distribution analysis: 3 seconds
Graph construction: 9 minutes, 27 seconds
Null edge purging: 44 seconds
Selection of optimal read markers: 10 minutes, 4 seconds
Detection of assembly seeds: 1 minutes, 47 seconds
Estimation of outer distances for paired reads: 1 minutes, 55 seconds
Bidirectional extension of seeds: 32 minutes, 6 seconds
Merging of redundant paths: 6 minutes, 38 seconds
Generation of contigs: 3 seconds
Scaffolding of contigs: 8 minutes, 53 seconds
Counting sequences to search: 0 seconds
Graph coloring: 3 seconds
Counting contig biological abundances: 9 seconds
Counting sequence biological abundances: 0 seconds
Loading taxons: 3 seconds
Loading tree: 3 seconds
Processing gene ontologies: 6 seconds
Computing neighbourhoods: 0 seconds
Total: 1 hours, 19 minutes, 46 seconds
Max. memory usage per core:
408 +/- 58 MiB
4. EdgeBio-MiSeq-E.Coli-DH10B-250x2
Download dataReads: 18928232
Read length (nucleotides):250
Insert size (nucleotides, include reads): 490 +/- 74
Technology: Illumina(R) MiSeq(R), Illumina, Inc.
Service provider: Next Generation Sequencing Services, Edge BioSystems, Inc.
4.1. -k 31
Metrics:
Contigs >= 100 nt
Number: 158
Total length: 4550832
Average: 28802
N50: 97852
Median: 1660
Largest: 327190
Contigs >= 500 nt
Number: 89
Total length: 4539062
Average: 51000
N50: 97852
Median: 32382
Largest: 327190
Scaffolds >= 100 nt
Number: 154
Total length: 4551116
Average: 29552
N50: 108530
Median: 1234
Largest: 327190
Scaffolds >= 500 nt
Number: 85
Total length: 4539346
Average: 53404
N50: 108530
Median: 32862
Largest: 327190
Command:
mpiexec -n 32 Ray \
-o EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-31-2012-12-19-1 \
-k 31 \
-p EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R1_001.fastq \
EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R2_001.fastq
Running time:
Network testing: 1 seconds
Counting sequences to assemble: 14 seconds
Sequence loading: 1 minutes, 17 seconds
K-mer counting: 7 minutes, 37 seconds
Coverage distribution analysis: 3 seconds
Graph construction: 11 minutes, 58 seconds
Null edge purging: 1 minutes, 32 seconds
Selection of optimal read markers: 16 minutes, 31 seconds
Detection of assembly seeds: 4 minutes, 33 seconds
Estimation of outer distances for paired reads: 39 seconds
Bidirectional extension of seeds: 1 hours, 1 minutes, 50 seconds
Merging of redundant paths: 4 minutes, 57 seconds
Generation of contigs: 3 seconds
Scaffolding of contigs: 5 minutes, 2 seconds
Counting sequences to search: 0 seconds
Graph coloring: 3 seconds
Counting contig biological abundances: 5 seconds
Counting sequence biological abundances: 0 seconds
Loading taxons: 3 seconds
Loading tree: 3 seconds
Processing gene ontologies: 6 seconds
Computing neighbourhoods: 0 seconds
Total: 1 hours, 56 minutes, 37 seconds
Max. memory usage per core:
421 +/- 39 MiB
4.2. -k 51
Metrics:
Contigs >= 100 nt
Number: 1554
Total length: 4714179
Average: 3033
N50: 88483
Median: 109
Largest: 327166
Contigs >= 500 nt
Number: 82
Total length: 4544573
Average: 55421
N50: 95958
Median: 33541
Largest: 327166
Scaffolds >= 100 nt
Number: 1550
Total length: 4715048
Average: 3041
N50: 98417
Median: 109
Largest: 327166
Scaffolds >= 500 nt
Number: 78
Total length: 4545442
Average: 58274
N50: 101674
Median: 35232
Largest: 327166
Command:
mpiexec -n 32 Ray \
-o EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-51-2012-12-19-1 \
-k 51 \
-p EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R1_001.fastq \
EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R2_001.fastq
Running time:
Network testing: 0 seconds
Counting sequences to assemble: 12 seconds
Sequence loading: 1 minutes, 14 seconds
K-mer counting: 9 minutes, 10 seconds
Coverage distribution analysis: 3 seconds
Graph construction: 14 minutes, 17 seconds
Null edge purging: 1 minutes, 39 seconds
Selection of optimal read markers: 16 minutes, 21 seconds
Detection of assembly seeds: 4 minutes, 41 seconds
Estimation of outer distances for paired reads: 47 seconds
Bidirectional extension of seeds: 52 minutes, 35 seconds
Merging of redundant paths: 4 minutes, 32 seconds
Generation of contigs: 3 seconds
Scaffolding of contigs: 3 minutes, 44 seconds
Counting sequences to search: 0 seconds
Graph coloring: 3 seconds
Counting contig biological abundances: 5 seconds
Counting sequence biological abundances: 0 seconds
Loading taxons: 3 seconds
Loading tree: 4 seconds
Processing gene ontologies: 6 seconds
Computing neighbourhoods: 0 seconds
Total: 1 hours, 49 minutes, 39 seconds
Max. memory usage per core:
447 +/- 46 MiB
4.3. -k 61
Metrics:
Contigs >= 100 nt
Number: 177174
Total length: 24678110
Average: 139
N50: 119
Median: 114
Largest: 327181
Contigs >= 500 nt
Number: 80
Total length: 4585518
Average: 57318
N50: 113734
Median: 33332
Largest: 327181
Scaffolds >= 100 nt
Number: 177168
Total length: 24679683
Average: 139
N50: 119
Median: 114
Largest: 327181
Scaffolds >= 500 nt
Number: 74
Total length: 4587091
Average: 61987
N50: 118481
Median: 35548
Largest: 327181
Command:
mpiexec -n 32 Ray \
-o EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-61-2012-12-19-1 \
-k 61 \
-p EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R1_001.fastq \
EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R2_001.fastq
Running time:
Network testing: 0 seconds
Counting sequences to assemble: 15 seconds
Sequence loading: 1 minutes, 18 seconds
K-mer counting: 9 minutes, 25 seconds
Coverage distribution analysis: 4 seconds
Graph construction: 14 minutes, 34 seconds
Null edge purging: 1 minutes, 37 seconds
Selection of optimal read markers: 15 minutes, 50 seconds
Detection of assembly seeds: 4 minutes, 41 seconds
Estimation of outer distances for paired reads: 1 minutes, 29 seconds
Bidirectional extension of seeds: 44 minutes, 20 seconds
Merging of redundant paths: 10 minutes, 36 seconds
Generation of contigs: 5 seconds
Scaffolding of contigs: 6 minutes, 54 seconds
Counting sequences to search: 0 seconds
Graph coloring: 3 seconds
Counting contig biological abundances: 17 seconds
Counting sequence biological abundances: 1 seconds
Loading taxons: 3 seconds
Loading tree: 3 seconds
Processing gene ontologies: 6 seconds
Computing neighbourhoods: 0 seconds
Total: 1 hours, 51 minutes, 41 seconds
Max. memory usage per core:
468 +/- 39 MiB
5. Discussion
This comparison between 2x150 and 2x250 is not fair because the insert size is not the same: 281 +/- 30 for 2x150 and 490 +/- 74 for 2x250. As Chaisson et al. shown in 2009 in Genome Research, the insert size alone is sufficient as long as the read length is above a threshold that depends on the life form being studied.
From Chaisson et al. 2009 Genome Research:
"When the read length exceeds a certain threshold, the read-length barrier, the efficiency reaches nearly 100%, so that the read length, indeed, does not matter. For example, for the Escherichia coli genome, the read-length barrier is 35 nt."
Source: Chaisson et al. 2009 Genome Res. 2009. 19: 336-346
This elegant paper is already a classic in the de novo genome assembly litterature.
Comments
Furthermore, our team will further improve the quality of service (QoS) of Ray by inspecting the graphs with our upcoming tool called "Ray Cloud Browser" demoed already at http://ec2-54-242-197-219.compute-1.amazonaws.com/~sebhtml/Ray-Cloud-Browser/client/
But anyway the datasets from NextBio can not be used to tell whether or not read length matter because the insert size is different between the two cases.
Adequate datasets should have a varying read length, but a constant insert size, and a roughly equal k-mer coverage depth too.