Re: Miseq 2x250 – Does Length Really Matter?

Justin Johnson at Edge BioSystems, Inc. posted assemblies done with CLCBio for fresh MiSeq data. Here are some assemblies with Ray for the same data.

Table 1: Metrics for contigs >= 500 b with Ray for samples EdgeBio-MiSeq-E.Coli-DH10B-150x2 and EdgeBio-MiSeq-E.Coli-DH10B-250x2. *, **

Sample	K-mer length	N50 (kb)	Max (kb)	Mean (kb)	Count	Time (hour:min)	Max mem. usage per core (MiB)
2x150	31	67.1	227.5	37.4	123	01:06	351 +/- 19
2x150	51	95.5	296.6	48.3	94	01:13	395 +/- 54
2x150	61	87.5	326.8	44.0	104	01:19	408+/- 58
2x250	31	97.8	327.1	51.0	89	01:56	421 +/- 39
2x250	51	95.9	327.1	55.4	82	01:49	447 +/- 46
2x250	61	113.7	327.1	57.3	80	01:51	468 +/- 39

*Jobs were done on 4 nodes with a total of 32 processor cores. **Contigs produced by Ray contain only symbols from {A, C, G, T}.

Table 1: Metrics for scaffolds >= 500 b with Ray for samples EdgeBio-MiSeq-E.Coli-DH10B-150x2 and EdgeBio-MiSeq-E.Coli-DH10B-250x2. Jobs were done on 4 nodes with a total of 32 processor cores.

Sample	K-mer length	N50 (kb)	Max (kb)	Mean (kb)	Count	Time (hour:min)	Max mem. usage per core (MiB)
2x150	31	70.9	227.5	41.1	112	01:06	351 +/- 19
2x150	51	100.6	296.6	51.0	89	01:13	395 +/- 54
2x150	61	96.9	326.8	48.2	95	01:19	408+/- 58
2x250	31	108.5	327.1	53.4	85	01:56	421 +/- 39
2x250	51	101.6	327.1	58.2	78	01:49	447 +/- 46
2x250	61	118.4	327.1	61.9	74	01:51	468 +/- 39

Assemblies done:

3.1. 2x150 -k 31 ~~Idle~~ ~~Running~~ Completed
3.2. 2x150 -k 51 ~~Idle~~ ~~Running~~ Completed
3.3. 2x150 -k 61 ~~Idle~~ ~~Running~~ Completed
4.1. 2x250 -k 31 ~~Idle~~ ~~Running~~ Completed
4.2. 2x250 -k 51 ~~Idle~~ ~~Running~~ Completed
4.3. 2x250 -k 61 ~~Idle~~ ~~Running~~ Completed

1. Software

1.1. de novo assembler: Ray v2.1.1-devel

1.2. Source code:

git://github.com/sebhtml/ray.git b80e067346ae38
git://github.com/sebhtml/RayPlatform.git bca919fb

1.3. Compiler: compilers/gcc/4.7.2
1.4. MPI library and runtime: mpi/openmpi/1.6.3_gcc
1.5. Kernel: Linux 2.6.32.40-clumeq

Build instructions * ,**

make MAXKMERLENGTH=32 ASSERT=y HAVE_LIBZ=y HAVE_LIBBZ2=y \
-j 10 \
PREFIX=/software/ray/2.1.1-devel-b80e067346ae38-bca919fb \
CXXFLAGS=" -O3 -std=c++98 -Wall -march=native "

make install

*compiled with MAXKMERLENGTH=32 for -k 31 jobs, with MAXKMERLENGTH=64 for others.
** ASSERT=n may lower the running time

2. Hardware

2.1. Recommended hardware for bacterial genomes

Processor: anything with a few cores
Total cores: between 16 and 32
Memory per core: at least 256 MB
Storage: storage for input files, genome size * 3 bytes for output files (there are no intermediate files)

2.2. Hardware used for these jobs:

Nodes: 4
Processor: Intel(R) Xeon(R) CPU X5560 @ 2.80GHz
Processors per node: 2
Cores per node: 4
Total cores: 32
Memory per node: 24661852 kB
Hyper-threading: no
Interconnect: Mellanox Technologies MT26428

3. Assemblies for EdgeBio-MiSeq-E.Coli-DH10B-150x2

Download data

Reads: 26020908
Read length (nucleotides): 150
Insert size (nucleotides, include reads): 281 +/- 30
Technology: Illumina(R) MiSeq(R), Illumina, Inc.
Service provider: Next Generation Sequencing Services, Edge BioSystems, Inc.

3.1. -k 31

Metrics:

Contigs >= 100 nt
Number: 155
Total length: 4616834
Average: 29786
N50: 67106
Median: 15855
Largest: 227502
Contigs >= 500 nt
Number: 123
Total length: 4609180
Average: 37473
N50: 67106
Median: 30135
Largest: 227502
Scaffolds >= 100 nt
Number: 143
Total length: 4617737
Average: 32291
N50: 70985
Median: 16455
Largest: 227502
Scaffolds >= 500 nt
Number: 112
Total length: 4610547
Average: 41165
N50: 70985
Median: 31521
Largest: 227502

Command:

mpiexec -n 32 Ray \
-o EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-31-2012-12-19-1 \
-k 31 \
-p EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R1_001.fastq \
EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R2_001.fastq

Running time:

Network testing: 0 seconds
Counting sequences to assemble: 13 seconds
Sequence loading: 1 minutes, 5 seconds
K-mer counting: 5 minutes, 52 seconds
Coverage distribution analysis: 3 seconds
Graph construction: 8 minutes, 38 seconds
Null edge purging: 45 seconds
Selection of optimal read markers: 11 minutes, 54 seconds
Detection of assembly seeds: 1 minutes, 58 seconds
Estimation of outer distances for paired reads: 55 seconds
Bidirectional extension of seeds: 24 minutes, 25 seconds
Merging of redundant paths: 4 minutes, 55 seconds
Generation of contigs: 3 seconds
Scaffolding of contigs: 5 minutes, 23 seconds
Counting sequences to search: 0 seconds
Graph coloring: 2 seconds
Counting contig biological abundances: 5 seconds
Counting sequence biological abundances: 0 seconds
Loading taxons: 2 seconds
Loading tree: 3 seconds
Processing gene ontologies: 6 seconds
Computing neighbourhoods: 0 seconds
Total: 1 hours, 6 minutes, 27 seconds

Max. memory usage per core:

351 +/- 19 MiB

3.2. -k 51

Metrics:

Contigs >= 100 nt
Number: 517
Total length: 4598589
Average: 8894
N50: 87569
Median: 110
Largest: 296628
Contigs >= 500 nt
Number: 94
Total length: 4546717
Average: 48369
N50: 95528
Median: 29691
Largest: 296628
Scaffolds >= 100 nt
Number: 512
Total length: 4599494
Average: 8983
N50: 97574
Median: 110
Largest: 296628
Scaffolds >= 500 nt
Number: 89
Total length: 4547622
Average: 51096
N50: 100680
Median: 31489
Largest: 296628

Command:

mpiexec -n 32 Ray \
-o EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-51-2012-12-19-1 \
-k 51 \
-p EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R1_001.fastq \
EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R2_001.fastq

Running time:

Network testing: 0 seconds
Counting sequences to assemble: 13 seconds
Sequence loading: 1 minutes, 6 seconds
K-mer counting: 6 minutes, 31 seconds
Coverage distribution analysis: 3 seconds
Graph construction: 9 minutes, 44 seconds
Null edge purging: 46 seconds
Selection of optimal read markers: 10 minutes, 53 seconds
Detection of assembly seeds: 1 minutes, 49 seconds
Estimation of outer distances for paired reads: 55 seconds
Bidirectional extension of seeds: 28 minutes, 29 seconds
Merging of redundant paths: 6 minutes, 28 seconds
Generation of contigs: 3 seconds
Scaffolding of contigs: 5 minutes, 52 seconds
Counting sequences to search: 0 seconds
Graph coloring: 3 seconds
Counting contig biological abundances: 6 seconds
Counting sequence biological abundances: 0 seconds
Loading taxons: 2 seconds
Loading tree: 3 seconds
Processing gene ontologies: 6 seconds
Computing neighbourhoods: 0 seconds
Total: 1 hours, 13 minutes, 12 seconds

Max. memory usage per core:

395 +/- 54 MiB

3.3. -k 61

Metrics:

Contigs >= 100 nt
Number: 53566
Total length: 10789399
Average: 201
N50: 120
Median: 113
Largest: 326854
Contigs >= 500 nt
Number: 104
Total length: 4580417
Average: 44042
N50: 87572
Median: 28005
Largest: 326854
Scaffolds >= 100 nt
Number: 53556
Total length: 10790842
Average: 201
N50: 120
Median: 113
Largest: 326854
Scaffolds >= 500 nt
Number: 95
Total length: 4582327
Average: 48235
N50: 96902
Median: 29363
Largest: 326854

Command:

mpiexec -n 32 Ray \
-o EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-61-2012-12-19-1 \
-k 61 \
-p EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R1_001.fastq \
EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R2_001.fastq

Running time:

Network testing: 0 seconds
Counting sequences to assemble: 9 seconds
Sequence loading: 1 minutes, 8 seconds
K-mer counting: 6 minutes, 25 seconds
Coverage distribution analysis: 3 seconds
Graph construction: 9 minutes, 27 seconds
Null edge purging: 44 seconds
Selection of optimal read markers: 10 minutes, 4 seconds
Detection of assembly seeds: 1 minutes, 47 seconds
Estimation of outer distances for paired reads: 1 minutes, 55 seconds
Bidirectional extension of seeds: 32 minutes, 6 seconds
Merging of redundant paths: 6 minutes, 38 seconds
Generation of contigs: 3 seconds
Scaffolding of contigs: 8 minutes, 53 seconds
Counting sequences to search: 0 seconds
Graph coloring: 3 seconds
Counting contig biological abundances: 9 seconds
Counting sequence biological abundances: 0 seconds
Loading taxons: 3 seconds
Loading tree: 3 seconds
Processing gene ontologies: 6 seconds
Computing neighbourhoods: 0 seconds
Total: 1 hours, 19 minutes, 46 seconds

Max. memory usage per core:

408 +/- 58 MiB

4. EdgeBio-MiSeq-E.Coli-DH10B-250x2

Download data

Reads: 18928232
Read length (nucleotides):250
Insert size (nucleotides, include reads): 490 +/- 74
Technology: Illumina(R) MiSeq(R), Illumina, Inc.
Service provider: Next Generation Sequencing Services, Edge BioSystems, Inc.

4.1. -k 31

Metrics:

Contigs >= 100 nt
Number: 158
Total length: 4550832
Average: 28802
N50: 97852
Median: 1660
Largest: 327190
Contigs >= 500 nt
Number: 89
Total length: 4539062
Average: 51000
N50: 97852
Median: 32382
Largest: 327190
Scaffolds >= 100 nt
Number: 154
Total length: 4551116
Average: 29552
N50: 108530
Median: 1234
Largest: 327190
Scaffolds >= 500 nt
Number: 85
Total length: 4539346
Average: 53404
N50: 108530
Median: 32862
Largest: 327190

Command:

mpiexec -n 32 Ray \
-o EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-31-2012-12-19-1 \
-k 31 \
-p EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R1_001.fastq \
EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R2_001.fastq

Running time:

Network testing: 1 seconds
Counting sequences to assemble: 14 seconds
Sequence loading: 1 minutes, 17 seconds
K-mer counting: 7 minutes, 37 seconds
Coverage distribution analysis: 3 seconds
Graph construction: 11 minutes, 58 seconds
Null edge purging: 1 minutes, 32 seconds
Selection of optimal read markers: 16 minutes, 31 seconds
Detection of assembly seeds: 4 minutes, 33 seconds
Estimation of outer distances for paired reads: 39 seconds
Bidirectional extension of seeds: 1 hours, 1 minutes, 50 seconds
Merging of redundant paths: 4 minutes, 57 seconds
Generation of contigs: 3 seconds
Scaffolding of contigs: 5 minutes, 2 seconds
Counting sequences to search: 0 seconds
Graph coloring: 3 seconds
Counting contig biological abundances: 5 seconds
Counting sequence biological abundances: 0 seconds
Loading taxons: 3 seconds
Loading tree: 3 seconds
Processing gene ontologies: 6 seconds
Computing neighbourhoods: 0 seconds
Total: 1 hours, 56 minutes, 37 seconds

Max. memory usage per core:

421 +/- 39 MiB

4.2. -k 51

Metrics:

Contigs >= 100 nt
Number: 1554
Total length: 4714179
Average: 3033
N50: 88483
Median: 109
Largest: 327166
Contigs >= 500 nt
Number: 82
Total length: 4544573
Average: 55421
N50: 95958
Median: 33541
Largest: 327166
Scaffolds >= 100 nt
Number: 1550
Total length: 4715048
Average: 3041
N50: 98417
Median: 109
Largest: 327166
Scaffolds >= 500 nt
Number: 78
Total length: 4545442
Average: 58274
N50: 101674
Median: 35232
Largest: 327166

Command:

mpiexec -n 32 Ray \
-o EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-51-2012-12-19-1 \
-k 51 \
-p EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R1_001.fastq \
EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R2_001.fastq

Running time:

Network testing: 0 seconds
Counting sequences to assemble: 12 seconds
Sequence loading: 1 minutes, 14 seconds
K-mer counting: 9 minutes, 10 seconds
Coverage distribution analysis: 3 seconds
Graph construction: 14 minutes, 17 seconds
Null edge purging: 1 minutes, 39 seconds
Selection of optimal read markers: 16 minutes, 21 seconds
Detection of assembly seeds: 4 minutes, 41 seconds
Estimation of outer distances for paired reads: 47 seconds
Bidirectional extension of seeds: 52 minutes, 35 seconds
Merging of redundant paths: 4 minutes, 32 seconds
Generation of contigs: 3 seconds
Scaffolding of contigs: 3 minutes, 44 seconds
Counting sequences to search: 0 seconds
Graph coloring: 3 seconds
Counting contig biological abundances: 5 seconds
Counting sequence biological abundances: 0 seconds
Loading taxons: 3 seconds
Loading tree: 4 seconds
Processing gene ontologies: 6 seconds
Computing neighbourhoods: 0 seconds
Total: 1 hours, 49 minutes, 39 seconds

Max. memory usage per core:

447 +/- 46 MiB

4.3. -k 61

Metrics:

Contigs >= 100 nt
Number: 177174
Total length: 24678110
Average: 139
N50: 119
Median: 114
Largest: 327181
Contigs >= 500 nt
Number: 80
Total length: 4585518
Average: 57318
N50: 113734
Median: 33332
Largest: 327181
Scaffolds >= 100 nt
Number: 177168
Total length: 24679683
Average: 139
N50: 119
Median: 114
Largest: 327181
Scaffolds >= 500 nt
Number: 74
Total length: 4587091
Average: 61987
N50: 118481
Median: 35548
Largest: 327181

Command:

mpiexec -n 32 Ray \
-o EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-61-2012-12-19-1 \
-k 61 \
-p EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R1_001.fastq \
EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R2_001.fastq

Running time:

Network testing: 0 seconds
Counting sequences to assemble: 15 seconds
Sequence loading: 1 minutes, 18 seconds
K-mer counting: 9 minutes, 25 seconds
Coverage distribution analysis: 4 seconds
Graph construction: 14 minutes, 34 seconds
Null edge purging: 1 minutes, 37 seconds
Selection of optimal read markers: 15 minutes, 50 seconds
Detection of assembly seeds: 4 minutes, 41 seconds
Estimation of outer distances for paired reads: 1 minutes, 29 seconds
Bidirectional extension of seeds: 44 minutes, 20 seconds
Merging of redundant paths: 10 minutes, 36 seconds
Generation of contigs: 5 seconds
Scaffolding of contigs: 6 minutes, 54 seconds
Counting sequences to search: 0 seconds
Graph coloring: 3 seconds
Counting contig biological abundances: 17 seconds
Counting sequence biological abundances: 1 seconds
Loading taxons: 3 seconds
Loading tree: 3 seconds
Processing gene ontologies: 6 seconds
Computing neighbourhoods: 0 seconds
Total: 1 hours, 51 minutes, 41 seconds
Max. memory usage per core:

468 +/- 39 MiB

5. Discussion

This comparison between 2x150 and 2x250 is not fair because the insert size is not the same: 281 +/- 30 for 2x150 and 490 +/- 74 for 2x250. As Chaisson et al. shown in 2009 in Genome Research, the insert size alone is sufficient as long as the read length is above a threshold that depends on the life form being studied.

From Chaisson et al. 2009 Genome Research:

"When the read length exceeds a certain threshold, the read-length barrier, the efficiency reaches nearly 100%, so that the read length, indeed, does not matter. For example, for the Escherichia coli genome, the read-length barrier is 35 nt."
Source: Chaisson et al. 2009 Genome Res. 2009. 19: 336-346

This elegant paper is already a classic in the de novo genome assembly litterature.

Comments

Torsten Seemann said…

Ultimately, comparing simple metrics like N50 etc is only meaningful if the correctness of the assemblies is equal.

Friday, December 21, 2012 at 4:10:00 PM EST

sebhtml said…

I agree with what you said. For a single bacterial genome from multiple cells (like the samples above), the error rate of Ray is almost null.

Furthermore, our team will further improve the quality of service (QoS) of Ray by inspecting the graphs with our upcoming tool called "Ray Cloud Browser" demoed already at http://ec2-54-242-197-219.compute-1.amazonaws.com/~sebhtml/Ray-Cloud-Browser/client/

But anyway the datasets from NextBio can not be used to tell whether or not read length matter because the insert size is different between the two cases.

Adequate datasets should have a varying read length, but a constant insert size, and a roughly equal k-mer coverage depth too.

Monday, January 7, 2013 at 8:01:00 PM EST

Search This Blog

DSKernel: AI and Strength Training