Re: Miseq 2x250 – Does Length Really Matter?







Table 1: Metrics for contigs >= 500 b with Ray for samples EdgeBio-MiSeq-E.Coli-DH10B-150x2 and EdgeBio-MiSeq-E.Coli-DH10B-250x2. *, **
Sample
K-mer length
N50 (kb)
Max (kb)
Mean (kb)
Count
Time (hour:min)
Max mem. usage per core (MiB)
2x150
31
67.1
227.5
37.4
123
01:06
351 +/- 19
2x150
51
95.5
296.6
48.3
94
01:13
395 +/- 54
2x150
61
87.5
326.8
44.0
104
01:19
408+/- 58
2x250
31
97.8
327.1
51.0
89
01:56
421 +/- 39
2x250
51
95.9
327.1
55.4
82
01:49
447 +/- 46
2x250
61
113.7
327.1
57.3
80
01:51
468 +/- 39
*Jobs were done on 4 nodes with a total of 32 processor cores. **Contigs produced by Ray contain only symbols from {A, C, G, T}.



Table 1: Metrics for scaffolds >= 500 b with Ray for samples EdgeBio-MiSeq-E.Coli-DH10B-150x2 and EdgeBio-MiSeq-E.Coli-DH10B-250x2. Jobs were done on 4 nodes with a total of 32 processor cores.
Sample
K-mer length
N50 (kb)
Max (kb)
Mean (kb)
Count
Time (hour:min)
Max mem. usage per core (MiB)
2x150
31
70.9
227.5
41.1
112
01:06
351 +/- 19
2x150
51
100.6
296.6
51.0
89
01:13
395 +/- 54
2x150
61
96.9
326.8
48.2
95
01:19
408+/- 58
2x250
31
108.5
327.1
53.4
85
01:56
421 +/- 39
2x250
51
101.6
327.1
58.2
78
01:49
447 +/- 46
2x250
61
118.4
327.1
61.9
74
01:51
468 +/- 39


Assemblies done:

3.1. 2x150 -k 31 Idle Running Completed
3.2. 2x150 -k 51 Idle Running Completed
3.3. 2x150 -k 61 Idle Running Completed
4.1. 2x250 -k 31 Idle Running Completed
4.2. 2x250 -k 51 Idle Running Completed
4.3. 2x250 -k 61 Idle Running Completed




































3. Assemblies for EdgeBio-MiSeq-E.Coli-DH10B-150x2

Download data

Reads: 26020908
Read length (nucleotides): 150
Insert size (nucleotides, include reads): 281 +/- 30
Technology: Illumina(R) MiSeq(R), Illumina, Inc.
Service provider: Next Generation Sequencing Services, Edge BioSystems, Inc.

3.1. -k 31


Metrics:

Contigs >= 100 nt
 Number: 155
 Total length: 4616834
 Average: 29786
 N50: 67106
 Median: 15855
 Largest: 227502
Contigs >= 500 nt
 Number: 123
 Total length: 4609180
 Average: 37473
 N50: 67106
 Median: 30135
 Largest: 227502
Scaffolds >= 100 nt
 Number: 143
 Total length: 4617737
 Average: 32291
 N50: 70985
 Median: 16455
 Largest: 227502
Scaffolds >= 500 nt
 Number: 112
 Total length: 4610547
 Average: 41165
 N50: 70985
 Median: 31521
 Largest: 227502


Command:

mpiexec -n 32 Ray \
 -o EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-31-2012-12-19-1 \
 -k 31 \
 -p EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R1_001.fastq \
 EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R2_001.fastq

Running time:

 Network testing: 0 seconds
 Counting sequences to assemble: 13 seconds
 Sequence loading: 1 minutes, 5 seconds
 K-mer counting: 5 minutes, 52 seconds
 Coverage distribution analysis: 3 seconds
 Graph construction: 8 minutes, 38 seconds
 Null edge purging: 45 seconds
 Selection of optimal read markers: 11 minutes, 54 seconds
 Detection of assembly seeds: 1 minutes, 58 seconds
 Estimation of outer distances for paired reads: 55 seconds
 Bidirectional extension of seeds: 24 minutes, 25 seconds
 Merging of redundant paths: 4 minutes, 55 seconds
 Generation of contigs: 3 seconds
 Scaffolding of contigs: 5 minutes, 23 seconds
 Counting sequences to search: 0 seconds
 Graph coloring: 2 seconds
 Counting contig biological abundances: 5 seconds
 Counting sequence biological abundances: 0 seconds
 Loading taxons: 2 seconds
 Loading tree: 3 seconds
 Processing gene ontologies: 6 seconds
 Computing neighbourhoods: 0 seconds
 Total: 1 hours, 6 minutes, 27 seconds


Max. memory usage per core:

351 +/- 19 MiB

3.2. -k 51


Metrics:

Contigs >= 100 nt
 Number: 517
 Total length: 4598589
 Average: 8894
 N50: 87569
 Median: 110
 Largest: 296628
Contigs >= 500 nt
 Number: 94
 Total length: 4546717
 Average: 48369
 N50: 95528
 Median: 29691
 Largest: 296628
Scaffolds >= 100 nt
 Number: 512
 Total length: 4599494
 Average: 8983
 N50: 97574
 Median: 110
 Largest: 296628
Scaffolds >= 500 nt
 Number: 89
 Total length: 4547622
 Average: 51096
 N50: 100680
 Median: 31489
 Largest: 296628

Command:

mpiexec -n 32 Ray \
 -o EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-51-2012-12-19-1 \
 -k 51 \
 -p  EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R1_001.fastq \
 EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R2_001.fastq

Running time:

 Network testing: 0 seconds
 Counting sequences to assemble: 13 seconds
 Sequence loading: 1 minutes, 6 seconds
 K-mer counting: 6 minutes, 31 seconds
 Coverage distribution analysis: 3 seconds
 Graph construction: 9 minutes, 44 seconds
 Null edge purging: 46 seconds
 Selection of optimal read markers: 10 minutes, 53 seconds
 Detection of assembly seeds: 1 minutes, 49 seconds
 Estimation of outer distances for paired reads: 55 seconds
 Bidirectional extension of seeds: 28 minutes, 29 seconds
 Merging of redundant paths: 6 minutes, 28 seconds
 Generation of contigs: 3 seconds
 Scaffolding of contigs: 5 minutes, 52 seconds
 Counting sequences to search: 0 seconds
 Graph coloring: 3 seconds
 Counting contig biological abundances: 6 seconds
 Counting sequence biological abundances: 0 seconds
 Loading taxons: 2 seconds
 Loading tree: 3 seconds
 Processing gene ontologies: 6 seconds
 Computing neighbourhoods: 0 seconds
 Total: 1 hours, 13 minutes, 12 seconds

Max. memory usage per core:

395 +/- 54 MiB

3.3. -k 61

Metrics:

Contigs >= 100 nt
 Number: 53566
 Total length: 10789399
 Average: 201
 N50: 120
 Median: 113
 Largest: 326854
Contigs >= 500 nt
 Number: 104
 Total length: 4580417
 Average: 44042
 N50: 87572
 Median: 28005
 Largest: 326854
Scaffolds >= 100 nt
 Number: 53556
 Total length: 10790842
 Average: 201
 N50: 120
 Median: 113
 Largest: 326854
Scaffolds >= 500 nt
 Number: 95
 Total length: 4582327
 Average: 48235
 N50: 96902
 Median: 29363
 Largest: 326854

Command:

mpiexec -n 32 Ray \
 -o EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-61-2012-12-19-1 \
 -k 61 \
 -p EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R1_001.fastq \
 EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R2_001.fastq

Running time:

 Network testing: 0 seconds
 Counting sequences to assemble: 9 seconds
 Sequence loading: 1 minutes, 8 seconds
 K-mer counting: 6 minutes, 25 seconds
 Coverage distribution analysis: 3 seconds
 Graph construction: 9 minutes, 27 seconds
 Null edge purging: 44 seconds
 Selection of optimal read markers: 10 minutes, 4 seconds
 Detection of assembly seeds: 1 minutes, 47 seconds
 Estimation of outer distances for paired reads: 1 minutes, 55 seconds
 Bidirectional extension of seeds: 32 minutes, 6 seconds
 Merging of redundant paths: 6 minutes, 38 seconds
 Generation of contigs: 3 seconds
 Scaffolding of contigs: 8 minutes, 53 seconds
 Counting sequences to search: 0 seconds
 Graph coloring: 3 seconds
 Counting contig biological abundances: 9 seconds
 Counting sequence biological abundances: 0 seconds
 Loading taxons: 3 seconds
 Loading tree: 3 seconds
 Processing gene ontologies: 6 seconds
 Computing neighbourhoods: 0 seconds
 Total: 1 hours, 19 minutes, 46 seconds

Max. memory usage per core:

408 +/- 58 MiB

 

4.  EdgeBio-MiSeq-E.Coli-DH10B-250x2

Download data

Reads: 18928232
Read length (nucleotides):250
Insert size (nucleotides, include reads): 490 +/- 74
Technology: Illumina(R) MiSeq(R), Illumina, Inc.
Service provider: Next Generation Sequencing Services, Edge BioSystems, Inc.


4.1. -k 31


Metrics:

Contigs >= 100 nt
 Number: 158
 Total length: 4550832
 Average: 28802
 N50: 97852
 Median: 1660
 Largest: 327190
Contigs >= 500 nt
 Number: 89
 Total length: 4539062
 Average: 51000
 N50: 97852
 Median: 32382
 Largest: 327190
Scaffolds >= 100 nt
 Number: 154
 Total length: 4551116
 Average: 29552
 N50: 108530
 Median: 1234
 Largest: 327190
Scaffolds >= 500 nt
 Number: 85
 Total length: 4539346
 Average: 53404
 N50: 108530
 Median: 32862
 Largest: 327190

Command:

mpiexec -n 32 Ray \
 -o EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-31-2012-12-19-1 \
 -k 31 \
 -p EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R1_001.fastq \
 EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R2_001.fastq

Running time:

 Network testing: 1 seconds
 Counting sequences to assemble: 14 seconds
 Sequence loading: 1 minutes, 17 seconds
 K-mer counting: 7 minutes, 37 seconds
 Coverage distribution analysis: 3 seconds
 Graph construction: 11 minutes, 58 seconds
 Null edge purging: 1 minutes, 32 seconds
 Selection of optimal read markers: 16 minutes, 31 seconds
 Detection of assembly seeds: 4 minutes, 33 seconds
 Estimation of outer distances for paired reads: 39 seconds
 Bidirectional extension of seeds: 1 hours, 1 minutes, 50 seconds
 Merging of redundant paths: 4 minutes, 57 seconds
 Generation of contigs: 3 seconds
 Scaffolding of contigs: 5 minutes, 2 seconds
 Counting sequences to search: 0 seconds
 Graph coloring: 3 seconds
 Counting contig biological abundances: 5 seconds
 Counting sequence biological abundances: 0 seconds
 Loading taxons: 3 seconds
 Loading tree: 3 seconds
 Processing gene ontologies: 6 seconds
 Computing neighbourhoods: 0 seconds
 Total: 1 hours, 56 minutes, 37 seconds


Max. memory usage per core:

421 +/- 39 MiB

 

4.2. -k 51


Metrics:

Contigs >= 100 nt
 Number: 1554
 Total length: 4714179
 Average: 3033
 N50: 88483
 Median: 109
 Largest: 327166
Contigs >= 500 nt
 Number: 82
 Total length: 4544573
 Average: 55421
 N50: 95958
 Median: 33541
 Largest: 327166
Scaffolds >= 100 nt
 Number: 1550
 Total length: 4715048
 Average: 3041
 N50: 98417
 Median: 109
 Largest: 327166
Scaffolds >= 500 nt
 Number: 78
 Total length: 4545442
 Average: 58274
 N50: 101674
 Median: 35232
 Largest: 327166

Command:

mpiexec -n 32 Ray \
 -o EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-51-2012-12-19-1 \
 -k 51 \
 -p EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R1_001.fastq \
 EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R2_001.fastq

Running time:

 Network testing: 0 seconds
 Counting sequences to assemble: 12 seconds
 Sequence loading: 1 minutes, 14 seconds
 K-mer counting: 9 minutes, 10 seconds
 Coverage distribution analysis: 3 seconds
 Graph construction: 14 minutes, 17 seconds
 Null edge purging: 1 minutes, 39 seconds
 Selection of optimal read markers: 16 minutes, 21 seconds
 Detection of assembly seeds: 4 minutes, 41 seconds
 Estimation of outer distances for paired reads: 47 seconds
 Bidirectional extension of seeds: 52 minutes, 35 seconds
 Merging of redundant paths: 4 minutes, 32 seconds
 Generation of contigs: 3 seconds
 Scaffolding of contigs: 3 minutes, 44 seconds
 Counting sequences to search: 0 seconds
 Graph coloring: 3 seconds
 Counting contig biological abundances: 5 seconds
 Counting sequence biological abundances: 0 seconds
 Loading taxons: 3 seconds
 Loading tree: 4 seconds
 Processing gene ontologies: 6 seconds
 Computing neighbourhoods: 0 seconds
 Total: 1 hours, 49 minutes, 39 seconds

Max. memory usage per core:

447 +/- 46 MiB

4.3. -k 61


Metrics:

Contigs >= 100 nt
 Number: 177174
 Total length: 24678110
 Average: 139
 N50: 119
 Median: 114
 Largest: 327181
Contigs >= 500 nt
 Number: 80
 Total length: 4585518
 Average: 57318
 N50: 113734
 Median: 33332
 Largest: 327181
Scaffolds >= 100 nt
 Number: 177168
 Total length: 24679683
 Average: 139
 N50: 119
 Median: 114
 Largest: 327181
Scaffolds >= 500 nt
 Number: 74
 Total length: 4587091
 Average: 61987
 N50: 118481
 Median: 35548
 Largest: 327181

Command:

mpiexec -n 32 Ray \
 -o EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-61-2012-12-19-1 \
 -k 61 \
 -p EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R1_001.fastq \
 EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R2_001.fastq

Running time:

 Network testing: 0 seconds
 Counting sequences to assemble: 15 seconds
 Sequence loading: 1 minutes, 18 seconds
 K-mer counting: 9 minutes, 25 seconds
 Coverage distribution analysis: 4 seconds
 Graph construction: 14 minutes, 34 seconds
 Null edge purging: 1 minutes, 37 seconds
 Selection of optimal read markers: 15 minutes, 50 seconds
 Detection of assembly seeds: 4 minutes, 41 seconds
 Estimation of outer distances for paired reads: 1 minutes, 29 seconds
 Bidirectional extension of seeds: 44 minutes, 20 seconds
 Merging of redundant paths: 10 minutes, 36 seconds
 Generation of contigs: 5 seconds
 Scaffolding of contigs: 6 minutes, 54 seconds
 Counting sequences to search: 0 seconds
 Graph coloring: 3 seconds
 Counting contig biological abundances: 17 seconds
 Counting sequence biological abundances: 1 seconds
 Loading taxons: 3 seconds
 Loading tree: 3 seconds
 Processing gene ontologies: 6 seconds
 Computing neighbourhoods: 0 seconds
 Total: 1 hours, 51 minutes, 41 seconds
Max. memory usage per core:

468 +/- 39 MiB

5. Discussion


This comparison between 2x150 and 2x250 is not fair because the insert size is not the same: 281 +/- 30 for 2x150 and 490 +/- 74 for 2x250. As Chaisson et al. shown in 2009 in Genome Research, the insert size alone is sufficient as long as the read length is above a threshold that depends on the life form being studied.

From Chaisson et al. 2009 Genome Research:

"When the read length exceeds a certain threshold, the read-length barrier, the efficiency reaches nearly 100%, so that the read length, indeed, does not matter. For example, for the Escherichia coli genome, the read-length barrier is 35 nt."
Source: Chaisson et al. 2009 Genome Res. 2009. 19: 336-346

This elegant paper is already a classic in the de novo genome assembly litterature.


Comments

Torsten Seemann said…
Ultimately, comparing simple metrics like N50 etc is only meaningful if the correctness of the assemblies is equal.
sebhtml said…
I agree with what you said. For a single bacterial genome from multiple cells (like the samples above), the error rate of Ray is almost null.

Furthermore, our team will further improve the quality of service (QoS) of Ray by inspecting the graphs with our upcoming tool called "Ray Cloud Browser" demoed already at http://ec2-54-242-197-219.compute-1.amazonaws.com/~sebhtml/Ray-Cloud-Browser/client/

But anyway the datasets from NextBio can not be used to tell whether or not read length matter because the insert size is different between the two cases.

Adequate datasets should have a varying read length, but a constant insert size, and a roughly equal k-mer coverage depth too.

Popular posts from this blog

A survey of the burgeoning industry of cloud genomics

Generating neural machine instructions for multi-head attention

Adding ZVOL VIRTIO disks to a guest running on a host with the FreeBSD BHYVE hypervisor