2012-12-27

The source code of SOAPdenovo2 sits in the shadows

Update 2012-12-28: SOAPdenovo 2.04-r223 source code was posted online on 2012-12-28. Some minor concerns:

1. The tarball suffers from the bundled library problem (libbam.a and libbammac.a are shipped precompiled). This is incompatible with the GPLv3 license.

2. The Makefile is not portable with its hard-coded paths to compilers (/opt/blc/gcc-4.5.0/bin/gcc should just be gcc).


The SOAPdenovo2 article was published today under the terms of the Creative Commons Attribution License. Here are the Availability and requirements for SOAPdenovo2 as reported in the article:

Availability and requirements

Project name: SOAPdenovo2
Project home page: http://soapdenovo2.sourceforge.net/
Operating system(s): e.g. Platform independent
Programming language: C, C++
Other requirements: GCC version ≥ 4.5.0
License: GNU General Public License version 3.0 (GPLv3)
Any restrictions to use by non-academics: none
Contact: bgi-soap@googlegroups.com




I fired up my browser and in a flash went to the "Project home page". I downloaded the latest SOAPdenovo2 distribution but was disappointed to see that the SOAPdenovo2 distribution was binary-only. This is a logical contradiction with the GNU General Public License, version 3 -- the license under which SOAPdenovo2 is being distributed according to the article above.

Binary-only distribution is the path to the dark side (proprietary)

The SOAPdenovo2 distribution contains 4 pre-compiled binaries and 2 plain-text files. No source files were found, which is confusing because the GNU General Public License, version 3, is for distributing open source software with an emphasis for freedom.

Table 1: Files distributed in the tarball called SOAPdenovo2_revision217.tgz
-->
File Type
MANUAL ASCII text, with very long lines
update.log ASCII text
pregraph_sparse_127mer.v1.0.3 ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, for GNU/Linux 2.6.9, not stripped
pregraph_sparse_63mer.v1.0.3 ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, for GNU/Linux 2.6.9, not stripped
SOAPdenovo-127mer ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, for GNU/Linux 2.6.9, not stripped
SOAPdenovo-63mer ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, for GNU/Linux 2.6.9, not stripped

Concerns of a bioinformatics adventurer

Concern #1: ELF 64-bit LSB executables for GNU/Linux 2.6.9 are not platform-independent. Therefore, the claim of platform independence is false. For instance, I can not run these executables on OpenBSD.

Concern #2: GCC version ≥ 4.5.0 is not required as it is a proprietary binary distribution. Therefore this requirement is untrue.

Concern #3: proprietary software distributions are not eligible to licensing under the GNU General Public License version 3. Therefore, the authors should select their own proprietary license or release the source code of SOAPdenovo2. The previous last publicly available version was SOAPdenovo v1.05.


What the reviewers had to say ?


Reviewer #AJN was concerned here and here by the lack of source code.

Freedoms provided by free software

According to the Free Software Foundation, Inc.:

A program is free software if the program's users have the four essential freedoms:
  • The freedom to run the program, for any purpose (freedom 0).
  • The freedom to study how the program works, and change it so it does your computing as you wish (freedom 1). Access to the source code is a precondition for this.
  • The freedom to redistribute copies so you can help your neighbor (freedom 2).
  • The freedom to distribute copies of your modified versions to others (freedom 3). By doing this you can give the whole community a chance to benefit from your changes. Access to the source code is a precondition for this. 

$

2012-12-19

Re: Miseq 2x250 – Does Length Really Matter?







Table 1: Metrics for contigs >= 500 b with Ray for samples EdgeBio-MiSeq-E.Coli-DH10B-150x2 and EdgeBio-MiSeq-E.Coli-DH10B-250x2. *, **
Sample
K-mer length
N50 (kb)
Max (kb)
Mean (kb)
Count
Time (hour:min)
Max mem. usage per core (MiB)
2x150
31
67.1
227.5
37.4
123
01:06
351 +/- 19
2x150
51
95.5
296.6
48.3
94
01:13
395 +/- 54
2x150
61
87.5
326.8
44.0
104
01:19
408+/- 58
2x250
31
97.8
327.1
51.0
89
01:56
421 +/- 39
2x250
51
95.9
327.1
55.4
82
01:49
447 +/- 46
2x250
61
113.7
327.1
57.3
80
01:51
468 +/- 39
*Jobs were done on 4 nodes with a total of 32 processor cores. **Contigs produced by Ray contain only symbols from {A, C, G, T}.



Table 1: Metrics for scaffolds >= 500 b with Ray for samples EdgeBio-MiSeq-E.Coli-DH10B-150x2 and EdgeBio-MiSeq-E.Coli-DH10B-250x2. Jobs were done on 4 nodes with a total of 32 processor cores.
Sample
K-mer length
N50 (kb)
Max (kb)
Mean (kb)
Count
Time (hour:min)
Max mem. usage per core (MiB)
2x150
31
70.9
227.5
41.1
112
01:06
351 +/- 19
2x150
51
100.6
296.6
51.0
89
01:13
395 +/- 54
2x150
61
96.9
326.8
48.2
95
01:19
408+/- 58
2x250
31
108.5
327.1
53.4
85
01:56
421 +/- 39
2x250
51
101.6
327.1
58.2
78
01:49
447 +/- 46
2x250
61
118.4
327.1
61.9
74
01:51
468 +/- 39


Assemblies done:

3.1. 2x150 -k 31 Idle Running Completed
3.2. 2x150 -k 51 Idle Running Completed
3.3. 2x150 -k 61 Idle Running Completed
4.1. 2x250 -k 31 Idle Running Completed
4.2. 2x250 -k 51 Idle Running Completed
4.3. 2x250 -k 61 Idle Running Completed




































3. Assemblies for EdgeBio-MiSeq-E.Coli-DH10B-150x2

Download data

Reads: 26020908
Read length (nucleotides): 150
Insert size (nucleotides, include reads): 281 +/- 30
Technology: Illumina(R) MiSeq(R), Illumina, Inc.
Service provider: Next Generation Sequencing Services, Edge BioSystems, Inc.

3.1. -k 31


Metrics:

Contigs >= 100 nt
 Number: 155
 Total length: 4616834
 Average: 29786
 N50: 67106
 Median: 15855
 Largest: 227502
Contigs >= 500 nt
 Number: 123
 Total length: 4609180
 Average: 37473
 N50: 67106
 Median: 30135
 Largest: 227502
Scaffolds >= 100 nt
 Number: 143
 Total length: 4617737
 Average: 32291
 N50: 70985
 Median: 16455
 Largest: 227502
Scaffolds >= 500 nt
 Number: 112
 Total length: 4610547
 Average: 41165
 N50: 70985
 Median: 31521
 Largest: 227502


Command:

mpiexec -n 32 Ray \
 -o EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-31-2012-12-19-1 \
 -k 31 \
 -p EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R1_001.fastq \
 EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R2_001.fastq

Running time:

 Network testing: 0 seconds
 Counting sequences to assemble: 13 seconds
 Sequence loading: 1 minutes, 5 seconds
 K-mer counting: 5 minutes, 52 seconds
 Coverage distribution analysis: 3 seconds
 Graph construction: 8 minutes, 38 seconds
 Null edge purging: 45 seconds
 Selection of optimal read markers: 11 minutes, 54 seconds
 Detection of assembly seeds: 1 minutes, 58 seconds
 Estimation of outer distances for paired reads: 55 seconds
 Bidirectional extension of seeds: 24 minutes, 25 seconds
 Merging of redundant paths: 4 minutes, 55 seconds
 Generation of contigs: 3 seconds
 Scaffolding of contigs: 5 minutes, 23 seconds
 Counting sequences to search: 0 seconds
 Graph coloring: 2 seconds
 Counting contig biological abundances: 5 seconds
 Counting sequence biological abundances: 0 seconds
 Loading taxons: 2 seconds
 Loading tree: 3 seconds
 Processing gene ontologies: 6 seconds
 Computing neighbourhoods: 0 seconds
 Total: 1 hours, 6 minutes, 27 seconds


Max. memory usage per core:

351 +/- 19 MiB

3.2. -k 51


Metrics:

Contigs >= 100 nt
 Number: 517
 Total length: 4598589
 Average: 8894
 N50: 87569
 Median: 110
 Largest: 296628
Contigs >= 500 nt
 Number: 94
 Total length: 4546717
 Average: 48369
 N50: 95528
 Median: 29691
 Largest: 296628
Scaffolds >= 100 nt
 Number: 512
 Total length: 4599494
 Average: 8983
 N50: 97574
 Median: 110
 Largest: 296628
Scaffolds >= 500 nt
 Number: 89
 Total length: 4547622
 Average: 51096
 N50: 100680
 Median: 31489
 Largest: 296628

Command:

mpiexec -n 32 Ray \
 -o EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-51-2012-12-19-1 \
 -k 51 \
 -p  EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R1_001.fastq \
 EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R2_001.fastq

Running time:

 Network testing: 0 seconds
 Counting sequences to assemble: 13 seconds
 Sequence loading: 1 minutes, 6 seconds
 K-mer counting: 6 minutes, 31 seconds
 Coverage distribution analysis: 3 seconds
 Graph construction: 9 minutes, 44 seconds
 Null edge purging: 46 seconds
 Selection of optimal read markers: 10 minutes, 53 seconds
 Detection of assembly seeds: 1 minutes, 49 seconds
 Estimation of outer distances for paired reads: 55 seconds
 Bidirectional extension of seeds: 28 minutes, 29 seconds
 Merging of redundant paths: 6 minutes, 28 seconds
 Generation of contigs: 3 seconds
 Scaffolding of contigs: 5 minutes, 52 seconds
 Counting sequences to search: 0 seconds
 Graph coloring: 3 seconds
 Counting contig biological abundances: 6 seconds
 Counting sequence biological abundances: 0 seconds
 Loading taxons: 2 seconds
 Loading tree: 3 seconds
 Processing gene ontologies: 6 seconds
 Computing neighbourhoods: 0 seconds
 Total: 1 hours, 13 minutes, 12 seconds

Max. memory usage per core:

395 +/- 54 MiB

3.3. -k 61

Metrics:

Contigs >= 100 nt
 Number: 53566
 Total length: 10789399
 Average: 201
 N50: 120
 Median: 113
 Largest: 326854
Contigs >= 500 nt
 Number: 104
 Total length: 4580417
 Average: 44042
 N50: 87572
 Median: 28005
 Largest: 326854
Scaffolds >= 100 nt
 Number: 53556
 Total length: 10790842
 Average: 201
 N50: 120
 Median: 113
 Largest: 326854
Scaffolds >= 500 nt
 Number: 95
 Total length: 4582327
 Average: 48235
 N50: 96902
 Median: 29363
 Largest: 326854

Command:

mpiexec -n 32 Ray \
 -o EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-61-2012-12-19-1 \
 -k 61 \
 -p EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R1_001.fastq \
 EdgeBio-MiSeq-E.Coli-DH10B-150x2/Ecoli350_S1_L001_R2_001.fastq

Running time:

 Network testing: 0 seconds
 Counting sequences to assemble: 9 seconds
 Sequence loading: 1 minutes, 8 seconds
 K-mer counting: 6 minutes, 25 seconds
 Coverage distribution analysis: 3 seconds
 Graph construction: 9 minutes, 27 seconds
 Null edge purging: 44 seconds
 Selection of optimal read markers: 10 minutes, 4 seconds
 Detection of assembly seeds: 1 minutes, 47 seconds
 Estimation of outer distances for paired reads: 1 minutes, 55 seconds
 Bidirectional extension of seeds: 32 minutes, 6 seconds
 Merging of redundant paths: 6 minutes, 38 seconds
 Generation of contigs: 3 seconds
 Scaffolding of contigs: 8 minutes, 53 seconds
 Counting sequences to search: 0 seconds
 Graph coloring: 3 seconds
 Counting contig biological abundances: 9 seconds
 Counting sequence biological abundances: 0 seconds
 Loading taxons: 3 seconds
 Loading tree: 3 seconds
 Processing gene ontologies: 6 seconds
 Computing neighbourhoods: 0 seconds
 Total: 1 hours, 19 minutes, 46 seconds

Max. memory usage per core:

408 +/- 58 MiB

 

4.  EdgeBio-MiSeq-E.Coli-DH10B-250x2

Download data

Reads: 18928232
Read length (nucleotides):250
Insert size (nucleotides, include reads): 490 +/- 74
Technology: Illumina(R) MiSeq(R), Illumina, Inc.
Service provider: Next Generation Sequencing Services, Edge BioSystems, Inc.


4.1. -k 31


Metrics:

Contigs >= 100 nt
 Number: 158
 Total length: 4550832
 Average: 28802
 N50: 97852
 Median: 1660
 Largest: 327190
Contigs >= 500 nt
 Number: 89
 Total length: 4539062
 Average: 51000
 N50: 97852
 Median: 32382
 Largest: 327190
Scaffolds >= 100 nt
 Number: 154
 Total length: 4551116
 Average: 29552
 N50: 108530
 Median: 1234
 Largest: 327190
Scaffolds >= 500 nt
 Number: 85
 Total length: 4539346
 Average: 53404
 N50: 108530
 Median: 32862
 Largest: 327190

Command:

mpiexec -n 32 Ray \
 -o EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-31-2012-12-19-1 \
 -k 31 \
 -p EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R1_001.fastq \
 EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R2_001.fastq

Running time:

 Network testing: 1 seconds
 Counting sequences to assemble: 14 seconds
 Sequence loading: 1 minutes, 17 seconds
 K-mer counting: 7 minutes, 37 seconds
 Coverage distribution analysis: 3 seconds
 Graph construction: 11 minutes, 58 seconds
 Null edge purging: 1 minutes, 32 seconds
 Selection of optimal read markers: 16 minutes, 31 seconds
 Detection of assembly seeds: 4 minutes, 33 seconds
 Estimation of outer distances for paired reads: 39 seconds
 Bidirectional extension of seeds: 1 hours, 1 minutes, 50 seconds
 Merging of redundant paths: 4 minutes, 57 seconds
 Generation of contigs: 3 seconds
 Scaffolding of contigs: 5 minutes, 2 seconds
 Counting sequences to search: 0 seconds
 Graph coloring: 3 seconds
 Counting contig biological abundances: 5 seconds
 Counting sequence biological abundances: 0 seconds
 Loading taxons: 3 seconds
 Loading tree: 3 seconds
 Processing gene ontologies: 6 seconds
 Computing neighbourhoods: 0 seconds
 Total: 1 hours, 56 minutes, 37 seconds


Max. memory usage per core:

421 +/- 39 MiB

 

4.2. -k 51


Metrics:

Contigs >= 100 nt
 Number: 1554
 Total length: 4714179
 Average: 3033
 N50: 88483
 Median: 109
 Largest: 327166
Contigs >= 500 nt
 Number: 82
 Total length: 4544573
 Average: 55421
 N50: 95958
 Median: 33541
 Largest: 327166
Scaffolds >= 100 nt
 Number: 1550
 Total length: 4715048
 Average: 3041
 N50: 98417
 Median: 109
 Largest: 327166
Scaffolds >= 500 nt
 Number: 78
 Total length: 4545442
 Average: 58274
 N50: 101674
 Median: 35232
 Largest: 327166

Command:

mpiexec -n 32 Ray \
 -o EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-51-2012-12-19-1 \
 -k 51 \
 -p EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R1_001.fastq \
 EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R2_001.fastq

Running time:

 Network testing: 0 seconds
 Counting sequences to assemble: 12 seconds
 Sequence loading: 1 minutes, 14 seconds
 K-mer counting: 9 minutes, 10 seconds
 Coverage distribution analysis: 3 seconds
 Graph construction: 14 minutes, 17 seconds
 Null edge purging: 1 minutes, 39 seconds
 Selection of optimal read markers: 16 minutes, 21 seconds
 Detection of assembly seeds: 4 minutes, 41 seconds
 Estimation of outer distances for paired reads: 47 seconds
 Bidirectional extension of seeds: 52 minutes, 35 seconds
 Merging of redundant paths: 4 minutes, 32 seconds
 Generation of contigs: 3 seconds
 Scaffolding of contigs: 3 minutes, 44 seconds
 Counting sequences to search: 0 seconds
 Graph coloring: 3 seconds
 Counting contig biological abundances: 5 seconds
 Counting sequence biological abundances: 0 seconds
 Loading taxons: 3 seconds
 Loading tree: 4 seconds
 Processing gene ontologies: 6 seconds
 Computing neighbourhoods: 0 seconds
 Total: 1 hours, 49 minutes, 39 seconds

Max. memory usage per core:

447 +/- 46 MiB

4.3. -k 61


Metrics:

Contigs >= 100 nt
 Number: 177174
 Total length: 24678110
 Average: 139
 N50: 119
 Median: 114
 Largest: 327181
Contigs >= 500 nt
 Number: 80
 Total length: 4585518
 Average: 57318
 N50: 113734
 Median: 33332
 Largest: 327181
Scaffolds >= 100 nt
 Number: 177168
 Total length: 24679683
 Average: 139
 N50: 119
 Median: 114
 Largest: 327181
Scaffolds >= 500 nt
 Number: 74
 Total length: 4587091
 Average: 61987
 N50: 118481
 Median: 35548
 Largest: 327181

Command:

mpiexec -n 32 Ray \
 -o EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-61-2012-12-19-1 \
 -k 61 \
 -p EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R1_001.fastq \
 EdgeBio-MiSeq-E.Coli-DH10B-250x2/Ecoli-650_S1_L001_R2_001.fastq

Running time:

 Network testing: 0 seconds
 Counting sequences to assemble: 15 seconds
 Sequence loading: 1 minutes, 18 seconds
 K-mer counting: 9 minutes, 25 seconds
 Coverage distribution analysis: 4 seconds
 Graph construction: 14 minutes, 34 seconds
 Null edge purging: 1 minutes, 37 seconds
 Selection of optimal read markers: 15 minutes, 50 seconds
 Detection of assembly seeds: 4 minutes, 41 seconds
 Estimation of outer distances for paired reads: 1 minutes, 29 seconds
 Bidirectional extension of seeds: 44 minutes, 20 seconds
 Merging of redundant paths: 10 minutes, 36 seconds
 Generation of contigs: 5 seconds
 Scaffolding of contigs: 6 minutes, 54 seconds
 Counting sequences to search: 0 seconds
 Graph coloring: 3 seconds
 Counting contig biological abundances: 17 seconds
 Counting sequence biological abundances: 1 seconds
 Loading taxons: 3 seconds
 Loading tree: 3 seconds
 Processing gene ontologies: 6 seconds
 Computing neighbourhoods: 0 seconds
 Total: 1 hours, 51 minutes, 41 seconds
Max. memory usage per core:

468 +/- 39 MiB

5. Discussion


This comparison between 2x150 and 2x250 is not fair because the insert size is not the same: 281 +/- 30 for 2x150 and 490 +/- 74 for 2x250. As Chaisson et al. shown in 2009 in Genome Research, the insert size alone is sufficient as long as the read length is above a threshold that depends on the life form being studied.

From Chaisson et al. 2009 Genome Research:

"When the read length exceeds a certain threshold, the read-length barrier, the efficiency reaches nearly 100%, so that the read length, indeed, does not matter. For example, for the Escherichia coli genome, the read-length barrier is 35 nt."
Source: Chaisson et al. 2009 Genome Res. 2009. 19: 336-346

This elegant paper is already a classic in the de novo genome assembly litterature.


Plant genome (white spruce), Illumina HiSeq 2000, IBM Blue Gene/Q, and Ray

The SRA056234 dataset contains reads for Picea glauca (white spruce). The reads were obtained with the Illumina HiSeq 2000. It's 2.8 TiB of uncompressed fastq files.

$ du -sh blocks/
2.8T    blocks/

We are using  Ray on a IBM Blue Gene/Q. In particular, we are using 512 nodes with 1 IBM PowerPC A2 processor and 16 GiB of DDR3 memory per node. Each processor has 16 cores, and each core has 4 threads.

Because of memory limitation per core, we are only using 16 MPI ranks per node for the time being. Therefore, we are using 16384 Ray processes each with a maximum of 1 GiB of memory on the Blue Gene/Q. We have 16 TiB of distributed memory.

First,  we can list the Ray plugins we are using.

$ ls -1 SRA056234-Picea-glauca-2012-12-18-4/Plugins/
plugin_Amos.txt
plugin_CoverageGatherer.txt
plugin_DummySun.txt
plugin_EdgePurger.txt
plugin_FusionData.txt
plugin_FusionTaskCreator.txt
plugin_GeneOntology.txt
plugin_GenomeNeighbourhood.txt
plugin_JoinerTaskCreator.txt
plugin_KmerAcademyBuilder.txt
plugin_Library.txt
plugin_MachineHelper.txt
plugin_MessageProcessor.txt
plugin_Mock.txt
plugin_NetworkTest.txt
plugin_Partitioner.txt
plugin_PhylogenyViewer.txt
plugin_Scaffolder.txt
plugin_Searcher.txt
plugin_SeedExtender.txt
plugin_SeedingData.txt
plugin_SequencesIndexer.txt
plugin_SequencesLoader.txt
plugin_SwitchMan.txt
plugin_VerticesExtractor.txt


The file partition has a nice layout. On the Blue Gene/Q, I needed to split my files in 4288 fastq files with a maximum of 2000000 sequences each because I/O operations are offloaded to I/O drawers. Aside from that, Ray works as is.

$ head SRA056234-Picea-glauca-2012-12-18-4/FilePartition.txt
#File   Name    FirstSequence   LastSequence    NumberOfSequences
0       blocks/SRR525188_1-block-0.fastq        0       1999999 2000000
1       blocks/SRR525188_2-block-0.fastq        2000000 3999999 2000000
2       blocks/SRR525188_1-block-1.fastq        4000000 5999999 2000000
3       blocks/SRR525188_2-block-1.fastq        6000000 7999999 2000000
4       blocks/SRR525188_1-block-10.fastq       8000000 9999999 2000000
5       blocks/SRR525188_2-block-10.fastq       10000000        11999999        2000000
6       blocks/SRR525188_1-block-11.fastq       12000000        13999999        2000000
7       blocks/SRR525188_2-block-11.fastq       14000000        15999999        2000000
8       blocks/SRR525188_1-block-12.fastq       16000000        17999999        2000000
 

$ tail SRA056234-Picea-glauca-2012-12-18-4/FilePartition.txt
4278    blocks/SRR525214_1-block-98.fastq       8498092212      8500092211      2000000
4279    blocks/SRR525214_2-block-98.fastq       8500092212      8502092211      2000000
4280    blocks/SRR525214_1-block-99.fastq       8502092212      8504092211      2000000
4281    blocks/SRR525214_2-block-99.fastq       8504092212      8506092211      2000000
4282    blocks/SRR525215_1-block-0.fastq        8506092212      8508092211      2000000
4283    blocks/SRR525215_2-block-0.fastq        8508092212      8510092211      2000000
4284    blocks/SRR525215_1-block-1.fastq        8510092212      8512092211      2000000
4285    blocks/SRR525215_2-block-1.fastq        8512092212      8514092211      2000000
4286    blocks/SRR525215_1-block-2.fastq        8514092212      8515006210      913999
4287    blocks/SRR525215_2-block-2.fastq        8515006211      8515920209      913999


The 8515920210 input sequences are uniformly distributed onto 16384 MPI ranks. There are 519770 sequences per MPI rank.


$ head SRA056234-Picea-glauca-2012-12-18-4/SequencePartition.txt
#Rank   FirstSequence   LastSequence    NumberOfSequences
0       0       519769  519770
1       519770  1039539 519770
2       1039540 1559309 519770
3       1559310 2079079 519770
4       2079080 2598849 519770
5       2598850 3118619 519770
6       3118620 3638389 519770
7       3638390 4158159 519770
8       4158160 4677929 519770
 

$ tail SRA056234-Picea-glauca-2012-12-18-4/SequencePartition.txt
16374   8510713980      8511233749      519770
16375   8511233750      8511753519      519770
16376   8511753520      8512273289      519770
16377   8512273290      8512793059      519770
16378   8512793060      8513312829      519770
16379   8513312830      8513832599      519770
16380   8513832600      8514352369      519770
16381   8514352370      8514872139      519770
16382   8514872140      8515391909      519770
16383   8515391910      8515920209      528300


Finally, the 42831057656 k-mers are distributed uniformly on 16384 MPI ranks, with around 2614200 k-mers per MPI rank.

$ head SRA056234-Picea-glauca-2012-12-18-4/GraphPartition.txt   
#Rank   NumberOfKmers   IdealNumberOfKmers      Difference      RelativeDifference
#TotalKmers: 42831057656
#Ranks: 16384
#IdealNumberOfKmers: 2614200
0       2611430 2614200 -2770   -0.10596%
1       2612276 2614200 -1924   -0.073598%
2       2613476 2614200 -724    -0.0276949%
3       2611618 2614200 -2582   -0.0987683%
4       2616320 2614200 2120    0.0810956%
5       2615454 2614200 1254    0.0479688%
 

$ tail SRA056234-Picea-glauca-2012-12-18-4/GraphPartition.txt
16374   2612820 2614200 -1380   -0.0527886%
16375   2615682 2614200 1482    0.0566904%
16376   2610428 2614200 -3772   -0.144289%
16377   2617236 2614200 3036    0.116135%
16378   2615978 2614200 1778    0.0680132%
16379   2614000 2614200 -200    -0.00765052%
16380   2619520 2614200 5320    0.203504%
16381   2614508 2614200 308     0.0117818%
16382   2614830 2614200 630     0.0240992%
16383   2614080 2614200 -120    -0.00459031%


The RelativeDifference column indicates really good automated load balancing.

Ray is currently running the slave mode RAY_SLAVE_MODE_INDEX_SEQUENCES.

2012-12-12

Provisioning secure cloud instances locally

Hello,

I decided to deploy Ray Cloud Browser on a local virtual machine.

This guide is for installing and using libvirt on Fedora 17.

The guest instance in this tutorial is OpenBSD 5.2 as it is very lightweight and secure.


To use NAT with a bridge that has no interface, virsh net-start must be run as root.

Connecting to the host (Fedora 17)


First we connect to ls31 (Fedora 17). ls31 is the host on which we will run domain 0. In our case, ls31 has 64 VCPU (AMD Opteron(TM) Processor 6272) and 128 GiB of memory.

 $ ssh -lboiseb01 192.168.3.31 -X

Then, we need to install virtualization packages.

 $ sudo su

Installing packages


 # yum install -y @virtualization

We will put disks in /kvm/img and cdroms in /kvm/iso.

Installing the guest (OpenBSD 5.2)


 # mkdir -p /kvm/{iso,img}


Then we install the virtual machine with 1024 MiB of memory and a 8 GiB disk and graphics.
We'll remove the graphics once ssh is working.

 # virt-install --name test-1 --ram 1024 --disk path=/kvm/img/test-1.img,size=8 --cdrom /kvm/iso/OpenBSD-5.2.iso --graphics vnc

Now we can check that it is running

 # virsh list
  Id Name                 State
 ----------------------------------
   1 test-1               running

Now, we start the viewer. 1 is the domain number, domain 0 is the host.
We can do this as a normal user.

 $ vncviewer :0


Inside the guest, we basically just install OpenBSD 5.2. I prefer to use only 1 mount point for all the file system.


Installing vim, the editor


Once this is over, reboot the guest, and install vim with this (in OpenBSD 5.2):

 # export PKG_PATH=http://ftp.openbsd.org/pub/OpenBSD/5.2/packages/`machine -a`/
 # pkg_add vim


Setup keys for the guest


Now, we create a key pair on the client (can be the host).

 $ ssh-keygen -t rsa -f openbsd-vm.pem -P ""

Then, we add the content of openbsd-vm.pem.pub to ~/.ssh/authorized_keys in the guest.

 vim ~/.ssh/authorized_keys

Finally, we change OpenBSD passwords to something silly and only use the key to connect to the guest with (this will work after adding NAT with iptables):

 ssh -i openbsd-vm.pem  seb@192.168.3.31 -p 23422


To stop a virtual machine, the best way is to use halt or shutdown, otherwise:

 virsh destroy 1

To start a virtual machine, use

 virsh start test-1

Network


By default, the guest can use the network of the host (domain 0) and a bridge is configured for the other way too. But the bridge only works on the host.

Enable port forwarding on the host (domain 0)

 # sysctl -w  net.ipv4.ip_forward=1

To forward a specific port of the host to a guest:

If the virtual machine runs with a virsh network, than its ipv4 address is NAT'ed by default.


Route frames from 192.168.3.31:23422 to 192.168.122.234:22

 # iptables -t nat -A PREROUTING -p tcp -d 192.168.3.31 --dport 23422 -j DNAT --to 192.168.122.234:22

Redirect local trafic too so that we can connect to the guest from 192.168.3.31 too

 # iptables -t nat -A OUTPUT -p tcp -d 127.0.0.1 --dport 23422 -j DNAT --to 192.168.122.234:22
 # iptables -t nat -A OUTPUT -p tcp -d 192.168.3.31 --dport 23422 -j DNAT --to 192.168.122.234:22


Accept payload to 192.168.122.234:22

 # iptables -A FORWARD -p tcp -d 192.168.122.234 --dport 22 -j ACCEPT



Voilà, we just configured a OpenBSD 5.2 guest with libvirt, and configured routes so that 192.168.3.31:23422 goes to 192.168.122.234:22

To connect to the guest instance:


$ ssh -i openbsd-vm.pem  seb@192.168.3.31 -p 23422

References



* https://fedoraproject.org/wiki/Getting_started_with_virtualization?rd=Virtualization_Quick_Start
* http://wiki.libvirt.org/page/Networking
* http://www.hackorama.com/network/portfwd.shtml


2012-12-04

Showcasing a pre-alpha version of Ray Cloud Browser

Dear genomic enthusiasts,

Explaining the de Bruijn graph -- or the de novo assembly process for that matter -- to people can be a daunting task. All biologists have a web browser ready to fire up at anytime. Furthermore, all modern browsers support HTML5 -- a way of making nice portable user interfaces.

Ray Cloud Browser is a visualizer for genomic data. But unlike classical genome browsers, Ray Cloud Browser is dynamic, and you can move things with energy if you want to.

The current version is really pre-alpha, but the hardest-to-implement core features are there. The client is in Javascript (ECMA script). The web server is Apache httpd, but any web server will do. The server-side application code is in C++ 1998, and runs in Apache httpd using the standard CGI 1.1 (Common Gateway Interface).

The stateful HTML5 client provides the graph layout engine, the physics engine, the rendering engine, the active-object engine, the communication engine. All the client code was written from scratch for efficiency. No external Javascript library is required.

The stateless server is in C++ 1998. It receives HTTP GET queries from clients in the form of:

/cgi-bin/RayCloudBrowser.webServer.cgi?tag=RAY_MESSAGE_TAG_GET_KMER_FROM_STORE&object=TCGTCTTCGTCTCGGCCATCGGCGTGACGCT&depth=512

RayCloudBrowser.webServer.cgi is the single executable that needs to deployed by the web server. The QUERY_TAG is what is provided to the executable RayCloudBrowser.webServer.cgi.

A mandatory parameter is the tag parameter, which is a message tag. Given the message tag, the program will do something in particular.

For instance,

http://ec2-54-242-197-219.compute-1.amazonaws.com/cgi-bin/RayCloudBrowser.webServer.cgi?tag=RAY_MESSAGE_TAG_GET_KMER_FROM_STORE&object=TCGTCTTCGTCTCGGCCATCGGCGTGACGCT&depth=1

gives you information about the requested object.

A readahead technology is also implemented and can be enabled by increasing the depth parameter.

For example:

http://ec2-54-242-197-219.compute-1.amazonaws.com/cgi-bin/RayCloudBrowser.webServer.cgi?tag=RAY_MESSAGE_TAG_GET_KMER_FROM_STORE&object=TCGTCTTCGTCTCGGCCATCGGCGTGACGCT&depth=512

gives at objects with a depth of at most 512 from the origin, which is the object provided with the object parameter. The maximum depth is 4096, so it's safe against denial of services.

Every communication between the client and the server is done in JSON, which is a standard that means JavaScript Object Notation. At any moment, there is a maximum number of active communication pipes between  the client and the server. The default is 8.


Now, more about the really sophisticated production server in Amazon EC2 (a free micro instance):

Physical memory (RAM):

[root@ip-10-194-103-146 cgi-bin]# head -n1 /proc/meminfo
MemTotal:         608740 kB

No swap partition:

[root@ip-10-194-103-146 cgi-bin]# cat /proc/swaps
Filename                Type        Size    Used    Priority


The full spefication from the cloud provider (Amazon Web Services, LLC) for a micro instance is:

613 MiB memory
Up to 2 EC2 Compute Units (for short periodic bursts)
EBS storage only
32-bit or 64-bit platform
I/O Performance: Low
EBS-Optimized Available: No
API name: t1.micro


The stuff that the web server needs are:

At the moment, there is one data file:

[root@ip-10-194-103-146 cgi-bin]# file *
Database.dat:                  data
RayCloudBrowser.webServer.cgi: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, not stripped


[root@ip-10-194-103-146 cgi-bin]# ls -lh
total 2.6G
-rw-r--r-- 1 sebhtml sebhtml 2.6G Dec  2 06:48 Database.dat
-rwxr-xr-x 1 root    root     31K Dec  3 19:21 RayCloudBrowser.webServer.cgi


 With some magic, the 613-MB instance can server queries using a 2.6-GB data file.

2012-12-01

Optimizing genomic alignment workloads

In some projects, there are a lot of files. Each of these files contain DNA sequences, and sometimes each of these files contain a different number of sequences. The workflow is to start several jobs that will align sequences for a subset of the files.

As an example, see the table below that contains files for the public human sample SRA000271.

Table 1: Number of sequences per file for sample SRA000271.


File Sequences
SRR002271_1.fastq.bz2 22243273
SRR002271_2.fastq.bz2 22243273
SRR002272_1.fastq.bz2 35756808
SRR002272_2.fastq.bz2 35756808
SRR002273_1.fastq.bz2 4276214
SRR002273_2.fastq.bz2 4276214
SRR002274_1.fastq.bz2 18095255
SRR002274_2.fastq.bz2 18095255
SRR002275_1.fastq.bz2 33729638
SRR002275_2.fastq.bz2 33729638
SRR002276_1.fastq.bz2 47074312
SRR002276_2.fastq.bz2 47074312
SRR002277_1.fastq.bz2 6757955
SRR002277_2.fastq.bz2 6757955
SRR002278_1.fastq.bz2 6093595
SRR002278_2.fastq.bz2 6093595
SRR002279_1.fastq.bz2 7177292
SRR002279_2.fastq.bz2 7177292
SRR002280_1.fastq.bz2 6580048
SRR002280_2.fastq.bz2 6580048
SRR002281_1.fastq.bz2 16693687
SRR002281_2.fastq.bz2 16693687
SRR002282_1.fastq.bz2 13383178
SRR002282_2.fastq.bz2 13383178
SRR002283_1.fastq.bz2 18374441
SRR002283_2.fastq.bz2 18374441
SRR002284_1.fastq.bz2 5600070
SRR002284_2.fastq.bz2 5600070
SRR002285_1.fastq.bz2 6286076
SRR002285_2.fastq.bz2 6286076
SRR002286_1.fastq.bz2 5709652
SRR002286_2.fastq.bz2 5709652
SRR002287_1.fastq.bz2 6309083
SRR002287_2.fastq.bz2 6309083
SRR002288_1.fastq.bz2 6006869
SRR002288_2.fastq.bz2 6006869
SRR002289_1.fastq.bz2 4776565
SRR002289_2.fastq.bz2 4776565
SRR002290_1.fastq.bz2 13044591
SRR002290_2.fastq.bz2 13044591
SRR002291_1.fastq.bz2 34898264
SRR002291_2.fastq.bz2 34898264
SRR002292_1.fastq.bz2 22854308
SRR002292_2.fastq.bz2 22854308
SRR002293_1.fastq.bz2 18042487
SRR002293_2.fastq.bz2 18042487
SRR002294_1.fastq.bz2 8109548
SRR002294_2.fastq.bz2 8109548
SRR002295_1.fastq.bz2 30697316
SRR002295_2.fastq.bz2 30697316
SRR002296_1.fastq.bz2 6101023
SRR002296_2.fastq.bz2 6101023
SRR002297_1.fastq.bz2 46467908
SRR002297_2.fastq.bz2 46467908
SRR002298_1.fastq.bz2 10848356
SRR002298_2.fastq.bz2 10848356
SRR002299_1.fastq.bz2 6078655
SRR002299_2.fastq.bz2 6078655
SRR002300_1.fastq.bz2 6483460
SRR002300_2.fastq.bz2 6483460
SRR002301_1.fastq.bz2 7260908
SRR002301_2.fastq.bz2 7260908
SRR002302_1.fastq.bz2 6283624
SRR002302_2.fastq.bz2 6283624
SRR002303_1.fastq.bz2 6092616
SRR002303_2.fastq.bz2 6092616
SRR002304_1.fastq.bz2 8669549
SRR002304_2.fastq.bz2 8669549
SRR002305_1.fastq.bz2 7079980
SRR002305_2.fastq.bz2 7079980
SRR002306_1.fastq.bz2 6061012
SRR002306_2.fastq.bz2 6061012
SRR002307_1.fastq.bz2 7990437
SRR002307_2.fastq.bz2 7990437
SRR002308_1.fastq.bz2 6403325
SRR002308_2.fastq.bz2 6403325
SRR002309_1.fastq.bz2 5929366
SRR002309_2.fastq.bz2 5929366
SRR002310_1.fastq.bz2 9745625
SRR002310_2.fastq.bz2 9745625
SRR002311_1.fastq.bz2 6113076
SRR002311_2.fastq.bz2 6113076
SRR002312_1.fastq.bz2 9808666
SRR002312_2.fastq.bz2 9808666
SRR002313_1.fastq.bz2 7841529
SRR002313_2.fastq.bz2 7841529
SRR002314_1.fastq.bz2 4334295
SRR002314_2.fastq.bz2 4334295
SRR002315_1.fastq.bz2 4857885
SRR002315_2.fastq.bz2 4857885
SRR002316_1.fastq.bz2 14025787
SRR002316_2.fastq.bz2 14025787
SRR002317_1.fastq.bz2 9732073
SRR002317_2.fastq.bz2 9732073
SRR002318_1.fastq.bz2 1670064
SRR002318_2.fastq.bz2 1670064
SRR002319_1.fastq.bz2 9264776
SRR002319_2.fastq.bz2 9264776
SRR003810_1.fastq.bz2 4605877
SRR003810_2.fastq.bz2 4605877
SRR003812_1.fastq.bz2 4891691
SRR003812_2.fastq.bz2 4891691
SRR003813_1.fastq.bz2 4908529
SRR003813_2.fastq.bz2 4908529
SRR003814_1.fastq.bz2 3958657
SRR003814_2.fastq.bz2 3958657
SRR003815_1.fastq.bz2 4780683
SRR003815_2.fastq.bz2 4780683
SRR003816_1.fastq.bz2 4673240
SRR003816_2.fastq.bz2 4673240
SRR003817_1.fastq.bz2 1226466
SRR003817_2.fastq.bz2 1226466
SRR003818_1.fastq.bz2 6450471
SRR003818_2.fastq.bz2 6450471
SRR003819_1.fastq.bz2 6593471
SRR003819_2.fastq.bz2 6593471
SRR003820_1.fastq.bz2 6553586
SRR003820_2.fastq.bz2 6553586
SRR003821_1.fastq.bz2 6328227
SRR003821_2.fastq.bz2 6328227
SRR003823_1.fastq.bz2 6247413
SRR003823_2.fastq.bz2 6247413
SRR003824_1.fastq.bz2 6025237
SRR003824_2.fastq.bz2 6025237
SRR003825_1.fastq.bz2 1561734
SRR003825_2.fastq.bz2 1561734
SRR003837_1.fastq.bz2 7512805
SRR003837_2.fastq.bz2 7512805
SRR003838_1.fastq.bz2 7152063
SRR003838_2.fastq.bz2 7152063
SRR003839_1.fastq.bz2 7162665
SRR003839_2.fastq.bz2 7162665
SRR003840_1.fastq.bz2 7316431
SRR003840_2.fastq.bz2 7316431
SRR003841_1.fastq.bz2 1978964
SRR003841_2.fastq.bz2 1978964
SRR003845_1.fastq.bz2 7425426
SRR003845_2.fastq.bz2 7425426
SRR003846_1.fastq.bz2 6496063
SRR003846_2.fastq.bz2 6496063
SRR003847_1.fastq.bz2 7303975
SRR003847_2.fastq.bz2 7303975
SRR003848_1.fastq.bz2 7322831
SRR003848_2.fastq.bz2 7322831
SRR003849_1.fastq.bz2 7353518
SRR003849_2.fastq.bz2 7353518
SRR003850_1.fastq.bz2 1956095
SRR003850_2.fastq.bz2 1956095
SRR003851_1.fastq.bz2 6349952
SRR003851_2.fastq.bz2 6349952
SRR003852_1.fastq.bz2 6750815
SRR003852_2.fastq.bz2 6750815
SRR003853_1.fastq.bz2 6676849
SRR003853_2.fastq.bz2 6676849
SRR003854_1.fastq.bz2 6582580
SRR003854_2.fastq.bz2 6582580
SRR003855_1.fastq.bz2 6764442
SRR003855_2.fastq.bz2 6764442
SRR003856_1.fastq.bz2 6721795
SRR003856_2.fastq.bz2 6721795
SRR003857_1.fastq.bz2 1637242
SRR003857_2.fastq.bz2 1637242
SRR003859_1.fastq.bz2 3741843
SRR003859_2.fastq.bz2 3741843
SRR003860_1.fastq.bz2 3946844
SRR003860_2.fastq.bz2 3946844
SRR003861_1.fastq.bz2 4062736
SRR003861_2.fastq.bz2 4062736
SRR003863_1.fastq.bz2 6555145
SRR003863_2.fastq.bz2 6555145
SRR003864_1.fastq.bz2 6419174
SRR003864_2.fastq.bz2 6419174
SRR003866_1.fastq.bz2 6498466
SRR003866_2.fastq.bz2 6498466
SRR003867_1.fastq.bz2 6459933
SRR003867_2.fastq.bz2 6459933
SRR003868_1.fastq.bz2 7433254
SRR003868_2.fastq.bz2 7433254
SRR003869_1.fastq.bz2 7308992
SRR003869_2.fastq.bz2 7308992
SRR003870_1.fastq.bz2 7557219
SRR003870_2.fastq.bz2 7557219
SRR003871_1.fastq.bz2 7443963
SRR003871_2.fastq.bz2 7443963
SRR003872_1.fastq.bz2 7499185
SRR003872_2.fastq.bz2 7499185
SRR003873_1.fastq.bz2 6323645
SRR003873_2.fastq.bz2 6323645
SRR003874_1.fastq.bz2 6145944
SRR003874_2.fastq.bz2 6145944
SRR003875_1.fastq.bz2 6348097
SRR003875_2.fastq.bz2 6348097
SRR003876_1.fastq.bz2 6296738
SRR003876_2.fastq.bz2 6296738
SRR003877_1.fastq.bz2 6450647
SRR003877_2.fastq.bz2 6450647
SRR003878_1.fastq.bz2 6543352
SRR003878_2.fastq.bz2 6543352
SRR003879_1.fastq.bz2 6418736
SRR003879_2.fastq.bz2 6418736
SRR003960_1.fastq.bz2 9722922
SRR003960_2.fastq.bz2 9722922
SRR003961_1.fastq.bz2 8995012
SRR003961_2.fastq.bz2 8995012
SRR003962_1.fastq.bz2 9249927
SRR003962_2.fastq.bz2 9249927
SRR003963_1.fastq.bz2 9185577
SRR003963_2.fastq.bz2 9185577
SRR003964_1.fastq.bz2 9484360
SRR003964_2.fastq.bz2 9484360
SRR003965_1.fastq.bz2 8959911
SRR003965_2.fastq.bz2 8959911
SRR003966_1.fastq.bz2 5851368
SRR003966_2.fastq.bz2 5851368
SRR003967_1.fastq.bz2 5236932
SRR003967_2.fastq.bz2 5236932
SRR003968_1.fastq.bz2 6170713
SRR003968_2.fastq.bz2 6170713
SRR003969_1.fastq.bz2 6276516
SRR003969_2.fastq.bz2 6276516
SRR003970_1.fastq.bz2 5765690
SRR003970_2.fastq.bz2 5765690
SRR003971_1.fastq.bz2 1615149
SRR003971_2.fastq.bz2 1615149
SRR004105_1.fastq.bz2 6815628
SRR004105_2.fastq.bz2 6815628
SRR004106_1.fastq.bz2 6857870
SRR004106_2.fastq.bz2 6857870
SRR004107_1.fastq.bz2 6961705
SRR004107_2.fastq.bz2 6961705
SRR004108_1.fastq.bz2 7019609
SRR004108_2.fastq.bz2 7019609
SRR004109_1.fastq.bz2 7017622
SRR004109_2.fastq.bz2 7017622
SRR004110_1.fastq.bz2 1510651
SRR004110_2.fastq.bz2 1510651
SRR004111_1.fastq.bz2 6500202
SRR004111_2.fastq.bz2 6500202
SRR004112_1.fastq.bz2 6382108
SRR004112_2.fastq.bz2 6382108
SRR004113_1.fastq.bz2 6769812
SRR004113_2.fastq.bz2 6769812
SRR004114_1.fastq.bz2 6749632
SRR004114_2.fastq.bz2 6749632
SRR004116_1.fastq.bz2 6660494
SRR004116_2.fastq.bz2 6660494
SRR004117_1.fastq.bz2 6894164
SRR004117_2.fastq.bz2 6894164
SRR004118_1.fastq.bz2 7012762
SRR004118_2.fastq.bz2 7012762
SRR004119_1.fastq.bz2 7106679
SRR004119_2.fastq.bz2 7106679
SRR004120_1.fastq.bz2 1890977
SRR004120_2.fastq.bz2 1890977
SRR004121_1.fastq.bz2 6688708
SRR004121_2.fastq.bz2 6688708
SRR004122_1.fastq.bz2 6442268
SRR004122_2.fastq.bz2 6442268
SRR004123_1.fastq.bz2 7219728
SRR004123_2.fastq.bz2 7219728
SRR004124_1.fastq.bz2 6650577
SRR004124_2.fastq.bz2 6650577
SRR004125_1.fastq.bz2 6980725
SRR004125_2.fastq.bz2 6980725
SRR004126_1.fastq.bz2 7121389
SRR004126_2.fastq.bz2 7121389
SRR004127_1.fastq.bz2 1920925
SRR004127_2.fastq.bz2 1920925
SRR004186_1.fastq.bz2 6039526
SRR004186_2.fastq.bz2 6039526
SRR004187_1.fastq.bz2 6189345
SRR004187_2.fastq.bz2 6189345
SRR004188_1.fastq.bz2 6202959
SRR004188_2.fastq.bz2 6202959
SRR004190_1.fastq.bz2 5896818
SRR004190_2.fastq.bz2 5896818
SRR004191_1.fastq.bz2 5619426
SRR004191_2.fastq.bz2 5619426
SRR004192_1.fastq.bz2 1399560
SRR004192_2.fastq.bz2 1399560
SRR004193_1.fastq.bz2 6978310
SRR004193_2.fastq.bz2 6978310
SRR004194_1.fastq.bz2 6918542
SRR004194_2.fastq.bz2 6918542
SRR004195_1.fastq.bz2 5848753
SRR004195_2.fastq.bz2 5848753
SRR004197_1.fastq.bz2 7159088
SRR004197_2.fastq.bz2 7159088
SRR004198_1.fastq.bz2 1930314
SRR004198_2.fastq.bz2 1930314
SRR004199_1.fastq.bz2 6536064
SRR004199_2.fastq.bz2 6536064
SRR004200_1.fastq.bz2 6839637
SRR004200_2.fastq.bz2 6839637
SRR004201_1.fastq.bz2 6943961
SRR004201_2.fastq.bz2 6943961
SRR004202_1.fastq.bz2 6886712
SRR004202_2.fastq.bz2 6886712
SRR004203_1.fastq.bz2 6920882
SRR004203_2.fastq.bz2 6920882
SRR004204_1.fastq.bz2 1862546
SRR004204_2.fastq.bz2 1862546
SRR004205_1.fastq.bz2 4905164
SRR004205_2.fastq.bz2 4905164
SRR004206_1.fastq.bz2 4757265
SRR004206_2.fastq.bz2 4757265
SRR004207_1.fastq.bz2 5253223
SRR004207_2.fastq.bz2 5253223
SRR004208_1.fastq.bz2 5061637
SRR004208_2.fastq.bz2 5061637
SRR004209_1.fastq.bz2 5250983
SRR004209_2.fastq.bz2 5250983
SRR004210_1.fastq.bz2 5348659
SRR004210_2.fastq.bz2 5348659
SRR004211_1.fastq.bz2 1331516
SRR004211_2.fastq.bz2 1331516
SRR004809_1.fastq.bz2 6281732
SRR004809_2.fastq.bz2 6281732
SRR004810_1.fastq.bz2 6479897
SRR004810_2.fastq.bz2 6479897
SRR004811_1.fastq.bz2 6428286
SRR004811_2.fastq.bz2 6428286
SRR004812_1.fastq.bz2 6386024
SRR004812_2.fastq.bz2 6386024
SRR004813_1.fastq.bz2 6397082
SRR004813_2.fastq.bz2 6397082
SRR004814_1.fastq.bz2 6108069
SRR004814_2.fastq.bz2 6108069
SRR004815_1.fastq.bz2 6525195
SRR004815_2.fastq.bz2 6525195
SRR004816_1.fastq.bz2 6511455
SRR004816_2.fastq.bz2 6511455
SRR004817_1.fastq.bz2 6462803
SRR004817_2.fastq.bz2 6462803
SRR004818_1.fastq.bz2 6433169
SRR004818_2.fastq.bz2 6433169
SRR004819_1.fastq.bz2 6338234
SRR004819_2.fastq.bz2 6338234
SRR004820_1.fastq.bz2 7308557
SRR004820_2.fastq.bz2 7308557
SRR004821_1.fastq.bz2 7156960
SRR004821_2.fastq.bz2 7156960
SRR004822_1.fastq.bz2 6213660
SRR004822_2.fastq.bz2 6213660
SRR004823_1.fastq.bz2 5949864
SRR004823_2.fastq.bz2 5949864
SRR004824_1.fastq.bz2 6087720
SRR004824_2.fastq.bz2 6087720
SRR004825_1.fastq.bz2 6078285
SRR004825_2.fastq.bz2 6078285
SRR004826_1.fastq.bz2 5882054
SRR004826_2.fastq.bz2 5882054
SRR004827_1.fastq.bz2 5541480
SRR004827_2.fastq.bz2 5541480
SRR004828_1.fastq.bz2 7078458
SRR004828_2.fastq.bz2 7078458
SRR004829_1.fastq.bz2 7524672
SRR004829_2.fastq.bz2 7524672
SRR004830_1.fastq.bz2 7820645
SRR004830_2.fastq.bz2 7820645
SRR004831_1.fastq.bz2 8105229
SRR004831_2.fastq.bz2 8105229
SRR004832_1.fastq.bz2 7956781
SRR004832_2.fastq.bz2 7956781
SRR004833_1.fastq.bz2 7547869
SRR004833_2.fastq.bz2 7547869
SRR004834_1.fastq.bz2 6447823
SRR004834_2.fastq.bz2 6447823
SRR004835_1.fastq.bz2 8616259
SRR004835_2.fastq.bz2 8616259
SRR004836_1.fastq.bz2 8975677
SRR004836_2.fastq.bz2 8975677
SRR004837_1.fastq.bz2 8642976
SRR004837_2.fastq.bz2 8642976
SRR004838_1.fastq.bz2 7023195
SRR004838_2.fastq.bz2 7023195
SRR004839_1.fastq.bz2 8062696
SRR004839_2.fastq.bz2 8062696
SRR004840_1.fastq.bz2 8060450
SRR004840_2.fastq.bz2 8060450
SRR004841_1.fastq.bz2 6678858
SRR004841_2.fastq.bz2 6678858
SRR004842_1.fastq.bz2 7091815
SRR004842_2.fastq.bz2 7091815
SRR004843_1.fastq.bz2 6890492
SRR004843_2.fastq.bz2 6890492
SRR004844_1.fastq.bz2 8549164
SRR004844_2.fastq.bz2 8549164
SRR004845_1.fastq.bz2 8837159
SRR004845_2.fastq.bz2 8837159
SRR004846_1.fastq.bz2 7759225
SRR004846_2.fastq.bz2 7759225
SRR004847_1.fastq.bz2 5661124
SRR004847_2.fastq.bz2 5661124
SRR004848_1.fastq.bz2 5408098
SRR004848_2.fastq.bz2 5408098
SRR004849_1.fastq.bz2 5950977
SRR004849_2.fastq.bz2 5950977
SRR004850_1.fastq.bz2 9849395
SRR004850_2.fastq.bz2 9849395
SRR004851_1.fastq.bz2 9534571
SRR004851_2.fastq.bz2 9534571
SRR004852_1.fastq.bz2 9340649
SRR004852_2.fastq.bz2 9340649
SRR004853_1.fastq.bz2 6474762
SRR004853_2.fastq.bz2 6474762
SRR004854_1.fastq.bz2 6360372
SRR004854_2.fastq.bz2 6360372
SRR004855_1.fastq.bz2 6512147
SRR004855_2.fastq.bz2 6512147
SRR004856_1.fastq.bz2 10212539
SRR004856_2.fastq.bz2 10212539
SRR004857_1.fastq.bz2 10214756
SRR004857_2.fastq.bz2 10214756
SRR004858_1.fastq.bz2 9948010
SRR004858_2.fastq.bz2 9948010
SRR004859_1.fastq.bz2 10152277
SRR004859_2.fastq.bz2 10152277
SRR004860_1.fastq.bz2 9972388
SRR004860_2.fastq.bz2 9972388
SRR004861_1.fastq.bz2 8818998
SRR004861_2.fastq.bz2 8818998
SRR004862_1.fastq.bz2 8128087
SRR004862_2.fastq.bz2 8128087
SRR004863_1.fastq.bz2 8579536
SRR004863_2.fastq.bz2 8579536
SRR004864_1.fastq.bz2 8333329
SRR004864_2.fastq.bz2 8333329
SRR004865_1.fastq.bz2 283573
SRR004865_2.fastq.bz2 283573
SRR004866_1.fastq.bz2 4325494
SRR004866_2.fastq.bz2 4325494
SRR004867_1.fastq.bz2 4441061
SRR004867_2.fastq.bz2 4441061
SRR004868_1.fastq.bz2 4417212
SRR004868_2.fastq.bz2 4417212
SRR004869_1.fastq.bz2 4426593
SRR004869_2.fastq.bz2 4426593
SRR004870_1.fastq.bz2 4325451
SRR004870_2.fastq.bz2 4325451
SRR004871_1.fastq.bz2 4437127
SRR004871_2.fastq.bz2 4437127
SRR005657_1.fastq.bz2 5950061
SRR005657_2.fastq.bz2 5950061
SRR005658_1.fastq.bz2 6023796
SRR005658_2.fastq.bz2 6023796
SRR005659_1.fastq.bz2 6046660
SRR005659_2.fastq.bz2 6046660
SRR005660_1.fastq.bz2 5778417
SRR005660_2.fastq.bz2 5778417
SRR005661_1.fastq.bz2 6083478
SRR005661_2.fastq.bz2 6083478
SRR005718_1.fastq.bz2 32158952
SRR005718_2.fastq.bz2 32158952
SRR005719_1.fastq.bz2 19748766
SRR005719_2.fastq.bz2 19748766
SRR005720_1.fastq.bz2 26060741
SRR005720_2.fastq.bz2 26060741
SRR005721_1.fastq.bz2 11956691
SRR005721_2.fastq.bz2 11956691
SRR005734_1.fastq.bz2 24962746
SRR005734_2.fastq.bz2 24962746
SRR005735_1.fastq.bz2 32958349
SRR005735_2.fastq.bz2 32958349
SRR006550_1.fastq.bz2 6485562
SRR006550_2.fastq.bz2 6485562
SRR006551_1.fastq.bz2 7621955
SRR006551_2.fastq.bz2 7621955
SRR006552_1.fastq.bz2 6982624
SRR006552_2.fastq.bz2 6982624
SRR006553_1.fastq.bz2 7074228
SRR006553_2.fastq.bz2 7074228
SRR006554_1.fastq.bz2 7003372
SRR006554_2.fastq.bz2 7003372
SRR006555_1.fastq.bz2 7308114
SRR006555_2.fastq.bz2 7308114
SRR006556_1.fastq.bz2 4359382
SRR006556_2.fastq.bz2 4359382
SRR006557_1.fastq.bz2 4337291
SRR006557_2.fastq.bz2 4337291
SRR006558_1.fastq.bz2 3599374
SRR006558_2.fastq.bz2 3599374
SRR006559_1.fastq.bz2 3450777
SRR006559_2.fastq.bz2 3450777
SRR006560_1.fastq.bz2 3022934
SRR006560_2.fastq.bz2 3022934
SRR006561_1.fastq.bz2 5884872
SRR006561_2.fastq.bz2 5884872
SRR006562_1.fastq.bz2 5477538
SRR006562_2.fastq.bz2 5477538
SRR006563_1.fastq.bz2 6660338
SRR006563_2.fastq.bz2 6660338
SRR006564_1.fastq.bz2 7568813
SRR006564_2.fastq.bz2 7568813
SRR029278_1.fastq.bz2 9014183
SRR029278_2.fastq.bz2 9014183
SRR029333_1.fastq.bz2 3422203
SRR029333_2.fastq.bz2 3422203
SRR029334_1.fastq.bz2 5765255
SRR029334_2.fastq.bz2 5765255
SRR029335_1.fastq.bz2 7473894
SRR029335_2.fastq.bz2 7473894
SRR029336_1.fastq.bz2 3905986
SRR029336_2.fastq.bz2 3905986
SRR029337_1.fastq.bz2 18361953
SRR029337_2.fastq.bz2 18361953
SRR029338_1.fastq.bz2 3602769
SRR029338_2.fastq.bz2 3602769


First, we need to use a more general terminology. Instead of talking about files, we will say objects. The weight of an object here is its number of sequences. As an example, the object SRR029335_2.fastq.bz2 has a weight of 7473894. And finally, instead of talking about jobs, we'll use the term 'bins'.

So let's say we have 16 bins and that we have the objects above. The goal is to balance the objects in 16 bins such as each bin has mostly the same weight.
 
With 508 objects and 16 bins, one way is to use 32 objects per bin (and 28 for the last, 15*32 + 1 * 28).

In that case, going for the 32-objects-per-bin will just not be very good because each bin will a very different weight.

Since the number of bins is known and the weight of each object is known, the expected weight per bin can be calculated by summing the weights of all objects and by dividing by the number of bins. In our case, we have 508 objects, and their total weight is 4045335994.
With 16 bins, we expect a average weight of 252833499 for any bin.

Given a distribution of objects, a score can be calculated by summing, for each bin, the absolute difference beween expected and actual weights. By minimizing this score, a really good distribution can be obtained very fast.

For the first state, all the objects are put in the first bin.

Given a state, the next one is generated by picking up 2 random objects. Each of these 2 objects is removed from its bin, and is deposited in a new randomly-selected bin. A new score is computed, and if it is better, the change is accepted. This is repeated as long as there is an improvement.

After 46620000 iterations (3 minutes), this solution was obtained:

Table 2: Weight of each bin after the balancing.

Bin ExpectedWeight ActualWeight
0 252833499 252837466
1 252833499 252829890
2 252833499 252835796
3 252833499 252832018
4 252833499 252839402
5 252833499 252832650
6 252833499 252827646
7 252833499 252833920
8 252833499 252830266
9 252833499 252833886
10 252833499 252827036
11 252833499 252836770
12 252833499 252828776
13 252833499 252840966
14 252833499 252835218
15 252833499 252834288

For this solution, the objects were distributed as follows.

Table 3: Distribution of objects into bin.

Object Weight Bin
SRR002271_1.fastq.bz2 22243273 11
SRR002271_2.fastq.bz2 22243273 11
SRR002272_1.fastq.bz2 35756808 10
SRR002272_2.fastq.bz2 35756808 10
SRR002273_1.fastq.bz2 4276214 9
SRR002273_2.fastq.bz2 4276214 9
SRR002274_1.fastq.bz2 18095255 15
SRR002274_2.fastq.bz2 18095255 15
SRR002275_1.fastq.bz2 33729638 0
SRR002275_2.fastq.bz2 33729638 0
SRR002276_1.fastq.bz2 47074312 13
SRR002276_2.fastq.bz2 47074312 13
SRR002277_1.fastq.bz2 6757955 12
SRR002277_2.fastq.bz2 6757955 12
SRR002278_1.fastq.bz2 6093595 3
SRR002278_2.fastq.bz2 6093595 3
SRR002279_1.fastq.bz2 7177292 14
SRR002279_2.fastq.bz2 7177292 14
SRR002280_1.fastq.bz2 6580048 14
SRR002280_2.fastq.bz2 6580048 14
SRR002281_1.fastq.bz2 16693687 12
SRR002281_2.fastq.bz2 16693687 12
SRR002282_1.fastq.bz2 13383178 9
SRR002282_2.fastq.bz2 13383178 9
SRR002283_1.fastq.bz2 18374441 2
SRR002283_2.fastq.bz2 18374441 2
SRR002284_1.fastq.bz2 5600070 12
SRR002284_2.fastq.bz2 5600070 12
SRR002285_1.fastq.bz2 6286076 14
SRR002285_2.fastq.bz2 6286076 14
SRR002286_1.fastq.bz2 5709652 11
SRR002286_2.fastq.bz2 5709652 11
SRR002287_1.fastq.bz2 6309083 0
SRR002287_2.fastq.bz2 6309083 0
SRR002288_1.fastq.bz2 6006869 10
SRR002288_2.fastq.bz2 6006869 10
SRR002289_1.fastq.bz2 4776565 15
SRR002289_2.fastq.bz2 4776565 15
SRR002290_1.fastq.bz2 13044591 4
SRR002290_2.fastq.bz2 13044591 4
SRR002291_1.fastq.bz2 34898264 8
SRR002291_2.fastq.bz2 34898264 8
SRR002292_1.fastq.bz2 22854308 7
SRR002292_2.fastq.bz2 22854308 7
SRR002293_1.fastq.bz2 18042487 10
SRR002293_2.fastq.bz2 18042487 10
SRR002294_1.fastq.bz2 8109548 4
SRR002294_2.fastq.bz2 8109548 4
SRR002295_1.fastq.bz2 30697316 11
SRR002295_2.fastq.bz2 30697316 11
SRR002296_1.fastq.bz2 6101023 3
SRR002296_2.fastq.bz2 6101023 3
SRR002297_1.fastq.bz2 46467908 8
SRR002297_2.fastq.bz2 46467908 8
SRR002298_1.fastq.bz2 10848356 13
SRR002298_2.fastq.bz2 10848356 13
SRR002299_1.fastq.bz2 6078655 5
SRR002299_2.fastq.bz2 6078655 5
SRR002300_1.fastq.bz2 6483460 2
SRR002300_2.fastq.bz2 6483460 2
SRR002301_1.fastq.bz2 7260908 15
SRR002301_2.fastq.bz2 7260908 15
SRR002302_1.fastq.bz2 6283624 15
SRR002302_2.fastq.bz2 6283624 15
SRR002303_1.fastq.bz2 6092616 9
SRR002303_2.fastq.bz2 6092616 9
SRR002304_1.fastq.bz2 8669549 14
SRR002304_2.fastq.bz2 8669549 14
SRR002305_1.fastq.bz2 7079980 1
SRR002305_2.fastq.bz2 7079980 1
SRR002306_1.fastq.bz2 6061012 12
SRR002306_2.fastq.bz2 6061012 12
SRR002307_1.fastq.bz2 7990437 0
SRR002307_2.fastq.bz2 7990437 0
SRR002308_1.fastq.bz2 6403325 13
SRR002308_2.fastq.bz2 6403325 13
SRR002309_1.fastq.bz2 5929366 9
SRR002309_2.fastq.bz2 5929366 9
SRR002310_1.fastq.bz2 9745625 14
SRR002310_2.fastq.bz2 9745625 14
SRR002311_1.fastq.bz2 6113076 8
SRR002311_2.fastq.bz2 6113076 8
SRR002312_1.fastq.bz2 9808666 0
SRR002312_2.fastq.bz2 9808666 0
SRR002313_1.fastq.bz2 7841529 15
SRR002313_2.fastq.bz2 7841529 15
SRR002314_1.fastq.bz2 4334295 13
SRR002314_2.fastq.bz2 4334295 13
SRR002315_1.fastq.bz2 4857885 7
SRR002315_2.fastq.bz2 4857885 7
SRR002316_1.fastq.bz2 14025787 13
SRR002316_2.fastq.bz2 14025787 13
SRR002317_1.fastq.bz2 9732073 11
SRR002317_2.fastq.bz2 9732073 11
SRR002318_1.fastq.bz2 1670064 11
SRR002318_2.fastq.bz2 1670064 11
SRR002319_1.fastq.bz2 9264776 9
SRR002319_2.fastq.bz2 9264776 9
SRR003810_1.fastq.bz2 4605877 12
SRR003810_2.fastq.bz2 4605877 12
SRR003812_1.fastq.bz2 4891691 3
SRR003812_2.fastq.bz2 4891691 3
SRR003813_1.fastq.bz2 4908529 11
SRR003813_2.fastq.bz2 4908529 11
SRR003814_1.fastq.bz2 3958657 12
SRR003814_2.fastq.bz2 3958657 12
SRR003815_1.fastq.bz2 4780683 2
SRR003815_2.fastq.bz2 4780683 2
SRR003816_1.fastq.bz2 4673240 0
SRR003816_2.fastq.bz2 4673240 0
SRR003817_1.fastq.bz2 1226466 3
SRR003817_2.fastq.bz2 1226466 3
SRR003818_1.fastq.bz2 6450471 1
SRR003818_2.fastq.bz2 6450471 1
SRR003819_1.fastq.bz2 6593471 0
SRR003819_2.fastq.bz2 6593471 0
SRR003820_1.fastq.bz2 6553586 9
SRR003820_2.fastq.bz2 6553586 9
SRR003821_1.fastq.bz2 6328227 5
SRR003821_2.fastq.bz2 6328227 5
SRR003823_1.fastq.bz2 6247413 9
SRR003823_2.fastq.bz2 6247413 9
SRR003824_1.fastq.bz2 6025237 11
SRR003824_2.fastq.bz2 6025237 11
SRR003825_1.fastq.bz2 1561734 12
SRR003825_2.fastq.bz2 1561734 12
SRR003837_1.fastq.bz2 7512805 6
SRR003837_2.fastq.bz2 7512805 6
SRR003838_1.fastq.bz2 7152063 0
SRR003838_2.fastq.bz2 7152063 0
SRR003839_1.fastq.bz2 7162665 14
SRR003839_2.fastq.bz2 7162665 14
SRR003840_1.fastq.bz2 7316431 9
SRR003840_2.fastq.bz2 7316431 9
SRR003841_1.fastq.bz2 1978964 13
SRR003841_2.fastq.bz2 1978964 13
SRR003845_1.fastq.bz2 7425426 10
SRR003845_2.fastq.bz2 7425426 10
SRR003846_1.fastq.bz2 6496063 9
SRR003846_2.fastq.bz2 6496063 9
SRR003847_1.fastq.bz2 7303975 5
SRR003847_2.fastq.bz2 7303975 5
SRR003848_1.fastq.bz2 7322831 3
SRR003848_2.fastq.bz2 7322831 3
SRR003849_1.fastq.bz2 7353518 2
SRR003849_2.fastq.bz2 7353518 2
SRR003850_1.fastq.bz2 1956095 15
SRR003850_2.fastq.bz2 1956095 15
SRR003851_1.fastq.bz2 6349952 6
SRR003851_2.fastq.bz2 6349952 6
SRR003852_1.fastq.bz2 6750815 6
SRR003852_2.fastq.bz2 6750815 6
SRR003853_1.fastq.bz2 6676849 11
SRR003853_2.fastq.bz2 6676849 11
SRR003854_1.fastq.bz2 6582580 1
SRR003854_2.fastq.bz2 6582580 1
SRR003855_1.fastq.bz2 6764442 2
SRR003855_2.fastq.bz2 6764442 2
SRR003856_1.fastq.bz2 6721795 6
SRR003856_2.fastq.bz2 6721795 6
SRR003857_1.fastq.bz2 1637242 12
SRR003857_2.fastq.bz2 1637242 12
SRR003859_1.fastq.bz2 3741843 7
SRR003859_2.fastq.bz2 3741843 7
SRR003860_1.fastq.bz2 3946844 7
SRR003860_2.fastq.bz2 3946844 7
SRR003861_1.fastq.bz2 4062736 4
SRR003861_2.fastq.bz2 4062736 4
SRR003863_1.fastq.bz2 6555145 11
SRR003863_2.fastq.bz2 6555145 11
SRR003864_1.fastq.bz2 6419174 6
SRR003864_2.fastq.bz2 6419174 6
SRR003866_1.fastq.bz2 6498466 0
SRR003866_2.fastq.bz2 6498466 0
SRR003867_1.fastq.bz2 6459933 10
SRR003867_2.fastq.bz2 6459933 10
SRR003868_1.fastq.bz2 7433254 11
SRR003868_2.fastq.bz2 7433254 11
SRR003869_1.fastq.bz2 7308992 8
SRR003869_2.fastq.bz2 7308992 8
SRR003870_1.fastq.bz2 7557219 3
SRR003870_2.fastq.bz2 7557219 3
SRR003871_1.fastq.bz2 7443963 1
SRR003871_2.fastq.bz2 7443963 1
SRR003872_1.fastq.bz2 7499185 3
SRR003872_2.fastq.bz2 7499185 3
SRR003873_1.fastq.bz2 6323645 15
SRR003873_2.fastq.bz2 6323645 15
SRR003874_1.fastq.bz2 6145944 7
SRR003874_2.fastq.bz2 6145944 7
SRR003875_1.fastq.bz2 6348097 12
SRR003875_2.fastq.bz2 6348097 12
SRR003876_1.fastq.bz2 6296738 9
SRR003876_2.fastq.bz2 6296738 9
SRR003877_1.fastq.bz2 6450647 10
SRR003877_2.fastq.bz2 6450647 10
SRR003878_1.fastq.bz2 6543352 4
SRR003878_2.fastq.bz2 6543352 4
SRR003879_1.fastq.bz2 6418736 13
SRR003879_2.fastq.bz2 6418736 13
SRR003960_1.fastq.bz2 9722922 6
SRR003960_2.fastq.bz2 9722922 6
SRR003961_1.fastq.bz2 8995012 14
SRR003961_2.fastq.bz2 8995012 14
SRR003962_1.fastq.bz2 9249927 10
SRR003962_2.fastq.bz2 9249927 10
SRR003963_1.fastq.bz2 9185577 1
SRR003963_2.fastq.bz2 9185577 1
SRR003964_1.fastq.bz2 9484360 4
SRR003964_2.fastq.bz2 9484360 4
SRR003965_1.fastq.bz2 8959911 5
SRR003965_2.fastq.bz2 8959911 5
SRR003966_1.fastq.bz2 5851368 5
SRR003966_2.fastq.bz2 5851368 5
SRR003967_1.fastq.bz2 5236932 5
SRR003967_2.fastq.bz2 5236932 5
SRR003968_1.fastq.bz2 6170713 4
SRR003968_2.fastq.bz2 6170713 4
SRR003969_1.fastq.bz2 6276516 12
SRR003969_2.fastq.bz2 6276516 12
SRR003970_1.fastq.bz2 5765690 1
SRR003970_2.fastq.bz2 5765690 1
SRR003971_1.fastq.bz2 1615149 12
SRR003971_2.fastq.bz2 1615149 12
SRR004105_1.fastq.bz2 6815628 7
SRR004105_2.fastq.bz2 6815628 7
SRR004106_1.fastq.bz2 6857870 2
SRR004106_2.fastq.bz2 6857870 2
SRR004107_1.fastq.bz2 6961705 3
SRR004107_2.fastq.bz2 6961705 3
SRR004108_1.fastq.bz2 7019609 7
SRR004108_2.fastq.bz2 7019609 7
SRR004109_1.fastq.bz2 7017622 5
SRR004109_2.fastq.bz2 7017622 5
SRR004110_1.fastq.bz2 1510651 6
SRR004110_2.fastq.bz2 1510651 6
SRR004111_1.fastq.bz2 6500202 12
SRR004111_2.fastq.bz2 6500202 12
SRR004112_1.fastq.bz2 6382108 9
SRR004112_2.fastq.bz2 6382108 9
SRR004113_1.fastq.bz2 6769812 9
SRR004113_2.fastq.bz2 6769812 9
SRR004114_1.fastq.bz2 6749632 2
SRR004114_2.fastq.bz2 6749632 2
SRR004116_1.fastq.bz2 6660494 6
SRR004116_2.fastq.bz2 6660494 6
SRR004117_1.fastq.bz2 6894164 9
SRR004117_2.fastq.bz2 6894164 9
SRR004118_1.fastq.bz2 7012762 4
SRR004118_2.fastq.bz2 7012762 4
SRR004119_1.fastq.bz2 7106679 0
SRR004119_2.fastq.bz2 7106679 0
SRR004120_1.fastq.bz2 1890977 1
SRR004120_2.fastq.bz2 1890977 1
SRR004121_1.fastq.bz2 6688708 13
SRR004121_2.fastq.bz2 6688708 13
SRR004122_1.fastq.bz2 6442268 2
SRR004122_2.fastq.bz2 6442268 2
SRR004123_1.fastq.bz2 7219728 7
SRR004123_2.fastq.bz2 7219728 7
SRR004124_1.fastq.bz2 6650577 10
SRR004124_2.fastq.bz2 6650577 10
SRR004125_1.fastq.bz2 6980725 5
SRR004125_2.fastq.bz2 6980725 5
SRR004126_1.fastq.bz2 7121389 4
SRR004126_2.fastq.bz2 7121389 4
SRR004127_1.fastq.bz2 1920925 15
SRR004127_2.fastq.bz2 1920925 15
SRR004186_1.fastq.bz2 6039526 4
SRR004186_2.fastq.bz2 6039526 4
SRR004187_1.fastq.bz2 6189345 15
SRR004187_2.fastq.bz2 6189345 15
SRR004188_1.fastq.bz2 6202959 3
SRR004188_2.fastq.bz2 6202959 3
SRR004190_1.fastq.bz2 5896818 4
SRR004190_2.fastq.bz2 5896818 4
SRR004191_1.fastq.bz2 5619426 4
SRR004191_2.fastq.bz2 5619426 4
SRR004192_1.fastq.bz2 1399560 2
SRR004192_2.fastq.bz2 1399560 2
SRR004193_1.fastq.bz2 6978310 15
SRR004193_2.fastq.bz2 6978310 15
SRR004194_1.fastq.bz2 6918542 1
SRR004194_2.fastq.bz2 6918542 1
SRR004195_1.fastq.bz2 5848753 3
SRR004195_2.fastq.bz2 5848753 3
SRR004197_1.fastq.bz2 7159088 3
SRR004197_2.fastq.bz2 7159088 3
SRR004198_1.fastq.bz2 1930314 0
SRR004198_2.fastq.bz2 1930314 0
SRR004199_1.fastq.bz2 6536064 5
SRR004199_2.fastq.bz2 6536064 5
SRR004200_1.fastq.bz2 6839637 4
SRR004200_2.fastq.bz2 6839637 4
SRR004201_1.fastq.bz2 6943961 2
SRR004201_2.fastq.bz2 6943961 2
SRR004202_1.fastq.bz2 6886712 4
SRR004202_2.fastq.bz2 6886712 4
SRR004203_1.fastq.bz2 6920882 6
SRR004203_2.fastq.bz2 6920882 6
SRR004204_1.fastq.bz2 1862546 13
SRR004204_2.fastq.bz2 1862546 13
SRR004205_1.fastq.bz2 4905164 15
SRR004205_2.fastq.bz2 4905164 15
SRR004206_1.fastq.bz2 4757265 5
SRR004206_2.fastq.bz2 4757265 5
SRR004207_1.fastq.bz2 5253223 7
SRR004207_2.fastq.bz2 5253223 7
SRR004208_1.fastq.bz2 5061637 7
SRR004208_2.fastq.bz2 5061637 7
SRR004209_1.fastq.bz2 5250983 7
SRR004209_2.fastq.bz2 5250983 7
SRR004210_1.fastq.bz2 5348659 6
SRR004210_2.fastq.bz2 5348659 6
SRR004211_1.fastq.bz2 1331516 6
SRR004211_2.fastq.bz2 1331516 6
SRR004809_1.fastq.bz2 6281732 3
SRR004809_2.fastq.bz2 6281732 3
SRR004810_1.fastq.bz2 6479897 12
SRR004810_2.fastq.bz2 6479897 12
SRR004811_1.fastq.bz2 6428286 5
SRR004811_2.fastq.bz2 6428286 5
SRR004812_1.fastq.bz2 6386024 9
SRR004812_2.fastq.bz2 6386024 9
SRR004813_1.fastq.bz2 6397082 8
SRR004813_2.fastq.bz2 6397082 8
SRR004814_1.fastq.bz2 6108069 3
SRR004814_2.fastq.bz2 6108069 3
SRR004815_1.fastq.bz2 6525195 3
SRR004815_2.fastq.bz2 6525195 3
SRR004816_1.fastq.bz2 6511455 3
SRR004816_2.fastq.bz2 6511455 3
SRR004817_1.fastq.bz2 6462803 8
SRR004817_2.fastq.bz2 6462803 8
SRR004818_1.fastq.bz2 6433169 7
SRR004818_2.fastq.bz2 6433169 7
SRR004819_1.fastq.bz2 6338234 5
SRR004819_2.fastq.bz2 6338234 5
SRR004820_1.fastq.bz2 7308557 1
SRR004820_2.fastq.bz2 7308557 1
SRR004821_1.fastq.bz2 7156960 2
SRR004821_2.fastq.bz2 7156960 2
SRR004822_1.fastq.bz2 6213660 15
SRR004822_2.fastq.bz2 6213660 15
SRR004823_1.fastq.bz2 5949864 2
SRR004823_2.fastq.bz2 5949864 2
SRR004824_1.fastq.bz2 6087720 3
SRR004824_2.fastq.bz2 6087720 3
SRR004825_1.fastq.bz2 6078285 5
SRR004825_2.fastq.bz2 6078285 5
SRR004826_1.fastq.bz2 5882054 4
SRR004826_2.fastq.bz2 5882054 4
SRR004827_1.fastq.bz2 5541480 5
SRR004827_2.fastq.bz2 5541480 5
SRR004828_1.fastq.bz2 7078458 12
SRR004828_2.fastq.bz2 7078458 12
SRR004829_1.fastq.bz2 7524672 11
SRR004829_2.fastq.bz2 7524672 11
SRR004830_1.fastq.bz2 7820645 14
SRR004830_2.fastq.bz2 7820645 14
SRR004831_1.fastq.bz2 8105229 0
SRR004831_2.fastq.bz2 8105229 0
SRR004832_1.fastq.bz2 7956781 14
SRR004832_2.fastq.bz2 7956781 14
SRR004833_1.fastq.bz2 7547869 6
SRR004833_2.fastq.bz2 7547869 6
SRR004834_1.fastq.bz2 6447823 9
SRR004834_2.fastq.bz2 6447823 9
SRR004835_1.fastq.bz2 8616259 6
SRR004835_2.fastq.bz2 8616259 6
SRR004836_1.fastq.bz2 8975677 5
SRR004836_2.fastq.bz2 8975677 5
SRR004837_1.fastq.bz2 8642976 12
SRR004837_2.fastq.bz2 8642976 12
SRR004838_1.fastq.bz2 7023195 3
SRR004838_2.fastq.bz2 7023195 3
SRR004839_1.fastq.bz2 8062696 3
SRR004839_2.fastq.bz2 8062696 3
SRR004840_1.fastq.bz2 8060450 0
SRR004840_2.fastq.bz2 8060450 0
SRR004841_1.fastq.bz2 6678858 9
SRR004841_2.fastq.bz2 6678858 9
SRR004842_1.fastq.bz2 7091815 15
SRR004842_2.fastq.bz2 7091815 15
SRR004843_1.fastq.bz2 6890492 2
SRR004843_2.fastq.bz2 6890492 2
SRR004844_1.fastq.bz2 8549164 13
SRR004844_2.fastq.bz2 8549164 13
SRR004845_1.fastq.bz2 8837159 5
SRR004845_2.fastq.bz2 8837159 5
SRR004846_1.fastq.bz2 7759225 14
SRR004846_2.fastq.bz2 7759225 14
SRR004847_1.fastq.bz2 5661124 9
SRR004847_2.fastq.bz2 5661124 9
SRR004848_1.fastq.bz2 5408098 10
SRR004848_2.fastq.bz2 5408098 10
SRR004849_1.fastq.bz2 5950977 7
SRR004849_2.fastq.bz2 5950977 7
SRR004850_1.fastq.bz2 9849395 14
SRR004850_2.fastq.bz2 9849395 14
SRR004851_1.fastq.bz2 9534571 6
SRR004851_2.fastq.bz2 9534571 6
SRR004852_1.fastq.bz2 9340649 9
SRR004852_2.fastq.bz2 9340649 9
SRR004853_1.fastq.bz2 6474762 14
SRR004853_2.fastq.bz2 6474762 14
SRR004854_1.fastq.bz2 6360372 12
SRR004854_2.fastq.bz2 6360372 12
SRR004855_1.fastq.bz2 6512147 1
SRR004855_2.fastq.bz2 6512147 1
SRR004856_1.fastq.bz2 10212539 1
SRR004856_2.fastq.bz2 10212539 1
SRR004857_1.fastq.bz2 10214756 1
SRR004857_2.fastq.bz2 10214756 1
SRR004858_1.fastq.bz2 9948010 8
SRR004858_2.fastq.bz2 9948010 8
SRR004859_1.fastq.bz2 10152277 5
SRR004859_2.fastq.bz2 10152277 5
SRR004860_1.fastq.bz2 9972388 4
SRR004860_2.fastq.bz2 9972388 4
SRR004861_1.fastq.bz2 8818998 8
SRR004861_2.fastq.bz2 8818998 8
SRR004862_1.fastq.bz2 8128087 1
SRR004862_2.fastq.bz2 8128087 1
SRR004863_1.fastq.bz2 8579536 11
SRR004863_2.fastq.bz2 8579536 11
SRR004864_1.fastq.bz2 8333329 14
SRR004864_2.fastq.bz2 8333329 14
SRR004865_1.fastq.bz2 283573 6
SRR004865_2.fastq.bz2 283573 6
SRR004866_1.fastq.bz2 4325494 11
SRR004866_2.fastq.bz2 4325494 11
SRR004867_1.fastq.bz2 4441061 12
SRR004867_2.fastq.bz2 4441061 12
SRR004868_1.fastq.bz2 4417212 0
SRR004868_2.fastq.bz2 4417212 0
SRR004869_1.fastq.bz2 4426593 13
SRR004869_2.fastq.bz2 4426593 13
SRR004870_1.fastq.bz2 4325451 13
SRR004870_2.fastq.bz2 4325451 13
SRR004871_1.fastq.bz2 4437127 7
SRR004871_2.fastq.bz2 4437127 7
SRR005657_1.fastq.bz2 5950061 14
SRR005657_2.fastq.bz2 5950061 14
SRR005658_1.fastq.bz2 6023796 2
SRR005658_2.fastq.bz2 6023796 2
SRR005659_1.fastq.bz2 6046660 12
SRR005659_2.fastq.bz2 6046660 12
SRR005660_1.fastq.bz2 5778417 0
SRR005660_2.fastq.bz2 5778417 0
SRR005661_1.fastq.bz2 6083478 7
SRR005661_2.fastq.bz2 6083478 7
SRR005718_1.fastq.bz2 32158952 6
SRR005718_2.fastq.bz2 32158952 6
SRR005719_1.fastq.bz2 19748766 12
SRR005719_2.fastq.bz2 19748766 12
SRR005720_1.fastq.bz2 26060741 1
SRR005720_2.fastq.bz2 26060741 1
SRR005721_1.fastq.bz2 11956691 2
SRR005721_2.fastq.bz2 11956691 2
SRR005734_1.fastq.bz2 24962746 10
SRR005734_2.fastq.bz2 24962746 10
SRR005735_1.fastq.bz2 32958349 15
SRR005735_2.fastq.bz2 32958349 15
SRR006550_1.fastq.bz2 6485562 14
SRR006550_2.fastq.bz2 6485562 14
SRR006551_1.fastq.bz2 7621955 15
SRR006551_2.fastq.bz2 7621955 15
SRR006552_1.fastq.bz2 6982624 7
SRR006552_2.fastq.bz2 6982624 7
SRR006553_1.fastq.bz2 7074228 2
SRR006553_2.fastq.bz2 7074228 2
SRR006554_1.fastq.bz2 7003372 4
SRR006554_2.fastq.bz2 7003372 4
SRR006555_1.fastq.bz2 7308114 4
SRR006555_2.fastq.bz2 7308114 4
SRR006556_1.fastq.bz2 4359382 0
SRR006556_2.fastq.bz2 4359382 0
SRR006557_1.fastq.bz2 4337291 11
SRR006557_2.fastq.bz2 4337291 11
SRR006558_1.fastq.bz2 3599374 13
SRR006558_2.fastq.bz2 3599374 13
SRR006559_1.fastq.bz2 3450777 2
SRR006559_2.fastq.bz2 3450777 2
SRR006560_1.fastq.bz2 3022934 6
SRR006560_2.fastq.bz2 3022934 6
SRR006561_1.fastq.bz2 5884872 13
SRR006561_2.fastq.bz2 5884872 13
SRR006562_1.fastq.bz2 5477538 3
SRR006562_2.fastq.bz2 5477538 3
SRR006563_1.fastq.bz2 6660338 1
SRR006563_2.fastq.bz2 6660338 1
SRR006564_1.fastq.bz2 7568813 14
SRR006564_2.fastq.bz2 7568813 14
SRR029278_1.fastq.bz2 9014183 5
SRR029278_2.fastq.bz2 9014183 5
SRR029333_1.fastq.bz2 3422203 4
SRR029333_2.fastq.bz2 3422203 4
SRR029334_1.fastq.bz2 5765255 2
SRR029334_2.fastq.bz2 5765255 2
SRR029335_1.fastq.bz2 7473894 3
SRR029335_2.fastq.bz2 7473894 3
SRR029336_1.fastq.bz2 3905986 0
SRR029336_2.fastq.bz2 3905986 0
SRR029337_1.fastq.bz2 18361953 7
SRR029337_2.fastq.bz2 18361953 7
SRR029338_1.fastq.bz2 3602769 14
SRR029338_2.fastq.bz2 3602769 14



 The solver is available here: https://github.com/sebhtml/NGS-Pipelines/blob/master/Balance-Objects.py
There was an error in this gadget