2012-11-21

Fetching data from Illumina BaseSpace

We are working to deploy Ray in Illumina BaseSpace.

For our tests, we needed the data on our infrastructure in Québec City.


First, I did a list of objects for 2x150bp Human Genome in Record Time with the HiSeq 2500

$ cat RawFiles.txt
https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L001_R1_001.fastq.gz?id=25033024&appResultId=
https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L001_R1_002.fastq.gz?id=25054488&appResultId=
https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L001_R2_001.fastq.gz?id=25081698&appResultId=
https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L001_R2_002.fastq.gz?id=25123588&appResultId=
https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L002_R1_001.fastq.gz?id=25155266&appResultId=
https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L002_R1_002.fastq.gz?id=25175449&appResultId=
https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L002_R2_001.fastq.gz?id=25196878&appResultId=
https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L002_R2_002.fastq.gz?id=25237826&appResultId=



Then, using curl, I wrote a script.


$ cat Get.sh
for object in $(cat RawFiles.txt)
do
        # just take the part with fastq
        fileName=$(for i in $(echo "$object"|sed 's=/= =g'|sed 's=?= =g'); do echo $i; done|grep fastq)

        curl --output $fileName --continue-at - --location \
                --cookie IComLogin=$(cat ~/123) "$object" &> $fileName.log &
done


~/123 contains the cookie value at Illumina BaseSpace.

If you are using a shared machine, use a cookie file instead, otherwise everyone can take your cookie.

Then I started the parallel downloads.

$ bash Get.sh

Finally, I obtained file size, sha1sum, and number of entries for each file.


Table 1: 2x150bp Human Genome in Record Time with the HiSeq 2500.
File
DNA sequences
Bytes
Sha1
sorted_S1_L001_R1_001.fastq.gz
143818693
18629367106
9f964962cc1253d0ab034db8b3ac4b5e74d0859a
sorted_S1_L001_R2_001.fastq.gz
143818693
19363819987
3b795c8e7f844756cc41e5470aa3aa15d3eb047e
sorted_S1_L001_R1_002.fastq.gz
149973420
19495696779
05e3d6ec3dd21466be020e808f6ec923dc392dc2
sorted_S1_L001_R2_002.fastq.gz
149973420
20247581108
d1ced61fee4cbfef61ff9b082169435a67162d11
sorted_S1_L002_R1_001.fastq.gz
144295306
18683353035
74cb3168e853578d5261eb659c71a8c79fd1b1fe
sorted_S1_L002_R2_001.fastq.gz
144295306
19422921920
24ba63efb2b7255914361db925e9dc9ea31f302e
sorted_S1_L002_R1_002.fastq.gz
147591231
19122587922
6cf2b8a24d849f5b59d37e182c79182f4732337d
sorted_S1_L002_R2_002.fastq.gz
147591231
19906116448
13e015736e6a4e6006c6d5089b8ddd2053e7e653


2012-11-19

The life cycle of a scientific manuscript

The manuscript:

Sebastien Boisvert, Frederic Raymond, Elenie Godzaridis, Francois Laviolette and Jacques Corbeil
Ray Meta: scalable de novo metagenome assembly and profiling
Genome Biology

For authors's contributions, see the paper on the publisher website.


We started to work on this project on 2012-02-25 according to "git log".

Date Journal Event
2012-06-07 Nature Genetics Manuscript submission
2012-06-12 Nature Genetics Editorial rejection
2012-06-15 Nature Methods Manuscript submission
2012-06-29 Nature Methods Editorial rejection
2012-07-05 Genome Research Manuscript submission
2012-07-10 Genome Research Editorial rejection
2012-07-11 Genome Biology Presubmission enquiry
2012-07-18 Genome Biology Positive editorial response to presubmission enquiry
2012-07-24 Genome Biology Manuscript submission
2012-09-18 Genome Biology Reception of reviewers's comments from the Editor
2012-10-24 Genome Biology Revised manuscript submission
2012-11-19 Genome Biology Editorial acceptation
2012-12-22 Genome Biology Publication prior to copyediting and formatting
2013-xx-xx Genome Biology Publication of copy-edited and formatted manuscript


I don't mind any of the delays.

I am very appreciative of the reviewers who took the time to read and understand every experiment we did.

All contributing authors believe that the reviewers's comments improved significantly the quality of the manuscript.

My accomplishments in 2012

Dear readers,

2012 was a busy year.

2012 accomplishments
(starting with the most significant)
  1. Publication of a scientific paper in the really good journal Genome Biology (impact factor: 9.04)
  2. Ray Meta, a free open software (generalization of Ray heuristics for metagenomes)
  3. The RayPlatform framework
  4. Invited participation as an expert to Next Generation Sequence Analysis 2012
  5. Invited seminar at Argonne National Laboratory
  6. Invited SciNet Developer Seminars (for computer scientists; for biologists and bioinformaticians) at SciNet in Toronto
  7. Technical guidance (for director Jacques Corbeil and codirector François Laviolette) for a  1 M$ Genome Canada application, 2012 Bioinformatics and Computational Biology Competition
  8. Creation of the mini-ranks hybrid programming model (with Fangfang Xia and Rick Stevens) 
  9. Editorial rejection at Nature Genetics (2012-06-12)
  10. Editorial rejection at Nature Methods (2012-06-29)
  11. Editorial rejection at Genome Research (2012-07-10)
  12.  
Agenda for 2013

  1. A paper about mini-ranks as a hybrid programming model
  2. Ray Cloud Browser: interactively skim processed genomics data with energy
  3. Deposit my Ph.D. thesis (2013-08-31)
  4. Finish Ph.D. (2013-12-31)

2012-11-09

Table 1: Comparison of Ray instances with MPI ranks and mini-ranks.



Table 1: Comparison of Ray instances with MPI ranks and mini-ranks.
Metric
Ray with MPI ranks
Ray with mini-ranks
Accession
SRA010766
SRA010766
Description
Jay T. Flatley genome
Jay T. Flatley genome
Input files
164
164
Input sequences
1593032322
1593032322
Compression
Bzip2
Bzip2
K-mer length
21
21
MPI implementation
Open-MPI 1.6.2
Open-MPI 1.6.2
Compiler
GNU g++ 4.7.0
GNU g++ 4.7.0
Ray version
2.1.0-pre-release
2.1.1-devel (9cbf2a8277)
RayPlatform version
1.1.0-pre-release
7.0.0-devel (7e38d17e0f)
Interconnect
Mellanox MT26428
Mellanox MT26428
Processor
AMD Opteron 6172
AMD Opteron 6172
Machines
43
100
MPI ranks per machine
23 or 24
1
MPI ranks
1024
100
Mini-ranks per MPI rank
None
23
Mini-ranks
None
2300
Messaging strategy
Virtual routing with polytope
None
Routing graph
Polytope
None
Routing graph degree
62
None
Network testing
00:00:20
00:00:03
Counting sequences to assemble
00:13:45
00:10:34
Sequence loading
00:45:01
00:30:09
K-mer counting
02:55:18
00:11:44
Coverage distribution analysis
00:00:22
00:00:08
Graph construction
00:22:33
00:17:04
Null edge purging
00:20:25
00:19:49


2012-11-06

Commands for Debian packaging

# build the .deb
dpkg-buildpackage -r fakeroot

# check the produced .deb
lintian ray_2.1.0-1_amd64.deb

# check the .dsc
lintian ray_2.1.0-1.dsc

# check the changes
lintian ray_2.1.0-1_amd64.changes

# add a upstream tarball
pristine-tar commit

Cost Effectiveness Analysis (CEA) of running Ray on Amazon EC2

Sample: SRA001125
URL: http://trace.ddbj.nig.ac.jp/DRASearch/submission?acc=SRA001125
DNA reads: 34911784 (2 * 17455892)
Read length (nt): 36
Technology: Illumina Genome Analyzer

API name: m1.large
2 Ray processes
Running time: 05:28:46
Pricing: 0.260 $ / h
Cost: 1.560 $

API name: m3.xlarge
4 Ray processes
Running time: 02:31:34
Pricing: 0.580 $ / h
Cost: 1.730 $

API name: cc2.8xlarge
32 Ray processes
Running time: 00:54:06
Pricing: 2.400 / h 
Cost: 2.400 $

Conclusions:

1. You get your results faster if you pay more.

2. For cc2.8xlarge, 33% (00:19:40) of the time was loading sequences from EBS.
That's a lot !

3. The scalability on this problem is not that good because the
problem size is not very large.

4. Amazon EC2 is really affordable for de novo assemblies of bacterial genomes. 
 
 
 
If you want to try these tests yourself => http://github.com/sebhtml/Ray-in-Amazon-EC2-CLOUD

Comparing fastq compression with gzip, bzip2 and xz


Storage is expensive. Compression is a lossless approach to reduce the storage requirements.
Sébastien Boisvert
2012-11-05


Table 1: Comparison of compression methods on SRR001665_1.fastq -- 10408224 DNA sequences of length 36. Tests were on Fedora 17 x86_64 with a Intel Core i5-3360M processor and a Intel SSDSC2BW180A3 drive. Tests were not run in parallel. The time is the real entry from the time command. Each test was done twice.
Compression
Time
Size (bytes)
none
0
2085533064
(100%)
time cat SRR001665_1.fastq | gzip -9 > SRR001665_1.fastq.gz
7m31.519s
7m20.340s
376373619
(18%)
time cat SRR001665_1.fastq | bzip2 -9 > SRR001665_1.fastq.bz2
3m12.601s
3m25.243s
295000140
(14%)
time cat SRR001665_1.fastq | xz -9 > SRR001665_1.fastq.xz
32m45.933s



257621508
(12%)



Table 2: Decompression tests. Each test was run twice.
Decompression
Time
time cat SRR001665_1.fastq.gz | gunzip > SRR001665_1.fastq
0m14.612s
0m13.247s

time cat SRR001665_1.fastq.bz2 | bunzip2 > SRR001665_1.fastq
1m3.412s
1m4.337s

time cat SRR001665_1.fastq.xz | unxz > SRR001665_1.fastq
0m24.194s
0m23.923s


It is strange that bzip2 is faster than gzip for compression.

2012-11-02

Justification pour accéder à un calculateur Microsoft(R) Windows(R) HPC(R)

** Problématique

Les séquençeurs d'ADN actuels (comme le Illumina(R) HiSeq(R) 2500) produit plus de 6 000 000 000 séquences d'ADN numériques de longueur entre 100 et 200 lettres (A, T, C, G) en une seule analyse. Un des types d'analyse possible est "l'assemblage de novo de génomes."

** Système logiciel

Mon logiciel s'appelle Ray et est codé en C++ 1998 (Microsoft(R) Visual Studio(R) 2010 supporte complètement ce standard). Une librairie MPI de passage de message est aussi requise. MPICH2 et Open-MPI sont les deux disponibles en distribution binaire sous Microsoft(R) Windows(R).

Ray est un logiciel libre (licence GPLv3) et utilise la librairie parallèle RayPlatform (licence: LGPLv3).

- http://www.ohloh.net/p/ray-assembler
- http://www.ohloh.net/p/rayplatform

- http://denovoassembler.sourceforge.net/


Ray fait de l'assemblage de novo de génomes ou de métagénomes dans l'industrie des sciences de la vie (génomique). Ray "scale" très bien pour les problèmes de "Big Data".

** Plateformes testés (liste incomplète)

- Windows 7 sur Intel(R) Q6600 (Visual Studio 2010, MPICH2)
- Amazon EC2 (MIT StarCluster)
- CentOS sur IBM(R) iDataPlex(R) (Intel Xeon)
- CentOS sur Sun(R) Constellation (Intel Xeon)
- IBM Blue Gene/Q (IBM PowerPC A2)
- Cray Linux sur Cray XE6 (AMD Opteron)
- Ubuntu sur Apple PowerBook G4 (PowerPC G4)
- Debian sur Sparc (Sun SunBlade 100)
- Fedora sur HP ProLient (AMD Opteron)
- Funtoo sur ARMv6j (Raspberry Pi)

** Buts

1. Avoir un distributeur de Ray dans le monde de Microsoft(R) Windows(R)
2. Tester la portabilité du code sur Windows(R)
3. Tester Microsoft(R) MPI
4. Tester la performance sur un système HPC Microsoft(R) / Fujitsu(R) / DMR(R)

** Jeux de donnée

Nom: SRA001125
Description: E. coli / ILLUMINA / 2008-07-01
Adresse: http://trace.ddbj.nig.ac.jp/DRASearch/submission?acc=SRA001125

Nom: SRS011098
Description: Human metagenome sample from G_DNA_Supragingival plaque of a female
participant in the dbGaP study "HMP Core Microbiome Sampling Protocol A (HMP-A)"
Adresse: http://trace.ddbj.nig.ac.jp/DRASearch/sample?acc=SRS011098

** Distributeurs

Courant:

- Geeknet, Inc. (sourceforge)
- GitHub, Inc..
- Calcul Canada, Inc. (Calcul Québec)

Prochainement:

- Software in the Public Interest, Inc. (Debian)
- Canonical, Inc. (Ubuntu)
- Amazon, Inc. (une image AMI de CloudBioLinux)

En cours:

- Cray, Inc
- Red Hat, Inc. (via Fedora(TM))

2012-11-01

New "mini-ranks" hybrid programming model.



Table 1: Comparison of MPI ranks with mini-ranks on the Colosse
super-computer at Laval University.
+-------+---------------------------------------------------+
| Cores | Average round-trip latency (us)                   |
+-------+-----------------------+---------------------------+
|       | MPI ranks             | mini-ranks                |
|       | (pure MPI)            | (MPI + pthread)           |
+-------+-----------------------+---------------------------+
| 8     | 11.25 +/- 0           | 24.1429 +/- 0             |
| 16    | 35.875 +/- 6.92369    | 43.0179 +/- 8.76275       |
| 32    | 66.3125 +/- 6.76387   | 41.7143 +/- 1.23924       |
| 64    | 90 +/- 16.5265        | 37.75 +/- 6.41984         |
| 128   | 126.562 +/- 25.0116   | 43.0179 +/- 8.76275       |
| 256   | 203.637 +/- 67.4579   | 44.6429 +/- 6.11862       |
| 512   |                       |                           |
+-------+-----------------------+---------------------------+
There was an error in this gadget