Posts

Showing posts from November, 2012

Fetching data from Illumina BaseSpace

We are working to deploy Ray in Illumina BaseSpace.

For our tests, we needed the data on our infrastructure in Québec City.


First, I did a list of objects for 2x150bp Human Genome in Record Time with the HiSeq 2500

$ cat RawFiles.txt
https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L001_R1_001.fastq.gz?id=25033024&appResultId=
https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L001_R1_002.fastq.gz?id=25054488&appResultId=
https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L001_R2_001.fastq.gz?id=25081698&appResultId=
https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L001_R2_002.fastq.gz?id=25123588&appResultId=
https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L002_R1_001.fastq.gz?id=25155266&appResultId=
https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L002_R1_002.fastq.gz?id=25175449&appResultId=
https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L002_R2_001.fastq…

The life cycle of a scientific manuscript

The manuscript:

Sebastien Boisvert, Frederic Raymond, Elenie Godzaridis, Francois Laviolette and Jacques Corbeil
Ray Meta: scalable de novo metagenome assembly and profiling
Genome Biology

For authors's contributions, see the paper on the publisher website.


We started to work on this project on 2012-02-25 according to "git log".
DateJournalEvent2012-06-07Nature GeneticsManuscript submission2012-06-12Nature GeneticsEditorial rejection2012-06-15Nature MethodsManuscript submission2012-06-29Nature MethodsEditorial rejection2012-07-05Genome ResearchManuscript submission2012-07-10Genome ResearchEditorial rejection2012-07-11Genome BiologyPresubmission enquiry2012-07-18Genome BiologyPositive editorial response to presubmission enquiry2012-07-24Genome BiologyManuscript submission2012-09-18Genome BiologyReception of reviewers's comments from the Editor2012-10-24Genome BiologyRevised manuscript submission2012-11-19

My accomplishments in 2012

Dear readers,

2012 was a busy year.

2012 accomplishments
(starting with the most significant)
Publication of a scientific paper in the really good journal Genome Biology (impact factor: 9.04)Ray Meta, a free open software (generalization of Ray heuristics for metagenomes)The RayPlatform frameworkInvited participation as an expert to Next Generation Sequence Analysis 2012Invited seminar at Argonne National LaboratoryInvited SciNet Developer Seminars (for computer scientists; for biologists and bioinformaticians) at SciNet in Toronto Technical guidance (for director Jacques Corbeil and codirector François Laviolette) for a  1 M$ Genome Canada application, 2012 Bioinformatics and Computational Biology CompetitionCreation of the mini-ranks hybrid programming model (with Fangfang Xia and Rick Stevens) Editorial rejection at Nature Genetics (2012-06-12)Editorial rejection at Nature Methods (2012-06-29)Editorial rejection at Genome Research (2012-07-10)Agenda for 2013

A paper about mini-ranks

Table 1: Comparison of Ray instances with MPI ranks and mini-ranks.

Table 1: Comparison of Ray instances with MPI ranks and mini-ranks. Metric
Ray with MPI ranks
Ray with mini-ranks
Accession
SRA010766
SRA010766
Description
Jay T. Flatley genome
Jay T. Flatley genome
Input files
164
164
Input sequences
1593032322
1593032322
Compression
Bzip2
Bzip2
K-mer length
21
21
MPI implementation
Open-MPI 1.6.2
Open-MPI 1.6.2
Compiler
GNU g++ 4.7.0
GNU g++ 4.7.0
Ray version
2.1.0-pre-release
2.1.1-devel (9cbf2a8277)
RayPlatform version
1.1.0-pre-release
7.0.0-devel (7e38d17e0f)
Interconnect
Mellanox MT26428
Mellanox MT26428
Processor

Commands for Debian packaging

# build the .deb
dpkg-buildpackage -r fakeroot

# check the produced .deb
lintian ray_2.1.0-1_amd64.deb

# check the .dsc
lintian ray_2.1.0-1.dsc

# check the changes
lintian ray_2.1.0-1_amd64.changes

# add a upstream tarball
pristine-tar commit

Cost Effectiveness Analysis (CEA) of running Ray on Amazon EC2

Sample: SRA001125 URL: http://trace.ddbj.nig.ac.jp/DRASearch/submission?acc=SRA001125 DNA reads: 34911784 (2 * 17455892) Read length (nt): 36 Technology: Illumina Genome Analyzer API name: m1.large 2 Ray processes Running time: 05:28:46 Pricing: 0.260 $ / h Cost: 1.560 $ API name: m3.xlarge 4 Ray processes Running time: 02:31:34 Pricing: 0.580 $ / h Cost: 1.730 $ API name: cc2.8xlarge 32 Ray processes Running time: 00:54:06 Pricing: 2.400 / h Cost: 2.400 $ Conclusions: 1. You get your results faster if you pay more. 2. For cc2.8xlarge, 33% (00:19:40) of the time was loading sequences from EBS. That's a lot ! 3. The scalability on this problem is not that good because the problem size is not very large. 4. Amazon EC2 is really affordable for de novo assemblies of bacterial genomes. If you want to try these tests yourself => http://github.com/sebhtml/Ray-in-Amazon-EC2-CLOUD

Comparing fastq compression with gzip, bzip2 and xz

Storage is expensive. Compression is a lossless approach to reduce the storage requirements.
Sébastien Boisvert 2012-11-05

Table 1: Comparison of compression methods on SRR001665_1.fastq -- 10408224 DNA sequences of length 36. Tests were on Fedora 17 x86_64 with a Intel Core i5-3360M processor and a Intel SSDSC2BW180A3 drive. Tests were not run in parallel. The time is the real entry from the time command. Each test was done twice.
Compression
Time
Size (bytes)
none
0
2085533064
(100%)
time cat SRR001665_1.fastq | gzip -9 > SRR001665_1.fastq.gz
7m31.519s
7m20.340s
376373619
(18%)
time cat SRR001665_1.fastq | bzip2 -9 > SRR001665_1.fastq.bz2
3m12.601s
3m25.243s
295000140
(14%)
time cat SRR001665_1.fastq | xz -9 > SRR001665_1.fastq.xz
32m45.933s



257621508
(12%)



Table 2: Decompression tests. Each test was run twice. Decompression
Time
time cat SRR001665_1.fastq.gz | gunzip > SRR001665_1.fastq
0m14.612s
0m13.247s

ti…

Justification pour accéder à un calculateur Microsoft(R) Windows(R) HPC(R)

** Problématique

Les séquençeurs d'ADN actuels (comme le Illumina(R) HiSeq(R) 2500) produit plus de 6 000 000 000 séquences d'ADN numériques de longueur entre 100 et 200 lettres (A, T, C, G) en une seule analyse. Un des types d'analyse possible est "l'assemblage de novo de génomes."

** Système logiciel

Mon logiciel s'appelle Ray et est codé en C++ 1998 (Microsoft(R) Visual Studio(R) 2010 supporte complètement ce standard). Une librairie MPI de passage de message est aussi requise. MPICH2 et Open-MPI sont les deux disponibles en distribution binaire sous Microsoft(R) Windows(R).

Ray est un logiciel libre (licence GPLv3) et utilise la librairie parallèle RayPlatform (licence: LGPLv3).

- http://www.ohloh.net/p/ray-assembler
- http://www.ohloh.net/p/rayplatform

- http://denovoassembler.sourceforge.net/


Ray fait de l'assemblage de novo de génomes ou de métagénomes dans l'industrie des sciences de la vie (génomique). Ray "scale" très bien pour les pro…

New "mini-ranks" hybrid programming model.

Table 1: Comparison of MPI ranks with mini-ranks on the Colosse
super-computer at Laval University.
+-------+---------------------------------------------------+
| Cores | Average round-trip latency (us)                   |
+-------+-----------------------+---------------------------+
|       | MPI ranks             | mini-ranks                |
|       | (pure MPI)            | (MPI + pthread)           |
+-------+-----------------------+---------------------------+
| 8     | 11.25 +/- 0           | 24.1429 +/- 0             |
| 16    | 35.875 +/- 6.92369    | 43.0179 +/- 8.76275       |
| 32    | 66.3125 +/- 6.76387   | 41.7143 +/- 1.23924       |
| 64    | 90 +/- 16.5265        | 37.75 +/- 6.41984         |
| 128   | 126.562 +/- 25.0116   | 43.0179 +/- 8.76275       |
| 256   | 203.637 +/- 67.4579   | 44.6429 +/- 6.11862       |
| 512   |                       |                           |
+-------+-----------------------+---------------------------+