Posts

Showing posts from January, 2013

Best Amazon EC2 instance type for Ray Cloud Browser on metagenomes and bacterial genomes

Hi,

I am using spot instances on Amazon Elastic Compute Cloud (EC2) to deploy a few installations of Ray Cloud Browser. Initially, I opted for m1.small instances because the Ray Cloud Browser web service that answers a bunch of HTTP GET API calls was not optimized. Namely, the C++ back-end code was memory-mapping a huge file (~16-20 GiB). This huge bloated binary file was the index -- the source of information about the huge graph describing a given biological sample. A recent patch improved the performance by packing information in every available bit in the binary file, reducing the number of blocks by 75%, hereby enhancing performance as well.

Recently, there were a lot of peaks in the m1.small spot pricing, and I figured out that my use case was all about bursts -- discrete HTTP API calls.


I then looked at the pricing history for the last 3 months, and these pesky peaks seem to be a 2013 thing.

The t1.micro spot instance pricing history also has these sophisticated highs.

With all…

Metagenome lumps & artifactual mutations

Image
Hey !

A recent arXiv manuscript by the group of Professor C. Titus Brown revealed "sequencing artefacts" in metagenomes. In this work, the authors discovered topological features in de Bruijn graphs that they called "lumps."

In the Microbiome40 sample (confidential stuff I guess ;-)), a collaborator observed these metagenome lumps using Ray Cloud Browser (public demo on Amazon EC2; source code on GitHub).

Here are some nice pictures of this system.






Ray assembly plugins are shielded against these as the algorithms are well-designed.



Another recent paper in Nucleic Acids Research by a group at the Broad Institute carefully documented "artifactual mutations" due to particular events during sample preparation.

Also using Ray Cloud Browser, we observed these topological features in the de Bruijn graph on Illumina MiSeq data, namely the public dataset called "E.Coli-DH10B-2x250".

Here are a nice picture of a bubble -- a branching point in the graph…

Visualizing autonomous components in de Bruijn graphs

Image
In Ray Cloud Browser (demo on Amazon EC2), I just found some interesting bits: autonomous components occurring in free form in a de Bruijn graph. You can see these yourself by going at the location displayed in the figures below.







I am working on these strange things in the graph because those useless parts consume a sizable amount of compute time. So instead of patching this problem downstream (i.e.: removing the dust), I decided to write Ray Cloud Browser to fix things in the right places (i.e.: removing the source of dust instead of the dust being produced).

My roadmap for 2013

Table 1: Roadmap for year 2013.






2013-02-15 2013-06-04 in the future (waiting for co-author comments) Manuscript (Godzaridis et al.) submission to Big Data
2013-07-15 Send my thesis to director and codirector for review.
2013-08-15
Initial thesis submission
2013-08-31
End of Banting & Best doctoral scholarship (director will pay me up to Dec. 2013)
2013-10-01
Canadian Institutes of Health Research Fellowship application ( 2012-2013)
2013-10-03
Louis-Berlinguet postdoctoral research scholarship application (2012-2013)
2013-11-01
Banting Postdoctoral Fellowship application (2012-2013)
2013-11-15
Doctoral oral defence
2013-12-15
Final thesis submission
2014-04-15
Starting post-doc with RickStevens





Needs Ray Cloud Browser paper in there also but the super enhanced visualization zombie version

Edits:

2013-01-27 moved defence from 2013-09-15 to 2013-11-15
2013-01-27 added zombie version

Measuring latency in the cloud

Hey,

I started 10 cc2.8xlarge in Amazon EC2 using the spot instance market.

The cc2.8xlarge hourly rate sits at 0.270 $ / h. The on-demand rate is 2.400 $ / h.

ResultsTable 1: Cloud latencies. Number of instances was 10 and the instance type was cc2.8xlarge.

MPI ranks
MPI ranks per instance
Average roundtrip latency (microseconds)
10
1
235.200785
20
2
313.960899
40
4
365.403384
80
8
473.863469
120
12
563.779653
160
16
322.942884
200
20
258.164757
250
25
220.151894
320
32
280.532563


Instances
The specification of 1 cc2.8xlarge is:

Cluster Compute Eight Extra Large Instance

60.5 GiB of memory
88 EC2 Compute Units (2 x Intel Xeon E5-2670, eight-core)
3370 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
EBS-Optimized Available: No*
API name: cc2.8xlarge


(that's 32 VCPUs with hyperthreading !)

These are the 10 instances in my placement group:

i-b078b9c0: ec2-17…

Assembly of a human genome with latest sequencing technologies

So Illumina released a nice dataset of a human genome with their latest improvements in their sequencing products.

The dataset HiSeq-2500-NA12878-demo-2x150 has 1171357300 reads, each with 151 nucleotides.

Using Ray with the polytope virtual network in Ray Platform, I assembled a pretty good assembly in a timely way. The job ran on 64 nodes or 512 cores, using 512 MPI ranks.

$ cat Ray-polytope.sh
#!/bin/bash

#PBS -S /bin/bash
#PBS -N HiSeq-2500-NA12878-demo-2x150-2013-01-18-2
#PBS -o HiSeq-2500-NA12878-demo-2x150-2013-01-18-2.stdout
#PBS -e HiSeq-2500-NA12878-demo-2x150-2013-01-18-2.stderr
#PBS -A nne-790-ac
#PBS -l walltime=48:00:00
#PBS -l qos=SPJ1024
#PBS -l nodes=64:ppn=8
#PBS -M sebastien.boisvert.3@ulaval.ca
#PBS -m bea
cd $PBS_O_WORKDIR

source /rap/nne-790-ab/software/NGS-Pipelines/LoadModules.sh

mpiexec -n 512 \
-bind-to-core -bynode \
Ray \
-route-messages -connection-type polytope -routing-graph-degree 21 \
 -o \
 HiSeq-2500-NA12878-demo-2x150-2013-01-18-2 \
 -k \
 31 \
-p \
      HiSeq-2500-NA12…

1 human genome, 1 Amazon EC2 instance, x days

Today we are assembling trying to assemble 1 human genome on 1 Amazon EC2 instance using SSD disks. We pay by the hour, 3.1 $ per hour to be exact.

Colors
black -> just text explaining stuff
purple -> commands to type on your laptop as a normal user with ec2-api-tools configured (I use a Lenovo X230 with Fedora)
red -> commands to type on the instance as root
blue -> commands to type on the instance as ec2-user

High I/O Quadruple Extra Large Instance
60.5 GiB of memory
35 EC2 Compute Units (16 virtual cores*)
2 SSD-based volumes each with 1024 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
Storage I/O Performance: Very High**
EBS-Optimized Available: No***
API name: hi1.4xlarge

Blog post

Pricing (on-demand for Linux/UNIX usage)

$3.100 per Hour


Starting the instance
1..2..3.. let's start!

$ ec2-authorize default -p 22

$ ec2-run-instances ami-08249861 -t hi1.4xlarge -n 1 -b /dev/sdf=ephemeral0 -b /dev/sdg=ephemeral1 -k GoldThor -g de…