Showing posts from January, 2013

Best Amazon EC2 instance type for Ray Cloud Browser on metagenomes and bacterial genomes

Hi, I am using spot instances on Amazon Elastic Compute Cloud (EC2) to deploy a few installations of Ray Cloud Browser. Initially, I opted for m1.small instances because the Ray Cloud Browser web service that answers a bunch of HTTP GET API calls was not optimized. Namely, the C++ back-end code was memory-mapping a huge file (~16-20 GiB). This huge bloated binary file was the index -- the source of information about the huge graph describing a given biological sample. A recent patch improved the performance by packing information in every available bit in the binary file, reducing the number of blocks by 75%, hereby enhancing performance as well. Recently, there were a lot of peaks in the m1.small spot pricing, and I figured out that my use case was all about bursts -- discrete HTTP API calls. I then looked at the pricing history for the last 3 months, and these pesky peaks seem to be a 2013 thing. The t1.micro spot instance pricing history also has these sophisticated highs.

Metagenome lumps & artifactual mutations

Hey ! A recent arXiv manuscript by the group of Professor C. Titus Brown revealed "sequencing artefacts" in metagenomes. In this work, the authors discovered topological features in de Bruijn graphs that they called "lumps." In the Microbiome40 sample (confidential stuff I guess ;-)), a collaborator observed these metagenome lumps using Ray Cloud Browser ( public demo on Amazon EC2; source code on GitHub). Here are some nice pictures of this system. Ray assembly plugins are shielded against these as the algorithms are well-designed. Another recent paper in Nucleic Acids Research by a group at the Broad Institute carefully documented "artifactual mutations" due to particular events during sample preparation. Also using Ray Cloud Browser, we observed these topological features in the de Bruijn graph on Illumina MiSeq data, namely the public dataset called "E.Coli-DH10B-2x250" . Here are a nice picture of a bubble

Visualizing autonomous components in de Bruijn graphs

In Ray Cloud Browser ( demo on Amazon EC2 ), I just found some interesting bits: autonomous components occurring in free form in a de Bruijn graph. You can see these yourself by going at the location displayed in the figures below. I am working on these strange things in the graph because those useless parts consume a sizable amount of compute time. So instead of patching this problem downstream (i.e.: removing the dust), I decided to write Ray Cloud Browser to fix things in the right places (i.e.: removing the source of dust instead of the dust being produced).

My roadmap for 2013

Table 1: Roadmap for year 2013. 2013-02-15 2013-06-04 in the future (waiting for co-author comments) Manuscript (Godzaridis et al.) submission to Big Data 2013-07-15 Send my thesis to director and codirector for review. 2013-08-15 Initial thesis submission 2013-08-31 End of Banting & Best doctoral scholarship (director will pay me up to Dec. 2013) 2013-10-01 Canadian Institutes of Health Research Fellowship application ( 2012-2013 ) 2013-10-03 Louis-Berlinguet postdoctoral research scholarship application ( 2012-2013 ) 2013-11-01 Banting Postdoctoral Fellowship application ( 2012-2013 ) 2013-11-15 Doctoral oral defence 2013-12-15 Final thesis submission 2014-04-15 Starting post-doc with Rick Stevens

Measuring latency in the cloud

Hey, I started 10 cc2.8xlarge in Amazon EC2 using the spot instance market. The cc2.8xlarge hourly rate sits at 0.270 $ / h. The on-demand rate is 2.400 $ / h. Results Table 1: Cloud latencies. Number of instances was 10 and the instance type was cc2.8xlarge. MPI ranks MPI ranks per instance Average roundtrip latency (microseconds) 10 1 235.200785 20 2 313.960899 40 4 365.403384 80 8 473.863469 120 12 563.779653 160 16 322.942884 200 20 258.164757 250 25 220.151894 320 32 280.532563 Instances The specification of 1 cc2.8xlarge is: Cluster Compute Eight Extra Large Instance 60.5 GiB of memo

Assembly of a human genome with latest sequencing technologies

So Illumina released a nice dataset of a human genome with their latest improvements in their sequencing products. The dataset HiSeq-2500-NA12878-demo-2x150 has 1171357300 reads, each with 151 nucleotides. Using Ray with the polytope virtual network in Ray Platform, I assembled a pretty good assembly in a timely way. The job ran on 64 nodes or 512 cores, using 512 MPI ranks. $ cat #!/bin/bash #PBS -S /bin/bash #PBS -N HiSeq-2500-NA12878-demo-2x150-2013-01-18-2 #PBS -o HiSeq-2500-NA12878-demo-2x150-2013-01-18-2.stdout #PBS -e HiSeq-2500-NA12878-demo-2x150-2013-01-18-2.stderr #PBS -A nne-790-ac #PBS -l walltime=48:00:00 #PBS -l qos=SPJ1024 #PBS -l nodes=64:ppn=8 #PBS -M #PBS -m bea cd $PBS_O_WORKDIR source /rap/nne-790-ab/software/NGS-Pipelines/ mpiexec -n 512 \ -bind-to-core -bynode \ Ray \ -route-messages -connection-type polytope -routing-graph-degree 21 \  -o \  HiSeq-2500-NA12878-demo-2x150-2013-01-18-2 \  -k \  31

1 human genome, 1 Amazon EC2 instance, x days

Today we are assembling trying to assemble 1 human genome on 1 Amazon EC2 instance using SSD disks. We pay by the hour, 3.1 $ per hour to be exact.   Colors black -> just text explaining stuff purple -> commands to type on your laptop as a normal user with ec2-api-tools configured (I use a Lenovo X230 with Fedora) red -> commands to type on the instance as root blue -> commands to type on the instance as ec2-user High I/O Quadruple Extra Large Instance   60.5 GiB of memory 35 EC2 Compute Units (16 virtual cores*) 2 SSD-based volumes each with 1024 GB of instance storage 64-bit platform I/O Performance: Very High (10 Gigabit Ethernet) Storage I/O Performance: Very High** EBS-Optimized Available: No*** API name: hi1.4xlarge Blog post Pricing (on-demand for Linux/UNIX usage) $3.100 per Hour Starting the instance 1..2..3.. let's start! $ ec2-authorize default -p 22 $ ec2-run-instances ami-08249861 -t hi1.4xlarge -n 1 -b /dev/sdf=