Computing an assembly in the cloud
I have been working on Ray for quite some time now. That's true. And when someone is told that Ray is distributed, that several rays are needed for a single genome/metagenome assembly, sometimes this one person will say that the future of bioinformatics is on the desktop, not on supercomputers.
I agree that not everyone is well-versed in the art of supercomputers. But given the ridiculously high yields of DNA sequencers, supercomputers are not out of the question. Perhaps it is the desktop that should go away in bioinformatics.
Flowcells share more with clouds than with desktops, I think.
That is where and when the cloud enters the game. Cloud providers such as Amazon Elastic Cloud Compute (my favorite), Microsoft Azure or Rackspace sells instance-hours. One instance-hour is a unit that means you have access to a single, perhaps virtual, computer for one hour.
The trick is to give the user an interface that abstracts the need for high performance computing. But the backend is still in the cloud, for efficiency. That's the idea behind Stanford's DNANexus, Illumina BaseSpace, GenomeSpace (by The Broad Institute), or Galaxy.
Obviously, all the existing tools need to be ported to the cloud. Otherwise, you just reinvent the same old wheel. Porting to the cloud, perhaps, means using a lot of cores because it is cheap and you pay as you go. For example, it is preferable to launch a 8-core job on Amazon EC2 on a instance with 8 elastic cloud units (ECU) than to launch a job that uses only 1 core on the same instance. That way, you get what you paid for and the job finishes earlier. You don't waste precious cycles on idle processor cores.
At the moment, Amazon is attracting new users with a lot of free instance-hours. The free things are only on the on-demand micro instances. A Amazon micro instance, internally named t1.micro, provides 613 mega bytes of memory and 1 elastic cloud unit (you see 1 processor core in Linux). 1 elastic cloud unit is equivalent to one 1.0-1.2 GHz processor core (2007 AMD Opteron or 2007 Intel Xeon).
Just to test Ray, which uses the RayPlatform engine, on Amazon, I assembled 20000 paired reads in the cloud using one free micro instance into a 70 kb assembly. It took 2 minutes, 50 seconds and memory peacked at 65 mega bytes.
What I needed to do was quite simple: login to Amazon, start one micro instance, connect to the cloud, install Open-MPI, g++ and Ray and finally launch the job, transfer my results and stop the instance. All of this was done either in my browser or in my terminal.
My next post will be about running Ray on 4 Amazon free micro instances, thus passing messages.
Conflict of interest
I am not linked (financial or idealistic) with any of the cloud providers.
Connecting to the clouds
ssh -i BlackMesa.pem ubuntu@ec2-23-22-108-139.compute-1.amazonaws.com
Assembling genomes
mpiexec -n 1 ./RayApp/Ray -p Sample/_1.fasta Sample/_2.fasta -o AmazonBuild
Comments