Posts

Showing posts from December, 2012

The source code of SOAPdenovo2 sits in the shadows

Update 2012-12-28: SOAPdenovo 2.04-r223 source code was posted online on 2012-12-28. Some minor concerns:

1. The tarball suffers from the bundled library problem (libbam.a and libbammac.a are shipped precompiled). This is incompatible with the GPLv3 license.

2. The Makefile is not portable with its hard-coded paths to compilers (/opt/blc/gcc-4.5.0/bin/gcc should just be gcc).


The SOAPdenovo2 article was published today under the terms of the Creative Commons Attribution License. Here are the Availability and requirements for SOAPdenovo2 as reported in the article:

Availability and requirements

Project name: SOAPdenovo2
Project home page:http://soapdenovo2.sourceforge.net/
Operating system(s): e.g. Platform independent
Programming language: C, C++
Other requirements: GCC version ≥ 4.5.0
License: GNU General Public License version 3.0 (GPLv3)
Any restrictions to use by non-academics: none
Contact: bgi-soap@googlegroups.com




I fired up my browser and in a flash went to the "Project home page&qu…

Re: Miseq 2x250 – Does Length Really Matter?

Justin Johnson at Edge BioSystems, Inc. posted assemblies done with CLCBio for fresh MiSeq data. Here are some assemblies with Ray for the same data.





Table 1: Metrics for contigs >= 500 b with Ray for samples EdgeBio-MiSeq-E.Coli-DH10B-150x2 and EdgeBio-MiSeq-E.Coli-DH10B-250x2. *, **
Sample
K-mer length
N50 (kb)
Max (kb)
Mean (kb)
Count
Time (hour:min)
Max mem. usage per core (MiB)
2x150
31
67.1
227.5
37.4
123
01:06
351 +/- 19
2x150
51
95.5
296.6
48.3
94
01:13
395 +/- 54
2x150
61
87.5
326.8
44.0
104
01:19
408+/- 58
2x250
31
97.8
327.1
51.0
89

Plant genome (white spruce), Illumina HiSeq 2000, IBM Blue Gene/Q, and Ray

The SRA056234 dataset contains reads for Picea glauca (white spruce). The reads were obtained with the Illumina HiSeq 2000. It's 2.8 TiB of uncompressed fastq files.

$ du -sh blocks/
2.8T    blocks/

We are using  Ray on a IBM Blue Gene/Q. In particular, we are using 512 nodes with 1 IBM PowerPC A2 processor and 16 GiB of DDR3 memory per node. Each processor has 16 cores, and each core has 4 threads.

Because of memory limitation per core, we are only using 16 MPI ranks per node for the time being. Therefore, we are using 16384 Ray processes each with a maximum of 1 GiB of memory on the Blue Gene/Q. We have 16 TiB of distributed memory.

First,  we can list the Ray plugins we are using.

$ ls -1 SRA056234-Picea-glauca-2012-12-18-4/Plugins/
plugin_Amos.txt
plugin_CoverageGatherer.txt
plugin_DummySun.txt
plugin_EdgePurger.txt
plugin_FusionData.txt
plugin_FusionTaskCreator.txt
plugin_GeneOntology.txt
plugin_GenomeNeighbourhood.txt
plugin_JoinerTaskCreator.txt
plugin_KmerAcademyBuilder.txt
plugin_Libra…

Provisioning secure cloud instances locally

Hello,

I decided to deploy Ray Cloud Browser on a local virtual machine.

This guide is for installing and using libvirt on Fedora 17.

The guest instance in this tutorial is OpenBSD 5.2 as it is very lightweight and secure.


To use NAT with a bridge that has no interface, virsh net-start must be run as root.

Connecting to the host (Fedora 17)
First we connect to ls31 (Fedora 17). ls31 is the host on which we will run domain 0. In our case, ls31 has 64 VCPU (AMD Opteron(TM) Processor 6272) and 128 GiB of memory.

 $ ssh -lboiseb01 192.168.3.31 -X

Then, we need to install virtualization packages.

 $ sudo su

Installing packages
 # yum install -y @virtualization

We will put disks in /kvm/img and cdroms in /kvm/iso.

Installing the guest (OpenBSD 5.2)
 # mkdir -p /kvm/{iso,img}

Then we install the virtual machine with 1024 MiB of memory and a 8 GiB disk and graphics.
We'll remove the graphics once ssh is working.

 # virt-install --name test-1 --ram 1024 --disk path=/kvm/img/test-1.img,size=8…

Showcasing a pre-alpha version of Ray Cloud Browser

Dear genomic enthusiasts,

Explaining the de Bruijn graph -- or the de novo assembly process for that matter -- to people can be a daunting task. All biologists have a web browser ready to fire up at anytime. Furthermore, all modern browsers support HTML5 -- a way of making nice portable user interfaces.

Ray Cloud Browser is a visualizer for genomic data. But unlike classical genome browsers, Ray Cloud Browser is dynamic, and you can move things with energy if you want to.

The current version is really pre-alpha, but the hardest-to-implement core features are there. The client is in Javascript (ECMA script). The web server is Apache httpd, but any web server will do. The server-side application code is in C++ 1998, and runs in Apache httpd using the standard CGI 1.1 (Common Gateway Interface).

The stateful HTML5 client provides the graph layout engine, the physics engine, the rendering engine, the active-object engine, the communication engine. All the client code was written from scra…

Optimizing genomic alignment workloads

In some projects, there are a lot of files. Each of these files contain DNA sequences, and sometimes each of these files contain a different number of sequences. The workflow is to start several jobs that will align sequences for a subset of the files.

As an example, see the table below that contains files for the public human sample SRA000271.

Table 1: Number of sequences per file for sample SRA000271.


FileSequencesSRR002271_1.fastq.bz222243273SRR002271_2.fastq.bz222243273SRR002272_1.fastq.bz235756808SRR002272_2.fastq.bz235756808SRR002273_1.fastq.bz24276214SRR002273_2.fastq.bz24276214SRR002274_1.fastq.bz218095255SRR002274_2.fastq.bz218095255SRR002275_1.fastq.bz233729638SRR002275_2.fastq.bz233729638SRR002276_1.fastq.bz247074312SRR002276_2.fastq.bz247074312SRR002277_1.fastq.bz26757955SRR002277_2.fastq.bz26757955SRR002278_1.fastq.bz26093595SRR002278_2.fastq.bz26093595SRR002279_1.fastq.bz27177292SRR002279_2.fastq.bz27177292SRR002280_1.fastq.bz26580048SRR002280_2.fastq.bz26580048SRR002281_…