Showing posts from 2012

The source code of SOAPdenovo2 sits in the shadows

Update 2012-12-28: SOAPdenovo 2.04-r223 source code was posted online on 2012-12-28. Some minor concerns: 1. The tarball suffers from the bundled library problem (libbam.a and libbammac.a are shipped precompiled). This is incompatible with the GPLv3 license. 2. The Makefile is not portable with its hard-coded paths to compilers (/opt/blc/gcc-4.5.0/bin/gcc should just be gcc). The SOAPdenovo2 article was published today under the terms of the Creative Commons Attribution License. Here are the Availability and requirements for SOAPdenovo2 as reported in the article: Availability and requirements Project name: SOAPdenovo2 Project home page: Operating system(s): e.g. Platform independent Programming language: C, C++ Other requirements: GCC version ≥ 4.5.0 License: GNU General Public License version 3.0 (GPLv3) Any restrictions to use by non-academics: none Contact: I fired up my browser and in a

Re: Miseq 2x250 – Does Length Really Matter?

Justin Johnson at Edge BioSystems, Inc. posted assemblies done with CLCBio for fresh MiSeq data. Here are some assemblies with Ray for the same data. Table 1: Metrics for contigs >= 500 b with Ray for samples EdgeBio-MiSeq-E.Coli-DH10B-150x2 and EdgeBio-MiSeq-E.Coli-DH10B-250x2. *, ** Sample K-mer length N50 (kb) Max (kb) Mean (kb) Count Time (hour:min) Max mem. usage per core (MiB) 2x150 31 67.1 227.5 37.4 123 01:06 351 +/- 19 2x150 51 95.5 296.6 48.3 94 01:13 395 +/- 54 2x150 61 87.5 326.8 44.0 104 01:19 4

Plant genome (white spruce), Illumina HiSeq 2000, IBM Blue Gene/Q, and Ray

The SRA056234 dataset contains reads for Picea glauca (white spruce). The reads were obtained with the Illumina HiSeq 2000. It's 2.8 TiB of uncompressed fastq files. $ du -sh blocks/ 2.8T    blocks/ We are using  Ray on a IBM Blue Gene/Q. In particular, we are using 512 nodes with 1 IBM PowerPC A2 processor and 16 GiB of DDR3 memory per node. Each processor has 16 cores, and each core has 4 threads. Because of memory limitation per core, we are only using 16 MPI ranks per node for the time being. Therefore, we are using 16384 Ray processes each with a maximum of 1 GiB of memory on the Blue Gene/Q. We have 16 TiB of distributed memory. First,  we can list the Ray plugins we are using. $ ls -1 SRA056234-Picea-glauca-2012-12-18-4/Plugins/ plugin_Amos.txt plugin_CoverageGatherer.txt plugin_DummySun.txt plugin_EdgePurger.txt plugin_FusionData.txt plugin_FusionTaskCreator.txt plugin_GeneOntology.txt plugin_GenomeNeighbourhood.txt plugin_JoinerTaskCreator.txt plugin_KmerAcademyB

Provisioning secure cloud instances locally

Hello, I decided to deploy Ray Cloud Browser on a local virtual machine. This guide is for installing and using libvirt on Fedora 17. The guest instance in this tutorial is OpenBSD 5.2 as it is very lightweight and secure. To use NAT with a bridge that has no interface, virsh net-start must be run as root. Connecting to the host (Fedora 17) First we connect to ls31 (Fedora 17). ls31 is the host on which we will run domain 0. In our case, ls31 has 64 VCPU (AMD Opteron(TM) Processor 6272) and 128 GiB of memory.  $ ssh -lboiseb01 -X Then, we need to install virtualization packages.  $ sudo su Installing packages  # yum install -y @virtualization We will put disks in /kvm/img and cdroms in /kvm/iso. Installing the guest (OpenBSD 5.2)  # mkdir -p /kvm/{iso,img} Then we install the virtual machine with 1024 MiB of memory and a 8 GiB disk and graphics. We'll remove the graphics once ssh is working.  # virt-install --name test-1 --ram 102

Showcasing a pre-alpha version of Ray Cloud Browser

Dear genomic enthusiasts, Explaining the de Bruijn graph -- or the de novo assembly process for that matter -- to people can be a daunting task. All biologists have a web browser ready to fire up at anytime. Furthermore, all modern browsers support HTML5 -- a way of making nice portable user interfaces. Ray Cloud Browser is a visualizer for genomic data. But unlike classical genome browsers, Ray Cloud Browser is dynamic, and you can move things with energy if you want to. The current version is really pre-alpha, but the hardest-to-implement core features are there. The client is in Javascript (ECMA script). The web server is Apache httpd, but any web server will do. The server-side application code is in C++ 1998, and runs in Apache httpd using the standard CGI 1.1 (Common Gateway Interface). The stateful HTML5 client provides the graph layout engine, the physics engine, the rendering engine, the active-object engine, the communication engine. All the client code was written f

Optimizing genomic alignment workloads

In some projects, there are a lot of files. Each of these files contain DNA sequences, and sometimes each of these files contain a different number of sequences. The workflow is to start several jobs that will align sequences for a subset of the files. As an example, see the table below that contains files for the public human sample SRA000271. Table 1: Number of sequences per file for sample SRA000271. File Sequences SRR002271_1.fastq.bz2 22243273 SRR002271_2.fastq.bz2 22243273 SRR002272_1.fastq.bz2 35756808 SRR002272_2.fastq.bz2 35756808 SRR002273_1.fastq.bz2 4276214 SRR002273_2.fastq.bz2 4276214 SRR002274_1.fastq.bz2 18095255 SRR002274_2.fastq.bz2 18095255 SRR002275_1.fastq.bz2 33729638 SRR002275_2.fastq.bz2 33729638 SRR002276_1.fastq.bz2 47074312 SRR002276_2.fastq.bz2 47074312 SRR002277_1.fastq.bz2 6757955 SRR002277_2.fastq.bz2 675