Posts

Showing posts from October, 2013

Pulling next-generation sequences from the European Nucleotide Archive (ENA)

Today, I have read the abstract (and the methods) of a PNAS (Proceedings of the National Academy of Sciences) paper entitled Evolutionary dynamics of Staphylococcus aureus during progression from carriage to disease. In this paper, researchers sequenced the DNA of a high number of Staphylococcus aureus isolates from patients over a certain period of time. This is science.

So this is nice and all, but after reading some bits of it, I wanted the data to do my own tests.

Getting sequence data is actually very easy, thanks to the International Nucleotide Sequence Database Collaboration (INSDC). The INSDC has 3 members (alphabetical order):
DNA Data Bank of Japan (DDBJ)European Nucleotide Archive (ENA)GenBank (raw sequence data is stored in the Sequence Read Archive (SRA)) In a nutshell, everything submitted to one of these gets mirrored to the others. The Metadata Model is also quite nice.

SRA is unusable because I have to download data files in the SRA format. This SRA format -- which is…

Nice public microbiome datasets for metagenomics

Edit 2013-10-29: added humongous dataset from Qin et al. 2012


There are at least two 3 huge datasets for metagenomics. One of the challenge is to analyze all the data simultaneously in an integrative fashion.

That's where Ray Surveyor will do when ready !

Ray Surveyor will generate a Gramian Matrix, and also a Pairwise Distance Matrix. All of this is based on reference-free algorithms.

Here are some huge datasets:


Dataset #1: A human gut microbial gene catalog established by deep metagenomic sequencing (ERA000116)
Paper:A human gut microbial gene catalogue established by metagenomic sequencing
Size: 406 GiB (compressed with gzip)
Accession: ERA000116
Samples: 124


Dataset #2: HMP Studies and samples provided by HMP program and staged at NCBI (SRA012041)

Paper:Structure, function and diversity of the healthy human microbiome
Size: 4.1TiB (compressed with gzip)
Accession: SRA012041
Samples: 764


Dataset #3:Type 2 Diabetes gut metagenome (microbiome) data from 368 Chinese samples and updated met…

Daily scrums in bioinformatics research

Since I started my doctorate, I have been using the Scrum method for managing all of my doctoral projects (product packlogs, sprint backlogs, roadmaps, stuff like that).

This summer, we had an software engineering intern in our group. To manage the intern (he reported to me), I decided (before his arrival) that we were going to hold daily scrums. Daily scrums are timeboxed meetings that last a maximum of 15 minutes each day. And each day, the daily scrum must be held at the same location and at the same time in order to create an awesome habit in the group.

In such a meeting, three exciting questions are asked to everyone by a Scrum Master. On the 5th floor of our building, I am the Scrum Master, which means that I animate and regulate our scrum meetings. These three questions are (custom flavour):

1. What did you do yesterday ?
2. What will you do today ?
3. Do you have any roadblocks ?

So we (me, the intern, and other people on our floor) met everyday for the whole duration of th…

A playground for actors and fellowship application progress

Image
Lately, I have been in part busy with paperwork related to my postdoctoral fellowship applications. While investing quite myself into these applications, I am also researching and devising better ways to create massively parallel software tools for genomics. I came up with the idea of "RayPlatform Actor Playground" and implemented a illustrious set of classes that embody the ideas conveyed by the "Playground."
The user story that motivated this novel playground for actors is that of what I named "Ray Surveyor." The purpose of this workflow is to compare genomic content between samples using DNA content without any references implied.

This Surveyor business is simply called using a new option for Ray: "-run-surveyor". An example of command line is available here. All the required runtime code is implemented within 4 types of actors. These actor types are: Mother, CoalescenceManager, GenomeGraphReader, and StoreKeeper. Mothers are the only actors …