Posts

Showing posts from 2013

Summary of papers I have read today

[kSNP v1] K-mers can be used to find single nucleotide polymorphisms. Gardner, S. N. and T. Slezak (2010). Scalable SNP analyses of 100+ bacterial or viral genomes.J Forensic Res1, 107. Summary (from my understanding)
:
1. Enumerate k-mers
2. Find k-mers with a varying central letter central
  example: CACCGTTCAAAGACATTAAATCTTTACAAGC and CACCGTTCAAAGACAGTAAATCTTTACAAGC
3. Align varying k-mers to reference to get coordinates
4. Enjoy



[kSNP v2] An new method (in Perl) discovers variations using an alignment-free variation discovery using k-mers Gardner, S. N. and B. G. Hall (2013, December). When Whole-Genome alignments just won't work: kSNP v2 software for Alignment-Free SNP discovery and phylogenetics of hundreds of microbial genomes.PLoS ONE8 (12), e81760+. Code: http://sourceforge.net/projects/ksnp/

Sequencing highlights variations unseen by conventional methods for the 2011 European E. coli O104:H4 outbreak. Grad, Y. H., M. Lipsitch, M. Feldgarden, H. M. Arachchi, G. C. Cerqu…

Beatles and Bioinformatics! 27th November 2013 in Liverpool

Twitter superstar Nick Loman has organized a great meeting in Liverpool (United Kingdom) called Beatles and Bioinformatics! on 27th November 2013. I just gave my keynote talk about Ray and metagenomics. I am currently attending the new great talks.

You can watch the live talks from Beatles and Bioinformatics! on the YouTube channel called pathogenomics.


Also, the CGR (Center for Genomic Research) Course in Metagenomics (held on 28th & 29th November 2013 in Liverpool) will cover 16S ribosonal RNA analysis and so-called OTUs (Operational Taxonomic Unit). The agenda is available too.


Update 2013-11-27 23:07 GMT:

Direct links to each talk on YouTube:

13.00 – KEYNOTE: Sebastien Boisvert (@sebhtml), UniversitĂ© Laval, QuĂ©bec, Canada – “Ray and Ray Cloud Browser for Metagenomics”  (sound starts after 40 seconds from the start)13.50 – Chris Stewart (@CJStewart7), University of Northumbria at Newcastle – “Development of the Gut Microbiome in Preterm Infants at Risk of Necrotising Enterocoliti…

Pulling next-generation sequences from the European Nucleotide Archive (ENA)

Today, I have read the abstract (and the methods) of a PNAS (Proceedings of the National Academy of Sciences) paper entitled Evolutionary dynamics of Staphylococcus aureus during progression from carriage to disease. In this paper, researchers sequenced the DNA of a high number of Staphylococcus aureus isolates from patients over a certain period of time. This is science.

So this is nice and all, but after reading some bits of it, I wanted the data to do my own tests.

Getting sequence data is actually very easy, thanks to the International Nucleotide Sequence Database Collaboration (INSDC). The INSDC has 3 members (alphabetical order):
DNA Data Bank of Japan (DDBJ)European Nucleotide Archive (ENA)GenBank (raw sequence data is stored in the Sequence Read Archive (SRA)) In a nutshell, everything submitted to one of these gets mirrored to the others. The Metadata Model is also quite nice.

SRA is unusable because I have to download data files in the SRA format. This SRA format -- which is…

Nice public microbiome datasets for metagenomics

Edit 2013-10-29: added humongous dataset from Qin et al. 2012


There are at least two 3 huge datasets for metagenomics. One of the challenge is to analyze all the data simultaneously in an integrative fashion.

That's where Ray Surveyor will do when ready !

Ray Surveyor will generate a Gramian Matrix, and also a Pairwise Distance Matrix. All of this is based on reference-free algorithms.

Here are some huge datasets:


Dataset #1: A human gut microbial gene catalog established by deep metagenomic sequencing (ERA000116)
Paper:A human gut microbial gene catalogue established by metagenomic sequencing
Size: 406 GiB (compressed with gzip)
Accession: ERA000116
Samples: 124


Dataset #2: HMP Studies and samples provided by HMP program and staged at NCBI (SRA012041)

Paper:Structure, function and diversity of the healthy human microbiome
Size: 4.1TiB (compressed with gzip)
Accession: SRA012041
Samples: 764


Dataset #3:Type 2 Diabetes gut metagenome (microbiome) data from 368 Chinese samples and updated met…

Daily scrums in bioinformatics research

Since I started my doctorate, I have been using the Scrum method for managing all of my doctoral projects (product packlogs, sprint backlogs, roadmaps, stuff like that).

This summer, we had an software engineering intern in our group. To manage the intern (he reported to me), I decided (before his arrival) that we were going to hold daily scrums. Daily scrums are timeboxed meetings that last a maximum of 15 minutes each day. And each day, the daily scrum must be held at the same location and at the same time in order to create an awesome habit in the group.

In such a meeting, three exciting questions are asked to everyone by a Scrum Master. On the 5th floor of our building, I am the Scrum Master, which means that I animate and regulate our scrum meetings. These three questions are (custom flavour):

1. What did you do yesterday ?
2. What will you do today ?
3. Do you have any roadblocks ?

So we (me, the intern, and other people on our floor) met everyday for the whole duration of th…

A playground for actors and fellowship application progress

Image
Lately, I have been in part busy with paperwork related to my postdoctoral fellowship applications. While investing quite myself into these applications, I am also researching and devising better ways to create massively parallel software tools for genomics. I came up with the idea of "RayPlatform Actor Playground" and implemented a illustrious set of classes that embody the ideas conveyed by the "Playground."
The user story that motivated this novel playground for actors is that of what I named "Ray Surveyor." The purpose of this workflow is to compare genomic content between samples using DNA content without any references implied.

This Surveyor business is simply called using a new option for Ray: "-run-surveyor". An example of command line is available here. All the required runtime code is implemented within 4 types of actors. These actor types are: Mother, CoalescenceManager, GenomeGraphReader, and StoreKeeper. Mothers are the only actors …