Partitions, genomes, and life sequences

- June 21, 2010

Tackling a problem by reducing it in smaller, more tractable parts, is the pathway that is followed to solve numerous, yet different, real-life problems. And divide-and-conquer can be rationally helpful to read through a genome's sequences of As, Ts, Cs, and Gs too. A method published in Genome Research facilitates reading genomes with partitioning. Sébastien Boisvert reports.

Reading life

Genomes contain the core code of life, or so we think. Recent technological advances are pushing for massively parallel acquisition of short digital sequences of DNA. However, these new sequencers -- machines that literally read DNA molecules -- can only decode about 30-150 digital letters at once of each DNA molecule among the many of them.

Accordingly, algorithms implemented as computer programs make sense from these sequence reads by assembling them in longer sequences -- akin to solving puzzles.

High-hanging fruit

Whole-genome shotgun short read sequencing and assembly are truly approachable for bacterial-sized genomes -- say 1-7 Mb. And otherwise all you get is thousands of unrelatable sequences. The lynchpin of genome biology is however to read larger genomes -- these of multi-cellular living organisms such as plants from the agricultural industry, and human genomes to harness knowledge and study thoroughly human health.

Genome partitions & life sequences

In the February 2010 issue of the scientific publication Genome Research, home of papers about -- obviously -- genome research, an international team of researchers from the National Human Genome Research Institute in the United States of America and the European Bioinformatics Institute in the United Kingdom proposed a novel approach to sequence larger genomes with current highly-parallel DNA sequencing instrumentation.

Low-flying fruits

The paper proposed a partitioning scheme to reduce the hardness of sequence assembly. The method's output is favorably compared to results obtainable by capillary-based methods, at a fraction of the cost and time. In the reported approach, the genome is cut at specific sites called restriction sites. Restriction enzymes recognize these and cut them.

According to the authors, these sites can be the workhorse for partitioning genomes.

The position of restriction sites depend on the nucleotide content of the genome to be sequenced. Thus, the spectrum of fragment lengths obtainable depends on the genome, too. The necessary a priori knowledge of these sites weakens, not to say nullifies, the authors' claims about the de novo nature of their published procedure.

The necessary a priori knowledge of these sites weakens, not to say nullifies, the authors' claims about the de novo nature of their published procedure.

The fly sites

In the paper, the authors suggested to use two specific restriction enzymes, and for each of them to select 4 different fragment lengths with gel electrophoresis.

The method was applied to the fruit fly Drosophila melanogaster 125-Mb genome, an overly studied life source code of an extensively characterized multi-cellular life form. As the scientists report, their fly genome assembly contained the target genetic blueprint contaminated by the blueprint of its microbiome.

De novo? not really.

Contrary to what the authors wrote -- that their method can decode the genome of yet unsequenced and uninvestigated living forms of life, it requires a thorough a priori knowledge of either the target genome or the one of its microbiome in order to fish the target genome, or to weed out the microbiome genome.

And the accurate choice of restriction enzymes surely requires a genome sequence.

To pair or not?

In 2008, the Illumina DNA sequencing technology could output paired sequences. The partition-and-sequence paper was received for peer-review on July 1st, 2009. One can read in that article that "the ability to generate accurate paired-end data was not available to us when we initiated this project." The lack of pairs is then solely based on the definition of the word "accurate".

Genome constructs to nonsense constructs

On page 255, one can confusingly read that "The RR libraries were sequenced individually using Velvet."

Reading DNA molecules with an intangible genetic-analysis software is of course a would-be hit, not the century's breakthrough. An inadequate wording or an editing error presumably explains it.

Fair enough or not enough?

The method is not readily-applicable for organisms whose genome are unknown, because otherwise the resulting strings would be a undesirable mix. Moreover, the knowledge of the genome sequence is a prelude to the adequate choice of restriction enzymes.

Genome partitioning is a good debut, but the method reported is inadequate and overfits the fly.

The method is not conceptually so different from existing DNA capture-based enrichment methods that aim at partitioning genomes: they both need genome sequences.

Reference

Genome Res. 2010. 20: 249-256 doi:10.1101/gr.097956.109

Search This Blog

DSKernel: AI and Strength Training

Partitions, genomes, and life sequences

Comments

Popular posts from this blog

The Thorium actor engine is operational now, we can start to work on actor applications for metagenomics

Learning to solve the example 1 of puzzle 3aa6fb7a in the ARC prize

The source code of SOAPdenovo2 sits in the shadows