SHREC mines errors

SHREC: a short-read error correction method
Bioinformatics doi:10.1093/bioinformatics/btp379 (2009)

In their paper, the authors stress on the importance of read correction in sequencing applications -- such as resequencing and de novo assembly. They mention that error correction techniques for Sanger reads are outdated, and that novelty is required. The authors say that assemblers are better when there're no errors -- that they work well with error-corrected reads. This is incorrect, as both the resequencing and de novo assembly applications, allow consensus calling. The authors write that the Euler assembly program is 'established', meanwhile, nobody uses it. Over the Sanger reign, only a very few software were introduced for error correction in reads. Accordingly, the authors only cite two such works.

They compare SHREC with the error-correction component of EULER-SR and ALLPATHS -- two assemblers -- on simulated and real data. The paper is fun to read, and the described work is sound and easy to grasp. The author are careful and scientific, they define every terms they use. Their method outperforms the others, and the approach is easy to use on a computer. The authors generated random errors, and utilized Illumina reads -- which also include solely randomly-located errors. However, the use of coverage depth only addresses minor errors, and it ignores correlated errors, such as the 454's insertions and deletions in homopolymers.

The novelty of the paper is at the rendez-vous, the work is well-motivated, the software is open-source, and benchmarks are adequately reported.


Polish researchers elucidate genome assembly

Whole genome assembly from 454 sequencing output via modified DNA graph concept
Computational Biology and Chemistry doi:10.1016/j.compbiolchem.2009.04.005 (2009)

The human genome project was a scientific success which allowed bioinformatics to grow. During this project, only Sanger sequencing was in action. Recently, with pyrosequencing, however, new platforms are emerging, and they provide much more data at lower cost, in a few hours. This data storm has prompted the need for novel assembly algorithms.

The authors provide a new computational framework for genome assembly -- SR-ASM (Short Reads ASseMbly). They utilize the 'recently available' Roche/454 technology, released in 2005. They evaluate their tool against Velvet and Newbler. Velvet is designed solely for Illumina (the authors say it runs on 454), whereas Newbler is sold along with the 454 sequencer from Roche. The authors say Newbler can not load fasta files, but it can. With this argument, they avoided the need to compare their tool with Roche's software -- the mighty Newbler. Roche/454 is a commercial success, mainly because of Newbler. The authors write that Phrap is not 454-aware; this is incorrect. The 454 sequencer takes 7 h to read DNA, Newbler assembles reads in about 15 minutes. Meanwhile, SR-ASM runs in only 80 hours.

Tests were done on the Prochlorococcus marinus 1.84 Mbp genome, and a 11717 nt segment of human chromosome 15. The authors write that a lack of coverage in 454 experiments causes assembly gaps. In principle, this is incorrect -- repeated elements are most likely to be responsible because novel sequencing technologies offer large depth and breadth of coverage. In their paper, tables are incomprehensible, and they don't capture the true value of assemblies. The authors only check for contig lengths, they don't assess the assembly errors.

In summary, the paper is hard to read, not scientific, and the novelty is very weak, if not inexistant.


Beware of bioinformatics

The advent of computing in our world has made research endeavours easier. The whole field of bioinformatics is built on the mindset that once you have data, regardless of its quality, you can go on and publish the data-analysis recipe and associated observations. Why do computational biologists feel they have to tell everyone that their sequence contains specific sub-sequences? Even more alarming, unless you are publishing in a good journal -- like Nature -- a bioinformatic paper is very unlikely to undergo extensive copy-editing, and thus, will presumably be super-boring to read, unless you are an outstanding writer. Suppose that a person prepares bread with his own novel recipe (yeah, right -- a novel bread recipe). He mixes the ingredients rightly, then he put the mix in the oven, sets it up, waits, and gets his result. This result is void of any scientific value, just as most of the creepy crap published in the bioinformatics sphere is void of discoveries. One could tell that this lack of discoveries is brought by the methodological aspects of bioinformatics -- that bioinformatics in essence is about how you discover things, not about discovering them. Let me consider the persistent problem of short-read alignment, which is pervasive in resequencing adventures. Considering that a genome sequence is very lengthy, and that those small chunks of data obtained are very short, but long enough to allow unique placement when they occur in unique genome segments, then even a child could design a novel algorithm. People are smashing themselves in walls to publish their own programs that align reads onto genomes. Whoa, reads!, genomes!, my nature, this is complex science. Not it's not. Overall, the open access movement has sparked many publishing start-ups, whose commercial activities lay in the process of making available online manuscripts that they receive, along with about a thousand boxes. Sure this is a winning business model, given that 'genomics' and 'computation' are power words.

Below, I attached a carefully-prepared list of peer-reviewed journals in bioinformatics. Go check them out!

Algorithms in molecular biology
Briefings in Bioinformatics
BMC Bioinformatics
BMC Genomics
Computational Biology and Chemistry
Journal of Bioinformatics and Computational Biology
Journal of Computational Biology
PLoS Computational Biology
The Open Bioinformatics Journal

(a more detailed list is available here)

Edits: 2010-06-22 (typos)


Updates on my assembler software


I have been working very hard since December 10th on a novel algorithm for fragments assembly. I already registered a sourceforge open source project. I started with some of the ideas of Pevzner et al. 2001, but I added several enhancements, and I modeled the problems with equations. My software is compatible with the amos specification and will be released on sourceforge upon acceptance in a journal. We are planning to submit our work to PNAS. We collected public data sets from the Short Read Archive to assess the performance of our software.