Friday, February 26, 2010

Aligning long reads rapidly and accurately

The last couple of years saw a large number of algorithms and papers on rapid short read alignment to enable the rapid assembly and alignment of short reads generated by next-generation sequencers (will these machines be called last-generation sequencers at some future point?). These new algorithms included SOAP, MAQ, Bowtie, and others.

With ever increasing read lengths though these aligners which were optimized for short reads are not as efficient as one might want. Roche/454 sequencing technology has already produced reads >400 bp, Illumina gradually increases read length >100 bp, and Pacific Biosciences looks at generating 1000 bp reads. Thus, reads coming from the new sequencing technologies are not really short any more, which rules out many of the new aligners exclusively designed for reads no longer than 100 bp.

A recent paper from Li and Durbin in Bioinformatics introduces a new algorithm called Burrows-Wheeler Aligner's Smith-Waterman Alignment (BWA-SW), to align long sequences up to 1 Mb against a large sequence database (e.g. the human genome) with a few gigabytes of memory. On real datasets SSAHA2 performs better for shorter reads though it takes much longer and is heavier on memory. From the paper:

To confirm this speculation, we compared the two aligners on 99 958 reads from run SRR002644 with average read length 206 bp. This time BWA-SW misses 1092 SSAHA2 Q20 alignments and produces 39 questionable alignments; SSAHA2 misses 325 and produces 10 questionable ones. SSAHA2 is more accurate on this shorter dataset, although it is nine times slower than BWA-SW and uses 40% more memory.