Friday, February 26, 2010

Aligning long reads rapidly and accurately

The last couple of years saw a large number of algorithms and papers on rapid short read alignment to enable the rapid assembly and alignment of short reads generated by next-generation sequencers (will these machines be called last-generation sequencers at some future point?). These new algorithms included SOAP, MAQ, Bowtie, and others.

With ever increasing read lengths though these aligners which were optimized for short reads are not as efficient as one might want. Roche/454 sequencing technology has already produced reads >400 bp, Illumina gradually increases read length >100 bp, and Pacific Biosciences looks at generating 1000 bp reads. Thus, reads coming from the new sequencing technologies are not really short any more, which rules out many of the new aligners exclusively designed for reads no longer than 100 bp.

A recent paper from Li and Durbin in Bioinformatics introduces a new algorithm called Burrows-Wheeler Aligner's Smith-Waterman Alignment (BWA-SW), to align long sequences up to 1 Mb against a large sequence database (e.g. the human genome) with a few gigabytes of memory. On real datasets SSAHA2 performs better for shorter reads though it takes much longer and is heavier on memory. From the paper:

To confirm this speculation, we compared the two aligners on 99 958 reads from run SRR002644 with average read length 206 bp. This time BWA-SW misses 1092 SSAHA2 Q20 alignments and produces 39 questionable alignments; SSAHA2 misses 325 and produces 10 questionable ones. SSAHA2 is more accurate on this shorter dataset, although it is nine times slower than BWA-SW and uses 40% more memory.

Tuesday, January 5, 2010

GNUMap: probabilistic mapping of next generation sequencing reads

A recent publication in Bioinformatics, The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next-generation sequencing talks firstly about an algorithm that probabilistically maps reads to repeat regions in the genome. This is important because firstly, a fair amount of the genome is many organisms is repetitive and secondly, with the short reads produced by next generation sequencers, even small repetitive regions could be unmappable. Now one argument is that repetitive regions are typically uninformative so the loss is not that much but then, there are places where repeats are informative and any algorithm that helps in making use of all available data is a good thing. 


In the next part of the paper they develop a probabilisitic Needleman–Wunsch algorithm which utilizes _prb.txt and _int.txt files produced in the Solexa/Illumina pipeline to improve the mapping accuracy for lower quality reads. Again, most software just ignores the quality score for a base thus, for not so high quality scores either they end up rejecting the base or read completely, or take the risk of a miscalled base. Using quality information should allow for a better mapping of reads as well as being able to utilize reads that were discarded before.

Reblog this post [with Zemanta]