Tuesday, January 5, 2010

GNUMap: probabilistic mapping of next generation sequencing reads

A recent publication in Bioinformatics, The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next-generation sequencing talks firstly about an algorithm that probabilistically maps reads to repeat regions in the genome. This is important because firstly, a fair amount of the genome is many organisms is repetitive and secondly, with the short reads produced by next generation sequencers, even small repetitive regions could be unmappable. Now one argument is that repetitive regions are typically uninformative so the loss is not that much but then, there are places where repeats are informative and any algorithm that helps in making use of all available data is a good thing. 

In the next part of the paper they develop a probabilisitic Needleman–Wunsch algorithm which utilizes _prb.txt and _int.txt files produced in the Solexa/Illumina pipeline to improve the mapping accuracy for lower quality reads. Again, most software just ignores the quality score for a base thus, for not so high quality scores either they end up rejecting the base or read completely, or take the risk of a miscalled base. Using quality information should allow for a better mapping of reads as well as being able to utilize reads that were discarded before.

Reblog this post [with Zemanta]