Friday, February 26, 2010

Aligning long reads rapidly and accurately

The last couple of years saw a large number of algorithms and papers on rapid short read alignment to enable the rapid assembly and alignment of short reads generated by next-generation sequencers (will these machines be called last-generation sequencers at some future point?). These new algorithms included SOAP, MAQ, Bowtie, and others.

With ever increasing read lengths though these aligners which were optimized for short reads are not as efficient as one might want. Roche/454 sequencing technology has already produced reads >400 bp, Illumina gradually increases read length >100 bp, and Pacific Biosciences looks at generating 1000 bp reads. Thus, reads coming from the new sequencing technologies are not really short any more, which rules out many of the new aligners exclusively designed for reads no longer than 100 bp.

A recent paper from Li and Durbin in Bioinformatics introduces a new algorithm called Burrows-Wheeler Aligner's Smith-Waterman Alignment (BWA-SW), to align long sequences up to 1 Mb against a large sequence database (e.g. the human genome) with a few gigabytes of memory. On real datasets SSAHA2 performs better for shorter reads though it takes much longer and is heavier on memory. From the paper:

To confirm this speculation, we compared the two aligners on 99 958 reads from run SRR002644 with average read length 206 bp. This time BWA-SW misses 1092 SSAHA2 Q20 alignments and produces 39 questionable alignments; SSAHA2 misses 325 and produces 10 questionable ones. SSAHA2 is more accurate on this shorter dataset, although it is nine times slower than BWA-SW and uses 40% more memory.

Tuesday, January 5, 2010

GNUMap: probabilistic mapping of next generation sequencing reads

A recent publication in Bioinformatics, The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next-generation sequencing talks firstly about an algorithm that probabilistically maps reads to repeat regions in the genome. This is important because firstly, a fair amount of the genome is many organisms is repetitive and secondly, with the short reads produced by next generation sequencers, even small repetitive regions could be unmappable. Now one argument is that repetitive regions are typically uninformative so the loss is not that much but then, there are places where repeats are informative and any algorithm that helps in making use of all available data is a good thing. 

In the next part of the paper they develop a probabilisitic Needleman–Wunsch algorithm which utilizes _prb.txt and _int.txt files produced in the Solexa/Illumina pipeline to improve the mapping accuracy for lower quality reads. Again, most software just ignores the quality score for a base thus, for not so high quality scores either they end up rejecting the base or read completely, or take the risk of a miscalled base. Using quality information should allow for a better mapping of reads as well as being able to utilize reads that were discarded before.

Reblog this post [with Zemanta]

Thursday, December 31, 2009

Bioinformatics software for biologists

I have mostly used scripts written by myself and packages like BioPerl (book), Biopython (book), AWK (book), or R (book) to work with biological data when it comes to data processing. More recently I have also been using Galaxy a lot to work on genome data which is especially useful since it has fast connections with many databases and not having to download raw data to my machine can save a lot of time and space. I do think more biologists should make themselves familiar with tools like Galaxy if only because of features like it saves your work history and how exactly was a particular set of data processed. Understandably though, many who are not familiar with programming tend to be completely unaware or unwilling to use tools which require any programming and are only slightly more comfortable with tools like Galaxy.

Since the amount of data is rapidly increasing in Biology and working through it is less and less of a serious option a number of commercial and free alternatives are becoming available for those wanting to do data (especially sequence) analysis in a user-friendly way without having to learn programming in any way. Some of the software in this genre are UGENEGbench, Geneious, CLC Main WorkbenchMacVector (Mac only) and more. I am partial to open source programs so they have been listed out first followed by closed source ones though I am fine with closed source if it does do something better. In the coming days I will write about the comparisons between these different software from my perspective which will be things like ease of manipulating sequences, getting multiple alignments, tree generation, and working with phylogenetic trees.

Monday, December 28, 2009

First Post

This blog is going to be about new ways to get insights into biological data using informatics, in particular, bioinformatics. While I am largely a bioinformatician (-icist?) I do collaborate with biologists so I also intend to write about new technologies, and new ways to generate data. Next generation sequencing, or high throughput sequencing and tools to understand epigenetics like ChIP-chip and ChIP-seq are something that I am currently working with. Part of this is about trying to make sense of all the new software that is being published, every month tens, if not hundreds, of new papers describing a software are published in Bioinformatics, Nucleic Acids Research and so on but its hard to say how good they are or how they compare with other tools for similar applications. I would like to review how they turn out in my usage.