Just hot off the press - a new paper appeared in advanced online publication in Nature Biotechnology. This paper describe a new computational pipeline to recover RNA transcript sequence and abundance (aka the "transcriptome") from RNA-seq data without using the genome as a reference.
This paper is a joint work of Moran from our group and Brian Haas and Manfred Grabher, both from the Broad Institute. The three of them (Brian, Manfred and Moran) developed a three-part tool to handle massive number of sequencing reads and assembling them to accurate account of the transcriptome. They also applied a very thorough evaluation of our method and how it compares to most of the state of the art methods in the field. This evaluation by itself is of interest to the this emerging area of analysis.
The basic problem is that RNA-seq returns many (millions) of sequences of different fragements of the original RNA molecules from the sample. To make sense of it, we need to assumble it (like puzzle) into longer pieces. The two general strategies for doing so are nicely illustrated in this figure (from a review by Haas and Zodie):
The "straightforward" approach is to align reads to the reference genome, and then use this mapping to guide reconstruction. The less obvious approach, that we took here, is to first assemble the puzzle, and then map to the genome. This turns out to be often as accurate (or even more), since it is less suceptibles to problems in mapping to the genome, differences between the reference genome and the actual sample, and partial/fragmented references.
Our strategy is based on three steps, each processing the data very efficiently while maintaining information and dealing with sequence errors and rare events.