De Novo Genome Assembly

High-throughput next-generation sequencing technologies make it feasible to build reference genomes for mammalian-sized genomes from a massive amount of short reads. However, many existing assemblers are initially designed for the assembly of genomes of only a few mega-bases in size, and do not scale well to large genomes due to their high computational and memory overhead. Thus, a parallel short read assembler, which is able to assemble large genomes such as the human genome in a reasonably short time with only modest computing resources, is of high importance to research in this area.

PASHA

PASHA is a parallel short read assembler for large genomes using de Bruijn graphs. Taking advantage of both shared-memory multi-core CPUs and distributed-memory compute clusters, PASHA has demonstrated its potential to perform high-quality de-novo assembly of large genomes in reasonable time with modest computing resources. Our evaluation using three small real paired-end datasets shows that PASHA is able to produce better assemblies with comparable genome coverage and mis-assembly rates compared to three leading assemblers: Velvet, ABySS and SOAPdenovo. Moreover, PASHA achieves the fastest speed for all three datasets on a single CPU. For the human genome, PASHA achieves competitive assembly quality with ABySS and is able to complete the assembly in about 21 hours, which is about 2.38× faster than ABySS on the same hardware configurations.

Download: Google

Publication:

Taipan

Taipan is a fast algorithm for single end short read assembly with Illumina reads. A recent comparison published in A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies has ranked Taipan as the best performing assembler for small genomes from single end reads of length 36 to 75.

Download: Sourceforge

Publication: