Short Read Error Correction | High Performance Computing

Short reads produced from high-throughput sequencers come with short lengths and high sequencing error rates. These sequencing errors complicate some research fields related to short read analysis, including re-sequencing, single nucleotide polymorphism (SNP) calling, and genome assembly. Fortunately, the low sequencing cost allows producing sufficient reads to obtain a highly redundant coverage of a genome. Thus, it is possible to detect and correct sequencing errors based on this redundancy. However, the error correction procedure is both compute- and memory-intensive due to the large number of short reads, thus requiring both time and memory efficient short read error correctors to tackle the flood of short reads.

Musket

Musket is an efficient multistage k-mer based corrector for Illumina short-read data. We employ the k-mer spectrum approach and introduce three correction techniques in a multistage workflow: two-sided conservative correction, one-sided aggressive correction and voting-based refinement. Our performance evaluation results, in terms of correction quality and de novo genome assembly measures, reveal that Musket is consistently one of the top performing correctors. In addition, Musket is multi-threaded using a master-slave model and demonstrates superior parallel scalability compared to all other evaluated correctors as well as a highly competitive overall execution time.

Download: Sourcefoge

Publications:

Yongchao Liu, Jan Schroeder, Bertil Schmidt: "Musket: a multistage k-mer spectrum based error corrector for Illumina sequence data". Bioinformatics, 2013, 29(3): 308-315

DecGPU

DecGPU (Distributed Error Correction on GPUs) is the first parallel and distributed error correction algorithm for high-throughput short reads using CUDA C++ and MPI. Using simulated and real datasets, our algorithm demonstrates superior performance, in terms of error correction quality and execution speed, to the existing error correction algorithms. The distributed feature of our algorithm makes it feasible and flexible for the error correction of large-scale datasets.

Download: Sourcefoge

Publications:

Yongchao Liu, Bertil Schmidt, Douglas L. Maskell: "DecGPU: distributed error correction on massively parallel graphics processing units using CUDA and MPI". BMC Bioinformatics, 2011, 12:85.
Haixiang Shi, Bertil Schmidt, Weiguo Liu, and Wolfgang Müller-Wittig:"A Parallel Algorithm for Error Correction in High-Throughput Short-Read Data on CUDA-Enabled Graphics Hardware". Journal of Computational Biology, 2010, 17(4): 603-615

SHREC

SHREC is a new error correction method based on a suffix tree running on standard multi-cores with Java.

Download: Sourceforge

Publication:

Jan Schröder, Heiko Schröder, Simon J. Puglisi, Ranjan Sinha and Bertil Schmidt: "SHREC: a short-read error correction method". Bioinformatics, 2009, 25(17): 2157-2163