Multiple Sequence Alignment | High Performance Computing

Multiple sequence alignment (MSA) generally constitutes the foundation of many bioinformatics studies related to molecular evolution and sequence functional/structural relationship analysis. The approach to producing an optimal MSA is to simultaneously align multiple sequences using dynamic programming. Unfortunately, this approach is impractical for alignments of more than a few sequences, due to its high computational cost. Therefore, many heuristics have been proposed to compute sub-optimal alignments based on different alignment approaches such as progressive alignment, iterative alignment, and alignment based on profile HMMs.

MSAProbs

MSAProbs is a new and practical multiple alignment algorithm for protein sequences. The design of MSAProbs is based on a combination of pair hidden Markov models and partition functions to calculate posterior probabilities. Assessed using the popular benchmarks: BAliBASE, PREFAB, SABmark and OXBENCH, MSAProbs achieves statistically significant accuracy improvements over the existing top performing aligners, including ClustalW, MAFFT, MUSCLE, ProbCons and Probalign. Furthermore, MSAProbs is optimized for multi-core CPUs by employing a multi-threaded design, leading to a competitive execution time compared to other aligners.

Download: Sourceforge

Publication:

Yongchao Liu, Bertil Schmidt, Douglas L. Maskell: "MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities". Bioinformatics, 2010, 26(16): 1958-196
Yongchao Liu and Bertil Schmidt: "Multiple protein sequence alignment with MSAProbs". Methods in Molecular Biology, 2014, 1079: 211-218

MSA-CUDA

MSA-CUDA parallelizes all three stages of the ClustalW progressive alignment pipeline using CUDA, and achieves significant speedups compared to the sequential ClustalW for a variety of large protein sequence datasets. Furthermore, these speedups also compare favorably to ClustalW-MPI on 32 CPU cores in a high-performance compute cluster. In terms of alignment quality, MSA-CUDA is remarkably consistent with ClustalW on BAliBASE dataset.

Publication:

Yongchao Liu, Bertil Schmidt, and Douglas L. Maskell: "MSA-CUDA: multiple sequence alignment on graphics processing units with CUDA". 20th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP 2009), 2009, 121-128 (BEST PAPER AWARD)
Yongchao Liu, Bertil Schmidt, and Douglas L. Maskell: "Parallel reconstruction of neighbor-Joining trees for large multiple sequence alignments using CUDA". 8th IEEE International Workshop on High Performance Computational Biology (HiCOMB 2009), 2009, Rome, Italy, IEEE Press.

Distance Matrix Computation for MSA on Cell/BE and x86

We introduce an implementation that accelerates the Distance Matrix Computation on x86 and Cell Broadband Engine, a homogeneous and heterogeneous multi-core system, respectively. By taking advantage of multiple processors as well as Single Instruction Multiple Data (SIMD) vectorization, we were able to achieve speed-ups of two orders of magnitude compared to the publicly available implementation utilized in ClustalW.

Download: Sourceforge

Publication:

Adrianto Wirawan, Chee Keong Kwoh, Bertil Schmidt: "Multi Threaded Vectorized Distance Matrix Computation on the Cell/BE and x86/SSE2 Architectures". Bioinformatics, 2010, 26(10): 1368-1369