Solely inspired by Atbrox's list of academic papers for Mapreduce & Hadoop Algorithms. Unlike computer science where applications of Mapreduce/Hadoop are very much diversified, most of published implementations in bioinformatics are still focused on the analysis and/or assembly of biological sequences. As usual this list will be updated time to time. If you find that any important paper that is missing from the list then please drop a comment at end of the post.
Review articles
Paper describes the concepts behind Hadoop and the associated HBase project, and current bioinformatics software that employ Hadoop.
Sequence analysis/assembly
PeakRanger paper describes a Hadoop version with supports for splitting the job by chromosomes to take advantage of the chromosome-level independence (CLI) of ChIP-seq data sets. In the CLI case, "map-then-reduce" becomes "split-by-chromosome-then-call-peaks" where chromosomes are used as keys.
Hadoop cluster was used for Counting k-mers and also to sum together the partial counts computed on individual machines using an extension of the MapReduce word counting algorithm.
Study illustrates two use case, one the analysis of gene sequence data (35339 Alu sequences) and other a study of medical information (over 100,000 patient records), and compares the performance of MapReduce computing model with MPI.
Myrna is a cloud-computing pipeline for calculating differential gene expression in large RNA-Seq datasets. Myrna is designed with a parallel Hadoop/MapReduce model in mind. Myrna can be run on the cloud using Amazon Elastic MapReduce, on any Hadoop cluster, or on a single computer (without requiring Hadoop).
Describes a typical comparative genomics algorithm, the reciprocal smallest distance algorithm (RSD), to run within Amazon's Elastic Computing Cloud (EC2).
Describes a parallel read mapping algorithm optimized for aligning next-generation sequence data to reference genomes.
Describes Hadoop implementation to three algorithms: BLAST, GSEA and GRAMMAR.
Describes algorithm CloudBurst, a new highly scalable read-mapping algorithm optimized for next-generation sequence data.
Describes an implementation which integrates Hadoop, Virtual Workspaces, and ViNe as the MapReduce, virtual machine and virtual network technologies, respectively, to deploy the commonly used bioinformatics tool NCBI BLAST on a WAN-ased test bed.
Describes Crossbow, a cloud-computing software tool that executes in parallel using Hadoop and combines the aligner Bowtie and the SNP caller SOAPsnp.
- The Genome Analysis Toolkit: a MapReduce framework for analyzing next generation DNA sequencing data
Describes Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce.
Describes Dryad (Microsoft’s implementation of MapReduce) and Azure with application in EST sequence assembly, identification of HLA-associated viral evolution, and a pairwise Alu gene alignment. Dryad combines the MapReduce programming style with dataflow graphs to solve the computation tasks.
Describes multiple sequence alignment method with improved computation time and accuracy using Hadoop framework.
Describes Hadoop based implemntation of BLAST and GSEA (Gene Set Enrichment Analysis) algorithms.
Describes wide range of topics using Microsoft’s MapReduce framework Dryad including iterative MapReduce programming model to analyse the metagenomics data . (Highly recommended)
Phylogenetic
Describes MapReduce framework for designing phylogenetic applications.
Workflow
Describes integration of Hadoop with Kepler workflow system which enables users to compose and execute MapReduce applications.
Describes MapReduce like middleware Hydra to support data parallelism and parameter sweep parallelism in bioinformatics workflows.
Pattern finding/motif detection
Describes a MapReduce-based pattern finding algorithm (MRPF) for analyzing the complex network.