Abstract

Dear Editor:
What data analysis problems in science can use clouds and/or MapReduce. What data analysis problems need iterative algorithms poorly supported by basic MapReduce. What are tradeoffs in performance, usability, flexibility, and fault tolerance between MPI and Iterative MapReduce. What are requirements for workflow systems needed to support complicated science data processing.
Here we examine different ways for using clouds for pleasingly parallel applications where we have compared five different approaches using two biomedical applications. We look at the cloud infrastructure service-based virtual machine utility computing models of Amazon AWS and Microsoft Windows Azure; MapReduce-based computing frameworks Apache Hadoop (deployed on raw hardware as well as on virtual machines) and Microsoft DryadLINQ. We compare performance showing strong variations in cost between different EC2 machine choices and comparable performance between the utility computing (spawn off a set of jobs) and managed parallelism (MapReduce). The MapReduce approach offered the most user-friendly approach. Typical results (Gunarathne et al., 2010) are shown in Figure 1.

Time to process a single biology sequence file (458 reads) per core with different frameworks.
Results such as in Figure 2, where the results of 30,000 Metagenomics sequences in 3D are shown, can be seen as in a typical bioinformatics pipeline of Smith-Waterman distance Computation, Deterministic Annealing Clustering and MDS visualization as shown in Figure 3.

Results of 17 clusters for full sample using Sammon's version of MDS for visualization.

Pipeline for analysis of metagenomics data.
The visualization uses dimension reduction where we have implemented two powerful methods GTM (Generative Topographic Mapping) and MDS (Multidimensional Scaling) (Bae et al., 2010; Choi et al., 2010).
Only MDS can be used for DNA sequence visualization as GTM requires a vector representation of original high dimensional data, whereas MDS only requires the N by N matrix of dissimilarity scores between sequences. Multiple Sequence Alignment needed to obtain a uniform vector representation of sequences is typically infeasible. The distance matrix calculation needed by MDS is very suitable for cloud implementation as the computations are independent. However, both clustering and MDS require parallel implementation as they are expensive O(N2) computations; the run time of these on a 768 core cluster is about 3 h for 30,000 sequences with a speed up of 500. These parallel implementation run poorly on clouds or MapReduce as their iterative algorithms require the long running processes and low latency of MPI. Thus, we see hybrid cluster-cloud architectures as needed for this class of problem where a complete workflow is gotten by linking separate services in clouds and closely coupled clusters.
We have developed new interpolation algorithms for both MDS and GTM, which can exploit clouds and MapReduce for the dominant part of the computation for large problems. These perform a basic dimension reduction for a sample of the data (20,000–100,000 points), which runs using MPI on a cluster; the remaining points are interpolated, which is a pleasingly parallel cloud application. We will present performance results for run time and quality of dimension reduction.
Alternatively we have extended MapReduce in an open source system, Twister (Ekanayake et al., 2010; Twister, 2010; available at http://www.iterativemapreduce.org/), that supports iterative computations of the type needed in clustering, MDS and GTM. This programming paradigm is attractive as it supports all phases of the pipeline in Figure 1. We present performance comparisons between MPI, MapReduce, and Twister on kernel applications such as matrix multiplication as well as the core services of Figure 1 in (Ekanayake et al., 2010). Other approaches to Iterative MapReduce are Pregel (Malewicz et al., 2010), HaLoop (Bu et al., 2010), and Spark (Zaharia et al., 2010).
Footnotes
Author Disclosure Statement
The author declares that no conflicting financial interests exist.
