An API and Middleware Systems for Scientific Data. Intensive Computing. Gagan Agrawal Wei Jiang. Tekin Bicer Yi Wang. Computer Science and Engineering.
An API and Middleware Systems for Scientific Data Intensive Computing Gagan Agrawal
Wei Jiang
Tekin Bicer
Yi Wang
Computer Science and Engineering The Ohio State University Columbus, OH 43210 {agrawal,jiangwei,bicer,wayi}@cse.ohio-state.edu
map()
input data
...
(k1,v)
map()
combine()
...
reduce()
result
(k1,v')
(k1,v'')
(k1,v')
(k2,v'')
(k2,v')
(k3,v'')
shuffle (k1,v)
(k1,v')
... (k3,v) (k3,v)
(k3,v')
...
(k1,v)
... (k1,v) (k1,v)
...
(k3,v') ...
...
Generalized Reduction API local reduction()
...
input data
global reduction()
...
...
rObj K
(k1,v')
result (k1,v'') (k2,v'') (k3,v'') ...
combined rObj
...
(k3,v')
...
proc(e)
rObj 1
(k1,v') (k1,v')
rObj 1
In this project, we are building on top of earlier work [22, 23, 13, 12, 21, 19, 1], to develop a middleware we will refer to as MATE++, which will address the above needs.
(k3,v')
...
input data
...
MATE++ MIDDLEWARE: PRIOR AND ONGOING WORK
(k2,v')
Map-Reduce API with combine()
...
2.
(k1,v')
(k2,v) (k2,v) (k2,v)
...
(k3,v)
INTRODUCTION
In analyzing large-scale data, exploiting parallelism is essential. Parallel programming and implementation of parallel algorithms has always been a difficult problem. In this context, emergence of MapReduce [9] and its recent variants [34, 7, 37, 24, 26, 17, 5, 28, 27] has helped ease parallel implementation. MapReduce, particularly its Hadoop implementation, has been extensively used for parallel data mining. Most of the current work with MapReduce has been in context of commercial applications. Scientific data-intensive applications are beginning of become more prominent, and there is clearly a need for systems that can support these. MapReduce or related technologies can be very promising for addressing the needs of scientific data-intensive applications. However, existing implementations of MapReduce have all or most of the following limitations, which are obstacles to use of existing systems: 1) require that data be loaded into a specialized file-system, which is simply not feasible while working with massive scientific datasets, 2) cannot allow algorithm specification to be portable across use of different data formats, which is often the case in many scientific domains. 3) cannot support use of accelerators and/or other modern many-core architectures, and 4) are not optimized for use of cloud-based data storage services. In addition, the system must offer API suitable for parallelizing various algorithms for association studies of interest here.
result
(k1,v) (k1,v)
(k1,v)
(k3,v)
1.
reduce()
shuffle
...
This paper gives an overview of recent and ongoing research at Ohio State related to creating middleware solutions (including a programming model) for scientific dataintensive computing. Particularly, we have developed a variant of the MapReduce API that reduces the shuffling, sorting, and memory overheads of the original MapReduce API, and seems suitable for scientific data analytics applications. In addition, we are addressing many of the limitations of existing MapReduce implementations, which are, support for transparent processing of data in popular scientific formats, use of accelerators and other emerging architectures, and support for cloud environments.
Map-Reduce API
...
ABSTRACT
Figure 1: Processing Structures
2.1
Precursor Work: MATE System
The work proposed here builds directly on top of a system developed at Ohio State, referred to as MATE [21]. MATE, which stands for MapReduce with an AlTEnernate api, demonstrated a variant of the original MapReduce API, which still allows ease specification of parallel algorithms, but eliminates the shuffling/grouping/sorting overheads. Our API is based on the notion of a Generalized Reduction. We show the processing structures for generalized reduction and the Map-Reduce programming model with and without the Combine function in Figure 1. The generalized reduction API has 2-phases: The local reduction phase aggregates the processing, combination, and reduction operations into a single step, shown simply as proc(e) in our figure. Each data element e is processed and reduced immediately locally before the next data element is processed. After all data elements have been pro-
Scientific Data Processing Module
User Program
Runtime System
Partitioning Scheduling Restructuring Splitting
Reduction
Combination
block0 split1 split2
Input
partition1
block1
partition2
block2
Data Format 2 Adapter
Data Format 3 Adapter
Third-party Adapter
Memory Block
Block Loader
Access Strategy Selector
Combination
Output Full Read
Partial Read
Reduction Ojbect
Scientific Data Processing Module
split5
Figure 2: Proposed Design for MATE++
cessed, a global reduction phase commences. All reduction objects from various local reduction nodes are merged with an all-to-all collective operation or a user defined function to obtain the final result. The advantage of this design is to avoid the overheads brought on by intermediate memory requirements, sorting, grouping, and shuffling, which can degrade performance in Map-Reduce implementations [21]. At first glance, it may appear that our API is very similar to Map-Reduce with the Combine function. However, using the Combine function can only reduce communication, that is, the (key, value) pairs are still generated on each map node and can result in high memory requirements, causing application slowdowns. Our generalized reduction API integrates map, combine, and reduce together while processing each element. Because the updates to the reduction object are performed directly after processing, we avoid intermediate memory overheads. Our earlier work has shown that MATE can outperform Hadoop by a factor of 10-20 on several standard data mining algorithms, and the specific improvement from the use of this API itself is a factor of 1.5 or higher [21]. In another related development, we have shown how support for very large (disk-resident) reduction objects can be added to the MATE framework, enabling a number of graph mining algorithms. The resulting implementation outperformed the graph mining package PEGASUS, developed at CMU, by a factor of 10 or more [19].
2.2
Third-party Lib
Data Format 1 Adapter
Partitioned Data
split3 split4
Data Format 3 Lib
Data Format Selector
Reduction Ojbect
Reduction Ojbect
Data Format 2 Lib
Finalize
split0 partition0
Data Format 1 Lib
Handling Different Scientific Data Formats
Scientific data processing is often complicated by a large number of data formats for storing the same kind of information, and by frequent changes to data format standards. Ideally, data mining algorithms should be implemented independent of particular data formats, and the issues related to loading data from specific formats and changes in data formats should be separated. For example, the 1000 genomes project uses a different format than its predecessor, HapMap, and an upcoming European project, UK10K, is likely to use a different set of formats. We have been extending the MATE framework to provide support for handling a variety of scientific formats, keeping specification of algorithms and data format issues independent. Note that MATE, unlike most MapReduce implementations, does not require data to be reloaded into
Figure 3: Details of the Proposed Scientific Data Processing Module a specific file system, making it feasible to analyze massive existing datasets. Figure 2 gives an overview of the design for handling different scientific data formats. The main idea is to support a scientific data processing module, which will invoke specific partitioning and data loading functions for a particular format. The partitioning function needs to be implemented for each specific data format, and it interacts with the corresponding library to help retrieve dataset information such as dataset dimensionality/rank, the length of each dimension, the number of units, and the total size of dataset. Figure 3 shows additional details of the proposed scientific data processing module. The goal is to encapsulate the details of the data format APIs within the data processing module and make them transparent to the implementors of the algorithms. Thus, the users are allowed to develop applications on the disk-resident scientific datasets as if they are ordinary arrays located in the memory.
2.3
Optimizing for Cloud Environments
Cloud environments like those provided by Amazon can not only provide additional computing power, but are also means for sharing massive scientific data. Since downloading datasets available in a cloud environment is extremely expensive and time consuming, and scientists may not anyways have available space to store and analyze these datasets, datasets shared through a cloud environment will likely have to be analyzed using compute resources on the cloud. Though a collection of instances in a cloud environment can be viewed as a cluster, the performance characteristics tend to be very different. Specifically, the S3 environment on Amazon, where many important datasets like the bioinformatics 1000 Genome dataset is currently stored, has a very different performance characteristic than a typical parallel file system. S3 is optimized as a data storage for web-services, and provides better performance for a large number of small accesses. In our preliminary work, we have developed a data access optimizer for S3, where multiple threads independently request chunks of data for a single compute instance. We have shown that MATE, with such optimizations, can outperform the Amazon Elastic MapReduce by 1-2 orders of magnitude. As part of the proposed MATE++ framework, we are adding optimiza-
tions on cloud specific to patterns we serve in our target algorithms [1].
2.4
Exploiting Current and Emerging Accelerator Architectures
In the last few years, many-core accelerators like GPUs are providing cost-effective, power-efficient, and extremescale parallelism. GPUs have been widely used for data mining and related problems1 . GPUs are part of some of the fastest supercomputers today, and are available on cloud environments as well (e.g. Amazon, Zunicore, and others). In scaling our big data framework to perform more extensive analysis and/or for handling larger and larger datasets, it is important to utilize GPU-level parallelism. Again, because programming a GPU is hard, a MapReduce(-like) API can ease mapping of data mining computations on GPUs. Our MATE++ framework will support use of GPU-level parallelism starting from both MapReduce and our generalized reduction API, considering two types of architectures: the more common discrete GPU architecture, where a GPU is connected to a PCI-Express bus, and the emerging Fused GPUs, where a multi-core CPU and a GPU are integrated in silicon (i.e. AMD fusion or Intel Ivybridge chips). For the first architecture, we have already shown that our continuous reduction idea leads to better utilization of shared memory (a small programmable cache on each multiprocessor on a GPU), leading to much better performance than other GPU-based implementations of MapReduce [4]. We will be extending this work to consider a cluster of GPUs. In addition, the architectures where CPU and GPU are integrated in silicon provide many interesting options for accelerating a MapReduce computations. We propose to consider the following three options: 1) dividing map computations between CPU and GPU cores, with continuous reduction and finalization of reduction object, 2) scheduling all map computations on GPU cores and all reduce computations on CPU cores, and 3) scheduling all map computations on CPU cores and all reduce computations on GPU cores.
3.
BENEFIT OF OUR API: A CASE STUDY
We will now use the apriori association mining algorithm, to show the similarities and differences between map-reduce and the generalized reduction APIs. Apriori is a well known algorithm for association mining, which also forms the basis for many newer algorithms [?]. Association rule mining is a process of analyzing a set of transactions to extract association rules and is a commonly-used and well-studied data mining problem. Given a set of transactions, each of them being a set of items, the problem involves finding subsets of items that appear frequently in these transactions. Formally, let Li represent the set consisting of frequent itemsets of length i and Ci denote the set of candidate itemsets of length i. In iteration i, Ci is generated at the beginning of this stage and then used to compute Li . After that, Li is used as Ci+1 for iteration (i + 1). This process will iterate until the candidate itemsets become empty. 1 Please see http://blogs.nvidia.com/2010/05/the-worldis-parallel-mining-data-on-gpus/
Generalized Reduction (Apriori) void reduction(void ∗ reduction data) { for each transaction ∈ reduction data{ for (i = 0; i < candidates size; i + +){ match = false ; itemset = candidates[i]; match = itemset exists(transaction, itemset); if (match == true ){ object id = itemset.object id; accumulate(object id, 0, 1); } } } } void update f requent candidates(int stage num) { j = 0; for (i = 0; i < candidates size; i + +){ object id = candidates[i].object id; count = get intermediate result(stage num, object id, 0); if (count >= (support level ∗ num transactions)/100.0){ temp candidates[j + +] = candidates[i]; } } candidates = temp candidates; }
Figure 4: Pseudo-code for apriori using Generalized Reduction API Fig. 4 shows the pseudo-code of apriori using the generalized reduction API. Using this API, in iteration i, first, the programmer is responsible for creating and initializing the reduction object. The reduction object is allocated to store the object ids for each frequent-i itemset candidate with associated counts of support. Then, the reduction operation takes a block of input transactions, and for each transaction, it will scan through the frequent-i itemset candidates, which are the frequent-(i − 1) itemsets generated from last iteration. During the scan, if some frequent-i itemset candidate is found in the transaction, the object id for this itemset candidate is retrieved and its corresponding count of support is incremented by one. After the reduction operation is applied on all the transactions, the update frequent candidates operation is invoked to remove the itemset candidates whose count of support is below the support level that is defined by the programmer. Fig. 5 gives the pseudo-code of apriori using the mapreduce API. The map function is quite similar to the reduction given in Fig. 4 and the only difference is that the emit intermediate function is used instead of the accumulate operation for updating the reduction object. In iteration i, the map function will produce the itemset candidate as the key and the count one as the value, if this itemset can be found in the transaction. After the Map phase is done, for each distinct itemset candidate, the reduce function will sum up all the one’s associated with this itemset. If its total count of occurrences is not less than the support level, this itemset and the total count will be emitted as the reduce output pair. The reduce output will be used in the update frequent candidates to compute Li for the current iteration and use it as the Ci+1 for next iteration, the same as in Generalized Reduction. By comparing the implementations, we can make the
MapReduce (Apriori) void map(void ∗ map data) { for each transaction ∈ map data{ for (i = 0; i < candidates size; i + +){ match = false ; itemset = candidates[i]; match = itemset exists(transaction, itemset); if (match == true ){ emit intermediate(itemset, one); } } } } void reduce(void ∗ key, void ∗ ∗ vals, int vals length) { count = 0; for (i = 0; i < vals length; i + +){ count+ = ∗vals[i]; } if (count >= (support level ∗ num transactions)/100.0){ emit(key, count); } } void update f requent candidates(void ∗ reduce data out) { j = 0; length = reduce data out− > length; for (i = 0; i < length; i + +){ temp candidates[j + +] = reduce data out− > key; } candidates = temp candidates; }
Figure 5: Pseudo-code for apriori using MapReduce API
following observations. In Generalized Reduction, the reduction object is explicitly declared to store the number of transactions that own each itemset candidate. During each iteration, the reduction object has to be created and initialized first as the candidate itemsets are dynamically growing. For each itemset candidate, the reduction object is updated in the reduction operation, if it exists in a transaction. When all transactions are processed, the update frequent candidates operation will compute the frequent itemsets and use it as the new candidate itemsets for next iteration. In map-reduce, the map function checks the availability of each itemset candidate in a transaction and emits it with the count one if applicable. The reduce function will gather the counts of occurrences with the same itemset candidate and compare the total count with the support level. To summarize, for each distinct itemset candidate, the generalized reduction sums up the number of transactions it belongs to and checks whether the total number is not less than the support level. In comparison, map-reduce examines whether the itemset candidate is in the transaction in the Map phase, without performing the sum. It then completes the accumulation and compares the total count with the support level in the Reduce phase. Therefore, the reduction can be seen as a combination of map and reduce connected by the reduction object. Besides, the intermediate pairs produced by the Map tasks, will be stored in the file system or main memory and accessed by the Reduce tasks. The number of such intermediate
Figure 6: K-means: Comparison between MATE, Phoenix and Hadoop on 16 cores
pairs can be very large. In apriori, processing each transaction can produce several such pairs. In comparison, the reduction object only has a single count for each distinct itemset. Therefore, it requires much less memory. In addition, map-reduce requires extra sorting, grouping, and shuffling of the intermediate pairs, since each reduce task accumulates pairs with the same key value. In comparison, the generalized reduction API does not have any of these costs.
3.1
Quantifying Performance Improvements
We evaluate the MATE system on multi-core machines by comparing its performance against Phoenix and Hadoop. We choose three popular data mining algorithms. They are k-means clustering, apriori association mining, and principal components analysis (PCA). The datasets and the application parameters we used are as follows. For k-means, the dataset size is 1.2 GB, and the points are 3dimensional. The number of clusters, k, was set to be 100. With apriori, we used a dataset that has 1,000,000 transactions, with each one having at most 100 items, and the average number of items in each transaction is 10. The support and confidence levels were 3% and 9%, respectively, and the application ran for 2 passes. In PCA, the number of rows and columns used in our experiments were 8,000 and 1,024, respectively. Our experiments were conducted on an AMD Opteron Processor 8350 with 4 quad-core CPUs (16 cores in all). Each core has a clock frequency of 1.00GHz and the system has a 16 GB main memory. For two of the applications, k-means and apriori, we also do a comparison with Hadoop. We could not compare the performance of PCA on Hadoop, as we did not have an implementation of PCA on Hadoop. Hadoop implementations of k-means and apriori were carefully tuned by optimizing various parameters [20]. Figures 6 through 8 show the comparison results for these three applications as we scale the number of cores used. Since our system uses the same scheduler as Phoenix, the main performance difference is likely due to the different processing structure, as we had discussed in earlier sections. Note that for Hadoop, we only report results from the maximum number of cores available (16). This is because the number of tasks in Hadoop cannot be completely controlled, as there was no mechanism available to make it use only a subset of the available cores.
Figure 7: PCA: Comparison between MATE and Phoenix on 16 cores
Figure 8: Apriori: Comparison between MATE, Phoenix and Hadoop on 16 cores
Figure 6 shows the comparison for k-means. Our system outperforms both Phoenix and Hadoop on both the machines. MATE is almost thrice as fast with 16 threads. Hadoop is much slower than both MATE and Phoenix, which we believe is due to the high overheads in data initialization, file system implementation, and use of Java for computations. As compared to MATE, it also has the overhead of grouping, sorting, and shuffling of intermediate pairs [20]. We also see a very good scalability with increasing number of threads or cores with MATE. Our system can get a speedup of about 15.0 between 1 and 16 threads, whereas, it is 5.1 for Phoenix. With more than 100 million points in the k-means dataset, Phoenix was slower because of the high memory requirements associated with creating, storing, and accessing a large number of intermediate pairs. MATE, in comparison, directly accumulates values into the reduction object, which is of much smaller size. The results from PCA are shown in Figure 7. PCA does not scale as well with increasing number of cores. This is because some segments of the program are either not parallelized, or do not have regular parallelism. However, we can still see that with 16 threads on the 16-core machine, our system was twice as fast as Phoenix. Besides the overheads associated with the intermediate pairs, the execution-time breakdown analysis for Phoenix shows that the Reduce and Merge phases account for a non-trivial fraction of the total time. This overhead is non-existent for the MATE system, because of the use of the reduction object. Finally, the results for apriori is shown in Figure 8. For 1 and 2 threads, MATE is around 1.5 times as fast as Phoenix. With 4 threads and more, the performance of
Phoenix was relatively close to MATE. One of the reasons is that, compared to k-means and PCA, the Reduce and Merge phases take only a small fraction of time with the apriori dataset. Thus, the dominant amount of time is spent on the Map phase, leading to good scalability for Phoenix. For the MATE system, the reduction object can be of large size and needs to be grown and reallocated for all threads as the new candidate itemsets are generated during the processing. This, we believe, introduced significant overheads for our system. Hadoop was much slower than our system, by a factor of more than 25 on both the machines.
4.
RELATED WORK
There has been a lot of recent work in the area of scalable and scientific data analytics. After development of MapReduce, many researchers have focused on improving the approach. Seo et al. [34] proposed two optimization schemes, prefetching and pre-shuffling, MapReduce Online [7] extended Hadoop by supporting online aggregation and continuous queries, Verma et al. [37] and Kambatla et al. [24] showed that less-restrictive synchronization semantics q can be shown, Ranger et al. [31] have implemented Phoenix in multi-core systems, Chu et al. [6] used and evaluated Hadoop for machine learning algorithms. GraphLab [26] was developed by Hellerstein’s group for machine learning algorithms and improves upon abstractions like map-reduce using a graph-based data model. Aluru et al. implemented a map-reduce-like framework to support searches and computations on tree structures [32]. Other programming abstractions for dataintensive computing have also been developed, either on top of map-reduce, or as an alternative to map-reduce, including Dryad [17], Pig Latin [28], Map-Reduce-Merge [5], and Sawzall [29]. Google researchers first proposed several graph operations using map-reduce but concluded that map-reduce was not quite suited to the graph operations due to the high overheads of map-reduce iterations and communication [8]. Pregel [27] was then developed recently as a new programming model and an optimized infrastructure for mining relationships from graphs. Integrating MapReduce systems with scientific data has been a topic of much interest recently [14, 38, 39, 25, 2, 18], as also summarized by Buck et al. [2]. The Kepler+Hadoop project [38], as the name suggests, combines MapReduce processing with Kepler, a scientific workflow platform. Its limitation is that it cannot support processing of data in different scientific formats. In another system [39] NetCDF processing using Hadoop is allowed, but the data has to be converted into text, causing high overheads. SciHadoop [2] integrates Hadoop with NetCDF library support to allow processing of NetCDF data with MapReduce API. MARP [33] is another MapReduce-based framework to support HPC analytical applications, which optimizes for certain access patterns. However, it cannot directly operate on scientific data formats. MapReduce has been implemented on accelerator architectures as well. CellMR [30] was a MapReduce framework implemented on asymmetric Cell multi-core processors, Mars [15] was the first attempt to harness GPU’s power for MapReduce applications, MapCG [16] was a subsequent implementation which was shown to outper-
form Mars. Catanzaro et al. [3] also built a framework around the MapReduce abstraction to support vector machine training as well as classification on GPUs. StreamMR [10] was a MapReduce framework implemented on AMD GPUs. MITHRA [11] was introduced by Farivar et al. as an architecture to integrate the Hadoop MapReduce with the power of GPGPUs in the heterogeneous environments. Shirahata et al. [35] have extended Hadoop on GPU-based heterogeneous clusters. GPMR [36] was a recent project to leverage the power of GPU clusters. Our existing MapReduce implementation [4] has been compared against several of these, and shown to outperform them for reductionintensive computations.
5.
CONCLUSIONS AND FUTURE WORK
This paper has provided a brief overview of recent and ongoing work at Ohio State, which has the goal of creating programming systems for supporting data-intensive scientific applications. We have shown that a variant of the original MapReduce API we developed leads to better performance for data mining tasks. In addition, our ongoing work has been addressing many of the limitations of existing MapReduce implementations, including support for transparent processing of data in popular scientific formats, use of accelerators and other emerging architectures, and support for cloud environments. Our future work will focus on working with scientific data analysis applications and further extend our implementations.
6.
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
REFERENCES
[1] Tekin Bicer, David Chiu, and Gagan Agrawal. A Framework for Data-Intensive Computing with Cloud Bursting. In Proceedings of Conference on Cluster Computing, September 2011. [2] Joe B. Buck, Noah Watkins, Jeff LeFevre, Kleoni Ioannidou, Carlos Maltzahn, Neoklis Polyzotis, and Scott Brandt. SciHadoop: Array-based Query Processing in Hadoop. In Proceedings of SC, 2011. [3] Bryan Catanzaro, Narayanan Sundaram, and Kurt Keutzer. A Map Reduce Framework for Programming Graphics Processors. In Third Workshop on Software Tools for MultiCore Systems (STMCS), 2008. [4] Linchuan Chen and Gagan Agrawal. Optimizing MapReduce for GPUs with Effective Shared Memory Usage. In Proceedings of Conference on High Performance Distributed Computing (HPDC), June 2012. [5] Hung chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and Douglas Stott Parker Jr. Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters. In Proceedings of SIGMOD Conference, pages 1029–1040. ACM, 2007. [6] Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary R. Bradski, Andrew Y. Ng, and Kunle Olukotun. Map-Reduce for Machine Learning on Multicore. In Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems (NIPS), pages 281–288. MIT Press, 2006. [7] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. Mapreduce online. In NSDI, pages 313–328, 2010. [8] Jonathan Conhen. Graph twiddling in a mapreduce world. Computing in Science Engineering, 11(4):29 –41, jul. 2009. [9] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of OSDI, pages 137–150, 2004. [10] Marwa Elteir, Heshan Lin, and Wu-chun Feng. StreamMR: An Optimized MapReduce Framework for AMD GPUs. In ICPADS ’11, Tainan, Taiwan, December 2011. [11] Reza Farivar, Abhishek Verma, Ellick Chan, and Roy Campbell. MITHRA: Multiple data Independent Tasks on a Heterogeneous Resource Architecture. In Proceedings of the 2009 IEEE Cluster. IEEE, 2009. [12] Leo Glimcher and Gagan Agrawal. A Middleware for Developing and Deploying Scalable Remote Mining Services. In In proceedings of Conference on Clustering Computing and Grids (CCGRID), 2008. [13] Leo Glimcher, Ruoming Jin, and Gagan Agrawal. FREERIDE-G: Supporting Applications that Mine Data Repositories. In In proceedings of International Conference on Parallel Processing (ICPP), 2006. [14] T. Gunarathne, Tak-Lon Wu, Judy Qiu, and Geoffrey C. Fox. MapReduce in the Clouds for Science. In 2nd IEEE International
[27]
[28]
[29]
[30]
[31]
[32] [33]
[34]
[35]
[36] [37]
[38]
[39]
Conference on Cloud Computing Technology and Science (CloudCom2010), Indianapolis, IN, 11/2010 2010. IEEE. Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. Mars: a MapReduce Framework on Graphics Processors. In Proceedings of PACT 2008, pages 260–269. ACM, 2008. Chuntao Hong, Dehao Chen, Wenguang Chen, Weimin Zheng, and Haibo Lin. MapCG: Writing Parallel Program Portable between CPU and GPU. In PACT, pages 217–226, 2010. Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In Proceedings of the 2007 EuroSys Conference, pages 59–72. ACM, 2007. Chou Jerry, Wu Kesheng, and Prabhat. FastQuery: A Parallel Indexing System for Scientific Data. In CLUSTER, pages 455–464. IEEE, 2011. Wei Jiang and Gagan Agrawal. Ex-MATE: Data Intensive Computing with Large Reduction Objects and Its Application to Graph Mining. In Proceedings of Conference on Cluster Computing and Grid (CCGRID), 2011. Wei Jiang, Vignesh Ravi, and Gagan Agrawal. Comparing Map-Reduce and FREERIDE for Data-Intensive Applications. In Proceedings of Conference on Cluster Computing, 2009. Wei Jiang, Vignesh Ravi, and Gagan Agrawal. A Map-Reduce System with an Alternate API for Multi-Core Environments. In Proceedings of Conference on Cluster Computing and Grid (CCGRID), 2010. Ruoming Jin and Gagan Agrawal. A middleware for developing parallel data mining implementations. In Proceedings of the first SIAM conference on Data Mining, April 2001. Ruoming Jin and Gagan Agrawal. Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance. In Proceedings of the second SIAM conference on Data Mining, April 2002. Karthik Kambatla, Naresh Rapolu, Suresh Jagannathan, and Ananth Grama. Relaxed synchronization and eager scheduling in mapreduce. Technical Report CSD-TR-09-010, Department of Computer Science, Purdue University, 2009. Sarah Loebman, Dylan Nunley, YongChul Kwon, Bill Howe, Magdalena Balazinska, and Jeffrey P Gardner. Analyzing Massive Astrophysical Datasets: Can Pig/Hadoop or a Relational DBMS Help? In 2009 IEEE International Conference on Cluster Computing (Workshops), pages 1–10. IEEE, 2009. Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. Graphlab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), Catalina Island, California, July 2010. Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135–146, 2010. Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In Proceedings of SIGMOD Conference, pages 1099–1110. ACM, 2008. Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the Data: Parallel Analysis with Sawzall. Scientific Programming, 13(4):277–298, 2005. M. Mustafa Rafique, Benjamin Rose, Ali Raza Butt, and Dimitrios S. Nikolopoulos. CellMR: A Framework for Supporting Mapreduce on Asymmetric Cell-Based Clusters. In IPDPS, pages 1–12, 2009. Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary R. Bradski, and Christos Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In Proceedings of 13th HPCA, pages 13–24. IEEE Computer Society, 2007. Abhinav Sarje and Srinivas Aluru. A mapreduce style framework for computations on trees. In ICPP, 2010. Saba Sehrish, Grant Mackey, Jun Wang, and John Bent. MRAP: A Novel MapReduce-based Framework to Support HPC Analytics Applications with Access Patterns. In Proceedings of HPDC, pages 107–118, 2010. Sangwon Seo, Ingook Jang, Kyungchang Woo, Inkyo Kim, Jin-Soo Kim, and Seungryoul Maeng. HPMR: Prefetching and Pre-shuffling in Shared MapReduce Computation Environment. In Proceedings of the 2009 IEEE Cluster. IEEE, 2009. Koichi Shirahata, Hitoshi Sato, and Satoshi Matsuoka. Hybrid Map Task Scheduling for GPU-Based Heterogeneous Clusters. In CloudCom ’10, pages 733–740, 2010. Jeff A. Stuart and John D. Owens. Multi-GPU MapReduce on GPU Clusters. In IPDPS, 2011. Abhishek Verma, Nicolas Zea, Brian Cho, Indranil Gupta, and Roy H. Campbell. Breaking the mapreduce stage barrier. In CLUSTER. IEEE, 2010. Jianwu Wang, Daniel Crawl, and Ilkay Altintas. Kepler + Hadoop: A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems. In SC-WORKS’09, pages –1–1, 2009. Hui Zhao, SiYun Ai, ZhenHua Lv, and Bo Li. Parallel Accessing Massive NetCDF Data Based on MapReduce. In Proceedings of the 2010 International Conference on Web Information Systems and Mining, WISM’10, pages 425–431, Berlin, Heidelberg, 2010. Springer-Verlag.