MapReduce and Parallel DBMSs: Together at Last - MIT Database ...

Recommend Documents

CSE 444 Practice Problems Parallel DBMSs and MapReduce

CSE 444 Practice Problems. Parallel DBMSs and MapReduce. 1. Parallel Data Processing Algorithms. (a) Describe how to compute the equi-join of two ...

The EU and Africa: Coming Together at last?

the African Union, AU) and with leadership from major regional .... Future missions in Africa â from Darfur to Somalia â look likely. In parallel, EU-Africa dialogue ...

MapReduce and Relational Database Management Systems ...

(McClean et al., 2013), MapReduce tasks run on top of Distributed File ..... Olston C., Reed B., Srivastava U., Kumar R. and Tom- kins A. 2008. Pig latin: a ...

MapReduce and Relational Database Management Systems ...

The data is growing at an alarming speed in both ..... tems: Putting analytics and big data in cloud. Deci- ... A Comparison of MapReduce and Parallel Database.

Virtually Living Together - People - MIT

contrary, if we design using pure intuition and visions, the design is likely to fail due ... touch for use as a mood-induction technique for emotional communication ...

Virtually Living Together - People - MIT

Virtually Living Together. A Design Framework for New Communication Media. Konrad Tollmar, Stefan Junestrand, Olle Torgny. Smart Things and Environments ...

Parallel PSO Using MapReduce - Bouncing Chairs

MapReduce Particle Swarm Optimization (MRPSO) is a parallel implementation of ..... the interface between Java and Python code. C. Methodology. MRPSO ...

Parallel PSO Using MapReduce - Bouncing Chairs

Google faced these same problems in large-scale paral- lelization, with hundreds ... and fault tolerance. MapReduce Particle Swarm Optimization (MRPSO) is a.

Distributed and Parallel Database Systems

distributed database management systems and parallel database management ... A parallel computer, or multiprocessor, is itself a distributed system made of a ...

SCOPE: parallel databases meet MapReduce - Semantic Scholar

fits from both traditional parallel databases and MapReduce execution engines to ..... Scope outputs data by means of built-in or custom output- ters. The system ...

Parallel Sorted Neighborhood Blocking with MapReduce

Oct 15, 2010 - match product offers for price comparison portals. ..... becomes feasible and the window size w allows for a dedicated ..... dedicated server.

Parallel K-Means Clustering Based on MapReduce

Keywords: Data mining; Parallel clustering; K-means; Hadoop; MapRe- duce. 1 Introduction. With the development of information technology, data volumes ...

Parallel K-Means Clustering Based on MapReduce

In this paper, we propose a parallel k-means clustering algorithm based on MapReduce ... oriented parallel clustering algorithms should be developed. .... on a cluster of computers, each of which has two 2.8 GHz cores and 4GB of memory.

The Efficiency of MapReduce in Parallel External

Dec 16, 2011 - ing pioneer work that relates the MapReduce framework with PRAM ... ãkey, valueã pair as input and generates a number of intermediate ãkey, valueã .... Our contribution We provide upper and lower bounds on the parallel.

Towards Systematic Parallel Programming over MapReduce

to develop parallel programs with MapReduce systematically, since it is ... ++ to denote the list concatenation: [3, 1, 4] ++ [1, 5] = [3, 1, 4, 1, 5]. A list that has only ...

SCM - MIT Database Group

NAND. Desktop. HDD. DRAM. 1990. 1995. 2000. 2005. 2010. 2015. Enterprise. HDD. SCM .... Today's failure semantics lets f

A Last, Loving Look at an MIT Landmark -- Building 20

The building was the only one on campus with its own lunchroom. ... cence, see RLE's Building 20 web page: htt~://rleweb

Fermat's Last Theorem - DSpace@MIT

May 14, 2015 - proved, and why it implies Fermat's Last Theorem; this implication is a ... We will say very little about the details of Wiles' proof, which are.

At Last

English is not the home language, learning what children can do across a range ... they can serve to identify critical issues and inequities; instead, we call ... focus on a narrow range of what constitutes the students' literacy toolkit and rep- ...

Parallel Database Systems

http://www.research.microsoft.com/research/BARC/Gray/PDB95.ppt. Chapter 22, Part A ... Parallelism is natural to DBMS processing .... Parallel DBMS Summary.

Topology in DBMSs - CiteSeerX

are maintained in one DataBase Management System. Mainstream DBMSs have implemented spatial data types, spatial functions and spatial indexes.

MapReduce and PACT - Comparing Data Parallel Programming Models

... of data processing jobs. Its open-source implementation ... from the perspective of the developer writing the data analysis tasks. We present tasks from.

PARTIALLY ASYNCHRONOUS, PARALLEL ALGORITHMS ... - MIT

Key words, parallel algorithms, asynchronous algorithms, nonexpansive functions, ..... can be modified by introducing a relaxation parameter, so that it satisfies ...

Together at home - IPPR

Jun 26, 2012 - local authorities to leverage their assets and income. â .... housing out of reach for all but the poor

MapReduce and Parallel DBMSs: Together at Last - MIT Database ...

Download PDF

0 downloads 94 Views 670KB Size Report

Comment

Jan 28, 2010 - Large-Scale Data Analysis. â« CACM '10. â« MapReduce ... Analytical Tasks: â« Web Log Aggregation ...

MapReduce and Parallel DBMSs: Together at Last Andrew Pavlo New England Database Summit – MIT CSAIL January 28, 2010

January 28, 2010

Co-Authors      

Daniel Abadi (Yale) David DeWitt (Microsoft) Samuel Madden (MIT) Erik Paulson (Wisconsin) Alexander Rasin (Brown) Michael Stonebraker (MIT)

January 28, 2010

Today’s Talk  SIGMOD ‘09  A Comparison of Approaches to Large-Scale Data Analysis

 CACM ‘10  MapReduce and Parallel DBMSs: Friends or Foes?  Compare/Contrast with Jeffrey Dean & Sanjay Ghemawat (Google)

January 28, 2010

Outline     

Introduction Benchmark Study & Results Sweet Spots Together At Last Concluding Remarks

January 28, 2010

Benchmark Environment  Tested Systems:  Hadoop (MapReduce)  Vertica (Column-store DBMS)  DBMS-X (Row-store DBMS)

 100-node cluster at Wisconsin

XXX

 Additional configuration information is available on our website.

January 28, 2010

Benchmark Tasks  Original MR Grep Task:  Find 3-byte pattern in 100-byte record  Dean et al. (OSDI ‘04)

 Analytical Tasks:  Web Log Aggregation  Table Join with Aggregation  User-defined Function

January 28, 2010

Results Summary Grep Task Web Log Join

Hadoop 284 sec 1146 sec 1158 sec

DBMS-X 194 sec 740 sec 32 sec

Vertica 108 sec 268 sec 55 sec

 Full results are available in our SIGMOD & CACM papers.

January 28, 2010

Outline     

Introduction Benchmark Study & Results Sweet Spots Together At Last Concluding Remarks

January 28, 2010

Extract-Transform-Load  “Read Once” data sets:     

Read data from several different sources. Parse and clean. Perform complex transformations. Decide what attribute data to store. Load the information into a DBMS.

 Allows for quick-and-dirty data analysis.

January 28, 2010

Semi-Structured Data  MapReduce systems can easily stored semistructured data since no schema is needed:  Typically key/value records with a varying number of attributes.

 Awkward to stored in relational DBMS:  Wide-tables with many nullable attributes.  Column store fairs better.

January 28, 2010

Limited Budget Operations  MapReduce frameworks:  Community supported and driven.  Attractive for projects with modest budgets and requirements.

 Parallel DBMSs are expensive:  No open-source option.

January 28, 2010

Together At Last?  What can MapReduce learn from Databases?  Fast query times.  Schemas.  Supporting tools.

 What can Databases learn from MapReduce?  Ease of use, “out of box” experience.  Attractive fault tolerance properties.  Fast load times.

January 28, 2010

MR+DBMS Integration  Vertica now integrates directly with Hadoop:  Hadoop jobs can use Vertica as input source.  Push Map/Reduce tasks down directly into DBMS nodes.

 Other notable commercial MR integrations:  Greenplum  AsterData  Sybase IQ

January 28, 2010

MR+DBMS Integration  HadoopDB (Yale+Brown):  Replace Hadoop’s distributed file system with multiple database instances.  Rewrite Hive query plans into localized SQL for each execution node.

 Position available for HadoopDB @ Yale

January 28, 2010

Other Work  MRi (Wisconsin):  Improving Hadoop by adding DBMS technologies that are transparent to users.  Ported GiST Search Trees to Hadoop.

 SQL Server 2008 R2 (Microsoft):  Including “MapReduce-like” functionality into parallel data warehouse version of MSSQL (Project Madison)

January 28, 2010

Conclusion  Complete benchmark information and source code is available at our website:  http://database.cs.brown.edu/sigmod09/

 Questions/Comments?