Proceedings of
Third International Workshop on
Data Management on New Hardware (DaMoN 2007 ) Anastasia Ailamaki
Qiong Luo
(Editors)
Second International Workshop on
Performance and Evaluation of Data Management Systems (ExpDB 2007 ) Philippe Bonnet
Stefan Manegold
(Editors)
Sponsored by
June 15, 2007, Beijing International Convention Center (BICC), Beijing, China
Contents
Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii DaMoN Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v ExpDB Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Invited Talks How do DBMS take advantage of future computer systems? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (DaMoN) . . . ix Honesty Young (IBM China Research Lab) From Moore to Metcalf - The Network as the Next Database Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (ExpDB) . . . ix Michael J. Franklin (University of California, Berkeley)
Multi-core, Multi-threading, and Deep Memory Hierarchies Pipelined Hash-Join on Multithreaded Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (DaMoN) . . . 1 Philip Garcia (University of Wisconsin - Madison), Henry Korth (Lehigh University) Parallel Buffers for Chip Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (DaMoN) . . . 9 John Cieslewicz (Columbia University), Ken Ross (Columbia University), Ioannis Giannakakis (Columbia University) A General Framework for Improving Query Processing Performance on Multi-Level Memory Hierarchies . . (DaMoN) . . . 19 Bingsheng He (HKUST), Yinan Li (Peking University), Qiong Luo (HKUST), Dongqing Yang (Peking University)
Query Processing on Unconventional Processors Vectorized Data Processing on the Cell Broadband Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (DaMoN) . . . 29 Sándor Héman (CWI), Niels Nes (CWI), Marcin Zukowski (CWI), Peter Boncz (CWI) In-Memory Grid Files on Graphics Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (DaMoN) . . . 35 Ke Yang (HKUST), Bingsheng He (HKUST), Rui Fang (HKUST), Mian Lu (HKUST), Naga Govindaraju (Microsoft Corporation), Qiong Luo (HKUST), Pedro Sander (HKUST), Jiaoying Shi (Zhejiang University) Trends and Workload Characterization The five-minute rule twenty years later, and how flash memory changes the rules . . . . . . . . . . . . . . . . . . . . . . . . . (DaMoN) . . . 43 Goetz Graefe (HP Labs) Architectural Characterization of XQuery Workloads on Modern Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (ExpDB) . . . 53 Rubao Lee (ICT, Chinese Academy of Sciences), Bihui Duan (ICT, Chinese Academy of Sciences), Taoying Liu (ICT, Chinese Academy of Sciences)
i
Program
This year, DaMoN and ExpDB audiences are united into a joint fun-filled day with several excellent technical talks and two very interesting keynotes.
8:30 - 8:45 8:45 - 9:00 9:00 - 10:00 10:00 - 10:30 10:30 - 12:00 12:00 - 1:30 1:30 - 2:30 2:30 - 2:45 2:45 - 3:45 3:45 - 4:00 4:00 - 5:00 5:00 - 5:30
Registration Welcome & Opening remarks DaMoN Invited talk by Honesty Young (IBM China Research Lab) on How do DBMS take advantage of future computer systems? Coffee break Session 1 (DaMoN): Multi-core, Multi-threading, and Deep Memory Hierarchies Lunch break Session 2 (DaMoN): Query Processing on Unconventional Processors Short break Session 3 (DaMoN/ExpDB): Trends and Workload Characterization Short break ExpDB Invited talk by Michael J. Franklin (University of California, Berkeley) on From Moore to Metcalf - The Network as the Next Database Platform Reflections and Feedback
iii
DaMoN Foreword
The DaMoN workshop takes place for the third time in cooperation with the ACM SIGMOD/PODS 2007 conference in Beijing, China. The second DaMoN workshop took place in cooperation with the ACM SIGMOD/PODS 2006 conference in Chicago, Illinois, USA. The first DaMoN workshop took place in cooperation with the ACM SIGMOD/PODS 2005 conference in Baltimore, Maryland, USA. Objective The aim of this one-day workshop is to bring together researchers who are interested in optimizing database performance on modern computing infrastructure by designing new data management techniques and tools. Topics of Interest The continued evolution of computing hardware and infrastructure imposes new challenges and bottlenecks to program performance. As a result, traditional database architectures that focus solely on I/O optimization increasingly fail to utilize hardware resources efficiently. CPUs with superscalar out-of-order execution, simultaneous multi-threading, multi-level memory hierarchies, and future storage hardware (such as MEMS) impose a great challenge to optimizing database performance. Consequently, exploiting the characteristics of modern hardware has become an important topic of database systems research. The goal is to make database systems adapt automatically to the sophisticated hardware characteristics, thus maximizing performance transparently to applications. To achieve this goal, the data management community needs interdisciplinary collaboration with computer architecture, compiler and operating systems researchers. This involves rethinking traditional data structures, query processing algorithms, and database software architectures to adapt to the advances in the underlying hardware infrastructure. Workshop Co-Chairs Anastasia Ailamaki Qiong Luo
(Carnegie Mellon University, (Hong Kong University of Science and Technology,
[email protected])
[email protected])
Program Committee Christiana Amza Peter Boncz Philippe Bonnet Shimin Chen Bettina Kemme Jun Rao Ken Ross Jingren Zhou
(University of Toronto) (CWI Amsterdam) (University of Copenhagen) (Intel Research) (McGill University) (IBM) (Columbia University) (Microsoft)
Anastasia Ailamaki Qiong Luo v
ExpDB Foreword The ExpDB workshop takes place for the second time in cooperation with the ACM SIGMOD/PODS 2007 conference in Beijing, China. The first ExpDB workshop took place in cooperation with the ACM SIGMOD/PODS 2006 conference in Chicago, Illinois, USA. Objective The first goal of this workshop is to present insights gained from experimental results in the area of data management systems. The second goal is to promote the scientific validation of experimental results in the database community and facilitate the emergence of an accepted methodology for gathering, reporting, and sharing performance measures in the data management community. Current conferences and/or journals do not encourage submission of mostly (or purely) experimental results. It is often difficult or impossible to reproduce the experimental results being published, either because the source code of research prototypes is not made available or because the experimental framework is under documented. Most performance studies have limited depth because of space limitation. Their validity is limited in time because assumptions made in the experimental framework become obsolete. Topics of Interest ExpDB is meant as a forum for presenting quantitative evaluation of various data management techniques and systems. We invite the submission of original results from researchers, practitioners and developers. Of particular interest are: • performance comparisons between competing techniques, • studies revisiting published results, • unexpected performance results on rare but interesting cases, • scalability experiments, • contributions quantifying the performance of deployed applications of data management systems. Workshop Co-Chairs Philippe Bonnet Stefan Manegold
(University of Copenhagen, Denmark, (CWI Amsterdam, The Netherlands,
[email protected])
[email protected])
Program Committee Gustavo Alonso Mehmet Altinel Laurent Amsaleg David DeWitt Stavros Harizopoulos Björn Þór Jónsson Carl-Christian Kanne Paul Larson Ioana Manolescu Matthias Nicola Raghunath Othayoth Nambiar Meikel Poess Kian Lee Tan Jens Teubner Anthony Tomasic Jingren Zhou
(ETH Zurich, Switzerland) (IBM Almaden Research Center, USA) (IRISA, France) (University of Wisconsin, Madison, USA) (MIT, USA) (Reykjavik University, Iceland) (Universität Mannheim, Germany) (Microsoft, USA) (INRIA Futurs, France) (IBM Silicon Valley Lab., USA) (Hewlett-Packard, USA) (Oracle, USA) (NUS, Singapore) (Technische Universität München, Germany) (CMU, USA) (Microsoft, USA) Philippe Bonnet Stefan Manegold vii
Invited Talks How do DBMS take advantage of future computer systems?
(DaMoN)
Speaker Honesty Young (IBM China Research Lab) Abstract Historically, CMOS scaling provides certain level of performance enhancement automatically. However, that "free" performance enhancement from device scaling will come to an end while CMOS scaling will continue for several more generations. Multi-core has been one architectural feature to improve chip level performance. Partially because of the power dissipation limit, each core of a multi-core chip becomes simpler/smaller and offers weaker single thread performance. In this talk, we will explain how to avoid potential performance bottlenecks when running typical DBMS software on a massive multi-core chip. For a high-end transaction system, the main memory cost is easily several times of CPU cost; the storage cost is even higher than the main memory cost. We will examine how potential future memory technologies (such as phase-change memory) may impact computer system architecture. A new class of high volume transaction systems is emerging. Each transaction is relatively simple. However, the potential revenue for each transaction may be very low. Thus, the transaction systems designed for banking-like applications may not be suitable for this new type of applications. We will describe the problem and encourage researchers and practitioners to come up with cost-effective solutions. Biography Dr. Honesty Young earned his Ph.D. in Computer Science from University of Wisconsin-Madison. Currently he is the Deputy Director and the CTO of IBM China Research Lab. He helped build the first parallel database prototype inside IBM. He led an effort that achieved leadership TPC database benchmark results. He has initiated and managed projects in storage appliances and controllers. He spent a year at IBM Research Division Headquarters as a technical staff. Dr. Young has published more than 40 journal and conference papers, including one best paper and one invited paper. He was the Industrial Program Chair of the Parallel and Distributed Information Systems (PDIS), taught two tutorials at key conferences, and served on the program committees of eight conferences. He is an IBM Master Inventor.
From Moore to Metcalf - The Network as the Next Database Platform
(ExpDB)
Speaker Michael J. Franklin (University of California, Berkeley) Abstract Database systems architecture has traditionally been driven by Moore’s Law and Shugart’s Law, which dictate the continued exponential improvement of both processing and storage. In an increasingly interconnected world, however, Metcalf’s Law is what will drive the need for database systems innovation going forward. Metcalf’s law states that the value of a network grows with the square of the number of participants, meaning that networked applications will become increasingly ubiquitous. Stream query processing is one emerging approach that enables database technology to be better integrated into the fabric of network-intensive environments. For many applications, this technology can provide orders of magnitude performance improvement over traditional database systems, while retaining the benefits of SQL-based application development. Increasingly stream processing has been moving from the research lab into the real world. In this talk, I’ll survey the state of the art in stream query processing and related technologies, discuss some of the implications for database system architectures, and provide my views on the future role of this technology from both a research and a commercial perspective. Biography Michael Franklin is a Professor of Computer Science at the University of California, Berkeley and is a Co-Founder and CTO of Amalgamated Insight, Inc., a technology start up in Foster City, CA. At Berkeley his research focuses on the architecture and performance of distributed data management and information systems. His recent projects cover the areas of wireless sensor networks, XML message brokers, data stream processing, scientific grid computing, and data management for the digital home. He worked several years as a database systems developer prior to attending graduate school at the University of Wisconsin, Madison, where he received his Ph.D. in 1993. He was program committee chair of the 2005 ICDE conference and 2002 ACM SIGMOD conference, and has served on the editorial boards of the ACM Transactions on Database Systems, ACM Computing Surveys, and the VLDB Journal. He is a Fellow of the Association for Computing Machinery, a recipient of the National Science Foundation CAREER Award, and the ACM SIGMOD "Test of Time" award. ix
Pipelined Hash-Join on Multithreaded Architectures Philip Garcia
Henry F. Korth
University of Wisconsin-Madison Madison, WI 53706 USA
Lehigh University Bethlehem, PA 18015 USA
[email protected]
[email protected]
ABSTRACT
processing algorithms that are more efficient and allow for more accurate runtime estimates, which then can be used by query optimizers. In this paper, we make the following observations:
Multi-core and multithreaded processors present both opportunities and challenges in the design of database query processing algorithms. Previous work has shown the potential for performance gains, but also that, in adverse circumstances, multithreading can actually reduce performance. This paper examines the performance of a pipeline of hashjoin operations when executing on multithreaded and multicore processors. We examine the optimal number of threads to execute and the partitioning of the workload across those threads. We then describe a buffer-management scheme that minimizes cache conflicts among the threads. Additionally we compare the performance of full materialization of the output at each stage in the pipeline versus passing pointers between stages.
1.
• Assigning threads to specific “processor thread slots” allows for high performance and throughput. • Single-die UHM architectures can share data among threads more efficiently than SMP architectures. • Writing pointers to a buffer instead of writing the full tuple does not save as much work as previously thought. • Hardware and software prefetching can result in large performance gains within query pipelines. • Properly scheduling threads on an SMT processor can significantly improve query pipeline runtimes.
INTRODUCTION
Recently, multi-core and multithreaded processors have reached the mainstream market. Unfortunately, software designs must be restructured to exploit the new architectures fully. Doing so presents both opportunities and challenges in the design of query-processing algorithms. In this paper, we describe some of the challenges presented to database system designers by modern computer architectures. We then propose parallelization techniques that speed up individual database operations, and improve overall throughput, while avoiding some of the problems such as those described in [18], that can limit performance gains on multithreaded processors. This study builds on the work in [9, 7, 24], but instead of focusing solely on optimizing a single join operation, we examine a pipeline of join operations on uniform heterogeneous multithreaded (UHM) processors, an architectural model that we describe in Section 2.1. The techniques we develop and evaluate are applicable beyond join, and relate to other data-intensive operations. By accounting for the heterogeneous threading model of modern processors and the efficient sharing of data offered by them, we develop query
• To exploit a multithreaded processor fully, a query pipeline should generate more threads than the architecture can execute concurrently. • A large memory bandwidth is required to keep all of the processing units busy in multi-core systems. In Section 2, we describe the changes in computer architectures that motivate this work. Then, we discuss the implications of these new architectures on database systems and describe the specific database query-processing issues on which we focus. In Section 4, we propose a threading model to help take advantage of these processors, and finally in Section 5, we discuss the results of our study, and speculate how this model will perform on future UHM processors.
2.
PROCESSOR ARCHITECTURE
Computer architectures are continuously evolving to take advantage of the rapidly increasing number of transistors that can fit on a single processor die. These new architectures include larger caches, increased memory and cache latencies (in terms of CPU cycles), the ability to execute multiple threads on the same core simultaneously, and the packaging of multiple cores (processors) on the same die. These new features interact in complex ways that make traditional simulations difficult. We have therefore chosen to run our tests on real hardware. This provides a more realistic view of both the processor and main-memory subystem. We ran our tests on both a dual 3.0 GHz Xeon Northwood processor, a 2.0 GHz Core Duo (Yonah) processor, as well
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Proceedings of the Third International Workshop on Data Management on New Hardware (DaMoN 2007), June 15, 2007, Beijing, China Copyright 2007 ACM 978-1-59593-772-8 ...$5.00.
1
Number of cores Clock Speed FSB speed L1 Size L2 Size L3 Size
P4 Prescott 1 2.8GHz 800MHz 16KB 1MB -
Xeon Northwood 2 3GHz 533MHZ 8KB 512KB 1MB
Core Duo
will contain many more cores that are each capable of executing multiple threads (using fine-grained multithreading or SMT)[6]. Many of these architectures (such as the Sun Niagara processor[3]) will implement multiple simple cores that sacrifice single-thread performance but yield substantially more throughput per watt and/or die area[6, 5, 3].
2 2GHz 667MHz 32KB 2MB (shared) -
2.2
The architectural changes that we have discussed force a re-examination of database system design. Concurrent database transactions generate inter-query parallelism but that increased parallelism can result in cache contention when threads or cores share one or more levels of the processor’s cache. This puts a higher premium in intra-query parallelism (see, e.g., [12]), which current database systems do not exploit to the same degree as inter-query parallelism. The rapidly expanding number of concurrently executing threads in a UHM architecture[21] combined with increasing memory latency (in terms of cycles) means database systems must be capable of executing an increasing number of threads at once to keep up with the growing thread-level parallelism offered by modern computer architectures. We propose a threading model that breaks down a query into not just a series of pipeline operations (where each stage executes a thread), but into a series of operations that themselves can be broken down and executed by multiple threads. This allows the system to choose a level of threading that is appropriate for both the workload presented to it and the architectural features of the machine on which it is running. Additionally, on UHM systems, the system can choose the thread context on which to schedule a thread, in order to make the greatest use of the resources available at the time. While much work has been done on optimizing query pipelines, much of this work has focused on either uniprocessor or SMP systems that assume a homogeneous threading model. New designs with UHM processors must first decide on which physical processor to execute the thread, and separately decide both on which core within the processor, and on which thread within the core to run. New schedulers must take into account how many threads are currently executing on the core, as well as what each thread on the core is doing. Much of the work on query pipeline optimization has also not taken into account the effects of using software prefetch instructions within the pipeline to improve performance further, with exceptions being [7, 9]. In this study, we examine intra-query parallelism within multiple hash-join operations. By breaking down each join into parallelizable threads, we have shown that both response time and throughput can be improved.
Table 1: Details of the processors used as a 2.8 GHz Pentium 4 Prescott as shown in Table 1. All of the machines ran Debian GNU/Linux with kernel version 2.6. We focused on the results obtained on the Pentium 4 processor, and unless otherwise noted, all results given are for it. In this section, we discuss some of the details of multithreaded architectures and their impact on database query processing.
2.1
Impact on Database System Design
Multithreaded Architectures
Multithreaded processor architectures are being designed not only to enable the highest performance per unit die area, but also to obtain the highest performance per watt of power consumed[6, 5, 19, 3]. To achieve these goals, computer architects are no longer focusing on increasing instruction-level parallelism and clock frequencies, and instead are designing new architectures that can exploit thread-level parallelism (TLP). These architectures manifest themselves in two ways: chip multiprocessors (CMP) and multithreaded processors. CMP systems are a logical extension of SMP systems, but with the multiple cores integrated on a single processor die. However, many of the CMP systems differ from traditional SMP systems in that the cores share one or more levels of cache. Multithreaded processors, on the other hand, allow the system to execute multiple threads simultaneously on the same processor core. One of the more popular forms of multithreading is simultaneous multithreading (SMT), however other methods are possible[8, 23, 16, 22]. Many of these new multithreaded and CMP processors belong to a class of processors called uniform heterogeneous multithreaded (UHM) processors[21]. This class of architectures allows multiple threads (of the same instruction set) to share limited resources in order to maximize utilization. In this model, not all hardware-thread contexts are equivalent, and the behavior of one thread can adversely effect the behavior of another. This effect is generally due to shared caches, but it could also be caused by poor instruction mixes. UHM architectures should not be confused with heterogeneous multiprocessors in which the processor units themselves vary significantly or have differing instruction sets, such as a graphics coprocessor.1 Multithreaded processors have become the standard for high-performance microcomputing. The major vendors of high-performance processors are currently focusing on dual and multi-core designs[2, 1, 14, 3], and many are either shipping processors using multithreaded and/or SMT technology [16, 14, 3] to accelerate their processors. Today’s high-end database servers often contain 2-16 processors that are each capable of executing two threads. Within the next few years it is likely that a single microprocessor
2.3
Prior Work
The work we describe here differs from earlier work [9, 7, 24] in several significant ways. In earlier work, software prefetching was examined in a single-threaded simulation[7], and was later extended to run on real machines[24, 9]. The prior work of Zhou, et al. [24], examined a single hash-join operation on an SMT processor, however this work was done on the Northwood variant of the Pentium 4, which doesn’t fully support software prefetching, so a form of preloading data was used instead of prefetching. [9] further built upon the model in [7, 24] and was designed such that multiple threads could perform a single hash join. That work, however, did not consider a pipeline of operations, and addi-
1 See [11] for an example of database processing on a graphics co-processor.
2
tionally required an initial partitioning that can result in suboptimal performance. In this paper, we consider a larger problem domain (pipelines) and a richer processing model aimed at UHM processors. This work differentiates itself by studying not the algorithms involved, but rather the impact of architecture on the end result. Through executing an example database pipeline, we can observe the interaction of program structures with the system architecture. By doing this we gain valuable insight into how to best design query pipeline execution strategies, and how to best choose an appropriate platform for query processing systems.
3.
O3 :
Q
A.a, B.b, C.c
O2 : 1B.bkey=C.bkey
PPP P P O1 : 1A.name=B.name
Rel: C
HH H
PROBLEM DESCRIPTION
Rel: A
We chose to examine a pipeline of two joins; however our algorithm can easily be extended to support more general n-way joins. For this study, we examine the performance of the query pipeline when running on various computer architectures. We also examine the performance of our threading model as a function of the number, size, and type of data stored in the buffers used to share data among the threads. An important consideration in query-pipeline processing is the buffer size used and the number of buffers that are allocated to facilitate inter-process communication. We show that the buffer size has a major affect on overall algorithm performance as do prefetching attempts (done by both hardware and software). Another important consideration is the issue of whether or not to materialize pointers. This becomes doubly important in a query pipeline consisting of operations O1 , O2 , . . . , Om because the data must be brought into cache for the first join (operation Oi ) and are possibly reused in the next join (operation Oi+j ).2 Because of this, materializing the output requires memory to store both the input relation and the output relation. This results in a larger overall cache footprint3 , although there is no deterministic way to tell how much larger this is on current computer architectures (due to streaming prefetch-buffers, memory access patterns, prefetch instructions etc). Recent research [9] has also shown that the time required to copy small amounts of data (