Parallel Set Operations in Complex Object-Oriented Queries - CiteSeerX

9 downloads 0 Views 782KB Size Report
As Copeland and Khosha an indicate, \The DSM pairs each attribute value ...... CABK88] George Copeland, William Alexander, Ellen Boughter, and Tom Keller.
Parallel Set Operations in Complex Object-Oriented Queries A Dissertation Presented to the Faculty of the School of Engineering and Applied Science University of Virginia In Partial Ful llment of the Requirements for the Degree of Doctor of Philosophy Computer Science by

Russell F. Haddleton January 1998

Approvals This dissertation is submitted in partial ful llment of the requirements for the degree of Doctor of Philosophy Computer Science

Russell F. Haddleton Approved:

John L. Pfaltz (Advisor) Alan P. Batson

Sang Son

James C. French (Chair)

Margaret Mayer (Minor Representative)

Accepted by the School of Engineering and Applied Science:

Richard W. Miksad (Dean) January 1998

Abstract This dissertation presents a new parallel object-oriented database system implementation and architecture. The system, parallel ADAMS, we have implemented as appropriate to large-scale scienti c database applications, where the retrieval of complex data from very large collections is a primary operation. Aside from being a parallel implementation, parallel ADAMS di ers from typical OODBMSs in three signi cant ways: (1) it employs the decomposed storage model, rather than contiguous object storage, (2) it is based on a query server architecture, and (3) it employs a shared nothing distributed architecture. Parallel ADAMS sets are partitioned by oid. In the dissertation, we demonstrate that set operators, and therefore logical query connectives, can be performed in a completely data parallel fashion. More complex queries involving implicit joins require inter-processor communication which is minimized in our implementation. The implementation runs on general purpose hardware. Results are provided for a group of 1-8 SUN processors. We observe \super-linear" speed up and nearly linear scale up for queries over a one million object ( 500 megabytes) database. In addition to measuring parallel performance in terms of time, we develop a formal model of behavior which could also be used for other database implementations. We model data movement in terms of primitive operations. Given accurate times of these primitive operations, performance times can be predicted. We use the model to explain the behavior of the parallel ADAMS system, and use iii

iv extensive tests of the system to validate the model. ADAMS is a working system which supports the popular \Oracle of Bacon" web site.

c 1998 Copyright Russell Flanders Haddleton All Rights Reserved January, 1998

To Anne Marie For patiently walking this long trail with me.

Acknowledgements This work would not have been possible without the guidance of my advisor John Pfaltz. His technical insights were valuable. His insights into my working style were beyond value. I owe him a great debt. Jim French, my doctoral committee chair, well deserves my thanks. His e orts to improve this dissertation are much appreciated. The other members of my dissertation committee, Sang Son, Alan Batson, and Margaret Mayer, also have my gratitude. My ancee, Anne Marie Longobucco, has shared this e ort with me. Her support and patience through the years have been indispensable. I am also grateful for the support my parents have shown me in all my educational endevours. This work required many hours in the lab. I am grateful to my colleagues for helping to make this time bearable. For technical help, Adam Ferrari and Nguyen-Tuong deserve thanks for insights into the Mentat system, while Brett Tjaden and Glenn Wasson deserve thanks for providing a stupendous ADAMS test application. Chi Nguyen and Colleen DeJong are also of particular recent note, but there are many others who have contributed to what I have experienced as the lab sense of community over the years. I am also grateful to the department's support sta for their assistance, both in the front oce and in the technical arena. This work has been supported in part by Department of Energy grant DEFG05-95ER25254. This assistance is greatly appreciated. vii

Contents Abstract

iii

Acknowledgements

vii

List of Symbols

xvii

1 Introduction

1

1.1 1.2 1.3 1.4 1.5

The Problem . . . . . . . . . . . The Importance of the Problem . Approach . . . . . . . . . . . . . Contributions . . . . . . . . . . . Structure of this Dissertation . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

2 Objects, Sets, and ADAMS Parallel Sets 2.1 Objects . . . . . . . . . . . . . . . . . 2.1.1 Object Storage . . . . . . . . . 2.2 ADAMS Sets . . . . . . . . . . . . . . 2.3 Streams . . . . . . . . . . . . . . . . . 2.3.1 Stream Operations . . . . . . . 2.3.2 Streams and Complex Queries 2.4 Parallel Sets . . . . . . . . . . . . . . . viii

. . . . . . .

. . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

1 2 3 5 6

8

8 10 12 13 14 14 15

ix 2.4.1 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Abstract Model and Implementation . . . . . . . . . . . . . . . . . .

3 Representation of Data on Disk

3.1 Data Structures . . . . . . . . . . 3.2 The Set Implementation . . . . . 3.2.1 Optimizing File Accesses 3.2.2 The O-tree Cache . . . . 3.2.3 Sequential Insertions . . .

. . . . .

. . . . .

. . . . .

4 Client/Server Database Architectures 4.1 4.2 4.3 4.4 4.5

Task Partitioning . . . . . . . . Page Server . . . . . . . . . . . Object Server . . . . . . . . . . Query Server . . . . . . . . . . ADAMS Parallel Query Server

5 Parallelism 5.1 5.2 5.3 5.4 5.5 5.6

Data Parallelism . . . . Data ow Parallelism . . No Return Parallelism . Loop Parallelism . . . . Simple I/O Parallelism . Parallelism . . . . . . .

6 Related Work

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

6.1 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 FAD and Bubba . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 SHORE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15 20

21

21 22 22 26 26

28

28 29 31 32 34

36

36 38 40 41 43 44

46

46 47 48

x 6.1.3 ACOS . . . . . . . . . . . . 6.1.4 AGNA . . . . . . . . . . . . 6.1.5 OSAM*.KBMS . . . . . . . 6.1.6 PRIMA . . . . . . . . . . . 6.1.7 Main Memory Systems . . . 6.2 Conceptual Models . . . . . . . . . 6.2.1 Kilian's ParSETS . . . . . . 6.3 Analytic Models . . . . . . . . . . 6.4 Benchmarks . . . . . . . . . . . . . 6.4.1 The 007 Benchmark . . . . 6.4.2 Earlier OODB Benchmarks 6.4.3 Relational Benchmarks . . 6.4.4 Benchmark Summary . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

7 The Parallel Implementation

7.1 Process Structure . . . . . . . . . . . . . . 7.1.1 Client Side - Application Program 7.1.2 Server Side - ADAMS Server . . . 7.1.3 Butler Process . . . . . . . . . . . 7.2 Employing Mentat . . . . . . . . . . . . . 7.3 Parallel and Sequential ADAMS . . . . . 7.4 Summary . . . . . . . . . . . . . . . . . .

8 The Analytical Model

8.1 Database Performance Models . . . . . . . 8.1.1 Simulations . . . . . . . . . . . . . 8.1.2 Analytical Modeling and ADAMS 8.2 An Access-Based Analytic Approach . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

49 51 52 52 53 53 53 53 53 54 55 55 56

57

57 59 66 67 68 69 70

71

71 72 74 75

xi 8.3

8.4 8.5 8.6

8.2.1 Database Model Components . . . An Analytic Model . . . . . . . . . . . . . 8.3.1 Union . . . . . . . . . . . . . . . . 8.3.2 Intersection and Complement . . . 8.3.3 Set Expressions . . . . . . . . . . . 8.3.4 Set Creation with Unordered OIDs 8.3.5 Retrieval Sets . . . . . . . . . . . . 8.3.6 Compound Map Queries . . . . . . Parallel Query Processing . . . . . . . . . Page Server Union . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . .

9 Performance 9.1 9.2 9.3 9.4 9.5

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

Parallel Performance Goals . . . . . . . . . . . . . . . . . . . . . Benchmark Objectives . . . . . . . . . . . . . . . . . . . . . . . . The Benchmark Database . . . . . . . . . . . . . . . . . . . . . . The Benchmark Queries . . . . . . . . . . . . . . . . . . . . . . . Speed up and Scale up Examples . . . . . . . . . . . . . . . . . . 9.5.1 Speed up . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.2 Scale up . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.1 Experimental Non-Factors . . . . . . . . . . . . . . . . . . 9.6.2 Observed Behavior . . . . . . . . . . . . . . . . . . . . . . 9.6.3 Validation Technique . . . . . . . . . . . . . . . . . . . . . 9.7 Validation Run Results . . . . . . . . . . . . . . . . . . . . . . . 9.7.1 Union, Intersection, and Combined Set Operation Results 9.7.2 Inverse Attribute Range Search Results . . . . . . . . . . 9.7.3 Inverse Map Results . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. 76 . 78 . 79 . 85 . 86 . 88 . 91 . 95 . 99 . 107 . 110 . . . . . . . . . . . . . . .

112

112 113 114 116 118 118 120 121 122 123 124 128 128 135 136

xii 9.7.4 Complex Query Results . . . . . . . . . . . . . . . . . . . . . . . . . 139 9.8 Benchmark Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

10 Conclusion

142

A Disk Drive Speci cations

145

B Disk Access Testing

147

C Parallel Query Processing Details

150

10.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 10.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

List of Figures 2.1 2.2 2.3 2.4 2.5

Streams and Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . Streams and Complex Queries . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel Set S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Representation of sets A and B as partitioned sets, n = 4 . . . . . . . . . . Representation of a partitioned attribute pressure and its inverse pressure?1 , n=4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

3.1 3.2 3.3 3.4 3.5

Thin Chunked Set Pattern . . . . . . Completely Contiguous Set Pattern . Internally Disordered Set Pattern . . Wide-Chunked Set Pattern . . . . . Block Pattern Timing Results . . . .

. . . . .

24 24 24 25 25

4.1 Page-Server Process Architecture . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Page-Server Data Shipping . . . . . . . . . . . . . . . . . . . . . . . . . . .

30 31

5.1 5.2 5.3 5.4 5.5

37 39 40 42 43

. . . . .

. . . . .

. . . . .

. . . . .

Data Parallel Execution of Scan . . . . . . . Parallel Data ow Execution of Union . . . . No Return Parallelism - Insertion Example Loop Parallelism - Print Loop Example . . Simple I/O Parallelism . . . . . . . . . . . . xiii

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

13 15 16 17

xiv 7.1 7.2 7.3 7.4

ADAMS Process Architecture . . . . . . . . . . . . . . . . . The Para oid C++ Class De nition . . . . . . . . . . . . . ADAMS Union - application code through server execution ADAMS No Return Operations . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

58 59 62 64

8.1 8.2 8.3 8.4 8.5 8.6

Union . . . . . . . . . . . . . . . . . . A Combined Set Operation Expression Inverse Attribute Retrieval . . . . . . Complex Attribute Retrieval Example A Complex Serial Query . . . . . . . . Complex Disjunctive Query . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

84 86 92 94 96 97

9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 9.10 9.11 9.12

Measurement Object Schema . . Frequency Object Schema . . . . Parallel Union Execution . . . . Complex Query Execution . . . . Complex Query | Tree Reads . Union | Scale Up . . . . . . . . Complex Query | Scale Up . . . Complex Query - Random Reads Union - Random Writes . . . . . Intersection - Random Writes . . Intersection - Random Reads . . Inverse Map - Long Sends . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

115 117 119 119 120 121 122 127 130 132 133 138

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

A.1 Cluster Disk Drive Speci cations . . . . . . . . . . . . . . . . . . . . . . . . 146 C.1 ADAMS Query - source code . . . . . . . . . . . . . . . . . . . . . . . . . . 151 C.2 ADAMS Query - source code (continued) . . . . . . . . . . . . . . . . . . . 152

List of Tables 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8

Primitive Operations . . . . . . . . . . . . . . . Advanced Operations . . . . . . . . . . . . . . Constants and Parameters . . . . . . . . . . . . Steps for Combined Set Operation Expression . Query Message Descriptions . . . . . . . . . . . Parallel Union Steps . . . . . . . . . . . . . . . Parallel Map Retrieval Load Steps . . . . . . . Parallel Map Retrieval Transfer Steps . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

75 77 78 87 101 102 104 105

9.1 Benchmark Queries and Operations . . . . . . . . . . . . . . . . . . . . . . 118 9.2 Primitive Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 9.3 Observed and Predicted Behavior of Complex Query 6, n = 4 (Observed/Predicted) 126 9.4 Union Relative Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 9.5 Set 1 Union Set 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 9.6 Intersection Relative Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 9.7 Random Writes - Intersection 1 and 3 . . . . . . . . . . . . . . . . . . . . . 132 9.8 Combined Set Operation Relative Error . . . . . . . . . . . . . . . . . . . . 134 9.9 Range Search Relative Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 9.10 Range Search 1 Sequential Reads . . . . . . . . . . . . . . . . . . . . . . . . 136 9.11 Inverse Map Relative Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 xv

xvi 9.12 Complex Query Relative Error . . . . . . . . . . . . . . . . . . . . . . . . . 140

List of Symbols

xvii

xviii

RS WS RR WR RC WC TA N A MBS MLS MBR MLR MPS MPR LS(S) LR(S) F (S) IR(S) IS(S) R W CR(S) CS(S) (S)  

 n

Sequential Disk Read Sequential Disk Write Random Disk Read Random Disk Write Tree Cache Read Tree Cache Write Tree Append Get Next oid from a Stream Append an oid to a Stream Brief Message Send (less than 100 bytes) Long Message Send (8K bytes) Brief Message Receive Long Message Receive Page Send Page Receive Load all Stream S segments from Sequentially Created Set Load all Stream S segments from Randomly Created Set Find a Key in Random Order in tree S Insert in Random Order into tree S Insert in Sequential Order into tree S Read a stream segments blocks from disk into cache Write a stream segments blocks to disk from cache Create Set S from random oid source Create Set S from sequential oid source oids Per Block in a tree node oids Per Block in a node for tree S Tree Cache Size (number of blocks) Tree Chunk Size (number of blocks) Stream Segment Size (number of oids) oids Per Map Transfer Message Bytes per oid Number of ADAMS Servers

xix

1 Introduction 1.1 The Problem The ability of the scienti c community to gather data continues to grow at a rapid rate. The time needed to analyze these great volumes of complex information, to nd trends or anomalies of interest, can mean that data accumulates faster than it can be examined and the researcher risks being overwhelmed or out of date. An increase in data gathering capacity is often accompanied by an increase in expectations. Thus a desirable enhancement in data gathering capability can turn out to be a mixed blessing if the analytical tools employed cannot be adjusted to meet the increasing demands, both in volume and complexity. We see the analysis task here as involving the use of queries, particularly queries we see as complex: those involving multiple conjuncts and disjuncts, with references to data value ranges and/or items in the database. The identi cation of a small sub-volume of interest from a very large database can be the most daunting part of the analysis task. Accommodating data complexity is an additional challenge. The object model is better suited to expressing data complexity than the relational model, and its value to scienti c data analysis is a frequent theme [CM95, MH94, Kim90b]. But object-oriented databases have been seen as poor performers when handling very large data bases ([Rao94],p.43). 1

2 This may be due to their relatively young age when compared to relational systems. Also impeding performance could be their ability to easily express greater complexity. A third challenge is ease of use. An elaborate solution requiring great expertise to deploy and use is not likely to be widely adopted, particularly by scienti c researchers responsible for developing their own applications. We see the combination of these three challenges | providing complex query processing over very large volumes of complex data in an easy to use manner | as the problem.

1.2 The Importance of the Problem Many scienti c and business applications will require the design exibility of an objectoriented database system while demanding the capability to perform complex query operations over very large datasets. This high volume query capability has gone wanting because OODBMS designers have typically chosen CAD or other small to medium scale database applications as a target, while the focus on large scale relational systems has generally been on providing support for many users making small transactional database updates 1 . Maintaining great volumes within a DBMS is dicult. A number of researchers are attempting to address this problem [SSU91, BCV93]. Some are investigating approaches including employing tertiary storage within a DBMS [Sto94, CHL93, Moh93] 2, while others are examining parallel database systems [DG92]. Successful commercial parallel relational database systems have been deployed [BFG+95, Pag92]. According to Gesman [Ges96], \the exploitation of parallelism within OODBMS has been almost completely ignored"(p. 852). Without good object-oriented solutions capable of exploiting parallelism a substantial portion of the scienti c and business community will 1 According to Mohan et al. [MPTW94], \Traditionally, users have been forced to deal with this problem

of handling the transaction and query workloads properly by maintaining two di erent databases on two di erent systems."(p.354). 2 We contrast this with recent work in main memory DBMS systems [AHH+90, vdB94], which are not likely to be able to address the same applications in the near future

3 be forced to use outdated relational technology to meet their performance needs, sacri cing the expressiveness of the object oriented model while needlessly enduring the overhead of countless joins and sequential scans.

1.3 Approach To address these problems we have devoted a great deal of e ort towards the design and implementation of a suitable database system. We have extended the single processor ADAMS object-oriented database system with features which we believe are invaluable in meeting the needs expressed above. A feature of particular note is seamless parallelism. The features we have developed include a unique parallel set representation, which is coupled with a unique (for object-oriented systems) parallel query server architecture. We have seen little discussion of the appropriate choice of client/server database architecture being strongly tied to application area; but we see in many scienti c applications just such a link between these high volume operations and our selection of query servers. As part of our approach we develop an analytical model. It serves not to replace experimental results, but as an aid in explaining them. Through the model we expect to clearly show why our system performs as is does, and to give insight into how it might perform with other hardware con gurations or applications. This implementation e ort handles a complaint of Baru et al. [BFG+95] regarding earlier academic work: \The results of these studies form a basis for our knowledge of parallel database issues today. However, two major limitations with many of these projects were: (1) Many of the problems were considered in isolation, so the implementation tended to be very simple, and (2) in several cases, people resorted to simulation and analysis because the implementation required enormous e ort." (p.293). In the literature we have found results given for prototype systems where a review of the (partial) system implementation left serious questions as to whether the results would remain valid had a full implementation been completed or a real application been tested.

4 There are other more complete systems, such as SHORE, where interesting functionality is developed (e.g. Parsets as in [DNSV]), but later one nds: \The Parset feature is not supported. It was an experimental feature implemented by a graduate student as part of his Ph.D. research a couple of years ago and is no longer used."(Marvin Solomon, SHORE Mailing List, 2/97). There are some database features that our implementation does not support 3 , and several possible ADAMS enhancements are discussed in [Had95]. But we believe the current implementation could provide the capabilities discussed below in a suitable real-world environment. We note that while distributed commercial object-oriented query-server database systems are available, we know of no existing query-server based object-oriented database system permitting parallel execution. Versant [Vera], for example, is a commercial queryserver based system which allows objects from di erent physical databases in a network to reference each other, yet \VERSANT does not currently perform parallelization of selection queries at the database level"(personal communication from Brian Cunningham, Sr. Systems Engineer at Versant, 9/96). The Objectivity/DB [Obj95] system, on the other hand, is a parallel page-server objectoriented database system. It is not clear with Objectivity/DB that any parallel bene t would be seen within a single query execution, the bene t may instead occur when requests by many users are spread over multiple servers. This would not be particularly bene cial in our application area where many more servers than clients would be likely. Client/server architectural issues are critical for performance, and we will be discussing these client/server options in detail in Chapter 4. The I/O bottleneck is seen as a major problem [DG92]. According to Mohan et al. [MPTW94]: \the speed of I/O devices has not increased over time as fast as that of CPUs" 3 Transactions and logging are not supported, because (a) they consume considerable overhead and (b)

scienti c databases are seldom transaction oriented. We also do not support arbitrary user object methods.

5 (p.356). Therefore systems manipulating vast quantities of data are likely to nd their performance limited by I/O. We have focused on reducing I/O costs in our work with sequential ADAMS, and we discuss several related modi cations. In the parallel version of ADAMS we continued with that goal. Additional goals included limiting message passing, which is both a network burden and a CPU burden, and process creation (which caused the ACOS system [Tee93] severe problems). These goals were pursued within the context of exploiting opportunities for parallel processing. We will be discussing these goals and how they were achieved.

1.4 Contributions The contributions of this work include: 1. An implementation of a new design of parallel sets for object-oriented database systems, particularly including their use in complex object-oriented queries. 2. An implementation of a parallel, query-server, object-oriented database system. 3. An analytic model explaining the performance of the system. In particular, we have: 1. Analyzed a number of client/server architectures and determined that a query server architecture seems best for query-intensive large database applications. 2. Extended an existing object-oriented database system (ADAMS) with seamless parallelism. 3. Obtained linear parallel speed up and scale up for representative ADAMS queries. 4. Added numerous features to ADAMS to ensure ecient query processing over large volumes. These include streams and \chunked" sets.

6

1.5 Structure of this Dissertation Chapter 1 - Introduction We have brie y outlined the problem we are addressing, and our approach to solving it.

Chapter 2 - Objects, Sets, and ADAMS Parallel Sets We discuss objects and sets of objects. We then introduce the ADAMS set, its features, and design. We continue by describing ADAMS parallel sets, a natural object-oriented extension of the ADAMS set.

Chapter 3 - Representation of Data on Disk We review some details behind the im-

plementation of ADAMS sets; with a focus on block placement on disk and cache usage. These storage techniques are re ected in the analytical model.

Chapter 4 - Client/Server Database Architectures We discuss Client/Server architectures, and why we believe a query-server client/server architecture is the superior choice in our application area.

Chapter 5 - Parallelism We provide an overview of parallelism as we have seen it in database operations.

Chapter 6 - Related Work A review of relevant related work is given. Sections include

implementations (including Bubba/PFAD, and SHORE), conceptual models (such as Kilian's Parallel Sets), analytic models, and benchmarks.

Chapter 7 - The Parallel Implementation Here we build on our earlier descriptions

of ADAMS parallel sets, parallelism, and client/server architectures to provide a description of the ADAMS implementation. Sources of parallelism in query examples are given.

Chapter 8 - The Analytical Model Now that we have descriptions of ADAMS parallel

sets, and the ADAMS implementation, we present an analytical model of ADAMS

7 query processing. The model allows one to predict, given the sizes of input sets and partial result sets, the number of operations required to complete a query.

Chapter 9 - Performance This chapter describes standard metrics associated with par-

allel database performance, and continues to describe a benchmark developed for evaluating both the analytical model and ADAMS parallel performance in general. Results are given.

Chapter 10 - Conclusions Our results are brie y reviewed, particularly in light of the problem we set out to address.

2 Objects, Sets, and ADAMS Parallel Sets In this chapter we describe ADAMS parallel sets. We begin by discussing objects and the object model. We then provide an operational de nition of \set" and review some of the details of the ADAMS \set" and \parallel set". We include a description of ADAMS \streams", which constitute an in-memory set representation used in query processing.

2.1 Objects According to Ozsu and Blakely [OB95], \OODBMSs lack a universally accepted object model de nition. Even though there is reasonable consensus on the basic features that need to be supported by any object model, (e.g., object identity, encapsulation of state and behavior, type inheritance, and typed collections), how these features are supported di ers among models and systems"(p.148). The Object Database Management Group [CAB+ 96], a consortium of OODBMS vendors, provides an object model describing a number of OODBMS constructs such as the \basic features" given above. In addition, the ODMG-93 standard provides an object de nition language (ODL), an object query language (OQL), and C++ and Smalltalk bindings for ODL and OML (object manipulation language). A system supporting the ODMG object model and following the ODMGs language interface guidelines is said to be ODMG com8

9 pliant. Such a standard has value for commercial systems, where source code portability can reduce development costs and risk. The ODMG standard is still evolving. Whether it will be seen to encompass the desired object model, or simply as a commercial expedient, is unclear. ADAMS [Pfa93] predates the organization of the ODMG. The ADAMS language was not designed to seamlessly allow object oriented programmers to persistently maintain objects. OODBMS systems evolved in part from such a need, as in CAD applications. Rather the purpose of ADAMS was to \interface many di erent computing environments to a common, persistent data space" (p.1). Thus ADAMS is an embedded language, with C++ and Fortran versions. The \basic" object model features are supported: 1. Identity | every ADAMS object has a unique identity, distinguishing it from all other objects in the data space. This is accomplished, in part, through the use of object identi ers, or oids, where an oid is an immutable unique label associated with a particular object throughout the object's existence. In the ADAMS implementation an oid is an 8 byte string, with oids assigned in lexicographically increasing order. 2. Encapsulation of state and behavior | The state of an ADAMS object is represented through its attributes (numeric, character, or raw data items) and maps (references to other objects). An application retrieves, assigns, or modi es attribute values (or map references) through a xed set of ADAMS statements identifying the object and the attribute (or map) name, thus providing encapsulation of state. Object behavior in ADAMS currently consists of the storage and retrieval of object data. 3. Type | Support for type, including inheritance, is provided through the use of the ADAMS dictionary. All ADAMS objects have a type associated with them. Sets, which in ADAMS are also objects, will be discussed further in the next section. The object model and the traditional relational database model are very di erent. A

10 number of database texts [Dat95, Kho93, O'N93] discuss these di erences, while papers of particular interest include [Loo90, KC90, Zdo94]. Identity is very signi cant. Two objects are equal in an OODBMS context if they have the same object identi er. That is, if they are literally the same object. When performing an intersection on two sets, for example, the object model requires only that a third set of object identi ers (oids) be generated. Comparison of oids alone is sucient to ensure a correct result. In contrast, the relational model is value based. Two tuples are equal, in a relational context, if their respective column values are equal. But value equality does not convey identity. Modifying a tuple in one relation which is equal to a tuple in a second relation does not result in an update to the second tuple. The intersection of two relations requires the generation of a third relation containing full copies of the relevant tuple data. The comparison process cannot be done by comparing tuple/row identi ers 1 as these do not convey identity, rather data from the tuples must be compared.

2.1.1 Object Storage ADAMS employs the Decomposition Storage Model (DSM) described by Copeland, Khosha an, and Valduriez [CK85, PV86], also known in the literature as the Binary Storage Model ([Tee93],p.53). As Copeland and Khosha an indicate, \The DSM pairs each attribute value with the surrogate of its conceptual schema record on a binary relation"(p.268). Rather than having all the data for a particular object stored contiguously, an object with j attributes will have its components stored in j locations. For an object of the MEASUREMENT class below, there are ve attributes and a map (which is simply an object-valued attribute). >

f f

1 See [O'N94] for row/tuple identi er details.

g g

11 In ADAMS, we store these (surrogate,value) pairs using O-trees [Orl89]. The surrogate in ADAMS is the oid. For the MEASUREMENT schema above, ADAMS uses an O-tree to record all time attribute values, another to maintain all temperature values, and four more for the remaining three attributes and the map. These O-trees are indexed by oid. Thus, given the oid of a speci c object in the class MEASUREMENT, ADAMS can quickly access its time attribute value by tree lookup in the time O-tree with the oid as search key. Since an ADAMS map is nothing more than an object-valued attribute, it is implemented in exactly the same fashion. The expression x:instrument returns the oid of the instrument on which the measurement x was recorded. Finally, this kind of attribute evaluation using the decomposed storage model allows the composition of access operators. The expression x:instrument:elevation denotes the elevation of the instrument on which x was recorded. x:instrument is evaluated rst, then this oid becomes the search key in the attribute elevation O-tree. In attribute evaluation, the oid is known and the attribute value is retrieved. In selection, the value is known and all objects, or oids, with that attribute value are returned. To select with respect to an attribute (or a range of attribute values), an inverse attribute O-tree is employed. Inverse O-trees are indexed by the attribute value. Given a speci c time, or range of times, the inverse time O-tree returns the oids of all objects with the speci ed time attribute value(s). There are several other storage models described in [TRSB93, CK85, Tee93]. Teeuw states \as far as we know, BSM 2 as such is not used by any relational system"(p.54). The other models will not be covered here. While the choice of the decomposed storage model was made prior to our work with ADAMS, we nd the various arguments made in favor of DSM in [CK85] compelling, particularly those relating to simplicity and exibility. The choice of DSM had other consequences. As we will show in chapters 4 and 7, it had a strong in uence on the ADAMS server architecture. 2 the Binary Storage Model

12

2.2 ADAMS Sets A set, within the context of object-oriented database, is a collection of objects. The following operations are supported: member(S,X) True if the object X is a member of the set S, false otherwise. insert(S,X) Inserts the object X into the set S. remove(S,X) Removes object X from the set, if it is there. union(S1, S2) Denotes the set resulting from the union of sets S1 and S2. intersection(S1, S2) Denotes the set resulting from the intersection of sets S1 and S2. di erence(S1, S2) Denotes the set resulting from the di erence of set S1 less S2. cardinality(S) Denotes the number of objects referenced by the set S. assign(S,T) Assigns to the set S all objects of the set T. Because oids are uniquely identi ed with objects, an actual set representation can be a list of oids and each object X in the representation can be just the oid denoting the object. Since oids have a lexicographic ordering, our sets of oids are kept in this linear order. We then have the following two operations which facilitate iteration over a set: rst(S) next(S)

Returns the smallest oid in S. If S is empty then an End of Set token is returned. Returns the oid immediately greater than that previously returned for the set. If rst was not previously invoked on the set then the result is unde ned. If no oids remain, then an End of Set token is returned.

Both rst and next assume S is invariant in the course of the iteration. In implementing ADAMS sets, we employ O-trees [Orl89]. As with attributes in section 2.1.1, set O-trees are indexed by oid. Conceptually, the objects constituting the set are simply the oids in the leaf blocks of the trees. In practice, the way these leaf blocks are

13 distributed in disk storage can be very important. We discuss data placement on disk further in Chapter 3.

2.3 Streams In resolving queries involving combinations of set operations, such as R S1 intersect (S2 union S3) , one goal of an implementation is to avoid the creation of temporary les for partial results. In this query, we could perform a union of S2 and S3, saving that result on disk as a set T, and then perform the intersection of T with S1 to complete the processing. But in keeping with our goal, we employ streams, also called iterators [Gra93], and transfer partial results between operators in order to avoid unnecessary disk I/O. Because S2 and S3 are in oid order, we can produce a partial result of the S2 union S3 operation, call it T1, and then begin to intersect this result with S1. In this case, we use streams as bu ers between the two set operations. When the bu er T1 that is being intersected with S1 becomes empty, it is re lled by running another portion of the S2 union S3 operation. We illustrate such a query in Figure 2.1, where the rectangular ladder-like objects represent streams, and the stream corresponding to the set Si is denoted by Si. = A Stream

= A Set

S1 S1

R

S2

Intersection

S2 T1 Union S3 S3

Figure 2.1: Streams and Query Processing

R

14 We note that ADAMS streams are not used to transfer data between processes, or even between process threads. We wish to avoid operating system and message passing overheads when possible. These goals in the ADAMS stream implementation echo those cited by Graefe ([Gra93], p.79), although an element contained in a stream as discussed there is a \granule" which is \typically a single record", and not strictly an oid as in ADAMS.

2.3.1 Stream Operations We have described streams as bu ers used to store partial results of operations, particularly when employed in queries (in a demand-driven manner). Here we provide a more formal de nition of stream. For our purposes a stream is a non-persistent consumable bu er, of oids supporting the following operations: next(ST )

append(ST ,X)

Returns the smallest oid remaining in ST . If ST is empty then an End of Stream token is returned. While a next operation may be invoked repeatedly on the same set, each oid within the stream is only returned once (the stream is consumed by next operations). Places the oid X at the end of ST .

Thus a stream is a queue.

2.3.2 Streams and Complex Queries Employing streams in situations where all oids are maintained in oid order, such as in the query of Figure 2.1, is straightforward. Streams are also used in what we call \complex queries" | queries involving one or more inverses and one or more set operations. An example is R

fx j

g

x.pressure > 23 AND x in S1

15 depicted in Figure 2.2. The inverse O-tree pressure is used to retrieve a partial query result; but this partial result is in attribute and not oid order. The temporary disk resident set T1 is used to sort the partial result. The query then proceeds, using streams, as an intersection of sets T1 and S1. inv

Select and

pressure

Sort

T1 T1

Intersection R R

S1 S1

Figure 2.2: Streams and Complex Queries While streams are frequently employed within ADAMS, particularly within complex queries, they are invisible to the ADAMS user.

2.4 Parallel Sets Central to parallel set operations, to parallel queries and retrieval, and to the performance of a parallel database is the representation of its sets of objects. In this section we describe the ADAMS parallel set.

2.4.1 Partitioning The horizontal partition of a set is the distribution of the objects in the set to multiple database subsets. Those systems which partition sets based on some property of the object, such as an attribute value, must resolve several dicult questions [CABK88, GD90]. For example, if the attribute value of the object changes the location of the object must change

16 as well. This may include updating all sets in which the object resides. Partitioning by the immutable oid value avoids many of these problems. Addressing schemes employing a simple mapping between object identi er and partition subset are common (as in ORION [JWKL90]), although few claims of direct database support for data parallel set operations have been encountered. The SHORE system is one such example, providing support for a parallel structure called the ParSet [Kil92, CDF+ 94, DNSV] which is discussed in Chapter 6. We use the oid of an object to uniquely determine the partition subset. In our work we have allocated each partition to a separate node, that is, processor and storage, on which the object (and all its attribute values) resides. As depicted in Figure 2.3, an ADAMS set S is a structure identifying n subsets, S1 ; ::; Sn; where each subset resides on a di erent node in the system.

Set S

OID-Based Partitioning Function

S

1

S

2

S

3

S

4

S

N

Figure 2.3: Parallel Set S If a particular object x is a member of two ADAMS sets, say x 2 A and x 2 B , x will always be represented as a member in two subsets on the same partition subset. That is,

17 if x 2 Aj , 1  j  n, and x 2 B , then x 2 Bj as depicted in Figure 2.4. Using such a A A1

A2

A3

A4

B1

B1

B3

B4

P1

P2

P3

P4

B

Figure 2.4: Representation of sets A and B as partitioned sets, n = 4 scheme, large granularity set operations such as union, intersection, and di erence, can be performed in a data parallel manner. We hash oids to determine where an object resides. In practice, this hashing has been very fast. Should skew be observed in storage requirements or processing time modifying the oid generation routine to reduce assignments to a particular node is also simple. More complex object placement algorithms may also be e ective [GWLZ94]. An attribute, such as pressure, is a set. It is a set of ordered pairs of the form (oid, attribute value). Given the oid of any object that has a pressure attribute ADAMS nds the matching ordered pair and returns the corresponding attribute value. Because an attribute is a set, it too can be partitioned | in this case by the oid of the attribute functions argument oid. Attributes are partitioned by oid. A request for the pressure attribute value for an object x 2 Aj is sent to the node where Aj resides. A search is then performed at that

18 pressure pressure

1

pressure

2

pressure

pressure 3

4

-1 pressure 3

-1 pressure 4

-1 pressure -1 pressure 1

-1 pressure 2

A A

A

1

2

P1

P2

A

A

3

4

P3

P4

Figure 2.5: Representation of a partitioned attribute pressure and its inverse pressure?1 , n=4 node in the O-tree representing pressurej, and the result of that search is returned to the requesting client process. This data distribution is depicted in Figure 2.5. Inverse attributes are also partitioned by oid. An exact match or range search query on the inverse attribute pressure?1 , such as R

fx j

g

21 > x.pressure > 23

is processed by sending a message to all n nodes, where searches are performed on the n O-trees representing pressure?1 . As message transfer is generally fast when compared to disk I/O, most of this searching is done in parallel. With large result sizes the message transfer time is relatively insigni cant and all searching is e ectively done in parallel. This form of data parallelism is discussed further in Chapter 5. While the result of a search for an objects attribute value is always a single value, which is returned to the requesting client process, the search using an attribute value or attribute

19 value range yields a number of oids. These oids are generally not returned to the requesting client process during a query, rather the oids at each node are maintained at that node as a subset of the retrieved set. Because the inverse attribute is partitioned by oid, the retrieved set is also properly partitioned by oid and may then be used in parallel set operations such as union or intersection. Thus an attribute evaluation of an oid is performed on a single node, and returns an attribute value. An attribute inverse evaluation of a value or range of values is performed on all n nodes and returns the oid of a new properly partitioned parallel set. Queries involving inter-object references, which we call maps, can be somewhat more complex to process. A map is an object-valued function. It is an attribute with entries of the form (oidi ; oidk ). When applied to an object having oidi , a map returns the oidk of the referenced object. A map inverse is an index having the referenced oidk as the key values and the referencing oidi as the entries, e.g. (oidk ; oidi ). We partition maps by the referencing oidi and inverse maps by the referenced oidk . The key oids (for both map and map inverse indices) always reside at their database partition 3 , and therefore it is clear where a function is to be applied for a given oid. The result of a function application will be an oid (possibly several oids for inverse maps) which may belong in a di erent database partition. For queries involving maps, an inter-server communication phase is required where the set of oids resulting from an inverse map, e.g. oidi , must be repartitioned to yield a wellde ned ADAMS parallel set in which each subset contains only oids belonging to that partition. This repartitioning has been fast in practice, as only oids are transferred, and many are sent in each message. But such operations prevent the servers from operating in complete isolation during complex query processing. 3 Note that we will often use the term \partition" to denote the \subset of the partition" as we have here.

The latter is correct but needlessly pedantic.

20

2.4.2 Abstract Model and Implementation ADAMS' parallel sets are seen as sets, ADAMS' parallel attributes are seen as attributes; thus the abstract model does not change. This has enabled the parallel implementation of ADAMS sets to maintain essentially the same procedural interface as the single processor version, and helps to enable a valuable feature of ADAMS | a seamless parallel interface.

3 Representation of Data on Disk In this chapter we review some of the details of the ADAMS O-tree implementation. The O-tree implementation is itself not central to our work. But le placement can signi cantly a ect performance and the techniques presented below are re ected in the analytical model of ADAMS query processing. The reader may skip this chapter with relatively little loss of comprehension. The essential results are that the leaves of a tree may be stored in either a sequential fashion in the le or in a disordered fashion, and why.

3.1 Data Structures There are several data structures which can be employed to represent a set. The original version of ADAMS used two. The rst, for small sets, is a contiguous ordered list of oids used for sets of 10 items or less. The second is a tree structure (O-trees) for larger sets, also containing oids as entries. Sets are indexed by oid, so as we iterate through sets we access the oids in oid order. We have found that maintaining data in a completely ordered manner facilitates query processing, and query processing has been a primary concern in ADAMS research. In the course of this research, we introduced ADAMS streams, discussed 21

22 in Section 2.3, and the parallel set representation 1 described in Section 2.4. The O2 database system [Dea90] similarly uses a tree structure (B-trees) for large sets, and smaller contiguous structures for smaller sets. The SHORE [DNSV94] database system employs two set variations in the parallel version of its system. Primary sets contain entire objects rather than references, while secondary sets contain oids. Both SHORE and O2 will be discussed further in Chapter 6.

3.2 The Set Implementation ADAMS sets are most frequently created in two ways. The rst is during a program which populates or adds substantial numbers of objects to the database, creating objects in an iterative manner and inserting them into one or more sets. The second method is as a result of query processing. A set is often the nal result of a query and temporary sets are often created to store partial results. The details of query processing in ADAMS are discussed later in this thesis; but we have found that block allocation strategies 2 used in processing these two types of sets can have profound implications for performance.

3.2.1 Optimizing File Accesses As discussed in the introduction, disk I/O dominates processing time for our expected application area. We expect the I/O bottleneck to continue for the foreseeable future, and for that reason we have paid particular attention to le access patterns within ADAMS. Our experiments in this area have employed a Seagate 41200N 1.2 GB disk drive (full speci cations are given in Appendix A). Drive parameters include: Average Rotational Latency Average Access(seek)

8.33 mSec 15 mSec

1 which may reference a combination of n small sets and O-trees, or n streams 2 O-tree nodes are allocated as blocks of approximately 2048 bytes each. Each leaf node is capable of

storing as many as 198 oids.

23 Transfer Rate

1.875 - 2.875 mbytes/sec

As can be seen from the above, the performance penalty incurred due to executing a random disk I/O (requiring a seek and a rotation) when a sequential I/O could have suced is severe. Approximately 46K can be transferred in the time required for an average seek and rotation. The presence of a 256K read look-ahead adaptive multi-segmented cache on the disk drive provides a further incentive for localized access. We note that a le, as seen by the le system, may not be sequentially organized on disk. It is understood that, as Thompson [Tho86] puts it, \the user usually thinks of the le as sequential with the I/O pointer automatically counting the number of blocks that have been read/written from the le."(p. PS2:4-8). We are not currently interested in the details of how a particular le system may reward sequential access at the le level. Rather, we have seen through experimental work that sequential (as opposed to random) le access is well rewarded and expect the performance di erences between random and sequential I/O will be found on most systems. In an earlier version of ADAMS, three le access patterns were seen when reading the input sets of a union or intersection operations. The rst was the set block pattern of Figure 3.1. The storage allocated to the two sets is distributed throughout the le. Each contiguous segment of storage allocated to contain blocks from a single O-tree we will call a \chunk". This \thin chunked" pattern results from using a block at a time O-tree allocation mechanism. As a database is populated, blocks for each attribute, map, and set are allocated one at a time. A second pattern results when the two input sets are created by copying them from other sets. As a copy is performed atomically, all blocks for a new set are allocated consecutively in the le, avoiding the thin chunked pattern of Figure 3.1. This pattern is shown in Figure 3.2.

24

= Set 1 Block

= Set 2 Block

Figure 3.1: Thin Chunked Set Pattern

= All Set 1 Blocks

= All Set 2 Blocks

Figure 3.2: Completely Contiguous Set Pattern A third pattern results from attribute or map retrievals. As attribute or map entries are retrieved, the associated oids are inserted into a set. Because this is also an atomic operation, the set's blocks are contiguous. But unlike the previous case, the oids are not inserted in order. As a result, the blocks are not in oid order. So we have a contiguous but internally disordered pattern.

= Set 1 Block

= Set 2 Block

Figure 3.3: Internally Disordered Set Pattern

25 We evaluated the performance e ects associated with di erent access patterns, and determined the relative performance of the patterns in an ADAMS application. Two important modi cations resulted from this work. The rst was to modify the block allocation system for O-trees in order to avoid the thin-chunked pattern. ADAMS now allocates blocks to O-trees in two steps. Initially, an O-tree is allocated 4 blocks. This should suce for sets of up to four hundred elements. After the initial allotment is used up, 100 blocks at a time are allocated. Unused blocks are returned to the storage manager at the end of processing. We refer to the new set block pattern as \wide chunked".

= Chunk of Ordered Set 1 Blocks

= Chunk of Ordered Set 2 Blocks

Figure 3.4: Wide-Chunked Set Pattern The second adjustment was to employ streams, which employ large bu ers (2,000 oids each) to better exploit opportunities for sequential le access. Streams were described in Section 2.3 and we only observe here that using streams with wide chunked sets reduced execution time of some operations by 500%, as shown in Figure 3.5 Thin-Chunked Blocks Contiguous Blocks Contiguous Disordered Blocks Wide Chunked Blocks with Streams

10.782 seconds 3.150 seconds 4.683 seconds 2.06 seconds

Figure 3.5: Block Pattern Timing Results which illustrates performance di erences of these four representation schemes. Each entry

26 denotes the time to access 30,000 oids.

3.2.2 The O-tree Cache Iterating through a set is a common activity. Both wide-chunking and streams improve the performance of programs which sequentially access set elements. But there are situations when unordered or random access is required. Obtaining good random performance is dicult. The access locality and predictability, exploited above, is missing. We have improved performance substantially through the use of variable sized LRU caching. There is one cache associated with each O-tree, which is somewhat unusual. But each tree will exhibit di erent access patterns at di erent times, and our cache allocation schemes exploit these. Further, searching in one of several small caches is faster than searching in one large one. When a set is being created from a source of unordered oids its cache is expanded up to as many as 5000 O-tree nodes. In the early stages, the entire tree will be in cache. In later stages (for extremely large trees), the cache may contain only the set's internal nodes. When set creation is complete the cache is ushed, and this is done in the order in which the blocks are to be stored in the le. For all other purposes each O-tree cache is maintained at a 200 block maximum. This is to enable reasonably quick lookup, while not overly taxing memory resources. With a 200 block cache we expect all the internal nodes of large trees to be cache resident.

3.2.3 Sequential Insertions ADAMS issues oids in lexicographic order. When processing queries ADAMS also manipulates oids in order. Consequently, oids are often inserted into a set in linear order. A literal interpretation of a sequence of set insertions would require repeated descent through the tree from the root to the leaf of insertion. These can be eliminated by maintaining a single ag bit in the set representation. For a newly created set a logical ag is established

27 indicating that all insertions have thus far been done in order, and a copy of the highest oid inserted is maintained in memory. While newly inserted oids continue to be larger than the previous largest, the ag remains set and the search phase is skipped. We call these faster tree insertions \tree appends", as they are much like an append operation. We have described three di erent enhancements we made to ADAMS: a) block allocation that yields wide-chunked sets, when possible; b) a exible O-tree cache designed to expand and contract to accommodate short-lived applications, e.g. sorting, that can require very large temporary storage; and block allocation that yields wide-chunked sets, when possible; c) agging sets that are being created through the insertion of ordered oids. This is the kind of attention to detail that is necessary in a successful system. They are important improvements, but they do not directly bear on parallelism in ADAMS. We mention them here because we will reference each technique in our detailed analysis of ADAMS in Chapter 8.

4 Client/Server Database Architectures In this chapter we discuss the partitioning of database tasks between client and server. Initially we restrict the discussion to single server systems, as these have received signi cant attention in the literature.

4.1 Task Partitioning In a database system shared over multiple computers there are a number of tasks to be performed and a number of processors available to perform them. In most systems these processes are divided into clients (each handling the vagaries of a user program and interactions with a speci c user) and servers (providing a xed set of operations in a speci ed manner, generally to multiple clients but also to other servers in some cases). In many cases a single processor in a system con guration acts as the server, while each of the other processors is dedicated to a single user as a client process. The design of the interface between client and server has great implications for performance, and as has been seen in this research di erent designs are better adapted to di erent workloads. The server designs we have seen have fallen into three categories: page server, object server, and query server. We will not discuss le servers, such as NFS, or hardware solutions 28

29 such as RAID ([O'N94]). We are also not interested in other shared-disk or shared-memory con gurations. According to DeWitt and Gray ([DG92]), \Shared-memory and shared-disk do not scale well on database applications." (p.88).

4.2 Page Server According to Franklin [Fra96], \Virtually all commercial ODBMS products and recent research prototypes have adopted the data-shipping approach"([Fra96],p.10). In the datashipping approach the focus is on the rapid transfer of data from a simple (and therefore fast) server to a complex client. Page server systems employ a data-shipping approach. In page server systems, the server only knows about pages of storage. The server can be asked to retrieve a page of data from a location, and can be asked to store a page at a location. In a pure page server system, those are the only server responsibilities, aside from concurrency control and transaction rollback/logging. The client process undertakes the rest of the responsibilities, from interpreting page contents to transaction and query processing. This architecture is illustrated in Figure 4.1. The client also frequently maintains a large page, or object, cache. Accessing this cache, which often includes the maintenance of a table for use in swizzling 1 , can be much faster than requesting a page from a server. This can be a great bene t for applications having high locality of reference. Another bene t to such a system is that the server spends little time executing software for each page request, so that the requests of multiple clients can be handled quickly. Further, if the database attempts to cluster related objects, i.e. to place them on the same page when possible, multiple objects can be received as the result of a single server request. Figure 4.2 illustrates the architecture, as client processes are shown requesting and receiving pages of data from a page server process. Franklin contends that in data-shipping systems \Scalability is improved because usable 1 Swizzling is the conversion of an oid into a memory reference.

30 Client Process Client Cache

User Application Program Transaction Manager/Query Manager Object Manager Cache Manager Page-Object Translation Layer Communication Layer

Page Server Process Communication Layer Concurrency Control Layer Storage Manager

Figure 4.1: Page-Server Process Architecture resources are added to the system as more clients are added."([Fra96],p.10). A similar statement is found in Carey, Franklin, and Zaharioudakis: \data-shipping ooads DBMS function from the server to the client workstations, enabling the plentiful and relatively inexpensive resources of the workstations to be exploited"([CFZ94],p.359). If each client relies very little on the server this will be true, and with high locality of reference and good clustering this is likely to be so. Industrial descriptions of page-server architectures are also enthusiastic. In Objectivity/DB's technical overview [Obj95] we have: \the distributed client/server architecture used by Objectivity/DB is [the] most advanced in the client/server evolution path"(p.15). Orenstein et. al [OHMS92] describe query processing in a page-server system. They indicate that \... queries execute on the client. This does not mean that the entire contents of a collection or index is sent to the client in order to evaluate a query. As with other objects, only those pages containing referenced addresses are fetched from the server."

31 Client 1

Return Page 399675

Page Server

Contents of Page 11342

Client 2 Return Page 99671

Client M

Figure 4.2: Page-Server Data Shipping (p.403). In query processing a great many pages may be accessed, with only a relatively few containing data ultimately found desirable. In some applications there may be a high percentage of querying clients and this could cause the page server to become a bottleneck. We are convinced that page servers are innappropriate for complex query operations in large scienti c collections.

4.3 Object Server Object server systems may also employ a data-shipping approach. The server knows about objects or clusters of objects and can retrieve and store them. This means that the client can ask for needed objects and receive them without digging through a considerable amount of other (possibly irrelevant) data. Franklin ([Fra96],p.10) gives the following object server advantages: 1. Client bu er space can be used eciently, as only requested objects are brought into the client's bu er pool.

32 2. Communication cost can be reduced by sending objects between clients and servers rather than entire pages if the locality of reference of objects is poor with respect to the contents of pages, 3. Object-level locking is easily implemented at the server since it knows which clients are accessing which objects. Franklin quickly dismisses these advantages, concluding that the page server approach is a \more reasonable choice for implementation"(p.12). But an object server may be very di erent than a page server in an important way. According to DeWitt et al. [DFMV92], object methods can be applied at either the client or server. If methods are applied at the server, then data is not shipped from client to server and the approach can no longer be considered data-shipping. Joseph et al. [JTTW91] distinguish between typeless object servers, class-based object servers, and type-based object servers. Wilcox in an article on object method execution giving several variations of server and concludes that \The variety displayed in ODB implementations is at once confusing and encouraging"([Wil94],p. 35). There are no hard and fast OODB object server implementation rules.

4.4 Query Server As de ned by Franklin, \The term query-shipping is widely used in the context of a client/server architecture with one server machine, and in which queries are completely evaluated at the server. There is, however, no recognized de nition of query-shipping for systems with multiple servers."([Fra96],p.191). Versant [Vera] employs a query-server architecture. This implies that the server understands objects, giving it object-server capabilities when not processing queries. In fact the Versant server has many roles (as listed in [Verb],p.4): 1. manages shared page cache,

33 2. evaluates objects for client queries, 3. manages disk and storage, 4. manages classes and schem,a 5. de nes short transactions, 6. locks objects, 7. performs backup, logging, and recovery, 8. manages indices and object clustering, 9. performs queries, 10. manages events and triggers, 11. manages cursors, and 12. supports SMP architectures. Despite the use of a query-shipping architecture 2 , Versant appears to be doing well. In the 007 benchmark ([CDKN94], described brie y in Section 6.4) Versant produced 22 rst place results (as compared to Objectivity's 3, although rst place results may not be the best way to compare the systems). As noted in Section 1.3, Versant has yet to implement a parallel server architecture. But the questioning of traditional OODB server wisdom is worthwhile. Franklin ([Fra96],p.9) gives the 3 following query-shipping advantages, which we present in an abbreviated form: 1. Communication cost and bu er space requirement reduction. 2 It should also be made clear that Versant employs a client-side cache.

34 2. Easy migration path from single-site system to client/server environment since the database engine (which resides solely on the server) can have a process structure similar to that of the single-site database system. 3. Interaction at the query level facilitates interaction among heterogeneous database systems using a standardized query language such as SQL. We would add that processing queries at the server can allow for the exploitation of disk access patterns, as ADAMS has done with \wide chunked" sets (see Section 3.2.1) and streams (see Section 2.3). A page server beset by apparently random disk address read requests from several clients may not see, or not initially create, the opportunities for sequential access available when processing queries. While causing other clients to \starve" while such access patterns are exploited may seem cruel, if a query can be completed several times faster as a result then the wait may not be long.

4.5 ADAMS Parallel Query Server The bottleneck limiting database system performance has traditionally been disk I/O. The main purpose of parallel server systems has been to address this problem, while enabling the storage of larger volumes of data. A diculty we have seen, in working with sequential and parallel ADAMS, is that while the use of parallel servers removes the disk I/O burden from the client, it replaces it with a signi cant message processing load. This message processing cost at the client has dominated performance in some of our testing. A concern has been that with a page server architecture, which is prevalent in commercial objectoriented systems, the message burden on the client would become more of an issue as the number of servers and the amount of data maintained increased. While page server systems seek to exploit client resources, in our environment we expect the client to be the bottleneck. We expect a query rich environment, with more servers than active clients. This kind of con guration requires that traditional OODB thinking be turned on its head. A client is

35 more likely to become a bottleneck than in a many client, one server environment. We do not expect that query results would be capable of tting in any single processor's cache, so extensive client-side caching approaches may be hopeless. In our environment data is seen once and not likely to be modi ed or viewed beyond that. This is very di erent from CAD applications, where a relatively small group of component objects could be manipulated by a single user for long periods. All of these factors, generally the reverse of those given as reasons for page server systems, leave us with a parallel query server architecture. Our use of the exible Decomposed Storage Model, as discussed in chapter 2, further favors the query server architecture. Tree traversal on a page server system (as would be required for assigning each attribute of a new object), likely without bene t of a large cache dedicated to the tree, could be particularly expensive.

5 Parallelism In this chapter we discuss ve di erent parallel execution paradigms: \data parallelism", \data ow parallelism", \no return parallelism", \loop parallelism", and \simple I/O parallelism". It is not an exhaustive list. Kyung-Chang Kim, for example, in [Kim90a], describes \path parallelism", \node parallelism", and \class-hierarchy parallelism". Leung and Taniar [LT95] refer to \intra-class" \inter-class" and \hybrid-class" parallelism. We focus on methods likely to succeed in our expected application and processing environment 1 , where the exploitation of data parallelism is critical. Data parallelism is central to our approach. No return and loop parallelism are also important components in our parallel processing strategy. We brie y discuss data ow and simple I/O (page server) parallelism, although we will not use them in parallel ADAMS.

5.1 Data Parallelism We typically see data parallelism as involving the execution of the same operation on multiple data items concurrently, employing a SIMD (Single Instruction Multiple Data) form of instruction processing. Per Lewis and El-Rewini [LER92], \the SIMD paradigm is often 1 A shared-nothing architecture with general purpose processors where the processing of queries over very

large data sets is a primary activity

36

37 called data-parallel programming"(p.10). An example of data parallel execution would be a linear search through a relation horizontally partitioned over N nodes. A request can be sent to each node to perform a search of the relation segment, and the same search can thus be performed asynchronously on all nodes. We illustrate this in Figure 5.1. The client process sends a message to each server to scan a portion of horizontally partitioned relation R, where R has been divided into portions R1 through RN . The client then waits for responses as the N servers perform the scan operation in parallel. Scan Relation R

Client

Client Code

Scan Relation R1

Server 1

Scan Relation R2

Server 2 Scan Relation RN

C

Mess

S

Server N

= Message "Mess" sent from process C to process S

Figure 5.1: Data Parallel Execution of Scan When message passing is expensive some operations will not be eciently supported by a data parallel approach, as they manipulate single elements or perform operations of small granularity. In such cases message passing can dominate performance. Also, by its structure, the approach is limited to horizontally partitioned databases, and in some cases it is hard to know how to properly horizontally partition [CABK88, TN92]. We note that granularity, large and small, is often left to the eye of the beholder. Bergsten, Couprie, and

38 Valduriez's de nition ([BCV93],p.735) generally serves: \course grain interaction - access to more than 10,000 data items", \ ne grain interaction - access to less than 10 data items". But the gap between 10 and 10,000 is large.

5.2 Data ow Parallelism Pure data ow parallelism, as found in [WA91], involves analyzing a program (or query) for data dependencies, and scheduling multiple sections of the program (operators or groups of operators) for processing on di erent processors at the same time as the dependencies permit. Teeuw and Blaken o er the following more general data ow de nition: \The term data ow is used for those data driven models of computation in which the data are active and ow asynchronously through the program, activating an instruction when all the required input data have arrived"([TB93],p.1269). The pure form of data ow parallelism (also called pipelined parallelism, [DG92], p.86) has seen success in computationally oriented systems, and apparently in main memory database systems. Reports of success with this approach in parallel or distributed database systems employing secondary storage are hard to nd. Prisma [TB93] is one, although their focus is on a main memory architecture. The great interest in parallel database systems, and in data ow computation, suggests that this dearth is not due to a lack of e ort. We illustrate data ow parallelism with Figure 5.2. In this gure relation A is stored at Server 1, relation B stored at server 2, and the result of a union of the two relations, relation C, is to be stored at server 3. The client sends the three servers messages, setting up parts of a short pipeline. Servers 1 and 2 can immediately begin accessing their relations as soon as instructed, while server 3 waits until data from servers 1 and 2 arrive. When that data does start arriving, server 3 will perform the union without further instructions from the client. For database systems the analysis is complicated by data placement. Each node "owns" some segment of secondary storage, and each section of code must execute on a node. The message burden between nodes can be expensive. Further, if two program

39 C = A UNION B Client Code

Client

Scan Relation A Ship to 3 Scan Relation B Ship to 3

Store UNION input from 1 input from 2 as C

Server 1

Relation A

Server 2 Relation B

Server 3

Figure 5.2: Parallel Data ow Execution of Union sections make use of substantial quantities of data from one particular node they will be competing with each other for use of that processor's disk drive and interfering with the execution of the section based on that node as well (if there is one). Successful parallel database systems which label themselves as data ow generally split all query operators in a horizontal manner in order to exploit data parallelism [AC88b, AC88a, WAF90, JWKL90, TB93]. Other systems occasionally go as far as to explicitly shun data ow techniques, as with the Teradata [DG92]: \Hashing is used to split the outputs of relational operators into intermediate relations. Join operators are executed using a parallel sort-merge algorithm. Rather than using pipelined parallel execution, during the execution of a query, each operator is run to completion on all participating nodes before the next operator is initiated"(p. 94). As indicated by Dewitt and Gray, \Partitioned execution o ers much better opportunities for speedup and scaleup" ([DG92],p.90). The use of streams in a data ow way can dramatically reduce the need to use secondary storage for temporary results though, and

40 in that way it can also help to provide lower response times.

5.3 No Return Parallelism When evaluating a database program at the client processor, it can be seen that many of the resulting operations impact only a single node, with no need for the return of a value to the user node for the database program to proceed properly. An example of this would be a program which populates the database using insert operations as in Figure 5.3. In this example the client program speci es the insertion of four items into horizontally partitioned relation R. The client process inserts these items by sending messages to the server maintaining the appropriate subset of R. The client need not wait for a completion response, but can continue the execution of the client program, possibly sending many messages to each server before requesting any sort of information back. Insert b into R Insert c into R

Insert b into R1

Client

Server 1

Insert d into R Insert e into R Client Code

Insert e into RN

Insert d into R2

Server 2 Insert c into RN

Server N

Figure 5.3: No Return Parallelism - Insertion Example This approach is similar to the data parallelism discussed above, in that a program's instructions can be executed in parallel on horizontally partitioned data structures. Union

41 operations, for example, could be done in a \no-return" way. Yet in \no-return" parallelism one instruction is not necessarily executed by multiple servers at the same time. Server 1 may be processing an insertion while server 2 is processing a remove. What we have is a form of SPMD (Single Program Multiple Data) data parallelism. No return operations can be sent in batches from the client to the server nodes, and as long as each server executes its operations in order the program will execute properly. One can have, in the extreme case, operations which are performed by server processes long after the client process, or transaction, has terminated. As the operations generally consist of relatively slow disk updates the burden on the client processor should be much smaller than that on any server node for many con gurations. This is a control ow method, where \with control ow, there is, for each database query a single control node that controls all processes related to this query... The control node starts all these processes, takes care of their synchronization, notices their nishing, etc." ([TB93], p.1270). However, no return parallelism, as implemented in ADAMS, does not require the control node (a client process in ADAMS) to keep track of an operation once it has been shipped to a server.

5.4 Loop Parallelism Many database programs conceptually involve sequential iteration over a collection of elements. For example, often the results of a query or contents of a set or relation are printed, as in the client code of Figure 5.4, or otherwise sequentially processed. Often the data retrieval processing can actually be parallelized, exploiting the loop situation by fetching N data items (or N batches of data items) at once. One can regard this as a form of pre-fetching. Figure 5.4 illustrates this technique as used to retrieve data from horizontally partitioned relation R. We have broken up the data retrieval into two phases, phase 1 being the \request" phase where the client sends a data request to each server, and phase 2, the \re-

42 turn" phase, where the client waits for responses from the servers. In between phase 1 and phase 2 the client may be printing (or performing a similar sequential activity) while the servers are preparing to return the requested information. Thus the data is being fetched before it is needed, or pre-fetched2 . For all elements x in R

Client

Return name from next in R 1

Server 1

print x.name; Return name from Next in R 2

Client Code

Server 2 Return name from Next in R N

Phase 1 Server N

b.name

Client

Server 1 d.name

Server 2 c.name

Phase 2 Server N

Figure 5.4: Loop Parallelism - Print Loop Example This pre-fetching technique is applicable in other situations, but we consider the loop situation to be signi cant enough to merit consideration on its own, rather than discussing 2 In the ADAMS implementation, after the rst request phase the servers begin lling the next request

prior to receiving it, an even more aggressive pre-fetch.

43 pre-fetching in more general terms.

5.5 Simple I/O Parallelism If I/O is seen as the major database bottleneck [DG90], a simple method of improving performance is to parallelize the I/O operations by o -loading them to parallel page server processors. Several recent object-oriented systems are based on page server architectures, such as EXODUS, Gemstone, O2, ObjectStore, and SHORE [CFZ94]. If several I/O requests can be handled at once then the I/O bottleneck is reduced and I/O performance may no longer be an issue. We illustrate this in Figure 5.5, where multiple clients are making page requests of multiple servers. With many more clients than servers, keeping the servers occupied is an easy chore. As the relative number of clients decreases, however, generating enough I/O requests to occupy the servers would depend on whether some form of aggressive pre-fetching was implemented and on the relative speed of the clients and servers. Server 1

Client 1 Return Page 399675

Server 2

Client 2 Contents of Page 11342

Client M

Return Page 99671

Server N

Figure 5.5: Simple I/O Parallelism

44 In our expected application environment such solutions do not scale. For ObjectStore queries, for example, index and data pages must be requested by the client system where all query processing is performed [OHMS92, LLOW91]. The ObjectStore servers know nothing about the data in the pages they maintain. For a con guration with many more servers than clients, we expect there to be a bottleneck at the client processor, caused by message handling and query processing overhead. The object server approach has similar diculties, unless the ability to apply largegrained set operations is available. An example is selection, as in [DFMV90]. If the servers can interact in order to resolve inter-object references without low-granularity client direction that is also an improvement. Without these server features the client must remain a query bottleneck.

5.6 Parallelism In designing a system it is dicult to choose among the above methods. Yet there are existing systems using varying permutations of these and claiming success [Men89, HHL+ 89]. Success is measured according to the existing hardware and software architectures, which, in the two papers above, were designed speci cally for database applications, and often to a particular set of database activities. If the processing of many small transactions are the target of the architecture, then a benchmark akin to the Debit Credit Transaction [Gro88] makes good sense. For other targets, a variety of large grained activities may be more appropriate [DGS88], and may be particularly bene cial when making comparisons to commercial systems. Estimating how successful methods employed on one system for one group of applications would be in a di erent processing environment running another group of applications is not an easy task. Many current parallel database systems restrict themselves to older data models. These were designed for conceptual simplicity and not necessarily for eciency or to exploit parallelism. An inecient process expressed in a sequential data model may yield near-linear

45 speed-up in a corresponding parallel data model, while a more ecient algorithm might show de nite sub-linear speed up, but still be an improvement in all foreseeable cases. For the data model we have been considering, the model and the supported language had the exploitation of parallelism as a goal from an early stage.

6 Related Work There has been little published work on parallel object-oriented database systems thus far. Object-oriented databases are still relatively new, and the work required to create a parallel implementation is substantial. One aid to keeping current in this area has been provided by Kjetil Norvag, a researcher at the Norwegian University of Science and Technology. His parallel object-oriented database system bibliography world wide web page 1 lists 32 published papers and technical reports as of 4/97. That number may appear small, but the site is known to the research community and thus the list is likely to be fairly complete 2. In this chapter we brie y describe work (not all from the web site) most relevant to our research.

6.1 Implementations While there are commercial distributed object-oriented database systems, allowing access to objects stored on multiple computers, there are no commercial systems capable of parallel object-oriented query execution as we envision it. There are, however, a small number of 1 http://www.idt.ntnu.no/ noervaag/p oodb papers.html 2 although it may not list all the papers associated with a particular research project, as with Bubba/FAD

46

47 academic prototypes.

6.1.1 FAD and Bubba The FAD object-based language and the Bubba parallel database machine prototype [Bor88, AC88b, CABK88, KV89, BBKV87, BAC+ 90, SAB+ 89, DV92, HDV88] were developed as part of the Database Program at Microelectronics And Computer Technology Corporation in the mid to late 1980's. They were an extraordinarily proli c research group. Their target application area included \`knowledge-based' transactions, which access and analyze large amounts of data."([BAC+90],p.5), which is similar to our own. The use of horizontal data partitioning, the processing strategy where \programs are sent to the nodes which contain data to execute locally" 3 ([KV89],p.156), and the use of a shared-nothing architecture, are common to both FAD/Bubba and ADAMS. But there are a number of important di erences between the two systems. Bubba, FAD's parallel execution environment, is a special purpose parallel database machine. ADAMS is entirely a software entity, expecting a general purpose computing environment. PFAD, the parallel FAD language, is described as \an abstraction of a highly parallel, shared-nothing database system Bubba" ([HDV88],p.72). An important task, as outlined in [HDV88], is to optimally convert FAD into PFAD 4. This conversion is complicated by data placement issues [CABK88]. Bubba declusters by key 5 . Each PFAD relation can be independently partitioned over a di erent number of nodes. In contrast, the ADAMS abstraction is the parallel set, and partitioning is always done by oid. There are other di erences. The object model was better de ned at the time of the parallel ADAMS design phase. The term \relation" appears throughout the MCC literature. Their interest in partitioning by hash value [KF88], rather than by oid, is puzzling. The 3 They continue: \(contrast this, for example, with data ow machines where the data is sent to the

operation nodes)." 4 This task includes data ow analysis. 5 \We decluster records into segments according to either the key value or its hash" (p.102, [CABK88]

48 degree to which what we now see as the object model was supported by FAD/Bubba is unclear. According to Heytens, \We believe that only a relational subset of FAD (no inter-object references) was actually implemented before the end of the project; we do not know what optimizations were implemented or what performance was achieved" ([Hey92], p. 215). We note that [BAC+ 90] provides some performance information, although a full speci cation of the queries and experimental database is not given. Partitioning by OID, as done by parallel ADAMS, seems to be an important step in providing ecient inter-object references or implicit join capability. Part of MCC's work describes the Decomposed Storage Model [CK85], which is used by ADAMS. In processing implicit join queries, they discuss the use of relational foreign key techniques supported by surrogate-based join indexes [KCJ+ 87], rather than direct inter-object references as provided in ADAMS.

6.1.2 SHORE Another proli c research group has been the SHORE research group at the University of Wisconsin-Madison. SHORE [DNSV, CDF+ 94, Gro95b] is a page-server based system, see Chapter 4. However, as mentioned in the introduction, they did experiment with a feature known as Parsets [DNSV], which e ectively extended server functionality. The SHORE ParSet implementation provided a means of partitioning a set of objects over a number of processors, and executing object methods over set elements in parallel. A ParSet slave process was responsible for executing a selected method over all objects in a partition of the set. The cost of creating ParSet slave processes, and invoking these methods, was o set by the bene t of executing methods over entire set partitions in parallel. Parsets were divided into two types. Primary sets contained the object data and provided a decluster() method, used to determine how the set was to be partitioned. Standard partitioning methods included hash, modulo, and range. Secondary Parsets consisted of oids only. An object could reside in only one primary set.

49 The Parset functionality, as it applied to queries, was never implemented (SHORE "doesn't do queries yet" | Personal communication, Zwilling, 11/94), although iterative selection was apparently available. And SHORE has not been released in its multi-server form. \The release supports only single server operation.", http://www.cs.wisc.edu/shore/1.0/overview/footnode.html#52. But SHORE does currently support \value-added server"'s (VAS's), which allow for a SHORE programmer to implement 6 an arbitrary slave process to interact with a SHORE server. How the VAS feature will be supported in a parallel server environment is hard to determine. The SHORE research group is aliated with their department's Paradise [DKL+ 94] research group, which is using SHORE for its extended relational GIS-targeted system. The goal of the Paradise project \is to apply the object-oriented and parallel database technology developed as part of the EXODUS CF+86 and Gamma projects to the task of implementing a parallel GIS system capable of managing extremely large (multi-terabyte) data sets such as those that will be produced by the upcoming NASA EOSDIS project"([DKL+ 94], p.2). The Paradise server is a VAS, and the expectation is that multiple Paradise servers will act in parallel. The details of the interaction design should be most interesting. SHORE (a page server system) and ADAMS (a query server system) have little in common architecturally. However it is interesting to note how SHORE's architecture may be arti cially extended to look more and more like a query server or sophisticated object server as the applications begin to resemble the expected ADAMS application area.

6.1.3 ACOS Wouter Teeuw's dissertation, \Parallel Management of Complex Objects" [Tee93], is an important cautionary tale. It describes the design and partial implementation of ACOS, a persistent object server for the Amoeba distributed operating system. ACOS was expected, in its nal form, to support query processing on collections of persistent complex objects. 6 SUN remote procedure call expertise is apparently required.

50 The process of designing the system included development of analytical models for disk I/O, network communication, and processor load. These models yielded the expected conclusion that disk I/O was, and would continue to be, the bottleneck. But when running a benchmark on a six processor system (i80386 processors) this turned out not to be the case. CPU appeared to be limiting factor. One problem had to do with process creation. ACOS creates a DFP (Data Flow Processor) process for each algebraic operation per node. For small grained operations this was far too expensive. Other issues included remote procedure calls: \an RPC, i.e., sending the data from the server to a client or the other way round, is fast. But, at least with ACOS, the bottleneck is to get the data in and out of the network bu ers, and not the act of sending itself"(p. 136). There were further performance incompatibilities related to Amoeba, and its centralized \Bullet" le server. The overall design did not work well in the operating environment. Teeuw did have some conclusions regarding modelling: \for many operating system functions, including, e.g., the RPC primitives, it is not quite clear what all happens in Amoeba. This is a problem with many systems. Often, the implementation platform cannot be modelled because either the system is too complex to model, or too many aspects of the system are unknown. Since knowledge of the system is missing, a bad model of the system may arise. In particular, the performance of a system may not be described in an adequate way. Also, not all parts of the system that can be modelled are included into a model of the system, because they are not considered as a bottleneck and left out. Consequently, models are inaccurate and only experiments show what is really going on." (p.139). We have had similar diculties with our operating environment. However, as we will see in Chapter 7, we have been extremely careful with our process architecture | avoiding extraneous process creation and excessive message passing. We have also been careful with our performance model (given in Chapter 8). The model provides predictions in terms of operations within the scope of the ADAMS software. Determining the number of microseconds an ADAMS primitive operation takes to execute in a particular environment

51 is not seen as the responsibility of the model, rather it is an issue for experimental research.

6.1.4 AGNA Heytens, in his dissertation [Hey92], describes the prototype AGNA system. AGNA is described as \an experimental persistent programming language that we have designed and implemented to investigate the use of parallelism in an information management system."(p.16). AGNA and ADAMS share similar goals, including providing support for scienti c research (p.11). But there are a number of interesting di erences in how these goals have been pursued. Parallelism in AGNA is sought at a very ne grain. The language is not embedded, and is declarative (much like LISP). This makes it relatively easy to employ techniques such as the such as data ow graphs in the compiler to extract ne-grained parallelism from within functions written in AGNA. As Heytens indicates, \Perhaps the most fundamental di erence between AGNA and other third-generation database systems is AGNA is based on a language that is inherently parallel, while most of the others are based on languages that are largely sequential"(p.17). Sequential C++ or Fortran code will exist in any ADAMS program, but a goal in ADAMS has been to provide a tool which is easy to use. A programmer not familiar with LISP could have diculty working with AGNA. The physical storage design is also quite di erent. A major structure is the \persistent heap" (p. 24). An object is stored contiguously at its heap address, composed of three parts: a segment, a page, and an o set in the page. Inter-object references are implemented using heap addresses (not oids). But a concern not addressed in the implementation is object growth or shrinkage, and a reference mechanism employing a virtual address (or oid) rather than the physical address would slow processing. Query processing is also very di erent. AGNA employs inverse attribute indexing, and partitioning by heap address, which is similar to partitioning by oid. The query processing involves selecting a single index from those available, if any, and accessing the object values

52 for further ltering. As the system stores objects contiguously, this may not be as bad as it sounds. But contiguous storage of objects raises issues regarding storage allocation and support for varying object sizes not discussed in the description of AGNA. AGNA was developed on a network of workstations, yet performance results are given for a single SUN workstation and for an Intel iPSC/2 Hypercube and not for the local area network of workstations. It would not surprise us to nd that message costs were a bottleneck in the LAN network, and that this limited the e ectiveness of the ne-grained approach on that con guration. It is signi cant that good performance was obtained on an available hardware con guration, and it is likely that some applications would bene t more from the ne-grained approach taken by AGNA than from a coarse grained approach. It is an interesting contrast.

6.1.5 OSAM*.KBMS Another system is the OSAM*.KBMS/P system [SJC+ 94] developed at the University of Florida. The OSAM*.KBMS project group is interested in active objects, with complex rule systems. They indicate that \a parallel implementation of an active knowledge base server is essential to achieve the needed eciency in processing nested transactions and rules."(p.1). Data is vertically partitioned. Transaction processing performance results are given in [SJC+ 94].

6.1.6 PRIMA PRIMA [Ges96] is a shared-memory, parallel object oriented database system. With this hardware architecture the use of explicitly parallel data structures, such as the ADAMS parallel set or the SHORE Parset, would not be as useful, as any processor can access any data. Results are given for some 007 benchmark activities run on a Sequent Symmetry processor over a database consisting of 5.7 MB of relevant data. It is unclear how the system would perform on a substantially larger database.

53

6.1.7 Main Memory Systems Prisma [AHH+ 90] and Goblin [vdB94] are two main memory parallel systems. Goblin apparently is only a partial implementation. In general, we do not expect that such systems will have the capacity to support the databases in our application area in the near future. Although we will not be disappointed when such a time nally does arrive.

6.2 Conceptual Models 6.2.1 Kilian's ParSETS Kilian's dissertation [Kil92] rigorously de nes the Parallel-Set, an extension of the object model. Parallel-Set methods include Apply (p.28), which applies a function to every object of a set, and Reduce, which provides the result of a function applied to all members of a Parallel-Set. Through these and related constructs it is expected that managing large sets in parallel should be an easier and more natural task. Kilian's work provided some of the insight for the SHORE ParSet discussed above.

6.3 Analytic Models Leung and Taniar [LT95] discusses a system in which the data resides entirely in main memory. Chen and Su [CS96] and Thakore and Su [TS94] discuss an architecture similar to OSAM*.KBMS above, where the data is vertically partitioned.

6.4 Benchmarks There are number of object-oriented database benchmarks described in the literature. Of particular interest is Chaudhri's annotated bibliography [Cha95]. While the following does not represent a complete set, it does include what we believe to be the primary OODB benchmark candidates for our purposes.

54 We note that all of the benchmarks are single user, although some e orts are being made to create a multi-user 007 benchmark [CDKN94]. This is important, as we expect that a multi-user page server implementation would provide far less disk access locality than a multi-user query server implementation. What we had hoped to nd were queries similar to: result fxjx:y:z:attribute1 < 42 OR x:y:z:attribute2 = 3g In ADAMS, the rst steps would be to retrieve on attribute1 or attribute2 and traverse backwards through the inverse maps. We believe such queries are particularly useful, though not seen because they are rarely supported. And ADAMS supports them as set-level operations, rather than with a granularity of single objects. Instead we nd forward traversals, where we are given the set of x above and need to nd the z objects. The ADAMS language has no way of expressing that as a single query.

6.4.1 The 007 Benchmark The 007 benchmark is intended to be "suggestive of many di erent CAD/CAM/CASE applications" ([CDN93], p.6). The 007 database is a complex web of objects. Object types include Modules, Complex Assemblies, Base assemblies, Composite parts, Atomic parts, Connection objects, and a Document. The objects are linked in various one to one or one to many relationships.The benchmark speci es nine traversal operations. T6 (traversal 6) is: \ Traverse the assembly hierarchy. As each base assembly is visited, visit each of the unshared composite parts. As each composite part is visited, visit the root atomic part. Return a count of the number of atomic parts visited when done."([CDN94], p.17). The benchmark speci es eight query operations. One of them is an exact match query on Atomic Part Id. Two of them are date range search on Atomic Parts. None of them involve the complex conjunctive and disjunctive queries that we wish to use with ADAMS.

55

6.4.2 Earlier OODB Benchmarks The 001 benchmark [Cat93, CS92] is intended to represent engineering applications, much like 007 which followed it. As with 007, the complex conjunctive and disjunctive queries that we wish to test are not to be found in 001. The suite of queries in 001 doesn't include range search operations either. The HyperModel benchmark [ABM+ 90] and the ACOB benchmark [DMFV90] also have a traversal focus, as that appears to be a common activity in CAD applications.

6.4.3 Relational Benchmarks The Set Query Benchmark [O'N93] was designed to re ect the requirements of Strategic Data Access systems, where queries into large collections of data are used to gain insight. O'Neil gives direct marketing, document search, and decision support as three SDA application areas. Our experience has been that there are scienti c and other application areas where research involving set queries can also be of great value. The benchmark consists of queries returning sets. No update operations are given. There are sixty-nine queries speci ed in the benchmark (for the default data con guration). Many of the queries result from applying the same query on di erent indices, however the categories were designed to provide wide coverage of query operations in the SDA domain Several of the queries given would be of value in benchmarking ADAMS. The diculties we would have in selecting this benchmark for use with ADAMS are, in part, related to the di erences between object-oriented and relational systems. Implicit joins are not tested. Nor can we expect them to be in a relational benchmark. Other relational benchmarks focus on explicit joins, something not available in ADAMS and often avoided in OODBs by improved object design.

56

6.4.4 Benchmark Summary Object-oriented data queries had a CAD focus, where apparently traversals are a critical activity and set queries aren't as much of a priority. Relational queries, on the other hand, cannot test implicit joins. We described our predicament to Akmal Chaudhri (see [CR94, ZC95]), whose research area is OODB benchmarks, and received: \I don't know of anything in the set query area for OODBs. Maybe your own work may develop such a benchmark?" (Chaudhri, personal communication, July, 1996).

7 The Parallel Implementation The original premise of this research was that data parallelism (Section 5.1) could be exploited in ADAMS. In this chapter we describe the client/server architecture used to achieve this goal. We also discuss other forms of parallelism which contribute to our success. Some of these forms are more important than others, but we found that exploiting multiple forms of parallelism was critical to the success of our implementation.

7.1 Process Structure Some of the justi cations for a page server architecture (see Chapter 4) include the fact that in a con guration serving many client processors the message burden on any single client processor is likely to be small. A further justi cation is that the client processors are often powerful workstations while the database server is likely to have substantially less capacity than the aggregate of client processors. Both of these are true for many system con gurations and many application environments. But our anticipated workload consists of only a few clients running large queries, in a con guration with more servers than active clients. We expect the database servers to have CPU cycles to spare, with disk I/O and client processing costs to dominate performance. For that reason, one of our goals has been 57

58 to move the processing burden to the servers. To accomplish this we have adopted a query server architecture. Client Process User Application Program ADAMS Interface Layer Parallel Set, Map, and Attribute Functions Communication Layer / Mentat

Butler Process Communication Layer / Mentat Registration, OID Server, Lock Functions

ADAMS Server Process Communication Layer / Mentat Set, Map, and Attribute Functions O-Tree Layer Storage Manager Layer Unix File System

Figure 7.1: ADAMS Process Architecture A running ADAMS program involves at least three executing processes: the process executing the client side of the application, a \Butler" process which the client process contacts to receive references to the appropriate \Server" process(es), and one or more ADAMS Server processes. Our expected environment is several general purpose processors in a shared-nothing (message passing) con guration. We use Mentat [Gro95a] to handle all interprocess communication.

59

7.1.1 Client Side - Application Program An ADAMS application program initially consists of a program written in some host language, such as C++ or Fortran, with embedded ADAMS statements. The application code contains operations such as \insert X into S1", or \S3 fX j X in S1 or X in S2g" or \j text buffer char  j X:name", where X is an object, S1, S2, and S3 are sets, name is an attribute, and \ " is the ADAMS assignment operator. This program is compiled by the ADAMS preprocessor, which translates named references (such as S1 or name) into oids and ADAMS code into calls to ADAMS interface routines. See [Had95] for a full list. ADAMS data structures include sets, maps (inter-object references), and attributes. As discussed in Section 2.1, we use the Decomposition Storage Model. Rather than store an object's attribute and map values contiguously, all values for an attribute or map are stored in a tree structure associated with the name of the attribute or map. These tree structures are indexed by oid. We partition our database by oid, thus all the attribute and map values of a particular object exist in the same partition but in di erent trees. class _A_para_oid { public: _A_uid self; _A_NDX_TYPE type; _A_uid sub_oids[MAX_NODES]; _A_uid storage_uid; int Persistent; _A_para_oid(_A_NDX_TYPE new_type); };

Figure 7.2: The Para oid C++ Class De nition

60 In the single user ADAMS version each set, map, or attribute is represented by an oid which identi es the appropriate O-tree structure(s) in the database. In the multi-user/parallel version there is an additional translation step: a set, map, or attribute oid represents a para oid (see Figure 7.2), which then points to the n oids identifying the appropriate O-tree structures in the database. For each operation requested by an application program, the client process will use the para oid of the set, map, or attribute referenced 1 to send the operation to the ADAMS Server(s) handling the relevant partition(s) of the database. The para oids used by an application program are one of the few things the client process maintains in cache, as a para oid will often be re-referenced many times during the course of a program's execution. As our database is partitioned by object identi er, a number of operations involved in query processing, such as set intersection, union, di erence, and range search over an attribute, can be performed in a completely data parallel manner, as discussed in Section 5.1. For example, in order to execute an intersection operation all the client process need do is send a message to each of the n ADAMS servers. Each message will contain the intersection instruction and the appropriate operand sub oids (see Figure 7.2) from the operand para oids. The ADAMS Servers complete the operation without further interaction with the client process, and without any interaction with each other. Employing data parallel techniques when possible was an early priority in the design of ADAMS, as a query consisting only of data parallel operations would be very likely to exhibit good parallel performance. It was expected that a design allowing excessive message trac would doom large system con gurations. In Figure 7.3, we give an example of ADAMS union. First is the ADAMS statement A B union C, as it would appear in a user program. Next are the ADAMS system calls generated by the preprocessor. The set names A, B, and C, used to identify the operands in 1 Whether generated by the parallel or sequential ADAMS preprocessor, the application code manipulates

sets, maps, and attributes and does not distinguish between sequential or parallel structures.

61 the user program, have been converted to oids denoted by A literal uid[i]. Finally we schematically illustrate the ADAMS runtime system executing the instruction A set union by distributing it to the n ADAMS servers. The set names, A, B, and C, are maintained in the dictionary along with the oids of the sets as represented in the database. When the names are used in an ADAMS program, the appropriate oids are retrieved from the dictionary and inserted into the generated code (as references into a generated table). As shown, the generated code for union simply manipulates the set oids. The underlying implementation, sequential or parallel, is buried in the layers below.

62

ADAMS Code

/* line

83

*/

{ _A_push (_A_stack, _A_literal_uid[3], _A_literal_uid[0]); _A_push (_A_stack, _A_literal_uid[4], _A_literal_uid[0]); _A_uidcpy (_A_temp_uid1, _A_pop(_A_stack)); _A_uidcpy (_A_temp_uid2, _A_pop(_A_stack));

// // // //

’B’ ’C’ operand set 1 operand set 2

if (_A_uidcmp(_A_temp_uid1, _A_literal_uid[5]) != 0 && _A_uidcmp (_A_temp_uid2, _A_literal_uid[5]) != 0 ) _A_newset (_A_literal_uid[5], _A_literal_uid[1], _A_NON_PERSIST); _A_set_union (_A_literal_uid[5], _A_temp_uid1, _A_temp_uid2); _A_push (_A_stack, _A_literal_uid[5], _A_nulluid); _A_uidcpy (_A_var[6], _A_pop(_A_stack));

// ’A’ 500,000

g

T1

Select and Sort

inv

pop



T1

T3 Intersection T2

Select and Sort

inv

revenue

T2

Figure 8.4: Complex Attribute Retrieval Example In this example there are four steps

T1 T2 T3 S

fx

j x.pop < 5000 g fx j x.revenue > 500; 000g T1 [ T2 T3

The rst step is exactly as in the previous example

T1

 j T1 j  inv fx j x.pop < 5000g = F (pop ) +  RR +CR (T1) + LR (T1)

The second step is similar to the rst

T2

 j T2 j  inv fx j x.revenue > 500; 000g = F (revenue ) +  RR +CR (T2) + LR (T2)

S

95 The third step performs an intersection of two streams

T3 T1 \ T2 = (j T1 j + j T2 j)  N + j T1 \ T2 j A And the nal step is simply the conversion of a stream to a set

S T3 = CS(S)+ j T3 j N 8.3.5.3 Simple Map Retrieval Map inverse retrieval operations are similar to attribute inverse retrieval operations. An example is S fx j x.city = oidCharlottesville g where city is a map or object reference, not an attribute or atomic value, and oidCharlottesville denotes an object 5 . A di erence between attribute and map inverse queries is that the former include range searches, such as S fx j100  x.temp  110g, whereas maps are indexed by the arti cial identi er oid so that only equality and inequality can be meaningful.

8.3.6 Compound Map Queries Given the structure of the ADAMS language simple map retrievals are relatively rare. The language does not encourage designation of single objects, or oids, as in oidCharlottesville . A compound retrieval expression of the form S fx j x.city.name = \Charlottesville"g would be more common. In this expression city.name implicitly denotes oidCharlottesville . Often a map inverse is applied to an entire set of objects. In a query to nd all corporations headquartered in cities whose population is less than 5000, one might have the following compound expression S

fx j

x.city.pop



g

5000

In [Ber94] these compound map queries are called implicit joins. In the relational model this query would be implemented using a selection and a join. In this case the cityinv map 5 In ADAMS, actual oid values are hidden from the user; but ADAMS variables can be assigned oids.

96 is applied to the set resulting from the popinv inverse attribute search. There can be a performance gain associated with applying a map to a set which is ordered, as the index searches will be done in order and it is more likely that desired blocks will be in the inverse map tree cache. Sorting the result of the population inverse search comes at some expense, but optimizing the query to occasionally avoid the sorting step would be dicult. ADAMS always sorts. If the set has more than 10 elements, it always sorts by creating an O-tree representation. inv

pop

Select and Sort

T1 T1

inv

city

Select for each

T2 T2

S

t in T1 Sort

Figure 8.5: A Complex Serial Query There are three steps comprising the execution of the statement

T1 T2 S

fy

j y.pop < 5000 g fx j x.city 2 T1 g T2

The rst step is very much like that of equation (8.12), and modifying that equation slightly we get  j T1 j  inv T1 fy j y.pop < 5000g = F (pop ) +  RR +CR (T1) + LR (T1)

97 The second step is more complicated, as it is not just a retrieval on one map value but a retrieval on several. For a single map value we would have something very similar to the above, but here we are retrieving for each element in a set.

T2

T1j jX

 j T2 j  inv F (city ) + i  RR fx j x.city 2 T1 g = i=1  + j T2i j (IR (T2)) + LR (T2)

where j T2i j denotes the number of objects to be inserted in T2 during iteration i. Note that care should be taken not to double count the random disk reads. A random read to retrieve a leaf block during a nd operation should not also be counted during the jT2 i j random reads to access all the oids during an iteration. The third step is simply the conversion of a stream to a set

S T2 = CS(S)+ j T2 j N For another example, we introduce a disjunction.

S

fx j

x.pop



20000 OR x.smsa =

oidNorthern V irginia g

Where smsa 6 is an object reference, or map, and oidNorthern V irginia denotes an object. inv

pop

Select and Sort

T1 T1

T3 Union

smsa

inv

Select and Sort

T2 T2

Figure 8.6: Complex Disjunctive Query 6 smsa is an abbreviation for \standard metropolitan statistical area".

S

98 In this example there are four steps

T1 T2 T3 S

fx

j x.pop  20000 g fx j x.smsa = oidNorthern V irginia g T1 [ T2 T3

(8.14)

The rst step is exactly as in the previous example

T1

fx

 j  R j x.pop  20000g = F (popinv ) + j T1 R +CR (T1) + LR (T1)

The second step is similar to the rst

T2

 j T2 j  inv fx j x.smsa = oidNorthern V irginia g = F (smsa ) +  RR



+CR (T2) + LR (T2)

The third step performs a union on two streams

T3 T1 [ T2 = j T1 j N + j T2 j N + j T1 [ T2 j A And the nal step is simply the conversion of a stream to a set

S T3 = CS(S)+ j T3 j N The complete formula is then

S

fx

j x.pop  20000 OR x.smsa = oidNorthern V irginia g = (8.15) (j T1 j + j T2 j)  ( 2RR + WS + A + N ) + j S j (WC + N + A) +

jX T1j i=1

log i +

jX T2j i=1



log i (RC + WC)

+F (smsainv ) + F (popinv )

99 This expression reveals a great deal about the overall query process. The rst term involving expensive disk I/O is a linear expression in terms of the cardinality of the argument subsets, T1 and T2, which are themselves determined by the parameters of the query. The last two terms are constants. They are determined by the size of the overall database, or the environment in which the query is taking place. If the size of the database were allowed to grow without bound, the third term, which is of order n logn, would come to dominate the expression. It represents the cost of keeping oids in sort order. However, in reasonably sized databases, say of less than 10,000,000 objects, the e ect of this term is marginal because P j cache reads and writes tend to be cheap, and ijTi =1 log i

Two additional sets were created, each containing approximately 30,000 objects. These were populated concurrently with the 150,000 object set, and as an element was inserted into the 150,000 object set its randomly assigned pressure value was interrogated. Objects having a "pressure" value between 50.0 and 70.0 were put in the rst 30,000 object set, those having a value 60.0 and 80.0 were put in the second. The result of the intersection of the two sets was a set of 14991 objects. 147

148 The database described above took some 61 megabytes of disk space. For some of the tests an additional pair of 30,000 object sets was created, and this added another 1.2 megabytes to the le space used. Over 450 megabytes remained free on the disk drive, limiting opportunities for disk fragmentation. The tests consisted of performing set intersections on the contents of the two 30,000 element sets mentioned above. The result of the intersection was stored in a temporary set (residing in primary storage only), as might be done in a number of applications, particularly those involving queries. The thin-chunked pattern of Figure 3.1 was created for the two input sets during the database population process. The contiguous pattern of Figure 3.2 was created by copying the two 30,000 element sets. The contiguous but internally disordered pattern of Figure 3.3 were created by performing attribute inverse retrievals on the 150,000 object set. The testing for the second and third cases thus required the creation of additional sets, but the slight variation in database sizes did not appear to be a factor in our testing. Six batches of 50 runs of each of the three storage variations were performed in a controlled environment, and the I/O timing result averages were: Thin-Chunked Blocks Contiguous Blocks Contiguous Disordered Blocks

10.782 seconds 3.150 seconds 4.683 seconds

The results were examined. Particular attention was given to the I/O traces showing the patterns above. A full description of the work done is not presented here, but it yielded the result that the thin-chunked block pattern caused one random read from disk per block read. In addition, the block size read from disk was 4K, meaning that approximately 3/4 of the data transferred from the disk in the thin-chunked case was not used during the intersection operation. A reasonably ecient mechanism for accessing sets created during standard database population programs was clearly required.

149 We rst changed the block allocation strategy, as described in section 3.2.1, then used streams (described in section 2.3). The combination of the block allocation mechanism changes and the use of streams cut deeply into the need for seeks and rotations, while exploiting the read-ahead disk drive cache. The average result of 50 runs of an intersection operation using the adjustments was Wide Chunked Blocks with Streams

2.06 Seconds

which was substatially faster than the three other cases, particularly the thin chunked case.

C Parallel Query Processing Details We have reviewed sources of parallelism in ADAMS operations, discussed the ADAMS client/server architecture, described the parallel data structures, and de ned ADAMS streams (see section 2.3). In this appendix chapter we tie these pieces together to provide an overview of ADAMS complex query processing. As an example, we use the query R fxjx:city:revenue > 500; 00 AND x:status = 7g. The query involves range search over an attribute, an attribute exact match, a map retrieval, and an intersection. The pre-processed code for the query is:

150

151 /* line

64

> */

{ _A_push (_A_stack, _A_literal_uid[2], _A_literal_uid[0]); // ’set1’ _A_newset (_A_literal_uid[3], _A_literal_uid[0], _A_NON_PERSIST);// resultant set _A_push (_A_stack, _A_literal_uid[4], _A_literal_uid[5]); // ’revenue’ sscanf ("500000", "%lf", &_A_numeric_buf1); _A_host_var_to_codomain (_A_value_buf1, _A_nulluid, &_A_numeric_buf1, "double", 0); sscanf ("9999999.0", "%lf", &_A_numeric_buf1); // range search upper bound _A_host_var_to_codomain (_A_value_buf2, _A_nulluid, &_A_numeric_buf1, "double", 0); _A_uidcpy (_A_temp_uid1, _A_pop(_A_stack));// attr_uid _A_uidcpy (_A_temp_uid1, _A_attr_range_inverse (_A_temp_uid1, _A_value_buf2, _A_NUMBER_STORAGE_LEN, _A_CLOSED, 1 _A_value_buf1, _A_NUMBER_STORAGE_LEN, _A_OPEN) ); _A_push (_A_stack, _A_nulluid, _A_nulluid); // stack MARKER _A_push (_A_stack, _A_literal_uid[6], _A_literal_uid[7]); // ’city’ _A_uid_getuid (_A_temp_uid3); _A_uid_getuid (_A_temp_uid5); _A_newset (_A_temp_uid3, _A_nulluid, _A_NON_PERSIST);

// _A_temp_uid3, _A_temp_uid5

_A_newset (_A_temp_uid5, _A_nulluid, _A_NON_PERSIST);

// are temp sets

MAP SECTION _A_uidcpy (_A_temp_uid4, _A_pop(_A_stack));

2

while (_A_uidcmp(_A_temp_uid4, _A_nulluid) != 0) { // _A_temp_uid4 is ’map’ // _A_temp_uid1 is element set _A_set_make_empty (_A_temp_uid5);

// _A_temp_uid5 = ’union set’

_A_set_first_element (_A_temp_uid2, _A_temp_uid1); if (_A_uidcmp(_A_temp_uid2, _A_nulluid) == 0) { // empty element set while (_A_uidcmp(_A_temp_uid4, _A_nulluid) != 0) _A_uidcpy (_A_temp_uid4, _A_pop(_A_stack)); break; } while (_A_uidcmp(_A_temp_uid2, _A_nulluid) != 0) { A_set_make_empty (_A_temp_uid3); // ’element’ inverse set _A_uidcpy (_A_temp_uid3, _A_map_inverse(_A_temp_uid4, _A_temp_uid2)); _A_set_union (_A_temp_uid5, _A_temp_uid3, _A_temp_uid5); _A_set_next_element (_A_temp_uid2, _A_temp_uid1); } _A_set_make_empty (_A_temp_uid1); _A_set_copy (_A_temp_uid1, _A_temp_uid5); // union is new element set _A_uidcpy (_A_temp_uid4, _A_pop(_A_stack)); // new ’map’ }

(CONTINUED)

Figure C.1: ADAMS Query - source code

152 _A_push (_A_stack, _A_temp_uid1, _A_nulluid);

// push element set

_A_push (_A_stack, _A_literal_uid[8], _A_literal_uid[5]);

// ’status’

sscanf ("7", "%lf", &_A_numeric_buf1); _A_host_var_to_codomain (_A_value_buf1, _A_nulluid, &_A_numeric_buf1, "double", 0); _A_uidcpy (_A_temp_uid1, _A_pop(_A_stack));

3

_A_uidcpy(_A_temp_uid1,_A_attr_inverse(_A_temp_uid1,_A_value_buf1,_A_NUMBER_STORAGE_LEN) ); _A_push (_A_stack, _A_temp_uid1, _A_nulluid); // push retrieved set _A_uidcpy (_A_temp_uid1, _A_pop(_A_stack)); _A_uidcpy (_A_temp_uid2, _A_pop(_A_stack)); _A_uid_getuid (_A_temp_uid3);

4

_A_newset (_A_temp_uid3, _A_literal_uid[1], _A_NON_PERSIST);

5

_A_set_intersect (_A_temp_uid3, _A_temp_uid1, _A_temp_uid2);

// AND

_A_push (_A_stack, _A_temp_uid3, _A_nulluid); _A_set_delete (_A_temp_uid1);

// free temp sets

RESTRICTION SET INTERSECTION SECTION

_A_set_delete (_A_temp_uid2); _A_uidcpy (_A_temp_uid1, _A_pop(_A_stack)); _A_uidcpy (_A_temp_uid2, _A_pop(_A_stack));

// intersect with // restriction set

_A_set_intersect (_A_literal_uid[3], _A_temp_uid1, _A_temp_uid2); _A_push (_A_stack, _A_literal_uid[3], _A_literal_uid[0]); _A_set_delete (_A_temp_uid1);

// free temp set

_A_uidcpy (_A_var[6], _A_pop(_A_stack));

// ’R’

Suggest Documents