Towards a Scalable Parallel Object Database {The Bulk Synchronous ...

Towards a Scalable Parallel Object Database { The Bulk Synchronous Parallel Approach K Ronald Sujithan Wadham College Oxford Email: [email protected]

PRG-TR-17-96 Programming Research Group Oxford University Computing Laboratory 11 Keble Road Oxford OX1 3QD United Kingdom August, 1996

Abstract Parallel computers have been successfully deployed in many scienti c and numerical application areas, although their use in non-numerical and database applications has been scarce. In this report, we rst survey the architectural advancements beginning to make general-purpose parallel computing cost-eective, the requirements for non-numerical (or symbolic) applications, and the previous attempts to develop parallel databases. The central theme of the Bulk Synchronous Parallel model is to provide a high level abstraction of parallel computing hardware whilst providing a realisation of a parallel programming model that enables architecture independent programs to deliver scalable performance on diverse hardware platforms. Therefore, the primary objective of this report is to investigate the feasibility of developing a portable, scalable, parallel object database, based on the Bulk Synchronous Parallel model of computation. In particular, we devise a way of providing high-level abstractions for ecient database programming, by combining elements from data-parallel functional languages, with the encapsulation and task-level features of object-oriented languages. Our work integrates collection types, that model a broad class of non-numerical applications, into the existing direct-mode BSP programming environment provided by BSPlib. We introduce collection types within a free-monoid framework, discuss the primitive operatons on collections, and outline how the collection types have been built on-top of BSPlib. In order to demonstrate the practical signi cance of our approach, we study the recently proposed ODMG-93 industry standard for object databases, including the collection types supported by the model, and the select-from-where construct of the companion Object Query Language (OQL). Using our preliminary implementation of the BSP collection types, we give experimental results on uniprocessor Sun Workstation, SGI PowerChallenge and IBM SP2 using the native BSPlib implementations, and discuss how the BSP collection types we have built can help implement object databases. These preliminary results encoragingly indicate that there is a close correspondence between the theoretical cost prediction under the BSP model, and the actual performance obtained on real machines with BSPlib.

Contents 1 Introduction 1.1 1.2 1.3 1.4

General-purpose architectures . . Models and algorithms . . . . . . Programming environments . . . Parallel database architectures . 1.4.1 Parallel database models . 1.4.2 Parallel object databases 1.5 Structure of the report . . . . . .

3

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. 4 . 6 . 7 . 8 . 8 . 9 . 10

2.1 A review of relational model . . . . 2.1.1 Structured Query Language 2.1.2 Relational parallelism . . . 2.2 Object Model . . . . . . . . . . . . 2.2.1 Object database standards 2.3 Types . . . . . . . . . . . . . . . . 2.3.1 Basic types . . . . . . . . . 2.3.2 Collection types . . . . . . 2.4 Object database languages . . . . . 2.4.1 Object De nition Language 2.4.2 Object Query Language . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

2 Object Databases

3 Scalable Parallel Programming 3.1 3.2 3.3 3.4

Computation granularity . . . . . Communication complexity . . . Speedup, Eciency and Scaleup The BSP model . . . . . . . . . . 3.4.1 The BSP cost model . . . 3.4.2 BSP programming . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

4.1 Mathematical framework for collection types . . . . . 4.2 BSP realisation of collection types . . . . . . . . . . . 4.2.1 Data distribution . . . . . . . . . . . . . . . . . 4.2.2 Sorting and load balancing . . . . . . . . . . . 4.3 Collection operations . . . . . . . . . . . . . . . . . . . 4.3.1 Single collection operations . . . . . . . . . . . 4.3.2 Operations involving more than one collection .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

4 Collection types

1

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

11 11 13 13 13 14 14 15 16 17 17 18

19 19 20 22 22 24 25

27 27 29 29 31 34 34 35

5 Parallel Query Evaluation

5.1 Mathematical framework for queries . . . . 5.2 BSP realisation of queries . . . . . . . . . . 5.3 Experimental results . . . . . . . . . . . . . 5.3.1 Sun implementation . . . . . . . . . 5.3.2 SGI PowerChallenge implementation 5.3.3 IBM SP2 implementation . . . . . . 5.4 Towards parallel query optimisation . . . .

6 Conclusions and Further Work

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

37 37 39 39 40 41 41 42

44

6.1 An I/O concept for BSP programming . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.2 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2

Chapter 1

Introduction Many industrial and commercial applications require the creation, classi cation, storage, retrieval and dissemination of very large collections of complex structured data. Although parallel computers have been successfully deployed in many scienti c and numerical application areas, their use in commercial applications, which are often non-numerical in nature, has been scarce. One of the impediments in the long-term commercial uptake of parallel computing has been the proliferation of diering machine architectures and corresponding programming models (e.g., see [92] for a survey). However, due to several technological and economic reasons, the various classes of parallel computers such as shared-memory machines, distributed-memory machines, and networks of workstations, are beginning to acquire a familiar appearance: a workstation-like processor-memory pair at the node level, and a fast, robust interconnection network that provides node-to-node communication [91]. The aim of recent research into the Bulk Synchronous Parallel model [96, 135] has been to take advantage of this architectural convergence. The central idea of BSP is to provide a high level abstraction of parallel computing hardware whilst providing a realisation of a parallel programming model that enables architecture independent programs to deliver scalable performance on diverse hardware platforms. Although BSP programming environments have been successfully utilised in several numerical applications [17, 29, 50, 70], they provide little help in non-numeric applications as they are based upon an explicit distribution and manipulation of arrays among a number of processors. The GPL [93, 98] project goes some way in addressing this problem, by adding high-level programming language constructs into a BSP environment. The goal of our research is to study the feasibility of scalable parallel object databases, and the type system needed to model a broad class of database applications, by considering a number of collection types that provide higher-level programming abstractions within a BSP programming environment [128]. Previous approaches to adding collection types to parallel languages were taken from a speci c programming model viewpoint. Data-parallel approaches [20, 72] achieve parallelism by the wholesale manipulation of collection types such as arrays, sets, or lists. On the other hand concurrent object-oriented approaches [5] aim to support encapsulation, modularity and task-level parallelism. The work described in this report forms a synthesis of the functional style of data-parallelism, and the task-level parallelism supported by the concurrent object-oriented languages. We focus on the problem of uniting data-parallelism, data distribution and load balancing, and accurate cost prediction within the BSP model. Our main technical contributions, so far, are the identi cation of an adequate set of basic collection types and operations, a mathematical framework for the uniform treatment of such collections based on the works of [34, 53, 130], and some encouraging results from a preliminary BSP implementation of an object database query evaluator. In the rest of this chapter, we brie y introduce the architectural advancements beginning to 3

enable general-purpose parallel computing, review the work done on models of parallel computing, introduce and place BSP in perspective in terms of the architectural capabilities and portable software requirements. Finally, a survey of previous approaches taken to design parallel database architectures, and a description of our approach, concludes the chapter.

1.1 General-purpose architectures In this section, we brie y review the architectural advancements that is beginning to make generalpurpose parallel computing feasible. Several parallel processing techniques, including instruction pipelining, replicated functional units, superscalar processors, vector processors and symmetric multiprocessors [75, 116] for performance enhancements have been proposed, and implemented, from the early 60's. In order to classify this diversity of parallel architectural models, Flynn [55] in 1966 introduced a taxonomy that is based on the multiplicity of the instruction stream (IS) and data stream (IS) in the computer: (1) SISD (Single IS, Single DS) involves instructions sequentially executing on data, thus, de nes sequential computation; (2) MISD (Multiple IS, Single DS) so far not a widely practised scheme; (3) SIMD (Single IS, Multiple DS) involves multiple processors simultaneously executing the same instruction on dierent data sets; (4) MIMD (Multiple IS, Multiple DS) involves multiple processors autonomously executing diverse instructions on diverse data sets.

System/ Model

Active Memory Technology, DAP 600 Series Kendall Square Research, KSR-1 Thinking Machines Corp., CM2 MasPar Computer Corp., MP1 Series

Processor/ Memory

up to 4096 processors 1 Kbits local memory up to 1088 processors global memory space up to 65,536 processors 1 Mbits local memory 1024 { 16,384 RISC processors 16 Kbytes local memory

Performance Interconnection (Peak) Network 560 M ops 43 G ops 28 G ops 1.3 G ops

2-D mesh with orthogonal and 4-way links

Software/ Languages

Host VMS/UNIX Fortran-Plus, APAL, C.

Ring connected UNIX, Fortran with ALLCACHE and KSR C. memory con guration 10-dimentional Host VMS/UNIX hypercube with Fortran, Lisp, 4 4 mesh on and C*. each vertex 2-D mesh with UNIX, Fortran, multistage X-bar MasPar Parallel interconnect Application Lang.

Table 1.1: Some representative rst generation parallel computers Many of the early parallel computers were SIMD system [66], evident from the representative commercial machines shown in table 1.1. However, [143] observes that, although vendors of these early parallel systems projected very impressive peak performance 1 gures for their systems, speedup studies showed that the sustained performance across a broad range of applications turned out to be a lot less than the quoted peak performance. When sustained performance rates are normalised by system prices, (sustained performance per dollar estimates) these rst generation parallel systems became considerably less cost-eective, particularly for many commercial The peak performance is obtained by choosing an architecture con guration that closely models the benchmark application, assuming ideal operating conditions, and ignoring overheads that arise in practice [65]. 1

4

applications. MIMD systems, in contrast, are composed of multiple processors capable of independent operation, and are often classi ed as shared-memory multiprocessors and distributed-memory multiprocessors, based on their physical organisation. In a shared-memory multiprocessor, processors communicate with each other through shared variables in a common memory, whereas in a distributed-memory multiprocessor, each node has a private memory, and inter-processor communication is achieved by sending messages over an interconnection network. Figure 1.1 distinguishes the shared-memory (left) and distributed-memory (right) architectures. Network

Memory

Network

P

P

P

P

P

P

M

M

M

Figure 1.1: Shared-memory and distributed-memory architectures The appeal of shared-memory multiprocessors is primarily in ease of programming, due to the single address space. Such systems tend to have scalability problems, however, since many processors need to access the central memory simultaneously, and the time taken to access this common memory increases with the number of processors. This inherent problem in sharedmemory multprocessors is one of the factors that is leading to the emergence of distributed-memory multiprocessors as the preferred scalable architecture [96]. In addition, several technological and economic factors, such as the availability of fast, inexpensive microprocessors, high-capacity memory (DRAM) chips, and low-cost secondary-storage devices (i.e optical and magnetic disks), are leading to the \next generation" of parallel computers based on a common architecture. This convergent architecture entails a collection of essentially complete computers, each consisting of a microprocessor, cache memory, a sizable DRAM memory, and a very large secondary-storage, connected by a robust inter-processor communication network [45]. Figure 1.2 attempts to show the characteristics of some representative \second generation" parallel systems2 . These \second generation" parallel systems are now well established tools for scienti c and numerical applications, that require very high performance [18]. In addition to the scienti c needs, there are many symbolic applications, such as databases, that can bene t from the advantages oered by present day parallel computers. However, a number of impediments in massively parallel computing technology can be identi ed, that have so far prevented these systems from becoming the standard for mainstream commercial computing. The lack of (1) scalable parallel architecture-independent models that eciently bridge hardware and software [91, 117, 134, 135], and (2) adequate high-level programming constructs and programming environments that aid portable software development [37, 94, 95], can be cited as the most fundamental impediments. Clearly, these limitations must be overcome before we can fully harness the architectural capabilities for non-numerical and database applications. We brie y outline the areas of current research that address some of these issues next. Some of these systems may oer specialised software and/or hardware to support a shared-memory programming model (i.e. support a single address space) but the underlying physical architecture is based on distributed-memory. 2

5

System/ Model

IBM, Power Parallel Series (SP2) Cray Research, MPP Series (T3D) Intel Corp. Paragon Fujitsu VPP Series (VPP500)

Processor/ Memory

1024 PowerPC processors 4 Mbytes local memory 1024 DEC Alpha processors global memory space 256 Intel i860 processors 4{16 Mbytes local memory 222 Fujitsu vector processors shared distributed memory

Performance Interconnection (Peak) Network 50 G ops

2-D mesh connected

150 G ops

3-D torus with wormwhole routing

300 G ops

2-D mesh with wormwhole routing

355 G ops

Crossbar connected

Software/ Languages

UNIX, OSF, Fortran, C, C++, and Visual programming tools UNIX, Fortran, C, Programming tools UNIX, OSF, Fortran, C, Several program porting tools UNIX, OSF, Fortran, C, Vectorising tools

Table 1.2: Some representative second generation parallel computers

1.2 Models and algorithms The Parallel Random Access Machine (PRAM), as an abstraction of the synchronous sharedmemory architecture, has been the theoretical model for the design and analysis of parallel algorithms [56, 76, 80, 108]. In the most general PRAM model, all memory accesses are completed in constant time, independent of the number of processors and inter-processor communication costs. Since parallel computers cannot achieve this in practice, PRAM provides an unrealistic abstraction of real machines. One area of research has been to measure the eciency of simulating a PRAM, and the slowdown incurred relative to the PRAM on realistic machines, so that PRAM style of shared-memory programming can be supported on diverse distributed-memory architectures [78,79,97,106,132]. A dierent approach is to develop \practical models" of parallel computing, that provide realistic abstractions of parallel machines, by incorporating resource parameters of asynchrony and communication costs [90,120{122,142]. The Bulk Synchronous Parallel (BSP) model introduced in [135] and further re ned in [59, 91] is a promising practical model of parallel computation for the architectural convergence described in section 1.1. Indeed, the results obtained to date with a number of applications such as Computational Fluid Dynamics (CFD) [29] and Industrial Electromagnetics [50], are highly encouraging. There are a number of other such models that have been proposed (see for example [90, 142] for comparative surveys), the LogP model being the prime alternative candidate [45]. Qualitative [90, 142] and quantitative [15] comparison of these two models demonstrate that, (1) they have very similar characteristics; (2) BSP can eciently simulate LogP; and, (3) BSP provides a better programming abstraction. These analytical results can be drawn to support the earlier claim that BSP is the model for the architectural convergence discussed in the last section. A stable, common, and realistic model is also necessary to aid the development of fundamental algorithms that can be applied to several applications. For example, consider the ubiquitous sorting problem, that is of great practical importance because of the number of sorting operations that occur in the database manipulation. Recently several deterministic (and randomised) ecient sorting algorithms have been proposed for the BSP model [58,59]. 6

Sorting is just one example of the many problems for which researchers are seeking ecient parallel algorithms. Advancements in the (complexity) theory of parallel algorithms can also improve the performance and desirability of parallel computers. The consideration of the common characteristics of parallel algorithms lead to a new complexity class, class NC , which includes all problems that can be eciently solved by parallel computers. The study of P -completeness shows that there exist a class of problems that are hardly parallelisable [80,85]. Together, they provide vital insight into the \parallel nature" of many important algorithms [91]. In a recent thesis [126], it was shown that many of the database query algorithms belong to class NC , and are therefore eciently parallelisable.

1.3 Programming environments The diverse architectural features of parallel computers, and the lack of commonly accepted programming language constructs meant that programming parallel computers have been signi cantly more dicult than programming sequential computers. Until better approaches are developed, the programming environment will remain a serious obstacle to mainstream scalable parallel computing. Two distinct classes of parallel programming environments can be identi ed: (1) implicit (or automatic) and, (2) explicit (or direct) [120]. The parallel languages that have been designed so far t somewhere in between the two ends of the spectrum, and, clearly, there is a trade-o between the amount of information the programmer has to provide, and the amount of information the compiler has to extract, to produce optimal parallel code. Explicit parallel programming languages are usually classi ed as either data-parallel (i.e. SIMD style) or control-parallel (or message-passing) (i.e. MIMD style). Since the explicit approach oers greater exibility, it is the most favoured approach at present, whilst the implicit approach remains a long term goal. Furthermore, several programming paradigms have been considered for parallel computers. Apart from the purely imperative approaches (mainly Fortran and C/C++ based), researchers have investigated applicative languages based on logic and functional paradigms. A general unifying applicative approach is exempli ed by the work on data-parallel functional programming [11,67{69] in terms of the Bird-Meertens Formalism [118,119] or Skeletons [42,46]. However, apart from a plethora of purely theoretical results, there have been few practical implementations of data-parallel functional languages other than the language NESL developed by Blelloch [21,22]. Recently, a great deal of attention has been given to concurrent object-oriented programming, and object-based parallelism. Several languages in this category, ABCL [141], Actors [4], Concurrent Aggregates (CA) [39], Mentat [64], CC++ [36], EC++ [31] and pC++ [25] have made important contributions in terms of supporting data-parallel operations, encapsulation, modularity and task-level parallelism. In particular, the CA and pC++ languages introduce the notion of distributed collections (or aggregates) that allow programmers to build distributed data structures with parallel execution semantics. The BSP model, as a programming abstraction, exposes only the signi cant features of any parallel computer, whilst hiding the architectural \idiosyncrasies". Therefore, integrating the above results into a general-purpose BSP programming environment would have a wider applicability. The (explicit) BSP programming environment being pioneered at Oxford University [70], based on the recent World-Wide BSP standard, provides a practical base for portable software development, and has been utilised successfully in a number of scienti c and numerical applications [17, 29, 50]. However, the current environment provides distributed arrays as the main distributed data structure, and adding other data abstractions, particularly collection types that can aid non-numerical application development, will be a highly rewarding exercise. 7

1.4 Parallel database architectures In this section, we brie y review the recent advances in database technology, provide a historical perspective on the work done on parallel database architectures, and place our work into context. The relational model, originally proposed by Codd [41], with a rm basis in relational calculus and classical rst-order logic, provided the foundation for database systems for the past 25 years. The companion language, Structured Query Language (SQL), provides a syntactically convenient way of calling the operators de ned in the relational algebra [47,131]. The elegance and simplicity of the model comes from table (i.e. relation) as the only data structure, allowing the operators to be composed, by streamlining the output of one operator into the input of another operator, into a data ow graph. Performance has been a major concern of almost all database applications. In particular, a common processing need of these applications is to provide real-time response to complex queries involving large volumes of data. As [102] note, with the rapid growth in the volumes of data involved from giga-bytes to terra-bytes, queries have become highly \data-intensive". On the other hand, there is a corresponding growth in the complexity of query expressions, to cope with the high volumes involved, making queries more and more \logic-intensive" [102]. Thus, ecient algorithms for the retrieval and manipulation of large collections of simple and complex data objects are essential to attain acceptable performance. Although fast sequential processors with large virtual memories have served databases in the past, the inherent limitations of the sequential architecture suggest that we may not be able to gain further improvements to response time in the future: the sequential I/O bottleneck is considered as the predominant limitation [28,49,102]. Exploiting parallelism appears the only way forward, in order to provide real-time response to data and logic intensive queries, that manipulate tera-bytes of data [49, 63]. The wide spread adoption of the relational model, and the companion language SQL, is cited as another reason for the curiosity in parallel database systems, and [77] observe that because of the highly parallel nature of the data ow graph, relational queries are ideally suited to parallel execution.

1.4.1 Parallel database models P

P

P

Network

P

P

P

M

M

M

Memory D

D

Network D

D

D

D

Network P

P

P

M

M

M

D

D

D

Figure 1.2: Parallel database models Stonebreaker has oered the following simple taxonomy of parallel databases3 [125]. Shared-Memory (SM): In this scheme, all disks (D) and all memory modules (M) are shared by the processor (P), as shown in gure 1.2 (left); sometimes referred to as shared-everything This taxonomy is analogies to the Flynn's taxonomy introduced earlier, from two perspectives. Firstly, it is a hardware based coarse classi cation, therefore does not capture the salient points. Secondly, it has found wide-spread usage despite the limitations, since it is highly eective in classifying and comparing diverse database architectures. 3

8

since there is a global address space for main memory and all the disks. Shared-Disk (SD): in this scheme, each processor has its private memory, but all disks are shared by the processors as demonstrated in gure 1.2 (centre); every page on every disk is accessible from any processor. Shared-Nothing (SN): in this scheme, each processor has its private memory and at least one private disk connected; the processor acts as a server for the data on the disk(s), and manages independently the data held locally; gure 1.2 (right) shows this type of architecture.

System/ Release

Terradata Database Computer (DBC) Oracle7 Parallel Server DataBase2 Parallel Edition Gamma Bubba

Commercial/ Host Research Architecture Commercial (Terradata)

Interface Processeors (IFPs) Access Module Processors (AMPs) with Y-net (tree) interconnect Flexible, multithreaded architecture that maps to any MPP with disks attached to the nodes IBM SP/2 with AIX environment and disks attached to each node 32-node Intel iPSC/2 Hypercube with disks attached to each node 40-node FLEX/32 multiprocessor with disks attached to the nodes

Commercial (Oracle) Commercial (IBM) Research (U. Wisc.) Research (MCC)

Table 1.3: Some representative Shared-Nothing parallel databases As stated in [49], although all three con gurations have been built in the past, a parallel database architecture based on the shared-nothing hardware design has emerged as the most favoured approach [12, 27, 48]. It is noteworthy that the underlying technological and economic reasons for this convergence are exactly the same as the ones discussed in section 1.1. The advantages claimed for the SN architecture include: (1) partitioning allows multiple processors to scan large relations in parallel without any exotic I/O devices, overcoming the sequential I/O problem mentioned earlier; each memory and disk is owned by some processor that acts as a server for the data; (2) the SN design moves only questions and answers (control) through the network, minimising the communications costs; (3) SN architectures minimise interference by minimising resource sharing among the processors; (4) in practice the systems demonstrated near linear speedup on relational queries with certain workload patterns; (5) SN architectures tend to scale well with additional processors. A critique of these models are provided in [14] which also oers a hybrid model for parallel databases, that combines (symmetric) SM nodes in an SN manner. Table 1.3 shows some representative research and commercial parallel SN databases.

1.4.2 Parallel object databases

The success obtained to date in building parallel database systems lies in the highly parallel nature of the operators in the relational algebra and SQL. However, restricting a query language to these operators circumscribe the number of applications that can be eciently supported (see [127] for a discussion), and today's object databases, that cater for modern database applications, require more features than the relational model and SQL can oer. Among the features that need to be supported, the following list is considered [9, 32] the most signi cant: (1) support for data abstraction in terms of a variety of collection types, such as, sets, bags, lists, arrays, etc.; (2) nested 9

collections, that is, collections of collections (3) more complex computations, such as transitive closure (4) Object ID's (OID's) that allows object sharing and object updates; and (5) object encapsulation, meaning that objects encapsulate both state (data) and behaviour (program). Several recent post-relational database standards propose some of the above mentioned features, including several collection datatypes, for the design of future database systems. However, extending the relational model in this manner makes most of the implementation techniques developed for parallel relational database systems irrelevant, requiring new techniques to be developed [126]. The convergent MIMD, distributed-memory parallel architecture 1.1, and the portable programming environments aorded by the BSP model, oer fresh opportunities for implementing scalable parallel object databases. Furthermore, observe that the consensus SN parallel database architecture introduced in this section is essentially a re nement of the BSP model, since it exposes the secondary-storage and I/O aspects of database manipulation. Thus, a detailed study of integrating both SN and BSP approaches shows considerable potential. We take the initial steps in evaluating this potential in this report. In the bulk of the report, we develop object database algorithms for collection types, based on the BSP model. We focus on issues related to the (embarrassingly) parallel operations, communication and load balancing. Our work takes advantage of the unique aspects of the model, including accurate analysis of costs. Finally we brie y look at integrating the ideas into a uniform query evaluation framework, and BSP cost based query optimisation. Our ultimate objective is to study the feasibility of scalable parallel object databases based on the BSP model.

1.5 Structure of the report The rest of this report is organised as follows (the gure below shows the chapter dependencies). 1. Introduction 2. Object Databases 3. Scalable Parallel Programming 4. CollectionTypes 5. Parallel Query Evaluation 6. Conclusions and Further Work

In chapter 2, we introduce database operations, object databases standards, and informally discuss the collection types. In chapter 3, we introduce the BSP model, performance prediction under the model, and the BSPlib programming environment. Next chapter 4 brings the previous two chapters together, and formally introduces the BSP-Collections model, and presents the BSP implementation of collections. In chapter 5 we discuss ways to implement object database queries on-top of the BSP collection types we have built, and provide some preliminary experimental results. Finally, in chapter 6 we draw conclusions from our experiments and recommend promising areas to explore.

10

Chapter 2

Object Databases In this chapter we informally introduce aspects of object databases relevant for subsequent study. We brie y review the relational model and the relational algebra, review the recent standards for object databases, the (de jure ) SQL3 draft standard being developed by the ISO and the (de facto ) ODMG-93 standard proposed by the OMG group [33], and provide an illustrative example. We nd that, although both aforementioned standards contain several common constituents, ODMG-93 is more concrete that the currently evolving SQL3 eort [82], thus provides a better basis for our work. We focus on the main elements of the data model, particularly the data types supported|the generalised collections data type and the specialised data types sets, bags, lists and arrays |for grouping the database objects, and the operations de ned for these types [127]. The informally introduced concepts in this chapter will be speci ed precisely in chapter 4 within a free-monoid framework, before considering a BSP implementation.

2.1 A review of relational model The ability to manipulate collections of persistent data, and the ability to store and access the data eciently, are two fundamental qualities expected of a database system. These fundamental qualities need to be supported by at least one data model, de ned as a mathematical formalism with a notation for describing the data, and a set of operations used to manipulate the data [47, 131]. This model, in essence, provides an abstract view of data, and allows the users to see information in a way that would be meaningful to them [38]. The database schema is the metadata describing the structure of data and is analogous to the type declarations in a programming language [30]. Schema de nitions are based on the types supported by the database language, and the details are traditionally part of type theory, and beyond the scope of our work. The data model can also be viewed as the database type system constraints | for specifying the allowed range for the data and the allowed operations [10]. The data model is made available via interfaces provided by database languages that are embedded into common programming languages (such as COBOL, PL/1, C or C++). Data De nition Languages (DDL) allow the programming of the structure of the database, and Data Manipulation Languages (DML) allow the high-level expression of queries over the stored data objects. The (stored) data themselves make up the bulk of the database { and may contain tera bytes of data { therefore, the eciency of query evaluation algorithms is paramount [63,77]. We rst brie y introduce the elements of the relational model before discussing the object model. Consider a relation R D1 D2 Dn , where D1 ; D2 ; : : : ; Dn are sets and D1 11

D2 Dn is the cartesian product of the sets. R is an n-ary relation, or of arity n, that is1 ,

f(a ; a ; : : : ; an ) j a

D2 ; ; an Dn g Each element of R is called a tuple and conventionally R can be written as a table; each Di is given a named column called attributes, and each tuple of R is written as a row of the table. The schema of the relation R is R : (A1 : D1 ; A2 : D2 ; : : : ; An : Dn ) and each attribute name Ai must be distinct. A subset of the attributes K fA1 ; A2 ; : : : ; An g is a key if (1) R will never contain two tuples that agree on the attributes in K , and (2) no proper subset of K satis es property 1

2

1

D1 ; a2

(1) [131]. The relational model de nes a number of algebraic operators, and the semantics of each of this operators, that serve to de ne the relational algebra (RA)2 .

Set Operations Since a relation is a set of tuples, the following set operations are de ned in the algebra. We may apply the set operators to relations of the same arity, and the order of the attributes in the operands is respected when performing the set operations. Given the two relations R and S , the union, denoted R [ S , is the set of tuples that are in R or S or both, the intersection, denoted R \ S , is the set of tuples that are in both R and S , and the dierence, denoted R ? S , is the set of tuples that are in R but not in S .

Cartesian Product

Let R and S be relations of arity n and m respectively. Then R S , the cartesian product of R and S , is the set of all possible (n m)-tuples whose rst n-attributes are from R and the last m-attributes are from S . Cartesian product forms the basis of combining several relations into one relation, and perform more complex operations.

Projection

This operation allows us to take a relation R and remove some of the attributes (i.e., columns) and rearrange the remaining attributes. If R is a relation of arity n, then let i1 ;i2 ;:::;im (R), where the ij 's are district integers in the range 1 to n, denote the projection of R onto components i1 ; i2 ; : : : ; im , that is, the set of m-tuples (a1 ; a2 ; : : : ; an ) such that there is some k-tuple (b1 ; b2 ; : : : ; bk ) in R for which aj = bij for j = 1; 2; : : : ; m.

Selection

This operation allows us to take a relation R and remove some of the tuples (i.e., rows) and rearrange remaining tuples. Let F be a rst-order logic formula involving, logical constants, arithmetic comparison operators, and the logical operators. Then F (R) is the set of tuples in R such that, the formula F is true under substitution for its variables.

Join

The -join of R and S on columns i and j , written R ./ S , where is an arithmetic comparison operator, is a short hand for i(n+j ) (R S ) if R is of arity n. In other words, the -join of R and We use the comprehension notation introduced in [130] in such expressions. In the proceeding outline of the main relational algebraic operators, we discuss the operators that have become important in practice, and do not necessarily follow the original RA de nition given in [41]. 1

2

12

S is those tuples in the cartesian product of R and S such that the ith component of R stands in relation to the j th component of S . If is the = comparison then the above operation is called equi-join, the most common form of join.

2.1.1 Structured Query Language

Data manipulation languages generally have capabilities beyond those of relational algebra, such as arithmetic capability, assignments, print commands, and aggregate functions. The Structured Query Language (SQL) is the most commonly used DML, and in 1992 became an international standard. The most common form of a query in SQL is a select statement of the form:

select R :A ; : : : ; Ri:Ai from R ; : : : ; Rk where F 1

1

1

The semantics of this query can be expressed in the relational algebra, using the operators introduced thus far as R1 :A1 ;:::;Ri :Ai (F (R1 Rk )), that is, we take the product of all the relations in the from-clause, select tuples according to the predicate formula F given in the where-clause, and nally project onto the attributes of the select-clause. In addition to the select, project, join, sort and group constructs, SQL includes features for schema de nition and update, indexing, aggregate operators, and views [47,131]. The elegance and simplicity of the \tables" (i.e. relations) structure of the relational model, the very reasons underlying its success, is now considered to be the limitation for many present day database applications (see [32] for examples). For instance, the simple but limited set of query operations leads to the so called \transitive closure" problem [35], and the separation of database languages (DDL and DML) from the host language leads to the so called \impedance mismatch" problem [123]. Further limitations of the relational model include the lack of support for, data abstraction (only built-in datatypes are allowed in RA), and encapsulation (only data is stored and not behaviour). Ongoing standards work will add object-oriented features and computational completeness to SQL, though it will be years before the standard is settled and years more before it is widely implemented. We shall examine some of these extensions in the sequel.

2.1.2 Relational parallelism

Before leaving the relational model, the parallelism inherent in the model is worthy of consideration. Recall from section 1.4 that each relational operator produces a new relation so that operators can be composed into highly parallel data ow graph. Two forms of parallelism can be observed: (1) by streamlining the output of one operator into the input of another operator, pipelined parallelism can be employed [83]; (2) by partitioning the input data among multiple processors, an operator can be split into many independent operators, each working on a part of the data; this partitioned data and execution gives partitioned Parallelism [19]. Figure 2.1 shows the parallel execution of a typical relational select-from-where query in an SN architecture [74,105].

2.2 Object Model The recent advances in object database, resulting from the integration of object-oriented programming capabilities with database capabilities, as a promising alternative to the relational approach, has stimulated a very high level of interest [9, 84]. Object-oriented programming is a disciplined programming style that incorporates four main Software Engineering principles: 13

Pipelined parallelism

Merge Sort

Sort

Sort

Project

Project

Project

Select

Select

Select

Partitioned parallelism

Figure 2.1: Relational parallelism abstraction, encapsulation, inheritance and polymorphism (see the standard object-oriented programming literature, such as [26,30,84,110], for a discussion of these concepts). Adding a rich set of type constructors, and a persistent storage mechanism, to an object-oriented language produces a database system that overcomes some of the limitations of the relational approach outlined previously [123].

2.2.1 Object database standards

Apart from the ability to manipulate at tables as described in section 2.1, SQL-92, the current SQL standard, oers no facilities to de ne complex data types required by object databases. SQL3, a new database language standard being developed by both ANSI and ISO committees together for the last three years, aims to address this requirement. SQL3 is upward compatible with, and extends SQL-92, in many signi cant ways, one of the major extensions being the addition of an extensible object data model [13,86]. This work is still evolving, with much work still need to be done. On the other hand, the Object Database Management Group (ODMG), as part of the OMG Organisation (an industrial consortium established to promote object technology) has proposed an industry standard for object databases [33]. The ODMG-93 standard de nes an object model of data, object de nition (ODL) and query languages (OQL), and bindings for common objectoriented programming languages (e.g., C++, Smalltalk). As noted in [82] ODMG oers a concrete de nition, thus, the work reported here is based on this standard. The main component of the object model is the type structure supported, including several abstract datatypes, which is described next.

2.3 Types An ODMG built-in datatype has an interface, declared in terms of type signatures and one or more implementation, where the interface de nes the external behaviour supported by the instance of the type and an implementations de nes data structures that support the external behaviour. The set of all instances of a given type is termed its extent, which represents all current instantiations of this type in the database. The individual instances of a type can be uniquely identi ed using the keys declaration. 14

2.3.1 Basic types Operation Characteristics

Attribute Property Relationship

Figure 2.2: Characteristics hierarchy The ODMG built-in basic type hierarchy has two components (1) characteristics, and (2) denotable objects. Characteristics refer to the operations and the properties of an object collectively, where properties are given by the attributes (internal states) and relationships (external links) declarations of that object. Figure 2.2 shows the type hierarchy for object characteristics. The hierarchy of object types is rooted as the type denotable object, and there are two orthogonal lines along which the set of denotable objects can be decomposed: (1) mutable (object) versus immutable (literal), and (2) atomic versus structured. All denotable objects have identity (OID), however, the OID representation for objects is dierent from literals. Figure 2.3 shows the denotable object type hierarchy. Integer Atomic_Literal

Float Boolean Charater Bit_String

Literal Immutable_Collection

Char_String Enumeration

Structured_Literal Date

Denotable_Object

Time

Immutable_Structure

Timestamp Atomic_Object Object

Interval Structure Set

Structured_Object

Bag Collection List Array

Figure 2.3: Denotable Objects hierarchy Instances of type object are mutable (i.e values of their attributes may change, and the relationships in which they participate may change), and represents objects in the usual sense3 . The object type represents the dynamic objects found in the object-oriented programming languages. This confusing ODMG terminology { using the term \object" for a type as well as its instances { is rather unfortunate. 3

15

Literals are objects whose instances are immutable, and are conceptually similar to the datatypes found in the conventional programming languages, the simplest, atomic literal, is the familiar int,

oat, char and bool types. The type structured literal has two subtypes, immutable collection and immutable structure, and gure 2.3 shows these two built-in subtypes. Note that date, time, timestamp and interval are de ned as in the ANSI SQL speci cation. The type structured object has two subtypes, structure and collection, and oer ways of de ning aggregate datatypes. Structures (or tuples) have a xed number of named slots, each of which contains an object or a literal. Collections by contrast, contain an arbitrary number of elements, all of the same type. Insertion of elements is based on either absolute position within the collection, or at a point established by the cursor. Retrieval is based on either absolute position, cursorrelative position or a predicate that uniquely selects an element from the collection based on the value(s) that object carries for one or more of its properties.

2.3.2 Collection types

A collection is de ned as an abstract object that groups other objects, and all of the elements of a collection must be of the same type. Collections may be de ned over any instansiable subtype of denotable object, and the model supports both ordered and unordered collections (e.g. based on the sequence of objects present), and collections with or without duplication of elements. Individual collections are instances of a parameterised abstract collection type, collection h i, where is the element type. The ODMG object model de nes a corresponding standard set of built-in collection type generators: set h i (unordered collections that do not allow duplicates), bag h i (unordered collections that allow duplicates), list h i (ordered collections that allow duplicates), array h i (multi dimensional arrays), are all subtypes of the type generator collection h i. For example, type generator for list h i can be instantiated to produce list haccounti by supplying the element type Account . The model de nes several abstract operations over the generalised collection types, and the speci c instances specialise these operations to provide the required semantics. The standard document describes these types and operations in detail [33]. Held_at

Operates_accounts

Bank

Banks_with

1+

1+

Has_customers

Has_accounts

Account

Savings Account

1+

Has_customers

Current Account

Customer

Organisation

Figure 2.4: A \Bank-Customer-Account" database schema

16

Person

2.4 Object database languages 2.4.1 Object De nition Language

The Object De nition Language is a speci cation language for de ning object types, particularly the type interfaces, that conform to the ODMG object model. The primary objective of ODL is to facilitate portability of database schemas across a range of conforming databases. ODL is designed to be programming language independent, therefore provides a degree of insulation for applications against the variations in programming languages, and supports all semantic constructs of the ODMG model (ODL is essentially a DDL, as described in section 2.1, for object databases). Details of the abstract and concrete syntax of ODL can be found in the standard document, here we illustrate the main features of ODL via a \Bank-Customer-Account" example.

interface Bank (extent Banks, keys Branch name, Sort code) f attribute String Branch name; attribute Int Sort code; ...

relationship ListhCustomeri Has customers; relationship SethAccounti Operates accounts; g

...

The problem is to de ne a single database that integrates information about the bank branches, customers of the bank, and the accounts operated by the bank. Figure 2.4 shows the schema of the database (We use the OMT notation of Rumbaugh, et. al. [110] to describe the object model of the database), with three main abstract datatypes, Bank, Account and Customer, and their interrelationships. First we de ne a type, using the ODL syntax, called Bank, with the Branch name,

interface Customer interface Account (extent Customers, keys Surname, Cus number) (extent Accounts, keys Acc number) f f attribute String Surname; attribute String Acc name; attribute Int Acc number; attribute String Address; attribute Int Cus number; attribute Int balance; ...

...

relationship ListhAccounti Has accounts; relationship ListhBanki Banks with;

g

relationship ListhCustomeri Has customers; relationship SethBanki Held at;

...

g

...

and Sort code as the key elds. That is, an instance of this type creates a branch, and any branch can be uniquely identi ed using either the Branch name or the Sort code assigned to the branch. 17

This type also contains two collection types as attributes, a set-collection Customer and a listcollection Account, signifying the fact that there may be more than one account per customer, but all account types are unique. Likewise types Account and Customer can be de ned, as shown above.

interface Savings account: Account interface Student: Customer (extent Students) (extent Savings accounts) f f attribute Int OD limit; attribute Short Interest; g

...

g

...

Using the Account type de nition and inheritance, a new specialised types such as Savings account that provides a modi ed behaviour in terms of Interest for the account, can be derived. Similarly, using the Customer type de nition specialised types such as Student that oers free OD limit facilities and Business that oers Loan limit etc. can be derived. Continuing this process iteratively will yield a complete type de nition for the database. An instantiation of this schema entails an object database for a particular banking organisation. This example also illustrates the the use of data abstraction using various collection types, in addition to sets of tuples, indicating the advantages of object databases.

2.4.2 Object Query Language

The Object Query Language de nes the syntax of object queries for the ODMG data model, and capable of dealing with complex objects and collections. OQL allows to query denotable objects starting from their names, where a name may denote any atomic, structure, collection, or literal objects, and acts as entry points into the database. OQL provides high-level primitives to process collection constructs (i.e. sets, bags, lists and arrays). Further, OQL provides declarative access to objects, in a way similar to SQL and the relational calculus, and a query consists of a set of query de nition expressions followed by the query speci cation (in terms of predicates de ned over the data elements). Details of the abstract and concrete syntax of OQL can be found in the standard document [33]. An OQL query is a function which, when applied to this input, delivers an object whose type may be inferred from the operator contributing to the query expression. A Query Evaluator is that part of the database system which processes high-level queries to produce an execution plan. The execution of queries involve calling the operators that manipulate the stored data and also invoke other system calls, to retrieve the queried object from the database. We shall examine parallel query evaluation, under the BSP model later in some depth. This chapter served to introduce the basic object database concepts required to follow the subsequent chapters.

18

Chapter 3

Scalable Parallel Programming In this chapter we introduce the concepts of architecture-independent scalable parallel programming in the BSP model. We start with a discussion of problem parallelism, that is, the parallelism inherent to a class of problem, and the parallel programming techniques that allow the capture of this parallelism. We then study the ubiquitous PRAM model of parallel computation, where particular attention is paid to simulations on practical machines, the communication complexity, and the de nitions of speedup, eciency and scaleup. The Bulk Synchronous Parallel (BSP) model is then introduced as the bridging model between parallel hardware and software. In this section we introduce the several unique aspects of this model which forms the basis for our investigations. The last section discusses the new World-Wide standard BSP programming environment, BSPlib.

3.1 Computation granularity In this section we introduce the notion of problem (or natural) parallelism inherent to a speci c class of application, and discuss the correspondance with the granularity of computation. The intuition here is that the maximum performance improvement we can hope to gain through parallel computing is limited by the degree of parallelism inherent to the problem. The last chapter introduced the selection of a suitable object model as the key step in database design. Taking databases as an example, and analysing the elements of the data model yields the parallelism available within this class of problem. Fox in [57] has studied the inherent parallelism present within many industrial applications, and has identi ed ve levels of parallelism given in column 1 of table 3.1, and suitable mapping to a machine class given in column 2 of the table. For database applications, as demonstrated in the example given in section 2.4.1, the data model emulates the high-level concepts from the problem domain, which can be abstract, complex and highly application dependent. This high-level data model is de ned in terms of the simpler intermediate data structures (e.g. sets, bags, lists, arrays, trees, tables and graphs), again, some considered in the previous chapter. These intermediate structures are generic (i.e. application independent) in nature, and supported by a suitable low-level implementation (e.g. structures and pointers). This problem decomposition into a number of smaller segments (or grains) reveal the dependencies among the segments, and the independent segments can be executed in parallel. Grain size or granularity is a measure of the amount of computation involved in a program segment (the simplest measure is to count the number of instructions in a grain), and the dependencies amongst the segments are represented by a data ow (or computation) graph. The nodes of the graph represent some computation and the links represent the partial ordering of the computa19

tions. In addition, these segments can be further decomposed to produce ner and ner segments, the lowest unit being an individual statement or instruction.

Problem Class [57]

Embarrasingly Parallel

Machine Class

Program Constructs

Networked Programs workstation or jobs clusters Compound MIMD Sub-programs, (Metaproblems) (Message job-steps or Passing) parts of program Asynchronous MIMD Procedures (Message subroutines, Passing) or coroutines Loosely SIMD or Non-recursive loops, Synchronous MIMD (data unfolded iterations Parallel) or a block of code Synchronous SIMD Instructions (Data or statements Parallel)

Execution Level Program Task Process Thread Operation

Typical Grain Size [75]

Large grain (> 10; 000 instructions) Large/Medium grain (> 2; 000 instructions) Medium grain (< 2; 000 instructions) Medium/Fine grain (< 500 instructions) Fine grain (1 instruction)

Table 3.1: Levels of problem parallelism In practice, capturing natural parallelism is a function of algorithm, programming skill, compiler optimisations and the underlying model of computation. Two types of natural parallelism are cited as important to parallel programming. The rst is control parallelism which allows multiple operations to be performed simultaneously, limited by the multiplicity of functional units [75]. In practice control parallelism is supported by message passing style of programming. The second type has been called data parallelism, where the same operation is performed over many data elements by many processors simultaneously, and by exploiting parallelism in proportion to the quantity of data involved, data parallelism oers the highest potential for concurrency [20]. Table 3.1 shows Fox's classi cation [57] of the possible mapping of the classes of problem parallelism, to machine classes, to corresponding programming constructs [75].

3.2 Communication complexity Clearly, the chosen model of computation has a crucial role in facilitating the ecient capture of natural parallelism. An examination of the phenomenal success of sequential computing to date indicates that the stored-program model, commonly attributed to von Neumann, largely contributed to this success. This central unifying model enables a diversity of sequential programs to run eciently on a diversity of sequential hardware architectures. Further an abstraction of this model into the unit-cost Random Access Machine (RAM) model (which assumes a standard set of operations each taking unit time) facilitated architecture-independent design and analysis of sequential algorithms [6,43]. Due to the lack of a consensus on parallel architectures, at least until recently, several models of parallel computation have appeared, each re ecting a particular type of architecture. A computational model [142] is an abstraction of a computing machine which is characterised by the choice of several resources and the corresponding resource metrics, enabling algorithm design and analysis based on these resource metrics. The PRAM model provides an idealistic abstraction for algorithm design and analysis, since it 20

ignores many important practical issues such as communication latency, routing strategy, bandwidth of interconnection networks, memory management, and processor synchronization. In particular, the model de nes a logical step in computation, assumed to execute in O(1) steps. Due to this simpli cation, algorithm design in the model is considerably easier, and, as a consequence, a large number of PRAM algorithms have been discovered over the past several years. However, most of the complexity of parallel computing is due the cost of inter-processor communication rather than local computation, a problem that proliferates as the number of available processors increase. As we discussed in section 1.1, a realistic parallel computer typically consists of a set of processor/memory pairs which are interconnected by a sparse network of links. The properties of this interconnection network has a profound eect on the performance of a parallel algorithm on the real machine. Figure 3.2 shows the properties of some commonly used interconnection networks for parallel computers [54,103,113,114,124] (also, see tables 1.1 and 1.2 for example machines that use some of the interconnection networks given here).

Network Type

Node Degree d

Network Diameter

No of Links

Remarks

1

p processors

2

D p?1

2

bp=2c

p

2

p processors

Completely Connected Binary Tree 2-D Mesh

p?1

1

p(p ? 1)=2

(p=2)2

p processors

3

2(h ? 1)

p?1

1

4

2(r ? 1)

2(p ? r)

r

Hypercube

n

n

np=2

p=2

3

2k ? 1 + bk=2c

3p=2

p=(2k)

3

2k ? 1

(1:5)2k

p=(2k)

Linear Array Ring

Cube Connected Cycle (CCC) ShueExchange

l p?1

Bisection Bandwidth B

tree height

h = dlog pe r r mesh where r = pp

dimension n = log p p = k 2k processors with cycle length k 3 p = 2k

Table 3.2: Properties of some common interconnection networks If a processor/memory pair represent a node, and a connection between two processor/memory pairs represent an edge, then the number of edges incident on a node is called the node degree d, and the maximum shortest path between any two nodes is called the network diameter D. When a given network is cut into two equal halves, the minimum number of edges along the cut is called the channel bisection width, b. If each edge corresponds to a channel with w bit wires, then the bisection bandwidth B = bw. When B is xed, the channel width w = B=b (bits), which corresponds to the maximum communication bandwidth along the bisection of the network. Then, the communication complexity C (; p) of a p-processor PRAM for a particular problem , is the worst case trac between the global shared-memory and the local memory of the processors [3]. This is essentially the limiting factor of the speed at which inter-processor communication can be achieved.

21

3.3 Speedup, Eciency and Scaleup As discussed in section 1.1, most existing parallel computers consist of a set of state of the art processors, each with considerable local memory and computing power, connected to some interconnection network. Such machines support substantially larger granularity of programs for problems with large inherent parallelism, than the O(1) granularity of PRAM programs. For parallel algorithms to be of any practical relevance, they must be scalable, that is, they must be applicable and ecient for an wide range of granularity (for example, the ve levels of granularity identi ed in table 3.1), algorithms and architectures. For a given architecture, algorithm, and problem size n, the asymtotoc speedup Sp(n) is the best speedup that is attainable, relative to the best known sequential algorithm1 . Let T1 (n) be the sequential execution time on a uniprocessor, Tp (n) be the minimum parallel execution time on a p-processor machine. Further, let kp (n) be the total time spent on communication (including the associated overheads). Then, the asymptotic speedup is de ned as, Sp (n) = T (nT)1+(nk) (n) p p In this de nition, the problem size is the independent parameter, upon which all other parameters are based. The system eciency of using the machine to solve a given problem is given by the ratio, Ep (n) = Spn(n) In general, the best eciency is one (Ep (n) = 1), and the best speedup is linear (Sp(n) = n). Therefore, a de nition of scalability is: a system is scalable if the system eciency is Ep (n) = 1 for all algorithms with any number of p processors and problem size n. However, as pointed out in [71] this de nition is too restrictive, and a more practical de nition is oered in [104] based on the PRAM model. The scalability p (n) of a machine for a given algorithm is de ned as the ratio of the asymptotic speedup Sp (n) on the real machine to the asymptotic speedup SI (n) on the ideal realisation of an Exclusive Read Exclusive Write (EREW) PRAM, SI (n) = TT1 ((nn)) I where TI (n) is the parallel execution time on the PRAM, ignoring all communication overheads. Thus, scalability is de ned as follows: p(n) = SSp((nn)) = TTI ((nn)) I p The larger the scalability, the better the performance the given architecture can yield running the given algorithm. In the ideal case, SI (n) = n, and p(n) becomes identical to the eciency de nition. In practice, however, scalability is limited by the communication overheads of the real machine, particularly the network diameter (D), and bisection bandwidth (B ) properties. This discussion indicates another reason for incorporating the communication complexity in the model.

3.4 The BSP model The Bulk Synchronous Parallel (BSP) model has been introduced as a candidate bridging model for parallel computing, in the same way the stored-program model served sequential computing A meaningful measure of asymptotic speedup mandates the use of a good sequential algorithm, even if its structure is dierent from the corresponding parallel algorithm. 1

22

[135]. A key property of the von Neumann model is ecient universality, which means that it can simulate arbitrary programs written in appropriate high-level languages in time proportional to a special-purpose machine built for each program. The most fundamental requirement for a general-purpose parallel computing model is to harness this ecient universality [91]. If a machine architecture is denoted by M, the class of programs to be simulated by U , then the eciency function ESIM is de ned as the ratio of total computational operations of original program to the total operation count of the simulation. Then, [133] argues that a universality statement is of the form M can simulate any program P in U with eciency at least ESIM . Optimal universality is achieved if the eciency is a constant factor independent of the program P and the number of processors, p. A number of universality results have been obtained in the parallel setting, based on the two-phase randomised routing results discussed in the last section, and the BSP model embodies these results, by making no assumptions about the technology or the degree of parallelism [92,96,133,134]. Several results obtained with PRAM programming, [136] quantitatively argues that, when incorporated in a practical setting, can form the basis of parallel programming languages and machine designs. Further, we wish to incorporate the multiple levels of inherent parallelism present in the problem, where the lowest level can be eciently implemented, given some parallel slackness. In order to do so, we wish to incorporate the principle of bulk synchrony, that is, the processors are barrier synchronised at regular intervals long enough for the messages to be transmitted to the destinations [136]. The Bulk Synchronous Parallel Computer or a (BSPC) is de ned below:

De nition 3.4.1 (BSP Computer) A BSP computer is an abstraction of any real parallel computer, and can be decomposed into three parts:

1. A number of processor/memory components. Usually each consists of sequential processor with a block of local memory. 2. An interconnection network that delivers messages in a point-to-point manner among the processors. 3. A facility for globally synchronising all the processors by means of a barrier. Router

Data items received during

Data items available to other processes in superstep i + 1

superstep i -1

Single process executing superstep i

Figure 3.1: Conceptual view of a superstep execution 23

The BSP model de nes the way in which programs are executed on the BSP computer. The model de nes that a program consists of a sequence of supersteps. During a superstep, each processor-memory pair can perform a number of local computations on values held locally to each processors memory at the start of the superstep. During the superstep, each processor may also initiate communication with other processor-memory pairs. However, the model does not prescribe any particular style of communication, although it does require that at the end of a superstep there is a barrier synchronisation at which point any pending communications must be completed. Figure 3.1 attempts to demonstrate the superstep semantics with respect to the inter-processor communication.

3.4.1 The BSP cost model

For each of the algorithms we describe in the sequel, we analyse their computational and communication costs. The superstep methodology that forms the heart of the BSP model facilities cost analysis because the cost of a BSP program running on a number of processors is simply the sum of each separate superstep executed by the program. In turn, for each superstep, the costs can be decomposed into those attributed to purely local computation, global exchange of data, and barrier synchronisation. To ensure that cost analysis can be performed in an architecture independent way, cost formulas are parameterised by four architecture dependent constants : p the number of processor memory pairs; s the speed of computation of a process in any arbitrary unit such as ops or comparisons per second (it provides an architecture independent way of calibrating BSP algorithms); l the minimum time (measured in terms of basic local computation steps s) for all the processors to barrier synchronise; g the number of local operations required to transfer a single word when all processors are simultaneously communicating a message. Machine

M ops

Network of Sun Workstations SGI PowerChallenge IBM SP2 (switch)

bsc dse

s

p

3.8 16.4 10.1 1 2 3 4 53 94 74 1 2 3 4 25 27 26 1 2 4 8

l g (local) g (all-to-all) N 12

ops s ops s ops s words 24 2.4 0.4 0.04 0.4 0.04 7 54 5.3 3.0 0.29 3.4 0.34 7 74 7.4 2.9 0.29 4.1 0.41 8 118 11.7 3.3 0.32 4.1 0.41 11 226 3.1 0.5 0.007 0.5 0.007 80 1132 15.3 9.8 0.13 10.2 0.14 12 1496 20.2 8.9 0.12 9.5 0.13 12 1902 25.7 9.8 0.13 9.3 0.13 12 244 9.4 1.3 0.05 1.3 0.05 7 1903 73.2 6.3 0.24 7.8 0.30 6 3583 137.8 6.4 0.25 8.0 0.31 7 5412 208.2 6.9 0.27 11.4 0.43 6

Table 3.3: Observed BSP parameters p, s, l and g for some real machines Table 3.3 shows the actual (or observed) s, p, l and g for some real parallel computers2 . The BSP machine parameters l and g are constant for an ideal scalable parallel machine, and These values are based on the Oxford BSP toolset benchmarks, and reproduced here with permission of the author Jonathan M. D. Hill. 2

24

corresponds to the network diameter d, and bisection bandwidth properties of the interconnection network (table 3.2). However, no assumptions about the scalability of real parallel machines can be made and therefore we give parameter values for dierent numbers of processors in table 3.3. Some observations about the BSP parameters on real machines can be made. For the switching network of the IBM SP2, the table demonstrates that the g remains xed as the number of processors p is scaled. The eciency of the communication network can be roughly estimated by comparing the cost of g for one processor, which represents memory speed, with g for p > 1. This gives a ratio of inter-processor communication to memory speed, which is 6 for the IBM SP2 with switch communication and 20 for the SGI Power Challenge. Since the BSP model considers all the individual communications that occur within a superstep as a single monolithic unit, the cost of these communications is accurately modeled by analysing the process with the largest amount of data entering or leaving itself. If h words is the largest accumulated size of all messages either entering or leaving a process within a superstep, and since g is de ned to be the number of ops required for all processors to simultaneously communicate a single word, the communication cost is gh. Patterns of communication of this form are termed hrelations and form the basis of costing communication in the BSP model. This method of costing communication is accurate for suitably large values of h h0 , for some h0 . The standard BSP model makes no distinction between the costs of one process sending h messages of size one or a single message of size h; both communications realise a h-relation cost of hg. However, on a real parallel machine there is an start-up latency associated with every message so the actual communication cost is dependent on message size. Miller [100] re ned the standard BSP cost model to accurately include the eect of message granularity in the communication cost. In the re ned BSP model, g is de ned as a function of the message size x and the asymptotic g value, g1 : N 21 ! (3.1) g(x) = x + 1 g1 The value of N 12 in Equation (3.1) is determined for each machine con guration by tting a curve to actual values of g(x). Table 3.3 gives the actual g values for the corresponding N 21 values for the parallel machines of our implementation. Given the four BSP parameters, the cost of a superstep is captured by the formulae [59],

w + gh + l

(3.2)

where w is an architecture independent cost that models the maximum number of operations executed by one of the processes in the local computation phase of a superstep; and h is the largest accumulated size of all messages either entering or leaving a process within a superstep; and l is the time for barrier synchronisation. The total computation cost is simply the sum of all the superstep costs.

3.4.2 BSP programming

Two modes of BSP programming are possible: (1) in automatic mode [135] the run-time system hides memory distribution/management from the user (i.e., PRAM style of programming), and; (2) in direct mode the programmer retains control over data distribution and manipulation among the processors. A number of researchers are currently forming a World-Wide Standard BSP Library [61] by synthesising several low-level BSP programming approaches that have been pursued over the last few years [62, 99{101, 107]. They propose a library called BSPlib to provide a parallel communication library based around a SPMD model of computation. The main parts of BSPlib are: 25

(1) routines to spawn a number of processes; (2) an operation that synchronises all processes; (3) Direct Remote Memory Access (DRMA) communication facilities that allow a process to manipulate the address space of remote processes without the active participation of the remote process ; (4) Bulk Synchronous Message Passing (BSMP) operations that provide a non-blocking send operation that delivers messages to a system buer associated with the destination process at the end of a superstep.

BSP Processes: Multiple processes are created by the procedure begin bsp and destroyed by

end bsp. This brackets a piece of code to be run in a SPMD manner among a number of processors. BSP Supersteps: A BSPlib calculation consists of a sequence of supersteps. During a superstep each process can perform computations on data held locally at the start of superstep and may communicate data to other processes. Any communications within a superstep are guaranteed to be complete by the end of the superstep, when all the processes synchronise. The end of one superstep and the start of the next is identi ed by a call to the library procedure bsp sync. Direct Remote Memory Access: The library uses one-sided Direct Remote Memory Access (DRMA) communication facilities to perform data-communication. In this mode of operation the local address space of each process can be manipulated by other processes by using either bsp get or bsp put. The operation bsp put stores locally held data into the local memory of a target process, without the active participation of the target process. The operation bsp get reaches into the local memory of another process to copy data values held there into a data structure in its own local memory. Unlike communication based upon synchronous message-passing, communications within the BSP superstep framework never deadlock. This is a major bene t of using the BSP approach.

The results described in this paper have been obtained using the Oxford BSP toolset implementation of BSPlib [61,70], which also contains a pro ling tool to help visualise the inter-process communication patterns occurring in each superstep. The pro ling tool graphically exposes three important pieces of information: (a) the elapsed time taken to perform communication; (b) the pattern of communication; (c) the computational elapsed time. When discussing the results in chapter 5.3, we will highlight these three attributes.

26

Chapter 4

Collection types Following the review of object databases in chapter 2 and scalable parallel programming in chapter 3, the purpose of our work can be restated as to demonstrate the feasibility of implementing the key elements of the ODMG standard in the BSP model. We aim to achieve this in this chapter by developing monolithic collection types, as de ned by the ODMG standard and described in chapter 2, such as lists, bags, and sets, that will be automatically distributed among a p processor BSP machine. We then de ne BSP speci c implementations of collection operations that eciently manipulate these types. The theory of collections is based on a calculus of total functions based on a small number of primitives and a hierarchy of types called the \Boom hierarchy". Essentially there are only three primitive operations in the theory, map, reduce and lter [51{53,129,130,140]. In extending this work for object queries, we wish to cover a wide range of possible collection types, not just sets, or sets and lists. To do so we abstract the important properties of collection types, and also highlight their dierences. This can be done by focusing how the collections are constructed, and their commutative and idempotent properties. Mathematically, collection types can be grouped into two forms: sets and sequences and vectors and arrays. In this report we focus primarily on sets and sequences, although the theory naturally extends to incorporate vectors and arrays as well [51]. First we introduce the mathematical framework of collection types, and then discuss BSP speci c implementation issues.

4.1 Mathematical framework for collection types A collection is a group of elements all of which have the same type1 , where collection h i is used to identify a collection containing elements of type . The collections selected for our study are based on the ODMG-93 standard introduced in chapter 2, where we observed that these fundamental collection types are adequate to model a broad class of database applications: (1) sets, unordered collections that contains no duplicates; (2) bags (or multisets), unordered collections that allow duplicates; and (3) lists (or sequences), ordered collections that allow duplicates; These collection types can be uniformly modelled as free-monoids given by de nition 4.1.2.

De nition 4.1.1 (Monoid) The triple M = (; zero ; ) is a monoid of type if is an associative binary operator (i.e., x (y z ) = (x y) z ) of type ! ! , and zero is the left and right identity of (i.e., zero x = x zero = x). A type formulae of the form x : de nes the object x to be of type , where the type is given in curried form [16]. 1

27

De nition 4.1.2 (Free Monoid) Let T h i be a type determined by the type parameter and (T h i, zero; ) is a monoid (where zero : T h i, and : T h i ! T h i ! T h i), then the quadruple (T h i; zero; unit; ) is a free monoid if unit is a function with type ! T h i. We highlight the salient features of collections by considering the properties of commutativity (i.e., the binary operator is commutative if x y = y x) and idempotency (i.e., the binary operator is idempotent if x x = x) on collections modelled as free monoids. That is, given a monoid (T h i; zero ; ), then the free-monoid (T h i; zero ; unit ; ) extends the monoid with a singleton or unit function that lifts objects of type into the set of objects T h i. The free-monoid can be understood as modeling the collection of 's as a type constructor T , where zero represents the empty collection, unit the collection containing a single element, and the binary associative operator that merges two collections together. Table 4.1 gives the well-known example collection free-monoids and their properties.

Monoid

type T set set h i bag bag h i ordered set oset h i list list h i

Zero fg fg fg []

Unit Merge () Commutative Idempotent p p fag [ p fag ] p fag [ [a] ++

Table 4.1: Collection free-monoids When transforming one collection into another, we want to, where possible, preserve the characteristics of idempotency (I ) and commutativity (C ) during the transformation. More formally, given a mapping from the monoid M to the two-element set fC; I g, then C 2 (M) i M is commutative and I 2 (M) i M is idempotent. A partial order between the two monoids is de ned as M N (M) (N ). This partial order creates a hierarchy of collections, for instance, from those at the top of table 4.1, to those at the bottom. The transformation between the collections M and N is captured by a monoid homomorphism de ned below.

De nition 4.1.3 (Free-monoid homomorphism) The homomorphism between the free monoids M and N written as (homM!N f A) transforms a collection A : T h i represented by the freemonoid M into a collection of type T hi modelled by free-monoid N if M N (M) (N ) M N ( where f : ! , M = (T h i; zero M ; unit M ; ) , N = (S hi; zero N ; unit N ; )). This transformation is inductively de ned in terms of M and N as: homM!N f (zeroM ) = zeroN homM!N f (unitM e) = unitN f e M N homM!N f (x y) = (homM!N f x) (homM!N f y)

What monoid homomorphisms provide is a single basis for parallelisation of operations that manipulate collections. In the context of converting one collection to another, for each element x : in the collection A, the monoid homomorphism does the following: (1) binds x to each member of the collection A of type M; (2) applies f to each such x, yielding a member of type N ; (3) puts the result back together using the merge operator of N . For example, if A is a list-collection [a1 ; a2 ; : : : ; an ], then homlist!set f A computes ff a1 g [ ff a2 g [ [ ff an g 28

which is the set-collection ff a1 ; f a2 ; : : : ; f an g. The partial ordering condition M N in de nition 4.1.3 is important. For example, list bag set since set is commutative-idempotent, bag is commutative but not idempotent, and list is neither commutative nor idempotent. This restriction prohibits conversions that may produce incorrect results.

4.2 BSP realisation of collection types In this section we consider issues related to BSP implementation of collections (as discussed in the previous section), in particular, exploiting locality of reference, reduce contention in the network, and the optimum use of the communication bandwidth. Our aim is to implement synchronous operations on distributed data structures, that work across processor boundaries. Some candidate data structures for physical implementation are discussed in the nal section of this chapter. In addition, in our object model, we are interested in object parallelism or collection parallelism, parallelism associated with a collection of objects in the applications. A collection collection h i containing n elements can be naively distributed among p processors of a BSP machine in blocks of n=p elements We assume that the distribution is with respect to the outer-most level of collection in a nested collection2 . For example, given a list of sets, only the size of the list which constitutes the outer-most collection eects the data-distribution, and the distribution is irrespective of the sizes of the inner sets. However, it does contribute to the cost of local computation. More formally, monoids and monoid terms can capture the data types and operations in both the logical and physical layers. If a logical datatype S is mapped to a physical structure T , then there must be a homomorphism homS!T , that maps any instance of the physical structure S to an instance of type T (that is, the mapping must always be one-to-one or one-to-many). In the preceding discussion physical structure T is understood as the BSP implementation of the types.

4.2.1 Data distribution

Our approach is based on the object-oriented languages Concurrent Aggregates [39] and pC++ [25] that integrate data parallelism within the object model. Our approach is to de ne an abstract collection type, and implement abstract operations over this type. These operations can then be derived to implement the required specialised collection types, sets, bags, lists and arrays discussed in the last section. In this model, programmers can keep a global view of the whole data structure, as an abstract data type, and can specify how the data is arranged or distributed, thus controlling the communication pattern. This leads to accurate cost prediction under the BSP model and parallelism is invoked by all the processors in an SPMD manner. In addition to the collection/element type and the associated operations, we augment the data structure with processor, distribution information, as described next. Elements are distributed across processors, using the distribution method speci ed. To initiate a parallel action, an element method can be invoked on the collection. The corresponding method is applied to all elements of the collection simultaneously, which enables us to express massive parallelism. Several distribution techniques have been studied and evaluated in the literature, that can be grouped into the following forms: (1) Whole: all the collection elements are allocated to a single processor; in other words, all the processors keep a copy of the whole data structure; (2) Block: consecutive segments of the collection elements are allocated to dierent processors (this distribution scheme is shown in gure 4.1; (3) Cyclic: successive elements are mapped to dierent processors in a round-robin fashion; (4) Random: elements are randomly allocated to the 2

alternatively, see [24,115] for a discussion of attening nested collections.

29

0

p-processor/memory

p-1

1

components

n-element collecton 0

n p

2n p

(p-1)n p

n

Figure 4.1: Block distribution of n-elements accross p-processors processors according to a probability distribution function. Figure 4.1 shows the block distribution scheme used in our preliminary implementation. In deciding distribution method, minimising remote memory references or exploiting locality3 , balancing processor speed to network speed, and the use of excess parallelism (i.e. slackness) to exploit network latency, are our main objectives. Latency and contention are dependent upon the relative placement of data elements and the reference pattern which is a characteristic of the query algorithm. Three kinds of parallelism in the distributed collections model have been identi ed, where the rst two involve only local computation and no communication, usually completed within a single superstep, therefore referred to as embarrassingly parallel computations. The third kind is an abstract collection level operation which may involve communication, usually decomposed into a number of supersteps.

Elementwise parallelism inside a collection We can apply a function f to every element ei of the collection C = collection h i in parallel by invoking fei (ei :f in C++ notation) and

each processor will carry out the computation of the elements which it owns. If an operation is to be applied to only a subset of the elements in the collection, then each processor selects the subset it needs to work with, and then apply the function f . Parallelism between collections If we have collections C1 = collection h i and C2 = collection hi, and a merge operator : ! that combines elements from the two collections to produce a new collection C3 = collection hi then the independence of the elementwise merging operation can be exploited in parallel. Speci cally, if the two collections are distributed using the same strategy, and each processor can locally merge elements which it owns, and all the participating processors working in parallel. Parallelism with collection operations Collection operations that do not fall into the simple \apply elementwise" category may require inter-processor communication. For instance, if we have a list-collection which is a sequence of (unsorted) integers and a sort operation which sorts the sequence in descending order, then we can apply sequential sort within each processor in parallel, a purely computation step, and then recursively merge the sorted sequence, which involves communication. Another example is the reduction operation, to

However, in BSP we do not exploit locality in terms of network proximity [134] which is dependent on the topology of the interconnection network. 3

30

be discussed later, which can be decomposed in terms of local reduction in parallel, followed by a global reduction in several communication steps.

4.2.2 Sorting and load balancing

We wish to maintain a sorted list without duplicates across a p-processor BSP machine, initially n=p elements per processor, as our underlying data structure for the collection types. However, intermediate steps may leave the sequence unsorted, or leave the distribution unbalanced. Thus sorting and load balancing are two fundamental operations for collections implementation. We describe a practical sorting algorithm which works upon an extra tag that we associate with each element of type in the collection collection h i. We de ne a monoid sorted[f ]( ), where f : ! , where the type is an ordered type (i.e., a type with a partial order , such as the ordinal n for integers), denotes a list ordered by f . For a list based collection, the tag will be a simple boolean valued attribute that determines if a collection element is \active". However, for a set based collection, implemented as a sorted sequence containing no duplicates, sorted[f ]( ), the tag element is the element of type which ensures the data-structure invariant of sorted lists.

Sorting

Sorting a collection of elements into ascending or descending order is a fundamental operation in collection processing. It arises as a sub-step in many processing situations, as illustrated in the subsequent sections of this chapter. Because of the fundamental nature of this problem, several sorting methods (see [23,44] for a comparative surveys) have been devised for a variety of parallel computers. Essentially there are two approaches to (deterministic) parallel sorting [88]. In the mergebased approach, each processor sorts the portion it holds locally by using some sequential sorting algorithm, and then all the sorted sublists are exchanged among all the processors, and nally the sublists are locally merged. Alternatively, in the partition-based (or quicksort based) approach the unsorted list is divided into a number of progressively smaller sublists divided by selected pivots, and the sublists sorted sequentially. Furthermore, the merging or partitioning can be achieved in constant number of steps or recursively. After careful consideration of several sorting methods, we have selected and implemented a practical variant of the deterministic sample sort algorithm [58,59,89,109,112] with regular sampling for our BSP implementation. Our choice was guided by the considerations of minimising communication costs [1, 60], and attaining load balance [112]. The constant step algorithms move each data element only once, thus minimise communication. On the other hand, recursive algorithms typically incur higher communication cost because elements may be moved several times between the processors. We attain the load balancing property by utilising global information about the data distribution from the regular samples. Central to our algorithm is the fact that runs of elements containing the same key will be distributed to the same processor|this property is utilised in the algorithms that follow. Sorting can be used as a good load balancing heuristic, en route, or can be used after an operation that reduces the size of the collection and introduces data skew. We rst introduce the algorithm and then discuss the load balancing properties. Our implementation,

[Step 1] [Step 2]

perform a local sort; each processor selects p samples (s) which are gathered onto process zero; (we choose a regular sample size of pn2 ); 31

[Step 3] [Step 4]

perform a local sort of p2 samples; pick p ? 1 regular pivots (k) from the sorted p2 samples; (pivots are picked at ip + 2p , (1 < i (p ? 1)), intervals); broadcast these pivot values to all the processors from processor zero; each processor produces p partitions of its local block using the p ? 1 pivots; each processor sends it's partition i (marked by pivots ki and ki+1 ) to processor i; locally merge all the received, sorted, partitions; (produces the desired eect of a global sort). n-element unsorted collecton n

3n p

2n p

n p

0

p2 samples and k = p -1 pivots k1

k2

k3

n-element sorted collecton 0

1

2

3

Figure 4.2: Samplesort example with 4-processors Since the complexity of an optimal comparison-based sequential sort is SeqSort(n) = n log n, then the cost of locally sorting p processors chunks of data in step 1 is SeqSort( np ). The cost of sorting p2 samples is likewise, p2 log p2 , and the nal merging stage has an average cost of SeqSort( np ). Thus, the total computation cost is4

n log n + p2 log p2 + n log n p p p p n n = 2 p log p + p2 log p2 = 2 np log n ? 2 np log p + 2p2 log p 3 n p = 2 p log n ? log p + n log p Clearly, if n p3 , then the computational cost is the optimal O( np log n). The communication cost can be estimated as follows. The cost of collecting the p2 samples is g(p2 ? 1), and the cost We provide a straightforward average BSP cost analysis of our implementation without resorting to detailed probabilistic analysis, which is beyond the scope of this report. Instead, we refer to the recent results reported in [58{60] for tighter theoretical bounds. 4

32

of broadcasting (p ? 1) pivots is g(p ? 1). The upper bound on the redistribution of data can be worked out by considering the data movement for partition i: there are at least Xi ip + p2 pn2 = (2i 2+ 1) np elements below partition i, and there are at least Xi+1 > p2 ? (i + 1)p + p2 pn2 = n ? (2i 2+ 3) np elements after partition i. Thus, the amount of data communicated by processor i is at most n ? Xi ? Xi+1 = np and the BSP cost of redistributing the partitions is g np . Therefore, the total communication cost is

g(p ? 1)2 + g(p ? 1) + g np

? = g p1 p2 (p ? 1) + n The communication cost is g np for the permute (i.e., at most the entire segment needs to be transferred to a dierent processor), and therefore the upper-bound BSP cost for our implementation of parallel sort is,

3 ? ParSort(n) = 2 np log n ? log p + pn log p + g 1p p2(p ? 1) + n + 4l This sorting procedure and the associated cost ParSort(n) for n elements, forms the basis of

many BSP collection algorithms we develop in the following sections.

Load Balancing

The problem to be addressed with the block distribution scheme we use is to determine when the blocks are not load-balanced, that is, the distribution deviates signi cantly from np from processor to processor. We implemented a scheme that implicitly re-balances collections after transformations which alter the load distribution|instances of this situation will be highlighted where appropriate. We balance a collection using a parallel sorting algorithm, described in the last section. To show a bound on load balancing, we make an observation that the upper-bound obtained for the redistribution of partitions, np , is independent of the initial distribution. Thus during the nal step, each processor will contain at most np elements, and receive at most np elements from other processors. Thus, no processor will contain more that 2 np elements, after the nal sort. Experimental results have shown that this bound is within a small percent of optimality in practice, instead of the factor of two derived here [89]. Clearly there is a trade-o between step 2 and step 4. Instead of the p regular samples, if oversampling is used (s > p) then an improvement on the load balance can be attained during step 4, at the expense of step 2. On the other hand, undersampling (s < p) results in signi cantly poor load balance in step 4, but decreases the size of step 2. Further work is required to study the oversampling approach to attain tighter load balance bounds [88].

33

4.3 Collection operations We rst describe the fundamental collection operations expressed as the higher-order functions5 map, lter, and fold (in the database processing context, they are interpreted as selection, projection, and aggregate operations). The BSP implementation of each of these operations, in the context of the distributed collection types introduced above, is presented, along with their BSP costs. A description of the set-theoretic operations union, intersection, dierence, cartesianproduct (interpreted as a database join operation) concludes this chapter.

4.3.1 Single collection operations Map

We begin with the most basic collection operator map that applies a function f : ! to every element of a collection collection h i to create a new collection collection hi. This simple, but important operator is essentially the homomorphism from the free-monoid (collection h i; zero ; unit ; ) to (collection hi; f; unit; ). It is the essence of the data-parallel paradigm as it models embarressingly parallel computations because the application of f to each element in the collection can occur concurrently. If we ignore nesting, then the BSP cost is,

c np

where c is the maximum cost of applying the function f to a single element of the collection.

Filter

Given a collection collection h i, then a ltered version of the collection which contains elements that satisfy a predicate f : ! B can be achieved by the homomorphism (collection h i; zero ; unit ; ) to (collection h i; zero ; x ! if (f x) then (unit x) else zero; ). In the context of collections distributed among p processors in n=p chunks, a global lter can be implemented in terms of p local lters with cost np , but there are potential problems with data-skew as the lter may produce a new collection with n ? k elements, where k is the number of elements eliminated during the lter operation, that is signi cantly smaller than the original. This potential skewing of data can be addressed by sorting to regain load balance. The cost of the lter is therefore

c np + ParSort(n ? k)

Fold

Fold (or reduce) is a higher order function that given an associative and commutative operator

and a collection [x1 ; x2 ; : : : ; xn ] computes the values x1 x2 : : : xn. Like the previous operators, fold can be expressed in terms of a homomorphism from the free-monoid (collection h i; zero ; unit ; ) to (; id ; x ! x; ); where id is the identity of . The parallel interpretation of fold is that a sequential fold can be performed in parallel to each segment of the collection of size n=p in time proportional to n=p. Each of the p results from this computation, can then be obtained by using a divide-and-conquer technique to fold the p values held in each process in log p supersteps. The BSP cost of the entire fold operation will be,

c np + log p (l + g)

5

i.e., in this case, functions that take a function as an argument.

34

4.3.2 Operations involving more than one collection

By using the monoid homomorphism, homM!set , any collection type can be converted into a set collection6 , a valid transformation in the type hierarchy (see table 4.1). Sets can therefore provide the basis for our BSP implementation of multiple collection operations, without resorting to tailored implementations of each collection type. Our task therefore reduces to one of implementing all the set-theoretic operations on set collections. As pointed out in [111], a fast sorting procedure can be used as the basis for the ecient processing of these operations. Given two sets S and T , we describe how to calculate the set theoretic operations S ? T (dierence), S [ T (union), and S \ T (intersection) in BSP. We make a similar assumption to those in section 4.2.1 about the distribution of S and T ; that is n elements of S and m elements of T are distributed across the p processors.

Dierence

To compute S ? T in BSP, we: (1) locally append the n=p and m=p list elements of the sorted list implementation of the sets together; (2) locally sort the appended list; (3) locally eliminate k duplicates, as duplicates are only introduced when an element from T is in S ; (4) perform a global sort on the appended lists (as the sorting algorithm we use never causes sequences of similar valued elements to straddle processors); then (5) a nal local elimination phase removes any remaining duplicates, which completes the implementation of set dierence. The BSP cost of the algorithm, for steps 1, 2 and 3, 2 n + m + SeqSort n + m

p

p

ParSort(n + m) for step 4, and a local computation cost of n+mp ?k for the nal step. Thus, the

total cost is,

n + m n + m k 3 p ? p + SeqSort p + ParSort(n + m ? k)

Union

Union is implemented in a manner similar to set dierence. That is, to compute S [ T : (1) the local blocks of S and T are appended, sorted, and then all-but-one duplicates (k) eliminated; (2) a global sort is performed; (3) as sort does not cause \runs" to straddle processors, locally eliminating all-but-one duplicate completes set union. The BSP cost of this algorithm is also, n + m k n + m ? + SeqSort + ParSort(n + m ? k) 3

p

p

p

Set intersection, S \ T can be implemented in terms of S ? (S ? T ).

Cartesian product

Given two collections S of type collection h i and T of type collection hi, then the cartesian product S T calculates the collection of all pairs from S and T of type collection h i. The BSP implementation works by transposing the data blocks of the smaller collection, say T , in successive supersteps, and then computing the local cartesian product. Clearly, this will bring every element si of S with every element ti of T into the same processor, to form the pair (si ; ti ) within p ? 1 supersteps, as shown in gure 4.3. 6

Indeed, ODMG-93 explicitly de nes a function listtoset for this purpose.

35

S0, T0

S1, T1

S2, T2

S3, T3

S0, T1

S1, T2

S2, T3

S3, T0

Superstep 0

S0, T2

S1, T3

S2, T0

S3, T1

Superstep 1

S0, T3

S1, T0

S2, T1

S3, T2

Superstep 3

Processor 0

Processor 1

Processor 2

Processor 3

Figure 4.3: Cartesian product example with 4-processors In the gure, the arrows indicate the movement of Ti chunks from processor i during a superstep, and after p ? 1 supersteps, all remote Ti chunks are available to complete the local computation of the cartesian product. The BSP cost of local computation is nm p2 and communim cation g p , thus the total cost is (assuming m n),

mn + g m + l p2 p for a single iteration. Thus the total cost for all p ? 1 supersteps is, mn + (p ? 1) g m + l p p

36

Chapter 5

Parallel Query Evaluation Parallel execution oers solution to the problem of reducing the response time of object database queries against large collections of objects. Chapter 2 introduced the essence of the ODMG-93 standard, which is to extend the set-based relational model with collection types, and OQL that allows high-level expression of database query statements [82]. The BSP collection types and the associated operations we developed in the last chapter, provides both, a practical framework for developing query algorithms, and a theoretical framework for determining the cost of the queries. The purpose of chapter is to bring together the concepts introduced in the previous chapters, in order to evaluate the feasibility of a scalable parallel object database based on the BSP model. We develop a parallel query evaluator on-top of the BSP collections model and BSPlib. A query evaluator answers an OQL query by rst nding a procedural plan by synthesising the basic operation, and subsequently executing the plan to produce the query results [19,63,81,83]. The free-monoid framework introduced in section 4.1 is extended in section 5.1 to model OQL queries in terms of monoid comprehensions. This comprehension notation can be considered as a convenient intermediate form to represent the OQL query expressions, so that producing an ecient query plan is modelled as transformations of monoid comprehensions. Our work is based on [53] where the transformation rules are termed (free-)monoid comprehension calculus. We extend this work by developing parallel algorithms that incorporate parallelism to the free-monoid calculus, and the scalability and accurate cost prediction aspects of the BSP model. We take account of the inherent limits on available parallelism (see section 3.1) due to precedence (or dependency) constraints between the operators, data placement constraints and the constraints imposed by the underlying model. The BSP cost model, together with the constraints on parallelism (in terms of bulk synchrony), provides us novel ways to investigate parallel query optimisation, discussed in the nal section of this chapter. The bulk of this chapter, however, is concerned with experimental evaluation of a sample OQL query on a number of practical BSP machines (provided by the BSPlib environment).

5.1 Mathematical framework for queries Extending the formalism introduced in the last chapter, queries in this calculus are expressed in terms of monoid comprehensions. Informally, a monoid comprehension over the monoid M takes the form Mf E j Qg, where expression E is called the head of the comprehension and the term sequence Q = q1 ; q2 ; : : : ; qn is called the quali er, and is either, a generator of the form v e0 , where v is a variable and e0 is an expression, or a lter pred, where pred is a predicate formula. 37

In particular, all quali ers in the comprehensions are eliminated from left to right until no quali er is left, and all pred terms are pulled to the outermost level. This transformation has the desirable eect of applying the lter function to reduce the size of the collection for subsequent processing. Formally, monoid comprehensions are de ned in terms of monoid homomorphisms.

De nition 5.1.1 (Monoid Comprehensions) A monoid comprehension over a collection freemonoid M is de ned inductively as Mf E j g = unitM e Mf E j x u; Qg = homM N x:Mf e j Qg u Mf E j pred; Qg = if pred then Mf e j Qg else zeroM where, e is an expression and, u in x collection monoid N ; M N .

u is an expression that computes an instance of the

De nition 5.1.2 (Monoid Comprehension Calculus) The Monoid Calculus [53] consists of the following syntactic forms,

v variable c constant e:A projection (A1 = e1 ; A2 = e2 ; : : : ; An = en ) object construction e1 e2

2 f=; >;

Towards a Scalable Parallel Object Database {The Bulk Synchronous ...

Towards a Scalable Parallel Object Database {The Bulk Synchronous ...

Suggest Documents

Towards a Scalable Parallel Object Database The Bulk Synchronous ...

Towards Running Bulk-Synchronous Parallel ...

A Parallel Virtual Machine for Bulk Synchronous Parallel ML - LACL

A Parallel Virtual Machine for Bulk Synchronous Parallel ML - LACL

bulk synchronous parallel ml with exceptions - CiteSeerX

Bulk Synchronous Parallel ML: Modular ... - Semantic Scholar

Bulk Synchronous Parallel ML: Semantics and ... - CiteSeerX

Parallel I/O in Bulk Synchronous Parallel ML - Semantic Scholar

Parallel I/O in Bulk Synchronous Parallel ML - Semantic Scholar

Parallel I/O in Bulk Synchronous Parallel ML - Semantic Scholar

Bulk Synchronous Parallel Computing Using a High ...

A static analysis for Bulk Synchronous Parallel ML to ... - CiteSeerX

Bulk Synchronous Parallel Computing { A Paradigm for ... - CiteSeerX

A Bulk Synchronous Parallel approach for indexing ...

Bulk Synchronous Parallel Computing { A Paradigm for ... - CiteSeerX

Bulk Synchronous Parallel Algorithms for the External Memory Model

Designing Scalable Object Oriented Parallel Applications - CiteSeerX

Towards Scalable Representations of Object Categories - mobvis

Systematic Development of Correct Bulk Synchronous Parallel Programs

Systematic Development of Correct Bulk Synchronous Parallel Programs

Toward Bulk Synchronous Parallel-Based Machine Learning ... - MDPI

Bulk Synchronous Parallel Scheduling of Uniform ... - Semantic Scholar

Bulk Synchronous Parallel Algorithms for Optimistic Discrete Event ...

Efficient parallel Text Retrieval techniques on Bulk Synchronous ...