Operating System Support for Parallel DBMS with Hierarchical Shared-Nothing Architecture1 Leonid B. Sokolinsky Chelyabinsk State University, Faculty of Mathematics, Br. Kashirinykh 129, Chelyabinsk, 454021, Russian Fed.
[email protected]
Abstract. The paper describes a structure of an operating system developed for the Omega parallel database management system. The Omega system is a prototype of a parallel DBMS designed for a massively-parallel multiprocessor system with hierarchical architecture. The Omega operating system includes a thread manager, a message manager and a file manager. A new approach to the process management and scheduling is presented. This approach is based on consumer/supplier model. It allows increasing the efficiency of a complex query plan execution using light-weighted processes (threads) and dataflow-driven model. A general approach to an implementation of the OOS subsystems is described. Described operating system was implemented on MBC-100 multiprocessor computing system.
1 Introduction To implement a parallel database management system (DBMS), we need some lowlevel operating services. These services include buffer pool management, the file system, process management, scheduling, and interprocess communication. There are two alternatives to provide these services. The first one is to use the services provided by an existing general-purpose operating system. However, general-purpose operating system services usually turns out either too slow or inappropriate with a view toward their applicability to support of database management functions [1]. Another alternative is to implement these services as a part of DBMS. Current DBMSs usually provide their own operating services and make little or no use of those offered by the operating system. This paper describes the Omega operating system (OOS) created as a part of the Omega parallel database machine [2]. The Omega parallel database machine is a prototype of the parallel DBMS with an hierarchical shared-nothing architecture designed for MBC-100 multiprocessor computing system [3]. As an operating system, MBC-100 can use either Helios operating system or Router operating system.
1
This work was supported by the Russian Foundation for Basic Research under Grant 97-07-90148.
38
Helios is a distributed operating system developed by Perihelion Distributed Software. Helios is UNIX source compatible. However, all services provided by Helios turn out too slow for DBMS needs. For example, the latent part of message passing initiation amounts up to 3000 bytes. It makes Helios to be of little avail for parallel DBMS support. Router is a toolset class distributed operating system developed in Applied Mathematics Institute of Russian Academy of Sciences. Router OS provides a low-level message passing functions being more efficient than those offered by Helios (in Router, the latent part of message passing initiation amounts only to 400 bytes). However, the Router OS does not provide such important services as buffer pool management, the file system, process management and scheduling. It motivated us to develop an operating system that can meet the Omega parallel DBMS requirements. The OOS is an extension of the Router OS, which includes the following subsystems: a thread manager, a message passing subsystem and a file manager. Although the OOS was created especially for the Omega project, it can be used with a large variety of different applications for MBC-100. The remainder of this paper is organized as follows. Section 2 gives an overview of the Omega system hardware architecture. Section 3 discusses the principles of the thread manager organization and offers a new model for the process management, scheduling and synchronization. In section 4, we describe the framework for an operating subsystem development. The message passing system is described in section 5. Our conclusion remarks are presented in section 6.
2 Hardware architecture of Omega system The Omega hardware architecture has three levels of hierarchy. The first level is presented by MBC processor module. The MBC processor module includes two processor devices: computational processor and communicational processor. As computational processor, it has Intel i860XR @ 40 MHz or i860XP @ 50 MHz. SGSThomson T805 transputer acts as communicational processor to handle I/O and boardto-board communications via four 20Mbit/second serial links. The MBC processor module has a shared memory architecture with 16/32 Mbytes SRAM (static random access memory) coupled to both the computational processor and the communicational processor. The communicational processor has 4/8 Mbytes of private local fast DRAM (dynamic random access memory). These two processors exchange data in shared memory, with interrupts and full bus locking for synchronization. The second level of hardware hierarchy is presented by the Ω-cluster [2]. The Ω-cluster is a shared-disk system whose nodes are MBC processor modules (see Fig. 1). Each Ω-cluster consists of four processor modules connected by links. Every processor module is connected to disk subsystem module (DSM). DSM has its own
39
communicational processor with private memory connected to four disk devices by SCSI bus. One link of each processor module remains free to connect the given Ω-cluster to others.
PM
PM
DSM
PM
SCSI
PM
Fig. 1. Ω-cluster structure (PM – processor module; DSM – disk subsystem module)
On the third (outer) level of hardware hierarchy, Ω-clusters are composed into Ω-system in a shared-nothing manner. One Ω-system can be scaled up to several hundreds Ω-clusters. There is no restriction for the interconnect topology in the Ω-system. It can be varied from simple rule box to hypercube.
3 Omega thread manager The Omega thread manager provides multiple light-weighted processes (threads of execution) within a process, which allow the Omega system components to be dynamically activated and executed without excessive overhead. The thread manager uses a consumer/supplier model. This model allows efficient scheduling and provides a mechanism for thread synchronization.
3.1 Consumer/supplier model A consumer/supplier model was developed to provide efficient scheduling and thread synchronization. This model uses the consumer/supplier paradigm. According to developed model, each process is treated as a root thread. The root thread can generate an arbitrary amount of the child threads. Each child thread can generate its own set of subordinate threads. Thus all generated threads form a hierarchy.
40
We assume that each child thread produces and supplies for the father thread some data granules. To produce a granule, the child thread needs the granules from its own child threads. A output buffer ("output storage") is associated with each thread for supporting the consumption/supply cycle. The different threads can have the output buffers of different sizes. The output buffer is organized as a queue. The supplier thread puts the produced granules to its output buffer as the consumer thread consumes the granules from this supplier's buffer (see Fig. 2). Each time a supplier thread produces a granule, it must execute a scheduling operation call. The time to be needed for producing one granule by any thread is a quantum of system time slicing. Let τi be a number indicates a measure of the output buffer fullness for the thread Ti. We assume that ∀i: 0≤τi≤1. We call this measure as a T-factor. If the T-factor τi=0, the output buffer Bi is empty. If the T-factor τi=1, the output buffer Bi is full. Define fi(t) as a function, which calculates the T-factor for the thread Ti in the point of time t. We call the fi(t) as a factorfunction of the thread Ti.