Data Streams Organization in Query Executor for Parallel DBMS1 Tatyana Y. Lymar, Leonid B. Sokolinsky Computer Science Department, Chelyabinsk State University , Chelyabinsk, Russia
[email protected],
[email protected] Abstract. The paper describes mechanism of data exchange between nodes of the query tree of the Omega parallel database system designed for the multiprocessor system MBC-100. This mechanism is based on streams concept and bracket model. The stream is a representation of a relation on the level of physical algebra. The streams are arguments and results of operations of physical algebra, that is the input and the output data of the query tree nodes. Operations are synchronized on the basis of consumer/supplier model. The paper describes the synchronization principle designed for the DBMS Omega, concentrates on stream types and framework for their implementation. Keywords: parallel DBMS, query execution, data stream, workload management.
1. Introduction The most significant problems in designing query execution subsystems in data base management systems are synchronization and data exchange between operations. Most of DBMS (System R, Ingres, Informix, etc) utilize iterator model for synchronization. This model uses representing a query as a tree, in nodes of which physical operations are situated, the edges represent data flow between operations [1]. The basic idea of the iterator model is fractionating of intermediate relations into granules (which usually is one tuple of a relation). Every time an operator needs a new granule it requests the operator supplying it with input data and awaits the result. If a tree has more than two levels the operator having received the request requires input data itself, and, in turns requests a granule from a node, which is lower in hierarchy. Thus the requests advance from root of the tree until the reach the leaves. Respectively, the operations execution is reversed, i.e. from leaves to the root. For our DBMS Omega designed for the MBC-100 system [2] we propose a different synchronization model. This decision is prompted by peculiarities of the Omega hardware architecture [3]. It includes three levels of hierarchy. The lowest level is presented by MBC processor module consisting of two processors: the computational and the communicational. The processor modules are united by four into Omega clusters, which are shared-disk systems. Omega clusters are combined into shared-nothing system. Thus, while utilizing iterator model, there occurs load imbalance on the lowest level of the architecture already, i.e. the computational processor is idle all the time while the communicational processor sends a request, awaits its execution and receives the result. That is why there emerged a necessity to design our own conception of synchronization, which would take into account all the specific features of the hardware and software architecture of the Omega-system. The remainder of the paper is organized as follows. Section 2 discusses the operation synchronization principles of the query executor of the DBMS Omega. Section 3 describes the stream conception. Section 4 contains a conclusion.
2. The Stocks The Stock mechanism is created for the operation synchronization in the query executor of the DBMS Omega. It is based on consumer/supplier model [4]. Operators in this model are executed 1
This work was supported by Russian Foundation for Basic Research under Grant 00-07-90077
before being requested their output data. Being executed the consumer-operator places its results into the buffer of data exchange with the supplier. When needed the consumer obtains data from the buffer. Thus the descending request flow, as in the iterator model, is ousted and there remains only the second phase, i.e. the execution of the operators from leaves to the root. Thus, both the computational processor processing the data and the communicational processor sending and receiving the results are loaded. The utilization of the internal buffering allows to balance load of the processor and therefore speed the query execution. The nodes of the query tree are executed in parallel by way of light–weighted processes (called threads) [3]. Each time a supplier thread produces a granule, it must execute a call. The time needed for producing one granule by any thread is a quantum of system time slicing. The real scheduling policy is based on a dynamic priority. The dynamic priority is a value calculated for every thread, that is a function of the stock-filling factor. The scheduler transfers the control to the thread that has the maximum dynamic priority. In order to exchange data between threads Stock class is created. This class realizes an abstract type of the threads output buffers in the consumer/supplier model. A stock is generalization of an iterator, i.e. an iterator can be represented as a stock of unitary length. Samples of the Stock class have queue structure. It is possible to place and to remove the elements, which represent byte strings of fixed length, into and from the stock. Figure 1 provides a scheme of processes accessing a stock. Thread L1
Thread L2
stock
tail
head
Figure 1. Processes access a stock
Thread T1 (the supplier) writes its output data element by element into the end of a stock (the “tail” of the queue), the Thread T2 (the consumer) when needed takes elements one by one from the beginning of the stock (the “head” of the queue). Information about the stocks is contained in the descriptors stored in a static array. A descriptor includes the following fields: the maximum number of elements in stock, length of an element, the address of the memory block allocated for the stock, a pointer to the current head and tail of the stock. A stock is created by stock_new function, which dynamically allocates the required amount of memory and places a new record into the descriptor table. Access to the contents of the stock is provided by special operators stock_accept and stock_eject. The function stock_accept transfers the pointer from the free space following the tail to the next memory block (if it is free) and returns a pointer to the preceding block, which can be filled with a granule received from the supplier. On the analogy, the stock_eject function transfers the pointer from the head to the element following it, by which it removes the element from the stock, and returns a pointer to the previous head. Thus the queue travels in circles round the memory allocated for the stock. If needed it is possible to clear the stock by the stock_reset function. The stock_release function frees the memory allocated for the stock and removes the corresponding record from the descriptor table. Every existing stock is assigned a value called stock filling factor. The value of the factor is calculated by stock_factor function.
3. The streams The choice of the specific mechanism of data transfer depends on physical positioning of the exchanging nodes. In the Omega-system three variants of placement of the adjacent nodes are possible. 1. The nodes are placed in one processor. 2. The nodes are situated in different processors of one cluster. 3. The nodes are executed in different clusters Therefore there appear data transfers of different types: When the adjacent nodes of the tree are on the same processor they are realized as threads and are executed simultaneously. There appears a problem of data exchange between the threads. When placed by second variant the data is transferred via the Omega-conductor channels. In the third variant the data transfer should be realized by the functions of the Omega-router. Besides, the input data of the query tree is stored on the hard disk and the query executor should obtain them from the File Management System. Most of authors treat these variants as separate problems focusing mainly on interprocess exchanges [1, 2]. We have decided to join them into one conception. In order to unify the interface of data transfer we introduce a concept of stream, combining all of the above variants, i.e. the streams realize a universal mechanism of data exchange between a process and a disk, between a process and a channel of a router or a conductor, and between two processes as well. A physical stream is a virtual file of FIFO-type. It can be opened, closed and reset to its initial state. Records can be placed into the streams by the FIFO rule (first in – first out). The described concept of data exchange is implemented in the query executor by the Stream class. It provides basic methods for creating abstract objects of Stream type and service functions realizing specific types of streams and access to them. There are the following types of streams: a stream of the File type; a stream of the Temporary file type; a stream of the Conductor channel type; a stream of the Router channel type; a stream of the Stock type. The implementation of the File type stream is an open stored file. The operations of opening and closing the File stream do not influence the state of the file, the respective actions are performed on the file iterator, i.e. it is created when a stream is opened and is eliminated when a stream is closed. The implementation of the Temporary file type stream is a temporary file. It is created and eliminated together with the corresponding stream. The implementation of the Conductor channel type stream is a channel of the Omegaconductor [5]. While creating and opening the Conductor channel stream no channel is created, it is created only for read or write operations and is immediately eliminated upon operation termination. The implementation of the Router channel type stream is a channel of the Omega-router[5]. Mechanism of its work is similar to the Conductor channel The implementation of the Stock type stream is a member of the Stock class (see section 2). The stock is created and eliminated together with the corresponding stream. The information about the existing streams is stored in a static array. The ordinal number of the array element is the identifier of the respective stream. Each member of the Stream class has the following attributes: the stream type; an identifier of the virtual file; the size of the element in bytes; a pointer to the working buffer;
a pointer to stream access function. The number of the basic functions includes accessing the attributes of the stream. The basic functions of creating and removing the members of the Stream class fulfill the respective actions in the stream descriptors array. Some other service functions are the functions of creating the streams of all the types mentioned above. The interfaces of these functions are unified as far as possible, but the number and the meaning of parameters is different. All other service functions (opening and closing streams, setting the stream’s initial position, reading and writing the stream) do not distinguish stream types and use only stream identifiers and, if needed, pointers to data buffers as parameters. Such unification allows accessing a stream not considering the actual mechanism of its work.
4. Conclusion This paper presents the approach to synchronization and data exchange between the operations used in the DBMS Omega. We propose a mechanism of synchronization based on the consumer/supplier model. This mechanism allows utilizing the hardware resources of the DBMS Omega more efficiently compared to the conventional iterator model. The paper gives the description of the stream conception designed for implementation of data exchange between the nodes of the query tree. Thereby we discuss all possible variants of mutual placement of the adjacent nodes of the tree. The unification of the data transfer process considerably eases furthers development of the system as a whole and the query executor in particular. This paper is the result of preliminary research. The described method of exchange organization should provide efficient data exchange in the query executor of the DBMS Omega. In the future work we plan to support this hypothesis by numerical experiments.
References 1. Graefe G. Query evaluation techniques for large databases. ACM Computer Surveys, 25(2), jj-169. 2. Zabrodin A.V., Levin V.K., Korneev V.V. The Massively Parallel Computer System MBC-100. Proceedings of PaCT-95, Lecture Notes in Computer Science, 1995jj-356. 3. Sokolinsky L.B., Axenov O., Gutova S. Omega: The Highly Parallel Database System Project. Proceedings of the First East-European Symposium on Advances in Database and Information Systems (ADBIS'97), St.-Petersburg, 1997, Vol. 2, jj88-90. 4. Sokolinsky L.B. Interprocessor Communication Support in the Omega Parallel Database System. Proceedings of the 1st International Workshop on Computer Science and Information Technologies(CSIT'99), Moscow, 1999, Vol. 2. 5. Sokolinsky L.B. Operating System Support for a Parallel DBMS with an Hierarchical SharedNothing Architecture. Proceedings of the Third East-European Conference on Advances in Databases and Information Systems(ADBIS'99), Maribor, Slovenia, 1999jj-45.