An Array Library for Microsoft SQL Server with Astrophysical Applications

4 downloads 3560 Views 65KB Size Report
Astronomical Data Analysis Software and Systems XXI ... Applications ... exists to extend the database server with custom-written code that enables us to ...
Astronomical Data Analysis Software and Systems XXI ASP Conference Series, Vol. 461 Pascal Ballester, Daniel Egret, and Nuria P. F. Lorente, eds. c 2012 Astronomical Society of the Pacific

An Array Library for Microsoft SQL Server with Astrophysical Applications L´aszl´o Dobos,1 Alexander S. Szalay,2 Jos´e Blakeley,3 Bridget Falck,2 Tam´as Budav´ari,2 and Istv´an Csabai1 1 E¨ otv¨os

Lor´and University, Department of Physics of Complex Systems, P´azm´any P´eter s´et´any 1/A, 1117 Budapest, Hungary 2 The

Johns Hopkins University, Department of Physics and Astronomy, 3800 San Martin Drive, Baltimore, MD 21218, USA 3 Microsoft

Corporation, Redmond, WA 98052, USA

Abstract. Today’s scientific simulations produce output on the 10-100 TB scale. This unprecedented amount of data requires data handling techniques that are beyond what is used for ordinary files. Relational database systems have been successfully used to store and process scientific data, but the new requirements constantly generate new challenges. Moving terabytes of data among servers on a timely basis is a tough problem, even with the newest high-throughput networks. Thus, moving the computations as close to the data as possible and minimizing the client–server overhead are absolutely necessary. At least data subsetting and preprocessing have to be done inside the server process. Out of the box commercial database systems perform very well in scientific applications from the prospective of data storage optimization, data retrieval, and memory management but lack basic functionality like handling scientific data structures or enabling advanced math inside the database server. The most important gap in Microsoft SQL Server is the lack of a native array data type. Fortunately, the technology exists to extend the database server with custom-written code that enables us to address these problems. We present the prototype of a custom-built extension to Microsoft SQL Server that adds array handling functionality to the database system. With our Array Library, fix-sized arrays of all basic numeric data types can be created and manipulated efficiently. Also, the library is designed to be able to be seamlessly integrated with the most common math libraries, such as BLAS, LAPACK, FFTW, etc. With the help of these libraries, complex operations, such as matrix inversions or Fourier transformations, can be done on-the-fly, from SQL code, inside the database server process. We are currently testing the prototype with two different scientific data sets: The Indra cosmological simulation will use it to store particle and density data from N-body simulations, and the Milky Way Laboratory project will use it to store galaxy simulation data.

1. Introduction With the exponential growth of scientific datasets, it is becoming even more important to store the data using database management systems (Szalay & Blakeley 2009). These databases must not just store the data but support the efficient processing of them as well. As most high-performance database systems are designed with commercial ap323

324

Dobos et al.

plications in mind, they usually lack native support for array data types and advanced mathematical functions; both are indispensable in scientific applications. Fortunately, these systems provide the necessary interfaces to extend their functionality by implementing user-defined data types (UDTs) and functions (UDFs), though all these extensions have to be developed for the certain product and are not portable. While Microsoft SQL Server is widely used in the astrophysics community (Szalay et al. 2001; Lemson & The Virgo Consortium 2006), it lacks native support for arrays. To address this issue, we developed a prototype of an array data type in the form of a library of user-defined functions acting on arrays stored as binary objects within the database. The arrays are stored such a way that integration with the most common mathematical libraries can be done seamlessly. 2. Library Details The Array Library supports arrays of all basic numeric data types and single and double precision complex numbers. It offers two storage classes depending on the sizes of the arrays. Arrays are stored in table columns of types varbinary(n) or varbinary(max). For a more detailed description of the library, see Dobos et al. (2011) and the SqlArray website at http://voservices.net/sqlarray. 2.1. Array Storage Classes Conventional relational database servers, such as Microsoft SQL Server, store tables in B-trees. In B-trees, records are stored in pages of 8 kB, on a row by row basis. Consequently, arrays smaller than the page size can be stored in-page. Binary objects, such as arrays larger than this limit, must be stored out of page and their retrieval from the background storage is about one order of magnitude slower than that of in-page data. Because of this, we decided to implement two different storage classes of arrays which are optimized to the peculiarities of the internal storage methods of the server. In-page arrays are termed small arrays and must be 8000 bytes or less. They are limited to six dimensions and stored along with a fix-sized header of 24 bytes. Array items are indexed using 16 bit integers. These arrays are appropriate for multidimensional point data, like outputs from N-body simulations. Small arrays are always read into the memory entirely. Out of page arrays are called max arrays and can have the maximum size of 2 GB. There is no limit on the number of dimensions, and array items are indexed using 32 bit integers. Access to out of page arrays requires traversing another B-tree, which has a significant overhead. Also, the API provides access to large binary data only through data streams (in contrast to simple memory blocks in the case of small arrays), which have their own overhead. On the other hand, the streaming interface supports partial loading of the binary objects which can be leveraged when only small parts of the arrays are needed. Max arrays are appropriate to store grid data. 2.2. Array Manipulation Functions Arrays are stored as binary objects in the server and all array manipulation is done by calling user-defined functions. Functions are written for all basic array manipulation tasks, such as to create empty arrays, access items, extract subarrays, etc. More advanced functions include: recast arrays to arrays of different dimensionality (but with

An Array Library for Microsoft SQL Server

325

the same number of items), permute indices, slice arrays along any dimensions, etc. Another set of functions was written to convert between ordinary tables and arrays. 2.3. Syntax Examples To date, there is no SQL language extension implemented for our library which would simplify the syntax of the array operations. We have plans to implement a pre-parser to the library, which takes queries written in an array-enabled flavour of SQL and converts them to the user-defined function calls. Right now all array manipulation has to be done via function calls. We demonstrate the current syntax below on some simple examples. The functions are strongly typed and are organized under different schemas according to the data type they act on. An array of five floats, a vector of five elements, is created the following way: DECLARE @a VARBINARY (100) = FloatArray . Vector_5 (1.0 , 5.0, 3.0, 4.0, 2.0) To extract an item of this vector the following should be executed. SELECT FloatArray . Item_1 (@a , 3) This returns the fourth (zero indexed) element of the array. The following query is used to convert an array into an ordinary table. DECLARE @a VARBINARY (100) = FloatArray .Parse (’[[1 ,2 ,3] ,[4 ,5 ,6]] ’) SELECT v FROM FloatArray . ToTable (@a)

2.4. Math Library Interfaces The Array Library was designed with efficient LAPACK and FFTW integration in mind. Array items are stored in column major order to match the requirements of math functions written in FORTRAN. The hardware specific optimized native libraries can be called from inside the SQL Server process via standard API calls. Since function parameters need to be marshalled between the managed environment of the user code runtime of SQL Server in which the arrays are implemented and the native environment of the external libraries, all data structures of the Array Library (especially the storage model of the arrays) were designed to minimize the overhead of data marshalling. 2.5. Performance Since the library is implemented as an extension to a closed-source product, its performance is largely determined by the speed of user-defined function calls. In case of small arrays, the time penalty of retrieving a single number stored in an array over retrieving a scalar value stored in a table column is about a factor of two. Of course, this is the worst case scenario, and this overhead becomes less significant with larger arrays. Calling into the native math libraries is very fast. The library functions are called directly from the server process by passing pointers only. This eliminates the need of moving the data through the client–server protocols.

326

Dobos et al.

3. Scientific Use Cases 3.1. The Indra Cosmological Simulations The Indra cosmological simulations (Falck et al. 2011) will be a set of 512 N-body simulation runs in 1 h−1 Gpc-sided boxes, each with 10243 dark matter particles of mass ∼ 1011 M . The high number of sightly different simulations will make it possible to efficiently characterize cosmic variance. Instead of storing 1 billion particles for each simulation row by row, we will use arrays to store particle positions and velocities in chunks. These chunks will be handled by the Array Library and will be distributed along space filling curves to enable fast spatial queries (Lemson et al. 2011). Additionally, complex Fourier modes of the density will be stored as 3-dimensional arrays, and halos will include arrays of their constituent particles. 3.2. The Milky Way Laboratory The Milky Way Laboratory will be based on a database of an N-body and hydrodynamical simulation of our galaxy. This will allow the detailed analysis of dark matter halo formation at several orders of magnitude higher resolution than in case of cosmological simulations. The system will be built on a new generation database server cluster optimized for high throughput I/O and will rely on Microsoft SQL Server and the Array Library to store particle data. 4. Summary We have presented our solution to add array support to Microsoft SQL Server. The Array Library contains functions to manipulate small arrays (for particle position data) and large arrays (for grid data). A few astrophysical applications have been introduced, like the Indra DM simulation database and the upcoming Milky Way Laboratory. The library is available at http://voservices.net/sqlarray. Acknowledgments. This work was supported by the following Hungarian grants: NKTH: Pol´anyi, OTKA-80177 and KCKHA005. The Project is supported by the European Union and co-financed by the European Social Fund (grant agreement no. ´ TAMOP 4.2.1./B-09/1/KMR-2010-0003). References Dobos, L., Szalay, A., Blakeley, J., Budavari, T., Csabai, I., Tomic, D., Milovanovic, M., Tintor, M., & Jovanovic, A. 2011, in EDBT Workshops. arXiv:1110.1729 Falck, B., Budavari, T., Cole, S., Crankshaw, D., Dobos, L., Lemson, G., Neyrinck, M., Szalay, A., & Wang, J. 2011, in American Astronomical Society Meeting Abstracts, vol. 218, 131.04 Lemson, G., Budavari, T., & Szalay, A. S. 2011, in SSDBM, 509 Lemson, G., & The Virgo Consortium 2006, ArXiv e-prints. astro-ph/0608019 Szalay, A., Gray, J., Thakar, A., Kunszt, P. Z., Malik, T., Raddick, J., Stoughton, C., & van den Berg, J. 2001, Public access to a terabyte of astronomical data. URL http://cas. sdss.org/dr6/en/skyserver/paper/ Szalay, A. S., & Blakeley, J. A. 2009, The Fourth Paradigm (Microsoft Research), chap. Gray’s Laws and Database-Centric Computing, 5