faces for other data products available with the EDR. 1. Introduction - the SDSS Early Data Release. The Sloan Digital Sky Survey (SDSS, http://www.sdss.org) is ...
Astronomical Data Analysis Software and Systems XI ASP Conference Series, Vol. 281, 2002 D. A. Bohlender, D. Durand, and T. H. Handley eds.
Query Performance in the SDSS Early Data Release Aniruddha R. Thakar1 , Peter Z. Kunszt1,2 , Alexander S. Szalay1 , Christopher Stoughton3 , James Gray4 , Jan V. vandenBerg1 Abstract. The Early Data Release (EDR) is the first public data distribution from the Sloan Digital Sky Survey. Formally released in June 2001 and available through the STScI MAST website, the EDR was designed to support several different levels of users, from the general public to astronomers using the data as a primary source for their research. We discuss experiences from the first few months of the EDR in the context of the usability and performance of the available user interfaces and query engines. In particular, we focus on comparisons between the skyServer, an SQL Server (RDBMS) based query engine, and the Catalog Archive Server, an Objectivity (OODBMS) based query engine. The former is geared toward the casual user seeking a few objects at a time with simple SQL queries, while the latter is meant for advanced queries formulated with the SDSS Query Tool (sdssQT). We also briefly describe the interfaces for other data products available with the EDR.
1.
Introduction - the SDSS Early Data Release
The Sloan Digital Sky Survey (SDSS, http://www.sdss.org) is a multi-institution project to build a map of a large part of the northern sky in 5 wavelength bands (Szalay 1999). The SDSS Science Archive is the science database that will result from the survey when it is completed, and is expected to be several TB in size and contain an object catalog of more than 200 million objects and 1 million spectra. The SDSS project made its first public release of data in June, 2001. The Early Data Release (EDR), formally announced at the Pasadena AAS Meeting, was released six months ahead of schedule. The EDR contains nearly 600 square degrees of data and constitutes about 5% of the total survey area. It contains several different data products: an object catalog consisting of 14 million photometric (image) objects and 55 thousand spectroscopic objects, raw atlas images, spectra, corrected and reconstructed frames, and binned images and mask images. 1
Center for Astrophysical Sciences, The Johns Hopkins University, 3701 San Martin Drive, Baltimore, MD 21218-2695, USA
2
IT Division Database Group, CERN, CH-1211 GENEVA 23, Switzerland
3
MS 127, Fermi National Accelerator Laboratory, P.O. Box 500 Batavia, IL 60510, USA
4
Microsoft Research, 301 Howard St #830, San Francisco CA 94105, USA
112
Query Performance in the SDSS Early Data Release
113
The EDR is available on the Internet via the STScI’s MAST (Multi-mission Archive of the Space Telescope) Web pages (http://archive.stsci.edu/mast/sdss). The various data products are served up using three different interfaces: the skyServer Web page provides access to the object catalog in tabular form via simple SQL queries; the Catalog Archive Server provides access to the catalog via a downloadable Tcl/Tk client that is meant for the power user and provides an advanced query interface; finally, the Data Archive Server provides FTP and Web form access to the raw data (atlas images, spectra, corrected and reconstructed frames and binned/mask images). A technical paper describing the EDR and all of its data and software products is currently in press (Stoughton et al. 2002). Here we discuss specifically the experience from the first few months since the release of the EDR, and focus in particular upon the comparison between the skyServer and the Catalog Archive Server in terms of the query performance. We include results from benchmarks with several test queries submitted to both servers. These are preliminary tests and benchmarks; we expect to report on more extensive tests soon.
2.
Overview of skyServer and CAS Architectures
Although these are both client/server systems that provide portable access to the same object catalog, they are quite different in their underlying data organization. The skyServer provides online Web access to an MS Windows-based relational database management system (RDBMS). The data is organized in tables and stored in a commercial RDBMS, Microsoft’s SQL Server. Queries can be submitted via Web forms or directly in SQL, but in general the emphasis with the skyServer is on the casual astronomy user and the general public. The scope of the submitted queries is restricted by limiting the execution time and the length of output generated by the queries. The skyServer has seen steady use in the first 3 months or so since it was launched. There have been more than 3 million hits on the main skyServer Web page, 500 thousand page views, and 30 thousand user sessions. The CAS is geared toward the “power” astronomy user that wants to pursue serious astronomy research with SDSS data. An advanced query interface is provided by a client/server system built upon a commercial object-oriented DBMS, Objectivity. A portable Tcl/Tk GUI client – the SDSS Query tool (sdssQT) – can be downloaded and used to submit complex, sophisticated queries in a language very similar to SQL. We have taken the basic syntax from SQL and added OO extensions and astro/math macros to produce a query language called SXQL (SDSS eXtended Query Language). The server is optimized for parallel, distributed queries, and is scalable across SMP and cluster architectures. It is a multithreaded C++ application that supports multiple concurrent sessions and queries per session. It uses a fast multi-dimensional indexing scheme – the HTM (Kunszt et al. 2001) – to facilitate data mining of large datasets.
114 3.
Thakar et al. Performance Tests
The queries that we used to compare the performance of the skyServer and the CAS are listed in Table 1 below. The question posed by the query is shown in the left column and the equivalent SQL/SXQL query is shown on the right. The times required to execute each query on each server are shown in Figure 1. For test queries 1-5, the skyServer outperforms the CAS by a factor of about 2 to 50. For the last query, which requires a join, the performance is comparable.
Table 1.
Test queries for skyServer versus CAS performance comparison.
Qry#
Description
SQL Query
1
Find all image objects that satisfy a simple constraint on a non-indexed quantity (r-band fiberMag). Find all galaxies with blue 23 < SB < 25 and within a coordinate cut.
SELECT objID FROM PhotoObj WHERE fiberMag[2] < 20.0 SELECT objID FROM Galaxies WHERE RA BETWEEN 160 AND 180 AND Dec < 0 AND g + rho BETWEEN 23 AND 25 SELECT htmID FROM Galaxies WHERE (0.7*u - 0.5*g - 0.2*i) < 1.25 AND r < 21.75 SELECT objID FROM Galaxies WHERE r + rho < 24 AND isoA r BETWEEN 30 AND 60 AND (power(q r,2) + power(u r,2))> 0.25 SELECT objID FROM PhotoPrimary WHERE ((u − g > 2.0)or(u > 22.3))AND ( i BETWEEN 0 AND 19 ) AND (g − r > 1.0) AND((r − i < 0.08 + 0.42 ∗ (g − r − 0.96)) OR(g − r > 2.26))AND(i − z < 0.25) SELECT p.modelMag u FROM PhotoObj.p, Field f p.FieldID = f.FieldID AND f.psfWidth < 1.1 AND p.modelMag u < 19 ———————————————————SELECT modelCounts[0] FROM ( SELECT obj FROM field WHERE psfWidth < 1.1 ) WHERE modelCounts[0] < 19 )
2 3
Create a count of galaxies for each of the HTM triangles in a color cut.
4
Find galaxies with r-band isophotal SB> 24, ellipticity> 0.5, and major axis of ellipse having dec between 30 and 60. Find all objects with colors of a quasar at 5.5 < z < 6.5.
5
6
4.
Find objects satisfying a field (PSF width) as well as an object (model magnitude) constraint. This query requires an expensive join in MS-SQL but can be written as a fast association (nested) query. The SQL Server version is shown above the line and the Objectivity version below the line.
Discussion
The reason for the superior performance of SQL Server over Objectivity that is apparent from the performance tests is really the disk I/O optimization. All of our queries, except for the last one, are in the I/O-bound regime. Almost all of the time required to execute the query is used fetching the data from the disk. With its highly optimized I/O performance, extensive indexing and query optimization, the Microsoft SQL Server database in the skyServer has clearly outperformed the Objectivity database in the Catalog Archive Server in our preliminary performance tests. Microsoft has capitalized on the faster disks that are available now and optimized its query engine for them. Objectivity has not enhanced its performance in the last 2 years or so, and does not have a built-in query optimizer as SQL Server does. A large portion of the SQL Server speed is also due to the fact that we have ported our HTM index to it. Finally,
Query Performance in the SDSS Early Data Release
Figure 1.
115
Execution times for test queries.
this testing was performed on single-node, non-federated databases, whereas our Objectivity server is really optimized for a distributed database (federation). However, we have not yet begun to extensively test the query regimes in which we expect Objectivity to have considerable advantages over the relational SQL Server. Already, in the tests reported here, we see that when a table join is required, Objectivity performance is comparable to SQL Server’s performance. As we test both systems with much longer, more intensive queries that contain searches on non-indexed quantities and require extensive table joins, we expect Objectivity performance to be superior to SQL Servers, especially in a distributed cluster configuration which our software is designed for. The EDR provides us with quite a unique laboratory containing two production databases, one relational and one object-oriented, being used by a large community of users day in and day out. Our duality of servers and interfaces also provides us with the option of choosing one over the other if its performance is consistently superior. We are well-positioned to make this switch if needed, although we expect that the best solution will probably entail a combination of the two approaches. References Kunszt, P. Z., Szalay, A. S., & Thakar, A. R. 2000, in ASP Conf. Ser., Vol. 216, Astronomical Data Analysis Software and Systems IX, ed. N. Manset, C. Veillet, & D. Crabtree (San Francisco: ASP), 141 Stoughton, C., et al. 2002, AJ, 123, 485 Szalay, A. S. 1999, Computing in Science & Engineering, Mar/Apr 1999, 54 Szalay, A. S., et al. 2000, Proceedings of the 2000 ACM SIGMOD on Management of Data, 451 Thakar, A. R., Kunszt, P. Z., & Szalay, A. S. 2000, in ASP Conf. Ser., Vol. 216, Astronomical Data Analysis Software and Systems IX, ed. N. Manset, C. Veillet, & D. Crabtree (San Francisco: ASP), 231