AstroSpark - Towards a Distributed Data Server for Big Data in Astronomy Mariem Brahem DAVID lab., Univ. Versailles St Quentin, Versailles France, Paris Saclay University
[email protected] Supervised by Stephane Lopes, Laurent Yeh, Karine Zeitouni DAVID lab., Univ. Versailles St Quentin, Versailles France, Paris Saclay University
[email protected],
[email protected],
[email protected] ABSTRACT
Distributed systems like Spark [9] have become increasingly popular as a cluster computing models for processing large amounts of data in many application domains. Spark performs an in-memory computing, with the objective of outperforming disk-based frameworks such as Hadoop. However, these distributed frameworks do not provide efficient astronomical query processing capabilities, due to no data access optimization which involves intensive computing. Inspired by these observations, we propose AstroSpark a system that extends Spark towards a scalable, lowlatency, cost-effective and efficient astronomical query processing framework. The main contributions of this paper are as follows: (1) AstroSpark extends Apache Spark, a distributed in-memory computing engine to process and analyze astronomical data. (2) AstroSpark supports data partitioning with Healpix [5], a structure for the pixelization of data on the sphere, to speed up query processing. (3) AstroSpark offers an expressive programming interface with CCS Concepts a unified query language ADQL [1], a SQL-Like language •Information systems → Parallel and distributed DBMSs; improved with geometrical functions. (4) AstroSpark implements a query optimizer and provides a cost based optiDatabase query processing; mization module to select the best query execution plans. The rest of the paper is organized as follows. Section 2 Keywords introduces the necessary background, section 3 provides a Astronomical Survey Data Management; Big Data; Query system architecture overview of AstroSpark. Section 4 gives Processing; Spark Framework details about partitioning support, section 5 discusses the related work and section 6 concludes the paper. Large amounts of astronomical data are continuously collected. As a result, support of scalable and high performance query processing of such data has become increasingly necessary. Apache Spark has been widely adopted as a successor to Apache Hadoop MapReduce to analyze Big Data in distributed frameworks. Despite its rich features, this framework can not be directly exploited towards processing astronomical data. In this work, we present AstroSpark, a distributed data server for astronomical data. AstroSpark extends Spark, a distributed in-memory computing framework, to analyze and query huge volume of astronomical data. It supports astronomical operations such as cone search, cross-match and histogram. AstroSpark introduces data partitioning and optimization techniques to achieve high performance query execution.
1.
INTRODUCTION
In recent years, there has been an accelerating explosion of astronomical data produced by advanced telescopes that can image enormous portions of the sky. For instance, Gaia an ESA mission [2], will face the challenge of dealing with an end mission volume of one Petabyte. Gaia is set to map our galaxy in three dimensions, to locate and characterize more than a billion of stars. Besides, traditional applications processed on a single machine cannot be used to query large size of data produced by the Gaia mission. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected].
3rd ACM SIGSPATIAL PhD Workshop’16, October 31-November 03 2016, Burlingame, CA, USA c 2016 ACM. ISBN 978-1-4503-4584-2/16/10. . . $15.00
DOI: http://dx.doi.org/10.1145/3003819.3003823
2. 2.1
BACKGROUND Healpix
In our context, Healpix is used as a linearization technique to transform a multi-dimensional space into a single dimension, called here index. Healpix (Hierarchical Equal Area isoLatitude Pixelization) is a hierarchical sky partitioning developed at NASA. It allows spatial splitting of the sky into 12 base pixels (cells). Healpix, nested numbering scheme has a hierarchical design, the base pixels are subdivided over the spherical coordinate system into equal size pixels. These subspaces are organized as a tree and the amount of subdivision is given by the NSIDE parameter. The NSIDE number represents the desired resolution. The first benefit of using the Healpix library is that it is adapted to the spherical space, it does not allow space deformation. The second benefit is that neighbouring points in the multi-dimensional space are likely to be close in the
Input Data
Querying system
Data Partitioning
Query Language (ADQL)
Query Parser Query Optimizer
Healpix library
Storage (HDFS)
SPARK
Figure 2: AstroSpark Architecture.
Figure 1: Healpix partition of the sphere (Nside = 1, 2, 4, 8)[5]
is defined by a sky position and a radius around that position. corresponding one dimensional space.
2.2
The Astronomical Data Query Language (ADQL) [1] is used to query astronomical data. It is a SQL-Like language improved with geometrical functions which allows users to express astronomical queries with a unified language. ADQL provides a set of geometrical functions : AREA, BOX, CENTROID, CIRCLE, CONTAINS. For example, CIRCLE expresses a circular region on the sky that corresponds to a cone in space and XMATCH allows finding objects in different datasets that are spatially coincident.
3.
• Cross-Matching queries aim at identifying and comparing astronomical objects belonging to different observations of the same sky region in order to study the temporal evolution of the source.
Astronomical Data Query Language
ASTROSPARK PROPOSAL
AstroSpark in a nutshell adapts data partitioning (as shown in Figure 2) to efficiently processing astronomical queries. To this end, we apply a spatial-aware data partitioning, and first use linearization with the Healpix library to transform a two dimensional data points into a single dimension value represented by a pixel identifier (Healpix ID). With linearization, we are able to manage a numerical range value. This is taken into account by many Spark function. Astronomical objects are affected to IDs. In order to balance the partition sizes, we employ range partitioning and store the resulting partitions in HDFS. Queries are expressed in Astronomical Data Query Language (ADQL). The query parser is extended to translate an ADQL query with astronomical functions and predicates into an internal algebric representation. Then, the query optimizer will enrich some prefiltering operators based on our spatial partitioning which make global filtering prune out irrelevant partitions. AstroSpark extends the Spark SQL optimizer called Catalyst [3] by integrating the particular logical and physical optimization techniques for the ADQL execution within the optimizer. AstroSpark focuses on three main basic astronomical operations: • Cone Search is one of the most frequent queries in the astronomical domain, it returns a set of Stars whose positions lie within a circular region of the sky. A Cone
• Histogram queries distribute the dataset into a specified number of groups and summarize astronomical information about each group. Moreover, we can extend these queries to encompass other attributes. For instance, a query example can be: selection of sources within a certain angular distance from the specified center position (cone search) fitting as well limitations on a certain magnitude range (or a particular spectral type) ordered by magnitude (see the following code). SELECT * FROM gaia_catalogue WHERE 1=CONTAINS(POINT(’ICRS’,alpha,delta), CIRCLE(’ICRS’,266.,-29., 0.08)) AND (magnitude BETWEEN 17 AND 18) ORDER BY magnitude ASC
4.
PARTITIONING
Partitioning is a fundamental component towards efficient and high performance processing of astronomical queries. Data partitioning enables query processing in parallel. For example, partitioning helps to prune out irrelevant partitions wich reduce computer resources and improve query performances. Spark provides two predefined partitioning techniques using dataframes: hash partitioning and range partitioning. HashPartitioner is the default partitioner in Spark. It calculates a partition index based on an element’s Java hash code. The partition index is determined quasi-randomly; consequently, close objects will end up in different machines. The range partitioner partitions data into roughly equal ranges. But, these methods are only applicable when the partition key is in one dimension. Our solution, AstroSpark uses linearization with Healpix to
resolve this issue. It transforms a multi-dimensional information into a one dimensional space and use range partitioning to ensure two main requirements: • Data locality: points that are located close to each other should be in the same partition. • Load balancing: the partitions should be roughly of the same size. To achieve the first requirement, a spatial grouping of the data is necessary. Nevertheless, a basic spatial partitioning may lead to imbalanced partitions due to the typical skewness of astronomical data. Therefore, the partitioning should be also adaptive to the data distribution. In order to facilitate query execution, AstroSpark retrieves partitions boundaries. For instance, to execute a cone search query, we use Healpix library to return the index of all pixels within an angular distance radius from a defined center. AstroSpark checks boundaries of each partitions to determine whether the split spaces are overlapping with the cone search cells. AstroSpark performs a first level filtering by obviating scanning partitions that do not overlap with the query region. Only cells that intersect the query cone are scanned. We further return points in the scanned cells. This process helps to reduce query processing time and costs. The number of partitions is an important parameter for improving efficiency of data partitioning. The number of partitions is determined depending on the number of workers in the cluster and the size of the input file. It is not a large number since large number of small files leads to deteriorated performance for the partitioning algorithm caused by a disproportionate cost in scheduling overhead. If the partitions are too large, then task execution is too long. The data partitioning and indexing processes is shown in Algorithm 1 Algorithm 1 Data partitioning and indexing Input: PF: Dataset, n: Desired number of partitions Output: outputDF: Partitioned dataset, boundaryList 1: 2: 3: 4: 5: 6:
dataFrame = Read (PF) /* Index creation using the Healpix Library */ for row in dataFrame do ipix=ToHealpix(alpha,delta) row = row + ipix end for
7: set(”spark.sql.shuffle.partitions”, n) 8: outputDF = Sort(dataFrame, ipix) /* sorts rows by ipix + creates the partition */ 9: save(outputDF, HDFS) 10: boundaryList = getBoundaries(outputDF) /* Partition boundaries creation */ Input files are stored in parquet format. We have chosen this format to boost AstroSpark performance. Parquet with compression reduces the data storage, it allows reading only records of interest through selected columns only. Input files are converted to DataFrames (line 1) which is equivalent to a relational table in Spark SQL. The two dimensional coordinates are mapped into a single dimensional ID represented
Data Size (GB)
Figure 3: Effect of data size. using the Healpix library (line 3). This library helps us to avoid calculating the correspondance between the spherical coordinates and the Healpix ID. So, a new Healpix column is added to the input DataFrame (line4). The Sort function in spark returns a set of partitions (line 8). The records are partitioned by range into roughly equal ranges, the number of partitions is specified as a system configuration in Spark SQL ”spark.sql.shuffle.partitions” (line 7). A sorted list of the range boundaries of each partition is then created (line 10).
5.
PRELIMINARY EXPERIMENTS
This section provides preliminary experimental evaluation of the partitioning algorithm besides performing linearization, future experiments will study how the partitioning algorithm can be used to efficiently implement some astronomical operations. Experimental Setup. Experiments were performed over a distributed system composed of 8 nodes, each of this nodes has 5 GB memory, 15 GB of virtual hard disk storage capacity and runs Ubuntu Server 14.04.3 LTS 64-bit with Hadoop 2.7.1 and Spark 2.0. Datasets. We used the Gaia [2] dataset to perform our tests, each record has a sourceID, a two-dimensional coordinate (alpha and delta) and 47 other attributes including magnitude, metalicity. We took random samples from the dataset with different sizes from 5 GB to 38 GB (30 million to 300 million records) Partitioning and indexing time vs. data size. For this test, we used NSIDE = 8. Figure 3 shows the scalability for partitioning Gaia datasets while varying input sizes. For example, it partitions a 38 GB file with 300 millions records in about one hour and a half. Partitioning and indexing time vs. NSIDE. We have also investigated in Figure 4 the impact of NSIDE parameter on the partitioning and indexing processes, given that NSIDE = 2ORDER and the maximum value of ORDER specified by the Healpix library is 29. The data size in this test is fixed to 20 GB. The costs are equivalents with the increase in the Healpix order. The cost of calculating and adding the Healpix identifier to the dataset is negligible. It takes about 40 mn to partition a file of 20 GB. Note that the partitions construction is a one shot process, since we
level query language which is adapted to the astronomical context like ADQL. Moreover, they do not exploit astronomical libraries which are suitable to the spherical coordinates system and do not use specific operations such as cone search queries, cross-match queries, and histogram queries.
7.
Figure 4: Effect of NSIDE. (NSIDE=2ORDER ) chose to store the partitioned files in HDFS and use them for future queries.
6.
RELATED WORK
Recent works have addressed the support of spatial data and queries using a distributed data server: Spatialhadoop [4] is an extension to Hadoop that supports spatial data types and operations. It improves each Hadoop layer by adding spatial primitives. SpatialHadoop adopts a layered design composed of four layers: language, storage, MapReduce, and operations layers. For the language layer, spatial Hadoop adds an expressive high level SQL-like language called Pigeon for spatial data types and spatial operations. In the storage layer, SpatialHadoop adapts traditional spatial index structures, Grid, R-tree and R+ tree, to form a two-level index structure of global and local indexing. SpatialHadoop enriches the MapReduce layer by adding two new components, SpatialFileSplitter and SpatialRecordReader. In the operations layer, SpatialHadoop focuses on three basic operations range query, spatial join, and k nearest neighbor (KNN). MD-HBase [6] is a scalable multi-dimensional data store for Location Based Services (LBSs), which is an extension of HBase. MD-HBase supports a multi-dimensional index structure over a range partitioned Key-value store, builds standard index structures like k-d trees and Quad-trees to support range and kNN queries. GeoSpark [8] extends the core of Apache Spark to support spatial data types, indexes, and operations. In other words, the system extends the resilient distributed datasets (RDDs) concept to support spatial data. GeoSpark provides native support for spatial data indexing (R-Tree and QuadTree) and query processing algorithms (range queries, kNN queries, and spatial joins over SRDDs) to analyze spatial data. SIMBA [7] is an extension of SPARK-SQL to support spatial queries and analytics over big spatial data. SIMBA builds spatial indexes over RDDs. It offers a programming interface to execute spatial queries (range queries, circle range queries, kNN, Distance join, kNN join), and uses cost based optimization. All these systems are designed for the geo-spatial context that differs from the astronomical context in its data types and operations. These systems either do not provide a high-
CONCLUSION
This paper describes AstroSpark a distributed in-memory computing framework based on Spark for processing largescale astronomical data. AstroSpark supports data partitioning with Healpix, offers an expressive query language in ADQL and extends the Spark SQL optimizer ”Catalyst” to optimize astronomical query processing (cone search, crossmatching and histogram). Our ongoing work focuses on data partitioning, for future work, we plan to add support for astronomical queries based on existing partitioned files. We envision to exploit partition pruning and query optimization techniques to efficiently execute astronomical queries. Future experiments will focus on evaluating the data partitioning algorithm while varying other parameters and prove the importance of partitioning for the efficient processing of astronomical queries.
8.
REFERENCES
[1] ADQL. http://www.ivoa.net/documents/latest/ADQL.html. [2] GAIA. http://www.cosmos.esa.int/web/gaia. [3] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD, pages 1383–1394. ACM, 2015. [4] A. Eldawy and M. F. Mokbel. Spatialhadoop: A mapreduce framework for spatial data. In 2015 IEEE 31st International Conference on Data Engineering, pages 1352–1363. IEEE, 2015. [5] K. M. Gorski, E. Hivon, A. Banday, B. D. Wandelt, F. K. Hansen, M. Reinecke, and M. Bartelmann. Healpix: a framework for high-resolution discretization and fast analysis of data distributed on the sphere. The Astrophysical Journal, 622(2):759, 2005. [6] S. Nishimura, S. Das, D. Agrawal, and A. El Abbadi. MD-hbase: design and implementation of an elastic data infrastructure for cloud-scale location services. Distributed and Parallel Databases, 31(2):289–319, 2013. [7] D. Xie, F. Li, B. Yao, G. Li, L. Zhou, and M. Guo. Simba: Efficient in-memory spatial analytics. In Proceedings of the 2016 ACM SIGMOD, pages 1071–1085, 2016. [8] J. Yu, J. Wu, and M. Sarwat. Geospark: A cluster computing framework for processing large-scale spatial data. In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems, page 70. ACM, 2015. [9] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2–2. USENIX Association, 2012.