data" into memory and to evict "cold data" out of memory. Index TermsâApache Spark, Big Data, Big Data Indexing,. Searching and Querying, Spark SQL, Spark ...
The Fourteenth IEEE International Conference on e-Business Engineering
Indexing for Large Scale Data Querying based on Spark SQL Yi Cui*, Guoqiang Li*, Hao Cheng†, Daoyuan Wang† *School of Software Engineering, Shanghai Jiao Tong University, Shanghai, China {Marshall2013, li.g}@sjtu.edu.cn †Intel APAC Corporation, Shanghai, China {hao.cheng, daoyuan.wang}@intel.com
performance of MapReduce is a big bottleneck due to writing to disk every time when each “map”, and “reduce” task is executed. With the increasing demand of real-time and interactive data processing, the deficiency of MapReduce is exposed. In order to deal with this problem, Apache Spark came into being. Apache Spark[2, 5] is a fast and general engine for large-scale data processing which provides high-level and expressive APIs in Java, Scala, Python and R. In order to make better use of memory, Spark introduces a new programming model, RDD[4] (resilient distributed dataset), which differentiate from the way that MapReduce processes data. RDD improves the fault tolerance and scalability of applications. To achieve better performance, Spark keeps intermediate data in memory and optimizes physical execution. Especially in Spark 2.x, with the second generation Tungsten execution engine, whole-stage code generation and vectorization significantly improves performance of physical execution layer. A series of built-in libraries including Spark SQL, MLlib, Spark Streaming and GraphX are incorporated into Spark to provide generality and scalability to build parallel applications. As an important module to process structured data inside Spark programs, Spark SQL[3, 6] is a built-in component which enables programmers to perform relational operation through DataFrame API. It offers tight integration between relational processing and procedural processing, thus providing seamless connection between querying on structured data and advanced data analytical ability such as machine learning. Furthermore, Spark SQL's extensible optimizer, Catalyst, lets developers easily specify optimization rules and features to Spark SQL. However, Spark SQL is not designed for long-run services. Generally, data will be loaded from external storage system in each table scan and cache mechanism is so coarse-grained that raw data is either cached in memory entirely or not. Moreover, data in memory eviction mechanism is also poor, and all of data kept in memory will be lost
Abstract— Spark SQL lets spark programmers query
structured data inside Spark programs using SQL statements. It provides spark programmers with great convenience to leverage the benefits of relational processing, and its internal RDD distributed processing also accelerates query on large data sets. However, Spark SQL is not designed for long-run services and its built-in data source would load data from storage system, such as HDFS and local file system, in each table scan without cache mechanism. Although users could keep data in memory using “cache” command explicitly, the data cached in memory is coarse grained. In this paper, we present an indexing structure which is a pluggable component of Spark SQL based on Apache Spark. Compared with Spark SQL, it has some additional advantages. Firstly, it allows users to create index of structured data to be processed, which speeds up the query performance greatly. Secondly, it enables programmers to load fine-grained data file of structured data into memory, which is flexible to load "hot data" into memory and to evict "cold data" out of memory. Index Terms—Apache Spark, Big Data, Big Data Indexing, Searching and Querying, Spark SQL, Spark as a Service
I. INTRODUCTION In recent years, big data applications have gained more and more popularity both in academic communities and in business solutions. It is a big challenge to collect, integrate and analyze on large scale datasets. With the development of big data technologies, many frameworks which process big data emerge. Invented in 2004, Google MapReduce[1] is a revolutionary framework in processing of large datasets. This allows users to deal with TB or even PB of data conveniently. However, IO This work was hosted by the Big Data Technology Group of SSG of Intel APAC R&D Center.
978-1-5386-1412-9/17 $31.00 © 2017 IEEE DOI 10.1109/ICEBE.2017.25
103
whenever applications shutdown or restart. Furthermore, for some business analytics and ad-hoc queries, some columns of target table are frequently used. These "hot data" should be cached in memory to reduce the amount of disk IO. While some data rarely used, called "cold data", is not excepted to be cached. In this paper, a new optimized component for Spark SQL which is based on fined-grained data format and customized index support is proposed. Built upon ideas from state-of-the-art commercial databases and applied to data processing queries, we put forward a customized index support for Spark SQL queries. This new design allows users to create/drop index structures on specific columns that would be utilized in multiple queries, which could significantly accelerate SQL queries. Besides, we split raw data tables into many small pieces which called "fiber". In another words, each table may be comprised of many fibers based on different columns and row groups, thus we can load and cache fined-grained data into memory and eviction mechanism is also more efficient for hot and cold data. We start this paper with a background on Spark SQL in section 2. We then describe design and implementation of the index component in section 3. We evaluate the system in section 4. Finally, section 5 covers the related work, and conclusion is presented in section 6.
for ad-hoc queries. Once the application is shutdown or restart, the table data cached in memory will be eliminated. Furthermore, for ad-hoc queries, caching the whole table data is a waste of space, since ad-hoc queries may only involve in a few columns or a few rows. Admittedly, in-memory computing speeds up data processing to 100x faster than Hadoop MapReduce, but for I/O intensive applications and ad-hoc queries, full table scan of Spark SQL is still a bottleneck. On top of Spark SQL, we wanted to construct an index layer to speed up Spark SQL queries. This index layer enables users to customize index for ad-hoc queries on certain column(s) to accelerate query on large scale data sets further. III. DESIGN AND IMPLEMENTATION A. Design of Data Format At present, in order to ensure a better cache efficiency and to increase space utilization, we split a complete data table files into many row groups to store in a data (table) file. Each row group is divided into several columns. Each column in a single row group is called a fiber. A fiber is referred to as the fundamental unit for caching and loading. Currently, each index file will be cached as an index fiber. Each index fiber corresponds to a data file which has multiple data fibers. Data files are distributed among all nodes in a cluster. Create index operation will be parallelized among worker nodes. In each worker node, we cache fibers on off-heap memory, because that only in this way, is there no GC overhead. Caching fibers on off-heap memory avoids Out-Of-Memory error which is a common challenge for big data applications. It is acknowledged that parquet[9], a columnar storage format, has efficient compression and encoding schemes and built to support for efficient columnar data representation. It is useful to almost all of data processing frameworks. Although parquet supports columnar storage which is naturally efficient for vectorization, it has three following drawbacks. First, there is low efficiency when accessing row entry randomly. Second, parquet does not support customized index by users’ needs. Moreover, parquet requires data copy in on-heap memory. Therefore, we present a new data format called “Spinach” which is also a columnar storage format. The whole table is divided into several row groups (the row group size could be configured by user), and within each row group, row entries are divided according to columns. Each column in a row group is called a data fiber. All these fibers in a table is organized sequentially in a data file. We keep each row group’s metadata, including row count in this group, group id, offset of this row group in data file, in the file header. Therefore, by means of metadata, read operation could be more efficient to locate each data fiber. All data fibers involved in a query will be loaded from disk at the first time, and cached in off-heap memory using LRU cache eviction mechanism. We assume that fibers involved in an ad-hoc query will be queried frequently and used later, so caching these fibers in memory will gain a much better query performance. Furthermore, we extended Spark SQL DDL (data definition language) to support create/drop index on
II. BACKGROUND Spark SQL is a module for working with structured data. Its predecessor, Shark[7], an interactive SQL engine on Hadoop system, modified Apache Hive system to execute queries on external data stored in Hive. However, Shark is tightly coupled with Hive, which restricted the development and performance of relational processing on large data sets. Due to over-dependence on Hive, new optimization strategies are difficult to be extended in Shark. Moreover, Shark only supports for querying external data, not for data in Spark programs (within RDD). In order to filling the gap, Spark SQL was designed to cover much wider range of data sources and to provide higher performance using some automatic optimization rules. The superiority of Spark SQL comes into two aspects. Firstly, the built-in DataFrame API integrates relational processing with procedural processing and covers more varieties of data sources. DataFrame offers higher performance about executing queries both on external data sources and within spark programs' inner data than RDD. It also unifies relational processing and advanced analytics, including machine learning, graph processing and structured streaming. Secondly, Spark SQL incorporates an extensible optimizer called Catalyst. It provides high performance using some DBMS optimization techniques, and enables programmers to add optimization rules and external data sources easily. However, for some cutting-edge business services, the speed of full table scan of Spark SQL is still an I/O bottleneck, especially for increasing large scale datasets. Moreover, the cache mechanism of Spark SQL is so poor that coarse-grained data is kept in memory after each cache, which is inappropriate
104
specific columns. Programmers can use the following statements to create and drop index.
table prepared for tree node segment to hold each key and its row ids’ start offset(s) in the index file. Followed by the key segment is the tree node segment. In tree node segment, in order to represent a B+ tree structure, we use a method similar to post-order traversal of trees. The reason why we use post-order is that parent node will not know the children nodes’ offsets until they are written into the index file. When we encounter a tree node, firstly, we write its children nodes in order from left to right recursively, then we write down the tree node itself and the children nodes’ offsets. At the beginning of building index, row keys will be sorted in ascending/descending order along with its row ids in memory. Keeping index keys in order is to facilitate the efficient sequential access to row entries during B+ tree range search. We use some data structures, such as hash map, to track the relation between indexed keys and offsets of its row ids, and a sorted sequence to hold the sorted keys. According to the number of distinct row keys, B+ tree model is used to construct the whole shape of index tree, including the number of layers and the number of nodes in each layer. After the model is built, row keys in ascending/descending order along with its start offset of row ids will be filled in each tree node sequentially according to the in-memory hash map. In addition, a tree node structure in an index file contains the following parts. Firstly, we record the number of keys belonging to this tree node as long type. Secondly, the position offset of the next node of this current node in the B+ tree is written. Then the position offsets of children nodes are written down. After that, the corresponding positions of row ids for each key are written down using the previous hash map. Finally, all keys belonging to this single tree node are saved. According the post-order traversal, the last node that will be record is the root node. After writing down the data segment and the tree node segment, data end (end of data segment) and root offset will be kept in the index file footer.
create sindex idx1 on tableName(columnNane) drop sindex idx1 on tableName(columnName) We keep some class paths for class loader to load corresponding classes to parse “spinach” file format and some statistic info including partitioned file name, index file name and the number of partitioned files as metadata in meta file. The relation between the 3 kinds of files is shown in Fig 1.
Fig 1. The relation between all kinds of files B. B+ Tree Index Design Spark SQL allows users to query and process structured data resides in "DataFrame". In general, all data stored in DataFrame is distributed among worker nodes in a cluster where advanced users can execute queries, using standard SQL-like syntax. Therefore, queries on DataFrame will be delivered to nodes where DataFrame resides to execute concurrently. The query result will be eventually sent from executors to driver in a cluster. DataFrames also can be registered to table views, which is much more similar to SQL syntax. However, when a SQL query on DataFrame is triggered, it actually triggers a full table scan or skip table scan using some statistics info. Obviously, the traditional approach in Spark SQL is not suited for random access. Compared with state-of-the-art commercial databases including MySQL, and Oracle, we design a B+ tree index structure which is similar to relational databases. When index creation statement is triggered, B+ tree index partitioned files will be created concurrently on all worker nodes where partitioned data files reside. In another words, the number of B+ tree files is the number of partitioned files. In each index file, we currently save row keys and tree structure sequentially as shown in Fig 2. In data segment, we only save the cluster of row ids (global row ids in each data file/table), because that keeping the whole row entries will make the index file so enormous that causes I/O bottleneck. Because indexing keys may not be unique, a key may correspond to multiple row entries, so we maintain a mapping relation to track the row keys and its row id(s) which are saved as long type. After one table scan all row keys along with its row ids are held in the hash table. Each key corresponds to a row id cluster which may contain several row ids. All row id clusters are written down into the data segment sequentially using the hash table. Meanwhile, we maintain another hash
C. Architecture This index structure module is designed on top of Spark SQL to speed up query execution. There are 3 layers including data layer, control layer, and index selection layer. Data layer contains some data fields of B+ tree structure, the representation of range interval as well as some classes for locating current position in B+ tree. Control layer contains range scanner that could be used to traverse the B+ tree to find the target key(s). On top of the above two layers is index selection layer. The function of this layer is twofold: check whether we could use index to speed up query and select the best index among all of the available indexes. Because on some occasions, the query is unable to use index, despite index exists in the table that SQL statements related to, the index selection is necessary. Furthermore, when two or more indexes are available for a query, some rules are needed to select the best index to maximize the query efficiency. The index selection layer also generates and executes some strategies to choose the best index.
105
Fig 2. Structure of Index File in its left and right child for the same attribute. The time complexity of this intersection operation involved in "AND"
The three layers handle the query in coordination with each other. The index selection layer selects appropriate index for a query statement, and then control layer use scanner to search the corresponding B+ tree to find the qualified keys. Eventually, given the row keys, the result row id(s) stored in data layer will be obtained directly by offsets of row id(s).
node is O(m n) where m and n are the number of intervals in left child node and in right child node, respectively. Achieving results got by intersections of two intervals is of great significance for performance improvements in query optimization. A query after optimized significantly reduces the number of intervals that need to be scanned. Because after intersection of two query intervals, the result interval set will be smaller or even empty, thereby significantly reducing the amount of keys that are unnecessary to query. For an "OR" node, we merge its left and right child. The time complexity of
D. Supporting For Query Functionalities As a pluggable module for Spark SQL, this index structure enables users to retrieve the required data in a faster way. We extended Spark SQL’s “FileFormat” and “DataSourceRegister” to implement our own file format and modified “FileSourceStrategy” to embed our file format. Our format provides read and write method to do read/write operations for our own data format which enables a series of functionalities to speed up queries. On one hand, detecting and using index to search is automatic if index exists in the current table which will be queried. Our index structure can process 7 types of filter predicates currently, they are "EqualTo", "LessThan","LessThanOrEqual","GreaterThan","GreaterThan OrEqual", "And", "Or". On the other hand, if current table has no index or there is no need to use index for the query conditions, the approach of the original Spark SQL, full table scan will be adopted. Currently, we use a scanner object to traverse all intervals of keys that match the query conditions. Generally speaking, scanner is just an iterator that iterates all of the qualified row ids. Due to the nature of B+ tree, all candidate keys are held in the leaf nodes and each leaf node has a pointer linked to its next leaf node. Therefore, the scanner object only needs to maintain the start and the end keys which determine when to start and when to terminate during traversal of tree nodes. We implemented a filter predicate optimizer of query conditions to push down query filter predicates. Query conditions may be complicated which intermix with "AND" and "OR" conditions. After processed by Spark SQL catalyst, a series of filters are returned which are actually one or more binary trees that contain "AND" or "OR" predicates as their tree nodes. The goals of this optimizer is to simplify filter predicates of query conditions to its most compact form. The filter optimizer firstly maps each filter predicate to an attribute-interval pair stored in a hash map, and then does some interval compaction and combination for each attribute. After optimization of filter optimizer, all filter predicates are transformed to a hash map that contains attributes involved in the query statement and its corresponding range intervals. The binary tree of query conditions has atomic inseparable predicates in its leaf nodes. Therefore, we use post order traversal to process leaf nodes firstly, and then to process parent nodes. For an "AND" node, we combine two parts of intervals
this union operation involved in "OR" node is O(m n) where m and n are the number of intervals in left child node and in right child node, respectively. The result set contains a set of intervals in order from small to large that do not overlap. The union set of two nodes ensures that the result set doesn't repeat or miss any key which satisfies the query conditions. Eventually, the result is a series of intervals that do not overlap. After optimized by the filter optimizer, a SQL statement is transformed to several intervals of attributes in most compact form in the query. When a query is executed, these intervals are used to locate the corresponding leaf nodes in B+ tree for scanning, and then scanner is served as an iterator to traverse the corresponding row ids in order. At present, this optimizer naturally supports for multi-range search, which allows users to write SQL statements nested with many layers of "AND" and "OR" predicates. Through optimization of "AND" and "OR" conditions, predicates pushdown is achieved, significantly reducing the amount of data that will be scanned, thereby improving efficiency of utilizing B+ tree index to search for required data. Moreover, we support multi-column composite index which enables users to build index on multiple columns and to execute ad-hoc queries efficiently. Multi-column composite index uses left-most prefix matching to check whether the attributes in the hash map obtained from the filter predicate optimizer is contained in the composite index. If a composite index is available for an ad-hoc query, the query might only use the first few columns of composite index, then we build a key schema for the first few columns that composite index matches. Key schema will be used to compare keys stored in B+ tree with the target key. Because schema is lightweight, building schema to compare keys is quite efficient instead of constructing new internal rows. As a result, the number of columns of key schema are dynamic for each input query, but the row keys in B+ tree are static and it is unnecessary to create temporary row keys.
106
E. Index Selection We implemented an index selector to select the most efficient index from all available index candidates according to some rules. Firstly, we scan index metadata which contains all of index info of the current table to check whether each index is matched to query attribute map. For single column index, if the index attribute could match attribute in the query hash map, then this index is available. For multi-column index, all of the index columns are scanned from left to right to check whether it could be matched with the attributes in the query hash map using left-most prefix matching. After getting the available indexes, the next step is to select the best index used to query. Since index is expected to be fully utilized, some additional rules need to be carried out to select the most efficient index. In order to improve the utilization rate of index and reduce the amount of data that will be scanned, we made some criteria as follows: Firstly, we introduce three indicators, the number of attributes in the SQL query statement, the total number of attributes in a B+ tree index entries, and the number of attributes matched with the index. Then the best matching is assumed to have highest proportion of matched attributes to total number of attributes of both query statements and index entries. Consequently, we compute sum of these 2 ratios for all of the available indexes and to choose the index with the highest sum of the 2 ratios as best index. In another words, for one index, if the query statement matches with the highest proportion of entries of this index, the index is treated as best index. The overhead of such selection could be ignored. In future work, we would implement cost based rules to select between multiple indexes.
master and 3 workers. Each node has 256 GB memory and 1 TB disk capacity with 10 GB network connection. We tested and compared the performance of processing 150 GB data on 4 kinds of models, including parquet cached/uncached mechanism, Spark SQL in memory, and our ad-hoc query engine. The test results are shown in Fig 3. Both full table scan and range query are tested for performance comparison. We bypassed the index to execute full table scan, and the results are shown in the left histogram in Fig 3. Furthermore, we executed some ad-hoc queries to perform range search, and the results are shown in the right histogram in Fig 3.
Fig 3. The benchmarks According to the full table scan bar chart, parquet data without caching shows worst performance, about 45.25s, compared to others. Without caching, data will be loaded into memory from disk during each read, thus giving rise to huge overhead. However, our ad-hoc query engine took about 3.45s, which is acceptable for large scale data processing in industrial production, due to its fine grained data load and eviction mechanism. Focusing on the range search, ad-hoc query engine also demonstrates high performance. It is reasonable that ad-hoc query engine used about 1.27s to execute the query, mainly as a result of the capability of randomly accessing the required row entries and reducing the amount of data to be scanned using B+ tree indexes. From the chart, it is clearly that other architecture and strategies took a bit longer time to execute range search than full table scan respectively, because they need to traverse each row entries and filter out the ones that do not meet the query conditions.
IV. EVALUATION The B+ tree index structure for this ad-hoc query engine mainly focuses on improving query performance on ad-hoc queries. We evaluate the performance of this module in Spark SQL on two dimensions: query processing performance and query functionalities. Some comparisons are particularly presented to demonstrate that the extensible pluggable module substantially improves query performance, and flexibly caches the hot data on off-heap memory without any GC overhead. Spark 2.0.2 version is chosen as the basic platform. We make comparisons among parquet, ad-hoc query engine Spinach, and Spark SQL on several dimensions shown in Table 1. The Table 1 given below reflects some differences existed among three types of data processing architecture. It can be seen from the table that our ad-hoc query engine, Spinach, incorporates the advantages of both parquet and Spark SQL in-memory cache, and provides rich supports for query optimization which enables users to customize index to accelerate query speed. In addition, the fine grained cache mechanism within off-heap memory also demonstrates its superiority over Parquet and Spark SQL.
V. RELATED WORK Query Optimizations really matter for query performance[15], optimizers could prominently reduce the frequency of full table scan. For OLAP applications, less full table scan significantly reduce disk read. Unlike Spark SQL, several commercial state-of-the-art databases, including SQLite[12], MySQL, Oracle, support many different types of index to support random block read to retrieve required row entries. Of the most commonly used indexes, we investigated a very mature type of index, B+ tree [14], which is frequently used in many relational databases. Our index component based on Spark SQL extends B+ tree index and supports single column search, range search and multi-column search. Furthermore, although B+ tree index in many current databases does not support “OR” predicate present in SQL query statements, we support “OR” predicate as long as “OR” concatenates the one single attribute in its subtrees.
A. Benchmarks We compared the ad-hoc query performance with and without index, respectively. 150 GB of uncompressed data of parquet format was queried on a spark cluster containing 1
107
Table 1. Comparisons about some dimensions [3] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 1383-1394. [4] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2. [5] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10-10. [6] Apache Spark SQL. http://spark.apache.org/sql [7] Reynold S. Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2013. Shark: SQL and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD '13). ACM, New York, NY, USA, 13-24. [8] Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In Proceedings of the ACM Symposium on Cloud Computing (SOCC '14). ACM, New York, NY, USA, Article 6, 15 pages. [9] Apache Parquet. http://parquet.apache.org [10] Scala programming language. http://www.scala-lang.org [11] Amélie Gheerbrant, Leonid Libkin, and Cristina Sirangelo. 2014. Naïve Evaluation of Queries over Incomplete Databases. ACM Trans. Database Syst. 39, 4, Article 31 (December 2014), 42 pages. [12] SQLite version3 overview. http://sqlite.org/version3.html [13] Apache Flink. http://flink.apache.org/ [14] Douglas Comer. 1979. Ubiquitous B-Tree. ACM Comput. Surv.11, 2 (June 1979), 121-137. [15] Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really?. Proc. VLDB Endow. 9, 3 (November 2015), 204-215.
We investigated some memory management scheme of big data systems, such as Apache Flink[13] and project Alluxio/Tachyon[8](Tachyon is the predecessor of Alluxio). Currently, our data format supports fine-grained memory cache in off-heap memory, which get rid of being in the control of JVM, thus avoiding java GC overhead. VI. CONCLUSION In this paper, a new design for ad-hoc query based on Spark SQL, as respects of its background, structure, functionalities, and analysis as well as evaluation, is presented. This ad-hoc query engine extends Spark SQL as a pluggable module on supporting customized B+ tree indexes and provides unified data source interfaces for Spark SQL users to process structured data. It is designed for improving query execution performance and reducing application latency. Based on the analysis of its structure, the B+ tree index provides rich support for abundant functionalities, including indexing selection, single column search, multi-column search as well as multiple range search. Our filter predicate optimizer also shows its power on optimizing complex query conditions. At last, through the evaluations, our ad-hoc query engine combines advantages of Parquet and Spark SQL in-memory on data cache and locality, moreover, it supports predicate pushdown to enhance query efficiency. Benchmarks show that the system has higher performance on processing large scale data compared with original Spark SQL and parquet, which is more acceptable for big data applications. VII. ACKNOWLEDGEMENTS We would like to give special thanks to Big Data Technology Team of SSG in Intel and the rest of contributors of this project so far. This project is held by SSG&BDT Department of Intel APAC R&D Center. We would also like to thank other team members in Intel for early discussions on the design of index structure and data format orgnization. And it's supported by National Natural Science Foundation of China No. 61732013. REFERENCES [1] Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation - Volume 6 (OSDI'04), Vol. 6. USENIX Association, Berkeley, CA, USA, 10-10. [2] Apache Spark. http://spark.apache.org
108