Constructing a Data Accessing Layer for In-memory

0 downloads 0 Views 665KB Size Report
accessing languages such as JPA and SQL, and application developers must design their programs according to the peculiarities ... methodology, data accessing engine construction, data model .... compatible, such as Oracle coherence [7], GigaSpaces XAP [8], ... technique [16, 17] is used to improve query performance.
Constructing a Data Accessing Layer for In-memory Data Grid Shuping Ji, Wei Wang, Chunyang Ye, Jun Wei, Zhaohui Liu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences Beijing 100190, P.R. China

{jishuping10, wangwei, cyye, wj, liuzhaohui11}@otcaix.iscas.ac.cn ABSTRACT In-memory data grid (IMDG) is a novel data processing middleware for Internetware. It provides higher scalability and performance compared with traditional rational database. However, because the data stored in IMDG must follow the key/value data model, new challenges have been proposed. One important aspect is that IMDG does not support standard data accessing languages such as JPA and SQL, and application developers must design their programs according to the peculiarities of an IMDG product. This results in complex and error-prone code, especially for the programmers who have no deep understanding of IMDG. In this paper, we propose a data accessing reference architecture for IMDG and a methodology to design and implement its data accessing layer. In this methodology, data accessing engine construction, data model designation and join operation supporting are presented. Moreover, following this methodology, we develop and implement a JPA compatible data accessing engine for Hazelcast as a case study, which proves the feasibility of our approach.

Categories and Subject Descriptors D.2.12 [Software Engineering]: Interoperability – data mapping, interface definition languages.

scale online transaction processing, complex event processing and big data processing. As a substitute, in-memory database (IMDB) was proposed. It has all the qualities of a traditional relational database, but resides in memory. IMDB attempts to bring data closer to the application, which involves holding an entire database in memory as a single entity. The application treats the IMDB layer as a database, while the IMDB is backed by a relational database. The advantage of this approach is the availability of data with faster access time. However, since it doesn’t solve the scalability problem of traditional relational database system, and lacks of persistency capacity, it hasn’t be widely used in practice [1]. In this background, in-memory data grid (IMDG) was proposed. As a novel data processing middleware, it gives us some appealing promises. By focusing on the provisioning and the accessing of data in a grid style manner, namely using a large amount of loosely-couple cooperative caches to store data, it provides extremely high scalability. One can always accommodate higher workloads by simply adding extra servers. Moreover, by storing information in memory in a redundant and consistency manner, IMDG can make relational database unnecessary (or for reporting and backup purposes only). IMDG has the capacity to tolerance node failures or even disasters through data replication approach.

General Terms Design, Standardization, Languages, Theory

Keywords

Complex Query Support

1. INTRODUCTION Relational database has been in a dominant position for the last several decades due to its rigorous data model, standard access interface, and perfect transaction processing capacity. However, its limited scalability and performance prevent its further development in some special fields of Internetware, such as large

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference‟10, Month 1–2, 2010, City, State, Country. Copyright 2010 ACM 1-58113-000-0/00/0010 …$15.00.

Data Flexibility

In-memory Database

In-memory Data Grid (IMDG), Key/value Data Model, Data Accessing.

Simple OLTP

In-meory Data Grid

Fast In-Memory Data Access

Access Data as POJOs

Linear Scalability

Figure 1. IMDB verses IMDG However, because the data stored in IMDG must follow the key/value data model to guarantee high scalability, new challenges have been proposed. As is shown in Figure 1, compared with IMDB, IMDG does have some disadvantages. One important aspect is that standard data accessing languages are not

supported by IMDG. So, application developers must design their programs according to the peculiarities of different IMDG products other than established software engineering practice. For example, IMDG only supports very simple types of queries to select data records from a single table. Complex operations such as join query are not supported. In practical application, a join query can often be rewritten into a sequence of primary-key queries. However, such conversion (also called mapping) is not an easy task. The application developers must design data schemas carefully to allow such query rewrite. And they need sufficient understanding of subtle concurrency issues to realize and handle the fact that a sequence of simple queries is equivalent to the original join query only in the case where no update of the same data items is issued at the same time. Solutions have been proposed to support SQL implementation on top of MapReduce [2], such as Tenzing [3] and YSmart [4]. But because the interfaces of MapReduce are largely different from those of IMDG, and MapReduce is focused on analytical task, it is hard to apply these solutions to IMDG. To the best of our knowledge, currently, there are no mature solutions to implement data accessing for IMDG by advanced languages. This paper proposes a data accessing reference architecture for IMDG and a methodology to implement its data accessing layer by existing standard data accessing languages such as Java Persistency API (JPA) [5]. The reference architecture clarifies the location of the data accessing layer and its internal structure. In the proposed methodology, data accessing engine construction, data model designation and join operation supporting are presented. We use de-normalization approach to support and accelerate join operations. As the cost, redundant data is stored and consistency must be considered when there is any data update. As a language tool, Antlr [6] is introduced to facilitate the construction of data accessing engines. It can automatically generate the recognizer for a specific data accessing language. As a case study, we present the implementation of a JPA compatible data accessing engine for Hazelcast. Then a simplified TPC-W application is used to demonstrate its feasibility. This paper is organized as follows: in section 2, we overview related work. Section 3 proposes the reference architecture and the methodology to implement data accessing layer for IMDG. In section 4, we present the case study to implement the data accessing engine for Hazelcast by JPA in detail, and section 5 concludes the paper.

on analytical task, which means all the operations are queries and no data insertion or data update are permitted. On the other hand, the interfaces of MapReduce are largely different from those of IMDG. So, these solutions cannot be applied to data accessing for IMDG. To the best of our knowledge, there are not mature solutions to implement data accessing for IMDG by existing standard languages such as JPA and SQL.

3. ARCHITECTURE & METHODOLOGY In this section, we first describe the typical system architecture of IMDG, in which we are focused on its data accessing layer and neglect its internal implementation. The structure of a data accessing engine is overviewed. Then the generic workflow to construct data accessing engines is presented. Third, we review the typical key/value data models and propose some data model designation principles for IMDG, in which de-normalization technique [16, 17] is used to improve query performance. Finally, the algorithm to support join query is presented.

3.1 Reference Architecture In the system architecture shown in Figure 2 (left), IMDG locates between the application layer and relational database layer. It works as a data access accelerating engine, while relational database provides data persistency function. Initially, data can be loaded from relational database to IMDG automatically when the system is launched. The clients interact with IMDG other than relational database to send data read and write requests. Read operations are executed only in IMDG (the unsupported read operations can be transmitted to the underlying relational database as an expedient method), while write operations are executed by write through or write behind [18] mode to the underlying relational database. Client process Clients

Interfaces Memcache Restful Native drive drive drive

JPA drive

SQL drive

Request distributor IDMG

Memcache Restful Native engine engine engine

Cache loader

2. RELATED WORK

JPA engine

SQL engine

IMDG core Write queue

A number of recent commercial IMDG products are JPA compatible, such as Oracle coherence [7], GigaSpaces XAP [8], and VMware GemFire [9]. However, they only support a small part of JPA interfaces. SQL is partially supported by GigaSpaces XAP, but the supporting is also very limited. Open source products, such as Hazelcast [10] and Infinispan [11], do not support JPA or SQL completely. Moreover, join operations are supported by none of these systems. Attempts have been made to create simpler interfaces on top of MapReduce. Tenzing[3], Sawzall[12], PIG[13], HIVE[14], HadoopDB[15] are typical examples. Some of them support a mostly complete SQL implementation with high performance and low latency, such as Tenzing. However, all these work is focused

Database system

JDBC drive

Figure 2. Data accessing layer architecture Memcache, restful, and native interfaces in Figure 2 (right) are the popular interfaces that have been completely supported in some existing IMDG products, while JPA and SQL are those that haven’t been supported. Not matter what kind of interface is chosen, there is the corresponding drive in the client side. The transaction requests are packed at the client side and sent to IMDG. Request distributor distinguishes the requests and transmits them to the corresponding execution engines.

In this paper, we are only focused on the engines for advanced data accessing languages. As is shown in Figure 3, a data accessing engine consists of a recognizer, a translator and an execution controller. The recognizer works as a parser. It recognizes legal sentences submitted by clients and generates corresponding abstract syntax trees. For the illegal sentences, an error is returned. Moreover, by now, some complex query operations are not supported, such as aggregation operations, nested queries and advance analytic functions in SQL. These operations are transmitted to the underlying relational database as an expedient method. The translator converts the abstract syntax trees from the recognizer to operation sequences of IMDG, and the execution controller takes charge of the execution of these operation sequences. On the whole, a data accessing engine can be regarded as a compiler. Inputs of this compiler are the legal sentences of a specific data accessing language. Outputs are operation sequences of IMDG. In addition, it takes in charge of the execution of generated operation sequences.

Advanced language analysis

Semantic description file Antlr toolkit

Recognizer

IMDG Interface analysis

IMDG data model design

Translator

Execution engine

Data accessing engine

Figure 4. Workflow to construct a data accessing engine Though we don’t make the process to construct data accessing engines automatic, following this workflow will make it much easier to produce a specific data accessing engine.

Sentences of an advanced data accessing language Data accessing engine

3.3 Data Model Design

Recognizer Abstract syntax tree

Translator

Execution sequence

Execution controller

Unsupported sentences

JDBC

Data model is the base of a data storage system, and it largely determines the working mode, interface, and even all the main characteristics of a data storage system. The reason why relational databases can support the powerful SQL interface is that its data model is rigorous and follows the 1NF, 2NF, 3NF or BCNF forms. Similarly, the reason why IMDG can only support a relatively weak interface is that its data follows the key/value data models.

IMDG core

3.3.1 Key/value data models Figure 3. Data accessing engine architecture

3.2 Data Accessing Engine Construction The workflow to construct a data accessing engine for IMDG is shown in Figure 4. First, an advanced data accessing language is analyzed and a corresponding semantic description file is given. This file should describe the lexical and syntax structure of the source data accessing language precisely. In our approach, this file follows the specification of Antlr [6]. Antlr is a language tool that provides a framework for constructing recognizers, interpreters, compilers, and translators. It also provides support for syntax tree construction, syntax tree walking, translation, error recovery, and error reporting. According to a given semantic description file, Antlr can automatically generates a recognizer. Every legal sentence of that language can be recognized and parsed by the generated recognizer, and a sentence can be translated to an abstract syntax tree. Second, the designer analyzes the interfaces of an IMDG product and designs the suitable internal data structure. Based on the structure of the abstract syntax tree generated by the recognizer, a translator can be designed and implemented. The execution controller takes charge of the execution of the generated operation sequences from the translator. Note that because a single sentence may be translated to a sequence of operations for IMDG, it is needed to guarantee its atomicity. We take advantage of the transactional mechanisms to realize this point.

According to the expressiveness in processing queries, the supporting of (semi-)structured data, and the application-specific characteristics, we divide the key/value data model into four categories: basic key-value, column, document and graph. Basic key-value data model only implement the key-to-value map and the key is not interpreted. This simplicity makes the corresponding systems fast and easy to implement. But the price is that only a hash-table-like PUT/GET/DELETE primary key access interface is provided. Column based data model is more structured and table-like. A single data entry is called a row and is addressed by its key. But unlike in relational databases, rows in column based database can have un-predefined sets of columns and can dynamically change over time. The keys are interpreted, say derived from a common attribute of the data objects. Such a data structure results in a more powerful interface compared to basic key-value data model. Document based data model is semistructured. The data are typically stored in JavaScript Object Notation (JSON) [19], which allows complex data structures to be represented. Document based database typically provides complex operations, such as adding or deleting elements to or from a document’s array. Graph based data model use nodes and edges as the main storage elements. It is designed to optimize the performance for associative accesses to support the efficient execution of typical graph algorithms, for example, following or finding paths in a graph.

Each data model has different benefits and drawbacks and one solution might fit a use case better than another. The access patterns and interfaces of these data models are different from one to the others. Currently, most IMDG products only support the basic key/value data model, so the interface is relatively weak, and it is harder to implement data accessing for IMDG products.

3.3.2 Entity Grid Mapping In the designation of relational database, entity-relationship model [20] is used to facilitate the construction of tables. Here, we use entity-relationship model as the starting point of data model designation for IMDG. In E-R model, an entity-relationship diagram is made up by a collection of entities, attributes and relationships. Two or more entities can be connected by a relationship, and relationships can be divided into four categories: one-to-one, one-to-many, manyto-one and many-to-many. In our method, every entity is mapped to collection (similar to the concept of table in relational database). For the relationships, only many-to-many relationship is mapped to collection, while the other three kinds of relationships will not be mapped or stored in IMDG. Moreover, the data from many-to-many relationship is stored redundantly to facilitate and accelerate complex queries such as joins. The approach to store redundant data can be called denormalization, which means copying of the same data into multiple documents or tables in order to simplify or optimize query processing or to fit user’s data into particular data model. The method to store redundant data will be illustrated through an example in section 4.4. city

name

id

id

age

1 or N

1 or N

Customer

street

Have

Address

queries is relatively easy. We use SQL to demonstrate the algorithm to implement join supporting for IMDG. In our algorithm, a join query is expressed as a collection of “query predicates” and “join predicates”. A query predicate identifies a condition that a record must satisfy to be returned. A join predicate represents a join operation between two tables. It contains references to the two tables, and identifies the names and properties of join attributes. Join predicates are used to connect all the tables together, while query predicates are used to filter the records. For example, in the following SQL query: SELECT * FROM book, author, country WHERE book.authorID = author.ID AND author.countryID = country.ID AND book.title =„computer science‟ AND country.name = „china‟ Three tables (book, author, and country) are connected by two join predicates: “book.authorID = author. ID” and “author.countryID = country.ID”. And two query predicates are used to filter the records: book.title =‟computer science‟ and country.name = „China‟. The algorithm to implement join query in IMDG is as follows. First, a SQL join query is analyzed and the collection of query predicates and join predicates is identified. Second, a “query flow diagram” is built. In this diagram, ovals represent tables and edges represent join predicates. A query predicate is attached to a specific oval indicating that the predicate is applied to the corresponding table, as shown in Figure 6 and Figure 7. Third, one of the tables is determined as the “root table”. Two principles are used to select the so-called root table: first the table should have effective attached query predicates, second the table should be the “positive edge” of the query flow diagram. We say a table is the positive edge when it is the edge point in the query flow diagram (except the ones without any query predicate) and its foreign key refers to another table’s primary key in the corresponding join predicate. Finally, an execution sequence can be generated, which specifies the execution order of all the query and join predicates. SQL Join Query

Equivalent IMDG query

Figure 5. Entity-relationship example

QP: book.title =’computer science’ ①

Book

In Figure 5, we assume the entity customer and address may have different relationships. If it is one-to-one, one-to-many or manyto-one relationship, only two collections will be generated in IMDG. One-to-many relationship means a customer may have several addresses and they cannot be intersected. The two generated collections represent the entity customer and address respectively. However, if it is many-to-may relationship, three collections will be generated. Two collections represent the two entities and the other one collection represents the many-to-many relationship.

3.4 Join Operation Supporting Join query is an essential feature for any database system, since it allows querying related information from several tables in a single atomic operation. In SQL, join queries can be categories as left, inner, right, cross and full outer joins, or equi, semi-equi and nonequi joins. In this paper, we only talk about the data accessing support of inner and equi joins. Because they are by far the most common join queries [21] and the supporting for this kind of

SELECT * FROM book, author, country WHERE book.authorID = author.ID AND author.countryID = country.ID AND book.title =’computer science’ AND country.name = ‘china’

JP: book.authorID = author.ID ②

Author JP: author.countryID = country.ID ③

Country

QP: country.name = ‘China’ ④

Figure 6. Root table at the edge location Two kinds of execution sequences can be constructed according to the fact whether the selected root table locates at the edge point of the query flow diagram (including the ones without any query predicates) or not. Figure 6 shows the case that root table locates at the edge point. As is shown in Figure 6 (left), table book is selected as the root table, because there is a query predicate book.title=‟computer science‟ for this table and it is the positive edge of this query flow diagram. The execution sequence indicates that all the predicates should be executed sequentially (marked by the serial numbers) starting from the root table book.

Moreover, the execution of a latter predicate is based on the results of the execution of the previous predicate. For example, after the predicate “book.title=‟computer science‟” is executed, the returned records of table book are identified. Then the execution of the predicate “book.authorID = author.ID” is activated. SQL Join Query

Equivalent IMDG query Book

SELECT * FROM book, author, country WHERE book.authorID = author.ID AND author.countryID = country.ID AND author.name =’david’

JP: book.authorID = author.ID ②

Author

QP: author.name = ‘David’ ①

Hazelcast, such as distributed queue, distributed map, distributed multi-map, distributed set, and distributed list. Programmers could adopt suitable data structures for their application. However, all these data structures are based on key/value data model, and it doesn’t provide a powerful data accessing capacity. Except querying an entity by its key, a few SQL like syntaxes are supported, they are AND/OR, BETWEEN, LIKE, IN and some relational operations such as =, . These predicates can be applied to a specific attribute, which means querying attributes other than primary keys are supported. But they can only be applied to a single collection. Complex operations, such as join, aggregation, and nested query are not supported. Table 1 shows some typical sentences of Hazelcast.

JP: author.countryID = country.ID

1. mapCustomers.put("1", new Customer("Joe", "Smith"));

②’

2. Customer customer = mapCustomer.get(1);

Country

3. Set customers =

Figure 7. Root table at the middle location Figure 7 shows the case that root table locates at the middle point. As is shown in Figure 7 (left), table author is selected as the root table, because there is not any query predicate attached to the table book. In this case, the query plan permits concurrent execution along two directions from the root table. This kind of query plan is more complex.

4. CASE STUDY In this section we present a case study implementing the data accessing for Hazelcast by JPA. And we use the simplified TPCW benchmark [22] as an example to demonstrate its feasibility. There are three parts in this section. In section 4.1, an overview of JPA standard is presented. In section 4.2 we simply introduce the Hazelcast product and give some examples of its interfaces. Section 4.3 describes the construction the JPA-to-Hazelcast engine. In section 4.4, the simplified TPC-W application is clarified.

4.1 JPA JPA stands for Java Persistency API [5]. It is a simpler programming model for entity persistence, which is designed as an object oriented data accessing interface to replace SQL. By its object relational mapping, software designers can directly store running objects into relational database. JPA has been supported by many implementations, such as Hibernate ORM [23], Open JPA [24], and Oracle TopLink [25]. They all provide the objectrelational-mapping capacity. Except the insert, delete and update operations, the primary query language used in JPA is the Java Persistence Query Language, or JP-QL [26]. It is syntactically very similar to SQL, but is object-oriented rather than tableoriented. It is used to define searches against persistent entities independent of the mechanism used to store those entities. It can directly operate against entity objects rather than database tables.

(Set) mapCustomer.values(new SqlPredicate("name LIKE 'Jo%' ")); 4. Set customers = (Set) mapCustomer.values(new SqlPredicate("BirthDate >= 1988"));

Table 1. Hazelcat sentence examples

4.3 JPA-to-Hazelcast Engine In this section, we will present the creation of a JPA compatible data accessing engine for Hazelcast. Following the workflow to construct data accessing engine, First, we create a JPA grammatical description file according to the specification of Antlr’s LL(K) grammar. This file describes the lexical and syntax structure of JPA precisely. Based on this file, Antler generates a recognizer automatically. The recognizer recognizes legal JPA sentences as inputs and abstract syntax trees are the outputs. Table 2 shows a fragment of our designed JPA description file. jpa_statement : select_statement |insert_statement |update_statement | delete_statement; select_statement: select_clause from_clause (where_clause)? (groupby_clause)? ( orderby_clause )?; select_clause: ('SELECT'|'select') ('DISTINCT'|'distinct')? select_expression (',' select_expression )*; from_clause : ('FROM' | 'from') identification_variable_declaration (',' identification_variable_declaration )*; …

Table 2. JPA grammar description file for Antlr

4.2 Hazelcast

The generated recognizer is the first component of the JPA-toHazecast engine. The other two components are the translator and execution controller. In the translator, an object tree is constructed from the abstract syntax tree. It is an intermediate code presenting a JPA sentence in a tree structure. Then this object tree can be translated into a sequence of operations of Hazelcast by the translation logics defined by the designer manually.

Hazelcast is an open source clustering and highly scalable data distribution platform for Java. It provides various distributed data structures, distributed caching capabilities, elastic nature, and memcache interface support. Moreover, it is a feature-rich, enterprise-ready and developer-friendly in-memory data grid solution. There are various kinds of data structures provided in

We take the JPA sentence “SELECT item FROM Item item WHERE item.ISubject= „abc‟” as an example. After lexical analysis, this JPA sentence is converted from char stream to token stream which consists of the tokens “SELECT”, “item”, “FROM”, and so on. Then the token stream is parsed and an abstract syntax tree is built. From the abstract syntax tree, an object tree is

constructed as shown in Figure 8. Each node in this object tree is an instance of class JPANode. There are four attributes of a JPANode: type, value, childrenList, and numOfChildren. Attribute type represents the type of current node, such as jpa_statement, select_statement, string_type, int_type and so on. The names of the other three attributes indicate their contents and functions. Then this object tree is be translated to “Set item = (Set ) mapItem.values(new SqlPredicate(“ISubject = „ABC‟”)". Its execution will be complemented by the execution controller. JP-QL

SELECT

FROM

The principles to design data model for IMDG are already presented in section 3.3. Following these principles, the data designation for the simplified TPC-W application in Hazelcat is shown in table 3. It indicates the data of a many-to-many relationship, say customer_address, is redundantly stored. By doing so, it is possible to directly query a customer’s addresses or query the customers that share a specific address. Similarly, the simplified the database calls of TPC-W application are shown in table 4. The first SQL statement is a simple query that is applied to only one table. The second and the third statements are examples of complex join operations when the root table locates at a middle place or an edge place respectively. Collection

WHERE

Customer

item

Item

item

= Address

item.ISubject

“abc”

Figure 8. Object tree example

Customer_address

4.4 Simplified TPC-W Application TPC-W is an industry standard e-business benchmark [22], which is introduced by the Transaction Performance Council in Feb 2000 at the ecommerce environment. It specifies an ecommerce workload that simulates customers browsing and buying products from a website. Fourteen different web pages are presented to simulate the activities of an on-line retail bookstore website. In this paper, we don’t use it as a performance benchmark, but focus on its on-line bookstore website application. We will take it as an example to demonstrate how typical real world applications can be transferred to IMDG platform without any change including the data accessing interfaces. We simplify and modify the TPC-W application to include only six tables: customer, order, book, author, address and customer_address. The tables and their relationships are shown in Figure 9. We assume the table customer_address is mapped from a many-to-many relationship between the entity customer and address. It means a customer may have several addresses, and different customers may share the same address. Customer Customer

Order Order

ID PK Name Password AddressID FK PhoneNum EmailAddr BirthDate LastVisit

ID PK CustomerID FK OrderNum OrderDate Satus BookID FK

Address Address ID Street City Country

PK

Book Book ID Title Subject ISBN PubDate AuthorID

CustomerID PK AddressID PK

Value

cid,1

{name=”David”,age=”24”}

cid, 2

{name=”Joy”,age=”28”}

aid, 1

{city=”Beijing”,street=”Hai dian”}

aid, 2

{city=”Beijing”,street=”Tsing Hua”}

cid, 1

{aid=1, aid=2}

cid, 2

{aid=1}

aid, 1

{cid=1, cid=2}

aid, 2

{cid=1}

… Table 3. Data designation for the simplified TPC-W application in Hazelcast 1. SELECT customer FROM Customer customer WHERE customer.ID = 1000 2. SELECT customer, order, book, author FROM Customer customer, Order order, Book boook, Author author WHERE customer.ID = order.CustomerID AND book.ID = order.BookID AND author.ID = book.AuthorID AND order.ID = 1000 3. SELECT customer, customer_address, address FORM Customer customer, Customer_address customer_address, Address address WHERE customer_address.CustomerID = customer.ID AND address.ID = customer_address.AddressID AND customer.ID = 1000

Table 4. JPA calls of the simplified TPC-W application

1. Customer customer = mapCustomer.get(1000);

PK

2. (1) Order order = mapOrder.get(1000); (2) Customer customer = mapCustomer.get(order.CustomerID); (2)’ Book book = mapBook.get(order.BookID);

FK

(3) Author author = mapAuthor.get(book.AuthorID); 3. (1) Customer customer = mapCustomer.get(1000);

Author Author Customer_address Customer_address

Key

ID PK Name Profile PhoneNum EamilAddr BirthDate

Figure 9. Tables of the simplified TPC-W application

(2) Customer_address customer_address = mapCustomer_address.get(customer.ID); (3) List addressList = new List(); foreach ( string aid: customer_address.aidList ){ addressList.add (mapAddress.get(aid) ); }

Table 5. Corresponding Hazelcat execution sequences Treat the JPA sentences in Table 4 as the inputs of the JPA-toHazelcast compiler, the outputs are shown in Table 5. In the

generated second execution sequence, step (2) and step (2)’ can be executed concurrently, since they don’t have any dependent relationship. In the generated third execution sequence, step (3) contains a foreach() loop. This is because from the many-to-many relationship customer_address, more than one aid may be gotten from the step (2).

5. CONCLUSION AND FUTURE WORK IMDG has been in a developing stage due to its high scalability and performance, though lots of challenges are encountered. This paper aims to construct a data accessing layer for IMDG. Several related problems are investigated and the corresponding solutions are proposed, they are: data accessing reference architecture for IMDG; join query supporting algorithm; and IMDG data model designation principles. We proved, to some degree, it is possible to make IMDG compatible with standard data accessing languages such as JPA and SQL by constructing a corresponding data accessing engine. A single JPA query can be translated to a sequence of simple queries, and these simple queries may have different execution orders. In the future, we will study the optimization of the query plans. On the other hand, concurrent execution makes it harder to guarantee the atomicity of a query plan when there is some data updates. Transactional property is another possible future research point.

6. ACKNOWLEDGEMENT The work was supported by the National Natural Science Foundat ion of China under Grant No. 61173003, 61100068, the National Grand Fundamental Research 973 Program of China under Grant No. 2009CB320704 and the National High-Tech Research and Development Plan of China under Grant No.2012AA011204.

7. REFERENCES [1] Hasso Plattner. 2009. A common database approach for OLTP and OLAP using an in-memory column database. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data (SIGMOD '09), Carsten Binnig and Benoit Dageville (Eds.). ACM, New York, NY, USA, 1-2. [2] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1): 107-113, 2008. [3] B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong. Tenzing: A SQL Implementation on the MapReduce Framework. PVLDB, 4(12):1318–1327, 2011. [4] R. Lee, et al., "YSmart: Yet Another SQL-to-MapReduce Translator," 31st International Conference on Distributed Computing Systems (Icdcs 2011), pp. 25-36, 2011.

[7] Oracle Coherence: http://www.oracle.com/technetwork/middleware/coherence/o verview/index.html. [8] GigaSpaces XAP: http://www.gigaspaces.com/datagrid. [9] VMware GemFire: http://www.vmware.com/products/applicationplatform/vfabric-gemfire/overview.html. [10] Hazelcast: http://www.hazelcast.com/. [11] Infinispan: http://www.jboss.org/infinispan/. [12] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with Sawzall. Scientifc Programming, 13(4):277-298, 2005. [13] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099-1110. ACM, 2008. [14] A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wycko_, and R. Murthy. Hive: a warehousing solution over a Map-Reduce framework. Proceedings of the VLDB Endowment, 2(2):1626-1629, 2009. [15] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proceedings of the VLDB Endowment, 2:922933, August 2009. [16] G. L. Sanders and S. K. Shin. Denormalization effects on performance of RDBMS. In Proceedings of the HICSS Conference, January 2001. [17] S. K. Shin and G. L. Sanders. Denormalisation strategies for data retrieval from data warehouses. Decision Support Systems, 42(1):267-282, October 2006. [18] Caching policy: http://en.wikipedia.org/wiki/Cache_(computing). [19] Json: http://www.json.org/. [20] P.P. Chen. The Entity-Relationship Model: Towards a unified view of Data. ACM Transactions on Database Systems, 1:9–36, Jan 1976. [21] Z. Wei, G. Pierre, and C. H. Chi. Scalable Join Queries in Cloud Data Stores. 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. May 2012. [22] TPC-W: http://www.tpc.org/tpcw/default.asp. [23] Hibernate ORM: http://www.hibernate.org/. [24] OpenJPA: http://openjpa.apache.org/.

[5] JPA: http://www.oracle.com/technetwork/articles/javaee/jpa137156.html.

[25] TopLink: http://www.oracle.com/technetwork/middleware/toplink/over view/index.html

[6] Terence Parr and Russell Quong. ANTLR: A predicatedLL(k) parser generator. Journal of Software Practice and Experience, 25(7), 1995.

[26] M. Keith and M. Schnicariol, "Introduction Pro JPA 2," ed: Apress, 2010, pp. 1-16.

Suggest Documents