LNCS 7105 - Partial Aggregation Using Hibernate - Springer Link

Partial Aggregation Using Hibernate Michał Gawarkiewicz and Piotr Wiśniewski Faculty of Mathematics and Computer Science Nicolaus Copernicus University Toruń, Poland {garfi,pikonrad}@mat.umk.pl

Abstract. In this paper authors describe tools allowing for the use of partial aggregation techniques in analytical queries from the level of object-relational Hibernate mapping. Grouping queries are quite burdensome for databases. Therefore, in applications processing large amounts of data, partial aggregation is made (it should not be confused with the huge data sets, which are aggregated in data warehouses). Unfortunately, applications written using object-relational mappings cannot use this type of solutions directly. The authors decided to fill this void by adding easy-to-use tools to automatically create aggregations and very simple mechanisms to retrieve the aggregated data.

1

Introduction

The use of database applications in business additionaly to making work easier causes a positive side effect of collecting large amounts of data. These data allows carrying out a series of studies, etc. These analyses require grouping queries that are much heavier to execute. In the case of huge data sets, this aspect led to the development of a large branch of science - data warehouses. This work applies to smaller collections, for which the application of data warehouse appears to be pointless, but large enough, so that direct analytic queries would have unacceptable execution times. In practical applications this problem is easy to get around by preparing additional aggregation tables and setting triggers on the appropriate events in the database. Then in the queries, rather than refering to the data directly, analytic queries use partially aggregated data. Parallel to the development of relational databases, the field of object-oriented programming languages and data modeling methods evolved as well. Modeling of relational data structures resembles more and more the modeling process of objects and classes. The issue has inspired the research on techniques of mapping objects to relations and relations to objects [1,2]. The data transformations required in such mappings can be very complex, especially since they often involve advanced joins, nested queries and support for data update operations. Object Relational Mapping tools allow the programmer to focus on the code development without the need for advanced knowledge of SQL nuances. This T.-h. Kim et al. (Eds.): FGIT 2011, LNCS 7105, pp. 90–99, 2011. c Springer-Verlag Berlin Heidelberg 2011

Partial Aggregation Using Hibernate

91

results in reduction of the time needed to develop software and an increased readability of the source code. Additional benefit of using object-relational mapping tools is the increased source code maintainability and portability between different DBMSs. The number of supported DBMSs depends on the actual ORM framework used - from one particular vendor to a full support of all major DBMSs. Nowadays object-relational mapping tools are available for most of the popular programming languages. Due to the popularity of their base languages the most noteworthy are Hibernate and JDO for Java [3], LinQ and ADO .NET for .NET platform and SQLObject for Python. Because nowadays Java is the most popular language for software development we will restrict our discussion to it. Unfortunately, application designers using object-relational mapping can not directly use the technique of partial aggregation. To use it they must somehow manually create the appropriate tables and triggers on tables with raw data. As a result, the clarity of solutions is lost and subsequent development and maintenance of such a code becomes more expensive. These facts led the authors to attempt to integrate the mentioned techniques with object-relational mappings, and the results of this research are presented in the following paper. The presented work is another after [4] and [5] of a series of works complementing the object-relational mapping’s support for advanced techniques for modern object-relational systems.

2

Contribution

In this paper, automatic algorithms have been developed to define auxiliary tables for storing partially aggregated data and to generate triggers that fill these tables. In addition, an easy to use interface using annotation to use these algorithms, maintaining JPA [6] standards was designed. The third aspect is the simple built-in HQL language parser that recognizes whether the sent request is supported by the stored aggregations and, if so, takes over its execution.

3

Motivation Examples

Let us consider sales data stored in the schema presented on figure 1. An example of an analytical question is the amount of each product sold in the selected time interval. In the case of small amounts of data, we can accomplish the query as follows: Example 1. SELECT invoiceLines.product_id, invoice.date, sum(invoiceLines.quantity) FROM invoice JOIN invoiceLines USING (invoice_id) GROUP BY invoice.date, invoiceLines.product_id HAVING date BETWEEN ’2011-07-16’ AND ’2011-07-22’

92

M. Gawarkiewicz and P. Wiśniewski

Fig. 1. Base Data Schema

In the case of huge amounts of data, such an approach is highly inefficient, which resulted in the development of data warehouses. However, there is a wide range of large sets, in which the use of warehouse-type solutions has no economic justification, while the use of direct questions is unacceptable at a time. The solution is then the partial aggregation of data. The above scheme should be extended to dw invoice date product table presented on the figure 2:

Fig. 2. dw invoice table

Then triggers on insert, update and delete should be applied on events modifying data in the invoiceLine table. Then instead of the above query, the following could be used: SELECT product_id, date, sum_quantity FROM dw_invoice_date_product WHERE function = ’sum’ AND date BETWEEN ’2011-07-16’ AND ’2011-07-22’ Of course, time to complete this query is incomparably shorter than the execution time of the query from the previous example. The second example is the sum of sales for selected groups of customers, and monthly sale totals for each customer. The relevant queries are as follows: Example 2. Sum of sales for selected groups of customers: SELECT customer.group_id, invoice.date, sum(invoiceLine.quantity) FROM customer JOIN invoice Using (customer_id) JOIN


93

invoiceLine USING (invoice_id) GROUP BY customer.group_id, invoice.date HAVING date BETWEEN ’2011-07-16’ AND ’2011-07-22’ Monthly sale totals for each customer: SELECT invoice.customer_id, month(invoice.date), year(invoice.date), sum(invoiceLine.quantity * invoiceLine.price) FROM invoice JOIN invoiceLine USING (invoice_id) GROUP BY invoice.customer_id, month(invoice.date), year(invoice.date) To deal with these aggregations, one can add a table aggregating sales information by customers and date, and then from it derive separate aggregations for client groups and monthly sales, and place them in two additional tables. As a result, the scheme should be extended to the structure shown in the figure 3.

Fig. 3. Aggregation tables

4

Hibernate

Object-relational mapping systems are the result of intersection of two worlds - object-oriented programming languages and relational databases. They allow for separation of the business logic layer from database layer and as a result increased portability and maintainability of software’s source code. Additional feature of ORM frameworks which has a big impact on their popularity is the speed of source code developments during which only minimal knowledge is needed about the advanced aspects of SQL language. There are many kinds of object relational mapping tools currently available on the market. They differ among themselves as to the programming language they are designed for and the

94


scope of supported databases management systems. Some of them are available as commercial products, while others have a fully open source code. One of the most popular ORM frameworks is Hibernate for Java language. Hibernate supports most of the major relational DBMSs. In accordance with JPA [6,7] standard it offers two methods of defining an object-relational mapping. To map Java classes to database tables, developer may choose to define a mapping configuration in an XML document or to define it using Java Annotations with nice Entity data model [8]. Sample objects generated using Hibernate framework may resemble the classes form listings 1.1, 1.2, 1.3. Listing 1.1. Invoice class representation

@Entity public c l a s s I n v o i c e { private Long i d ; private Date da te ; private Customer customer ; private L i s t i n v o i c e L i n e s ; ... Listing 1.2. Customer class representation

@Entity public c l a s s Customer { private L i s t i n v o i c e s ; private Long i d ; private S t r i n g f i r s t N a m e ; private S t r i n g lastName ; private CustomerGroup customerGroup ; ... Listing 1.3. CustomerGroup class representation

@Entity public c l a s s CustomerGroup { private L i s t c u s t o m e r s ; private Long i d ; private S t r i n g name ; ... From the Hibernate, queries from the examples 1 and 2 can be expressed as follows: Example 3. Amount of each product sold in the selected time interval: SELECT invoiceLines.product_id, invoice.date, sum(invoiceLines.quantity) FROM Invoice invoice JOIN invoice.invoiceLines invoiceLines GROUP BY invoice.date, invoiceLines.product_id HAVING date BETWEEN ’2011-07-16’ AND ’2011-07-22’


95

Sum of sales for selected groups of customers: SELECT customer.customer_group, invoice.date, sum(invoiceLines.quantity) FROM Invoice invoice JOIN invoice.customer customer JOIN invoice.invoiceLines invoiceLines GROUP BY customer.customer_group, invoice.date HAVING date BETWEEN ’2011-07-16’ AND ’2011-07-22’ Monthly sale totals for each customer: SELECT invoice.customer_id, month(invoice.date), year(invoice.date), sum(invoiceLines.quantity * invoiceLines.price) FROM Invoice invoice JOIN invoice.invoiceLines invoiceLines GROUP BY invoice.customer_id, month(invoice.date), year(invoice.date) Grouping queries sent to the database will be of course similar to the queries presented in the examples 1, 2 - the differences are cosmetic resulting from the automatic code generation. Hence, they are executed with an appropriate big time consumption for large data sets. Attempts to use the solutions described in the previous section, so far required the designer to give up elegance of Hibernate and manually create additional tables and write by hand triggers that fill these tables. Gathering of information can be carried out semi-automatically - these tables can be viewed as non-modifiable objects on the Hibernate side. Although this approach solves the aspect of the execution time, it significantly increases the cost of producing and maintaining such a written software. Any modification of the original scheme requires an analysis of the impact on the additional tables, rebuilding the extended scheme and rewriting the triggers. These problems led the authors to try to automate these issues.

5

Proposed Solution

The presented solution involves several aspects. The first is a set of annotations, through which it is defined which aggregations on the given classes are interesting. The generator implemented by the authors analyses these annotations to generate tables for storing aggregations and creates triggers filling them with data. The next aspect is the aggregated data retrieval interface to parse the query in HQL language, and when it detects support for supported aggregation, it takes information from the aggregated data. In case where there is no support, the HQL query is executed by native Hibernate engine. Annotations The proposed solution introduces annotations beginning with the prefix @DW.. Aggregated fields get annotation @DWAgr(funtcion=”...”) and fields where a

96


grouping is to be held get annotation @DWDim. The following example illustrates the use of them in the first example: Listing 1.4. InvoiceLine class representation

@Entity public c l a s s I n v o i c e L i n e { ... @DWDim(Dim = ” da te ” ) private I n v o i c e i n v o i c e ; private Long i d ; @DWDim private Product pr o duct ; @DWAgr( f u n c t i o n=”SUM” ) private I n t e g e r q u a n t i t y ; ... Dim extension in the annotation @DWDim(Dim = ”date”) indicates that the grouping is done with granularity of invoice dates, rather than invoices alone. In the above example the generator implemented by the authors will create a table dw invoiceline by product date with fields id product, date, sum quantity and triggers on events on the invoiceLine table that will care about data relevance in the aggregate table. Function parameter in the DWAgr annotation specifies which aggregate functions will be used. At this stage of our solution, basic aggregate functions are supported, ie. SUM, MIN, MAX, AVG and COUNT. In the case of simple values, relevant values of cells in tables are subject to aggregation. The case of the compound values will be discussed after the next example. In the second case when update affects aggregation for a customer or customer group, annotation mechanism is pinned to the invoice class: Listing 1.5. Invoice class representation

@Entity public c l a s s I n v o i c e { private Long i d ; @DWDim( E x t e n t i o n=”month ( ) ” ) private Date da te ; @DWDim( E x t e n t i o n=” group ” ) private Customer customer ; @DWAgr( f u n c t i o n=”SUM( q u a n t i t y ∗ p r i c e ) ” ) private L i s t i n v o i c e L i n e s ; ... The above example introduces several enhancements compared with the previous one. First dw invoice by date customer table is built and then triggers on events on invoice and invoiceLine tables. They will fill aggregations in those tables. In this example we need triggers on both tables, since modifications on both header data and a single record in invoiceLine table affect aggregations.


97

For the presented solutions it is not a problem that aggregate function is set on the field being a list, which means that the data are taken from a separate table. Hibernate metadata analysis is sufficient to create a properly working triggers by the implemented automatic generator. It is important that the relation is many invoiceline records into one invoice record. This way granulation of grouping described at the invoice level clearly distributes which records with aggregations get relevant data from invoiceline. InvoiceLine field contains a list of objects. This list relates to the fields of an other table. In this situation aggregation can apply not only to specific columns, but also as presented in the case to algebraic expressions on columns. In the current version only simple algebraic expressions without control instructions and nested parentheses is supported. The described case in the second example expects aggregation at further level, ie collecting customer sale totals for each month. Parameter (Extension=month()) of DWDim annotation on the date field informs the generator about the need to create a dw invoice by month date customer table that will aggregate data from the dw invoice by date customer table. This aggregation will be held through triggers generated for the dw invoice by date customer table. Parenthesis in this parameter tell the generator that it is a SQL function calculated from the value of the field, where a date field has a special procedure for the analysis of functions. The indication of the month will automatically add the year column in which the year will be saved, so that the aggregations from various years and the same month will not get together. If a programmer wants such a distinction, he or she should write @DWDim(Extention=”month() only”). The second use of Extension parameter concerns customer field, which is a pointer to a complex object and is represented on the database side by the customer table tuples. This extension indicates generator the need to establish dw invoice by date group customer table in which aggregations are collected for the client sums. As before, this is done by triggers created on the dw invoice by date customer table. Getting Aggregated Data The presented extension of Hibernate introduces two methods of access to aggregated data. The first one allows for data retrieval according to the QBE model. ncu.Agregations class has a static method qbe. As a result, objects of aggregation are returned. These features will be presented on the example of usage based on the first example: // date1 is set to ’2011-07-16’, date2 is ’2011-07-22’ ncu.Agregations.qbe(InvoiceLine.class, "SUM(quantity)", new Object[] {Date.class, Product.class}, new Object[] {ncu.Agregations.helpers.between(date1, date2)});

98


This call executes query: SELECT SUM(sum_quantity), date, id_product FROM dw_invoiceline_by_product_date GROUP BY date, id_product HAVING date BETWEEN ’2011-07-16’ AND ’2011-07-22’ This method has four parameters, the last one is optional. The first one specifies the class that aggregations relate to. This should be a class that contains definitions of aggregations discussed above. The second parameter specifies the aggregation functions we want to use. Of course, if the proposed solution could answer these questions, then these functions should be declared in the class definition in the @DWAgr(function...) annotation form. The third parameter defines the grouping as a list of objects. These can be classes or functions that were indicated in @DWDim annotation in the base class. The fourth parameter is a list of restrictions. In the example we used the helper function specifies that dates should be contained in a specified range. This function call for the second example: ncu.Agregations.qbe(Invoice.class, "SUM(quantity*price)", new Object[] {"month(date)", CustomerGroup.class}, new Object[] {}); This call executes query: SELECT SUM(value), date, group_of_customer FROM dw_invoice_value_by_group_customer_date GROUP BY date, group_of_customer The second methodology uses a simple HQL parser implemented by the authors in ncu.Aggregation class as follows: ncu.Agregations.hql(" SELECT invoiceLines.product_id, invoice.date, sum(invoiceLines.quantity) FROM Invoice invoice JOIN invoice.invoiceLines invoiceLines GROUP BY invoice.date, invoiceLines.product_id HAVING date BETWEEN ’2011-07-16’ AND ’2011-07-22’ "); The parser analyzes HQL query. If it recognize that query could be realized using prepared aggregation, the qbe method is called. If not than the original Hibernate HQL engine is called.

6

Future Work

In the future work the authors plan to extend the presented algorithms to work with ORACLE and IBM DB2 databases. It is also planned to aggregate more


99

statistical algorithms and make closer integration with the HQL language. A major weakness at this stage is the issue of inherited classes. Algorithms supporting aggregations on more complex base classes diagram containing inheritance will be an important challenge for further work.

References 1. Melnik, S., Adya, A., Bernstein, P.A.: Compiling mappings to bridge applications and databases. ACM Transactions on Database Systems (TODS) 33(4), 1–50 (2008) 2. Keller, W.: Mapping objects to tables: A pattern language. In: EuroPLoP (2007) 3. Hibernate, http://www.hibernate.org 4. Burzańska, M., Stencel, K., Suchomska, P., Szumowska, A., Wiśniewski, P.: Recursive Queries Using Object Relational Mapping. In: Kim, T.-h., Lee, Y.-h., Kang, B.-H., Ślęzak, D. (eds.) FGIT 2010. LNCS, vol. 6485, pp. 42–50. Springer, Heidelberg (2010) 5. Szumowska, A., Boniewicz, A., Burzańska, M., Wiśniewski, P.: Hibernate the Recursive Queries - Defining the Recursive Queries Using Hibernate ORM. In: ADBIS 2011 (2011) 6. DeMichiel, L.: Java Specification Requests JSR 317: JavaTM Persistence 2.0 (2009), http://jcp.org/en/jsr/detail?id=317 7. Bauer, C., King, G.: Java Persistence with Hibernate. Manning Publications Co., Greenwich (2006) 8. O’Neil, E.J.: Object/relational mapping 2008: hibernate and the entity data model (edm). In: Proc. ACM SIGMOD, pp. 1351–1356 (2008)

LNCS 7105 - Partial Aggregation Using Hibernate - Springer Link

LNCS 7105 - Partial Aggregation Using Hibernate - Springer Link

Suggest Documents

LNCS 7105 - A Bootstrapping Method for Learning ... - Springer Link

LNCS 4191 - Using Registration Uncertainty ... - Springer Link

LNCS 6385 - Partial Information Extraction Approach to ... - Springer Link

LNCS 3687 - Partial Relevance in Interactive Facial ... - Springer Link

LNCS 8508 - Springer Link

LNCS 8407 - Springer Link

Platelet Aggregation - Springer Link

Hibernate Recipes: - Springer

LNCS 7046 - Use Cases and Object Modelling Using ... - Springer Link

LNCS 8455 - Cross-Domain Sentiment Analysis Using ... - Springer Link

LNCS 6146 - Sudoku Using Parallel Simulated ... - Springer Link

LNCS 5872 - Network Security Validation Using Game ... - Springer Link

LNCS 6637 - Classifying Graphs Using Theoretical ... - Springer Link

LNCS 3177 - Detecting Credit Card Fraud by Using ... - Springer Link

LNCS 3736 - Face Recognition Using Modular Bilinear ... - Springer Link

LNCS 4276 - Developing Mobile Ambients Using an ... - Springer Link

LNCS 4488 - Building Behavior Scoring Model Using ... - Springer Link

LNCS 6946 - Remote Usability Testing Using ... - Springer Link

LNCS 7239 - Stream Data Mining Using the MOA ... - Springer Link

LNCS 3795 - Using Fuzzy Cognitive Map to Effectively ... - Springer Link

LNCS 3325 - IDS False Alarm Filtering Using KNN ... - Springer Link

LNCS 3795 - Routing Algorithm Using SkipNet and ... - Springer Link

LNCS 8956 - Cloud Federation Formation Using ... - Springer Link

LNCS 3975 - SemanticSpy: Suspect Tracking Using ... - Springer Link