The language de- veloped offers a promising performance, advantageous to Native Queries ... 1 The Motivation for a Java-integrated Query Language. Software ..... performance tests 2 have been performed in order to provide realistic point of.
Implementing a Query Language for Java Object Database Emil Wcisªo1 , Piotr Habela1 and Kazimierz Subieta1 1
Polish-Japanese Institute of Information Technology, Warsaw, Poland {ewcislo, habela, subieta}@pjwstk.edu.pl
Extending expression or query capability of programming languages is one of few current directions of improvement that can be considered domain-neutral. When applied to an existing programming environment, such a work aects the rather foundational elements of the language and thus presents design and integration challenges. This paper describes the implementation of an object-oriented query language integrated as the expression sublanguage into Java and interfacing with the DB4o object-oriented database management system. The language developed oers a promising performance, advantageous to Native Queries and enhanced expressive capabilities of a query. Abstract.
1
The Motivation for a Java-integrated Query Language
Software technologies are constantly facing the pressure for improved productivity and evolvability, so as to keep up with increasing demand for new applications and for their subsequent maintenance. A continuous trend in response to it is the development of gradually more and more abstract, high-level programming constructs. In might be perceived however that, with the development of contemporary object-oriented programming languages, the remaining potential for further improvements becomes exploited, as far the general purpose solutions are under consideration. Further productivity gains are hence rather expected from the side of domain specic languages [1] backed with frameworks suited to them. However, realizing the data-intensive functionality in general purpose programming languages is an example of a domain-neutral task that remains relatively laborious in many of the existing setups. In case of relational data sources it seems to be caused mainly by the heterogeneity between the programming language and data storage data models and language constructs. On the other hand, the data persistence solutions designed specically for object oriented languages make the data access very straightforward, but often lack the query language capability. Our goal is to introduce this in a possibly seamless, intuitive and ecient manner. The benets of a query language in a homogeneous environment can be observed in the area of relational database management systems (RDBMS). The way data retrieval logic is represented using SQL saves a programmer a signicant number of lines of code, at the same time improving the readability. With a
minimum programming eort the results of the queries can be consumed by the imperative programming constructs, since the data manipulation statements, local variables, procedure calls, conditional structures, iterations etc. become integrated with the query language [2, 3]. Another remarkable phenomenon is the popularity and industrial adoption of the query language solutions dealing with XML documents [4, 5]. Numerous application programming interfaces to XML support calls of such queries, signicantly simplifying the job of retrieving and even transforming the XML documents in applications. The emergence of such languages also proves that the concept of query languages is not inherently limited to the relational data model and its formal model, and that powerful query languages for other, more complex data models, can be developed, optimized, taught, used etc. However, the adoption of query language solutions for an object-oriented data model for a long time had been rather weak. This was to some extent caused by the fact that object-oriented database management systems (OODBMS) emerged as the solutions providing persistence to existing programming languages, having their own expression parts, which made a query language perhaps desirable but - at least for certain kinds of applications - not of primary importance. OODBMS like e.g. [6] found their applications especially in areas using large amounts of interconnected data, where direct links between objects oered by the object-oriented model assured high performance of navigation thanks to avoiding costly join operations imposed by the relational data model. In eect majority of such products did not oer a query language, or provided only very simple means of applying selection predicates against persistent object collections. The lack of query language has been perceived as a weakness of the OODBMS compared to RDBMS, hence attempts were made to develop and standardize such a language. Several versions of the Object Data Management Group (ODMG) standard have been developed, the nal one in 2000 [7]. The standard has introduced the Object Query Language (OQL), designed specifically for an object data model based on classes. Apart from some semantic ambiguities, what seems to be the biggest weakness of that proposal, is OQL's attempt to mimic the SQL solutions. Not only it repeats the, to some extent counterintuitive, non-orthogonal and verbose, SQL's select ... from ... where ... pattern. The key issue however is its repetition of the embedded SQL approach, where the query is provided to a respective programming interface operation of a general-purpose programming language as a string. Hence, the infamous impedance mismatch eects, although much reduced comparing to embedded SQL, is not eliminated. While avoided with respect of the data model (types), the impedance mismatch still aects several other aspects including syntax, binding stages, parameter passing mechanism etc. Hence, the amount of heterogeneity imposed by the ODMG standard is disappointingly high, given the fact that the most foundational aspect of the mismatch, that is, the data model dierence, does not occur here. The emergence of the LINQ [8] and similar programming language query enhancements have allowed to remove the key aspects of the impedance mis-
match. Increased homogeneity of the syntax, elimination of the "boilerplate" code and, especially, covering the queries with compile-time type checking provide a promising foundation for productivity improvements. At the same time however, there are still many challenges for the query language design, and, especially, for its actual realization against particular data persistence tools. A particular issue is performance, which demands elimination of unnecessary query processing steps and value retrievals, while assuring intuitive and consistent ways of updating objects. The prototype described in this paper provides a Java integrated query language designed to process volatile data collections, as well as the persistent objects stored in the DB4o OODBMS [9]. From among of the abovementioned technologies, the concept of the language is closest to LINQ. However, both the syntax and, especially, the underlying semantic formalism constitute a completely dierent, genuine concept based on the Stack Based Architecture (SBA) approach [10]. Although the single target platform has been presented here, the pattern followed in this implementation, including the operational semantics, abstract storage model, syntactic integration and query optimization is applicable also to other data source technologies and other similar languages being integrated into a programming language environment. The paper is organized as follows. Section 2 illustrates the core dierences between various query language solutions from the application programmer point of view. In section 3 the main externally observable features of the language are described. Section 4 presents the design solutions of the language. The features of the language, including the performance, are compared against other query languages in section 5. Section 6 states conclusions and outlines the future work. 2
Existing Solutions - Advantages and Issues
The integration of query languages with programming languages is proceeding its evolution towards better productivity and intuitiveness. The remarkable steps of this way are briey presented and compared in this section. Java Database Connectivity (JDBC) is a mature standard of a programming interface representing a classical embedded SQL approach. Since it was developed to integrate with simple means two already established, radically dierent technologies, this is not a surprise that this technology suers from many aspects of impedance mismatch and leaves all the burden of relational-to-object mapping to the programmer. As a typical string-based query interface, it does not perform a static syntactic or type checking of the query code. For parameterized queries, the query code needs to use special syntax to denote the place within it that will be augmented by its actual value. The value of the parameter is to be provided by a separate statement, while the validity of actual mapping is not veried at compile time. Result instances can be retrieved by an iterator, while particular elds are retrievable individually and cast onto appropriate types. A signicant step towards improved productivity is represented by the Hibernate object-relational mapping framework. A proprietary query language HQL is used
within it. Here, the transformation from relational to object-oriented structures of the result is performed automatically, based on conguration given by developer in XML or directly inside Java code, which saves many lines of code and makes it easier to understand. However, the drawbacks of string-based query interfaces are still present here: compile time query validation is absent and an explicit binding of query parameters is necessary. In case where the persistence layer is realized by an ODBMS supporting exactly the same data model as the programming language under consideration, the problems of integration are reduced. DB4o is an example of such system. Aware of the merits of query languages on the one hand, and the issues of the impedance mismatch on the other, the authors of the system designed a query interface that avoids embedding queries as strings [11]. As its name - Native Queries - suggests, the interface is arranged so as to allow building queries using the constructs of the host programming constructs exclusively. In case of Java binding this is realized through creating anonymous classes for particular queries. Thanks queries being the language's native construct, the type checking can be performed at compile time1 and the parameters are directly consumed by the query - without the need of binding them with individual statements. However, due to a limited number of query operators available in that interface, it is often necessary to decompose a query into more than one database invocation. In the simple example below (return employees of the age higher than average and the salary lower than 2000), it was necessary to precede the main query with earlier class extent retrieval and iteration over it on the side of client application, to calculate the average age that was subsequently used inside the query. ObjectSet res1 = dbConn.query(Emp.class); double ageSum = 0.0; for(Emp e : res){ ageSum += e.getAge(); } final double ageAvg = ageSum / res1.size(); ObjectSet res2 = dbConn.query(new Predicate(){ public boolean match(Emp e) { return e.getSalary() < 2000 && e.getAge() > ageAvg; } }); 3
The Concept and Features of the Integrated Query Language
Based on the above overview, it was possible to formulate the goal of our research, listing the features a programming language integrated with a query language dealing with persistent object data should possess. Clearly, regarding the language interface, a homogeneous design that avoids the impedance mismatch, as exemplied by the Native Queries, is desirable. The set of operators supported needs to be rich enough, to avoid redundant steps and intermediate 1
In consequence, the programming environment assistance for code completion, validation and refactoring becomes more feasible as well.
retrievals of such data that is actually not needed to be consumed at the application side. Treating the queries as the host language's native element brings a very valuable feature: the ability to allow them in all the contexts where the traditional expressions of that language are applicable. This means the ability to use the queries to:
initiate variables - both single-valued and collection-typed, specify return values, pass arguments to method calls, specify the values to be assigned, retrieve data item to be the subject of update (e.g. assignment's l-value), etc.
Contrary to traditional database query languages, such query expressions should be uniformly applicable both to persistent and volatile data. Another feature, rather natural for native expressions of an OO language but dierent from embedded SQL solutions is the query result updateability. Queries need to be capable of returning references to object rather than merely the copies of their stored values, so as to update the database content and invoke behavior against it. The abstract data model for Java (JAS0) for query semantics denition has been formulated as the modication of the reference AS0 model of SBA [10]. Based on its constructs the semantics of the respective query operators has been dened operationally. It consist of the following entities: Object = ordered pair of , where F is set of references to object elds (or sub-objects) and M is set of method of the object. Object reference = ordered pair of , where n is external name and o is object to which the reference points. Class = ordered triple of , where F is set of references to static elds of the class, M is set of static method of the class and C is set of class constructors (special methods used to create instances of the class). Method and class constructor is described by its name, return type and ordered set of parameters.
The main dierence between JAS0 and AS0 model is lack of object name in our approach. This is dictated by referential data model in Java language and other modern programming languages. This data model is universal both for Java objects stored in memory and in DB4o database and is expressive enough to cover other similar data models. In non-referential data models, like in relational model, the lack of object name in JAS0 is easy to workaround, because it can be xed by assuming existence of virtual root object that follows to any root objects in given model (e.g. to tables in relational model, or elements in XML). Query operators should be orthogonal to each other, allowing all the meaningful combinations and arbitrarily complex nesting of subqueries. Like in case of any high-level query language, query optimization is necessary. The static analysis allows to apply as the primary means the query rewriting - to eliminate redundant subexpression evaluations and using indices - to speed up the retrieval.
What is necessary to emphasize, is the need for a two-dimensional nature of the query-programming language integration. That means that not only the host language is to consume query results, but also queries should be capable of calling methods to consume their results, as well as creating new instances. The proposed language respects the encapsulation implied by the class denition. Non-public members can be queried externally only if a public getter operations are dened for them. In that case the query may refer simply to the member name rather than use an explicit operation call syntax. Considering calls to other, arbitrary object operations inside queries brings up the issue of side-eects. Including such method calls should be restricted for several reasons. Firstly, as observed in [11], updates to persistent objects caused by query evaluation complicate the transaction management. Secondly, it is useful to treat method calls and attribute reads uniformly inside a query as the subjects of optimization. Hence, the number of a given method's calls can be dierent than expected by a programmer due to the use of query rewriting optimization techniques. The complete integration also implies the unied type checking, allowing to validate the query and its surrounding application code against each other. That means that the validity of query is being checked at compile time against the environment it refers to, and - on the other hand - the type of the query result would be checked for its compliance with the code consuming it on the programming language side. Realizing those features required an unambiguous denition of the query operators semantics using a formalism that intuitively maps onto the implementation constructs and easily matches the context established by the surrounding programming language. For those reasons the Stack Based Architecture (SBA) [10] has been chosen. The object store model, an element of the SBA formalism, has been adapted to match the Java object model. Also the language's concrete syntax is based on the reference syntax of SBQL, which assures modularity, orthogonality and minimum syntactic sugar. It builds on the previous development, which provided the similar functionality for Java volatile objects - the SBQL4J language [12]. Hence the new language was named SBQL4db4o. The language oers a considerable expressive power. The following operators are currently implemented and can be used inside the queries. Arithmetic and algebraic: AND, NOT, instanceof
+, -, *, /, %, == , != , >, =, avg(Emp.age)) };
An explicit marking of the query part inside the #{} makes the syntactic layer integration not fully seamless. However, it avoids compromises in the query syntax and allows to make the syntax minimal, relative and orthogonal. At the same level it does not seem to incur particular inconvenience, as the main aspects of impedance mismatch are avoided.
On the other hand, thanks to integration with the Java compiler, the static analysis of the query is performed with the full knowledge of the environment established by the enclosing Java code. The type checking of the query is performed in accordance to the SBA principles. Apart from detecting the type errors in a query, this step allows also to perform optimizations, as described in the further part of this section. The queries referring to DB4o database are evaluated in the context of the standard DB4o connection object: ObjectContainer db4oConn = getDb4oConnection(); Collection result = #{ db4oConn.(...) };
It is also possible to make multiple calls to DB4o in a single SBQL4J query and to return unied result, e.g.: ObjectContainer db4oConn1 = getDb4oConnection(); ObjectContainer db4oConn2 = getDb4oConnectionToAnotherDB(); Collection result = #{ db4oConn1.(...) union db4oConn2.(...) };
Before transforming the query code to respective Java statements performing data retrieval, optimization routines need to be performed. Primarily, the redundant data retrievals need to be eliminated. This involves the so-called death subquery removal (to skip the evaluation of query parts irrelevant from the point of view of the nal result) and the extraction of independent subqueries ahead of operators that involve iterations [10]. As the example of the latter let us consider the following query (enclosing markup skipped for brevity): db4oConn.(
Emp where worksIn == (getDepts() where name=="Sales"))
In a straightforward realization of this query (that is, performed exactly as the operational semantics of respective operators species), the getDepts() operation call and the selection clause following it would be evaluated multiple times namely, once per each Emp object being tested. In the course of the static analysis the query optimizer transforms the abstract syntax of this query to: getDepts() where name=="Sales") group as aux0 .(db4oConn.(Emp where worksIn == aux0))
Here the subquery is evaluated once and its result is given an auxiliary name, subsequently used in the optimized query. Another kind of optimization realized in our prototype employs the mechanisms specic to the DB4o system - namely, the dense indices that may be created for selected elds in given classes. Retrieval of the information on the availability of the particular indices can be performed using the public programming interface of DB4o. The retrieved metadata is placed in an XML le that is subsequently used by the optimizer routine. Let us consider the following query involving value equality predicate: db4oConn.(Emp where age == 30)
In case the index for the attribute Emp.age exists in the database, the optimizer would transform the query so as to use it: db4oConn.Emp_ByIndex[age](30)
After the type checking and optimization, the query may be nally transformed into the data retrieval statements using relatively simple code generation rules (as the query language implementation produces Java source code rather than e.g. a bytecode). Each subquery that deals with the database (i.e. that is associated with the dbConn.( . . . ) expression context, becomes transformed into a separate Java class implementing the interface Db4oSBQLQuery, where R is the query result type determined in the course of the static query analysis. In this form the query is provided to the database, together with all the necessary parameters. A new method extending the DB4o interface: public R query(Db4oSBQLQuery query)
realizes that task and invokes query the processing. Creating and performing new queries does not require the restart of the database, thanks to the dynamic class loading mechanism in Java. For performance reasons it is desirable to run the optimized query in a possibly direct fashion against the database's data store, to avoid the burden of additional transformations. In case of our prototype, it has been chosen to realize it by directly invoking the operations that manipulate the DB4o object store. This is not a part of the DB4o public interface. However, this solution provides the necessary exibility and satisfactory performance. To illustrate the last step of the query generation process, and to show the dierences in the level of abstraction, let us consider the following sample query: db4oConn.(Emp where getAge() > 30)
This would be transformed into the following Java method, working directly against the data store (it becomes lengthy because of the need of dealing with the lazy references mechanism of DB4o). public java.util.Collection executeQuery( final ObjectContainerBase ocb, final Transaction t){ final LocalTransaction transLocal = (LocalTransaction) t; final java.util.Collection _ident_Emp = new java.util.ArrayList(); ClassMetadata _classMeta2 = ocb.classCollection() .getClassMetadata("Emp"); long[] _ids2 = _classMeta2.getIDs(transLocal); for (long _id2 : _ids2) { LazyObjectReference _ref2 = transLocal.lazyReferenceFor((int) _id2); ident_Emp.add((Emp) _ref2.getObject()); }
java.util.Collection _whereResult = new java.util.ArrayList(); int _whereLoopIndex = 0; for (Emp _whereEl : _ident_Emp) { if (_whereEl == null) { continue; } if (_whereEl != null) { ocb.activate(_whereEl, 1); } java.lang.Integer _mth_getAgeResult =_whereEl.getAge(); if (_mth_getAgeResult != null) { ocb.activate(_mth_getAgeResult, 1); } Boolean _moreResult = (_mth_getAgeResult == null) ? false : (_mth_getAgeResult > 30); if (_moreResult) { _whereResult.add(_whereEl); } whereLoopIndex++;
}
} pl.wcislo.sbql4j.db4o.utils.DerefUtils.activateResult( _whereResult, ocb); return _whereResult;
The code makes it visible how important for the performance may be the removal of the unnecessary actions from the iterations implied by query operators, as well as the reduction of the initial object set that would be the subject of the iteration. 5
Expressive Power and Performance Comparison
Compared to other technologies for querying persistent data from Java, outlined in section 2, our proposal assures the most concise syntax, avoidance of the impedance mismatch and good performance of query execution, at the same time oering strong static type checking. From among of the other such solutions available in Java, the Native Queries seem to be the most advanced one. Its advantages include: strong static type checking, direct availability of externally dened parameters inside the query, using native constructs and syntax of the host programming language.
On the other hand, the Native Queries has some drawbacks: only simple, selection style kinds of queries are supported, composite operators (like avg, max etc.) are missing, more complex queries require multiple calls to the database.
SBQL4db4o addresses those issues that allows us to consider it advantageous to the Native Queries. The syntactic integration is performed at the cost of explicit delimiting the query part. However this does not entail any further limitations: the SBQL4db4o queries may occur in any place where Java expression is allowed. At the same time it has given more exibility in the query syntax design, allowing, among others, to make it more concise and orthogonal than e.g. LINQ. Also the query evaluation performance results are encouraging. The following performance tests 2 have been performed in order to provide realistic point of view of practical usability of our prototype. We used the PolePosition benchmark [14] as basis for our tests. These were made for compare performance of SBQL4DB4o queries with native SODA queries to which Native Queries are translated. We have tested both queries with indices usage and without it. The testing model is a simple class with members: _int and _string of types Integer and String respectively. In the rst test (Fig. 1) the time results are shown for querying 300, 1000 and 3000 randomly generated objects . The test were repeated 1000 times in single test not using indices in order to establish reliable result. The gure provides the run time comparison.
Fig. 1.
Performance comparisons - query involving strings, no indices used
The second test (Fig. 2) involves a similar query with the search based on an integer value. In the second part of the performance test we use indexed queries. Because indices improve performance very signicantly, we have used more data (300000, 1000000, 3000000 objects) and we have performed the tests more time (3000 times) in order to provide more reliable results. 2
The tests environment conguration was the following. CPU: Intel Core i7-2720QM @ 2.20 GHz. RAM: 8 GB DDR3 non-ECC. HDD: Segate ST9500420 500 GB (no RAID installed) OS: Windows 7 Professional 64-bit
The other pair of tests (Fig. 3, 4) involves the same queries, this time evaluated with the support of indices. In the gure below the selection dealing with strings is used. The other test selects the objects based on their Integer type member value. As we can see there is a slight advantage (9% - 40%) in per-
Fig. 2.
Performance comparisons - query involving integers, no indices used
Fig. 3.
Performance comparisons - query involving strings, using indices
formance of SBQL4Db4o queries than native SODA queries. The improvement was observed because SBQL4Db4o queries are compiled to native Java code in contrast to SODA queries interpreted by DB4O database internal engine. This approach prevents query engine from using costly and non-eective operations like reading data with Java reection. The approach of compiling queries integrated with Java language was extensively described in [13]. The most signicant speed improvement have been achieved in queries that nds objects using indices based on integer-type eld. This is very widely used type of queries, due to the common use of identier attributes of numeric type. This is promising, especially when we consider our solution as a prototype.
Fig. 4.
6
Performance comparisons - query involving integers, using indices
Conclusions and Future Research
The paper presented a prototype of a Java integrated query language that extends the Java capability of expression construction, and focused the issues of adapting such a language for supporting high-level, declarative, seamless and type safe queries to an OODBMS - namely DB4o. The functionality under consideration belongs to the foundational elements of both the technologies being extended: Java and DB4o. Hence this is not surprising that the integration had to be performed in a relatively tight fashion, in some places going beyond the public interfaces foreseen and provided by the tool vendors. The pattern followed in this development can be applied to other object database environments provided that respective metadata are made available by their interfaces. With respect to most of the aspects, the integration of the proposed query language with Java can be considered seamless, as the particular kinds of impedance mismatch are eliminated. An exception to this is the concrete syntax, due to the inherent dierences in its style, additionally augmented by the explicit markup delimiting the query parts inside the regular Java code. As we argue however, apart from perhaps some aesthetic concerns, this should not undermine the productivity and usability of the language. The prototype is a usable solution, made public available as an open source software [12]. There are many directions of the further research being currently explored. Other optimization techniques based on query transformations (e.g. dealing with extracting so-called weakly dependent subqueries [15]) have been designed and await their implementation in the prototype. Another topic refers to the concrete syntax - in terms of its improvement and possibly a smoother integration with Java on the one hand, and the extendibility mechanism for incorporating application specic functionality on the other. Apart from the persistent objects, integration with other data sources is under design. This includes dealing with the challenges of the XML data model and applying the SBQL language approach to it in the Java context. Finally, to provide the language with more abstraction, especially in the applications dealing with data integration, the work
of providing the language with updateable virtual views [16] mechanism has been initiated. References
1. M.Fowler: Domain-Specic Languages. Addison-Wesley 2010 2. MSDN Library. Transact-SQL Reference (Database Engine). SQL Server 2008 R2. Microsoft 2010. http://msdn.microsoft.com/en-us/library/bb510741.aspx 3. Oracle. Oracle Database SQL Language Reference 11g Release 1 (11.1). B28286-06. August 2010. 4. World Wide Web Consortium. XML Path Language (XPath) 2.0. W3C Recommendation 14 December 2010. http://www.w3.org/TR/xpath20/ 5. World Wide Web Consortium. XQuery 1.0: An XML Query Language. W3C Recommendation 14 December 2010. http://www.w3.org/TR/xquery/ 6. Objectivity: Objectivity for Java Programmer's Guide, Release 7.0. Objectivity, Inc. 2001. 7. R.Cattell, D.Barry: The Object Data Standard: ODMG 3.0. Morgan Kaufmann 2000. 8. LINQ (Language-Integrated Query) website. Microsoft, 2011 http://msdn.microsoft.com/en-us/library/bb397926.aspx 9. DB4objects website. Versant, 2011 http://www.db4o.com/ 10. Stack-Based Architecture (SBA) and Stack-Based Query Language (SBQL) website, Polish-Japanese Institute of Information Technology, 2011, http://www.sbql.pl 11. W.R.Cook, C.Rosenberger: Native Queries for Persistent Objects, A Design White Paper. Dr. Dobb's Journal (DDJ), February 2006. http://drdobbs.com/database/184406432 12. SBQL4J Stack-Based Query Language for Java website http://code.google.com/p/sbql4j/ 13. E.Wcisªo, P.Habela, K.Subieta: Stack Based Query Language for Java - A. Abd c Springer Manaf et al. (Eds.): ICIEIS 2011, Part I, CCIS 251, pp. 589-603, 2011. 2011 14. PolePosition - the open source database benchmark http://www.polepos.org 15. M.Bleja, T.M.Kowalski, K.Subieta: Optimization of Object-Oriented Queries through Rewriting Compound Weakly Dependent Subqueries. Database and Expert Systems Applications, 21st International Conference, DEXA 2010, Bilbao, Spain, August 30 - September 3, 2010, Proceedings, Part I. Lecture Notes in Computer Science 6261 Springer 2010, ISBN 978-3-642-15363-1, pp.323-330 16. R.Adamus, K.Kaczmarski, K.Stencel, K.Subieta: SBQL Object Views - Unlimited Mapping and Updatability. Proceedings of the First International Conference on Object Databases, ICOODB 2008, Berlin 13-14 March 2008, ISBN 078-7399-412-9, pp.119-140