Generic Query Toolkit: A Query Interface Generator ...

6 downloads 61478 Views 2MB Size Report
Jan 28, 2004 - Input Parameters: operid - user login ID; optype - type of operation; sucflag - success flag; note - detailed message ... A typical multi-tiered application usually consists of three layers: the ..... Borland Developer Network.
Generic Query Toolkit: A Query Interface Generator Integrating Data Mining Lichun Zhu, C.I. Ezeife, R.D. Kent [email protected], [email protected], [email protected] School of Computer Science, University of Windsor Windsor, Ontario, Canada N9B 3P4

Abstract Construction of flexible query interface constitutes a very important part in the design of information systems. The major goal is that new queries can easily be built by either the developers or the end-users of information systems. Some information systems would provide a list of predefined queries and future additional queries would need to be reconstructed from scratch. Thus, the low degree of reusability of query modules is a limitation of the database query report systems that these information systems are based on. This paper presents Generic Query Toolkit, a software package that automates the query interface generation process. It consists of a parser and an interpreter for a newly defined Generic Query Script Language, a background query processing unit, a presentation layer service provider and the presentation layer component. Data mining querying feature has been integrated into this query language. Future work will integrate more data mining querying and other advanced features.

Keywords Business Intelligence, Query Automation, Data Mining 1. Introduction The original idea for developing the Generic Query Toolkit (GQT) came from the projects of building data mart and report systems for business clients. In these projects, users’ requirements (business logic) are constantly changing. There is need to build fast prototypes to speed up the communication cycle between the developer and end users. To meet these requirements, we developed a software solution to automate the query interface generation process, which makes the prototyping process more efficient. In this solution, we defined a SQL-like query language (called Generic Query Language or GQL). A language parser parses the GQL script and generates inner object structures to represent the query interface, such as criteria input fields, display attributes and so on. These inner object structures are serialized into XML schema and stored in the database. The query toolkit generates the query interface based on this schema, and then binds the end users’ input to generate sequences of target GQL statements. These statements are then processed by an interpreter to generate final query results. At last, a set of presentation tools will render the result to the end user in an interactive way. The original version of GQL script only supports SQL statements. Recently we have added control flow statements, variable declaration statements and others to make it a functional script language. We also added a set of language features to make it support XML based dataset manipulation and data mining functionalities. The ultimate goal of the GQT system is to provide solutions in Business Intelligence area. Business Intelligence(BI) [1] software typically includes data warehousing, data mining, analyzing, reporting and planning capabilities. BI helps decision makers utilize their data and make decisions more efficiently and accurately. To construct a BI solution, various technologies can be integrated, such as Online Analytical Processing (OLAP), Data warehousing, Data mining, Decision Support Systems (DSS) and scheduling/workflowcontrol. Data mining is a very important part of BI solution because it turns data into knowledge by extracting useful patterns or rules from vast amounts of data, using various algorithms. The GQT system being proposed in this paper, provides solutions for quickly building queries for end-users. It also integrates data mining support at its language level. Compared with other commercial BI solutions such as BusinessObjects [9], Cognos [10], our method is fairly light-weighted and can be widely integrated with software projects of various scales.

In fact, the original version of this toolkit has already been integrated with information systems ranging either from personal small desktop Management Information Systems to commercial distributed large data marketing / data warehouse systems, or from fat client applications to web-based applications. The rest of the paper is organized as follows: Section 2 summarizes relevant related work and paper contributions. Section 3 presents the design of the GQT and its script query language, GQL. Section 4 discusses the integration of data mining querying and algorithms into this toolkit. Section 5 presents the testing environment. Section 6 discusses future work and section 7 presents conclusions.

2. Related Work There are many commercial BI solutions available now, such as BusinessObjects, Cognos, Oracle Business Intelligence Suite [11]. However, most current commercial BI tools have the following weak points: 1. They are highly complicated systems, requiring sharp learning curve. 2. These software tools are expensive choices for small projects, due to high cost of the software and expenses on the customization and training process. The open source resources related to our work are: • The Pentaho Business Intelligence Project [2]: This system provides a complete open source BI solution that integrates other open source components within a process-centric, solution-oriented framework. • Mondrian OLAP server [15]: an open source OLAP server written in Java. It serves as a component of the Pentaho project, and supports the Multi-Dimensional Expressions (MDX) [14] query language to perform OLAP query. • JPivot project [16]: JPivot is a JSP custom tag library that renders an OLAP table and lets users perform typical OLAP navigations like slice and dice, drill down and roll up. It uses Mondrian as its OLAP Server. It also supports XML for Analysis (XMLA) standard [7], which is an open industry-standard web service interface designed specifically for OLAP and data mining functions. • Weka Data Mining project [3]: WEKA is a collection of machine learning algorithms for data mining tasks. It provides user interface that can be applied directly to a dataset for data analysis. It also provides a java library that can be called from our own Java code. Section 4 provides more discussion of WEKA. These open source projects provide insights that may be useful in the development of the GQT. Another related work is Michalis’s QURSED system [4], which uses XML based query set specification and XHTML to generate Web-based query forms and reports (QFRs) for semistructured XML data. The difference is QURSED uses a compiler to convert XML schema into dynamic server pages, and generates queries in XQuery syntax, while GQT uses interpretive approach to render the dynamic user interface and maintain target query language GQL as a script consisting of SQL and new GQL statements.

2.1. Contributions This paper proposes GQT a query automation tool and its script language GQL. The GQT generates query forms and report pages and focuses on automating the whole query process from user input, query submission to result presentation. Three contributions of GQT system that distinguish it from existing work are: 1. Loose coupling of logical and visual part: By using GQL script language to define the query process, users can focus on the logic part and need not be concerned about how the query form is generated and how the result is displayed. The coupling between the logical and the visual part is loose, simple, and easy to build. 2. Powerful functionalities provided by the script: The GQL script language provides easy-to-understand statements for building user query logic. It also provides powerful interfaces for user to manipulate the dataset and invoke other services like data mining and OLAP. Currently, the features provided by GQL are still in the process of expansion. 3. Background query model: GQT parses the user defined query process which is in GQL syntax and transforms it into XML schema which contains information about the query form, result form and user input. The XML schema is nested in queued tasks that are persistent during the whole life-cycle of a submitted query and can be shared by different query processing stages. This model allows submitted queries to run in the background, and different parts of the system to collaborate with each other using the GQL script and the generated XML schema. Thus, end-users can do other tasks while waiting for the results of their GQT queries. This feature is especially useful for a query that will span a long period.

2

3. The Design of Generic Query Toolkit The GQL script language syntax is presented in Section 3.1 while GQT architecture and components are discussed in Section 3.2.

3.1. GQL Language Specification By defining a script language, users can customize their query process using this script. The GQL script language being proposed in this paper for integrating front-end data retrieval services like OLAP and data mining with back-end database systems can specify two broad tasks of: 1. User interface definition: Users can describe the data presentation patterns as Field Attribute and define the query criteria as Condition Attribute in this script. 2. Process specification: Users can control business workflow and invoke various services (e.g., data mining services like Classification, Association Rule using WEKA or other mining algorithms and SQL statements) using this script. The syntax to define data presentation pattern or display attribute list (Field Attribute) is a collection of semi-colon delimited fields enclosed in curly brackets “{}” as : Field Attribute ::= {Field Name; Field Description; Field Type; Display Attribute [; [Aggregate Attribute]; [Key Attribute ] ] } Field Name is unique name of column to be displayed. Field Description is the display label of the column. Field Type is the SQL datatype of the column. Display Attribute specifies the default display attribute of the column, which can be SHOW/HIDE. Aggregate Attribute can be used to define the aggregation method at client side, which can be any SQL aggregation function like SUM, AVG, MAX, MIN; Key Attribute is used to specify whether the column is regarded as a key attribute or group attribute, which can be KEY/GROUP. This is useful for OLAP analysis. If a column is marked as GROUP column, we should specify #sequence number in group/order clause for the corresponding column and user will be able to select/unselect group columns from the generated query form to decide whether this column will be included as a dimension in the final query result. For example, we can use “{catelog; Category; STRING; SHOW; ; GROUP}” to specify the display attributes for column catelog. The display label is Category. The data type is String. The column will be displayed by default and it will be treated as a selectable dimension column in a multiple dimension query. To define the query criteria (Condition Attribute), we use a collection of semi-colon delimited fields enclosed in anguler brackets “” as: Condition Attribute ::= Condition Expression is unique name for query criteria. Condition Description specifies the display label of the criteria. We can specify “#sequence number” in this field to tell the parser that the current criteria shares the same input value with the query criteria as the sequence number. Condition Type is the SQL datatype of the column. Value Domain is specified as comma-delimited “value|description” pairs or “#select value, description from tablename where-clause” to generate a pick list for input field. Required Attribute specifies whether the input attribute should be REQUIRED, FIXED or VALUEONLY. REQUIRED 3

means value must be fed in the field before submitting the query. FIXED should be combined with default attribute to make read only input field that could only take the default value. VALUEONLY serves the same as REQUIRED, the difference is the parser will only use value itself to replace the macro rather than use the “attrname operator value” expression to replace the macro. Default Attribute can be used to specify the default value for the input field. It can be a string, a number or a “#single value select-statement”. Hint can be used to specify a help message that will be displayed when mouse is moved over the input field. For example, we can use “” to specify the query condition attributes for condition field id. The display label is Item. The data type is integer. The selectable value domain can be retrieved by sending SQL statement “select id,name from t item where id between 500 and 999 order by id” to the database server. The query process written in GQL script contains a collection of semi-colon delimited GQL statements. Currently, the language consists of (a) eleven types of statements (SQL statement, Declare, Assignment, If-elif-else, While, Exit, Break, Continue, Call, Display and Mine statement), (b) two built-in functions (today(), substring()) and (c) one built-in object (DataSet). All the statements are case insensitive. (a). The Eleven GQL Statements 1. SQL statement: Ordinary SQL statements for specific DBMS. We can place Field Attribute or Condition Attribute macros in the select-list or where/having clause. 2. DECLARE statement: We can define variables using the declare statement like: $declare varname1 Integer, varname2 Boolean . . . ; The supported data types include Integer, Numeric, Boolean, Date, Datetime, String and DataSet, corresponding to Java class Integer, Double, Boolean, Date, String and a self-defined class to communicate with database server and manipulate the data result in XML format. e.g., $declare counter Integer; 3. Assignment statement: We can assign value to a variable using $set statement $set varname = expression; The expression can be any PASCAL style arithmetic, relational or logical expressions and function calls, e.g., $set counter = counter + 1; 4. IF - ELIF - ELSE statement: The flow control feature can be used as: $if (expression) $begin Statements; $end $elif (expression) $begin Statements; $end $else $begin Statements; $end; The elif and else sub clause are optional. Currently all the statements or single statement should be enclosed within the $begin and $end block. e.g., $if (a >= 0) $begin select * from t dace where income >= 0 into temp t1; $end $else $begin select * from t dace where outcome < 0 into temp t1; $end; 5. WHILE statement: We can also define the iteration using $while statement $while (expression) 4

$begin Statements; $end; All the statements or single statement should also be enclosed within the $begin and $end block. e.g., $while (counter < 10) $begin update t item set balance=0.0 where id = :counter; $set counter = counter + 1; $end; 6. EXIT statement: We can use $exit statement to quit the execution of the statements. 7. BREAK statement: We can use $break statement to break a while-loop. 8. CONTINUE statement: We can use $continue statement to skip the rest of the statements in a while-loop and jump to the condition judgment step directly. 9. CALL statement: We can use $call statement to invoke a built-in function or method of a built-in object when we do not need the returned object. $callf unction/method name([parameters]); e.g., $call ds.union(ds1); – merge dataset ds1 into dataset ds. 10. Display statement: $display datasetvar using Field Attribute, Field Attribute, Field Attribute,. . . ; Set Datasetvar as primary dataset, dump it into file system using specified fields for display. 11. Mine statement: $mine datesetvar classifier using attr1,attr2,attr3,. . . class class attr model ‘model file name’; This uses WEKA pre-generated classifier model to classify data in datasetvar. Attributes specified as attr1, attr2, ... are predictive attributes. The class attr is the attribute that is used to classify the dataset tuple. The model stored in the model file name contains algorithm information and its tuned parameters. Section 4.2 has examples of using Mine statement and Display statements. (b). GQL built-in functions: Currently, GQL supports today() function to get current system date and substring() function to retrieve substring from a String variable. 1. Date Today() – gets system date. 2(i). String .SubString(Integer start pos) – retrieves substring from the start position of current variable string. 2(ii). String .SubString(Integer start pos, Integer len) – retrieves substring from the start position of current variable string, and stops at length of len. e.g., $set str2 = str1.substring(1,12); (c). GQL built-in DataSet object : The DataSet object provides five methods (ExecQuery, ReadTable, ReadFile, DumpFile and Union): 1. DataSet.ExecQuery(String sql statement) This is a static function to send a SQL statement to the database, retrieve query result and convert it into XML document. It returns a DataSet object. e.g., $set ds = Dataset.ExecQuery(‘select * from t1’); 2. DataSet.ReadTable(String db table name) This is equivalent to ExecQuery function with “select * from ” statement as parameter. It retrieves query result and converts it into XML document. It returns a DataSet object, e.g., $set ds = Dataset.readtable(‘t1’); 3. DataSet.ReadFile(String filename) We can use this static method to load a Delphi ClientDataSet [20] compatible XML document from the cache directory (the location of the cache directory is configurable in the system property file. For how the query result is generated, see section 3.2.3). The return object is a DataSet object. e.g. $set ds = DataSet.readFile(‘t1 img’); – loads cache image having ‘t1 img’ 5

postfix into memory. The cache file can either have .xml or .xml.z extension. The naming pattern for a cache file contains four parts: “g[query id] [submit datetime in yyyyMMddHHmmss format] [operator ID] [optional postfix name string]”. e.g. g5 060607194722 jack t1 img.xml 4. DataSet.DumpFile(Integer order, Boolean compress) We can use this method to dump the memory image of the current DataSet object into cache directory as Delphi ClientDataSet compatible XML document. By default, the order is zero, and compress is true. This means the data will be saved as the primary data package in compressed XML format. A GQL query can return multiple results, each result data package should be assigned a unique number to tell its order before being dumped into the cache directory. e.g. $call ds.DumpFile(0,true); 5. DataSet.Union(DataSet ds) Merge the data in ds into the current dataset. For backward compatibility, and because some database servers already support the declaration and flow control statements, such as T-SQL for Sybase, all the non-SQL statements are marked with a prefix “$”. This way, all the statements without “$” prefix will be sent to database server directly and all the statements having “$” prefix will be executed by the interpreter itself. The following sample script displays the tuples in table t dace: select {id;Item;INTEGER;SHOW;;GROUP}, {mark;Type;STRING;SHOW;;GROUP}, {catelog;Category;STRING;SHOW;;GROUP}, {cdate;Date;DATE;SHOW;;GROUP}, {sum(income) incom;Credit;MONEY;SHOW;SUM}, {sum(outcome) outcom;Debit;MONEY;SHOW;SUM}, {sum((income-outcome)) net;Net;MONEY;SHOW;SUM} from t dace where and and and and and and group by #1, #2, #3, #4 order by #1, #2, #3, #4; The references defined in “group by” clause correspond to the columns with “GROUP” attributes. The user can decide whether these group columns will be included in the final data result. References also can be defined in the value domain part of the conditions. For example, we can use “#select id, name from t item where id between 500 and 999 order by id” to generate a dropdown list from the specified SQL statement. Figure 1 shows the generated user interface. After inputting the query criteria and submitting the query, the parser generates the following target statement. select mark, catelog, sum(income) incom, sum(outcome) outcom, sum((income-outcome)) net from t dace where id between 501 and 512 and mark = ’P’ and cdate >= ’01-01-2006’ group by mark, catelog order by mark, catalog Please note that those input criteria whose values are empty will be removed from the where clause of the final GQL statement.

6

Figure 1. Generated user Interface

3.2. Architecture of the GQL Toolkit The java-based architecture of this toolkit is shown in Figure 2. The major five components of this toolkit are Metadata repository, GQL Parser, GQL Daemon, GQL Server and GQL Viewer. 3.2.1 Metadata repository All the pre-defined query plans, which include plan name, description, GQL script, access right information etc, and user submitted query tasks are stored in the metadata repository. It also contains system information like operators, data dictionary and system audit. We use Hibernate [12], a Java toolkit for mapping Entity Relation Model to Object Model, to map these tables into java classes. In this way, the manipulation on the database records is converted into manipulation on the generated mapping objects. 3.2.2 GQL Parser The GQL Parser is the core component of the whole system. Its function is to parse the GQL script, look up display fields and conditional fields, get their attributes then generate internal object structures and syntax tree that will be used by GQL Daemon and data presentation module. It is developed using java based lexical analyzer generator Jflex and java based LALR parser generator Java Cup [13]. The GQL Parser acts as a two-phase parser engine. The first-phase parse happens at the time that user generates a new query task. It parses the GQL script to locate the Field Attributes and Condition Attributes and provide information for building the user interface, then uses the Macro replacement mechanism to replace the Field Attributes and Condition Attributes with the real expressions, removes redundant query conditions and generates a collection of target GQL statements. The second-phase parse happens at the time that GQL Daemon executes the submitted task in the background. It interprets the target GQL statements in sequential order, submits SQL statements to datatbase server and manipulates the data based on user defined logic. 3.2.3 GQL Daemon This module acts as the background query processing unit. It awakes every few seconds to browse the task queue, looking for tasks in waiting status. When a waiting task is detected, the daemon program creates a thread to execute the task. The algorithm for the main process is: Procedure callback main() Begin 1. Load parameters; 2. While Exists(thread(s) in the CriticalSection) Do wait; 3. Enter CriticalSection;

7

Client App Hibernate O-R Mapping 1. get script, 2. put into queue 3. get task info

Web Browser HTTP

SOAP/WSDL GQL App At Server Side Application Server - Tomcat

GQLViewer (JSP, Servlet,Struts… TODO: Report tools, Jpivot OLAP, Data Mining)

GQL Server (Web Services Based on Axis) Metadata Repository p_query, p_queryq ...

GQL Parser

GQL Daemon

Hibernate O-R Mapping 1.get waiting tasks 2.set task status

Read cached query results for display

Write query result into the cache directory

File System Cache Directory (in compressed XML format)

Figure 2. GQT Architecture 4. If (At House-Cleaning Time) Then 4.1. Call house-cleaning procedure to clear expired tasks; 5. For each item in task queue with status as “Waiting” Begin 5.1. If num of threads