Enhancing the Query by Object Approach using Schema Summarization Techniques
Thesis submitted in partial fulfillment of the requirements for the degree of
MS by Research in Computer Science Engineering
by Ammar Yasir 200702005
[email protected]
Center of Data Engineering International Institute of Information Technology Hyderabad - 500 032, INDIA July 2015
c Ammar Yasir, 2015 Copyright All Rights Reserved
International Institute of Information Technology Hyderabad, India
CERTIFICATE
It is certified that the work contained in this thesis, titled “Enhancing the Query-By-Object approach using Schema Summarization techniques” by Ammar Yasir, has been carried out under my supervision and is not submitted elsewhere for a degree.
Date
Adviser: Prof. P. Krishna Reddy
Dedicated to my parents Mrs.Shahar Bano, Mr. Ziaul Hasan and sister Sara Hasan for their everlasting love and support.
Acknowledgments
This dissertation would not have been written without the constant support and encouragement of many people. Firstly, I would like to express my deepest gratitude to professor P. Krishna Reddy, for his expert guidance. He has supported me throughout my thesis with invaluable discussions and feedback. He also encouraged me to take up challenging problems and gave me freedom to explore my ideas. I would also like to thank my colleagues in IT for Agriculture Lab and Center for Data Engineering, especially M Kumara Swamy and R Uday Kiran sir for their critical comments and constructive suggestions. I would also like to thank my labmates Gowtham Srinivas, Somya, Satheesh for their fruitful discussions. I am grateful to all my friends for providing constant support and motivation. Ashray, Abhinav, Rohit Nigam, Rohit Gautam, Romit, Shubhangi, Sankalp, Ankur Goel, Vinay, Shrikant, Rakshit, Siddharth and Ankit made my stay in IIIT as one of the best experiences of my life. Lastly, I am forever indebted to my mother Mrs. Shahar Bano and my father Mr. Ziaul Hasan for their patience, understanding and encouragement.
v
Abstract
Modern day organizations use databases to manage information for their business operations. Since the introduction of DBMSs in the mid-1960s, database technology has made significant advances in terms of functionality and performance. As a result, modern day database systems can process a large number of complex queries on any database. An important area of database research focuses on improving the usability of databases. Research efforts are ongoing to develop efficient user interfaces to access information from databases, focusing not only on the design of user-interfaces but more importantly, improving the process of user interaction and the underlying architecture. Information Requirement Elicitation (IRE) was proposed in the literature, which recommends a framework for developing interactive interfaces, allowing users to access database systems without having prior knowledge of a query language. An approach called ‘Query-by-Object’ (QBO) has been proposed in the literature for IRE by exploiting simple calculator like operations. In QBO, the database is represented with the help of objects and operators are provided to relate information between objects. However, the QBO approach was proposed by assuming that the underlying database is simple and contains a few tables, each of small size. Large databases have complex database schemas. Given a large number of tables in a schema, the number of objects is also large. Locating information of interest and how it is related to other objects becomes a challenging task for the user. Also, the number of possible operations between objects increase significantly. In this thesis, we investigate opportunities for a better organization of options available to the user for interacting with the database without making any changes to the organization of data at the physical layer. First, we try to determine entities in the schema that collectively represent a conceptual unit or topic in the database. Similarly, we try to organize instances of an object by organizing them into a hierarchy based on attribute values. The organization of objects into topics allows the user to relate information at a higher level of abstraction and leverages the number of operational pairs that needed to be defined in QBO. We also evaluate the research decisions through system analysis and usability studies, which were conducted with the help of a fully functional prototype developed for a real, complex database. An important process in the proposed approach is discovering topical structures in the database schema. The problem has gained attention recently in the database community as the problem of Schema Summarization. Schema summarization for a relational database schema is a challenge that involves identifying semantically correlated elements in a database schema. Research efforts are being made to propose schema summarization approaches by exploiting database schema and data stored in vi
the database. Existing efforts for schema summarization are data oriented. In scenarios where datavii is inconsistent or insufficient, existing approaches suffer. In this thesis, we explore the database documentation as an information source. We aim to utilize the schema and database documentation to provide an efficient schema summary. We propose a notion of table similarity by exploiting the referential relationship between tables and the similarity of passages describing the corresponding tables in the database documentation. Using the notion of table similarity, we propose a clustering based approach for schema summary generation. Experimental results on a benchmark database show the proposed approach, although independent of data stored in the database, is as efficient as the data-oriented approaches.
Contents
Chapter
Page
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Overview of Existing Efforts for Access Methods in Database Systems . . . . . . . 1.2 Overview of Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Overview of Proposed Approach for Enhanced Query-by-Object Approach 1.2.1.1 Overview of Query-by-Object approach . . . . . . . . . . . . . 1.2.1.2 Issues with Query-by-Object approach . . . . . . . . . . . . . . 1.2.1.3 Proposed Enhanced Query-by-Object Approach . . . . . . . . . 1.2.2 Overview of Proposed Schema Summarization Approach . . . . . . . . . . 1.2.2.1 Overview of Schema Summarization . . . . . . . . . . . . . . . 1.2.2.2 Proposed Approach for Schema Summarization . . . . . . . . . 1.3 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . . .
2 Related Work . . . . . . . . . . . . . . . . 2.1 Innovative Query Interfaces . . . . . . . 2.1.1 Visual Interfaces . . . . . . . . 2.1.2 Text Interfaces . . . . . . . . . 2.1.3 Other Works . . . . . . . . . . 2.2 Schema Summarization . . . . . . . . . 2.2.1 Schema Matching . . . . . . . . 2.2.2 Mining Database Structures . . 2.2.3 Topical Structures in Databases 2.2.4 Schema Summarization . . . . 2.3 Discussion . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . 8 . 9 . 9 . 11 . 12 . 13 . 13 . 13 . 13 . 14 . 14
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
1 2 3 3 3 4 4 5 5 6 7 7
3 Enhanced Query-by-Object Approach for Information Requirement Elicitation in Large Databases 16 3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.1 Query-by-Object Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.3 Discovering Topical Structures in Databases . . . . . . . . . . . . . . . . . . . 19 3.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.1 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.2 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.2.1 Organization into topics: . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.2.2 Facilitating Instance Selection: . . . . . . . . . . . . . . . . . . . . 23 viii
CONTENTS
3.3
3.4
3.5
3.2.2.3 Defining Operations: . . . . . . . . . . . . . . . . . . . . . 3.2.3 QBT protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3.1 QBT Developer Protocol . . . . . . . . . . . . . . . . . . . 3.2.3.2 QBT User protocol . . . . . . . . . . . . . . . . . . . . . . System Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 CONFIG-DB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Usability Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3.1 Experiment 1, Task Analysis: . . . . . . . . . . . . . . . . . 3.4.3.2 Experiment 2, Use Survey: . . . . . . . . . . . . . . . . . . 3.4.3.3 Limitations and possible improvements for the usability study Summary of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
ix 23 25 25 26 26 27 28 31 31 31 31 32 34 35 36
4 Exploiting Schema and Documentation for Summarizing Relational Databases . . . . . . . . 37 4.1 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.2 Schema based Table Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.3 Documentation based Table Similarity . . . . . . . . . . . . . . . . . . . . . . 40 4.1.3.1 Finding Relevant Text from the Documentation: . . . . . . . . . . . 40 4.1.3.2 Similarity of passages: . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.4 Table Similarity Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.5 Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1.5.1 Influential tables and Cluster Centers . . . . . . . . . . . . . . . . . 44 4.1.5.2 Clustering Objective Function . . . . . . . . . . . . . . . . . . . . . 44 4.1.5.3 Clustering Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.3 Effect of window function (f ) on combined table similarity and clustering . . . 46 4.2.4 Effect of document similarity measure (S) on similarity metric and clustering accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.5 Effect of contribution factor (α) on table similarity and clustering . . . . . . . 47 4.2.6 Comparison of Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . 48 4.3 Summary of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5 Conclusion and Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
List of Figures
Figure
Page
1.1
The TPCE schema without table categories . . . . . . . . . . . . . . . . . . . . . . .
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12
QBO user protocol . . . . . . . . . . . . . . . . . . . . . . . . . The iDisc approach . . . . . . . . . . . . . . . . . . . . . . . . . Topical Structure for QBT . . . . . . . . . . . . . . . . . . . . . QBT user protocol . . . . . . . . . . . . . . . . . . . . . . . . . System Prototype Architecture . . . . . . . . . . . . . . . . . . . CONFIG-DB . . . . . . . . . . . . . . . . . . . . . . . . . . . . Traditional calculator versus System Prototype UI . . . . . . . . . System Prototype UI . . . . . . . . . . . . . . . . . . . . . . . . Treemap representation of user’s selection(object and granularity) QBO Approach Prototype . . . . . . . . . . . . . . . . . . . . . . QBT Approach Prototype(with Topic modeling and binning) . . . Average ratings for questions from questionnaire . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
17 19 24 26 27 28 29 30 30 32 32 34
4.1 4.2 4.3 4.4 4.5 4.6
TPCE Schema . . . . . . . . . . . . . . . . . . . . . . . . . . accsim and accclust values on varying window function, f . . accsim and accclust values for document similarity functions S Accuracy of similarity metric on varying values of α . . . . . Accuracy of clustering on varying values of α . . . . . . . . . Clustering accuracy for different clustering algorithms . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
38 46 46 47 47 48
x
. . . . . .
. . . . . .
5
List of Tables
Table
Page
3.1 3.2 3.3 3.4 3.5 3.6 3.7
Operator Matrix for Example 1 . . . . . . . . . . QBO Developer and User Protocols . . . . . . . Within-Topic Matrix 1(WT-I) . . . . . . . . . . . Between-Topic Matrix(BT) . . . . . . . . . . . . Time taken and number of attempts for each task Query building time results for QBT . . . . . . . Query building time results for QBO . . . . . . .
. . . . . . .
18 18 24 24 32 33 33
4.1
Referential Similarity between tables security, daily market and watch item . . . . . .
40
xi
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
Chapter 1
Introduction
A database is a well-organized collection of related data. For example, an address book which stores name, phone numbers and addresses of people you know, represents a database. A database management system (DBMS) is a collection of programs that enables users to create and maintain a database. The DBMS is a general purpose software system that facilitates the process of defining, constructing, manipulating, and sharing databases among various users and applications. Since their introduction in the mid-1960s, DBMSs have enjoyed enormous success. An important feature of a DBMS is that it offers data independence. Application programs utilizing the database are insulated from the changes in the way data is structured and stored. A DBMS provides a suite of sophisticated techniques to store and retrieve data efficiently. It also has a potential for enforcing standards among database users in a large organization, for example, name and formats of data elements, terminology, and display formats. DBMS also ensures the security of the database by enforcing access controls for users and also ensures durability, the recovery of the database in face of failures, errors of many kinds or intentional misuse. Overall, the prime selling feature of the database approach has been the reduced application development time. A DBMS provides support for important functions that are common to all applications accessing data in the DBMS, making application development less time consuming. With the rapid increase of published information and the abundance of data, users require sophisticated tools to simplify the task of managing data and extracting useful information in a timely fashion. To deliver such sophisticated systems, database technology has made great strides in the area of data storage, transaction management, concurrency control and query interfaces. As a result, modern day DBMSs can efficiently process a large number of complex queries on any database. Although advances in database technology have concentrated heavily on functionality and performance, ‘usability’ of databases leaves a lot to be desired. The important aspect while discussing the usability of a database is not just the design of the user interface, but also more importantly the process of interaction and the underlying architecture. 1
In this thesis, we have made two contributions, first, we propose enhancements for the Query-byObject approach by using schema summarization techniques. Also, we propose an efficient approach for generating schema summary by utilizing the schema structure and database documentation. In the remaining part of this chapter, we will first overview existing efforts for providing access to data in database systems and issues involved in providing efficient data access. Then we give an overview of our proposed approach in the thesis. Further, we discuss the issue of schema summarization, review the existing approaches for schema summarization and give an overview of proposed approach for schema summarization. Finally, we mention the major contributions made in the thesis and organization of the thesis.
1.1 Overview of Existing Efforts for Access Methods in Database Systems In this section we discuss some of the common approaches for providing access to data in a database system: • Database Query Interfaces : Structured query models like SQL or XQuery are powerful means of interacting with the database. SQL is a textual language with a simple English-like syntax, and is widely implemented in most commercial database systems. Alternatively, users can use visual query systems (VQSs) [1]. VQSs are query systems for databases that use visual representations to locate information of interest and express related requests. VQSs can be seen as an evolution of query languages and were aimed to improve the effectiveness of the human-computer interaction. Query-by-Example [2], for example, allowed users to query a database by creating example tables in the query interface and has influenced many commercial products like Microsoft Access. Formbased interfaces are widely regarded as the most user-friendly querying method. A form is a named collection of objects having the same structure. The structured representation of a query form is an abstraction of conventional paper forms, therefore users felt at ease with the system. The system presented in [3, 4] provided visual tools for users to frame queries using forms. • Keyword Search : Searches are a specialized class of queries [5]. A search consists of keywords representing the user’s information requirements, and the underlying data is usually a collection of unstructured documents. A search engine retrieves the documents relevant to the query and ranks the retrieved documents. The keyword search query mechanism allows users to freely express their query requirements and coupled with instantaneous response time, makes it easier to refine queries. Although a mainstay of Information Retrieval (IR) systems, the keyword-search approach has been extended to databases domain as well [6]. Systems such as BANKS [7] and DBXplorer [8] provide an IR-style keyword-base search engine over relational data. 2
• Information Requirement Elicitation : In m-commerce environment, the ‘Information Requirement Elicitation’ (IRE) approach and its conceptual design was proposed by Sun [9]. IRE describes an interactive communication in which information systems help users to specify their requirements with adaptive choice prompts. Users initiate IRE sessions by expressing their needs. In an IRE enabled system, there is an IRE component, which gets triggered upon receiving user’s request. The IRE component checks whether the information requirement is specific enough. If not, the component generates choice prompts for the missing elements by utilizing user inputs, user context, and user preference. The loop continues until the required request information can be provided to the user. A prototype of IRE in an imagined m-commerce scenario is demonstrated in [10].
1.2 Overview of Proposed Approach Structured query models like SQL/XQuery are very efficient in the context of expressing a query. However, these models require a user to specify a query using a fixed syntax, have a prior knowledge of the database structure and model and express the query in terms of that particular structure. Novice users are not skilled at using SQL like query languages as they support a complex structure. While VQSs offer a friendlier approach, systems like QBE do not perform well with large schema. Secondly, a user needs to be aware of values of the database to fill the example tables. Another challenge for the user is grasping the join relationships between data entities to express complex queries. Similarly, formbased interfaces, although convenient for the users, pose a limitation on the number of queries that can be executed. Keyword-search approach is not entirely effective as users express queries with complex semantics and expect precise, complete results. The IRE approach proposed a grand framework whose potential has not been fulfilled yet. Based on the notion of IRE, Query-by-Object (QBO) approach was proposed for developing query interfaces. In this thesis, we propose enhancements to the existing QBO approach, to design user interfaces efficiently for large databases. Another important area in the context of database usability is to generate the summary for a complex database schema. As part of this thesis, we also propose techniques to generate efficient schema summary.
1.2.1 Overview of Proposed Approach for Enhanced Query-by-Object Approach In this section, we present a brief overview of the QBO approach and the challenges involved in developing user interfaces based on the QBO approach. Later we describe an overview of the proposed approach for an enhanced query-by-object approach. 1.2.1.1
Overview of Query-by-Object approach
IRE uses a series of steps to elicit information from users where each step adds to the information about the user’s intent. However, IRE does not permit users to utilize the results of intermediate queries 3
to progressively build complex queries. Based on the notion of IRE, the Query by Object (QBO) interface was proposed by Bhalla et al. in [11, 12] for m-Commerce environment. In this system, users communicate with a DBMS through a web interface. The user’s intent is captured via objects and path navigation through an option-based interface. In the end, a query is formulated and executed at the DBMS server by converting it into its SQL equivalent. Initially, the user is presented with an object menu. Users perform navigation operations and select one or two objects at desired level of granularity. Unlike IRE, the QBO approach supports closure property, which facilitates that each step executes and uses its result in the next step. It allows users to gather and combine query results. It allows users to search for information in a logical way, whereby intermediate results are refined or combined to get the intended result. Query-By-Object approach has been used to develop user interfaces for mobile devices [13], GIS systems [14] and e-learning systems [15]. An empirical study was conducted in [14] to evaluate user interaction through the QBO interface. The study showed that the QBO approach is easy, intuitive and simple to use for common users.
1.2.1.2
Issues with Query-by-Object approach
Designing user interfaces based on QBO approach to provide information access for a general purpose database is a challenging issue [16]. The QBO approach uses a database to store objects and operations, where each object corresponds to a relation in the schema. Developing user interfaces based on QBO approach becomes challenging when the complexity of the underlying database (schema and data) increases. Large number of tables in the schema makes it harder for the user to locate his/her information of interest and how it is related to other elements in the schema. Similarly, this issue is more important when the object instances are large in number. Hence, there is a need for better organization of options available to the user in the QBO interface. Also with the increase in the number of tables, the number of pairwise operations between tables increase significantly.
1.2.1.3
Proposed Enhanced Query-by-Object Approach
To address the issues of QBO, first, we exploit the notion of detecting topical structures in databases to represent the schema at a higher level of abstraction. Identifying topical structures allows tables which are semantically correlated to be grouped together, which provides a better organization for options presented to the users. We use an elaborated approach called iDisc [17], which utilizes the database schema structure and the data stored in the database to generate a clustering of schema entities, representing the topical structure of the database. We discuss the iDisc approach in detail in Chapter 3. Secondly, instead of defining operations between each pair of tables, we can define operations between topics and within topics which can reduce the number of pairs for which operators need to be defined. Similarly, to facilitate easier instance selection, we allow selection of instances based on attribute values of the table. Later, we organize instances of an attribute into bins, providing a two-level hierarchy for instance selection. The developer protocol is modified to include steps required to generate the abstract lev4
Figure 1.1 The TPCE schema without table categories els. Consequently, the user protocol is also modified for the proposed approach. We also discuss the engineering of a prototype based on the proposed approach.
1.2.2 Overview of Proposed Schema Summarization Approach In this section, we briefly describe the issue of schema summarization, the approaches proposed in the literature and the proposed approach for schema summarization.
1.2.2.1
Overview of Schema Summarization
Detecting topical structures of the database schema is an interesting challenge. In the literature, the term schema summarization has been used interchangeably with detection of topical structures. Modern enterprise databases consist of hundreds of interlinked tables. While users are accustomed to data being represented in two-dimensional tables, grasping joins between tables is a challenge for general users. For example, Figure 1.1 describes the schema diagram of the TPCE benchmark database. The TPCE database stimulates the working of an online brokerage firm. Although being morderate in terms of schema size, complex relationships in the schema makes it difficult for users to familiarize themselves 5
with the database schema. As complexity of database schemas increase, the amount of time spent on understanding the metadata and schema structure becomes significant. Database normalization is a process of analyzing the given relation schemas based on their functional dependencies and primary keys to achieve the desirable properties of (1) minimize redundancy and (2) minimize the insertion, deletion and deletion anomalies. Unsatisfactory relational schemas that do not meet the normal form tests are decomposed into smaller relation schemas the meet the tests and hence achieve the desirable properties. However, through the process of normalization what users perceive as a single independent unit of information is disintegrated into smaller relations. Coupled with odd naming conventions for tables, this makes it harder for a user to locate his information of interest easily. Schema summarization has been proposed to assist users in understanding a complex database schema easily. A schema summary represents a higher level of abstraction of the database schema. A user is initially presented with a few important concepts from the database. Subsequently, a user can zoom into sections of the schema in which he is interested. Generating a schema summary involves identifying semantically correlated elements in the schema. Existing approaches [18, 19, 20, 21] exploit the schema structure and data stored in the database to generate schema summary with a clustering based approach. In scenarios where the data stored in databases is insufficient, the existing approaches suffer. In this thesis, given a current snapshot of the database (schema), we investigate the database documentation as an additional source of information and propose an algorithm to generate summary by exploiting the database schema structure and documentation.
1.2.2.2
Proposed Approach for Schema Summarization
A foreign key relationship between two tables shows that there exists a semantic relationship between two tables. However, referential relationships alone do not provide good results [53]. Hence we attempt to supplement this referential similarity between tables with another notion of similarity, such that the tables belonging to one category attain higher intra-category similarity. The additional similarity criteria is based on finding similarity between the passage of text representing the table in the database documentation. The intuition behind this notion of similarity is that the tables belonging to the same category should share some common terms about the category in the documentation. We combine the referential similarity and document similarity by using a weighted function and obtain a table similarity metric over the relational database schema. After pairwise similarity between tables is identified, we use a Weighted K-Center clustering algorithm to partition tables into k clusters. Experiments conducted on a benchmark database show that the proposed approach is as effective as the existing data-oriented approaches. 6
1.3 Thesis Contribution • Proposed enhancements to the existing QBO approach by detecting topical structures in the databases. • Presented an advanced system to query relational databases, based on the enhanced QBO approach. • Explored the database documentation as a source of information for generating schema summary and proposed an algorithm to exploit the database schema and documentation to generate efficient schema summary.
1.4 Thesis Organization In the next chapter, we discuss the QBO approach and the issue of designing QBO interface for complex databases. We discuss proposed enhancements over the existing QBO approach and also discuss results from system and user level evaluation of the system. We also study an advanced system for querying relational databases, the tool used in usability evaluations of the approach proposed in Chapter 3. In Chapter 4, we present the problem of summarizing a relational database and propose an algorithm to generate schema summary by utilizing schema and documentation. We also discuss a thorough experimental evaluation of the proposed approach. Chapter 5 presents the summary of the work discussed in the thesis, conclusions, and the future work.
7
Chapter 2
Related Work
One of the earliest works in the field of database usability [22], focused on analyzing the expressive power of a declarative query language SEQUEL, in comparison to natural language. However, the importance of usability in database systems was first addressed in [23]. Since then, most of the research efforts in the context of database usability have been focused on developing innovative query interfaces. In [24], the author describes the initial enthusiasm and user-induced frustration in building of interactive information systems. In 2007, Jagadish et al. [25] provided a second wind to the research domain, discussing a set of five ‘pain points’ on why database are so difficult to use. The first pain point describes how complex schema structure makes it hard for the users to locate their information of interest and construct relevant queries. The authors propose that an abstraction of the presentation data model is needed to allow users to structure information in a natural way. As users have different views on the organization of data in a database, various personalized presentation models are developed for different class of users. However, when users are presented with multiple views, they do not understand the underlying difference between the views and tend to become confused and lose trust in the system. This issue is discussed as the second pain point in the context of database usability. The third pain point deals with the issue of users getting an unexpected result or being unable to query without getting any explanation from the database system. The fourth pain point describes that the existing query interfaces are not modeled as WYSIWYG (What-You-See-Is-What-You-Get), which is a desired quality in any user interface. The last pain point discusses that the creation of a database is a challenging task for novice users and is a reason why a lot of modern day information is not present in databases. The authors later introduced a presentation data model for direct data manipulation with a schema later approach. An important aspect of usability in databases is to provide information access with minimal efforts to database end-users. In the literature, various visual query systems and textual interfaces have been proposed to provide efficient data access. We review some of the prominent works in the field of query systems in Section 2.1. In the context of improving database usability, generating schema summary for complex database schemas has also received attention of late. We review some of the proposed techniques for generating efficient schema summary in Section 2.2. 8
2.1 Innovative Query Interfaces 2.1.1 Visual Interfaces Using visual representations for query specification is perhaps the most researched field in the context of database usability. Query-by-Example (QBE) [2] was one of the first graphical query languages with minimum syntax developed for database systems. QBE and its successor Office-by-Example [26] were both based on domain relational calculus [27]. In QBE, rather than specifying a query using a fixed syntax; the query is formulated by filling templates of the relations, displayed visually on the computer screen. The inputs to the template can be translated into an SQL equivalent and executed on the database. QBE does not require the knowledge of any syntactic constructs or the internal structure of the database to use, as users are presented only the equivalent table skeletons. QBE is relationally complete. With some additional commands, condition boxes and other constructs, users can express all queries that belong to the class of relational algebra. It has been an influence on developing visual querying facilities in products like Microsoft Access, IBM Visual XQuery Builder, Borland’s Paradox and open source tools like query builder for phpMyAdmin. Query by Templates (QBT) [28] was a generalization of QBE language for databases modeled with SGML. QBT incorporates the structure of the documents for composing powerful queries by displaying a template for a representative entry in the database. The template describes the type of data expected in the database. The user specifies examples of data in the template and system retrieves data matched by the user-specified template, similar to QBE. QBT allows various templates like flat templates, nested templates and structured templates, unlike QBE where the table skeleton is the only available template. Query-by-Diagram (QBD*) [29], is a visual query system that allows navigation based on abstractions of the E-R semantic model. QBD* allows users to extract information from the database without worrying about the logical model of the schema. The process of query formulation in QBD* is as follows: The query structure is based on the selection of a main concept, which is the first entity selected by the user. The user then performs navigation on the ER model to select paths starting from the main concept. The path represents a subquery that selects a subset of instances of the main concept. Setbased operations like union, intersection and difference are available to combine various subqueries. The main feature of QBD* is that it provides a graphical mechanism capable of expressing recursive queries (transitive closure). Query-by-Icon (QBI) [30] provides an icon-based visual query system capable of querying and exploring databases. QBI provides an interface with pure iconic specifications, without the use of any diagrams. A user perceives the underlying database as a set of classes, each having several properties called generalized attributes (GA). Generalized attributes encapsulate and hide from the user the details of specifying a query. To construct a query, users select compatible classes via their corresponding icons. When users select a class, its GAs are used to define the selection condition. Similarly, the user also selects GAs, which will be part of the output result. Query results are saved to be explored further 9
in the construction of complex queries. A comparison study of QBD* and QBI [1] suggested that expert users perform better using the QBD* system while QBI performed slightly better for non-expert users. VISIONARY [31] is a visual query language, based on diagrammatic paradigm like QBD*. In VISIONARY, a vision represents the external data model that uses a combination of icons and text to provide visual primitives of concepts and associations, which is represented by a name and multiplicity. Users formulate queries by choosing a primary concept, the selection predicates and the attributes to be retrieved in output. If the interpretation given to a query is not the one the user had in mind, the user can force a different interpretation by disabling some associations and enabling others. The internal data model is relational, using SQL as the query language. An intermediate graph-based model provides mapping between the visual and the relational models. Kaleidoquery [32] is a powerful visual query language for object databases, supporting the capabilities of the OQL language. Kaleidoquery uses a visual filter flow model, where filters are used to filter out information of interest for users. The class instances are considered as information flow and using constraints on the class attributes, information is filtered. The output of the query is then examined, or it flows into other queries to be further refined. Kaleidoquery separates the tasks of writing the query constraints and organizing the structure and ordering of the results, providing a more dynamic evolution of queries than OQL. Liu and Jagadish [33] designed a spreadsheet algebra for the relational database that continuously presents data to users in a WYSIWYG (What You See Is What You Get) manner. By dividing query specification into progressive refinement steps, users can extend intermediate results to construct complex queries. The data manipulation actions are reversible, and users can modify an operation specified earlier without redoing the later operations. Users can also specify at least all single-block SQL queries without being exposed to complex database concepts. Non-technical users benefit from the direct manipulation interface as it allows easier and more accurate specification of queries. VISQUE [34] describes a visual interaction language by exploiting End-User Development techniques, web-based user interface design and data models. VISQUE uses knowledge visualization techniques like a tree-based metaphor to represent a multidimensional database schema and also allows construction of complex SQL like using set-based, nested and aggregation queries. Due to the popularity of touch-based and motion tracking device, research efforts have been made to design user interfaces that allow gesture-based querying over relational databases [35]. The database query interface allows users to manipulate results directly by interacting with them in a sequence of gestures. Corresponding to each table, a view is created in the workspace that can be directly manipulated. Each gesture denotes a single manipulation action and impacts only the view. User needs to learn few gestures that correspond to an action. Users can undo each action to return to the previous workspace state. Each action corresponds to the execution of a specific SQL query. Actions are stackable and can be performed in sequence, manipulating tables in the workspace till the desired result is achieved. Application developers designing query interfaces for a specific purpose prefer to use form-based interfaces [4, 3]. In form-based interfaces,the user is presented with a list of searchable fields, each with 10
an entry area that can be used to indicate the search string. To pose a query, the user needs to fill in the areas of the form relevant to their search. The form-based approach is especially relevant as end-users are accustomed to manual form-based work. In [4] the authors study a simple form model that includes hierarchically structured forms with an event-driven routing. To assist users in the creation of forms for view definitions, an inference component was provided to create view definitions consisting of the hierarchical structure and functional dependencies among form fields. The inference component uses a collection of rules and heuristics along with a purposeful dialog. The Expert Database Design System [4] assists a designer in the view integration process. The system provides rules for incrementally integrating the form views and heuristics for mapping the form fields into entity types and relationships. Some other form-based systems for databases are, the GRIDS system [36], which allowed users to pose queries in a semi-IR fashion and Acuity Project [37], which used form generation techniques for data-entry operations such as updating tables in a relational database. In [38], the authors tried to automate the process of construction of query forms. With the help of a limited number of forms, the system can express a wide range of queries which helps in leveraging the restriction on expressiveness posed by form based querying mechanisms. Given a set of interesting queries, similar queries are identified and subsequently clustered to represent queries that can be queried using a single form.
2.1.2 Text Interfaces With the explosion of data availability on the web and the ease of access to data through search engines, we observe databases playing second fiddle in terms of popularity. Search engines, for example, Google [39], allow users to issue keyword-based queries freely and coupled with instantaneous response time allows for a satisfactory experience for the user. While there are still room for improvements, the success story of the web search engines suggests that any data management system is more useful if users can extract information from the system with minimal efforts. Keyword searches in databases [6] allow users to query databases using a set of keywords. The BANKS system [7] integrates keyword querying and interactive browsing of databases. BANKS models a database as a graph, where tuples correspond to nodes, foreign key and other links between tuples correspond to edges. Answers to a query are modeled as rooted trees connecting tuples that match individual keywords in the query. Answers are ranked using a notion of proximity coupled with a notion of prestige of nodes based on inlinks, the latter being inspired by techniques developed for web search. Another keyword-search system DBXplorer [8], uses a symbol table that is used at search time to determine efficiently the locations of query keywords in the database. Given a set of keywords, the symbol table is looked up to identify database tables and all potential subsets of tables in the database that might contain rows having all keywords, are identified and enumerated. For each enumerated join tree, a SQL statement is constructed (and executed) that joins the tables in the tree and selects the rows that contain all keywords. The system then presents the final rows to the user. 11
Keyword search has also been extended for XML databases. The aim of such systems is to identify the smallest element that contains most of the keywords [40] or the smallest element that is meaningful [41]. In [42], the authors describe ObjectRank which uses a metric of authority transfer on a data graph to improve result quality for ranking results in keyword searches in the database. Ranking of SQL query results has been studied in [43] using probabilistic models. Research efforts have also focused on combining form-based approaches and keyword search. Given a set of keywords, a system retrieves a set of forms instead of query results [44, 45]. The systems create inverted SQL queries from the SQL queries in the forms. Unlike traditional keyword search on databases, the techniques do not require any special purpose indices, and instead make use of standard text indices supported by most database systems. Some information systems use a ‘page-and-link’ approach for accessing data resource, for example, a web directory. A Web Directory is a repository of Web pages that are organized into a topic hierarchy. Typically, directory users locate the information sought simply by browsing through the topic hierarchy, identifying the relevant topics and finally examining the pages listed under the relevant topics. Some of the common web directories include [46, 47]. Users select a related link as per their needs. Each link helps users in narrowing down to his information needs.
2.1.3 Other Works In [48], the authors proposed a new paradigm for data interaction called guided interaction, which uses interaction to guide the users in query construction, query execution and result in examination processes. The authors mandate that databases should be responsive to the user, and all possible actions are enumerated so as to allow discovery and exploration. The database can also preemptively deliver insights to aid in query construction. The proposed paradigm supports its value for any general database interaction interface, whether that be SQL-writing, form-filling, keyword-typing or any other interface. The authors suggested how information in the database could be leveraged to guide a user during query construction by following these core principles. Query recommendation is a popular feature in modern systems, especially search engines. These recommendations are built by mining search query logs from existing users [49, 50]. The method proposed in [49] is based on a query clustering process that identifies semantically similar queries by exploiting historical preferences of registered users. The method also ranks the semantically correlated queries. In [50], the authors model search engine user’s sequential search behavior, representing it as query refinement process. This model is combined with a traditional content-based similarity method to compensate for the high sparsity of real query log data. In [51], the concept of auto-completion was proposed to rapidly suggest predicates to the user to create conjunctive SELECT-PROJECT-JOIN queries. In [52], authors proposed a method to mine SQL query logs and identify potential query templates. In complete-search [53], Bast et al. modify the inverted index data structure to provide incrementally changing search results for document search. TASTIER [54] provides find-as-you-type in relational 12
databases by partitioning the relation graph. In the information retrieval area, Anick et al. [55] achieve interactive query refinement by extracting key concepts from the results and presenting them to the user. Faceted search [56] extends this to present the user with multiple facets of the results, allowing for mixing of search and browse steps.
2.2 Schema Summarization 2.2.1 Schema Matching Information integration in an important challenge in data management [57, 58]. Schema matching [59] involves identifying semantic correspondences or mappings among attributes from different databases. In [60, 61], authors describe schema oriented approaches for finding correlated schema elements using name, description, relationship and constraints. In [62] authors have proposed an integrated approach, using linguistic approaches and structure matching process. [63] proposed a fragment oriented approach for matching large schemas to reduce the matching complexity. Identifying mappings is analogous to finding similarity between schema elements belonging to two different schemas.
2.2.2 Mining Database Structures Mining database structure has received attention recently [64, 65, 66]. Bellamn [64] performs data mining on the database structure and identifies attributes with similar values, discovers join relationship among tables while also identifying direction and sizes. Such analysis can help in preparing data for data mining or for identifying foreign keys for schema mapping. In [65], the authors addressed the problem of mining a data instance for structural clues. The structural clues helps in identifying data instances that may contain errors, missing values, and duplicate records that may ultimately be helpful in data design. The authors proposed a set of information-theoretic tools that identify structural summaries that are useful for characterizing the information content of the data.
2.2.3 Topical Structures in Databases Wu et.al. [21] proposed an elaborate approach, iDisc, to discover topical structures in relational databases. The approach first models the database in three representation, graph based, vector-based and similarity-based. Graph-based representation models the database as a graph where the tables represent the nodes in the graph, and foreign key relationships represent the edges. In vector-based representation, each table is modeled as a document and hence the database represents a collection of documents. Similarity-based representation computes a similarity matrix by considering the similarity of attributes between schema elements. iDisc then performs clustering on each of the database representations and then combines the clustering using a voting scheme to generate topical structures. 13
2.2.4 Schema Summarization The problem of schema summarization was coined by Jagadish et al. in [18]. The proposed approach for generating schema summary utilized abstract elements and abstract links. Each abstract element represents a cluster of original schema elements, and abstract link represents one or more link between the schema elements within those abstract elements. The authors used the notion of summary importance and summary coverage to generate schema summary representing important schema elements with a broader coverage. The approach in [18] was proposed in the context of XML schema. The assumption made in [18] could not be applied to relation schema. Yang et al. [20] proposed an improved algorithm for relational schema summarization. The authors proposed a new definition for the importance of tables in a relational database schema based on information theory and statistical models. The authors also described a novel distance function that quantified the similarity between elements in the schema. Based on the distance function, a clustering based approach was proposed for generating schema summary. In [19], the authors apply the technique of community detection in social networks for schema summarization. The approach partitioned the database schema elements into subject groups by using modularity based community detection. By utilizing the table importance measure proposed in [20], a hierarchal clustering algorithm was proposed for multi-level navigation structure in schema summary. The schema summary described foreign-key relationships, subclass relationships and overlap of data instances.
2.3 Discussion Although VQSs like QBE and its derivatives are relationally complete and use user-friendlier that SQL/XQuery, they still require prior knowledge of schema structure and a grasp of join relationships between tables to some extent. Query interfaces like the form-based interfaces restrict the number of queries you can construct for data access. In the proposed effort, we aim to provide an easy to use interface for novice users, which can leverage the number of queries a user can execute on the system. In keyword search systems, although users are content with querying using keywords, they need to express more complex query semantics. Also, users expect more precise and complete answers to their queries while keyword search based systems may return many irrelevant results without any explanation. A similar scenario is experienced by the query recommender system. In the proposed approach, we emphasize on providing precise and complete answers like the structured query language. Schema matching involves identifying semantic correspondences or mappings among attributes from different databases whereas the proposed approach identifies semantically correlated elements within a schema. In [64, 65], the aim was to identify semantic relationships (foreign key) between tables. The proposed approach aims to identify clusters of strongly correlated schema elements. The existing schema summarization approaches [18, 20, 19, 21] are data oriented, utilizing schema and data available 14
in the tables. In contrast, the proposed approach uses schema information and database documentation to generate schema summary.
15
Chapter 3 Enhanced Query-by-Object Approach for Information Requirement Elicitation in Large Databases
Databases are more useful when users can extract information from the database with minimal efforts. Most database systems provide powerful, structured query models like SQL to query the database. However, these models require users to specify an unambiguous query explicitly using a fixed syntax and have a prior knowledge of the database structure, which is unfavorable for novice users. Hence, alternate query interfaces are required for information access that are more suited to the skills of a novice user yet still provide expressive power like SQL. Research efforts are going on to design efficient query interfaces that simplify the process of accessing information stored in a database. Information Requirement Elicitation (IRE) [9] proposes an interactive framework for accessing information. IRE proposes that user interfaces should allow users to specify their information requirements using adaptive choice prompts. In the literature, Query-By-Object (QBO) approach has been proposed to develop user interfaces for mobile devices [13], GIS systems [14] and e-learning systems [15] based on IRE framework. The QBO approach provides a web-based interface for building a query using multiple user level steps. The main advantage of this approach is simplicity to express a query. The QBO approach uses a database to store the objects and entities. However, for databases with large number of tables and rows, the QBO approach does not scale well. In this chapter, we propose an improved QBO approach, Query-by-Topics (QBT), to design user interfaces based on IRE framework that works on large relational databases. In the proposed approach, we represent the objects at a higher level of abstraction by clustering database entities and representing each cluster as a topic. Similarly, we organize instances of an entity in groups based on values of a userselected attribute. The aim of this chapter is not to propose an approach for detecting topical structures but rather how such an approach can provide applications in practical scenarios like information systems. Experiments were conducted at the system and user level on a real dataset using a QBT based prototype and the results obtained are encouraging. The rest of the chapter is organized as follows. In Section 3.1, we explain the QBO approach and discovering topical structures in a database. In Section 3.2, we present the proposed framework. In Section 3.3, we discuss the prototype development based on the proposed approach. In Section 3.4, 16
Figure 3.1 QBO user protocol we present experiments and analysis of the proposed approach. The last section contains summary and conclusions.
3.1 Background In this section, we explain the Query-By-Object Approach (QBO) in detail and also describe the framework for discovering topical structures in databases.
3.1.1 Query-by-Object Approach ‘Information Requirement Elicitation’ [9] framework allows users to build their queries in a series of steps. The result of each step is used to determine the user’s intent. Based on the notion of IRE, the Query-By-Object (QBO) approach was proposed in [14]. In this approach, the user communicates with a database through a high-level interface. The initial intent of the user is captured via selection of objects from an object menu. The user navigates to select the granularity of these objects and operators to operate between the selected objects. The user’s actions are kept track in a query-bag, visible to the user at all stages. Finally, an SQL equivalent query is formulated and is executed at DBMS server. In the IRE framework, intermediate queries cannot be utilized further and hence, there is not much support for complex queries. In QBO, user is allowed to gather and combine query results. This is supported by closure property of the interface. It states that the result of an operation on objects leads to the formation of another object. Hence, the results of a query can be used to answer an extended query. As the QBO interface involves multiple user level steps, non-technical users can easily understand and use the system for retrieving information from the databases. The developer protocol and user protocol (Figure 3.1) for the QBO approach are as follows:
3.1.2 Example Consider an example where a developer builds a QBO based system that users will query. System development based on QBO Developer Protocol: The following steps are taken by the developer: 17
film actor flim actor
film U, I, C R R
actor R U, I, C R
fim actor R R U, I, C
Table 3.1 Operator Matrix for Example 1 QBO Developer Protocol 1. Store objects and entities in a RDBMS. 2. Define operators for each pair of objects. 3. Provide IRE based object selection, operation selection and support for closure property.
QBO User Protocol 1. Select an object. 2. Select granularity of object. 3. Select another object. 4. Select the operator. 5. Display result. 6. If required, extend query by selecting another object.
Table 3.2 QBO Developer and User Protocols • Database: – film - (film id, film name, film rating) – actor - (actor id, actor name) – film actor - (fim id, actor id, actor rating); • In this approach, the relations in the entity-relationship (ER) data model are considered as objects. Next, the developer defines pair wise operations between these objects. Four types operators were proposed: UNION (U), INTERSECT (I), COMPLEMENT (C) and RELATE (R). The ‘RELATE’ operator has different connotations depending on the chosen objects it operates on. The pairwise operations are shown in Table 3.1. • A web-based interface provides a list of objects, instances and operations user can select from. The system also allows the user to combine query responses. Steps taken by the user based on QBO User Protocol: Consider an example query that the user is interested in, Find all actors who have worked with the actor ‘Jack’. Such query can be expressed with QBO as: Find names of films actor ‘Jack’ has worked in, then find names of actors who worked in these films. User level steps are: • Select object: actor • Select granularity: actor-‘Jack’ • Select another object: film 18
Figure 3.2 The iDisc approach • Select operator: Relate • Select another object: actor • Select operator: Relate • Display result
3.1.3 Discovering Topical Structures in Databases Discovering topical structures in databases allows us to group semantically related tables in a single group, helping in identifying what users might perceive as a single unit of information in the database. Consider a database D, consisting of a set of tables T = {T1 , T2 ...Tn }. Topical structure of D describes a partitioning, C = {C1 , C2 , ..Ck } of tables in T such that the tables in the same partition have a semantic relationship and belong to one subject area. In [17], the authors proposed iDisc, a system which discovers topical structure in a database by clustering tables into quality clusters. Clustering [67] is the process of grouping a set of data objects into multiple groups or clusters so that objects within a cluster have high similarity, but are very dissimilar to objects in other clusters. The iDisc approach is described in Figure 3.2. The input to iDisc is D consisting of a set of tables T and returns a clustering C of the tables in T . In the iDisc approach, a database is first modeled by various representations namely vector-based, graph-based and similarity-based. In the vector-based model, each table is represented as a document in bag-of-words model and a database is hence represented as a set of documents. In the graph-based model, the database is repre19
sented as an undirected graph. The nodes in the graph are the tables in the database (T ). Two tables Ti and Tj share an edge in the undirected graph if there exists a foreign key relationship between Ti and Tj . In the similarity-based representation, a database D is represented as a n × n similarity matrix M , where n = |T | and M [i, j] represents the similarity between tables Ti and tables Tj . The similarity between two tables is calculated by finding matching attributes based on a greedy matching strategy [68]. The table similarity is then averaged out over the similarities of matching attributes. In the next phase, clustering algorithms are implemented for each database representation model. The vector-based model and similarity-based model use hierarchical agglomerative clustering algorithm approach. A cluster quality metric is defined to measure the cluster quality. For the graph-based representation, shortest path betweenness and spectral graph partitioning techniques are used for partitioning the graph into connected components. Similar to other representations, a cluster quality metric is used to measure the quality of connected components. After clustering process ends, the base-clusterer for each representation selects the clustering with the highest quality score and preliminary clustering for each representation is discovered. After identifying preliminary centerings, iDisc uses a multi-level aggregation approach to aggregate results from each clustering using a voting scheme to generate final cluseters. A clusterer boosting technique is also used in the aggregation approach by assigning weights to produce more accurate clustering representations. Later, representatives for each cluster is discovered using an importance metric based on centrality score of the tables in the graph-based representation. The output of iDisc is a clustering of tables in the database, where each labeled clusters represents a topic.
3.2 Proposed Approach In this section, we first present a case study for eSaguT M , an IT-based personalized agro-advisory system. From the case study, we highlight our motivation and the problem we aim to solve. Later, we discuss the proposed approach in detail.
3.2.1 Case Study The eSagu system aims to improve the productivity of farms by delivering high quality personalized (farm-specific) agro-expert advice in a timely manner to each farm at the farmer’s door-steps without farmer asking a question. In eSagu, the agriculture scientist, rather than visiting the crop in person, delivers the expert advice by getting the crop status in the form of both digital photographs and the related information. The eSagu system records data about the farmers, farm history, sowing details, soil details, crop details and information about problems/diseases observed by farmers. Agro-experts need to analyze the observation data from various perspectives to deliver personalized advice and have complex query requirements. Also, query requirements tend to change frequently. The agro-experts are familiar with the data domain but are not technical experts. Hence, there is a need for a higher level 20
interface and presentation model to access data in the eSagu system. The issue here is that the query interface proposed to elicit information requirement of non-technical users should be easy to use while still allowing users to pose a wide range of queries. The QBO approach and its merits have been discussed in Section 3.1. To design user interfaces based on QBO to provide information requirement elicitation for eSagu, we face the following scenarios: • Implement the eSagu system in a RDBMS, where each table would correspond to an object. The eSagu database consists of 84 tables. • Define operations between 84 × 84 object pairs. • Provide a web-based interface providing a list of tables (84 tables) and instances (some tables containing more than 104 rows). Use Case: Consider the scenario when a user is trying to query the eSagu database using a web-based interface designed using the developer’s protocol. The user protocol would include: • Select an object: a user would have to analyze a list of 84 objects and locate his object of interest. • Select granularity or instance selection: Even if instance selection is based on attribute values, attributes can have a large number of distinct values. • Select operator: A user would have to grasp how each object would relate to other objects. A complex database may contain a large number of tables in the schema due to conceptual design or schema normalization. In such cases, it is difficult for the user to locate his information of interest. A naive solution, to organize objects alphabetically, may not be efficient. For example, in the eSagu database, there are 35 tables for various crop observations, cotton observation, crossandara observation and likewise 33 others. If a user wants to browse through all such observation tables, he would need to know all the crop names. An organized list where crop observation tables are grouped together and then sorted alphabetically would be more intuitive for the user. Hence when the objects are more in number, there is a need to represent the objects at a higher level of abstraction. Similarly, there is a need for a better organization when the object instances are more in number. In general we are faced with the following problems for QBO developers and users: • Large number of tables in the schema makes it harder for the user to locate his information of interest. • With a large number of instances in each table, selection of desired instance becomes difficult. • With a large number of tables, the number of pairwise operations between tables also increase. For n tables in the schema, in the worst case n × n operational pairs exist. 21
3.2.2 Basic Idea In the proposed approach, we exploit the notion of detecting topical structures in databases to represent the schema at a higher level of abstraction. Identifying topical structures allows tables which are semantically correlated to be grouped together, which provides a better organization for options presented to the users. Secondly, instead of defining operations between each pair of tables, we can define operations between topics and within topics. Hence, the number of pairs for which operators have to be defined can be reduced significantly. Similarly, to facilitate easier instance selection, we organize instances of an attribute into bins, providing a two-level hierarchy for instance selection. The developer protocol is modified to include steps required to generate the abstract levels. Consequently, the user protocol is also modified for the proposed approach. The proposed approach has the following additional processes to QBO: • Organizing objects into topical structures. • Facilitating instance selection • Defining operators for the topical structure. We discuss each of these process in detail in the following subsections. 3.2.2.1
Organization into topics:
For organizing objects into topical structures, we use the iDisc approach described in section 3.2. Given a database containing a set of tables T = (T1 , T2 , ..Tn ) as input, the iDisc framework generates a clustering C = (C1 , C2 , ..Ck ) of tables in the schema along with representative tables for each cluster L = (L1 , L2 , ..Lk ). Ci represents the set of tables belonging to the ith cluster, where Li represents the representative table of the cluster Ci . The representative table’s name of a cluster is used to assign a label to cluster. Each labeled cluster collectively represents a topic in the database. In QBO approach, the hierarchy of information organization is as follows: Tables → Attributes → Attribute Instances After generating topical structures of the database, we make the following modification in the hierarchy of organization: Topics → Tables → Attributes → Attribute Instances In other terms, we introduce topics and present the database tables belonging to a topic as its granularity. Hence, an object in QBT is a topic that has three levels of granularity (tables, attributes and attribute instances), in contrast to QBO which had only attributes and attribute instances as the two levels of granularity. Our approach is also in accordance with the IRE framework. By introducing topics, users can browse the database contents semantically, providing more intuitive options to the users. 22
3.2.2.2
Facilitating Instance Selection:
For selecting an instance(s) of an object, selection based on an attribute values comes naturally to the user. Thus, we first ask the user to select an attribute and then select its instances. However, in case the number of instances of an attribute are large, we need an efficient organization of options. Here we have two problems in conflict as while we allow the user to drill down to his requirements in multiple steps, we may end up creating too many steps that are unfavorable for the user. We thus create a two-level hierarchy for attribute values such that there are few steps required for instance selection while providing a better organization. In the two-level hierarchy, we organize the attribute instances by grouping the attribute instances into intervals. The first level represents the intervals and the second level represents the instances itself. Considering values of an attribute as a data distribution, we relate creating intervals to determine bins for creating histograms for a given data distribution. Methods for calculating number of bins (k) given a data distribution are as follows: • Struge’s formula: k = ⌈log2 n + 1⌉ • Square root choice: k =
√
n
• Scott’s choice (based on bin width): h = • Freedman-Diaconis’s choice: h = 2 ×
3.5σ 1
n3
, where h represents bin width
IQR(x) 1
n3
, where IQR = interquartile range
We would like to point out that the aim of proposed approach is to make it easier for the user to select instances. For example, if we have a textual attribute representing names of people in a community, one simple solution can also be binning based on first alphabet of the name, rather than distribution. Taking textual attributes into perspective, we additionally provide a search box for textual attributes, which can act as a filter for attribute selection. The usability of the search tool becomes even more prominent if the textual attributes contain long texts.
3.2.2.3
Defining Operations:
Next, we need to define operators that perform in case of QBT. Operators enable us to perform complex queries on databases involving one or more objects. The selected objects act as operands to the operators. We define two types of operator matrix: i Within-Topic Operator Matrix (WTS): This matrix represents all possible operations within a topic. The WTS matrix includes operations between a topic’s representative table with other tables belonging to the topic and between the tables in a same topic.
23
Figure 3.3 Topical Structure for QBT ii Between-Topics Operator Matrix (BTS): This matrix represents the possible operations between the representative tables of each topic. The diagonal elements represent the WTS matrix of the topics and other non-diagonal elements represent operations between two distinct topics. By defining operational pairs between topics and within a topic, we reduce the number of operation pairs for which operations need to be defined. The reduction in operational pairs depends on the topical structure identified for the database. Figure 3.3 shows an example of the organization of tables into topical structures. A topic is represented by its representative-table and all tables belonging to a topic are called its subordinate tables. The first subscript represents the topic and second describes whether the table is a representative table or a subordinate table of the topic. Tables of each topic are further represented as a, b, and so on. Table 3.3 describes the Within Topic matrix for the first Topic (WT-I) and table 3.4 describes the Between Topic matrix (BT). The following scenarios come up in context of Figure 3.3,
t T11 T12a T12b T12c
T11 U,I,C R R R
T12a R U,I,C R R
T12b R R U,I,C R
T12c R R R U,I,C
t T11 T21
Table 3.3 Within-Topic Matrix 1(WT-I)
T11 [WT-I] R
T21 R [WT-II]
Table 3.4 Between-Topic Matrix(BT)
Scenario 1. The two selected objects belong to the same topic. It has further three possibilities: 24
• Both the tables are representative tables {T11 ,T11 }: As there is only one representative table for each topic, this represents operations between same tables. The possible operations will be provided in Within-Topic operator matrix (WT-I[1,1]).
• One of the table is representative-table and the other is a subordinate-table {T11 ,T12a }: This case represents a RELATE operation between the two tables. The operations will be defined in Within-Topic operator matrix (WT-I[1,2]).
• Both the tables are subordinate tables {T12a ,T12b }: In this case, the two tables relate directly or through the representative table of the corresponding topic. In this case, the operations are performed at a higher level (WT-I[2,3]).
Scenario 2. The two selected objects belong to different topics. It has three further possibilities: • Both the selected tables are representative-tables {T11 ,T21 }: The possible operations will be defined in Between-Topics-I operator matrix (BT-I[1,2]).
• One table is a representative-table and other is a subordinate table {T11 ,T22a }: In this case, the tables can be related at the higher level via the representative-tables of the two topics (BT-I[1,2]).
• Both the tables are subordinate tables ({T12a ,T22a }): Similar to the above case, the two tables can be related through their representative-tables. The possible operations are defined in BetweenTopics-I matrix(BT[1,2]).
3.2.3 QBT protocols In this section we describe the QBT developer protocol and QBT user protocol: 3.2.3.1
QBT Developer Protocol
• Store objects and entities in a database (RDBMS) • Organize the tables in a schema based on the topic of tables, as described in Section 4.1.1. • Create a framework to organize attribute instances into the two-level hierarchy, as explained in Section 4.1.2. • Define operations within each topic and between topics, described in Section 4.1.3. 25
• Provide an interface based on QBT, to allow object selection, instance selection and support closure property. 3.2.3.2
QBT User protocol
The user protocol for QBT is described in Figure 3.4. The main options in the QBT are as follows. • Select a topic • Select granularity (a table, attribute and attribute values) • Select another topic • Select an operation • Display result • Extend query, if required
Figure 3.4 QBT user protocol
3.3 System Prototype In this section, we discuss the prototypes developed for QBO approach and the QBT approach based on the notion of IRE. As shown in Figure 3.5, the system prototype is based on a client-server architecture. Users interact with the system with a web-based user interface (EQBO client), which allows objects selection, operators selection and also displays query results. The back-end (EQBO server) consists of a system that processes the inputs given by a user and generates an SQL query that is executed on a relational database server (MySQL). The results of the SQL query are informed to the user at every stage of interaction. The user interface was implemented in PHP using open source jQueryUI tools and visual tools. Options available to user were refined by means of AJAX calls to the system, using JSON objects for information transfer between client and server. The developer protocols were followed to define objects and operations. For the QBO prototype, each table in the database corresponds to an object. The attributes of a table are considered as its 26
Figure 3.5 System Prototype Architecture granularity, based on which instances of the object can be selected. In the QBT prototype, we discovered topical structures in the database, with topics corresponding to objects. Operators are required when a user wants to relate information from one object with another object. Analogous to a calculator, a query can expressed as: A op B = C, where A is the left operand, B is the right operand, op represents the operator and C represents the result. A and B represent objects defined in the database. Considering objects analogous to numbers in a calculator, operators can be unary, which require a single left operand (A) as argument and binary operators, which require both left operand (A) and right operand (B) as arguments. In a calculator, one or two objects of the same type (numbers) operate and result into an object of same type (number). However in our system, objects are of different types as each object corresponds to a table containing different attribute. Consequently, depending on the operator selected, the resulting object can be of type A, type B or type A join B. Four binary operators are defined for any general purpose database: ADD (union), MINUS (complement), AND (intersect) and RELATE. For two same objects, binary operators ADD, MINUS and AND operators are defined. For two different objects, binary operator RELATE provides a natural join between objects. For each object, unary operators are defined corresponding to each direct join relationship it has with other objects. In addition to default operators, the database administrator can define various domain specific operators to provide more flexibility to end users.
3.3.1 CONFIG-DB Configuration information corresponding to different databases is stored in the CONFIG-DB (Figure 3.6). The CONFIG-DB consists of an object table which stores the name of objects identified from a 27
Figure 3.6 CONFIG-DB database. The CONFIG-DB simply maintains an index for objects. Object granularity and attribute values are accessed from the original database in which the object belongs. If the database has to be represented as topics, a topics table is defined similarly to objects. Each topic has a representative object. The objects belonging to the same topic have same topic id, otherwise topic id is null. In addition, the unary operators table and binary operators table stores operator details such as left operand object, right operand object, SQL query for the operator, resultant object and icon location to visually represent the operator in the user interface. For any database, by default each table indexed as an object and default binary and unary operators are defined. However, the CONFIG-DB can be re-populated by the database administrator to allow topic representation or to design more operators.
3.3.2 User Interface The design of user interface was motivated by the aim to provide an interface analogous to a traditional calculator. As presented in Figure 3.7, selection of numbers is analogous to object selection (red), operator selection for numbers is analogous operator selection for objects(blue). The display section of a traditional calculator that displays the result, numbers or operator selected is analogous to a query-bag that keeps track of user’s interactions and intermediate results (green). Figure 3.8 depicts the user in28
Figure 3.7 Traditional calculator versus System Prototype UI terface, which consists of four section (each represented by numbered arrows). The first section (1) of Figure 3.8 describes the interaction process of selecting objects, granularity and instances. In general, this section should describe an efficient representation of database schema and data. It has been widely studied that visual representation of objects can be easily manipulated by the user. However, for large scale databases that contain large number of tables, attributes and attribute values, visual representation are complex and restricted by screen size. Consider a database with 100 tables, each table consisting of 15 attributes on average and each attribute containing 1000 distinct attribute values on average. Visual representations like treemaps, which provide a compact representation of hierarchal data suffer badly as the screen becomes densely packed. Similarly, graph representations, which can represent both object and relationships between objects also suffer as the network structure becomes dense and confusing to the user. To deal with large scale database, we use cascading menus to represent the database hierarchy. The left most menu represents objects, grouped by topics. On selection of an object, its attributes are represented in the second menu. Subsequently, attribute selection leads to the third menu consisting of attribute values. Since the attribute values are probable to be more in number, a search box is provided to users for locating desired information. The second section (2) of Figure 3.8 represents the operator selection. Operators are represented as a grid of buttons similar to a grid of operators in a traditional calculator. The operator grid is updated based on object selected. In addition to operators, the grid contains a ‘backspace’ and ‘calculate’ button to undo previous selections and evaluate an expression respectively. An important design choice is the use of icons along with the textual representation of operator functionality. An icon can provide a visual representation of the functionality of the operator that can be easily manipulated by the end user. For example, a + icon is displayed for an operation where you want to add some more instances an object to an existing object. The third section (3) of Figure 3.8 describes the representation of a query-bag which keeps track of user’s selections (objects and operators), similar to the display section a calculator. In general, user’s selection will represent a very small subset of options available. We can thus use visual tools to represent user’s selections and 29
Figure 3.8 System Prototype UI
Figure 3.9 Treemap representation of user’s selection(object and granularity)
operation results. Object selection is represented through treemaps as they provide a compact representation for hierarchal data and operator selection is represented via icons. For example, Figure 3.9 shows the treemap representation of user selection of a farmer object with granularity for gender as male and age as 20, 21 or 22. A point to note is that the treemaps are not displayed until the user has made a selection on an object. The fourth section (4) of Figure 3.8 represents results of SQL query, formulated based on user’s interactions. Each selection made by the user updates the query, correspondingly a SQL query is executed on the database server and the SQL query results are presented back to the user in real time. The SQL query results are displayed using the query-by-example (QBE) approach. Real time presentation of SQL query results allows users to validate their existing selections at every stage and reduces the probability of formulating a wrong query. 30
3.4 Experiments 3.4.1 Experimental Methodology To analyze the effectiveness of the proposed approach, we conducted system-level experiments and also conducted a usability study. System-level experiments consist of evaluating the reduction in navigation burden and reduction in the number of operational pairs from the QBO approach. The usability study consists of a task analysis and ease of use survey on a real database using real users. For the usability study, we developed two prototypes, one based on the QBO approach and one based on the QBT approach. The interface for both the approach is almost similar, except that the QBO prototype does not group object by topics and does not provide bins for instances. First we do a task analysis of the QBT prototype and QBO prototype to check whether the proposed approach is beneficial to the user. To compensate for the limitations of the task analysis (discussed later in Section 3.4.3.1), we ask the user to explore the database on their own and pose queries from their day-to-day requirements using both the prototypes. After the exploration session, they fill out a questionnaire, rating the prototypes. It may not be the most efficient usability evaluation but it reduces the bias from task analysis.
3.4.2 Performance Analysis We measure the effect of using topical structures at the system level by measuring the reduction factor (RF ) for operational pairs. The reduction factor represents the number of operation pairs in the QBT approach as compared to the QBO approach (RFop ). If the number of operation pairs in QBT are OPqbt and in case of QBO are OPqbo , the reduction factor (RFop ) is defined as follows: RFop = 1 −
OPqbt OPqbo
(3.1)
We illustrate the metric by referring to Figure 3.3, where the total number of tables are 8. When tables have been divided into two topics, the number of operation pairs are follows: two 4 × 4 WT matrix and one diagonal BT matrix (2 pairs). Hence OPqbt is 34, while OPqbo is 64 (8 × 8). The reduction factor for operation pairs is 0.46. For the eSagu database, after identifying topical structures, operational pairs were calculated for the between topics matrix (BT) and within topics matrix (WT). The reduction factor for operational pairs (RFop ) observed was 0.76.
3.4.3 Usability Study Usability tests were conducted on four real users having computer experience but were not skilled at SQL or query languages. The users belonged to the age group 20-26 and were agriculture consultants at IT for Agriculture lab, IIIT Hyderabad. The users were familiar with the database domain, mainly eSagu and can validate the query results comfortably. Users were then briefed about the QBT prototype for 15 minutes along with a quick demonstration of a sample query. Before the experiments, users 31
Figure 3.10 QBO Approach Prototype Figure 3.11 QBT Approach Prototype(with Topic modeling and binning)
Task T1 T2 T3 T4 T5 T6
Time taken in seconds (attempts taken) User1 User2 User3 User4 21(1) 16(1) 41(2) 22(1) 18(2) 31(2) 30(1) 27(1) 170(3) 81(2) 79(1) 112(2) 17(1) 18(1) 22(1) 25(1) 25(1) 18(1) 41(2) 24(1) 140(2) 151(2) 110(2) 103(2)
Table 3.5 Time taken and number of attempts for each task were allowed a 5 minutes practice session to get themselves acquainted with the tool before starting the experiments. We performed two experiments: Task analysis and Use Survey [69]. 3.4.3.1
Experiment 1, Task Analysis:
After the initial interactive session the users were given six tasks. The tasks are as following: • T1: Find the details of family members for the farmer D.Laxama Reddy. • T2: Find all the farms owned by the farmer named P olepally T hirumalreddy. • T3: Find all the observations given to farmers from M alkapur village who grow cotton crops. • T4: Find the details of livestock belonging to the farmer d.laxama reddy. • T5: Find all the farmers belonging to the coordinator named k .s narayana. • T6: Find all the advice given to farmers from M alkapur village. 32
Task T1 T2 T3
Min Time 16 18 79
Max Time 41 31 170
Average 25 26.5 110.5
Std. Deviation 10.98 5.91 42.44
Avg. time for query construction 20 17.66 55.25
Table 3.6 Query building time results for QBT Task T4 T5 T6
Min Time 17 18 103
Max Time 25 41 151
Average 20.5 27 126
Std. Deviation 3.69 10.29 23.13
Avg. time for query construction 20.5 21.6 63
Table 3.7 Query building time results for QBO Each task involved constructing a query corresponding to the task requirement and retrieving the correct result. Ideally, we would like to evaluate the two prototypes by evaluating results for same tasks. However, if the user performs tasks on one of the prototype and then performs the same task on another prototype, the second prototype would be at an advantage because the user already gains a prior experience about performing the tasks. To address this issue, we instead divide the tasks into two groups of three tasks each. The first three tasks (T1, T2 and T3) would be performed on QBT prototype while last three tasks (T4, T5 and T6) would be performed on the QBO prototype. While we have different tasks being performed for the two prototypes, we try to maintain that the tasks are similar in nature and complexity. We maintained that the task T1 is similar to task T4 with the difference in objects corresponding to the family details and livestock details. Similarly, T2 and T5 represent a join operation, differing only in terms of the object involved. Tasks T3 and T6 represent a complex join involving three objects. Table 3.5 shows the time taken by each user to build his query for all six tasks and also the total number of attempts taken to complete each task. Note that we only account for the time taken by the user to build the query and not the time taken by the system to execute the query. The average time to complete all the tasks was 5 minutes and 36 seconds. The longest time to complete the six tasks was 6 minutes 31 seconds while the fastest time was 5 minutes and 13 seconds. The standard deviation to complete all six tasks was 37 seconds. Table 3.6 and Table 3.7 show the query building times for the two prototypes. Additionally, for QBT, the average time to complete the first three tasks (T1, T2 and T3) successfully was about 2 minutes and 35 seconds. The longest time taken to complete the tasks on the QBT prototype was 3 minutes and 29 seconds and the fastest time was 2 minutes and 3 seconds. The standard deviation of time to complete the first three queries was 40 seconds. For QBO, the average time for the last three tasks (T4, T5 and T6) was 2 minutes and 54 seconds, while the longest time was 3 minutes and 7 seconds and the fastest time was 2 minutes and 32 seconds. The standard deviation of time to complete the last three tasks was 16 seconds. The average number of trials required by users to complete all six tasks was 9.75. The average number of attempts required to complete the first three tasks was 4.75 while the 33
Figure 3.12 Average ratings for questions from questionnaire average number of attempts required to complete the last three tasks was 4.25. The maximum number of attempts required by a user for any single task was 3 (for T3). The average time for query construction for the first three tasks was 34 seconds while the average time for query construction for the last three tasks was 41 seconds. As discussed in the experimental methodology, as the tasks are performed first on QBT prototype and then the rest of the tasks are performed on the QBO prototype. The QBO prototype has an advantage that the user is already accustomed to perform similar tasks earlier on the QBT prototype. However, the average time for query construction for QBT is less than for QBO which shows that users are able locate their information quicker in QBT than in QBO. 3.4.3.2
Experiment 2, Use Survey:
After the task evaluation, we conducted a survey to determine how the users felt about the prototypes individually. Users were asked to explore the prototypes and pose various queries from their day-to-day requirements. After the users had explored the database using the two prototypes they were asked to fill in a questionnaire based on a USE survey. The questionnaire asked the users to rate both the prototypes based on the following questions: • Q1: The tool is easy to use. • Q2: The tool sufficient for my information requirements • Q3: The tool can be used with minimal efforts. • Q4: The tool requires minimal training and can be used without written instructions. • Q5: I can locate my information easily. • Q6: The tool requires minimal steps to formulate a query. 34
The users had to respond to each question on a scale ranging from 0 (completely disagree) to 10 (completely agree). Finally, each user was requested to give some feedback about the general perception of the prototypes, to obtain and identify additional comments about strengths and weaknesses to improve the tool. In figure 3.12, we represent the average ratings provided by the users for each of the questions. The mean rating for the QBT prototype was 6.95 with a standard deviation of 0.24. The mean rating for the QBO prototype was 6.33 with a standard deviation of 0.30. For Q1, the QBT prototype received an average rating of 7.25 while QBO prototype received an average rating of 6.5. For Q2, the QBT prototype received an average rating of 6.75 while QBO received an average rating of 6.25. For Q3, the QBT prototype received an average rating of 7 while the QBO prototype received an average rating of 6.5. For Q4, both the QBT prototype and the QBO prototype received an average rating of 6.75. The two prototypes do not differ much in the user interface design as much as in the process of interaction. For Q5, the QBT prototype received an average rating of 7.25, whereas the QBO prototype received an average rating of 6. In QBT prototype, we have introduced topics to organize objects that help the user to locate objects quickly. For Q6, the QBT prototype received an average rating of 6.75, whereas the QBO prototype received an average rating of 6. The highest ratings for the QBT prototype was received for questions Q1 and Q5 while the lowest ratings were received for Q2, Q4 and Q6 alike. For the QBO prototype, the highest rating was received for Q4, and the lowest rating was received for questions Q5 and Q6. From the use survey, we see that the QBT prototype receives highest ratings for Q1 and Q5 which shows that after exploring the data through the prototype, the users feel that the QBT prototype easy to use and they can locate their desired information quickly compared to QBO. On the other hand, the lower ratings for Q2, Q4 and Q6 for QBT specifically show that there is still scope for improvements, as users feel that they are not able to express all their requirements. After users had been given the freedom to explore both the prototypes, we see that QBT prototypes in general received higher ratings than the QBO prototypes. Although the difference is ratings is not highly significant, the difference in ratings shows the preference of QBT over QBO.
3.4.3.3
Limitations and possible improvements for the usability study
The users for our usability study were a group of agricultural experts working in the IT for Agriculture lab, IIIT Hyderabad. They matched our target audience of users that are unfamiliar with database systems but are familiar with the data they want to query. However, the usability study could have been conducted iteratively with different groups of users rather than a single group of agricultural experts. Another scope for study was having an expert review of the prototypes. We use a limited set of six questions in the survey. In [68], the authors have described an array of questions that could be used for a detailed study of user behavior. The usage of other popular questionnaires like System Usability Scale (SUS) was also an alternative. The questionnaire could have also included questions that directly compare the two prototypes. Measuring the internal consistency or a 35
reliability score could have been used to validate our questionnaire. We use the mean average ratings to evaluate our study while other measures like standard deviation and also the correlation between questionnaire ratings could be studied to have more detailed analysis. For the task analysis, we made the user perform three similar tasks on two prototypes. While users completed their first three tasks on the one prototype, they become experienced to complete their tasks on the other prototype. This creates a bias for one of the prototype. Similarly, we could not let the users complete the same task on both the prototype, which would have again created a bias.
3.5 Summary of the chapter Accessing a database requires the user to be familiar with query languages. The QBO approach, based on IRE framework provides an interface where a user progressively builds queries using multiple steps. This approach works fine for small databases but cannot perform well for a database consisting of large number of tables and rows. In this chapter, we propose Query-by-Topics, which provides enhancements over the existing QBO approach. We exploit topical structures in large databases to represent objects at a higher level of abstraction. We also organize instances of an object in a two-level hierarchy based on a user selected attribute. The advantages of this approach includes: user gets less navigational burden and the number of operations is reduced at the system level. The QBT prototype was implemented for a real database and experiments were conducted at the system level and user level to discuss the advantages.
36
Chapter 4 Exploiting Schema and Documentation for Summarizing Relational Databases
According to a recent study, users take more time to express and formulate their query requirements compared to the time taken for executing the query and displaying the result [70]. With the increase in complexity of modern day databases, users spend a considerable amount of time in understanding a given schema in order to locate their information of interest. To address these issues, the notion of schema summarization was proposed in the literature [25, 18]. Schema summarization involves identifying semantically related schema elements, representing what users may perceive as a single unit of information in the schema. Identifying abstract representations of schema entities helps in efficient browsing and better understanding of complex database schema. Practical applications of schema summarization are as follows: • Schema Matching [71, 59] is a well researched issue. Schema matching involves identifying mappings between attributes from different schemas. After identifying abstract representations of schema elements, we can reduce the number of mapping identification operations by identifying mappings at an abstract level rather than schema level. • In Query Interfaces, users construct their query by selecting tables from the schema. A quick schema summary lookup might help the user in understanding where his desired information is located and how is it related to other entities in the schema. The problem of schema summarization has gained attention recently in the database community. Existing approaches [18, 19, 20] for generating schema summary exploit two main sources of database information, the database schema and data stored in the database. In another related work, Wu et al. [21] described an elaborate approach (iDisc) for clustering schema elements into topical structures by exploiting the schema and the data stored in the database. In this chapter, we propose an alternative approach for schema summarization by exploiting the documentation of the database, in addition to its schema. It can be noted that we investigated how documentation of the database provides the scope for efficient schema summarization. The database 37
Figure 4.1 TPCE Schema documentation contains domain specific information about the database which can be used as an information source. For each table, first we identify the corresponding passages in the documentation. Later, a table similarity metric is defined by exploiting similarity of the passages describing the schema elements in the documentation and the referential relationships between tables. Using the similarity metric, a greedy weighted k-center clustering algorithm is used for clustering tables and generating schema summary. The experimental results on the TPCE [72] benchmark database shows the effectiveness of the proposed approach. The rest of the chapter is organized as follows: In section 4.1, we describe the proposed approach including the basic idea, table similarity measure and clustering algorithm. In section 4.2, we discuss the experimental results and analysis. Section 4.3 includes conclusions and future works.
4.1 Proposed Approach
We use the TPCE schema [72] described in Figure 4.1 as the running example in this chapter. The TPCE schema consists of 33 tables that are grouped into four categories of tables: Customer (blue), Market (green), Broker (red) and dimension (yellow). This categorization is provided by the TPCE benchmark and it also serves as the gold standard for evaluation of our experiments. 38
Existing approaches for clustering database tables are data oriented, utilizing schema and data in the database for generating schema summary. In scenarios where the data is insufficient, or some tables do not contain data, we have to look for alternate sources of information. For example, in the TPCE benchmark database, if no active transactions are considered, the table trade request is empty and hence, cannot be considered for clustering in existing approaches. Thus, we investigate alternative sources of information for a database. Databases are accompanied with the documentation or the requirement document. These documents contain domain specific information about the database that could be exploited for generating schema summary. Although one can go through the documentation and infer the schema summary manually, it is not always feasible to do so. Documentations for enterprise database are generally large, spanning hundreds of pages. The documentation for TPCE is 286 pages long and manually going through the documentation will thus be a tedious process for the user. In the proposed approach, we aim to propose an efficient approach for schema summary generation, using only schema and the documentation.
4.1.1 Basic Idea A foreign key relationship between two tables shows that there exists a semantic relationship between two tables. However, referential relationships alone do not provide good results [20]. Hence, we attempt to supplement this referential similarity between tables with another notion of similarity, such that the tables belonging to one category attain higher intra-category similarity. This additional similarity criterion is based on finding similarity between the passage of text representing the table in the database documentation. The intuition behind this notion of similarity is that the tables belonging to the same categories should share some common terms about the category in the documentation. We combine the referential similarity and document similarity by means of a weighted function and obtain a table similarity metric over the relational database schema. After pairwise similarity between tables is identified, we use a Weighted K-Center clustering algorithm to partition tables into k clusters. We propose a measure for table similarity. The measure has two components: one based on referential relationship and the other based on similarity of corresponding passages in the documentation. We first explain about the components and then present the table similarity measure.
4.1.2 Schema based Table Similarity In a relational database, foreign keys are used to implement referential constraints between two tables. The presence of foreign keys thus implies that the two tables have a semantic relationship. Such constraints are imposed by the database designer or administrator and form the basic ground truth on the similarity between tables. In our approach, referential similarity between two tables R and S is expressed as RefSim(R,S). 39
Security Daily Market Watch Item
Security 1 1
Daily Market 1 0
Watch Item 1 0 -
Table 4.1 Referential Similarity between tables security, daily market and watch item
Ref Sim(R, S) =
(
1 , If R,S have foreign key constraint 0 , Otherwise.
Example1: Consider the three tables Security, Daily market and Watch item (S, D and W ) in the TPCE schema. Table security has a foreign key relationship with daily market and watch item, hence Ref Sim(S, D) = Ref Sim(D, S) = 1 and Ref Sim(S, W ) = Ref Sim(W, S) = 1. The pairwise similarity is described in Table 4.1.
4.1.3 Documentation based Table Similarity In addition to the referential similarity, we also try to infer the similarity between tables using database documentation as an external source of information. First, we find the passage describing the table in the documentation using passage retrieval approach. The similarity between two tables thus corresponds to the similarity between the corresponding passages in the documentation. The passage from the documentation representing a table Ti is referred to as the table-document of Ti , T D(Ti ). The first task is to identify the table-document for each table from the documentation. Later, we find pairwise similarity between the table-documents. 4.1.3.1
Finding Relevant Text from the Documentation:
Passage retrieval [73, 74, 75, 76, 77] is a well researched domain. Passage retrieval algorithms return the top-m passages that are most likely to be the answer to an input query. We use a sliding window based passage retrieval approach similar to the approach described in [78]. In this chapter, we focus on using a passage retrieval approach to evaluate table similarity from database documentation rather than comparing different approaches for passage retrieval from the documentation. Consider a table Ti with a set of attributes Ai = (Ai1 , Ai2 ..Aik ). Given a database documentation (D), for each table Ti we construct a query Q(Ti ) consisting of the table name and all its attributes as keywords. Q(Ti ) =< Ti , Ai1 , Ai2 ..Aik >
(4.1)
In a sliding window based passage retrieval approach, given a window size wi for Ti , we search wi continuous sentences in the document sequentially for the keywords in Q(Ti ). If at any instance, the 40
window matches all the keywords from Q(Ti ), the passage in the window is considered a potential tabledocument for Ti . In cases where multiple windows are identified, we implement a ranking function [79] for the retrieved passages and choose the passage with the highest ranking score. The selection of an appropriate window size is a crucial step as the number of keywords in Q(Ti ) varies for each Ti . We propose two types of window functions (f (Q(Ti ))): • Independent window function, f (Q(Ti )) = c, c being a numeric constant. • Linear window function, f (Q(Ti )) = a × |Q(Ti )| + c, c being a numeric constant. After the passage describing the table is identified, we store the passage in a separate document and represent it as the table-document T D(Ti ) for the table table Ti . 4.1.3.2
Similarity of passages:
Once the table-documents have been identified, we have a corpus containing table-document(s) for each table. The table-document(s) are pre-processed by removing stop-words and performing stemming using Porter Stemmer. The table-document can be modeled in two ways: • TF-IDF Vector: T D(i) = (w1 , w2 , ..wd ) can be represented as a d-dimension TF-IDF feature vector, where d = |corpus| and wi represents the TF-IDF score for the ith term in T D(i). • Binary Vector: T D(i) is represented as a d-dimension binary vector T D(i) = (w1 , w2 , ..wd ), where d = |corpus| and wj is 1 if TD(i) contains the term wj and 0 otherwise. We then calculate pairwise similarity between table-documents using the cosine similarity measure or the jaccard coefficient: DocSimcos (R, S) = DocSim(docR , docS ) =
docR .docS |docR | × |docS |
(4.2)
DocSimjacc (R, S) = DocSim(docR , docS ) =
docR ∩ docS |docR | ∪ |docS |
(4.3)
4.1.4 Table Similarity Measure For two tables R and S, let Ref Sim(R, S) represent the referential similarity and DocSim(R, S) represent the document similarity between R and S. We combine the referential similarity and document similarity using a weighing scheme as Sim(R, S) = α × Ref Sim(R, S) + (1 − α) × DocSim(R, S)
(4.4)
Where α is a user specified parameter called the contribution factor 0 ≤ α ≤ 1. It measures the contribution of referential similarity to the table similarity. In some cases, two tables have a low value 41
Algorithm 1 Finding Table Similarity Input: D: Database Schema, T D: Set of Table-Document vectors, S: Document similarity measure, α: Contribution factor, Output: Sim: Pairwise similarity between tables in database Ref Sim ← REFERENCE-SIMILARITY(T D, S) DocSim ← DOCUMENT-SIMILARITY(D) Sim ← α × Ref Sim + (1 − α) × DocSim for all tables as k do for all tables as i do for all tables as j do if Sim(i, k) × Sim(k, j) < Sim(i, j) then Sim(i, j) ← Sim(i, k) × Sim(k, j) end if end for end for end for return Sim
procedure R EFERENCE -S IMILARITY(D) for all tables as R do for all tables as S do if R,S have foreign key relationship in D then Ref Sim(R, S) ← 1 else Ref Sim(R, S) ← 0 end if end for end for return Ref Sim end procedure
procedure D OCUMENT-S IMILARITY(TD, S) for all tables as R do for all tables as S do DocSim(R, S) ← S(T D(R), T D(S)) end for end for return DocSim end procedure
42
of (combined) similarity, but have high similarity to a common neighboring table. For example, in Figure 4.1, tables account permission(AP ) and customer(C) do not have a referential similarity but both are similar to the table customer account(CA). In such cases, two tables gain similarity as they have similar neighbors. For the previous example, similarity between account permission and customer should be max(Sim(AP,C) , Sim(AP,CA) × Sim(CA,C)). We construct the undirected database graph G = (V, E), where nodes (V ) correspond to tables in the database schema. For any two tables R and S, we define an edge representing the combined similarity Sim(R, S) between the tables. The database graph G is a complete graph. Consider a path p : R = Ti , Ti+1 , Ti+2 , ...Tj = S between two tables Ti and Tj . Similarity between the tables Ti and Tj along path p is
Simp (R, S) =
j−1 Y
Sim(Tk , Tk+1 )
(4.5)
k=i
Then the path with the maximum similarity between R and S gives the complete similarity between R and S. Sim(R, S) = maxp Simp (R, S)
(4.6)
As we construct a complete graph, we use the Floyd-Warshall algorithm for finding the shortest paths in a weighted graph. In our case, we define the shortest distance as having the maximum similarity. Since we construct a complete graph for finding all pairs maximum similarity paths, the algorithm takes O(n3 ) running time for this step. Algorithm 1 describes the procedure for calculating the pairwise similarity between tables in a schema. By taking the database schema, set of extracted passages, a document similarity measure and contribution factor as input, the algorithm returns pairwise similarity between tables. First we calculate the referential and document similarity for O(n2 ) pairs and later combine them using the contribution factor. The procedure REFERENCE-SIMILARITY() takes as input the database schema and calculates the similarity between two tables based on the referential relationships. The procedure DOCUMENTSIMILARITY() takes as input the passage corresponding to each table, a document similarity measure and calculates the similarity between tables based on the similarity of corresponding passages of the tables. Note that for every table, the passage is extracted by employing the passage retrieval approach described in Section 4.1.3.
4.1.5 Clustering Algorithm For generating summary, we use a greedy Weighted K-Center clustering algorithm. It provides minmax optimization problem, where we want to minimize the maximum distance between a table and its cluster center. 43
4.1.5.1
Influential tables and Cluster Centers
In schema summarization, the notion of influential table is used for clustering [20]. The notion says that the most important tables should not be grouped in the same cluster. We measure the influence of a table by measuring the influence one table has on other tables in the schema [80]. Specifically, if a table is closely related to large number of tables in the database, it will have a high influence score. The influence score helps in identifying the cluster centers, described in the clustering process. The influence of a table R on another table S in the database schema is defined as 2
f (R, S) = 1 − e−Sim(R,S)
(4.7)
Influence score of a table is thus defined as f (R) =
X
f (R, ti )
(4.8)
ti ǫT
where T represents the set of tables in the database. 4.1.5.2
Clustering Objective Function
The clustering objective function aims to minimize the following measure [20]. Q = maxki=1 maxRǫCi f (R) × (1 − Sim(R, Center(Ci )))
(4.9)
where k in the number of clusters, f (R) is the influence score of table R and Center(Ci ) represents the center of the ith cluster (Ci ). 4.1.5.3
Clustering Process
We use the Weighted K-Center algorithm that considers the influence score for clustering. In this approach, the most influential table is selected as the first cluster center, and all the tables are assigned to this cluster. In each subsequent iterations, the table with lowest weighted similarity from its cluster center separates out to form a new cluster center. The remaining tables are re-assigned to the closest cluster center. We repeat the process for k iterations, such that k clusters are identified for the database 2 ) [81], where n is the number e schema. The time complexity of the greedy clustering algorithm is O(kn of tables in the schema.
4.2 Experimental Results In this section, we present results of experiments conducted on our proposed approach. The following variables have been used at different stages in our approach: • Window size function (f ) for table-document discovery. 44
• Document similarity measure (S), for calculating the similarity of passage describing the tables in the documentation. • α, the contribution factor in combined table similarity metric. • k, the number of clusters determined by the clustering algorithm. Varying any of the variables affects the table similarity metric and clustering. We study the influence of these variables by varying one variable while keeping the other variables constant. Later, we conduct experiments on the clustering algorithm and compare our approach with other existing approaches.
4.2.1 Experimental Setup We used the TPCE database [72], provided by TPC. It is an online transaction processing workload, simulating the OLTP workload of a brokerage firm. TPC also provides a software package EGen to facilitate the implementation of the TPCE database. We used the following parameters to implement an instance of TPCE: Number of Customers = 5000, Scale factor = 36000, Initial trade days = 10. The TPCE schema consists of 33 tables, which are grouped into four categories: Customer, Market, Broker and Dimension. We use this categorization as the gold standard to measure the accuracy of our approach. The dimension tables are not an explicit category, they are used as companion tables to other fact tables and hence can be considered as outliers to our clustering process. We thus aim to cluster the other 29 tables any measure the accuracy of these 29 tables to the given gold standard. In addition, TPCE also provides the documentation for the TPCE benchmark. It is a 286 page long document and contains information about TPCE business and application environment, the database and the database transactions involved. This document serves as an external source in the proposed schema summarization approach.
4.2.2 Evaluation Metric The accuracy of clustering and table similarity metric is evaluated by means of an accuracy score, proposed in [20]. The accuracy score has different connotations for clustering evaluation and table similarity evaluation. For the table similarity metric, we find the top-n neighbors for each table based on the Sim metric described in Equation (4.6). Unless specifically mentioned, we find the top-5 neighbors in our experiments. From the gold standard, if category of table Ti is Ca , mi is the count of the tables in the top-n neighborhood of Ti belonging to the same category as Ca , then average accuracy of similarity metric is defined as P accsim =
mi iǫT n
|T |
(4.10)
Similarly for clustering accuracy, consider a cluster i containing ni number of tables. If the category of the cluster center of a cluster i is Ca ; let mi denote the count of tables in the cluster that belong to the 45
category Ca . Then accuracy of the cluster i and overall clustering accuracy is accclusti = accclust =
P
mi ni
(4.11)
mi |T |
(4.12)
iǫT
Figure 4.2 accsim and accclust values on varying window Figure 4.3 accsim and accclust values for document similarity functions S function, f
4.2.3 Effect of window function (f ) on combined table similarity and clustering In this experiment we measure the impact of varying the window function f for window size (w) on the clustering accuracy and table similarity metric. We fix α = 0.5, k = 3 and use the tf-idf based cosine similarity for table-document similarity. We conduct an experiment with the following window functions • wi = f (Q(Ti )) = 10 • wi = f (Q(Ti )) = 20 • wi = f (Q(Ti )) = 2 × |Q(Ti )| + 1 • wi = f (Q(Ti )) = 3 × |Q(Ti )| + 1 The results of this experiments are shown in Figure 4.2. We observe that although the function f = 20 gives respectable results, it is hard to determine a value of such constant ( f = 10 gives poor results). Using a constant window size can cause loss of information in some cases or add noise in other cases. To be on the safe side, linear window functions, which gave comparatively similar results are preferred. In further experiments, unless specified specifically, we use the window function as f (Q(Ti )) = 2 × |Q(Ti )| + 1. 46
1
1
Without dimension tables With dimension tables
0.95
0.9 0.8 0.7
0.8
0.6
accclust
accsim
0.9 0.85
0.75 0.7
0.5 0.4
0.65
0.3
0.6
0.2
0.55
0.1
0.5
0
0.2
0.4
α
0.6
0.8
0
1
0
0.2
0.4
α
0.6
0.8
1
Figure 4.4 Accuracy of similarity metric on varying Figure 4.5 Accuracy of clustering on varying values values of α of α
4.2.4 Effect of document similarity measure (S) on similarity metric and clustering accuracy The table-documents identified for each table can be of variable length. We study two similarity measures described in Equation (4.2) and Equation (4.3): Cosine similarity and Jaccard similarity. We compare the accuracy of similarity metric and clustering algorithm for the two similarity measures for k = 3, α = 0.5 and f (Q(Ti )) = 2 × |Q(Ti )| + 1. The results of the experiments are shown in Figure 4.3. We observe that the tf-idf score based Cosine similarity measure is consistent with the results. This can be attributed to the fact the table-documents share a lot of similar terms about the domain of the document and hence term frequency and inverse document frequency play an important role in determining the score of the terms in a document.
4.2.5 Effect of contribution factor (α) on table similarity and clustering In this section we measure the impact of varying α on the clustering accuracy and table similarity metric. In this experiment, we fix w = 2×|Q(Ti )| and k = 3, while varying α from 0 to 1. Figure 4.4 and Figure 4.5 show the results of varying α on clustering accuracy and accuracy of similarity metric. One interesting observation is we achieve the best clustering accuracy when the contribution of referential similarity and document similarity are almost equal (α = 0.4, 0.5, 0.6). This shows that rather than one notion of similarity supplementing the other, both similarities have equal importance in generating schema summary. Also using any single similarity measure (when α is 0 or 1) produces low accuracy results which verify the claims made in this chapter. 47
1
clust 0.9
s
clust
v
0.8
clust
0.7
clust
accsim
d c
0.6 0.5 0.4 0.3 0.2 0.1 0
2
3
4
k
Figure 4.6 Clustering accuracy for different clustering algorithms
4.2.6 Comparison of Clustering Algorithm In this section, we compare clustering algorithms for schema summary. In addition to the proposed weighted k-center clustering algorithm using an influence function (Clusts ), we implement the following clustering algorithms: • Clustc , A community detection based schema summarization approach proposed in [19]. • Clustd , the schema summarization approach proposed in [20]. The clustering approach uses a table importance metric based weighted k-center clustering algorithm. • Clustv , Combines results from clustering using reference similarity and clustering using document similarity using a voting scheme similar to [21]. This algorithm focuses on combining clustering from different similarity models rather than combining similarity models. Figure 4.6 shows the clustering accuracy achieved for k = (2, 3, 4) for various clustering algorithms. We observed that clusts and clustd achieve almost similar accuracy, with clusts giving slightly higher accuracy as it was able to successfully cluster the table trade request. If no active transactions are considered for the TPCE database, the table trade request is empty and data oriented approaches are unable to classify the table. For the clustv and clustc approaches, no specific patterns were observed. The reason for low accuracy of clustv is because referential similarity provides a very imbalanced and ineffective clustering that deters the overall clustering accuracy in the voting scheme significantly.
4.3 Summary of the chapter Schema summarization has been proposed in the literature to help users in exploring complex database schema. Existing approaches for schema summarization are data oriented. In this chapter, we proposed 48
a schema summarization approach for relational databases using database schema and the database documentation. We proposed a combined similarity measure to incorporate similarities from both sources and proposed a framework for schema summary generation. Experiments were conducted on a benchmark database and the results showed that the proposed approach is as effective as the existing data oriented approaches.
49
Chapter 5
Conclusion and Future work
With the rapid increase in the amount of published information and the explosion of data, users require sophisticated tools to simplify the task of managing data and extracting useful information in timely fashion. Subsequently, databases and database systems are essential to every organization for its business operations. Accessing information stored in a database requires the user to be familiar with query languages. Naive users are not skilled at using a general purpose query language like SQL, which has a complex structure. As a result, research efforts are on to provide easy to use query interfaces with expressive power comparable to SQL. The QBO approach, based on IRE framework provides an interface where user progressively builds a query using multiple steps. The QBO approach works fine for small databases but cannot perform well on a database consisting of large number of tables and rows. In this thesis, we propose Query-byTopics, which provides enhancements over the existing QBO approach. We exploit topical structures in large databases to represent objects at a higher level of abstraction. We also organize instances of an object in a two-level hierarchy based on a user selected attribute. The advantages of this approach include user gets less navigational burden and the number of operations is reduced at the system level. We also implemented a system prototype for a real database and made efforts to extend it for any general purpose database. Experiments were conducted at the system level to estimate the reduction in navigational burden and reduction in the number of operational pairs. A usability study was also conducted using the system prototype to evaluate our efforts against human factors. A key step in the proposed approach was to represent schema elements at a higher level of abstraction. Schema summarization has been proposed in the literature to cluster database schema entities and to present a high-level abstraction of the schema that help users in exploring complex database schemas. Existing approaches for schema summarization are data-oriented. In this thesis, we proposed a schema summarization approach for relational databases by utilizing the database schema and the database documentation. We proposed a combined similarity measure to incorporate similarities from both sources and proposed a framework for schema summary generation. Experiments were conducted on a bench50
mark database and the results showed that the proposed approach is as effective as the existing data oriented approaches. As part of future work, we would like to improve on the limitations of the usability study (mentioned in section 3.4.3.3). For schema summarization, we would like to come up with approaches to learn the values of various parameters used in the proposed approach. Also, apart from the database documentation, documents like the requirement document could be exploited for schema summarization. Lastly, another research work could focus on developing a unified approach to combine notion of similarity from schema, data and the database documentation.
51
Related Publications 1. Ammar Yasir, M. Kumara Swamy and P. Krishna Reddy, Exploiting Schema and Documentation for Summarizing Relational Databases, International Conference on Big Data Analytics, LNCS Volume 7678, 2012, pp77-99. 2. Ammar Yasir, M. Kumara Swamy and P. Krishna Reddy, Enhanced Query by Object Approach for Information Requirement Elicitation in Large Databases, International Conference on Big Data Analytics, LNCS Volume 7678, 2012, pp 26-41.
52
Bibliography
[1] Tiziana Catarci, Maria Francesca Costabile, Stefano Levialdi, and Carlo Batini. Visual query systems for databases: A survey. Journal of Visual Languages and Computing, 8(2):215–260, 1997. [2] Mosh´e M. Zloof. Query by example. In Proceedings of the May 19-22, 1975, national computer conference and exposition, AFIPS ’75, pages 431–438, New York, NY, USA, 1975. ACM. [3] Joobin Choobineh. Human Factors in Management Information Systems. Ablex Publishing Corp., Norwood, NJ, USA, 1988. [4] Joobin Choobineh, Michael V. Mannino, and Veronica P. Tseng. A form-based approach for database analysis and design. Communications of the ACM, 35(2):108–120, February 1992. [5] Raghu Ramakrishnan and Johannes Gehrke. Database management systems (3. ed.). McGrawHill, 2003. [6] Lu Qin, Jeffrey Xu Yu, and Lijun Chang. Keyword search in databases: The power of rdbms. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, SIGMOD ’09, pages 681–694, New York, NY, USA, 2009. ACM. [7] Arvind Hulgeri and Charuta Nakhe. Keyword searching and browsing in databases using banks. In Proceedings of the 18th International Conference on Data Engineering, ICDE ’02, pages 431–, Washington, DC, USA, 2002. IEEE Computer Society. [8] Sanjay Agrawal, Surajit Chaudhuri, and Gautam Das. Dbxplorer: enabling keyword search over relational databases. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, SIGMOD ’02, pages 627–627, New York, NY, USA, 2002. ACM. [9] Jun Sun. Information requirement elicitation in mobile commerce. Communications of the ACM, 46(12):45–47, December 2003. [10] Jun (John) Sun, Hoh Peter In, and Kuncara Aji Sukasdadi. A prototype of information requirement elicitation in m-commerce. In 2003 IEEE International Conference on Electronic Commerce (CEC 2003), 24-27 June 2003, Newport Beach, CA, USA, page 53, 2003. 53
[11] Subhash Bhalla, Masaki Hasegawa, Enrique Gutierrez, and Nadia Berthouze. Computational interface for web based access to dynamic contents. International Journal of Computational Science and Engineering, 2(5/6):302–306, August 2006. [12] S. Bhalla and M. Hasegawa. Query-by-object interface for accessing dynamic contents on the web. In TENCON ’02. Proceedings. 2002 IEEE Region 10 Conference on Computers, Communications, Control and Power Engineering, volume 1, pages 310–313 vol.1, 2002. [13] Takatoshi Akiyama and Yutaka Watanobe. An advanced search interface for mobile devices. In Proceedings of the 2012 Joint International Conference on Human-Centered Computer Environments, HCCE ’12, pages 230–235, New York, NY, USA, 2012. ACM. [14] Shapiee Abd Rahman, Subhash Bhalla, and Tetsuya Hashimoto. Query-by-object interface for information requirement elicitation in m-commerce. International Journal of Human Computer Interaction, 20(2), 2006. [15] Kazumi Nemoto and Yutaka Watanobe. An advanced search system for learning objects. In Proceedings of the 13th International Conference on Humans and Computers, HC ’10, pages 94– 99, Fukushima-ken, Japan, Japan, 2010. University of Aizu Press. [16] M. Hasegawa, S. Bhalla, and T. Izumita. A high-level query interface for web user’s access to data resources. In Frontier of Computer Science and Technology, 2007. FCST 2007. Japan-China Joint Workshop on, pages 98–105, 2007. [17] Wensheng Wu, Berthold Reinwald, Yannis Sismanis, and Rajesh Manjrekar. Discovering topical structures of databases. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pages 1019–1030, New York, NY, USA, 2008. ACM. [18] Cong Yu and H. V. Jagadish. Schema summarization. In Proceedings of the 32nd international conference on Very large data bases, VLDB ’06, pages 319–330. VLDB Endowment, 2006. [19] Xue Wang, Xuan Zhou, and Shan Wang. Summarizing large-scale database schema using community detection. In Journal of Computer Science and Technology, volume 27, pages 515–526. Springer US, 2012. [20] Xiaoyan Yang, Cecilia M. Procopiuc, and Divesh Srivastava. Summarizing relational databases. Proceedings of the VLDB Endowment, 2(1):634–645, August 2009. [21] Wensheng Wu, Berthold Reinwald, Yannis Sismanis, and Rajesh Manjrekar. Discovering topical structures of databases. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD ’08, pages 1019–1030, New York, NY, USA, 2008. ACM. [22] Ben Shneiderman. Improving the human factors aspect of database interactions. ACM Transactions on Database Systems, 3(4):417–439, December 1978. 54
[23] C. J. Date. Database usability. In Proceedings of the 1983 ACM SIGMOD international conference on Management of data, SIGMOD ’83, pages 1–1, New York, NY, USA, 1983. ACM. [24] Tiziana Catarci. What happened when database researchers met usability. Information Systems, 25(3):177–212, 2000. [25] H. V. Jagadish, Adriane Chapman, Aaron Elkiss, Magesh Jayapandian, Yunyao Li, Arnab Nandi, and Cong Yu. Making database systems usable. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data, SIGMOD ’07, pages 13–24, New York, NY, USA, 2007. ACM. [26] Kyu-Young Whang, Art Ammann, Anthony Bolmarcich, Maria Hanrahan, Guy Hochgesang, Kuan-Tsae Huang, Al Khorasani, Ravi Krishnamurthy, Gary Sockut, Paula Sweeney, Vance Waddle, and Mosh´e Zloof. Office-by-example: an integrated office system and database manager. ACM Transactions on Information Systems, 5(4):393–427, October 1987. [27] E. F. Codd. Relational completeness of data base sublanguages. In: R. Rustin (ed.): Database Systems: 65-98, Prentice Hall and IBM Research Report RJ 987, San Jose, California, 1972. [28] Arijit Sengupta and Andrew Dillon. Query by templates: a generalized approach for visual query formulation for text dominated databases. In Proceedings of the IEEE international forum on Research and technology advances in digital libraries, IEEE ADL ’97, pages 36–47, Washington, DC, USA, 1997. IEEE Computer Society. [29] Michele Angelaccio, Tiziana Catarci, and Giuseppe Santucci. Query by diagram: A fully visual query system. Journal of Visual Languages and Computing, 1(3):255–273, September 1990. [30] Antonio Massari, Stefano Pavani, Lorenzo Saladini, and Panos K. Chrysanthis. Qbi: query by icons. In Proceedings of the 1995 ACM SIGMOD international conference on Management of data, SIGMOD ’95, pages 477–, New York, NY, USA, 1995. ACM. [31] Francesca Benzi, Dario Maio, and Stefano Rizzi. VISIONARY: a viewpoint-based visual language for querying relational databases. Journal of Visual Languages and Computing, 10(2):117–145, 1999. [32] Norman Murray, Norman Paton, and Carole Goble. Kaleidoquery: A visual query language for object databases. In Proceedings of the Working Conference on Advanced Visual Interfaces, AVI ’98, pages 247–257, New York, NY, USA, 1998. ACM. [33] Bin Liu and H.V. Jagadish. A spreadsheet algebra for a direct data manipulation query interface. In Data Engineering, 2009. ICDE ’09. IEEE 25th International Conference on, pages 417–428, 2009. 55
[34] Clemente Rafael Borges and Jos´e Antonio Mac´ıas. Feasible database querying using a visual enduser approach. In Proceedings of the 2nd ACM SIGCHI symposium on Engineering interactive computing systems, EICS ’10, pages 187–192, New York, NY, USA, 2010. ACM. [35] Arnab Nandi and Michael Mandel. The interactive join: recognizing gestures for database queries. In CHI ’13 Extended Abstracts on Human Factors in Computing Systems, CHI EA ’13, pages 1203–1208, New York, NY, USA, 2013. ACM. [36] Roberta Evans Sabin and Tieng K. Yap. Integrating information retrieval techniques with traditional db methods in a web-based database browser. In Proceedings of the 1998 ACM symposium on Applied Computing, SAC ’98, pages 760–766, New York, NY, USA, 1998. ACM. [37] Saurabh Sinha, Kirk Bowers, Sandra A. Mamrak, and Ra A. Mamrak. Accessing a medical database using www-based user interfaces. Technical report, The Ohio State University, 1998. [38] Magesh Jayapandian and H. V. Jagadish. Automating the design and construction of query forms. In Proceedings of the 22Nd International Conference on Data Engineering, ICDE ’06, pages 125–, Washington, DC, USA, 2006. IEEE Computer Society. [39] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the seventh international conference on World Wide Web 7, WWW7, pages 107–117, Amsterdam, The Netherlands, The Netherlands, 1998. Elsevier Science Publishers B. V. [40] Lin Guo, Feng Shao, Chavdar Botev, and Jayavel Shanmugasundaram. Xrank: ranked keyword search over xml documents. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, SIGMOD ’03, pages 16–27, New York, NY, USA, 2003. ACM. [41] Yunyao Li, Cong Yu, and H. V. Jagadish. Schema-free xquery. In Proceedings of the Thirtieth international conference on Very large data bases - Volume 30, VLDB ’04, pages 72–83. VLDB Endowment, 2004. [42] Andrey Balmin, Vagelis Hristidis, and Yannis Papakonstantinou. Objectrank: Authority-based keyword search in databases. In (e)Proceedings of the Thirtieth International Conference on Very Large Data Bases, Toronto, Canada, August 31 - September 3 2004, pages 564–575, 2004. [43] Sanjay Agrawal, Surajit Chaudhuri, Gautam Das, and Aristides Gionis. Automated ranking of database query results. In CIDR, pages 888–899, 2003. [44] Eric Chu, Akanksha Baid, Xiaoyong Chai, AnHai Doan, and Jeffrey Naughton. Combining keyword search and forms for ad hoc querying of databases. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, SIGMOD ’09, pages 349–360, New York, NY, USA, 2009. ACM. 56
[45] Aditya Ramesh, S. Sudarshan, Purva Joshi, and ManishaNaik Gaonkar. Keyword search on form results. The VLDB Journal, 22(1):99–123, 2013. [46] Google Directory, http://dir.google.com/. [47] Open Web Directory, http://dmozcom/. [48] Arnab Nandi and H. V. Jagadish. Guided interaction: Rethinking the query-result paradigm. PVLDB, 4(12):1466–1469, 2011. [49] Ricardo Baeza-Yates, Carlos Hurtado, and Marcelo Mendoza. Query recommendation using query logs in search engines. In Proceedings of the 2004 international conference on Current Trends in Database Technology, EDBT’04, pages 588–596, Berlin, Heidelberg, 2004. Springer-Verlag. [50] Zhiyong Zhang and Olfa Nasraoui. Mining search engine query logs for query recommendation. In Proceedings of the 15th international conference on World Wide Web, WWW ’06, pages 1039– 1040, New York, NY, USA, 2006. ACM. [51] Arnab Nandi and H. V. Jagadish. Assisted querying using instant-response interfaces. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data, SIGMOD ’07, pages 1156–1158, New York, NY, USA, 2007. ACM. [52] Nodira Khoussainova, YongChul Kwon, Magdalena Balazinska, and Dan Suciu. Snipsuggest: context-aware autocompletion for sql. Proceedings of the VLDB Endowment, 4(1):22–33, October 2010. [53] Holger Bast and Ingmar Weber. The completesearch engine: Interactive, efficient, and towards ir & db integration. In CIDR, pages 88–95, 2007. [54] Guoliang Li, Shengyue Ji, Chen Li, and Jianhua Feng. Efficient type-ahead search on relational data: a tastier approach. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, SIGMOD ’09, pages 695–706, New York, NY, USA, 2009. ACM. [55] Peter Anick. Using terminological feedback for web search refinement: A log-based study. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR ’03, pages 88–95, New York, NY, USA, 2003. ACM. [56] Ka-Ping Yee, Kirsten Swearingen, Kevin Li, and Marti Hearst. Faceted metadata for image search and browsing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’03, pages 401–408, New York, NY, USA, 2003. ACM. [57] Paul Brown, Peter J. Haas, Jussi Myllymaki, Hamid Pirahesh, Berthold Reinwald, and Yannis Sismanis. Toward automated large-scale information integration and discovery. In Data Management in a Connected World, Essays Dedicated to Hartmut Wedekind on the Occasion of His 70th Birthday, pages 161–180, 2005. 57
[58] AnHai Doan and Alon Y. Halevy. Semantic-integration research in the database community. AI Mag., 26(1):83–94, March 2005. [59] Erhard Rahm and Philip A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, 10(4):334–350, December 2001. [60] S. Bergamaschi, S. Castano, and M. Vincini. Semantic integration of semistructured and structured data sources. SIGMOD Record, 28(1):54–59, March 1999. [61] Luigi Palopoli, Giorgio Terracina, and Domenico Ursino. Experiences using dike, a system for supporting cooperative information system and data warehouse design. Information Systems, 28(7):835–865, October 2003. [62] Jayant Madhavan, Philip A. Bernstein, and Erhard Rahm. Generic schema matching with cupid. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB ’01, pages 49–58, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. [63] Erhard Rahm, Hong-Hai Do, and Sabine Massmann. Matching large xml schemas. SIGMOD Record, 33(4):26–31, December 2004. [64] Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan, and Vladislav Shkapenyuk. Mining database structure; or, how to build a data quality browser. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, SIGMOD ’02, pages 240–251, New York, NY, USA, 2002. ACM. [65] Periklis Andritsos, Ren´ee J. Miller, and Panayiotis Tsaparas. Information-theoretic tools for mining database structure from large data sets. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, SIGMOD ’04, pages 731–742, New York, NY, USA, 2004. ACM. [66] Yannis Sismanis, Paul Brown, Peter J. Haas, and Berthold Reinwald. GORDIAN: efficient and scalable discovery of composite keys. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12-15, 2006, pages 691–702, 2006. [67] Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 3rd edition, 2011. [68] Erhard Rahm and Philip A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, 10(4):334–350, December 2001. [69] A. Lund. Measuring usability with the use questionnaire. Usability and User Experience Special Interest Group, Volume 8, Issue 2, October 2001. [70] Arnab Nandi and H. V. Jagadish. Guided interaction: Rethinking the query-result paradigm. PVLDB, 4(12):1466–1469, 2011. 58
[71] AnHai Doan and Alon Y. Halevy. Semantic-integration research in the database community. AI Magazine, 26(1):83–94, March 2005. [72] TPCE, http://www.tpc.org/tpce/. [73] Charles L. A. Clarke, Gordon V. Cormack, D. I. E. Kisman, and Thomas R. Lynam. Question answering by passage selection (multitext experiments for TREC-9). In Proceedings of The Ninth Text REtrieval Conference, TREC 2000, Gaithersburg, Maryland, USA, November 13-16, 2000, 2000. [74] Abraham Ittycheriah, Martin Franz, Wei-Jing Zhu, Adwait Ratnaparkhi, and Richard J. Mammone. Ibm’s statistical question answering system. In Proceedings of The Ninth Text REtrieval Conference, TREC 2000, Gaithersburg, Maryland, USA, November 13-16, 2000, 2000. [75] Gerard Salton, J. Allan, and Chris Buckley. Approaches to passage retrieval in full text information systems. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’93, pages 49–58, New York, NY, USA, 1993. ACM. [76] Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernandes, and Gregory Marton. Quantitative evaluation of passage retrieval algorithms for question answering. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, SIGIR ’03, pages 41–47, New York, NY, USA, 2003. ACM. [77] Mengqiu Wang and Luo Si. Discriminative probabilistic models for passage based retrieval. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’08, pages 419–426, New York, NY, USA, 2008. ACM. [78] C. S. Khoo W. Xi, R. Xu-Rong and E.P. Lim. Incorporating window-based passage-level evidence in document retrieval. In Journal of Information Science, volume 27, pages 73–80, 2001. [79] Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Proceedings of The Third Text Retrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 2-4, 1994, pages 109–126, 1994. [80] Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. Graph clustering based on structural/attribute similarities. Proceedings of the VLDB Endowment, 2(1):718–729, August 2009. [81] M.E Dyer and A.M Frieze. A simple heuristic for the p-centre problem. Operations Research Letter, 3(6):285–288, February 1985.
59