Information Extraction for SQL Query Generation in

Information Extraction for SQL Query Generation in the Conversation-Based Interfaces to Relational Databases (C-BIRD) Majdi Owda, Zuhair Bandar, and Keeley Crockett The Intelligent Systems Group, School of Computing, Mathematics and Digital Technology, The Manchester Metropolitan University, Chester Street, Manchester, M1 5GD, UK {m.owda, z.bandar, k.crockett}@mmu.ac.uk Abstract. This paper presents a novel methodology of incorporating Information Extraction (IE) techniques into an Enhanced Conversation-Based Interface to Relational Databases (C-BIRD) in order to generate dynamic SQL queries. Conversational Agents can converse with the user in natural language about a specific problem domain. In C-BIRD, such agents allow a user to converse with a relational database in order to retrieve answers to queries without knowledge of SQL. A Knowledge Tree is used to direct the Conversational Agent towards the goal i.e. creating an SQL query to fit the user’s natural language enquiry. The use of IE techniques such as template filling helps in answering the user’s queries by processing the user’s dialogue and extracts understandable patterns that fills the SQL templates. The developed prototype system increases the number of answered natural language queries in comparison to hardcoded decision paths in the knowledge trees. Keywords: Information Extraction, Knowledge Trees, Conversational Agents, and Conversation-Based Interfaces to Relational Databases.

1 Introduction The development of reliable Natural Language Interfaces to Databases (NLIDBs) can accelerate the progress of interactive applications. A considerable amount of literature has been published on this area. Androutsopoulos [21] defined four main approaches to NLIDBs, each approach is supported by a specific architecture. The first approach is based on Pattern-Matching which is one of the first architectural techniques that have been used for the design and development of NLIDBs. An example of such a system is SAVVY [22]. In order to make NLIDBs more reliable this approach utilizes a highly structured semantic grammar, rapid authoring tools, and a query paraphraser. However, in large databases a great number of patterns have to be created. Furthermore, the process of identifying relevant patterns is critical and time consuming. The second approach is based on an Intermediate Language, which is used to represent an intermediate logical query generated from a natural language question. The intermediate logical query is then transformed into SQL. Two such systems of this category are MASQUE/SQL [5] and EDETE [3]. The main disadvantage of this approach is that failure to generate an SQL query will result in a response which is not informative to the user.

The third approach is based on a Syntax-Based Family of Architectures, where the natural language question is syntactically analyzed to create parse tree(s). The resulting parse tree(s) will then be used directly to create database queries. An example of such a system is LUNAR [8]. However, a major difficulty of this approach is in generating the mapping rules from the parse tree to SQL. The fourth approach is based on the Semantic-Grammar Family of Architectures. This approach also involves the construction of a parse tree and mapping of the parse tree to SQL. The difference between this approach and syntax based approach is the use of semantic categories instead of syntactic concepts where non-leaf nodes in the parse tree are semantic categories. The semantic information about the knowledge domain is hard-wired into the semantic grammar. This family includes systems such as: PLANES [7, 21] and LADDER [21]. This approach appears to work well in restricted domains (e.g. small databases). However, creating new system will involve creating a new semantic grammar which is both time consuming and complex. Currently there is some ongoing work on NLIDBs which does not fall into the categories defined by Androutsopoulos such as Precise [1], Step [6] and C-Phrase [19]. A reliable NLIDB system should overcome the problems associated with the above approaches. Firstly any such system should provide adequate coverage of patterns to represent the domain. Secondly, the system should allow for interactivity with the user. The user must be able to understand the system responses in the case of failure of generating SQL query. Thirdly, overcoming the problem of generating mapping rules from syntax based parse tree to SQL. The Conversation-Based Interfaces to Relational Databases [1] is based on the pattern-matching architecture but has additional components: knowledge trees, a goal oriented conversational agent and information extraction module. Knowledge trees are used to structure the domain knowledge to cover common queries in a particular domain. Whereas a goal oriented conversational agent provides the natural language interface and can help to disambiguate the user’s queries, by allowing for dialogue interaction. The use of the knowledge trees helps direct the conversational agent towards the goal (i.e. query generation from natural language) through interaction with the user. The information extraction module helps extract understandable patterns from the relational database and project the created templates into SQL queries. The proposed approach reduces the complexity of building NLIDBs. This paper is organized as follows: Sections 2, 3 and 4 will introduce the main concepts behind Conversational Agents, Knowledge Trees, and Information Extraction. Section 5 will introduce the Enhanced Conversation-Based Natural Language Interfaces to Relational Databases Framework. Section 6 presents the results. Section 7 includes the summary. 2 Conversational Agents A conversational agent is human-computer dialogue system that interacts with the user turn by turn using natural language. Conversational agents have been used in many areas (e.g. help and advice, tourism, information retrieval, and learning styles [15, 16, 17, 18]). This field started by Alan Turing in 1950 with his famous paper Computing Machinery and Intelligence [2]. In the 1960s work on conversational agents included ELIZA [16] which is a computer program simulates a

psychotherapist. ELIZA is based on simple stimulus-response architecture (i.e. patterns and their responses), where ELIZA turns the input patterns into questions as responses. A number of programs have been developed based on lessons learned from ELIZA such as: PARRY [9], ALICE [14], and ADAM [15]. PARRY was built for a specific task that simulates paranoid patients, which makes PARRY hard to be adopted as autonomous component. ALICE and ADAM are general purpose conversational agents. They can be trained and adopted as autonomous components. Allen [17] characterized conversational agents, based on the implementation technique used, into five different groups (ranging from least to most complex): Finite-state script, Frame-based, Set of Contexts, Plan-based models, and Agent-based models. The conversational agent in our framework should have a set of requirements in order to develop a Conversation-Based NLIDB. Firstly it should be an autonomous general purpose conversational agent. Secondly it is capable to deal with a set of contexts, and to direct the users towards a goal, which gives the conversational agent the flexibility and robustness. Thirdly it should be able to converse freely within a specific domain. Finally it should be able to extract, manipulate, and store information from the users. ALICE and ADAM meet the above requirements. But ADAM is also a goal oriented which is more suitable to develop a reliable Conversation-Based NLIDB. The proposed system will be using ADAM and will have the advantage of having a user friendly interface that speaks the user’s language and allows for dialogue interaction. 3 Knowledge Trees Knowledge Trees has been used for Knowledge Representation and maintenance in many systems (e.g. ESTA [10], AMRAPALIKA [6] HWYCON [20]). The main purpose of using such trees for Knowledge Representation is that its structure is similar to the way in which humans represent knowledge. The domain expert is not required to have experience in any programming language. The tree structure also provides an easy way to revise and maintain the knowledge base. A variety of definitions of the term knowledge tree have been suggested. Yager [12] defines the idea of a knowledge tree as a mechanism providing a global framework for the mobilization of a knowledge base in response to a specific query. Our definition of a knowledge tree in the context of this paper is a tree where the knowledge is organized in a hierarchical structure based on the expert knowledge which has been extracted and developed by a knowledge engineer. Hence the knowledge trees serves as a road map for the conversation flow in a specific domain. For example, consider the knowledge tree in Figure 1. Assume T0 is the start node, and T1, T2, T3 are subnodes for the start node, and T4, T5 are sub-nodes for node T2. T0, T2, T3 are direction nodes which direct the conversation towards the formulation of an SQL query. T1, T4, T5, T6, T7 are report nodes which provide the syntax of the SQL queries. The direction nodes include patterns to carry on a conversation and directions to sub-nodes. The report nodes where they carry the SQL queries and they might include patterns in case further information needed.

T0

T2

T1

T4

T3

T5

T6

T7

Fig 1. Simple knowledge Tree.

4 Information Extraction Techniques Information Extraction (IE) is a cascade of sequence of steps at each step the system will add a structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically. There are two basic approaches to the design of IE systems. Firstly, Knowledge Engineering Approach which is based on having a knowledge engineer that keep maintaining the system rules. Secondly, Automatic Training Approach where is no need for someone to write rules for the IE system, just someone who knows well about the domain and then the task is to take a corpus of texts and annotate the text appropriately for the information being extracted. According to Appelt and Israel [13] rule based approach is best when the resources such as lexicons, and rule writers are available. However training data is expensive to obtain, extraction specification is likely to change, and the highest possible performance is critical. The rule based approach will help with the problem of generating SQL queries from natural language. The relational databases have rich semantic structure since there have been an engineering effort invested already during the development period. Using Rule Based Information Extraction to project the Natural Language Query into an SQL Statement has been made possible. 5 Enhanced Conversation-Based Natural Language Interfaces to Relational Databases Framework Figure 2 shows the proposed framework architecture, which has been divided into three main levels; each level has a group of modules associated with it. The following sections explain the supportive and backend levels first and then the interactive level, since the interactive level depends on both the supportive level and backend level. 5.1 The Supportive Level This intermediate level contains modules that support the framework functionality, it contains the following components: The Knowledge Tree, Conversational Agent,

Semantically Understandable Relational Database Patterns, SQL Queries Templates and Tree for dynamically generated queries. User Query

Agent Response

Conversation Manager Response Generation

Context Switching & Management

Interactive Level Information Extraction

Knowledge Tree SQL statements

Context Script files

Conversational Agent

SQL Queries Templates

Tree for Dynamic Generated Queries

Rule Matching

Supportive Level

Information Extraction

Relational Database Meta Data & Database Annotation Relational Database

Backend Level

Fig 2. Shows Conversation Based Interface to Relational Databases (C-BIRD)

5.1.1 Knowledge Tree The role of the Knowledge Tree is mainly for structuring knowledge elicited by knowledge engineers based on the expert knowledge. In this enhanced framework the knowledge tree is not going to be the only source for creating an SQL query, due to the fact that there will be an algorithmic solution which will support the framework when the knowledge tree can’t support the conversation. The knowledge tree will help answer common queries, setting the grounds and initial information extraction through the tree nodes that will help later with trying to formulate a dynamic SQL query.

5.1.2 Conversational Agent The Conversational Agent in this enhanced framework has been equipped firstly, with more scripts to extract and engage the user with more conversations. Secondly, the Conversational Agent will get the benefit of using the Information Extraction Module that utilizes the Semantically Understandable Relational Database Patterns to fill the SQL Templates in order to generate automated responses i.e. responses based on untrained utterances. 5.1.3 Semantically Understandable Relational Database Patterns The relational databases have rich semantic structure because there has been an engineering effort invested already during the development period. The idea behind introducing semantically understandable relational database patterns in order to build Conversation-Based Interface to Relational Databases (C-BIRD) is of great value, which can be utilized by the Conversational Agent. The use of Conversational Agents enable us to converse with the user about a certain context. In the database area the contexts change once you move from table/database to another. Defining the main database elements to the Conversational Agent will help in laying the ground for the Conversational Agent to carry out a conversation within/around an understandable pattern. For example, the table name called “customer_dim” will be understood as an understandable pattern “customers” table. The following database elements have been identified as the major understandable patterns in the relational database to be annotated and used in the information extraction module (Table Names, Column Names, Column Values, and Functions Names). The Information Extraction module is going to extract these understandable patterns and use them to fill the SQL queries templates and then generate responses accordingly. The current prototype has been developed for one database only, which clearly did not include the database name as a relational database understandable pattern. In the case of having more than database, the database name will be included. The module in section 5.2.1 named Relational Database Annotation & Meta Data Module will introduce the annotations of the semantically understandable relational database patterns, in which it will allow the Conversational Agent and the Information Extraction Module to extract these patterns from conversations. 5.1.4 SQL Queries Templates Using the SQL select statement structure to create a generalized set of SQL select statement templates. These templates will be used by the information extraction module in order to generate SQL select statements from the semantically understandable relational database patterns. The presentation of the templates here is based on the complexity of the SQL select statement which is going to be generated. This category include templates can create an SQL select statements based on one table. {table name} | {table name & Column/s name} | {table name & Column/s value} {table name & Column/s name & column/s value}

This category include templates can create SQL statements based on two tables. {2 tables names} | {2 table’s names & Column/s names} | {2 table’s names & Column/s values} | {2 table’s names & Column/s names & Column/s values} This category include templates can create SQL statements based on three tables {3 tables names} | {3 table’s names & Column/s names} | {3 table’s names & Column/s values} | {2 table’s names & Column/s names & Column/s values} 5.1.5 Tree for dynamically generated queries The Tree for dynamically generated queries will serve the purpose of saving automatically generated SQL queries from the Conversation-Based Interface to Relational Database. The main purpose is saving dynamically generated SQL statements in a structured manner, since the SQL and associated conversation could be used to grow the static knowledge tree built by the knowledge engineer. 5.2 The Back End Level This level contains the main knowledge repositories; it mainly contains two components. The first component is the Relational Database Meta Data and the Annotation Module. This information will be provided as part of the relational database management system such as the Meta Data and an Annotation Module. The second module is the Relational Database, which represent the dynamic knowledge base the system trying to answer queries from. 5.2.1 Relational Database Meta Data & Annotation Module This module is mainly for annotating the database table’s names, column’s names and function’s names, to be used in the Information Extraction Module. 5.2.2 The Relational Database The relational database is an independent component since the proposed architecture can work with any relational database management system to obtain a result set for any SQL statement. The domain relational database used is for a global computing company. The Global Computing Company distributes computer hardware and software components to customers on a worldwide basis. 5.3 The Interactive Level This level contains modules which work very closely with the user and are mainly represented by the Information Extraction, Response Generation, and the Conversational Manager. 5.3.1 Information Extraction Module Information Extraction Module (IE) is responsible for extracting useful information from the user input in order to formulate SQL query. The extraction of information is

carried out in two levels, the first level at the Conversational Agent (CA) level where the CA will extract information during the conversation; the CA will set out variables i.e. pass through the nodes in the knowledge tree, these variables will be taken into account towards formulating SQL query. The second level is the dynamic mode level or when the conversation arrive to a point of not having enough information supplied to the conversational agent to generate a query from the trained knowledge i.e. knowledge tree extracted from the expert knowledge. Knowledge extracted at this level is based on all inputs have been entered by the user, the inputs will be sent to the Information Extraction Module this module will try to extract Semantically Understandable Relational Database Patterns, the semantically understandable patterns will formulate the basic building blocks of general SQL templates in order to dynamically generate the nearest SQL query from the user text. Information Extraction techniques have been used: Text Cleaning, Semantic Interpretation, and Template Filling. In the Text Cleaning, the system removes duplicated words, and unnecessary tokens e.g. spaces. What to remove from the text as unnecessary tokens has been arguable. The system will keep all the words entered because any piece of text could be useful at some stage. Through the Semantically Understandable Relational Database Patterns, the system will extract and interpret table’s names, column names, column values, and aggregation functions names. At last the system will try to fill the extracted information into the SQL templates available to formulate SQL statement. 5.3.2 Response Generation The framework uses the response generation module in order to create responses to be sent to the user. There are six types of responses; first is the Initial Response which will come at the start of each context such as a welcome message or context identification. This response will give the user an idea where he is in the conversation and offer options for directions. The second is the In-Context Conversational Response, which is a context related dialogue response. This response allows further discussion within a context. The third is the Enquiry Response, which asks the user for further information in order to formulate an SQL statement. This response can appear when the conversation arrives to a leaf node i.e. goal node and when more information is needed to create the SQL statement. The fourth is the trained SQL statement and Result Set Response. This type of response comes directly from the knowledge tree when the user arrives to a leaf node and no further information is needed i.e. the goal has been reached. The fifth is the Dynamic enquiry response: once in the dynamic mode, the user will be asked for further information. The last type is the Dynamic SQL Statement and its Result Set Response. This response will come from the enhanced framework as a result of further processing of the inputted text. 5.3.3 Conversation Manager The Conversational Manager in this framework is involved to manage more components than in the initial framework [1]. In this framework the goal is not known and the system has to create an SQL query dynamically through the use of the Conversational Agent, and Information Extraction in case the answer or the SQL

statement is not in the trained knowledge tree. The Conversational Manager will manage all the modules in the interactive level in order to ensure the system flexibility. 6. Results Results on the initial prototype and conversations shown in [1], in this paper the enhanced prototype has been developed and testing the prototype shown satisfactory results apart from the need to improve the interactivity and continuation of conversation between the user and the system after generating the dynamic SQL statement based on Information Extraction. Figures 3 is an example on the use of Information Extraction to project simple Natural Language Query into an SQL Statement. The highlighted keywords have been used in the Information Extraction module in order to fill an SQL template. Show customers belong to the following account: Jeffrey May

Select * from customer_dim ,accounts where customer_dim.ACCOUNT_ID=accounts.ACCOUNT_ID and ACCOUNT_DSC='Jeffrey May' Fig 3. Projection from natural language into SQL Statement using Information Extraction (IE)

7. Summary The proposed enhanced framework introduces an easy and flexible way in order to develop Conversation-Based Interfaces to Relational Databases (C-BIRD), through the use of goal oriented conversational agent and knowledge tree. A goal oriented conversational agent provides the natural language interface and helps disambiguate the user’s queries, by utilizing dialogue interaction. Knowledge tree is easy to create, structure, update, revise, and maintain. Further more the knowledge tree will direct the conversational agent towards the goal (i.e. query generation from natural language). The enhanced framework utilizes Information Extraction in order to create dynamic SQL statements from natural language sentences, by extracting semantically understandable relational database patterns and filling them into SQL templates in which they will make up answers to users queries. Using Information Extraction the prototype is now equipped with further capability to generate SQL statements dynamically from natural language statements, which will increase the accuracy of answering a user query. References 1. Owda, M., Bandar, Z., Crockett, K.: Conversation-Based Natural Language Interface to Relational Databases, The 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology Workshops, California, pp. 363--367 (2007)

2. Turing, A.: Computing machinery and intelligence, Computation & intelligence: collected readings, American Association for Artificial Intelligence, Menlo Park, CA, 1995, pp. 23--46, (1950). 3. Reis, P., Mamede, N., Matias, J.: Edite – A Natural Language Interface to Databases: a New Dimension for an Old Approach, Proceeding of the Fourth International Conference on Information and Communication Technology in Tourism, Edinburgh, Scotland, (1997). 4. Popescu, A., Etzioni, O., Kautz, H.: Towards a theory of natural language interfaces to databases, Proceedings of the 8th international conference on Intelligent user interfaces, ACM Press, Miami, Florida, USA, January 12-15, pp. 327-327, (2003). 5. Androutsopoulos, I., Ritchie, G., Thanisch, P.: MASQUE/SQL – An Efficient and Portable Natural Language Query Interface for Relational Databases, Proceedings of the 6th international conference on Industrial and engineering applications of artificial intelligence and expert systems, Gordon & Breach Science Publishers, Edinburgh, pp. 327-330, (1993). 6. Prasad, R., Ranjan, K., Sinha, A.: AMRAPALIKA: An expert system for the diagnosis of pests, diseases, and disorders in Indian mango, Knowl.-Based Syst., v. 19(1), pp. 9-21, (2006). 7. Hendrix, G., Sacerdoti, E., Sagalowicz, D., Slocum, J.: Developing a Natural Language Interface to Complex Data, ACM Trans. Database Syst., ACM Press, New York, pp. 105-147, (1978). 8. Woods, W., Kaplan, R., Webber, B.: The Lunar Sciences Natural Language Information System, Technical Report 2378, Bolt Beranek and Newman Inc., (1972). 9. Colby, K.: Simulation of Belief systems, Computer Models of Thought and Language, pp.251--286, (1973). 10. Introduction to ESTA (Expert System Shell for Text Animation), Prolog Development Center, Denmark. http://www.visual-prolog.com/vip6/Community/esta/esta.htm 11. Sammut, C.: Managing Context in a Conversational Agent, Electron. Trans. Artif. Intell., v. 5, pp. 189--202, (2001). 12. Yager, R.: Knowledge trees and protoforms in question-answering systems: Special Topic Section on Soft Approaches to Information Retrieval and Information Access on the Web, J. Am. Soc. Inf. Sci. Techn..l, New York,v. 57, pp. 550--563, (2006). 13. Appelt, E., Israel, D.: Introduction to Information Extraction, Tutorial at IJCAI-99, (1999). 14. Wallace, R.: The Anatomy of A.L.I.C.E., Artificial Intelligence Foundation Inc. Avialable at: http://www.alicebot.org/anatomy.html, (2004). 15. Adam: Student Debt Advisor, Convagent Ltd, Manchester, UK, Available at: http://www.convagent.com/convagent/adam3.aspx, (2001). 16. Weizenbaum, J.: ELIZA - A Computer Program for the Study of Natural Language Communication between Man and Machine, Communications of the ACM, ACM Press, New York, v. 9(1), pp. 36-45, (1966). 17. Allen, J., Byron, D., Dzikovska, M., Ferguson, G., Galescu, L., Stent, A.: Toward conversational human-computer interaction, AI Mag., American Association for Artificial Intelligence, Menlo Park, CA, v. 22, pp. 27--37, (2001). 18. Latham, A., Crockett, K., McLean, D., Edmonds, B., O'Shea, K.: Oscar: An intelligent conversational agent tutor to estimate learning styles, 2010 IEEE International Conference on Fuzzy Systems, 978-1-4244-6919-2, (2010). 19. Minock, M.: C-Phrase: A System for Building Robust Natural Language Interfaces to Databases, Journal of Data Engineering (DKE), Elsevier, 69(3):290-302, (2010). 20. Kaetzel, L., Clifton, J., Klieger, P., Snyder, K.: Highway concrete (HWYCON) expert system user reference and enhancement guide, Gaithersburg: National Institute of Standards and Technology (BFRL), (1993). 21. Androutsopoulos, I., Ritchie, G., Thanisch, P.: Natural language interfaces to databases-an introduction, Journal of Language Engineering, v. 1(1), pp. 29--81, (1995). 22. Johnson, T.: Natural Language Computing: The Commercial Applications, Knowledge Engineering Review, v. 1, pp. 11-23, (1986).

Information Extraction for SQL Query Generation in

Information Extraction for SQL Query Generation in

Suggest Documents

SQL Query Tuning for SQL Server

Query Intensive Interface Information Extraction Protocol for Deep Web

Building Query Optimizers for Information Extraction - Columbia CS

Query Intensive Interface Information Extraction Protocol for Deep Web

SQL Query Tuning for Oracle - Toolbox.com

Understanding Query Processing and Query Plans in SQL Server

GPU SQL Query Accelerator - International Journal of Information ...

Multi-query SQL Progress Indicators

Information Extraction and Text Generation of News Reports for a ...

Interprocedural Query Extraction for Transparent Persistence

Query Expansion in Information Retrieval for Urdu

Thesaurus for Query Expansion in Web Information

Knowledge-Worker Requirements for Next Generation Query ...

Query Generation for Semantic Datasets - Semantic Scholar

Knowledge-Worker Requirements for Next Generation Query ...

EXAMPLE-BASED QUERY GENERATION FOR ... - CiteSeerX

Automatic SQL Query Formation from Natural Language Query

Microsoft SQL Server Query Tuning Query Tuning - Meetup

Bidirectional Attention for SQL Generation

SQL Generation for Natural Language Interface

SQL Generation for Natural Language Interface

Query Optimization and Execution Plan Generation in

progress in information extraction - Association for Computational ...

Collaborative information extraction for adaptive recommendations in ...