providing multilingual natural language access to

2 downloads 0 Views 641KB Size Report
In this paper we describe an approach for a multilingual natural language query .... Principles of Database and Knowledge-Base Systems, Volume I. Computer.
PROVIDING MULTILINGUAL NATURAL LANGUAGE ACCESS TO TOURISM INFORMATION Helmut Berger(1), Michael Dittenbach(1), Dieter Merkl(2), Werner Winiwarter(1) (1)

Electronic Commerce Competence Center – EC3 Donau-City-Straße 1, A–1220 Wien, Austria

{helmut.berger, michael.dittenbach, werner.winiwarter}@ec3.at (2)

Institut f¨ur Softwaretechnik und Interaktive Systeme Technische Universit¨at Wien Favoritenstraße 9–11/188, A–1040 Wien, Austria [email protected]

Abstract. In this paper we describe an approach for a multilingual natural language query interface that allows the formulation of queries on tourism information like hotel availability. The interface is based on n-gram language identification and a keyword-based query interpretation where natural language analysis is performed for identification of the requested objects. The web-based interface is designed to allow for easy addition of alternative languages and for easy adaptation to other application domains.

1

Introduction

Natural language user-interfaces are a continuing research topic in computer science since the very first days of this scientific discipline. Natural language is especially appealing as an interface for database queries because the user is able to express her or his information request naturally without the need to learn a formal query language such as for example SQL [4]. For a nice and concise overview of the field consult for instance [1, 3].

Natural language technology is also a potential key for the success of applications in ecommerce. In particular, the provision of multilingual access to information resources is crucial, even stressed in such a multilingual environment as Europe. We have developed an interface prototype called AD.M.I.N for multilingual natural language access to tourism data. At present, the user may express her or his information requests for accomodation in English or German, where the language of the query is automatically detected by the system based on n–gram statistics of these languages. Hence, the inclusion of further languages should be fairly simple. The remainder of the paper is organized as follows. In Section 2 we provide an overview of the language identification module which is responsible for the identification of the language in which the query is expressed. Section 3 gives an overview of the necessary steps in order to extract the relevant information from the query and convert it into proper SQL-statements in order to retrieve the requested information. In Section 4 we give an example for the results of a particular query. Finally, some conclusions are given in Section 5.

2

Language Identification

To identify the language of a query, we use an n–gram-based text classification approach [2] where the classes represent the different languages. An n–gram is an n-character slice of a longer character string. As an example, for n = 3, the tri–grams of the string ’language’ are: { la, lan, ang, ngu, gua, uag, age, ge }. Dealing with multiple words in a string, the blank character is usually replaced by an underscore ’ ’ and is also taken into account for the construction of an n–gram document representation. Language classifciation using n–grams reqires to use a sample text for each language to build statistical models, i.e. n–gram frequency profiles, of the languages. We used parts of sentences of a variety of topically different news stories both in English and German. The n–grams, with n ranging from 1 to 5, of these sample texts were analyzed and sorted in descending order according to their frequency, separately for each language. These sorted histograms are the n–gram frequency profiles for a given language. As an example, the top ten tri–gram occurrences in the German and English language texts are shown in Table 2. In the English text, it can be seen that {the, and, of, in} and the ending {ion} are the most frequent tri–grams. Contrarily, in the German text, the most frequent tri–grams are endings like {en, er, ie, ch} and words like {der, ich, ein}. To determine the language of a query, the n–gram profile, n = 1 . . . 5, of the query string is built as described above. The distance between two n–gram profiles is computed by a simple rank-order statistic. For each n–gram occurring in the query, the difference between the rank of the n–gram in the query profile and the rank in a language profile is calculated. For example, the tri–gram {the} might be at rank 5 in a hypothetical query but is at rank 2

German en er de der ie ich ein sch ch che

1786 1570 949 880 779 763 730 681 642 599

th the he of of an nd in ion and

English 1333 1142 928 592 575 439 407 389 385 385

Table 1: Top ten tri–gram occurrences of German and English text with underscores representing blanks

in the English language profile. Hence, the difference in this example is 3. These differences are computed analogously for every available language. The sum of these differences is the distance between the query and the language in question. Such a distance is computed for all languages, and the language with the profile having the smallest distance to the query is selected as the identified language, in other words, the language of the query. If the smallest distance is still above a certain threshold, it can be assumed that the language of the query is not identifiable with a sufficient accuracy. In such a case the user will be asked to rephrase her or his query.

3

Query Interpretation

When the language of the query has been identified as outlined above, the query string is passed to the Query Interpretation Module. The general task of this module is to generate the necessary SQL statements in order to process the query and retrieve the relevant information from the database. The general idea behind our Query Interpretation Module is as follows. The requested concepts of the application domain are extracted from the natural language query string. Based on these concepts the final SQL query is composed of what we call SQL-fragments, i.e. SQL statements that are available for a wide range of different query patterns. In particular, the Query Interpretation Module is decomposed into several subtasks as shown in Figure 1. The subtasks as depicted in Figure 1 are performed in sequence and consist of the following. First, the module NumConverter detects numerals and represents them in form of the respec-

query string

NumConverter

recognizes numerals, e.g. "eleven" and converts them to digits, e.g. "11"

QueryCleaner

discards terms which cannot be found in the ontology or in the database

QueryRewriter

replaces each word with its preferred term

Tagger

SQLQueryGenerator

tags the information to add semantic information

constructs the SQL query and retrieves the object IDs

any desired output

Figure 1: Subtasks during query interpretation

tive digits. In the next step, i.e. QueryCleaner, the query string is decomposed into its words. For each word, QueryCleaner checks if it is contained either in the ontology of the application domain or in a domain specific table of proper names, as for example the name of cities or geographical regions and the like. The ontology of the application domain is represented in form of XML descriptions where words and their synonyms are connected to semantic objects of the domain. As the result, QueryCleaner extracts each word that was thus found to be relevant for the application domain. These words are further replaced by their preferred terms as indicated in the ontology. This task is accomplished by the QueryRewriter module. The then following Tagger module adds semantic information to the remaining query words. These semantic tags are further used to determine which modifiers could be expected in the neighborhood of the word. Modifiers might be negation, conjunction, disjunction or perhaps numerals that describe quantities and the like. This information is then used to determine

elements of the query that provide a more detailed specification of the requested object. In the final step, i.e. SQLQueryGenerator, the necessary SQL-fragments are selected based on the identified query concepts and their modifiers. Pragmatically speaking, the identified concepts are filled into the SQL-fragments and all fragments together represent the final SQL query expression. As an example for the differences in SQL-fragment selection consider the following parts of queries, all three describing possible geographical relations of the place of the accomodation with respect to the location of the city of Imst in Tyrol, Austria. 1. . . . in Imst 2. . . . not in Imst 3. . . . close to Imst The first two would result in the selection if the identical SQL-fragments with negation in the second case, thus: 1. WHERE city like ’Imst’ 2. WHERE city not like ’Imst’ However, the third, i.e. ’. . . close to Imst’, results in the selection of a different SQL-fragment. In particular, a SQl-fragment that incorporates the computation of distance between geographical locations will be selected.

4

AD.M.IN.@work

In this Section we provide an example of query processing with our multilingual natural language interface. Consider a situation where you are interested in a vacation at the country side. The queries we are showing are: • English: Show all farms close to Imst where pets are allowed • German: Zeig mir alle Bauerh¨ ofe in der Umgebung von Imst wo Haustiere erlaubt sind For these two queries the following word are identified as being relevant for the application domain: {farms, Bauernh¨ofe}, {close to, in der Umgebung von}, {Imst, Imst}, {pets,

Figure 2: Result for German query

Haustiere}. The first pair of words is identified as referencing to a particular type of accomodation, i.e. typ Bauernhof in our ontology. The pair {pets, Haustiere} is identified as being a further specification of the accomodation, namely only that subset is requested where accompanying pets are allowed, i.e. only those accomodations where the ’einrichtung haustiere’ flag is set in their description. {Imst} is identified as the name of a city in Tyrol, Austria. Finally, {close to, in der Umgebung von} is identified as being a further restriction for the location of the accomodation. Such an restriction is only valid for geographical locations as, for instance in our example, a particular city. The result of the query are shown at the moment in form of tables, as given in Figure 2 for the German query and in Figure 3 for the English query. Please note that both queries result in the same language independent representation as can be seen in the figures. This language independent representation requires the retrieved objects to be of ’typ Bauernhof’, ’nahe’ to

Figure 3: Result for English query

the ’city’ of ’Imst’ and allowing pets, i.e. ’einrichtung haustiere’. This is shown in the last line above the table containing the various accomodations. Obviously, the corresponding SQL-fragments are also the same for both queries.

5

Conclusions

In this paper we have described a multilingual natural language database interface for tourism information. The major features of our approach are the langage detection based on n–grams, query identification based on keyword matching and natural langage analysis, the inclusion of a domain ontology for easy adaptation to other application domains, and a web-based user interface for query formulation and result presentation.

Acknowledgments We are grateful that TIScover provided us with the tourism data for the experiments described in this paper. Thanks are also due to Konrad Plankensteiner and Ferdinand Schinagl for their valuable comments and suggestions during design and inplementation of the interface.

References [1] I. Androutsopoulos, G. D. Ritchie, and P. Thanisch. Natural language interfaces to databases - An introduction. Research Paper no 709, Department of Artificial Intelligence, University of Edinburgh, Edinburgh, UK, 1994. [2] W. B. Cavnar and J. M. Trenkle. N-gram-based text categorizatioin. In Proceedings of the 3rd Int’l Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, 1994. [3] W. Winiwarter. Bew¨altigung der Informationsflut – Stand der Computerlinguistik. Nachrichten f¨ ur Dokumentation 47(3), 1996 (in German). [4] J. D. Ullman. Principles of Database and Knowledge-Base Systems, Volume I. Computer Science Press, Rockville, MD, 1988.

Suggest Documents