Extracting Data from Personal Text Messages Richard Cooper and Sajjad Ali1 Computing Science University of Glasgow Abstract We present an approach to Information Extraction (IE) from short text and electronic mail messages in the restricted task of extracting data which can be added to a specified data repository. Whereas most IE systems work by starting from a syntactic analysis of the message, our software works by generating possible sentence structures from the database metadata and then pattern matching these structures against the input text. This technique finds new data and generates update statements which can be used to add the new data to the repository. The paper describes an initial version of a component which handles a number of kinds of sentences, anaphoric references and synonyms.
1.
Introduction
There are a great many mechanisms for acquiring the data necessary to an information system, from the fresh entry of data found in the non-computerised world to the automatic capture of data using devices such as bar code readers. Electronic interpersonal communication forms an important part of the modern ubiquitous and highly distributed information processing environment and lies somewhere between these two extremes. On the one hand, the messages contain data which is held in the computer and so is immediately accessible, but, on the other hand, this data is in a loosely structured form unlikely to fit with the repository managed by the information system. The work described here attempts to process short electronic free text communication (e-mail and SMS text messages) and extract structured data so that it can readily be added to the repository. In this case, we can exploit two features of the message – it will include terms from the domain that the repository is modelling and it is probable that the language structure will be fairly simple. We are also quite content to ignore anything in the message which does not seem to have anything to do with the domain. We cannot, on the other hand, be quite as confident regarding that syntax and spelling will be accurately used – indeed in the case of text messaging, both spelling rules and syntax will almost always be ignored or transformed dramatically. Public Web Pages
Middleware Program
Moderating
Transit Data Repository
Program
Set of Templates each having: stylesheet type structure
Live Data Repository
Figure 1. A Collaborative Information Bearing Website Architecture This project arose in the context of a generic system for the construction of collaboratively developed information bearing web sites [1]. The architecture of such sites is shown in Figure 1 and essentially consists of the presentation of a collection of hyperlinked pages generated from a data repository. Each page represents either summary information about a collection of items (catalogue pages) or the details of individual items (item pages), but with the additional aspect that each page visited also encourages the visitor to collaborate in the site development by submitting fresh information. For instance, if the web site describes a collection of books, there would be buttons on each catalogue page to request information about books not yet on the site, and on each book page there would be buttons to request more information about this book. The information submitted is held on the server in a transit repository from which it is processed by a moderator using a program that can be used to edit it and transfer the resulting data to the live repository. The software which generates the site takes in a high level schema of the data structure and automatically constructs the repositories, the middleware to generate the web pages and the moderating program. 1
Send to
[email protected] for more information.
Information Extraction
1
01/02/2006
The system provides two mechanisms to allow the visitor to submit information. Firstly, and more simply, an editable form allows the entry of the data associated with an entity of a particular kind. Secondly, the visitor can send an e-mail message. The latter gives more flexibility, allowing the communication of any information desired. The first implementation of the system provided the moderator with tools to cut and paste text from the e-mail messages into the live repository and this proved to be helpful but cumbersome [2] Simplifying the moderator’s task by automating the process was the initial stimulus to this project, which aims to produce a component which automatically extracts data from the messages in a form suitable for addition to the repository. Subsequently, it became clear that information bearing short messages, sent both as e-mails and text messages, are a common aspect of most modern distributed information systems. The task of identifying structured data from unstructured documents is usually referred to as Information Extraction. In our case, the data is to be found in natural language text2 and the structure is to be found in the schema of the data repository. The approach to be followed will be to identify linguistic structures in the message which can be transformed into updates on the repository. Thus, on encountering “The author of Emma was Jane Austen.”, the program should extract the data and develop an update command such as the SQL-like command “update Book set Author = ‘Jane Austen’ where Title=’Emma’;”. In proceeding, the metadata found in the schema is of paramount importance. The presence of the metadata name “Author” in the sentence will alert the processor to the presence of potentially useful data and the metadata structure can be used to identify potential sentences which contain that data. Looking at a factual statement, we find that the purposes of the words in the sentence naturally fall into three categories. Some words (articles, conjunctives, etc.) are there to provide syntactic structure. Other words ground the statement by identifying an information category from the domain of discourse. The final group of words provide data values drawn from the information category. In database terms, the second group are metadata and the third group data. For instance, in "The author was Jane Austen", "The" and "was" provide structure, "author" identifies a category from the domain and "Jane Austen" is a data value. Making this three way distinction between the words explicit permits us to attempt analysis of the text in a number of ways. Much IE work attempts to learn the structures given the collection of terms from the domain of discourse. Other work attempts to learn the terms used from the structures. In our work, we can mostly assume we know both of these and are only trying to find any data values we can. This seems sensible since a person interpreting the sentence above would have been taught the English language (including the concise structures used in e-mails and text messages) and also be expected to know something of the nature of the book domain. On encountering the sentence, the reader will be able to attach a name to the author of the book being discussed, which might mean adding a new name to the list of authors, or adding a new entry to the list of books known to be by Austen (or it might be redundantly restating information already known or, more problematically, conflict with values in the repository). Looking first at the structure, attempts to process a natural language text use two broad approaches. One possibility is to attempt to fully parse the text according to the language in which the text is supposed to be written. Having found the linguistic category of each part of the text, it should then be easier to extract information from it. The second approach is to try to match fragments of the text, such as sentences, to expected linguistic structures. Although the former is the usual (and more generally applicable) approach, the latter seems more promising in restricted contexts such as this one and is the approach we will describe here. We provide sentence pattern templates3 with placeholders for the domain terms and the data, match these against sentences from the message, and identify the text matching the placeholders as data which can be extracted. Turning to the domain terms, we believe that the IE process needs to be given these as well. Much recent IE activity has exploited the exploding interest in ontologies brought about the need to add meaning to web structure [3]. An ontology describes the terms found in the domain of discourse and indicates the relationship between them. We could use a number of techniques for describing the ontology, which could be purpose built or to exploit a more general ontological system such as WordNet [4]. In our case, we are working in the context of the database and, if the metadata has been given reasonable names, we can use these names as a basis for our collection of terms. Our system uses the metadata combined with synonyms generated in a variety of ways in order to identify equivalent terms to those actually found in the database. In getting a system such as this working, it is vital to use a data model which is supportive of the aims of the system. In this case, we require the data model to fit with the expected use of language, which means it will be high level, provide both an entity identification mechanism usable by humans and a means of dealing with gender and number. It is clear, for instance, that there is a significant discrepancy between the structure of a relational database and the way we describe information [5]. XML, on the other hand, imposes a hierarchical structure which lacks the generality found in the more generalised semantic network of human conceptions. It is also clear that the "key" values we use to identify things (names, titles, etc.) do not always possess the features required of a relational database key. In particular, they are not unique. Interpreting a sentence is 2
In this case, we use the term "natural language text" to refer to text produced and consumed by humans and not just to text which accords to the syntax of any particular human language. Note that in IE research, the term template is usually used to mean the structure in which the extracted is returned. Here we mean an abstract form of the sentences of which the component can make sense.
3
Information Extraction
2
01/02/2006
also very much guided by the gender of the objects being discussed (which allows us to interpret pronouns) and the number of objects expected since sentence structures differ for single-valued and multi-valued objects. In order to make the information identification process easier, the metadata is described in a high level data model. We describe our model in Section 3 and note particular features which model gender, number and key terms in order to facilitate the extraction process. Two other aspects of interpreting a sentence are also crucial – context and synonyms. Which book are we talking about? The sentence previously discussed must be taking place in a larger linguistic structure – a conversation or a continuing text. Perhaps it occurs as an answer to "Who wrote Emma?" or follows "My favourite book is Persuasion". Interpretation of each sentence can only be achieved given a context which holds information which is understood, but unspoken. We will see in Section 5.6, how such a context can be modelled and can guide interpretation. Human language typically permits many terms to be used for the same concept. In our sentence, we might have encountered the term "writer" instead of "author". Thus we must be sure that the second category of words – terms from the universe of discourse – encompasses as many varieties of term as we can expected our message sender to use. Thus the metadata of the database can only be taken as a start point. We also need to be able to find all potential synonyms of the terms and equivalent ways of expressing the same fact – e.g. "Austen wrote the book". Section 5.8 discuss how we manage this using WordNet and other techniques as a basis for managing this diversity. This technical report extends the work reported previously [6, 7] and is structured as follows. The second section reviews information extraction practice to position our work against other IE approaches and techniques. The next two sections describe the underlying data model in which the information captured is described and the architecture of the overall system. The fifth section is the centrepiece and describes the linguistic structures and how they are used. The sixth section describes the prototype as currently functioning, while the seventh section has some implementation details. The concluding section identifies unsolved problems and approaches to their solution.
2.
Approaches to Information Extraction
Information Extraction (IE) is the process of taking a piece of free text and locating structured information within it. There have been a number of approaches to achieving this which have their basis in the more general endeavour of trying to get a program to understand natural language. Natural Language Processing has a long and rich history [8] and usually proceeds by parsing the text and using the syntactic structure to guide a semantic interpretation of the text. Context free grammars and augmented transition networks are the main tools used to achieve this. IE has the more modest aim of trying to find useful information without making a full attempt to understand the text as a whole. Mostly, this is carried out in the expectation that the information found will be concerned with specific subject matter and a specific information gathering exercise such as the development of a database or providing automatic summarisation [9]. There are a wide range of approaches to this. At one extreme, many researchers have begun by producing a full syntactic and semantic analysis of a well formed natural language text from which the information can be identified. At the other extreme, lies the use of keyword matching techniques which make little or no use of the linguistic structures in the text. Both of these are often facilitated by the fact that the texts being used conform to a regular structure – for instance they might be insurance reports. The system described by Cardie [10] gives a typical sequence of phases. She starts by identifying the words and their part of speech (Tokenization and Tagging) and continues to identify the sentence structure (Sentence Analysis or Segmentation). The Extraction phase can then “identify domain-specific relations among relevant entities in the text”. There follows a Merging phase which resolves anaphoric references and co-references (multiple terms for the same concept). Finally, Template Generation identifies “events” in the text and adds fresh information to the environment which is being developed. LaSIE similarly divides the task into lexical pre-processing, parsing and semantic interpretation, and discourse interpretation [11]. GATE [12] is typical of the kind of system which supports this kind of approach. The process identifies the linguistic categories of all terms before trying to match them with domain information – e.g. find all the noun phrases and then check if they could be book titles. The end result is a template filled in with data extracted from the text. A second approach [13] is to effect a parsing of the text by tagging, often laboriously by hand, thus turning unstructured text into semi-structured data and giving a considerable assistance to the information detection process. Another common approach (e.g. [14]) takes the text and a series of terms, finds them and then analyses the text that surrounds them, constructing possible patterns from the sentences found. In terms of our tripartite distinction of word use, the system is given terms and can extract structure and then data. The issue which dominates IE research is the development of systems which learn how to deal with new linguistic forms. The Message Understanding Conferences (summarised in [15]) run by the US Navy provided researchers with a series of messages in order to challenge the ability of the contributing systems to learn to process novel syntactic structures. Adaptability is one of the key concepts behind a recent summer school in IE [16]. The aspiration is that systems will develop both the recognisable syntactic structures and the knowledge base as new messages are encountered.
Information Extraction
3
01/02/2006
The free text which our component is designed to handle is found in e-mail and SMS text messages. In this context, we feel that a full linguistic analysis is unnecessary, since out of the enormous number and variety of natural language sentences which occur, we are only interested in a tiny subset and are happy not to understand the rest. Most of the time spent on a full analysis would therefore be wasted. Moreover, full linguistic analysis is likely to be infeasible since the syntactic structure of the messages is unlikely to be as well structured or unchanging as can be expected from other texts. We carried out one experiment in trying to elicit information bearing sentences from subjects. Rather than send in a sentence such as "The author is Jane Austen.", respondents mostly sent "Author : Jane Austen.". On the other hand, keyword matching seems insufficient and, in any case, it would be foolish to throw away entirely the linguistic and domain structures in which the component is expected to work. Here, we take a much simpler approach which exploits the following assumptions about the restricted world within which the component must function: i)
The information to be found is restricted to that expressed by the schema of a database and this is not expected to change and so adapting the target information structure is not an issue (although adapting the syntactic structures manageable is important).
ii) The schema structure is very simple consisting of entities grouped into collections. iii) The English used in each message conveying fresh information is likely to be fairly simple and direct. iv) The syntax of a message may not accord with any fixed natural language syntax. v) Spelling is also likely to vary considerably. The approach we have taken is therefore largely different from the work described above. We tell the system the domain terms and the language structure in the form of expected sentence templates and then let it try to find information in the messages by matching input sentences to the patterns. The start and end of the process is clearly similar to previous systems – we must tokenise the text and find the sentences first and, at the end, produce a filled in "template" (in our case, SQL-like statements). In between, the process is much simpler – we merge the sentence templates with domain terms to produce sentence patterns and then match these against the input sentences. This is lightweight in the sense used in the work of Kang et al. [17], and is also similar to the work of Stratica and Desai [18], both of which use similar techniques to process natural language queries. This has the advantage that we do not need to carry out a syntactic analysis at extraction time and also that, by modifying the sentence templates, we can use the same component unchanged. In essence, we are using the syntactic structure of the language in the pattern generation process and thus are doing this once at start-up time for each database rather than once for each message. Given the fluidity of language in the modern world, this seems like a major benefit, since new modes of expression can easily be accommodated. Thus a sentence of the form " : " is as easily handled as one in good English. It is also valuable that many languages can be accommodated at the same time, useful in the global nature of information use. This is to be contrasted with systems such as MUSE [19] which can build an IE tool in one of a number of different languages. Finally, the need to encompass SMS messages is catered for since any novel syntax or spelling can be included. We are also quite happy to ignore sentences which appear to have nothing to do with the domain. Note the approach does not avoid parsing. Providing the set of sentence templates is essentially providing a grammar for the language. It is just that the grammar is pitched at a very superficial level rather than the profound structures required for a general account of natural language. It should be noticed that superficiality in this context does not mean fuzziness, but it does mean reduced processing time. However, in achieving this we have a number of problems to overcome: i)
There are still a large number of sentence structures which might be used to express the same concept.
ii) Most of the sentences will include anaphoric references, such as pronouns. iii) Although the database metadata is fixed, there are usually a number of alternative names and spellings that could have been chosen instead of the ones actually used. The first version of the component takes a simplistic view of some of these problems in order to demonstrate that even so, useful extraction can be achieved. The next four sections describe that first version and the concluding section discusses the outstanding problems and our proposals for solving them.
3.
The Data Model
As has already been emphasised, the nature of the data model used to describe the extracted data is of critical importance. We require a model which describes data at a level of abstraction which fits the level of human discourse. Thus records and foreign keys won't do and neither will hierarchical XML structures. Object or entity-based models have long been recognised as being much closer to the level at which we discuss information and so is basic to our approach. Furthermore, much
Information Extraction
4
01/02/2006
linguistic interpretation revolves around the gendered nature of information. Finally the model has to take account of a rather different use of key information between humans and database systems. The data model with which the schema structure is described is essentially an entity-based model with a few additions [20]. The model allows the specification of a number of entity types each with a number of properties whose domain may be either a primitive type or another entity type and may be single or multi-valued. The model also allows the specification of how each property will be implemented in a repository which consists of a mixture of relational, XML and multimedia file data. Of relevance to the current work in the model is that one of the properties of each entity type can be given the primitive type Gender. Properties of this type can take one of the three values: male, female or unknown. Entities whose type does not have a property of this type are neuter. This part of the data model is used to disambiguate pronouns of different genders, since a pronoun refers to an entity and each entity will now have a gender (the value of the gender property if the type has one, a neuter gender if the type has no such property). As an example, the schema shown in Figure 2 describes a a data source for a web site describing published books, implemented as a mixture of relational tables(R), XML files(X) and multimedia files(F). Of most importance here, however, is the treatment of key properties. The model supports not only traditional database primary keys (we will call these Dkeys), but also humanly intelligible keys (Hkeys). These are properties whose values may appear as identifiers in the text as what the Natural Language Processing community call named values. This is vital since in many cases the database key will be an artificially produced number which the person sending the e-mail cannot be expected to know. Thus we might create an artificial identifier to distinguish authors, but will expect site visitors to communicate in terms of the name – of course, in many cases, the Dkey may be an Hkey as well (e.g. the ISBN of a publication). Where they are different a function in the system turns Hkeys into Dkeys, by seeking the HKey and returning the Dkey of that entity. Humanly intelligible keys differ from primary keys in two ways which might be expected to give rise to problems: they might not be unique; and there may be more than one of them. The latter point gives no problems since the purpose of an Hkey is to permit the component to search for an entity from the values of Hkey properties. The effect of identifying a number of properties as humanly intelligible keys is to make the search for the identified entity use all of those properties. Uniqueness is much more of a problem and means that either we would need to use context more effectively than we do or that the ambiguity must be resolved by the user of the system – see Section 8 for more on this. Entity Book Book Book Book Book Book Author Author Author Author Author Author Author Author Publisher Publisher Publisher Publisher Publisher
Field ISBN Title Date PageLength Author Publisher ID Name Nationality BirthDate DeathDate Gender Photograph Writings ID Name Address Phone Publishes
Domain Varchar(20) Varchar(30) Date Number2 Author Publisher Number2 Varchar(20) Varchar(10) Date Date Gender Image Book Number2 Varchar(20) Varchar(50) Varchar(20) Book
R/X/F R R R R R R R R R R R R F R X X X X X
Dkey √
Hkey √ √
√ √
√ √
Cardinality single single single single many one single single single single single single single many single single single many many
Inverse
Writings Publishes
Author
Publisher
Figure 2. A Sample Data Repository Schema Implementation of the data model may take many forms. It may be turned into a relational database or an XML file using well known techniques [21, 22]. [20] describes how a judicious mixture of relations and XML may be set up to exploit the benefits of each. In our prototype system, the database is implemented as a tab separated file of strings, which is sufficient for proof of concept, though not of course for either efficiency or for further use of the database. The software generates database updates and these are described in terms of an SQL-like query language which consists of the following update and insert command structures: update set = and insert into () values ( )
Information Extraction
5
01/02/2006
All of the information found in a sentence in the messages can be turned into one or more of these. The goal of the extraction process is to locate the entity type, property and value for updates, and the entity type and Dkey and Hkey values for inserts. The insert command can be reduced to inserting a value for one column if the Dkey and Hkey columns are the same. The other values for the new entity are set to nulls and it is assumed that further data will appear further on in the message.
4.
The Information Extraction Architecture
Given a database and a message, the use of the system comprises two phases – the generation of a collection of patterns holding the sentence structures which the component can recognise; and the use of that collection to derive database update statements which can store the information found in a message. This section describes how the organisation of these activities and introduces the linguistic data structures they use.
Pattern Templates
Template Types
E-mail Message
Pattern Generation
Metadata
Patterns
Synonym Generation
Synonyms
Information Extraction
Context
Hand Crafted Patterns
Update Templates
Updates
Figure 3. An Architecture for Information Extraction to a Database The aim of the IE component is to recognise sentences such as “The author of ‘Emma’ was Jane Austen.”, “The author was Jane Austen.”, “The writer was Jane Austen.”, “Jane Austen wrote ‘Emma’.” and “Jane Austen wrote it.”, and, for each of these, display an executable command to the moderator with which the repository could be updated. All of the above should, for instance, generate the command (expressed in the query language described above) “update Book 0123456 set Author = ‘Jane Austen’”4, given that 0123456 is the ISBN either of the explicitly mentioned book, "Emma" or of the most recently mentioned book in the conversation. These sentences express a single fact and are listed in order of complexity of handling. To this end, we start from a collection of sentence templates describing the structure of a group of sentences we expect to encounter. Each of the sentences listed above consists of a text string containing words which fulfil one of three functions. Firstly there are words whose role is structural (“a”, “the”, “was”, etc.). Secondly, there are words which are either meta-data (“author”) or stand-in for the meta-data (“writer” and “wrote”). Finally there is data, either the fresh data that the visitor is communicating (“Jane Austen”) or the Hkey data which the visitor sending the message expects to find in the database already (“Emma”, the publisher Macmillan). We therefore have constructed a pattern template structure which distinguishes these three and a pattern generation mechanism which takes the template, leaves the first set of words unchanged, replaces the second set with meta-data and leaves the third as place holders. For instance, "The author was Jane Austen." matches the pattern "The author was ." which is formed from the template "The was ." by identifying with the property author. The templates are grouped into template types, each of which identifies a particular combination of information which can be found in sentences of that type. This template is of the type which expects to find one property and its value in the context of a known current entity. Each type is associated with an abstract form of the updates which are appropriate to add this information to the database. From these abstract forms, appropriate updates can be generated. This type generates just one update which is of the form "update set property1 = value1". The pattern and template structures are described in more detail in Section 5.1. Pattern generation consists of taking each template and replacing the placeholders with each valid combination of metadata terms, creating one pattern for each combination. The example above, for instance, creates one pattern for each property in the database. The complex template, "The was , while the was ." generates one pattern for each pair of different properties of the same entity type – e.g. "The author was 4
Note that all references to metadata whether in the message or in the SQL are case insensitive. In these examples the message contains “author” and yet the property name is “Author”. The component resolves such differences satisfactorily.
Information Extraction
6
01/02/2006
, while the date was ." Further patterns could be added by hand if the ones automatically generated proved to be insufficient. The actual metadata terms (entity type names and property names) can be replaced by synonyms in the patterns. Synonym generation takes three forms. The synonyms of a term can be added by hand, by the use of WordNet and by vowel stripping. The synonym generation interface also permits inappropriate automatically generated synonyms to be removed. Pattern generation uses each synonym in place of the term. Thus, if writer is regarded as a synonym of author, the pattern "The writer was ." is also generated. Pronouns are also represented by placeholders. Once a set of patterns has been generated, the matching process can begin. The message is examined one sentence at a time. A match is detected if each precise word in the pattern matches and there is at least one word to match each placeholder. Thus "The author is Jane Austen." matches "The author is ." as the first three words match exactly, while there are two words left over to fill the placeholder. "The author is.", on the other hand, would not match. When a match is found, the pattern is traced back to its template and the data found is remembered – in this case, = "author" and = "Jane Austen". The type of the template is then used to create the update, by filling in values from the context as well as using the values found by matching. Finally, the context is updated with any terms used in the sentence.
5.
The Linguistic Structures
In designing the linguistic structures which we use, the important notion was to keep in mind the task in hand and not be distracted by alternative and more powerful representations of natural languages. The goal can be summarised as turning sentences in the message into updates in the query language. As has already been discussed, the approach taken is to use pattern matching and so the process consists of finding a matching pattern for the sentence, extracting any data and using this to fill in an update template appropriate for this pattern. For instance, "The author is Jane Austen." matches the pattern "The author is ." and extracts the update "update Book set author = 'Jane Austen' where ISBN = ". The patterns used for the extraction process are specific to the application, or more accurately the schema. However, patterns from different applications conform to application non-specific templates – in this case, "The is .". Furthermore, different templates can result in the same update. For instance, the previous template has the same effect as "The was .". Therefore, templates are grouped into template types, which identify the expected update. In this section, we describe these structures in more detail and then continue to discuss the use of context and synonyms.
5.1 Patterns and Templates The fundamental linguistic structure employed by the extraction process is the template. A template describes the structure of a sentence from which the process can extract data and consists of a series of constant terms and placeholders. The pattern template structure is a text string made up of terms including two kinds of placeholder - one for the metadata (surrounded by single angle brackets) and one for the data (double brackets)5. Each placeholder is identified by a placeholder kind and an index number, which is required as there may be more than one of each kind in a sentence. For instance, is a placeholder for the third property name found in the sentence. The number is used to connect two related terms in the sentence - would indicate a value for . The placeholder kinds are: - a placeholder which will be replaced by an entity type name in the pattern creation process; - a placeholder which will be replaced by a property name in the pattern creation process; - a placeholder which will be replaced by a placeholder for a specific Hkey property during the pattern creation process – the generated placeholder will be replaced by a data value expected to be found as a value of that property during the information extraction process; - a placeholder which will be replaced by a placeholder for a specific property value during the pattern creation process – the generated placeholder will be matched with a text string expected to be a value for that property during the information extraction process; - a placeholder which will be replaced by a possessive pronoun ("his", "her", "its", "their") in the pattern creation process;
5
Thus moving from the general template to the more specific pattern and then to the exact sentence has the appearance of removing one set of brackets.
Information Extraction
7
01/02/2006
- a placeholder which will be replaced by a subject pronoun ("he", "she", "it", "they") in the pattern creation process; - a placeholder which will be replaced by an object pronoun ("him", "her", "it", "them") in the pattern creation process; - a placeholder which will be replaced by an object possessive pronoun ("his", "hers", "its", "theirs") in the pattern creation process. Here are two of the simplest examples:
"The of this is " and “The of is ” From these the pattern generator will create, among other patterns:
"The date of this book is " and “The date of is ” from which the information extraction process can recognise:
"The date of this book is 1816.” and “The date of Emma is 1816.” After the IE process, the component will now have a value for a particular property, in this case the date property. It can discover which entity the property is for, either by context in the first instance or by using the Hkey in the second instance – Section 5.6 has the details on this.
5.2 The Generation of Patterns The first version of the IE component can only deal generically with very simple structures – essentially those in which the main verb, either explicitly or implicitly, is some version of “to be”. All such sentences make very simple statements relating data to metadata – “the value of this property is that data”. Sentences of this form are comparatively easy to deal with since they only require the straightforward replacement of placeholders with the available metadata. The steps for generating patterns are as follows: Each entity type and each property of the entity type are taken in turn and all instances of and are replaced by the appropriate names and all instances of and are replaced by placeholders for the angle brackets. Should there be more than one Hkey property, then one set of patterns for each would be generated. Thus for the schema in Figure 2, given the template, “The of is ”, the generator will create 24 patterns (12 for Book – 6 properties and 2 Hkeys; 7 for Author since multimedia properties are excluded; and 5 for Publisher). One major benefit of using pattern matching is that it is possible to create distinguishable patterns which are complex. The concluding section describes in more generality what we are planning to achieve next, but in the first version we have included structures such as:
was the of and . and
was the and was the . From these templates we get:
is the author of Emma and Persuasion. and
was the author and was the publisher. The main problem with compound statements is they exacerbate the problem of ambiguity. Should we choose to permit the syntactically unsound, but common, form “X was the Y of A and B and C and D and...”, then we would have a lot of trouble with “Jane Austen was the author of Emma and Pride and Prejudice”, for instance. We will return to ambiguity in the concluding section.
Information Extraction
8
01/02/2006
5.3 Pattern Matching The matching process takes the message one sentence at a time and compares each sentence with each pattern. One problematic issue concerns the identification of sentences. In the prototype, sentences are terminated by a full stop, but this is unlikely to be of sufficient generality, a point we will return to in the conclusions. Each term in the pattern is either a constant, a metadata term or a placeholder. The first two must match exactly (or match synonyms – see section 5.8), the latter freely matches one or more words in the sentence. The matching algorithm is as follows: 1. There is no match if the number of words in the sentence is less than the number of terms in the pattern. 2. Start with the first term in the pattern. If it is not a placeholder, check equality with the next word in the sentence. If it checks, record a match, if not, the pattern as a whole does not match. 3. If the term is a placeholder, remember the position of the next word in the sentence. Keep looking along the sentence to find a word which matches the next term in the pattern (which must be a non-placeholder). Record the position of the last non-matching word in the sentence. The placeholder has now been matched with the sentence words from the first one remembered to the last one. If no match is found, the pattern does not match the sentence. If the end of the sentence is reached then this might still be a match as long as there are no more terms in the pattern. 4. Continue until there has been a match of all terms in the pattern or a match failure has been encountered. Figure 4 shows an example of the matching of "The author is and the date is " with "The author is Jane Austen and the date is 1815.".
Pattern Term
Type
Sentence Word
Outcome
1. The
Constant
1. The
match
2. author
Property
2. author
match
3. is
Constant
3. is
match
4.
Value placeholder
4. Jane
remember sentence position 4
4.
Value placeholder
5. Austen
continue
4.
Value placeholder
6. and
matches next term, so this ends the words which might match value1, i.e. words 4 and 5 "Jane Austen" – don't move on in the sentence
5. and
Constant
6. and
match
6. the
Constant
7. the
match
7. date
Property
8. date
match
8. is
Constant
9. is
match
9.
Value placeholder
10. 1815
remember pos 10
9.
Value placeholder
end of sentence
no more terms either so that this is also a match value2 = word 10 = 1815
Figure 4. An Example of the Pattern Matching Process One last point to be made is that this technique places considerable pressure on the effective ordering of the templates (and therefore of the patterns generated). For instance, placing "The author is before this example will match the sentence with value1, i.e. the author name, matching with "Jane Austen and the date is 1815". The ordering of templates will be returned to in the conclusions.
5.4 Updates and Template Types In order to extract the data found when a pattern is matched, it is necessary to tie the pattern to an update. This is achieved by grouping the templates by the updates they should produce. This grouping is called a template type. A template type is essentially a set of parameterised updates, in which the parameters are the placeholders of the templates. Here is a simple example:
Information Extraction
9
01/02/2006
Template:
"The is "
Template Type: parameters :
number of updates : 1 update1 : update set = This means that the effect of matching a pattern drawn from this template is to update the current entity (see the next section on Context for what this refers to) by changing the property to the value. Thus if the current entity is a book with ISBN 0123456: Pattern:
"The date is "
Sentence:
"The date is 1815"
Update:
update Book 0123456 set date = 1815
Here is another example: Template:
"The of this is "
Template Type: parameters = "
update1 : update set = This means that the entity to be updated is the last mention entity of the type matched, thus if the last mentioned book has ISBN 0123456: Pattern:
"The date of this book is "
Sentence:
"The date of this book is 1815"
Update:
update Book 0123456 set date = 1815
Recall that updates are of the form:
update set = where < DkeyProperty> = which requires the template type to specify which entity is to be updated, the property to be updated and the new value. From the entity to be updated, we can determine the type to be updated and the Dkey property and value. We will discuss the three required aspects, one at a time. The entity parameter can take one of a number of values as follows:
Eco – Use the most recently mentioned entity. EHkey – Use the entity identified from an Hkey in the sentence. EntityType – Use the most recently mentioned entity a type picked up from a match with an placeholder. Epp – Use the entity identified through a possessive pronoun. For instance, the "Its" in "Its title is Emma" refers to the most recently mentioned neuter entity. The property to be updated is always identified through the matching of a placeholder. The new value usually identified through the matching of a placeholder. There are two cases to consider depending on the nature of the property type. For a base type property, the matched value can be used directly. If it is an entity type property, the matched value will be an Hkey, but the update requires the Dkey, so a translation is required. For example, given the sentence:
The author is Jane Austen and the date is 1815. The updates generated will be:
update Book 0123456 set author = 203 update Book 0123456 set date = 1815 in which the 203 is the Dkey of an author entity with Hkey, 'Jane Austen'. This is found by a function in the code which turns Hkeys into Dkeys, a potentially non-deterministic process which may require either user intervention or a much more sophisticated model of context which might indicate which of two keys is more likely. There are two other possibilities which can determine the updated value:
Information Extraction
10
01/02/2006
Esp – The Dkey of an entity identified through a subject pronoun. For instance, the "She" in "She was the author of Emma" indicates that the most recently used Author entity is to be the value of the property. The Dkey is found and the property value is changed to that. null – This is a special case for negative statements. Clearly negative information is significantly more complex and problematic than positive information and will be returned to in the conclusions, but here we are interested in sentences such as "She was not the author of Emma", from which we can only infer the update of the property with a null value. Thus the template type for such patterns includes "set = null".
5.5 Creating New Entities As well as changing property values, we can expect the message to indicate the existence of entities of which we had no previous knowledge. This can arise in two ways, either explicitly through a statement which is intended to convey the existence of a new entity, or implicitly by adding information which implies the existence. "Persuasion is another book." is an example of an explicit statement, while "The author of Persuasion was Jane Austen." explicitly extends the number of books known to be written by Jane Austen, but, if Persuasion is not already in the repository, implies the existence of a new book. In terms of our model, the two arise from different constructs. An explicit reference arises as an Hkey, while an implicit reference arises as an property value. Here are the two examples in full. Sentence:
Persuasion is another book.
Pattern:
is another book
Template:
is another
Template Type:
no property update implied
The matching process will find EntityType1 = "Book", hkeyProperty1 = "title" and Hkey1 = "Persuasion", and so has enough data to create a new entity, as described below. It will, of course, firstly make a check in the repository to see that the book is not already there. Sentence:
The author of Persuasion was Jane Smith.
Pattern:
The author of was
Template:
The of was
Template Type:
update set =
The matching process will find Hkey = "Persuasion", Ehkey1 = "Book", property = "author" and value = "Jane Smith", but if, when it looks in the repository for Jane Smith to get a Dkey value, it finds nothing, it must make a new entity, of type Author with an Hkey of "Jane Smith". When creating a new entity, we have very limited information – the type of the entity and an Hkey value – but this is enough. We create a entity with only one or two columns filled in leaving the rest as null values. We fill in only one column in the case that the Hkey is also the Dkey and two columns if they are different. This seems reasonable, since we are only filling in the information we have. If we encounter a sentence which offers a third piece of information in the form of a property value, for examples "Its date was 1820.", we can then create an update to set its value. In the case of a separate Dkey and Hkey, we have to automatically generate a Dkey in a similar manner to the creation of entity identifiers in object oriented. If the Dkey is not an Hkey then it will never need to be human detectable. Inserts are created with updates of the form:
insert into () values ( ) In the first case, this becomes:
insert into () values () In the second case, it becomes:
insert into (,) values (,)
5.6 Context One complicating factor is that each sentence is rarely complete in itself. Most sentences will have contextual references embedded in them – for example: “It was written by Jane Austen.”, “The book was written by Jane Austen.” or “The author was Jane Austen.” all refer to a particular book, but which one? In these cases, our component must discover which book is
Information Extraction
11
01/02/2006
being referred to before the sentence can be processed. To this end, we have constructed a context framework to which the component has continuous access. The context must deal with the following three cases:
Pronouns. Sentences such as “It was written by Jane Austen.” require the referend of a pronoun to be identified. Understanding the sentence depends upon the reader having in mind a recently mentioned neuter entity. Similarly, encountering masculine and feminine pronouns such as “his” or “he”, “her” or “she” indicate the presence of a particular masculine or feminine entity. Definites. Sentences such as “The book was written by Jane Austen.” include definite references to entity type names ("The book") and require the referend of a particular entity of that type to be identified. Understanding this sentence depends upon the reader having in mind a recently mentioned Book entity. Implicit Context. Sentences such as “The author was Jane Austen.” have no explicit reference to an entity at all and yet the reader understands the sentence, because it follows either a sentence such as “My favourite book is Emma.” or a question such as "Who wrote Persuasion?" – i.e., a sentence which discusses a particular entity. This most recently discussed entity is the start point for the understanding. Alternatively, the reader would identify the use of the word “author” to establish the context of the sentence as being a Book entity and will use the most recently mentioned Book, as in the previous case. To guide the information extraction process, therefore, the component manages an object maintaining contextual information within which contains: one variable which holds a reference to the most recently mentioned entity of any type; three variables referring to the most recently mentioned masculine entity, the most recently mentioned feminine entity and the most recently mentioned neuter entity; also a variable referring to the most recently mentioned entity which should have a gender, but this is unknown; a variable for each entity type referring to the most recently mentioned entity of that type; a variable referring to the most recently mentioned entity type – this is vital for start up as discussed next. The system handles four kinds of pronoun which each play a different role in the extraction process:
Possessive pronouns ("his", "her", "its", "their") stand for entities in sentences. For instance, the "Her" in "Her name was Jane Smith." is identifying the author entity whose name property is to be set to "Jane Smith". Subject pronouns ("he", "she", "it", "they") stand for values in sentences. For instance, the "He" in "He was the author." is identifying a value of the author property. When a value is identified through a pronoun, the value will always be an entity and not a simple value. Object pronouns ("him", "her", "it", "them") also stand for values in sentences. For instance, the "her" in "The author of Persuasion was her." also identifies a value for the author property. Object possessive pronouns ("his", "hers", "its", "theirs") stand for entities in sentences. In the sentence, "That date of both was also hers." identifies an author to assign the date of both to.
5.7 Maintaining the Context When the IE component is passed a message, this might be in one of a number of situations. In our collaborative web site example, the component is summoned from a particular web page – either a catalogue page specific to an entity type or an item page specific to a particular entity. In either case, it can build the context appropriately. Alternatively, it might be used in finding the answer to a question – again the question will provide context information. On the other hand, it might just start from nothing, in which case the early sentences of the message must set the context clearly. Thus the component, when called, expects not just the message but also information from which it can build a starting context. In the current version of the system, this can either be an entity or an entity type. The process of extracting data from the sentence is now more complex than just pattern matching and proceeds as follows: 1
Identify any entity which is explicitly specified in the message.
2
If there is a possessive pronoun, use one of the three gender variables to identify the entity.
3
If the entity is identified using only the type name, then use the variable associated with that entity type.
4
If the reference is implicit, use the most recently mentioned entity. If this is the wrong type of entity as indicated by the properties involved, use the context variable for the entity type associated with the property names identified.
Information Extraction
12
01/02/2006
5. Identify explicitly specified values. 6. If there is a subject or object pronoun, use the gender variables to identify the value as the Dkey of the entity. When the sentence has been dealt with, it is necessary to update the context in order to prepare the component for the next sentence. Thus any entity encountered will be used to identify the various context variables as follows: 7
The last entity which is updated is set as the most recently mentioned entity.
8
That entity and the last entity used as the updated value are examined and used to update the gender variables and the variables specific to the types. Thus “The author of Emma was Jane Austen.” updates the most recent book, the most recent author, the most recently used neuter entity and the most recently used female entity (or possibly the most recently used gendered entity of unknown entity).
5.8 Synonyms Natural language supports the use of multiple terms for the same concept. This is particularly true for a language such as English which has multiple roots – in the case of English, Latin and Germanic languages. Therefore, any system which is intended to extract information from messages sent from uncontrolled sources must cope with synonyms. In our model, there are three ways in which synonyms may arise – as equivalent names for entity types, as equivalent names for properties and as equivalent terms for data values. Our system copes with the first two of these, as shown in Figure 5. Each property and entity type name may be assigned a set of synonyms as described below. The effect of this is that a synonym in a sentence will have the same effect as the original term. Thus "The writer was Jane Austen." will be interpreted in the same way as "The author was Jane Austen."
Figure 5. The Synonyms Pane. Some of the synonyms of our sample are shown in Figure 5. The interface for managing synonyms allows synonyms to be entered in a number of ways: i)
by loading from a synonyms file;
ii) by direct insertion of new synonym terms; iii) by stripping vowels from a word; iv) by the use of WordNet. Vowel stripping is provided in four ways. Firstly, there is an option to take all of the synonyms and strip out the vowels or only do this for the original word (i.e. the entity type or property name). Secondly, in either case, the user can choose to strip all of the vowels, generating one synonym for each word it strips, or choose to strip every combination of vowels thus creating many synonyms. In the figure above, all of the ways vowels can be stripped from author and title can be seen. WordNet supports a variety of relationships between words. Our prototype supports the addition of all words in the same synonym set, all direct hypernyms, all direct hyponyms and all meronyms. The figure shows the terms added to title by the use of WordNet. When all synonyms have been entered, those which seem inappropriate (such as championship) can be removed.
Information Extraction
13
01/02/2006
6.
An Information Extraction Prototype
This section describes a software application which has been written to test out the model described in this report. The application written in Java takes four input files containing the message, the template types, the templates and the context. The data repository must also be loaded and, for the prototype, this consists of two tab separated files containing the schema and the data. The section describes the facilities first and then gives an example of the program in use.
6.1 Using the Prototype The interface to the program is shown in Figure 6. As well as a menu bar, there are three panels – a feedback panel at the bottom of the screen and two tabbed panes above. Among the two tabbed panes are distributed nine panels for each of the main data sets – the message, the template types, the templates, the context, the schema, the data, the patterns, the updates and the synonyms. Figure 6 shows the message and updates panels in view. Each of the nine panes can be switched between the two tabbed panes, so that any pair can be viewed at any time6.
Figure 6. A Prototype Application for Information Extraction. To start the system, the following set up process must be followed:
6
1.
The message must be loaded.
2.
The template types must be loaded, followed by the set of templates.
3.
The schema must be loaded, followed by the data.
4.
The context must be loaded – this must happen after the schema, since its structure is dependent on the schema.
5.
The synonyms must be created – this must happen after the schema, since the synonyms maintained are only for metadata terms as described in the schema.
6.
The patterns must be generated – this must happen after the schema, templates and synonyms have been loaded since the patterns are formed from these two.
This idea is due to Iain Clarke.
Information Extraction
14
01/02/2006
Synonym maintenance is shown as Figure 7. The frame contains a panel on the left listing the current synonym set for one of the metadata terms – in this case the title of a book. The top right panel contains buttons which activate the various synonym maintenance facilities, while the lower right panel gives feedback on synonym maintenance. The activation panel contains, on the top line, a combo box for choosing which metadata term should be treated, a button for loading synonyms from a file, a button for deleting synonyms selected in the left hand panel and a combo box for choosing between the four vowel stripping operations. The lower line has a combo for choosing which of the WordNet relationships should be used to generate synonyms, together with the quit button.
Figure 7. A Synonym Maintenance Frame. The system is now ready to extract information. A menu item starts this going. The program analyses one sentence at a time and, if it finds a match, it generates the updates and executes them. After each sentence, it updates the context and data displays.
6.2 An Example An example message and its output are shown in Figure 6. This section shows how one arises from the other – a fuller version appears in an appendix. Figure 8 shows the schema and database against which the message is being checked. This is a version of the database shown in Figure 2, stripped down for brevity. The figure also shows the initial context which is based on a single entity, the book "Emma". Figure 9 shows the templates we have provided and their template types. The combination of the schema and the templates results in the set of patterns also shown in Figure 9. When the program encounters "The author is Jane Austen." in the context of Emma, the matched pattern extracts "Jane Austen" as a value for the author property. As this property takes entity values, the value found is interpreted as an Hkey and this is transformed into the Dkey for the author entity which is assigned to the property. When the program encounters "Its date is 1815.", the matched pattern extracts the fact that a neuter entity needs its date property set to 1815. As the most recent neuter entity is the book "Emma", this the entity changed. The program next encounters "She was also the writer of Pride and Prejudice." and now the matched pattern extracts the fact the author (synonym "writer") of some entity called "Pride and Prejudice" is the a feminine entity. "Pride and Prejudice" is found as a book title and the author field of this book entity is set to the Dkey of the most recent feminine entity, Jane Austen. The next sentence is "Her dob was 1780" and now we have a value for the dob property of the most recent feminine entity. "year : 1820" may not be good English but can be captured from the template " : . This tries first to set a year property of the most recently mentioned entity. As this is the author Jane Austen and an author entity does not have a year property, the program backtracks and finds an entity type which does have such a property. It finds that Book does and so sets the property for the most recently mentioned book, "Pride and Prejudice".
Information Extraction
15
01/02/2006
"She was the author of Persuasion." raises a new problem – that the book Persuasion cannot be found. It is therefore necessary to create a new book entity, with a generated ISBN and the title Persuasion. This new entity has its author field set to the Dkey of Jane Austen, since this is the most recently mentioned feminine entity. The new entity also becomes the most recent entity, Book and neuter entity. Then comes "The date is 1825, while the title is PPPP." which assigns a data to Persuasion and then retitles it. This requires no change to the context. The entity changes title again with "The ttl of the volume is The Big Q.". The property name has a vowel stripped synonym and volume is a synonym for Book derived from WordNet and so the most recent book is updated. "The author of The Big Q is Des Dillon." searches for a book called "The Big Q". Although this happens to be the most recent entity, it is actually retrieved from the database afresh. The author value is a new author and so a new Author entity is created holding a generated ID and the author name. Finally, "The author of Emma is not Jane Austen." provides some negative information. All the program can do with this is cancel the property value by assigning a null value. In the above everything happens automatically and the resulting database and context can be seen in the Appendix. To be useful, the updates should be presented for review and moderation before being executed.
7.
Implementation Details This section describes how the software is implemented in terms of the classes used.
7.1 The Data Model The data model is implemented as a set of classes for holding each of the main constructs of the model. These are:
Schema – This class holds the schema name and the set of entity types. It has a constructor which reads the schema from a file, various get methods and a method to display the schema in a text pane. EntityType – This class holds the entity type name and the set of properties. It also holds the set of entities of that type and identifies which property is the Dkey. It also has a set of synonyms for the entity type name. It has methods to add properties and to manipulate them in a variety of ways. It also has methods for managing Dkeys, Hkeys and gender. It also has a method for creating a new Dkey. Attribute – This class models properties and holds the property name and type and booleans for whether the property is a Dkey, an Hkey, is a primitive type, is a gender property and is multi-valued. It also stores a set of synonyms of the property name. There are a variety of get and set methods and a test for whether a string is a synonym. Database – This class holds a name, a schema reference and a set of entities that comprise the database. The constructor reads the data from a tab separated file. It has a method to add an entity and another to display the database in a text pane. It also has methods to find an entity of a particular type by Hkey and by Dkey and, importantly, a method which returns the Dkey of an entity of a particular type, given an Hkey. Entity – This class models one entity in the database and holds the array of property values which make up the entity. Entity type properties are held as foreign keys. There are methods for getting and setting values (important for the update command), for getting the gender, Dkey and Hkeys of the entity.
7.2 Templates and Patterns The classes which model the template and pattern system typically one class each of the set of entities, one entity and an item within the entity. Thus there is a connection between the items of a template, the items in a pattern and the items in a matching (see the next section for the last of these).
TemplateTypeSet – This class contains the set of template types. It has methods to read and display the template types. TemplateType – This class models one template type. As well as methods for creating the template type from a stream tokenizer, to display the template type and to retrieve parameters, the principal method in the class is the one which generates updates from a match, as described in Section 5.2. TemplateTypeParam – This class models one parameter of a template type. It has a constructor and get methods to retrieve the various aspects of a parameter. TemplateSet – This class contain the set of templates. It has methods to read and display the templates and to generate the patterns.
Information Extraction
16
01/02/2006
Template – This class models one template. It has a constructor to read the template from a stream tokenizer, display methods and methods to check if it has pronouns of each type. However, the main methods are ones which generate patterns, as described in Section 5. TemplateItem – This class models a term in a template. It has a constructor and various get methods, plus a method to generate a pattern item. PatternSet – This class contains the set of patterns. It has methods to add and retrieve a pattern and to display the templates. It also a method to try to extract data from one sentence, by matching it against each pattern in turn. Pattern – This class models one pattern. The constructor creates an empty pattern of the same size as its template. A set method adds the items and there are a variety of get methods. There is a display method, but most important is the match method, which takes a sentence and attempts to match it with itself. If it succeeds it uses its template type to call the method to generate updates. PatternItem – This class models a term in a pattern. The class has a number of constructors, since each type has different information associated with it. There is a display method, various get methods and a method which tries to match a string.
7.3 Messages, Context, Matching and Results Message – This class models a message from information is to be extracted. It has methods to read and display a message. It also has the method which starts off the extraction process, passing to the extraction method in PatternSet, each sentence in turn. Context – This class models the context as described in Section 5.6. It has a constructor which builds the class context from a file, using the schema to guide it. It has methods to display the context and to get and set each of the entities stored. MatchSet – This class models the matching of one pattern with one sentence and has the same number of Match objects as the pattern has items. It has a constructor which creates an empty set of matches and a set method to populate it. As well as various get methods, it has a number of methods which return one piece of information from the match, such as the ith word matched, the gender of a pronoun or the entity associated with an Hkey. Match – This class models the match of one pattern item with part of a sentence. It stores the type, number and gender of pattern item and the text. It has the usual get and set methods, and a display method. UpdateSet – This class models a set of updates. It has a constructor to create an empty set and methods to add one or more updates. Otherwise, it just has get and display methods. Update – This class models one update to the data repository. It stores the repository and context, and the pattern and template type from which it was created. It also store the entity to be updated, the property and the new value to be set, as well as the Dkey and Hkey of the entity. It has a constructor to make an update from its constituents and display methods. It also has methods to report the effect of its changes and to change the data. GUI – This class builds the GUI. It has the methods which create all of the GUI components and also has the action event listener method to distribute the user requests. Main – This class is the class to start the application off. It has the main method and calls the GUI class.
8.
Conclusions
We have demonstrated a pattern matching approach to Information Extraction in which the patterns are automatically generated from the metadata to accord with whichever linguistic structures we choose to recognise. The pattern matching approach is unusual in the Information Extraction world, most of whose practitioners demand a syntactic analysis first and then uses the linguistic structure to reveal the meaning. This is not surprising since this is precisely what we as humans need to do to extract any information out of an utterance or a text. However, it seems reasonable in any enterprise to avoid, if possible, steps in the process which are unnecessary. In our case, we wish to avoid the significant amount of work required to discover the syntactic structure of even a simple message – we just want to find the data that is there. We can reasonably hope to be successful because we are working in an extremely restricted universe of discourse. We are only interested in sentences which tell us something that we can turn into the addition of data to our repository. If in the book example, we encounter the sentence “A/C Acoustics recorded the album ‘Understanding Music’.”, we might think that a fine sentence, but we are happy for our component not to make any sense of it, but to ignore it. This does not mean that we do not value the mainstream of IE research as typified by the MUC conferences
Information Extraction
17
01/02/2006
mentioned above, since these deal with more general extraction tasks with which our approach could not possibly cope. It is just that we would like to take a parsimonious approach to our particular task. Even so, our first version is limited in the range of sentence structures we can handle. We could compensate with this with the mechanism of instructions to contributors on which sentence structures they can use. This is unsatisfactory since this is little better than providing a form, and so our plan is to rectify the major deficiencies in subsequent versions. Essentially, we have started from a simple (and simplistic) account of language and intend to add complexity as the need arises. Here are some of the issues we need to tackle. The first of these is the limited nomenclature provided by the metadata. We have so far included some useful, but limited, means of handling synonyms as described in Section 5.8. We would like to include some form of spelling correction, since spelling will certainly be suspect. We would also like to add to vowel stripping some more of the standard techniques used in shortening text messages, such as the use of digits for syllables as in "dict8" or "4tune". The automatic translation of noun phrases would also be useful, so that the program could handle the common habit of authors using a foreign language term for a concept as it is the only one they know. Incorporating EuroWordNet [23] may be the way forward here. Even this would be far from enough and the example shown in Section 4 which depart from the metadata not just in the words used, but also in their part of speech. Thus “Jane Austen wrote Emma.” and “Emma was written by Jane Austen” should be acceptable alternatives to “Jane Austen is the author of Emma”. To achieve this, the metadata author must be transformed not just into another noun such as writer, but into one or more tenses of the verb to write. A thesaurus such as WordNet provides a sufficient structure to achieve this and we plan to use this. At worst, we could include a mechanism for adding hand crafted patterns to the automatically generated set. Another issue surrounding verbs concerns what we might call generic verbs, i.e. verbs which are neither the verb to be nor are schema specific. Two examples are include and call as illustrated by "Her novels also include Persuasion" and "There is one called Persuasion.". Inclusion of generic verbs seems only to be a matter of extending the template set set by representing these verbs as structural terms in the template – e.g. "There is one called .". Plurality is handled in the data model and plural pronouns are recognised, but the extraction process is, as yet, incapable of making good use of messages including plurals. "She wrote both of them.", for instance cannot be managed, partly because of the impoverished nature of the context, but also because template types cannot yet deal with plurals. We also make no use of temporal and uncertain assertions in the message. Our component can only handle different tenses if they all are intended to have the same effect. Thus "The author is Jane Austen." and "The author was Jane Austen." are considered equivalent. That works for this case, but if the data is time varying, such as "The Prime Minister was Clement Attlee" and "The Prime Minister is Tony Blair." convey very different information. To handle time, we need several additions to the capabilities of the program. Firstly, the context needs to be extended with a temporal aspect. Secondly, the data model needs to be able to handle the various aspects of time which relate to data storage [24]. Finally, there will need to be some way of putting temporal information into the template types, so that the extraction process can determine the temporal information it should be seeking. Uncertain information is even more problematic and has two aspects. Firstly, there is correspondent reliability. Inevitably, messages holding definitely expressed information will include errors. Clashes will occur when two values are given for the same piece of information. Perhaps a record should be kept of the accuracy of statements sent by correspondents and information from suspect correspondents should be held in a database of information of questionable validity. Alternatively, each piece of information could be decorated with a degree of uncertainty. Secondly, there are statements which explicitly express uncertainty. Statements including uncertainty, such "Jane Austen may have written Persuasion.", can also be handled in the same way. This will require extending the data model and the structure of the template types to alert the extraction process with the possibility of uncertainty. Negative information is dealt with currently by nullifying property values. Perhaps returning the value to a previous version might be an improvement, but this requires storing all past states of the database.. Extending the context structure has already been shown to be necessary to capture plurality and temporal information. It also seems to be necessary to keep a historical record of the discourse in order to deal with references that go further back than the most recently mentioned entities. Once the historical record is available, however, with more than one entity of a given type being in context, ambiguity arises as it does in so many ways. Ambiguity is not dealt with in our prototype. It arises in the following cases, among others: i)
An Hkey may match multiple entities. For instance, there are probably several books called "Persuasion".
ii) A sentence may match multiple patterns. The sentence, "The author is Jane Austen and the date is 1815.", has already been mentioned in this context in Section 5.3. iii) Data values may be confused with structural words such as “and” as in "Jane Austen wrote Emma and Persuasion" and "Jane Austen wrote Pride and Prejudice".
Information Extraction
18
01/02/2006
iv) More than one entity of a particular type is recently referred to in the message. We make no pretence that we can solve all of the problems since natural language text abounds in ambiguity and there will always be a need to consult the moderator for a resolution at times. However, a number of approaches are possible. Firstly, the order in which the program tries alternatives is vital. Entities could be ranked in the degree of relevance to recent context. We can be fairly sure what "Persuasion" refers to in a message which has previously mentioned Jane Austen. The order of pattern templates is also vital as has been discussed. We have not yet, however, proved to our satisfaction that one template order will resolve all ambiguity. Ultimately, ambiguity cannot be eliminated altogether and so we plan a module which displays possible alternatives and asks the moderator to choose between them. We can already deal with some sentence structures with subsidiary clauses, such as “The book has an author whose name is Jane Austen.” by adding more templates. However, more complex sentences in which two or more new facts such “The author was Jane Austen and she was born in 1775.” need to be added to the repository are beyond our capability at present since this requires updates in two entity types. To achieve this we need to extend the pattern template structure to handle clausal structure so that we can systematically generate more complex sentences. Although we specifically ruled out an interest in evolving the information structure managed automatically by the component, we would like to assist in the moderator in the evolution of the schema of the repository as information comes in. To this end, we might permit the component to try to match the pattern templates following a failure of to find the sentence in the patterns. Thus it could look directly for “The of this is ” without filling in the placeholders. This would then match “The subject of the book is Databases.”, following which the component could suggest to the moderator the addition of a new property for books to hold the subject matter. This could then be added at the next revision of the web site. Learning to cope with new structures was one of the main issues tackled by the participants in the Message Understanding Conferences. Our system needs to be told the domain terms and the sentence structures. We are currently examining techniques for learning new sentence structures. The simplest approach to this is to use sentences about known data. Sentences will now include known metadata and known data and from this, it should be straightforward to generate new sentence templates. We have reported this work abstractly as involving a live repository of data and concretely as a relational database. In fact, our repository structure is a judicious mix of relational databases and XML. We need to extend our work so that the output could be an update to an XML equivalent to SQL we have used so far. We do not currently handle attachments. Our repository model is designed to describe multimedia properties held in files. The MSc thesis by David Kerr has demonstrated how to download and store multimedia attachments in the context of this model [24]. The next version of the component will integrate David’s work into the IE context. Finally, there are some mundane issues which the current version does not handle well. The automatic generation of new Dkeys is very crude but better techniques for this are well known. The identification of sentences, often called segmentation, is also problematic given that accurate punctuation cannot be guaranteed, so thought here is needed. However, we feel that, by taking a fairly simple approach to the problem, we have demonstrated that the extraction from small loosely structured text messages can feasibly be achieved.
Acknowledgements The authors would like to thank Anders Hermansen, Nicola Laciok, David Kerr and Chenlan Bi for early versions of this software and Rosemary McLeish for her comments on a draft of this paper.
Bibliography 1.
Cooper,R.L., An Architecture for Collaboratively Assembled Moderated Information Bearing Web Sites Proceedings of Web Based Collaboration, DEXA, 2002, pp 293-297, IEEE Computer Society Press.
2.
Laciok, 2000 – N. Laciok, An XML Component for a Collaboratively Developed Website, MSc Dissertation, University of Glasgow, September 2000.
3.
V. Lopez and E. Motta, Ontology-Driven Question Answering in AquaLog, NLDB 2004, LNCS 3136, pp 89-102, 2004.
4.
WordNet. http://wordnet.princeton.edu/.
5.
T. Connelly and C. Begg, Database Systems, pp 809-814, Addison Wesley, 2005.
Information Extraction
19
01/02/2006
6.
Cooper,R.L. and Ali,S., Extracting Database Information from E-mail Messages, 20th British National Conference on Databases, pp 271-279, LNCS 2712, Springer, July 2003.
7.
Cooper,R.L. and Ali,S., Extracting Information from Short Messages, Natural Language Processing and Information Systems, Montoya, Munoz and Metais (eds), LNCS 3513, 388-391, 2005.
8.
G.F. Luger, Artificial Intelligence: Structures and Strategies for Complex Problem Solving (Third Edition). London: Addison-Wesley, 1997.
9.
The Text Summarization Project, http://www.site.uottawa.ca/tanka/ts.html.
10. C. Cardie, Empirical Methods in Information Extraction, AI Magazine, 18:4, 65-79 1997 11. R. Gaizauskas and Y. Wilks, Information Extraction: Beyond Document Retrieval, Journal of Documentation, 54(1):70-105, 1998. 12. The General Architecture for Text Engineering, http://gate.ac.uk/ 13. D. Fisher, S.Soderland, J. McCarthy, F. Feng and W. Lehnert, Umass System, MUC-6, 1995 14. E. Agichtein and L. Gravano, Snowball: Extracting Relations from Large Plain-Text Collections, Proc.5th ACM International Conference on Digital Libraries (DL), 2000. 15. R. Gaizauskas and Y. Wilks, Information Extracion: Beyond Document Retrieval, Journal of Documentation, 54(1):70-105, 1998 16. Pazienza, 1999 - M.T. Pazienza, Information Extraction: Towards Scalable, Adaptable Systems, Lecture Notes in Artifical Intelligence 1714, Springer, 1999. 17. I-S Kang, S-H Na, J-H Lee and G. Yang, ,Lightweight Natural Language Database Interfaces, NLDB 2004, LNCS 3136, pp76-88, 2004. 18. N. Stratica and B. C. Desai, Schema-Based Natural Language Semantic Mapping, NLDB 2004, LNCS 3136, pp103-113, 2004 19. MUSE -MUlti Source Entity finder, http://www.dcs.shef.ac.uk/nlp/muse/. 20. Cooper,R.L. and Davidson,M., Content Management for Declarative Web Site Design, DIWeb, 2004. 21. T.Teorey, and J.P. Fry., A Logical Design Methodology for Relational Databases using the Extended Entity-Relationship Model, ACM Computing Surveys, June 1986. 22. Carsten Kleiner, Udo W. Lipeck: Automatic Generation of XML DTDs from Conceptual Database Schemas in Kurt Bauknecht, Wilfried Brauer, Thomas A. Mück (Eds.): Informatik 2001: Wirtschaft und Wissenschaft in der Network Economy - Visionen und Wirklichkeit, Tagungsband der GI/OCG-Jahrestagung, 25.-28. September 2001, Universität Wien, ISBN 3-85403-157-2, Band 1, pp396-405. 23. EuroWordNet, www.illc.uva.nl/EuroWordNet. 24. Tanzel, A.U., Clifford, J., Gadia, S., Jajodia, S., Segev, A. and Snodgrass, R., Temporal Databases, Benjamin/Cummings, 1993. 25. D. Kerr, Incorporating Multimedia Data into a Collaborative Web Site Design Tool, MScIT Dissertation, University of Glasgow, September 2001.
Appendix A. A Fully Worked Example This appendix gives the full details of the example discussed in brief in Section 6.2. The inputs to the system are given followed by a sentence by sentence analysis.
A.1 The Inputs The system must be set up with a set of template types, templates. This provides the linguistic input after which the specific input is a database and its schema, the initial context (in this case, the book "Emma") and the message. These are shown below for the example:
Schema Name: Library Entity Types:
Information Extraction
Database Name: Entities:
20
My Library
01/02/2006
Book(
ISBN title author year Author( ID name dob gender
string string Author string number string string gender
Dkey Hkey sv Hkey sv mv ) sv Dkey sv Hkey sv sv sv )
The Schema Most Recent Type Most Recent Entity Most Recent Neuter Most Recent Book
Book Book Book Book
1236 1236 1236
Book Book Book Book
1234 1235 1236 1237
Author 201 Author 202 Author 203
Dracula Falstaff Emma Pride and Prejudice
201 202 null null
1889 1975 null null
Bram Stoker Robert Nye Jane Austen
1850 1940 null
m m f
The Database The author is Jane Austen. Its date is 1815. She was also the writer of Pride and Prejudice. Her dob was 1780. year : 1820. She was the author of Persuasion. The date is 1825, while the title is PPPP. The ttl of the volume is The Big Q. The author of The Big Q is Des Dillon. The author of Emma is not Jane Austen. The Message
The Context
A.2 The Analysis Sentence 1: The author is Jane Austen. Matched Pattern: The author is Template: The is Template Type: update set = Update: update Book set author = 203 where ISBN = 1236 Context changes: The new current entity is Jane Austen, also the new Author and feminine entity. Explanation: The current entity is Emma, whose Dkey is the ISBN, 1236. When a value is found for a property which is an entity type, the value found will be an Hkey and this must be transformed into a DKey for the update. In this case, Jane Austen is the author with id, 203 and so the HKey value, "Jane Austen", is transformed into its DKey, 203.
Sentence 2: Its date is 1815. Matched pattern: Its date is Template: is Template Type: update set = Update: update Book set year = 1815 where ISBN = 1236 Context changes: The book Emma becomes the current entity again. Explanation: The possessive pronoun, "its", determines the gender to be neuter, so the most recent neutral entity is updated. Note that date is a synonym for year.
Sentence 3: She was also the writer of Pride and Prejudice. Matched Pattern: She was also the author of Template: was also the of Template Type: update set = Update: update Book set author = 203 where ISBN = 1237 Context changes: The current entity and book and neutral entity is now Pride and Prejudice, ISBN 1237. Explanation: The subject pronoun "she" causes the program to get the DKey of the most recent feminine entity, Jane Austen, as the value to be set. The entity to be changed is found by searching for the title Pride and Prejudice and using its Dkey for the where test.
Sentence 4: Her dob was 1780.
Information Extraction
21
01/02/2006
Matched pattern: dob was Template: was Template Type: update set = Update: update Author set dob = 1780 where ID = 203 Context changes: The author, Jane Austen becomes the current entity again. Explanation: This is similar to Sentence 2. "Her" tells the program to update the most recent feminine entity.
Sentence 5: year : 1820. Matched pattern: year : Template: : Template Type: update Sset = Update: update Book set year = 1820 where ISBN = 1237 Context changes: The book, Pride and Prejudice, becomes the current entity again. Explanation: The program attempts to assign the property value for year to Jane Austen, but finding that there is no such property, locates year as a property of book and so switches the update to the most recent book.
Sentence 6: She was the author of Persuasion. Matched pattern: She was the author of Template: was also the of Template Type: update set = Updates:
insert into Book( ISBN, title) values (0, "Persuasion") update Book set author = 203 where ISBN = 0
Context changes: The book Persuasion becomes the current entity and the most recent book. Explanation: The sentence is analysed as for sentence 3, but now the program cannot locate a book called "Persuasion" and so has to create a new Book entity. It creates a new Dkey (very crudely in this prototype), "0", and creates an entity with the two values it knows, the new Dkey and the title. Then it updates the new entity with the author information.
Sentence 7: The date is 1825, while the title is PPPP. Matched pattern: Its date is , while the title is Template: The < property1> is , while the < property2> is Template Type: update set = update set = Update:
update Book set date = 1825 where ISBN = 0 update Book set title = "PPP" where ISBN = 0
Context changes: No change. Explanation: This is a more complex form of the analysis of Sentence 2. Templates of this type hold the information for two types and this is reflected in their template type. Note the use of the index numbers to keep property names and values together.
Sentence 8: The ttl of the volume is The Big Q. Matched pattern: The ttl of the volume is Template: The of the is