USING PART-OF-SPEECH PATTERNS AND DOMAIN ONTOLOGY TO MINE IMPRECISE CONCEPTS FROM TEXT DOCUMENTS Muhammad Abulaish1 & Dr. Lipika Dey2 Abstract
In the last few years, several works in the literature have addressed the problem of information extraction from text documents. The importance of this problem derives from the fact that, once extracted, the information can be handled in a way similar to instances of a traditional database. But, most of the information extraction systems assume that the texts are having only precise concepts and these restrict the information provider to use only precise concepts to represent their information. In this paper we have presented a system that uses part-of-speech patterns and domain ontology to extract imprecise concepts present in the text documents and hence it allows information provider to describe concepts by using linguistic variables – very, more, light, strong, slightly, quite etc. that are very common with natural languages. We have considered wine documents as case study however it can be applied to any domain for which there is an existing ontology. In this paper, we have shown how a structured knowledgebase can be designed to hold imprecise concept descriptions extracted from text documents. The structured knowledgebase can then be searched efficiently for required information. Keywords: Information extraction, Ontology, Fuzzy ontology model, Tagger, Knowledgebase.
1.
Introduction
Information extraction from unstructured text documents is an open research problem. The problem focuses on extracting entities and relations among them from text documents [9]. In all applications of information extraction systems that we are aware of, concepts are viewed as collections of crisp unary relations. On the other hand, in many real-world applications like intelligent e-commerce users may provide information, which are vague and imprecise. A customer may be interested in buying “very light flavored and less strong bodied wine” or may be interested in “low priced share”. Here the concepts “very light”, “less strong” and “low” are not precisely defined. Moreover, information is not necessarily presented in the same way always. Due to this fact, data extraction and exchange are not easy if different actors (producers or consumers of information) have not 1
Department of Mathematics Jamia Millia Islamia (A Central University) Jamia Nagar, New Delhi – 25, India E-mail:
[email protected]
2
Department of Mathematics Indian Institute of Technology Hauz Khas, New Delhi – 16, India E-mail:
[email protected]
91
agreed on the semantics of data. By using a common model, ontology provides a way to ensure interpretation and exchange of information across different documents. Ontology specifies the key concepts in a domain and their inter-relationships to provide an abstract view of an application domain [4]. Along with concept descriptions, it provides a taxonomic classification of concepts in the world to be used as semantic primitives. With the support of ontology, both user and system can communicate with each other by the shared and common understanding of a domain. But, usually a concept in ontology is defined in terms of its mandatory and optional properties along with the value restrictions on those properties. In general there is no framework for qualifying a property. One of the chief problems that hinder the extraction of information from text documents by using such ontology is the use of imprecise concepts by the information provider to express their information. By imprecise concept we mean varying degree of precision associated with a concept that can’t be expressed quantitatively. Let us consider an example for the wine ontology developed by World Wide Web Consortium (W3C) group3. This ontology uses the properties flavor, color, body, etc. for describing any class of wine, while the value restrictions on property flavor is the set {delicate, moderate, strong}, on property color it is {red, white, rose}, and for the property body it is {light, medium, full}. This ontology also describes a set of instances of wines of various types and describes them in terms of the above-mentioned properties. However, when we actually look at Web documents on wine, descriptions in those documents rarely contain the above values. Here are two sample descriptions extracted from Web documents: x Dolcetto is a red table wine, which is quite dry and has a slightly fruity flavor. x Syrah has a full body, less fruity flavor and dark red color that grow originally in California's coastal areas. If the user is looking for a wine that is “mild fruity flavored” then we see that though these are not exact matches for user requirements, both the wines are good matches for the query. In this paper we have addressed the issue of handling imprecise concepts through the use of a fuzzy ontology model. Specifically we have proposed: (i) (ii) (iii)
How a fuzzy ontology structure can be created with the help of existing domain ontology and resource qualifiers. We have shown how the schema of a knowledgebase can be created from this and using the two together, information about instances can be extracted from unstructured texts. The extracted information is stored in a structured knowledgebase, which can then be searched for matches.
The rest of the paper is organized as follows. In section 2 we have reviewed some related works on ontology based information processing. Section 3 presents the details of our system. Finally, we conclude the paper and discuss the future work in section 4.
2.
Related Work
To manage the deluge of information many Information Extraction (IE) systems are developed which can be used to automatically extract relevant information from text documents. Andreasen et. al. have described an approach to querying text sources based on extracting and evaluating semantic contents given a formal ontology for the text domain [2]. The method is developed in the 3
www.w3c.org
92
ONTOlogy QUERYing (ONTOQUERY) project [1,3]. Traditional search engines depend more or less exclusively on recognition of keywords or patterns of keywords in the text material. By contrast, ONTOQUERY addresses retrieval of pertinent text segments based on the conceptual content of the text. Like ONTOQUERY, the project ONTOSEEK [5] addresses content-based search by incorporating an ontology into the system. In ONTOSEEK, the ontology is used to help users interactively construct precise and unambiguous descriptions of resource texts and formulate unambiguous queries, which may subsequently be generalized or specialized. Hicham noussi et. al. have proposed an approach based on ontology, which facilitates the formalization and the extraction of data from different sources [6]. The extracted data are converted into a coherent structure so that users and agents can query them regardless of their origin. Similarly, Jiann-Jyh Lu et. al. have combined wrapper agent technologies and ontologies of molecular biology to enable biologists to issue concept-to-concept query [7]. Liddle et. al. have developed a java-based tool that helps domain experts by providing a graphical interface for domain ontology creation and testing. In turn they have used the created ontology to extract data from Web documents and to store them in structured form [8]. All the IE systems mentioned above assume that the text documents contain precise concept descriptions and consequently they attempt to extract these to answer precise queries only. Moreover, though a lot of focus is currently given on standardizing ontology representation languages for general domain representations, most of these assume that the concept world is precisely defined in terms of properties and values. One of the central problems that we face today is to decide how to use the ontology straight away for the extraction of imprecise concepts from unrestricted texts to answer users imprecise queries. Therefore, we have created a fuzzy ontology structure by embedding fuzzy concept modifiers in the existing domain ontology to ease the extraction of imprecise concepts from text documents.
3.
Proposed Ontology-based Information Extraction System
Figure 1 presents a schematic view of our system and highlights the relationship among the various
modules. The Ontology Editor creates the fuzzy ontology structure by incorporating fuzzy concept modifiers in the existing domain ontology. Document Parser Text Documents POS Tagger Ontology Editor Term Filter
Domain Ontology
Tree Structured Documents
Fuzzy Ontology Structure
Resource Qualifiers
KIGA
Knowledge base
Figure 1: proposed system
93
The Document Processing Agent (DPA) consists of a document parser that divides an unstructured text document into individual record-size chunks and presents them as individual unstructured record documents for further processing. It also consists of a part-of-speech (POS) tagger that assigns parts of speech to individual words in the documents. The DPA uses a term filter, which filters unwanted POS tags. Finally, it converts the filtered document into a ternary tree structure. The Knowledgebase Instance Generation Agent (KIGA) is responsible for populating the knowledgebase. It uses the ternary tree structures and the fuzzy ontology structure as inputs and generates the instances of the knowledgebase. The working principle of the different modules is explained in the following subsections. 3.1.
Ontology Editor
This module creates fuzzy ontology structure by incorporating fuzzy concept modifiers in the existing domain ontology. For this we have used Protégé 2.0 Beta4, which is an integrated software tool, used by system developers and domain experts to develop knowledge-based-systems. Protégé is currently being used in all those fields including clinical medicine and the biomedical sciences in which concepts can be modeled as a class hierarchy. To accommodate fuzzy concept modifiers in the existing domain ontology, we first of all propose to have a new resource of type qualifiers, which is defined as a collection of related terms. In English language, adverbs are most often used to qualify adjectives particularly where the user wants to specify a property with a varying degree of precision that can’t be expressed quantitatively. Hence in general, the resource qualifier models adverbs of English language. For example, when we talk of “very cold water”, we mean almost chilled water without specifying the actual temperature. The design of the collection is meant to be a graded set so that the distance between the two values in the collection reflects their degree of dissimilarity. Let Ps(x, y) denotes the degree of similarity between terms x and y in the value set. One of the ways to fix the degree of similarity between the value vi at position i and value vj at position j for a graded value set is Ps(vi,vj) = 1, for i = j = 1 / (~i-j~+ 1), for i z j Resource Qualifier
Wine has
has
has has
Name
Color
has
has Body
Taste
has has
Color Qualifier
Flavor
has has value
has
has value
Fuzzy Color
Taste Qualifier
Body Qualifier
Flavor Qualifier
has
has
has value
has
has value
has
has
has
has Fuzzy Taste
Fuzzy Body
Fuzzy Flavor
Figure 2: Taxonomic structure of wine classes
We now present the design principles for building a fuzzy ontology structure. To accommodate imprecise descriptions, we add a new resource class called Resource qualifier class to the ontology 4
http://protégé.stanford.edu
94
structure. A Resource qualifier class represents a graded set of qualifiers that along with a value set can be applied to describe the property of a concept with varying degree of precisions. As an example, we have considered the wine ontology developed by W3C. The wine ontology uses the value set {red, rose, white} for color property, {delicate, moderate, strong} for flavor property, {light, medium, full} for body property, and {dry, off-dry, sweet} for taste property. For the above-mentioned property-value sets the qualifiers considered in our model are: {null, light, pale, bright, dark, deep} for color property, {null, slightly, very} for flavor property, {null, very} for body property, and {null, slightly, semi, medium, very} for taste property. In the qualifier sets null is considered just to allow the extraction of all those property values from text documents that appear without qualifiers therein. Using the concept of multiple inheritance from object-oriented design, a new class called fuzzy class is created, which is a subclass of both value class and qualifier class. In our case the defined fuzzy classes are: FuzzyColor, which is the subclass of Color class and ColorQualifier class, FuzzyFlavor, which is the subclass of Flavor class and FlavorQualifier class, FuzzyBody, which is the subclass of Body class and BodyQualifier class, and FuzzyTaste, which is the subclass of Taste class and TasteQualifier class. The redefined taxonomic structure of the wine classes is shown in figure 2. Now to describe a wine, the “allValuesfrom” constraint is used to constrain property values to be either null or an instance of the corresponding fuzzy class. The modified constraints on wine properties and the partial Web Ontology Language (OWL) codes generated by Protégé are shown in figure 3 and 4 respectively. Template Slots of wine class Slot name
Type
Allowed Values/Classes
Hasbody
Instance
Fuzzybody
Cardinality
Default
0:1
Hastaste
Instance
Fuzzytaste
0:1
Hascolor
Instance
Fuzzycolor
0:1
Hasflavor
Instance
Fuzzyflavor
0:1
Figure 3: Template slots of wine class
.
Figure 4: OWL codes generated by Protégé 2.0 Beta
95
3.2. Document Processing Agent (DPA) The Document Processing Agent (DPA) consists of a document parser which divides an unstructured text document into individual record-size chunks and presents them as individual unstructured record documents for further processing. It also consists of a Parts-of-speech (POS) Tagger, which is a program that assigns parts of speech to English words based on the context in which they appear. We have used a web-based tagger that has been developed by the Specialized Information Services Division (SIS) of the National Library of Medicine (NLM) and it is freely available at the URL http://tamas.nlm.nih.gov/taggercgi.html.
Some of the common part of speech tags and their corresponding parts of speech are given in figure 5. Some sample tagged documents are depicted in figure 6. The document processor uses a term filter, which in our case is guided by the fuzzy ontology structure to filter unwanted tags from the tagged documents. Since we look for concept descriptions only, we have filtered out the X (auxiliary verbs) and T (article) tags. This increases the processing speed of the DPA. T X A
Article Auxiliary verb Adverb
N V J
Noun Verb Adjective
R P C
Preposition Pronoun Conjunction
D I i
Determiner Interjection “to” as an infinitive marker
Figure 5: Part of speech tags and their corresponding parts of speech Dolcetto is a red table wine which is quite dry and has a slightly fruity flavor. Syrah has a full body, less fruity flavor and dark red N X TJ N N P X A J C X T A J N N X TJ N A J N C N J color that grow originally in California's coastal areas. N C V A R N J N
Figure 6: Tagged documents
Finally, for every record document a tree structure is created. In order to create the tree structure the document is divided into segments on the basis of commas (,), semicolons (;), preposition (with), conjunctions (c) and full stop (.). For tree structure, only those segments having at least one adjective tag (likely to contain property values) are considered. The template of the tree structure is defined as follows: Struct Tree {String *Value; Tree *Lchild; Tree *Mchild; Tree *Rchild}; Every segment is converted into an instance of the tree structure by distributing its tags in the following way: Root (R): A node that contains the rightmost adjective tag. Lchild (L): A node that contains all tags that are to the left of the tag considered at R. Mchild (M): A node that contains all tags that are to the right of the tag considered at R Rchild : points to the root of the sub-tree constructed from the next segment. Full, J
Red, J
Dolcetto, N
Table, N Wine, N
Syrah, N Body, N
Fruity, J
Dry, J Less, A Flavor, N
Quite, A
Null
Red, J
Fruity, J Deep, A Color, N
Slightly, A
Flavor, N
Coastal, J
Null Grow, V Area, N Originally, A California’s, N
Figure 7: Ternary tree structure
96
Null
The equivalent context-free grammar for this may be given as follows: Document (D) Æ LRMD | ; L Æ (N+P+A+V+J)* ; RÆJ; MÆ (N+P+V)* Where N, P, A, J, and V have it meaning as given in figure 5. Figure 7 shows the resulting tree structure after the above-mentioned procedure is applied on the tagged document shown in figure 6. 3.3. Knowledgebase Instance Generating Agent (KIGA) First, the Knowledgebase Instance Generating Agent parses the fuzzy ontology structure and creates an SQL schema for the knowledgebase. This is implemented as a sequence of create-table statements, whose attributes are all those object-set names derived from the fuzzy ontology structure having only atomic values and whose types are varchar for lexical object sets and real for nonlexical object sets. Then it uses the output of the document processor along with the fuzzy ontology structure to populate the knowledgebase with information extracted from the text documents. Algorithm Instance_Generator (ROOT) Input: Ternary tree structure generated from Web documents; PTR is the pointer to a tree structure. List of objects, relationships, and constraints Output: Instance of Knowledge base. Data structure: Ternary tree structure; Relational database schema structure. Steps: 1. Ptr = ROOT // Start from root node 2. If (Ptr z Null) // If the tree is non-empty a. SEARCH_PROPERTY_NAME (Property_name_list, Ptr -> MChild) // Search property name in the middle child node. If Property name found // Property name is explicitly mentioned in the document Assign value at root node to Property_Value // Extract Property value Go to step 2 (b) // Proceed to search qualifier value Else // Property name is not explicitly mentioned in the document SEARCH_PROPERTY_VALUE (Property_Value_Lists, Ptr -> Value) // Assume the value of root node as a property value //and search it in the property value sets. If any Property Value matched with Ptr -> Value // The value at root node is a valid property value Assign it to Property_Value // Extract Property value and block the corresponding property value list // for next search Go to step 2 (b) // Proceed to search qualifier value Else // The value at root node is not a valid property value SEARCH_PROPERTY_VALUE (Property_Value_Lists, Ptr -> LChild) // Search the Property value in the left child of // the root node. If no valid property value found // The sub-tree does not have any property value to // populate the knowledgebase Go to step 2 (c) // Proceed to search the next sub-tree End if End if End if b. SEARCH_QUALIFIER_VALUE (Qualifier_Value_Lists, Ptr -> Lchild) // Search qualifier value in the left child of the // root node If qualifier value found // The extracted property value is associated with a qualifier Assign it to Qualifier_Value // Extract qualifier value and associate it to the property value found in the earlier steps End if If it is the first invocation of algorithm // Object name is only in the left child of the root node of the input tree SEARCH_OBJECT_NAME (Ptr -> Lchild) // Object with which the found property value and qualifier will be associated // in the knowledgebase c. Ptr = Ptr -> Rchild // Proceed for the next sub-tree Instance_Generation (Ptr) // Repeat the above process for the next sub-tree 3. Insert Object_Name,, Property_Value, Qualifier_Value into Knowledgebase // If the object name has first occurrence then generate // SQL INSERT statements otherwise generate SQL // UPDATE statements to accommodate the extracted // values in the knowledgebase 4. End if 5. Stop
Figure 8: Instance generation algorithm
97
Some of the key behavioral features of the instance generation mechanism are: (i)
(ii)
Since a particular object may have been described in a document by using some or all properties mentioned in the ontology structure or by some other features not mentioned in the ontology structure, several fields may have null values. Moreover, a document may or may not use property name in conjunction with the property values for describing a concept. For example, in the document – Roussanne is a light bodied, light red and very sweet wine from France's Loire Valley, often blended with Merlot, the property name body is mentioned explicitly; the property descriptors of color and taste appear only implicitly through their values.
Guided by these observations, we have employed a two-way approach to populate the knowledgebase. Given a property name – our instance generator looks for values to fill up the object description. This method allows the knowledgebase to accommodate object descriptions with property values that are not present in the underlying ontology. In absence of property names, property values from the ontology are used as pointers to fill up the particular attribute slot. Further, our assumptions are: (i) A property value does not have any qualifier unless it has a value, and (ii) at most one qualifier is associated with a particular property value. Algorithm Instance_Generator(), shown in figure 8, outlines the basic procedure of generating instances for the knowledgebase. This algorithm accepts the ternary tree structures generated from parsed documents and the list of objects, relationships, and constraints as input. The outputs are used to fill up the knowledgebase. This is accomplished again through a series of embedded SQL statements. 3.4. Results Figure 9 shows the knowledgebase generated from a corpus of 50 wine documents collected from the
web. The goal of our knowledgebase-instance generator is to extract maximum relevant concepts and ignore maximum non-relevant concepts. A concept that is related with an object either explicitly or implicitly is said to be relevant for the object under consideration otherwise it is nonrelevant. For example consider the following text collected from the Internet. Johannisberg Riesling: a dry, delicate, white color wine…………Enjoy well with fish or red meats…………………………
Here the concepts dry and delicate are implicitly relevant and the concept color is explicitly relevant to describe the Verdlet (A type of wine), whereas the concept red is non-relevant. So, the instance generator should be able to extract the concepts dry, delicate, white and to ignore the concept red. The Precision and Recall of the knowledgebase instance generator are computed as follows: Precision = No. of Relevant concepts extracted / Total number of irrelevant and relevant concepts extracted, Recall = No. of Relevant concepts extracted / Total number of relevant concepts.
Despite the diversity of the collection the system works extremely well and the employed extraction procedure achieve high rates of precision (92.89%) and Recall (80.69%). The results are summarized in Table-I.
98
Table – I
Extracted Not Extracted
Relevant concepts
Non-relevant concepts
Precision
Recall
209 (True +ve) 50 (False –ve)
16 (False +ve) 231 (True –ve)
92.89%
80.69%
Figure 9: Sample Knowledgebase
Once the information is stored in a structured form it can then be searched efficiently by the user with the help of structured query language for required information. For example, in order to find out the name of wine(s) that is (are) slightly bitter in taste and light bodied we may use the following SQL statement: SQL>
SELECT WineName FROM Knowledgebase WHERE TasteQualifier=’slightly’ AND TasteValue=’bitter’ AND BodyValue=’light’;
After executing the above SQL statements we get Grignolino that is a wine having slightly bitter taste and light body. 99
Similarly, the users to extract different information may use the following SQL statements. x
List the name of red and slightly sweet wine(s)
SQL>
SELECT WineName FROM Knowledgebase WHERE ColorValue=’red’ AND TasteQualifier=’slightly’ AND TasteValue=’sweet’;
Result: Valpolicella x List the name of medium sweet or fresh fruity wine(s) SQL>
SELECT WineName FROM Knowledgebase WHERE (TasteQualifier=’medium’ AND FlavorValue=’fruity’);
TasteValue=’sweet’)
OR
(FlavorQualifier=’fresh’
AND
Result: Niagara, Pink Catawba, Blush Niagara x
List the name of rust colored, slightly bitter, light bodied and delicate flavored wine(s)
SQL>
SELECT WineName FROM Knowledgebase WHERE ColorValue=’rust’ AND TasteQualifier=’slightly’ BodyValue=’light’ AND FlavorValue=’delicate’;
AND
TasteValue=’bitter’
AND
Result: Grignolino
4. Conclusions and Future Work In this paper we have proposed to create a fuzzy ontology structure by embedding fuzzy concept modifiers in the existing domain ontology. The fuzzy ontology structure is then used to extract imprecise concept descriptions from unstructured text documents and stores them in structured form. The main advantage of our model is that instead of creating a new fuzzy ontology structure it redefines the existing domain ontology and incorporates fuzzy concept modifiers into them. Furthermore, it is using existing Structured Query Languages for relational databases instead of designing a document design-specific query language to extract concepts from Web documents. The model is easily adaptable. For a new domain, one has to define new set of qualifiers for the slots of the domain ontology under consideration because these are context-dependent. Currently, we are enhancing the model to learn from the extracted information. This can help incorporate new qualifiers, values or even properties as resources in the ontology.
5. References [1] [2] [3]
[4]
[5] [6]
[7]
Andreasen, T., Fischer Nilsson, J., Erdman Thomsen, H., Ontology-based Querying, in: H. L. Larsen et al. (Eds.), Flexible Query Answering Systems, Recent Advances, Physica-Verlag, Springer, 15 –26, 2000. Andreasen, T., Jensen, P. A., Fischer Nilsson, J., Paggio, P., Pedersen, B. S., Erdman Thomsen, H., Contentbased Text Querying with Ontological Descriptors, Data & Knowledge Engineering, 48(2), pp. 199-219, 2004. Andreasen, T., Jensen, P. A., Fischer Nilsson, J., Paggio, P., Pedersen, B. S., Erdman Thomsen, H., ONTOQUERY: Ontology-based Querying of Texts, AAAI 2002 Spring Symposium, Stanford, California, 2002. Broekstra, J., Klein, M., Decker, S., Fensel, D., Van Harmelen, F., Horrocks, I., Enabling Knowledge Representation on the Web by Extending RDF Schema, Proc. 10th Int’l World Wide Web Conference, Hong Kong, 2001 Guarino, N., Masolo, C., Vetere, G., OntoSeek: Content-based Access to the Web, IEEE Intelligent Systems 14(3), 70 –80, 1999. Hicham Snossi, Laurent Magnin, Jian-Yun Nie, Toward an Ontology-based Web Data Extraction, In Workshop on Business Agents and the Semantic Web. In Proceedings of the Fifteenth Canadian Conference on Artificial Intelligence (AI’2002). Calgary, Alberta, Canada, May 26, 2002 Jiann-Jyh Lu and Chun-Nan Hsu, Query Answering using Ontologies in Agent-based Resource Sharing Environment for Biological Web Information Integrating, In Proceedings of IJCAI-2003 Workshop on Information Integration on the Web, Menlo Park, CA, 2003. AAAI Press
100
[8] [9]
Liddle, S. W., Hewett, K. A., Embley, D. W., An Integrated Ontology Development Environment for Data Extraction, Proc. ISTA’03, June 2003 Zelenko, D., Aone, C., Richardella, A., Kernel Methods for Relation Extraction, Journal of Machine Learning Research 3, 1083-1106, 2003.
101