Ontology Construction for Structured Textual Data - CiteSeerX

14 downloads 24706 Views 381KB Size Report
Dec 4, 2003 - The art of ranking things in genera and species is of no small ... method for information retrieval in the database systems is Query By Keyword ... weirdness are good candidate terms. ... of documents, they assign a topic based on a topic tracking ... The software does not create hierarchy from domain knowl-.
12/04/2003

Ontology Construction for Structured Textual Data Faisal I. Bashir, Hassan Mansoor Dept. of Electrical & Computer Engineering, University of Illinois at Chicago, 851 S. Morgan St, SEO 1020, Chicago, IL, 60607. {fbashi1, hmanso3}@uic.edu

Abstract. This paper explores the domain of automatically extracting semantic information from given database schema and structured textual data for the purpose of Ontology construction. We examine the advantages of ontology based information search mechanisms and survey the difference techniques currently employed for automated ontology construction from given text corpus and from unstructured multimedia data, such as images, audio and video. Finally we propose an algorithm for ontology construction from given structured textual data with the help of database schema. The approach that we thoroughly discuss is from the domain of Formal Concept Analysis which is based on Lattice Theory. We use a generic lexical ontology WordNet to generate word senses and to label the nodes of ontology.

1 Introduction The art of ranking things in genera and species is of no small importance and very much assists our judgment as well as our memory. For centuries, philosophers have sought universal categories for classifying everything that exists, lexicographers have sought universal terminologies for defining everything that can be said, and librarians have sought universal headings for storing and retrieving everything that has been written. The last decade has seen a trend of exponential development in storage, processing and sharing technologies. This has led to an explosion of information stored in both structured and non-structured form which is shared mainly through the World Wide Web. Today, the semantic web has enlarged the task to the level of classifying, labeling, defining, finding, integrating, and using everything on the World Wide Web, which is rapidly becoming the universal repository for all the accumulated knowledge, information, and data of mankind. Apart from the WWW, corporations maintain their own internal databases and information warehouses. Database technologies have matured and are predominantly present everywhere the information is stored. Efficient search mechanisms for these huge databases, is an active area of interest these days. The most widely used method for information retrieval in the database systems is Query By Keyword (QBK) used both for the retrieval of structured textual and non-structured multimedia data. This works as follows: the user submits a query, preferably in Natural Language, describing his inform ation needs. The system then searches indexed documents based on keywords specified in the query. The goal is to achieve efficient search with minimum amount of irrelevant information (high precision) as well as ensuring that relevant information is not overlooked (high recall). The documents that contain the same/similar keywords are marked based on the occurrence of keywords and their relative importance in the overall query. One problem with this approach is that the documents which convey the same message the user is looking for but do not contain the exact same keywords are ranked

1

12/04/2003 lower than what the user expects. This problem can be addressed by using query expa nsion methods. Add itional search terms are added to the original query based on statistical co-occurrence of terms. Hence, recall is expanded but at the expense of deteriorated precision [5]. It has been claimed that concept-based models using ontologies of u nderlying data perform much better job of information indexing and are easy to perform semantic search on [2]. There are two distinct problems for an ontology-based model: one is the extraction of semantic concepts from database schema and underlying data, and the other is relationships among these semantic concepts to generate the actual ontology. With respect to former problem, the key issue is to identify appropriate concepts that describe and ide ntify document contents at an abstract level. It is important to make sure those irrelevant concepts will not be associated and the relevant concepts will not be discarded. The second problem deals with automatic construction of relationships among these nodes to generate ontologies. In this paper, we present our approach to tackle both problems in a unified system. The rest of the paper is organized as follows: Section 2 presents an overview of related work in the domain of automated ontology construction for information retrieval tasks; Section 3 presents a specification of the ontology construction pro blem and the mathematical foundations of the FCA based approach; Section 4 presents details of our ontology construction algorithm using FCA; Section 5 presents the Concept Labeling portion of ontology construction; Section 6 rounds up the conclusions of this research e ffort.

2 Related Work In this section, we first state the definition of ontology and the concepts associated with building one. Then we present a brief overview of the different approaches that have been used to solve automated ontology construction problem. The Ontology has been defined as, “an explicit formal specification of how t o represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them”. The subject of ontology is the study of the categories of things that exist or may exist in some domain. The ontologies are normally represented by tree diagrams such as the one shown in figure 1.

2

12/04/2003 Figure 1: Tree of Prophyry.

It shows a version of the Tree of Porphyry, as it was drawn by the 13th century log ician Peter of Spain. It illustrates the subcategories under Substance, which is called the supreme genus or the most general supertype. It also highlights how a categorization can be achieved by recursively defining and identifying Genus and Differentiae [7]. 2.1 Ontology Construction from Text Corpus Analysis Tariq et al [8] address the problem of ontology construction from text corpus analysis in specialist text. The specialist text in some domain is a text repository that is peerreviewed and reflects the opinion of the majority of the enterprise. They present three case-studies of specialist text corpuses corresponding to Forensic Science, Breast Cancer Research and Finance and Business Domains. Specialist text is often heavily populated with special domain terms and their relationships as signaled in the text. The interrelation of terms in domain, when made explicit using graphs, represents the ontology of the domain under consideration. Open Class Words (OCWs) dominate specialist language texts, particularly noun phrases (NPs); phrases used to name objects, events, actions and states relevant to the domain. The OCWs reflect the lexical choice of the domain measured by way of frequency of occurrence. It has been claimed that the specialist texts co ntain the so-called lexical signature that distinguishes them from general language text, such as British National Corpus (BNC). They measure the frequency of occurrence of OCWs in specialist language text and compare it against its appearance in the general language texts to obtain the weirdness coefficient (WC). Term with high frequency and weirdness are good candidate terms. Once the single and compound words are identified automatically by comparing the distributions in the specialist corpus with a representative general language text, one can then extract the semantic relationships between terms. They have outlined the Finite State Graphs for sentences which link terms based on Is-A, Is-Part -Of, Is-Kind-Of, etc. For example, a triple of phrase – “X REL Y” where X and Y are noun phrases (NPs) and REL is a phrase generally expressed as IS-A. They identify many such Cues and Signals and other heuristics for locating such relations of synonyms and hyponyms. They extract the sentences containing lexico-syntactic cues and tag them to indicate the grammatical category of each word. All correct sentences are then parsed to extract hypernym-hyponym pairs. All these hypernym-hyponym pairs are then merged together using a tree data structure that finally constitutes a forest of subtrees representing domain ontology. This model in XML format is then formalized into ontology of the domain. 2.2 Hierarchy generation and Concept Labeling using SOMs and WordNet A clustering based approach is suggested in [4] in which hierarchy construction is achieved for given text documents by using Self Organizing Maps (SOMs). This method constructs hierarchy from top to bottom. In order to tag an appropriate concept for each node in the hierarchy, they use an automatic concept selection algorithm from WordNet. Their approach can be broken down into two major areas: § Automatic Hierarchy Construction: For this they rely on clustering algorithm of selforganizing tree which constructs a hierarchy from top to bottom. They propose two algorithms, namely Hierarchical Growing Self Organizing Tree (HGSOT) and Modified Self Organizing Tree (MSOT). Automatic ontology construction using both algorithms are compared against hierarchical agglomerative clustering algorithm, and

3

12/04/2003 they perform better in terms of precision and recall. Furthermore, HGSOT outperforms MSOT in their experiments. Automatic Concept Labeling: Once the clusters are formed, they assign concept for each node in the hierarchy. For this, they propose two types of strategies and adopt a bottom up concept assignment mechanism. First, for each cluster consisting of a set of documents, they assign a topic based on a topic tracking algorithm. If multiple concepts are candidate for a topic they propose an intelligent method of arbitration. Next, to assign a concept to an interior node in the hierarchy, they use WordNet. Descendent concepts of the internal node are also identified in WordNet. From these identified concepts and their hypernyms they identify a more generic concept that is assigned as a concept for the interior node.

§

2.3 Stanford KSL - Ontolingua The Stanford Knowledge Systems Laboratory (KSL) has developed a software “Ontolingua” provides a distributed collaborative environment to browse, create, edit, modify, and use ontologies [5]. The service is available over the WWW, and a user can login to create his own ontologies. The software does not create hierarchy from domain knowledge, or assign concepts to nodes to create an ontology of the domain. Instead, the user is supposed to model the ontology he’s trying to create himself. Once the user has modeled the ontology manually, he can enter information into the Ontolingua which provides a convenient interface to manually enter classes and their parent-child relations. The software has a user-friendly interface to create new ontology, enter description of the ontology and what will it be used for, insert new classes into ontology which can be child classes of existing ontology (like “Thing”), and insert new subclasses of classes in ontology. For example, to create an ontology of Movies, the user will first the movie domain on his own and create a model of the domain. He can then enter this model into Ontolingua to formalize it. A movie will have a lead artist. So, a class by the name “Artist” can be defined. The domain of this class will be string, for the name of the artist. This class can be a child of more general classes in the Ontolingua environment. So, we can model it as a child of generic class “Person” (Artist is-a Person), which is a child of class “Agent”, which is a child of class “Individual”, which is a child of class “Individual Thing”, which is a child of uppermost level class “Thing”. Similarly, the class “Awards” can be gene rated, which will be a subclass of “Thing”. A subclass of Awards named “Oscar” can be defined similarly. A class “Genre” is defined which has subclasses of “Action”, “Comedy”, “Drama”, etc. Relations between classes can be declared too. For example, one relation can be “Granted To”, which relates the class Awards to the class Movies based on if the corresponding award has been won by the movie. Also functions on classes can be d efined, like one function can be “Year_Of_Release” which returns the year the movie was released. Based on these definitions, the Ontolingua then summarizes the ontology in terms of classes and their relationships. One such output is: ************************ Ontolingua Output ***************************

Ontology documentation: This is an Ontology about movies.

4

12/04/2003 Summary of Movies:

Movies includes the following ontologies: Agents

No ontologies include Movies. Class hierarchy (22 classes defined): Artist Awards Academy Emmy Golden_Globe Oscar Genre Action Comedy Documentary Drama Sci-Fi Producer Soundtrack Year 4 relations defined: Granted_To Plays_In Staring Won

3 functions defined: Has_Producer Has_Soundtrack Year_Of_Release

5

12/04/2003 No individuals defined.

33 unnamed axioms defined.

No named axioms defined.

22 classes defined: Academy Action Artist Awards Comedy Documentory Drama Emmy Golden_Globe Horror Movie Oscar Producer Sci-Fi Soundtrack Year ************************ Ontolingua Output ***************************

3 Problem Specification and Strategy As we have stated in previous sections, ontology is a collection of concepts and their interrelationships, which provide an abstract view of an application domain. We propose to use the technique of Formal Concept Analysis (FCA) [1] towards construction of hierarchy for given domain modeled by the database. FCA is a way to find, structure, and display relationships between concepts, which consist of attributes and objects. This method helps in understanding a given domain and in building a domain model for it. Formal Concept Analysis (FCA) is a method for deriving conceptual structures out of data. These structures can be graphically represented as conceptual hierarchies, allowing the analysis of complex structures and the discovery of dependencies within the data. FCA is increasingly applied in conceptual clustering, data analysis, information retrieval, knowledge discovery, and ontology engineering. Formal Concept Analysis is based on the philosophical understanding that a concept is constituted by two parts: its extension which consists of all objects belonging to the concept, and its intension which comprises

6

12/04/2003 all attributes shared by those objects. This understanding allows to derive all concepts from a given context (data table) and to introduce a subsumption hierarchy. Formal Concept Analysis arose twenty years ago as a mathematical theory. Its focus has turned during the past years: nowadays FCA papers are presented almost exclusively at computer science conferences. Some of the recent books, tutorials and resources on FCA are mentioned in the Bibliography section. In this section, we first introduce the mathematical foundation of FCA presenting a concrete example for the end-to-end process of generating ontology in a hypothetical context of Beverages . In the next section, we will show how FCA can be adopted to construct ontologies from a given database, casting our sp ecific problem in the domain of FCA. 3.1 Formal Concept Analysis – Mathematical Foundation To introduce the FCA method , we first have to define the term context or formal context. A formal context is a triple (G, M, I) which consists of a set G of objects, a set M of attributes and a binary (incidence) relation I G M between objects and attributes. A context is typically represented in tabular form as a cross table, whose rows are represented by the objects, whose columns are represented by the attributes and whose cells are marked iff the incidence relation holds for the corresponding pair of object and attri bute. As an example we will present a context of beverages. The different drinks form the set G of objects, and some possible features of drinks are collected in the set M of attributes. The incidence relation I is given by the cross table. An example of a formal context. nonalcoObjects \ Attributes hot caffeine sparkling alcoholic holic Tea

x

x

Coffee

x

x

Mineralwater

x

x x

Wine

x

Beer

x

Cola Champagne

x

x x

x

x x

Table 1: Cross Table for the context of Beverages.

The table should be read in the following way: Each x marks a pair being an element of the incidence relation I, e.g. (coffee, hot) is marked because the object coffe e carries the attribute hot, whereas (mineralwater, hot) is not marked because normally mineralwater is not hot. Thus (g, m) I should be interpreted as "the object g carries the attribute m". The central notion of FCA is the formal concept. A concept (A, B) is defined as a pair of objects A G and attributes B M which fulfils certain conditions. A is called extent and B is called intent of the concept. To define the necessary and sufficient conditions for a formal context we present two derivation operators. Given A G we define A' := {m M| g A: (g, m) I} and dually for B M B' := {g G| m B: (g, m) I}. A' contains all attributes that are common to all objects in A. And B' is the set of all objects that carry all the attributes of B.

7

12/04/2003 With that, the pair (A, B) is a formal concept iff A' = B and A = B'. This property says that all objects of the concept carry all its attributes and that there is no other object in G carrying all attributes of the concept. When looking at the cross table this property can be seen if rectangles totally covered with crosses can be identified, e.g. the four cells associated with tea, coffee, non-alcoholic, and hot constitute such a rectangle. If we ignore the sequence of the rows and columns we can identify even more concepts, e.g. ignoring the row cola and the column caffeine (or moving them to another place) we achieve another rectangle/concept, namely the cells associated with the objects beer and champagne and the attributes alcoholic and sparkling. Looking at the definition of a formal concept one can easily see that for all A G the pair (A'', A') is a formal concept. The dual holds for all B M, i.e. (B', B'') is always a formal concept, too. Yet, the sets of concepts achieved in this way are equal and contain exactly the concepts existing in the given context. For formal concepts a natural subconcept/superconcept relationship can then be defined as follows: (A1, B1) (A2, B2) A1 A2 ( B2 B1 ) This relationship shows the dualism that exits between attributes and objects of concepts. A concept C1= (A1, B1) is a subconcept of concept C2=(A2, B2) iff the set of its objects is a subset of the objects of C2. Or an equivalent expression is iff the set of its attributes is a superset of the attributes of C2. Actually, the set of all formal concepts of a context forms a so called concept lattice. The infimum of this lattice is formed by ( , M) and its supremum is formed by (G, ) if the context is given by (G, M, I). Because of the dualism between objects and attributes and the fact that data analysts or any other users of FCA are interested in investigating structures and relationships we need a representation of concepts that treats both objects and attributes alike. This representation is realized in a line diagram which is presented in the next subsection. 3.2 Ontology Visualization – Line Diagrams A line diagram is a graphical visualization of the concept lattice. It allows the investig ation and interpretation of relationships between concepts, objects and attributes. This includes object hierarchies, if they exist in the given context. A line diagram contains the relationships between objects and attributes and thus is an equivalent representation of a context, i.e. it contains exactly the same information as the cross table. Also, dependencies and relationships between attributes can be easily detected in a line diagram. Figure 2 shows line diagram for the above presented context of beverages.

8

12/04/2003

Figure 2: Line diagram for the context of beverages.

The graph consists of nodes that represent the concepts and edges connecting these nodes. Two nodes C1 and C2 are connected iff C1 C2 and there is no concept C3 with C1 C3 C2. Although the concept lattice is a directed acyclic graph (DAG) the edges are not provided with arrowheads, instead the convention holds that the superconcept always appears above of all its subconcepts. For example the line diagram shows that the nodes annotated with coffee and with cola are both subconcepts of the node annotated with caffeine. As a difference to usual lattice diagrams the labeling in line diagrams is reduced, i.e. each object and each attribute is only entered once. So the nodes are not annotated by their complete extents and intents. Rather, attributes and objects propagate along the edges, as a kind of inheritance. Attributes propagate along the edges to the bottom of the diagram and dually objects propagate to the top of the diagram. Thus the top element of a line diagram (the supremum of the context) is actually marked by (G, ) if G is the set of objects. The bottom element (the context's infimum) is marked by ( , M) if M is the set of attributes. Attribute names are always displayed slightly above the node and object names are noted slightly below the respective node. To read a line diagram you start at the object, attribute, or concept you are interested in, e.g. the node marked with cola. Following all paths from this node to the top element one visits all superconcepts of the selected concept. Collecting the attributes displayed along the paths one finds all attributes that the selected concept or object carries. Selecting a node and following all paths from this node to the infimum of the lattice one finds all sub- and subsubconcepts. If the selected node displays an attribute name all objects along these paths establish the set of objects carrying this attribute. Thus, line diagrams display relationships between objects, attributes and concepts in an easily perceivable way. For example, the above given line diagram reveals that beer and champagne are equivalent objects. Of course, one has to pay attention to the context (in a colloquial as well as in a formal sense). Concerning the given formal co ntext beer and champagne are equal because they carry exactly the same attributes, namely sparkling and alcoholic. Their equivalence can be seen in the line diagram by their appearance at the same concept node. Line diagrams also display object hierarchies and explicitly show why some concepts are specializations of others. For example the line diagram shows, cola is a subconcept of mineralwater. Actually it says "cola is mineralwater with caffeine" which is very close to truth. Another thing one can learn from line diagrams are implications, e.g. the example line diagram shows that any object that is hot also carries the attribute non-alcoholic. That is

9

12/04/2003 because all subconcepts of the concept annotated with hot are also subconcepts of the node annotated with non -alcoholic.

4 FCA based Ontology Construction The target domain we have chosen to construct ontology for is Motion Picture dat abases. The database will be represented in the form of modified cross-table. The table has as usual, attributes as columns and objects as rows, but the attributes can be specific terms instead of check marks. This means that the attributes will represent m-ary and not binary characteristics of objects. An example is shown in table 2. An ‘X’ mark placed under an attribute means value of the attribute is not available for that particular object. Based on database schema information (attributes) and the underlying data (objects), we will generate a hierarchy that encapsulates knowledge represented by the domain of movie databases. Objects \ Attributes Genre MPAA Rating Awards Brave Heart

Drama

R

Oscar

English Patient

Drama

R

Oscar

Lion King

Action/Adventure

PG

Golden Globe

Jurassic Park

Action/Adventure

x

Oscar

Table 2: Cross table corresponding to a movie database.

In the framework of FCA, a formal concept triple (G, M, I) will represent movies, their attributes and the relations between them. Set G, the set of objects will contain the set of movie titles, say, {Brave Heart, English Patient, …}. Set M will have the set of attributes corresponding to each tuple, say, {Genre, MPAA Rating, Awards, …}. The incidence relation I G M is not binary in our case, because the cross table will not contain crosses only marking the presence/absence of an attribute. Rather it will carry some sp ecific information about the attribute value for each object. One way to get around the problem of non -binary incidence relation is to split the attributes into atomic classes and use new set of expanded attributes to create a cross table. For instance, the Genre attribute in table 2 can be split into four attributes, Gen-D, Gen-C, Gen-A. Similarly, MPAA Rating attribute can be broken down into Rat-R, RatedPG13, Rat-PG. Award attribute will be replaced by Awards-Oscar and Awards-GG, since this attribute has two instances, namely Oscar and Golden Globe. Now the relation I, between set of objects G and set of attributes M is purely binary and the classic FCA theory as outlined in section 3 can be applied. Table 3 shows an expanded cross table (having reasonably large number of entries for objects and attributes) with binary inc idence relation set marked by crosses. We didn’t include more than 8 attributes because then the line diagram becomes too cluttered to draw and understand. Objects/Attributes Brave Heart Forrest Gump 10

GenD x x

GenC

GenA

RatR x

RatPG13 x

RatedPG

AwardsOscar x x

AwardsGG x

12/04/2003 English Patient Titanic Chicago Monsters Inc Ice Age Finding Nemo Lion King Harry Potter I Jurassic Park Starwars E.T. Starwars Ep. I Spiderman Independence Day Lord of the Rings

x x

x

x x x

x x x x x x x x x x x x x

x x x x x x

x

x x x x x x x

Table 3: Cross Table for Movie database problem showing binary relation set I.

Based on the FCA theory, we will output our results of ontology generation in the form of Line Diagram. As stated in section 3.2, line diagram contains the relationships between objects and attributes and thus is a representation equivalent to cross table. Figure 5 shows line diagram corresponding to context of movies in table 3. Our approach towards ontology creation system is outlined in figure 4 which identifies different modules of the system. ‘Attribute/Data analysis’ module parses the attributes from database schema information and structured data in tables of the database to generate candidate terms. These candidate terms will be used to label the nodes and edges of the final ontology DAG. Also, the inter-relationships between these terms will be generated using WordNet for conceptual modeling. The ‘Conceptual Modeling’ is performed using FCA based approach as given by the algorithm in next subsection. After this phase, the intermediate results are formalized into ontology in the form of a Line Diagram.

11

x

x

12/04/2003

Other Knowledge Source

Database Schema/Data

WordNet

Attribute/Data Analysis

s Term idate Cand

Term Base Conceptual Modeling

Formalization

Brave Heart

ar

O

g

Ica Age

sc

tin

G

Ra

Engli sh Patie nt

Aw ard s

R P

DR ro C am m aa o nc m eorr H ed or y

nre Ge

Juras sic Park

Figure 3: Proposed architecture for ontology generation given a movie database schema and data.

4.1 FCA Based Ontology – Algorithm: In this section we describe an algorithm based on Formal Concept Analysis, which constructs a hierarchy given the set of objects (G) and attributes (M). This algorithm does not use WordNet and it does not label the nodes it creates. It creates a hierarchy of nodes and constructs the basic ontology of the domain modeled by given data. The algorithm generates ontology in a top-down fashion. The root of the resulting line diagram will represent the Supremum (G,0) while the bottom node of the line diagram will represent its Infimum (0,M). We start with the nodes, at the top level of hierarchy, that model single attribute and multiple objects and we end, at the bottom level of hierarchy, with nodes that represent multiple attributes and single objects.

12

12/04/2003 Parameters: B: The set containing list of attributes from database schema. The attributes in B may be single attributes or the combinations (non-repetitive) of them. B ij means the jth set of attributes at ith level of hierarchy containing as many as I number of atomic attributes (attributes from original schema) in it. n: The total number of attributes in database schema. N: The maximum number of x’s in all tuples. This will be the total levels of hierarchy. Node: The set containing ontology. Node (i,j) represents jth node at ith level of hierarchy. Algorithm: 1: Compute N, the maximum number of x’s in all tuples. 2: Start with the set B containing single attributes from database schema. B = {Bij| i = 1, j = 1, …,n }. 3: For i = 1 to N { For j = 1 to sizeof(B) { a = Bij′

b = Bij If (a ’ == b and a == b’) { Node(i,j).AttributeList = b Num = Total number of objects that have x’s in the set b of attributes. Node(i,j).ObjectSize = Num }

} //Make set of attribute length i+1 from permutations of Bi and repeat.

Figure 4: Algorithm for the generation of Ontology from database schema and data.

4.2 FCA Based Ontology – Example: In this section we describe how the algorithm in Fig. 5 works in the context of a movie database as in Table 3. § § §

Step 1: N = 4. So, the maximum number of levels in the hierarchy will be 4. Step 2: B = {Gen-D, Gen-C, Gen-A, Rat-R, Rat-13, Rat-PG, Aw-Osc, Aw-GG}. Step 3: Let b = Gen-C (say), then a = {Chicago, Monsters Inc, Ice Age, Finding Nemo}. a’ = Set of attributes common to all objects in a = {Gen-C} = b. So, (a,b) form a formal concept and hence b = {Gen-C} is added as a node at 1st level of hierarchy. For a counter example, let b = {Rat-R}, then a = {Brave Heart, English Patient}. a’ = {Rat-R, Gen-D, Aw-Osc} != b. So, b = {Rat-R} is not a formal concept and it will not be added as a node at level 1 i n the final ontology. The process is then repeated for attribute set containing multiple attributes.

13

12/04/2003 Supremum

Level 1:

Level 2:

Gen-D

Gen-D, Rat-13

Gen-C

Gen-D, Aw-Osc

Gen-D, Rat-R, Aw-Osc BraveHeart, English Patient

Gen-A

Rat-13

Gen-C, Rat-PG

Rat-PG

Aw-Osc

Gen-A, Rat-13

Aw-GG

Gen-A, Rat-PG

Gen-A, Rat-PG, Aw-GG Lion King, E.T.

Rat-13, Aw-Osc, Aw-GG Chicago, Forrest Gump

Level 3:

Gen-D, Rat-13, Aw-Osc, Aw-GG Forrest Gump

Gen-C, Rat-13, Aw-Osc, Aw-GG Chicago

Level 4:

Infimum

Figure 5: Example Ontology constructed from Movie Database in Table 3. Object Sets (A’s) are not listed at level 1 and 2 because the list is long at those levels.

4.3 FCA Based Ontology – Implementation: The implementation of the project is under way using the following components and tools. Database Internet Movie Database (IMDB) http://imdb.com/Charts/usatopmovies DBMS MS Access Universal Data Access (UDA) Technology ODBC Platform Win 2000 Professional Language MS Visual C++ 6.0 Library Microsoft Foundation Classes (MFC)

14

12/04/2003 Library-specific Classes

CDatabase, CRecordset

Table 4: Tools and Technologies involved in implementation.

The implementation is centered around the two MFC classes to interact with ODBC driver, namely CDatabase and CRecordset. All the class member functions of the two classes throw a CDBException if they fail, which we catch and display appropriate me ssage box to user. The program flow which follows the algorithm in Fig. 4, goes like this: §

§

§

§ §

5

Create an object of class CDatabase. This object will interact with the underlying database and connect to it using t he installed driver for ODBC. Database schema can be changed using CDatabase::ExecuteSQL (), but we don’t need this for our project. CDatabase::Open() opens a connection to the database using the Data Source Name (DSN) as parameter. If CDatabase::Open() goes through, create an object of CRecordset and call CRecordset::Open() to open a record set. An SQL statement can be put in the Open function call. We first use “SELECT * FROM Movies”, where Movies is the name of the table (relation) in the database. If this open gets through, we traverse the recordset to count the maximum number of checks in each tuple (record) by calling CRecordset::MoveNext() and checking for CRecordset::IsEOF(). This gives us the total levels of hierarchy in resulting Ontology. Next we execute CRecordset::Requery() to requery the CRecordset object using an SQL statement of the form “Select * FROM Movies WHERE Gen_D = 1”. This selects all the tuples which fall under the category of Genre -Drama. We parse the CRecordset again using CRecordSet::MoveNext and put the Tiles in separate array called Title-Gen-D. Note that CRecordset::MoveNext() will traverse different list this time now that the SQL query is different. We then test if Gen-D is a formal concept (a’ == b and b’ == a) by putting in a query and using Title -Gen-D in WHERE clause. If all tuples in the result of this query have a 1 in Gen-D, then we say Gen-D is a formal concept and add this node to the first level of hierarchy. We repeat the same process for all attribute of the relation. This gives us nodes of ontology at first level of hierarchy. The above two steps are then performed for next level of hierarchy recursively to yield the final ontology result.

Concept Labeling

Once the formal ontology is created which shows the hierarchy of concepts as Fig. 5, the next step is labeling of individual nodes in the hierarchy. This step can be carried out based on the dense labels generated in the hierarchy construction step, and /or with the help of a generic lexical ontology like WordNet. The approaches we have studied for this part are presented in this section. 5.1 FCA Based Ontology – Reduced Labeling This is an extension of the FCA based concept lattice generation phase as described in detail in the last section. If each node in the hierarc hy is labeled by its extent as well as intent, then the resulting ontology becomes too messy and it is impossible to read and understand. Ganter et al [2] present the technique of reduced labeling to make sure that

15

12/04/2003 each attribute and a set of objects is entered o nly once in the concept lattice to ensure the readability of the ontology as well as keeping the Is -A relationships intact. To explain reduced labeling with a concrete example from our movie database, we model the smaller context of movies as represented by the following table which has the first half number of tuples from the original table. Objects/Attributes Brave Heart Forrest Gump English Patient Titanic Chicago Monsters Inc Ice Age Finding Nemo

GenD x x x x

GenC

RatR x

RatPG13

RatedPG

x x x x x x

x x

AwardsOscar x x x

AwardsGG

x

x

x

x x x

Table 5: Smaller Cross Table for Movie database problem showing binary relation set I.

We apply the algorithm in Fig. 4 to this context to generate nodes at different levels of hierarchy. The sub-concept super-concept relations between these nodes in the hiera rchy are given by the following definitions. Let (A1, B 1) and (A2, B2) be formal concepts of (G, M, I). We say that (A1, B1) is a “Proper Subconcept” of (A 2, B 2) if ( A1, B1 ) ≤ ( A2 , B2 ) and, in addition, ( A1 , B1 ) ≠ ( A2 , B 2 ) holds. This will be abbreviated as: ( A1 , B1 ) < ( A2 , B 2 ) . Also, (A1,

B 1) is a “Lower Neighbor” of (A2, B 2), if ( A1 , B1 ) < ( A2 , B2 ) , but no formal concept (A, B) of (G,M,I) exists with ( A1, B1) < ( A, B) < ( A2 , B2 ) . This is abbreviated as: ( A1, B1) p ( A2 , B2 ) . Formal concepts are identified at each level of hierarchy by the algorithm presented in Fig. 4. The corresponding concept lattice is generated in the form of a line diagram which will then be labeled. We create a node for each individual formal concept at every level of hierarchy. Each node is then connected to its l ower neighbors, where the notion of lower neighbors is given by above definition. After that, we label the nodes. The nodes are first labeled with Attribute names: we attach the attribute m to the node representing the concept (m’, m’’). The remaining nodes are labeled with Object names: we attach each object g to the node representing the concept (g’’, g’). The final labeled ontology is shown in the Fig. 6. Here, a concept node is labeled with an attribute m ∈ M , if it is the largest concept (at the higher level of hierarchy, closer to Supremum) with m in its intent, and it is labeled with an element g ∈ G , if it is the smallest concept with g in its extent.

16

12/04/2003 Supremum

Level 1:

Gen-D

Gen-C

Rat-13

Aw-Osc

Rat-PG

Level 2: Titanic

Brave Heart

Aw-GG

Rat-R

Level 3:

Level 4:

Forrest Gump

Infimum

Figure 6: Example Ontology constructed from Movie Database in Table 5.

The manual construction in Fig. 6 of the ontology is compared against one standard implementation using a freeware software, ConceptExplorer [8]. The output of ConceptExplorer is displayed in Fig. 7. 5.2 Concept Labeling using WordNet A similar two-level approach is taken by Khan et al [4] as already pointed out in related work section. They first generate the hierarchy based on clustering techniques modified from Self Organizing Maps (SOMs). Once the hierarchy is constructed, the concept assignment phase is introduced. The process of concept assignment is done in a bottom up fashion. Each of the leaf nodes in generated hierarchy corresponds to a document or a small set of documents corresponding to one topic, like “Gold”, “Copper”, etc. The parent node is assigned a concept based on concepts of its child nodes and their hypernyms and its synonyms. This set of hypernyms and synonyms is generated from WordNet. For example, “Asset” is assigned to the parent node of Gold and Co pper nodes.

17

Chicago

12/04/2003

Figure 7: Output of ConceptExplorer for Ontology construction from Movie Database in Table 5.

6

Conclusions

In this paper, we have studied the domain of ontology construction in general and that from a structured database in particular. We have outlined some approaches already being used in the literature for the problem of automated ontology derivation. We have argued that the process of ontology generation is in fact a two -step process. In the first step, a hierarchy of the domain is constructed based on the domain knowledge inferred from text corpus in case of unstructured document repository, or from database relations in case of structured data. This hierarchy models the explicit relations between attributes and objects represented in the database based on its schema and embedded data. Once this hierarchy has been generated, the next phase of Concept Labeling is introduced. In this phase, individual concepts are assigned to the ontology based on information correspon ding to each node and its hypernyms, synonyms generated from WordNet.

References [1] Erdman Michael, “Formal Concept Analysis to learn from Sisyphus-III Material”, In: Brian R. Gaines, Mark Musen (eds.): Proceedings of the 11th Knowledge Acquisition for KnowledgeBased Systems Workshop (KAW'98), Banff, Canada, 1998. pp. SIS-2. http://www.aifb.uni-karlsruhe.de/~mer/Pubs/Sisy_FCA/ [2] Ganter Bernhard, Wille Rudolf, “Applied Lattice Theory: Formal Concept Analysis”. http://citeseer.nj.nec.com/ganter97applied.html

18

12/04/2003 [3] Khan L , “Ontology-based Information Selection”, Ph.D. Thesis, University of South California, 2000. http://citeseer.nj.nec.com/khan00ontologybased.html [4] Khan L, Luo F, Yen I, “Automatic ontology derivation from documents”, Submitted to IEEE Transactions on Knowledge and Data Engineering (TKDE), August, 2002. http://www.utdallas.edu/~lkhan/papers/AODD_IEEETKDE2002.pdf [5] Knowledge Systems Laboratory (KSL), Stanford University. “Ontolingua”. http://www.ksl.stanford.edu/software/ontolingua/ [6] Peat H.J., Willett P., “The Limitations of Term Co-Occurrence Data of Query Expansion in Document Retrieval Systems”, Journal of ASIS, vol. 42, no. 5, pp. 378-383, 1991. [7] Sowa J. F., “Building, Sharing, and Merging Ontologies”. http://users.bestweb.net/~sowa/ontology/ontoshar.htm [8] Tariq M et al, “Experiments in ontology construction form specialist text”, EuroLan 2003, The Semantic web and language technology Its potentials and practicalities. July 28-Aug. 8, 2003, Bucharest, Romania. http://www.racai.ro/EUROLAN2003/html/workshop/TariqManumaisupat/TariqManumaisupat.pdf

Bibliography [1] Stumme Gerd, Tutorial Formal Concept Analysis, EKAW 2002. http://www.aifb.uni-karlsruhe.de/WBS/gst/FBA03/tutorial_ecml_pkdd_2002.pdf [2] Stumme Gerd, Course of FCA, Summer 2003, Magdeburg. http://www.aifb.uni-karlsruhe.de/WBS/gst/FBA03.shtml [3] G. Stumme, R. Wille, U. Wille* : Conceptual Knowledge Discovery in Databases Using Formal Concept Analysis Methods. In: J. M. Zytkow, M. Quafofou (Eds.): Principles of Data Mining and Knowledge Discovery. Proc. 2nd European Symposium on PKDD'98, LNAI 1510, Springer, Heidelberg 1998, 450-458 (part of [19]). http://www.aifb.uni-karlsruhe.de/WBS/gst/papers/1998/P1993-PKDD98.ps [4] Ganter Berhard, Wille Rudolf, “Formal Concept Analysis -- Mathematical Foundations”. http://www.math.tu-dresden.de/~ganter/FCAbooks.html [5] Darmstadt Research Group in FCA. http://www.mathematik.tu -darmstadt.de/ags/ag1/Sekretariat/sekretariat_en.html [6] Uta Priss, A Formal Concept Analysis Homepage. http://www.upriss.org.uk/fca/fca.html [7] Uta Priss, Indiana State University, Course, “Advanced Topics in Information Systems: Formal and Relational Concept Analysis”. http://www.upriss.org.uk/teaching/697-Su97-syllabus.html [8] ConceptExplorer, a freeware project available from SourceForge. http://sourceforge.net/projects/conexp [9] First Conference on Formal Concept Analysis, Feb. 27 – March 1, 2003. France. http://fzbw.de/icfca03

19

Suggest Documents