An Intelligent Database Application for the Semantic Web Amr F. El-Helw Hussien H. Aly
[email protected] [email protected] Department of Computer Science & Automatic Control, Faculty of Engineering Alexandria University, EGYPT Abstract Semantic Web is one of the important research fields that came into light recently. Its major concern is to convert the World Wide Web from just a huge repository of unrelated text, into useful linked pieces of information. Linking the information is not only based on text similarity, but mainly on the meanings and realworld relations between items. In this research, we use the techniques of the semantic web, to create formal definition rules for some kind of unstructured data, and to query this data for information that might be not explicitly stated, and that would otherwise be very difficult to extract. For this purpose, we develop an integrated system that can extract data from both unstructured and structured documents, and then answer users' queries using the extracted data together with some inference rule that help to deduce more information from this data. As an example of this data, we take the social relationships found in death-ads in newspapers. These ads contain both implicit and explicit information about various relationships among people. These relationships, if arranged and organized in a proper way, can be used to extract and infer other hidden, not explicitly mentioned relationships. Keywords: Semantic Web, Intelligent Databases, Deductive Databases, Information Retrieval. 1. Introduction 1.1. Overview The World-Wide Web (www) is a huge repository of information. It contains documents and multimedia resources concerning almost every imaginable subject, in human-usable format. Documents on the Web have cross-references known as links. These links represent some sort of relationships between the documents. The majority of these documents are written using the Hyper-Text Markup Language (HTML), which is a text stream with embedded special tags. Most of the tags are concerned with organization and presentation of the document. Due to the huge volume of available data, it is becoming increasingly difficult to locate useful information, using current technology of search engines. Also, users often want to use the Web to do more than just locate a document; they want to perform some task, or perform some inference on its relationships with other documents. Completing these tasks often involves visiting a series of pages, integrating their content and reasoning about them in
some way. This is far beyond the capabilities of directories and search engines. The main problem is that the Web was not designed, initially, to be processed by machines. Documents on the web do not provide any information that helps the machine to determine what the text means. Semantic Web is not a technology by itself. In fact, it is a group of inter-related technologies that may be used (individually or together) to produce the desired outcome. Among these technologies and concepts are Ontologies, Data Mining, Deductive Databases, Artificial Intelligence, Man-Machine Interface and others. The Semantic Web may allow users to organize and browse the Web in ways more suitable to the problems they have at hand. The Semantic Web could be used to impose a conceptual filter to a set of web pages, and display their relationships based on such a filter. This may also allow visualization of complex content. With HTML, such interfaces are virtually impossible since it is difficult to extract meaning from the text. The major concern of Semantic Web is to convert the World Wide Web from just a huge repository of unrelated text, into useful linked pieces of information. Linking the information is not only based on text similarity, but mainly on the meanings and real-world relations between items. 1.2. Objective Currently, there is no universal application of the Semantic Web, and it seemed that applications will be domain dependent as stated in [12], “the Semantic Web currently lacks a business case. Even if one were to emerge, there is no guarantee that the Semantic Web will become universal”. In this research, we aim to use an Intelligent Database approach to develop a real-life application that applies the techniques and the structures of the Semantic Web and explores the potential use of it. We consider the social relationships found in death-ads in newspapers. These ads contain both implicit and explicit information about various relationships among people. These relationships, if arranged and organized in a proper way, can be used to extract and infer other hidden, not explicitly mentioned relationships. While it seemed to be a classic problem in deductive database area of research, the problem is not the deduction process but how to collect the data from the scattered and unstructured Web pages into the extensional part of the deductive database.
In order to achieve this goal, we have to translate the existing documents on the Web into a proper model to represent the semantics and relationships between these documents, design a suitable inference engine to navigate through the documents and an interface query language to drive this engine. 2. System Architecture The system developed throughout this research can be described as an integrated Semantic Web system. By the word “integrated”, we mean that the system is comprised of more than one component, each with a specific task. These components are: the Data Transformer, the Data Collector, the Inference Rules Editor, and the Query Tool, in addition to the Intelligent Database System itself, which can be considered as the core of the system. Figure 1 represents a schematic for these components and their relationship. User Interface
Query Tool Query
Result
Domain Expert
Domain of Application
Intelligent Database
Inference Rules Editor
XML Documents
Data Collector
Intermediate XML Docs Unstructured HTML Data
Data Transformer
Fig. 1 – System Architecture 2.1. Data Transformer This component is responsible for transforming the data on the web from its unstructured HTML format into a semi-structured format (e.g. XML) that can be easily processed by other system components. The transformation process will depend on the application domain to assign the target meaning to the document content. Obviously, if the data already exists in a structured format, then it can be passed directly to the next component in the system.
2.2. Data Collector The data collector is responsible for collecting the data from the semi-structured XML documents and storing them into the database, to form the extensional database of the deductive system. 2.3. Inference Rules Editor Using this component, the domain expert can always add new inference rules to the knowledge base of the system. Of course, the newly added rules will only take effect from the moment of their addition. 2.4. Query Tool This component receives queries built by the user interface and converts it into a logic program that uses the inference rules (intensional database) together with the stored data (extensional database) to determine the result of the query, and passes this result back to the user. 3. Design & Implementation In this research, we use the techniques and the structures of the semantic web, to create formal definition rules and collect the relevant facts for some kind of unstructured data that are scattered in Web pages. A user can then query this data for information that might be not explicitly stated, and that would otherwise be very difficult to extract. The first step needed to accomplish this goal is to convert the unstructured data found on the Web into some structured or semi-structured format, so that we can later manipulate this data. We chose XML format as an example of standard semi-structured formats that can be used for this purpose. When the unstructured data is converted into XML, it is easy to deal with it later using any programming language or technique. This needs some knowledge representation to encode the expertise of the domain of application. The second step is to parse the resulting XML document(s) and export their data into a structured database that should be used as the extensional part (EDB) of the deductive database system. This step is not a problem, since XML parsers are widely available. Also we had to enumerate a variety of possible inference rules about human relations and store these rules as well in the intentional database (IDB). These rules are to be used to derive the deduction process. Once the user enters a query, it is translated into some kind of a logic program (using Prolog for example) and answered from the facts and rules stored earlier in the database 3.1. Structuring the Data As shown in the system architecture, the first step to get the data is to convert the unstructured text data found on the web into semi-structured XML documents.
In fact, the technique used to do this can be applied to any other type of documents, after making the necessary changes. The main difficulty here was to find some pattern of the documents that can be followed to determine the context of each token (word or phrase). This pattern is domain dependant, i.e. it differs from one area to another. Sometimes, this pattern would not be easy to discover. In our chosen domain of application, the ads are usually semi-structured. To find this pattern, we had to group synonyms (words and expressions with the same meaning) together, and replace them with a unified semantic element. We can summarize the algorithm of identifying the pattern as shown in fig. 2. Given: A document that can be viewed as a set of phrases P. A phrase is any word or expression that has a meaning relative to the context of the application domain. Output: A tokenized document that contains only the main semantic elements. Algorithm: 1. Let Pi ⊂ P be all the phrases that belong to the same token type Ti; 1 ≤ i ≤ n 2. For i = 1 to n: 2.a) Let Si be a semantic element that represents the token type Ti 2.b) ∀ p ∈ Pi, replace p with Si in the document 3. The resulting document only contains the semantic elements Si; 1 ≤ i ≤ n.
Fig. 2 – Pattern Identification Algorithm
Of course, this algorithm has to be applied on a wide variety of documents that belong to the required knowledge domain. The resulting pattern will be revised and refined with every added document, until we get the most possible general pattern that covers all (or at least the majority) of the cases. Note that some phrases can actually belong to more than one token type. This can be determined from the context sometimes, but it can sometimes lead to unexpected results. For this reason, the output of this phase is not always 100% correct. Once we replace each phrase with its token type, we can deduce the pattern of tokens in the document. Then, we parse the selected document, and process it according to this pattern. This way, we can correctly extract most (not necessarily all) of the data in the document. To convert the extracted data to an XML document, we can use the Document Object Model (DOM) of XML documents. The DOM deals with XML as a tree with each element in the document as a node in the tree, thus preserving the hierarchical structure of XML
documents. Many programming languages support the XML-DOM and provide means to deal with it. Note that structuring the data is only necessary because currently, most of the Web data is not available in XML format. However, many data providers now tend to provide their data as XML documents as well as HTML pages (as shown in the system architecture above). So, in the near future, it is expected that most (if not all) data would be available in XML format. Thus, no structuring would be needed. 3.2. Grouping the Data Given an XML document that contains the data, we scan the document, and insert the data into the database repository after converting it to the suitable form. Again, in this stage, we parse the XML document using a DOM (Document Object Model) parser, in order to easily extract the data and store it into the database. The question is: When should this step be carried out? We should invoke it for every new XML document that is added to the application domain. So, as we initialize the system, it would be invoked for all the documents that we initially have (or those that have been created from unstructured web documents). Later on, it should be invoked whenever any new document is submitted to the system, in order to consolidate the data in this document with the knowledge base. 3.3. Knowledge Representation: Inference Rules This part represents the "knowledge base" of the system. It should contain all the rules that define the relationships between the various entities involved in our application domain. An inference rule is a statement of the form: Result Å Fact1, Fact2, …, Factn The facts on the right side of the rule are called the antecedent(s), and the result on the left side is called the consequent. This means that the consequent can be inferred from the antecedent(s). In other words, if the antecedents are all correct then the consequent is also correct and can be used as an antecedent in other inference rules. Note that both the antecedent and the consequent are predicates that can take arguments. For example, consider the following inference rule: father(x, y) Å male(x), child(y, x) In this rule, we have the predicates father, male and child, and we have two arguments (or parameters) x and y. This means that if x is a male and y is the child of x, then x is the father of y. These inference rules are needed in order to deduce any hidden relationships and to extract implicit information from the stored data. The system gives the ability to add more inference rules later, and they will
take effect from the moment of their addition into the system. 3.4. Creating and Answering Queries This step involves translating the user's query into some kind of a logic program (using Prolog for example) that can answer this query from the facts and rules stored earlier. To accomplish this step, we use a Prolog engine that we can call and pass our query. For this purpose we use the XSB engine [13]. It can access the stored data, process the query, and return all possible results of this query, so that we can present these results to the user in the appropriate way. 4. The Social Relationship Miner As an example of a real-life application that can make use of the concepts and techniques discussed earlier, we take the social relationships found in deathads in newspapers. These ads contain both implicit and explicit information about various relationships among people. These relationships, if arranged and organized in a proper way, can be used to extract and infer other hidden, not explicitly mentioned relationships that are otherwise difficult to extract. 4.1. The Data Transformer As mentioned earlier, the first step to get the data is to convert the unstructured text data found in the ads into semi-structured XML documents. And to do this we had to identify the various semantic elements that comprise the very type of documents that is relevant to our application. For example, the phrases (– ﺗ ﻮﻓﻲ– ﺗ ﻮﻓﻰ )اﻧﺘﻘ ﻞ إﻟ ﻰ رﺣﻤ ﺔ اﷲall have the same meaning of (die). So whenever we encounter any of these phrases, we determine that we have the DIE element. Other token types include family relations, job descriptions, etc. Naturally, the tokens and semantic elements differ from one language to another. In our systems, we work with documents written in Arabic language. For the sake of explanation, we shall translate some of the terms relevant to our topic. After replacing each phrase with its representative semantic element and after discarding insignificant phrases, one can deduce the pattern of tokens in the document. The state diagram is defined in the application as a set of nodes (states) and an action to take at each state for each encountered token type. This gives the system more flexibility since it is possible to add more states and define the type of action that has to be taken for these states with different kinds of tokens. In fact, a domain expert can define a whole new state diagram with a completely new set of nodes and actions. This way, the application can work on different kinds of documents. Next, we parse the selected document, and process it according to our state diagram. This way, we can correctly extract most (not necessarily all) of the data in
the document, and convert them into XML format. At this point, there has to be some interaction between the system and an expert in the application domain, to review the generated XML document, and possible refine and correct any incorrect data that might have been generated due to a linguistic problem (for example, a word that might have two different meanings, and thus might be considered as any of two different token types). The resulting XML file consists of two main sections. The first section contains the data of every person mentioned in the document. The second section contains the relations between the persons in the first section. As an example of transforming an unstructured document into XML format, consider the following death-ad: ﺍﻨﺘﻘﻠﺕ ﺇﻟﻰ ﺭﺤﻤﺔ ﺍﷲ ﺍﻟﺴﻴﺩﺓ ﺃﻤﻴﻨﺔ ﺤﺭﻡ ﺍﻟﻤﻬﻨﺩﺱ ﺤﺎﻤﺩ ﺍﻟﺒﺭﺠﻲ ﻭ ﻭﺍﻟـﺩﺓ
ﺍﻟﻤﻬﻨﺩﺴﺔ ﻋﺎﺌﺸﺔ ﺤﺭﻡ ﺍﻟﻤﺴﺘﺸﺎﺭ ﻤﺤﻤﻭﺩ ﻓﻜﺭﻱ ﻭ ﺍﻟﺩﻜﺘﻭﺭ ﻤﻬﻨﺩﺱ ﻤﺤﻤﺩ ﺯﻭﺝ ﻭﻓﺎﺀ ﺒﺎﻟﺘﻠﻴﻔﺯﻴﻭﻥ ﻭ ﺍﻟﻤﻬﻨﺩﺱ ﺍﺤﻤﺩ ﻭ ﺯﻴﻨﺏ ﺤﺭﻡ ﺍﻟﺩﻜﻨﻭﺭ ﻤﺤﻤـﺩ
ﻤﻭﺴﻲ ﻭ ﺭﺸﻴﺩﺓ ﺤﺭﻡ ﺍﻟﻤﺤﺎﺴﺏ ﻓﺎﺭﻭﻕ ﻭ ﺼﻔﺎﺀ ﺤﺭﻡ ﺍﻟﻤﺭﺤﻭﻡ ﻋﻤﺭ ﻭ
ﺍﻟﺴﻴﺩ ﺼﻔﻭﺕ ﺯﻭﺝ ﻤﻨﺎل ﻭ ﺍﻟﺴﻴﺩ ﻤﺤﻤﻭﺩ ﺯﻭﺝ ﺸﻴﺭﻴﻥ Fig. 3 – Sample death-ad An approximate English translation of this document (only for the sake of explanation) would look as shown in figure 4 below. We announce the death of Mrs. Amina the wife of Eng. Hamed El-Borgy, the mother of: Eng. Aisha (wife of Judge Mahmoud Fekry), Dr. Eng. Mohamed (husband of Wafaa who works in the TV), Eng. Ahmed, Zeinab (wife of Dr. Mohamed Moussa), Rashida (wife of Accountant Farouk) and Safaa (wife of the deceased Omar), Mr. Safwat (husband of Manal), and Mr. Mahmoud (husband of Sherine). Fig.4 – English Translation of Figure 3 Also, as we said earlier, this step can convert any type of documents, not just death-ads. All we need is to determine the different token types, and the state diagram that represents the token pattern of the document, and the rest is straightforward. Figure 5 shows a portion of the XML document that results from structuring the document in the above example. 4.2. The Data Collector The data collector is the component responsible for collecting the data from the XML documents that have been created by the Data Transformer (or that should be provided by the data providers) and storing this data into the deductive database system. Given an XML document that contains the data, the data collector starts by scanning the section. For every person in this section, the system checks whether or not this person already exists in the database, and inserts this person if necessary. The comparison is done based on the name, gender, and job,
i.e. if a person with the same data exists in the database, it is assumed to be the same person. We have no other way to differentiate between people who might accidentally have similar data. Next, the system parses the section of the XML document, and inserts these relation into the database (again, if they do not already exists), with the respective IDs of the persons of these relations. ... ...
Fig. 5 – Portion of the Generated XML file 4.3. The Inference Rules Here, we included as many inference rules as possible. These rules define the human relationships and the inter-relations between these relationships. As pointed out in the system architecture, these inference rules are considered as the knowledge base of the system. Domain experts can add rules at any time. Examples of these relationships might include: father (x, y) uncle (x, y) ancestor (x, y) Sibling (x, y) uncle (x, y)
Å child (y, x), male (x). Å father (z, y), brother (x, z). Å ancestor (x, z), parent (z, y). Å child (x, z), child (y, z), x ≠ y. Å brother (x, z), parent (z, y).
The above relationships represent a small portion of what we can call "family relationships". However, the system does not only support family relationships. There are also other kinds of relationships, like: superior (x, y) knows (x, y) knows (x, y)
Å superior(x, z), superior (z, y). Å relative (x, y). Å coworker (x, y).
These are only a sample of the inference rules that are included in the system. Although they might look simple, some relations actually have a lot of cases to consider. For example if we look at the nephew relation, it can be inferred from any of the following: nephew (x, y) nephew (x, y) nephew (x, y)
Å son (x, z), sibling (z, y). Å uncle (y, x), male(x). Å aunt (y, x), male (x).
These inference rules are needed in order to deduce any hidden relationships and to extract implicit information from the stored data. They represent the intentional database of the deductive database system. The system gives the ability to add more inference rules later, and they will take effect from the moment of their addition into the system. 4.4. Issuing Queries It is clear from the name of the application “Social Relationship Miner” that the main concern of this application is to extract relationships between people. The user who issues queries to the system is mainly interested in one of the following: •
Given two persons, the user might want to know if a certain relation holds. For example, given the above document, a possible query might be: Is Sherine the wife of Mahmoud? The answer to this kind of query is simply yes or no. • Given one person, the user might want to know all persons who satisfy a certain relation with the given person. e.g. List all children of Amina. The answer to this kind of query is a list of one or more person names. • Given two persons, the user might want to determine the relation(s) between them (if any), e.g. What is the relation between Safwat and Manal? The answer to this query is a list of one or more relations ordered by significance (closeness of relation). To issue a query, the user selects the query type (any of the three categories above), and supplies the parameters (relation types or person names). The system constructs a logic query and passes it to the XSB engine, which executes the query, and returns its results back to user. 5. Conclusion In this research, we used an Intelligent Database approach to develop an application that applies the techniques and the structures of the Semantic Web and that makes use of its benefits. The system is tested through a prototype example of a real-life application using death advertising documents published on the Internet in Arabic newspapers. We implemented a system that integrated a number of Semantic Web technologies. The inputs to this system are Internet document in a specific application domain. These documents can be either unstructured HTML and text documents, or semi-structured XML documents. In the case of unstructured documents, domain experts can define parsing rules for the documents, in the form of state diagrams. The system then parses the documents, extracts data items and relations from them, and converts them into semistructured XML documents. The parsing is carried out according to the user-defined state diagram, which
increases system flexibility. In case of Internet documents that are already available in XML format, the system can transform these documents from one format to another. This is also accomplished using userdefined XSLT stylesheet. Next, once all the data is available in XML format of a defined structure for a given application domain, the system reads the data from the formed XML files, and stores them into the database repository. This repository is structured in such a way to enable intelligent processing and querying of the data and/or relationships. The Inference Rules Editor allows the domain experts to define inference rules (Intensional Database) that define the relationships between data items in the required application domain. These rules are also stored into the database. We also provide a user interface with which a user can query the system. The users' queries are translated into a logic program that is executed by a logic engine in the intelligent database system. It uses the stored data and inference rules to produce the result of this query, and return it back to the user. As a practical example for this system, we introduced a real-life application. We considered the social relationships found in death-ads in newspapers. These ads contain both implicit and explicit information about various relationships among people. These relationships, if arranged and organized in a proper way, can be used to extract and infer other hidden, not explicitly mentioned relationships. 6. References [1] J. Heflin, "Towards the Semantic Web: Knowledge Representation in a Dynamic, Distributed Environment", Ph.D. Thesis, University of Maryland, College Park, 2001. [2] E. Bertino, B. Catania and G. P. Zarri, "Intelligent Database Systems", Addison-Wesley, 2001 [3] A. Deutsch, M. Fernandez, D. Florescu, A. Levy and D. Suciu, "XML-QL: A Query Language for XML", Proc. 8th Int'l WWW Conference, 1999. [4] T. T. Chinenyanga and N. Kushmerick, "Expressive Retrieval from XML Documents", Proc. 24th Annual Int'l ACM SIGIR conf. on Research & Development in Information Retrieval, 2001. [5] S. Decker, D. Fensel, F. van Harmelen, I. Horrocks, S. Melnik, M. Klein and J. Broekstra, "Knowledge Representation on the Web", Proc. of the 2000 Int'l Workshop on Description Logics, 2000. [6] S. Decker, F. van Harmelen, J. Broekstra, M. Erdmann, D. Fensel, I. Horrocks, M. Klein, and S. Melnik, "The Semantic Web – on the Respective Roles of XML and RDF", IEEE Internet Computing, September/October 2000. [7] World Wide Web Consortium HTML Specification, http://www.w3.org/TR/REC-html40 [8] World Wide Web Consortium XML Specification, http://www.w3.org/TR/REC-xml
[9] World Wide Web Consortium RDF Specification, http://www.w3.org/TR/REC-rdf-syntax [10] B. N. Grosof, I. Horrocks, R. Volz and S. Decker, "Description Logic Programs: Combining Logic Programs with Description Logic", Proc. of the 12th International Conference on World Wide Web), May 2003. [11] G. F. Luger and W. A. Stubblefield, "Artificial Intelligence: Structures and Strategies for Complex Problem Solving", 3rd edition, Addison-Wesley, 1998. [12] S. Bowness, "Information Highways", April 2004, Vol. 11, No. 3, P. 16. [13] XSB Engine, http://xsb.sourceforge.net [14] J. Heflin, J. Hendler, and S. Luke, "SHOE: A Knowledge Representation Language for Internet Applications", Technical Report CS-TR-4078 (UMIACS TR-99-71), Dept. of Computer Science, University of Maryland at College Park. 1999. [15] D. Fensel, S. Decker, M. Erdmann, and R. Studer: "Ontobroker: The Very High Idea". Proc. of the 11th International Flairs Conference (FLAIRS-98), Sanibal Island, Florida, May 1998. [16] T. R. Gruber. "A Translation Approach to Portable Ontology Specifications". Knowledge Acquisition, 5(2):199-220, 1993. [17] D. Fensel, I. Horrocks, F. van Harmelen, D. McGuinness, and P. F. Patel-Schneider, "OIL: Ontology Infrastructure to Enable the Semantic Web", IEEE Intelligent System, 16(2), 2001. [18] A. Farquhar, R. Fikes & J. Rice, "The Ontolingua Server: A Tool for Collaborative Ontology Construction", Knowledge Systems Laboratory, 1996. [19] H. P. Luhn, "The automatic creation of literature abstracts", IBM Journal of Research and Development, 2, pp. 159-165, 1958. [20] M. Sharp, "Text mining", Rutgers University, Communications, Information and library Science, Seminar in Information studies, Dec 11, 2001.