Making the Web More Semantic: A Methodology for Context-Aware Query Processing1
Veda C. Storey Computer Information Systems Department J. Mack Robinson College of Business Georgia State University Email:
[email protected]
Andrew Burton–Jones Computer Information Systems Department J. Mack Robinson College of Business Georgia State University Email:
[email protected]
Vijayan Sugumaran Department of Decision & Information Sciences School of Business Administration Oakland University Email:
[email protected]
Sandeep Purao School of Information Sciences & Technology Pennsylvania State University Email:
[email protected]
Abstract The continued growth of the World Wide Web has made the retrieval of relevant information for a user’s query increasingly difficult. A major obstacle to more accurate and semantically sound retrieval is the lack of intelligence in web search systems. This research presents a methodology to increase the semantic content of web query results by building context-aware queries. The methodology contains heuristic mechanisms that use lexical sources such as WordNet and ontologies such as the DAML library to augment a query. A semantic net representation facilitates the process. The methodology has been implemented in a research prototype that connects to search engines (Google and AlltheWeb) to execute the augmented query. An empirical test of the methodology and comparison of results against those directly obtained from the search engines demonstrates that the proposed methodology provides more relevant results to users. Keywords: query augmentation, semantic retrieval, ontology, context, query, Semantic Web
1
This research was supported by J. Mack Robinson College of Business, Georgia State University and the Office of Research & Graduate Study, Oakland University. Earlier versions of this research were presented at the TwentyThird International Conference on Information Systems (ICIS 2002), the Twenty-Second International Conference on Conceptual Modeling (ER 2003), and research workshops at the University of Houston and the University of Washington. We thank the doctoral students at Georgia State University, particularly Cecil Chua and Punit Ahluwalia, for their assistance.
1
1. Introduction The World Wide Web is one of the world’s most valuable information resources, but it is increasingly difficult to retrieve relevant information from the Web due to its rapid growth and lack of structure (Kobayashi and Takeda, 2000; Lawrence, 2000; Spink and Ozmultu, 2002). A major impediment to information retrieval from the Web is the inability of Web pages and query systems to understand the context of natural language terms. Understanding more of the semantics in Web pages and queries would help process users’ queries more effectively, but it is especially difficult to capture and represent meaning in machine-readable form. One response to these problems has been the proposed Semantic Web, which is intended to extend the World Wide Web by infusing it with semantics (Berners-Lee, et al., 2001; Ding, et al., 2002). The vision of the Semantic Web is an extension of the current web in which information is given well-defined meaning (Berners-Lee, et al., 2001). Terms on web pages will be marked up using ontologies to provide domain-specific terms and inference rules to serve as surrogates for the semantics of each term’s meaning (Berners-Lee, et al., 2001; Gruninger and Lee, 2002; IEEE, 2001). It will be a long time, however, before markup that adds semantics to data presented on pages becomes the norm. Libraries of ontologies containing domain-specific knowledge are being developed for the Semantic Web and other uses. The most well-known library of ontologies is the DAML (DARPA Agent Markup Language) library containing approximately 280 ontologies and 56,000 classes (http://www.daml.org/ontologies/). Significant effort has gone into building these ontologies, yet there has been little research on developing methodologies for retrieving information using them. A methodology for doing so would greatly assist in obtaining more relevant query results until the full potential of the Semantic Web can be realized.
2
The objective of this research, therefore, is to: develop a methodology for processing queries on the Web that takes into account the semantics of the user’s request and, by doing so, obtain more relevant query results. The following tasks are carried out. •
A heuristic-based methodology is developed that expands and shrinks a set of query terms using lexical and ontological sources to achieve a more precise, context-specific query
•
The methodology is implemented in a prototype system called the Semantic Retrieval System.
•
The effectiveness of the methodology is assessed by analyzing the results obtained from processing queries using the prototype. The contribution of this research is to develop a methodology that more effectively
processes queries by capturing and augmenting the semantics of a user’s query. The results of the research should move us closer to realizing the potential of the Semantic Web, while demonstrating useful applications of lexicons and ontology libraries. This paper is divided into six sections. Section 2 reviews relevant prior research. Section 3 presents the heuristic-based methodology for enabling context-aware query processing. Section 4 describes the prototype that implements the methodology. Section 5 reports the results of an empirical test of the methodology. Section 6 concludes the paper.
2. Related Research This research lies at the intersection of three areas: 1) knowledge representation, 2) Web semantics, and 3) information retrieval, as shown in Figure 1. Examples of related research topics in these areas are summarized in Table 1. 2.1 Knowledge Representation The ability of information systems to capture and represent knowledge, and to use semantic
3
Web Semantics Languages for the Semantic Web
Semantic Web applications
Information Retrieval Web information retrieval
Algorithms for information retrieval
Processing natural language
This
research Ontologies for the Semantic Web Building domain ontologies
Expanding queries using lexical sources
Refining lexical sources
Knowledge-Representation
Figure 1. Related Research for Context-Aware Query Processing
Table 1: Research in Knowledge Representation, Web Semantics, & Information Retrieval Category Languages for Semantic Web Semantic Web applications
Description
Languages to specify semantic markup on the web, e.g., ontology inference layer (OIL); DARPA agent markup language (DAML). Application of Semantic Web technology to diverse applications including knowledge management, military defense, and distance education. Web information Information retrieval tools and approaches for web searching include retrieval indexing, clustering, ranking, and web-page annotating to be used and extended on the Semantic Web. Information retrieval algorithms to improve precision, recall, and Algorithms for speed of information retrieval from large document repositories. information retrieval Methodologies for effectively processing natural language queries Processing using procedures such as stop-word removal, word stemming, and natural noun parsing. language Ontologies for Ontologies developed to define terms and relationships used in Semantic Web specific domains on the Semantic Web. Methodologies developed for integrating multiple ontologies. Prototypes and methodologies for building domain specific and Building overarching (upper) level ontologies. Methodologies proposed for domain evaluating the quality of domain ontologies. ontologies Refining lexical Refinements to WordNet and applications developed that use it. sources Query expansion used to improve completeness of document Expanding searches. Automated and semi-automated techniques attempted. queries using lexical sources
Related References Fensel, et al. 2001; McIlraith, et al. 2001 Fensel, et al., 2001; Hendler, 2001 Kobayashi and Takeda, 2000; Raghavan, 1997
Lewis and Jones, 1996, Xu and Croft, 1998 Maedche and Staab, 2001; Stephens and Huhns, 2001 Kayed and Colomb, 2002; Guarino and Welty, 2002 Fellbaum, 1998; Miller, et al., 1990 Voorhees, 1994
4
information effectively, remains limited (Berners-Lee, et al., 2001; Minsky, 1994). The querying process, for example, is inherently constrained by imprecision and ambiguity in search terms (Miller, 1996; Towell and Voorhees, 1998). Imprecision occurs when a user’s search terms do not exactly match the concepts they intend to retrieve; ambiguity occurs when search terms have multiple meanings. Knowledge representation is intended to alleviate these problems by providing stocks of domain-specific knowledge and mechanisms to make inferences based on the semantics of these pieces of knowledge. Two important sources of semantics are lexicons and ontologies. Lexicons comprise the general vocabulary of a language. Lexicons help resolve imprecision and ambiguity by identifying the meaning of a term by its connection to other terms in a language. A well-known example is WordNet, a comprehensive on-line catalog of English terms (http://www.cogsci.princeton.edu/~wn) (Fellbaum, 1998; Miller, et al. 1990). WordNet stores, categorizes, and relates nouns, verbs, adjectives, and adverbs, organizing them into synonym sets with underlying word senses. For example, WordNet describes four word-senses for the term ‘chair’ as a noun and two for it as a verb, as well as superclasses and subclasses of the term. Lexicons such as WordNet have been used to assist in query processing in prior research (Moldovan and Mihalcea, 2000; Voorhees, 1994). Whereas lexicons describe a language, ontologies are a way of describing one’s world (Weber, 1997). Ontologies generally consist of terms, their definitions, and axioms relating them (Gruber, 1993). There are many different types of ontologies (Guarino, 1998; Mylopoulos, 1998; Noy and Hafner, 1997; Weber, 2002), with the most well known ones including Ontolingua (Farquhar, et al., 1996), SHOE (Heflin, et al., 1999), Cyc (Guha and Lenat, 1994) and the XMLbased schemes such as OIL (Fensel, et al., 2001) and DAML (Hendler and McGuinness, 2000).
5
Ontologies can be characterized as either formal/top-level ontologies that describe the world in general, or material/domain ontologies that describe specific domains (Guarino, 1998; Weber, 2002). Although significant effort has gone into building ontologies, there has been little research on retrieving information using them. 2.2 Web Semantics Markup languages such as hypertext markup language (HTML) are the primary mode of structuring documents on the web. These languages do not enable web page designers to specify the meaning of terms. Search technologies for the web, therefore, remain primarily keywordbased (e.g., AltaVista) or link-based (e.g., Google), and do not take into account the context of users’ queries or the context of terms on web pages being searched (Lawrence, 2000). The intention of the Semantic Web is that these problems will be solved by ‘marking-up’ terms on web pages with links to online ontologies that provide machine-readable definitions of the terms and their relationships with other terms (Berners-Lee, et al., 2001). This is supposed to make the web easier to process for both humans and their intelligent agents (Fensel, et al., 2001; Hendler, 2001). The proliferation of ontologies is crucial for the Semantic Web, hence, the creation of large ontology libraries (Ding, et al., 2002; Stephens and Huhns, 2001). Ontologies themselves, however, can suffer from imprecision and ambiguity. For example, a search for ‘chair’ on the DAML ontology library returns 20 classes that include the string ‘chair’ with varied meanings. The Semantic Web will probably not consist of neat ontologies, but a “complex Web of semantics ruled by the same sort of anarchy that rules the rest of the Web.” Therefore, research is needed to evaluate how meaning can be represented and used to assist users and their queries.
6
2.3 Information Retrieval Research on information retrieval (IR) has developed procedures and methodologies that can support querying on the Web (Raghavan, 1997). A major problem in information retrieval is word-sense disambiguation: a word may have several meanings (homonymy), yet several words may have the same meaning (synonymy) (Ide and Veronis, 1998; Miller, 1996). The goals of information retrieval are to: 1) increase the relevance of the results returned (precision) by eliminating those that have the wrong sense (homonyms); and 2) increase the proportion of relevant results in the collection returned (recall) by including terms that have the same meaning (synonyms). Word-sense disambiguation requires two steps: 1) identifying the user’s intended meaning of query terms, and 2) altering the query so that it achieves high precision and recall. The first step is usually achieved by automatically deducing a term’s meaning from other terms in the query (Allan and Raghavan, 2002). On the Semantic Web, this appears infeasible since most web queries are only two words long (Spink and Ozmultu, 2002; Spink, et al., 2001), which is too short to identify context (de Lima and Pedersen, 1999; Voorhees, 1994). Thus, some user interaction is inevitable to accurately identify the intended sense of terms in web queries (Allan and Raghavan, 2002). The second step (altering a query) is generally achieved through a combination of: •
query constraints, such as using Boolean operators to require pages to include all query terms (Hearst, 1996; Mitra, et al., 1998);
•
query expansion with ‘local context,’ in which terms are added to the query based on a subset of documents the user has identified as relevant (Mitra, et al., 1998; Salton and Buckley, 1990; Xu and Croft, 1998); and/or
7
•
query expansion with ‘global context,’ in which additional terms are added to the query from thesauri, terms in the document collection, or past queries (de Lima and Pedersen, 1999; Greenberg, 2001; Qiu and Frei, 1993; Voorhees, 1994). Of these three methods to improve a query, query constraints are known to improve Web
search (Moldovan and Mihalcea, 2000). On the other hand, query expansion with local context is less likely to be effective, because empirical studies show that Web users rarely provide relevance feedback (Cui, et al., 2002; Spink, et al., 2001). Finally, although query expansion with global context should be useful on the traditional Web, it should not be required (theoretically) on the Semantic Web, because terms on Semantic Web pages would be linked to ontologies that explain each term’s meaning within a web of related semantics (Berners-Lee, et al., 2001). In other words, rather than having to add global context to the query, the global context would be defined already on the Semantic Web pages and associated ontologies. Unfortunately, ontologies for the Semantic Web cannot be depended upon fully because they are often of poor quality (Hendler, 2001). Consequently, existing information retrieval procedures for expanding queries with global context should remain useful on the traditional Web and the Semantic Web. To add global context to queries, it has been suggested that two thesauri should be used, ideally in combination (Mandala, et al., 1999; Moldovan and Mihalcea, 2000; Voorhees, 1994): (1) general lexical thesauri that provide lexically related terms (e.g., synonyms), and (2) domainspecific thesauri that provide related terms in a specific domain. Although WordNet is the preferred lexical thesaurus (Fellbaum, 1998), it is difficult to identify domain-specific thesauri for all of the Web’s domains (Greenberg, 2001). One solution is to use large ontology libraries, such as the DAML library, which can provide useful knowledge even in the presence of individual ontologies that are incomplete or inaccurate (Stephens and Huhns, 2001). As a result,
8
ontology libraries can therefore have a dual role on the Web or the Semantic Web: 1) as a source of definitions of terms on specific web pages, and 2) as a collective source of semantics that can assist query expansion. Although lexicons have been used in information retrieval for some time, the joint application of lexicons and ontologies, has not been explored, especially for searching the web. We next present such an application.
3. The Semantic Retrieval System (SRS) This section proposes a Semantic Retrieval System (SRS) for supporting context-aware query processing on the web. The architecture is shown in Figure 2 and its underlying methodology presented below. Search Engine Results
User Feedback
Query Constructor
Formatted Query
Search Engine
Original Terms + New Knowledge
Interface
Inference Engine Original Query Terms
Local Knowledge Base
Word Senses, Synonyms, Hypernyms, Hyponyms
Lexical Sources
Ontological Sources
Figure 2. Semantic Retrieval System Architecture 3.1 Architecture for Context-Aware Query Processing The Semantic Retrieval System consists of six components: interface, inference engine, remote knowledge sources (lexical and ontological sources), local knowledge base, query constructor, and search engines.
9
Interface. The interface captures the user’s query and transforms it for processing. Queries are accepted in natural language. They require no special syntax because empirical studies report that users often use advanced syntax incorrectly and find it difficult to understand (Lewis and Spark Jones, 1996; Spink, et al. 2001). After the query has been processed, the interface presents the results to the user. Inference engine. For each query term, the inference engine extracts related terms from remote lexical and ontological knowledge sources and returns them to its local knowledge base. The inference engine uses this local knowledge, in conjunction with user input, to determine the meaning of a query’s terms. The methodology it uses is outlined in section 3.2. Remote knowledge sources. Lexical and ontological sources of knowledge are used in combination. Figure 3 show the kinds of information that can be gathered from WordNet and the DAML ontology library for the term “professor”. In this example, the ontology provides more knowledge than WordNet. However, terms in WordNet are not always in the ontology library, and vice versa. WordNet 1.7.1
Synonyms: (professor means the same as x): null Hypernyms (professor is a kind of x): academician, academic, faculty member Hyponyms (x is a kind of professor): assistant professor, associate professor, full professor, Regius professor, visiting professor Holonyms: (professor is a part of x): member of faculty, member of staff Familiarity: (professor has x word senses): one
DARPA Ontology Library
Six ontologies found with class “Professor.” Extract from www.daml.org/ontologies/64: Faculty (teacherOf). Professor (age, doctoralDegreeFrom, emailAddress, mastersDegreeFrom, researchInterest, undergraduateDegreeFrom, tenured) AssistantProfessor () AssociateProfessor () Chair () Dean () FullProfessor () VisitingProfessor ()
Figure 3: Search for “Professor” in WordNet vs. DARPA Ontology Library Local Knowledge Base. The local knowledge base is conceived as a semantic network, as in prior retrieval systems (Chen and Dhar, 1990; Lee and Baik, 1999). The network is
10
comprised of terms (ellipses) and arcs specifying relationships between terms as shown in Figure 4. The network can specify five relationships for a query term ‘X,’ which serve as a semantic foundation for query processing: synonym (X is the same as Y), hypernym (X is a superclass of Y), hyponym (X is a subclass of Y), negation (X is not Y), and candidate (X is related to Y). The semantic network grows through the introduction of knowledge from the inference engine and shrinks by pruning excess knowledge. Professor Electric Chair
negation
hypernym Chair
candidate
Law
hyponym Key: Shaded ellipses: original terms from query Unshaded ellipses: new knowledge from lexical and ontological sources
synonym
Jurisprudence
Patent Law
Figure 4: Example Semantic Network for a Two-Word Query Query Constructor. The query constructor module processes the original terms from the query and new knowledge from the local knowledge base to construct context-aware web queries. It uses a set of heuristics and an interface tool to construct queries in the syntax required by a given search engine. Search engines. The system executes queries and retrieves pages using a common web search engine. It can work with search engines such as Google, Alltheweb and Altavista.
3.2 A Methodology for Context-Aware Query Processing Figure 5 presents our methodology for context-aware query processing. Phase 1 identifies the query context. Phase 2 builds up the knowledge base by expanding and shrinking the query so that it captures more context. Phase 3 executes the query using an available search engine.
11
Identify Query Context Step 1: Identify noun phrases Step 2: Build synonym sets Step 3: Select word sense
Æ
Build Knowledge Base
Æ
Step 4: Exclude incorrect sense Step 5: Include hypernyms & hyponyms Step 6: Resolve inconsistencies
Execute Query Step 7: Construct Boolean query Step 8: Run query on search engine Step 9: Retrieve and present results
Figure 5: Methodology for Disambiguating, Augmenting, and Executing Query Although the methodology in Figure 5 is procedural, and in parts algorithmic, most steps rely on heuristics to achieve the desired output for each step because no fully algorithmic approach has been devised for resolving problems of context and word polysemy (i.e., multiple word senses) (Miller, 1996). Table 2 describes each heuristic, its assumption, and literature support. The following sections describe each phase and its corresponding steps, using the sample query. Table 2: Heuristics in Methodology Step
Heuristic
Assumption
1
Identify noun-phrases by querying successive nouns in WordNet Find relevant synsets for query terms in WordNet Allow one synset to be selected for each noun or noun phrase
WordNet contains common noun phrases Synsets for the query terms are in WordNet There are not multiple equally relevant synsets
2 3 4
5
6 7
The first synonym from the highestordered synset not selected by the user has a word-sense that the user wants excluded from the query Hypernyms and hyponyms obtained from WordNet and DAML Find matches between WordNet synonyms, and DAML and WordNet hypernyms and hyponyms Add synonyms, hypernyms, and excluded terms to the query using Boolean operators
8
Obtain results through web search engine
9
None
Support for Heuristic
Good support: Size of WordNet; Fellbaum, 1998 Good support: Size of WordNet; Fellbaum, 1998 Moderate support: Quality of WordNet but inherent ambiguity of language; Fellbaum, 1998 WordNet’s synsets are ordered by Moderate support: Quality of frequency of usage; they do not WordNet; but inherent highly overlap; excluding one term ambiguity of language; is sufficient to help the query. Fellbaum, 1998 Hypernyms and hyponyms exist Good support: Size of WordNet; in WordNet and DAML Fellbaum, 1998 There are sufficient synonyms in Good support: Size of WordNet; the synsets provided by WordNet Fellbaum, 1998 to enable a match Co-occurrence of added and/or Good support: Usefulness of coexcluded terms with query terms occurrence in information on web pages is a reliable basis retrieval; Voorhees, 1994, for accepting or excluding pages Mandala et al., 1999 Web search engine’s own Good support: Web search methodologies do not conflict or engines more effective for web override SRS’s methodology than traditional ones; Singhal and Kaszkiel, 2001 -----
12
We describe our methodology with the help of a running example. Consider BernersLee’s et al. (2001) query: Find Mom a specialist who can provide a series of bi-weekly physical therapy sessions. The need for sense disambiguation is clear if the query is transformed into a more ambiguous, shorter query, more closely approximating queries on the web (Spink and Ozmultu, 2002; Spink, et al., 2001): Find doctors providing physical therapy. 3.2.1 Phase 1: Identify Query Context Step 1: Identify noun phrases: A user enters his or her query using natural language. The query is first parsed for nouns. Although other parts of speech could be useful, the methodology is initially restricted to nouns because these are the majority of terms in common ontologies. Furthermore, identifying noun-phrases can significantly improve query precision (de Lima and Pedersen, 1999). Querying each consecutive word-pair in WordNet identifies noun phrases. For example, “physical therapy” is a phrase in WordNet, so this step would find one noun and one noun-phrase (‘doctor’ and ‘physical therapy’). These form the base nodes for expanding the query. They are represented in semantic network form, as shown in Figure 6. Initially, the terms are linked by a ‘candidate’ relationship based on their proximity in the query, indicating that the terms have not yet been identified as being related in a lexical or domain-specific way. The semantic network is then augmented with synonyms, hypernyms, hyponyms, and negations.
Doctor
candidate
Physical therapy
Figure 6: Semantic Network of Query Terms after Step 1 Step 2: Build synonym sets: When the user provides a term, the first task is to identify the proper semantics of a term, given the user’s context. To do so, the word senses from
13
WordNet are used. For each noun or noun-phrase, synonym sets for each context stored in WordNet are extracted. For example, “doctor” has four senses in WordNet: 1) a medical practitioner, 2) a theologian, 3) a game played by children, and 4) an academician. The noun phrase, “Physical therapy,” has just one sense in WordNet. Each synonym set comprises one to many synonyms. These synonyms and synonym sets for the query terms are added to the knowledge base. As a result, the knowledge base is expanded to include contexts. Step 3: Select word sense: If a query does not contain enough terms to automatically deduce the word sense for the user’s context, user interaction is required (Allan and Raghavan, 2002; Voorhees, 1994). The user is presented with the synsets for each noun and noun-phrase in the knowledge base that has multiple senses (e.g., “doctor,” “physical therapy”) from which the user selects the most appropriate sense. For example, the user could select the ‘medical practitioner’ sense of ‘doctor’ rather than the ‘theologian’ sense. Once the desired word-sense for each noun and noun-phrase has been identified, steps 4-6 add additional terms to the knowledge base to expand and constrain the query based on the identified contexts. As a result, the Semantic Retrieval System builds a precision-biased query, giving greater emphasis to precision than recall (de Lima and Pedersen, 1999; Hearst, 1996). 3.2.2 Phase 2: Build Knowledge Base Step 4: Exclude incorrect sense: To ensure precise query results, it is important to filter out pages that contain incorrect senses of each term. Traditional and web-based query expansion techniques augment queries with additional mandatory- or weighted-terms but do not include negative terms (i.e., terms of an incorrect sense) as filters (Moldovan and Mihalcea, 2000; Efthimiadis, 2000; Greenberg, 2001; Jansen, 2000; Voorhees, 1994). Excluding incorrect senses
14
is extremely important on the web because of the vast number of results that can be returned. The challenge lies in determining which senses to exclude without increasing user-interaction. Because WordNet can return multiple synsets for each query term, unwanted senses can be inferred from the user’s chosen word sense. For example, if a user selects the “medical practitioner” sense of ‘Doctor,’ then, a correct inference (Table 3) is that the user does not want pages associated with the other senses of the term (theologian, children’s game, or scholar). WordNet orders its synsets by estimated frequency of usage, so the system assumes (Table 2) that the most useful term to exclude is the most commonly used sense of the term not selected by the user. Thus, in the medical query, the system would exclude the sense of doctor defined by ‘doctor of the church’ by including it as negative knowledge in its knowledge base (Figure 7) and by constructing its query with the clause: doctor AND NOT “doctor of the church”. Table 3: Extraction of Hypernyms from Lexical and Ontological Sources Knowledge from Lexical Sources (WordNet) Hypernym (superclass) Hyponym (subclass) Word-sense (defined by synset) Doctor Doc, Physician, MD, Dr Medical practitioner, medical man GP, specialist, surgeon, intern, extern, allergist, veterinarian Doctor of the Church Theologian, Roman Catholic No hyponym listed Doctor Play, child’s play No hyponym listed Dr Scholar, scholarly person, student No hyponym listed Physical Therapy Physiotherapy, physiatrics Therapy No hyponym listed Knowledge from Ontological Sources (DAML Library) Term Hypernym (superclass) Hyponym (subclass) Doctor Qualification, Medical care PhD professional, Health professional Physical therapy Rehabilitation, Medical practice No hyponym listed Term
Step 5: Including hypernyms and hyponyms: Prior research has expanded queries with hypernyms and hyponyms to enhance recall. In such studies, query terms are typically optional, so the system returns the query term (e.g., doctor) or the superclass (e.g., medical practitioner) or the subclass (e.g., GP) (Greenberg, 2001; Voorhees, 1994). For web-based queries, precision is
15
preferred over recall (de Lima and Pedersen, 1999; Hearst, 1996). When terms are optional, however, including hypernyms and hyponyms often reduces precision (Greenberg, 2001). In contrast, the Semantic Retrieval System includes hypernyms and hyponyms as mandatory terms to enhance precision.
Theologian
he
Medical practitioner
Doctor of the Church Qualification
he
s
Doc
ho
he
GP
n
he c
Doctor
Physical therapy
ho
ho Specialist
Therapy
Rehabilitation
he s
Physiotherapy
PhD
he
Medical practice Key: he: hypernym; ho: hyponym; n: negation; c: candidate; s: synonym Shaded ellipse: original terms from query, Unshaded ellipse: new knowledge from lexical and ontological sources, Dashed line: additional knowledge in Table 3 not shown to conserve space
Figure 7: Semantic Network of Query-Terms after Building Knowledge-Base For each query term, the Semantic Retrieval System automatically obtains the hyponyms and hypernyms from WordNet for the user’s chosen synset. The terms from the DAML ontology library are obtained by querying the library for each noun and noun phrase. Table 3 shows the terms extracted from WordNet and the DAML ontology library for the example. The Semantic Retrieval System allows the user to select up to two hypernyms (one from DAML and one from WordNet), or up to two hyponyms (one from DAML and one from WordNet) for each query term. The system then includes the selected hypernym(s) or hyponym(s) as mandatory terms in the query. For example, if the user selected the hypernym “medical practitioner” in Table 3, the Semantic Retrieval System would search for pages that contain doctor AND “medical practitioner.” Similarly, if the user selected the hyponym “specialist,” the Semantic Retrieval
16
System would search for doctor AND specialist. Clearly, some ‘doctor’ pages that are relevant may not include “medical practitioner” or “specialist.” Nevertheless, pages that contain ‘doctor’ and either ‘medical practitioner’ or ‘specialist’ are more likely to reflect the medical sense of the term ‘doctor’ than pages that contain ‘doctor’ alone. In these situations, recall is sacrificed for precision. Note also that, when querying for the subclass (e.g., specialist), the Semantic Retrieval System still includes the original query term because the narrower term can also be polysemous (e.g., specialist has two senses in WordNet). Figure 7 shows the resulting, expanded semantic network of terms after completing this step. Step 6: Resolve inconsistencies: The ability of query expansion to improve a query depends upon the usefulness of the terms added. As acknowledged in Table 2, the previous steps are fallible because: (a) the synonym sets in WordNet are not necessarily orthogonal so a partially relevant word-sense may be excluded by the methodology in step 4, (b) the DAML ontologies are of mixed quality and might contain inaccurate information (Hendler, 2001), and (c) WordNet and the domain ontologies are contributed by many individuals who may have differing views of a domain. To identify inconsistencies, the Semantic Retrieval System checks the hyponyms and hypernyms of the query terms (from DAML and WordNet) against the synonyms of the query term (from WordNet) that the user did not select as the desired word sense. Upon finding a match, it asks the user if the term is desired, and adjusts the knowledge base accordingly. For example, if a user selects the medical sense of ‘doctor’ from the synsets returned from WordNet, but then selects ‘theologian’ or ‘scholar’ as his or her preferred DAML hypernym, the Semantic Retrieval System would identify this as an inconsistency and ask the user to verify his or her
17
selections. This process reduces the risk that the system will make an incorrect inference prior to executing the query.
3.2.3 Phase 3: Execute Query Step 7: Construct Boolean query: The augmented query is transformed into the syntax of a search engine using Boolean constraints to improve precision (Hearst, 1996; Moldovan and Mihalcea, 2000). The query expansion techniques depend upon the type of term added: (a) Synonym: For each query term, the Semantic Retrieval System automatically adds the first synonym provided by WordNet from the synset that the user selected with an OR; e.g., (query term OR synonym). (b) Hypernym and hyponyms: For each query term, the Semantic Retrieval System includes the hypernyms or hyponyms chosen by the user with an OR. For example, if hypernyms are selected, the system builds the query: query term AND (WordNet hypernym OR DAML hypernym). If hyponyms are selected, the system builds the query: query term AND (WordNet hyponym OR DAML hyponym). (c) Negation: For each query term, the Semantic Retrieval System automatically adds the first synonym from the first remaining synset in WordNet not selected by the user with a Boolean NOT; e.g., (query term AND NOT synonym). Each heuristic for adding terms attempts to improve precision while minimizing user interaction. Adding synonyms increases precision because the user may not have selected the most precise query term for that word sense. Although adding synonyms on its own could increase recall at the cost of precision (Greenberg, 2001), the use of a synonym in combination with a hypernym should improve precision. Because WordNet lists terms in their estimated order of frequency of use, the first synonym is likely the best alternative for that term.
18
Adding hypernyms or hyponyms increases the likelihood that a page returned contains the sense of the term the user desires. The Semantic Retrieval System requires a user to: (a) select WordNet hyponyms because it cannot automatically infer which subclass the user desires; and (b) select DAML hyponyms because the ontologies can be of mixed quality and the wordsense of ontological terms are not listed. To reduce user interaction, the Semantic Retrieval System automatically selects a WordNet hypernym based on it being the first hypernym returned from the synset selected by the user (because terms are ordered in estimated frequency of usage). However, users can reselect another term. Applying these heuristics to the medical example and assuming that the user chooses the ‘specialist’ hyponym for ‘doctor’ and the ‘medical practice’ and ‘therapy’ hypernyms for ‘physical therapy,’ the following query would be constructed: (Doctor OR Doc) AND “Specialist” AND NOT “Doctor of the church” AND (“Physical therapy” or physiotherapy) AND (therapy or “medical practice”). Step 8: Run query on search engine: Although a number of information retrieval search engines have been developed, such as TREC (Text Retrieval Conference; Harman, 1993), web search engines appear to be more effective for web querying (Singhal and Kaszkiel, 2001). This methodology, therefore, submits the query to one or more web search engines (in their required syntax) for processing. The query construction heuristics work with most search engines. For example, AltaVista allows queries to use a NEAR constraint, but since other search engines such as Google and AlltheWeb do not, it is not used. Likewise, query expansion techniques in traditional information retrieval systems can add up to 800 terms to the query with varying weights (Qiu and Frei, 1993). This approach is not used in our methodology because web search engines limit the number of query terms (e.g. Google has a limit of ten terms).
19
Step 9: Retrieve and present results: In the final step, the results from the search engine (URLs and ‘snippets’ provided from the web pages) are retrieved and presented to the user. The user can either accept the results or rewrite and resubmit the query to get more relevant results. The end result of the nine-step methodology is a more context-specific and precise query. To test the methodology, a manual simulation was first conducted which indicated that SRS could increase query precision, especially when queries were short or contained ambiguous terms. A prototype was then developed for more rigorous testing.
4. Implementation The prototype of the Semantic Retrieval System uses J2EE technologies and follows a traditional client-server architecture, as shown in Figure 8. The user specifies search queries in natural language. The server contains Java application code and the WordNet database. The prototype provides an interface to Google (www.google.com) and AlltheWeb (www.alltheweb.com). The Semantic Retrieval System contains three modules: a) Interface & Parser Module, b) Local Knowledge Base & Inference Engine, and c) Query Constructor. The interface & parser module is responsible for capturing the user’s natural language input and parsing it for nouns and noun phrases. Queries are parsed using QTAG (Mason, 2003), a probabilistic part-of-speech tagger. It returns the part-of-speech for each word in the text. Based on the nouns and noun phrases identified in the query, an initial, baseline query is created. The local knowledge base & inference engine interfaces with the WordNet lexical database via JWordNet (http://sourceforge.net/projects/jwn/). Synonyms are first obtained for each noun and noun phrase in the initial query. Based on the user’s selected synset, the hypernyms and hyponyms for the selected sense of the term are obtained from the lexical database. Hypernyms and hyponyms from the DAML library are obtained using Teknowledge’s DAML
20
Semantic Search Service (http://plucky.teknowledge.com/daml/damlquery.jsp). The query constructor then expands the query, adding synonyms, negative knowledge, hypernyms, and/or hyponyms, following the heuristics outlined in section 3.2. It also keeps track of the number of terms currently in a search query in case the maximum number of terms for a search engine (e.g., 10 terms for Google) is reached. The expanded query is then submitted to the search engine using appropriate syntax and constraints, and the results returned. The query constructor interacts with Google through its Web API service and AlltheWeb through URI encoding. It requests for the first twenty pages from both search engines in the format they support. In both cases, the title of the page, the snippet, and the URL for each page are returned. Client Side
Semantic Web Search Client
Server Side
Search Engines
Input -Output Interface & Parser And Module
AlltheWeb Search Engine AlltheWeb Search Engine AltaVista Search Engine
Local WordNet Knowledge Base & Agent Inference Engine
WordNet Database
Query Refinement Query Constructor And Execution Agent
Google Search Engine Google Search Engine
DAML Ontologies
Figure 8. Semantic Retrieval System Design
4.1 An Illustrative Scenario The initial user interface is shown in Figure 9. Assume the user enters the query: “Find a department chair in Atlanta.” The interface & parser module extracts the user’s input and parses it. The noun phrases are displayed to the user (Figure 9). The user can uncheck nouns; by default all are selected. The selected terms form the initial search query, which the user can then refine.
21
Figure 9. Parsing Natural Language Query and Word Sense Selection
Figure 10. Selecting Hypernyms and Hyponyms from WordNet and DAML
22
Figure 11. Presentation of Results to User To refine the query, the query constructor retrieves the word senses for each query term and displays them to the user, as illustrated in Figure 9. The user selects the appropriate sense. If none is selected, the module uses the query’s base term. In Figure 9, “chair” has four senses. After selecting the appropriate word sense, the user initiates query refinement. The module then gathers the user selections and executes additional steps (steps 4-7) to expand the query. For example, based on the selected word sense for each term, a synonym is identified and added to the query with an OR condition. “Chair” is expanded with the first term in the user’s selected synset (i.e., chair OR professorship). Also, a synonym of chair is added as negative knowledge from the next highest synset that has a synonym to exclude unrelated hits (i.e., chair –president). For each term, the hypernyms and hyponyms (superclass and subclass) corresponding to the selected word sense are retrieved from WordNet and DAML ontology and displayed to the user
23
(Figure 10). The user-selected hypernyms and hyponyms are added to the query. The final refined query is submitted to Google and the first 20 hits returned (Figure 11).
5. Empirical Evaluation To assess the methodology’s effectiveness, a laboratory study was carried out in which the results obtained using the Semantic Retrieval System were compared to those obtained using a web search engine alone. The control group (search-engine) and the experimental group (search-engine plus the Semantic Retrieval System) used the same search engine. Thus, the experiment directly tests the benefit of the methodology and its heuristics.
5.1 Treatment and Dependent Variables The experiment’s proposition was that queries sent to Google and AlltheWeb would be more precise when users used the Semantic Retrieval System. The dependent variable was the relevance of the results obtained. Following Buckley and Voorhees (2000), two measures of relevance are used: the number of relevant pages in the first 10 and first 20 pages (Relevance(10) and Relevance(20)). These are appropriate in a web context because users do not view many results (Spink, et al., 2001). To assess a page’s relevance, each subject reviewed the page’s title and snippet and indicated whether the page was relevant, partially relevant, or not relevant for his or her query. Following Efthimiadis (2000), this allowed two measures of relevance: R1 (the number of relevant pages) and R2 (the number of relevant or partially relevant pages). Thus, this gave four measures of the dependent variable: R1(10), R1(20), R2(10), and R2(20). We expected the same results for each measure. Recall was not tested because it is less relevant and not strictly measurable on the web (Efthimiadis, 2000). To increase generalizability, the experiment included two control variables: query term ambiguity and search engines (see Figure 12). Query
24
term ambiguity refers to the degree to which query terms are ambiguous or clear. Query term ambiguity should have a negative impact on the precision of results. However, because the Semantic Retrieval System works to increase the clarity of search terms, we expected an interaction effect such that the system would have a greater benefit when search terms were ambiguous (see Figure 12). Finally, we controlled for search engines by testing the methodology across two search engines (Google and AlltheWeb). The Semantic Retrieval System was expected to have a significant effect on query precision irrespective of the search engine used (Figure 12). Table 4 details our hypotheses.
Control variable: Query term ambiguity (High vs. Low) Independent Variable: SRS plus search engine
+
+
Dependent Variable: Relevance of results
Control variable: Search engine (Google/AlltheWeb)*
Key: * Factor controlled but no a priori hypothesis
Figure 12. Empirical Test Table 4: Hypotheses #
Hypothesis
1
Users will judge query results obtained from the Semantic Retrieval System to be more relevant than query results obtained from a search engine alone. The Semantic Retrieval System will have a greater positive effect on the relevance of query results when query terms are more ambiguous.
2
5.2 Query Sample A large body of diverse queries was tested as recommended by Buckley and Voorhees (2000). Clearly, for SRS’s methodology to operate, a user’s query must contain terms that exist in WordNet or DAML. Of course, many queries could use terms that do not exist in either WordNet
25
or DAML; for example, ‘instance’ information such as names of products (Spink et al. 2001). To ensure a test of the study’s proposition, the query sample was constructed as follows. First, all single-word classes from the DAML library were extracted. These were then pruned by excluding terms that (a) had no superclasses in the DAML library, or (b) would be unfamiliar to subjects (e.g., highly specialized terms). 280 terms were selected. These were subdivided into two groups based on their word-senses in WordNet. 139 terms (the clear group) had 0-2 word-senses in WordNet (where 0 represents a domain-specific term not in WordNet). 141 terms (the ambiguous group) had 3 or more word-senses. Subjects were then asked to perform four queries. Two of these queries were required to contain terms from a random list of 26 terms from DAML. The other two queries were unconstrained. The subjects were randomly assigned either 26 clear or 26 unclear terms (26 terms being found to be reasonable in a pilot test). The result of the random assignment was that 38 subjects received clear terms, and 33 received unclear terms. All queries were formed in natural language and, consistent with most web queries, were required to be short, including approximately two nouns each (Spink et al., 2001, Spink and Ozmultu, 2002). 5.3 Subjects, Materials and Procedure Seventy-one students from two universities participated voluntarily. Each subject contributed four queries. Subjects were experienced search engine users, recording a mean of 6 on a 7-point agree/disagree likert-type scale that asked whether they frequently used search engines. The subjects were required to build their own queries and evaluate their own results since experimenters are unable to create as diverse a set of queries as users, can bias results, and cannot objectively determine whether a result is relevant to a user (Gordon and Pathak, 1999). Each subject received materials explaining how the Semantic Retrieval System operates,
26
how to construct queries, and how to grade the relevance of the results. Each student’s materials stipulated the order of queries he or she was required to run and the 26 randomly selected terms based on the group to which they had been assigned (ambiguous or clear). Subjects were first given a 10-minute training exercise to introduce them to the Semantic Retrieval System and to practice developing a query and ranking the results. Next, each participant developed his or her four queries. Query 1 and Query 3 required the user to use the list of terms provided (i.e., ambiguous or clear). Query 2 and Query 4 allowed the user to use any terms they wished. Because AlltheWeb was expected to be less familiar to subjects, the training session and the first two queries used AlltheWeb; Google was used for Queries 3 and 4 (see Figure 13). The experiment took approximately 45 minutes. Subjects were given three minutes to perform each query. For each query, they were given four minutes to rank the relevance of the first twenty pages returned from each system (two minutes to rank the twenty pages from the Semantic Retrieval System and two minutes to rank the twenty pages from the search engine alone). The subjects were given a short time (approximately six seconds) to assess the relevance of each page because this was considered to conform to typical web-user behavior. In addition, a pre-test with four graduate students and a pilot test with 49 undergraduate students indicated that it gave sufficient time for accurate responses without burdening the students in terms of the total experimental time.
AlltheWeb Search Engine Google Search Engine
Terms from the List (Clear or Ambiguous) Query 1 Part A: SRS plus AlltheWeb Part B: AlltheWeb only Query 3 Part A: SRS plus Google Part B: Google only
Students own Terms (Unconstrained) Query 2 Part A: SRS plus AlltheWeb Part B: AlltheWeb only Query 4: Part A: SRS plus Google Part B: Google only
Figure 13. Experimental Procedure
27
5.4 Summary of Results Table 5 details the results. For each query, the methodology was expected to increase the number of relevant pages and reduce the number of hits. To provide a detailed analysis, Table 5 gives a separate analysis for each subsample. Table 6 summarizes the results of a MANOVA to test the main effects and interaction effects for each control variable. Table 5: Results from Paired Sample t-tests† Dep. Subsample Var R1(10) Unconstrained Constrained – clear
N
Mean With Mean Std. Error tSig. Result as SRS Without SRS for Diff. statistic 1-tailed Predicted
127 72
5.5 5.4
4.4 3.5
.39 .42
2.66 4.34
.005* .000**
Yes Yes
62
5.2
3.0
.47
4.70
.000**
Yes
AlltheWeb
135
4.4
3.7
.36
2.09
.020*
Yes
Google
126
6.3
4.0
.33
7.13
.000**
Yes
127 72
7.4 7.4
6.5 6.4
.40 .40
2.23 2.49
.014* .008*
Yes Yes
62
7.1
4.8
.51
4.57
.000**
Yes
AlltheWeb
135
6.6
5.8
.40
2.12
.018*
Yes
Google
126
8.1
6.4
.30
5.61
.000**
Yes
125 67
4.3 4.5
3.4 3.1
.33 .40
2.76 3.55
.004* .001**
Yes Yes
Constrained – unclear
R2(10) Unconstrained Constrained – clear Constrained – unclear
R1(20) Unconstrained Constrained – clear Constrained – unclear
60
4.4
2.2
.44
5.04
.000**
Yes
AlltheWeb
129
3.5
2.8
.34
1.92
.029*
Yes
Google
123
5.3
3.2
.27
7.70
.000**
Yes
125 67
6.7 6.9
5.7 5.7
.37 .40
2.79 3.01
.003* .002**
Yes Yes
60
6.2
4.2
.44
4.58
.000**
Yes
AlltheWeb
129
5.7
4.9
.39
2.10
.019*
Yes
Google
123
7.6
5.8
.26
7.11
.000**
Yes
115 69
580,618 140,548
3,218,232 1,026,502
469,513 194,550
-5.62 -4.55
.000** .000**
Yes Yes
58
963,611
3,481,781
454,781
-5.54
.000**
Yes
AlltheWeb
122
822,148
3,971,170
482,118
-6.53
.000**
Yes
Google
120
267,135
1,319,883
118,145
-8.91
.000**
Yes
R2(20) Unconstrained Constrained – clear Constrained – unclear
# Hits Unconstrained Constrained – clear Constrained – unclear
Key: † Maximum N = 142 (unconstrained), 76 (clear), (constrained), 66 (unclear), 142 (AlltheWeb), 142 (Google) N is less than the maximum due to missing values. * sig. at α < 0.05 1-tailed, ** at α < 0.05 1-tailed (Bonferroni adjusted). The adjusted alpha = .05/20 = .0025. {R1, R2}{10, 20} = Number of pages in top {10; 20} that are {Only Relevant; Relevant or Partially Relevant}
28
Table 6: Summary of Tests of Control Variables and Interactions in MANOVA† Factor Main Effects
Interaction Effects
F-statistic
Significance (2-tailed) Effect Size (eta squared)‡
SRS Ambiguity Search Engine
12.47 2.81 7.48
.000*** .004*** .000***
.09 .02 .06
SRS*Ambiguity SRS*Search Engine
1.47 2.18
.166 .070*
.01 .02
Key: † MANOVA statistics calculated using Pillai’s trace including all DVs (i.e., R1(10), R2(10), R1(20), R2(20)) ‡ Eta squared is the proportion of variability in the DV accounted for by variation in the IV * significant at α < 0.10, *** significant at α < 0.01 one-tailed
The results support the methodology. For all subsamples, using the Semantic Retrieval System significantly reduced the number of pages but significantly increased the proportion of relevant ones. The results were consistent across all measures of relevance. This was confirmed in a post-hoc analysis shown in Table 7, which indicates that the Semantic Retrieval System increased the number of relevant results and decreased the number of irrelevant results for every subsample. In 8 of 10 subsamples, there was no effect on the number of partially relevant pages, and in 2 of 10 subsamples, the Semantic Retrieval System retrieved fewer partially relevant pages. This confirmed the methodology’s intention to produce precision-biased queries. Table 7: Paired t-tests for Impact of SRS in each Subsample Subsample Unconstrained
R(10) R(20) P(10) P(20) N(10) N(20) t = 2.6 t = 2.8 t = -.75 t = .55 t = -2.5 t = -3.4 df = 126 df = 124 df = 126 df = 124 df = 126 df = 124 p = .005* p = .004* p = .227 p = .298 p = .008* p = .001** Constrained – Clear t = 4.3 t = 3.6 t = -2.7 t = -.77 t = -3.5 t = -3.9 df = 71 df = 66 df = 71 df = 66 df = 71 df = 66 p = .000** p = .001** p = .005* p = .222 p = .001** p = .000** Constrained – Unclear t = 4.7 t = 5.0 t = .31 t = -.96 t = -5.3 t = -4.4 df = 61 df = 59 df = 61 df = 59 df = 61 df = 59 p = .000** p = .000** p = .381 p = .172 p = .000** p = .000** All-the-Web t = 2.1 t = 1.9 t = .46 t = .75 t = -2.8 t = -2.8 df = 134 df = 128 df = 134 df = 128 df = 134 df = 128 p = .020* p = .029* p = .324 p = .227 p = .003* p = .005* Google t = 7.13 t = 7.7 t = -3.9 t = -1.5 t = -6.1 t = -7.1 df = 125 df = 122 df = 125, df = 122 df = 125 df = 122 p = .000** p = .000** p = .000** p = .072 p = .000** p = .000** * = significant at p