Extracting Semantics with Lexical and Ontological Knowledge Sources: A Methodology for Context-Aware Information Retrieval from the World Wide Web Veda C. Storey Computer Information Systems Department J. Mack Robinson College of Business Georgia State University Email:
[email protected]
Andrew Burton–Jones Computer Information Systems Department J. Mack Robinson College of Business Georgia State University Email:
[email protected]
Vijayan Sugumaran Department of Decision & Information Sciences School of Business Administration Oakland University Email:
[email protected]
Sandeep Purao School of Information Sciences & Technology Pennsylvania State University Email:
[email protected]
Abstract The continued growth of the World Wide Web has made the retrieval of relevant information for a user’s query increasingly difficult. A major obstacle to more accurate and semantically sound retrieval is the inability of web search systems to incorporate context in the retrieval process. This research presents a methodology to increase the semantic content of web query results by building context-aware queries. The methodology contains mechanisms that use two knowledge sources to augment a query: lexicons such as WordNet and ontologies such as the DAML library. A semantic net representation facilitates the process. The methodology has been implemented in a research prototype that can connect to publicly available search engines to execute the augmented query. An empirical test of the methodology and comparison of results against those directly obtained from the search engines shows that the proposed methodology provides more relevant results to users. The results demonstrate that lexicons and ontologies act as complementary sources of knowledge to construct the context necessary for obtaining more relevant results. Keywords: query augmentation, semantic retrieval, ontology, lexicon, context, query, Semantic Web, Semantic Retrieval System.
This research was supported by the J. Mack Robinson College of Business, Georgia State University, and the Office of Research & Graduate Study, Oakland University. Earlier versions of this research were presented at the Twenty-Third International Conference on Information Systems, the Twenty-Second International Conference on Conceptual Modeling, and research workshops at the University of Houston and the University of Washington. We thank the doctoral students at Georgia State University, particularly Cecil Chua and Punit Ahluwalia, for their assistance.
2
1. Introduction The World Wide Web is one of the world’s most valuable information resources because it continues to grow with contributions from many people. Yet, its rapid growth and lack of structure make it difficult to retrieve information from it (Kobayashi and Takeda, 2000; Lawrence, 2000; Spink and Ozmultu, 2002). A major impediment to information retrieval from the Web is the inability of Web pages and query systems to store or interpret context for the purpose of enhancing query processing and outcomes. One response to these problems is the vision of a ‘Semantic Web,’ which infuses content on the web with semantics (Berners-Lee et al., 2001; Ding et al., 2002). This includes marking up terms on web pages using ontologies that define domain-specific terms and provide inference rules, which act as surrogates for the semantics of each term’s meaning (Berners-Lee et al., 2001; Gruninger and Lee, 2002; IEEE, 2001). This vision of the semantic web, however, is likely to be many years away because it requires, among other things, retrofitting existing content with markup that adds semantics. The alternatives available to inject context and semantics into the query processes include libraries of ontologies that contain a significant amount of domain-specific knowledge (Ding and Fensel, 2001). The most well-known and well-populated of these is the DAML (DARPA Agent Markup Language) library with approximately 280 ontologies and 56,000 classes (http://www.daml.org/ontologies/). Significant effort has gone into building these ontologies, but few information retrieval methodologies leverage them. A second alternative for injecting semantics into the query process is lexical knowledge bases (lexicons) such as WordNet (Miller et al. 1990) that have been developed independently of the ontologies and applied to enhance retrieval separately. A methodology that uses both of these knowledge sources in an integrated manner is likely to take advantage of the synergies they provide, resulting in more relevant query results.
3
The objective of this research, therefore, is to: develop a methodology for retrieving information from the World Wide Web that takes into account the semantics of the user’s query through the use of lexicons and ontologies and, by doing so, obtains more relevant result than currently possible. A second objective is to improve our understanding of how these knowledge sources contribute to more effective information retrieval. To achieve these objectives, we perform the following tasks. First, we develop a methodology for context-aware information retrieval from the web that refines a user’s request using lexical and ontological sources. Next, we implement the methodology as a proof-of-concept prototype called the Semantic Retrieval System. Finally, we assess the effectiveness of the methodology and contribution of the lexical and ontological knowledge sources by analyzing the results of processing queries with the prototype. The contribution of the research is, first, to provide a methodology for processing queries that takes into account the context of the query. In doing so, it assists in realizing the potential of the Semantic Web. Second, it provides a tangible artifact (Hevner et al., 2004) that we use for evaluating the contribution of lexicons and ontologies in context-aware query processing. This paper is divided into six sections. Section 2 reviews relevant prior research. Section 3 presents the heuristic-based methodology for enabling context-aware query processing. Section 4 describes the prototype that implements the methodology. Section 5 reports the results of an empirical test of the methodology. Section 6 concludes the paper.
2. Related Research This research lies at the intersection of three areas: 1) knowledge representation, 2) Web semantics, and 3) information retrieval as shown in Figure 1. examples of research from these areas.
Table 1 shows illustrative
4
Web Semantics Languages for the Semantic Web
Semantic Web applications
Information Retrieval Web information retrieval
Algorithms for information retrieval
Processing natural language
This research Ontologies for the Semantic Web Building domain ontologies
Expanding queries using lexical sources
Refining lexical sources
Knowledge-Representation
Figure 1: Related Research for Context-Aware Query Processing Table 1: Research in Knowledge Representation, Web Semantics, & Information Retrieval Category Languages for Semantic Web Semantic Web applications Web information retrieval Algorithms for information retrieval Processing natural language Ontologies for Semantic Web Building domain ontologies Refining lexical sources Expanding queries using lexical sources
Description Languages to specify semantic markup on the web, e.g., ontology inference layer (OIL); DARPA agent markup language (DAML). Application of Semantic Web to diverse applications including knowledge management, military defense, and distance education. Information retrieval tools and approaches for web searching include indexing, clustering, ranking, and web-page annotating to be used and extended on the Semantic Web. Information retrieval algorithms to improve precision, recall, and speed of information retrieval from large document repositories.
References Fensel, et al. 2001; McIlraith, et al. 2001 Fensel, et al., 2001; Hendler, 2001 Kobayashi and Takeda, 2000;
Methodologies for effectively processing natural language queries using procedures such as stop-word removal, word stemming, and noun parsing. Ontologies for defining terms and relationships used in specific domains on the Semantic Web. Methodologies developed for integrating multiple ontologies. Prototypes and methodologies for building domain specific and overarching (upper) level ontologies. Methodologies proposed for evaluating the quality of domain ontologies. Refinements to WordNet and development of applications that use it.
Lewis and Jones, 1996, Xu and Croft, 1998
Query expansion to improve completeness of document searches. Automated and semi-automated techniques attempted.
Raghavan, 1997
Maedche and Staab, 2001; Stephens and Huhns, 2001 Kayed and Colomb, 2002; Guarino and Welty, 2002 Fellbaum, 1998; Miller, et al., 1990 Voorhees, 1994
2.1 Knowledge Representation The ability of information systems to capture and represent knowledge, and to use semantic information effectively, remains limited (Berners-Lee et al., 2001; Minsky, 1994). The
5
querying process, for example, is inherently constrained by imprecision and ambiguity in search terms (Miller, 1996; Towell and Voorhees, 1998). Imprecision occurs when a user’s search terms do not exactly match the concepts they intend to retrieve; ambiguity occurs when search terms have multiple meanings. The aim of knowledge representation, as shown in Figure 1, is to ease these problems by providing stocks of domain knowledge and mechanisms to make inferences based on the semantics of these pieces of knowledge (Guarino, 1995). Two important sources of semantics are lexicons and ontologies. Lexicons comprise the general vocabulary of a language and can help resolve imprecision and ambiguity in terms by identifying the meaning of a term by its connection to other terms in a language. A well-known example is WordNet, a comprehensive on-line catalog of English terms (http://www.cogsci.princeton.edu/~wn) (Fellbaum, 1998; Miller, et al. 1990). WordNet stores, categorizes, and relates nouns, verbs, adjectives, and adverbs, organizing them into sets of synonyms (or “synsets”) that share the same underlying word senses. For example, WordNet describes four word-senses for the term ‘chair’ as a noun (seat, professor, chairperson, and electric chair) and two for it as a verb (preside and moderate), as well as superclasses (e.g., furniture) and subclasses (e.g., armchair). Lexicons such as WordNet have been used to assist query processing in prior research, but there is no consensus regarding their relative benefit or the best way to use them (Moldovan and Mihalcea, 2000; Voorhees, 1994). Whereas lexicons describe a language, ontologies are a way of describing one’s world (Weber, 1997). Ontologies generally consist of terms, their definitions, and axioms relating them (Gruber, 1993). Although there are many types of ontologies (Guarino, 1998; Mylopoulos, 1998; Noy and Hafner, 1997; Weber, 2002), they can be characterized as either formal/top-level ontologies that describe reality in general (e.g., things and properties) or material/domain ontologies that describe reality in one or more specific domains (e.g., key things and properties
6
in the auction domain) (Guarino, 1998; Weber, 2002). Much effort has gone into creating ontologies (Guha and Lenat, 1994, Chan, 2004), ontology development languages (e.g., Fensel et al. 2001; Hendler and McGuinness, 2000; McGuinness et al. 2002) ontology development environments (e.g., Corcho et al. 2003; Kishore et al. 2004), and ontology libraries (Ding and Fensel, 2001); however, there has been little research on retrieving information using them. 2.2 Web Semantics Markup languages such as hypertext markup language (HTML) are the primary mode of structuring documents on the web. These languages do not enable web page designers to specify the meaning of terms. Therefore, search technologies for the web remain primarily keywordbased (e.g., AltaVista) or link-based (e.g., Google), and do not take into account the context of users’ queries or the context of terms on web pages being searched (Lawrence, 2000, Leroy et al. 2003). The intent of the Semantic Web is to solve these problems by ‘marking-up’ terms on web pages with links to online ontologies that provide machine-readable definitions of terms and their relationships with other terms (Berners-Lee, et al., 2001). This should make the web easier to process for humans and their intelligent agents (Fensel, et al., 2001; Hendler, 2001). The proliferation of ontologies is crucial for the Semantic Web, hence, the creation of ontology libraries (Ding et al. 2002). However, ontologies can suffer from imprecision and ambiguity themselves. For example, a recent (December 2004) search for ‘chair’ on the DAML ontology library returned 20 classes that include the string ‘chair’ with varied meanings. The Semantic Web will probably never consist of neat ontologies, but a “complex Web of semantics ruled by the same sort of anarchy that rules the rest of the Web” (Hendler, 2001). Thus, research is needed to evaluate how meaning can be represented and used to assist users and their queries. 2.3 Information Retrieval Research on information retrieval (IR) has developed procedures and methodologies that
7
can support querying on the Web (Raghavan, 1997). A key problem in information retrieval is word-sense disambiguation: a word may have several meanings (homonymy), yet several words may have the same meaning (synonymy) (Sanderson, 2000; Ide and Veronis, 1998; Miller, 1996). The goals of IR are to: 1) increase the relevance of the results returned (precision) by eliminating those that have the wrong sense (homonyms); and 2) increase the proportion of relevant results in the collection returned (recall) by including terms that have the same meaning (synonyms). Word-sense disambiguation requires: 1) identifying the intended meaning of query terms, and 2) altering the query so that it achieves high precision and recall. The first step is usually achieved by deducing a term’s meaning from the other terms in the query (Allan and Raghavan, 2002). This appears infeasible on the Web since most web queries are typically only two words (Spink and Ozmultu, 2002; Spink, et al. 2001), which is too short to identify context (de Lima and Pedersen, 1999; Voorhees, 1994). Thus, some user interaction is inevitable to accurately identify the intended sense of terms in web queries (Allan and Raghavan, 2002). The second step (altering a query) is generally achieved through a combination of: •
query constraints, such as using Boolean operators to require pages to include all query terms (Hearst, 1996; Mitra, et al., 1998; Eastman and Jansen, 2003);
•
query expansion with ‘local context,’ in which terms are added to the query based on a subset of documents the user has identified as relevant (Mitra, et al., 1998; Salton and Buckley, 1990; Xu and Croft, 1998); and/or
•
query expansion with ‘global context,’ in which additional terms are added to the query from thesauri, terms in the document collection, or past queries (de Lima and Pedersen, 1999; Greenberg, 2001; Qiu and Frei, 1993; Voorhees, 1994). Of these three, query constraints have been shown to improve Web search, but only if
used wisely (Moldovan and Mihalcea, 2000; Eastman and Jansen, 2003). Query expansion with
8
local context is unlikely to be effective; studies show that Web users rarely provide relevance feedback (Cui, et al., 2002; Spink, et al., 2001). Query expansion with global context should be useful on the traditional Web. Theoretically, it should not be required on the Semantic Web, because terms on Semantic Web pages will be linked to ontologies that explain each term’s meaning within a web of related semantics (Berners-Lee et al. 2001). That is, rather than having to add global context to the query, global context will already be defined on the Semantic Web pages and associated ontologies. Unfortunately, ontologies for the Semantic Web cannot be depended upon fully because they are often of poor quality (Hendler, 2001). Consequently, existing information retrieval procedures for expanding queries with global context should remain useful on both the traditional, and Semantic, Web. To add global context, research suggests that two thesauri should be used, ideally in combination (Mandala, et al., 1999; Moldovan and Mihalcea, 2000): (1) general lexical thesauri that provide lexically related terms (e.g., synonyms), and (2) domain-specific thesauri that provide related terms in a specific domain. Although WordNet is the preferred lexical thesaurus (Fellbaum, 1998) it is very difficult to identify domain-specific thesauri for all of the Web’s domains (Greenberg, 2001). One solution, we propose, is to use libraries of domain ontologies, such as the DAML library, which can provide useful domain-specific knowledge even in the presence of individual ontologies that are incomplete or inaccurate (Stephens and Huhns 2001). We, therefore, suggest that ontology libraries can have a dual role on the traditional Web and the Semantic Web: 1) as a source of definitions of terms on specific web pages, and 2) as a source of domain-specific semantics that, together with lexical knowledge, can provide global context for query expansion. Although lexicons have been used in information retrieval for some time, there is no accepted approach to doing so, nor has the joint application of lexicons and
9
ontologies been specifically explored for retrieving information from the Web. A methodology for doing so is presented next.
3. A Methodology for Context-Aware Information Retrieval The objective of the methodology is to improve the relevance of information retrieved by identifying the intended sense of query terms. This intended sense of the query terms is then reflected in the query by expanding it to incorporate context and constraints following the theoretical rationale for the correspondence between term co-occurrence and relevance (van Rijsbergen, 1977). van Rijsbergen suggests that a user is likely to find a page relevant if it contains the query term (e.g., chair) as well as additional terms that are related to the desired sense (e.g., professor). Although intuitive, this has proven remarkably hard to achieve in practice because it is difficult to identify the correct context and to expand the query with the right terms and constraints (Peat and Willett 1991; Sanderson, 2000). This methodology we propose to address this challenge shares its goals with some prior approaches (Ide and Veronis, 1998). It differs in its approach to achieving the goal by its use of lexicons and ontologies as dual sources of knowledge that contribute, together with a coordinated set of query constraints, to construct the context. Figure 2 shows the proposed methodology. Identify Query Context Step 1: Identify noun phrases Step 2: Build synonym sets Step 3: Select word sense
Æ
Refine Query Context Step 4: Exclude incorrect sense Step 5: Include hypernyms & hyponyms Step 6: Resolve inconsistencies
Æ
Execute Query Step 7: Construct Boolean query Step 8: Run query on search engine Step 9: Retrieve and present results
Figure 2: Methodology for context-aware information retrieval Following our intent to understand how the two sources of knowledge, ontologies and lexicons, contribute to retrieving more relevant information from the web, our methodology is developed to leverage specific knowledge elements available in lexicons and ontologies. These
10
knowledge elements differ across the two sources.
For example, the form of knowledge
available in a lexicon such as WordNet includes synonyms, hypernyms (superclasses), hyponyms (subclasses), meronyms (parts), and holonyms (wholes). These can be complemented by the forms available in ontologies, which typically do not provide synonyms, but often provide hypernyms, hyponyms, meronyms, and holonyms, and can also include properties and axioms. These sources also differ in function. Lexicons document generally accepted meanings of words in a language, whereas ontologies document the meaning of terms in a specific domain. As domain-specific terms can become generally accepted over time, terms in lexicons and ontologies can overlap.
However, at any time, lexicons can include terms for which a
corresponding ontological term has not been created, and ontologies can offer domain-specific terms not yet provided by lexicons
1
. The methodology we propose leverages these
complementary sources. Two scoping decisions on the development of the methodology were: 1) the use of a subset of knowledge forms, namely, synonyms, hypernyms, and hyponyms (not meronyms, holonyms), and 2) the use of noun phrases (e.g., ‘doctor’ or ‘physical therapy’) for search terms. This is based upon prior research which shows that these dominate search. Also, these are the majority of terms in common ontologies and query approaches that use lexicons (de Lima and Pedersen, 1999). 3.1 Supporting Architecture The methodology is supported by a logical architecture, shown in Figure 3. It comprises internal and external components that support execution of the methodology.
1
For example, the term “work” appears in nine ontologies in the DAML library, reflecting a range of domains (e.g., biology, computer science, and life). In each one, “work” is described in the way that the creator of that ontology views work in that domain. For example, in the computer science ontology (http://www.daml.org/ontologies/64), “work” has two subclasses: “course” and “research.” In contrast, WordNet describes six generally accepted senses of “work” in English (including purposeful activity, output, employment, and workplace) and lists 86 subclasses of “work” across these senses. Although WordNet lists several subclasses that are similar to those in the computer science ontology (e.g., coursework and publication), it does not include the specific terms “course” and “research.”
11
Search Engine Results
User Feedback
Query Constructor
Formatted Query
Original Terms + New Knowledge
Interface
Inference Engine Original Query Terms
Local Knowledge Base
Internal Components
Search Engine
Remote Knowledge Sources Word Senses, Synonyms, Hypernyms, Hyponyms
Lexical Sources
Ontological Sources
External Components
Figure 3: Logical Architecture Supporting the Methodology The internal components include an interface, an inference engine that uses a local knowledge base, and a query constructor. The “interface” captures the user’s query in natural language (because users often use search syntax incorrectly (Lewis and Jones, 1996; Spink, et al. 2001)), transforms it for processing, and presents the results to the user after processing. The ”local knowledge base,” conceived as a semantic network, similar to prior retrieval systems (Chen and Dhar 1990; Lee and Baik 1999), allows representation of the query as made of up terms (nodes) and semantic relationships between terms (arcs) (see Figure 4). It offers five types of relationships, which serve as a semantic basis for constructing context aware queries: synonym, hypernym, hyponym, negation (X is not Y), and candidate (X is related to Y). The "inference engine” grows the semantic network by adding terms from the remote lexical and ontological knowledge sources and shrinks the network by pruning knowledge that is not related to the intended sense of the query terms. The ”query constructor” uses the local knowledge base to construct context-aware web queries using the syntax and Boolean constraints required by a given search engine. The final two component include “remote knowledge sources” for lexical
12
(WordNet) and ontological knowledge (the DAML ontology library); and “search engines” (such as Google, AlltheWeb and AltaVista) that are invoked to execute the enhanced query. Professor hypernym Electric Chair
Chair
candidate
Law
hyponym Key: Shaded nodes: original terms from query Unshaded nodes: new knowledge from lexical and ontological sources
synonym
Jurisprudence
Patent Law
Figure 4: Example Semantic Network for a Two-Word Query
The methodology makes minimal assumptions about the corpus/document collection (only that it contains English terms), the web search engine (only that it allows Boolean operators), and user (only that s/he queries in English).
This contrasts with prior query
expansion methodologies that have augmented the corpus/index being queried (with data to support disambiguation), tailored the search algorithm (so it can utilize the data added to the corpus/index), and made assumptions about the user (such as assuming that s/he is an expert in information retrieval) (Hearst, 1996, Sanderson 2000). Second, the methodology is designed to connect to publicly available tools (e.g. Google) to perform the search instead of simulated ones. 3.2 Algorithms The methodology is described below by simulating the phases for the query posited by Berners-Lee et al. (2001): Find Mom a specialist who can provide a series of bi-weekly physical therapy sessions, shortened to more closely approximate queries on the web (Spink and Ozmultu 2002; Spink et al. 2001): Find doctors providing physical therapy. 3.2.1 Phase 1: Identify Query Context The first phase in the methodology (see Figure 2 earlier) is necessary to build the context following the rationale suggested by van Rijsbergen (1977). We use two knowledge sources,
13
ontologies and lexicons, to support this phase, which consists of three steps that incrementally build the context after parsing the natural language query from the users (Table 2). Table 2. Algorithms for phase 1 of the methodology (see Figure 2) Step 1. Identify noun-phrases by querying successive query terms in WordNet. 2. Find relevant synsets for query terms in WordNet.
Assumption WordNet contains most common noun phrases. For each noun phrase, WordNet contains most common synsets.
3. Select one synset for each nounphrase.
One synset reflects the desired word sense.
Support Good support: WordNet is the highest-quality publicly available lexicon. It contains common synsets for most noun phrases (Ide and Veronis, 1998). Moderate support: Quality of WordNet but inherent ambiguity of language (Fellbaum, 1998).
Step 1: Identify noun phrases: The user’s natural language query is parsed to identify nouns that are matched against each consecutive word-pair in WordNet. WordNet lists ‘physical therapy’ as a noun-phrase, so, for our example, this step would find one noun (‘doctor’) and one noun-phrase (‘physical therapy’). These form the base nodes for expanding the query, as shown in Figure 5. The terms are linked by a ‘candidate’ relationship based on their proximity in the query, indicating that the terms have not yet been identified as being related in a lexical or domain-specific way.
Doctor candidate
Physical therapy
Figure 5: Semantic Network of Query Terms after Step 1 Step 2: Build synonym sets: When the user provides a term, the first task is to identify the user’s intended sense of the term. To do so, word senses from WordNet are used. For each noun or noun-phrase, synsets for each context are extracted from WordNet. For our example, four senses are found in WordNet for the noun ‘doctor’: 1) medical practitioner, 2) theologian, 3) game played by children, and 4) academic, and one for the noun phrase, ‘physical therapy.’ Each synset comprises one to many synonyms. The synsets for the query terms are added to the knowledge base, thereby expanding the network with potential contexts for each term.
14
Step 3: Select word sense: If a query does not contain enough terms to automatically deduce a user’s intended word sense, user interaction is required (Allan and Raghavan, 2002). The user is presented with the synsets for each term in the knowledge base that has multiple senses (e.g., ‘doctor,’ ‘physical therapy’) from which the user selects the intended sense. For example, the user could select the ‘medical practitioner’ sense rather than the ‘theologian’ sense of ‘doctor’. Once the desired sense of each term has been identified, steps 4-6 add additional terms to the knowledge base to expand and constrain the query based on the identified senses. 3.2.2 Phase 2: Refine Query Context The second phase (Figure 2) is necessary to refine the context by excluding terms that suggest unwanted word senses. The same knowledge sources support this phase, which consists of three steps that incrementally refine the context identified in the previous phase (Table 3). Table 3. Algorithms for phase 2 of the methodology (see Figure 2) Step 4. Exclude the first synonym from the highest-ordered synset not selected by the user with a sense that the user wants excluded. 5. Include hypernyms and hyponyms from WordNet and DAML. 6. Resolve inconsistencies by checking if the DAML terms to be added exist among the terms in the unwanted WordNet synsets.
Assumption It is sufficient to exclude just one term, rather than all irrelevant senses.
Support Moderate support: Quality of WordNet but inherent ambiguity of language (Fellbaum, 1998).
WordNet and DAML contain a sufficient number of hypernyms and hyponyms. WordNet synsets provide a sufficient number of synonyms to enable a match if it exists.
Good support: Size of WordNet (Fellbaum, 1998) & DAML (Ding and Fensel, 2001). Good support: Size of WordNet and availability of synsets (Fellbaum, 1998).
Step 4: Exclude incorrect sense: To ensure precise query results, it is important to filter out pages that contain incorrect senses of each term. Traditional and web-based query expansion techniques augment queries with additional mandatory- or weighted-terms but tend not to filter out (i.e., exclude) terms of an incorrect sense (see Moldovan and Mihalcea, 2000; Efthimiadis, 2000; Greenberg, 2001; Jansen, 2000; Voorhees, 1994). Excluding incorrect senses is crucial on the Web because of the vast number of results that can be returned. The difficulty is determining which senses to exclude without increasing user-interaction. As WordNet can return multiple
15
synsets for each query term, unwanted senses can be inferred from the user’s chosen word sense. For example, if a user selects the ‘medical practitioner’ sense of ‘Doctor,’ then, a correct inference is that the user does not want pages associated with its other senses, e.g., theologian, children’s game, or scholar (Table 4). WordNet orders its synsets by estimated frequency of usage, so it is assumed that the most useful term to exclude is the most commonly used sense of the term not selected by the user. Thus, for our sample query, the methodology excludes the sense of doctor defined by ‘doctor of the church’ by including it as negative knowledge in its knowledge base and refining the query as: “doctor” AND NOT “doctor of the church”. Step 5: Including hypernyms and hyponyms: Prior research has expanded queries with hypernyms and hyponyms to enhance recall. In these studies, query terms are typically optional, so the systems would return the query term (e.g., doctor) or its superclass (medical practitioner) or its subclass (general practitioner) (Greenberg, 2001; Voorhees, 1994). These studies show that, when terms are optional, including hypernyms and hyponyms can enhance recall but it often reduces precision (Greenberg, 2001). For web-based queries, precision is preferred over recall (de Lima and Pedersen, 1999; Hearst, 1996). Thus, in contrast to prior approaches, our methodology includes hypernyms and hyponyms as mandatory (rather than optional) terms to enhance precision. For each query term, hyponyms and hypernyms for the user’s chosen synset are automatically obtained from WordNet. Terms from the DAML library are obtained by querying the library for each noun and noun phrase. Table 4 shows the terms extracted from WordNet and the DAML library for the example. The methodology allows the user to select up to two hypernyms (one from DAML and one from WordNet), and up to two hyponyms (one from DAML and one from WordNet) for each query term. The selected hypernym(s) or hyponym(s) are included as mandatory terms in the query. For example, if the user selects the hypernym ‘medical practitioner,’ the query would
16
be refined to doctor AND ‘medical practitioner.’ Likewise, if the user selects the hyponym ‘specialist,’ the query would be refined to doctor AND specialist. In these situations, recall is sacrificed for precision. Note also that, when querying for the subclass (e.g., specialist), the query refinement would still include the original query term because the narrower term can also be polysemous (e.g., ‘specialist’ has two senses in WordNet). Table 4 shows examples of hypernyms and hyponyms from the two sources. Figure 6 shows a partial view of the extended semantic network that underlies the refined query. Table 4. Including Hypernyms and Hyponyms from Lexical and Ontological Sources Term Doctor
Word-sense (defined by synset) Doc, Physician, MD, Dr Doctor of the Church Doctor Dr
Physical Therapy Doctor Physical therapy
Physiotherapy, physiatrics Not specified in the DAML library Not specified in the DAML library
Hypernym (superclass)
Hyponym (subclass)
Medical practitioner, medical man Theologian, Roman Catholic
Source
GP, specialist, surgeon, intern, extern, allergist, veterinarian No hyponym listed
Lexicon (WordNet) Lexicon (WordNet)
Play, child’s play
No hyponym listed
Lexicon (WordNet)
Scholar, scholarly person, student Therapy
No hyponym listed
Lexicon (WordNet)
No hyponym listed
Lexicon (WordNet)
Qualification, Medical care professional, Health professional Rehabilitation, Medical practice
PhD
Ontology (DAML) No hyponym listed
Ontology (DAML)
Theologian he Medical practitioner
Doctor of the Church Qualification
he
he s
Doc
ho
n
he c
Physical therapy
Doctor ho
ho GP
Specialist
Therapy
Rehabilitation
he s
Physiotherapy he
PhD Medical practice
Key: he: hypernym; ho: hyponym; n: negation; c: candidate; s: synonym Shaded ellipse: original terms from query, Unshaded ellipse: new knowledge from lexical and ontological sources, Dashed line: additional knowledge in Table 3 not shown to conserve space
Figure 6: Semantic Network of Query-Terms after Building Knowledge-Base
17
Step 6: Resolve inconsistencies: The ability of query expansion to improve a query depends upon the usefulness of the terms added. The steps outlined above may fail because: (a) the synonym sets in WordNet are not necessarily orthogonal so a partially relevant word-sense may be excluded by the methodology in step 4, (b) the DAML ontologies are of mixed quality and might contain inaccurate information (Hendler, 2001), and (c) WordNet and the domain ontologies are contributed by many individuals who may have differing views of a domain. To identify inconsistencies, the hyponyms and hypernyms of the query terms selected from DAML are checked against the synonyms of the query term (in WordNet) that the user did not select as the desired word sense. If a match is found, the user is asked if the term is desired and the knowledge base is adjusted accordingly. For example, if a user selects the medical sense of ‘doctor’ from the synsets returned from WordNet, but then selects ‘theologian’ or ‘scholar’ as his or her preferred DAML hypernym, this is identified as an inconsistency and the user asked to verify his/her selections. This process reduces the risk that an incorrect inference will be made prior to executing the query. 3.2.3 Phase 3: Execute Query The final phase in the methodology (Figure 2) executes the query by invoking publicly available search engines, and presents the results to the user. It consists of three steps that incrementally refine the context identified in the previous phase (Table 5). Table 5. Algorithms for phase 3 of the methodology (see Figure 2) Step 7. Construct the query by adding synonyms, hypernyms, and excluded terms to the query using Boolean operators. 8. Execute query by invoking a web search engine
Assumption Using Boolean operators to ensure term co-occurrence is a reliable basis for accepting pages.
Support Good support: Usefulness of term cooccurrence if Boolean operators are used properly (Mandala et al. 1999).
Web search engines do not conflict with or override the methodology.
9. Present results to the user
---
Good support: Web search engines (e.g., Google) do not use explicit query expansion. Thus, the methodology appears to add to, rather than conflict with, existing Web search engines. ---
18
Step 7: Construct Boolean query: The augmented query is transformed into the syntax required by a search engine using Boolean constraints (Hearst, 1996; Moldovan and Mihalcea, 2000). This step, thus, simply automates query construction with Boolean operators to compensate for the possible lack of this skill on the part of web-users (Eastman and Jensen, 2003). The Boolean constraints depend upon the type of term. Following the previous steps in the methodology, the first synonym provided by WordNet from the synset that the user selected is added to the query with the OR operator; e.g., (query term OR synonym). The hypernyms or hyponyms chosen by the user are also added with the OR operator. For example, if hypernyms are selected, this results in the query: query term AND (WordNet hypernym OR DAML hypernym). If hyponyms are selected, this results in the query: query term AND (WordNet hyponym OR DAML hyponym). Finally, the first synonym from the first remaining synset in WordNet not selected by the user is added with the negation operator i.e. NOT; e.g., (query term AND NOT synonym). Each transformation attempts to improve query precision without requiring user interaction. Adding synonyms increases precision because the user may not have selected the most precise query term for that word sense. Although adding synonyms on its own could increase recall at the cost of precision (Greenberg, 2001), the use of a synonym in combination with a hypernym should improve precision. As WordNet lists terms in their estimated order of frequency of use, the first synonym is likely the best alternative for that term. Adding hypernyms or hyponyms increases the likelihood that a page returned contains the user’s desired sense. The methodology requires the user to: (a) select WordNet hyponyms because it cannot automatically infer a user’s desired subclass; and (b) select DAML hyponyms because the ontologies can be of mixed quality and the word-sense of ontological terms are not listed. To reduce user interaction, the methodology automatically selects a WordNet hypernym based on it being the first
19
hypernym returned from the user’s selected synset (as terms are ordered in terms of commonality). The users can override these selections. Applying these transformations to our example and assuming the user chooses the ‘specialist’ hyponym for ‘doctor’ and the ‘medical practice’ and ‘therapy’ hypernyms for ‘physical therapy,’ the following query is constructed: (Doctor OR Doc) AND “Specialist” AND NOT “Doctor of the church” AND (“Physical therapy” or physiotherapy) AND (therapy or “medical practice”). Step 8: Run query on search engine: The query construction transformation described in the previous step uses common Boolean operators to ensure that the resulting query is not dependent on particular search engines. For example, an operator such as NEAR is not part of the query construction process because AltaVista allows it, but others, such as Google and AlltheWeb do not. Likewise, query expansion techniques in traditional information retrieval systems (such as in TREC, Text Retrieval Conference; Harman, 1993) can add up to 800 terms to the query with varying weights (Qiu and Frei, 1993), whereas web search engines place limits on query length (e.g., Google has a limit of ten terms). Consequently the methodology only adds terms up to the limit of the search engine. The query, expanded, transformed and constructed with appropriate Boolean operators is then used to invoke publicly available search engine(s). If the engines require different syntax, the appropriate syntax is generated for this purpose. Step 9: Retrieve and present results: The results from the search engine (URLs and ‘snippets’ provided from the web pages) are retrieved and presented to the user. The user can either accept the results or rewrite and resubmit the query to obtain different results. The outcome of the nine-step methodology is a more context-specific and precise query.
4. The Semantic Retrieval System (SRS) The methodology for context-aware information retrieval was first simulated manually and then implemented in a prototype, the Semantic Retrieval System (SRS), which was
20
constructed using off-the-shelf components, custom programming, and web services. The implementation uses J2EE technologies, and follows a client-server architecture with external components that are invoked as needed. Figure 7 shows the implementation choices for the different elements, and corresponds closely to the logical architecture in Figure 3.
Client
Server
External Query Constructor
SRS Client
Interface and Parser Inference Engine, Local Knowledge Base
Search Engines (Google and AlltheWeb)
Lexical Source (WordNet) Ontological Source (DAML)
Figure 7: Semantic Retrieval System Design The “interface & parser module” captures the user’s natural language input and parses it for nouns and noun phrases using the QTAG (Mason, 2003) part-of-speech tagger. It returns the part-of-speech for each input term. The “local knowledge base & inference engine” interfaces with the WordNet lexical database via JWordNet (http://sourceforge.net/projects/jwn/). Synonyms are first obtained for each noun and noun phrase in the initial query. Based on the user’s selected synset, the hypernyms and hyponyms for the selected sense of the term are obtained from the lexical database. Hypernyms and hyponyms from the DAML library are obtained using Teknowledge’s (http://plucky.teknowledge.com/daml/damlquery.jsp) DAML Semantic Search Service. The “query constructor” then expands the query, adding synonyms, negative knowledge, hypernyms, and/or hyponyms, based on the algorithms outlined earlier. It tracks of the number of terms currently in a search query in case the maximum number of terms
21
for a search engine (e.g., 10 terms for Google) is reached. The expanded query is then submitted to the search engine using appropriate syntax and constraints, and the results returned. The query constructor interacts with Google through its Web API service and AlltheWeb through URI encoding. It requests the first twenty pages from both search engines, returning the title of the page, the snippet, and the URL for each page. 4.1 Illustrative Scenario The user interface is shown in Figure 8. The user has entered the query: “Find a department chair in Atlanta.” The interface & parser module now extracts the user’s input and parses it. The noun phrases are displayed (Figure 8) and the user can uncheck nouns that are not relevant or appropriate. The selected terms form the initial query, which the user can then refine.
Word senses identified for each noun from WordNet
Nouns identified
Figure 8. Parsing Natural Language Query and Word Sense Selection
22
To refine the query, the query constructor retrieves the word senses for each query term and displays them to the user, as illustrated in Figure 9 and the user selects the appropriate sense. If none is selected, the module uses the query’s base term. For example, in Figure 8, “chair” has four senses. After selecting the appropriate word sense, the user initiates query refinement. The module then gathers the user selections and executes additional steps (steps 4-7) to expand the query. For example, based on the selected word sense for each term, a synonym is identified and added to the query with an OR condition. “Chair” is expanded with the first term in the user’s selected synset (i.e., chair OR professorship). A synonym of chair is also added as negative knowledge from the next highest synset that has a synonym to exclude unrelated hits (i.e., chair – president). For each term, the hypernyms and hyponyms (superclass and subclass) corresponding to the selected word sense are retrieved from WordNet and the DAML ontology and displayed to the user (Figure 9). The user-selected hypernyms and hyponyms are added to the query. The final refined query is submitted to Google and the first 20 hits returned (Figure 10).
Hypernyms and Hyponyms found for each term
Figure 9. Selecting Hypernyms and Hyponyms from WordNet and DAML
23
Query constructed
Results from executing search
Figure 10. Presentation of Results to User
5. Empirical Evaluation To assess the methodology’s effectiveness, a laboratory study was carried out in which the results obtained using the Semantic Retrieval System were compared to those obtained using a web search engine alone. The control group (search-engine) and the experimental group (search-engine plus the Semantic Retrieval System) used the same search engine. Thus, the experiment directly tests the incremental benefit of the methodology. 5.1 Treatment and Dependent Variables The central hypothesis for the experiment was that users would judge query results returned by the Semantic Retrieval System as more relevant than results obtained from a search engine alone (e.g., Google or AlltheWeb) (Figure 11). Because users do not view many web page results (Spink et al., 2001), the dependent variable was the number of relevant pages in the first 10 pages returned (R1[10]). Several variations of this variable were created for sensitivity
24
analysis. First, we collected data on the number of relevant pages in the first 10 and first 20 pages (R[10] and R[20]) (Buckley and Voorhees, 2000). Second, we asked each subject to review the title and snippet of each page returned and indicate whether the page was relevant, partially relevant, or not relevant for his/her query. This allowed another two measures of relevance: R1 (the number of relevant pages) and R2 (the number of relevant or partially relevant pages) (Efthimiadis, 2000). This allowed us to have one primary dependent variable, R1[10], and three variations that were used for sensitivity analysis: R2[10], R1[20], and R2[20]. Recall was not tested because it is less relevant and not strictly measurable on the web (Efthimiadis, 2000). To increase generalizability, the experiment included two control variables: query term ambiguity and search engines (see Figure 11). Query term ambiguity refers to the degree to which query terms are ambiguous or clear. Although we make no specific hypothesis regarding query term ambiguity, it intuitively should have a negative impact on the precision of results. As the Semantic Retrieval System aims to increase the clarity of search terms, it is also possible that it could have a greater benefit when search terms are more ambiguous, i.e., an interaction effect (Figure 11). In addition to controlling for query term ambiguity, we also controlled for search engines by testing the methodology across two search engines (Google and AlltheWeb). The different search engines could have their own independent effects on query relevance. Nevertheless, we expected the Semantic Retrieval System to have a significant effect on query precision irrespective of the search engine used (per Figure 11).
Independent Variable: Search Engine with SRS (Lexical and Ontological knowledge)
+
(H1)
Control variables*: Query term ambiguity (High vs. Low) Search engine (Google/AlltheWeb)
Figure 11: Empirical Evaluation
Dependent Variable: Relevance of Results (R1[10])
Key: * Factors controlled but no a priori hypotheses
25
5.2 Query Sample A large body of diverse queries was tested (per Buckley and Voorhees, 2000). Clearly, for the Semantic Retrieval System’s methodology to operate, a user’s query must contain terms that exist in WordNet or DAML. Of course, many queries could use terms that do not exist in either WordNet or DAML; for example, ‘instance’ information such as names of specific cities (Spink et al. 2001). To ensure a test of the study’s hypothesis, the query sample was constructed as follows. First, all single-word classes from the DAML library were extracted. These were then pruned by excluding terms that either (a) had no superclasses in the DAML library, or (b) would be unfamiliar to subjects (e.g., highly specialized terms). A total of 280 terms were selected. These were subdivided into two groups based upon their word-senses in WordNet. There were 139 terms (the clear group) that had 0-2 word-senses in WordNet, where 0 represented a domain-specific term not in WordNet and 141 terms (the unclear group) that had 3 or more word-senses. Subjects were then asked to perform four queries. Two of these queries were required to contain terms from a random list of 26 terms from DAML. The other two queries were unconstrained. The subjects were randomly assigned a list of either 26 clear or 26 unclear terms. (A pilot test found that 26 terms was a reasonable number to use). This enabled us to include ‘query term ambiguity’ as a between-groups control in the experiment. The result of the random assignment was that 38 subjects received a list of clear terms, and 33 received a list of unclear terms. All queries were formed in natural language and, consistent with web queries in practice, were required to be short, including approximately two nouns each (Spink et al. 2001, Spink and Ozmultu 2002). Examples queries constructed by the subjects included the following: Find a school in Georgia; Find soul food recipes; Find research on alga; and Find wood furniture for office.
26
5.3 Experimental Design and Procedure Seventy-one students from two universities participated voluntarily in the experiment. Each subject contributed four queries. Subjects were experienced search engine users, recording a mean of 6 on a 7-point likert-type scale that asked whether the subjects frequently used search engines. The subjects were required to build their own queries and evaluate their own results since experimenters are unable to create as diverse a set of queries as users, can bias results, and cannot objectively determine whether a result is relevant to a user (Gordon and Pathak, 1999). Figure 12 summarizes the experimental design. Each subject performed each query with and without SRS, and used two search engines. Thus, SRS and ‘Search Engine’ were within-groups factors in the experiment and ‘query-term ambiguity’ was the between groups factor. Terms Allthe\ Web Google
Search Engine
Constrained Clear Terms With SRS
Ambiguous Terms
Without SRS
Clear Terms With SRS
Unconstrained
With SRS
Without SRS
With SRS
Without SRS
With SRS
Without SRS
Ambiguous Terms
Without SRS
With SRS
Without SRS
Figure 12. Experimental Design Each subject received materials explaining how the Semantic Retrieval System operates, how to construct queries, and how to grade the relevance of the results. Each subject’s materials stipulated the order of queries he or she was required to run and the 26 randomly selected terms based upon the group to which the subject had been assigned (clear or ambiguous). The subjects were first given a 10-minute training exercise to introduce them to the Semantic Retrieval System and to practice developing a query and ranking the results. Next, each participant developed his or her four queries. Query 1 and Query 3 required the user to use the list of terms provided (i.e., clear or ambiguous). Query 2 and Query 4 allowed the user to use any terms she wished. Because AlltheWeb was expected to be less familiar to subjects, the training session and
27
the first two queries used AlltheWeb; Queries 3 and 4 used Google. Figure 13 shows the procedure. The cells in this figure correspond to the experimental design shown in Figure 12.
Allthe Web Google
Search Engine
Terms Constrained Query 1 Clear or Ambiguous Terms Performed With and Without SRS
Unconstrained Query 2 Ambiguity of terms unknown Performed With and Without SRS
Query 3 Clear or Ambiguous Terms Performed With and Without SRS
Query 4 Ambiguity of terms unknown Performed With and Without SRS
N=142
N=142 N(Clear)=76, N(Ambiguous)=66
N=142
N=284
N=142
Figure 13. Experimental Procedure The experiment took approximately 45 minutes. Subjects were given three minutes to perform each query. For each query, they were given four minutes to rank the relevance of the first 20 pages returned from each system (two minutes to rank the 20 pages from the Semantic Retrieval System and two minutes to rank the 20 pages from the search engine alone). The subjects were given a short time (averaging approximately six seconds per page) to assess the relevance pages returned because this was considered typical of web-user behavior (Spink et al 2001). A pre-test and a pilot test with 49 undergraduate students indicated that these times were sufficient without burdening the subjects in terms of the total experimental time. 5.4 Results Table 6 presents the results, which show that the hypotheses were supported. On average, users received 1.5 additional relevant pages (5.38 - 3.83) in their first 10 pages when the search was performed “with SRS” (i.e., a 15% increase). To demonstrate the robustness of this result, Table 6 also shows the results for each sub-sample. The results are consistent across all of subsamples. As an additional test of robustness, we tested the results for each sub-sample across all four measures of the dependent variable (i.e., R1[10], R2[10], R1[20], and R2[20], not shown here to conserve space). They all consistently supported the hypothesis.
28
Table 6. Test of Differences in Relevance due to SRS Dependent Variable: R(10): Number of relevant pages in the first 10 pages returned Sample
Mean With Mean Std. Error tSRS Without SRS for Diff. statistic
Full sample (N=261)
Sig. 1-tailed
Result as Expected?
5.38
3.83
.252
6.13
.000
Yes
Unconstrained terms (N=127) Constrained terms, clear (N=72)
5.5 5.4
4.4 3.5
.39 .42
2.66 4.34
.005** .000**
Yes Yes
Constrained terms, ambiguous (N=62)
5.2
3.0
.47
4.70
.000**
Yes
AlltheWeb (N=135)
4.4
3.7
.36
2.09
.020*
Yes
Google (N=126)
6.3
4.0
.33
7.13
.000**
Yes
Key: † N is less than the maximum due to missing values. * sig. at α < 0.05 1-tailed, ** at α < 0.05 1-tailed (Bonferroni adjusted). The adjusted alpha = .05/5 = .01.
Table 7 summarizes the results of a MANOVA to test the main effects and interaction effects for each control variable. The results show that both control variables had a significant main effect on relevance. The results were more relevant when the queries were clear and/or when Google was used. The table also indicates that the benefit of the Semantic Retrieval System does not appear to be moderated by the ambiguity of query terms (i.e., the interaction term for SRS*Term Ambiguity is insignificant). Overall, it appears that queries do not have to be highly ambiguous for the Semantic Retrieval System to be useful; even marginally ambiguous queries can benefit from the system. The results also show that the benefit of using SRS was significantly greater when used with Google. This was unexpected. This may indicate the ability of some search engines to more effectively utilize the expanded query constructed by SRS. Table 7. Effects of Control Variables MANOVA† Factor Main Effects SRS Search term ambiguity Search engine Interaction Effects SRS*Term ambiguity SRS*Search Engine Key:
†
F-statistic 12.47 2.81
Significance (2-tailed) .000*** .004***
Effect Size (eta squared)‡ .09 .02
7.48 1.47
.000*** .166
.06 .01
2.18
.070*
.02
MANOVA statistics calculated using Pillai’s trace including all DVs (i.e., R1(10), R2(10), R1(20), R2(20)) Eta squared is the proportion of variability in the DV accounted for by variation in the IV * significant at α < 0.10 two-tailed, *** significant at α < 0.01 two-tailed ‡
29
Overall, the results in Tables 4 and 5 are very consistent in supporting the study’s hypothesis. The effect sizes, while significant, are not large. Averaging across all subsamples, the Semantic Retrieval System increased the number of relevant pages from approximately 4.5 out of 10 to 6 out of 10. This benefit ranged from 2 additional relevant pages (out of 10) when query terms were ambiguous or when Google was used, to approximately 1 additional relevant page (out of 10) when query terms were clear or when AlltheWeb was used. To understand the contribution of different sources of knowledge (lexical or ontological) to the outcomes, an exploratory analysis was performed. For this analysis, we examined the queries with constrained terms (comprising 72 with clear terms and 62 with ambiguous terms, per Table 6) to ascertain how the results were affected by the source of knowledge. We found that subjects added terms to 122 of these 134 queries. For each queries, we coded the source of each term added by the subject, i.e., lexical (WordNet) or ontological (the DAML library). It was found that 100% of these queries included terms from WordNet, whereas only 30% included terms from DAML. In other words, subjects chose not to add ontological terms for 70% of these queries. However, in the 30% of cases where subjects added terms from the DAML ontology library, these terms did assist the query (t = 6.3, p = .00). Table 8 summarizes these results. Table 8: Exploratory Results of Contribution of Different Sources Dependent Variable: R(10): Number of relevant pages in the first 10 pages returned Test
Subsample (queries by N source of terms) Lexical 122
Knowledge Sources Ontological and Lexical
36
Mean With SRS 5.48 5.72
Mean Without Std. Error for tSRS Diff. statistic 3.25 .322 6.92 2.53
.504
6.34
Sig. Result as 1-tailed Expected? .000** Yes .000**
Yes
Key: * sig. at α < 0.05 1-tailed, ** at α < 0.05 1-tailed (Bonferroni adjusted; .05/6 = .008).
Further tests showed that these results were unaffected by the ambiguity of query terms. In the ‘ambiguous’ group, subjects chose ontological terms in 28% of the queries, and in the ‘clear’ group, subjects chose ontological terms in 31% of the queries. Likewise, as in Table 7,
30
term ambiguity did not moderate the benefit of SRS in either the Lexical subsample (F = 1.4, df = 241, p = .24) or the Ontological and Lexical subsample (F = 1.6, df = 75, p = .20). A final analysis was performed to evaluate whether users perceived the benefits of SRS to be worth the effort. At the end of the experiment, subjects completed a questionnaire regarding their perception of the usefulness of using the Semantic Retrieval System to that of using a publicly available search engine (Google) alone. The questionnaire used a 7-point likert scale, was reliable (Cronbach’s alpha = .92), and contained four items, asking the users to evaluate if SRS, “Compared to Google alone”: (1) enables me to find good information faster, (2) is useful for searching information on the web, (3) improves my performance in web querying, and (4) increases my productivity in searching the web.” The mean response for the scale was 4.83 out of 7. The difference between subjects’ responses and a baseline score of 4.0 (the expected score if the system were no more useful than Google) was statistically significant (t = 6.2, df = 67, p = .00). This indicates that subjects rated the Semantic Retrieval System somewhat more useful than Google, in spite of the additional interaction SRS required. The baseline provided by Google, as the world’s most popular search engine (Sullivan, 2005), clearly makes this is a strong outcome. Consistent with the results in Table 7, subjects’ perception of the usefulness of the Semantic Retrieval System compared to Google was consistent across queries: i.e. it did not depend on the ambiguity of query terms (F = .20, df = 67, p = .66). Overall, the results indicate that, first, the benefit of the Semantic Retrieval System does not stem its use of one particular source of knowledge; rather different sources are complementary, which enables users to choose an appropriate piece of knowledge for a given query; and second, users value the Semantic Retrieval System despite the additional effort required to operate it.
31
5.5 Discussion Humans are rarely troubled by the ambiguity of natural language because they can use elements of context to understand it (Miller, 1996). Intelligent systems, however, remain generally unable to interpret natural language except in highly constrained domains. No general algorithmic approach has been devised yet for interpreting natural language. The current state-ofthe-art suggests that little progress in this direction is likely until systems start to use large, common-sense stocks of real world knowledge (Ide and Veronis, 1998; Miller, 1996; Minsky, 1994). This general problem of interpreting natural language is especially salient within the context of searching the World Wide Web. The Web is the world’s largest and potentially most valuable information resource, but it is highly unstructured, contains a diverse and dynamic collection of contexts, and is used by diverse individuals who have little knowledge of search techniques or syntax (Spink et al., 2001). This research was motivated by the practical problem of improving Web searching. Following Minsky’s (1994) analysis of intelligent systems, we developed a methodology to improve information retrieval from the World Wide Web by drawing on sources of large stocks of knowledge about the real world. The methodology makes minimal assumptions about web pages, search engines, or users, and, thus, should be widely applicable. The methodology differs from prior approaches in its joint use of two lexicons and ontologies and Boolean operators. Past attempts at query expansion often yielded negative results (Voorhees, 1994, Sanderson, 2000). Our methodology, on the other hand, demonstrates that query expansion can be useful, even with minimal assumptions. The contributions of this research can be assessed in terms of the system’s effectiveness and efficiency. In terms of effectiveness, the results suggest that the Semantic Retrieval System provides more relevant pages to users, increasing the relevance of results by about 15%.
32
Although the results are encouraging, further research is needed to characterize situations that are more apt candidates for infusion of semantics. For example, approaches such as ours may be particularly effective when a user has some knowledge of the domain being queried so that he or she can participate in the process of expanding/refining the query by effectively using the information provided by the knowledge sources. In terms of efficiency, the user interaction required by the Semantic Retrieval System is not insignificant. Nevertheless, the results suggest that users are willing to undertake such interaction to improve their queries. It may be possible to minimize this interaction through user profiles (e.g., Tanudjaja and Mui, 2002). The methodology can also be judged in terms of traditional criteria for validity. For internal validity, the controlled experiment increases confidence in our results. Additional variables could, however, be tested. For example, the users’ knowledge of the search domain may significantly influence the usefulness of the query expansion (Leroy et al., 2003). The experiment obtained significant results despite a small sample of users, providing adequate statistical conclusion validity. Nevertheless, the effect sizes of our results were small, with further research needed to test the system across a larger body of users. In terms of external validity, the Semantic Retrieval System was employed across a range of queries (ambiguous/clear, restricted/unrestricted) and search engines (Google and AlltheWeb), although the experiment was limited to student subjects. Additional tests across a broader spectrum of users would be valuable. Construct validity could be improved by refining the constructs used in the methodology. The Semantic Retrieval System relies on existing knowledge sources (WordNet and the DAML library) and Web search systems (Google and AlltheWeb), which were directly incorporated. The strength of this approach is that the results are applicable to real-world systems. The weakness is that the results are affected by the particular implementations of these systems (i.e., WordNet, the DAML library, Google, and
33
AlltheWeb). For example, a disappointing result was that the use of ontological terms from the DAML ontology library was less than anticipated. Although the results in Table 8 indicate that terms from ontologies can be beneficial, they do not explain why users chose not to include terms from the ontologies in the 70% of the queries where the ontologies provided terms. Because the Semantic Retrieval System used real world ontology libraries, rather than simulated ones, either of the following two factors could underlie our results. First, sources of domain specific global context (i.e. ontologies) are, in general, less useful for query expansion than sources of non-domain specific global context (i.e. lexicons). Second, the ontologies in the DAML Ontology libraries need to be improved before they can meet their potential as a source of domain-specific global context for query expansion. To investigate these two factors, metrics-suites to investigate the quality of the ontologies in the DAML library are available elsewhere (Burton-Jones et al., 2005), which suggest that many of the ontologies are small (i.e., less than 20 terms); use ambiguous terms (i.e., terms with more than four senses); and contain inaccuracies (e.g., mistaking subclass for part-of relationships). This provides some evidence to support the second scenario above: one reason the ontologies were rarely used by subjects in this experiment is that the DAML ontologies were not of sufficient quality. Notwithstanding this finding, the results in Table 8 show that knowledge from the ontology library can improve queries when used. Together, these two pieces of evidence suggest that the improvement of ontologies could greatly assist query expansion systems. Since the DAML library is the largest and most well-known ontology library, this result is particularly striking, highlighting the need for continued research on methodologies for developing useful and practical domain ontologies. The results of our research suggest that ontologies will be most effective as sources of global context for query expansion when: (1) they display deep domain-specific knowledge (rather than merely generally accepted knowledge
34
provided by lexicons such as WordNet), and (2) the knowledge the ontologies provide is accurate and specified in clear (not ambiguous) terminology.
6. Conclusion This research presented a methodology that leverages existing ontological and lexical sources as a way of incorporating context into the process of information retrieval from the Web. The methodology for context-aware information retrieval and the experiments clearly indicate the manner in which the sources of knowledge (lexical and ontological) contribute to improving information retrieval is complementary. This is likely to be a consequence of the present state of knowledge codified in these sources, which is at best incomplete, tends to be overlapping and may lack the precision needed to further improve the retrieval results. Nevertheless, these sources represent the best codified stocks of common-sense knowledge, and therefore, have the best chance of improving retrieval results. An effort such as the one we have reported is, therefore, necessary to assess how these sources can contribute to better retrieval results. The methodology we have outlined leverages and extends, in a concerted manner, prior research on natural language processing, knowledge representation, and information retrieval to develop algorithms that are incorporated in the methodology. The methodology has been implemented in a research prototype, called the Semantic Retrieval System (SRS), which allows execution of queries by invoking commercially available search engines such as Google and AlltheWeb. An empirical study shows that the methodology yields results that represent an improvement over those provided by the search engines and that the methodology provides more relevant results from a query with minimal user intervention. An analysis of the contribution of sources of knowledge used suggests the complementary nature of the available knowledge sources, and provides guidelines for future ontology development
35
research. In the commercial domain, new search engines are beginning to appear (see, e.g. www.clusty.com that appeared in late 2004), which are providing users help in disambiguating their queries by allowing search results to reflect different word senses. The algorithms and implementations underlying these commercial efforts are, however, not disseminated to researchers, making accumulation of knowledge in this direction difficult. The work we have reported is, therefore, an important in accumulation of knowledge in this direction. Further research can be directed at improving the methodology itself such as its flexibility and scalability and personalizability; as well as more extensive testing to assess the contributions of knowledge sources, the manner of usage of the methodology by the users, and to evaluate how changes in the knowledge sources result in improvements in information retrieval.
References Allan, J., and Raghavan, H. "Using Part-of-Speech Patterns to Reduce Query Ambiguity," Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, 2002, pp. 307-314. Berners-Lee, T., Hendler, J., and Lassila, O. "The Semantic Web," Scientific American, May 2001, pp. 1-19. Buckley, C., and Voorhees, E.M. "Evaluating Evaluation Measure Stability," Proceedings of the SIGIR, Athens, Greece, 2000, pp. 33-40. Burton-Jones, A., Storey, V.C., Sugumaran, V., and Ahluwalia, P. "A Semiotic Metrics Suite for Assessing the Quality of Ontologies," Data & Knowledge Engineering (in press) 2005, 26pp. Chan, C. W. “From Knowledge Modeling to Ontology Construction,” International Journal of Software Engineering and Knowledge Engineering, (14:6), 2004, pp. 603 – 624. Chen, H., and Dhar, V. "A Knowledge-Based Approach to the Design of Document-Based Retrieval Systems," Conference on Office Information Systems, ACM SIGOIS Bulletin (11), 1990, p. 281-290. Corcho, O., Fernandez-Lopez, M., and Gomez-Perez, A. "Methodologies, Tools and Languages for Building Ontologies. Where is their Meeting Point?," Data & Knowledge Engineering (46) 2003, pp 41-64. Cui, H., Wen, J.-R., Nie, J.-Y., and Ma, W.-Y. "Probabilistic Query Expansion Using Query Logs," Proceedings of the Eleventh World Wide Web Conference (WWW 2002), Hawaii, 2002, p. 325-332. de Lima, E.F., and Pedersen, J.O. "Phrase Recognition and Expansion for Short, Precision-biased Queries based on a Query Log," Proceedings of the 22nd Annual International ACM SIGIR
36
Conference on Research and Development in Information Retrieval, Berkeley, 1999, pp. 145-152. Ding, Y., and Fensel, D. "Ontology Library Systems: The Key to Successful Ontology Re-Use," First International Semantic Web Working Symposium (SWWS'01), Stanford University, California, USA, 2001, pp. 93-112. Ding, Y., Fensel, D., Klein, M., and Omelayenko, B. "The Semantic Web: Yet Another Hip?," Data & Knowledge Engineering (41), 2002, pp. 205-227. Eastman, C.M., and Jansen, B.J. "Coverage, Relevance, and Ranking: The Impact of Query Operators on Web Search Engine Results," ACM Transactions on Information Systems (21:4), October 2003, pp 383-411. Efthimiadis, E.N. "Interactive Query Expansion: A User-Based Evaluation in a Relevance Feedback Environment," Journal of the American Society for Information Science (51), 2000, pp. 989-1003. Fellbaum, C. (ed.). WordNet: An Electronic Lexical Database, MIT Press, MA, 1998. Fensel, D., Harmelen, F., Horrocks, I., McGuinness, D., & Patel-Schneider, P. "OIL: An Ontology Infrastructure for the Semantic Web," IEEE Intelligent Systems (March/April), 2001, pp. 38-45. Gordon, M., and Pathak, P. "Finding Information on the World Wide Web: The Retrieval Effectiveness of Search Engines," Information Processing and Management (35), 1999, p. 141-180. Greenberg, J. "Automatic Query Expansion via Lexical-Semantic Relationships," Journal of the American Society for Information Science (52:5), 2001, pp. 402-415. Gruber, T.R. "A Translation Approach to Portable Ontology Specifications," Knowledge Acquisition (5), 1993, pp. 199-220. Gruninger, M. and Lee, J. "Special Issue: Ontology Applications and Design," Communications of the ACM, (45:2), 2002, pp. 39-41. Guarino, N. "Formal Ontology, Conceptual Analysis and Knowledge Representation," International Journal of Human-Computer Studies (43:5-6), Nov/Dec 1995, pp 625-640. Guarino, N. "Formal Ontology and Information Systems," N. Guarino, Proceedings of the 1st International Conference on Formal Ontology in Information Systems, Italy, 1998, pp. 3-15. Guarino, N., and Welty, C. "Evaluating Ontological Decisions with OntoClean," Communications of the ACM (45:2), 2002, pp. 61-65. Guha, R.V., and Lenat, D.B. "Enabling Agents to Work Together," Communications of the ACM (37:7), 1994, pp. 127-142. Harman, D. "Overview of the First TREC Conference," R. Korfhage, E. M. Rasmussen and P. Willett, Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, PA, June 27 - July 1, 1993, pp. 36-47. Hearst, M.A. "Improving Full-Text Precision on Short Queries using Simple Constraints," Proceedings of the SDAIR, Las Vegas, NV, 1996, pp. 1-16. Hendler, J. "Agents and the Semantic Web," IEEE Intelligent Systems (Mar/Apr), 2001, p. 30-36. Hendler, J., and McGuinness, D.L. "The DARPA Agent Markup Language," IEEE Intelligent Systems (15:6), 2000, pp. 67-73. Hevner, A., March, S., Park, J., and Ram, S. "Design Science in Information Systems Research," MIS Quarterly (28:1) 2004, pp 75-105. Ide, N., and Veronis, J. "Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art," Computational Linguistics (24:1), 1998, pp. 1-40.
37
IEEE "Special issue on the Semantic Web," IEEE Intelligent Systems, Mar/Apr 2001, pp. 32-79. Jansen, B.J. "An Investigation Into the Use of Simple Queries on Web IR Systems," Information Research: An Electronic Journal (6:1), 2000, pp. 1-13. Kayed, A., and Colomb, R.M. "Extracting Ontological Concepts for Tendering Conceptual Structures," Data & Knowledge Engineering (40), 2002, pp. 71-89. Kishore, R., Zhang, H., and Ramesh, R. "A Helix-Spindle Model for Ontological Engineering," Communications of the ACM (47:2) 2004, pp 69-75. Kobayashi, M., and Takeda, K. "Information Retrieval on the Web," ACM Computing Surveys (32:2), 2000, pp. 144-173. Lawrence, S. "Context in Web Search," IEEE Data Engineering Bulletin (23:3), 2000, pp. 25-32. Lee, J.-O., and Baik, D.-K. "SemQL: A Semantic Query Language for Multidatabase Systems," Proceedings of the 1999 ACM CIKM International Conference on Information and Knowledge Management, November 2-6, Kansas City, Missouri, 1999. Leroy, G., Lally, A.M., and Chen, H. "The Use of Dynamic Context to Improve Casual Internet Searching," ACM Transactions on Information Systems, (21:3), 2003, pp. 229-253. Lewis, D.D., and Jones, S. K. "Natural Language Processing for Information Retrieval," Communications of the ACM (39:1), 1996, pp. 92-101. Maedche, A., and Staab, S. "Ontology Learning for the Semantic Web," IEEE Intelligent Systems (March/April), 2001, pp. 72-79. Mandala, R., Tokunaga, T., and Tanaka, H. "Combining Multiple Evidence from Different Types of Thesaurus for Query Expansion," Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, 1999, pp. 191-197. Mason, O. "Qtag -- a Portable POS Tagger,", obtained from http://web.bham.ac.uk/o.mason/software/tagger/index.html, March 2003. McGuinness, D.L., Fikes, R., Hendler, J., and Stein, L.A. "DAML+OIL: An Ontology Language for the Semantic Web," IEEE Intelligent Systems (17:5), Sep/Oct 2002, pp 72-80. McIlraith, S.A., Son, T.C., and Zeng, H. "Semantic Web Services," IEEE Intelligent Systems (March/April), 2001, pp. 46-53. Miller, G.A. "Contextuality," In Mental Models in Cognitive Science, J. Oakhill and A. Garnham (eds.), Psychology Press, East Sussex, UK, 1996, pp. 1-18. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K.J. "Introduction to WordNet: An On-line Lexical Database," International Journal of Lexicography (3:4), 1990, pp. 235-244. Minsky, M. "A Conversation with Marvin Minsky About Agents," Communications of the ACM (37:7), 1994, pp. 22-29. Mitra, M., Singhal, A., and Buckley, C. "Improving Automated Query Expansion," Proceedings of the 21st Annual International Conference on Research and Development on Information Retrieval, Melbourne, Australia, 1998, pp. 1-15. Moldovan, D.L., and Mihalcea, R. "Improving the Search on the Internet by using WordNet and Lexical Operators," IEEE Internet Computing (4:1), 2000, pp. 34-43. Mylopoulos, J. "Information Modeling in the Time of the Revolution," Information Systems (23:3-4), 1998, pp. 127-155. Noy, N.F., and Hafner, C.D. "The State of the Art in Ontology Design: A Survey and Comparative Review," AI Magazine (18:3/Fall), 1997, pp. 53-74.
38
Peat, H.J., and Willett, P. "The Limitations of Term Co-Occurrence Data for Query Expansion in Document Retrieval Systems," Journal of the American Society for Information Science (50:1) 1991, pp 49-64. Qiu, Y., and Frei, H.-P. "Concept Based Query Expansion," Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, Pennsylvania, 1993, pp. 160-169. Raghavan, P. "Information Retrieval Algorithms: A Survey," Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, Louisiana, 1997, pp. 11-18. Salton, G., and Buckley, C. "Improving Retrieval Performance by Relevance Feedback," Journal of the American Society for Information Science (41:4), 1990, pp. 288-297. Sanderson, M. "Retrieving with Good Sense," Information Retrieval (2:1) 2000, pp 49-69. Spink, A., and Ozmultu, H.C. "Characteristics of Question Format Web Queries: An Exploratory Study," Information Processing and Management (38), 2002, pp. 453-471. Spink, A., Wolfram, D., Jansen, M.B.J., & Saracevic, T. "Searching the Web: The Public and their Queries," Journal of the American Society for Information Science (52:3), 2001, pp. 226-234. Stephens, L.M., and Huhns, M.N. "Consensus Ontologies: Reconciling the Semantics of Web Pages and Agents," IEEE Internet Computing (September-October), 2001, pp. 92-95. Sullivan, D. "Nielsen NetRatings: Search Engine Ratings," in: Search Engine Watch, (published Feb 5, 2005, online: http://searchenginewatch.com/reports/article.php/2156451), 2005. Tanudjaja, F., and Mui, L. "Persona: A Contextualized and Personalized Web Search," Proceedings of the 35th Hawaii International Conference on System Sciences, Hawaii, 2002. Towell, G., and Voorhees, E.M. "Disambiguating Highly Ambiguous Words," Computational Linguistics (24:1), 1998, pp. 125-145. van Rijsbergen, C.J. "A Theoretical Basis for the Use of Cooccurrence Data in Information Retrieval," Journal of Documentation (33) 1977, pp 106-119. Voorhees, E.M. "Query Expansion Using Lexical-Semantic Relations," Proceedings of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, 1994, pp. 61-69. Weber, R. Ontological Foundations of Information Systems, Coopers & Lybrand, Melbourne, 1997. Weber, R. "Ontological Issues in Accounting Information Systems," In Researching Accounting as an Information Systems Discipline, S. Sutton and V. Arnold (eds.), American Accounting Association, Sarasota, FL, 2002. Xu, J., and Croft, W.B. "Corpus-Based Stemming using Co-occurrence of Word Variants," ACM Transactions on Information Systems (16:1), 1998, pp. 61-81.