Finding More Useful Information Faster from Web Search Results Yi-fang Brook Wu
Latha Shankar
Xin Chen
Information Systems Department New Jersey Institute of Technology
Information Systems Department New Jersey Institute of Technology
Information Systems Department New Jersey Institute of Technology
[email protected]
[email protected]
[email protected]
ABSTRACT In this paper, we propose a prototype system for automatic generation of concept hierarchies to be used as an overview of search results. The system sends a user’s query to five search engines and receives a returned list of relevant web pages. The system then extracts query-oriented concept terms from snippets that come with the returned hits. Concept terms are organized into a concept hierarchy using a co-occurrence-based classification technique. Finally, concepts in returned documents are dynamically highlighted according to terms in the selected concept branch that lead to the chosen document. The user study shows that concept hierarchies do provide easy navigation and browsing of web returned documents. The results also show that users can find a document of interest no matter how low it is ranked in the retrieved list.
Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – Clustering, Search Process
General Terms: Algorithms, Human Factors Keywords: Concept Hierarchy, hierarchical summarization 1. INTRODUCTION The system interface can be a major barrier that prevents users from accessing information they need. Due to poor ranking mechanisms of some search engines, users usually need to examine several documents before finding enough useful documents to fulfill their needs. This problem is worsened by the large amount of hits returned from a search session. Since most search systems list returned documents in linear order, users must browse through each return hit sequentially to find relevant documents. This makes it time-consuming and less likely to obtain articles of interest if they are ranked lower. Spink et al [6] also find that users rarely examine below the first 20 retrieved documents. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’03, November 3-8, 2003, New Orleans, Louisiana, USA. Copyright 2003 ACM 1-58113-723-0/03/0011…$5.00.
Our study focuses on providing aids to users to find more useful and relevant information faster from the search results. Most of the time users can determine the relevance of a search hit by looking only at concept terms appearing in the snippets or in the title. Automatic concept hierarchies displaying extracted noun phrases would allow users to glance contents of search results without reading through all documents and thus reduce their effort in locating relevant information [1, 3]. We propose a system that uses hierarchical representation of concept terms extracted from the result set that would facilitate easier navigation, browsing and location of relevant parts in the documents. Our tree-like hierarchy is constructed using a term association technique essentially based on the co-occurrences of concept terms extracted.
2. HIGHLIGHT: AN AUTOMATIC CONCEPT EXTRACTION AND ORGANIZATION SYSTEM 2.1 Systems Design Our system (http://highlight.njit.edu) provides a browsing tool to the results of a search query besides automatically extracting and indexing concepts from text collections. Its output is organized in a hierarchical structure, which is similar to web directories. According to Nielsen [4], users usually scan the text for important excerpts and read selected paragraphs rather than the complete texts. Therefore, we add a function to our concept hierarchy interface to highlight relevant portions of text in the retrieved documents to capture users’ attention. The system architecture consists of a graphical user interface, a document acquisition component to route queries to search engines and to retrieve search results, Noun Phrase Extractor (NPE) for concept terms extraction, a co-occurrence analysis component to organize terms in the hierarchy, and a temporary storage space for downloaded documents received from search engines.
2.2 Query-Oriented Concept Extraction With the amount of returned hits provided by search engines, analyzing the full text of documents to extract concept terms is not an option. To render a concept hierarchy in an effective and efficient manner, we decided to analyze the short text descriptions that come with search hits: snippets. Snippets, usually 2-3 lines or shorter, are extracted by search engines, and they contain query terms to show the context. Users are expected to judge the relevance of a document by scanning the snippet and then decide if the full text is desirable.
568
Our definition of “concept terms” is noun phrases (NPs) that appear in the snippets, because we consider noun phrases as concept terms that people use in writing. For this study, we extracted NPs that were at least two-word long and appeared in more than two different documents. On an average, 200 snippets generate about 45 to 70 concept terms. We call this technique “query-oriented concept extraction,” because it extracts terms from snippets, which are parts of documents that are closely related to the query.
2.3 Organizing Terms in the Hierarchy The relationship between higher and lower level terms is established by examining their co-occurrences in the documents. A concept hierarchy, by design, shows concept terms from broader to narrower ones. The Probability of Co-occurrence Analysis (POCA) [7] was a subsumption based [5] technique, and it is defined as follows:
“knowledge discovery journal (2), “ and “phenomenon detection (2)”}. If a second level term is selected, only documents containing both the first and second level terms will be selected.
3.2 Query-Oriented Dynamic Document Highlight Figure 2 and 3 show the hierarchy branch: "data mining" Æ "knowledge discovery" Æ "phenomenon detection" is selected and the document “ITSC Data Mining Center” is chosen to be displayed in both full text and relevant paragraph modes. All topic terms along this branch are highlighted with different colors. Both designs are used to attract users’ attention to extracted query-oriented concept terms. With relevant paragraph mode, all paragraphs containing terms in the selected branch that led to the chosen document are extracted to save users’ reading time by reducing the amount of text.
P (X|Y) > P (Y|X), P (X|Y) >= N, where 0 < N < 1 If a term pair (X, Y) fulfills the above set of inequalities, X occurs more frequently than Y in the document collection, and is therefore broader than Y. Note, the threshold, N, affects the number of term pairs derived; namely, larger N results in a smaller number of term pairs. In our study, we used 0.8, the same as in [5].
3. EXAMPLE RESULTS 3.1 Search Results Figure 1 shows a concept hierarchy consisting of topics in the returned hits that are closely related to query terms. Figure 2. Query-Oriented Dynamic Document Highlight, Full Text Mode
Figure 1. A Concept Hierarchy and a Set of Returned Hits for “Data Mining” There are more than 20 first level concept terms. The top 10 first-level terms are: “data mining” (root, 189), “knowledge discovery (28),” “data mining software (12),” “data analysis (12),” “data mining solution (7),” “data mining tool (6),” “data mining group (5),” “international conference (5),” “data mining product (5),” “machine learning (5),” and “system software (5).” The top branch is: “data mining (root, 189)” Æ “knowledge discovery (28)” Æ {“mining suite (3),” “data mining center concentrate (2),” “coincidence search capability (2),”
Figure 3. Query-Oriented Dynamic Document Highlight, Relevant Paragraph Mode We call this “dynamic topical highlight,” because the highlighting of text can be varied for the same document. Depending on the query used and tree branch selected, the document will be highlighted
569
differently each time with only terms in the selected branch. This utilizes the dynamic highlighting of concept terms in a document. We believe that query-oriented concept extraction and dynamic topical highlight are both useful for online search. The concept hierarchy helps users to understand topics covered in the returned set and the highlighting helps in capturing users’ attention.
CHD were significantly better than that of a linear interface. The system therefore, performed better than a linear search interface in helping users understand concepts and finding new topics related to the query. 7.00
It is worth mentioning that this document is ranked 118th in the Lycos returned list. Using the regular linear list interface, a user might have already given up the search before reading this page.
6.00
4. EVALUATION
4.00
4.1 User Study I
3.00
In order to conduct the user study before Highlight was released, we developed a Windows based version of it, called concept hierarchy developer (CHD). 19 undergraduate students were chosen as participants for the study. They were asked to take part in the study for course credit. The participants were given a demonstration of the functionality of CHD. They were asked to use it like they would use any search engine. The task was to run a few queries using the tool. No limit was set on search time or number of queries. The intent was to get them to use the tool in a non-experimental context. At the end, the participants were asked to fill out a questionnaire. After completion, the participants were asked to upload the log file generated by the tool and the completed questionnaire for analysis. On average, each subject had run 8 queries. For each query, the participants had clicked 12 terms on the CH, followed 5 documents, and spent about 9 minutes on the search. These statistics established that the participants had spent sufficient time on the tool to qualify for the analysis.
4.1.1 Usefulness of the CH
5.00 Linear
CHD
2.00 1.00 Familiar
Understand Find topics concepts
Satisfied
Figure 4. Linear vs. Hierarchical Interface
4.1.2 User Interface - Ease of Use An easy-to-use interface is very important to a system’s success. The open-ended questions from the survey indicated that 88% of the users categorized the user interface as user friendly/easy to navigate. The system has features such as text highlighting and relevant paragraph display to facilitate easier browsing of documents. We analyzed the questionnaire responses to evaluate these 2 key features. The questionnaire responses fall in a Likert scale range of 1-7. A response of 4 is considered neutral and scores above 4 are considered favorable. The responses from the questionnaires indicate that,
We chose 4 key items from the questionnaire that compared CHD’s hierarchical interface with a linear interface that the subjects were familiar with (all of them had used at least Google before).
•
95% of the participants found the text highlighting to be useful.
•
74% of the participants found the relevant paragraph mode to be useful.
•
Familiarity – Were the participants familiar with the hierarchical interface as with the linear interface?
•
Understanding concepts – Did the summarization hierarchy presented by the system provide an overview of the document collection and help in understanding the concepts better?
Overall, the questionnaire results indicated that the participants were satisfied with the way the system highlighted key concepts in a document and displayed paragraphs relevant to the query.
•
•
4.2 User Study II – Comparison of Linear Interface Vs. Hierarchical Interface
Finding new topics – Did the CH help users find new topics related to the query? This relates to the novelty ratio factor described by Korfhage [2].
The purpose of this survey was two-fold: •
To observe and compare the organization of retrieved documents in a linear interface and a hierarchical interface.
Satisfaction – Overall, were the participants satisfied with CHD’s hierarchical interface?
•
To compare the search efficiencies of the 2 interfaces.
The average responses for the above questionnaire items were analyzed and the results are given below. From Figure 4, it is interesting to note that although the participants were less familiar with the hierarchical interface, they were more satisfied with it. Tests of significance were conducted on the above results. With p