Helping Identify When Users Find Useful Documents - Semantic Scholar

0 downloads 0 Views 744KB Size Report
Aug 18, 2010 - As a history buff, you have heard of the quiet revolution, the peaceful revolution and the velvet revolution. For a skill-testing question to win an ...
Liu, C., Gwizdka, J., Liu, J. (2010). Helping identify when users find useful documents: Examination of query reformulation interval. Proceedings of the 3rd Information Interaction in Context Symposium (IIiX’2010).

Helping Identify When Users Find Useful Documents: Examination of Query Reformulation Intervals Chang Liu, Jacek Gwizdka, Jingjing Liu School of Communication and Information, Rutgers University 4 Huntington Street, New Brunswick, NJ 08901, USA

[email protected], [email protected], [email protected] ABSTRACT We explore search behaviors during a new kind of search unit – the query reformulation interval (QRI). The QRI is defined as an interval between two consecutive queries in one search session that contains at least two queries. Our controlled, web-based study focused on examining behaviors associated with querying and useful document saving. We compared behavioral variables that characterized QRIs during which useful pages were found with those during which no useful pages were found. Our results demonstrated that the QRI duration and the total time spent on content pages during QRIs with useful pages were significantly longer than during QRIs with no useful pages. Users viewed more content pages and spent more time on content pages than on search result pages during QRIs with useful pages. The findings suggest that user behavior during QRIs can be used as an indicator of QRIs containing useful documents.

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]

General Terms Measurement, Performance, Experimentation, Human Factors.

usefulness or relevance. These included clickthrough behaviors (e.g., [17], [6], [1]), time that users spend on documents (e.g., [11], [15], [6], [1]), saving and printing [11]. While each of these types of behaviors has been found to help system learn document usefulness individually, there is evidence to show that combining multiple behaviors improves the accuracy of predicting document usefulness (e.g., [6], [18]). Previous studies mainly focused on behaviors on individual documents and search result pages. There is a need to expand the examined behaviors to other aspects of the search process. In this study, we examined user‟s behaviors during query reformulation intervals (QRIs), that is, user‟s behaviors that take place between two queries issued in the course of one search session. Entering a query represents a user‟s cognitive decision to modify some aspect of the current search tactic [3]. One can hypothesize that these cognitive decisions have an effect on user‟s search behaviors and thus that they group search behaviors into meaningful units. Examining search behaviors within these units may indicate during which intervals users view useful documents. This information, taken together with other behaviors indicating usefulness of specific documents, is expected to improve the accuracy of predicting document usefulness.

Information retrieval, query reformulations, user behavior, task type, search stage, implicit relevance feedback

This paper reports our findings on several features of the observed users‟ search behaviors in QRIs and their relationship with the type of QRI and other contextual factors like task type and search stage. Our results indicated that some search behaviors within such unit could indicate when users viewed useful documents.

1. INTRODUCTION

2. RELATED WORK AND MOTIVATION

The notion of document usefulness with respect to the user and her task is a key concept in information retrieval (IR). The ability to predict document usefulness is important for IR systems, not only because it can help improve document ranking, but also because knowledge of document usefulness can be used in relevance feedback. Explicit elicitation of document usefulness interrupts users and requires users‟ additional effort. In contrast, implicit learning of document usefulness is unobtrusive; hence its superiority to explicit elicitation techniques.

Dwell time, also referred to as display time, is the time a user spends on a page; it has been investigated in many studies. For example, Morita and Shinoda [15] found the reading time for articles rated as interesting was longer than for articles rated as uninteresting. In contrast, Kelly and Belkin [12] did not find significant relationship between display time and usefulness judgments in their naturalistic studies. They found, however, that display time differed significantly depending on specific tasks and users. White and Kelly [20] further found that the performance of implicit relevance feedback of dwell time could be improved when the task information is considered. Kellar et al. [10] examined the relationship between task, reading time and relevance, and found that the performance of reading time as an indicator of relevance varied in different types of tasks. In particular, they found reading time is a reliable indicator of interest on information gathering tasks when users only judge general relevance, and is not good on fact finding tasks when a user wants to find a specific answer. Liu and Belkin [15] examined the effects of search stage and usefulness on decision time. Their results showed that users spent the longest time on very useful pages in early search stages, but spent the shortest

Keywords

User behavior provides a rich source of information for systems to implicitly learn document usefulness. Previous studies have looked at various user behaviors that can indicate document Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. IIiX 2010, August 18-21, 2010, New Brunswick, New Jersey, USA. Copyright 2010 ACM 978-1-4503-0247-0/10/08...$10.00

time in later stages. These studies demonstrated that understanding how behaviors change with respect to contextual factors can improve the effectiveness of using search behaviors as implicit feedback. In addition to dwell time on documents and pages, the duration and behaviors in other kind of search phases, e.g. search result list, have also been found to be an effective indicator of document usefulness when combined with other behaviors on content pages, e.g. Fox, et al. [6], Hassan, Jones and Klinkner [7] and Agichtein, et al [1]. Fox, et al. [6] analyzed the association between implicit measures of user interest and their explicit ratings using Bayesian modeling method. They found combining time a user spent on result list, and exit type (e.g. kill browser window, new query, URL entry and etc.) together with clickthrough behaviors can improve the prediction of users‟ preferences. Hassan, Jones and Klinkner [7] examined the relationship between users‟ sequential search behaviors and their search success, and found that the transition time from a search result click to a query submission, and the number of clicks on search result page were related to the search success. Their results indicated that in a successful search, users tended to spend longer time between clicking a search result and submitting a query than in an unsuccessful search. Unsuccessful sessions were more likely to end with an abandoned query that contained no clicks after a query. When using the behavioral implicit feedback to predict users‟ preference, Agichtein, et al [1] found that considering the number of clicks on the result list and the position of clicks in the result list can improve the prediction. These studies demonstrated that combing measures of behaviors on result list with measures of behaviors on content pages can enhance the prediction of users‟ preferences compared to the measures of behaviors on content page alone. However some of these studies (e.g. [1], [6]) only examined behaviors during single queries rather than complete search sessions. In a complete search session, people often reformulate their queries to find information that satisfies their needs. In other area of related work, Huang and Efthimiadis [8] used time and the click pattern between queries to measure the effectiveness of query reformulation types. They suggested that a reformulation type was effective when it led to clicking search results rather than to skipping the results, and that the longer transition time was associated with more complex query reformulations. However, the focus of [8] was on the effectiveness of query reformulation, and the authors did not examine the relationship of behaviors between queries with the usefulness of documents. Previous studies have mainly focused on examining behaviors on single documents and search result lists. None of them has examined users‟ behaviors during query reformulation intervals; and none of them have checked if behaviors during this type of intervals could indicate when useful documents are viewed. We believe that it is worth examining whether behaviors between cognitive decisions expressed as query reformulations can predict if any useful documents are viewed and if combining measures of these behaviors with other search behaviors can improve the prediction of useful pages. Most of behavioral measures that were used as implicit feedback in previous studies belonged to the “examine” type of behaviors according to Kelly and Teevan‟s [14] classification. This category includes the behaviors such as viewing, listening, scrolling, finding, querying, selecting and browsing. Compared with “examine” behaviors, “annotate” (e.g. save) and “create” (write)

behaviors provide stronger evidence since they occur less frequently and are executed more deliberately [12]. “Save” is a type of a retain behavior; it suggests an intention of future use of an information object. In this study, we used individuals‟ “save” behaviors as an explicit indicator of when useful documents were found. We compared searchers‟ behaviors during QRIs when useful documents were found with those intervals when no useful documents were found.

3. RESEARCH OBJECTIVES Work presented in this paper makes a contribution by analyzing search behaviors within a new type of search unit. The behaviors and the associated measures are considered within query reformulation intervals. Query reformulation interval (QRIs) is defined as an interval from the point when a user submits a query to the point when a user starts to enter a subsequent query during the same search session. The aim is to use behavioral signals as evidence in support of implicit feedback. To our knowledge, this type of unit has not been examined in previous studies. The following measures were used to characterize QRIs: duration of QRI, visited page type (content and search engine results list pages), number of visited pages by their type, average dwell time on web pages by their type (these variables will be explained in Section 4.3). Our focus in the work reported here was on examining behaviors associated with querying and saving useful documents. We compared users‟ behaviors within two types of QRIs, 1) QRIs during which useful pages were saved, and 2) QRIs when no pages were saved. In addition, we considered how task features (task type and task structure) and their interactions with QRI type influenced users‟ behaviors during each type of QRI. We also considered the search stage as a factor that might influence users‟ behaviors during QRIs; and the method we used to categorize search stage was whether users have saved any useful pages or not. Our research objectives were as follows: 1)

Examine whether behaviors during QRIs when useful documents are found are different from behaviors when useful documents are not found;

2)

Examine whether search stage influences QRI behaviors;

3)

Examine whether task type or task structure influence QRI behaviors.

4. METHOD Forty-eight subjects (Table 1) participated in a question-driven, web-based information search study conducted in a controlled experimental setting. Participants were university students, from undergraduate and graduate programs. Most participants were very frequent Web searchers and only one person searched the Web relatively infrequently: once or twice a week. Table 1 Participant Profile. Age

Mean 27 year; median 23 years. range 20-51

Gender

17 females and 31 males

Level of study

65% - undergraduate; 6% - Master; 23% - PhD; 6% - other (just graduated)

English language

First language 56%; spoken at home 65%

Web search 35% almost constantly; 46% several times a day; 17% once a day; 2% once or twice a week. frequency

4.1 Procedure Each study session took an hour and a half to two hours and was conducted in a university lab on a personal desktop computer running the Microsoft Windows XP operating system. Each session consisted of the following steps: an introduction to the study, consent form, search task practice, background questionnaire, six search tasks, and post-session questionnaire. The searchers bookmarked and tagged the web pages that they considered most useful in helping accomplishing their tasks. Bookmarking provided an explicit indicator of a document‟s usefulness and was a saving action. Participants were asked to continue the search until they had gathered enough information to accomplish the task. User interaction with the computer (visited and bookmarked URLs, mouse and keyboard events, and video from a screen cam) was recorded using Morae software 1. The start and end of each search task were controlled by an external program that was used to start and end a Web browser session (Internet Explorer).

4.2 Tasks The study search tasks were designed as questions that described what information needed to be found and provided a context for the search. The tasks were designed to differ in terms of their difficulty and structure. A total of twelve questions were used in the study. Four tasks were created by us, while eight were created by Toms and her colleagues [19]. Two types of search tasks were used: Fact Finding (FF) and Information Gathering (IG) [11]. The goal of a fact finding task is to find one or more specific pieces of information (e.g., name of a person or an organization, product information, a numerical value; a date). The goal of an information gathering task is to collect several pieces of information about a given topic. This type of task is also referred to as a topical search. The tasks were also divided into three categories according to the structure of the underlying information need [20], 1) Simple (S), where the information need is satisfied by a single, independent piece of information (by definition, simple task is of the fact finding type); 2) Hierarchical (H), where the information need is satisfied by finding multiple characteristics of a single concept; this is a depth search, where a single topic is explored; 3) Parallel (P), where the information need is satisfied by finding multiple concepts that exist at the same level in a conceptual hierarchy; this is a breadth search. Task types and structures are listed in Table 2. The tasks were constructed using to Simulated Work Task Situations [3]. The simulated situations were created by using task scenarios that provided participants with the search context and the basis for relevance judgments. Sample tasks are shown in Table 11. Task Acronym

1

IG-P

Parallel information gathering task (topical search)

IG-H

Hierarchical information gathering task (topical search)

During the course of an individual study session, each participant performed six tasks of differing type and structure (Table 3). Thus, the forty-eight participants performed a total of 288 searches. In each search, the participant was able to choose between two questions of the same type and structure but on different topics. We offered the choice of topics to increase the likelihood of a participant's interest in the question topic. The order of tasks was balanced with respect to the objective task difficulty. Two task orders were repeated: cases where the difficulty increased from low to high, and cases where the difficulty decreased from high to low (Table 3). The search tasks were performed on the English version of Wikipedia by using two different search engines: Google Wikipedia search and ALVIS Wikipedia search. Participants were instructed not to use Wikipedia‟s own search. There were no effects of user interface on the QRI duration, and, thus, we will not discuss this factor any further. Table 3. Task Rotations. 1

2

3

4

5

QR1

FF-S1

QR2

IG-H

QR3 QR4

6

FF-P

IG-H

FF-S2

FF-H

IG-P

FF-P

FF-S1

IG-P

FF-H

FF-S2

FF-S1

FF-P

IG-H

IG-P

FF-H

FF-S2

IG-H

FF-P

FF-S1

FF-S2

FF-H

IG-P

4.3 Variables The independent variables included task type and structure, QRI type, and search stage. The task type and structure were described in the section 4.2. QRI type: two types were defined with respect to the saving behavior: 1) QRIs that contain no saved pages, and 2) QRIs that contain at least one page saved. Search stage: The search process was divided into four stages with respect to whether any pages had been saved before each QRI and the QRI type. The rationale of such division is that users‟ knowledge about the search topic might be changed after they saved something. The four search stages are as follows: Stage 1 included QRIs before users have saved any pages; Stage 2 included QRIs during which users saved pages and which followed immediately stage 1;

Table 2. Task Types and Structures.

Stage 3 included QRIs without any saving activity after stage 2, which is after at least one saving activity;

Task Structure and Type

Stage 4 included QRIs during which users saved pages and which followed stage 3. The sequence of stages 1, 2, 3, 4 reflects an order of search stages (shown in Figure 1).

FF-S

Simple fact finding task (known item search)

FF-P

Parallel fact finding task (known item search)

FF-H

Hierarchical fact finding task (known item search)

Morae is a product of http://www.techsmith.com/morae.asp

Tech

Smith

Inc.

explanation). Out of the total 684 QRIs, 428 did not contain saved pages, while 256 intervals contained saved pages.

Figure 1. Transitions between the Four Search Stages Defined with Respect to QRI and Saving Useful Pages. The total QRI duration was the time between two successive queries and was calculated as the time from the end of entry of one query to the point when searchers began to enter a subsequent query. During such time, searchers might examine search result pages (SERPs), read content pages, or save useful pages (by bookmarking). In addition, the bookmarked content pages were always shown to the user immediately after they were bookmarked. In order to compare the interval duration between two QRI types, we calculated the QRI duration, by subtracting the time spent on bookmarking and the display time of the content page immediately after bookmark, from the total QRI duration. The QRI duration was the main dependent variable in this study. Other dependent variables included the total time on content pages, the total time on SERPs, the number of content pages and the number of SERPs viewed during QRIs, the mean time on content pages and SERPs, and the percentage of time on content pages and on SERPs. The total time on content pages was calculated as the total time users spent on content pages minus the time on content pages displayed immediately after bookmarking during QRIs. The total time on SERPs was the same as the total time users spent on SERPs during QRIs. With respect to the number of content pages and SERPs, only the number of unique content pages and SERPs were counted; that is, if a user read a content page more than once, it was only counted as one unique content page. The percentage of time on content pages (or on SERPs) was calculated by dividing the total time on content pages (or on SERPs) by QRI duration. In the statistical analysis, all temporal variables were log transformed, and the reported mean and standard deviation (SD) were back-transformed from the mean and SD of the logtransformed variables. The back-transformed values were similar to the median of the original data. For other variables that were not normal distributed, non-parametric tests were conducted, and reported mean and SD were the original mean and SD.

5. RESULTS Among the 288 search sessions, 98 searches contained only one query, that is, they did not contain any query reformulation. The data on saving behaviors was missing for 5 search sessions during the experiment. Since the current paper focuses on the behaviors during QRIs, only valid search sessions that contained at least two queries are considered (N=185). These sessions contained 684 QRIs.

5.1 Effects of the Two QRI Types 5.1.1 Comparison of behaviors in two types of QRIs QRIs could be divided into two types according to whether they contained saved pages (refer to section 4.3 for detailed

A t-test was conducted to examine the difference in QRI duration between the two types of QRIs. The duration of QRIs with useful pages (M=67.83 seconds) was significantly longer than QRIs without useful pages (M=21.38 seconds). The QRI duration included time on content pages and time on search result pages (SERPs). We compared the total and the mean time on content pages and SERPs, the number of content pages and SERPs users viewed during the intervals of the two types. The results of t-tests showed that both of the total times on content pages and SERPs within QRIs that contained useful pages were significantly longer than QRIs that did not contain useful pages. The results of Mann-Whitney U tests showed that users viewed more content pages and SERPs in QRIs with useful pages than in QRIs without useful pages. However, there was no significant difference in the mean time on content pages and SERPs between two QRI types (Table 4). Table 4. Comparison of Interval Behaviors during Two Types of QRIs Behaviors within QRIs QRI duration (seconds) Total time on content pages (seconds) Total Time on SERPs (seconds) Number of content pages Number of SERPs Number of total pages Mean time on content pages (seconds) Mean time on SERPs (seconds) Percentage of time on Content page Percentage of time on SERPs *p

Suggest Documents