profile-based summarisation for web site navigation

3 downloads 14127 Views 4MB Size Report
[2003] define contextual retrieval as, 'combine search technologies and knowledge about ..... content (e.g., a blog), snippets of results (titles and summaries), users' brows- ing sessions ...... ment Few of them for Bangla Language. In Computer ...
PROFILE-BASED SUMMARISATION FOR WEB SITE NAVIGATION

By

Azhar Hasan Alhindi

A thesis submitted for the degree of Doctor of Philosophy School of Computer Science and Electronic Engineering University of Essex April 2015

Declaration

This thesis is the result of my own work, except where explicit reference is made to the work of others, and has not been submitted for another qualification to this or any other university. This thesis does not exceed the word limit for the respective Degree Committee. Azhar Hasan Alhindi

ii

To my loving mother, father! For their endless love, support and encouragement

iii

Acknowledgements Thanks to the Almighty Allah, Who always helps me. This doctoral thesis is only the beginning of my wonderful journey. The experiences I had while writing it were invaluable and very educational, in both academic and personal contexts. First and foremost, I have to thank my parents, Safiyya Al-Edrisy and Hasan Alhindi, for their love and support throughout my life. Thank you both for giving me strength to reach for the stars and chase my dreams, and for believing in me and my goals. My wise brother Ahmad and my beautiful sisters, Abrar, Afnan, and Alaa deserve special thanks, too, simply for their presence in my life since infancy. I would like to sincerely thank my supervisors, Dr Udo Kruschwitz and Dr Chris Fox, for their guidance, patience, and endless support throughout my doctoral studies at the University of Essex, and especially for their confidence in me. Working with them has given me a disposition for research that I simply did not have when I began my PhD study. From them, I have learnt thinking skills, academic rigour, and how to research effectively. My gratitude goes as well to Prof Massimo Poesio (internal examiner) from the University of Essex and Prof. Joemon Jose (external examiner) from the University of Glasgow. Their valuable comments and suggestions during my PhD examination have been included in the final version of this thesis. I would also like to thank the anonymous reviewers and the editors of the ACM TOIS special issue on contextual search and recommendation for the substantial, insightful, and very constructive feedback they provided on the journal paper which represents a substantial part of this thesis. Thanks to the University of Essex for providing me with all the required resources to successfully carry out this research programme. Working in such a friendly and harmonious environment has contributed to making this a gratifying experience.

iv

Abstract Compared to information systems that work the same for all users and contexts, systems that utilise contextual information have greater potential to help a user identify relevant information more quickly and more accurately. Contextual information comes in a variety of flavours, often derived from records of past interactions between a user and the information system. It can be individual- or group-based. The motivation for our work is as follows. First, instead of looking at Web searching or browsing, which has been studied extensively, we focus our attention on Web sites. Such collections can be notoriously difficult to search or explore. If we could learn from past user interactions what information needs can be satisfied by which documents, we would be in a position to help a new user to get to the required information much more rapidly. Hence, we harness the search behaviour of cohorts of users instead of individual users, turning it automatically into a profile which can then be used to assist other users of the same cohort. Finally, we are interested in exploring how such a profile is best utilised for profile-based summarisation of the collection at hand in a navigation scenario in which such summaries can be displayed as hover text as a user moves the mouse over a link. The process of acquiring the profile is not a research interest here; we simply adopt a biologically inspired method that resembles the idea of ant colony optimisation (ACO). This has been shown to work well in a variety of application areas. The model can be built in a continuous learning cycle that exploits search patterns as recorded in typical query log files. The main focus of this thesis will be on using the model in profile-based summarisation to generate summaries of documents for navigation support. Our research explores different single-document and multidocument summarisation techniques, some of which use the profile and some of which do not. We perform task-based evaluations of these different techniques – and hence of the impact of the profile and profile-based summarisation – in the context of Web site navigation. The experimental results demonstrate that profile-based summarisation to assist users in navigation tasks can significantly outperform generic summarisation as well as a standard Web site without such assistance. v

Contents

List of Figures

viii

List of Tables

ix

List of Algorithms

xi

List of Acronyms

xii

1 Introduction 1.1 Motivation . . . . 1.2 Research Questions 1.3 Contributions . . . 1.4 Thesis Structure .

I

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

Theory

1 2 6 7 7

10

2 Related Work 2.1 Paradigms of Information Access . . . . . . . . . . . . . . . . . . 2.1.1 Search and Navigation . . . . . . . . . . . . . . . . . . . . 2.1.2 Recommendation-Based Systems . . . . . . . . . . . . . . 2.2 Towards Contextualised Information Retrieval . . . . . . . . . . . 2.3 Web Search versus Enterprise Search . . . . . . . . . . . . . . . . 2.4 Personalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 User Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Automatic Text Summarisation . . . . . . . . . . . . . . . . . . . 2.6.1 Application Areas . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Characteristics of Summaries . . . . . . . . . . . . . . . . 2.6.3 Approaches to Summarisation . . . . . . . . . . . . . . . . 2.6.4 Generic Summaries versus Personalised Summaries . . . . 2.6.5 Single-Document Summarisation versus Multi-Document marisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Evaluating Automatic Summarisation . . . . . . . . . . . 2.7.2 Evaluating Interactive Information Retrieval . . . . . . . 2.8 Key Differentiators . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sum. . . . . . . . . . . . . . . . . . . .

3 A Framework for Profile-Based Summarisation 3.1 General Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Profile-Based Summarisation . . . . . . . . . . . . . . . . . . . . . . . . vi

12 15 15 19 21 25 29 32 38 42 43 44 48 51 54 55 59 60 61 62 63

Contents

3.3

vii

3.2.1 Profile-Based Single-Document Summarisation . . . . . . . . . . 3.2.2 Profile-Based Multi-Document Summarisation . . . . . . . . . . Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Building a Log-Based Profile 4.1 Log-Based Approaches . . . . . . . 4.2 Query Logs . . . . . . . . . . . . . 4.3 Ant Colony Optimisation Model . 4.4 Deriving Related Terms . . . . . . 4.5 Ant Colony Optimisation Trimmed 4.6 Concluding Remarks . . . . . . . .

II

. . . . . . . . . . . . . . . . Model . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

Evaluation

5 General Experimental Setup 5.1 Data Collection Pre-processing 5.2 Log Analysis . . . . . . . . . . 5.3 Profile Construction . . . . . . 5.4 Noun Phrase Extraction . . . . 5.5 Extracting a Summary . . . . . 5.6 Concluding Remarks . . . . . .

64 66 68 69 70 72 74 78 80 81

82 . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

84 85 85 87 88 93 95

6 A Scoping Study on Profile-Based Summarisation 6.1 Experimental Setup . . . . . . . . . . . . . . . . . . 6.2 Results and Discussion . . . . . . . . . . . . . . . . . 6.2.1 Overall Performance Comparison . . . . . . . 6.2.2 User Feedback . . . . . . . . . . . . . . . . . 6.3 Concluding Remarks . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

96 97 101 101 103 105

7 A Pilot Study on Profile-Based Summarisation 7.1 Experimental Setup . . . . . . . . . . . . . . . . 7.2 Results and Discussion . . . . . . . . . . . . . . . 7.2.1 Overall Performance Comparison . . . . . 7.2.2 User Feedback . . . . . . . . . . . . . . . 7.3 Concluding Remarks . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

106 107 111 111 114 117

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . .

. . . . .

8 Task-Based Evaluations 8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Questionnaires . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Protocol and Search Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Significance Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Standard Web site versus Single and Multi-Document Profile-Based Summarisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.1 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.2 Average Completion Time . . . . . . . . . . . . . . . . . . . . . .

118 120 121 121 122 124 125 126 127 128

Contents

8.8

8.9

8.7.3 Average Number of Turns to Finish a Task . . . . . . . . . . . . 8.7.4 Task Success . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.5 Post-Search Questionnaire . . . . . . . . . . . . . . . . . . . . . . 8.7.6 Post-System Questionnaire . . . . . . . . . . . . . . . . . . . . . 8.7.7 Exit Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . Generic versus Single and Multi-Document Profile-Based Summarisation 8.8.1 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.2 Average Completion Time and Number of Turns . . . . . . . . . 8.8.3 Post-Search Questionnaire . . . . . . . . . . . . . . . . . . . . . . 8.8.4 Post-System Questionnaire . . . . . . . . . . . . . . . . . . . . . 8.8.5 Exit Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9.1 General Observations . . . . . . . . . . . . . . . . . . . . . . . . 8.9.2 Comparison with Related Work . . . . . . . . . . . . . . . . . . . 8.9.3 User Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

128 129 130 132 133 134 135 135 136 137 138 139 140 140 140 141 142 143

9 Conclusions and Limitations

145

10 Future Directions

150

References

153

Appendices

188

Appendix A: Entry Questionnaire

189

Appendix B: Post-Search Questionnaire

192

Appendix C: Post-System Questionnaire

194

Appendix D: Exit Questionnaire

196

List of Figures

1.1 1.2

A Partial Domain Model Learnt from Query Logs. . . . . . . . . . . . . Hover Text Presenting a Page Summary of a Linked Page. . . . . . . . .

4 5

2.1 2.2

Research Areas of Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . Classification of Different Evaluation Methods Used in Automatic Summarisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

3.1 3.2 3.3

Profile-Based Summarisation General Architecture. . . . . . . . . . . . . Architecture of a Profile-Based Single-Document Summariser. . . . . . . Architecture of a Profile-Based, Multi-Document Summariser. . . . . . .

62 65 67

4.1 4.2

Acquiring a Profile from Query Logs. . . . . . . . . . . . . . . . . . . . . A Profile Acquired Using a Shorter Query Log (1 Month). . . . . . . . .

78 78

5.1

Preprocessing Steps Required on HTML Documents. . . . . . . . . . . .

86

6.1

Box Plot of Overall Assessment of Summary Quality. . . . . . . . . . . . 103

7.1

Box Plot of Overall Assessment of Summary Quality. . . . . . . . . . . . 113

8.1 8.2 8.3 8.4

Summary of The Statistical Tests Used on the Experiments. . . . . . . . System B (Applying MDS). . . . . . . . . . . . . . . . . . . . . . . . . . System C (Applying SDS). . . . . . . . . . . . . . . . . . . . . . . . . . Results of Questionnaire Regarding Ease of Learning How to Use Systems, Experment 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of Questionnaire Regarding Ease of Use Systems, Experiment 1. Results of Questionnaire Regarding Understanding How to Use Systems, Experiment 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of Questionnaire Regarding Ease of Learning How to Use Systems, Experment 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of Questionnaire Regarding Ease of Use Systems, Experiment 2. Results of Questionnaire Regarding Understanding How to Use Systems, Experiment 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8.5 8.6 8.7 8.8 8.9

ix

58

125 127 127 133 133 133 139 139 139

List of Tables

5.1

Noun Phrases Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.1 6.2

Overall performance comparison on document “library”. . . . . . . . . . 101 Overall Ratings on 10 Documents. Bold values are significantly better and underlined values are significantly worse than the average according to the z-score with 95% confidence level. . . . . . . . . . . . . . . . . . . 102 p-Values of Wilcoxon Signed Rank Post-Hoc Pairwise Tests. . . . . . . . 103

6.3 7.1

7.2 7.3 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10

8.11 8.12 8.13 8.14 8.15 8.16

90

Overall Ratings on 10 Documents. Bold values are significantly better and underlined values are significantly worse than the average according to the z-score with 95% confidence level. . . . . . . . . . . . . . . . . . . 111 Local Users: p-Values of Wilcoxon Signed Rank Post-Hoc Pairwise Tests. 113 Web Users: p-Values of Wilcoxon Signed Rank Post-Hoc Pairwise Tests. 113 A Basic Design with Graeco-Latin Square Rotation for Topic and Interface [Kelly, 2009]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experiment 1: Average Completion Time of Task (in Seconds). . . . . . Experiment 1: Average Number of Turns to Complete a Task. . . . . . . Experiment 1: Post-Search Questionnaire: User Familiarity with a Search Topic, Mean Scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experiment 1: Post-Search Questionnaire: Ease of Getting Started, Mean Scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experiment 1: Post-Search Questionnaire: Ease of Performing Task, Mean Scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experiment 1: Post-Search Questionnaire: Satisfaction with Results, Mean Scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experiment 1: Post-Search Questionnaire: Adequate Time, Mean Scores. Experiment 1: Post-Search Questionnaire: Mean Scores, by Task. . . . . Experiment 1: Post-System Questionnaire, Mean Scores. Bold values are significantly better and underlined values are significantly worse than the average according to the z-score with 95% confidence level. . . . . . . . Experiment 1: Exit Questionnaire (System Preference). . . . . . . . . . Experiment 1: Exit Questionnaire (Search Experience). . . . . . . . . . Experiment 2: Average Completion Time of Tasks (in Seconds). . . . . Experiment 2: Average Number of Turns to Complete a Task. . . . . . . Experiment 2: Post-Search Questionnaire: Ease of Getting Started, Mean Scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experiment 2: Post-Search Questionnaire: Ease of Performing Task, Mean Scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

123 128 129 130 131 131 131 132 132

133 134 134 136 136 137 137

List of Tables

8.17 Experiment 2: Post-Search Questionnaire: Satisfaction with Results, Mean Scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.18 Experiment 2: Post-Search Questionnaire: Adequate Time, Mean Scores. 8.19 Experiment 2: Post-Search Questionnaire: Mean Scores by Task. . . . . 8.20 Experiment 2: Post-System Questionnaire, Mean Scores. Bold values are significantly better and underlined values are significantly worse than the average according to the z-score with 95% confidence level. . . . . . . . 8.21 Experiment 2: Exit Questionnaire (System Preference). . . . . . . . . . 8.22 Experiment 2: Exit Questionnaire (Search Experience). . . . . . . . . .

xi

138 138 138

139 140 140

List of Algorithms

3.1 3.2

Profile-Based Single-Document Summarisation. . . . . . . . . . . . . . . Profile-Based Multi-Document Summarisation. . . . . . . . . . . . . . .

4.1

The ACO-Based Algorithm to Build and Evolve the Domain Model [Albakour et al., 2011]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Related Terms Extraction Algorithm to Extract Related Terms for the Pre-Defined Term from the Domain Model . . . . . . . . . . . . . . . . . ACO Trimmed Algorithm to Effectively Remove the Outdated Terms from the Domain Model . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2 4.3

5.1

Algorithm to Create an Extractive Summary. . . . . . . . . . . . . . . .

xii

66 67

77 79 80 95

Acronyms

ACO

Ant Colony Optimisation

AI

Artificial Intelligence

AQE

Automatic Query Expansion

AR

Association Rules

BOW

Bag Of Words/Terms

CIR

Contextual Information Retrieval

CIRSs

Contextual Information Retrieval Systems

DUC

Document Understanding Conference1

HMMs

Hidden Markov Models

HTML

HyperText Markup Language

IQE

Interactive Query Expansion

IR

Information Retrieval

IRS

Information Retrieval System

IRSs

Information Retrieval Systems

IIR

Interactive Information Retrieval

IIRSs

Interactive Information Retrieval Systems

MDS

Multi-Document Summarisation

MLE

Maximum Likelihood Estimation

MMR

Maximal Marginal Relevance

MTurk

Amazon Mechanical Turk2

MUC

Message Understanding Conference

NLP

Natural Language Processing

ODP

Open Directory Project

OPP

Optimal Position Policy

QFG

Query Flow Graphs

SDS

Single-Document Summarisation

TAC

Text Analysis Conference3

1

http://duc.nist.gov/ https://www.mturk.com/ 3 http://www.nist.gov/tac/ 2

xiii

List of Acronyms

xiv

TF.IDF

TermFrequency.InverseDocumentFrequency

THIC

Term Hits In Context

TREC

Text REtrieval Conference

VSM

Vector Space Model

WUM

Web Usage Mining

WWW

World Wide Web

1 Introduction

This chapter discusses the motivation of the work presented in this thesis in Section 1.1 and the research questions in Section 1.2. The primary contributions of this thesis are also discussed in Section 1.3. Finally, the organisation of the following parts of this thesis is detailed in Section 1.4 including the list of publications.

1

Chapter 1. Introduction

1.1

2

Motivation “The easier access to information becomes, the greater become our expectations for ubiquitous access in all kinds of situations.” [Marchionini and White, 2009]

The continuous growth of document collections on local Web sites1 makes it desirable to develop techniques that assist users not just in the search process, but also in navigating the collection. It can be surprisingly difficult to track down a specific document or a specific piece of information on an intranet or a university Web site; even if the information is there, it is difficult to find. There are a number of reasons for this, one being that such collections are different to the Web in many respects [Hawking, 2011]. For example, there is much less redundancy than on the Web in general; only a single document might exist in a large collection that satisfies a specific user need. Another reason is the mismatch of terminology between what a searcher is after and what the documents are about, sometimes referred to as the “vocabulary gap” [Smyth, 2007]. Using a university site example, a new first-year student might not actually know where to register, how to find the accommodation office, how to obtain a parking permit, and whether to enrol for a “module”, a “course” or a “unit”. In such cases, it is possible that the local search engine will not be of much help either [Hawking, 2011]. Web search algorithms have matured over the past years and become more reliable so that a query submitted to Google, for example, typically returns excellent matches. However, this means the people expect comparable experience with search tools in other environments such as intranets, digital libraries, and email systems which contain far fewer documents and where the search algorithms have not necessarily been as effective as on the Web [Hawking, 2011]. Harnessing contextual information offers great opportunities to address these problems. 1 When we talk about “local Web sites” we are referring to document collections that represent the intranet or the Web presence of a university, a company, or some other organisation. To simplify matters and to make the approaches more generalisable, we ignore the internal organisational structures that are common features of such collections. This also allows us to treat other collections, such as digital libraries, in the same way.

Chapter 1. Introduction

3

Contextual search and recommendation refers to a diverse set of techniques that are all aimed at moving away from a one-size-fits-all approach and making a system more effective by incorporating contextual information derived from a wide range of variables, such as content, geographical, interaction and social variables [Melucci, 2012] or simply the users’ search histories [Smyth et al., 2005]. While many contextual systems attempt to personalise a system for individual users, contextual approaches can also be group based; contextualisation should not be equated to personalisation [Ruthven, 2011]. Group-based or cohort-based information appears to be a promising route for a community of users with common concerns. Such communities are formed of individuals — for example, employees of a company or members of a university — that, over time, collectively acquire knowledge about a resource such as a local Web site. The idea is to tap in to this knowledge, and facilitate the sharing of search and navigation experiences among community members [Smyth, 2007]. We can therefore characterise our context fairly generally as the environment in which a cohort2 of users operates. This bears some resemblance to the idea of “trait-based groups” as people who “may be highly likely to repeat or augment tasks already accomplished by other group members, have interests in the same queries and results as other group members” [Teevan et al., 2009]. The theory is that learning from one user should benefit future users with similar information needs, an idea that our work shares with other approaches that assist users in navigating a collection [Kantor et al., 2000; Wexelblat and Maes, 1999, for example]. This suggests that utilising the accumulated search histories could be beneficial, but does not tell us how to build usable knowledge structures or how to apply them. To address the first point, we note that query logs have emerged as a very valuable resource that can be mined to derive useful knowledge, such as query suggestions. In fact, search trails and associated click-through information are used in many modern search engines to help 2

We will use the terms “cohort”, “community”, “group” and “population” interchangeably. In addition, we also use “domain model”, “model”, “graph” and “profile” as synonymous terms in this thesis (i.e. we generalise the conceptual structure by not distinguishing between a model that is derived from a document collection and one that represents a user’s interests). Finally, we make the simplifying assumption of referring to “browsing” and “navigation” as the same thing; in fact, any reference to “search” in our studies is to be interpreted as searching for information in a navigation context.

Chapter 1. Introduction

4

in the search process [White and Huang, 2010]. Our approach utilises a profile acquired from log data that contains information relating to previous search trails involving a local search engine. We do not propose a new paradigm for the profile. Instead, we adopt a state-of-the-art approach from the literature that can easily be applied to typical query logs. Many methods can be applied to build different types of profiles. Here we adopt a term-association network in which the nodes represent queries and links between queries reflect (unspecified) relations between these nodes. We acquire our profile by applying an Ant Colony Optimisation (ACO) analogy. This offers the additional benefit over more static approaches by embodying a temporal context; it learns from user interactions with an information system but it is also able to forget. For example, a concept “timetable” in a model derived from a university Web site will reflect a stronger association to a concept “teaching timetable” at the start of the academic year and “exam timetable” towards the end of the year. Figure 1.1 illustrates part of the domain model represented as a network of correlated queries inferred from search logs (of the University of Essex Web site) of a cohort of users’ interactions over a certain period of time. In summary, we turn the search history of a local Web site into a profile that represents the search interests and interactions of the population of searchers. This is different from individual user profiles.

examination location teaching rooms

teaching timetable

0.06088 departmental timetable

0.03847

0.03172

exam timetable

0.00245 0.04682 0.00297 0.00296 0.00925

0.00924

0.00892

exam

0.3

0.0030 timetable information

0.00294

timetable

0.00282

0.00295 0.00308

personal timetable

timetable office

courses

moodel

Figure 1.1: A Partial Domain Model Learnt from Query Logs.

Chapter 1. Introduction

5

That leaves us with the question as to how to apply the profile. Interactive search support has attracted a lot of attention, for example, Dumais et al. [2001]; Paek et al. [2004]; White et al. [2002b], and cohort modelling has shown to be effective for Web search [Yan et al., 2014], but not much has been reported to support users in a navigation context. Our aim is to explore the benefit of profile-based summarisation for navigation, providing “tool tips” that give Web-site users a summary of a document as they hover the mouse over a link. This allows them to assess whether it is worth following the link or not (as illustrated in Figure 1.2). One of the main objectives is to cut down the number of steps and the time taken to get from the user’s entry page to the desired document.

Figure 1.2: Hover Text Presenting a Page Summary of a Linked Page. Summarisation appears relevant to the navigation problem as it helps present salient information to a user by condensing a document’s content and extracting the most relevant facts or topics included in them [Lloret and Palomar, 2012]. Query-biased summarisation has been shown to be highly effective in a search context [White et al., 2003, for example]. We hypothesise that it is also useful in Web-site navigation. Both Single-Document Summarisation (SDS) and Multi-Document Summarisation (MDS) have been investigated extensively and independently within the Natural Language Processing (NLP) and Information Retrieval(IR) communities [Wan, 2010]. SDS aims to produce a concise and fluent summary of a single document, whereas MDS sum-

Chapter 1. Introduction

6

marises multiple related documents. The two tasks are very closely related in both task definition and solution methods. The major difference between the tasks of SDS and MDS is the greater amount of redundancy when we start with multi-documents [Jurafsky and Martin, 2009, Ch. 23, p. 831]. Both approaches are useful and commonly applied [Nenkova and McKeown, 2011]. For example, given a cluster of news articles, a multi-document summary can be employed to help users understand the whole cluster; a single summary for each article can be employed to help users to know the content of the specific article. We are interested in both SDS and MDS. In this thesis, we report on a number of experiments and studies aimed at addressing our research questions (we will present the research questions in the next Section 1.2). The core of the experimental work consists of task-based evaluations in a Web sitenavigation context. We use a specific university Web site to conduct our studies, but would argue that the methods are applicable to a wide range of Web sites and intranets. The caveat of such a study is that it is limited to a single Web site and the findings may or may not be transferable to other document collections (see also Kruschwitz et al. [2013]). Despite this limitation, we argue that our results could provide insights and serve as a baseline for future studies on different Web sites. The major bottleneck in conducting research into using any form of query logs is the difficulty in getting hold of realistic and large-scale log data, both of which we managed to address in this study.

1.2

Research Questions

The purpose of the research conducted is to build profile-based summaries and apply them for IR tasks. In particular, the research questions we aim to answer with our work are as follows: 1. Can Web site navigation benefit from the automated summarisation of results? 2. Will a domain model/profile capturing the search behaviour of a group of users

Chapter 1. Introduction

7

be beneficial for the summarisation process? 3. Will such methods result in measurable (quantifiable) benefits such as shorter sessions, fewer interactions, and the like?

1.3

Contributions Hypothesis: profile-based summarisation can help a user in the navigation process in a Web-site and guide the user to the right documents more easily.

The main aim of this thesis is to investigate the use of profile-based summarisation to provide contextualisation and interactive support for navigating Web sites. The thesis makes the following contributions: • We provide a general framework for turning log data into cohort-based model that can be used for navigation. • We propose profile-based single-document and multi-document summaries. • We conducted an extensive evaluation to assess the quality of the proposed work.

1.4

Thesis Structure

In general, we split the thesis into two main parts: a theory part, which contains the main principles and the basics of our research area, and an evaluation part, which contains all our experimental work. The thesis is structured as follows: • Chapter 2:

This chapter gives a comprehensive discussion of related work,

describes key terms and research areas used throughout the thesis, and introduces several preliminary notions which are necessary to properly understand the goals of this research. We present detailed background information for the areas of

Chapter 1. Introduction

8

profiling and summarisation and show the different methods and techniques used in those areas. We also provide background information on IR and NLP. • Chapter 3: This chapter presents a framework for profile-based summarisation. It talks about the general architecture of profile-based summarisation and illustrates the methods we have applied and the algorithms we have adopted to create single- and multi-document profile-based summaries. • Chapter 4:

This chapter presents the methodology we have utilised and the

algorithms we have adopted to build our adaptive log-based profile, a biologically inspired model built from query logs. • Chapter 5:

This chapter illustrates how we applied the specified methods in

our experiments. It includes all the technical parts of our practical work. • Chapter 6:

This chapter presents the initial experiment we conducted, which

was a scoping study on profile-based summarisation in order to investigate the potential of cohort-profile based SDS for typical information needs. We produce a number of single-document extractive summaries using several methods, some using the profile and some not. We also discuss the experimental setup and show the evaluation results of the work being conducted. • Chapter 7:

This chapter reports on a pilot study we conducted to assess the

potential that profile-based SDS and MDS might have in a Web-site navigation context. We explore a wider range of single- and multi-document extractive summaries using a number of methods some using the profile and some not. The experimental setup and evaluation results are also discussed. • Chapter 8: This chapter presents two task-based evaluations. In particular, it presents the principles behind the task-based evaluation, illustrates how we used profile-based summaries to assist users in a Web site navigation task, and finally discusses the experiments’ evaluation results.

Chapter 1. Introduction

9

• We finish the thesis by outlining its conclusions and limitations in Chapter 9 and delineating opportunities for future work in Chapter 10. Some of the original material used in this thesis has been published in the following peer-reviewed papers: Journal Publications 1. A. Alhindi, U. Kruschwitz, C. Fox, M-D. Albakour. Profile-Based Summarisation for Web Site Navigation. In Paul N. Bennett, Kevyn Collins-Thompson, Diane Kelly, Ryen W. White, Yi Zhang, editors, ACM Trans. Inf. Syst. 33, 1, Article 4 (February 2015). (Chapter 1, Chapter 3, Chapter 4, Chapter 5, Chapter 7, Chapter 8, Chapter 9 and Chapter 10) As the main author I designed and performed the experiments, analysed and interpreted the data and wrote the paper. Kruschwitz, Udo and Fox, Chris both contributed to drafting the paper and revising it critically for important intellectual content. Albakour, M-Dyaa contributed by guiding the data analysis process and together we worked collaboratively. Conference Publications 1. A. Alhindi, U. Kruschwitz, and C. Fox. A Pilot Study on Using Profile-Based Summarisation for Interactive Search Assistance. Advances in Information Retrieval, volume 7814 of Lecture Notes in Computer Science, pages 672-675. Springer Berlin Heidelberg, 2013. (Chapter 6) 2. A. Alhindi, U. Kruschwitz, and C. Fox. Site Search Using Profile-Based Document Summarisation. DIR, pages 62–63. Citeseer, 2013. (Chapter 6)

Part I

Theory

10

2 Related Work

Information retrieval is a daily activity for most people. When searchers have welldefined needs, a one-shot query might be sufficient for the retrieval of documents, but when they are seeking information for complex mental activities – such as for decisionmaking or learning – retrieval is necessary but may not be sufficient because the searchers may need additional support which extends beyond the provision of search results [Marchionini and White, 2009]. In this case, tools and support services that assist users in managing, analysing, and sharing sets of retrieved information may be helpful. The problems addressed in this thesis touch on a number of different research areas. We will discuss each of them in turn. Figure 2.1 presents the research areas of our interests in 12

Chapter 2. Related Work

13

a simplified way.

IR NLP

Profiling Summarisation

Figure 2.1: Research Areas of Study. In our discussion of related work, we will focus on profiling and summarisation, but we will also cover the broader areas of IR and NLP. NLP and IR are altogether different areas of research. However, NLP techniques have been utilised in IR and are considered helpful for further advance for both presenting the user’s query and facilitating descriptions of document content. These techniques are used to contrast both descriptions and to present the user with the documents that best fulfil his or her own information needs [Allan, 2004]. This chapter starts with an explanation of the different paradigms of information access in Section 2.1. Then, in Section 2.2, we review the specific areas of IR related to the topics covered in this thesis and position our work by outlining the current direction of research in this area. We also provide a background of the Web search, discuss the enterprise search domain, outline the challenges and gaps in this area, and outline how this area differs from the Web in general in Section 2.3. After that, in Section 2.4, we illustrate the main principles behind the concept of personalisation and show the difference between personalised and non-personalised search engines. This is followed by a discussion of the various approaches to building the user profile in Section 2.5. In Section 2.6, we describe research and development on the automated creation of summaries of one or more texts. This section presents the application areas and characteristics of summaries, along with an overview of the principal approaches in summarisation. It also

Chapter 2. Related Work

14

shows the difference between generic and personalised summaries and between singleand multi-document summarisation. In Section 2.7, we review the methods of evaluating Information Retrieval Systems (IRSs) and summaries. The chapter concludes by discussing the key differentiators in Section 2.8.

Chapter 2. Related Work

2.1

15

Paradigms of Information Access

Finding specific information on the World Wide Web (WWW) can be a tedious process due to the enormous amount of information available. To identify particular information needs on the Web, three paradigms of information access have been used [Micarelli et al., 2007]: (1) searching by query, (2) searching by surfing/browsing, and (3) recommendation. We will discuss searching by query and searching by browsing in the next section and discuss recommendation later on. Browsing and searching are the two predominant interaction modes for locating information. These can be characterised as searching by navigation and searching by query [Furnas, 1997; Jul and Furnas, 1997].

2.1.1

Search and Navigation

Searching by query is accomplished by submitting a search query — typically a list of keywords or terms — to a search engine. The search engine then returns ranked links to pages that match the query. Search by query might return inappropriate results due to polysemy and synonymy, for example, but it is very common. It is able to identify the pages that contain relevant terms quickly. In contrast, searching by navigation, or “browsing”, is accomplished by analysing the Web pages sequentially by the user to explore the content of each Web page. This approach cannot be used to determine a particular piece of information in which the user cannot quickly locate the pages of interest. It is useful when knowledge of the required information is more vague, and when the user is unable to formulate a conventional search query due to a lack of knowledge of the appropriate terminology to use, or when a user is simply exploring a Web site or information resource. In this scenario, providing contextual support may help bridge the “vocabulary gap”, and reduce the frustration of navigating Web sites that can arise when using the static, general-purpose link structure

Chapter 2. Related Work

16

that has been set in place by the Web site administrators and content editors [Karim et al., 2009]. It might also help those users that have no specific knowledge of the content of a collection [Joachims et al., 1997]. We also note that use of contextual information for navigation has not received the same attention as contextual information for searching, which has been explored extensively (at least in the case of general Web searching). We do know, however, that employing users’ interaction histories in a browsing context has the potential to significantly reduce a user’s effort in finding the right information [Wexelblat and Maes, 1999]. With search by query on the Web, typically, the user’s query is entered into a text box. The search engine then produces one or more pages of ranked links to documents. Each link is typically accompanied by a query-based summary of the link’s destination (a snippet). The results may include links to different kinds of information, such as documents in various formats, videos, images, news articles, and encyclopaedia entries. When the user tries to find documents related to his or her request, the user may not select the optimal query terms – even if the user has search experience. When the query is short, this leads to a large number of returned documents, many of which may not even match the query. This also occurs when the query terms are too general or incorrectly formulated. Research has been conducted to discuss these problems and to find solutions for them [Kruschwitz, 2005]. To improve the quality of the retrieval performance in the personalisation domain, Micarelli et al. [2007] suggest two techniques: query expansion and relevance feedback. Expanding the user query with specific words which are extracted from previously retrieved pages of ranked documents will improve ranking quality. Gooda Sahib et al. [2010] show the possible impact of query expansion in the IR process and how it supports the searchers’ information-seeking tasks. The technique of query expansion helps to improve the effectiveness of searches by adding a short list of terms to a query. Query expansion can be applied in two different ways based on Gooda Sahib et al. [2010], either as Automatic Query Expansion (AQE), in which the system will automatically add

Chapter 2. Related Work

17

additional terms to the initial query, or as Interactive Query Expansion (IQE), in which a list of proposed terms will be presented to the searchers so that they can decide to add terms to their initial queries. Various studies have been conducted to contrast the two methods, and it seems that the potential advantages of IQE are not being fully exploited. IQE seems to be preferred by researchers because they have the ability to distinguish between relevant terms and irrelevant ones; even if the retrieval process’s performance is not enhanced, these experienced users feel more in control. A variety of improvements for Web search has been suggested and adopted. These include, for example, query-term highlighting within the results [Wilson, 2011] and query suggestions based on previous user interactions [Jones et al., 2006]. These improvements can lead to a more interactive search process. Olston and Chi [2003] build an intelligent system which combines the strengths of searching and browsing in a single interface. This system guides users towards search results by highlighting relevant hyperlinks on the pages that they are browsing. Another approach to combining the two interaction modes is proposed by Freyne et al. [2007] in an attempt to harness and harvest community wisdom by incorporating social search and social browsing. White et al. [2007] enhance Web searching by suggesting, in addition to the regular search results, links to Web sites frequently visited by other users with similar information needs. This exploits the searching and browsing behaviour of previous users. Our work inherits ideas from these approaches. Instead of proposing links or queries, however, we aim to help Web site users by applying text summarisation to hyperlinked documents in order to assist their navigation. In order to fully satisfy users’ search needs, a search engine should take users past the result list and aid them towards their information goals as they browse [Coyle and Smyth, 2007]. The act of browsing can assist users in formulating more focussed goals; quite often, users’ information needs are ill-defined at query time. The SearchGuide browsing assistant [Coyle and Smyth, 2007] helps users both in navigating to pages that are linked from the selected page and in browsing through the contents of a selected

Chapter 2. Related Work

18

result. The content of the ‘anchor tex’ associated with links, and the surrounding text, often provides insufficient information for users to make reliable decisions about whether to open a linked page [Jones and Li, 2008]. However, users often entirely depend on this information when browsing hypertextual documents returned by a regular IR search engine [Jones and Li, 2008]. As a consequence, users may be obliged to follow a link in order to determine whether it leads to useful information and return to the previous page when it does not. With this in mind, the question is whether there is some scope for providing browsing assistance by summarising the documents that can be found by following a given hyperlink. There may be a trade-off between such help and the required additional reading effort. It is also unclear how long such a summary needs to be. By way of support, in the case of Web search, automatically generated summaries have been shown to allow users to gauge document relevance more effectively than they can through the standard ranked title/snippet approach [White et al., 2003]. Overall search time can be significantly reduced, as short summaries can be read more quickly than full pages of text [Chen and Dumais, 2000]. There are many other ways of improving general Web search and navigation. Carbonaro [2010] presents a summarisation process through exploring concept-based search retrieval for enabling user-friendly and intelligent content exploration. Jones and Li [2008] provide topical feedback for link selection in hypertext browsing and a term cloud preview of the contents of each linked page. However, there appears to have been little progress in making navigating a Web site more adaptive to users’ needs (without requiring the users to explicitly express what they are after [Joachims et al., 1997, for example]). Web sites and intranets can be difficult to navigate, with a static and possibly idiosyncratic organisation [Berendt and Spiliopoulou, 2000; Karim et al., 2009]. In addition, searching can be difficult on such sites [Hawking, 2011, Ch. 15, p. 641]. A common approach to adding assistance to a Web site is to use an overlay window or hover text, essentially adding a “layer” on top of an existing site. This can be used for presenting search results [Dumais et al., 2001; White et al., 2002b, for example], or

Chapter 2. Related Work

19

for navigation by introducing links and suggestions to commonly visited pages, taking advantage of the collective search and navigation effort of other users [Karim et al., 2009; Saad and Kruschwitz, 2011]. We will adopt the idea of hover text by creating a pop-up “tool-tip” that contains a summary of the target of a link whenever users hover their mouse cursor over that link. The summary is generated from the document, or documents, that can be reached by following the link. It uses information such as the title of the document and relevant nodes in the profile. The actual content of the Web site itself is left unchanged.1

2.1.2

Recommendation-Based Systems

Recommender systems [Sugiyama et al., 2004] are considered one of the most promising approaches to the problem of information overload on the Web. These systems facilitate search process for useful information on the Web, but only for specific scenarios: a user’s behaviour can be compared to other users’ behaviours and mapped to the best-matching one – for example, an online book shop. Recommendation is accomplished by analysing the suggested items to identify the items which have chosen by the user in the past. These systems serve different domains, such as knowledge management, digital libraries, and e-commerce, by utilising user preferences to provide personalised suggestions. User preferences [Sivapalan et al., 2014] are represented as either a presence score, such as ‘likes’ or ‘dislikes’, or as a ‘numerical score’ indicating how much the user likes the product. After gathering user preferences, items are recommended by identifying differences and similarities among several users’ profiles. Recommender systems use the idea of learning profiles from the user to learn what he or she likes over a period of time without placing any burden on the user, so these systems help when there are no search queries or when those queries are difficult to formulate; recommender systems also deal easily with the mass information on the WWW [Middleton et al., 2003]. 1 There may be legal and moral issues when wrapping other people’s content inside an interface that, in some sense, changes the “content” and appearance of a Web site. But we believe the beneficiaries of our work will normally be Web site managers, as with other work that aims at supporting the designers and owners of a Web site [Chi et al., 2000, for example].

Chapter 2. Related Work

20

In modern recommendation systems [Pazzani and Billsus, 2007], after a user interacts with a Web application, a common scenario will start in which list of items is presented to the user and then he or she selects an item, either to interact with it in some way or to obtain more details about it. For example, an online news site may present headlines and story summaries for the presented Web pages to be accessible to the user, who just selects the required headline to read the whole story. However, on an e-commerce site, a list of individual products will be provided to the user, who then is able to choose which product to purchase or learn more about. Typically, these items are stored in a database on a Web server, and this Web server dynamically creates Web pages with a list of items. The huge amount of stored items in the database demands a specific order in which to present the items or a subset of items to present to the user. Recommender systems can be personalised, non-personalised, attribute-based, peopleto-people correlated, and item-to-item correlated [Sivapalan et al., 2014]. Many ecommerce sites utilise recommender systems in order to propose items to their customers according to the customer’s demographics, the site’s best-selling items, or an analysis of the customer’s previous buying behaviour (which helps to predict buying behaviour in the future). In general, these techniques are utilised to adapt the site to each customer, and providing personalisation for the site and for each customer. On the other hand, in non-personalised recommender systems [Schafer et al., 2001], each customer obtains the same recommendations, which are not based on the customer but on the average opinions of different customers about the products. Personalised recommender systems – such as those on Amazon and eBay – could use social or collaborative filtering, or content-based filtering. Collaborative filtering systems [Sivapalan et al., 2014; Su and Khoshgoftaar, 2009; Terveen and Hill, 2001], such as those on Last.fm, Facebook, MySpace, LinkedIn, and GroupLens [Resnick et al., 1994], are known as the active user’s neighbourhood, and they take into account the behaviours of other people with similar interests during the item-recommendation process. Conversely, content-based filtering is used to foresee future communications with

Chapter 2. Related Work

21

each active individual user through building an individual model that shows user likes and dislikes; however, this requires the presence of content descriptions for the items. Intuition is the main property that the content-based recommenders focus on: ‘finds me things like I have liked in the past’. Content-based filtering recommender systems such as Pickbooks [Petrovic et al., 2015], Personal Web-Watcher [Mladenic, 1996], InfoFinder [Krulwich and Burkey, 1996], NewsWeeder [Lang, 1995], and Letizia [Lieberman et al., 1995], utilise the profile of the user’s preferences (user history, ratings, personal information, etc.) to generate recommendations. This profile includes content descriptions of the items and utilises past user ratings to predict a customised user rating for unseen products. Collaborative and content-based filtering may be automatically combined in many ways, such as in Basilico and Hofmann [2004]; Cappella et al. [2015]; Lu et al. [2015]; Miranda et al. [1999]; Popescul et al. [2001]; Yang et al. [2014], to provide a new approach called a hybrid approach [Adomavicius and Tuzhilin, 2005; Sivapalan et al., 2014] to avoid problems that exist in both content-based and collaborative filtering systems. Middleton et al. [2003] generate a recommendation system for online research papers at the University of Southampton; this system supports both content-based and collaborative recommendations, so it is a hybrid recommender system and a searchable paper database. It acts as a pool to share knowledge among all users via recommendation and search. As another example, Netflix is a popular service which combines collaborative and content-based filtering to recommend movies [Petrovic et al., 2015]. In our work we will not employ the techniques used by recommender systems but rather focus on IR and NLP methods.

2.2

Towards Contextualised Information Retrieval

IR is an extensive field in the science of document searching; it includes storing and returning a wide range of media – for example, images files and text documents –

Chapter 2. Related Work

22

which help people to discover the desired information. Today, many applications use IR [Sharma and Patel, 2013], including digital libraries, recommender systems, search engines, and media searches. A search engine is considered one of the most functional applications of IR techniques; it deals with large-scale text collections. The best known examples of search engines are Web search engines, which are designed to search for information on the WWW; however, many others types are available, such as enterprise, desktop, mobile, federated, and social searches. An Information Retrieval System (IRS) is normally composed of three essential subsystems [Baeza-Yates and Ribeiro-Neto, 2011]: (1) the document collection that the user is searching for (the infinite set of logical representations of the documents indexed by the system is referred to as D); (2) the users’ information needs, which are called queries (the finite set of logical representations of the user’s interests is denoted by Q); (3) the algorithms used to match user requirements (queries) with document collections which fulfil the user’s information needs (this is realised by a ranking function F(di , qj ) defined on D × Q which assigns an actual number to a query representation qj ∈ Q and a document representation di ∈ D (this aims to arrange the documents in the document collection according to their relevance to the user’s information needs). Accurately modelling a user’s information needs using simple models is not easy. For example, many of the existing state-of-the-art IR models utilise a simplified method of representing documents and queries as bags of words. These Bag Of Words (BOW) models, such as Amati and Van Rijsbergen [2002]; Fang and Zhai [2005]; Ponte and Croft [1998], employ various sorts of document representations, query representations, and ranking functions. However, these models have one thing in common, which is the way that term order is disregarded when building the document and query representations. In such models, it is either difficult or impossible to represent many types of user preferences because of their varying natures. It should be noted that BOW models often include term co-occurrence statistics in the ranking function, thus modelling one very basic form of term dependence [Van Rijsbergen, 1977; Wei and Croft, 2007].

Chapter 2. Related Work

23

Theoretical models, such as the Vector Space Model (VSM) [Salton, 1971] and the probabilistic model [Robertson and Jones, 1976], have been suggested to determine query and document representations and query document matching. In any case, it is obvious that these models concentrate on the search topic and overlook the particular user’s search context. The many available IRSs which rely on these models consider the submitted query to be the main knowledge about the information need; accordingly, they retrieve the same documents without giving any attention to who entered the query. The growing importance of information emerging from an extensive variety of electronic media sources has made conventional IRSs less efficient. Typically, conventional IR models view the retrieval issue as matching a query with a group of documents [Singhal, 2001], and they do not model personalised and contextual search. Such frameworks are static and not mindful of the context in which they work; the numerical models that underlie them require abstraction . This is more commonly known as the black-box [White et al., 2002a] or one-size-fits-all approach [Allan et al., 2003]; most Web search engines could be characterised in this way. Truly, users cannot be satisfied by the information returned by these systems, especially if the results are ambiguous. With a specific end goal to handle this issue, there is an increasing interest in Contextual Information Retrieval (CIR), which depends on different sources of clues (delivered from the user’s search background and environmentlike preferences, interests, location, and time) to enhance retrieval accuracy [TamineLechani et al., 2010]. Contextual Information Retrieval Systems (CIRSs) depend on diverse approaches to modelling the user’s context and various methods of document relevance measurement, but all have a common objective of giving the most valuable information to the users according to their context and of improving the retrieval accuracy by including two steps. These fittingly characterise the context of user information needs, often known as the search context, and subsequently adapting the search by considering it in the

Chapter 2. Related Work

24

information acquisition process. Accordingly, the main challenge in IR is as follows: how to capture and incorporate contextual information in the retrieval process to improve search performance? Allan et al. [2003] define contextual retrieval as,

‘combine search technologies and

knowledge about query and user context into a single framework in order to provide the most appropriate answer for user’s information needs’. According to Dourish [2004], context is seen as a type of information that comprises implicit attributes that portray the user and the environment in which information activities happen. Lopes [2009] presents more detail about context definitions, context features, and their uses in IR. Recently, IRSs have started to utilise contextual approaches for a variety of different purposes [Ruthven, 2011], such as realising what information users need by tracking their responses to information, predicting what information they need, figuring out how information is presented to users, showing how information is connected with other information, and choosing who else should be informed about new information. Many researchers with different background believe that improving the user’s experience and enhancing the system’s search effectiveness can be achieved using context. Distinctive contexts can reveal that we need search systems to act distinctively or to offer diverse reactions. Melucci [2012] presents contextual search within a computational framework according to contextual variables, contextual factors, and statistical models. A potentially promising solution for reacting to users’ needs is to bring context into the search process [Freyne et al., 2004; Smyth et al., 2003]. This will allow the generic search engine to customise its results for the needs of specialist user cohorts. In order to determine the general search context, Smyth et al. [2003, 2005] attempt to discover patterns that consider past actions and behaviours of a community of searchers, which can be used when responding to future searchers; the aim is to re-rank search results to ease information access. This will provide potential enhancements in search performance, particularly when a cohort of searchers shares the same information needs and utilises

Chapter 2. Related Work

25

the same queries to express these needs [Smyth et al., 2005]. Freyne et al. [2004] show that the cohort-based search method can bring valuable search-performance advantages to end-users and can avoid a large portion of the privacy and security concerns that are generally associated with related personalisation research. As a summary, the contextual aspect of IR systems is important. Nowadays, IRSs are investing more in contextual features to support users during their information seeking process. Our work aims to contribute to CIR research by building profile-based summaries which can help users to easily navigate through the information space.

2.3

Web Search versus Enterprise Search

The most conspicuous application of IR is Web search. For any search engine, IR consists mainly of two components: indexing and ranking technology. The Web has many characteristics that have made it so successful [Hawking, 2011, Ch. 1, p. 9] including the simple HyperText Markup Language (HTML), the low access cost, the widespread reach of the Internet, the interactive browser interface, the search engines, and especially the freedom to publish ideas which can reach millions of people for free and without having to go through the editorial board of a large publishing company. The Web includes an almost unlimited number of Web pages, which are created in a specific manner, as they contain more structure than typical document collections. There are more than one trillion distinct URLs on the Web2 , even if many of them are pointers to dynamic pages, not static HTML pages. Static Web pages are different from dynamic ones [Manning et al., 2008] in that the former’s content does not change from request to request. For this reason, a professor’s manually updated home page is static Web page, but an airplane terminal’s ever-changing flight status page is dynamic. The latter are usually mechanically created by an application server in response to a database query. 2

Google blog, 2008. ‘http://googleblog.blogspot.co.uk/2008/07/we-knew-web-was-big.html’

Chapter 2. Related Work

26

Furthermore, Web pages can incorporate anchor text, which is utilised to portray the documents; hyperlinks to another Web documents; and different tags utilised for distinction among the diverse parts of the Web page contents. It is valuable to recognise these structures with a specific end goal of identifying their semantic content [Kruschwitz, 2005]. This investigates the inside structure of the normal HTML page, which includes hyperlinks, mark-up, or both. The content of the Web page does not usually give an adequate description of the Web page, which is one of the reasons to provide such structure. The conceptual information, such as link text, the semantic content’s page content marker, or a meta tag, is identified by investigating the mark-up structure from Web documents, keeping in mind that the end goal is to determine which documents are matched to the query. Search engines utilise distinct mechanisms with the aim of assessing the relevance of the documents [Alhalabi et al., 2009]. Regularly, search engines concentrate on different mixes of these mechanisms, which include link analysis, keyword frequency, and page usage. These approaches display interesting Web pages which reflect the user’s preferences. The latest methodology is called collaborative filtering, which additionally retrieves the most interesting documents for the user. Building a user profile that allows the system to identify the user’s area of interest and sorts the users and their needs is one of the most serious issues in the implementation of collaborative filtering. We are going to explain the user profile in more detail in Section 2.5. In addition, with the aim of obtaining the most interesting documents for the user, we will examine some reference tools: time taken to explore Web page contents, mouse movement, page scrolling, and mouse clicks. Effective techniques to retrieve relevant documents from homogeneous collections of text – for example, scientific abstracts – were already available in the 1950s. However, one event that radically changed IR in the 1990s is that the Web turned into a huge success. The Web was so successful because some issues had been addressed, including its enormous scale, its heterogeneity, and most of all, the infinite variety of purposes for

Chapter 2. Related Work

27

which Web searches may be utilised. Bringing exceptionally powerful searches to complex information spaces within enterprises is the new challenge that IR faces [Hawking, 2004]. Enterprise search [Baeza-Yates and Ribeiro-Neto, 2011] is the application of IR technology to information finding within organisations. It searches through an enterprise’s content, which are the digital textual materials owned by an organisation and located in its public Web site(s), intranet, emails, databases, records, etc. The ultimate goal of an enterprise IRS is to search all the documents that may contain a useful answer and to present the search results in a form that is of maximum utility to the searcher. The key IR search problems in the enterprise search area [Hawking, 2004] are defining an appropriate test collection for the enterprise search, finding an effective ranking mechanism for the heterogeneous collection, effectively searching collections of emails, building employee portals as a distributed IR problem, exploiting search contexts within enterprise searches, estimating the importance of the documents that are not part of the Web, and so on. However, integrating enterprise search within the software applications used by employees proves to be of a great value. Building up an effective and viable search engine for a modern enterprise can have a substantial payoff. Feldman and Sherman [2001] emphasises the importance of information access in the enterprise. They found that access to information is critical to enterprises’ activities. In addition, they estimate the costs for an enterprise of not finding information to be in the region of $2.5 to $3.5 million per year. Developing efficient and effective tools for enterprise search has therefore attracted researchers in academic and industrial IR alike. Mukherjee and Mao [2004] outlined the most important factors involved in an enterprise search system. Besides the various approaches of designing effective ranking models for enterprise search, they propose that contextualisation and recommendation can be particularly helpful in search on the enterprise. Li et al. [2005] provide intranet search in their framework of four sorts of information:

Chapter 2. Related Work

28

home pages about sets or topics, experts on topics, employees’ personal information, and term definitions. For every sort of search, information extraction methodologies have been utilised to extract, merge, and summarise information in progress. Finding specific types of information cannot be achieved utilising traditional IR, which relies only on the search of relevant documents. Thus, the method that Li et al. [2005] propose is to search for information of each specific type and to be aware of the conventional relevance search. Many commercial systems [Li et al., 2005] have been built for intranet search, and many of them present the intranet search as an issue of traditional relevance search and are used to return a set of ranked documents with the most relevant documents on the top. Helping individuals access information on an intranet is a huge challenge in the IR field, and this is the topic that we are interested in. Enterprise search can be different from desktop search, which exploits the search mechanism for the content on a single computer, and a Web search, which exploits the search mechanism for documents on the open Web. Enterprise search and Web search both essentially aim to provide the user with desired information. However, their search strategies are different since these domains’ users, information needs, environments, security, knowledge, and goals are different [Hawking, 2004; Mukherjee and Mao, 2004]. They differ in many ways [Mukherjee and Mao, 2004]. First, on the Internet, a large number of returned documents are typically related to the query, and the user is searching for the best or the most relevant documents, so searching is the process of finding the best of many answers. On the other hand, on the intranet, the methodology of discovering the right answer is frequently more difficult because a user may know or have already seen the desired documents; most of these queries have a specific right answer, but that answer is not necessarily the most popular document, as it would often be on the Internet. Second, intranet content commonly reflects the perspective of the entity it serves, and it is generated to convey information instead of to attract the attention of a specific group of users. This content is gathered from heterogeneous repositories that commonly do not cross-reference each other using hyperlinks. The diverse attributes of enterprise content and processes make IR within the enterprise differ strongly from

Chapter 2. Related Work

29

Web searches. We focus on enterprise search in the wider sense; more specifically, we are primarily interested in facilitating the navigation of the collection at hand on Web sites and intranets by providing profile-based summaries of links to improve retrieval in these environments. For example, in a university Web site, a number of different potential user groups can benefit from each other’s information; instead of improving the value of an individual search, knowing the total context of search behaviour helps the search engine to understand people and their online behaviour in general [Baeza-Yates and Ribeiro-Neto, 2011]. This can be done through capturing the past users’ interactions in order to assist new users in finding the right documents for their information needs.

2.4

Personalisation

Although many IRSs – for instance, digital library systems and Web search engines – succeed in overcoming information overload, they are far from optimal in that they normally lack user modelling and are not ready to adjust to individual users [Nunberg, 2003]. The typical IRSs provide the same list of results for various users who search for the same query. Suppose we have a programmer and a tourist who both use the same word ‘Java’; although each is looking for different information, the system would return the same results [Shen et al., 2005a]. In addition, it can be difficult to recognise the correct sense when a user searches for the word ‘Java’; sometimes the user means Java (the island, and sometimes the user means Java (the programming language); this is because the user’s information needs change over time. Thus, it is important to model a user’s information needs by personalising the search to enhance retrieval accuracy [Shen et al., 2005a]. The searching process can be simplified by applying the personalisation techniques in which the time that the user spends to navigate through the query results will be reduced; therefore, the accuracy of the search engine will be increased. Nowadays, several Web search engines, such as Google and Yahoo, use lots

Chapter 2. Related Work

30

of personalisation techniques and integrate contextual search tools in order to adapt the results to each user’s information needs [Micarelli et al., 2007]. The most known personalised search tools are Google’s Alerts3 and Personalised Search [Micarelli et al., 2007]. Web personalisation aims to improve user satisfaction by inferring the users’ needs according to either preceding or current interactions with the users, without requiring them to request this service explicitly [Mulvenna et al., 2000]. Once the data has been obtained, a user model will be built. This user model, which is considered the main part of the personalisation process, is used to predict the user’s future interests in a specific representation, and it typically reflects the structure of the domain in the presence of a domain ontology [Anand and Mobasher, 2005]. Unlike non-personalised search engines, which rely on traditional IR techniques, personalised search engines better satisfy the user by assessing the user’s needs in short-term and long-term models, based on past queries, browsed documents, and user actions. We will talk in more detail about user models in Section 2.5. Anand and Mobasher [2005] point out that personalisation approaches might be developed to run on the server side or on the client side; the main difference between them is the amount of data accessible to the personalisation system. On the server side, collaborative and individual approaches can be applied, and while the system will be able to collect data from all users, it only accesses users’ interactions with content on a given Web site. A server-based search engine, such as Google, can keep track of a user’s past queries and chosen results, utilising this information to derive user interests. Client-based search, on the other hand, can keep track of all documents viewed or edited by a user so as to get a superior model of the users interests; client-based data are only available through the individual user’s interactions with multiple Web sites. Studies such as Jansen et al. [2007a, 2008] concentrate on identifying the user intent behind the query to create query-specific IR. Identifying the user’s intent is normally 3

http://www.google.fr/alerts.

Chapter 2. Related Work

31

independent of the user’s search history or communications, but it depends on query features, such as clarity and usage rate in anchor texts. Zhou and Croft [2008] suggest measuring the ranked list robustness as a marker of topic difficulty for content-based queries. The study conducted by Teevan et al. [2005] indicates that users’ opinions about the relevance of the returned search results for the same query contrast significantly and that the users are interested in obtaining personalised results rather than handling all users in the same manner. Song et al. [2010] divide the approaches for personalised search into three main approaches: (1) taxonomy-based methods that map user interests to an existing taxonomy, such as Open Directory Project (ODP); (2) content-based implicit measures that use traditional text presentation models, such as language mode and VSM, to identify user preferences; and (3) rank methods that learn retrieval functions with the aim of satisfying the majority of users rather than individuals. Ferragina and Gulli [2008] proposed an innovative approach based on Web-snippet hierarchical clustering to achieve personalisation to assist users in searching the Web through the generated system, which is called ‘Snaket’4 . This is an open-source and complete system which offers both hierarchical clustering of the snippets retrieved by the search engine and folder labelling with variable-length sentences. The main idea is that the user submits a query to the Snaket engine through its Web interface, which is used for Web search, news, blog domains and books. Next, a labelled folder hierarchy is returned and presented to the user; afterward, the user can select a group of folder labels (themes) most relevant to his or her query needs. As soon as the selection occurs, the original ranked list is personalised to the chosen themes by filtering out the snippets that do not relate to those with the chosen themes. The innovative feature here is that the Snaket engine adapts the ranked list of results dynamically to the local selections made by any user without requiring tracking of the user’s past search behaviour, an explicit login by the user, or application of user profiles. In addition, helping users 4

http://snaket.di.unipi.it/

Chapter 2. Related Work

32

to navigate through the hierarchy returned according to their search needs will reduce problems related to poor, polysemous, or informative queries.

2.5

User Profiles

In order to capture a user’s or user group’s interests, a model must first be built [Teevan and Dumais, 2011]. The acquisition of user knowledge and preferences to build a user profile is one of the most important problems that must be tackled in order to provide effective personalised assistance [Micarelli and Sciarrone, 2004]. Models can be built from queries that users submit to search the collection by building Query Flow Graphs (QFG) [Boldi et al., 2009; Deng et al., 2009, for example], from anchor text [Kraft and Zien, 2004] from mining term Association Rules (AR)[Fonseca et al., 2003] or by extracting term relations from documents [Kruschwitz, 2005; Sanderson and Croft, 1999, for example]. More precisely, various works, such as Tamine-Lechani et al. [2008], employ one of the following sources of evidence to construct user profiles: • User behaviour, viewed as a collection of implicit feedback indicators [Kelly and Fu, 2007], such as past search history [Qiu and Cho, 2006; Song et al., 2010], browsed documents (previously visited Web pages), and previously issued queries [Teevan et al., 2005], browsing history [Matthijs and Radlinski, 2011], Web Usage Mining (WUM), clickthrough data [Joachims et al., 2005; Sun et al., 2005], browsing features [Shen et al., 2005a; Teevan et al., 2005], eye-tracking [Joachims et al., 2007], social tags/annotations manually provided by users [Vallet et al., 2010], users’ social interactions with friends in social networks, user-generated content (e.g., a blog), snippets of results (titles and summaries), users’ browsing sessions (which use collaborative filtering techniques) [Shmueli-Scheuer et al., 2010], indexed content [Teevan et al., 2010], or modified collaborative filtering of pure browsing history [Sugiyama et al., 2004]. • Bookmarks as indications to predict user preferences [Mc Gowan, 2003].

Chapter 2. Related Work

33

• Desktop information [Dumais et al., 2003] and contextual information sources, for example, news sources (e.g., Reuters or the New York Times), blog sites, and e-commerce sites [Budzik and Hammond, 2000] utilised for deducing user interests. User profiles can be utilised to clarify queries by supplying information relating to the query, which, at the same time, presents the users’ interests. The user profile aims to either narrow down the number of search results or increase the probability of obtaining the most interesting results [Speretta and Gauch, 2005]. The main reason to exploit user profiles is personalisation. This method can be used in different areas [Matthijs and Radlinski, 2011], such as suggesting news articles to read and interesting social networking groups to join or personalising advertisements. It offers two types of service: either personalisation services, as used in a user’s search results, or targeted services, such as recommendations and advertisements. Ideally, according to Nanas et al. [2004], the user profile will have the ability to: (a) adjust to both long-term and short-term interests according to user feedback, (b) distinguish irrelevant and relevant information items, and (c) represent multiple topics of interest. The user’s profilebased personalisation consists of three stages [Gauch et al., 2007]: (1) collecting user information, (2) constructing the user profile from the data, and (3) exploiting the user profile information by applying technology to provide personalised services. In personalised IR, [Liu et al., 2004; Mc Gowan, 2003; Shen et al., 2005b; Sieg et al., 2004; Speretta and Gauch, 2005; Tamine-Lechani et al., 2008] retrieval is accomplished in practice by combining the user’s profile in one of these three fundamental retrieval stages as proposed by Gauch et al. [2007]: (1) document retrieval, (2) document reranking, and (3) query reformulation. The last two stages have been enhanced in most of the related works by exploiting evidence extracted from the user profile. In the first phase, the user profile is exploited as a part of the retrieval process; this provides both a traditional ranking system which can be personalised and a quick response to the query. In the second phase, the user profile calls on an external system on the client side to do the re-ranking activity in which the query results return first;

Chapter 2. Related Work

34

then, the system re-ranks the top-ranked results locally using only the snippets of these results to avoid spending too much time. In the third phase, the user profile effects the representation of the submitted query by modifying or augmenting it to better represent the user’s needs in the current profile; then, the retrieval process takes place. This helps reduce the vocabulary problem, and it works as quickly as the non-personalised search systems even though it is less likely to affect the result list. To be able to construct profiles that represent individual users, the system should consider an accurate user identification method. Gauch et al. [2007] provide five techniques for user identification: software agents, logins, enhanced proxy servers, cookies, and sessions. The initial three methodologies are more precise; however, they require the users’ participation, a burden which the last two options do not require. There are a number of common methods of structuring such models [Gauch et al., 2007], such as by representing the user profile as a set of weighted keywords, as in Chen and Sycara [1998]; Chesnais et al. [1995]; Mc Gowan [2003]; Widyantoro et al. [1999]; a rich, semantic concept-based structure enhanced with the use of ontologies, as in Liu et al. [2004]; Sieg et al. [2007]; Speretta and Gauch [2005]; an instance of predefined ontology (such as ODP), as in Daoud et al. [2009]; Yu [2004]; semantic networks, as in Asnicar and Tasso [1997]; Gentili et al. [2003]; Micarelli and Sciarrone [2004]; Stefani and Strappavara [1998]; Tan and Teo [1998]; or AR. A fuzzy user-profile model was proposed in John and Mooney [2001], in which the user’s interests are represented by a set of categories for which the weights provide the knowledge about the user, indicating the membership of a category to some degree. The methods for structuring profiles can aim to model individual user’s interests [Teevan et al., 2010] or cohorts of users [Yan et al., 2014]. Mobasher et al. [2002] are interested in producing aggregate usage profiles (which reflect the behaviour of a group of users) and individual user profiles (which reflect a single user’s behaviour). The information gathered to create the profiles can be adjusted or enlarged [Anand and Mobasher, 2005]; in this situation, the profile is dynamic. Alternately, it can be maintained over time, in

Chapter 2. Related Work

35

which case, the profile is static. Dynamic profiles, which we have applied in our work, take into account time as a way to identify the user’s short-term interests, and longterm interests. Gauch et al. [2007] give a good example. Suppose a musician usually uses the Web to search for this domain. One day, she decides to take a vacation and to search for airplane tickets, hotels, and so on. In this case, her profile contains the music information as a long-term interest and the vacation information as a short-term interests. The vacation interests should be forgotten as soon as she returns from the vacation, and the prominence of the music interests should then be resumed. It could be difficult to identify and manage short-term interests because users’ interests change rapidly as they change their tasks, but the main idea is still that user profiling can be utilised to present information about the users’ interests, and the time consideration can be used to enhance the quality of information access. Static user profiles have been used by most of the current personalisation systems; while user interests are dynamic, few systems use dynamic user profiles. Examples are: NewDude [Billsus and Pazzani, 2000] but also Web sites such as Amazon5 and Booking.com6 . The most prominent profiles are those which have the potential to best reflect users’ interests as these interests change. Models can be explicit, in which users input topics of interest, or implicit, in which those interests are inferred from actions [Teevan and Dumais, 2011]. The type of data collected in explicit models may be either personal or demographic information such as jobs, birthday, etc.7 However, explicit models have several drawbacks, such as the time it takes to build them and their static nature. As such, using explicit relevance feedback cannot always improve the user model, especially if a good interface is not present to help the user manage the profile [Micarelli et al., 2007]. Users are usually not able to implement these techniques effectively, and they do not have extra time to explicitly determine or refine their needs. To overcome the drawbacks of explicit models, implicit models do not require the user to participate in collecting information 5

http://www.amazon.co.uk/ http://www.booking.com/ 7 Useful applications of such static profiles include online dating and sientific repositories like ResearchGate. 6

Chapter 2. Related Work

36

during the profile construction process, which places less burden on the user. However, implicit input may lead to much noisier models and there is no guarantee that they will properly represent a user’s interests.8 The types of implicit data used to construct profiles can vary [Teevan and Dumais, 2011]; for example, the analysis of log records has been shown to be effective at approximating explicit feedback, and query log analysis has developed into a very active research area [Bernard et al., 2009; Silvestri, 2010]. It has been widely recognised that query log files represent a good source for capturing implicit user feedback. This feedback can then be exploited to build knowledge structures that can assist in interactive search – for example, to derive query substitutions [Jones et al., 2006] or to extract meaningful knowledge [Baeza-Yates and Tiberi, 2007]. Query logs have been used to build adaptive domain models based on biologically inspired algorithms, such as ACO and Nootropia (which are inspired by the immune system and which have been successful in information filtering problems involving personalised news recommendation [Nanas and De Roeck, 2009]). There are obvious drawbacks when using log data. These include the fact that they can be noisy and patchy in their coverage. Both of these issues are particularly prevalent when it comes to records that represent the ‘long tail’. Ways to mitigate these problems include the use of robust methods to build models, a focus on frequent queries, or ‘backing off’ to generic knowledge sources such as Wikipedia if necessary.9 Another approach relies on the user’s activities on the search site itself by implementing a GoogleWrapper which enables each individual’s search activities to be collected [Speretta and Gauch, 2005]. Lv et al. [2006] utilise two types of implicit feedback information: clicked results for the same query (the immediately viewed documents) and users’ query logs. Both these 8

It is impossible to make a very strong statement about which model is best as each type of model is appropriate for some applications. 9 We do not make use of external knowledge sources due to the problem of adapting them to domainspecific applications [Clark et al., 2012]. Let us consider the example used in this study, Figure 4.1. The query ‘library’ can be expanded to include ‘library opening times’, ‘catalogue’, ‘moodle’, and so on. This makes sense for this particular Web site but such an extended query might not be suitable for a different Web site. For example, a user with the query ‘library’ on Wikipedia would have very different intent, and using the same profile would not work well. In short, we do not believe that neither a domain model acquired for a specific site nor generic sources will be applicable. The main idea is to have a methodology that is easily transferable and then to run the model acquisition process on a new Web site without any major customisation effort.

Chapter 2. Related Work

37

general methods to personalise search can be utilised to perform results re-ranking and query expansion. The study conducted in Shen et al. [2005a] is nearly related to the study of Lv et al. [2006], but the contrasts between these two studies are that Shen et al. [2005a] utilise distinctive algorithms to implement query expansion and to re-rank the search results. Furthermore, they utilise past queries to perform query expansion and implement it before the query; this does not allow the query expansion to benefit from the user’s clickthrough data. In addition, their methodology does not distinguish between search results and expanded terms, among expanded terms, or among search results. Thus, the study led by Lv et al. [2006] makes the entire system concise and brings collaboration between results re-ranking and query expansion. These experiments demonstrate the effective and efficient use of the methodology proposed in this study. Clark et al. [2012] provide a broad overview of methods to turn document collections and query logs into structured knowledge. As pointed out earlier, we are interested in profiles representing a population of Web site users rather than the construction of individual profiles (we are not addressing ‘collaborative search’, in which several people contribute to a shared task [Morris et al., 2008, for example]). To achieve this, we adopt an approach that uses an ACO analogy of building adaptive community profiles, a biologically inspired model applied to query logs. This has been shown to be effective for generating query suggestions in intranets (such as a local Web site, where suitable knowledge structures are typically not readily available) and is easy to replicate [Albakour et al., 2011]. ACO is a type of swarm intelligence technique. It has been studied extensively in the context of solving problems in domains which include scheduling [Socha et al., 2003], classification [Martens et al., 2007], and telecommunications routing [Di Caro and Dorigo, 1998]. It is motivated by thought of ants’ colony behaviour, such as the laying and tracking of pheromone trails which can dissipate over the long run. Abraham et al. [2006] explain this behaviour as follows: (1) ants wander randomly, and after discovering food, come back to their colony while setting down pheromone trails; (2) then, trails are followed by different ants and reinforced in the event that they discover food eventually; (3) however, pheromone trails also evaporate over time, so paths

Chapter 2. Related Work

38

to diminishing food resources become less popular and less visited. The fundamental intuition in ACO is its ability to manage emerging situations automatically, which for our situation, includes new search trends due to temporal or seasonal changes; furthermore, ACO has the power to gradually forget less popular, outdated search routes. This approach builds up a network of query terms which reflects the accumulated search knowledge of a cohort of users and which can be continuously updated. We will focus on a broad cohort by building a single profile and not distinguishing between individual user groups accessing the Web site. It is, however, a simple step from the approach we take to build profiles that make finer-grained distinctions based on a user’s login credentials, as long as that information is present in the log files (e.g., models that reflect students, administrative staff, or first-year undergraduate biology students,). While the profile we use is built using logs of queries submitted to a local search engine, we assume that such a profile is also relevant for Web site navigation.

2.6

Automatic Text Summarisation

Text summarisation has been an active research area for more than 50 years [Luhn, 1958], but the increasing availability of large volumes of digital information makes such techniques ever more important. In particular, the recent prevalent IR engine has developed a substantial implementation for displaying retrieval results by using automatic summarisation, whereby the user can rapidly and precisely judge the importance of texts retrieved as the result of a query [Mochizuki and Okumura, 2000]. Ruthven et al. [2001] show that summarisation methods can assist users of conventional IRSs to filter the most essential or relevant documents from a list of returned documents. The aim is to provide an extract of one or more texts which contain the most important or salient ideas [Nenkova and McKeown, 2011]. In the 1950s, Luhn [1958] conducted the very first work on automatic summarisation

Chapter 2. Related Work

39

and specifying the rules for sentence extraction. He worked on summarising magazine articles and technical papers. He first formulated the idea on which much of the later research is based: that a few words in a document are expressive of its content and that the sentences that contain the most significant information in the document are those that contain numerous expressive words close to one another. In addition, he proposed utilising frequency of occurrence to identify the words that are expressive of the document’s topic. Stated differently, the words that occur frequently in a document represent that document’s essential topic. When we need to decide if reading an entire document is worthwhile, text summaries are essential. The summarised documents could be in any form of text, such as news stories and Web pages. It is non-trivial to write a summary of a text in which two essential factors should be considered: the need to concentrate on the most central ideas from the source text and the need to take the reader’s past knowledge and probable specific interests into account. Many types of documents are available today in digital form, but the majority need summaries. As a result of the abundant information available in these documents, they are difficult to manually search and filter to determine which knowledge should be obtained. As such, the ability to automatically filter and extract this information is required in order to avoid drowning in it. However, the quality achieved with manual summarisation is still higher than that obtained with automatic text summarisation. Despite its disadvantages, automatic text summarisation is consistent, always accessible, and not exhausting. A good summary [Dalal and Zaveri, 2011] is characterised by soft metrics like good topic coverage, high information content, low redundancy (or high novelty), compactness (brevity), and relative coherence. Hovy and Lin [1998] divide summarisation approaches into two main groups based on their output types: abstractive summaries and extractive summaries. Here, we focus on extractive summarisation.10 Producing abstractive summaries is a more challenging task, as they are written to convey the main information in the input, which requires a 10

A possible alternative for our work would be to use a term cloud view, as in Jones and Li [2008].

Chapter 2. Related Work

40

deep linguistic approach to parse the entered text. Such a formal representation of the interpreted text identifies generic concepts with the aim of generating a new summary consisting of the same information in the original text in a brief, coherent form. By contrast, extractive summaries involve obtaining heuristics and varying amounts of NLP to extract the most relevant sentences and then concatenating these sentences, which are taken exactly as they appear in the original document(s). The units that are extracted differ from one summariser to another. Most summarisers use sentences rather than larger units, such as paragraphs. Most existing automatic text summarisation systems are extraction systems. These systems extract parts of source documents and produce the results as summaries [Lin and Hovy, 2003b]. The most popular systems are those that utilise sentence extraction [Edmundson, 1969; Goldstein et al., 1999; Hovy and Lin, 1998; Kupiec et al., 1995; Luhn, 1958]. Many of the systems that participated in the previous Document Understanding Conference (DUC) 2002, a large-scale summarisation assessment effort supported by the US government, are based on extraction techniques [Lin and Hovy, 2003b]. In some cases, the extracted units are post-edited, such as by omitting subordinate sentences or combining incomplete sentences to construct complete sentences [Jing, 2000; Jing and McKeown, 2000]. The essential benefit of sentence extraction is that a sentence is a minimally structured argumentation unit [Jorge et al., 2008]. Moreover, extracting sentences permits better control over the summary’s compression level. A potential lack of coherence is the main disadvantage in sentence extraction techniques [Hassel, 2004]. One approach to tackle this issue is to extract paragraphs instead of sentences, which will provide a wider context. However, if a lack of coherence occurs as a result of unresolved anaphoric references, the problem may be solved by one of the following methods: (1) eliminating the summary sentences that contain these references [Brandow et al., 1995], (2) including the sentences immediately preceding the problematic ones [Nanba and Okumura, 2000], (3) or including the important sentences that help to solve the anaphora if these precede the anaphoric sentences [Paice, 1990]. This choice will degrade the quality of the

Chapter 2. Related Work

41

summary, however, because it does not respect the restrictions on size. There is another way to solve the coherence problem, which is by eliminating the sentences that begin with words such as ‘however’, ‘thus’, and so on, which connect separate phrases located in a sentence, when the preceding sentence is not included in the summary. Steinberger et al. [2007] use anaphoric information in order to check the coherence of the summary produced by the summariser. There is no particular length of produced summary, and its length will depend on the purpose of the summarisation process. The summarisation tracks organised by DUC and later on by the Text Analysis Conference (TAC) provided task descriptions that have identified summary lengths. Based on this, the length is supposed to be within a particular cut-off of words (e.g., between 240 and 250 words, inclusive, with white-spacedelimited tokens). Summaries under this size limit were penalised by decreasing their scores during the evaluation process, and summaries above the size limit were truncated. The compression ratio (the summary’s length compared with the original’s) varies for different summarisers; summaries can be short (100 words) [Angheluta et al., 2004] or very short (10 words) [Douzidia and Lapalme, 2004]. In fact, a short summary of around 10 words (e.g., a headline) has replaced the single-document, 100-word summary task from DUC 2003 [Lin and Hovy, 2003b] as the standard. In our work, we want the summary to be concise and to provide HTML pages’ most interesting topics to users. In Section 5.5, we present the method we followed to identify the summary length in our work. Evaluation is one of the most important issues that influences the future of text summarisation; most of the time, human judgements differ about leads to a ‘good’ summary, making the automatic evaluation process particularly difficult [Lin and Hovy, 2002a]. The text summarisation field faces other difficulties. Utilising abstraction to create the summary is difficult, as it requires deep analysis and an understanding of the original text even when employing machine learning, and the results are not the same as can be achieved with man-made summaries. Another difficulty is that generating summaries

Chapter 2. Related Work

42

for shorter documents is not practical, as there is a lack of material. We will discuss these evaluation issues in detail in Section 2.7.

2.6.1

Application Areas

There are a number of applications for text summarisation [Hassel, 2004; Nenkova and McKeown, 2011]. Many systems have used text summarisation to improve document search and browsing. Jing [2001] presents some examples as follows: • Over the Internet: browsing and retrieving relevant documents in an effective way has become extremely important to Internet users, especially with such a massive amount of information on the Internet. Summaries help users avoid wasting time accessing and browsing documents that are not interesting to them, and in this manner, can benefit users who want to rapidly judge the importance of documents. • Web browsing: Web site summaries could change our browsing habits and enable us to filter irrelevant Web pages. • Digital libraries: the effort of storing and indexing the huge amount of accessible documents which are found in electronic form leads to the construction of digital libraries. To generate summaries for documents in digital libraries with the aim of efficiently organising and retrieving the documents, automatic techniques are required, as the quantity of documents that are to be incorporated in digital libraries is enormous, and manually creating summaries for all such documents is impossible. • Hand-held devices: document condensation is extremely required because of the limited display space on hand-held devices such as personal digital assistants. Numerous uses of summarisation take place in everyday activities [Jurafsky and Martin,

Chapter 2. Related Work

43

2009; Mani and Maybury, 1999]. Summarisation is used to create document outlines, scientific article abstracts, news article headlines, snippets summarising a Web page on a search engine results page, summaries of a business meeting or email thread, answers to complicated questions (created by summarising more than one document), minutes, biographies, abbreviations, movie summaries, and so on. Therefore, a summarisation system usually falls into more than one of the fundamental classifications above and thus should be assessed along more than one dimension, utilising various measures.

2.6.2

Characteristics of Summaries

Text summarisation commonly branches out in different dimensions. Summarisation systems fall into a variety of groups. Hassel [2004] distinguishes the following: Characteristics of the source text(s) (Input):  Source: single-document vs. multi-document  Language: monolingual vs. multilingual  Genre: news vs. technical paper  Specificity: domain-specific vs. general  Length: short (1-2 page docs) vs. long (> 50 page docs)  Media: text, graphics, audio, video, multi-media Characteristics of the summary usage (Purpose):  Use: generic vs. query-oriented (informative vs. indicative)  Purpose: what is the summary used for (e.g., alert, preview, inform, digest, provide biographical information)?  Audience: untargeted vs. targeted (slanted)

Chapter 2. Related Work

44

Characteristics of the summary as a text (Output):  Derivation: extract vs. abstract  Format: running text, tables, geographical displays, timelines, charts, etc.  Partiality: neutral vs. evaluative

2.6.3

Approaches to Summarisation

Summaries (in the form of snippets) produced by Web search engines have a (relatively) long history [Wolf et al., 2004]. Sometimes the user search terms found in these summaries are highlighted by search engines. This method is sometimes called Term Hits In Context (THIC). However, some enterprise sites have utilised human-authored document summaries [Wolf et al., 2004]. This type of summary is typically included as metadata in the HTML encoding of a document, such as in the document’s description metatag. Search engines have used this metadata as document summaries. Other search engines have simply used the initial ‘n’ characters or words of a document’s main text as the document’s summary, and a hybrid approach that combines several of these techniques is used by still others; for instance, a document’s summary may be extracted from the description metadata field, but if that does not exist in the source document, the first 255 characters of the page are used instead. Google utilises the title of the page, the content attribute of the description meta tag, and a thumbnail image for the snippet description.11 The examples of text summarisation implementations vary; this includes Swesum [Dalianis et al., 2003], EstSum [M¨ uu ¨risep and Mutso, 2005], and SUMMARIST [Hovy and Lin, 1998]. Two analysis approaches have been utilised with the purpose of decreasing the size of the original text in text summarisation: linguistic and statistical approach [McCargar, 2004]. The linguistic approach applies abstraction to create the summary. This ap11

‘Google developer, https://developers.google.com/+/web/snippet/’; ‘Google SERP Snippet Optimisation Tool, http://www.seomofo.com/snippet-optimizer.html’

Chapter 2. Related Work

45

proach also employs Artificial Intelligence (AI) concepts for various levels of linguistic analysis in NLP [Jurafsky and Martin, 2009] and to implement knowledge representation and reasoning approaches which make the summary easier to create [Brachman and Levesque, 2004]. The generated summary corresponds to the man-made one and includes generalised terms to cover the knowledge in the information source, so it is brief, knowledge-rich, accurate, and coherent. Unlike the linguistic approach, the statistical approach [McCargar, 2004] utilises extraction techniques to obtain the most important sentences directly from the source text. Statistical approaches have been used by many summarisation systems in order to extract relevant sentences [Berger and Mittal, 2000a; Galanis and Malakasiotis, 2008; Knight and Marcu, 2000]. Some of these statistical techniques are Katz’s K-mixture Model [Katz, 1996], the VSM [Salton et al., 1975], and Expectation Maximisation [Knight and Marcu, 2000]. The VSM uses TermFrequency.InverseDocumentFrequency (TF.IDF), which is an algebraic model for representing text documents in the form of vectors of identifiers, such as index terms [Salton et al., 1975]. TF.IDF [Salton and McGill, 1983], which we have applied in our work, is a basic numerical measurement which shows how significant a word is to a document in a collection or corpus. It is regularly utilised as a weighting element in IR and text mining. TF.IDF is employed as a weighting scheme in numerous summarisation tasks [Mochizuki and Okumura, 2000; Wolf et al., 2004]. It consists of two elements: the term frequency tf and the inverse document frequency idf , and it is amongst the most well-known models for computing terms’ weights [Salton and McGill, 1983]. Gotti et al. [2007] introduced GOFAISUM as a topic answering and summarisation system. They worked on a set of news articles identified with a certain topic to create 250-word-or-less summaries. They utilised TF.IDF scores to discover sentences related to the topic. Other techniques for automatic text summarisation have been developed by researchers, and these can be broadly classified into four groups based on Dalal and Zaveri [2011]: (A) heuristic/statistical techniques, (B) semantics-based techniques, (C) query-oriented

Chapter 2. Related Work

46

techniques, and (D) cluster-based techniques. The most generally utilised heuristics for text summarisation, as described in Hahn and Mani [2000] and Dalal and Zaveri [2011], are lexical cues, keywords, thematic words, cue words, title words, and location/position. In Hovy and Lin [1998]; Luhn [1958], sentences’ significance factors were calculated according to the presence of highfrequency words or keywords. The sentences with higher significance factors were used to create an auto-abstract. In Baxendale [1958] and D´ıaz and Gerv´as [2007], positional heuristics were used, including the first and last sentence of each paragraph in the summary. These heuristics consider the sentence’s position in the document, in that the first five sentences of the text are associated with the highest value and the rest are assigned the value 0. Edmundson [1969] suggests more heuristics like cue words and title words. The presence of cue words, such as ‘important’, ‘hardly’, and ‘significant’ in a sentence makes it important. The cue-word approach involves using a pre-stored cue dictionary that consists of bonus words (which are positively relevant), stigma words (which are negatively relevant) and null words (which are irrelevant) [Edmundson, 1969; Hovy and Lin, 1998]. The title words are found in the title of a document (not including empty words and stop words). Title words are highly relevant to the subject of the text. In the title method, the sentences that include title words will be selected for the summary [Edmundson, 1969; Teufel and Moens, 1997]. By contrast, the thematic words method [D´ıaz and Gerv´ as, 2007] considers the eight most frequently occurring non-empty words (those not on the stop word list). These will be extracted using the TF.IDF approach [Salton and McGill, 1983]. The sentences which include the most of these thematic words will be assigned high values and selected as summary sentences. A serious improvement in the location heuristic was the Optimal Position Policy (OPP) method [Lin and Hovy, 1997], which can be utilised to learn the location of important sentences by identifying the genre-specific regularities of a class of text documents. The main advantage of the heuristic approach is that heuristics (which serve to evaluate the significance of extracted units by giving each unit a score) are exceptionally valuable in recognising essential sentences for extracts; they are also easy to implement and

Chapter 2. Related Work

47

ensure good overall coverage. Combinations of heuristics are regularly employed to score sentences [Edmundson, 1969; Lin, 1999; Uddin, 2007]. The main disadvantage is that heuristics do not consider semantics; for this reason, heuristic-generated summaries sometimes lack cohesion. Besides heuristics, semantics-based approaches like lexical chains [Barzilay et al., 1997; Chen et al., 2005] and rhetorical parsing [Marcu, 1998] can be utilised to create generic summaries. However, lexical chains are the most fascinating method in that they involve two primary stages to create a summary: constructing a source representation for the original text and forming a summary representation from it. Another approach is query-oriented summarisation [Carbonell and Goldstein, 1998; ElHaj and Hammo, 2008], which is used in order to generate user-preference summaries for applications such as ‘question-answering systems’ and ‘browsing on search engines’. Query-oriented methods enable different summaries to be generated from the same input text. By using query-oriented summarisation, the search engines’ document rankings can be improved/customised as per user requirements [Agichtein et al., 2006; Carbonell and Goldstein, 1998]. Cluster-based evaluation and summarisation methodologies are relatively new [Pei-ying and Cun-he, 2009; Wang et al., 2009]. In cluster-based summarisation, clusters of sentences are structured in view of different sentence similarity measures [Pei-ying and Cun-he, 2009]. The quantity of clusters is generally equivalent to the quantity of topics covered in the text. In the cluster-based technique, to guarantee good topic coverage, representative sentences are extracted from every cluster. However, successful implementation of cluster-based summarisation relies on the clustering algorithm. The knowledge-based approach is considered an alternative to the statistical approach. This methodology is regularly adopted to summarise a text in a particular area. Rich knowledge about the domain is required to understand the text and choose what must be incorporated in the summaries. This approach acquires domain knowledge using two

Chapter 2. Related Work

48

routes: manual encoding and automatic training. For instance, Paice and Jones [1993] used stylistic clues and constructs to recognise important concepts in very structured technical papers. On the other hand, McKeown and Radev [1995]; Radev and McKeown [1998] used the results of information extraction systems as inputs to create fluent summaries for a cluster of documents utilising natural language generation methodologies. Domain knowledge has been effectively exploited in the knowledge-based approach, which has been adapted to the special requirements of the application. The knowledgebased approach has numerous weaknesses: it is unscalable and knowledge-concentrated, it is expensive to port to another domain, it may require huge training corpora, and it only summarises predefined interests. Our approach can be seen as a knowledge-based approach which collects users’ interests in a specific domain and then extracts summaries which reflect those interests to serve a defined audience (in our work, university students). Other methods apply more complicated techniques to decide which sentences to extract. These techniques include machine learning [Leite and Rino, 2008, for example] to specify important features and various NLP techniques to determine the most important passages and relationships between words. In addition, Bayesian classifiers have been utilised [Kupiec et al., 1995]. The work of Fung and Ngai [2006] used Hidden Markov Models (HMMs) to show that the probability of a sentence’s consideration relies on whether the previous sentence has also been considered. Other methods of programmatically generating summaries of documents are discussed by Hassel [2004] and Jeˇzek and Steinberger [2008].

2.6.4

Generic Summaries versus Personalised Summaries

The technique used in the summarisation process indicates the kind of information that a summary could contain. A summary may highlight the basic idea, in which case it is called generic or traditional summarisation, or it may highlight the specific user’s individual area of interest, in which case it is known as personalised summarisation.

Chapter 2. Related Work

49

Summaries have also been distinguished by their content [Nenkova and McKeown, 2011], such as an indicative summary or an informative summary. The former enables the reader to identify the document’s topics while the latter elaborates on some of these topics based on the reader’s interest [Saggion and Lapalme, 2002]. A great part of the work to date has been in the context of generic summarisation [Nenkova and McKeown, 2011]. This type of summarisation makes few assumptions about the audience or the summary’s aim. The audience is typically considered to be a general one (i.e., anyone may read the summary). Moreover, no considerations are made regarding the genre or domain of the summarised materials. In this setting, the significance of information is specified only with respect to the input content. The summary will help the reader rapidly determine what the document is about, possibly allowing him or her to avoid reading the document itself. However, in generic summaries, the impact of the reader’s interest has seldom been considered. Traditional summarisation generates the same summary for different users by utilising the same methodology regardless of who is reading. As most of the current summarisation systems treat their outputs as static and plain texts, these traditional summarisation methods fail to capture user interests during summarisation. Users need personalisation because they have individual preferences with regard to a particular source document collection; in other words, each user has different perspectives on the same text. Thus, traditional summarisation methods are to some extent insufficient because, obviously, a universal summary for all users will not always be satisfactory [Yan et al., 2011]. As such, personalised text summarisation could help to present different summaries that better correspond to reader interests and preferences [Zhang et al., 2003]. The potential of personalised summarisation over generic summaries has already been demonstrated [D´ıaz and Gerv´as, 2007, for example]. According to D´ıaz and Gerv´ as [2007], experiments have shown that personalised summarisation is important, as summary sentences can match the user’s interests, and generic summarisation may be useless in deciding if a document is relevant to the user’s needs. We are

Chapter 2. Related Work

50

interested in personalised summarisation but at the cohort level. Examples of personalised summaries vary; they include the aspect-based method [Berkovsky et al., 2008], the non-negative matrix factorisation method [Park, 2008], and the annotationbased method [Zhang et al., 2003]. In addition, summaries that are tailored to the user’s query (query-biased summaries) can prove more effective than other representations of a document [Tombros and Sanderson, 1998]. In query-focussed summarisation (which is also called focussed summarisation, topic-based summarisation, and user-focussed summarisation), the goal is to summarise only the information relevant to a specific user query [Nenkova and McKeown, 2011]. Such summaries rely on the specification of a user’s information need [Mani and Maybury, 1999], such as an area of interest, topic, or query. Producing snippets for search engines is a particularly useful queryfocussed application [Turpin et al., 2007; Varadarajan and Hristidis, 2006]. However, Wang et al. [2007] focus on query-biased summarisation for Web pages based on the extraction and ranking. Stokes et al. [2007], on the other hand, propose an extractive summarisation approach that uses query expansion to choose sentences to generate a 250-word summary. White et al. [2001] present a query-biased summarisation interface named WebDocSum for Web searching. The summarisation system has been specifically developed to act as a component in existing Web search interfaces. An attempt is also made to incorporate Web page media, such as tables and images, into the summary if a document contains insufficient text. The summaries allow the user to more effectively assess the content of Web pages. Their experimental results show that the system appears to be more useful and effective in helping users gauge document relevance than the traditional ranked titles/snippets approach. However, such scenarios may not be sufficient in that they depend on only the currently submitted query, which might not contain information that accurately describes the user’s interests in the generated summaries. The aim of our approach to implicitly learn the profile which contains users’ interests, keep it up-to-date, and generate cohort-based summaries that reflect those interests under a wider, more practical scenario.

Chapter 2. Related Work

2.6.5

51

Single-Document Summarisation versus Multi-Document Summarisation

A summary can be generated for multiple related source documents – for example, a cluster of news stories on the same topic – known as Multi-Document Summarisation (MDS) [Mani and Maybury, 1999; Rosner and Camilleri, 2008]; on the other hand, in Single-Document Summarisation (SDS), summaries may be generated for individual documents. We are interested in exploring both types. Researchers have attempted both SDS and MDS [Chali et al., 2009; Liu and Zhao, 2009; McKeown and Radev, 1995; McKeown et al., 1999] in the past. Efforts have also been made to incorporate some sort of NLP to generate abstracts [Hovy and Lin, 1998; McKeown et al., 1999]. Sophisticated commercial abstract-generating tools are available e.g., employed by the Joint Research Center (JRC) [Kabadjov et al., 2011]. Radev et al. [2002] uses abstraction and extraction methods to apply the text summarisation process to a single document or to multiple documents. As mentioned earlier, most of the work on SDS and MDS is done using extractive summarisation. Early work in summarisation dealt with SDS [Nenkova and McKeown, 2011] in which systems produced a summary of one documentwhether a news story, scientific article, broadcast show, or lecture. Single-document summarisation is thus used in situations for which the final goal is to characterise the content of a single document, such as producing a headline or an outline [Jurafsky and Martin, 2009]. SDS, simply orders the extracted sentences as they appeared in the original document [Jurafsky and Martin, 2009]. There is no standard length for the generated summary, as it varies based on the implementation guidelines. MDS has drawn attention in recent years, as it can be used to summarise an entire corpus enabling the user to obtain an overview [Yeloglu et al., 2011]. MDS is particularly appropriate for Web-based applications [Jurafsky and Martin, 2009]. Wang and Zhou [2012] address the two currently existing major issues with respect to MDS: which

Chapter 2. Related Work

52

information should be included in the summary and how to eliminate redundancy in the final summary. However, starting with a cluster of documents to summarise, Jurafsky and Martin [2009] describe three stages to generate an extractive multi-document summary for this cluster of documents. Redundancy elimination is one of the main distinguishing factors between SDS and MDS (apart from the input sources). Selecting sentences from a set of related articles could result in overlapping information, which must be eliminated. There are various approaches to extractive MDS. For example, common topics can be identified through clustering. This can then be followed by selecting one sentence to represent each (sub-) cluster [McKeown et al., 1999]. Alternatively, a composite sentence using information from each cluster can make up the summary [Barzilay et al., 1999]. Another approach is first to create single-document summaries. These summaries can then be grouped in clusters, and representative passages from the clusters can be extracted [Stein et al., 2000]. Other approaches are described in Armano et al. [2012]; Lin and Hovy [2002b]; Radev et al. [2004]; Stokes et al. [2007]. Fukumoto et al. [2010] apply a clustering technique in which they classify the extracted sentences into groups of semantically related sentences with the aim of eliminating redundancy. Their summarisation approach focuses on detecting key sentences that contain crucial information from related documents. Newman et al. [2004] describe an experiment to determine the quality of different similarity metrics with regard to redundancy elimination. The three metrics are WordNet distance (using a semantic lexicon for the English language), Cosine Similarity (which measures the cosine of the angle between two vectors in order to measure the similarity between them), and latent semantic indexing/analysis [Deerwester et al., 1990]. Another method to avoid redundancy is Maximal Marginal Relevance (MMR) [Carbonell and Goldstein, 1998; Goldstein et al., 2000], in which a clustering algorithm is applied to all the sentences in the documents to be summarised, generating a set of clusters of related sentences from which a single (centroid) sentence is chosen for each cluster to enter into the summary.

Chapter 2. Related Work

53

This method selects only one (optimally long) version of each original sentence. In recent years, there has been interest in topic-focussed MDS, as in Wan [2009]. There are many approaches to topic-focussed MDS [Daum´e III and Marcu, 2006; Wenjie et al., 2008, for example]. This approach adapts traditional summarisation methods so that the information conveyed in the summary is biased on the given topic. Graph-based methods have recently been exploited for topic-focussed MDS and have drawn more and more attention in recent years [Wan et al., 2007; Wei et al., 2008]. Generally speaking, the extracted sentences in the summary should be representative or salient, capturing the important content related to the queries with minimal redundancy [Shen and Li, 2011]. The two fundamental questions than need to be addressed in MDS are how to solve the redundancy problem without eliminating crucial sentences and how to determine the order that the extracted sentences should follow. Different methods to order the extracted sentences include chronological order of the events in extracted sentences, sentence-position in the documents, and according to a similarity measure (with the most similar sentences appearing on top) [Barzilay et al., 2001]. Summarisation techniques have also been used in IR to summarise search results. Querybiased summarisation, when applied to Web search, was found to be useful and effective, and users preferred it over a standard search engine presentation format [White et al., 2002b, 2003]. Summaries in a search context tend to be extractive, but this is not always the case [Berger and Mittal, 2000b, for example]. Our approach can be seen as topic-focussed summarisation, in which the topic is a representation of a profile. Summarisation of Web documents is typically based on the query and an individual profile (if any) and rather than a full profile [Park, 2008; Wang et al., 2007, for example]. In our case, we are interested in using the profile of a cohort of users, rather than of individual users.

Chapter 2. Related Work

2.7

54

Evaluation

IRSs use distinctive measures to assess the quality of the returned results, showing the degree of the relationship between the user demand and the retrieved results [Jurafsky and Martin, 2009]. Two of the most popular measures, which are utilised by a huge number of retrieval systems, are precision and recall. The latter is the estimation of the system’s ability to return the results reacting to the user’s query. However, it is insufficient to specify the quality of the retrieval system; the system may gain 100% even if it retrieves all the documents in the database. To do as such, it is obliged to utilise a more precise measure that can evaluate the number of results most relevant to the user’s need. Generally, in CIR we cannot define a single measure of success; we might measure success by, for example, convenience, accuracy of retrieval results, end-user satisfaction, expanded reliability, or trust that the information is helpful and suitable [Ruthven, 2011]. The most widely recognised IR assessment tool is the test collection, described in detail in Sanderson [2010]. Based on what we are testing, test collections will likely be used in CIR evaluation; however, test collections frequently support only a restricted scope of contextual factors. Evaluation of personalisation systems is still considered a challenging process due to the difficulty of determining the main factors that affect user satisfaction with a personalisation system in which the system should accurately predict and fulfil the user’s needs. Thus, personalisation systems are usually evaluated based on the accuracy of the algorithms they use. A number of key dimensions have been used to evaluate the quality of personalisation systems in Anand and Mobasher [2005], including accuracy, user’s satisfaction, utility, coverage, performance, explainability, scalability, and robustness.

Chapter 2. Related Work

2.7.1

55

Evaluating Automatic Summarisation

It has been shown that evaluating the consistency and quality of a generated summary is difficult [Fiszman et al., 2009, for example]. The primary issue is that there is no obvious notion of what makes a good summary. Two classes of metrics have been employed: form and content. Form metrics are typically measured on a point scale [Brandow et al., 1995], and concentrate in particular on grammaticality, general text coherence, and organisation. Content metrics, on the other hand, are harder to measure. Usually, the output of the system is compared to one or more human-generated ideal summaries, either sentence-by-sentence or unit-by-unit. Similarly, as with the assessment of IR [Croft et al., 2009], the ‘precision’ – the percentage of information presented in the system’s summary – and the ‘recall’ – the percentage of important information removed from the summary – can both be evaluated. Evaluation approaches utilised in automatic summarisation can be classified as (Goldstein et al. [1999]; Jones and Galliers [1995]; Mani et al. [1999]) (1) indirect or extrinsic evaluation (also called the task-based approach), which measures a system’s performance in a given task or (2) direct (or intrinsic) evaluation, which measures a system’s quality. Many evaluations of summarisation systems are intrinsic: the quality of summaries is judged by direct human judgement of, for example, informativeness, fluency, and coverage, or by a comparison with an ‘ideal’ summary [Jing et al., 1998]. Methodologies for building ideal summaries vary. They could be generated by professional abstractors, or multiple human subjects could provide a number of summaries and then merge these summaries using methods such as majority opinion, union, or intersection. Automatic summaries could then be compared with the ideal summary [Jing et al., 1998]. Comparison against an ideal summary could be done manually or automatically using intrinsic approaches, but doing it automatically is desirable, as it further reduces the need for human involvement [Nenkova and McKeown, 2011]. One approach to intrinsic

Chapter 2. Related Work

56

evaluation could be applied to assess the relationship between the ideal summary and the automatic summaries: precision and recall metrics (Salton and McGill [1983]) by dividing the number of sentences found in the ideal and automatic summaries by the number of sentences in the ideal summary (precision) or by the number of sentences in the automatic summary (recall). The main problems with these measures are that they are not capable of distinguishing between many possible equally good summaries and that the summaries that differ quite a lot content-wise may get very similar scores [Hassel, 2004]. In addition, Hassel [2004]; Nenkova and McKeown [2011] present other main intrinsic approaches which are also used in summarisation evaluation; these are relative utility, DUC manual evaluation, automatic evaluation and ROUGE, and the pyramid method (which is also a manual approach). Manual evaluation requires much time and effort, and it is also expensive. Therefore, competitions such as DUC and, more recently, TAC, have made broad use of automatic evaluation metrics. These metrics show up during the assessment process for a huge number of summaries, including in DUC [Over and Yen, 2004]; such metrics often take over 3,000 hours of human effort, which is not a normal procedure [Lin, 2004]. Moreover, the evaluation of the generated summary is very difficult even if it is compared with the human-made ideal summary, and it returns to the presence of human judges. In addition, during the evaluation process judges must be aware of aspects that go beyond the readability of the summary, including to what degree the summary satisfies the user’s needs and how relevant the obtained information is. Automatic evaluation metrics, such as those of the ROUGE family [Lin, 2004], have been demonstrated to not always correlate well with human evaluations of content matching in text summarisation. Pyramid evaluation has risen as a standard approach for manual evaluation of summaries [Nenkova and Passonneau, 2004]. Jing et al. [1998] study two methods for evaluation of summarisation systems: an evaluation of how well summaries assist a user in accomplishing a task such as IR, and an evaluation of created summaries compared to an ‘ideal’ summary. The Mechanical Turk (MTurk) system has been utilised for a diversity of labelling and annotation tasks [Snow et al., 2008]; however, such crowd-sourcing

Chapter 2. Related Work

57

has not been tried for summarisation. However, recently, a few extrinsic methods (e.g., task-based evaluation) have been presented. Extrinsic evaluations include question-answering and comprehension tasks and tasks that measure the impact of summarisation on determinations of a document’s relevance to a topic [Wolf et al., 2004]. While evaluation forums such as DUC and TAC enable experimental setups through comparison to an ideal summary, a definitive objective in the development of a summarisation system is to assist the end user in performing a task better [Nenkova and McKeown, 2011]. A task-based evaluation scheme has been adopted as a new approach of evaluating summaries by measuring the systems’ qualitative features [Jing et al., 1998; Mani et al., 1999; Mochizuki and Okumura, 2000; Tombros and Sanderson, 1998]. Task-based summarisation evaluation methods [Wolf et al., 2004, for example] do not analyse sentences in the summary; they try to measure to what degree the producing summaries are beneficial for the completion of a particular task, such as IR or text categorisation. A task represents the goal or purpose of the search [Kelly, 2009]. Kellar et al. [2006] demonstrate task categorisation in order to investigate information-seeking behaviour on the Web. Tasks were categorised as follows: Fact-Finding – searching for specific facts, files, or pieces of information; Information-Gathering – collecting information, typically from different sources, to make a decision, write a report, complete a project, etc.; Just Browsing – viewing Web pages with no specific goal in mind, usually for entertainment; and Conducting Transactions engaging in online actions such as email or banking. FactFinding includes the type of tasks we are interested in for use in task-based evaluation. Bystr¨om [2002], Bystr¨ om and Hansen [2005], and Bystr¨om and J¨arvelin [1995] examine task complexity and how tasks can be defined, measured, and studied. In particular, Bystr¨om and Hansen [2005] identify and distinguish between three different types of tasks: work tasks, information-seeking tasks, and information-retrieval tasks. Orasan [2006] provides two evaluation approaches for considering the participation of humans in the evaluation process: off-line (or automatic) and on-line evaluation methods.

Chapter 2. Related Work

58

The former does not need any human involvement. Automatic evaluation approaches are favoured because they are not directly affected by human subjectivity, which means that they can be repeated by different researchers. An alternate point of interest related to automatic evaluations is their speed. In on-line evaluation, however, human judges read each answer immediately from the system and determine if it is correct in light of particular criteria. As opposed to off-line evaluations, this sort of evaluation is time-consuming and can be much more easily affected by human subjectivity. Tucker [2000] proposed a more detailed classification of evaluation methodologies. In particular, he identifies four main approaches for automatic summarisation evaluation (see Figure 2.2) as follows: • Direct evaluation utilises guidelines in order to help humans directly evaluate summaries. • Target-based evaluation compares an ideal summary to the system’s output. • Manual task-based evaluation requires the people’s participation to perform tasks according to information from the automatic summaries. • Automatic task-based evaluation automatically assesses the efficiency of the systems using summaries created by a computer rather full text or summaries produced by humans.

summary evaluation methods

intrinsic

direct evaluation (on-line)

target-based evaluation (off-line)

extrinsic manual taskbased evaluation (on-line)

automatic task-based evaluation (off-line)

Figure 2.2: Classification of Different Evaluation Methods Used in Automatic Summarisation.

Chapter 2. Related Work

59

The advantages of this classification are that it is more refined than the ones typically utilised in automatic summarisation and that it covers many different aspects which have to be considered when a method is evaluated. Many evaluation techniques have been proposed, yet there is no accord as to which is best. The reality of the situation is that there is no single best strategy for summarisation evaluation. Every strategy has its advantages and limitations, and we need to make sensible decisions as to which strategy to utilise, taking into account the characteristics of each summarisation system. In our work, we evaluate the quality of summaries using human ratings and (implicitly) as part of task-based evaluations.

2.7.2

Evaluating Interactive Information Retrieval

IR evaluation studies abstract users out of the evaluation model, but in Interactive Information Retrieval (IIR), users’ behaviours and experiences are studied in addition to their interactions with systems and information [Kelly, 2009]. Much work has contributed to the development of IIR evaluation and experimentation [Kelly, 2009]. Borlund [2000, 2003] study simulated work tasks which go beyond simple topic-based descriptions of needs by giving more contextual information, which is customised to target users in the form of a short story that describes the problem to be solved and the information needed. These tasks are assigned to subjects with the aims of providing better control over the search situation and generating conditions that allow for comparison. We are interested in simulated work tasks. These are not simulations but tasks which are as close as possible to a realistic IR process. This is different from simulated search evaluations (or Wizard of Oz studies), which do not recruit and study real users, as in Lin and Smucker [2008]; White et al. [2005]. In Wizard of Oz studies, one or more researchers are ‘behind the curtain’ making the system work while users think they are interacting with a real system. One of the advantages of simulated users is that IIR evaluations can be conducted more rapidly, with less cost, and with more users. Another type of information need that is utilised in IIR evaluations is natural information needs. These

Chapter 2. Related Work

60

tasks are typically conducted by users daily, but in this type of task, it can be difficult to generalise and compare findings across users. Multi-tasking and multiple search episodes are another type of task; this type is concerned with creating sets of tasks that can be done simultaneously [Lin and Belkin, 2000; Spink et al., 2002]. The TREC Interactive Track has made significant contributions to the development of an IIR evaluation framework and has experimented with different types of tasks to evaluate IIRSs [Dumais and Belkin, 2005; Lagergren and Over, 1998], such as an ad hoc search task and a routing task. In the former, subjects were asked to find and save many relevant documents for a number of topics and also to create a best query for each topic; in the latter, participants were required to recruit subjects who were tasked with creating an optimal routing query.

2.8

Key Differentiators

We have found that the greater part of the work conducted in IR focuses on the Web, with little work done on specific Web sites or intranets, which is critical because these systems do not behave quite the same as the Web. Furthermore, the majority of these studies are static and do not incorporate changes that may happen. We have also found that there has been much prior work on assisting users in finding information in a hyperlinked environment. Our work is novel in targeting a cohort of users and in combining summarisation techniques and profiling (from query logs) to assist a user from the cohort in finding information when navigating a Web site. While summarisation has been used for searching, there is very little work on assisting a user in navigating and browsing without altering the actual content of the Web site. We identify a single Web site on which to conduct our studies. We argue that the site chosen is a good example of the type of site for which these methods can be applied. We hope that our work will serve as a benchmark for future studies.

3 A Framework for Profile-Based Summarisation

This chapter talks about the general abstract architecture in Section 3.1 and outlines how we apply the acquired model to construct single-document and multi-document profile-based summaries in Section 3.2. We describe in more detail how we integrate the profile in the summarisation process to extract summary sentences which reflect the users’ interests.

61

Chapter 3. A Framework for Profile-Based Summarisation

3.1

62

General Architecture

Figure 3.1 presents a framework for profile-based summarisation in Web site navigation. It is suitable for both single-document and multi-document summarisation approaches. The architecture displayed here is an abstract illustration of many summarisation frameworks; it ignores the language, approaches, tools, and techniques utilised. In this research, we are most interested in the summarisation process; however, the structure differs according to the aim and the techniques applied.

Phase 2

Search Server

Web Site User

Search Results

Phase 1

Selected Document

Browsing Activities

Browsing Activities

CohortBased Profile

Matching

Browsing Activities

Figure 3.1: Profile-Based Summarisation General Architecture. To provide an abstract explanation, suppose if we have a user looking in a University of Essex Web site to find information about a timetable (see Figure 1.1). The user starts to interact with the Web site’s interface and then navigates the site through the links, moving from one page to another in an attempt to find his/her desired information.

Chapter 3. A Framework for Profile-Based Summarisation

63

When a user hovers the mouse over any link, the selected document undergoes a process in which it is split into sentences. After that, the sentences are entered into a sentencematching process, where they are matched against other information – in our case, the user profile. The sentence selection is dependent on ranking the sentences after the matching process. Later, the top sentences are selected, ordered, and combined to create a single summary which represents a shorter version of the selected document. Finally, the generated summary is displayed in a tool tip window which appears as soon as the user hovers his or her mouse over a link on the site, as illustrated in Figure 1.2.

3.2

Profile-Based Summarisation

To generate a profile-based summary of a document (or a set of documents) we can, in the simplest case: (1) locate a suitable term as a node in the domain model (e.g. “timetable”, see Figure 1.1); (2) extract all directly connected nodes, to generate a BOW that is related in the model;1 (3) a summary can then be generated by extracting sentences that are related to these terms. The hope is that a summary generated in this way will be helpful in a search-and-navigation task. We propose using a form of query-based summarisation in our experiments. In a search context, this is quite straight-forward: we can employ the user’s query as our starting point in the model. In a navigation context, however, there is no query.2 In this case, we assume instead that the title of a hyperlinked document can play the role of a “query” and provide a suitable starting point in the domain model. This seems justified in a Web site context that is virtually spam-free, and where content policies are in place. We are aware that such an assumptions do not always hold [Hawking and Zobel, 2007, for example]. We then use those terms in the domain model that are related to the title in the “query-based” summarisation process. This approach is adopted for both SDS 1

Intuitively, at least, these terms are then related in a way that appears appropriate for a search task. 2 Although there might be a relevant query if the user is interacting with a search engine while navigating.

Chapter 3. A Framework for Profile-Based Summarisation

64

and MDS (as described in the following subsections). If the resulting summary is empty, no summary will be displayed. An empty summary may be due to the nature of the document: for example, there may be no text in the document, or the type of the document might not be supported. In the case of profilebased summarisation, there might also be no matching concept in the domain model. Not displaying a summary appears to be a sensible option if no matching concepts can be extracted from the domain model.

3.2.1

Profile-Based Single-Document Summarisation

Algorithm 3.1 illustrates generic pseudo-code of the profile-based SDS process.3 The algorithm assumes a navigation context. If we are in a search context, we do not need a document title but just the current query (as it is illustrated in Figure 3.2, query-based SDS process using a profile). More specifically, in the experiments described here, we perform the following steps: • First, the title is extracted, normalised and parsed identifying patterns used in terminological feedback extraction, namely, nouns and noun phrases. This results in a (possibly empty) set of terms. • For each term, we check whether it is represented as a node in the profile (see Chapter 1; and more detail in Chapter 4). If it is, we extract all directly connected nodes (i.e., related “queries”) and construct the union of all these terms. For example, in Figure 4.1, if the document title was The Library, then the only term considered is “library”. This then results in the collection of terms {“library opening times”, “catalogue”, “moodle”, “cmr”, “albert sloman”} being generated (Line 4 of Algorithm 3.1).4 3 The method ExtractRelatedTerms simply extracts all terms from the profile that have a direct link to any of the terms extracted from the document title. More sophisticated methods such as random walk-based approaches are possible alternatives. 4 By ignoring the model weights and treating the terms as a bag of terms, there is potentially a risk

Chapter 3. A Framework for Profile-Based Summarisation

65

• A document is preprocessed and segmented into sentences (see Section 5.1). • All sentences are rank-ordered according to some similarity metric when compared to the terms extracted from the domain model (see Section 5.5 for more detail). • The summary sentences are generated by trimming the resulting ranked sentences to a specified length. To specify the summary length we counted the number of words in each sentence until we reach to the required length (for more detail see Section 5.5). • Finally, the candidate sentences (summary sentences) are sorted according to the order in the original document.

Query Logs

Domain Model (Profile)

Extracted Title

Query Refinements Selected Document

Sentence Segmentation Matching

Extracted sentences All sentences from a document

Summary

Figure 3.2: Architecture of a Profile-Based Single-Document Summariser. of query drift when expanding by many weakly connected nodes. We deliberately went for a simple summarisation step and will explore more sophisticated approaches in future work.

Chapter 3. A Framework for Profile-Based Summarisation

66

Algorithm 3.1 Profile-Based Single-Document Summarisation.

1 2 3 4 5 6 7 8 9 10 11 12 13

Input: document D, profile P Output: a summary for a document Begin T ← GetDocumentTitle(D) /* Extract related terms from the profile / domain model, see Algorithm 4.2*/ RelatedTerms ← ExtractRelatedTerms(T,P) if RelatedTerms 6= NULL then Split D into sentences, D = {S1 , S2 , ..., Sn } /* Find similarity values between RelatedTerms and each sentence, and store them in array */ foreach Si ∈ D do SM[i] = Similarity(Si , RelatedsTerm) RankOrder SM[] descending SummarySet ← ExtractTopSentencesAccordingToSpecifiedSummaryLength(SM[]) /* see Algoritm 5.1 to specify summary length */ Summary ← SortSummaryAccordingToOriginalDoc(SummarySet) return Summary End

3.2.2

Profile-Based Multi-Document Summarisation

For profile-based MDS, we apply a similar sequence of steps as in Algorithm 3.1, but in this case the “document” that is summarised is generated from a collection of related documents. Normally, MDS is applied to a collection of documents that are related to each other. In our application, the collection of related documents is generated from a root hyperlinked document. This is done by extracting all outgoing links from this document, and retrieving the corresponding documents. These documents are then concatenated to create the “meta-document” that will be summarised. Following the extraction of candidate sentences for the summary, the sentences are ordered according to their similarity to relevant terms in the domain model. To avoid duplicate sentences appearing in a summary, we also apply a simple redundancy elimination step. This process is illustrated in Figure 3.3 and also in Algorithm 3.2 which illustrates generic pseudo-code of the profile-based MDS process in navigation.

Chapter 3. A Framework for Profile-Based Summarisation

67

Algorithm 3.2 Profile-Based Multi-Document Summarisation.

2 3 4 5 6 7 8 9 10 11 12 13 14 15

Split C into sentences, C = {S1 , S2 , ..., Sn } /* Find similarity values between RelatedTerms and each sentence, and store them in array */ foreach Si ∈ C do SM[i] = Similarity(S, RelatedTerms) RankOrder SM[] descending ExtractTopSentencesAccordingToSpecifiedSummaryLength(SM[]) /* see Algoritm 5.1 to specify summary length */ InformationOrderingForCoherentReformulation return Summary End

Query Logs

Domain Model (Profile)

Extracted Title

Selected Document

Query Refinements Outgoing Links (Documents)

Sentence Segmentation Sentence Matcher All sentences from all documents Extracted sentences

Summary

Information Ordering

Content Selection

1

Input: a set of related documents RD, profile P Output: a summary for the set of related documents Begin T ← GetDocumentTitle(D1 ) /* Find related terms from the profile / domain model, see Algorithm 4.2*/ RelatedTerms ← FindRelatedTerms(T) if RelatedTerms 6= NULL then foreach D ∈ RD do Append all sentences of D to form meta document C

Redundancy Elimination

Figure 3.3: Architecture of a Profile-Based, Multi-Document Summariser.

Chapter 3. A Framework for Profile-Based Summarisation

3.3

68

Concluding Remarks

This chapter provides a general framework for profile-based summarisation in the context of navigating a Web site. It assumes a profile structured as a term association graph but makes very few assumptions apart from that. In the next chapter, we will discuss the method we applied to build the profile, in addition to presenting other approaches.

4 Building a Log-Based Profile

As we are proposing to build adaptive domain knowledge for navigation, this chapter presents the biologically inspired model we used for building adaptive models from query logs. Note that this is not the actual focus of our work. A number of other different log-based methods are presented in Section 4.1. This chapter also includes an example of logs we have used and lists all the pre-processes steps that have been applied on those logs, resulting in refined logs which can then be transformed into structured knowledge using the candidate approach described in Section 4.2. We then describe the methodology of constructing the log-based profile, including the ACO model and the ACO-trimmed model in Section 4.3 and Section 4.5, respectively. Furthermore, the method used to derive related terms from the model is discussed in Section 4.4.

69

Chapter 4. Building a Log-Based Profile

4.1

70

Log-Based Approaches

Log-based approaches for search and navigation exploit past user interactions with the search engine as recorded in the logs, including the queries submitted and documents displayed to the user as results and those inspected by the user (clicked). They are usually maintained by search engine providers for a wide range of applications. Query log analysis has become one of the most promising research areas for the automatic derivation of knowledge structures for search assistance [Bernard et al., 2009; Silvestri, 2010]. This motivates our use of log data in generating cohort-personalised summaries to assist in navigation-based search. Query log analysis aims to reasonably and non-intrusively capture the user’s behaviour in interacting with a search system. It provides insight into how users formulate their queries (e.g., the average number of terms in a query), the characteristics of their search sessions (e.g., the average length of a search session or the average number of queries in a session), the topics searched, and the trend of queries or topics over time [Jansen and Spink, 2004]. There are many different ways of extracting information from the logs, and here, we are looking at the sort of model that was introduced in Chapter 1. Different methods for exploiting query logs to derive new query modification suggestions for Web site searches have been explored by Kruschwitz et al. [2013]. The methods can be classified as either adaptive, such as ACO, or non-adaptive, such as Query Flow Graphs (QFG), Maximum Likelihood Estimation (MLE), and Association Rules (AR). Adaptive methods are able to learn over time in a continuous learning cycle to build and adapt a domain model which represents queries and query suggestions. Changing the update frequency is then likely to result in a different model. By contrast, non-adaptive methods take the entire log and derive suggestions from the aggregated data; building such models incrementally would not result in a different model. Among the nonadaptive methods, Kruschwitz et al. [2013] find (1) that QFGs are the best-performing method and can lead to suggestions when MLE fails to do so, and (2) that some of these

Chapter 4. Building a Log-Based Profile

71

suggestions may be unrelated or not very useful. They also find that the AR approach is less effective than the other log-based methods (and significantly worse than QFG, partly because it ignores the order in which two queries were submitted). The non-log-based methods extract query suggestions using snippet text (SNIPPET) or word n-grams (NGRAMS). In other words, the baseline approaches exploit a document collection rather than log data in which the SNIPPET approach selects suggestions from best-matching documents whereas NGRAMS utilise n-grams derived from an entire document collection. Kruschwitz et al. [2013] demonstrate that log-based methods (even without incorporating clickthrough information) outperform non-log-based approaches for query suggestion quality. Of the methods considered, ACO performs well overall and is simple to implement. It is particularly appealing due to its adaptive nature and capability to capture some temporal context.1 . For these reasons, we adopted ACO to construct a domain model for our experiments, although the details of the specific model are not under investigation in the work described here. Obviously, there are many other approaches such as random walk-based models that could have been adopted, but we leave that as a possible future research direction.

Definition 1. (Domain Model): “A domain model M is an abstract understanding of the document collection in a domain.” [Albakour, 2012] Definition 2. (A Log-based Domain Model): “A log-based domain model ML is a model which has been constructed entirely from a query log L.” [Albakour, 2012] Remember, that we treat the terms ‘profile’ and ‘domain model’ as synonyms in this thesis. A major bottleneck in conducting research into query logs is the difficulty in obtaining realistic, large-scale log data. In our experiments, we use the logs of a local Web site search engine (the University of Essex Web site). These have been collected over a 1

Regarding the adaptive nature of ACO, previous studies applying ACO to query logs have demonstrated that the ACO paradigm can learn useful query suggestions and improve over time [Albakour et al., 2011; Kruschwitz et al., 2011, for example]

Chapter 4. Building a Log-Based Profile

72

3-year period. An inherent feature of the ACO paradigm is its ability to adapt to changes and to capture seasonality to a certain extend. In fact, it does not really represent seasonality but includes reoccurring search patterns just like any other search request submitted. In other words, it “re-learns” these search paterns. Dignum et al. [2010] studied the effect of this process (on the same Web site that we use as an example). This was done through a user study which looked into query suggestions derived from the model. The aim was to evaluate how quickly the model learns suggestions and how it performs over a number of different seasons, with comparison to a number of non-seasonal baselines (Google and the meta-search engine Clusty). In particular, investigate the effect of seasonality on ACO over three different academic seasons: Winter 2008, Summer 2009, and Winter 2009. Dignum et al. [2010] used the top 20 queries from the logs of the university search engine, and for each of these queries they selected the three best, i.e., highest weighted and refinements from each trimester’s domain model. Participants were asked to determine whether queries and their refinements were relevant for the associated trimester. The results indicate that the domain model is a quick learner, derives more relevant query suggestions than those presented by some standard search engines (baselines), is able to suggest seasonally relevant terms over three different academic trimesters, has the potential to improve performance over time, and maintains pace with seasonal changes in user interests, as shown in the gradual improvement of First Relevant term suggestions.

4.2

Query Logs

The query logs we are interested in come in a fairly standard format; nothing special is assumed about the structure. Here is an extract from the actual (pre-processed) log files (“xxx” is a field separator):

Chapter 4. Building a Log-Based Profile

73

... 1657406 xxx 02f4e527278fbe909aedb97cea27f8d9 xxx librarary

xxx

librarary

wed nov 17 11:41:43 gmt 2010 xxx 0 xxx 0 xxx 0 xxx\

xxx librarary

... 1657409 xxx 02f4e527278fbe909aedb97cea27f8d9 xxx albert salmon

xxx

albert salmon xxx

wed nov 17 11:41:58 gmt 2010 xxx 0 xxx 0 xxx 0 xxx\

albert salmon

... 1657412 xxx 02f4e527278fbe909aedb97cea27f8d9 xxx albert salmon library

xxx

albert salmon library

wed nov 17 11:42:23 gmt 2010 xxx 0 xxx 0 xxx 0 xxx\ xxx

albert salmon library

... 1657415 xxx 02f4e527278fbe909aedb97cea27f8d9 xxx albert sloman library

xxx

albert sloman library

wed nov 17 11:42:37 gmt 2010 xxx 0 xxx 0 xxx 0 xxx\ xxx

albert sloman library

...

The logs record (1) a query identifier, (2) a session identifier, (3) the submission time, (4) the submitted query, and (5) some other additional information (see also Kruschwitz et al. [2013]). This extract shows four interactions submitted within the same session. In a sequence of steps, the user replaces the original query “librarary” by a new query “albert salmon”, then by “albert salmon library” and the final query “albert sloman library”. As can be seen from the sample log entries, we do not identify individual users, nor do we associate IP addresses with sessions. This is to comply with data protection issues and to avoid potential privacy problems. But it is also consistent with the idea of treating all users as part of the same cohort. This fits with the aims of the current study, and is in line with alternative approaches to address privacy and security concerns when building personalised search systems, such as I-SPY, which are aimed at the needs of a community without the need to store individual search histories [Freyne et al., 2004; Smyth et al., 2003].

Chapter 4. Building a Log-Based Profile

4.3

74

Ant Colony Optimisation Model

Our profile is being constructed using an ACO algorithm but it could equally have been constructed using alternative approaches such as MLE and QFG. For the actual summarisation step, it is not important whether the model was built using ACO, MLE, QFG, etc. We just assume a model with the properties of concepts connected by edges with weights associated with them. However, based on Albakour et al. [2011]; Kruschwitz et al. [2011], we decided to use the ACO model based on the fact that it has been shown to work well and it is adaptive over time. We will now introduce the ACO approach in more detail. In the current work, we use ACO to build an adaptive domain model as we aim to construct a model that is dynamic and evolves over time to adapt to changes in the domain. The ACO analogy is used to first populate and then adapt a directional graph similar to QFG. In this analogy, the edges in the graph are weighted with the pheromone levels that the ants, in this case users, leave when they traverse the graph.2 This is a fully automated process entirely relying on implicit cues. This is different from the idea of AntWorld, which also aims at assisting users in navigating the Web but requires some explicit user judgements [Kantor et al., 2000]. In effect, the user traverses a portion of the graph by using query refinements (analogous to the ant’s journey). The weights of the edges on this route are reinforced (increasing the pheromone levels). Over time, all weights (pheromone levels) are reduced by introducing an evaporation co-efficient factor. This notion of evaporation captures the reduced popularity of an edge that has not been used recently by reducing the weight of untraversed edges over time. This then penalises inappropriate or less relevant query modifications. In this way, we expect outdated terms effectively to be removed from the model, i.e., the refinement weight will become so low that the phrase will never be recommended to the user. 2

The actual model acquisition process is not the focus of the thesis; the description given here is largely reproduced from Kruschwitz et al. [2013].

Chapter 4. Building a Log-Based Profile

75

While ACO is an adaptive technique that can be continuously updated, for our experiments we will take a snapshot of the model; we do not continuously update it during the experiments. This helps to simplify the experimental design, and focus on assessing the impact of using cohort-personalised summarisation to guide navigation. Let us assume that we update the pheromone levels on a daily basis. For the edge qi → qj the pheromone level wij is updated using Equation 4.1.

wij = N ∗ ((1 − ρ)wij + ∆wij )

(4.1)

where: 1. N is a normalisation factor, as all pheromone trail levels are normalised to sum to 1; 2. ρ is an evaporation co-efficient factor; 3. ∆wij the amount of pheromone deposited at the end of the day for the edge qi → qj . The amount of pheromone deposited should correspond to ant moves on the graph. In our case, this can be the frequency of query refinements corresponding to the edge. Also the cost of ant moves can be taken into account when calculating the amount of pheromone deposited. Generally, it can be calculated using Equation 4.2 [Dorigo et al., 2006].

∆wij = Σk Q/Ck ; For all ant moves on edge qi → qj

(4.2)

where: (a) Q is a constant, (b) Ck is the cost of ant k journey when using the edge qi → qj . Previous works [Albakour, 2012; Albakour et al., 2011] report on experiments with a number of evaporation co-efficient factors for Equation 4.1 and pheromone deposition

Chapter 4. Building a Log-Based Profile

76

calculation schemes in Equation 4.2 (see Section 5.3 for more detail). Let us describe the process of evolving the graph with ACO in more detail. We start with an empty graph. Then, for every batch of a continuously updated log, we extract all the individual user sessions in that batch. We build the edges on the graph by considering only subsequent queries in the same session (“immediate refinements”). In other words, we use session boundaries of individual users, and treat each session as a transaction. Then, for each session that contains at least two queries, we identify the sequence of queries and update the pheromone level on the edges of the graph, and add nodes or edges to the graph if necessary. In other words, the queries are time ordered and for each subsequent query phrases in the session an edge is created, or updated if it already exists, by the mean association weight of the previous day. A nominal update value of 1 is used for our first day, however, any positive real number could have been chosen without affecting the outcome of normalisation. Each consecutive query pair within a session is processed as outlined in Algorithm 4.1, which illustrates the ACO procedure. For example, if we have the following query chain in a session: query k , query l , query m , the edges considered are query k → query l , and query l → query m as ant movements (the use of “immediate refinements” has been shown to outperform transitive relationships [Albakour, 2012; Albakour et al., 2011]). At the end of each day, all edge weights are normalised to sum to 1, and the mean weight of all edges is then calculated. By normalising the weights, all weights (pheromone levels) of nontraversed edges are reduced over time, hence, over time, penalising seasonally incorrect or less relevant phrase refinements [Dignum et al., 2010]. In practical application, the weights can be offline on a periodic basis, e.g. hourly, daily, weekly, or even when a certain number of user sessions have completed. Furthermore, it is possible to run the algorithm from any point in the user log to any other; this allows us to compare how the model performs for particular time periods, as we did in Figure 4.1 and Figure 4.2. Note that this process of adaptive domain modelling can be seen as building a profile for a community of users, as opposed to building a profile for an individual users.

Chapter 4. Building a Log-Based Profile

77

Algorithm 4.1 The ACO-Based Algorithm to Build and Evolve the Domain Model [Albakour et al., 2011].

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Input: domain model as a graph G, daily association Ad , number of days DAY NUMS Output: G Begin for d ← 1 to DAY NUMS do Ad ← FindAllAssociations(d) /* update weights of traversed edges */ 0 foreach (q, q ) ∈ Ad do 0 /* Query q is associated to q in a session on day d */ n ← FindNode(G,q) if n = NULL then n ← AddNode(G,q) 0 0 n ← FindNode(G,q ) 0 if n = NULL then 0 0 n ← AddNode(G,q ) 0 e ← FindEdge(G,n,n ) 0 t ← CalculateDepositedPheromone(q,q ) if e = NULL then 0 e ← AddEdge(G,n,n ) SetWeight(G,e,t) else SetWeight(G,e,t+GetWeight(G,e)) NormaliseAllWeights(G) End

To pre-process the log files, the logs are segmented into sessions. For each session, the queries are time-ordered. We process the logs by keeping only those sessions that contain more than one search query. To reduce noise, only sessions with ten or fewer queries are considered.3 We perform case folding — that is, all capital letters are transformed into small letters — in order to normalise the query corpus. We also replace all punctuation marks such as colons, semicolons, and dashes by white space. We then use the processed log file to build the profile. Figure 4.1 illustrates a snapshot of node “library” in the profile acquired using the 3-year query logs (only the highest-weighted links are displayed). For comparison, Figure 4.2 shows what the corresponding part of the profile looks like after starting the construction at the same point of time, but running the model acquisition process on the first month of the query logs only. 3

Longer sessions are potentially better to model individual searcher profiles, e.g., Bennett et al. [2012], but for a group profile we assume shorter sessions to be the main building blocks of the model. In any case, only 0.31% of all our sessions are longer than ten queries.

Chapter 4. Building a Log-Based Profile

78

session 1, timestamp, query x Session 1, timestamp, query y session 1, timestamp, query z

session 1

log files

query logs

session 2, timestamp, query n session 2, timestamp, query x

session 2

0.0040 moodle timetable catalogue

0.0051 0.0046

0.0102 0.0041

0.0041

0.3

library

0.0045

cmr 0.0048

library opening times

0.0044 albert sloman

Figure 4.1: Acquiring a Profile from Query Logs.

campus log account 0.1015 catalogue

0.3

0.1035 0.1110 library

0.1035 library campus log

0.0990

dissertations

Figure 4.2: A Profile Acquired Using a Shorter Query Log (1 Month).

4.4

Deriving Related Terms

The ACO model can be used as a tool which can be employed to serve IR tasks such as recommendation, navigation or query expansion. Query expansion is a common technique in IR where terms related to a query are added to the original query to improve the retrieval performance [Hawking, 2011]. In this work, however, we could use it to recommend alternative queries for a given query from the graph for navigation.

Chapter 4. Building a Log-Based Profile

79

The process of recommending queries with the ACO model works as follows. We first identify the original query as a node in the graph, and then traversing the graph edges to identify and list all directly associated nodes ranked by edge weights. For the submitted query “library” in Figure 4.1, the list of possible query modification suggestions would then be “library opening times”, “catalogue”, “moodle”, “cmr” and “albert sloman”. The list of query suggestions are then used to select the summary sentences to aid a user in exploring the document collection as explained in more detail in Chapter 3. Algorithm 4.2 Related Terms Extraction Algorithm to Extract Related Terms for the Pre-Defined Term from the Domain Model

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

20 21 22 23

Input: domain model as a graph G, pre-defined term PT Output: list of related terms Begin /* Extract list of patterns Pi e.g., nouns and noun phrases, from PT */ P ← GetPatterns(PT) /* see Section 5.4 */ foreach Pi ∈ PT do /* Search for noun or noun phrase in the graph */ n ← FindNode(G,Pi ) if n 6= NULL then /* Extract related terms RelatedTerms = {T1 , T2 , ..., Tn } that directly connected to the original term in the graph */ RelatedTerms ← GetRelatedNodes(G,n) RankRelatedTerms(RelatedTerms) descending if RelatedTerms > MAX then RelatedTerms ← GetTopMAXRankedTerms(RelatedTerms) U ← ConstructUnionOfAllRelatedTerms(RelatedTerms) else /* In case n is a multi-word noun phrase */ RemoveStopWords(n) ST ← SplitIntoSubTerms (n) foreach STi ∈ n do DoTheStepsFrom6To13(STi ) RankTheUnionOfAllRelatedTerms(U) descending Refinements ← GetTopMAXOfTheUnionOfAllRankedTerms(U) return Refinements End

In a navigation context, we have no query and therefore need to identify suitable terms that represent the document at hand, and then perform the same corresponding steps. We will use the document title to do this i.e. the title of the document a user has opened is treated as a representation of the user’s information need (“query”). This will be the underlying principle of how the model can be used for profile-based summarisation. Alternatively, one might consider using the anchor text on which a user clicks or possibly the whole item containing the anchor text (this will be explored in

Chapter 4. Building a Log-Based Profile

80

future work). Algorithm 4.2, illustrates the procedures followed to extract the query recommendation (related terms) from the graph.

4.5

Ant Colony Optimisation Trimmed Model

As a result of applying ACO on the logs to build a term association network, over time we expect outdated terms to be effectively removed from the model, i.e., the refinement weight will become so slow that the term will never be recommended to the user. The trimming process will ensure that the graph contains only the most valuable edges that have the highest weights. This is to substantially reduce the number of edges to make it more efficient. Trimming functionality does not involve the Algorithm 4.1, which is proposed by [Albakour et al., 2011]. Algorithm 4.3 ACO Trimmed Algorithm to Effectively Remove the Outdated Terms from the Domain Model

1 2 3 4 5 6 7 8

9 10 11

Input: domain model as a graph G, nodes in the graph ni , edges in the graph ej Output: trimmed graph TtrimmedGraph Begin foreach n ∈ G do e[] ← GetAllDirectlyAssociatedEdges(G,n) avg ← GetAverageWeights(G,n,e[]) foreach e ∈ n do w ← GetWeight(G,e[i]) if w ≤ avg then DeleteEdge(G,e[i]) NormaliseAllWeights(TtrimmedGraph) return TtrimmedGraph End

To address this, we updated the algorithm and added a sub-algorithm to it to involve this property as shown in Algorithm 4.3. Starting with ACO, we take one node at a time from the graph, then calculate the total average weights of all outgoing edges for that node, and finally trim all those outgoing edges whose weights fall below the overall average weight of an edge. We were looking for non-traversed edges that have low weights at the end of each day (after we extract all sessions in a day from the log files and represent them in the graph and doing the normalisation process) and then removing them from the graph by the end of that day and applying the normalisation

Chapter 4. Building a Log-Based Profile

81

process again. The remaining model can then be used.

4.6

Concluding Remarks

In this chapter, we introduce the log-based profile in the context of navigating a Web site. It is a biologically inspired paradigm to turn search logs into useful structures that can be utilised for query recommendation. We have used only queries submitted within user sessions as implicit feedback to create the structure. We have not incorporated any clickthrough data or other feedback signals in the model. In the next chapter, we will discuss the general experimental setup applied to our work.

Part II

Evaluation

82

5 General Experimental Setup

This chapter provides the main experimental setup we used in our work. We start this chapter by illustrating the preprocessing steps we followed to generate the summaries in Section 5.1. We show how we analysed the log data in Section 5.2. We also specify the method and implementation used to construct the log-based profile in Section 5.3. Then, in Section 5.4, we explain the method used to identify and extract the technical terminological nouns and noun phrases from the issued queries or from the titles extracted from the Web pages, and we support that with some examples. Later in the chapter, we also illustrate the mechanism of identifying the length of the generated summaries in Section 5.5. 84

Chapter 5. General Experimental Setup

5.1

85

Data Collection Pre-processing

Generating a summary for HTML documents involves a number of preprocessing steps [Silva and Ribeiro, 2003; Smucker, 2011; Stokes et al., 2007; William, 1992], Figure 5.1 nicely illustrates the sequence of those steps. First, we perform some document cleaning, for example, removing tags and tables. This is followed by a pipeline of NLP tools1 , performing sentence splitting, in which the text is broken up into sentences by detecting that whether a punctuation character marks the end of a sentence or not, in order to specify the beginning and the end of a sentence, also called sentence boundaries; tokenisation, in which each sentence generated in the previous step will be segmented into tokens (words, punctuation, numbers, etc); stop words elimination, in which punctuation marks and frequent words such as ‘the’, ‘of’, ‘and’, ‘a’, ‘in’, ‘to’, ‘is’, ‘for’, etc, will be omitted because they do not contribute much to the informational content of a document, and they have less significance in statistical analysis; and stemming, to minimise a word to its stem or root (e.g., the words “connect, connected, connecting, connection, connections” have the same stem “connect”).

5.2

Log Analysis

The second field in the log record, in the previous chapter in Section 4.2, is an automatically generated session identifier which can be seen as anonymous user ID. It facilitates the session detection as the default server time-out is 30 minutes. In the literature on Web search log analysis, a common time-out threshold used is 30 minutes [Jansen et al., 2007b]. Automatically identifying the boundaries of sessions is a difficult task [G¨oker and He, 2000; Jansen et al., 2007b]. One of the reasons is that a session can consist of a number of search goals and search missions [Jones and Klinkner, 2008]. Identifying topically related chains in user query sessions has been studied extensively [Gayo-Avello, 2009]. 1

We use OpenNLP for some of these steps.

Chapter 5. General Experimental Setup

86

Figure 5.1: Preprocessing Steps Required on HTML Documents. We use the default server timeout, that is, a session expires after 30 minutes of inactivity. This method has been shown to give highly accurate session boundaries [Jansen et al., 2007b]. To test the applicability of this approach to the current work, we randomly sampled 50 sessions from the log files. Three of the authors independently assessed whether each of those sessions was concerned with a single topic. They then compared their judgements. There was agreement that all sessions were about a single topic. In addition, we found that there was no session longer than 30 minutes. However, given that sessions in our sample domain tend to be short — with only 1.53 queries per session on average [Kruschwitz et al., 2013] — we randomly sampled another set of 50 sessions containing at least two queries. Using the same manual assessment, no single session was identified that was clearly and unambiguously about more than one topic, although there were six sessions that potentially fell into that category (e.g., a query “study abroad” followed by “psychology”). Again there was no session longer than 30 minutes. We conclude that applying the standard time-out approach appears sensible

Chapter 5. General Experimental Setup

87

in this study.

5.3

Profile Construction

For running ACO, a lot of different parameters could be used. Based on the finding of previous studies, Albakour et al. [2011] and Albakour [2012], we used the bestperforming combination of the parameters for Equation 4.1, in which the evaporation co-efficient factor ρ = 0.1 and only immediate refinements in the sessions are considered to update the pheromone level on the graph (as we described earlier in Section 4.3). As we mentioned earlier, the evaporation factor decides how quickly the model forgets the links. A small value for ρ means gradual and slow forgetting, whereas a high value would highly reinforce recent observations. Albakour et al. [2011] and Albakour [2012] experimented with varying values to test the impact of evaporation co-efficient (ρ = 0, 0.05, 0.1, 0.15, 0.3, 0.5, 0.7, 0.9), and the default one is 0. They found that the performance gets better with small values of ρ as opposed to no evaporation. In other words, forgetting less commonly used and possibly seasonally incorrect reformulations has a positive impact. They observed quite a small increase in the performance with small values but statistically significant. By contrast, with large values of the evaporation co-efficient, the performance degrades significantly. Thus, the optimum value found for the evaporation factor is 0.1. They also experimented with three different pheromone calculation schemes to build the edges on the graph (‘Subsequent reformulations/immediate refinements’, ‘Linking all’, ‘Link to last’), and the default one is ‘immediate refinements’. They found that the ‘immediate refinements’ scheme significantly outperforms the other two, ‘Linking all’ and ‘Link to last’. The ‘Link to last’ scheme suffers the most because a lot of useful links are lost within the session. Furthermore, the constant Q in Equation 4.2 is chosen to be the average weight of all edges in the graph in the previous day. The cost Ck is considered to be the distance between the queries in the session (for immediate refinements C = 1). All these settings have been shown to work well for log data of this domain.

Chapter 5. General Experimental Setup

88

In addition to that, in order to divide the query log into batches of a continuously updated log, we utilised daily batches. This resulted in ‘1096’ different batches of logs according to the number of days in three years. The total number of distinct sessions in this (pre-processed) log data is ‘299,938’, with an average of ‘279’ sessions per day. However, the number of sessions differs between ‘36’ sessions and ‘988’ with a standard deviation σ = ‘134.032’. The divergence in number of sessions between batches is the result of some busy periods throughout the academic year. For instance, each of the first days of academic years ‘Oct 2008’, ‘Oct 2009’, and ‘Oct 2010’ consists of ‘668’ sessions, ‘827’ sessions, and ‘988’ sessions, respectively, which are all well above the average. On the contrary, Christmas day in ‘25 Dec 2007’, ‘25 Dec 2008’, and ‘25 Dec 2009’ contains only ‘36’ sessions, ‘45’ sessions, and ‘56’ user sessions, respectively.

5.4

Noun Phrase Extraction

Caropreso et al. [2000] defines a phrase as: ‘a textual unit usually larger than a word but smaller than a full sentence’. The role of a phrase in the text is stronger than a word in linguistic terminology [Caropreso et al., 2000]. Moreover, the syntactic nature of a sentence is best described in terms of noun phrases as it plays the functions of subjects, objects, and presentational phrases. Accordingly, they are called concepts in AI [Arampatzis et al., 1998]. Many researchers [Buckley et al., 1996; Strzalkowski et al., 1996] who participated in the Text REtrieval Conference (TREC) have found the phrases to be valuable for indexing purposes. Applying linguistic analysis on the text is required in order to identify these phrases. Zhang et al. [2007] divides noun phrases into kinds, such as proper nouns, simple phrases, dictionary phrases, and complex phrases. The correct application of those phrases has resulted in better retrieval performance than utilising words alone. A noun phrase consists of constituents in which the head functions a noun [Vadas and Curran, 2011] and forms the key part [Quirk et al., 1987].

Chapter 5. General Experimental Setup

89

In our work, we were interested in identifying and extracting nouns and noun phrases up to three words long from the title of the page or from the submitted query, based on the method presented in Justeson and Katz [1995], to be the most useful phrases for generating the summaries task. Then, the longest possible phrases are extracted (e.g., if the set contains both ;albert sloman; and ;albert sloman library, then the first term will be removed in this step). For each identified term, we checked whether it is represented as a node in the profile2 and (if so) extracted all directly connected nodes (i.e., related ‘queries’) and constructed the union of all these terms, as described in Section 4.4. We used the chunker in the OpenNLP library, which aims to identify noun and verb phrases (either single-word or multi-word phrases). However, the noun phrases extracted by the chunker are not terminological phrases, which is what we aim to extract. Moreover, the chunker does not apply any constraints (e.g., grammatical structure) on the input strings, which results in the extracted noun phrases containing useless words, e.g., He, The xxx, Its xxx.3 Like Justeson and Katz [1995], we applied a number of patterns to identify the technical terminological noun phrases. A candidate term in Justeson and Katz [1995] is a multi-word noun phrase (we are also interested in oneword nouns); and it is either a string of nouns and/or adjectives ending in a noun or it consists of two such strings separated by a single preposition. There are two admissible patterns of length two and five of length 3. The following examples of each permitted pattern (as can be seen in Table 5.1) are taken from the University of Essex sample domain and converted into a normalised form (where A is an adjective, N is a noun, and P is a preposition).

2

To test how common this might be, we randomly sampled 50 HTML pages from the document collection and did not find a single case where there were no terms extracted from the title of the page matching a node in the profile. For the purpose of this analysis, we treated the terms essex, university and university of essex as stop words. 3 We use ‘xxx’ to refer to any word just to clarify the example.

Chapter 5. General Experimental Setup

90

Table 5.1: Noun Phrases Patterns. AN NN AAN ANN NAN NNN NPN

disciplinary regulations module enrolment third international conference international application form management short courses brain computer interfaces excellence in education, department of law

However, we used a chunker to find the beginning and end of the chunk of noun phrases, after which we extracted the noun phrases based on the patterns of Justeson and Katz [1995]. The steps are as follows: 1. Tokenise the input sentence (using tokenisation in OpenNLP library), which is represented as a String array, and each String object in the array is one token. 2. Associate each token with a specific tag (using the Part-of-Speech Tagger in OpenNLP library). 3. Now we have two input arrays: one for the sentence’s tokens, the other for Partof-Speech tags. 4. After that, the chunker takes the two arrays as input and returns the chunk ‘tag array’ for a sentence. The ‘tags array’ contains one chunk tag for each token in the input array. The corresponding tag can be found at the same index that the token has in the input array. 5. The current word is located in the first column; its Part-of-Speech tag is located in the second column and its chunk tag in the third. The ‘chunk tags’ include the name of the chunk type. The chunks can be classified into noun phrases (NP), verb phrases (VP), and prepositions (PP). ‘B-...’ specifies the first word of the chunk, and ‘I-...’ for every other word contained in the chunk. For example, ‘B-NP’ for the first word of the noun phrase in the chunk and ‘I-NP’ for the following noun phrase words in the same chunk, and the same thing for

Chapter 5. General Experimental Setup

91

‘B-VP’ and ‘I-VP’ but for verb-phrase words. ‘B-PP’ is the chunk tag for the prepositions. The ‘O’ chunk tag is utilised for tokens that are not considered in any part of a chunk. For example: Title : “University of Essex :: Business :: Introduction” Output: Word University of Essex : Business : Introduction

POS tag NNP IN NNP : NN : NN

Chunk tag B-NP B-PP B-NP O B-NP O B-NP

6. Based on the constraints applied on the candidate strings provided by Justeson and Katz [1995], determiners such as articles, demonstratives, possessive pronouns, and quantifiers will be removed. 7. Apply the patterns in Justeson and Katz [1995] to identify and extract the technical terminological noun phrases. Those patterns are: AN, NN, AAN, ANN, NAN, NNN, and NPN. The equivalent patterns based on chunk tags are: B-NP, B-NP/(I-NP)*, B-NP/B-PP/B-NP, and B-NP/B-PP/B-NP/(I-NP)*. The possible terminological noun phrases extracted from the above example are: (a) University of Essex (B-NP/B-PP/B-NP)=(NPN) (b) Business (B-NP)=(N) (c) Introduction (B-NP)=(N)

Chapter 5. General Experimental Setup

92

Example

Original Text: “The Special Collections Room in the Albert Sloman Library currently holds some 60 named collections of books, manuscripts and contemporary archives, together with a general sequence which is arranged by subject”.4

Text after applying tokenisation and POS tagger: The DT Special JJ Collections NNP Room NNP in IN the DT Albert NNP Sloman NNP Library NNP currently RB holds VBZ some DT 60 CD named VBN collections NNS of IN books, NN manuscripts NNS and CC contemporary JJ archives, NNS together RB with IN a DT general JJ sequence NN which WDT is VBZ arranged VBN by IN subject. NN

Output of the chunker: [NP The DT(B-NP) Special JJ(I-NP) Collections NNP(I-NP) Room NNP(I-NP)] [PP in IN(B-PP)] [NP the DT(B-NP) Albert NNP(I-NP) Sloman NNP(I-NP) Library NNP(INP)] [ADVP currently RB(B-ADVP)] [VP holds VBZ(B-VP)] [NP some DT(B-NP) 60 CD(I-NP) named VBN(I-NP) collections NNS(I-NP)] [PP of IN(B-PP)] [NP books, NN(BNP) manuscripts NNS(I-NP)] [O and CC(O)] [NP contemporary JJ(B-NP) archives, NNS(INP)] [ADVP together RB(B-ADVP)] [PP with IN(B-PP)] [NP a DT(B-NP) general JJ(I-NP) sequence NN(I-NP)] [NP which WDT(B-NP)] [VP is VBZ(B-VP) arranged VBN(I-VP)] [PP by IN(B-PP)] [NP subject. NN(B-NP)]

List of nouns and noun phrases extracted by the chunker: 1. The Special Collections Room (B-NP/I-NP/I-NP/I-NP) 4

This paragraph extracted from the University of Essex Library Web page, http://libwww.essex. ac.uk/speccol.htm

Chapter 5. General Experimental Setup

93

2. the Albert Sloman Library (B-NP/I-NP/I-NP/I-NP) 3. some 60 named collections (B-NP/I-NP/I-NP/I-NP) 4. books, manuscripts (B-NP/I-NP) 5. contemporary archives, (B-NP/I-NP) 6. a general sequence (B-NP/I-NP/I-NP) 7. which (B-NP) 8. subject. (B-NP) List of nouns and noun phrases extracted using terminological patterns [Justeson and Katz, 1995] on the output of the chunker: 1. Special Collections Room (B-NP/I-NP/I-NP)=(ANN) 2. Albert Sloman Library (B-NP/I-NP/I-NP)=(NNN) 3. named collections (B-NP/I-NP)=(NN) 4. books (B-NP)=(N) 5. manuscripts (B-NP)=(N) 6. contemporary archives (B-NP/I-NP)=(AN) 7. general sequence (B-NP/I-NP)=(AN) 8. subject (B-NP)=(N)

5.5

Extracting a Summary

Many different similarity measures could be used for running SDS and MDS to identify the sentences to be included in the summary. The similarity metric used in our work to rank-order the sentences is the standard TF.IDF cosine similarity. The idea is that all the sentences extracted from the original document(s) are matched against the terms obtained from the profile (the query refinements) using TF.IDF cosine similarity. This resulted in generating a list of rank-order sentences (descending) in which each sentence will be associated with a specific score identifying the strength of the relationship between a sentence and the profile. The high matched sentences appear at the top of the

Chapter 5. General Experimental Setup

94

list. Then we specify the length of the summary. We followed DUC 2002 conventions to specify the length of the generated summaries in which we generated summaries of 100 words at the most [Lin and Hovy, 2003a]. The process we followed, illustrated in Algorithm 5.1, is that getting a list of ranked sentences, get one sentence at a time. The top-matching sentence will be added to the summary in all cases, and the initial value of the counter will be the length of the topmatching sentence. If the length of this sentence is more than or equal to 100 words, it will be the only one in the summary, and this is the only case where the summary length is more than 100. Otherwise, if the length of this sentence is less than 100 words, we start a loop from the second sentence in the list of ranked sentences. Before we add any other sentences to the summary, we count the number of words in a sentence and add it to the counter; if the counter is less than or equal to 100, then the sentence will be considered and added to the summary, but if not, the sentence will not be added to the summary and we will break the loop. In this case, the extracted summary is less than or equal to 100 words but not more. There are many of approaches to summarisation that could be used, but here we only focus on one (ranked-based list of sentences) because it seems that this approach is simple to implement and performs well compared to the other methods [Hong et al., 2014]. Future work could look at alternative summarisers – such as Word probability (FreqSum), Topic words (TsSum), LexRank, and Greedy-KL – [Hong et al., 2014].

Chapter 5. General Experimental Setup

95

Algorithm 5.1 Algorithm to Create an Extractive Summary.

14

Input: list of ranked sentences RS Output: a summary Begin T ← GetTopMatchingSentence(RS) Append T to summary /* Initialise counter with the length of top-matching sentence */ Split T into words counter = GetLength(T) if counter < n then /* Start from the second sentence in the list */ foreach S ∈ RS do Split S into words counter + = GetLength(S) if counter ≤ n then Append S to summary else if counter > n then Break

15

End

1 2 3 4 5 6 7 8 9 10 11 12 13

5.6

Concluding Remarks

This chapter discusses the general experimental setup and the different parameters we applied to build our domain model and generate profile-based summaries. In the next chapter, we will present the first study we conducted on profile-based summarisation.

6 A Scoping Study on Profile-Based Summarisation

We started by exploring the usefulness of the cohort-profile domain model in generating profile-based summaries. As our first experiment, we conducted a simple scoping study for exploring the potential of profile-based single-document summarisation for typical information needs, i.e., assuming that we do have a query [Alhindi et al., 2013]. We compared the generated summaries using a profile with different baselines and assessed them by comparing human judgements of summaries.

96

Chapter 6. A Scoping Study on Profile-Based Summarisation

6.1

97

Experimental Setup

We applied five SDS methods. The first two algorithms we implemented were designed for traditional (generic) summarisation, and they represented widely used baselines that are also employed by [Yan et al., 2011]. The other three are all variations of the ACO approach. Note that method 3 and method 4 are query-independent as they are using the entire model to generate the summary, whereas method 5 is query-specific. However, to clarify the point, there is indeed only one query-biased ACO method, which is method 5; the others (method 3 and method 4) take the entire model and turn it into a bag of words. The evaporation factor of the ACO algorithm would ensure that not all of the queries ever submitted would be included in such a ‘bag’, and the pruning approach, as in method 4, aimed at cutting the model down even further. The five summarisation approaches are: 1. Random: (“Baseline-1”) Selects sentences from the document randomly. 2. Centroid: (“Baseline-2”) This is a centroid-based approach to summarisation [Radev et al., 2004]. This algorithm takes into account the following parameters: centroid value, positional value, and first-sentence overlap, in order to generate a summary.1 3. ACO: A query graph is built by processing the log data according to our ACO model (Section 4.3). The entire model is turned into a flat list of terms by extracting all the terms/queries in all the nodes in the graph. Then, extract a specific number of sentences from a document that are strongly matched to those terms as summary sentences. 4. ACO trimmed: Starting with ACO we trim all those edges whose weights fall below the overall average weight of an edge. The remaining model is turned into a flat list of terms for summarisation using the same mechanism in method 3, as 1

Radevs centroid-based approach completely ignores the query terms and selects the centroid of the document without reference to a query.

Chapter 6. A Scoping Study on Profile-Based Summarisation

98

described in Section 4.5.2 5. ACO query refinements: The list of terms used for summarisation are all those that are directly linked to the query node in the ACO model (see Section 3.2 for more detail) assuming search context. The Random method is a very poor baseline, and simply chosen because it has been used in other work. Centroid can be surprisingly strong as is the case with many sensible baselines. For example, in a recent study conducted by El-Haj et al. [2014], they concluded that “the results of the evaluation suggest that it is appropriate to include centroid-based summaries as a sensible baseline; in the experiments their quality appears to be on a par with manual summarisation”. In order to generate our sample summaries, we use documents that correspond with frequently submitted queries, as commonly done in other studies [Paek et al., 2004, for example]. We would expect that the biggest potential for profile-based summarisation lies in applying this approach to those queries which can be associated with a rich profile, which is the reason why we selected ten of the most frequently submitted queries in the query logs, (“timetable”, “courses”, “moodle”, “accommodation”, “library”, “fees”, “law”, “enrol”, “psychology”, and “graduation”), along with a corresponding document (an HTML page selected from the top ten results returned by the existing search engine). However, the reason we argue that these top queries represent a large proportion of all queries submitted to the search engine, and approximately follow the power-law distribution [Kruschwitz et al., 2013]. The reason for selecting one page of top-ranked results is because users are much more likely to click on the top results of a ranked list than to select something further down [Joachims et al., 2005]. There is also some qualitative feedback we received on all of the experiments that supports this hypothesis. For each document, we applied five different summarisation algorithms for comparison as we explained above. 2

While the original ACO model has “669,699” nods, the ACO trimmed model has “70,594”.

Chapter 6. A Scoping Study on Profile-Based Summarisation

99

We adopted an existing evaluation framework to assess the quality of summarised documents [Yan et al., 2011]. The idea is that a number of randomly sampled users are asked to assess different summaries of a document that were each generated based on different techniques (but the user has no idea which underlying method is being used in each case). Given the Web sites context of this study we recruited a sample of five members of our institution to do the assessment to represent our target users. Each evaluator was requested to compare the generated summaries (provided in a random order) and express his or her opinion about them, using a rating mechanism. In line with Yan et al. [2011], we used a 5-point Likert scale3 , where 5 = excellent, 4 = good, 3 = average, 2 = bad, and 1 = terrible. This rating system aimed to evaluate the quality of the generated summaries and allow a comparison between the methods. After the rating process was completed, the evaluators were asked to fill out an exit questionnaire, providing general feedback on the generated summaries. In this questionnaire, the evaluators were asked to answer the following questions: 1. What are the special characteristics that have been found in the summaries that have been evaluated as the best among others? What did you like about them? What did you dislike about them? 2. What did you dislike about the summaries that have been evaluated the worst among others? 3. What are the main criteria that you have considered during the rating process? In the other words, what are the things you were seeking to obtain in the summary during the rating process? 4. What do you think about the size of the generated summaries? What size do you prefer? 5. Was it easy to do the rating process? Have you found any difficulties? If yes, please present them. 6. Please list any other comments or suggestions that you have about the topics and the different forms of summaries generated for each topic. THANK YOU!!!! 3

Strictly speaking, we employ Likert-type scales but refer to Likert scales throughout the study.

Chapter 6. A Scoping Study on Profile-Based Summarisation

100

Evaluation forms were handed out in hard copy with all summaries on a page with a link to the original document. We provide an example of the generated summaries given the query “library”, with the first two being treated as baselines: Random We offer outstanding courses at undergraduate and postgraduate level plus innovative courses to improve your language and study skills. Our graduate profiles keep you up-to-date on what your friends are doing. The Library provides ideal conditions for quiet study with 1,070 reader places. In addition, 110 networked PCs and terminals provide access to over 47,000 online journals, databases, e-books and library catalogues. South Essex College opening hours are from 8am to 9.30pm on Mondays to Thursdays, with a 5pm closure on Fridays and 9am to 1pm opening on Saturdays. Centroid Our range of undergraduate courses allow you to demonstrate your academic potential in a research-intensive environment. Undertake your research at the heart of our internationallyacknowledged and well-connected research community.

We are one of the leading research-

intensive universities in the UK, offering world-class teaching in a friendly, inclusive and international atmosphere. We offer outstanding courses at undergraduate and postgraduate level plus innovative courses to improve your language and study skills. Our centres provide a focus for cutting-edge research and innovative teaching by bringing our academics together to share their expertise and knowledge. ACO Extra copies of set texts and books we expect heavy demand for are available in a short loan collection and you can check the location of books on our online catalogue online. This includes the papers of the Social Democrat Party (SDP), QUALIDATA and the Boundary Commission for England and Wales. Recent additions include extensive papers relating to the work of Sigmund Freud, the historical collection of the Royal Statistical Society Library and the Library of the Essex Society for Archaeology and History. ACO trimmed Extra copies of set texts and books we expect heavy demand for are available in a short loan collection and you can check the location of books on our online catalogue online. The South Essex College library gives you access to a wide range of learning resources in a pleasant and safe working environment and includes specialised collections linked to your course. Recent additions include extensive papers relating to the work of Sigmund Freud, the historical collection of the Royal Statistical Society Library and the Library of the Essex Society for Archaeology and History. ACO query refinements We are in the top 10 for research in the UK and are number one for Social Sciences. Our courses cover everything from business management through to the social sciences and health care. In addition, 110 networked PCs and terminals provide access to over 47,000

Chapter 6. A Scoping Study on Profile-Based Summarisation

101

online journals, databases, e-books and library catalogues. Recent additions include extensive papers relating to the work of Sigmund Freud, the historical collection of the Royal Statistical Society Library and the Library of the Essex Society for Archaeology and History.

Note that, in our studied, we decided not to work with gold standard summaries (which would then allow to apply MUC/DUC metrics) as in the given context it is unclear what such gold standard would be.

6.2

6.2.1

Results and Discussion

Overall Performance Comparison

Table 6.1, shows an example of the actual ratings by individual assessors. They evaluated the different summaries generated on the above example given the query “library”. We refer to individual people on the table as “H-1”, “H-2”, “H-3”, “H-4” and “H-5” which mean “human-1”, “human-2”, “human-3”, “human-4” and “human-5”, respectively. Table 6.1: Overall performance comparison on document “library”. Systems

H-1

H-2

H-3

H-4

H-5

Random Centroid ACO ACO trimmed Query refinements

1 1 3 3 4

2 2 3 4 5

1 4 2 4 4

2 1 3 2 4

2 3 2 4 3

In Table 6.2, for each summarisation method, we report the average ratings obtained from users. We visualise these results using the box plot in Figure 6.1. We can see that all variations of the ACO-based algorithm outperform the other alternatives in achieving a higher average rating. We employ a non-parametric measure (Friedman) to assess the variance by ranks for the ratings across the different summarisation methods. The Friedman test compares

Chapter 6. A Scoping Study on Profile-Based Summarisation

102

the mean ranks for each of our five summarisation methods against the mean ranks for all of the remaining four methods. In addition, we report the z-score for the rating ranks. The z-score represents how much the rating ranks of a method deviate from the rating ranks of all other methods, and its polarity indicates whether the difference is positive or negative and its value indicates significance. The z-scores for each method are also included in Table 6.2 to indicate levels of significance. Table 6.2: Overall Ratings on 10 Documents. Bold values are significantly better and underlined values are significantly worse than the average according to the z-score with 95% confidence level. System

Users Mean

Random Centroid ACO ACO trimmed ACO query refinements

z

1.56 (-8.52) 2.12 (-3.55) 2.56 (0) 2.98 (3.28) 3.72 (8.78)

The results of our statistical analysis are as follow: (Friedman χ2 = 31.2821, df = 4, p  0.001): The small p-value (p  0.001) suggests that the choice of the summarisation method has an effect on the users’ rating. The post-hoc tests address the pairwise comparisons. Wilcoxon Signed Rank tests reveal that only two pairs of methods do not show significant differences, namely Centroid vs. ACO and ACO vs. ACO trimmed. We note that Random and Centroid are significantly worse than the average at a 95% confidence level. ACO is worse than the average but not significantly at a 95% confidence level (z > −1.96). ACO trimmed and ACO query refinements perform significantly better than the average at a 95% confidence level.4 Table 6.3 has more details on the post-hoc tests. The box plot in Figure 6.1 summarises these results. We see the scores of Centroid and Random do not overlap with ACO trimmed and ACO query refinements. We can also see clearly that the first two are significantly worse but the last two are significantly 4

Applying the rather conservative Bonferroni adjustment suggests that there is no significant differences between any pair of the methods

Chapter 6. A Scoping Study on Profile-Based Summarisation

103

Table 6.3: p-Values of Wilcoxon Signed Rank Post-Hoc Pairwise Tests.

Random Centroid ACO Random Centroid ACO ACO trimmed

0.0149

0.0079 0.2930

ACO ACO trimmed query refinements 0.0058 0.0282 0.0742

0.0058 0.0057 0.0057 0.0141

4.5

better.



4.0



3.5



3.0 2.5



1.0

1.5

2.0

average rating



Random

Centroid

ACO

ACO.Trimmed

ACO.Refinements

Figure 6.1: Box Plot of Overall Assessment of Summary Quality.

6.2.2

User Feedback

In addition to answering specific questions, users were encouraged to leave feedback. Most provided comments that would be beneficial for future work. When they asked about the special characteristics in the summaries that they evaluated the best, one of the users wrote: “Because they gave sufficient information about the topic and the ideas are connected somehow”. Another user commented: “What I most like is that some of the summaries are related to the main topic and provide good information. On the other hand, some of the summaries are far away and disconnected from the main topic”. Somebody else mentioned: “They contain short sentences, they

Chapter 6. A Scoping Study on Profile-Based Summarisation

104

make sense, they are easy to understand, they contain some useful examples and links that might help if people want rapid access to contacts addresses”. When they asked about the reasons they disliked the summaries that were evaluated as the worst, the comments were as follows: “Some of the information is not related to the topic, incomplete sentences, illogical sequences of the sentences”, “They are not related to the topic. They are talking about another topic. Sometimes, there are more than one topic in the summary. Sometimes, the sentences are not related to each other; each sentence stands alone with no relation to the prior or after sentence. Sometimes, there is no logic in the sequence of the sentences”, and “The first thing is those summaries are not connected to the main topic. The second thing is even if they related to the main topic but the flow or the structure are not well”. The evaluators were also asked about the main criteria that they considered during the rating process. Some of them were searching for linking sentences and summaries that had enough information. One wrote: “The most important thing to me was the flow of the sentences (Are they logically order? Does each sentence lead to and introduce the following one? Or are they scattered)”. Another commented: “The sentences should be complete, containing only useful and related text, the sentences containing in summary should also be coherent, the sentences should be short and useful, not containing a lot of repeated information”. When asked about the size of the generated summaries, most of the evaluators were satisfied with the current summary length. Apart from that, some of them suggested having at most five to seven short sentences in each summary. However, the rating process, on the other hand, was not difficult that much as the evaluators said. At the end, the evaluators were asked to leave any comments or suggestions on the overall summaries. One user mentioned: “Some of the summaries contain repetition for sentences. And some other has no linkage between sentences. Some of the summaries start talking about a certain topic and suddenly mention something completely different”.

Chapter 6. A Scoping Study on Profile-Based Summarisation

105

Now that we have clearly seen the evaluators feedback, we can benefit from it to improve our work in further experiments.

6.3

Concluding Remarks

The initial experiment into using profile-based summarisation gives a clear indication of the potential that ACO-based models offer. We will now move on to the next chapter, which presents a pilot study on profile-based summarisation.

7 A Pilot Study on Profile-Based Summarisation

Based on the findings of the previous scoping study, we further explore the usefulness of the acquired profile to generate profile-based summaries. In this pilot study, we create summaries as if they were generated in a navigation context; more specifically, we generate ‘query-based’ summaries using the title of the relevant page (see Section 3.2). We evaluate a wider range of profile-based techniques using both single-document and multi-document summarisation.

106

Chapter 7. A Pilot Study on Profile-Based Summarisation

7.1

107

Experimental Setup

The experimental setup is similar to the scoping study. However, while the scoping study was only meant to give indicative results, we now apply stricter selection criteria for the initial dataset to make the experiment objective and replicable. As before, we expect that the biggest potential for profile-based summarisation lies in applying the technique when the context provides a rich profile. We start again with frequently submitted queries and then identify the top matching document for each query, as returned by the Google search engine (for example, by submitting the query ‘timetable site:essex.ac.uk’ to ‘http://www.google.com/’). We then recruited human subjects to evaluate the summaries. The evaluators were in two groups: local users and remote users, specifically Amazon Mechanical Turk (MTurk) workers. 1. Local Users: Given the site-specific focus of this work, we recruited subjects to represent our target users: students of our institution. These represent typical users of the local Web site. To recruit a range of different types of student users, and to help avoid bias in the selection process, we sent an e-mail to the local university mailing list (a list that is used to send out “small ads” to students) and selected the first ten volunteers who replied. They were informed in advance that they would receive

£10

for participating. Users were provided with hard copies

of all of the original documents (the ones for which we generated summaries) and asked to read them before they rated the summaries on an evaluation form. 2. Web Users: We also recruited ten subjects from an online workforce service, namely, MTurk. We gave MTurk workers an electronic version of the evaluation form and links to the original documents. They were asked to read the documents first and then complete the form to provide us with their ratings. Subjects were paid $10 each. For the local users, the evaluations typically took 30–40 minutes.

Chapter 7. A Pilot Study on Profile-Based Summarisation

108

We are going to use ACO models based on the same general principles as the ACO models used in Chapter 6 but suited to a navigation context; also use them both in an SDS and in an MDS context. We use the best performing ACO method and this is the reason that we did not use ACO trimmed model here. We used the following six methods to create the summaries, with the first three being treated as baselines: 1. Random SDS: (“Baseline-1”), a random selection of sentences from the document (a baseline that is also employed by Yan et al. [2011]). 2. Centroid SDS: (“Baseline-2”), a centroid-based approach to summarisation [Radev et al., 2004]. This algorithm takes into account the following parameters: centroid value, positional value, and first-sentence overlap, in order to generate a summary. 3. Centroid MDS (all documents): (“Baseline-3”), a centroid-based approach for summarising a collection of related documents. We retrieve all documents to which the document in hand has links, create a meta-document by concatenating all these documents, and then apply “Baseline-2” to this meta-document. 4. ACO title-based SDS: ACO-based, single-document summarisation, using the document title, as described in Section 3.2.1. 5. ACO title-based MDS (first-five documents): ACO-based, multi-document summarisation as described in Section 3.2.2 but only considering the first five outgoing links in the document. 6. ACO title-based MDS (all documents): ACO-based, multi-document summarisation as described in Section 3.2.2, considering all outgoing links in the document. Obviously, there would have been many other alternative summarisation approaches that could have been chosen. It could be argued that there should have been other query-dependent summarisation methods (using a raw query e.g., document title as a query without the expansion made by ACO or compare our summaries with snippets

Chapter 7. A Pilot Study on Profile-Based Summarisation

109

generated by a search engine as the documents we collected are found by Google (thus by Google snippets)) included in the methods under investigation. This is a valid argument, as anything we conclude from the experiment could be due to the specific query-dependent summarisation method rather than the actual profile. We decided to apply ACO for the reasons given in Chapter 4 and leave the exploration of alternative models for future studies. We adopted the same evaluation framework used in the scoping study, which has been taken from Yan et al. [2011] to assess the quality of summarised documents, and allow a comparison between methods. Again, subjects were asked to assess the summaries on a 5-point Likert scale, where 5 was ‘excellent’; 4, ‘good’; 3, ‘average’; 2, ‘bad’; and 1, ‘terrible’. The total number of generated summaries for the 10 documents was 59. This was because there was an empty summary for one document (ACO title-based SDS). The summary was empty because there were no matching sentences. We excluded this summary from the evaluation form and assigned it a rating of 1 (‘terrible’). Hence, although each subject rated 59 summaries, we have ratings for all 60 summaries in the follow-up analysis. The results of the different methods when applied to the University of Essex Accommodation homepage1 are presented below: Random Online applications are now open for all new undergraduate students for 2014-15. We offer a range of support services for students to help with everyday worries about your accommodation or for practical or personal issues. Where you live is a very important part of your life at university. Applications are now open for all returning undergraduate and postgraduate students for 201415. Applications have now re-opened, we apologise for any inconvenience caused. Construction work is underway near the residences and may cause some noise disruption to residents during 2013-14. Centroid SDS Where you live is a very important part of your life at university. Our accommodation provides the perfect base from which to start this new and exciting period in your life. Online 1

http://www.essex.ac.uk/accommodation/

Chapter 7. A Pilot Study on Profile-Based Summarisation

110

applications are now open for all new undergraduate students for 2014-15. Applications are now open for all returning undergraduate and postgraduate students for 2014-15. Applications have now re-opened, we apologise for any inconvenience caused. We are building exciting new facilities for our students at our Colchester campus. Construction work is underway near the residences and may cause some noise disruption to residents during 2013-14. Centroid MDS (all documents) Where you live is a very important part of your life at university. Our accommodation provides the perfect base from which to start this new and exciting period in your life. If you have withdrawn temporarily from the University (intermitted) during the first term of the first year of study, and decide to return to the University the following academic year, you may re-apply for accommodation. Being an Essex student offers a unique experience of living and learning, sending you on an unforgettable journey that will shape your future. ACO title-based SDS Where you live is a very important part of your life at university. Online applications are now open for all new undergraduate students for 2014-15. Applications are now open for all returning undergraduate and postgraduate students for 2014-15. We are building exciting new facilities for our students at our Colchester campus. Keep up to date with the latest accommodation updates and offers from across our Colchester and Southend Campuses. We offer a range of support services for students to help with everyday worries about your accommodation or for practical or personal issues. ACO title-based MDS (first-five documents) All full-time, fully registered, postgraduate students studying at Colchester or Southend campuses are guaranteed single accommodation for their first year of study, provided they return their application and GBP250 prepayment deposit after applications open in April 2014. All full-time fully-registered undergraduate students studying at either our Colchester or Southend Campuses are entitled to single accommodation for the first year of study as long as they return their application by the deadline date, Friday 22 August 2014. If you have a sports bursary and want to apply for accommodation, please contact the Sports Centre direct. ACO title-based MDS (all documents) All full-time, fully registered, postgraduate students studying at Colchester or Southend campuses are guaranteed single accommodation for their first year of study, provided they return their application and GBP250 prepayment deposit after applications open in April 2014. All full-time fully-registered undergraduate students studying at either our Colchester or Southend Campuses are entitled to single accommodation for the first year of study as long as they return their application by the deadline date, Friday 22 August 2014. This is normally available through the myEssex Applicant Portal.

Chapter 7. A Pilot Study on Profile-Based Summarisation

7.2

7.2.1

111

Results and Discussion

Overall Performance Comparison

In Table 7.1, for each summarisation method, we report the average ratings obtained from each group of users (local users and Web users). We visualise these results using the box plot in Figure 7.1. In both user groups, we can observe that all variations of the ACO-based algorithm outperform the other alternatives in achieving a higher average rating. We apply the same statistical tests used in the scoping study. We employ a nonparametric measure, the Friedman test, to assess the variance by ranks for the ratings across the different summarisation methods in each group of users. The Friedman test compares the mean ranks for each of our six summarisation methods against the mean ranks for all of the remaining five methods. In addition, we report the z-score for the rating ranks. The z-score represents how much the rating ranks of a method deviate from the rating ranks of all other methods, and its polarity indicates whether the difference is positive or negative and its value indicates significance. The z-scores for each method are also included in Table 7.1 to indicate levels of significance. Table 7.1: Overall Ratings on 10 Documents. Bold values are significantly better and underlined values are significantly worse than the average according to the z-score with 95% confidence level. System

Random Centroid SDS Centroid MDS (All) ACO title-based SDS ACO title-based MDS (First 5) ACO title-based MDS (All)

Local Users

Web Users

Mean

Mean

z

1.86 (-9.39) 2.51 (-5.44) 3.04 (-0.89) 3.47 (2.69) 4.52 (9.88) 3.62 (3.15)

In the following, we detail the results for each group of users:

z

1.71 (-9.89) 2.62 (-5.31) 3.06 (-2.58) 3.43 (1.13) 4.49 (9.83) 4.16 (6.81)

Chapter 7. A Pilot Study on Profile-Based Summarisation

112

1. Local Users (Friedman χ2 = 33.7391, df = 5, p  0.001): The small p-value (p  0.001) suggests that the choice of the summarisation method has an effect on the users’ rating. The post-hoc tests address the pairwise comparisons. The Wilcoxon Signed Rank tests reveal that only five pairs of the methods do not show significant differences, namely Centroid SDS vs. Centroid MDS (All), Centroid SDS vs. ACO title-based SDS, Centroid MDS (All) vs. ACO title-based SDS, Centroid MDS (All) vs. ACO title-based MDS (All), and ACO title-based SDS vs. ACO title-based MDS (All). We note that Random and Centroid SDS are significantly worse than the average at a 95% confidence level. Centroid MDS (All) is worse than the average but not significantly at a 95% confidence level (z > −1.96). All ACO variations perform significantly better than the average at a 95% confidence level.2 Table 7.2 has more details on the post hoc tests. 2. Web Users (Friedman χ2 = 43.3871, df = 5, p  0.001): All differences are significant except Centroid SDS vs. Centroid MDS (All), and Centroid MDS (All) vs. ACO title-based SDS. Random, Centroid SDS and Centroid MDS (All) are significantly worse than the average at a 95% confidence level. In line with the results for Local Users, all ACO variations are significantly better than the average at a 95% confidence level.3 Table 7.3 gives detailed post hoc test results. Figure 7.1 nicely illustrates the “missing summary”, as under ACO title-based SDS we see an outlier for both Local and Web Users representing the document for which no summary was generated and which we therefore penalised in order to have a fair comparison. The actually generated summaries for this method were judged much more positively. We also compare the ratings of the two user samples. Mann-Whitney U tests between each pair of corresponding methods applied to the two user groups does not result in 2

Applying Bonferroni adjustment suggests that there are only two significant differences, namely between Random vs. ACO title-based MDS (First 5), and Centroid SDS vs. ACO title-based MDS (First 5) 3 A Bonferroni adjustment results in only one significant difference between Centroid MDS (All) and ACO title-based MDS (First 5).

Chapter 7. A Pilot Study on Profile-Based Summarisation

113

Table 7.2: Local Users: p-Values of Wilcoxon Signed Rank Post-Hoc Pairwise Tests. Random Centroid Centroid ACO ACO MDS ACO MDS SDS MDS (All) SDS (First 5) (All) Random Centroid SDS Centroid MDS (All) ACO SDS ACO MDS (First 5)

0.0127

0.0092 0.0526

0.0108 0.0664 0.2616

0.0019 0.0019 0.0059 0.0107

0.0059 0.0365 0.1055 0.7211 0.0140

Table 7.3: Web Users: p-Values of Wilcoxon Signed Rank Post-Hoc Pairwise Tests. Random Centroid Centroid ACO ACO MDS ACO MDS SDS MDS (All) SDS (First 5) (All) Random Centroid SDS Centroid MDS (All) ACO SDS ACO MDS (First 5)

0.0142

0.0059 0.0502

0.0059 0.0488 0.1522

0.0059 0.0059 0.0019 0.0092

0.0059 0.0059 0.0089 0.0091 0.0142

any significant difference. We can conclude that general Web users’ assessments of the quality of summaries is consistent with the assessments of local users. This is interesting, given the domain-specific nature of the documents. It might indicate that the algorithm, that is, incorporating usage data, is what makes the difference, rather than the cohort. Looking at finer-grained classes of cohorts than just local users vs. nonlocal users should

5

uncover which groups (if any) benefit most from the proposed approach or whether it

Types of users ●

Web Users

4

Local Users

3



2





1

● ●

0



Random

Centroid SDS

Centroid MDS ALL

ACO SDS

ACO MDS (First 5)

ACO MDS (ALL)

Figure 7.1: Box Plot of Overall Assessment of Summary Quality.

Chapter 7. A Pilot Study on Profile-Based Summarisation

114

is really the usage data as such that makes the difference. The main conclusions we can draw from this pilot study is that all variations of the ACO-based algorithm outperform the other methods, and that multi-document summarisation offers the biggest potential, in particular when only choosing the first five outgoing links to generate the summary. In other words, not simply relying on the target document but also incorporating “related” documents can improve the quality of the summary. Future work could look into employing incoming rather than outgoing links. As we have seen that summaries provided by ACO title-based MDS are rated higher. Our intuition is that MDS draws in additional information that is complementary to the content of the target Web page as illustrated by the accommodation examples. We do however not include any hypothesis about this (without a strong justification) and simply use the results to inform us about what algorithms to choose in the task-based evaluations (i.e. MDS (first five documents) rather than MDS (all documents)). A different reading would be to conclude that query- or profile-based summaries are better than query-independent summaries. This does not contradict our findings about the potential of applying ACO for profile-based summarisation, although further experiments could explore how well this methodology scores against alternative query- or profile-based approaches.

7.2.2

User Feedback

Beside answering particular questions (as the ones in the previous scoping study), users were encouraged to leave feedback that would be helpful for future work. When they asked about the special characteristics in the summaries that they evaluated the best, one of the users wrote: “A detailed information are given, important links are provided , and important information are highlighted”. Another user commented: “I

Chapter 7. A Pilot Study on Profile-Based Summarisation

115

preferred the ones which actually summarized the page (obviously) but I most valued those that didn’t seem to hack together bits and pieces into one incoherent jumble, and those that didn’t include bits that referred to other bits that were NOT included in the summary. I liked the ones that correctly included description text often included near the ‘headers’, but understandably, not all websites use this space for descriptions, so it may not be viable to use in such a tool”. Somebody else said: “The best summaries clearly explain, what is there in the main content. No other additional words will be there. From the summary itself, one can easily understand the information given in the main page. I like the way of expressing the summaries of the main content”. The other users mentioned: “Some of them just extracts the content what they mean to deliver, this is what I like, but in some summaries apart from the content unnecessary things are given”, and “Many of my ratings were based on how cohesive the sentence structures were and the word choice. The best ones were to the point and didn’t attempt to convey information through a first person perspective. The ones that I didn’t like were the ones that were too short or had too much unnecessary information”. When they asked about the reasons they disliked the summaries that were evaluated as the worst, the comments were as follows: “It does not provide sufficient amount of information. In some cases, there are abrupt, missing details that I think should be added to the summary”, “Summaries that hacked together unrelated bits and pieces were among the worst, particularly when they lead to lines similar to ‘look here for so and so:’ and then skipped right to the next topic. Also, the summaries which seemed to only include the footer copyright information”, “Those summaries will not be much related to the other summaries, and it will not be related to the main content. Unnecessary contents will be there which was not needed for the summary”, “The ones that I believed were the worst had poor word choice and sentence structure. The overall theme and messages conveyed did not reflect the authority of a professional institution”, and “They contained a lot of irrelevant info, mostly useless”. The evaluators were also asked about the main criteria that they considered during the

Chapter 7. A Pilot Study on Profile-Based Summarisation

116

rating process. One of them commented: “Providing most important information given in the website”. Another wrote: “In order of importance: whether or not it actually summarized what the page was about, coherence, and flow. It had to have accurate describe [sic] the page’s contents, it had to make sense, and it had to flow in a way that didn’t feel like an ADHD trip”. The others commented: “The criteria were that the summaries should contain the descriptions clearly and if any other additional things are there in summaries, that will not be considered”, “Summary is a precise content extracted from a explanation. That’s what I expected. The exact piece of information extracted from the details”, “Short summary; well formed; main information emphasized;”, “Straight and to the point messages. Summaries that did not leave out important information such as navigation, or registration links. Ones that did not try too hard to make it seem as if it was written through a first person perspective”, and “I was looking for a fluent text, which had all the relevant and useful info. I would try to avoid copyright stuff if possible”. When they asked about the size of the generated summaries, one said: “To be honest , some of the summaries are very detailed and have large amount of information. I prefer summaries which include concise information , thus , size of summaries might be less important”. Another wrote: “I would prefer the shorter summaries, but it seemed that very few (if any) of them grabbed an accurate summary. Longer ones also seemed to toss in questionably relevant material. I’d aim for something in the middle to be safe from both ends”. The others said: “The size was little complicated. It should be nearly 4-5 lines, which will be sufficient for the summaries”, “Size of the generated summaries is quite good, it can be 3-5 sentences”, and “The sizes were adequate, but some were a bit lengthy and some were way too short. I personally preferred the larger summaries. The ones that were longest however did not capture my attention as much as medium sized summaries”. However, when they asked about the rating process, the comments were as follows: “It was pretty easy, but a few of them I couldn’t decide if the rating was right or not but

Chapter 7. A Pilot Study on Profile-Based Summarisation

117

mostly because it was the actual source page that was ambiguous”, “No it was not easy to do the rating process, because two or more sentences resemble like the same sentences and some sentences were not related to the main content, so it was bit difficult”, “The rating process is easy. I didnt found [sic] any difficulties”, and “Some of the summaries were a bit hard to understand because of the formatting of the text, otherwise it was fairly easy to rate the summaries”. At the end, the evaluators were asked to leave any comments or suggestions on the overall summaries. One user mentioned: “My only comment would be that I noticed that one, and only one, summary grabbed the bolded text that was directly onto a bolded title on the page. That was one of the better summaries, so whatever it did there, it should do some more. Document 9, summary 2, I believe”.

7.3

Concluding Remarks

The pilot experiment into using profile-based summarisation emphasises on the results we obtained from the initial experiment that shows how the ACO-based approaches outperformed alternative approaches. In the next chapter, we will discuss the second sets of experiments in order to investigate whether the results obtained in this pilot study can also be demonstrated in actual navigation applications to assist users in navigation tasks. We will then evaluate the results based on the task-based evaluations instead of human ratings.

8 Task-Based Evaluations

The findings of the scoping and pilot studies suggest that profile-based summarisation has the potential to generate document summaries that are better than summaries that do not use a profile, or, more conservatively, that incorporating usage data from a Web site helps make better summaries than not using any usage data. We wish to determine whether the apparent improvement in summarisation quality has a measurable effect in the context of a navigation task. More specifically, when using summaries to assist in a navigation task, we wish to know whether profile-based summaries (applying an ant colony optimisation approach) allow users to find information more easily and quickly with improved levels of satisfaction. This chapter gives a description of our data sets in Section 8.2. Then, we discuss the 118

Chapter 8. Task-Based Evaluations

119

task-based evaluation framework we used to assess whether such summaries (the bestperforming profile-based summarisation methods in Chapter 7) can be demonstrated in actual navigation applications and could be helpful in a realistic navigation setting. We present the main components of task-based evaluations including questionnaires we used in Section 8.3, the protocol we followed and the search tasks we used in Section 8.4, and how we recruited our subjects in Section 8.5. The applied statistical tests are also presented in Section 8.6. This chapter also includes two main sections that present in detail our two experiments that have been evaluated based on the task-based evaluations, which are ‘Standard Web site versus Single And Multi-Document ProfileBased Summarisation’ in Section 8.7 and ‘Generic versus Single And Multi-Document Profile-Based Summarisation’ in Section 8.8. This is followed by a detailed discussion in Section 8.9.

Chapter 8. Task-Based Evaluations

8.1

120

Overview

Task-based evaluations aim to simulate real tasks as closely as possible while assessing various aspects of users’ performance. Here, we will evaluate a navigation task to determine the utility of different summarisation methods. We adopt a standard approach to task-based evaluation [Diriye et al., 2010; Kelly, 2009; Yuan and Belkin, 2010] using laboratory experiments and within-subjects design to compare three systems. We broke the evaluations into two experiments involving different baselines. This was because it is difficult to compare more than three systems with the approach that we adopted. The experiments were primarily concerned with a navigation task, but subjects were also free to use a local search engine that was part of the Web site. In line with Kelly [2009], we chose 18 subjects, each of whom attempted to complete six search tasks. These tasks required the subjects to find documents that are relevant to predetermined topics. Each subject was expected to complete two different tasks on each system. In accordance with TREC-9 Interactive Track guidelines [Hersh, 2002; Hersh and Over, 2001], subjects had 10 minutes for each task. They were asked to complete standardised questionnaires. Based on the discussion we made in Section 2.7 of the possible evaluation methods, we decided to go with the extrinsic evaluation (task-based evaluation) and use simulated work tasks with real users. To clarify as to whether effects are coming from the existence of the summaries (vs. not) rather than different algorithms for generating the summaries, we decided to abstract from the actual quality of a summary in the taskbased evaluations and instead focus on the overall utility of the different summarisation methods by measuring completion time among other metrics as the main benchmarks. In addition to that, we have compiled and included qualitative feedback from users that reflects some of the issues that people had with the summarisation methods applied.

Chapter 8. Task-Based Evaluations

8.2

121

Datasets

A copy of the existing Web site was created by crawling the main Web site of the University of Essex with Nutch1 . We limited the crawl to a maximum depth of 15. The crawl for Essex was performed in November 2013. The resulting dataset contains 110,369 documents. We also simulated the intranet search engine of the institution on our machines using Apache Solr2 so that if users decided to use search rather than navigation we would still be able to track their steps. A total of more than 1.5 million queries, described in more detail elsewhere [Kruschwitz et al., 2013], were extracted from the logs of the existing site search engine over a period of 3 years (20 November 2007 till 19 November 2010) to build our model. No click-through information was exploited; this makes the techniques simpler and easier to replicate. In any case, click-through information was not available for much of the log data collected for the model.

8.3

Questionnaires

We used questionnaires based on those suggested by the TREC-9 Interactive Track guidelines.3 We made some very minor modifications to the entry questionnaire to reflect the terminology of the British education system, and included Google as an example of a Web search engine (see the appendices). A 5-point Likert scale was used where appropriate. Note that for the experiments reported here, the term “search” on these standardised questionnaires is intended to be interpreted as “search by navigation”. We used the following questionnaires: 1. Entry questionnaire, which collects demographic and Internet usage information. 1

http://nutch.apache.org/ http://lucene.apache.org/solr/ 3 http://www-nlpir.nist.gov/projects/t9i/qforms.html 2

Chapter 8. Task-Based Evaluations

122

2. Post-search questionnaire, which is used to assess a user’s perspectives of the systems and the tasks. 3. Post-system questionnaire, which is used to capture the user’s perceptions for each of the systems. 4. Exit questionnaire, which focuses on a comparison between the three systems.

8.4

Protocol and Search Tasks

We followed the procedure adopted in Craswell et al. [2003] to guide the subjects during the task-based evaluation. The experiments were conducted in an office in a one-onone setting. We have three systems and six tasks and their orders were revolved and counterbalanced. At the start of each session, subjects were asked first to complete an entry questionnaire. After that, subjects were given 5 minutes’ introduction to the three systems — without being told anything about the technology behind them. Then, each subject was expected to perform two different tasks on each system. The interactions with the system were logged electronically. To finish a task, subjects were instructed to click on a link that was provided. This would record the page they found in the log of the experiment, together with a time stamp. After completing each task, subjects were asked to complete a post-search questionnaire. When they had finished both tasks on one system, they were asked to complete a post-system questionnaire. Finally, when subjects finished all the tasks, they were asked to complete an exit questionnaire. To make the tasks realistic, they were designed using information about commonly submitted queries, as recorded in the search logs of the existing Web site. Constructing tasks using query logs is a common approach [Kelly, 2009]. The tasks had actually been constructed for a previous experiment within the same domain [Adindla, 2014; Adindla and Kruschwitz, 2013] to allow cross-study comparisons. They have some similarity with Borlund’s “simulated work tasks” [Borlund, 2000], and also take into account the

Chapter 8. Task-Based Evaluations

123

guidelines suggested by Kules and Capra [2008]. We assume that users with common, frequently occurring information needs will benefit most from the framework proposed here: frequent queries submitted on a Web site tend to make up a large proportion of all submitted queries [Kruschwitz et al., 2013, for example]. Given the “long tail” of rare queries in a life system, we might wish to ignore such queries in any domain model. Here, this is achieved as a side-effect of the “evaporation factor” in the ACO model. Table 8.1: A Basic Design with Graeco-Latin Square Rotation for Topic and Interface [Kelly, 2009]. Subjects S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18

Time 1

Time 2

Time 3

I1: I1: I1: I1: I1: I1: I2: I2: I2: I2: I2: I2: I3: I3: I3: I3: I3: I3:

I2: I2: I2: I2: I2: I2: I3: I3: I3: I3: I3: I3: I1: I1: I1: I1: I1: I1:

I3: I3: I3: I3: I3: I3: I1: I1: I1: I1: I1: I1: I2: I2: I2: I2: I2: I2:

1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6,

2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1

3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2,

4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3

5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4,

6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5

The tasks assigned were randomised using a Graeco-Latin square design [Kelly, 2009] to avoid task bias and potential learning effects (see Table 8.1).4 The tasks were constructed based on the following queries: “accommodation”, “undergraduate”, “funding”, “postgraduate”, “staff” and “international”: Task-1 You have been accepted for a place at the University of Essex at the Colch4

We adopted a previously proposed Graeco-Latin square design but note that the design does not actually completely balance the interface orders.

Chapter 8. Task-Based Evaluations

124

ester campus. Find information on the residences, accommodation information for freshers, contact details and other useful information. Task-2 You are interested in applying for an undergraduate course in the School of Computer Science and Electronic Engineering (CSEE). Find information for prospective students and the courses being offered within the school. Task-3 You are going to be a new postgraduate student at the University of Essex. You need to locate a page with useful information about tuition fees and possible funding offered by the university. Task-4 Find documents that allow you to download the postgraduate prospectus and view maps of the University of Essex campuses. Task-5 You need to find out who is the head of the Department of Philosophy at the University of Essex, and a list of people or office holders within the school. Task-6 You have registered as an international student at the University of Essex. You would like to locate information regarding new and current students, freshers information and administration information. Locate documents to help you look for the right information.

8.5

Subjects

In order to obtain a good selection of different types of users, and to avoid bias in the selection process, we sent an e-mail to the same local mailing list (i.e., “small ads” for students) and, for each experiment, we selected the first 18 volunteers who replied. None of the subjects took part in more than one of our studies. The background of the subjects was mixed, both in the level of their degree (Bachelor, Master and PhD) and discipline (Law, History, Computer Science, Linguistics, Electronic Engineering and Economics). Subjects were informed in advance that they would be reimbursed for

Chapter 8. Task-Based Evaluations

125

participation.

8.6

Significance Tests

Figure 8.1 shows a summary of the different statistical tests we have used, including the ones used in the scoping and pilot studies. The main idea we followed is that when comparing algorithms/systems we first check for a main effect and then apply appropriate post-hoc tests. In the following task-based experiments, unless otherwise specified, we apply one-way ANOVA (at significance level 0.05) as a parametric measure to test for statistical significance (main effects). Where appropriate, we use Tukey’s HSD tests for post-hoc analysis (at significance level 0.05). For Likert-scale data, we apply Kruskal-Wallis tests with Mann-Whitney U tests for post-hoc analysis also at significance level 0.05.

Statistical tests ( At significance level 0.05 )

Parametric test

Non-parametric test

Interval/Ratio data (assume parametric distributions)

Likert-type scale data (ordinal data)

Task-based experiments

Task-based experiments

Scoping and Pilot experiments

Repeatedmeasures

Independentmeasures

Repeatedmeasures

One way ANOVA

Kruskal-Wallis Test

Friedman test

Post hoc test

Post hoc test

Post hoc test

Tukey’s HSD Test

Mann-Whitney U Test

Wilcoxon SignedRanks Test

Figure 8.1: Summary of The Statistical Tests Used on the Experiments.

Chapter 8. Task-Based Evaluations

8.7

126

Standard Web site versus Single and Multi-Document Profile-Based Summarisation

We compare three different systems, two of them offering summarised documents as additional navigation support (profile-based single and multi-document, respectively), the baseline being the existing Web site without alteration (as that is the system that users access at the moment). System A is a baseline system that does not use summarisation. System B and System C both use pop-up windows to display profile-based summaries, see Figures 8.2 and 8.3 showing an example of screenshots for System B and System C. Summaries are presented as soon as a user hovers the mouse over a link and disappear when a user moves the mouse away. In this experiment, we compare the following three systems: 1. System A: A snapshot copy of the existing site, with no alterations. This is the system that users normally use to find information about the university. It serves as the baseline. 2. System B : This system adds a layer of multi-document summarisation (MDS — see Section 3.2.2) on top of System A. The summaries are built using the first five outgoing links of the hyperlinked documents. We use ACO title-based MDS (first five documents), the best-performing MDS method in our pilot study. 3. System C : This is similar to System B but uses single-document summarisation (SDS — see Section 3.2.1) of the hyperlinked documents. We use ACO title-based SDS, the best-performing SDS method in our pilot study. Both System B and System C use pop-up tool-tips to display their summaries of hyperlinked documents (as illustrated in Figures 8.2 and 8.3). Summaries for the hyperlinked document (either MDS or SDS) are presented as soon as a user hovers the mouse over the link to that document. The summaries disappear when a user moves the mouse away from the link. If a query has been submitted, query-term highlighting is applied

Chapter 8. Task-Based Evaluations

Figure 8.2: System B (Applying MDS).

127

Figure 8.3: System C (Applying SDS).

to the summaries in Systems B and C, in line with White et al. [2003], as are the terms extracted from the document title. Note that the university style template is removed before any summarisation is performed. In presenting our results, we start with statistics derived from the logs and then look at the questionnaires that our subjects completed.

8.7.1

Subjects

Out of 18 participants, 8 were male and 10 were female. Their ages ranged from 19 to 42 (average age: 25.78). Participants were a mix of undergraduate and postgraduate students with different disciplines; six subjects had taken part in previous studies on online searching (but not in any of our studies). All subjects declared that they use the Web on a regular basis. The average time subjects have been performing online search is 9.61 years (10 of them between 3 and 20 years). When asked how often they searched for information online, 16 of the participants selected “daily”. Note that our users (who we would consider typical target users of the system) tend to have a lot of experience using Web search systems (mean: 5.00 — on a 5-point Likert scale, where 1 means “none” and 5 means “a great deal”) but much less

Chapter 8. Task-Based Evaluations

128

experience using commercial search engines (mean: 2.89).5

8.7.2

Average Completion Time

Average completion time and number of interactive actions have been commonly used as metrics to compare different interactive information systems [Kelly, 2009; Pitkow et al., 2002, for example]. Table 8.2 reports the average completion time (and standard deviation) broken down for each task. We measured the time between presenting the task to the users and the submission of the result. There were no cases in which users did not submit an answer. An analysis of variance showed that the effect of completion time per task was significant (ANOVA F = 8.01, df = 2, p < 0.01). Pairwise post-hoc Tukey tests reveal that two of the comparisons are significant at p < 0.05, namely, the average time spent on a task on System B and C was significantly shorter than on the baseline. Table 8.2: Experiment 1: Average Completion Time of Task (in Seconds). System

A B C Overall

8.7.3

Task 1

Task 2

Mean SD

Task 4 Mean

SD

Task 5 Mean SD

Task 6 Mean

SD

Overall

Mean SD

Mean

166 161 164 164

193 (113.26) 144 (26.31) 255 (137.30) 277 (61.05) 317 (105.85) 225 (112.35) 181 (93.41) 128 (42.09) 209 (65.92) 211 (36.80) 237 (63.38) 188 (72.82) 184 (45.05) 132 (38.85) 203 (90.37) 220 (31.57) 239 (66.34) 190 (65.76) 186 (88.81) 135 (37.03) 222 (104.81) 236 (53.53) 264 (89.03) 201 (87.79)

(73.29) (62.40) (35.24) (59.21)

SD

Task 3

Mean

SD

Average Number of Turns to Finish a Task

We also investigated the number of turns, that is, the number of steps required to conduct a task (see Table 8.3). A turn could be viewing a document or inputting a query. On average, users needed 11.51 turns on System A, 9.55 turns on System B and 10.11 turns on System C. There is a main effect in terms of number of turns per task (ANOVA F = 8.34, df = 2, p < 0.01), and pairwise post-hoc Tukey tests reveal the 5 This may look like an anomaly; one explanation is the fact that the relevant question in the TREC-9 Interactive Track entry questionnaire reads as follows: “How much experience have you had searching on commercial online systems (e.g., BRS Afterdark, Dialog, Lexis-Nexis)?”

Chapter 8. Task-Based Evaluations

129

same significant differences as for completion time, that is, users needed significantly more turns on the baseline system and the fewest number of turns on average on the multi-document summarisation system (marginally6 fewer than using single-document summarisation). Table 8.3: Experiment 1: Average Number of Turns to Complete a Task. System

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Overall

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

A B C Overall

7.23 6.51 7.83 7.19

(1.67) (1.59) (1.77) (1.71)

11.66 8.83 11.00 10.50

8.7.4

Task Success

(5.73) (2.26) (1.63) (3.87)

6.66 5.00 5.00 5.55

(2.05) (1.00) (1.52) (1.77)

11.33 8.33 8.00 9.22

(1.59) (2.62) (1.00) (2.39)

15.33 14.66 14.83 14.94

(2.28) (4.98) (2.40) (3.47)

16.83 14.00 14.00 14.94

(3.02) (3.46) (3.16) (3.48)

11.51 9.55 10.11 10.49

(4.73) (4.58) (4.06) (4.54)

Prior to the evaluation, matching documents had been identified by one of the authors. These documents were taken as gold standard to measure task success. In the postsearch questionnaire (discussed later) users were asked to state whether they were able to complete their search task successfully. For System A, three answered with ‘No’; for System B and System C, none of them answered with ‘No’. We also checked task success. The success rate was comparable across all systems. Almost all submitted documents exactly matched the information request as specified by the task (36 on System C, 36 on System B and 33 on System A). Only three of the 108 search tasks did not result in exact matches. There were two particularly difficult tasks: Tasks 5 and 6 (clearly reflected by the user satisfaction values in Table 8.7). Only 16 of the 18 users found a correct document for Task 6, and 17 were correctly submitted for Task 5. If we look at those two tasks in detail and compare them to the results reported earlier, we find that they have a higher number of turns on average (see Table 8.3). We also find a higher average completion 6

We use the term marginally if there is no significant difference but a view on the data with a naked eye might suggest there could be such a difference. Obviously this is most likely to be noise (but does not have to be).

Chapter 8. Task-Based Evaluations

130

time for those two tasks (236 seconds for Task 5 and 264 seconds for Task 6; see Table 8.2).

8.7.5

Post-Search Questionnaire

After participants finished each task, they had to fill in a post-search questionnaire. This included the following questions, with answers given on a 5-point Likert scale (where 1 indicates “not at all” and 5 indicates “extremely”): 1. “Are you familiar with the search topic?” 2. “Was it easy to get started on this search?” 3. “Was it easy to do the search on this topic?” 4. “Are you satisfied with your search results?” 5. “Did you have enough time to do an effective search?” Table 8.4 presents the results for question “Are you familiar with the search topic?” Overall, users provided fairly average ratings on the three systems with no significant difference between the systems. Table 8.4: Experiment 1: Post-Search Questionnaire: User Familiarity with a Search Topic, Mean Scores. System

A B C Overall

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Overall

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

4.16 4.16 2.66 3.66

3.33 4.83 3.16 3.77

3.83 3.50 4.33 3.88

2.50 3.50 3.83 3.27

2.50 2.33 3.00 2.61

4.00 2.50 2.66 3.05

3.38 3.47 3.27 3.37

(0.68) (1.06) (1.24) (1.24)

(1.10) (0.37) (1.34) (1.27)

(0.89) (1.38) (0.74) (1.09)

(1.11) (0.76) (1.46) (1.33)

(1.38) (0.94) (1.73) (1.41)

(0.81) (1.25) (1.24) (1.31)

(1.23) (1.34) (1.46) (1.35)

Table 8.5 presents the results for question “Was it easy to get started on this search?” Overall, users indicated that it was easy to get started when doing the search with no significant difference (though it was marginally easier to use the baseline, as one would

Chapter 8. Task-Based Evaluations

131

perhaps expect given the absence of what some might consider to be “distracting” tooltips). Table 8.5: Experiment 1: Post-Search Questionnaire: Ease of Getting Started, Mean Scores. System

A B C Overall

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Overall

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

4.33 4.00 3.16 3.83

3.83 4.00 3.50 3.77

3.83 4.00 4.33 4.05

3.33 4.16 4.33 3.94

3.66 3.33 3.33 3.44

4.33 3.00 3.66 3.66

3.88 3.75 3.72 3.78

(0.74) (0.81) (0.89) (0.95)

(0.89) (1.15) (0.76) (0.97)

(0.89) (0.81) (0.74) (0.84)

(1.10) (0.68) (1.10) (1.07)

(1.37) (0.47) (1.37) (1.16)

(0.47) (0.81) (0.94) (0.94)

(1.02) (0.92) (1.09) (1.01)

Table 8.6 presents the results for question “Was it easy to do the search on this topic?” Overall, users indicated that they found it easy doing the search on the topics on the three systems, with no significant difference between them. Table 8.6: Experiment 1: Post-Search Questionnaire: Ease of Performing Task, Mean Scores. System

A B C Overall

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Overall

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

4.66 3.83 3.66 4.05

3.33 4.33 3.66 3.77

3.66 4.33 4.50 4.16

3.16 3.83 4.66 3.88

2.83 2.50 3.00 2.77

3.83 3.33 3.00 3.38

3.58 3.69 3.75 3.67

(0.47) (0.89) (0.94) (0.91)

(1.37) (0.74) (0.74) (1.08)

(0.47) (0.74) (0.50) (0.68)

(0.68) (0.89) (0.47) (0.93)

(1.21) (0.95) (1.29) (1.18)

(0.68) (0.74) (0.81) (0.82)

(1.06) (1.05) (1.06) (1.06)

Table 8.7 gives the results for question “Are you satisfied with your search results?” Overall, subjects appear to be satisfied with their results (with no significant differences between the systems). Table 8.7: Experiment 1: Post-Search Questionnaire: Satisfaction with Results, Mean Scores. System

A B C Overall

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Overall

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

4.50 5.00 5.00 4.83

3.83 5.00 4.00 4.27

4.00 4.16 4.33 4.16

3.66 4.66 4.66 4.33

3.00 3.16 3.50 3.22

2.83 3.16 3.00 3.00

3.63 4.19 4.08 3.97

(0.50) (0.00) (0.57) (0.50)

(1.34) (0.00) (1.15) (1.14)

(0.00) (0.68) (0.74) (0.60)

(1.10) (0.47) (0.47) (0.88)

(0.81) (0.89) (1.25) (1.03)

(0.89) (1.06) (1.15) (1.05)

(1.06) (1.02) (1.16) (1.10)

We had given subjects at most 10 minutes to conduct one task. We wanted to see if the time allocated was sufficient. Table 8.8 presents the results for the question “Did

Chapter 8. Task-Based Evaluations

132

you have enough time to do an effective search?” Overall, users indicated that they had enough time when using the three systems, with no significant difference between them. Table 8.8: Experiment 1: Post-Search Questionnaire: Adequate Time, Mean Scores. System

A B C Overall

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Overall

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

4.83 4.33 4.50 4.55

4.33 5.00 4.33 4.55

4.33 4.66 4.66 4.55

3.50 4.83 4.83 4.38

3.83 4.00 4.33 4.05

4.33 3.83 4.00 4.05

4.19 4.44 4.44 4.36

(0.37) (1.10) (0.50) (0.76)

(0.47) (0.00) (0.74) (0.59)

(0.47) (0.47) (0.47) (0.49)

(1.11) (0.37) (0.37) (0.95)

(1.34) (0.57) (0.74) (0.97)

(1.10) (0.89) (1.00) (1.03)

(0.99) (0.79) (0.72) (0.85)

We also get a picture of the users’ perceptions of the difficulty of the tasks in general without differentiating between the three systems; the results are shown in Table 8.9. Table 8.9: Experiment 1: Post-Search Questionnaire: Mean Scores, by Task. Criterion

Task 1 Mean SD

Familiarity 3.66 Start 3.83 Search 4.05 Satisfied 4.83 Time 4.55 Overall 4.18

8.7.6

(1.24) (0.95) (0.91) (0.50) (0.76) (1.00)

Task 2

Task 3

Task 4

Task 5

Task 6

Overall

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

3.77 3.77 3.77 4.27 4.55 4.03

3.88 4.05 4.16 4.16 4.55 4.16

3.27 3.94 3.88 4.33 4.38 3.96

2.61 3.44 2.77 3.22 4.05 3.22

3.05 3.66 3.38 3.00 4.05 3.43

3.37 3.78 3.67 3.97 4.36 3.82

(1.27) (0.97) (1.08) (1.14) (0.59) (1.08)

(1.09) (0.84) (0.68) (0.60) (0.49) (0.80)

(1.33) (1.07) (0.93) (0.88) (0.95) (1.16)

(1.41) (1.16) (1.18) (1.03) (0.97) (1.27)

(1.31) (0.94) (0.82) (1.05) (1.03) (1.11)

(1.35) (1.01) (1.06) (1.10) (0.85) (1.14)

Post-System Questionnaire

After two search tasks were performed on one system, participants filled in a post-system questionnaire. Table 8.10 gives a breakdown of the results when asked about the system they had just used. A Kruskal-Wallis test suggests one significant finding only, namely for the question “How easy was it to use this information system?” (χ2 = 10.8941, df = 2, p  0.001). System A is significantly worse than the average at the 95% confidence level. Systems B and System C are significantly better than average. Posthoc Mann-Whitney U tests with Bonferroni adjustment show that both System B and System C are significantly easier to use than System A (at p < 0.05), but there is no difference between System B and System C. This finding is a bit surprising and certainly worth noting as System A is the standard Web site.

133

5



4 3

3 2

average rating



0

0



0

1

1

2

average rating

2

3



1

average rating

4

4

5

5

Chapter 8. Task-Based Evaluations

System A

System B

System C

System A System A

Figure 8.4: Results of Questionnaire Regarding Ease of Learning How to Use Systems, Experment 1.

System B

System B

System C

System C

Figure 8.5: Results of Questionnaire Regarding Ease of Use Systems, Experiment 1.

Figure 8.6: Results of Questionnaire Regarding Understanding How to Use Systems, Experiment 1.

The box plots in Figures 8.4, 8.5 and 8.6 provide a different representation of these results. Table 8.10: Experiment 1: Post-System Questionnaire, Mean Scores. Bold values are significantly better and underlined values are significantly worse than the average according to the z-score with 95% confidence level. Criterion

How easy was it to learn to use this information system? How easy was to use this information system? How well did you understand how to use the inf. sys.?

8.7.7

System A

System B

System C

Mean

Mean

Mean

z

4.05 (-0.52) 3.44 (-4.65) 4.16 (-1.22)

z

4.11 (0.22) 4.27 (2.59) 4.33 (0.88)

z

4.11 (0.29) 4.22 (2.06) 4.22 (0.33)

Exit Questionnaire

In the exit questionnaire, users were asked to answer questions concerning the search experience they had in the experiment. Table 8.12 summarises the answers users gave. Overall, users did understand the nature of the tasks (mean: 4.00, on a scale where 1 represents “not at all”, and 5 represents “completely”); to some extent they found this task similar to other tasks that they typically performed (mean: 3.22); and they found an observed difference between the systems (mean: 4.00). For the question “Which of the three systems did you like the best overall?”, users expressed a strong preference for System B and System C : 7 users preferred System C,

Chapter 8. Task-Based Evaluations

134

10 users preferred System B, 1 user preferred System A, and none found no difference. Furthermore, when asked “Which of the two systems did you find easier to use?”, a large majority of users judged that System B and System C were easier to use than the baseline system (System A: 1; System B : 9; System C : 7 users; no difference: 1). Users were also asked “Which of the three systems did you find easier to learn to use?”; the results show no clear preference (System A: 4; System B : 6; System C : 5 users; no difference: 3). Table 8.11 summarises these results and displays the number of users who selected each of the choices. Table 8.11: Experiment 1: Exit Questionnaire (System Preference). Criterion Easier to learn to use Easier to use Best overall

System A 4 1 1

System B 6 9 10

System C 5 7 7

No difference 3 1 0

Table 8.12: Experiment 1: Exit Questionnaire (Search Experience). Question To what extent did you understand the nature of the searching task? To what extent did you find this task similar to other searching tasks that you typically perform? How different did you find the systems from one another?

8.7.8

Mean 4.00 3.22 4.00

Concluding Remarks

To summarise the results of this experiment: applying profile-based summarisation to assist users in navigation tasks which can significantly outperform a standard Web site without such assistance in terms of time and turns needed to conduct a task. Multidocument summaries are marginally better than single-document summaries. We need to note, however, that the results might be distorted by the fact that the baseline system looks different from the other two systems (which are indistinguishable). Nevertheless, we deliberately used the existing Web site because that is the system in actual use, and it represents the most natural/common way of navigating a Web site. In order to address the issue of the variable presentation format, and to assess the

Chapter 8. Task-Based Evaluations

135

impact of the personalisation of the summaries, we conducted an additional user study.

8.8

Generic versus Single and Multi-Document ProfileBased Summarisation

We adopted the same experimental setup as in the first task-based evaluation study in Section 8.7, but with a different baseline System A. In this experiment, all the three systems looked almost identical to the user, more specifically: 1. System A: Adds a layer of centroid-based, single-document summarisation of hyperlinked documents, presented as pop-up tool-tips over the existing site. This algorithm is designed for traditional summarisation. It is a widely used baseline [Radev et al., 2004, for example]. 2. System B : As before (ACO title-based MDS (first five documents)). 3. System C : As before (ACO title-based SDS ). Figure 1.2 is a screenshot of System A. While there is still query-term highlighting if a query has been submitted, unlike System B and System C (Figures 8.2 and 8.3) there is no highlighting of terms extracted from the document title.7

8.8.1

Subjects

Out of 18 participants, 13 were male and 5 were female. All subjects declared that they use the Internet on a regular basis. The average time subjects have been doing online searching is 12.33 years (9 of them between 3 and 17 years). When asked how often they searched for information online, all of the participants selected “daily”. 7

This difference stemmed from the nature of the summarisation algorithms.

Chapter 8. Task-Based Evaluations

8.8.2

136

Average Completion Time and Number of Turns

Table 8.13 gives a picture of the average completion time of tasks. As in the previous study, it takes significantly longer on the baseline system (System A) to conduct a task than on any of the other two systems despite all three systems providing pop-up summaries (ANOVA F = 10.31, df = 2, p < 0.01 followed by pairwise post-hoc Tukey tests at p < 0.05). Table 8.13: Experiment 2: Average Completion Time of Tasks (in Seconds). System

A B C Overall

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean

157 131 132 140

223 191 194 203

164 123 120 136

240 199 223 221

337 236 237 270

395 257 269 307

253 (108.33) 189 (72.54) 196 (77.95) 213 (92.17)

(35.84) (34.85) (36.17) (37.53)

(71.89) (82.68) (32.60) (67.57)

(26.94) (37.89) (26.62) (36.88)

(72.04) (49.92) (70.37) (67.00)

(73.26) (49.54) (66.70) (79.45)

(85.04) (50.57) (79.38) (96.32)

Overall SD

Table 8.14: Experiment 2: Average Number of Turns to Complete a Task. System

A B C Overall

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Overall

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

6.78 5.64 6.01 6.14

(1.49) (1.49) (2.13) (1.82)

10.83 9.00 10.16 10.00

(3.38) (2.30) (2.11) (2.76)

8.33 5.50 5.66 6.50

(1.97) (1.11) (2.28) (2.26)

11.83 10.16 10.00 10.66

(0.68) (1.77) (2.08) (1.82)

13.00 11.50 11.50 12.00

(1.91) (1.60) (2.21) (2.05)

16.00 13.50 13.50 14.33

(2.88) (4.03) (3.50) (3.69)

11.13 9.21 9.47 10.06

(3.59) (3.59) (3.63) (3.70)

Similar to the previous task-based evaluation study in Section 8.7, a significantly shorter completion time goes hand in hand with significantly fewer turns (ANOVA F = 28.88, df = 2, p < 0.01 followed by pairwise post-hoc Tukey tests at p < 0.05). Task success is identical in all three systems (with correct answers in all cases). It looks like System A in this study – Centroid SDS – had consistently worse completion times than System A – without pop-ups – in the first study, the standard Web site without summarisation support (cf. Tables 8.2 and 8.13). These summaries may have required more effort to parse due to the summarisation method used. It also could be that users could not scan them easily for the keywords since there was no highlighting. Since this difference is not present in the number of turns (cf. Tables 8.3 and 8.14), that adds evidence to the theory that participants had to spend time trying to read and

Chapter 8. Task-Based Evaluations

137

understand the centroid-based summaries.

8.8.3

Post-Search Questionnaire

Table 8.15 presents the results for question “Was it easy to get started on this search?”. The lower average ratings for each task on the baseline system suggest that it was slightly easier to get started on System B and System C, but there was no main effect. Table 8.15: Experiment 2: Post-Search Questionnaire: Ease of Getting Started, Mean Scores. System

A B C Overall

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Overall

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

4.50 4.66 4.66 4.61

3.66 4.00 3.83 3.83

3.83 4.33 4.50 4.22

3.66 4.33 4.33 4.11

3.50 3.83 3.66 3.66

2.83 3.16 3.16 3.05

3.66 4.05 4.02 3.91

(0.50) (0.47) (0.47) (0.48)

(0.74) (1.15) (0.68) (0.89)

(0.89) (0.47) (0.50) (0.71)

(0.94) (0.47) (1.10) (0.93)

(0.76) (0.37) (0.94) (0.75)

(0.68) (0.68) (0.68) (0.70)

(0.91) (0.81) (0.92) (0.90)

Table 8.16 includes the results for question “Was it easy to do the search on this topic?”. The tendency is the same (slightly lower ratings for System A), but again there is no statistical significance. Table 8.16: Experiment 2: Post-Search Questionnaire: Ease of Performing Task, Mean Scores. System

A B C Overall

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Overall

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

4.00 4.66 4.33 4.33

3.16 3.66 3.66 3.50

3.83 4.83 4.66 4.44

3.00 3.50 3.66 3.38

2.33 3.33 3.00 2.88

2.00 3.16 3.16 2.77

3.05 3.86 3.75 3.55

(0.57) (0.47) (0.47) (0.57)

(1.06) (0.47) (0.74) (0.83)

(0.37) (0.37) (0.47) (0.59)

(0.57) (0.50) (0.74) (0.67)

(0.74) (0.47) (1.29) (0.99)

(0.00) (0.68) (0.68) (0.78)

(0.97) (0.82) (0.98) (0.99)

Table 8.17 contains the results for question “Are you satisfied with your search results?” with similar results (no significance). As before, users indicated that they had enough time to conduct the tasks (Table 8.18), no matter which system they used. These and other results are included in Table 8.19, which presents the results by task.

Chapter 8. Task-Based Evaluations

138

Table 8.17: Experiment 2: Post-Search Questionnaire: Satisfaction with Results, Mean Scores. System

A B C Overall

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Overall

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

4.50 5.00 5.00 4.83

3.66 4.83 4.83 4.44

4.00 5.00 4.83 4.61

3.66 4.50 4.66 4.27

3.33 3.83 3.66 3.61

2.16 3.33 3.00 2.83

3.55 4.41 4.33 4.10

(0.50) (0.00) (0.57) (0.50)

(0.94) (0.37) (0.37) (0.83)

(0.00) (0.00) (0.37) (0.48)

(1.10) (0.50) (0.47) (0.86)

(0.74) (0.68) (1.10) (0.89)

(0.68) (0.74) (1.15) (1.01)

(1.03) (0.79) (1.05) (1.04)

Table 8.18: Experiment 2: Post-Search Questionnaire: Adequate Time, Mean Scores. System

A B C Overall

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Overall

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

5.00 5.00 5.00 5.00

4.50 4.83 4.83 4.72

4.66 5.00 4.83 4.83

4.50 4.66 4.66 4.61

4.00 4.00 4.00 4.00

3.33 3.50 3.33 3.38

4.33 4.50 4.44 4.42

(0.00) (0.00) (0.00) (0.00)

(0.50) (0.37) (0.37) (0.44)

(0.47) (0.00) (0.37) (0.37)

(0.76) (0.47) (0.47) (0.59)

(1.00) (0.57) (0.81) (0.81)

(0.94) (0.95) (0.94) (0.95)

(0.88) (0.76) (0.83) (0.83)

Table 8.19: Experiment 2: Post-Search Questionnaire: Mean Scores by Task. Criterion

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Overall

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

Mean SD

3.77 3.83 3.50 4.44 4.72 4.05

4.38 4.22 4.44 4.61 4.83 4.50

3.44 4.11 3.38 4.27 4.61 3.96

3.27 3.66 2.88 3.61 4.00 3.48

2.55 3.05 2.77 2.83 3.38 2.92

3.62 3.91 3.55 4.10 4.42 3.92

Familiarity 4.33 Start 4.61 Search 4.33 Satisfied 4.83 Time 5.00 Overall 4.62

8.8.4

(0.57) (0.48) (0.57) (0.50) (0.00) (0.54)

(0.91) (0.89) (0.83) (0.83) (0.44) (0.92)

(0.75) (0.71) (0.59) (0.48) (0.37) (0.63)

(0.83) (0.93) (0.67) (0.86) (0.59) (0.92)

(1.09) (0.75) (0.99) (0.89) (0.81) (0.99)

(1.06) (0.70) (0.78) (1.01) (0.95) (0.95)

(1.09) (0.90) (0.99) (1.04) (0.83) (1.02)

Post-System Questionnaire

In the post-system questionnaire, a Kruskal-Wallis test again indicated a significant result for the question “How easy was it to use this information system?” (χ2 = 18.6279, df = 2, p  0.001). Post-hoc Mann-Whitney U tests with Bonferroni adjustment show that users found System B and System C significantly easier to use than System A (at p < 0.05); no other differences are significant (see Table 8.20 and box plots in Figures 8.7, 8.8 and 8.9). This confirms the finding of the first study in that profile-based summarisation outperforms a sensible baseline.

Chapter 8. Task-Based Evaluations

139

Table 8.20: Experiment 2: Post-System Questionnaire, Mean Scores. Bold values are significantly better and underlined values are significantly worse than the average according to the z-score with 95% confidence level. Criterion

System B

System C

Mean

Mean

Mean

z

4.00 (-1.36) 3.50 (-5.96) 4.50 (-1.22)

4.16 (0.68) 4.66 (4.11) 4.56 (1.21)

z

4.16 (0.68) 4.44 (1.84) 4.56 (0.01)

5



z

3

3 2

average rating



1

2

average rating

2

3





0

0

0

1

1

average rating

4

4

4

5

5

How easy was it to learn to use this information system? How easy was to use this information system? How well did you understand how to use the inf. sys.?

System A

System A

System B

System A

System C System A

Figure 8.7: Results of Questionnaire Regarding Ease of Learning How to Use Systems, Experment 2.

8.8.5

System B

System B

System C

System C

Figure 8.8: Results of Questionnaire Regarding Ease of Use Systems, Experiment 2.

Figure 8.9: Results of Questionnaire Regarding Understanding How to Use Systems, Experiment 2.

Exit Questionnaire

In the exit questionnaire, users overall strongly preferred System B and System C, see Table 8.21: no user preferred System A, 10 users preferred System B, 6 users preferred System C, and 2 found no difference. Furthermore, a large majority of users judged that System B and System C were easier to use than the baseline system (System A: none; System B : 8; System C : 7 users; no difference: 3). There was no difference between the three systems in the ease of learning to use (System A: none; System B : 2; System C : 1 user; no difference: 15). Overall, users understood the nature of the tasks (mean: 3.88); to some extent they found this task similar to other tasks that they typically performed (mean: 2.94); and they found marginal difference between the systems (mean: 3.72). Table 8.22 nicely presents these results.

Chapter 8. Task-Based Evaluations

140

Table 8.21: Experiment 2: Exit Questionnaire (System Preference). Criterion Easier to learn to use Easier to use Best overall

System A 0 0 0

System B 3 8 10

System C 1 7 6

No difference 14 3 2

Table 8.22: Experiment 2: Exit Questionnaire (Search Experience). Question To what extent did you understand the nature of the searching task? To what extent did you find this task similar to other searching tasks that you typically perform? How different did you find the systems from one another?

8.8.6

Mean 3.88 2.94 3.72

Concluding Remarks

The second task-based evaluation study on navigation support largely confirms the results of the first study: we have a measurable benefit of using cohort-personalised summaries when using a baseline that is visually almost identical except for the content of the summaries. Furthermore, the results indicate that using the cohort-profile to generate “personalised” summaries has a measurable benefit over using a sensible summarisation baseline when assessed in terms of the time taken and the users’ overall preference.

8.9

8.9.1

Discussion

General Observations

We looked at two evaluations in navigation support in order to first see how pop-up boxes compare against a standard Web site, and then, in a second experiment, how profile-based summarisation (using pop-ups) compares against pop-up boxes that do not use such a profile. However, the reason we compare three systems in each taskbased evaluation is to be able to compare SDS and MDS with the same baseline(s) and

Chapter 8. Task-Based Evaluations

141

at the same time be able to compare these two types of summarisation with each other. The results of the two studies indicate that applying profile-based summarisation to assist users in navigation tasks can significantly outperform generic summarisation as well as a standard Web site without such assistance. The results of the task-based evaluations indicate that multi-document summaries are marginally better than singledocument summaries, although the difference was more marked in the pilot study. In the first experiment, we used three systems that are perhaps not directly comparable; the baseline system looks slightly different to the other two systems (which are indistinguishable). That was the reason to apply an alternative baseline in the second experiment, namely, centroid-based summarisation.8 To check for consistency across the two studies, we conducted paired t-tests comparing average completion time and average number of turns for System B and System C (the systems used in both studies) and found no significant differences (at p < 0.05) when comparing the overall results for the corresponding systems.

8.9.2

Comparison with Related Work

It is difficult to make direct comparisons between our findings and those of previous studies. To the best of our knowledge, there has been no comparable study that addresses profile-based navigation on a Web site. In a Web search context, there have been many studies (see Chapter 2) but the setup differs in at least two important dimensions: document collection and mode of search. However, one interesting finding in a Web search study aimed at showing results in context concludes that inline summaries were more effective than summaries presented as hover text [Dumais et al., 2001]. One of our basic assumptions has always been to not touch the structure of the site and simply add a layer on top of it. However, perhaps inline summaries could be investigated 8

Unlike in the first study, in which only three subjects found no difference in learning to use the different systems, in the second study a large majority of 15 subjects found no difference between the systems.

Chapter 8. Task-Based Evaluations

142

as an alternative in future, also because even with adding an overlay it is not necessarily guaranteed that one is not intercepting with the content owner’s javascript and altering the presentation to a certain degree anyway. There have been other studies on the same Web site. In a navigation context, the suggestion of commonly visited pages (derived from query and click logs) as an overlay window was shown to cut down task completion time and is preferred by users over the unaltererd Web site [Saad and Kruschwitz, 2011]. A follow-up study applied an ACO approach (to click graphs rather than query graphs as we do), which was shown to outperform a baseline that suggests links simply based on popularity [Saad and Kruschwitz, 2013]. A different study on the same Web site adopts a search context [Adindla and Kruschwitz, 2013] whereby query support is offered through a conceptual graph derived from the syntactic relationships extracted from the document collection. That study applied exactly the same tasks we employed; apart from the finding that search assistance did help overall. it is worth noting that the average completion time of the baseline system is virtually identical to ours and that Tasks 5 and 6 were also found to be the most difficult ones.

8.9.3

User Feedback

We primarily focussed on a statistical analysis but would like to report some feedback received from users in the exit questionnaires. We observe that there was a common theme in that many users found the summarisation systems to be very similar; for example, in the first experiment “there should be no big difference between the two” and “I found System C very similar to System B”, and when compared against a centroidbased baseline “I think that all the systems were the same” and “I did’t notice a very big difference between these systems”. Regarding the content and presentation format of the summarisation-based systems users liked the highlighting of keywords, liked the overall idea (“[..] provides better

Chapter 8. Task-Based Evaluations

143

information when moving the mouse over any link. This helps me to decide on clicking the link or not.”), the ease of use and potential to save time. They also noted problems, the potential for confusion was mentioned several times, for example, “additional information [..] could be off-putting and confusing.” and “The page contains a lot of links which creates a lot of ambiguity and misconfusion [sic]”. One user noticed “Sometimes in Systems B & C the help box did not provide useful information”. One user suggested: “It’s more fun if we have colorful windows popping up without so many words crammed together. These words are too small and boring sometimes”. Perhaps future work in this direction should explore moving away from extractive summaries and, for example, use term clouds instead [Jones and Li, 2008]. Technical issues included “The clouds with the comments could load faster”. Finally, we also received some feedback that supports the observation that Web sites can be difficult to navigate, for example, “I found parts of the university Web site difficult to search for specific information [..] I found it easiest just to type the information [..] rather than through the menues and sub-menues”. In line with this, someone suggested to include more direct links for new students on the university home page which fits very much the overall idea of our work. Regarding the overall experience, one comment reads “nice systems making search on the uni website easier”. We also received some evidence that our tasks were realistic and targeted at the right user group, for example, “The tasks are well chosen, because they are close to our life as students. It’s an interesting experience” and somebody else commenting “I learned some new information about university funding”.

8.9.4

Summary

The main conclusion that we derive from the statistical evidence is that profile-based document summarisation can lead to significant measurable benefits. Our explanation, supported by user feedback, is that a tool-tip window summarising what a user can

Chapter 8. Task-Based Evaluations

144

expect from following that link will allow the user to make a more informed decision as to whether to follow that link or not, in particular if the summary is based on the search behaviour of a cohort of users the searcher belongs to. We do need to be cautious with our conclusions, however. An issue raised earlier is the fact that in this study we did not consider any alternative query- or profile-based summarisation methods other than the ones discussed here. It would therefore be safer to conclude that employing usage data for profile-based summarisation certainly does outperform sensible baseline approaches but the findings merit further investigations to drill down further. This is also true when it comes to the issue of different cohorts of users: in this study, we provided a framework for cohort-based summarisation to assist in navigation scenarios, but have not evaluated the impact of using models derived from different cohorts.

9 Conclusions and Limitations

In this chapter, we provide a summary of the materials introduced, answer our research questions, and discuss the limitations of our work.

145

Chapter 9. Conclusions and Limitations

146

This thesis hypothesises that profile-based summaries can help navigation. A series of user studies have been carried out to confirm the hypothesis. We presented the use of profile-based summarisation to help users navigate a local Web site. Several algorithm variations are evaluated in the context of navigation. The profile was acquired by exploiting past users’ searching patterns in an attempt to capture the ‘knowledge’ of previous users in order to help a new one who might have the same information need, a form of group-based context we try to exploit. The idea is to use prior searchers paths to help construct more contextually relevant summaries. The technique is applied to a real-life scenario, the University of Essex Web site, and is implemented as a layer on top of the existing Web site to support browsing activities. We can now return to the initial research questions and answer them using the evidence from the studies we conducted. 1. Can Web site navigation benefit from the automated summarisation of results? We found evidence that summarisation does have potential benefits when applied to a Web site. We also found that profile-based summarisation outperformed a range of alternative approaches. We find that the profile-based summarisation approach outperformed the baseline systems, which included the Web site without any support (the existing default Web site) as well as a site that employed generic summarisation, that is, one that did not incorporate information about past user search patterns. 2. Will a domain model/profile capturing the search behaviour of a group of users be beneficial for the summarisation process? In all our studies, we found that a model that employs past users’ search paths and patterns significantly improves the quality of the summarisation process for the given scenarios. We have employed fairly simple methods both for acquiring the profile and profile-biased summarisation. There is much scope for further exploration, including the use of clickthrough data in the model construction process, the utilisation of the adaptive nature of the model, as well as the characterisation and modelling of more finer-grained user

Chapter 9. Conclusions and Limitations

147

groups. 3. Will such methods result in measurable (quantifiable) benefits such as shorter sessions, fewer interactions, and the like? Based on two task-based evaluations, we can conclude that we can assist users in finding information more quickly and in fewer interaction steps. However, to validate the findings in a realistic context, one needs to test the methods on real users in real time, for example, via A/B testing where we have two systems but, a user never knows which systems is being used there [Kohavi et al., 2007], which will then also test less frequent queries. By analysing the lessons, we learned from each experiment and from the user feedback that we can see the potential that ACO-based models offer in which all variations of the ACO-based algorithm outperform the other methods and achieving a higher average rating, and that multi-document summarisation offers the biggest potential, in particular when only choosing the first five outgoing links to generate the summary. This is also supported by the user feedback; for example, when asked about the features they liked in the summaries generated based on the ACO model, they said: ‘Because they gave sufficient information about the topic and the ideas are connected somehow’, ‘The best summaries clearly explain, what is there in the main content. No other additional words will be there. From the summary itself, one can easily understand the information given in the main page. I like the way of expressing the summaries of the main content’. We can also see that applying profile-based summarisation to assist users in navigation tasks can significantly outperform generic summarisation as well as a standard Web site without such assistance. Moreover, regarding the content and presentation format of the summarisation-based systems, users liked the highlighting of keywords, the overall idea (‘[..] provides better information when moving the mouse over any link. This helps me to decide on clicking the link or not’ ), the ease of use, and potential to save time. We also received some feedback that supports the observation that Web sites can be difficult to navigate; for example, ‘I found parts of the university Web site difficult to search for specific information [..] I found it easiest just to type the information [..]

Chapter 9. Conclusions and Limitations

148

rather than through the menus and sub-menus’. Regarding the evaluation setup, subjects provided evidence that our tasks were realistic and targeted at the right user group; for example, ‘The tasks are well chosen because they are close to our life as students. It’s an interesting experience’. Regarding the overall experience, one comment reads ‘nice systems making search on the uni website easier’. We conclude that, overall, there is much potential in mining past-user query logs to build a model that reflects not individual user needs but collective needs, by way of some form of community profile. Using an ACO analogy allows the model to learn patterns over time but also to forget them. The adaptive nature of the model has not been investigated in this thesis, but that opens interesting avenues for future work. We hope this study will serve as a benchmark for future investigations into Web site navigation. There are a number of limitations. First of all, obviously, in an area such as the Web, validating the findings is much easier, as there are well-known test collections and any implementation can easily be tested against such a collection to give, for example, a measure of precision. Indeed, in the Web sites and intranets, this is more difficult, which is why it requires a bit of work to exactly define the exact correct method of implementation. Hence, task-based evaluations aim to model realistic tasks as closely as possible, but they still remain somewhat artificial. In our work, we have done a number of tests using real users, and of course there are lots of other ways of validating it. More extensive tests with real users in realistic contexts will have to be conducted such as the A/B testing, to validate any of the findings of this study. An obvious limitation of such a study is that the results are based on data from a single Web site and the findings may or may not be transferable to other document collections. The same caveat was raised when assessing the quality of query modification suggestions on the same Web site [Kruschwitz et al., 2013]. This is also true for any comparison of the results with studies conducted on Web logs.

Chapter 9. Conclusions and Limitations

149

Site logs may have very different characteristics. For example, Web queries tend to be around 2.35 words long on average [Beitzel et al., 2004, 2007; Jansen et al., 1998; Silverstein et al., 1998, for example]; our queries are shorter, on average 1.81 query terms [Kruschwitz et al., 2013], which is consistent with other results reported on Web sites and intranets [Stenmark and Jadaan, 2006]. As a result, the query logs collected on a Web site will likely be very different from Web query logs. This also affects the average session length (and hence the profile acquisition process), in our case 1.53 queries per session, 1.73 on a different Web site (the Utah government Web site) [Chau et al., 2005] but more than two queries on average for Web logs [Jansen et al., 1998, 2007b; Silverstein et al., 1998]. The major bottleneck in conducting research into using any form of query logs is the difficulty in obtaining realistic and large-scale log data, which is also the reason why it is nearly impossible to conduct and report studies on a selection of large-scale logs collected on different sites. For this study, we used the actual log files of a reasonably sized university Web site collected over 3 years. We hope that our results could serve as a benchmark for future studies on different Web sites and using different profiling approaches.

10 Future Directions

In this chapter, we finish the thesis by carving out our future research directions.

150

Chapter 10. Future Directions

151

There is much scope for future work. One major route is to exploit the adaptive nature of the ACO-based profile. In the presented work, we took a snapshot of the profile and did not try to exploit the benefits of the ACO model as a continuously updated representation of the cohort’s collected search knowledge. This will need to be done in longitudinal studies that capture an element of time. It has already been demonstrated that the ACO-based model has the potential to improve over time [Albakour, 2012, for example]. There is certainly a need to adapt, as illustrated by rebranding exercises and the introduction of new terms.1 Note also that the outlined approach should need no customisation to new Web sites and can be applied out-of-the-box. This avoids the problem of having to manually or semi-automatically adapt general-purpose knowledge structures to a new collection or domain. It would be interesting to see the approach applied to sites of different types and sizes. Two other major directions would include clickthrough information in the model acquisition process and the application of the profile to finer-grained2 cohorts of users. Both of these aspects are straightforward and simply assume that clickthrough information is available and an organisational structure is in place, respectively. Alternatives to applying an ACO approach for the construction of the profile are also possible, for example, the use of QFG, which have been shown to be effective in deriving query suggestions in a Web site context [Kruschwitz et al., 2013]. Random walk models can be explored in place of the simplistic selection of related concepts, as used in our summarisation approaches. In this thesis we made the clear distinction between Web search and Web site search. 1

Two examples of commonly used concepts on the Web site used in this study are the term Freshers’ Week, which has been replaced by Welcome Week, and the term FASer (Feedback Assessment Submission electronic repository), which has replaced OCS (Online Coursework Submission system). 2 By identifying different granularities of profiles (where the extreme ends are either the individual user of the entire user community), for example, Web site users could be classified into different (possibly overlapping) groups, such as first-year biology students.

Chapter 10. Future Directions

152

As such it remains an open question as to whether the proposed methods will also be beneficial for Web search. Given the key differences discussed in Section 1.1 this cannot be easily answered but would require future experiments in a Web context, which is something that could be explored in future studies. Generally speaking, the research area offers lots of directions for future work and potential to make Web site navigation more user-focussed. The thesis is just a stepping stone within the wider research context.

References

A. Abraham, H. Guo, and H. Liu. Swarm Intelligence: Foundations, Perspectives and Applications. In N. Nedjah and L. Mourelle, editors, Swarm Intelligent Systems, volume 26 of Studies in Computational Intelligence, pages 3–25. Springe, 2006. S. Adindla. Navigating the Knowledge Graph: Automatically Acquiring and Utilizing a Domain Model for Intranet Search. PhD thesis, University of Essex, 2014. S. Adindla and U. Kruschwitz. Using Predicate-Argument Structure to Bootstrap a Domain Model for Site Search: Results of a Task-Based Evaluation. In Proceedings of OAIR/RIAO (10 th International Conference in the RIAO Series), pages 29–32, Lisbon, 2013. G. Adomavicius and A. Tuzhilin. Toward The Next Generation of Recommender Systems: A Survey of The State-of-The-Art and Possible Extensions. Knowledge and Data Engineering, IEEE Transactions on, 17(6):734–749, 2005. E. Agichtein, E. Brill, and S. Dumais. Improving Web Search Ranking by Incorporating User Behavior Information. In Proceedings of the 29 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 19– 26. ACM, 2006. M-D. Albakour. Adaptive Domain Modelling for Information Retrieval. PhD thesis, University of Essex, 2012.

153

References

154

M-D. Albakour, U. Kruschwitz, N. Nanas, D. Song, M. Fasli, and A. De Roeck. Exploring Ant Colony Optimisation for Adaptive Interactive Search. In Proceedings of the 3 rd International Conference on the Theory of Information Retrieval (ICTIR), Lecture Notes in Computer Science, pages 213–224, Bertinoro, 2011. Springer. W. S. Alhalabi, K. Miroslav, and T. Moiez. Induction-Based Approach to Personalized Search Engines. Vdm Verlag, 2009. A. Alhindi, U. Kruschwitz, and C. Fox. A Pilot Study on Using Profile-Based Summarisation for Interactive Search Assistance. In Advances in Information Retrieval, volume 7814 of Lecture Notes in Computer Science, pages 672–675. Springer Berlin Heidelberg, 2013. J. Allan. NLP for IR-Natural Language Processing for Information Retrieval. In Proceedings of the Twelfth Text Retrieval Conference, pages 18–21, 2004. J. Allan, J. Aslam, N. Belkin, C. Buckley, J. Callan, B. Croft, S. Dumais, N. Fuhr, D. Harman, D. J. Harper, et al. Challenges in Information Retrieval and Language Modeling: Report of a Workshop Held at the Center for Intelligent Information Retrieval, University of Massachusetts Amherst, September 2002. SIGIR Forum, 37: 31–47, 2003. G. Amati and C. J. Van Rijsbergen. Probabilistic Models of Information Retrieval based on Measuring the Divergence from Randomness. ACM Transactions on Information Systems (TOIS), 20:357–389, 2002. S. Anand and B. Mobasher. Intelligent Techniques for Web Personalization. Intelligent Techniques for Web Personalization, pages 1–36, 2005. R. Angheluta, R. Mitra, X. Jing, and M. Moens. KU Leuven Summarization System at DUC 2004. In Document Understanding Conference, 2004. A. T. Arampatzis, T. Tsoris, C. H. A. Koster, and T. P. Van Der Weide. Phase-Based

References

155

Information Retrieval. Information Processing & Management, 34:693–707, 1998. G. Armano, A. Giuliani, and E. Vargiu. Using Snippets in Text Summarization: a Comparative Study and an Application. In IIR, pages 121–132. Citeseer, 2012. F. A. Asnicar and C. Tasso. ifWeb: a Prototype of User Model-Based Intelligent Agent for Document Filtering and Navigation in the World Wide Web. In Proceedings of WorkshopAdaptive Systems and User Modeling on the World Wide Web’at 6 th International Conference on User Modeling, UM97, Chia Laguna, Sardinia, Italy, pages 3–11, 1997. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval: The Concepts and Technology Behind Search. Pearson Higher Education, 2011. R. Baeza-Yates and A. Tiberi. Extracting Semantic Relations from Query Logs. In Proceedings of the 13 th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 76–85. ACM, 2007. R. Barzilay, M. Elhadad, et al. Using Lexical Chains for Text Summarization. In Proceedings of the ACL Workshop on Intelligent Scalable Text Summarization, volume 17, pages 10–17. Madrid, Spain, 1997. R. Barzilay, K. R. McKeown, and M. Elhadad. Information Fusion in the Context of Multi-Document Summarization. In Proceedings of ACL, pages 550–557. Association for Computational Linguistics, 1999. R. Barzilay, N. Elhadad, and K. R. McKeown. Sentence Ordering in Multidocument Summarization. In Proceedings of the First International Conference on Human Language Technology Research, pages 1–7. Association for Computational Linguistics, 2001. J. Basilico and T. Hofmann. Unifying Collaborative and Content-Based Filtering. In

References

156

Proceedings of the twenty-first international conference on Machine learning, page 9. ACM, 2004. P. B. Baxendale. Machine-Made Index for Technical Literature: an Experiment. IBM Journal of Research and Development, 2:354–361, 1958. S. M. Beitzel, E. C. Jensen, A. Chowdhury, D. Grossman, and O. Frieder. Hourly Analysis of a Very Large Topically Categorized Web Query Log. In Proceedings of the 27 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 321–328. ACM, 2004. S. M. Beitzel, E. C. Jensen, A. Chowdhury, O. Frieder, and D. Grossman. Temporal Analysis of a Very Large Topically Categorized Web Query Log. Journal of the American Society for Information Science and Technology (JASIST), 58:166–178, 2007. P. N. Bennett, R. W. White, W. Chu, S. T. Dumais, P. Bailey, F. Borisyuk, and X. Cui. Modeling the Impact of Short- and Long-term Behavior on Search Personalization. In Proceedings of the 35 th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’12, pages 185–194. ACM, 2012. B. Berendt and M. Spiliopoulou. Analysis of Navigation Behaviour in Web Sites Integrating Multiple Information Systems. The VLDB Journal, 9:56–75, 2000. A. Berger and V. O. Mittal. Query-Relevant Summarization Using FAQs. In Proceedings of the 38 th Annual Meeting on Association for Computational Linguistics, pages 294– 301. Association for Computational Linguistics, 2000a. A. L. Berger and V. O. Mittal. OCELOT: A System for Summarizing Web Pages. In Proceedings of the 23 rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’00, pages 144–151. ACM, 2000b. S. Berkovsky, T. Baldwin, and I. Zukerman. Aspect-Based Personalized Text Summa-

References

157

rization. In Adaptive Hypermedia and Adaptive Web-Based Systems, pages 267–270. Springer, 2008. J. J. Bernard, S. Amanda, and T. Isak. Handbook of Research on Web Log Analysis. Information Science Reference - Hershey, PA, 2009. D. Billsus and M. J. Pazzani. User Modeling for Adaptive News Access. User Modeling and User-Adapted Interaction, 10:147–180, 2000. P. Boldi, F. Bonchi, C. Castillo, D. Donato, and S. Vigna. Query Suggestions Using Query-Fow Graphs. In Proceedings of the 2009 Workshop on Web Search Click Data, pages 56–63. ACM, 2009. P. Borlund. Experimental Components for the Evaluation of Interactive Information Retrieval Systems. Journal of Documentation, 56:71–90, 2000. P. Borlund. The IIR Evaluation Model: a Framework for Evaluation of Interactive Information Retrieval Systems. Information research, 8(3), 2003. R. J. Brachman and H. J. Levesque. Knowledge Representation and Reasoning. Elsevier, 2004. R. Brandow, K. Mitze, and L. F. Rau. Automatic Condensation of Electronic Publications by Sentence Selection. Information Processing and Management, 31:675–685, 1995. C. Buckley, A. Singhal, and M. Mitra. Using Query Zoning and Correlation Within SMART: TREC 5. In TREC, 1996. J. Budzik and K. J. Hammond. User Interactions with Everyday Applications as Context for Just-in-Time Information Access. In Proceedings of the 5 th International Conference on Intelligent User Interfaces, pages 44–51. ACM, 2000. K. Bystr¨ om. Information and Information Sources in Tasks of Varying Complexity.

References

158

Journal of the American Society for information Science and Technology, 53(7):581– 591, 2002. K. Bystr¨ om and P. Hansen. Conceptual Framework for Tasks in Information Studies. Journal of the American Society for Information Science and Technology, 56(10): 1050–1061, 2005. K. Bystr¨ om and K. J¨ arvelin. Task Complexity Affects Information Seeking and Use. Information processing & management, 31(2):191–213, 1995. J. N. Cappella, S. Yang, and S. Lee. Constructing Recommendation Systems for Effective Health Messages Using Content, Collaborative, and Hybrid Algorithms. The ANNALS of the American Academy of Political and Social Science, 659(1):290–306, 2015. A. Carbonaro. Improving Web Search and Navigation Using Summarization Process. In Knowledge Management, Information Systems, E-Learning, and Sustainability Research, pages 131–138. Springer, 2010. J. Carbonell and J. Goldstein. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. In Proceedings of the 21 st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 335–336. ACM, 1998. M. F. Caropreso, S. Matwin, and F. Sebastiani. Statistical Phrases in Automated Text Categorization. Centre National de la Recherche Scientifique, Paris, France, 47, 2000. Y. Chali, S. A. Hasan, and S. R. Joty. A SVM-Based Ensemble Approach to MultiDocument Summarization. In Advances in Artificial Intelligence, pages 199–202. Springer, 2009. M. Chau, X. Fang, and O. R. Liu Sheng. Analysis of the Query Logs of a Web Site Search

References

159

Engine. Journal of the American Society for Information Science and Technology (JASIST), 56:1363–1376, 2005. H. Chen and S. Dumais. Bringing Order to the Web: Automatically Categorizing Search Results. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’00, 2000. L. Chen and K. Sycara. WebMate: a Personal Agent for Browsing and Searching. In Proceedings of the Second International Conference on Autonomous Agents, pages 132–139. ACM, 1998. Y. Chen, X. Wang, and Y. Guan. Automatic Text Summarization Based on Lexical Chains. In Advances in Natural Computation, pages 947–951. Springer, 2005. P. R. Chesnais, M. J. Mucklo, and J. A. Sheena. The Fishwrap Personalized News System. In Community Networking, 1995. Integrated Multimedia Services to the Home, Proceedings of the Second International Workshop on, pages 275–282. IEEE, 1995. E. H. Chi, P. Pirolli, and J. Pitkow. The Scent of a Site: A System for Analyzing and Predicting Information Scent, Usage, and Usability of a Web Site. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’00, pages 161–168. ACM, 2000. M. Clark, Y. Kim, U. Kruschwitz, D. Song, D. Albakour, S. Dignum, U. C. Beresi, M. Fasli, and A. De Roeck. Automatically Structuring Domain Knowledge from Text: an Overview of Current Research. Information Processing and Management, 48:552–568, 2012. M. Coyle and B. Smyth. Supporting Intelligent Web Search. ACM Transactions on Internet Technology (TOIT), 7:20, 2007. N. Craswell, D. Hawking, T. Upstill, A. McLean, R. Wilkinson, and M. Wu. TREC11

References

160

Web and Interactive Tracks at CSIRO. In Proceedings of TREC-12, pages 193–203, 2003. W. B. Croft, D. Metzler, and T. Strohman. Search Engines - Information Retrieval in Practice. Pearson Education, 2009. ISBN 978-0-13-136489-9. M. K. Dalal and M. A. Zaveri. Heuristics Based Automatic Text Summarization of Unstructured Text. In Proceedings of the International Conference & Workshop on Emerging Trends in Technology, pages 690–693. ACM, 2011. H. Dalianis, M. Hassel, K. de Smedt, A. Liseth, T. C. Lech, and J. Wedekind. Porting and Evaluation of Automatic Summarization. Nordisk Sprogteknologi, pages 2000– 2004, 2003. M. Daoud, L. Tamine-Lechani, M. Boughanem, and B. Chebaro. A Session Based Personalized Search Using an Ontological User Profile. In Proceedings of the 2009 ACM Symposium on Applied Computing, pages 1732–1736. ACM, 2009. H. Daum´e III and D. L. Marcu. Bayesian Query-Focused Summarization. In Proceedings of COLING/ACL, pages 305–312. Association for Computational Linguistics, 2006. S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by Latent Semantic Analysis. JASIS, 41:391–407, 1990. H. Deng, I. King, and M. R. Lyu. Entropy-Biased Models for Query Representation on the Click Qraph. In Proceedings of the 32 nd International ACM SIGIR Conference on Research and development in Information Retrieval, pages 339–346. ACM, 2009. G. Di Caro and M. Dorigo. AntNet: Distributed Stigmergetic Control for Communications Networks. Journal of Artificial Intelligence Research, 9:317–365, 1998. A. D´ıaz and P. Gerv´ as. User-Model Based Personalized Summarization. Information Processing & Management, 43:1715–1734, 2007.

References

161

S. Dignum, U. Kruschwitz, M. Fasli, Y. Kim, D. Song, U. C. Beresi, and A. De Roeck. Incorporating Seasonality into Search Suggestions Derived from Intranet Query Logs. In Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on, volume 1, pages 425–430. IEEE, 2010. A. Diriye, A. Blandford, and A. Tombros. When is System Support Effective?

In

Proceedings of IiiX, pages 55–64. ACM, 2010. M. Dorigo, M. Birattari, and T. Stutzle. Ant Colony Optimization. Computational Intelligence Magazine, IEEE, 1:28–39, 2006. P. Dourish. What we Talk about when we Talk about Context. Personal and ubiquitous computing, 8(1):19–30, 2004. F. S. Douzidia and G. Lapalme. Lakhas, an Arabic Summarization System. In Proceedings of DUC, page 128135. Citeseer, 2004. S. Dumais, E. Cutrell, and H. Chen. Optimizing Search by Showing Results in Context. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’01, pages 277–284. ACM, 2001. S. Dumais, E. Cutrell, J. J. Cadiz, G. Jancke, R. Sarin, and D. C. Robbins. Stuff I’ve Seen: a System for Personal Information Retrieval and Re-use. In Proceedings of the 26 th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pages 72–79. ACM, 2003. S. T. Dumais and N. J. Belkin. The TREC Interactive Tracks: Putting the User into Search. TREC: Experiment and evaluation in information retrieval, pages 123–152, 2005. H. P. Edmundson.

New Methods in Automatic Extracting.

Journal of the ACM

(JACM), 16:264–285, 1969. M. El-Haj, U. Kruschwitz, and C. Fox.

Creating Language Resources for Under-

References

162

Resourced Languages: Methodologies, and Experiments with Arabic. Language Resources and Evaluation, pages 1–32, 2014. M. O. El-Haj and B. H. Hammo. Evaluation of Query-Based Arabic Text Summarization System. In Natural Language Processing and Knowledge Engineering, 2008. NLPKE’08. International Conference on, pages 1–7. IEEE, 2008. H. Fang and C. Zhai. An Exploration of Axiomatic Approaches to Information Retrieval. In Proceedings of the 28 th annual international ACM SIGIR conference on Research and development in information retrieval, pages 480–487. ACM, 2005. S. Feldman and C. Sherman. The High Cost of not Finding Information. Technical Report 29127, IDC, 2001. P. Ferragina and A. Gulli. A Personalized Search Engine Based on Web-Snippet Hierarchical Clustering. Software: Practice and Experience, 38:189–225, 2008. M. Fiszman, D. Demner-Fushman, H. Kilicoglu, and T. C. Rindflesch. Automatic Summarization of MEDLINE Citations for Evidence-Based Medical Treatment: A Topic-Oriented Evaluation. Journal of Biomedical Informatics, 42:801–813, 2009. B. M. Fonseca, P. B. Golgher, E. S. De Moura, B. Pˆossas, and N. Ziviani. Discovering Search Engine Related Queries Using Association Rules. Journal of Web Engineering, 2:215–227, 2003. J. Freyne, B. Smyth, M. Coyle, E. Balfe, and P. Briggs. Further Experiments on Collaborative Ranking in Community-Based Web Search. Artificial Intelligence Review, 21(3-4):229–252, 2004. J. Freyne, R. Farzan, P. Brusilovsky, B. Smyth, and M. Coyle. Collecting Community Wisdom: Integrating Social Search & Social Navigation. In Proceedings of the 12 th International Conference on Intelligent User Interfaces, pages 52–61. ACM, 2007. F. Fukumoto, A. Sakai, and Y. Suzuki. Eliminating Redundancy by Spectral Relax-

References

163

ation for Multi-Document Summarization. In Proceedings of the 2010 Workshop on Graph-based Methods for Natural Language Processing, pages 98–102. Association for Computational Linguistics, 2010. P. Fung and G. Ngai. One Story, One Flow: Hidden Markov Story Models for Multilingual Multidocument Summarization. ACM Transactions on Speech and Language Processing (TSLP), 3:1–16, 2006. G. W. Furnas. Effective View Navigation. Proceedings of the SIGCHI conference on Human factors in computing systems - CHI ’97, pages 367–374, 1997. D. Galanis and P. Malakasiotis. Aueb at TAC 2008. In Proceedings of the TAC 2008 Workshop. Citeseer, 2008. S. Gauch, M. Speretta, A. Chandramouli, and A. Micarelli. User Profiles for Personalized Information Access. The Adaptive Web, pages 54–89, 2007. D. Gayo-Avello. A Survey on Session Detection Methods in Query Logs and a Proposal for Future Evaluation. Information Sciences, 179:1822–1843, 2009. G. Gentili, A. Micarelli, and F. Sciarrone. Infoweb: An Adaptive Information Filtering System for the Cultural Heritage Domain. Applied Artificial Intelligence, 17, 8:715– 744, 2003. A. G¨oker and D. He. Analysing Web Search Logs to Determine Session Boundaries for User-Oriented Learning. In P. Brusilovsky, O. Stock, and C. Strapparava, editors, Proceedings of the International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems, Lecture Notes in Computer Science, pages 319–322. Springer Berlin Heidelberg, 2000. J. Goldstein, M. Kantrowitz, V. Mittal, and J. Carbonell. Summarizing Text Documents: Sentence Selection and Evaluation Metrics. In Proceedings of the 22 nd annual

References

164

international ACM SIGIR conference on Research and development in information retrieval, pages 121–128. ACM, 1999. J. Goldstein, V. Mittal, J. Carbonell, and M. Kantrowitz. Multi-Document Summarization by Sentence Extraction. In Proceedings of the 2000 NAACL-ANLP Workshop on Automatic summarization-Volume 4, pages 40–48. Association for Computational Linguistics, 2000. N. Gooda Sahib, A. Tombros, and I. Ruthven. Enabling Interactive Query Expansion through Eliciting the Potential Effect of Expansion Terms. Advances in Information Retrieval, pages 532–543, 2010. F. Gotti, G. Lapalme, L. Nerima, E. Wehrli, and T. du Langage. GOFAIsum: a Symbolic Summarizer for DUC. In Proc. of DUC, volume 7, 2007. U. Hahn and I. Mani. The Challenges of Automatic Summarization. Computer, pages 29–36, 2000. M. Hassel. Evaluation of Automatic Text Summarization. Licentiate Thesis, Stockholm, Sweden, pages 1–75, 2004. D. Hawking. Challenges in Enterprise Search. In Proceedings of the 15 th Australasian Database Conference-Volume 27, pages 15–24. Australian Computer Society, Inc., 2004. D. Hawking. Enterprise Search. In R. Baeza-Yates and B. Ribeiro-Neto, editors, Modern Information Retrieval, pages 641–683. Addison-Wesley, 2nd edition, 2011. D. Hawking and J. Zobel. Does Topic Metadata Help with Web Search? JASIST, 58: 613–628, 2007. W. Hersh. TREC 2002 Interactive Track Report. In Proceedings of TREC, 2002.

References

165

W. Hersh and P. Over. The TREC-9 Interactive Track Report. NIST Special Publication, pages 41–50, 2001. K. Hong, J. M. Conroy, B. Favre, A. Kulesza, H. Lin, and A. Nenkova. A Repositary of State of The Art and Competitive Baseline Summaries for Generic News Summarization. Proceedings of LREC, May, 2014. E. Hovy and C. Y. Lin. Automated Text Summarization and the SUMMARIST System. In Proceedings of a Workshop on Held at Baltimore, pages 197–214. Association for Computational Linguistics, 1998. B. J. Jansen and A. Spink. Web Search: Public Searching of the Web, volume 6. Springer Science & Business Media, 2004. B. J. Jansen, A. Spink, J. Bateman, and T. Saracevic. Real Life Information Retrieval: a Study of User Queries on the Web. SIGIR Forum, 32:5–17, 1998. B. J. Jansen, D. L. Booth, and A. Spink. Determining the User Intent of Web Search Engine Queries. In Proceedings of the 16 th International Conference on World Wide Web, pages 1149–1150. ACM, 2007a. B. J. Jansen, A. Spink, C. Blakely, and S. Koshman. Defining a Session on Web Search Engines. Journal of the American Society for Information Science and Technology (JASIST), 58:862–871, 2007b. B. J. Jansen, D. L. Booth, and A. Spink. Determining the Informational, Navigational, and Transactional Intent of Web Queries. Information Processing & Management, 44:1251–1266, 2008. K. Jeˇzek and J. Steinberger. Automatic Text Summarization (The State of the Art ˇ 2007 and New Challenges). Katedra informatiky a v`ypoˇcetn´ı techniky, FAV, ZCUZ´ apadoˇcesk´ a Univerzita v Plzni, Univerzitn´ı, 22:14, 2008. H. Jing. Sentence Reduction for Automatic Text Summarization. In Proceedings of the

References

166

Sixth Conference on Applied Natural Language Processing, pages 310–315. Association for Computational Linguistics, 2000. H. Jing. Cut-and-Paste Text Summarization. PhD thesis, Columbia University, 2001. H. Jing and K. R. McKeown. Cut and Paste Based Text Summarization. In Proceedings of the 1 st North American Chapter of the Association for Computational Linguistics Conference, pages 178–185. Association for Computational Linguistics, 2000. H. Jing, R. Barzilay, K. McKeown, and M. Elhadad. Summarization Evaluation Methods: Experiments and Analysis. In AAAI Symposium on Intelligent Summarization, pages 51–59, 1998. T. Joachims, D. Freitag, and T. Mitchell. WebWatcher: a Tour Guide for The World Wide Web. In In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI), pages 770–775. Morgan Kaufmann, 1997. T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately Interpreting Clickthrough Data as Implicit Feedback. In Proceedings of the 28 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 154–161. ACM, 2005. T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the Accuracy of Implicit Feedback from Clicks and Query Reformulations in Web Search. ACM Transactions on Information Systems (TOIS), 25, 2007. R. I. John and G. J. Mooney. Fuzzy User Modeling for Information Retrieval on the World Wide Web. Knowledge and Information Systems, 3:81–95, 2001. G. JF. Jones and Q. Li. Focused Browsing: Providing Topical Feedback for Link Selection in Hypertext Browsing. In Advances in Information Retrieval, pages 700–704. Springer, 2008.

References

167

K. S. Jones and J. R. Galliers. Evaluating Natural Language Processing Systems: An Analysis and Review, volume 1083. Springer Verlag, 1995. R. Jones and K. L. Klinkner. Beyond the Session Timeout: Automatic Hierarchical Segmentation of Search Topics in Query Logs. In Proceeding of the 17 th ACM conference on Information and knowledge management (CIKM’08), pages 699–708. ACM, 2008. R. Jones, B. Rey, O. Madani, and W. Greiner. Generating Query Substitutions. In Proceedings of the 15 th International Conference on World Wide Web, pages 387– 396. ACM, 2006. G. F. Jorge, G. Laurent, F. Olivier, and d. C. Gal. Bag-of-Senses Versus Bag-of-Words: Comparing Semantic and Lexical Approaches on Sentence Extraction. In The Proceedings of the Text Analysis Conference Workshop. TAC, 2008. S. Jul and G. W. Furnas. Navigation in Electronic Worlds: A CHI 97 Workshop. SIGCHI Bull., 29:44–49, 1997. D. Jurafsky and J. H. Martin. Speech and Language Processing: an Introduction to Natural Language Processing , Computational Linguistics, and Speech Recognition. Prentice Hall Series in Artificial Intelligence. Prentice Hall, 2009. J. S. Justeson and S. M. Katz. Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text. Natural Language Engineering, 1:9–27, 1995. M Kabadjov, R. Steinberger, H. Tanev, M. Turchi, and V. Zavarella. JRC’s Participation at TAC 2011: Guided and Multilingual Summarization Tasks. In In Proceedings of the Text Analysis Conference (TAC, 2011. P. B. Kantor, E. Boros, B. Melamed, V. Me˜ nkov, B. Shapira, and D. J. Neu. Capturing Human Intelligence in the Net. Communications of the ACM, 43(8):112–115, 2000. J. Karim, I. Antonellis, V. Ganapathi, and H. Garcia-Molina. A Dynamic Navigation Guide for Webpages. CHI 2009, pages 1–4. Stanford InfoLab, 2009.

References

168

S. M. Katz. Distribution of Content Words and Phrases in Text and Language Modelling. Natural Language Engineering, 2:15–59, 1996. M. Kellar, C. Watters, and M. Shepherd. The Impact of Task on the Usage of Web Browser Navigation Mechanisms. In Proceedings of Graphics interface 2006, pages 235–242. Canadian Information Processing Society, 2006. D. Kelly. Methods for Evaluating Interactive Information Retrieval Systems with Users. Foundations and Trends in Information Retrieval, 3:1–224, 2009. D. Kelly and X. Fu. Eliciting Better Information Need Descriptions from Users of Information Search Systems. Information Processing & Management, 43:30–46, 2007. K. Knight and D. Marcu. Statistics-Based Summarization-Step One: Sentence Compression. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pages 703–710. AAAI Press, 2000. R. Kohavi, R. M. Henne, and D. Sommerfield. Practical Guide to Controlled Experiments on the Web: Listen to your Customers Not to the Hippo. In Proceedings of the 13 th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, pages 959–967. ACM, 2007. R. Kraft and J. Zien. Mining Anchor Text for Query Refinement. In Proceedings of the 13 th International Conference on World Wide Web, pages 666–674. ACM, 2004. B. Krulwich and C. Burkey. Learning User Information Interests Through Extraction of Semantically Significant Phrases. In Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access, pages 100–112, 1996. U. Kruschwitz. Intelligent Document Retrieval: Exploiting Markup Structure, volume 17 of The Information Retrieval Series. Springer, 2005. U. Kruschwitz, D. Lungley, M-D. Albakour, and D. Song. Deriving Query Sugges-

References

169

tions for Site Search. Journal of the American Society for Information Science and Technology (JASIST), 64:1975–1994, 2013. U. Kruschwitz, M. D. Albakour, J. Niu, J. Leveling, N. Nanas, Y. Kim, D. Song, M. Fasli, and A. De Roeck. Moving Towards Adaptive Search in Digital Libraries. Advanced Language Technologies for Digital Libraries, pages 41–60, 2011. B. Kules and R. Capra. Creating eExploratory Tasks for a Faceted Search Interface. In Second Workshop on Human-Computer Interaction (HCIR 2008), 2008. J. Kupiec, J. Pedersen, and F. Chen. A Trainable Document Summarizer. In Proceedings of the 18 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 68–73. ACM, 1995. E. Lagergren and P. Over. Comparing Interactive Information Retrieval Systems Across Sites: The TREC-6 Interactive Track Matrix Experiment. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 164–172. ACM, 1998. K. Lang. Newsweeder: Learning to Filter Netnews. In In Proceedings of the Twelfth International Conference on Machine Learning, pages 331–339. Morgan Kaufmann, 1995. D. S. Leite and L. H. M. Rino. Combining Multiple Features for Automatic Text Summarization Through Machine Learning. In Computational Processing of the Portuguese Language, pages 122–132. Springer, 2008. H. Li, Y. Cao, J. Xu, Y. Hu, S. Li, and D. Meyerzon. A New Approach to Intranet Search Based on Information Extraction. In Proceedings of the 14 th ACM International Conference on Information and Knowledge Management, pages 460–468. ACM, 2005. H. Lieberman et al. Letizia: An Agent that Assists Web Browsing. In International

References

170

Joint Conference on Artificial Intelligence, volume 14, pages 924–929. LAWRENCE ERLBAUM ASSOCIATES LTD, 1995. C. Lin. Training a Selection Function for Extraction. In Proceedings of the Eighth International Conference on Information and Knowledge Management, pages 55–62. ACM, 1999. C. Lin. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), pages 25–26, 2004. C. Y. Lin and E. Hovy. Identifying Topics by Position. In Proceedings of the Fifth Conference on Applied Natural Language Processing, pages 283–290. Association for Computational Linguistics, 1997. C. Y. Lin and E. Hovy. Manual and Automatic Evaluation of Summaries. In Proceedings of the ACL-02 Workshop on Automatic Summarization-Volume 4, pages 45–51. Association for Computational Linguistics, 2002a. C. Y. Lin and E. Hovy. From Single to Multi-Document Summarization: A Prototype System and its Evaluation. In Proceedings of the 40 th Annual Meeting on Association for Computational Linguistics, pages 457–464. Association for Computational Linguistics, 2002b. C. Y. Lin and E. Hovy.

Automatic Evaluation of Summaries Using N-Gram Co-

Occurrence Statistics. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 71–78. Association for Computational Linguistics, 2003a. C. Y. Lin and E. Hovy. The Potential and Limitations of Automatic Sentence Extraction for Summarization. In Proceedings of the HLT-NAACL 03 on Text Summarization Workshop-Volume 5, pages 73–80. Association for Computational Linguistics, 2003b. J. Lin and M. D. Smucker. How do Users Find Things with Pubmed?: Towards Au-

References

171

tomatic Utility Evaluation with User Simulations. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 19–26. ACM, 2008. S. Lin and N. J. Belkin. Modeling Multiple Information Seeking Episodes. In Proceedings of the ASIS Annual Meeting, volume 37, pages 133–47. ERIC, 2000. F. Liu, C. Yu, and W. Meng. Personalized Web Search for Improving Retrieval Effectiveness. Knowledge and Data Engineering, IEEE transactions on, 16:28–40, 2004. M. L. Liu and T. J. Zhao. Chinese Multi-Document Summarization Based on Topic Detection Technology. In Computational Intelligence and Industrial Applications, 2009. PACIIA 2009. Asia-Pacific Conference on, volume 1, pages 233–236. IEEE, 2009. E. Lloret and M. Palomar. Text Summarisation in Progress: a Literature Review. Artificial Intelligence Review, 37:1–41, 2012. C. T. Lopes. Context Features and their Use in Information Retrieval. In Proceedings of the Third BCS-IRSG Conference on Future Directions in Information Access, FDIA’09, pages 36–42. British Computer Society, 2009. Z. Lu, Z. Dou, J. Lian, X. Xie, and Q. Yang. Content-Based Collaborative Filtering for News Topic Recommendation. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. H. P. Luhn. The Automatic Creation of Literature Abstracts. IBM Journal of research and development, 2:159–165, 1958. Y. Lv, L. Sun, J. Zhang, J. Y. Nie, W. Chen, and W. Zhang. An Iterative Implicit Feedback Approach to Personalized Search. In Proceedings of the 21 st International Conference on Computational Linguistics and the 44 th Annual Meeting of the Asso-

References

172

ciation for Computational Linguistics, ACL-44, pages 585–592. Association for Computational Linguistics, 2006. I. Mani and M. T. Maybury. Advances in Automatic Text Summarization, volume 293. MIT Press, 1999. I. Mani, D. House, G. Klein, L. Hirschman, T. Firmin, and B. Sundheim. The TIPSTER SUMMAC Text Summarization Evaluation. In Proceedings of the Ninth Conference on European Chapter of the Association for Computational Linguistics, pages 77–85. Association for Computational Linguistics, 1999. C. D. Manning, P. Raghavan, and H. Sch¨ utze. Web Search Basics. In Introduction to Information Retrieval, chapter 19, pages 421–442. Cambridge University Press Cambridge, 2008. G. Marchionini and R.W. White. Information-Seeking Support Systems. IEEE Computer, 42:30–32, 2009. D. Marcu. Improving Summarization through Rhetorical Parsing Tuning. In The 6 th Workshop on Very Large Corpora, pages 206–215, 1998. D. Martens, M. De Backer, J. Vanthienen, M. Snoeck, and B. Baesens. Classification with Ant Colony Optimization. Evolutionary Computation, IEEE Transactions on, 11:651–665, 2007. N. Matthijs and F. Radlinski. Personalizing Web Search Using Long Term Browsing History. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pages 25–34. ACM, 2011. J. P. Mc Gowan. A Multiple Model Approach to Personalised Iinformation Access. PhD thesis, University College Dublin, 2003. V. McCargar. Statistical Approaches to Automatic Text Summarization. Bulletin of the American Society for Information Science and Technology, 30:21–25, 2004.

References

173

K. McKeown and D. R. Radev. Generating Summaries of Multiple News Articles. In Proceedings of the 18 th annual international ACM SIGIR conference on Research and development in information retrieval, pages 74–82. ACM, 1995. K. R. McKeown, J. L. Klavans, V. Hatzivassiloglou, R. Barzilay, and E. Eskin. Towards Multidocument Dummarization by Reformulation: Progress and Prospects. In Proceedings of AAAI, pages 453–460, 1999. M. Melucci. Contextual Search: A Computational Framework. Foundations and Trends in Information Retrieval, 6:257–405, 2012. A. Micarelli and F. Sciarrone. Anatomy and Empirical Evaluation of an Adaptive WebBased Information Filtering System. User Modeling and User-Adapted Interaction, 14:159–200, 2004. A. Micarelli, F. Gasparetti, F. Sciarrone, and S. Gauch. Personalized Search on the World Wide Web. In P. Brusilovsky, A. Kobsa, and W. Nejdl, editors, The Adaptive Web, pages 195–230. Springer-Verlag, 2007. S. E. Middleton, N. R. Shadbolt, and D. C. D. Roure. Capturing Interest Through Inference and Visualization : Ontological User Profiling in Recommender Systems. In K-CAP2003, 2003. T. Miranda, M. Claypool, A. Gokhale, T. Mir, P. Murnikov, D. Netes, and M. Sartin. Combining Content-Based and Collaborative Filters in an Online Newspaper. In In Proceedings of ACM SIGIR Workshop on Recommender Systems, 1999. D. Mladenic. Personal WebWatcher: Design and Implementation. 1996. B. Mobasher, H. Dai, T. Luo, and M. Nakagawa. Discovery and Evaluation of Aggregate Usage Profiles for Web Personalization. Data Mining and Knowledge Discovery, pages 61–82, 2002.

References

174

H. Mochizuki and M. Okumura. A Comparison of Summarization Methods Based on Task-based Evaluation. In LREC, 2000. M. R. Morris, J. Teevan, and S. Bush. Enhancing Collaborative Web Search with Personalization: Groupization, Smart Splitting, and Group Hit-highlighting. In Proceedings of the 2008 ACM Conference on Computer Supported Cooperative Work, CSCW ’08, pages 481–484. ACM, 2008. R. Mukherjee and J. Mao. Enterprise Search: Tough Stuff. Queue, 2, 2004. M. D. Mulvenna, S. S. Anand, and A. G. B¨ uchner. Personalization on the Net using Web mining: Introduction. Communications of the ACM, 43:122–125, 2000. K. M¨ uu ¨risep and P. Mutso. ESTSUM-Estonian Newspaper Texts Summarizer. In Proceedings of The Second Baltic Conference on Human Language Technologies, pages 311–316. Citeseer, 2005. N. Nanas and A. De Roeck. Autopoiesis, the Immune System, and Adaptive Information Filtering. Natural Computing, 8:387–427, 2009. N. Nanas, V. S. Uren, and A. De Roeck. Nootropia: a User Profiling Model Based on a Self-Organising Term Network. Artificial Immune Systems, pages 146–160, 2004. H. Nanba and M. Okumura. Producing More Readable Extracts by Revising Them. In Proceedings of the 18 th Conference on Computational Linguistics-Volume 2, pages 1071–1075. Association for Computational Linguistics, 2000. A. Nenkova and K. McKeown. Automatic summarization. Now Publishers, 2011. A. Nenkova and R. Passonneau. Evaluating Content Selection in Summarization: The Pyramid Method. In HLT-NAACL, pages 145–152, 2004. E. Newman, W. Doran, N. Stokes, J. Carthy, and J. Dunnion. Comparing Redundancy

References

175

Removal Techniques for Multi-Document Summarisation. In Proceedings of STAIRS, pages 223–228, 2004. G. Nunberg. As Google Goes, so Goes the Nation. New York Times, 5, 2003. C. Olston and E. H. Chi. ScentTrails: Integrating Browsing and Searching on the Web. ACM Trans. Comput.-Hum. Interact., 10(3):177–197, 2003. C. Orasan. Comparative Evaluation of Modular Automatic Summarisation Systems Using CAST. PhD thesis, University of Wolverhampton, 2006. P. Over and J. Yen. Introduction to DUC-2001: an Intrinsic Evaluation of Generic News Text Summarization Systems. In Proceedings of DUC 2004 Document Understanding Workshop, Boston, 2004. T. Paek, S. Dumais, and R. Logan. WaveLens: A New View Onto Internet Search Results. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’04, pages 727–734. ACM, 2004. C. D. Paice.

Constructing Literature Abstracts by Computer:

Techniques and

Prospects. Information Processing & Management, 26:171–186, 1990. C. D. Paice and P. A. Jones. The Identification of Important Concepts in Highly Structured Technical Papers. In Proceedings of the 16 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 69– 78. ACM, 1993. S. Park. Personalized Summarization Agent Using Non-negative Matrix Factorization. PRICAI 2008: Trends in Artificial Intelligence, pages 1034–1038, 2008. M. J. Pazzani and D. Billsus. Content-Based Recommendation Systems. In The adaptive web, pages 325–341. Springer, 2007. Z. Pei-ying and L. Cun-he. Automatic Text Summarization Based on Sentences Cluster-

References

176

ing and Extraction. In Computer Science and Information Technology, 2009. ICCSIT 2009. 2 nd IEEE International Conference on, pages 167–170. IEEE, 2009. I. Petrovic, P. Perkovic, and I. Stajduhar. A Profile-and Community-Driven Book Recommender System. In Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2015 38th International Convention on, pages 631– 635. IEEE, 2015. J. Pitkow, H. Sch¨ utze, T. Cass, R. Cooley, D. Turnbull, A. Edmonds, E. Adar, and T. Breuel. Personalized Search. Communications of the ACM, 45:50–55, 2002. J. M. Ponte and W. B. Croft. A Language Modeling Approach to Iinformation Retrieval. In Proceedings of the 21 st annual international ACM SIGIR conference on Research and development in information retrieval, pages 275–281. ACM, 1998. A. Popescul, D. M. Pennock, and S. Lawrence. Probabilistic Models for Unified Collaborative and Content-Based Recommendation in Sparse-Data Environments. In Proceedings of the 17 th Conference in Uncertainty in Artificial Intelligence, UAI ’01, pages 437–444. Morgan Kaufmann Publishers Inc., 2001. F. Qiu and J. Cho. Automatic Identification of User Interest for Personalized Search. In Proceedings of the 15 th International Conference on World Wide Web, WWW ’06, pages 727–736. ACM, 2006. R. Quirk, S. Greenbaum, G. Leech, and J. Svartvik. A Comprehensive Grammar of the English Language. Studies in Second Language Acquisition, 9:109–111, 1987. D. R. Radev and K. R. McKeown. Generating Natural Language Summaries from Multiple on-Line Sources. Computational Linguistics, 24(3):470–500, 1998. D. R. Radev, E. Hovy, and K. McKeown. Introduction to the Special Issue on Summarization. Computational linguistics, 28:399–408, 2002.

References

177

D. R. Radev, H. Jing, M. Stys, and D. Tam. Centroid-Based Summarization of Multiple Documents. Information Processing & Management, 40:919–938, 2004. P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl. GroupLens: an Open Architecture for Collaborative Filtering of Netnews. In Proceedings of the 1994 ACM Conference on Computer Supported Cooperative Work, pages 175–186. ACM, 1994. S. E. Robertson and K. S. Jones. Relevance Weighting of Search Terms. Journal of the American Society for Information science, 27(3):129–146, 1976. M. Rosner and C. L. Camilleri. Query-Based Multi-Document Summarisation. In 22 nd International Conference on Computational Linguistics, page 25, 2008. I. Ruthven. Information Retrieval in Context, chapter 8, pages 195–216. Springer, 2011. I. Ruthven, A. Tombros, and J. M. Jose. A Study on the Use of Summaries and Summary-Based Query Expansion for a Question-Answering Task. In 23 rd BCS European Annual Colloquium on Information Retrieval Research (ECIR 2001), pages 1–14. Electronic Workshops in Computing, 2001. S. Z. Saad and U. Kruschwitz. Applying Web Usage Mining for Aadaptive Intranet Navigation. In Multidisciplinary Information Retrieval, volume 6653, pages 118–133. Springer, 2011. S. Z. Saad and U. Kruschwitz. Exploiting Click Logs for Adaptive Intranet Navigation. In Proceedings of the 35 th European Conference on Information Retrieval (ECIR’13), volume 7814 of Lecture Notes in Computer Science, pages 793–796. Springer, 2013. H. Saggion and G. Lapalme. Generating Indicative-informative Summaries with sumUM. Comput. Linguist., 28:497–526, 2002. G. Salton. The SMART Retrieval SystemExperiments in Automatic Document Processing. Prentice-Hall, Inc., 1971.

References

178

G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGrawHill computer science series. McGraw-Hill, Inc., 1983. G. Salton, A. Wong, and C. S. Yang. A Vector Space Model for Automatic Indexing. Communications of the ACM, 18:613–620, 1975. M. Sanderson. Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends in Information Retrieval, 4(4):247–375, 2010. M. Sanderson and B. Croft. Deriving Concept Hierarchies from Text. In Proceedings of the 22 nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99, pages 206–213. ACM, 1999. J. B. Schafer, J. A. Konstan, and J. Riedl. E-Commerce Recommendation Applications. In Applications of Data Mining to Electronic Commerce, pages 115–153. Springer, 2001. M. Sharma and R. Patel. A Survey on Information Retrieval Models, Techniques And Applications. International Journal of Emerging Technology and Advanced Engineering, ISSN, pages 2250–2459, 2013. C. Shen and T. Li. Learning to Rank for Query-Focused Multi-document Summarization. In Data Mining (ICDM), 2011 IEEE 11 th International Conference on, pages 626–634. IEEE, 2011. X. Shen, B. Tan, and C. Zhai. Implicit User Modeling for Personalized Search. In Proceedings of the 14 th ACM International Conference on Information and Knowledge Management, pages 824–831. ACM, 2005a. X. Shen, B. Tan, and C. Zhai. Context-Sensitive Information Retrieval Using Implicit Feedback. In Proceedings of the 28 th Annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pages 43–50. ACM, 2005b. M. Shmueli-Scheuer, H. Roitman, D. Carmel, Y. Mass, and D. Konopnicki. Extracting

References

179

User Profiles from Large Scale Data. In Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud. ACM, 2010. A. Sieg, B. Mobasher, and R. Burke. Inferring Users Information Context from User Profiles and Concept Hierarchies. In Classification, Clustering, and Data Mining Applications, pages 563–573. Springer Berlin Heidelberg, 2004. A. Sieg, B. Mobasher, and R. Burke. Web Search Personalization with Ontological User Profiles. In Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pages 525–534. ACM, 2007. C. Silva and B. Ribeiro. The Importance of Stop Word Removal on Recall Values in Text Categorization. In Neural Networks, 2003. Proceedings of the International Joint Conference on, volume 3, pages 1661–1666. IEEE, 2003. C. Silverstein, M. Henzinger, H. Marais, and M. Moricz. Analysis of a Very Large Altavista Query Log. Technical report, Technical Report 1998-014, Systems Research Center, Compaq Computer Corporation, 1998. F. Silvestri. Mining Query Logs: Turning Search Usage Data into Knowledge, volume 4 of Foundations and Trends in Information Retrieval. Now Publisher, 2010. A. Singhal. Modern Information Retrieval: A Brief Overview. IEEE Data Eng. Bull., 24:35–43, 2001. S. Sivapalan, A. Sadeghian, H. Rahnama, and A.M. Madni. Recommender Systems in E-Commerce. In World Automation Congress (WAC), pages 179–184. IEEE, 2014. M. D. Smucker. Information Representation. In I. Ruthven and D. Kelly, editors, Interactive Information Seeking, Behaviour and Retrieval, pages 77–93. Facet Publishing, 2011. B. Smyth. A Community-Based Approach to Personalizing Web Search. Computer, 40: 42–50, 2007. ISSN 0018-9162.

References

B. Smyth, J. Freyne, M. Coyle, P. Briggs, and E. Balfe.

180

I-SPY - Anonymous,

Community-Based Personalization by Collaborative Meta-Search. In In Proceedings of the 23 rd SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence, pages 367–380. Springer-Verlag, 2003. B. Smyth, E. Balfe, J. Freyne, P. Briggs, M. Coyle, and O. Boydell. Exploiting Query Repetition and Regularity in an Adaptive Community-Based Web Search Engine. User Modeling and User-Adapted Interaction, 14(5):383–423, 2005. R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. Cheap and Fast—but is It Good?: Evaluating Non-expert Annotations for Natural Language Tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 254–263. Association for Computational Linguistics, 2008. K. Socha, M. Sampels, and M. Manfrin. Ant Algorithms for the University Course Timetabling Problem with Regard to the State-of-the-Art. In Applications of Evolutionary Computing, volume 2611 of Lecture Notes in Computer Science, pages 334– 345. Springer Berlin Heidelberg, 2003. W. Song, Y. Zhang, T. Liu, and S. Li. Bridging Topic Modeling and Personalized Search. In Proceedings of the 23 rd International Conference on Computational Linguistics: Posters, pages 1167–1175. Association for Computational Linguistics, 2010. M. Speretta and S. Gauch. Personalized Search Based on User Search Histories. In Web Intelligence, 2005. Proceedings. The 2005 IEEE/WIC/ACM International Conference on, pages 622–628. IEEE, 2005. A. Spink, H. C. Ozmutlu, and S. Ozmutlu. Multitasking Information Seeking and Searching Processes. Journal of the american society for information science and technology, 53(8):639–652, 2002. A. Stefani and C. Strappavara. Personalizing aAccess to Web Sites: The SiteIF Project.

References

181

In Proceedings of the 2 nd Workshop on Adaptive Hypertext and Hypermedia HYPERTEXT, volume 98, pages 20–24, 1998. G. C. Stein, A. Bagga, and G. B. Wise. Multi-Document Summarization: Methodologies and Evaluations. In Proceedings of TALN00, pages 337–346, 2000. J. Steinberger, M. Poesio, M. A. Kabadjov, and K. Jeˇzek. Two Uses of Anaphora Resolution in Summarization. Information Processing & Management, pages 1663– 1680, 2007. D. Stenmark and T. Jadaan. Intranet Users’ Information-Seeking Behaviour: A Longitudinal Study of Search Engine Logs. In Proceedings of ASIS&T, Austin, TX, 2006. N. Stokes, J. Rong, and L. Cavedon. NICTAs Update and Question-Based Summarisation Systems at DUC 2007. In Proceedings of the Document Understanding Conference Workshop, 2007. T. Strzalkowski, L. Guthrie, J. Karlgren, J. Leistensnider, F. Lin, J. P. Carballo, T. Straszheim, J. Wang, and J. Wilding. Natural Language Information Retrieval: TREC-5 Report. In TREC, 1996. X. Su and T. M. Khoshgoftaar. A Survey of Collaborative Filtering Techniques. Advances in artificial intelligence, 2009:4, 2009. K. Sugiyama, K. Hatano, and M. Yoshikawa. Adaptive Web Search Based on User Profile Constructed without any Effort from Users. In Proceedings of the 13 th International Conference on World Wide Web, pages 675–684. ACM, 2004. J. T. Sun, H. J. Zeng, H. Liu, Y. Lu, and Z. Chen. CubeSVD: a Novel Approach to Personalized Web Search. In Proceedings of the 14 th International Conference on World Wide Web, pages 382–390. ACM, 2005. L. Tamine-Lechani, M. Boughanem, and N. Zemirli. Personalized Document Ranking:

References

182

Exploiting Evidence from Multiple User Interests for Profiling and Retrieval. Journal of Digital Information Management, 6:354–365, 2008. L. Tamine-Lechani, M. Boughanem, and M. Daoud. Evaluation of Contextual Information Retrieval Effectiveness: Overview of Issues and Research. Knowledge and Information Systems, 24:1–34, 2010. A. H. Tan and C. Teo. Learning User Profiles for Personalized Information Dissemination. In Neural Networks Proceedings, 1998. IEEE World Congress on Computational Intelligence. The 1998 IEEE International Joint Conference on, volume 1, pages 183– 188. IEEE, 1998. J. Teevan and S. Dumais. Web Retrieval, Ranking and Personalization. In I. Ruthven and D. Kelly, editors, Interactive Information Seeking, Behaviour and Retrieval, pages 189–203. Facet Publishing, 2011. J. Teevan, S. T. Dumais, and E. Horvitz. Personalizing Search via Automated Analysis of Interests and Activities. In Proceedings of the 28 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 449–456. ACM, 2005. J. Teevan, M. Ringel Morris, and S. Bush. Discovering and Using Groups to Improve Personalized Search. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM ’09, pages 15–24. ACM, 2009. J. Teevan, S. T. Dumais, and E. Horvitz. Potential for Personalization. ACM Transactions on Computer-Human Interaction, 17:4:1–4:31, 2010. L. Terveen and W. Hill. Beyond Recommender Systems: Helping People Help Each Other. HCI in the New Millennium. Addison Wesley, pages 487–509, 2001. S. Teufel and M. Moens. Sentence Extraction as a Classification Task. In Proceedings of the ACL, volume 97, pages 58–65, 1997.

References

183

A. Tombros and M. Sanderson. Advantages of Query Biased Summaries in Information Retrieval. In Proceedings of the 21 st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2–10. ACM, 1998. R. Tucker. Automatic Summarising and the CLASP System. PhD thesis, University of Cambridge, Computer Laboratory, 2000. A. Turpin, Y. Tsegay, D. Hawking, and H. E. Williams. Fast Generation of Result Snippets in Web Search. In Proceedings of the 30 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 127–134. ACM, 2007. S. A. Uddin, M. N.and Khan. A Study on Text Summarization Techniques and Implement Few of them for Bangla Language. In Computer and Information Technology, 2007. iccit 2007. 10 th International Conference on, pages 1–4. IEEE, 2007. D. Vadas and J. R. Curran. Parsing Noun Phrases in the Penn Treebank. Computational Linguistics, 37:753–809, 2011. D. Vallet, I. Cantador, and J. Jose. Personalizing Web Search with Folksonomy-Based User and Document Profiles. Advances in Information Retrieval, pages 420–431, 2010. C. J. Van Rijsbergen. A Theoretical Basis for the Use of Co-Occurrence Data in Information Retrieval. Journal of documentation, 33:106–119, 1977. R. Varadarajan and V. Hristidis. A System for Query-Specific Document Summarization. In Proceedings of the 15 th ACM International Conference on Information and Knowledge Management, pages 622–631. ACM, 2006. X. Wan. Topic Analysis for Topic-Focused Multi-Document Summarization. In Proceedings of CIKM, pages 1609–1612. ACM, 2009. X. Wan. Towards a Unified Approach to Simultaneous Single-Document and MultiDocument Summarizations. In Proceedings of the 23 rd International Conference on

References

184

Computational Linguistics, pages 1137–1145. Association for Computational Linguistics, 2010. X. Wan, J. Yang, and J. Xiao. Manifold-Ranking Based Topic-Focused Multi-Document Summarization. In Proceedings of the 20 th International Joint Conference on Artifical Intelligence, pages 2903–2908. Morgan Kaufmann Publishers Inc., 2007. C. Wang, F. Jing, L. Zhang, and H. J. Zhang. Learning Query-Biased Web Page Summarization. In Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM ’07, pages 555–562. ACM, 2007. C. Wang, L. Li, and Y. Zhong. A New Evaluating Method for Chinese Text Summarization not Requiring. In Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference on, pages 1–7, 2009. H. Wang and G. Zhou. Toward a Unified Framework for Standard and Update MultiDocument Summarization. ACM Transactions on Asian Language Information Processing (TALIP), 11:5, 2012. F. Wei, W. Li, Q. Lu, and Y. He. Query-Sensitive Mutual Reinforcement Chain and its Application in Query-Oriented Multi-Document Summarization. In Proceedings of the 31 st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 283–290. ACM, 2008. X. Wei and W. B. Croft. Modeling Term Associations for Ad-Hoc Retrieval Performance Within Language Modeling Framework. In Advances in Information Retrieval, volume 4425 of Lecture Notes in Computer Science, pages 52–63. Springer Berlin Heidelberg, 2007. L. Wenjie, W. Furu, L. Qin, and H. Yanxiang. PNR 2: Ranking Sentences with Positive and Negative Reinforcement for Query-Oriented Update Summarization. In Proceedings of COLING, pages 489–496. Association for Computational Linguistics, 2008.

References

185

A. Wexelblat and P. Maes. Footprints: History-Rich Tools for Information Foraging. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems, pages 270–277. ACM, 1999. R. White, I. Ruthven, and J. M. Jose. Web Document Summarisation: a Task-Oriented Evaluation. In Database and Expert Systems Applications, 2001. Proceedings. 12 th International Workshop on, pages 951–955. IEEE, 2001. R. W. White and J. Huang. Assessing the Scenic Route: Measuring the Value of Search Trails in Web Logs. In Proceedings of the 33 rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 587–594. ACM, 2010. R. W. White, J. M. Jose, and I. Ruthven. Implicit Contextual Modelling for Information Seeking. In Proceedings of the Glasgow Context Group 1 st Colloquium: Building Bridges: Interdisciplinary Context-Sensitive Computing, 2002a. R. W. White, I. Ruthven, and J. M. Jose. Finding Relevant Documents Using Top Ranking Sentences: An Evaluation of Two Alternative Schemes. In Proceedings of the 25 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’02, pages 57–64. ACM, 2002b. R. W. White, J. M. Jose, and I. Ruthven. A Task-oriented Study on the Influencing Effects of Query-biased Summarisation in Web Searching. Information Processing and Management, 39:707–733, 2003. R. W. White, I. Ruthven, J. M. Jose, and CJ. Van Rijsbergen. Evaluating Implicit Feedback Models Using Searcher Simulations. ACM Transactions on Information Systems (TOIS), 23(3):325–361, 2005. R. W. White, M. Bilenko, and S. Cucerzan. Studying the Use of Popular Destinations to Enhance Web Search Interaction. In Proceedings of the 30 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 159–166. ACM, 2007.

References

186

D. H. Widyantoro, J. Yin, M. El Nasr, L. Yang, A. Zacchi, and J. Yen. Alipes: A Swift Messenger in Cyberspace. In Proceedings of Spring Symposium Workshop on Intelligent Agents in Cyberspace, pages 62–67, 1999. B. F. William. Stemming Algorithms. In B. F. William and B-Y. Ricardo, editors, Information Retrieval: Data Structures & Algorithms, chapter 8, pages 131–160. PrenticeHall, Inc., 1992. M. Wilson. Interfaces for Information Retrieval. In I. Ruthven and D. Kelly, editors, Interactive Information Seeking, Behaviour and Retrieval, pages 139–170. Facet Publishing, 2011. C. G. Wolf, S. R. Alpert, J. G. Vergo, L. Kozakov, and Y. Doganata. Summarizing Technical Support Documents for Search: Expert and User Studies. IBM Systems Journal, 43:564–586, 2004. J. Yan, W. Chu, and R. W. White. Cohort Modeling for Enhanced Personalized Search. In Proceedings of the 37 th International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 505–514. ACM, 2014. R. Yan, J. Y. Nie, and X. Li. Summarize What You Are Interested In: An Optimization Framework for Interactive Personalized Summarization. In In Proceedings of EMNLP, pages 1342–1351, 2011. X. Yang, Y. Guo, Y. Liu, and H. Steck. A Survey of Collaborative Filtering Based Social Recommender Systems. Computer Communications, 41:1–10, 2014. O. Yeloglu, E. Milios, and N. Zincir-Heywood. Multi-Document Summarization of Scientific Corpora. In Proceedings of the 2011 ACM Symposium on Applied Computing, pages 252–258. ACM, 2011. C. Yu. Personalized Web Search for Improving Retrieval Effectiveness. IEEE Transactions on Knowledge and Data Engineering, 16:28–40, 2004.

References

187

X. Yuan and N. J. Belkin. Investigating Information Retrieval Support Techniques for Different Information-Seeking Strategies. JASIST, 61:1543–1563, 2010. H. Zhang, Z. C. W. Y. Ma, and Q. Cai. A Study for Documents Summarization Based on Personal Annotation. In Proceedings of the HLT-NAACL 03 on Text Summarization Workshop-Volume 5, pages 41–48. Association for Computational Linguistics, 2003. W. Zhang, S. Liu, C. Yu, C. Sun, F. Liu, and W. Meng. Recognition and Classification of Noun Phrases in Queries for Effective Retrieval. In Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pages 711–720. ACM, 2007. Y. Zhou and W. B. Croft. Measuring Ranked List Robustness for Query Performance Prediction. Knowledge and Information Systems, 16:155–171, 2008.

Appendices

188

Appendix A: Entry Questionnaire

189

Searcher #___________ Condition ___________ TREC-9 INTERACTIVE SEARCHING STUDY ENTRY QUESTIONNAIRE 

What [high] school/college/university degrees/diplomas do you have (or expect to have)?

____________________________________ degree subject date



What is your occupation?

___________________________________ 

What is your gender? o Female o Male



What is your age? ________ years



Have you participated in previous Searching Studies? o Yes o No



Overall, for how many years have you been doing online searching? _________ years

Please circle the number closest to your experience.......

No experience

How much experience have you had...

Some experience

1. using a point-and-click interface (e.g., Macintosh, Windows) 2. searching on computerized library catalogs either locally (e.g., your library) or remotely (e.g., Library of Congress) 3. searching on commercial online systems (e.g., BRS Afterdark, Dialog, Lexis-Nexis)

1

2

3

4

A great deal of experience 5

1

2

3

4

5

1

2

3

4

5

4. searching on world wide web search services (e.g., Alta Vista, Excite, Yahoo, HotBot, WebCrawler, Google, Bing)

1

2

3

4

5

Please circle the number that is closest to your searching behaviour....

Never

5. How often do you conduct a search on any kind of system?

1

Once or Once or Once or twice a twice a twice a year month week 2 3 4

Once or twice a day 5

Please circle the number that indicates to what extent you agree with the following statement.......

6. I enjoy carrying out information searches.

Strongly Disagree 1

Disagree Neutral 2

3

Agree 4

Strongly Agree 5

Appendix B: Post-Search Questionnaire

192

Searcher #___________ Condition ___________ Topic # ____________ TREC-9 INTERACTIVE SEARCHING STUDY POST-SEARCH QUESTIONNAIRE Please answer the following questions, as they relate to this specific topic.

1. Are you familiar with this topic? 2. Was it easy to get started on this search? 3. Was it easy to do the search on this topic? 4. Are you satisfied with your search results? 5. Did you have enough time to do an effective search?

Not at all 1 1 1 1 1

Somewhat 2 2 2 2 2

3 3 3 3 3

Extremely 4 4 4 4 4

5 5 5 5 5

Appendix C: Post-System Questionnaire

194

Searcher #___________ Condition ___________ TREC-9 INTERACTIVE SEARCHING STUDY POST-SYSTEM QUESTIONNAIRE 

Now, please consider the searching experience that you just had.



1. How easy was it to learn to use this information system? 2. How easy was it to use this information system? 3. How well did you understand how to use the information system? 

Not at all 1

Somewhat

Extremely

2

3

4

5

1 1

2 2

3 3

4 4

5 5

Please write down any other comments that you have about your searching experience with this information retrieval system here. Thank you!

Appendix D: Exit Questionnaire

196

TREC-9 INTERACTIVE SEARCHING STUDY EXIT QUESTIONNAIRE

Please consider the entire search experience that you just had when you respond to the following questions.

1. To what extent did you understand the nature of the searching task? 2. To what extent did you find this task similar to other searching tasks that you typically perform? 3. How different did you find the systems from one another?

Not at all 1

Somewhat

Completely

2

3

4

5

1

2

3

4

5

1

2

3

4

5

4. Which of the three systems did you find easier to learn to use? SYSTEM A

SYSTEM B

SYSTEM C

No difference

5. Which of the three systems did you find easier to use? SYSTEM A

SYSTEM B

SYSTEM C

No difference

6. Which of the three systems did you like the best overall? SYSTEM A

SYSTEM B

SYSTEM C

No difference

7. What did you like about each of the systems?

8. What did you dislike about each of the systems?

9. Please list any other comments that you have about your overall search experience. THANK YOU!!!!