Detection of Shifts in User Interests for Personalized Information Filtering

122 downloads 338 Views 1001KB Size Report
for Personalized. Information. Filtering ... of ACM, Inc. To copy otherwi- se, to republish, to post on ...... Software. http://www.dsv. su.se/ - fk. Lang, K. (1995).
Detection

of Shifts

for Personalized W.

Lam*,

S.

Mukhopadhyay,

J. and

723 W.

Mostafa**,

Indianapolis,

St.

**School

City,

Iowa

of Library

and

Bloomington,

Palakal

at Indianapolis

IN 46202 Sciences

Building

University

Indiana

M.

SL280

of Management

S306 Pappajohn Iowa

and Science

of Science

Michigan

*Department

Filtering

Information

School

University

The

Interests

Information

Computer Purdue

in User

of Iowa 52242-1000 Science

Information

University IN 47405-1801

Abstract

1

Several machine learning approaches have been proposed in the literature to automatically learn user interests for information filtering. However, many of them are ill-equipped to deal with changes in user interests

Information filtering is concerned with the problem of delivering useful information to a user while preventing an overload of irrelevant information. Information selected for presentation is commonly based on descriptions of user preferences called profiles [Belkin and Croft, 1992]. Typically, the user profile is not known in advance, and can also change with time. The user may

that may occur due to changes in the user’s personal and proikssionai situations. If undetected over a long time, such changes may cause significant degradation in the filtering performance and user satisfaction during

Introduction

choose to provide

a liiited

the period of non-detection. In this paper, we present a two-level learning approach to cope with such non-

mation

the relevance

stationary

back data so that the filtering system can effectively choose and present information as relevant to the user as possible. This clearly requires adaptive capability on

user interests.

While

objective

the lower level consists

of a standard convergence-type machine learning algorithm, the higher level uses Bayesiart analysis of the user provided relevance feedback to detect shifts in user in-

is to estimate

ments ument points

items.

from

inforThe

the feed-

f(x)

corresponds

is to find a map ~ : D + to the relevance

Et

of a doc-

z. Given that such a map is known for all in D, a bite set of documents can always be

rank-ordered and presented in a prioritized fashion to the user. As a consequence, several information frlter-

Permission to make digital/hard copy of all part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior sp-cific permission arrd/or fee.

ACM

the user profile

ss D, the objective

such that

Switzerland@1996

of feedback

of specific

interaction with the user. Thus, in the case of textbased document Iiltering, the overall problem of information filtering may be broadly posed as learning a map from a space of documents to the space of real-valued user relevance factors. Denoting the space of docu-

of the approach.

Zurich. SIGIR96. 8/96/08.$3.50

amount

the part of the filtering system so that the performance of the system gradually improves during the course of

terests. Once such a shift is detected, the lower-level learning algorithm is suitably reinitialized to quickly adapt to the new user profile. Experimental results with simulated users are presented to demonstrate the feasibility

concerning

ing systems have been proposed

in the literature

based

on different machine learning paradigms; some examples include the Stanford Information FHtering Tool (SIFT) [Yan and Garcia-Molina, 1995], NewsWeeder [Lang, 1995], Browse [Jennings and Higuchi, 1992], and NewT [Seth, 1994].

0-89791 -792-

317

We have developed a document filtering system, called SIFTER (Smart Information Faltering Technology for Electronic Resources) [Mukhopadhyay et al, 1996], that consists of the following three major components: (i) a

However, NewsWeeder’s inability to adapt the filter in an on-lime fashion limits its utility. SIFT (Stanford Information Filtering Tool) has also been developed to filter USENET news [Yan and Garcia-Moliia, 1995].

document representation module that employs a contentbased vector-space document indexing scheme using a

SIFT requires users to specify keywords to generate the initial filter. Depending on a user’s choice, the filter

predefined

may be represented

forms

two functions:

off-line found

thesaurus,

(ii) a clustering determination

and unsupervised in a representative

fashion

module

that per-

of categories

a boolean

in an

can provide

based on similarities

set of documents,

finement.

and on-line

using the vector-space

statement.

If a vector

some adaptivity

profile

model or as is used, SIFT

in the form

In this mode SIFT

requires

of flter

re-

users to provide

relevance feedback (by pointing out documents of interest), based on which weights present in the profile are adjusted accordingly. SIFT unfortunately may suffer

ClassScation of incoming vectors to categories during actual operation, and (iii) a user profile learning module that learns user interests over the document categories, based on on-line user relevance feedback snd a

from

reinforcement

cause it assumes users can assess how and when their

machine

learning

algorithm.

The core of

interests

the SIFTER system, consisting of the above three modules, has been applied to filtering LISTSERV mails as well as filtering academic research reports in the domain of computer 1.1

change and that

ple filters

Work

adjust the filter based on changes in users’ interests). In this section we review some filtering research, mainly on the user’s role and adaptivity.

adequately address the problem ests. This has also been pointed

the next section.

adaptation

bemo-

filters

There

of changing user interout in [Kilander, 1995].

We refer to the phenomenon of changing user interest as user non-stationarity and discuss it in more detail in

1.2

and Stevens [1991] also

Problem

Description

are two major

sources of non-stationarity

can arise in an information

described a rule-based USENET news fltering system, named Infoscope. Infoscope uses heuristic rules associating common patterns of usage (e.g., number of sessions, newsgroups read, frequencies of relevant terms in

filtering

system

that

operating

in a dynamic environment. Fwstly, the nature and the domain of information may change which naturally calls for adaptation

of the information

representation

scheme

(e.g., switchhg of thesaurus in a document-based system) as well as the classification scheme. Secondly, even assuming that the overzdl domain of information is unchanged, a given user’s interests in different categories

articles, etc.) to appropriate actions. In Infoscope, to refine filters, users must add or remove terms from the falter and they must also set appropriate thresholds.

they would be sufficiently

and users can select from pre-designed

al [1987] requires direct and explicit user input in flter generation and maintename. In this system, users must create roles that prescribe appropriate actions with tests on factors such as message type, date, and the sender. Hence, it requires significant user involvement to assure effective filtering and does not provide automatic Fischer

system,

back and machine learning techniques show promise in reducing user’s involvement in filter building and refinement. Second, in our view, the existing systems do not

A more de

tailed comparative study of some of the existing filtering systems can be found in [K&mder, 1995]. The Information Lens system described by Malone et

capabdity.

as the Infoscope

that cover common topical areas. NewT relies on relevance feedback to add and initialize new filters and NewT reduces user involvement retie existing filters. in filter refinement further by utiliiing a genetic algorithm to evolve filters toward increased fitness. Two main conclusions can be drawn from the review presented in this section. Fwst, relevance feed-

Information filtering approaches can be characterized according to the amount of user involvement and degree of adaptivity (capability of the system to automatically

focusing

deficiencies

tivated to m-configure the filter. Finally, NewT (news tailor) [Seth, 1994] offers the user option of using multi-

science.

Related

similar

rule triggering

Recently, some filtering approaches have been proposed that attempt to reduce user involvement in fil-

of information

may change.

a profmsional

environment

assignment

ter maintenance and refinement. NewsWeeder [Lang, 1995], a USENET news filtering tool, asks users to rate news articles they read (on a scale of five values). The rated articles and ratings are used as training examples for a machme learning algorithm that is executed

or the initiation

Such changes can occur in due to a change in the job of a new research project

on the part of the user. In this paper, our focus is on the latter kind of non-stationarity which can be handled without any change in the representation or classification modules. To our knowledge, the existing information ing approaches do not explicitly account for

nightly to generate the interest profiles for the next day. By limiting the user input to only rating of articles, NewsWeeder is successful in reducing user involvement.

stationary

318

user interest

profile.

However,

filtera non-

several such

shifts, if unnoticed, all filtering the learning their

approaches

learned

deteriorate

the over-

of user feedback is required

Left to themselves, require a long time

many of to erase

cost is the possibility

can significantly

performance. profile

(represented

in so?e

form

user interests

of a

to estimate

are assumed

the categories

performance

to be constant

of documents.

f 2. The

the map

of sub-optimal There

since

over each of

are two possible

memory) and relearn the new profile. This is particularly true for algorithms that are designed to optimize long-term filtering performance while coping with ran-

ways such sub-optimality can be overcome if the resulting performance is deemed to be inadequate: (i)

domness

large clusters

in the document

feedback.

clusters

(e.g., partition

ones) on the basis of uncer-

tion to learn a more general parametrized map (e.g., a neural network) from the document space to user rele-

short-term

vance values.

In the latter

quickly

at a reasonable

desirable

analysis features

system

of the feedback are mutually

be easily accomplished

signals.

conflicting,

objec-

the document

into smaller

tively insensitive to noise. On the other hand, the ability to react quickly to changes requires low inertia and

the learning

is a long-term

adapt

tainty of user feedback over a cluster, and (ii) use the condioverall map in the form of fz o fl as an initial

where

optimtilty

to further

and user relevance

has to be made rela-

tive

In general,

stream

These two and cannot

by a single algorithm,

provided

since the

by using an optimal learning scheme at the lower level while monitoring the user relevance feedback for short-

in a document

filtering

non-stationary context.

uses a reinforcement learning profile assuming a stationary

While

categories user).

process.

of

fz),

md

is to

performance

then optimize

(as

the perfor-

representation

module

determines

a

finite-dimensional vector description of a document (the input space for ~1), and the classifier module finds the category to which a document belongs to (the output for ~1). The user profile lemming module is concerned with the on-liie learning of ~2 i.e., a map from the set of

the lower-level

algorithm to learn the user user, a user interest track-

the learning

Overview

and

The document

user interests

ing algorithm using Bayesian decision theory (based on [Zacks and Barzily, 1981]) is employed at the higher level to detect shifts in user interests and reinitialize

2

fl

the objective

filtering

the filtering performances with tl and fz were found to be quite adequate. The incorporation of hierarchical clustering (method (i) ) or multi-stage learning (method (ii)) at the present time constitutes future work.

term changes at a higher level. The latter, in turn, suitably reinitializes the lower level upon the detection of a change. In this paper, we investigate such a twofor learning

by

method,

mance, possibly over a much longer period, by means of the general map. In the liiited number of experiments performed with the decomposition approach, however,

realization of the two objectives calls for the use of different techniques. A good compromise can be obtained

level system

arrive

2.1

SIFTER

to the relevance

Document

values (assuming

Representation

Using

stationary

Vector-Space

Model

In this section, core components

we present

a brief overview

of the information

filtering

of the three

The main purpose of the document representation component is to convert documents arriving in the DOCBOX into numeric structures that are representative of original documents and are easily parsable by other tlltering modules. Several methods are available in the classical IR literature for converting textual documents to rep-

system called

SIFTER and their functions in order to provide the reader with abroad understanding of its operation. The three core components are the document representation module, the document classification module and the user profile SIFTER

learning

is presented

module.

here md

A brief description a more detailed

of

resentative

descrip-

tion of these three modules can be found in [Mukhopadhyay et a2, 1996]. The approach used in the design of the overall faltering system SIFTER formation document

is to decompose

the problem

in [Salton

and McGM,

used methodology,

as the vector-space

model.

The

ion

of in-

of SIFTER

weights,

uses the popular

current tf-idf

known

implementat-

(term-frequency

an onliie

thesaurus

the the

was used with some specific

constraints on its content and structure. The thesaurus contains keywords drawn from authoritative sources for controlled vocabulary (for example, ACM Computing Reviews Classification Scheme for documents in the do-

and ~2 (horn the set of categories to user relevance values); while ~1 is determined in an off-line manner, ~z is learned through interaction with the user. The objec(i.e., the amount

Salton

a widely

inverse-documentfrequency) technique to generate vector representations of documents. For generating

faltering into learning two maps ~1 (horn the space to a finite set of document categories)

tive of such decomposition

structures.

1983] has described

is to reduce the complexity

main

of user tkedback necessary) of learning

of computer

sentative

the high dimensional relevance map from the document space to relevance values. Since ~1 is learned in an a priori unsupervised manner, a considerably less amount

document

science).

Using

base, a table

a sufficiently

is generated

repre

that con-

tains the total ilequencies of all unique terms in the thesaurus. Next, for each new document to be represented,

319

another

table

is generated

that

contains

the

frequencies

of terms in the document.

tables, the following equation elements of the vectors:

relevance

Using these two

is used to compute

which

the

vector, with elements di (i = 1, ..., n),

probability

is an estimate

of dj.

probability

vector

p = ~i],

probabtilty

of the category

The

second is an action

such that

ter as the most relevant category. are continuously updated during where Tik is the number

of occurrences

of term

Tk in

the total

number

of documents

in the document

and nk is the number of documents contain the given term Tk. Document

Classification

The classification

module

The

base,

in the base that

learning

agent

stages:

vector

Vi is classified

ated of dimension all other elements updated

classification

is carried

out conalgorithm classificathe cosine is used to operation

where

categories

based on the cosine similarity

centroids.

The resulting

corresponding to each vector user profile learning module.

of Vi with

category

User

Profile

Learning

the

capabiMy

information

learn

updates

a simplified

model

and updates

to the number

+ ~(Ei(k)

-

Pi(k))

chosen step-size.

Thus,

during

the convergence process,

to come out of the converged

in the presence of user shifts,

state and r~

in practice

such re-

of the documents

is carried

out by means of

3

and

User

Interest

‘hacking

Scheme

of the user. The algorithm

two vectors of dimensions

of categories.

Pi(k)

an invalid user model. It is to cope with such nonstationary users that we propose the shift detection module in this paper. The method used to accomplish this is described in the following section.

In this section,

currently used to learn the user model is based on a reinforcement learning algorithm studied in the Artificial Intelligence and Mathematical Psychology communities [Naremlra amd Thathachar, 1989]. Denoting the categories of documents by Cl,.. ., C“, di is used to denote the expected relevance of a document belonging to the category Ci. The learning agent maintains

is 1, and whose (i = 1,..., n) is

learning requires a very long time. During the intermediate period the filtering performance is poor since the

Module

agent maintains

to cat-

all categories are probabilistically given the chance to be ranked at the. top. This allows the user model, in the form of the d vector, to be learned sufficiently accurately. While in theory the learning algorithm used has the

is then passed on to the

this task, the learning

1) =

+

verge to zero. However,

The user profile learning module consists of a learning agent that interacts directly with the user and sorts the incoming documents according to its belief of the user preferences for the various categories of documents. To accomplish

belonging

n whose lth element are zero. Then pi(k)

O < q < 1 is a suitably

ranking 2.3

for up-

the p vector is moved by a small distance towards the optimal unit vector. Asymptotically, one element of the pvector converges to one, while the other elements con-

of SIFTER, this module merely classifies an incoming document vector Vi as belonging to one of the learned cluster

(i. e., the algorithm

as Pi(~

lar category Ck using the learned centroids from stage 1. The learning of cluster centroids is done in an offline batch mode while

presen-

dating p(k) and ~(k)) is as follows. ~(k) (i = 1,..., n) at any instant is the running average of the relevance

into a particu-

tinuously as documents arrive. A clustering that is similax to the Maximin – Distance tion technique [Tou and Gonzalez, 1974] with similarity measure [Salton and McGill, 1983] generate the centroids. During the on-line

(i.e.,

egory i. Denoting the current mtimum element of d vector as having the index 1, a unit vector E(k) is cre-

cluster hypotheses [Cl, . . - , Ck] are generated from an initial set of sample test document vectors [S1,. ... SN]. Each cluster Ci is then represented by its centroid, Zi. In SIFTER, each cluster is treated as a specific document category. During the classification stage, an indocument

algorithm

values given by the user for documents

an unsupervised, cluster learning stage and a vector classification stage. During the learning stage, initial

coming

at every iteration

category to be presented at the top. The rest of th~ categories are sorted according to the corresponding d val-

Module

consists of two processing

Both p and d vectors the learning process

tation of documents to the user) sorts the incoming documents by first sampliig the p vector to select the

ues. The learning 2.2

the

on the basis of user relevance feedback.

is the inverse document i, Ik = k)g(~/nk) of the term Tk in the document base, iV is

document frequency

pi represents

Ca being selected by the iii-

we describe the higher-level

user interest

tracking scheme in more detail. Section 3.1 provides a qualitative overview of the method, while 3.2 describes the technical details. The various underlying assumptions are mentioned and justified in the present context. 3.1

equal

Outline

of

the

Tracking

Scheme

For nonstationary users, the relevance probabilities of a category varies with time. We present a tracking scheme

The first is the estimated

320

capable

of detecting

of a category.

a shift in the relevance

The tracking

is performed

data.

Intuitively,

a window

the shift detection

decide whether feedback

relies on

finite

the posshift

in [Zacks and Barzily, (and hid) (B.))

1981],

is a submartin-

choice of n, there will always be a nonzero

ability

feedback to

of making

a shift shift

(if any) in the

noisiness.

As proved

representing

and a downwaxd

gale which implies that, as n + co, an actual shift will be eventually detected. In practice, because of the

in the given category

changed or the variations

is due to only inherent

and h$d) (Bn) of an upward

the sequences h~”) (BJ

feedbacks collected

and analyzes the collected the user’s interest

has genuinely

algorithm

noisy relevance

h~v) (Bn)

probability

respectively.

of relescheme

framework to detect a shift in the based on the relevance feedback

of possibly

for each category

terior

on each cat-

egory separately. For each category, a history vance feedback data is collected. The tracking employs a Bayesian relevance probabihty

compute

probability

a wrong

(missed

detection)

(false alarm).

associated

with

namely

and declaring

A decision

based on the shift

An attrm-

detection,

probability

function

prob-

ignoring

an incorrect is formulated

and two cost quantities

misses and false alarms.

This

decision

tive i%ature of this traclcing scheme is that it can be applied on top of any learning scheme. In particular, it is especially suitable for convergence-type schemes such as that described in section 2.3. For simplicity of analysis, the method presented here

function is used to make a decision regarding whether a shift has occurred. If a shift is determined, the tracking system informs the learning agent and an appropriate reinitialization of the latter’s states takes place.

assumes a scenario in which

3.2

at some point

a single shift has occurred

of time between

known prior probability

t = O and t = m with

distribution

a

Some

Practical

cision

Functions

Considerations:

Use

of De-

of the time of shift.

When multiple shifts occur in the user interest, the time interval between two successive shifts is assumed to be

The posterior

sufficiently large. This allows each shift detection proband the time window lem to be treated independently,

inant

over which a shift can occur to be ideaiiied as an infinite horizon window. The infinite shift horizon assumption permits the use of certain mathematical identities which greatly simpliies the computations. ~. be a sequence of the relevance feedLet~l, ~,...,

the learning agent should reinitialize the action bility vector in response to the shift declaration.

back data collected for a particular category. /3i is either 1 or O governed by the underlying

c&red for the category i. To formulate this decision process, we introduce

shift

parameter

h(d)

Since each relevance

(Bn)

is high,

downward

interest)

respectively,

the assumptions and h(d) (BJ are provided

(i. e., a decrease given

B~.

probaQual-

shift is likely

to have oc-

Let kl be the cost of ignoring

shti

two

a shift

respectively

for each category

i. These

are given as follows:

1981]. Our

If k >1,

=

[h\ ’’)(BJ]k

+

=

[h\d)(BJ]k

(BJ

the decision function

(1)

Fi grows in a slower k