for Personalized. Information. Filtering ... of ACM, Inc. To copy otherwi- se, to republish, to post on ...... Software. http://www.dsv. su.se/ - fk. Lang, K. (1995).
Detection
of Shifts
for Personalized W.
Lam*,
S.
Mukhopadhyay,
J. and
723 W.
Mostafa**,
Indianapolis,
St.
**School
City,
Iowa
of Library
and
Bloomington,
Palakal
at Indianapolis
IN 46202 Sciences
Building
University
Indiana
M.
SL280
of Management
S306 Pappajohn Iowa
and Science
of Science
Michigan
*Department
Filtering
Information
School
University
The
Interests
Information
Computer Purdue
in User
of Iowa 52242-1000 Science
Information
University IN 47405-1801
Abstract
1
Several machine learning approaches have been proposed in the literature to automatically learn user interests for information filtering. However, many of them are ill-equipped to deal with changes in user interests
Information filtering is concerned with the problem of delivering useful information to a user while preventing an overload of irrelevant information. Information selected for presentation is commonly based on descriptions of user preferences called profiles [Belkin and Croft, 1992]. Typically, the user profile is not known in advance, and can also change with time. The user may
that may occur due to changes in the user’s personal and proikssionai situations. If undetected over a long time, such changes may cause significant degradation in the filtering performance and user satisfaction during
Introduction
choose to provide
a liiited
the period of non-detection. In this paper, we present a two-level learning approach to cope with such non-
mation
the relevance
stationary
back data so that the filtering system can effectively choose and present information as relevant to the user as possible. This clearly requires adaptive capability on
user interests.
While
objective
the lower level consists
of a standard convergence-type machine learning algorithm, the higher level uses Bayesiart analysis of the user provided relevance feedback to detect shifts in user in-
is to estimate
ments ument points
items.
from
inforThe
the feed-
f(x)
corresponds
is to find a map ~ : D + to the relevance
Et
of a doc-
z. Given that such a map is known for all in D, a bite set of documents can always be
rank-ordered and presented in a prioritized fashion to the user. As a consequence, several information frlter-
Permission to make digital/hard copy of all part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior sp-cific permission arrd/or fee.
ACM
the user profile
ss D, the objective
such that
Switzerland@1996
of feedback
of specific
interaction with the user. Thus, in the case of textbased document Iiltering, the overall problem of information filtering may be broadly posed as learning a map from a space of documents to the space of real-valued user relevance factors. Denoting the space of docu-
of the approach.
Zurich. SIGIR96. 8/96/08.$3.50
amount
the part of the filtering system so that the performance of the system gradually improves during the course of
terests. Once such a shift is detected, the lower-level learning algorithm is suitably reinitialized to quickly adapt to the new user profile. Experimental results with simulated users are presented to demonstrate the feasibility
concerning
ing systems have been proposed
in the literature
based
on different machine learning paradigms; some examples include the Stanford Information FHtering Tool (SIFT) [Yan and Garcia-Molina, 1995], NewsWeeder [Lang, 1995], Browse [Jennings and Higuchi, 1992], and NewT [Seth, 1994].
0-89791 -792-
317
We have developed a document filtering system, called SIFTER (Smart Information Faltering Technology for Electronic Resources) [Mukhopadhyay et al, 1996], that consists of the following three major components: (i) a
However, NewsWeeder’s inability to adapt the filter in an on-lime fashion limits its utility. SIFT (Stanford Information Filtering Tool) has also been developed to filter USENET news [Yan and Garcia-Moliia, 1995].
document representation module that employs a contentbased vector-space document indexing scheme using a
SIFT requires users to specify keywords to generate the initial filter. Depending on a user’s choice, the filter
predefined
may be represented
forms
two functions:
off-line found
thesaurus,
(ii) a clustering determination
and unsupervised in a representative
fashion
module
that per-
of categories
a boolean
in an
can provide
based on similarities
set of documents,
finement.
and on-line
using the vector-space
statement.
If a vector
some adaptivity
profile
model or as is used, SIFT
in the form
In this mode SIFT
requires
of flter
re-
users to provide
relevance feedback (by pointing out documents of interest), based on which weights present in the profile are adjusted accordingly. SIFT unfortunately may suffer
ClassScation of incoming vectors to categories during actual operation, and (iii) a user profile learning module that learns user interests over the document categories, based on on-line user relevance feedback snd a
from
reinforcement
cause it assumes users can assess how and when their
machine
learning
algorithm.
The core of
interests
the SIFTER system, consisting of the above three modules, has been applied to filtering LISTSERV mails as well as filtering academic research reports in the domain of computer 1.1
change and that
ple filters
Work
adjust the filter based on changes in users’ interests). In this section we review some filtering research, mainly on the user’s role and adaptivity.
adequately address the problem ests. This has also been pointed
the next section.
adaptation
bemo-
filters
There
of changing user interout in [Kilander, 1995].
We refer to the phenomenon of changing user interest as user non-stationarity and discuss it in more detail in
1.2
and Stevens [1991] also
Problem
Description
are two major
sources of non-stationarity
can arise in an information
described a rule-based USENET news fltering system, named Infoscope. Infoscope uses heuristic rules associating common patterns of usage (e.g., number of sessions, newsgroups read, frequencies of relevant terms in
filtering
system
that
operating
in a dynamic environment. Fwstly, the nature and the domain of information may change which naturally calls for adaptation
of the information
representation
scheme
(e.g., switchhg of thesaurus in a document-based system) as well as the classification scheme. Secondly, even assuming that the overzdl domain of information is unchanged, a given user’s interests in different categories
articles, etc.) to appropriate actions. In Infoscope, to refine filters, users must add or remove terms from the falter and they must also set appropriate thresholds.
they would be sufficiently
and users can select from pre-designed
al [1987] requires direct and explicit user input in flter generation and maintename. In this system, users must create roles that prescribe appropriate actions with tests on factors such as message type, date, and the sender. Hence, it requires significant user involvement to assure effective filtering and does not provide automatic Fischer
system,
back and machine learning techniques show promise in reducing user’s involvement in filter building and refinement. Second, in our view, the existing systems do not
A more de
tailed comparative study of some of the existing filtering systems can be found in [K&mder, 1995]. The Information Lens system described by Malone et
capabdity.
as the Infoscope
that cover common topical areas. NewT relies on relevance feedback to add and initialize new filters and NewT reduces user involvement retie existing filters. in filter refinement further by utiliiing a genetic algorithm to evolve filters toward increased fitness. Two main conclusions can be drawn from the review presented in this section. Fwst, relevance feed-
Information filtering approaches can be characterized according to the amount of user involvement and degree of adaptivity (capability of the system to automatically
focusing
deficiencies
tivated to m-configure the filter. Finally, NewT (news tailor) [Seth, 1994] offers the user option of using multi-
science.
Related
similar
rule triggering
Recently, some filtering approaches have been proposed that attempt to reduce user involvement in fil-
of information
may change.
a profmsional
environment
assignment
ter maintenance and refinement. NewsWeeder [Lang, 1995], a USENET news filtering tool, asks users to rate news articles they read (on a scale of five values). The rated articles and ratings are used as training examples for a machme learning algorithm that is executed
or the initiation
Such changes can occur in due to a change in the job of a new research project
on the part of the user. In this paper, our focus is on the latter kind of non-stationarity which can be handled without any change in the representation or classification modules. To our knowledge, the existing information ing approaches do not explicitly account for
nightly to generate the interest profiles for the next day. By limiting the user input to only rating of articles, NewsWeeder is successful in reducing user involvement.
stationary
318
user interest
profile.
However,
filtera non-
several such
shifts, if unnoticed, all filtering the learning their
approaches
learned
deteriorate
the over-
of user feedback is required
Left to themselves, require a long time
many of to erase
cost is the possibility
can significantly
performance. profile
(represented
in so?e
form
user interests
of a
to estimate
are assumed
the categories
performance
to be constant
of documents.
f 2. The
the map
of sub-optimal There
since
over each of
are two possible
memory) and relearn the new profile. This is particularly true for algorithms that are designed to optimize long-term filtering performance while coping with ran-
ways such sub-optimality can be overcome if the resulting performance is deemed to be inadequate: (i)
domness
large clusters
in the document
feedback.
clusters
(e.g., partition
ones) on the basis of uncer-
tion to learn a more general parametrized map (e.g., a neural network) from the document space to user rele-
short-term
vance values.
In the latter
quickly
at a reasonable
desirable
analysis features
system
of the feedback are mutually
be easily accomplished
signals.
conflicting,
objec-
the document
into smaller
tively insensitive to noise. On the other hand, the ability to react quickly to changes requires low inertia and
the learning
is a long-term
adapt
tainty of user feedback over a cluster, and (ii) use the condioverall map in the form of fz o fl as an initial
where
optimtilty
to further
and user relevance
has to be made rela-
tive
In general,
stream
These two and cannot
by a single algorithm,
provided
since the
by using an optimal learning scheme at the lower level while monitoring the user relevance feedback for short-
in a document
filtering
non-stationary context.
uses a reinforcement learning profile assuming a stationary
While
categories user).
process.
of
fz),
md
is to
performance
then optimize
(as
the perfor-
representation
module
determines
a
finite-dimensional vector description of a document (the input space for ~1), and the classifier module finds the category to which a document belongs to (the output for ~1). The user profile lemming module is concerned with the on-liie learning of ~2 i.e., a map from the set of
the lower-level
algorithm to learn the user user, a user interest track-
the learning
Overview
and
The document
user interests
ing algorithm using Bayesian decision theory (based on [Zacks and Barzily, 1981]) is employed at the higher level to detect shifts in user interests and reinitialize
2
fl
the objective
filtering
the filtering performances with tl and fz were found to be quite adequate. The incorporation of hierarchical clustering (method (i) ) or multi-stage learning (method (ii)) at the present time constitutes future work.
term changes at a higher level. The latter, in turn, suitably reinitializes the lower level upon the detection of a change. In this paper, we investigate such a twofor learning
by
method,
mance, possibly over a much longer period, by means of the general map. In the liiited number of experiments performed with the decomposition approach, however,
realization of the two objectives calls for the use of different techniques. A good compromise can be obtained
level system
arrive
2.1
SIFTER
to the relevance
Document
values (assuming
Representation
Using
stationary
Vector-Space
Model
In this section, core components
we present
a brief overview
of the information
filtering
of the three
The main purpose of the document representation component is to convert documents arriving in the DOCBOX into numeric structures that are representative of original documents and are easily parsable by other tlltering modules. Several methods are available in the classical IR literature for converting textual documents to rep-
system called
SIFTER and their functions in order to provide the reader with abroad understanding of its operation. The three core components are the document representation module, the document classification module and the user profile SIFTER
learning
is presented
module.
here md
A brief description a more detailed
of
resentative
descrip-
tion of these three modules can be found in [Mukhopadhyay et a2, 1996]. The approach used in the design of the overall faltering system SIFTER formation document
is to decompose
the problem
in [Salton
and McGM,
used methodology,
as the vector-space
model.
The
ion
of in-
of SIFTER
weights,
uses the popular
current tf-idf
known
implementat-
(term-frequency
an onliie
thesaurus
the the
was used with some specific
constraints on its content and structure. The thesaurus contains keywords drawn from authoritative sources for controlled vocabulary (for example, ACM Computing Reviews Classification Scheme for documents in the do-
and ~2 (horn the set of categories to user relevance values); while ~1 is determined in an off-line manner, ~z is learned through interaction with the user. The objec(i.e., the amount
Salton
a widely
inverse-documentfrequency) technique to generate vector representations of documents. For generating
faltering into learning two maps ~1 (horn the space to a finite set of document categories)
tive of such decomposition
structures.
1983] has described
is to reduce the complexity
main
of user tkedback necessary) of learning
of computer
sentative
the high dimensional relevance map from the document space to relevance values. Since ~1 is learned in an a priori unsupervised manner, a considerably less amount
document
science).
Using
base, a table
a sufficiently
is generated
repre
that con-
tains the total ilequencies of all unique terms in the thesaurus. Next, for each new document to be represented,
319
another
table
is generated
that
contains
the
frequencies
of terms in the document.
tables, the following equation elements of the vectors:
relevance
Using these two
is used to compute
which
the
vector, with elements di (i = 1, ..., n),
probability
is an estimate
of dj.
probability
vector
p = ~i],
probabtilty
of the category
The
second is an action
such that
ter as the most relevant category. are continuously updated during where Tik is the number
of occurrences
of term
Tk in
the total
number
of documents
in the document
and nk is the number of documents contain the given term Tk. Document
Classification
The classification
module
The
base,
in the base that
learning
agent
stages:
vector
Vi is classified
ated of dimension all other elements updated
classification
is carried
out conalgorithm classificathe cosine is used to operation
where
categories
based on the cosine similarity
centroids.
The resulting
corresponding to each vector user profile learning module.
of Vi with
category
User
Profile
Learning
the
capabiMy
information
learn
updates
a simplified
model
and updates
to the number
+ ~(Ei(k)
-
Pi(k))
chosen step-size.
Thus,
during
the convergence process,
to come out of the converged
in the presence of user shifts,
state and r~
in practice
such re-
of the documents
is carried
out by means of
3
and
User
Interest
‘hacking
Scheme
of the user. The algorithm
two vectors of dimensions
of categories.
Pi(k)
an invalid user model. It is to cope with such nonstationary users that we propose the shift detection module in this paper. The method used to accomplish this is described in the following section.
In this section,
currently used to learn the user model is based on a reinforcement learning algorithm studied in the Artificial Intelligence and Mathematical Psychology communities [Naremlra amd Thathachar, 1989]. Denoting the categories of documents by Cl,.. ., C“, di is used to denote the expected relevance of a document belonging to the category Ci. The learning agent maintains
is 1, and whose (i = 1,..., n) is
learning requires a very long time. During the intermediate period the filtering performance is poor since the
Module
agent maintains
to cat-
all categories are probabilistically given the chance to be ranked at the. top. This allows the user model, in the form of the d vector, to be learned sufficiently accurately. While in theory the learning algorithm used has the
is then passed on to the
this task, the learning
1) =
+
verge to zero. However,
The user profile learning module consists of a learning agent that interacts directly with the user and sorts the incoming documents according to its belief of the user preferences for the various categories of documents. To accomplish
belonging
n whose lth element are zero. Then pi(k)
O < q < 1 is a suitably
ranking 2.3
for up-
the p vector is moved by a small distance towards the optimal unit vector. Asymptotically, one element of the pvector converges to one, while the other elements con-
of SIFTER, this module merely classifies an incoming document vector Vi as belonging to one of the learned cluster
(i. e., the algorithm
as Pi(~
lar category Ck using the learned centroids from stage 1. The learning of cluster centroids is done in an offline batch mode while
presen-
dating p(k) and ~(k)) is as follows. ~(k) (i = 1,..., n) at any instant is the running average of the relevance
into a particu-
tinuously as documents arrive. A clustering that is similax to the Maximin – Distance tion technique [Tou and Gonzalez, 1974] with similarity measure [Salton and McGill, 1983] generate the centroids. During the on-line
(i.e.,
egory i. Denoting the current mtimum element of d vector as having the index 1, a unit vector E(k) is cre-
cluster hypotheses [Cl, . . - , Ck] are generated from an initial set of sample test document vectors [S1,. ... SN]. Each cluster Ci is then represented by its centroid, Zi. In SIFTER, each cluster is treated as a specific document category. During the classification stage, an indocument
algorithm
values given by the user for documents
an unsupervised, cluster learning stage and a vector classification stage. During the learning stage, initial
coming
at every iteration
category to be presented at the top. The rest of th~ categories are sorted according to the corresponding d val-
Module
consists of two processing
Both p and d vectors the learning process
tation of documents to the user) sorts the incoming documents by first sampliig the p vector to select the
ues. The learning 2.2
the
on the basis of user relevance feedback.
is the inverse document i, Ik = k)g(~/nk) of the term Tk in the document base, iV is
document frequency
pi represents
Ca being selected by the iii-
we describe the higher-level
user interest
tracking scheme in more detail. Section 3.1 provides a qualitative overview of the method, while 3.2 describes the technical details. The various underlying assumptions are mentioned and justified in the present context. 3.1
equal
Outline
of
the
Tracking
Scheme
For nonstationary users, the relevance probabilities of a category varies with time. We present a tracking scheme
The first is the estimated
320
capable
of detecting
of a category.
a shift in the relevance
The tracking
is performed
data.
Intuitively,
a window
the shift detection
decide whether feedback
relies on
finite
the posshift
in [Zacks and Barzily, (and hid) (B.))
1981],
is a submartin-
choice of n, there will always be a nonzero
ability
feedback to
of making
a shift shift
(if any) in the
noisiness.
As proved
representing
and a downwaxd
gale which implies that, as n + co, an actual shift will be eventually detected. In practice, because of the
in the given category
changed or the variations
is due to only inherent
and h$d) (Bn) of an upward
the sequences h~”) (BJ
feedbacks collected
and analyzes the collected the user’s interest
has genuinely
algorithm
noisy relevance
h~v) (Bn)
probability
respectively.
of relescheme
framework to detect a shift in the based on the relevance feedback
of possibly
for each category
terior
on each cat-
egory separately. For each category, a history vance feedback data is collected. The tracking employs a Bayesian relevance probabihty
compute
probability
a wrong
(missed
detection)
(false alarm).
associated
with
namely
and declaring
A decision
based on the shift
An attrm-
detection,
probability
function
prob-
ignoring
an incorrect is formulated
and two cost quantities
misses and false alarms.
This
decision
tive i%ature of this traclcing scheme is that it can be applied on top of any learning scheme. In particular, it is especially suitable for convergence-type schemes such as that described in section 2.3. For simplicity of analysis, the method presented here
function is used to make a decision regarding whether a shift has occurred. If a shift is determined, the tracking system informs the learning agent and an appropriate reinitialization of the latter’s states takes place.
assumes a scenario in which
3.2
at some point
a single shift has occurred
of time between
known prior probability
t = O and t = m with
distribution
a
Some
Practical
cision
Functions
Considerations:
Use
of De-
of the time of shift.
When multiple shifts occur in the user interest, the time interval between two successive shifts is assumed to be
The posterior
sufficiently large. This allows each shift detection proband the time window lem to be treated independently,
inant
over which a shift can occur to be ideaiiied as an infinite horizon window. The infinite shift horizon assumption permits the use of certain mathematical identities which greatly simpliies the computations. ~. be a sequence of the relevance feedLet~l, ~,...,
the learning agent should reinitialize the action bility vector in response to the shift declaration.
back data collected for a particular category. /3i is either 1 or O governed by the underlying
c&red for the category i. To formulate this decision process, we introduce
shift
parameter
h(d)
Since each relevance
(Bn)
is high,
downward
interest)
respectively,
the assumptions and h(d) (BJ are provided
(i. e., a decrease given
B~.
probaQual-
shift is likely
to have oc-
Let kl be the cost of ignoring
shti
two
a shift
respectively
for each category
i. These
are given as follows:
1981]. Our
If k >1,
=
[h\ ’’)(BJ]k
+
=
[h\d)(BJ]k
(BJ
the decision function
(1)
Fi grows in a slower k