Learning User Interests through Positive Examples Using ... - CiteSeerX

11 downloads 0 Views 280KB Size Report
Abstract: An approach to learning user interests is presented that borrows from ... In Section 3, a framework is presented that unites explicit models of users.
Learning User Interests through Positive Examples Using Content Analysis and Collaborative Filtering 1

2

Ingo Schwab , Alfred Kobsa , Ivan Koychev

1

1)

GMD FIT.MMK D-53754 Sankt Augustin, Germany {Ingo.Schwab ,Ivan.Koychev}@gmd.de 2)

University of California, Irvine Dept. of Information and Computer Science Irvine, CA CA 92697-3425, U.S.A [email protected] Abstract: An approach to learning user interests is presented that borrows from findings in the area of human-computer interaction, user modeling and information retrieval. It relies on positive evidences only, in consideration of the fact that users rarely supply the ratings needed by traditional learning algorithms, specifically not negative examples. Learning results are explicitly represented to account for the fact that in the area of user modeling explicit representations are known to be considerably more useful than purely implicit representations. A content-based recommendation approach is complemented by recommendation based on community membership, thus avoiding getting trapped in possible "local optima" of the content-based approaches. Finally, gradual forgetting of older observations has been introduced to better account for drifting user interests. The described framework has been extensively tested in a web-based information system for research funding opportunities.

Keywords: User modeling, learning user preferences, recommender systems, feature selection, nearest neighbor, tracking drifting interests

1 Introduction The field of adaptive interfaces is multi-disciplinary. Besides machine learning it includes at least the areas of human-computer interaction, user modeling, intelligent agents, information retrieval, knowledge-based systems, and natural-language systems. Research in machine learning algorithms for adaptive interfaces should take results and paradigms from these other disciplines into account. Such work in a neighboring discipline may set research goals for machine learning, guide its quest for new or better algorithms, or restrain this search by explaining why certain assumptions or results are unacceptable from the viewpoint of the neighboring discipline. Our research in machine learning got inspired and guided by findings in the areas of user modeling, human-computer interaction and information retrieval. From the former area, we borrowed the idea that learning results about the user must be explicitly represented in order to be useful for comprehensive adaptation. An explicit representation of the system’s assumptions about the user considerably facilitates:

• the re-use of these assumptions for purposes other than those for which they were originally collected, • the collaboration of different agents in the construction and exploitation of user models, • the explanation of these assumptions to the user, and • the physical ownership of these assumptions by the user, e.g. in a portable user model. The adoption of this paradigm of explicit representation from the area of user modeling allowed us to better address three other lessons from human-computer interaction, information retrieval and machine learning: Users hardly give explicit feedback. It has been known for a long time in humancomputer interaction that users are extremely reluctant to perform actions that are not directed towards their immediate goal if they do not receive immediate benefits (e.g., Carroll and Rosson 1987). Many machine learning algorithms however rely on so-called positive and negative labels (for instance positive and negative ratings), and in many adaptive interfaces labels are expected to be supplied by the user. Reports that such ratings (particularly negative ones) are hardly ever given by users are therefore not surprising. These issues will be discussed in more detail in Section 3.1. Sections 3.2 and 3.3 describes our efforts to develop and experimentally assess learning algorithms that rely on positive examples only, which can be much more easily obtained implicitly by monitoring users’ selective actions only. Users’ interests change over time. From research in human information retrieval (Allan, 1996; Lam et al. 1996) we know that users’ interests change over time. Interest indicators that lie far in the past are therefore less reliable than those that are more recent. Our approach also includes a method for tracking user interest drifts by weighting the observations over time. Thus, the most recent observations become more "important" for the learning algorithms, assuming that they better represent the current user’s interests than the older ones. This approach will be presented in Section 3.4 together with an analysis of its impact on the predictive accuracy. Recommendations based on user similarity can fruitfully complement recommendations based on object similarity. Two different approaches currently exist for making recommendations to users with regard to what objects she should select: contentbased approaches take similarities between objects into account while cliquebased (aka "collaborative") approaches take similarities between users into account. Several researchers (e.g., Balabanovic and Shoham, 1997; Basu et al., 1998; Delgado et al., 1998; Pazzani, 1999; Billsus, 2000) already demonstrated the usefulness of combining both approaches. In Section 3.5 we show that collaborative recommendation can be made considerably more efficiently on the basis of an explicit user model. Our solutions to these problems all leverage the power of explicit representations for user models. In Section 3, a framework is presented that unites explicit models of users and user groups, algorithms from the area of machine learning, and acquisition components from the area of user modeling. This concept has been followed in the

design of the adaptive interface of ELFI, a grants information system that is used by a large number of users on a daily basis.

2 User Modeling and Machine Learning Traditional user modeling systems often make use of knowledge representation (KR) techniques. KR formalisms offer facilities for maintaining knowledge bases, and for reasoning on the basis of this knowledge (using the inference procedures of representation formalisms). For user modeling purposes, these facilities are typically employed as follows (Wahlster & Kobsa 1989): assumptions about individual characteristics of the user are maintained in a knowledge base, together with system knowledge about the application domain or meta-knowledge for inferring additional assumptions about the user based on her current model (including predefined group models, the so-called stereotypes). If available, inference procedures of the representation formalism or meta-level inferences (e.g., heuristics) can be used to expand the user model. This knowledge base has been called user modeling knowledge base (UMKB; Pohl 1998, 1999).

Figure 1. Using a knowledge representation system for user-adapted interaction.

There are four main tasks for a system that caters to users based on assumption about the user in a user model: Acquisition of assumptions: From the user’s interactions with the application system, assumptions concerning user characteristics must be made and entered into the user model. Representation: The user model contents need to be organized in a way that they can be easily retrieved and compared. Reasoning: From existing user model contents (and perhaps other knowledge like domain knowledge or domain-independent user modeling inference rules), further assumptions about the user can be derived. Reasoning may also be used to detect and handle conflicts in the UMKB.

Decision: A user-adaptive application exploits user model contents to make decisions about how to adapt its behavior appropriately. Figure 1 illustrates the application of KR methods to user modeling. Acquisition and decision are performed outside of the KR system, which is responsible for representation and reasoning only. Other typical characteristics of KR-based user modeling include the fact that acquisition components often employ procedures or rules (which are triggered by one or only few observations) to construct assumptions about users that are to be entered into the UMKB. Such an acquisition process is not aware of the observation history, which may lead to conflicts in the user model. Such conflicts are resolved by giving preference to that assumption that was just made, or by including a truth-maintenance system that monitors the logical dependencies between assumptions (Brajnik and Tasso, 1994). Second, KR-based user models mostly contain assumptions that are related to mental notions like knowledge, belief, goals, and interests. This approach to user modeling has therefore be called "mentalistic" by Pohl (1997). ML system observations adaptive feature

acquisition

representation

learning results

decision

component

ML system observations adaptive feature

acquisition decision

information flow

representation

learning results

acquisition user modeling task

Figure 2. Using machine learning for user-adapted interaction.

In the early Nineties, the use of learning techniques in user-adaptive systems became more and more popular. Almost at the same time, "interface agents" and "personal assistants" were introduced. Kozierok and Maes (1993) and Mitchell et al. (1994) describe software assistants for scheduling meetings that employ machine learning methods (memory-based learning and decision tree induction, respectively) to acquire assumptions about individual habits when arranging meetings. More recently, a fairly large number of systems using machine learning for personalized information filtering have been described in the literature, like Letizia (Lieberman, 1995), Amalthaea (Moukas, 1996), Syskill&Webert (Pazzani and Billsus, 1997) and others. In general, machine learning (ML) methods process training input and offer support for decision problems (mainly classification) based on this input. Hence, ML-based useradaptive systems work quite differently from KR-based ones. The central source of information about the user are learning results rather than knowledge bases. Observations of user behavior (e.g., reactions to meeting proposals or document ratings)

are used as training examples. Learning components acquire user models by running their algorithms on these examples. Learning results (e.g., "the user is interested in X") are not explicitly represented but "hidden" in formalisms that are specific for each learning algorithm used (decision trees, networks with probabilities, etc.). This makes it difficult to reuse these results for other purposes than those for which they were originally learned. It is also not possible to employ them in so-called user modeling servers where different applications store and retrieve assumptions about the user from a central network repository (Kobsa 2000, Fink and Kobsa 2001), or in portable user models that can be exploited by different kiosk applications (Fink et al. 1997). Due to the lack of an independent representation formalism, there also exists no further reasoning based on already acquired data. On the positive side, though, decisions are directly supported. For example, the meeting-scheduling assistants let their learning components predict the user’s reaction to new meeting proposals, and use this prediction for their individualized suggestions. Figure 2 illustrates the above discussion of the use of machine learning for user-adapted interaction. "Representation" is set in gray to indicate that it is implicit. Learning results typically serve one specific decision process only. For different types of adaptation, different learning processes have to be implemented. In contrast to KR-based user modeling, acquisition in ML-based user modeling systems is history-aware, i.e., it takes the history of interactions into account when processing a set of training examples, either one by one or all at once.1 Hence, learning results (i.e., the user model of ML-based systems) are being steadily revised; there is no need for special revision mechanisms. Moreover, the "user model" of ML-based systems often actually is a usage profile, i.e., it contains behavior-related information about the user rather than assumptions about the user’s mental attitudes.2 Currently, ML techniques are widely used in user-adaptive systems. Their main advantage is their ability to support (history-aware) acquisition and decision dynamically. However, they also pose two major problems:

1



It is not easy and sometimes nearly impossible for several different decision processes to take advantage of learning results that only reflect usage regularities but do not explicitly represent individual user characteristics.



In this case, it is furthermore difficult to communicate learning results to the user for inspection and explanation (as is required in many privacy laws such as EU 1995). In KR-based user modeling systems with their explicitly represented user models, this can be done more easily.

In the latter case, a learning method is called non-incremental from a technical point of view; but nevertheless it can be used for history-aware acquisition if new observations are processed together with old ones. 2 In the case of information filtering systems, though, learning results often do indicate user interests in specific information content and can be regarded as (implicitly represented) mentalistic assumptions.

observations

We propose to integrate elements of KR-based and ML-based user modeling. A first step into this direction was taken by the user model server Doppelgänger (Orwant, 1995), which uses learning methods to process information from several sources. Learning results are represented explicitly in a standardized format (all assumptions are stored using a symbolic notation with associated numerical confidence values), so that all Doppelgänger clients can take advantage of them. With acquisition and representation de-coupled, several learning components can work on the acquisition of the same kind of data. For instance, Doppelgänger uses both hidden Markov models and linear prediction to acquire temporal patterns of user behavior, which are employed to predict future user activities.

AC

ML Group Models

Representation

decisions

ML LC

assumptions

DC

User models

Figure 3. The LaboUr architecture.

Figure 3 (Pohl, 1997) gives an overview of LaboUr, our user modeling architecture that reconciliates the KR and ML approaches to user modeling. A LaboUr system accepts observations about the user and lets learning components (LCs) and acquisition components (ACs) select appropriate observations for their input. LCs (which are MLbased) internally generate usage-related results that will be transformed into explicit assumptions, if possible. These assumptions are passed to a KR-based user modeling subsystem. ACs directly generate user model contents (which may be behavior-related or mentalistic) and do not support decisions. They can implement heuristic acquisition methods like those often used in KR-based user modeling systems (cf. Section 2.1). In contrast to LCs, which typically need a significant number of observations to produce learning results with sufficient confidence, ACs can allow for "quick-and-dirty" acquisition from a small number of observations. This is useful for adaptive systems with short periods of user interaction, like information kiosks and web sites. LCs can be consulted for decision support based on learning results.

In addition, there may be other decision components (DCs) that directly refer to user model contents. Besides supporting acquisition and decision processes, a LaboUr system can also offer direct access (input and output) to user models, due to its use of explicit representation facilities. A LaboUr system may maintain several user models. In this case, ML techniques can additionally be used for group modeling, i.e., clustering user models into user group models (Paliouras et al. 1999, Fink 2001). If group models exist in a system, individual user models may be complemented by suitable group information. LaboUr is an open user modeling architecture: several sources of information about the user may contribute to the user model, which again can support several applications and types of adaptation.

3 Learning in LaboUr 3.1

Overview

In order to valididate the framework described above, we developed a user modeling subcomponent and integrated it into the real-world application ELFI (ELectronic Funding Information; Nick et al. 1998). ELFI is a WWW-based information system that provides information about research grants and is used by more than 1000 people in German research organizations who regularly monitor and/or advise researchers on extra-mural funding opportunities. In this system, calls for proposals are recommended for review based on those that the user had already browsed so far. The user modeling subcomponent of ELFI, which is an instantiation of the LaboUr framework, was again dubbed LaboUr. Apart from integrating KR based and ML based user modeling in the LaboUr system, we also aimed at respecting additional findings from the area of Human-Computer Interaction and Machine Learning in the design of LaboUr, particularly the following ones: Lack of user feedback: Computer users are known to provide little feedback only when they are supposed to rate the quality of items recommended to them, and specifically hardly ever negative ratings. Traditional learning algorithms however very much rely on explicit positive and negative examples. We therefore focused on algorithms that would operate in the presence of implicit positive examples only, and additionally be able to take negative examples into account in case these are available. Interest drift: Users’ interest may change over time, e.g. to account for different search goals (Allan, 1996; Lam et al. 1996; Grabtree and Soltysiak, 1998). We therefore enhanced the learning algorithm in such a way that users’ interest drifts are taken into account. Combination of content-based and collaborative recommendation: Recommendations based on the similarity of the recommended items with already seen items may be unable to cover the full spectrum of the user’s interest and get stuck in "local optima". The combination with collaborative methods allows for recommendations

that go beyond object similarity and take the demonstrated interests of users into account that are comparable to the current user. We will see in the remainder of this article that for all three goals the explicit representation of learning results as advocated in the LaboUr approach turns out to be extremely useful.

Figure 4. Learning mechanisms in LaboUr.

Figure 4 shows the structural design of LaboUr, gives an overview of the learning methods used, and describes how they collaborate to recommend new or interesting objects to the user. There are three main stages in the LaboUr recommendation process. In the acquisition stage, observations about a user are processed with statistical methods to extract those features of the objects that represent the user’s interests viz. disinterests. This will be described in more detail in Section 3.2.3 The learning mechanisms includes a gradual forgetting mechanism that handles drifting user interests (see Section 3.4). In the second stage, the selected features are explicitly represented in the user model. In a third stage, learning algorithms utilize the model in recommending new relevant objects to the user (see Section 3.3). Additionally, a collaborative approach compares the content-based user models and recommends objects that similar users frequently selected in the past.

3.2 3.2.1

Learning from Positive Examples Observations about Users

A standard approach when using machine learning for acquiring interest profiles is to assume that the set of information objects can be divided into classes (e.g., objects that are “interesting” and “not interesting”) and apply an inductive classification algorithm to these examples. In many systems, users must provide examples for both classes in an initial training phase, on the basis of which a classification algorithm is learned inductively. Thereafter, the classification algorithm can determine whether new

information objects belong to the class “interesting” or to the class “not interesting”. However, such explicit rating requires additional user effort and keeps users from performing their actual tasks, both of which is undesirable. As has been observed by Carroll and Rosson (1987), users are unlikely to engage in such additional efforts even when they know that they would profit in the long run. Users are extremely reluctant to perform any actions that are not directed towards their immediate goal if they do not receive immediate benefits.3 Additionally, motivating web visitors to provide personal data is proving very difficult. Internet users normally avoid engaging in a relationship with Internet sites. This is mostly due to a lack of faith in the privacy of today’s web sites. Users often withhold personal data or provide false data (GUV 1998, Standard 1998, Forrester 1999). Conclusions about user interests should therefore not rely very much on user ratings, but rather take passive observations about users into account as far as possible. However, obtaining an appropriate set of negative examples (i.e., examples of the “not interesting” class) is then problematic since the central source of information about the user is his or her navigation behavior and especially the set of selected objects. Several systems use those objects as negative examples that have been presented to the user but were not selected (e.g. Lieberman, 1995 and Mladenic, 1996). We claim in contrast that in a general approach this assumption is not valid. It is a common situation that objects are overlooked, and it is impossible to have an overview of all relevant objects. Sometimes objects that are not visited at the moment become visited at a later point, and sometimes they are ignored forever even when the user is interested in them since it is too time consuming or simply not possible to follow every interesting link. Therefore, classifying the objects not visited as negative examples seems to be a dangerous assumption since wrong decision borders may be calculated between the classification regions. It is more appropriate to only take selected objects as examples for the class “interesting”. On the other hand, the selection of a document by the user also does not necessarily mean that it is interesting for him. Therefore, some “noise” reduction of the selected objects is necessary for extracting only those ones that are really interesting. The results from experiments by Morita and Shinoda (1994) and Konstan et al. (1997) show that the time spent on reading is highly correlated with users’ explicit rating of interestingness of articles. This correlation may be not very strong for some domains. Nevertheless, a suitable threshold for the time spent for reading will be able to set apart the truly interesting documents. Observations about other more domain-specific actions like printing, saving, forwarding, book-marking etc., can also be used to help distinguish the really interesting documents. As a result, we obtain a set of positive examples about users’ interests. However, if only examples from one class are available, standard classification methods are not applicable. Thus, we had to invent new methods for learning interest profiles, or revise existing ones.

3

As a matter of fact, Mike Pazzani (personal communication) reports that only 15% of the users of Syskill & Webert (Pazzani et al. 1996, Pazzani and Billsus 1997) would supply interest ratings even though they were encouraged to do so.

3.2.2

Defining the Feature Space

Like in many other approaches to learning interest profiles, LaboUr needs representations of the (information) objects the user is dealing with as input for learning. For this representation, crucial features need to be identified and appropriately coded. ELFI documents are described by a number of attributes, each of which contains a set of features. They are visible to the user and she makes her document selections based on them. Examples of attributes and their features are the research topic (features are the individual areas), funding type (features: individual types), cross section topic, abstract (features: individual words), recipient, deadline and the amount of the award. We use Boolean vectors for representing sets, i.e. one bit for every element of the base set. The vectors for the selected attributes are concatenated to produce a semi-structured document representation. Additionally the document abstracts are represented as a set of words. In a domain-modeling phase we identified those attributes that are best suited for modeling the user interests. This process is executed only once and the resulting learning vectors are the same for all users. The suitability of the selected features was tested using a log file containing several months of system usage. The frequencies of the occurrence of features were calculated for those documents that users selected and for those they did not select. If users select documents on the basis of attributes in which they are interested rather than reading documents randomly, the frequencies of the corresponding features will be higher among the selected documents than the unselected documents. Through this kind of analysis, we found that some features (e.g., "research topic" and "grant type") are better indicators of interest than others. In a further effort to avoid unusable large document vectors we reduced the set of the words observed words in the abstracts to the 189 most discriminating ones. A TFIDF measure was applied to determine these words (for the sake of brevity, we will not go into details). All in all we were able to reduce the dimension of the learning task from several thousands to only 420 features. 3.2.3

Determining the Significance of Features for Acquiring Explicit Information about User Preferences

In this section, a statistical approach is described to generate explicit assumptions about a user from positive examples only. This process is applied after every selective action of a user. It uses a uni-variate significance analysis to determine whether a user is interested in specific values of the document features. It is based on the idea that feature values in random samples are normally distributed. If the value appears in the set of selected documents significantly more often than in a random sample, the user is interested in it. If on the other hand the selection frequency is significantly lower, the user is not interested in that value. To explain this idea, we take a typical example from ELFI. We want to determine if a user is interested in the feature “project grant”. First, we calculate the probability of this feature in all documents. Let us assume that 815 documents are available in ELFI (which is a typical number); and 316 documents contain this feature. Thus, the probability to

316 = 0.39 . For random 815 document selections from the overall set there will however be a mean error, so that a confidence interval around the actual frequency needs to be determined. If the actual frequency lies outside this interval, it can be assumed with certain confidence that the user has not made a random choice but that there is a strategy involved in the user’s selection. The confidence interval for feature j is given by the following formula: randomly select a document with this feature is p project grant =

c1j, 2 = µ ± z p j (1 − p j )n

(1)

µ is the means of the distribution and is equal to the above overall probability multiplied by the number of selections, while z is the critical value. It determines the area under the standard normal curve; for a confidence rate of 95% the value is 1.96. This means that 95% of random samples fall within this interval and 5% are outside. Thus there is a possibility of 5% that a user becomes misclassified. For a greater confidence the critical value is higher (e.g., for a 99% User is interested in: confidence, it is 2.576). Research Topics: Mathematics (0.85) Aeronautics (0.88) Control engineering (0.88) Traffic research (0.52) Funding Type: Printing grant (0.56) Receivers: World (0.38) Developing country (0.46) Abstract: Image processing (0.39) Financing (0.52) Frequencies (0.56) Live (0.64) Aviation research(0.99) Multimedia (0.66) Scientific (0.39) User is not interested in: Research topics: Phytotherapy (-0.84) Physical geography (-0.76) Theoretical medicine (-0.79) Funding Type: Scholarship (-0.55) Abstract: Part-time employment (-0.49) Dissertation (-0.41)

Figure 5. An explicit user profile.

Let us now assume that a user selects 30 documents. Then the 95% confidence interval for the grant type “project” can be calculated as

c1project grant = 6.4

and

c2project grant = 16.86 . These numbers give rise to the following algorithm for acquiring explicit assumptions with respect to the feature "project grant": If it appears in 6 or less of the selected documents, then the user is not interested in documents with this value. If 17 or more documents contain this feature, the user can be regarded as interested in documents with this value. Otherwise the user’s selection is neither a positive nor a negative indicator for his interest in this feature. Thus an explicit user profile can be constructed by applying the uni-variate significance analysis to every feature of the object set.

Figure 5 shows a partly fictitious example output of this procedure. Only the features which have been found to be important to the user are listed. The value in the bracket is

the normalized value of interest. It lies between -1 (totally uninteresting) and 1 (totally interesting). In this example, 14 features are of (positive) and 6 of (negative) importance to the user. 3.2.4

Using Feature Selection for the Reduction of Dimensionality

A problem with observation data from ELFI is that the dimensionality of the vectors describing the document still is rather large. We have still many features (namely 420), while the amount of training data is very limited (between one and 65 selected objects per user). This problem is quite typical for adaptive interactive systems that use learning methods. Learning under these conditions is not practical, because the amount of data needed to approximate a concept in d dimensions grows exponentially with d , a phenomenon commonly referred to as the curse of dimensionality (Bellman, 1961). Hence, there is a need for dimensionality reduction. In Section 3.2.2, we described the usage of domain modeling for pruning down the relevant features, which from the point of view of machine learning falls under the category of feature selection by a priori information. Furthermore however we claim that every user has different interests and therefore also different features that are important to her. Feature selection should be individualized and be performed individually for each user. Uni-variate significance analysis considerably reduces the dimension of the learning task while the significantly uninteresting and interesting features for each user still remain. In the example shown in Figure 5, the algorithm is able to reduce the candidate features from 420 to 20. Feature selection can be easily combined with our existing learning algorithms. Whenever a user selects an object the significant features are determined based on the objects that were selected so far. This set of features can change quite dynamically: features can be included in the set of significant features whenever they are identified as a significant interest or disinterest, but also omitted from this set if they are not considered as significant for the user any more. The learning task is then performed with this current subset of features As will be shown in the next section, this combined learning approach is much more noise resistant while learning much faster than traditional methods, and the performance of learning algorithms that we studied can be significantly improved.

3.3

Content-based Recommendations

In this section we will describe two different approaches to learning from positive examples only that we investigated, namely a probabilistic and an instance-based method (the latter with two different distance measures). We will then compare the learning results of these algorithms when combined with the user-specific feature selection described above.

3.3.1

Probabilistic Approach

We used Bayes’ theorem to calculate the probability of user interest for a given document. That is, we applied a simple Bayes classifier to positive examples only. For the vector representation of a new document a product is computed of the probabilities for each bit that in previously selected documents the value of this bit is equal to its value in the current vector. Like the simple Bayes classifier, this approach assumes that the bits are mutually independent. This algorithm computes a single value for a given document. The idea is that interesting documents should receive higher values whereas uninteresting documents should receive lower values. When applying this algorithm to the available logfile data for each user, unselected documents turned out to generally have lower values than selected ones. This suggests that the approach can be used for interest prediction if a reasonable threshold value is chosen. Grants with probability values greater than the selected threshold would be assumed to belong to the class “interesting” and would be recommended to the user. Alternatively one can rank the documents using the calculated probability value, and propose the n best documents to the user. 3.3.2

Instance-Based Approach

One of the most popular machine learning algorithms is the k-Nearest Neighbor (kNN) Approach (Mitchell, 1997). For this algorithm, learning means remembering previously (classified) experiences. Each experience (called instance) can be represented as a point in a Euclidean space. Classification of a new instance then means searching the nearest neighbors in the Euclidean space. The class of the majority of these neighbors determines the result of classification. Since the selected documents are considered as positive evidences, we have only one class. A standard kNN procedure would always classify a new document as positive. We therefore modified the Nearest Neighbor idea by examining a space of fixed size around each previously selected document. A new document is considered interesting if its distance to at least one previously selected document is less than this radius (see Figure 6).

Figure 6. Instance-based learning approach.

To operationalize this idea, a good distance measure must be found and the size of the examined space around new documents determined. In a first approach, we used a Hamming distance, which is the number of different bits of two compared vectors. Experiments however showed that the quality of the Hamming distance is insufficient. Since every bit in the representation vector is assumed to be equally important to every user, this approach does not take individual user interests into account. Therefore, a weighted distance measure is needed which is individually computed for each user. The idea is that a large weight for an attribute being very crucial to a user will lead to larger distance values between documents that differ in this feature. We obtained such distance weights from the feature analysis mechanism used for learning explicitly represented user preference information. The next section describes the determination of distance weights in more detail. 3.3.3

Obtaining Weights for Distance Measuring

As mentioned before, the results of the uni-variate significance analysis can be used to obtain feature weights for the distance measure needed for the instance-based learning approach. That means that weights are the normalized values (see Figure 5) from the significance analysis. The effect of the weighted distance measuring can be seen in Figure 7. This visualization uses a technique called multi-dimensional scaling (Kruskal, 1964). It allows us to show the relationships between selected documents in two dimensions. The documents are numbered from 1 to 21 in a chronological order (according to the selection time). In the left picture, a simple Hamming distance is used. The user’s selection behavior and the resulting preferences for certain features have not been taken into account and do not become visible. The right picture visualizes the same document selection using a weighted distance measure, where the weights are obtained from feature selection. Here,

Figure 7. Selected documents, displayed using an unweighted (left) and a weighted (right) distance measure.

the user behavior is clearly visible. In the beginning (documents 1 through 8) the user tries to find the interesting documents. Perhaps she is just playing or experimenting and is trying to figure out which kind of information or which interaction features ELFI offers. But after this exploratory period she finds the information she was looking for. From this point onwards, the selected documents are very similar and therefore form a cluster in the right picture. New or overseen documents similar to the documents of this cluster could be recommended to the user. 3.3.4

Evaluating Learning Methods

The observations about users are sequences of user actions. Usually these sequences are regarded as time series. In the ELFI environment, the relationships between selections are not causal. This means that later selections do not depend on earlier ones. They are primarily goal driven: users aim at selecting documents that are “interesting” for them. Therefore we use a standard cross-validation techniques for determining the prediction accuracy of the learning algorithms used. Furthermore, ELFI users selected a relatively small number of documents only (between one and 56). We therefore decided to use the leave-one-out cross-validation technique which works as follows: remove one document from the n documents that a user selected during a session, and use the remaining n-1 selections as a training set for the learning algorithm. Have each document of the ELFI database ranked by the learning mechanism and determine the priority with which the selected document would be proposed to the user. Then remove a different document from the set of user selections and also determine the recommendation priority for each document in the database. In the end, the n results for each document are averaged, and the probability is determined that the n documents that would have been proposed with highest priority were indeed among those that the user actually selected. This result expresses an average accuracy of prediction of the tested learning algorithm. We evaluated the different learning methods described above with a set of 220 users who selected at least two documents. The users were grouped based on the number of document selections. The users in the first group selected 2 to 8 documents, those in the second group 9 to 18 documents, etc. (altogether they selected 1886 documents, that is a means of 8.5 documents per user). The horizontal axis in Figure 6 presents these groups, and the vertical axis the recommendation accuracy. Our experiment shows that the combination of instance-based learning with weighted distance measure and feature selection performs best. On an average, only three percent of the documents that would have been recommended were not selected by the user, and one may speculate that they would have been interesting to the user had she seen them. All learning algorithms significantly outperformed randomly generated advice, which would be rated at the 50 percent level. The unusually high prediction accuracy (normally 80% seems to be a "practical ceiling" in the prediction of object selections) may possibly also be partially attributed to the fact that documents were structured by visible categories. User behavior may therefore be more "orderly" and hence more easily predictable. A perfect fit would mean that every selected document would also be one of the favorites of the learning algorithm. But achieving this perfect fit is nearly impossible

98.00% 97.00% 96.00%

accuracy

95.00% 94.00% 93.00% 92.00% 91.00% 90.00% 89.00% 88.00% Group 1 (2--8)

Group 2 (9--16)

Group 3 (17--24)

Group 4 (25--32)

Group 5 (33--40)

Feature selection in combination with instance-based learning using the Hamming distance Instance-based learning using a weighted distance measure Feature selection in combination with instance-based learning using a weighted distance measure Feature selection in combination with the probabilistic approach

Figure 8. Prediction accuracy of compared approaches according to the number of selected documents.

with real user data. During the experiments we discovered that there were a few users who had no recognizable selection strategy and were indeed unpredictable. Possibly, they were not familiar with computers system or just wanted to “play” with the system. The data of these users were not removed from the test sets. The conclusions of our experiments are twofold. First, our experiments show that the use of feature selection significantly improves the performance of the learning algorithms. Instance-based learning plus feature selection works well for small training sets even with a simple Hamming distance. However, with growing training sets it becomes apparent that instance-based learning with weighted distance measure learns much faster. The second result is that the instance-based learning approach performs better than the probabilistic approach.

3.4

Tracking Drifting User Interests

Traditionally, users' selections over a period of time from a menu of items are regarded as relatively independent tries. However, often users' interests drift with time (Allan, 1996; Lam et al. 1996; Grabtree and Soltysiak, 1998). This normally means that the most recent observations represent the current user's interests better than older ones. From a more general perspective, the task of catering to drifting user interests corresponds to learning drifting (aka changing, evolving or shifting) concepts (e.g. Schlimmer and Granger, 1986; Widmer and Kubat, 1996; Widmer, 1997; Maloof and Michalski 2000 etc.).

The most frequently used approach to deal with this problem is to learn the description of users’ interests from the newest observations only. The training examples are selected from a so-called time window, i.e. only the last l examples are used for training (Mitchell et. al., 1994; Grabtree and Soltysiak, 1998). An improvement of this approach is the use of heuristics to adjust the size of the window according to the current predictive accuracy of the learning algorithm (Widmer and Kubat, 1996). In a similar way, the time-based forgetting mechanism in Maloof and Michalski (2000) uses a time function for aging the examples. Instances that are older than a certain age are deleted from the partial memory. However, this approach totally “forgets” the observations that are outside the given window or older than a certain age. These observations may however still be valuable. To avoid loss of useful knowledge learned from old examples, some of these systems keep old rules as long as they are competitive to new ones (Mitchell et. al. 1994). Another approach is to use a dual user model (Chiu and Webb, 1998; Billsus and Pazzani, 1999). Such as the approach pursued in (Billsus and Pazzani, 1999) use a hybrid user model consisting of both a short-term and a long-term model of the user’s interests. The method employs the short-term model first, which is based on the most recent observations. If a story cannot be classified with the short-term model, the long-term model is used. This hybrid user model is useful in domains where the long-term user’s interests are quite broad and short-term interests change fast, as is the case for news stories. 3.4.1

Gradual forgetting for adaptation to drifting user’s interests

In order to cope with the problem of interest drifts, this paper suggests another approach. The main idea behind it is that new interests that users pursue are not completely independent from previously pursued interests. Therefore it is worthwhile to even take very old examples into account. The most recent observations should however be more important than the old ones and the importance of an observation should decrease over time. A gradual forgetting function w = f (t ) can be defined which is able to produce weights for each observation based on its time of occurrence. The idea of weighting (aging) observations about the user over time is not new in user modeling. Webb and Kuzmycz (1996) suggest a data aging mechanism which places an initial weight of 1 on each observation. A set proportion discounts the weight of every observation, each time another relevant observation is incorporated into the model. A simple kind of attributevalue learning procedure is able to take the weights of these observations into account. A further development of this approach (Webb et al., 1997; Chiu and Webb, 1998) investigates the utilization of an algorithm based on induction on decision trees, called FBM-C4.5. However, FBM-C4.5 does not use any aging, i.e. it treats all observation as equally important. Hence, this aging mechanism cannot be used for more complicated learning algorithms, like Top Down Induction on Diction Tree (TDIDT) and for the significance tests for feature selection employed in LaboUr. A method that can learn and track changing context by using meta-learning is presented in Widmer (1997). The assumption is that the domain provides explicit clues as to the current context (e.g., attributes with characteristic values). A two-level learning algorithm is presented that effectively adjusts to changing contexts by trying to detect (via meta-learning) contextual clues and to focus the learning process using this

information. Two systems based on this model are presented: METAL(B) combines meta-learning with a Naive Bayesian Classifier (NBC), while METAL(IB) is based on an instance-based learning algorithm. An “exemplar weighting” mechanism is applied in METAL(IB), however it is not used in METAL(B). This weighting schema is not useful for our feature selection. Therefore, the problem of defining a suitable example-weighting schema for more complicated learning algorithms like TDIDT, NBC and our analysis of significance remains open. We found however that weights that obey the following constraints are suitable for the latter algorithm:

∑i =1 wi n

wi ≥ 0 and

=1

(2)

n Here, wi are the observation weights and n is the number of observations. Various functions that model the process of forgetting and satisfy these constraints can be defined. For example, a linear gradual forgetting function is defined as follows:

wi = −

2k (i − 1) + 1 + k n −1

(3)

Here, i = 1, 2, 3, , n is a counter for observations starting from the most recent selection and goes back over time to the first selection; k ∈ [0,1] is a parameter that determines the slope of the forgetting function (if k=0 then there is no forgetting; if k=1 then the oldest observation is totally forgotten). The “speed of forgetting” can be adjusted by varying k. Other types of forgetting functions can be defined (e.g. logarithmic, exponential etc). We should also note that the presented gradual forgetting mechanism could be combined with a time window, by weighting the examples in it (i.e. n = l ). In the LaboUr approach, feature selection is used for creating the explicit user profiles (see Sections 3.2.3). Hence, applying gradual forgetting to feature selection will affect the explicit user profiles and consequently the generated recommendations. In particular the forgetting weights are utilized in counting the appearance of a feature j in the user’s selections by multiplying each occurrence of the feature in a observation by its weight (i.e c j = ∑i =1 wi vij , where v i j ∈ { 0 ,1} are the feature values in the Boolean vectors that represent the observed user’s selections; wi are the weights calculated by the forgetting function for the observations). n

Feature

Observation no.

Control engineering Mechanical engineering

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

0.93

0.92

0.91

0.90

0.89

0.88

0.87

0.86

0.85

0.84

0.83

0.82

0.81

0.80

0.79

0.67

0.65

0.63

0.61

0.59

0.58

0.56

0.55

0.53

0.52

0.51

0.49

0.94

0.93

0.92

0.91

0.90

0.89

0.88

0.87

0.86

0.85

0.84

0.83

0.82

0.81

0.81

0.82

0.80

0.78

0.76

0.75

0.73 0.48

0.52

0.56

0.59

0.62

0.65

0.68

0.50

0.55

0.58

0.62

0.70

0.73

0.75

0.77

0.78

0.83

0.84

0.85

0.58

0.57

0.56

0.55

0.53

0.88

0.87

0.87

0.86

0.85

Multimedia Developing countries Electrical engineering Europe

0.83

0.81

0.79

0.78

0.76

0.75

0.73

0.72

0.71

0.69

0.68

0.67

0.66

0.65

0.64

0.83

0.81

0.79

0.78

0.76

0.75

0.73

0.72

0.71

0.69

0.68

0.67

0.66

0.65

0.64

0.62

0.59

0.57

0.55

0.54

0.52

0.50

0.49

0.48

0.62

Numerics Mathematics

0.66

0.64

0.62

0.51

0.49

0.48

0.47

0.83

0.82

0.81

0.80

0.79

0.58

0.57

0.55

0.53

0.52

0.51

0.49

0.48

0.78

0.77

0.75

0.74

0.52

0.51

0.49

0.48

Table 1. The influence of the gradual forgetting mechanism on generated user profile. The gray lines represents the weights generated by pure feature selection. The white lines present the changes in the user profile using forgetting function (2), with k=80%. The confidence interval for the feature selection is set to 99%.

To illustrate how the utilization of the proposed forgetting mechanism affects the user profiles let us study in more detail the changes in a real user profile. Table 1 compares an excerpt of two profiles of a user, one generated without and one with a forgetting function. The rows represent some of the features that have been selected as the user’s interests. The gray lines represent the changes in the user profile generated by pure feature selection. The white lines represent the changes in the user profile when the forgetting function (3) is utilized (k=80%). Each column presents a subset of features selected for the user profile generated after each selection. The period presented in the table ranges from selection 11 to 25. The numbers in the cells represent the calculated normalized weights for each selected feature. If the cell is empty, the feature was not selected as interesting at this stage. The confidence interval for feature selection was 99%, which leads to the selection of the most significant user interests only. It can be seen that user interests change over time. It appears that the user profile generated using the forgetting function is more sensitive and responds faster to changes in the user’s interests. Interests that do not occur in recent observations are forgotten faster (e.g., the features "control engineering" and "mechanical engineering"). Apparently, these features drop out from the current user interests faster if the forgetting mechanism is employed. In a similar way, new interests are recognized earlier (e.g., the algorithm that employs the presented forgetting mechanism is able to recognize feature "multimedia" as a user interest five steps earlier than the approach without forgetting). Since the presented aging mechanism does not totally forget old observations, a reoccurring user interest will also be recognized more easily. Moreover, features that appear in the user’s selection regularly, i.e. represent stable user interests, are left almost unchanged (e.g. the feature "electrical engineering"). And finally, when the method is applied to a larger set of

training examples, it becomes more noise resistant without losing its sensitivity to real changes in interest. We can conclude that the user profile created using this forgetting mechanism includes mostly those features which mainly represent the interests that occurred in recent observations and those which are stable over time. 3.4.2

Experiment

In this experiment our goal was to investigate how the weights for gradual forgetting are able to influence the predictive accuracy of recommendations. 97%

Accuracy

96%

95%

94% 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

k

Figure 9. The correlation between the “forgetting value” k and the quality of prediction for groups 2-5. k=0 means no forgetting.

The experiments were conducted using the same approach as reported in Section 3.3.4, and the users were divided into the same 5 groups depending on their total number of document selections. We omitted Group 1, however, since we did not expect to be able to detect interest drifts in the first five user selections (this was also confirmed in informal trials). The first experiment explores the relationship between the value of k and the prediction accuracy of the recommendations. For each of the groups 2-5 there exists a value of k, named optimal group average, for which the average prediction accuracy is greater than the one for k=0. The optimal group average value of k tends to increase slowly with the size of the training set as follows: Group 2: k=0.2; Groups 3 and 4: k=0.3; Group 5: k=0.4. k=0.3 is an "optimal average" for those groups of users (see Figure 9). To be even more precise one can use an empirical function k = f k (n) , where n is the position of the observation. Figure 10 compares the optimal learning algorithm for LaboUr (namely instance-based learning cum weighted distance measure and feature selection) with supplementary interest drift monitoring using an optimal average k and an optimal group average. The improvement of the predictive accuracy for the optimal average (k=0.3) is significant for

α = 0.05 , but it is very close to the critical value. When we use the optimal group average, the improvement becomes significant. The improvement in average accuracy largely depends on the behavior of the users. When the user’s interests change, the prediction accuracy can decrease very much. After such shifts, the presented approach is able to adapt faster to the new user interests. Therefore the current approach can be improved by dynamically adapting the value of k to each user. Similar approaches have been used for adapting the size of a time window individually (Widmer and Kubat, 1996; Klingenberg and Renz, 1998). This is however left to future research. 100.00%

1) Feature selection in combination with instance-based learning using a weighted distance measure

99.00%

accuracy (%)

98.00%

2) Approach 1) using forgetting with k=0.3 for each group

97.00% 96.00% 95.00%

3) Approach 1) using forgetting with best k for each group

94.00% 93.00% Group 2

Group 3

Group 4

Group 5

Figure 10. The improvement in accuracy of recommendations when gradual forgetting is used.

The above experiments show that the presented forgetting mechanism can be successfully used for learning drifting user interests. Experiments that we conducted with a data set from an artificial learning problem by Schlimmer and Granger (1986) furthermore demonstrate that the presented approach is more generally able to significantly improve the predictive accuracy of inductive learning algorithms (TDIDT, NBC and IBL) on drifting concepts (Koychev and Schwab, 2000 and Koychev, 2000). They also show that the presented forgetting method can work in cooperation with other forgetting mechanisms like time windows, by weighting the observation inside the window. Since ELFI users are relatively short time users and since the average prediction accuracy increases with the number of selections, we do not consider combining gradual forgetting with a time window for this application however.

3.5 3.5.1

Content-based Collaborative Recommendations Forming user groups

User-adaptive systems typically learn about individual users by processing observations about individual behavior. However, it may take a significant amount of time and a large number of observations to construct a reliable model of user interests, preferences and other characteristics. Additionally, it is useful to take advantage of the behavior of similar users. This problem is often solved by abandoning content-based user modeling in favor of the collaborative filtering approach (Konstan et al., 1997). Earlier, the user modeling community provided a different answer, namely the stereotype approach (Rich 1979, 1989). During the development time of a system, user subgroups are identified and typical characteristics of members of these subgroups determined. During the runtime of the system, users are assigned to one or more of these predefined user groups and their characteristics attributed to the user. The need for an (empirically based) pre-definition of these stereotypes is an evident disadvantage. As an alternative, the system Doppelgänger used clustering mechanisms to find user groups dynamically, based on all available individual user models (Orwant, 1995). In LaboUr, we pursue a similar approach to group modeling. Explicitly represented user models can be clustered and the descriptions of the clusters can be used like predefined stereotypes. In contrast to “real” stereotypes, clusters are acquired dynamically and can be revised whenever needed. Thus dynamic evolution of user groups can be accounted for. However, in some domains it can be difficult to find well-distinguished groups of users. Sometimes users’ interests are very idiosyncratic and sometimes users fit into more than one group. Users also can be grouped along different dimensions. Hence, the system should be able to manage many different groupings of users. Additionally, in some cases the current user’s interests will change frequently and rapidly. As a result, the dynamics of user group formation can be very fluid, which will require high computational costs to keep the users’ group memberships up to date, especially with large numbers of users. A possible solution is the Nearest Neighbor method for defining a small group of those users who are most similar to the current user. This approach can then be used for collaborative recommendations. When grouping users, the key issue is to define an appropriate similarity measure between users. In our approach, we employ the Pearson correlation for measuring the similarity between users’ explicit profiles. The correlation between users (e.g. u x and u y ) is measured in the feature set, which is defined as the union of the selected features for each user: Fx, y = Fx ∪ Fy . This will avoid considering two users to be similar if they have a lot of irrelevant features in common. The values in the user vectors are the weights of the features calculated by the significance test. That means that weights are the normalized values (see Figure 5) from the significance analysis. Therefore, the correlation between two users can be calculated as follows:

∑ (wx,i − wx )(w y,i − w y )

r (u x , u y ) =

i∈Fx , y

∑ ( wx , i − wx ) 2 ∑ ( w y , i − w y ) 2

i∈Fx , y

(4)

i∈Fx , y

where w j ,i is the weight of the feature i for the user j , calculated by the feature significance test (see Section 5.2.3); and w j is the average weight for the user j . ELFI users usually are short time users or users with changing interests. It is difficult to distinguish well-defined groups of users. Hence, we decided to use a Nearest Neighbor approach for collaborative recommendations. Let Drec, x be a set of documents that consists of a union of sets of documents that have been seen by the neighbor users but were not seen by the current user x :

Drec, x =

U Du

ui ∈U sim

i

− Dx

(5)

In the next step, the documents from Drec, x are rated using weighted voting as follows: k

Rdi ∈Drec , x = ∑ r ( x, u j )v j ,i j =1

(6)

where, v j ,i ∈ {0,1} and v j ,i = 1 iff d i ∈ Du j . In some cases the user can have specific interests and the k-NN method can include users that actually have quite different interests. Weighted voting prevents such neighbor users from having a big influence on recommendations. Finally, the documents which received the highest ratings are recommended to the user x . Collaborative recommendation can be provided to users independently from contentbased recommendation. This allows users to choose the information source for achieving their goals. Another approach is a weighted voting method to merge both sources of recommendation. The weight of each source can be dynamically adjusted by observing user’s actions on the list of recommended documents. 3.5.2

Experiments

To show the relevance of the explained algorithm we conducted several experiments. We calculated the correlation between each possible combination of two users. After that we tested how the correlation between two possible users is related to the prediction accuracy. It was possible to show that there is a strong correlation (0.98) between the accuracy of collaborative recommendation as described above and the correlation (i.e. similarity) between users (see Figure 11). As we mentioned before (Section 3.3.4) the prediction quality of 50% represents a random recommendation strategy. If the

correlation between two users is nearly 1, the average prediction quality is also very high (i.e. it approaches 100%). If the correlation is approximately 0, then the prediction quality is equal to that of a random recommender. We can conclude that the ELFI domain is amenable to collaborative recommendation (i.e., it is possible to give accurate recommendations if users are similar). 100

accuracy (%)

80

60

40 -0.2

0

0.2

0.4

0.6

0.8

1

correlation between users

Figure 11. The relationship between the similarity between two users (based on the correlation between content-based profiles) and the quality of the collaborative recommendations.

This result can be used for defining a relevant lower threshold for considering two users to be similar for collaborative recommendation that can provide some level of accuracy of the recommendations. For example, a threshold of 0.95 for the correlation between users will guarantee that the average prediction accuracy of collaborative recommendation will be above 90%. Additionally, our experiments suggest that collaborative recommendation based on similarity in content-based profiles achieves its high precision accuracy with considerably less computational efforts than traditional approaches, which compare items.

4 Related Work In the past, several systems have been developed that employ learning procedures to identify individual users’ interests with respect to information objects and their contents, and make use of learned interest profiles to generate personalized recommendations. Lieberman (1995) developed the system Letizia, which assists a user in Web browsing. Letizia tries to anticipate interesting items on the Web that are related to the user’s current navigation context (i.e., the current Web page, a search query, etc.). For a set of links it computes a preference ranking based on a user profile. This profile is a list of

weighted keywords, which is obtained by aggregating the results of TFIDF analyses of pages. Letizia uses heuristics to determine positive and negative evidence of the user’s information interest. Viewing a page indicates interest in that page, bookmarking a page indicates even stronger interest, while “passing over” links (i.e., selecting a link below and/or on the right of other links) indicates disinterest in these links. A classification approach is taken by Syskill&Webert (Pazzani, Muramatsu and Billsus 1996). The user rates a number of Web documents from some content domain on a binary “hot” and “cold” scale. Thus, positive and negative learning examples become available to the system. Based on these ratings, it computes the probabilities of words being in hot or cold documents. A set of word probability triplets is formed for each user, which can be regarded as an interest profile that characterizes the average hot and cold documents of this user. Based on this profile, the Naive Bayes Classifier method is used to classify further documents as hot or cold, respectively. The system Personalized WebWatcher (Mladenic 1996) also uses the Naive Bayes Classifier. This system observes individual users’ choices of links on Web pages, in order to recommend links on other Web pages that may be visited later. The user does not have to provide explicit ratings. Instead, visited links are taken as positive examples, non-visited links as negative ones. The Naive Bayes Classifier is again used in the system NewsDude (Billsus and Pazzani 1999), similarly as in Syskill&Webert, to recommend news articles to users. In NewsDude, the probabilities are used to characterize the long-term interests of a user. To avoid recommending too many similar documents to a user, an additional short-term profile is built by memorizing currently read articles. New articles are then compared to the memorized ones; if they are too similar, they are not recommended even when they would match the long-term interest profile. This procedure corresponds to the nearestneighbor classification algorithm. Note that for the short-term profile, only positive examples are needed (albeit to produce “negative” recommendations). Fab (Balabanovic, Shoham, 1997) is a hybrid content-based collaborative system, which maintains user profiles based on content analysis, and directly compares these profiles to determine similar users for collaborative recommendations. Users get items recommended both when they scored highly against the user’s interest profile, and when they were rated highly by a user with a similar interest profile. Three main components exist: collection agents, which find pages for a specific topic; selection agents, which find pages for a specific user; and the central router. Every agent maintains a profile. A collection agent’s profile represents its current topic, while a selection agent’s profile represents a single user’s interests. Pages found by collection agents are sent to the central router, which forwards them to those users whose profiles match above some threshold. The RAAP (Research Assistant Agent Project) (Delgado et al., 1998) is a system assisting users in classifying documents (bookmarks) retrieved from the WWW. It automatically recommends them to other users of the system with similar interests. Once the user has bookmarked an interesting page, her agent suggests classifying it under predefined categories based on the content of the documents and the user’s interest profile. Then the user has the opportunity to confirm the suggestion or to change the

classification to one that she considers best for the given document. In addition, the agent recommends newly classified bookmarks to other similar users. Pazzani (1999) discusses learning user interests profiles for recommending information sources. Different approaches for generating recommendations are explored, compared and combined. The experiments are conducted with restaurant data that contains users’ explicit ratings of different restaurants. The initial experiments are performed with classical collaborative and content-based approaches. Furthermore, a demographicsbased approach is suggested, which uses descriptions of the raters to learn the relationship between a single item and the type of people that like that object. The demographic information (e.g. age, genre, education etc.) is obtained from the users’ home pages. Another approach called collaboration via content is also investigated, which compares the content-based users profiles to detect similarities among users. Finally, a way for combining the recommendations obtained from above approaches by consensus among the approaches is suggested.

5 Conclusion The paper first of all argues for the need of an explicit representation of learning results, a requirement that we borrowed from user modeling. Such an explicit user profile not only facilitates the exchange of user information and its usage for purposes other than those for which it was originally collected, but it is also valuable for comparing and clustering users, and for providing recommendations on this basis.. The other theoretical contributions of this paper address issues in human-computer interaction that hitherto have proved to be major hurdles for user-adaptive interaction. A gradual forgetting mechanism is presented to cope with the phenomenon of drifting user interests. This mechanism enhances the prediction accuracy significantly since it allows the learning algorithm to re-calibrate quicker in cases where users switch to new topics. In future work we will expand this mechanism with the ability to adapt dynamically the "speed of forgetting" of each individual user. The second critical problem addressed in this paper is that users hardly ever give explicit feedback, and particularly not negative feedback. It remains a challenge to design interfaces that acquire user feedback in an unobtrusive way so that implicit positive and negative user ratings will become available more easily (some special cases where this seems possible are discussed in Kobsa et al., 2001). However, we think that in general negative evidence will always be difficult to obtain. In these cases, the methods presented in this paper can be fruitfully employed. The proposed methods have been implemented in the ELFI environment. We anticipate that our findings can be transferred to other areas like web browsing, news groups, etc. without major problems. Even though these are different application areas, the main problem remains the same: to extract users’ interest from the content of the visited web pages to recommend other pages that are relevant to her current interests.

6

Acknowledgement:

The research described here has been partially supported by the German Science Foundation through Grant No. Ko 1044/4-5 (project LaboUr).

References Allan J. (1996) Incremental Relevance Feedback for Information Retrieval. Proceedings of SIGIR’96 Zurich, Switzerland, ACM Press, pp. 270-278. Balabanovic M. and Shoham Y., (1997). Fab: Content-Based, Collaborative Recommendation. In Communications of the ACM, March 1997/Vol.40, No. 3, pp.77-87. Basu, C., Hirsh, H. and Cohen, W. (1998): Recommendation as Classification: Using Social and Content-Based Information in Recommendation. In Proceedings of the Fifteenth National Conference on Artificial Intelligence, Madison, WI, pp. 714-720. Bellman R. (1961). Adaptive Control Processes: A Guided Tour. Princeton University Press. Billsus, D., and Pazzani, M. J. (1999). A Hybrid User Model for News Classification. In Kay J. (ed.), UM99 User Modeling Proceedings of the Seventh International Conference, Springer-Verlag, Wien, New York, pp. 99-108. Billsus, D. (2000): User Model Induction for Intelligent Information Access. Ph.D. Thesis, Dept. of Information and Computer Science, University of California, Irvine, CA. Brajnik, G. and C. Tasso: 1994, A Shell for Developing Non-Monotonic User Modeling Systems. International Journal of Human-Computer Studies 40, 31-62. Carroll, J. and Rosson, M. B. (1987). The paradox of the active user. In J. M. Carroll (Ed.), Interfacing Thought: Cognitive Aspects of Human-Computer Interaction. Cambridge, MA: MIT Press. Chiu B. and Webb G. (1998) Using Decision Trees for Agent Modeling: Improving Prediction Performance. User Modeling and User-Adapted Interaction 8 (1/2): 131152. Delgado J., Ishii N., Ura T. (1998) Content-based Collaborative Information Filtering: Actively Learning to Classify and Recommend Documents. In Proceedings of the CIA’98, Paris, France, July 1998, LNCS 1435. Springer Verlag. EU (1995). European Union: 1995, Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the Protection of Individuals with Regard to the Processing of Personal Data and on the Free Movement of such Data. Official Journal of the European Communities (23 November 1995 No L. 281), 31ff. Available from http://158.169.50.95:10080/legal/en/dataprot/directiv/directiv.html Fink, J., Kobsa, A., Jaceniak, I.: (1997): Individualisierung von Benutzerschnittstellen mit Hilfe von Datenchips für Personalisierungsinformation. GMD-Spiegel 1/1997,

pp. 16-17. Available from http://www.ics.uci.edu/~kobsa/papers/1997-GMDkobsa.ps Fink, J.: 2001, User Modeling Servers: Requirements, Design, and Implementation. Ph.D. Thesis, Dept. of Mathematics and Computer Science, University of Essen, Germany (forthcoming). Fink, J. and A. Kobsa (2001): A Review and Analysis of Commercial User Modeling Servers for Personalization on the World Wide Web. To appear in User Modeling and User-Adapted Interaction, Special Issue on Deployed User Modeling. Forrester (1999), The Privacy Best Practise. Forrester Research, Cambridge, MA, September 1999. Grabtree I. Soltysiak S. (1998). Identifying and Tracking Changing Interests. International Journal of Digital Libraries, Springer Verlag, vol. 2, 38-53. th

GUV (1998): GUV’s 10 WWW User Survey. http://www.gvu.gatech.edu/user_surveys/survey-1998-10/

Available

from

Klingenberg, R. and Renz, I. (1998): Adaptive information filtering: learning in the presence of concept drift. AAAI/ICML-98 Workshop on Learning for Text Categorization, TR WS-98-05, Madison, WI. Kobsa, A. (2000): Generic User Modeling Systems. To appear in User Modeling and User-Adapted Interaction, Ten Year Anniversary Issue. Kobsa, A., Koenemann, J. and Pohl, W. (2001). Personalized Hypermedia Presentation Techniques for Improving Online Customer Relationships. The Knowledge Engineering Review, forthcoming. Konstan J., Miller B., Maltz D., Herlocker J., Gordon L., Riedl J. (1997). GroupLens: Applying Collaborative Filtering for Usenet News. In Communications of the ACM, March 1997/Vol.40, No. 3, 77-87. Koychev I. (2000). Gradual Forgetting for Adaptation to Concept Drift. In Proceedings of ECAI 2000 Workshop “Current Issues in Spatio-Temporal Reasoning”. Berlin, Germany, pp. 101-106. Koychev I. and Schwab I. (2000). Adaptation to Drifting User’s Intersects - Proceedings ECML2000/MLnet workshop “ML in the New Information Age”. Barcelona, Spain, pp. 39-45. Kozierok R. and Maes P. (1993). A learning interface agent for scheduling meetings. In W. D. Gray, W. E. Heey, and D. Murray, editors, Proc. of the International Workshop on Intelligent User Interfaces, Orlando FL, pages 81-88, New York, 1993. ACM Press. Kruskal J. (1964). Multidimensional scaling by optimising goodness of fit to a nonmetrical hypothesis. Psychometrika 1, 1-27. Lam W., Mukhopadhyay S., Mostafa J. and Palakal M. (1996). Detection of shifts in user interests for personalized information filtering. Proceedings of SIGIR' 96 Zurich, Switzerland, ACM Press, pp. 317-325.

Lieberman, H. (1995). Letizia: An Agent That Assists Web Browsing. International Joint Conference on Artificial Intelligence, Montreal. Maloof M. and Michalski R. (2000). Selecting examples for partial memory learning. Machine Learning 41, 27-52. Mitchell T. (1997). Instance-Bases Learning. Chapter 8 of Machine Learning. McGrawHill. Mitchell T., Caruana R., Freitag D., McDermott, J. and Zabowski D. (1994). Experience with a Learning Personal Assistant. Communications of the ACM 37.7 81-91. Mladenic, D. (1996). Personal WebWatcher: Implementation and Design. Technical Report IJS-DP-7472, Department of Intelligent Systems, J. Stefan Institute, Slovenia. Morita M., Shinoda Y. (1994), Information Filtering Based on User Behavior Analysis and Best Match Text Retrieval. In proceedings of SIGIR’94. ACM, New York. Moukas A. (1996). Amalthaea: Information discovery and filtering using a multi-agent evolving ecosystem. In Proceedings of the Conference on Practical Application of Intelligent Agents and Multi-Agent Technology. Nick A., Koenemann J., and Schaluck E. (1998). ELFI: information brokering for the domain of research funding. Computer Networks and ISDN Systems, 30, pp. 14911500. Orwant J. (1995) Heterogeneous learning in the Doppelgänger user modeling system. User Modeling and User-Adapted Interaction, 4(2), 107-130. Paliouras, G., V. Karkaletsis, C. Papatheodorou and C. Spyropoulos: 1999, Exploiting Learning Techniques for the Acquisition of User Stereotypes and Communities. In: Kay, J. (ed.): UM99 User Modeling: Proceedings of the Seventh International Conference. Wien, New York: Springer-Verlag, pp. 169-178. Pazzani, M., Muramatsu J. and Billsus D. (1996). Syskill & Webert: Identifying Interesting Web Sites. In: Proceedings of the National Conference on Artificial Intelligence. Portland, OR, pp.54-61. Pazzani M. J. and Billsus D. (1997). Learning and Revising User Profiles: The Identification of Interesting Web Sites. Machine Learning 27, 313-331. Pazzani, M. (1999): A Framework for Collaborative, Content-Based and Demographic Filtering. In Artificial Intelligence Review 13(5), 393-408. Pohl W. (1997) LaboUr - machine learning for user modeling. In M. J. Smith, G. Salvendy, and R. J. Koubek, editors, Design of Computing Systems: Social and Ergonomic Considerations (Proceedings of the Seventh International Conference on Human-Computer Interaction), volume B, pages 27-30, Amsterdam, 1997. Elsevier Science. Pohl W. (1998) Logic-Based Representation and Reasoning for User Modeling Shell Systems. St. Augustin, Germany: Infix. Pohl W. (1999) Logic-Based Representation and Reasoning for User Modeling Shell Systems. In User Modeling and User-Adapted Interaction 9(3): 217-283.

Rich, E.: 1979, User Modeling via Stereotypes. Cognitive Science 3, 329-354. Rich, E.: 1989, Stereotypes and User Modeling. In: Kobsa, A. and W. Wahlster (eds.): User Models in Dialog Systems. Berlin, Heidelberg: Springer, 35-51. Resnik P. and Varian H. (1997). Recommender Systems. In Communications of the ACM, March 1997/Vol.40, No. 3, pp. 56-58. Schlimmer J., and Granger R. (1986). Incremental Learning from Noisy Data, Machine Learning 1 (3), 317-357. Schwab I., Pohl W. and Koychev, I. (2000). Learning to Recommend from Positive Evidence, Proceedings of Intelligent User Interfaces 2000, ACM Press, pp.241-247. TheStandard (1998): Trustbuilders vs. Trustbusters. TheStandard, May 11, 1998. Available from http://www.thestandard.com/article/display/0,1151,235,00.html Wahlster, W. and Kobsa A. (1989): User Models in Dialog Systems. In A. Kobsa and W. Wahlster: User Models in Dialog Systems. Berlin: Springer Verlag. Webb, G. Chiu, B. and Kuzmycz, M. (1997): Comparative evaluation of alternative induction engines for feature-based modelling. International Journal of Artificial Intelligence in Education 8, 97-115. Webb, G. and Kuzmycz, M. (1996): Feature-based modelling: a methodology for producing coherent, consistent, dynamically changing models of agents’ competencies. User Modeling and User-Adapted Interaction 5(2), 117-150. Widmer G. (1997). Tracking Changes through Meta-Learning, Machine Learning 27, 256-286. Widmer G. and Kubat M. (1996). Learning in the presence of concept drift and hidden contexts. Machine Learning 23: 69-101.

Suggest Documents