Automatic Generation of Content-Based User Profiles ... - CiteSeerX

0 downloads 0 Views 72KB Size Report
filtering results based on “manual” user profiles (namely terms defined explicitly by ... Automatic generation of a "good enough" initial user profile which is based ...
Automatic Generation of Content-Based User Profiles Compared to Rule-Based Profiles for Information Filtering Tsvi Kuflik Dept. of Information Systems Engineering, BenGurion University of the Negev,

Peretz Shoval Dept. of Information Systems Engineering, BenGurion University of the Negev,

P.O.Box 653, Beer Sheva 84105, Israel (972) 8 6479431

P.O.Box 653, Beer Sheva 84105, Israel (972) 8 6472221

tsvikak @bgumail.bgu.ac.il

[email protected]

ABSTRACT

In this paper content-based filtering approach is compared with rule-based filtering approach. The research goals are to study the effectiveness of automatic generation of contentbased user profiles as a function of the number of examples given to the system, and to compare it with the effectiveness of a rule-based user profile. The study shows that rulebased profiles tend to outperform content-based profiles when the latter are generated from a small number of positive examples, but as the number of examples increases, content-based profiles are more advantageous. We conclude that both approaches can be combined, thus minimizing the users’ initial effort of building filtering profiles and enabling incremental improvements based on user feedback later on. Keywords

Information filtering; User filtering; Rule-based filtering.

modeling;

Content-based

1. INTRODUCTION

User profiles for information filtering systems can be generated “manually” by users or automatically by a system. Previous research has shown that users fail to define their information needs accurately. For example, [1] showed that query terms provided by users are poor predictors for relevancy of email messages, in comparison to terms identified automatically by an artificial neural net (ANN). Filtering results based on filters generated automatically from a trained ANN yielded a correlation of 92% with users’ ratings of the same messages, while filtering results based on “manual” user profiles (namely terms defined explicitly by the users) yielded only 48% correlation with the users’ ratings. The domain of machine learning provides a variety of tools for user profile generation and adaptation. Many

general machine learning techniques have been adopted for user profile generation in various applications of information retrieval and filtering, e.g. text categorization [5]. However, machine-learning techniques require a large set of training examples to train a system, which may be a problem in practical implementation of machine learning for user profile generation. For a system to learn and generate a user profile automatically, a large set of examples is needed. Users may have only a few good examples of data items ("positive" examples) to represent what they need, and possibly also some bad ("negative") examples of irrelevant data items. Hence training an automatic system to generate a reliable user profile is problematic. A partial solution to this problem could be incremental buildup and improvement of user profiles based on relevance feedback. Given an initial user profile that is "good enough," the user’s feedback to filtering system recommendations may generate a growing set of training examples that can be used to further retrain the system so as to build a more accurate user profile. Automatic generation of a "good enough" initial user profile which is based on index terms (keywords) is a problematic task. An alternate solution to this problem may be to use filtering rules that represent user information needs. A filtering rule can be defined in the form: If then . For example: If contains “Machine Learning” and = “John Smith” Then = high. In spite of the major role that rules play in Artificial Intelligence in general and expert systems in particular [6], they are hardly used in practice in information filtering, presumably due to the initial effort required to define a comprehensive set of filtering rules. (Rules are used mainly in email filtering [2, 3].) However, there may be simple rules that are easy to define and good enough to be used at initial stages of filtering. In this study we concentrate on initial generation of userprofile and we show two things; a) how the quality of

The rest of the paper is organized as follows: Section 2 describes the research; Section 3 presents the results, and Section 4 concludes and provides future research directions. 2. RESEARCH DESCRIPTION

Researchers from the academic realm were asked to define search queries in any area of their research interests. 28 search queries were defined. The queries were submitted to database search engines (INSPEC, ECONLIT and others), which retrieved respective abstracts of publications. Then, each researcher was asked to read the abstracts retrieved for his query, and rate the relevancy of each of them. After the participants rated the relevancy of their abstracts, we discussed with them the reasons for the specific ratings and the "rules of thumbs" they used to determine the relevancy of each abstract. Surprisingly, most users particularly used one general filtering rule. This rule was related to the title of the paper and can be defined as follows: "If the title contained a relevant keyword, then it is most likely that the abstract is relevant." As a result, filtering rules regarding document titles were defined for all users (as will be described later). We measured the precision of the retrieved abstracts within each of the 28 data sets. Precision is defined as the ratio between the number of relevant abstracts and the total number of abstracts retrieved for that query. Figure 1 presents the data sets precisions. The precision values show that most users had difficulty in defining their area of interest accurately. Except for Set 1, with a precision of 0.94, all other set precisions are less than 0.6. In 80% of the sets, precision is less than 0.4; and in about half of the sets precision is less than 0.2. Our new results show again that user-defined terms are not good enough to predict relevancy, and there must be better ways to generate user queries or profiles for IF systems. Out of the above 28 data sets, only 16 sets, having precisions above 10% and more than 45 abstracts, were selected for further analysis.

Precision of Retrieved sets 1 0.8 Precision

automatically generated content-based profiles improves as the set of training examples increases, and b) how simple filtering rules can "compensate" for poor initial contentbased profiles that are based only on a small number of examples. Consequently, we show that the two approaches can be combined so that at the initial stages of filtering the user profile will be based on such simple rules, and then – once sufficient examples are accumulated - be replaced by more effective content-based profile.

0.6 0.4 0.2 0 1

3

5

7

9

11 13 15 17 19 21 23 25 27 Retrieved sets

Fig. 1 Precision of retrieved sets 3. RESULTS 3.1 The Impact of Training Set Size on Filtering Effectiveness

For each of the data sets, the following process was performed: The filtering system generated up to 10 contentbased user profiles, the first profile was generated from the first 10 examples of the set, the second profile from the first 20 examples, and so on. Training Set 1 Size P R Orig. P .94 1 1 10 1 1 20 1 1 30 no no 40 data data no no 50 data data no no 60 data data no no 70 data data no no 80 data data Rules 1 .5

4 P .49 1 .67 .67 .75

R .14 .29 .29 .43

Set Number 5 6 7 P R P R P .46 .45 .36 .89 .73 1 .27 0 .78 .78 1 .27 0 .78 .78 1 .36 0 .89 .89 1 .45 0

8 R 0 0 0 .2

.67 .57 .89

.8

1

.36

0

.2

.86 .63

1

.82

1

.45

0

.2

.82 .65

1

.75

1

.36

0

.2

.83 .61

1

.9

1

.45

.5

.8

.8

1

0.56 0.86 0.55

1

0.8

.57

P .35 1 1 .5 .67

R .06 .12 .06 .12

no data no data no data no data 1

no data no data no data no data 0.86

Table 1: Precision and Recall for sets having original precision > 0.3 Set Orig No. . P

20

40

10 12 13 16 17 18 19

.29 .27 .26 .19 .19 .18 .16

P 0 0 0 0 0 0 0

20 21

.12 .11

0 0

0 0

0

0

.09

0

0.2

0

0

22

R 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0

R 0 0 0 0 0 0 0

Set Size 60 80 P R P R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 no no no no data data data data 0 0 1 0.2 0 0 0

0

0

0

Rules P R 1 .63 .67 .67 .6 .43 .5 .4 .75 .5 .5 .33 1 .5 1 1

.8 .25

.5

.25

Table 2: Precision and Recall for sets having original precision < 0.3 The last 20 examples were set aside as a test set. For sets having less than 100 examples the process stopped at the last possible tenth. This process simulated incremental

content-based user profile buildup using user relevance feedback since at every step the existing profile was updated by a batch of 10 more examples. The filtering results are presented at tables 1 & 2. Looking at table 1, it is apparent that when initial precision is high (precision above 40% in our case) - that is, when there are enough positive examples to learn from - the system was able to generate a effective user profiles. As the number of examples grows, precision and recall improve. When we look at the remaining two sets in Table 1, having precision below 0.4 (sets 7 & 8) we see that the results are very poor. The results for the sets having original precision < 0.3 emphasize that fact; they were a total failure: 0 recall for all sets but one (Table 2) 3.2 The Impact Effectiveness

of

Filtering

Rules

on

Filtering

When users were interviewed about the "rules of thumb" they use in evaluating document relevancy, all the participants reported one general rule: "If a relevant term appears in the title of the document, then there is a good chance that this document is relevant". The phrase "relevant term" refers to any keyword that seems relevant to the user in the context of his area of interest. Based on manual analysis of the first 80 examples or less (leaving the last 20 documents as a test set), a "Title" filtering rule was defined for each set of documents. Using this filtering rule, the system filtered the same documents within each of the above data sets. The results appear in the "Rules" row or column of Tables 1 & 2. It is noticeable that for the data sets where content-based profile exhibited reasonable performance it outperformed the rulebased filter at the end of the learning process. For example Set 4 in Table 1 shows that when the learning set size includes 50 examples, the rule-based profile has precision of 0.8 and recall of 0.57, which is higher than the contentbased profile with precision of 0.67 and recall of 0.57. However, in the next learning step, having 60 examples, the content-based profile performs better, with precision of 0.86 and recall of 0.63. Similar behavior can be seen with Sets 1, 5 and 6. However, for the data sets having original precision less than 40%, the content-based profile completely fails, while the rule-based profiles succeed, having an average precision of 0.73 and average recall of 0.52 (these are the average values in Table 2). 4. CONCLUSIONS AND FUTURE WORK

We showed that an information filtering system could generate content-based profiles from short texts (abstracts) and use them successfully when there are enough training examples (positive and negative). We succeeded to generate content-based profiles with data sets having about 100 examples, and original precision of above 40%. The system failed to do this for data sets having lower precision. The ability of an automatic system to build an accurate user profile from a small number of short text examples depends

heavily on the number of examples and on the nature of the text itself. Usually, not many examples are available, and when searched, getting enough examples with high enough precision cannot be guaranteed. We showed that rule-based filtering, which is based on one simple rule, performs better than content-based filtering when the number of positive examples is small (precision < 0.4, in our case). On the other hand, when there are enough positive examples, the automatically generated contentbased profile performs better. We conclude is that a rule-based profile can serve as an initial user profile for a filtering system until an accurate content-based profile is learned from enough examples. The two profiles can be combined in the following way: the system starts filtering using the "simple" rule-based profile, applying filtering rules that are explicitly defined by the user. Then the content-based profile is built and improved, based on user relevance feedback to the filtered data items. Initially, the content-based profile is used "offline" in parallel with the rule-based profile, which is used "online" until it provides similar filtering results. At this point the content-based profile becomes operational and replaces the rule-based profile. This content-based profile continues to improve. The rule-based profile can also be updated by the user (e.g., by adding more rules). How to improve filtering rules and how to combine rulebased filtering and content-based filtering are issues for future research. Another future research issue is how to automatically learn filtering rules from users or from examples and how to adapt general well-known filtering rules to specific user profiles. REFERENCES

1. Boger, Z., Kuflik, T., Shapira, B. and Shoval, P. (2001). Automatic keyword identification by artificial neural networks compared to manual identification by users of filtering systems, Information Processing & Management, Vol. 37 (2), pp. 187-198. 2. Boone, G. (1998). Concept features in Re:Agent, an intelligent Email agent, Proceedings of the Ssecond International Conference on Autonomous Agents. Minneapolis, MN, USA. 3. Hannani, U., Shapira, B. and Shoval, P. (2001). Information Filtering: Overviewof Issues, Reseach and Systems. User Modeling and User Adapted Interaction, Vol. 11, No. 3, pp.203-259. 4. Jackson, P. (1998). Introduction to Expert Systems, 3rd edition, Addison-Wesley 5. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys. Accepted for publication.

Suggest Documents