Comparison of Interestingness Functions for ... - Semantic Scholar

10 downloads 54175 Views 103KB Size Report
School of Computer Science. University of ... of Computer. Science, York University ... Some of the interestingness measures are found to be better than others.
Comparison of Interestingness Functions for Learning Web Usage Patterns Xiangji Huang

Aijun An

Nick Cercone

School of Computer Science University of Waterloo Waterloo, Ontario, Canada

Department of Computer Science, York University Toronto, Ontario, Canada

School of Computer Science University of Waterloo Waterloo, Ontario, Canada

[email protected]

[email protected]

ABSTRACT

tranet, as more information becomes available over the net, it becomes more difficult for people to find useful information. Disparate types of data, such as Web pages, images, visiting history, newsgroup documents and e-mails, can be collected from information sources. However, few tools have been developed to help users browse such heterogeneous information - resulting in Web users browsing by intuition or luck. Tools that identify user needs and recommend relevant information are in demand. Data mining research makes it possible for people to find useful knowledge from a large amount of data. With the emergence of large amount of Web data, it is a natural step to apply data mining techniques to the Web to find useful patterns and use them to help the user to navigate the Web. We present our experience in applying data mining techniques to the large data repository maintained by Livelink. In this paper, we concentrate on Web usage mining that discovers user navigation patterns from a large collection of Livelink Web logs. The discovered patterns reflect the users’ browsing behavior, which can be used to identify the user needs, re-organize the information on the net, and facilitate retrieval of relevant information for users. We report our work on finding two types of Web usage patterns: association rules and sequential patterns. Informally speaking, an association rule tells that a conjunction of conditions implies a consequence. For example, the rule Project Description 1, Task Description 2 → Document 3 induced from the LiveLink logs tells that a person looking at Project Description 1 and Task Description 2 often looks at Document 3 as well. A sequential pattern specifies an ordered sequence of objects that occurs frequently in a sequence database. For example, the sequence  Human Resources, Benefits, Dental Expense Claim Form induced from the Livelink log file tells that the pages for Human Resources, Benefits and Dental Expense Claim Form are visited frequently and in the specified order. We used Apriori [1] and AprioriAll [2] to discover association rules and sequential patterns, respectively. Depending on the support threshold, a great number of rules or patterns may be generated. To identify interesting rules or patterns, we apply a number of interestingness measures to rank the discovered rules or patterns. Two of the measures have not been used to evaluate association rules and most of them have not been used for sequential patterns. We present a comparison of these measures on the Livelink log data. To compare these measures, we presented the top-ranking rules or patterns generated from each interestingness measure to our domain

Livelink is a collaborative intranet, extranet and e-business application that enables employees and business partners of an organization to capture, share and reuse business information and knowledge. The usage of the Livelink software has been recorded by the Livelink Web server in its log files. We present an application of data mining techniques to the Livelink Web usage data. In particular, we focus on how to find interesting association rules and sequential patterns from the Livelink log files. A number of interestingness measures are used in our application to identify interesting rules and patterns. We present a comparison of these measures based on the feedback from domain experts. Some of the interestingness measures are found to be better than others.

Categories and Subject Descriptors H.2 [Database Management]: Data Mining

General Terms Measurement, Experimentation

Keywords Data mining, Web usage mining, Interestingness measures, Association rules, Sequential patterns

1.

[email protected]

INTRODUCTION

The enormous growth of information within an organization has pushed the development of techniques for automatic management and retrieval of information over an intranet, an extranet or the Internet. Livelink is a Web-based product of Open Text Corporation, which is designed to facilitate the storage, sharing, management and retrieval of critical information and processes for organizations, teams and individuals. Despite Livelink providing automatic management and retrieval of documents over an intranet or ex-

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’02, November 4–9, 2002, McLean, Virginia, USA. Copyright 2002 ACM 1-58113-492-4/02/0011 ...$5.00.

617

rule A → B, A and B are sets of objects. In sequential pattern AB, A is a sequence of sets of objects and B is a set of objects. For example, in sequence {o1}{o2, o3}{o4}, A is {o1}{o2, o3} and B is {o4}, where o1, o2, o3 and o4 are objects.

experts who evaluated these patterns according to their unexpectedness and actionability. We report our findings from the evaluation results.

2.

THE DATA SET

The log files used in our experiments contain Livelink access data for a period of two months (April and May 2002). The size of the raw data is 7GB. The data describe more than 3,000,000 requests made to a Livelink server from around 5,000 users. Each request corresponds to an entry in the log files. The entry contains: 1. the IP address the user is making the request from; 2. the cookie of the browser the user is making request from, which can be as long as 5,000 bytes; 3. the time the request is made and the time the required page is presented to the user; 4. the name of the request handler in the Livelink program; 5. the name of the method within the handler that is used to handle the request; 6. the query strings that can be used to identify the page and the objects being requested, and some other information that are irrelevant to our task, such as the URL addresses that are useful for error-handling. To learn association rules and sequential patterns, data preprocessing was applied to the raw log files. For each request in the raw log files, we identified the user that made the request and the information objects that were requested. Identifying information objects instead of Web pages is a novel aspect of our project. The information object could be a document (such as a PDF file, an Excel file or a Word file), a semi-structured project description, a task description, a news group message, a picture and so on. Most requests contain only one object, but some requests contain more than one object. The total number of objects identified from our two-month data is 38,679, which is part of the objects maintained by the Livelink server. After users and objects were identified from log entries, the requests were grouped into sessions. A session is a time ordered sequence of sets of objects that a user requests during a single visit to Livelink.

3.

1. Support and confidence (SC). The support of a rule or pattern is defined as P (AB) and its confidence is defined as P (B|A), where P denotes probability. Rules or patterns are ranked according to their support value as the main key and their confidence value as the secondary key. 2. Confidence and support (CS). Rules or patterns are ranked according to their confidence value as the main key and their support value as the secondary key. 3. RI [7]. This Rule-Interest measure is defined as RI = P (AB) − P (A)P (B). 4. IS [10]. Derived from statistical correlation, the IS measure is defined as

s

IS =

5. MD [3]. The MD measure was inspired by a query term weighting formula used in information retrieval and has been used to measure the quality of classification rules [3]. We adopt the formula to measure the extent to which an association rule A → B or a sequential pattern (AB) can discriminate between B and B: M D = log

P (A|B)(1 − P (A|B)) . P (A|B)(1 − P (A|B))

6. C2 [5]. The C2 formula measures the agreement between A and B. It has been evaluated as a good rule quality measure for learning classification rules [3]. It can be defined as

LEARNING INTERESTING PATTERNS

We implemented the Apriori algorithm [1] to learn association rules and the AprioriAll algorithm [2] to learn sequential patterns from the session file. The number of association rules that are discovered depends on the support and confidence thresholds. The number of discovered sequential patterns depends on the support threshold. For our data set, we found that the number of generated rules is not affected much by changing the confidence threshold. However, the number of rules or patterns greatly depends on the support threshold. Table 1 shows how the number of rules or patterns varies with the support threshold. From the table, we can see that a great number of rule or patterns can be discovered if the support threshold is set to be very low. In order to find interesting patterns from a great number of discovered patterns, we rank the rules or patterns according to interestingness measures and prune out some redundant rules or patterns based on the structural relationship between rules or patterns.

3.1

P (AB)P (AB) . P (A)P (B)

C2 =

1 + P (A|B) P (B|A) − P (B) × . 1 − P (B) 2

7. Conviction (CV) [4]. Conviction tests the independence between A and B. It is defined as Conviction =

P (A)P (B) . P (AB)

The values from some of these measures (such as RI, MD and C2) can be zero or negative, indicating A and B are not correlated or they are negatively correlated, respectively. For the Conviction measure, a value less than or equal to 1 indicates A and B are not correlated or they are negatively correlated, respectively. In our learning programs, rules or patterns with this kind of interestingness values are considered uninteresting and are pruned. All measures except M D and C2 have been used to measure the interestingness of association rules. M D and C2 have only been used to measure classification rules. To our knowledge, among the measures listed above, only support and conf idence have been used for evaluating sequential patterns.

Interestingness Measures

The following interestingness measures are used in our application to measure the interestingness of an association rule A → B or a sequential pattern AB. In association

618

Support threshold Number of assoc. rules Number of seq. patterns

0.02 2 8

0.01 14 32

0.008 39 55

0.005 88 109

0.003 723 357

0.0028 4,556 409

0.0025 74,565 651

0.002 4,800,070 2,834

0.001 >1,000,000,000 609,453

Table 1: Number of generated rules and patterns (confidence threshold = 0.5 for association rules)

3.2

Support threshold 0.01 0.005 0.0028 0.0025

Pruning Rules and Patterns

The use of an interestingness measure can help identify interesting association rules or sequential patterns by ranking the discovered rules or patterns. However, it cannot be used to identify redundant rules or patterns. By redundant rules or patterns we mean that the same semantic information is captured by multiple rules or patterns and hence some of them are considered redundant. Shah et al [8] discuss some pruning techniques for detecting redundant association rules. We adopt two of their pruning rules and adapt the rules to use with interestingness measures. Our method for pruning association rules is as follows.

Support threshold 0.01 0.005 0.002 0.001

IS 0 0 5 5

RI 0 0 0 0

CV 0 0 1 1

MD 0 0 7 7

C2 0 0 6 6

SC 0 0 0 0

CS 1 3 10 9

IS 1 2 5 2

RI 1 2 2 0

CV 1 2 2 2

MD 1 2 8 4

C2 1 2 6 5

Table 3: Number of Interesting Sequential patterns in the Top-10 Lists 7 interestingness measures. We presented the top 10 rules or patterns for each of the 7 interestingness measures to our domain experts, who evaluated the interestingness of these rules or patterns according to their unexpectedness and actionability. Table 2 shows the evaluation results for association rules for four different support thresholds. A number in the table means the number of interesting rules in the top 10 rules with respect to the corresponding interestingness measure and support threshold. Table 3 shows the results for sequential patterns. The following observations can be made from the two tables.

where IV (R1 ) and IV (R2 ) are the interestingness values for R1 and R2 , respectively. • Pruning Rule 2: If there are two rules of the form A → C1 and A → C1 ∧ C2 , and the interestingness value of rule A → C1 is not significantly better than rule A → C1 ∧ C2 , then rule A → C1 is redundant and are pruned. Sequential patterns are pruned using the following rules:

1. Interesting rules or patterns can only be found in the top 10 lists if the support threshold is reduced to certain level. This means that if we set the support threshold too high, interesting rules can be missed.

• Pruning Rule 3: If a sequential pattern S1 contains another pattern S2 , and the interestingness value of S2 is not significantly better than S1 , then S2 is redundant and are pruned. A sequential pattern A1 A2 ...Am contains another sequential pattern B1 B2 ...Bn if there exist integers i1 < i2 < ... < in such that B1 ⊆ Ai1 , B2 ⊆ Ai2 , ..., Bn ⊆ Ain , where Ai (i = 1, ..., m) and Bi (i = 1, ..., n) are sets of objects.

2. According to the results for sequential patterns, a too low support threshold (such as 0.001 in our case) may bring more uninteresting patterns in the top-ranking lists even if new interesting patterns may also be found at the same time. This means that the gain of more interesting patterns may be overshadowed by the loss of interesting patterns in the top 10 lists if the support threshold is too low.

pattern consists of set of objects, the pattern AAAAA and is thus pruned

3. Support is not a good interestingness measure for either association rules or sequential patterns. The top 10 lists generated using the SC measure contain no interesting rules or patterns. This indicates that most frequently accessed patterns are not interesting in our application. Interestingness measures other than support should be used.

Pruning rule 4 is based on the feedback from our domain experts, who consider patterns with repeated single elements not useful and should be pruned1 .

4.

CS 0 0 1 3

Table 2: Number of Interesting Association Rules in the Top-10 Lists

• Pruning Rule 1: If there are two rules of the form A → C and A ∧ B → C, and the interestingness value of rule A ∧ B → C is not significantly better than rule A → C, then rule A ∧ B → C is redundant and are pruned. A rule R1 is significantly better than rule R2 if IV (R1 ) − IV (R2 ) > 5%, IV (R2 )

• Pruning Rule 4: If a sequential repeated occurrences of the same pattern is pruned. For example, contains the same set of objects A according to this rule.

SC 0 0 0 0

RESULTS AND FINDINGS

For each support threshold, the generated association rules or sequential patterns are ranked according to each of the

4. M D, C2 and IS are good measures for evaluating both association rules and sequential patterns.

1 This rule may not be suitable for other applications, such as DNA sequence analysis.

5. CS is a very good measure for sequential patterns. For the support threshold of 0.002 in Table 3, all the pat-

619

the presentation of the folder.

terns in the top 10 list are interesting. This indicates that if we rank sequential patterns according to confidence as the main key and support as the secondary key, the best results may be obtained. We also observed that almost half of the discovered rules and patterns have the confidence value of 1. For example, among the 609,453 sequential patterns discovered for support threshold 0.001, 295,801 discovered patterns have the confidence value of 1. If we rank the rules or patterns by confidence only, the result is not stable. This is the reason why we use CS instead of confidence as a measure. To our surprise, CS turns out to be the best measure for ranking sequential patterns. However, by looking at the details of the interesting patterns, we found that almost all the interesting patterns ranked high by CS are of the same subject. Those interesting patterns are about the pictures of a hokey game organized by the company. On the other hand, the interesting patterns ranked high by M D and C2 contain more variety of interesting patterns. If the hokey game pictures are not considered to be interesting, CS may not be the best measure.

5.

CONCLUSIONS

We have reported our experience in mining interesting Web usage patterns from the Livelink log data. We compared 7 interestingness measures for ranking both association rules and sequential patterns. We found that the M D and C2 measures work well for both association rules and sequential patterns. IS is also a good measure even though it is not as good as M D and C2. CS seems to be the best measure for sequential patterns if we do not consider the variety of interesting patterns. But CS does not work well for association rules. SC is among the worst measures, which implies that patterns with high support are usually not interesting. To our surprise, RI does not work well in finding interesting rules or patterns from our data set even though it has been said to be a good measure for ranking rules. Evaluating interestingness measures on a real data set by domain experts and analyzing the evaluation results are a unique part of this work. We also found that the number of discovered association rules or sequential patterns can super-exponentially increase when we reduce the support threshold. This can be observed from Table 1. For example, when the support threshold is reduced from 0.0028 to 0.002 (a very small reduction), the number of association rules increases from 4,556 to nearly 5 million. This observation is consistent with the findings reported in [11] for real data sets. A comment from our domain expert about the discovered interesting rules is that some of the interesting rules are closely related and can be further generalized into a single association rule. For example, the following three rules: A → BC B → AC and C → AB, can be generalized to one association rule: if any of the objects in {A, B, C} is requested, then the two other objects will also be requested. We will look at this problem in the future. We are also working on making use of background knowledge to prune out some uninteresting rules. Part of the background knowledge is the hierarchy of the objects maintained by Livelink. We are using this hierarchy information to prune out some types of uninteresting patterns, such as those that describe a relationship between a folder and a banner image used for

6.

ACKNOWLEDGMENT

This research is supported in part by the Open Text Corporation and the Natural Sciences and Engineering Research Council of Canada (NSERC).

7.

ADDITIONAL AUTHORS

Additional author: Gary Promhouse (Open Text Corporation, email: [email protected]).

8.

REFERENCES

[1] Agrawal, R. and Srikant, R. Fast Algorithms for Mining Association Rules, Proceedings of the 20th International Conference on Very Large Databases, Santiago, Chile, Sept. 1994. [2] Agrawal, R. and Srikant, R., Mining Sequential Patterns. Proceedings of International Conference on Data Engineering, Taipei, Taiwan, March 1995. pp.3-14. [3] An, A. and Cercone, N. 2001. ”Rule Quality Measures for Rule Induction Systems: Description and Evaluation”, Computational Intelligence, Vol. 17 No. 3. [4] Brin, S., Motwani, R., Ullman, J. and Tsur, S. 1997. ”Dynamic Itemset Counting and Implication Rules for Market Basket Data”, Proceedings of 1997 ACM-SIGMOD International Conference on Management of Data, Montreal, Canada. pp.255-264. [5] Bruha, I. 1996. ”Quality of Decision Rules: Definitions and Classification Schemes for Multiple Rules”, in Nakhaeizadeh, G. and Taylor, C. C. (eds.): Machine Learning and Statistics, The Interface. Jone Wiley & Sons Inc. [6] Hilderman, R.J. and Hamilton, H.J., Evaluation of Interestingness Measures for Ranking Discovered Knowledge. In Cheung, D., Williams, G.J., and Li, Q. (eds.), Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’01), Lecture Notes in Computer Science, Springer-Verlag, Hong Kong, April, 2001, pp. 247-259. [7] Piatetsky-Shapiro, G., “Discovery, Analysis and Presentation of Strong Rules”. Knowledge Discovery in Databases, AAAI, 1991, pp.229. [8] Shah, D., Lakshmanan, L.V.S., Ramamritham, K. and Sudarshan, S., “Interestingness and Pruning of Mined Patterns”. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, May 1999. [9] Silberschatz, A. and Tuzhilin, A. “What makes patterns interesting in knowledge discovery systems”. IEEE Transactions on Knowledge and Data Eng., 8(6):970-974, 1996. [10] Tan, P. and Kumar, V. ”Interestingness Measures for Association Patterns: A Perspective”, Technical Report TR00-036, Department of Computer Science, University of Minnestota, 2000. [11] Zheng, Z., Kohavi, R. and Mason, L. Real World Performance of Association Rule Algorithms, Proc. of the 7th ACM-SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, New York, NY. 2001.

620

Suggest Documents