Computers in Industry 57 (2006) 622–630 www.elsevier.com/locate/compind
Mining web browsing patterns for E-commerce Qinbao Song a,*, Martin Shepperd b a
Xi’an Jiaotong University, China b Brunel University, UK
Received 27 January 2005; accepted 22 November 2005 Available online 30 June 2006
Abstract Web user clustering, Web page clustering, and frequent access path recognition are important issues in E-commerce. They can be used for the purposes of marketing strategies and product offerings, mass customization and personalization, and Web site adaptation. In this paper, we view the topology of a Web site as a directed graph, and use a user’s access information on all URLs of a Web site as features to characterize the user and use all users’ access information on a URL as features to characterize the URL. The user clusters and Web page clusters are discovered by both vector analysis and fuzzy set theory based methods. The frequent access paths are recognized based on Web page clusters and take into account the underlying structure of a Web site. Our method does not require the identification of user sessions from Web server logs, and both a user and a page can be assigned to more than one cluster. Our frequent access path identification algorithm is not based on sequential pattern mining, so it avoids the performance difficulties of the latter. We applied our algorithms to five real world data sets of different sizes. Our results show the effectiveness of the proposed algorithms with the fuzzy set theory based methods being slightly more accurate. # 2006 Elsevier B.V. All rights reserved. Keywords: Web usage mining; Web user clustering; Web page clustering; Frequent access path recognition; E-commerce
1. Introduction The Web provides a direct communication media between business organizations’ services and their customers with very low cost. It is revolutionizing the traditional way of doing commerce, and is becoming more and more popular in the business community. At the same time, it is important for business organizations to assess whether their Web-based services are fulfilling the intended purpose, personalize their services, evaluate the effectiveness of promotional campaigns and build competitive advantage based on the understanding of users’ access behavior. Web mining is the intelligent analysis of Web data. With Web mining techniques, business organizations are able to gain a better understanding of both the web and web users’ preferences to help them run their businesses more efficiently. One kind of outcome of Web mining is Web browsing patterns. By the use of Web browsing patterns, business organizations can perform mass
* Corresponding author. Tel.: +86 29 82668645; fax: +86 29 82668971. E-mail addresses:
[email protected] (Q. Song),
[email protected] (M. Shepperd). 0166-3615/$ – see front matter # 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.compind.2005.11.006
customization and personalization, adapt their Web sites, and further improve their marketing strategies, product offerings, and promotional campaigns. Therefore, Web browsing pattern mining has special meaning for business organizations. Thus, it has attracted much attention [29,8,32,18] from data mining, machine learning, and other research communities for many years. Of the existed methods, some are non-sequential, such as association rule mining and clustering; and some are sequential, such as sequential or navigational pattern mining. Both approaches ignore the site topology and need to identify user sessions. We explore the problem of user clustering and Web page clustering based on vector analysis and fuzzy set theory. Our solution does not require the identification of user sessions from Web server logs and a user or a Web page can be assigned to more than one cluster. We take into account the underlying structure of a Web site when investigating the problem of frequent access path identification. Furthermore, the approach is not based on sequential pattern mining, so it avoids the difficulties of performance and scalability. The rest of the paper is organized as follows. In Section 2, we summarize the work related to the Web users and Web page clustering, and frequent accessing path recognition. In Section
Q. Song, M. Shepperd / Computers in Industry 57 (2006) 622–630
3, we provide the formal definitions of the problem and introduce general information on Web server log files. In Section 4, we present two types of browsing pattern mining algorithms, respectively, vector analysis and fuzzy set theory based. In Section 5, we give the experimental results on real Web usage data. Finally, in Section 6, we summarize our work. 2. Related work The term of Web mining was first proposed by Etzioni [9] in 1996. He claimed that Web mining is the use of data mining techniques to automatically discover and extract information from World Wide Web documents and services. 2.1. Taxonomy Web mining includes Web content mining [6], Web structure mining [19,14], and Web usage mining [3,20,18]. Web content mining concentrates on the structure of the within documents of a Web site, it is the process of extracting knowledge from the content of web documents [6]. Web structure mining tries to discover the link structure of the hyper-links at the interdocument level. It is the process of inferring knowledge from the structure of data [19,14]. Web usage mining, the concept of applying data mining techniques to Web server logs and discovering user navigation patterns, was first proposed by Chen et al. [4,5], Mannila and Toivonen [22], and Yan et al. [35]. 2.2. Web usage mining and frequent access discovery Mannila and Toivonen [22] use page accesses from Web server logs as events for discovering frequent episodes [21]. Zaiane et al. [37] load Web server logs into a data cube in order to perform data mining as well as OLAP. While they acknowledge the difficulties involved in identifying users and user sessions, no specific methods are presented for solving the problem. Huang et al. [15] also propose using a cube model that maintains the order of the session’s components and using multiple attributes to describe visited Web pages. Pei et al. [28] propose an efficient complete Web access patterns mining algorithm based on highly compressed WAP-tree structure. Chen et al. [4,5] introduce the concept of using the maximal forward references in order to break down user sessions into transactions for the mining of traversal patterns. Their work is based on statistically dominant paths and association rules discovery. Heer and Chi [13] use multimodal clustering and information scent algorithms to extract significant user paths from the Web server logs. Nanopoulos and Manolopoulos [25] argue that access should not be consecutive. They present a general definition of traversal pattern and the corresponding level-wise algorithm for the determination of traversal patterns. All these methods are based on sequential patterns mining [1,2], and because the number of candidate sequences generated may grow quickly when the length of sequence increases, all these methods encounter scalability problems.
623
2.3. User clustering Nasraoui et al. [26] and Krishnapuram et al. [17] use fuzzy algorithms to discover clusters of user session profiles, and the former take the hierarchical structure of URLs into consideration. In contrast to Yan et al. [35], Cooley [7] introduces an algorithm that classifies users using a hypergraph partitioning technique. Cooley’s method is used to identify particularly interesting and similar path histories, but it cannot be used to gain an overall picture of all usage of a Web site. Nasraoui and Krishnapuram [27] use unsupervised robust multi-resolution clustering techniques to discover Web user groups. Xie and Phoha [34] use belief functions to cluster Web site users. They separate users into different groups and find a common access pattern for each group of users. Unfortunately the approach still needs to identify sessions. 2.4. Web page clustering Flake et al. [12] use only link information to discover Web communities (Groups of URLs). The Web communities they discover however, merely reflect the viewpoint of Web site developers. Mobasher et al. [24] propose a technique of usagebased clustering of URLs for the purpose of creating adaptive Web site. They directly compute overlapping clusters of URL references based on their co-occurrence patterns across user transactions. But their URLs clustering is still based on frequent items and needs user session identification. Selamat and Omatu [30] propose a neural networks based method to classify Web pages and use principle component analysis to select the most relevant features for the classification. Obviously, it is a content-based method, and it needs class-profile which contains the most regular words in each class. So how to properly define the class-profile and weight each word are problematic. For a more detailed review see Facca and Lanzi [10]. 3. Problem statement In this section, we firstly give the formal definitions of the problem and then introduce some general information on Web server log files. Definition 3.1. For a Web site, if one Web page has a link pointing to another Web page, we say that these two Web pages are adjacent. Adjacency is one kind of mutual relationship. Definition 3.2. For a Web site, a path p is a sequence of Web pages {url1, url2, . . ., urli, urlj, . . ., urln1, urln} for n 2, where any ordered pair (urli, urlj) 2p(1 i, j n) represents adjacent Web pages. This path is said to be from url1 to urln, url1 and urln are referred to as head-node and tail-node, respectively. The length of the path p is n 2. Definition 3.3. For a Web site, we use hits(user, url) to denote the frequency of a user who browses Web page url of the Web site during a period of time. Definition 3.4. A Web site W is a directed graph G where the set of nodes V represents the set of Web pages URL, the set of
624
Q. Song, M. Shepperd / Computers in Industry 57 (2006) 622–630
directed edges E represents the set of links LNK between any two adjacent Web pages, and the root node represents the home page. The Web site W is denoted as W = (URL, LNK). Each Web page url 2 URL has a state State(url) = {url, {UserID, hits(UserID, url)}m, {PathID}n}, where m, n 1, User ID is the index of user who accesses the Web page url, PathID is the index of path that passes node url. Example. Suppose a.html is a Web page which involved in the paths Path12 and Path34 of a given Web site W, and this page has been visited by user User5 20-times and by user User6 30times during a given period of time, the state of the Web page a.html is State(a.html) = {a.html, {User5, 20}, {User6, 30}, {Path12}, {Path34}}. In this paper, we limit Web browsing patterns to three aspects, user clusters, Web page clusters, and frequent access paths. User clusters are the groups of users who have similar navigation patterns. This is very useful for market segmentation or mass personalization in E-commerce. Web page clusters are the groups of Web pages having related content which could be useful for mass personalization and Web site adaptation. Frequent access paths are the most popular paths of a Web site, knowledge of which can be very useful in Web site adaptation, real-time advertising and promotion. The related definitions are presented as follows: Definition 3.5. A user cluster CoU is a group of users that seem to behave similarly when navigating through a Web site, specifically, they access conceptually related Web pages of a Web site during a given period of time. Definition 3.6. A Web page cluster CoP is a group of Web pages that seem to be conceptually related according to user’s perception, specifically, they are accessed by similar users during a given period of time. Definition 3.7. A frequent access path PoF is a path that is frequently accessed by users during a given period of time. We based frequent accessing path mining on the Web user and Web page clusters, and the discovery of Web user and Web page clusters on the basis of user navigational behavior without assuming predefined groups. For a Web site, user access information is generally gathered automatically by Web server and recorded in the server logs. There are four different log files, the access log, agent log, error log, and referer log. These log files are text files, and their sizes 0
Rmn
hitsð1; 1Þ B hitsð2; 1Þ B B .. B B . ¼B B hitsði; 1Þ B B .. B @ .
hitsð1; 2Þ hitsð2; 2Þ .. . hitsði; 2Þ .. .
hitsðm; 1Þ
hitsðm; 2Þ
depend on the traffic at a particular site. Recorded in these files is the volume of activity at each page on a web site, the type of browser used to access each page, any errors that users may have experienced downloading pages from the Web site, and where users were referred from when they accessed pages at the Web site. By default a Web server log file is in the NCSA Common Log File Format [33]. The fields of the Common Log File Format log file are host, rfc931, username, etc. See ref. [33] for further the details. From Web server logs we can easily obtain the information concerning which users access which Web pages of a Web site during a specified period of time. 4. Proposed algorithms In this section, we present two Web browsing pattern mining methods: vector analysis based and fuzzy set theory based. 4.1. Vector analysis based method We suppose that: The users with similar interests should have the similar browsing patterns. Associated Web pages should be browsed by the users with similar interests. The general browsing patterns are not changeable during a given period of time for a given user, although different users’ browsing patterns maybe different during the specified period of time. Based on the above assumptions, we can draw user clusters and Web page clusters from Web server logs by the analysis of users’ browsing information during the period of time for a given Web site. Hits is one kind of user browsing information. We can directly extract the hits of all users who access the Web pages of a Web site during a given period of time. We organize this information in the form of a matrix. That is, we view users as rows, Web pages as columns, and the count of hits as the values of the elements of this matrix. The matrix is defined as follows: Definition 4.1. A URL—User associated matrix R is used to describe the relationship between Web pages and users who access these Web pages. Let n be the number of Web pages and let m be the number of users, the matrix can be denoted as
}
hitsð1; jÞ hitsð2; jÞ .. . hitsði; jÞ .. .
}
1 hitsð1; nÞ hitsð2; nÞ C C C .. C C . C hitsði; nÞ C C C .. C A .
hitsðm; jÞ
hitsðm; nÞ
}
}
(1)
Q. Song, M. Shepperd / Computers in Industry 57 (2006) 622–630
where hits(i,j) is the count of user i accesses Web page j during a defined period of time (see Definition 3.4). The ith row vector R[i, ] records the counts of the ith user accesses of all the Web pages during the specified period of time, and the jth column vector R[, j] records the counts of all users who access the jth Web page during the same period of time. We use the cosine similarity, which is also used by others [31,23,16], as the similarity measure. It is defined as follows: Definition 4.2. The similarity between vector, V1 = {y1,1, y1,2, . . ., y1,n}, and vector, V2 = {y2,1, y2,2, . . ., y2,n}, is defined as a cosine similarity measure: SimðV 1 ; V 2 Þ ¼
n X
y1;i y2;i :
(2)
i¼1
With the matrix R, we can easily discover the user clusters and Web page clusters by measuring the similarities among row/ column vectors, respectively. Specifically, we first compute the similarities among different vectors (row vectors for user clustering and column vectors for Web page clustering) and obtain the similarity matrix Simkk(k 2 {m, n}) (see Eq. (3)). Then we evaluate the similarity values row by row: if the similarity value is great than the given threshold, the corresponding row number and column numbers which represent corresponding users or pages fall into one class (see the Appendix for details). For example, suppose the bold elements in the similarity matrix Simkk denote the similarities whose values are great than the given threshold, we obtained three classes: {1, i, i + 1, k}, {2, i + 1}, and {i, i + 1, k}.
0 B B B B B Simkk B B B B B @
1
Simð1; 2Þ 1
}
Simð1; iÞ Simð1; i þ 1Þ Simð2; iÞ Simð2; i þ 1Þ .. .. . . 1 Simði; i þ 1Þ }
.. . .. . 1
625
because of the site topology. So, if we know the state of each Web page, by analysis of the results of Web page clustering, we can gain the frequent access paths. At the same time, for a specified Web site, we can easily obtain its topology. By fusing this information with user’s browsing information, we can obtain the state information of each Web page of the site. We discover frequent access paths by further processing of the Web page clusters and corresponding Web page state information. That is, first for each Web page cluster, by gathering the Web pages according to the corresponding state information State(url) we generate candidate frequent paths. Then, we check whether or not the same paths exist in the candidate frequent paths, if so merge them. After that, we measure the frequency of the modified candidate frequent paths to obtain frequent accessing paths (see the Appendix for details). The measure used to decide the frequency of a path is defined as follows: Definition 4.3. The frequency f p of a path p = {url1, url2, . . ., urln0 }(n0 2) of a Web site is defined as the ratio of the summation of the hits of the users on each of its 2-length sub-path’s head-node and these users also accessed the corresponding tail-node and the summation of the hits of all users on all nodes of the Web site, that is: Pn 0
f p ¼ Pmi¼1 0 j¼1
hitsðALL; urli Þ hitsðALL; url j Þ
:
(5)
where m0 is the number of pages of a Web site.
Simð1; kÞ Simð2; kÞ
1
C C C C C Simði; kÞ C C C .. C . C Simðk 1; kÞ A
(3)
1
The threshold can be calculated according to the following empirical formula: k X k X 2 T¼ Simði; jÞ: kðk 1Þ i¼1 j¼i
(4)
Of course, other values could be chosen if desired. On the other hand, we think that the reasons that Web pages are associated are: (1) these pages are accessed by the similar users; and (2) the reason that there is a link between two Web pages is that the Web site developers believe that the pages are related in some way, in other words, because of some constraint imposed by the topology of the Web site. Therefore, for clusters of Web pages, at least one portion of them falls into one cluster
4.2. Fuzzy set theory based method In real-world problems, Web user and Web page clusters not only do not have crisp boundaries, but also can overlap considerably. At the same time, incomplete data can easily occur in the data set, due to a variety of reasons inherent in web browsing and logging. Thus, both user clustering and Web page clustering require modeling of an unknown number of overlapping sets in the presence of significant noise and outliers. Fuzzy set theory provides the ability to deal with incomplete information and overlapping clusters. Since the pioneering work by Zadeh [36] in 1965, fuzzy set theory has received wide attention both in its theory development and in its applications. In fuzzy set theory, a
626
Q. Song, M. Shepperd / Computers in Industry 57 (2006) 622–630
fuzzy subset F of a set S is defined by a membership function which gives the degree of membership of each element of S belonging to F. Mathematically, the F is written as: F ¼ fðsi ; mF ðsi ÞÞjsi e S ^ 1 i ng:
(6)
where si 2 s S; n is the number of members of set s; mF (s),mF : S ! [0,1] is the membership function of fuzzy subset F and [0,1] denotes an infinite number from 0 to 1 inclusively. We classify users and Web pages based on fuzzy set theory. First, we create user fuzzy subsets and Web page fuzzy subsets according to fuzzy set theory and Web server logs. Let U be the set of users who access the Web site W during a period of time, the U is denoted as U = {u1, u2, . . ., ui, . . ., um}, where 1 i m and m is the number of users. Let URL be the set of Web pages of the Web site W, the URL is denoted as URL = {url1, url2, . . ., urli, . . ., urln}, where 1 i n and n is the number of Web pages of the Web site W. The user fuzzy subset and Web page fuzzy subset are defined as follows: Definition 4.4. For each user uj 2 U (1 j m), we use the user accessing information on each Web page url 2 URL to describe the browsing behavior. So the user fuzzy subset mu j of the jth user uj that reflects the user’s browsing behavior is defined as: mu j ¼ fðurli ; f mu ðurli ÞÞjurli 2 URLL1 i ng: j
(7)
where f mu ðurli Þ, f mu : URL ! ½0; 1, is the membership funcj j tion that defined as f mu ðurli Þ ¼ hitsðu j ; urli Þ=sumnk¼1 hitsðu j ; urlk Þ, and hits(uj, j urlx) is the number of user uj accesses of Web page urlx, as defined in Definition 3.3. Definition 4.5. For each Web page urli 2 URL (1 i n), we use all user accessing information on this Web page to describe itself. So the Web page fuzzy subset murli that reflects all users’ browsing behavior on the ith Web page urli is denoted as: murli ¼ fðu j ; f murl ðu j ÞÞju j 2 UL1 j mg: i
(8)
where f murl ðu j Þ, f murl : U ! ½0; 1, is the membership funci i tion that is defined as f murl ðu j Þ ¼ hitsðu j ; urli Þ=summ hitsðuk ; urli Þ, and hits(ux, k¼1 i urli) is the number of user ux accesses Web page urli, as defined in Definition 3.3. Then, we classify users and Web pages by measuring the similarities among fuzzy subsets respectively. The similarity measure is defined as follows: Definition 4.6. We suppose X is a set of fuzzy subsets, denoted as X = {x1, x2, . . ., xi, . . ., xn} (1 i n). Each fuzzy subset xi 2 X can be characterized by a set of fuzzy subset elements (xi,1, xi,2, . . ., xi,m). The fuzzy similarity between two fuzzy subsets, xi and xj, is defined as: Pm xi;k ^ x j;k Sim f ðxi ; x j Þ ¼ Pk¼1 : (9) m k¼1 xi;k _ x j;k The implementation of fuzzy clustering algorithms of users and Web pages are just as same as the crisp methods (see the Appendix for details).
5. Experimental results 5.1. Experimental setup For the purpose of evaluating the performance and the effectiveness of the proposed algorithms and verifying whether or not the algorithms are potentially useful in practice, experiments were conducted with five real world data sets. The five data sets, with sizes of 200 KB, 400 KB, 600 KB, 800 KB and 1000 KB, respectively, were randomly extracted from a 1800 KB database that was randomly drawn from the log data of the Web site for Xi’an Jiaotong University. For each of the five real world data sets, we used the 3-fold cross-validation method as the validation approach. In 3-fold cross-validation, the data set D is randomly split into three mutually exclusive subsets D1, D2, and D3 of equal size. The inducer is trained and tested three times. Each time t 2 {1, 2, 3}, it is trained on D Dt 1 and tested on Dt. We designated 2/3 of the data as the training set and the remaining 1/3 as the test set. We compared the precision of the proposed algorithms with the partitioning method K-Means, the model based incremental method COBWEB [11], and the classical frequent accessing path recognition algorithm FS [5], respectively. For each algorithm, we let the average precision of the corresponding three test sets be the final results. 5.2. Experimental evaluation First we discovered user clusters, Web page clusters, and frequent access paths from the training data sets. Then we evaluated and compared the proposed methods with other methods using the following strategies: For Web user clustering, by scanning the test data sets, we obtain a user i and the corresponding Web pages URLsri . Then we decide which user cluster the user i belongs to according to the discovered user clusters, we obtain the predicted URLsip . By comparing URLsip and URLsri , we obtain the precision metric of Web user clustering as follows: Precision ðuser clusteringÞ ¼
m 1X jjURLsip URLsri jj : m i¼1 jjURLsri jj
(10) For Web page clustering, by scanning the test data sets, we obtain a page j and the corresponding users USRsrj . Then we decide which Web page cluster the page j belongs to according to the discovered Web page clusters and obtain the predicted USRs pj . By comparing USRs pj and USRsrj , we obtain the precision metric of Web page clustering as follows: p
n jjUSRs USRsr jj 1X j j : Precision ðpage clusteringÞ ¼ n j¼1 jjUSRsrj jj
(11) 1
The notation D Dt means set D minus set Dt.
Q. Song, M. Shepperd / Computers in Industry 57 (2006) 622–630
627
Table 1 Precision of Web user clustering
Table 3 Precision of frequent accessing path recognition
Data set
Data set
Precision of K-Means
COBWEB
UC
FUC
1 2 3 4 5
78.32 86.45 87.65 83.67 85.24
82.73 73.12 86.78 84.71 85.32
86.14 91.11 92.39 88.09 88.76
88.14 92.59 93.02 90.57 90.09
Mean
84.27
82.53
89.30
91.08
Precision of K-Means
COBWEB
PC
FPC
1 2 3 4 5
86.85 81.99 84.42 88.27 87.26
82.92 87.63 78.26 86.20 82.96
89.86 90.47 90.03 92.41 93.03
92.52 91.34 92.45 93.20 92.26
Mean
85.76
83.59
91.16
92.35
m
m
For frequent access path identification, suppose Pi 1 and Pi 2 are the ith frequent paths that are discovered by the m1 and the m2 methods, respectively. We use the following precision metric to evaluate the frequent access path identification method m 2 {m1, m2}: k 1X jj pm i jj : Precision ðpath identificationÞ ¼ m2 1 k i¼1 jjmaxf pm i ; pi gjj
(12) where k = max{k1, k2}, k1 and k2 are the numbers of frequent paths that discovered by the m1 and the m2 methods, respectively. If k1 > k2 (or k1 < k2), the precision of the m1 (or m2) method for the jk1–k2j paths always is 1, while the precision of the m2 (or m1) method for the jk1–k2j paths always is 0. Note that we did not use recall as a measure, this is because it is the ratio of the number of relevant records retrieved to the total number of relevant records in the database, but we never know the latter for the given database. 5.3. Results Tables 1–3 2 are the results of Web user clustering, Web page clustering, and frequent accessing path recognition with different methods, respectively. The results reveal that our proposed algorithms outperformed all other methods, and the precision of the fuzzy set theory based methods are slightly better than the crisp methods. 2
FS
FAP
1 2 3 4 5
83.45 79.02 85.08 82.41 89.69
85.13 87.40 84.99 81.88 90.43
Mean
83.93
85.97
Table 4 Mann–Whitney test of the precision differences between the proposed crisp methods and other methods
Table 2 Precision of Web page clustering Data set
Precision of
UC is the user clustering algorithm, FUC the fuzzy set theory based user clustering algorithm, PC the Web page clustering algorithm, FPC the fuzzy set theory based Web page clustering algorithm, and FAP the frequent access path identification algorithm.
Pair of methods
Median
n
P
User clustering K-Means User clustering COBWEB Web page clustering K-Means Web page clustering COBWEB
88.76 85.24 88.76 84.71 90.47 86.85 90.47 82.96
15
0.0165
15
0.0083
15
0.0041
15
0.0041
For the purpose of more formally determining the significance of the results, we used a Mann–Whitney test to compare sample medians of the precision values. We have the following alternate hypotheses: Both FUC and UC methods are more accurate than K-Means and COBWEB methods. Both FPC and PC methods are more accurate than K-Means and COBWEB methods. Table 5 Mann–Whitney test of the precision differences between proposed fuzzy methods and other methods Pair of methods
Median
n
p
Fuzzy user K-Means Fuzzy user COBWEB Fuzzy web K-Means Fuzzy web COBWEB
91.09 85.24 91.09 84.71 92.45 86.85 92.45 82.96
15
0.0041
15
0.0041
15
0.0041
15
0.0041
clustering clustering page clustering page clustering
Table 6 Mann–Whitney test of the precision differences between crisp methods and fuzzy methods Pair of methods
Median
n
p
Frequent path recognition FS User clustering Fuzzy user clustering Web page clustering Fuzzy web page clustering
85.13 83.45 88.76 91.09 90.47 92.45
15
0.2149
15
0.1570 0.1116
15
628
Q. Song, M. Shepperd / Computers in Industry 57 (2006) 622–630
Fig. 1. CPU time of different methods as data set size increases.
FAP is more accurate than FS. FUC method is more accurate than UC method. FPC method is more accurate than PC method. The null hypotheses are that there is no difference with (a = 0.05). From Tables 4 and 5 we see that the first and second alternate hypotheses are accepted in both user clustering and Web page clustering. In other words, the improvements from the proposed methods are statistically significant. From Table 6 we observe that the third, fourth, and fifth alternate hypotheses are rejected in frequent accessing path identification, user clustering, and Web page clustering. In other words, the corresponding improvements are not statistically significant. It seems the fuzzy set theory based methods may offer some small improvement than the corresponding crisp methods (in the sense of the results all point in the same direction). However, if the effective size is small and with a low n, the test is unable reject the null hypothesis. Therefore, further work to confirm this finding would be valuable since even small improvements may have considerable commercial significance. Fig. 1 contains the CPU time of different methods. It shows that: (1) the CPU time of each clustering algorithm increases almost linearly with data set size, but the CPU time of our methods are all smaller than other methods; (2) the CPU time for the proposed frequent access path recognition algorithm increases almost linearly with data set size, but it is smaller than the classic approach FS. These results shows that the scalability of the proposed algorithms are better. The reason is that the frequent access path were obtained by the analysis of Web page clusters and the topology of the Web site, thus avoiding the need to determine large reference sequences. 6. Conclusions In this paper, we have presented vector analysis and fuzzy set theory based algorithms for mining user clusters, Web page clusters, and frequent access paths. We have also applied the proposed algorithms to the real world data and our experimental results show the proposed algorithms have higher precision and scalability. At the same time, the experimental results also show that the precision of fuzzy method is slightly better than the crisp method, but the results of Mann–Whitney
test reveal that it is not statistically significant. Therefore, it cannot be claimed that the fuzzy method is better than crisp method, but it is still a potentially useful alternative Web usage mining method. E-commerce Web sites can generate great quantities of data everyday, but traditional Web log analysis methods are insufficient for business questions. Web usage mining can discover Web site usage patterns that can be used to better understand user interests and requirements. Furthermore, this information is especially valuable for business organizations to achieve improved customer satisfaction, retain customer loyalty, and build and keep competitive advantage. Acknowledgment The authors thank the anonymous reviewers for their insightful and helpful comments, which resulted in substantial improvements to this work. Appendix A. The Pseudocode Algorithm 1. User clustering
Q. Song, M. Shepperd / Computers in Industry 57 (2006) 622–630
Algorithm 2. Web page clustering
629
Algorithm 4. Fuzzy user clustering
Algorithm 3. Frequent path extraction Algorithm 5. Fuzzy web page clustering
References [1] R. Agrawal, R. Srikant, Mining Sequential Patterns, Research Report RJ 9910, IBM Almaden Research Center, San Jose, California, 1994. [2] R. Agrawal, R. Srikant, Mining sequential patterns, in: Proceedings of the Eleventh IEEE International Conference on Data Engineering (ICDE’95), Taipei, Taiwan, (1995 March), pp. 3–14. [3] J. Borges, M. Levene, Data mining of user navigation patterns, in: Proceedings of the WEBKDD’99 Workshop on Web Usage Analysis and User Profiling, San Diego, CA, USA, (1999 August), pp. 31–39. [4] M.S. Chen, J.S. Park, P.S. Yu, Data mining for path traversal patterns in a Web environment, in: Proceedings of the 16th International Conference on Distributed Computing Systems, 1996, pp. 385–392. [5] M.S. Chen, J.S. Park, P.S. Yu, Efficient data mining for path traversal patterns, in: IEEE Transactions on Knowledge and Data Engineering, vol. 10, no. 2, 1998, 209–221. [6] W. Cohen, A. McCallum, D. Quass, Learning to understand the web, IEEE Data Engineering Bulletin 23 (2000) 17–24.
630
Q. Song, M. Shepperd / Computers in Industry 57 (2006) 622–630
[7] R. Cooley, Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data, Ph.D. Thesis, University of Minnesota, May 2000. [8] M. Eirinki, M. Vazirgiannis, Web mining for web personalization, ACM Transactions on Internet Technology 3 (1) (2003) 1–27. [9] O. Etzioni, The world-wide web: Quagmire or gold mine? Communications of the ACM 39 (11) (1996) 65–68. [10] F.M. Facca, P.L. Lanzi, Mining interesting knowledge from weblogs: a survey, Data & Knowledge Engineering 53 (2005) 225–241. [11] D.H. Fisher, Knowledge acquisition via incremental conceptual clustering, Machine Learning 2 (1987) 139–172. [12] G.W. Flake, S. Lawrence, C. Lee Giles, F.M. Coetzee, Self-organization and identification of Web communities, IEEE Computer 35 (3) (2002) 66– 71. [13] J. Heer, E.H. Chi, Identification of web user traffic composition using multi-modal clustering and information scent, in: Proceedings of the Workshop on Web Mining, SI AM Conference on Data Mining, Chicago, IL, USA, (2001), pp. 51–58. [14] M. Henzinger, Link analysis in web information retrieval. Bulletin of the technical committee on data engineering, IEEE Computer Society 23 (2000) 3–9. [15] Z. Huang, J. Ng, D.W. Cheung, M.K. Ng, W. Ching, A cube model for web access sessions and cluster analysis, in: Proceedings of WEBKDD 2001, San Francisco, CA, USA, (2001 August), pp. 47–57. [16] S. Hwang, W. Hsiung, W. Yang, A prototype WWW literature recommendation system for digital libraries, Online Information Review 27 (3) (2003) 169–182. [17] R. Krishnapuram, A. Joshi, O. Nasraoui, L. Yi, Low complexity fuzzy relational clustering algorithms for web mining, IEEE Transactions on Fuzzy Systems 9 (4) (2001 August) 595–608. [18] R. Kosla, H. Blockeel, Web mining research: a survey, SIGKDD Explorations 2 (2000 July) 1–15. [19] Y.H. Kuo, M.H. Wong, Web document classification based on hyperlinks and document semantics, in: PRICAI 2000 Workshop on Text and Web Mining, Melbourne, Australia, (2000), pp. 44–51. [20] S.K. Madria, S.S. Bhowmick, W.K. Ng, E.P. Lim, Research issues in Web data mining, in: Proceedings of the First International Conference on Data Warehousing and Knowledge Discovery (DaWaK’99), 1999, pp. 303–312. [21] H. Mannila, H. Toivonen, A.I. Verkamo, Discovering frequent episodes in sequences, in: Proceedings of the First International Conference on Knowledge and Data Mining, 1995, pp. 210–215. [22] H. Mannila, H. Toivonen, Discovering generalized episodes using minimal occurrences, in: Proceedings of the Second International Conference on Knowledge and Data Mining, 1996, pp. 146–151. [23] R. Meteren, M. Someren, Using content-based filtering for recommendation, in: Proceedings of MLnet/ECML2000 Workshop, Barcelona, Spain, 30 May 2000. [24] B. Mobasher, R. Cooley, J. Srivastava, Creating adaptive web sites through usage-based clustering of URLs, in: Proceedings of the 1999 IEEE Knowledge and Data Engineering Exchange Workshop (KDEX), 1999 November. [25] A. Nanopoulos, Y. Manolopoulos, Finding generalizes path patterns for Web log data mining, in: Proceedings of the East European Conference on Advances in Databases and Information Systems (ADBIS’00), 2000 September, pp. 215–228. [26] O. Nasraoui, H. Frigui, R. Krishnapuram, A. Joshi, Mining web access logs using relational competitive fuzzy clustering, in: Proceedings of the
[27]
[28]
[29]
[30] [31] [32]
[33] [34]
[35]
[36] [37]
Eighth International World Wide Web Conference, Toronto, Canada, 1999. O. Nasraoui, R. Krishnapuram, A new evolutionary approach to Web usage and context sensitive associations mining, International Journal on Computational Intelligence and Applications—Special Issue on Internet Intelligent Systems 2 (3) (September 2002) 339–348. J. Pei, J. Han, B. Mortazavi-Asl, H. Zhou, Mining access patterns efficiently from Web logs, in: Proceedings of the Pacific-Asia Conference on knowledge Discovery and Data Mining(PAKDD’00), 2000 April. D. Pierrakos, G. Paliouras, C. Papatheodorou, C.D. Spyropoulos, Web usage mining as a tool for personalization: a survey, User Modeling and User-Adapted Interaction 13 (4) (2003) 311–372. A. Selamat, O. Sigeru, Web page feature selection and classification using neural networks, Information Sciences 158 (2004) 69–88. M. Shyu, C. Haruechaiyasak, S. Chen, Category cluster discovery from distributed WWW directories, Information Sciences 155 (2003) 181–197. J. Srivastava, R. Cooley, M. Deshpande, P.N. Tan, Web usage mining: discovery and applications of usage patterns from web data, SIGKDD Explorations 1 (2) (2000) 12–23. The Common Logfile Format. http://www.w3.org/Daemon/User/Config/ Logging.html#common-logfile-format. Y. Xie, V. Phoha, Web user clustering from access log using belief function, in: Proceedings of the ACM K-CAP’01, First International Conference on Knowledge Capture, Victoria, British Columbia, Canada, (2001), pp. 202–208. T. Yan, M. Jacobsen, H. Garcia-Molina, U. Dayal, From user access patterns to dynamic hypertext linking, in: Proceedings of the 5th International World Wide Web Conference, 1996. L.A. Zadeh, Fuzzy sets, Information and Control (8) (1965) 338–353. O.R. Zaiane, M. Xin, J. Han, Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs, in: Advances in Digital Libraries, Santa Barbara, CA, USA, 1998, pp. 19–29. Qinbao Song received a PhD in computer science from the Xi’an Jiaotong University, China in 2001. He is professor of software technology at Xi’an Jiaotong University, China. He has published more than 50 referred papers in the area of data mining, machine learning, and software engineering. His research interests include intelligent computing, machine learning for software engineering, and trustworthy software.
Martin Shepperd received a PhD in computer science from the Open University in 1991 for his work in measurement theory and its application to software engineering. He is professor of software technology at Brunel University, London, UK and director of the Brunel Software Engineering Research Centre (B-SERC). He has published more than 90 referred papers and three books in the area of empirical software engineering, machine learning and statistics. He is editor-in-chief of the journal Information & Software Technology and was Associate Editor of IEEE Transactions on Software Engineering (2000–2004). He has also previously worked for a number of years as a software developer for a major bank.