Web usage/structure mining, Web page clustering, Graph partition- ing. ..... general situation in web servers world wide or if it is a particular case with universia.
A Survey in Web Page Clustering Techniques Antonio LaTorre, Jos´e M. Pe˜na, V´ıctor Robles, Mar´ıa S. P´erez Department of Computer Architecture and Technology, Technical University of Madrid, Madrid, Spain, {atorre, jmpena, vrobles, mperez}@fi.upm.es
Abstract. Web page clustering is one of the major preprocessing steps in web mining analysis. In data mining, preprocessing is a key task to ensure reliability and quality of the knowledge extracted by the whole mining process. As the amount of data to process is potentially infinite if dynamic web pages are considered, the need of preprocessing this information seems necessary to deal with this computational problem. Considering individual pages could also not provide additional information. In order to deal with this issue, this contribution proposes the use of web clustering techniques as a first step of the web usage mining process. We provide an overview of techniques in graph partitioning and their application to the problem of web page clustering, considering the taxonomy and characteristics of different web site structures/features (e.g. frames, dynamic code, CMS-based sites). This study demonstrate that the benefit of using these techniques depends considerably on input data and on users’ navigational habits. Keywords. Web usage/structure mining, Web page clustering, Graph partitioning.
1
Introduction
Web page clustering is one of the major and most important preprocessing steps in web mining analysis. In this context (Web Usage/Context Mining) items to be studied are web pages. Web page clustering puts together web pages in groups, based on similarity or other relationship measures. Tightly-couple pages, pages in the same cluster, are considered as singular items for following data analysis steps. A complete data mining analysis could be performed by using web pages information as it appears in web logs, but when the number of pages to take into account increases (i.e., in a corporative largescale web server or a server using dynamic web pages) this process could be quite hard or even unbearable. In order to deal with this issue, web page clustering appears as a reasonable solution. These techniques group pages together based on some kind of relationship measure. Pages in the same cluster will be considered as a single item for further data analysis steps. Recently, a number of approaches have been developed dealing with specific aspects of Web usage mining for the purpose of automatically discovering user profiles. For example, Perkowitz and Etzioni [18] proposed the idea of optimizing the structure of Web sites based co-occurrence patterns of pages within usage data for the site. Schechter et al [20] have developed techniques for using path profiles of users to predict future HTTP requests, which can be used for network and proxy caching. Spiliopoulou
et al [22], Cooley et al [19], and Buchner and Mulvenna [5] have applied data mining techniques to extract usage patterns from Web logs, for the purpose of deriving marketing intelligence. Shahabi et al [8], Yan et al [24], and Nasraoui et al [17] have proposed clustering of user sessions to predict future user behavior. There are some different techniques that will be discussed in next section, but within this study we have considered the graph partitioning approach. These techniques are based on the concept of weighted graphs. B. Hendrickson and R. Leland [12] define the graph partitioning problem as: Given A graph G = (V, E) with (possibly unitary) weights on the edges and/or vertices and a parameter p. Find A partitioning of the vertices of G into p sets in such a way that the sums of the vertex weights in each set are as equal as possible, and the sum of the weights of edges crossing between sets is minimized. Many techniques and algorithms have been proposed, using different similarity measures and strategies that have been proved to influence in the quality of obtained results [23]. Some of them have been implemented by B. Hendrickson and R. Leland. Most important are inertial method, spectral partitioning, Kernighan-Lin method and a multi-level variant of Kernighan-Lin algorithm. A detailed description of them can be found in [13]. Other different approaches to graph partitioning, like the ExpectationMaximization Algorithm or the one proposed by Pe˜na et al. [16] using optimization methods to obtain the best collection clusters according to a cluster quality metric are out of the scope of this study. The main motivation for this contribution has been already mentioned: the need of some techniques to be able to treat in a reasonable way the huge amount of data coming from web server logs. As some implementations of these techniques have been already done, we wanted to test them with real data to try to obtain some preliminary conclusions about the effect on the web mining result. The following section gives an overview of most important page clustering techniques. Section 3 presents the study we have performed, while section 4 presents main results obtained. Some discussion is done in section 5 and main conclusions are shown in section 6.
2
Most important Web Page Clustering techniques
Web page clustering deal with a set of web pages hosted on a web server to obtain a collection of web page sets (clusters). These clusters are applied in the following steps of the mining process instead of original pages. There are three web clustering criteria: semantic, structure, and usage based. 2.1
Semantic Clustering
Cooley in [9] suggests the usage of some semantical hints to get profit in the mining process of web data, proposing that several analysis cannot be achieved without additional metainformation on structure.
Semantical web page clustering are based on the concept of web page hierarchies. The lowest level leaves in these hierarchies are web pages, that are grouped in higherlevel nodes based on semantical affinities. For example, product web pages are clustered in several product families that are later grouped in a cluster for all products, beside other clusters of corporative or support information can also be defined. Semantical hierarchies can be defined following many different criteria, depending on the objectives and strategies of this analysis, and, hence, many different collections of clusters can be provided. This web page clustering techniques requires, anyway, some domain information, either from the domain experts or retrieved by any semantic repository. In this later case, there is a range of possible paths, from META-like information provided on the page contents, to Semantic Web principles, including also CMS-based web sites. 2.2
Graph Partitioning for Web Page Clustering
Structure and usage page clustering are both very similar. These two approaches build a web page graph, in which nodes are the different web pages and arcs are the links among these pages. These links can be defined by the actual web links, in the case only web structure is considered or may be weighted by the usage of these transitions. In this last case, web log file is scanned to analyze the frequency of the transitions. In all these cases web clustering problem is translated in what is called graph partitioning. The graph partitioning problem is NP-hard, and it remains NP-hard even when the number of subsets is 2 or when some unbalancing is allowed [6]. For large graphs (with more than 100 vertices), heuristics algorithms which find suboptimal solutions are the only viable option. Proposed strategies can be classified in combinatorial approaches [15, 10], based on geometric representations [21, 11], multilevel schemes [25, 14], evolutionary optimization [3] and genetic algorithms [7]. We can find also hybrid schemes [2] that combines different approaches. There are a lot of graph partitioning algorithms and, as we cannot describe every single algorithm, we have selected those that we consider more relevant. We will start talking about four graph partitioning heuristics we have used in this study and then we will give a brief description of other clustering algorithms we find interesting. – Simple partitioning methods. These techniques are intended to generate initial partition sets to be refined by using a local optimization strategy. The quality of these initial sets is very problem dependent, but they are often surprisingly good because data locality is often implicit in the vertex numbering [13]. In this category we find three main methods: (i) linear scheme, that assigns vertex to sets in order (if we have n vertices and p sets, first n/p vertices will be assigned to set 1, and so on), (ii) random scheme, where vertices are randomly assigned to sets preserving balance and (iii) scattered method, where vertices are processed in order with next vertex being assigned to the smallest set. – Spectral partitioning. Spectral methods use eigenvector of a matrix constructed from the graph to decide how to partition the graph. The connection between eigenvectors and partitions may seem so surprising, but it has been proved that these
techniques are quite good at finding the right general area of the graph where cuts should be done. However, they often do not behave properly obtaining fine details. It is therefore advisable to use a local refinement algorithm to improve its results, like the generalized Kernighan-Lin one that will be described next. – Kernighan-Lin. This is one of the most popular graph partitioning techniques. As it is quite old, it dates to the 70’s, many extensions and improvements have been done. The linear implementation by Fiduccia & Mattheyses is probably the most well-known of these improvements and it is often credited with the original algorithm [13]. KL usually does not find good partitions unless it is given a good initial one. This is why it is generally used as a local optimization technique, where it performs quite better. – Multilevel-KL. This method was originally described in [12]. It is the most suitable option to deal with very large problems where high quality partitions are needed. It works by creating a sequence of increasingly smaller graphs approximating the original one, partitioning the smallest graph, and projecting this partition back through the intermediate levels. Kernighan-Lin is invoked every few levels of projection to refine the partition [13].
3
Experimentation
To evaluate these different page clustering techniques we have selected a very wellknown web mining problem, web page associations. Figure 1 shows the over all experimental scenario. Web page association analysis tries to extract the set of pages that usually appear together in the same web session. Sessions are extracted from the web log by using a referrer-based technique [4]. The web page associations analysis could be done at the web page level, but it would drive to a large set of association rules with very low support. This extracted model is, in most cases, useless. Instead of that we have performed a preprocessing step consisting on web page clustering. We construct a graph by using sessions information and this graph is partitioned by applying the algorithms described in previous section. This is the step that we want to study, and therefore these algorithms have been evaluated. The final result of the web page clustering step is a set of clusters containing the pages that belong to each of them. With this information, original sessions are rewritten translating each page into the cluster it belongs to. These rewritten sessions are then processed by the Apriori algorithm [1] to extract common associations. The data sets we have used come from the log files from www.universia.es. Universia is a community portal about Spanish-speaking higher education that involves 379 universities world-wide. It contains information about education, organized in different groups depending on the user profile. We have used these log files from four different days. The total size of the logs files is 3 GBytes and they contain a total of 25000 user clicks in approx. 5000 different pages.
Fig. 1. Experimental scenario
4
Results
This section presents the results of our experiment. Table 1 shows the number of discovered association rules and their support values for our clustered web pages sets obtained by using the graph partitioning algorithms described earlier in this paper. We can see that those results are not as satisfactory as we thought at the beginning. The number of rules discovered by using clustered data is even worse than those obtained with raw data from web logs. Most of the algorithms give us only a single rule that, as we will discuss later, is the obvious one in every web usage analysis. The support of these rules is always 100% as it represents the arrival of users to the web site. We do not find important differences between algorithms with and without local improvement. This can be only justified by the kind of data we are working with. In next section we will discuss why these results are not as good as we expected and we will try to find some explanation for this behavior.
Algorithm Local Improvement Rules Discovered Best Support Mean Support None No 4 100% 29% Linear No 1 100% 100% Linear K-L 1 100% 100% Spectral No 1 100% 100% Spectral K-L 1 100% 100% Random No 2 100% 57,65% Random K-L 1 100% 100% Scattered No 2 100% 59,2% Scattered K-L 1 100% 100% Multilevel K-L No 1 100% 100% Table 1. Clustering algorithms applied and rules discovered
5
Discussion
With this study we wanted to prove that a preliminary step of preprocessing is necessary in the field of web usage mining. The huge amount of data to process makes it impossible to use all the information in web logs and so it seems very useful to group related data together. We have found that even with this preprocessing step results obtained were not as satisfactory as one could thing. The reason to these poor results appears very clearly if we perform a deeper analysis of input data. We have observed that most visits to this web site come directly from popular search engines (Google and Yahoo, for example), and then most of them are single-page visits. That means that users search the information they need, then they access to this information and finally they leave the site. This makes impossible to obtain relevant results and it could explain why when we applied web partitioning techniques the Apriori algorithm performed poorer. These algorithms group tightly-coupled pages together, what means that, if most people access same pages (main page, for example) and then leave, those accesses will be in same cluster. This way, other clusters do not add relevant information because all important data appear in first one. That makes the Apriori algorithm to be only able to infer one single rule, the obvious one: every single visit to the site accesses one page in main cluster (we call main cluster the one with single-page visits). It would be interesting to do some research within other sites to find out if this is the general situation in web servers world wide or if it is a particular case with universia. Some research on isolated web sites (corporative webs that can be only reached from the inside of the institution) would be interesting too.
6
Conclusions
In this paper we have presented the importance of a preliminary step of data preprocessing in every data mining process. In this particular case, we have used techniques of graph partitioning to speed up the process and to obtain better association rules by using only relevant information. Then we have presented some of the most important graph partitioning algorithms and the experimental scenario we have worked with. Results of this experiment demonstrate that the improvement introduced by this preprocessing step depends dramatically on the quality of input data. In this case, our study led us to the conclusion that our input data was not good enough. However, we can not take universia responsible for this problem, as we have said before. Search engines and users’ navigational habits have so much to do with these results. For future research, it would be interesting to develop the two lines that we mentioned before to see if we can state that this is the normal way web sites behave and nothing can be done at this respect or if, otherwise, there are some situations where this preprocessing step would be useful.
7
Acknowledgements
We would like to thanks www.universia.es for the possibility of working with their log files.
References 1. Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules. In Proc. 20th Int. Conf. Very Large Data Bases, VLDB, pages 487–499, 1994. 2. R. Ba˜nos, C. Gil, J. Ortega, and F.G. Montoya. Multilevel heuristic algorithm for graph partitioning. In Proceedings of the 3rd European Workshop on Evolutionary Computation in Combinatorial Optimization. LNCS 2611, pages 143–153, 2003. 3. R. Ba˜nos, C. Gil, J. Ortega, and F.G. Montoya. Partici´on de grafos mediante optimizaci´on evolutiva paralela. In Proceedings de las XIV Jornadas de Paralelismo, pages 245–250, 2003. 4. B. Berendt, B. Mobasher, M. Spiliopoulou, and J. Wiltshire. Measuring the accuracy of sessionizers for web usage analysis. In Workshop on Web Mining at the First SIAM International Conference on Data Mining, pages 7–14, 2001. 5. A. Buchner and M. D. Mulvenna. Discovering internet marketing intelligence through online analytical Web usage mining. SIGMOD Record, 4(27), 1999. 6. T.N. Bui and C. Jones. Finding good approximate vertex and edge partitions is np-hard. Information Processing Letters, 42:153–159, 1992. 7. T.N. Bui and B. Moon. Genetic algorithms and graph partitioning. IEEE Transactions on Computers, 45(7):841–855, 1996. 8. J. Adibi C. Shahabi, A. M. Zarkesh and V. Shah. Knowledge discovery from users Web-page navigation. In Workshop on Research Issues in Data Engineering, Birmingham, England, 1997. 9. Robert Cooley. The use of web structure and content to identify subjectively interesting web usage patterns. ACM Transactions on Internet Technology, 3(2), 2003. 10. C. Fiduccia and R. Mattheyses. A linear time heuristic for improving network partitions. In Proceedings of the 19th IEEE Design Automation Conference, pages 175–181, 1982. 11. J. Gilbert, G. Miller, and S. Teng. Geometric mesh partitioning: Implementation and experiments. In Proceedings of the 9th International Parallel Processing Symposium, pages 418–427, 1995. 12. B. Hendrickson and R. Leland. A multilevel algorithm for partitioning graphs. In Proc. Supercomputing ’95, December 1995. 13. B. Hendrickson and R. Leland. The Chaco User’s Guide Version 2.0. pages 1–44, 1995. 14. G. Karypis and V. Kumar. Multilevel K-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing, 48(1):96–129, 1998. 15. B.W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphics. The Bell Systems Technical Journal, pages 291–307, 1970. 16. Jose M. Pe na, Victor Robles, Oscar Marban, and Maria S. P´erez. Bayesian methods to estimate future load in web farms. In AWIC 2004, pages 217–226, 2004. 17. A. Joshi O. Nasraoui, H. Frigui and R. Krishnapuram. Mining Web access logs using relational competitive fuzzy clustering. In Eight International Fuzzy Systems Association World Congress, August 1999. 18. M. Perkowitz and O. Etzioni. Adaptive Web sites: automaticlly synthesizing Web pages. In Fifteenth National Conference on Artificial Intelligence, Madison, WI, 1998.
19. B. Mobasher R. Cooley and J. Srivastava. Data preparation for mining World Wide Web browsing patterns. Journal of Knowledge and Information Systems, 1(1), 1999. 20. M. Krishnan S. Schechter and M. D. Smith. Using path profiles to predict HTTP requests. In 7th International World Wide Web Conference, Brisbane, Australia, 1998. 21. H.D. Simon and S. Teng. How good is recursive bisection? SIAM Journal of Scientific Computing, 18(5):1436–1445, 1997. 22. M. Spiliopoulou and L. C. Faulstich. WUM: A Web Utilization Miner. In EDBT Workshop WebDB98, Valencia, Spain, LNCS 1590, Springer Verlag, 1999. 23. A. Strehl, J. Ghosh, and R. Mooney. Impact of Similarity Measures on Web-Page Clustering. In Workshop for Artificial Intelligence for Web Search, July 2000. 24. H. Garcia-Molina T. Yan, M. Jacobsen and U. Dayal. From user access patterns to dynamic hypertext linking. In 5th International World Wide Web Conference, Paris, France, 1996. 25. C. Walshaw and M. Cross. Mesh partitioning: a multilevel balancing and refinement algorithm. SIAM Journal of Science Computation, 22(1):63–80, 2000.