their usage of the web. It is an interesting topic since it can make the web sites more effective by allowing web site administrators or webmasters to understand ...
ISBN: 972-8924-16-X © 2006 IADIS
USER PROFILING FROM CITIZEN WEB PORTAL ACCESSES USING THE ADAPTIVE RESONANCE THEORY NEURAL NETWORK José D. Martín-Guerrero and Emilio Soria-Olivas, Deparment of Electronic Engineering, University of Valencia, Spain C/ Dr. Moliner, 50. 46100 Burjassot (Valencia) - Spain
Paulo J.G. Lisboa School of Computing and Mathematical Sciences, Liverpool John Moores University, United Kingdom
Alberto Palomares and Emili Balaguer-Ballester Research & Development Department, Tissat, P.L.C., Valencia, Spain
ABSTRACT In this paper, we propose the use of the Adaptive Resonance Theory, and more specifically the ART2 neural network for carrying out a clustering of web users; this is because of its capabilities to find clusters regardless of whether the clusters present a certain size or shape. Moreover, this algorithm does not need to know the number of clusters in advance. In order to evaluate the goodness of the modelling, six artificial data sets covering usual situations that one can find in real sites are created; clustering achieved by ART2 are benchmarked with the classical K-means algorithm. Afterwards, we apply the clustering algorithms to a real data set which records accesses of users to the web citizen portal Infoville XXI. The clusters found by the ART2 with the real data set are very straightforward and easy to interpret, providing a useful basis for the design of a personalized recommender system. KEYWORDS Adaptive Resonance Theory, Citizen Web Portal, User Profiling, Clustering.
1. INTRODUCTION One of the most important aims of web mining is to find characteristics and patterns of the web users and their usage of the web. It is an interesting topic since it can make the web sites more effective by allowing web site administrators or webmasters to understand the users of the web. However, this approach can be difficult to implement in high dimension. Some web portals may contain thousands of services, which implies that even sometimes the number of services available at the site is potentially higher than the number of users over any reasonable time window. Obviously, useful conclusions cannot be extracted by clustering in such a high-dimensional space, nor it is easy to find any kind of inter-user similarity. The proposal is to cluster in a lower dimensionality space (Cadez, 2001), (Martín2003), where each dimension has a meaning in itself. The components of this new space are called “page categories” or “descriptors”. A widely used classical algorithm, such as K-means (Duda, 2000), which is a useful tool for some data sets, is not well suited to this application, as this method has a tendency to force clusters of similar size, while yet clusters of web-based user profiles vary significantly in their size. In this paper, we use a neural network based on the Adaptive Resonance Theory (ART), in particular, the ART2 network (Carpenter, 1991) because we are working with continuous data (the ART1 network is analogous to this one but working with binary data). This network can circumvent some of the usual drawbacks of classical algorithms, having been designed to solve the stability-plasticity dilemma, namely, the ability to adapt clusters to new data patterns, without disrupting the already established clusters. Moreover, an important advantage for our case comes
334
IADIS International Conference e-Society 2006
from the fact that it is not necessary to know the number of clusters in advance; once the degree of similarity is chosen, the algorithm finds the number of clusters corresponding to this choice.
2. DATA SETS Web mining tools must be applicable to real data sets if their practical value is to be realized. However, before turning to real data sets, it is important to characterize their performance; and that is best done initially with artificial data sets; first, because they enable a general application to web sites with different characteristics; and, second, because it is possible to carry out an evaluation of algorithm performance. Six artificial data sets whose clusters in the space defined by probability of descriptors were known, were selected in order to test the clustering algorithms. The most relevant characteristics of these data sets are shown in Table 1. Table 1. Characteristics of the artificial data sets in terms of: number of descriptors (ND), i.e. dimensionality of the space in which the clustering is carried out, number of clusters (NC), and overlap among clusters.
Data set #1 Data set #2 Data set #3 Data set #4 Data set #5 Data set #6
Characteristics ND NC 2 2 3 4 5 8 5 8 8 12 8 12
Overlap1 NO HIGH SLIGHT HIGH HIGH SLIGHT
Simulated data sets are very useful to carry out an analysis about algorithm performance in different situations but real data become absolutely necessary as a final test. In this work, we focus on a citizen portal, an interactive gateway between citizens and administration. In this work, we profiled user accesses to the web region portal Infoville XXI (http://www.infoville.es/). This is an official web site supported by the Valencian Government, which provides citizens of Valencia, in Spain, with more than 2,000 services grouped into 21 descriptors. We used accesses corresponding to three months: November 2002 through January 2003. We carried out a four-step preprocessing of these data before searching for clusters: removing outliers from the database, finding out the most informative descriptors (five of the 21 descriptors were found as the most relevant ones), anonymizing the data, and encoding the frequency of accesses as probabilities2. Finally, the data set used for clustering was formed by 1,000 patterns and five descriptors, namely, “public administration”, “town councils”, “channels” (channels is a quite heterogeneous descriptor, which contains different services, such as setting up a business, looking for housing, etc.), “shopping” and “entertainment”.
3. ADAPTIVE RESONANCE THEORY This network clusters inputs by using unsupervised learning. In practice, this works by identifying the most appropriate cluster for a given user pattern, then testing whether the cluster prototype is a good-enough representation of the user pattern and, from this, adapting that cluster or starting a new one. As a computational tool, ART networks enable the user to control the degree of similarity of patterns placed on the same cluster; once this choice is done, it is not necessary to choose the number of clusters in advance, but the network finds the number corresponding to the degree of similarity chosen (Carpenter, 1991).
1
We consider a slight overlap when less than 20% of the patterns are overlapped, whereas a high overlap means that more than 20% of the patterns are overlapped among different clusters. 2 Frequency of accesses was transformed into a priori probabilities by normalizing the number of accesses within a session with respect to its maximum number.
335
ISBN: 972-8924-16-X © 2006 IADIS
4. RESULTS We have benchmarked the clusters achieved by the ART2 network with that found by the familiar K-means algorithm (Duda, 2000) in order to have a valid and well-accepted reference to evaluate our results. Model development for K-means was carried out by choosing the desired number of clusters, which is known in the case of artificial data sets; in the case of real data, feasible quantities of clusters were chosen. Regarding the ART2 network, the vigilance parameter was varied between 0.9 and 0.99. The number of iterations was chosen to assure the stability of the network, i.e., no updates happen for last iterations. In order to evaluate the clusters achieved for the artificial data sets, two approaches were taken into account. On the one hand, we considered whether the number of clusters found by the algorithm was correct or not, and, on the other, how good these clusters were. The latter was measured by the Mahalanobis distance from each cluster found by the algorithm to the nearest known cluster' centre:
D=
1 N
M
∑N i =1
i
⋅ di
(1)
In (1), D provides information about the distance from the clustering found to the correct one; the smaller the value of D, the closer the match to the actual situation. N is the whole number of patterns, M the number of correct clusters found, Ni the number of patterns belonging to the i-th cluster found, and di the Mahalanobis distance from the i-th cluster to the corresponding centre. A cluster found by the algorithm is considered to be correct if its distance with respect to the nearest actual centre is lower than a predefined threshold; heuristically, we found that a suitable choice for this threshold was both an Euclidean distance of 0.2 and D=1 (distances measured in the descriptors’ probability space used for clustering). Results are shown in Table 2. Regarding the percentage of correct clusters found, similar results were achieved by both algorithms, when the dimensionality is low, but ART2's behaviour was much better when dealing with a high number of clusters in a high dimensionality space. The values of the parameter D show that outcomes obtained with ART2 were clearly better, since D had a smaller value except with data sets #3 and #4; nevertheless, the percentage of correct clusters found by ART2 with these data sets was higher than those found by K-means. Therefore, the overall behaviour for these sets is also better when using ART2. Table 2. It is shown the Success Rate (SR) [%] of correct clusters found, and the normalized Mahalanobis distance (D) between the centres and the correct clusters found by the K-means algorithm and an ART2 network. These distances are measured in the descriptors' probability space.
Data set #1 Data set #2 Data set #3 Data set #4 Data set #5 Data set #6
K-Means SR D 100 0.0330 25.0 0.6818 37.5 0.1498 37.5 0.2227 33.3 0.2858 0 --------
ART2 SR 100 25.0 50.0 50.0 66.7 50.0
D 0.0330 0.2492 0.2314 0.2747 0.2701 0.2411
In the case of the real data set, we did not know the desired clusters in advance. We could not calculate the values of SR and D, and therefore, our analysis should be based on the meaning of the clusters, from our understanding of the different descriptors that define the data space. Different clusterings were achieved using K-means (setting the number of clusters) and ART2 (setting the vigilance parameter to natural values). We will focus on the results achieved by the ART2 network, since the interpretation of the clusters was much more understanding. Among the different clusterings obtained by ART2, the most comprehensive was formed by 7 clusters. Five of the clusters were clearly centred on the five descriptors taken into account, hence, these five clusters have clearly defined preferences for a single web category. Besides this, we found two interesting groups formed by a large number of users. One of these groups clustered people whose interests were mainly focused on “shopping” and “entertainment”, i.e., people who accessed to the portal looking for leisure issues rather than paperwork ones. The other group represented an opposite behaviour, since it clustered people who accessed to the portal for solving bureaucracy matters and official enquiries (“public administration”, “town councils” and “channels” were the descriptors with highest prevalence of accesses).
336
IADIS International Conference e-Society 2006
5. CONCLUSION In this paper, we focused on clustering algorithms as generic methodologies for user modelling and characterized the performance of the Adaptive Resonance Theory algorithm, benchmarking it against the popular K-means algorithm. The ART2 model showed better performance than classical K-means in finding the most accurate models of the underlying clusters in the data. Regarding the real data from the city portal Infoville XXI, the clusters obtained by ART2 were straightforward to interpret, and the interpretation seemed quite logical, indeed. Furthermore, we feel that, having benchmarked our method using artificial data with known characteristics and found that the method gave good results, increases our confidence for clustering of real data sets with similar dimensionalities and numbers of users. Future work is addressed mainly to take advantage of the information provided by clustering algorithms. Two directions are clear, namely the use of this method as the basis to design a personalized recommender system based on collaborative filtering, to make the navigation easier and more useful to users; and besides, to use this method to improve the design of the web portal by reflecting the navigational needs of the users and by providing gateways between similar descriptors.
ACKNOWLEDGEMENT This work has been partially supported by the project entitled, “NDPG: Neuro-Dynamic Programming Group: Aplicaciones practicas de programación neuro-dinámica y aprendizaje reforzado en minería web y marketing”, with reference number GVA05/009.
REFERENCES Cadez, I. et al., 2001. Model-based clustering and visualization and navigation patterns on a Web site. Technical Report MSR-TR-0018. Microsoft Corporation. Carpenter, G. and Grossber, S. 1991. Pattern recognition by Self-Organizing Neural Networks. MIT Press, Cambridge, MA, USA. Duda, R. et al, 2000. Pattern classification. John Wiley & Sons, New York NY, USA. Martín, J.D., 2003. A pseudo-supervised approach to improve a recommender based on collaborative filtering. Proceedings of UM2003. Johnstown, PA, USA, pp. 429-431.
337