Association Models for Web Mining - Springer Link

30 downloads 5756 Views 132KB Size Report
Editor: Paolo Giudici, David Heckerman and Joe Whittaker. Received ... More specifically, we consider a model-based approach to web mining, in order to carry out an .... needed, and classical methods do not provide an easy solution to this.
Data Mining and Knowledge Discovery, 5, 183–196, 2001 c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. 

Association Models for Web Mining PAOLO GIUDICI∗ [email protected] Dipartimento di Economia Politica e Metodi Quantitativi, University of Pavia, Via san Felice 5, 27100, Pavia, Italy ROBERT CASTELO University of Utrecht, The Netherlands

[email protected]

Editor: Paolo Giudici, David Heckerman and Joe Whittaker Received September 30, 2000; Revised March 15, 2001

Abstract. We describe how statistical association models and, specifically, graphical models, can be usefully employed to model web mining data. We describe some methodological problems related to the implementation of discrete graphical models for web mining data. In particular, we discuss model selection procedures. Keywords: selection

1.

Bayesian inference, data mining, graphical models, Markov chain Monte Carlo methods, model

Introduction

The aim of this paper is to illustrate concepts and techniques involved in the statistical analysis of web data. More specifically, we consider a model-based approach to web mining, in order to carry out an explorative study that will help explain the way people interact with a web site. With the exception of web crawlers, there is usually a person involved, so we can assume that there is a purpose behind the visit to the web site, and it makes sense to believe there is a correlation between visits and the hits registered. An analysis of pairwise correlations, as done in a typical association mining analysis, provides some information about how people access the web site. Still more information may be gathered by examining higher order correlations and conditional independencies, which are the cornerstone of the models we propose in this article on mining web data. A conditional independence statement is a triplet in which any marginal relationship (dependence) between two elements vanishes when the third element is taken into consideration. For instance, in healthcare, there is statistical evidence that people who have heart attacks have a larger cholesterol intake than those who do not. However, cholesterol intake (CI) is conditionally independent of the chance of heart attack (CA) given the cholesterol blood level (CB), because when cholesterol blood level is known, cholesterol intake becomes irrelevant. Given that we know the cholesterol blood level (the ∗ To

whom correspondence should be addressed.

184

GIUDICI AND CASTELO

conditioning fact), information about cholesterol intake does not alter the chance of heart attack. Conditional independencies are restrictions of the model. A model that contains no restrictions is said to be unrestricted, or empty; when it contains all possible restrictions, it is said to be fully restricted, or saturated. Conditional independency statements are written using the symbol ⊥. For the preceding example: CI ⊥ CA | CB. Conditional independencies may be encoded in graphs. A formal description of how to read off the conditional independencies from an undirected graph g = (V, E), where V is a set of vertices, and E is a set of edges, is to note whether a subset S ⊂ V separates two non-empty subsets A, B ⊂ V : if so, the conditional independence statement A ⊥ B | S holds. We will demonstrate how to use these graphs in our data mining application by fitting so-called graphical models. These are usually referred to as graphical Markov models, but we will drop the adjective Markov to avoid confusion with Markov chains. In the next section, we illustrate the data set considered. In Section 3, we analyse the available data from a frequentist statistics viewpoint. In Section 4, we illustrate and apply a more coherent and structured approach to the data, based on Bayesian graphical model selection, and compare results with the frequentist approach. Finally, Section 5 contains concluding remarks and comments, as well as suggestions for future research. 2.

The available data

We carried out our exploratory analysis on a database that corresponds to the visits to a set of areas (vroots) of the Microsoft corporate web site. This database is publicly available through the UCI KDD Archive at the University of California, Irvine (http://kdd.ics.uci.edu/ databases/msweb/msweb.html). In particular, the data we considered correspond to all visited pages on the web site www.microsoft.com by a total population of 5,000 anonymous visitors, randomly chosen. Each visitor is identified by a number from 10,000 to 15,000, and no personal information is given. For each visitor, we recorded the number of visits to each page of the site in a week of February 1998. Site pages are by a number, which corresponds to an address (e.g., 1057 corresponds to “MS Power Point News”). This data does not contain information about the click order of the visitor. The following are codes for three visitors or clients (denoted by C) with vroots (denoted by V). C, 10908 V, 1108 V, 1017 C, 10909 V, 1113

ASSOCIATION MODELS FOR WEB MINING

185

V, 1009 V, 1034 C, 10910 V, 1026 V, 1017 A client typically visits only a few different pages; in fact, an average of four visited pages per client. This leads to a rather sparse contingency table when the data is arranged according to all categorical variables corresponding to the pages (vroots). To alleviate the problem, we have eliminated from the data set all clients who visited only one page. This data was first analyzed by Breese et al. (1998) with the purpose of establishing a comparison among different predictive algorithms for collaborative filtering. As stated in that paper, collaborative filtering uses previously stored information about user preferences to predict possible future user behavior. It is not our goal here to mine data from the web in order to predict web usage, but rather to gather insight into the way users access the web site. This will be achieved by means of two types of exploratory graphical modeling strategies, based respectively on the frequentist and the Bayesian statistical approaches. The use of graphical models for web mining has also been illustrated by Heckerman et al. (2000), but in the framework of dependency-networks and greedy search.

3.

Frequentist web mining

If pages are arranged by absolute frequency of visits, their distribution is heavily asymmetric. In figure 1, the most visited page is number 1008, Free downloads. To understand associations between different pages determined by the visits of clients, we build up a contingency table that has, as leading descriptors, as many random variables as considered pages. However, given the very high number of pages in the site, and the low average number of different pages visited, we need to group pages into homogeneous categories, in order to avoid sparse tables. Grouping may be done in several ways, for instance statistically (see Breese et al., 1998). On the other hand, we considered a grouping that is logically sensible, and we identified eight groups: Programs, Catalog, Internet, Entertainment, Office, Development, Windows, Initials. We also binarized the variable corresponding to the grouped page, with Level 1 indicating that the group has been visited at least once, and Level 0 indicating that the group of pages was not visited during the week. Of course, in so doing, there is a loss of information, but there is an advantage in ease of analysis and understanding. The cross-classification of the 5,000 clients into the eight groups of pages, with two levels per group (visited/not visited) produces a 28 contingency table. The marginal association

186

Figure 1.

GIUDICI AND CASTELO

Frequency distribution of the pages.

between groups is measured by the marginal pairwise odds ratio. For each odds ratio, we have calculated, besides the maximum likelihood estimate, an aproximate 95% confidence interval. Two variables are declared significantly associated if a value of unity of the odds ratio, corresponding to marginal independence, falls outside the confidence interval. To illustrate the marginal associations, one can draw a graph whose nodes correspond to the groups and where edges are inserted between a pair of nodes if the two corresponding groups are significantly associated. Such a graph is not a conditional independence graph, such as those employed in the graphical modeling literature (see e.g., Whittaker, 1990, or Lauritzen, 1996) but a marginal independence graph, with no separation properties. However, it sets a framework for subsequent analysis. Such a marginal independence graph is depicted in figure 2. In figure 2, solid lines represent positive associations, and dashed lines represent negative associations. Positive associations with an odds ratio greater than three, describing a strong association, are highlighted with a thick solid line. The total number of edges is estimated to be 14. The previous exploratory analysis is based on marginal independencies, as it considers separately all marginal two-way contingency tables corresponding to each pair of variables. A more correct study of the associations must consider the 28 joint contingency table directly, with all variables being simultaneously analyzed. To achieve this, we need a more

ASSOCIATION MODELS FOR WEB MINING

187

Figure 2. The exploratory graph: positive and negative associations are represented respectively by solid and dashed lines.

structured, model-based approach. Important insights can be gained by basing the analysis on graphical modeling theory. The contingency table of the visited pages of a web site summarizes user behavior. The analysis of the conditional independencies contained in that table provides, as we shall see, valuable knowledge about how this web site is accessed. In this context, we define a graph g = (V, E) as a pair of sets: a finite set of vertices, V , with a one-to-one correspondence with a set of random variables (in our case the groups of pages), and a finite set of edges, E, that join pairs of the vertices. All the random variables occur on equal footing. Under this assumption, the type of graphical model to discover conditional independencies is undirected, that is, the edges in E are undirected. More precisely, we use decomposable graphical models that are determined by chordal graphs. A chordal graph is an undirected graph with no cycles involving more than three vertices. A graphical model is a probability distribution, P, which is Markov with respect to a graph, g. This means that every (conditional) independency represented on g is satisfied under P. For the data at hand, we shall concentrate on discrete (binary) graphical models. Let θ(g) indicate the vector of the 28 unknown cell probabilities. Consider a complete sample from P. Our aim is to choose, on the basis of the observed sample, the most likely conditional independence graph g (or, equivalently, the most likely graphical model). Again, a conditional independence graph describes relationships between arbitrary subsets of variables, while marginal independence graphs, such as that in figure 2, describe marginal pairwise relationships, which do not take into account the other variables. In data mining, there is typically little a priori knowledge, so one may want to compare all possible graphs for a given set of random variables. Let L(g) be the likelihood function of a graph g, having observed the data. To carry out model selection, we need to attach a score to each considered model. The classical frequentist score is obtained by maximizing the likelihood in θ (g), for any given graph g. Selection is then typically carried out by stepwise comparisons of models, comparing the corresponding maximized likelihoods. For the data at hand, we consider discrete graphical models, which are a subset of the class of loglinear models, with the generators of the models corresponding to the cliques of the graph. Model selection has been carried out by means of an edge elimination procedure,

188

GIUDICI AND CASTELO

with a significance level of 0.05, which has led to a final graphical model coinciding with that in figure 2. Therefore, if we stopped our analysis here, we would conclude that the exploratory marginal graph may be interpreted as a conditional independence graph. 4.

Bayesian web mining

We now consider a Bayesian web mining analysis which, as we shall show, turns out to be quite advantageous. The technical details of the concrete implementation of the approach are contained in Giudici and Castelo (2000), to which we refer for further details and discussion. The problem with employing classical scores for model selection in data mining is that, as already observed, with even a moderate number of variables considered, stepwise procedures are often somewhat unstable. Furthermore, in data mining there is typically little subjectmatter knowledge on which models are substantially important and, therefore, it is advisable to report conclusions from more than one model. Hence, a model averaging procedure is needed, and classical methods do not provide an easy solution to this. The Bayesian approach gives a solution to this problem, as it is based on the comparison of probabilities. The Bayesian model score is a probability, obtained by applying Bayes’ theorem to the marginal likelihood of a model. The latter is obtained by integrating the likelihood over all admissible values of θ(g). On the negative side, the Bayesian approach needs to specify prior distributions. As a prior distribution on the unknown cell probabilities, we propose to assign an uninformative Dirichlet distribution for each clique of the graph, similar to the Dirichlet priors used for Bayesian networks described in Heckerman et al. (1995), and typically done in the probabilistic expert systems literature. For a prior on the model space, we have taken a uniform distribution over the number of considered graphs so that the Bayesian model score can be simply obtained by normalizing the marginal likelihood. A difficulty with the Bayesian approach is the need to evaluate the high dimensional integrals involved in the derivation of the model scores. Furthermore, for high-dimensional graphical models, such as those considered here, the set of all possible models is large, and a full comparison of all the posterior probabilities associated with the competing models becomes infeasible. The preceding problems can be solved, at least approximately, by means of Markov chain Monte Carlo methods (MCMC; for a review, see Brooks, 1998). In particular, for model search purposes, a successful MCMC algorithm is the Markov chain Monte Carlo composition algorithm (MC3 ), proposed by Madigan and York (1995), adapted by Giudici and Castelo (2000) to data mining problems. Consider the application of an MCMC model search to our web mining data. Our aim is to approximate the posterior distribution of the space of discrete undirected graphical models. This will enable us to distinguish which model is the most likely association structure describing conditional independencies among the eight groups of pages of the web site. Importantly, the posterior distribution on the space of models allows us to see how frequently the most likely model is better than the rest of the models. Also, it is interesting to see which parts of the model remain unchanged across those models that account for the

ASSOCIATION MODELS FOR WEB MINING

189

largest portion of the distribution. Logically, these unchanged parts deserve a higher degree of confidence in the information they provide. The space of possible models is rather large; the number of all possible chordal graphs on eight vertices is about 31 million. Therefore, in all our experiments on this data set, we have run a long Markov chain of 100,000 iterations in order to achieve practical convergence to the probability distribution we are approximating. Figure 3 contains two different diagnostics of convergence. In both, the x axis represents the running iteration on a log-scale. After the first 10,000 iterations, the Markov chain starts to converge. Figure 3(a) shows the convergence of the average number of edges present in the model. In particular, the number of edges seems to converge around the value of 16, two more than in the frequentist case. Finally, figure 3(b) shows the convergence of the ratio between the number of accepted models and the number of rejected models. In figures 4 and 5, we find three different types of information produced by the MCMC process. In figure 4(a), the x axis represents the different models visited by the Markov chain. There are a total of 52, which means that the rest of more than 30 million models are supposed to have a probability of less than 1/100,000. In figure 4(a), the numbering of the models corresponds to the sequence in which the Markov chain encounters them for the first time, starting from the empty (fully restricted) model. Figure 4(a) shows the posterior distribution of the models given the data. Next to it, in figure 4(b), we find this distribution accumulated, by ordering the models from larger to smaller probabilities. There are just three models that account for more than 90% of the distribution. Finally, figure 5 has in its x axis the different cardinalities of the sets of edges seen during the run of the Markov chain, and the figure depicts the posterior distribution of the total number of edges given the data. It is expected that this distribution takes a normal shape, with the mean around the most probable number of edges. The information gathered so far is twofold: On the one hand, it shows that the Markov chain converges, indicating that the search process is sound and that the conclusions can be trusted; on the other hand, it provides insight into the shape of the search space, giving a rough idea of how difficult it might be to devise the mechanism that generates the data. The assumption here is that the type of model that generates the data is an undirected graphical model. Under this assumption, the joint probability distribution of the variables decomposes into a product of potentials matching the conditional independence structure that the graph at hand encodes. Therefore, the conditional independencies we read off the graph are the building blocks of the mechanism that generates the data. Figure 6 shows, from (a) to (c) in increasing probability, the three undirected graphical models that account for more than 90% of the probability distribution. These three models show that the page groups Internet (INT), Windows (WIN), and Office (OFF) separate Catalog (CAT) and Entertainment (ENT) from everything else in each of the graphs. Thus {CAT, ENT}⊥V \{CAT, ENT, INT, WIN, OFF}|INT, WIN, OFF

190

GIUDICI AND CASTELO

(a)

(b) Figure 3.

Convergence diagnostics.

ASSOCIATION MODELS FOR WEB MINING

(a)

(b) Figure 4.

Posterior and cumulative posterior distribution of the models.

191

192

Figure 5.

GIUDICI AND CASTELO

Probability distribution of the cardinalities of the edges.

Figure 6. Three most probable models given the data. They differ in how hits on pages under Initials and Entertainment interact with hits on pages under Windows and Office.

ASSOCIATION MODELS FOR WEB MINING

193

This means that to understand the behavior of the users in the pages under Catalog and Entertainment, consider what the users do in the pages under these groups plus the pages under Internet, Windows, and Office. Everything else does not influence their behavior in Catalog and Entertainment. This does not mean that there is no correlation between, for instance, Entertainment and Development, but it does mean that this correlation becomes irrelevant in light of what happens on the pages under Internet, Windows, and Office. To fully understand why this is so, it would be necessary to know the exact layout of all the pages of the web site, since this might be the effect of the structure of the web site itself. Let us assume that this has to do with user tastes and preferences. For instance, assume that we would like to increase the hits on certain pages under the group Entertainment. Since what users do in Entertainment is related to everything else through their behaviour in Windows, Internet and Office, it makes sense to concentrate efforts on increasing the hits to Entertainment by making some design decisions in the pages belonging to the three groups Windows, Internet, and Office. Of course, global major changes in the structure of the web site will change the way users navigate, but we are discussing the implementation of less drastic options. The conclusions we draw are not, in any case, the result of a causal interpretation of the associations. They are rather the result of using the notion of conditional independence to make reasonable assumptions about how our data is generated in order to try to shed some light on decisions we may need to make. The scope of this data limits the conclusions we can draw. Nevertheless, we believe that this methodology could be more profitable if applied to more complete data where hits are linked to user profiles or marketing campaigns on the web. Since the MCMC Bayesian model search approximates the posterior of the model given the data, it is possible, at the same time, to evaluate the probability of a particular edge given the data, by averaging over those models from the posterior distribution that contain the edge. That probability indicates the likelihood that this edge will be present in the conditional independence graph. This probability will mostly agree with the ranking of the odds ratios. However, because here an edge plays a global role as part of a path, its probability may disagree in some cases. In Table 1, we find the approximation to the posterior distribution of the edges given the data. It is clear that edges may be grouped in two sets, one above the threshold of 0.8, and another below. The significant pairwise associations detected in the frequentist analysis of the previous section basically agree with those edges in the group of higher probability. 5.

Concluding remarks and discussion

In this paper, we have shown the use and advantages of an MCMC learning method in the context of undirected graphical models and we have described its application to modeling association structures in web mining. We looked at the estimated pairwise odds ratios and the associated confidence intervals. The results were confirmed in a more formal analysis, using graphical association models. We also demonstrated that Bayesian Markov chain Monte Carlo techniques can be a useful and valid tool to identify association rules in web mining. Although computationally more

194 Table 1.

GIUDICI AND CASTELO Probability distribution of the edges given the data. edge

p(edge|data)

INTERNET−CATALOG

9.999600e-01

INTERNET−ENTERTAINMENT

9.999100e-01

DEVELOPMENT−INTERNET

9.998800e-01

INTERNET−OFFICE

9.998600e-01

INITIALS−DEVELOPMENT

9.998200e-01

OFFICE−PROGRAM

9.993100e-01

INTERNET−WINDOWS

9.990900e-01

INTERNET−PROGRAM

9.990800e-01

CATALOG−WINDOWS

9.990700e-01

OFFICE−WINDOWS

9.990500e-01

WINDOWS−PROGRAM

9.989600e-01

WINDOWS−DEVELOPMENT

9.951900e-01

OFFICE−CATALOG

9.616900e-01

ENTERTAINMENT−OFFICE

9.154200e-01

WINDOWS−ENTERTAINMENT

8.925100e-01

WINDOWS−INITIALS

8.634300e-01

CATALOG−DEVELOPMENT

1.079600e-01

CATALOG−PROGRAM

3.790000e-02

DEVELOPMENT−ENTERTAINMENT

4.780000e-03

OFFICE−DEVELOPMENT

3.880000e-03

ENTERTAINMENT−INITIALS

2.990000e-03

INTERNET−INITIALS

5.600000e-04

PROGRAM−DEVELOPMENT

5.200000e-04

intensive to implement and test, the Bayesian analysis is advantageous in the interpretation of the results. The comparison of the different methodologies to our available data shows many similarities, and this brings more confidence to the results. From a subject-matter viewpoint, most of the associations we found were not known a priori, and indeed may not seem very obvious. However, the problem of interest to the providers of this data set is exactly how to find such unknown associations, and this is what makes it a difficult data mining problem. If the aim of the inference is heavily focused on quantitative learning aspects, we suggest considering the reversible jump approach suggested in Giudici and Green (1999). It does MCMC model determination over both the model and the parameter space and, therefore, makes it possible to extract more inferences on parameters of interest, such as odds ratios. On the other hand, such an algorithm is certainly more difficult to implement and test. Of course, a more computationally efficient approach would be to use some form of greedy search. In that setting, one does not approximate the posterior distribution on the space of

ASSOCIATION MODELS FOR WEB MINING

195

models, but rather performs a search for those models that fulfill certain heuristic criteria. Then, in order to compare competing models, one should compute some sort of utility, such as divergence measures or prediction accuracy, which may prove useful in some situations. Such procedures, besides being heuristic, are not as intuitive as a probability distribution. In addition, we believe that our work shows that associations present between visits to web pages can be a useful tool for understanding consumer behavior on the web. They may be used, for instance, to classify clients in homogeneous groups and to optimize the design of the web site. We now turn our attention to discussion of the paper at the Pavia conference. Both discussants at the conference, Alessandro Zanasi (IBM KDD Centre, Bologna) and Tomas Kocka (University of Economics, Prague) highlighted the inconvenience of the decisions taken during our preprocessing step. We believe that treating web pages and user hits as a contingency table is the key step that allows us to use all existing statistical machinery for the analysis of contingency tables. The number of web pages over which we would like to draw conclusions grows dramatically because the web itself is an unbounded source of information. This turns into a huge degree of sparseness in the contingency tables we build. In such a situation, two solutions are possible: one is the one we applied, which is to group pages; the other would be to develop computational techniques that can cope with sparse contingency tables and still produce reliable analyses. We adopted the first solution, but we fully share the interest in the second one. One of the main reasons, in the context of graphical models, to try to work with individual pages, is that they more closely model the mechanism that generates the data. A second point raised by the discussants was the fact that we lose the information regarding the order in which pages have been visited. This is related to the fact that we decided to classify the web information into a contingency table. No doubt the order in which pages are visited is highly valuable information, but to mine such knowledge, a completely different approach would be necessary. That is a topic for further research. Further, we agree with the discussants that MCMC techniques are computationally demanding and that speeding up the convergence would alleviate this aspect. However, from the point of view of the decision maker who would use the methodology we propose, getting the right answer is more important then speed. Therefore, we suggest that developing more accurate convergence diagnostics should take precedence over speeding up the convergence. As pointed out by Zanasi, the explosion of commercial transactions through the internet, in business-to-business rather than business-to-client settings, increases the relevance of adapting existing data analysis tools for web data and creating new ones. In this virtual setting, where switching providers is as easy as clicking on a different corner of the web browser, customer relationship management (CRM) becomes crucial to business activity. Thus, we agree with the view of the discussant that research should be focused on delivering web mining technology for CRM purposes. Finally, our methodology could be easily applied to similar customer-behavior problems, such as forecasting TV and market basket analysis. Papers on that issue can be found at http://www.baystat.it/giudici/index.html. Further methodological research is needed, particularly to find ways to speed up practical convergence of the MCMC algorithm. On the other hand, an important direction of applied research is the development of appropriate

196

GIUDICI AND CASTELO

software code to implement graphical models and MCMC model searches, so that they become routinely available. Acknowledgments We thank the discussants, Alessandro Zanasi and Tomas Kocka, as well as the referees, for their very useful comments and suggestions. We also thank Janet Hansen, for final editing of the paper. Finally, we would like to thank all participants at the Pavia conference for having contributed to a highly stimulating learning forum. References Breese, J., Heckerman, D., and Kadie, C. 1998. Empirical analysis of predictive algorithms for collaborative filtering. Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence. New York: Morgan Kaufmann. Brooks, S. 1998. Markov chain Monte Carlo method and its application. The Statistician, 47:69–100. Giudici, P. and Castelo, R. 2001. Improving Markov chain Monte Carlo, for data mining. Manuscript revised for publication. Giudici, P. and Green, P. 1999. Decomposable graphical Gaussian model determination. Biometrika, 86:785–801. Heckerman, D., Chickering, D., Meek, C., Rounthwaite, R., and Kadie, C. 2000. Dependency networks for inference, collaborative filtering and data visualization. Journal of Machine Learning Research, 1:49–75. Heckerman, D., Geiger, D., and Chikering, D. 1995. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197–243. Lauritzen, S. 1996. Graphical Models. Oxford: Oxford University Press. Madigan, D. and York, J. 1995. Bayesian graphical models for discrete data. International Statistical Review, 63:215–232. Whittaker, J. 1990. Graphical Models in Applied Multivariate Statistics. New York: Wiley.

Paolo Giudici is Associate Professor of Statistics at the Faculty of Economics of the University of Pavia. He received his Ph.D. in statistics at the University of Trento (Italy) in 1994. His research interests include Markov chain Monte Carlo methods, Graphical Markov models and Statistical models for data mining.

Robert Castelo is a Ph.D. student in computer science at the University of Utrecht, The Netherlands.