users on a web site from data composed of demographic descriptions and site ... best case, one advantage of this approach is that site owners can get a ..... stronger generalisation on urls that considers only http://:/. We.
Discovering Rich Navigation Patterns on a Web Site 1,2
1
2
Karine Chevalier , Cécile Bothorel , and Vincent Corruble
1 France Telecom R&D (Lannion), France {karine.chevalier, cecile.bothorel}@rd.francetelecom.com 2 LIP6, Pole IA, Université Pierre et Marie Curie (Paris VI), France {Karine.Chevalier, Vincent.Corruble}@lip6.fr
Abstract. In this paper, we describe a method for discovering knowledge about users on a web site from data composed of demographic descriptions and site navigations. The goal is to obtain knowledge that is useful to answer two types of questions: (1) how do site users visit a web site? (2) Who are these users? Our approach is based on the following idea: the set of all site users can be divided into several coherent subgroups; each subgroup shows both distinct personal characteristics, and a distinct browsing behaviour. We aim at obtaining associations between site usage patterns and personal user descriptions. We call this combined knowledge 'rich navigation patterns'. This knowledge characterizes a precise web site usage and can be used in several applications: prediction of site navigation, recommendations or improvement in site design.
1
Introduction
The World Wide Web is a powerful medium through which individuals or organizations can convey all sorts of information. Many attempts have been made to find ways to describe automatically web users (or more generally internet users) and how they use Internet. This paper focuses on the study of web users at the level of a given web site: are there several consistent groups of site users based on demographic descriptions? If this is the case, does each group show a distinct way of visiting the web site? These questions are important for site owners and advertisers, but also in a social research perspective: it is interesting to test if there is some dependence between demographic descriptions and ways to navigate on a site. Our project addresses the discovery of knowledge about users and their different site usage patterns for a given site. We aim at obtaining associations between site usage patterns (through navigation patterns) and personal user descriptions. We call this combined knowledge ’rich navigation patterns’. These particular patterns underline, on a given site, different ways of visiting the site for specific groups of users (users that share similar personal descriptions). Our aim is to test the assumption that there is some links between navigations on a site and users’ characteristics and to study the relevance of correlating these two very different types of data. If our results confirm that there are some relations between users’ G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 62–75, 2003. © Springer-Verlag Berlin Heidelberg 2003
Discovering Rich Navigation Patterns on a Web Site
63
personal characteristics and site navigations, this knowledge would help to describe in a rich manner site visitors, and open avenues to assist site navigation. For instance, it can be helpful when we want to assist a user for which no information (i.e. personal information or last visits on the site) is known. Based on his current navigation on a web site, and a set of rich navigation patterns obtained before, we can infer some personal information about him. We can then recommend him to visit some documents adapted to his inferred profile. This paper is organized in the following way: the second section presents some tools and methods to understand and describe site users, the third section describes rich navigation patterns and a method to discover them and the fourth section shows our evaluations performed on the rich navigation patterns extracted from several web sites.
2
Knowledge Acquisition about Site Users
There are many methods to measure and describe the audience and the traffic on a site. A first way to know users who access a given site is to use surveys produced by organizations like NetValue [13], Media Metrix [12] and Nielsen//NetRating [14]. Their methods consist in analyzing Internet activities of a panel over a long period of time and infer knowledge on the entire population. This is a user-centric approach. Panels are built up in order to represent at best the current population of Internet users. Some demographic data on each user of the panel (like age, gender or level of Internet practice…) are collected and all theirs activities on Internet are recorded. The analysis of these data provides Internet usage qualification and quantification. We will consider here only the information related to users and web usage. This approach gives general trends: it indicates for instance who use the web, what sort of site they visit, etc., but no processing is performed to capture precisely usage patterns on a given site. In the best case, one advantage of this approach is that site owners can get a description of his site users, but this point is true only for sites with large audience, the other sites have a low chance to have their typical users within the panel so as to obtain a meaningful description of their users. We can point out several interesting aspects in the user-centric approach. Firstly those methods are based on the extrapolation of observations made on a panel of users to the entire set of users. This means that it can be sufficient to make an analysis on only one part of users. Secondly, the approach relies on the assumption that there are some links between some features of users profile and their Internet usage. The second way to know users who have access to a given site is to perform an analysis at the site level. This is the site-centric approach. It consists in collecting all site navigations and then analysing these data in order to obtain traffic measure on the site and retrieve the statically dominant paths or usage patterns from the set of site sessions. A session corresponds to a user site visit; a session can be considered as a page sequence (in chronological order). Users' sessions are extracted from log files that contain all HTTP requests done on the site. Further information on problems and techniques to retrieve sessions from log files can be found in [5]. There are many industrial tools (WebTrends[16], Cybermetrie [7]) that implement the site-centric approach.
64
K. Chevalier, C. Bothorel, and V. Corruble
Here, we focus our attention on methods that retrieve automatically site usage patterns. Most of the methods are based on frequency measures: a navigation path is retrieved from the set of site sessions because it has a high probability of being followed [3][4], a sequence of pages is selected because it appears frequently in the set of site sessions (WebSPADE algorithm [8] adaptation SPADE algorithm [17]) or a site usage pattern is revealed because it is extracted from a group of sessions that were brought together by a clustering method [11]. Cooley and al. suggest filtering the frequent page sets in order to keep the most relevant set [6]. They consider a page set as interesting if it contains pages that are not directly connected (there is no link between them and no similarity between their content). Those methods allow catching different precise site usage patterns in terms of site page visited but they capture only common site usage patterns (site usage patterns shared by the greatest number of users). If a particular group of users shows a specific usage pattern of the web site, and it is not composed of enough users, their specific usage will not be highlighted. In that case, important information can be missed: particular (and significant) behaviours could be lost among all navigations. One way to overcome this limitation is to rely on some assumptions and methodologies of the user-centric approach that we described above. Firstly, it could be interesting to assume that there are some correlations between the users’ personal descriptions and the way they visit a given site. We can then build groups of users based on personal characteristics and then apply the site usage patterns extraction on smaller sets of sessions in order to capture navigation patterns specific to subgroups of users. This strategy reveals site usage patterns which are less frequent but associated to a specific group of users who share similar personal descriptions. We can then answer questions such as: “Do young people visit the same pages on a given site?” Secondly, in the same manner that the user-centric approach extrapolates knowledge learned on a panel of Internet users to the entire set of Internet users, we could restrict our search on data coming from a subset of site users and interpret the results obtained on this subset as valid for all the site users.
3
Discovering Rich Navigation Patterns
Our research project addresses the problem of knowledge discovery about a set of web site users and their site uses. We explore the possibility of correlating users' personal characteristics with their site navigation. Our objective is to provide a rich usage analysis of a site, i.e. usage patterns that are associated to personal characteristics and so offer a different, deeper understanding of the site usage. This has the following benefits: • It provides the site manager with the means to understand his/her site users. • It lets us envisage applications to personalization, such as navigation assistance to help new visitors. We want to add meaning to site usage patterns, and find site usage patterns which are specific to a subgroup of site users. We explore the possibility to correlate user de-
Discovering Rich Navigation Patterns on a Web Site
65
scriptions and site usage patterns. Our work relies on the assumption that navigation "behaviors" and users’ personal descriptions are correlated. If valid, this assumption has two consequences: (1) Two users similar in socio-demographic terms have got similar navigation on a web site; (2) Two users similar in their navigations on a web site have got similar personal description. Our approach supposes the availability of data that are richer than classical site logs. They are composed of site sessions and personal descriptions of reference users. Reference users form a subset of users who have accepted to provide a list of personal characteristics (like age, job…) and some navigation sessions on the web site, i.e. they are used as a reference panel for the entire population of the web site. From this data, we wish to obtain knowledge that is specific to the web site from which the data is obtained. In an application, by using this knowledge, we can infer some personal information about a new visitor, and propose page recommendations based on his navigation even if he gives no personal information. We choose to build this knowledge around two distinct elements: - A personal user characteristic is an element that describes in a personal way a user, for instance: age is between 15 and 25 years old, gender is man… - A navigation pattern represents a site usage pattern. Navigation patterns are sequences or sets of web pages that occur frequently in users sessions. For instance, our data shows, on the boursorama.com site (a French Stock Market site), the following frequent sequence of pages: access to a page about quoted shares and later on, consultation of a page that contains advices for making investment. We call the association of both elements of knowledge a 'rich navigation pattern', i.e. a navigation pattern associated to personal user characteristics. After describing our way to discover navigation patterns in the next subsection, we detail different rich navigation patterns that we want to learn and finally we present a way to extract them from a set of data composed of reference user's description and their site sessions. 3.1 Discovering Navigation Patterns Navigation patterns are sequences or sets of pages that occur frequently in session sets. We used an algorithm to retrieve frequent sets of pages, that take into account principles of algorithms such as FreeSpan [10] (PrefixSpan [15], WebSPADE [8] and SPADE [19]) that improve Apriori [1]. These algorithms are based on the following idea: "a frequent set is composed of frequent subsets". Here, a session is considered as a set of pages. We chose to associate to each pageset a list of session ids in which the pageset occurs in order to avoid scanning the whole set of sessions each time the support of pageset have to be calculated [10][15][8][19].
66
K. Chevalier, C. Bothorel, and V. Corruble Table 1. Initialisation phase for each session s in S do for each page pg ∈ s do Add s to the session set of the page pg. L1 ={ } for each page pg do if numberUser(pg)>minOccurrence then L1 =L1∪{pg} return L1
An initialisation phase (table 1) creates the frequent sets composed of one page. The session set S is scanned in order to build for each page pg a set that contains all sessions in which pg occurs. Then, only the web pages that appear in the navigations of more than minOccurrence users, are kept in L1 (set of large 1-pageset). A session is Table 2. Building (k+1) pagesets // Main loop k = 1 while (|Lk|>1) do Lk+1 =BuildNext(Lk) k++ end_while // BuildNext(Lk): Lk+1 = { } i = 1 while (i