Mining Web Access Logs of an On-line Newspaper - XLDB

Mining Web Access Logs of an On-line Newspaper Paulo Batista and M´ ario J. Silva Departamento de Inform´ atica, Faculdade de Ciências – Universidade de Lisboa Campo Grande 1749-016 Lisboa Portugal {pb,mjs}@di.fc.ul.pt

Abstract. With the explosive growth of data available on the Internet, personalization of this information space become a necessity. An important component of web personalization is the automatic knowledge extraction from web log files. However, analysis of large web log files is a complex task not fully addressed by existing web access analyzers. Using commercial software, we applied well-known data mining techniques (association rules and clustering) to analyze access log records collected on a web newspaper. This paper identifies several reading patterns and discusses approaches for mining this data.

1

Introduction

The evolution of the Internet has lead to an enormous proliferation of the available information and the personalization of this information space has become a necessity. The knowledge obtained by learning web users preferences can be used to improve the effectiveness of their web sites by adapting the web information structure to the users behavior. Automatic knowledge extraction from web log files can be useful for identifying such reading patterns and infer user profiles. However, it is hard to find appropriate tools for analyzing raw web log data to retrieve significant and useful information. There are several commercially available web log analysis tools [1], but most of them are disliked by their users and considered too slow, inflexible, expensive, difficult to maintain or very limited in the results they can provide [18]. Recently, the advent of data mining techniques for discovering usage patterns from web data (a.k.a. web log mining or web usage mining) made it possible to mine typical user profiles from the vast amount of access logs. Web usage mining can be viewed as the extraction of usage patterns from access log data containing the behavior characteristics of users. Web usage mining can help in addressing some of the shortcomings of the standard approaches for web personalization. However, the discovery of patterns from usage data is not by itself sufficient for performing the personalization tasks. Nevertheless, it is an important component for effective derivation of ”actionable” user profiles derived from web access patterns [11]. The learned knowledge

can also be used for other applications, such as improving site usability, business intelligence, and usage characterization. In this paper, we present initial work on web usage mining, describing the use of data mining techniques to analyze web log records collected from Publico On-Line [15], a web newspaper. Using commercial data mining software (SPSS Clementine [7] and IBM Intelligent Miner [9]), we have identified several web access patterns by applying association rules induction and clustering techniques to the access logs of this digital publication. The remaining of this paper is organized as follows: in section 2, we present the web usage mining concept. Our access logs processing architecture is presented in section 3. Then, in section 4, we present and analyze the results of the data mining work on Publico On-Line access data. Finally, section 5 summarizes our conclusions and presents directions for future work.

2

Mining Web Usage Data

Data mining efforts associated with the Web, called Web Mining, can be broadly categorized into three areas of interest based on which part of the Web to mine: web content mining, web structure mining, and web usage mining [10]. Web content mining focuses on techniques for searching the web for documents whose content meets web users queries. Web structure mining is related to the analysis of the link structure of the web, in order to identify relevant documents. Web usage mining is defined as the process of applying data mining techniques to the discovery of usage patterns from web logs data, to identify web users behavior [16]. Web content and structure mining are beyond the scope of this work, despite we used some sort of structure mining in our preprocessing phase. In Web mining, data can be collected at the server-side, client-side, proxy servers, or a consolidated web/business database. In [16], the authors present a more detailed description of these data sources. To summarize, (i) Web server logs explicitly records browsing behavior of site visitors, (ii) Client-side data collection can be implemented by using a remote agent or by modifying the source code of an existing browser (iii) and Web proxies act as an intermediate level of caching between client browsers and Web servers. The information provided by the data sources described above can be used to construct several data abstractions, namely users, page-views, click-streams, and server sessions [17]. A user is defined as a single individual that is accessing file Web servers through a browser. In practice, it is very difficult to uniquely and repeatedly identify users. A user may access the Web through different machines, or use more than one browser at one time. A page-view consists of every file that contributes to the display on a user’s browser at one time and is usually associated with a single user action such as a mouse-click. A click-stream is a sequential series of page-views requests. Note that any page view accessed through a client or proxy-level cache will not be recorded on the server side. A server session (or visit) is the click-stream for a single user for a particular Web

site. The end of a server session is defined as the point when the user’s browsing session at that site has ended. The process of Web usage mining can be divided into three phases: preprocessing, pattern discovery, and pattern analysis [16]. Preprocessing consists of converting usage information contained in the various available data sources into the data abstractions necessary for pattern discovery. Another task is the treatment of outliers, errors, and incomplete data that can easily occur due reasons inherent to web browsing. The data recorded in server logs reflects the (possibly concurrent) access of a Web site by multiple users, and only the IP address, agent, and server side click-stream are available to identify users and server sessions. However, it is important to notice that the data collected by server logs may not be entirely reliable because some page views may be cached by the user’s browser or by a proxy server. In a Web server log, all requests from a proxy server have the same identifier, even though the requests potentially represent more than one user. Also, due to proxy server level caching, multiple users throughout an extended period of time could actually view a single request from the server. The Web server can also store other kinds of usage information such as cookies, which are markers generated by the Web server for individual client browsers to automatically track the site visitors. After each user has been identified (through cookies, logins, or IP/agent analysis), the click-stream for each user must be divided into sessions. As we cannot know when the user has left the Web site, a timeout is often used as the default method of breaking a user’s click-stream into sessions. The next phase is the pattern discovery phase. Methods and algorithms used in this phase have been developed from several fields such as statistics, machine learning, and databases. This phase of Web usage mining has three main operations of interest: association (i.e. which pages tend to be accessed together), clustering (i.e. finding groups of users, transactions, pages, etc.), and sequential analysis (the order in which web pages tend to be accessed). The first two are the focus of our ongoing work. Pattern analysis is the last phase in the overall process of Web usage mining. In this phase the motivation is to filter out uninteresting rules or patterns found in the previous phase. Visualization techniques are useful to help application domains expert analyze the discovered patterns.

3

Access Logs Processing Architecture

Publico On-Line is a daily online newspaper. Each edition is constructed by a generation program that collects all articles, applies formats and constructs a navigable Web structure with articles grouped in thematic sections. We have defined a general architecture for web access mining (see Figure 1), using the site’s Web server logs as data source. The preprocessing phase includes initial preparation tasks that are included in a processing agent system [13]. This system performs the following tasks: noise filtering (i.e. removing irrelevant data like access errors or images requests),

Site Files

Server Logs

Preparation tasks: Data Cleaning User and Session Identification

FREQUENT ITEM SETS DISCOVERY

access repository

Session Files

Data Preprocessing Session Identification Format transformation Usage Statistics

STATISTICAL ANALYSIS: - data distribution estimation

CLUSTERING: - session clusters

Usage Mining

WEB ACCESS PATTERNS & USER PROFILES

Fig. 1. Overview of a general architecture for Web Access Mining.

sessions identification, and storage in a repository. Session identification consists of grouping all page-view records from a given IP address collected during user activity periods (we define inactivity as a period of 30 minutes or higher for which we have no registered accesses to the Web server). For each valid pageview (a news article) the agent assigns the corresponding news section based on site structure information present on the page’s URL. The conceptual schema of the repository is illustrated in Figure 2. Each article (artigo) is associated to one section (seccao) and a reader (cliente) accesses one or more articles during a session. The sessions identified by the processing agent as described above are called short sessions. When allowed by the user agent, the web server also registers a cookie that is accepted by that agent. We call long sessions to the set of short sessions that share the same cookie (accumulation of the user access transactions grouped by cookie). To adapt this data to the data structures of the data mining algorithms used, we transformed log access tables into numerical and Boolean matrices, where each column corresponds to a newspaper section and each row represents a session. In numerical matrices, each matrix cell contains the quantity of articles accessed on each pair (session, section); in Boolean matrices a cell is True when at least one article is accessed in that (session, section) pair. We examined the aggregated data matrices through a set of basic statistical functions that help in obtaining a preliminary view about the data. For numeric variables we have observed the maximum, minimum, mean, and standard deviation; for Boolean variables we obtained the frequencies (see Figure 3). These statistics show that the matrices are very sparse, that is, for each session we have a small number of articles and a small number of sections accessed. For example, 82.8% of the sessions do not have any accessed articles from the

Fig. 2. Conceptual schema of the access record repository: readers (cliente) access articles (artigo) during a session (sessao); each article belong to a section (seccao).

Science section, and in the remaining 17.2% we have an average of 2.3 accessed articles.

Name

Minimum Value

Maximum Value

Mean

Standard Deviation

Na me

Mo dal Va lue

CIENCIAS CULTUR A DES PORTO ECONOMIA INTERNACIONAL LOCAL_LISBOA LOCAL_PORTO POLITICA SOCIEDADE EDUCACAO

1 1 1 1 1 1 1 1 1 1

97 208 318 258 208 460 256 208 367 90

2.30343 3.78779 5.69846 3.93347 3.38226 5.68833 7.59835 3.35767 4.26733 2.64958

2.81841 5.97421 10.836 7.23418 5.55397 11.5647 13.2351 5.41012 7.9853 3.29088

CIENCIAS CULTURA DE SPOR TO EC ONOMIA EDUCAC AO INTERNAC IONAL LO CAL_LIS BOA LO CAL_PO RTO PO LITICA S OCIEDADE

F F F F F F F F F F

Mo dal Fre que ncy(%) 8 2.80 8 3.12 6 7.87 7 6.84 8 4.98 6 9.59 7 7.81 8 6.47 7 0.86 7 1.70

Fig. 3. Analysis of short sessions. The dominant values of both the numerical matrix (left) and the Boolean matrix (right) show that most users access a very small number of articles. Identical results were obtained for long sessions.

4

Mining Publico On-Line Access Data

To study the identification of associations between sections, we used typical data mining modelling operations. We view our problem of analyzing patterns of access to groups of news sections as a Market Basket Analysis problem [4].

Discovery of frequent itemsets is one of the techniques used in this kind of problem. It’s aim is to find groups of items that are frequently referred together in transactions. In our problem, transactions are the web accesses and items the news sections. 4.1

Discovering Frequent Itemsets

Groups of items occurring frequently together in many transactions are referred to as frequent itemsets [2]. Generally, a support threshold is specified before mining and is used by the algorithm for pruning the search space. The itemsets returned by the algorithm satisfy this minimum support threshold. We have identified frequent sets on Boolean data, defining weak associations as those below 5% of the total number of occurrences, and heavy associations as those above 10%. We have chosen these values based on a previous study [6].

Fig. 4. Discovery of frequent sets applied to short sessions. Strong associations have a heavy line.

Analysis of the results shows that strong associations on short sessions also exist on long sessions. This is an expected result, as long sessions accumulate accesses made in short sessions. For example, we have identified strong associations between Politics (Politica) and Society (Sociedade), Politics (Politica) and International News (Internacional), and between Society (Sociedade) and International News (Internacional), among other strong associations (see Figure 4). 4.2

Clusters Identification

Groups of sections obtained by frequent itemsets analysis gives us some interesting associations. However, it shows ”dependencies” among news sessions independently of the type of users preferences. Identification of groups of users with

identical preferences requires the extraction of different kind of access patterns. We searched groups of sessions (clusters) that were similar in the sections accessed. We had two approaches for clustering available on the used data mining tools: demographic clustering (based on Euclidian similarity metrics) and neural clustering (namely Kohonen self organizing maps) [9]. Figure 5 shows the largest clusters obtained in the analysis of numeric and Boolean short and long sessions, using both approaches.

Session type

Measure data type

Demographic Clustering

numerical

61% all sections except International

short boolean numerical long boolean

14% Sports

13% Internat.

75% all sections except Science 13% Internat.

12% Sports

Neural Clustering 15% Economy, Science LocalPorto, Educat. 15% Sports 12% Internat. 14% Sports

13% Internat. 12% Sports 13% Internat.

Fig. 5. Largest clusters for short and long sessions in Publico access log data.

Clustering on numerical data shows no evidence of clear reading patterns. We suspect that we have a very significant number of irregular sessions (outliers) with sporadic accesses without a defined pattern. This issue will be studied in future research. Both approaches for clustering Boolean data show similar reading patterns in short and long sessions. The most frequent clusters are those that group the accesses to Sports and International sections.

5

Conclusions and Future Work

In this paper we discussed the application of data mining technology to the analysis of access log records collected from a newspaper web site. Using commercial data mining software systems, we have identified and characterized several reading patterns within the news site. These patterns will define user profiles which integrate a news recommendation system based on web user preferences. Frequent sets and clustering produce different patterns. Frequent sets show groups of sections that are more frequent together, independently of the user profiles, and clustering show groups of sections that define similar web usage. Clustering of Boolean and numerical data lead to different results. While for Boolean data results are similar in both kinds of sessions and clustering approaches, we obtained different reading patterns for numerical data, or no reading patterns at all. We detected that a very significant number of sessions consist of a single page-view referred from a site external to the online newspaper. This may explain why we were able to identify patterns in Boolean data and had more

difficulty when dealing with numerical data. This suggests that to find more interesting patterns it is necessary to remove these sessions from the repository. The clustering results on numerical data may also be an outcome of the Euclidean distance-based similarity measures that are not adequate for mining our web access data. Previous research indicated that access data to digital libraries follows a Zipf-like distribution [14]. Commonly used clustering algorithms such as K-means, were developed for data samples from gaussian populations [3]. As future work, we plan to study more appropriate methods for analyzing web log data, using different similarity metrics (Minkowski distances, cosine measure and extended Jaccard similarity), and taking account the data distribution function.

References 1. Access Log Analyzers, http://www.uu.se/Software/Analyzers/Accessanalyzers.html 2. R. Agrawal, R. Srikant, Fast Algorithms for Mining Association Rules, Proc. of the 20th VLDB Conference, 1994. 3. P. S. Bradley, U. M. Fayyad, Refining Initial Points for K-Means Clustering, Proc. of the 15th International Conference on Machine Learning, Morgan Kaufmann, 1998. 4. M. Berry, G. Linoff, Data Mining Techniques - For Marketing, Sales and Customer Support, John Wiley & Sons, 1997. 5. Brian F. J. Manly, Multivariate Statistical Methods, Chapman & Hall, 1986. 6. P. Batista, M. Silva, Prospeçc˜ ao dos Dados de Acesso a um Servidor de Not´ıcias na Web, 2 Conferência sobre Redes de Computadores, Portugal, Outubro 1999. 7. Clementine User Guide, Version 5, Integral Solutions Limited, 1998. 8. R. Cooley, B. Moshaber, J. Srivastava, Data Preparation for Mining World Wide Web Browsing Patterns, Knowledge and Information Systems, 1(1), 1999. 9. Using the Intelligent Miner for Data, IBM Corporation, 1998. 10. R. Kosala, H. Blockeel, Web Mining Research: A Survey, SIGKKD Explorations, 2(1), July 2000. 11. B. Mobasher, H. Dai, T. Luo, N. Nakagawa, Y. Sun, J. Wiltshire, Discovery of Aggregate Usage Profiles for Web Personalization, Proc. of the Web Mining for E-Commerce Workshop (WebKDD’2000), August 2000. 12. B. Moshaber, R. Cooley, J. Srivastava, Automatic Personalization Based on Web Usage Mining, Communications of the ACM, 43(8), 2000. 13. N. Maria, P. Gaspar, N. Grilo, A. Ferreira. M. Silva, ARIADNE - Digital Library Architecture, Proc. of the Second European Conference on Research and Advanced Technology for Digital Libraries, Springer, 1998. 14. J. E. Pitkow, Summary of WWW Characterization, WWW Journal, 2(1), 1999. 15. Publico On-Line, http://www.publico.pt 16. J. Srivastava, R. Cooley, M. Deshpande, P.-N. Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data, SIGKKD Explorations, 1(2), Jan 2000. 17. WWW Committee Web Usage Characterization Activity, http://www.w3.org/WCA, Web Characterization Terminology & Definitions Sheet, W3C Working Draft, May 1999. 18. O. R. Zaiane, M. Xin, J. Han, Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs, Proc. of Advances in Digital Libraries Conference (ADL98), April 1998.