IEICE TRANS. INF. & SYST., VOL.E88–D, NO.5 MAY 2005
993
PAPER
Acquisition and Maintenance of Knowledge for Online Navigation Suggestions †a) ´ Juan D. VELASQUEZ , Richard WEBER††b) , Nonmembers, Hiroshi YASUDA† , and Terumasa AOKI† , Members
SUMMARY The Internet has become an important medium for effective marketing and efficient operations for many institutions. Visitors of a particular web site leave behind valuable information on their preferences, requirements, and demands regarding the offered products and/or services. Understanding these requirements online, i.e., during a particular visit, is both a difficult technical challenge and a tremendous business opportunity. Web sites that can provide effective online navigation suggestions to their visitors can exploit the potential inherent in the data such visits generate every day. However, identifying, collecting, and maintaining the necessary knowledge that navigation suggestions are based on is far from trivial. We propose a methodology for acquiring and maintaining this knowledge efficiently using data mart and web mining technology. Its effectiveness has been shown in an application for a bank’s web site. key words: knowledge discovery in data bases, web mining, online suggestions, data mart
1. Introduction During the last years, we have witnessed a tremendous growth of business performed via the Internet. This has led to an explosion of the number of internet sites and also of the number of visits to these sites. Web Usage Mining (WUM) has contributed to the analysis and exploration of the information, visitors leave behind when navigating through a particular web site. Many algorithms and systems have already been proposed for WUM [16], [18], [21], [24], [27], [30]. However, an adequate way of acquiring and maintaining the knowledge revealed by web mining applications still remains to be fully developed. We present a methodology for continuous analysis of visits to a particular web site and for efficient maintenance of the obtained results. We propose a data mart [2], [4], [17] with two repositories in order to support our methodology. In the first one (Information Repository) we store web usage data and web site information, which serve as input for web mining analyses. In the second one (Pattern Repository), we store the results, which, together with domain-specific knowledge, constitute a knowledge base for online navigaManuscript received November 6, 2003. Manuscript revised May 13, 2004. † The authors are with Research Center for Advanced Science and Technology, University of Tokyo, Tokyo, 153–8904 Japan. †† The author is with Center for Collaborative Research, University of Tokyo, Tokyo, 153–8904 Japan. On leave from Department of Industrial Engineering, University of Chile. a) E-mail:
[email protected] b) E-mail:
[email protected] DOI: 10.1093/ietisy/e88–d.5.993
tion suggestions. The suggested system allows an efficient maintenance of the generated insights and consequently provides a platform for effective online navigation suggestions. While data marts have already been used in order to store web access data, the proposed Pattern Repository, which contains the knowledge discovered by web mining, and the domain-specific rule base are the main contributions of the present paper. The remainder of this paper is structured as follows. Section 2 provides an overview on related work. Section 3 introduces the preprocessing techniques used. In Sect. 4 we describe the proposed system. Section 5 shows how data mining techniques can be applied to the Information Repository. Section 6 presents the application of our methodology to the analysis of visitor behavior in a bank’s web site. Section 7 concludes this paper and points our further developments. 2. Related Work The extraction of knowledge from the web and from its usage is a nontrivial problem, since it deals with a huge collection of heterogeneous, unlabelled, distributed, time-variant, semistructured and high-dimensional data [21]. To support the various tasks, new models to represent the data and to store the extracted information have been developed [17], together with the creation of new algorithms for knowledge discovery. There are called web mining techniques. Since the data sources to be analyzed can be very heterogeneous in nature, data standardization becomes necessary. This is typically done in an Information Repository of a data mart [17]. The data stored in this repository can be analyzed by applying web mining techniques. In the following subsections we will briefly the current status of existing work related to the above mentioned steps. 2.1 Web Mining In web mining, we apply data mining techniques in order to discover patterns from World Wide Web (WWW) data following the KDD∗ process [11], [26]. Consequently, we must consider three important steps: preprocessing, pattern discovery and pattern analysis and interpretation [24]. ∗
Knowledge Discovery in Databases.
c 2005 The Institute of Electronics, Information and Communication Engineers Copyright
IEICE TRANS. INF. & SYST., VOL.E88–D, NO.5 MAY 2005
994
The following common classification of web data is given in [16]: Structure. Data that shows the internal web page structure. In general, these are HTML or XML tags, and some of them contain information about hyperlink connections to other web pages. Content. The web page content, i.e., pictures, free text, sounds, etc. Usage. Data that describes the visitor behavior browsing in a web site. Such data is stored in web log files. User profile. This can be complementary data, providing demographic information about the users. Following the above categorization, web mining techniques can be grouped into three areas: Web Structure Mining (WSM), Web Content Mining (WCM), and Web Usage Mining (WUM). In the next sections, a short description of each area will be provided. 2.1.1 Web Structure Mining By WSM, we mine the web’s hyperlink structure (interdocument structure). A web site is represented by a graph of its links or between sites. This allows us to study facts like the popularity of a web page. For instance, a page is very important if it referred by numerous other pages on the web. The respective approaches are comparable to bibliography citation systems: an important publication is cited often. A model of the web link structure provides a notion of the hyperlinked communities [7]. It can be used, for example, in search engines, like Google or Altavista, in order to adquire the most cited pages for a particular topic. Algorithms such as HITS and PageRank have been proposed for such systems [18]. An interesting application of WSM is in the context of a web warehouse [17], [19], where the completeness of web sites is studied in order to determine the access frequency of local links in the same server. The web warehouse is an information repository containing web page replication, allowing the identification of mirror sites. 2.1.2 Web Content Mining The goal of WCM is to find useful information from the web content. In this sense, WCM is similar to Information Retrieval (IR) [1]. However, web content comprises not only free text, but also other objects such as pictures, sound and movies. There are two main areas of WCM [18], [21]: the mining of document contents (web page content mining) and the improvement of content search in tools like search engines (search result mining). 2.1.3 Web Usage Mining WUM is a rapidly growing field interns of scientific as well
as commercial aspects, due to its applicability to web personalization and improvement of web sites. In general, WUM applies traditional data mining methods in order to analyze usage data. However, the special nature of web data makes modifications necessary. As the first step, special attention must be paid to data preprocessing, since, in many applications, the available data is incomplete [24]. Applying the sessionization process [9] corrects, in part, the problems detected and identifies the real sessions by a visitor. This preprocessing step returns the basic input data for the subsequent mining process. The goal of WUM is to discover patterns in usage data by applying different kinds of data mining techniques, such as statistical analysis, association rules, clustering, classification, sequential patterns and dependency modeling [18], [24]. Applications of WUM can be grouped in two main categories: user modeling in adaptive interfaces, known as personalization, and user navigation patterns, in order to improve the web site structure. After performing WUM, an analysis of the discovered patterns becomes necessary. Here, the participation of an expert in the respective application domain is important. Data visualization tools can facilitate such analyses. 2.2 Data Mart Technology The construction of a data mart consists of several steps [12], [17] to build the final information repository. The first step is the selection of the respective data source. Next, the data repository model is created. There are two ways to create such a model: Cube and Star schemes. In both schemes, historical information can be loaded and consolidated. The ETL process (extraction, transformation and load) is applied for cleaning and processing the data [25]. It defines the strategy for data extraction from operational systems (creation to program code), data transformation (cleaning, summarization and consolidation) and data loading into the data mart. In the transformation step, the Data Staging Area [17], which can be created in the data base system and which uses the engine’s proprietary capabilities, is defined. In the literature, data marts have already been proposed to analyze web data. For example, in [13] a threedimensional Cube Model to store web access sessions was suggested. The authors used “Session”, “Component”, and “Attribute” as dimensions; this simplifies the representation of sequential session data and allows the explicit identification of web access sessions maintaining the order of session components (i.e., web pages). Another approach process to improve least recently used (LRU)-based web server caching [2]. The authors applyied data mining techniques to a specially designed data mart that stores web access data and reported enhanced web caching strategies based on their analysis.
´ VELASQUEZ et al.: ACQUISITION AND MAINTENANCE OF KNOWLEDGE FOR ONLINE NAVIGATION SUGGESTIONS
995
2.3 Online Suggestions Making online suggestions [29] during a web site visit is related to various areas of current research, which will be described briefly. An online recommendation system, which suggests commodities a visitor is likely to purchase, is presented in [14]. This application focusses on online shopping where high-quality data concerning each visitor is available. A real-time evolutionary algorithm for predicting the most probable page to a current visitor is going to look at with her/his next click is described in [3]. While this approach is important for web prefetching (see, e.g. [8]) in order to improve web site performance, it is passive in the sense that it does not try to understand visitor requirements when making the mix page suggestions. In [5], an intelligent approach was used to maintain rules for an adaptive system that provides navigation recommendations for users based on a short questionnaire. The user’s answers and browsing behavior are compared with the patterns embedded in the rules, based on which the navigation recommendation is created. In [22], a system that provides navigation suggestions is proposed based on conceptual clustering. The main algorithm used is called “Index Finder”. It suggests new “index pages”, i.e., pages with links to others pages. Based on these suggestions, the web master has the opportunity to improve the web site by adding index pages, which helps the users to find what they are looking for. A variation of the Index Finder is an approach that can represent a static (every user sees the same pages) and semiautomatic navigation suggestion system. The research field of Adaptive Web Sites (see e.g. [23]) seeks to automatically improve a site’s organization and presentation by analyzing visitor access patterns. The underlying problem is that a good web design is hindered by several facts, such as • different visitors have distinct goals; • the behavior of one visitor changes over time; • a site grows over time by accumulating many pages and links without being restructured in a manner appropriate to current needs. The respective research activities try to generate adaptive web sites in order to overcome the mentioned problems by analyzing past visitor behavior and recommending new pages, for example. Current systems make their suggestions off-line, i.e., they do not intervene at a current visit. In this sense, our proposal is different from existing approaches since we make online suggestions of web pages that the current visitor can next look at. 3. Data Preprocessing The data we use as input for the web mining step must be transformed beforehand. The respective operations depend
Fig. 1
Data staging area model.
on the nature of each of the data sources. Accordingly, we present, in the following sections, transformation of web log files and web pages. 3.1 Web Logs Transformation A Data Staging Area (DSA) [2], [17] was defined in order to process web log data. Logically, it is composed of two tables: Weblogs and Logclean (see Fig. 1). The first table contains data about the web log registers. It is stored directly from the web log file using a code in perl language. The second one contains the registers after the cleaning and transformation processes. The Weblogs table is loaded with the data on a visitor’s session from the respective web logs, particularly IP address and agent, time stamp, embedded session Ids, method, status, software agents, bytes transmitted, and objects required (page, picture, movies, etc). We only consider registers whose code does not have error and whose URL parameters link to web page objects. Other objects, such as pictures, sound, and links are not considered. From the web log files we must determine sessions for each visitor by applying the sessionization process [9]. Finally, the Logclean table contains the log registers grouped by IP and Agent. The URL parameter is a number, because it uses the same codification as the dimension table “Pages” in the Data Mart. If the URL already exists in the Data Mart, its identifier number is inserted into the Logclean table. Otherwise a new register is created in the dimension table “Pages” and its identifier is inserted into the Logclean table. 3.2 Web Page Transformation We represent a document (web page) by its content, expressed by the words it comprises. Since there exist many tags and words that are not directly related to the content of the page, we must first filter the text and eliminate the following types of words. • HTML Tags. However, tags that show useful information about the page content, for instance, the tags describing the central theme of the web page, will be used to identify special words inside the text. • Stop words (e.g., pronouns, prepositions, conjunctions, etc.)
IEICE TRANS. INF. & SYST., VOL.E88–D, NO.5 MAY 2005
996
• Word stems. Word stems generated by removing suffixes. After such filtering, we represent the web site by a vector space model [1], in particular, by vectors of words. Let R be the number of different words in a web site and Q the number of web pages. A vectorial representation of the web site would be a matrix M of dimension RxQ that contains the vectors of words in its columns. M = (mi j ) i = 1, . . . , R
and
j = 1, . . . , Q
(1)
Here, mi j is the weight of word i on web page j. In order to estimate these weights, we use a variation of tf*idf-weighting, defined by Q sw(i) mi j = fi j 1 + , (2) ∗ log TR ni where fi j is the number of occurrences of word i on web page j and ni is the number of web pages that contain word i. Additionally, we propose to augment word importance if a visitor searches for a specific word. This is done by using sw (special words), which is an array of dimension R. Component i contains the number of times that a visitor asks for word i in the search process during a given period (e.g., one month). TR is the total number of times that a visitor searches words in the web site during the same period. If TR is zero, sw(i) TR is defined as zero, i.e., if there has not been any word search, the weight mi j depends only on the number of occurrences of words. Using the above definitions, a web page vector is introduced, by which a page is represented by the weights of its words. Definition 1 (Page Vector): WPk = (wpk1 , . . . , wpkR ) = (m1k , . . . , mRk ) with k = 1, . . . , Q Based on this definition, we used the angle’s cosine as a similarity measure between two page vectors [1]. R j i k=1 wpk wpk i j d p(WP , WP ) = (3) R j 2 i 2 R k=1 (wpk ) k=1 (wpk ) We define d pi j = d p(WPi , WP j ) as the similarity between page i and page j of the web site. 4. The Proposed System In this section we present the proposed system for online navigation suggestions. First we provide a general overview of our proposition. Then we will explain each of its elements in detail.
Fig. 2
Discovery and maintenance of knowledge from the web.
Knowledge Base containing the results of these analyses and additional domain knowledge from an expert in order to suggest online navigation steps. The Information Repository contains data describing the behavior of the web site visitors [2], [26] and the content of each web page. The respective data sources are web log registers (dynamic information) and the text contained in the web pages (static information). The former data source is based on the structure given by W3C† . The latter one is the web site itself, i.e., a set of web pages with its logical structure. Applying the ETL process to these data sources, the relevant information is loaded into the proposed Information Repository. Then, data mining tools are applied to the data stored in the Information Repository in order to recognize visitor behavior. In the application presented below, we used SelfOrganizing Feature Maps, but any other data mining tool could be employed as well. The proposed system is independent of the method used and offers the flexibility to work with different data mining techniques and their respective results. The generated patterns are stored in the “Patterns Repository”, see Fig. 2. With the collaboration of business experts, these patterns are analyzed and interpreted in order to conclude online navigation suggestions. The result is expressed as “ifthen-else” rules and stored in a “Rule Repository”. The Patterns Repository together with the Rule Repository make up the Knowledge Base (KB) of the proposed architecture for online navigation suggestions. The system shown in Fig. 2 has been implemented in a workstation with a Linux Redthat 9 operative system and a relational database engine to support the data marts proposed (see Sect. 6 of this paper). Bellow, we will describe each element of the proposed system in detail.
4.1 Overview The main idea of our proposition is to use an Information Repository where we store the data to be analyzed and a
† Web logs are delimited text files as specified by RFC 2616, “Hypertext Transfer Protocol – HTTP/1.1” http://www.rfceditor.org/rfc/rfc2616.txt
´ VELASQUEZ et al.: ACQUISITION AND MAINTENANCE OF KNOWLEDGE FOR ONLINE NAVIGATION SUGGESTIONS
997
Fig. 4
Fig. 3
A generic knowledge base structure.
Data mart model.
4.2 Information Repository The star query model [17] was selected to store the information generated by the transformation step. Figure 3 shows the composition of the table. In the fact table, three measures are stored: the time spent in each page, the number of real sessions that visit a page and the total bytes transmitted during the visitor session. These are additive measures, and using OLAP techniques [2], [4], [13], we can obtain statistics about visitor behavior. However, these analyses are only complementary to the use of data mining tools, proposed in this paper. This star model is composed of six-dimensional tables. Time and Calendar contain time stamp and date, respectively Pages contains the URL and the text in each page. The table Pagedistance efficiently stores the distance between web pages. The Session table shows the IP and Agent that identify a real session, Referrer contains the object that refers to the page, and finally, Method list the access methods used. 4.3 Knowledge Base Web mining techniques have the potential to reveal interesting patterns about visitor behavior in a certain web site through the analysis of data stored in the information repository [26]. By applying their knowledge, experts in the respective domain can interpret these patterns and conclude rules for a given task, in our case for online navigation suggestions. The Knowledge Base proposition [6] is a wisdom representation through the use of “if-then-else” rules based on the discovered patterns. Figure 4 shows the general structure of the suggested Knowledge Base (KB). The “Patterns Repository” contains the results of the data mining step and the “Rule Repository” is the set of
Fig. 5 A generic patterns repository model generated by web visitor behavior analysis.
rules provided by the business expert. Below, both will be described in more detail. Potential users of this Knowledge Base are persons and automated inference engines that use expert knowledge in their inference process. In this work, the knowledge is used to suggest online navigation steps during a particular visit. 4.3.1 Patterns Repository This repository goes beyond existing suggestions where only web access data is stored, see e.g. [2], [13]. It serves to store the patterns revealed from the Information Repository by applying web mining techniques. Figure 5 shows a generic model of the Patterns Repository, which is based on Data Mart technology. In the fact table, the measures are navigation and statistics. Both are nonadditive [17] and contain the web page navigation suggestions and a related statistic, for example the percentage of visits in the studied period. The dimension table Time contains the date when the web mining technique was applied over the information repository. Next, the table Criteria contains the comparison patterns and formula for a specific web mining technique. It is explained in the Description column. The column Period is the period in which the data to be analyzed have been generated, e.g., “01-Jan-2003 to 31-Mar-2003”. Finally, the table Wmt (web mining technique) stores the applied mining technique, e.g., Classical SOFM, SOFM
IEICE TRANS. INF. & SYST., VOL.E88–D, NO.5 MAY 2005
998 Table 1
Rule repository’s pseudocode.
select navigation, statistics, formula(pattern,n) into S from pr fact, time, criteria, wmt where “star join” and “fix technique” and “fix time” and formula(pattern,n) > ; ... if S is empty then send(“no suggestion”); ... while S not empty loop if S.navigation actual web site then S.navigation = compare page(ws,S.navigation); ... if S.navigation last page visited and S.statistic > γ then send(S.navigation); ... end loop
with thoroidal architecture and K-means. When the KB is consulted, the Pattern Repository provides, as an answer, a set of possible web pages to be suggested. Based on this set and additional information, such as statistical information on web page visits, the Rule Repository will provide the final suggestions. 4.3.2 Rule Repository The goal of applying the Rule Repository is to recommend a page from the current web site, i.e., to make a navigation suggestion. Using a visitor online detection mechanism such as a cookie, we can identify the visited pages and the time spent on each page during one session. Using this data, we first match the current visit with the visit patterns stored in the Patterns Repository. This requires a minimum number of visited pages in order to understand the current visitor’s behavior. Then we apply the Rule Repository to the matching results in order to give an online navigation suggestion. This is a candidate for the next page to be visited, based on the history of previous navigation patterns. If the suggested page is not connected directly with the current page but the suggestion is accepted by the visitor, new knowledge can be generated about his/her preferences. This can be used to reconfigure the Web site, i.e., to reorganize links among the pages. Table 1 shows part of the rule set. We use the parameter to identify only those patterns that are “close enough” to the current visit. γ filters these suggestions that have statistics above a given threshold, e.g., the page has a minimum percentage of visits. Since the Patterns Repository contains historical information, a suggested page may not appear in the current web site. In such a case, the function “compare page” determines a page of the current web site, whose content is the most similar to that of the suggested page (using Eq. (3)). 5. Mining the Information Repository Recent studies [15], [16], [27] have demonstrated the potential of the use of clustering for visitor behavior analysis. In
those works, visitor activities were modeled based on the sequence of visited pages and in all cases, their content. It will be shown how the proposed Knowledge Base facilitates the analysis of web site visits and, as a consequence, the effective provision of online navigation suggestions. Next, we describe a measure for determining the similarity between two sessions. Then we show how this similarity can be used for the clustering of sessions. 5.1 Similarity between Sessions In a previous study [30], we developed a similarity measure based on the following three elements: the sequence of visited pages, their content and the time spent in each one of them. We require the following definition of a Visitor Behavior Vector in order to introduce this measure. Definition 2 (Visitor Behavior Vector): υ = [(p1 , t1 ) . . . (pn , tn )], (pi , ti ) being parameters that represent the page in the ith visit and the time spent on it, respectively. In this expression, pi is the page identifier. Let α and β be two visitor behavior vectors with cardinality C α and C β , respectively, and Γ(·) a function that when applied over α or β, returns the respective navigation sequence. The proposed similarity measure is shown as 1 τk ∗ d p(phα,k , phβ,k ), (4) sm(α, β) = dG(Γ(α), Γ(β)) η k=1 η
where η = min{C α , C β }, and d p(phα,k , phβ,k ) is the similarity between the kth page of vector α and the kth page of vector β. The first term on the right-hand side of Eq. (4) compares the sequences of the two visits. tα
The second term, τk = min{ kβ , tk
tkβ tkα },
indicates the visi-
tor’s relative interest in the visited pages, assuming that the time spent on a page is proportional to the interest the visitor has in its content. The third term, d p, is the similarity between pages in the vectorial representation introduced in Eq. (3). 5.2 Clustering Visitor Sessions For the clustering of visitor sessions, we applied the proposed similarity measure and a Self-organizing Feature Map (SOFM) to the data in the Information Repository. Schematically, it is presented as a two-dimensional array in whose nodes the neurons are located [28]. Each neuron is represented by an n-dimensional vector, whose components are the synaptic weights. By construction, all the neurons receive the same input at a given moment. The operation of a SOFM requires vectors with the same cardinality, which is a configurable parameter in the creation of the visitor behavior vector. Let H be the number of elements of such a vector. If a visitor session does not
´ VELASQUEZ et al.: ACQUISITION AND MAINTENANCE OF KNOWLEDGE FOR ONLINE NAVIGATION SUGGESTIONS
999
have at least H elements, we fill the missing components up to H with time zero. Otherwise, we only consider up to the H th component of a session in which more pages were visited.
for storage in the Information Repository.
6. Application of the Proposed Methodology
Before showing the results of the proposed methodology, we present, in this section, a way of validating the online navigation suggestions. Since the web site represents the core business of a virtual bank, we had to test the effectiveness of our suggestions offline using the available web logs. We selected the first 70% of the log files of 3-months period (January to March, 2003) in order to generate rules providing online navigation suggestions based on the proposed system (see Sect. 4) and the respective domain knowledge. The respective visits corresponded to the period from January first to March tenth. We applied these rules to the remaining 30% (from March 11 to March 30) of the web log files and determined the effectiveness according to the following procedure. An online navigation suggestion is a set of pages in the same web site recommended to a current visitor of the site. In order to know whether this visitor would have accepted or rejected the suggestion, we compare the content of the suggested pages with the page actually selected by the visitor. Let α = [(p1 , t1 ), . . . , (pH , tH )] be a visitor behavior vector from the test set. Based on the first i pages actually visited, the proposed system recommends several possibilities for the following page i + 1, i.e., possible pages to be visited. Let S i+1 (α) be the online navigation suggestion for the (i + 1)th page to be visited by visitor α, where δ < i < H and δ is the minimum number of visited pages used to prepare the suggestion. Then, we can write S i+1 (α) = α α α α th {li+1,0 , . . . , li+1, j , . . . , li+1,k }, with li+1, j being the j link page th suggested for the (i + 1) page to be visited by visitor α and k the maximum number of pages in each suggestion. In this α represents the “no suggestion” state. notation, li+1,0 We calculate the effectiveness of the suggestions made for the (i + 1)th page to be visited by visitor α as
In order to prove the effectiveness of the tools developed in this work, a web site was selected† . It is the first Chilean virtual bank, i.e., it does not have physical branches and all the transactions are made by electronic means, for example, using e-mails and portals. The site is written in Spanish, has 217 static web pages and approximately eight million web log registers corresponding to the period January to March, 2003. 6.1 Loading Data into the Information Repository The data mart model introduced in Fig. 3, has been implemented in Oracle 9i Relational Database Manager System (RDBMS) [20], on the basis of the following specific characteristics. Partition tables. A table can be divided physically in several partitions and each one of them can be stored on different discs. Bitmap index. A data structure used to efficiently access large data sets described by low cardinality attributes. Materialized views. A data structure that maintains, in memory, the query result, in order to accelerate the response to the next query on the same topic. The web page transformation step was executed by a code in perl language, and the matrix with the distances among page vectors was loaded. The dimension of this matrix is 217 × 217, according to Q = 217 pages. The above data were stored in the Pagedistance table in the Data Mart. The data staging area used for web log processing, which is shown in Fig. 1, has been implemented in two tables inside the data base. The transformation step ends with the sessionization process. This task is implemented in the engine’s language and considers 30 minutes as the longest visitor session. Only 16% of visitors visit 10 or more pages and 18% visit less than 4. The average number of visited pages is 6. On the basis of this information, we fix 6 as the maximum number of components in a visitor behavior vector, i.e., H = 6. Vectors with more than 6 pages are considered only up to their 6th component. Vectors with less than 3 pages are not considered in our analysis and vectors with only 3, 4 or 5 pages are filled up with blank pages and time 0 up to the 6th component. Applying the above-described filters, we could identify approximately 300,000 vectors. Finally, loading is executed by a code in the engine’s language whereby the data in the staging area is transferred
6.2 Methodology to Validate the Online Navigation Suggestions
α Ei+1 (α) = max{d p(pi+1 , li+1, j )},
(5)
with d p being the page distance introduced in (3). If the set of suggested pages for the (i + 1)th page contains the page pi+1 the visitor actually visited, Ei+1 (α) = 1. On the other hand, when the suggested pages are very different compared to the page visited, i.e. page pi+1 , Ei+1 (α) will be close to 0. α }, i.e., no A particular case is when S i+1 (α) = {li+1,0 suggestion is proposed. 6.3 Applying SOFM to the Information Repository We applied a Self-Organizing Feature Map (SOFM) with 6 input variables and 32 × 32 output neurons. Figure 6 shows the result where four main clusters can be identified. †
http://www.tbanc.cl/
IEICE TRANS. INF. & SYST., VOL.E88–D, NO.5 MAY 2005
1000
• Cluster 1. The visitors are interested in products and services offered by the bank. • Cluster 2. The visitors search for information about credit cards. • Cluster 3. Visitors are interested in agreements between the bank and other institutions. • Cluster 4. Visitors are interested in investments and remote services offered by the bank. Since our Data Mart structure contains consolidated historical information, it will be possible to analyze the variation in visitor behavior permanently. 6.5 Knowledge Base Fig. 6
Cluster 1 2 3 4
Clusters among visitor behavior vectors.
Table 2 Visitor behavior clusters. Pages Visited Time spent in seconds (2,11,25,33,136,205) (10,50,120,130,150,58) (120,117,128,126,40,62) (30,41,52,68,101,18) (81,86,148,190,147,83) (8,55,40,11,72,101) (161,172,180,99,108,1) (8,71,115,97,35,9)
Table 3 Pages 1 2, . . . , 65 66, . . . , 98 99, . . . , 115 116, . . . , 130 131, . . . , 155 156, . . . , 184 185, . . . , 217
Web pages and their content. Content Home page Products and Services Agreements with other institutions Remote services Credit cards Promotions Investments Different kinds of credit
The thoroidal topology of the SOFM maintains the continuity of clusters, which allows us to study the transition among the preferences of the visitors from one cluster to another. The data mining tool interacts with the data mart through a stream, which is created by the previous execution of a code in perl, that receives as input the time range to be analyzed. 6.4 Results from the Clustering Step The clusters shown in Fig. 6 are presented in more detail in Table 2. The second and third columns of this table contain the center neuron (winner neuron) of each of the clusters, which represents the visited pages and the time spent in each one of them. The pages in the web site were labeled with a number to facilitate each analysis. Table 3 shows the main content of each page. A detailed cluster analysis together with the interpretation of the particular web pages visited yielded the following results.
The proposed generic Knowledge Base has been adapted to this particular application. It provides online navigation suggestions for each visitor to the selected web site. 6.5.1 Patterns Repository Figure 5 presents the patterns repository’s general structure. In the case of the described application, it contains the following specific information: Time. (2003,Sep,3,20,16), i.e., “16:00 hours, September 20th, third week, year 2003”. Criteria. The cluster centroids identified by the web mining process and shown in Table 2 as well as the formula shown in Eq. (4). WMT. Self-organizing Feature Map with thoroidal architecture, 32 × 32 neurons. 6.5.2 Rules for Navigation Suggestions First, we must identify the current visitor session. Since the selected web site uses cookies, we could use this tool for online session identification. In order to prepare the online suggestion for the (i+1)th page to visit, we compare the current session with the patterns in the Patterns Repository. The comparison requires a minimum of three pages visited (δ = 3) to determine the cluster centroid most similar to the current visit, which would allow the preparation of a suggestion for the fourth page to visit. This process can be repeated after the visitor has visited more than three pages, i.e., to suggest the fifth page, we use the four pages visited in the current session. The final online suggestion is made using the rule base developed with the domain expert. For the four clusters found, sixteen rules were created. We suggest, at most, three pages for the fourth, fifth, and sixth pages to visit, i.e., k = 3. A rule example is shown in Table 4. It belongs to the first cluster found for the fourth page suggestion. In the example, “S” is a data structure with the result of the star query shown in the Table 1. From the final set of pages in “L”, Extracted Three Links extracts a subset with a maximum of three links, based on the associated statistic. Default is the no suggestion state.
´ VELASQUEZ et al.: ACQUISITION AND MAINTENANCE OF KNOWLEDGE FOR ONLINE NAVIGATION SUGGESTIONS
1001 Table 4
Example of rule operation.
C1 → [(2,10),(11,50),(25,120),(33,130),(136,150),(205,58)] α → [(1, 15), (3, 61), (8, 180)] % current visitor ws → {p1 , . . . , p217 } % current web site pages S.navigation → {p33 , p38 , p41 , p118 , p157 , p201 } S.statistic → {1.2, 2.1, 1.8, 0.9, 0.8, 0.1} Case C1 and SuggestionPage=4: Prepare suggestion(p0 ,0,L); % default “no suggestion” % L: link page suggestion, 0: statistic associated while S not null loop if (S.navigation not in ws) then S.navigation = compare page(ws,S.navigation); elsif ((S.navigation α p1...3 ) and (S.statistic > γ)) then Prepare suggestion(S.navigation,S.statistic,L); Pop(S); % Next element in S end if; end loop; send(Extracted Three Links(L)); % L → {p38 , p41 , p33 } Fig. 7
6.6 Validating the Online Navigation Suggestions From the 30% of visitor behavior vectors that belong to the test set, we select only those with six real components, i.e., it was not necessary to fill visitor vectors with zero to get the six components. By this selection, 11,532 vectors are obtained to test the effectiveness of the online navigation suggestions. We consider an online navigation suggestion to be successful when Ei+1 (·) ≥ 0.7, i.e., when the best similarity between the actually visited page and the suggested page is founded to be greater than 0.7 using Eq. (5). Figure 7 shows a histogram with the percentage of accepted suggestions calculated using our validation method. As can be seen, acceptance increases when more pages are suggested for each page visited. Had the proposed methodology been used to suggest only one page, that page would have been accepted in slightly more than 50% of the cases. This was considered a very successful suggestion by the business expert, since we were dealing with a complex web site with many pages, many links between pages, and a high rate of visitors who leave the site after a few clicks. Furthermore, it should be noted that the percentage of acceptance probably would have been even higher had we actually suggested the page during the online session. Since we were comparing past visits stored in log files, we could only analyze the behavior of visitors who did not actually receive any of our proposed suggestions. 6.7 Comparison with Other Navigation Suggestion Approaches In Sect. 2.3, two approaches to online suggestions were introduced. The first one (see [5]) is a system based on rules, the number of which increases with time, making their administration difficult. Also, some rules could become obsolete. Our approach has a structure where in the rules and
Percentage of acceptance of online suggestions.
patterns are stored separately. The rule repository is a parametric set of rules, i.e., rules receive the visitor’s navigation patterns not as a numerical value, but as the result of patterns comparison using the Pattern Repository. Thereafter, the recommendation is generated. Comparing with the first approach, this scheme allows us to maintain a small set of rules and a large number of patterns. This approach is convenient when the rules are expected to be highly variant with time, but the parameters show a certain regularity. This is the case in visitor behavior analysis, where the continuous changes in the web site structure and content require new rules with time. The second approach (see [22]) is a variation of the Index Finder implementation and is based on conceptual clustering. This system was tested in a virtual music shop called Music Machines† . In this web site, the effectiveness of Index Finder was calculated using real values of the session before and after the changes. It is a semiautomatic navigation suggestion system and the same pages are suggested to all users. Our system prepares personalized suggestions for each visitor, considering his/her particular browsing behavior in the web site. 6.8 Expected Real-Time Computational Problems An important aspect concerning visitor relationships, is the time necessary to prepare suggestions. If the visitor is subjected to a long wait time, common sense and empirical experiments show that the visitor might not be willing to continue the visit. The most time-consuming task in the preparation of an online suggestion is to search for the browsing pattern in the patterns repository, where the search is always a full scan. It can be improved by defining a “time to live” for the patterns, i.e., a period of time during which they are considered useful for the current web site version. However, the full scan †
http://machines.hyperreal.com/manufactures/
IEICE TRANS. INF. & SYST., VOL.E88–D, NO.5 MAY 2005
1002
problem prevails, and a huge amount of visitors interacting with the system could lead to unacceptable response times. 7. Conclusions A generic framework for the discovery and maintenance of knowledge from web visits has been proposed for online navigation suggestions. Based on an Information Repository that contains visitor access data and web page descriptions, web mining techniques are applied in order to discover patterns of typical visitor behavior. These are stored in a Patterns Repository, which is the first element of the proposed Knowledge Base. The second element is a Rule Repository containing rules that are based on the patterns found and on expert knowledge. We applied this framework to the case of a bank’s web site and identified patterns of visitor behavior that support an expert in formulating knowledge regarding the particular application domain. The proposed system was validated using web log registers instead of real online sessions. We showed that a high percentage of visitors selected pages similar to the ones that were suggested. It will be necessary to continue applying our methodology to other web sites in order to gain new hints for future developments. Finally, it is possible continually review whether the online navigation suggestions proposed by our system is accepted by the visitors. This provides us with new knowledge for web site reconfiguration, i.e., reorganization of links among web pages. References [1] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, chapter 2, Addison-Wesley, 1999. [2] F. Bonchi, F. Giannotti, C. Gozzi, G. Manco, M. Nanni, D. Pedreschi, C. Renso, and S. Ruggieri, “Web log data warehousing and mining for intelligent web caching,” Data and Knowledge Engineering, vol.32, no.2, pp.165–189, 2001. [3] D. Bonino, F. Corno, and G. Squillero, “A real-time evolutionary algorithm for web prediction,” Proc. IEEE/WIC Int. Conf. on Web Intelligence, pp.139–145, Halifax, Canada, Oct. 2003. [4] A. Bonifati, F. Cattaneo, S. Ceri, A. Fuggetta, and S. Paraboschi, “Designing data marts for data warehouses,” ACM Trans. Softw. Eng. Methodol., vol.4, pp.452–483, 2001. [5] P. Brusilovsky, “Methods and techniques of adaptive hypermedia,” User Modeling and User-Adapted Interaction, vol.6, no.2-3, pp.87– 129, 1996. [6] M. Cadoli and F.M. Donini, “A survey on knowledge compilation,” AI Communications, vol.10, no.3-4, pp.137–150, 1997. [7] S. Chakrabarti, B.E. Dom, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg, “Mining the web’s link structure,” Computer, vol.32, no.8, pp.60–67, 1999. [8] X. Chen and X. Zhang, “Web document prefetching on the Internet,” in Web Intelligence, eds. N. Zhong, J. Liu, and Y. Yao, pp.345–363, Springer-Verlag, Berlin, 2003. [9] R. Cooley, B. Mobasher, and J. Srivastava, “Data preparation for mining world wide web browsing patterns,” J. Knowledge and Information Systems, vol.1, pp.5–32, 1999.
[10] J.K. Debenham, “Knowledge base maintenance through knowledge representation,” Proc. 12th Int. Conf. on Database and Expert Systems Applications, pp.599–608, M¨unchen, Germany, Sept. 2001. [11] U.M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From data mining to knowledge discovery: An overview,” AI Magazine, vol.17, pp.37–54, 1996. [12] M. Golfarelli and S. Rizzi, “Designing the data warehouse: Key steps and crucial issues,” J. Computer Science and Management, vol.2, no.1, pp.1–14, 1999. [13] Z. Huang, J. Ng, D.W. Cheung, M.K. Ng, and W. Ching, “A cube model for web access sessions and cluster analysis,” Proc. WEBKDD, pp.47–57, San Francisco CA, Aug. 2001. [14] J. Ji, Z. Sha, C. Liu, and N. Zhong, “Online recommendation based on customer shopping model in e-commerce,” Proc. IEEE/WIC Int. Conf. on Web Intelligence, pp.68–74, Halifax, Canada, Oct. 2003. [15] A. Joshi and R. Krishnapuram, “On mining web access logs,” Proc. Int. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp.63–69, 2000. [16] Z. Lu, Y.Y. Yao, and N. Zhong, Web Intelligence, Part II, Sect. 9: Web Log Mining in, pp.173–194, Springer-Verlag, Berlin, 2003. [17] R. Kimball and R. Merx, The Data Webhouse Toolkit, Wiley Computer Publisher, 2000. [18] R. Kosala and H. Blockeel, “Web mining research: A survey,” SIGKDD Explorations, vol.2, no 1, pp.1–15, 2000. [19] S.K. Madria, S.S. Bhowmick, W.K. Ng, and E-P. Lim, Research Issues in Web Data Mining, Data Warehousing and Knowledge Discovery, pp.303–312, 1999. [20] Oracle 9i Administration Guide, Oracle Corporation, 2002. [21] S.K. Pal, V. Talwar, and P. Mitra, “Web mining in soft computing framework: Relevance, state of the art and future directions,” IEEE Trans. Neural Netw., vol.13, no.5, pp.1163–1177, 2002. [22] M. Perkowitz, Adaptive Web Site: Cluster Mining and Conceptual Clustering for Index Page Synthesis, Dissertation for degree of Doctor of Philosophy, University of Washington, 2001. [23] M. Perkowitz and O. Etzioni, “Adaptive web sites: Automatically synthesizing web pages,” Proc. 15th Nat. Conf. on AI, pp.727–732, Wisconsin, USA, July 1998. [24] J. Srivastava, R. Cooley, M. Deshpande, and P. Tan, “Web usage mining: Discovery and applications of usage patterns from web data,” SIGKDD Explorations, vol.1, no 2, pp.12–23, 2000. [25] P. Vassiliadis, A. Simitsis, and S. Skiadopoulos, “On the logical modeling of ETL processes,” Proc. 14th Int. Conf. on Advanced Information Systems Engineering, pp.782–786, London, UK, 2002. [26] J. Vel´asquez, H. Yasuda, T. Aoki, and R. Weber, “Using the KDD process to support the web site reconfiguration,” Proc. IEEE/WIC Int. Conf. on Web Intelligence, pp.511–515, Halifax, Canada, Oct. 2003. [27] J. Vel´asquez, H. Yasuda, T. Aoki, and R. Weber, “Acquiring knowledge about users’s preferences in a web site,” Proc. 1st IEEE Int. Conf. on Information Technology: Research and Education, pp.375– 379, Newark, New Jersey, USA, Aug. 2003. [28] J. Vel´asquez, H. Yasuda, T. Aoki, and R. Weber, “Using self organizing feature maps to acquire knowledge about visitor behavior in a web site,” Proc. Knowledge-Based Intelligent Information & Engineering Systems, pp.951–958, Oxford, UK, Sept. 2003. [29] J.D. Vel´asquez, A. Bassi, H. Yasuda, and T. Aoki, “Mining web data to create online navigation recommendations,” Proc. 4th IEEE Int. Conf. on Data Mining, pp.551–554, Brighton, UK, Nov. 2004. [30] J. Vel´asquez, H. Yasuda, T. Aoki, and R. Weber, “A new similarity measure to understand visitor behavior in a web site,” IEICE Trans. Inf. & Syst., vol.E87-D, no.2, pp.389–396, Feb. 2004.
´ VELASQUEZ et al.: ACQUISITION AND MAINTENANCE OF KNOWLEDGE FOR ONLINE NAVIGATION SUGGESTIONS
1003
Juan D. Vel´asquez is a doctoral student at RCAST, University of Tokyo. He received his B.E. in electrical engineering and B.E. in computer science in 1995, P.E. in electrical engineering and P.E. in computer engineering in 1996, Masters in computer science and Masters in industrial engineering in 2001 and 2002, respectively, from the University of Chile, Chile. In 1996, he joined the academic staff of the Physics and Mathematics Sciences Faculty at University of Chile as a part time lecturer. In 2004 he was incorporated as a full-time academic staff member of the Industrial Engineering Department (DIE) and promoted to Instructor Professor. He also served as the Executive Director Manager of the Informatics Management and e-Business professional engineering programs from 1997 to 2001. In the professional area, he has been consultant by several ministries of the Republic of Chile and software companies in Latin America. His research interests include data mining, web mining and very large data bases.
Richard Weber is an Associate Professor at the Department of Industrial Engineering of the University of Chile and academic director of the Masters program of Operations Management. He received a B.S. in mathematics, and a M.S. and a Ph.D. in operations research from Aachen Institute of Technology, Germany. His research interests include data mining, dynamic data mining, and web mining. He is member of Informs, IEEE, ACM, and the editorial board of the international journal IDA-Intelligent Data Analysis. He has also been a visiting professor at The Center for Collaborative Research (CCR) at the University of Tokyo.
Hiroshi Yasuda received his B.E., M.E. and Dr.E. from the University of Tokyo, Japan in 1967, 1969, and 1972, respectively. Since joining the Electrical Communication Laboratories of NTT, in 1972, he has been involved in work on video coding, image processing, telepresence, B-ISDN network and services, Internet and computer communication applications. After serving twenty-five years (1972-1997), his final position being Vice President, Director of NTT Information and Communication Systems Laboratories at Yokosuka, he left NTT and joined the University of Tokyo. He is now Director of The Center for Collaborative Research (CCR). He had served as the Chairman of ISO/IEC JTC1/SC29 (JPEG/MPEG Standardization) from 1991 to 1999, as well as the President of DAVIC (Digital Audio Video Council) from September 1996 to September 1998. He received the 1987 Takayanagi Award, the 1995 Achievement Award of EICEJ, the 1995-1996 EMMY award from The National Academy of Television Arts and Science, and the 2000 Charles Proteus Steinmetz Award from IEEE. He is a Fellow of IEEE, EICEJ, and IPSJ, and a member of Television Institute.
Terumasa Aoki is a lecturer with the Research Center for Advanced Science and Technology, the University of Tokyo. He received his B.S., M.E. and Ph.D. in information and communication from the University of Tokyo, Japan in 1993, 1995, and 1998, respectively. His current research interests are in the fields of terabit IP router, access control of gigabit LAN/WAN, next-generation video conferencing system, high-efficiency image coding, and management of digital content copyrights. He has received various academic excellent awards such as the 2001 IPSJ Yamashita award, the FEEICP Inose award for 1994, and the 4 other awards.