An Algorithm to Calculate the Expected Value of ... - Semantic Scholar

1 downloads 0 Views 199KB Size Report
4 Naval Research Laboratory, Monterey, CA, USA hadjimic@nrlmry.navy.mil. Summary. ... Commercial software packages for Web log analysis, such as.
An Algorithm to Calculate the Expected Value of an Ongoing User Session Socorro Mill´an2 , Ernestina Menasalvas1 , Michael Hadjimichael4 , and Esther Hochsztain3 1

2 3 4

Departamento de Lenguajes y Sistemas Informaticos Facultad de Informatica, U.P.M, Madrid, Spain [email protected] Universidad del Valle, Cali Colombia [email protected] Facultad de Ingenier´ıa Universidad ORT Uruguay [email protected] Naval Research Laboratory, Monterey, CA, USA [email protected]

Summary. The fiercely competitive web-based electronic commerce environment has made necessary the application of intelligent methods to gather and analyze information collected from consumer web sessions. Knowledge about user behavior and session goals can be discovered from the information gathered about user activities, as tracked by web clicks. Most current approaches to customer behavior analysis study the user session by examining each web page access. Knowledge of web navigator behavior is crucial for web site sponsors to evaluate the performance of their sites. Nevertheless, knowing the behavior is not always enough. Very often it is also necessary to measure sessions value according to business goals perspectives. In this paper an algorithm is given that makes it possible to calculate at each point of an ongoing navigation not only the possible paths a viewer may follow but also calculates the potential value of each possible navigation.

1 Introduction The continuous growth of the World Wide Web together with the competitive business environment in which organizations are moving has made it necessary to know how users use web sites in order to decide the design and content of the web site. Nevertheless, knowing most frequent user paths is not enough, it is necessary to integrate web mining with the organization site goals in order to make sites more competitive. The electronic nature of customer interaction negates many of the features that enable small business to develop a close human relationship with customers. For example, when purchasing from the web, a customer will not accept an unreasonable wait for web pages to be delivered to the browser. One of the reasons of failure in web mining is that most of the web mining firms concentrate on the analysis exclusively of clickstream data. Clickstream

378

Authors Suppressed Due to Excessive Length

data contains information about users, their page views, and the timing of their page views. Intelligent Web mining can harness the huge potential of clickstream data and supply business critical decisions together with personalized Web interactions [6]. Web mining is a broad term that has been used to refer to the process of information discovery from sources in the Web (Web content), discovery of the structure of the Web servers (Web structure) and mining for user browsing and access patterns through logs analysis (Web usage)[8]. In particular, research in Web usage has focused on discovering access patterns from log files. A Web access pattern is a recurring sequential pattern within Web logs. Commercial software packages for Web log analysis, such as WUSAGE [1], Analog [3], and Count Your Blessings [2] have been applied to many Web servers. Common reports are lists of the most requested URLs, a summary report, and a list of the browsers used. These tools, either commercial or research-based, in most cases offer only summary statistics and frequency counts which are associated with page visits. New models and techniques are consequently under research. At present, one of the main problems on Web usage mining has to do with the pre-processing stage of the data before application of any data mining technique. Web servers commonly record an entry in a Web log file for every access. Common components of a log file include: IP address, access time, request method, URL of the page accessed, data transmission protocol, return code and number of bytes transmitted. The server log files contain many entries that are irrelevant or redundant for the data mining tasks and need to be cleaned before processing. After cleaning, data transactions have to be identified and grouped into meaningful sessions. As the HTTP protocol is stateless, is impossible to know when a user leaves the server. Consequently, some assumptions are made in order to identify sessions. Once logs have been preprocessed and sessions have been obtained there are several kinds of access pattern mining that can be performed depending on the needs of the analyst (i.e. path analysis, discovery of association rules, sequential patterns, clustering and classification) [4], [13], [9],[7],[11]. Nevertheless, this data has to be enhanced with domain knowledge about the business, if useful patterns are to be extracted which provide organizations with knowledge about their users’ activity. According to [14] without demonstrated profits, a business is unlikely to survive. An algorithm that takes into account both the information of the server logs and the business goals to improve traditional web analysis was proposed in [10]. In this proposal the focus is on the business and its goals, and this is reflected by the computation of web link values. The authors integrated the server logs analysis with both the background that comes from the business

Title Suppressed Due to Excessive Length

379

goals and the available knowledge about the business area. Although this approach makes it possible to know the session value it cannot be used to predict the future value of the complete session in a given page during a navigation. It can only compute the value of the traversed path up to a given page. On the other hand, an algorithm to study the session (sequences of clicks) of the user in order to find subsessions or subsequences of clicks that are semantically related and that reflect a particular behavior of the user even within the same session was proposed in [?]. This algorithm enables the calculation of rules that, given a certain path, can predict with a certain level of confidence the future set of pages (subsession) that the user will visit. Nevertheless, rules alone are not enough because we can estimate the future pages the user will visit but not how valuable this session will be according to the site goals. In this sense it is important to see that not all the subsessions will be equally desirable. In this paper we integrate both approaches. We present an algorithm to be run on-line each time a user visits a page. Based on the behavior rules obtained by the subsession algorithm presented in [?], our algorithm calculates the possible paths that the user can take as well as the probability of each possible path. Once these possible paths are calculated, and based on the algorithm to calculate the value of a session, the expected value of each possible navigation is computed. We will need to obtain these values to be able to predict the most valuable (in terms of the site goal) navigation the user can follow in an ongoing session. The different paths themselves are not of interest – we are concerned with discovering, given a visited sequence of pages, the value of the page sequences. This is the aim of the proposed algorithm. With these values we provide the site administrator with enhanced information to know with action to perform next in order to provide the best service to navigators and to be more competitive. The remainder of the paper is organized as follows: in section 2, the algorithms to compute subsession and to calculate the value of a session are presented. In section 3, we introduce the new approach to integrate the previous algorithms. In section 4, an example of the application of the algorithm as well as the advantages and disadvantages of the work are explained. Section 5, presents the main conclusion and the future works.

380

Authors Suppressed Due to Excessive Length

2 Preliminaries In this section we first present some definitions that are needed in order to understand the proposed algorithm and next we briefly describe the algorithms in which our approach is based. 2.1 Definitions Web-site: As in [5]we define a web site as a finite set of web pages. Let W be a web-site and let Ω be a finite set representing the set of pages contained in W. We assigned a unique identifier αi to each page so that a site containing m pages will be represented as Ω = α1 , . . . , αm . Ω(i) represents the ith element or page of Ω, 1 ≤ i ≤ m. Two special pages denoted by α0 and α∞ are defined to refer to the page from which the user enters in the web site and the page that the user visits after he ends the session respectively [12]. Web-site representation: We consider a Web-site as a directed graph. A directed graph is defined by (N, E), where N is the set of nodes and E is the set of edges. A node corresponds to the concept of web page and an edge to the concept of hyperlink. Link: A link is an edge with origin in page αi and endpoint in page αj . It is represented by the ordered pair (αi , αj ). Link value: The main user action is to select a link to get to the next page (or finish the session). This action takes different values depending on the nearness or distance of the target page or set of target pages. The value of the link (αi , αj ) is represented by the real number vij , (vij ∈ 0 we consider that the user navigating from node i to node j is getting closer to the target pages. The larger the link value the greater the links effect is in bringing the navigator to the target. ( If vij > 0, vil > 0, vij > vil : then we consider that it is better to go from page αi to page αj than going from page αi to αl ) • If vij < 0 we consider that the navigator, as he goes from page αi to page αj, is moving away from the target pages. ( If vij < 0, vil < 0, vij < vil : then it is worse to go from page αi to page αj than to go from page αi to page αk ). • If vij = 0 we consider that the link represents neither an advantage nor a disadvantage in the objective’s search. A path αp(0) , αp(1) , . . . , αp(k) is a nonempty sequence of visited pages that occurs in one or more sessions. We can write Path[i] =αpath(i) . A path Path = (αpath(0) , αpath(1) , . . . , αpath(n−1) ) is said to be frequent if sup(Path) > ε. A path Path = (αpath(0) , αpath(1) , . . . , αpath(n−1) ) is said to be behavior-frequent if the probability to reach the page αpath(n−1) having visited αpath(0) , αpath(1) , . . . , αpath(n−2) is higher than a established threshold. This means that ∀i, 0 ≤ i < n / P(αpath(i) |αpath(0) , . . . , αpath(i−1) ) > δ.

Title Suppressed Due to Excessive Length

381

A sequence of pages visited by the user will be denoted by S,with |S| the length of the sequence (number of pages visited). Sequences will be represented as vectors so that S[i]∈ Ω (1≤ i ≤n) will represent the ith page visited. In this paper, path and sequence will be interchangeable terms. The Added Value of a k-length Sequence S[1],....,S[k] : It is computed as the sum of the link values up to page S[k] to which the user arrives traversing links (S[1],S[2] ), (S[2],S[3]),....(S[k-1],S[k] ). It is denoted by AV(k) and computed as AV(k)=v S[1],S[2] + vS[2],S[3] + ... + vS[k−1],S[k] 2≤ k≤ n. Sequence value: It is the sum of the traversed link values during the sequence (visiting n pages). Thus it is the added value of the links visited by a user until he reaches the last page in the sequence S[n]. It is denoted by AV(n). The Average Added Value of a k-length Sequence S[1],S[2],...., S[k] : It represents the added accumulated value for each traversed link up to page k, for 2≤ k ≤ n. It is denoted by AAV(k) and is computed as the accumulated value up to page k divided by the number of traversed links up to page k (k-1). AAV(k)= AV (k ) / (k-1). Session. We define session as the complete sequence of pages from the first site page viewed by the user until the last. 2.2 Session Value Computation Algorithm In [10], an algorithm to compute the value of a session according both to user navigation and web site goals was presented. The algorithm makes it possible to calculate how close a navigator of a site is from the targets of the organization. This can be translated to how the behavior of a navigator in a web site corresponds to the business goals. The distance from the goals is measured using the value of the traversed paths. The input of the algorithm is a values matrix V [m, m] that represents the value of each link that a user can traverse in the web site. These values are defined according to the web site organization business processes and goals. The organization business processes give a conceptual frame to compute links values, and makes it possible to know a user is approaching or moving away from the pages considered as goals of the site. Thus, these values included in value matrix V must be assigned by business managers. Different matrices can be defined for each user profile, and make it possible to adapt the business goals according to the user behavior. The original algorithm outputs are the added accumulated value and average accumulated value evolution during the session. We will only consider the added accumulated value. 2.3 Pseudocode of the Sequence Value algorithm Input: Value links matrix V[m,m]

382

Authors Suppressed Due to Excessive Length

Initialization: AV=0 //Added Value=0 AAV=0 //Average added value=0 k=1 //number of nodes=1 read S[k] //read the first traversed page in the sequence Output: Sequence Accumulated Value Pseudocode While new pages are traversed k = k +1 //compute the traversed page sequential number read S[k]// read the next traversed page /* the selected link is (S[k-1],S[k]) 1≤ S[k-1]≤ m-1 1≤ S[k] ≤m 2≤ k ≤ n */ AV = AV + V(S[k-1],S[k]) // Add link traversed value to accumulated value 2.4 Subsession calculation An approach to studying the session (sequences of clicks) of the user is proposed in [?]. The purpose is to find subsessions or subsequences of clicks that are semantically related and that reflect a particular behavior of the user even within the same session. This enables examination of the session data using different levels of granularity. In this work the authors propose to compute frequent paths that will be used to calculate subsessions within a session. The algorithm is based on a structure that has been called FBP-tree (Frequent Behavior Paths tree). The FBP-Tree represents paths through the web site. After building this tree, frequent behavior rules are obtained that will be used to analyze subsessions within a user session. The discovery of these subsessions will make it possible to analyse, with different granularity levels, the behavior of the user based on the pages visited and the subsessions or subpaths traversed. Thus, upon arriving at an identifiable subsession, it can be stated with a degree of certainty the path the user will take to arrive at the current page. In order to find frequent paths that reveal a change in the behavior of the user within an ongoing session are calculated. The first step in obtaining frequent paths is discovering frequent behavior paths – paths which indicate frequent user behavior. Given two paths, Path IN D and Path DEP , a frequent-behavior rule is a rule of the form: Path IN D → Path DEP

Title Suppressed Due to Excessive Length

383

where Path IN D is a frequent path, called the independent part of the rule and Path DEP is a behavior-frequent path, called the dependent part. Frequentbehavior rules must have the following property: P (Path DEP | Path IN D ) > δ The rule indicates that if a user traverses path Path IN D , then with a certain probability the user will continue visiting the set of pages in Path DEP . The confidence of a frequent-behavior rule is denoted as conf (Path IN D → P athDEP ) and it is defined as the probability to reach Path DEP once Path IN D has been visited. Pseudocode of the FBP-tree Algorithm Input: FTM : frequent transition matrix (NxN) L : list of sessions Output: FBP-Tree : frequent behavior path tree (each FBP-tree node or leaf has a hit counter). Pseudocode: For each s in L { for i in N { j =i +1 while j δ} where

Title Suppressed Due to Excessive Length











• •

385

ri =( Path IN D , Path DEP , P (Path DEP | Path IN D )) Equivalent Sequences (Q≈ P) Let P, Q be two paths, Q=(αq(0) , αq(1) , . . . , αq(m) )and P=(αp(0) ,αp(1) , . . . ,αp(m) ). Q≈ P if and only if ∀j (0 ≤ j< m) Q[j] = P[i]. Decision Page (αD ). Let αD ∈ Ω be a page and Path= (αp(0) , αp(1) , . . . , αp(k) ) be a path, αD is a decision page in Path if: αD = αp(k) and ∃ri ∈ FRS such that Path ≈ ri ·Path IN D In other case αD is called a non-decision page. Predicted Paths Predicted paths are all possible dependent subsequences with their associated probabilities: (Path DEP , P (Path DEP | Path IN D )) When the user has arrived at a decision page there may be several different inferred paths which follow, or none. The paths that can follow have different consequences for the web site’s organization. These consequences can be measured using each path’s value. A path value shows the degree to which the user’s behavior is bringing him or her closer to the site’s goals (target pages). Predicted Subsession Value It is the value of each of a possible dependent path. It is computed using the function Sequence Value that returns the value of the sequence using the sequence value algorithm (see section 2.2). Subsequence value measures the global accordance between the subsequence and the site goals according to the given links value matrix. Possible Dependent Sequences (IND, FRS) function. Based on Frequent Behavior rule Algorithm in our approach we have defined the set Frequent Rule Set (FRS) that represents the set of behavioral rules obtained by the algorithm in a certain moment. We also define a function Possible Dependent Subsequences(IND, FRS) that given a sequence of pages IND and a set of behavioral rules will give all the possible sequences that can be visited after visiting IND and the associated probability. This function will simply scan the set of FRS and will look for the rules that match IND in order to obtain the mentioned result. Sequence Value(seq) function Based on the algorithm seen in section 2.2 a function is defined that given a sequence computes its value. Subsession Expected Value E(V) Let V be a random variable representing the value of each possible sequence.

Let P(Path DEPi |PathIN D ) be the conditional probability that a user will begin visiting DEPi after having visited the sequence of pages represented by IND. This probability satisfies the condition: P (P athDEPi |P athIN D ) = 1 The conditional subsession expected value given a sequence of pages already visited IND (E(V|IND)) is defined as follows: IN D(E(V |IN D)) = (Vi ∗ (P (P athDEPi | P athIN D )) where PathDEP i represents the path that has the value Vi

386

Authors Suppressed Due to Excessive Length

3.2 Expected path value algorithm For each page in a sequence the algorithm checks whether it is a decision page, with respect to the current path traversed. If it is a non-decision page, it means that we have no knowledge related to the future behavior (no rule can be activated from the FRS) and no action is taken. If it is a decision page then at least one behavioral rule applies so that we can calculate the possible values for each possible path the user can follow from this point. These possible values can be considered as a random variable. Consequently we can calculate the expected value of this variable. This gives us an objective measure of the behavior of the user up to a certain moment. This way if the expected value is positive then it will mean that no matter which possible path the user would take, in average the resulting navigation will be profitable for the web site. Nevertheless, if the expected value happens to be negative then it would mean that in the long run the navigation if the user will end in a undesirable effect for the web site. The algorithm provides the site with added value information which can be used, for example, to dynamically modify the web site page content to meet the projected user needs. Pseudocode of the expected path value algorithm Input: Seq //Sequence of pages visited by a user Output: value of each possible path and expected added value of all possibilities Pseudocode V=0; // value of the sequence FRS=Frequent Behavior Rules(); If the last page in Seq is a decision page pathDEP= Possible Dependent Subsequences(Seq,FRS); V=0; For each path Pa in PathDEP compute: Vi= Sequence Value(Pa); V=V +Vi*P(Pa|Seq); Else; Note that if we have additional information related to the effect that certain web site actions would have on the probabilities of the activated rules, we could calculate both the expected value before and after a certain action.

4 Example Let us suppose that during a user session in the web site (see 1), there are no rules to activate until the sixth page encountered in the navigation. This

Title Suppressed Due to Excessive Length

387

means that there are no rules that satisfy the property P(PDEP |PIN D ) > δ. Let’s assume that δ= 0.24. When the user arrives to the 6th traversed page three frequent-behavior rules are activated so that 3 sequences (α7 ), (α9 α8 ), (α10 α11 ) are frequently followed from that point (see table 2) Sequence α5 α5 α2 α5 α2 α1 α5 α2 α1 α6 α5 α2 α1 α6 α3 α5 α2 α1 α6 α3 α4

Decision page? No No No No No Yes

Table 1. Sequence visited by a user

Activated Rule Prob(Dep|Ind) Value α3 α4 → α7 0,25 -15 α3 α4 → α9 α8 0,3 16 α3 α4 → α10 α11 0,35 -3 Dummy rules 0,1 0 Total 1 Table 2. Activated FRS for α3 α4 , conditional probabilities and value of the consequent paths

The probabilities and values associated with each activated rule are presented in Table 2. Once this information is available, the algorithm computes the subsession expected value. In our example, this subsession expected value is 0. This means that up to this moment, on average, further navigation by the user along any frequent path will neither bring the user closer to, nor farther from, the target pages. E(V|IND) = 0.25*(-15)+0.3*16+0.35*(-1.05)+0.1*0 = 0 (see 3). Another example is presented in 4. In this example, before knowing which PATHDEP the user will follow, the subsession expected value can be calculated. In this example, the expected subsession value at this point in the navigation is 6.75. Thus, as the expected value is positive, the average final result at this point is profitable for the site. So, in this case we can estimate that the user will act according to the web site goals. Figure 1 illustrates the advantages of the algorithm. In the example that is represented, one can see (given that the y-axis illustrates the value of the ongoing session), that up to the decision page the value of the session is

388

Authors Suppressed Due to Excessive Length Sequence Prob(Depi |Ind) Value(Vi ) (Prob(Depi |Ind) ∗ Vi Dep1α7 0,25 -15 -3,75 Dep2α9 α8 0,3 16 4,80 Dep1α10 α11 0,35 -3 -1,05 Dummy rules 0,1 0 0 Total 1 Table 3. Example 1: Subsession Expected value calculation

Sequence Prob(Depi |Ind) Value(Vi ) (Prob(Depi |Ind) ∗ Vi Depi 0,75 10 7,50 Depj 0,25 -3 -0,75 Total 1 6,75 Table 4. Example 2: Expected subsession value calculation

positive. At this point, there are three possibilities (according to the FRS). If the user follows paths Dep2 or Dep3 the result would be positive, while following path Dep1 leads to decreased session value. Nevertheless, we cannot say for sure which will be followed, but we have an algorithm that tells us the expected value (in this case positive) of the future behavior. This means that in some cases (if Dep1 is followed) the behavior will not be positive for the web site but on average we can say that the result will be positive, and we can act taking advantage of this knowledge.

Fig. 1. Expected value of sessions

Title Suppressed Due to Excessive Length

389

Figure 1: The change in session value (y-axis) with path progression (xaxis), depending on the various dependent paths (Dep1, Dep2, Dep3) followed. Note that additional information about past actions could yield more information indicating which paths might be followed.

5 Conclusions and Future Work An integrated approach to calculate the expected value of a user session has been presented. The algorithm makes it possible to know at any point if the ongoing navigation is likely to lead the user to the desired target pages. This knowledge can be used to dynamically modify the site according to the user’s actions. The main contribution of the paper is that we can quantify the value of a user session while he is navigating. In a sense this makes the relationship of the user with the site closer to real life relationships. It is important to note that the algorithm can be applied recursively to all the possible branches in a subsession in order to refine the calculation of the expected value. We are currently working on an extension of the algorithm in which the impact of actions performed by the site in the past are evaluated in order to include this knowledge in the algorithm.

6 Acknowledgments The research has been partially supported by Universidad Polit´ecnica de Madrid under Project WEB-RT and Programa de Desarrollo Tecnol´ogico (Uruguay).

References 1. 2. 3. 4.

http://www.boutell.com/wusage. http://www.internetworld.com/print/monthly/1997/06/iwlabs.htm l. http://www.statlab.cam.ac.uk/ sret1/analalog. E. Han B. Mobasher, N. Jain and J. Srivastava. Web mining: Pattern discovery from www transaction. In Int. Conference on Tools with Artificial Intelligence, pages 558–567, 1997. 5. J. Adibi C. Shahabi, A. M. Zarkesh and V. Shah. Knowledge discovery from user’s web-page navigation. In In Proceedings of the Seventh International Workshop on Research Issues in Data Engineering High Performance Database Management for Large-Scale Applications (RIDE’97), pages 20–31, 1997. 6. Oren Etzioni. The world-wide web: Quagmire or gold mine? Communications of the ACM, 39(11):65–77, November 1996. 7. Daniela Florescu, Alon Y. Levy, and Alberto O. Mendelzon. Database techniques for the world-wide web: A survey. SIGMOD Record, 27(3):59–74, 1998.

390

Authors Suppressed Due to Excessive Length

8. Jiawei Han and Micheline Kamber. Data Mining:Concepts and Techniques. Morgan Kaufmann publishers, 2001. 9. Mukund Deshpande Jaideep Srivastava, Robert Cooley and Pa ng Ning Tan. Web usage mining: Discovery and applications of usage patter ns from web data. SIGKDD Explorations., 1:12–23, 2000. 10. Hoszchtain E. Menasalvas E. Sessions value as measure of web site goal achievement. In SNPD’2002, 2002. 11. Carsten Pohle Myra Spiliopoulou and Lukas Faulstich. Improving the effectiveness of a web site with web usage mining. In In Web Usage Analysis nad User Profiling, Masand and Spiliopoulou (Eds.), Spriger Verlag, Berlin, pages 142–162, 1999. 12. R. Krisnapuram O. Nasraoiu and A. Joshi. Mining web access logs using a fuzzy relational clustering algorithm based on a robust estimator. In 8th International World Wide Web Conference, Toronto Cana da, pages 40–41, May 1999. 13. M. Perkowitz and O. Etzioni. Adaptive web sites: Automatically synthesizing web pages. In Fifteenth National Conference on Artificial Intelligence (AAAI/IAAI’98)Madison, Wisconsin, pages 727–732, July 1998. 14. From: Gregory Piatetsky-Shapiro. Subject: Interview with jesus mena, ceo of webminer, author of data mining your website. page http://www.kdnuggets.com/news/2001/n13/13i.html, 2001.

Suggest Documents