Identifying User Clicks Based on Dependency Graph Jun Liu, Cheng Fang
Nirwan Ansari
School of Information and Communication Engineering Beijing University of Posts and Telecommunications Beijing, China, 100876. Email: liujun,
[email protected]
Electrical and Computer Engineering Department New Jersey Institute of Technology New Jersey, US, 07102. Email:
[email protected]
Abstract—Identifying user clicks from a large number of measured HTTP requests is the fundamental task for web usage mining, which is important for web administrators and developers. Nowadays, the prevalent parallel web browsing behavior caused by multi-tab web browsers renders accurate user click identification from massive requests a great challenge. In this paper, we propose a dependency graph model to describe the complicated web browsing behavior. Based on this model, we develop two algorithms to establish the dependency graph for measured requests, and identify user clicks by comparing their probabilities of being primary requests with a self-learned threshold. We evaluate our method with a large dataset collected from a real world mobile core network. The experimental results show that our method can achieve high accurate user clicks identification.
I. I NTRODUCTION With the continuous growth and abundance of information available on the World Wide Web (WWW), websites are playing a crucial role for people to exchange ideas, conduct business and consume entertainment. Providing administrators with hidden and meaningful information about users’ behaviors and interests is critical to improve the performance and quality of websites. The utility of this function significantly depends on the outcomes of an important preprocessing task on web traffic records, user clicks identification. User clicks identification is a process of obtaining the set of requests triggered by users’ click actions from a large number of measured HTTP requests. It is important to preprocess measured HTTP request records for various web applications, such as web search optimization [6][15], web usage mining [7][9], anomaly detection [8][13], etc. In recent years, several methods have been proposed to better understand user click behavior, and they can be categorized into three types. The first type is the data cleaning based approach, which filters requests by the type of accessing files and removes requests for unwanted file types such as images and multimedia files, which are considered as non-click requests [12][11]. Although it is simple to implement this kind of methods, they can hardly achieve accurate identification and suffer greatly from the increasing complexity of file types. The second type is to identify user clicks by means of clustering techniques [1][3]. These methods represent requests as vectors by their attributes or accessing time, and apply clustering techniques to identify
the user click behavior. Owing to the limitation of the sensitive parameter (the number of clusters k) and the complex iterative computational model, the clustering based methods are not applicable in the situation with massive request records. The third type is the model-based approach [2][13][14], which abstracts user click behavior as a mathematical model, such as dynamic Bayesian networks and hidden semi-Markov, and uses a set of training data to obtain the key parameters of the model. Then, the trained model is used to identify user clicks. These methods usually make some assumptions to establish the models; for example, a page click is only determined by the last click. However, the prevalent parallel browsing behavior by using the multi-tab browsers breaks these assumptions and renders these methods ineffective. As websites move from relatively static pages to functional carriers with embedded multimedia and dynamic contents, the nature of the interaction between clients and websites changes as well. This results in great challenges to user clicks identification in the following aspects. (1) Web pages have become increasingly complex in that the number of embedded objects have increased [5]. Many requests occur not as a result of the page click, but as a result of interactions between browser and website after the initial request. (2) The traditional browsing paradigm of visiting a sequence of web pages one by one in the same browser window has been greatly changed. Parallel browsing behavior is prevalent because almost all browsers are equipped with the multi-tab function to support opening more than one page from a single page [4]. (3) Advanced web technologies, such as AJAX, Flash, etc., enable website developers to put personalized advertising scripts into web pages. Dynamic requests triggered by these scripts are hard to be distinguished from normal user clicks and break the traffic patterns of original embedded objects in the page. Therefore, the user clicks identification is still an open issue and even becomes more difficult nowadays. Toward this end, we propose a dependency graph model to describe the complex user browsing behavior in this paper. Based on this model, we develop a novel method with two steps to identify user clicks from a huge number of HTTP requests of web pages and embedded objects. The fist step is establishing the dependency graph for measured HTTP requests. The second step is identifying user clicks by a
Fig. 2. Dependency Graph
Fig. 1. Web Browsing Behavior
statistical inference approach. User clicks are identified by comparing their probabilities of being the primary request with a self learned threshold. Experiments on real-world data demonstrate that our method can achieve high accurate identification. The reminder of this paper is organized as follows. Section 2 provides a brief review of the web browsing behavior. Section 3 details our proposed method to identify user clicks based on the dependency graph model. We then introduce how we evaluate the effectiveness and accuracy of our method and discuss the results in Section 4. Finally, we conclude the paper in Section 5. II. W EB B ROWSING B EHAVIOR We now look into interactions between web clients and websites reflected in the network. Fig. 1 depicts a sample web browsing behavior, in which time flows from left to right. For simplicity, we only illustrate the process of three users (User 1 − 3) accessing three pages (pi ) of a website. In general, a website comprises two types of elements: the web pages and the embedded objects in each page. All pages and embedded objects are identified by Universal Resource Locators (URLs), which are represented as hyperlinks in the web pages. For example, requests r1 and r9 in Fig. 1 have the same URL because they are produced by clicking hyperlinks pointing to the same web page p1 . The structure of a website is determined by these hyperlinks among the web pages. When a user clicks a hyperlink to open a page, the browser will send a HTTP request containing the URL of this page to the website. Responded page content of this initial request usually contains many hyperlinks of the embedded objects. After parsing these hyperlinks, the browser client produces a set of requests to retrieve embedded objects from web servers in a multi-thread manner. During the period of loading a web page, users may instill different browsing behaviors. Most users usually wait till a full page shows up and click another hyperlink in the page after viewing the page content (User 1 and 2). Some impatient users, however, branch some pages in new tabs through the right-click menu or by holding the CTRL key while clicking a link. This will result in some overlap request sequences of multiple pages, such as pages p1 and p2
opened by User 3. Although we can describe such a simple process based on human perception, the requests of user clicks cannot be determined directly at the network-side or serverside, in which only the user’s request sequence (r1 , r2 , r3 , ...) along with their own accessing time (t1 , t2 , t3 , ...) is observed. Therefore, the user clicks identification problem can be defined as identifying the set of requests associated with the user clicks C = {c1 , c2 , c3 , ...} (e.g., r1 , r4 , r6 , ... in Fig. 1) from the measured HTTP request sequence R = {r1 , r2 , r3 , ...}. III. M ETHODOLOGY A. Modeling Browsing Behavior by Dependency Graph Considering the process described above, we introduce a dependency graph model to depict the dynamic web browsing behavior. Formally, we model the browsing behavior as a directed and weighted graph, G = (O, S, E, W ), referred to as a dependency graph. O = {o1 , o2 , ..., on } is the set of nodes representing accessed objects which are identified by URLs. Each node oi is assigned an occurrence count S[oi ] ∈ S of the accessed object. E is the set of directed edges with weights W . There is an edge from node oi to oj if and only if they meet the following conditions: (i) For a request sequence υ = {ri−1 , ri , ..., rj } produced by the same user, in which accessed objects are {oi−1 , oi , ..., oj }, the interval between the accessing time of ri−1 and ri is bigger than τ , where τ is called the lookahead time window. (ii) In the sub-sequence υ 0 = {ri , ..., rj } of υ, the interval between each pair of adjacent requests is smaller than τ . The weight of a directed edge is the number of times the pair < oi , oj > appears in the measured HTTP request sequence R. Ideally, if all users open and view web pages one by one with a given viewing time and the τ equals to the viewing time, the edge from oi to oj indicates that oj is an embedded object in the web page oi . Unfortunately, this assumption is always violated by complicated browsing behaviors in the real world, such as the behavior of User 3 in Fig. 1. Fig. 2 shows the dependency graph derived from the browsing behavior example shown in Fig. 1. In the example, we have the same URL o1 pointing to web page p1 in request r1 , r9 , and r17 . Similarly, we get other nodes, from o2 to o8 , and their occurrence numbers which are the bold numbers close to these nodes, such as 3 for o1 . By choosing an appropriate lookahead time window τ , which will be described in detail later, a dependency graph can be built as shown in Fig. 2. Note that
there are some edges starting from o1 and ending at o4 to o8 , shown as dashed line. These edges are caused by the parallel browsing behavior of User 3, which results in short time intervals between all pairs of adjacent requests in sequence from r17 to r24 . We will next introduce how we establish the dependency graph to represent the dynamic behaviors and identify user clicks from massive HTTP requests. B. Establishing Dependency Graph The dependency graph is initially empty and is established through a learning process, which is summarized as Algorithm I. The input data of the algorithm is a set of HTTP requests R. Each request ri maintains information including the user identification ui , the accessing time ti , and the URL of accessed object oi . The first part of Algorithm I, line 1 to line 4, divides all requests into a set of sequences. Each sequence is made up of requests from the same user, and ordered by the accessing time. The second part of Algorithm I, line 5 to line 22, processes the sequence of each user one by one. If the time interval between a request o and the last request olast (t−tlast ) is bigger than the value of the lookahead time window τ , the request o is regarded as a primary request (line 15). The primary request is a candidate to be possibly identified as a user click in a later step. Otherwise, the request is regarded as a secondary request, which is a request triggered by a primary request. For each secondary request, a directed edge is added from the current primary request to this request (line 17) and the occurrence of this edge is incremented (line 18). After having iteratively executed on request sequences of all users, the dependency graph is established. Algorithm I: Establish Dependency Graph Input: R = {r1 , r2 , ..., rn } → A set of HTTP requests ri = {ui , ti , oi } → Data structure of each HTTP request Output: G = {O, S, E, W } → The dependency graph 1: H = // The set of request sequences of all users 2: for each ri in R do 3: push(H[ui ],ri ) 4: end for 5: for each hi in H do 6: oprimary = null 7: tlast = 0 8: for i=1 to length(hi ) do 9: ri = pop(hi ) 10: t = getAccessingTime(ri ) 11: oi = getURL(ri ) 12: O[oi ] = 1 13: S[oi ]++ 14: if t − tlast > τ then 15: oprimary = o 16: else 17: E[oprimary ][oi ] = 1 18: W [oprimary ][oi ]++ 19: end if 20: tlast = t 21: end for 22: end for
C. Identifying User Clicks In the dependency graph, a node oi represents the accessed object and the occurrence count S[oi ] is the number of times this object has been accessed by users. A directed edge from oi to oj represents that oj is accessed right after oi has been accessed. Obviously, we cannot conclude whether a request is a click only by the existence of an edge from other nodes to it. For example, in Fig. 2, o4 cannot be identified as a non-click request by the existence of an edge from o1 to o4 , which is caused by the parallel browsing behavior of User 3. We have to use other information, the weights of each edge to a node from other nodes. The sum of these weights represents how many times an object was accessed following other objects. Therefore, whether a request is a user click can be inferred by the probability p of a request oi being the primary request, which is expressed as follows: P W [oj ][oi ] in doi oj ∈O =1− (1) p=1− S[oi ] S[oi ] The basic idea of inferring whether a request is a user click is by thresholding, i.e., comparing p with a threshold ρ. If p is larger than ρ, implying that the request following other requests seldom occurs, the request can be identified as a user click. Otherwise, it tends to be the request of an embedded object. Obviously, the number of requests identified as user clicks depends on the value of ρ. The bigger ρ results in fewer requests to be identified as user clicks, but they are more likely to be true, i.e., user clicks. To get a reasonable trade-off between the quantity and quality of identified user clicks, we develop a self learning process to determine the value of ρ, which is shown as Algorithm II. We set an initial threshold ρ0 as a big number, for example, 0.9. Nodes identified as user clicks by this big threshold are most likely true. We put these primary nodes into a set Oprimary . If a node has a directed edge from a node in Oprimary to it, this node is placed into another set Osecondary . After this, we can calculate the expected ratio ϕ0 of user clicks in all requests, i.e., the number of actual user clicks over the total number of requests. Then, an iterative process is executed by varying threshold ρ from 1 to 0 with a step size of 0.01. For each ρ, we can obtain an identification result Oprimary and the corresponding ϕ. Obviously, we have 0 6 ϕ 6 1 for 1 > ρ > 0. The iterative process is stopped when ϕ > ϕ0 . Then, we identify the current Oprimary as the set of user clicks C. IV. E XPERIMENTAL E VALUATION A. Network Environment and Dataset To evaluate the proposed method, we have collected test dataset by high performance traffic monitor system (TMS) placed in the core network of a leading mobile operator in China. The network environment is shown in Fig. 3. The network serving both 2G and 3G subscribers consists of three major parts: mobile clients, the access network and the core network. A mobile client communicates with a cell tower in
Algorithm II: Identify User Clicks Input: G = {O, S, E, W } → The dependency graph Output: C = {c1 , c2 , c3 , ...} → The set of user clicks 1: for each oi in O do 2: din = 0 3: for each oj in O do 4: din += W [oj ][oi ] 5: end for 6: p = 1 − din / S[oi ] 7: if p > ρo then 8: push(Oprimary ,oi ) 9: sprimary += S[oi ] 10: end if 11: end for 12: for each oi in Oprimary do 13: for each oj in O do 14: if E[oi ][oj ] > 0 and oj 6∈Oprimary then 15: push(Osecondary ,oj ) 16: end if 17: end for 18: end for 19: for each oj in Osecondary do 20: ssecondary += S[oj ] 21: end for 22: ϕ0 = sprimary / (sprimary +ssecondary ) 23: for ρ=1 to 0 step 0.01 do 24: for each oi in O do 25: din = 0 26: for each oj in O do 27: din += W [oj ][oi ] 28: end for 29: p = 1 − din / S[oi ] 30: if p > ρ then 31: push(Oprimary ,oi ) 32: else 33: push(Osecondary ,oi ) 34: end if 35: end for 36: if (ϕ=sprimary /(sprimary +ssecondary )) > ϕ0 then 37: break; 38: end if 39: end for 40: C = Oprimary
the access network which forwards its data service traffic to a SGSN. The SGSN establishes a tunnel on the Gn interface with a GGSN that provides connectivity to external networks. Through this path, the request message of a mobile client enters the IP network and reaches the serving server. Data responded from the server to the client traverse in the reversed path. Therefore, we can capture all HTTP requests and responses with user identity and the accessing time between clients and web servers by the TMS deployed on the Gn interface. Table I summarizes the dataset we have acquired. B. Identification Results In order to evaluate the effectiveness of the proposed method, we compare the accuracy of identification of our method with the data cleaning method, which is widely used by existing commercial user click identification systems. We execute our proposed method and the data cleaning method on the dataset to identify user clicks. The identified user clicks
Fig. 3. Network Architecture
produced by two methods are compared with the real clicks data benchmark derived by human perception to verify the correctness of the results. For each method, we evaluate the identified clicks of 100 random users on three well-known news portal websites having rich embedded objects in pages, namely, sina.com.cn, sohu.com, and ifeng.com. Identified user clicks in six sample time slots, 9-10, 11-12, 13-14, 15-16, 1718, and 19-20, are verified. We use the F1 score [10], which combines precision and recall factors with an equal weight, to measure the accuracy of identification results. The F1 score can be calculated using Equation 2, in which the precision (P) is the number of correctly identified clicks divided by the number of all identified clicks and the recall (R) is the number of correctly identified clicks divided by the number of requests that should be identified as clicks. The F1 score reaches its best value at 1 and worst value at 0. F1 =
2·P ·R P +R
(2)
F1 scores of the experimental results are shown in Table II, in which DC represents the data cleaning method and DG represents our dependency graph model based method. For the above three websites, the average F1 scores of our method are 0.9185, 0.8987 and 0.8180 respectively, (rather close to the best value of their respective F1 scores), while those of the data cleaning method are only 0.3219, 0.2815 and 0.3430, respectively. The results clearly demonstrate that our method is more accurate than the data cleaning method. TABLE I S UMMARY OF DATASET Factor File Size Number of Unique Users Number of Hosts Number of HTTP Requests Identified User Clicks
Value 339.2GB 49,103 26,059 2,025,994 70,124
C. Discussion on Parameters In Algorithm I, the lookahead time window τ is used to determine whether a request is a primary request or a secondary request for establishing the dependency graph. In our method, whether a request is a user click is inferred by comparing the statistical probability value p of being the
15-16 0.3685 0.9167 0.3278 0.7886 0.2882 0.7895
17-18 0.3149 0.8541 0.1874 0.9007 0.3453 0.7723
19-20 0.2496 0.9322 0.2922 0.9014 0.3311 0.8889
Avg 0.3219 0.9185 0.2815 0.8987 0.3430 0.8180
0.06
13-14 0.3095 0.9655 0.3314 0.9490 0.3818 0.8763
0.04
11-12 0.3643 0.9063 0.2667 0.8944 0.3689 0.7315
0.0360 0.0337
0.02
9-10 0.3245 0.9362 0.2832 0.9579 0.3429 0.8498
ϕ
Website Method DC sina DG DC sohu DG DC ifeng DG
0.08
TABLE II F1 S CORES OF I DENTIFICATION R ESULTS
primary request with a threshold ρ. To evaluate the impact of the value of τ on the accuracy of identification results, we conduct a set of experiments to study the F1 score versus various values of τ . The evaluation result is shown in Table III. We can see the F1 scores of the above three websites remain relatively stable with various values of τ . This demonstrates the accuracy of our method is not sensitive to the parameter τ . In our experiment, we choose τ = 3s. In order to illustrate the self learning process of selecting the optimal threshold ρ in Algorithm II, we plot the values of ϕ versus varied threshold ρ as shown in Fig. 4. In our experiment, we set the initial threshold ρ0 as 0.9. The expected ratio ϕ0 of user clicks in all requests corresponding to ρ0 is 0.0337 (the red dotted line in Fig. 4). The ratio ϕ gets bigger with decreasing value of ρ, and reaches 0.0360 when ρ is 0.72, which is the optimal threshold for our experiment. In Table II, values of the F1 score indicate that ρ = 0.72 leads to a reasonable trade-off between precision and recall in results. TABLE III F1 S CORES OF VARIOUS LOOKAHEAD TIME WINDOW Website sina sohu ifeng
τ = 1s 0.8982 0.9579 0.8456
τ = 2s 0.9306 0.9498 0.8437
τ = 3s 0.9362 0.9579 0.8498
τ = 4s 0.9283 0.9241 0.7813
τ = 5s 0.9197 0.9242 0.7570
V. C ONCLUSION In this paper, we have proposed a novel method to identify user clicks based on the dependency graph model. By studying the parallel web browsing behavior supported by modern browsers, we model the sequence of HTTP requests as a dependency graph. Then, we develop two algorithms for establishing the dependency graph and identifying user clicks from requests by a self learning process. We have conducted an experiment to evaluate the proposed method based on the dataset collected from a real world mobile network. Experimental results have substantiated that our method achieves higher accuracy as compared with the widely used data cleaning method. User clicks identification is fundamentally critical for subsequent web usage mining. We expect our work will be helpful to enhance the quality of user clicks identification, and benefit the analysis of user behaviors and interests to improve network and website performance. In our proposed method, we set the initial threshold in a practical way. For the future work, we plan to investigate how the initial threshold impacts the identification results and study if there is a mathematical
0.8
0.76
0.7
0.6
0.5
ρ Fig. 4. ϕ value of various ρ
optimal value of the initial threshold. Another plan is to study the distribution of embedded objects in web pages based on the identification results. ACKNOWLEDGEMENT This work has been partially supported by the National Natural Science Foundation of China (61072061) and the 111 Project of China (B08004). R EFERENCES [1] A. Bianco, G. Mardente, Mellia M, M. Munaf, and L. Muscariello, “Web user-session inference by means of clustering techniques,” IEEE/ACM Transactions on Networking (TON), 2009, 17(2), pp. 405-416. [2] O. Chapelle, Y. Zhang, “A dynamic bayesian network click model for web search ranking,” In Proceedings of the 18th international conference on World wide web. ACM, 2009, pp. 1-10. [3] Y. Fu, K. Sandhu, and M. Y. Shih, “A generalization-based approach to clustering of web usage sessions,” Web Usage Analysis and User Profiling, Springer Berlin Heidelberg, 2000, pp. 21-38. [4] J. Huang, R. W. White, “Parallel browsing behavior on the web,” In Proceedings of the 21st ACM conference on Hypertext and hypermedia. ACM, 2010, pp. 13-18. [5] S. Ihm, V. S. Pai, “Towards understanding modern web traffic,” In Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference. ACM, 2011, pp. 295-312. [6] U. Lee U, Z. Liu, and J. Cho, “Automatic identification of user goals in web search,” In Proceedings of the 14th international conference on World Wide Web. ACM, 2005, pp. 391-400. [7] O. Nasraoui, M. Soliman, E. Saka, A. Badia, and R. Germain, “A web usage mining framework for mining evolving user profiles in dynamic web sites,” Knowledge and Data Engineering, IEEE Transactions on, 2008, 20(2), pp. 202-215. [8] G. Oikonomou, J. Mirkovic, “Modeling human behavior for defense against flash-crowd attacks,” IEEE International Conference on Communications, IEEE, 2009, pp. 1-6. [9] T. Pamutha, S. Chimphlee, C. Kimpan, and P. Sanguansat, “Data Preprocessing on Web Server Log Files for Mining Users Access Patterns,” International Journal of Research and Reviews in Wireless Communications, Vol. 2, 2012. [10] C. J. Rijsbergen, Information retrieval, 1979. [11] K. R. Suneetha, R. Krishnamoorthi, “Identifying user behavior by analyzing web server access log file,” International Journal of Computer Science and Network Security, 2009, 9(4), pp. 327-332. [12] D. Tanasa D, B. Trousse, “Advanced data preprocessing for intersites web usage mining,” Intelligent Systems, IEEE, 2004, 19(2), pp. 59-65. [13] Y. Xie, S. Z. Yu, “A large-scale hidden semi-Markov model for anomaly detection on user browsing behaviors,” IEEE/ACM Transactions on Networking, 2009, 17(1), pp. 54-65. [14] C. Xu, C. Du, G. F. Zhao, and S. Yu, “A novel model for user clicks identification based on hidden semi-Markov,” Journal of Network and Computer Applications, 2012.
[15] Y. Zhang, W. Chen, D. Wang, and Q. Yang, “User-click modeling for understanding and predicting search-behavior,” In Proceedings of the 17th ACM international conference on Knowledge discovery and data mining, ACM, 2011, pp. 1388-1396.