Building Web Personalization System with Time-Driven Web Usage Mining G P Sajeev
Ramya P T
Department of Computer Science Amrita School of Engineering Kollam 690525 Kerala, India
Department of Computer Science Amrita School of Engineering Kollam 690525 Kerala, India
[email protected]
[email protected]
ABSTRACT
the data mining for discovering the usage patterns from the Web data, in order to understand and serve better the needs of Web based applications [3]. Web usage mining has many applications such as personalization of Web content, prefetching and caching of the Web objects, which helps in improving the system, and implementing the recommendation systems. Web users are always eager to access the content of their interest in a rapid manner. To fulfil the user’s demand, companies and organization, sets and suggest the Web pages to their users, based on the characteristics of an individual or a community, which is termed as Web personalization. Automatically suggesting the Web pages to their users based on their past navigational pattern is the common method adopted in Web personalization system [9]. However, it is difficult to determine the user’s interest since it may change from time to time. Hence it is mandatory to consider timing information for building a Web personalization model. The time attribute is a powerful attribute that improves the efficiency and accuracy in different ways. When the time attribute is selected with different interpretations, each of them contributes significantly in personalization. In this paper we focus on the time related attributes such as average time duration, inter-visiting time, and burst of visit along with the navigational pattern. The rest of the paper is organized as follows. Section 2 discusses the related research. We introduce the Web personalization model with time driven Web usage mining in Section 3. Section 4 presents performance evaluation of the proposed model. We conclude the paper in Section 5, with suggestions for the future work.
Web personalization is a powerful tool used for personalizing the Websites. The personalization system aims at suggesting the Web pages to the users based on their navigational patterns. Use of attributes such as time, popularity of Web objects makes the model more efficient. This paper proposes a novel Web personalization model which utilizes time attributes, such as duration of visit, inter-visiting time, burst of visit, and the user’s navigational pattern. Test results indicate that the proposed model explores the user’s behaviour and their interest.
Keywords Web Personalization, Web Usage Mining, Navigational Pattern, Pattern Discovery, Pattern Classification.
1.
INTRODUCTION
The development of the Internet and the widespread usage of the Web services has resulted in vast and increasing amounts of various data in the Web. Web data mining is used to crawl through various Web resources to collect required information, which enables the individual or organization, to promote their business, understanding the marketing dynamics, and floating the new promotions on the Internet. There is a growing trend among companies, organizations and individuals for gathering the information through Web mining and utilizing the information to the best of their interests. [10]. According to the targets of analysis, Web mining is classified into three categories, viz.: Web content mining, Web structure mining and Web usage mining [12]. Web content mining is the process of extracting useful information from the Web documents. In the Web structure mining, the analysis is done with the help of a graph model, Web graph. The Web graph is a directed graph, whose vertices correspond to the Web pages and the directed links are hyper links connecting the pages. Web usage Mining is an application of
2.
RELATED WORK
Numerous studies have been performed concerning web personalization. A method for Web personalization using the navigational pattern tree structure is proposed in [5]. This work also develop a Navigational Pattern mining (NPminer) algorithm for discovering frequent sequential patterns on the proposed Navigational Pattern Tree and the mining results are formed within association rules as navigational knowledge. High data complexity and the maintenance overhead for tree structure are the major drawbacks for the method. In using adaptive data structure [13] authors try to enhance the case of burst of visits in the personalization of Websites. An unexpectedly large number of events occurring within a certain time period is called a burst, suggesting
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected].
WCI ’15, August 10-13, 2015, Kochi, India c 2015 ACM. ISBN 978-1-4503-3361-0/15/08. . . $15.00 DOI: http://dx.doi.org/10.1145/2791405.2791481
38
unusual actions or processes. In bursty Web search pattern cases, the user attempts to find specific results that belong to limited categories of interest within a short time period. As a consequence an efficient retrieval and storing mechanism is needed to keep the users personalized categories and frequent results. The main advantage of this method is that space complexity is less and works well with high traffic Websites. The limitation is that it has poor performance with less populated Websites. Using implicit feedback [8] deals with the problem of building an effective collaborative filtering-based recommender system for an e-commerce environment without using explicit feedback data. The main idea of this approach is constructing pseudo rating data from the implicit feedback data. It can only be applied to some selective recommendation system. [16] proposes a novel method which computes the similarities of preferred path using navigational path and semantic information. When users access the Website mined by the method proposed in this paper, the system automatically recommend the preferred paths which have high similarities with the selected field according to the selected field of the users. This method improves the accuracy with the help of semantic information but this cause high computational complexity. A personalization model using the popularity of web items is proposed in [9]. The authors studies finding temporal patterns in the popularity of Web items by finding popularity trends of Web items over time, and clusters the Web items having similar popularity trends. Generally assume that the popularity trends of Web items are not necessarily periodic. In this paper, an item is represented as a time-series of popularity over time. The length of the time-series is the length of the entire time period during which the popularity is obtained. The main drawback of this method is that it may generate some unexpected pattern, which will affect the accuracy of result. Online navigational pattern identification and prediction [4] implements an efficient Web log mining and online navigational pattern prediction. This paper introduces new concepts such as shared pattern, scalable kNN-based approach with an inverted index table. The result shows that use of shared pattern improve the accuracy and gain in running time is very huge when using an inverted index. There is another mechanism which helps to generate a personalization model based on tag and time information [15]. In this paper, authors investigate the importance and usefulness of tag and time information when predicting user’s preference and examined how to exploit such information to build an effective resource-recommendation model. Both tag and time information is crucial in providing personalized recommendations. But the system Works well only with time sensitive data. We observe that the study on the effect of time attribute on Web personalization is relatively an unexplored area. We believe that time attribute could be used as a powerful attribute. Though there are some related work based on time attributes, they have not used the attribute in an efficient way. So we propose a Web personalization model which personalize the Web sites based on the average duration of visit, inter visiting time, burst of visit along with user’s navigational pattern.
3.
THE PROPOSED WEB PERSONALIZATION MODEL
The proposed system focuses on the effect of time attribute in the Web personalization model. The proposed model comprises three stages namely Log mining, pattern discovery and the Prediction Model. In this section, we present our view on how we get good quality sessions, which mining technique is better to use to get all navigational patterns and how we perform navigational pattern prediction along with time attribute. All the components of our proposed framework are depicted in Figure 1. Table 1: Notations Notations used S X Y Pi α count Di μ P
3.1
Description Set of sessions Dataset generated from Web log data Unique vector of page sequences ith page in a session Visit parameter for finding burst of visit Number of visit on page Time interval for finding burst of visit Threshold parameter for similarity score Set of patterns
Proposed System
The proposed system has three stages. The Web log data and the cookie information are the main source of input data used in our empirical study. The log contains the information such as user IP, requested page URL, method of access, time, date and user agent information. Our preprocessing involves four major steps such as cleaning, User identification, session identification and identifying attributes. In the cleaning process we remove the irrelevant entries from log such as error requests and robot’s requests. The request of images and graphics are also removed from the log entries, since they have little significance for our model. The users are identified based on the IP address from the log file. In the session identification process we apply a 30 minute threshold technique. After sessions are identified, we have to compute the time attributes such as average time duration, inter visiting time and burst of visit along with navigational pattern. Next, cluster the data by using these attributes. Finally generate a classifier based on the clustering result. This classifier here acts as the model which can predict the characteristics of the user.
3.1.1
Data Preprocessing
The data preprocessing module consist of four stages viz. log data cleaning, user identification, session identification and attribute selection. The cleaning and preprocessing step consists mainly of removing the implicit requests and removing requests made by robots. A robot (i.e., Web crawler) is a well known software program that performs automatic information retrieval from the Web by taking advantage from its structure to move from page to page and from site to site [4]. The data pertaining to crawler’s traverses is to be filtered out from the dataset, since it may adversely affect
39
Figure 1: Web Personalization System
the analysis of results. The users are identified based on the IP addresses in log entries. Here we do not consider the individual users under a proxy server, rather the IP address of the proxy server is taken into account as a single user who represents the community. With this assumption we analyse the character of the user. After identifying the users we need to generate the sessions pertaining to them. Time-out oriented techniques are widely used by commercial tools to detect the sessions in Web log data. In this technique a session is identified by the set of requested pages during a predefined threshold time-out interval. This work assumes the time-out threshold to start a new session as 30 minutes, which is a standard value [4]. The data preprocessing method is depicted in Algorithm 1. Algorithm 1: Algorithm for data-preprocessing Input: userIp,timestamp,pageurl,access method,page size Output: sessionid,userid,pageseq 1 Remove the irrelevant, robots requests and error requests. 2 Identify each user based on IP Address and assign a user ID 3 Sort the sequences according to user id and time stamp 4 for all sequences do 5 Divide the current sequences according to page-stay Threshold and userid 6 Combine pagesequences of same session and assign each user session with unique session id After generating sessions, we identify the time attributes. Three time related attributes such as Average time duration, burst of visit and inter-visiting time are taken into account and subsequently each of them is calculated. Also we consider each user’s session for finding these attributes. The average time duration is calculated by finding the time duration between two consecutive page visits. The count of web accesses is calculated for a particular time interval. If this number is sufficient to denote this Web Page as most
40
preferred one during the time interval where the accesses are performed, then this access pattern is considers as a burst of visits. The burst of visit reflects the request that coming into the site due to some popular events occurring on the site. So this type of request show that the request is only because of the particular event not because of user’s interest. The parameter, burst of visit could be used for subsidising the effect of such requests [13]. The inter-visiting time is calculated with the time difference between the sessions. The algorithm for finding the attributes is given in Algorithm 2.
Algorithm 2: Algorithm for finding attributes Input: sessionid,userid,timestamp,pageseq Output: sessiondate,pageseq,avgtime,intervisit,burstofvisit 1 for sequences of each user from X do 2 Divide the sequences based on sessions S = {s1, s2, ...sn} 3 Compute the time differences between consecutive pages in S and find the mean as average time duration 4 Find time difference between each session as inter-visiting time 5 Extract time interval say Di from timestamps in X 6 Set a threshold for times, visit parameter α 7 for each Di do 8 Find the number of times each page pi visited 9 if count(pi ) > times then 10 Select sessions with pi on Di 11 Set burstOfVisit = α ∗ count
Finally we extract the navigational pattern from the user’s behaviour. The navigational pattern is simply the sequences of Web pages that the user is surfed during a period. We translate the long sequence of URL into a particular pattern id. Here a unique vector of pattern list with the given data is generated initially. By using a string similarity function replace each URL sequences into a value in unique vector.
Then we compare each session URL sequences with the value in the unique vector using string similarity function. The sequences replace with most similar one. The algorithm for finding the navigational pattern is given by Algorithm 3.
used to conduct the experiments also describes the testing environment.
4.1
Dataset and Tool Used
The dataset is collected from the NASA space center Web server [2]. It is a well known dataset used by many reAlgorithm 3: Algorithm for Navigational Pattern searchers. The log data was recorded according the common Input: sessionid,userid,sessiondate,pageseq,avgtime,intervisit, format and it spans the whole month of July 1995. The samburstofvisit ple Web log data is given in the figure 2. We use statistical Output: pattern set generated computational tool R Language [11, 6] for validating our 1 Remove users with only one session from X model. 2 Create a unique vector of pageseq in X as Y 4.2 Preprocessing 3 for each yi in Y do 4 Create similarity vector for yi The original log file contains around 3, 00,000 entries from 5 Remove vector which as similarity score less than which erroneous request, robot’s request and request to imthreshold μ ages and graphics, are removed. So after preprocessing stage 6 Create a pattern list P = {p1, p2, ...pn} from Y the log size reduces to 24,000. Next we identify the users 7 for each xi in X do based on their IP address then generate 3000 sessions for 8 Compute similarity score of each pageseq in X with including in the dataset. Here we compute the parameters Y and generate similarity vector such as average time duration, burst of visit and inter vis9 Choose most similar yi from Y by finding similarity iting time [13]. For computing the burst of visit, the pascore and map to pi in P rameters such as visit parameter and time interval need to be selected. The visit parameter and the time interval are arbitrarily chosen as 0.005 and 24 Hours, respectively. A very long time interval (days or months) may miss its sig3.1.2 Pattern Discovery nificance and a very short time interval (second or minutes) make the computation complex. Hence, we collect the reLog mining stage generates a dataset with different atquests on each page in a day. If the number of requests is tributes such as average time duration, inter-visiting time, is greater than the threshold value, then we calculate the and burst of visit along with the navigational pattern. We value of burst of visit, as visit parameter × number of reperform clustering technique as the method for pattern disquest, on a page. The threshold value for number of requests covery. Basically, clustering or cluster analysis is the task of is set as 2000 as in algorithm 3. Then we select the sessions grouping a set of objects into different clusters or groups, in which contain times value greater than 2000, for updating such a way that objects in the same cluster are more simithe value of burst of visit. lar (in some sense or another) to each other than to those in other clusters. There exist different types of clustering mod4.3 Results els viz. connectivity model, centroid model and distribution After generating the dataset we apply k-means clustering model. This study employs a centroid based K-means clusalgorithm [7]. For the experimentation purpose we arbitrary tering algorithm. The K-means clustering algorithm considchoose the cluster size as 5. Hence the sessions are divided ered to be simple to use and computationally faster, when into five categories. After clustering we analyse the clusters the number of clusters are kept small. We classify users into and identifies each group. It is observed that the clusters different clusters based on some proximity measures. The have the following characteristics. result of clustering gives a group of user session with similar characteristics. – Frequency of visit is high: Interested Users.
3.1.3
Building the Prediction model
Pattern discovery stage produces different groups of users. Next we build the prediction model for classifying the new requests into these categories.The classifier is trained using the results of pattern discovery in order to build the personalization model. We choose SVM (Support Vector Machine) as the machine model, since it offers high accuracy and theoretical guarantees[1, 14]. Moreover, SVM models perform well for non linear data also. The proposed model is capable of predicting the user’s behaviour, by placing the user into a designated class. Since one model makes use of timing attributes along with the navigational pattern, it is presumed that the model works well as a personalization system.
4.
PERFORMANCE EVALUATION
Our experiments include processing of the Web log data, identifying the attributes, training and verification of the prediction model. Here we describe briefly about the dataset
– Frequency of visit is very low: Less Interested Users. – Average visiting time is high: Interested users but not frequent users. – Inter visiting time high: Very less interest. – Frequency of visit and duration is very less: Not Interested. The graphical representation of different clusters is depicted in figure 3. Finally we build the prediction model based on the clustering result. Aforementioned SVM classifier is employed for building the classifier. We observed that the model shows a good accuracy of 96% with the training dataset. The model is again tested with some new dataset and the prediction model yields about 94% of accuracy. The SVM classifier is compared with the Naive bayes (NB) classifier. The SVM classifier performs slightly better than the NB classifier, it is observed. The comparison results are depicted in Figure 4.
41
(a)
(b)
Figure 4: Prediction Accuracy of SVM and NB Classifiers with dataset size of (a) 5000 records (b) 7000 records
Figure 2: Sample Web log data
42
Figure 3: Pie diagram for clustering result
5.
CONCLUSION
This study has proposed a novel Web personalization model using the time attributes. The model is trained with the dataset generated by using K-means clustering. It is observed that, when using time attributes along with the navigational pattern, model yields better results in terms of prediction accuracy. Also, the proposed model allows for classifying the users not only on the basis of navigational pattern but also on the basis of their behaviour with time. We identified significant time attributes, time duration, burst of visit and inter-visiting time, contribute to users behaviour in web browsing. It will be interesting to perform a further analysis on the user’s access pattern and behaviour within a group. Also, this study does not consider the semantic information for analysing the users behaviour. The results of this study could be used for building powerful recommendation systems when used along semantic relatedness. These are suggested as the directions for future research.
6.
[13]
[14]
[15]
[16]
on Information Engineering and Applications (IEA) 2012, pages 849–856. Springer, 2013. E. Sakkopoulos, D. Antoniou, P. Adamopoulou, N. Tsirakis, and A. Tsakalidis. A web personalizing technique using adaptive data structures: The case of bursts in web visits. Journal of Systems and Software, 83(11):2200–2210, 2010. K. Soman, R. Loganathan, and V. Ajay. machine learning with SVM and other kernel methods. PHI Learning Pvt. Ltd., 2009. N. Zheng and Q. Li. A recommender system based on tag and time information for social tagging systems. Expert Systems with Applications, 38(4):4575–4587, 2011. Z. Zhou and D. Yang. Personalized recommendation of preferred paths based on web log. Journal of Software, 9(3):684–688, 2014.
REFERENCES
[1] A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik. Support vector clustering. The Journal of Machine Learning Research, 2:125–137, 2002. [2] P. Danzig, J. Mogul, V. Paxson, and M. Schwartz. The internet traffic archive. Available at URL http://ita. ee. lbl. gov, 2000. [3] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern classification. John Wiley & Sons, 2nd edition, 2012. [4] A. Guerbas, O. Addam, O. Zaarour, M. Nagi, A. Elhajj, M. Ridley, and R. Alhajj. Effective web log mining and online navigational pattern prediction. Knowledge-Based Systems, 49:50–62, 2013. [5] Y.-M. Huang, Y.-H. Kuo, J.-N. Chen, and Y.-L. Jeng. Np-miner: A real-time recommendation algorithm by using web usage mining. Knowledge-Based Systems, 19(4):272–286, 2006. [6] R. Ihaka and R. Gentleman. R: a language for data analysis and graphics. Journal of computational and graphical statistics, 5(3):299–314, 1996. [7] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. An efficient k-means clustering algorithm: Analysis and implementation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(7):881–892, 2002. [8] T. Q. Lee, Y. Park, and Y.-T. Park. A time-based approach to effective recommender systems using implicit feedback. Expert systems with applications, 34(4):3055–3062, 2008. [9] W.-K. Loh, S. Mane, and J. Srivastava. Mining temporal patterns in popularity of web items. Information Sciences, 181(22):5010–5028, 2011. [10] S. K. Pani, L. Panigrahy, V. Sankar, B. K. Ratha, A. Mandal, and S. Padhi. Web usage mining: a survey on pattern extraction from web logs. International Journal of Instrumentation, Control & Automation, 1(1):15–23, 2011. [11] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2014. [12] Z. Rong, Y. Tang, and S. Liu. Research on web log mining. In Proceedings of the International Conference
43