A New Approach for Clustering of Navigation Patterns of Online Users

2 downloads 8228 Views 606KB Size Report
MS. Dipa Dixit et al. / International Journal of Engineering Science and Technology. Vol. 2(6), 2010, 1670- ... Web pages with the highest associated weights. ... uri-query /sc-status /time-taken /cs-version/ cs-host /cs(User-Agent) / cs(Referer) .
MS. Dipa Dixit et al. / International Journal of Engineering Science and Technology Vol. 2(6), 2010, 1670-1676

A New Approach for Clustering of Navigation Patterns of Online Users MS DIPA DIXIT Lecturer, Computer Engineering Dept.,Fr C.R.I.T , Vashi, Navi Mumbai, India E-mail: [email protected]

MR JAYANT GADGE Assistant Professor, Thadomal Shahani Engineering College, Bandra, Mumbai, India E-mail: [email protected]

Abstract: We present a study of the web based user navigation patterns mining and propose a new approach for Clustering of user navigation patterns. The approach is based on the graph partitioning for modeling user Navigation patterns. In order to mining user navigation patterns, we establish an undirected graph based on connectivity between Referrer and URI pages. Moreover, we present a algorithm for preprocessing of unprocessed web log file and a formula for assigning weights to edges of the undirected graph .The experimental results shows the clusters of navigation patterns of user and also the approach can improve the quality of clustering for user navigation pattern in web usage mining systems. These results can be use for predicting user’s intuition in the large web sites. Keywords: Web Usage Mining; User Navigation Pattern; Clustering; Graph Partitioning 1. INTRODUCTION Modeling user web navigation data is challenging task that is continuing to gain importance as the size of the web and its user-base increases. A web navigation behavior is helpful in understanding what information of online users demand. Following that, the analyzed results can be seen as knowledge to be used in intelligent online applications, refining web site maps, web based personalization system and improving searching accuracy when seeking information. Nevertheless, an online navigation behavior grows each passing day, and thus extracting information intelligently from it is a difficult issue. Web Usage Mining (WUM) is the process of extracting knowledge from Web user’s access data by exploiting Data Mining technologies [1]. It can be used for different purposes such as personalization, system improvement and site modification. In our system, user navigation patterns are described as the common browsing behaviors among a group of users. Since many users may have common interests up to a point during their Navigation, navigation patterns should capture the overlapping interests or the information needs of these users. In addition, navigation patterns should also be capable to distinguish among web pages based on their different significance to each pattern. In our paper, we present an algorithm for preprocessing of web log file and a new approach for clustering of user navigation patterns based on the graph partitioning for modeling user navigation patterns. For the clustering of user navigation patterns we create an undirected graph based on connectivity between referrer and URI pages and we propose a formula for assigning weights to edges in such a graph. The results represent that our approach can improve the quality of clustering for user navigation pattern in web usage mining systems. These results can be use for predicting user’s next request in the huge web sites. The rest of this paper is organized as follows: In section 2, we review some researches that advance in navigation pattern mining. Section 3 describes the design consideration

ISSN: 0975-5462

1670

MS. Dipa Dixit et al. / International Journal of Engineering Science and Technology Vol. 2(6), 2010, 1670-1676 and implementation for the clustering of navigation pattern mining. Results are shown in section 4. Finally, section 5 summarizes the paper and introduces future work. 2. RELATED WORK Recently, several Web Usage Mining algorithms have been proposed to mining user navigation behavior. In the following we review some of the most significant navigation pattern mining systems and algorithm in web usage mining area that can be compared with our system. A partitioning method was one of the earliest clustering methods to be used in Web usage mining by Yan et al[2]. They used an incremental algorithm that produces high quality clusters. Each user session is represented by an ndimensional feature vector, where n is the number of Web pages in the session. The value of each feature is a weight, measuring the degree of interest of the user in the particular Web page. The calculation of this figure is based on a number of parameters, such as the number of times the page has been accessed and the amount of time the user spent on the page. Based on these vectors, clusters of similar sessions are produced and characterized by the Web pages with the highest associated weights. The characterized sessions are the patterns discovered by the algorithm. One problem with this approach is the calculation of the feature weights. The choice of the right parameter mix for the calculation of these weights is not straightforward and depends on the modeling abilities of a human expert. A partitioning graph theoretic approach is presented by Perkowitz and Etzioni [3], who have developed a system that helps in making Web sites adaptive, i.e., automatically improving their organization and presentation by mining usage logs. The core element of this system is a new clustering method, called cluster mining, which is implemented in the PageGather algorithm. PageGather receives user sessions as input, represented as sets of pages that have been visited. Using these data, the algorithm creates a graph, as signing pages to nodes. An edge is added between two nodes if the corresponding pages co-occur in more than a certain number of sessions. Clusters are defined either in terms of cliques, or connected components. Clusters defined as cliques prove to be more coherent, while connected component clusters are larger, but faster to compute and easier to find. A new index page is created from each cluster with hyperlinks to all the pages in the cluster. The main advantage of PageGather is that it creates overlapping clusters. Furthermore, in contrast to the other clustering methods, the clusters generated by this method group together characteristic features of the users directly. Thus, each cluster is a behavioral pattern, associating pages in a Web site. However, being a graph based algorithm, it is rather computationally expensive, especially in the case where cliques are computed. Another partitioning clustering method is employed by Cadez et al. [4] in the WebCANVAS tool, which visualizes user navigation paths in each cluster. In this system, user sessions are represented using categories of general topics for Web pages. A number of predefined categories are used as a bias and URLs from the Web server log files are assigned to them, constructing the user sessions. 3. DESIGN CONSIDERATION AND IMPLEMENTATION Various steps are involved in identifying the clusters of user navigation pattern are shown below:

Data preprocessing

User Navigation Mining

Navigation profile o/p

WEB LOGS Knowledge Base Fig. 1. Steps for Navigation Pattern Mining

ISSN: 0975-5462

1671

MS. Dipa Dixit et al. / International Journal of Engineering Science and Technology Vol. 2(6), 2010, 1670-1676 Step 1: Data sets consisting of web log records for 5446 users are collected from De Paul University website. Web log is unprocessed text file which is recorded from the IIS Web Server. Web log consist of 17 attributes with the data values in the form of records Fields: date/ time /c-ip /cs-username /s-sitename /s-computername /s-ip /s-port /cs-method /cs-uri-stem /csuri-query /sc-status /time-taken /cs-version/ cs-host /cs(User-Agent) / cs(Referer) . Step2: Generally, several preprocessing tasks need to be done before performing web mining algorithms on the Web server logs. Data preprocessing a web usage mining model (Web-Log preprocessing) aims to reformat the original web logs to identify user’s access sessions. The Web server usually registers all users’ access activities of the website as Web server logs. Due to different server setting parameters, there are many types of web logs, but typically the log files share the same basic information, such as: client IP address, request time, requested URL, HTTP status code, referrer, etc. Data pre-processing is done using following steps.

Data Cleansing

Session Identification

User Identification

Content Retrieval

Path Completion

Fig. 2. Block diagram for Pre-processing

1.

2.

Data Cleansing: Irrelevant records are eliminated during data cleansing. Since target of web usage mining is to get traversal pattern, following two kinds of records are unnecessary and should be removed : a.

The records of graphics, video and format information. The records having filenames suffixes of GIF,JPEG,CSS and so on, which can be found in cs_uri_stem field of record.

b.

The records with failed HTTP status codes. By examining the status field of every record in the web log, the record with status code over 299 and under 200 are removed.

User and Session Identification: The task of user and session identification is to find out the different user sessions from the original web access log. A referrer-based method is used for identifying sessions. The different IP addresses distinguish different users. a.

If the IP addresses are same, the different browsers and operation systems indicate different users which can be found by client IP address and user agent who gives information of user’s browsers and operating system.

b.

If all of the IP address, browsers and operating systems are same, the referrer information should be taken into account. The ReferURI(cs_referer) is checked, a new user session is identified if the URL in the

ISSN: 0975-5462

1672

MS. Dipa Dixit et al. / International Journal of Engineering Science and Technology Vol. 2(6), 2010, 1670-1676 ReferURI - field hasn't been accessed previously, or there is a large interval (usually more than 30 minutes) between the accessing time of this record. 3.

Content Retrieval: Content Retrieval retrieves content from users query request i.e. cs_uri_query. Eg: Query: http/1www.cs.depaul.edu /courses/syllabus.asp?course=323-21-603&q=3&y=2002&id=671. Retrieve the content like /courses/syllabus.asp which helps in fast searching of unvisited pages by other user’s which are similar to user’s interest.

4.

Path Completion: Path Completion should be used acquiring the complete user access path. The incomplete access path of every user session is recognized based on user session identification. If in a start of user session, Referrer as well URI has data value, delete value of Referrer by adding ‘-‘. Web log preprocessing helps in removal of unwanted click-streams from the log file and also reduces the size of original file by 40-50%.

Step 3: Generation of PageId: PageId is sequence generated numbers like p1, p2, p3….are created for user query/pages/page views. Step 4: User Navigation Mining: Web pages accessed are modeled as undirected graph G=(V,E) . The set V of vertices contains the identifiers of the different pages hosted on the Web server. a.

Undirected graph created for a single user session using Hash Map.

b.

Hash Map data structure stores the referrer–URI pair and their corresponding weights.

c.

Weight of edges given by 1or 0. If link between page and referrer exist then weight will be 1 else 0.

d.

Weight of pages is given as frequency of pages in graph. i.e. Weight of pages (W) = Frequency(F) of referrer–URI pair(occurrence in user session)

e.

Applied Depth First Search Algorithm on graph and obtained all possible navigation patterns.

f.

Path length is calculated by considering the total weight of the edges in a graph.

g.

If navigation pattern weight /path length is less than or equal to two, then pattern is not considered for analysis. (Minpathlength = 3) Eg: UserId =1 Navigation Pattern = p1,p2,p3,p5 or 1,2,3,5 Hence, clusters of patterns for user sessions are obtained and fed into Knowledge base for further analysis.

5. RESULTS AND DISCUSSIONS Step wise results are shown below for 5000 records Step 1: Collection of web logs which are in raw or unprocessed form.17 attributes are shown below: #Fields: date time c-ip cs-username s-sitename s-computername s-ip s-port cs status time-taken cs-version cs-host cs(User-Agent) cs(Referer)

method cs-uri-stem cs-uri-query sc-

2002-04-01 00:00:10 1cust62.tnt40.chi5.da.uu.net - w3svc3 bach bach.cs.depaul.edu 80 get /courses/syllabus.asp course=323-21-603&q=3&y=2002&id=671 200 156 http/1.1 www.cs.depaul.edu mozilla/4.0+(compatible;+msie+5.5;+windows+98;+win+9x+4.90;+msn+6.1;+msnbmsft;+msnmen-us;+msnc21) http://www.cs.depaul.edu/courses/syllabilist.asp depaul.edu/courses/syllabilist.asp 2002-04-01 00:00:26 ac9781e5.ipt.aol.com - w3svc3 bach bach.cs.depaul.edu 80 get /advising/default.asp - 200 16 http/1.1 www.cs.depaul.edu mozilla/4.0+(compatible;+msie+5.0;+msnia;+windows+98;+digext) http://www.cs.depaul.edu/news/news.asp?theid=573

       ISSN: 0975-5462

1673

MS. Dipa Dixit et al. / International Journal of Engineering Science and Technology Vol. 2(6), 2010, 1670-1676 Step 2: Preprocessing is done 5000 web log records from CTI data set. Cleansing, User Identification, Session Identification, Content Retrieval applied on records

Fig. 3. Processed file with required attributes

After Path Completion is performed on the web access logs only few attributes i.e.06 out of 17 are further used for analysis. Preprocessing of 5000 records was done in 14secs.

Fig. 4. Processed file after path completion (full preprocessing) with required attributes

ISSN: 0975-5462

1674

MS. Dipa Dixit et al. / International Journal of Engineering Science and Technology Vol. 2(6), 2010, 1670-1676 Step 3: Page Id is generated for the URI/pages/page view accessed by user

Fig. 5. List of Page id and corresponding pages/uri

Step 4: In User Navigation Mining undirected graphs are created and clusters of all possible patterns are generated for a user session

Fig. 6. Cluster of patterns for user id=9

When Clusters of navigation Patterns were compared to original user navigation patterns almost all pages views /URI(99%) were covered in the clusters. Very few outliers were obtained (1%).

ISSN: 0975-5462

1675

MS. Dipa Dixit et al. / International Journal of Engineering Science and Technology Vol. 2(6), 2010, 1670-1676 5. CONCLUSION In this paper, we advance an algorithm for preprocessing of web log also proposed a novel contribution to clustering user navigation patterns. The experimental results represents that our approach can improve the quality of clustering for user navigation pattern in web usage mining systems by covering all page views in the clusters. For the future, these clusters can be used for classification methods for classifying user request. This can be used in WUM based recommendation systems. 6. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8]

B. Mobasher, R. Cooley, and J. Srivastava, "Automatic personalization based on Web usage mining," Communications of the ACM, vol. 43, pp. 142-151, 2000. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White, "Visualization of navigation patterns on a Web site using model-based clustering," Proceedings of the sixth ACM SIGKDD international conference onKnowledge discovery and data mining, pp.280-284, 2000. C. R. Anderson, P. Domingos, and D. S.Weld, "Adaptive Web Navigation for Wireless Devices," Proceedings of the Seventeenth International Joint Conferenceon Artificial Intelligence, pp. 879–884, 2001. J. Ben Schafer, Joseph Konstan, John Riedl “Recommender Systems in E-Commerce” GroupLens Research Project Department of Computer Science and Engineering,University of Minnesota. M. Perkowitz and O. Etzioni, "Towards adaptive Web sites: Conceptual framework and case study," Artificial Intelligence, vol. 118, pp. 245-275, 2000. R. Baraglia and F. Silvestri, "Dynamic personalization of web sites without user intervention," Communications of the ACM,vol. 50, pp. 63-67, 2007. R. Baraglia and F. Silvestri, "An Online Recommender System for Large Web Sites," Proceedings of the Web Intelligence,IEEE/WIC/ACM International Conference on(WI'04)-Volume 00, pp. 199-205, 2004. T. W. Yan, M. Jacobsen, H. Garcia-Molina,and U. Dayal, "From user access patterns to dynamic hypertext linking," Computer Networks and ISDN Systems, vol. 28, pp. 1007-1014, 1996.

ISSN: 0975-5462

1676

Suggest Documents