Web Usage Mining: A Review on Process, Methods and Techniques 1
Chintan R. Varnagar, 2Nirali N. Madhak, 3Trupti M. Kodinariya, 4Jayesh N. Rathod
1
[email protected],
[email protected],
[email protected],
[email protected], 1, 2 Post Graduate Student, 3, 4Assistant Professor, Department of Computer Engineering, Atmiya Institute of Technology and Science, Rajkot, Gujarat, India.
Abstract: In Current era, internet is playing such a vital role in our everyday life that it is very difficult to survive without it. The World Wide Web (WWW) has influenced a lot to both users (visitors) as well as the web site owners. Enormous growth of World Wide Web increases the complexity for users to browse effectively. To increase the performance of web sites better web site design, web server activities are changed as per users’ interests. To achieve this they have to analyze user access pattern which are captured in the form of log files. Web usage mining is a process of analyzing interaction of user with different web application. Web usage mining can be seen as three step process: data pre-processing, pattern discovery and pattern analysis. Due to tremendous use of web, web log files are increase with faster rate and size is also huge. These data are usually noisy and ambiguous hence preprocessing step is essential in mining process. Different pre-processing techniques are data cleaning, user identification, session identification and transaction identification. In this paper, we provide detailed survey of work done so far on data collection and pre-processing stage of web usage mining. Keywords: Data mining, Web usage mining, Web log mining, Preprocessing.
I. INTRODUCTION World Wide Web (WWW) has been proving to be tremendous amount of data and also data on WWW is growing exponentially in terms of both their size and its usage with respect to time. In contrast to the standard data mining methods web data mining methods need to deal with heterogeneous, semi structured or unstructured data [1]. In Web Data Mining various core or applied data mining techniques are applied to obtain some interesting knowledge out of data available on WWW. Also the resources (web pages) on WWW undergo frequent updation in terms of their content, structure, with respect to time. Web data mining can be categorized based on the interest and/or final objective of what kind of knowledge to mine from web data [2]. 1) Web Content Mining: refers to discovery of useful information or knowledge from web page contents i.e. text or it could be multimedia data like image, audio, video etc. 2) Web Structure Mining aims at analyzing, discovering and modeling link structure of web pages and/or web site to generate structural summary on which various techniques are applied and outcomes of these techniques can be utilized to
recreate, redesign the web site which ultimately improves structural quality of web site [3]. 3) Web Usage Mining deals with understanding of user behavior, while interacting with web site, by using various log files to extract knowledge from them. This extracted knowledge can be applied for efficient reorganization of web site, better personalization and recommendation, improvement in links and navigation, attracting more advertisement. As a result more users attract towards web site hence will be able to generate more revenue out of it [2, 3, 5]. In this paper we give overview of what is web data mining, process of Web Usage Mining (WUM) and in depth review of work done so far on data preprocessing methods for WUM. When users interact with web site it generates and leave behind the traces at different places in different format. Section II discuss on possible sources from which logs for Web Usage Mining can be obtained. These traces are captured and recorded in appropriate way, collected logs may suffer from impurities, noise, and hence various data mining techniques cannot be applied directly on them. So in Section III we discuss requirement, steps, methods for data preprocessing. Section IV discuss on various Pattern discovery techniques that can be applied on preprocessed logs gathered in previous step to mine knowledge from it. Section V Pattern Analysis discuss on various ways, how such generated result can be represented, interpreted or analyzed. Section VI gives conclusion and Section VII directs towards Future Work. II. DATA COLLECTION There are three main sources to get the row log data, which are namely 1) Client Log File 2) Proxy Log File 3) Web Server Log File [6]. A. Web Server Log File: The most significant and frequently used source for web usage mining is web server log data. This web log data is generated automatically by web server when it services user request, which contains all information about visitor’s activity [2]. The common server log file types are access log, agent log, error log and referrer log [6] Table-1 summarizes each. Depending on web server, web log file data varies on number, type of attributes, and format of log file [7]. W3C maintains standard log file format however custom log file
format can be configured. Many varied format are available like 1.Common log format, 2.Extended common log format, 3. Centralized log format, 4.NCSA common log format, 5.ODBC logging, 6.Centralized binary logging. [8]. among all common or extended file format are mainly implemented by web server. TABLE-1: WEB SERVER LOG FILE TYPES AND CONTENT
Log File Type Access Log Agent Log Error Log Referrer Log
What it records
W3C Extended Log File Format (Figure-1) is very valuable in web usage mining as it can be customized. It contains some additional attribute then CLF [7, 20]. These are i) REFERER_URL defines the URL where visitor came from. ii) HTTP_USER_AGENT reflects visitor’s browser version, iii) HTTP_COOKIE is a persistent token, used to identify user uniquely, which is sent to visitor.
All resource access request sent by user User’s browser, version, OS etc Details of errors occurred while processing user access request. Contains information about referrer page.
Common Log Format (CFL) may contain following fields [host/IP rfcname logname [DD/MMM/YYYY: HH:MM:SS0000] “METHOD/PATH HTTP/ 1.0” bytes] [7]
Fig.2: Explanation of additional attribute of ECFL [7].
Fig. 1: W3C Extended Common Log Format (ECFL) file [20].
Web Server may use caching for efficiency purpose. So if request comes from user for a particular page and if this page is there in its cache it will be delivered to the user without
Fig. 3 Web Usage Mining Process
making entry into the web server log file.
and understanding the internal calls-page access resulted to fulfill a single request.
B. Client Side Log File:
Entire process of web usage mining can be logically divided into three significant and co-related steps as shown in Figure-3, which is Data Preprocessing (Data Preparation), Pattern Discovery (Knowledge Discovery).Pattern Analysis (Knowledge Analysis & Presentation).
Refer to recording of activities, events that happens within the premises of client machine. Like mouse wheel rotation, scrolling within a particular page, mouse clicks, content selection [9]. In some case it is advantageous, as it eliminates necessity of session identification, caching [11]. This can be recorded by number of ways: 1) By integrating java applet with web site: which records each of the activity of users. But for that java plug in need to be installed on each client side browser. Also user may experience delay in page loading time, when applet is loaded for the first time [11]. 2) By writing Java Scripts: in almost each pages of web site that will record this interaction of user with web page and report it to server when transaction is complete. This approach requires each page to be re-created, re-designed which could be time consuming, cumbersome even in some case not technically feasible because of the limitations of web hosting and allied server side software/ hardware components 3) By developing a browser plug-in: which need to be installed only once which can record this interaction and will send the record at finite interval of time or just before when user is about to close the connection with website or when user is quitting from browser. This can be done without changing the underlying design, architecture or technology of web site. However user’s collaboration is required and compatible plugins needs to be developed per browser type. [9, 10] demonstrated how client side public or private data like content of my documents, calendar, browser history, favorites, bookmarks can be used for WUM application like User Profiling and Content-based Recommendation. [10] Suggested a system, which does recommendation, consisting of three tiers (layer). Layer-1 is row information collection agent, which collects data from client machine. Layer-2, a logic layer uses this data to create Dynamic User Profile (DUP), Layer-3 is responsible for presentation and customized UI. [9] Suggested to build such a dynamic profile from various hardware level events like keyboard, mouse etc. C. Proxy Server Log File: At many places network traffic is routed through a dedicated machine known as a proxy server, all the request and response are serviced through this proxy server. Study of this proxy server log files, whose format is same as of web log file may reveal the actual HTTP requests coming from multiple clients to multiple web servers and characterizes, reveals the browsing behavior for a group of anonymous users sharing a common proxy server [11]. Some web sites use n-tier architecture to have reliable, efficient and secure web applications. Log data that are gathered at application server while servicing the users request can also be used for web usage mining. They peculiarly show how user requests are serviced and may assist in identifying
III. DATA PREPROCESSING Due to diversity of sources individual or obtained combined log file, which contains row log data is unformatted, may contain noise, impurities and directly on it [5]. So a row log data undergoes a complex process, consisting of series of steps/stage called Data Preprocessing. It removes such impurities and /or converts data into format on which data mining techniques can be applied [7]. It aims to build and provide a reliable, robust structural framework on which success of later stage relies, which is application of various data mining techniques (Pattern Discovery) [12]. Data Preparation is the most complicated and time consuming task. About 80 percentage [13] of time is given on this process to strengthen quality of data because as qualitative the data is better the results. For this data preparation task which mainly includes various sub-task namely data cleaning, user identification, session identification, path completion and transaction identification [12]. Plenty of algorithms, heuristic techniques are developed and suggested for this, using which a robust, reliable and integrated data source can be created and later on various data mining technique can be applied on them efficiently. Depending on what to mine any above listed sub task can be repeated or eliminated at all. Here we provide an in depth review and work done on data Preprocessing methods. A. Data Cleaning & Feature selection: It is a process of identifying, selecting and removing of unnecessary or irrelevant fields and/or rows form row log data. Web log file contains so many attributes (fields) only necessary fields are selected rest of them are dropped. Firstly entries for access of JPEG, GIF file, Java Scripts, other audio/video files need to be removed as they are executed or downloaded not on basis of user’s request and hence might be redundantly recorded in log files. Secondly if user requests a page or resource which is not available on web server, those entries are marked with different status code (error), which also needs to be discarded. Thirdly the entries occurred from the crawlers or spiders also need to be eliminated because they do not reflect the way human visitor navigate the site. Many crawlers declare themselves as an agent and hence can be detected easily by simple string matching. [14] Employs various heuristic based on which non-human behavior can be detected. [7] Suggest that records which are too rear or too frequent will not lead to constitute any meaningful or important knowledge from it. For example records pertaining with access to index.html or home.html are not of much interest and hence can be dropped. Table-2 summarizes data cleaning step.
TABLE-2: SUMMARY OF TASKS PERFORMED IN DATA CLEANING
Step No
Task Performed (Removal of records from web log)
How to detect?
1
Multimedia file entries, Script entries
Base on File extensions
2
Error entries
HTTP status code
3
Crawler and Spider entries
Host name, agent field
4
Non-Human behavior entries
Heuristic technique [14]
5
Too Rear or too forward entries
Entry or exit point of web site
B. User Identification: User Identification refers to identification of unique user. If one is using proxy servers to route request through it, web server log show a single IP address [7]. But actually there are number of user who had initiated those requests and which were processed through proxy server. Caching at various levels (places), bookmarked page access introduces challenges to identify and detect uniqueness among users. Uniqueness can be detected by client type (user agent), site topology and cookies [7, 12]. If web page access requests are processed through proxy server in web server log, it will enter a common IP Address, which makes it difficult to identify users uniquely. 1) Base on Client Type: One possible heuristic is to look for the agent field to identify differences in OS or browser. If any one parameter is different for the records having same IP address it indicates a different user. Although it lead to misconception in case when user intentionally does like this. E.g. if user wants to test the web page for certain parameter across various browser like access time ,orientation, look and feel, he may enter a same URL but from different browser, pausing to be different user, but actually he is not. However this kind of access is made quite often. 2) Base on Site Topology: Also if users request a page which is not reachable from the previously visited page and if the IP address is same, it represents different user. [12] Explained use of referrer attribute of W3C extended common log format to detect uniqueness. If analyst is aware with the site topology this can be detected easily [7]. Let the site topology be P Q R S T, P V W X Y and user browsing pattern is [P Q R S X], then it is assumed that page X is accessed by different user. Let topology be M N O P A B C D. Say the browsing path is [M N O P N O P] it will detect two unique user first with path [M N O P] and second having path [N O P], as p was accessed twice. This situation can arise by same user even if user types in the URL in address bar of the browser or page is invoked using bookmark to reach the pages not connected via links and hence may lead to misconception. The best way to detect uniqueness of user is cookie [7].
3) By using Cookie: Cookie is a small variable which stores some parameter value at client side. Cookie will be created by web server and will be sent to user for storing at client side, whenever a user request for a web page for the very first time. For Any subsequent request with the same web server browser will be sending cookie information along with request, web server recognizes that it is a same user and hence deliver requested page without creating once again. Cookie are often not logged in web server, can be destroyed after some time automatically (finite living time), also it can be turned on and off by user. Better and efficient technique can be implemented by combination of one or more above listed approach. C. Session Identification: User Session is considered to be set of consecutive pages visited (requests made) by a single user during a certain time period to the same web site. One session S is a set of entries s made by user while browsing a web site. S contains a set of tuples S= Where s ∈ S is visitor’s entry which contains s.ip (IP address), (s.wp) web page and s.t (time of entry), n (number of transaction in sessions). [14] Introduces two methods called 1) proactive which is based on constructing sessions using sessions id gathered from cookies. 2) Reactive, which creates sessions from web log by applying various heuristics. 1) Session identification by time oriented heuristic: It uses time gap between entries, if it exceeds certain threshold new session is created. If s.tn+1-s.tn >= timethreshold then new session. Various researchers say typical threshold value may vary from 10 min to 2 Hour [7]. This value is affected and determined by application, site topology and on so many other parameters, actually it should be determined dynamically. According to[15] web access patterns results from differences in site topology, user’s habits, users’ interest in topic, and varied association between topics. Hence a fix threshold is not appropriate and adequate for all type of application. So he introduced concept of dynamic threshold. He suggested using two fix thresholds of 30 min and 10 min for maximum and minimum time respectively and two dynamic thresholds on each of maximum and minimum static threshold. 2) Session identification by duration spent on observing page: Based on time spent on each page, pages can be categorized on two groups navigational pages and informative pages. Information pages are visitor’s ultimate destination, and users spend more time to study the content of informative pages as compared to navigational pages. In addition with site topology this information can be used to define sessions. If we know percentage of navigational page in web log file, the maximal length of such page can be determined by formula.
q= - ln(1- ) / λ Where q is threshold of navigational page, is percentage of navigational page, λ is observed duration time mean of all pages in the log [12].
2) Session identification by referrer: W3C Extended log format have Referrer URL attribute. This attribute of a page should exist in the same session. If no referrer is found then it is a first page of a new session. Let there be two consecutive requests p and q, where p S (p is a page and S is a session). If referrer (r) for a page q was invoked within session S: r S, then n is added to S, otherwise to a new session [7]. [16] Proposed another concept called integer programming. Unlike heuristic method which creates one session at a time this method constructs sessions simultaneously additionally generated session better match expected empirical distribution, at the cost of increased result time. [12] Proposed reference length method and maximal forward reference method, who suggest formulating session as set of pages from the first page in a request sequence to the final page before a backward reference is made. In this approach tree structure of the server page require to be searched multiple times. [17] Suggested algorithm which does not require searching whole tree representing server pages. He employs concept of efficient use of data structure. Array List to represent web logs and user access list, hash table for storing server pages, two way hashed structure for Access History List, represents user accessed page sequences. Experiments reveals less time complexity and good accuracy of sessions generated as compare to results of [12]. [18] Introduced graphs to identify sessions in complex browsing practice at client side. At client side user have many choice to request a web page namely in new window, new tab, switching of tabs. In this approach to record such activity in phase-1 an AJAX interface is created. Phase-2 constructs graphs structure from the web usage data obtained from phase1. And finally in phase-3 graph mining methods are applied on recently created graph structure, to discover weighted frequent pattern. TABLE-3: SUMMARY OF APPROACHES FOR SESSION IDENTIFICATION
Author Cooley et al. Zhang et al.
Approach
Based on Time Oriented(Time gap between entries of same user)
Session Identification by
Remarks
Fix(Static) Threshold
Simplicity
Dynamic Threshold
Varied user activity modeled better
Cooley et al.
Based on Time spent on page (navigational data)
By knowledge of navigational & Informative page
Site topology need to be defined
Cooley
Referrer field
Presence of value for
Only Extended Log
et al.
referrer filed
Cooley et al
Referrer field
Reference Length, Maximal Forward Reference method
Site topology require to be searched multiple times
G. Arumug am et al.
Referrer field, advance data structures
Referrer field, RL, MRF, Advance data structure
No multiple scanning, better result as of [12]
R. F. Dell et. al
Integer Programming
M. Heydari et al
Graph based approach
file format req
Simultaneous session creation, better session quality Use of client side logs
Application of graph mining methods
Session Clustering Approach S. Alam et al
T. Hussain et al.
Z Ansari et al.
Euclidian Distance
Good for numerical attributes
Partical Swarm Optimization & agglomerative
Angular Separation, Canberra Distance
Suited for nonnumerical attribute, better structured result representation
Fuzzy C-mean clustering
Fuzzy membership function,
Better result even for ill defined and overlapping boundary
Particle Swarm
3) Session Clustering: Clustering is a technique which groups similar objects based on certain common attributes (properties) that they share. Web session clustering is an immerging technique applied for WUM. [19] Explains particle swarm based clustering for web usage data which uses Euclidian Distance(ED), which is suited for the numerical data. [20] Introduced two different similarity measures Angular Separation and Canberra Distance and applied particle swarm optimization and agglomerative to achieve hierarchical sessionization of sessions, which increases visualization and represent it in a better structured way. When ultimate data mining task is clustering, the session files are filtered to remove very small sessions as it may be noise. But direct removal of these small sized sessions may result in loss of a significant amount of information especially when the number of small sessions is large. K-mean can be applied which initializes cluster center randomly and updated by taking weighted average of all data point in that cluster, this recalculation results in better cluster center set. K-mean
handles crisp data set having clear cut boundaries, but in real world the many times boundaries are ill defined and even overlap each other.[21] Suggest to use fuzzy set theoretic approach, to define a “fuzzy member ship function” based on number of URLs accessed by sessions and then applying fuzzy c-mean clustering. This demonstrated better results as compared with traditional hard computing based approach of small session elimination. D. Path Completion: Another critical issue that arises and needed to be resolved is path completion. Sometimes user’s action does not get recorded in access log. If user clicks back word button from the browser, due to presence of cache/proxy server if local copy is present in client cache or proxy server, browser directly serves it to the user. Without making this access entry recorded in to server’s web log. Due to this number of page access requests present in the web log could be smaller than actually such requests are made. So this kind of missing entries preserves incomplete user path and hence requirement of detecting such missing page sequences form web logs arise, which is called path completion. Such missing pages should be mended in the log file before going for the pattern discovery [22]. To achieve this objective we need to refer referrer logs and site topology. If the referred URL of a requesting page does not exactly match with the last direct page requested, it indicates that the requested path is not complete. Further if the referred page URL is in the user's recent request history, we can assume that the user has clicked the "backward" button to visit page. But if the referred page is not in the history, it means that a new user session begins, just as we have stated above. We can mend the incomplete path using heuristics provided by referrer field and site topology. [22] Proposed an approach in which first step uses identified user sessions, secondly it uses Reference Length algorithm (RL), which uses time spent to decide whether it is informative or auxiliary page and Maximal Forward Reference (MFR), which uses page sequence in user access path. Both are having its own limitation so Yan combined both this algorithm, first MFR is used to identify content page, then cut off time is determined and finally RL is applied for identifying auxiliary pages. Finally in third step complete path is built from referrer field, and as required reference length of some page can be modified using this proposed algorithm. In recent time many of the web sites are developed by integrating various technologies, components which works in a collaborative way altogether. In addition much of the content displayed at a particular point of time is dynamic in nature. As the content is dynamic, a fresh request needs to be submitted to server to get the latest data or information. E.g. A site dealing with selling/buying of commodities, stocks and many other real time applications updates its content frequently, within small time. For this site even in case of back button (on load) event content is to be fetched from server and hence such entries get recorded in web server invariably. So requirement of path completion techniques are not required for such web sites.
E. Transaction Identification: Transaction refers to grouping of set of operations which are atomic, logically identical and which are performed and recorded over certain period of time. Whether this step is required or not is dependent on what kind of knowledge we want to mine from web log data [15, 22, 23]. [23] Defined and categorizes two types of transactions that can be formed form sessions. 1) Travel path transaction: consist of both content and auxiliary page. It represents sequence of user’s accessed pages. Mining such transaction reveals common traversal paths of users. 2) content-only transaction is defined as all content pages of a user session. Mining these content-only transactions will discover the users' interests and cluster users visiting the some web site. Both uses RL and MFR algorithm discussed earlier. IV. PATTERN DISCOVERY It is the ultimate stage where some useful knowledge will be derived by applying various statistical and/or data mining techniques at hand from various research areas like data mining, machine learning, statistical method and pattern recognition. Frequently used techniques are classification, clustering, association rule, sequential pattern etc [4, 5, 24]. Clustering aims to build clusters and categorize users in to groups (clusters) who demonstrated similar browsing behavior, also known as user clustering [7]. Page clustering techniques indentifies group of pages which are conceptually related. It can be done by measuring similarities between two entities. Some commonly used techniques are Euclidian Distance, SPO, and Fuzzy C-Mean etc [19, 21]. Clustering forms base for the web personalization, adoption to an individual user’ need. Based on clustering user demographic behavior, market segmentation for an E-Commerce site, recommendation can be planed and delivered in a personalized way [11]. Classification is considered as supervised learning. It is an automated process of assigning a class label or mapping a user based on browsing history or on the basis of some other attribute with one of existing class. It can be done by various inductive learning algorithm like decision tree classifier, naïve Bayesian classifiers, support vector machine It forms the bases for WUM application like profile building. Later on based on classified user profile efficient personalization, recommendation can be made [7, 9, 10]. Association Rules are able to discover related item occurring together in same transaction, and is used to find interdependency, co-relation among the pages. Such number of rules generated could be very large so two measures support and confidence is employed, which determines importance and quality of rules [7, 11]. A-Priori and its many versions are developed to mine association rule. Sequential patterns (rule) are formed when we attach a time domain with some other attribute of interest. The problem of mining sequential patterns is to find the maximal frequent sequences among all sequences that have a certain user specified minimum support [7]. Using this web marketer can better match advertisement with targeted user groups [11]. V. PATTERN ANALYSIS Result of pattern discovery phase might not be in the form, suitable for interpretation or to derive conclusion out of it. It provides ways to compare the results and to extract interesting
rule or pattern from output of previous step [25]. So various visualization and presentation tools are used which represent data in 2D, 3D pictorial representation. This tool provides interactive way of representing, comparing, characterizing result in terms of charts, graphs, tables, wein diagram and so many others visual presentations [25]. Many times result generated or data itself are stored in data cubes or in data ware house on which various OLAP operations such as roll-up, drill-down, slice etc. can be performed which provides multiple view of same data to analyzer in logical and hierarchical structure. Knowledge Query Mechanism such as SQL facilitates to retrieve data in a way controlled by analyzer, generally kind of statistical data in text format. VI. CONCLUSION Web sites are of much use for users. Web sites are built, deployed and maintained to serve with various function to user. At what extent this functions, features which were thought of is implemented can be identified, verified by careful inspection at the log data. Based on this result further corrective, measuring action can be planned, executed. To be able to achieve this knowledge is accomplished through the application of various subjective and/or objective, procedural algorithmic or heuristic processes, methods or techniques. VII. FUTURE WORK Web log data pre-processing is very important and crucial task in entire process. This phase can be strengthened by choosing and neatly applying various heuristic techniques. Most of the systems, architecture that were implemented or proposed considers either client side or server side log data. In future a system could be build that considers and exploit the usefulness of both client side and server side log data, to produce result that are more efficient and batter match with empirical observations. REFERENCES [1] Qingyu Zhang, Richard Segall "Web mining: a
survey of current research, techniques and software”, International Journal of Information Technology & Decision Making Vol. 7, No. 4, 2008. [2] Kosala and Blockeel, “Web Mining Research: A Survey”, SIGKDD Exploration, Newsletter of SIG on Knowledge Discovery and Data Mining, ACM, Vol.2, 2000. [3] B. Singh, H. K. Singh, “Web Data Mining Research: A Survey”, IEEE, 2010. [4] R. Cooley, B. Mobasher, J. Srivastava, "Web mining: information and pattern discovery on World Wide web”, tools with artificial intelligence, Ninth IEEE International November 1997. [5] J. Srivasta, R.Cooley, M.Deshpande, P.Tan, "Web usage mining: discovery and applications of usage patterns from Web data", ACM SIGKDD Vol.7, No.2, Jan-2000. [6] Suneetha, K. R. and D. R. Krishnamoorthi, "Identifying User Behavior by Analyzing Web Server Access Log File”, (IJCSNS) International Journal of Computer Science and Network Security, VOL.9, No.4, April 2009. [7] Zidrina Pabarskaite, Aistis Raudys, “A process of knowledge discovery from web log data: Systemization and critical review”, Journal of Intelligent Information System, Springer, 2007. [8] S. K. Pani et al.,” Web Usage Mining: A survey on pattern extraction from web logs”, International Journal of Instrumentation, Control &Automation, Vol.1, Issue 1, 2011.
[9]
Jinhyuk Choi, G Lee, “New Techniques for Data Preprocessing Based on Usage Logs for Efficient Web User Profiling at Client Side”, International Conference on Web Intelligence & Intelligent Agent Technology, IEEE/ACM/WIC, 2009 [10] Ting Chen et al., “Content Recommendation System based on Private Dynamic User Profile”, VIth International Conference on Machine Learning and Cybernetics, IEEE, August-2007. [11] V. Chitra, A. S .Davamani, ”A survey on preprocessing methods for web usage data”, International Journal of Computer Science & Information Security, Vol.7, No.3, 2010. [12] R. Cooley, B. Mobasher, J. Srivastav, “Data preparation for mining world wide web browsing pattern”, Journal of Knowledge and Data Engineering Workshop, IEEE, 1999. [13] S. Ansari, et al., “Integrating e-commerce and data mining: Architecture and challenges”, IEEE, 2001. [14] B. Berendt, M. Spiliopoulou, “Analyzing navigation behavior in web site integrating multiple information systems”, VLDB Journal, Special issues on databases and web, 2000. [15] J. Zhang, Ali A. Ghorbani, “The Reconstruction of user session from a server log using improved time oriented heuristic”, IInd Annual Confernnce on Communication Networks and Service Research, IEEE, 2004. [16] R. F. Dell et al., ”Web user session reconstruction using integer programming”, International Conference on Web Intelligence and Intelligent Agent Technology, IEEE/ACM/WIC, 2008. [17] G. Arumugam, S. Sugana, ”Optimum algorithm for generation of user session sequences using server side web user logs”, IEEE, 2009. [18] M. Heydari et al., “A graph based web usage mining method considering client side data”, International Conference on Electrical Engineering and Informatics, IEEE, 2009. [19] Alam, S., G. Dobbie, et al., ”Particle Swarm Optimization Based Clustering Of Web Usage Data”, International Conference on Web Intelligence and Intelligent Agent Technology, IEEE/ACM/WIC, 2008. [20] Tasawar Hussain, et al., “Hierarchical sessionization at preprocessing level of WUM based on swarm intelligence”, VIth International Conference on Emerging Technologies, IEEE, 2010. [21] Zahi Ansari, et al., ”A fuzzy set theoretic approach to discover user sessions from web navigational data”, IEEE, 2011. [22] Yan LI, et al., ”Research on path completion technique in web usage mining ”, International Symposium on Computer Science and Computational Technology, IEEE, 2008. [23] Yan LI, Bo-qin FENG, et al., “The construction of transaction for web usage mining”, International Conference on Computational Intelligence and Natural Computing, IEEE, 2009. [24] Jose M. Domenech1 and Javier Lorenzo, “A Tool for Web Usage Mining”, 8th International Conference on Intelligent Data Engineering and Automated Learning, 2007. [25] Liu Kewen, “Analysis of Preprocessing methods for web usage mining”, International Conference on measurement, Information and Control, IEEE, 2012.