An Ameliorated Methodology for Preprocessing Web Log Data using Data Warehousing and Data Mining Framework B. N. Shankargowdaa,1 , Vibha Lb,1 , Venugopal K Rc,1 , L M Patnaikd,1
[email protected] a Bangalore
Institute of Technology, Bangalore. Institute of Technology, Bangalore. c University Visvesvaraya College of Engineering, Bangalore. d Indian Institute of Science, Bangalore. b BNM
Abstract A customer oriented business organization can survive and sustain by providing accurate and personalized services to their customers in real time. Hence, understanding individual users perception and their activities on the World Wide Web and then building a customized user model that provides the wealth of information for the customer is by recognizing the strategic use of historical data that have a competitive advantage in intelligence gathering and that aid in the decision making process so as to rationalize, automate, consolidate, transform and integrate ODS is important. The accuracy and accessibility of data is important in uncovering patterns from Web log data and it is needless to state that the quality of the information resulting from the data analysis is only as good as its underlying data. Hence the pre-processing is critical in any data querying, analysis, mining and reporting applications. This paper proposes an ameliorated methodology to pre-process terabytes or petabytes of raw server log file by building a DWH and Data mining framework so as to enhance the performance, throughput, scalability and multi-dimensional analysis economically. Keywords: EWIC: Enterprise Wide Information System, ODS: Operational Data Source, SOM: Self-Organizing Maps, TOAD: Tool for Oracle Application Developers, WUM: Web Usage Mining.
1. Introduction Data warehouse and data mining framework provides knowledge workers with critical and accurate decision support information that is very hard to access and present in traditional ODS. It provides an organization the ability to effectively process enterprise wide information and uncover the potential business trends and dependencies which would have otherwise gone unnoticed and unexploited. DWH has the ability to rationalize, automate, consolidate, transform and integrate the process of building an Enterprise Wide Information Center (EWIC) rather than have many individual DSS/EIS and their own infrastructure. The other factors that contribute are the cost of state of art computer systems are declining by the day which makes building an enterprise wide Preprint submitted to ICCN-2013/ICDMW-2013/ICISP-2013
June 11, 2013
Data warehouse that can store large amount of data a viable potion as compared to the traditional ODS. The use of computers and the number of customers for businesses are growing rapidly over the years and the amount of information stored is also growing exponentially. This very fact is affecting the response time and also the ability to understand its content. DWH provides the ability to extract, analyze and also manage both macro and micro perspectives of the organization that can save the effort and long hours of manual work and also avoid the possibility of making costly mistakes due to incorrect data or irrational assumptions. It provides the ability to make rational decisions based on the understanding of the entire system and business processes rather than using rough estimates from incomplete or incorrect data. Hence, there is a need for Data WareHouse technology in data mining applications. 1.1. Motivation The ultimate objective of building a data warehouse and data mining framework is to assist the corporate organizations develop a framework or a data driven model to analyze the behavior of users and also help the customer oriented business organizations to facilitate the web browsers through the provision of their optimized real world, understand the need and perspective of individual customers, attract and retain customers, improve sales and provide better customer service, segment and profile customers, design marketing strategies across products and services of the organization, assess risk and detect fraud, evaluate the effectiveness of promotional campaigns and to optimize the functionality of Web based applications, and to build the most effective user based model for their Web space. The paper proposes an ameliorated methodology for analyzing the user behavior on the internet by building a DWH and integrating it with the Data mining framework with a set of guidelines at the pre-processing stage of data analysis. The most important aspect of analyzing the behavior of users on the www is to obtain a good data set in which data is clean, accurate, predictive, timely, accessible and complete. The rest of paper is organized as follows: Section 2 gives a brief description about the Related work: Section 3 discusses about the Background: Section 4 discusses about the Modeling: Section 5 describes problem definition: Section 6 describes the Experimental setup and Implementation: Section 7 discusses about the performance analysis: and the last section discusses the conclusion. 2. Related work This section discusses about some of the associated work done with respect to the tools, techniques or methodologies used for preprocessing server log data in the data mining process and highlight the draw backs of the same. Much of the research and practice in the data preparation has been focused on preprocessing and integrating the data sources for different analysis. Qiang Yang et al., [1] introduced a data-cube model to contain the original access sessions for data mining from Web-logs to effectively support different mining tasks. Hussain et al., [2] surveyed the preprocessing techniques to identify the issues and analyzed WUM preprocessing so as to improve pattern mining and analysis. Vijayakumar
2
et al. [3] proposed a framework for finding useful information from Web Usage Data that uses SOM and in the preprocessing phase and discovering the user clusters and page clusters from the sessions by using SOM. Kewem et al., [4] analyzed the preprocessing of web usage mining to find users access model automatically and quickly from vast web log data. Milija et al., [5] shows design and implementation of data warehouse and the use of data mining algorithms for the purpose of knowledge discovery for business decision making process. Jiang et al., [6] emphasis on the study of Web log data preprocessing, collaborative filtering, and aiming at user session identification in the process of data preprocessing and analyzing existing algorithms so as to provide high-quality data for subsequent log mining. Choi et al., [7] analyzed the user profiling at client side which is an important key task for intelligent information delivery in Web environment and designed three important preprocessing techniques. Ling et al., [8] focused on a comprehensive analysis of the quantitative relations between the user browsing time and the user interest rate based on the combination of five user minimum browsing behaviors. Tanasa et al., [9] focused on data preprocessing for WUM that applies data procedures to analyze user access of Web sites. They tried to determine the exact list of users who accessed the Web site and to reconstitute user sessions. Olivia et al., [10] emphasis on the planning and development of a data Warehouse covering a number of aspects and uses of the data warehouse. Huang et al., [11] proposed a new approach for interleaved server session from Web server logs using m-order Markov model combined with a competitive algorithm to reconstruct interleaved sessions from server logs. Cooley et al., [12] describes a detailed taxonomy of the work in preprocessing, pattern discovery, and pattern analysis. And also a brief overview of the Web SIFT system as an example of a prototypical Web usage mining system. Natheer et al., [13] proposed a fast active user-based user identification with time complexity O(n) and session identification using an ontology-based method that utilizes the Web site structure and functionalities to identify different sessions. Raju et al., [14] focused on grouping individual data collection events into groups, called Web transactions, before feeding it to the mining system. 3. Background The activities of individual users and their perception on the www needs to be understood for building a customized user model that provides the wealth of information to the customer. The strategic use of historical data that have a competitive advantage in intelligence gathering and that aid in the decision making process is important to rationalize, automate, consolidate, transform and integrate transaction data. The traditional preprocessing techniques used to create the user models are normally too rigid to capture the inherent uncertainty of human behavior. In this context, a DWH and Data mining framework can be used to handle and process human uncertainty and to simulate human decision-making. The more accurate information a user model has, the better the content and presentation can be customized. A user model is created through a User Modeling process in which unobservable information about a user is inferred from observable information from that user using his interactions with www.
3
3.1. Data Sources for Web Usage Mining The primary source of data used in Web usage mining is the server log files. The server logs include Web server access logs and application server logs. Apart from these sources the site files and meta-data including content features and structural elements of pages, operational databases, application templates, users client-side cookies,external clickstream or demographic data sources can also be used. 3.2. Data Preparation or Preprocessing stage The three phases of Web usage mining (WUM) are data preprocessing, pattern discovery, and pattern analysis. The main objective of log data preprocessing is to create users profiles that represent individual users interest and activities on the www. The user profile can further be used as an input for analyzing the behavior of users on the internet. Understanding the user behavior or interest is the first step in providing customized Web services [15]. The block diagram in Figure 1 shows the stages of data preprocessing. The real-time web log data needs to be preprocessed as the data gathering methods are loosely controlled that may result in out-of-range values like salary: -1000, impossible data combinations (male: pregnant), or garbage values. The raw server log file may have the following anomalies: Useless or unwanted data for the subjective analysis from operational databases, incomplete data i.e., lacking attribute values or certain attributes of interest, or may contain aggregate data, missing data or values particularly for tuples of some attributes that may need to be inferred or set with a default values.
Figure 1: Stages of Data Preprocessing.
4
3.3. Approaches for web log Data Preprocessing Traditionally, the data preprocessing is done by writing programs in HLL or MVS Job Control Language or by writing scripts in Unix/Linux or also can be done using Query languages like SQL [16]. 4. Problem Definition The raw sever log file may contain image files, video files, multimedia files; methods: navigation done by automated indexers like Web crawlers, web spiders, or web robots, or bots etc. that browse the entire web in an organized fashion and is used for indexing a site; failed HTTP status code i.e., codes that specify the success or failure of a requested event. The objective of data preprocessing or data sourcing acquisition, cleanup, and transformation component is to provide accurate and personalized services to their customers in real time; remove redundant data that will slow down or confuse the knowledge discovery process; reduce the size of Server Log file considerably and arrive at a good dataset that aids in enhancing the performance of data analysis; improve data quality so as to improve the accuracy and efficiency of the subsequent mining process. 5. Modeling The Web log dataset includes the URLs requests, the IP addresses of users, timestamps, the resource requested, possible parameters used in invoking a Web application, status of the request, HTTP method used, the user agent (browser, OS, version), and the referring Web resource that provide much of the potential information of user access or navigational behavior in a Website. Based on the user model a practical solution to preprocess raw server log file is proposed by building and data warehouse and data mining framework. 5.1. DWH and Data mining Framework DWH is an environment or an Architectural construct of an information system that functions as a central repository for storing enterprise wide information. DWH is complex software composed of seven critical components designed to make the entire architecture functional, manageable, and accessible by both ODS that source data into DWH and also the OLAP tools.The seven important components of DWH architecture is shown in Figure 2. The components are: The data sourcing, acquisition, cleanup, transformation, and migration component; The Metadata repository; The central data Warehouse database; Data marts; Data querying, reporting, analysis, and mining component; DWH administration and management component: Information delivery/visualization component.
5
Figure 2: Block diagram of Data Warehouse Architecture.
6. Implementation The data preparation process is often the most time consuming and computationally intensive step in the Web usage mining process, and often requires the use of special algorithms and heuristics not commonly employed in other domains. This process is critical to the successful extraction of useful patterns from the data.The data sourcing, acquisition, cleanup, transformation, and migration tools or in general data preprocessing performs all the conversions, summarizations, key changes, structural changes, and condensations needed to transform disparate data into information that can be effectively used by DSS or EIS tools [17]. 6.1. Experimental setup The sample raw server log file data for the said analysis is obtained from an educational portal http://www.enggresources.com.The quantity and quality of dataset available with the said portal is very much appropriate for our analysis. 6.2. Raw server log file 122.166.134.209 - - [19/Jul/2009:00:38:47+0530]”GET /ra/rank.phpsem=8&course= B.E/B.Tech&deptid=1RV IS HTTP/1.1 ”200 5220” http://www.enggresources.com/ra/c ollege.php?sem=8&course=B.E%2FB.Tech&fwdto=&cid=RV””Mozilla/5.0 (Windows;U; Windows NT6.0;enUS;rv:1.9.0.11)Gecko/2009060215Firefox/3.0.11(.NET CLR3.5.30729) 59.92.172.67 - - [19/Jul/2009:00:40:07 +0530] ”GET /ra/college.phpfwd=rank.php HTTP /1.1” 200 6141” http://www.enggresources.com/” ”Mozilla/4.0 (comp atible; MSIE 8.0;Windows NT 5.1; Trident/4.0; GTB5;Info Path.3)”
6
6.3. Procedure for data sourcing acquisition, cleanup, and transformation process Step 1: Read the Raw Server log file from input. Step 2: Check If the log file entry consists of any of the(.jpeg/.gif/.tif/.bmp/.jpg) OR If the log file entry consists of any other methods apart from GET and POST OR If the log file entry consists of any of user agents like web crawlers or web spiders or webrobots OR If the log file entry consists of any of status codes > 300. then, remove these fields from processing and add the rest to the staging table. Step 3: Repeat Step 2 till EOF. Output: Pre-processed Server Log File. 6.4. Pre-processed Server Log File 59.92.172.67 - - [19/Jul/2009:00:40:07 +0530] ”GET /ra/college.phpfwd=rank.php HTTP/1.1 ”200 6141” http://www.enggresources.com/” ”Mozilla/4.0 (compatible; MSIE 8.0;Windows NT 5.1; Trident/4.0; GTB5; Info Path.3)” 6.5. Pseudo code for Data Sourcing, Acquisition, Cleanup, Transformation, and Migration using: Unix or Linux based Shell script and MY SQL DB a = (echoarr|cut − f 1 − d””)adate = (echod|cut − c2 − 12) echo”insertintologf ilenewvalues(0 ”a”0 ,0 ”b”0 ,0 ”c”0 , .......,0 ”ssec”0 , ., 0 ”x”0 )”|mysql − −host = localhostdatabase = mysqldone < logf ilenew.txt 6.6. Procedure for Data sourcing, Acquisition, Cleanup, Migration and Transformation process by building DWH using TERADATA or TOAD Loading data into operational database will have a bad impact on the performance of voluminous large data. The same can be done by building a DWH. A Staging table is created with all the field values in the cleansed Server log file that correspond to individual fields/column heads in the new table created. Any ETL tool is preferred for loading same table frequently and in order to load one time data into table then it is better go use Oracle sql loader. 6.7. User Identification A user may visit a site more than once and the server log records multiple sessions for each user. Either a user activity record may be used to refer to the sequence of logged activities belonging to the same user. The IP address alone is not sufficient for mapping log entries onto the set of unique visitors. This is due to the proliferation of ISP proxy servers which assign rotating IP addresses to clients as they browse the Web. It is still possible to accurately identify unique users through a combination of IP addresses and other information such as user agents and referrers.
7
6.8. Sessionization Sessionization is the process of segmenting the user activity record of each user into sessions, each representing a single visit to the site. A session is a set of ordered pages viewed in one visit by the same visitor identified using the IP address, user agent, operating system and referrals. We identify the number of pages viewed in a session as the length of the session with the aid of timestamp. Each page is identified by its URL is described by many attributes, including Page ID, classification of pages in a Web site based on the context of the page contents, the total time spent at a page, the total number of hits at a page per session. The various approaches for creating sessions are Time oriented approach, Navigation oriented approach, Concept matching approach and SOM. 7. Performance Evaluation The experimental results of data preprocessing confirms the considerable reduction in size of the server log file and also derives a good quality dataset that yields enhanced performance without compromising on throughput, scalability and multi-dimensional analysis when compared to the traditional approaches. The sample dataset is executed on all the said approaches to extract, transform and load variables in the server log file. The results are tabulated as shown in Table No. 1 and 2 and Figure 3, Figure 4, Figure 5 and Figure 6. No. of Lines 26 1009 5280 95191 116160 232320
Scripts 2 66 401 12668 15568 33498
Database 1 4 20 364 431 851
DWH & DM Framework 1 2 7 62 98 223
Table 1: Extracting all the variables from log file using Shell script, DB and DWH
Figure 3: Extracting all the variables from log file using Shell script, DB and DWH
8
Figure 4: Extracting all the variables using DB and DWH
No. of Lines 26 1009 5280 95191 116160 232320
Scripts 1 12 64 3656 4645 10616
Database 1 2 7 62 331 665
DWH & DM Framework 1 2 5 44 98 193
Table 2: Extracting 4 variables from log file using Shell script, DB and DWH
Figure 5: Extracting 4 variables from log file using Shell script, DB and DWH
7.1. Advantages of proposed methodology The advantages of DWH approach against the operational databases using a query language like SQL are a plenty. The DWH and data mining framework provides with several optimizations techniques for computing multiple aggregates that are not supported in case of TPS or operational databases and we need to write complex procedures to perform the aggregate operation on the data for analysis. The advantages of using the proposed methodology are summarized as shown in Table 3.
9
Figure 6: Extracting 4 variables using DB and DWH
Parameter Multidimensional data CUBE() RollUp()/Drilldown() Rank()/Dense Rank() Slice()/Dice()/Pivot() Aggregation() Summarization()
Scripts X X Tedious Tedious X Tedious Tedious
HLL X X Tedious Tedious X Tedious Tedious
Database X X Tedious Tedious X Tedious Tedious
DWH & DM Easy Easy Easy Easy Easy Easy Easy
Table 3: Advantages of DWH and DM framework
8. Conclusions The experimental results show that building DWH and integrating it with the Data mining tools yields better performance when compared to traditional approaches where the data set is prepared depending on the goals of the analysis, the data set needs to be transformed and aggregated at different levels of abstraction. The experimental results of data sourcing Acquisition, cleanup, transformation process confirms the reduction in the size of Server Log file considerably and also arrives at a good dataset that aids in enhancing the performance of data analysis. 9. References [1] Qiang Yang, Joshua Zhexue Huang and Michael Ng, “A Data Cube Model for Prediction-Based Web Prefetching”, in the Journal of Intelligent Information Systems, vol. 20, pp. 11-30, 2003. [2] Hussain T, Asghar S and Masood N “Web Usage Mining: A Survey on Preprocessing of Web Log File”, in the International Conference on Information and Emerging Technologies (ICIET),Karachi, Pakistan, pp. 1-6, 2010. [3] Vijaya Kumar T and Guruprasad H S, “Clustering Web Usage Data Using Concept Hierarchy and Self Organizing Map”, in the International Journal of Computer Application, vol. 56, pp. 38-44, 2012. 10
[4] Kewem Liu, “Analysis of Preprocessing Methods for Web Usage Data”, in the International Conference on Measurement, Information and Control (MIC), pp. 383-386, 2012. [5] Milija Suknovic, Milutin Cupic and Milan Martic, “Data Warehousing and Data Mining - A Case Study”, in the Yugoslav Journal of Operations Research, vol. 15, pp. 125-145, 2005. [6] Jiang Chang bin and Chen Li, “Web Log Data Preprocessing Based on Collaborative Filtering,” in the Second International Workshop on Education Technology and Computer Science (ETCS), Wuhan, China, pp. 118-121, 2010. [7] Choi, Jinhyuk and Lee Geehyuk, “New Techniques for Data Preprocessing Based on Usage Logs for Efficient Web User Profiling at Client Side,” in IEEE /WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, Italy, pp. 54-57, 2009. [8] Ling Zheng, Shuo Cui, Dong Yue and Xinyu Zhao, “User Interest Modeling Based on Browsing Behavior,” in 3rd International Conference on Advanced Computer Theory and Engineering (ICACTE), Chengdu, China , vol. 5, pp. 455458, 2010. [9] Tanasa D and Trousse B, “Data Preprocessing for WUM,” in Potentials, IEEE, pp. 22-25, 2004, [10] Olivia Rud C, “Data Warehousing for Data Mining: A Case Study,” in SAS users group international conference,SUGI-25, Indiana, USA, pp. 119-25, 2000. [11] Huang Hao, Jiang Dan and Huang Jianqing, “Separating Interleaved User Sessions from Web Log ,” in International Conference on Network Computing and Information Security (NCIS), Guilin, China, pp. 152-156, 2011. [12] Cooley R, Mobasher B and Srivastava J, “Data Preparation for Mining World Wide Web Browsing Patterns,” in the Journal of Knowledge and Information Systems, vol. 1, pp. 1-24, 1999. [13] Natheer Khasawneh and Hien Chung Chan, “Active User-Based and OntologyBased Web Log Data Preprocessing for Web Usage Mining,” in IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, Hongkong, pp. 325-328, 2006. [14] Raju G T, Yogish and Manjunath T N, “The Descriptive Study of Knowledge Discovery from Web Usage Mining,” in IJCSI International Journal of Computer Science Issues, vol. 8, pp. 225-230, 2011. [15] Mobasher B, “Web Usage Mining,” Springer,Germany, pp. 449-483, 2007. [16] Bamshad Mobasher, “The Adaptive Web,” Springer,Germany, pp. 90-135, 2007. [17] Alex Berson and Stephen J Smith, “Data warehousing, Data mining and OLAP,” Tata McGraw Hill, New Delhi, 2004. 11