An Overview on Web Usage Mining

2 downloads 0 Views 204KB Size Report
data mining techniques to the World Wide Web. Web usage mining ... Keywords: Web Usage Mining, Web server logs, Data Preprocessing, Pattern discovery. 1.
An Overview on Web Usage Mining G. Neelima1,* and Sireesha Rodda2 2

1 GMRIT, Rajam, Srikakulam, A.P, India GITAM University, Visakhapatnam, A.P, India [email protected]

Abstract. The prolific growth of web-based applications and the enormous amount of data involved therein led to the development of techniques for identifying patterns in the web data. Web mining refers to the application of data mining techniques to the World Wide Web. Web usage mining is the process of extracting useful information from web server logs based on the browsing and access patterns of the users. The information is especially valuable for business sites in order to achieve improved customer satisfaction. Based on the user’s needs, Web Usage Mining discovers interesting usage patterns from web data in order to understand and better serve the needs of the web based application. Web Usage Mining is used to discover hidden patterns from weblogs. It consists of three phases like Preprocessing, pattern discovery and Pattern analysis. In this paper, we present each phase in detail, the process of extracting useful information from server log files and some of application areas of Web Usage Mining such as Education, Health, Human-computer interaction, and Social media. Keywords: Web Usage Mining, Web server logs, Data Preprocessing, Pattern discovery.

1

Introduction

World Wide Web is a growing collection of large amount of information and usually a great portion of time is needed to identify the appropriate information, so various techniques are needed to analyze the data. One of the techniques used is Web mining. Using Web mining, we can analyze and discover the useful information from the web. Web Usage Mining (WUM) extracts useful information based on users’ needs from web log information. Based on the user needs and likes, WUM gives the appropriate information using the web server logs. To extract and process the information, web usage mining follows two main steps by [1] [2]: Data preprocessing and Pattern discovery. The huge data present in the web is a collection of raw data, so to get the user needed information the web data preprocessing should be done. The different phases in web usage mining include data cleaning, data preparation, user identification, session identification, data integration, data transformation, pattern *

Corresponding author.

© Springer International Publishing Switzerland 2015 S.C. Satapathy et al. (eds.), Emerging ICT for Bridging the Future − Volume 2, Advances in Intelligent Systems and Computing 338, DOI: 10.1007/978-3-319-13731-5_70

647

648

G. Neelima and S. Rodda

discovery and pattern analysis. The data preprocessing is most critical phase in the WUM. The preprocessing of data can be done on the original data or on the data integrated from multiple sources. The purpose of web usage mining is to discover hidden information from weblog data, so we have to mine the data from log files. Log files provide information about the activity of user, viz., which web site he/she using, whom you send/receive e-mail etc. These files are maintained by the system administrator. This paper provides a comprehensive survey of web usage mining. Section 2 describes the various kinds of Web server log files available for application of web usage mining. Section 3 details the different phases present in web usage mining. Section 4 summarizes various applications existing in the domain of web usage mining. Section 5 discusses the challenges and concerns arising due to the application of web usage mining and Section 6 concludes the paper.

2

Web Server Logs

A web server log is a log file or simple text file which stores activities performed by the user and maintained by the server ie., it maintains a history of web page requests. Generally these log files cannot be accessible by the user, only the web administrator can handle them. The Log files in different web servers maintain different types of information. Consider the example log file, which contains • • • • • • • • •

The IP address of the computer making the request (i.e. the visitor) The identity of the computer making the request The login ID of the visitor The date and time of the hit The request method The location and name of the requested file The HTTP status code (e.g. file sent successfully, file not found, etc) The size of the requested file The web page which referred the hit (e.g. a web page containing a hyperlink which the visitor clicked to get here)

151.44.15.252 - - [25/May/2004:00:17:39 +1200] "GET /data/zookeeper/status.html HTTP/1.1" 200 4195 "http://www.mediacollege.com/cgibin/forum/commentary.pl/noframes/read/209" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Hotbar 4.4.7.0)" Fig. 1. Example Log file

These details of the log file are then used for web usage mining process. Web usage Mining is applied to identify the highly utilized web site. The utilization of a website would be the frequently visited web site or the web site being utilized for longer time duration. Therefore the quantitative usage of the web site can be found if the log file is analyzed.

An Overview on Web Usage Mining

649

A Web log is a file to which the Web server writes information each time a user requests a website from that particular server. A log file can be located in three different places: 1) Web Server Log files The log file that resides in the web server notes the activity of the client who accesses the web server for a web site through the browser. The contents of the file will be the same as it is discussed in the previous topic. In the server which collects the personal information of the user must have a secured transfer? 2) Web Proxy Server Log files A Proxy server is said to be an intermediate server that exist between the client and the Web server. Therefore if the Web server gets a request of the client via the proxy server then the entries to the log file will be the information of the proxy server and not of the original user. These web proxy servers maintain a separate log file for gathering the information of the user. 3) Client Browsers Log files This kind of log files can be made to reside in the client’s browser window itself. Special types of software exist which can be downloaded by the user to their browser window. Even though the log file is present in the client’s browser window the entries to the log file is done only by the Web server.

3

Web Usage Mining Process

Web usage mining, from the data mining aspect, is the task of applying data mining techniques to discover usage patterns from Web data in order to understand and better serve the needs of users navigating on the Web. As every data mining task, the process of Web usage mining also consists of two main steps[3] [4]: (1) Data preprocessing phase, (2) Pattern discovery phase 3.1

Phase- 1: Data Preprocessing

The first issue in the preprocessing phase is data preparation [1]. The data preparation process is often the most time consuming and computationally intensive step in the Web usage mining process. The process may involve preprocessing the original data, integrating data from multiple sources, and transforming the integrated data into a form suitable for input into specific data mining operations. This process by Kamika Chaudhary and Santosh Kumar Guptaet.al [5] is known as data preparation.

650

G. Neelima and S. Rodda

Fig. 2. Web Usage Mining Process

1.

Data preparation: Web data can be collected and used in the context of Web personalization [6][7]. These data are classified in four categories according to[8]: 1.1 Content data are presented to the end-user appropriately structured. They can be simple text, images, or structured data, such as information retrieved from databases. 1.2 Structure data represent the way content is organized. They can be either data entities used within a Web page, such as HTML or XML tags, or data entities used to put a Web site together, such as hyperlinks connecting one page to another. 1.3 Usage data represent a Web site’s usage, such as a visitor’s IP address, time and date of access, complete path (files or directories) accessed, referrers’ address, and other attributes that can be included in a Web access log. 1.4 User profile data provide information about the users of a Web site. A user profile contains demographic information for each user of a Web site, as well as information about users’ interests and preferences. Such information is acquired through registration forms or questionnaires, or can be inferred by analyzing Web usage logs. 2. Preprocessing: The information available in the web is heterogeneous and unstructured. Therefore, the preprocessing phase is a prerequisite for discovering patterns. The goal of preprocessing is to transform the raw click stream data into a set of user profiles. Data preprocessing presents a number of unique challenges which led to a variety of algorithms and heuristic techniques for preprocessing tasks such as merging and cleaning, user and session identification etc. Various research works are carried in this preprocessing area for grouping sessions and transactions, which is used to discover user behavior patterns[1][7]. 2.1 Data Cleaning: Data Cleaning is a process of removing irrelevant items such as jpeg, gif files or sound files and references due to spider navigations. Improved data quality improves the analysis on it. The Http protocol requires a separate connection for every request from the web server. If a user request to view a particular page along with server log entries graphics and scripts are download in addition to the HTML file. An exception case is Art gallery site where images are

An Overview on Web Usage Mining

651

more important. Check the Status codes in log entries for successful codes. The status code less than 200 and greater than 299 were removed. 2.2 User Identification: Identification of individual users who access a web site is an important step in web usage mining. Various methods are to be followed for identification of users. The simplest method is to assign different user id to different IP address. But in Proxy servers many users are sharing the same address and same user uses many browsers. An Extended Log Format overcomes this problem by referrer information, and a user agent. If the IP address of a user is same as previous entry and user agent is different than the user is assumed as a new user. If both IP address and user agent are same then referrer URL and site topology is checked. If the requested page is not directly reachable from any of the pages visited by the user, then the user is identified as a new user in the same address. Caching problem can be rectified by assigning a short expiration time to HTML pages enforcing the browser to retrieve every page from the server [9]. 2.3 Session Identification: A user session can be defined as a set of pages visited by the same user within the duration of one particular visit to a web-site. A user may have a single or multiple sessions during a period. Once a user was identified, the click stream of each user is portioned into logical clusters. The method of portioning into sessions is called as Sessionization or Session Reconstruction. A transaction is defined as a subset of user session having 3.2

Phase-2: Pattern Discovery Phase

1. Pattern Discovery: Once user transactions have been identified, a variety of data mining techniques are performed for pattern discovery in web usage mining. These methods represent the approaches that often appear in the data mining literature such as discovery of association rules and sequential patterns and clustering and classification etc. Classification is a supervised learning process, because learning is driven by the assignment of instances to the classes in the training data. Mapping a data item into one of several predefined classes is done. It can be done by using inductive learning algorithms such as decision tree classifiers, naive Bayesian classifiers, Support Vector Machines etc., Association Rule Discovery techniques are applied to databases of transactions where each transaction consists of a set of items. By using Apriori algorithm by [4] is the biggest frequent access item sets from transaction databases that is the user access pattern are discovered. Clustering is a technique to group users exhibiting similar browsing patterns. Such knowledge is especially useful for inferring user demographics in order to perform market segmentation in Ecommerce applications or provide personalized web content to pages. Sequential Patterns are used to find inter-session patterns such that the presence of a set of items followed by another item in a time-ordered set of sessions. By using this approach, web marketers can predict future visit patterns which will be helpful in placing advertisements aimed at certain user groups. 2. Pattern Analysis: Pattern analysis is the final stage in web usage mining. Mined patterns are not suitable for interpretations and judgments. So it is important to filter out uninteresting rules or patterns from the set found in the pattern discovery phase. In this stage tools are provided to facilitate the transformation of information into

652

G. Neelima and S. Rodda

knowledge. The exact analysis methodology is usually governed by the application for which Web mining is done. Knowledge query mechanism such as SQL is the most common method of pattern analysis. Another method is to load usage data into a data cube in order to perform OLAP operations[4].

4

Applications

Web usage mining is the application of data mining techniques to discover usage patterns from Web data, in order to understand and better serve the needs of Webbased applications. Web usage mining has seen a rapid increase in interest, from both the research and practice communities. Following are some of the applications. 1. Education: [10] says that Nowadays, the application of web usage mining in educational systems is increasing exponentially. An interesting application of Web usage mining is link recommender systems (LRS). Their purpose is to facilitate the navigation of the users on a Web site and to help them not to get lost when they are browsing through hypertext documents. On the other hand, there is an increasing interest in applying data mining to educational systems, making educational data mining a new and growing research. EDM is an emerging discipline, concerned with developing methods for exploring the unique types of data that are obtained from different types of educational contexts. On the one hand, there are traditional face-to-face classroom environments such as special education and higher education. On the other hand, there are computer-based education and Web-based education such as well-known learning management systems the examples of which are WebCT, BlackBoard and Moodle, intelligent tutoring systems. Our web usage mining approach uses all the available usage information about students (profile and log information) in order to learn user routes or browsing pathways for personalized link recommendation. Web usage mining generally consists of three phases: data preparation, pattern discovery and recommendation. The first two phases are performed off-line and the last phase is performed online. In education, data preparation will transform Web log files and profiles into data with the appropriate format. Pattern discovery will use a data mining technique, such as clustering, sequential pattern and association rule mining. Finally, recommendation will use the discovered patterns to provide personalized links or contents. 2. Health informatics: Internet can serve as the backbone for implementing supply chain solutions to add value to health care providers, their suppliers, and their patients. According to [11] Even though health care information systems in some hospitals and clinics have been linked together with a local area network or a wide area network, network based health care systems have not been popular until the advent of the Internet. The three primary Internet applications that the healthcare industry uses, to varying degrees, are the Internet, intranets, and extranets. Doctors can use the Internet to do more than

An Overview on Web Usage Mining

653

download information and communicate with other providers; it can also be used to send complex medical files across the Web. 3. Human-computer interaction: Web usage mining is a kind of web mining, which exploits data mining techniques to discover valuable information from navigation behavior of World Wide Web users. Web usage mining (WUM) is a new research area which can be defined as a process of applying data mining techniques to discover interesting patterns from web usage data. Web usage mining provides information for better understanding of server needs and web domain design requirements of web-based applications. Web usage data contains information about the identity or origin of web users with their browsing behaviors in a web domain. Web pre-fetching, link prediction, site reorganization and web personalization are common applications of WUM. Web usage mining and its relationship with computer. [12] tells that All the three phases of Web Usage Mining provide good log file which is free from inconsistent, un-useful data. It helps in filtering unwanted access patterns/ web pages. The Web Structure Mining plays an important role with various benefits including , quick response to the web users, reducing lot of HTTP transactions between users and servers thus saving memory space of server, better utilization of bandwidth along with server processor time. 4. Social media: Web usage mining also plays an important role in social networks analysis. [13] It is useful for the analysis of social networks extraction discussed in section 2 of this paper. The usage data and user communications on an on-line social networking website can be transformed into relational data for social-networks construction. In addition, web usage mining is also a tool for measuring centrality degree. social network analysis is finding the communities embedded in the social network datasets, and moreover, analyzing the evolutions of the communities in dynamic networks. The evolution pattern as one kind of temporal analysis aspect sometimes could provide us an interesting insight from the perspective of social behavior. Recently, a considerable amount researches have been done on this topic. In the field of social network, Web community is also used to mean a set of users having similar interests. In [14] Social networking products are flourishing. Sites such as MySpace, Facebook, and Orkut attract millions of visitors a day, approaching the traffic of Web search sites 2. These social networking sites provide tools for individuals to establish communities, to upload and share user generated content, and to interact with other users. In recent articles, users complained that they would soon require a full-time employee to manage their sizable social networks. Indeed, [15] take Orkut as an example. Orkut enjoys 100+ million communities and users, with hundreds of communities created each day. A user cannot possibly view all communities to select relevant ones.

5

Challenges

Web Usage Mining is the automatic discovery of user interactions with a web server, including web log. The web log files are collected from web server. WUM focuses

654

G. Neelima and S. Rodda

on privacy concerns and is currently the topic of extensive debate. The knowledge gathered from WUM can be very useful in many Web applications such as Web caching, Web perfecting, intelligent online advertisements, and in addition to construct Web personalization. Most of the research challenges efforts for modeling personalization systems are clustering pages or user session, association rule generation and sequential pattern generation. These challenges are the most popular ones encountered in almost all the web usage mining research. And these problems have a huge impact on the success or failure of web usage mining research.

6

Conclusion

This paper has attempted to provide a review of rapidly growing area of web usage mining. Web usage mining makes use of this information in order to mine the desired information and make it available to user efficiently and efficaciously. Content and structure preprocessing allows raw data to be preprocessed along these dimensions also. The involvement of intelligent agents and knowledge query mechanisms improves the efficiency of pattern analysis. Process, Applications related with web usage mining are discussed in this paper. This paper has aimed at describing challenges, and the hope is that the research community will take up the challenge of addressing them.

References [1] Patel, K.B., Patel, A.R.: Process of Web Usage Mining to find Interesting Patterns from Web Usage Data. In: International Journal of Computers & Technology Volume 3(1) (August 2012) [2] Langhnoja, S., Barot, M.: Pre-Processing: Procedure on Web Log FileforWeb Usage Mining. International Journal of Emerging Technology and Advanced Engineering 2(12) (December 2012) Website: http://www.ijetae.com, ISSN 2250-2459, ISO 9001:2008 Certified Journal [3] Mitharam, M.D.: Preprocessing in Web Usage mining. International Journal of Scientific & Engineering Research 3(2), 1 (2012) ISSN 2229-5518 [4] Sharma, A.: Web Usage Mining: Data Preprocessing, Pattern Discovery and Pattern Analysis on the RIT Web Data [5] Chaudhary, K., Gupta, S.K.: Web Usage Mining Tools & Techniques: A Survey. International Journal of Scientific & Engineering Research 4(6), 1762 (2013) ISSN 22295518 [6] Srivastava, J., Cooley, R.: Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data. SIGKDD Explorations (January 2000) [7] Chitraa, V., Davamani, A.S.: A Survey on Preprocessing Methods for Web Usage Data. International Journal of Computer Science and Information Security (IJCSIS) 7(3) (2010) [8] Langhnoja, S., Barot, M.: Pre-Processing: Procedure on Web Log FileforWeb Usage Mining. International Journal of Emerging Technology and Advanced Engineering 2(12) (December 2012) Website: http://www.ijetae.com, ISSN 2250-2459, ISO 9001:2008 Certified Journal

An Overview on Web Usage Mining

655

[9] Pani, S.K., Panigrahy, L.: Web Usage Mining: A Survey on Pattern Extraction from Web Logs. International Journal of Instrumentation, Control & Automation (IJICA) 1(1) (2011) [10] Romero, C., Ventura, S., Zafra, A., de Bra, P.: Applying Web usage mining for personalizing hyperlinks in Web-based adaptive educational systems (received January 8, 2009) (received in revised form May 4, 2009) (accepted May 4, 2009) [11] Siau, K.: Health Care Informatics. IEEE Transactions on Information Technology in Biomedicine 7(1) (March 2003) [12] Geeta, R.B., Totad, S.G., Reddy, P.: Amalgamation of Web Usage Mining and Web Structure Mining. International Journal of Recent Trends in Engineering 1(2) (May 2009) [13] Imran, M., Castillo, C., Diaz, F., Vieweg, S.: Processing Social Media Messages in Mass Emergency: A Survey (August 3, 2014), rXiv:1407.7071v2 [cs.SI] [14] Raju, E., Sravanthi, K.: Analysis of Social Networks Using the Techniques of Web Mining 2(10) (October 2012) [15] Zhang, Y.: Web Information Systems Engineering and Internet Technologies. Springer Science+Business Media, LLC (2011)