Pattern discovery techniques in Web mining

7 downloads 248 Views 166KB Size Report
development efforts. We briefly .... of the company database directly from the Web based interfaces. Another consequence of this transformation is the existence of Web applications so that the users could ..... Conference, Mumbai, India, pp.
Pattern discovery techniques in Web mining Mirela Pater, Daniela E. Popescu and Daniela Maştei Department of Computer Science, University of Oradea, Faculty of Electric Engineering and Information Technology, University Str. no.1, Oradea, Romania Phone: +40 (0)259 408-250, E-Mail: [email protected], [email protected], [email protected]

2.

Abstract. With the huge amount of information available online, the World Wide Web is a fertile area for data mining. Application of data mining techniques to the World Wide Web, referred to as Web mining, has been the focus of several recent research projects and papers. In this paper we define Web mining and present an overview of the various research issues, techniques and development efforts. We briefly describe the strategies for pattern discovery techniques in Web mining.

3. 4.

information selection and pre-processing: automatically selecting and pre-processing specific information from retrieved Web resources generalization: automatically discovers general patterns at individual Web sites as well as across multiple sites analysis: validation and/or interpretation of the mined patterns

Web mining techniques could be used to solve the information overload problems above directly or indirectly. Web mining uses techniques from different research areas, such as database (DB), information retrieval (IR), natural language processing (NLP), and the Web document community. In figure 1, is presented the taxonomy of Web mining that includes: Web content mining and Web usage mining [11].

Keywords: Web, data mining, information retrieval, information extraction, pattern discovery I. INTRODUCTION The World Wide Web (Web) is a popular and interactive medium to disseminate information today. The Web is huge, diverse, and dynamic and thus raises the scalability, multimedia data and temporal issues respectively [9]. With the explosive growth of information sources available on the Web, it has become increasingly necessary for users to utilize automated tools in find the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating server-side and client-side intelligent systems that can effectively mine for knowledge. Web mining can be broadly defined as the discovery and analysis of useful information from the World Wide Web. This describes the automatic search of information resources available on-line, i.e. Web content mining, and Web usage mining.

WEB MINING Web Content Mining Agent Based Approach

Web Usage Mining Database Approach

Figure 1: Taxonomy of Web Mining Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data. It implicitly covers the standards process of knowledge discovery in databases (KDD) [13]. We could simply view Web mining as an extension of KDD process that is applied on the Web data. Web mining is often associated with IR (Information Retrieval) or IE (Information Extraction). As we point previously, IR on the Web is an instance of Web (content) mining. Actually, IR is the automatic retrieval of all relevant documents while at the same time retrieving as

II. WEB MINING Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services [9]. This area of research is so huge today partly due to the interests of various research communities, the tremendous growth of information sources available on the Web and the recent interest in ecommerce. Pazzani [15] decomposed Web mining into several subtasks: 1. resource finding: the task of retrieving intended Web documents

77

provide some comfort to users, but do not generally provide structural information nor categorize, filter or interpret documents. In recent years these factors have prompted researchers to develop more intelligent tools for information retrieval, such as Web agents and to extend data mining techniques to provide a higher level of organization for semi-structured data available on the Web.

a few of the non-relevant as possible [13]. IR has the primary goals of indexing text and searching for useful documents in a collection and nowadays researches in IR includes modeling, documents classification and categorization, user interfaces, data visualization, filtering, etc. [14]. The task that can be considered to be an instance of Web mining is Web document classification or categorization, which could be used for indexing. We can say that all of the indexing tasks use data mining techniques. Information extraction (IE) has the goal of transforming a collection of documents, usually with the help of an IR system, into information that is more readily digested and analyzed [14]. IE aims to extract relevant information from the documents while IR aims to select relevant documents [11]. While the IE is interested in the structure or representation of a document, IR views the text in a document just as a bag of unordered words [20]. There are basically two types of IE: IE from unstructured texts and IE from semi-structured data. There are considerable difference between the IE systems that are used for unstructured documents with those that are used for semi-structured or even structured documents. Most IE systems focus on specific Web sites to extract. The results of the IE process could be a compression or summary of the original text or documents.

A.1.1. Agent-based Approach Web mining is often viewed from or implemented within an agent paradigm. Thus, Web mining has a close relationship with software agents or intelligent agents. Some of these agents perform data mining tasks to achieve their goals. We can place agent-based approach into three categories: – Intelligent search agents – search for relevant information using domain characteristics and user profiles to organize and interpret the discovered information. – Information filtering/categorization – use various information retrieval techniques and characteristics of open hypertext Web documents to automatically retrieve, filtering and categorize them. – Personalized Web agents – learn user preferences and discover Web information sources based on these preferences and those of other individuals with similar interests.

A. Web content mining Web content mining describes the discovery of useful information from the Web contents, data and documents. Now most of data are either ported to or accessible from the Web. There are a lot of Digital Libraries that are accessible from the Web. We also see that many companies are transforming their business and services electronically. As a consequence many of the company databases that previously resided in the legacy systems are being ported to or made accessible from the Web. Thus the employees, partners or even customers could access some of the company database directly from the Web based interfaces. Another consequence of this transformation is the existence of Web applications so that the users could access the applications through Web interfaces. Many applications and systems are being migrated to the Web environments. Some of the Web content data are hidden data, which cannot be indexed. These data are either generated dynamically as a result of queries and reside in the DBMSs or are private. Basically, the Web content consists of several types of data such as textual, image, audio, video metadata as well as hyperlinks. Recent research on mining multi types of data is termed multimedia data mining [13]. Thus we consider multimedia data mining as an instance of Web content mining. The lack of structure that permeates the information sources on the Web makes automated discovery of Webbased information difficult. Traditional search engines such as Alta Vista, Lycos, ALIWEB [15], and others

In table 1 is presented the association between the categories of Web mining and the agent paradigm [14]. TABLE 1. Association between the categories of Web mining and the agent paradigm Content based filters ↔ Reputation based filters ↔ Collaborative or social based filters ↔ Event based filters ↔ Hybrid filters ↔

Content mining Structure (and content) mining Usage mining Usage mining Combination of the categories

A.1.2. Database Approach Database approaches to web mining have focused on techniques for organizing the semi-structured data on the Web into more structured collections of resources and using standard database querying mechanisms and data mining techniques to analyze it. As mentioned in [14], the database techniques on the Web are related to the problems of managing and querying the information on the Web. There are three classes of tasks related to those problems: modeling and querying the Web, information extraction and integration

78

is performed. The second approach uses the log data directly by utilizing special pre-processing techniques. The applications of web usage mining could be classified into two main categories: learning a user profile or user modeling in adaptive interfaces and learning user navigation patterns. Web users would be interested in techniques that could learn their information needs and preferences, which is user modeling possibly combined with Web content mining. On the other hand, information providers would be interested in techniques that could improve the effectiveness of the information on their Web sites by adapting the Web site design or by biasing the user’s behavior towards satisfying the goals of the site. In the other words, they are interested in learning user navigation patterns.

and Web site construction and restructuring. First two tasks are related to the Web content mining applications. The database view tries to infer the structure of the Web site or to transform a Web site to become a database so that better information management and querying on the Web become possible. A lot of applications use multilevel databases (MLDB) in which each level is obtained by generalizations on lower level and use a special purpose query language for Web mining to extract some knowledge from the MLDB of Web documents. In multilevel databases the main idea is that the lowest level of the database contains semi-structured information stored in various Web repositories, such as hypertext documents. At the higher level metadata or generalizations are extracted from lower levels and organized in structured collections, i.e. relational or objectoriented databases. Most of Web-based query systems and languages utilize standard database query language such as SQL, structural information about Web documents and even natural language processing for the queries that are used in WEB searches.

III. PATTERN DISCOVERY FROM WEB TRANSACTIONS Analysis of how users are accessing a site is critical for determining effective marketing strategies and optimizing the logical structure of the Web site. Because of many unique characteristics of the client-server model in the World Wide Web, including differences between the physical topology of Web repositories and user access paths, and the difficulty in identification of unique users as well as user sessions or transactions, it is necessary to develop a new sessions or transactions, it is necessary to develop a new framework to enable the mining process. There are a number of issues in pre-processing data for mining that must be addressed before the mining algorithms can be run. These include developing a model of access log data, developing techniques to clean/filter the raw data to eliminate outliers and/or irrelevant items, grouping individual page access into semantic units (transactions), integration of various data sources such as user registration information and specializing generic data mining algorithms to take advantage of the specific nature of access log data. The first pre-processing task is data cleaning. Techniques to clean a server log to eliminate irrelevant items are of importance for any type of Web log analysis, not just data mining. The discovery associations or reported statistics are only useful if the data represented in the server log gives an accurate picture of the user accesses of the Web site. Elimination of irrelevant items can be reasonably accomplished by checking the suffix of the URL name [14]. A major problem associated with proxy servers is that of user identification. Use of a machine name to uniquely identify users can result in several users being erroneously grouped together as one user. The second major pre-processing task is transaction identification. Before any mining is done on Web usage data, sequences of page references must be grouped into logical units representing Web transactions or users sessions.

B. Web usage mining Web usage mining is the automatic discovery of user access patterns from Web servers. Organizations collect large volumes of data in their daily operations, generated automatically by Web servers and collected in server access logs. Web usage mining focuses on techniques that predict user behavior while the user interacts with the Web. Other sources of user information include referred logs witch contain information about the referring pages for each page references and user registration or survey data. Analyzing such data can help organizations determine the life time value of customers, cross marketing strategies across products and effectiveness of promotional campaigns. It can also provide information on how to restructure a Web site to create a more effective organizational presence and shed light on more management of workgroup communication and organizational infrastructure. Most existing Web analysis tools provide mechanisms for reporting user activity in the servers and various forms of data filtering. Using such tools it is possible to determine the number of accesses to the server and to individual files, the times of visits and the domain names and URLs of users. These tools are designed to handle low to moderate traffic servers and usually provide little or no analysis of data relationships among the accessed files and directories within the Web space. The Web usage mining process could be classified into two commonly used approaches [11]. The first approach maps the usage data of the Web server into relational tables before an adapted data mining techniques

79

discovered access patterns, the topology of the Web locality, and certain heuristic derived from user behavior models, could give recomandations about changing the physical link structure of a particular site.

A user session is all of the page references made by a user during a single visit to a site. A transaction differs from a user session in that the size of a transaction can range from a single page reference to the entire page references in a user session, depending on the criteria used to identify transactions. Once user transactions or sessions have been identified, there are several kinds of access pattern mining that can be performed depending on the needs of the analyst, such as path analysis, discovery of association rules and sequential patterns and clustering, and classification. Association rule mining discovery techniques [2, 14] are generally applied to databases of transactions where each transaction consists of a set of items. In such a framework the problem is to discover all associations and correlations among data items where the presence of one set of items in a transaction implies (with a certain degree of confidence) the presence of other items. In context of web usage mining, this problem amounts to discovering the correlations among references to various files available on the server by a given client. Each transaction is comprised of a set of URLs accessed by a client in one visit to the server. Since usually such transaction databases contains extremely large amounts of data, current association rule discovery techniques try to support for items under consideration. Support is a measure based on the number of occurrences of user transactions within transaction logs.

V. CONCLUSIONS In conclusion, the key component of Web mining is the mining process itself. Web mining has adapted techniques from the field of data mining, database mining and information retrieval, as well as developing some techniques of its own, e.g. path analysis. A lot of work still remains to be done in adapting known mining techniques as well as developing new ones. Web usage mining studies reported to date have mined for association rules, temporal sequences, clusters and path expressions. As a manner in which the Web is used continues to expand, there is a continual need to figure out new kind of knowledge about user behavior that needs to be mined. The quality of mining algorithm can be measured both in terms of how efficient it is in mining for knowledge and how efficient it is in computational terms. There will always be a need to improve the performance of mining algorithms along both these dimensions. The term Web mining has been used to refer to techniques that encompass a broad range of issues. The Web presents new challenges to the traditional data mining algorithms that work on flat data. We have seen that some of the traditional data mining algorithms have been extended or new algorithms have been used to work on the Web data.

IV. ANALYSIS OF DISCOVERED PATTERNS The term Web mining has been used to refer to techniques that encompass a broad range of issues. The discovery of Web usage patterns, carried out by techniques described earlier would not be very useful unless there were mechanisms and tools to help an analyst better understand them. Hence, in addition to developing techniques for mining usage patterns from Web logs, there is a need to develop techniques and tools for enabling the analysis of discovered patterns. These techniques are expected to draw from a number of fields including statistics, graphics and visualization, usability analysis and database querying Visualization has been used very successfully in helping people understand various kinds of phenomena, both real and abstract. Hence, it is a natural choice for understanding the behavior of Web users. The Web is visualized as a directed graph with cycles, where nodes are pages and edges are (inter-pages) hyperlinks. One of the open issues in data mining, in general, and Web mining, in particular, is the creation of intelligent tools that can assist in the interpretation of mined knowledge. Clearly, these tools need to have specific knowledge about the particular problem domain to do any more than filtering based on statistical attribute of the discovered rules or patterns. In web mining, for example, intelligent agents could be developed that based on

REFERENCES [1] S. Abiteboul. Querying semi-structured data. In F.N. Afrati and P. Kolatis,editors, Datadatabase Theory – ICDT’ 97, 6th International Conference, Delphi, Greece, pp. 1-18, January 8-10, 1997 [2] R. Agrawal and R. Strikant. Fast algorithm for mining association rules. In Proc. of the 20th VLDB Conference, Santiago, Chile, pp.487-499, 1994 [3] S. Agrawal, et. All. On the computation of multidimensional aggregates. In Proc. of the 22th VLDB Conference, Mumbai, India, pp. 506-521, 1996 [4] D.E. Appelt and D. Israel. Introduction to information estraction technology. In Proc. of 16th International Joint Conference an Artificial Intelligence IJCAI-99, Tutorial, 1999 [5] R. Baeza-Yates and B. Ribeiro-Beto. Modern Addison-Wesley Longman Information Retrieval, Publishing Company, 1999 [6] M. Balabanovic, et all. An adaptive agent for automated Web browsing. Journal of Visual Communication and Image Representation, 6(4), 1995 [7] P. Buneman et all. A query language and optimization techniques for unstructured data, In H.V. Jagadish and I.S.

80

[16] M. Pazzani et all., Identifying interesting web sites. In Proc. AAAI Spring Symposyum on Machine Learning in Information Access, Portland, Oregon, 1998 [17] M.T. Pazienza, editor. Information Extraction: A multidisciplinary Approach to an Emerging Information Technology, volume 1299 of Lecture Notes in Computer Science, International Summer School, SCIE-97, Frascati (Rome), Springer, 1997 [18] G. Piatetsky-Shapiro et all. An overview of issues in developing industrial data mining and knowledge discovery applications, In Proc. of The Second Int. Conference on Knowledge Discovery and Data Mining, pp. 89-95, 1996 [19] P. Pirolli et all., Extracting usable structures from the web. In Proc. of 1996 Conference on Human Factors in Computing Systems (CHI-96), Vancouver, British Columbia, Canada, 1996 [20] R. Strikant and R. agrawal. Mining sequentional patterns: Generalizations and performance improvements, In Proc. of the Fifth Int’l Conference on Extending database Technology, Avignon, France, 1999 [21] S. Vaithyanathan. Introduction: Data mining on the internet, Artificial Intelligence Review, 13(5/6):343-344, 1999 [22] M.R. Wulfekuhler and W.F. Punch. Finding salient features for personal Web page categorization, In Proc.oOf 6th International World Wide Web Conference, 1997

Mumicks editors, Proc. of 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, pp.505-516, ACM Press, 1996 [8] M.S. Chen et all., Data mining for path traversal patterns in web environment, In Proc. of the 16th International Conference on Distributed Computing Systems, pp.385-392, 1996 [9] R. Cooley et all., Grouping web page references into transactions for mining world wide web browsing patterns, Technical report TR 97-021, University of Minnesota, Dept. of computer science, Minneapolis, 1997 [10] R. Cooley et all., Web mining: Information and pattern discovery on the world wide web. Technical report TR 97-027, University of Minnesota, Dept. of computer science, Minneapolis, 1997 [11] R. Cooley. Web Usage mining: Discovery and Aplication of Interesting Patterns from Web data, PhD thesis, Dept. of computer science, University of Minnesota, May 2000 [12] J. Cowie and W. Lehnert. Information extraction, Communications of ACM, 39(1):60-91, 1996 [13] U. Fayyad et all. From data mining to knowledge discovery: An overview. In Advances in Knowledge Discovery and Data Mining, pp.1-34, AAAI Press, 1996 [14] R. Kosala and H. Blockeel. Web Mining research: A survey, In ACM SIGKDD, Vol.2, Issue 1, July 2000 [15] P. Maes. Agents that reduce work and information overload. Communications of ACM, 37(7):30-40, 1994

81

Suggest Documents