December 19, 2008 13:57 WSPC/173-IJITDM
00315
International Journal of Information Technology & Decision Making Vol. 7, No. 4 (2008) 683–720 c World Scientific Publishing Company
WEB MINING: A SURVEY OF CURRENT RESEARCH, TECHNIQUES, AND SOFTWARE
QINGYU ZHANG∗ and RICHARD S. SEGALL Department of Computer & Information Technology Arkansas State University, State University Arkansas 72467-0130, USA ∗
[email protected]
The purpose of this paper is to provide a more current evaluation and update of web mining research and techniques available. Current advances in each of the three different types of web mining are reviewed in the categories of web content mining, web usage mining, and web structure mining. For each tabulated research work, we examine such key issues as web mining process, methods/techniques, applications, data sources, and software used. Unlike previous investigators, we divide web mining processes into the following five subtasks: (1) resource finding and retrieving, (2) information selection and preprocessing, (3) patterns analysis and recognition, (4) validation and interpretation, and (5) visualization. This paper also reports the comparisons and summaries of selected software for web mining. The web mining software selected for discussion and comparison in this paper are SPSS Clementine, Megaputer PolyAnalyst, ClickTracks by web analytics, and QL2 by QL2 Software Inc. Applications of these selected web mining software to available data sets are discussed together with abundant presentations of screen shots, as well as conclusions and future directions of the research. Keywords: Web mining; web content mining; web usage mining; web structure mining; web mining software.
1. Introduction In the data mining communities, there are three types of mining: data mining, web mining, and text mining.25 There are many challenging problems in data/web/text mining research.53 Data mining mainly deals with structured data organized in a database (DB) while text mining mainly handles unstructured data/text. Web mining lies in between and copes with semi-structured data and/or unstructured data. Web mining calls for creative use of data mining and/or text mining techniques and its distinctive approaches. Mining the web data is one of the most challenging tasks for the data mining and data management scholars because there are huge heterogeneous, less structured data available on the web and we can easily get overwhelmed with data. In the literature, the terms of web mining, web data ∗ Corresponding
author. 683
December 19, 2008 13:57 WSPC/173-IJITDM
684
00315
Q. Zhang & R. S. Segall
mining, and web data extraction mining are used interchangeably. In this paper, we use the term web mining. According to Wikipedia,51 web mining is the application of data mining techniques to discover patterns from the web and can be classified into three different types of web usage mining, web content mining, and web structure mining. The taxonomy of web mining has grown from that of only web content mining and web usage mining such as considered by Cooley et al.8 to include that of web structure mining as elaborated by Liang.27 Web content mining is the process of discovering useful information from the content of web pages that may consist of text, image, audio or video data in the web; web usage mining is the application that uses data mining to analyze and discover interesting patterns of user’s usage of data on the web; and web structure mining is the process of using graph theory to analyze the node and connection structure of a web site.51 An example of the latter would be discovering the authorities and hubs of any web document, e.g. identifying the most appropriate web links for a web page. According to Kosala and Blockeel,25 “In practice, the three web mining tasks above could be used in isolation or combined in an application, especially in web content and structure mining since the web document might also contain links.” For example, Zhong56 studies the brain informatics (i.e. combination of content and structure) from a web intelligence perspective. Kosala and Blockeel25 present a survey of web mining research for each of the three web mining categories presented above, and distinguish web mining as different from information retrieval (IR) and information extraction (IE). They hold that web mining techniques are not the only tools to solve information overload problems either directly or indirectly. They claim that “Other techniques and works from different research areas, such as database (DB), information retrieval (IR), natural language processing (NLP), and the web document community, could also be used. . . . By the direct approach we mean that the application of the web mining techniques directly addresses the above problems. . . . By the indirect approach we mean that the web mining techniques are used as a part of a bigger application that addresses the above problem.” Kosala and Blockeel25 also claim that the web mining research area is a converging research area from several research communities, such as DB, IR, and artificial intelligence (AI) with machine learning and natural language processing (NLP) from the latter. The purpose of this paper is to provide a more current evaluation and update of web mining research and techniques available. This paper also presents the comparisons and summaries of selected software for web mining. The web mining software selected for discussion and comparison in this paper are SPSS Clementine, Megaputer PolyAnalyst, ClickTracks by web analytics, and QL2 by QL2 Software Inc.
December 19, 2008 13:57 WSPC/173-IJITDM
00315
Web Mining: A Survey of Current Research, Techniques, and Software
685
Applications of these selected web mining software to available data sets are discussed together with abundant presentations of screen shots, as well as conclusions and future directions of the research. 2. Literature Review Both Etzioni12 and Kosala and Blockeel25 decompose web mining into four subtasks that, respectively, are (a) resource finding; (b) information selection and preprocessing; (c) generalization; and (d) analysis. Kosala and Blockeel25 use this criterion for a selection of literature up to the year 2000, the year of their publication. This paper uses both new and expanded criteria for a more current selection of literature: specifically that from the date of their publication to the present date of 2007. In this paper, the web mining process is divided into the following five subtasks: (1) (2) (3) (4) (5)
resource finding and retrieving; information selection and preprocessing; patterns analysis and recognition; validation and interpretation; visualization.
The literature in this paper is classified into the three types of web mining: web content mining, web usage mining, and web structure mining. We put the literature into four sections: (2.1) Literature review for web content mining; (2.2) Literature review for web usage mining; (2.3) Literature review for web structure mining; and (2.4) Literature review for web mining survey. For each section below, we summarize the literature with the following issues: web mining process, methods/techniques, applications, data sources, and software used. We also review a current topic on semantic web as Sec. 2.5. 2.1. Literature review for web content mining Web content mining is performed by extracting useful information from the content of a web page/site. It includes extraction of structured data/information from web pages, identification, match, and integration of semantically similar data, opinion extraction from online sources, and concept hierarchy, ontology, or knowledge integration (see Table 1). To reduce the gap between low-level image features used to index images and high-level semantic contents of images in content-based image retrieval (CBIR) systems or search engines, Zhang et al.55 suggest applying relevance feedback to refine the query or similarity measures in image search process. They present a framework of relevance feedback and semantic learning where low-level features and keyword annotations are integrated in image retrieval and in feedback processes to improve the retrieval performance. They developed a prototype system performing better than traditional approaches.
1–3–5
Chen et al.5
1–2
1–2
Darmont et al.11
Graves et al.16
1–2–3–4–5
1–2–3–4
Lau et al.26
Liu29
1–2–3–4
Process
Zhang et al.55
Author
Earth Science Markup Language (ESML)
Transforming multiform data into a unified format
Correlation mining, clustering, machine learning, partial tree alignment
Clustering, categorization, web structure, and summarization techniques
Keywords search
Relevance feedback algorithm
Method/techniques
Satellite imagery
Warehousing web data
Web query interface integration (e.g. travelocity.com); opinion mining
Semantic Virtual Document (SVD)
Homepage analysis
Contend-based image retrieval
Applications
NASA Goddard Earth Sciences Data
Structured and unstructured; CompUSA.com
6173 students’ homepages
Data sources
Algorithm Development and Mining Toolkit (ADaM)
Java prototype
Prototype intelligent Search And Review of Cluster Hierarchy (iSEARCH)
Search engine
A prototype system of content-based web image search
Software
686
Table 1. Review table for web content mining.
December 19, 2008 13:57 WSPC/173-IJITDM 00315
Q. Zhang & R. S. Segall
December 19, 2008 13:57 WSPC/173-IJITDM
00315
Web Mining: A Survey of Current Research, Techniques, and Software
687
Lau et al.26 use powerful search engines to mine 6173 students’ personal homepages and convert unstructured, self-revealed text information into business intelligence stored in a DB. They construct a dictionary of 80,750 keywords/phrases to do the web mining analysis to identify key characteristics of customers. Chen et al.5 propose semantic virtual document (SVD) technique that makes use of the web structure with clustering technique to represent knowledge in web documents. This technique allows an automatic content-based categorization of web documents as well as a tree-like graphical user interface for browsing postretrieval documents. They also introduce cluster-biased automatic query expansion technique to interpret short queries accurately. They present a prototype of Intelligent Search and Review of Cluster Hierarchy (iSEARCH) for web content mining. Liu28,29 very simply distinguishes between the three types of web mining by noting that web usage mining discovers user access patterns from usage logs; web structure mining discovers knowledge from hyperlinks; and web content mining mines knowledge from page contents. Liu29 presented a webcast solely on web content mining in which he focuses on structured data extraction, information integration, and IE from unstructured text such as “opinion mining” of customer written comments. Darmont et al.11 propose a modeling process for warehousing heterogeneous web data and design a java prototype that transform multiform data into Extensible Markup Language (XML) document. A good data preparation improves the performance of data mining algorithms. Graves et al.16 undertake a project of creating a data mining web services designed specifically for science data using the Algorithm Development and Mining (ADaM) toolkit. They used earth data from National Aeronautical and Space Administration (NASA) Goddard Research Center to explore pattern recognition, image processing, and data preparation algorithms. These algorithms are geared toward satellite imagery data. They use the Earth Science Markup Language (ESML) to handle heterogeneous data formats seamlessly. 2.2. Literature review for web usage mining Web usage mining is to discover user access patterns from web usage logs. All web site visitor actions can be logged as web log files in web servers for user behavior analysis (see Table 2). There are many web log analysis tools to provide statistical information such as pages popularity (number of times a page has been visited). Web usage mining helps reorganizing the web site for fast and easy customer access, improving links and navigation, attracting more advertisement capital by intelligent adverts, turning viewers into customers by better site architecture, and monitoring the efficiency of the web site.37 Mobasher et al.35,36 discuss the automatic web personalization based on web usage mining. The general architecture includes data preparation, usage mining, and online recommendation process. In the data preparation phase, site files and server logs are cleaned, and sessions, pageviews, episodes are identified with filtering
Abraham and Ramos1
1–2–3–4
Clendaniel6
Ant clustering algorithm, linear genetic programming approach
Fuzzy clustering techniques
Association rules, classification, clustering, sequential patterns, dependency modeling
1–2–3–4
1-2-3
Sequence mining
Association rule hypergraph partitioning; clustering
Method/techniques
1–2–3–4–5
1–2–3–4
Process
Joshi21,22
Srivastava et al.48
Spiliopoulou46
Mobasher et al.35,36
Author
Web usage patterns for e-commerce
Analyze user behavior and respond better for higher profit, e.g. sweepstakes or web promotions
Personalization
Personalization, system improvement, site modification, business intelligence, usage characterization
Web usage mining for better web evaluation and design
Web usage mining for automatic personalization
Applications
Weblog data
Weblog data
Server log, proxy log, client log
Acr-news.org
Data sources
WebSIFT, WUM, SpeedTracer, WebLogMiner, Shahabi
Web utilization miner (WUM), Mining internet data for associative sequences (MiDAS)
WebPersonalizer
Software
688
Table 2. Review table for web usage mining.
December 19, 2008 13:57 WSPC/173-IJITDM 00315
Q. Zhang & R. S. Segall
Clustering, nearest neighbor
Clustering, classification, sequential patterns
Vector analysis and fuzzy set theory
Clustering, classification, association, sequential rules, OLAP
1–2–3–4
1–2–3–4–5
1–2–3–4–5
1–2–3–4–5
Fenstermacher and Ginsburg14
Pierrakos et al.39
Song and Shepperd45
Pabarskaite and Raudysv37
Method/techniques
Pattern analysis
Process
1–2–3
Cooley9
Author
Web log and customer data mining; large-scale web log mining
Web browsing patterns (e.g. web user clustering, web page clustering, and frequent access path recognition) for e-Commerce
Web usage mining for personalization
Client-side monitoring for web mining
Identify subjectively interesting web usage patterns
Applications
Table 2. (Continued)
Web log files; and log formats of URLs obtained using yahoo.com searches
Log data of the web site for Xi’an Jiaotong University
Server log files, cookies, explicit user input, client-side data
Client-side data
Web structure and content from a large e-commerce site
Data sources
List of web mining commercial software and free ware of WUM and Analog
SETA, Tellim, Oracle9iAS, Netmind, SiteHelper, WUM
Software
December 19, 2008 13:57 WSPC/173-IJITDM 00315
Web Mining: A Survey of Current Research, Techniques, and Software 689
December 19, 2008 13:57 WSPC/173-IJITDM
690
00315
Q. Zhang & R. S. Segall
support. In the usage mining phase, association rule and clustering algorithms are used to discover usage profiles and frequent itemsets. Finally, the personalization recommendation is proposed. They experiment with the WebPersonalizer system using the site for the newsletter of the Association for Consumer Research (acrnews.org). Spiliopoulou46 discusses web usage mining for web site evaluation and design. To evaluate a site, she suggests that the three steps should be followed: (1) formulate the problem, (2) prepare the web log for analysis, (3) discover navigation patterns using sequence miner, and (4) visualize the results. She compares two software of Web Utilization Miner (WUM) and Mining Internet Data for Associative Sequences (MiDAS). The major difference between these two systems is that, in MiDAS, a navigation pattern is a sequence of events while WUM has been extended to include both the sequence of events and a tree composed of the routes connecting those events. Srivastava et al.48 define web usage mining as the application of data mining techniques to discover usage patterns from web data to better serve the needs of web applications, and it includes three phases: preprocessing, pattern discovery, and pattern analysis. The usage data can be collected at the different sources such as web server logs, client side data, web proxy caching. Data mining analyses such as association rules, classification, clustering, sequential patterns, and dependency modeling can be used for personalization, system improvement, site modification, business intelligence, and usage characterization. A prototypical web usage mining system of Web Site Information Filter System (WebSIFT) is introduced. Joshi21,22 claims that web mining can be said to have three data mining operations: clustering, associations, and sequential analysis. Joshi21,22 stated that clustering for web mining would be finding natural groups of users, pages or other; associations would be study of which URLs tend to requested together; and sequential analysis would be the order in which URLs tend to be accessed. Such analysis can be used for personalization. Clendaniel6 holds that some companies are generating about 1 gigabyte per day of customer behavior data including every page the visitor viewed, the viewing sequence, how long the visitor stayed. However, many companies only use the metrics of page hits and click-through rates, and data are stored offline, unused in massive system backup. Mining such web data can improve a firm’s profit. Abraham and Ramos1 propose using ant clustering algorithms to discover web usage patterns and linear genetic programming approach to analyze visitors’ trends. Their results showed that ant colonies clustering performed well when compared with self-organizing maps (SOM) but less efficient when compared with evolutionary-fuzzy clustering approach. Cooley7 contends that web usage mining has grown in the past years in spite of the crash of the e-commerce bubbles. He defines web usage mining as the application of data mining techniques to web click stream data to extract usage patterns. Based on a large e-commerce web site, he finds that the use of web structure and content
December 19, 2008 13:57 WSPC/173-IJITDM
00315
Web Mining: A Survey of Current Research, Techniques, and Software
691
can help identify subjectively interesting web usage patterns. Cooley et al.9 describe WEBMINER, a system for web usage mining, and they hold that web usage mining can help organizations to determine the life-time value of customers, marketing strategies across products, and effectiveness of promotional campaigns. Fenstermacher and Ginsburg14 discuss client-side monitoring for web usage mining. They hold that only a fraction of users’ actions reaches the web server so that the analysis and inference of client-side behavior from server-side data are inaccurate. They suggest that client-side data can be collected based on not only web browser but also standard office productivity tools. As such, a much richer and more accurate picture of user behavior on the web can be painted. They suggest that clustering and nearest neighbor approaches can be used to analyze client-side data. Pierrakos et al.39 survey recent work in web usage mining as a tool for personalization. The data sources include server log files, cookies, explicit user input, and client-side data. Their mining techniques include clustering, classification, and sequential patterns. They describe web usage mining software such as SETA, Tellim, Oracle9iAS, Netmind, SiteHelper, and WUM. Cyrus and Banaei-Kashani10 argue that web usage mining is a crucial component of any efficacious web personalization system, and to make online and anonymous web personalization effective, web usage mining must be realized in real time as accurately as possible. They propose a feature-matrices model to track and accurately classify users’ access patterns in real time. Song and Shepperd45 demonstrate that web user clustering, web page clustering, and frequent access path recognition can be used for the sake of marketing strategies, personalization, and web site adaptation. They view the topology of a web site as a directed graph and mine web browsing patterns for e-commerce. They use vector analysis and fuzzy set theory to cluster users and URLs. Their frequent access path identification algorithm is not based on sequence mining. Pabarskaite and Raudys37 present a comprehensive overview of web log/usage mining based on over 100 research works. Their paper discusses description of web log data, web protocol, web servers, most popular log formats, web log data preprocessing, web log mining/analysis, visualization and results, and web log mining software. 2.3. Literature review for web structure mining Web structure mining is to use the hyperlink structure of the web as an information source. The web may be viewed as a graph with the documents as nodes and the hyperlinks between them as edges. The graph view can be used for effective retrieval and classification. Furnkranz15 discusses exploiting the graph structure of the World Wide Web for improved retrieval performance and classification accuracy. Many search engines use graph properties in ranking their query results. He shows that the information
December 19, 2008 13:57 WSPC/173-IJITDM
692
00315
Q. Zhang & R. S. Segall
of predecessor pages (i.e. pages that have a hyperlink pointing to the target page) can be used for enhancing text classification performance. Chakrabarti3 authors a text on mining the web by discovering knowledge from hypertext data that included techniques such as network analysis and machine learning. To help users search for information and organize information layout, Smith and Ng44 suggest using a SOM to mine web data and provide a visual tool to assist user navigation. Based on the users’ navigation behavior, they develop LOGSOM, a system that utilizes SOM to organize web pages into a two-dimensional map. The map provides a meaningful navigation tool and serves as a visual tool to better understand the structure of the web site and navigation behaviors of web users. Fang and Sheng13 address the design of the portal page of a web site. They try to maximize the efficiency, effectiveness, and usage of a web site’s portal page by selecting a limited number of hyperlinks from a large set for the inclusion in a portal page. Based on relationships among hyperlinks (i.e. structural relationships that can be extracted from a web site and access relationship that can be discovered from a web log), they propose a heuristic approach to hyperlink selection called LinkSelector (Table 3). Instead of clustering user navigation patterns by means of a Euclidean distance measure, Hay et al.20 use the Sequence Alignment Method (SAM) to partition users into clusters, according to the order in which web pages are requested and the different lengths of clustering sequences. They validate SAM by means of usertraffic data of two different web sites and results show that SAM identifies sequences with similar behavioral patterns. To meet the need for an evolving and organized method to store references to web objects, Guan and McMullen18 design a new bookmark structure that allows individuals or groups to access the bookmark from anywhere on the Internet using a Java-enabled web browser. They propose a prototype to include more features such as URL, the document type, the document title, keywords, date added, date last visited, and date last modified as they share bookmarks among groups of users. Song and Shepperd45 view the topology of a web site as a directed graph and mine web browsing patterns for e-commerce. They use vector analysis and fuzzy set theory to cluster users and URLs. Their frequent access path identification algorithm is not based on sequence mining. 2.4. Literature review for web mining survey As stated previously, the starting point of our research is a paper by Kosala and Blockeel25 who perform research in the area of web mining and suggest the three web mining categories of web content, web structure, and web usage as also used in this paper. Han and Chang19 author a paper on data mining for web intelligence that claims that “incorporating data semantics could substantially enhance the quality
Frequent access path identification algorithm, fuzzy set theory
Sequence Alignment Method (SAM)
1–2–3–4
1–3–4
Hay et al.20
Heuristic approach
Song and Sheppard45
1–3–4–5
Fang and Sheng13
Clustering, self-organized map
Design bookmark structure
1–3–4–5
Smith and Ng42
Classification, clustering, social network analysis, latent semantic indexing, machine learning
Graph theory
1–2–3–4
1–3–5
Method/techniques
Guan and McMullen18
1–3–4
Chakrabarti3
Process
Furnkranz5
Author
Mining web browsing patterns for e-commerce
Bookmark
Mining navigation patterns
Hyperlink selection for portal page
Mapping user navigation patterns
Hypertext link
Hyperlink in WWW
Applications
Table 3. Review table for web structure mining.
Five real-world data sets
Group of users
User-traffic data of two different web sites
Data from University of Arizona web site
Data sources
Prototype BookMark system
LinkSelector
LOGSOM
Software
December 19, 2008 13:57 WSPC/173-IJITDM 00315
Web Mining: A Survey of Current Research, Techniques, and Software 693
December 19, 2008 13:57 WSPC/173-IJITDM
694
00315
Q. Zhang & R. S. Segall
of keyword-based searches,” and indicate research problems that must be solved to use data mining effectively in developing web intelligence. The latter includes mining web search-engine data and analyzing web’s link structure, classifying web documents automatically, mining web page semantic structures and page contents, and mining web dynamics. Web dynamics is the study of how the web changes in the context of its contents, structure, and access patterns (Table 4). Barsagade2 provides a survey paper on web mining usage and pattern discovery. Chau et al.4 discuss personalized multilingual web content mining. Kolari and Joshi24 provide an overview of past and current work in the three main areas of web mining research-content, structure, and usage as well as emerging work in semantic web mining. Scime41 edit a “Special Issue on Web Content Mining” of the Journal of Intelligent Information Systems (JIIS). Scime42 also author a book on web mining
Table 4. Review table for web mining survey. Author
Method/techniques
Applications
Kosala and Blockeel25
Machine learning, information retrieval, natural language processing, information extraction
Web content, structure, and usage mining; information integration, web warehouse
Han and Chang19
Mining web search engine data; analyzing web’s link and semantic structures
Barsagade2 Chau et
al.4
Kolari and Joshi24
Web usage survey Personalized multilingual web content mining Semantic web mining
Web content, structure, and usage mining
Liu and Chang31
Web information integration; concept hierarchies, segmenting web pages; opinion mining
Scime41
JIIS (Special Issue on web content mining), Vol. 22, No. 3, May 2004
Scime42
Web content, structure, and usage mining; personalization; e-mail and usenet
Zanasi et al.54
Data, text, and web mining, and their business applications
Greening17
Personalization, association, clustering, decision trees
Liu30
Association rules and sequential patterns, link analysis, wrapper generation, web crawling
Markov and Larose32
Web content, structure, and usage mining; opinion mining, information integration A report on web content, structure, and usage mining
December 19, 2008 13:57 WSPC/173-IJITDM
00315
Web Mining: A Survey of Current Research, Techniques, and Software
695
applications and techniques that include web content, structure, and usage mining such as personalization, e-mail, and usenet. Zanasi et al.54 discuss data, text, and web mining and their business applications. According to Greening,17 “data-mining algorithms overlap in the problems they can solve, but for a given problem there’s usually a “best algorithm.” Greening17 also claims “The world of web data mining is simultaneously a minefield and a gold mine. By saving data associated with visitors, content, and interactions, you can at least ensure you’ll be able to use it later. Despite the difficulties, you might consider evaluating and incorporating data-mining applications now. The sooner you start learning from your data, the sooner you can leave your competitors in the dust.” Liu30 authors a text on web mining that includes a substantial part on web mining foundations. Liu30 discusses IR and web search, link analysis, web crawling, structured data extraction, information integration, opinion mining, and web usage mining. Markov and Larose32 author a report for Wiley & Sons Publishers on uncovering patterns in web content, web structure, and web usage by data mining the web. 2.5. Literature review for semantic web The semantic web is an extension of the web where information is provided precise meaning and can be understood and processed by machines.23 It operates on the principle of shared data. According to Palmer,38 one can think of the semantic web as being an efficient way of representing data on the World Wide Web (WWW) or as a globally linked DB. According to Wikipedia,52 the semantic web makes “it possible for the web to understand and satisfy the requests of people and machines to use the web content.” The definition of the semantic web has been formatted by the Semantic Web Agreement Group (SWAG)43 as follows: “The Semantic Web is a web that includes documents, or portions of documents, describing explicit relationships between things and containing semantic information intended for automated processing by our machines.” In essence the semantic web is a web with a meaning. According to W3Schools,49 “the Semantic Web is a web that is able to describe things in a way that computers can understand.” Essential to the semantic web are ontologies, which give a shared understanding of a domain to facilitate communications among humans and software agents (Kanellopoulos and Kotsiantis, 2007). As W3Schools49 explains “The Semantic Web is not about links between web pages”, but rather about “describing the relationships between things (like A is a part of B and Y is a member of Z) and the properties of things (like size, weight, age, and price).”
December 19, 2008 13:57 WSPC/173-IJITDM
696
00315
Q. Zhang & R. S. Segall
3. Web Mining Software and Demonstrations Below are selected software that utilize some of the types of web mining. They are discussed in the same order as their above respective tables for their corresponding type of web mining demonstrated. The selection of the software discussed below is a result of those software vendors that were willing to provide the authors with free trial versions and also generous professional guidance by conference call discussions. 3.1. Megaputer PolyAnalyst Megaputer PolyAnalyst is an enterprise analytical system that integrates web mining together with data and text mining because it does not have a separate module for web mining. Web pages or sites can be inputted directly to Megaputer PolyAnalyst as data source nodes. Megaputer PolyAnalyst has the standard data and text mining functionalities such as categorization, clustering, prediction, link analysis, keyword and entity extraction, pattern discovery, and anomaly detection. These different functional nodes can be directly connected to the web data source node for performing web mining analysis. Megaputer PolyAnalyst user interface allows the user to develop complex data analysis scenarios without loading data in the system, thus saving analyst’s time. According to Megaputer,33 whatever data sources are used, PolyAnalyst provides means for loading and integrating these data. PolyAnalyst can load data from disparate data sources including all popular DBs, statistical, and spreadsheet systems. In addition, it can load collections of documents in html, doc, pdf, and txt formats, as well as load data from an Internet web source. PolyAnalyst offers visual “on-the-fly integration” and merging of data coming from disparate sources to create data marts for further analysis. It supports incremental data appending and referencing data sets in previously created PolyAnalyst projects. Figures 1–12 are screen shots illustrating the applications of Megaputer PolyAnalyst for web mining to available data sets. Figure 1 shows a screen shot of the PolyAnalyst workspace for a hypothetical company. Figure 2 shows customer feedback DB of Megaputer Incorporated the manufacturer of PolyAnalyst software that is used as basis of web mining performed in Figs. 3–6. Figure 3 shows on-line analytical processing (OLAP) table with multiple dimensions. Figure 4 shows a dimension matrix with multiple dimensions for structured and unstructured data. Figure 5 shows a link diagram using PolyAnalyst of data whose major nodes include those of “performance,” “work,” “group” and “customer,” and Fig. 6 provides a screen shot of text clustering of this same data. Based on a web page of undergraduate admission of web site of Arkansas State University (ASU), Fig. 7 shows a keyword extraction report, Fig. 8 shows some web page taxonomy, Fig. 9 shows data merging of two web sites: one from Arkansas State University (ASU) and one from Indiana University (IU) in Bloomington, Indiana,
December 19, 2008 13:57 WSPC/173-IJITDM
00315
Web Mining: A Survey of Current Research, Techniques, and Software
Fig. 1. PolyAnalyst workspace with Internet data source.
Fig. 2. PolyAnalyst with customer feedback data of Megaputer Inc.
697
December 19, 2008 13:57 WSPC/173-IJITDM
698
00315
Q. Zhang & R. S. Segall
Fig. 3. OLAP table with multiple dimensions.
Fig. 4. Dimension matrix with multiple dimensions for structured and unstructured data.
December 19, 2008 13:57 WSPC/173-IJITDM
00315
Web Mining: A Survey of Current Research, Techniques, and Software
Fig. 5. Link diagram.
Fig. 6. Text clustering.
699
December 19, 2008 13:57 WSPC/173-IJITDM
700
00315
Q. Zhang & R. S. Segall
Fig. 7. Keyword extraction report.
Fig. 8. Web page taxonomy.
December 19, 2008 13:57 WSPC/173-IJITDM
00315
Web Mining: A Survey of Current Research, Techniques, and Software
Fig. 9. Data merging of two web sites.
Fig. 10. Usage data from firewall log.
701
December 19, 2008 13:57 WSPC/173-IJITDM
702
00315
Q. Zhang & R. S. Segall
Fig. 11. Dimension matrix of firewall log data.
Fig. 12. Histogram of subnet A.
December 19, 2008 13:57 WSPC/173-IJITDM
00315
Web Mining: A Survey of Current Research, Techniques, and Software
703
and Fig. 10 shows usage data from firewall log. Figure 11 shows a dimension matrix of firewall log data, and Fig. 12 shows a histogram of a subnet. 3.2. SPSS Clementine “Web Mining for Clementine is an add-on module that makes it easy for analysts to perform ad hoc predictive Web analysis within Clementine’s intuitive visual workflow interface”. Web Mining for Clementine combines both web analytics and data mining with SPSS analytical capabilities to transform raw web data into “actionable insights.” It enables business decision makers to take more effective actions in real time. SPSS47 claims examples of automatically discovering user segments, detecting the most significant sequences, understanding product and content affinities, and predicting user intention to convert, buy, or churn. SPSS47 claims four key data mining capabilities: segmentation, sequence detection, affinity analysis, and propensity modeling. Specifically, SPSS47 indicates six web analysis application modules within SPSS Clementine that are search engine optimization, automated user and visit segmentation, web site activity and user behavior analysis, home page activity, activity sequence analysis, and propensity analysis. Unlike other platforms used for web mining that provide only simple frequency counts (e.g. number of visits, ad hits, top pages, total purchase visits, and top click streams), SPSS47 Clementine provides more meaningful customer intelligence such as likelihood to convert by individual visitor, likelihood to respond by individual prospect, content clusters by customer value, missed crossed-sell opportunities, and event sequences by outcome. Figures 13–20 are screen shots illustrating the applications of SPSS Clementine for web mining to available data sets. Figure 13 shows the SPSS Clementine workspace with 251,998 records and seven fields extracted from a web log file. Figure 14 demonstrates the defining window of user modes and the user modes for field clusters of web data. The user modes include research mode, shopping mode, search mode, evaluation mode, and so on. Visit segments for web data are shown in Fig. 15. Figure 16 exhibits link diagram for web data by campaign, gender, age, and income. Figures 17 and 18 show web data for different campaigns and classifier results using different model types (e.g. CHAID, logistic, neural net). Figure 19 exhibits decision tree results and decision rules for determining clusters of web data. The comparison of lift diagrams for training and testing data sets is shown in Fig. 20. 3.3. ClickTracks by web analytics ClickTracks by web analytics is a web metrics tool that makes online behavior visible. Unlike other web statistical tools, ClickTracks shows information in context to the user. ClickTracks shows where visitors go and what motivates them to take the paths they take. According to the ClickTracks web site, ClickTracks unites
December 19, 2008 13:57 WSPC/173-IJITDM
704
00315
Q. Zhang & R. S. Segall
Fig. 13. SPSS Clementine workspace with web data extracted with 251,998 records and seven fields.
Fig. 14. User modes for field clusters of web data.
December 19, 2008 13:57 WSPC/173-IJITDM
00315
Web Mining: A Survey of Current Research, Techniques, and Software
Fig. 15. Visit segments for web data.
Fig. 16. Link diagram for web data by campaign, gender, age, and income.
705
December 19, 2008 13:57 WSPC/173-IJITDM
706
00315
Q. Zhang & R. S. Segall
Fig. 17. Web data for Campaign C.
Fig. 18. Classifier results.
December 19, 2008 13:57 WSPC/173-IJITDM
00315
Web Mining: A Survey of Current Research, Techniques, and Software
Fig. 19. Decision tree results and decision rules for determining clusters.
Fig. 20. Comparison of lift diagrams for training and testing data sets.
707
December 19, 2008 13:57 WSPC/173-IJITDM
708
00315
Q. Zhang & R. S. Segall
visitor information with the web site and lets web site owners to know how people get to their sites, where they click and where they exit. According to “Navigation Report” movie posted on WebAnalytics web site, “Using the ClickTracks Navigation report, you can easily see how visitors move through your site. . . . ClickTracks . . . shows the percentage of visitors who clicked on each link. At a glance you can see how visitors react to each page, what interests them the most, and what links are not attracting any attention at all. . . . The Page Analysis frame gives detailed information about visits to each page you view, such as the average time spent on the page, and the exit rate. . . . The Path View at the bottom shows, for each page, what internal and external pages visitors came from and also where they click next in order of popularity.” According to WebAnalytics,50 ClickTracks let user to know more about buyers and thus gives insights on how to turn more web site visitors into buyers. “ClickTracks lets user to see buyers from many different aspects, such as identifying their entry points, the paths they take and things they do on the way to the checkout. ClickTracks thus gives valuable information to the user that he (she) can put into action. Figures 21–26 are screen shots illustrating the applications of ClickTracks for web mining to available data sets. Figure 21 exhibits using Bob’s Fruitsite as data source. Figure 22 shows visitor statistics and the plot of visitors and visitors cost. Top referrers and pages with most visitors are shown in Fig. 23. Figure 24 demonstrates search keywords and popular search engines used, and statistical results for using Google search engine are shown in Fig. 25. Figure 26 shows the path view for item “Apple and the user’s click sequence results.” 3.4. QL2 by QL2 Software, Inc. QL2 is web data extraction software. According to QL2,40 it completely automates the process of extracting information from any web site, even if the data are behind a firewall, a subscription log-in, or a search form. It deploys intelligent agents to automatically fetch information from the web. These intelligent agents navigate complex web sites, log-in to subscription and password-protected sites, fill out forms and input specific criteria to generate dynamic web pages. Intelligent agents can reach any content with a web browser in a fully automated fashion. QL2 software extracts the data regardless of formats: word documents, e-mails, spreadsheets, DBs, PowerPoint files, HTML, images, and PDFs. QL2 can also add structure to the data it collects and output the information in an actionable format such as a spreadsheet, DB, or XML feed; hence, data can be sorted, filtered, and queried with ease. QL2 software can extract data/information from both the World Wide Web and unstructured documents, and integrate it into business intelligence in real
December 19, 2008 13:57 WSPC/173-IJITDM
00315
Web Mining: A Survey of Current Research, Techniques, and Software
Fig. 21. Bob’s Fruitsite as data source.
Fig. 22. Visitor statistics and plot of visitors and visitors cost.
709
December 19, 2008 13:57 WSPC/173-IJITDM
710
00315
Q. Zhang & R. S. Segall
Fig. 23. Top referrers and pages with most visitors.
Fig. 24. Search keywords vs popular search engines used.
December 19, 2008 13:57 WSPC/173-IJITDM
00315
Web Mining: A Survey of Current Research, Techniques, and Software
Fig. 25. Statistics results for using Google search engine.
Fig. 26. Path view for item “Apple” and click sequence results.
711
December 19, 2008 13:57 WSPC/173-IJITDM
712
00315
Q. Zhang & R. S. Segall
Fig. 27. Workspace for QL2.
time for a 360◦ view of business and the market. QL2 can be used to automatically mine competitor web sites, online catalogs, news feeds, and regulatory filings, or extracting data from PDFs, PowerPoint presentations, Word docs, and e-mail archives. Figures 27–30 are screen shots illustrating the applications of QL2 for web mining to available data sets. Figure 27 shows a screen shot of the QL2 workspace, and Fig. 28 shows QL2 with finance web link of Yahoo web site. Figure 29 shows screen shot of expanded inner queries for Best Buy data. Figure 30 shows extracted data from Best Buy outputted as an Microsoft Excel file. 4. Conclusions This paper has provided a more current evaluation and update of web mining research and techniques available. Extensive literature has been reviewed based on three types of web mining, namely web content mining, web usage mining, and
December 19, 2008 13:57 WSPC/173-IJITDM
00315
Web Mining: A Survey of Current Research, Techniques, and Software
713
Fig. 28. QL2 with www.yahoo.com data.
web structure mining. For each tabulated research work, we have examined such key issues as web mining process, methods/techniques, applications, data sources, and software used. In this paper unlike previous investigators, the web mining process is divided into the following five subtasks: (1) resource finding and retrieving, (2) information selection and preprocessing, (3) patterns analysis and recognition, (4) validation and interpretation, and (5) visualization. This paper helps researchers and practitioners effectively accumulate the knowledge in the field of web mining, and speed its further development. This paper has also reported the comparisons and summaries of selected software for web mining. The web mining software selected for discussion and comparison in this paper are SPSS Clementine and Megaputer PolyAnalyst, ClickTracks by web analytics, and QL2 by QL2 Software Inc. Applications of these selected web mining software to available data sets are discussed together with abundant presentations of screen shots.
December 19, 2008 13:57 WSPC/173-IJITDM
714
00315
Q. Zhang & R. S. Segall
Fig. 29. Inner query for Best Buy data.
The future directions of our research include a more in-depth analysis of the applications of algorithms both heuristic and analytical, the investigation of other software available for web mining as well as additional data sets as available either on the web or otherwise. The techniques would not be limited to clustering, classification, association, and sequence analysis but also others as applied to IR. Another future direction of the research would be a review or comparison of current research that exhibits overlap of types of mining that combines web mining with that of text and/or data mining. We also plan to investigate new or additional web mining software, as well as the available software for applications to the Semantic Web. Our future directions of research would thus include investigation into Semantic Web applications such as that for bioinformatics in which biological data and knowledge bases are interconnected. Our future research would also include the applications of intelligent personal assistant or intelligent software agent that automatically accumulates and classifies suitable information based on user preferences.
December 19, 2008 13:57 WSPC/173-IJITDM
00315
Web Mining: A Survey of Current Research, Techniques, and Software
715
Fig. 30. Extracted data from BestBuy.com outputted as Excel file.
Acknowledgments The authors would like to acknowledge funding for support of this research from the 2007 Summer Faculty Research Grant as awarded to both authors from the College of Business at Arkansas State University (ASU). The authors would also like to acknowledge the generosity and kindnesses of those personnel who gratefully provided software and technical assistance for SPSS Clementine, Megaputer PolyAnalyst, Web Analytics ClickTracks, and QL2 by QL2 Software Inc. References 1. L. Abraham and V. Ramos, Web usage mining using artificial ant colony clustering and genetic programming, in CEC’03 — Congress on Evolutionary Computation (IEEE Press, Canberra, Australia, 2003), pp. 1384–1391. ISBN 078-0378-04-0. 2. N. Barsagade, Web usage mining and pattern discovery: A survey paper, Computer Science and Engineering Dept., CSE Tech Report 8331 (Southern Methodist University, Dallas, Texas, USA, 2003).
December 19, 2008 13:57 WSPC/173-IJITDM
716
00315
Q. Zhang & R. S. Segall
3. S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data (Elsevier Science & Technology Books, 2003), ISBN-13: 9781558607545. 4. R. Chau, C. Yeh and K. Smith, Personalized multilingual web content mining, KES (2004), pp. 155–163. 5. L. Chen, W. Lian and W. Chue, Using web structure and summarization techniques for web content mining, Inform. Process. Management: Int. J. 41(5) (2005) 1225– 1242. 6. T. S. Clendaniel, Profitability and mining web data: Avoiding the path to red ink, The Data Administration Newsletter (R.S. Seiner, Publisher, 2002), http://www. tdan.com/i019fe02.htm. 7. R. Cooley, The use of web structure and content to identify subjectively interesting web usage patterns, ACM Trans. Internet Tech. 3(2) (2003) 93–116. 8. R. Cooley, B. Mobasher and J. Srivastava, Web mining: Information and pattern discovery on the world wide web, ICTAI (1997). 9. R. Cooley, B. Mobasher and J. Srivastava, Data preparation for mining worldwide browsing patterns (2003), http://maya.cs.depaul.edu/∼classes/ect584/papers/cmskais.pdf. 10. S. Cyrus and F. Banaei-Kashani, Efficient and anonymous web usage mining for web personalization, INFORMS J. Comput. Special Issue on Data Mining 15(2) (2003) 123–147. 11. J. Darmont, O. Boussaid and F. Bentayeb, Warehousing Web Data (2007), http://www. arxiv.org/ftp/arxiv/papers/0705/0705.1456.pdf. 12. O. Etzioni, The World Wide Web: Quagmire or gold mine, Commun. ACM 39(11) (1996) 65–68. 13. X. Fang and O. Sheng, LinkSelector: A web mining approach to hyperlink selection for web portals, ACM Trans. Internet Tech. 4(2) (2004) 209–237. 14. K. Fenstermacher and M. Ginsburg, Client-side monitoring for web mining, J. Am. Soc. Inform. Sci. Tech. 54(7) (2003) 625–637. 15. J. Furnkranz, Web structure mining — Exploiting the graph structure of the world¨ wide web, OGAI-J. 21(2) (2002) 17–26. 16. S. Graves, R. Ramachandran, K. Keiser, M. Maskey and C. Lynnes, Deployable suite of data mining web services for online science data repositories, in 23rd Conf. IIPS, 87th American Meteorological Society Annual Meeting (San Antonio, TX, 13–18 January 2007). 17. D. Greening, Data mining on the web: There’s gold in that mountain of data (2006) http://www.webtechniques.com/archives/2000/01/greening/. 18. S. Guan and P. McMullen, Organizing information on the next generation web — design and implementation of a new bookmark structure, Int. J. Inform. Technol. Decision Making 4(1) (2005) 97–115. 19. J. Han and C. Chang, Data mining for web intelligence, Computer (November 2002), pp. 54–60, http://www-faculty.cs.uiuc.edu/∼hanj/pdf/computer02.pdf. 20. B. Hay, G. Wets and K. Vanhoof, Mining navigation patterns using a sequence alignment method, Knowledge Inform. Syst. 6(2) (2004) 150–163. 21. A. Joshi, Web mining (2001), www.cs.umbc.edu/∼ajoshi/web mine. 22. A. Joshi, Web/data mining and personalization, University of Maryland Baltimore County (UMBC) eBiquity Research Area (2001), http://ebiquity.umbc.edu/project/ html/id/17/Web-Data-Mining-and-Personalization. 23. D. Kanellopoulos and S. Kotsiantis, Semantic web: A state of the art survey, Int. Rev. Comput. Software 3(1) (2001) 428–442.
December 19, 2008 13:57 WSPC/173-IJITDM
00315
Web Mining: A Survey of Current Research, Techniques, and Software
717
24. P. Kolari and A. Joshi, Web mining: Research and practice, Comput. Sci. Eng. July/August (2004) 42–53. 25. R. Kosala and H. Blockeel, Web mining research: A survey, ACM SIGKDD Explor. 2 (2000) 1–15. 26. K. Lau, K. Lee, Y. Ho and P. Lam, Mining the web for business intelligence; homepage analysis in the Internet era, J. Database Marketing Customer Strategy Management 12(1) (2004) 32–54. 27. J. W. Liang, Introduction to text and web mining, Seminar at North Carolina Technical University (2003), www.database.cis.nctu.edu.tw/seminars/2003F/TWM/ slides/p.ppt. 28. B. Liu, Web content mining (2005), http://www.cs.uic.edu/∼liub/WebContentMining.html. 29. C. Liu, Web content mining (29 November 2006), CM SIGKDD Webcast. 30. B. Liu, Web Data Mining: Exploring Hyperlinks, Contents and Usage Data (Springer Verlag Press, 2007), ISBN-13: 978-3-540-37881-5. 31. B. Liu and K. Chang, Editorial: Special issue on web content mining, SIGKDD Explorations 6(2) (2004) 1–4. 32. Z. Markov and D. T. Larose, New Report “Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage” Features Web Structure Mining, Web Content Mining and Web Usage Mining (John Wiley and Sons, 25 May 2007), http://biz.yahoo.com/bw/070525/20070525005213.html?.v=1. 33. Megaputer Intellegence Inc., WebAnalyst architecture (2007), http://www.megputer. com/products/wa/architechture.php3. 34. B. Mobasher, Web data mining for business intelligence, ECT 584, DePaul University, Chicago, IL (2007), http://maya.cs.depaul.edu/∼classes/ect584/papers/ mobasher.pdf. 35. B. Mobasher, R. Colley and J. Srivastava, Automatic personalization based on web usage mining, Commun. ACM 43(8) (2000) 142–151. 36. B. Mobasher, H. Dai, T. Luo, S. Yuqing and J. Zhu, Integrating web usage and content mining for more effective personalization, EC-Web (2000). 37. Z. Pabarskaite and A. Raudys, A process of knowledge discovery from web log data: Systematization and critical review, J. Intell. Inform. Syst. 28(1) (2007) 79–104. 38. S. Palmer, The semantic web: An introduction (2001), http://infomesh.net/2001/ swintro/. 39. D. Pierrakos, G. Paliouras, C. Papatheodorou and C. Spyropoulos, Web usage mining as a tool for personalization: A survey, User Model. User-Adapt. Interact. 13(4) (2003) 311–372. 40. QL2 software (2007), http://www.ql2.com, viewed 5 June 2007. 41. A. Scime, Guest Editor’s Introduction: Special Issue on Web Content Mining: Special Issue on Web Content Mining, J. Intell. Inform. Syst. 22(3) (2004) 211–213. 42. A. Scime, Web Mining: Applications and Techniques (Idea Group Publishing, Hershey, PA, 2005), ISBN: 1591404142. 43. Semantic Web Agreement Group, What is the semantic web? (2001) http://swag. webns.net/WhatIsSW. 44. K. A. Smith and A. Ng, Web page clustering using a self-organizing map of user navigation patterns, Decision Support Syst. 35(2) (2003) 245–256. 45. Q. Song and M. Shepperd, Mining web browsing patterns for e-commerce, Comput. Indus. 57(7) (2006) 622–630. 46. M. Spiliopoulou, Web usage mining for web site evaluation, Commun. ACM 43(8) (2000) 127–134.
December 19, 2008 13:57 WSPC/173-IJITDM
718
00315
Q. Zhang & R. S. Segall
47. SPSS, Web mining for Clementine (2007) http://www.spss.com/web mining for clementine, viewed 16 May 2007. 48. J. Srivastava, R. Cooley, M. Deshpande and P. Tan, Web usage mining: Discovery and applications of usage patterns from web data, SIGKDD Explor. 1(2) (2000) 12–23. 49. W3Schools, Semantic web tutorial (2008) http://www.w3schools.com/semweb/ default.asp. 50. Web Analytics (2007) http://www.clicktracks.com/, viewed as 25 October 2007. 51. Wikipedia, Web mining (2007) http://en.wikipedia.org/wiki/Web mining. 52. Wikipedia, Semantic web (2008) http://en.wikipedia.org/wiki/Semantic Web. 53. Q. Yang and X. Wu, 10 challenging problems in data mining research, Int. J Inform. Technol. Decision Making 5(4) (2006) 597–604. 54. A. Zanasi, S. Temis, C. A. Brebbia and N. F. Ebecken, Data mining VII: Data, text and web mining and their business applications (Data Mining and Information Engineering 2006), Transactions on Information and Communication Technologies, Vol. 37 (2006), ISBN 1-84564-178-7. 55. H. Zhang, Z. Chen, M. Li and Z. Su, Relevance feedback and learning in content-based image search, World Wide Web 6(2) (2003) 131–155. 56. N. Zhong, Impending brain imformatics research from web intelligence, Int. J. Inform. Technol. Decision Making 5(4) (2006) 713–727.
Additional Readings 57. G. Adomavicius and A.Tuzhilin, Expert-driven validation of rule-based user models in personalization applications, Data Mining Knowledge Discov. 5 (2001) 33–58. 58. R. Amarasiri and D. Alahakoon, Building a cluster of intelligent, adaptive web sites, Neural Comput. Appl. 13 (2004) 149–156. 59. R. Boncella, Competitive intelligence and the web, Commun. Assoc. Inform. Syst. 12 (2005) 327–340. 60. M. Ceci and D. Malerba, Classifying web documents in a hierarchy of categories: A comprehensive study, J. Intell. Inform. Syst. 28 (2007) 37–78. 61. M. Chau, D. Zeng, H. Chen, M. Huang and D. Hendriawan, Design and evaluation of a multi-agent collaborative web mining system, Decision Support Syst. 35(2003) 167–183. 62. R. Chau and C. Yeh, Filtering multilingual web content using fuzzy logic and selforganizing maps, Neural Comput. Appl. 13 (2004) 140–148. 63. Z. Chen, A. Fu and F. Tong, Optimal algorithms for finding user access sessions from very large web logs, World Wide Web 6 (2003) 259–279. 64. R. Chen, K. Sivakumar and H. Kargupta, Collective mining of bayesian networks from distributed heterogeneous data, Knowledge Inform. Syst. 6 (2004) 164–187. 65. S. Dustdar and R. Gombotz, Discovering web service workflows using web services interaction mining, Int. J. Business Process Integrat. Management 1 (2007) 256–266. 66. M. Eirinaki and M. Vazirgiannis, Web mining for web personalization, ACM Trans. Internet Tech. 3 (2003) 1–27. 67. B. Ezeife and Y. Lu, Mining web log sequential patterns with position coded pre-order linked wap-tree, Data Mining Knowledge Discov. 10 (2005) 5–38. 68. F. Faccaand and P. Lanzi, Mining interesting knowledge from weblogs: A survey, Data Knowledge Eng. 53 (2005) 225–241. 69. S. Flesca, S. Greco, A. Tagarelli and E. Zumpano, mining user preferences, page content and usage to personalize website navigation, World Wide Web 8 (2005) 317– 345.
December 19, 2008 13:57 WSPC/173-IJITDM
00315
Web Mining: A Survey of Current Research, Techniques, and Software
719
70. P. Giudici and R. Castelo, Association models for web mining, Data Mining Knowledge Discov. 5 (2001) 183–196. 71. B. Gregg and S. Walczak, Adaptive web information extraction, Commun. ACM 49 (2006) 78–84. 72. W. Grossmann, M. Hudec and R. Kurzawa, Web usage mining in e-commerce, Int. J. Electron. Business 2 (2004) 480–492. 73. H. Han and R. Elmasri, Learning rules for conceptual structure on the web: Special issue on web content mining, J. Intell. Inform. Syst. 22 (2004) 237–256. 74. B. Haruechaiyasak and M. Shyu, A web-page recommender system via a data mining framework and the semantic web concept, Int. J. Comput. Appl. Tech. 27 (2007) 298–311. 75. B. Huang and T. Chou, Factors for web mining adoption of B2C firms: Taiwan experience, Electron. Commerce Res. Appl. 3 (2004) 266–279. 76. X. Huang, F. Peng, A. An and D. Schuurmans, Dynamic web log session identification with statistical language models, J. Am. Soc. Inform. Sci. Tech. 56 (2004) 1290–1303. 77. X. Jiang, Efficient data mining for web navigation patterns, Inform. Software Tech. 46 (2004) 55–63. 78. K. Joshi, A. Joshi and Y. Yesha, On using a warehouse to analyze web logs, Distributed Parallel Databases 13 (2003) 161–180. 79. Kdnuggets, Software: Web Mining and Web Usage Mining, www.kdnuggets.com/ software/web-mining.html. 80. D. Kim, H. Jung and L. Geunbae, Unsupervised learning of mDTD extraction patterns for Web text mining, Inform. Process. Manag. 39 (2003) 623–637. 81. Y. Kotb, K. Gondow and T. Katayama, Optimizing the execution time for checking the consistency of xml documents: Special issue on web content mining, J. Int. Inform. Syst. 22 (2004) 257–279. 82. Y. Kuo and L. Chen, Personalization technology application to Internet content provider, Expert Syst. Appl. 21 (2001) 203–215. 83. C. Lee, Y. Kim and P. Rhee, Web personalization expert with combining collaborative filtering and association rule mining technique, Expert Syst. Appl. 21 (2001) 131–137. 84. A. Maguitman, F. Menczer, E. Fulya, H. Roinestad and A. Vespignani, Algorithmic computation and approximation of semantic similarity, World Wide Web 9 (2006) 431–456. 85. A. Nanopoulos and Y. Manolopoulos, Mining patterns from graph traversals, Data Knowledge Eng. 37 (2001) 243–266. 86. Z. Pabatsjaute, Decision trees for web log mining, Intll. Data Anal. 7 (2003) 141–154. 87. D. Roussinov and F. Zhao, Text clustering and summary techniques for CRM message management, J. Enterprise Inform. Management 17 (2004) 424–429. 88. B. Sakkopoulos, D. Kanellopoulos and A. Tsakalidis, Semantic mining and web service discovery techniques for media resources management, Int. J. Metadata Semant. Ontol. 1 (2006) 66–75. 89. M. Shyu, C. Haruechaiyasak and S. Chen, Mining user access patterns with traversal constraint for predicting web page requests, Knowledge Inform. Syst. 10 (2006) 515– 528. 90. K. Smith and A. Ng, Web page clustering using a self-organizing map of user navigation patterns, Decision Support Syst. 35 (2003) 245–256. 91. P. Smyth, D. Pregibon and C. Faloutsos, Data-driven evolution of data mining algorithms, Commun. ACM 45 (2002) 33–37. 92. A. Thuraisingham, Web Data Mining and Applications in Business Intelligence and Counter-Terrorism (CRC Press, 2004), ISBN-13: 978-0849314605.
December 19, 2008 13:57 WSPC/173-IJITDM
720
00315
Q. Zhang & R. S. Segall
93. L. van Wel and R. Lamb`er, Ethical issues in web data mining, Ethics Inform. Tech. 6 (2004) 129–140. 94. Y. Wu and A. Chen, Prediction of web page accesses by proxy server log, World Wide Web 5 (2002) 67–88. 95. Q. Yang, T. Li and K. Wang, Building association-rule based sequential classifiers for web-document prediction, Data Mining Knowledge Discov. 8 (2004) 253–273. 96. Q. Yang, T. Li and K. Wang, Web-log cleaning for constructing sequential classifiers, Appl. Artific. Intell. 17 (2003) 431–441. 97. Q. Yang and H. Zhang, Integrating web prefetching an caching using prediction models, World Wide Web 4 (2001) 299–321. 98. A. Zhang and Y. Dong, A novel web usage mining approach for search engines, Comput. Networks 39 (2002) 303–310.