With the rapid increase of the huge amount of online information, there is a strong demand for Web text ... search, Web text processing, text feature conversion, and BPNN-based knowledge ..... to transform text data into a spreadsheet table in.
161
Chapter VIII
A Multi-Agent Neural Network System for Web Text Mining Lean Yu Chinese Academy of Sciences and City University of Hong Kong Shouyang Wang Chinese Academy of Sciences Kin Keung Lai City University of Hong Kong
Abstract With the rapid increase of the huge amount of online information, there is a strong demand for Web text mining which helps people discover some useful knowledge from Web documents. For this purpose, this chapter first proposes a back-propagation neural network (BPNN)-based Web text mining system for decision support. In the BPNN-based Web text mining system, four main processes, Web document search, Web text processing, text feature conversion, and BPNN-based knowledge discovery, are involved. Particularly, BPNN is used as an intelligent learning agent that learns about underlying Web documents. In order to scale the individual intelligent agent with the large number of Web documents, we then provide a multi-agent-based neural network system for Web text mining in a parallel way. For illustration purpose, a simulated experiment is performed. Experiment results reveal that the proposed multi-agent neural network system is an effective solution to large scale Web text mining.
Introduction Web text mining is the process of using unstructured Web-type text documents and examining it in an attempt to find implicit patterns hidden in the Web text documents. With the amount of
online Web text information growing rapidly, the need for a powerful Web text mining method that can analyze Web text information and infer useful patterns for prediction and decision purposes has increased. To be able to cope with the abundance of available Web text information,
Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
A Multi-Agent Neural Network System for Web Text Mining
Web users need assistance of some software tools and software agents (often called “softbots”) for exploring, sorting, and filtering the available Web text information (Etzioni, 1996; Kozierok & Maes, 1993). Much effort for Web text mining has been made and some important progresses are obtained. For example, Joachims (1996) utilized probabilistic TFIDF method and naïve Bayes method to perform text categorization task, one subtasks of the text mining. Feldman and Dagan (1998) proposed a keyword-frequency approach to explore unstructured text collections. Tan (1999) presented a two-phase text mining framework for knowledge discovery in text. Lee and Yang (1999) used a self-organizing map (SOM) method to perform Web text mining task. Chen and Nie (2000) proposed a parallel Web text mining approach for cross-language information retrieval. Choi and Yoo (2001) utilized a neural network approach to text database discovery on the Web. Recently, Yu, Wang, and Lai (2005) utilized rough set theory to refine text mining for prediction purpose. Generally speaking, existing research concentrated on the development of agents that are high level interfaces to the Web (Etzioni and Weld, 1994; Furnkranz, Holzbaur & Temel, 2002), programs for filtering and sorting e-mail messages (Maes, 1994; Payne & Edwards, 1997), or usenet netnews categorization (Joachims, 1996; Lang, 1995; Lashkari, Metral & Maes, 1994; Mock, 1996; Sheth & Maes, 1993). More examples about Web text mining can be found in two recent surveys (Chakrabarti, 2000; Kosala & Blockeel, 2000). In the meantime, a number of Web text mining systems, such as IBM Intelligent Miner (http:// www-306.ibm.com/ software/data/iminer/) and SAS Text Miner (http://www.sas.com/), have already been developed. Although some progress in Web text mining has been made, there are still several important issues to be addressed. First of all, most text mining tasks focus on the text categorization/classification, text clustering, concept extraction, and docu-
162
ment summarization (Yu, Wang & Lai, 2006). But the text content and entity relation modeling (i.e., the causality relationship between entities) is less explored in Web text documents; the essential goal of text mining is often neglected. As we know, the basic motivation of text mining is to find and explore some useful knowledge and hidden relationships from some unstructured text data to support decision-making, similar to data mining in structured data. In the existing literature, these text mining tasks little involved in hidden entity relationship modeling. For example, the main function of text categorization is to classify different documents into different prelabeled classes (e.g., Joachims, 1996), but how the different documents with different categories support decision-making is not clear. Differing in the previous Web text mining tasks, this chapter attempts to explore some implied knowledge hidden in Web text documents to support business prediction and decision-making (Yu et al., 2006). Second, most text mining models usually utilize some well-known tools such as the vector space model (VSM) (e.g., TFIDF algorithm (Salton, 1971, 1991; Salton & Yang, 1973; Sparck Jones, 1972)) and some traditional statistical models, for example, naïve Bayes algorithm (Feldman & Dagan, 1998; Joachims, 1996; Katz, 1995)). Nevertheless, a distinct shortcoming of these models is that their extrapolation and generalization capabilities are often weak (Yu et al., 2006). To remedy this drawback, this chapter adopts an intelligent learning agent instead of traditional statistical models to perform the Web text mining tasks. Because the learning agent has good learning and flexible mapping capability between inputs and outputs, the generalization capability may be much stronger than the traditional models (Yu et al., 2006). Third, even though some researchers have used some intelligent techniques such as selforganizing map (SOM) (Lee & Yang, 1999) and rough set (Yu et al., 2005) to perform Web text mining, an individual intelligent technique may
A Multi-Agent Neural Network System for Web Text Mining
come to have difficulty in training when the large numbers of available Web text documents exceeds a tolerable level. Furthermore, when new Web text documents are found, the single intelligent agent should be re-trained in order to learn from the new ones. Thus, the redundant training cost is quite burdensome when a designer scales up an existing intelligent agent into a large one with more Web text documents. In this chapter, we use a multi-agent technique to eliminate the redundant computational load. Finally, most existing Web text mining methods focus on the traditional text clustering and categorization tasks rather than text knowledge discovery task, that is, their capability of knowledge discovery is worth improving further. In this chapter, the knowledge discovery from unstructured Web text documents is emphasized in addition to traditional text processing tasks, which is distinctly different from the previous studies (Yu et al., 2006). In terms of the above issue, this chapter proposes a multi-agent neural network system to perform Web text mining for prediction and decision-making purpose. The generic framework of the multi-agent neural network system for Web text mining is composed of four main processes: Web document search, Web text processing, text feature conversion, and the BPNN-based knowledge discovery, which will be illustrated in the next section. The main objective of this chapter is to develop a multi-agent intelligent Web text mining system that offers an effective and scalable Web text mining solution to support decision-making from Web text documents. The rest of this chapter is organized as follows. In the section on The Framework of BPNN-based Intelligent Web Text Mining, we present a generic framework of the BPNN-based intelligent learning agent for Web text mining, and then describe its main text mining processes and the limitation of the single intelligent agent approach. To overcome the inadequacy of single intelligent agent, a multi-agent-based intelligent
Web text mining system is proposed in the section on Multi-Agent-Based Web Text Mining System. For illustration, an experiment is performed in the section entitled Experiment Study. Finally, some concluding remarks and future directions are given in Conclusions and Future Research Directions.
The Framework of the BPNN-Based Intelligent Web Text Mining In this section, we first present a generic framework of the BPNN-based Web text mining system and then the main processes of the proposed text mining system are described. Finally, the limitation of the single intelligent agent approach to Web text mining is illustrated.
The Framework of BPNN-based Intelligent Web Text Mining Web text mining, an emerging issue in the data mining and knowledge discovery research field, has evolved from text mining. According to Webster’s online dictionary (http://www.Webster-dictionary.org), text mining, also known as intelligent text analysis (Gelbukh, 2002), text data mining (Hearst, 1999) or knowledge-discovery in text (KDT) (Feldman & Dagan, 1995; Karanikas & Theodoulidis, 2002), refers generally to the process of extracting interesting and non-trivial patterns and knowledge from unstructured text documents. Similarly, Web text mining utilizes Web-type textual documents and examines it in an attempt to discover inherent structure and implicit meanings “hidden” in the Web text documents using interdisciplinary techniques from data mining, machine learning, natural language processing (NLP), information retrieval (IR), information extraction (IE), and knowledge management (Karanikas & Theodoulidis, 2002; Yu et al, 2005, 2006). In this sense, Web text min-
163
A Multi-Agent Neural Network System for Web Text Mining
ing is only a subset of text mining. The unique difference is that the Web text mining focused on the knowledge discovery from the hypertext (e.g., HTML and XML) published on the Web while text mining includes broader text documents besides hypertext (e.g., Word file and ASCII text file). One main objective of Web text mining is to help people discover knowledge for decision support from large quantities of semi-structured or unstructured Web text documents. Similar to data mining approaches, Web text mining also needs assistance of some intelligent software agent that can automatically analyze Web text data and extract some implied relationships or patterns from Web text documents. Therefore, it can be viewed as an extension of data mining or knowledge discovery from (structured) databases (Fayyad, Piatetsky-Shapiro & Smyth, 1996). With the rapid increase of the number and diversity of distributed text sources on the Web, Web text mining has increasingly been one of the most potential research fields. Actually, Web text mining consists of a series of tasks and procedures which involve many interdisciplinary fields such as data mining, machine learning, and NLP. Because the final goal of Web text mining is to support prediction and decision using explored knowledge discovered by Web text mining, the Web text mining must adapt the dynamic change over time as the Web text documents increase rapidly. Therefore, Web text mining must have learning and generalization capability. In this chapter, the BPNN is used as an intelligent computational agent for extrapolation and generalization purposes. In the environment of our proposed approach, the BPNN agent is first trained with the many related Web documents and then the trained BPNN agent can project to new documents for prediction and decision when new Web documents arrive. Figure 1 illustrates the main components of a BPNN agent for Web text mining and the control flows among them. Note that the control flows with thin arrow represent the training phase with in-sample data and
164
Figure 1. The generic framework of BPNN-based Web text mining system
the control flows with thick arrow represent the out-of-sample testing phase of BPNN-based Web text mining agent. As can be seen from Figure 1, the generic framework of the BPNN-based Web text mining system consists of four main components: Web document search, Web text processing, feature vector conversion, and BPNN-based learning mechanism, which is described in detail in the next section.
The Main Processes of the BPNN-Based Web Text Mining System As previous subsection revealed, the BPNN-based Web text mining is composed of four main processes: Web document search, Web text processing, feature vector conversion, and BPNN-based learning mechanism.
Web Document Search Clearly, the first step in Web text mining is to collect the Web text data (i.e., the relevant documents). The main work at this stage is to search text documents that a specified project needs. In general, the first task of document collection is to identify what documents are to be retrieved. When determining a subject, some keywords should be
A Multi-Agent Neural Network System for Web Text Mining
selected and then text documents can be searched and queried with search engines, Web crawlers, or other information retrieval (IR) tools. The Internet contains enormous, hetero-structural, and widely distributed information bases in which the amount of information increases in a geometric series. In the information bases, text sets that satisfy some conditions can be obtained by using a search engine. When a user comes to the search engine (e.g., Google, http://www.google. com/) and makes a query, typically by giving keywords. Sometimes advanced search options such as boolean, relevancy-ranked search, fuzzy search, and concept search (Berry & Browne, 2002; Karanikas & Theodoulidis, 2002) are speci-
fied. The engine looks up the index and provides a listing of best-matching documents from internal file systems or the Internet according to its criteria, usually with a short summary containing the document’s title and sometimes parts of the text. Thus, the large text documents can be obtained according to the specific project.
Web Text Processing When Web text documents are collected, the collected text documents are mainly represented by Web pages, which are tagged by hypertext makeup language (HTML) or extensible markup language (XML). Thus collected documents or texts are
Figure 2. The three main procedures of Web text processing
Figure 3. A typical word division processing example using a bag-of-words representation
165
A Multi-Agent Neural Network System for Web Text Mining
mostly semi- or non-structural information. Our aim of Web text processing is to extract some features that represent the text contents from these collected texts for further analysis. In this component, three important procedures, word division processing, text feature generation, and typical feature selection, as illustrated in Figure 2, are included here.
Word Division Processing For word division processing, full text division approach is used. Actually, the full text division approach model acts by treating each document as a bag-of-words representation (Berry & Browne, 2002; Joachims, 1996). For example, “the price of oil rises” can be divided into word sets, that is, “the”, “price”, “of”, “rises”. More generally, we have the following definition of full text division for word division process (Yu et al., 2006).
[Full text division approach]: Let T = {T1, T2, …, Tm} be a collection of documents. For each text document, assume that there are n words. Then each document Ti can be represented as a bag-of-words W = {W1, …, Wn}. Actually, word division processing is very simple because it only treats the text documents as a bag-of-words representation. A typical example is illustrated in Figure 3.
Text Feature Representation Generally, the representation of a problem has a strong effect on the generalization accuracy of a learning system. For the Web text mining, a Web text document, which is typically a string of characters, has to be transformed into an appropriate representation, which is suitable for the learning algorithm and mining tasks. For process-
Figure 4. The attribute-value representation of the Web text documents
166
A Multi-Agent Neural Network System for Web Text Mining
ing convenience of learning algorithms, a simple word division processing is not enough and the text feature representation is strongly required. In this chapter, we use an attribute-value style to format the text feature representation based on the previous bag-of-words representation. Particularly, we have the following definition for attribute-value feature representation. Note that the terms “feature” and “word” will be used interchangeably in the following. [Attribute-value text feature representation]: For a text document T with n words, its bag-ofwords representation is W = {w1, …, wn} according to the word division processing. Suppose I = {I1, I2, …, In} is a set of interesting words appearing in T. Then the text document T can be represented as W = {wj, V(wj), …, wk, V(wk)}, where V(ws) is the term frequency of word (ws) and 1