A Framework of Web-based Text Mining on the Grid Lean Yu and Shouyang Wang Academy of Mathematics and Systems Sciences, Chinese Academy of Sciences, Beijing, China
[email protected] [email protected]
Kin Keung Lai City University of Hong Kong, Hong Kong, China; School of Business Administration, Hunan University, Changsha, China
[email protected]
technologies to discover new knowledge from text documents on the web. Under this background, we propose a novel WTM system framework on the grid to this problem in this paper. Grids are geographically distributed platforms for computation, composed of a set of heterogeneous machines accessible to their users via a single interface. Grid computing has been proposed as an important computational model, distinguished from conventional distributed computing by its focus on large-scale resource sharing, innovative applications, and, in some cases, high-performance orientation [5]. Recently, grid computing is emerging as an effective paradigm for coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations operating in the industry and business area [5]. Thus, today grids can be used as effective infrastructures for distributed high-performance computing and data processing. In this study, the grid technology is firstly applied to WTM system and the goal of grid-based WTM is to improve the mining efficiency of WTM. The rest of the paper is organized as follows. First of all, a general framework of grid-based WTM is proposed in Section 2. Then a main process about WTM on the grid is described in Section 3. Subsequently the implementation of the proposed gridbased WTM framework model is given in Section 4. And Section 5 concludes the paper.
Abstract This study proposes a novel web-based text mining (WTM) framework on the grid, which firstly applies the grid technology to text mining system for improving the performance of WTM. The WTM system is to implement a high performance text mining process on the grid, and provide users with a high efficiency text mining and knowledge discovery services. We focus our discussion on the formulation of this framework for WTM system on the grid. In this study, firstly, a general framework of grid-based WTM is proposed, then a main process about WTM on the grid is described, and subsequently the implementation of the proposed grid-based WTM framework model is given. Finally, some conclusions are given.
1. Introduction With the rapid increase of the number and diversity of distributed text sources on the web, there has been the strong demand for web-based text mining (WTM) systems which help people discover some useful knowledge from large numbers of text documents. However, most existing text mining systems usually build a text document index in a single platform for all available static text documents and utilize a wellknown mining model such as the vector space model (VSM) based on TF × IDF [1-4]. But it is almost impossible to build the single index for all text documents on the web as the number of text document increases very rapidly. Furthermore, the efficiency including precision ratio and recall ratio of these approaches is low even for limited text documents. It is, therefore, imperative to introduce some new
0-7695-2452-4/05 $20.00 © 2005 IEEE
Yue Wu University of Southampton, Southamption, UK
[email protected]
2. General framework of Grid-based WTM Text mining is a newly emerging research area that has become popular in recent years. According to Webster’s online dictionary [6], text mining (also known as intelligent text analysis, text data mining or knowledge-discovery in text (KDT)), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text.
97
Web-based text mining uses unstructured web-type textual document and examines it in an attempt to discover structure and implicit meanings “hidden” within the web documents using techniques from data mining, machine learning, natural language processing (NLP), information retrieval (IR), information extraction (IE), and knowledge management [7]. One main objective of web-based text mining (WTM) is the support of the knowledge discovery process in large web document collections. With the rapid increase of the number and diversity of distributed text sources on the web, WTM has increasingly been a research focus and most potential field. With the exponentially growing demands of web textual documents, grid infrastructures are foreseen to be one of the most critical yet challenging technologies to meet the practical demands for high performance and high efficiency text mining in a large variety of web documents and databases. In the past few years, many software environments for gaining access to very large distributed computing resources have been made available, such as Condor [8] and Globus [9]. Based on the previous work, a grid-based WTM (GWTM) system is proposed. Figure 1 shows the general framework architecture of GWTM with three grids. For simplicity we represented an access network by a router but practically such networks would contain several routers.
(1) Core network In the general framework of grid-based WTM, the computing resources are distributed across an Internetbased network with high-speed backbone network in the core, typically the one provided by the China Network Communication (CNC) Companies, and several lower-speed access networks at the edge. In the grid infrastructure, the core network is usually reliable and a very high-speed network (several Gb/s). In the core network, the distributed computing resources can deal with different text information in terms of the network protocol and users’ requests. (2) Router/firewall From Figure 1, grids are typically put at the edges of the core network via router. Today’s computational grids are using the standard IP routing protocol, which has basically remained unchanged for two decades. Therefore, routers play an important role in the gridbased WTM system. In addition, the firewall is necessary in constructing a grid-based infrastructure and practical systems to protect the data of every grid. In the Figure 1, the routers of Grid 1 and Grid 2 have a functionality of firewall; in the Grid 3, the server serves as a firewall to protect the data. (3) Grid member A grid member is any networked computer registered with grid, as illustrated in Figure 3. Each grid member operates independently and asynchronously of all other members. Some grid members may have more dependencies with other members due to special relationships. Grid members spontaneously discover each other on the grid to form transient or persistent relationships called grid groups. Grid group is a collection of grid members that have some common motivations. Grid members that provide the same set of services tend to be inter-changeable. Any grid members can appear or leave the grid at any time. A grid member provides services that can be used by other grid members. Generally, the communication between two grid members is connected with some specific protocols. For example, two grid members can establish direction point-topoint connection through IP control protocol. Of course, grid members may not have direct point-topoint communication between themselves, either due to lack of physical network connections, or network configuration [10]. For such a situation, a grid member may use one or more intermediary grid members to route a message to another grid member. (4) Grid group Grid members can self-organize into grid group. As previously mentioned, grid group is a collection of grid members. Each grid group is uniquely identified by a
Figure 1 The framework architecture of GWTM As can be seen from Figure 1, the GWTM contains five main components: core network, router/firewall, grid member, grid group, and grid service.
98
unique grid group pin or ID number. To create a secure environment, grid group boundaries permit grid members to access and view some protected contents. Grid groups form logical regions which boundaries limit access to the grid group resources. A grid group does not necessarily reflect the underlying physical network boundaries such as those imposed by routers and firewalls. Grid group virtualizes the notion of routers and firewalls, subdividing the grid in secure regions without respect to actual physical grid boundaries [10]. (5) Grid service Grid services are infrastructure services to facilitate grid-based WTM system. The core grid services are described as follows. a) Grid registration service. It provides a mechanism to register services of various resources. b) Grid view service. It provides all grid services registered to support query for grid users. c) Grid query service. The query service is used by grid members to search for resources on the grid. The purpose of the query service is to select the “best” service, allowing the grid user to specify only the minimum parameters of interest. The query service first searches for the view service for a set of services that satisfy the grid user’s request. The query service then filters the set of services in terms of the criteria in the request. d) Grid access service. The grid access service is used to validate requests made by one grid member to another. Grid service administrator receiving the request provides the requestor credentials, and then information about the request is responded to the access service to determine if the access is permitted.
Figure 2 The main process of GWTM The Internet contains an enormous, heterostructural and widely distributed information base in which the amount of information increases in a geometric series. In the information base, text sets that satisfy some conditions can be obtained by using a search engine. When a user comes to the search engine (e.g., Google, http://www.google.com) and makes a query, typically by giving keywords (sometimes advanced search options such as Boolean, relevancy-ranked search, fuzzy search and concept search [7, 11] are specified) the engine looks up the index and provides a listing of best-matching documents from internal file systems or the Internet according to its criteria, usually with a short summary containing the document’s title and sometimes parts of the text. Thus, the large text sets can be obtained according to the specific tasks. (2) Feature extraction phase When text sets are collected, the collected text sets are mainly represented by web pages, which are tagged by hypertext makeup language (HTML) or extensile makeup language (XML). Thus collected documents or texts are mostly semi- or non-structural information. Our task is to extract certain features that represent the text contents from these collected texts for further analysis and application. The primary objective of the feature extraction operation is to identify facts and relations in text sets. In general, text feature extraction problems are handled by the text weighting approach or by the semantic analysis based approach. But conventional text weighting schemes do not reveal the text characteristics in the related documents satisfactorily. Here a novel probability-based approach is introduced for features extraction with the help of the K-mixtures
3. The main process of GWTM In general, GWTM process should be considered as a series of tasks, which involves many interdisciplinary techniques mentioned above. Generally, the main process of the GWTM contains four main phases, as illustrated in Figure 2 below: text collection, feature extraction, structure analyzing and text classification. (1) Text collection phase This is the first stage of GWTM. The main work at this stage is to search text documents that the tasks need. In general, the first step of document collection is to identify what documents are to be retrieved. When determining a subject, some keywords should be selected, and then text documents can be searched and queried with the aid of a search engine.
99
model [12]. In the K-mixture model, the probability Pi(k) of the word wi, appearing k times in a document, is given by: x ( y) k Pi (k ) = (1 − x)δ k ,0 + × (1) y + 1 ( y + 1) k where δ k ,0 =1 if and only if k = 0; δ k ,0 =0, otherwise.
weights of these thematic sentences should therefore be larger than those of other sentences when constructing the weight function for a sentence. (4) Text classification phase Classification is one of the most important tasks in data mining [1]. The main goal of classification is to make retrieval or query speed faster and make the retrieval more efficient and more precise than before. Here a VSM-based text categorization algorithm [1] is introduced. The core thought of the algorithm is to judge the category of testing text by calculating similarity of eigenvectors of text features. The algorithm’s basic process is divided into two stages. The first stage is the sample training and learning stage. In this stage, a basic text categorization set C = {c1, c2, …, cm} and its eigenvectors V(ci) are given as targets in advance. In order to verify the algorithm’s classification capability, some training text sets S = {s1, s2, …, sr} and their eigenvectors V(si) are chosen. By calculating the similarity of texts, we can classify the category of training texts. Here the similarity is calculated as
The parameters x and y can be fitted using the observed mean t and the observed inverse document frequency (IDF) as follows: t=
⎛ N ⎞ CFi ⎟⎟; ; IDF = log⎜⎜ N ⎝ DFi ( wi ) ⎠
y = t × 2 IDF − 1 =
(CFi − DFi ) t ;x = y DFi
(2)
where collection frequency (CFi) refers to the total number of occurrences of the ith term in the collection; document frequency (DFi) refers to the number of documents in the collection in which the ith term occurs; and N is the total number of documents on the collection. The parameter x used in the formula denotes the absolute frequency of the term, and y can be used to calculate the extra terms per document in which the terms occur (compared with the case where a term has only one occurrence per document). Using the probability-based approach, some words/terms with high probability are used as text features or called “metadata”. This makes metadata repository construction possible. We use extracted features as attributes of the metadata base. In view of these features or attributes, we may find some useful patterns using text classification algorithms from metadata repository. (3) Structure analyzing phase In this phase, based on the results of the text structure analyzer, text abstracts can be generated using a text abstract builder. In the text sets, web texts contain both pure texts and all kinds of hyperlinks that reflect relationships in different web pages. It is therefore necessary to analyze the text structure. By analyzing the linkage of web texts, we can judge relationships in different documents. This is useful in finding new knowledge. In the same way, by analyzing the web linkage and the number of hyperlinks, we can obtain similar and interconnected material in different web texts, further increasing the efficiency of information retrieval. Text abstracts can be generated by analyzing text structure. Text abstracts are very important in text mining because they help in our understanding of the whole document. Some thematic sentences reflect a paper’s core ideas; these are usually located at the beginning or end of the papers or paragraphs. The
Simik =
V si ∗ Vck V si ∗ Vck
(1 ≤ i ≤ r, 1 ≤ k ≤ m)
(3)
where Simik denotes the similarity between the ith category and the kth category, and Vsi and Vck represent the eigenvectors of basic text category and training text category. The second stage is the testing sample-identifying stage. Here we present some testing text sets T= {t1, t2, …, tn } that are to be classified. We use a similarity matrix to tackle testing text sets. That is, Similarity matrix = [Vt1,Vc2 ,L,Vtn]T ⊗[Vc1,Vc2 ,L,Vcm]
⎡ Sim 11 ⎢ Sim 21 = ⎢ ⎢ L ⎢ ⎣ Sim n 1 where Simij =
Vti ∗ Vcj Vti ∗ Vcj
Sim 12
L
Sim 22
L
L Sim n 2
Sim ij L
Sim 1 m ⎤ Sim 2 m ⎥⎥ (4) L ⎥ ⎥ Sim nm ⎦
(1 ≤ i ≤ n, 1 ≤ j ≤ m ) .
In the matrix, the minimal value of every row can m
be obtained by MinSimi = min( Simij ) , and then we can j =1
judge the ti that is the most similar to cj. So text ti will be categorized into the class cj. Through the above processes of GWTM, some interesting “hidden” patterns can be mined and explored from large web textual documents. As an emerging research area, GWTM is in progress. It is worth noting that the main process of web-based text
100
mining is built on the grid, the detailed implementation are described in the following.
on the grid. Such integration environments usually rely on thousand of connected machines and the integration way is shown in Figure 5 below. Most of them are based on computer cycles stealing like Condor [8], Entropia [15] or XtremWeb [16]. Based on the two grid integration ways in connection with text mining procedure, a new webbased text mining system on the grid environment can be formulated. For space consideration, the experiments are omitted here.
4. The implementations of GWTM In order to support grid-based WTM applications, the grid-based WTM system utilize parallel processing technique to perform specific text mining tasks. In the other words, for a certain text mining task, every grid member in Figure 1 will perform the mining process shown in Figure 2 on the grid based upon a parallel computing environment. Figure 3 illustrates the parallel computing environment on the grid for GWTM.
Figure 4 Meta-cluster computing
Figure 3 Parallel computing environments for GWTM For convenience of practical applications, the implementation of the GWTM architecture must deal with the following two main grid configurations [13] to integrate the results of all grid computing. (1) Meta-cluster computing Meta-cluster computing provides an efficient way to integrate text mining results of every grid members on the grid environment. This grid configuration comprises a set of parallel machines or clusters that are linked together with Internet to provide a very large parallel computing resource, as illustrated in Figure 4. Grid environments such as Globus [9] and Netsolve [14] are well-designed to handle meta-cluster computing session to execute long-distance parallel applications. (2) Global or mega-computing This grid configuration also provides a solution to integrate multi-results for web-based text mining task
Figure 5 Global computing
5. Conclusions This study briefly proposed a novel framework of web-based text mining (WTM) system on the grid environment to improve the efficiency and performance of text mining. In this study, we firstly propose a general framework of GWTM. Based on this framework, four main procedures about web-based
101
text mining on the grid are presented. In order to integrate the results of GWTM, two main integration ways are introduced. Relying on the framework and the main procedures of web-based text mining, a creative web-based text mining system on the grid can be formulated.
[12] S.M. Katz, “Distribution of content words and phrases in text and language modeling”, Natural Language Engineering, 1995, 2(1), pp. 15-59. [13] F. Bouhafs, J.P. Gelas, L. Lefevre, M. Maimour, C. Pham, P.V. Primet, and B. Tourancheau, “Designing and evaluating an active grid architecture”, Future Generation Computer Systems, 2005, 21, pp. 315-330. [14] M. Beck, H. Casanova, J. Dongarra, T. Moore, J.S. Plank, F. Berman, and R. Wolski, “Logistical quality of service in netsolve”, Computer Communications, 1999, 22(11), pp. 1034-1044. [15] Entropia: high performance Internet computing. http://www.entropia.com. [16] Xtremweb: a global computing experimental platform. http://www.xtremweb.net.
Acknowledgements This work is partially supported by National Natural Science Foundation of China; Chinese Academy of Sciences; Key Laboratory of Management, Decision and Information Systems and SRG of City University of Hong Kong.
References [1] Salton, G. The SMART Retrieval System — Experiments in Automatic Document Processing. Prentice-Hall, Inc. Publishing, Englewood Cliffs, New Jersey, 1971. [2] K. Sparck Jones, “A statistical interpretation of term specificity and its application to retrieval”, Journal of Documentation, 1972, 28(1), pp. 1120. [3] G. Salton, and C.S. Yang, “On the specification of term values in automatic indexing”, Journal of Documentation, 1973, 29(4), pp. 351-372. [4] G. Salton, “Developments in automatic text retrieval”, Science, 1991, 253, pp. 974-979. [5] I. Foster, C. Kesselman, and S. Tuecke, “The anatomy of the grid: enabling scalable virtual organizations”, International Journal of High Performance Computing Applications, 2001, 15(3), 200-223. [6] Online Webster-dictionary. http://www.websterdictionary.org/. [7] H. Karanikas, and B. Theodoulidis, “Knowledge discovery in text and text mining software”, Technical Report, UMIST Department of Computation, 2002. [8] M. Litzkow, and M. Livny, “Experience with the condor distributed batch system”, Proceedings of the IEEE Workshop on Experimental Distributed Systems, 1990. [9] I. Foster, C. Kesselman, and S. Tuecke, “Globus: a metacomputing infrastructure toolkit”, International Journal of Supercomputer Applications, 1997, 11(2), pp. 115-128. [10] C. Li, and L. Li, (2004). “Agent framework to support the computational grid”, Journal of Systems and Software, 2004, 70, pp. 177-187. [11] M.W. Berry, and M. Browne, “Understanding search engines: mathematical modeling and text retrieval”, SIAM, Philadelphia, 2002.
102