Clustering Deep Web Databases Semantically Ling Song1,2, Jun Ma1, Po Yan1, Li Lian1, and Dongmei Zhang1,2 1
2
School of Computer Science &Technology, Shandong University, 250061, China School of Computer Science & Technology, Shandong Jianzhu University, 250101, China
[email protected]
Abstract. Deep Web database clustering is a key operation in organizing Deep Web resources. Cosine similarity in Vector Space Model (VSM) is used as the similarity computation in traditional ways. However it cannot denote the semantic similarity between the contents of two databases. In this paper how to cluster Deep Web databases semantically is discussed. Firstly, a fuzzy semantic measure, which integrates ontology and fuzzy set theory to compute semantic similarity between the visible features of two Deep Web forms, is proposed, and then a hybrid Particle Swarm Optimization (PSO) algorithm is provided for Deep Web databases clustering. Finally the clustering results are evaluated according to Average Similarity of Document to the Cluster Centroid (ASDC) and Rand Index (RI). Experiments show that: 1) the hybrid PSO approach has the higher ASDC values than those based on PSO and K-Means approaches. It means the hybrid PSO approach has the higher intra cluster similarity and lowest inter cluster similarity; 2) the clustering results based on fuzzy semantic similarity have higher ASDC values and higher RI values than those based on cosine similarity. It reflects the conclusion that the fuzzy semantic similarity approach can explore latent semantics. Keywords: Semantic Deep Web clustering, Fuzzy set, Ontology, PSO, K-Means.
1 Introduction The Deep Web usually means the databases available through the HTML pages in the Internet. Unlike the surface web, the Deep Web refers to the collection of web data that is accessible by interacting with a web-based query form, and not through the traversal of hyperlinks. Data mining in Deep Web sources has aroused a lot of interesting in the recent researches. Approaches have been proposed for both clustering and classifying Deep Web sources [1-3]. As an important technique of data mining, document clustering is the operation of grouping similar documents together into clusters which are coherent internally but clearly different from each other. There have been a lot of researches focused on clustering algorithm, such as K-Means algorithm [4], Particle Swarm Optimization (PSO) and hybrid PSO approaches [5-9]. Compared with common data clustering, Deep Web clustering faces challenges due to a very wide variation in the way of web-site designers’ model that is not possible to assume certain standard form field names and structures [10-11]. H. Li et al. (Eds.): AIRS 2008, LNCS 4993, pp. 365–376, 2008. © Springer-Verlag Berlin Heidelberg 2008
366
L. Song et al.
Given a large scale of searchable forms of Deep Web databases, in order to group together forms that correspond to similar databases, pre-query and post-query approaches can be used [12]. Post-query techniques issue query probing and the retrieved results are used for clustering purposes [3]. Pre-query techniques rely on visible features of forms such as attribute labels and page contents [13-14]. In these approaches, visible features are represented as vectors and cosine similarity is used as the basis for similarity comparison. In classical VSM, measure of cosine similarity between two vectors di and dj is as follows:
G G d j ⋅ dk sim ( d j , d k ) = G G = d j × dk
∑
n
i =1
∑
n
w ji × wki
w2 ji × i =1
∑
n
w 2 ki i =1
(1)
Where: wji and wki are weights of the ith term in dj and dk respectively. However, the representation of document as vector of bags of words suffers from well-known limitations: its inability to represent semantics. The main reason is that VSM is based on lexicographic term matching. Two terms can be semantically similar although they are lexicographically different. Therefore, such lexicographic term matching results is inability to exploit semantic similarity between two Deep Web features, which affects the quality of clustering at last. Ontology is a specification of a conceptualization of a knowledge domain, which is a controlled vocabulary that describes concepts and the relations between them in a formal way, and has a grammar for using the concepts to express something meaningful within a specified domain of interest. Researchers have focused on how to integrate domain ontology as background knowledge into document clustering process and shown that ontology could improve document clustering performance [15-18]. In these measures, the basic idea is to re-weight terms and assign more weight to terms that are semantically similar with each other and cosine measure is used to measure document similarity. However, these term re-weighting approaches that ignore some of terms might cause serious information loss. Zhang’s experiments showed that term re-weighting might not be an effective approach when most of the terms in a document are distinguish core terms [19]. Facing the above problems, in this paper we firstly present a fuzzy semantic measure, which integrates semantics of ontology and fuzzy set theory to compute similarity between visible features of Deep Web forms. Then we present a hybrid PSO algorithm for Deep Web clustering. Main contributions of this paper are: 1. 2. 3. 4. 5.
Proposes a semi-automatic approach of building domain ontology. Defines a similarity matrix of concepts based on domain ontology. Defines a fuzzy set to represent a conceptual vector of the Deep Web form, which can explore semantics in domain ontology. Necessity degree of matching between fuzzy sets is used in comparing Deep Web forms. A hybrid PSO algorithm is proposed for Deep Web databases clustering.
The organization of this paper is as follows: After giving an approach of building domain ontology in Sec. 2, a fuzzy semantic similarity measure with respect to ontology is proposed in Sec. 3, which is necessary for the work of Sec. 4, where it is
Clustering Deep Web Databases Semantically
367
applied in Deep Web databases clustering with hybrid PSO. In Sec. 5, the efficacy of our approach is demonstrated through relevant experiments. Conclusions are given in Sec.6.
2 Core Ontology and Domain Ontology A frame system for the core ontology is given in [20]: Definition 1 (Core Ontology). An core ontology is a sign system O:=(L;F;C; H;ROOT;), which consists of: A lexicon L consists of a set of terms ; A set of concepts C, for each c C, there exists at least one statement in the ontology; A reference function F, with F: 2 L → 2C . F links sets of lexical entries {Li} ⊂ L to the set of concepts they refer to. The inverse of F is F-1; Concept hierarchy structure H: Concepts are taxonomically related by the directed, acyclic, transitive relation H, ( H ⊂ C × C ) . ∀c1 , c 2 ∈ C , H (c1,c2) means that c1 is a hierarchy relation of c2; A top concept ROOT C. For all c C it holds H(c, ROOT).
∈
∈
∈
Based on the frame system of above core ontology, a semi-automatic approach to build domain ontology from a given set of query forms is proposed. Firstly attribute features of the Deep Web forms are parsed. The OntoBuilder project supports the extraction of attributes from web search forms and saves as XML format [21]. From these XML files concepts and instances of concepts are then extracted, with which we can build domain ontology. Fig. 1 and fig. 2 are extracted form attributes and its XML file of a search form, respectively.