Abstract. With the rapid growth of Web databases, it is necessary to extract and integrate large-scale data available in Deep Web automatically. But current Web ...
Kou Y, Shen DR, Yu G et al. Combining local scoring and global aggregation to rank entities for deep Web queries. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 24(4): 626–637 July 2009
Combining Local Scoring and Global Aggregation to Rank Entities for Deep Web Queries Yue Kou (寇 Ge Yu (于
月), Student Member, CCF, De-Rong Shen (申德荣), Senior Member, CCF
戈), Senior Member, CCF, Member, ACM, IEEE, and Tie-Zheng Nie (聂铁铮), Student Member, CCF
College of Information Science and Engineering, Northeastern University, Shenyang 110004, China E-mail: {kouyue, shenderong, yuge, nietiezheng}@ise.neu.edu.cn Revised March 13, 2009. Abstract With the rapid growth of Web databases, it is necessary to extract and integrate large-scale data available in Deep Web automatically. But current Web search engines conduct page-level ranking, which are becoming inadequate for entity-oriented vertical search. In this paper, we present an entity-level ranking mechanism called LG-ERM for Deep Web queries based on local scoring and global aggregation. Unlike traditional approaches, LG-ERM considers more rank influencing factors including the uncertainty of entity extraction, the style information of the entities and the importance of the Web sources, as well as the entity relationship. By combining local scoring and global aggregation in ranking, the query result can be more accurate and effective to meet users’ needs. The experiments demonstrate the feasibility and effectiveness of the key techniques of LG-ERM. Keywords
1
Deep Web, entity-level ranking, local scoring, global aggregation
Introduction
The immense scale and wide spread of the Web have rendered it the largest data repository. With the rapid growth of Web databases, it is necessary to extract and integrate large-scale data available in Deep Web automatically[1] . In many domains like e-commerce, customer support and digital libraries, unstructured documents from Web sources are often related to the real world entities. Typical entities are products, papers, persons or publishers etc, which are embedded in Web pages or data sources. With page-level retrieval, users have to read the entire page to determine if there are desired entities. Compared with page-level search from static Web pages, an entity-level search is much more appealing to users. For instance, users often search for a specific product and they hope to acquire a list of relevant entities with clear information such as name, price quotation, performance, and so on. If these domain-specific target entities can be extracted and ranked accurately, the quality of query result will be improved remarkably. Most current Web search engines conduct pagelevel ranking, which are becoming inadequate for entity-oriented vertical search, especially for Deep Web queries. For example, in digital libraries application,
one might want to find the top “paper” in the area of “Web search”. As shown in Fig.1, due to applying only page-level retrieval and ranking, current Web search engines may generate some irrelevant query answers. At the same time, many good pages listing relevant papers may be missed. The goal of entity-level ranking is to return the best target entities that meet user’s need.
Fig.1. Results for the query: “Web Search Paper” using Web search engines.
The high importance of entity-level ranking has triggered a huge amount of research. However, most
Regular Paper Supported by the National Natural Science Foundation of China under Grant No. 60673139 and the National High Technology Development and Research 863 Program of China under Grant No. 2008AA01Z146.
Yue Kou et al.: Rank Entities for Deep Web Queries
have only considered the local characteristics of entities within their Web pages (such as the uncertainty of entity extraction, the confidence of Web sources and so on). We called them ranking approaches based on local scoring (e.g., [2–4]). At the same time, many researchers proposed the ranking approaches that focus on ranking and aggregating the structured entities based on the global frequency of relevant documents or Web pages (e.g., [5, 6]). Correspondingly, we called them ranking approaches based on global aggregation. However, only relying on local scoring or global aggregation, we cannot acquire an overall ranking result. Therefore several studies proposed mixed ranking approaches which combine the techniques of local scoring and global aggregation (e.g., [7–10]). But most current mixed ranking approaches may be at the cost of complicating users’ input or dynamic data maintenance. So in this paper, we present an entity-level ranking mechanism called LG-ERM for Deep Web queries based on local scoring and global aggregation. Here our challenge includes two major aspects: 1) How to determine the entity type that the user wants and how to match it with the entity type owned by each Web source? In the example given in Fig.1, from the query keywords we can deduce that the target entity type is paper. If the identification of entity type can be processed by machine automatically, the search range can be limited intelligently only within the relevant entity set. With the determined entity type, entity type matching between query and each Web source can be done. 2) How to rank the entities to return the most relevant ones? Current Web search engines conduct page-level search, which make users scrutinize the returned pages to dig for the entities they want. We need an efficient method to perform the entity-level ranking accurately. The primary contributions of this paper are as follows. Firstly, a new mechanism that enables entity-level ranking for Deep Web queries is constructed, which, unlike traditional approaches, considers more rank influencing factors. Secondly, by combining local scoring and global aggregation in ranking, the query result supplied by LG-ERM can be more accurate to meet user’s need. Thirdly, an experimental study is proposed to determine the feasibility and effectiveness of the key techniques of LG-ERM. The rest of this paper is organized as follows. Section 2 introduces the related work. Section 3 describes the model of LG-ERM and Section 4 introduces query pre-processing. Section 5 and Section 6 introduce the process of local scoring and global aggregation, respectively. Section 7 discusses our ranking algorithm. Section 8 and Section 9 show the experimental results and conclude the paper.
627
2
Related Work
Our work is related with the existing literature in two major categories: entity extraction and entity ranking. We now study these categories respectively. 2.1
Entity Extraction
First, our mechanism relies on entity extraction techniques to extract entities. Much work has been done to address the problem of entity extraction. For example, Etzioni et al.[11] proposed an autonomous system KnowItAll to extract facts, concepts and relationship from single documents on the Web. Cai et al.[12] presented block-based Web search via segmenting Web pages into blocks. Cheng et al.[8] presented MetaQuerier for Deep Web queries and used an extractor during the extraction process. Nie et al.[9,13] explored effective models to retrieval Web objects by integrating record extraction and attribute extraction. In our previous work, we have presented a DOMtree based Deep Web entity extraction model called DEEM[14] to handle the schema extraction problem. Via D-EEM, we can extract the entities from Web pages and acquire the uncertainty about attribute-level extraction which will be considered as one of the rank influencing factors. 2.2
Entity Ranking
As described in Section 1, the entity-level ranking approaches can be classified into three major categories: the ranking approaches based on local scoring, the ranking approaches based on global aggregation, and the mixed ranking approaches combining local scoring and global aggregation. Below we introduce related work in each category. As for the ranking approaches based on local scoring, some researchers stressed the ranking factors about the local uncertainty of entities. For instance, Dong et al.[2] analysed different levels of uncertainty during data integration to decide approximate schema mappings and compute the Top-K answers. Also some optimization techniques were proposed to improve the accuracy of the existing ranking algorithms. For example, Jin et al.[3] presented some ranking optimization functions and considered users’ feedbacks during ranking. Many researchers emphasized on ranking the structured entities based on global aggregation. Intuitively, if an entity is returned by many Web sources, then it will probably be the more desired one. So many researchers scored the entities based on their occurrence frequency in result set. For example, Chakrabarti et al.[6] exploited the relationships between the documents
628
J. Comput. Sci. & Technol., July 2009, Vol.24, No.4
containing the keywords and the target objects related to those documents. Top-K objects are calculated based on the frequency of entities and the aggregated scores of documents. There is also some work that applies mixed ranking approaches by combining local scoring and global aggregation. For example, Cheng et al.[7,8] built entity search engines by forcing users to indicate clearly the entity types (such as paper, author or publisher) and context patterns in their requests. According to the input information, a series of matching steps were performed to select the best entities from Web sources. These techniques mainly aimed at the surface Web but not the Deep Web, and they relied on the query requests in specific patterns, which possibly complicates users’ operations. Also some researchers proposed a domain-independent entity-level link-analysis model to rank the objects based on a Web data warehouse with the capability of handling structured queries. For example, Nie et al.[9,10] utilized a Web data warehouse to pre-store all of the extracted entities. They focused on data extraction and schema matching. But the rank influencing factors were not given clearly. In addition, because the entities were pre-stored, it was difficult to maintain the validity of the dynamic data. 3
Overview of LG-ERM Model
Let D be the Web data from the relevant Web sources collected by Web crawler. We use E = {e1 , e2 , . . . , en } to denote the set of the entities extracted from D and use R = {he1 , score(e1 )i, he2 , score(e2 )i, . . . , hek , score(ek )i} to indicate the TopK answers relevant to user’s query Q. Here ei means the i-th entity in the result list with the score score(ei ). Our goal is to rank the given E and acquire the ranked list R accurately. We take into account the following rank influencing factors. 1) Extraction uncertainty: on one hand, the query needs to be reformulated and mapped to a predefined entity type (e.g., product type or paper type) with a global schema, which will result in the uncertainty about query reformulation. On the other hand, wrongly labelling an attribute of entity (e.g., wrongly labelling the title of a paper) will result in the uncertainty about attribute-level extraction which will further result in the uncertainty of entity exaction. 2) Style information: the relevance between keywords and entities often depends on how distinctly the keywords appear in the Web source. 3) Importance of the Web sources: a Web source ranked highly by a search engine often garners better popularity and more interest from users. 4) Entity relationship: duplicate entities should
be identified and their frequency will be considered as the main evidence in global aggregation. As can be seen in Fig.2, LG-ERM model consists of three layers: extraction layer, local scoring layer and global aggregation layer. Firstly, via query reformulation, the input query will be pre-processed to determine the entity type that the user wants and be mapped to a global schema. Secondly, in the extraction layer, the entities along with their schemas, styles and source information will be extracted from these Web sources by some entity extractors (i.e., components in D-EEM). Thirdly, in the local scoring layer, entity type matching between the query and the entities will be performed through local matchers. Also local matchers are responsible for quantifying the relevance between them. The local score of each entity will be computed as a combination of the degree of entity type matching, relevance and the importance of Web sources. Fourthly, in the global aggregation layer, duplicate entities from different Web sources will be identified. The global aggregator will collect these local entities and rank them by aggregating their local scores. Finally, a ranked list of entities along with their global scores will be returned to the user. As for entity extraction, we use D-EEM to extract the entities from Web pages. In this paper, we only focus on the ranking strategy with the assumption that these entities have been extracted in advance.
Fig.2. Overview of LG-ERM model.
4
Query Pre-Processing
In this section, we discuss the details of query preprocessing which includes two aspects: source data pre-processing and query request pre-processing. Correspondingly, the adopted techniques about entity type mapping and query reformulation will be introduced.
Yue Kou et al.: Rank Entities for Deep Web Queries
4.1
629
Entity Type Mapping
In this paper, the Web can be considered as a repository containing all kinds of entities. As for a specific domain, the entity types may be limited to a specific entity type set. For example, for the domain of paper publication, the entity types include paper, publisher, author, and so on. Furthermore, the entities of the same entity type can be described by similar attributes. Therefore, for each domain-specific entity type, domain experts can pre-define a global schema to describe its structure, attributes, and features. For instance, some global schemas such as paper(title, author-name, conference-name, date, . . .) and publisher(publishername, address, ISSN, . . .) can be predefined, each of which is used to describe a specific entity type. Compared with these pre-defined global schemas, the schemas supplied by Web sources may not strictly conform to the global schemas. The aim of entity type mapping is to determine which entity type a Web source can supply, that is, to map from each source schema to one of the global schemas. Thus, the similarity between the source schema and the global schema should be quantified. Since entity type mapping is query-independent, it can be pre-processed. We use Sglobal to represent a pre-defined global schema which contains the attributes {a1 , . . . , am } to describe a certain entity type. Similarly, we refer to each entity type in a Web source as Ssource which contains the attributes {b1 , . . . , bn }. During entity type mapping, the effect of different attributes may be different. Hence we should assign different weights to different attributes. Similar to the existing approaches about attributes ordering (e.g., [15]), we use the approximate functional dependencies among attributes {a1 , . . . , am } to quantify their importance and assign weights {w1 , . . . , wm } to them automatically. The weight of ai is calculated as (1), where ai ∈ a ˆ ⊆ Sglobal , j 6= i. Here error (ˆ a → aj ) represents the minimum fraction of entities that violate the functional dependency a ˆ → aj . If an attribute highly influences other attributes then its weight will be higher. wi =
m X 1 − error (ˆ a → aj ) j=1
|ˆ a|
.
(1)
Based on the above concepts, we can formally define the semantics of entity type mapping as follows. Definition 1 (Entity Type Mapping). An entity type mapping M is a quadruple (Sglobal , Ssource , W, AP), where Sglobal and Ssource are the entity type schemas to be mapped, W is the weights {w1 , . . . , wm } assigned to attributes {a1 , . . . , am } and AP (composed of {hPair (ai , bj ), Pj i}, Pj > 0) is a set of attribute pairs
between Sglobal and Ssource with an attribute-level uncertainty Pj which is acquired during entity extraction. Definition 2 (Matched Attribute). A matched attribute is an attribute contained in Ssource , which corresponds to at least one attribute of Sglobal (as defined in (2). Amatched = {bj |bj ∈ Ssource ∧ ∃ai (ai ∈ Sglobal ∧ hPair(ai , bj ), Pj i ∈ AP )}.
(2)
Given an entity type mapping M and the matched attributes Amatched , we can compute the similarity between Sglobal and Ssource with the following steps (as shown in Fig.3).
Fig.3. Process of entity type mapping.
Step 1. Construct the entity type vectors S global (w1 , . . . , wl ) and S source (s1 , . . . , sl ), where wi is the weight assigned to ai , and si is computed based on the attribute pairs and the attribute-level uncertainty (as defined in (3)). The uncertainty about attribute-level extraction may result from wrongly labeling the attributes of entity during extraction. Here we use our entity extraction mechanism to handle the attributes identification problem and acquire the attribute-level uncertainty values. wi × max{Pj }, if ∃bj (bj ∈ Amatched si = ∧hPair(ai , bj ), Pj i ∈ AP ); (3) 0, otherwise. Step 2. Compute the similarity between S global (w1 , . . . , wl ) and S source (s1 , . . . , sl ) based on cosine similarity function Sim which can better measure their differences (as described in (4)). The ideal value of Sim is 1 when each element in S source (s1 , . . . , sl ) is equal to the corresponding element in S global (w1 , . . . , wl ). The higher the value of Sim, the more similar
630
J. Comput. Sci. & Technol., July 2009, Vol.24, No.4
S global (w1 , . . . , wl ) and S source (s1 , . . . , sl ) will be. Note that during the similarity computation we only consider the matched attributes (the number is l), while other attributes of Ssource (called the unmatched attributes) are eliminated temporarily. Pl Sim(S global , S source ) = qP l
i=1 (wi
2 i=1 wi ×
× si ) qP l
. (4)
2 i=1 si
Step 3. Revise the similarity between S global (w1 , . . . , wl ) and S source (s1 , . . . , sl ) obtained in Step 2. During the above steps we only consider the similarity among the matched attributes. But the unmatched attributes are also important to evaluate the mapping degree between the source schema and the global schema. Intuitively, the fewer the unmatched attributes, the higher the mapping degree should be. So the similarity between S global (w1 , . . . , wl ) and S source (s1 , . . . , sl ) should be revised as (5). Here l and u represent the numbers of the matched attributes and the unmatched attributes in Ssource , respectively. And m is the number of attributes in Sglobal . Sim0 (S global , S source ) = Sim(S global , S source ) ×
4.2
l . MAX{m, l + u}
(5)
Query Reformulation
Users always want to find information about domain-specific entities, such as paper search or product search. But their queries are possibly represented as keywords rather than as structured queries against some pre-defined schemas. We need to reformulate the user’s input into two parts: entity type description and semantic description. The former is used to describe the target entity’s type, while the latter is used to describe the semantics of the query. Definition 3 (Query Reformulation). Query reformulation means to reformulate a query as a triple (Sglobal , PQ , Desc), where Sglobal means the global schema that corresponds to the target entity type with a reformulation uncertainty PQ , and Desc which consists of the query keywords {k1 , . . . , kn } represents the semantic description of the target entity. As mentioned in Subsection 4.1, the goal of entity type mapping is to determine the entity types that Web sources can supply. Different from entity type mapping, the goal of query reformulation is to determine the entity type that the user wants according to his query request and to represent it as one of the pre-defined global schemas.
Fig.4. Process of query reformulation.
In order to determine the target entity type, firstly the characteristic words of each predefined entity type should be extracted and described as a characteristic vector. Then the user’s query request will be compared with each characteristic vector to select the most matching one as the target entity type. The above process can be implemented via the text categorization technique which is supported by different models including the probabilistic model, the example-based model, the linear model and so on (e.g., [16, 17]). So, via the existing text categorization technique, we can analyse and classify the user’s request according to its type or theme. The primary steps are as follows (as shown in Fig.4). Firstly, we consider the Web data as the training set and manually classify it into some entity types. For each predefined entity type, the characteristic words will be extracted and described as a characteristic vector. Secondly, the query vector consisting of query keywords is constructed. Finally, the similarity (denoted as PQ ) between the query vector and each characteristic vector is computed via cosine similarity function. And the global schema with the most similar characteristic vector will be selected as the entity type description of the query. As can be seen in Fig.5, the effects of query reformulation consist of two aspects. On one hand, the entity type description of the query will be determined to provide a basis for the consequent entity type matching. We can use the mapping relations to further match the entity type between the query request and each Web source (as described in Subsection 5.1). On the other hand, the semantic description Desc of the query can be used to select the candidate entities and to make
Yue Kou et al.: Rank Entities for Deep Web Queries
the style-based scoring further (as described in Subsection 5.2). Here the candidate entities mean the entities containing the keywords in Desc. For example, as illustrated in Fig.1, via query reformulation we can analyse that the global schema of the target entity type is paper(title, author-name, conference-name, date, . . .) with the semantic description “Web search”. So we should search the entities within the Web sources which can supply the entity type of “paper”. Then the entities containing the keywords “Web search” will be returned as candidate entities.
Fig.5. Effects of query reformulation.
5
Local Scoring
The ranking rules adopted by different Web sources may possibly be inconsistent. For example, each paper retrieval site may sort results by title, year or popularity. Therefore, given a series of candidate entities extracted from different Web sources, we should use a uniform ranking rule to rearrange them. In this section we present the local scoring method about how to score the candidate entities in their local Web sources. 5.1
Probabilistic Entity Type Matching
The schemas supplied by some Web sources may be incomplete and not conform to user’s request, which will influence the quality of entities contained in them. The aim of entity type matching is to measure the similarities about entity type between Web sources and query request. That is, given a Web source, we should determine whether it can supply the answers with respect to the entity type that user wants, which needs three steps. Firstly, we should determine the entity type each Web source supplies. Secondly, we should determine the entity type that the user wants. Finally, the similarity between them should be quantified. Based on query pre-processing, we have acquired some valuable data which can facilitate the process of entity type matching. On one hand, during query reformulation, the target entity type has been determined with an uncertainty PQ . On the other hand,
631
during entity type mapping, each source schema has been mapped to a global schema with a mapping similarity Sim0 (S global , S source ). By adding the query reformulation uncertainty PQ to the entity type mapping M , we can directly compute the probabilistic similarity between the target entity type and each source’s entity type (as described in (6)). M.similarity = PQ × Sim0 (S global , S source ). 5.2
(6)
Style Based Scoring
Besides the degree of entity type matching, the relevance between the query and candidate entities is also considered. Web page designers always give prominence to important things and deemphasize the unimportant parts. They often use effects such as bold, hyperlink or bright color to emphasize the main subjects they provide. Our solution is based on the assumption that relevant entities often include the keywords in emphasized styles. For example, for most paper retrieval sites, it is a general practice to use the bold effect on paper’s title. Then the papers with the query keywords appeared in their titles are more relevant. During entity extraction, we have analyzed the characteristics of Web pages including the style information. Based on the style information, the relevance between the query and candidate entities can be measured effectively. The relevance is computed as follows. Step 1. Divide the content of each entity into different information chunks V = {v1 , . . . , v|V | } according to its style information. Here the chunk means the content of entity which is in the same style. For example, all the words with the bold effect within an entity constitute a chunk. Step 2. Assign a weight F (vi ) to each vi according to its importance. There are two possible ways to deduce a chunk’s importance. The weights can be assigned by the domain analyst. In addition, the weights can be learned from training set using a supervised learning algorithm such as SVM. The first way is straightforward, but it is often desirable to minimize the participation of the analyst. Therefore we address the second case which is to learn the weights automatically. Firstly the chunks are pre-labeled by manual and each labeled chunk can be represented as (vi , y) where y is the pre-defined weight of the chunk vi . The set of labeled chunks refers to training set T . Thus the P problem becomes2 to find a function F such that (vi ,y)∈T |F (vi ) − y| is minimized. The procedure of function determination has been implemented by some techniques such as neural network or SVM (e.g., [18]). Thus we do not go into details here. The determined function can be used directly to map from a style to a
632
J. Comput. Sci. & Technol., July 2009, Vol.24, No.4
weight for each chunk. Step 3. For each vi in an entity, compute its keywords occurrence ratio R(vi ). R(vi ) is used to quantify the number of query’s description keywords (Desc : {k1 , . . . , kn }) affiliated with vi (as described in (7)). R(vi ) =
|vi ∩ Desc| . |Desc|
(7)
Step 4. For each entity e, compute the distinction degree D(e) by combining its F (vi ) and R(vi ) (as described in (8)). The higher the value of D(e), the more relevant e is to the query subject. D(e) =
|V | X
R(vi ) × F (vi ).
(8)
i=1
Step 5. By considering the distinction degrees of all the entities appeared in one Web source, we can compute the relevance between the query and candidate entities (as defined in (9)). For each Web source, we first pick the entity that has the highest distinction degree. And then for each candidate entity in this Web source, it will be compared with the picked entity to compute its relevance. D(ei ) . ei .relevance = MAX{D(e1 ), . . . , D(e|e| )} 5.3
of WS0 is implicitly endorsing WS, that is, WS0 gives some importance to WS. How much WS0 contributes to the importance of WS is proportional to the importance of WS’ itself (as defined in (10)). Here N is the total number of Web sources in the link graph, out(WS 0 ) is the out degree of WS0 (WS0 has a link to WS) and α is the random jump probability. I(WS ) is the score of the Web source WS. I(WS ) = α ×
Web Sources Scoring
With many Web sources providing similar entity information on the Internet, users often have to scrutinize a large number of different sources to look for the entities they want. But unfortunately, owing to misleading, erroneous and outdated data, some Web sources are not trustworthy. Users always expect to search the Web sources with higher quality and confidence. Intuitively, the more authoritative the Web sources are, the higher quality and confidence they will possess. For example, the information provided by the official Web sources is more effective and confident than the information provided by some personal Web sources. Thus the importance of Web sources should be measured. Many search engines sort the retrieved pages by different rank criteria. One popular way is to exploit the additional information inherent in the Web due to its hyperlink structure. One successful link-based ranking system is PageRank[19−21] , which is not related to the query subjects and is used by the Google search engine. This method considers the whole Web as a hyperlinked graph. We use the basic idea of PageRank to measure the importance of Web sources. If a Web source WS0 has a link to another Web source WS then the author
X WS 0 |WS 0 →WS
I(WS 0 ) . out(WS 0 )
(10) The importance measuring can be done offline. Thus a sorted source list can be generated, and we can focus the entity search only on the important Web sources. Therefore, not only the process of entity search can be speeded up, but also some potential noises contained in bad sources can be isolated better. Taking into account the above rank influencing factors, we can assign a local score for each entity ei within its Web source WS (as described in (11)). LocalScore(ei ) = M.similarity×ei .relevance×I(ei .WS ). (11) 6
(9)
1 + (1 − α) × N
Global Aggregation
Intuitively, if an entity appears in multiple Web sources, it is more likely to be the accurate answer. Thus the frequency of answers in Web sources will be considered as the main evidence to score the candidate entities globally. But because the same entity may be represented in multiple ways within different Web sources, the frequency of answers may possibly be calculated inaccurately. Thus we must identify the duplicate entities from the candidate entity set. Then for each entity, not only the exact equivalents but also the potential duplicates will be considered as the same entity as it. For the process of duplicates identification, we are aiming at supporting both offline knowledge construction and online entity matching. The former is applied to the initial clusters generation, while the latter is used to further identify the final duplicate sets. Based on these duplicate sets, the global score of each candidate entity will be computed. 6.1
Initial Clusters Generation
In our previous work, we have presented a Deep Web Entity Identification Mechanism based on Semantics and Statistical analysis (SS-EIM)[22] . This mechanism can offline identify the duplicate entities from large data sets. Through text initial matching, semantic relationship abstraction and group statistics analysis, the entity relationship knowledge consisting of a
Yue Kou et al.: Rank Entities for Deep Web Queries
series of non-overlapping clusters KC {kc 1 , . . . , kc |KC | } is constructed. Based on the entity relationship knowledge, the initial clusters can be generated, each of which corresponds to the same entity. The process of initial clusters generation includes the following steps. Step 1. For each candidate entity ei , its affiliated cluster will be identified. If ei is contained within the entity relationship knowledge, it will be clustered according to KC {kc 1 , . . . , kc |KC | }. Otherwise, the entity will constitute a new cluster (as described in (12)). Then the initial cluster set C{c1 , . . . , c|C| } consisting of such ei .c is generated. ½ kc i , if ei ∈ kc j ; ei .c = (12) cnew , otherwise. Step 2. Select the pivot entity ep for each initial cluster. The higher an entity’s local score, the more representative within the cluster it will be. So, for each cluster, we select the entity ep that has the highest local score in its cluster as the pivot entity. 6.2
Online Duplicates Identification
The initial clusters mainly depend on an offline statistical analysis that may result in incomplete statistical data. Some duplicate entities may possibly be assigned into different clusters. So, the initial clusters should be revised further by performing online duplicates identification. Here we present an online identification strategy which can overcome limitations of the entity relationship knowledge. Through online identification, an undirected entity relationship graph ERG = (V, E) can be constructed, where V represents the node (entity) set and E is the set of edges. As the objects of identification, the pivot entities from the initial clusters are gradually added to ERG. Finally a series of duplicate sets will be generated according to the unconnected components in ERG. The detailed steps are as follows. Step 1. Compute the similarities among the pivot entities. As for each attribute pair ei .ak and ej .ak (k = 1 ∼ l), the attribute-level similarity is calculated by function Sim(ei .ak , ej .ak ). In our current implementation, we have provided a variety of similarity
633
functions (e.g., string matching function, date matching function, price matching function, etc.) that can be selected according to attribute types[22] . And the similarities of multiple attribute pairs will be combined into the entity-level similarity (13). Here sk is the weight of ak (as introduced in Subsection 4.1). Finally we acquire a list of entity pairs with their similarities (denoted as EP {hPair(ei , ej ), Sim(ei , ej )i}). Sim(ei , ej ) =
n X
sk × Sim(ei .ak , ej .ak ).
(13)
k=1
Step 2. Construct ERG based on the similarities among entities. If the pivot entities from EP are similar enough, that is, their similarity exceeds the predefined similarity threshold, they will be combined together in ERG via an edge weighted as their similarity. Otherwise, they will be assigned to different unconnected components. Step 3. Combine the initial clusters according to ERG. If some pivot entities from different initial clusters exist within the same unconnected component, these initial clusters will be combined as one duplicate set. Consider the following example. Suppose the initial clusters are {c1 , . . . , c7 } with seven pivot entities {c1 .ep , . . . , c7 .ep } and the predefined similarity threshold is 0.5. The goal is to group them correctly. Assume the similarities among them have been computed (as shown in Fig.6(a)). The process of ERG construction is as follows. Firstly, because the similarity of the entity pair (c1 .ep , c5 .ep ) exceeds the threshold, they will be combined (as shown in Fig.6(b)). Also the entity pair (c1 .ep , c2 .ep ) is similar enough, so c2 .ep will be added to the unconnected component containing c1 .ep (as shown in Fig.6(c)). Analogously, (c3 .ep , c4 .ep ) constitutes a new unconnected component and c2 .ep is connected with c5 .ep weighted as 0.6 (as shown in Figs. 6(d) and 6(e)). The remaining entities not exceeding the threshold will be divided as the respective unconnected components (as shown in Fig.6(e)). Consequently, all the pivot entities have been added to ERG and the final duplicate sets are {c1 ∪ c2 ∪ c5 , c3 ∪ c4 , c6 , c7 }.
Fig.6. Demonstration of duplicates identification.
634
6.3
J. Comput. Sci. & Technol., July 2009, Vol.24, No.4
Global Score Computing
Based on the generated duplicate sets, the frequency of answers can be computed more accurately, which is used to score the entities globally. The process of global score computing includes two steps (as shown in Fig.7).
Fig.7. Process of global score computing.
Step 1. For each final duplicate set, select a new pivot entity from the entities contained in the set. Here the entity with the highest local score in the final duplicate set will be selected as the new pivot entity. Step 2. For each duplicate set ci , compute the global score of the pivot entity ep in it (as defined in (14)). The global score of each pivot entity is corroborated by the local scores of entities in its cluster. The higher the frequency of entities, the higher the global score of the corresponding pivot entity will be. GlobalScore(ci .ep ) =
|ci | X
LocalScore(ci .ej ).
(14)
j=1
7
Ranking Algorithm of LG-ERM
In this section, we present the ranking algorithm of LG-ERM and some optimization measures. The steps of entity ranking are listed in Fig.8. We follow the assumptions of the entities from the Web sources having been extracted, that is, we have known the relevant information of the entities including their schemas, styles and affiliated Web sources (denoted as e.Ssource , e.style and e.WS, respectively). Firstly, the query Q is reformulated to determine the target entity type Sglobal and the semantic description Desc. Secondly, for each entity e in each Web source s, the similarity of entity type between Q.Sglobal and e.Ssource is computed in line 4. In lines 5∼7, the relevance of the current entity is computed based on its distinction and the maximum distinction of the seen entities in s. And then the importance of the Web source supplying the current entity is quantified by calling
function rankWebSource. In line 9 by combining the similarity, relevance and source’s importance, the local score of each entity can be computed. Thirdly, via the entity relationship knowledge and online duplicates identification, the entity set E is transformed to the duplicate sets C. For each cluster c in C, the procedure of pivot entity selection is adopted to determine its ep . In line 14, the global score of ep is aggregated by the local score of each entity in ep ’s cluster. Finally the resultList constructed by each cluster’s ep is sorted in descending order of globalScore and returned. Ranking Algorithm of LG-ERM Input: query keywords Q entity set E extracted from D the pre-processing result of entity type mapping M Output: the ranked result list resultList related to Q 1: Reformulate Q into (Sglobal , PQ , Desc); 2: for each Web source s in D do 3: for each entity e in s do 4: e.similarity = mapType(Q.Sglobal , e.Ssource , M, PQ ); 5: e.distinction = computeDist(Q, Desc, e.style); 6: maxDistinction = s.getMaxDistinction(); 7: Update the relevance of each e in s; 8: e.sourceRank = rankWebSouce(e.WS ); 9: e.localScore = e.similarity × e.relevance × e.sourceRank; 10: Generate the cluster set C from E; 11: for each cluster c in C do 12: Select the pivots entity ep from c; 13: for each entity e in c do 14: ep .globalScore = ep .globalScore + e.localScore; 15: resultList.add(hep , ep .globalScorei); 16: Sort resultList in descending order of globalScore; 17: Return resultList; Fig.8. Ranking algorithm of LG-ERM.
Based on the above algorithm, we can adopt some optimization measures to improve the performance of LG-ERM further. Firstly, in order to obtain the current state of data, we adopt online processing for entity extraction and ranking. But it may influence the query speed. Caching is an important optimization measure to enhance the efficiency of query processing, which uses the results of previous queries to answer the current query. Thus we can use some caching techniques (e.g., [23, 24]) in our ranking mechanism. Secondly, because local scoring or global aggregation is performed within a source or a cluster, we can utilize the multithreading concurrent mechanism to run these operations in parallel. Then the query response time can be reduced effectively.
Yue Kou et al.: Rank Entities for Deep Web Queries
8
Experiments
In this section, we first briefly discuss the setup of our system. Then the result of our experiments to evaluate the proposed approaches will be shown. 8.1
time cost is reasonable. Users may find it acceptable to wait for the ranked entities, because it saves them the hassle of manually selecting the relevant entities from the Web sources themselves.
Dataset and Environment
We focus on ranking the entities in digital libraries domain, that is, via the ranking method adopted in LG-ERM, the Top-K papers with higher scores will be selected. In order to acquire the dataset, the following steps will be done. Firstly we select some Web sources as the providers of entities. On one hand, we select some official Web sources (ACM and DBLP) which own higher confidence. On the other hand, some personal Web sources owning lower confidence are also considered. Secondly, we manually select and submit some queries with the length of 2, 4 and 6 keywords to the selected Web sources to acquire result pages. In order to rank entities effectively, the submitted queries should satisfy the following two conditions. 1) The query must be general enough so that its result set will contain as many entities as possible. 2) In order to illustrate the benefit of de-duplication during global aggregation, the results returned from different Web sources should have much overlap information represented in different formats. Thirdly, the entities embedded in result pages are extracted via D-EEM. Also a series of attribute-level uncertainty Pj can be acquired during entity extraction. Finally, the dataset containing 1 million paper records (30% from ACM, 40% from DBLP and 30% from personal Web sources) is obtained to be ranked. Based on the dataset and corresponding queries, we can estimate the performance of the ranking method presented in this paper including the time cost and the precision. We conduct the experiments on a 2.8GHz Pentium 4 machine with 1GB RAM and 160GB of disk. 8.2
635
Time Cost
We report on the average time cost needed to rank the entities over 100 queries. The time cost is divided into the time for local scoring and the time for global aggregation. As illustrated in Fig.9, the average time for local scoring far exceeds the global aggregation time. This can supply us a large optimization space. For example, we can process local scoring within different Web sources in parallel to reduce the rank time further. Fig.10 shows the overall time cost of our ranking method acting on the candidate entity set with different data amounts and query lengths. Also via changing the number of keywords in the query request, the overall time is measured. As illustrated in Fig.10, the
Fig.9. Average time cost of local scoring and global aggregation acting on different candidate data amounts.
Fig.10. Overall time cost of LG-ERM acting on different candidate data amounts for different numbers of query keywords.
8.3
Precision Comparison
We consider the query result through manual sorting as the standard result set. By comparing the qualified Top-K answers returned by our ranking mechanism (denoted as Ranking Set) with the Top-K ones in the corresponding standard set (denoted as Standard Set), the ranking precision for Top-K answers can be calculated (defined as (15)). Precision |Ranking Set.Top − K ∩ Standard Set.Top − K| . = K (15) The purpose of the experiment is to compare the precision of the following five ranking strategies. 1) P (page-level ranking strategy): the most matching Web pages but not entities are ranked and returned. We consider the Top-K answers within each page as a Ranking Set to compute the average precision. 2) TS (entity type matching and style based ranking strategy): the entities are ranked only according to the probabilistic entity type matching and style based scoring locally.
636
J. Comput. Sci. & Technol., July 2009, Vol.24, No.4
3) L (local scoring based ranking strategy): besides the factors considered in TS, the importance of Web sources is also considered for local scoring. 4) LK (local scoring and knowledge based ranking strategy): not only the factors considered in L but also the entity relationship knowledge are considered during the global aggregation. 5) LG (local scoring and global aggregation based ranking strategy): not only the factors considered in LK but also online duplicates identification are adopted during the global aggregation. It is the strategy performed by LG-ERM.
Fig.11. Comparison of precision among P, TS, L, LK, LG at K = 10, K = 20 and K = 30.
In Fig.11, the precisions at K = 10, 20, 30 of the result by the five ranking strategies are shown. As we can see, the strategies that focus on entity-level search (TS, L, LK and LG) have better precision than pagelevel search (P). And the strategies combining the local scoring and global aggregation (LK and LG) are better than the other strategies (P, TS and L). Furthermore, with offline knowledge construction and online entity matching, the ranking result generated by LG can be more precise than LK. 8.4
Parameter Setting
During online duplicates identification, the similarity between two entities should be compared with the predefined threshold. In order to determine the threshold, we select randomly 200 duplicate pairs from the dataset to test the characteristics of duplicates. As shown in Fig.12, the similarities of the most duplicate pairs (about 92.5%) are within the interval [0.4,
Fig.12. Parameter setting about the similarity threshold of duplicates.
0.7]. Therefore the similarity threshold can be set based on the interval to identify duplicates accurately. 9
Conclusion
In this paper, we have presented a new mechanism called LG-ERM to enable entity-level ranking for Deep Web queries. Unlike traditional approaches, LG-ERM considers more rank influencing factors, including the uncertainty of entity extraction, the style information of entities and the importance of Web sources, as well as the entity relationship. By combining local scoring and global aggregation, the ranked results can be more accurate than other ranking strategies. The experiments demonstrate the feasibility and effectiveness of the key techniques of LG-ERM. Currently the style based scoring depends on the assumption that all the candidate entities contain the query keywords. But sometimes the search result does not satisfy this assumption, which may result in incomplete query result. In addition, during Web source scoring, not all the Web sources provide hyperlinks to other ones. Furthermore, there are almost no hyperlinks among Web sources in some domains. Aiming at these particular cases, in order to improve the availability of our mechanism, next we should present some new scoring methods by considering more influencing factors (e.g., user’s feedbacks). Also we will make a further research on the efficiency of entity ranking and the optimization of relative parameter settings. Acknowledgements We are grateful to Prof. Wei-Yi Meng for his encouragement. We also thank Gao-Shang Sun, Zhen-Dong Qu, Zhen-Hua Wang and Ming-Dong Zhu for their efforts on this work. References [1] Chang K C, He B, Li C, Patel M, Zhang Z. Structured databases on the web: Observations and implications. SIGMOD Record, 2004, 33(3): 61–70. [2] Dong X, Halevy A Y, Yu C. Data integration with uncertainty. In Proc. the 33rd VLDB, Vienna, Austria, September 23–27, 2007, pp.687–698. [3] Jin R, Valizadegan H, Li H. Ranking refinement and its application to information retrieval. In Proc. the 17th WWW, Beijing, China, April 21–25, 2008, pp.397–406. [4] Qin T, Liu T, Zhang X, Wang D, Xiong W, Li H. Learning to rank relational objects and its application to Web search. In Proc. the 17th WWW, Beijing, China, April 21–25, 2008, pp.407–416. [5] Chaudhuri S, Ramakrishnan R, Weikum G. Integrating DB and IR Technologies: What is the Sound of one hand clapping. In Proc. the 2nd CIDR, CA, USA, January 4–7, 2005, pp.1–12. [6] Chakrabarti K, Ganti V, Han J W, Xin D. Ranking objects by exploiting relationships: Computing top-k over aggregation. In Proc. the 25th SIGMOD, Illinois, USA, June 27–29, 2006, pp.371–382.
Yue Kou et al.: Rank Entities for Deep Web Queries [7] Cheng T, Yan X, Chang K C C. EntityRank: Searching entities directly and holistically. In Proc. the 33rd VLDB, Vienna, Austria, September 23–27, 2007, pp.387–398. [8] Cheng T, Chang K C C. Entity search engine: Towards agile best-effort information integration over the Web. In Proc. the 3rd CIDR, USA, January 7–10, 2007, pp.108–113. [9] Nie Z, Ma Y, Shi S, Wen J, Ma W. Web object retrieval. In Proc. the 16th WWW, Alberta, Canada, May 8–12, 2007, pp.81–90. [10] Nie Z, Wen J, Ma W. Object-level vertical search. In Proc. the 3rd CIDR, CA, USA, January 7–10, 2007, pp.235–246. [11] Etzioni O, Cafarella M, Downey D. Web-scale information extraction in KnowItAll. In Proc. the 13th WWW, NY, USA, May 17–20, 2004, pp.100–110. [12] Cai D, Yu S, Wen J, Ma W. Block-based Web search. In Proc. the 27th SIGIR, Sheffield, UK, July 25–29, 2004, pp.456–463. [13] Zhu J, Nie Z, Wen J, Zhang B, Ma W. Simultaneous record detection and attribute labeling in Web data extraction. In Proc. the 12th KDD, PA, USA, August 20–23, 2006, pp.494– 503. [14] Kou Y, Li D, Shen D, Yu G, Nie T. D-EEM: A DOM-tree based entity extraction mechanism for deep Web. In Proc. the 5th CNCC, Xian, China, September 25–27, 2008, p.21. [15] Nambiar U, Kambhampati S. Mining approximate functional dependencies and concept similarities to answer imprecise queries. In Proc. the 7th WebDB, Paris, France, June 17– 18, 2004, pp.73–78. [16] Nigam K, McCallum A K, Thrun S. Text classification from labeled and unlabeled documents using EM. Machine Learning, 2000, 39(2): 103–134. [17] Lertnattee V, Theeramunkong T. Effect of term distributions on centroid-based text categorization. Information Sciences, 2004, 158(1): 89–115. [18] Song R, Liu H, Wen J. Learning block importance models for Web pages. In Proc. the 13th WWW, NY, USA, May 17–20, 2004, pp.203–211. [19] Bianchini M, Gori M, Scarselli F. Inside PageRank. ACM Transactions on Internet Technology, 2005, 5(1): 92–128. [20] Parreira J X, Weikum G. JXP: Global authority scores in a P2P network. In Proc. the 8th WebDB, Maryland, USA, June 16–17, 2005, pp.31–36. [21] Vazirgiannis M, Drosos D, Senellart P, Vlachou A. Web page rank prediction with Markov models. In Proc. the 17th WWW, Beijing, China, April 21–25, 2008, pp.1075–1076. [22] Kou Y, Shen D, Li D, Nie T. A deep Web entity identification mechanism based on semantics and statistical analysis. Journal of Software, 2008, 19(2): 194–208. [23] Yagoub K, Florescu D, Issarny V. Caching strategies for dataintensive Web sites. In Proc. the 26th VLDB, Cairo, Egypt, September 10–14, 2000, pp.188–199. [24] Shi L, Han Y, Ding X, Wei L. An SPN-based integrated model for Web prefetching and caching. J. Comput. Sci. & Technol, 2006, 21(4): 482–489.
637 Yue Kou is a lecturer at College of Information Science and Engineering, Northeastern University. She received her Ph.D. degree in computer software and theory from Northeastern University of China in 2009. She is a student member of CCF. Her main research interests include Web data management and information retrieval. De-Rong Shen is a professor at College of Information Science and Engineering, Northeastern University. She received her Ph.D. degree in computer software and theory from Northeastern University of China in 2004. She is a senior member of CCF. Her main research interests include distributed and parallel systems, Web data management and data grid. Ge Yu is a professor at College of Information Science and Engineering, Northeastern University. He received his Ph.D. degree in computer science from Kyushu University of Japan in 1996. He is a member of IEEE, ACM, and a senior member of CCF. His research interests include database theory and technology, distributed and parallel systems, embedded software, network information security. Tie-Zheng Nie is a lecturer at College of Information Science and Engineering, Northeastern University. He received his Ph.D. degree in computer software and theory from Northeastern University of China in 2009. He is a student member of CCF. His main research interests include Web data management and information integration.