Fast and Accurate Incremental Entity Resolution Relative to an Entity Knowledge Base Michael Welch∗
Aamod Sane
Chris Drome
Barnes & Noble Palo Alto, CA 94301
Yahoo! Inc. Sunnyvale, CA 94089
Yahoo! Inc. Sunnyvale, CA 94089
[email protected]
[email protected]
[email protected] ABSTRACT
User facing topical web applications such as events or shopping sites rely on large collections of data records about real world entities that are updated at varying latencies ranging from days to seconds. For example, event venue details are changed relatively infrequently whereas ticket pricing and availability for an event is often updated in near-realtime. Users regard these sites as high quality if they seldom show duplicates, the URLs are stable, and their content is fresh, so it is important to resolve duplicate entity records with high quality and low latencies. High quality entity resolution typically evaluates the entire record corpus for similar record clusters at the cost of latency, while low latency resolution examines the least possible entities to keep time to a minimum, even at the cost of quality. In this paper we show how to keep low latency while achieving high quality, combining the best of both approaches: given an entity to be resolved, our incremental Fastpath system, in a matter of milliseconds, makes approximately the same decisions that the underlying batch system would have made. Our experiments show that the Fastpath system makes matching decisions for previously unseen entities with 90% precision and 98% recall relative to batch decisions, with latencies under 20ms on commodity hardware.
Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval
Keywords Entity Resolution, Knowledge Base
1.
INTRODUCTION
The problem of deduplication and linking data records that refer to real world entities arises in several contexts. For ∗
Work completed while at Yahoo!
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’12, October 29–November 2, 2012, Maui, HI, USA. Copyright 2012 ACM 978-1-4503-1156-4/12/10 ...$10.00.
example, topical sites that specialize in travel, real estate, or hyperlocal collect and organize information of interest so that users can browse as well as search. Users are more likely to frequent those sites that offer a wide variety of fresh information in a well organized fashion. The objectives of information providers, however, do not always coincide with these goals. In addition to inadvertent errors and noisy data, a business owner, for example, might prefer to have their listing shown multiple times and thus submit several variations of their information. The onus is on the sites to deduplicate, enrich, and link information from multiple such data sources for the users. In principle, entity resolution over a corpus of n entities is possible using O(n2 ) pairwise comparisons, but since this becomes impractical as n grows, blocking techniques [9, 11, 8] are used to partition entities into blocks of similar records using hashes called blocking keys. Pairwise comparison [3] is then limited to blocks of size b such that O(b2 ) ≪ O(n2 ). Most such research has been directed toward improved resolution quality, without considering the issues of overall entity resolution latency. Low latency resolution using search techniques has been recently studied by [4], and researchers in linguistics [13] have considered streaming deduplication. However, this research has focused on per entity resolution latency, without considering resolution quality relative to full corpus processing. We employ entity resolution to synthesize a knowledge base that is being used in several of Yahoo’s topical web sites. The knowledge base is created from data sources of varying quality, ranging from high quality proprietary data to very noisy sources such as user supplied information. In this setting, to achieve the best quality, we need a full corpus entity resolution that can account for effects such as resolution cascades. On the other hand, sites such as event listings and shopping deals are most useful when the data they provide is also fresh. We therefore also need low latency deduplication of new records. To achieve low latency, we may opt to examine only a portion of the entity graph. In this paper we describe the Fastpath system, which leverages the high quality deduplication decisions of a batch system as a basis for deduplicating records incrementally in real-time with high quality. To the best of our knowledge, ours is the first entity resolution system that achieves both high quality and low latency.
2. THE FASTPATH SYSTEM The Batchpath is a Hadoop based system built at Yahoo! designed to ingest records in different formats from multi-
ple sources and generate a single, unified semantic graph of deduplicated records called an entity knowledge base. The key stage of the Batchpath system is the Builder, which deduplicates the input records in a two step process. The Builder first partitions the data set into blocks of similar records, so that instead of comparing O(n2 ) objects, we need only compare O(b2 ) objects [7], provided blocking functions separate dissimilar records. Blocking functions are data dependent, for instance using n-grams for large text fields, numeric comparisons for telephone numbers, and so on. In the second Builder step, within each block we perform a more thorough similarity comparison between all pairs of objects. This pairwise similarity measure operates on the same parcel used for blocking, though it need not use the same set of attributes. The similarity score between each pair of objects in a block creates a connected component graph with weighted edges. Pairs whose similarity score is above a certain threshold are considered similar. Subsequent stages of the Builder refine these connected components, splitting or merging them to form a final set of distinct entities. Full details of the Batchpath system can be found in another paper [2]. Fastpath uses a search engine to select matching records for a deduplication decision. We allow for two common incremental operations: (1) either the query object is added to an existing cluster, or (2) it forms a new cluster. This eliminates some of the complexity seen in the full graph, such as a merge or split operation potentially cascading changes to neighboring objects, while still providing high accuracy deduplication in most cases. Fastpath reuses several components of Batchpath, including (1) the same blocking functions to generate blocking keys for the query record, (2) the same entity blocking keys as terms in the search engine, (3) the same pairwise comparison functions, and (4) the same matching thresholds to select the best matching entity. We first populate a search index for the knowledge base created by Batchpath, where each object is indexed by its blocking keys. At runtime, we apply the same blocking function to the incoming query record and find candidate matching entities in the index. We use multiple blocking functions in the Batchpath to ensure that we do not miss any entities that we should have compared [9], thus improving recall. Using these same keys as terms for a query record allows the search engine to use standard text ranking approaches to provide an initial candidate ranking. We compute the pairwise similarity score between the query record and each of the top-k candidate entities, and select the entity with the highest score Smax . If Smax is above a configurable threshold (typically, the same threshold value used during entity cluster refinement in the Batchpath), we consider the entity as a match and return it as the result.
3.
EXPERIMENTAL RESULTS
In our experiments we use a subset of data which is currently serving the local event listings1 for Yahoo. Our experiments were performed on a dual-core 2.4 Ghz Intel Core2 machine with 4GB of memory running Linux. This machine hosts both the Fastpath logic and the text retrieval 1 An event may be a professional sports game, theatrical performance, a city council meeting, farmers market, etc.
Figure 1: Performance engine, using a proprietary text search engine framework. The search engine allows us to plug in ranking and matching components written in Java that we share in the Hadoopbased Batchpath system. In our first set of experiments we construct and index a knowledge base using the complete set of input data. We then query the Fastpath system using the same input records to analyze and tune performance of the system. This experiment measures the ability to identify near or exact duplicates, and gives us a baseline for latency of the system. In our second set of experiments we evaluate the ability of the Fastpath to support the arrival of updates to existing entities or ingest new, previously unseen data into the system, which occurs regularly when the the system ingests new or updated feeds. We build a knowledge base from the full corpus and then randomly select a subset of entities for which we artificially remove the attributes and blocking keys in the knowledge base. This ensures the same clustering decisions are kept when building the knowledge base, but simulates the arrival of as yet unseen entities. The subset of entities is then used to query the knowledge base. We compare the mapping decisions suggested by the Fastpath for the query entities with the clusters generated by Batchpath.
3.1 Experimental dataset Our input data consists of approximately 750k records with attributes and relationships that represent knowledge about local events, such as details of the event venues, performers or participants, event genre, ticket prices, starting and ending times, and so on. The data is collected from both curated data providers as well as crawled and extracted from the Web using modern information extraction techniques.
3.2 Querying a complete knowledge base In our first experiment, we evaluate the effect of varying the number of candidate objects k to retrieve using the search engine. We use a weighted term matching function similar to TF-IDF, commonly found in text retrieval applications, as the candidate ranking strategy. After constructing a knowledge base from the full 750,000 record input data set, we randomly selected 20,000 of those records to submit as queries and evaluate the accuracy of the top result after re-ranking. The result is considered a true positive match if the query object is one of the objects in the top ranked object cluster returned. The other two possibilities are a false positive match (matching the incorrect
Figure 2: Query Time
cluster) and a false negative match (returning no result). There are no “true negative” results, as each query should match an entity. We expect high but not necessarily perfect precision or recall, as the Fastpath system only has a partial view of the entity graph. Figures 1 and 2 shows how varying values of k effect precision, recall, F-measure, and average query execution time of the system. We use the standard definition of precision and recall for classification tasks. As we consider additional candidate entities for deeper evaluation, the recall and query execution times increase, while the precision remains effectively constant. This suggests that the re-ranking applied to the retrieved entities nearly always finds the “right” candidate out of the available choices. Retrieving additional candidates generally improves the accuracy of the system up until approximately k = 10, above which we see only insignificant improvements in recall and F-measure. This shows the initial search engine ranking generally captures the correct result in the top 10. When considering the top k = 10 candidate entities selected by term ranking, false positives, or queries where the Fastpath matches an object to the incorrect cluster, occur less than 1% of the time. The false negative rate is also low at 1.33%. Collectively this means customers using the Fastpath system will experience minimal overhead in their system as they process additional full knowledge bases produced by the grid system.
3.3 Querying with new records Next, we evaluate the ability to match updates to existing entities or previously unseen records to the appropriate entity cluster, as would be the case when new records are discovered by the search engine, but have not yet been seen by the Batchpath. As we have noted earlier, when examined by Fastpath alone, such records cannot cause existing clusters to split or merge. We simulate this behavior by first constructing a complete knowledge base as before. We uniformly at random sample 1% of the input records into a subset S1% . For each record Oi in S1% , we identify the cluster in the knowledge base which contains Oi as a source and remove all attribute values, relations, and blocking keys provided to that entity by Oi . We then create an inverted index from the modified knowledge base and use the records in S1% to query the Fastpath. For comparison, we have the same number of entity clusters as the previous experiment,
Figure 3: Performance
Figure 4: Query Time
but now the attribute values from those excluded records are not present in any of the clusters and therefore cannot influence candidate retrieval or pairwise matching. We evaluate the precision, recall, and F-measure for the top 1 ≤ k ≤ 10 entities. We consider the returned result a positive match if the query record was matched with the suggested cluster in the full knowledge base. False positives and false negatives are defined as before, and true negatives are now possible as some of the excluded records were the only source record for their particular cluster. Figures 3 and 4 shows the accuracy and running time characteristics of introducing new records to the system via the Fastpath. Figure 3 shows the precision decreases slightly from 0.904 to 0.896 as we increase the number of candidate entities we retrieve from 1 to 10, while at the same time recall improves from 0.951 to 0.979. For k ≥ 3, the Fmeasure remains effectively constant. Average running time for a query with k ≤ 10 remains at approximately 20ms on a single machine. These results show that the Fastpath system enables a natural tradeoff between precision and recall. For applications that prefer more aggressive de-duplication at the potential cost of incorrectly merging records, retrieving more candidate entities is preferable. Other applications which require more certainty in disambiguation decisions, possibly at the expense of maintaining duplicate entities, can opt to retrieve fewer candidate entities. Table 1 shows the breakdown of misclassification rates for k = 1, k = 5, and k = 10,
highlighting this tradeoff. In either scenario, the Fastpath achieves approximately 90% precision while also finding over 95% of the matching entities in the knowledge base. Error type Incorrectly resolved Incorrectly non-resolved
k=1 1.19% 0.57%
k=5 1.31% 0.31%
k=10 1.34% 0.25%
Table 1: New record misclassification rates
4.
RELATED WORK
Developments in entity resolution have been recently surveyed in [7, 17, 5]. Since the early work on census [12] researchers have sought to reduce the number of entities that need comparison by improving blocking techniques. Improvements include the development of effective blocking functions [6], minimizing number of comparisons [3], scalable blocking approaches such as Canopy Clustering [11], Multiblock, a way to avoid missing any comparisons [9], and frameworks for scaling blocking [14]. We have adopted Canopy Clustering and Multiblock approaches in our batch system, and adapted the ideas of [1] for better quality. Two papers propose algorithms that trade off execution time for quality of results to complete computations in a reasonable amount of time. [18] approaches the problem from a database view and introduces an anytime algorithm capable of clustering massive data sets with a large number of features. Similarly, [16] uses “hints” to better guide a pay-asyou-go clustering algorithm. Both approaches generate high quality results without performing a complete comparison. Closer to our approach, [4] proposes the combination of blocking and information retrieval techniques to first select similar entities at low latency, followed by further analysis of those entities in ways that avoid having to exhaustively compare all their features. A similar approach is taken in [13] for discovering co-occurences of entities (phrases) in a document collection. Word sets, called “entity chains”, in input documents are compared with existing clusters of entity chains using several similarity scores, and top-k clusters are returned as likely matches. New clusters are created if an existing cluster is not found. Our indexing system is similar to that of [4, 13], but we use blocking keys as terms in our search engine. Like [13], our system can also create new entities. However, to the best of our knowledge, our approach of combining the indexing and batch approach, with continuity between the two systems, appears to be unique.
5.
CONCLUSION
In this paper we presented the unique requirements and design of an entity resolution system suitable for topical Web sites. Our system began as a purely batch processing entity resolution platform, Batchpath, focused on building a high quality knowledge base. Additional requirements of topical Web applications, such as latency, necessitated designing an alternative path which supports fast entity resolution while maintaining accuracy. The Fastpath system achieves these goals, operating as a low latency, incremental knowledge base, seeded by the high quality Batchpath output. Our experiments on large scale, real-world data have demonstrated the Fastpath meets all the requirements of a topical Web site: it performs deduplication operations for entities in under 20ms while running on commodity hardware, it
achieves nearly 90% precision and 98% recall when resolving newly discovered or updates to existing entities. The experiments have also shown that customers of the Fastpath can adjust the system’s bias between conservative (more duplicates, fewer incorrectly resolved entities) and aggressive (fewer duplicates, but possibly more incorrectly merged entities) resolution by simply tuning how many of the top-k candidate entities to consider. Additional details about the system and experiments can be found in our full paper [15].
6. REFERENCES [1] A. Arasu, C. R´ e, and D. Suciu. Large-scale deduplication with constraints using dedupalog. In ICDE, pages 952–963, 2009. [2] K. Bellare, C. Curino, A. Machanavajihala, P. Mika, M. Rahurkar, A. Sane, R. Aronson, P. Bohannon, L. Chitnis, C. Drome, Z. Gu, B. Kannan, V. Rastogi, N. Torzec, and M. Welch. Woo: A scalable and multi-tenant platform for continuous knowledge base synthesis, 2012. [3] O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom. Swoosh: a generic approach to entity resolution. VLDB J., 18(1):255–276, 2009. [4] P. Christen and R. Gayler. Towards scalable real-time entity resolution using a similarity-aware inverted index approach. In AusDM, volume 87 of CRPIT, pages 51–60, 2008. [5] A. Cuzzocrea and P. L. Puglisi. Record linkage in data warehousing: State-of-the-art analysis and research perspectives. In DEXA Workshops, pages 121–125, 2011. [6] J. de Freitas, G. L. Pappa, A. S. da Silva, M. A. Gon¸calves, E. S. de Moura, A. Veloso, A. H. F. Laender, and M. G. de Carvalho. Active learning genetic programming for record deduplication. In IEEE Congress on Evolutionary Computation, pages 1–8, 2010. [7] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. on Knowl. and Data Eng., 19(1):1–16, Jan. 2007. [8] L. O. Evangelista, E. Cortez, A. S. da Silva, and W. Meira. Adaptive and flexible blocking for record linkage tasks. JIDM, 1(2):167–182, 2010. [9] R. Isele, A. Jentzsch, and C. Bizer. Efficient multidimensional blocking for link discovery without losing recall. In WebDB, 2011. [10] S. Kataria, K. S. Kumar, R. Rastogi, P. Sen, and S. H. Sengamedu. Entity disambiguation with hierarchical topic models. In KDD, pages 1037–1045, 2011. [11] A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In SIGKDD, pages 169–178, 2000. [12] H. B. Newcombe, J. M. Kennedy, S. Axford, and A. James. Automatic linkage of vital records. Science, 130(3381):954–959, October 1959. [13] D. Rao, P. McNamee, and M. Dredze. Streaming cross document entity coreference resolution. In COLING (Posters), pages 1050–1058, 2010. [14] V. Rastogi, N. Dalvi, and M. Garofalakis. Large-scale collective entity matching. Proc. VLDB Endow., 4(4):208–218, Jan. 2011. [15] M. Welch, A. Sane, and C. Drome. High Quality Real-time Incremental Entity Resolution in a Knowledge Base, 2012. [16] S. E. Whang, D. Marmaros, and H. Garcia-Molina. Pay-as-you-go entity resolution. IEEE Trans. Knowledge and Data Engineering, 2012. [17] W. E. Winkler. Overview of record linkage and current research directions. Technical report, Statistical Research Division, U. S. Census Bureau, 2006. [18] L. Ye, X. Wang, D. Yankov, and E. J. Keogh. The asymmetric approximate anytime join: A new primitive with applications to data mining. In SDM, 2008.