Adaptive detection of approximately duplicate database records and ...

16 downloads 31633 Views 654KB Size Report
UNIVERSITY OF CALIFORNIA, SAN DIEGO. Adaptive detection .... II.7 Average recall versus precision for all algorithms on the mixed dataset. 27. IV.1 Accuracy ...
UNIVERSITY OF CALIFORNIA, SAN DIEGO Adaptive detection of approximately duplicate database records and the database integration approach to information discovery

A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science and Engineering

by

Alvaro Edmundo Monge

Committee in charge: Professor Charles P. Elkan, Chairperson Professor Samuel Buss Professor David Kirsh Professor Keith Marzullo Professor Mohan Paturi

1997

Copyright Alvaro Edmundo Monge, 1997 All rights reserved.

The dissertation of Alvaro Edmundo Monge is approved, and it is acceptable in quality and form for publication on micro lm:

Chair University of California, San Diego 1997

iii

TABLE OF CONTENTS Signature Page : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

iii

Table of Contents : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

iv

List of Figures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

vi

List of Tables : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : vii Acknowledgments : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : viii Vita, Publications, and Fields of Study : : : : : : : : : : : : : : : : : : :

xi

Abstract : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xiii I

Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : A. Overview of the thesis : : : : : : : : : : : : : : : : : : : : : : : : : : :

II Domain-independent record matching : : : : : : : : : : : : A. Related work : : : : : : : : : : : : : : : : : : : : : : : : B. Record matching algorithms : : : : : : : : : : : : : : : 1. A recursive record matching algorithm : : : : : : : : 2. The Smith-Waterman algorithm : : : : : : : : : : : 3. The hybrid algorithm : : : : : : : : : : : : : : : : : C. Experimental evaluation: matching aliation elds : : : 1. The WebFind information discovery tool : : : : : : 2. Experiments matching inspec and netfind records 3. Recall and precision experimental results : : : : : : D. Matching mailing addresses : : : : : : : : : : : : : : : : 1. Generating mailing address records : : : : : : : : : : 2. Experimental results : : : : : : : : : : : : : : : : : : E. Discussion : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

III Adaptive and scalable detection of approximately duplicate database records A. Previous work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : B. Pairwise record matching algorithms : : : : : : : : : : : : : : : : : : : C. Data structure for maintaining the clusters of duplicate records : : : : 1. The Union-Find data structure : : : : : : : : : : : : : : : : : : : : D. Improved simple algorithm : : : : : : : : : : : : : : : : : : : : : : : : E. The overall priority queue algorithm : : : : : : : : : : : : : : : : : : : 1. Computational complexity analysis : : : : : : : : : : : : : : : : : :

iv

1 3 5 6 8 10 13 16 18 18 19 22 25 26 28 29 31 33 34 35 36 37 38 41

IV Experimental evaluation : : : : : : : : : : : : : : : : : : : : : : : : : : : A. Choosing thresholds : : : : : : : : : : : : : : : : : : : : : : : : : : : : B. Measuring accuracy : : : : : : : : : : : : : : : : : : : : : : : : : : : : C. Algorithms tested : : : : : : : : : : : : : : : : : : : : : : : : : : : : : D. Databases of mailing addresses : : : : : : : : : : : : : : : : : : : : : : 1. Varying the size of the window and the priority queue : : : : : : : 2. Varying the number of duplicates per record : : : : : : : : : : : : 3. Varying the size of the database : : : : : : : : : : : : : : : : : : : 4. Varying the level of noise in the database : : : : : : : : : : : : : : E. Detecting approximate duplicate records in a real bibliographic database

43 43 45 45 46 46 48 52 59 61

V The WebFind information discovery tool : : A. Approaches to searching on the WWW : B. Resources used by WebFind : : : : : : : 1. Library resources : : : : : : : : : : : : 2. White pages resources : : : : : : : : : 3. Article archives resource : : : : : : : : C. inspec and netfind integration : : : : : 1. Bibliographic record retrieval : : : : : 2. Determining WWW starting points : D. Discovery phase : : : : : : : : : : : : : : E. Experimental results : : : : : : : : : : : : 1. Mapping aliations to internet hosts : 2. Discovery of worldwide web servers : : 3. Discovery of home pages : : : : : : : : 4. Paper discovery : : : : : : : : : : : : F. Related work : : : : : : : : : : : : : : : : 1. Ahoy! The Homepage Finder : : : : : 2. CiFi: Citation Finder : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

64 65 67 67 68 69 70 70 71 72 74 74 75 76 76 76 77 79

VI Conclusion : : : : : : : : : : : : : : : : : : : : : A. Future work : : : : : : : : : : : : : : : : : : 1. Domain-independent record matching : : 2. Detecting approximate duplicate records : 3. WebFind : : : : : : : : : : : : : : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

84 85 85 86 87

Bibliography : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

92

v

: : : : : : : : : : : : : : : : : :

LIST OF FIGURES II.1 Optimal record alignments produced by the Smith-Waterman algorithm. II.2 Recursive algorithm with abbreviation matching: average recall versus precision. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : II.3 Recursive algorithm without abbreviation matching: average recall versus precision. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : II.4 Smith-Waterman algorithm (L = 0) : average recall versus precision. II.5 Subrecord-level Smith-Waterman algorithm (L = 1) : average recall versus precision. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : II.6 Word-level Smith-Waterman algorithm (L = 2) : average recall versus precision. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : II.7 Average recall versus precision for all algorithms on the mixed dataset. IV.1 Accuracy results using the same database and varying the size of the window and priority queue. : : : : : : : : : : : : : : : : : : : : : : : IV.2 Number of record comparisons using the same database and varying the size of the window and priority queue. : : : : : : : : : : : : : : : IV.3 Accuracy results for varying the number of duplicates per original record using a uniform distribution. : : : : : : : : : : : : : : : : : : : IV.4 Accuracy results for varying the number of duplicates per original record using a Zipf distribution. : : : : : : : : : : : : : : : : : : : : : IV.5 Accuracy results for varying database sizes (log-log plot). : : : : : : : IV.6 Number of comparisons performed by the algorithms (log-log plot). : IV.7 Number of record comparisons made when mixing the strategies for applying the record matching algorithm. See Tables IV.6 and Table IV.7 (log-log plot). : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : IV.8 Accuracy results for varying the noise level. : : : : : : : : : : : : : : V.1 WebFind query form. : : : : : : : : : : : : : : : : : : : : : : : : : : V.2 Papers found in inspec from the query issued by WebFind. : : : : : V.3 Results from a WebFind discovery session. : : : : : : : : : : : : : :

vi

15 21 22 24 25 26 27 48 49 50 51 52 53 59 60 81 82 83

LIST OF TABLES II.1 Examples of netfind and inspec records. : : : : : : : : : : : : : : : II.2 Numbers of equivalent groups in the three datasets : : : : : : : : : : II.3 Pairs of records used to evaluate the matching algorithms. : : : : : : II.4 Scores assigned by six matching algorithms to pairs of records. : : : : IV.1 Number of pure clusters detected and the number of record comparisons performed by the duplicate detection algorithms when varying the size of the structure storing records for possible comparisons. : : : IV.2 Experiments on ve databases of varying size. : : : : : : : : : : : : : IV.3 Number of pure and impure clusters detected by the Smith-Watermanbased algorithms on databases of varying size. : : : : : : : : : : : : : IV.4 Number of clusters detected by the algorithms using the matching algorithm of Hernandez and Stolfo (1995) on databases of varying size. IV.5 Number of comparisons performed by the algorithms. : : : : : : : : : IV.6 Number of pure clusters detected when di erent strategies are used in each of the two passes. The size of the window is xed at 10 records, and the size of the queue is 4. : : : : : : : : : : : : : : : : : : : : : : IV.7 Total number of comparisons made to detect the pure clusters from Table IV.6. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : IV.8 Results on a database of bibliographic records. : : : : : : : : : : : : : V.1 Sample inspec author aliations. : : : : : : : : : : : : : : : : : : : :

vii

18 20 29 30 47 54 54 55 55 57 58 62 75

ACKNOWLEDGMENTS People have many memorable moments in their lives. The most memorable moment for me was the rst day that I set foot in the United States back in August of 1983. Leaving El Salvador as a thirteen-year old was the hardest thing I have ever done. It was also a necessary thing to do given the political, social, and economical instability of my birth country. It has been a long road to another memorable moment in my life, the conclusion of my graduate studies. This road was made easier by all the friendships I made and all of the people that made it easier for me. I dedicate this thesis to everyone who has had such an in uence in my life. From the rst day I arrived, my primary goal has been to work hard in all of my endeavors. Early on, my e orts concentrated in learning English, of which I didn't speak a word. Within a year of enrolling in English as a second language courses in high school, I had the knowledge necessary to enroll in a regular high school schedule of classes. The rst of my accomplishments had been completed. While in El Salvador, Mathematics was my subject of interest. This interest continued in my rst year of high school, until I was introduced to computer programming. The earliest experience I had with computers was in an after school program in my rst year of high school. My Algebra teacher introduced me and others to programming BASIC on a TRS-80. This small beginning is all it took to spark in me an interest in computer science. My next goal would be to become a computer scientist. My education has been shaped by many people. The rst person I learned from and continue to learn from is my father. He is the one person who has shaped my life the most, and for that I am forever thankful. I want to thank all of of the many wonderful and dedicated teachers that I had while in El Salvador. My high school years in Los Angeles were also very important and the dedication and care of all my teachers helped and encouraged me to do the best job I could do. They are also the reason for my quest to become an educator. My four years at U.C. Riverside have been some of the best in my life viii

so far. In high school, I had gained the con dence that I lacked after leaving El Salvador. I carried this through to my years as an undergraduate. As is evident in this acknowledgment, success is not achieved alone, and my stay at U.C. Riverside is not an exception. The small size of the computer science department allowed me to work side-by-side with professors and gave me a rst look at life in academia. It is here where I realized my nal goal: I wanted to become a university professor. For three and a half years there, I worked as a math and computer science tutor; something that I had also done in high school. This pleasurable and satisfying experience rearmed the goal that I had set out for myself. I would like to thank all of my professors at U.C. Riverside for all their time and inspiration. There are many others to whom I would like to extend my gratitude. In particular, I would like to thank Sarah Wall for her nurturing, encouragement, counsel, guidance, and her concern for my success while at U.C. Riverside and beyond. There are many other people to whom I want to extend my thanks. There are many individuals whom I have learned from, and while not all are listed here by name, they have been very important. Following is a list of people who have made graduate school a very rewarding and enjoyable experience. Thanks to my advisor Charles Elkan, who is a collaborator on all of this work. Charles has been a great in uence in sharpening my academic skills. I am appreciative for all his time, patience, and encouragement in each of my years at U.C. San Diego. Thanks to Filippo Menczer, Curtis Padgett, Victor Vianu, Michael Merritt, Keith Marzullo, Yannis Papakonstantinou for helpful discussions about this work. Special appreciation goes to Filippo with whom I have shared many ideas about school and life in general. He has been very supportive and I am very fortunate to have him as one of my best friends. My deepest thanks to AT&T fellowship program, whose nancial support has been critical in my studies. In many occasions the moral support from my fellowship mentor Michael Merritt has been more valuable than the nancial support. His

ix

belief in me has kept me going, and my success could not have come without his and AT&T's support. The existence of fellowship programs like AT&T's is of great importance; without such programs many students like myself would not pursue graduate studies. Thanks to everyone in my family. They have been supportive of my studies from the rst day I opened a book. My father, Edmundo A. Monge, has been a great inspiration and his encouragement, advice, and sincere thoughts about academics and life in general have always been very important to me. I am forever thankful. My mother, Aura Marina, has been a role model for hard work and dedication. Both of my parents' su ering have paid o and this thesis is only small evidence of that. I have received much support from all my brothers, Eric Max and Tito Renan, and my sisters Dixie Duran, Zoila Mirian, and Aura Vicky. Each of them has helped shape my life in one way or another. Last and by no means least in my family, I would like to thank my cousins and nephews. They have reminded me never to lose that child inside of me. All of them have brought smiles to me and always will. I will always be there for each of you. Special thanks go to Michael Anthony, Eric Rafael, Efren Daniel, and Emilia Cristina, Mia Elizabeth. You are all very wonderful and special. I reserve my most sincere and heartfelt thanks for Sara Miner. Sara's encouragement, support, and friendship in my last year of graduate school have been vital to the completion of my PhD. Sara has made a great positive impact in my life, and has brightened my life in immeasurable ways. I have learned more about myself in the short time that I have known Sara than I have in a long time. Sara has brought out so many wonderful things inside me. Sara has always been a great listener and is a very special, wonderful, giving person. I am forever thankful for all that you have given to me. In me, you will always have a friend to count on. There will forever be a place in my heart for you. You are a very important part of my life : : : essentially!!

x

December 28, 1969 1983 1987 1987{1989 1988{1990 1988{1991 1990{1991 1990 1991 1991 1991{1997 1992{1996 1993 1993 1994 1996 1997

VITA Born, Ojos de Agua, Chalatenango, El Salvador Immigrated to USA Woodrow Wilson High School, East Los Angeles Summer Internship Jet Propulsion Laboratory, Pasadena, CA National Hispanic Scholarship Fund (NHSF) scholar University of California, Riverside Mathematics and Computer Science Tutor University of California, Riverside Research internship University of California, Riverside Minority Summer Research Internship Program University of California Graduate Division, Riverside B.S., University of California, Riverside Summer Internship AT&T Bell Labs, Murray Hill, NJ AT&T Cooperative Research Fellow Teaching Assistant, Department of Computer Science and Engineering University of California, San Diego M.S., Department of Computer Science and Engineering University of California, San Diego Summer Internship AT&T Bell Labs, Murray Hill, NJ Summer Internship Los Alamos National Laboratory Summer Instructor \Database systems principles", Department of Computer Science and Engineering University of California, San Diego Doctor of Philosophy, Department of Computer Science and Engineering University of California, San Diego xi

PUBLICATIONS \Domain-independent record matching algorithms for knowledge discovery applications." Alvaro E. Monge and Charles P. Elkan. (To be submitted). \An ecient domain-independent algorithm for detecting approximately duplicate database records." Alvaro E. Monge and Charles P. Elkan. SIGMOD 1997 data mining and knowledge discovery workshop. Tucson, Arizona. May 1997. \The WebFind tool for nding scienti c papers over the world wide web." Alvaro E. Monge and Charles P. Elkan. Proceedings of the 3rd International Congress on Computer Science Research. Tijuana, Mexico. November 1996. pp. 41{46. \WebFind: Mining external sources to guide WWW discovery (software demo)." Alvaro E. Monge and Charles P. Elkan The Second International Conference on Knowledge Discovery and Data Mining. August 1996. \The eld matching problem: algorithms and applications." Alvaro E. Monge and Charles P. Elkan. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. August 1996. pp. 267{270. \Integrating external information sources to guide worldwide web information retrieval." Alvaro E. Monge and Charles P. Elkan. Technical Report CS96-474, University of California, San Diego, January 1996. \WebFind: automatic retrieval of scienti c papers over the worldwide web." Alvaro E. Monge and Charles P. Elkan. Working notes of the AAAI Fall Symposium on AI Applications in Knowledge Navigation and Retrieval. AAAI Press. November 1995. Page 151. \A survey of concurrency control in multidatabase systems" Charles P. Elkan and Alvaro E. Monge. Technical Report CS94-391, University of California, San Diego, October 1994. \The Design and Implementation of Version 0 of the LoGOS Database System." Charles P. Elkan, Gary Jerep, and Alvaro E. Monge. Technical Report CS92-260, University of California, San Diego, October 1992.

xii

ABSTRACT OF THE DISSERTATION Adaptive detection of approximately duplicate database records and the database integration approach to information discovery by Alvaro Edmundo Monge Doctor of Philosophy in Computer Science University of California, San Diego, 1997 Professor Charles Elkan, Chair The integration of information is an important area of research in databases. By combining multiple information sources, a more complete view of the world is attained, and additional knowledge gained. This is a non-trivial task however. Often there are many sources which contain information about a certain kind of entity, and some will contain records concerning the same real-world entity. Thus, one problem in integrating information sources is to identify possibly di erent designators of the same entity. This thesis provides solutions to this data cleansing problem. The integration of information sources is also proposed as an approach for information retrieval over the worldwide web. Data cleansing is the process of purging databases of inaccurate or inconsistent data. The data is manipulated into a form which is useful for other tasks, such as data mining. This thesis addresses the data cleansing problem of detecting database records that are approximate duplicates, but not exact duplicates. An ecient algorithm is presented which combines three key ideas. First, the Smith-Waterman algorithm for computing the minimum edit-distance is used as a domain-independent method to recognize pairs of approximately duplicates. Second, the union- nd data structure is used to maintain the clusters of duplicate records incrementally, as pairwise duplicate relationships are discovered. Third, the algorithm uses a priority queue xiii

of cluster subsets to respond adaptively to the size and homogeneity of the clusters discovered as the database is scanned. This results in signi cant savings in the number of times that a pairwise record matching algorithm is applied, without impairing accuracy. Comprehensive experiments on synthetic databases and on a real-world database con rm the e ectiveness of all three ideas. This thesis also presents WebFind, an application that discovers scienti c papers online over the worldwide web. The size of the web makes online information retrieval dicult. In such a setting, it is critical to know where to concentrate the search for information. WebFind uses a domain-independent algorithm to match records from di erent sources for the purpose of integrating the information in those sources. We describe the design of WebFind, the integration process, and discovery phase.

xiv

Chapter I Introduction Research in the areas of knowledge discovery and data cleansing has seen recent growth. The growth is due to a number of reasons. The most obvious one is the exponential growth of information available online. In particular, the biggest impact comes from the popularity of the Internet and the World Wide Web (WWW or web). In addition to the web, there are many traditional sources of information, like relational databases, which have impacted the growth of the information available online. The availability of these sources increases not only the amount of data, but also the variety of and quality in which such data appears. These factors create a number of problems. The work in this thesis concentrates on two such problems. The rst problem is the detection of multiple representations of a single entity. The second is the discovery of particular pieces of information from the web. Data cleansing is the process of cleaning up databases containing inaccurate or inconsistent data. One inconsistency is the existence of di erent multiple representations of the same real-world entity. The task is to detect such duplicity and reconcile the di erences into a single representation. The di erences may be due to data entry errors such as typographical mistakes, or to unstandardized abbreviations, or to di erences in detailed schemas of records from multiple databases, among other reasons. As the information in multiple sources is integrated, the same real-world entity is duplicated. The detection of records that are approximate duplicates, but 1

2 not exact duplicates, in databases is an important task. This thesis presents solutions to this problem. First, Chapter II introduces algorithms used to determine if two records represent the same entity. Such record matching algorithms are used in database-level duplicate detection algorithms which are presented in Chapter III. Finally, Chapter IV provides an empirical evaluation of the duplicate detection algorithms, including a comparison with previous work. Along with the increase of information available comes an increase in the complexity of discovering information that is relevant to a particular user. There are a number of research e orts that have addressed this problem. Many of these e orts concentrate on indexing as much of the information that is available online as possible and providing a search engine that accepts queries against the index. This thesis proposes an alternative approach: to integrate external information sources to guide the information discovery process. The integration of information sources requires the identi cation of equivalent records from the sources being integrated, hence, we study this problem as well. There are three important contributions that make up this thesis. The rst is new solutions to the pairwise record matching problem. Given two records, several domain-independent algorithms are given which determine whether the records are approximate duplicates of each other. The second contribution is the development of a scalable and adaptive algorithm to detect approximate duplicate database records. The algorithm presented responds adaptively to the size and homogeneity of the approximately equivalent groups of records discovered as the database is scanned. Finally, this thesis also presents a solution to the problem of discovering information on the worldwide web. The approach taken here is to use a combination of external information sources as a guide for locating where to look for information on the web.

3

I.A Overview of the thesis Integration of information sources is a key process in many elds. This process is made dicult by the form and quality of data at the sources being integrated. To detect imperfections and consolidate multiple forms of information in the data is the process of data cleansing { a preprocessing step in data warehousing. The databases used in data warehousing contain a very large number of records. The data cleansing step must therefore be ecient. Chapter II through Chapter IV describe and evaluate the di erent components of the duplicate detection system proposed in this thesis. Chapter II formally de nes the record matching problem and describes domain-independent algorithms to solve it. It gives experimental results on synthetic and real data which show the e ectiveness of the proposed domain-independent algorithm over an algorithm that depends on domain-speci c knowledge to be provided. The domain-independent record matching algorithm becomes the principal operation in detecting approximate duplicate records in an entire database, the topic of Chapter III. The chapter discusses the options for a duplicate detection algorithm. The di erent features are combined and a number of duplicate detection algorithms presented. Chapter IV provides an empirical evaluation of the duplicate detection algorithms proposed, including a comparison to previous duplicate detection software. The importance of integration of information is explored in Chapter V, where WebFind is presented. WebFind is an application that discovers scienti c papers made available by their authors on the web. WebFind uses a novel approach to performing information retrieval on the worldwide web. The approach is to use a combination of external information sources as a guide for locating where to look for information on the web. The chapter describes the design of WebFind and discusses its performance in answering queries from a user. Finally, Chapter VI gives a summary and directions for future research. Future research includes extensions of the work on domain-independent record matching algorithms, and on database-level approximate duplicate record detection. Possible

4 extensions and future features to WebFind are also described.

Chapter II Domain-independent record matching Many knowledge discovery and database mining applications need to combine information from heterogeneous sources. These information sources, such as relational databases or worldwide web pages, provide information about the same real-world entities, but describe these entities di erently. Resolving discrepancies in how entities are described is the problem studied in this chapter. Speci cally, the record matching problem is to determine whether or not two syntactically di erent record values describe the same semantic entity, i.e. real-world object. Solving the record matching problem is vital in three major knowledge discovery tasks.

 First, the ability to perform record matching allows one to identify correspond-

ing information in di erent information sources. This allows one to navigate from one source to another, and to combine information from the sources. In relational databases, navigating from one relation to another is called a \join." Record matching allows one to do joins on information sources that are not relations in the strict sense. A worldwide web knowledge discovery application that uses record matching to join separate Internet information sources, called WebFind, is described in Chapter V below. 5

6

 Second, the ability to do record matching allows one to detect duplicate records,

whether in one database or in multiple related databases. Duplicate detection is the central issue in the so-called \Merge/Purge" task [Hernandez and Stolfo, 1995; Hylton, 1996; Monge and Elkan, 1997], which is to identify and combine multiple records, from one database or many, that concern the same entity but are distinct because of data entry errors. This task is also called \data scrubbing" or \data cleaning" or \data cleansing" [Silberschatz et al., 1995]. This problem is studied in more detail in Chapter III.

 Third, doing record matching is one way to solve the database schema match-

ing problem [Batini et al., 1986; Kim et al., 1993; Madhavaram et al., 1996; Song et al., 1996]. This problem is to infer which attributes in two di erent databases (i.e. which columns of which relations for relational databases) denote the same real-world properties or objects. If several values of one attribute can be matched pairwise with values of another attribute, then one can infer inductively that the two attributes correspond. This technique is used to do schema matching for Internet information sources by the \information learning agent" (ILA) of Etzioni and Perkowitz (1995), for example.

This chapter is organized as follows. The next section discusses the related work in this area. Section II.B rst states the record matching problem precisely and then presents the domain-independent record matching algorithms proposed in this thesis. The performance of the algorithms is analyzed in Section II.C using real-world examples of records that must be matched by WebFind, and also in Section II.D using randomly created mailing addresses.

II.A Related work The record matching problem has been recognized as important for at least 50 years. Since the 1950s over 100 papers have studied matching for medical records

7 under the name \record linkage." These papers are concerned with identifying medical records for the same individual in di erent databases, for the purpose of performing epidemiological studies [Newcombe, 1988]. Record matching has also been recognized as important in business for decades. For example tax agencies must do record matching to correlate di erent pieces of information about the same taxpayer when social security numbers are missing or incorrect. The earliest paper on duplicate detection in a business database is by Yampolskii and Gorbonosov (1973). The \record linkage" problem in business has been the focus of workshops sponsored by the US Census Bureau [Kilss and Alvey, 1985; Bureau, 1997; Cox, 1995; Winkler, 1995]. Record matching is also useful for detecting fraud and money laundering [Senator et al., 1995]. Almost all published previous work on record matching is for speci c application domains, and hence gives domain-speci c algorithms. For example, three recent papers discuss record matching for customer addresses [Ace et al., 1992], census records [Slaven, 1992], or variant entries in a lexicon [Jacquemin and Royaute, 1994]. Some recent work on record matching is not domain-speci c, but assumes that domain-speci c knowledge will be supplied by a human for each application domain [Wang et al., 1989; Hernandez and Stolfo, 1995]. One important area of research that is relevant to approximate record matching is approximate string matching. String matching has been one of the most studied problems in computer science [Boyer and Moore, 1977; Knuth et al., 1977; Hall and Dowling, 1980; Galil and Giancarlo, 1988; Chang and Lampe, 1992; Du and Chang, 1994]. The main approach is based on edit distance [Levenshtein, 1966]. Edit distance is the minimum number of operations on individual characters (e.g. substitutions, insertions, and deletions) needed to transform one string of symbols to another [Peterson, 1980; Hall and Dowling, 1980; Kukich, 1992]. In the survey by Hall and Dowling (1980), the authors consider two di erent problems, one under the de nition of equivalence and a second using similarity. Their de nition of equivalence allows only small di erences in the two strings. For examples, they allow alternate spellings

8 of the same word, and ignore the case of letters. The similarity problem allows for more errors, such as those due to typing: transposed letters, missing letters, etc. The equivalence of strings is the same as the mathematical notion of equivalence, it always respects the re exivity, symmetry, and transitivity property. The similarity problem on the other hand, is the more dicult problem, where any typing and spelling errors are allowed. The similarity problem then is not necessarily transitive; while it still respects the re exivity and symmetry properties. The main contributions in this chapter are to give three new record matching algorithms that are domain-independent, and to show experimentally that these algorithms perform well for integrating real Internet information sources.

II.B Record matching algorithms The word record is used to mean a syntactic designator of some real-world object, such as a tuple in a relational database. The record matching problem arises whenever records that are not identical, in a bit-by-bit sense, may still refer to the same object. For example, one database may store the rst name and last name of a person (e.g. \Jane Doe"), while another database may store only the initials and the last name of the person (e.g. \J. B. Doe"). Table II.1 in Section II.C contains examples of records designating academic institutions. These examples show that records can be made up of subrecords (also called elds) delimited by separators such as newlines, commas, or spaces. A subrecord is itself a record and may be made up of subsubrecords, and so on. In this thesis, we say that two records are equivalent if they are equal semantically, that is if they both designate the same real-world entity. Semantically, this problem respects the re exivity, symmetry, and transitivity properties. The record matching algorithms which solve this problem depend on the syntax of the records. These syntactic calculations are approximations of what we really want, semantic equivalence. In such calculations, errors are bound to occur and thus the semantic

9 equivalence will not be properly calculated. However, our claim is that there are few errors and that the approximation is good. The experiments from Chapter IV will provide evidence for this claim. Equivalence may sometimes be a question of degree, so a function solving the record matching problem returns a value between 0.0 and 1.0, where 1.0 means certain equivalence and 0.0 means certain non-equivalence. This study assumes that these scores are ordinal, but not that they have any particular scalar meaning. Degree of match scores are not necessarily probabilities or fuzzy degrees of truth. An application will typically just compare scores to a threshold that depends on the domain and the particular record matching algorithm in use. Record matching algorithms vary by the amount of domain-speci c knowledge that they use. The pairwise record matching algorithms used in most previous work have been application-speci c. For example, Hernandez and Stolfo (1995) use production rules based on domain-speci c knowledge, which are rst written in OPS5 and then translated by hand into C. This section presents algorithms for pairwise record matching which are relatively domain independent. In particular, this work proposes to use a generalized edit-distance algorithm. This domain-independent algorithm is a variant of the well-known Smith-Waterman algorithm [Smith and Waterman, 1981], which was originally developed for nding evolutionary relationships between biological protein or DNA sequences. A record matching algorithm is domain-independent if it can be used without any modi cations in a range of applications. By this de nition, the Smith-Waterman algorithm is domain-independent under the assumptions that records have similar schemas and that records are made up of alphanumeric characters. The rst assumption is needed because the Smith-Waterman algorithm does not address the problem of duplicate records containing elds which are transposed.1 The second assumption is needed because any edit-distance algorithm assumes that records are strings over Technically, a variant of the Needleman-Wunsch [Needleman and Wunsch, 1970] algorithm is actually used, which calculates the minimum weighted edit-distance between two entire strings. Given two strings, the better-known Smith-Waterman algorithm nds a substring in each string such that the pair of substrings has minimum weighted edit-distance. 1

10 some xed alphabet of symbols. Naturally this assumption is true for a wide range of databases, including those with numerical elds such as social security numbers that are represented in decimal notation. We illustrate domain-independence by using the same Smith-Waterman algorithm in three di erent domains. Section II.C contains results of experiments on records containing the address of an institution. The task here is to detect the best matching record in one database given a record from a di erent database. Domain independence is achieved since there is no information about mailing addresses used by the algorithms. In addition, the algorithms work on di erent types of addresses. In Section II.D and Section IV.D, the same Smith-Waterman algorithm is used to detect duplicate records in randomly created mailing lists. Finally, in Section IV.E, the algorithm is used to cluster a database of bibliographic records into groups of records which refer to the same publication. While the records in these domains are very di erent, exactly the same Smith-Waterman algorithm is used in all applications. The remaining of this section covers the algorithms developed for matching database records. The rst algorithm is based on the observation that database records contain some structure which can be used to match them. Section II.B.2 presents the Smith-Waterman algorithm used in this work. These two algorithms are then combined into a hybrid algorithm which addresses the problem of transposed elds.

II.B.1 A recursive record matching algorithm We rst present an algorithm that uses the recursive structure of typical textual records. The base case is that A and B match with degree 1.0 if they are the same atomic string or one abbreviates the other; otherwise their degree of match is 0.0. An atomic string is a sequence of alphanumeric characters delimited by punctuation characters. Each subrecord Ai of A is assumed to correspond to the subrecord Bj of B with which it has highest score. The score of matching A and B then equals the

11 mean of these maximum scores: jAj jB j X 1 score(A; B ) = jAj max score(Ai; Bj ): i=1 j =1

Atomic strings can take the form of abbreviations. If one record contains an abbreviation and the second record contains its expansion, then the atomic string matching must be able to detect them as a match. Thus, when matching atomic strings, the algorithm applies heuristics for matching abbreviations. Heuristic matching of abbreviations uses four patterns: (i) the abbreviation is a pre x of its expansion, e.g. \Univ" abbreviates \University", or (ii) the abbreviation combines a pre x and a sux of its expansion, e.g. \Dept" matches \Department", or (iii) the abbreviation is an acronym for its expansion, e.g. \UCSD" abbreviates \University of California, San Diego", or (iv) the abbreviation is a concatenation of pre xes from its expansion, e.g. \Caltech" matches \California Institute of Technology". Note that the rst and third cases are special cases of the fourth case. The implementation of the recursive matching algorithm is straightforward. No special data structures are necessary. The algorithm uses the text strings to store and access the records, subrecords, etc. First, the subrecords of each record A and B being compared are determined. The algorithm requires a list of delimiters, one delimiter for each level of nesting. Each record is then split into its subrecords wherever the delimiter corresponding to the current nesting level is found. Next, the record matching function is called recursively on every pair of subrecords, made up of one subrecord from A and a subrecord from B . All these comparisons are necessary in order to determine the subrecord in B which best matches each subrecord of A. The recursion stops when at least one of the records cannot be decomposed into

12 subrecords (i.e. there are no identi able delimiters). At this point, the heuristics for matching atomic strings that contain abbreviations are applied. The rst two cases have been implemented in the algorithm used in the experiments of Section II.C, and used in WebFind. The implementation of the last two heuristics is more complex because the abbreviation expands to a string that spans a collection of subrecords and subsubrecords. The score calculated by these heuristics is returned. The score of matching the two records A and B is then calculated as described by the equation above. The recursive record matching algorithm has quadratic time complexity. Given A and B , every subrecord in A must be compared with every subrecord in B . At the lowest level, each atomic string of A is compared with each atomic string of B . An important optimization is to apply memoization to remember the results of recursive calls that have already been made. Whenever the algorithm is called with the same arguments as on a previous call, the result saved previously is returned instead of being recomputed. This algorithm has two main advantages: rst, its recursive nature allows it to cope with out-of-order subrecords, subsubrecords, and so on; and second, its heuristics allow it to cope with typical abbreviations. In order to study the usefulness of each of these features separately, the experiments reported in Section II.C also use a version of the algorithm that does not have the heuristics for matching abbreviations. The same heuristics for matching abbreviations are used when the recursive matching algorithm is used in di erent domains. That is, whether the algorithm is used in matching addresses of institutions, or mailing addresses, or bibliographic records, the heuristics are unchanged. Domain independence is achieved since there is no knowledge of the domain that is built into the heuristics. Finally, as the experiments show, the heuristics actually work in the di erent domains.

13

II.B.2 The Smith-Waterman algorithm Given two strings of characters, the Smith-Waterman algorithm [Smith and Waterman, 1981] uses dynamic programming to nd the lowest cost series of changes that converts one string into the other, i.e. the minimum \edit distance" weighted by cost between the strings. Costs for individual changes, which are mutations, insertion, or deletions, are parameters of the algorithm. Although edit-distance algorithms have been used for spelling correction and other text applications before, this work is the rst to show how to use an edit-distance method e ectively for general textual record matching. For matching textual records, we de ne the alphabet to be the lower case and upper case alphabetic characters, the ten digits, and three punctuation symbols space, comma, and period. All other characters are removed before applying the algorithm. This particular choice of alphabet is not critical. The Smith-Waterman algorithm has three parameters m, s, and c. Given the alphabet , m is a jj  jj matrix of match scores for each pair of symbols in the alphabet. The matrix m has entries for exact matches, for approximate matches, as well as for non-matches of two symbols in the alphabet. In the original SmithWaterman algorithm, this matrix models the mutations that occur in nature. In this work, the matrix tries to account for typical phoneme and typing errors that occur when a record is entered into a database. Much of the power of the Smith-Waterman algorithm is due to its ability to introduce gaps in the records. A gap is a sequence of non-matching symbols; these are seen as dashes in the example alignments of Figure II.1. The Smith-Waterman algorithm has two parameters which a ect the start and length of the gaps. The scalar s is the cost of starting a gap in an alignment, while c is the cost of continuing a gap. The ratios of these parameters strongly a ect the behavior of the algorithm. For example if the gap penalties are such that it is relatively inexpensive to continue a gap (c < s) then the Smith-Waterman algorithm prefers a single long gap over many short gaps. Intuitively, since the Smith-Waterman algorithm allows for gaps

14 of unmatched characters, it should cope well with many abbreviations. It should also perform well when records have small pieces of missing information or minor syntactical di erences, including typographical mistakes. The Smith-Waterman algorithm works by computing a score matrix E . One of the strings is placed along the horizontal axis of the matrix, while the second string goes along the vertical axis. An entry E (i; j ) in this matrix is the best possible matching score between the pre x 1 : : : i of one string and the pre x 1 : : : j of the second string. When the pre xes (or the entire strings) match exactly, then the optimal alignment can be found along the main diagonal. For approximate matches, the optimal alignment is within a small distance of the diagonal. Formally, the value of E (i; j ) is

8 >> E (i ? 1; j ? 1)+ >> >> E (i ? 1; j ) + c < E (i; j ) = max > E (i ? 1; j ) + s >> >> E (i; j ? 1) + c >: E (i; j ? 1) + s

m(letter(i); letter(j )) if align(i ? 1; j ? 1) ends in a gap if align(i ? 1; j ? 1) ends in a match if align(i ? 1; j ? 1) ends in a gap if align(i ? 1; j ? 1) ends in a match

All experiments reported in this paper use the same Smith-Waterman algorithm with the same gap penalties and match matrix. The parameter values were determined using a small set of aliation records. The experiments showed that the values chosen were intuitively reasonable and provided good results. The match score matrix is symmetric with all entries ?3 except that an exact match scores 5 (regardless of case) and approximate matches score 3. An approximate match occurs between two characters if they are both in one of the sets fd tg fg jg fl rg fm ng fb p vg fa e i o ug f, .g. The penalties for starting and continuing a gap are 5 and 1 respectively. The informal experiments just mentioned show that the penalty to start a gap should be similar in absolute magnitude to the score of an exact match between two letters, while the penalty to continue a gap should be smaller than the score of an approximate match. If these conditions are met, the accuracy of the Smith-Waterman algorithm is

15 department- of chemical engineering, stanford university, ca------lifornia Dep------t. of Chem---. Eng-------., Stanford Univ-----., CA, USA. psychology department, stanford univ-----------ersity, palo alto, calif Dept. of Psychol-------------., Stanford Univ., CA, USA.

Figure II.1: Optimal record alignments produced by the Smith-Waterman algorithm. nearly una ected by the precise values of the gap penalties. The experiments varied the penalty for starting gaps by considering values smaller and greater than an exact match. Similarly, the penalty to continue a gap was varied by considering values greater than 0.0. The nal score calculated by the algorithm is normalized to range between 0.0 and 1.0 by dividing by 5 times the length of the smaller of the two records being compared. Figure II.1 shows two typical optimal alignments produced by the SmithWaterman algorithm with the choice of parameter values described. The records shown are taken from datasets used in experiments of Section II.C for measuring the accuracy of the record matching algorithms in this section. These examples show that with the chosen values for the gap penalties, the algorithm detects abbreviations by introducing gaps where appropriate. The second pair of records also shows the inability of the Smith-Waterman algorithm to match out-of-order subrecords. The Smith-Waterman algorithm uses dynamic programming and its running time is proportional to the product of the lengths of its input strings. This quadratic time complexity is similar to that of our recursive record matching algorithm. The two algorithms are di erent in that the Smith-Waterman algorithm is symmetric: the score of matching record A to B is the same as the score of matching B to A. The recursive algorithm is asymmetric when two records have di erent numbers of subrecords. Symmetry may be a natural requirement for some applications of record matching but not for others. For example, the name \Alvaro E. Monge" matches \A. E. Monge" while the reverse is not necessarily true.

16

Optimization The Smith-Waterman algorithm is an expensive approximate record matching algorithm. Its running time is quadratic in the length of the records being compared, and in our test databases records are strings of at least 50 characters. Therefore, we have modi ed the algorithm to run in linear time. In order to describe this modi cation, we must explain at an intuitive level how the algorithm works. Given two strings of length n and m, the Smith-Waterman algorithm computes an n  m matrix. As mentioned previously, when the strings match the optimal alignment and best possible matching score is along the main diagonal. For approximate matches, the optimal alignment is within a small distance of the diagonal. Since we are interested only in the precise distances between strings that are similar, the entire matrix does not need to be computed. Instead, only the entries in a strip along the diagonal plus or minus some width need to be computed. Experiments show that with a width of 15 characters, the accuracy of the Smith-Waterman algorithm is practically unchanged. Intuitively, 15 characters allow for approximately two missing words or two abbreviations in succession in one of the records. If there are missing words in both records, then alternative pairs of missing words keep the computation along the diagonal. This limit makes the theoretical time complexity linear, while empirically, the running time of the algorithm decreases by over 50%. Additional experiments and analysis are needed to fully assess the e ects on loss of accuracy and decrease in running time due to not computing the entire matrix.

II.B.3 The hybrid algorithm The two algorithms described previously have some strengths and weaknesses. The recursive record matching algorithm has the advantage of being able to detect matches under the presence of out-of-order strings. However, it is unable to handle records which contain errors. One of the strengths of the Smith-Waterman algorithm on the other hand is its resilience to errors in the records. Unfortunately,

17 it cannot cope with records which have strings being out-of-order. These strengths and weaknesses suggested the combination of both algorithms. The hybrid algorithm combines the Smith-Waterman algorithm and the recursive matching algorithm from above. At shallower levels of subrecord nesting the hybrid algorithm follows the recursive algorithm, while at deeper levels of nesting the hybrid algorithm uses the SmithWaterman algorithm. Speci cally, the hybrid algorithm has an additional \level of nesting" parameter L. Say that the entire records being matched are at level 0, their subrecords are at level 1, and so on. Under the hybrid algorithm with parameter L, if subrecords A and B are at level L then their degree of match is de ned to be the score calculated by the Smith-Waterman algorithm. Otherwise, if both A and B are at a shallower nesting level, then each subrecord Ai of A is assumed to correspond to the subrecord Bj of B with which it has highest matching score. As with the recursive algorithm, the score of matching A and B then equals the mean of these maximum scores. The records considered in the experiments of Section II.C have three levels of nesting. The entire records (level 0) are complete institutional addresses. Setting L = 0 makes the hybrid algorithm the same as the Smith-Waterman algorithm. Subrecords at level 1 are separated by commas and are address \lines" such as department names. If we set L = 1 then the hybrid algorithm applies the Smith-Waterman algorithm at this level and can cope with out-of-order address lines. For example, the algorithm can match two records where the department name and the university name appear in di erent orders. Finally, with L = 2 the hybrid algorithm applies the SmithWaterman algorithm only to the individual words in address records. In this case the algorithm can match Department of Psychology and Psych. Dept. for example. In this hybrid algorithm, the Smith-Waterman algorithm is used as the base case score function. The parameter values of the Smith-Waterman algorithm are exactly the same as those described in the previous section. The implementation of the hybrid algorithm follows the recursive algorithm until the base case is considered. There are various hybrid algorithms, one per num-

18 Internet host

cs.ucsd.edu

institution

computer science department, university of california, san diego

cs.stanford.edu computer science department, stanford university, palo alto, california (inspec) Dept. of Comput. Sci., California Univ., San Diego, La Jolla, CA, USA. (inspec)

Dept. of Comput. Sci. Stanford Univ., CA, USA.

Table II.1: Examples of netfind and inspec records. ber of nesting levels in the records. This allows for additional exibility over the pure recursive algorithm or the Smith-Waterman algorithm. Instead of applying the heuristics for matching records based on the abbreviations, the Smith-Waterman algorithm is invoked on the subrecords that are the base case of the recursion. The subrecords that become the base case are determined by the parameter L to the algorithm.

II.C Experimental evaluation: matching aliation elds This section evaluates the performance of the algorithms for record matching introduced in this chapter on real-world datasets of records as used by our worldwide web information discovery tool for nding scienti c papers, called WebFind.

II.C.1 The WebFind information discovery tool Information discovery over the web is a dicult problem. There are many

19 approaches to the problem. The approach taken by WebFind is to integrate two information sources that complement each other. The two information sources are melvyl and netfind. melvyl is a University of California library service that includes comprehensive databases of bibliographic records, including a science and engineering database called inspec [University of California, 1996]. netfind is a white pages service that gives Internet host addresses and people's e-mail addresses [Schwartz and Pu, 1994a]. Both of these information sources have a eld which designates institutional addresses. To integrate inspec and netfind, equivalence of these elds must be determined. WebFind determines this equivalence by using a record matching algorithm. Since institutions are designated very di erently in inspec and netfind, it is non-trivial to decide when an inspec institution corresponds to a netfind institution. The current implementation of WebFind uses the recursive record matching algorithm with abbreviation matching to determine this. Chapter V describes WebFind in detail. The remaining sections describe the experiments which were performed in order to evaluate the accuracy of the matching algorithms in detecting matching institutional designators in this domain.

II.C.2 Experiments matching inspec and netfind records Each record matching algorithm was tested using three datasets. Two of the datasets contain records from the inspec bibliographic database. These records are the result of queries for institutional addresses. The queries were performed on November 1996. The rst dataset is a collection of aliation records returned by an inspec query that extracted 2824 bibliographic records containing the keywords \San Diego" and \Univ." in their aliation eld. The dataset was created by saving only the aliation eld in these citations. This dataset contains 273 aliations after exact duplicate aliations are removed. It contains mostly records describing various academic departments at the University of California, San Diego (UCSD), but it also includes a number of aliations from other universities in San Diego, such as San Diego State University. We refer to this dataset as the UCSD dataset. The so-called Stanford

20 group size UCSD Stanford mixed

1 2 3 4 5 6 7 8 9 10+

79 22 13 8 2 0 2 1 2 2

60 30 11 12 5 2 2 3 0 9

148 59 24 16 12 3 3 4 1 12

Table II.2: Numbers of equivalent groups in the three datasets dataset contains 379 distinct aliation records, the result of an inspec query that retrieved 5417 total records containing the keywords \Stanford" and \Univ." in their aliation eld. In both cases, there may be UCSD and Stanford records in inspec which are not in the datasets due to the records having typographical errors which are not covered by the query given to inspec. The third dataset is the union of the UCSD and Stanford datasets with an additional 46 UCSD and Stanford records from netfind. Of the 46 netfind records, 21 are from Stanford and 25 from UCSD. In all records, the stop words in the set fand, in, for, the, of, on, &, ?, /g are removed before matching, but it is not critical to do so. Stop words are strings which, whether matched or not, do not add to the matching of two records. Each dataset contains duplicate records. The duplicate records exist in the results gathered from the inspec database. The records in this database are entered manually, and thus there exist data entry errors. In addition, however, there is no strict format which is used for entering the aliation information about an author. There is only an informal schema, and thus the same aliation information can be entered in di erent formats thus introducing duplicate records. The mixed dataset contains another type of duplicate records: the same aliation can be present in the

21 1 0.9 0.8 0.7

Precision

0.6 0.5 0.4 0.3 0.2

UCSD dataset Stanford dataset

0.1 0 0

mixed dataset 0.1

0.2

0.3

0.4

0.5 Recall

0.6

0.7

0.8

0.9

1

Figure II.2: Recursive algorithm with abbreviation matching: average recall versus precision. inspec database and in the netfind service. Since these two information sources

have very di erent formats to represent aliations, the mixed dataset will contain duplicate records for the same aliation; thus making it a harder test data set for the algorithms. For each dataset we identi ed groups of truly equivalent records. For example, the two Stanford records in Table II.1 form a group. Table II.2 shows the number of groups of each size in each dataset. A group of size 1 is a single record for which there is no match in a dataset. The process of identifying these equivalent classes of records was manual. During this process, no special attention was given as to how the algorithms actually match records. Examining the groups of equivalent records suggests that all algorithms will have diculty matching some records. In particular, in each dataset some records mention more than one academic department. Consider for example the record

22 1 0.9 0.8 0.7

Precision

0.6 0.5 0.4 0.3 0.2 0.1 0 0

UCSD dataset Stanford dataset mixed dataset 0.1

0.2

0.3

0.4

0.5 Recall

0.6

0.7

0.8

0.9

1

Figure II.3: Recursive algorithm without abbreviation matching: average recall versus precision. This record is given in inspec as the address of the author for a single-author conference paper. Presumably the author is aliated with both the Electrical Engineering and Statistics departments at Stanford. Intuitively, this record belongs to two groups of equivalent records, but other members of these groups are not equivalent. The existence of overlapping but distinct groups of matching records will be dicult for any matching algorithm to deal with. Dept. of Electr. Eng. & Stat., Stanford Univ., CA, USA.

II.C.3 Recall and precision experimental results The performance of a record matching algorithm can be evaluated by viewing the problem in information retrieval terms [Salton and McGill, 1983]. Given a set of possibly equivalent records, consider each record in turn to be a query, and rank all

23 other records according to their degree of match as computed by the record matching algorithm. The accuracy of the algorithm then corresponds to retrieval e ectiveness as measured by precision and recall. Recall is the proportion of relevant information (i.e. truly matching records) actually retrieved (i.e. detected), while precision is the proportion of retrieved information that is relevant. The amount of retrieved information varies depending on what threshold is chosen for match scores. Typically, as the threshold is increased, recall decreases from 100% to 0%, while precision increases from 0% to 100%. Each experiment, whose results are reported below, concerns one record matching algorithm and one dataset. Separately, each record in the dataset is treated as a \query." The record matching algorithm is applied to this \query" and to every other \document" record in the dataset. With a xed threshold for declaring a match between the \query" and a \document" record, we obtain a set of retrieved \document" records. Recall and precision are calculated by comparing the retrieved set with the previously identi ed set of relevant records for the \query" record. By varying the threshold for declaring a match, precision is calculated as a function of recall. The average precision level for a given recall level is calculated by arithmetic averaging over all \query" records of the observed precision for this recall level. This average precision is then plotted as a function of recall in a gure. Five algorithms were tested: the recursive algorithm from section II.B.1 with and without the heuristics for matching abbreviations, the Smith-Waterman algorithm from section II.B.2, and the hybrid algorithm from section II.B.3 with L = 1, and with L = 2. Figures II.2 to II.6 show average recall versus precision for each algorithm tested on the three datasets. Figure II.7 shows average recall versus precision for all the algorithms on the mixed dataset. In summary, all algorithms perform worst on the mixed dataset, which is the most diverse. The expected tradeo between recall and precision is visible. For all levels of recall, the simple Smith-Waterman algorithm has lower precision than the other two algorithms. The recursive and hybrid algorithms

24 1 0.9 0.8 0.7

Precision

0.6 0.5 0.4 0.3 0.2

UCSD dataset Stanford dataset

0.1 0 0

mixed dataset

0.1

0.2

0.3

0.4

0.5 Recall

0.6

0.7

0.8

0.9

1

Figure II.4: Smith-Waterman algorithm (L = 0): average recall versus precision. have similar accuracy. Two properties of the records account for these results. First, the inspec records contain many abbreviations, but few typographical errors, unlike the mailing list records discussed in Section II.D. Hence, our heuristics for recognizing abbreviations are useful, but the generality of the Smith-Waterman algorithm is less important. When an abbreviation-matching heuristic is successful, the match is considered exact and given a score of 1.0. The Smith-Waterman algorithm matches the same abbreviations, but by introducing a gap in the non-abbreviated string. This leads to a match score strictly less than 1.0. Therefore, the Smith-Waterman-based methods tend to give lower scores overall for matching pairs of inspec and netfind records. The second important property of these records is that they do often have out-of-order address lines. The simple Smith-Waterman algorithm is less accurate here because it implicitly assumes a xed order for the components of records. Whenever components appear in a di erent order in two records, the Smith-Waterman al-

25 1 0.9 0.8 0.7

Precision

0.6 0.5 0.4 0.3 0.2

UCSD dataset Stanford dataset

0.1 0 0

mixed dataset

0.1

0.2

0.3

0.4

0.5 Recall

0.6

0.7

0.8

0.9

1

Figure II.5: Subrecord-level Smith-Waterman algorithm (L = 1): average recall versus precision. gorithm cannot match some of these components. The recursive algorithm has good performance because it tolerates transpositions down to the lowest level (i.e. outof-order individual words). The accuracy of the hybrid algorithm approaches the accuracy of the recursive algorithm when the hybrid algorithm applies the SmithWaterman algorithm at the subrecord level, thus tolerating the transpositions that are common in these records. Applying the Smith-Waterman algorithm at the level of individual words has slightly worse accuracy because it fails to take into account useful ordering information inside subrecords.

II.D Matching mailing addresses In this section, we show how the proposed record matching algorithms perform on some dicult pairs of records chosen as especially informative by Hernan-

26 1 0.9 0.8 0.7

Precision

0.6 0.5 0.4 0.3 0.2

UCSD dataset Stanford dataset

0.1 0 0

mixed dataset 0.1

0.2

0.3

0.4

0.5 Recall

0.6

0.7

0.8

0.9

1

Figure II.6: Word-level Smith-Waterman algorithm (L = 2): average recall versus precision. dez (1996). The algorithms from the previous section are shown to perform better than the application-speci c, handcrafted record matching algorithm created for this domain by Hernandez and Stolfo (1995).

II.D.1 Generating mailing address records Records in this domain are typical mailing list entries as corrupted by typographical errors and other mistakes. Each record contains nine elds: social security number, rst name, middle initial, last name, address, apartment, city, state, and zip code. All eld values are chosen randomly and independently. Personal names are chosen from a list of 63000 real names. Address elds are chosen from lists of 50 state abbreviations, 18670 city names, and 42115 zip codes. 2 2

These

lists

are

available

at

ftp://ftp.denet.dk/pub/wordlists ftp://cdrom.com/pub/FreeBSD/FreeBSD-current/src/share/misc/zipcodes

.

and

27 1 0.9 0.8 0.7

Precision

0.6 0.5 0.4 0.3

record−level SW subrecord−level SW

0.2

word−level SW rec. alg. w/exact matching

0.1 0 0

rec. alg. w/abbrev. matching 0.1

0.2

0.3

0.4

0.5 Recall

0.6

0.7

0.8

0.9

1

Figure II.7: Average recall versus precision for all algorithms on the mixed dataset. Once the database generator creates a random true record, it creates a random number of duplicate records according to a xed probability distribution. When it creates a duplicate record, the generator introduces errors (i.e. noise) into the true record. Possible errors range from small typographical slips to complete name and address changes. The database generator allows one to choose the probabilities with which errors of each type are introduced. The typographical errors introduced by the generator occur with relative frequencies known from previous research on spelling correction algorithms [Pollock and Zamora, 1987; Church and Gale, 1991; Kukich, 1992]. Some of these errors do change a record eld by inserting, deleting and transposing characters. These are errors of the type that an edit-distance algorithm such as the variant of the Smith-Waterman algorithm from Section II.B.2 is designed to detect. However the Smith-Waterman algorithm was developed | and its parameter values were chosen | without knowledge of the particular error probabilities used by the database generator. Moreover, the database generator introduces

28 errors of other types, including transpositions of entire words, complete changes in names and zip codes, and social security number omissions. The pairwise record matching algorithm of Hernandez and Stolfo (1995) has special rules to cope with each of these error types, while the Smith-Waterman algorithm variant does not.

II.D.2 Experimental results Table II.4 shows the status determined by the Hernandez and Stolfo (1995) algorithm and the scores assigned by the algorithms of this chapter to six pairs of records shown in Table II.3. The Truth column in the table indicates that the rst four pairs are \true" matches while the last two pairs do not describe the same individuals. The third column shows that the algorithm of Hernandez and Stolfo (1995) performs correctly only on the rst two pairs. The fourth through eighth columns show that with appropriate thresholds, each of the algorithms can discriminate the true matches from the false matches correctly. For both variants of the recursive algorithm, an appropriate threshold for declaring a match is 0.25. For the Smith-Waterman-based algorithms, appropriate thresholds are 0.35, 0.55, and 0.75, depending on the level of nesting at which the Smith-Waterman algorithm is applied. With the hybrid algorithm, scores are higher when the Smith-Waterman algorithm is applied at a deeper level, so the threshold for declaring a match must be higher. We can draw further conclusions from looking at how close individual scores in Table II.4 are to the corresponding \true" 0.0 or 1.0 scores. Notably, we see that the hybrid algorithm gives better separation between matching pairs and non-matching pairs when the Smith-Waterman algorithm is applied at deeper levels. The second and fourth pairs of records are particularly interesting. On the second pair, the word-level hybrid algorithm performs better than the other hybrid algorithms because of a word-level transposition (\Colette Johnen" versus \John Colette"). The recursive algorithm performs better with heuristics for abbreviation matching because \Johnen" matches \John" with these heuristics. On the fourth

29 Record Pair number 1 2 3 4 5 6

Soc. Sec. number

Name

Address

City, State Zip code

850982319 950982319 missing missing missing missing 152014425 152014423 274158217 267415817 760652621 765625631

Ivette A Keegan Yvette A Kegan Colette Johnen John Colette Brottier Grawey B baermel Bahadir T Bihsya Bishya T ulik Frankie Y Gittler Erlan W Giudici Arseneau N Brought Bogner A Kuxhausen

23 Florida Av. 23 Florida St. 600 113th St. apt. 5a5 600 113th St. ap. 585 PO gBox 2761 O Box 2761 220 Jubin 8s3 318 Arpin St 1p2 PO Box 3628 PO Box 2664 949 Corson Ave 515 212 Corson Road 0o3

missing missing missing missing Clovis NM 88102 Clovis NM 88102 Toledo OH 43619 Toledo OH 43619 Gresham, OR 97080 Walton, OR 97490 Blanco NM 87412 Raton, NM 87740

Table II.3: Pairs of records used to evaluate the matching algorithms. pair, the Smith-Waterman-based algorithms do quite well, while the other algorithms can only match the city, state and zip code elds. The non-Smith-Waterman-based algorithms cannot tolerate the typographical errors in the other elds of this pair of records.

II.E Discussion This chapter addressed the problem of reconciling information from heterogeneous sources in an application-independent manner. Di erent information sources may represent the same real-world objects di erently, so it is dicult to recognize when information is about the same objects. We call this problem \record matching." The same problem appears in the literature by the name of \record linkage". Previous work has attempted to solve the problem using very application-speci c approaches. This chapter presents algorithms which are domain independent. The recursive matching algorithm is based on the typical structure of textual records. At the base case, a number of heuristics for matching abbreviations are applied to the atomic strings. While this algorithm can match records which may have some of their strings out of order, it cannot cope with errors. The Smith-Waterman algorithm is

30 Record Pair Hernandez number Truth Stolfo 1 1 1 2 1 1 3 1 0 4 1 0 5 0 1 6 0 1

Recursive exact abbrev. 0.3333 0.4444 0.7000 0.8667 0.5833 0.5833 0.3333 0.3333 0.1667 0.1667 0.1458 0.1458

Smith-Waterman record subrecord word 0.8486 0.8534 0.9185 0.6708 0.8611 0.9422 0.7256 0.7922 0.9500 0.4189 0.6030 0.8017 0.3191 0.4901 0.5503 0.1633 0.3556 0.6458

Table II.4: Scores assigned by six matching algorithms to pairs of records. also proposed to compare records. The di erent parameters of the Smith-Waterman algorithm, make it well suited for records which contain di erent kinds of errors. However, it does not handle the problem of out of order strings. Thus, the hybrid algorithm combines the recursive and Smith-Waterman algorithm to give a record matching algorithm that copes with out of order strings and which is resilient to errors. These record matching algorithms should be useful for typical alphanumeric records. Experiments were performed on records that contain elds such as names, addresses, titles, dates, identi cation numbers, and so on. Experimental results from two di erent application domains show that the algorithms are highly e ective.

Chapter III Adaptive and scalable detection of approximately duplicate database records This chapter considers the problem of detecting when records in a database are duplicates of each other, even if they are not textually identical. If these multiple duplicate records concern the same real-world entity, they must be detected in order to have a consistent database. Multiple records for a single entity may exist because of typographical data entry errors, because of unstandardized abbreviations, or because of di erences in detailed schemas of records from multiple databases, among other reasons. Thus, the problem is one of consolidating the records in these databases, so that an entity is represented by a single record. This is a necessary and crucial preprocessing step in data warehousing and data mining applications where data is collected from many di erent sources and inconsistencies can lead to erroneous results. Before performing data analysis operations, the data must be preprocessed and organized into a consistent form. The duplicate detection problem is di erent from, but related to, the schema matching problem [Batini et al., 1986; Kim et al., 1993; Madhavaram et al., 1996; Song et al., 1996]. That problem is to nd the correspondence between the structure of 31

32 records in one database and the structure of records in a di erent database. The problem of actually detecting matching records still exists even when the schema matching problem has been solved. For example, consider records from di erent databases that include personal names. The fact that there are personal name attributes in each record is detected by schema matching. However record-level approximate duplicate detection is still needed in order to combine di erent records concerning the same person. Record-level duplicate detection, or record matching, may be needed because of typographical errors, or varying abbreviations in related records. Record matching may also be used as a substitute for detailed schema matching, which may be impossible for semi-structured data. For example records often di er in the detailed format of personal names or addresses. Even if records follow a xed high-level schema, some of their elds may not follow a xed low-level schema, i.e. the division of elds into sub elds may not be standardized. A set of records that refer to the same entity can be interpreted in two ways. One way is to view one of the records as correct and the other records as duplicates containing erroneous information. The task then is to cleanse the database of the duplicate records [Silberschatz et al., 1995; Hernandez and Stolfo, 1995]. Another interpretation is to consider each matching record as a partial source of information. The aim is then to merge the duplicate records, yielding one record with more complete information [Hylton, 1996]. This chapter is organized into the following sections. The next section describes previous methods used to detect duplicate records in a database. Section III.B reviews the rst contribution, a linear-time variant of the Smith-Waterman algorithm that provides a domain-independent method of matching pairs of records. Section III.C explains the second contribution, the use of the union- nd data structure to keep track eciently of clusters of duplicate records as they are discovered. In Section III.E, the overall algorithm for detecting approximate duplicates is described. Here, the contribution is a method which, by adapting to the size and homogeneity

33 of discovered clusters, minimizes the number of expensive pairwise record matching operations that must be performed. Section III.E.1 analyzes the time and space complexity of the duplicate detection algorithm. The next chapter will present experimental results for the algorithm, rst on randomly created synthetic databases, and then on a real bibliographic database.

III.A Previous work The standard method of detecting exact duplicates in a table is to sort the table and then to check if neighboring tuples are identical. Optimizations of this approach are described by Bitton and DeWitt (1983). The approach can be extended to detect approximate duplicates. The idea is to do sorting to achieve preliminary clustering, and then to do pairwise comparisons of nearby records [Newcombe et al., 1959; Giles et al., 1976]. Sorting is typically based on an application-speci c key chosen to make duplicate records likely to appear near each other. The key is also known as the \blocking criterion" [Winkler, 1985; Winkler, 1995]. In Hernandez and Stolfo (1995), the authors compare nearby records by sliding a window of xed size over the sorted database. If the window has size W then record i is compared with records i ? W + 1 through i ? 1 if i  W and with records 1 through i ? 1 otherwise. The number of comparisons performed is O(TW ) where T is the total number of records in the database. In order to improve accuracy, the results of several passes of duplicate detection can be combined [Newcombe et al., 1959; Kilss and Alvey, 1985]. Typically, combining the results of several passes over the database with small window sizes yields better accuracy for the same cost than one pass over the database with a large window size. Hernandez and Stolfo (1995) combine the results of multiple passes by explicitly computing the transitive closure of all discovered pairwise \is a duplicate of" relationships. If record R1 is a duplicate of record R2, and record R2 is a duplicate

34 of record R3, then by transitivity R1 is a duplicate of record R3. Transitivity is true by de nition if duplicate records concern the same real-world identity, but in practice there will always be errors in computing pairwise \is a duplicate of" relationships, and transitivity will propagate these errors. However, in typical databases, sets of duplicate records tend to be distributed sparsely over the space of possible records, and the propagation of errors is rare. The experimental results con rm this claim [Hernandez and Stolfo, 1995; Hernandez, 1996] and Chapter IV of this thesis. Hylton (1996) uses a di erent, more expensive, method to do a preliminary grouping of records. Each record is considered separately as a \source record" and used to query the remaining records in order to create a group of potentially matching records. Then each record in the group is compared with the source record using his pairwise matching procedure.

III.B Pairwise record matching algorithms Every duplicate detection method proposed to date, including ours, requires an algorithm for detecting \is a duplicate of" relationships between pairs of records. Typically this algorithm is relatively expensive computationally, and grouping methods as described in the previous section reduce the number of times that it must be applied. The pairwise record matching algorithms used in most previous work have been application-speci c. The system proposed here uses a linear-time generalized edit-distance algorithm for pairwise record matching, a variant of the Smith_ Waterman algorithmSection II.B.2 provides a thorough presentation of the SmithWaterman algorithm and the optimized version of the algorithm used in detecting approximate duplicate records. The algorithm evaluated in that section is used without any modi cations here for detecting duplicates. In particular, the parameter values are unchanged. This achieves the kind of domain-independence we are interested in.

35

III.C Data structure for maintaining the clusters of duplicate records Under the assumption of transitivity, the problem of detecting duplicates in a database can be described in terms of keeping track of the connected components of an undirected graph. Let the vertices of a graph G represent the records in a database of size T . Initially, G will contain T unconnected vertices, one for each record in the database. There is an undirected edge between two vertices if and only if the records corresponding to the pair of vertices are found to match, according to the pairwise record matching algorithm. When considering whether to apply the expensive pairwise record matching algorithm to two records, we can query the graph G. If both records are in the same connected component, then it has been determined previously that they are approximate duplicates, and the comparison is not needed. If they belong to di erent components, then it is not known whether they match or not. If comparing two records results in a match, their respective components should be combined to create a single new component. This is done by inserting an edge between the vertices that correspond to the records compared. At any time, the connected components of the graph G correspond to the transitive closure of the \is a duplicate of" relationships discovered so far. Consider three records Ru , Rv , and Rw and their corresponding nodes u, v, and w. When the fact that Ru is a duplicate of record Rv is detected, an edge is inserted between the nodes u and v, thus putting both nodes in the same connected component. Then, when the fact that Rv is a duplicate of Rw is detected, an edge is inserted between nodes v and w. Transitivity of the \is a duplicate of" relation is equivalent to reachability in the graph. Since w is reachable from u (and vice versa), the corresponding records Ru and Rw are duplicates. This \is a duplicate of" relationship is detected automatically by maintaining the graph G, without comparing Ru and Rw .

36

III.C.1 The Union-Find data structure There is a well-known data structure that eciently solves the problem of incrementally maintaining the connected components of an undirected graph, called the union- nd data structure [Hopcroft and Ullman, 1973; Cormen et al., 1990]. This data structure keeps a collection of disjoint updatable sets, where each set is identi ed by a representative member of the set. Each set corresponds to a connected component of the graph. The data structure has two operations:

 Union(x; y) combines the sets that contain node x and node y, say Sx and Sy , into a new set that is their union Sx [ Sy . A representative for the union is chosen, and the new set replaces Sx and Sy in the collection of disjoint sets.

 Find(x) returns the representative of the unique set containing x. If Find(x) is invoked twice without modifying the set between the requests, the answer is the same.

To nd the connected components of a graph G, we rst create jGj singleton sets, each containing a single node from G. Then, for each edge (u; v) 2 E (G), if Find(u) 6= Find(v) then we perform Union(u; v). At any time, two nodes u and v are in the same connected component if and only if their sets have the same representative, that is if and only if Find(u) = Find(v). Note that the problem of incrementally computing the connected components of a graph is harder than just nding the connected components. There are linear time algorithms for nding the connected components of a graph. However, here we require the union- nd data structure because we need to nd the connected components incrementally as duplicate records are detected. The implementation of the disjoint-set data structure that is used includes two heuristics known as union by rank, also referred to as weighted union, and path compression. With these heuristics the worst-case running time of any series of m Union and Find operation over n elements is O((m + n) (m; n)) [Tarjan, 1975; Tarjan, 1983]. The function (m; n) is the inverse of Ackermann's function, and grows extremely slowly. For all practical cases (m; n)  4 and running time is linear

37 in the number of operations and elements. The textbook of Cormen et al. (1990) has a detailed description of the union- nd data structure, the two heuristics, and an analysis of its running time (pages 440{458).

III.D Improved simple algorithm The previous section described a way in which to maintain the clusters of duplicate records and compute the transitive closure of \is a duplicate of" relationships incrementally. This section uses the union- nd data structure to improve the standard method for detecting approximate duplicate records. As done by other algorithms, the algorithm performs multiple passes of sorting and scanning. Whereas previous algorithms sort the records in each pass according to domain-speci c criteria, this work proposes to use domain-independent sorting criteria. Speci cally, the algorithm uses two passes. The rst pass treats each record as one long string and sorts these lexicographically, reading from left to right. The second pass does the same reading from right to left. After sorting, the algorithm scans the database with a xed size window. Initially, the union- nd data structure (i.e. the collection of dynamic sets) contains one set per record in the database. The window slides through the records in the sorted database one record at a time (i.e. windows overlap). In the standard window method, the new record that enters the window is compared with all other records in the window. The same is done in this algorithm, with the exception that some of these comparisons are unnecessary. A comparison is not performed if the two records are already in the same cluster. This can be easily determined by querying the union- nd data structure. When considering, the new record Rj in the window and some record Ri already in the window, rst the algorithm tests whether they are in the same cluster. This involves comparing their respective cluster representatives, that is, comparing the value of Find(Rj ) and Find(Ri). If both these values are the same, then no comparison is needed because both records belong to the same

38 cluster or connected component. Otherwise the two records are compared. When the comparison is successful, a new \is a duplicate of" relationship is established. To re ect this in the union- nd data structure, the algorithm combines the clusters corresponding to Rj and Ri by making the function call: Union(Rj ; Ri). Chapter IV has the results of experiments comparing this improved algorithm to the standard method. We expect that the improved algorithm will perform fewer comparisons. Fewer comparisons usually translates to decreased accuracy. However, similar accuracy is expected because the comparisons which are not performed correspond to records which are already members of a cluster most likely due to the transitive closure of the \is a duplicate of' relationships. In fact, all experiments show that the improved algorithm is as accurate as the standard method while performing many fewer record comparisons.

III.E The overall priority queue algorithm The algorithm described in the previous section has the weakness that the window used for scanning the database records is of xed size. If a cluster in the database has more duplicate records than the size of the window, then it is possible that some of these duplicates will not be detected because not enough comparisons are being made. Furthermore if a cluster has very few duplicates or none at all, then it is possible that comparisons are being done which may not be needed. An algorithm is needed which responds adaptively to the size and homogeneity of the clusters discovered as the database is scanned. This section describes such a strategy. This is the high-level strategy adopted in the duplicate detection algorithm proposed in this work. Before describing the algorithm, we need to analyze the xed size window method. The xed size window algorithm e ectively saves the last jW j? 1 records for possible comparisons with the new record that enters the window as it slides by one record. The key observation to make is that in most cases, it is unnecessary to save

39 all these records. The evidence of this is that sorting has already placed approximate duplicate records near each other. Thus, most of the jW j ? 1 records in the window already belong to the same cluster. The new record will either become a member of that cluster, if it is not already a member of it, or it will be a member of an entirely di erent cluster. In either case, exactly one comparison per cluster represented in the window is needed. Since in most cases, all the records in the window will belong to the same cluster, only one comparison will be needed. Thus, instead of saving individual records in a window, the algorithm saves clusters. This leads to the use of a priority queue, in place of a window, to save cluster records. The rest of this section describes this strategy as it is embedded in the duplicate detection system. First, like the algorithm described in the previous section, two passes of sorting and scanning are performed. The algorithm scans the sorted database with a priority queue of cluster subsets of records belonging to the last few clusters detected. The priority queue contains a xed number of sets of records. In all the experiments reported below this number is 4. Each set contains one or more records from a detected cluster. For eciency reasons, entire clusters should not always be saved since they may contain many records. On the other hand, a single record may be insucient to represent all the variability present in a cluster. Records of a cluster will be saved in the priority queue only if they add to the variability of the cluster being represented. The set representing the cluster with the most recently detected cluster member has highest priority in the queue, and so on. The algorithm scans through the sorted database sequentially. Suppose that record Rj is the record currently being considered. The algorithm rst tests whether Rj is already known to be a member of one of the clusters represented in the priority queue. This test is done by comparing the cluster representative of Rj to the representative of each cluster present in the priority queue. If one of these comparisons is successful, then Rj is already known to be a member of the cluster represented by the set in the priority queue. We move this set to the head of the priority queue and continue with the next record, Rj+1 .

40 Whatever their result, these comparisons are computationally inexpensive because they are done just with Find operations. In the rst pass, Find comparisons are guaranteed to fail since the algorithm scans the records in the sorted database sequentially and this is the rst time each record is encountered. Therefore these tests are avoided in the rst pass. Next, in the case where Rj is not a known member of an existing priority queue cluster, the algorithm uses the Smith-Waterman algorithm to compare Rj with records in the priority queue. The algorithm iterates through each set in the priority queue, starting with the highest priority set. For each set, the algorithm scans through the members Ri of the set. Rj is compared to Ri using the SmithWaterman algorithm. If a match is found, then Rj 's cluster is combined with Ri's cluster, using a Union(Ri ; Rj ) operation. In addition, Rj may also be included in the priority queue set that represents Ri's cluster. Speci cally, Rj is included if its Smith-Waterman matching score is below a certain \strong match" threshold. This priority queue cluster inclusion threshold is higher than the threshold for declaring a match, but lower than 1:0. Intuitively, if Rj is very similar to Ri , it is not necessary to include it in the subset representing the cluster, but if Rj is only somewhat similar, i.e. its degree of match is below the inclusion threshold, then including Rj in the subset will help in detecting future members of the cluster. On the other hand, if the Smith-Waterman comparison between Ri and Rj yields a very low score, below a certain \bad miss" threshold, then the algorithm continues directly with the next set in the priority queue. The intuition here is that if Ri and Rj have no similarity at all, then comparisons of Rj with other members of the cluster containing Ri will likely also fail. If the score is close to the matching threshold, then it is worthwhile to compare Rj with the remaining members of the cluster. The \strong match" and \bad miss" thresholds are used to counter the errors which are propagated when computing pairwise \is a duplicate of" relationships. Finally, if Rj is compared to members of each set in the priority queue without detecting that it is a duplicate of any of these, then Rj must be a member of

41 a cluster not currently represented in the priority queue. In this case Rj is saved as a singleton set in the priority queue, with the highest priority. If this action causes the size of the priority queue to exceed its limit then the lowest priority set is removed from the priority queue.

III.E.1 Computational complexity analysis Assume that the database contains T records each of length L, so the database has size O(TL). The union- nd data structure uses O(T ) space with a small constant. The series of operations performed by the priority queue strategy on this data structure have high locality of access, so standard caching and paging methods work well for accelerating these operations. The priority queue data structure uses O(KML) space where K is the maximum number of sets in the queue, and M is the maximum size of each set. Through experiments, it was determined that in practice it is not necessary to x M , but it is easy to do so, either for theoretical purposes or out of aversion to risk. With small K and M the priority queue data structure easily ts in typical caches. For xed T , L, K , and M , all union- nd and priority queue operations take small constant amounts of time. For one pass over the database, the number of invocations of the Smith-Waterman algorithm is bounded above by TKM . The overall time complexity of one pass is therefore the time taken by sorting plus the time taken by the duplicate detection windowing scheme, or O(TL log TL + TKML) = O(TL log TL). Since a small xed number of passes is performed, the total time complexity is the same. The total number of disk accesses required by the algorithm is CTL log TL for some small constant C , since we can use any standard sorting algorithm for large databases that minimizes the number of disk accesses, and the database scanning part of the priority queue strategy simply reads the sorted database sequentially. As this analysis indicates, the complexity is bounded by the clustering algorithm used. Since sorting is used in this thesis, we see the O(TL log TL) complexity. Because invocations of the Smith-Waterman algorithm are more expensive

42 than all other operations performed by the priority queue strategy, in practice its time complexity is O(ATL) where A is the average number of invocations of the SmithWaterman algorithm per record on one pass through the database. The union- nd data structure allows the Smith-Waterman algorithm to be not called at all for many records. For other records, the heuristics described in the previous section successfully minimize the number of calls of the Smith-Waterman algorithm. The theoretical minimum is A = 1, each record must be involved in at least one comparison. In practice, we claim that A  2:0 for typical databases. In the databases tested in the next chapter, A < 3:5. The following chapter evaluates the e ects that replacing the xed size window strategy with the priority queue strategy has on accuracy and performance. The experiments show that accuracy remains very similar while decreasing the number of comparisons performed sharply.

Chapter IV Experimental evaluation This chapter provides an empirical evaluation of the duplicate detection algorithms discussed in the last chapter. There are two main sets of experiments presented. In the rst set, the algorithms are tested on randomly created lists of mailing addresses. In the second set, the algorithms are then evaluated on a real database containing bibliographic records.

IV.A Choosing thresholds The majority of the algorithms evaluated use the Smith-Waterman algorithm { without the optimization from Section II.B.2 { to compare records and determine if they are duplicates. Since the Smith-Waterman algorithm returns a score between 0.0 and 1.0, a threshold must be selected such that scores above the threshold are considered a match. To determine this threshold, a number of small experiments were performed. Table II.3 contains example pairs of records chosen as especially instructive by Hernandez (1996), and Table II.4 shows the pairwise scores assigned by the di erent algorithms to these records. Column six of the table has the score assigned by the Smith-Waterman algorithm. The rst two pairs are correctly detected to be duplicates by the equational theory rules of Hernandez and Stolfo (1995). The Smith-Waterman algorithm classi es these as duplicates given any threshold below

43

44 0.67. The third and fourth pairs are pairs that the equational theory does not detect as duplicates. The Smith-Waterman algorithm performs correctly on these pairs when the duplicate detection threshold is set at 0.41 or lower. Finally, the equational theory nds the fth and sixth pairs to be duplicates, when in fact they are not. The Smith-Waterman algorithm correctly assigns these pairs low scores. With a threshold of 0.32 or higher, the Smith-Waterman algorithm performs correctly on these pairs. The choice for a matching threshold has to detect most real duplications while keeping the number of false positives negligible. These considerations lead to our choice of 0.50 for the matching threshold. Thresholds lower than 0.50 are too aggressive and thus start increasing the number of false positives; while higher thresholds do not detect all duplications. All the experiments described below use this xed, conservative threshold for declaring that a pair of records are approximate duplicates. The duplicate detection algorithms have an option for which strategy to use when storing records for possible comparisons. Two options have been discussed, the xed size window and the priority queue strategies. Two passes are performed by the duplicate detection algorithm, and each pass can use either of the two strategies. If the xed size window is selected, then the size of the window must be chosen. When using the priority queue strategy, the size of the priority queue must be selected and appropriate values must be chosen for the \bad miss" and \strong match" cuto scores. Again, small experiments were performed in order to determine these values. The algorithms which use the priority queue strategy use a \bad miss" cuto score of 0.25 and a \strong match" cuto score of 0.95. These values were determined to provide a good tradeo between the number of record comparisons made and the number of clusters detected using those comparisons.

45

IV.B Measuring accuracy The measure of accuracy used in this thesis is the number of clusters detected that are \pure". A cluster is pure if and only if it contains only records that belong to the same true cluster of duplicates. This accuracy measure considers entire clusters, not individual records. This is intuitive, since a cluster corresponds to a real world entity, while individual records do not. A cluster detected by a duplicate detection algorithm can be one of the following: 1. the cluster is equal to a true cluster or, 2. the cluster is a subset of a true cluster or, 3. the cluster contains some of two or more true clusters. By this de nition, a pure cluster falls in either of the rst two cases above. Clusters that fall in the last case are referred to as \impure" clusters. A good duplicate detection algorithm will have 100% of the detected clusters as pure and 0% as impure. Now consider the ratio of pure clusters to true clusters. If this quantity is greater than 100%, then this suggests a need to union some of the clusters to get closer to the number of true clusters. This could be the result of a conservative threshold. On the other hand, an aggressive threshold would cause the ratio to be smaller than 100%. This would suggest the breaking up of clusters into smaller ones.

IV.C Algorithms tested The sections that follow provide the results from experiments performed in this study. Several algorithms are compared, where each is made up of di erent features. The main features that make up an algorithm are the pairwise record matching algorithm used and the structure used for storing records for possible comparisons. The three main algorithms compared are the so-called \Merge/Purge" algorithm [Hernandez and Stolfo, 1995; Hernandez, 1996] and two versions of the algorithm from

46 Section III.E. One version, \PQS w/SW", uses the Smith-Waterman algorithm to match records, while the second version, \PQS w/HS", uses the equational theory described in Hernandez and Stolfo (1995). For short, priority queue strategy is abbreviated by PQS. Both PQS-based algorithms use the union- nd data structure described in Section III.C.1. Both also use the priority queue of cluster subsets discussed in Section III.E. The equational theory matcher returns only either 0 or 1, so unlike the Smith-Waterman algorithm, it does not estimate degrees of pairwise matching. Thus, we need to modify the strategy of keeping in the priority queue a set of representative records for a cluster. In the current implementation of the PQS w/HS algorithm, only the most recently detected member of each cluster is kept. For both PQS-based algorithms, the gures show the number of pure and impure clusters that were detected. The gures also show the number of true clusters in the database, and the number of clusters detected by merge-purge. Unfortunately the merge-purge software does not distinguish between pure and impure clusters. The accuracy results reported here include both pure and impure clusters thus slightly overstating its accuracy.

IV.D Databases of mailing addresses The rst experiments reported here use databases that are mailing lists generated randomly by software designed and implemented by Hernandez (1996). The generator is described in Section II.D.1.

IV.D.1 Varying the size of the window and the priority queue This section presents the results of experiments which vary the size of the structure keeping the records for possible comparison, be it the xed size window or the priority queue. The database with 40000 true clusters mentioned in Table IV.2

47 xed size window size of pure comparisons size of window clusters priority queue 2 57868 436062 1 4 54916 791125 3 8 53880 1469854 7 10 53605 1801590 9 16 52890 2773420 15

PQS pure comparisons clusters 57272 488321 53965 839624 53120 1476351 52967 1790079 52741 2714561

Table IV.1: Number of pure clusters detected and the number of record comparisons performed by the duplicate detection algorithms when varying the size of the structure storing records for possible comparisons. was chosen to test the algorithms. Section IV.D.3 describes how this database was created. Both the xed size window and the priority queue strategies were tested. The same strategy was used in both passes of the duplicate detection algorithm. The Smith-Waterman algorithm and the union- nd data structure were used with both strategies. The window varied in size of 2, 4, 8, 10, and 16 records. A window of size W makes at most W ? 1 comparisons, thus in order to compare the strategies fairly, the priority queue varied in size of 1, 3, 7, 9, and 15. The accuracy results are shown in Table IV.1 and its corresponding gure, Figure IV.1. Table IV.1 also has the number of record comparisons which are plotted in Figure IV.2. Both algorithms make almost the same number of record comparisons, with the priority queue strategy behaving better as the size of its priority queue is increased. The priority queue strategy, however, is more accurate than the xed size window strategy as can be seen by comparing the number of pure clusters detected. The priority queue strategy is closer to the number of true clusters. A window of size 10 is required by the xed size window strategy in order for it to detect a similar number of clusters that the priority queue strategy detects with a priority queue of

48 4

6

x 10

Fixed size window PQS

5.9 5.8

Number of clusters

5.7 5.6 5.5 5.4 5.3 5.2 5.1 5

2

4

8 10 Size of the window or priority queue

16

Figure IV.1: Accuracy results using the same database and varying the size of the window and priority queue. size 4. With a window of size 10, the algorithm detects 53605 pure clusters. The priority queue strategy detects 53614 pure clusters with a priority queue of size 4. In all experiments that follow, the merge-purge algorithm uses a xed window of size 10, as in most experiments performed by Hernandez and Stolfo (1996). The PQS-based algorithms always use a priority queue containing at most 4 sets of records. This number makes the accuracy of both algorithms approximately the same. Of course, it is easy to run a PQS-based algorithm with a larger priority queue in order to obtain greater accuracy.

IV.D.2 Varying the number of duplicates per record A good approximate duplicate detection algorithm should be almost unaffected by changes in the number of duplicates that each record has. To study the

49 6

3

x 10

Fixed size window PQS

Number of record comparisons

2.5

2

1.5

1

0.5

0

2

4

8 10 Size of the window or priority queue

16

Figure IV.2: Number of record comparisons using the same database and varying the size of the window and priority queue. e ect of increasing this number, the number of duplicates per record was varied using a uniform and Zipf distribution. For each original record in the database, a uniform distribution with parameter m gives equal probability to each possible number of duplicates between 0 and m. Zipf distributions give high probability to small numbers of duplicates, but still give non-trivial probability to large numbers of duplicates. A Zipf distribution has two parameters 0    1 and 1  D. For 1  i  D the probability of i duplicates is ci?1 where the normalization constant c = 1= PDi=1 i?1. Having a maximum number of duplicates D is necessary because P1i=1 i?1 diverges if   0. In the rst experiment, ve di erent databases were created, each containing 50000 di erent original records. Each original record was duplicated using the uniform

50 4

7

x 10

6

Number of clusters

5

4 true clusters 3

PQS w/SW (pure) merge/purge (pure and impure) PQS w/HS matcher: (pure)

2

impure clusters

1

0 0

1

2

3 4 5 6 Average number of duplicates per record

7

8

Figure IV.3: Accuracy results for varying the number of duplicates per original record using a uniform distribution. distribution. The maximum number of duplicates per original record was varied in the values 1, 2, 4, 8, or 16. The noise level was maintained constant in all databases. The smallest database, with zero or one duplicate per original record, had a total of 75162 records, while the largest database had a total of 451314 records. Figure IV.3 shows the results of the rst experiment. As the average number of duplicates per original record increases from 0.5 to 8 under a uniform distribution, the number of pure clusters detected by the PQS w/SW algorithm increases by 22%. For the PQS w/HS algorithm, the increase is 25%. This increase constitutes a decrease in accuracy since we want to get as close to the number of true clusters as possible. The second experiment used databases where the records were duplicated according to the Zipf distribution, with  values in the set f0.1,0.2,0.4,0.8g. The maximum number of duplicates per original record was kept constant at 20. Here

51 4

7

x 10

6

Number of clusters

5

4 true clusters 3

PQS w/SW (pure clusters) merge/purge (pure and impure clusters) PQS w/HS: (pure clusters)

2

impure clusters

1

0 5

5.5

6 6.5 7 7.5 Average number of duplicates per original record

8

8.5

Figure IV.4: Accuracy results for varying the number of duplicates per original record using a Zipf distribution. the smallest database had a total of 303018 records, while the largest database had a total of 479024 records. The results of this experiment are shown in Figure IV.4. As in the rst experiment, both PQS-based algorithms have very similar performance to the mergepurge algorithm, with PQS w/SW performing slightly better. In both of these experiments that vary the average number of duplicates per original record, the number of impure clusters remains very small throughout. The fact that nearly 100% of the detected clusters are pure suggests that we could relax various parameters of the algorithm in order to combine more clusters, without erroneously creating too many impure clusters.

52 6

10

true clusters U/F + |V|=10 w/SW (pure clusters) U/F + PQS w/SW (pure clusters)

Number of clusters

U/F + PQS w/HS (pure clusters) merge/purge (pure and impure clusters)

5

10

4

10 4 10

5

10 Total number of records in the database

6

10

Figure IV.5: Accuracy results for varying database sizes (log-log plot).

IV.D.3 Varying the size of the database This section looks at how the size of the database a ects the accuracy of duplicate detection. Consider databases with respectively 10000, 20000, 40000, 80000 and 120000 original records. In each case, duplicates are generated using a Zipf distribution with a high noise level, Zipf parameter  = 0:4, and 20 maximum duplicates per original record. The largest database considered here contains over 900000 records in total. Table IV.2 shows the exact number of records for each of the ve databases tested. Table IV.3 and Table IV.4 show the accuracy results for the experiments. The same data is plotted in Figure IV.5. In order to determine the e ect of using the priority queue strategy over the xed size window strategy, another algorithm was tested. The algorithm, UF + Fix W + SW, uses the union- nd data structure from Section III.C.1 with a window of xed size, 10 records. The algorithm uses the

53 8

Number of record comparisons

10

7

10

merge/purge U/F + |W|=10 w/SW U/F + PQS w/SW U/F + PQS w/HS 6

10

5

10 4 10

5

10 Total number of records in the database

6

10

Figure IV.6: Number of comparisons performed by the algorithms (log-log plot). Smith-Waterman algorithm to determine \is a duplicate of" relationships. Figures IV.5 and IV.6 show the performance of the PQS-based algorithms and the merge-purge algorithm on these databases. Again, the same algorithm parameters as before are used. The gures clearly display the bene ts of the PQS-based algorithm. Figure IV.5 shows that the number of clusters detected by all three strategies is similar, with PQS w/SW having slightly better accuracy. The Smith-Waterman based algorithms had very similar accuracy, this is shown as an overlap in the gure. The tables show that the number of impure clusters for the algorithms is small (overall, less than 0.70% of the number of true clusters), as desired. While all three algorithms detect nearly the same number of clusters, they do not achieve this accuracy with similar numbers of record comparisons. Table IV.5 and its accompanying gure, Figure IV.6 show that the merge-purge algorithm does many more pairwise record comparisons. For the largest database tested, PQS w/SW performs about 3.4 million comparisons and PQS w/HS performs about 3.0 million

54

Experiment Database True name size clusters 10k 75181 10000 20k 150546 20000 40k 300388 40000 80k 604294 80000 120k 907189 120000 Table IV.2: Experiments on ve databases of varying size.

Experiment name 10k 20k 40k 80k 120k

Duplicate detection algorithm UF+Fix W+SW UF+PQS+SW pure % impure pure % impure 13097 0.19 13170 0.17 26484 0.27 26617 0.24 53605 0.43 53614 0.39 108877 0.60 108491 0.57 165096 0.67 164387 0.67

Table IV.3: Number of pure and impure clusters detected by the Smith-Watermanbased algorithms on databases of varying size.

55

Experiment name 10k 20k 40k 80k 120k

Duplicate detection algorithm UF+PQS+HS M/P pure % impure pure + impure 14314 0.04 14217 28836 0.05 28590 58233 0.06 57801 118430 0.10 117617 179909 0.11 179025

Table IV.4: Number of clusters detected by the algorithms using the matching algorithm of Hernandez and Stolfo (1995) on databases of varying size.

Experiment Duplicate detection algorithm name UF+Fix W+SW UF+PQS+SW UF+PQS+HS M/P 10k 417473 230324 228185 1489017 20k 865876 478073 465060 3007210 40k 1801590 1001045 952767 6070416 80k 3782611 2149907 1977700 12366235 120k 5831469 3379585 3035910 18753559 Table IV.5: Number of comparisons performed by the algorithms.

56 comparisons, while the merge-purge algorithm performs about 18.8 million comparisons. This is six times as many as the PQS-based algorithm that uses the same pairwise matching method and achieves essentially the same accuracy. When comparing the xed size window strategies, the experiments again show that while achieving nearly the same accuracy, the algorithm using the union nd data structure performs many fewer record comparisons. In particular, the UF + Fix W + SW algorithm performs about 70% fewer record comparisons than the mergepurge algorithm. When this algorithm is compared with the algorithm that uses both the union- nd data structure and the priority queue strategy, nearly the same accuracy is achieved. The experiments show that using the priority queue strategy instead of a xed size window results in more than 42% fewer record comparisons. Overall, these experiments show the signi cant improvement that the union nd data structure and the priority queue of cluster subsets strategy have over the merge-purge algorithm. The best depiction of this is in comparing the merge-purge algorithm with the PQS w/HS algorithm. In both of these cases, the exact same record matching function is used. The di erence is in the number of times this function is applied. The merge-purge algorithm applies the record matching function on records that fall within a xed size window, thus making unnecessary record comparisons. The PQS w/HS and the PQS w/SW algorithms apply the record matching function more e ectively through the use of the union- nd data structure and priority queue of sets.

Combining the xed size window and priority queue strategies The duplicate detection system proposed in this thesis uses two passes over the database. After the rst pass nishes, the \is a duplicate of" relationships are passed on to subsequent passes; this is done through the union- nd data structure. Since these passes are otherwise independent in their processing, any strategy can be used in order to detect duplicate records. This thesis has discussed two strategies, the xed size window and the priority queue strategy. It is not clear whether using

57 Experiment Duplicate detection algorithm name Fix W x2 Fix W & PQS PQS & Fix W PQS x2 10k 13097 13177 13095 13170 20k 26484 26631 26469 26617 40k 53605 53634 53580 53614 80k 108877 108603 108743 108489 120k 165096 164586 164871 164388 Table IV.6: Number of pure clusters detected when di erent strategies are used in each of the two passes. The size of the window is xed at 10 records, and the size of the queue is 4. the same strategy on both passes is a de nite advantage over using a mix of them. However, as the previous experiments show, the priority queue strategy does positively a ect the performance of the system. In particular, when compared to using only the xed size window strategy, similar accuracy is achieved while performing fewer record comparisons To analyze the e ect of combining the two strategies in the duplicate detection system, the following experiments were performed. All four combinations of the two strategies were tested: 1. Fix W x2: the xed size window strategy is used by both passes. 2. Fix W & PQS: the rst pass uses the xed size window strategy, while the second pass uses the priority queue strategy. 3. PQS & Fix W: the rst pass uses the priority queue strategy, while the second pass uses the xed size window strategy. 4. PQS x2: the priority queue strategy is used by both passes. As in previous experiments, the size of the window is xed at 10 records and the size of the priority queue in the priority queue strategy is 4. The union- nd data

58 Experiment Duplicate detection algorithm name Fix W x2 Fix W & PQS PQS & Fix W 10k 417473 338483 309371 20k 865876 697596 646573 40k 1801590 1446037 1357436 80k 3782611 3041135 2893944 120k 5831469 4711042 4505233

PQS x2 230324 478074 1001048 2149911 3379589

Table IV.7: Total number of comparisons made to detect the pure clusters from Table IV.6. structure is used in all combinations of the algorithms to keep track of the clusters incrementally. Table IV.6 has the number of pure clusters detected by each combination of strategies on databases of varying size. Table IV.7 contains the number of record comparisons that each of the algorithms makes. This data is plotted in Figure IV.7. With respect to the accuracy of the algorithms, the experiments show that for the smaller databases, using the xed size window strategy in the second pass results in better accuracy. However, for databases of at least 40000 original records, it is better to use the priority queue strategy in both of the passes. When the number of record comparisons is considered, the bene t of the priority queue strategy is readily seen. As with previous experiments, use of the priority queue strategy results in high accuracy with fewer record comparisons than the xed size window strategy. Table IV.7 shows the decrease in the number of record comparisons. Of interest are the results for the algorithms which use di erent strategies in the passes. These results show that it is better (both in terms of accuracy and in number of record comparisons performed) to use the priority queue strategy in the rst pass and then the xed size window strategy in the second pass than vice versa. As expected, the combination that uses the priority queue strategy in both

59 7

Number of record comparisons

10

fix window x2 6

10

fix window, PQS PQS, fix window PQS x2

5

10 4 10

5

10 Total number of records in the database

6

10

Figure IV.7: Number of record comparisons made when mixing the strategies for applying the record matching algorithm. See Tables IV.6 and Table IV.7 (log-log plot). passes outperforms the other combinations. It has very similar accuracy while making fewer record comparisons. We conclude that the priority queue strategy is more adaptive than the xed size window strategy.

IV.D.4 Varying the level of noise in the database The nal experiment studied the e ect that the amount of noise in the duplicate records has on accuracy. As Hernandez (1996) explains, his generator has tunable parameters for introducing noise in a record. These include the probability that the social security number is omitted (i.e. set to 000-00-0000), the probability of a typographical error in a name or address, the probability of a complete change in name or zip code, and more. Hernandez (1996) chose various parameter settings to represent \low", \medium", and \high" noise. Roughly, low noise has these probabilities at 5%,

60 4

8

x 10

7

Number of clusters

6

5

4 true clusters 3

PQS w/SW (pure clusters) merge/purge (pure and impure clusters)

2

PQS w/HS (pure clusters) impure clusters

1

0 low

medium Noise level

high

Figure IV.8: Accuracy results for varying the noise level. while high noise has them at 25% to 30%. Figure IV.8 shows accuracy results at ve di erent levels of noise. As in other experiments, the PQS-based algorithm used a priority queue size of 4, while the merge-purge algorithm was run with a window of size 10. Each database had 50000 original records that were duplicated according to a Zipf distribution with  = 0:4 and 20 maximum duplicates per original record, as before. As expected, the accuracy with which duplicates are identi ed decreases with increased noise. It is interesting to note that additional noise decreases the accuracy of the algorithms using the equational theory matcher more severely than it decreases the accuracy of the algorithm using the Smith-Waterman algorithm. This phenomenon is visible in the smaller slope of the accuracy function for the PQS w/SW algorithm. At the highest level of noise tested, PQS w/SW performs better than the other methods.

61

IV.E Detecting approximate duplicate records in a real bibliographic database This section looks at the e ectiveness of the algorithm on a real database of bibliographic records describing documents in various elds of computer science published in several sources. The database is a slightly larger version of one used by Hylton (1996). He presents an algorithm for detecting bibliographic records which refer to the same work. The task here is not to purge the database of duplicate records, but instead to create clusters that contain all records about the same entity. An entity in this context is one document, called a \work", that may exist in several versions. A work may be a technical report which later appears in the form of a conference paper, and still later as a journal paper. While the bibliographic records contain many elds, the algorithm of Hylton (1996) considers only the author and the title of each document, as do the methods tested here. The bibliographic records were gathered from two major collections available over the Internet. The primary source is A Collection of Computer Science Bibliographies assembled by Alf-Christian Achilles (1996). Over 200000 records were taken from this collection, which currently contains over 600000 BibTeX records. Since the records come from a collection of bibliographies, the database contains multiple BibTeX records for the same document. In addition, due to the di erent sources, the records are subject to typographical errors, errors in the accuracy of the information they provide, variation in how they abbreviate author names, and more. The secondary source is a collection of computer science technical reports produced by ve major universities in the CS-TR project [Yan and Garcia-Molina, 1995; Kahn, 1995]. This contains approximately 6000 records. In total, the database used contains 254618 records. To apply the duplicate detection algorithm to the database, simple records are created from the complete BibTeX records. Each record contains the author names and the document title from one BibTeX record. As in all other experiments,

62 Cluster size Number of clusters Number of records % of all records 1 118149 118149 46.40% 2 28323 56646 22.25% 3 8403 25209 9.90% 4 3936 15744 6.18% 5 1875 9375 3.68% 6 1033 6198 2.43% 7 626 4382 1.72% 8 445 3560 1.40% 9 294 2646 1.04% 10 190 1900 0.75% 11+ 593 10809 4.25% total 163867 254618 100.00% [Hylton, 1996] 162535 242705 100.00% Table IV.8: Results on a database of bibliographic records. the PQS-based algorithm used two passes over the database, the rst pass based on sorting the records lexicographically from left to right, and the second pass based on a lexicographic right to left sort. Small experiments allowed us to determine that the best Smith-Waterman algorithm threshold for a match in this database was 0.65. The threshold is higher for this database because it has less noise than the synthetic databases used in other experiments. Results for the PQS-based algorithm using a priority queue size of 4 are presented in Table IV.8. The algorithm detected a total of 163867 clusters, with an average of 1.60 records per cluster. The true number of duplicate records in this database is not known. However, based on visual inspection, the great majority of detected clusters are pure. The number of clusters detected by the PQS-based algorithm is comparable

63 to the results of Hylton (1996) on almost the same database. However, Hylton reports making 7.5 million comparisons to determine the clusters whereas the PQS-based algorithm performs just over 1.6 million comparisons. This savings of over 80% is comparable to the savings observed on synthetic databases. The PQS-based algorithm cannot be compared on this database with the algorithms which use the equational theory because the equational theory only applies to mailing list records. An entirely new equational theory for bibliographic records would have to be written in order to make these comparisons. Since the SmithWaterman algorithm is domain independent, we successfully used the same algorithm as in the experiments of the previous section without any modi cations.

Chapter V The WebFind information discovery tool Information retrieval in the worldwide web environment poses unique challenges. The web is a distributed, always changing, and ever expanding collection of documents. These features of the web make it dicult to nd information about a speci c topic. The most common approaches involve indexing, but indexes introduce centralization and can never be up-to-date. Available information retrieval software has been designed for very di erent environments with typical tools [Salton and McGill, 1983; Salton and Buckley, 1988] working on an unchanging corpus, with the entire corpus available for direct access. This chapter describes WebFind, an application that discovers scienti c papers made available by their authors on the web. WebFind uses a novel approach to performing information retrieval on the worldwide web. The approach is to use a combination of external information sources as a guide for locating where to look for information on the web. The external information sources used by WebFind are inspec and netfind. inspec is a science and engineering abstracts database that is part of melvyl. melvyl is a University of California library service that includes comprehensive databases of bibliographic records [University of California, 1996]. netfind is a white pages service that gives internet host addresses and people's e-mail addresses 64

65 [Schwartz and Pu, 1994b]. Separately, these services do not provide enough information to locate papers on the web. WebFind integrates the information provided by each in order to nd a path for discovering on the web the information actually wanted by a user.

V.A Approaches to searching on the WWW The most common approach to resource discovery over the web is to use an index to store information about web documents. This approach involves periodic automatic o -line searching of the web and gathering of information about all the documents found in these searches. AltaVista [Digital Equipment Corporation, 1997], WebCrawler [Pinkerton, 1994], Lycos [Mauldin and Leavitt, 1994], Infoseek [Randall, 1995; InfoSeek Corporation, 1997], and Inktomi [Brewer and Gauthier, 1997] are the most important examples of applications which use this indexing approach. Most will retrieve the entire contents of a document and index a portion of it. This centralization introduces a bottleneck in the query answering capabilities of a server. This is usually countered by assigning a large amount of resources to the server. Another disadvantage is due to the nature of the web itself. The web is very dynamic: existing documents are always being updated or removed, while new documents keep coming online. Thus, no indexing approach can realistically index all of its contents. Inktomi is based on the network of workstations technology [Anderson et al., 1995] and is a distributed system in three ways: any one of the system components can be crawling the worldwide web for documents, the index is distributed, and any query can be answered by any of the workstations [Inktomi Corporation, 1996]. The main alternative to resource discovery based on o -line indexing is to perform automated online searching. Such online searching requires sophisticated heuristic reasoning to be suciently focused. The most developed example of this approach is the so-called Internet Softbot [Etzioni and Weld, 1994]. The Softbot is a software agent that transforms a user's query into a goal and applies a planning

66 algorithm to generate a sequence of actions that should achieve the goal. The Softbot planner possesses extensive knowledge (some acquired through learning) about the information sources available to it. WebCrawler can use its own index to suggest starting points for online searches. Having good starting points is important in online searches because otherwise it takes longer than users are willing to wait to nd relevant information. The WebCrawler approach may solve the problem of keeping up with the dynamism of the web, but it does not solve the other problems associated with indexing approaches. Recent research has produced relevant projects which address the outdated index problem faced by the centralized search engines. These research projects consider a collection of dedicated, distributed software agents to perform the query. One such project is arachnid [Menczer, 1997]. arachnid is based on an arti cial life model in which each agent performs a local search. The collection of \spiders" is situated in some starting points in the web. From there, each spider performs its own search, learning to choose links based on previously retrieved documents. The search process of arachnid is notably di erent than the typical search engines. arachnid uses the hypertext structure in order to explore the information space. When a document is deemed by the agent | or assessed by the user | as relevant, the agent gains energy. Agents who consistently accumulate energy will be better t than others who retrieved less relevant information, and spawn o spring similarly t. Thus the local search concentrates only on the web sources (i.e. pages) where relevant information has been found. The WebFind approach to resource discovery is similar to the WebCrawler and Softbot approaches, in that WebFind performs online searching of the web. However, unlike the WebCrawler, the starting points used by WebFind are suggested by inference from information provided by reliable external sources, not by a precomputed index of the web. Unlike the Softbot, WebFind uses applicationspeci c algorithms for reasoning with its information sources. In principle a planning algorithm could generate the reasoning algorithm used by WebFind, but in prac-

67 tice the WebFind algorithms are more sophisticated than it is feasible to synthesize automatically.

V.B Resources used by WebFind There are three types of resources used in WebFind. The rst is a resource containing bibliographic records of published scienti c papers, i.e. a library system like melvyl. These resources are consulted for information about articles written by a speci c author and/or about a speci ed topic. The information gathered from the library resources is used to initiate further searches for the authors and an online copy of the article. The second type of resource is an Internet white pages directory for locating authors and organizations on the Internet. The results of a white pages directory search signi cantly bound the WWW search space. They do so by providing Internet locations where the author is likely to be found. This prevents unnecessary searching of Internet locations where the author cannot be found. The third type of resource is those which contain the full text of papers. This is our nal target in the WebFind application and we want to have as many of these full text article resources as possible in order to increase our success rate.

V.B.1 Library resources A library resource provides detailed information about published material. For a library resource, WebFind uses melvyl [University of California, 1996], the University of California library system's on-line software which allows a user to query the many databases of the university. melvyl is comprehensive and systematic but it has limited content. The web, on the other hand, can have unlimited content, but it is chaotic. These are both characteristics of the web's continuous growth. While such growth increases the amount of information available, it also has a negative impact on its organization. WebFind is a unique combination of these complementary information sources.

68 melvyl is a client server architecture. Clients connect to the melvyl central

server located in Northern California via telnet connections. The server provides a command line interface to its databases from which the user enters melvyl search commands. melvyl's most important commands are nd and display. The nd command is used to query a database for papers whose melvyl records satisfy some criteria given by the user. The criteria is usually a combination of keywords and/or author names. Once melvyl executes the query, it lets the user know the number of retrieved records. A user then issues the display command to display the records. WebFind has an interface to melvyl which removes the user interactiveness and allows the issuing of queries through function calls. melvyl includes comprehensive databases of bibliographic records, including a science and engineering database called inspec. The inspec records provide the following information about a paper: title, authors, aliation of the principal author, source where published, publisher, abstract, and others. In the present version of WebFind, only the inspec database is queried. However, there are no obvious obstacles to incorporating the other databases.

V.B.2 White pages resources netfind is a service which attempts to locate the e-mail address of a person

on the Internet given her name and a description of where she works [Schwartz and Pu, 1994b]. A netfind request consists of a set of keys. One of them is the individual's rst name, last name, or login name. The other keys make up a description of the organization that the person is aliated with. The keys can be the name of the institution and/or the city/state/country where the institution is located. Other keys in the set may describe organizational units of the institution (e.g. departments, division, laboratories, etc.). netfind searches its seed database for Internet hosts which match all the keywords in the query. The seed database is an index of all the Internet hosts known to netfind. No abbreviations are allowed in the request and there is no substring matching done. If there are two or more matching hosts, the user

69 is asked to select up to three of them to search. If there are more than 100 domains, then a partial list is displayed and the user is asked to write a more restrictive query. To nd an e-mail address, netfind considers each of the candidate domains, and performs a search into each one. The algorithm and heuristics involved are described in Schwartz and Pu (1994b). Upon completion of the search, netfind summarizes the nger search results. When the search is successful, netfind provides the most promising e-mail address for the person being sought, and the most recent login information. White pages directories are essential resources for reducing the size of the search space. WebFind uses the information collected from netfind to form a set of candidate WWW locations where the author may have a home page. This set of candidates is much smaller than the entire WWW. The set also compares well with the WebCrawler approach of using its precomputed index to nd starting points. Our approach is di erent in that we use credible resources in inspec and netfind whereas the WebCrawler uses only those resources it has found during its indexing phase [Pinkerton, 1994]. Note that the seed database is maintained and updated manually for reasons of quality assurance.

V.B.3 Article archives resource Two types of article archive resources have been identi ed. The rst is sites which are known to keep collections of the full text of published scienti c papers. For example, many computer science departments have an anonymous FTP server where they make available their technical reports to the public. Article archive resources may not necessarily be a designated collection of published articles (of which there are many.) A site is considered a resource as long as it contains the full text of some article(s). The popularity of the WWW has created the emergence of a second type of resource which is more prevalent: people's home pages. Since the WWW lends itself well to the task of providing public information to many users, an author's WWW home page is a clear choice for a resource. Authors can easily provide hypertext links

70 to their published work through their home page. It is for this reason that WebFind uses an author's WWW home page (and the WWW page for the author's aliation) as the main resource in the process of retrieving the full contents of a paper published by that author.

V.C inspec and netfind integration A WebFind search starts with the user providing keywords to identify the paper, exactly as he or she would in searching inspec directly. WebFind's interface is shown in Figure V.1. A paper can be identi ed using any combination of the names of its authors, words from its title or abstract, or other bibliographic information. The values provided by the user are used by WebFind to query inspec through melvyl. WebFind sends the query to the melvyl server and once the results are received, WebFind saves and formats them. The user is presented with a list of each article's title and authors. Hyperlinks are provided which the user can follow to retrieve an article's abstract (if available) and additional library information. After the user con rms that the right paper has been identi ed, WebFind queries inspec to nd the institutional aliation of the principal author of the paper. Then, WebFind uses netfind to provide the internet address of a host computer with the same institutional aliation.

V.C.1 Bibliographic record retrieval WebFind rst queries inspec for bibliographic records which satisfy the user's inquiry. To query inspec, WebFind establishes a telnet connection to the melvyl server every time the user issues a new request. The request is sent to and processed by the melvyl server. Once the query executes, WebFind retrieves the

resulting records, formats them for display and also stores the records for future reference. The result is stored as a list of records; each record consists of the article's title, authors, principal author's aliation, and subject keywords. The list is stored

71 in a le for future use by other modules. WebFind does this in order to bypass the WWW server's inability to save state information between WWW server requests; that is, there is no information saved between client and server connections (e.g. following hyperlinks in a document) in the WWW architecture. To pass on information to forthcoming connections, we store it in les which are available during those connections.

V.C.2 Determining WWW starting points WebFind uses the candidate domains given by netfind to limit the WWW space which is searched. WebFind rst constructs a netfind query from the keywords in the aliation eld of a principal author. The inspec aliation is a comma

separated text string whose only structure is that of having the most speci c information at the start of the string (e.g. a department of a university or company), continuing with more general information and nally ending with the country where the organization is located. This structure, however, is quite informal and there are many exceptions throughout inspec. A query to netfind is a set of keywords describing an aliation. Useful keywords are typically words in the name of the institution or in the name of the city, state, and/or country where it is located [Schwartz and Pu, 1994b]. Noise words like \of", \and", \for", etc, found in the inspec aliation are not considered during the construction of the query. Since the netfind query engine is incapable of processing abbreviations, WebFind chooses only full words found in the aliations given by inspec. Instead of only sending full keywords, another alternative is to expand abbreviations to their full word. This can be done when WebFind has a very high probability of doing the correct expansion. Those abbreviations which are harder to expand are saved for later processing. There are a few keywords which are searched for explicitly in the aliation. The keyword \univ" is one of the most important, and is expanded to \university." When this keyword is found, it is used along with any neighboring keywords in the netfind query. In addition, when considering American

72 institutions, WebFind also uses the city; for foreign institutions the country is used. WebFind sends the query to the dblookup netfind utility. dblookup is a query engine which retrieves records from netfind's seed database. In general, the result of the netfind query is all hosts whose aliation contains the keywords in the query. There can be many such hosts, and WebFind must determine which of them is best. Since institutions are designated very di erently in inspec and netfind, it is non-trivial to decide when an inspec institution corresponds to a netfind institution. WebFind uses the recursive record matching algorithm described in section II.B.1 to do this. The Internet host selected is the one whose netfind aliation has the highest matching score with the inspec aliation.

V.D Discovery phase The searching of the worldwide web done by WebFind is real-time in two senses. First, the search takes place while the user is waiting for a response to his or her query. Second, information gathered from one retrieved document is analyzed and used to guide what documents are retrieved next. WebFind executes the discovery phase on the Internet host selected by the recursive record matching algorithm. The rst step in the discovery phase is to nd a worldwide web server on the chosen internet host. This step uses heuristics based on common patterns for naming servers. The most widely used convention is to use the pre x www. or www-. WebFind tests the existence of a server named with either of these pre xes by calling the Unix ping utility. If either pre x yields a server, then WebFind continues with the next step of the discovery phase. Otherwise, WebFind strips o the rst segment of the internet host name and applies the same heuristics again. For example, given cs.ucsd.edu as the internet host found in netfind, WebFind rst attempts to contact a worldwide web server at www.cs.ucsd.edu and www-cs.ucsd.edu. If no servers are contacted, then WebFind repeats this process by considering the internet host ucsd.edu. Thus, it looks for potential servers at

73 and www-ucsd.edu. Once a worldwide web server has been identi ed, WebFind follows links until the wanted article is found. This search proceeds in two stages: nd a web page for the principal author, and nd a web page that is the wanted article. Each stage of the search uses a priority queue whose entries are candidate links to follow. The priority of each link in the queue is equal to the estimated relevance of the link. For the rst stage, the priority queue initially has a single link, the link for the main page of the server. For the second stage, the priority queue initially contains just the result of the rst stage. When a link is added to the priority queue, its relevance is estimated using the recursive record matching algorithm applied to the context of the link, and each of two sets of keywords, a primary set and a secondary set. The context of a link is its anchor text and the two lines before and two after the line containing the link, provided no other link appears in those lines. Links are ranked lexicographically, rst using degree of match to the primary set, and then using degree of match to the secondary set. In the rst stage of search, the primary set of keywords is the name of the principal author, while the secondary set is fstaff, people, facultyg. Intuitively, the main objective is to nd a home page for the author, while the fall-back objective is to nd a page with a list of people at the institution. In the second stage of search, the primary set of keywords is the title of the wanted article, while the secondary set has keywords fpublications, papers, reportsg. Here, the main objective is to nd the actual wanted paper, while the fall-back objective is to nd a page with pointers to papers in general. At each stage, the search procedure is to repeatedly remove the rst link from the priority queue, and to retrieve the pointed-to web page. The search succeeds when this page is the wanted page. The search fails when the queue is in fact empty. If the page is not the wanted page, all links on it are added to the priority queue with their relevance estimated as just described.

www.ucsd.edu

74 Even if either stage of search fails, the user still receives useful information. If the rst stage fails, the user is given the web page of the author's institution. If the second stage fails, the user is given the web page of the author's institution and the author's own home page.

V.E Experimental results This section reports on experiments performed with the latest implementation of WebFind. The aim of the experiments was to identify which aspects of this version of WebFind are the limiting factors in its ability to locate authors and their papers on the worldwide web. Figure V.1, Figure V.2 and Figure V.3 show an example of a WebFind discovery session. The experiments discussed here used queries in di erent areas of computer science concerning papers by authors at ten di erent institutions. Table V.1 contains the inspec record representations for each of the ten institutions. WebFind is evaluated on its ability to map aliations to internet hosts, to discover worldwide web servers, to discover home pages for authors, and nally to discover the wanted paper.

V.E.1 Mapping aliations to internet hosts WebFind correctly associated eight of the ten inspec aliations to Internet hosts in netfind. The rst aliation that WebFind did not correctly identify was

\Dept. of Cognitive Sci., California Univ., San Diego." The reason for this failure was that netfind does not have an entry for this department, although its internet host is cogsci.ucsd.edu. This weakness can be overcome by incorporating additional information sources that complement netfind. The other aliation that WebFind did not nd a correct host for was \Lab. for Comput. Sci., MIT." The reason here is that fteen di erent internet hosts all have a netfind description equivalent to \Lab. for Comput. Sci., MIT." Each of

75

Dept. of Comput. Sci. & Eng., California Univ., San Diego, La Jolla, CA, USA Dept. of Cognitive Sci., California Univ., San Diego, La Jolla, CA, USA Dept. of Electr. Eng. & Comput. Sci., California Univ., Berkeley, CA, USA Dept. Comput. Sci. Eng., Washington Univ., Seattle, WA, USA Lab. for Comput. Sci., MIT, Cambridge, MA, USA Dept. of Comput. Sci., Cornell Univ., Ithaca, NY, USA Dept. of Comput. Sci., Texas Univ., Austin, TX, USA Dept. of Electr. Eng. & Comput. Sci., Illinois Univ., Chicago, IL, USA Dept. of Comput. Sci., Waterloo Univ., Ont., Canada Dept. of Comput. Sci., Columbia Univ., New York, NY, USA Table V.1: Sample inspec author aliations. these hosts corresponds to a di erent research group (e.g. cag.lcs.mit.edu belongs to the computer architecture group) but this information is not available in either the inspec or netfind aliation descriptions. The next version of WebFind will overcome this problem by adding keywords to the inspec and/or the netfind aliations, if necessary. Added inspec aliation keywords will be subject keywords, while added netfind aliation keywords will be host name segments. For example, adding the subject keywords \computer architecture" to \Lab. for Comput. Sci., MIT" would give a speci c match to the host name cag.lcs.mit.edu. Note that this will often involve matching of abbreviations, e.g. of \cag" and \computer architecture."

V.E.2 Discovery of worldwide web servers Of the eight internet hosts that WebFind found correctly, there was only one that it could not nd a worldwide web server for. This is encouraging given the

76 simple heuristic used for nding a server, discussed in Section V.D.

V.E.3 Discovery of home pages WebFind found the home page for ve principal authors on the seven world-

wide web servers it searched. The other two principal authors did not have home pages of any kind on the servers found by WebFind. In these two cases, the authors were no longer aliated with the institution that inspec provided. This problem can be solved by using the most recent information that inspec can provide. This can be achieved by having the most recent reference for the author in question. Thus, WebFind would formulate additional queries to send to inspec. These queries can be created from the results of the initial query.

V.E.4 Paper discovery Finally, WebFind successfully discovered two papers starting from the ve author's home pages found. The low rate is due to the type of author pages which were discovered. Two of the ve pages were not personal home pages, but rather they were \annual reports" or \research statements" which did not provide any outgoing links, so the wanted papers were not in fact available through their authors' home pages. In summary, the experiments show that WebFind is successful at nding worldwide web servers and nding web pages designated for authors. WebFind is less successful at nding actual papers, most of all because many authors have not yet published their papers on the worldwide web.

V.F Related work This section reviews the work most related to WebFind. Two projects are described: Ahoy! The Homepage Finder [Shakes et al., 1997a; Shakes et al., 1997b] and CiFi Citation Finder [Loke et al., 1996; Han et al., 1997].

77

V.F.1 Ahoy! The Homepage Finder Ahoy! is a web service which nds homepages given an individual's name plus possibly other information such as their country or the individual's institution. Ahoy! is one in a line of software agents for the Internet from the Computer Science Department at the University of Washington. Its direct predecessor is MetaCrawler [Selberg and Etzioni, 1997; Go2Net, 1997], which is a search engine. MetaCrawler does not have any centralized index itself. Instead, it relies on a number of indices. MetaCrawler sends a user's request to several search engines in parallel. The search engines it relies on are AltaVista [Digital Equipment Corporation, 1997], Excite [Excite Inc., 1997], Infoseek [Randall, 1995; InfoSeek Corporation, 1997], Lycos [Mauldin and Leavitt, 1994], WebCrawler [Pinkerton, 1994], and Yahoo [Yahoo! Inc., 1997]. When it receives all the results, MetaCrawler organizes them, ranks them, and returns them to the user. MetaCrawler has several advantages; rst, by sending a query to several search engines, there are more opportunities to answer the query. One of the disadvantages discussed in the introduction of this chapter is the inability of an index-based search engine to capture all of the contents of the Internet. MetaCrawler tries to address this issue by querying several index-based search engines. Another advantage is that MetaCrawler provides one query interface for multiple search engines. Finally, MetaCrawler combines the results of all the search engines, computing a rank for a document which includes the ranking assigned by the search engines indexing that document. Similar to MetaCrawler, Ahoy! also relies on a set of sources to nd an individual's homepage. First, Ahoy! queries MetaCrawler and receives a list of hyperlinks pointing to candidate home pages for the individual. Simultaneously, Ahoy! also queries two e-mail services [WhoWhere? Inc., 1997; DoubleClick Inc., 1997]. Finally, it also queries an internal database of institutions for the individual's institution. After Ahoy! has the results from all the sources, it then goes through a ltering step which identi es and removes irrelevant references. There are two types of lters, one based on the information provided in the query and the second based on

78 heuristics about people's names and common structure of homepages. The references which remain after the ltering step are then ranked and placed into categories which the authors call buckets. References are place in the buckets according to three attributes: the degree of match between the name speci ed in the query and the name in the reference; whether there is a match of the reference's URL and the institution and country speci ed in the query; and the third attribute is whether the reference appears to be a homepage or not. The use of several search engines results in good recall results [Shakes et al., 1997a] which are higher than recall of search engines because Ahoy! also can automatically general references. Ahoy! generates URLs based on previous patterns which are learned from successful homepage searches. Furthermore precision is shown to be higher than other methods because Ahoy! concentrates purely on homepages whereas search engines are used for querying general information. While WebFind and Ahoy! search for di erent types of information, the approaches used are somewhat similar. WebFind's approach is to integrate the information stored in reliable sources. Similarly, Ahoy! relies on the services provided by other search engines in order to answer a query. However, there are some di erences. First WebFind performs an online search, retrieving and evaluating the entire contents of web pages. Ahoy! bases its decisions on the information provided by the sources it queries which contain only a summary of the contents of a web page. As discussed earlier in this chapter, the sources used by Ahoy! are not as reliable as the sources used by WebFind. Finding a homepage is one of the intermediate tasks that WebFind performs in searching for a paper. Section V.D describes WebFind's approach for nding an author's homepage. Brie y, it relies on the ranking of hyperlinks found in the web pages it retrieves, following the highest ranked one. Overall, Ahoy!'s homepage nding approach is more complex than WebFind's. Ahoy! has a learning component which makes it more accurate in future queries. Furthermore, the URL generator can generate URLs which may not be indexed by search engines. Such a learning com-

79 ponent could make WebFind more accurate. However, because WebFind performs the search online starting from references suggested by reliable sources, it considers all the hyperlinks that it discovers and does not rely on whether the hyperlinks appear in some index.

V.F.2 CiFi: Citation Finder The most closely related work to WebFind is CiFi [Loke et al., 1996; Han et al., 1997]. CiFi is an agent that nds citations in the worldwide web. A citation in CiFi is a mention of the paper in a web page; there may or may not be a link associated with the citation. WebFind on the other hand considers only hyperlinks whose neighborhood contains the title of the paper. CiFi builds on the approaches introduced in WebFind. One of the extensions provided in CiFi is the use of several WWW sources, similar to MetaCrawler [Selberg and Etzioni, 1997; Go2Net, 1997]. It uses a number of search strategies. CiFi sends queries to AltaVista [Digital Equipment Corporation, 1997] and Ahoy! [Shakes et al., 1997b]. The AltaVista search returns candidate web pages where the paper is cited, and the Ahoy! search returns possibly a homepage for the author and/or a web page for the author's institution. When a homepage for the author is found, then CiFi performs like WebFind, searching for a reference to the paper from the author's homepage. CiFi also queries two databases which store computer science bibliographies. These databases are similar in nature and content to the databases used in Section IV.E to evaluate the duplicate detection algorithms. A citation is also searched for in the web page of the author's institution. CiFi rst tries to nd a search service for technical reports provided by the institution and uses it if found. There may be several citations for a paper. CiFi searches for the best citation of the paper. If a technical report is found, CiFi continues its search, ultimately stopping when a conference or journal citation is found. The authors of CiFi [Loke et al., 1996] suggest that WebFind is limited by its use of external information sources and by the contents of the inspec database

80 in the melvyl system. These critiques are addressed in this section. First, one of the goals of WebFind was to provide a tool that is based on reputable sources of information. Early exploration of the eld concluded that lack of data quality is one of the unfortunate side e ects of information growth in the Internet. To address this issue, WebFind was designed to integrate sources which provided information of high quality. Thus, the reason for using the inspec database available in melvyl. Furthermore, the design of WebFind does not prevent it from using other information sources. The current implementation achieves these goals and more is planned for the future. Much has been learned from the success of WebFind. Future versions will take the approach of integrating information sources further, to the point that MetaCrawler, Ahoy! and CiFi have done. These and other features as well as research issues to be addressed in future work are discussed in Section VI.A.3.

81

Figure V.1: WebFind query form.

82

Figure V.2: Papers found in inspec from the query issued by WebFind.

83

Figure V.3: Results from a WebFind discovery session.

Chapter VI Conclusion The integration of information sources is an important area of research. As the performance of WebFind shows, there is much to be gained from integrating multiple information sources. This thesis has explored and provided solutions to some of the problems to be overcome in this area. In particular, to integrate data from multiple sources, one must rst identify the information which is common in these sources. Di erent record matching algorithms were presented that determine the equivalence of records from these sources. Section II.B gives record matching algorithms that should be useful for typical alphanumeric records that contain elds such as names, addresses, titles, dates, identi cation numbers, and so on. Experimental results from two di erent application domains show that the algorithms are highly e ective. For general use, we recommend the hybrid algorithm, with the Smith-Waterman algorithm applied at the subrecord level. The recursive component of this algorithm allows it to handle out-of-order and missing subrecords, while its Smith-Waterman component allows it to handle typographical errors, abbreviations, and missing words. This hybrid algorithm is domain-independent and can be used in new applications with no reprogramming. The hybrid algorithm was successfully applied to the problem of detecting duplicate records in databases of mailing addresses and of bibliographic records without any changes to the algorithm. Although the Smith-Waterman component 84

85 does have several tunable parameters, in typical alphanumeric domains we are con dent that the numerical parameters suggested in Section II.B.2 can be used without change. The one parameter that should be changed for di erent applications is the threshold for declaring a match. This threshold is easy to set by examining a small number of pairs of records whose true matching status is known. Future work should investigate automated methods for learning optimal values for this parameter and the Smith-Waterman algorithm parameters. The duplicate detection methods described in Chapter III improve previous related work in three ways. The rst contribution is an approximate record matching algorithm that is relatively domain-independent. However, this algorithm, an adaptation of the Smith-Waterman algorithm, does have parameters that can in principle be optimized (perhaps automatically) to provide better accuracy in speci c applications. The second contribution is to show how to compute the transitive closure of \is a duplicate of" relationships incrementally, using the union- nd data structure. The third contribution is a heuristic method for minimizing the number of expensive pairwise record comparisons that must be performed while comparing individual records with potential duplicates. It is important to note that the second and third contributions can be combined with any pairwise record matching algorithm. In particular, we performed experiments on two algorithms that contained these contributions, but that used di erent record matching algorithms. The experiments resulted in high duplicate detection accuracy while performing many fewer record comparisons than previous related work.

VI.A Future work VI.A.1 Domain-independent record matching In the future, additional improvements to domain-independent record matching algorithms will be explored. The algorithms proposed in this thesis can be improved by automatically choosing the di erent parameter values. In particular, a

86 record matching algorithm based on the Smith-Waterman algorithm can have higher accuracy from one domain to the next if the di erent parameters can be dynamically adjusted as the algorithm is used on more records. As proposed in this thesis, domainindependence should remain as the main goal, so the adjustments to the parameters should be done through the collection of statistics. The record matching algorithms need to be compared with other approximate string matching algorithms such as those proposed by Buss and Yianilos (1995) and Damashek (1995). In addition, other tools such as agrep [Wu and Manber, 1992] for approximate pattern matching and di [Simon, 1989] for nding di erences in les, are relevant work which will be compared to the algorithms stated here.

VI.A.2 Detecting approximate duplicate records Future work on detecting approximately duplicate database records includes the following:

 Improving accuracy by performing more comparisons among records belonging to clusters which are suspiciously larger than average.

 Extending the algorithms to be able to detect approximately duplicate frag-

ments. Fragments are parts of a document which are larger than records considered in this thesis. A fragment could also be an entire document, in which case, a collection of documents is given and the problem is to detect those which are approximate duplicates. This problem comes up in detecting plagiarism. Also, search engines usually index multiple copies of the same web page without detecting that it has already been indexed. This occurs because the URL is used as a unique identi er. The problem is also seen in news-wire services, where the same story is sent multiple times, each time with only minor syntactic changes to it.

 Studying the issues involved with the implementation of these ideas on a real

database management system. The experiments presented in this thesis were

87 performed on at les, where the elds of a record were thought of as being concatenated into a consecutive string of characters. Implementing the algorithms into a database management system is not a trivial task. The performance of the system must be kept in mind. As an example of the diculty, ecient sorting algorithms must be employed. It may not be feasible to copy the entire contents of a database to a at le in order to detect duplicates.

 Exploring clustering algorithms other than sorting. Due to the number of errors

occurring in a database, there is no guarantee that a number of sorting passes will bring all candidate duplicate records close enough for the records to be compared. Thus, additional clustering techniques need to be explored which are less expensive than sorting and can achieve better clustering. One possible data structure is the vantage point tree, vp-tree, introduced Yianilos (1993). The vp-tree can be constructed in O(nlog(n)) time and can be searched in O(log(n)) expected time. The vp-tree is of importance in spaces where distance between its members cannot be approximated by a Euclidian distance. Thus, the vptree allows for searching of the nearest neighbors in general metric spaces. This could possibly replace the multiple sorting passes, with a single construction of a vp-tree, with appropriate vantage points chosen.

VI.A.3 WebFind Integrating additional white page resources This section describes a number of features which can improve WebFind. The experiments evaluating WebFind and its regular use have shown that the main weakness is that netfind does not cover all of the institutions on the web. Thus, the rst improvement would be to integrate additional white pages services which complement netfind. The following is a non-exhaustive list of possible candidate sources:

 University WWW servers [Universities.Com, 1997],

88

 Internet address nder [DoubleClick Inc., 1997],  WhoWhere? E-mail Addresses and Homepages [WhoWhere? Inc., 1997],  BigBook [BigBook Inc., 1997],  BigFoot [BigFoot Partners L.P., 1997],  BigYellow [NYNEX Information Technologies Company, 1997],  Search engines: Yahoo [Yahoo! Inc., 1997], AltaVista [Digital Equipment Cor-

poration, 1997] Lycos [Mauldin and Leavitt, 1994], Inktomi [Brewer and Gauthier, 1997], etc.

While there are many sources which can be of high relevance to WebFind care must be taken to include only sources of high quality. A major part of WebFind's success is that both netfind and inspec are sources of high quality. inspec in particular is a very reputable source that is also comprehensive in its eld. Many sources on a certain eld are available through the web. However only those which are reputable and are kept up to date should be considered. Furthermore, some sources may be duplicated by other sources completely, and thus need not be integrated. The duplicate detection algorithm gives us a way to detect those cases where a source spans only part of another source. After the di erent methods for discovering a paper have been exhausted, the user can be referred to other services on the web. In particular, links should be provided with the properly formatted query to the service in order to save the user from having to come up with queries to each of the di erent services.

Integrating additional paper resources In those cases where authors do not not provide links to their published work in their home pages, we can search for the paper of interest in archives which store the full text of papers. These archives are usually speci c to a eld of interest. Following is a sample of some of the WWW sites that have collections of technical

89 reports and other published papers in the eld of computer science. There are other sites which have analogous information in other scienti c elds.

 Computer Science Technical Report Harvest http://rd.cs.colorado.edu/brokers/cstech/query.html

 Uni ed Computer Science Technical Reports http://www.cs.indiana.edu/cstr/search

 WWW Database Article Archive http://www.lpac.ac.uk/SEL-HPC/Articles/DBArchive.html

 Network Bibliography http://www.research.att.com/~ hgs/bib/biblio.html

 GLIMPSE - search for articles in journals/proceedings http://glimpse.cs.arizona.edu:1994/bib

 Search text of technical reports http://www.cs.indiana.edu/ucstri/info.html

 Theoretical CS journals/bibliographies (via MIT) http://theory.lcs.mit.edu/ iandc/related.html

Full use of the bibliographic record A feature that can improve WebFind's success rate is to use more information that is available from bibliographic records. There are many opportunities for WebFind to improve its e ectiveness by using as much as possible all the information contained in the bibliographic record. The record provides the full title of a paper, the authors, subject keywords, an abstract in most cases, the source of publication, the name and location of the publisher, and other elds. Applications similar to WebFind (e.g. search engines, CiFi, Ahoy!) rely on the user being able

90 to properly state the information they are interested in. Unfortunately, typical users cannot state their query exactly. When a paper has multiple authors and WebFind is unsuccessful at nding the paper from the principal author, WebFind can attempt to discover the paper through the other co-authors. There are at least three ways in which a co-author can be used. First, if a principal author's home page is not found, then WebFind can assume that the other co-authors have the same aliation and instead look for home pages for all the authors of the paper. If at least one of these is successful, then discovery continues through that author's home page. Second, if a paper is not found at an author's home page, WebFind can consider visiting the hyperlinks at the home page which point to other authors. The heuristic here is that if the author points to other authors, it is likely that there will be relevant hyperlinks. Finally, since the key to WebFind is the e ective \join" that it performs between inspec and netfind, WebFind must rst get the aliation eld for each of the co-authors. The rst task, then, is to get a co-author to appear as the principal author for some paper. This involves WebFind composing di erent queries from the set of records it already has and issuing the query to inspec. Once the aliation is extracted, the remaining discovery process goes on as before, but this time with the co-author's aliation. In addition, inspec's bibliographic records contain subject keywords, etc. which can be used to augment the search phase. For example, when ranking hyperlinks WebFind could also weigh them according to a third (or even more) set of keywords, where each set is made up of keywords that come from the bibliographic record. A problem that comes up when submitting queries for aliations of an author is that the author may have multiple aliations. This may be due to the author having moved from one research institution to another for example. WebFind can be improved by using only the most recent and complete aliation available for that author. When the user selects a paper, the bibliographic record for that paper may not contain the most recent aliation for the author. WebFind can perform additional inspec queries in order to be certain that it uses the most recent aliation.

91

Access multiple melvyl databases The current WebFind implementation can only query inspec. A more robust WebFind will not only include all the other melvyl databases, but will also allow queries across multiple databases. The intuition here is to provide a larger set of sources for obtaining bibliographical records and to add the functionality of querying more than one melvyl database at a time (which cannot be done even with current melvyl software.)

Bibliography [Ace et al., 1992] J. Ace, B. Marvel, and B. Richer. Matchmaker : : : matchmaker : : : nd me the address (exact address match processing). Telephone Engineer and Management, 96(8):50,52{53, April 1992. [Achilles, 1996] Alf-Christian Achilles. A collection of computer science bibliographies. URL, 1996. http://liinwww.ira.uka.de/bibliography/index.html. [Anderson et al., 1995] Thomas E. Anderson, David E. Culler, and David A. Patterson. A case for NOW (networks of workstations). IEEE Micro, 15(1):54{64, February 1995. [Batini et al., 1986] C. Batini, M. Lenzerini, and S. Navathe. A comparative analysis of methodologies for database schema integration. ACM Computing Surveys, 18(4):323{364, December 1986. [BigBook Inc., 1997] BigBook Inc. BigBook. URL, 1997. http://www.bigbook.com/. [BigFoot Partners L.P., 1997] BigFoot Partners L.P. BigFoot. URL, 1997. http://www.bigfoot.com/. [Bitton and DeWitt, 1983] D. Bitton and D. J. DeWitt. Duplicate record elimination in large data les. ACM Transactions on Database Systems, 8(2):255{65, 1983. [Boyer and Moore, 1977] Robert S. Boyer and J. Strother Moore. A fast stringsearching algorithm. Communications of the ACM, 20(10):762{772, 1977. [Brewer and Gauthier, 1997] Eric Brewer and Paul Gauthier. Inktomi search engine. URL, 1997. http://www.inktomi.com. [Bureau, 1997] U.S. Census Bureau, editor. U.S. Census Bureau's 1997 Record Linkage Workshop, Arlington, Virginia, 1997. Statistical Research Division, U.S. Census Bureau. [Buss and Yianilos, 1995] Sam R. Buss and Peter N. Yianilos. A bipartite matching approach to approximate string comparison and search. Technical report, NEC Research Institute, 4 Independence Way, Princeton, NJ, 1995. 92

93 [Chang and Lampe, 1992] W. I. Chang and J. Lampe. Theoretical and empirical comparisons of approximate string matching algorithms. In CPM: 3rd Symposium on Combinatorial Pattern Matching, pages 175{84, 1992. [Church and Gale, 1991] K. W. Church and W. A. Gale. Probability scoring for spelling correction. Statistics and Computing, 1:93{103, 1991. [Cormen et al., 1990] Thomas H. Cormen, Charles E. Leiserson, and Roland L. Rivest. Introduction to Algorithms. MIT Press, 1990. [Cox, 1995] Brenda G. Cox. Business survey methods. John Wiley & Sons, Inc., 1995. Wiley series in probability and mathematical statistics. [Damashek, 1995] M. Damashek. Gauging similarity with n-grams: language independent categorization of text. Science, 267(5199):843{848, 1995. [Digital Equipment Corporation, 1997] Digital Equipment Corporation. AltaVista search engine. URL, 1997. http://www.altavista.digital.com. [DoubleClick Inc., 1997] DoubleClick Inc. Internet Address Finder. URL, 1997. http://www.iaf.com/. [Du and Chang, 1994] M.-W. Du and S. C. Chang. Approach to designing very fast approximate string matching algorithms. IEEE Transactions on Knowledge and Data Engineering, 6(4):620{633, August 1994. [Etzioni and Perkowitz, 1995] Oren Etzioni and Mike Perkowitz. Category translation: learning to understand information on the Internet. In Proceedings of the International Joint Conference on AI, pages 930{936, 1995. [Etzioni and Weld, 1994] Oren Etzioni and Daniel Weld. A softbot-based interface to the Internet. Communications of the ACM, 37(7):72{76, July 1994. [Excite Inc., 1997] Excite Inc. Excite Search Engine. URL, 1997. http://www.excite.com. [Galil and Giancarlo, 1988] Z. Galil and R. Giancarlo. Data structures and algorithms for approximate string matching. Journal of Complexity, 4:33{72, 1988. [Giles et al., 1976] C. A. Giles, A. A. Brooks, T. Doszkocs, and D.J. Hummel. An experiment in computer-assisted duplicate checking. In Proceedings of the ASIS Annual Meeting, page 108, 1976. [Go2Net, 1997] Go2Net. MetaCrawler. URL, 1997. http://www.metacrawler.com. [Hall and Dowling, 1980] Patrick A. V. Hall and Geo R. Dowling. Approximate string matching. ACM Computing Surveys, 12(4):381{402, 1980.

94 [Han et al., 1997] Y. Han, S.W. Loke, and L. Sterling. Agents for citation nding on the world wide web. In Proceedings of the Second International Conference and Exhibition on the Practical Application of Intelligent Agents and Multi-Agents, April 1997. Also in http://www.cs.mu.oz.au/tr db/mu 96 40.ps.gz. [Hernandez and Stolfo, 1995] M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 127{138, May 1995. [Hernandez and Stolfo, 1996] M. Hernandez and S. Stolfo. A generalization of band joins and the merge/purge problem. IEEE Transactions on Knowledge and Data Engineering, 1996. To appear. [Hernandez, 1996] Mauricio Hernandez. A Generalization of Band Joins and the Merge/Purge Problem. Ph.D. thesis, Columbia University, 1996. [Hopcroft and Ullman, 1973] J. E. Hopcroft and J. D. Ullman. Set merging algorithms. SIAM Journal on Computing, 2(4):294{303, December 1973. [Hylton, 1996] Jeremy A. Hylton. Identifying and merging related bibliographic records. M.S. thesis, MIT, 1996. Published as MIT Laboratory for Computer Science Technical Report 678. [InfoSeek Corporation, 1997] InfoSeek Corporation. InfoSeek home page. URL, 1997. http://www.infoseek.com. [Inktomi Corporation, 1996] Inktomi Corporation. The Inktomi technology behind HotBot (a white paper). URL, 1996. http://www.inktomi.com/CoupClustWhitePap.html. [Jacquemin and Royaute, 1994] C. Jacquemin and J. Royaute. Retrieving terms and their variants in a lexicalized uni cation-based framework. In Proceedings of the ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 132{141, July 1994. [Kahn, 1995] Robert E. Kahn. An introduction to the CS-TR project [WWW document]. URL, December 1995. http://www.cnri.reston.va.us/home/cstr.html. [Kilss and Alvey, 1985] Beth Kilss and Wendy Alvey, editors. Record linkage techniques, 1985: Proceedings of the Workshop on Exact Matching Methodologies, Arlington, Virginia, 1985. Internal Revenue Service, Statistics of Income Division. U.S. Internal Revenue Service, Publication 1299 (2-86). [Kim et al., 1993] W. Kim, I. Choi, S. Gala, and M. Scheevel. On resolving schematic heterogeneity in multidatabase systems. Distributed and Parallel Databases, 1(3):251{279, July 1993.

95 [Knuth et al., 1977] Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6(2):323{350, 1977. [Kukich, 1992] K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4):377{439, December 1992. [Levenshtein, 1966] V. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics { Doklady 10, 10:707{710, 1966. [Loke et al., 1996] S.W. Loke, A. Davison, and L. Sterling. Ci : An intelligent agent for citation nding on the world-wide web. In Proceedings of the 4th Paci c Rim International Conference on Arti cial Intelligence, pages 580{591, August 1996. Also in http://www.cs.mu.oz.au/tr db/mu 96 04.ps.gz. [Madhavaram et al., 1996] M. Madhavaram, D. L. Ali, and Ming Zhou. Integrating heterogeneous distributed database systems. Computers & Industrial Engineering, 31(1{2):315{318, October 1996. [Mauldin and Leavitt, 1994] Michael Mauldin and John Leavitt. Web-agent related research at the center for machine translation. In Proceedings of the ACM Special Interest Group on Networked Information Discovery and Retrieval, McLean, VA, August 1994. [Menczer, 1997] Filippo Menczer. ARACHNID: Adaptive retrieval agents choosing heuristic neighborhoods for information discovery. In Machine Learning: Proceedings of the 14th International Conference (ICML97). Morgan Kaufmann Publishers, Inc., 1997. [Monge and Elkan, 1997] Alvaro E. Monge and Charles P. Elkan. An ecient domain-independent algorithm for detecting approximately duplicate database records. In Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tucson, Arizona, May 1997. [Needleman and Wunsch, 1970] S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequences of two proteins. Journal of Molecular Biology, 48:443{453, 1970. [Newcombe et al., 1959] Howard B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital records. Science, 130:954{959, October 1959. Reprinted in [Kilss and Alvey, 1985]. [Newcombe, 1988] Howard B. Newcombe. Handbook of record linkage: methods for health and statistical studies, administration, and business. Oxford University Press, 1988. [NYNEX Information Technologies Company, 1997] NYNEX Information Technologies Company. BigYellow. URL, 1997. http://www.bigyellow.com/.

96 [Peterson, 1980] J. Peterson. Computer programs for detecting and correcting spelling errors. Communications of the ACM, 23(12):676{687, 1980. [Pinkerton, 1994] Brian Pinkerton. Finding what people want: Experiences with the WebCrawler. In Electronic Proceedings of the Second International Conference on the World Wide Web, Chicago, October 1994. Elsevier Science BV. http://webcrawler.cs.washington.edu/WebCrawler/WWW94.html. [Pollock and Zamora, 1987] J. J. Pollock and A. Zamora. Automatic spelling correction in scienti c and scholarly text. ACM Computing Surveys, 27(4):358{368, 1987. [Randall, 1995] Neil Randall. The search engine that could. (locating world wide web sites through search engines). PC/Computing, 8(9):165 (4 pages), September 1995. [Salton and Buckley, 1988] Gerard Salton and Chris Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513{523, 1988. [Salton and McGill, 1983] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw Hill, 1983. [Schwartz and Pu, 1994a] Michael Schwartz and Calton Pu. Applying an information gathering architecture to Net nd: a white pages tool for a changing and growing Internet. IEEE/ACM Transactions on Networking, 2(5):426{439, October 1994. [Schwartz and Pu, 1994b] Michael Schwartz and Calton Pu. Applying an information gathering architecture to Net nd: a white pages tool for a changing and growing Internet. Technical Report 5, Department of Computer Science, University of Colorado, October 1994. [Selberg and Etzioni, 1997] Erik Selberg and Oren Etzioni. The metacrawler architecture for resource aggregation on the web. IEEE Expert, 12(1):8{14, 1997. [Senator et al., 1995] T. E. Senator, H. G. Goldberg, J. Wooton, and M. A. Cottini et al. The nancial crimes enforcement network AI system (FAIS): identifying potential money laundering from reports of large cash transactions. AI Magazine, 16(4):21{39, 1995. [Shakes et al., 1997a] Jonathan Shakes, Marc Langheinrich, and Oren Etzioni. Dynamic reference sifting: a case study in the homepage domain. In Proceedings of the Sixth International World Wide Web Conference, pages 189{200, 1997. Also in http://www.cs.washington.edu/research/ahoy/doc/paper.html. [Shakes et al., 1997b] Jonathan Shakes, Marc Langheinrich, and Oren Etzioni. Ahoy! The Homepage Finder. URL, 1997. http://www.cs.washington.edu/research/ahoy.

97 [Silberschatz et al., 1995] A. Silberschatz, M. Stonebraker, and J. D. Ullman. Database research: achievements and opportunities into the 21st century. A report of an NSF workshop on the future of database research, May 1995. [Simon, 1989] I. Simon. Sequence comparison: some theory and some practice. In M. Gross and D. Perrin, editors, Proceedings of Electronic Dictionaries and Automata in Computational Linguistic. LITP spring school on theoretical computer science., pages 79{92, Berlin, West Germany, 1989. Springer Verlag. [Slaven, 1992] B. E. Slaven. The set theory matching system: an application to ethnographic research. Social Science Computer Review, 10(2):215{229, Summer 1992. [Smith and Waterman, 1981] T. F. Smith and M. S. Waterman. Identi cation of common molecular subsequences. Journal of Molecular Biology, 147:195{197, 1981. [Song et al., 1996] W. W. Song, P. Johannesson, and J. A. Bubenko Jr. Semantic similarity relations and computation in schema integration. Data & Knowledge Engineering, 19(1):65{97, May 1996. [Tarjan, 1975] Robert E. Tarjan. Eciency of a good but not linear set union algorithm. Journal of the ACM, 22(2):215{225, 1975. [Tarjan, 1983] Robert E. Tarjan. Data structures and network algorithms. Society for Industrial and Applied Mathematics, 1983. [Universities.Com, 1997] Universities.Com. Universities.com. URL, 1997. http://www.universities.com/. [University of California, 1996] Division of Library Automation University of California. Melvyl system welcome page. URL, May 1996. http://www.dla.ucop.edu/. [Wang et al., 1989] Y. R. Wang, S. E. Madnick, and D. C. Horton. Inter-database instance identi cation in composite information systems. In Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences, pages 677{84, January 1989. [WhoWhere? Inc., 1997] WhoWhere? Inc. WhoWhere? URL, 1997. http://www.whowhere.com/. [Winkler, 1985] William E. Winkler. Exact matchings list of businesses: Blocking, sub eld identi cation, information theory. In Kilss and Alvey [1985], pages 227{241. U.S. Internal Revenue Service, Publication 1299 (2-86).

98 [Winkler, 1995] William E. Winkler. Matching and Record Linkage, pages 355{384. In Brenda G. Cox [1995], 1995. Wiley series in probability and mathematical statistics. [Wu and Manber, 1992] Sun Wu and Udi Manber. agrep-a fast approximate patternmatching tool. In Proceedings of the Winter 1992 USENIX Conference, pages 153{162, Berkeley, CA, 1992. USENIX. [Yahoo! Inc., 1997] Yahoo! Inc. Yahoo! URL, 1997. http://www.yahoo.com. [Yampolskii and Gorbonosov, 1973] M. I. Yampolskii and A. E. Gorbonosov. Detection of duplicate secondary documents. Nauchno-Tekhnicheskaya Informatsiya, 1(8):3{6, 1973. [Yan and Garcia-Molina, 1995] T.W. Yan and H. Garcia-Molina. Information nding in a digital library: the Stanford perspective. SIGMOD Record, 24(3):62{70, September 1995. [Yianilos, 1993] Peter N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In SODA: ACM-SIAM Symposium on Discrete Algorithms (A Conference on Theoretical and Experimental Analysis of Discrete Algorithms), 1993.

Suggest Documents