Methods for evaluating and creating data quality - CiteSeerX

40 downloads 30295 Views 854KB Size Report
The first are edit/imputation methods for maintaining business rules and for imputing for missing data. The second are methods of data cleaning for finding duplicates within files or across ..... computer software, a significant proportion is also.
ARTICLE IN PRESS

Information Systems 29 (2004) 531–550

Methods for evaluating and creating data quality$ William E. Winkler* US Bureau of the Census, Statistical Research, Room 3000-4, Washington, DC 20233-9100, USA Received 30 May 2003; received in revised form 1 November 2003; accepted 15 December 2003

Abstract This paper provides a survey of two classes of methods that can be used in determining and improving the quality of individual files or groups of files. The first are edit/imputation methods for maintaining business rules and for imputing for missing data. The second are methods of data cleaning for finding duplicates within files or across files. Published by Elsevier Ltd. Keywords: Integer programming; Set covering; Data cleaning; Approximate string comparison; Unsupervised and supervised learning

1. Introduction Data is a valuable resource. Proper use of suitably high-quality data can yield quantitative measurements that allow the evaluation of processes and the improvement of operational efficiencies. If data are of poor quality, then they may not be suitable for their intended purpose. If data are created for one purpose and are used for another purpose, then the data may not be of sufficient quality for the second purpose. In this paper, we provide an overview of two aspects of improving data quality that have been considered in the statistical literature since the 1950s. The first method is data editing [1] that verifies that data values satisfy predetermined restraints. These restraints are also called business $ Expanded version of talk given at the ICDT Workshop on Data Quality in Cooperative Information Systems, Siena, Italy, January 2003. *Tel.: +1-301-763-4729; fax: +1-301-457-2299. E-mail address: [email protected] (W.E. Winkler).

0306-4379/$ - see front matter Published by Elsevier Ltd. doi:10.1016/j.is.2003.12.003

rules. Fellegi and Holt [2] defined a formal mathematical model for data editing that is intended to minimize changes in data records and to assure that the substituted data values pass the edit rules. The means of implementing the model of Fellegi and Holt have primarily involved operations research [3]. In straightforward situations, the edit rules are built into easily modified tables. In many situations, the systems can assure that all records satisfy edits without human intervention [3]. The second method is record linkage that is based on a statistical model due to Fellegi and Sunter [4]. Record linkage generalizes methods for Bayesian networks [5,6]. The methods have been rediscovered in the computer science literature [7,8] but without full mathematical proofs of the optimality of the classification rules. The methods are often referred to as data cleaning [9] or object identification [10,11]. Additionally, Fellegi and Sunter provided means of unsupervised learning for automatically determining optimal parameters in simple situations that have been extended to many practical situations [9,12–15].

ARTICLE IN PRESS 532

W.E. Winkler / Information Systems 29 (2004) 531–550

Although the methods of data editing and record linkage have primarily been applied to individual files, newer methods [16–20] are intended for linking and cleaning groups of files. The former methods (particularly [17]) create additional information during the file-linkage process that improves the linkages. In some situations, the methods allow improved statistical analyses across files even in the presence of linkage error. The methods of Koller et al. [19,20] are called Probabilistic Relational Models. The outline of this paper is as follows. In Section 2, we provide some examples of typical situations where data quality might be poor and the possible costs and benefits of improving the quality of the data. In Section 3, we give an overview of statistical data editing and imputation that might be efficiently used in assuring that business rules are satisfied and that missing data are filled in for some of the fields. In Section 4, we describe methods of data cleaning for removing duplicates within and across files. The methods are often referred to as object or entity identification, data fusion or record linkage. In Section 5, we describe some advanced methods and current research problems. The final section consists of concluding remarks.

2. Background In this section, we give an example of quality issues in a pharmaceutical data warehouse and in a hospital data system that is used for billing patients. 2.1. Examples A pharmaceutical company is interested in creating a data warehouse that links reports from doctors, information from laboratory reports and internal company data. Initially, the analysts know that they wish to quantify better the effects of certain drugs and possible side effects of drugs. They hope that they will be able to mine the warehouse for potential new areas of research and to determine existing avenues of research that are less likely for potential success than others. After creating an initial list of source files, the data

warehouse team determines that the documentation for some files is inaccurate as to what fields are present in the files and where the fields are located. Although names, social security numbers (SSNs), and dates-of-birth are needed for connecting information associated with persons, these identifiers are sometimes missing with some records. Missing identifiers and erroneous information is particularly a problem with information from a small proportion of the laboratories. The team needs to determine how much work is needed to connect information if identifiers are missing. If SSN is missing but name and date-ofbirth are present, is it possible to connect individual records? Should some of the missing information be put in the original files? If identifying information and other information are in error from some of the laboratories, what quality control procedures and financial incentives can be put in place to assure that the information from all of the laboratories is of reasonable quality and comparable? If the hardware, the software development and personnel for the project cost $30 million US, is it worthwhile to spend $2 million US on improving the quality of the information from the laboratories? A hospital has several different databases and cannot connect patient information associated with treatments and laboratory fees for accurate and timely billing. Some individuals are billed late or never billed. Other individuals who have paid their bills do not have the record of bill payment placed in computer files and are repeatedly billed. Some difficulties arise because names, dates-ofbirth, or insurance identification number are entered in error in some files. Should the individuals (sometimes doctors or nurses) be given better training on data entry? Should the individual systems be made more user friendly? Should better ways be developed for overcoming errors in identifying fields such as name, date-of-birth or insurance identification number? What is the size of the problem? If two percent of the information from a billion-dollar hospital operation has errors, how much will it cost to improve the information systems and data quality? How much money and time will be saved with the improved systems and data quality?

ARTICLE IN PRESS W.E. Winkler / Information Systems 29 (2004) 531–550

In the following sections, we will consider methods that have been developed for assuring that business rules are satisfied and for identifying duplicates within or across files. If we know that only certain drugs are used in treating a particular type of cancer, we may have a business (i.e., edit) rule that states those (codes for) drugs that can be associated with particular types of cancer. If the two contradictory codes are together in the database, then an error message must occur or an override flag must be present. If an individual works for a given industry in a certain type of job, then their wage (salary) should not be too high or too low in comparison to similar types of jobs in the same industry. If two records partially agree on name information and on date-of-birth information, when are the two records likely to be associated with the same person? If the two records have poor agreement on name and dateof-birth information and good agreement on other information, when should the pair of records be reviewed to determine if they are the same person? 2.2. Conceptual model and related issues In building a large data warehouse of cooperating information systems, we need to consider a number of issues. The first issue is the quality of the information in each individual file. We will only consider the three quality issues of duplicates, inconsistent information, and missing information in individual fields. If there are duplicates in a name and address list, we may unnecessarily contact individual entities two or more times. The first cost is the extra time and expenses of multiple contacts to the same individual or business. The second cost is the irritation of the entity being contacted. The benefit of removal of duplicates can be significant cost savings. For instance, a hospital may bill someone twice for the same procedure if two different patient-event accounts have been set up. It is quite possible that the first bill will properly be credited as having been paid while the second will not be corrected. Although the customer’s contact in the hospital’s billing office may indicate that the erroneous bill will be taken care of, it may not be corrected. The lack of correction may be due to not doing the

533

correction, doing the change incorrectly in the computer, or not having access to a method of correcting the computer files. The expenses to the hospital and the customer are considerable if several iterations take place until a more senior person is contacted in the hospital who has the authority or knowledge to correct the computer files. Inconsistent or missing data in a patient record can be caused by a variety of factors. A code or an amount of a drug may be erroneously entered in a computer file. For instance, if we know that a drug is only given in certain amounts to different classes of patients according to age and sex, then a straightforward edit of the amount might eliminate error in the computer file or more importantly an error in the amount or type of a drug. If we are integrating several databases into a data warehouse, we need to assure that the individual databases cooperate in the sense of having compatible definitions and concepts for common terms (Fig. 1). For instance, if one historical database has amounts in Liras and a current database has amounts in Euros, then the historical data might be converted to Euros. The conversion might only be done in the software that analyzes data taken from the historical database and the current one. If several different databases do not have common identifying codes such as a

File 1

File 2

Integrating File : Data Warehouse

File 3

File 4

Fig. 1. Data warehouse.

ARTICLE IN PRESS 534

W.E. Winkler / Information Systems 29 (2004) 531–550

tax identifier, then linkages must be done using non-unique identifiers such as name and address information [21–23]. A combination of methods may be needed for the linkages to overcome incompatible formatting of individual names and addresses and moderate typographical error [24]. In a single administrative list, it may be useful to identify duplicates that arise because of typographical error in the unique identifiers. In merging 20 years of California State Employment quarterly data, we observed 2–3 percent errors per year in the SSN that was used as the unique identifier. Each time series associated with an individual in the one billion record file might have two breaks because of errors in the SSNs. To improve the quality of the file, we used name, address, and other information to link data and correct the erroneous SSNs. A much more complete development of automatic schema matching needed for warehouse development is given in [25]. In the following sections we will cover methods of improving the consistency of information in a set of files and identifying duplicates.

3. Statistical data editing In a database, we would like to correct inconsistent information and fill-in missing information in an efficient, cost-effective manner. For editing, we are concerned with single fields and multiple fields. Edits for single fields are relatively straightforward. For instance, we may use a lookup table to determine correct diagnostic codes. If the code in the field in the database is not in the table, then we provide a message indicating the code must be changed. For multiple fields, we might want an individual of less than 15 years to always have marital status of unmarried. If a record fails this edit, then we would need to change either the age or the marital status. Edit or business rules are closely related to the integrity constraints. In simple situations, they can be considered the same because they can include entity integrity and other basic integrity forms. They do not include referential integrity that is better dealt with by duplicate-detection methods. In more general situations, they include assertions

such as if an individual is in a certain salary group, then there salary should not be greater than amount X : They can include triggers that are statements that can be automatically implemented. The set of edits are in formats in computer files that are easily maintained and updated. The methods of implementing edit rules go further because they check the logical consistency of a set of edit rules. They can automatically determine a minimal set of fields whose values must be changed and the values to which they should be changed so that a record satisfies edits. Until very recently, implementing a set of edits was computationally intractable [26]. Further, it is possible to put probability distributions on the ‘corrected’ records that are consistent with subset of records that satisfy the edit rules or have only missing data [27]. Data editing has been done extensively in the national statistical agencies since the 1950s. Early work was primarily done clerically. Later editing methods applied computer programs that incorporated if-then-else rules with logic similar to the logic applied by the clerks in their manual reviews. There were two main disadvantages of these methods. The first disadvantage is that there is still a significant amount of clerical review. Although most records are only edited by the computer software, a significant proportion is also edited by the clerks as an additional quality step. In the largest situations, the additional review associated the largest businesses or a sample of individuals took months by dozens of clerks. The second disadvantage was that the edit procedures could not assure that the record that had been changed (i.e., ‘corrected’) by the clerks would pass edits after one review by the clerks. Typically, a number of iterations were needed to assure that a record that initially failed edits would be changed to a record that satisfies edits. As fields were changed to correct for edit failures, additional edits would fail that did not fail originally. The sequential, local nature of the edit corrections needed multiple iterations to assure that the final ‘corrected’ record would pass. During each iteration, more and more data fields would be changed in the edit-failing records. In a dramatic comparison, Garcia and Thompson [28] showed that a group of 10 analysts took 6 months to review and

ARTICLE IN PRESS W.E. Winkler / Information Systems 29 (2004) 531–550

correct a moderate size survey that had complicated edit patterns. An automated method based on the model of Fellegi and Holt (see below) needed only 24 h to edit/impute the survey and changed 1/3 as many fields as the clerical review. Fellegi and Holt ([2], hereafter FH) provided a theoretical model for editing. In providing their model, they had three goals: (1) The data in each record should be made to satisfy all edits by changing the fewest possible items of data (variables or fields). (2) Imputation rules should be derived automatically from edit rules. (3) When imputation is necessary, it is desirable to maintain the marginal and joint frequency distributions of variables. FH (Theorem 1) proved that implicit edits are needed for solving the problem of goal 1. Implicit edits are those that can be logically derived from explicitly defined edits. Implicit edits contain information about edits that do not fail initially for a record but may fail as values in fields associated with failing edits are changed. By ‘correcting’ a record, we mean changing values in fields associated with failing edits so that the modified record no longer fails edits. The fields that an edit places restrictions on are referred to as the entering fields of the edits. Goal 1 is referred as the error localization (EL) problem. FH provided an inductive, existence-type proof to their Theorem 1. Their solution, however, did not deal with many of the practical computational aspects of the problem that, in the case of discrete data, were considered by Garfinkel et al. [29], hereafter GKL), Winkler [30,31], and Chen [32]. Because the error localization problem is NP-complete (GKL), reducing computation is the most important aspect in implementing an FH-based edit system. More formally, we can describe the EL problem as follows: record r ¼ ða1 ; a2 ; y; an Þ has n fields. For each i; ai AAi ; 1pipni ; where Ai is the set of possible values or code values which may be recorded in Field i: jAi j ¼ ni : The objective of error localization is to find the minimum number of fields to change if a record fails some of the edits. It can be formulated as a set covering problem. Let

535

E% ¼ fE 1 ; E 2 ; y; E m g be the entire set of (explicit and implicit) edits failed by a record r with n fields. A field j enters edit E i if the edit places restrictions on field j: Consider the set covering problem: Pn Minimize j¼1 cj xj ð1Þ Pn subject to i ¼ 1; 2; y; m j¼1 aij xj X1; ( 1; if field j is to be changed; xj ¼ 0; otherwise; where ( aij ¼

1; if field j enters E i ; 0; otherwise

and cj is a measure of confidence in field j: A small value of cj indicates that the corresponding field j is considered more likely to be in error. If the implicit edits are available, then FH (Theorem 1, [2]) showed that solving the EL problem is equivalent to solving Eq. (1). More explicitly, FH proved that any cover C of the fields in the failing (explicit and implicit) edits associated with an editfailing record r always yields an edit-passing record r1 from record r by finding new values of the fields in C: If the cover C is prime (i.e., has no subsets that are also covers), then it is always necessary to change values in each field of the cover. Alternatively, if implicit edits are not available, then the EL problem can be solved by direct integer programming methods such as branch-and-bound that are much slower. The following example illustrates some of the computational issues. An edit can be considered as a set of points. Let edit E ¼ fmarried; ageo15g: Let r be a data record. Then rAE ) r fails edit. This formulation is equivalent to ‘If age o15; then not married’. We note that if a record r fails a set of edits, then one field in each of the failing edits must be changed. Now consider an implicit edit E 3 that can be implied from two explicitly defined edits E 1 and E 2 ; i.e., E 1 and E 2 ) E 3 (Table 1). The fields age, marital status and relationship to head of household are entering fields because they are restricted by the edits. Implicit edit E 3 can be logically derived from E 1 and E 2 : If E 3 fails for a record r ¼ fageo15; not married, spouseg; then necessarily either E1 or E2 fail. Assume that the

ARTICLE IN PRESS 536

W.E. Winkler / Information Systems 29 (2004) 531–550

Table 1 Two explicit edits and one implicit edit E 1 ¼ fageo15; married ; g E 2 ¼ f ; not married; spouse g E 3 ¼ fageX15; ; spouse g

implicit edit E 3 is unobserved. If edit E 2 fails for record r; then we may change the marital status field in record r to married to obtain a new record r1 : Record r1 does not fail for E 2 but now fails for E 1 : If the implicit edit E 3 were observed, then we would know to change at least one additional field in record r: For much larger data situations having more edits and more fields, the number of possibilities increases at a very high exponential rate. If implicit edits are generated, then standard branch-and-bound integer programs are very fast for the main edit program [30,33,34]. The speed of the edit program is due to the fact that implicit edits almost perfectly summarize information that is needed for solving the EL problem (1). Winkler [30] introduced a greedy heuristic that is more than 100 times as fast as branch-and-bound for errorlocalization. The greedy algorithm was tested with actual survey data representing real-world edit situations. The greedy algorithm obtained solutions that were equivalently minimal to branchand-bound for better than 99.9% of the records. If the complete set of implicit edits is available prior to editing, then the speed of the main edit program is no longer an issue. The heuristic algorithm processes on the order of 1000 records per second. Many organizations have done research and developed FH systems. Rather than generate implicit edits prior to editing, many of the systems solve the EL problem directly. Early successes were Statistics Canada’s Generalized Edit and Imputation System (GEIS) for sets of linear inequality edits that improved computational speed for editing of economic methods [35] by efficiently applying a cardinality-constrained variant of Chernikova’s algorithm. The GEIS system does not generate implicit edits prior to error localization. IBM and the Italian National Statistical Institute (ISTAT) developed the SCIA (also named DAISY) system for discrete, demographic

data. The SCIA system used algorithms for editgeneration and error-localization that were introduced by Garfinkel et al. [29]. Winkler [31] provided much faster set covering algorithms than those of GKL. Current work on generalized systems is promising because of its potential generality and ease of use. In the LEO system, De Waal and Quere [36] apply Fourier–Motzkin ideas to combinations of linear-inequality edits for continuous data and general edits for discrete data. With LEO, the EL problem is solved directly by generating failing implicit edits only for the records being processed. Computation is limited by bounding the number of fields that are changed. Bruni and Sassano [37] applied satisfiability concepts to discrete data. They generate implicit edits prior to error localization. The ideas of Bruni and Sassano are a central part of the DIESIS system being developed by ISTAT [38,39]. Because GEIS and LEO do not generate implicit edits prior to error localization, they typically solve the EL problem directly and are slower (1 s per record to several minutes per record). The SPEER system [3,30] and the DISCRETE system [31,32], are much faster (1000 records per second and 100+ records per second, respectively) because they generate implicit edits prior to editing. Although SPEER and DISCRETE are not as general as GEIS, LEO, or DIESIS, they are likely to effectively handle all the data and edit situations needed in constructing data warehouses with millions of records. In data-warehouse situations, the ease of implementing FH ideas using established edit software can be dramatic. An analyst who has knowledge of the edit situations might put together the edit tables in a relatively short time. The edit tables might be all that is needed for the production edit system. Kovar and Winkler [40] compared the GEIS and SPEER systems for a set of linearinequality edits on continuous economic data. In the comparison, both systems were installed and run in less than one day. In another situation with discrete data, Winkler and Petkunas [41] were able to develop a production edit system for a small survey in less than one day. In many business situations, only a few simple edit rules are needed. In these cases, we might easily apply a FH system.

ARTICLE IN PRESS W.E. Winkler / Information Systems 29 (2004) 531–550

Editing files to assure that business rules are satisfied has been noted in the data quality literature by Redman [1], English [42], and Loshin [43]. None of the authors has noted the difficulties of developing and applying a set of consistent hardcoded if-then-else rules and the relative ease of applying FH methods.

Table 2 Elementary examples

4. Duplicate identification Data cleaning is used to identify duplicates when unique identifiers are unavailable. It relies primarily on matching of names, addresses, and other non-unique identifiers. Matching businesses using business names and other information can be particularly difficult [44]. For the last few years, there has been considerable interest and work on data cleaning [45]. Data cleaning is also called object identification [10,11] or record linkage [4]. Rahm and Do [45] provided an overview of data cleaning and some research problems. Tejada et al. [11] showed how to define linkage rules in a database environment. Sarawagi and Bhamidipaty [46] and Winkler [6] demonstrated how machine learning methods could be applied in record linkage situations where training data were available. Ananthakrishna et al. [47] and Liang et al. [48] provided methods for linkage in very large files in the database environment. Yancey and Winkler [49] developed BigMatch technology for matching moderate size lists of 100 million records against large administrative files having upwards of 4 billion records. Cohen and Richman [50] showed how to cluster and match entity names using methods that are scalable and adaptive. We illustrate some of the data cleaning issues with a straightforward example. In Table 2 the three pairs represent three individuals. In the first two cases, a human being could generally determine that the pairs are the same. In both situations, the individuals have reasonably similar names, addresses, and ages. We would like software that automates the determination of match status. In the third situation, we may know that the first record of the pair was a medical student at the university 20 years ago. The second record is a doctor in a different city who is known to have

537

Name

Address

Age

John A Smith J H Smith

16 Main Street 16 Main St

16 17

Javier Martinez Haveir Marteenez

49 E Applecross Road 49 Aplecross Raod

33 36

Gillian Jones Jilliam Brown

645 Reading Aev 123 Norcross Blvd

24 43

attended medical school at the university. With good automatic methods, we could determine that the first two pairs represent the same person. With a combination of automatic methods and human understanding, we might determine the third pair is the same person. In this section, we describe the formal model of Fellegi and Sunter ([4], hereafter FS) for record linkage classification rules. Cooper and Maron [7] and Yu et al. [8] rediscovered the methods. Only FS provided complete, formal proofs of the validity of the methods. In Section 4.2, we give a description of methods of preprocessing files prior to matching. The preprocessing is generally based on fixed rule-based logic that is used in many commercial systems. Current research [46,51,52] applies hidden Markov models to obtain preprocessing that is much more adaptive with new data situations. The adaptive methods require moderate or small amounts of training data. The methods have been particularly successful with Asian addresses where some of the existing rule-based methods perform poorly. In Section 4.3, we provide an overview of methods of approximate string comparison for dealing with typographical variation (or error) in pairs of strings for fields such as first name, last name, or street name. In Section 4.4, we describe efficient methods of 1-1 matching. In Section 4.5, we describe the methods of parameter estimation that are based on unsupervised learning [12,13], semi-supervised learning [6,9,53], and supervised learning [10,11,46]. Winkler [13] showed that optimal matching parameters can vary significantly across geography even with files of similar types having the same sets of identifying fields.

ARTICLE IN PRESS 538

W.E. Winkler / Information Systems 29 (2004) 531–550

4.1. Record linkage model of Fellegi and Sunter Fellegi and Sunter [4] provided a formal mathematical model for ideas that had been introduced by Newcombe [54–56]. They provided ways of estimating key parameters. To begin, notation is needed. Two files A and B are matched. The idea is to classify pairs in a product space A  B from two files A and B into M; the set of true matches, and U; the set of true nonmatches. Fellegi and Sunter, making rigorous concepts introduced by Newcombe [54], considered ratios of probabilities of the form: R ¼ PðgAGjMÞ=PðgAGjUÞ;

ð2Þ

where g is an arbitrary agreement pattern in a comparison space G: For instance, G might consist of eight patterns representing simple agreement or not on the largest name component, street name, and street number. Alternatively, each gAG might additionally account for the relative frequency with which specific values of name components such as ‘Smith’, ‘Zabrinsky’, ‘AAA’, and ‘Capitol’ occur. The ratio R or any monotonely increasing function of it such as the natural log is referred to as a matching weight (or score). The decision rule is given by If R > Tm ; then designate pair as a match: If Tl pRpTm ; then designate pair as a possible match and hold for clerical review: If RoTl ; then designate pair as a nonmatch: ð3Þ The cutoff thresholds Tl and Tm are determined by a priori error bounds on false matches and false nonmatches. The thresholds are often called lower and upper cutoffs. Rule (3) agrees with intuition. If gAG consists primarily of agreements, then it is intuitive that gAG would be more likely to occur among matches than nonmatches and ratio (2) would be large. On the other hand, if gAG consists primarily of disagreements, then ratio (2) would be small. Rule (3) partitions the set gAG into three disjoint subregions. The region fTl pRpTm g; is referred to as the no-decision region or clerical review region. In some situations, resources are available to review pairs clerically. Fig. 2 provides

an illustration of the curves of log frequency versus log weight for matches and nonmatches, respectively. Fig. 2 shows hypothetical cutoffs thresholds that we denote with the symbols L and U; respectively. The data used in Fig. 2 are based on information obtained while matching name and address files from one of the sites for the 1988 U.S. Dress Rehearsal Census. The clerical review region consists of individuals within the same household that are missing both name and age. 4.2. Name and address standardization and parsing Standardization consists of replacing various spelling of words with a single spelling. For instance, different spellings and abbreviations of ‘Incorporated’ might be replaced with the single standardized spelling ‘Inc’. The standardization component of software might separate a general string such as a complete name or address into words (i.e., sets of characters that are separated by spaces and other delimiters). Each word is then compared lookup tables to get standard spelling. Table 3 following table shows various commonly occurring words that are replaced by standardized spellings (given in capital letters). After standardization, the name string is parsed into components that can be compared (Table 4). The examples are produced by general name standardization software that I wrote for the US Census of Agriculture matching system. Because the software does well with business lists and person matching, it has been used for other matching applications at the Census Bureau and other agencies. At present, I am unaware of any commercial software for name standardization. Promising new methods based on Hidden Markov models [52] may improve over the rule-based name standardization. The Hidden Markov models require training data. There are many excellent commercial address standardization packages for addresses. The following tables illustrates address standardization with a propriety package developed by the Geography Division at the US Census Bureau. In testing in 1994, the software significantly outperformed the best US commercial packages in terms of standardization rates while producing

ARTICLE IN PRESS W.E. Winkler / Information Systems 29 (2004) 531–550

539

11 10 9 L

U

8

Log Frequency

7 6 5 4 3 2 1 0 -28

-20

-12

-4

4 12 Weight o=nonmatch, *=match cutoff "L" = O and cutoff "U" = 6

20

28

Fig. 2. Log frequency versus weight, matches and nonmatches combined.

Table 3 Examples of name parsing: standardized

4.3. String comparators and likelihood ratios

1. DR John J Smith MD 2. Smith DRY FRM 3. Smith & Son ENTP

In many matching situations, it is not possible to compare two strings exactly (character-by-character) because of typographical error. Dealing with typographical error via approximate string comparison has been a major research area in computer science (see e.g., [57,58]). In record linkage, we need a function that represents approximate agreement, with agreement being represented by 1 and degrees of partial agreement being represented by numbers between 0 and 1. We also need to adjust the likelihood ratios (2) according to the partial agreement values. Having such methods is crucial to matching. Three geographic regions are considered in Table 7. The function F represents exact agreement when it

comparably accurate standardizations. Table 5 shows a few addresses that have been standardized. Table 6 represents components of addresses produced by the parsing. The general Geography Division software produces approximately 50 components of addresses. The driver routine for name and address standardization software that we make available with the matching software only outputs the most important components of the addresses.

ARTICLE IN PRESS W.E. Winkler / Information Systems 29 (2004) 531–550

540 Table 4 Examples of name parsing: parsed

1.

PRE

FIRST

MID

LAST

POST1

DR

John

J

Smith

MD

2. 3.

Smith Smith

Table 5 Examples of address parsing: standardized 1. 2. 3. 4.

16 W Main ST APT 16 RR 2 BX 215 Fuller BLDG SUITE 405 14588 HWY 16 W

takes value one and represents partial agreement when it takes values less than one. In the St Louis region, for instance, 25% of first names and 15% of last names did not agree character-by-character among pairs that are matches. For the Columbia, Missouri and rural Washington areas, typographical area rates were similar. If there had been no method for dealing with typographical error rates, then more than 25% of the true matches would not have been located with software. Jaro [59] introduced a string comparator that accounts for insertions, deletions, and transpositions. The basic Jaro algorithm has three components: (1) compute the string lengths, (2) find the number of common characters in the two strings, and (3) find the number of transpositions. The definition of common is that the agreeing character must be within half the length of the shorter string. The definition of transposition is roughly that consecutive characters from one string are found in the second string in a different order. The pair of characters from the second string must be less than half the distance of the shorter of the two strings apart. The string comparator value (rescaled for consistency with the practice in computer science) is Fj ðs1 ; s2 Þ ¼ 1=3 ðNc =lens1 þ Nc =lens2 þ 0:5Nt =Nc Þ; where s1 and s2 are the strings with lengths lens1 and lens2 ; respectively, Nc is the number of common characters between the strings s1 and s2

POST2

BUS1

BUS2

DRY ENTP

FRM

Son

where the distance for common is the half of the minimum length of s1 and s2 ; and Nt is the number of transpositions. The number of transpositions Nt are computed somewhat differently from the obvious manner. Using truth data sets, Winkler [60] introduced methods for modeling how the different values of the string comparator affect the likelihood in the Fellegi-Sunter decision rule. The truth data sets were only needed during development and are not needed for production matching situations. Winkler [60] also showed how a variant of the Jaro string comparator F dramatically improves matching efficacy in comparison to situations when string comparators are not used. The variant applies some ideas of Pollock and Zamora [61] from a large study for the Chemical Abstracts Service. They provided empirical evidence about how the probability of keypunch errors increased as the character position in a string moved to the right. More recent work by Sarawagi and Bhamidipaty [46] and Borthwick [62] provide empirical evidence that the new string comparators can perform favorably in comparison to Bigrams and Edit Distance. Table 8 compares the values of the Jaro, Winkler, Bigram, and Edit-Distance values for selected first names and last names. The basic Bigram distance counts the consecutive pairs of characters that are in two strings. The basic Edit distance counts the minimum number of insertions, deletions, and substitutions to go from one string to another. The Bigram and Edit Distance are normalized to be between 0 and 1 by dividing the distance by the length of the longest string and then subtracting the resultant ratio from one. All string comparators take value 1 when the strings agree character-by-character. In all situations, the denominator is the maximum of the two string

ARTICLE IN PRESS W.E. Winkler / Information Systems 29 (2004) 531–550

541

Table 6 Examples of address parsing parsed

1. 2. 3. 4.

Pre2

Hsnm

Stnm

W

16

Main

14588

HWY

RR

Box

2

215

Post1

Post2

ST

16

Unit1

Unit2

Bldg

405

Fuller

16

W

Table 7 Proportional agreement by string comparator values among matches key fields by geography StL

Col

Wash

First F ¼ 1:0 F > 0:6

0.75 0.93

0.82 0.94

0.75 0.93

Last F ¼ 1:0 F > 0:6

0.85 0.95

0.88 0.96

0.86 0.96

lengths. Both Bigram and Edit Distance have difficulty with transpositions as illustrated by the first entry among the last names and the second entry among the first names. The Edit-Distance conversion to the scale between zero and one is the same as that used by Bertolazzi et al. [21]. 4.4. 1-1 matching Many applications consist of matching two files that have few internal duplicates. In these situations, it is often efficient to force 1-1 matching because many of the second and third best matches for pairs might have matching weights that are sufficiently high to necessitate clerical review and not be true matches. Forcing 1-1 matching in an efficient manner can greatly reduce clerical review in these situations. Jaro [59] provided a linear sum assignment procedure (lsap) [63] to force 1-1 matching. He observed that 1-1 matching via greedy algorithms in earlier record linkage systems could make a higher proportion of erroneous assignments. A greedy algorithm is one in which a record is always associated with the corresponding available record having the highest agreement weight. Subsequent

records are only compared with available remaining records that have not been assigned. In the following (Table 9), the two households are assumed to be the same, individuals have substantial identifying information, and the ordering is as shown. A lsap algorithm (applied to the matching weights represented in Table 10 causes the wife–wife, son–son, and daughter–daughter assignments correctly because it optimizes the set of assignments globally over the household. Other algorithms such as some types of greedy algorithms can make erroneous assignments such as husband–wife, wife–daughter, and daughter–son. The value cij is the (total agreement) weight from matching the ith person from the first file with the jth person in the second file. Winkler [64] introduced a modified assignment algorithm that uses 1/500 as much storage as the original algorithm and has of equivalent speed. The modified assignment algorithm does not induce a very small proportion of matching error (0.1– 0.2%) that is caused by the original assignment algorithm. Winkler [64] provides more details of the modified assignment algorithm. Fig. 3 illustrates the effect of 1-1 matching in contrast to non1-1-matching. 4.5. Automatic and semi-automatic parameter and error-rate estimation Fellegi and Sunter [4] introduced methods for estimating optimal parameters (probabilities) in the likelihood ratio (2). They observed that PðgÞ ¼ PðgjMÞPðMÞ þ PðgjUÞPðUÞ;

ð4Þ

where gAG is an arbitrary agreement pattern and M and U are the two classes of matches and nonmatches. If the agreement pattern gAG is from three fields that satisfy a conditional independence

ARTICLE IN PRESS W.E. Winkler / Information Systems 29 (2004) 531–550

542

Table 8 Comparison of string comparators using last names and first names Names SHACKLEFORD DUNNINGHAM NICHLESON JONES MASSEY ABROMS HARDIN ITMAN JERALDINE MARHTA MICHELLE JULIES TANYA DWAYNE SEAN JON JON

SHACKELFORD CUNNIGHAM NICHULSON JOHNSON MASSIE ABRAMS MARTINEZ SMITH GERALDINE MARTHA MICHAEL JULIUS TONYA DUANE SUSAN JOHN JAN

Jaro

Wink

Bigr

Edit

0.970 0.896 0.926 0.790 0.889 0.889 0.000 0.000 0.926 0.944 0.869 0.889 0.867 0.822 0.783 0.917 0.000

0.982 0.896 0.956 0.832 0.933 0.922 0.000 0.000 0.926 0.961 0.921 0.933 0.880 0.840 0.805 0.933 0.000

0.925 0.917 0.906 0.000 0.845 0.906 0.000 0.000 0.972 0.845 0.845 0.906 0.883 0.000 0.800 0.847 0.000

0.818 0.889 0.889 0.667 0.667 0.833 0.143 0.000 0.889 0.667 0.625 0.833 0.800 0.500 0.400 0.750 0.667

Table 9 Representation of a household

8

HouseH2

Husband Wife Daughter Son

Wife Daughter Son

Log Frequency

HouseH1

9

7 6 5 4 3 2 1 0 - 24

- 20

- 16

- 12

Table 10 Weights associated with individuals across two households c12 c22 c32 c42

c13 c23 c33 c43

4 rows, 3 columns Take at most one in each row and column

assumption, then the system of seven equations and seven unknowns can be used to estimate the m-probabilities PðgjMÞ; the u-probabilities PðgjUÞ; and the proportion PðMÞ: We note that the observed proportions PðgÞ can be computed for the eight patterns gAG: There are 7 ¼ 23 1 free values PðgÞ because the probabilities must add to 1. The conditional independence assumption corresponds exactly to the na.ıve Bayes assumption in machine learning [5]. Winkler [12] showed how to estimate the probabilities using the EM-Algorithm ([65], also [66]). Although this is a method of

-4

0

4

8

12

16

4

8

12

16

Matching Weight

9 8 7

Log Frequency

c11 c21 c31 c41

-8

6 5 4 3 2 1 0 - 24

- 20

- 16

- 12

-8

-4

0

Matching Weight

Fig. 3. 1-1 Matching versus non-1-1 matching.

unsupervised learning (e.g., [14,67]) that will not generally find two classes C1 and C2 that correspond to M and U; Winkler [13] demonstrated that the EM estimates optimal parameters in a few suitable situations. The best situations are when the observed proportions PðgÞ are computed

ARTICLE IN PRESS W.E. Winkler / Information Systems 29 (2004) 531–550

over suitably chosen sets of pairs. Recent extensions of the methods for choosing the pairs are due to Yancey [15] and Elfekey et al. [9]. Furthermore, Winkler [13] showed that the optimal m- and uprobabilities could vary significantly from one region of the US to another. In particular, the conditional probability Pðagreement on first name jMÞ can differ significantly from an urban region to an adjacent suburban region. Belin and Rubin [68] introduced methods of automatic-error rate estimation that used information from the basic matching situations of Winkler [13]. Their error rate estimates were sufficiently accurate that Scheuren and Winkler [16] could use the estimated error rates in a statistical model that adjusts regression analyses for linkage errors. Winkler [69] observed, however, that the methods only worked well in a narrow range of situations where the curves associated with matches M and nonmatches U were well-separated and had several other desirable characteristics. With many administrative lists, business lists, and agriculture lists, the methods of Belin and Rubin were unsuitable. Extensions of the basic parameter estimation methods have been to situations where the different fields used in the EM-algorithm can have dependencies upon one another and when various convex constraints force the parameters into subregions of the parameter space [14]. The general fitting algorithm generalizes the iterative scaling algorithm of Della Pietra et al. [70]. Additional extensions are where small amounts of unlabeled data are combined with the unlabeled data used in the original algorithms [5,6,9,53]. The general methods [6] can be used for datamining to determine what interaction patterns and variants of string comparators and other metrics affect the decision rules. The variant that uses unlabeled data with small amounts of labelled training data can yield semi-automatic estimates of classification error rates. The general problem of error rate estimation is very difficult. It is known as the regression problem [71,72].

5. Advanced methods and research problems In this section, we describe some basic scenarios in which we interconnect individual records from

543

two or more files. The merging is intended to create a data warehouse. We assume that the individual files have already been cleaned of duplicates. We assume that fields such as sex are given a common set of codes so that the values can be compared. We assume that numeric values are converted to common units. For instance, monetary units might all be converted to Euros. In all situations, we assume that linkage of individual records across files is by common identifying information such as name, address, telephone, and date-of-birth (when available). We assume that two records that correspond to the same entity (either a person or business) may have typographical variation in the identifiers. By typographical variation, we mean that corresponding fields such as name or date-of-birth do not necessarily agree character-by-character. In extreme situations, these identifiers may be missing or completely wrong. This section is divided into several subsections. In Section 5.1, we consider a straightforward methods that shows how a large administrative file can be used for improving the linkages of two files that are subsets of the administrative file. In Section 5.2, we describe a method of analytic linking that creates additional information during the matching process and allows adjustments of statistical analyses that can sometimes account for linkage error. In Section 5.3, we describe some research problems. 5.1. Bridging file A bridging file is a file that can be used in improving the linkages between two other files. Typically, a bridging file might be an administrative file that is maintained by a governmental unit. We begin by describing two basic situations where individuals might wish to analyze data from two files. Tables 11 and 12 illustrates the situation. In the first case, economists might wish to analyze the energy inputs and outputs of a set of companies by building an econometric model. Two different government agencies have the files. The first file has the energy inputs for companies that use the most fuels such as petroleum, natural gas, or coal as feed stocks. The second file has the

ARTICLE IN PRESS W.E. Winkler / Information Systems 29 (2004) 531–550

544 Table 11 Need for data warehouse, economics

Table 13 Representation of matching situation

Economics—

File A

Companies

Agency A Fuel Feedstocks

Agency B -

Outputs Produced

Table 12 Need for data warehouse, health Health—

Individuals

Receiving social benefits

Agencies B1 ; B2 ; B3

Incomes

Agency I

Use of health services

Agencies H1 ; H2

goods that are produced by the companies. The records associated with the companies must be linked primarily using fields such as name, address, and telephone. In the second situation, health professionals wish to create a model that connects the benefits, hospital costs, doctor costs, incomes, and other variables associated with individuals. A goal might be to determine whether certain government policies and support payments are helpful. If the policies are helpful, then the professionals wish to quantify how helpful the policies are. We assume that the linkages are done in a secure location, that the identifying information is only used for the linkages, and that the linked files have the personal identifiers removed (if necessary) prior to use in the analyses. A basic representation (Table 13) is where name, address, and other information is common across the files. The A-variables from the first A file and the B-variables from the second (B) file are what are primarily needed for the analyses. We assume that a record r0 in the A might be linked to between 3 and 20 records in the B-file using the common identifying information. At this point, there is at most one correct linkage and between 2 and 19 false linkages. A bridging file C might be a large administrative file that is maintained by a government agency that has the resources and skills to assure that the file is reasonably free of

A11 ; y; A1n A21 ; y; A2n . . AN1 ; y; ANn

Common

File B

Name1 ; Addr1 ; DOB1 Name2 ; Addr2 ; DOB2 . . NameN ; AddrN ; DOBN

B11 ; y; B1m B21 ; y; B2m . . BN1 ; y; BNm

duplicates and has current, accurate information in most fields. If the C file has one or two of the Avariables, the record r0 might only be linked to between 1 and 8 of the records in the C file. If the C file has one or two B-variables, then we might further reduce the number of records in the B-file that record r0 can be linked to. The reduction might be to one or zero records in the B file with which record r0 can be linked. Each of the linkages and reductions in the number of B-file records that r0 can be linked to depends on both the extra A-variables and the extra B-variables that are in file C. If there are moderately high error rates in the A-variables or Bvariables, then we may erroneously assume that record r0 may not be linked from file A to file B. Having extra resources to assure that the Avariables and B-variables in the large administrative file C have very minimal error rates is crucial to successfully using the C file as a bridging file. 5.2. Analytic linking Many of the linkages of sets of files in cooperating systems will be for the purposes of allowing analyses that have not previously been possible or to increase the accuracy of analyses. For instance, economists and demographers may wish to analyze microdata containing A-variables from an A-file and B-variables taken from a B-file [73]. The linkages are primarily done using name and address information. In some situations, it may be possible to determine how much an analysis such as a regression analysis can be improved using a theoretical model of the linkage error may affect the analysis [16,18]. Fig. 4 illustrates some matching scenarios that we refer to as good, mediocre, poor, and very poor

ARTICLE IN PRESS W.E. Winkler / Information Systems 29 (2004) 531–550 Mediocre Matching Scenario

8

8

7

7

6

6

5 4

Matches Nonmatches

3

Log Frequency

Log Frequency

Good Matching Scenario

5 4

2

1

1

0

0 -24 -20 -16 -12 -8 -4 0 4 8 12 16 Matching Weight

-24 -20 -16 -12 -8 -4 0 4 8 12 16 Matching Weight

1st Poor Matching Scenario

2nd Poor Matching Scenario

8

8

7

7

6

6

5 Matches Nonmatches

3

Log Frequency

Log Frequency

Matches Nonmatches

3

2

4

545

5 4

Matches Nonmatches

3

2

2

1

1

0 -24 -20 -16 -12 -8 -4 0 4 8 12 16 Matching Weight

0 -24 -20 -16 -12 -8 -4 0 4 8 12 16 Matching Weight

Fig. 4. Good, mediocre, poor, and very poor matching scenarios.

that depends on how much the curves of matches and nonmatches overlap when matching is performed using name and address information. The good scenario in which the curves are quite separated can refer to high-quality person lists. The mediocre scenario refers to the situations of moderately high-quality person lists where name, address, and other information used for linkages is not well maintained. The mediocre scenario is fairly typical when two survey files without common unique identifying codes are merged. The poor scenario refers to some administrative lists of persons and reasonably high-quality lists of businesses. The very poor scenario refers to many administrative lists of persons and to most administrative lists of businesses. Scheuren and Winkler [16] concluded that most statistical analyses could be performed without adjustment in the good matching scenario and that some

statistical analyses could be performed in the mediocre scenario provided suitable adjustments were made. The suitable adjustments involved special purpose statistical estimation software that accounted for some of the bias in the regression analysis due to linkage error. The most interesting situation for improving matching and statistical analyses is in the poor and very poor matching scenarios. Sometimes, economists and demographers will have a good idea of the relationship of the A-variables from the A-file and the B-variables from the B-file (Table 13). In these situations, we might use the A-variables to predict some of the B-variables. That is, Bij ¼ Predj ðAk1 ; Ak2 ; y; Akm Þ where j is the jth variable in the B-file and Predj is a suitable predictor function. Alternatively, crude predictor functions Predj might be determined during iterations of the linkage process. After an initial stage of linkage

ARTICLE IN PRESS

using only name and address information, a crude predictor function might be constructed using only those pairs having high matching weight. Scheuren and Winkler [17] conjecture that at most 200 pairs having high matching weight and false match rate at most 10% might be needed in simple situations with only one Avariable and one B-variable for a very poor matching scenario. Although the details are beyond the scope of this overview article, we provide two figures that illustrate the situation in the simplest situations. The following figure shows the basic strategy that consists of four stages that can be iterated. We begin with record linkage (stage RL) using name and address

Y-variable

W.E. Winkler / Information Systems 29 (2004) 531–550

1000 800 600 400 200 0 -200 -400 0

10

20

30

40

50

60

70

80

90

100

110

70

80

90

100

110

70

80

90

100

110

X-variable

Y-variable

546

1000 800 600 400 200 0 -200 -400 0

10

20

30

40

50

60

RA-EI m k RL’RA information only. With the high matching weight pairs, we perform a regression analysis (stage RA) to get a crude predictor of, say, a y-variable in one file using an x-variable in another file. We edit outliers (stage EI) from the regression of all the linked pairs (above a suitable cutoff weight) to replace the outliers with predicted values (stage RA). With the new variable predðyÞ in the first file, we have additional information that allows us to improve the extra information by matching pred(y) from the first file with y-variable from the second file (2nd iteration of stage RL). Figs. 5 and 6 illustrate the basic situation. The upper left corner of Fig. 5 shows the scatterplot of a regression situation in which the records and their x- and y-variables are correctly linked. The regression analysis gives an R2 of approximately 0.4. The observed situation with linkages with name and address information only and matching weights above a suitably low is shown in the second panel on the left. Matching error which might represent a typical situation with name and address information in business lists is high. The overall false match rate is well above 0.50 and the majority of x-variables are paired with the wrong y-variables. The resultant R2 from the regression is close to 0.0.

Y-variable

X-variable

1000 800 600 400 200 0 -200 -400 0

10

20

30

40

50

60

X-variable

Fig. 5. First stage of analytic linking, very simple situation of two variables only, ‘o’—false match and ‘x’—true match.

With 100–200 pairs above a reasonably high matching weight corresponding to a false-match rate of 0.10, we can model a predicted y value using the observed x values. In other situations, economists might already have developed a better model of the relationship between x- and yvariables that would be better than the crude, but straightforward, situation with the pairs having highest matching weight. The plot with the outliers from the regression is shown in the bottom plot of Fig. 5. For each x-variable we create a new variable predðyÞ in the first file. For each outlier from the regression, the variable predðyÞ is equal the value of the y-variable from the regression; for the remaining x-variable; predðyÞ is equal the value of the y-variable from the initial linkages. The new variable predðyÞ provides a new matching variable that can be compared with the y-variable in the second file.

ARTICLE IN PRESS

y-variable

W.E. Winkler / Information Systems 29 (2004) 531–550

the second file. The 1-1 matching further reduces the number of false linkages (in this example). In more complicated situations with more variables and more files, we can successively reduce the number of records in the B-file with which each Afile record can be linked (clustered).

1000 800 600 400 200 0 -200 -400 0

10

20

30

40

50

60

70

80

90

100

110

y-variable

x-variable

1000 800 600 400 200 0 -200 -400 0

10

20

30

40

50

60

70

80

90

100

110

70

80

90

100

110

x-variable

y-variable

547

1000 800 600 400 200 0 -200 -400 0

10

20

30

40

50

60

x-variable

Fig. 6. Second stage of analytic linking, very simple situation of two variables only, ‘o’—false match, ‘x’—true match.

After the matching with name information, address information, and the with the extra variable, the plot from the observed linkages is shown in the middle plot of Fig. 6. The reason that there are fewer points on the right-hand side is that predðyÞ greatly reduces the number of false matches. False matches are shown with the symbol ‘o’ and true matches with the symbol ‘x’. We can think of the extra information (i.e., predðyÞ) as providing a means of more efficiently clustering. In the initial linkage using name and address information only, a record from the first file might be linked to 3–15 other records. Each linkage might have a relatively low matching weight that (crudely) corresponds to a low probability of a correct match. The extra information due to the predðyÞ causes most records from the first file to be linked with fewer records from

5.3. Research problems In Section 5.1, we showed how linkages could be improved by using a bridging file to obtain extra information to improve linkages. In the second subsection, we created information that allowed us to improve linkages. In general, we could use one or more bridging files and a set of predicted variables to improve linkages. Such combinations of procedures might be best used in constructed a main data warehouse of cooperating systems. The research questions are: (1) ‘What are the best ways of automating many of the linkage steps?’ and (2) ‘Are these procedures suitable for the most advanced clean up of the administrative files maintained by government agencies?’ Some files contain moderate amounts of typographical error in key identifying fields that are needed for linkages. For instance, in some person files, there is a 5% error rate in each of the fields first name, last name, year-of-birth, month-ofbirth, and day-of-birth. In matching two files, we only consider pairs of records that agree characterby-character on a few characteristics and use the remaining characteristics to compute matching scores. This procedure is called blocking in record linkage [56] or clustering [74]. These methods reduce the number of pairs that are considered during the matching process. If we only consider pairs that agree character-by-character on the combination of first name, last name, year-ofbirth, month-of-birth, and day-of-birth, then we could miss more than 20% of the matches. A partial solution to the difficulty with typographical error in key identifying fields is to have multiple blocking passes. Yancey and Winkler [49] have developed BigMatch technology for creating multiple indexes associated with different sets of blocking criteria to match large administrative files having a 100 million or more records in each file being matched. The BigMatch methods of fast

ARTICLE IN PRESS 548

W.E. Winkler / Information Systems 29 (2004) 531–550

indexing, retrieval, and comparison should scale much better than indexing methods such as those of Ferragina and Grossi [75]. The research questions is (1) ‘What are the best methods of blocking (clustering)?’ and ‘What are the most efficient methods computationally?’

6. Concluding remarks This paper describes methods for statistical data editing and imputation and for data cleaning to remove duplicates. The editing model is based on the work of Fellegi and Holt model [2]. The data cleaning model is based on the record linkage work by Fellegi and Sunter [4]. Extensions of the record linkage methods can allow accurate interconnections of corresponding records associated with individual entities. These interconnections are one of the main steps in creating a data warehouse of cooperating systems.

Acknowledgements This paper reports the results of research and analysis undertaken by Census Bureau staff. It has undergone a Census Bureau review more limited in scope than that given to official Census Bureau publications. This report is released to inform interested parties of research and to encourage discussion. The author thanks three referees for comments related to some additional references and improvement in the clarity of the presentation.

References [1] T.C. Redman, Data Quality in the Information Age, Artech, Boston, 1996. [2] I.P. Fellegi, D. Holt, A systematic approach to automatic edit and imputation, J. Amer. Statist. Assoc. 71 (1976) 17–35. [3] W.E. Winkler, The state of statistical data editing, in: Statistical Data Editing, ISTAT–The Italian National Statistical Institute, Rome, Italy, 1999, pp. 169–187 (available as Report rr99/01 at http://www.census.gov/ srd/www/byyear.html). [4] I.P. Fellegi, A.B. Sunter, A theory for record linkage, J. Amer. Statist. Assoc. 64 (1969) 1183–1210.

[5] W.E. Winkler, Machine learning, information retrieval, and record linkage, Proceedings of the Section on Survey Research Methods, American Statistical Association, 2000, pp. 20–29 (also available at http://www.niss.org/ affiliates/dqworkshop/papers/winkler.pdf). [6] W.E. Winkler, Methods for record linkage and Bayesian networks, Proceedings of the Section on Survey Research Methods, American Statistical Association, 2002, CD-ROM, Alexandria, Virginia, USA (report RRS2002/05 available at http://www.census.gov/srd/ www/byyear.html). [7] W.S. Cooper, M.E. Maron, Foundations of probabilistic and utility-theoretic indexing, J. Assoc. Comput. Mach. 25 (1978) 67–80. [8] C.T. Yu, K. Lam, G. Salton, Term weighting in information retrieval using the term precision model, J. Assoc. Comput. Mach. 29 (1979) 152–170. [9] M. Elfekey, V. Vassilios, A. Elmagarmid, TAILOR: a record linkage toolbox, IEEE International Conference on Data Engineering ’02, 2002. [10] S. Tejada, C. Knoblock, S. Minton, Learning object identification rules for information extraction, Inf. Systems 26 (8) (2001) 607–633. [11] S. Tejada, C. Knoblock, S. Minton, Learning domainindependent string transformation for high accuracy object identification, ACM SIGKDD’02, 2002. [12] W.E. Winkler, Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage, American Statistical Association, Proceedings of the Section on Survey Research Methods, 1988, pp. 667–671, available as Report rr00/05 at http://www.census.gov/srd/ www/byyear.html [13] W.E. Winkler, Near automatic weight computation in the Fellegi-Sunter model of record linkage, Proceedings of the Fifth Census Bureau Annual Research Conference, 1988, pp. 145–155. [14] W.E. Winkler, Improved decision rules in the FellegiSunter model of record linkage, Proceedings of the Section on Survey Research Methods, American Statistical Association, 1993, pp. 274–279. [15] W.E. Yancey, Improving EM algorithm estimates for record linkage parameters, Proceedings of the Section on Survey Research Methods, American Statistical Association, 2002, to appear. [16] F. Scheuren, W.E. Winkler, Regression analysis of data files that are computer matched, Survey Methodol. 19 (1993) 39–58. [17] F. Scheuren, W.E. Winkler, Regression analysis of data files that are computer matched II, Survey Methodol. 23 (1997) 157–165. [18] P.A. Lahiri, M.D. Larsen, Regression analysis with linked data, J. Amer. Statist. Assoc. 81 (2003) CD-ROM, Alexandria, Virginia, USA. [19] D. Koller, A. Pfeffer, Probabilistic frame-based systems, Proceedings of the 15th National Conference of Artificial Intelligence (AAAI), July 1998, Madison, Wisconsin, 1998, pp. 157–164.

ARTICLE IN PRESS W.E. Winkler / Information Systems 29 (2004) 531–550 [20] L. Getoor, N. Friedman, D. Koller, A. Pfeffer, Learning probabilistic relational models, in: S. Dzeroski, N. Lavrac (Eds.), Relational Data Mining, Springer, New York, 2001. [21] P. Bertolazzi, L. De Santis, M. Scannapieco, Automatic record matching in cooperative information systems, IEEE Workshop on Data Quality in Cooperative Information Systems, Siena, Italy, January 2003. [22] T.E. Ohanekwu, C.I. Ezeife, A token-based data cleaning technique for data warehouse systems, IEEE Workshop on Data Quality in Cooperative Information Systems, Siena, Italy, January 2003. [23] M. Neiling, S. Jurk, H.-J. Lenz, F. Naumann, Object identification quality, IEEE Workshop on Data Quality in Cooperative Information Systems, Siena, Italy, January 2003. [24] M. Mecella, M. Scannapieco, A. Virgillito, R. Baldoni, T. Cartarci, C. Batini, Managing data quality in cooperative information systems, Very Large Data Bases ’02, 2002. [25] E. Rahm, P.A. Berstein, A survey of approaches to automatic schema matching, VLDB J. 10 (2001) 334–350. [26] W.E. Winkler, The quality of very large databases, Proceedings of Quality in Official Statistics ’2001, CDROM, May 15–17, 2001, Stockholm, Sweden, 2001 (also available at http://www.census.gov/srd/www/byyear.html as report rr01/04). [27] W.E. Winkler, A contingency table model for imputing data satisfying analytic constraints, Proceedings of the Section on Survey Research Methods, American Statistical Association, 2003, CD-ROM, Alexandria, Virginia, USA. [28] M. Garcia, K.J. Thompson, Applying the generalized edit/ imputation system AGGIES to the annual capital expenditures survey, Proceedings of the International Conference on Establishment Surveys, II (2000) 777–789. [29] R.S. Garfinkel, A.S. Kunnathur, G.E. Liepins, Optimal imputation of erroneous data: categorical data, general edits, Oper. Res. 34 (1969) 744–751. [30] W.E. Winkler, Editing discrete data, Proceedings of the Section on Survey Research Methods, American Statistical Association, 1995, pp. 108–113 (report rr97/04 available at http://www.census.gov/srd/www/byyear.html). [31] W.E. Winkler, Set-covering and editing discrete data, Proceedings of the Section on Survey Research Methods, American Statistical Association, 1997, pp. 564–569 (report rr98/01 available at http://www.census.gov/srd/ www/byyear.html). [32] B.-C. Chen, Set covering algorithms in edit generation, (American Statistical Association, Proceedings of the Section on Statistical Computing), 1998, available as Report rr98/06 at http://www.census.gov/srd/www/ byyear.html [33] G. Barcaroli, M. Venturi, DAISY (Design, Analysis and Imputation System): structure, methodology, and first applications, in: J. Kovar, L. Granquist (Eds.), Statistical Data Editing, Vol. II, U.N. Economic Commission for Europe–ISTAT, Rome, Italy, 1997.

549

[34] G. Barcaroli, M. Venturi, An integrated system for edit and imputation of data: an application to the italian labour force survey, Proceedings of the 49th Session of the International Statistical Institute, Florence, Italy, 1993. [35] I. Schopiu-Kratina, J. Kovar, Use of Chernikova’s algorithm in the generalized edit and imputation system, Statistics Canada, Methodology Branch Working Paper BSMD 89-001E, 1989. [36] T. De Waal, R. Quere, A fast and simple algorithm for automatic editing of mixed data, J. Official Statist. 19 (2003) to appear. [37] R. Bruni, A. Sassano, Logic and optimization techniques for an error free data collecting, Dipartimento di Informatica e Sistemistica, Universita di Roma ‘‘La Sapienza’’, 2001. [38] A. Manzari, A. Reale, Towards a new method of edit and imputation of the Italian Census: a Comparison with the Canadian Nearest-Neighbour Methodology, Presented at the International Statistical Institute Meeting in Seoul, Korea, September 2001. [39] A. Manzari, A. Reale, Towards a new method of edit and imputation of the Italian Census: a Comparison with the Canadian Nearest-Neighbour Methodology, Paper presented at the U.N. Economic Commission for Europe worksession, Helsinki, Finland, May 27–29, 2002 (available at http://www.unece.org/stats/documents/2002.05. sde.htm). [40] J.G. Kovar, W.E. Winkler, Editing economic data, (American Statistical Association, Proceedings of the Section on Survey Research Methods), 1996, available as Report rr00/04 at http://www.census.gov/srd/www/ byyear.html [41] W.E. Winkler, T. Petkunas, The DISCRETE Edit System, in: J. Kovar, L. Granquist (Eds.), Statistical Data Editing, Vol. II, U.N. Economic Commission for Europe, Geneva, Switzerland, 1997, pp. 56–62. [42] L.P. English, Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits, Wiley, New York, 1999. [43] D. Loshin, Enterprise Knowledge Management: The Data Quality Approach, Morgan Kaufmann, San Diego, 2001. [44] W.E. Winkler, Matching and record linkage, in: B.G. Cox, D.A. Binder, B.N. Chinnappa, A. Christianson, M.J. Colledge, P.S. Kott (Eds.), Business Survey Methods, Wiley, New York, NY, 1995. [45] E. Rahm, H.-H. Do, Data cleaning: problems and current approaches, Bull. IEEE Tech. Committee on Data Eng. 23 (4) (2000) 3–13. [46] S. Sarawagi, A. Bhamidipaty, Interactive deduplication using active learning, Very Large Data Bases ’02, 2002. [47] R. Ananthakrishna, S. Chaudhuri, V. Ganti, Eliminating fuzzy duplicates in data warehouses, Very Large Data Bases ’02, 2002. [48] J. Liang, C. Li, S. Mehrotra, Efficient record linkage in large data sets, 8th Annual International Conference on Database Systems for Advanced Applications, DASFAA 2003, 26–28 March, Kyoto, Japan, 2003.

ARTICLE IN PRESS 550

W.E. Winkler / Information Systems 29 (2004) 531–550

[49] W.E. Yancey, W.E. Winkler, BigMatch software, computer system, 2002 (documentation is in research report RRC2002/01 at http://www.census.gov/srd/www/byyear. html). [50] W.W. Cohen, J. Richman, Learning to match and cluster large high-dimensional data sets for data integration, Association of Computing Machinery SIGKDD ’02, 2002. [51] P. Christen, T. Churches, J.X. Zhu, Probabilistic name and address cleaning and standardization (The Australian Data Mining Workshop, November, 2002), available at http://datamining.anu.edu.au/projects/linkage.html [52] T. Churches, P. Christen, J. Lu, J.X. Zhu, Preparation of name and address data for record linkage using hidden Markov models, BioMed. Central Med. Inform. Decision Making 2 (9) (2002) available at http://www. biomedcentral.com/1472-6947/2/9/. [53] M.D. Larsen, D.B. Rubin, A iterative automated record linkage using mixture models, J. Amer. Statist. Assoc. 79 (1989) 32–41. [54] H.B. Newcombe, J.M. Kennedy, S.J. Axford, A.P. James, Automatic linkage of vital records, Science 130 (1959) 954–959. [55] H.B. Newcombe, J.M. Kennedy, Record linkage: making maximum use of the discriminating power of information, Comm. Assoc. Comput. Mach. 5 (1962) 563–567. [56] H.B. Newcombe, Handbook of Record Linkage: Methods for Health and Statistical Studies, Administration, and Business, Oxford University Press, Oxford, UK (out of print), 1988. [57] P.A.V. Hall, G.R. Dowling, Approximate string comparison, Assoc. Comput. Mach. Comput. Surveys 12 (1980) 381–402. [58] G. Navarro, A guided tour of approximate string matching, Assoc. Comput. Mach. Comput. Surveys 31 (1) (2001) 31–88. [59] M.A. Jaro, Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, J. Amer. Statist. Assoc. 89 (1989) 414–420. [60] W.E. Winkler, String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage, Proceedings of the Section on Survey Research Methods, American Statistical Association, 1990, pp. 778–783. [61] J. Pollock, A. Zamora, Automatic spelling correction in scientific and scholarly text, Comm. Assoc. Comput. Mach. 27 (1984) 358–368.

[62] A. Borthwick, MEDD 2.0, Conference Presentation, New York City, NY, USA, February, 2002, available at http:// www.choicemaker.com [63] R.E. Burkard, U. Derigs, Assignment and Matching Algorithms: Solution Methods with FORTRAN-Programs, Springer, New York, 1980. [64] W.E. Winkler, Advanced methods for record linkage, Proceedings of the Section on Survey Research Methods, American Statistical Association, 1994, pp. 467–472, (longer version report rr94/05 available at http://www. census.gov/srd/www/byyear.html). [65] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Statist. Soc. B 19 (1977) 380–393. [66] G.J. McLachlan, T. Krisnan, The EM Algorithm and Extensions, Wiley, New York, 1997. [67] T.M. Mitchell, Machine Learning, McGraw-Hill, New York, 1997. [68] T.R. Belin, D.B. Rubin, A method for calibrating falsematch rates in record linkage, J. Amer. Statist. Assoc. 90 (1995) 694–707. [69] W.E. Winkler, The state of record linkage and current research problems, Statistical Society of Canada, Proceedings of the Survey Methods Section, 1999, pp. 73–80 (longer report rr99/03 available at http://www.census.gov/ srd/www/byyear.html). [70] S. Della Pietra, V. Della Pietra, J. Lafferty, Inducing features of random fields, IEEE Trans. Pattern Anal. Mach. Intell. 39 (1977) 1–38. [71] V. Vapnik, The Nature of Statistical Learning Theory, 2nd Edition, Springer, Berlin, 2000. [72] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York, 2001. [73] W.E. Winkler, Issues with linking files and performing analyses on the merged files, Proceedings of the Section on Government Statistics, American Statistical Association, 1999, pp. 262–265. [74] A. McCallum, K. Nigam, L.H. Unger, Efficient clustering of high-dimensional data sets with application to reference matching, Association of Computing Machinery SIGKDD ’00, 2000. [75] P. Ferragina, R. Grossi, The string B-tree: a new data structure for string search in external memory and its applications, J. Assoc. Comput. Mach. 46 (1999) 236–280.