A machine learning approach to create blocking ... - Semantic Scholar

2 downloads 418 Views 317KB Size Report
Jun 27, 2013 - Record linkage, a part of data cleaning, is commonly recognized as one ... pacitation or access failure, business rules stipulate the creation of a new master .... The major problem that any practical RL software has to solve is highly ..... of remaining variables is used to form the pool of terms of length (i + 1). In.
A machine learning approach to create blocking criteria for record linkage Phan H. Giang George Mason University 4400 University Dr. Fairfax, VA 22030 June 27, 2013 Abstract Record linkage, a part of data cleaning, is commonly recognized as one of the most expensive work in data warehousing. Most RL systems employ blocking strategies to reduce the number of pairs to be matched. Until recently, blocking criteria are selected manually by domain experts. This paper proposes a new method to automatically learn efficient blocking filters for record linkage. Our method has solved the bottle neck problem highlighted in recent efforts to apply machine learning approach – the lack of sufficient labeled data for training. Unlike previous works, we do not consider blocking filter in isolation but in the context of a matcher which it is going to be used with. We show that given such a matcher, the labels that are relevant for learning blocking filter are the labels assigned by the matcher (link/nonlink), not the true labels (match/unmatch). This conclusion allows us to generate unlimited amount of labeled data for training. We formulate the problem of learning a blocking filter as a DNF learning problem and use the PAC learning theory to guide the development of algorithm to search for blocking filter. We test the algorithm on a real patient master file 2.18 millions records. The experimental result shows that compared with filters obtained by educated guess, the optimal learned filters have comparable recall but have throughput reduced by an order-of-magnitude factor.

1

Introduction

Data records represent real world entities. Ideally, there should be an one-to-one correspondence between the records and the entities but in reality this condition is often violated. A record may have no corresponding real world entity and at the same time, an entity may have more than one record in database. Periodically, databases need to be restored to their clean state by removing the ghost records and duplicate records. In the age of Big Data, important insights are gained by analyzing enriched data sets obtained by integrating the data collected for different purposes. 1

In healthcare, for example, master patient files often contain records that represent no real patient as well as multiple records that actually belong to a unique patient. Poor quality of data entry is responsible for some of the problems but in many situations it is a result of unavoidable circumstances. Normally, at the point a patient enters to a healthcare system, his/her master record is located, pulled out and verified so that information of new episode can be added. When this step fails for any reason such as emergency situations, patient incapacitation or access failure, business rules stipulate the creation of a new master record with the expectation that it will be merged with the patient’s existing record at some later times. At organization level, the structural evolution in healthcare industry also lead to aggregation of databases that cover overlapping segments of population. The presence of duplicates in databases have many undesirable consequences to both providers and patients. Providers suffer from incorrect billing and patients suffer from sup-optimal care if important pieces of information are not accessible because they are scattered in different records that are not linked together. Most health research has to integrate data from different sources to obtain richer data sets. In [8], Fleming et al. describe the effort by the NHS National Services Scotland to link medical records nationwide to create highvolume high-quality data sets for public health and epidemiological research. In US, Bouhaddou et al. [5] describe a join effort by The Department of Veterans Affairs, Department of Defense and Kaiser Permanente to create the Nationwide Health Information Network exchange in San Diego by linking patient records from their organizations. The record linkage (RL), in broader sense, is the problem of consolidation information concerning the same entity from different sources. In more specific sense, RL can be formulated as the problem of linking records about the same entities from two different files. An excellent introduction to methodological and practical issues of RL and its role in the data quality assurance can be found in [11]. A survey of more recent developments is found in [23]. The core step of RL that receives most attention from researchers is the decision whether or not to link a pair of records on the basis of their data. The model that is the basis of many recent developments was proposed in 1960s by Newcombe and colleagues [18] and was formally analyzed by Fellegi and Sunter (FS) [7].   P (γ ∈ Γ|M ) (1) S = ln P (γ ∈ Γ|U ) where S is the matching score; γ is the agreement pattern computed from the pair of records, Γ is the set of such patterns; M is the proposition that two records in the pair are actually matched; U is the proposition that two records in the pair are actually unmatched; P (γ|M ) is the probability of observing pattern γ if two records are matched and P (γ|U ) is the probability of observing γ if two records are unmatched. In theory, the score in FS model is nothing but the log likelihood ratio between two competing hypotheses M and U . The optimal property of FS

2

model is derived from those of the statistical hypothesis testing theory. The linkage is then decided by the rule:   if S ≥ Tm   matched needs review if Tm ≥ S ≥ Tu (2) The pair is designated as   unmatched if S ≤ Tu

where Tm and Tu are the match and unmatch thresholds which are selected in such a way that optimizes the performance characteristics. A RL application has to answer two basic questions (1) what are the agreement patterns γ and (2) how to estimate distributions P (γ ∈ Γ|M ) and P (γ ∈ Γ|U ) from records. Because most of the data fields in records that are subject to comparison are textual in nature (e.g., names, addresses, SSN) there is no perfect definition of agreement that can capture the intuition of similarity. The simplest way is to have binary classification: two fields can be either agree (1) if they coincide or disagree (0) otherwise. More sophisticated measures of agreement must account for factors such as partial agreement, error patterns, frequency of the strings and different ways that fields such as names and addresses can be written down. Once agreement space Γ is defined, (conditional) probability measures P (·|M ) and P (·|U ) can be estimated. Recent advances allow the measures to be estimated automatically with or without training data. What makes this challenging is that RL has to deal with records reflecting the attributes of entities at different times during which the attributes could change. For example, among the often recorded personal attributes only date of birth is immutable, all other attributes may change at different rates. The first name attribute is more stable than the last name (especially for women) which in turn is more stable than the address attribute. Besides FS model, other models such as Bayesian net or deterministic rulebased scoring are also used for matching records. Experiments shows that despite difference, when tuned carefully the matching models can deliver scores that reflect basic human intuition for linkage. For example, Joffe at al [13] benchmark deterministic, probabilistic and fuzzy methods of matching with algorithmic parameter fine tuning. Testing on sample of 20, 000 record pairs, they find that the methods have the precision measure in range [88.7%, 95.6%] and the recall measure in range [88.7%, 98.5%]. Unexpectedly, they also find that deterministic matching algorithms with optimized parameters outperform the probabilistic method for record linkage. In [20], the authors survey 33 record linkage studies and found that the sensitivity of probabilistic record linkage ranged from 74% to 98%. The major problem that any practical RL software has to solve is highly demanding computational requirement. There are couple of observations. First, as the number of pairs of records is the product of numbers of records in each file, a straightforward application of pair matching for all of them would not be feasible even for files of moderate sizes. For the sake of simplicity, let us assume that the task is to match records in two files A and B of moderate size (N ∼ 106 ). The time needed to match all the pairs (N 2 ∼ 1012 ) would exceed any reasonable

3

Figure 1: Two-stage record linkage expectation.1 Second, suppose that among 2N records on both A and B there are m pairs that need to be linked. Those m pairs must be identified among N 2 possible pairs. Even though the duplication rate (m/N ) is significant, the probability of getting a matched pair on average is only m/N 2 which becomes very small as N increases.2 The implication from the observations is that most of matching, if applied for pairs without discrimination, is a waste of computational power. Blocking refers to the family of techniques used to reduce the number of pairs that get matching. In an early application of RL, Newcombe and Kennedy [18] imposed a condition that a pair of records must have the same Soundex code for last or maiden names or date or place of birth before it is actually scored by a matching algorithm. Their basic two-step architecture for RL (Fig. 1) is still at work today even though more efficient blocking strategies and better scoring methods have been developed. Basically, in the first stage, an inexpensive filter is used to screen out pairs that are unlikely to be matched (duplicates). In the second stage, a more expensive scoring mechanism is applied for those pairs that passed through the first screening. Most of blocking methods use more than one blocking criterion 1 For N = 3 × 106 records, it would take 2.85 years to match all 9 × 1012 pairs assuming a matching rate of 105 pairs per second. 2 At N = 3 × 106 and 10% duplication rate (m = 3 × 105 ) the probability of matched pair is only 3.3 × 10−8 .

4

which can be computed cheaply from individual records. K(r) = [kr1 , kr2 , . . . krℓ ] where r is a record in files A or B; ℓ is the number of blocking criteria; kri is a key for criterion i of record r. The RL is made in ℓ passes (one for each criterion). In pass i, the records in the files are sorted according to their keys. A pair of records (r, r′ ) passes through the filter if they have the same blocking key kri = kri ′ . Thus, the complexity reduces to O(N log(N )) instead of O(N 2 ). The RL architecture is implementable in the MapReduce framework of Apache Hadoop3 . In each pass, the Mapper takes as input a blocking criterion and a record file and emits for each record a pair of key and the record itself . The Reducer collects records with the same key and performs pairwise matching which results in a score for each pair. Following information retrieval literature [2], the quality of RL algorithm is often measured by two quantities: recall (aka sensitivity) and precision (aka positive predictive power). The recall measure is the ratio between the number of pairs linked by the algorithm that are truly matched over the number of truly matched pairs. The precision measure is the ratio between the number of pairs linked by the algorithm that are truly matched over the number of pairs linked by the algorithm. Between two stages of RL, matching algorithm gets most attention of researchers who seek to take advantage of new methods and information to improve the accuracy of match. For example, Winkler [23] lists thirteen areas of current research including the methods for estimating error rates, using additional information to improve the matching in two files A and B, methods to use multiple addresses tracked over time to improve matching and so on. Not listed among those topics is the research on efficient blocking criteria even though the efficiency of entire RL process depends on it, especially when dealing with large data files. Winkler [23] provides a dramatic real world example of RL for 2000 Decennial Census. The data file contains about 300 million records. It is also known that about 600, 000 pairs are truly matched. Winkler shows that 99.5% of the true matches are found in a subset of 1012 pairs determined by 11 blocking criteria of the total of 1017 pairs (300 million × 300 million). In other words, the reduction of workload by factor of 105 is achieved at the cost of missing 0.5% of truly matched pairs. In this paper, we focus on the blocking phase and our goal is to apply machine learning approach to learn efficient blocking criteria. We make an assumption that the matching algorithm is given and denoted by S (for the scorer).

2

Identification of Blocking Criteria

2.1

String representation of records

Structure of actual file records can vary greatly but for our purpose we use an abstract and generic representation of records. We assume that records are 3 Available

at http://hadoop.apache.org

5

Fld[O] R1[O] R2[O] Fld[P] Len Offset R1[P] R2[P]

PRE 5 5 DR MS

Name “Dr. John J. Smith, MD” “Ms Jane Black Esq, MHA” FIRST MID LAST POST1 15 2 15 5 20 22 37 42 JOHN J SMITH MD JANE BLACK ESQ

POST2 5 47 MHA

DOB “01/23/1973” “Apr 30, 1960” DD MM YYYY 2 2 4 49 51 55 23 01 1973 30 04 1960

Table 1: An example of data standardization and parsing. strings of fixed length and the meaning of characters or segments of characters in the strings is described by a data dictionary. In some cases, the string representation of records is obtained by converting each record field into a string of a suitable length with space padding if necessary and then concatenating those strings together. In most cases, however, it is necessary to apply preprocessing steps before data fields can be converted into strings. Typically, that includes standardization (e.g., convert all characters to upper (lower) case, standardize abbreviations, spelling) and parsing (e.g., break full personal name into first name, middle name and last name components). For example, a standardization software developed for the US Census of Agriculture matching system [23] would first convert various textual versions in which name and title of Doctor Smith are written into a standardized version “DR John J Smith MD”. This version then gets parsed into various pre-defined subfields such as [PRE = “DR”], [FIRST = “John”], [MID = “J”], [LAST = “Smith”], [POST1=“MD”], [POST2 = “”], [BUS1 = “”] and [BUS2 = “”]. Similar procedures can be applied to standardize and parse other types of data: addresses, dates, phone numbers and even Social Security Number. An example is given in table 1. The original (before standardization and parsing) field names and data of two records are given in top three rows. The next three rows have the names of sub-fields their lengths and offsets. The last two rows have the data that are standardized and parsed. An important issue for RL but inconsequential otherwise is the decision on alignment when converting field data into strings. Given an alignment choice, typographical errors have different impact depending where they occur. For example, with the left alignment, an insertion of a character to the beginning of a string of 10 characters would change the positions of all of them from their correct ones. On the other hand, if the error occurs at the end the positions of all characters remain correct.4 To avoid this problem, we propose a simple solution that is to use both left and right alignments to convert a data field to a string. In other words, for a field whose data can vary in length, its string representation would contain two copies one is left justified the other is right justified. For 4 There is a special string comparison method where the alignment is irrelevant is the edit distance that measures the difference between two strings by the minimal number of operations needed to transform one string to the other [15]. But edit distance, determined by dynamic programming, is expensive to compute.

6

example, assuming that the first names are allowed to have maximum length of 15, “William” would be converted in to a string of length 30 as follows. 1 3 1 5 0 +-------------+--------------+ WILLIAM WILLIAM Under this representation, intuitive concepts such as “first character”, “the two last characters” of the first name can be identified easily as the first character, 29th and 30th characters of the string. We would like to emphasize that this representation is used specifically for learning blocking criteria. A matching algorithm can use this or any other suitable representation of records.

2.2

Relationship between blocking criteria and matcher

From now on, we assume that records are strings of length n. For a record r, r[i] is the character at position i. This convention allows us to formally analyze blocking criteria identification. Definition 1 A blocking criterion is a subset of index positions K ⊂ {1, 2, . . . n}. The key for criterion K of a record s is the string formed from the characters of s at positions in K (denoted by s[K]). Blocking criteria are cheap mechanisms to screen out pairs but using a single criterion is often not satisfactory because the occurrence of a small error at the criterion positions leads to the pair not being matched and wrongly sent to the unmatched bin. For that reason, in practice several blocking criteria are used to complement each others. In the cited example of 2000 Decennial Census, Winkler used 11 criteria to capture 99.5% of matched pairs in 11 passes. The chance of a truly matched pair was wrongly prevented from matching is so low because that occurs only if it fails all 11 criteria. We call a set of blocking criteria that are intended to be used together a filter. Note that while the concept of blocking criteria is useful to describe RL algorithm (how it runs), the concept of filter is more relevant in the analysis of RL outcome and efficiency (how it works). Definition 2 A record pair π = (s1 , s2 ) is said to pass through filter F = {Ki |1 ≤ i ≤ m} (notation F (π) = 1) if s1 [Ki ] = s2 [Ki ] is true for some i. To analyze the properties of a filter, we use the following notations. π is used for a pair of records, S(π) for the score by S, M (π) = 1 for π is a (truly) matched pair and M (π) = 0 for the pair is unmatched; F (π) = 1 for π passes through the filter F and F (π) = 0 for π fails the filter or is blocked by the filter; S(π) is the numerical score by matcher S and L(π) = 1 if S(π) is above the match threshold, L(π) = 0 otherwise. Without reference to a specific record pair, M , F and L are Boolean propositions, defined on the space of pairs, to denote the set of pairs that are matched, the set of pairs that pass through the filter and 7

¯, the set of pairs that are linked respectively. Their negations are denoted by M ¯ are defined accordingly. F¯ and L Let us assume for a moment that we have a perfect matcher i.e., the score for a truly matched pair is higher than the match threshold Tm and the score for a unmatched pair is lower than the unmatch threshold. Referring to Fig. 1, given a perfect matcher, the precision measure is 100%. The recall of RL system is the same as the recall of the filter which is the fraction of the number of truly matched pairs that pass though the filter over the number of truly matched pairs. |{π|F (π) = 1 ∧ M (π) = 1}| . (3) rF = Pr(F |M ) = |{π|M (π) = 1}| The purpose of using filter is to screen out the unmatched pairs. Its efficiency can be measured by the rejection rate calculated as the fraction of the number of (unmatched) pairs that fail the filter over the number of (unmatched) pairs. Notice that the qualification “unmatched” is not important as the following approximation shows. ¯ ) = |{π|F (π) = 0 ∧ M (π) = 0}| ≈ |{π|F (π) = 0}| = Pr(F¯ ) (4) bF = Pr(F¯ |M |{π|M (π) = 0}| N2 The approximation is made possible because on the space of pairs the proportion of matched pairs Pr(M ) is extremely small, always smaller than the reciprocal ¯ ) is close to 1. of the number of records (1/N ) and Pr(M ¯ ) Pr(M ¯) Pr(F¯ ) = Pr(F¯ |M ) Pr(M ) + Pr(F¯ |M ¯ ¯ ¯ ¯ ¯ ≈ Pr(F |M ) Pr(M ) ≈ Pr(F |M ) We also use the throughput rate – the probability of passing through the filter – (Pr(F ) = 1 − rejection rate). For a filter, the higher the recall (rF ) the better and the higher the rejection rate (bF ) the better. Normally, however, matcher S is not perfect. Suppose the rates of recall and precision of S, evaluated independently without the filter on the entire space of pair, are rS = PrS (L|M )) and pS = PrS (M |L). In a two-stage F − S system, a pair is linked only if it passes through filter F and is linked by matcher S. So, F ∧ L is the set of pairs of records linked by the system. Thus, for entire RL system, ( ∧F ∧L) recallRL = Pr(F ∧ L|M ) = Pr(M Pr(M ) (5) ∧F ∧L) precisionRL = Pr(M |F, L) = Pr(M Pr(F ∧L) The point of a filter is to (cheaply) screen out unmatched pairs. In terms of probability, for a pair of records, the conditional probability of being matched given that it passes the filter should be higher than the unconditional probability i.e., Pr(M (π) = 1|F (π) = 1) > Pr(M (π) = 1) and Pr(M (π) = 0|F (π) = 0) > Pr(M (π) = 0). Concerning the matcher, obviously the knowledge of the match score (S(π)) changes the probability that the pair is matched. The difference is 8

that score, being the result of a more expensive and reliable mechanism than filter, produces much more reliable information about the match status of a pair. So given the score, our judgment about true match status can be made without regard to the filter result – (F (π)). In fact, that is how the decision on whether to link a pair is made (based exclusively on the value of score as seen in Fig. 1). We conclude that M and F are independent given L (notation (M ⊥ F |L)). In terms of probability Pr(M |L, F ) = Pr(M |L)

(6)

Under conditional independence Pr(M ∧ F ∧ L) = Pr(L) Pr(F |L) Pr(M |F, L) = Pr(L) Pr(F |L) Pr(M |L)

(7)

Thus, the precision measure Pr(M |L) does not depend on the filter but only the recall does Pr(L) Pr(F |L) Pr(M |L) (8) recallRL = Pr(M ) Thus, given a matcher S, the higher the conditional probability Pr(F |L) (of a pair passing through the filter given it is linked by the matcher) the higher the recall of RL system. For example, the recall is maximized when the filter does not exclude any pair that would be scored high by the matcher (Pr(F |L) = 1). Combining this observation with (4) we see that the filter identification problem has two conflicting criteria:  maximize Pr(F |L) (9) minimize Pr(F ) One the one hand, to allow through all the pairs that are scored high by the matcher may require more blocking criteria (passes) or more relaxed criteria. On the other hand, from efficiency point of view, one needs as low as possible throughput rate which normally implies few blocking criteria which are cheap to compute. To avoid the dilemma, it is necessary to fix one condition as a goal and use the other as a constraint. There are two ways to transform (9) into an optimization problem

(a)

3



maximize s.t.

Pr(F |L) Pr(F ) ≤ α

or (b)



minimize s.t.

Pr(F ) Pr(F |L) ≥ β

(10)

Blocking filter as Disjunctive Normal Form

In this section, we formulate that the task of learning blocking filters as a DNFlearning problem. DNF learning is a fundamental learning problem introduced by Valiant [21] that has been investigated extensively in computer science literature [14, 1] with applications in different fields from genomics [24] to cryptography [22]. Let us review basic definitions of this theory. There is a set of 9

binary variables {v1 , v2 , . . . vn }. A literal is either a variable or its negation v¯i . An assignment is a mapping a : {v1 , v2 , . . . vn } → {0, 1} and is extended to their negation by a(v¯i ) = 1 − a(vi ). A term or minterm is a conjunction (denoted by ·) of literals. A DNF formula is a disjunction (notation: +) of terms. A term is also viewed as a set of literals and a DNF as a set of terms. Assignment a satisfies term t (notation: t(a) = 1) if ∀ℓ ∈ t, a(ℓ) = 1. Assignment a satisfies DNF f (notation: f (a) = 1) if ∃t ∈ f , t(a) = 1. A term t is an implicant of DNF f if t ⇒ f i.e., for any assignment a if t(a) = 1 then f (a) = 1. An implicant t of f is prime if no proper subset of t is also an implicant of f . If terms of a DNF formula are all different and are prime implicants then the DNF formula is called reduced formula or sum of irredundant prime implicants. If no literal of a DNF appear negative it is called monotone. The DNF learning task includes a (unknown) formula f , called the target formula, a training data set which is a set of labeled assignments {ha1 , b1 i, ha2 , b2 i, . . . hak , bk i} where the label is the value of the target formula under the assignment i.e., bi = f (ai ). The goal of learning is to construct a DNF formula fˆ that approximates f . Valiant [21] formalizes the notion of ǫ-approximation in Probably Approximately Correct (PAC) learning model as follows: fˆ is said to ǫ-approximate f if the probability that an assignment that satisfies fˆ or f but not both is at most ǫ. The PAC-learning in general case is hard. It requires exponential time in the worst case i.e., the formulas are not learnable in polynomial time. A more positive result by Aizenstein and Pitt [1] shows an algorithm that can learn most DNF formulas in polynomial time under the assumption of uniformity. By translating blocking filter learning problem to DNF learning problem, we can leverage the results and algorithm developed for the latter to our goal. In Section 2.1 we make an assumption that the records are strings of length n. For each of the positions, there is a binary variable vi . To each pair of record we associate a binary vector of length n of character by character comparison V . Given a pair of records π = (s1 , s2 ) and for 1 ≤ i ≤ n  1 if s1 [i] = s2 [i] and s1 [i] is not a blank (11) Vπ [i] = 0 if s1 [i] 6= s2 [i] or s1 [i], s2 [i] are blank

Each comparison vector is an assignment. We associate to each blocking criterion K = {k1 , k2 , . . . km } where ki are position indices a term tK = vk1 .vk2 . . . vkm . We associate to each filter, a collection of blocking criteria, F = {K1 , K2 , . . . Kh } a DNF fF = tK1 + tK2 + . . . + tKh . Because tK involves only positive literals, DNF fF induced by filters is monotone. Thus, a pair of records can pass through a filter if and only if its comparison vector satisfies the DNF induced by the filter. The matcher S can also be associated with a propositional formula fS of variables vi using the comparison vector Vπ as follows fS (Vπ ) = L(π)

(12)

where L(π) is proposition (S(π) ≥ Tm ). In other words, even though S is not a propositional formula its information is encapsulated by a formula via comparison vectors. 10

The problem of learning blocking filter F given matcher S is translated into a PAC problem of learning DNF formula fF that approximate formula fS . The training data (positive and negative examples) are provided by having random sample on the space of possible record pairs. Each pair π is made into a training data point hVπ , L(π)i where Vπ is the comparison vector and L(π) is the link label given by matcher S.

4

Algorithm

PAC learning problem has been investigated extensively in literature [21, 14, 4, 19]. It is known that three factors that are crucial to PAC learning problem are consistency, monotonicity and uniformity. The consistency basically means that the target function is a function on binary vectors. Strictly speaking, the target fS may be inconsistent if matcher S classifies two pairs (π1 , π2 ) that have the same comparison vectors (Vπ1 = Vπ2 ) in different ways (L(π1 ) 6= L(π2 )). However, we can use a simple technique to massage the training data to remove inconsistency as follows. Suppose {hV, Li i}m i=1 is the set of training examples that have the same comparison vector V but the corresponding labels are not the same. We can remove the inconsistency by ˆ substituting Li by the label of majority L. The monotonicity requires that flipping a bit from 0 to 1 would not switch the value of the target function from 1 to 0. In this case, this means impossibility that as the agreement between two records increases the matching score becomes lower. For our problem, we can reasonably assume that fS satisfies consistency and monotonicity. Finally, the uniformity is a property about the distribution on assignments. In our case, the uniformity is not satisfied because the distribution on the comparison vector space derived from record pairs is uneven. In fact, real data show that the distribution is very unbalanced (see discussion in Section 5, Fig. 3). Under unrestricted distribution, Kearns et al. [14] show that DNF formulas are not learnable in polynomial time. We’ll need some simplifications to make it feasible. Our algorithm uses ideas developed by Jackson et al [12] with suitable modification to address the violation of uniformity assumption in our problem. We consider problem (13) reformulated from (b) in (10) where α is interpreted as the a pre-defined acceptable level of recall of the filter with respect to matcher. A filter is acceptable if its recall is at least α. Among acceptable filters, the lower the throughput rate the better (equivalently, the higher rejection rate the better).  minimize Pr(F ) (13) s.t. Pr(F |L) ≥ α This optimization problem is solved in two steps. In the first step, the collection of acceptable filters are generated. In the second step, an acceptable filter is selected based on its throughput rate. The process of learning blocking criteria is described in Fig. 2. A sample of pairs is made from the product space of records from files A and B. Matcher S 11

Figure 2: Process of learning blocking criteria is used to label pairs which then are converted into labeled comparison vectors. A DNF learning algorithm is used to determine the set of acceptable filters. Optimal filters are selected among the acceptable filters based on their throughput rates.

4.1

Generating acceptable filters

A nice property of (13) is that the generation of acceptable filters needs only the positive examples which are small fraction of the training data. Only at minimization of throughput rate phase, the whole training data is needed. This turns out to be a major advantage for computationally intensive filter generation to work with data small enough to fit into computer internal memory. We denote the set training examples by X = {hVi , Lii}ni=1 , the sets of positive and negative examples by X + = {hVi , Li i|Li = 1} and X − = {hVi , Li i|Li = 0} respectively. For a formula ϕ, its (empirical) probability calculated on the training data is the proportion of number of examples that ϕ satisfies over the total number of examples Pr(ϕ, X) =

|{hVi , Li i|ϕ(Vi ) = 1}| |X|

(14)

Finding all DNFs with probability higher than a threshold is a combinatorial search problem in an exponential search space. To make it feasible we restrict 12

our search on DNFs that have at most m terms and each term has at most k variables. Denote set of such DNFs by DNF(m, k, p1:m ). The main algorithm M 1 finds DNF(m, k, p1:m ). M 1 calls routine M 0 that finds a set of terms of length k and have probability at least p. M 0 exploits the fact that the probability of any subterm is greater than or equal to the probability of whole term. Suppose M is a set of formulas, the set of variables present in M is denoted by Var(M ). Algorithm M 0: Find terms of length k having probability at least p. Input: The set of positive examples X + , 0 < p < 1 and a positive integer k. Output: M (k, p, X + ) - the set of terms of length k having probability at least p on the set of positive examples. 1. Initiate M (1, p, X + ) with the set of variables that have probability calculated by formula (14) is at least p. 2. If i ≤ k, initiate M (i, p, X + ) = ∅. For each term t ∈ M (i − 1, p, X + ), form a new term t′ = t.v by extending t with a variable v which is not in t but is in the collection of variables in M (i − 1, p, X + ) i.e., v ∈ Var(M (i − 1, p, X + )) − Var(t). Compute Pr(t′ , X + ). If Pr(t′ , X + ) ≥ p, add t to M (i, p, X + ). 3. If i < k increase i and Goto 2, else stop and output M (k, p, X + ). Algorithm M 1: Find the set of acceptable DNFs. Input: The set of positive examples X + ; m the number of terms in DNF; an integer vector k1:m where ki are the length of ith term; a real vector p1:m where pi ∈ (0, 1) is the threshold for conditional probability of term ti given the falsification of all previous terms: Pr(ti |t1 + t2 + . . . + ti−1 ) ≥ pi . Output: DNF(m, k1:m , p1:m ) the set of DNFs that have m terms. 1. DNF(1, k1 , p1 ) is initiated with the set M (k1 , p1 , X + ) of terms determined by calling routine M 0. 2. DNF(i, k1:i , p1:i ) is the set of DNF formulas of length i. 2.1 For each F ∈ DNF(i, k1:i , p1:i ), find the set of (i + 1)th terms by + + calling M 0: M (ki+1 , pi+1 , Xi+1 ) where Xi+1 = {x ∈ X + |F (x) = 0} + is the subset of X where F is evaluated to false. + ) and add the newly 2.2 Expand F with each term in M (ki+1 , pi+1 , Xi+1 created DNFs to DNF(i + 1, k1:i+1 , p1:i+1 )

3. Increase i by 1. If i < m Goto 2 else output DNF(m, k1:m , p1:m ) The algorithm is based on a chain formula for probability of DNF Pr(t1 + t2 + . . . + tm ) = Pr(t1 ) + + Pr(t2 |t¯1 ) Pr(t¯1 ) + . . . + Pr(tm |t1 + . . . + tm−1 ) Pr(t1 + . . . + tm−1 )

13

(15)

For example, with three terms m = 3 and pi = 0.7, Pr(t1 + t2 + t3 ) = 0.7 + 0.7 ∗ (0.3) + 0.7 ∗ (0.09) = 0.973. As noted, the algorithms M 0 and M 1 only need the positive examples which for RL problems of moderate size, can be load into memory. To deal with very large data files (∼ 108 or more records), algorithms M 0 and M 1 can be implemented in the MapReduce framework. The training data set X + can be divided and distributed. M 1 calls M 0 for bulk of its work. For M 0, a term and its frequency constitutes a key-value pair. M 0 is implemented as k iterations of Map-Reduce. In each iteration, the Mapper produces a list of < t, c > where t is a term of length i and c is its local count. The Reducer consolidates all the pairs with the same key to produce a pair of < t, C > where C is the global count which is the sum of the local counts for the term. The terms that have global count lower than threshold p ∗ N where the N is the size of X + are removed. The set of remaining variables is used to form the pool of terms of length (i + 1). In the next iteration, this new list of terms is transformed by the Mapper to list of term-frequency pairs.

4.2

Minimizing throughput rate

The minimization of throughput rate among acceptable filters is straightforward. Essentially, Pr(F ) is estimated by the proportion of the number of data points in a random sample that are satisfied by F . Specifically, suppose X is the training data, A is the set of acceptable DNFs determined in the previous step, the optimal filter is selected by F ∗ = arg min Pr(F, X) F ∈A

5

(16)

Experiments

This technique of learning blocking criteria is applied for duplication detection on a patient master file of 2.18 million records. Referring to Fig. 1, the duplicate detection is a record linkage task in which files A and B are the same. A minor change is necessary to avoid trivial linkage of a record with itself. The space of possible pairs is N (N − 1) rather than N 2 . Records have ten demographic data fields: LastName, FirstName, MiddleName, BirthDate, Gender, SSN, HomeAddress, City, Zip and PrimaryPhone. After preliminary analysis to remove obvious errors, we consider in total 78 character positions between those fields. In other words, records are “standardized” to strings of length 78. Data cleaning experts provide an ex-ante estimate of duplication rate (10%) and a rule-based record matcher S. An initial safe and inefficient filter used to find positive examples consists of four criteria. The first criterion has first four characters of LastName. The second criterion has first four character of FirstName. The third criterion consists of unit digit of month, unit digit of day, decade and year digits in BirthDate

14

Log Frequency of Positive Examples 12

ln(frequency)

10 8 6 4 2 0

0

1

2 3 4 Rank of positive examples

5

6 4

x 10

Figure 3: Frequency of unique comparison vectors format mm/dd/yyyy. For example the key for “10/05/1978” is “0578”. The last criterion has four last digits of SSN. The purpose of this filter is capture as many as possible potential duplicates. Using this filter, the filter finds 418, 037 pairs of records that it recommends to be linked. This number includes those from clusters that have up to tens of records that belong to the same entities (patients). The comparison vectors made for 418, 037 pairs of records have 55, 308 distinct elements. In Fig. 3, we plot the frequency chart of those unique vectors ranked by their frequencies (higher frequency higher rank). The x-axis represents the rank and y-axis has the natural logarithm of the frequencies. As the figure shows, about 37, 000 comparison vectors are unique while about 18, 000 have multiple frequencies. The extreme case is a vector that corresponds to about 21, 000 pairs. It turns out, this is due to a cluster of few hundreds patients from a prison. The data for those patients are replaced by generic values, the address and phone number listed are not personal but belong to the institution. It is clear that even after excluding that cluster, the distribution on the comparison vector space violates the uniformity assumption used in PAC learning (see a remark in Section 4). We have tried different configurations of parameters of number of terms in filters (m), lengths of each term (ki ) and probability thresholds for terms (pi ). In one experiment, m = 2, k1 = k2 = 6 and p1 = p2 = 0.95. With this configuration, the filters generated have at least 0.9975 recall (see Eq. (15)). Among those about 211, 000 have recall higher than 0.9982. For this group, sampling is used to estimate throughput rates. Among the acceptable filters, the lower throughput the better. To compare relative efficiency of a filter, we compute the ratio of its frequency relative to that of the filter that has minimal throughput. Figure 4 shows the chart of log throughput ratio for the 211, 000 acceptable filters. On the x-axis is the rank (by frequency) of a filter. On the y-axis is

15

Throughput Ratio 8 7

ln(frequency ratio)

6 5 4 3 2 1 0

0

0.5

1 1.5 Rank of acceptable filters

2

2.5 5

x 10

Figure 4: Throughputs of acceptable filters the log of frequency ratio. The frequency ratio for the best (most efficient) filter, by definition, is 1 (log of frequency ratio is 0). On the other extreme, the worst filter has log frequency of 7 which means that its throughput is 1000 times higher than the best one. Most filters (about 170,000 of them) are found in the relatively flat middle segment of the curve, with log frequency ratio between 2.0 and 4.5 (frequency ratio between 8 to 100). Suppose that a filter is considered “good” if it has 90% chance of beating a randomly selected acceptable filter. So, even if filters obtained by educated guess are good, according to the data plotted in Figure 4, they can be as 7 times less efficient than the best one determined by this algorithmic method. While it is not surprising that a learned filter delivers very high recall when applied for the database from which it has been learned, we want to find out its generalizability i.e., how a filter learned on one database perform on another database. We were given an opportunity to probe that question. In addition to the master file (A) of 2.18 million records that is used for learning blocking criteria, we were allowed the access to two other master files named B and C for convenience. B has 2.07 millions and C has 4.31 millions records. The files are from three different organizations but they have basic demographic fields that are used for duplicate detection. The optimal blocking filter FA∗ , learned on A as described above, is used for duplicate detection in files B and C. We compare its performance against the educated guess filters provided by data cleaning experts. In both cases, the FA∗ capture about 97% − 98% of the pairs that are allowed to pass through the expert filters and then linked by matcher S. However, run time using the FA∗ is reduced by 8 − 10 times.

16

The generalizability is possible because, we are offering a speculation, that although the master files are different, the corresponding distributions on comparison vector spaces are less sensitive to differences in textual data. The optimal filter is actually learned on the comparison vector space, therefore, is less sensitive to the textual data.

6

Discussion and related works

Record linkage, an essential part of data cleaning, is commonly recognized as the most expensive work in data warehousing.5 It has been investigated extensively (see [23] and the reference therein). Because it would be wasteful and even impossible from pragmatic point of view to match every possible pairs, most practical RL systems employ Blocking-Matching architecture. Most of recent research however is devoted to improving the accuracy of matcher. According to [3, 9], four types of blocking are found in literature on RL. The standard blocking methods computes blocking keys directly from data fields. The sorted neighborhood method [10] sorts records by a sorting key and then uses a window of a fixed size w moving along database. Records that are seen in a window will be scored. Bigram index method breaks a key into a list of substrings of two characters (bigram) and from that all possible sub-lists of length above threshold. Those sublists will be indexed and used to retrieve records for scoring. The goal of the exercise is to tolerate minor error in the key. Finally, canopy clustering [16] uses a cheap distance measure to divide records into overlapping subsets called canopies. This is also the basic idea behind any blocking method. The distinction of canopy technique however is in the procedure of creating canopies. Assuming a distance measure µ and 2 thresholds 0 < rtight < rloose . The first canopy is created by picking at random a data point (record) as a center and calculate distances from it to other points. Those that lie inside the circle of radius rloose form the first canopy. Those points that lie inside the circle of radius rtight will be excluded from the pool of points. The construction of the next canopy begins with picking at random a point for a center from remaining points. The process continues until all points are canopied. The cheap distance normally used is TFIDF (term frequency/inverted document frequency) cosine distance. The methods are compared in an experimental study [3] based on two metrics the rejection rate and filter recall defined as the ratio of the number of duplicate pairs passed through filter and the number of duplicate pairs (Pr(F |M )). F scores (harmonic mean of two metrics) are calculated. The experiment is tested on small databases of generated data (from 1,000 to 10,000 records). The data showed that the worst performance in terms of F -score (0.87) belongs to canopy method with “wrong” thresholds. With “right” threshold canopy technique can achieve superior performance (0.97). In contrast, standard blocking and sorted 5 Winlker [23] cites some estimates that up to 90% of work in data warehousing from multiple sources is spent on removing duplicates.

17

neighborhood methods are less sensitive to its parameter (length of blocking key) and achieve score between 0.90 to 0.95. More recently researchers use machine learning approach to learn blocking criteria and provide alternatives to the traditional method of selecting blocking criteria manually. In [17], Michelson and Knoblock report that the learned blocking scheme (filter) is better than blocking schemes produced by non-domain experts, and is comparable to those made by domain experts. The main limitation, the authors note, is that the quality of blocking scheme learned depends on the amount of training data provided. In their experiments, the amounts of (manually) labeled pairs provided by training range from 10% to 50% of pairs. This requirement may be feasible for files of relatively small sizes but unrealistic for files of moderate sizes (∼ 105 ) or more. Indeed, all their experiments are on files of 5, 000 records or less. Recognizing that labeled training data are hard to come by, Cao et al [6] propose a method of learning blocking filter that also makes use of unlabeled data. In their proposal, they use unlabeled data in combination with (manually) labeled data in maximization of rejection rate and use the labeled data for controlling recall to an acceptable level. Our model makes two contributions to the learning of blocking filter. First of all, it solves a bottle neck problem highlighted in [17, 6], namely, the lack of sufficient labeled data. The key is to consider the blocking filter problem not in isolation but in the context of a matcher. We show (Section 2.2 Eq. (8)) that for a RL system of the Blocking-Matching architecture, given a matcher S, a blocking filter can affect the recall of the RL system only via its sensitivity with respect to the matcher (Pr(F |L)). The sensitivity of the filter wrt the true match status (Pr(F |M )) is not directly relevant to the recall of the RL system. Consequentially, the relevant labels in learning blocking criteria are the labels assigned to pairs by the matcher (L: link/nonlink) not the labels assigned by human experts (M : match/unmatch). Thus, using the given matcher one can generate unlimited amount of labeled data for learning blocking criteria. It is necessary to note that this conclusion not only applies for our method of learning but for any learning methods including those proposed in [17, 6] as long as RL system has two stage architecture. The second contribution is the formulation of learning blocking criteria as a DNF learning problem. Even though one of the key assumptions (the uniform distribution of examples) is not satisfied, the results in DNF learning theory are useful in designing algorithms and shed light on what is possible and what is not. For example, the negative result of Kearns et al. [14] implies that learning of ideal filters that have recall and rejection rate arbitrarily close to one is theoretically impossible. Thus, the claim “our blocking scheme fulfills the two main goals of blocking: maximizing both PC [Pairs Completeness or recall] and RR [Reduction Ratio or rejection rate]” as stated in [17] is theoretically impossible. Another example is a heuristic advice on selecting blocking criteria “the blocking strategies for each pass should be independent to the maximum extent possible” [11] (p.125). Actually, it follows from Eq. (15) that the terms in a DNF are not equal in terms (pun intended) of contribution to the recall of RL system. The first term is the most important and subsequent terms 18

make progressively marginal contributions. They should not be independent but complementary. Translated into an advice, it should be like this: “focus your attention on blocking strategy used for the first pass, at subsequent passes try to capture those that were rejected in the previous passes”. Although we only consider the blocking that is obtained from character by character comparison (the standard blocking according to the classification by Baxter et al. [3]), DNF learning also applies for more general blocking schemes. In [17], a blocking scheme is defined as a couple of a blocking attribute and a comparison method. Examples of blocking schemes can be couples {“first name”,“exact match”} and {“address”, “have the same first three characters”}. In this case, Eq. (11) cannot be used to define the comparison vector but the following rules must be used. Vπ [1] = 1 if the first names in two records of π are the same and Vπ [1] = 0 otherwise. Vπ [2] = 1 if the first three characters in the standardized address in the records are the same and Vπ [2] = 0 otherwise. This work still has some limitations which are the subject of future work. In DNF learning theory, most positive results of learning in polynomial time use the uniform distribution assumption which does not hold in our problem. We made several simplifying assumptions to limit the search space such that by fixing the number of terms in DNFs and number of literals in terms. This restriction could exclude optimal DNFs from being considered. Second, we used standard sampling to generate training data from the product space of records (see Fig. 2). Because the probability of positive (match) pairs is very low, the samples have to be very large to capture sufficient number of positive examples. We have to use some heuristics to capture positive examples separately. The effects of heuristics should be systematically analyzed.

7

Conclusion

In this paper, we propose a new method to automatically learn efficient blocking filters for record linkage. Our method has solved the bottle neck problem of lacking sufficient labeled data for training highlighted in recent efforts to apply machine learning approach. The key to our solution is to consider blocking filter problem not in isolation but in the context of a matcher that the filter is going to be used with. We show that given such a matcher, the labels that matter for learning blocking filter are the labels assigned by the matcher (link/nonlink), not the true labels (match/unmatch). This conclusion allows us to generate unlimited amount of labeled data for training. We formulate the problem of learning a blocking filter as a DNF learning problem. The PAC learning theory sheds light on the challenges of making ideal blocking filters. To make it feasible, we assume two pragmatic simplifications (a) to optimize rejection rate for filters that have recall higher than a pre-set acceptable threshold and (b) restrict search for acceptable filters on a well structured subset of all DNF. We develop an algorithm and test on a real patient master file of 2.18 million records. The result shows that compared with filters obtained by educated guess, the optimal learned filters have comparable recall but have throughput reduced by an order-

19

of-magnitude factor. This approach is both practical and helpful for very large data files.

References [1] Aizenstein, H., and Pitt, L. On the learnability of disjunctive normal form formulas. Machine Learning 19 (1995), 183–208. [2] Baeza-Yates, Ricardo and Ribeiro-Neto, Berthier. Modern Information Retrieval. Addison Wesley, 1999. [3] Baxter, R., Christen, P., and Churches, T. A comparison of fast blocking methods for record linkage. In Proc. of ACM SIGKDD’03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation (2003), pp. 25–27. [4] Blum, A., Burch, C., and Langford, J. On learning monotone boolean functions. In IEEE Symposium on Foundations of Computer Science (1998), pp. 408–415. [5] Bouhaddou, O., Bennett, J., Cromwell, T., Nixon, G., Teal, J., Davis, M., Smith, R., Fischetti, L., Parker, D., Gillen, Z., and Mattison, J. The Department of Veterans Affairs, Department of Defense, and Kaiser Permanente Nationwide Health Information Network exchange in San Diego: patient selection, consent, and identity matching. In AMIA Annual Symposium Procs 2011:135-43 (2011). [6] Cao, Y., Chen, Z., Zhu, J., Yue, P., Lin, C.-Y., and Yu, Y. Leveraging unlabeled data to scale blocking for record linkage. In Procs of the Twenty-Second International Joint Conference on Artificial Intelligence IJCAI (2011). [7] Fellegi, I. P., and Sunter, A. B. A theory for record linkage. Journal of American Statistical Association 64 (1969), 1183–1210. [8] Fleming, M., Kirby, B., and Penny, K. I. Record linkage in scotland and its applications to health research. Journal of Clinical Nursing 21, 19-20 (2012), 2711–21. doi:10.1111/j.1365-2702.2011.04021.x. [9] Gu, L., and Baxter, R. A. Adaptive filtering for efficient record linkage. In Proc. of the Section of survey research methods, American Statistical Association (2004), M. W. Berry, U. Dayal, C. Kamath, and D. Skillicorn, Eds., SIAM. [10] Hernandez, M. A., and Stolfo, S. J. Real-world data is dirty: data cleansing and merge/purge problem. Journal of data mining and knowledge discovery 1, 2 (1998).

20

[11] Herzog, T. N., Scheuren, F. J., and Winkler, W. E. Data Quality and Record Linkage Techniques. Springer, 2007. [12] Jackson, J., Lee, H., Servedio, R., and Wan, A. Learning random monotone DNF. Discrete Applied Mathematics 159, 5 (2011), 259–271. [13] Joffe, E., Byrne, M. J., Reeder, P., Herskovic, J. R., Johnson, C. W., McCoy, A. B., Sittig, D. F., and Bernstam, E. V. A benchmark comparison of deterministic and probabilistic methods for defining manual review datasets in duplicate records reconciliation. Journal of the American Medical Informatics Association (2013). doi:10.1136/amiajnl2013-001744. [14] Kearns, M., Li, M., Pitt, L., and Valiant, L. G. On the learnability of Boolean formulae. In Procs. of 19th Annual ACM Symposium on Theory of Computing, New York City, May 25–27, 1987 (New York, 1987), A. V. Aho, Ed., ACM Press, pp. 285–295. [15] Levenshtein, V. I. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10, 8 (1966), 707–710. Translated from Russian: Doklady Academia Nauk SSSR 163(4) p845-848, 1965. [16] McCallum, A., Nigam, K., and Ungar, L. Learning to match and cluster large high-dimensional data sets for data integation. In Proc. of the Sixth International Conference on KDD (2000), pp. 169–170. [17] Michelson, M., and Knoblock, C. A. Learning blocking schemes for record linkage. In Procs of 21st National Conference on Artificial Intelligence (AAAI-2006) 1:440-445 (2006). [18] Newcombe, H. B., and Kennedy, J. M. Record linkage: Making maximum use of the discriminating power of identifying information. Communications of the Association for Computing Machinery 5 (1962), 563–567. [19] Servedio, R. A. On learning monotone DNF under product distributions. In 14th Annual Conference on Computational Learning Theory, COLT 2001 and 5th European Conference on Computational Learning Theory, EuroCOLT 2001, Amsterdam, The Netherlands, July 2001, Procs. (2001), vol. 2111, Springer, Berlin, pp. 558–573. [20] Silveira, D. P., and Artmann, E. Accuracy of probabilistic record linkage applied to health databases: systematic review. Rev Saude Publica 43, 5 (2009), 875–82. [21] Valiant, L. G. A theory of the learnable. Communications of the ACM 27, 11 (1984), 1134–1142. [22] Wan, A. Learning, Cryptography, and the Average Case. PhD thesis, Columbia University, New York, 2010.

21

[23] Winkler, W. E. Overview of record linkage and current research directions. Tech. Rep. 2006-2, Statistical Research Division, U.S. Census Bureau, Washington, 2006. [24] Wu, C., Walsh, A. S., and Rosenfeld, R. Genotype phenotype mapping in RNA viruses - disjunctive normal form learning. In Pacific Symposium on Biocomputing, 16:62-73 (2011).

22