Using Occupancy Models to Estimate the Number of Duplicate Cases ...

Journal of Data Science 5(2007), 53-66

Using Occupancy Models to Estimate the Number of Duplicate Cases in a Data System without Unique Identifiers

Ruiguang Song, Timothy Green, Matthew McKenna, and M. Kathleen Glynn Centers for Disease Control and Prevention Abstract: Data systems collecting information from different sources or over long periods of time can receive multiple reports from the same individual. An important example is public health surveillance systems that monitor conditions with long natural histories. Several state-level systems for surveillance of one such condition, the human immunodeficiency virus (HIV), use codes composed of combinations of non-unique personal characteristics such as birth date, soundex (a code based on last name), and sex as patient identifiers. As a result, these systems cannot distinguish between several different individuals having identical codes and a unique individual erroneously represented several times. We applied results for occupancy models to estimate the potential magnitude of duplicate case counting for AIDS cases reported to the Centers for Disease Control and Prevention with only non-unique partial personal identifiers. Occupancy models with equal and unequal occupancy probabilities are considered. Unbiased estimators for the numbers of true duplicates within and between case reporting areas are provided. Formulas to calculate estimators’ variances are also provided. These results can be applied to evaluating duplicate reporting in other data systems that have no unique identifier for each individual. Key words: Case surveillance, duplicate reporting, occupancy model, soundex.

1. Introduction Public health surveillance systems that monitor conditions with long natural histories can receive multiple reports from different sources regarding the same affected individual. For example, an individual may change his/her place of residence and seek care for the disease under surveillance, likely resulting in case reports from both places. If there is a unique identifier for each individual submitted with surveillance reports, then duplicate reports can be easily identified and removed from the surveillance system. However, because of confidentiality concerns, national surveillance systems do not collect information on variables that can uniquely identify a person. For example, name and social security number may be reported to a state surveillance system as a part of routine reporting,

54

Ruiguang Song et al.

but are not reported to CDC (Centers for Disease Control and Prevention) for national HIV/AIDS surveillance purposes. Instead, an identifier is often created based on several descriptors. This identifier will not be unique. When information submitted to a surveillance system cannot uniquely identify an individual, and the potential for duplicate reports being submitted to the system exists, the system must use additional information to determine if cases with the same nonunique identifiers represent the same person. For this discussion, we call reports with the same partial personal identifiers “potential duplicates”. Among these, we classify reports representing the same person as “true duplicates” and those representing different persons as “non-duplicates”. National AIDS surveillance data in the United States have the potential for duplicate reporting and do not have unique identifiers to identify and remove duplicate reports. Using the data available at the national level (cases reported to CDC), one cannot determine whether cases with the same partial personal identifiers represent the same person and therefore are true duplicates. However, it is possible to estimate the expected number of non-duplicates from the potential duplicates in a surveillance system based on the probability of matching on these partial personal identifiers. Larsen (1994) considered this problem in a register of HIV infected persons, using a method to estimate the number of distinct individuals in the register based on the date of birth of each entry and classical occupancy theory where each ball has the same chance of falling into any one of the cells. While this method may be applicable as applied to the date of birth in a given year, it cannot be applied to identifiers where individuals have an unequal chance to take each possible value of the identifier, e.g., the soundex (a code based on last name using a method of encryption, see Fenna, 1984). Under the classical occupancy model where each ball has the same chance of falling into each cell, the explicit formula for the expected number of empty cells is available. However, the explicit formula for the variance associated with the observed number of empty cells is not available. A similar situation occurs under the model with unequal occupancy probabilities. Only approximate formulas for the variance are available, see Chistyakov (1967), Holst(1971), and Sevastyanov (1972). In this paper, we provide exact variance formulas for the observed number of empty cells under the two occupancy models. They are presented in Sections 2 and 3, respectively. In Section 4, we consider a model with cells filled by colored balls. All of these results can be applied to evaluating duplicates in a data system. As an example, we use occupancy models to evaluate duplication in AIDS case reporting. Results are presented in Section 5. Finally, some concerns and recommendations are presented in the discussion section.

Analysis of Duplicate Reporting

55

2. Occupancy Model with Equal Occupancy Probabilities Suppose that r balls are randomly distributed to n cells. Assume that each ball has an equal chance of being distributed to each cell. Let Mr,n be the number of cells remaining empty. According to occupancy theory (see Feller, 1968, page 102), the probability distribution of Mr,n is given by n−m n i n−m = m) = (−1) (1 − (m + i)/n)r m i

P r(Mr,n

(2.1)

i=1

where xk is the binomial coefficient equal to the number of combinations of k items selected from x items. Note that this formula is difficult to handle because of the potential for rounding error. A useful recursive formula is available (see Feller, 1968, page 60): P r(Mr+1,n = m) = P r(Mr,n = m)

n−m m+1 + P r(Mr,n = m + 1) n n

(2.2)

Based on this recursive formula, one can derive the mean and variance of Mr,n . An alternative but simpler way to derive the mean and variance is presented in Section 3. As a special case of equations (3.1) and (3.6), the mean and variance of Mr,n defined in (2.1) are: 1 r (2.3) E(Mr,n ) = n 1 − n and V ar(Mr,n ) = n2

1−

1 n

1−

2 n

r

1 r 1 r 1 − 1− + 1− n n n

(2.4)

Since we are interested in the number of occupied cells and the number of balls that exceed the minimum necessary to fill the occupied cells, we consider variables Kr,n , the number of occupied cells, and Dr,n , the number of balls r, minus the number of cells occupied by the r balls, Dr,n = r − Kr,n = r − (n − Mr,n ). Therefore, we have E(Dr,n ) = r − E(Kr,n ) = r − [n − E(Mr,n )]

(2.5)

and, given n and r, V ar(Mr,n ) = V ar(Kr,n ) = V ar(Dr,n )

(2.6)

If Kr,n is observed but r is unknown, then we can estimate r by solving equation (2.3) for r:

56

Ruiguang Song et al.

rˆ =

log(1 − K/n) log(1 − 1/n)

(2.7)

Using the delta method, the variance of the above estimator is V ar(ˆ r) ≈

1 (n − K) log(1 − 1/n)

2

V ar(Kr,n )

(2.8)

The above results can be applied to situations where cells do not all have the same probability of being occupied, but can be divided into subgroups such that within each subgroup each cell has an equal probability of being occupied by a ball. 3. Occupancy Model with Unequal Occupancy Probabilities In this section, we assume that the occupancy probabilities differ from cell to balls have been placed cell. Let Mr,n denote the number of empty cells after r

into n cells with occupancy probabilities p1 , . . . , pn , and ni=1 pi = 1. Then, the expected number of empty cells can be expressed as

E(Mr,n ) = =

n

P r(the i-th cell is empty)

i=1 n

(1 − pi )r

(3.1)

i=1

Given m, the number of empty cells, we can estimate r, the number of balls, by solving the above equation for r with E(Mr,n ) = m. Using the binomial expansion followed by interchanging the order of summation in (3.1), we have: r t r (−1) qt E(Mr,n ) = n − r + t

(3.2)

t=2

where qt =

n i=1

pti

(3.3)

Therefore, the expected number of excess balls is given by r t r (−1) qt E(Dr,n ) = r − [n − E(Mr,n )] = t t=2

(3.4)

Analysis of Duplicate Reporting

57

Given s (2 ≤ s < r), the above formula can be approximated by s

E(Dr,n ) ≈

(−1)t

t=2

r qt t

(3.5)

r t r The difference between (3.4) and (3.5) is the sum of smaller terms t=s+1 (−1) t qt . Because of rthe cancelation of positive and negative terms, the difference can be small. If t qt decreases with r twhen t ≥ s, then the above approximation qs+1 . Since qt+1 ≤ pmax qt where pmax = has a maximum error less than s+1 r r qt+1 if t ≥ (rpmax ) − 1)/(1 + pmax ). max{p1 , . . . , pn }, it follows that t qt ≤ t+1 Similar to the occupancy model with equal occupancy probabilities, the variances of Mr,n , Kr,n and Dr,n are all the same for given n and r. The common variance is:

V ar(Mr,n ) =

n (1 − pi )r [1 − (1 − pi )r ] i=1

+2

{[1 − (pi + pj )]r − (1 − pi )r (1 − pj )r }

(3.6)

i

Using Occupancy Models to Estimate the Number of Duplicate Cases ...

Using Occupancy Models to Estimate the Number of Duplicate Cases ...

Suggest Documents

Using occupancy models to understand the

Using ImageJ Software to Estimate Number of

Using occupancy models to understand the distribution of an ... - PubAg

Using spatially explicit occupancy models to predict the distribution

Models to estimate the mechanical resistance to

Using Marginal Structural Models to Estimate the ... - Projects at Harvard

Using epidemiological models to estimate the ... - Semantic Scholar

Using Marginal Structural Models to Estimate the ... - Projects at Harvard

The Detection of Forged Handwriting Using a Fractal Number Estimate ...

Improved allometric models to estimate the ...

Fisheries Research Using generalized linear models to estimate ...

Using Irregularly Spaced Returns to Estimate Multi-factor Models ...

Using Irregularly Spaced Returns to Estimate Multi-factor Models ...

Using Hierarchical Models to Estimate a Weighted ... - Semantic Scholar

Using time-varying models to estimate post-transplant survival ... - PLOS

Using Irregularly Spaced Returns to Estimate Multi-factor Models ...

Using Climate Models to Estimate Urban Vulnerability ... - AMS Journals

models to estimate phytomass accumulation of

models to estimate phytomass accumulation of

An integrated data model to estimate spatiotemporal occupancy ...

Integrating occupancy modeling and camera-trap data to estimate ...

VSIMPL: A program to estimate the number of ... - Springer Link

Duplicate Detection in Web Shops using LSH to Reduce the Number ...

Using the Lives Saved Tool to estimate the number of maternal and