Transitive Record Linkage in SAS® using SET / KEY= and MODIFY / POINT= Denis Hulett, University California San Francisco, Sacramento, CA Glenn Wright, University California San Francisco, Richmond, CA Jaycee Karl, Modern Analytics, San Diego, CA ABSTRACT Several approaches exist for linking or unduplicating records that have no common identifier. Many of these approaches link together pairs of records, but linked data are most useful when a new common identifier is assigned to each “cluster” of records that are linked together directly or indirectly. This paper explains the challenges of transforming a file of linked pairs into a file of records clustered into a groups termed “cliques” with new common identifiers, proposes an algorithm for performing the transformation, and provides SAS code that implements this algorithm for an example data set of linked pairs, using SET statements with the KEY= option, MODIFY statements with the POINT= option, and several other SAS tools. A companion paper (Wright & Hulett, 2010) demonstrates how to implement the same algorithm using SAS hash objects.
INTRODUCTION As part of the evaluation of California’s publicly funded family planning program, Family PACT (Planning, Access, Care and Treatment), the number of births to mothers who are Family PACT clients are counted. To identify these births, Medi-Cal enrollment records are linked to California’s Birth Statistical Master File (BSMF). Before this linking could be accomplished, the BSMF records had to be unduplicated. In other words, all the birth records that belong to the same mother must be identified and given a common identifier. There are many approaches for linking records, including the popular Fellegi-Sunter (1969) model that we employed. However, most such linking processes share a common limitation: they produce a file of linked pairs. This is insufficient in cases where more than two records in the file belong to the same entity (as is the case where a sole mother has more than two infant records in the birth file). This paper describes our process and details the SAS code used cluster common records together to form cliques.
CLIQUES Unduplicating a file is much the same as linking records between two files, except the unduplication process links a file to itself. Conceptually, each record is compared with every other record in the file. Each pair of records that is determined to belong to the same person is outputted together, creating a file of linked pairs. However, unduplication of the BSMF is a many-to-many situation; an individual woman may have multiple births over the seven year time span examined. As such, her record pairs must be grouped together. A clique is formed when all her births are grouped and given a common identifier. Forming a clique requires the identification of trans-associations. These are instances where multiple record-pairs are chained together by common records. For example,
If records A & B belong to the same person, and
Records B & C belong to the same person,
Then records A, B, & C belong to the same person.
FILE OF LINKED PAIRS File A in Table 1 contains fictitious demographics for two mothers found in the birth records. As with real data, the women’s names may change between birth records and data points may be missing. To create a link between all births to an individual mother, linking is performed on multiple criteria. In the following demonstration, the records that share a social security number (SSN) or share a first and last name are linked. The arrows show the links between records in File A and the copy of itself named File B. Record A links to record B, B to D, B to E, C to F, and D to E. Jane has four birth records (sometimes under different names) and Maria has two.
1
Table 1. Birth Mothers
Birth Mothers File A
Birth Mothers File B
Birth Rec A
First JANE
Last DOE
SSN
Birth Rec A
First JANE
Last DOE
SSN
B
JANE
DOE
111-11-1111
B
JANE
DOE
111-11-1111
C
MARIA
GONZALES
222-22-2222
C
MARIA
GONZALES
222-22-2222
D
JANIE
JONES
111-11-1111
D
JANIE
JONES
111-11-1111
E
JANEY
JONES-DOE
111-11-1111
E
JANEY
JONES-DOE
111-11-1111
F
MARIA
LOPEZ
222-22-2222
F
MARIA
LOPEZ
222-22-2222
The SAS code below creates the records shown in Table 1.
DATA BIRTH_MOTHERS; FORMAT CHILD_DOB DATE9.; INPUT BIRTH_REC $ CHILD_DOB DATE9. FNAME $13-17 LNAME $19-27 SSN $29-40 ; DATALINES; A 09SEP2009 JANE DOE B 08AUG2008 JANE DOE 111-11-1111 C 02FEB2002 MARIA GONZALES 222-22-2222 D 01JAN2001 JANIE JONES 111-11-1111 E 03MAR2003 JANEY JONES-DOE 111-11-1111 F 05MAY2005 MARIA LOPEZ 222-22-2222 ; RUN; Using either matching SSNs or first and last name agreement as linking criteria, the SQL code below will create the file of linked pairs from our BIRTH_MOTHERS files. The reader may modify the linking criteria as needed. However, if a file is large, the Cartesian product produced by the SQL procedure above will likely exceed your processing resources. For example, if you have 100,000 records to compare, SAS has to create a 100,000 X 100,000 table and make 10 billion comparisons. There are nearly 4 million birth records in our process. To accommodate this large file a “blocking process” was employed to compare only records that show some evidence of belonging to the same woman. That process is outside the scope of this paper (see Newcombe, 1988 for explanation). /* CREATE FILE OF LINKED PAIRS */
PROC SQL; CREATE TABLE FILE_OF_LINKED_PAIRS AS SELECT MONOTONIC() AS REC_NUM, A.BIRTH_REC AS ID1, B.BIRTH_REC AS ID2 FROM BIRTH_MOTHERS A, BIRTH_MOTHERS B WHERE A.BIRTH_REC < B.BIRTH_REC AND (A.SSN=B.SSN OR (A.FNAME=B.FNAME AND A.LNAME=B.LNAME)); QUIT;
2
TRANS-ASSOCIATION LOGIC Before we introduce the SAS features used to efficiently gather these pairs into cliques, let’s think through what we want to do. Table 2 holds the file of linked pairs resulting from the linking process. Record number (1) tells us that ID A is linked to ID B. Let’s put these IDs in a list of ids we’ll call a search buffer (Table 3). We want to search the remaining record pairs in Table 2 for any records that contain one of the ids in our search buffer. Neither columns ID1 nor ID2 contain an instance of record A, so we search these fields for the next value in the search buffer, B. Looking at record number (2) we see that ID1 contains the value B. So then we add its mate (ID2) D to the search buffer. In record (3) we find another B. So we add its companion E to the search buffer. Continuing to search for the next value in the search buffer we find that record (5) contains a D. Its companion E is already in the search buffer so nothing new is added. In more complex linking process we may find additional common records with a search of column ID2 in the record pairs. In this case however, we have completed our search for Jane’s deliveries; the search buffer holds the members of her clique, records A, B, D, and E.
We are now ready to output her records as a group, clear our search buffer, and begin to search for the next clique. To avoid searching for IDs that have already been identified as a member of a clique, we create a Found List (Table 4.) The row numbers in the found list mirror the row numbers of the linked pairs. As rows are used in a clique, we check them off so that we know to pass them over as we proceed.
Table 2. Linked Pairs REC NUM 1 2 3 4 5
ID1 A B B C D
ID2 B D E F E
Table 3. Search Buffer Buf 1 A
Buf 2 B
Buf 3 D
Buf 4 E
Table 4. Found List Row 1 2 3 4 5
Found 9 9 9 0 9
TRANS-ASSOCIATION PROCESS SAS® FEATURES Table 5 provides an overview of the main SAS features we need to efficiently accomplish the trans-association process just described. Most importantly, the KEY= and POINT= options allow us to access observations in a SAS dataset nonsequentially. When a variable is indexed within a SAS data set, the KEY= option allows us to read and write to only observations containing values that match those supplied to the KEY= option. A numeric value supplied to the POINT= option directs SAS to the indicated row number. The CNTLLEV dataset option specifies level of concurrent access to a dataset that is allowed. Before SAS can update a dataset, it places a lock on the entire dataset or row to be updated depending upon the statement or procedure being executed. When a SET or MODIFY statement is executed, a lock is placed upon the row currently being read or modified. However, when a SET or MODIFY statement with a POINT= or KEY= statement option is executed, SAS places a lock on the entire dataset being referenced. This prevents SAS from gaining update access to the dataset from another concurrently executed SET statement, which is the case in our code. We reference FOUND_LST from both a SET statement and a MODIFY statement with POINT= options within a single DATA step. The overall lock on the FOUND_LST dataset is overcome by specifying record-level control (CNTLLEV =REC) on the MODIFY statement. Thus, only the row being accessed at any given time is locked for updating and SAS has concurrent update access to the other rows as needed. This capability is essential to our trans-association code. The use of SAS return codes is also vital to the process. SAS provides an automatic variable named _IORC_ (which we believe stands for input/output return code). The values of this variable can tell us that SAS has found a match to the value provided by the KEY= option, or that no more matches are in the data set. SAS also provides a macro function %SYSRC( ) which translates return codes into standardized mnemonics that are helpful for coding. We will discuss these SAS features in Table 5 as we step through the code for the trans-association process.
3
Table 5. SAS Features Used in Clique Program SAS Language
Definition
Statements ARRAY
Defines a set of variables that you plan to process as a group.
MODIFY
Replaces, deletes, and appends observations in an existing SAS data set in place but does not create an additional copy.
RETURN
Instructs SAS to move to the next iteration of the DATA step.
Options with SET, MODIFY, or UPDATE statements KEY=
Option enables you to access observations nonsequentially in a SAS data set according to a value.
POINT=
Option enables you to access observations nonsequentially in a SAS data set according to the row number.
Dataset Options CNTLLEV =
Specifies level of concurrent access to a dataset. Record-level (CNTLLEV =REC) control locks the row being accessed, but allows other rows in that dataset to be accessed by other concurrently executed SAS statements or procedures.
INDEX =
Creates a simple or a composite index to be built. An index is an optional file that you can create for a SAS data file in order to provide direct access to specific observations located by an observation value.
I/O (Input/Output) Processing _IORC_
Automatic variable contains the return code for each I/O operation that the MODIFY or SET statement attempts to perform.
%SYSRC macro
Enables you to test for return codes produced the MODIFY statement and SET statements with the KEY= option.
_SOK
Return code mnemonic indicating the function was successful.
_DSENOM
Return code mnemonic indicating no matching observation was found in MASTER data set.
Source: Excerpts from SAS® 9.2 Language Reference: Dictionary. SAS Institute Inc. 2010
TRANS-ASSOCIATION CODE In preparation for the trans-association process, indexes for both ID variables in the file of linked pairs are created. This is accomplished via the INDEX= option in the code below at marker 01. In addition, the dataset used to flag record pairs already matched to a clique is created at marker 02. The values of this FOUND_LST are initialized to 0 at marker 03.
01 02
03
/* Create two indexes on the FILE_OF_LINKED_PAIRS, one for ID1 and one for ID2 */ DATA FILE_OF_LINKED_PAIRS (INDEX=(ID1 ID2) KEEP = REC_NUM ID1 ID2) FOUND_LST (KEEP = FOUND); /* Create list for marking rows containing IDs that are already in a clique */ SET FILE_OF_LINKED_PAIRS; FOUND = 0; RUN;
4
Two macro variables are set at marker 04. The SEARCH_BUF_SIZE contains the number of variables in the buffer array needed to accommodate the largest clique plus one. This value is most often unknown and therefore must be overestimated to ensure that the buffer array has a sufficient number of variables allocated to handle the largest clique you expect to find. The ID_LEN macro variable provides the length of the ID variable. In this example it is 1. The trans-association code begins at marker 05, reading in the first observation of the FILE_OF_LINKED_PAIRS. At marker 06 the POINT= option instructs the FOUND_LST to read the same observation as indicated by the variable REC_NUM (in this case observation 1). This observation has not been previously found to be part of a clique so the IF statement at marker 07 is false and SAS continues to marker 08. An array named SEACH_BUF has been created. The loop insures that all the values in the SEACH_BUF array are set to missing at the beginning of each new clique formation. The values from the current pair of IDs in the FILE_OF_LINKED_PAIRS are placed in the first two variables in the buffer at marker 09. The do loop at marker 10 will increment the index variable “i” from 1 to the size of the SEACH_BUF array until a missing value is encountered. The do loop at marker 11 will instruct SAS to first search the values in the ID1 column of the FILE_OF_LINKED_PAIRS and then search the ID2 column. When i=1 and k=1 then the value from the SEARCH_BUF1 (A) will be used to search the ID1 column. When i=1 and k=2 then the value from the SEARCH_BUF1 (A) will be used to search the ID2 column. Whenever a match is found the expression _IORC_ = %SYSRC(_SOK) at marker 12 will be true and SAS will enter the do statement at marker 13. SAS searches through the search buffer to see if the companion value from the pair in the row just matched is already in the SEARCH_BUF array. If it is already in the buffer then the IN_BUFFER flag is set to 1. Next, at marker 14 SAS checks to insure that we have not overflowed our buffer and warns us accordingly in the log. If we have not filled the search buffer and our IN_BUFFER flag has not been set to 1, then the companion value from the pair in the row just matched is placed in the next empty variable in the SEARCH_BUF array at marker 15. The subsequent MODIFY statement will point to the observation in the FOUND_LST indicated by the REC_NUM variable from the FILE_OF_LINKED_PAIRS row just matched. The CNTLLEV= option allows SAS to access the FOUND_LST dataset at the same time it is being accessed by the set statement at marker 06. The variable FOUND is set to 1. The REPLACE statement writes the value of the variable FOUND to the FOUND_LST. The code from markers 12 through 15 will repeat until the expression _IORC_ = %SYSRC(_SOK) is not true, meaning that no more matches to the current search value from the SEARCH_BUF array can be found in the FILE_OF_LINKED_PAIRS. Then SAS loops back to marker 10 and increments the value of index variable I to begin the search again for the next value in the buffer. The code from markers 10 through 15 will repeat until there are no values in the SEARCH_BUF array that have not been used in a search. At marker 16 the automatic _error_ variable is set to 0 anytime the _IORC_ variable has a value other than 0, meaning it cannot find a match. Setting _error_=0 prevents SAS from printing the values of the current observation to the log. At marker 17, the variable CLIQ_ID is incremented every time a clique has been completely formed. This will serve as a unique identifier for the clique. The do loop at marker 18 outputs an observation for each ID in the SEARCH_BUF array. After each clique is outputted, SAS returns to marker 05 to read in the next row from the FILE_OF_LINKED_PAIRS. At marker 06 SAS reads in the corresponding row from the FOUND_LST file. If that row is already a member of a clique the RETURN statement at marker 07 instructs SAS to go back to the top of the DATA step and read in the next observation from the FILE_OF_LINKED_PAIRS. When an observation that is not already part of a clique is read in then SAS clears the SEARCH_BUF array at marker 08 and begins the process again. 04
05 06 07
/* Set value for maximum expected clique size + 1 */ %LET SEARCH_BUF_SIZE=5; /* Enter the length of the ID Var. Later used to create array of ID vars in a clique */ %LET ID_LEN = 1; DATA CLIQUES (KEEP=CLIQ_ID _ID) FOUND_LST; LENGTH ID1 ID2 _ID OTHER_ID $&ID_LEN.; SET FILE_OF_LINKED_PAIRS; SET FOUND_LST POINT = REC_NUM; IF FOUND = 1 THEN RETURN;
5
08
09
10 11
12 13
14 15
16
17 18
ARRAY SEARCH_BUF{&SEARCH_BUF_SIZE.} $&ID_LEN.; DO i = 1 TO &SEARCH_BUF_SIZE UNTIL (SEARCH_BUF{i} = ' '); SEARCH_BUF{i} = ' '; END; *ID1 and ID2 are the first elements in the buffer; SEARCH_BUF{1}=ID1; SEARCH_BUF{2}=ID2; DO i = 1 TO &SEARCH_BUF_SIZE. UNTIL (SEARCH_BUF{i} = ' '); /*Look for matches to each buffer item first in ID1, then in ID2*/ DO k = 1, 2; DO UNTIL (_IORC_ =(%SYSRC(_DSENOM))); IF K = 1 THEN DO; ID1 = SEARCH_BUF{i}; SET FILE_OF_LINKED_PAIRS KEY = ID1; OTHER_ID = ID2; END; ELSE IF K = 2 THEN DO; ID2 = SEARCH_BUF{i}; SET FILE_OF_LINKED_PAIRS KEY = ID2; OTHER_ID = ID1; END; /* Whenever a match is found, check whether it is in the buffer already */ IF _IORC_ = %SYSRC(_SOK) THEN DO; IN_BUFFER = 0; DO J = 1 TO &SEARCH_BUF_SIZE. WHILE (SEARCH_BUF{j} ^= ' '); IF SEARCH_BUF{j} = OTHER_ID THEN IN_BUFFER = 1; END; IF J > &SEARCH_BUF_SIZE. THEN PUT 'Warning: Buffer too small!'; *If not already in buffer, put it in the buffer; ELSE IF IN_BUFFER = 0 THEN SEARCH_BUF{j} = OTHER_ID; MODIFY FOUND_LST (CNTLLEV = REC) POINT = REC_NUM; FOUND = 1; REPLACE FOUND_LST; END; END; END; END; IF _IORC_ =%SYSRC(_DSENOM) THEN _ERROR_ = 0; /* Prevents observations from being printed to the log when no match for key variable is found */ CLIQ_ID+1; DO I=1 TO &SEARCH_BUF_SIZE WHILE (SEARCH_BUF{I}^=' '); _ID=SEARCH_BUF{I}; OUTPUT CLIQUES; END; RUN;
6
RESULTS UNDUPLICATION OF EXAMPLE DATA Table 5 provides the output from the trans-association process in our example. The five rows of linked pairs form two cliques, four IDs in the first and two IDs in the second. Table 5. Output from Trans-Association Process Example
_ID A B D E C F
CLIQ_ID 1 1 1 1 2 2
UNDUPLICATION OF CALIFORNIA BIRTH RECORDS After using a probabilistic record linking process to create a file of linked record-pairs joining two birth records that belong to the same mother, the trans-association process just described was employed to create cliques from those record-pairs. Figure 1 provides a bar chart of the number of mothers who have child birth records in the California Birth Statistical Master file between 2000 and 2007 by the number of births they had during that eight year period. Of the nearly 3.4 million mothers, 64% only had one delivery, 28% had two, 6% had three, 1% had four, and less than 1% had five or more.
2,500,000
Nbr of Mothers
2,000,000 1,500,000 1,000,000 500,000 0 1
2
3
4
5+
Nbr of Births Source: California Birth Statistical Master File
Figure 1. Number of Mothers in the California Birth Statistical Master File by the Number of Births between 200 and 2007.
CONCLUSION SAS provides the KEY= and POINT= options efficient ways to nonsequentially access observations in a SAS dataset. Return codes provided by the _IORC_ automatic variable can tell us when and if the observations we are looking for have been found. The CNTLLEV= dataset option allows SAS to set, modify, or update a single dataset from multiple statements within a single DATA step. Used together, SAS provides highly efficient means for searching, modifying or retrieving data from SAS datasets.
7
The main limitation to this process is that a reasonable estimate of the size of the maximum clique must be known in advance. This is problematic because errors in a linking process that create dubious links can result in some unexpectedly long chains of trans-associations among record pairs. Setting the SEARCH_BUF_SIZE macro variable to extremely high numbers to accommodate these large chains will necessarily cost computing resources. Setting the SEARCH_BUF_SIZE too low will cause unpredictable breaks in perhaps legitimate chains of trans-associations. A companion paper to this paper entitled “Transitive Record Linkage in SAS using Hash Objects” (Wright & Hulett, 2010) provides a treatment of this same process utilizing hash objects and features available in SAS 9.2 that does not require that the size of the maximum clique be known in advance. After the trans-association process has been run successfully, it is recommended that the largest cliques be examined. Routinely, post-processing is required to break up some of the unrealistically large and dubious trans-associations.
REFERENCES
Fellegi, I. P. & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64, 1183-1210.
Newcombe, H. B. (1988). Handbook of record linkage. New York: Oxford University Press.
SAS Institute Inc. (2010). SAS 9.2 Language Reference: Dictionary, Third Edition. Cary, NC: SAS Institute Inc.
Wright, G. & Hulett, D. (2010) Transitive Record Linkage in SAS using Hash Objects: Western Users of SAS Software 2010 Conference Proceedings.
®
®
CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Denis Hulett Bixby Center for Global Reproductive Health University California San Francisco 1615 Capitol Ave. Sacramento, CA 95899-7420 (916) 650-0432
[email protected] SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
8