Data Merge and Modification: Lessons Learned Patricia Rodríguez de Gil Rheta E. Lanehart American Statistical Association Conference on STATISTICAL PRACTICE February 21-23, 2013, New Orleans, LA
Introduction
Data warehouses are primary entities that among other functions, collect data over time
Unlike data collected for specific research purposes, opportunistic data are in most cases passively observed and often times, massive, complex, and messy Thus, analysts might encounter major issues when working with operational data such as:
storage amount and type of data and more importantly, having to deal with different data formats
Conference on Statistical Practice 2013
2 of 20
Research Problem When all input data come from a single sample, the quality of the data depend, for example, on accurate data-recording
However, when a single data source is insufficient, multiple sources of data are used to provide the necessary input to the model
In such situations, researchers apply appropriate methodology to create a comprehensive, composite data file, using effective data merge techniques
Conference on Statistical Practice 2013
3 of 20
Purpose Because data are only as valuable as its level of quality, efficient data merging procedures need to be performed so that they will integrate the data in a way that the desired information is not lost or inaccurately represented
Using data from an education data warehouse in a southeastern state, this presentation will show efficient ways to merge and modify data of different formats and lengths
Conference on Statistical Practice 2013
4 of 20
Define Analytic Objectives
Research question
What are the student and school factors that influence enrollment in different accelerated curricular programs?
Sample Definition
College bound, 12th grade students attending regular high school and taking at least one course in accelerated curricular programs
Conference on Statistical Practice 2013
5 of 20
Accessing Data Sources
Common Data Formats
Fixed-width and delimited text files
Microsoft Excel spreadsheets
dBASE files
ODBCcompliant data
SAS tables
OLE DB provider’s files
HTML tables
Microsoft Access tables
Conference on Statistical Practice 2013
6 of 20
Initial Data Exploration Selected Student Level Variables
Transcripts file
Demographic file
Student ID, Course Number, Course Name, Grade level, School ID Student ID, Gender, Race/Ethnicity/, LEP, Lunch status, Exceptionality, School ID
Enrollment file
Student ID, Language code, Migrant status, School ID
Conference on Statistical Practice 2013
7 of 20
Initial Data Exploration (con’t) Selected School Level Variables
Common Core of Data (CCD)
School ID School Name Demographic composition Location Size Lunch free/reduced/ Full time teachers
Conference on Statistical Practice 2013
8 of 20
Data Processing Summary Start
Read files Into SAS Dataset Filter and Merge Create New Variables Summary of Data by Student Data Analysis/ Modeling
Conference on Statistical Practice 2013
End 9 of 20
Read Files into SAS Dataset 1
CONTENTS
2
FREQ
1 – Transcripts = 5,347,993 2 – Transcripts = 3,623,328
Conference on Statistical Practice 2013
Transcripts - Contents
Transcripts - Frequencies
DATA
A
Courses – 1,829,514
10 of 20
Filter and Merge A
SORT
Courses – 1,829,514
By student ID Course_Name
DATA
Nondup key
B
Filter If Level = 3 & Type = 1
CCD = 3824
C
B MERGE
C
Conference on Statistical Practice 2013
D
Courses – Unique (students) 1,381,651
CCD – Regular Schools N = 385
Courses In Regular High Schools 11 of 20
CREATE NEW VARIABLES GPA_OVERALL Start
COURSES 2001-2005 DATA
GPA_ALL YEARS Grade_Score
GRADES 2001-2005 MEANS GPA_0105 Conference on Statistical Practice 2013
End 12 of 20
Summary of Data by Student Courses DATA UPDATE
GPA
Synthetic File
DEMO
REPORTS
CORR
GPA
DATA UPDATE
MEANS Conference on Statistical Practice 2013
13 of 20
Merge/Matching Problems
Exact Matching
Same individuals are in the two files Record-linkage : identifiers (e.g., student ID) are available to perform the matching of records Efficient software to sort individuals by their identifiers
Resulting merged file (synthetic file)
More comprehensive data than the two separate files; however False match False non-match
Conference on Statistical Practice 2013
14 of 20
Required Expertise
Domain The expert understands the particulars of the business or scientific problem Data The expert understands the structure, size, and format of the data Analytical Methods The expert understands the capabilities and limitations of the methods that might be relevant to the problem
Conference on Statistical Practice 2013
15 of 20
SAS code for Merging Files data courses_hs2005; set hs_transcripts_10 hs_transcripts_11; if GRADE_LVL_CD = '12';
data courses_hs2005Reg; merge courses_hs2005_U (in=hs2005) hs_ccd_regular (in=hsreg); by school_enrolled; if hs2005 and hsreg;
** combine two files into one;
** keep only records where same school_enrolled exists on both files;
Conference on Statistical Practice 2013
16 of 20
SAS code for Updating Files data students2005; update rigor_hs2005_enroll_tot2 (in=demo) rigor89012all (in=gpa); by k20_edw_id ; if demo=1; * update all records in the demo file; run;
Conference on Statistical Practice 2013
17 of 20
SAS code for Creating new variables data rigor8; set courses0105; if GRADE_LVL_CD = '08'; do; length grade_score 3; grade_score=0; if GRADE_EARNED = '' then grade_score=.; else if GRADE_EARNED in ('A' 'A+' 'A-') then grade_score= 4.0; else if GRADE_EARNED in ('B' 'B+' 'B-') then grade_score= 3.0; else if GRADE_EARNED in ('C' 'C+' 'C-') then grade_score= 2.0; else if GRADE_EARNED in ('D' 'D+' 'D-') then grade_score= 1.0; else if Grade_earned in ('E' 'F') then grade_score= 0.0; else if Grade_Earned in ('I' 'N' 'NG' 'P' 'S' 'U' 'WF' 'WP' ) then grade_score= . ; end; run; Conference on Statistical Practice 2013
18 of 20
Recommended Reading
Wainer, H. (2000). Drawing inferences from self-selected samples. Mahawah, NJ: Lawrence Erlbaum Associates.
Conference on Statistical Practice 2013
19 of 20
Contact Information Patricia Rodríguez de Gil, M.A.T., ABD University of South Florida College of Education Department of Measurement & Evaluation 4202 E. Fowler Avenue EDU 105 Tampa FL, 33620-5650 United States
[email protected] Work on this presentation was supported by a grant provided by the National Science Foundation (NSF ITEST #0833503 awarded to Dr. Kathryn M. Borman, P.I.). Conference on Statistical Practice 2013
20 of 20