Data Merge and Modification: Lessons Learned

1 downloads 0 Views 879KB Size Report
comprehensive, composite data file, using effective data merge techniques ... high school and taking at least one course in accelerated ... Read files. Into SAS Dataset. Filter and Merge. Create. New Variables ... 11 of 20. Filter and Merge. B. Nondup key. By student ID. Course_Name. Filter ... keep only records where same.
Data Merge and Modification: Lessons Learned Patricia Rodríguez de Gil Rheta E. Lanehart American Statistical Association Conference on STATISTICAL PRACTICE February 21-23, 2013, New Orleans, LA

Introduction 





Data warehouses are primary entities that among other functions, collect data over time

Unlike data collected for specific research purposes, opportunistic data are in most cases passively observed and often times, massive, complex, and messy Thus, analysts might encounter major issues when working with operational data such as:   

storage amount and type of data and more importantly, having to deal with different data formats

Conference on Statistical Practice 2013

2 of 20

Research Problem When all input data come from a single sample, the quality of the data depend, for example, on accurate data-recording 

However, when a single data source is insufficient, multiple sources of data are used to provide the necessary input to the model 

In such situations, researchers apply appropriate methodology to create a comprehensive, composite data file, using effective data merge techniques

Conference on Statistical Practice 2013

3 of 20

Purpose Because data are only as valuable as its level of quality, efficient data merging procedures need to be performed so that they will integrate the data in a way that the desired information is not lost or inaccurately represented 

Using data from an education data warehouse in a southeastern state, this presentation will show efficient ways to merge and modify data of different formats and lengths

Conference on Statistical Practice 2013

4 of 20

Define Analytic Objectives 

Research question 



What are the student and school factors that influence enrollment in different accelerated curricular programs?

Sample Definition 

College bound, 12th grade students attending regular high school and taking at least one course in accelerated curricular programs

Conference on Statistical Practice 2013

5 of 20

Accessing Data Sources 

Common Data Formats

Fixed-width and delimited text files

Microsoft Excel spreadsheets

dBASE files

ODBCcompliant data

SAS tables

OLE DB provider’s files

HTML tables

Microsoft Access tables

Conference on Statistical Practice 2013

6 of 20

Initial Data Exploration Selected Student Level Variables 

Transcripts file 



Demographic file 



Student ID, Course Number, Course Name, Grade level, School ID Student ID, Gender, Race/Ethnicity/, LEP, Lunch status, Exceptionality, School ID

Enrollment file 

Student ID, Language code, Migrant status, School ID

Conference on Statistical Practice 2013

7 of 20

Initial Data Exploration (con’t) Selected School Level Variables 

Common Core of Data (CCD)   

   

School ID School Name Demographic composition Location Size Lunch free/reduced/ Full time teachers

Conference on Statistical Practice 2013

8 of 20

Data Processing Summary Start

Read files Into SAS Dataset Filter and Merge Create New Variables Summary of Data by Student Data Analysis/ Modeling

Conference on Statistical Practice 2013

End 9 of 20

Read Files into SAS Dataset 1

CONTENTS

2

FREQ

1 – Transcripts = 5,347,993 2 – Transcripts = 3,623,328

Conference on Statistical Practice 2013

Transcripts - Contents

Transcripts - Frequencies

DATA

A

Courses – 1,829,514

10 of 20

Filter and Merge A

SORT

Courses – 1,829,514

By student ID Course_Name

DATA

Nondup key

B

Filter If Level = 3 & Type = 1

CCD = 3824

C

B MERGE

C

Conference on Statistical Practice 2013

D

Courses – Unique (students) 1,381,651

CCD – Regular Schools N = 385

Courses In Regular High Schools 11 of 20

CREATE NEW VARIABLES GPA_OVERALL Start

COURSES 2001-2005 DATA

GPA_ALL YEARS Grade_Score

GRADES 2001-2005 MEANS GPA_0105 Conference on Statistical Practice 2013

End 12 of 20

Summary of Data by Student Courses DATA UPDATE

GPA

Synthetic File

DEMO

REPORTS

CORR

GPA

DATA UPDATE

MEANS Conference on Statistical Practice 2013

13 of 20

Merge/Matching Problems 

Exact Matching 

Same individuals are in the two files Record-linkage : identifiers (e.g., student ID) are available to perform the matching of records Efficient software to sort individuals by their identifiers



Resulting merged file (synthetic file) 

More comprehensive data than the two separate files; however  False match  False non-match

Conference on Statistical Practice 2013

14 of 20

Required Expertise 





Domain  The expert understands the particulars of the business or scientific problem Data  The expert understands the structure, size, and format of the data Analytical Methods  The expert understands the capabilities and limitations of the methods that might be relevant to the problem

Conference on Statistical Practice 2013

15 of 20

SAS code for Merging Files data courses_hs2005; set hs_transcripts_10 hs_transcripts_11; if GRADE_LVL_CD = '12';

data courses_hs2005Reg; merge courses_hs2005_U (in=hs2005) hs_ccd_regular (in=hsreg); by school_enrolled; if hs2005 and hsreg;

** combine two files into one;

** keep only records where same school_enrolled exists on both files;

Conference on Statistical Practice 2013

16 of 20

SAS code for Updating Files data students2005; update rigor_hs2005_enroll_tot2 (in=demo) rigor89012all (in=gpa); by k20_edw_id ; if demo=1; * update all records in the demo file; run;

Conference on Statistical Practice 2013

17 of 20

SAS code for Creating new variables data rigor8; set courses0105; if GRADE_LVL_CD = '08'; do; length grade_score 3; grade_score=0; if GRADE_EARNED = '' then grade_score=.; else if GRADE_EARNED in ('A' 'A+' 'A-') then grade_score= 4.0; else if GRADE_EARNED in ('B' 'B+' 'B-') then grade_score= 3.0; else if GRADE_EARNED in ('C' 'C+' 'C-') then grade_score= 2.0; else if GRADE_EARNED in ('D' 'D+' 'D-') then grade_score= 1.0; else if Grade_earned in ('E' 'F') then grade_score= 0.0; else if Grade_Earned in ('I' 'N' 'NG' 'P' 'S' 'U' 'WF' 'WP' ) then grade_score= . ; end; run; Conference on Statistical Practice 2013

18 of 20

Recommended Reading 

Wainer, H. (2000). Drawing inferences from self-selected samples. Mahawah, NJ: Lawrence Erlbaum Associates.

Conference on Statistical Practice 2013

19 of 20

Contact Information Patricia Rodríguez de Gil, M.A.T., ABD University of South Florida College of Education Department of Measurement & Evaluation 4202 E. Fowler Avenue EDU 105 Tampa FL, 33620-5650 United States [email protected] Work on this presentation was supported by a grant provided by the National Science Foundation (NSF ITEST #0833503 awarded to Dr. Kathryn M. Borman, P.I.). Conference on Statistical Practice 2013

20 of 20