A Partitioning Method for Processing Large Data Files

NESUG 16

Pharmaceuticals

PH001

A Partitioning Method for Processing Large Data Files Anshuman Panigrahi, marketRx; Bridgewater, NJ John R. Gerlach, PRA International; Horsham, PA

Abstract

The Problem

Assume that you have a very large data file (over 40 million records) representing physician-level prescription data that contains the number of times a physician writes a prescription belonging to a drug class (e.g., pain medications). In this cohort, there are over 500 drugs, spanning several markets, involving over 200 drug classes, denoted by a USC code. For each drug class, you wish to perform a ranking of any physician based on the number of prescriptions written with respect to a drug class. Unfortunately, the volume of data prohibits the usual by-group processing.

Given a very large data file containing physician-level prescription data, and very limited computer resources, impute the rank a physician’s prescription writing behavior, specific to that USC group.

This paper explains a sequential – partitioning method that facilitates the processing of large data files, such as the aforementioned file. Also, this paper explains a clever method for assigning a rank value for any physician for a given drug class.

The By-Group Boundaries

The problem is two-fold. First, you must process a very large raw data file with limited computer resources and compute the decile ranges for each USC group. Then, based on that analysis, you must be able to impute the rank value of any physician based on the number of prescriptions written by the physician.

Given that the large data file is sorted, by USC code, we exploit that fact by determining the record boundaries of the n drug classes in the data file. That is, for each group of records, we need to know the starting record number and ending record number.

Introduction Processing large data files becomes prohibitive with limited computer resources, especially when the data must be stored as a SAS® data set in order to do by-group processing. Due to limited disk space or memory, processing a large SAS data set becomes a challenge, especially when sorting or doing by-group processing.

To obtain this information, you must read the data file, albeit the USC codes only. This necessary task is the only serious hurdle regarding the limited computer resources, notably disk space. data usc; infile phys_rx; input @10 usc $char5.; run;

One obvious solution: Split the large data file into smaller SAS data sets. That is, when reading the data file, you distribute the ith observation into its appropriate data set as determined by the by-group variable (in this case, the USC code). However, the solution is extremely tedious, since you would need to know every unique value of the by-group variable in order to create its respective data set. Even worse, this solution more than doubles the disk space requirements by storing the data into many, albeit more manageable, SAS data sets.

Next, the FREQ procedure processes the USC data set and creates an output data set that identifies all the drug classes, along with their frequency of instances. proc freq data=usc; tables usc / out=usc_freq(drop=percent); run;

Now, you have the information needed to compute the start / end boundary points of each USC group in the data file, which is accomplished in the subsequent Data step. Note the use of the automatic variable _N_ that begins the process of determining the boundaries.

Keep in mind that the primary objective is to perform bygroup processing, not to manage data. Given that the data file is sorted already, it is possible to read the large data file for a specific by-group, perform the analysis, then store (append) the results in a SAS data set. But, you don’t know the first and last record of the data file pertaining to a given by-group. Moreover, you don’t even know how many groups there are. 1

NESUG 16

Pharmaceuticals

data usc_bound; set usc_freq; if _n_=1 then do; f_recno = 1; l_recno = count; end; else do; f_recno = l_recno + 1; l_recno + count; end; keep usc f_recno l_recno; run;

not know a priori anything about the USC groups. The only requirement is that the original data file is ordered by USC group.

The Partitioning Process By defining several pertinent macro variables, you are able to read the original data file, regardless of limited computer resources, i.e. disk space. Using the %DO loop in the Macro Language, the ith iteration performs two steps: reading the data file representing the specific USC group; and, processing the data for that group. To read a raw data file sequentially, in a partitioned manner, such that it reads only those records belonging to a specific USC group, the Data step inside the %DO loop uses the FIRSTOBS and OBS options of the INFILE statement. The values for these two options are supplied by the pair-wise macro variables, FIRSTi and LASTi, that tells the SAS System which records to read for the ith USC group.

The data set USC_BOUND contains 3 variables: USC, F_RECNO (the first record number of a USC group) and L_RECNO (the last record number of a USC group), as shown below. USC 31151 31152 31212 32312 21300

F_RECNO 1 101 426 611 : : 40587001

L_RECNO 100 425 610 1023

%macro part_proc; %do i = 1 %to &n_usc.;

: 41000000

%* Read partitioned portion of data file ; data usc_group; infile file1 firstobs=&&first&i.. obs=&&last&i..; input @1 doc_id $char9. @10 usc $char5. @15 trx 7.; run;

Now, you know the first and last record for each USC group in the data file. But, you’re not done yet. You need to create pair-wise macro variables that contain the start / end points in order to read the data file in a partitioned manner, one USC group at a time. Initially, you must determine how many USC groups exist in the data file. Using the USC_BOUND data set created by the previous Data step, the SQL procedure uses SAS Dictionary Tables to define the N_USC macro variable that denotes the number of USC groups, as follows.

Notice that the code above uses deferred addressing found in the Macro Language. Thus, for the ith iteration of the %DO loop in the macro %part_proc (partition process), the macro variables &&first&i.. and &&last&i.. resolve to their appropriate pair-wise values, denoting the first and last records, for a given USC group. Consequently, the data set USC_GROUP contains only those records that represent a particular USC.

proc sql noprint; select left(put(nobs,4.)) into :n_usc from dictionary.tables where libname eq 'WORK' and memname eq 'USC_BOUND'; quit;

Upon creating the data set USC_GROUP that contains only those records pertaining to a particular USC, the following proprietary code (not shown) efficiently generates a ranking of the analysis variable (i.e., TRX), representing the total number of prescriptions written by a physician, and includes low and high range values for ten rank levels.

Then, another SQL step employs the N_USC macro variable in order to create the pair-wise macro variables denoting the start / end boundaries in the data file, thereby delimiting each USC group in the data file.

% Perform ranking process ;

proc sql noprint; select f_recno into :first1-:first&n_usc. from usc_bound; select l_recno into :last1-:last&n_usc. from usc_bound; quit;

< Proprietary Code >

The result of this process is depicted below. Notice that the number of USC groups and respective pair-wise start/end points are data dependent. You need

USC

2

RANK

LOW

HIGH

NESUG 16

31151 31151 31212 31212 31212 31212 31212 82300 82300

Pharmaceuticals

1 : 10 : 1 2 3 : 9 10 : 1 : 10

1 : 100 : 1 18 36 : 463 963 : 1 : 777

:

:

:

:

:

: :

:

58

%* Append Control data set ; proc append base=rankfmts data=ranges; run; %end;

141 17 35 56

%mend part_proc;

Finally, the FORMAT procedure processes the Control Input data set and creates all the pertinent formats, as follows.

962 1,200 : 43 : 18,540

proc format library=work cntlin=rankfmts; run;

QC Report of Results

Recall the objective: To impute the rank value that statistically represents how often a physician writes a prescription belonging to a USC group, compared to other physicians. That is, to assign a rank value for any physician in the large data file, with respect to a USC group. The task becomes almost trivial by creating a collection of Control Input data sets that contains the rank values for all the USC groups found in the data file.

It’s always a good idea to checks the results of a process that becomes a black box to a larger process. In this case, each value of FMTNAME, which indicates a USC group, should have a frequency of ten, which denotes the rank levels, the deciles. The following FREQ procedure affords this simple validation report. Other validation checks should be considered, as well.

Creating the Formats

proc freq data=ranges; tables fmtname / list missing;

For a given USC group, programmatically the ith iteration of the %DO loop, the process continues with another Data step that takes the results of the proprietary ranking code and converts it into a bona fide Control Input data set. Notice that each Control data set is named according to the USC group, following a suitable naming convention.

run; Given that there are ten rank levels for each USC group, there will be ten times the number of unique USC’s, which is a small data set by comparison to the original data file. How much smaller in size? In this study, twenty thousand times smaller.

$uf

Assigning Rank Values

Thus, for any physician who has written n prescriptions specific to a USC group, the respective character format maps the number of prescriptions written to a rank value, accordingly. Consider the following Data step. Note that the variable CRANGE contains the LOW and HIGH values depicted in the above table.

After creating the collection of formats that assigns a rank value based on a physician’s total prescriptions specific to a USC group, the next step becomes obvious – to use these formats. And, typical of using the SAS System, there are at least two ways of employing these formats.

% Create Control Input data set ; data ranges; retain fmtname "u&&usc&i..f" type ‘C’; set ranges(keep=crange rank); start = input(compress(scan( crange,1,'- ')),8.); end = input(compress(scan( crange,2,'- ')),8.); rename rank = label; run;

One method uses the Macro language that will generate the nested IF statements that affords the logic needed to use the appropriate assignment statement based on the USC code. Each assignment statement will contain the appropriate character format that maps a physician’s total prescriptions (TRX) to its rank value. Thus, for example, if a physician wrote twenty-five prescriptions of Celebrex, having USC code 09100 (in the Cox II drug class), the nested IF statements would branch to the assignment statement that would use the format $u09100f in order to assign the rank value.

Upon creating the Control Input data set that represents a specific USC group, the following APPEND procedure includes that information, thereby creating a single Control data set that contains all the pertinent formats.

The following macro generates the nested IF statements, along with their respective assignment statements. Notice that the %DO loop begins with the number two in order to generate the ELSE/IF statements. 3

NESUG 16

Pharmaceuticals

data ranks; set sample; uscf = put(usc,$uscf.); rank = putc(trx,uscf); keep doc_id usc trx rank; run;

%macro nestedif; if usc eq "&usc1." then rank = input( put(usc,$u&usc1.f.), best.); %do i = 2 %to &n_usc.; else if usc eq "&&usc&i.." then rank = input( put(usc,$u&&usc&i..f.),best.); %end; %mend nestedif;

Caveats What if the large raw data file is not sorted, already? Assuming that the data file resides on a high-end machine, you could use a sorting routine, such as Syncsort®, then proceed with the proposed partitioning method.

Now, assume that you wish to process only twenty-five percent of the very large data file. After reading a random sample of the original data file, creating a subset data set called SAMPLE, the following Data step uses the macro %NESTEDIF that facilitates the process of imputing the rank value for those physicians in the sample.

This method addresses the problem of doing by-group processing with a very large data file with limited computer resources. Thus, it should be obvious that all the by-groups (USC groups) are of manageable size.

data ranks; set sample; %nestedif; run;

Another problem concerns the ranking process that generates the data sets needed to produce the formats. What if the ranking process is such that the START / END values overlap?

To develop a better understanding of this solution, consider a partial listing of the following SAS log. 1040 data ranks; 1041 set sample; 1042 %nestedif; MPRINT(NESTEDIF): then rank = input( MPRINT(NESTEDIF): then rank = input( MPRINT(NESTEDIF): then rank = input( : : MPRINT(NESTEDIF): then rank = input( 1250 run;

Conclusion Using a sequential – partitioning method, it’s possible to perform by-group processing on a large data file, despite limited computer resources. By taking advantage of common options belonging to the INFILE statement and good use of the Macro Language, this method is a natural solution to a rather nasty data management problem.

if usc eq "31151" put(usc,$u31151f.), best.); else if usc eq "31152" put(usc,$u31152f.),best.); else if usc eq "32211" put(usc,$u32211f.),best.); : : : else if usc eq "82300" put(usc,$u82300f.),best.);

By creating user-defined formats specific to a USC group, the task of imputing any physician’s rank value becomes easy and efficient. Moreover, the ranking process is independent to the task of imputing a physician’s rank with respect to a USC group.

An alternative to using the Macro Language would be to use Dynamic formats in SAS. In order to use this newer technique, however, you need one more format, called $USCF, that maps each USC code to the name of its respective format. Recall the data set USC_FREQ that contains the unique USC groups, which the SQL step employs to create the extra format.

Author Information Anshuman Panigrahi marketRx Bridgewater, NJ 267.242.0207

proc sql noprint; select "USCF" as fmtname, "C" as type, usc as start, "u"||usc||"f" as label from usc_freq; quit;

John R. Gerlach PRA International Horsham, PA 215.284.2176

SAS® is a registered trademark of SAS Institute.

Syncsort® is a registered trademark of Synsort, Inc.

Then, you would use the PUTC function in the following Data step, in lieu of the large nested IF statement created by the %NESTEDIF macro. 4

A Partitioning Method for Processing Large Data Files

A Partitioning Method for Processing Large Data Files

Suggest Documents

A Partitioning Method for Processing Large Data Files

A Data-aware Partitioning and Optimization Method for Large-scale ...

Method for processing internet request files

Adaptive Partitioning for Very Large RDF Data

A Method for Design of Data-tailored Partitioning Algorithms for ...

Improved Data Partitioning For Building Large ROLAP Data Cubes in

A Proactive Complex Event Processing Method for Large-Scale ...

A Flexible Partitioning Tool for Large Ontologies

Effective Spatial Data Partitioning for Scalable Query Processing

Improved Data Partitioning For Building Large ... - Semantic Scholar

A Natural Language Processing Tool for Large-Scale Data ... - PLOS

Parallel Processing Method for Airborne Laser Scanning Data Using a ...

Method and arrangement for data processing in a communication ...

Development of a new data-processing method for SKYNET sky ...

A Measured-Data Processing Method Based on MATLAB wavelet for

A multivariate Denton method for benchmarking large data sets - CBS

The Remedian: A Robust Averaging Method for Large Data Sets

A Framework for Data Partitioning for C++ Data-Intensive Applications

An AIPS-based, distributed processing method for large radio ... - Jive

Method for processing dross

the Partitioning Method

Method for processing dross

SUPPLEMENTARY DATA & FILES FOR ...

Method for processing dross