PROC SQL as an alternative to traditional SAS Data and PROC step methods. .....
C1ltplt en:i; en:i; nm; proc fomat artlin=begin; nm; proc fomat artlin=end; nm;.
310
Release 6.06 Enhancement s
Using PROC SQL In Data Management Applications Merry G. Rabb, SAS Consulting Services Abstract The inii'Oduction of PROC SQL in Release 6.06 provides the SAS~ applications developer with a powc:rlul new data management tool. When determining what coding rechniques are to be used in a production application, you may want to consider PROC SQL as an alternative to traditional SAS Data and PROC step methods. Using examples based on real applications, this paper compares using PROC SQL with other techniques and presents guidelines for the use ofPROC SQL and SQL views.
Introduction For those of us already familiar with SQL terminology and syntax, and even for those of us who are just beginning to learn SQL, PROC SQL provides a very quick and convenient way to extract subsets of data or combine data sets. SAS data views, which are SAS data sets that do not contain data but rather contain instructions on how to access the data, can also be created by PROC SQL. This paper will not discuss the many ways in which PROC SQL can and will be used for ad hoc reporting or other non-production jobs. Instead we will look at why PROC SQL may be useful in a production applications development environment. In developing a production application, many factors go into the process of determining what coding techniques to use. Among the factors to consider are: 1) Computer resource efficiency. A technique may be selected because it uses less CPU
time or less I/O than other techniques. This is typically a factor with applications that process large amounts of data relative to the computer system capacity, and with organizations that have real computable costs associated with their computer resource usage. When effiCiency is a consideration, the decision as to which technique to use should be made after testing alternative techniques and collecting and comparing information on the resources used by each technique. 2) Ease of system maintenance. Sometimes a simple or more readable coding approach is preferrable to one that may actually be more computer efficient, because maintainability is a pri-
NESUG '90 Proceedings
mary consideration. This may be true in a case where the people who maintain and modify the system are not the same people who developed it, or because the system must be changed frequently. 3) Ease of integration witb tbe rest or tbe application. An example of this type of concern might be a case where SAS code is being generaled by SAS macro logic, orb:£. Screen Control Language (SQ.) in a SAS/AftW system. It may be significantly easier to generate a PROC PRINT step, for example, than a customized Data step report.
4) ProvidJag a UDique function. Sometimes a particular coding technique is the only practical way to implement a requirement.
We will look at some applications, and compare PROC SQL with other techniques in light of these considerations.
Computer Efficiency The decision to use a technique such as PROC SQL because of efficiency considerations should only be made after doing benchmarking runs where you compare one technique versus another in as realistic and controlled an environment as possible. For this reason, this paper cannot make any definitive claims concerning when PROC SQL should be used for efficiency reasons.
While comparing PROC SQL with other techniques, some general observations about the comparative efficiency of PROC SQL were made. In general it was found that Data steps can be written to be more efficient than PROC SQL for most simple queries. For example, Table 1 shows CPU time and EXCP count statistics (collected under MVS) for programs that count the observations and create a new data set for a particular value ofDEPT,representing aboutl2 percent of a 200,000 observation data set A Data step approach was always faster than the SQL approach. Indexing the data set on DEPT improved the performance ofPROC SQL, but it also improved the performance of the Data step. PROC SQL is sometimes morecomputerefficientfor combining SAS data sets. This is especially true when you can avoid sorting by using PROC SQL. In
Release 6.06 Enhancements
our example, data in the GENERAL data set (200,000 observations) is grouped by the variable EMPLOYEE, but not sorted. The NOTSORTED option cannot be used here because it is a MERGE statement, thus the GENERAL data set must be sorted (see Figure 1). Note that this very lengthly code could be replaced by a single PROC SQL SELECI' statement (see Figure 2 and Table 2). Since PROC SQL does not require lhe sort step, it runs faster than the Data step/PROC Son approach. PROC SQL can also be more efficient for deleting or updating records in a SAS data set. The operation is performed in place, as opposed to a Data step approach, which always involves reading and writing the entire data set. PROC SQL is even more efficient in these cases if the WHERE clause can take advantage of an index, and the number of observations meeting the WHERE condition is relatively small in comparision to the size of the data set. PROC SQL is almost always more efficient for combining two SAS data sets when one is a large, indexed data me and the second is a smaller data file being used to select observations from the larger file. An example would be a large data file indexed on Part Number, and containing pricing information for a company's entire inventory of pans. The smaller me represents an order that needs to be processed. Only the parts that were ordered need to be read from the large flle. In our example (Figure 3), the PARTS file has 200,000 observations, and the ORDER file has 279 observations. The Data step approach, using a Merge statement with an IN= variable, took 11.65 seconds CPU time and had an EXCP count of 1528. The PROC SQL approach took .35 seconds CPU time and had an EXCP count of 576. This happens because in performing the loin, PROC SQL uses the index to read only those observations from the large me that are actually needed. The Data step cannot do this because the observations must actually be read and the Merge performed before the value of the IN• variable can be checked. Using an SQL view may be more efficient than actually recreating a SAS data set each time the data changes. This is v«y much dependant on the specific application. For example, how often would the SAS data set actually have to be recreated assuming you did not use a view? Also, how much code is required to determine whether or not the SAS data set actually needs to be recreated? Examples of creating and using SQL views are presented in later sections.
311
Ease Of Maintenance In general, once you are familiar with PROC SQL syntax, the code is easy to write, read and modify. For example, in the three way join code above (Figures 1 and 2), the PROC SQL code is easily interpreted. We can see immediately that we are selecting information from three tables at once, how the three tables relate to one another in terms of matching key variables, which columns on the report are computed in the code versus which ones are stored, etc. There is much less code to deal with and maintenance should be relatively simple. There are very few examples wherePROC SQLcode
is NOT simpler and more compact than traditional coding solutions. One example can be found when you want to concatenate two SAS data sets (see Figure 4). In a non-SQL coding solution you could use PROC APPEND, which would be essentially one line of code. One SQL approach would be to use the SQL INSERT statement to add rows. If you also needed to do some other type of processing at the same time, or if the data sets to be combined have dissimilar structures, the traditional SAS approach is a Data step using a SET statement without a BY statement In PROC SQL, however, you would use the OUTER UNION CORRESPONDING opemtor. Since this feature involves longer and more complex code than the Data step approach, you may only want to use it when you are creating an SQL view, and must use PROC SQL.
Ease Of Integratio n In tenns of building production applications for clients, ease of integration is probably the most common deciding factor in cases where we have used PROC SQL. When you are generating submitted SAS code from a menu-driven AF system based on user selections from multiple screens, or when your SAS code is generated using macro logic, the simplicity of SQL syntax is a big advantage. In F"Jgure 5 is an example of pan of a Screen Control Language program that submits code to create a view that is later used to generate a report. The program is pan of a SAS/AF system for a membership organization. In the display portion of the program, the user is asked to choose such options as summary versus detail report, general versus custom report, whether to select based on a date range, and whether to select based on the type of member. The PROC SQL step is generated in pieces (submit blocks) as the SCL variables that represent the user choices are evaluated.
NESUG '90 Proceedings
312
Release_ 6.06 Enhancement s
Unique Features PROC SQL has a number of unique features that make it easier or more efficient to use in some cases (such as the ability to merge without sorting, or the ability to merge on different keys without renaming). There is also the ability to create SQL views. A view allows us to reference a SAS data set name in our code, and instead of actually referencing data, we are referencing instructions on how and where to obtain the data. One reason to create a view rather than a SAS data file is illustrated in the SAS/AF example above. The SQLcode was created and submitted to SAS from an interactive menu-driven system that may have had many other features and options available to the user. If we had coded "create table" instead of "create view" in the program above, the fmal submit block would have submitted SQL code that would actually perform the data extraction while the user waited for processing to complete. Instead we simply created a view, which takes very little time. When the user exits the application, we then submit the reporting code that references the view. The actual data are extracted and printed after the user is done, so he or she does not have to sit staring at the screen while the repon executes. Referencing a view in our programs also enables us to be sure that the data we are reading are always up to date. Since the view we reference represents instructions on how to obtain the current data, rather than data itself, we don't have to depend on another process to create the current version of the data set we need. A variation on this idea is an application that might change the data, then use it For example, what if your application updates a master file, then in a different process you need a unique list of ID numbers. You might want to use a view to obtain the unique list of ID numbers so that you don't have to fD'St determine whether the me has been updated, then recreate the list of ID numbers if necessary. An example of why you might use a view in an application that modifres the data is found in the case study below.
A Case Study In Data Management For this application, the users wanted an interactive
menu-driven system that would execute a PROC PROC NETA...OW is a feature of SAS/OR software that solves network optimizaNETFLOW~model.
NESUG '90 Proceedings
lion problems such as, in this case, a minimum cost model for product distribution. In addition to executing the model, the users also wanted features that would allow them to edit or browse the data that represented the input to the model, and they wanted the application to compute fmal costs for the model based on some preliminary costs and various factors. The model represented several processing plants that produced several products and shipped them to several sales territories. Except for the product supply and demand data, which were provided in a form that could be direcdy read into PROC NETFLOW with the NODEDATA= option, the primary input data structures that were provided and that the users wanted to be able to edit were: PLNTPROD:
Some data values, such as manufacturing costs, were unique for a combination of Plant and Product.
PLNITERR:
Some data values, such as shipping costs, were defmed by a Plant and Sales Territory combination.
For the ARCDATA= input to PROC NETFLOW, however, we needed a data set that had one observation for eacl, Product/Territory combination within each plant. Also, fmal costs had to be computed at that level since they were a function of both manufacturing and shipping costs. Creating this third PLANT/PROD{IERR data set which we called NETWORK, was not completel; straightforward. We could not just merge the two data sets since what we wanted was all possible combinations of PRODUCT and TERRITORY within each value of PLANT. The prototype application was developed in Version 6.04 where PROC SQL is not available, so Data step code was used (see Figure'>. As part of evaluating the efficiency of PROC SQL for this type of application, the application was uploaded to the mainframe and tested there. The SQL approach involved creating a SAS data view that represents instructions on how to obtain the NET. WORK data (see Figure 7). PROC SQL only took .02 CPU seconds to create the view, as opposed to a total of .39 CPU seconds to create the SAS data file butPROCNETFLOW ran for .49moreCPU secon~ when the view was referenced in the ARCDATA= option instead of the SAS data file.
Release 6.06 Enhancements 313
There is a major advantage to using a view instead of a data flle in this application. In the interactive system, any time the user edits one of the two SAS data sets used as input, the NE1WORK data set must be recreated before the model can be run. Initially we took the approach of executing appropiate pieces of code when the edit option was selected for one of the two data sets. The users then made the point that they would most likely be editing a lot more often than they would be running the model, so it would be more efficient to recreate the data set right before running the model. The fmal complication was that the users might edit one of the input data sets, then select to browse the NE1WORK data set, and they would expect the changes they just made to be reflected. Without the availability ofPROC SQL and SAS data views, there is no ideal solution to this problem. Programming logic could be devised to determine whether or not the NE1WORK data set actually needed to be recreated, but this would be cumbersome and difficult to maintain. If the final version of the system is implemented in Release 6.06, we will use the SQL approach of creating a SAS data view. This view could be referenced in the browse step as
weU as the PROC NETFLOW step. This way we do not need the logic in the interactive system to determine whether or not the data set must be recreated.
Conclusions As with any other programming technique or tool, it is up to the individual programmer to determine if PROC SQL meets bis or her needs for a specifiC application. As application developers, PROC SQL provides us with added capability and flexibility, particularly in the areas of user-generated SAS code, code readability, and in the ability to create SAS data views instead of SAS data files. The author may be contacted at
SAS Consulting Services Inc. 1700 Rockville Pike Suite 330 Rockville, MD 20852
SAS, SAS/AF and SAS/OR are registered trademaries of SAS Institute Inc., Cary, NC, USA.
NESUG '90 Proceedings
314
Release 6.06 Enhancements
FIG1lRE 1: 'lBREE 11AY .rom lllml. S'lEP
I*
I
SCRl' EXIIMPIZ
SCRr BY JiMP.l.OYEE N1IME - 200,000 Ol!S
pro:: sort data=g'.general cut=gen;
*I
by Blployee;
run;
I*
SCRl' BY JiMP.l.OYEE N1IME - alLY em: OBS PER JiMP.l.OYEE data=q.&~~Jloy Qlb=eq:l; by name;
pro:: sort
run;
I*
cm:ATE 'l'&IR:RARlC FIIB OF MERGED 1!MPI.OYEE lllml. data t:.EIIp; merge qen(rename=(Biploye e-name)) &IIJ; by
name;
if salary < 20000 arxi
run;
I* I*
*I
~·JEHE:J:RY·;
RE-SCRI' ~ FII.E BY M:Nl'H IN PREPARATICN FCR ME:A:;IOO Wl'IH THIRD FII.E
*I *I
pro:: sort data=t.eup; by nart:h;
run;
I* I* I*
M:Nl'H IS~ FIEr.D AND IS SltlRED GFOJPED IN l.OGICAL (l«ll' AIHWEI'ICAL) Cla:R, SO M::NIHLY DATA SF:r IIJST AlSO BE scmm
pro:: sort data=q.Jialt:hly
~;
by nart:h;
run;
I*
FINAL ~ AND a:MRJl'E o:Hfi.SSICN INO::ME
data report; merge tEmp JOOnths;
by nart:h; cx:mnino=sales run;
I*
* -unitprce .. I1UIIm'd;
.
1:\llll
!* proc sql to pe.rfonn join with imexed table */ proc sql; create table test as
select P.part_id, mrl.tprce, JUOO:rd, (mrl.tprce*nunD:rd} as totamt fran n. parts P, n.orders 0 where P.part,_id=O.part_id ; quit;
A} run; B)
proc sql; insert into save.info select * fran l'll!loT.info; quit;
C)
data save.info; set save.info l'll!loT.info; run;
D)
proc sql; create table info as select * fran save.info
=
outer uni.cn fran select quit;
*
1'll!loT. info;
NESUG '90 Proceedings
316
Release 6.06 Enhancements
FIQJRE 5: 1lsing SCL to Galarabe lKlC SQL QXIe (Note: ihese are QXIe Fragme11t:s of a PI:Q3£am Ent:cy)
I* sauple
SCL cede fragments *I replace begdate "an:i &begdate le item. date"; replace en:ldate "an:i item.date le &erddate"; replace ~ •am IIIBitJers. type = "&memt:ype"
11
sul:l1lit; proc sql; create view selected as select
endsul::lllit: i f J:ptt:ypeo= I 5(1(1 then I* CXll.lapse IIIID.IIIt:s into
total
*I
emsut:mit; eni; I* collapse IIIID.IIIt:s into cne total
*1
do;
a'le
subnit; distinct: sum(IIIIDll1t) as total
i f :tpbl!\11»" 1GI!N 1 then I* standa:td zeport: *I sul:mit; JDeiiLid III1Dliit: date I* fran IT!M data set *I llddrl addr2 city state zip I* fran MDIBERS data set:
do;
endsul::lllit;
*I
eni: I* standa:td zeport: *I
I* OlStaD npart *I I* use varlist function to pra!pt for variables I* each data set • • . SCL not: ~ I* then subnit variable lists
else do;
fran
subnit; &itemvars
*I *I *I
&memvars endsul::lllit; erx:l;
I*
alStan zeport:
*I
sul:mit; fran IT!M, mmERS tdlel:e IIEIILid=id
&begdate &erddate
I* replacement st:rirrgs *I
&memtype
endsul::lllit; i f J:ptt:ypeo= I S!lf 1 then I* SUIIIDa:ty zeport: sul:mit; g:tQlp by JDeiiLid
do;
*I
endsul::lllit;
eni;
I*
SUIIIDa:ty
zeport: *I
I* final subnit with CDJtime I* all oode generated so far
sul:mit cmtinle; order by JDeiiLid; endsul::lllit;
NESUG '90 Proceedings
c:pt:icn ac:blally subnits
*I *I
Release 6.06 Enhancements 317
nam
Use DBta steps tD c::rmt:e m:J.'IOlK data set
6
I* I* I*
Create SliS data sets Formats will be used
to use as :m:x: FCIRMl\T i.rpit *I to get po:inter values for *I
the PIN'l'PRXl data set in the final DBta step
*I
data beqin emr drql begin en:ir retain begin 0 type 'CI;
set
~.pl.ntprod;
by plant; i f first.plant then beqin=_n_; i f last. plant then
do; st:art-pl.ant; fllltname=l $begin I i label=prt:(beq in,4.); ootprt: begin; en:i=_n..; fmtname= I $en:i I i label=prt: (en:i, 4 • ) ; C1ltplt en:i; en:i; nm;
proc fomat artlin=begin; nm;
proc fomat artlin=end; nm;
data networle(drcp=s t:art stop); set ~.plntterrr I* obtain pointers fran fomat *I
start=i.rplt(pl t(plant, $begin.) ,4.); stop=input(plt (plant,$en:i.) ,4.); do i=start to stop;
I* lead OJLLesp:nliD:J prOOuct data *I
set ~.pl.ntprod(drop=plant) point=i; _ocst_=prodcost+ship 7"6t; _tail_-plant; _head_=t:err II 1- 1 II prcduct; ootprt:; en:!; I* read c:on:espon:iin; prcduct data *I
nm;
nm;
NESUG '90 Proceedings
318
Release 6.06 Enhancements
FIGilRE 7
IKlC SQL To D:llbine 'l'lolo Oll1:a Sets
proc sql; create view netview as select
*, {prcdcr:lst-+shlpccst) as _ccst_, T. plant as _ta.il,_, terr II '_' II product as _head_
frail test.plntterr T, ~
!Jlit;
test.plnl:prod P
T.plant=P.plan t;
proc netflcw
arcdata=netview arccAit-arc:cut
run;
ncdeda:t:a=test;s updem nodecut=nodeo .l;
camt rows for specific value of IEPI' Create table selectinq rows for one value of IEPI' statistics are CRJ Time/EXCP camt
No
:rmex
5.44 5.73 5.33 3.07
1 652 1 745
1 747 1 749
I1rlex an IJEPl' 1.91 1 1959 2.22 •• 2042
1.79 1 2oss 3.08 1 753
step llfPI'OBd'les perfODII both 11CXXINI"1 ani 11~ 11 fUncticns. !!he
!Cl'E: Data
'!!ABlE 2 ~ ley Join Exm!pJe statistics are CRJ Time I EXCP CCUnt
IKlC SQL
10.28 1
DATA STEP~ (in:::l.udirg sort)
15.07 1 2363
864
DATA SlEPS APPH:W:H
(factoring cut first sort)
NESUG '90 Proceedings
8.94 1
953