I thank the many R developers for providing such wonderful tools for free and .....
statistics book on R is Modern Applied Statistics in S, by Venables and Ripley.
R FOR SAS AND SPSS USERS
Bob Muenchen
I thank the many R developers for providing such wonderful tools for free and all the r‐help participants who have kindly answered so many questions. I'm especially grateful to the people who provided advice, caught typos and suggested improvements including: Patrick Burns, Peter Flom, Martin Gregory, Charilaos Skiadas and Michael Wexler. SAS® is a registered trademark of SAS Institute. SPSS® is a trademark of SPSS Inc. MATLAB® is a trademark of The Mathworks, Inc. Copyright © 2006, 2007, Robert A. Muenchen. A license is granted for personal study and classroom use. Redistribution in any other form is prohibited. 1
Introduction ..................................................................................................................................... 4 The Five Main Parts of SAS and SPSS ............................................................................................... 4 Typographic & Programming Conventions ..................................................................................... 5 Help and Documentation ................................................................................................................ 6 Graphical User Interfaces ................................................................................................................ 7 Easing Into R .................................................................................................................................... 7 A Few R Basics ................................................................................................................................. 7 Installing Add‐on Packages .............................................................................................................. 9 Data Acquisition ............................................................................................................................ 10 Example Text Files ..................................................................................................................... 10 The R Data Editor ...................................................................................................................... 10 Reading Delimited Text Files ..................................................................................................... 11 Reading Text Data within a Program (Datalines, Cards, Begin Data…) .................................... 13 Reading Fixed Width Text Files, 1 Record per Case .................................................................. 14 Reading Fixed Width Text Files, 2 Records per Case ................................................................. 15 Importing Data from SAS .......................................................................................................... 17 Importing Data from SPSS ......................................................................................................... 18 Exporting Data to SAS & SPSS Data Sets ................................................................................... 18 Selecting Variables and Observations ........................................................................................... 19 Selecting Variables – Var, Variables= ........................................................................................ 19 Selecting Observations – Where, If, Select If ............................................................................ 26 Selecting Both Variables and Observations .............................................................................. 32 Converting Data Structures ....................................................................................................... 32 Data Conversion Functions ....................................................................................................... 33 2
Data Management ......................................................................................................................... 33 Transforming Variables ............................................................................................................. 33 Conditional Transformations .................................................................................................... 36 Logical Operators .................................................................................................................. 36 Conditional Transformations to Assign Missing Values ............................................................ 38 Multiple Conditional Transformations ...................................................................................... 41 Renaming Variables (…and Observations) ................................................................................ 42 Recoding Variables .................................................................................................................... 45 Keeping and Dropping Variables ............................................................................................... 48 By or Split File Processing .......................................................................................................... 48 Stacking / Concatenating / Adding Data Sets ........................................................................... 50 Joining / Merging Data Frames ................................................................................................. 50 Aggregating or Summarizing Data ............................................................................................ 52 Reshaping Variables to Observations and Back ........................................................................ 55 Sorting Data Frames .................................................................................................................. 57 Value Labels or Formats (& Measurement Level) ......................................................................... 58 Variable Labels .............................................................................................................................. 63 Workspace Management .............................................................................................................. 65 Workspace Management Functions ......................................................................................... 66 Graphics ......................................................................................................................................... 67 Analysis .......................................................................................................................................... 71 Summary........................................................................................................................................ 78 Is R Harder to Use? ........................................................................................................................ 79 Conclusion ..................................................................................................................................... 80 3
INTRODUCTION The goal of this document is to provide an introduction to R that that is tailored to people who already know either SAS or SPSS. For each of 27 fundamental topics, we will compare programs written in SAS, SPSS and the R language. Since its release in 1996, R has dramatically changed the landscape of research software. There are very few things that SAS or SPSS will do that R cannot, while R can do a wide range of things that the others cannot. Given that R is free and the others quite expensive, R is definitely worth investigating. It takes most statistics packages at least five years to add a major new analytic method. Statisticians who develop new methods often work in R, so R users often get to use them immediately. There are now over 800 add‐on packages available for R. R also has full matrix capabilities that are quite similar to MATLAB, and it even offers a MATLAB emulation package. For a comparison of R and MATLAB, see http://wiki.r‐project.org/rwiki/doku.php?id=getting‐started:translations:octave2r. SAS and SPSS are so similar to each other that moving from one to the other is fairly straightforward. R however is totally different, making the transition confusing at first. I hope to ease that confusion by focusing on the similarities and differences in this document. It may then be easier to follow a more comprehensive introduction to R. I introduce topics in a carefully chosen order so it is best to read this from beginning to end the first time through, even if you think you don't need to know a particular topic. Later you can skip directly to the section you need.
THE FIVE MAIN PARTS OF SAS AND SPSS While SAS and SPSS offer many hundreds of functions and procedures, these fall into five main categories: 1. Data input and management statements that help you read, transform and organize your data. 2. Statistical and graphical procedures to help you analyze data. 3. An output management system to help you extract output from statistical procedures for processing in other procedures, or to let you customize printed output. SAS calls this the Output Delivery System (ODS), SPSS calls it the Output Management System (OMS). 4. A macro language to help you use sets of the above commands repeatedly. 5. A matrix language to add new algorithms (SAS/IML and SPSS Matrix). 4
SAS and SPSS handle each with different systems that follow different rules. For simplicity’s sake, introductory training in SAS or SPSS typically focus on topics 1 and 2. Perhaps the majority of users never learn the more advanced topics. However, R performs these five functions in a way that completely integrates them all. So while we’ll focus on topics 1 and 2 with when discussing SAS and SPSS, we’ll discuss some of all five regarding R. Other introductory guides in R cover these topics in a much more balanced manner. When you finish with this document, you will want to read one of these; see the section Help and Documentation for recommendations. The integration of these five areas gives R a significant advantage in power. This advantage is demonstrated by the fact that most R procedures are written using the R language. SAS and SPSS procedures are not written using their languages. R’s procedures are also available for you to see and modify in any way you like. While only a small percent of SAS and SPSS users take advantage of their output management systems, virtually all R users do. That is because R's is dramatically easier to use. For example, you can create and store a regression model with myModel= 5). LIST. EXECUTE. TEMPORARY. SELECT IF(gender = "m"). SAVE OUTFILE='C:\males.sav'. EXECUTE .
28
R
TEMPORARY. SELECT IF(gender = "f"). SAVE OUTFILE='C:\females.sav'. EXECUTE . # R Program to Select Observations. load(file="c:\\mydata.Rdata") attach(mydata) print(mydata) #---SELECTING OBSERVATIONS BY INDEX # Print all rows. print( mydata[1:8, ] ) # Just the males: print( mydata[5:8, ] ) # Negative numbers exclude rows. # So this excludes the females in rows 1 through 4. # The isolate function is used to apply the minus # to 1,2,3,4 and prevent -1,0,1,2,3,4. print( mydata[-I(1:4), ] ) # The which function can find the index numbers # of the true condition. which(gender=="m") # You can use those index numbers like this. print( mydata[which(gender=="m"), ] ) # You can make the logic as complex as you like: print( mydata[ which(gender=="m" & q4==5), ] ) # You can save the indices to a numeric vector stored OUTSIDE the # original data frame. Otherwise how would you store the 5,6,7,8 # values in a data frame that has 8 rows? happyGuys