Document not found! Please try again

An Introduction to R for Beginners

249 downloads 0 Views 3MB Size Report
Feb 15, 2017 - James (Jimmy) Earl Carter 1924 ...... Australian woods. ... determining if wood hardness (difficult to measure) is related to wood density (easy to ...
An Introduction to R for Beginners Sasha Hafner∗ and Adam Ryan February 15, 2017

[email protected],

[email protected]

Contents 1 Introduction to R

5

1.1

R overview and history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.2

Finding and installing R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.3

Using R: GUI & scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.4

Commands, assignment, and objects . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2 Operators and functions

12

2.1

Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.2

Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3 Getting help

22

3.1

Help in general . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

3.2

Help on functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

3.3

Finding new functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

4 Data types and data objects in R

26

4.1

Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

4.2

Overview of R data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

4.3

Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

4.4

Matrices, arrays, and lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

5 Data frames, data import, and data export

39

5.1

Reading data from files (and setting your working directory)

. . . . . . . . . . . . .

39

5.2

Reading data from spreadsheet files . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

5.3

Creating data frames manually . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

5.4

Working with data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

5.5

Writing data to files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

6 Working with vectors

54

6.1

Vector arithmetic and vectorized functions . . . . . . . . . . . . . . . . . . . . . . . .

54

6.2

Working with character data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

2

6.3

Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

6.4

Dates and times

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

6.4.1

The lubridate package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

6.4.2

Dates and times in base R . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

7 Basic data manipulation

77

7.1

Indexing and subsetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

7.2

Sorting data and locating observations . . . . . . . . . . . . . . . . . . . . . . . . . .

89

7.3

Combining data frames and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

8 Wickham’s packages for data manipulation

97

8.1

Alternatives for subsetting, indexing, and renaming . . . . . . . . . . . . . . . . . . .

97

8.2

Grouped operations and the %>% operator . . . . . . . . . . . . . . . . . . . . . . . . 110

9 Aggregating and summarizing data

120

10 Base graphics

144

10.1 Introduction to the plot function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 10.2 Adding data to plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 10.3 Exporting graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 11 Packages

161

12 Exploratory data analysis

165

12.1 Summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 12.2 Counts and contingency tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 12.3 Histograms and other summary plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 12.4 Normal quantile and cumulative probability plots . . . . . . . . . . . . . . . . . . . . 173 13 t tests

177

14 Linear regression

181

14.1 The lm function, model formulas, and statistical output . . . . . . . . . . . . . . . . 181 14.2 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

3

15 Analysis of variance and analysis of covariance

218

15.1 Analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 15.2 Analysis of covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 16 Generalized linear models

242

16.1 Introduction to glm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 16.2 Binary responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 16.3 Count data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 17 Random and mixed effects models

256

17.1 Introduction to the lme4 package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 17.2 Random effects models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 17.3 Mixed effects models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 18 Nonlinear regression

291

18.1 A little bit on optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 18.2 The nls function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 18.3 The nls.lm() function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 19 Basic R programming

297

19.1 Loops and grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 19.2 Conditional statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 19.3 Writing simple functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 20 Base graphics, part II 20.1 Arranging multiple plots per page

313 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

20.2 More on the plot function: arguments and values

. . . . . . . . . . . . . . . . . . . 325

21 Common mistakes

327

22 Where to go next

331

23 References

333

4

1 1.1

Introduction to R R overview and history

R is a programming language and a software system for computations and graphics. According to the R FAQ1 , “[i]t consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files.” R was originally developed in 1992 by Ross Ihaka and Robert Gentleman at the University of Auckland in New Zealand. The R language is a “dialect” of the S language2 , which was developed (mainly) by John Chambers at Bell Laboratories. This software is currently maintained by the R Core Team, which consists of more than a dozen people, and includes Ihaka, Gentleman, and Chambers. Many other people have contributed code to R since it was first released. R is open source; the source code for R is available under the GNU General Public License, meaning that users can modify, copy, and redistribute the software or derivatives, as long as the modified source code is made available. The software is regularly updated, but changes are usually not major.

1.2

Finding and installing R

The R Core Team maintains a network of servers that contains installation files and documentation on R, called the Comprehensive R Archive Network, or CRAN. You can access it through http: //cran.r-project.org/, or a Google search for CRAN R. R is available for Windows, Mac, and Unix–like operating systems. Installation files and instructions can be downloaded from the CRAN site by selecting one of the download links at the top. Although the graphical user interfaces (GUIs) and their menus differ across systems (if present at all), the R commands do not.

1.3

Using R: GUI & scripts

There are two basic ways to use R on your machine: interactively through a graphical user interface (GUI) or shell, where R evaluates your code and returns results as you work, or by writing, saving, and then running R script files. Note that even if you use a GUI, R is not like typical “selectionand-response” software that you might be used to. Instead, you have to compose expressions (or commands) that R interprets3 . This approach may make R more difficult to learn than a typical Windows or Mac program, but it comes with many advantages, including flexibility, efficiency, and repeatability. We will work directly in a GUI for most of this workshop. If you are working in Windows, your R GUI will look something like the screenshot below (Fig. 1). The left window is the R console, where you enter commands. I’ll use the term “console” in this workbook to refer to refer to this window or a shell that is used for R. The middle window is a simple “script editor”–more on that below. The R GUI for Mac is a bit nicer–a recent example is shown below (Fig. 2). Again, the left window is the console, where you enter commands, and the middle window is a script editor. And for Unix-like operating systems, there is no R GUI, but instead you access R through a command-line shell, such as Bash–accessed through the purple window in the screenshot shown 1

http://cran.r-project.org/doc/FAQ/R-FAQ.html#R-Basics The S language is also used in the commercial software S-PLUS, which is very similar to R. 3 To get around writing code altogether there are some icon-driven programs that interface with R, e.g., R Commander (http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/). But, I recommend you stick to writing R commands yourself– otherwise you will miss out on some of the major advantages of R. 2

5

Figure 1: An R GUI in Windows 7.

Figure 2: An R GUI in Mac OS X.

6

Figure 3: Running R through a command-line shell in Ubuntu Linux. below (Fig. 3)4 . In this case, you might use a stand-alone text editor, like GVIM shown to the left in the screenshot below, to work with your scripts. R script files (or scripts) are just text files that contain the same types of R commands that you can submit to a GUI or shell. Because everything you do in R can be recorded as a text command, script files provide an accurate way to record exactly what steps you carry out for a particular analysis. Once written, scripts can be submitted to R using an R GUI or a shell. All the code covered in this workbook will work if directly typed into the GUI, or it can be saved in a script file which can then be submitted to R5 . Working only interactively with R usually doesn’t make sense. If your code is worth writing, it is probably worth saving. An advantage of R over software that requires numerous mouse clicks to run an analysis is that you can save an exact record of your analysis–in a script file. You can then repeat the same analysis in the future, perhaps with an updated data set, or you can modify you script to add new results or change some of the steps. Most R users probably work simultaneously with an interactive interface (a GUI or command-line shell) and a script file editor. For example, I generally write code in a text editor and test and tweak it using a command-line shell as I work. So, how do you work with scripts? Any simple text editor works–you just need to be able to edit and save text files. But if you want more features, like syntax highlighting, autocomplete, or send-to-R functionality–there are other options. The Windows and Mac OS X versions of the R GUI come script editors, shown as the middle windows in the screenshots (Figs. 1 and 2). Both editors allow you to edit and create scripts, and also submit commands with the click of a button. The Windows version is pretty basic. The Mac OS X version, however, has several useful features, including syntax highlighting. There are some useful (in some cases free) text editors or even “integrated development environ4 5

You can also run R through a command-line shell in Windows or Mac OS X operating systems as well. There is at least one difference between scripts and the GUI: with scripts, the results are not automatically printed–to manually print to the output file, use the function print.

7

ments” (IDEs) available that can be set up with R syntax highlighting and other features. I recommend RStudio6 , which is available for Windows, Mac OS X, and Linux operating systems. It is now much more than a script editor, and includes tools for building packages and writing dynamic reports, among others. For a simpler option for Windows, Notepad++ is a good choice. It is a general purpose text editor, and includes syntax highlighting for R, and the ability to send code directly to R with the NppToR plugin7 . Several other options are available for Windows and other operating systems–see http://www.sciviews.org/_rgui/projects/Editors.html for more information. Once you have a script file that you would like to run, you can send it to R by entering the command source("filename.R") in the R GUI console or a command-line shell that is running R, or enter the command R CMD BATCH filename.R8 directly in a command-line shell (e.g., a Windows Command Prompt console).

1.4

Commands, assignment, and objects

In order to learn R, you need to learn the R language, and really, not much more. The R language is a mix of functional and object-oriented style [2]. For our purposes here, this means a few things. First, most (technically all) of the operations carried out in R are done by calling up functions, e.g., sqrt(10) will calculate the square root of 10. And the values that are stored in symbolic variables (e.g., you might store the value 10 in a variable called x) are not changed unless there is some kind of assignment involved–more on assignment in a bit. Both of these traits are characteristics of functional programming languages. The variables you create and work with in R are called objects. Objects are really anything that can be assigned to a symbolic variable; data structures (e.g., a matrix) and functions (e.g., sqrt) are examples of objects. Data structures and other objects in R have what are called classes–really just a description of the type of object, such as “numeric vector” for a collection of numbers9 . Many functions in R operate differently on different types of objects–these functions are called generic. These traits are characteristics of object-oriented languages. Advantages of the approach used for the R language should become clear as we begin to work with R. In general, it means less work for the user than languages without these characteristics. The instructions you give R are called commands. The basic approach to using R interactively is to type a command and hit Enter–R evaluates what you typed and prints the result. For example, to calculate 10 − 6, enter 10 - 6 in your console > 10 - 6 and you’ll get this result: [1] 4 In your console, the > character is the prompt character, which indicates that R is ready to accept input. The result that R returns, 4, has as [1] at the left side of the line. This number simply 6

Information and download available here: http://rstudio.org/. Notepad++ can be downloaded here: http://notepad-plus-plus.org/download. NppToR is available here: http: //sourceforge.net/projects/npptor/. And, you can find instructions on using NppToR here: http://jekyll. math.byuh.edu/other/howto/notepadpp/using.shtml. 8 To execute R scripts in batch mode, your operating systems needs to know where to find the R executable, and you may need to manually add the file location to the Environment variable. 9 The class of numeric vectors is actually "numeric".

7

8

indicates the position of the adjacent element in the output–this will make more sense later when the output has more elements. Here, it isn’t needed. For multi-line commands, an R console will display a “continuation symbol” + instead of the prompt >. So, for example, if we entered the same command as above, but hit Enter before typing the 6, this is what we would see: > 10 + 6 [1] 4 The continuation character indicates that R is waiting for something else–that you haven’t supplied a complete command. The way commands are displayed in this workbook looks a little different from what you’ll see in a console10 10-6 # [1] 4 First, the prompt character > is left out, to make it easy to copy and paste code directly into R11 . The continuation symbol is also omitted. And, all the output lines are preceeded with #, which is a comment character in R12 , in order to facilitate copying and pasting. For the above command, the result is printed to the screen and lost–there is no assignment involved13 . In order to do anything other than the simplest analyses, you must be able to store and recall data. In R, you can assign the results of command to symbolic variables (as in other computer languages) using the assignment operator14 4) # [1] 9 But the table function is a better approach when dealing with multiple factor levels for individual variables, or even multiple variables. For example, let’s take a look at some data on soil respiration in Alaskan ecosystems. gas

Suggest Documents