Apr 6, 2006 - converted into a 5 x 10 matrix in a column-wise manner. If we had done it ...... help.start() start the HTML version of help str(a) display the ...
A short Introduction to statistics using R N.E. Zimmermann and K. Steinmann April 6, 2006
http://www.r-project.org
1
Contents 1 What is R
3
2 How to get started with R 2.1 Command line basics . . . . . . . . . . . . . . . . . . . . . . . 2.2 Basic commands . . . . . . . . . . . . . . . . . . . . . . . . .
4 4 5
3 What object and data types does R distinguish? 3.1 Scalars and vectors . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Matrices and data frames . . . . . . . . . . . . . . . . . . . . 3.3 List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 7 8 11
4 Dealing with data, objects and workspaces
12
5 Data handling 14 5.1 How to prepare your data . . . . . . . . . . . . . . . . . . . . 14 5.2 Import and Export of data . . . . . . . . . . . . . . . . . . . 15 5.3 Extracting information . . . . . . . . . . . . . . . . . . . . . . 16 6 Linear models 6.1 Testing the significance of the 6.2 Testing the significance of the 6.3 Multivariate linear models . . 6.4 Prediction . . . . . . . . . . .
coefficients model . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
17 19 21 23 26
7 Data distribution 27 7.1 Goodness of fit test . . . . . . . . . . . . . . . . . . . . . . . . 29 8 Generalized linear models (glm) 34 8.1 Generalized linear model of the family Poisson an example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 8.2 Generalized linear model of the family binomial (logistic regression) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 8.3 Predictions with glms . . . . . . . . . . . . . . . . . . . . . . 39 9 Implementing functions in R
40
10 References
40
2
1
What is R
The software R is primarily a statistical package that allows the user to analyze and graphically display a wide array of data using almost unlimited numbers of statistical techniques. At the same time it is an almost complete programming language, which enables the user to code and implement complex operations, data manipulations and statistical simulations. R is also great in storing moderately complex data structures, but allows querying large databases using SQL. The graphics capabilities are outstanding, although partly demanding with respect to programming. Finally, R can be considered a flexible ”environment for data analysis, statistics and graphics”, where the term ”environment” addresses the fact that R is a fully designed and coherent system rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software. This allows adapting and extending the system very efficiently. R is completely free of charge and available under the GNU public license. It is usually installed as a basic package that enables the user to perform the most important statistical analyses, and it contains the full depth of the programming capabilities. By this, every possible extension is basically doable if the skills are there. However, most users will download add-on packages that contain specifically designed statistical analyses, data manipulation or display tasks in a fully developed set of commands. As an example: If we are to classify a set of objects based on measured or observed variables, we either code a classification routine ourselves (and only the highly skilled would attempt to do this), or we download an add-on package that contains all necessary commands to do so. The packages are all available at a centrally maintained web-portal and download and installation are easily doable from within the R environment. An add-on package usually contains a set of compiled functions, plus an update of the R help environment. The user is always free to code own functions and to add these to the collection of commands available at the R prompt. The fact that R is open source software means that the code of each command can be made visible. This can be an advantage where more advanced R users would like to learn from existing code. R basically represents a re-design of the commercially available S-language. The vast majority of commands are the same, although small but important differences exist. These differences contain attempts to make the R environment even more flexible, adaptable and consistent. S+, the most advanced commercial software package based on the S-language includes a full menu system, but also a command prompt. R on the contrary only can be used at the command prompt. While this is a disadvantage for beginners, it is a huge advantage for more experienced users, since any command can much faster be processed by simply typing it than by working through a huge and complex graphical user interface with basically thousands of options (as are available at the level of R). Also, 3
the command prompt makes it easier running R in batch mode. Any set of commands that have been performed to reach a certain result can easily be copy pasted to a log file, can be documented and archived. By this, a user can always reconstruct what has been done to attain a specific result. This is close to impossible to get from graphical user interfaces. R is available under all major operating systems including versions of Windows, MacIntosh, UNIX and Linux. The environment can either be run within a simple window system that contains e.g. windows for a command prompt or for graphics. However, R can also be run from the command prompt of an operating system (terminal windows in UNIX, Linux, or Mac MacOS X, or DOS window under Windows/DOS). For practical reasons, most users will use the window system, and this introduction only addresses this mode of use.
2
How to get started with R
Assuming that the R basic package is installed, we now explain the most important commands and essentials for using R. All examples for commands will be put in the courier type set. Command are entered as text on the command line prompt represented by the character ‘>’. The command is submitted to the R terminal by hitting the carriage return or enter key (←֓), which we do not add in this manual in order to avoid confusion. Few simple command examples are added directly into the text and are not put on separate lines. A good way to get started and to find general help is to type the following command on the command line: >help.start() It leads you to a web site that contains a lot of necessary information. The manual below is a simplified introduction only. Alternatively, you can click the “Help” button in the main menu of the R environment. It offers you some of the resources the help.start() site provides.
2.1
Command line basics
Each command is usually followed by brackets containing arguments to the command. E.g. if we are to ask for help on the term “summary”, or if we are to quit R, we type: help(summary) quit() The first means that the command help is to be performed for the term summary, while the second command has no arguments. Yet, the brackets are still necessary. You can also use R like a calculator: typing 3 + 4 ∗ 3 4
(enter) basically performs mathematical operations including the right order of operations, interpreting the sequence as: 3 + (4 ∗ 3) = 15. R is case sensitive, meaning that the terms “A” and “a” address two different things. Commands can be performed in two different ways. Either, you type the commands and all the options. As a result you will get the result to the screen. Let us assume we want to get numbers from 1 through 50 to the screen, we can e.g. type: seq(1,50) This types a sequence of 50 numbers to the screen. However, if we wish to store these numbers in an object (in a variable, if you like), we “pipe” the seq() command into an object, e.g. into an object called x: x 420],lat[spc7>420],pch=15,cex=1.1,col="violetred3") points(lon[spc7>480],lat[spc7>480],pch=15,cex=1.1,col="darkmagenta") points(lon[spc7>560],lat[spc7>560],pch=15,cex=1.1,col="darkblue") legend(4.5,50.1,c("0-60","60-120","120-180","180-240","240-300","300-360", "360-420","420-480","480-560",">560"), pch=c(1,15,15,15,15,15,15,15,15,15), cex=c(1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0),bty="n", col=c("black","grey70","yellow","gold2","orange","orangered","red", "violetred3","darkmagenta","darkblue") )
3.3
List
The most complexe object is the list. It consists of a concatenation of various object types whch we have learned so far. The basic characteristics is that no restriction to dimensionality is imposed any longer. Scalars, vectors, matrices or data frames of any length can be combined into a list. Usually, list objects are generated as a result of more complexe statistical operation. 11
If we generate a simple linear regression (see chapter 6 for more details), we can store the results of the regression into a list object. The command y.lm