An Introduction to R

6 downloads 28839 Views 3MB Size Report
Page 1 .... BaseR or a version of it is used at Facebook, Twitter, Foursquare,. Microsoft, Google, Ford, Bank .... (Tools → Global Options → General → Change..).
The R Programming Language for Statistical Computing and Graphics

Introduction 1 / 136

Introduction 2 / 136

R is the lingua franca of data science

1

R is • a programming environment • for statistical computing and graphics. • An open source platform, • working on most platforms (GNU/Linux, OS X, Windows) and • extendable via packages. • running all procedures in the workspace (RAM).

1 http://blog.revolutionanalytics.com/2013/11/the-rise-of-r-as-the-language-of-analytics.html Introduction 3 / 136

”We could have chosen to be commercial, and we would have sold five copies of the software.”

Ross Ihaka, R Developer New York Times, 9. January 2009

Introduction 4 / 136

A short history...

Introduction 5 / 136

1976 John Chambers, Ben Bolker and colleagues begin developing The System”at Bell Labs 1984 First version of S 1993 First public”appearance of R to S news”mailing list 1995 First release of R under GNU general license” 1996 The R-Paperby Ihaka & Gentleman (Google Scholar: 8070 citations) 1997 Established R ”core group” R version 0.49, April 1997 2000 R Version 1.0.0 2009 Article in The New York Times 2014 Most likely, R became the top statistics package used during the summer of this year (Muenchen cited in Nature 517, 109-110)

Introduction 6 / 136

Bell Labs/AT&T

Researchers working at Bell Labs are credited with the development of radio astronomy, the transistor, the laser, the charge-coupled device (CCD), information theory, the UNIX operating system, the C programming language, S programming language and the C++ programming language. Eight Nobel Prizes have been awarded for work completed at Bell Laboratories.

Source: https://en.wikipedia.org/wiki/Bell_Labs

Introduction 7 / 136

Ross Ihaka & Robert Gentleman, New Zealand We wanted a command driven interface and, since we were both very familiar with S, it seemed natural to use an S-like syntax. This decision, more than anything else, has driven the direction that R development has taken. [...] the adoption of the S syntax for our interpreter produced something which feltremarkably close to S. Having taking this first step we found ourselves adopting more and more features from S.

Ross Ihaka, 1998

Introduction 8 / 136

The name is partly based on the (first) names of the first two R authors (Robert Gentleman and Ross Ihaka), and partly a play on the name of the Bell Labs language ’S’.

Hornik (2015), The R FAQ” https://CRAN.R-project.org/doc/FAQ/R-FAQ.html

Introduction 9 / 136

The R Core Group

Doug Bates, John Chambers, Peter Dalgaard, Seth Falcon, Robert Gentleman, Kurt Hornik, Stefano Iacus, Ross Ihaka, Friedrich Leisch, Uwe Ligges, Thomas Lumley, Martin Maechler, Duncan Murdoch, Paul Murrell, Martyn Plummer, Brian Ripley, Deepayan Sarkar, Duncan Temple Lang, Luke Tierney, and Simon Urbanek.

Hornik (2015), The R FAQ” https://CRAN.R-project.org/doc/FAQ/R-FAQ.html

Introduction 10 / 136

The initial work [...] produced what looked like a potentially useful piece of software and we began preparing it for use in our teaching laboratory.

Ross Ihaka, 1998

Introduction 11 / 136

Academic success

• Teaching • R is free and works on many platforms • Methods development • Development, implimentation and application go hand-in-hand • Methods access • How easy is it to run a simple IRT model in SPSS? • Reproducible Research • Open Data • Review scripts, not just manuscripts

Introduction 12 / 136

Commercial success

Open Source: Can be task-tailored”(cf. PHP for Facebook)

BaseR or a version of it is used at Facebook, Twitter, Foursquare, Microsoft, Google, Ford, Bank of America, NASA, Beyer, Pfizer, Roche, The New York Times and other newspapers...

Introduction 13 / 136

Developer success

http://spectrum.ieee.org/computing/software/ the-2015-top-ten-programming-languages Institute of Electrical and Electronics Engineers

Introduction 14 / 136

Extendability: Packages

Functions can be combined to packages. Packages structure R code in a systematic manner. A package can be loaded into the workspace and the according funtions are ready for use. • Expand R’s base capabilities. • R gives common environment/framework, specialists implement specific solutiens. • Tailoredfunctions for routine tasks (A package does not need to be

published) • October 2015: 7395 packages on CRAN

Introduction 15 / 136

Comprehensive R Archive Network (CRAN):

• ”Master site”: https://cran.r-project.org/ • Host the R software, packages etc. for available platforms, both

pre-compiled and source code.

Introduction 16 / 136

Task Views Collections of packages for specific purposes (e.g. Natural Language Processing, Medical Imaging, or Psychometrics) Examples: • https://cran.r-project.org/web/views/

NaturalLanguageProcessing.html • https:

//cran.r-project.org/web/packages/qdap/index.html • https:

//cran.r-project.org/web/packages/RQDA/index.html • https:

//cran.r-project.org/web/views/MedicalImaging.html • https:

//cran.r-project.org/web/views/Psychometrics.html

Introduction 17 / 136

Packages

lavaan Structural Equation Models that mimic e.g. MPlus foreign Read SAS/SPSS/... data xlsx Read/write MS Excel dplyr Easier reshaping/aggregation of data ggplot2 Graphics & plots ggvis Interactive, web based graphics car Regression tools - incl. useful ’recode’ function CTT For Classical Test Theory (cf. also irtoys)

Introduction 18 / 136

What is a ’good’ package?

In general, R packages are routinely tested under Debian/Linux. Packages must stick to the ”CRAN Repository Policy” • Who are the authors? • Is it under continuing revision? • How old are the last versions? • How good is the documentation? • Are there resources - books, tutorials, etc. - using this software? • Is the package used in scientific articles?

Introduction 19 / 136

http://paulbutler.org/archives/visualizing-facebook-friends/

Introduction 20 / 136

Graphics

• https://gjabel.wordpress.com/2014/06/05/

world-cup-players-representation-by-league-system/ • http://motioninsocial.com/tufte/ • http://flowingdata.com/2014/10/23/

moving-past-default-charts/ • http://www.htmlwidgets.org/

Introduction 21 / 136

The competition...

R vs. Python

R vs. Matlab/STATA/SAS/SPSS/

....Julia?

Introduction 22 / 136

Where (for me) the effort of learning R paid of... • Routine reports: For individuals and insitutions • Research: Integration of various data sources (from cleaning to scale

analyses) • R works always and everywhere • You are often expected to speakR as naturally as English

• Everything is a script • Never touch your raw data (again) • Work with batch-scripts and GUIs • Example: OSCE reports (incl. ID-check and Generalizability ’Study’)

Introduction 23 / 136

What R does not

• It’s a pain to enter data manually in R

Rather use Excel and save as *.csv • If you WANT point-and-click, other options are better • Open Source means: Validation and reliability checks of software is

done by the community”. (Usually stressed by commercial companies)

• R is not ”indictable” • The PCs working-memory is the bottle-neck (Usually not relevant in ”ourscenarios)

Introduction 24 / 136

”We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.”

Anne H. Milley, Director of technology product marketing at SAS New York Times, 9. January 2009

Introduction 25 / 136

Pro & Con

Pro R is from statisicians for statisticians

Con R is from statisicians for statisticians

Introduction 26 / 136

Further info and help

https://cran.r-project.org/faqs.html https://www.r-project.org/ http://stackoverflow.com/questions/tagged/r https://stat.ethz.ch/mailman/listinfo/r-help https://journal.r-project.org/ http://www.statmethods.net/ https://www.rstudio.com/resources/cheatsheets/

Introduction 27 / 136

Getting started...

Getting started... 28 / 136

CRAN: The Comprehensive R Archive Network • Search for ”cran”in your preferred search engine...

Getting started... 29 / 136

• Download for Linux/MacOS/Windows

Getting started... 30 / 136

• Select Install R for the first time

Getting started... 31 / 136

• Select Download R

Getting started... 32 / 136

Install R (without admin rights) 1

Crate new folder (e.g. c:/RTest/) & install to this directory

2

Do not install to Program Files”

3

Customise installation

Getting started... 33 / 136

Install R (without admin rights)

4

A matter of taste: Seperate windoes

Getting started... 34 / 136

Install R (without admin rights)

5

Plain text help is faster

Getting started... 35 / 136

Install R (without admin rights)

6

Start menu?

Getting started... 36 / 136

Install R (without admin rights)

7

Important: No Registry entries!

Getting started... 37 / 136

Install R (without admin rights)

8

Install...

Getting started... 38 / 136

Install R (without admin rights) 9

Find R as desktop icon or RGui.exe in either 32bit or 64bit folder:

Getting started... 39 / 136

If this doesn’t work... • If your (IT’s) security policy is too rigid, • you may need admin rights to even start the installer. • If so, you can run the steps described above at your private PC • The folder RTest”can then be copied (e.g. via a USB drive) to any

other PC. • If the security policy is still to rigid. Well...

If you work on Windows and your connection goestrough a proxy server, let R know via: #Important for instaling packages #Type into script/console setInternet2(use=TRUE)

Getting started... 40 / 136

R as a calculator

R as a calculator 41 / 136

# Addition and subtraction 1 + 1 - 5 + 3 ## [1] 0 # Multiplication 2 * 13 ## [1] 26 # Square root and exponentiation sqrt(3) ## [1] 1.732051 sqrt(3)^2 ## [1] 3 3^(1/2) ## [1] 1.732051

R as a calculator 42 / 136

# Fraction 3/4 ## [1] 0.75 # Brackets... 2/(4 + 4) * 2 ## [1] 0.5 # ...no brackets 2/4 + 4 * 2 ## [1] 8.5

R as a calculator 43 / 136

Integrated Development Environments (IDEs) (For more convenience...)

Integrated Development Environments (IDEs) 44 / 136

Some options...

• Stick to baseR (Mac has syntax highlighting) • Rstudio (All platforms - Very convienient, but slower than baseR) • Notepad ++ & npptoR (Win only but fast) • Eclipse/Vim/Emacs (Win/Mac/Linux) • .....

Integrated Development Environments (IDEs) 45 / 136

RStudio

Integrated Development Environments (IDEs) 46 / 136

Download at: https://www.rstudio.com/products/rstudio/download/ • Use installers, when possible • In institutional context: ZIP file (no installer - no admin)

Integrated Development Environments (IDEs) 47 / 136

You can select the R-Installation that R Studio works with by hand:

(Tools → Global Options → General → Change..)

Integrated Development Environments (IDEs) 48 / 136

Packages! Packages! Packages!

Packages 49 / 136

Packages and Repositories Packages extend R’s capabilities. They add functions or routines for specific tailored tasks.There are packages for almost every purpose such as psychometrics, machine learing, natural language processing, genetics, medical image processing, or sending emails. Some of the packages used in this work are: • foreign for handling data from e.g. SPSS • ggplot2 for plotting data • TAM for IRT purposes • lme4 for estimating variance componets used in Generalizability

Theory Packages can be retrieved from Repositories (i.e ’online archives’).

Packages 50 / 136

Most popular archives: • CRAN (Comprehensive R Archive Network) • RForge (Often ’Beta’ software) • Bioconductor (Bioinformatics)

Installing manually or via code: → Packages → Install Package(s) → Select Mirror → Select Package

# Example: CRAN packeges form RStudio mirror. install.packages("", repos = "https://cran.rstudio.com/") # Example: R-Forge install.packages("", repos = "http://R-Forge.R-project.org")

Packages 51 / 136

Data Representation in R

Data Representation in R 52 / 136

Principles

In R, everything is an object - (incl. paths, variable names, statistic procedures, operators...) R is vector-based Every object is stored and manipulated in the workspace You can mess up litterally everything - but only as long as the session lasts.

Data Representation in R 53 / 136

Objects are created with a left arrow (