An Introduction to Mapping and Spatial Modelling R

22 downloads 7043 Views 2MB Size Report
Start R. Use the drop-down menus to change your working directory to somewhere you are ... (see http://cran.r-project.org/web/packages/Rcmdr/index.html).
An Introduction to Mapping and Spatial Modelling R By and © Richard Harris, School of Geographical Sciences, University of Bristol

An Introduction to Mapping and Modelling R by Richard Harris is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Based on a work at www.social-statistics.org. You are free: to Share — to copy, distribute and transmit the work to Remix — to adapt the work Under the following conditions: Attribution — You must attribute the work in the following manner: Based on An Introduction to Mapping and Spatial Modelling R by Richard Harris (www.social-statistics.org). Noncommercial — You may not use this work for commercial purposes. Use for education in a recognised higher education institution (a College or University) is permissible. Share Alike — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one. With the understanding that: Waiver — Any of the above conditions can be waived if you get permission from the copyright holder (Richard Harris, [email protected]) Public Domain — Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license. Other Rights — In no way are any of the following rights affected by the license: Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations; The author's moral rights; Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights. Notice — For any reuse or distribution, you must make clear to others the license terms of this work which applies also to derivatives. (Document version 0.1, November, 2013. Draft version.)

An Introduction to Mapping and Spatial Modelling in R. © Richard Harris, 2013

1

Introduction and contents This document presents a short introduction to R highlighting some geographical functionality. Specifically, it provides:

10



A basic introduction to R (Session 1)



A short 'showcase' of using R for data analysis and mapping (Session 2)



Further information about how R works (Session 3)



Guidance on how to use R as a simple GIS (Session 4)



Details on how to create a spatial weights matrix (Session 5)



An introduction to spatial regression modelling including Geographically Weighted Regression (Session 6)

Further sessions will be added in the months (more likely, years) ahead. The document is provided in good faith and the contents have been tested by the author. However, use is entirely as the user's risk. Absolutely no responsibility or liability is accepted by the author for consequences arising from this document howsoever it is used. It is is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License (see above). Before starting the following should be considered. First, you will notice that in this document the pages and, more unusually, the lines are numbered. The reason is educational: it makes directing a class to a specific part of a page easier and faster. For other readers, the line numbers can be ignored. 20

30

Second, the sessions presume that, as well as R, a number of additional R packages (libraries) have been installed and are available to use. You can install them by following the 'Before you begin' instructions below. Third, each session is written to be completed in a single sitting. If that is not possible, then it would normally be possible to stop at a convenient point, save the workspace before quitting R, then reload the saved workspace when you wish to continue. Note, however, that whereas the additional packages (libraries) need be installed only once, they must be loaded each time you open R and require them. Any objects that were attached before quitting R also need to be attached again to take you back to the point at which you left off. See the sections entitled 'Saving and loading workspaces', 'Attaching a data frame' and 'Installing and loading one or more of the packages (libraries)' on pages 10, 31 and 37 for further information.

An Introduction to Mapping and Spatial Modelling in R. © Richard Harris, 2013

2

Before you begin Install R. It can be downloaded from http://cran.r-project.org/. I currently am using version 3.0.2. Start R. Use the drop-down menus to change your working directory to somewhere you are happy to download all the files you need for this tutorial. At the > prompt type, download.file("http://dl.dropboxusercontent.com/u/214159700/RIntro.zip", "Rintro.zip") and press return. 10

Next, type unzip("Rintro.zip")

All the data you need for the sessions are now available in the working directory. If you would like to install all the libraries (packages) you need for these practicals, type load(“begin.RData”)

and then install.libs()

You are advised to read Installing and loading one or more of the packages (libraries) on p. 37 before doing so.

20

Please note: this is a draft version of the document and has not as yet been thoroughly checked for typos and other errors.

An Introduction to Mapping and Spatial Modelling in R. © Richard Harris, 2013

3

An Introduction to Mapping and Spatial Modelling in R. © Richard Harris, 2013

4

Session 1: Getting Started with R This session provides a brief introduction to how R works and introduces some of the more common commands and procedures. Don't worry if not everything is clear at this stage. The purpose is to get you started not to make you an expert user. If you would prefer to jump straight to seeing R in action, then move on the Session 2 (p.13) and come back to this introduction later.

1.1 About R R is an open source software package, licensed under the GNU General Public Licence. You can obtain and install it for free, with versions available for PCs, Macs and Linux. To find out what is available, go to the Comprehensive R Archive Network (CRAN) at http://cran.r-project.org/ 10

Being free is not necessarily a good reason to use R. However, R is also well developed, well documented, widely used and well supported by an extensive user community. It is not just software for 'hobbyists'. It is widely used in research, both academic and commercial. It has well developed capabilities for mapping and spatial analysis. In his book R in a Nutshell (O'Reilly, 2010), Joseph Adler writes, “R is very good at plotting graphics, analyzing data, and fitting statistical models using data that fits in the computer's memory.” Nevertheless, no software provides the perfect tool for every job and Adler adds that “it's not good at storing data in complicated structures, efficiently querying data, or working with data that doesn't fit in the computer's memory.”

20

30

To these caveats it should be added that R does not offer spreadsheet editing of data of the type found, for example, in Microsoft Excel. Consequently, it is often easier to prepare and 'clean' data prior to loading them into R. There is an add-in to R that provides some integration with Excel. Go to http://rcom.univie.ac.at/ and look for RExcel. A possible barrier to learning R is that it is generally command-line driven. That is, the user types a command that the software interprets and responds to. This can be daunting for those who are used to extensive graphical user interfaces (GUIs) with drop-down menus, tabs, pop-up menus, left or right-clicking and other navigational tools to steer you through a process. It may mean that R takes a while longer to learn; however, that time is well spent. Once you know the commands it is usually much faster to type them than to work through a series of menu options. They can be easily edited to change things such as the size or colour of symbols on a graph, and a log or script of the commands can be saved for use on another occasion or for sharing with others. Saying that, a fairly simple and platform independent GUI called R Commander can be installed (see http://cran.r-project.org/web/packages/Rcmdr/index.html). Field et al.'s book Discovering Statistics Using R provides a comprehensive introduction to statistical analysis in R using both command-lines and R Commander.

1.2 Getting Started

40

Assuming R has been installed in the normal way on your computer, clicking on the link/shortcut to R on the desktop will open the RGui, offering some drop-down menu options, and also the R Console, within which R commands are typed and executed. The appearance of the RGui differs a little depending upon the operating system being used (Windows, Mac or Linux) but having used one it should be fairly straightforward to navigate around another.

An Introduction to Mapping and Spatial Modelling in R. © Richard Harris, 2013

5

Figure 1.1. Screen shot of the R Gui for Windows

1.2.1 Using R as a calculator

At its simplest, R can be used as a calculator. Typing 1 + 1 after the prompt > will (after pressing the return/enter key, ↵) produce the result 2, as in the following example: > 1 + 1 [1] 2

Comments can be indicated with a hash tag and will be ignored > # This is a comment, no need to type it

Some other simple mathematical expressions are given below. 10

20

30

> 10 - 5 [1] 5 > 10 * 2 [1] 20 > 10 - 5 * 2 [1] 0 > (10 - 5) * 2 [1] 10 > sqrt(100) [1] 10 > 10^2 [1] 100 > 100^0.5 [1] 10 > 10^3 [1] 1000 > log10(100) [1] 2 > log10(1000) [1] 3 > 100 / 5 [1] 20 > 100^0.5 / 5 [1] 2

# The order of operations gives priority to # multiplication # The use of brackets changes the order # Uses the function that calculates the square root # 102 # 100.5, i.e. the square root again

# Uses the function that calculates the common log

An Introduction to Mapping and Spatial Modelling in R. © Richard Harris, 2013

6

1.2.2 Incomplete commands

If you see the + symbol instead of the usual (>) prompt it is because what has been typed is incomplete. Often there is a missing bracket. For example,

10

> sqrt( + 100 + ) [1] 10 > (1 + 2) * (5 - 1 + ) [1] 12

# The + symbol indicates that the command is incomplete

Commands broken over multiple lines can be easier to read.

20

> for (i in 1:10) { + print(i) + } [1] 1 [1] 2 [1] 3 [1] 4 [1] 5 [1] 6 [1] 7 [1] 8 [1] 9 [1] 10

# This is a simple loop # printing the numbers 1 to 10 on-screen

1.2.3 Repeating or modifying a previous command

If there is a mistake in a line of a code that needs to be corrected or if some previously typed commands will be repeated then the ↑ and ↓ keys on the keyboard can be used to scroll between previous entries in the R Console. Try it!

1.3 Scripting and Logging in R 30

1.3.1 Scripting

You can create a new script file from the drop down menu File → New script (in Windows) or File → New Document (Mac OS). It is basically a text file in which you could write, for example, a ?ls() > help(ls)

This will provide details about the function, including examples of its use. It will also list the arguments required to run the function, some of which may be optional and some of which may have default values which can be changed as required. Consider, for example, > ?log()

A required argument is x, which is the data value or values. Typing log() omits any data and generates an error. However, log(100) works just fine. The argument base takes a default value of e1 which is approximately 2.72 and means the natural logarithm is calculated. Because the default is assumed unless otherwise stated so log(100) gives the same answer as log(100, base=exp(1)). Using log(100, base=10) gives the common logarithm, which can also be calculated using the convenience function log10(100). 30

The results of mathematical expressions can be assigned to objects, as can the outcome of many commands executed in the R Console. When the object is given a name different to other objects within the current workspace, a new object will be created. Where the name and object already exist, the previous contents of the object will be over-written, without warning – so be careful! > a print(a) [1] 5 > b print(b) [1] 20 > print(a * b) [1] 100 > a print(a) [1] 100

10

20

In these examples the assignment is achieved using the combination of < and -, as in a a could be used or, more simply, a = 100. The print(..)command can often be omitted, though it is useful, and sometimes necessary (for example, when what you hope should appear on-screen doesn't). > f = a * b > print(f) [1] 2000 > f [1] 2000 > sqrt(b) [1] 4.472136 > print(sqrt(b), digits=3) [1] 4.47 > c(a,b) [1] 100 20 > c(a,sqrt(b)) [1] 100.000000 4.472136 > print(c(a,sqrt(b)), digits=3) [1] 100.00 4.47

# The additional parameter now specifies # the number of significant figures # The c(...) function combines its arguments

1.4.2 Naming objects in the workspace

Although the naming of objects is flexible, there are some exceptions, 30

> _a 2a a > A > a [1]

row.names(installed.packages())

If the packages cannot be found then they can be installed using install.packages(c("RgoogleMaps","png","sp","spdep")). Note that you may need administrative rights on your computer to install the package (see Section 3.5.1, Installing and loading one or more of the packages (libraries), p.37).

2.1 Getting Started

20

As the focus of this session is on showing what R can do rather than teaching you how to do it. instead of requiring you to type a series of commands, they can instead be executed automatically from a previously written source file (a script: see Section 1.3.1, page 7). As the commands are executed we will ask R to echo (print) them to the screen so you can following what is going on. At regular intervals you will be prompted to press return before the script continues. To begin, type, > source(file.choose(), echo=T)

and load the source file session2.R. After some comments that you should ignore, you will be prompted to load the .csv file schools.csv: > ## Read in the file schools.csv file > wait() Please presss return schools.data head(schools.data) > tail(schools.data)

# Shows the first few rows of the data # Shows the bottom few rows of the data

We can produce a summary of each column in the data table using > summary(schools.data)

In this instance, each column is a continuous variable so we obtain a six-number summary of the centre and spread of each variable. 40

The names of the variables are An Introduction to Mapping and Spatial Modelling in R. © Richard Harris, 2013

13

> names(schools.data)

Next the number of columns and rows; and a check – row-by-row – to see if the data are complete (have no missing data). > ncol(schools.data) > nrow(schools.data) > complete.cases(schools.data)

It is not the most comprehensive check but everything appears to be in order.

2.3 Some simple graphics 10

The file schools.csv contains information about the location and some attributes of schools in Greater London (in 2008). The locations are given as a grid reference (Easting, Northing). The information is not real but is realistic. It should not, however, be used to make inferences about real schools in London. Of particular interest is the average attainment on leaving primary school (elementary school) of pupils entering their first year of secondary school. Do some schools in London attract higher attaining pupils more than others? The variable attainment contains this information. A stripchart and then a histogram will show that (not surprisingly) there is variation in the average prior attainment by school.

20

> > > +

attach(schools.data) stripchart(attainment, method="stack", xlab="Mean Prior Attainment by School") hist(attainment, col="light blue", border="dark blue", freq=F, ylim=c(0,0.30), xlab=”Mean attainment)

Here the histogram is scaled so the total area sums to one. To this we can add a rug plot, > rug(attainment)

also a density curve, a Normal curve for comparison and a legend.

30

> > > > > > +

lines(density(sort(attainment))) xx > + > >

par(mai=c(1,1.4,0.5,0.5)) # Changes the graphic margins boxplot(attainment ~ school.type, horizontal=T, xlab="Mean attainment", las=1, cex.axis=0.8) # Includes options to draw the boxes and labels horizontally abline(v=mean(attainment), lty="dashed") # Adds the mean value to the plot legend("topright", legend="Grand Mean", lty="dashed")

Not surprisingly, the selective schools (those with an entrance exam) recruit the pupils with highest average prior attainment.

Figure 2.1. A histogram with annotation in R

An Introduction to Mapping and Spatial Modelling in R. © Richard Harris, 2013

15

Figure 2.2. Mean prior attainment by school type

2.4 Some simple statistics It appears (in Figure 2.2) that there are differences in the levels of prior attainment of pupils in different school types. We can test whether the variation is significant using an analysis of variance. > summary(aov(attainment ~ school.type)) Df Sum Sq Mean Sq F value Pr(>F) school.type 5 479.8 95.95 71.42 # > # >

attainment.high.fsm.schools quantile(fsm, probs=0.75)] Finds the attainment scores for schools with the highest proportions of FSM pupils attainment.low.fsm.schools round(cor(fsm, attainment),3) > cor.test(fsm, attainment) Pearson's product-moment correlation

10

data: fsm and attainment t = -18.1731, df = 365, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.7394165 -0.6313939 sample estimates: cor -0.6892159

Of course, the use of the Pearson correlation assumes that the relationship is linear, so let's check: > plot(attainment ~ fsm) > abline(lm(attainment ~ fsm))

20

# Adds a line of best fit (a regression line)

There is some suggestion the relationship might be curvilinear. However, we will ignore that here. Finally, some regression models. The first seeks to explain the mean prior attainment scores for the schools in London by the proportion of their intake who are free school meal eligible. (The result is the line of best fit added to the scatterplot above). The second model adds a variable giving the proportion of the intake of a white ethnic group. The third adds a dummy variable indicating whether the school is selective or not. > model1 summary(model1) Call: lm(formula = attainment ~ fsm, data = schools.data)

30

Residuals: Min 1Q Median -2.8871 -0.7413 -0.1186

3Q 0.5487

Max 3.6681

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 29.6190 0.1148 258.12 |t|) (Intercept) 30.1250 0.1979 152.21 < 2e-16 *** fsm -7.2502 0.4214 -17.20 < 2e-16 *** white -0.8722 0.2796 -3.12 0.00196 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.164 on 364 degrees of freedom Multiple R-squared: 0.4887, Adjusted R-squared: 0.4859 F-statistic: 173.9 on 2 and 364 DF, p-value: < 2.2e-16

> model3 summary(model3)

20

Call: lm(formula = attainment ~ fsm + white + selective, data = schools.data) Residuals: Min 1Q -2.6262 -0.5620

30

Median 0.0537

3Q 0.5607

Max 3.6215

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 29.1706 0.1689 172.712 plot(Easting, Northing, asp=1, main="Map of London schools") # The argument asp=1 fixes the aspect ratio correctly

Amongst the attribute data for the schools, the variable esl gives the proportion of pupils who speak English as an additional language. It would be interesting for the size of the symbol on the map to be proportional to it. > plot(Easting, Northing, asp=1, main="Map of London schools", + cex=sqrt(esl*5))

It would also be nice to add a little colour to the map. We might, for example, change the default plotting 'character' to a filled circle with a yellow background. 20

> plot(Easting, Northing, asp=1, main="Map of London schools", + cex=sqrt(esl*5), pch=21, bg="yellow")

A more interesting option would be to have the circles filled with a colour gradient that is related to a second variable in the data – the proportion of pupils eligible for free school meals for example. To achieve this, we can begin by creating a simple colour palette: > palette map.class plot(Easting, Northing, asp=1, main="Map of London schools", + cex=sqrt(esl*5), pch=21, bg=palette[map.class])

40

It would be good to add a legend, and perhaps a scale bar and North arrow. Nevertheless, as a first map in R this isn't too bad!

An Introduction to Mapping and Spatial Modelling in R. © Richard Harris, 2013

19

Figure 2.3. A simple point map in R

Why don't we be a bit more ambitious and overlay the map on a Google Maps tile, adding a legend as we do so? This requires us to load an additional library for R and to have an active Internet connection. > library(RgoogleMaps)

If you get an error such as the following Error in library(RgoogleMaps) : there is no package called ‘RgoogleMaps’

it is because the library has not been installed.

10

Assuming that the data frame, schools.data, remains in the workspace and attached (it will be if you have followed the instructions above), and that the colour palette created above has not been deleted, then the map shown in Figure 2.4 is created with the following code: > MyMap PlotOnStaticMap(MyMap, Lat, Long, cex=sqrt(esl*5), pch=21, bg=palette[map.class]) > legend("topleft", legend=paste("

Suggest Documents