Document not found! Please try again

Exploring Linguistic Data with R

1 downloads 0 Views 604KB Size Report
Jul 23, 2015 - find file paths to folders (and saving files), but assumes no particular computer insight beyond that. Most of the code below will work fine ...
Exploring Linguistic Data with R Gard Jenset Thursday, July 23, 2015

Table of Contents 1 Exploring R 1.1 Example vectors 1.2 Find length of a vector 1.3 Extract a subset 1.4 Combine into a data frame 1.5 Inspect the data frame 1.6 Data frame subset 1.7 Column operator 1.8 Column operator subsetting 1.9 Basic plots 1.10 Saving plots 2 Saving and Reading data 2.1 Saving data 2.2 Reading data frames 3 Data in depth 3.1 Hierarchical clustering of speakers based on pronouns 3.2 Inference tree of the number of words spoken 4 Further reading

1 Exploring R The present document is intended to accompany an introductory session on how to use the R platform for exploratory statistics in linguistics. However, it can also be used as a selfcontained tutorial, e.g. for refreshing the material covered in the session. As a short introduction, the tutorial does not claim to be comprehensive, and for further study a books such as "Quantitative Corpus Linguistics with R" (Gries 2009) or "Analyzing Linguistic Data with R" (Baayen 2008) are recommended. The tutorial assumes that R has been installed on your compute and that you are able to find file paths to folders (and saving files), but assumes no particular computer insight beyond that. Most of the code below will work fine irrespective of your operating system. However, there is a fundamental difference between Windows and Mac OS / Linux with respect to how file

paths are formatted. On Windows, R requires two backslashes to separate folders and files, whereas on Mac OS/Linux, a single forward slash is used, as in these schematic examples: • •

Windows: "C:\\Users\\yourUserName\\Documents\\thefile.txt" Mac OS: "/Users/yourUserName/Documents/thefile.txt"

The code below was created on a Windows 7 computer using R version 3.1.2 (2014-10-31).

1.1 Example vectors We start by typing data into R. Recall that a vector is a sequence of values, all of the same data type. Create some example vectors, containing characters/strings, numbers and boolean values. Give them informative names. Use the c() function to concatenate individual data points into vectors. affix = c("ly", "in", "re", "ity", "ation", "ee", "ism") length = c(2, 2, 2, 3, 5, 2, 3) native = c(TRUE, T, FALSE, F, F, F, F)

1.2 Find length of a vector length(affix) ## [1] 7

1.3 Extract a subset We can extract a subset of the vector with [] combined with numbers referring to the index of the element (numbered from 1 to the end of the vector). The first 4 affixes are extracted like this: affix[1:4] ## [1] "ly"

"in"

"re"

"ity"

The last two are extracted like this: affix[6:7] ## [1] "ee"

"ism"

We can also use the generic functions head() and tail() for this. By default they return 6 values, but we can use the n argument to specify how many: head (affix) ## [1] "ly"

"in"

tail (affix, n = 2)

"re"

"ity"

"ation" "ee"

## [1] "ee"

"ism"

We can also extract data based on a condition. The example below extracts affix length (in characters) where the length is greater than 2: affix[ length > 2 ] ## [1] "ity"

"ation" "ism"

1.4 Combine into a data frame Recall that a data frame is multidimensional data structure that can hold different data types in columns. Each column corresponds to a vector. affix.df = data.frame(af = affix, len = length, native = native)

1.5 Inspect the data frame This data frame has 7 rows. We can view it by typing the name: affix.df ## ## ## ## ## ## ## ##

af len native 1 ly 2 TRUE 2 in 2 TRUE 3 re 2 FALSE 4 ity 3 FALSE 5 ation 5 FALSE 6 ee 2 FALSE 7 ism 3 FALSE

We can summarize it using the summary() function: summary(affix.df) ## ## ## ## ## ## ## ##

af ation:1 ee :1 in :1 ism :1 ity :1 ly :1 re :1

len Min. :2.000 1st Qu.:2.000 Median :2.000 Mean :2.714 3rd Qu.:3.000 Max. :5.000

native Mode :logical FALSE:5 TRUE :2 NA's :0

To inspect the first 3 rows, we can use head() with n = 3 (default value is 6): head(affix.df, n = 3) ## af len native ## 1 ly 2 TRUE ## 2 in 2 TRUE ## 3 re 2 FALSE

1.6 Data frame subset We can also subset data using [] as in the example above. Since the data frame is multidimensional we can specify rows, columns, or a combination. The syntax is: [row index , column index] # first row affix.df[ 1, ] ## af len native ## 1 ly 2 TRUE # the first 3 rows of the first column affix.df[ 1:3, 1 ] ## [1] ly in re ## Levels: ation ee in ism ity ly re

1.7 Column operator The data frame has a special syntax for accessing the column variables: the dollar sign $ followed by the name of the column. If we want to refer to the length column by name (rather than by index) we can write: affix.df$len ## [1] 2 2 2 3 5 2 3

1.8 Column operator subsetting We can combine subsetting with the column operator to extract a subset of the data matching a particular condition: affix.df [ affix.df$len > 2,

]

## af len native ## 4 ity 3 FALSE ## 5 ation 5 FALSE ## 7 ism 3 FALSE

1.9 Basic plots The advanced functionality for generating production quality plots is one of the strengths of R. Here are some basic types:

A simple scatter plot This basic plot has lots of scope for tweaking and modification (?plot). We start by plotting the length data from above:

plot(length)

Note how the values (y-axis) are plotted against the index number (x-axis). The reason is that we only provided one variable to plot.

A box and whisker plot Let's plot the length of the affix against its etymological history (native vs. non-native), using a box and whisker plot with. We get the nicest results with the bwplot() function from the lattice package: library(lattice) bwplot(native ~ length)

The tilde operator (~) specifies that native should be plotted as a function of length.

Plotting data from a data frame We can pick data to plot from a data frame. All R plot functions have an optional data argument where you can specify where to find the variables to be plotted (data frame column names). We can re-write the code above as: boxplot( native ~ len, data = affix.df)

A conditional density plot Another way of plotting the relationship between nativeness and length is to use a conditional density plot, which shows the relationship as changing probabilities: cdplot( as.factor(native) ~ len, data = affix.df)

Note that the as.factor() function is necessary, since native has data of type boolean, whereas cdplot() expects a factor.

1.10 Saving plots There are 3 basic ways to save a plot. Each allows you to control the format (e.g. PDF, PNG, JPG).

In R console • • • •

Create the plot Click on the graphics device (window with plot) Go to File (upper left corner) Go to Save as and choose format

In RStudio • • •

Create the plot Go to the Plots tab on the right hand pane Click on the Export button and choose format

With R code Recommended for reproducibility, but make sure you don't overwrite files you want to keep!

# specify where you want to save the plot, and the format pdf("C:\\Users\\gbj\\Documents\\DH summer school\\Code\\R\\img\\cdplot.pdf") cdplot( as.factor(native) ~ len, data = affix.df) dev.off() ## pdf ## 2

Notes: • • •

dev.off() is required to re-activate the "normal" plotting device in R.

Make sure to change the file path to match your computer's folder structure. The file path above is for Windows. If you are using a Mac computer, your file path will look something like this: "/Users/yourUserName/folderWithYourData/"

2 Saving and reading data The sections below show you how to save a data set, as well as how to read a previously saved data set into R.

2.1 Saving data Workspace When you close the R application (either in the console or in RStudio), R will ask if you want to save your workspace. This is a .Rdata file containing all the objects you have created. This is useful, and can save some time when you work concurrently with many objects.

Data frame Most of the time it is recommended to save your data as regular file, e.g. as a .txt or .csv file. This is also a better format for sharing data. To save the data frame you created, use write.table(). Only two arguments are required: what to save, and where to save it. write.table(affix.df, "C:\\Users\\gbj\\Documents\\DH summer school\\Code\\R\\data\\data_frame_with_default_settings.txt")

However, we can control the output in more detail: write.table(affix.df, "C:\\Users\\gbj\\Documents\\DH summer school\\Code\\R\\data\\data_frame_with_custom_settings.txt", quote = FALSE, # switch off quotes around strings row = FALSE, # don't save row numbers sep = "\t" # specify tabulator separated fields )

Compare the two output files to see the difference!

Notes: • •

Make sure to change the file paths to match your computer's folder structure. The file paths above are for Windows. If you are using a Mac computer, your file path will look something like this: "/Users/yourUserName/folderWithYourData/"

2.2 Reading data frames Most of the time we want to read data that has been saved to the disc, rather than typing it into R. There are several ways to read data into R, but here we will focus on one particular function, read.table() for tables / data frames. A saved data frame can be read into R using read.table() like this: new.df = read.table( "C:\\Users\\gbj\\Documents\\DH summer school\\Code\\R\\data\\data_frame_with_custom_settings.txt", header = TRUE, sep = "\t" )

As you can see, the syntax closely matches that of write.table(). header = TRUE specifies that the first row are to be treated as column names, and the sep argument tells R to divide up the data into rows and columns based on tabulators. Notes: • •

Make sure to change the file paths to match your computer's folder structure. The file path above is for Windows. If you are using a Mac computer, your file path will look something like this: "/Users/yourUserName/folderWithYourData/"

Some extra arguments you might need These arguments are often needed for some data: • •

dec = "," for setting the decimal separator in non-English environments quote = "\"" whenever your data contains strings with apostrophes

In general, it's good to look closely at the documentation (?read.table).

3 Data in depth The sections below illustrate how we can explore linguistic data in more depth using R. The example data are based on the Bodleian First Folio version of Shakespeare's Hamlet.

3.1 Hierarchical clustering of speakers based on pronouns This section explores how we can find patterns of similarity between characters in Hamlet based on the pronouns that they use. The data frame has two columns matching characters with the personal pronouns (subject and object forms) they use.

Read data Use read.table() to load the data into R. Modify the path variable to match the location on your computer: hamlet_prn = read.table("C:\\Users\\gbj\\Documents\\DH summer school\\Code\\Python\\hamlet_pronouns.txt", header=F, sep="\t")

Inspect data frame Inspect the resulting data frame to ensure there were no problems in loading the data into R: dim(hamlet_prn) ## [1] 429

2

head(hamlet_prn) ## ## ## ## ## ## ##

V1 V2 1 you fra 2 thee ber 3 you ber 4 you ber 5 i fra 6 i fra

Make new column names We can use the colnames() function to give the columns more informative names: colnames(hamlet_prn) = c("prn", "character")

Calculate the number of times each character uses a pronoun To create a contingency table of the number of times each character uses a particular pronoun we use the xtabs() function. Note how there is nothing in front of the tilde (~): hamlet_prn.counts = xtabs(~ character + prn, data = hamlet_prn)

The result should looke like this, with each row corresponding to a character and each column corresponding to a pronoun: ## prn ## character he her him i it me she thee them they thou vs we ye you ## all 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 ## ber 0 0 0 1 4 0 0 1 0 0 0 0 0 0 2 ## cap 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ## cla 2 2 4 13 10 4 3 1 5 1 2 1 4 0 11 ## clo.1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ## for 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 ## fra 0 0 0 2 0 0 0 0 0 0 0 0 0 0 2

## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##

ger gho gui ham hor hor, mar lae mar mes oph osr pla.1 plq pol pri rey ros sai ser vol

4 0 0 5 3 0 3 1 0 4 0 1 0 3 0 0 2 0 0 0

1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0

1 9 3 3 0 3 1 1 1 0 0 0 2 32 17 12 4 16 18 1 0 0 0 0 3 7 5 4 0 4 6 1 0 1 0 0 1 12 1 3 0 1 0 0 0 0 0 0 0 0 0 1 4 8 2 0 0 0 0 0 0 6 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0

2 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 2 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

1 0 0 3 2 0 0 0 1 1 0 0 0 0 0 0 0 0 2 0

5 2 0 4 2 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

0 0 3 0 2 4 0 2 0 0 0 0 0 0 0 0 2 0 0 1

0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

6 0 0 27 4 0 9 1 0 5 0 0 0 10 0 0 3 2 0 0

Calculate distances To transform the counts into distances we use the dist() function. The default distance is Euclidean but see ?dist for more options: hamlet_prn.dist = dist(hamlet_prn.counts)

Perform hierachical agglomerative clustering and plot The function hclust() lets us perform a hierarchical agglomerative clustering of the data. The default clustering type to use is complete but other options are available. Note that there are other packages available for doing hierarchical clustering in R, see e.g. the R packages diana and pvclust. hamlet_prn.clust = hclust(hamlet_prn.dist) plot(hamlet_prn.clust)

Reflections First: According to one heuristic, the number of clusters in the cluster dendrogram should be the square root of the number of observations divided by two. If we take the number of characters as our observation points (what we want to cluster), how well does this expected number of clusters correspond to the plot created above? (Hint: use the code sqrt(ncol(hamlet_prn.count)/2) to calculate the expected number of substantial clusters.) Second: Looking back at the contingency table of characters and pronouns, can you spot the pronouns that are most likely to be responsible for singling out Hamlet in one node, and Claudius and Horatio in another?

3.2 Inference tree of the number of words spoken This section explores how we can find associations between variables in a data set. The data frame used has two columns holding the characters and the number of words spoken by them in the play. Additional variables are added to the data frame in R.

Read data Use read.table() to load the data into R. Modify the path variable to match the location on your computer: hamlet_part_size = read.table("C:\\Users\\gbj\\Documents\\DH summer school\\Code\\Python\\hamlet_part_size.txt", header=F, sep="\t")

Inspect and name columns Refer back to the code examples in section 3.1 above, and inspect the data set. Give the columns informative names: colnames(hamlet_part_size) = c("character", "wcount")

Visualize data distribution We can combine the functions sort() and barplot() to visualize how the sizes of the parts are distributed. Each column in the barplot is a part (character). To label the bars, we can order the data frame according to the number of words spoken, and use the character names from the data frame as labels: hamlet_part_size_ordered = hamlet_part_size[ order(hamlet_part_size$wcount, decreasing = T), ] barplot(hamlet_part_size_ordered$wcount, names.arg = hamlet_part_size_ordered$character, las=2, cex.names=0.8)

What does the distribution of words / parts among the characters tell us in the case of Hamlet?

Add more variables To see if there is an interaction between the number of words spoken by the character and that character's social position or gender, we add two columns holding this information.

First, gender with one row per character. FALSE indicates male, TRUE indicates female (note the use of NA for "all"): hamlet_part_size$female = c(NA, F, F, F, F, F, F, F, F, F, F, F, NA, F, F, F, F, T, F, T, F, F, F, F, T, F, F, F, F, F, F, F, F)

Next, we create a factor variable with the levels royal, court (for courtiers), and low for everyone else: hamlet_part_size$status = as.factor(c(NA, "low", "court", "court", "court", "low", "court", "court", "low", "low", "royal", "royal", NA, "royal", "court", "low", "court", "royal", "low", "low", "low", "court", "court", "low", "court", "low", "low", "court", "court", "court", "low", "low", "royal"))

The output should look like this: ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

character wcount female status all 17 NA ser 9 FALSE low vol 17 FALSE court osr 30 FALSE court pri 11 FALSE court sai 14 FALSE low pol 300 FALSE court ros 88 FALSE court fra 37 FALSE low rey 59 FALSE low ham 932 FALSE royal for 32 FALSE royal plp 6 NA cla 457 FALSE royal gen 5 FALSE court luc 4 FALSE low hor 501 FALSE court ger 343 TRUE royal mar 160 FALSE low plq 27 TRUE low pla.1 18 FALSE low lae 277 FALSE court amb 4 FALSE court mes 14 FALSE low oph 206 TRUE court clo.2 4 FALSE low clo.1 30 FALSE low gmn 1 FALSE court hor, mar 35 FALSE court gui 31 FALSE court cap 5 FALSE low ber 74 FALSE low gho 62 FALSE royal

Exercise: visualize using cdplot() Try visualizing the effect of the binary variable female on the number of words spoken. Refer back to the plotting example in section 1 above, and use cdplot().

Create conditional inference tree Conditional inference trees are a simple but effective way of determining the effects that some (predictor) variables have on a single (outcome or response) variable. It is a computational technique and does not make any assumptions about the data. Conditional inference trees are particularly good at finding interactions. In this case, we use function ctree() in the party package, which is included in R. ctree() requires a formula of the same type seen in some of the plot functions in section 1

above. In this case, we use the number of words as the response, and gender and social status as predictors: # load 'party' package: library(party) ## ## ## ## ## ## ## ## ## ## ## ## ##

Loading required package: grid Loading required package: zoo Attaching package: 'zoo' The following objects are masked from 'package:base': as.Date, as.Date.numeric Loading Loading Loading Loading

required required required required

package: package: package: package:

sandwich strucchange modeltools stats4

# create tree (note formula): hamlet.tree = ctree(wcount ~ female + status, data = hamlet_part_size)

We can plot the tree like: plot(hamlet.tree)

Reflections The resulting tree diagram can be read as a flow chart. In this case, we get a very simple tree with only two end-nodes. The key variable is (not surprisingly) the social status of the character. Higher social status is associated with a higher number of spoken words. There is no node representing gender, since ctree() will automatically disregard variables that are redundant or not significant. Can you think of a reason why gender did not play a role in this analysis? (Hint: look at the output of cdplot() in the exercise above.)

4 Further reading There are numerous resources available for linguists intrested in learning R and statistics. Books such as Stefan Gries' Quantitative Corpus Linguistics with R (2009) and Statistics for Linguistics with R (2009) are good starting points. Harald Baayen's Analyzing Linguistic Data (2008) is more challenging because of the slightly more terse format, but goes further than Gries' books and is well worth working through once a basic familiarty with R has been attained. Quantitative Methods in Corpus-Based Translation Studies (2012), edited by Michael Oakes and Meng Ji, offers a wealth of interesting examples how R and statistics combine with applied linguistics topics that are also relevant beyond translation studies. For improving R programming skills, there are a number of good MOOCs available online, not least from Coursera (www.coursera.org) where many courses are constantly avilable and can be studied at your own pace. Finally, when troubleshooting R scripts, the large and

active R community means that it is easy to find help online, particularly on websites such as www.stackoverflow.com.