TEXT MINING FOR LINGUISTICS: A brief introduction ...

5 downloads 3104 Views 606KB Size Report
As linguists, we will be interested in the text that sits between html tags. After all, ... the function that is specified in their FUN argument over an entire input vector.
TEXT MINING FOR LINGUISTICS: A brief introduction using R Jeroen Claes KU Leuven, Quantitative Lexicology and Variational Linguistics [email protected] We’ve all been there: you discover some wildly interesting phenomenon in your favorite language of study and you want to examine it as soon as you can. However, before you can get to the interesting bit, you have to plow through tons and tons interview transcripts and withstand a tedious data annotation process that is even more boring than your average English Literature class. This tutorial is intended to encourage you to do it differently, using text mining techniques. The advantages of such techniques are: 1.

2.

3.

They enable you to process vast amounts of data in relatively short time, allowing you to take advantage of Big Data. For instance, processing a corpus of some 100,000 tweets will normally take less than half an hour. Try doing that by hand! Once you get your algorithm right, it is much more consistent than you could ever be. Computers don't suffer from attention lapses, that defect is limited to humans. Programming is vastly more fun than annotating!

I will show you how to apply functions from the R language (R Core Team, 2016) in your linguistics research projects to structure text and turn it into information. Importantly, though, this tutorial will not cover an introduction to R. Rather, I will assume that you have a basic understanding of the R language (e.g., working with functions, loading and saving data, subsetting objects). For those of you not familiar with R, Natalia Levshina’s book How to to Linguistics With R (Levshina, 2015) or Gries’s (2013) Quantitative Corpus Linguistics with R are recommended entry points. Assuming that you have a basic-to-intermediate understanding of R, you’ll learn: 1. 2. 3.

How to scrape content from a website using the rvest package in R (Wickham, 2016b) How to collect responses from web forms using the httr package in R (Wickham, 2016a) How to gather a collection of tweets using the twitteR package in R (Gentry, 2016)

1

4. 5.

6.

How to use a part-of-speech tagger from within R. Two alternatives are discussed: Stanford Part of Speech tagger (Toutanova et al., 2003) and TreeTagger (Schmidt, 2016). How to use regular expressions and the R package stringi (Gagolewski & Tartanus, 2016) to transform the output of a POS tagger into linguistic annotations. How to do all of this in a fast, clean, readable,error-proof way by exploiting R's strengths while avoiding its weaknesses

Does that seem like a whole lot to you? Don’t worry, with this hands-on tutorial, you’ll get the hang of it soon enough. Let's get started!

Table of Contents 1. Scraping content off a website ........................................................................................ 3 1.1 Grabbing a page .......................................................................................................................... 3 Excercise: ............................................................................................................................................................ 3 1.2 Selecting content from a page ................................................................................................ 4 Excercise: ............................................................................................................................................................ 5 1.3 Extracting text from content .................................................................................................. 5 Excercise: ............................................................................................................................................................ 5 1.4 Extracting attributes ................................................................................................................. 5 1.5 A note on elegant programming ............................................................................................ 5 Excercise: ............................................................................................................................................................ 6 1.6 A note on fast and R-style programming ............................................................................ 6 1.7 A completely worked-out example ...................................................................................... 8 2. Collecting web form responses ................................................................................... 13 2.1 GET requests ............................................................................................................................. 13 2.2 POST requests .......................................................................................................................... 14 3. Gather Tweets with the TwitteR package ................................................................ 17 3.1 Introduction .............................................................................................................................. 17 3.2 Getting started .......................................................................................................................... 18 3.3 Querying the Search API with the twitteR package ..................................................... 18 3.4 Gaining access to twitter users' home locations and languages ............................. 20 3.3.1 Excercise ............................................................................................................................................... 21 4. Use a Part-Of-Speech tagger to POS-tag your data ................................................ 22 4.1 Stanford POS tagger ................................................................................................................ 22 4.2 TreeTagger ................................................................................................................................ 23 5. Parsing the output of a POS-tagger into linguistic annotations ....................... 26 5.1 Basic string operations: find/replace and transforming to lowercase ................. 26 5.1.1 Converting text to lowercase ....................................................................................................... 26 5.1.2 Stripping off doubles spaces, line endings and other nasties ........................................ 26 5.2 A note on regular expressions ............................................................................................ 27 5.2.1 The basics of the basics: character classes ............................................................................. 27 5.2.2 The basics of the basics: restraining the number of occurrences that are matched .............................................................................................................................................................................. 28 5.2.3 The not-so-basics: negative lookahead statements ............................................................ 29

2

5.3 Making a concordance ........................................................................................................... 30 5.3.1 Splitting a text into sentences ...................................................................................................... 30 5.3.2 Building a concordance from sentences .................................................................................. 30 5.4 Worked-out example: annotating a concordance with regular expressions ...... 32 5.4.1 Selecting data ...................................................................................................................................... 32 5.4.2 Cleaning up data ................................................................................................................................ 33 5.4.3 POS-tagging data ................................................................................................................................ 33 5.4.4 Constructing a concordance for POS-tagged data ............................................................... 34 5.4.5 Annotating a concordance ............................................................................................................. 35

6. Next steps and further exploring ................................................................................ 37 7. References .......................................................................................................................... 37

1. Scraping content off a website Websites are a great source of data. With a list of URLs, you can easily gather a large collection of texts. Magazines and major newspapers, for example, increasingly publish content online. On top of that, much interesting resources (e.g., certain types of WordNet) are published as xml files, documents structed in a similar way to webpages. Thus, mastering the art of downloading webpages and decoding them to useful data should be high on any linguist's agenda. Before you get tempted into opening your browser and start copy-pasting material, know that there is actually a very convenient way to do this in R. Open an R session and type: # Just the first time install.packages("rvest")



# Load the package library(rvest)

1.1 Grabbing a page The first step is to download a web page into R's working memory. This is what the read_html function does. Some websites don't like algorithms scraping off their content. Therefore it's usually a good idea to trick the website into believing that a regular browser such as Mozilla Firefox is asking for content. This can be done by setting the user_agent argument correctly. page Search > type cmd.exe. On Mac/Linux, search for a program called Terminal in your application folder. Type in the following lines of code. The first line will change your working directory to the folder to which you have downloaded the files to.

Mac/Linux code: cd ~/TreeTagger sh install-tagger.sh

Windows code: cd c:\TreeTagger sh install-tagger.sh

Now, to use the TreeTagger, all we need to do is call it. As we did for the Stanford Parser, I like to use variables in my commands so I can adapt my code easily when the need arises. #It is usually a good idea to put the paths to the parser and the langu age model in variables, this makes your code easier to read. parserPath % stri_replace_all_fixed(" ", " ")%>% stri_replace_all_fixed(" ", " ")%>% stri_replace_all_fixed(" ", " ") return(text) } sentences$sentence