Distributed Computing in R using the Segue Package Author: Johnathan Mercer
[email protected] This paper examines the functionality of the R package ‘Segue’ for distributed computing. To this end I will first provide an introduction to the R framework, then apply segue to a canonical example. I will discuss the history, the language, environments, and provide some examples to get you acclimated. After this foundation we will discuss rJava. This project is intimately dependent on rJava because it provides the interface for R to work with Java, and therefore, for R to interface with AWS Java SDK. We can then move on to stochastically estimating Pi. We will walk through the entire process and, at each step, look underneath the hood of the segue functions to understand better how segue hides all of the R to Java and Java to AWS functionality from the user. In the end, you will be equipped to utilize distributed computing in R for embarrassingly parallel problems and also have the foundational knowledge to build your own Java interface. So an outline of this document is the following: Part 1. R Tutorial a. History b. RStudio c. R language d. lapply function e. rJava Part 2. Estimating Pi a. Installation and setup of segue b. createCluster c. emrlapply d. emptyS3Bucket Part 3. References and Code a. References b. Project Code on AWS
Part 1. R Tutorial
History of R
R is a descendant of the S language. Dr. John M. Chambers, of Bell Labs, was awarded the ACM’s software system prize in 1998 for the development of the S language. The ACM's citation notes that Dr. Chambers' work "will forever alter the way people analyze, visualize, and manipulate data . . . S is an elegant, widely accepted, and enduring software system, with conceptual integrity, thanks to the insight, taste, and effort of John Chambers." [http://www.acm.org/announcements/ss99.html]
As listed in the PowerPoint, you may download R from http://cran.r-‐project.org/ and you can download a powerful and popular IDE called RStudio from http://www.rstudio.com/. I will be using RStudio for this entire tutorial and my system specifications are the following:
I list these here and in the PowerPoint because one thing anyone involved in cloud computing, or other technologies where you are on the “bleeding edge”, is that much of your time involves trying to find information and anyone who has made advances online. Many posts fail to state the nuances of the system in which they are working. This can introduce failures when trying to reproduce work and makes the process much harder. I have learned that it is an acquired skill to learn and debug using information found online and build up intuition as to why the failures may be occurring. RStudio RStudio looks like this when opened:
The upper left is where you can open R scripts (you give them .r extensions) and it allows multiple scripts to be open with tabs. The lower left is the console where you can interactively run code. R is very much a scripting language and interacts with the interpreter much like you would program in Python. The upper right is your workspace where you can inspect objects created. This is very helpful because the standard R you have to use the console to essentially print out the contents of objects such as dataframes. So in my opinion this feature really brings R one step closer to competing with commercial environments like SAS. The lower left is the console where you can type commands and get immediate results. You will notice I typed 2+2 and the interpreter responded with 4, so that is reassuring. The lower right provides real-‐estate to search for files, look at output (plots), include other packages, and search and display help topics. For example, an important R function we will look at is the lapply function. If I were to type > help(lapply) I would then see the help topic on the lapply function tat includes a description and example. Lapply{base} implies it is a base function in R and not provides by an additional package.
One last useful note for those using R on a Mac. If you want to code in the editor and submit your code without pasting into the terminal then just highlight the code and press Command+Return to submit the code the console.
+
The R Language
In R you can to simple operations such as addition which you already saw. You can assign values to objects: Here I assigned 2 to the object x and then printed out the value just by typing the name of the object in the console (and pressing enter). Notice we use the “