Distributed Computing in R using the Segue Package ...

45 downloads 150 Views 2MB Size Report
Installation and setup of segue b. ..... install packages from the packages tab: .... •Segue tutorial: http://jeffreybreen.wordpress.com/2011/01/10/segue-‐r-‐to-‐.
      Distributed Computing in R using the Segue Package Author: Johnathan Mercer [email protected]     This  paper  examines  the  functionality  of  the  R  package  ‘Segue’  for  distributed   computing.  To  this  end  I  will  first  provide  an  introduction  to  the  R  framework,  then   apply  segue  to  a  canonical  example.  I  will  discuss  the  history,  the  language,   environments,  and  provide  some  examples  to  get  you  acclimated.  After  this   foundation  we  will  discuss  rJava.  This  project  is  intimately  dependent  on  rJava   because  it  provides  the  interface  for  R  to  work  with  Java,  and  therefore,  for  R  to   interface  with  AWS  Java  SDK.  We  can  then  move  on  to  stochastically  estimating  Pi.   We  will  walk  through  the  entire  process  and,  at  each  step,  look  underneath  the  hood   of  the  segue  functions  to  understand  better  how  segue  hides  all  of  the  R  to  Java  and   Java  to  AWS  functionality  from  the  user.  In  the  end,  you  will  be  equipped  to  utilize   distributed  computing  in  R  for  embarrassingly  parallel  problems  and  also  have  the   foundational  knowledge  to  build  your  own  Java  interface.  So  an  outline  of  this   document  is  the  following:     Part  1.  R  Tutorial     a.  History     b.  RStudio     c.  R  language     d.  lapply  function     e.  rJava   Part  2.  Estimating  Pi     a.  Installation  and  setup  of  segue     b.  createCluster     c.  emrlapply       d.  emptyS3Bucket     Part  3.  References  and  Code     a.  References     b.  Project  Code  on  AWS                        

Part  1.  R  Tutorial    

History  of  R  

  R  is  a  descendant  of  the  S  language.  Dr.  John  M.  Chambers,  of  Bell  Labs,  was  awarded   the  ACM’s  software  system  prize  in  1998  for  the  development  of  the  S  language.     The ACM's citation notes that Dr. Chambers' work "will forever alter the way people analyze, visualize, and manipulate data . . . S is an elegant, widely accepted, and enduring software system, with conceptual integrity, thanks to the insight, taste, and effort of John Chambers." [http://www.acm.org/announcements/ss99.html]

  As  listed  in  the  PowerPoint,  you  may  download  R  from  http://cran.r-­‐project.org/     and  you  can  download  a  powerful  and  popular  IDE  called  RStudio  from   http://www.rstudio.com/.  I  will  be  using  RStudio  for  this  entire  tutorial  and  my   system  specifications  are  the  following:    

          I  list  these  here  and  in  the  PowerPoint  because  one  thing  anyone  involved  in  cloud   computing,  or  other  technologies  where  you  are  on  the  “bleeding  edge”,  is  that  much   of  your  time  involves  trying  to  find  information  and  anyone  who  has  made  advances   online.  Many  posts  fail  to  state  the  nuances  of  the  system  in  which  they  are  working.   This  can  introduce  failures  when  trying  to  reproduce  work  and  makes  the  process   much  harder.  I  have  learned  that  it  is  an  acquired  skill  to  learn  and  debug  using   information  found  online  and  build  up  intuition  as  to  why  the  failures  may  be   occurring.       RStudio     RStudio  looks  like  this  when  opened:  

    The  upper  left  is  where  you  can  open  R  scripts  (you  give  them  .r  extensions)  and  it   allows  multiple  scripts  to  be  open  with  tabs.  The  lower  left  is  the  console  where  you   can  interactively  run  code.  R  is  very  much  a  scripting  language  and  interacts  with   the  interpreter  much  like  you  would  program  in  Python.       The  upper  right  is  your  workspace  where  you  can  inspect  objects  created.  This  is   very  helpful  because  the  standard  R  you  have  to  use  the  console  to  essentially  print   out  the  contents  of  objects  such  as  dataframes.  So  in  my  opinion  this  feature  really   brings  R  one  step  closer  to  competing  with  commercial  environments  like  SAS.     The  lower  left  is  the  console  where  you  can  type  commands  and  get  immediate   results.  You  will  notice  I    typed  2+2  and  the  interpreter  responded  with  4,  so  that  is   reassuring.  The  lower  right  provides  real-­‐estate  to  search  for  files,  look  at  output   (plots),  include  other  packages,  and  search  and  display  help  topics.  For  example,  an   important  R  function  we  will  look  at  is  the  lapply  function.  If  I  were  to  type       >  help(lapply)     I  would  then  see  the  help  topic  on  the  lapply  function  tat  includes  a  description  and   example.  Lapply{base}  implies  it  is  a  base  function  in  R  and  not  provides  by  an   additional  package.      

                  One  last  useful  note  for  those  using  R  on  a  Mac.  If  you  want  to  code  in  the  editor  and   submit  your  code  without  pasting  into  the  terminal  then  just  highlight  the  code  and   press  Command+Return  to  submit  the  code  the  console.            

                 

                                       

     +  

                                   

 

The  R  Language  

  In  R  you  can  to  simple  operations  such  as  addition  which  you  already  saw.  You  can   assign  values  to  objects:                   Here  I  assigned  2  to  the  object  x  and  then  printed  out  the  value  just  by  typing  the   name  of  the  object  in  the  console  (and  pressing  enter).  Notice  we  use  the  “

Suggest Documents