Optimizing R VM: Allocation Removal and Path Length Reduction via Interpreter-level Specialization Haichuan Wang
Peng Wu
David Padua
University of Illinois at Urbana-Champaign
[email protected]
IBM T.J. Watson Research Center
[email protected]
University of Illinois at Urbana-Champaign
[email protected]
Abstract
1.
The performance of R, a popular data analysis language, was never properly understood. Some claimed their R codes ran as efficiently as any native code, others quoted orders of magnitude slowdown of R codes with respect to equivalent C implementations. We found both claims to be true depending on how an R code is written. This paper introduces a first classification of R programming styles into Type I (looping over data), Type II (vector programming), and Type III (glue codes). The most serious overhead of R are mostly manifested on Type I R codes, whereas many Type III R codes can be quite fast. This paper focuses on improving the performance of Type I R codes. We propose the ORBIT VM, an extension of the GNU R VM, to perform aggressive removal of allocated objects and reduction of instruction path lengths in the GNU R VM via profile-driven specialization techniques. The ORBIT VM is fully compatible with the R language and is purely based on interpreted execution. It is a specialization JIT and runtime focusing on data representation specialization and operation specialization. For our benchmarks of Type I R codes, ORBIT is able to achieve an average of 3.5X speedups over the current release of GNU R VM and outperforms most other R optimization projects that are currently available.
In the age of big data, R is a tremendously important language and considered the lingua franca for data analysis [18, 28]. There are more than two million users of R today and the user base is rapidly expanding. The popularity of R is mainly due to the productivity benefits it brings to data analysis. R contributes to programmer productivity in several ways, including the following two: the availability of extensive data analysis packages that can be easily incorporated into an R script and the interpreted environment that allows for interactive programming and easy debugging. Like many other interpreted and dynamically typed languages, R suffers from a critical limitation: it is very slow. Figure 1 compares the performance of a set of algorithms [3] implemented in different languages and shows that the GNU R VM, the most widely used R implementation today, is more than two orders of magnitude slower than C and twenty times slower than CPython, the Python interpreter.1 100000 10000 1000 100 10 1
Categories and Subject Descriptors D.3.4 [Processors]: Compilers, Interpreters, Run-time environments
Introduction
Slowdown to C 801.3
Slowdown to Python 10212.4 794.2 752.0 674.5 603.5 392.4 128.2 117.5 26.2 11.1 8.8 7.6 5.1
Figure 1. Slowdown of R on the shootout benchmarks relative to C and CPython.
General Terms Languages, Performance Keywords R, Specialization, Dynamic Scripting Language
1.1
Three R Programming Styles
To better understand the landscape of existing R optimization projects, one has to consider the different ways in which a programmer may use the language. Figure 2 shows three different R programming styles.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. CGO ’14, February 15 - 19, 2014, Orlando, FL, USA. c 2014 ACM 978-1-4503-2670-4/14/02. . . $15.00. Copyright http://dx.doi.org/10.1145/2544137.2544153
1 For
pidigits, the steep performance gap also comes from algorithm differences. For instance, the Python version uses the built-in big number support whereas the R version uses vectors to represent big numbers.
295
Listing 2. Type II: Vector programming # a , b , and c a r e v e c t o r s a = b + c Listing 3. Type III: Glue codes # rnorm and f f t a r e n a t i v e f u n c t i o n s a