Nov 5, 2013 - The analysis of increasingly large data sets and the use of parallel ..... a wide range of advanced functionality but at the cost of BLAS and ...
Journal of Statistical Software
JSS
November 2013, Volume 55, Issue 14.
http://www.jstatsoft.org/
Scalable Strategies for Computing with Massive Data Michael J. Kane
John W. Emerson
Stephen Weston
Yale University
Yale University
Yale University
Abstract This paper presents two complementary statistical computing frameworks that address challenges in parallel processing and the analysis of massive data. First, the foreach package allows users of the R programming environment to define parallel loops that may be run sequentially on a single machine, in parallel on a symmetric multiprocessing (SMP) machine, or in cluster environments without platform-specific code. Second, the bigmemory package implements memory- and file-mapped data structures that provide (a) access to arbitrarily large data while retaining a look and feel that is familiar to R users and (b) data structures that are shared across processor cores in order to support efficient parallel computing techniques. Although these packages may be used independently, this paper shows how they can be used in combination to address challenges that have effectively been beyond the reach of researchers who lack specialized software development skills or expensive hardware.
Keywords: concurrent programming, memory-mapping, parallel programming, shared memory, statistical computing.
1. Introduction The analysis of increasingly large data sets and the use of parallel processing are active areas of research in statistics and machine learning. Examples of these data sets include the Nexflix Prize competition (Bennet and Lanning 2007); next-generation genome sequencing and analysis; and the American Statistical Association’s 2009 Data Expo involving the airline on-time performance data set (RITA 2009). Many statisticians are left behind when confronted with massive data challenges because their two most widely-used software packages, SAS (SAS Institute Inc. 2011) and R (R Core Team 2013a), are ill-equipped to handle this class of problem. SAS supports the analysis of large data with an impressive number of standard methods, but the Netflix Prize competition required the development and implementation of new method-
2
Scalable Strategies for Computing with Massive Data
ologies. On the other hand, R is very well-suited for the development of new data analysis and statistical techniques but does not seamlessly handle massive data sets. These barriers to entry have presented significant obstacles to statisticians interested in engaging in such massive data challenges. The root of the problem is the current inability of modern high-level programming environments like R to exploit specialized computing capabilities. Package bigmemory (Kane, Emerson, and Haverty 2013a) leverages low-level operating system features to provide data structures capable of supporting massive data, potentially larger than random access memory (RAM). Unlike database solutions and other alternative approaches, the data structures provided by bigmemory are compatible with standard basic linear algebra subroutines (BLAS), linear algebra package (LAPACK) subroutines, and any algorithms which rely upon columnmajor matrices. The data structures are available in shared memory for use by multiple processes in parallel programming applications and can be shared across machines using supported cluster filesystems. The design of bigmemory addresses two interrelated challenges in computing with massive data: data management and statistical analysis. Section 2 describes these challenges further and presents solutions for managing and exploring massive data. When the calculations required by an exploration or analysis become overwhelming parallel computing techniques can be used to decrease execution time. Package foreach (Revolution Analytics and Weston 2013b) provides a general, technology-agnostic framework for implementing parallel algorithms and can exploit the shared memory capabilities of bigmemory. Section 4 considers a broad class of statistical analyses well-suited to foreach parallel computing capababilities. Thus, bigmemory and foreach can be used together to provide a software framework for computing with massive data (demonstrated in Section 3) that includes shared memory, parallel computing capabilities (demonstrated in Section 5). Section 6 examines the performance of bigmemory and foreach compared to standard R data structures and parallel programming capabilities in a small data setting. Section 7 concludes with a discussion of the future of massive data and parallel computing in the R programming environment.
2. Big data challenges and bigmemory High-level programming environments such as R and MATLAB (The MathWorks, Inc. 2011) allow statisticians to easily import, visualize, manipulate, and model data as well as develop new techniques. However, this convenience comes at a cost because even simple analyses can incur significant memory overhead. Lower-level programming languages sacrifice this convenience but often can reduce the overhead by referencing existing data structures instead of creating unnecessary temporary copies. When the data are small, the overhead of creating copies in high-level environments is negligible and generally goes unnoticed. However, as data grow in size the memory overhead can become prohibitively expensive. Programming environments like R and MATLAB have some referencing capabilities, but their existing functions generally do not take advantage of them. Uninformed use of these features can lead to unanticipated results. According to the R Installation and Administration manual (R Core Team 2013b), R is not well-suited for working with data structures larger than about 10–20% of a computer’s RAM. Data exceeding 50% of available RAM are essentially unusable because the overhead of all but the simplest of calculations quickly consumes all available RAM. Based on these guidelines, we consider a data set large if it exceeds 20% of the RAM on a given machine and massive
Journal of Statistical Software
3
if it exceeds 50%. Although the notion of size varies with hardware capability, the challenges and solutions of computing with massive data scale to the statistician’s problem of interest and computing resources. We are particular interested in cases where massive data or the memory overhead of an analysis exceed the limits of available RAM. In such cases, computing requires use of fixed storage (disk space) in combination with RAM. Historically, size limitations of high-level programming languages have resulted in the use of a database. A database can provide convenient access to subsets of large and massive data structures and are particularly efficient for certain types of summaries and tabulations. However, reliance on a database has several drawbacks. First, it can be relatively slow in executing many of the numerical calculations upon which statistical algorithms rely. Second, calculations not supported by the database require copying subsets of the data into objects of the high-level language stored in RAM. This copying can be slow and the subsequent analysis may require customized implementations of algorithms for cases where the overhead of standard implementations (used for smaller data) exceeds the capacity of RAM. Finally, the use of a database requires the installation and maintenance of a software package separate from the high-level programming environment and may not be available to all researchers. Customized extraction of subsets of data from files resident on disk provides an alternative to databases. However this approach suffers from two drawbacks. First, it has to be done manually, requiring extra time to develop specialized code and, as a result, proportionally more time for debugging. Second, custom extractions are often coupled to specific calculations and cannot be implemented as part of a general solution. For example, the calculation of the mean of a column of a matrix requires an extraction scheme that loads only elements from the column of interest. However, a column-wise extraction will not work well for calculating matrix row means. Furthermore, the modes of extraction are specific to the chosen file format and data structures. As a result, different extraction schemes may need to be implemented over the course of a data exploration depending on how the data are stored. This may further increase development and debugging time. In the extreme case, some calculations may require sophisticated extraction schemes that may be prohibitively difficult to implement; in such cases the statistician is effectively precluded from performing these types of calculations. Both the database and custom extraction approaches are limited because their data structures on disk are not numerical objects that can be used directly in the implementation of a statistical analysis in the high-level language. They require loading small portions of the data from disk into data structures of the high-level language in RAM, completing some partial analysis, and then moving on to other subsets of the data. As a result, existing code designed to work on entire data structures native to the language and within RAM is generally incompatible with analyses relying upon database or customized extractions of data from files. Fundamentally, these approaches are limited by their reliance on technologies that evolved decades ago and lack the flexibility provided by modern computing systems and languages. Some algorithms have been designed specifically for use with massive data. For example, the R package biglm (Lumley 2013) implements an incremental algorithm for linear regression (Miller 1992) that processes the data in chunks, avoiding the memory overhead of R’s native lm function for fitting linear models. However, such solutions are not always possible. The prospect of implementing different algorithms for a certain type of analysis simply to support different data sizes seems grossly inefficient.
4
Scalable Strategies for Computing with Massive Data
2.1. Underlying technology Modern operating systems allow files resident on disk to be memory-mapped, associating a segment of virtual memory in a one-to-one correspondence with contents of a file. The C function mmap is available for this memory-mapping on POSIX-compliant operating systems (including UNIX, Linux, and Mac OS X, in particular); Microsoft Windows offers similar functionality. The Boost C++ Libraries (Dawes et al. 2013) provide an application programming interface for the use of memory-mapped files which allows portable solutions across Microsoft Windows and POSIX-compliant systems (made available through the BH package Emerson, Kane, Eddelbuettel, Allaire, and Francois 2013). Memory-mapped files provide faster access to data than standard read and write operations for a variety of reasons beyond the scope of this paper. Most importantly, the task of moving data between disk and RAM (called caching) is handled at the operating-system level and avoids the inevitable costs associated with an intermediary such as a database or a customized extraction solution. Interested readers should consult one of the many web pages describing memory-mapping, such as Kath (1993) for example. Memory-mapping is the cornerstone of a scalable solution for working with massive data. Size limitations become associated with available file resources (disk space) rather than RAM. From the perspective of both the developer and end-user, only one type of data structure is needed to support data sets of all sizes, from miniscule to massive. The resulting programming environment is thus both efficient and scalable, allowing single implementations of algorithms for statistical analyses to be used regardless of the size of the problem. When an algorithm requires data not yet analyzed, the operating system automatically caches the data in RAM. This caching happens at a speed faster than any general data management alternative and only slower than customized solutions designed for very specific purposes. Once in RAM, calculations proceed at standard in-memory speeds. Once memory is exhausted, the operating system handles caching of new data and displacing older data (which are written to disk if modified). The end-user is insulated from the details of this mechanism, which is certainly not the case with either database or customized extraction approaches.
2.2. The bigmemory family of packages We offer a family of packages for the R statistical programming environment for computing with massive data for POSIX-compliant operating systems. Windows is currently not supported but could be through the POSIX-compatible environments used by the R environment. This family of packages is intended for data that can be represented as a matrix and on computers with 64-bit operating systems. The main contribution is a new data structure providing a dense, numeric matrix called a big.matrix which exploits memory-mapping for several purposes. First, the use of a memory-mapped file (called a filebacking) allows matrices to exceed available RAM in size, up to the limitations of available file system resources. Second, the matrices support the use of shared memory for efficiencies in parallel computing. A big.matrix can be created on any file system that supports mmap, including cluster file systems. As a result, bigmemory is an option for large-scale statistical computing, both on single machines or on a cluster of machines with the appropriate configuration. The support for shared-memory matrices and a new framework for portable parallel programming will be discussed in Section 4. Third, the data structure provides reference behavior, helping to avoid the creation of unnecessary temporary copies of massive objects. Finally, the underlying ma-
Journal of Statistical Software
5
trix data structure is in standard column-major format and is thus compatible with existing BLAS and LAPACK libraries as well as other legacy code for statistical analysis (primarily implemented in C, C++, and Fortran). These packages are immediately useful in R, supporting the basic exploration and analysis of massive data in a manner that is accessible to non-expert R users. Typical uses involve the extraction of subsets of data into native R objects in RAM rather than the application of existing functions (such as lm) for analysis of the entire data set. Some of these features can be used independently of R by expert developers in other environments having a C++ interface. Although existing algorithms could be modified specifically for use with big.matrix objects, this opens a Pandora’s box of recoding which is not a long-term solution for scalable statistical analyses. Instead, we support the position taken by Ihaka and Temple Lang (2008) in planning for the next-generation statistical programming environment (likely a new major release of R). At the most basic level, a new environment could provide seamless scalability through filebacked memory-mapping for large data objects within the native memory allocator. This would help avoid the need for specialized tools for managing massive data, allowing statisticians and developers to focus on new methodology and algorithms for analyses. The matrices provided by bigmemory offer several advantages over standard R matrices, but these advantages come with tradeoffs. Here, we highlight the two most important qualifications. First, a big.matrix can rarely be used directly with existing R functions. A big.matrix has its own class and is nothing more than a reference to a data structure which can’t be used by functions such as base::summary, for example. However, many analogous big.matrix operations are included in package biganalytics (Emerson and Kane 2013a) which implements apply, biglm, bigglm, bigkmeans, colmax, colmin, colmean, colprod, colrange, colvar, summary, etc. Tabulation operations are included in package bigtabulate (Kane and Emerson 2013b) and includes functions bigsplit, bigtabulate, bigtable, and bigtsummary. Other analyses will need to be conducted on subsets of the big.matrix (which must fit into available RAM as R matrices). Similarly, the extraction of a larger-than-RAM subset of a big.matrix into a new big.matrix must be done manually by the user, a simple two-step process of creating the new object and then conducting the copy in chunks. The one exception to this is the sub.big.matrix class, which creates “windows” into contiguous, rectangular blocks for a big.matrix and is beyond the scope of this article. Second, bigmemory supports numeric matrices (including NA values) but not character strings or anything like a big.data.frame. Package ff (Adler, Gl¨aser, Nenadic, Oehlschl¨agel, and Zucchini 2013) offers a wide range of advanced functionality but at the cost of BLAS and LAPACK compatibility; a full comparison of bigmemory and ff is beyond the scope of this paper.
3. Application: Airline data management Data analysis usually begins with the importation of data into native data structures of a statistical computing environment. In R, this is generally accomplished with functions such as read.table, which is very flexible. It reads a text file into a data.frame and it can perform this operation without information regarding the column types. To remain robust and to correctly read these files there are many checks that need to be performed while a file is being imported and there is associated overhead with this process. This is particularly true when the colClasses parameter is not specified and the read.table function is required to derive the column type while scanning through the rows of a data file. In this case,
6
Scalable Strategies for Computing with Massive Data
an intermediate data.frame is created with intermediate column vectors. If the data in a subsequent scan corresponds to a different type than an intermediate vector, then a new vector is created, with the updated type, and the intermediate vector is copied to the new vector. The process continues until all rows of the data file are scanned. As a result, these importing functions can incur significant memory overhead resulting in long load times for large data sets. Furthermore, native R data structures are limited to RAM. Together, these limitations have made it difficult, or even precluded, many statisticians from exploring large data using R’s native data structures. The bigmemory package offers a useful alternative for massive numeric data matrices. The following code creates a filebacking for the airline on-time performance data. This data set contains 29 records on each of approximately 120 million commercial domestic flights in the United States from October 1987 to April 2008, and consumes approximately 12 GB of memory. A script to convert the data set into integer values is available at the Bigmemory Project website (Emerson and Kane 2013b). A compressed version of the preprocessed data set along with all of the examples that appear in this article can be downloaded from the author’s website (Kane and Emerson 2013a). The reader should note that the compressed file is approximately 1.7 GB and requires approximately 12 GB. The creation of the filebacking avoids significant memory overhead and, as discussed in Section 2, the resulting big.matrix may be larger than available RAM. Double-precision floating point values are used for indexing, so a big.matrix may contain up to 253 − 1 elements for 64-bit architectures, such as x86 64 (1 petabyte of 4-byte integer data, for example). These values are type-cast to 64-bit integers. This approach will be used for all numerical, vector-based data structures as of R version 3.0.0, available beginning of April 2013. The creation of the filebacking for the airline on-time performance data takes about 15 minutes. In contrast, the use of R’s native read.csv would require about 32 GB of RAM, beyond the resources of common hardware circa 2013. The creation of an SQLite database SQLite Development Team (2013) takes more than an hour and requires the availability and use of separate database software.1 At the most basic level, a big.matrix object is used in exactly the same way as a native R matrix via the bracket operators ("[" and "[