we need fast access to a specific line in a huge reference data set for many parallel calls. We think that. R is the perfect tool to solve this problem. Given its rich ...
C ONTRIBUTED RESEARCH ARTICLE
1
Fast and Parallel Data Access with bigmemory and Rserve by Ulf Schepsmeier Abstract Operationalisation of prediction models using big data face several challenges in practice. In this article we address some of them - fast and parallel access to a huge data set in an efficient and stable way. This data is needed to add additional information to given data to apply prediction in R (in an on-demand web-service). We use shared memory technologies and ideas from data warehousing. The R-package bigmemory and the Rserve technology of Urbanek (2003) are key ideas here. Thereby, we were able to fetch a specific row out of 60 million lines in less than a 1/100 second in a highly parallel setting. Handling memory limitations as well as fast and efficient data load are also addressed.
Introduction In times of cheap computer memory storage more and more data is collected and available for “big data”, statistical analytics and artificial intelligence solutions. The big challenge in the past was to collect these data or to store these data and make it available for analysis. Nowadays modern databases can store huge amounts of data while solutions like Hadoop or Spark allow for distributed computing. But such approaches may be solutions for analytics but then it comes to operationalization of the found models other topics pop up. Two of these topics we want to address here - fast and parallel access to the data to add specific information to current data needed for prediction. To be more precise, we need fast access to a specific line in a huge reference data set for many parallel calls. We think that R is the perfect tool to solve this problem. Given its rich class of models and tools for analysis and modeling on the one hand and up-to-date technologies for fast and parallel computing. The problem stems from an insurance context where lots of customer data is available stored in an Oracle (ORACLE, 2016) database (DB). The specific problem has about 60 million lines and 200 columns representing customers with their attributes. The predictive model is assumed to run on each new customer claim and handled on-demand in a web-service. Therefore, fast and highly parallel access to the historical data in R is needed. On top of that we have of course limited memory on the server and uninterrupted exchange of the data set is needed to update the data. Approaches such as SQL requests to the DB using R-Oracle connections via packages like ROracle (Mukhin et al., 2014) are not fast enough to be considered for productive on-demand services. Therefore, new solutions have to found - in R - applying shared memory methods. Figure 1 summarizes the main problems. The forth point “Analysis” is not part of this manuscript, but the proffered solution may also be of use for an analysis setup. The solution we present is divided into three parts. It all starts with data preparation in the data base followed by fast and efficient extract, transform and load (ETL) processes to provide the data for the R-sessions. Here we will present two approaches. The main part of our solution is the use of
Figure 1: Overview of the addressed problems with big data, fast and parallel data access
The R Journal Vol. XX/YY, AAAA
ISSN 2073-4859
C ONTRIBUTED RESEARCH ARTICLE
2
Figure 2: Solution with bigmemory and Rserve: process diagram
in-memory functionality offered by the R-package bigmemory (Kane et al., 2013) in combination with indexing for speed-up of the row select. We will show that this approach is by several magnitude faster than other investigated solutions. Beside the fast access to the data we also optimize the memory consumption by intelligent data type definition for each variable. A reduction of 83% (180GB to 30GB) was achieved in our application. The usage of the deepcopy() function allows uninterrupted exchange of the data set, which was a requirement, too. In order to use this solution in the required setting of highly parallel on-demand access of independent customer claims we make use of the Rserve technology offered by Urbanek (2003). Rserve acts as a socket server (TCP/IP or local sockets) which allows binary requests to be sent to R. Every connection has a separate work-space and working directory. The client-side implementations is done in Java, allowing our application to use facilities of R without the need of linking to R code. The proffered solution may also act as environment for analysis on the business side making use of R-packages like biganalytics (Emerson and Kane, 2016) or biglm (Lumley, 2013). It can also be combined with computing in parallel applying the foreach package of Analytics and Weston (2014) or mclapply of the parallel package which is part of R (R Core Team, 2015). The process diagram in Figure 2 illustrates our solution with Rserve and bigmemory. The run time environment for our application is a 64-bit Linux Redhat 6 server with 24 Intel(R) Xeon(R) CPU @ 2.27GHz cores and a total memory (RAM) of 378 GB. On this machine we run R in version 3.2.3. The most important packages are bigmemory (4.4.6), ROracle (1.2-2) and RSclient (0.7-3), a client for Rserve, allowing to connect to Rserve instances and issue commands. The data is stored on an Oracle Exadata cluster (ORACLE, 2016) with Oracle 11g.
Data preparation We consider a data set of historical customer transactions and call these data reference data – in delimitation to the active data of a customer to which we want to join a specific row of the reference data. The reference data consists of 200 different variables storing historical data of customers of the last two years. This results in about 60 million rows. The data contains customer numbers for identification, dates, variables describing the customer such as age or sex and variables aggregating historical transactions. Again, to pronounce this point again, all data has be delivered to the shared memory to be able to fetch a specific row of the data in the on-demand service. Beforehand the information which rows are really needed is not available. Thus pre-filtering is not possible. We assume that all the necessary data is stored in one single data table in an Oracle DB. Other SQL DBs such as PostgreSQL, Microsoft SQL Server or other DBs are of course possible too. The differences in the subsequent steps will be the connection to the DB in R. The main data preparation is to have all data variables to be numeric, since big.matrix-objects of the bigmemory package, which we will use later on, only allows numerical values (see Kane et al., 2013). Thus categorical variables such as sex have to be factorized. This should be done beforehand in the creation and population of the data table in the DB. It is also possible in the subsequent ETL process. But it is less clean and causes performance reduction in the data delivery process. Alternatives to the bigmemory package such as ff of Adler et al. (2014) may offer in-memory solutions for strings but have
The R Journal Vol. XX/YY, AAAA
ISSN 2073-4859
C ONTRIBUTED RESEARCH ARTICLE
3
other fallbacks. In our application bigmemory was the best solution. Thus we will concentrate on this solution here. For an overview of both packages and a comparison see for example the presentation of Rosario (2010).
ETL process In this section we describe two possible ETL processes to extract data from a DB and provide the data to R. One uses the ROracle package for direct access form R to the (Oracle) DB, while the second method uses SQL+ to extract the data to CSV-files. It also describes our steps of development to the we think - best solution. The general idea of an ETL process is to Extract data from homogeneous or heterogeneous data sources, Transform the data for storing it in the proper format or structure for the purposes of querying and analysis and to Load it into the final target (usually a data warehouse) (Wikipedia, 2016). In our case we want to extract the data from the data warehouse, transform it in the proper (R-) format, split the data according to their data type (double, integer, short or char) and load it to R. The split of the data according to their data type will be interesting in the load step. At this point we assume just one data type to simplify this step. Note that the data type ‘double’ is the most general data type here and ‘integer’, ‘short’ and ‘char’ variables can be handled as ‘double’ variables.
Using ROracle The first idea to provide the data is with R itself. Mukhin et al. (2014) offer with their package ROracle an easy possibility to connect R to an Oracle DB. They make use of the popular package DBI (R Special Interest Group on Databases, 2013). Since the data is already in the correct format in the DB a simple select fetches the data and stores it in an RData-file. Since it will be a topic later on, stores in .rda or CSVs is of course possible, too. For the description of the following functions and their arguments please consult the manual of ROracle.
library(ROracle) # driver drv