manipulation of large databases with "r" - FTP Directory Listing

6 downloads 1465 Views 297KB Size Report
Mature state of the art programming language: mix-and-match models, scripts ..... source: http://www.php-example.com/2010/07/php-mysql-sample-database-.
MANIPULATION OF LARGE DATABASES WITH "R" Ana Maria DOBRE, Andreea GAGIU National Institute of Statistics, Bucharest Abstract Nowadays knowledge is power. In the informational era, the ability to manipulate large datasets is essential for making long term strategies. More and more companies and official statistics offices need to use large databases but also by increasing the volume of data it increases the complexity of manipulating it. A basic economy rule is that the supply follows the demand; hereby, much software for manipulating big data was created in order to follow the necessity. The aim of this paper is to underline the performance of R in manipulating large databases. Regarding to this, the paper is primarily intended for people already familiar with common databases and statistical concepts. The paper illustrates as its best how simply R - both open-source and commercial versions – is used to handle large databases manipulation. Keywords: R statistical software, data manipulation, big data, large databases, statistics JEL Classification: C44, C61, C82, C87 Introduction By big data shall mean the huge volumes of information that the new technologies or the companies collect and register about individuals or processes. Actually, big data is any dataset or data application that does not fit into available RAM. Analysing data is fun for sure. Analysing big data is full of challenges, almost fascinating, but not so fun. It depends on how the used software follows the necessities. In this paper we assume that R – both free open-source and commercial versions – is able to handle easy large databases. R is widely used in every field where is data – academia, business, official statistics and so on. Since many R users have very large computational needs on big data, various tools for manipulation of large databases were developed. By manipulation we aim to bind together analysis, data mining, computations, visualization and much more. 1. Literature review The importance of using single software which is able to perform all the stages of data analysis was firstly shown by Hodgess (2004) for some models in SAS and FORTRAN programming or a combination of Excel, FORTRAN and SAS. Currently, R software packages can make almost all type of data analysis, like plots, cluster analysis, decomposition, sampling analysis, mapping, statistical regression and forecasts. In the last years, one of the major problems has been manipulating data from large datasets. Initially, computers could barely read in a large dataset, so whether it could be displayed. Gradually, computers have been able to handle larger and larger datasets. The book "Graphics of Large Datasets: Visualizing a Million" (Unwin, Theus, Hoffman, 2006) contributes to an overlook of understanding and knowledge of graphics and data analysis referring to large databases. Revolution R Enterprise is built upon the powerful open source 62

R statistics language. Its message is "100% R and More". Revolution Analytics is the commercial version of R software for organizations and large-scale research; yet it is available for free for students, professors, researchers and open source users. With commercial enhancements and professional support for real-world use, it brings higher performance, greater scalability, and stronger reliability to R at a fraction of the cost of other commercial products like SPSS or SAS. Regarding to this, R's popularity has grown in recent years and the trend is favorable, the estimations showing that in about three years will exceed the number of users of SAS and SPSS. For the period May 2010-May 2012, R was ranked first by 30% of respondents (Muenchen, 2012). 2. Tools for manipulation of large databases in R First of all we will introduce some basic concepts about packages (Adler, 2010). Packages in R are a collection of previously programmed functions. They include functions for specific tasks. There are two types of packages: those that come with the base installation of R and packages that should be manually downloaded and installed. The base installation refers to the big executable file that we download and install. The base version contains the most common packages. To see which packages are already installed, we have to click Packages -> Load package. There are hundreds of user-contributed packages that are not part of the base installation, most of which are available on the R website. There are many packages available that will allow the user to execute the same statistical calculations as commercial ones. 2.1. Packages included with the base installation Loading a package that came with the base installation may be done either by a mouse click or by entering a specific command. The user can click Packages->Load package and select a package. The other method is to use the command library. For instance, to load the MASS package, the user should type the command: > library(MASS) This command gives access to all functions in the MASS package. 2.2. Packages not included with the base installation Sometimes the process of loading a package is slightly more complicated. For instance, we consider a paper in which data are plotted versus their spatial locations (latitude and longitude) and the size of the dots is proportional to the data values. The text states that the graph was created with the bubble function from the gstat package. If we click Packages>Load package, we will not find gstat. If a package does not appear in the list, it has not been installed. Hence this method can also be used to determine whether a package is part of the base installation. To obtain and install gstat, or any other available package, we can download the zipped package from the R website and require R to install it, or we can install it from within R. 2.3. Loading the package There is a difference between installing and loading. Install denotes adding the package to the base version of R. Loading refers to the full-access of all the functions in the package. The user cannot load a package if it is not installed. To load the gstat package it can be used one of the two methods described above. Once it has been loaded, ?bubble will give instructions for using the function. We have summarised the process of installing and loading packages. If a package is part of the base installation, or has previously been installed, we 63

should use the library function. If a package’s operation depends upon other packages, they will be automatically loaded, provided they have been installed. If not, they can be manually installed. 2.4. The quality of the packages Some of the packages contain hundreds of functions written by leading scientists in their field, who have often written a book in which the methods are described. Other packages contain only a few functions that may have been used in a published paper. Hence, there are packages from a range of contributors from the enthusiastic PhD student to the professor who has published ten books. Every package is a research project reviewed at academic level (Caragea, Alexandru, Dobre, 2012). There are lots of packages handling the data when the data is small enough. Things get complicated when big data is involved. There are several approaches when it comes to huge amounts of data. Below we explained the mechanism of R in handling data. R reads data into RAM all at once, if using the usual read.table function. Objects in R live in memory entirely. Keeping unnecessary data in RAM will cause R to choke eventually. Specifically, on most systems it is not possible to use more than 2GB of memory; the range of indexes that can be used is limited due to lack of a 64 bit integer data type in R and R64; on 32 bit systems, the maximum amount of virtual memory space is limited to between 2 and 4GB. There are three major solutions in R: - bigmemory: It is ideal for problems involving the analysis in R of manageable subsets of the data, or when an analysis is conducted mostly in C++. It is part of the “big” family, some of which we will discuss in this study (bigmemory package, biglm package, snow package, bigdata package) - ff: file-based access to datasets that cannot fit in memory (ff Package) - the possibility to use databases which provide fast read/write access for data analysis (RODBC Package, DBI package) 2.5. Packages that handle big data 1. bigmemory package is part of Bigmemory Project. Bigmemory and related packages biganalytics, synchronicity, bigtabulate and bigalgebra bridge this gap, implementing massive matrices and supporting their manipulation and exploration. Bigmemory implements several matrix objects: big.matrix (an object that simply points to a data structure in C++), shared.big.matrix (similar to big.matrix, but can be shared among multiple R processes) and filebacked.big.matrix (it does not point to a data structure; rather, it points to a file on disk containing the matrix, and the file can be shared across a cluster). Shared memory allows us to store data in RAM and share it among multiple processes. Suppose we want to store some data in shared memory so it can be read by multiple instances of R. This allows the user the ability to use multiple instances of R for performing different analytics simultaneously. The data structures may be allocated to shared memory, allowing separate processes on the same computer to share access to a single copy of the data set. The data structures may also be ¿lebacked, allowing users to easily manage and analyze data sets larger than available RAM and share them across nodes of a cluster. Among the advantages of using bigdata package, we mention the possibility to: store a matrix in memory, restart R, and gain access to the matrix without reloading data; share the matrix among multiple R instances or sessions. There are several disadvantages like: no communication among instances of R; limited by available RAM, unless using filebacked.big.matrix; matrix disappears on reboot, unless using filebacked.big.matrix. 2. ff Package is a solution that is based on using files. It provides data structures that are stored on disk. These data structures act as if they are in memory; only necessary/active parts 64

of the data from disk are mapped into main memory. The package supports R standard atomic types: double, logical, raw, integer, as well as non-standard atomic types and non-atomic types. ff has some special features, differentiating itself from bigmemory: support for arrays and data frames, not "just" matrices; a good indexing for improved random access performance; fast ¿ltering of data frames via package bit. 3. biglmPackage. is a great alternative to performing usual logistic regression analyses on big data. Some general features are the following: building generalized linear models on big data; loading data into memory in chunks; processing the last chunk and updating the sufficient statistic; required for the model; disposes the last chunk and loading the next chunk; repeats until end of file. Revolution R Enterprise with RevoScaleR. The main embarrassment is that R is a memory-bound language. All data used in calculations - vectors, matrices, lists etc. - need to be held in memory. Even for modern computers with 64-bit address spaces and huge amounts of RAM, dealing with data sets that are tens of gigabytes and hundreds of millions of rows (or larger) can present a significant challenge. The problem is not only about capacity, but also about accommodating data in the memory for the analysis. Revolution Analytics has the solution with its initiative to extend the reach of R into the branch of production data analysis with terabyte-class data sets. With RevoScaleR integrated package, R users can process, visualize and model their largest data sets in a fraction of the time of legacy systems, without the need to deploy expensive or specialized hardware. RevoScaleR provides a new data file type with extension .xdf that has been optimized for accessing parts of an Xdf file for independent processing. RevoScaleR provides also a new R class, RxDataSource, which has been designed to support the use of external memory algorithms with .xdf files. (Rickert, 2011) Oracle R Enterprise. Oracle R Enterprise, a component of the Oracle Advanced Analytics Option, enables users to run R commands on database-resident data, develop and refine R scripts, and leverage the parallelism and scalability of the database. Data analysis can run the latest R open source packages and develop-without need of SQL knowledge. Oracle R Enterprise enables the ability to use R console or any R GUI/IDE. The authors recommend installing the packages as soon as the user needs them in order to avoid getting the error “Error: cannot allocate vector of size X Mb”. Otherwise, if the user installs all the packages immediately after running R, the RAM memory will be too busy and R won’t be able to allocate memory when running the commands in the script. 3. SWOT analysis for R in large data manipulation In the following we considered interesting to outline a SWOT analysis for R regarding large data manipulation in order to make clear the possibilities R has or not. 3.1. Strengths For R in general: ƒ Open-source program with improved commercial version ƒ The cost of using R are related only with training of users ƒ Working of various operating systems: Windows, Linux, Mac OSX ƒ Easy to install and configure ƒ Mature state of the art programming language: mix-and-match models, scripts and packages for the best results ƒ It has several a few packages for big data support For Revolution Analytics in particular: ƒ The cost of using Revolution Analytics is very small against other similar software (see Table 1) 65

ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ

Process, visualize and model terabyte-class data sets in a fraction of the time of legacy products – without requiring expensive or specialized hardware Using R on multiple cores on a single computer The ability to process more data than can fit into memory by creating XDFfiles Optimizes the process of streaming data from disk to memory, dramatically reducing the time needed for statistical analysis of large data sets Using only a commodity multi-processor computer with modest amounts of RAM, data processing and predictive modelling techniques can easily be performed on data sets with hundreds of millions of rows and hundreds of variables Easy to cut down the computation time for big data analytics simply by scaling with compute nodes Reduces processing time by extending the system to a small cluster of similar computers commensurately The amount of data conversion and copying is minimized, saving time and speed New variables and rows can be added without needing to rewrite the entire file Efficient parallelization of statistical and data-mining algorithms Multiple models can be analyzed jointly Descriptive Statistics and Cross Tabs on very large data sets Statistical Modeling on very large data sets Possibility of detecting collinearities in models, which can lead to wasted computations or even computational failures, and can remove them prior to doing any computations Support for relational databases Efficient object indexing Oracle Integration via Oracle R Enterprise Has added support for native SAS file formats and conversion to XDF

3.2. Weaknesses ƒ R reads data into memory by default ƒ R is not wise enough to use more memory than available ƒ Regardless of the number of cores on CPU, R will only use one on a default build ƒ The range of indexes that can be used is limited due to lack of a 64 bit integer data type in R 3.3. Opportunities ƒ The RevoScaleR package provides external memory algorithms that help R break through the memory/performance barrier ƒ RevoScaleR functions are fast and efficient enabling real data analysis to be performed on a 120 + million row, 13GB data set on a common dual core laptop ƒ All of the RevoScaleR statistical functions produce objects that may be used as input to standard R functions ƒ Extensible Programming Framework: Advanced R users can write their own functions to exploit the capabilities of the XDF files and RxdataSource objects ƒ Re-use and reproduce new discovered techniques on analytic operations that the user is going to perform – this is difficult in SAS or SPSS ƒ No one has commercialised the PSPP open-source alternative to SPSS - like Revolution Analytics made with open-source R ƒ Revolution Analytics added support for native SAS file formats and conversion to XDF

66

3.4. Threats ƒ Other packages and programming languages (SAS particularly) can read data from files on demand ƒ SAS is based almost entirely on external memory algorithms ƒ Even without a capacity limit, computations may be too slow to be useful. Table 1. Comparison of license prices for commonly used software for analysing big data Software License Price Revolution Analytics $ 1000 Oracle Advanced Analytics $ 23,000/CPU and $ 460/ Named User Plus SAS Analytics Pro EUR 5,640 (Commercial/Individual use-1 user license) IBM SPSS Statistics Premium $ 15,800 Source: Processing of the authors based on pricing available on Internet 4. Case study in R The case study is based on a company personnel data base. First, we described how we connected the database to R, then we showed some example of using R to analyse the data. 4.1. Connecting with the database There are essentially two paths to communicate with databases in R. One based on the ODBC protocol and the other is based on the general interface provided by package DBI (R Special Interest Group on Databases, 2009) together with specific packages for each database management system (DBMS). If the user decides to use the ODBC protocol, it is necessary to ensure the communication with the DBMS using this protocol. This may involve installing some drivers on the DBMS side. From the side of R, it is necessary only to install the package RODBC. Package DBI implements a series of database interface functions. These functions are independent of the database server that is actually used to store the data. The user only needs to indicate which communication interface he will use at the first step when he establishes a connection to the database. This means that if DBMS is changed, it is necessary to change a single instruction (the one that specifies the DBMS to communicate with). In order to achieve this independence the user also needs to install other packages that take care of the communication details for each different DBMS. R has many DBMS-specific packages for major DBMSs. Specifically, for communication with a MySQL database stored in some server R owns the package RMySQL. 4.2. Loading the Data into R Running on Windows If R is running on Windows, independently of whether the MySQL database server resides on that same PC or in another computer (eventually running other operating system), the simplest way to connect to the database from R is through the ODBC protocol. In order to use this protocol in R, RODBC package needs to be installed. Before connecting to any MySQL database for the first time using the ODBC protocol, a few extra steps are necessary. Namely, it is necessary to install the MySQL ODBC driver, which is called myodbc and can be downloaded from the MySQL site. This only needs to be done the first time ODBC is used to connect to MySQL. After installing this driver, we can create ODBC connections to MySQL databases residing on the computer. 67

According to the ODBC protocol, every database connection created has a name (the Data Source Name, or DSN according to ODBC jargon). This name will be used to access the MySQL database from R. To create an ODBC connection on a Windows PC, we must use a program called “ODBC data sources”, available at the Windows control panel. After running this program we have to create a new User Data Source using the MySQL ODBC driver (myodbc) that we are supposed to have previously installed. During this creation process, we will be asked several things, such as the MySQL server address, the name of the database to which we want to establish a connection, and the name we give to this connection (the DSN). Once we have completed this process, which we only have to do for the first time, we are ready to connect to this MySQL database from R. After loading the RODBC package, we establish a connection with our database using the previously created DSN, using the function odbcConnect(). We then use one of the functions available to query a table, in this case the sqlFetch() function, which obtains all rows of a table and returns them as a data frame object. The next step is to create an xts object from this data frame using the date information and the quotes. Finally, we close the connection to the database with the odbcClose() function. The following R code establishes a connection to the “employees” database and loads information about the contained tables. > library(RODBC) > ch sqlTables(ch) TABLE_CAT TABLE_SCHEM TABLE_NAME TABLE_TYPE REMARKS 1 employees countries TABLE 2 employees departments TABLE 3 employees employees TABLE 4 employees job_history TABLE 5 employees jobs TABLE 6 employees locations TABLE 7 employees regions TABLE In the following we presented some examples on how to extract data using some SQL queries. > #the geographical location of the departments > qry_dep_country dep_country print(dep_country) DepartmentID DepartmentName CountryISOCode 1 10 Administration US 2 20 Marketing CA 3 30 Purchasing US 4 40 Human Resources UK 5 50 Shipping US 6 60 IT US 7 70 Public Relations DE 8 90 Executive US 9 100 Finance US 10 110 Accounting US > #the number of employees grouped by departments and countries 68

> specialist_country specialist_country DepartmentID NoEmployees DepartmentName CountryISOCode 1 10 1 Administration US 2 20 2 Marketing CA 3 30 6 Purchasing US 4 40 1 Human Resources UK 5 50 23 Shipping US 6 60 5 IT US 7 70 1 Public Relations DE 8 90 3 Executive US 9 100 6 Finance US 10 110 2 Accounting US > #the list of jobs and average payments on each job > qry_jobs_avg_pay jobs_avg_pay print(jobs_avg_pay) No Employees avg_salary JobTitle 1 5 7920 Accountant 2 1 12000 Accounting Manager 3 1 4400 Administration Assistant 4 2 17000 Administration Vice President 5 1 12000 Finance Manager 6 1 6500 Human Resources Representative 7 1 13000 Marketing Manager 8 1 6000 Marketing Representative 9 1 24000 President 10 5 5760 Programmer 11 1 8300 Public Accountant 12 1 10000 Public Relations Representative 13 5 2780 Purchasing Clerk 14 1 11000 Purchasing Manager 15 2 2600 Shipping Clerk 16 16 2750 Stock Clerk 17 5 7280 Stock Manager > #liniar regresion model on how number of employees varies according to salary and job title > qry_jobs jobs print(jobs) NoEmployees Salary JobTitle 1 1 6900 Accountant 2 1 7700 Accountant 3 1 7800 Accountant 4 1 8200 Accountant 5 1 9000 Accountant 6 1 12000 Accounting Manager 7 1 4400 Administration Assistant 69

8 2 17000 Administration Vice President 9 1 12000 Finance Manager 10 1 6500 Human Resources Representative 11 1 13000 Marketing Manager 12 1 6000 Marketing Representative 13 1 24000 President 14 1 4200 Programmer 15 2 4800 Programmer 16 1 6000 Programmer 17 1 9000 Programmer 18 1 8300 Public Accountant 19 1 10000 Public Relations Representative 20 1 2500 Purchasing Clerk 21 1 2600 Purchasing Clerk 22 1 2800 Purchasing Clerk 23 1 2900 Purchasing Clerk 24 1 3100 Purchasing Clerk 25 1 11000 Purchasing Manager 26 2 2600 Shipping Clerk 27 1 2100 Stock Clerk 28 2 2200 Stock Clerk 29 2 2400 Stock Clerk 30 2 2500 Stock Clerk 31 2 2700 Stock Clerk 32 1 2800 Stock Clerk 33 1 2900 Stock Clerk 34 2 3200 Stock Clerk 35 2 3300 Stock Clerk 36 1 3600 Stock Clerk 37 1 5800 Stock Manager 38 1 6500 Stock Manager 39 1 7900 Stock Manager 40 1 8000 Stock Manager 41 1 8200 Stock Manager > summary(biglm(NoEmployees~Salary+JobTitle, jobs)) Large data regression model: biglm(NoEmployees ~ Salary + JobTitle, jobs) Sample size = 41 Coef (95% CI) SE p (Intercept) 1.5242 0.2774 2.7710 0.6234 0.0145 Salary -0.0001 -0.0002 0.0001 0.0001 0.3837 JobTitleAccounting Manager 0.2700 -0.7403 1.2804 0.5052 0.5930 JobTitleAdministrationAssistant -0.2330 -1.1935 0.7275 0.4802 0.6276 JobTitleAdministrationVice President 1.6010 0.0071 3.1948 0.7969 0.0445 JobTitleFinance Manager 0.2700 -0.7403 1.2804 0.5052 0.5930 JobTitleHumanResources Representative -0.0940 -0.9204 0.7324 0.4132 0.8201 JobTitleMarketing Manager 0.3362 -0.7739 1.4463 0.5551 0.5447 JobTitleMarketing Representative -0.1271 -0.9765 0.7223 0.4247 0.7648 JobTitlePresident 1.0643 -1.5062 3.6348 1.2852 0.4076 JobTitleProgrammer 0.1229 -0.4461 0.6919 0.2845 0.6657 JobTitlePublic Accountant 0.0252 -0.7747 0.8250 0.3999 0.9499 70

JobTitlePublic Relations Representative JobTitlePurchasing Clerk JobTitlePurchasing Manager JobTitleShipping Clerk JobTitleStock Clerk JobTitleStock Manager

0.1377 -0.3402 0.2039 0.6479 0.2591 -0.0424

-0.7204 0.9958 0.4290 0.7483 -1.2470 0.5666 0.4534 0.4530 -0.7211 1.1288 0.4625 0.6593 -0.4879 1.7837 0.5679 0.2539 -0.6193 1.1375 0.4392 0.5552 -0.5131 0.4284 0.2354 0.8572

Conclusion As a conclusion, R is the most powerful software in handling datasets as well as big data. R can make almost all type of data analysis for big data, like plots, cluster analysis, statistical regression and forecasts. The exposed case study is only a very small part of what R is able to perform. One of the biggest advantages is its flexibility and its low price for the commercial version against other similar software. Further research on R and big data should be considered because the possibilities of the software and the requirements of the companies and official statistics offices are spreading continuously.

Acknowledgement The present paper is part of a research project of Romanian R-userRs Team (http://www.r-project.ro/). The authors would like to give special gratitude to Nicoleta Caragea and Ciprian Antoniade Alexandru. They provided us their support and guideline in this project.

References Adler J., 2010, R in a nutshell, O’Reilly Caragea N., Alexandru A.C., Dobre A.M., 2012, "Bringing New Opportunities to Develop Statistical Software and Data Analysis Tools in Romania", The Proceedings of the VIth International Conference on Globalization and Higher Education in Economics and Business Administration, pp. 450-456. Edlefsen L., 2011, "RevoScaleR Speed and Scalability", Revolution Analytics Hodgess E., 2004, "A Computer Evolution in Teaching Undergraduate Time Series", Journal of Statistics Education, 12 (3), www.amstat.org/publications/jse/v12n3/hodgess.html Muenchen R., 2012, The Popularity of Data Analysis Software, http://r4stats.com/articles/ popularity/ R Development Core Team, 2005, R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria. http://www.r-project.org. Rickert J., 2011, "Big Data Analysis with Revolution R Enterprise", Revolution Analytics Ripley B., Lapsley M., 2012, "RODBC: ODBC Database Access. R package version 1.3-6. ", http://cran.r -project.org/package=RODBC Rosario R., 2010, "Taking R to the Limit, Part II: Working with Large Datasets", http://www. bytemining.com/wp-content/uploads/2010/08/r_hpc_II.pdf 71

Unwin A., Theus M., Hoffman H., 2006 Graphics of Large Datasets: Visualizing a Million. Springer Science, Singapore. * * * Big Memory Project, http://bigmemory.org/ * * * Comprehensive R Archive Network, http://cran.r-project.org * * * Database source: http://www.php-example.com/2010/07/php-mysql-sample-databasescript-create.html * * * Mysql connector download, http://www.mysql.com/downloads/connector/odbc/ * * * Mysql configuration dsn on windows, http://dev.mysql.com/doc/refman/5.0/en/connector-odbc-configuration-dsn-windows.htm * * * Oracle R Enterprise, http://www.oracle.com/technetwork/database/options/advancedanalytics/r-enterprise/index.html * * * Revolution Analytics, http://www.revolutionanalytics.com

72