Insights into Research Computing Operations using Big Data-Powered Log Analysis Fang (Cherry) Liu
Weijia Xu
Mehmet Belgin
Partnership for an Advanced Computing Environment (PACE) Georgia Institute of Technology Atlanta, GA 30332
[email protected]
Texas Advanced Computing Center University of Texas Austin, TX 78758
[email protected]
Partnership for an Advanced Computing Environment (PACE) Georgia Institute of Technology Atlanta, GA 30332
[email protected]
Ruizhu Huang
Blake C. Fleischer
Texas Advanced Computing Center University of Texas Austin, TX 78758
[email protected]
Partnership for an Advanced Computing Environment (PACE) Georgia Institute of Technology Atlanta, GA 30332
[email protected]
ABSTRACT
KEYWORDS
Research computing centers provide a wide variety of services including large-scale computing resources, data storage, high-speed interconnect and scientific software repositories to facilitate continuous competitive research. Efficient management of these complex resources and services, as well as ensuring their fair use by a large number of researchers from different scientific domains are key to a center’s success. Almost all research centers use monitoring services based on real time data gathered from systems and services, but often lack tools to perform a deeper analysis on large volumes of historical logs for identifying insightful trends from recurring events. The size of collected data can be massive, posing significant challenges for the use of conventional tools for this kind of analysis. This paper describes a big data pipeline based on Hadoop and Spark technologies, developed in close collaboration between TACC and Georgia Tech. This data pipeline is capable of processing large volumes of data collected from schedulers using PBSTools, making it possible to run a deep analysis in minutes as opposed to hours with conventional tools. Our component-based pipeline design adds the flexibility of plugging in different components, as well as promotes data reuse. Using this data pipeline, we demonstrate the process of formulating several critical operational questions around researcher behavior, systems health, operational aspects and software usage trends, all of which are critical factors in determining solutions and strategies for efficient management of research computing centers.
big data, log analysis, hadoop, spark
CCS CONCEPTS • Information systems → Data mining; Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. PEARC17, New Orleans, LA, USA © 2017 ACM. 978-1-4503-5272-7/17/07. . . $15.00 DOI: http://dx.doi.org/10.1145/3093338.3093351
ACM Reference format: Fang (Cherry) Liu, Weijia Xu, Mehmet Belgin, Ruizhu Huang, and Blake C. Fleischer. 2017. Insights into Research Computing Operations using Big Data-Powered Log Analysis. In Proceedings of PEARC17, New Orleans, LA, USA, July 09-13, 2017, 8 pages. DOI: http://dx.doi.org/10.1145/3093338.3093351
1
INTRODUCTION
The batch job scheduling system is ubiquitous in large research computing environments, including the Georgia Institute of Technology (Georgia Tech) the Partnership for an Advanced Computing Environment (PACE) center. PACE uses the Torque [9] resource manager and MOAB scheduler [6] to manage approximately 1, 400 nodes and 44, 000 CPU cores in which given any time and average of 5, 000 jobs are occupying the clusters. Job schedulers are a key component of scalable computing infrastructures. They orchestrate all of the work executed on the computing infrastructure and directly impact the effectiveness and throughput of the system. Job submissions to MOAB and Torque are done via user created scripts that include a list of resource requests and what commands to run. When a job starts running on the cluster, the scheduler stores the job script and other collected information only temporarily and discards them after job completion. This lack of long term retention of job details makes it very difficult to identify usage patterns and trends based on historical data. PBSTools [7] is a open source tool developed by Ohio Supercomputer Center, which provides a mechanism for long term retention of job scripts and jobs details in a MySQL database, allowing analysis of historical usage on demand. This tool comes with a web GUI allowing some basic reporting, such as aggregating the software usage statistics based on user, group, job counts, and node counts. However, we found that PBSTools can become nearly unresponsive when a large number of jobs from multiple schedulers pushing data into this database concurrently at any given time. Further, PBSTools does not offer a visualization interface or allow more in-depth data
PEARC17, July 09-13, 2017, New Orleans, LA, USA
Fang (Cherry) Liu, Weijia Xu, Mehmet Belgin, Ruizhu Huang, and Blake C. Fleischer
processing (e.g. host- or module-based indexing). Therefore, in leu of the PBSTools web GUI, we integrate monthly PBSTools SQL queries into a data pipeline for fast, interactive analysis via a variety of post processing tools. This project is motivated by PACE’s need for a deeper understanding of operational patterns and trends. PACE is tasked with supporting a diverse array of services on large collection of heterogeneous resources, serving approximately 1, 500 active users from hundreds of research groups from different domains with limited staff. This challenging role requires efficient and scalable solutions in every aspect, which are extremely difficult to achieve without a deep understanding of users, systems and software. Before this study, our decisions and strategies mostly based on intuition and assumptions, rather than solid data, which is counter productive and usually a recipe for failure. This tool now helps us identify which groups are causing the majority of job crashes, find out about frequently loaded software, which departments are our top users, and how some of our clusters have been misused by certain research groups, as outlined in Section 5. In this paper, we present a new big data pipeline for analyzing historical scheduler data, as well as share new insights into PACE’s operations gained via the use of this tool. In section 2, we provide background and related work. Section 3 explains the core system architecture in details, which is developed in close collaboration with Texas Advanced Computing Center (TACC). Section 4 describes the hardware setup and tests for preliminary Web GUI. Section 5 provides the results of data analysis and gives examples to how this analysis can be used to identify patterns and converted into meaningful action items. We conclude and discuss future work in Section 6.
of SQL queries. However, the slow performance of the Web GUI provided by PBSTools, aggravated by the large volume of users, resources and jobs at Georgia Tech, prevents it from being used for even the most basic reporting functions. A static dump of the database would be faster to query, but still lacks the flexible design we need for a deeper analysis. TACC developed an application to support distribution analysis, association analysis and sequential pattern mining with many options for users to interactively adjust, explore and visualize a large volume of log data generated from XALT [12]. Although we don’t use XALT, we identified this tool as a good match for our needs, and invited TACC researchers for a collaboration for adapting this tool to process our scheduler logs. To meet the generic analysis needs and increasing volume of log data, we utilize Spark for efficient large-scale data processing. Spark is an open-source cluster computing framework originally developed by AMPLab at UC Berkeley [15]. One of the big advantages of Spark is that it runs in-memory on the cluster and does not rely Hadoop’s MapReduce two-stage paradigm. This makes repeated access to the same data much faster [11, 14]. We are considering a future integration of Apache Zeppelin [3], which provides a multi-purpose notebook for data ingestion, discovery, analytics as well as visualization and collaboration, with it’s build-in integration with Spark [1]. Zeppelin provides automatic SparkContext and SQLContext injection, runtime jar dependency loading from local filesystem or maven repository, and displays Spark job progress as well as cancels the job. This is a useful tool for integration of big data analytics platforms such as Spark for improved data visualization capabilities.
3 2
BACKGROUND AND RELATED WORK
Most data center log analysis focuses on computer system level information, such as time series performance/monitoring data related CPU/Memory/Network bandwidth usage. Frequently used tools include Splunk [8], Ganglia [4] and Nagios [5]. Splunk tackles large volume of machine-generated data and turns it into valuable insights including customer behavior, machine behavior, security threats, fraudulent activity and more. Recent development in Hadoop-based Splunk, namely Hunk, provide a way to analyze large volume of data on Hadoop. Ganglia is a scalable distributed monitoring system. Its data structures and algorithms pose low overhead per node, achieving high concurrency. Ganglia can scale up to handle clusters with 2000 nodes. Nagios provides monitoring of all mission-critical infrastructure components including applications, services, operating systems, network protocols, systems metrics, and network infrastructure. Although these tools provide the basic monitoring function for a data center such as PACE, the need for a tool for analyzing historical scheduler data has become apparent for our ability to provide sustainable and scalable solutions to meet the demands of the modern, ever-changing research computing environment. PBSTools [7], developed at Ohio Supercomputer Center is deployed as an extension to PACE’s Torque resource manager. PBSTools deposits both job scripts and torque accounting records into a MySQL database, and provides a web front end for a variety
ARCHITECTURE AND DATA PIPELINE
We consider a dataset acquired from PBSTools MySQL database between March 2016 and February 2017 amounting to a total size of 19GB composed of 55M records. 17 of the standard 38 fields were not necessary for our analysis and thus removed, reducing the dataset to 6GB. These data are collected from 5 scheduler servers supporting 150 queues, executing in total more than 55M jobs during a one year period. Figure 1 shows the overall architecture with four major components: • The data collector component pulls the data from PBSTools database on monthly basis, and pushes into the Hadoop Distributed File System (HDFS) [2] as a separate JavaScript Object Notation (JSON) file per month. Each month’s JSON file contains about 5M new job records with 300MB to 800MB of size, depending on cluster usage. All of the JSON files are kept in HDFS for further processing. • The preprocessing component reads JSON files produced from data collector component, further parses the PBS job scripts entries, extracts hierarchical data indexed by modules and hostnames, and saves the intermediate data in a Parquet format on HDFS. • The analyzer component further processes data based on questions we want to answer, such as aggregating cpu time usage for each software used or for each host, job wait time per queue, software usage associations, etc.
Insights into Research Computing Operations using Big Data-Powered Log Analysis
PEARC17, July 09-13, 2017, New Orleans, LA, USA
Figure 1: Overview of 4-component data reduction pipeline using scalable Spark-based data processing engine • The postprocessing component has two independent pieces. Any tool of choice, e.g. Python Jupyter Notebook (with Pandas) or R Shiny Web GUI tool can be used for data processing and visualization. We chose to use the Python Jupyter notebook for our analysis as described in Section 5. We also present some preliminary results from R Shiny Web GUI tool in Section 4. To efficiently analyze the large volume of log data, TACC has previously developed scalable tools for parsing and analyzing the log data captured by XALT [12]. We extended this tool to analyze logs directly generated by PBSTools for this project. The analysis tool leverages the Spark programming framework for efficient large scale data processing and includes two components, a data preprocessing component and a data analysis component.
3.1
Spark-based Preprocessing
The preprocessing component transforms the data into a more efficient format for scalable computation. Our tools can read log data in various formats from structured data sources such as relational databases, JSON, and structured text format with a schema description. For this project, queries are made to the PBSTools database and stored as JSON files. Each JSON file contains a list of JSON objects representing one data record using a pre-defined schema. The list of JSON objects will then be stored in Parquet format, a binary self-describing data format suitable for subsequent data processing. During the data transformation step, users can also specify and select only a subset of data from the raw data set. The tool directly supports subsetting data based on values specified for a field through the command line options (-filter, -beg and -end). A customized data parser could be used to implement more complicated data transformations. In this project, we implemented functions that can extract information of interest from unstructured text fields based on regular expression matches. Two examples used
here include parsing software module used for each job from user submitted script and parsing string representations of resources used by each job as a list of hostnames for further processing. The extracted information forms two new columns of Array types. The result of data transformation is serialized in a HDFS directly using Parquet format for further analysis.
3.2
Analyzer
The second step perform three types of analysis functions, namely cross distribution analysis, association analysis and aggregated distribution analysis. The distribution analysis (via -d flag) specifies any two fields and computes how values in one field are distributed in the data set according to the values of the second field. The result is a co-occurrence frequency matrix for the two fields, visualized through the web user interface. Given two data fields, the distribution analysis carries out as the following: (1) Values from each data field are first identified. (2) Unique values are used as labels to construct the distribution matrix between the two fields. (3) The occurrences of all the unique value combinations between two fields are aggregated and used to populate the distribution matrix. The association analysis (via -a flag) allows a user to infer relationships between one set of values from one or more fields. Through command line options, a user can also select another field to define the scope of inferences. For example, a user may wish to identify inferences of potential usages of software based on usage activities per user or per user group, specified using corresponding field names. The association analysis was implemented using FPGrowth algorithm. The association analysis results are stored using Predictive Model Markup Language (PMML) to ensure interoperability of the results for further post-processing and visualization needs.
PEARC17, July 09-13, 2017, New Orleans, LA, USA
Fang (Cherry) Liu, Weijia Xu, Mehmet Belgin, Ruizhu Huang, and Blake C. Fleischer
The aggregated distribution analysis (via -agg flag) is designed to aggregate values from multiple fields based on another set of fields as "keys". The result of aggregated distribution analysis is a data cube whose dimensions correspond to the set of "key" fields and whose values are aggregated vector of values from specified fields. All command line switches are integrated into the R Shiny Web GUI interface, and provide either CLI or GUI.
3.3
Data Format and Transformation
We use a loosely-coupled data pipeline to link the pipeline components through files. The benefit of this approach is that each component can be used in conjunction with other data pipelines and developed independently in a modular fashion. Intermediate results can be reused by other applications in the future. In Figure 1, the data is transformed along the data pipeline, Table 1 shows the data schema with selected fields from PBSTools MySQL database. When retrieved by the data collector component, two fields cput and walltime are converted from time format to seconds for easy aggregation later. The script field is converted to unicode to avoid decoding error when saving as JSON files. Field Original Type jobid varchar(32) system varchar(32) groupname varchar(32) nproc int(10) unsigned nodes text queue tinytext submit_ts int(11) submit_date date start_ts int(11) start_date date end_ts int(11) end_date date cput_req time cput time walltime_req time walltime time mem_req tinytext pmem_req tinytext mem_kb int(10) unsigned vmem_kb int(10) unsigned hostlist text exit_status int(11) script mediumtext sw_app tinytext Table 1: PBSTools Data Schema
The preprocessing component converts the hostlist field in Table 1 to hostlist_array column and its original format is: < hostname1 > /< cores1 > + < hostname2 > /< cores2 > and the output becomes a string array in which each string represents one hostname. is ignored because the total number of cores being used by each job can be captured by the field nproc.
The analysis we present in Section 5 uses both hostlist_array for a hostname-indexed analysis and module_array, which is generated using the script field, for a software-indexed analysis.
4
HARDWARE AND TESTING
PACE has a dedicated five-node hadoop/spark cluster (AMD nodes with 24 cores, 128GB memory and 3TB local storage) running Red Hat Enterprise 6.5. One node is setup as the name node while other four nodes are used as data nodes. The disk is partitioned with 2.3TB as HDFS storage and 20GB as tmp directory. Spark is configured to run on Hadoop Yarn, in which HADOOP_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configurations are used to write to HDFS and connect to the YARN ResourceManager. The configurations in this directory are distributed to the YARN cluster so that all containers used by the application use the same configuration. Our data pipeline currently uses Python 2.7.9, Java 1.8, Hadoop 2.7.2 and Spark 2.1.0 for the experiments. Based on our experiments, we identified the preprocessing component as the most time consuming component. Time usage by other components are insignificant, typically less than a minute. We first tried to implement the preprocessing component in pure sequential Python code, however it took about 6 hours to finish processing one year worth of data. After implementing our Spark-based approach, the same analysis took only 11 minutes, at a significant increase in performance. Figure 5 shows the runtimes for different number of executors, the longest time for processing in one executor is less than an hour. As we can tell, the algorithm scales with increasing number of executors, and the execution time can be as small as 11 minutes for a year’s worth of data. Figure 6 demonstrates the tool’s linear scaling in terms of the size of the problem, and the execution time is increased as the problem size becomes larger. We don’t observe much difference between one month and three months data, likely due to the data size that’s too small to compensate for the system overhead for Spark to start up the containers on the remote hosts. We also test the preliminary web GUI written using the R Shiny package. Figure 2 shows the interface in which distribution analysis and association analysis can be chosen for various fields in Table 1. The date range panel allows users to select a subset of data in both distribution and association analysis by setting start and end dates. By default, all records are used for the analysis. The distribution analysis setting exposes nominal variables to be grouped by and numeric variables to be totaled. The association panel enables the flexibility of adjusting the confidence and support level and selection of the aggregation variable and fields involved in the association analysis. Finally there are two separate panels for distribution and association plot settings. For the distribution plot settings, users are able to set how many top categories for the legend and how many top groups to be shown on the x axis. Categories on the legend and groups on the x axis are selectable as well. For the association plot setting, users can limit how many rules to be displayed on the association plot. Finally, users can switch between bar plot and association plot. One important feature of the web application is that the figures update immediately if only the plot setting panels are modified
Insights into Research Computing Operations using Big Data-Powered Log Analysis
PEARC17, July 09-13, 2017, New Orleans, LA, USA
Figure 2: GUI for PostProcessing
Figure 3: Resource Usage Per Group and Queue
Figure 4: Software Association Analysis
(after the first initiation of the Spark processing). The Spark processing only needs to be repeated if users modify analysis setting panels. Figure 3 shows resource usage for top 9 groups along with corresponding queues the jobs were running for one year data set.
Figure 5: Running times (minutes) vs number of executors (1,2,4,8) used for one year data analysis
Figure 4 shows association analysis on software usage. As this tool is still under development, we used Jupyter Notebook and Python’s
PEARC17, July 09-13, 2017, New Orleans, LA, USA
Fang (Cherry) Liu, Weijia Xu, Mehmet Belgin, Ruizhu Huang, and Blake C. Fleischer a large fraction of job failures and address them by reaching out only a handful of researchers in one group.
5.2
Figure 6: Running times (minutes) vs data size in one month, three months, six months and one year
Pandas and Matplotlib modules to analyze the data in the next section.
5
DATA ANALYSIS
The data pipeline in Figure 1 produces a reduced dataset in CSV format containing only the desired data required for analysis using any desired tool. Due to the compact size of the input file (700MB max), it takes only seconds to a minute to load the data and run a full analysis on a year worth of scheduler data, including preprocessing needed to remove some unwanted special characters and strings.
5.1
Jobs Crashing Immediately after Allocation
We first look into jobs with zero runtime, which immediately crashed right after allocation. Common reasons for immediate crashes are errors in submission scripts and hardware problems. Figure 7a shows the hostnames for the top 20 nodes with the most number of crashes in the past year. The node with the most frequent crashes, namely iw-h43-5, hasn’t experienced too many hardware problems in the past year, so we attribute this large number of crashes to problematic job submission scripts used by the group with exclusive access to this node. This finding prompted an outreach effort to help this particular group to prevent more job submission failures in the future. In Figure 7a, the majority of crashes happened on the nodes inside the ‘h29’ rack (hosts are named according to their rack). Unlike iw-h43-5, the nodes in this rack are shared by a large number of researchers from different domains. We also found some evidence for recurring problems in the history of these nodes, so it’s possible that these crashes indicate hardware or network problems that may be impacting the entire h29 rack. This finding directed our efforts towards system administration-related troubleshooting for this particular rack. As a next step, we looked at the distribution of crashed jobs per department and found that Biology Department caused the most number of crashes by a large margin (Figure 7b). We then looked into the distribution of crashes within the Biology department (Figure 7c), and identified a particular group in Biology (group names de-identified), which experienced the majority of crashes. This simple analysis allowed us to narrow down the root cause for
Software Maintenance
PACE maintains a very large repository of scientific applications consisting of approximately 4,000 installations belonging to 400 packages compiled with different combinations of release versions, compilers, MPI stacks and Python combinations. Supporting this large repository has been a significant challenge for our limited staff. It has been long planned to narrow the scope by eliminating old and unused packages and versions, but our team was lacking the visibility into user preferences to achieve this in the past. The scheduler data captured in this analysis records the software modules loaded for each group. The names of these modules include software names and versions that can be used for identifying software usage patters. Figures 8a,8b,8c and 8d show versions used by researchers for four of the software that we are interested in, namely Matlab, Python, Mvapich2 [13] and OpenMPI [10]. We found that most users have already made the switch to more recent Matlab versions r2015a and later, which aligns well with our expectations. We support Matlab versions as old as r2007 and most of these old versions can now be retired. Maintaining a robust and fast Python stack has also proven to be a very challenging task for the PACE team. As the number of installed Python modules grow, we started seeing more internal version incompatibilities threatening the overall integrity of the entire Python stack. The Anaconda distribution provides many optimized and reliable python modules, with an easy installation mechanism. We made Anaconda available to our users a long while ago, and are planning to decommission our internally compiled Python stack in the future. The Python usage patterns as depicted in Figure 8b yield a considerable adoption rate of 38% for Anaconda, but our internally compiled Python stack still has the lion’s share with 55% (42% for default, which is v2.7, and 13% for explicit loads of v2.7). This chart makes a distinction between python/2.7 and the default python version, which is also v2.7, to provide insight into how many users will be impacted if we change the default version in the future. These findings tell us that additional engagement and education is needed for Anaconda, as we are not quite ready to make the switch yet. Finally, we analyzed user preferences for two main MPI distributions, namely Mvapich2 and OpenMPI. Despite its age, Mvapich2 v2.0 has an overwhelming share of 89%, despite the availability of v2.1 (19%). User preference for older versions is more pronounced for the OpenMPI as shown in Figure 8d, with 38% of OpenMPI jobs running with the previous 1.6 version.
5.3
Cluster Utilization
Next, we look broadly into cluster utilization. Two available metrics are CPU time per job and number of job submissions, excluding the immediately crashed jobs. While not thorough, these two metrics provide sufficient insight into the usage of the clusters by different departments and groups. Figure 9a depicts the top 10 departments with the most number of job submissions. The Biology department has the most number of submissions, followed by tests run by PACE (PACE test). PACE stress tests the schedulers during maintenance by submitting large numbers of jobs with very short run times. As
Insights into Research Computing Operations using Big Data-Powered Log Analysis
(a)
PEARC17, July 09-13, 2017, New Orleans, LA, USA
(b)
(c)
Figure 7: Analysis of Crashing Jobs
(a)
(c)
(b)
(d)
Figure 8: Analysis of Software Usage a result we came in second in the number of submitted jobs, but most of these jobs complete almost instantly and we are nowhere close to being one of the top consumers of these resources. This is an example to how the number of job submissions can be a misleading metric, as it doesn’t necessarily capture the utilization of the resources. Perhaps a better metric of utilization is CPU time, as it is not as easily biased by benchmarking. Figure 9b depicts the top 10 departments with the most (aggregate) CPU time utilization. These results show that the Mechanical Engineering department
is the number one consumer of CPU cycles, followed closely by Biology. This analysis also revealed unexpected high activity on our ‘testflight’ cluster, which is used to test experimental OS, scheduler, and software versions planned for future deployment. This cluster gives us the opportunity to identify and resolve problems in advance, before deploying the changes everywhere. Figure 9c shows the aggregate CPU time utilization, broken down by top 5 groups. While one group in particular stood out above the rest, several other groups may also be using the cluster for productions runs, consuming a significant amount of cycles. High utilization aside, we are particularly concerned about the validity of research data obtained on experimental and unsupported systems and hardware. These results play a crucial role in identifying the groups that need to be contacted so fair usage can be established on the cluster. We will contact these groups and apply further usage restrictions to discourage the use of this cluster for production runs in the future. Even with this limited initial analysis we gain invaluable insights into job failures, software maintenance, and resource utilization. Each of these insights translate into meaningful action items for efficiently addressing common problems. The presented pipeline provides a proof of concept for greater in-depth PBSTools database analyses, with emphasis on more rapid processing of data than would otherwise be possible with a non-distributed pure python approach. The real power of the presented data pipeline is to provide the ability to formulate any relevant questions on the fly using the parameters listed in Table 1, and getting almost instant answers, which would otherwise take tremendous effort and processing time when using conventional tools on a large collection of scheduler logs.
6
CONCLUSIONS AND FUTURE WORK
In this paper, we present a streamlined data processing pipeline capable of analyzing large volumes of data obtained from job schedulers. Using this framework, we obtain faster results for gaining new insights into researcher behavior, systems health, operational
PEARC17, July 09-13, 2017, New Orleans, LA, USA
Fang (Cherry) Liu, Weijia Xu, Mehmet Belgin, Ruizhu Huang, and Blake C. Fleischer
(a)
(b)
(c)
Figure 9: Analysis of Resource Utilization aspects and software usage trends. Our initial analysis using this data pipeline yielded several meaningful action items that target specific user groups, software, and queues for efficient and focused addressing of common issues. In the future, we plan to improve preprocessing with more comprehensive data cleaning features. Some of the steps currently in postprocessing will be moved to the preprocessing component to improve re-usability through command-line switches. In addition, we plan to integrate Apache Zeppelin with our backend data processing engine for more interactive data visualization and queries. Overall, this data pipeline is a fast, flexible and modular tool that can benefit any research computing center using schedulers that support PBSTools, such as Slurm and Torque/MOAB.
7
ACKNOWLEDGEMENT
We would like to thank Andre McNeill, Dan Zhou and other PACE team members for their support on the hardware setup, and would like to thank Neil Bright and Paul Manno for their administrative support.
REFERENCES Apache Spark : Lightning-fast cluster computing, [1] 2016. http://spark.apache.org/. (2016). [2] 2016. Welcome to Apache Hadoop,http://hadoop.apache.org. (2016). [3] 2017. Apache Zeppelin, https://zeppelin.apache.org. (2017). [4] 2017. Ganglia Monitoring System, http://ganglia.sourceforge.net/. (2017). [5] 2017. The Industry Standard In IT Infrastructure Monitoring, https://www.nagios.org/. (2017). [6] 2017. Moab HPC Suite : Adaptive Computing, http://www.adaptivecomputing.com/products/hpc-products/moab-hpcbasic-edition/. (2017). [7] 2017. PBS Tools : Ohio Super Computer Center, https://www.osc.edu/ troy/pbstools. (2017). [8] 2017. SPLUNK, Make Machine Data Accessible, Usable and Valuable to Everyone, https://www.splunk.com/. (2017). [9] 2017. Torque Resource Mananger : Adaptive Computing, http://www.adaptivecomputing.com/products/open-source/torque/. (2017). [10] Richard L. Graham, Galen M. Shipman, Brian W. Barrett, Ralph H. Castain, George Bosilca, and Andrew Lumsdaine. 2006. Open MPI: A High-Performance, Heterogeneous MPI. In Proceedings of the Fifth International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks. Barcelona, Spain.
[11] Ruizhu Huang and Weijia Xu. 2015. Performance evaluation of enabling logistic regression for big data with R. In Big Data (Big Data), 2015 IEEE International Conference on. IEEE, 2517–2524. [12] Ruizhu Huang, Weijia Xu, and Robert McLay. 2016. A Web Interface for XALT Log Data Analysis. In Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale (XSEDE16). ACM, New York, NY, USA, Article 31, 8 pages. DOI:http://dx.doi.org/10.1145/2949550.2949560 [13] H. Jin Q. Gao D. Panda W. Huang, G. Santhanaraman. 2007. Design and Implementation of High Performance MVAPICH2: MPI2 over InfiniBand. [14] Weijia Xu, Ruizhu Huang, Hui Zhang, Yaakoub El-Khamra, and David Walling. 2016. Empowering R with High Performance Computing Resources for Big Data Analytics. In Conquering Big Data with High Performance Computing. Springer, 191–217. [15] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. HotCloud 10, 10-10 (2010), 95.