benchmarks are compiled by the available compilers linking to the available ..... Framework Interface Implementation in Python CGI. The implementation of the ...
Interface for Performance Environmental Autofiguration frameworK Liang Men
Bilel Hadri and Haihang You
Computer Science and Computer Engineering University of Arkansas Fayetteville, AR, 72710
National Institute for Computational Science University of Tennessee Knoxville, TN, 37996
Abstract—Performance Environment Autoconfiguration frameworK (PEAK) is presented to help developers and users of scientific applications to find the optimal configurations for their application on a given platform with rich computational resources and complicate options. The choices to be made include the compiler with its settings of compiling options, the numerical libraries and settings of library parameters, and settings of other environment variables to take advantage of the NUMA systems. A website based interface is developed for users convenience of choosing the optimal configuration to get a significant speedup for some scientific applications executed on different systems. Keywords—HPC, Autoconfiguration
Numerical
I.
Library,
Compiler
INTRODUCTION
A systematic way of selecting and configuring the resources on High Performance Computing (HPC) systems is desired for scientific application developers in order to obtain optimal application performance. For most application developers, an exhaustive search of all the possible configurations with plenty of parameters is beyond their time and effort. For example, on several HPC supercomputers, like Kraken and Jaguar, Cray XT5 systems operated by the National Institute for Computational Science (NICS) and the Oak Ridge National Laboratory (ORNL) respectively, the numerical libraries are one of the most used libraries [1], which are supported by several numerical vendors using optimized BLAS(Basic Linear Algebra Subprograms) [2], LAPACK(the Linear Algebra PACKage) [3], FFT and ScaLAPACK [4] functions. The libraries, such as LibSci [5] from CRAY, ACML [6] from AMD and MKL [7] from Intel, also support different compilers on the systems like PGI, Cray, GNU, Intel and Pathscale. A default configuration is usually implemented in the supercomputer with the expectation of handling the majority of the scientific applications. However, our preliminary exploration indicates that the default environment is not the best configuration for many applications. Although the libraries and compilers documentations are well maintained in the supercomputing center, such information is overwhelming for users and it is difficult to find optimal options. An environmental auto-configuration framework is developed to help users of scientific applications select the
optimal configuration for their applications on the supercomputing platform with abundant computational resources. It starts with the benchmarks of the popular numerical routines in optimized vendor libraries. The benchmarks are compiled by the available compilers linking to the available numerical resources on the supercomputing systems. With a wide range of vector or matrix sizes as input parameters, the compiled programs are executed in different environments for a knowledgeable database preserving the performance data. The database will provide valuable reference for users to find out the most beneficial environments and configurations of computational resources for their applications. The framework interface is deployed with potential and distinctive functions. Initially, it provides scripts to automatically compile and execute the benchmarks of the available numerical functions depending on different environments. With the performance data of the benchmarks, developers can easily discover the best configurations to their scientific applications. In addition, it is highly recommended for developers to check performance data from the existed database to determine their choices of the available computational resources. The interface will provide performance diagrams based on the users’ parameters and update its database if new configurations are executed. Furthermore, the interface can advise the users on the static or dynamic linking paths, compiler options as well as other configuration tools based on the user’s choice of platform. The auto-configuration feature is essential for the developers to transplant their codes between different environments of the supercomputing systems for compilation and execution. II.
THE DEVELOPMENT OF THE FRAMEWORK
A. Experimental environment The performance comparison is based on two high performance platforms, Kraken and Nautilus, at NICS. Kraken is a Cray XT5 platform, with a peak performance of 1.17 Pflops/s with 112,896 compute cores. Each node is composed of two 2.6 GHz six-core AMD Opteron processors (Istanbul) with 16 GB of memory. All the results in section C have been performed on one node with twelve cores, leading to a theoretical peak performance of 124.8 Gflops/s. Three numerical libraries have been studied: LibSci (10.4.5), ACML (4.4.0) and Intel MKL (10.2). Each numerical library is built with the following compilers: PGI (10.6.0), GNU (4.4.3) and Intel (11.1.038).
Nautilus is an SGI Altix 1000 UltraViolet shared-memory machine featuring 1,024 cores (Intel Nehalem EX processors with CPU speed of 2.0 GHz on each core) and 4 terabytes of memory within a single system image. Three compilers, Intel
(12.1.233), PGI (10.3) and GNU (4.3.4) and two numerical vendor libraries MKL (10.3.6) and ACML(4.4.0) are considered in the framework.
Input: Size, Function...
Output: Compiler, Library, Environment... USER INTERFACE
Auto Configuration Generator
Compiler /Library
User Inquiry Test Driver
Kernel Code
Job Script Generator
PBS Scripts
Platform Execution
Test Driver Code
Performance Data Performance Database
Fig. 1. Design Flow of Autoconfiguration Framework
B. Design Flow of the Framework The framework is built on the performance data of commonly used numerical routines, which are compiled with various compilers linking to the available libraries on the platforms. As shown in Fig. 1, a batch of most used scientific functions is selected for benchmarking. The auto-configuration generator produces the test driver code and kernel code from the preserved test bench model with user configuration , such as matrix sizes, timing functions, and the performance evaluation functions. The compilation process, which generates applications with various combinations of compilers and libraries, is automatically performed by a job script linking to optimized flags and options. Performance data is generated by running an application on the platform with a wide range of input parameters and scheduled by the scripts in Potable Batch System (PBS). A performance database is developed along with external variables, such as matrix size, function names, compilers, libraries, number of cores, etc. Based on initial setup, an inquiry interface is developed for the users' access to the framework. It provides suggestions on using the recommended library and compiler for better performance in scientific applications. The performance data in the database will be plotted as diagrams for reference. If the inquired information does not exist, the framework is adaptive for inserting new benchmark functions, and reserves such information for future reference. More options of new compilers and libraries will be explored and added to the interface. C. Performance Comparison for BLAS/LAPACK Libraries In previous work [8], nine popular subroutines from BLAS and LAPACK have been benchmarked on different numerical vendor libraries (LibSci, ACML and MKL) with 3 compilers (PGI, GNU and Intel). The default programming environment on Kraken, LibSci with the PGI compiler, provides in most cases the fastest implementation or very closes to the peak performance of ACML for the BLAS subroutines. However,
for computing of the eigenvalues and eigenvectors with DSYGV and computation the QR factorization with DGELS in LAPACK, the default programming environment is not descent and can dramatically slow down a scientific application.
Fig. 2. DSYGV Performance on Kraken with 12 Cores
As the detailed performance shown in Fig.2, when calling DSYGV function for computing eigenvalues and eigenvectors, LibSci is not recommended to be linked with for the poor performance. ACML with PGI performs well for problems with small sizes, and MKL with PGI performs better with larger size. Another example is shown in Fig.3; DGELS routine solves a system of equations using the QR factorization. While MKL and ACML obtain better performance, LibSci does not perform well especially compiled with Intel. According to Cray scientific developers, this function has not been perfected and they are in process of improving the performance in future releases.
select the available FFT kernels by configuring CRAFFT_PLANNER. Intel MKL has been providing Discrete Fourier Transforms Interface (DFTI) since MKL (6.0). After version 10.2, MKL has fully integrated FFTW3.x interface without any extra effort for building wrappers, which contributes the same benchmark as FFTW during the framework exploration. With highly-efficient FFT algorithm to compute DFT, ACML supports FFT routines of complex data and real to complex data, handling single and multidimensional data. The multidimensional routines benefit from the use of OpenMP for good scalability on SMP machines.
Fig. 3. DGELS Performance on Kraken with 12 Cores
On Nautilus, MKL library outperforms ACML by almost a factor of two for the LAPACK functions when compiled with Intel for general cases. One exception is DSYGV routine when the number of cores is greater than 16. Fig. 4 shows the performance of the function with 64 cores. For the matrix size less than 8000, ACML is the fastest implementation to solve the eigenvalue problems.
Fig. 5. FFT Performance on Kraken with different 2D matrix size
Fig. 4. DSYGV Performance on Nautilus with 64 Cores
D. Performance Comparison for FFT Libraries Fast Fourier Transform (FFT) provides the Discrete Fourier Transform (DFT) computation for the basis of many scientific applications. FFTW is a popular FFT library with a comprehensive collection of fast C routines for computing the DFT and related cases [9]. The latest version 3.x supports a brand-new design offering better support to SIMD instructions on modern CPUs and a distributed-memory implementation on top of MPI. CRAFFT (Cray Adaptive FFT) [10] as part of LibSci library is available on Cray XT systems. CRAFFT provides a simplified interface to delegate computations to other FFT kernels (including FFTW) and can dynamically
The FFT routines in different libraries may not have a common interface. Three test benches, developed for ACML, LibSci, MKL and FFTW, are linked to correspondent libraries and compiled with different compilers for benchmarking on Kraken. The transform is performed on 2D matrix from real to complex with a size of 4 to 4096. Each computation is performed 100 times for randomly generated matrix with cache cleaned. The performance results among the PGI, Intel and GNU compilers are close and the PGI result is selected to show in Fig. 5, normalized with regards to the CRAFFT_PLANNER = 0 group, which is comparable to FFTW3.3. Setting CRAFFT_PLANNER to 0 indicates no online planning is done and a default FFT kernel is used at all times. If the value is 1, then some planning is attempted to find a faster kernel than the default. If the value is 2, planning is extensive and attempts to use the fastest kernel available to CRAFFT. For most cases of the small matrix sizes, MKL has a better performance than other libraries. III. USER INTERFACE FOR ENVIRONMENTAL CONFIGURATION A user interface is developed for the framework to simplify the environment variable settings by automatic configuration. The automated data generation tool partitions the process of constructing the framework database into four steps to address the following issues: compiling the
benchmarks with different compilers and environmental settings, executing the applications with various parameters, extracting data from the performance results and building a database for further inquiry. The interface hides the details of test bench compilation and execution on HPC platforms. Another advantage of employing the configuration tools is to get rid of the configuration of complicated linking flags and compiling options. The interface provides users with a selection of attributes in the drag down menu shown in Fig. 6 to guide the creation of a batch of python scripts for the working flow.
Fig. 7. Performance of DGEMM with Different Environmental Variables and Tools Configuration
Fig. 6. Performance Data Generation Interface
A. Autoconfiguration for Linking Flags and Compiler Options Linking flags and compiler options for the available resources on Kraken and Nautilus are abundant and intricate for the nonprofessional users. For instance, LibSci is the default programming environment, with no adding flags for most compilers. When linking is performed with ACML or MKL, flag '_mp' is necessary for optimized performance by taking full advantage of the multithreaded BLAS. After successful compilation of the source code, configuration files are required to execute the user's application on the HPC systems. In the job script files, the environment variable OMP_NUM_THREADS is set by the user at runtime to the number of threads desired. Alternatively, the environment variable MKL_NUM_THREADS must be set to the maximum number of cores in one node when using the MKL libraries. On Nautilus, memory placements management and thread affinity are important to optimize multithreaded as well as openMP applications. For optimized memory placement, numactl [11] tool can schedule processes with a specific NUMA architecture, such as specifying the round robin fashion policy on node. Beside memory locality, dplace tool [12] is used to bind a related set of processes to specific CPUs to prevent process migrations. As mentioned in the dplace manual, the option “-x 2” is used for the Intel MKL to skip placement of the second thread, as Intel OpenMP jobs use an extra lightweight shepherd thread unknown to the user and need to be placed.
Besides the configuration tools, more environment variables are available for custom configuration to avoid performance degradation on the NUMA systems. Fig. 7 shows an example of various performance results in DGEMM function from MKL compiled with Intel and executed on 16 cores with different environmental configuration on Nautilus. Two more environment variables, MKL_DYNAMIC and KMP_AFFINITY, are recommended for optimized performance. When MKL_DYNAMIC is TRUE, it enables the MKL to dynamically set the number of threads. Otherwise, Intel MKL will use the number of threads set by the user. Another environment variable, KMP_AFFINITY, provides a thread affinity mechanism for Intel OpenMP programs. If DISABLED, the OpenMP runtime library will not be available to make any affinity-related system calls [13]. When the option “-x 2” is configured to dplace and other environmental variables, KMP_AFFINITY and MKL_DYNAMIC, are set appropriately, the best performance in the blue curve achieves 121 Gflops/s. If the KMP_AFFINITY is disabled, as the bottom curve in green shows, the peak performance of DGEMM remains the same with single core performance at 7.66 Gflops/s, which is 15.8 times slower than the best performance. The performance is 90% of the peak performance without setting MKL_DYNAMIC to FALSE, or drops more than 60% if dplace is not evoked, no matter if the numactl option is set or not. B. Framework Interface Implementation The framework interface is implemented as a webpage for universal access. Users are responsible to choose specified tasks, platform, libraries and compilers for their applications. After a computing platform is selected, such as Kraken or Nautilus on current interface, the linking flags for compilation and the environmental variable settings for optimized performance are automatically loaded. The auto configured files or scripts relieve the users from the complex details on program execution. Furthermore, the framework makes it
possible for users to transplant applications from one computing system to another, avoiding restrictions on the execution level. User Interface
Select Tasks: Cleanse, Compliation, Execution
Choose Platform, Library, Compiler,and System Architecture
Submit User Selections
Backup Server
Get Parameters from User Interface
performance combination of vendor libraries and compilers, which is essential to achieve better utility efficiency for execution of time-costing scientific applications. The common functions in BLAS and LAPACK have been benchmarked and the performance data is well maintained. The new development of the framework is meeting the goal on current and emerging extreme scale systems, as well as the parallel and distributed computations. The performance of different FFT implementations in the vendor libraries are benchmarked in the framework regarding the accuracy of the implementations as well as the distributed memory resources. A website interface is developed to help researchers determining the fastest library choices for their applications and save their effort from exploring with different libraries. A knowledgeable database storing the previous performance data on the website server provides great convenience for the user inquiry on better performance of their applications..
Call CGI Program
REFERENCES Reconfigure User Interface for User Access
Reconfigure PBS file and Python Script for Execution
Upload Scripts to Configured Platform for Execution
Fig. 8. Framework Interface Implementation in Python CGI
The implementation of the interface is shown in Fig. 8. The configuration information is transferred to the backup server where a python CGI program is evoked. PBS script templates are preserved in the server with all possible and recommended configurations on the available platforms. The background program loads in a PBS template based on the platform selection and tailors it with the optimized environment variable and recommended tool settings. A new PBS file is loaded with a python script for program compilation and application execution. The files are convenient to be accessed or uploaded to the selected platform on the interface through python CGI support. IV. CONCLUSIONS The framework is built to study the performance of available computing resources on high performance computing platforms to help researchers determine the extreme
[1] Hadri, Bilel, Timothy Robinson, Mark Fahey, and William Renaud. "Software Usage on Cray Systems across Three Centers (NICS, ORNL and CSCS)" , CUG 2012, Stuttgart, Germany, 2012. [2] BLAS, “Basic linear algebra subprograms,” http://www.netlib.org/blas/. [3] Dongarra, Jack J., Jim R. Bunch, G. B. Moler, and George W. Stewart. LINPACK users' guide. No. 8. Society for Industrial Mathematics, 1987. [4] Scalapack, “Scalable Linear Algebra PACKage,” http://www.netlib.org/scalapack/. [5] Cray. “LibSci,” http://docs.cray.com. [6] AMD, “Core Math Library,” http://www.amd.com/acml. [7] Intel, “Math Kernal Library (MKL), ” http://www.intel.com/software/products/mkl/. [8] Hadri Bilel, and Haihang You, “A Performance Comparison Framework for Numerical Libraries on Cray XT5 System,” CUG 2011, Fairbanks, Alaska, 2011. [9] FFTW, “Fast Fourier Transform in the West,” http://www.fftw.org/. [10] J. Bentz, “FFT Libraries on Cray XT: Current Performance and Future Plans for Adaptive FFT Libraries,” CUG 2008, Helsinki, Finland, 2008. [11] SGI, “numactl man page,” http://techpubs.sgi.com/library/ [12] SGI, “dplace man page,” http://techpubs.sgi.com/library/ [13] Hadri Bilel, Haihang You, and Shirley Moore. "Achieve better performance with PEAK on XSEDE resources." In Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond, p. 10. ACM, 2012.