Enhancing Productivity in High Performance ... - Semantic Scholar

Enhancing Productivity in High Performance Computing Through Systematic Conditioning 1 Magdalena Sławińska, Jarosław Sławiński and Vaidy Sunderam Dept. of Math and Computer Science, Emory University 400 Dowman Drive, Atlanta, GA 30322, USA {magg,jaross,vss}@mathcs.emory.edu

Abstract. In order to take full advantage of high-end computing platforms, scientific applications often require modifications to source codes, and to their build systems that generate executable files. The ever-increasing emphasis on productivity in HPC makes the efficiency of the build process an extremely important issue. Our work is motivated by the need for a systematic approach to the HPC lifecycle that encompasses build and porting tasks. In this paper we briefly present the design of the Harness Workbench Toolkit (HWT) that addresses the porting issue through a user-assisted conversion process. We briefly describe our adaptation capability model that includes routine conversions (string substitutions) as well as more advanced transformations such as 32-64 bit changes. Next, we show and discuss results of an experiment with a production source code (the CPMD application) that examines the effort of adapting the baseline code (the Linux distribution) to specific high-end machines (IBM SP4, Cray X1E, Cray XT3/4) in terms of the number of necessary conversions. Based on the conversion capability model, we have implemented conversion assistant modules that were used in the experiment. The experimental results are promising and demonstrate that our approach takes a step towards improving the overall productivity of scientific applications on high-end machines. Key words: porting parallel programs, productivity, legacy codes, conditioning, automatic code transformations, software tools

1

Introduction

High performance computing is well-established as a mainstream methodology for conducting science. The efficacy of applications, and the resulting scientific advances, are directly related to “productivity” in HPC – traditionally measured by raw performance, problem-size, and machine efficiency. This and related notions of productivity form the focus of several recent efforts, including the DARPA High Productivity Computing Systems (HPCS) [1] program and Cray’s adaptive supercomputing [2] vision. Whereas the HPCS initiative focuses on developing a new generation of economically viable systems that follow the rate of underlying technology improvement in delivering increased value 1

Research, publication, and presentation supported in part by DOE grant DE-FG0206ER25729, NSF grant CNS-720761, and a faculty travel grant from The Institute of Comparative and International Studies, Emory University.

to end-users [3], Cray aims to provide the next productivity breakthrough by developing a single adaptable HPC platform that integrates multiple processing technologies. However, HPC productivity is also greatly influenced by several other less tangible factors, the most significant among them being the preparatory stages of application execution. Thus another strategy to improve HPC productivity is to reduce the effort involved in scientific application development. Research shows that in terms of productivity, the build process, i.e., compilation, linking, installation, deployment, and staging, may consume up to 30% of the development effort [4]. This large effort results mainly from the variety of software packages, program-building tools, and hardware architectures together termed the build problem [5]. Adaptation and deployment of scientific codes on high-performance platforms is particularly challenging due to new and unique architectures and the legacy nature of applications that have evolved over several decades with continual tweaking. Modern scientific codes are usually developed on common Unix-like architectures and need to be ported to run on specific HPC machines that are controlled by specialized (often lightweight) kernels. In order to obtain efficient executables, necessary modifications depend on vendor’s compilers and libraries, and may be relatively easily defined such as function renaming (e.g. POSIX functions), removing unsupported system calls, references to signals, taking care of pointer-sizes, modifications to handle the absence of the TCP protocol (sockets), etc. In addition to more or less straightforward source code changes, more challenging modifications such as fundamental algorithm reimplementation (e.g. vectorization), changing communication patterns to reflect specific hardware might be necessary. Our project aims to partially relieve site administrators, porting specialists, and computational scientists from the burdens of the build process by providing a systematic, integrated, and streamlined approach, in contrast to current ad-hoc practices. In order to accomplish this we propose a toolkit that facilitates conditioning, i.e., adaptation, of HPC applications to target platforms. In this paper we focus on the source code application adaptation aspect. Our methodology is based on a toolkit-assisted approach that can facilitate routine tasks. Discussions with application scientists and experts at Oak Ridge National Laboratory (ORNL), summarized in Section 3, show that conditioning is often tedious and requires cross-domain expertise. Furthermore, there is an acute lack of dedicated toolkits that can support this large and complex process. Although it is likely that conditioning HPC applications cannot be fully automated and requires human intervention, results of the conducted research and experiments show that certain patterns can be observed. In Section 4 we define the adaptation capability model based on the analysis of a few ORNL application source codes and demonstrate the productivity improvement with developed conversion assistants (Section 5). Finally, Section 6 concludes the paper.

2

Related Work

As mentioned, execution of scientific codes on HPC platforms requires dealing with adaptation of scientific source codes, the build system and conditioning the target environment. There are many ways of tackling different aspects of these phases. One

way to approach conditioning issues is to learn from HPC community experiences that often are documented as guidelines or study cases. For instance, HPC vendors or compiler providers support their users with porting guidelines, compiler optimization flags, optimized libraries [6, 7], or even with documentation on how to build popular scientific applications [8]. Clearly, this approach, albeit valuable, is very general, and in a specific situation a well-documented use case may be more desirable. Recent porting cases regard technologies that have gained popularity such as multi-core processing or reconfigurable supercomputers based on Field Programmable Gate Array (FPGA) technology. They report on portable frameworks [9, 10], guidelines for specific applications [11–13], or algorithms [14, 15]. Although addressing multi-core or highperformance reconfigurable computing platform porting issues is not our project main focus, our methodology can be applied to facilitate conditioning on such machines. We propose to encapsulate system- and application- specific knowledge into reusable, community-shared descriptions called profiles. Profiles instruct our conditioning assistant modules to retrieve appropriate parameters for conditioning tasks such as environment variable settings, compilation suites (compilers, libraries, optimization flags, etc) or code-snippets. Another way to deal with some conditioning issues is to utilize standards. For instance, portable run-time environments such as the Open Run-Time Environment (ORTE) [16] provide a unified interface to interprocess communication, resource discovery and allocation, and process launching across heterogeneous platforms. Our project intends to benefit from the unification provided by ORTE, e.g. with respect to application launching. Another example is notable GNU Autotools that helps overcome the build problem by standardizing the compilation, linking, and installation process. However, GNU Autotools does not fully address all the aspects characteristic to the HPC domain such as cross-compilation, use of restricted microkernels, and tuning to the hardware environment. In addition, it introduces its own compatibility issues such as the requirement for compatible versions at the user’s and developer’s side [17]. Our approach supports different build systems (Makefile-based as well as GNU Autotoolsbased) by providing a unified interface and encapsulating specific features into profiles. Providing tools that facilitate porting or building processes takes conditioning a step further. For instance, Environment Modules facilitates management of configuration information regarding software packages and libraries installed on a given computer system [18]. Although, in the past there have been efforts to develop tools for automatic porting [19], currently a supportive approach is preferred due to application complexity. The most advanced software solutions provide integrated programming environments that facilitate conditioning [20, 21]. However, they are usually restricted to popular operating systems or very specific architectures. Instead of developing a new IDE for conditioning, our project aims to provide uniform access to build tools through a tool-virtualization methodology and by storing expert knowledge regarding specific tool configuration requirements, parameters, or options into profiles. This knowledge would be described once and repeatedly applied. In the HPC arena, Eclipse PTP [22] provides an interesting option for GUI-oriented development and run-time interface to heterogeneous systems. Our project can be perceived as complementary to Eclipse PTP in terms of supplying conditioning Eclipse plugins.

3

A New Approach to Conditioning

The traditional conditioning model is presented in Figure 1. End-users connect from

Traditional model Build tools specific to machine Ssh sessions

User/ raw access

Front-end node

User/ uniform access

Virtualized conditioning tools

Proposed model

I/O node

Build tools specific to machine Ssh session

Front-end node

I/O node

Fig. 1. Traditional and proposed conditioning model

their client systems to designated front-end nodes through remote (secure) connections. The security policy depends on the high-end system and may vary, e.g., static passwords and private-key authentication methods or one-time passwords. In fact, the manner of the user-supercomputer interaction is forced by the front-end node environment. Fortunately, often front-end nodes are controlled by the Linux operating system albeit usually modified by the vendor. Despite high Linux popularity and taking into account that scientists often need to utilize a few computational centers, adjusting to a new environment can be inconvenient for researchers and may distract them from their actual work especially in the context of the build problem. Analysis of current practices at large scientific computing establishments (e.g. DOE labs) suggests that application building is usually accomplished via GNU Autotools and Makefile-based systems, sometimes through proprietary scripts, and occasionally without any provided build system. But even when standard tools are employed, user modifications are often needed due to, e.g., cross-compilation issues. These factors contribute to lost productivity and worse, inconsistent or even incorrect results due to misuse of compiler flags or wrong library versions. Moreover, executing applications on differ-

ent high-end platforms usually involves different preparatory steps that lead to further inefficiencies or inaccuracies. In order to address the above issues, we propose a shift in the conditioning model, as depicted in Figure 1. In our model the user works locally, instead of working directly on a remote front-end node. We propose the concept of the build-tool virtualization to provide a common and unified interface to conditioning tools. In our approach the end-users interact with the Harness Workbench Toolkit (HWT) and issue a virtualized build command. The building assistant (Figure 2) orchestrates and runs, based on information stored in declarative profiles, relevant plugin modules (e.g. for environment preconditioning, staging, and compiling), the command that is passed to the execution assistant. The execution assistant (also steered by profiles) generates and executes actual target-specific commands.

GUI

User $

Source code repository

cmdline GUI

Harness Workbench Toolkit Developer

Source code transformation plugins Conditioning plugins

substitution

tracking

Encapsulated knowledge

Porting Assistant ...

profiles

Build Assistant precondition

stage

compile

FFTW v. 3.1.2

file.cc

file.cc

...

Admin

Vendor

Execution Assistant Target-specific plugins

Cray

IBM

XT4

p690

module load FFTW.3.1.2 cp file.cc /dest pgCC -c -fastsse file.cc

Jaguar

...

Target-specific FFTW=$lib/FFTW3.1.2 commands scp file.cc user@host:/dest xlC -c -O3 -qmaxmem=-1 -qstrict file.cc

Cheetah

New machine

Fig. 2. Harness Workbench Toolkit

The HWT architecture is presented in Figure 2. The HWT consists of three layers designed to be pluggable to support toolkit extensibility. The HWT behavior is configurable and tunable through declarative profiles that incorporate expert knowledge. Profiles, described in more detail in our previous paper [23], embody target platform specific knowledge, and may inherit from or override other descriptions. In addition, dynamic recursive resolution allows profile elements to cross-refer, and thus enables switching among predefined settings to select an appropriate suite of, e.g., compiler flags provided by a vendor.

The other very important aspect of conditioning, as mentioned earlier, is adaptation of the application source code to the target platform. In the HWT, the porting assistant layer is dedicated to source code adaptation tasks. Its merits are based on the adaptation (conversion) capability model that aims to systematize required and commonly performed application source code conversions.

4

Adaptation Capability Model

Porting can be defined as a set of tasks required to launch and correctly execute the application on a target machine. Whereas it usually involves source code modifications, the application semantics needs to be preserved. In particular, the uniqueness of high-end machines and pursuit of peak performance makes this process exceptionally challenging in the HPC domain. We propose a toolkit-assisted approach that can facilitate routine tasks. In order to determine the routine activities that porting specialists deal with we have examined eight scientific codes in ORNL production use from a wide spectrum of computational science (chemistry, biology, fusion, computer science, and climate). We focused on those applications with available baseline and ported source codes, relevant to ORNL computing systems (Cray X1E (vector machine), Cray XT3/4, IBM SP4 (PowerPC processors)). A PC Linux distribution of a scientific application served as a baseline code. The results of our analysis are presented in the form of porting conversion categories shown in Figure 3. Conversion

Automatic Mapping

Detection ● ●

Substitution

Template

POSIX Name-mangling convention ● ...

Basic

Manual algorithm reimplementation

●

SSE

Signals Threads ...

DMA Vectorization

● ●

Language incompatibilities ● Time functions ● ... ●

Tracking ● ● ●

Loop unrolling 32->64 bits ...

Fig. 3. Conversion capability model

In general we can distinguish between two main code conversion categories, namely automatic and manual. The former refers to the set of conversions that to some extent may be automated, although user steering and input is still necessary. The simplest automatic conversions are substitutions that play a similar role to name refactoring (e.g.

adding the prefix PXF for POSIX functions on Cray machines, or identifier mangling conventions). More advanced conversions concern pattern mapping. As an illustration, consider the different parameter passing conventions, e.g., the PGI Fortran compiler CALL FREE(PTR) and the IBM Fortran compiler CALL FREE(%VAL(PTR)), or time functions that may differ in sematics on various high-end machines. The other example relates to library incompatibilities such as a new version of the same library that has not been ported to a machine yet (e.g. FFTW 3.x 7→ 2.x), or highly vendor-optimized library counterparts. In addition to converting patterns into patterns, there are conversions that track dependencies outside the matched patterns. For instance, consider the code adaptation from the 32-bit to the 64-bit compilation model for Fortran on Cray X1E. It involves tracking variables’ declarations and existance of the compilation parameter ’-sdefault64’ in the build system. The other very important example of a tracking template conversion is loop unrolling. This technique (i.e., loop unrolling) attemps to optimize loop execution time (e.g. by utilizing the maximum number of available CPU registers) and is commonly used for fast computations. Apart from mapping conversions there are cases where a given HPC system does not support certain features such as signals, threads, some system calls, sockets, or a synthetic file system (i.e., /proc). Detection conversions intend to deal with such situations and inform the user about non-portabilities. In general, detections trigger manual code adaptations that usually require expert knowledge of the hardware (architecture), system software in terms of compiler switches, usage of relevant libraries or versions, and application algorithms. For instance, in order to utilize a streaming feature such as SSE or 3DNow, the algorithm must be implemented in an assembler code. Another example concerns code vectorization to fully exploit vector processors, manual loop unrolling, or performance optimization and tuning. Based on the described conversion capability model, we have developed conversion assistant modules for the HWT porting assistant layer in order to perform the porting experiment with production scientific code.

5

CPMD Experiment

We have implemented examples of substitution, template, and detection conversions of the conversion capability model (Figure 3) as appropriate porting assistant modules (Figure 2). The goal of this experiment was to examine the number of modifications of the baseline source code that can be supported by our conversion modules in comparison to the total number of necessary modifications to successfully build and execute an application on a given high-end machine. As target platforms we chose HPC systems relevant to ORNL, i.e., IBM SP4, Cray X1E, and Cray XT3/4. The most convenient from the analysis standpoint would be a comparison of original source codes with their ported counterparts. Unfortunately, obtaining original source codes may encounter considerable obstacles due to their constant adaptation over time. To overcome this difficulty, a preprocessor can be used to generate architecture-specific versions of application source codes. Therefore, we chose the CPMD application [24] since it supports hundreds of configuration sets including architectures of our interest. We assumed the baseline code

is the CPMD PC Linux distribution. CPMD is a molecular dynamics scientific code consisting of about 700 files implemented in Fortran (and an occasional C file). The CPMD build system is based on a shell script that generates the appropriate Makefile file. The results of our experiment are presented in Table 1.

Table 1. CPMD experimental results

Conversion Specific conversion Target HPC system category IBM SP4 Cray X1E Cray XT3/4 manual changing FFT → ESSL FFT 12 7 N/A optimization & tuning 4 95 9 unknown 2 0 7 Total 18 102 16 automatic substitution 11 11 0 detection 8 21 6 basic template 4 1 2 tracking template 1 43 18 Total 24 76 26

The Cray X1E turned out to be the most challenging of all of the examined highend machines in the context of the number of required source code modifications. This includes manual as well as automatic conversions and results from the vector architecture of Cray X1E and the non-vectorized CPMD baseline code. The CPMD application contains the intrinsic FFT library. However, in case of IBM SP4 and Cray X1E the vendor provides the optimized version of FFT. Therefore, we distinguished a named optimization subgroup in Table 1. Other optimization and tuning conversions we identified concern manual loop unrolling (often combined with pragmas to enable aggressive code optimization such as for indicating which variables can be shared), changing the 32-bit to 64-bit compilation model, and setting zeros to the memory (on IBM SP4 due to performance reasons). Finally, there are cases that require expert knowledge for their explanation and that is why we classified them into the ’unknown’ category. With regard to automatic conversions we did not identify any substitutions for Cray XT3/4. This results from the existence of the same compiler for Cray XT3/4 and for Linux and appropriate configuration files in the CPMD distribution, so there is no need for substitutions. In general, the number of required substitutions measures the portability of a given routine. The number of substitutions would be greater in the case of changing calculation precision. This usually implies exchanging the function suite and in consequence, often many substitutions and tracking conversions (the number of arguments and their types often needs to be adjusted to counterpart function calls). For instance, changing calculation precision for the CPMD code in regard to the BLAST library requires exchanging a suite of 58 function names. Not surprisingly, due to its vector nature, Cray X1E leads in the detection conversion group that included detecting of unsupported signals and synthetic file system.

Results shown in Table 1 indicate that tracking conversions are dominant (for Cray X1E and Cray XT3/4) over the basic templates and suggests that more advanced conversions may reduce the porting effort to a greater degree. The obtained results demonstrate that our methodology is promising and may even contribute much more to porting applications not as well prepared for this process as CPMD (CPMD is portability-oriented, e.g. porting-sensitive routines or functions such as malloc() or open() are wrapped into proprietary functions). The outcome of this experiment shows that although manual conversions are much more cumbersome, require substantial effort and knowledge, and cannot be eliminated, we identified conversions that can be supported by a tool, and in this regard improve productivity of scientists involved in the porting process.

6

Conclusions and Future Work

Despite maturing in many ways, high-end computing systems continue to require substantial effort in terms of application and environment adaptation to execute a scientific code on a target HPC platform. Conditioning needs will intensify in the future as new technologies emerge and gain popularity (e.g. multicore processors, reconfigurable supercomputers). Utilization of legacy codes is inevitable due to their tested reliability and demonstrated performance. In this paper we introduce the Harness Workbench Toolkit that supports conditioning of the environment and adaptation of source code to a particular high-end machine. We focus on the porting aspect of the conditioning, namely the HWT porting assistant layer. We describe the conversion capability model that classifies source code adaptation conversions. Our experiment with developed conversion assistants performed on the production code (the CPMD application) demonstrates that our approach is feasible and may improve productivity of computational scientists by limiting their involvement only to those porting tasks that cannot be supported by software (i.e., optimization and tuning). In addition, the results of the CPMD experiment show that conversion assistants can be parametrized according to a machine architecture. Therefore, our future work will concentrate on including architecture-specific information into profiles. For instance, the set of detection conversion assistants for a given high-end machine would be determined by the machine’s profile. We also plan to develop conversion assistants as lexical analyzers. Currently, they are implemented as regular expressions and were sufficient for the experiment. In addition, we intend to perform similar experiments with scientific applications implemented in C and mixed (Fortran/C) programming languages.

References 1. Feldman, M.: HPC, Thy Name is Productivity. HPCwire (2007) http://www.hpcwire. com/. 2. Snell, A., Wu, J., Willard, C.G., Joseph, E.: Bridging the Capability Gap: Cray Pursues ”Adaptive Supercomputing” Vision. White Paper (2007) http://www.cray.com/ downloads/IDC-AdaptiveSC.pdf. 3. Kepner, J.: HPC Productivity: An Overarching View. The International Journal of High Performance Computing Applications 18(4) (2004) 393–397

4. Kumfert, G.K., Epperly, T.G.W.: Software in the DOE: The hidden overhead of “the build”. Technical Report UCRL-ID-147343, Lawrence Livermore National Laboratory (2002) 5. Dubois, P.F., Kumfert, G.K., Epperly, T.G.W.: Why Johnny can’t build. Computing in Science and Engineering 5(5) (2003) 83–88 6. Cray, Inc.: Craydoc. http://docs.cray.com/ (2007) 7. Sun Microsystems, Inc.: Porting UNIX Applications to the Solaris Operating Environment. http://developers.sun.com/solaris/articles/portingUNIXapps.html (1999) 8. PathScale, LLC: Building popular codes with PathScale. https://www.pathscale.com/ building_code/index.html (2007) 9. Saunders, R., Jeffery, C., Jones, D.T.: A Portable Framework for High-Speed Parallel Producer/Consumers on Real CMP, SMT and SMP Architectures. In: Proc. 21st Intl. Parallel and Distr. Processing Symp. (IPDPS 2007), Long Beach, CA (2007) 10. Bader, D., Kanade, V., Madduri, K.: SWARM: A Parallel Programming Framework fro Multicore Processors. In: Proc. 21st Intl. Parallel and Distr. Processing Symp. (IPDPS 2007), Long Beach, CA (2007) 11. Olivier, S., Prins, J., Derby, J., Vu, K.: Porting the GROMACS Molecular Dynamics Code to the Cell Processor. In: Proc. 21st Intl. Parallel and Distr. Processing Symp. (IPDPS 2007), Long Beach, CA (2007) 12. Petrini, F., Fossum, G., Fernandez, J., Varbanescu, A.L., Kistler, M., Perrone, M.: Multicore Suprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine. In: Proc. 21st Intl. Parallel and Distr. Processing Symp. (IPDPS 2007), Long Beach, CA (2007) 13. Kindratenko, V., Pointer, D.: A Case Study in Porting a Production Scientific Supercomputing Application to a Reconfigurable Computer. In: IEEE Symp. Field-Programmable Custom Computing Machines (FCCM 06), IEEE CS Press (2006) 13–22 14. Villa, O., Scarpazza, D.P., Petrini, F., Peinador, J.F.: Challenges in Mapping Graph Exploration Algorithms on Advanced Multi-core Processors. In: Proc. 21st Intl. Parallel and Distr. Processing Symp. (IPDPS 2007), Long Beach, CA (2007) 15. Brunner, R., Kindratenko, V., Myers, A.: Developing and Deploying Advanced Algorithms to Novel Supercomputing Hardware. In: NASA Science Technology Conference – NSTC’07. (2007) 16. Castain, R.H., Woodall, T.S., Daniel, D.J., Squyres, J.M., Barrett, B., Fagg, G.E.: The open run-time environment (openrte): A transparent multi-cluster environment for highperformance computing. In: Proc., 12th European PVM/MPI Users’ Group Meeting, Sorrento, Italy (2005) 17. Doar, M.B.: 5. In: Practical Development Environments. O’Reilly (2005) 18. Furlani, J.L., Osel, P.W.: Environment Modules Project. http://modules.sourceforge. net/ (2005) 19. Muppidi, S., Krawetz, N., Beedubail, G., Marti, W., Pooch, U.: Distributed computing environment (DCE) porting tool,. In: Proceedings of the IFIP/IEEE International Conference on Distributed Platforms: Client/Server and Beyond: DCE, CORBA, ODP and Advanced Distributed Applications. (1996) 115–129 20. Sun Microsystems, I.: Sun Studio 12 C, C++ & Fortran Compilers and Tools. http:// developers.sun.com/sunstudio/ (2007) 21. SRC Computers, I.: SRC’s Carte Programming Environment. http://www.srccomp.com/ SoftwareElements.htm (2007) 22. The Eclipse Foundation: Parallel Tools Platform (2007) http://www.eclipse.org/ptp. 23. Sławińska, M., Sławiński, J., Kurzyniec, D., Sunderam, V.: Enhancing Portability of HPC Applications across High-end Computing Platforms. In: Proc. 21st Intl. Parallel and Distr. Processing Symp. (IPDPS 2007), Long Beach, CA (2007) 24. CPMD Consortium: Car-Parrinello Molecular Dynamics – CPMD 3.11. http://www. cpmd.org/ (2006)