Compiler technologies for understanding legacy ...

Available online at www.sciencedirect.com

ScienceDirect

This space is reserved for the Procedia header, This Procedia space isComputer reserved for 108C the Procedia header, Science (2017) 2418–2422 This space is reserved for the Procedia header, This space is reserved for the Procedia header,

do do do do

not not not not

use use use use

it it it it

International Conference on Computational Science, ICCS 2017, 12-14 June 2017, Zurich, Switzerland

Compiler technologies for understanding legacy scientific Compiler technologies for legacy scientific Compiler technologies foronunderstanding understanding legacy scientific code: A case study an ACME land module Compiler technologies for understanding legacy scientific code: A case study on an ACME land module code: A case study on an ACME land module 1∗ 1 2 Dali Wang , Yu Pei22 , Oscar Hernandez , ACME Wei Wu22 , Zhou Yao , Youngsung code: study on an land module 1∗ A case 1 4 1 , Wei Wu2 , Zhou 4 Dali Wang1∗, Yu Pei23, Oscar Hernandez Yao22 , Youngsung Michael Wolfe4 , and Ryan Dali Wang , Yu Kim Pei 3, ,Oscar Hernandez , Wei Wu Kitchen , Zhou 4Yao , Youngsung Kim 23 , Michael Wolfe4 , 1and Ryan 2Kitchen4 Dali Wang1∗, Yu Pei , Oscar Hernandez , Wei Wu , TN, Zhou Yao2 , Youngsung 1 Kim , Michael Wolfe , and Ryan Kitchen Oak Ridge National Laboratory, Oak Ridge, U.S.A. 1 Kim3 , Michael Wolfe4 , and Ryan Kitchen4 Laboratory, Oak Ridge, TN, U.S.A. wangd,[email protected] 1 Oak Ridge National Oak Laboratory, Oak Ridge, TN, U.S.A. 2 Ridge National University wangd,[email protected] of Tennessee, Knoxville, TN, U.S.A. wangd,[email protected] 2 Ridge National Oak Laboratory, Oak Ridge, U.S.A. of Tennessee, Knoxville, TN, TN, U.S.A. ypei2,wwu12,[email protected] 2 University University wangd,[email protected] of Tennessee, Knoxville, TN, U.S.A. ypei2,wwu12,[email protected] National Center for Atmospheric Research, Boulder, CO, U.S.A. 2 ypei2,wwu12,[email protected] Tennessee, Knoxville, TN, U.S.A. National University Center for of Atmospheric Research, Boulder, CO, U.S.A. [email protected] National Center for Atmospheric Research, Boulder, CO, U.S.A. ypei2,wwu12,[email protected] 4 [email protected] The Portland Group (PGI), Portland, OR, U.S.A. [email protected] 4 National Center for Atmospheric Research, Boulder, CO, U.S.A. Portland Group (PGI), Portland, OR, U.S.A. ryan.kitchen,[email protected] 4 The The Portland [email protected] Group (PGI), Portland, OR, U.S.A. ryan.kitchen,[email protected] 4 ryan.kitchen,[email protected] The Portland Group (PGI), Portland, OR, U.S.A. ryan.kitchen,[email protected] 1

3 3 3 3

Abstract Abstract The complexity of software systems have become a barrier for scientific model development Abstract The complexity of software systems have become a barrier for scientific model development and software modernization. In this study, we present a procedure to use compiler-based techThe complexity of software systems have become a barrier for scientific model development Abstract and software modernization. In this study, we present a procedure to use compiler-based technologies to better understand complex scientific code. The approach requires no extra software and software modernization. In this study, we present a procedure to use compiler-based techThe complexity ofunderstand software systems have become a barrier for scientific model development nologies to better complex scientific code. The approach requires no extra software installation and configuration and its scientific software code. analysis can be transparent to developer and nologies to better understand In complex The approach requires no extra software and software modernization. this study, we present a procedure to use compiler-based installation and configuration andtoitsillustrate softwarethe analysis can be transparent toprocedure developertechand users. We designed a sample code data collection and analysis from installation and configuration and its scientific software code. analysis can be transparent to developer and nologies todesigned better understand complex The approach requires no extra software users. We a sample code to illustrate the data collection and analysis procedure from compiler and showed a case studythe thatdata used the information from procedure interprocedure users. Wetechnologies designed a sample code to illustrate collection and analysis from installation and configuration and its software analysis can be transparent to developer and compiler technologies and showed a case module study that used the information interprocedure analysis to analyze a scientific function extracted from an Earth from System Model. We compiler technologies and showed a case studythe that used the information from interprocedure users. We designed a sample code to illustrate data collection and analysis procedure from analysis to analyze a scientific function extracted from an scientific Earth System We believe study provides a new path tomodule better understand legacy code. Model. analysisthis to analyze a scientific function module extracted from an Earth from System Model. We compiler technologies and showed a case used the information interprocedure believe this study provides a new path tostudy betterthat understand legacy scientific code. believe this study Published provides a Elsevier new path tomodule betterCode, understand legacy code. Keywords: Compiler Technology, Legacy Scientific Functional Unit Test, Kernel Extraction analysis toAuthors. analyze a scientific function extracted from an scientific Earth System Model. We © 2017 The by B.V. Keywords: Compiler Technology, Legacy Scientific Code, Functional Unitscientific Test, on Kernel Extraction Peer-review under responsibility ofathe scientific committee ofunderstand the International Conference Computational Science believe this study provides new path to better legacy code. Keywords: Compiler Technology, Legacy Scientific Code, Functional Unit Test, Kernel Extraction Keywords: Compiler Technology, Legacy Scientific Code, Functional Unit Test, Kernel Extraction

1 Introduction 1 Introduction 1 Introduction Along with the evolution of computational sciences, large scale scientific codes have been devel1 Along withsoftware the evolution of computational sciences, largesignificantly scale scientific have determined been developed. Introduction The systems of these scientific code differ and codes are largely

Along with the evolution of computational sciences, large scale scientific codes have been developed. softwarescientific systems problems of these scientific code differ significantly anddeveloper are largely determined by theThe underlying and modeling conventions among communities. oped. The software systems of these scientific code differ and codes are largely determined Along with the evolution of problems computational sciences, largesignificantly scale scientific have been develby the underlying scientific and modeling conventions amongmodel developer communities. The complexity of software systems become a barrier for scientific development and by theThe underlying scientific problems and modeling conventions among developer communities. oped. software systems of these scientific code differ significantly and are largely determined The complexity of software systems become a barrier for developed scientific model development and software modernization. Although Several tools have been to facilitate source code Thethe complexity ofscientific softwareproblems systems and become a barrier for scientific development and by underlying modeling conventions amongmodel developer software modernization. Although (such Several tools have beenperformance developed to facilitatecommunities. source code understanding and documentation as Doxygen) and improvements (such as software modernization. Although Several tools have been developed to facilitate source code The complexity of documentation software systems become a barrier forperformance scientific model development understanding and (such as Doxygen) and improvements (suchand as SCOREP), adopting these tools to understand large scale scientific code remains challenging. understanding and documentation (such as tools Doxygen) and performance (such as software modernization. Although havescale beenscientific developed toimprovements facilitatechallenging. source code SCOREP), adopting these tools to Several understand large code remains SCOREP), adopting these tools to understand large scale scientific code remains challenging. ∗ This research understanding and documentation (such as Doxygen) and performance improvements (such as was funded by the DOE Biological and Environmental Research (Accelerated Climate Mod∗ This eling for Energy and Terrestrial Ecosystem Science) and LDRD 8277 from Oak Ridge National Lab. research was funded by the DOE Biological and Environmental Research (Accelerated Climate ModSCOREP), adopting these tools to understand large scale scientific code remains challenging. ∗ researchand wasTerrestrial funded byEcosystem the DOE Biological andLDRD Environmental (Accelerated Climate ModelingThis for Energy Science) and 8277 fromResearch Oak Ridge National Lab. eling for Energy Science) and 8277 fromResearch Oak Ridge National Lab. ∗ This researchand wasTerrestrial funded byEcosystem the DOE Biological andLDRD Environmental (Accelerated Climate Mod1 eling for Energy and Terrestrial Ecosystem Science) and LDRD 8277 from Oak Ridge National Lab.

1877-0509 © 2017 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the scientific committee of the International Conference on Computational Science 10.1016/j.procs.2017.05.264

1 1 1

Compiler technology for understanding scientific code Computer Science 108C (2017) Wang, Pei, Hernandez, et al. Dali Wang et al. / Procedia 2418–2422

In the paper, we present our approach to use compiler technologies to better understand large-scale, legacy scientific code. This approach presents several major advantages: 1) no extra software installation and configuration are required; 2) all the necessary information on the software system have been collected at certain phases of compiling and linking process; and 3) compiler-supported software analysis are transparent to software developer and users. We first describe several key concepts and phases of traditional compiler frontend and layout a general procedure to harvest data for program understanding using a mini app. Then we present a case study that uses compiler technology to analyze a scientific function module from an Earth System Model.

2

Compiler information collection for program understanding

Major components and functions of a modern compiler can be categorized as front-end (including lexical analyzer, syntax analyzer, semantic analyzer and intermediate code generator) and back-end (code optimizer and code generator). Compiler collects and stores extensive information generated by these major components at each of the compiling phases. In this study, we collect information from compiler and present them in an appropriate format for scientists (modeler and application users), thefore, we are more interested in the first few parts of the compiler (referred to as compiler frontend), specifically, language parser and interprocedural analysis. Then we transform (or reverse engineering) these information for scientific modelers and developers to better understand the overall software system. First, We design a mini Fortran app reflect many of the design patterns in a legacy code base. It contains three source files: a program driver, a module file with derived type definitions and a module file containing compute kernels. The driver is the main entrance of the program. It initializes all the data structures then sequentially call the compute kernels. Notable features in the mini-app include: an OOP style data packaging with type bounded initialization subroutines; Compute kernels operate and updates the global derived data types which have pointers to arrays and nested level of indexing to represent complex variables; and name aliasing in the kernels using ASSOCIATE blocks. Then, we use the Function Parser (such as the FParser within F2PY [1]) to parse the mini-app. The information are stored in Abstract Syntax Tree (AST) statement and Symbol tables. Since many compilers (such as GCC (version 4.5 and above) and Clang/LLVM) offer plugin functionality, which allows developers to build compiler-based tools into compilers without having to modify their binaries, we also use a gfortran compiler plugin [2] to extract program information from our mini-app. The plugin access the Fortran front-end intermediate representation and traverses together with its symbol tables (e.g. Symbol trees). In a compiler, interprocedural analysis (IPA) is typically designed to analyze and optimize an application and provides a powerful set of language independent information on the whole application for better program understanding. By using the PGIPA library and following the instructions provided by PGI [3], we successfully processed our mini Fortran app and produced the analyzed result. The output is in Javascript Object Notation (JSON) format. For our miniapp, IPA analysis identified the derived type and their size and structure, and the remaining information are nested in the three file objects that contain data type definition, file location, function, variables, binding records, call records and attributes. Therefore, we can easily identify the parameters and variables that are used and insert the data packaging calls accordingly. Similarly we have sufficient data for overall program understanding and transformations. 2

2419

2420

Compiler technology for understanding scientific code Computer Science 108C (2017) Wang, Pei, Hernandez, et al. Dali Wang et al. / Procedia 2418–2422

3

Case study on a scientific function module

In the section we use an earth system model (Accelerated Climate model for Energy (ACME)) to demonstrate the usefulness of our methods. Considering the interests of general audience of this conference and the page limitation, we focus on a functional unit within ACME, instead of the whole simulation system. The detailed analysis of the whole simulation system will be presented in a later publication. ACME contains several key components of earth systems, include the ACME Land Model (ALM). ALM simulates several aspects of the land surface, including heterogeneity, biogeochemistry, biogeophysics, hydrologic cycle and ecosystem dynamics. The latest version of ALM contains several submodels, (e.g.,Canopy Fluxes, Ecosystem Dynamics, Hydrology, Urban, Fire, Dust), that are organized based on carbon-nitrogen cycles, water and energy balance. Researchers are looking into innovative ways to expedite the model development via modularization and facilitate model validation via direct model-data comparison. Two excellent examples of these efforts are kernel extraction [4] and functional unit testing platform [5,6]. In this study we use both approaches to generate an ecosystem function module out the ALM and then use the IPA analysis to summarize the differences between these two module systems. The module we studied is called Maintenance Respiration (CNMResp), which is designed to update all maintenance respiration fluxes (prognostic carbon state variables) every half hour.

3.1

Information on CNMResp module generated by KGEN approach

In this step, we use a Fortran Kernel Generator (KGEN) to extract the CNMResp module out of the ALM simulation. KGEN, developed at National Center for Atmospheric Research [13], is written in Python as an extension of F2PY package to extract a Fortran subprogram as a standalone software out of a large software application. We then use the PGI compiler to recreate the CNMResp module and collect information from IPA analysis. The result of IPA analysis of the CNMResp module, generated via KGEN approach, is visualized in Figure 1. The left graph shows the connections between the KGEN driver (green), KGEN utility (yellow), KGEN datatype (blue) and ALM functions (red). KGEN utilities are designed to initialize, checksum and create new unit for the ALM function (i.e., CNMResp). There is another ALM function (i.e., CNEcosystemDynNoleaching) included in the system. It is the parent of the target function, therefore, the workflow of KGEN is that: KGEN driver to CNEcosystemDynNoLeading and to CNMResp). A variety of new derived datatypes are created by KGEN (blue nodes) to map the original ALM datatype. It is a convenient design that to reduce the data dependency in ALM. However, it introduces new datatypes that are not part of the original software. The right graph presents the circular layout of all the subroutine with the CNMResp module. It is obvious that the major parts of this module are new derived datatypes, generated by KGEN itself.

3.2

Information on CNMResp module generated by FUT approach

In the past several years, modular design and function test of environmental models have gained attention within the Biological and Environmental Research Program of the U.S. Department of Energy. We have developed a functional unit testing (FUT) platform [14]. Evolved from the traditional concepts on software unit testing, FUT provides convenient (piece-by-piece) ways for process-based multiscale model verification and validation, covering both model structure and scientific workflow. FUT expedites model modification and enhancement, enables environmental model reconfiguration, reuse and reassembly and facilitate collaboration among field 3

Dali Wangscientific et al. / Procedia 2418–2422 Compiler technology for understanding code Computer Science 108C (2017) Wang, Pei, Hernandez, et al.

Figure 1: The call graph of CNMResp module generated by KGEN. Data collected via PGIPA analysis.

scientists, observation dataset providers, modelers, and computer scientists. We use FUT to extract the same CNMResp module out of the ALM simulation and then use the PGI compiler to recreate the CNMResp module and collection the information from IPA analysis. The result of IPA analysis of the CNMResp module, generated via FUT approach, is visualized in Figure 2. The left graph shows the connections between the FUT driver (green), KGEN utility (yellow), ALM datatype (blue), ALM data utility (light blue), and ALM functions (red). FUT utilities are designed to run, load variables and create output for the ALM function (i.e., CNMResp). There is another ALM function (i.e., readCNMrespParams) included in the system. It is the read function included in the target function (CNMResp.F90). A variety of derived datatypes of ALM are used by FUT (blue nodes) along with ALM data utilities (light blue nodes). FUT is designed to reuse all the original ALM datatype. This approach preserved the implicit connections among ALM datatypes. The right graph present the circular layout of all the subroutines with the CNMResp module. It is obvious that the major parts of this module are the original ALM derived datatypes and data utilities.

Figure 2: The call graph of CNMResp module generated by FUT. Data collected via PGIPA analysis.

4

2421

2422

Dali Wangscientific et al. / Procedia 2418–2422 Compiler technology for understanding code Computer Science 108C (2017) Wang, Pei, Hernandez, et al.

3.3

Comparison of these two versions of functional module generation

We have used the information provided by PGI compiler to analyze the call structure of an ALM module extracted by two approaches, KGEN and FUT. KGEN uses the information from F2PY to understand data connections between subroutines, and create its own data structure to hold the input and output results of the target function (i.e., CNMResp). It is a convenient solution considering the information collected via F2PY, very similar to the front-end of a compiler (such as PGI and GCC). On the other hand the FUT approach is designed to reuse the original data structures, utilities and functions to generate a standalone module. This approach can better preserve the original software system structure but at the cost of large overhead and complicated internal connectivity. It is also worthy to mention that the preservation of original software system structure is one of the design considerations of FUT implementation, since we would like to generate standalone functional modules as building blocks that can be assembled and customized later to create our own computational models to match the field experiment or real world observation system. In this section, we only illustrate the call functions. More information, such as library linkage, associate rules, as well as control flow can be extracted and analyzed to improve the program understanding of legacy code.

4

Conclusion

In this study, we have introduced a procedure to use compiler-based technologies to better understand complex scientific code. The approach requires no extra software installation and configuration and its software analysis can be transparent to developer and users. We designed a sample code to illustrate the data collection and analysis procedure and also present a case study and use the IPA information from PGI compiler to analyze a scientific function module extracted from an Earth System Model. We believe the compiler-based analysis provides a very generic, transparent and practical approach to improve program understanding of legacy scientific code. In this study we only investigate the call graph function, future work will focus on library usage and control flow. Since huge information can be harvested and transformed from many components of a modern compiler, advanced data visualization and analysis will also be important parts of our future effort.

References [1] P. Peterson, F2PY: a tool for connecting Fortran and Python programs, International Journal of Computational Science and Engineering 2009 - Vol. 4, No.4 pp. 296 - 305 [2] O. Hernandez, A GCC compiler plug-in for program understanding , in-house, pre-release software. [3] R. Kitch and M. Wolfe, PGIPA Export Documentation, (personal communication). [4] Y. Kim, J. Dennis, C. Kerr, R. R. P. Kumar, A. Simha, A. Baker, and S. Mickelson, KGEN: A Python Tool for Automated Fortran Kernel Generation and Verification, in Proc. of the Int. Conf. on Computational Science (ICCS 2016), 2016. [5] D. Wang, Y. Xu, P. Thornton, A. King, L. Gu, C. Steed, J. Schuchart, A Functional Testing Platform for the Community Land Model, Environmental Modeling and Software, 2014, Volume 55, Pages 25-31, 10.1016/j.envsoft.2014.01.015 [6] D. Wang, W. Wu, T. Janjusic, Y. Xu, C. Iversen, P. Thornton, M. Krassovski, Scientific Functional Testing Platform for Environmental Models: An Application to Community Land Model, International Workshop on Software Engineering for High Performance Computing in Science, May 16-24, 2015, Florence, Italy. DOI 10.1109/SE4HPCS.2015.10

5

Compiler technologies for understanding legacy ...

Compiler technologies for understanding legacy ...

Suggest Documents

ACOTES: Advanced Compiler Technologies for

ACOTES Project: Advanced Compiler Technologies for Embedded ...

Beyond Olympic Legacy: Understanding Paralympic Legacy Through ...

Understanding Deep Web Technologies

Systems Biology Strategies and Technologies for Understanding ...

Using Grid Technologies for Web-enabling Legacy Systems - CiteSeerX

Understanding Plyler's Legacy: Voices from Border Schools

Understanding Legacy Features with Featureous - IEEE Computer ...

understanding the organizational context of 'legacy' - CiteSeerX

Understanding Plyler's Legacy: Voices from Border Schools

Understanding the Legacy - Pi Beta Phi International

Understanding international crime trends: The legacy ... - Precaution.org

Understanding WSIS - Information Technologies & International ...

Understanding WSIS - Information Technologies & International ...

The PRECC Compiler Compiler

Extracting ontologies from legacy systems for understanding ... - DMU

Compiler Application (COMPILER) - Erlang

Compiler

A Pattern Enforcing Compiler (PEC) for Java: Using the Compiler

COCOVILA â Compiler-Compiler for Visual Languages - Ando Saabas

Understanding Technologies for Creating High-Security ID Cards

The HipHop Compiler for PHP

Keysight Technologies Understanding GSM/EDGE Transmitter and ...

Understanding the Impact of Emerging Technologies ...

Compiler technologies for understanding legacy ...