Proceedings of the Second International Workshop ...

16 downloads 0 Views 4MB Size Report
May 15, 2005 - Sima Asgari, Lorin Hochstein, Victor Basili, Marvin Zelkowitz, Jeff Hollingsworth, ...... [11] P. Stelling, I. Foster, C. К esselman, C.L ee, and G. von.
Proceedings of the Second International Workshop On Software Engineering for High Performance Computing System Applications

May 15, 2005 St. Louis, Missouri, USA

2

Proceedings of the Second International Workshop On Software Engineering For High Performance Computing System Applications St. Louis, Missouri, USA May 15, 2005 Co-located with the International Conference on Software Engineering (ICSE 2005)

Edited by Philip M. Johnson

3

Contents Message from the Program Committee.....................................................................................................5 Program.....................................................................................................................................................6 Supercomputing and Systems Engineering...............................................................................................7 John Grosh

Application of a development time productivity metric to parallel software development ......................8 Andrew Funk, Victor Basili, Lorin Hochstein, Jeremy Kepner

A metric space for productivity measurement in software development..................................................13 Robert Numrich, Lorin Hochstein, Victor Basili

Generating testable hypotheses from tacit knowledge for high productivity computing .........................17 Sima Asgari, Lorin Hochstein, Victor Basili, Marvin Zelkowitz, Jeff Hollingsworth, Jeff Carver, Forrest Shull

Case study of the Falcon code project ......................................................................................................22 D.E. Post, R.P. Kendall, E.M. Whitney

Can software engineering solve the HPCS problem? ...............................................................................27 Eugene Loh, Michael Van De Vanter, Lawrence Votta

P3I: The Delaware programmability, productivity, and proficiency inquiry............................................32 Joseph Manzano, Yuan Zhang, Guang Gao

Refactorings for Fortran and high performance computing......................................................................37 Jeffrey Oberbey, Spiros Xanthos, Ralph Johnson, Brian Foote

Building a software infrastructure for computational science applications: lessons and solutions ..........40 Osni Marques, Tony Drummond

And away we go: Understanding the complexity of launching complex HPC applications ....................45 Il-Chul Yoon, Alan Sussman, Adam Porter

Automating the development of scientific applications using domain-specific modeling .......................50 Francisco Hernandez, Purushotham Bangalore, Kevin Reilly

HPC needs a tool strategy .........................................................................................................................55 Michael Van De Vanter, D.E. Post, Mary E. Zosel

Predicting risky modules in open-source software for high performance computing ..............................60 Amit Phadke, Edward Allen

Towards a timed markov process model of software development ..........................................................65 Burton Smith, David Mizell, John Gilbert, Viral Shah

Finite-state verification for high performance computing ........................................................................66 George Avrunin, Stephen Siegel, Andrew Siegel

Improving scientific software component quality through assertions ......................................................73 Tamara Dahlgren, Premkumar Devanbu

Automated, scalable debugging of MPI programs with Intel(R) message checker ..................................78 Jayant DeSouza, Bob Kuhn, Bronis de Supinski

4

Message from the Program Committee Welcome to the Second International Workshop on Software Engineering for High Performance Computing System Applications. High performance computing systems are used to develop software for wide variety of domains including nuclear physics, crash simulation, satellite data processing, fluid dynamics, climate modeling, bioinformatics, and financial modeling. The top500.com website lists the 500 highest performance computing systems along with their specifications and owners. The diversity of government, scientific, and commercial organizations present on this list illustrates the growing prevalence and impact of HPCS applications on modern society. Recent initiatives in the HPCS community, such as the DARPA High Productivity Computing Systems program and the Workshop on the Roadmap for the Revitalization of High-End Computing, recognize that dramatic increases in low-level HPCS benchmarks of processor speed and memory access times do not necessarily translate into high-level increases in actual development productivity. While the machines are getting faster, the developer effort required to fully exploit these advances can be prohibitive. There is an emerging movement within the HPCS community to define new ways of measuring high performance computing systems, ways which take into account not only the low-level hardware components, but the higher-level productivity costs associated with producing usable HPCS applications. This movement creates an opportunity for the software engineering community to apply our techniques and knowledge to a new and important application domain. In this workshop, we bring together researchers and practitioners from both SE and HPCS communities to continue articulating the challenges and opportunities of software engineering for high performance computing system applications that we began at the first workshop in Edinburgh, Scotland in 2004. The Program Committee would like to especially thank the ICSE organizers for their logistical support in the development of the workshop.

Philip Johnson, University of Hawaii Stuart Faulk, University of Oregon Adam Porter, University of Maryland Walter Tichy, University of Karlsruhe Larry Votta, Sun Microsystems Douglass Post, Los Alamos National Laboratory Jeremy Kepner, MIT Lincoln Laboratory Daniel Reed, University of North Carolina

5

Program 9:00-9:45 Chair: Philip Johnson 9:45-10:30 Chair: Stuart Faulk

10:30-11:00 11:00-11:45 Chair: Walter Tichy 11:45-12:30 Chair: Larry Votta 12:30-1:30 1:30-3:00 Chair: Douglass Post

3:00-3:30 Chair: Adam Porter 3:30-4:00 4:00-4:45 Chair: Larry Votta 4:45-5:15 Chair: Philip Johnson 5:15-5:30

Keynote Presentation Supercomputing and Systems Engineering, John Grosh

Panel: Experimental Methods and Metrics • Application of a Development Time Productivity Metric to Parallel Software Development • A Metric Space for Productivity Measurement in Software Development • Generating Testable Hypotheses from Tacit Knowledge for High Productivity Computing Break Panel: Case studies • Case Study of the Falcon Code Project • Can Software Engineering Solve the HPCS Problem? • P3I: The Delaware Programmability, Productivity, and Proficiency Inquiry Open Discussion Experimental investigation of software engineering of high performance computing applications: what is our five year plan? Lunch Panel: Tools • Refactorings for Fortran and High-Performance Computing • Building A Software Infrastructure for Computational Science Applications: Lessons and Solutions • And Away We Go: Understanding The Complexity of Launching Complex HPC Applications • Automating the Development of Scientific Applications using Domain-Specific Modeling • HPC Needs a Tool Strategy Panel: Modeling and prediction • Predicting Risky Modules in Open-Source Software for High-Performance Computing • Towards a Timed Markov Process Model of Software Development Break Panel: Verification • Finite-State Verification for High Performance Computing • Improving Scientific Software Component Quality Through Assertions • Automated, scalable debugging of MPI programs with Intel(R) Message Checker Small group breakout sessions • Back to the future! Each group will generate a vision of the future of SE-HPCS by inventing a title and abstract for a paper to be presented at SE-HPCS 2006, SE-HPCS 2008, or SE-HPCS 2010! Small group presentations and wrapup

6

Supercomputing and Systems Engineering John Grosh Office of the Deputy Under Secretary of Defense for Science and Technology 1777 North Kent Street Rosslyn, Virginia 22209

[email protected] Specifically, high performance computer systems with higher bandwidth, certain architecture features, and greater programmability. As a result, the DARPA High Productivity Computing Systems Program was initiated, to develop a new generation of high performance computing systems and architectures to improve the user productivity and mission effectiveness.

ABSTRACT This paper outlines the keynote presentation for Second International Workshop on Software Engineering for High Performance Computing System Applications. This talk will describe and highlight the growth of systems engineering in high performance computing. Also, I will provide an overview on trends in Defense applications, performance assessment, discussion on software-intensive systems, and challenges in applications software security. I will describe ongoing programs and challenges in these areas, as they related to high performance computing.

Most view this program as a systems development program. It is that, but it is also an agenda to change the way we view, value, measure, and assess high performance computing systems. What is important to note is that this program marks the emergence of a systems engineering agenda as well – necessary if we are to build large software-intensive systems that utilize supercomputers.

OVERVIEW Since the early 1980s, high performance computing has been defined by heroic feats of maximizing performance, frequently without significant thought to context and utility. Single applications in physics and engineering were developed, rewritten, tuned, and optimized with considerable effort to run at maximal execution performance. Most aspects of systems and software engineering were considered secondary to the pursuit of performance.

Thus, in my talk, I will discuss several major trends, programs, and emerging needs relevant to Defense supercomputing and systems engineering. The first will outline the gradual shift from single-purpose physics applications to more complex and comprehensive modeling systems. I will use applications taken from the DoD High Performance Computing Modernization Program as examples and trace growth of computing requirements over the last decade.

Today, performance is still important, but as we move high performance computing into the mainstream, such disciplines as software and systems engineering will grow in relevance. Future trends point to scalable applications serving as components within larger information technology systems. An example is the integration of aircraft testing with computational fluid dynamic simulations. This evolution in computer modeling and simulation dictates a shift in thinking on how we view high performance computing systems – both hardware and software – in a manner that improves our ability to integrate and build large and complex software-intensive systems.

The second is the challenge of producing large software-intensive systems. I will provide an outline of a proposed DoD program in this area, which is mainly focused at developing embedded and real-time systems. Since high performance computing applications tend to lag embedded / real time systems in complexity, a discussion of such issues serves as a portal into future challenges for supercomputing. The third area is security. With the geometric growth in security attacks, we face intense challenges in assuring the operation of systems and protection of data. Since most are familiar with issues related to computer and network security, I will focus on discussing the challenges of this presentation on an emerging area – application software security. This new area addresses (or attempt to address) a fundamental question on how does an organization protect applications software from theft and reverse engineering.

In 2001, the Department of Defense viewed high performance computing as at critical juncture. Systems of higher peak performance were being manufactured. Unfortunately, there continued to be a National Security requirement for computing systems that fell outside the commercial mainstream.

Throughout these discussions, I will tie in the need for systems engineering tools, techniques, and algorithms in order to build applications that meet various mission requirements. Finally, I will highlight ongoing DoD efforts related to establishing a systems engineering culture in the high performance computing community.

This paper is authored by an employee of the U.S. Government and is in the public domain. SE-HPCS'05, May 15, 2005, St. Louis, Missouri, USA. ACM 1-59593-117-1/05/0005

7

Application of a Development Time Productivity Metric to Parallel Software Development Andrew Funk1 [email protected]

Victor Basili2 [email protected]

Lorin Hochstein2 [email protected]

Jeremy Kepner1 [email protected]

Efforts to quantify time to solution in the HPC community have traditionally focused on measuring execution time and developing benchmarks and metrics to evaluate computational throughput. However, few metrics exist for evaluating development time, which is increasingly being recognized as a significant component of overall time to solution.

ABSTRACT Evaluation of High Performance Computing (HPC) systems should take into account software development time productivity in addition to hardware performance, cost, and other factors. We propose a new metric for HPC software development time productivity, defined as the ratio of relative runtime performance to relative programmer effort. This formula has been used to analyze several HPC benchmark codes and classroom programming assignments. The results of this analysis show consistent trends for various programming models. This method enables a high-level evaluation of development time productivity for a given code implementation, which is essential to the task of estimating cost associated with HPC software development.

Herein we define a new metric which we call development time productivity. In this case the utility is measured by how much faster a parallel code can find a solution, relative to a baseline serial code. The cost is measured by how much effort the programmer must put in to writing the parallel code, relative to the serial code. In other words,

Speedup Development Time = Productivity Relative Effort

1. INTRODUCTION One of the main goals of the DARPA High Productivity Computing Systems (HPCS) program [1] is to develop a method of quantifying and measuring the productivity of High Performance Computing (HPC) systems. At a high level, HPCS Productivity, Ψ, has been defined as utility over cost:

,

where Speedup =

Serial Runtime Parallel Runtime

,

and Ψ=

U (T ) C S + CO + C M

Relative Effort =

[2],

Parallel Effort Serial Effort

.

Effort may be measured in various ways. The most direct way is to measure the actual time spent programming both the serial and parallel code. This is perhaps the most accurate measure, but it can also be problematic to obtain accurate time logs, and often this data is simply not available. In this case other software metrics, such as Source Lines of Code (SLOC), may be used instead.

where utility, U(T), is a function of time. Generally speaking, the longer the time to solution, the lower the utility of that solution will be. The denominator of the formula is a sum of costs – software (CS), operator (CO), and machine (CM). A higher utility and lower overall cost lead to a greater productivity for a given system.

This work is sponsored by Defense Advanced Research Projects Administration, under Air Force Contract FA872105-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.

Some studies have shown SLOC to correlate well with developer effort [3], though this is still open to debate. In any case, we are working in relative, not absolute terms. For example, if parallel code A requires 2x the SLOC of a baseline serial code, and parallel code B requires only 1.5x the SLOC of the baseline serial code, then it is reasonable to assume that code A required a larger amount of effort than code B to develop.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SE-HPCS'05, May 15, 2005, St. Louis, Missouri, USA. Copyright 2005 ACM 1-59593-117-1/05/0005...$5.00.

8

1

MIT Lincoln Laboratory 244 Wood Street Lexington, MA 02420

2

University of Maryland Computer Science Department A.V.Williams Building, Room 4111 College Park, MD 20742

We have applied this development time productivity formula to performance and effort data collected from two HPC benchmark suites, and from a series of graduate student parallel programming assignments. Section 2 describes in detail how the data were collected and analyzed in each case. The results of this analysis are presented in Section 3, and discussed further in Section 4.

2.3. CLASSROOM ASSIGNMENTS A series of classroom experiments were conducted for the HPCS program, in which students from several different classes were asked to produce parallel programming solutions to a variety of textbook problems (see Table 1). In most cases the students first created a serial program to solve the problem, and this was used as the baseline for comparison with their parallel solution. The students used C, Fortran, and Matlab for their serial codes, and created parallel versions using MPI, OpenMP, and Matlab*P (aka StarP, a parallel extension to Matlab) [9].

2. ANALYSIS 2.1. NAS PARALLEL BENCHMARKS The NAS Parallel Benchmark (NPB) [4] suite consists of five kernel benchmarks and three pseudo-applications from the field of computational fluid dynamics. The NPB presents an excellent resource for this study, in that it provides multiple language implementations of each benchmark. The exact codes used were the C/Fortran (serial, OpenMP), and Java implementations from NPB-3.0, and the C/Fortran (MPI) implementations from NPB2.4. In addition, a parallel ZPL [5] implementation and two serial Matlab implementations were also included in the study.

Table 1. Classroom Assignments Class Problem

Programming Task

Students reporting

P0A1, Game of P1A1 Life

Create serial and parallel versions using C and MPI

16, 11

P0A2 Weather Sim Add OpenMP directives to existing serial Fortran code

These codes were all run on an IBM p655 multiprocessor computer using the Class A problem size. The parallel codes were run using four processors. The runtimes used were those reported by the benchmark codes. As discussed in the previous section, the speedup for each parallel code (and the serial Matlab codes) was calculated by dividing the baseline serial C/Fortran runtime by the parallel runtime. The SLOC for each benchmark code was counted automatically using the SLOCcount tool [6]. For each implementation, the relative SLOC was calculated by dividing the parallel SLOC by the baseline serial C/Fortran SLOC. For each benchmark implementation, a development time productivity value was calculated by dividing the speedup by the relative SLOC. The results of this analysis are presented in Section 3.1.

2.2. HPC CHALLENGE The HPC Challenge suite [7] consists of several activity-based benchmarks designed to test various aspects of a computing platform. The four benchmarks used in this study were FFT (v0.6a), High Performance Linpack (HPL, v0.6a), RandomAccess (v0.5b), and Stream (v0.6a).

17

P2A1 BuffonLaplace Needle

Create serial versions using C 11 and Matlab, and parallel versions using MPI, OpenMP, and StarP

P2A2 Grid of Resistors

11 Create serial versions using C and Matlab, and parallel versions using MPI, OpenMP, and StarP

P3A1 BuffonLaplace Needle

Create serial versions using C 17 and Matlab, and parallel versions using MPI, OpenMP, and StarP

P3A2 Parallel Sorting

Create serial and parallel versions using C and StarP

P3A3 Game of Life

Create serial and parallel 8 versions using C, MPI, OpenMP

13

The students ran their programs on a variety of computing platforms, and reported their own timings. For purposes of comparison, all speedups were calculated using eight processors for the parallel case.

These codes were run on the Lincoln Laboratory Grid (LLGrid), a cluster composed of 80 dual-processor nodes connected by Gigabit Ethernet [8]. The parallel codes were run using 64 of these dual-processor nodes, for a total of 128 CPUs. The speedup for each parallel code was determined by dividing the runtime for a baseline serial C/Fortran code by the runtime for the parallel code (the serial Matlab code was treated the same as the parallel codes for purposes of comparison).

Although the students did report development effort in hours, this data was sometimes spotty and inconsistent. For the sake of consistency with the procedure used for the NPB, effort was again measured in terms of SLOC, which were reported by an automated tool other than SLOCcount. The relative SLOC was calculated by dividing the SLOC for each student’s parallel code by the SLOC for that same student’s serial code.

The relative SLOC (again counted using SLOCcount) was calculated by dividing the SLOC for each parallel code by the SLOC for a baseline serial C/Fortran code.

The development time productivity was calculated by dividing the speedup by the relative SLOC for each students’s code submission. The results of this analysis are presented in Section 3.3.

The development time productivity for each benchmark implementation was determined by dividing the speedup by the relative SLOC. The results of this analysis are presented in Section 3.2.

9

10

10

Productivity

Speedup

Ideal speedup = 4

1 Fortran/C + MPI

1

Fortran/C + OpenMP

Fo rtran/C + M PI

Java

Fo rtran/C + Op enM P

Serial Matlab

J ava Serial M at lab

ZPL

ZPL

0.1

0.1

0.1

1

10

BT

Relative SLOC

CG

EP

FT

IS

LU MG

SP

Benchmark

Figure 1. Speedup vs. Relative SLOC for the NPB

Figure 2. Development Time Productivity for the NPB

3. RESULTS 3.1. NAS PARALLEL BENCHMARKS

The Matlab results provide an example of an implementation that falls in the lower-left quadrant of the graph, meaning that its runtime is slower than serial Fortran, but it also requires fewer SLOC than Fortran (Figure 1). In fact, because of its low SLOC relative to serial Fortran, the serial Matlab manages to have a development time productivity value comparable to parallel Java (Figure 2).

Figure 1 presents a log-log plot of speedup vs. relative SLOC for the NPB. Each data point corresponds to one of the eight benchmarks included in the NPB suite, and the results are grouped by language (performance data was not available for some of the implementations). The speedup and relative SLOC for each benchmark implementation are calculated with respect to a reference serial code implemented in Fortran or C.

A few of the Java implementations are in or near the lower-right quadrant of the graph. This indicates that, although additional lines of code were added to create the parallel implementation, little if any parallel speedup was realized. There may be any number of reasons why these implementations did not fare well – it should be noted that the Java implementations were generated via a semi-automated translation from the serial Fortran. In any case, those implementations that are in or near the lower-right quadrant will have development time productivity values at or below the baseline established by the serial implementation.

Each parallel code was run using four processors, setting an upper bound for speedup as indicated on the graph. For this study, no attempt was made to optimize or configure these benchmarks for the computing platform used. The goal of this study was not to judge suitability of one language over another for a given benchmark, but to observe general trends for a given language. For example, the OpenMP implementations tend to yield parallel speedup comparable to MPI, while requiring less relative SLOC (Figure 1). This is reflected in the higher development time productivity values for OpenMP (Figure 2). As a general rule, we expect to see traditional parallel languages and libraries such as MPI and OpenMP fall in the upper-right quadrant of the graph. This reinforces our intuition that parallel performance is achieved at the cost of additional effort (over serial implementation).

3.2. HPC CHALLENGE Figure 3 presents the results for the HPC Challenge benchmarks. The speedup and relative SLOC for each implementation were calculated with respect to a serial C/Fortran implementation. The parallel codes were all run using 64 dual-processor nodes, for a total of 128 CPUs. The implementations used for the Random Access benchmark (designated as RA in Figure 3) require a great deal of inter-processor communication, and so actually run slower as more processors are involved in a network cluster.

The lone ZPL implementation falls in the upper-left quadrant of the graph, having a relatively high speedup and low SLOC count, as compared to the serial Fortran implementation. Although a single data point does not constitute a trend, this result was included as an example of an implementation that falls in this region of the graph. Accordingly, this ZPL implementation has a high development time productivity value (see Figure 2).

With the exception of Random Access, the MPI implementations all fall into the upper-right quadrant of the graph, indicating that they deliver some level of parallel speedup, while requiring more SLOC than the serial code. As expected the serial Matlab implementations do not deliver any speedup, but do all require fewer SLOC than the serial code. The pMatlab implementations

10

1000

1000

Ideal speedup = 128 Stream

100

FFT

HPL HPL Stream

Productivity

Speedup

10 FFT

Serial Matlab

0.1

pMatlab

0.1

RARA

RA

1

10

1 0.1

Serial Matlab C+MPI

0.01

C+MPI

0.01 0.001 0.01

100

HPL

FFT

10 1

Stream

pMatlab

0.001 0.0001

100

Stream

Relative SLOC

FFT

HPL

Random Access

Benchmark

Figure 3. Speedup vs. Relative SLOC, HPC Challenge Figure 4. Development Time Productivity, HPC Challenge

(except Random Access) fall into the upper-left quadrant of the graph, delivering parallel speedup while requiring fewer SLOC.

(Figure 5). This leads to higher development time productivity values for OpenMP (Figure 6). Actually, the OpenMP assignment with a median SLOC less than the serial SLOC resulted in a development time productivity value greater than 10 (not shown).

The combination of parallel speedup and reduced SLOC means that the pMatlab implementations have higher development time productivity values (Figure 4). On average the serial Matlab implementations come in second, due to their low SLOC.

The lone StarP data point has a median speedup value below one (slower than serial C), while requiring fewer lines of code than MPI (Figure 5). Due to the low speedup value, the StarP assignment has a development time productivity value less than one (Figure 6).

The MPI implementations, while delivering better speedup, have much higher SLOC, leading to lower development time productivity values.

3.3. CLASSROOM ASSIGNMENTS

4. CONCLUSIONS

Figure 5 presents speedup vs. relative SLOC results for a series of classroom assignments. The speedup and relative SLOC were collected for each student, and the median values for each assignment are plotted on the graph. As indicated on the graph, the ideal speedup in this case is eight, based on the use of eight processors for the classroom assignments.

We have introduced a common metric for measuring development time productivity of HPC software development. The development time productivity formula has been applied to data from benchmark codes and classroom experiments, with consistent results. In general the data supports the theory that MPI implementations yield good speedup but have a higher relative SLOC than other implementations. OpenMP generally provides speedup comparable to MPI, but requires fewer SLOC. This leads to higher development time productivity values. There are questions of scalability with regard to OpenMP that are not addressed by this study.

Some of the assignments had median speedup values outside the range of 0.1 – 10. It is assumed that such outlier data is erroneous, and not representative of actual achieved performance with correct implementations. For the sake of clarity and comparison with the NPB results, the axes ranges are limited to 0.1 – 10, excluding some data points. Error bars indicate one standard deviation from the median value.

The pMatlab implementations of HPC Challenge provide an example of a language that can yield good speedup for some problems, while requiring fewer relative SLOC, again leading to higher values of the development time productivity metric.

The MPI data points for the most part fall in the upper-right quadrant of the graph, resulting in development time productivity values at or above one (Figure 6). There was one MPI assignment in which most of the students were not able to achieve speedup, and this resulted in a median development time productivity less than one.

In addition to discovering general trends for a given language, in practice this technique could be used to evaluate the productivity of programming models provided with two or more HPC systems, as part of the decision process associated with procurement. Consideration of the development time productivity metric, along

The OpenMP data points indicate a higher achieved speedup compared to MPI, while also requiring fewer lines of code

11

Ideal speedup = 8

10

10

P3A3 P0A2 P1A1 P3A3

Productivity

Speedup

P2A1 P3A1

1

1

MPI P2A1

OpenMP

Series1

StarP

Series2 Series3

0.1 0.1

1

0.1

10

P0A2

P1A1

Relative SLOC

P2A1

P3A1

P3A3

Assignment

Figure 6. Development Productivity, Classroom assignments

Figure 5. Speedup vs. Relative SLOC, Classroom assignments with hardware performance and other factors, would give a more complete picture of overall system productivity.

6. REFERENCES

Follow-on studies will examine more benchmark codes and language implementations. In the HPCS program there is an ongoing effort to collect a wide variety of HPC benchmarks implemented in as many languages as possible. Having this range of data will enable us to make more thorough comparisons between languages.

[2] Kepner, J. “HPC Productivity Model Synthesis.” IJHPCA Special Issue on HPC Productivity, Vol. 18, No. 4, SAGE 2004

[1] High Productivity Computer Systems http://www.HighProductivity.org

[3] Humphrey, W. S. A Discipline for Software Engineering. Addison-Wesley, USA, 1995

Further classroom experiments are planned, and as more data is collected it will be analyzed in the same manner, to see if other trends emerge. In addition to SLOC, effort data will be collected both automatically and by student reporting. Having two effort data sources will allow us to judge the accuracy of student reporting, as well as to fine-tune the automated collection process. This data will also enable us to further explore the relationship between effort and SLOC.

[4] NAS Parallel Benchmarks. http://www.nas.nasa.gov/Software/NPB/ [5] ZPL. http://www.cs.washington.edu/research/zpl/home/ [6] Wheeler, D. SLOCcount. http://www.dwheeler.com/sloccount/ [7] HPC Challenge. http://icl.cs.utk.edu/hpcc/ [8] Haney, R. et. al. “pMatlab Takes the HPC Challenge.” Poster presented at High Performance Embedded Computing (HPEC) workshop, Lexington, MA. 28-30 Sept. 2004

5. ACKNOWLEDGMENTS We wish to thank all of the professors whose students participated in this study, including Jeff Hollingsworth, Alan Sussman, and Uzi Vishkin of the University of Maryland, Alan Edelman of MIT, John Gilbert of UCSB, Mary Hall of USC, and Allan Snavely of UCSD.

[9] Choy, R. and Edelman, A. MATLAB*P 2.0: A unified parallel MATLAB. MIT DSpace, Computer Science collection, Jan. 2003. http://hdl.handle.net/1721.1/3687

12

A Metric Space for Productivity Measurement in Software Development Robert W. Numrich

Lorin Hochstein

Victor R. Basili

Minnesota Supercomputing Institute University of Minnesota Minneapolis, MN

Department of Computer Science University of Maryland College Park, MD

Department of Computer Science University of Maryland College Park, MD

ABSTRACT

time of the course and to calculate the difference between each pair of students.

We define a metric space to measure the contributions of individual programmers to a software development project. It allows us to measure the distance between the contributions of two different programmers as well as the absolute contribution of each individual programmer. Our metric is based on an action function that provides a picture of how one programmer’s approach differs from another at each instance of time during the project. We apply our metric to data we collected from students taking a course in parallel programming. We display the pictures for two students who showed approximately equal contributions but who followed very different paths through the course.

1.

2. DATA COLLECTION We collected data from students as they worked on a programming assignment for a graduate-level course on Grid Computing at the University of Maryland [2]. The assignment was to implement Conway’s Game of Life [4] to run in parallel on a Beowulf Linux cluster [1]. The students used the MPI library [3] to implement the parallel program. We collected data by instrumenting the compiler. Each time a student compiled a program, we asked two questions. First, how long have you been working before the compilation? A blank response indicated that they had been working continuously since the previous compilation. Second, what kind of work were you doing? The student selected the kind of work from a list of seven activities, which are listed in the first column of Table 1.

INTRODUCTION

We define a metric space that measures the contributions of individual programmers to a software development project. This space satisfies all the mathematical requirements for a metric space and allows us to measure the distance between programmers and the absolute size of each programmer’s contribution. We assign a power function, the rate of work production, for each activity performed by a programmer. The integral of the power function over time yields the work done by each programmer, and the integral of the work function yields a quantity called action in the physical sciences. After scaling and shifting each programmer’s action function to the same time interval, we define a metric space with a distance function equal to the integral of the absolute difference between two action functions. We apply our metric to data we collected observing students taking a course in parallel programming. By monitoring them during the course, we determined a set of activities performed at different times by each student and assigned a value to each activity in terms of how well it advanced the student toward a solution of the problem. In the following sections we describe how we used that data to calculate a numerical value to each student’s contribution over the life-

Table 1: Activities and Power Ratings Activity Power Rating Tuning 0.9 Parallelizing 0.7 0.6 Functionality Learning 0.5 0.2 Compile-Time Error Run-Time Error 0.2 Other 0.1

The instrumented compiler recorded the responses along with a time stamp indicating when the compilation occurred. From this data, we computed a set of time intervals for each student along with the activity associated with that interval. A fundamental problem in trying to define a productivity metric in software development is the definition of work. Each kind of activity in each time interval corresponds to some work that advances the student toward the solution of a problem. Some activities advance the student more quickly than others, that is, they produce work at a higher. It is sufficient, for our purposes, to know the rate at which work accumulates, the power rating, without defining the actual unit of work itself. One unit of work can be converted to any other unit of work through a suitable conversion factor. The work associated with each activity can be converted into whatever unit of work we want without changing our results.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SE-HPCS’05, May 15, 2005, St. Louis, Missouri, USA. Copyright 2005 ACM 1-59593-117-1/05/0005 ...$5.00.

13

For simplicity, we assume that the power function ρij is constant in each interval so that Z t Wji (t) = ρij ds , (9)

The important quantity for our analysis is the unit of power, ρ, which we define as the maximum rate at which any programmer can perform work to finish the project. In each interval of time, each programmer, denoted by superscript i, performs some activity, denoted by subscript j, at some fraction of peak power, ρij = αij ρ .

tij−1

and hence work accumulates linearly in each interval,

(1)

Wji (t) = ρij (t − tij−1 ) .

The dimensionless parameters 0 ≤ αij ≤ 1 characterize the behavior of each programmer. Table 1, in its second column, shows the power ratings, αij , assigned to each activity. We have given all programmers the same power rating for the same activity although we could, with more information, assign different ratings to each one. These power ratings are the input parameters to our model. Their values are purely subjective at this point, and we claim no profound meaning to them. They are dimensionless quantities that represent the fraction of peak power for each activity.

3.

At the end of each time interval, the work accumulated over that interval is Wji (tij ) = ρij σji

W i (t) = Wki (t) +

(2)

where we inserted the factor of two for convenience. Substituting the work function from equation (10) into the integral and evaluating the integral, we find Sji (t) = ρij (t − tij−1 )2 .

Sji (tij ) = ρij (σji )2 .

(5) (6)

S i (t) = Ski (t) +

i

The number of intervals n is different for each programmer, and the total time spent T i is different for each programmer. The width of each time interval is σji = tij − tij−1 ,

(16)

We make the set of programmers P a metric space [5] by defining a distance function based on the difference in how each programmer generates action during the project. We want this function to be a dimensionless function of a dimensionless variable such that the distance between programmers is a pure number. We also want the measure of a programmer’s individual contribution to be the distance from the null programmer, a laggard that spends time on the project but produces nothing. First we define a set of units. The unit of time, T , is the maximum time spent by any programmer in the set,

WORK AND ACTION

dWji , dt

ρij (σji )2 .

5. A METRIC SPACE FOR PROGRAMMERS

(7)

In each time interval Tji , programmer P i is involved in some activity that contributes some work, Wji (t), toward finishing the project. Some activities advance the project more than others. For each activity, the power function of equation (1) is the derivative of the work function, ρij (t) =

k−1 X j=1

and the activity performed in each interval is different for each programmer as represented by the constants αij from equation (1) and Table 1.

4.

(15)

The total accumulated action in interval Tki at time t is the sum,

and finishes at tini = T i .

(14)

At the end of each interval, the action accumulated over that interval is

(4)

Each programmer starts at ti0 = 0 ,

(12)

tij−1

(3)

be the corresponding time interval. Programmers spend their time doing different things at different times during the project. To reflect this changing activity, we divide each time interval T i into subintervals, j = 1, ni .

ρij σji ,

is the sum of the work done during all the intervals preceding interval Tki plus the additional work done so far in interval Tki . The action generated in each time interval is the integral, Z t Wji (s)ds , (13) Sji (t) = 2

assigned to the project. Let t ≥ 0 represent time measured from the beginning of the project at time t = 0. Let T i be the time spent on the project by programmer P i and let

Tji = [tij−1 , tij ] ,

k−1 X j=1

Consider a specific software development project with some (finite) set of programmers,

T i = [0, T i ]

(11)

where we have used the width of the interval from equation (7). As time increases from one interval to the next, work accumulates at different rates at different times. At time t ∈ Tki the total accumulated work,

THE SET OF PROGRAMMERS P = {P 1 , P 2 , . . . } ,

(10)

T = max(T i ) .

(8)

i

(17)

The unit of power is ρ and the unit of action is

the rate of work production for programmer P i in time interval Tji .

Sˆ = ρT 2 .

14

(18)

To put each programmer onto the same time scale [7], we define the dimensionless time variable, z = 1 + (t − T i )/T .

(19) i

0.04

We define a dimensionless action function s (z) in interval Tki from the sum in equation (16) evaluated at time t = Tz + Ti − T ˆ and scaled by the unit of action S, " # k−1 X i i 2 1 i i i s (z) = · Sk (T z + T − T ) + ρj (σj ) . Sˆ j=1

• 4

(20)

2

0.03

(21)

s(z)

0.02

The dimensionless time variable z spans the interval i

1 − T /T ≤ z ≤ 1 ,



(22)

♦ ? ♥ ∇

0.01

and the first time interval for each programmer shifts to a new starting point, z = 1 − T i /T .



0.00

(23)

0.0

At this value of z, from definition (19), the time, t = 0, corresponds to the left end of the first interval where the action is zero. We extend the action function continuously to z = 0 by defining si (z) = 0 ,

0 ≤ z ≤ 1 − T i /T .

∇ ♦

0.2

4

0.6

0.4

♥♠

0.8

2 •

1.0

Figure 1: Action as a function of time for ten different programmers as a function of time. Each programmer is assigned a symbol that marks the beginning and ending of each curve. Time has been scaled so that the unit of time equals the longest time spent by any programmer in the set. The time for other members of the set are shifted to the right so that each programmer starts work at a different time but ends work at the same time.

(24)

(25)

The programmer spending the longest time spans the whole interval from z = 0 to z = 1. With these definitions, we define the distance between two programmers as the integral of the difference of the two action functions, Z 1 |si (z) − sj (z)|dz . (26) dist(P i , P j ) =

Figure 2 shows the action curves for these two programmers along with two dotted triangles, determined by the end points of each curve, which we use to approximate the area under each curve. Although the area under the two curves is about the same under, indicating that the two programmers contributed about the same amount to the project, the way they contributed is quite different. One programmer took a long but steady approach while the other took a short but steep approach. The quantitative measure of the difference between the two approaches is the area between the two curves, which, from the corresponding entry in the table, equals 5.9 milli-action units.

0

The size of each programmer’s contribution is the distance to the null contributor, always assumed to be in the set of programmers, such that Z 1 |si (z)|dz . (27) dist(P i , 0) = 0

6.



z

In the variable z, every programmer ends activity at the same time, z=1.

?

APPLICATION TO EMPIRICAL DATA

Figure 1 shows the action functions defined by equation (21) for the set of ten programmers we considered. Each programmer is marked by a symbol at the beginning and end of the corresponding interval in the dimensionless time variable z. The size of each programmer’s contribution is the area under the action function. The distance between programmers is the area under the absolute difference between action functions. We can approximate the area under each curve by the area of the triangle determined by the end points of each curve [6]. Table 2 lists the values obtain this way in milliaction units, ρT 2 × 10−3 . To interpret the information measured by our metric space, we isolate two programmers, number three and number eight in Table 2, whose contributions are approximately equal.

7. SUMMARY We have defined a metric space for a set of programmers. The distance function for that space allows us to measure the contribution of an individual programmer and to measure the difference in contributions between pairs of programmers. The metric is based on the action function for each programmer as it evolves in time. This function provides not only a measure for the total contribution, the area under the curve, but also a picture of how the approach of one programmer differs from another at each instance of time during the project. We illustrated this property by displaying the action functions for two programmers who contributed equally but followed very different paths through the project.

15

Table 2: Individual contributions in milli-action units, ρT 2 × 10−3 . The values on the diagonal are the individual contributions from equation (27). The values below the diagonal are the distances between contributions from equation (26). 1 2 3 4 5 6 7 8 9 10

1.2 0.8 5.3 3.3 1.4 3.6 1.6 6.3 5.2 1.3 1

1.9 4.7 3.3 0.9 3.4 1.6 5.9 5.0 1.8 2

6.6 5.4 4.4 4.8 5.2 5.9 6.2 6.3 3

0.04 4

0.03 3.0 2.8 0.8 1.7 3.6 2.1 2.7 4

2.2 2.8 1.2 5.1 4.4 2.0 5

3.5 2.2 3.0 1.6 3.3 6

1.4 5.1 3.7 1.1 7

s(z) 6.5 1.6 6.3 8

5.0 4.8 9

0.02 ♣

0.3 10

0.01

0.00 Our model depends on several assumptions, which can be changed. We assumed that the power function is constant for each activity, independent of the time interval and independent of the programmer. This assumption results in a simple linear increase in work and a quadratic increase in action over each time interval. It is not clear that letting the power function vary over the time interval would add much to the analysis. But it might add something if we assign different constants to different programmers for the same activity. After all, some programmers are more productive than others, for example, while writing parallel code. In the end, the distance function for our metric space takes experimental measurements of programmer activity as input and returns a dimensionless, pure number as output. By shifting and scaling the time, it accounts for the disparities in the start and stop times for different programmers. We never needed to specify the unit of work. Each activity produces work of some kind that advances the project. We subjectively judged the effectiveness of each activity by assigning a higher or a lower power rating to it. These power ratings are crucial input to the model, and the determination of what these ratings should be is the next important step we need to take to judge the utility of this model.

8.

0.0

4

0.2

0.6

0.4

0.8

1.0

z

Figure 2: Action as a function of time for programmers three and eight from Table 2. The area under each curve is approximated by the triangle determined by the end points of each curve. The two programmers contributed about the same to the project, the area under the two curves is about the same, but they worked in two quite different ways to the same end.

[2]

[3]

ACKNOWLEDGEMENTS

[4]

This research was supported in part by Department of Energy contract DE-FG02-04ER25633 to the University of Maryland. We thank Alan Sussman for allowing us to collect data from his class on grid computing at the University of Maryland. It was also supported in part by two Department of Energy contracts to the University of Minnesota: contract DE-FC02-01ER25505, as part of the Center for Programming Models for Scalable Parallel Computing, and contract DE-FG02-04ER25629, as part of the Petascale Application Development Analysis Project, both sponsored by the Office of Science.

9.



[5] [6]

[7]

REFERENCES

[1] D. J. Becker, T. Sterling, D. Savarese, J. E. Dorband, U. A. Ranawake, and C. V. Packer. Beowulf: A parallel workstation for scientific computation. In Proceedings

16

of the 1995 International Conference on Parallel Processing (ICPP), 1995. J. Carver, S. Asgari, V. R. Basili, L. Hochstein, J. Hollingsworth, F. Shull, and M. V. Zelkowitz. Studying code development for high performance computing: The hpcs program. In Workshop on High Productivity Computing, Edinburgh, Scotland, pages 32–36, May 2004. J. Dongarra, S. Otto, M. Snir, and D. Walker. A message-passing standard for MPP and workstations. Communications of the ACM, 39(7):84–90, 1996. M. Gardner. The fantastic combinations of John Conway’s new solitaire game “Life”. Scientific American, 223:120–123, 1970. A. N. Kolmogorov and S. V. Fomin. Introductory Real Analysis. Dover, revised English edition, 1970. R. W. Numrich. Performance metrics based on computational action. International Journal of High Performance Computing Applications, 18(4):449–458, 2004. R. W. Numrich. A dynamical approach to computer performance analysis, submitted, 2005.

Generating Testable Hypotheses from Tacit Knowledge for High Productivity Computing 1

Sima Asgari1, Lorin Hochstein1, Victor Basili1,2, Marvin Zelkowitz1,2 , Jeff Hollingsworth1, Jeff Carver3 , Forrest Shull2

Computer Science Department - University of Maryland, College Park, 20742 MD, USA Fraunhofer Center for Experimental Software Engineering, College Park, 20742 MD, USA 3 Mississippi State University, Mississippi State, MS 39762, USA

2

{sima,lorin,basili,hollings,mvz}@cs.umd.edu, [email protected], [email protected] to apply empirical methods to study parallel programming issues. We have applied similar methods in the past to researching development time issues in other software domains [7].

ABSTRACT In this research, we are developing our understanding of how the high performance computing community develops effective parallel implementations of programs by collecting the folklore within the community. We use this folklore as the basis for a series of experiments, which we expect, will validate or negate these assumptions.

Because of little interaction between the HPC and SE communities in the past, those of us on the SE side have very little knowledge about the nature of software development in the HPC domain. While the HPC community has not focused on development time issues in the sense of generating publications on these subjects, it has assuredly accumulated a wealth of experience about such matters, leading some HPC practitioners to refer to the field as a “black art”. Indeed, those in the community tend to harbor strong (and sometimes contradictory) beliefs about development time issues. It would be inappropriate to disregard this body of knowledge simply because it has not been packaged in a suitable format. Unfortunately, since it currently exists only as tacit knowledge, it is not obvious how to best leverage this expertise. While there has been previous research in trying to capture the needs of HPC programmers as they relate to software development issues [2, 3], there has been little research in trying to capture the knowledge of HPC programmers on software development issues, with a notable exception [5]. In this paper, we describe the initial stages of our work to collect this knowledge, which we refer to as “tribal lore” or “folklore”.

Categories and Subject Descriptors D.2 [Software engineering]: Empirical Studies, Folklore

General Terms High Productivity Development Time Experimental Studies, Tacit Knowledge Solicitation, Testable Hypotheses

Keywords Folklore Elicitation, Hypothesis Generation

1. INTRODUCTION The DARPA High Productivity Computing Systems (HPCS) project has goals of “providing a new generation of economically viable high productivity computing systems for national security and for the industrial user community,” and initiating “a fundamental reassessment of how we define and measure performance, programmability, portability, robustness and ultimately, productivity in the HPC domain”1.

By tribal lore or folklore we mean the common beliefs about the interaction between variables such as code development effort, development activities such as debugging, programming models, languages, execution time, etc.

In order to reassess the definitions and measures in a scientific domain it is necessary to study the basis and source of those definitions and measures. These sources are usually found in the related literature and various documentations existent in the community. However the large amount of tacit information that is merely in people’s minds often remains neglected.

We conducted two separate studies to solicit HPCS folklore and the types of defects common to high-end programming. This paper discusses the process of knowledge solicitation, some initial analysis of the collected information and hypotheses created.

Historically, there has been little interaction between the HPC and the software engineering communities. The Development Time Working Group of the HPCS project is focused on development time issues. The group has both software engineering researchers as well as HPC researchers. The strategy of the working group is

An initial conclusion from the folklore is that debugging parallel code is a particularly difficult task. In order to quantify the debugging difficulty we need to analyze the defects (bugs) in the code to find out the types of defects that programmers encounter when writing parallel code, to understand how common these defects are and to specify how difficult they are to fix.

2. KNOWLEDGE SOLICITATION PROCESS

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SE-HPCS’05, May 15, 2005, St. Louis, Missouri, USA. © 2005 ACM 1-59593-117-1/05/0005…$5.00.

The development time working group of HPCS is responsible for investigating issues concerning development time within the HPCS framework. We conduct experimental studies by collecting various data during the code development phase of high productivity computing by novices (university students working 1

17

http://www.highproductivity.org

on class assignments) and professionals working on real projects (case studies) or small sample problems (observational studies).

gather a sense of the type of information that a beginning HPC programmer might find. This initial list of 10 ideas (the left column of the table in Appendix 1) was recorded and used as the basis for our first survey.

As the initial set of hypotheses that should be investigated using the collected data, we generate hypotheses from the tacit knowledge collected from the HPC community members. After capturing this knowledge, several testable hypotheses are generated around each issue and we investigate them using the development data that we’ve collected.

We then asked 7 HPC specialists and professors who regularly teach HPC classes to comment on the initial list. They were asked to give an “agree”, “disagree” or “don’t know” answer to each lore, give their comments or change suggestions and add any folk lore that they are aware of but is not on the list.

Figure 1 shows the process of knowledge solicitation and analysis. The area inside the dotted rectangle in figure 1 is the current part of the study that we discuss in this paper.

Figure 2 shows the answers. The folklore number 11 in Appendix 1was added by one of the participants at this stage. Generally the comments revolved around clarifying the domain to which the bit of lore applied. For example was the bit of lore talking about a user programming model such as OpenMP or hardware architecture such as a multi-threaded machine.

2.1 HPCS Folklore One of the main goals of the development time working group of HPCS project is to leverage HPC community’s knowledge of development time issues. In order to do so, we are soliciting expert opinion on issues related to HPC programming by collecting elements of folklore through surveys, generating discussion among experts on these elements of the lore to increase precision of statements and to measure degree of consensus and finally generate testable hypotheses based on the lore that can be evaluated in empirical studies.

In order to clarify the questionable points we scheduled a discussion session among the participants. This discussion resulted in some modifications in the way folklore sentences were phrased. The right column of the table in Appendix 1 is the result of this modification. 7

6

Survey participants

5

4

3

2

1

0 1

2

3

4

5

6

7

8

9

10

HPCS Tribal Lore agree

Agree, but

Disagree

Disagree, but

Don't know

Figure 2: Responses to the initial list of HPC folklore At some point during the discussion, the participants agreed that “MPI programs don't run well when you use lots of small messages because you get latency-limited”. In order to include this in the folklore list, the lore number 12 was added to the list. At the next step of the study, a survey form was compiled from the current list of 12 folklore and distributed to the participants at the “High Productivity Computing Systems, Productivity Team Meeting” held in January 2005. In order to avoid any bias, some of the randomly selected lore were rephrased to imply the logically inverse sentence. Two sets of survey forms were compiled and distributed randomly.

Figure 1: Folklore and defect solicitation process Before starting the exploratory experiment of collecting peoples’ anecdotal beliefs through surveys, we needed an initial set of such anecdotes to both encourage thinking and also use as examples of what we are interested in.

In Figure 3 (the survey results), the numbers on the x-axis represent the folklore numbers, where the numbers marked with * show that the altered version of the lore was used. In version 1 of the survey the altered phrases of the folklore 1,3,4,6,7,8 and 11 and the original version of the folklore 2,5,9,10 and 12 were used, and in the second version of the survey they were switched. In Figure 3, the third column for each folklore, marked as ‘mean”, represents the mean value from the two surveys.

To gather the folklore in HPC, a member of the study group, who is an HPC professor, conducted an informal scan of several sources including lecture notes used in introductory HPC classes at the University of Maryland as well as scanning the Internet for related keywords (including "HPC folklore” and "HPC folklore"). The goal of this process was not to be exhaustive, but instead to

The total number of respondents was 10 for the first version and 18 for the second version of the survey. In most cases more than

18

50% of the participants agreed with the positive lore and disagreed with the altered ones. It seems that folklore numbers 5 and 11 need further investigation since there is less than 30% agreement on them. This emphasizes the fact that there could be large inconsistency between experts’ viewpoints and also that the phrasing of the folklore is a very important factor. Before trying to create testable hypotheses based on the folklore, we are updating the folklore phrasing to make sure to collect proper data for testing those hypotheses. Also in order to avoid misinterpretation,

the folklore should be phrased as clear and unambiguous as possible, starting from the most controversial ones. The list of HPC folklore is still in primary stage and needs further refinement. We are classifying and analyzing the comments given to the survey by participants. We are also conducting the survey in our upcoming classroom studies to see how comparable students’ and professionals’ knowledge is.

100%

80%

60%

40%

20%

0% 1

1* 1

n ea M

2

2* 2

n ea M

3

3* 3

n ea M

4

4* 4

n ea M

5

5* 5

n ea M

6

AgreePositiveDeisagreeInverse

6* 6

n ea M

7

7* 7

n ea M

8

8* 8

n ea M

DisagreePositiveAgreeInverse

9

9* 9

n ea M

10

Don't know

n * 10 ea M 10

11

n * 11 ea M 11

12

n * 12 ea M 12

Blank

Figure 3: Folklore Survey Results Hypothesis 2: On average, shared memory programs will require less effort than message passing, but the shared memory outliers will be greater than the message passing outliers. To test this hypothesis we measure the total development time. Hypothesis 3: There will be more students who submit incorrect shared memory programs compared to the message-passing programs. To test this hypothesis we measure the number of students who submit incorrect solutions.

2.2. Testable Hypotheses We use the revised folklore to produce testable hypotheses and investigate the hypotheses using the collected data. An example for testable hypotheses is lore number 4 in the updated list of Appendix 1: Folklore 4: Debugging race conditions in shared memory programs is harder than debugging race conditions in message passing programs. At the first discussion session, the following points were brought up: • When working in the shared memory model, either it works right away or you will never figure out why. • Bugs in shared memory are hard to deal with because they can be non deterministic, more subtle and harder to track down. • Shared memory programs are far easier to develop because: o They provide a global address space o You do not have to think about the details that you do in message passing o You can incrementally develop shared memory programs • In some cases, it may be harder to debug shared memory programs. The following hypotheses were created from the above: Hypothesis 1: The average time to fix a defect due to race conditions will be longer in a shared memory program compared to a message-passing program. To test this hypothesis we measure the time to fix defects due to race conditions.

3. DEFECTS (BUGS) IN HPC CODE The types of defects that occur in code, their frequency of occurrence, and the effort required to fix them have an impact on productivity. In order to be able to test the defect related folklore, such as folklore 4, which discussed above, we need to analyze defects. We have started a study to analyze the defects by classifying defect types, how common they are and how difficult they are to fix. The process is similar to the one used for the folklore. We asked a HPC specialist to compile an initial list of defects from the literature and his own experience. Table 1 shows this initial list. At the next step of the study, a survey form from the initial list of defects was compiled and distributed to the participants at the “High Productivity Computing Systems, Productivity Team Meeting” held in January 2005. In this defect survey we asked the participants to identify the frequency of each defect on a 1 to 5 scale where 1 is the lowest and 5 is the highest frequency. They were also asked to identify

19

searching for empirical relations between variables such as activity, effort, workflow, performance, and code size.

the severity of the defect as low, medium or high and at the end they were asked to add any defects that they have experienced but is not on the list. The initial list of table 1 was used for the survey and table 2 is the list of defects added by the respondents. These new defects will be added to the list for the next round of surveys.

To run good experiments, we need to develop relevant testable hypotheses. To this end we have tried to understand what the community believes to be true about high end computing and to make explicit the tacit assumptions about a number of issues.

Table 1: Initial Defects List

We have been soliciting expert opinion on the issues related to HPC programming by collecting elements of folklore through surveys, generating discussion among experts on these elements of the lore to increase precision of statements and to measure degree of consensus and finally generate testable hypotheses based on the lore that can be evaluated in empirical studies. In some cases we were also able to generate new hypotheses based on the logical relationship between the collected lore.

Message Passing M1 ·Deadlock sender and receiver waiting for each other M2 ·Async Send/Recv and updating variables before send completes M3 ·Async Send/Recv and reading variables before they arrive M4 ·Not all processes call a collective communication operation

Table 2: Added defects

M5 ·Process tries to send a message to itself M6 ·Type inconsistencies in Send/Recv Shared Memory

MPI sends never received, code runs, but resources never reclaimed

S1 ·Synchronization bugs

Message failure

S2 ·Variables that should be thread private are shared

Message reordering (Forgetting messages could be reordered)

S3 ·Variables that should be shared are private

Bookkeeping errors in domain decomposition (indexing errors)

S4 ·Different locks used for the same variable (i.e. one shared object and a reader lock and a writer lock)

Loop with data dependencies get parallelized

S5 ·Program tires to acquire a lock it already holds

Loop without data dependencies does not get parallelized

Decomposition

Pointer problems

D1 ·Same work done on more than one node (when not intended)

Thread stack overflow Using any distributed memory machine

D2 ·Some work not done The initial analysis of the survey results shows that for 11 out of 13 defects, more than 60% of respondents believe that the frequency is low or medium, except for 2 defects S1 and S2. The initial conclusion for this observation could be: “Shared memory defects are more frequent than other types of defect“, which is a hypothesis generated from the folklore analysis.

I/O related defects

We were also able to sort the defects based on their severity. The ascending order of severity based on survey results would be: M5, D1, S5, M4, M6, S3, D2, M1, S4, S2, M3, S1, M2, where M5 is the least and M2 the most severe defect. Investigating the validity of above conclusions as well as drawing further conclusions relies on the results from our ongoing and upcoming survey studies. Figure 4: Time to fix defects It is important to note that in order to keep the survey questions simple and not confusing; we had to use the short-and-pithy statement of the lore, although they usually do not reflect people’s full understanding of the lore. Therefore the survey respondents may think we are oversimplifying the statements. This is an issue that needs further consideration.

3.1 Empirical Defect Study We are gathering empirical defect data from our HPC development time studies. In a pilot study students that were developing a program for 1D quantum dynamics simulation in C (approximately 150 SLOC) were asked to track time to fix defects while parallelizing code in MPI. As seen in figure 4, in this study “the defects related to I/O activities are the most time consuming to fix”. This is another generated hypothesis that is being investigated in our current studies.

The results so far indicate that there is a large variation in beliefs among experts. For 10 items out of a total of 12 folklore items, the results show agreement among 46% to 65% of the respondents, the maximum agreement being 64%. Two items of lore were less than 30% agreed upon. These items clearly need more clarification.

4. CONCLUSION AND FUTURE WORK In this paper, we have described our efforts in collecting elements of the collective knowledge of the HPC community, or “folklore”, that relate to issues of development time. We have employed methods traditionally used in the social sciences such as focus groups and surveys [6]. This work is complementary to our other research in the area, where we are conducting experimental studies to collect development time data and analyze this data by

There are several explanations for this variation. First, it is possible that there is not a wealth of common beliefs in the community about high end computing. Second, it is possible that most beliefs are bound by a context. Thus each individual brings

20

to the table a variation of that common belief based upon their own specialized experiences. This could either mean that if we could define the context variables surrounding each lore, we might find small common sets of lore, or it could mean that the contexts are so diverse that each individual represents his or her own lore. It is also possible that we have not sufficiently characterized folklore in our statements, causing confusion in the answers. This could be in the original statements themselves (e.g., not providing sufficient context) or in the negation of the statements (not truly capturing the inverse of the original statement). In any case, it is clear that in some cases, we have not captured a verifiable folklore and thus need to work on better formulating our hypotheses.

ACKNOWLEDGEMENTS This research was supported in part by Department of Energy contract DEFG0204ER25633, to the University of Maryland.

5. REFERENCES [1] J. Kontio, J., Lehtola, L., and Bragge, J. “Using the Focus Group Method in Software Engineering: Obtaining Practitioner and User Experiences.” Proceedings of 2004 International Symposium on Empirical Software Engineering (ISESE’04), (Redondo Beach, CA, 19-20 Aug. 2004), 271-280. [2] Shull F., Basili V. R., Boehm B., Brown A. W., Costa P., Lindvall M., Port D., Rus I., Tesoriero R., and Zelkowitz M. V., "What We Have Learned About Fighting Defects", Proceedings of 8th International Software Metrics Symposium, Ottawa, Canada, IEEE, June 2002, pp. 249-258. [3] C.M. Pancake, “Establishing standards for HPC system software and tools”, NHSE Review, Nov. 1997. [4] S. Squires, W. Tichy, L. Votta, “What Do Programmers of Parallel Machines Need? A Survey”, Second Workshop on Productivity and Performance in High-End Computing (P-PHEC) , 2005. [5] J. Dongarra et al., eds. “Sourcebook of Parallel Computing”, Morgan Kaufmann, 2003 [6] C. Robson, “Real World Research: A Resource for Social Scientists and Practitioner-Researchers”, 2nd Ed. Blackwell Publishers, 2002. [7] V. Basili, F. McGarry, R. Pajerski, M. Zelkowitz, “Lessons learned from 25 years of process improvement: The rise and fall of the NASA Software Engineering Laboratory”, IEEE Computer Society and ACM International Conf. on Soft. Eng., Orlando FL, May 2002, 69-79.

What would be a reasonable percentage of agreement? Can the hypotheses be clearly stated to minimize the variation and offer empirical support for the folklore? These are the issues we are currently working on. We believe continuing to develop the folklore is of value. Evaluation of the testable hypotheses generated based upon the folklore could lead to a higher degree of consensus and to the creation of a set of empirically supported measures of productivity in HPC domain. We have also begun to try to understand the nature of defects in high end computing and use some of our methods in generating folklore about development in general to defects in particular. Results at this writing are preliminary but we do have some agreement that shared memory defects are more frequent than any other type of defect. This is the kind of hypothesis we can test in case studies.

Appendix 1: List of HPC folklore Initial List

Updated List

[1] Use of Parallel machines is not just for more CPU power, but also for more total memory or total cache (at a given level).

[1] Many people use parallel machines primarily for the large amount of memory available (cache or main).

[2] It's hard to create a parallel language that provides good performance across multiple platforms.

[2] It's hard to create a parallel language that provides good performance across multiple platforms

[3] It's easier to get something working in using a shared memory model than message passing.

[3] It's easier to get something working using a shared memory model than message passing.

[4] It's harder to debug shared memory programs due to race conditions involving shared regions.

[4] Debugging race conditions in shared memory programs is harder than debugging race conditions in message passing programs

[5] Explicit distributed memory programming results in programs that run faster since programmers are forced to think about data distribution (and thus locality) issues.

[5] Explicit distributed memory programming results in programs that run faster than shared memory programs since programmers are forced to think about data distribution (and thus locality) issues

[6] In master/worker parallelism, the master soon becomes the bottleneck and thus systems with a single master will not scale.

[6] In master/worker parallelism, a system with a single master has limited scalability because the master becomes a bottleneck.

[7] Overlapping computation and communication can result in at most a 2x speedup in a program.

[7] In MPI programs, overlapping computation and communication (non-blocking) can result in at most a 2x speedup in a program.

[8] HPF's data distribution process is also useful for SMP systems since it makes programmers think about locality issues.

[8] For large-scale shared memory systems, you can achieve better performance using global arrays with explicit distribution operations than using Open MP.

[9] Parallelization is easy, Performance is hard. For example, identifying parallel tasks in a computation tends to be a lot easier than getting the data decomposition and load balancing right for efficiency and scalability.

[9]

[10] It's easy to write slow code on fast machines.

[10] It's easy to write slow code on fast machines. Generally, the first parallel implementation of a code is slower than its serial counterpart.

[11] Experts often start with incorrect programs that capture the core computations and data movements. They get these working at high performance first, and then they make the code functionally correct later.

[11] Sometimes, a good approach for developing parallel programs is to program for performance before programming for correctness.

[12] N/A

[12] Given a choice, it's better to write a program with fewer large messages than many small messages

21

Identifying parallelism is hard, but achieving performance is easy.

Case Study of the Falcon Code Project D.E. Post

R.P. Kendall

E.M. Whitney

Los Alamos National Laboratory P.O. Box 1663, MS E526 Los Alamos, NM 87544 1-505-665-7680

Los Alamos National Laboratory P.O. Box 1663, MS B260 Los Alamos, NM 87544 1-505-665-0356

Los Alamos National Laboratory P.O. Box 1663, MS C920 Los Alamos, NM 87544 1-505-667-3595

[email protected]

[email protected]

[email protected]

ABSTRACT

solving procedures have matured as a result of learning from their past experiences, and by identifying and applying “lessons learned” from their successes and failures[1].

The field of computational science is growing rapidly. Yet there have been few detailed studies of the development processes for high performance computing applications. As part of the High Productivity Computing Systems (HPCS) program we are conducting a series of case studies of representative computational science projects to identify the steps involved in developing such applications, including the life cycle, workflows and tasks, and technical and organizational challenges. We are seeking to identify how software development tools are used and the enhancements that would increase the productivity of code developers. The studies are also designed to develop a set of “lessons learned” that can be transferred to the general computational science community to improve the code development process. We have carried a detailed study of the Falcon (Fig.1) code project. That project is located at a large institution under contract to a national sponsor. The project team consisted of about 15 scientists charged with developing a multiphysics simulation that would utilize large-scale supercomputers with 1000s of processors. The expected life time of the code project is about 30 years. The case study findings reinforced the importance of sound software project management and the challenges associated with verification and validation.

Figure 1. Falcon in flight.(Lanner Falcon biarmicus)

Through the DARPA High Productivity Computing Systems (HPCS) program, we are doing a set of case studies of large-scale high performance computing application code projects to develop these “lessons learned.” Additionally, these case studies will be used to develop an understanding of the scale of these projects and the challenges and tasks that code developers and code users face. This information will be used to help computer and software vendors focus on issues such as the required improvement for software development tools that must be addressed if computational science is to become more productive. As is common in social science case studies[2], we maintain the anonymity of the code project, the institution, and the sponsoring organization. Falcon is a pseudonym. We have found that anonymity is crucial if we are to obtain accurate information from the code project teams.

Categories and Subject Descriptors D.2.0 [Software Engineering]. D.2.9 [Management]: Life cycle, Productivity.

General Terms Management, Verification

Keywords High Performance Computing, Verification And Validation, Software Project Management, Case Studies

1. INTRODUCTION Computational Science potentially offers an unprecedented ability to predict the behavior of complex system and thus address many important societal issues. It has sometimes been described as the third leg of the triad of theory, experiment and simulation. However it is still relatively immature as a problem solving methodology compared to theory and experiment. Other problem Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SE-HPCS'05, May 15, 2005, St. Louis, Missouri, USA. Copyright 2005 ACM 1-59593-117-1/05/0005...$5.00..

Falco

Section 2 of this paper provides an overview of the characteristics of the Falcon project. Section 3 focuses on the project team, the structure of the project, and the organization. Section 4 discusses the life cycle. Section 5 describes the workflow, the development tasks, who accomplishes them and how long they take. This section includes an assessment of the software development tool use and experience. Section 6 summarizes the “lessons learned”.

2. Code Characteristics The goal of the Falcon code project is to develop a predictive capability for a product whose performance involves the trade-off

22

of many strongly coupled physical effects spanning at least ten orders of magnitude each of temporal and spatial scales. An accurate predictive capability is needed to reduce the dependence of the sponsoring institution on large, expensive and potentially dangerous empirical tests to certify the product.

few new experiments. Their level of experience and expertise is sufficiently high that they can not only identify when bugs and model deficiencies are present but can often identify the source of the bug or the needed model improvements. The users participate very constructively and effectively in the development, verification and validation of the code. The code is extensively documented on an internal web-site (approximately 400 Mbytes of HTML files). The documentation consists of descriptions of the physics in the code, the algorithms and models in the code, the input and output, and instructions for how to run the code. This has proved highly useful for the users.

The Falcon code project is based on an innovative and potentially very powerful method for solving a set of initial value partial differential equations for the conservation of particles, momentum and energy. These equations are non-linear and have non-linear source terms and coefficients that are calculated with analytic, computational and table look-up schemes. The coupled set of equations is solved with operator splitting, and some degree of time and spatial error correction. A mixture of explicit and implicit techniques is used. The Falcon code was designed to run on massively parallel SMP platforms. The product to be simulated is a multi-material object with a complicated geometry. The equations are solved on an unstructured 2 or 3 dimensional mesh that represents the major features of the object. Generating a reliable mesh from CAD-CAM files and other descriptions of the problem is a highly challenging task, often consuming several months of an expert’s time to set up each new type of problem. The unstructured mesh allows flexibility for incorporating adaptive mesh refinement and adding resolution for capturing fine scale features where necessary.

3. Code Project and Team Characteristics As the Falcon project has grown, the formality of the management approach has increased. The project leader is in charge of general coordination of planning, monitoring and directing the work. He has delegated direct oversight of the work to “activity coordinators” who are responsible for the technical sub-tasks, including not only physics and algorithm issues, but also software quality assurance, testing, configuration management and documentation. These tasks are monitored by informal reports at weekly team coordination meetings and are tracked formally with quarterly reports. This structure allows detailed tracking and monitoring of tasks, while encouraging staff development. The whole team has access to the plans and status.

Parallelization for computation with SMP architectures is accomplished with domain decomposition of the mesh using ParMetis[3]. The parallel programming model is MPI[3]. The target platforms are a SMP LINUX cluster with ~1000 nodes and a large vendor-specific SMP cluster with approximately 2000 nodes.

The Falcon project team consists of about twelve physicists and five computer scientists. It is highly trained, cohesive and competent. Team building has been emphasized. Attention has been given to succession planning and recruiting, both by the team and by management. Mentoring of new staff by experienced staff is emphasized. Many of the project team will typically work on a single project like Falcon for 10 to 20 years or more.

The approach to performance optimization is pragmatic. The team uses several optimization tools (e.g. PIXIE, DCPI, SpeedShop and prof[3]) to identify roadblocks. The team then works on minimizing the impact of the roadblocks. The emphasis during the early stages of development was to maximize performance through reasonable choices for the code architecture, and then to work on optimization after the basic capability of the code had been established. This approach was a response to the pressures to develop the basic capability necessary for demonstrating the actual and potential utility of the code as soon as possible, even if the initial performance efficiency is low. If the required capability had not been demonstrated in a timely fashion, the project would have been canceled. A substantial investment in optimization was thus a luxury to be addressed at a later time.

The Falcon project had a difficult initial phase (Fig. 3). The sponsor and senior institutional managers specified the original requirements and the project schedule. The initial schedule was too ambitious and the original requirements didn’t meet the needs of the users. Neither the sponsor nor institutional management had much experience in the area being addressed by the Falcon Project. They didn’t rely heavily on the substantial body of experience that had been developed in earlier projects similar to Falcon. The original requirements and schedule called for the development of an initial capability three years after project start, but the historical record and quantitative estimates[4] indicate that it normally takes approximately eight years to develop this level of initial capability. As a result, the team didn’t complete its first milestone on schedule (i.e. accomplish eight years of work in three years). A further complication was that the milestones were not aligned with the requirements and needs of the users. The users viewed the milestones as “stunts” that had little to do with addressing the issues they judged important. The team staffing level was reduced from fifteen to about four immediately following the missed milestone (Fig. 2).

The code project uses nine different languages and a set of external libraries including: Fortran, C, Perl, Python, Unix shells, SCHEME, MAKE, and external libraries. Most of the code is an object-oriented instantiation of Fortran. The team has successfully captured many of the advantages of low level object-oriented capability, such as polymorphism and inheritance, while avoiding the pitfalls of many levels of inheritance and excessive use of templating. The major blocks of code are about 410,000 Fortran SLOC, 50,000 SLOC of C, 200,000 SLOC of library code, and about total 30,000 SLOC of Perl, Python and Unix scripts. Perl and Python are primarily used for build and test scripts.

While this was painful for the code team, it had the advantage of taking the code out of the management spotlight. The code team was then able to concentrate on addressing the needs and interests of the user community, and was able to develop a strong connection to the users. This led to initial acceptance and later to strong advocacy of the code project by the user community. With this advocacy, senior management then began to support the code project. New staff were recruited and the project is now a key part of the institution’s and sponsoring agency’s program.

The Falcon project computational tools are used by a team of approximately 50 engineers to assess the behavior of new and existing product designs. The users are highly knowledgeable and experienced. They do most of the validation of the code by comparing the code results with data from past experiments and a

23

As a result of the customer advocacy and an analysis by middle management, the project requirements were modified to closely match the requirements of the users and a more realistic schedule was developed.

successively porting the code to new platforms. When a successor can replace the older code, and the product engineers have made the transition from the older code to the successor code, support for the older code is stopped and it is “retired”. The development of new capability then shifts to the successor code. The Falcon project is in the process of displacing an older project with less capability. The life-time of these projects is much longer than the time between new platforms. Thus porting to the new platform becomes much more important than extensive performance optimization for a particular platform.

We judged the level of maturity of the code project to be somewhere between CMM Level 2 and Level 3 in terms of the processes and practices the Falcon code group is following[5]. The team has not had a formal CMM assessment, but has had several internal and external audits.

Like many computational simulations, the FALCON code project has a strong element of research and development to ensure that new algorithms are developed and successfully implemented. The users also have needs that must be met if the code project is to be successful. The adequacy of the models in the code can only be determined as part of an intensive validation program. It was difficult to draft a detailed list of requirements before the project was begun or to specify a detailed schedule.

Falcon Project Life Cycle and History 15

planned staffing actual staffing

major product releases

Now 10

Initial development

5

0

0

product improvement and development

Milestones

1

2

3

In the case of the Falcon project, senior institutional management and the sponsor specified a set of requirements that would allow them to “sell” the program to the funding sources. This is similar to experiences in the Information Technology (IT) industry where a marketing department identifies market opportunities, and then signs up customers by promising a level of code capability that outbids the competition. Then the software engineers must deliver the promised capability. This contributes to over-promising the capability that can be delivered within the defined schedule and resource level[6, 7].

serious testing by customers

4

5

6

7

8

9

10

calendar time (years) Requirements set by sponsor and institutional management

Figure 2. schedule.

FALCON code project staffing and release

In the case of the Falcon project, the detailed schedule initially specified by the sponsor and senior institutional management was not based on the prior experience with similar codes or quantitative estimates. Instead the schedule was based on when the capability was desired. In addition, the sponsor and institutional management chose a set of goals that appealed to the funding agency but were not the highest priority for the ultimate customers, the product engineers. The customers needed and wanted a different set of capabilities. They thus had little interest in the initial code project. Once it became clear that the schedule was almost a factor of three too optimistic and that the initial goals were not appropriate, the project goals were changed to match the needs of the customers and a more realistic schedule was developed.

4. Falcon Life Cycle The FALCON code project lifetime is expected to be on the order of 30 years (Figure 3). This is based on the experience with similar projects at this institution. Indeed, some projects like Falcon have had lifetimes of up to 45 years. The first part of the life cycle was dedicated to development of the preliminary, initial capability to solve the conservation equations without accurate source terms or coefficients. This took about five years. Now that capability is being tested. Further development will continue until a production capability has been achieved with more accurate source terms and coefficients. The production phase involves heavy use and testing by the user community. During the production phase, the code team will support the use of the code, maintain the code, port it to new platforms, and develop and add April 2005

5. Workflows and Tasks

Falcon Project Life Cycle

The institution that managed the Falcon project has had decades of experience developing and using similar (but less capable) simulations. However, that experience was in serial development (i.e. develop one capability and test it, then develop a second capability and add it to the first, etc.). Serial code development would have taken 20 years or more to achieve the desired capability. The Falcon code project and others begun at the same time planned to develop the major components in parallel to speed up the overall development process (Figure 4). Component development in parallel placed new and much greater demands on project management issues since the code teams were four to five times larger than in the past. It also called for better risk management techniques. If many components are needed for the full capability, one failure would double the overall development time. This risk was realized for the Falcon project. A contract support group did not deliver a key component. The Falcon team has had to develop it. This has subtracted from the resources

major product releases

Production , product development and user support phase product improvement Initial and development development

0

5 serious 10 testing by customers

Continued product testing (V&V) and application by users

15

20

Retirement user support minimal development minimal porting

25

30

35

calendar time (years)

Figure 3. Falcon Project Life Cycle. The small tic marks denote 6 month release dates. new capability as required by product engineers. For similar projects at this institution, the ultimate life span of the code is determined by user demand and the difficulty of

24

available for other tasks and has delayed realization of the full project capability. The institution has had to learn how to organize and manage this new kind of code development process.

cannot be viewed), the bug only shows up in a huge problem (giga-bytes of state data) and bugs that occur when the mesh data migrates between processor/nodes (during remaps). Table 1 Seven Categories of Tasks for Computational Science 1. Formulate Questions and Issues Identify high level goals, customers and the general approach 2. Develop Computational and Project Approach Define detailed goals and requirements, seek input from customers, select numerical algorithms and programming model, design the project, recruit the team, get the resources, identify the expected computing environment, 3. Develop Code Write and debug code, including code modules, input and output, code controllers, etc. 4. Perform V&V Define verification tests and methodology, utilize regression test suites, define unit tests and execute them, define useful validation experiments, design validation experiments, get validation results and compare with code results, etc. 5. Make production runs Setup problems, schedule runs, execute runs, store results 6. Analyze computational results Begin analysis during run to optimize run, store and visualize/analyze results, document results, Develop hypotheses, test hypotheses with further runs 7. Make decisions Make decisions based on results, document and justify decisions, develop plan to reduce uncertainties and resolve open questions, identify further questions and issues

Figure 4. Parallel code development plan. The tasks that the Falcon code development project and their users carry out can be grouped into seven categories (Figure 5, Table 1).

Figure 5. Code development and application task categories.

6. “Lessons Learned”

The tasks are not carried out linearly as in a “waterfall” model. They are nested, and iterative. For instance, a candidate solver might be selected during the design phase. Then it might be discovered during the testing phase or during production runs that it does not provide the needed capability. Then the team has to go back, identify a new candidate solver, develop it, test it, etc., until a satisfactory solver has been found. Or one might discover in the V&V phase that the models miss an important effect that has to included, and so on. Nonetheless, the use of these categories has been useful for ensuring that all of the tasks are identified for the hardware and software vendors (Table 2). Improved tools to accomplish these tasks would improve the ability of code teams like the Falcon code team to develop scientific codes more quickly with fewer defects and better performance.

The “lessons learned” include knowledge of the characteristics of a working high performance computing application code project that will allow hardware and software vendors to develop computers and tools that can improve the ability to develop and utilize the applications, and the steps and procedures that the institution and code team can follow to improve their products and time to solution. Some specific Falcon-related opportunities for improvement are listed in Table 2. It is also worthwhile to list some explicit “lessons learned” that emerged from the experiences of the Falcon team that pertain to team and institutional dynamics. Many of the lessons are emphasized in the standard software project management literature[8, 9]. The case study identified the importance of a competent, well-led, and cohesive code development team. An appropriate skill mix (computer science, scientists with domain knowledge, computational mathematicians, code librarians and documentation staff, etc.) was crucial. It is difficult to program the current generation of massively parallel computers. The team members must develop trust in each other and their leadership.

The Falcon project also focused on verification and validation. They found that verification and validation has been very challenging. None of the existing techniques for verification have proved to be satisfactory for a complex, multi-physics code. Validation has been similarly challenging. There is little data available, and usually it includes many effects. Identification of the role of specific individual effects has proved to be difficult.

Detailed requirements are difficult to develop for scientific codes primarily because they often involve research with its attendant uncertainties. This reality sets scientific code development apart from more conventional deterministic software development projects (where requirement rigor is the norm). The Falcon code effort requires research and development. However, it is crucial to develop some requirements because it is important that the schedule and resources be consistent with the requirements. A development schedule that was a factor of three too optimistic nearly killed the Falcon project. Thus, in spite of the difficulties

A key observation by the Falcon Team is that debugging massively parallel programs is hard. The worst debugging situations included situations where: the bug is not consistently reproducible, very subtle errors that build from the least significant digit over many calculation cycles, the bug vanishes when the debugger is used, the bug is only reproducible after a very long run (restarting near the bug makes it go away), the bug is not reproducible in a debug version (e.g. you must search in an optimized version where variables have been optimized away and

25

and [3]challenges, it is important to make estimates for the expected schedule. While sponsors and institutional management should determine the programmatic direction of a project, the people that set most of the requirements must have considerable domain knowledge, e.g. the users and developers. Inadequate domain knowledge by the sponsors and institutional management contributed to the specification of an overly ambitious schedule and requirements that didn’t meet the needs of the customers.

Performance optimization has been much less important than being able to port the code to successive generations of platforms. The demonstration that that many projects have life cycles that span many machine generations has had an impact on the DARPA HPCS vendors. The vendors now have a better understanding of the need for stability and incremental steps for software development infrastructure and tools. Second, the specification of the workflow steps has been useful for identifying the areas where hardware and software vendors can improve productivity by eliminating bottlenecks and improving programming efficiency. Specifying the development steps has helped them focus on the most productive areas for improvement. Third, as noted in the prior section, this study demonstrated that it is impossible to set down specific detailed requirements for a scientific code project.

Table 2. Opportunities for improved development tools and development environments. • Problem set-up tools (mesh generation, etc.) • Data storage and retrieval, especially over distributed networks • Smoother upgrades for operating systems and tools • Better and easier to use compilers and parallel programming models for massively parallel computers (now Fortran with MPI) • Linkers and loaders with ability to link many languages • Better parallel debuggers • Performance analysis tools (hardware and software) • Better run schedulers • Visualization (office, small workroom, theater) • Data analysis tools (V&V and analysis of runs) • Testing tools (coverage analysis, software quality,..) • Production run configuration and problem logs Customer focus is important. The users determine whether the code is successful or not. If they can use it to solve their problems, they will be strong supporters and an important ingredient in the success of the code. If the code cannot help them, then they will either ignore the code or be vocal detractors.

The case study illustrated the importance of sound project management, not just by the team but also by the institution and the sponsor. Overly ambitious schedules set by the sponsor and institution almost destroyed the project and nearly deprived the program of a very promising and important new computational tool. Support by knowledgeable senior management is essential for success. Verification and validation are an essential element of the development and application of a computational science project. Yet, the intellectual basis for verification and validation is insufficient at present for projects like Falcon to verify and validate their code at the level necessary for success.

8. ACKNOWLEDGMENTS The authors are grateful to the members of the Falcon code team for their participation and their patience and forbearance and for allowing the computational science community to learn and benefit from their experiences. This research was supported by USDOE contract W-7405-ENG-36.

A stable and mature development environment is important. Stable development tools and platforms allow the code development team to concentrate on the code development tasks.

9. REFERENCES

These “lessons learned” can be summarized in the statement that attention to sound software project management is important.

[1].

The Falcon project also found verification and validation difficult because methods for verification were inadequate, and because there was a paucity of useful validation data.

[2].

7. Summary and Conclusions

[3].

In order to characterize a large-scale computational science project, we conducted a detailed case study of the Falcon code project. We found that it was essential to maintain complete anonymity to ensure that the team would allow us access to a full and accurate set of information.

[4]. [5]. [6].

We drew three major specific, three general, and many minor conclusions from this case study. The first major specific point is that the life time for this project is expected to be around 30 years, much longer than smaller computational science projects such as are found in academia, and much longer than most projects in the Information Technology industry. This ensures that the project team is very conservative in their approach and emphasizes minimizing risks. They have avoided using new and untried computer languages, compilers, code development methodologies, libraries, etc., especially those targeted to a single platform.

[7]. [8]. [9].

26

Petroski, H., Design Paradigms: Case Histories of Error and Judgment in Engineering. 1994, New York: Cambridge University Press. 221. Yin, R.K., Case Study Research, Design and Methods. Third ed. Applied Social Research Mehods Series, ed. L. Bickman and D. J.rog. Vol. 5. 2003, Thousand Oaks: Sage Publications. 181. Dongarra, J., et al., Sourcebook of Parallel Computing. 2003, Amsterdam: Morgan Kaufmann Publishers. 842. Capers-Jones, T., Estimating Software Costs. 1998, New York: McGraw-Hill. Paulk, M., The Capability Maturity Model. 1994, New York: Addison-Wesley. Glass, R.L., Software Runaways: Monumental Software Disasters. 1998, New York: Prentice Hall PTR. 288. Ewusi-Mensah, K., Software Development Failures: Anatomy of Abandoned Projects. 2003, Cambridge, Massachusetts: MIT Press. 276. DeMarco, T., The Deadline. 1997, New York, New York: Dorset House Publishing. 310. Thomsett, R., Radical Project Management. 2002, Upper Saddle River, NJ: Prentice Hall

Can Software Engineering Solve the HPCS Problem? Eugene Loh

Michael L. Van De Vanter

Lawrence G. Votta

Sun Microsystems Inc. 16 Network Circle, UMPK16-303 Menlo Park, CA 94025 USA +1 831 655-2883

Sun Microsystems Inc. 16 Network Circle, UMPK16-304 Menlo Park, CA 94025 USA +1 650 786-8864

Sun Microsystems Inc. 18 Network Circle, UMPK18-216 Menlo Park, CA 94025 USA +1 650 786-7514

[email protected]

[email protected]

[email protected]

ABSTRACT

parallelism. The HPC productivity bottleneck is moving from machines to people, much as it has in other domains, but Software Engineering has had little impact, in no small part by failing to address traditional HPC priorities. Successful application of Software Engineering principles and practices must be grounded in the particular practices and priorities of HPC, and must be substantiated with cost-benefit data.

The High Productivity Computing Systems (HPCS) program seeks a tenfold productivity improvement. Software Engineering has addressed this goal in other domains and identified many important principles that, when aligned with hardware and computer science technologies, do make dramatic improvements in productivity. Do these principles work for the HPC domain?

The case study presented here starts collecting that data. Based on well-known HPC benchmark codes, the study establishes a baseline evaluation of benefits from a perfective maintenance exercise in which manual optimizations are discarded and modern programming language abstractions are exploited for readability.

This case study collects data on the potential benefits of perfective maintenance in which human productivity (programmability, readability, verifiability, maintainability) is paramount. An HPC professional rewrote four FORTRAN77/MPI benchmarks in Fortran 90, removing optimizations (many improving distributed memory performance) and emphasizing clarity.

The outcome is dramatic. Code shrank, in most cases by a factor of 10, and the relationship between code and specification, previously inaccessible, became evident. The former is known to reduce software cost, and the latter is an essential step toward verification. Run time performance typically suffered by a factor of 2, a penalty that may be neutralized by automatic optimization and parallelization at both compile and run time.

The code shrank by 5-10x and is significantly easier to read and relate to specifications. Run time performance slowed by about 2x. More studies are needed to confirm that the resulting code is easy to maintain and that the lost performance can be recovered with compiler optimization technologies, run time management techniques and scalable shared memory hardware.

These results emphasize the need for empirical studies and for a data-driven evaluation of solutions in the context of HPC. Related benefits such as requirements tracing and portability must also be quantified and brought into the cost-benefit equation. The study also casts light on the HPC community’s pursuit of performance at the expense of human effort. A credible cost-benefit analysis starts with “modern” implementations, and explicitly includes human productivity [6].

Categories and Subject Descriptors D.2.0 [Software Engineering]. D.1.3 [Programming Techniques]: Concurrent Programming –parallel programming

Keywords High Performance Computing, Software Productivity, Software Maintenance, HPCS

This research is part of the High Productivity Computing Systems (HPCS) program, funded by the Defense Advanced Research Projects Agency (DARPA) to “create new generations of high end programming environments, software tools, architectures, and hardware components” from the perspective of overall productivity rather than unrealistic benchmarks [5].

1. INTRODUCTION The High Performance Computing (HPC) community faces a crisis on two fronts: concerns about correctness demand increased verification [16], and programming itself is becoming more complex due to growing problem sizes and the need for massive

Section 2 describes the study design. Section 3 presents the results, followed by further analysis in Section 4. Section 5 discusses the findings in the broader context of the program, and mention of related work appears in section 6.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SE-HPCS’05, May 15, 2005, St. Louis, Missouri, USA. Copyright is held by Sun Microsystems,Inc. 1/05/0005...$5.00.

2. THE EXPERIMENT This is an exercise in perfective maintenance: changing programs in ways that do not affect essential functionality but which improve maintainability [12][17].

1-59593-117-

27

2.1 Hypothesis

3. DATA

This empirical study has elements of both a quasi-experiment [8] and a case study [19] – the quasi-experiment due to the repeated test (treatment of reordering maintenance and performance goals) of different benchmarks and a case study due to the single professional subject. The use of hypotheses allow us to test and explore the amount of code size reduction and the amount of performance expense. For completeness there are two hypotheses.

Table 1 summarizes quantitative results of the experiment, expressed both as reduction in lines of code (LOC) and run time performance penalty; measurement details appear in Appendix A. Each benchmark is discussed separately below, followed by comments on qualitative results. The experiments appear in the order they were conducted. Some learning about Fortran 90 and computational fluid dynamics occurred during the experiments.

H1: Perfective maintenance will not reduce the code size.

Table 1: Summary of Results

H2: Perfective maintenance will not change execution performance.

Benchmark

LOC LOC Size Slowdown before after reduction NAS BT 3687 484 8x 3x NAS MG 1701 150 11x 2x?/6x? NAS CG 839 81 4x/10x ≥2x ASCI sPPM 13606 1358 10x 2x Approximately 2x of the size reduction was due to removal of explicit parallelism, a result consistent with other studies of the size contribution of MPI [4][9].

2.2 Experimental Setting The subject is a computational physicist with fifteen years of experience in the computing industry, conventionally trained to prioritize performance over maintainability. The experiment was performed using a modern software engineering development environment and workstation for development and a target supercomputer for benchmark execution.

3.1 NAS BT Benchmark

2.3 Experimental Design

BT is a “synthetic application,” representative of applications involving computational fluid dynamics. The subject had general familiarity with NAS Parallel Benchmarks, but none with BT. The experiment took approximately two weeks of full time effort.

The subject revised four parallel benchmarks relevant to the HPC community (external validity): three of the NAS Parallel Benchmarks (NPB) [13][15] and the ASCI sPPM benchmark [10], all written in FORTRAN 77 [2]. This report refers to both the NPB benchmark specifications, which date back to the NPB1 release, and the NPB serial and parallel (MPI) implementations, which are part of the NPB2.3 release [13].

Table 2 summarizes code size changes between the reference implementation and revised code. These summaries are approximate, given the difficulty of precise accounting. Table 2: NAS BT Lines of Code

The goal of the perfective maintenance task was to prioritize maintainability over performance: •

remove specialized code for distributed memory;



remove source level optimizations;



use abstractions provided by Fortran 90 [7], a modern superset of FORTRAN 77; and



remove code known to be not portable.

Description

LOC NPB NPB (MPI) (serial) Global declarations 269 246 Main program 105 74 Initialize 201 148 Time step & solve 1165 596 Exact solution 13 13 Compute dU/dt 625 596 Compute left-hand sides 717 716 Self test 247 228 Inter-process comm.. 345 0 Total 3687 2617 Reductions varied considerably, for example:

The dependent variables are code size and execution performance on the target system. The construct validity threats are summarized as: do we think what and how we are changing and removing from the code accomplishes the priority shift? Clearly, the treatment accomplishes much of the priority shift, and the issue is one of how much more could be accomplished. The results in the next section allay much of this concern. The internal validity addresses the concern of undetected influences on the dependent variables code size and execution performance. Inspection of the code and executing benchmark test limit the possibility and magnitude of any influences. Finally, the external validity threats are effects that limit our ability to generalize the experiment. Specifically, we are studying benchmarks here – kernels of computation – to expect the ratio of size reduction to execution performance is not relevant and thus not a threat. We would like to generalize to the kernels of major computational codes. How the benchmarks are created, calibrated and maintain address this validity threat.

28

Revised 19 54 19 62 35 82 102 111 0 484



Global declarations shrank for three reasons: removal of confusing intermediate constants (manual optimizations); adoption of Fortran 90 array syntax and module support; and removal of variables that didn’t need to be global.



The main program and self test contain code for which little improvement was possible.



Initialize was reduced by relocating some code, removing unnecessary code (previously hard to detect because of code complexity), and adopting Fortran 90 array syntax.



Time step & solve showed the greatest reduction: removing hand optimizations and exploiting Fortran 90 array syntax and MATMUL. As the code clarified, comments (not counted in the LOC results) also shrank.



Exact solution acquired related bits of code from elsewhere.



Time differential dU/dt shrank, not only by exploiting Fortran 90 features (bringing the code in much closer alignment with the specification), but also by detection and removal of approximately 300 lines of redundant code.

The relative compactness of the derived code derives from much the same phenomena reported for the BT and MG benchmarks.

The revised code exhibited a 2.7x slowdown on the class W data set (one of several specified as part of the NAS Parallel Benchmarks) when compared to the original implementation.

A significant portion of the perfective maintenance effort (and the resulting code) for CG involved reverse engineering an arcane data fabrication algorithm used in the original code. The algorithm, which is not fully determined in the specification, must be reproduced exactly in order to satisfy correctness checks. Using other reasonable algorithms for fabricating matrix data would permit overall code reduction closer to 10x.

3.2 NAS MG Benchmark

The revised code autoparallelizes well. More details on scalability appear in the full report [11].

One of the most striking results was a dramatic increase in correspondence between specification and code (section 3.6).

MG is a “kernel” benchmark, intended to characterize multigrid computations. It is much shorter than BT: the specification is 2.5 pages of PDF, in contrast to 30 pages for BT. The subject, who had prior implementation experience with the benchmark, set out to simplify the NPB reference code, but eventually rewrote it from scratch in a few hours. The experiment took about one working day over a period of a week. Table 3 summarizes changes in code size between the reference implementation and the revised code.

3.4 ASCI sPPM Benchmark sPPM is a computational fluid dynamics (CFD) benchmark that uses the "simplified" Piecewise Parabolic Method (sPPM). Its performance targets were 1 teraflops and beyond for procurements within the Accelerated Strategic Computing Initiative (ASCI) [1]. Although not a "real application," sPPM is the most complex of the four codes studied. The shock-wave physics is handled with fairly involved numerical algorithms. Performance optimizations include cache blocking, vectorization, multi-thread and multiprocess parallelism, overlapping communication and computation, and dynamically scheduling work.

Table 3: NAS MG Lines of Code Description

LOC NPB

NPB

Revised

(MPI)

(serial)

Global declarations

80

89

0

Main program

201

160

35

Initialize v

202

213

35

The subject was initially unfamiliar with the sPPM code and with CFD. The experiment was part-time for less than one month. Table 5 summarizes changes in code size between reference implementation and revised code. These line counts include makefiles, input decks, and so on, but these are relatively small.

Operators

281

277

80

Table 5: sPPM Lines of Code

Communications

665

144

0

Description

LOC ASCI

Revised

Main program

2486

484

Code reductions were similar to those for the BT benchmark, and the result was similarly more concise and readable.

Shock dynamics

2742

616

Boundaries

3431

22

The revised code performs poorly, largely due to a compiler problem (believed fixable) with stencil performance for array syntax. It appears that the code autoparallelizes reasonably well, at least to conventional scales. Details appear the full report [11].

Time stepping

2312

75

Global declarations

238

25

Some C I/O functions

366

54

3.3 NAS CG Benchmark

Timers

49

17

CG is another “kernel” benchmark, approximately as complex as MG and much simpler than BT. It is intended to characterize conjugate gradient computations. The subject had prior implementation experience with the benchmark. The experiment took place over two days with most of the effort going to simplifying the NPB code and working with sparse matrices. Table 4 summarizes changes in code size between two of the reference implementations and the revised code.

Multi-platform threads support

1500

0

Makefile

407

21

Run script

57

23

Input deck

7

10

Reference output

11

11

TOTAL

13606

1358

Other

272

124

0

Total

1701

1007

150

Table 4: NAS CG Lines of Code Description

LOC NPB

NPB

(MPI)

(serial)

Core functionality

839

309

81

Data fabrication

197

197

158

Total

1036

506

239

Some reasons for code reduction have already been observed for the other benchmarks. Additionally:

Revised

29



sPPM strives for functional and performance portability -for example, targeting superscalars, vectors, etc.



sPPM suffers complexity from trying to maintain small access strides in memory for best performance.



A great deal of the I/O is unnecessarily complicated.



There was a great deal of support for legacy threads, which was made obsolete by adoption of OpenMP [14].



Sophisticated schemes overlapped communication and computation, something that would be best left to platform infrastructure in an idealized world.



3.5 Manual Optimizations The subject reported severe difficulty understanding some of the manually optimized code:

Routines were replicated with special case optimizations.

Remarkably, given the extensive simplification of the code, the revised version on a single CPU ran only 2.1x slower on a test problem than the reference implementation. The cause of the slowdown requires investigation, but is likely due, at least in part, to the large memory strides in the revised version.



global definition of many intermediate constants in an attempt to identify common subexpressions;



manually unrolled loops; and



functions expanded in-line, for example for derivatives.

Some of those could actually be counterproductive in today’s computational environments. All optimizations confounded the relationship between specification and code.

3.6 Expressiveness and Fortran 90

The challenge for computing systems is to achieve automatically the impressive scalability achieved by extensive manual techniques used in this reference implementation. There are reasons to be hopeful, including the large problem sizes that are needed to study multi-scale physics in 3d CFD and the higherlevel expression that results when the source code is improved, but demonstration of such automatic scalability remains the subject of future work. call resid(u,v,r,n1,n2,n3,a,k)

Language features new to Fortran 90 enable code that is both shorter and expressed more directly in terms of the problem. For example, the timed portion of the NAS MG reference implementation appears in Figure 1:

callnorm2u3(r,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt)) old2 = rnm2 oldu = rnmu do

it=1,nit call mg3P(u,v,r,a,c,n1,n2,n3,k) call resid(u,v,r,n1,n2,n3,a,k)

enddo call norm2u3(r,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt)) Figure 1: NAS MG timed portion (original FORTRAN 77) This code excerpt implements the portion of the specification appearing in Figure 2, and the revised version appears in Figure 3.

4. ANALYSIS The benefit (average reduction of LOC) and the cost (the average increase in execution time performance) are calculated simply as the ratio of the pretreatment quantity (LOC, execution time) divided by the post treatment quantity (LOC, execution time). Thus for the NAS BT benchmark we have 3687/4849 = 7.62 ~8. For execution time, we use the convention that the factor is always greater than 1 with the description of slower or faster to indicate more or less execution time.

Each of the four iterations consists of the following two steps,

r = v - A u (evaluate residual) u = u + Mk r (apply correction) ... Start the clock before evaluating the residual for the first time, ... Stop the clock after evaluating the norm of the final residua.l

5. DISCUSSION The central result of this study, namely that HPC code can be both shrunk tenfold and dramatically clarified at the expense of a twofold performance penalty, suggests rethinking the balance between man and machine. A number of technologies offer the prospect of recovering some of the cost of the performance loss.

Figure 2: NAS MG timed portion (spec.) do iter = 1, niter r = v - A(u)

! evaluate residual

u = u + M(r)

! apply correction

There are many benefits. The lifetime costs of software are known to correlate highly, and linearly, with code size. Also, the kind of code produced in this case study is much more likely to be portable, a significant factor in the lifetime cost of HPC software. Finally, the increasing importance of verification in HPC software will be well served by code that is not only smaller, but also dramatically easier to understand in relationship to the problem specifications. Experiments are needed to assess these effects more precisely.

enddo r = v - A(u)

! evaluate residual

L2norm = sqrt(sum(r*r)/size(r)) Figure 3: NAS MG timed portion (revised Fortran 90) The evident correspondence between the two promises greater success and lower cost for code verification and maintenance.

30

The study validates some of the design goals for Fortran 90: high quality results were possible, levels of effort were moderate, and maintenance could often be done gradually with frequent regression testing. Other modern languages for HPC might offer similar, or better, benefits, but the costs of learning and conversion must be a necessary part of the analysis.

Productivity Computing Systems (HPCS) Program. . [6] Gustafson, J. Purpose-Based Benchmarks. International Journal of High Performance Computing Applications: Special Issue on HPC Productivity, 18,4 (November 2004). [7] ISO/IEC International Standard ISO/IEC 1539-1:1997(E) Information Technology - Programming Languages – Fortran. Geneva, Switzerland, 1997.

6. RELATED WORK Other studies have shown significant code reduction from distributed to shared memory implementations: approximately 2x for the NAS MG benchmark [4] and an average of 1.77 from MPI to serial implementations over 8 benchmarks [9].

[8] Judd, C. M., Smith, E. R., and Kidder, L. H. Research Methods in Social Relations Holt, Rinehart and Winston, Inc., sixth edition, 1991.

These results are broadly consistent with Weinberg’s studies on the cost of multiple goals: “Optimization goals tend to be highly conflicting with other goals” [18]. Goals are also constraints; this case study can be seen as removing constraints on software that derive from the “accidental” rather the “essential” nature of the task, in the terminology of Fred Brooks (following Aristotle) [3]. The “accidental” in this case includes the limitations of FORTRAN 77, the demand for utmost performance, and confounding platform architectures.

[9] Kepner, J., HPC Productivity Model Synthesis. International Journal of High Performance Computing Applications: Special Issue on HPC Productivity 18,4 (November 2004). [10] LLNL The ASCI sPPM Benchmark Code. Lawrence Livermore National Laboratory, [11] Loh, E. Van De Vanter, M.L, and Votta, L. G. Can Software Engineering Solve the HPCS Problem, Sun Microsystems Laboratories Technical Report, 2005 (in preparation).

7. CONCLUSIONS

[12] Mockus, A. and Votta, L.G. Identifying Reasons for Software Changes Using Historical Databases. Proceedings of the International Conference on Software Maintenance – ICSM2000, San Jose, California (October 2000) 120-130.

At the cost of a relatively modest performance penalty at runtime, HPC software written in FORTRAN 77 can be improved through perfective maintenance with dramatic reduction in human cost (across the entire software life cycle) and reduce the growing cost of verification.

[13] NASA The NAS Parallel Benchmarks (NPB). NASA Advanced Supercomputing Division, .

This cost-benefit equation must be explored further, not only with investigation into performance improvements, but also by expanding the scope of the data across more HPC professionals and more kinds of HPC code. This empirical data is needed to support the kind of credible analysis the HPC community will expect in order to evaluate solutions to the HPCS problem.

[14] OpenMP, . [15] Saphir, W. C., et al. New implementations and results for the NAS Parallel Benchmarks 2. 8th SIAM Conference on Parallel Processing for Scientific Computing, Minneapolis, MN, (March 14-17, 1997).

8. ACKNOWLEDGMENTS

[16] Post. D. E., and Votta, L. G. Computational Science Requires a New Paradigm. Physics Today, 58(1): p. 35-41.

We would like to thank our HPCS colleagues at Sun Microsystems and elsewhere in the HPC community for their helpful discussions and comments.

[17] Swanson, E.B. The Dimensions of Maintenance. Proceedings of the 2nd International Conference on Software Engineering, San Francisco, California (1976) 492 – 497.

This material is based upon work supported by DARPA under Contract No.NBCH3039002.

[18] Weinberg, G. M., Schulman, E. L., Goals and Performance in Computer Programming. HUMAN FACTORS 16,1 (1974) 70-77.

9. REFERENCES [1] The Accelerated Strategic Computing Initiative (ASCI), now known as Advanced Simulation and Computing (ASC) .

[19] Yin, R. K. Case Study Research: Design and Methods. Sage Publications, second edition, 1994.

APPENDIX A

[2] American National Standards Institute American National Standard Programming Language FORTRAN. ANSI X3.91978, New York, NY, 1978.

Comments and blank lines are excluded from line counts (LOC). Performance was measured on a Sun Fire 6800 server with 48 Gbyte of memory and 24 UltraSPARC III+ CPUs at 900 MHz with 8 Mbyte of L2 cache each. The Fortran 95 compiler from the Sun ONE Studio 8 Compiler Suite was used with typical switches:

[3] Brooks, F. P. Jr., No silver bullet: essence and accidents of software engineering. Computer 20,4 (April 1987) 10-19. [4] Chamberlain, B. L., Deitz, S. J., and Snyder, L. A comparative study of the NAS MG benchmark across parallel languages and architectures. Proceedings of the ACM Conference on Supercomputing (2000). [5] Defense Advanced Research Project Agency (DARPA) Information Processing Technology Office, High

31



-fast (common performance-oriented switches)



-xarch=v9b (UltraSPARC-III settings, for 64-bit binaries)



-parallel –reduction (in cases where autoparallelization was used)

P3I The Delaware Programmability, Productivity and Proficiency Inquiry Joseph B. Manzano Yuan Zhang Guang R. Gao Department of Electrical & Computer Engineering, University of Delaware Newark, Delaware 19716, U.S.A {jmanzano,zhangy,ggao}@capsl.udel.edu

ABSTRACT New advancements on high-productivity computing systems have shown the weaknesses of existing parallel programming models and languages. To address such weaknesses, a number of researchers have proposed new parallel programming models and powerful programming language features that can meet the challenges of the emerging HPC Systems. However, the success or the failure of a new programming model and accompanied language features need to be evaluated in the context of their productivity impacts. In this paper, we report a productivity study that was conducted at the University of Delaware in the context of the IBM PERCS project, being funded via the DARPA/HPCS enterprise. In particular, our study is centered on the productivity impact of a new and key programming construct called Atomic Sections that is jointly proposed by our group and our colleagues at IBM.

Figure 1: A New Approach to Language Design

1.

INTRODUCTION

New advancements on High Productivity Computing (HPC) Systems have shown the weaknesses of existing parallel programming models and languages. To address such issues, a number of researchers have proposed new programming models and powerful programming language features that can meet the challenges of the emerging HPC Systems. However, the success or failure of such programming models and accompanied language features needs to be evaluated in the context of their productivity impacts. A small overview of the new approach to language design is depicted in Figure 1. A key source of complexity (and a productivity deterrent) in parallel programming arises from fine grained synchronization which appears in the form of lock / unlock operations or critical sections. An effect of these constructs in productivity is to put excessive resource management on the programmer. Thus, the probability of programmer’s errors appears increases. Another side effect is the hidden overhead of the underlying synchronization actions and their

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SE-HPCS ’05 St. Louis, Missouri USA Copyright 2005 ACM 1-59593-117-1/05/05 ...$5.00.

32

accompanying data consistency operations. This overhead reduces the amount of scalable parallelism that could have been achieved. Atomic Sections have been proposed as a parallel programming construct that can simplify the use of fine grained synchronization, while delivering scalable parallelism by using a weak memory consistency model. This construct has been implemented under OpenMP XN, an extension of OpenMP in the context of the PERCS project; that was implemented by the authors of this paper. Based on this prototype implementation, a productivity study (The Delaware Programmability, Productivity and Proficiency Inquiry or P3I) was designed and implemented. The main focus of P3I is to test the productivity of Atomic Sections overall and the productivity of them in each phase of an application development (designing, parallelization and debugging). P3I, in its first implementation, was conceived not for measuring performance, but to measure programmability and debug-ability given a set of short programming exercises. Another feature that makes this study unique is its weighting factor, which is a function that will change the amount of participation (time) of each participant, according to its expertise in the HPC domain. An overview of P3I weights and weighting scheme is given in the subsequent sections. The purpose of this paper is to provide an outline of P3I, its framework and methodology. Section 2 provides a brief overview of the new constructs: Atomic Sections. Sections 3 and 4 present a high level overview of P3I and

a deeper view of its internals, data collection methods, and parts, respectively. Section 5 explains the weighting procedures and its overall purpose. Sections 6 and 7 present results and conclusions based on the first implementation of P3I. Finally, section 8 presents related and future work.

2.

ATOMIC SECTIONS: OVERVIEW

The Atomic Section construct was jointly proposed by the authors of this paper and their colleagues at IBM. It has been implemented in two ways. This construct has two implementations. One of them is under IBM’s X10 language [5], and It is used for local synchronization. The second implementation is under the OpenMP XN programming model. OpenMP1 is the standard programming model for Shared Memory Processors (SMP) machines. Due to its status and the plethora of resources available for it, OpenMP became an excellent platform to test new parallel programming constructs. The Delaware group, which is comprised by the authors of this paper, took the OpenMP and extended it with Atomic Sections. Hence, the new programming model became OpenMP with extensions or OpenMP XN for short. From now on, every time that the term Atomic Sections is used, the implementation under OpenMP XN should be assumed, unless stated otherwise. An Atomic Section is defined as a section of code that is intended to be executed atomically, and be mutually exclusive from other conflicting atomic operations. The word “conflicting” in the previous statement is one of the main reasons that Atomic Sections is different from other synchronization constructs in OpenMP. Atomic Sections that are conflicting should be guarded against each other but they shouldn’t interfere with the ones that do not conflict with them. Standard OpenMP2 offers two extreme cases when it comes to interference and synchronization. Its critical section construct provides a global lock that ensures that only a critical section is running and the others are waiting, regardless of the effects of their execution. On the other hand, OpenMP’s lock functions put the responsibility on the programmer to detect if the code interferes (data race) with other protected blocks of code, and lock it according to his/her judgment. Atomic Sections provide the middle ground in which the programmer is free of lock management and the compiler will take care to run the sections in a non-interfering manner. The syntax and a code example of Atomic Sections are presented in figures 2 and 3. Each Atomic Section’s instance is associated with a structured block of code. This block is defined as a piece of code that has only one entry point and one exit point. This a brief introduction to this construct, for a more in depth explanation please refer to [7]. During the conception of this construct, another aspect became apparent: how will programmers react to this construct? Therefore, the Delaware group developed P3I to have an idea about the construct impact on programmers and to create a methodology / framework template for future studies.

3.

P3I: OVERVIEW

1

OpenMP is a parallel programming extension for C/C++ and FORTRAN. It uses the fork and join model as its parallel programming model [2] 2 OpenMP also offers the atomic directive but this construct is extremely restricted. It is limited for read modify write cycles and for a simple group of operations

33

Figure 2: Atomic Section’s Syntax

Figure 3: Atomic Section’s Example

The main objective of P3I is to measure the productivity impact that a new construct has. From this main objective, many small questions arise. How much impact will the new construct have on the programmer? How will the construct affect each phase of application development? Other questions about the study itself are raised. Is there any way to ensure a correct distribution of our sample, or to pre-filter the results such that the data that we considered more relevant (i.e. novice and average programmers) have more weights than others? This productivity study tries to answer these questions and provides a solid platform for future work and iterations. P3I had a total of fifteen participants. These participants came from a pool of graduate and undergraduate students in the department of Electrical and Computer Engineering at the University of Delaware. The participants attended two meetings in which the study was presented to them. The study itself is accessible through a group of web pages call the Web Hub. In the Web Hub, the participants can review, download and fill the different parts of the study. The Web Hub is divided into phases, which each participant should take in order. The first phase is dubbed Phase 0 and it consists of a Web Survey, Web log and a programming exercise. All the other phases of the study contain only a Web log and programming exercise. The purpose of Phase 0 is to get the participants familiar with the study form and infrastructure. After Phase 0, there are three more phases which represent the core of the study. For a complete description of all phases and the data collection infrastructure, refer to the next section. A code excerpt from Phase 0 is presented in figure 4. This code excerpt presents the code involving a bank transaction using lock / unlocks OpenMP construct and Atomic Section construct. The participants use a modified version of a free source OpenMP compiler, called Omni [1], which has been modified to support OpenMP XN. This study’s main metric is the time to correct solution, which is calculated thanks to the data collection infrastructure explained in section 4. As stated before, this study is not designed to measure performance (even though that will be one of its objectives in future iterations). The Web Survey is used to calculate the weights that will be used in this study. Its structure, purpose and results will be presented in sections 5 and 6. The programming exercises in each of the phases are designed to put programmers in different contexts that appear in the HPC domains. In a nutshell, Phase 0 is the “Acclimation Phase”, which makes all participants familiar with the study structure. Phase 1a becomes the “Complete Development Phase”, which asks to develop a complete parallel application. Phase 1b is the “Parallelization Phase” in which a serial code is parallelized. Finally, Phase 1c is the “Paral-

Figure 4: Code Excerpt from Phase 0 lel Debugging Phase” that ask the participants to debug a code. For a more in depth look at each phase, please refer to the next section.

4.

P3I INFRASTRUCTURE AND PROCEDURES

The main study is composed of four parts or phases. The main webpage can be found at [6]. In there, the participants can learn about the study itself, its main metric, and each phase’s time limit. Moreover, extra materials can be accesses through this website. These extras include tutorials about OpenMP, POSIX threads, and a brief explanation about Atomic Sections, the OpenMP XN compiler, and instructions on how to install it. Figure 5 provides a picture depicting the structure of the Web Hub. The first phase

Figure 5: Web Hub Structure is dubbed Phase 0: The Unfortunate Banker, also known as the “Acclimation Phase”, and its main objective is to show the procedures for running all the subsequent phases. The phase itself is subdivided into a Web Survey, a Web Log and the programming exercise: The Banker. The Web Survey is the first step of the study and it is comprised of

34

24 questions that test the participant’s knowledge in parallel programming extensions, parallel execution models and hardware support for parallel models. More about the survey score system and its importance will be discussed in the next section. The Web Log is designed to capture subjective data about the starting and stopping time of each phase. It also has a special section for participants’ comments and answers to some questions raised by each phase. The questions are about aspects of the programming exercises and their answers are simple but require a certain understanding about the exercise. Most of these questions are optional. Every phase requires the participant to record data in the Web Log at the beginning and at the ending of each phase. The programming exercise consists of three files. The first file is a make file that will be used to build the source files. The next file is a bash script which will collect time stamps, user info, compiler information and program output. The participants are required to use this script to run their application. Finally, a skeleton C source file is provided for the participants to use. In Phase 0, this programming exercise is a simple simulation of bank transactions between four branches scattered across the country. Each branch is simulated as a thread that receives transactions and synchronizes them with locks or Atomic Sections. There is a different group of files for each required construct. Therefore, the participant will run this phase twice; one for locks and one for Atomic Sections. In this phase, the source files are complete running applications that were ready to run “out of the box”. All the phases have the same structure as this phase. Therefore, participants can learn how to run the phases and its parts through this phase. Phase 1 is the core of the study and consists of 3 sub-phases and several exercises. The first sub phase is called Phase 1a: the GSO exercise, also known as the “Complete Development Phase”. This exercise will present a hypothetical case in which a Gram Schmidt Ortho-normalization is required to create an ortho-normal basis. The exercise should be completed from scratch and the final code should be parallelized with OpenMP XN C function calls and pragmas. The participants are only given helper functions for them to use in their code. One of the requirements for a successful run is that a provided check function returns true when testing the basis. Another requirement for completion is that two versions are created (one for locks and one for Atomic Sections) and that each version successfully runs. Some extra information is provided in the webpage of the phase that is accessible through the Web Hub. This specific exercise was developed to test the programmer’s abilities in designing, parallelizing and debugging an application. Phase 1b is dubbed The Random Access Program, also known as the “Parallelization Phase”. It is based on the Random Access exercise which is used to test memory bandwidth systems. The program contains a huge table that is randomly accessed and updated. At the end of the execution the reversed process is applied to the same table and the table is checked for consistency. If the number of errors surpasses one percent, the test has failed. The synchronization in this case applies to each random access of the elements in the table. Intuitively, the number of errors that might appear in the final table reduces considerably when the table is made large enough. This makes the use of synchronization constructs useless for the program, and it is actually one of the questions that are asked in the webpage of this phase. In this

phase, the subjects are given a serial version of the program and are asked to parallelize it with OpenMP XN. As before, two versions are required to complete this phase (locks and Atomic Sections). An extra version can be completed and it consists of the program without any synchronization constructs. This exercise simulates the scenario in which programmers need to change serial codes to parallel implementations. Phase 1c is called The Radix Sort Algorithm and is an implementation of this famous algorithm. The algorithm itself is explained in this sub phase’s webpage. The participants are given a buggy parallel implementation of this algorithm. There are three bugs in the algorithm that relate to general programming, parallelization and deadlocks. All three bugs are highly dependent and when one is found and solved, the others become apparent. As before, a lock and an Atomic Section version are required. The extra questions in this section involve the identification of the bugs, why it becomes a problem and possible solutions. The main objective of this section is to measure the debug-ability of a parallel code that involves synchronization constructs. A summary of the Methodology and Framework is given by Figure 6. All data collected from the phases is saved to a

Figure 7: Weight of each Participant

Figure 8: Histogram of the Weights

some questions are left out of the final computation since they deal with hardware support. An expert score in the web survey is 106. When a participant finishes the web survey, his/her score is calculated. Afterward, a ratio is taken with respect with the expert score. This will be called the expertise level percentage. All these calculations are called “Weight Calculations” and they will be kept intact in future iterations. Finally, the expertise level is subtracted from one to produce the participant’s weight. This weight will be used to filter the data by multiplying it with each participant’s time. This process will amplify the contribution of less expert programmers to the study. These final steps are named “The Weighting Function” and it will be modified in future iterations of the study. That being said, P3I - in its first iteration - target low level and average programmers. It also has the ability to “weed” out all high expertise participants. This will prevent skewing of data from the high expertise end, but it will amplify on the low end. This scheme can be considered a “Low Expertise Weighting Function”. Two other schemes have been considered, and they will be applied in the next iterations of P3I. For the explanation of these future schemes, please refer to section 8.

6. P3I RESULTS

Figure 6: P3I Infrastructure and Methodology file that is in the possession of the organizers of the study. The data that is collected is ensured to be private and it is only made available to one of the organizers of the study. The identity of the participants and their results are kept secret so no possible repercussion of their participation can arise. The participants cannot access their results and the rest of the Delaware group doesn’t know who participated or for how long they stayed in the study.

5.

THE WEB SURVEY: PURPOSE

The web survey is the first part of the P3I and it is mandatory for all participants. It consists of 24 questions that range from a simple “With which programming language are you familiar?” to more complicated questions such as “How much you know about the fork and join parallel programming model?” In the web survey, participants will check boxes or radial buttons to decide their level of expertise in a range of 1 (least expert / Never heard about it) to 5 (expert / Use frequently). Each question has a maximum score of 5 - except for the first one that has a value of 6 - and

35

The results of the study consist of the average time of all participants in each sub-phase and sub-section. The data consists of time (calculated in seconds) from the first attempt to run the program, up to the first correct result. Each time data is weighted with the participant’s weight before the average is calculated. A total of 15 participants took the Acclimation Phase. From this group, six completed the “Complete Development Phase”. The same six participants completed the “Parallelization Phase”. Finally, only two participants completed the last phase. Each participant has to run the experiments in a Sun SMP machine with 4 processors and a modified version of the Omni Compiler. The results for the weights are presented in figures 7 and 8. It shows that the distribution of expertise among the participants approaches a normal distribution. This result can be used in future iterations to break the population into samples. This will also allow researchers to test several hypotheses about productivity. More about these future schemes will be presented in section 8. Figure 9 provides the final results that have been modified by the weight’s data. A complete discussion about the results and the weights of the participants are given in the next section.

7. ANALYSIS AND CONCLUSIONS As shown by tables 1a and 1b, the weights in this study

control and experimental groups. A broader area should be considered, as computer scientists must be included in the sample. The web survey should be extended to include questions about compiler specifics and multithreaded knowledge. Versions of POSIX threads, MPI exercises and OpenMP programs should be created as a familiarization exercise for novice programming (i.e. extending Phase 0). Finally, a new incentive system should be instituted (i.e. exercises can be homework or projects in college-level multithreaded classes). These suggestions come from work done on the University of Maryland at the helm of Vic Basili [4] [3]. Special thanks to Vivek Sarkar, Kemal Eboglicu, Vic Basili and all the Delaware group’s collaborators from IBM Watson Research Center for all their help on bringing this study to completion.

9.

Figure 9: Weighted Average Time Data for Each Phase

formed a slightly skewed normal distribution. This will ensure that most of the results will be weighted against a close range of values. Moreover, the actual weights that were presented in the study are in the range of 0.42 to 0.55. This means that the data for this study was not affected much by the weights. Also, this means that the population is not suitable for sampling since all the groups has the same level of expertise. As Table 2 shows, there is a considerable reduction of time to solution in all the phases. However, the small sample space hinders this study from making a stronger case. Overall, there is a dramatic reduction of the time to correct solution, and each phase also shows this reduction. In Phases 1a and 1c, this reduction is in factors of five. In this data, the sequencing information should be considered. This information is obtained by recording the order in which the sub sections were taken within each sub phase. This is important because there is always a learning period in which the participants get familiar with the code. Even in these cases, the reduction of time to solution is also present. Moreover, this study can be augmented and redesigned thanks to data gained from this study and many others. Therefore, the first iteration of P3I serves as a solid foundation for more iteration of this study or others.

8.

FUTURE AND RELATED WORK

Based on several new studies and interactions with other productivity groups, the Delaware group already formed a base for the next iteration of P3I. The first improvement will be to greatly reduce the human factor in data collection and extend the data collection infrastructure for both objective and subjective data. Currently, the data is subject to human interaction. To reduce this effect, a modified shell and an instrumented compiler will be used in the next iteration. The shell will collect all data concerning the user activity on it. Moreover, the shell will also collect editor history. The compiler will silently collect warnings and errors on the application being compiled. The web log should be enhanced with more parts as an estimate on compilation, debugging and designing. More relevance should be given to comments. The human factors and behavior should be considered more. A larger population should be considered and sampling of the population should take place. Samples should eliminate the sequencing problem by creating

36

ACKNOWLEDGMENT

This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004.

10.

REFERENCES

[1] Omni openmp compiler project. http://phase.hpcc.jp/Omni/home.html. Parallel and High Performance Applicational Software Exchange (PHASE). [2] Openmp: Simple, portable, scalable smp programming. http://www.openmp.org/drupal. OpenMP Architecture Review Board. [3] S. Asgari, V. Basili, J. Carver, L. Hochstein, J. K. Hollingsworth, F. Shull, and M. Zelkowitz. Challenges in measuring hpcs learner productivity in an age of ubiquitous computing. In Proceeding of Workshop on Software Engineering and High Performance Computing Applications (held at ICSE), 2004. [4] J. Carver, S. Asgari, V. Basili, L.Hochstein, J. K. Hollingsworth, F. Shull, and M. Zelkowitz. Studying code development for high performance computing: The hpcs program. In Proceeding of Workshop on Software Engineering and High Performance Computing Applications (held at ICSE), 2004. [5] K. Ebcioglu, V. Saraswat, and V. Sarkar. X10: an experimental language for high productivity programming of scalable systems. In Proceeding of the Second Workshop on Productivity and Performance in High End Computers, pages 45 – 51, February 2005. [6] J. B. Manzano, Y. Zhang, and G. Gao. Productivity study. http://www.capsl.udel.edu/courses/eleg652/2004/productive January 2005. Computer Architecture and Parallel System Laboratory (CAPSL). [7] Y. Zhang, J.B. Manzano, and G. Gao. Atomic section: Concept and implementation. In Mid-Atlantic Student Workshop on Programming Languages and Systems (MASPLAS), 2005.

Refactorings for Fortran and High-Performance Computing Jeffrey Overbey, Spiros Xanthos, Ralph Johnson and Brian Foote University of Illinois at Urbana-Champaign MC 258 201 North Goodwin Urbana, IL 61801 {overbey2,xanthos2,johnson,foote}@cs.uiuc.edu

ABSTRACT

1.

Not since the advent of the integrated development environment has a development tool had the impact on programmer productivity that refactoring tools have had for objectoriented developers. However, at the present time, such tools do not exist for high-performance languages such as C and Fortran; moreover, refactorings specific to highperformance and parallel computing have not yet been adequately examined. We observe that many common refactorings for object-oriented systems have clear analogs in procedural Fortran. The Fortran language itself and the introduction of object orientation in Fortran 2003 give rise to several additional refactorings. Moreover, we conjecture that many hand optimizations common in supercomputer programming can be automated by a refactoring engine but deferred until build time in order to preserve the maintainability of the original code base. Finally, we introduce Photran, an integrated development environment that will be used to implement these transformations, and discuss the impact of such a tool on legacy code reengineering.

The quality of any long-lived code base tends to degrade over time. Otherwise stable architectures tend to take on a certain malleable quality as programmers attempt to adapt systems to meet unforeseen new requirements. Moreover, many coding best practices—e.g., small methods and concise, descriptive names—fall to the wayside when deadlines and functionality are in jeopardy. Although these issues are negligible in isolation, their cumulative action is often an erosion of the system’s architecture [2]. Such systems tend to have high entropy, exhibit code duplication and global information sharing [5]. Maintenance and expansion can become tedious, costly, and time-consuming work [1]. One solution to this gradual software decay is refactoring [6, 10]. Refactorings are source-level program transformations that preserve the observable behavior of a system while improving its source code. Often, refactorings aim to eliminate code duplication or poor design decisions. Common refactorings include renaming variables or functions to be more descriptive, breaking a large subroutine into several smaller ones, or substituting one algorithm for another. In object-oriented systems, common refactorings include replacing case statements with polymorphism, introducting Method Objects, and moving methods between classes. Martin Fowler’s Refactoring: Improving the Design of Existing Code [6] gives a more extensive catalog of refactorings. One particularly interesting quality of refactorings is that many of them are algorithmic in nature. In essence, renaming a function amounts to a textual change of its declaration and of all invocations. While this is certainly not an easy task (due to complications such as preprocessing and function overloading), it can, in fact, be automated. The Eclipse Java Development Tool [9] has arguably brought automated refactorings to the widest audience, although it was not the first tool to implement them. Much of the initial work in automated refactorings was done by Ralph Johnson’s research group at the University of Illinois. William Opdyke’s Ph.D. thesis [10] is often cited as the pioneering work in this area. John Brant and Don Roberts’ Smalltalk Refactoring Browser [12] was the first implementation, introducing automated refactorings to the Smalltalk community. It has since been integrated into VisualWorks Smalltalk. More recently, Alejandra Garrido has been working with preprocessing issues in C refactoring [7], [8], and Photran will introduce refactorings specific to Fortran and high-performance computing.

This work is being funded by IBM under the PERCS project.

Categories and Subject Descriptors D.2.3 [Software Engineering]: Coding Tools and Techniques—program editors; D.2.7 [Software Engineering]: Distribution, Maintenance, and Enhancement—restructuring

General Terms Design, Languages.

Keywords Refactoring. Fortran programming.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICSE’05, May 15–21, 2005, St. Louis, Missouri, USA. Copyright 2005 ACM 1-59593-117-1/05/0005 ...$5.00.

37

THE NEED FOR REFACTORING

2.

REFACTORING FORTRAN

the same way that a breakpoint can be set on a specific line, programmers could tag a section of code to indicate that a given transformation (e.g., “unroll this loop five times”) should be applied immediately before the code is compiled. This would allow programmers to maintain the more readable version of their code while compiling a performance-optimized version.

The classical use of refactoring as a form of retroactive engineering is particularly applicable to Fortran. The amount of Fortran 77 code that remains in production is a clear indication that software maintenance issues apply equally to HPC. However, there are many refactorings that are specific to the Fortran language. In her M.S. thesis [4], Vaishali De identifies many possible Fortran refactorings. Many of the standard refactorings for fields and methods (described in [6]) apply equally to variables and subroutines in Fortran. Extract Method (i.e., removing a section of code into its own subroutine), Decompose Conditional (replacing a complex boolean expression with a more descriptive function call), Rename Variable, and Reorder Procedure Arguments are several examples. Some refactorings typically applied to classes in an object-oriented system can be applied equally well to Fortran modules—e.g., Encapsulate Field (where accesses of and assignments to a variable are replaced with function calls) and Move Method (moving a subroutine between modules). Fortran also brings several unique refactorings. One example is transforming code from fixed format to free format. This is just one part of migrating code from Fortran 77 to Fortran 90/95. Another example would be migrating parallel arrays to derived types. Similarly, once Fortran 2003 compilers materialize, existing Fortran programs will need to be migrated to that as well. One aspect of this will be especially challenging: Fortran 2003 introduces object orientation to the language. The process of transforming a procedural code base into an object-oriented one is certainly not one that can be automated completely: A great deal of domain knowledge and a number of design decisions must be made in the process. However, we anticipate that many of the steps can be automated, such as migrating module procedures to type-bound procedures.

3.

4.

Demeyer et al. define reengineering as “the examination and alteration of a subject system to reconstitute it in a new form and the subsequent implementation of the new form” [13]. When we talk about reengineering, then, we usually refer to a legacy software system that has to be altered to meet changing requirements or to support extensions and additions. Legacy code is a critical part of many systems and the source of many problems. Software systems must constantly adapt to changes of the environment in which they operate. In most cases, legacy software has been developed in obsolete languages by using old-fashioned development practices. This complicates the upgrade of legacy parts. One option is to discard or replace legacy parts, but this is not always feasible because of the cost. In this case, the most viable solution is to reengineer these parts. During the reengineering process, we want to improve the design of the system in order to make it more maintainable and upgradable. This involves the alteration of its structure, but at the same the preservation of its behavior. The main reason that programmers don’t attempt to improve the design of a system is their fear of breaking its behavior in the process [14]. When we alter the structure of a system in order to improve its design without changing its behavior, we are, by definition, refactoring. A tool that provides automated refactorings can eliminate the most of the common mistakes that programmers make when they attempt to manually refactor a piece of software. (Proper testing is still essential, however.) This shows the importance of a tool like Photran when someone is working with legacy code.

REFACTORINGS FOR HIGHPERFORMANCE COMPUTING

In the high performance world a very important issue for the software is the optimization for certain architectures. Many of the refactorings described previously can be applied to a large number of programs. However, there is an entirely different class of refactorings that are unique to the domain of supercomputing: namely, performance refactorings. It is well-known that, despite the best efforts of compiler vendors, code intended to run on a specific supercomputer must undergo many hand optimizations. Examples include manual unrolling of loops and optimizing data structures based on the machine’s cache size. Applying these tweaks by hand is a tedious error-prone process. So, a tool that would be able to automate the process of applying these tweaks would be very useful. However, while refactorings are typically used to improve the design and readability of code, many of these performance optimizations actually decrease readability. Loop unrolling is an excellent example: An unrolled loop is far more difficult to comprehend than one that has not been unrolled. In these cases, rather than transforming the code in-place, we propose the notion of deferred transformations.1 In 1

REENGINEERING AND REFACTORING

5.

PHOTRAN

Identification of refactorings for Fortran and high-performance computing is an essential step, but even more important is the development of a tool that can automate them. Photran [11] is an Eclipse-based Fortran IDE being developed at the University of Illinois that will implement many of these transformations. The current version of Photran is based on Eclipse’s C Development Tool [3] and runs on Eclipse 3.0 under Linux, Windows, Solaris, and Mac OS X. It includes a keywordhighlighting Fortran editor, CVS support, debugging via a GUI interface to GNU gbd, Makefile-based compilation, and error extraction for several popular Fortran compilers. A sophisticated refactoring infrastructure is under development, although it is not visible in the current public release. A pretty printer and Rename refactoring have been demonstrated internally and will be included in the near future. While we have identified many refactorings that will be essential for Fortran programmers, there are many we ings” since the transformations are not made to the working code.

We use the term “transformations” rather than “refactor-

38

are not yet aware of. Receiving input from Fortran programmers, especially those in the high-performance arena, will be essential. Our aim is to provide Fortran programmers with a stateof-the-art tool that can increase productivity and allow them to adapt their code to changing requirements and modern software engineering practices. Such a tool will be essential in reengineering efforts. Since Fortran has been used for more than fifty years, a vast amount of active Fortran code is decades old, making a tool like Photran a necessity for Fortran programmers.

[10] Opdyke, W. Refactoring Object-Oriented Frameworks. Ph.D. Thesis, University of Illinois at Urbana-Champaign, 1992. [11] Photran, an Eclipse plugin for Fortran Developement http://www.photran.org/ [12] Roberts, D., Brant, J., and Johnson, R. “A Refactoring Tool for Smalltalk.” Theory and Practice of Object Systems 3(4), 1997. [13] Demeyer S., Ducasse S. and Nierstrasz O. Object-Oriented Reengineering Patterns. Morgan Kaufmann, 2003. [14] Feathers M. Working Effectively with Legacy Code. Prentice Hall, 2004.

Figure 1: A Screenshot of Photran 2.1

6.

REFERENCES

[1] Boehm, B., and Horowitz E. (editors) The High Cost of Software: Practical Strategies for Developing Large Software Systems. Addison-Wesley, 1975. [2] Brooks, F. The Mythical Man-Month: Essays on Software Engineering. Addison-Wesley, 1995. [3] C Development Tool http://www.eclipse.org/cdt/ [4] De, V. A Foundation for Refactoring Fortran 90 in Eclipse. M.S. Thesis, University of Illinois at Urbana-Champaign, 2004. [5] Foote, B., Yoder J. Big Ball of Mud. Fourth Conference on Patterns Languages of Programs (PLoP’97/EuroPLoP’97). Monticello, Illinois, September 1997. [6] Fowler, M. Refactoring: Improving the Design of Existing Code. Addison-Wesley, 1999. [7] Garrido, A., and Johnson, R. “Challenges of Refactoring C Programs.” Proceedings of IWPSE 2002: International Workshop on Principles of Software Evolution. Orlando, Florida. May 19–20, 2002. [8] Garrido, A., and Johnson, R. “Refactoring C with Conditional Compilation.” Proceedings of the 18th IEEE International Conference on Automated Software Engineering (ASE 2003). Montreal, Canada, October 6–10, 2003. 323–326. [9] Java Developement Tools for Eclipse http://www.eclipse.org/jdt/

39

Building A Software Infrastructure for Computational Science Applications: Lessons and Solutions Osni Marques

Tony Drummond

Lawrence Berkeley National Laboratory 1 Cyclotron Road, MS 50F-1650 Berkeley, CA 94720-8139 (+1 510) 486-5290

Lawrence Berkeley National Laboratory 1 Cyclotron Road, MS 50F-1650 Berkeley, CA 94720-8139 (+1 510) 486-5290

[email protected]

[email protected]

ABSTRACT

1. INTRODUCTION

The development of high performance engineering and scientific applications is an expensive process that often requires specialized support and adequate information about the available computational resources and software development tools. The development effort is increased by the complexity of the phenomena that can be addressed by numerical simulation, along with the increase and evolution of computing resources. In this paper we discuss mechanisms implemented by the DOE Advanced Computational Software (ACTS) Collection Project to mitigate that effort. ACTS is a set of DOE-developed software tools, sometimes in collaboration with other funding agencies, that make it easier to write high performance codes for computational science applications. The paper discusses categories of problems that the tools solve, functionalities that they provide, applications that have benefited from their use, and lessons learned through interactions with users possessing different levels of expertise.

A number of recent reports produced by the engineering and scientific communities have given recommendations for improving the nation’s high-end computing capabilities and also for achieving an optimal use of computing resources [1][2]. It has been recognized that mechanisms to reduce the effort required for software development, testing and evolution are imperative to attaining many of these goals. This is particularly important in the realm of applications that require large-scale computer simulations. In fact, the lifespan of software and applications is usually much longer than that of the hardware. Nonetheless, recent studies on the high performance computing landscape indicate a continuous increase in the number of cluster based systems, as well as new architectures that will make the 1 Pflop/s mark attainable by 2008-2009 [3][4]. Large complex application codes usually have several software constituents. Here, a software constituent indicates a collection of subroutines, programs, sections of programs or combinations of these that implement a computational service. Without loss of generality, we classify these constituents according to their functionalities into three classes: algorithmic, data manipulation, and I/O implementations. In the first class we find software implementations of numerical algorithms ranging from basic linear algebra to complex iterative schemes in which one solves several systems of equations by combining different techniques. The second class groups those constituents that implement a strategy for managing data in memory for processing or storage. Finally, the I/O class includes implementations of software strategies that involve a transfer of data from a process or task memory to either disk or the memory of one or more other processes or tasks. In this software development model, a developer or a group of developers realize all the different software constituents and create control procedures in a main program to link them. Frequently, the next step for application developers is to devise mechanisms to use more efficiently the often-limited computational resources. This task reoccurs many times in the life of an application code and it is mostly driven by the changes in computational resources.

Categories and Subject Descriptors D.1.3 [Programming Techniques]: Concurrent Programming Distributed Programming and Parallel Programming; D.2.2 [Software Engineering]: Design Tools and Techniques Software Libraries and User Interfaces; D.2.13 [Software Engineering]: Reusable Software; K.3.2 [Computers and Education]: Computer and Information Science Education Computer Science Education.

General Terms Algorithms, Performance, Design, Reliability, Languages.

Keywords Computational Sciences, High Performance Computing.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SE-HPCS'05, May 15, 2005, St. Louis, Missouri, USA. Copyright 2005 ACM 1-59593-117-1/05/0005...$5.00.

Application developers use combinations of compiler directives, language extensions, specialized library calls and even code rewrites to optimize the performance of their applications. Overall, this part of the application development is operationally

40

cases in collaboration with universities. In addition, some tools have been co-funded by DOE and other agencies like the US National Science Foundation (NSF) and the US Defense Advanced Research Projects Agency (DARPA).

very expensive and with very short term impact on the application. The cost of optimizing a code render this kind of development almost impractical for current and future computational challenges, in particular when this cost is added to the costs of the initial development, debugging, synchronization of source code and consecutive version upgrades. All these costs increase as the complexity of the application is also increased and the nature of high-end computing continues to evolve.

The ACTS Project gathers years of uncoordinated software development and makes it available, at no direct costs, to a wide international community of scientists and engineers through the ACTS Collection. By no direct costs we mean users do not pay for the download and the use of the tools. However, users may have to devote resources when prototyping their application using the tools. Pointers and more information on how to download the tools can be found at the ACTS Information Center [8]. The ACTS Project complements the library research and development efforts by adding technical support, quality assurance and outreach. The technical support provided by the ACTS Project ensures that development efforts are better employed. As a result of this effort, ACTS libraries have reached higher acceptance levels among computational scientists and institutions. In turn, this outreach provides the necessary means for individual tool projects to interact with more users, so the tools can gradually mature, becoming both more robust and portable to state-of-theart high performance computing environments.

In contrast, the use of software tools greatly reduce the software development and maintenance costs since other specialized software development teams have taken on the laborious task of guaranteeing portability, scalability and robustness. In addition, software tools offer a variety of solutions to a problem through interfaces. Nowadays, there are several approaches for making tool interfaces interoperable [5]. Interoperable tools facilitate code changes that are due to an increase in the complexity of the problem being addressed or algorithmic formulation. In recent years, a number of important scientific and engineering problems have been successfully studied and solved by means of computational modeling and simulation [6]. Many of these computational models and simulations benefited from the use of available software tools and libraries to achieve high performance. These tools have facilitated code portability across many computational platforms while guaranteeing the robustness and correctness of the algorithmic implementations.

Software libraries in the ACTS Collection fall in one of four categories: numerical tools, tools for code development, tools for code execution and tools for library development [8]. The numerical tools implement numerical methods and algorithms, and include sparse linear system solvers, ODE solvers, optimization solvers, etc. The tools in the code development category provide infrastructure that manages some of the complexity of distributed programming (such as distributing arrays, communicating boundary information, etc) but do not actually implement numerical methods. Execution support is a category for application-level tools; these tools include performance analysis and remote visualization support. Library support tools provide an infrastructure for tool developers and probably will not be used or seen directly in scientific applications.

As a result of all the advancements in tool development and deployment, there is nowadays a growing need and interest among computational scientists for identifying software tools that have been successfully used in the solution of computational problems that are similar to theirs. In this article, we discuss our experience with the US Department of Energy (DOE) Advanced Computational Software (ACTS) Collection Project towards the creation of a reliable software infrastructure for computational science computations. The paper deals with issues related to ACTS tools, in particular categories of problems that they solve, functionalities that they provide, applications that have benefited from their use, lessons that we have learned by interacting with computer vendors and users with different levels of expertise, and mechanisms to minimize the usually costly application development effort.

The tools are selected, considered for inclusion and evaluated by taking into account efficiency, scalability, reliability, portability, flexibility, and ease-of-use [9]. Efficiency refers to the optimal use of computational resources in the system. Scalability is the ability to increase the number of processes and processors as the size and complexity of the problem being solved is also increased without compromising efficiency. Reliability refers to the failure free features of the library and proper handling of error bounds Portability refers to the almost adaptability of the libraries to a wide variety of computational environments. Flexibility is the feature that allows users to construct new routines, libraries and codes from well-defined tool modules. The use of flexible software automatically leads to extensible software. Ease-of-use delivers interfaces that users outside the community of developers can adopt and become familiar with. All these goals make ACTS tools an important resource for a wide spectrum of applications.

2. ACTS: OVERVIEW AND GOALS The ACTS Collection is a set of computational software tools that aim at simplifying the solution of common and important computational problems [7]. ACTS tools are targeted for distributed computing and use different mechanisms for intercrosses communication. Some of the tools also provide implementations for sequential architectures. The Collection evolved from the former DOE 2000 Project, which had two main components, the Advanced Computational Testing and Simulation Toolkit and the National Collaboratory Project. One of the goals of the project was to change the way scientists work together and address major challenges of scientific computation by developing and exploring new computational tools and libraries. Thus, most of the tools currently in the ACTS Collection were primarily developed at DOE laboratories, in some

The ACTS Project is also actively engaged in education activities, by organizing and participating in workshops, tutorials and other events related to computational sciences [10]. Recently, we have introduced a matrix of software tools, their most relevant

41



applications in science and engineering, with comments on tool performance and functionality with reference to other available similar tools [6][11]. The applications come from an international base of users (some without any links to DOE) with whom we have interacted, and currently include data analysis of the cosmic wave background, collisional breakup in a quantum system of three charged particles, electronic structure calculations by means of a full-potential linearized augmented plane-wave, modeling and analysis of accelerator cavities, the study of steady state neutron flux distributions inside a reactor core, cardiac simulation, etc. We plan to regularly update this table with feedback and results from the international community of computational scientists, as well as with software developments from other teams. Our goal is to promote the reuse of robust software tools and at the same time provide guidance on their use. In addition, this reference matrix can be instrumental in the dissemination of state of the art software technology, while providing feedback not only to users but also to tool developers, software research groups and even to computer vendors.





3. LESSONS LEARNED Important lessons have emerged from the activities described in the previous section. While some of the lessons are perhaps specific to DOE communities, they are nonetheless valuable to the computational science and engineering communities as a whole, and in particular to the software development and support projects: •











There is still a gap between tool developers and application developers which leads to duplication of efforts. Without projects like ACTS, application developers will continue to design and implement codes using techniques that are already available from other sources. Quite often these inhouse implementations are not optimal because of the application developers’ inexperience with all the different issues that lead to optimal performance. The tools currently included in the ACTS Collection should be seen as dynamically configurable collections and should be grouped into collections upon user/application demand. Based on the particular needs of an application, users generally benefit from only a subset of the functionality available in ACTS; therefore, they may only need to install a subset of the ACTS tools in their computing environments. Users demand long-term support of the tools. One of the main concerns that users have expressed is the longevity of support from tool developers and required evolution of the software as the hardware technology continues to evolve and the complexity of the application continues to grow. Applications and users play an important role in hardening tools. The main parameters for software maturity are portability, robustness, acceptance, and long-term support. It is particularly the interactions with real users and real applications that have made the software mature, portable, robust and better documented. In turn, mature software will be widely accepted inside a given scientific community. Tools evolve or are superseded by other tools. As technology continues to advance, there are some tool functionalities that are either no longer needed or are improved as direct consequences of the user demands.

There is a demand for tool interoperability and more uniformity in the documentation and user interfaces. Users want to experiment with functionalities available in a subset of tools. Finding similar user interfaces and comparable levels of support and documentation makes this task even simpler and riskless. The computational challenges at hand also demand new software developments that interact with legacy code practices, data and computer languages. There is a need for an intelligent and dynamic catalog of high performance tools. The need for a centralized software and reference repository is essential for preventing the duplication of efforts. Currently, the ACTS Information Center provides pointers to tools funded by DOE with the accumulated expertise from tool users and scientific domains that are served by the tools, as well as pointers to tools offering functionality not currently available in the ACTS Collection or tools that offer functionality that overlaps with ACTS tools. Collaborative software development. Recent issues of computer vulnerability have raised many concerns regarding collaborative software development platforms. There is an urgent need to document the existing open software development activity supported by DOE and at the same time collect requirements and implementation plans from other DOE institutions to ensure the expansion of these collaborative software development environments in the future. Software licensing for experimental software. Software licensing issues need to be addressed through a DOE wide tool certification process that backs up the development and distribution of experimental software based on results of a peer software review and feedback from the computational science community.

4. PROPOSED SOLUTIONS The ACTS Collection Project has been serving as a mechanism for the development, support, and promotion of quality high performance software tools. The goals and first successes of this project have been enhanced by the increasing demand for complex high performing computer simulations, closer interactions between computer scientists and other domain scientists, less duplication of efforts, and interoperability to pickand-play with kernels from a variety of computational services. We have been working to continuously expand the scope of the ACTS Collection Project based on the lessons summarized in the previous section. The continuation and enhancement of the ACTS Project towards the creation of a reliable software infrastructure for scientific computing also seeks to take into consideration the following items: •



42

A Solid Base for the ACTS Collection. The matured tools in ACTS form a solid base that has contributed to the acceptance of the tools in computation sciences. We have proposed and defined mechanisms for the inclusion of new tools in the collection with a yearly peer-reviewed process that certifies the tools as part of the collection. ACTS High Quality Software Certification. We have proposed the creation of a software certification process for tools in the collection and that should be defined in terms of















software availability, robustness, functionality, portability, documentation, and interoperability [5]. Inclusion of New Solutions to Computational Problem. The development of good quality complex parallel codes is usually very expensive. The ACTS Collection aims at promoting code reusability for the solution of common and important computational problems as a means of accelerating scientific discoveries. It also aims at fostering the development of tools that are not available and recommend good quality tools developed by parties not necessarily funded by DOE. Interoperability and Software Distribution. Interoperability may affect performance in some cases but it potentially reduces time to solution and, most important, it assures longevity of the software. Moreover, language choices made by the tool developers must not dictate the choice of language used by the application developers. Active Collaboration with DOE Computer Facilities. The spectrum of users in the DOE computer facilities consists of scientific challenge teams (multidisciplinary and multiinstitutional teams engaged in research, development, and deployment of scientific codes, mathematical models, and computational methods to exploit the capabilities of terascale computers), high-end capability users (single PI teams and their groups of collaborators or students), and new users transitioning from midrange computers. Working in Collaboration with Other Software Initiatives. There are several high-end computing programs that interact with ACTS at different levels, in particular the SciDAC Program [13]. This program has been developing software technology to address complex scientific problems in today’s and the future’s high-end hardware. In turn, there are several tools in the current ACTS framework that have benefited projects funded by other programs. Encourage Feedback from Users. The ultimate measure of success for the high performance computing tools provided by ACTS is their role in the production of high quality scientific results. This can be assessed, for instance, by encouraging feedback from current and potential users as well from participants in the workshops and other activities related to ACTS, and by giving presentations and fomenting discussions in major events covering computational sciences topics. Further Development of the Tools. When appropriate and based on user specific needs, we employ our expertise to improve the features provided by a particular tool. This kind of activity happens in collaboration with the tool developers or independently, when funding for the development of a certain tool has ceased. Expertise Tracking. The ACTS Information Center attempts to collect and make available all information from tool developers, acts-support and users’ experiences with the tools. In some cases, tool developers keep their own e-mail lists of users, by means of which questions and problems are posed and later either the developers or other experienced users propose answers and solutions. We have been expanding this capability in the ACTS Information Center by continuing to participate in these e-mail lists and looking into ways to employ more specialized search tools to retrieve this





information from feedback, evaluations, and reports collected on a particular tool. Increasing the Visibility of the ACTS Information Center. Good quality software is usually labor-intensive to develop, test, maintain, and evolve. We envision ACTS as a mechanism that helps improve the life cycle of scientific computing software and also a delivery vehicle for software developed with DOE funding. Dissemination Plans. From our experience with the ACTS Project, we have learned that education of graduate students and postdoctoral fellows provides a viable resource to build a bridge between computer scientists and domain scientists. By educating other scientists on the use of a set of tools, they become familiar with the technology, accept this technology faster, are able to develop codes using state-of-the-art tools, and minimize the tool selection process.

In Figure 1, we show how we envision the multiple interactions and roles played by the ACTS Project within the computational science community. ACTS has been working with a basic set of reliable tools and is paying close attention to tools being developed by other initiatives in order to incorporate them into the Collection. In the figure, the tools to be tested and eventually accepted are produced by developers working with and in the User Community and in the ACTS project. The strong components of the ACTS Collection and its infrastructure are the coordination of high-level support to the User Community, the independent Testing and Acceptance of software tools and the constant interactions with Scientific Computer Centers and Computer Vendors. The high level support is geared towards minimizing the application development time, from the first prototype code to the production code. The testing and acceptance of new tools into the ACTS Collection ensures the tool quality and expansion of available functionality. In the inclusion of tools to the collection, Interoperability is playing a major role to guarantee software reusability, incremental development, and a ready-to-use wide variety of services to the end users. The key component of this long-term user support is the coordination of efforts between software and hardware vendors, tool developers, users and ACTS. We have realized that the main parameters for software maturity are portability, robustness, acceptance, and long-term support. And it is in turn the interactions between application and tool developers that have made the software tools more mature, portable, robust and better documented. Therefore, we are working with commercial software developers and computer vendors to guarantee the long-term support for the tools and to effectively reach out more user communities. To complement the above activities, and among others, we have also implemented a quarterly electronic newsletter, made available Access Grid-based consulting services, worked on tentative collaboration agendas with various research centers and computer facilities, collaborated with tool developers towards the publication of special issue of journal featuring ACTS tools, responded to requests for ideas on how to efficiently deploy future high-end computing technologies [12], and worked on the development of a Python Interface to the numerical libraries in the ACTS Collection [14].

43

tool developers should be in charge of these activities. It turns out that this model raises important issues about funding opportunities, since software maintenance has not been properly addressed in the various funding agencies’ agendas. Nonetheless, the ACTS Project proposes a viable solution to bring all these efforts to the computational science community while exploring mechanisms for extending the longevity of the software beyond their development phase. Therefore, projects like ACTS are important for maintaining a high quality software collection by not only attesting the quality of tools but also ensuring that users select the more suitable tools and tool functionalities, and make proper use of these. Clearly, a higher-level support distinguishes this type of project from other main software repositories.

6. REFERENCES [1] National Coordination Office for Information Technology Research and Development, http://www.itrd.gov.

Figure 1 - The multiple interactions played by ACTS.

[2] S. Graham and M. Snir, The NRC Report on the Future of Supercomputing, CTWatch Quart., http://www.ctwatch.org, February 2005.

5. CONCLUSIONS The issues that we address in this paper attend the needs of the computational science community at large, where significant efforts are focused on the development of complex parallel codes and their optimization. This expensive process often requires specialized support and information about software tools, and becomes crucial if we consider that more complex physical and societal phenomena, along with the growth of computing resources, is driving the continuous growth of the gallery of computational sciences applications. Therefore, we foresee a great need not only for a state-of-the-art software repository, where tools and their ongoing developments are available and documented, but also a collaborating infrastructure in which knowledge and expertise is captured and shared. A high-end software infrastructure will produce substantial performance information from interactions between algorithm, tool and application developers. We envision the creation of a database of performance data at the algorithm, tool and application levels. As a result, we will obtain better characterizations of the behavior of today and future software technologies in HEC systems.

[3] D. Feitelson, The Supercomputer Industry in Light of the Top500 Data, Computing in Science & Engineering, Vol 7, Jan/Feb 2005. [4] E. Strohmaier, J. Dongarra, H. Meuer and H. Simon, Recent Trends in the Marketplace of High Performance Computing, CTWatch Quart., http://www.ctwatch.org, February 2005. [5] CCA-Forum, http://www.cca-forum.org. [6] T. Drummond, O. Marques, J. Roman and V. Vidal. A Study of Robust Scientific Libraries For The Advancement of Sciences and Engineering To appear in the Proc. of the VECPAR 2004 Conference, Springer Verlag. [7] T. Drummond and O. Marques, The Advanced Computational Testing and Simulation Toolkit (ACTS), What can ACTS do for you? Technical Report LBNL-50414, May 2002. [8] ACTS Information Center: http://acts.nersc.gov. [9] T. Drummond and O. Marques, The ACTS Collection, Robust and High-Performance Tools for Scientific Computing: Guidelines for Tool Inclusion and Retirement, Technical Report LBNL-PUB-3175, November 2002.

The benefits of deploying such an infrastructure can be measured in many ways. A wide range of scientific code developers and users benefit from a) information and education about state-ofthe-art, high-end computational tools; b) the development and promotion of robust, effective, portable, usable, and durable software; c) the increased interoperability of tools, which promotes the evolution and adoption of current software development projects into future software technologies; d) multidisciplinary collaborations and the consequent accumulation of expertise; and e) spending less time on code development and having more time to devote directly to scientific discovery.

[10] ACTS Events, http://acts.nersc.gov/events. [11] Matrix of Applications, http://acts.nersc.gov/MatApps. [12] T. Drummond, O. Marques and G. Wilson, An Infrastructure for the Creation of High End Scientific and Engineering Software Tools and Applications, Technical Report LBNLPUB-3176, April 2003. [13] DOE/SciDAC, http://www.er.doe.gov/scidac.

As we discussed in Section 3, users demand long-term support of the tools, which implies not only bug fixes, but also addition of new features, porting and documentation updating. Ideally, the

[14] N. Kang and L.A. Drummond, A first prototype of PyACTS, Technical Report LBNL-53849, August 2003.

44

And Away We Go: Understanding The Complexity of Launching Complex HPC Applications Il-Chul Yoon

Alan Sussman

Adam Porter

Dept. of Computer Science University of Maryland College Park, MD 20742

Dept. of Computer Science University of Maryland College Park, MD 20742

Dept. of Computer Science University of Maryland College Park, MD 20742

[email protected]

[email protected]

[email protected]

V 3 c D 7E@Q3,>-" 761#

G JLX .=  , (L™3OPS6347D"/  #2!8OŠž3-9I 4#.PS63[ I#/50,  34/

3#A/Q= #2QY3

B (,3 /3$"508x: BV"/E#-"-  @34/

3#@"V43$"/ L™63ˆ U ,

ž "$"AL™3‡-"#2Y, 3/ /3$"/5 A 3;4,>#†-VE#-".  8@‰j61N--x"/  #2 -( @,3 /3$"50 , 3N # I45R

" Q ,

 ".$"8xŠž"7EN( S" 63-0   ".$""/ D""/7l-Q 5K8 :    PS6, 3E#64RA" Q"/50BL™/3/G3MNƒ -

" ˆ50N 6" 

K ˆk0# t7JL‡ OP "/!,  #$"/ -98‡Š 3@61 "/50, //R PS6/, 50E?50N   U N7O4 (NPN3  50EUI4 , ,>34=%V'U) ", , #-". W"/

 #$  8

Keywords

   ,>634LX/35B" #O4LXT"3Y 6PS , 50EA, 3N #6PNT7

1.

?: ZINTRODUCTION 506[ G#/50, $"  "/"\#6  /50#] 5B" [L    ,>634LX/35B" #B#/50,

 C^_%V'U)A`O"/, , #"  *PS/,  23K/Lab43 #6 3c 3W"/, , #-"/ W"K#

, -Z4/LaT"3 #6 50,>  EI3

   c],".3$"/= "[4 K#/50, $"  "/ 364

3#6Q^_/8 8"I*/3MN$"/W# 632`/3"#63 45R

, O  43
63 #6A4 /U "@"/

 #$  j 4=", , #-"./  c50/3Gk0# ["\5B"/E7b,>/, C3-"l/8mj  /\  "/4!n *Q 6, 7  -9E6PS *4 "./J 3-"7   23?,3N #2PNT7("/ (".;4/   !A ?,3 < 65o#6  "-PSQ#/ 4 63$"< D50Q"/ ]2p9/34RLX/30%('*)q P  ,>63R"/ 6J

4638 :?NPS "W D,3 < 65r*  #6s"C45B"/V#-"4  7BL@"Yt/,3 /3$"5u#

, -Q45I "/D/L@v 00wx"34 ydz{S|$}W~}4{/X€ } f28bV 3 [ K 7ExK "/#6‚"&4 K36 4"3#$ 63Q3MNƒ -„/,>63$" ]‚tG50N 6…&21 , /3 C  45I " ]#N0c2463R

63$"/ C"/ C50,3.PS0-x"/

4 H K, 3N #6 ‚#N KC 7"H, "34#

a".3Q4#E† # ,    50  !8 : ‡34-"3#$R"/@4 , ,>/34-Y  6NR50, -7 "J5B"  6 E7 3NN7  "/50#j50N ‡L* Y3 /C6T*[  v

c"/K Rwx"34&"/ W I 63V# 50,>/ Ej50N  B  wx"34?n [   4,  63Z£-œ.¤8¥: 4CT*s#/50,>  6ND,>63N  #-"7b61#$" & ".$"¢".K c4".3-Z

 ".347sLR 63K  5B"/ EY5B"N$"J"ˆ#/ 4EA46*/L?, E74#"/>P/"/ *"/#63/4 ".*/   "347E8!¦C=PSA "@ U"/, , #" Bxt7, #-"/ /L 5B"/E7[^_
  Ej5B"7D# 505I  #-".V=c"/#$c 63V"/ K,>/E"/7 61#$"/ O"3 /Y"/50

E/Lx "$"8 )*   30 463R, 3.PNaQ Y›T #†#" ‚"/C34/

3# 4,>#† #-" W46,"3$".7K

4 B";:*< {/yy>= hVU44E"/7W"/D61N  4< H, 3/,>634T7‚-8 : ?:*< {y$y>= h y "3&"  P 6344-b" @0{|$€$@0{ AS} f!,3E#4ž"( x5B".#2 5B"/MN U?,>634L™350-(< "/4-  o C,3 ,>6348 F*/CB &DEF' -); &,EHG "/ CB ,FI "3 4,>#† -ZZ J:*< {y$y>= h y 8KB -DEF' -); &,LG ".3c &50  5R

5 #/ 43$"/EO".I5I ˆ0".† -H= h y L™35B"#$ 5B"/MN N"/ "B ,FI ="Y50643#jR6P " "( († 4  64QL( K"LX7 M:*< {y$y>= h y 8o: K5B".#2 5B"/MS63"/44   "D›/  8@: xL™/35B". @4,>#†II  š 2 4 3#=vN,>#†#-". .N?"/ " V^ š v-N;`2."/ ˆ  š v-N0ž36† - =0 = ,B/L?"QL™/35B". B463PN#(6LX/3"/  #2  O  , 3/ /3$"5W^_$`@".5B"MS= ,0 @›/