2.2
Parallel Programming: Can we PLEASE get it right this time? Tim Mattson
Michael Wrinn
Intel Corporation Dupont, WA
Intel Corporation Hillsboro, OR
[email protected]
[email protected]
ABSTRACT
1. INTRODUCTION
The computer industry has a problem. As Moore’s law marches on, it will be exploited to double cores, not frequencies. But all those cores, growing to 8, 16 and beyond over the next several years, are of little value without parallel software. Where will this come from? With few exceptions, only graduate students and other strange people write parallel software. Even for numerically intensive applications, where parallel algorithms are well understood, professional software engineers almost never write parallel software. Somehow we need to (1) design many core systems programmers can actually use and (2) provide programmers with parallel programming environments that work. The good news is we have 25+ years of history in the HPC space to guide us. The bad news is that few people are paying attention to this experience. This talk looks at the history of parallel computing to develop a set of anecdotal rules to follow as we create manycore systems and their programming environments. A common theme is that just about every mistake we could make has already been made by someone. So rather than reinvent these mistakes, let’s learn from the past and “do it right this time”.
The computer industry has enjoyed a 30-year “value ramp”, where Moore’s law drove processor improvements, which in turn drove upgrade cycles of ever-increasing capabilities. Moore’s law continues, with several generations of transistor doubling anticipated. However, single-thread performance, as measured by SPECint 2000, has begun to level off, and no longer tracks the increase in transistor count; this is at least partially attributable to design elements (out of order execution, prefetching etc) reaching points of diminishing return. Also, power consumption limitations have curbed further increases in processor frequency. In response to this situation, the industry is shifting to multi-core designs, where – in principle – greater performance can be realized at lower power levels. Achieving this performance, however, requires concurrent software, until now relegated to a small niche requiring specialist, even heroic efforts. The challenge, then: how best to shift software design from its historic, inherently-serial approach, to one which incorporates concurrency?
2. PARALLEL PROGRAMMING: LESSONS FROM HISTORY
Categories and Subject Descriptors
An ideal solution would automatically exploit concurrency through techniques such as speculative multithreading, or automatic parallelization of loops. As yet, such “implicit” approaches have not proved encouraging; a study [1] of the SPECint benchmarks found gains in the range of 8-15% for a dual core system. Research continues, but does not suggest a near-term solution. Thus, the job of realizing concurrency falls to software developers. The “parallel programming problem” has been addressed, in high performance computing, for at least 25 years, with the result that – still - only a small number of specialized developers write parallel code. With multicore systems becoming ubiquitous, there is some hope that the “if you build it, they will come”: more systems give more opportunity for development, thus creating a demand for better tools, making the task more manageable, in turn drawing in more developers, all in a virtuous feedback cycle. While this may happen, it rests on the interesting premise that 25 years of PhD-level work on parallel systems was insufficiently diligent. We propose to understand this preceding “massively parallel programming” era as one of exploration, trial-and-error, replete with insights as to what works, and especially, what does not. Following are some lessons from history, to guide work as the industry shift to the “manycore parallel” era.
D.1.3 [Programming Techniques]: Concurrent Programming – parallel programming
General Terms Algorithms, Performance, Design, Experimentation, Human Factors, Languages
Keywords Parallel computing, Design patterns
7
2.1 Rely on a Small Number of Good Technologies Built on Current Usage.
2.2 In Developing Parallel Program Technologies, Target Production Level Applications and Their Developers.
Throughout the 1990s, hundreds of parallel programming technologies were created and pushed into the programming community. The adoption rate was very low -- why? We believe part of the problem was choice overload: the tendency of a consumer, when presented with too many choices, to walk away without making a choice. This phenomenon was shown in the “Draeger Grocery Store” study [2]: create two displays of gourmet jams, one with 24 jars, the other, 6. Each invited people to try the jams, and offered a discount coupon for purchase. They alternated displays, tracked how many people passed the displays, how many stopped and sampled the jams, and how many subsequently used the coupon to buy the jam. The results were surprising: •
24 jar display: 60% of the people passing the display sampled the jam, 3% purchased it.
•
6 jar display: 40% of the people passing the display sampled the jam., 30% purchased.
A great deal of work in computer science is done on reduced size, or toy, problems. This makes sense, since the goal is to understand the principles and models behind the computing, not to produce complex full featured applications. When making the transition from research to production programming environments, however, a reliance on toy problems can be dangerous and lead to erroneous conclusions. An unfortunate example this sort arose with HPF; high performance Fortran [5], developed in the early 90s to be the common language supporting parallel programming for scientific applications. The HPF Forum developed the language using numerous toy problems and academic benchmarks, but when the language was completed, few if any software vendors would use it. Part of the problem was delay in shipping compilers conforming to the standard. A bigger problem was the model was wrong: HPF was based on a strict data parallel programming model common to the large SIMD supercomputers of the era. Production level applications, even when largely data parallel, don’t map well onto this model. These applications invariably include computations that are fundamentally task parallel. As a result, few software vendors adopted HPF. The HPF community in Japan later amended the standard by adding task parallel constructs [6], but it proved too late for the larger international community, and HPF basically died.
The larger display was better at getting people’s attention, but did not lead to a purchase; too much choice, it appears, is demotivating. Selecting a gourmet jam is insignificant. Maybe for more important issues, “choice overload” is not relevant? Further studies considered more important choices such as 401k plans [3], and the phenomenon of choice overload persisted. Choice overload is real. When we present application programmers a myriad of parallel programming environments, we overwhelm them, and decrease the adoption of the technology. We may indeed need new languages for parallel programming, but must be very careful to make this choice only when absolutely necessary, when existing languages cannot address the problems. Even then, better to fix the languages we have before creating new ones. For example, OpenMP [4] is a well-known API for programming shared memory machines. It was created primarily to address the needs of scientific programming applications dominated by straightforward iterative loop structures. OpenMP, however, was unable to directly handle more general loop cases, such as this pointer-following loop:
2.3 To Achieve Commercial Impact, Enlist Industry Stakeholders from Inception to Deployment. The computer industry is interested in solutions. Academia is interested in research agendas. The goal in academia is get funding and publish papers. We are not denigrating the academic community, as they play a vital role in developing the foundational ideas behind successful technology. As the focus moves from development to deployment, however, the solution oriented focus of industry is vital. The best way to assure that focus is to keep the industry team actively involved from the beginning. A case in point in the history of parallel computing comes from the message passing forum and MPI [7] (the Message passing Interface). MPI was created by a consortium of national labs, universities and computing industry representatives in the early to mid 90’s, with first release in 1996. Implementations of MPI 1.0 were available immediately in the public domain and shortly thereafter from all major parallel computer vendors. The MPI forum continued their work refining MPI specifications, releasing MPI 2.0 in 1997. Industry representatives were involved, but the leadership in MPI 2.0 emphasized research over industrial concerns. The result: MPI 2.0 was released in 1997, but it wasn’t until 2004 that a full implementation was generally available. Large portions of the MPI 2.0 standard, such as one sided communication, dynamic process models, and parallel I/O were poorly understood and, in retrospect, were research agendas rather than established technologies ready for standardization.
nodeptrlist list, p; for (p=list; p!=NULL; p = p->next) process (p->date);
Rather than abandon OpenMP and for a new API, we extended it, adding a task construct in OpenMP 3.0: nodeptrlist list, p; #pragma omp parallel { #pragma omp single { for (p=list; p!=NULL; p = p->next) #pragma omp task firstprivate(p) process (p->date); } }
8
Even without a theoretical foundation, we can still replace our “engineering” perspective with a more scientific approach. How to accomplish this will be described in the following section.
2.4 Work on the Important Problems, Not Your Favorite Problems. Considering technical programs at major computer science conferences over the last few years, one would conclude that the single most important technology for parallel programming is transactional memory. Examine that conclusion, however, in light of the pressing issues in the exploitation of concurrency, listed here as the “Top 10 issues in parallel computing.” (This was produced by a group of experienced parallel programmers at Intel, and validated in conversations with parallel programmers across industry and academia.)
3. DEVELOPING A SYSTEMATIC APPROACH TO PARALLEL COMPUTING: HUMAN-CENTERED MODELS, LANGUAGES, AND METRICS Parallel programming has developed along informal, empirical lines. Given the shortage of parallel programmers, however, we need to accept the fact that these informal approaches are not working. To succeed with parallel programming in the multi-core era, we must adopt a systematic, measured approach informed by insight into how programmers think: proceed from a humancentered model, develop a language for evaluating programming techniques in terms of that model, and define metrics to evaluate progress.
1.
Finding concurrent tasks in a program. How to help programmers “think parallel”? 2. Scheduling tasks at the right granularity onto the processors of a parallel machine 3. The data locality problem: Associating data with tasks and doing it in a way that our target audience will be able to use correctly. 4. Supporting scalability, hardware: bandwidth and latencies to memory plus interconnects between processors to help applications scale. 5. Supporting scalability, software: libraries, scalable algorithms, and adaptive runtimes to map high level software onto platform details. 6. Synchronization constructs (and protocols) that let programmers write programs free from deadlock and race conditions. 7. Tools, API’s and methodologies to support the debugging process 8. Error recovery and support for fault tolerance 9. Support for good software engineering practices: composability, incremental parallelism, and code reuse. 10. Support for portable performance. What are the right models (or abstractions) so programmers can write code once and expect it to execute well on the important parallel platforms?
3.1 Design Patterns: a Model of How Domain Experts Think. Research on parallel programming languages is important, but more important is to understand how programmers conceive of a parallel algorithms and express them in a working parallel program. This process is only loosely related to the language used. Design patterns have been used as a notation to capture how experts in a given domain think about and approach their work. In [8], a design pattern language is presented as a representation of the thought process used in developing parallel programs. The pattern language is based on four design spaces, each a progressively deeper level of detail: Finding Concurrency, Algorithm Structures, Supporting Structures, and Implementation Mechanisms. The Finding Concurrency design space addresses the central problem in parallel programming; where is the concurrency, how is it to be exposed, and what are the dependencies that must be managed when exploiting concurrency? The idea is that parallel programming is fundamentally a task oriented problem. Even when working with a data parallel notation as the final target, the parallel algorithm is still fundamentally understood in terms of a collection of concurrent tasks. As a dual to the tasks are the data they work on which must be decomposed into largely independent units that can be updated in parallel. In a sense, the task decomposition and data decomposition become a dual representation of the problem. With the decompositions in hand, the programmer can group tasks into logical collections and then identify dependencies arising form the tasks or the data. The Algorithm Structure design space takes the tasks, data, groups and dependencies and provides a framework for organizing them into an effective parallel algorithm. It employs a decision tree (Figure 1) to guide the programmer to the appropriate structure based on the dominant features of the problem. For example, are computations best understood in terms of the decompositions of the data, or as an ensemble of tasks, or by the flow of data between groups of tasks? Approaches are further refined, for example, based on whether decompositions proceed from linear or recursive structures. The decision sequence leads to a detailed choice of programming strategy.
Transactional memory (TM), addresses only items 6, 8 and 9. Item 6, synchronization, can be addressed by fine grained locks, and it is debatable whether TM will really address item 9, composability. We need more focus on the top of the list (finding concurrency and managing data locality) and less on fashionable topics such as TM.
2.5 Attack the Parallel Programming Problem by Systematic, Scientific Methodologies. Parallel programming research is dominated by an engineering perspective: build it, show that “it”, works, and move on. We contrast this approach with a more deliberative scientific approach, where a hypothesis is proposed, experiments conducted to test the hypothesis, predictive theories developed, and results peer reviewed. This scientific approach is largely absent in parallel programming. We lack a body of theory to drive research in parallel programming. As programming is a human endeavor which involves cognition and psychology more than physics and mathematics. Given the inherently subjective nature of programmability, it’s easier to fall back on counting lines of code or measuring benchmark performance than to grapple with the effectiveness of different programming constructs for solving different programming problems.
9
effort to change the scheduling of parallel loop iterations, in the OpenMP API, this is accomplished with a single schedule clause: #pragma omp parallel for reduction(+:sum) private(x) schedule(dynamic) for (i=1; i