study because it's a computing-intensive, widely used, and .... had no time left for fine-tuning the parallel ver- ... 10 days, they spent two to three hours a day on.
feature
parallel programming
Parallelizing Bzip2:
A Case Study in Multicore Software Engineering Victor Pankratius, Ali Jannesari, and Walter F. Tichy, University of Karlsruhe
As multicore computers become mainstream, developers need to know which approaches to parallelism work. Four teams competitively parallelized the Bzip2 compression algorithm. The authors report lessons learned.
70
IEEE SOFT WARE
M
ulticore chips integrate several processors on a single die, and they’re quickly becoming widespread. Being affordable, they make it possible for every PC user to own a truly parallel computer, but they also make parallel programming a concern for more software developers than ever before. Not only is parallel programming considered difficult, but experience with parallel software is limited to a few areas, such as scientific computing, operating systems, and databases. Now that parallelism is within reach for new application classes, new software engineering questions arise. In the young field of multicore software engineering, many fundamental questions are still open, such as what language constructs are useful, which parallelization strategies work best, and how existing sequential applications can be reengineered for parallelism. At this point, there is no substitute for answering these questions than to try various approaches and evaluate their effectiveness. Previous empirical studies focused on either numeric applications or computers with distributed memory,1–3 but the resulting observations don’t necessarily carry over to nonnumeric applications and shared-memory multicore computers. We conducted a case study of parallelizing a real program for multicore computers using currently available libraries and tools. We selected the sequential Bzip2 compression program for the study because it’s a computing-intensive, widely used, and relevant application in everyday life. Its source code is available, and its algorithm is welldocumented (see the sidebar “Bzip Compression Fundamentals”). In addition, the algorithm is non-
Published by the IEEE Computer Society
trivial, but, with 8,000 LOC, the application is small enough to manage in a course. The study occurred during the last three weeks of a multicore software engineering course. Eight graduate computer science students participated, working in independent teams of two to parallelize Bzip2 in a team competition. The winning team received a special certificate of achievement.
Competing Team Strategies
Prior to the study, all students had three months’ extensive training in parallelization with Posix Threads (PThreads) and OpenMP (see the sidebar, “Parallel Programming with PThreads and OpenMP”) and in profiling strategies and tools. The teams received no hints for the Bzip2 parallelization task. They could try anything, as long as they preserved compatibility with the sequential version. They could reuse any code—even from existing Bzip2 parallel implementations,4–6 although these implementations were based on older versions of the sequential program and weren’t fully compatible with the current version. 0 74 0 -74 5 9 / 0 9 / $ 2 6 . 0 0 © 2 0 0 9 I E E E
Bzip Compression Fundamentals Bzip uses a combination of techniques to compress data in a lossless way. It divides an input file into fixed-sized blocks that are compressed independently. It feeds each block into a pipeline of algorithms, as depicted in Figure A. An output file stores the compressed blocks at the pipeline’s end in the original order. All transformations are reversible, and the stages are passed in the opposite direction for decompression. ■■ Pipeline stage 1. A Burrows-Wheeler transformation (BWT) reorders the characters on a block in such a way that similar characters have a higher probability of being closer to one another.1 BWT changes neither the length of the block nor the characters. ■■ Pipeline stage 2. A move-to-front (MTF) coding applies a locally adaptive algorithm to assign low integer values to symbols that reappear more frequently.2 The resulting vector can be compressed efficiently. ■■ Pipeline stage 3. The well-known Huffman compression
Burrows-Wheeler Input transformation (BWT) file Stage 1
Move-to-front (MTF) coding
Huffman compression
Stage 2
Stage 3
We asked the teams to document their work from the beginning—including their initial strategies and expectations, the difficulties they encountered during parallelization, their approach, and their effort. In addition to these reports, we collected evidence from personal observations, the submitted code, the final presentations, and interviews with the students after their presentations.7 Because of space limitations, we omit a number of details here, but more information (including threats to validity) is available elsewhere.8
Team 1 The first team tried several strategies. They started with a low-level approach, using a mixture of OpenMP and PThreads. Then they restructured the code by introducing classes. As the submission deadline approached, they reverted to an earlier snapshot and applied some ideas from the BzipSMP parallelization.5 Team 1’s plan was to understand the code base (one week), parallelize it (one week), and test and debug the parallel version (one week). Actual work quickly diverged from the original plan. At the
technique is applied to the vector obtained in the previous stage. Julian Seward developed the open source implementation of Bzip2 that we used in our case study.3 It lets block sizes vary in a range of 100 to 900 Mbytes. A low-level library comprises functions that compress and decompress data in main memory. The sorting algorithm that’s part of the BWT includes a sophisticated fallback mechanism to improve performance. The high-level interface provides wrappers for the low-level functions and adds functionality for dealing with I/O. References
1. M. Burrows and D.J. Wheeler, A Block-Sorting Lossless Data Compression Algorithm, tech. report 124, Digital Equipment Corp., 10 May 1994. 2. J.L. Bentley et al., “A Locally Adaptive Data Compression Scheme,” Comm. ACM, vol. 29, no. 4, 1986, pp. 320–330. 3. J. Seward, Bzip2 v. 1.0.4, 20 Dec. 2006; www.bzip.org.
Compressed output file
Figure A. The Bzip2 stages. The input file is divided into fixed-block sizes that are compressed independently in a pipeline of techniques.
beginning, the team invested two hours to get a code overview and find the files that were relevant for parallelization. They spent another three to four hours to create execution profiles with gprof (www.gnu.org/software/binutils), KProf (http:// kprof.sourceforge.net), and Valgrind (http:// valgrind.org). The team realized that they had to choose input data carefully to find the critical path and keep the data sizes manageable. They invested another two hours in understanding code along the critical path. Understanding the code generally and studying the algorithm took another six hours.9 Thereafter, they decided that parallel processing of data blocks was the most promising approach, but they had problems unraveling existing data dependencies. The team continued with a parallelization at a low abstraction level, taking about 12 hours. In particular, they parallelized frequently called code fragments with OpenMP and exchanged a sorting routine for a parallel Quicksort implementation using PThreads. However, the speedup was disappointing. The team decided to refactor the code and November/December 2009 I E E E S O F T W A R E
71
Parallel Programming with PThreads and OpenMP PThreads and OpenMP add parallelism to C in two different ways. PThreads is a thread library, while OpenMP extends the language. PThreads Posix Threads (PThreads) is a threading library with an interface specified by an IEEE standard. PThreads programming is quite low level. For example, pthread_create(...) creates a thread that executes a function; pthread_mutex_lock(l) blocks lock l. For details, David Butenhof has written a good text.1 OpenMP OpenMP defines pragmas—that is, annotations—for insertion in a host language to indicate the code segments that might be executed in parallel. Effectively, OpenMP thus extends the host language. In contrast to PThreads, OpenMP abstracts away details such as the explicit creation of threads. However, the developer is still responsible for correctly handling locking and synchronization. With OpenMP, you parallelize a loop with independent iterations by inserting a pragma before the loop. The following example illustrates a parallel vector addition:
#pragma omp parallel for for(i=0; i