Optimizing Performance of C++ Threading Libraries András Fekete Department of Computer Science University of New Hampshire Durham, NH, 03824, USA
[email protected] Abstract Multi-core architectures are now the standard on highend computing devices, from desktop computers to tablets to cell phones. Multi-threaded applications can take full advantage of parallel hardware. This paper compares the performance of several C++ threading interfaces, and presents a theoretical framework for optimal utilization of parallel hardware.
1 Introduction The Von Neumann computing model [17] dominated computer architecture for over half a century. In this serial processing model, program instructions are then transferred over the computer bus, one at a time, to be decoded and executed in the CPU. This “Von Neumann bottleneck” limits the speed at which a program may be executed. The serial processing bottleneck was not an issue for many years, due to ongoing improvements in computer hardware. This is quantified by an observation known as Moore’s Law: computing performance doubled roughly every 18 months for over 50 years [6]. Exponential increases in performance cannot be sustained indefinitely. Eventually fundamental physical limits associated with speed, size, power consumption, and heat dissipation impact the performance of single processors. The most straightforward ways to increase performance is to increase clock speed, and clock speeds increased a thousand-fold between 1980 and 2000. But clock speeds of 4GHz were achieved on the Pentium 4 processor in 2004, and have not increased much since then. As a result, chip manufacturers have turned to hardware parallelism to increase computing performance. There are several different types of parallel architectures, but in the past decade, multi-core processors [14] have dominated computing platforms. In Flynn’s taxonomy [13], multi-core architectures are classified as MIMD (multiple instruction, multiple data stream) systems, with each core capable of independent processing. Multi-core architectures aim to increase computing performance by taking advantage of parallelism [15], rather than by increasing performance of a single CPU. These systems feature more than one “core” (CPU) on a
John M. Weiss Department of Math and Computer Science South Dakota School of Mines and Technology Rapid City, SD 57701-3995, USA
[email protected] chip. Each core may have its own cache memory, but typically share system memory (RAM). Dual-core and quad-core processor chips dominate the market today, with more cores gradually appearing on servers and highperformance computer systems. This “quiet revolution” has dramatically altered the computing landscape, but not all software fully utilizes multi-core hardware. For example, C++ only recently added concurrency support in the 2011 standard [12]. Prior to this time, programmers were forced to rely upon external concurrency libraries such as POSIX Threads (Pthreads) [9], Boost Threads, and Open Multi-Processing (OpenMP) [8] for parallel processing. External libraries offer the advantage of a coherent concurrency model across different platforms and languages (such as C and Fortran), but incorporating concurrency features directly into the language core adds stability, portability, and optimization opportunities. With a variety of different concurrency frameworks available, software developers may have difficulty choosing the best threading library for their application. In this study, we compare the performance of two C++11 concurrency frameworks (async and threads) with that of three external threading libraries (Pthreads, Boost Threads, and OpenMP). We also provide a theoretical framework for examining optimal utilization of parallel hardware. Other researchers have attempted to understand and simplify the process of writing multi threaded applications by creating models [7], or creating an abstraction of the architecture [18], or looking at the underlying virtual machine given by the language [10]. In this research we take a different approach, in that the overall goal is to have a simple metric to base decisions on how to split up a program into smaller tasks that will gain the largest speedup.
2 Concurrency in C++ Concurrency was added to the C++ language standard in 2011 [12]. Both the core language and its standard library are guaranteed to support multiple threads of control, and several different approaches to fundamental concurrency issues (synchronization, mutual exclusion, etc.) are provided. Concurrency interfaces in C++11 are implemented by the async() function and methods of the thread class. These approaches are largely interchangeable,
and differ primarily in that the async() function provides a simple mechanism for returning function results. Prior to the C++11 standard, software developers were forced to rely upon external libraries for multi-threaded applications. Among these external libraries, POSIX threads, Boost threads, OpenMP, and MPI are the most widely used. The message passing interface implementation of MPI is geared towards distributed (rather than multi-core) platforms, and was not considered in this study. POSIX threads (Pthreads) [9] were introduced in the 1980’s, in an early effort to implement a portable interface for parallel programming in C. Boost is a set of libraries for C++ that provide support for a wide range of tasks, including concurrency, linear algebra, random numbers, regular expressions, and unit testing. Many of the newer C++ features have relied upon Boost as a test bed, prior to formal adoption as part of the language standard. OpenMP is a multi-platform library that supports multiprocessing on many different processor architectures and operating systems, with language bindings to C, C++, and Fortran [8]. Multithreading support in OpenMP is implemented via #pragma directives to the preprocessor.
3 Concurrency performance In this study [2], two benchmarks were used to compare the performance of five concurrency frameworks (C++ async, C++ threads, Pthreads, Boost Threads, and OpenMP): primality testing by trial division [16] (prime), and a long series of assembly instructions (asmProc). The prime benchmark was used in a previous study by one of the authors [11]. When testing a set of numbers for primality, almost all the computation is parallelizable. In other words, N processors should provide close to the theoretical maximum N-fold speedup. Potential concurrency issues such as race conditions, deadlock, etc. [3][4][9] do not occur, since each primality test is independent of all others. Code for the primality testing by trial division is listed in Figure 1. The input value N is tested for primality by dividing by all possible odd factors between 3 and N/2. The routine does not short circuit the primality testing; it continues to divide by all factors, regardless of whether the number is already determined to be non-prime. bool is_prime( unsigned n ) { if ( n == 2 ) return true; // 2 is only even prime if ( n < 2 || n % 2 == 0 ) return false; // test all odd factors up to n / 2 bool prime = true; for ( unsigned i = 3; i < n / 2; i += 2 ) if ( n % i == 0 ) prime = false; return prime; }
Figure 1: Primality testing by trial division
This benchmark, while simple and robust, does not necessarily reflect all types of real-world computations. Typical multi-threaded applications often involve longer serial instructions sequences. Such longer sequences of operations allow the processor to make better use of instruction prefetch, which in turn decreases the need for the kernel to swap out the process from the CPU for a different task. The assembly instruction benchmark (Figure 2) had roughly an order of magnitude more operations than the primality test benchmark. bool asmProc ( unsigned n ) { bool retval; for ( unsigned i = 0; i < n; i++ ) { // long series of add, multiply, shift instructions } return retval; }
Figure 2: Assembly instruction benchmark Both benchmarks were executed with an input number N that was proportional to the number of iterations in the benchmark loop. (For the prime benchmark, N was the prime number to be checked for primality.) These benchmarks are purely CPU bound, rather than memory or I/O bound. Benchmark code was executed on a variety of hardware platforms, including older Intel quad-core CPUs, newer Intel dual-core i5 and quad-core i7 CPUs, and 16core and 256-core Xeon CPUs [5]. The newer CPUs are all hyperthreaded, doubling the reported number of hardware threads. Software platforms included Windows and Linux. The GNU g++ compiler was used to compile C++ code, with various levels of optimization. Other than for Pthreads, optimization had little impact on these benchmarks.
4 Theoretical speedup According to Amdahl’s Law [1], the speedup gains from concurrency are a function of both the serial processing time and parallelizable processing time: tp = tser + tpar / c
(1)
where tser is the serial processing time, tpar is the parallelizable processing time, and tp is the processing time to run on c cores (processors). In the real world, however, things do not work out quite so neatly. Let us consider a serial process that takes t1 time to complete on one processor. Adding more processors will not speed up this process, since it is inherently serial in nature. However, multiple processes may be run concurrently on parallel hardware. For example, we may have a process that tests numbers for primality. On a 4-core machine, assuming complete utilization of parallel hardware, up to 4 processes (primality tests) may execute simultaneously, in the same time it takes to run one process.
What happens when the number of threads exceeds the number of available processors? Depending on how concurrency is implemented, there are several possible outcomes, exemplified by two extremes. In the first scenario, concurrent execution is coarsely quantized. The scheduler allows the first 4 processes to run to completion in time t1, at which point a processor becomes available to run process 5. This process also takes time t 1 to complete, and the total time is 2×t1. Alternatively, the scheduler may switch between tasks with finer granularity, using time slices too short for the process to complete prior to task switching. In this case we might expect a completion time of 5/4×t1. These are extreme cases, and the actual completion time might lie somewhere between 2×t1 and 5/4×t1. As illustrated in Figure 3, the red line shows serial processing time, which increases linearly as the number of processes (threads). The stair-step green line illustrates the coarse quantization scenario, and the blue line illustrates fine quantization.
Figure 3: Theoretical serial and parallel times for a 4-core processor running multiple threads
Figure 4: Expected speedup of a 4-core system
5 Results The Figures 5 and 6 in this section show the results of benchmarking the prime and asmProc routines, plotting speedup vs. number of processes for a 4-core system.
Figure 5: Measured speedup: asmProc benchmark
This analysis yields the following mathematical model: tser = t1 × p tfine = max( t1 × p / c, t1 ) tcoarse = t1 × floor( ( p + c – 1 ) / c )
(2) (3) (4)
where t1 is the serial time for one process to complete, p is the number of processes (threads), and c is the number of cores (processors). tser, tfine, and tcoarse are serial, fine-grained parallel and coarse-grained parallel processing times for p processes, respectively. From Equations (3) and (4), we can predict the expected speedup of a 4-core system for fine- and coarse-grained thread scheduling, respectively. This is illustrated in Figure 4. Figure 6: Measured speedup: prime benchmark
The measured timings match the predicted curves quite well. OpenMP seems to use coarse-grained scheduling on this system, whereas the other multithreading libraries allow for fine-grained scheduling. The speedup drop at c+1 processes is small but reproducible, and is likely due to scheduler overhead. The Figures 7-10 show the impact of problem size (i.e., time required to complete processing of the concurrent routine). Note that these curves are consistent across different benchmarks and threads. This indicates that, to achieve optimal speedup, a minimal amount of time must be spent in the concurrent routine (otherwise scheduler overhead reduces concurrency gains). The exception is OpenMP which is susceptible to the number of threads launched as described earlier.
Figure 7: Impact of problem size: prime benchmark, 12 cores, 4 threads
Figure 9: Impact of problem size: asmProc benchmark, 12 cores, 4 threads
Figure 10: Impact of problem size: asmProc benchmark, 12 cores, 32 threads To further investigate the impact of scheduler overhead, we ran tests using a “no-op” benchmark: a routine which was called and simply returned immediately. Timing this routine gives a measure of the multithreading overhead for each of the concurrency libraries. Table 1 lists the results for different numbers of threads, ranging from 102 to 109 threads. Table 1: Threadpool overhead (times in msec).
Figure 8: Impact of problem size: prime benchmark, 12 cores, 32 threads
log N 2 3 4 5 6 7 8 9
serial 0.001 0.012 0.125 1.009 10.73 100.9 988.0 9878
async 0.571 0.649 0.647 0.712 1.753 13.05 103.2 1002
omp 1.445 1.788 0.234 2.211 2.534 12.22 100.9 1004
pthread 1.583 1.214 1.534 1.223 3.565 16.99 101.9 1003
thread 0.444 0.959 0.536 0.604 1.592 11.81 101.0 1001
boost 0.542 0.439 0.580 0.506 1.944 10.37 103.3 1001
Time to execute the thread significantly exceeds threadpool instantiation time after log(N) of approximately 7, which is consistent with the previous benchmark observations on optimal concurrency performance. A
simple rule of thumb is that if each thread takes at least 10msec to execute, more time is spent processing data relative to the 1-2msec instantiation time of the thread data structures. The expected speedup gain is a ratio of instantiation time to thread execution time.
instruction sequence, achieves higher speedups. This is shown in Figures 13 and 14.
To examine the impact of scalability, we ran benchmarks on a 256-core system. As shown in the following figures, the results are as expected for a 40-core system due to the nature of such a machine not having only a single user running programs at a given time. As can be clearly seen from Figure 11 and 12, there is a linear relation to speedup based on the number of threads up until the system is 100% utilized. At full utilization, it is up to the scheduler to allocate resources to each thread based on the other processes running on the machine. Figure 13: Impact of hyper-threading: prime benchmark, 16 cores (32 hardware threads)
Figure 11: Measured speedup: prime benchmark, 256 cores, 85% system load Figure 14: Impact of hyper-threading: asmProc benchmark, 16 cores (32 hardware threads)
6 Conclusions In this study, we present a refined version of Amdahl’s Law. Our model takes scheduler granularity into account. Although fine-grained scheduler quantization appears to be the rule, our benchmarks demonstrate that coarsegrained scheduler quantization can be observed in some concurrency frameworks (notably OpenMP).
Figure 12: Measured speedup: asmProc benchmark, 256 cores, 85% system load The impact of hyper-threading depends on a number of factors, including compiler, degree of optimization, CPU model, and instruction sequence. For a 16-core Xeon system, the asmProc benchmark, with its longer
As expected, more cores equates to a greater concurrency speedup. But for schedulers with coarse granularity, optimal utilization of parallel hardware will take place when either the number of threads is an exact multiple of the number of cores, or when there are significantly more threads than processor cores. This conclusion is born out in both theory and practice. The drop in performance when the number of threads first exceeds the number of cores is notable. As a result, it
is generally best to maximize the number of threads in an application, in order to maximize the concurrency speedup. This reduces both the impact of scheduler granularity and stalled (e.g., I/O bound) threads. We also examined the impact of scheduler overhead on concurrency optimization. Longer processes will generally benefit more from parallelization, since the impact of concurrency overhead is reduced. Sufficiently optimal thread execution times started at 100ms in our tests. This gave us speedups close to the theoretical maximum while still splitting the serial task into smaller sub-tasks. Finally, we tested two different benchmarks with five different concurrency frameworks on a variety of systems. Scalability was excellent for both benchmarks, with more processors giving an (expected) greater speedup. System load impacts performance as expected: only available processors yield concurrency speedups. hyper-threading is not equivalent to doubling the number of processors, but benchmarks with longer instruction sequences seem to perform as much as 30% better on hyperthreaded processors. Finally, the concurrency framework (async, C++11 threads, Pthreads, Boost Threads, OpenMP) had relatively little impact on performance. This was somewhat surprising, since the low-level Pthread approach is closest to the hardware, and might be expected to execute most efficiently. In reality, Pthread code seldom ran faster than the other threading libraries, and in some cases actually performed worse. OpenMP has the highest level of abstraction, and its performance might be expected to suffer accordingly. The major performance impact observed in this study seems due to scheduler granularity. Hence the choice of concurrency framework should be based more on usability than performance. From a usability standpoint, Pthreads require low-level C code (function callbacks with void pointers) and a high degree of manual resource management. The C++11 concurrency interface offers type safety, elegant syntax, and more features, and is clearly superior for most tasks. The same is true of Boost Threads. OpenMP is remarkably easy to use, and offers Fortran bindings in addition to C/C++. However, it relies upon preprocessor decorations (#pragma omp) which hides the parallel implementation from the user, making it more difficult to achieve the same degree of concurrency control provided by the other multithreading libraries. The performance increase from concurrent processing on multi-core processors seems well worth the relatively small coding effort. Barring unforeseen technological breakthroughs, it seems evident that concurrency frameworks will become increasingly important in the near future.
References [1] Gene M. Amdahl, Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities, AFIPS Conference Proceedings (30): pp.483-485, 1967. [2] Andras Fekete, CPP Thread Tester. https://github.com/bandi13/CPP-Thread-tester. 2015. [3] M. Herlihy and N. Shavit. The Art of Multiprocessor Programming, Morgan Kaufman, 2012. [4] W. Hwu and D. Kirk. Programming Massively Parallel Processors: A Hands-on Approach, Elsevier Science, 2010. [5] Intel processors: http://ark.intel.com, Dec 2014. [6] G. E. Moore, Cramming More Components onto Integrated Circuits, Electronics, pp.114–117, 1965. [7] Geoffrey Nelissen, Vandy Berten, Joel Goossens, Dragomir Milojevic, Techniques Optimizing the Number of Processors to Schedule Multi-Threaded Tasks, Proc. - Euromicro Conf. Real-Time Syst., pp.321-330, 2012. [8] OpenMP: http://openmp.org/wp, Dec 2014. [9] POSIX threads: http://en.wikipedia.org/wiki/ POSIX_Threads, Dec 2014. [10] Jennifer B. Sartor and Lieven Eeckhout, Exploring Multi-Threaded Java Application Performance on Multicore Hardware, SIGPLAN, vol. 47, no. 10, pp.281-296, 2012. [11] John Weiss, Comparison of POSIX Threads, OpenMP and C ++ 11 Concurrency Frameworks, in 30th International Conference on Computers and Their Applications, 2015, pp.251–256. [12] Wikipedia: C++11. http://en.wikipedia.org/wiki/C+ +11, Dec 2014. [13] Wikipedia: Flynn’s taxonomy. http://en.wikipedia. org/ wiki/Flynn%27s_taxonomy, Dec 2014. [14] Wikipedia: Multi-core processors. http://en. Wikipedia.org/wiki/Multi-core_processor, Dec 2014. [15] Wikipedia: Parallel computing. http:// en.wikipedia.org/wiki/Parallel_computing, Dec 2014. [16] Wikipedia: Primality testing. http:// en.wikipedia.org/wiki/Primality_test, Dec 2014. [17] Wikipedia: Von Neumann architecture. http://en.wikipedia.org/wiki/ Von_Neumann_architecture, Dec 2014. [18] Ruken Zilan, Javier Verdu, Jorge Garcia, Mario Nemirovsky, Rodolfo Milito, Mateo Valero, An Abstraction Methodology for the Evaluation of Multicore Multi-threaded Architectures, IEEE Int. Work. Model. Anal. Simul. Comput. Telecommun. Syst. Proc., pp.478-481, 2011.