Optimizing Matrix Transposes Using a POWER7 Cache Model and Explicit Prefetching Gabriel Mateescu
Gregory H. Bauer and Robert A. Fiedler
Ecole Polytechnique Fed de Lausanne ´ erale ´ Blue Brain Project, 1015 Lausanne, Switzerland
National Center for Supercomputing Applications 1205 W. Clark St., Urbana, IL 61801, USA
[email protected] ABSTRACT We consider the problem of efficiently computing matrix transposes on the POWER7 architecture. We develop a matrix transpose algorithm that uses cache blocking, cache prefetching and data alignment. We model the POWER7 data cache and memory concurrency and use the model to predict the memory throughput of the proposed matrix transpose algorithm. The performance of our matrix transpose algorithm is up to five times higher than that of the dgetmo routine of the Engineering and Scientific Subroutine Library and is 2.5 times higher than that of the code generated by compiler-inserted prefetching. Numerical experiments indicate a good agreement between the predicted and the measured memory throughput.
Keywords Matrix Transpose, Cache, Prefetching, POWER7
1.
INTRODUCTION
Exploiting the memory bandwidth of a processor architecture is essential for achieving high performance of all but CPU-bound scientific codes. Matrix transposition is a common kernel of scientific codes that utilize methods such as multidimensional fast-Fourier transforms. Efficiently using the cache and hiding the memory latency in matrix transposition requires reorganizing the straightforward implementation of transposition. We consider the problem of transposing a matrix outof-place (a memory-bound operation) [3] and apply tiling (blocking) and prefetching to improve the performance of matrix transpose operations by harnessing the characteristics of the processor architecture. Blocking improves the cache utilization and prefetching hides the memory latency. We model the time taken by matrix transpose in terms of the block size, the cache line size, the cache access time, and the time to issue prefetch instructions. Three approaches to prefetching are compared: (1) manual prefetching; (2) hardware streams; and (3) compilerinserted prefetching. We also compare the performance of our approach with that of the out-of-place, double-precision matrix transpose routine dgetmo of the IBM Engineering and Scientific Subroutine Library (ESSL). This work was performed on the IBM POWER7 processor [4, 9], which contains a number of innovations that en-
Copyright is held by author/owner(s).
68
{gbauer,
rfiedler}@ncsa.illinois.edu
able delivering exceptionally high memory bandwidth to applications, even when all processor cores are utilized. The POWER7 chip is the processor used for the IBM Power Enterprise series high-performance computing systems [8].
2.
PROBLEM FORMULATION
Let A be a real matrix of order n, where A = {ai,j | 1 ≤ i, j ≤ n, ai,j ∈ R}. We consider the out-of-place matrix transpose problem: construct the matrix AT = {aTj,i | 1 ≤ i, j ≤ n} at memory locations disjoint from A such that aTj,i = ai,j for 1 ≤ i, j ≤ n. We assume that A and AT are stored in row-major order, and that the machine’s RAM size is large enough for A and AT to fit simultaneously in memory. The basic out-of-place matrix transpose procedure is shown in Algorithm 1, where the elements of A are copied to AT by traversing A in row-major order and AT in column-major order. Algorithm 1 Matrix Transpose Operation 1. for (i = 0; i < n; i = i + 1) do 2. for (j = 0; j < n; j = j + 1) do 3. aT [j][i] = a[i][j] 4. end for 5. end for We employ tiled copy, whereby A is divided into square blocks (tiles) of order B, as shown in Algorithm 2. B is the number of elements in a row or a column of a square tile in A or AT . Assuming that B evenly divides n, the number of blocks in A and AT is (nB )2 = (n/B)2 . Algorithm 2 Tiled Matrix Transpose Operation 1. for (ib = 0; ib < n; ib = ib + B) do 2. for (jb = 0; jb < n; jb = jb + B) do 3. for (i = ib ; i < ib + B; i = i + 1) do 4. for (j = jb ; j < jb + B; j = j + 1) do 5. aT [j][i] = a[i][j] 6. end for 7. end for 8. end for 9. end for We design our matrix transpose algorithm such that we achieve high memory throughput by harnessing features of the POWER7 micro-architecture.
Performance Evaluation Review, Vol. 40, No. 2, September 2012
3.
APPROACH AND MODEL
We model and implement three widely used techniques for improving the performance of the matrix transpose operation: (1) cache prefetching, to avoid compulsory cache misses; (2) cache blocking, to increase the cache hit rate by improving temporal locality; and (3) data alignment, to avoid cache conflict misses. We use the model to derive an expression for the time to perform the transpose.
3.1
Tiled Transpose with Prefetch
Prefetching can hide the latency associated with loading (or storing) data from (or to) memory. Prefetching works by launching the loading of the data from memory into the cache at a point in the program execution that is ahead of the point where the data is used, so that the data has been loaded into the cache by the time the program execution reaches the point where it needs the data. In loop-nest computations, prefetching launches the loading of (or storing into) a cache line a number of iterations in advance of the iteration in which the data in that cache-line is used. We improve Algorithm 2 by adding instructions to prefetch the cache lines that will be needed in subsequent iterations. Algorithm 3 performs the transpose operation using both tiling and prefetching. Algorithm 3 Tiled Matrix Transpose with Prefetch 1. for (ib = 0; ib < n; ib = ib + B) do 2. for (jb = 0; jb < n; jb = jb + B) do 3. for (i = ib ; i < ib + B; i = i + 1) do 4. jr = j b + S r ; 5. jc = jb + Sc + i; 6. Prefetch(a[i][jr : jr + B]) 7. Prefetch(aT [jc ][i : i + B]) 8. for (j = jb ; j < jb + B; j = j + 1) do 9. aT [j][i] = a[i][j] 10. end for 11. end for 12. end for 13. end for We denote block-row and block-column respectively as the segment of a row and column of A (or AT ) that belongs to a block. A block-row has B elements that are double precision numbers, so a block-row has 8 B bytes, that is, NP F = 8 B/L cache lines. Algorithm 3 traverses the blocks of A in row-major order and the blocks of AT in columnmajor order. We number the blocks of A and AT in the order in which we traverse them. Consider the loop at lines 3-11 in Algorithm 3, henceforth called the i loop. The i-loop is executed once for each block. Let b denote the block of A that is currently being transposed to AT , where b = (jb +nB ×ib )/B and 0 ≤ b ≤ nB 2 −1. In each iteration of the i-loop, the algorithm concurrently transposes one block-row of block b of A and prefetches one block-row, i.e., NP F cache lines, of block b+1 of A and block b + 1 of AT . Prefetching is initiated by way of two prefetch instructions located at lines 6 and 7 in Algorithm 3: 1. Prefetch(a[i][jr : jr + B]): prefetch one block-row in the next block of A;
Performance Evaluation Review, Vol. 40, No. 2, September 2012
2. Prefetch(aT [jc ][i : i + B]): prefetch one block-row in the next block of AT . Define the prefetch stride to be the distance between the matrix element being copied from the current block of A (or AT ) and the matrix element being prefetched from the next block of A (or AT ). Denote the prefetch stride for A and AT by Sr and Sc , respectively, where the subscripts r and c indicate that A is traversed in row-major order and AT is traversed in column-major order. Let R be the number of matrix elements reserved for storing a row of A. R includes the n elements of a row as well as padding elements that avoid conflict cache misses; the computation of R is given in subsection 3.3. The prefetch stride Sr is B if jb < Jb (1) Sr = B × R − jb otherwise where jb is the column index of the first column in block b of A, Jb = max{jb }, and jb = Jb means that the block anchored at (ib , jb ) is the last block along a row of A. The prefetch stride Sc is B×R if jb < Jb Sc = (2) B − jb × R otherwise where jb is the row index of the first row in block b of AT , Jb = max{jb }, and jb = Jb means that the block anchored at (jb , ib ) is the last block along a column of AT . In the implementation of Algorithm 3, the inner loop j will be unrolled, so that the inner-most loop will be the i loop. The unrolled j loop will be replaced with B load instructions and B store instructions.
3.2
Block Size
The POWER7 processor [4, 9] has an eight-way set associative level-one (L1) cache of size S = 32 KiB, where 1Ki = 210 , and with cache line size L = 128 bytes. The unit for prefetching from memory into the L1 cache is a cache line of length L. A block of B double precision elements takes 8 B bytes of storage (a double precision number has eight bytes) and that storage must be a multiple of the cache line size: 8 × B = k × L, where k is a positive integer Using L = 128 bytes, we get: B = 16 × k, where k is a positive integer
(3)
To avoid capacity misses in the L1 cache, four B × B blocks must fit simultaneously in cache: the source and target blocks currently being transposed, and the source and target blocks being prefetched. The size of a block is 8 B 2 bytes, so the four blocks fit in cache if: 4 × (8B 2 ) ≤ 32 × 210 that is B ≤ 32
(4)
From (3) and (4) we get that B can have only two values: B ∈ {16, 32}
3.3
(5)
Data Alignment
Alignment of the rows of A and AT with respect to the sets of L1 cache lines is needed to avoid conflict misses, because the L1 cache is not fully associative.
69
We can think of the L1 cache, which is organized as an eight-way set associative memory, as a two-dimensional array with nA = 8 columns and nS = S/(nA L) = 32 rows, where nA is the number of cache lines in a set, and nS is the number of sets. During the transpose operations for a block b, four blocks need to be present in cache: (1) the block b in A from which data is copied to AT ; (2) the block b in AT to which data is copied; (3) the block b + 1 in A into which data is prefetched from memory; (4) the block b + 1 in AT into which data is set to zero. We design the data alignment such that these four blocks fit in L1 cache without causing conflict misses. The key observation is that, in order to avoid conflict misses, consecutive block-rows of a block of A and AT should be located in consecutive sets of the L1 cache. If the above rule is observed, there is room in the cache for four blocks of size B ≤ 32. For example, a block of size B = 32 double precision elements will take 32 × 2 cache lines: (1) one block-row takes two lines from one set: 32×8 = 256 = 2 × L; (2) one block has 32 block-rows, which take 32 × 2 cache lines, i.e., one fourth of the cache lines in the L1 cache. The L1 cache has 32 × 8 lines, so there is room for four blocks. We achieve the goal of having two consecutive block-rows mapped to consecutive sets in the cache by reserving for each row of A and AT storage of size ) * 8n RB = L × n S × + 1 bytes (6) L × ns where nS = 32, L = 128. The size in elements of a row is R = RB /8, which can be used to compute the prefetch strides Sr (relation (1)) and Sc (relation (2)). Data alignment is implemented by allocating memory with the POSIX function posix memalign() and allocating RB × (n + B) bytes aligned on a 32 KiB boundary (the L1 cache size) for each of the arrays A and AT .
3.4
Prefetch Distance
Let b be the block of A that is currently being transposed by the i-loop of Algorithm 3. The i-loop has B iterations, one for each block-row of b. In each iteration of the i-loop, three actions are performed: (1) block-row (i − ib ) of block b of A is being transposed; (2) prefetch-for-load is issued for block-row (i − ib ) of block b + 1 of A to prefetch NP F = 8 B/L cache lines; and (3) prefetchfor-store is issued for block-row (i − ib ) of block b + 1 of AT to prefetch NP F cache lines. Because prefetching is a nonblocking operation, it overlaps with transposing the blocks of A. We implement prefetch-for-load with the data cache-block touch (DCBT) POWER7 instruction and prefetch-for-store with the data cache-block set to zero (DCBZ) POWER7 instruction [7]. Both instructions write into the L1 data cache. The prefetch distance, denoted by D, is the number of iterations between the iteration in which a cache line is used and the iteration in which the prefetch of that cache line was launched. Next we derive the prefetch distance for DCBT and DCBZ. Prefetching-for-load of a block-row (i − ib ) of block b + 1 is initiated at iteration i of the i-loop for block b. This blockrow will be used at iteration i of the i-loop for block b + 1.
70
So the prefetch distance for DCBT is DDCBT = B
(7)
and is independent of the i-loop variable. In contrast, the prefetch distance for prefetch-for-store, depends on i: DDCBZ (i) = (B − i + ib ), 0 ≤ i − ib < B
(8)
In the next subsection, we derive the time per iteration of the i-loop, assuming that there are no memory stalls, and then we show that indeed there are no memory stalls, thanks to the prefetch distance of our algorithm and to the advanced features of the POWER7 micro-architecture.
3.5
Transpose Time Model
First, we model the time taken by one iteration of the iloop when there are no memory stalls. Denote this time by Twork . The expression of Twork is: Twork = 3 B + 2 B TL1 + NP F TP F cycles
(9)
where 3 is the number of cycles taken by the index computations at lines 4 and 5 in the i-loop, TL1 is the number of clock cycles needed to load (store) a floating point register from (to) the L1 cache, and TP F is the accumulated time of executing a DCBT and a DCBZ instruction. (DCBT and DCBZ are non-blocking, thus they return before data is prefetched.) For the POWER7 processor, L = 128 bytes, TL1 = 2 cycles, λ = 336 cycles, TP F ≈ 36 cycles, so NP F = 8B/L = B/16 and from (9) we get: Twork = 7 B + 36 (B/16) = (37/4) B cycles
(10)
Prefetching in the i-loop occurs concurrently with copying a[i][j] to aT [j][i]. If prefetching is fast enough to hide the memory latency, then there is no memory stall and the actual time taken by an i-loop iteration is Twork . The time interval elapsed from the time when prefetching is initiated in an i-loop iteration to the time when prefetching completes is: Tpref etch = (NP F TP F + λ) cycles Using NP F = B/16, TP F = 36 cycles, and λ = 336 cycles, we get: Tpref etch = 336 + (9/4) B cycles
(11)
To hide the memory latency, a prefetch launched in an iteration must complete by the time code execution reaches the iteration where the prefetched data are needed. A sufficient condition to guarantee this is: Tpref etch ≤ D × Twork
(12)
From (10, 11) we get: D ≥ Tpref etch /Twork = (9 B + 1344)/(37 B) Given that B ∈ {16, 32}, we get D ≥ 3 for B = 16 and D ≥ 2 for B = 32. For prefetch-for-load, DDCBT = B (from (7)) so no memory stall occurs. For prefetch-for-store, DCBZ Dmin = 1 (from (8)) so the sufficient condition to avoid memory stalls is not met. However, we show next that there are no memory stalls caused by prefetch-for-store. The POWER7 processor has a Store Request Queue (SRQ) [9], which, for our prefetch-for-store approach, avoids stalls. Indeed, by Little’s Law, the SRQ does not fill up if Twork > NP F × λ/CST
(13)
Performance Evaluation Review, Vol. 40, No. 2, September 2012
where CST is the maximum number of outstanding stores, which for POWER7 [9] has the value CST = 32. Substituting Twork = (37/4) B, NP F = B/16, and λ = 336 in (13), we get that prefetch-for-store does not cause a stall if: 1 > 336/(128 × 37) = 21/296 We conclude that the DCBZ latency can be hidden and all the iterations of the i-loop take Twork . This means that the time to transpose a block b of A, denoted by Tblock , is Tblock (B) = B × Twork = (37/4) B 2 cycles and the time to transpose A is: Ttr = n2B Tblock (B) = B Twork = (37/4) n2 cycles
(14)
The throughput of the transpose operation is the size of A plus AT , which is 16n2 , divided by Ttr , that is, 16 n2 /Ttr bytes/cycle. The clock frequency of POWER7 is about 4 GHz, so the memory throughput, measured in Giga-bytes per second (GB/sec), is BW = 64 n2 /Ttr , that is: BW =
4. 4.1
64 n2 256 = = 6.92 GB/sec (37/4) n2 37
(15)
NUMERICAL EXPERIMENTS Execution environment
Our experiments were run on an IBM POWER 780 machine with four POWER7 processors clocked at 3.86 GHz. The data caches have the sizes: 32 KiB level-one, 256 KiB level-two, and 4 MiB level-three, where 1Mi = 220 . The operating system was SuSE Linux Enterprise Server 11.1 with version 2.6.35 of the Linux kernel. We used the IBM XLC compiler version 11.1.0.0 and the IBM Engineering and Scientific Subroutine Library (ESSL) version 5.1.0. The machine supports large page sizes and multiple data streams, both of which can improve memory throughput. Two page sizes are supported: 64 KiB (default) and 16 MiB (large pages). Large pages help reduce the translation lookaside buffer (TLB) miss rate for large matrices; we utilize large pages via the hugetlbfs library.
4.2
Hardware streams
The POWER7 architecture supports hardware-managed data streams, whereby the processor logic detects the regularly strided memory access patterns of a program and initiates prefetching of data according to these patterns. The POWER7 data streaming control register (DSCR) [4] can be used to control the type of streaming done by the processor. In the experiments, we select three values for DSCR: (1) DSCR=1, hardware streams are disabled; (2) DSCR=0, hardware streams are enabled for load operations only; and (3) DSCR=15, hardware streams are enabled for load and store operations.
4.3
Manually- vs compiler-inserted prefetch
The first experiment determines whether manually inserting prefetch instructions, as shown in Algorithm 3 in subsection 3.1, gives better memory bandwidth than using compiler-based prefetching. Compiler-based prefetching is generated by compiling the code with the command-line option -qprefetch of the IBM XLC compiler. The results suggest that for matrix transpose one should use either manually inserted prefetching or hardware streams;
Performance Evaluation Review, Vol. 40, No. 2, September 2012
Figure 1: Manually-inserted prefetch vs. compilerinserted prefetch
the latter is done by setting DSCR to an appropriate value such as 1 or 15. Figure 1 shows that: (1) the compiler-inserted prefetching gives a marginal improvement over the case when no software or hardware prefetching occurs; (2) the manuallyinserted prefetch improves the bandwidth by about 250%.
4.4
Tiled transpose with prefetch
Figures 2 and 3 show the memory bandwidth for tiled transpose with prefetching for B = 16 and B = 32, respectively. Each plot shows the bandwidth for the three values of DSCR discussed in subsection 4.2. The baseline bandwidth is that without hardware streams, i.e., DSCR = 1. Enabling hardware streams improves performance as follows: (1) for B = 16, prefetch for loads (DSCR = 0) gives a marginal improvement (meaning that our DCBT prefetch does an excellent job), while prefetch for loads and stores (DSCR = 15) gives noticeable improvement for large pages; (2) for B = 32, prefetch for loads (DSCR = 0) gives better improvement than prefetch for loads and stores (DSCR = 15), meaning that DSCR = 15 causes excessive prefetching. Large page sizes (the bottom plot in each figure) help reduce TLB misses, and the positive impact of large page sizes is clear for n > 4000. For the default page size (the top plot in each figure), the impact of TLB misses becomes significant for n > 4000. The effect of the block size B on the bandwidth depends on n and on the page size. For the page size of 64 KiB, B = 16 is better for n < 2000, and B = 32 with prefetchfor-load is better than B = 16 for n > 2000. For the large page size of 16 MiB, B = 16 gives better bandwidth than B = 32.
4.5
Predicted and observed throughput
In the range of values of n that is not affected by level-two or level-three cache effects (i.e., n > 2000) or TLB misses (i.e., large pages and n < 16000), or hardware streams (i.e., DSCR = 1), the memory throughput is: (1) 6 GB/sec for B = 16 (see Figure 2); and (2) 5.5 GB/sec for B = 32 (see Figure 3). The throughput predicted by relation (15) in subsection 3.5 is 6.92 GB/sec, so the observed value is 87% (for B = 16) or 80% (for B = 32) of the predicted value. Thus, our model is a good predictor if we account for the machine overheads, which are typically about 20%.
71
matrix transpose routine dgetmo of ESSL for large pages (16 MiB).
Figure 4: Memory bandwidth of dgetmo for page size 16 MiB
Figure 2: Memory bandwidth for B = 16 and page sizes of 64 KiB (top) and 16 MiB (bottom)
The memory bandwidth of dgetmo does not depend strongly on the page size; the results for a page size of 64 KiB are similar and are not shown. We perform the comparison between our approach and dgetmo by selecting, for each n, the set of values of B, DSCR, and page size that give the best memory bandwidth. The result is shown in Figure 5; our approach gives a memory bandwidth up to five times better than dgetmo and the improvement is larger for larger n.
Figure 5: Our approach is significantly better than dgetmo
4.7
Figure 3: Memory bandwidth for B = 32 and page sizes of 64 KiB (top) and 16 MiB (bottom)
4.6
Comparison with the dgetmo routine
We compared the memory throughput of our approach with that of the out-of-place matrix transpose routine dgetmo of the Engineering and Scientific Subroutine Library. Figure 4 shows the memory bandwidth of the out-of-place
72
Comparison with STREAM Copy
STREAM copy is a widely used memory bandwidth benchmark [6]; it copies a one-dimensional source array containing N = n2 double precision elements to a target array. If there are no memory stalls, the time to copy the array is: Tcopy = N × (2 B TL1 ) = 4 N B cycles and the memory throughput is 16 N/Tcopy cycles/sec, which for the 4 GHz clock of the POWER7 means 64 N/Tcopy GB/sec, i.e., BWstream = (64 N )/(4 N ) = 16 GB/sec This model of STREAM copy throughput assumes that there are no memory stalls, but it does not include explicit prefetch-
Performance Evaluation Review, Vol. 40, No. 2, September 2012
ing; therefore, achieving the predicted throughput of 16 GB/sec requires hardware-managed data streams. Indeed, we have measured the bandwidth of STREAM copy for a page size of 16 MiB, and have obtained 14.8 GB/sec for DSCR = 15, but only 10.6 GB/sec for DSCR = 0 and 4.2 GB/sec for DSCR = 1. Comparing the 14.8 GB/sec throughput of STREAM copy with the throughput of 7 GiB/sec achieved by our matrix transpose approach (for B = 32, a page size of 16 MiB and DSCR = 0), we note that we achieve 47% of the best bandwidth of STREAM copy. Moreover, because we use software prefetching, in the absence of hardware streams (i.e., DSCR = 1), our matrix transpose bandwidth (6 GB/sec for B = 16 and 5.5 GB/sec for B = 32) exceeds the bandwidth of STREAM copy (4.2 GB/sec).
5.
RELATED WORK
A study of various matrix transpose algorithms [1] gives two main insights. First, the predicted performance of an algorithm depends on how performance is measured, e.g., number of cache misses or the execution time. Second, the actual performance of an algorithm often deviates significantly from the predicted performance. Cache-oblivious algorithms [2] are independent of cache architecture parameters such as the cache line size and total cache size. However, they assume a cache that is fully associative and tall, i.e., has a size of Ω(L2 ), where L is the cache-line size. Experimental results show [1] that that cache-oblivious algorithms work well for some problem sizes but not for others, which suggests that in fact their performance depends on the cache architecture, particularly the associativity. On the other hand, cache-blocked transpose, the most popular cache-aware algorithm, gives consistently good performance if the block size value is correctly selected. Optimizing cached-block transpose involves multiple parameters, including the block size, the loop nest order, and the loop unrolling factor. McCalpin et al [5] have followed an empirical optimization approach that uses a code generator which scans the domain of these parameters. Our work takes a model-based approach to determining the values of the design parameters, and it takes into account parameters not included in [5] such as the associativity of the cache and explicit prefetching (the latter was not supported at the time when [5] was published).
6.
CONCLUSIONS
We have designed and implemented a matrix transpose algorithm that combines cache blocking, prefetching, and data alignment to efficiently use the level-one cache of the POWER7 processor. Our matrix transpose algorithm avoids memory stalls by exploiting features of the POWER7 microarchitecture, such as the store request queue and the degree of concurrency of the load and store requests. To understand the expected performance, we have modeled cache blocking and prefetching for matrix transpose in terms of the POWER7 cache organization, memory access latency and concurrency. We have conducted numerical experiments whose results indicate that the measured memory throughput of matrix transpose is in good agreement with the throughput predicted by the model. The experimental results indicate that
Performance Evaluation Review, Vol. 40, No. 2, September 2012
the memory throughput of our approach is up to five times better than that of the ESSL dgetmo routine and 2.5 times better than that of compiler-inserted prefetching. We have shown that large page sizes improve the memory throughput of matrix transpose by reducing the TLB miss rate. The single-core optimization of matrix transpose described here would also improve the performance of an MPI parallel matrix transpose that uses AlltoAll communication, since the MPI version would perform on-core transposes as well.
Acknowledgments This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (award number OCI 07-25070) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign, its National Center for Supercomputing Applications, IBM, and the Great Lakes Consortium for Petascale Computation.
7.
REFERENCES
[1] Chatterjee, S., and Sen, S. Cache-efficient matrix transposition. In Proceedings of the Sixth International Symposium on High-Performance Computer Architecture (HPCA-6) (2000), pp. 195 –205. [2] Frigo, M., Leiserson, C., Prokop, H., and Ramachandran, S. Cache-oblivious algorithms. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science (1999), pp. 285–297. [3] Hennessy, J. L., and Patterson, D. A. Computer Architecture: A Quantitative Approach, 4th ed. Morgan Kaufmann, 2007. [4] Kalla, R. Power7: IBM’s next-generation server processor. Micro, IEEE 30, 2 (march-april 2010), 7 –15. [5] McCalpin, J., and Smotherman, M. Automatic benchmark generation for cache optimization of matrix operations. In Proceedings of the 33rd annual on Southeast regional conference (New York, NY, USA, 1995), ACM-SE 33, ACM, pp. 195–204. [6] McCalpin, J. D. STREAM: Sustainable memory bandwidth in high performance computers. Tech. rep., University of Virginia, Charlottesville, Virginia, 2012. http://www.cs.virginia.edu/stream/. [7] Power.org. Power ISA Version 2.06. http://www.power.org/resources/downloads. [8] Rajamony, R., Arimilli, L. B., and Gildea, K. PERCS: The IBM POWER7-IH high-performance computing system. IBM Journal of Research and Development 55, 3 (2011), 3:1–3:12. [9] Sinharoy, B., Kalla, R., Starke, W., Le, H., Cargnoni, R., Norstrand, J. V., Ronchetti, B., Stuecheli, J., Leenstra, J., Guthrie, G., Nguyen, D., Blaner, B., Marino, C., Retter, E., and Williams, P. IBM POWER7 multicore server processor. IBM Journal of Research and Development 55, 3 (2011), 1:1–1:29.
73