Challenges and Potentials of Emerging Multicore Architectures Markus Stürmer, Gerhard Wellein, Georg Hager, Harald Köstler and Ulrich Rüde
Abstract We present performance results on two current multicore architectures, a STI (Sony, Toshiba, and IBM) Cell processor included in the new Playstation™ 3 and a Sun UltraSPARC T2 (“Niagara 2”) machine. On the Niagara 2 we analyze typical performance patterns that emerge from the peculiar way the memory controllers are activated on this chip using the standard STREAM benchmark and a shared-memory parallel lattice Boltzmann code. On the Cell processor we measure the memory bandwidth and run performance tests for LBM simulations. Additionally, we show results for an application in image processing on the Cell processor, where it is required to solve nonlinear anisotropic PDEs.
1 Introduction Over many years single core processors have dominated computer architectures and performance was gained through higher clock rates. Currently this is changing and there is a trend towards multicore architectures that has several reasons: First they run at lower clock speeds, which reduces thermal dissipation and power consumption and offers greater system density. Additionally, multicore processors deliver significantly greater computing power through concurrency compared to conventional single core processor chips. But multicore processors are also changing the software development landscape. The codes have to sustain a much higher level of parallelism and the bottlenecks emerging from shared resources like caches and memory connections have to be taken into account. In this contribution we try to evaluate the power of two multicore architectures, the Sun UltraSPARC T2 (“Niagara 2”) machine and the STI (Sony, Toshiba, and IBM) Cell processor. In Sect. 2 we discuss the architectural specifications of the Niagara 2 machine and then present performance results on lattice Boltzman (LBM) fluid simulations and the STREAM benchmark. M. Stürmer · H. Köstler · U. Rüde Lehrstuhl f. Systemsimulation, Universität Erlangen-Nürnberg, Cauerstraße 6, 91058 Erlangen, Germany e-mail:
[email protected] G. Wellein · G. Hager Regionales Rechenzentrum Erlangen, Martensstraße 1, 91058 Erlangen, Germany 551
552
M. Stürmer et al.
In Sect. 3 we analyze the achievable memory bandwidth of the Cell processor, run LBM simulations, and show runtimes for an application from variational image processing.
2 Sun UltraSPARC T2—Single Socket Highly Threaded Server The Sun UltraSPARC T2 (codenamed “Niagara 2”) might be a first glance at one potential direction for future chip designs: A highly threaded server-on-a-chip approach using a high number of “simple” cores which run at low or moderate clock speed but support a large number of threads.
2.1 Architectural Specifications Trading high single core performance for a highly parallel system-on-a-chip architecture is the basic idea behind the Sun Niagara 2 concept [17] as can be seen in Fig. 1: Eight simple in-order cores (running at 1.4 GHz) are connected through a nonblocking switch with a shared L2 cache (4 MByte in total) and four independently operating dual-channel FB-DIMM memory controllers. At first glance this unified memory architecture (UMA) provides the scalability of cache-coherent non unified memory architecture (ccNUMA) approaches, taking the best of the two worlds at no cost. The aggregated nominal main memory bandwidth of 42 GB/s (read) and 21 GB/s (write) for a single socket is far ahead of most other general purpose CPUs and is only topped by the NEC SX8 vector series. Since there is only a single floating point unit (performing mult or add operations) per core the system balance of approximately 4 Bytes/Flop (assuming read) is the same as for the NEC SX8 vector processor. To overcome the restrictions of in-order architectures and long memory latencies, each core is able to support up to eight threads by replicated hardware, e.g. register sets. Although at most two threads per core are concurrently active at any time, all eight threads run simultaneously and can be interleaved between the various pipeline stages with only few restrictions, avoiding the costs of context switching which can be substantial on classic CPUs. The cores implement the well-known SparcV9 instruction set allowing for easy porting of existing software packages. However, running more than a single thread per core is a must and identifying as well as exploiting a high level of parallelism in the application codes will be the major challenge to achieve good performance on this system. For the tests to be performed on single or dual socket systems, OpenMP and MPI parallel applications can easily be used. For potential large scale systems with hundreds or thousands of sockets, i.e. tens to hundreds of thousands of threads, appropriate parallelization techniques like hybrid MPI/OpenMP codes will have to be employed in the future. Going beyond the requirements of the tests presented in this report one should add that the Sun Niagara 2 chip also comprises a PCIe-x8 and 10 Gb Ethernet interconnect port as well as a cryptographic coprocessor.
Emerging Multicore Architectures
553
Fig. 1 Schematic view of the Sun UltraSPARC T2 chip architecture. Eight physical cores (C1, . . . , C8) with local L1 data (8 KByte) and L1 instruction (16 KByte) caches are connected to eight L2 banks (two of them sharing one memory controller) through a nonblocking crossbar switch (picture by courtesy of Sun Microsystems)
The single socket Sun system used for this report is an early access, preproduction model of Sun’s T2 server series.
2.2 Lattice Boltzman Method Performance Characteristics We use a simple lattice Boltzmann method (LBM) kernel to test the capabilities of Sun’s Niagara 2 for data-intensive applications. Figure 2 presents the basic structure of the main compute routine (“collision-propagation step”) of the LBM kernel. A detailed description of implementation and data layout of the LBM kernel used as well as a brief introduction to LBM can be found in Ref. [22]. For the Sun Niagara 2 architecture the “propagation optimized” data layout (IJKL) is appropriate. As an implementation of the combined collision-propagation step we test both
554
M. Stürmer et al.
Fig. 2 Code snippet of 3D implementation with propagation optimized data layout using Fortran notation
the 3D variant, featuring a standard Cartesian loop nest, and the vector variant, in which the loops are coalesced to a single long loop which visits all lattice sites in the same order as the 3D version. On top of the outermost loop OpenMP directives parallelize the collision-propagation step. With respect to the highly threaded design of the Niagara 2 we did not apply a fixed OpenMP scheduling (e.g. static) but studied several alternatives. However, data locality issues on ccNUMA systems such as multi-socket AMD Opteron servers require the use of static scheduling for the 3D initialization loops of distribution functions. As a test case we consider the lid driven cavity problem using a cubic domain (holding N 3 cells; N x = Ny = N z = N ) and a D3Q19 discretization stencil. The performance is measured in units of “million fluid cell updates per second” (FluidMLUPs) which is a handy performance unit for LBM. Note that 5 FluidMLUPs are equivalent to approximately 1 GFlop/s of sustained CPU performance. The performance characteristics of Sun Niagara 2 as a function of the domain size is shown in Fig. 3 for 16 and 32 OpenMP threads using different scheduling strategies. Most notably, the behavior changes significantly when going from 16 to 32 threads, showing a substantial performance drop for the static scheduling strategies at larger domain sizes in the latter case. The combination of dynamic scheduling with reasonably large chunk sizes and the vector implementation provides best flexibility and is a good approach for this architecture. There is also a high variation in performance for all scheduling strategies yielding performance drops of a factor of up to three. Unlike conventional cache based architectures (cf. Fig. 4) these breakdowns cannot be correlated with array dimensions defined as powers of 2 which can cause massive cache thrashing for our LBM implementation (cf. discussion in Ref. [22]). Interestingly, for 32 threads and static scheduling performance drops at about N = 224 and does not recover for larger domains. On the other hand, performance of the dynamic scheduling approach, which has a completely different data
Emerging Multicore Architectures
555
Fig. 3 Performance of LBM on Sun Niagara 2 using 16 (upper panel) or 32 OpenMP threads and different scheduling strategies. For the 32 (16) threads run, each (every third) point has been measured in the interval N = 50, . . . , 397
access pattern, remains at a high level. Since the strong performance fluctuations also show up over the whole range of domain sizes a potential suspect for both effects is the memory subsystem or, more precisely, a load imbalance in the use of the four memory controllers. This topic is analyzed in more detail in Sect. 2.3. In contrast to “classic” CPUs LBM performance still improves by approximately 20% when using four instead of two threads per physical core, emphasizing the need for highly parallel codes. To demonstrate the potential of the Sun Niagara 2 for LBM as compared to standard technologies we present in Fig. 4 the node performance of two compute node architectures widely used nowadays in HPC clusters. For the Intel Xeon system (upper panel in Fig. 4) employing a UMA architecture all variants of LBM provide roughly the same performance. The major breakdowns are associated with cache thrashing and can easily be avoided by array padding. Even though this system uses two sockets and four high speed cores (3.0 GHz), the maximum performance falls short by a factor of two compared to Sun’s Niagara 2. It is well known that the AMD Opteron systems can provide better performance than Intel architectures for data intensive codes. The lower panel of Fig. 4 shows measurements on a four socket AMD Opteron system using the fastest memory technology (“Socket-F”) available at the time of writing. Per socket, the Opteron (approximately 7 FluidMLUPs/socket) can outperform the Intel system (approximately 6 FluidMLUPs/socket), but data locality problems must be carefully avoided as can be seen from the results for dynamic scheduling. Here the data access patterns in the initialization step and the compute kernel are completely different and performance breaks down by a factor of four to five. In summary, the Sun Niagara 2 can
556
M. Stürmer et al.
Fig. 4 Performance of LBM on an Intel Xeon 5160 (“Woodcrest”) based HP DL140G3 server and an AMD Opteron based HP DL585G2 server. The number of threads is set to the number of cores in the system. For the Intel (AMD) system each point has been measured in the interval N = 20, . . . , 250(300). Note that the HP DL140G3 uses the Intel 5000X (“Greencreek”) chipset and the AMD Opteron is based on the current “Socket F” (10.6 GB/s nominal memory bandwidth per socket)
outperform the current workhorses of HPC clusters on a per socket basis by a factor of three to four and thus provides a high potential for data intensive applications such as LBM.
2.3 Analysis of Data Access Performance A widely used standard benchmark to test the sustainable main memory bandwidth is the STREAM [14] suite which consists of four vector-vector operations with large vector length. In the following we focus on the so-called “TRIAD” test and show the basic structure in Fig. 5. Typically the variables array and offsets are tuned to achieve optimal bandwidth for main memory access and the corresponding value is reported for the system as the “optimal stream” number. For our purposes we fix array = 225 and scan the value of offset. Since the three arrays A, B and C are located consecutively within a common block we can probe the Sun Niagara 2 for bandwidth problems arising from the relative alignment of data streams in main memory. Using an OpenMP variant and static scheduling we find for reasonable thread counts highly periodic structures with substantial breakdowns in performance (see
Emerging Multicore Architectures
557
Fig. 5 Code snippet of the TRIAD test within the STREAM benchmark suite
Fig. 6 Bandwidth of Sun Niagara 2 for the TRIAD test within the STREAM benchmark for different numbers of threads as a function of array offset given in double words (8 bytes). The size of each array used in the benchmark is 225 double words
Fig. 6). The period of the strongest fluctuation is exactly 64 double precision words, which can be explained by the peculiar way the processor assigns memory controllers to addresses: Physical address bits 8:7 are responsible for selecting the controller, while bit 6 chooses between the two available cache banks associated with it. If the offset between data streams is zero, the array length of 225 bytes ensures that all of them have identical bits 8:7, resulting in a single memory controller being used for all data transfers. The same happens again when offset = 64, i.e. 512 bytes, and for each integer multiple of this number. Interestingly, this effect shows up in a prominent way only when using 16 or 32 threads. With eight threads or less, the expected breakdowns are only minor, and the achievable maximum bandwidth is cut
558
M. Stürmer et al.
Fig. 7 Bandwidth for the TRIAD and COPY test within the STREAM benchmark as a function of the array offset given in DP words (8 bytes). The two-socket Intel Xeon (HP DL140G3) and four-socket AMD Opteron (HP DL585G2) compute nodes are described in Sect. 2.2 and in the caption of Fig. 4. The benchmark has been run in parallel on all cores of each compute node
in half. The reason for this is as yet unclear. One could speculate that the on-chip resources for sustaining outstanding memory transactions are used inefficiently when the number of threads is less than 16 because each core can execute a maximum of two threads simultaneously. Interestingly, the amplitude of the maximum fluctuations is the same as for the LBM kernel and the frequency of fluctuations grows with thread count as well. This is a clear indication that alignment of concurrent data accesses must be carefully analyzed in future work about LBM. Finally we present in Fig. 7 the STREAM TRIAD data for the standard compute nodes as used in the previous section to point out a peculiarity of these systems which is important to interpret their “pure” stream numbers correctly. For the Intel Xeon system we find a strong fluctuation in performance, alternating between 4.5–5.0 GB/s (odd values of offset) and 6.0–6.5 GB/s (even values of offset). A thorough analysis of data access patterns and/or compiler-generated assembly code reveals that this effect is related to the compiler’s ability to use “nontemporal stores” for a[]. In general, a write miss on the store stream for a[] causes a read for ownership (RFO) on the corresponding cache line, i.e. a[j] is loaded to cache before it is modified causing additional data transfer. Nontemporal stores, however, bypass the cache, avoiding the RFO. Even so, nontemporal stores can only be used in vectorized loops and require packed SSE instructions with operands aligned to 16-byte boundaries, i.e. a[j0] needs to be aligned to a 16-byte boundary. Since the starting address of a[j0] does not depend on offset, the compiler used in our tests (Intel Fortran compiler version 9.1.045 for 64-Bit applications) fur-
Emerging Multicore Architectures
559
ther restricts the use of nontemporal stores to the case where all operands used in the vectorized loop are aligned to 16-byte boundaries as well. This is true only for even values of offset leading to the even-odd oscillations for the STREAM TRIAD data presented in Fig. 7. However this might change with future compiler versions. For COPY (c(j)=a(j) in the inner loop of Fig. 5) the RFO on c[] can always be avoided since the relative memory alignment of a[j0] and c[j0] is always a multiple of 16 bytes. The AMD Opteron system shows a similar behavior with an even larger fluctuation of the results which could be caused, e.g., by banking effects in main memory. It must be emphasized that the RFO problem in the STREAM benchmark is to some extent artificial and can—at least for simple kernels—be easily resolved through straightforward array padding. On the other hand, the STREAM data access pattern of Sun’s Niagara 2 is induced by the use of four independent on-chip memory controllers. In the future, more effort has to be put into investigating strategies to make best use of this “distributed UMA” design for user applications.
3 Cell Processor The STI1 Cell Processor used, e.g., in the new Playstation™ 3 is an innovative, heterogeneous multi-core processor that offers outstanding performance on a single chip. The organization of the processor is depicted in Fig. 8 [11, 12]: The backbone of the chip is a fast ring bus—the Element Interconnect Bus (EIB)—connecting all units on the chip and providing a throughput of up to 204.8 GB/s in total at 3.2 GHz. A PowerPC-based general purpose core—the Power Processor Element (PPE)— is primarily used to run the operating system and control execution. The Memory Interface Controller (MIC) can deliver data with up to 25.6 GB/s from Rambus XDR memory and the Broadband Engine Interface (BEI) provides fast access to
Fig. 8 Schematic view of the STI Cell Processor 1 Sony,
Toshiba and IBM.
560
M. Stürmer et al.
IO devices or a coherent connection to other Cell processors. The computational power resides in eight Synergistic Processor Elements (SPEs), simple but very powerful co-processors consisting of three components: The Synergistic Execution Unit (SXU) is a custom SIMD only vector engine with a 128 vector register and two pipelines. It operates on 256 kB of own Local Store (LS), a very fast, low-latency memory. SXU and LS constitute the Synergistic Processing Unit (SPU), which has a dedicated interface unit, connecting it to the outside world: The primary use of the Memory Flow Controller (MFC) is to asynchronously copy data between Local Store and main memory or Local Stores of other SPEs using Direct Memory Access. It also provides communication channels to the PPE or other SPEs and is utilized by the PPE to control execution of the associated SPU. Each SPE can be seen as a very simple computer performing its own program, but dependent on and controlled by the PPE. Usually, a single Cell processor is able to perform 204.8 GFlop/s using fusedmultiply-adds in single precision (not counting the abilities of the PPE). However, only six SPEs are available under Linux running as a guest system on the Sony Playstation™ 3, what reduces the maximum performance accordingly.
3.1 Results Memory Bandwidth A theoretical main memory bandwidth of 25.6 GB/s is an impressive number at first glance. One must however consider that this bandwidth is shared by seven cores on the Playstation™ 3, the PPE and six SPEs—or even nine cores on other Cell hardware where all eight SPEs are available. This section will therefore explore how much bandwidth can actually be expected and what factors it depends on. The Memory Interface Controller (MIC) of the current Cell implementation supports two memory channels, both accessing eight memory banks of Rambus XDR memory. Each channel provides half the total bandwidth; as every bank can deliver only about 2.1 GB/s, each channel must utilize six memory banks at a time to reach optimal throughput. The memory is organized in blocks of 128 B; while smaller memory accesses are supported, they are at least as costly as reading or writing a whole 128 B block and will not be discussed further. Blocks are distributed alternatingly to the channels, and every 16th 128 B resides on the same memory bank. Usually the MIC combine all memory transfers concurrently in-flight to keep both channels busy. However, if only memory transfers to the same memory bank are requested (e.g. 128 B transfers with a stride of 2 kB), only the bandwidth of a single bank is available. However if only memory transfers to the eight banks of one channel are in-flight (e.g. 128 B transfers with a stride of 256 B), the channel’s bandwidth will be the bottleneck. To get high throughput, larger DMA transfers, affecting different memory banks, should be made whenever possible; having multiple smaller concurrent DMAs—each memory flow controller can handle 16 at a time—also increases chances to keep many banks busy.
Emerging Multicore Architectures
561
Fig. 9 Available bandwidth depending on locality, size and concurrency of DMA transfers
Figure 9 presents bandwidth graphs for different combinations of DMA sizes, concurrent transfers on each SPE and number of SPEs involved. In all cases, a SPE repeatedly reads from or writes to a main memory buffer of 16 MB size and the accumulated bandwidth of all SPEs was measured. Another factor examined is the impact of TLB misses, as every MFC remembers only four TLB entries and needs the PPE’s memory management unit to resolve TLB misses. We compare the behavior with the standard page size of 4 kB and huge pages covering the whole 16 MB of the buffer. Figure 9(a) demonstrates which bandwidth is achievable using a single SPE. If only one transfer is performed at a time, memory latencies of nearly 500 cycles plus the cycles necessary to set up the next transfer play an important role. From a MFC’s view, a write access has completed at the latest when it has been delivered to the MIC. Therefore it has apparently lower latencies, giving better overall bandwidth for writing. Processing multiple DMAs concurrently, both up- and downstream bandwidth can be increased; with more than four concurrent DMAs, bandwidth can be increased further only for smaller transfer sizes. Comparing the results with standard and huge memory pages, we see the strong impact of TLB misses. Nevertheless, a single SPE can only get half of the possible memory upstream.
562
M. Stürmer et al.
In Fig. 9(b) all SPEs are used concurrently, doing one DMA transfer at a time each. For larger DMA transfers, we get the highest bandwidth, which cannot be increased further by performing multiple DMA transfers on each SPE. If huge pages prevent TLB misses, a read bandwidth of about 25 GB/s and a write bandwidth of more than 24 GB/s is achievable. This is very close to the theoretical maximum of 25.6 GB/s, which cannot be reached as memory refreshes etc. impede [6]. Bandwidth for mixed read and write accesses drops further, as the direction of the memory bus has to be switched. We see that throughput is reduced effectively by more than 3 GB/s due to TLB misses, if the usual memory page size is used. If from every memory page only parts are actually transferred, the effect of TLB misses will increase further.
3.2 Results LBM Our CBEA-optimized Lattice Boltzmann implementation is based on the same model as mentioned in Sect. 2.2 and was conceived to achieve high performance for regular and moderately irregular domains, e.g. to simulate the blood flow in human vessels. Neither allocation of a full cuboid, which would often consist mainly of solid lattice cells, nor a list representation of the domain, which could not be processed in SIMD and would not meet the alignment constraints of the MFCs’ DMA transfers, were reasonable approaches for the memory layout. So the whole domain is divided into equally sized boxes—in this case 83 —and boxes containing fluid cells are actually allocated. The size of a box is small enough to fit into Local Store so that its update can be done without additional main memory accesses. Obstacle and boundary information is further stored in a format that enables propagation in SIMD—the importance of this will become more clear in this section. A more thorough description of the implementation can be found in [16]. Table 1 compares the performance of our optimized LBM kernel for the SPU to a straightforward scalar LBM implementation on the PPE, a SPE or a standard CPU. The first observation is the low MLUP rate of the PPE compared to a current x86 CPU. The scalar performance of a SPE is even worse, which has multiple causes. As the SPU natively performs SIMD operations, only, accomplishing a single scalar operation is much more expensive than performing a full SIMD operation. Table 1 Performance of a straightforward single precision LBM implementation in C on an Intel Xeon 5160 at 3.0 GHz, a standard 3.2 GHz PPE and one SPE, compared with the optimized SPEimplementation for a 83 fluid lattice cells channel flow. No memory transfer was performed to or from the SPEs Straightforward C
Optimized
CPU
Xeon
PPE
SPE
SPE
MFLUPS
10.2
4.8
2.0
49.0
Emerging Multicore Architectures
563
Table 2 Cell MLUPS performance for a 963 channel flow; MFLUPS = MLUPS ·
943 963
No. of SPEs
1
2
3
4
5
6
MLUPS
42
81
93
94
94
95
Moreover no dynamic branch prediction is supported; if the compiler expects a branch to be taken, a hint instruction must be scheduled many cycles before its associated jump. Statically mispredicted jumps will usually cause at least 18 stall cycles. As execution is in-order, memory and branch latencies sum up especially in the propagation step, as its instruction level parallelism is very low. Table 2 shows performance of the Cell-optimized LBM code on the Playstation™ 3. Transfer of operands between main memory and Local Store(s) affects performance in the following ways: First the MFC must be instructed which DMA transfers are to be performed; further, as Local Store is implemented in single-ported memory, DMA transfers sometimes delay concurrent accesses of the SPU; and it may finally be necessary to wait for DMA transfers to finish, especially if multiple SPEs compete for memory bandwidth. While performance scales well from one to two SPEs (more than 96.4% efficiency), three SPEs are already able to nearly saturate the memory bus—for the best rate of 95 MLUPS, a memory throughput of more than 17 GB/s is necessary.
3.3 Real Time Image Decompression on a Playstation™ 3 Based on the ideas presented in [10] we develop a method for video coding that we call PDEVC (PDE based video compression) in [15]. In order to encode an uncompressed video sequence we choose an adapted subselection of all points in the domain based on a recursive subdivision of each frame by B-Tree Triangular Coding (BTTC) [7] and save their color value and coordinates. The information of all other points is discarded. The missing data can be restored by image inpainting [1, 4, 13]. We assume that in the input image u0 : Ω → R a finite number m ∈ N of landmarks in a subset ΩC ⊂ Ω of the image domain Ω ⊂ R2 is prescribed. For one frame u : Ω → R we have u0 (x i ) = u(x i ),
∀x i ∈ ΩC , 1 ≤ i ≤ m.
(1)
The rest of the points is inpainted by nonlinear anisotropic diffusion [20, 21] to preserve image edges. This requires to solve the (nonlinear) PDE − div(D∇u) = 0 if x ∈ Ω \ ΩC , u = u0
if x ∈ ΩC ,
D∇u, n = 0
if x ∈ ∂Ω
(2)
564
M. Stürmer et al.
with Neumann boundary conditions. The diffusion tensor D = g(Du ),
Du = ∇uσ (∇uσ )T
(3)
and the subscript σ denotes the standard deviation for a convolution of u with a Gaussian mask. The so-called diffusivity function g : R+ 0 → [0, 1] determines the influence of the image data on the diffusion process. We use the Charbonnierdiffusivity given by [5] g(s 2 ) =
1 1 + s 2 /β 2
.
(4)
β > 0 denotes a contrast parameter. To solve (2) we treat the nonlinearity by an inexact lagged diffusivity method [3, 9]. The idea is to fix the diffusion tensor D during one iteration step, and update the values of D in the whole domain after each iteration [2, 8, 19]. The discretization is done by finite volumes which corresponds to a symmetric finite difference discretization [18]. Additionally, we apply a multilevel damped Jacobi solver, i.e., we start on a coarse level and perform enough iterations to obtain a solution there. For this purpose we construct an image pyramid and store landmarks on each level. This solution is interpolated to the next finer level and used there as initial guess for the solver. This scheme is repeated until we have reached the finest level. If we store about 15% of the points of a video sequence of size 320 × 240 and then apply for the inpainting 130 damped Jacobi iterations on three levels this leads to an acceptable image quality and we achieve 25 fps on a Playstation™ 3 while on a standard CPU (3 GHz Pentium) we achieve less than 1 fps.
4 Conclusions and Future Work We have analyzed the performance characteristics of data-intensive codes on new, innovative processor architectures. On Sun’s UltraSPARC T2, a highly threaded eight-core design, the peculiar performance patterns of a propagation-optimized lattice Boltzmann code and the well-known STREAM benchmark could be attributed to access conflicts on the built-in memory controllers. Architecture-specific optimization approaches will thus have to concentrate on these alignment problems in the future. x86-based processors suffer from similar but more easily resolved deficiencies. We have also shown, that the Cell Broadband Engine Architecture represents an interesting approach to increase memory bandwidth and performance of real-world applications. Operating on local memory instead of caches reduces the complexity of cores and on-chip communication, allowing more cores per transistors. Asynchronous DMA transfers are advantageous to common load and store operations in enabling better exploitation of memory interleaving and hiding of latencies. But it is also necessary to mention the downside of the Cell approach: For all but the most trivial problems the programmer has to parallelize and partition the algorithm
Emerging Multicore Architectures
565
and create a memory layout suitable for DMA transfers and SIMD processing himself. Compilers and frameworks can often not more than assist in that. For complex problems or to get optimal performance it is often necessary to orchestrate DMA transfers and inter-core communication and to SIMD-vectorize kernels by hand. Acknowledgements G.W. and G.H. are indebted to Sun Microsystems and RWTH Aachen Computing Centre for granting access to a pre-production Niagara 2 system. This work has been funded in part by the Competence Network for Technical, Scientific High Performance Computing in Bavaria (KONWIHR).
References 1. M. Bertalmío, A. Bertozzi, G. Sapiro, Navier-Stokes, fluid dynamics, and image and video inpainting, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2001), pp. 355–362 2. A. Bruhn, Variational optic flow computation: Accurate modeling and efficient numerics. PhD thesis, Department of Mathematics and Computer Science, Saarland University, Saarbrücken, Germany, 2006 3. T. Chan, P. Mulet, On the convergence of the lagged diffusivity fixed point method in total variation image restoration. SIAM J. Numer. Anal. 36(2), 354–367 (1999) 4. T. Chan, J. Shen, Nontexture inpainting by curvature driven diffusions (CDD). J. Vis. Commun. Image Represent. 12(4), 436–449 (2001) 5. P. Charbonnier, L. Blanc-Féraud, G. Aubert, M. Barlaud, Two deterministic half-quadratic regularization algorithms for computed imaging, in Proceedings IEEE International Conference on Image Processing, vol. 2, Austin, TX, USA (1994), pp. 168–172 6. T. Chen, R. Raghavan, J. Dale, E. Iwata, Cell broadband engine architecture and its first implementation. http://www.ibm.com/developerworks/power/library/pa-cellperf/, November 2005. [Online; accessed 1.11.2007] 7. R. Distasi, M. Nappi, S. Vitulano, Image compression by B-tree triangular coding. IEEE Trans. Commun. 45(9), 1095–1100 (1997) 8. C. Frohn-Schauf, S. Henn, K. Witsch, Nonlinear multigrid methods for total variation image denoising. Comput. Vis. Sci. 7(3), 199–206 (2004) 9. S. Fu´cik, A. Kratochvil, J. Ne´cas, Kacanov-Galerkin method. Comment. Math. Univ. Carolinae 14(4), 651–659 (1973) 10. I. Galic, J. Weickert, M. Welk, A. Bruhn, A. Belyaev, H. Seidel, Towards PDE-based image compression, in Proceedings of Variational, Geometric, and Level Set Methods in Computer Vision. Lecture Notes in Computer Science (Springer, Berlin, 2005), pp. 37–48 11. IBM. Cell broadband engine architecture, October 2006 12. IBM. Cell broadband engine programming tutorial, May 2007 13. F. Lauze, M. Nielsen, A variational algorithm for motion compensated inpainting, in British Machine Vision Conference, 2004 14. J.D. McCalpin, Stream: Sustainable memory bandwidth in high performance computers. Technical report, University of Virginia, USA (2004). http://www.cs.virginia.edu/stream/ 15. P. Münch, H. Köstler, Videocoding using a variational approach for decompression. Technical Report 07-1, Department of Computer Science 10 (System Simulation), Friedrich-Alexander Universität Erlangen-Nürnberg, Germany (2007) 16. M. Stürmer, J. Götz, G. Richter, U. Rüde, Blood flow simulation on the cell broadband engine using the lattice Boltzmann method. Technical Report 07-9, IMMD10 (2007) 17. OpenSPARC T2 core microarchitecture specification. Technical report, Sun Microsystems (2007). http://opensparc-t2.sunsource.net/specs/ 18. U. Trottenberg, C. Oosterlee, A. Schüller, Multigrid (Academic Press, San Diego, 2001)
566
M. Stürmer et al.
19. C. Vogel, Computational Methods for Inverse Problems (SIAM, Philadelphia, 2002) 20. J. Weickert, Theoretical foundations of anisotropic diffusion in image processing. Computing 11, 221–236 (1996) 21. J. Weickert, Anisotropic Diffusion in Image Processing (Teubner, Stuttgart, 1998) 22. G. Wellein, T. Zeiser, S. Donath, G. Hager, On the single processor performance of simple lattice Boltzmann kernels. Comput. Fluids 35, 910–919 (2006)