plication's memory access behavior and to specify an op- timized data placement within the source codes resulting in a minimum of remote accesses at run-time.
Using Simulation to Understand the Data Layout of Programs Jie Tao ∗, Wolfgang Karl, and Martin Schulz LRR-TUM, Institut f¨ur Informatik, Technische Universit¨at M¨unchen, 80290 M¨unchen, Germany E-mail: {tao,karlw,schulzm}@in.tum.de
ABSTRACT One of the most prominent performance issues on NUMA systems is the access latency to remote memories, which can be several orders of magnitude higher than the one of local memory accesses. Effective data allocation that limits the necessity to access remote memories therefore has the potential to significantly improve the performance of applications. This paper presents a tool that simulates the parallel execution of shared memory programs and provides extensive and detailed information about their run-time data layout. This information allows users to analyze an application’s memory access behavior and to specify an optimized data placement within the source codes resulting in a minimum of remote accesses at run-time. Using this simulation tool, a speedup improvement of up to 145.8% for numerical kernels has been achieved, demonstrating the potentials of such optimizations. KEY WORDS Execution Simulation, Shared Memory Programming, Data Locality
1. Introduction Clusters built from commodity PCs are becoming increasingly popular and viable platforms for high performance computing. Next to the message passing, which is still the most important paradigm for programming parallel machines, also shared memory is increasingly supported on clusters. It provides transparent access to the physically distributed memories and enables concurrent threads of an application to communicate through a shared address space. However, most of the shared memory applications do not run efficiently on systems with non-uniform memory access (NUMA) characteristics primarily due to excessive remote memory accesses. A significant amount of projects has therefore focused on improving the data locality of running programs and further minimizing the memory access latency. These projects fall into two categories: page migration approaches, which improve the data locality through the run-time migration of virtual memory pages to nodes that reference them more ∗ Jie Tao is a staff member of Jilin University in China and is currently pursuing her Ph.D at Technische Universit¨at M¨unchen in Germany.
frequently, and the compiler-based analysis and optimization, which minimizes the communication latency using remote access elimination, communication-computation overlap, and initially effective data distribution. Both of them are efficient, but expensive, since the data migration schemes require a support of the operating system, while the compiler schemes have to be integrated into the complex structures of modern compilers. In addition, these methods either are not able to use application specific semantic information due to their system level implementation or are only static without the capability to adapt the dynamic runtime behavior. We propose an alternative approach that helps users to understand the memory access behavior of their applications and aids them in specifying a correct data layout within the source codes. This approach is based on a multiprocessor simulator which is presented in this paper, called SIMT. It is an extended version of Augmint [9], a fast execution driven multiprocessor simulation toolkit for Intel x86 architectures [4]. We utilize Augmint’s memory reference generator to drive the simulation and provide a new back-end simulating our target architectures: PCbased clusters with NUMA interconnection fabrics. SIMT is used to study the impact of remote memory accesses on application performance, to evaluate the performance improvement caused by a better data locality, and to create memory access histograms, which include a complete summary about all memory access performed during the execution of a program. The information provided by the memory access histograms allows users to understand an application’s memory access behavior and further to specify a correct data distribution within the application resulting in better performance. The feasibility and effectiveness of this work has been proven by several numerical kernels. Most of them show a high improvement in terms of absolute execution time and speedup. Due to an accurate modeling of the target architectures, it can be expected that this performance improvement also translates into similar gains during a real application execution. The remainder of this paper is organized as follows. Section 2. briefly outlines a few previous approaches for improving data locality. In Section 3. the simulated architecture is introduced and the simulation tool is described in detail. Section 4. discusses the analyzation and optimization of the data layout based on the simulation tool and the
performance improvement gained from the optimization. The paper is rounded up in Section 5. with some concluding remarks and a few future directions.
2. Related Work In recent years, data locality has been increasingly addressed since it is one of the most important performance issues on distributed shared memory system especially those with NUMA characteristics. Many researches therefore focus on the improvement of data locality. Navarro and Zapata [8] present a method using the compiler to automatically select iteration/data distributions for sequential F77 codes. The compiler analyzes the memory locality exhibited by a program based on descriptors of memory access patterns and establishes a localitycommunication graph that allows an automatic finding of an efficient iteration/data distribution minimizing the requirement to access remote memories. Arvind and Yelick [6] describe compiler analysis and optimizations for explicitly parallel programs communicating through a shared address space. The analysis, called cycle analysis, checks for cycles among interfering accesses. The optimizations include message pipelining which allows multiple outstanding remote memory operations, conversion of two-way to one-way communication, and elimination of communication through data reuse. The performance improvement is about 20-35% for programs running on a CM-5 multiprocessor. Tandri and Abdelrahman [13] introduces an algorithm for deriving data and computation partitions to improve memory locality on scalable shared memory multiprocessors. The algorithm reduces remote memory costs by establishing affinity between where the computations are performed and where the data is located. Experiments using the compiler prototype demonstrate its efficiency in parallelizing standard benchmarks. The SUIF-Adapt [3] is an integrated compile/runtime system for global data distribution in DSM systems. The compiler divides the program into phases and initially selects an effective distribution for each phase. The runtime system monitors program execution and redistributes data in order to acquire the best completion time. As it contains both static and dynamic analysis, the SUIF-Adapt system outperforms the static-only compiler-based systems. Nikolopoulos etc. [10] present two algorithms for moving each virtual memory page to the node that references it more frequently. The purpose of the page movement is to minimize the maximum latency due to remote memory accesses. One algorithm works with iterative parallel programs and is based on the assumption that the page reference pattern of one iteration will be repeated throughout the execution of the program. The other algorithm checks periodly for hot memory areas and migrates the pages with excessive remote references. Performance evaluations on an SGI Origin2000 show a significant improvement in throughput.
Verghese etc. [16] study the improvements in performance provided by OS supported dynamic migration and replication. This kind of page-migration is based on the information about full-cache misses. Results of their experiments show a performance increase of up to 29% for some workloads. The implemented approaches for improving data locality described above can be divided into two categories: page-migration and compiler-based approaches. Both of them are efficient and transparent, but expensive and not accurate. In the compiler scheme, the locality analysis and optimizations take place at compile time and are therefore inaccurate since compilers lack information on unpredictable and dynamic behavior. The data migration scheme decides the location of a page usually based on the assumption that the access patten to a single page does not change over a period of time. It is possible, however, that a page is not accessed by the local node but dominately by another one immediately after the migration. Pages may ping back and forth among processors causing a high overhead. Our approach is based on the information about a program’s run-time data layout and therefore can lead to a better parallel performance since programmers may distribute the data more exactly using the memory access pattern information. A speedup improvement of up to 145.8% achieved by the simulation of the optimized version has demonstrated the efficiency of this approach.
3. The SIMulation Tool SIMT simulates the parallel execution of applications on multiprocessors with NUMA characteristics. It serves as a tool not only for data locality optimization as described above but also for architecture design. We therefore simulate architectures with various system configurations, different memory management policies, as well as different cache consistency protocols.
3.1 Target Architecture The simulated architecture comprises a set of nodes connected via an interconnection network with NUMA access properties. Each node includes a processor, a three-level memory hierarchy consisting of an L1 cache, an L2 cache, and a local memory, and other system facilities, including the network interface. By default, the L1 cache has a line size of 32 bytes, 2-way set associative organization, an access time of 1 cycle, and a size of 32 KB per node. The L2 cache is organized in the same way, but by default has an access time of 10 cycles and size of 512 KB per node. The physical memory, which is distributed among the nodes and mapped logically into a single address space, has an access time of 100 cycles. All these system configuration values can be modified using command line parameters, keeping the simulation tool fully flexible with respect to the concrete target architectures.
The shared data is partitioned with page granularity (4096 bytes) and cyclically distributed among all nodes or allocated only on one node. Any memory access, if not satisfied by a cache hit, will be satisfied by either the local or a remote memory depending on the data distribution. The latency of a remote memory access is specified by users according to the properties of target architectures.
3.2 Implementation of the Memory Reference Generator Software simulation of multiprocessor systems are traditionally performed using two methods: trace-driven simulation or execution-driven simulation. Since in the execution-driven approach the interleaving of memory references is the same as it would be on the simulated machine, the simulation of shared memory systems, like MINT [15], MAYA [1], Limes [7], Augmint [9], and SPASM [12], usually deploy this approach. To our knowledge, there are two simulators supporting distributed shared memory architectures on top of PC platforms built from Intel x86 computers: Limes [7] and Augmint [9]. Limes provides cache and bus modules to simulate bus-based target architectures, while Augmint does not provide architectural components like caches. Augmint, on the other hand, provides scheduling routines for simplifying the tasks of developing a target architecture simulator; we therefore applied Augmint as the base of our memory reference generator and extended it to simulate memory management schemes and cache consistency protocols.
3.3 Memory Management and Cache Consistency In a distributed shared memory system, shared data is distributed among processors according to a data allocation policy. SIMT currently includes a memory simulator which implements two memory management policies: “Full” and “Round-robin”. In the “Full” allocation scheme, all data is allocated on one node which can be specified by the user. In the “Round-robin” allocation scheme, the pages are allocated across all nodes in a round-robin fashion at the smallest possible granularity (usually the page size). In addition, SIMT includes a full cache simulator which implements two fundamental different coherence schemes. The first one, called “invalidation-scheme”, invalidates all cache copies by a write operation as a hardware-based cache coherence protocol does [11] and the second one, called “flush-scheme”, flushes caches via synchronization events as an application level software-based cache coherence protocol does [5]. Memory references captured by the memory reference generator and filtered by the cache simulator are calculated using the functionality of SIMT. Numbers of local and remote accesses from each node to each shared page
or unit are delivered to the user and serve as the basis of program optimization with respect to data locality. In addition, the simulation tool provides the elapsed simulation time and simulated processor cycles, statistics of L1 and L2 read and write hits and misses, and information about the coherency behavior according to the chosen coherency scheme.
4. Understanding and Improving the Data Layout 4.1 Benchmark Applications and Simulation Parameters In order to study the run-time data layout of applications, we have simulated a spectrum of applications including FFT, LU, RADIX, OCEAN, WATER-NSQUARED, and MP3D from the SPLASH-2 benchmarks suite [17] with their default working set size, as well as a Successive OverRelaxation (SOR) and a Gaussian Elimination (GAUSS) with a 195 × 195 grid and a 128 × 128 matrix respectively. All applications are simulated using several system configurations and different data allocation policies in combination with the “invalidation” cache coherence scheme. All applications have been implemented using a straightforward SPMD model and executed in parallel with one thread per processor. All simulation parameters are set to their default values described in Section 3.. The remote memory access latency varies between 100 and 2000 cycles, representing a range of architectures from those with only local memory accesses to architectures with NUMA architectures typically found in loosely-coupled PC clusters.
4.2 Frequency of Remote Memory Accesses The first experiment examines the frequency of remote memory accesses. Table 1 shows the percentage of remote memory accesses compared to total memory accesses for each application under different data distribution policies. It is expected in principle that on average NN−1 references to main memories are normally remote. The data illustrated in Table 1 has verified this estimation. In addition, we observe that the number of remote accesses increases tremendously for large systems. A certain data distribution policy may improve the data locality of a special application, however, this improvement is very limited and can not significantly reduce the impact of remote memory accesses on performance. Consider the FFT program running on 32 processors, which represents the general cases shown in Table 1. When the “Full” allocation scheme replaces the “Round-robin”, the local memory access is improved only by 1.88%. In comparison with the 95.89% of the total remote memory accesses, this improvement can be neglected.
FFT LU RADIX OCEAN WATER MP3D SOR GAUSS
4 78.8 54.1 80.7 72.2 75.9 83.2 73.1 82.5
Full (node = 1) 8 16 89.3 94.6 94.9 95.9 89.6 96.1 91.7 97.0 88.6 97.3 92.5 96.3 81.8 81.5 93.6 96.7
32 95.9 99.8 97.2 99.2 98.6 98.0 89.5 99.0
Round-robin (size = 1 page) 4 8 16 32 75.2 92.4 96.7 97.7 79.9 51.4 96.4 70.8 79.2 87.5 95.8 97.1 78.5 87.6 95.3 98.8 59.0 75.3 89.7 94.6 76.3 88.4 94.4 97.1 70.5 86.4 84.4 62.6 68.3 85.6 95.0 96.3
Round-robin (size = 2 pages) 4 8 16 32 71.7 87.9 94.4 95.8 67.0 94.1 95.1 98.8 80.1 89.3 87.1 91.4 84.2 79.7 93.8 98.9 78.7 92.2 94.7 96.2 74.9 87.6 93.7 96.9 76.3 79.5 45.8 99.1 67.9 81.6 89.1 92.2
Table 1. Percentage of remote memory accesses (%).
4.3 Analyzing the Data Layout The next set of experiments studies the memory access behavior using the memory access histogram gathered by the simulation tool. In Figure 1, which visualizes the memory access histogram of the LU program executed on a 4 node system using the “Round-robin” allocation scheme, it can be seen that data of the LU program is not properly allocated throughout its execution. An example is the third virtual page with the beginning address 12288. From the upper graph of Figure 1 we observe that this page is frequently accessed by node 1. This page, however, is located on node 3, which is shown on the lower graph of the figure. The page is therefore incorrectly allocated. Other pages behave similarly. Besides the information on individual pages, a memory access histogram provides a global overview about an application’s memory access pattern. As illustrated in the upper graph of Figure 1, it is easily visible that data access of the LU program is characterized by blocks with one processor accessing frequently only one block. This observation is also likely to hold for different working set sizes. It is therefore possible to deduce an application’s memory access behavior based on the simulation of a single working set size and thereby to optimize the application in general.
4.4 Impact of Optimization Based on the analysis described above, the LU source code can easily be optimized with respect to the data distribution. Using special allocation macros extended to the simulation tool, incorrectly allocated memory regions or blocks are placed on the nodes matching the memory access behavior as seen in Figure 1. Similar optimizations have been done for three other applications including RADIX, SOR, and WATER. Figure 2 shows the memory access distribution of the LU program before the optimization (top) and after the optimization (bottom). It is clear that most of the memory references are remote before the data layout optimization. After the optimization, the requirement to access remote memories are heavily reduced. This reduction results in a significant speedup improvement which is illus-
trated in Figure 3. This figure shows the variation between the optimized version and the transparent versions (without optimization) of the LU, RADIX, SOR, and GAUSS programs. The curves in Figure 3 present the simulation time versus the latency of remote access for different data distribution policies. These curves clearly show the difference between simulation time observed for each data allocation scheme and the optimized data distribution. For SOR, WATER, and RADIX, the “Round-robin” scheme works better than the “Full” policy and the “Full” scheme, on the other hand, works better with LU. For each code and each allocation scheme, however, there is space to improve the simulation time. The curves representing the optimized version exhibit this potential. Also, this improvement increases as the latency of a remote memory access increases. The most dramatic improvement can be observed from the SOR code. Its optimized version runs 2.5 times faster (improvement of 145.8%) than the transparent version using the “Full” policy, when simulated under a latency of 2000 processor cycles for end remote memory accesses. This can be attributed to the improvement of 161% in data locality, which leaves only 20.07% of the total number of memory accesses to remote memories. Similar result can also be observed for LU. For RADIX and WATER, however, the improvement is more moderate at 40.6% and 21.59% respectively due to their more complex access patterns.
5. Conclusion This paper presents a simulation tool which simulates the parallel execution of shared memory programs on NUMA architectures and gathers information about their memory access behavior in the form of memory access histograms. This includes the number of local and remote accesses to each shared virtual page or unit. It thereby allows an understanding of the run-time data layout of programs and the detection of access hot spots and communication bottlenecks. It aids users in optimizing applications towards better run-time data locality and a significant performance improvement. Results of simulated executions of small codes have shown this potential.
Access from node 3 Access from node 2 Access from node 1 Access from node 0
0 40 96 81 9 12 2 28 16 8 38 20 4 48 24 0 57 28 6 67 32 2 7 36 68 86 40 4 96 45 0 05 49 6 15 53 2 24 57 8 3 61 44 44 65 0 53 69 6 63 73 2 72 77 8 82 81 4 9 86 20 01 90 6 11 94 2 20 98 8 3 10 04 2 10 400 64 11 96 05 11 92 46 11 88 87 12 84 28 12 80 6 13 976 10 13 72 51 68
Numberofaccesses
50 45 40 35 30 25 20 15 10 5 0 Address of accessed memory location
Numberofaccesses
50 45
Access to node 3 Access to node 2 Access to node 1 Access to node 0
40 35 30 25 20 15 10
12 2 28 16 8 38 20 4 48 24 0 57 28 6 67 32 2 76 36 8 86 40 4 96 45 0 05 49 6 15 53 2 24 57 8 34 61 4 44 65 0 53 69 6 63 73 2 72 77 8 82 81 4 92 86 0 01 90 6 11 94 2 20 98 8 3 10 04 24 10 00 64 11 96 05 11 92 46 11 88 87 12 84 28 12 80 69 13 76 10 13 72 51 68
0
40 9
81 9
6
5 0 Address of accessed memory location
50 45 40 35 30 25 20 15 10 5 0
remote local
0 40 96 81 9 12 2 28 16 8 38 20 4 48 24 0 57 28 6 67 32 2 76 36 8 86 40 4 96 45 0 05 49 6 15 53 2 24 57 8 34 61 4 44 65 0 53 69 6 63 73 2 72 77 8 82 81 4 92 86 0 01 90 6 11 94 2 20 98 8 3 10 04 24 10 00 64 11 96 05 11 92 46 11 88 87 12 84 28 12 80 69 13 76 10 13 72 51 68
Numberoafccesses
Figure 1. Memory accesses of the LU program running on a 4 node system using Round-robin.
remote local
0 40 96 81 9 12 2 28 16 8 38 20 4 48 24 0 57 28 6 67 32 2 76 36 8 86 40 4 96 45 0 05 49 6 15 53 2 24 57 8 34 61 4 44 65 0 53 69 6 63 73 2 72 77 8 82 81 4 92 86 0 01 90 6 11 94 2 20 98 8 3 10 04 24 10 00 64 11 96 05 11 92 46 11 88 87 12 84 28 12 80 69 13 76 10 13 72 51 68
Numberoafccesses
Address of accessed memory location 50 45 40 35 30 25 20 15 10 5 0 Address of accessed memory location
Figure 2. Comparison of memory accesses before and after optimization (LU program). LU
25
100
round-robin (size=1) round-robin (size=2) full optimized
Simulation time (per 10,000,000 cycles)
Simulation time (per 10,000,000 cycles)
30
20
15
10
5
0
RADIX
80
round-robin (size=1) round-robin (size=2) full optimized
60
40
20
0 200
400
600
800
1000
1200
1400
1600
1800
2000
200
400
Latency of remote access (in processor cycles) SOR
25
600
800
1000
1200
1400
1600
1800
2000
1800
2000
Latency of remote access (in processor cycles) 140
full round-robin (size=2) round-robin (size=1) optimized
Simulation time (per 10,000,000 cycles)
Simulation time (per 10,000,000 cycles)
30
20
15
10
5
0
WATER
120
full round-robin (size=2) round-robin (size=1) optimized
100 80 60 40 20 0
200
400
600
800
1000
1200
1400
1600
Latency of remote access (in processor cycles)
1800
2000
200
400
600
800
1000
1200
1400
1600
Latency of remote access (in processor cycles)
Figure 3. Impact of the data locality optimization on simulation time.
In the next line of this research work, this kind of optimization will be validated on actual NUMA-characterized systems, like PC clusters interconnected via the Scalable Coherent Interface [2], an IEEE standardized System Area Network with a latency of less than 2us and a bandwidth of more than 80 MB/s for process to process communication. This work has begun with a Data Layout Visualizer [14] which illustrates the memory access histogram provided by the simulation tool in a user-understandable way and projects the information in the memory access histograms back to the data structures within the source codes. The second line of future work is to extend the simulation tool with other data allocation schemes, cache coherence protocols, and possibly interconnection technologies, thereby turning it into a more general tool for performance evaluation and architecture design.
References [1] Divyakant Agrawal, Manhoi Choy, Hong Va Leong, and Ambuj K. Singh. Maya: A simulation platform for parallel architectures and distributed shared memories. Technical Report TRCS93-24, Department of Computer Science, University of California, Santa Barbara, 1993. [2] Hermann Hellwagner and Alexander Reinefeld, editors. SCI: Scalable Coherent Interface: Architecture and Software for High-Performance Computer Clusters, volume 1734 of Lecture Notes in Computer Science. Springer-Verlag, 1999. [3] Gregory M.S. Howard and David K. Lowenthal. An integrated compiler/run-time system for global data distribution in distributed shared memory systems. In Proceedings of the Second Workshop on software Distributed Shared Memory Systems, 2000. [4] Intel Corporation. Intel Architecture Software Developer’s Manual for the PentiumII, volume 1–3. Published on Intel’s developer website, 1998. [5] Wolfgang Karl and Martin Schulz. Hybrid-DSM: An efficient alternative to pure software DSM systems on NUMA architectures. In Proceedings of the 2nd International Workshop on Software DSM (held together with ICS 2000), May 2000. [6] Arvind Krishnamurthy and Katherine Yelick. Analyses and optimization for shared space programs. Journal of Parallel and Distributed Computation, 38(2):130–144, 1996. [7] D. Magdic. Limes: an execution-driven multiprocessor simulation tool for the i486+-based PCs. School of Electrical Engineering, Department of Computer Engineering, University of Belgrade, POB 816 11000 Belgrade, Serbia, Yugoslavia, 1997.
[8] A.G. Navarro and E.L. Zapata. An automatic iteration/data distribution method based on access descriptors for DSMM. In Proceedings of the 12th International workshop on Languages and Compilers for Parallel Computing (LCPC’99), San Diego, La Jolla, CA, USA, August 1999. [9] A-T. Nguyen, M. Michael, A. Sharma, and J. Torrellas. The augmint multiprocessor simulation toolkit for intel x86 architectures. In Proceedings of 1996 International Conference on Computer Design, October 1996. [10] Dimitrios S. Nikolopoulos, Theodore S. Papatheodorou, and etc. User-level dynamic page migration for multiprogrammed shared-memory multiprocessors. In Proceedings of the 29th International Conference on Parallel Processing, pages 95–103, Toronto, Canada, August 2000. [11] D. Patterson and J. Hennessy. Computer Architecture — A Quantitative Approach. Morgan Kaufmann Publishers, Inc., 2 edition, 1996. [12] Anand Sivasubramaniam. Execution-driven simulators for parallel systems design. In S. Andrad´ottir, editor, Proceedings of the 1997 Winter Simulation Conference, pages 1021–1028, 1997. [13] S. Tandri and T. S. Abdelrahman. Automatic partitioning of data and computations on scalable shared memory multiprocessors. In Proceedings of the 1997 International Conference on Parallel Processing (ICPP ’97), pages 64–73, Washington - Brussels - Tokyo, Aug. 1997. [14] Jie Tao, Wolfgang Karl, and Martin Schulz. Visualizing the memory access behavior of shared memory applications on NUMA architectures. In Proceeding of the 2001 International Conference on Computational Science (ICCS), San Francisco, CA, May 2001. to appear. [15] J. E. Veenstra and R. Fowler. Mint tutorial and user manual. Technical Report 452, University of Rochester, 1993. [16] Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. OS support for improving data locality on CC-NUMA compute servers. Technical Report CSL-TR-96-688, Computer System Laboratory, Stanford University, February 1996. [17] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 24–36, June 1995.