each data page and in the next step allocate the data on the node that mostly .... on SDSM-based clusters, we examine the parallel execution of OpenMP ...
Efficient Shared Memory Programming on Software-DSM Systems Jie Tao1, Wolfgang Karl1, Carsten Trinitis2, Hubert Eichner2, and Tobias Klug2 1 Institut
f¨ ur Technische Informatik Universit¨at Karlsruhe (TH) Kaiserstr.12, 76128 Karlsruhe, Germany {tao, karl}@ira.uka.de 2 Lehrstuhl
f¨ ur Rechnertechnik und Rechnerorganisation Technische Universit¨at M¨ unchen Boltzmannstr.3, 85748 Garching, Germany {trinitic,eichnerh,klug}@cs.tum.edu Abstract
Software Distributed Shared Memory generally forms the basis for shared memory programming on cluster architectures. Due to the feature of softwarebased implementation and page-based inter-node transfers, however, SDSMs usually do not show excellent performance. This paper discusses an approach for possible optimization with respect to data locality which can alleviate the overheads caused by the communications across processor nodes. This approach relies on a simulation platform to acquire performance data about the runtime data operations and this information is then visualized by a visualization tool. Based on the presented access pattern users can detect the dominating node for each data page and in the next step allocate the data on the node that mostly accesses it. The approach is verified on the simulation platform using standard benchmark applications and the experimental results show an improvement of up to 70%.
1
Introduction
Cluster systems usually exhibit a good scalability and cost-effectiveness. They are hence increasingly used in both research and commercial areas for High Performance Computing. Traditionally, message passing is used on such machines to parallelize applications. Over the last years, however, shared memory models have also been widely applied for programming cluster systems. A well-known example is OpenMP [2], a portable, standardized shared memory model initially designed for SMPs and currently extended for clusters. In contrast to the message passing paradigm, shared memory models provide a programming interface semantically close to sequential programming, thus releasing the users from the burden of otherwise having to deal with communications within the source code.
A prerequisite for shared memory programming on clusters is a distributed shared memory (DSM) that establishes a global memory space visible to all processor nodes. As a result, a number of DSMs have been developed either solely based on software mechanisms [7, 17, 19, 16] or also relying on additional hardware support [18]. However, shared memory programs often show poor performance on DSM systems, and this problem is especially critical for software-based DSMs (SDSM) primarily due to the management of shared data. Hence, this work focuses on mechanisms capable of reducing the overhead in data transfer across processor nodes. On an SDSM system, the memory is actually distributed over the processor nodes. Hence, the global address space for programmers is not physical but virtual, and the data a processor needs for computation could resident on another node. In this case, SDSMs rely on the page fault handler of the operating system to fetch the data from the remote node. Nevertheless, using this scheme the whole page has to be copied resulting in large inter-node traffic. To tackle this performance problem, we use locality optimization schemes to optimally distribute the data on the dominating node. This means a data page will not be placed on the node first touching it, as traditional SDSMs usually do. Rather it will be explicitly allocated by the programmer on the node that dominantly accesses it during the execution. For this, we simulate the execution of a program and collect information about the runtime data accesses. This information is then presented to the user with a visualization tool. Based on this tool, users are capable of detecting the dominating node for each data page. The proposed approach has been verified using the simulation platform. Experimental results show a high improvement for most tested applications. The remainder of this paper is organized as follows: Section 2 briefly describes the concept of software distributed shared memory and the related work. This is followed by a discussion of the efficiency problem of shared memory programming on SDSMs in Section 3 with ViSMI, an SDSM built for our InfiniBand cluster, as an example for case study. In Section 4 the approach for optimizations is described in detail, including the simulator, the visualization tool, and the optimization process. This is followed by the first experimental results in Section 5. The paper concludes with a short summary and a few future directions in Section 6.
2
SDSM and the Related Work
The basic idea behind software distributed shared memory is to provide the programmers with a virtually global address space on cluster architectures. This idea is first proposed by Kai Li [11] and implemented in IVY [12]. As the memories are actually distributed across the cluster, the required data could be located on a remote note and also multiple copies of shared data could exist. The latter leads to consistency issues, where a write operation on shared data has to be seen by other processors. For tackling these problems SDSMs usually rely on the page fault handler of the operating system to support page protection mechanisms that implement invalidation-based consistency models. The concept of memory consistency models is to precisely characterize the behavior of the respective memory system by clearly defining the order in which memory 2
operations are performed. Depending on the concrete requirement, this order can be strict or less strict, hence leading to various consistency models. The first consistency model, also the most strict one, is Sequential Consistency [10], which forces a multiprocessor system to achieve the same result of any execution as if the operations of all the processors were executed in some sequential order and the operations of each individual processor appear in this sequence in the order specified by its programmers. This provides an intuitive and easy-to-follow memory behavior, however, the strict ordering requires the memory system to propagate updates early and prohibits optimizations in both hardware and compilers. Hence, other models have been proposed to relax the constraints of Sequential Consistency with the goal of improving the overall performance. Relaxed consistency models [1, 3, 7, 8] define a memory model for programmers to use explicit synchronization. Synchronizing memory accesses are divided into Acquires and Releases, where an Aquire allows the access to shared data and ensures that the data is up-to-data, while Release relinquishes this access right and ensures that all memory updates have been properly propagated. By separating the synchronization in this way invalidations are only performed by a synchronization operation, therefore reducing the unnecessary invalidations caused by an early coherence operation. A well-known relaxed consistency model is Lazy Release Consistency (LRC) [8] in which the invalidations are propagated at the acquire time. This allows to perform any communication of write updates so late when the data is actually needed. To reduce the communications caused by false sharing, where multiple unrelated shared data locate on the same page, LRC protocols usually support a multiple-writer scheme. Within this scheme, multiple writable copies of the same page are allowed and a clean copy is generated after an invalidation. Home-based Lazy Release Consistency (HLRC) [16], for example, implements such a multiple-writer scheme by specifying a home for each page. All updates to a page are propagated to the home node at synchronization points, such as lock release and barrier. Hence the page copy on home is up-to-date. Over the last years, a significant amount of work has been done in the area of software distributed shared memory. Quarks [19] is an example using relaxed consistency. This is a system aiming at providing a lean implementation of DSM to avoid complex high-overhead protocols. Treadmarks [7] is the first implementation of shared virtual memory using the LRC protocol on a network of stock computers. Shasta [17] is another example that uses LRC to achieve a distributed shared memory on clusters. For the HLRC protocol representative works include Rangarajan and Iftode’s HLRC-VIA [16] for a GigaNet VIA-based network, VIACOMM [4] which implements a multithreaded HLRC also over VIA, and NEWGENDSM [14], an SDSM for InfiniBand clusters. We also implemented an SDSM, called ViSMI (Virtual Shared Memory for InfiniBand) [15]. ViSMI provides a shared virtual space by implementing the HLRC protocol. In addition, it also establishes an interface for developing shared memory programming models on cluster architectures.
3
3
Efficiency Problem of Shared Memory Programming on SDSMs: A Case Study
In order to depict the performance problem of shared memory applications running on SDSM-based clusters, we examine the parallel execution of OpenMP programs on top of our InfiniBand cluster. This OpenMp execution environment [5] is based on ViSMI and for adaptation to the ViSMI programming interface we modified the Omni OpenMP compiler for SMPs [9] to make a new runtime library. 4,5
2
4
4 6
3,5 Speedup
3 2,5 2 1,5 1 0,5 0 Radix
FT
CG
Sparse
SOR
Gauss
Applications
Figure 1: Speedup on systems with different number of processors Figure 1 shows the performance of several OpenMP applications from different benchmark suites and a few self-coded small kernels. The speedup is measured on our InfiniBand cluster with different number of nodes. It can be observed that for most applications only a slight speedup has been achieved via parallel execution. For example, except the SOR (Successive Over Relaxation) code, all applications show a speedup of lower than 3 when running on the cluster with 6 processors. The efficiency is less than 0.5. To detect the reason we examine now the execution time breakdown of two sample codes: CG and Gauss. Figure 2 shows the experimental results. In Figure 2, exec. denotes the overall execution time, while comput. specifies the time for actually executing the application, page is the time for fetching pages, barrier denotes the time for barrier operations (this includes the time for synchronization and that for transferring diffs.), lock is the time for performing locks, handler is the time Gauss
CG 16000
70
14000
60
12000
50
10000
40
8000
30
6000
20
4000
10
2000 0
exec.
comput.
page
barrier
lock
handler
0
overhead
exec.
comput.
page
barrier
lock
Figure 2: Sample normalized execution time breakdown 4
handler
overhead
needed by the communication thread, and overhead is the time for other protocol activities. The sum of all partial times equals to the total exec. time. It can be seen that for Gauss roughly only the half time is spent for computation and for CG this is even worse with only 33% of the overall time for running the program. Observing the time used for page transfer, it can be seen that both applications show a high proportion. In addition, synchronization activities, like lock and barrier, also take a significant amount of execution time and for CG this time is even more than the time for computing. This renders the necessity to optimize the behavior of SDSMs with respect to page and synchronization operations. As the first step towards this kind of optimization, we focus on the former: optimization to reduce the time for data transfer.
4
The Approach
For such optimization we first need the knowledge about the data accesses performed at runtime during the execution of an application. For this we rely on an existing simulation tool called SIMT [20]. SIMT is a multiprocessor simulator modeling the parallel execution of shared memory applications on systems with shared or distributed shared memory. As it is used for research work at the area of memory system SIMT mainly models the memory hierarchy including arbitrary caches and the management of the main memory. SIMT itself is based on Augmint [13], an event-driven multiprocessor simulator. Hence, it applies the same SPMD programming model, where parallelization is presented with m4 macros in a fashion similar to SPLASH and SPLASH-2 applications. For simulating the parallel execution of OpenMP applications, we both extended SIMT and modified the Omni OpenMP compiler to map the OpenMP programming model to the simulation platform. In addition, we add mechanisms in SIMT to simulate the activities of page faults on SDSM systems. Like most simulators, SIMT provides extensive statistics on the execution of an application at the simulation process. This includes elapsed simulation time, simulated processor cycles, number of total memory references, and information about page faults. In addition, SIMT delivers detailed information about data accesses in the form of memory access histograms. Each entry in the histogram corresponds to a data page and records the number of accesses performed by each processor. This enables the user to detect dominating nodes, providing the basis for determining a better data distribution leading to less inter-node communication. For supporting this, we developed a visualization tool that presents the performance data in user-understandable graphical views. This visualizer uses different form, like diagrams, tables, curves, and 3D charts, to illustrate the various aspects of data accesses at different granularity. The visualization can be specified to either virtual pages or data structures. The latter directly supports the detection of optimization objects in the source code. Figure 3 shows two sample views based on virtual pages. The left depicts a 2D block view that presents hot memory accesses over the application’s whole shared virtual address space. For each page it shows the node (in color) that has issued the most accesses to the page. The number of accesses is depicted in the length of the 5
Figure 3: Sample views of the visualization tool (virtual page) column. This allows the user to detect the node where a page has to be hosted in order to reduce page copy and thereby caused data transfer on an SDSM system. The right of Figure 3 is a 3D view also showing the accesses to all virtual pages of the working set. In contrast to the 2D view, it separates the accesses into phases where a phase can be a code region, a function, or a synchronization phase. In a single phase, each page is represented by a colored bar showing all accesses from all nodes. The higher the bar, the more accesses performed by the corresponding node. This enables easy detection of the dominating node for each virtual page. In addition, the viewpoint is adjustable enabling a change of focus to an individual phase or observation of the changes in access behavior between phases. This enables a more detailed data allocation which corresponds to the different behavior in various program phases. For the concrete example, for example, different access patterns can be observed for different phases. While most pages are dominantly accessed by a single node in the first, second, and fourth phase, they are also accessed by other nodes in the remaining phases. This indicates that for some applications separate optimizations should be performed for each phase. An accumulated view across all phases alone would have failed to show this behavior and would have led to incorrect conclusions. Figure 4 illustrates a sample data structure view showing the memory accesses performed on the main working set of a numerical kernel SOR (Successive Over Relaxation) on an eight node system. This view directly shows accesses to the main data structure, a large dense matrix. Small squares within the diagram represent the array elements. Depending on the size of the array, each square can correspond to several elements. The color of a square shows the node that performs the most accesses to these elements. For the concrete example, a block access pattern can be observed, where a single processor node predominantly accesses a specific block. This diagram can hence direct the users to allocate data onto dominating nodes.
6
Figure 4: Sample views of the visualization tool (data structure)
page fault
1200
local
Number of accesses
1000 800 600 400 200
96
90
84
78
72
66
60
54
48
42
36
30
24
18
6
12
0
0 Virtual page number
page fault
800
local
Number of accesses
700 600 500 400 300 200 100
96
90
84
78
72
66
60
54
48
42
36
30
24
18
12
6
0
0
Virtual page number
Figure 5: Access behavior of SOR before (top) and after optimization (bottom)
7
5
Experimental Results
2,00
4 8 16
1,50
32
1,00
0,50
R S O
LM M U
W A TE R
E AN
IX
O C
A D R
LU
G C
M G
0,00
FT
Improvement (original / optimized)
With the directions from the visualization tool, we can explicitly allocate the date in the source code with a specification of the home node. As the first step we use the simulation platform to examine the influence of such optimizations. For enabling the specification of home node for virtual pages we implements a macro ALLOC that can be used by the user in source codes to give the number of a page or a block of pages and the corresponding node. Figure 5 illustrates a sample experimental result for the SOR code. This figure shows the access behavior of the first 100 pages of the working set. Corresponding to each page two numbers are presented: one for the accesses performed locally and the other for those fulfilled with issuing a page fault. Examining both diagrams it can be clearly seen that after optimization the accesses via page fault are less than that before optimization. This is caused by the correct placement of the data pages on the dominating nodes which are specified as the home.
OpenMP Applicationn
Figure 6: Access behavior of SOR before (top) and after optimization (bottom) To examine the impact of this optimization on the execution behavior, we tested a variety of OpenMP applications from different benchmark suites, including the NAS OpenMP Benchmark [6] and the self-ported OpenMP version of the SPLASH-II Benchmark [21]. We also used a few self-coded small kernels. Figure 6 presents the experimental results. This figure gives normalized improvement obtained by dividing the execution time with the unoptimized version by that with the optimized version. For each applications the same metric is measured on systems with 4, 8, 16, and 32 processor nodes individually. It can be observed that all applications show an improvement (value > 1.0) with the SOR code the best performance. This improvement rate increases on systems with more processors, indicating that larger systems benefit more from this optimization.
6
Conclusions
In this paper, we introduce an approach for raising the efficiency of shared memory programming on SDSM-based cluster systems. This approach relies on a multipro8
cessor simulator to acquire the runtime memory access behavior, a visualization tool to display the performance data in user-understandable manner, and explicit data allocation for less runtime inter-node communication. The experimental results on the simulation platform show the feasibility of this approach. The work is an initial phase for this kind of optimization. In the next step, we will move it to the InfiBand cluster. For this, performance data will be collected via the operating system and the established SDSM ViSMI for our cluster will be extended to support the explicite specification of home nodes.
References [1] A. L. Cox, S. Dwarkadas, P. J. Keleher, H. Lu, R. Rajamony, and W. Zwaenepoel. Software Versus Hardware Shared-Memory Implementation: A Case Study. In Proceedings of the 21th Annual International Symposium on Computer Architecture, pages 106–117, April 1994. [2] L. Dagum and R. Menon. OpenMP: An Industry-Standard API for SharedMemory Programming. IEEE Computational Science & Engineering, 5(1):46–55, January 1998. [3] L. Iftode and J. P. Singh. Shared Virtual Memory: Progress and Challenges. In Proceedings of the IEEE, Special Issue on Distributed Shared Memory, volume 87, pages 498–507, 1999. [4] V. Iosevich and A. Schuster. Multithreaded home-based lazy release consistency over VIA. In Proceedings of the 18th International Parallel and Distributed Processing Symposium, pages 59–68, April 2004. [5] W. Karl J. Tao and C. Trinitis. Implementing an OpenMP Execution Environment on InfiniBand Clusters. In Proceedings of IWOMP 2005 (Lecture Notes in Computer Science), Eugene, USA, June 2005. [6] H. Jin, M. Frumkin, and J. Yan. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. Technical Report NAS-99-011, NASA Ames Research Center, October 1999. [7] P. Keleher, S. Dwarkadas, A. Cox, and W. Zwaenepoel. TreadMarks: Distributed Shared Memory On Standard Workstations and Operating Systems. In Proceedings of the 1994 Winter Usenix Conference, pages 115–131, January 1994. [8] P. J. Keleher. Lazy Release Consistency for Distributed Shared Memory. PhD thesis, Department of Computer Science, Rice University, January 1995. [9] K. Kusano, S. Satoh, and M. Sato. Performance Evaluation of the Omni OpenMP Compiler. In Proceedings of International Workshop on OpenMP: Experiences and Implementations (WOMPEI), volume 1940 of LNCS, pages 403–414, 2000. [10] L. Lamport. How to Make a Multiprocessor That Correctly Executes Multiprocess Programs. IEEE Transactions on Computers, 28(9):241–248, 1979. 9
[11] K. Li. Shared Virtual Memory on Loosely Coupled Multiprocessors. PhD thesis, Yale University, September 1986. [12] K. Li. IVY: A Shared Virtual Memory System for Parallel Computing. In Proceedings of the International Conference on Parallel Processing, Vol. II Software, pages 94–101, 1988. [13] A-T. Nguyen, M. Michael, A. Sharma, and J. Torrellas. The Augmint Multiprocessor Simulation Toolkit for Intel x86 Architectures. In Proceedings of 1996 International Conference on Computer Design, pages 486–491. IEEE Computer Society, October 1996. [14] R. Noronha and D. K. Panda. Designing High Performance DSM Systems using InfiniBand Features. In DSM Workshop, in conjunction with 4th IEEE/ACM International Symposium on Cluster Computing and the Grid, April 2004. [15] C. Osendorfer, J. Tao, C. Trinitis, and M. Mairandres. ViSMI: Software Distributed Shared Memory for InfiniBand Clusters . In Proceedings of the 3rd IEEE International Symposium on Network Computing and Applications (IEEE NCA04), pages 185–191, September 2004. [16] M. Rangarajan and L. Iftode. Software Distributed Shared Memory over Virtual Interface Architecture: Implementation and Performance. In Proceedings of the 4th Annual Linux Showcase, Extreme Linux Workshop, pages 341–352, Atlanta, USA, October 2000. [17] D. J. Scales, K. Gharachorloo, and C. A. Thekkath. Shasta: A Low Overhead, Software-Only Approach for Supporting Fine-Grain Shared Memory. In Proceedings of the 7th Symposium on Architectural Support for Programming Languages and Operating Systems, pages 174–185, October 1996. [18] M. Schulz. SCI-VM: A Flexible Base for Transparent Shared Memory Programming Models on Clusters of PCs. In Proceedings of HIPS’99, volume 1586 of Lecture Notes in Computer Science, pages 19–33, Berlin, April 1999. Springer Verlag. [19] M. Swanson, L. Stroller, and J. B. Carter. Making Distributed Shared Memory Simple, Yet Efficient. In Proceedings of the 3rd International Workshop on HighLevel Parallel Programming Models and Supportive Environments, pages 2–13, March 1998. [20] J. Tao, M. Schulz, and W. Karl. A Simulation Tool for Evaluating Shared Memory Systems. In Proceedings of the 36th Annual Simulation Symposium, pages 335– 342, Orlando, Florida, April 2003. [21] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 24–36, June 1995.
10