Improving Performance of OpenMP for SMP Clusters through Overlapped Page Migrations Woo-Chul Jeun1, Yang-Suk Kee2, and Soonhoi Ha1 1
Electrical Engineering and Computer Science, Seoul National University, Seoul, Korea {wcjeun, sha}@iris.snu.ac.kr http://peace.snu.ac.kr/research/parade/ 2 Computer Science & Engineering, University of California, San Diego, USA
[email protected]
Abstract. Costly page migration is a major obstacle to integrating OpenMP and page-based software distributed shared memory (SDSM) to realize the easy-touse programming paradigm for SMP clusters. To reduce the impact of the page migration overhead on the execution time of an application, the previous researches have mainly focused on reducing the number of page migrations and hiding the page migration overhead by overlapping computation and communication. We propose the ‘collective-prefetch’ technique, which overlaps page migrations themselves even when the prior approach cannot be effectively applied. Experiments with a communication-intensive application show that our technique reduces the page migration overhead significantly, and the overall execution time was reduced to 57%~79%.
1 Introduction Since the OpenMP specification [1] was proposed as a standard shared-memory programming model in 1998, there have been a variety of efforts to adopt OpenMP as an easy-to-use programming paradigm for parallel processing platforms, which range from small chip-level systems [2] to Grids. As commodity off-the-shelf symmetric multi-processors (SMPs) and high-speed network devices are widely deployed, SMP clusters have become an attractive platform for high performance computing. Accordingly, there have been many studies [3][4][5][6][7][8][9][10] on applying the OpenMP programming model to cluster systems. An intuitive approach to realizing OpenMP for SMP clusters is to use an SDSM (Software Distributed Shared Memory) system, which transparently provides a single shared address space across distributed memory. Specifically, most of them utilize page-based SDSM systems, which keep memory consistency with a user-level page fault signal handler. The execution time of an application on a page-based SDSM system can be decomposed into computation time, page migration overhead, synchronization overhead, and signal handler overhead. We define a page migration as a procedure that sends a page request to the home node and receives the page reply for the request,
and define the page migration overhead as the net time to complete all page migrations. It is known that SDSM systems suffer from poor performance due to high synchronization overhead and excessive accesses to the remote pages [11]. Therefore, the common challenge that the prior studies on OpenMP for SMP clusters confronted was how to overcome these intrinsic performance bottlenecks of the conventional SDSM systems. Their specific efforts were how to reduce page migration overhead and synchronization overhead. Specifically, their solution techniques can be categorized into five: implementing synchronization directives efficiently [5], reducing the shared address space [6][8], lessening the number of page migrations [4][5][6][8][9][10], reducing the page migration delay with fast communication HW and page update protocol [7], and hiding the page migration overhead by overlapping computation time and page migration overhead [6][10]. Some of these techniques are complementary, and consequently can be applied independently to improve performance. In this paper, we assume that the number of page migrations is already minimized by other techniques. To lessen the page migration overhead for a given number of page migrations, previous studies have mainly focused on hiding the page migration overhead by overlapping the computation time and the page migration overhead. However, they are effective only if computation time is large enough to hide the page migration overhead. Specifically, this paper proposes a ‘collective-prefetch’ technique to lessen this page migration overhead. This technique analyzes the page access patterns and prefetch remote pages by overlapping page migrations themselves. Note that our technique is different from the prior techniques in that it can be used even when computation cannot overlap communication. When this technique is applied to our target SDSM system, experiments with a communication-intensive application show that the page migration overhead was reduced to 30%~72% and overall execution time was reduced to 57%~79%. This paper is organized as follows. In Section 2, motivation of our research is presented with a communication-intensive application. Then, we explain the proposed technique in Section 3. We detail our implementation and present the experimental results using an OpenMP programming environment in Section 4. Finally, we draw a conclusion with some idea of future research direction.
2 Motivational Example We used ParADE [5] as our target system. ParADE is an OpenMP-based parallel programming environment that consists of OpenMP translator, thread-safe SDSM based on HLRC (home-based lazy release consistency) protocol [11] with migratory home and multiple-writer protocol, and MPI library such as MPI/Pro. By executing the program in a hybrid model of software DSM and MPI, ParADE shows better performance than pure SDSM-based environment such as Omni/SCASH. Our motivational application is FT [12] that contains a computational kernel of a 3-D Fast Fourier Transform (FFT)-based spectral method. Table 1 shows the number
of page migrations and execution time breakdown of FT on ParADE for 2 ~ 8 nodes. To isolate the page migration overhead clearly, we use one computation thread for each node to avoid overlapping computation and communication. The page migration overhead is over about 65% of the total execution time, much larger than the computation time. In consequence, overlapping computation time with page migration overhead has little effect with this application. The possible improvement is no more than the computation time, about 13%~26%. Note that this poor performance is not due to any inefficiency of ParADE implementation. When we ran the same FT program on Omni/SCASH, the execution time was much longer than ParADE. Table 1. The number of page migrations and execution time breakdown of FT class A on ParADE
Nodes Synchronization 2 1.54% 4 2.50% 8 5.62%
Handler
Compu- Migration Execution Page migrations of tation Time(sec) each node (times) 7.49% 26.18% 64.79% 116.29 115,968~116,399 6.52% 17.62% 73.36% 87.27 87,872 ~ 88,397 5.91% 12.87% 75.60% 59.16 51,648 ~ 52,187
Like most page-based SDSM systems, ParADE uses a segmentation fault signal handler for page migrations [13]. Figure 1 demonstrates the memory consistency mechanism of a page-based SDSM. Any unprivileged access of a computation thread (at Node 0) to a page invokes the segmentation fault signal handler and then the handler fetches the valid page negotiating with the owner (Node 1) of the page. The problem is that the handler is blocked for a long time, waiting for the reply. This makes the application experience long memory access latency during runtime. Node 0 Page server Computation thread thread
Active Idle
SIGSEGV signal handler
Node 1 Page server Computation thread Page thread Request
Page Reply (4KB)
Time
Fig. 1. Page migration between nodes in the ParADE. Time runs down the page
The commonly used approaches to overcome such long latency problem are pipelining and prefetching. Pipelining and prefetching have been proposed in different contexts such as Web [14], HPF [15], and SDSM without OpenMP [16]. These techniques enable computation and communication to be overlapped. In contrast, we overlap page migrations themselves. Our technique does not reduce the latency of each page migration, but reduce the page migration overhead. Furthermore, it can be effective even when computation workload is small.
3 Overlapping Page Migrations In this section, we present the proposed technique with the inspector-executor model [17]. Our technique consists of two steps: generating the inspector-executor code and translating OpenMP code. Without any modification of an OpenMP source code, our OpenMP translator automatically generates the inspector-executor code from the source code. The inspector checks all references to the shared memory and collects page migrations. The executor overlaps page migrations and executes the actual parallel loop. The OpenMP translator replaces the original parallel loop with the inspectorexecutor code at the first step. Then, the OpenMP translator translates the inspectorexecutor code at the second step. 3.1 Inspector-executor code The OpenMP translator does a pre-processing step to generate the inspector-executor code before the OpenMP translates a parallel loop. For example, Figure 2 shows a simple OpenMP code segment where x and y are shared. #pragma omp parallel for default(shared) for(i=0;i