Predicting Performance of Intel Cluster OpenMP ... - Semantic Scholar

1 downloads 74 Views 277KB Size Report
Predicting Performance of Intel. Cluster OpenMP with Code Analysis. Method. Jie Cai, Alistair P. Rendell, Peter E. Strazdins,. H'sien Jin Wong. November 2008.
TR-CS-08-03

Predicting Performance of Intel Cluster OpenMP with Code Analysis Method Jie Cai, Alistair P. Rendell, Peter E. Strazdins, H’sien Jin Wong November 2008

Joint Computer Science Technical Report Series Department of Computer Science Faculty of Engineering and Information Technology Computer Sciences Laboratory Research School of Information Sciences and Engineering

This technical report series is published jointly by the Department of Computer Science, Faculty of Engineering and Information Technology, and the Computer Sciences Laboratory, Research School of Information Sciences and Engineering, The Australian National University. Please direct correspondence regarding this series to: Technical Reports Department of Computer Science Faculty of Engineering and Information Technology The Australian National University Canberra ACT 0200 Australia or send email to: Technical-DOT-Reports-AT-cs-DOT-anu.edu.au A list of technical reports, including some abstracts and copies of some full reports may be found at: http://cs.anu.edu.au/techreports/

Recent reports in this series: TR-CS-08-02 Paul Thomas. Implementation of PIS. June 2008. TR-CS-08-01 Stephen M. Blackburn, Sergey I. Salishev, Mikhail Danilov, Oleg A. Mokhovikov, Anton A. Nashatyrev, Peter A. Novodvorsky, Vadim I. Bogdanov, Xiao Feng Li, and Dennis Ushakov. The Moxie JVM Experience. April 2008. TR-CS-07-05 Peter Strazdins. Research-Based Education in Computer Science at the ANU: Challenges and Opportunities. August 2007. TR-CS-07-04 Stephen M. Blackburn and Kathryn S. McKinley. Immix Garbage Collection: Fast Collection, Space Efficiency, and Mutator Locality. August 2007. TR-CS-07-03 Peter Christen. Towards Parameter-free Blocking for Scalable Record Linkage. August 2007. TR-CS-07-02 Sophie Pinchinat. Quantified mu-calculus with decision modalities for concurrent game structures. January 2007.

Predicting Performance of Intel Cluster OpenMP with Code Analysis Method Jie Cai, Alistair P. Rendell, Peter E. Strazdins, H’sien Jin Wong College of Engineering and Computer Science The Australian National University {Jie.Cai, Alistair.Rendell, Peter.Strazdins, Jin.Wong}@anu.edu.au

Abstract Intel Cluster OpenMP (CLOMP) extends OpenMP programs onto clusters by using page-based software Distributed Shared Memory (sDSM) system. With the commercial release of CLOMP, the interest in cluster-enabled OpenMP systems is likely to increase. Page faults detection and servicing are the major overheads of such systems. This paper presents a code analysis method to estimate the number of page faults caused when running an OpenMP Program with CLOMP. Then, utilizing SDP model to predict performance of CLOMP. On a 4-node Intel cluster connected via both InfiniBand (IB) and Giga-Ethernet (GigaEth) interconnects, some NAS Parallel Benchmarks (NPB) are used to evaluate and predict CLOMP. The estimates show less than ∼10% prediction error on most benchmarks, which indicates that the code analysis method-SDP model is an effective approach to predict performance of CLOMP.

1. Introduction Intel Cluster OpenMP (CLOMP) is a cluster-enabled OpenMP system which provides support for the shared memory OpenMP programming model on distributed memory systems [2]. This offers not only increased portability for existing OpenMP programs, but also an alternative programming paradigm that is generally considered to be easier compared to the message passing model normally used on clusters. With the commercial release of the CLOMP compiler in 2006, interest in cluster-enabled OpenMP implementations is only likely to increase. This paper will present a code analysis method to help with understanding the performance of CLOMP. Typically, cluster-enabled OpenMP implementations are implemented over page-based software Distributed Shared Memory (sDSM) systems [2, 9, 6, 7, 4]. This is viable since the relaxed memory consistency model of OpenMP limits the need to move globally shared memory pages between nodes to well-defined consistency points (i.e. OpenMP bar-

rier and flush operations). Accesses to these shared memory pages are detected using page protection, with protection states of “Invalid”, “Read-Valid”, and “Write-Valid”. Memory consistency is maintained by page fault servicing. The overhead of this can be characterized by the number and cost of different types of page faults. In this way, we can rationalize the performance of a cluster OpenMP program by utilizing the SIGSEGV Driven Performance (SDP) model [1]. In this paper, a code analysis method for CLOMP will be presented to estimate the number of different types of page faults that will occur when running an OpenMP program on CLOMP. Based on the SDP model, the proposed code analysis model could be used to predict the performance of CLOMP. Experiments are run on a 4-node cluster connected via both InfiniBand (IB) and Giga-Ethernet (Giga-Eth) interconnections, and some NAS Parallel Benchmarks (NPB) programs are used to evaluate the performance of CLOMP and verify the code analysis method. The rest of the paper will be organized in five sections. In section 2, the background knowledge of CLOMP will be introduced, followed by description of the proposed code analysis method in section 3. Some NAS Parallel Benchmarks are analyzed to estimate the number of page faults in section 4. In section 5, the performance evaluation results are demonstrated and discussed. Conclusions are drawn in section 6.

2. CLOMP The first commercial cluster-enabled OpenMP implementation is CLOMP, which is derived from the TreadMarks sDSM system [2], and uses a lazy release memory consistency model (LRC) [5]. It arguments OpenMP to include a new sharable directive that is used to identify variables that can be referenced by more than one OpenMP thread. These variables are placed in globally addressable memory, that is created, maintained, and synchronized across all OpenMP threads through the sDSM system. The TreadMarks sDSM system is page-based, partitioning the

thread spending the longest time on handling page faults along each parallel region. Therefore, the critical path SDP model is more accurate but requires detailed knowledge of number of page faults caused on each processes within each parallel region. On the contrary, the aggregate SDP model only requires the aggregated number of page faults for the whole timed section. A few assumptions are required to introduce the SDP models. • The page faults caused on different processes are fully overlappable.

Figure 1: Page State Transfer Diagram for CLOMP (derived from [2], and experimental observation)

• The cost of servicing a fetch fault is constant regardless the number of processes involved.

globally addressable memory into pages. Changes made to a page by other nodes are patched into the local view when required. The memory consistency model of CLOMP and performance models developed in [1] will be detailed in the rest of this section.

• The computation time of p threads can be approximated by T (1) p , where T(1) is the execution time on one thread. This in turn implies: – The OpenMP program is load-balanced.

2.1. Memory Consistency Model

– The sequential portion within timed section is negligible.

The LRC used in CLOMP is summarized in Figure 1 by showing the transitions between page states. Write notices are passed between threads when an OpenMP barrier, lock or flush directive is encountered. When a write notice is received for a shared page, it will be set to “Invalid”. A subsequent read or write request to that page will then give rise to a fetch fault. This in turn requires diffs to be collected from the relevant other threads before the page can be used. Threads from which “diffs” have been requested (consumed) must change the protection for that page from “write-valid” to “read-valid” if necessary, indicating the start of a new “diff” reference point. Transitions from “read-valid” to “write-valid” occur the first time a write is made to a page, at which point a write fault will be issued, necessitating the creation of a copy or “twin”.

In this case, the critical path SDP model for the total execution time on p threads can be expressed as: T (p)crit =

T (1) + p

X

r,i r,i M axp−1 i=0 Nw Cw + Nf Cf



(1)

r

The aggregate SDP model also assumes that an even number of page faults occur on each thread in each parallel region. under this model, the total execution time is given by: T (p)agg =

2.2. SDP Performance Models

Nw C w + Nf C f T (1) + p p

(2)

Nwr,i and Nfr,i stand for the number of write and fetch faults, respectively, occurring within parallel region r of thread i. Nw and Nf stand for the aggregate number of write and fetch faults respectively. Cw and Cf stand for the time spent on servicing a write and a fetch fault, respectively. The page faults numbers can be obtained either by using segvprof.pl, a profile tool provided by Intel that reports the aggregated number of different type of page faults, or by analyzing OpenMP code directly. Additionally, the cost to service different type of faults can be measured by running a OpenMP testing program provided in [1]1 .

Two SIGSEGV Driven Performance (SDP) models were developed for cluster-enabled OpenMP systems in [1]. In the model, numbers and costs of different types of page faults have been used as the overhead to rationalize the performance of an OpenMP application running with clusterenabled OpenMP. The estimated elapsed time for running an OpenMP program with such systems could be represented as the computation time plus the overhead. There are two types of SDP models. One utilizes the number of page faults caused along the critical path of running an OpenMP application, and another utilizes the aggregated number of page faults for the whole timed section. The critical path of running an OpenMP application is determined by the

1 The source code of the program http://ccnuma.anu.edu.au/dsm/segv cost.

2

is

available

at

#define N (1

Suggest Documents