Use of Cluster OpenMP with the Gaussian Quantum Chemistry Code ...

4 downloads 351 Views 800KB Size Report
School of Computer Science. College of Engineering and Computer Science. The Australian ... Node 0. Software Distributed Shared Memory. Runtime Library. Node 1. Node 2. Node 3 .... compile and build the two ported parallel Links of.
Cluster OpenMP and the Gaussian Code: A Preliminary Performance Analysis

Rui Yang, Jie Cai, Alistair P. Rendell, and V. Ganesh School of Computer Science College of Engineering and Computer Science The Australian National University Canberra, ACT 0200, Australia

Outline

ƒ ƒ ƒ ƒ ƒ ƒ ƒ

Introduction to Cluster OpenMP Gaussian 03 CLOMP implementation of Gaussian 03 Performance results Performance modeling Conclusions Comments

Cluster OpenMP (CLOMP) • A software system to extend OpenMP to distributed memory clusters • Released 2006 with Intel C/Fortran compiler • Uses the Treadmarks paged based software distributed shared memory (sDSM) systems • Exploits virtual memory environment

OpenMP C/Fortran Program

CLOMP

Thrd 0

Thrd 1

Thrd 2

Thrd 3

Software Distributed Shared Memory Runtime Library Node 0

Node 1

Node 2

Node 3

Page-based sDSM Systems ƒ Program memory partitioned into local and global address spaces Local

Local

Local

Local

Global Node 0

Node 1

Node 2

Node 3

ƒ Global pages are protected by using mprotect ƒ Kept consistent by detecting and servicing different types of page-faults The major overhead of the sDSM system is the cost of servicing the page faults according to the memory consistency model

Page Transitions in CLOMP • Uses the Lazy Release Consistency model • Write fault: − All local operations

• Fetch fault: − Involves network communications

• Page state reset at memory sequence points − OpenMP Barrier, Lock and Flush operations

Write notices from other processes (barrier, lock, flush)

Invalid

Write-Valid

Fetch Fault (first read)

Write Fault (first write) Read-Valid

Write notices

Changes consumed by other processes

Applicability of CLOMP ƒ Compared to regular OpenMP, the additional overhead of using CLOMP is primarily that associated with the cost of servicing the various segmentation faults ƒ Barriers, locks and flush operations lead to inter-process communication and re-set of page protection ƒ The ideal CLOMP applications are those that have good data locality, and coarse grain parallelism ƒ Good examples are applications for doing rendering, datamining, all kinds of parallel search, speech and visual recognition, and genetic sequencing ƒ Is Gaussian 03 a good example? − Gaussian 03 uses the Linda object based DSM system on distributed memory systems and OpenMP on shared memory systems

The Gaussian Code ƒ Gaussian is a general purpose computational chemistry package that can perform a variety of electronic structure calculations ƒ Proceeds by executing a series of links ƒ There are many links, but only some are time dominant

Link 101 (Initialization) Link302 (Calculation eletronic integrals)

exec()

Link 502 (Perform Energy calculations)

exec()

Link 703 (Performance Force calculations)

exec()

Link 9999 (Finalization)

exec()

Link 502: Self-Consistent Field (SCF) Energy Evaluation ƒ Link 502 used in virtually all Gaussian computations − Performs SCF Calculations using Hartree-Fock and Density Functional Theories

ƒ In Hartree-Fock theory the most time-consuming process is construction of the Fock matrix (F)

Fμν = H μν

core

• • • •

1 ⎡ ⎤ + ∑ Dλσ ⎢( μν | λσ ) − ( μσ | λν )⎥ 2 ⎦ ⎣ λσ

Dλσ, is the density matrix that is shared and read by all threads Each thread calculates subset of (μν|λσ) , the electron repulsion integrals (ERI), and adds a contribution to its own F matrix Thread private F matrices summed after all threads have completed For density functional theory calculation is similar but involves each thread performing a numerical integration

Parallel of Gaussian 03 Starting the Link malloc(size=M) Divide it as D F#0 F#1 F#2

parallel Thread 0 Read shared D Calculate a subset of the ERIs or numerical integration; Save product into F#0

Thread 1 Read shared D Calculate a subset of the ERIs or numerical integration; Save product into F#1

Thread 2 Read shared D Calculate a subset of the ERIs or numerical integration; Save product into F#2

Thread 0 In the case of a force evaluationsums the approach is similar, except each the all private to copies thread is now making contributions the gradient vector rather than of the F (or K) matrices the F (or K) matrix. into F#0

CLOMP and Gaussian 03 ƒ Replace the malloc() with CLOMP analogue kmp_sharable_malloc() •

Whole of the Gaussian working array shared between all threads in the Link

ƒ A large amount of the shared variables were automatically tagged as sharable by the compiler, however, a significant number of other variables had to be identified by hand •

Around 60 sharable directives were inserted for Link 502

ƒ The parallel Links were then compiled using CLOMP Intel 10.0 compiler with the following flags: • cluster-openmp • clomp-sharable-commons • clomp-sharable-localsaves • clomp-sharable-argexprs

Hardware Architecture and Software Environment ƒ Linux cluster containing 4 nodes, each with a 2.4GHz Intel Core2 Quad-core Q6600 CPU and 4GB of local DRAM memory. The cluster nodes were connected via Gigabit Ethernet. ƒ The Intel C/Fortran compiler 10.0 was used to compile and build the two ported parallel Links of the Gaussian development version (GDV) with the CLOMP flags as given in the previous slide.

Benchmark Molecules

Valinomycin

α-pinene C60

Benchmark Calculations Five different HF and DFT benchmark jobs were considered spanning a typical range of molecular systems. Case Method Basis

Molecule

Link

Routines

I

HF

6-311g*

Valinomycin 502

PRISM

II

BLYP

6-311g*

Valinomycin 502

PRISMC,CALDFT

III

BLYP

cc-pvdz

C60

IV

B3LYP

3-21g*

Valinomycin 502&703 PRISM,CALDFT

V

B3LYP

6-311g** α-pinene

502&703 PRISMC,CALDFT

703

PRISM,CALDFT

Only the first SCF iteration is measured within Link 502 (the time to complete a full SCF calculation will scale almost linearly with the number of iterations required for convergence). The reported times comprises both the parallel formation of the F (or K) matrix and its (sequential) diagonalization.

Code Structure Start of the Link: Allocate working array in the sharable heap using kmp_sharable_malloc(); Within the Link: … Obtain new density matrix Parallel loop over Nthread (OpenMP Parallel region): Call Prism, PrismC or CalDFT to calculate 1/Nthread of the total integrals and save their contributions to each thread’s private Fock matrix; End Parallel loop loop over i=2, Nthread (sequential region) Add Fock matrix created by thread i to the master thread’s Fock matrix; endloop … Exit Link

Speedup Valinomycin HF and DFT: Case I and II (L502) 8 7

Case I

6

Case II

5 4 3 2 1 0 1×1 1‐thread

1×2

2×1 2‐thread

1×4

2×2 4‐thread

Nnode x Ncore

4×1

2×4

4×2 8 thread

4×4 16‐thread

Speedup C60 DFT Energy and Gradient: Case III L502 and L703 14 12 10 8

Case III L 502 Case III L703

6 4 2 0 1×1 1‐thread

1×2 2‐thread

2×1

1×4

2×2 4‐thread

Nnode x Ncore

4×1

2×4

4×2 8 thread

4×4 16‐thread

Full Speedup Data Nnode × Ncore 1-thread 2-thread

1×1 1×2 2×1 1×4

4-thread

2×2 4×1

8 thread 16-thread

2×4 4×2 4×4

SIGSEGV Driven Performance Model ƒ Memory consistency work is the major overhead of clusterenabled OpenMP systems • Page fault detection and servicing are the major activities for the memory consistency work.

ƒ Can we use the numbers and costs of the different types of page faults to rationalize CLOMP performance? • CLOMP provides the SEGVprof profiling tool to count the page fault number

ƒ Developed two models • Critical path model: finding the number of page faults along the critical path of each parallel regions and combine them to estimate the overhead • Aggregate model: uses aggregated page fault numbers with a serialization fraction to estimate the overhead Cai, J., Rendell, A.P., Strazdins, P.E., Wong, H.J.: Performance model for cluster-enabled OpenMP implementations. In: Proceeding of 13th IEEE Asia-Pacific Computer Systems Architecture Conference, pp. 1–8 (2008)

The Critical PathCritical Model Path Performance Model

Critical Path Performance Model par Tot ( 1 ) Tot (1) par Tot ( p) = + T crit ( p) = + Max(i =0, p −1) ( N iwC w + N i f C f ) p p

Tot(1) par : elapsed time for the parallel region using one thread p : number of threads used Tcrit(p) : page fault time cost for the thread with maximum number of page faults in the p-thread calculations. Niw : number of write page faults Nif : number of fetch page faults Cw : cost of write fault : cost of fetch fault Cf Max(i=0,p-1) : thread in parallel region that has maximum cost

Model Assumptions ƒ Find the critical path which contains longest computation and overhead for each parallel region − Assumes the page faults happening on different threads within a same parallel region are fully overlapped − Assumes that the costs for servicing page faults are constant on a given hardware platform − Assumes that the OpenMP application is load-balanced, therefore computation time is same for each thread − Overhead from the number of page faults will be the determinant of the critical path for the parallel region.

ƒ Assumes the sequential region contained in the timed section of an OpenMP application is negligible ƒ The combined critical paths for each parallel region could be used to estimate the overall performance

Page Fault Counts for Benchmarks Nthread Case I (L 502) Case II (L 502)

Case III (L 502)

Case III (L 703)

PRISM PRISMC CALDFT PRISMC CALDFT PRISMC CALDFT

2 4 2 4 2 4 2 4 2 4 2 4 2 4

Elapsed Time(sec.) 752 386 285 146 185 94 286 148 95 50 2894 1454 273 137

Max page faults/thread Write Fetch 4931 13148 4934 13150 5104 6036 5347 6273 8043 3933 7540 3680 5337 5240 5628 5253 3852 1493 4146 2187 6038 8577 6043 8580 6682 1011 6475 1127

• Using SEGVprof profiling tool to count the page fault number • Total page fault counts roughly reflect the size of the system, i.e. execution time. • For any given test case and routine, the number of write and fetch page faults appears to remain roughly constant when going from 2 to 4 threads (1 thread/node).

Testing SDP Model Tot (1) par Tot (1) par crit Tot ( p ) = + T ( p) = + Max( i =0, p −1) ( N iwC w + N i f C f ) p p Thus between 4 and 2 thread results ΔTot = 2 × Tot (4) − Tot (2) = 2 × T crit (4) − T crit (2) = ΔT crit Expect to be equal, but not, although appears to be linear relationship 25 20 15

ΔTot(sec)

• Measured ΔTot • Calculate ΔTcrit with Cw=10μs and Cf=171 μs (lower limit on cost)

10 5 0 ‐5 0

0.5

1

1.5

ΔTcrit (sec)

2

2.5

Failure of SDP Model 25 20 15

ΔTot(sec)

Possible causes ƒ Load imbalance ƒ Repeated computation ƒ Fixed page fault cost regardless of thread count ƒ Overlapping page fault cost

10 5 0 ‐5 0

0.5

1

1.5

2

ΔTcrit (sec)

We suspect repeated computation and load imbalance since total cost of page faults is too small Cw = 10us and Cf = 171us gives ΔTcrit = 2.25s, yet ΔT = 20.8s

2.5

Conclusions ƒ Comparable or better scalability of Cluster OpenMP implemented Gaussian 03 for using multi-nodes compared to multi-cores ƒ Page fault measurements reveal relatively low counts within long duration parallel regions, implying that HF and DFT energy and gradient computations are well suited to implementation with Cluster OpenMP ƒ A critical path model has been used to analyze performance, but the accuracy of this model appears to be limited probably due to repeated computation and load imbalance

Comments The Not So Pretty CLOMP Results! Link1002, part of a frequency calculation

2 1.8

Speedup (or lack theoeof)

1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1×1

1×2

2×1

1×4

2×2

Nnode x Ncore

4×1

2×4

4×2

4×4

Comments ƒ On software distributed shared memory: • ScaleMP provides single system image over a cluster using combination of virtualization and software distributed shared memory

ƒ On local cache coherency domains • Integrating Software Distributed Shared Memory and Message Passing Programming, J. Wong and A.P. Rendell (submitted Cluster09) • What happens if a message is received into global memory of a remote processor for memory consistency

Acknowledgements ƒ Australian Research Council grants LP0669726 and LP0774896 and industry partners Intel, Sun, Gaussian Inc ƒ Discussions with Jay Hoeflinger, Larry Meadows and Sanjiv Shah

Questions?