PowerPoint Presentation - Slide 1

Berkeley UPC http://upc.lbl.gov Overview

UPC-to-C Translator

 A portable and high-performance UPC implementation, compliant with UPC 1.2 spec  Features:  High performance UPC Collectives  Extensions for performance and programmability      

Non-blocking memcpy functions Semaphores and signaling put Value-based collectives Atomic memory operations Hierarchical layout query Localization (castability) queries

 Source-to-source translator, based on Open64  Enhances programmer productivity through static and dynamic optimizations: compiler, runtime, communication libraries

Performance Portability: System, Scale, Load

 Compiler and runtime optimizations for application scalability

 Compile time message vectorization and strip-mining  Runtime Analysis: communication instantiated at runtime based on system specific performance models  Performance models designed to take system scale and load into account

Communication dynamically instantiated using either Put/Get or VIS calls

 Open Source Software (Windows/Mac/UNIX), installation DVD available at PGAS booth (#136) (down is good)

Portable Design  Layered design, platform-independent code gen  Supports wide range of SMPs, clusters and MPPs  x86, Itanium, Opteron, Athlon, Alpha, PowerPC, MIPS, T3E, PA-RISC, SPARC, X1, SX-6, Cray XT, IBM Blue Gene, …  Linux, FreeBSD, NetBSD, Tru64, AIX, IRIX, HPUX, Solaris, MS Windows, Mac OS X, Unicos, SuperUX, …  Pthreads, Myrinet, Quadrics Elan 3/4, InfiniBand, IBM LAPI, Dolphin SCI, MPI, Ethernet, Cray X1 / SGI Altix shmem, Cray XT Portals, IBM BG/P DCMF

BUPC Runtime + GASNet  Well-documented runtime interface, multiple UPC compilers (Berkeley UPC and Intrepid GCC/UPC)  Debugging and tracing support    

Performance Instrumentation Support (GASP) Supports Parallel Performance Wizard (PPW) Detailed communication tracing support TotalView debugger support

 Interoperability with other programming env:

Runtime Optimization of Vector Operations

Overall improved application scalability and programmer productivity

Scatter-gather operations are widely used in applications (boundary exchange)

 Multiple implementations  blocking, pipeline, AMs, progress threads

 Modern NICs are under-provisioned for multicore  Dynamic approach required  Performance models for communication at scale pose a research challenge  Qualitative models: track derivatives instead of absolute performance  Determine first order performance parameters  Methodology for parameter space exploration at scale  Preserve performance ordering of implementations  Account for errors and partial exploration of parameter space

 UPC calls to/from C, C++, Fortran, MPI

 Berkeley GASNet used for communication:  Performance from inline functions, macros, and network-specific implementations

 Optimized Collective ops  High-performance communication  Consistently matches or outperforms MPI  One-sided, lightweight semantics Cray XT4 Bandwidth

Cray XT4 Roundtrip Latency

1800

 Dynamic optimization approach implemented in the Berkeley UPC compiler  Demonstrated improved performance and productivity

30

1600

25

800

portals -conduit Put OSU MPI BW test mpi -conduit Put

400

20

15

10

mpi -conduit Put MPI Ping -Ack

5

portals -conduit Put

200 0

0 2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

1

2

4

8

16

Transfer Size (Bytes)

128

256

512

1024

MPI Send/Recv GASNet (Get + sync) GASNet (Put + sync)

8

7

3000

2500

2000

1500

1000

6

(down is good)

(up is good)

Bandwidth (MiB/s)

3500

9

Roundtrip Latency (µs)

4000

64

BlueGene/P Roundtrip Latency

Six Link Peak GASNet (6 link) MPI (6 link) GASNet (4 link) MPI (4 link) GASNet (2 link) MPI (2 link) One Link Peak GASNet (1 link) MPI (1 link)

4500

32


BlueGene/P Multi-link Bandwidth

5

4

3

2

500

0

(down is good)

1000

600

Roundtrip Latency (µs)

1200

(up is good)

Bandwidth (MiB/s)

1400

1

512

1k

2k

4k

8k

16k

32k

64k

128k


256k

512k

1M

2M

0

1

2

4

8

16

32

64


 2009, Lawrence Berkeley National Laboratory

128

256

512