A portable and high-performance UPC implementation, compliant ... Berkeley
GASNet used for communication: ... network-specific implementations. ▫
Optimized ...
Berkeley UPC http://upc.lbl.gov Overview
UPC-to-C Translator
A portable and high-performance UPC implementation, compliant with UPC 1.2 spec Features: High performance UPC Collectives Extensions for performance and programmability
Non-blocking memcpy functions Semaphores and signaling put Value-based collectives Atomic memory operations Hierarchical layout query Localization (castability) queries
Source-to-source translator, based on Open64 Enhances programmer productivity through static and dynamic optimizations: compiler, runtime, communication libraries
Performance Portability: System, Scale, Load
Compiler and runtime optimizations for application scalability
Compile time message vectorization and strip-mining Runtime Analysis: communication instantiated at runtime based on system specific performance models Performance models designed to take system scale and load into account
Communication dynamically instantiated using either Put/Get or VIS calls
Open Source Software (Windows/Mac/UNIX), installation DVD available at PGAS booth (#136) (down is good)
Portable Design Layered design, platform-independent code gen Supports wide range of SMPs, clusters and MPPs x86, Itanium, Opteron, Athlon, Alpha, PowerPC, MIPS, T3E, PA-RISC, SPARC, X1, SX-6, Cray XT, IBM Blue Gene, … Linux, FreeBSD, NetBSD, Tru64, AIX, IRIX, HPUX, Solaris, MS Windows, Mac OS X, Unicos, SuperUX, … Pthreads, Myrinet, Quadrics Elan 3/4, InfiniBand, IBM LAPI, Dolphin SCI, MPI, Ethernet, Cray X1 / SGI Altix shmem, Cray XT Portals, IBM BG/P DCMF
BUPC Runtime + GASNet Well-documented runtime interface, multiple UPC compilers (Berkeley UPC and Intrepid GCC/UPC) Debugging and tracing support
Performance Instrumentation Support (GASP) Supports Parallel Performance Wizard (PPW) Detailed communication tracing support TotalView debugger support
Interoperability with other programming env:
Runtime Optimization of Vector Operations
Overall improved application scalability and programmer productivity
Scatter-gather operations are widely used in applications (boundary exchange)
Multiple implementations blocking, pipeline, AMs, progress threads
Modern NICs are under-provisioned for multicore Dynamic approach required Performance models for communication at scale pose a research challenge Qualitative models: track derivatives instead of absolute performance Determine first order performance parameters Methodology for parameter space exploration at scale Preserve performance ordering of implementations Account for errors and partial exploration of parameter space
UPC calls to/from C, C++, Fortran, MPI
Berkeley GASNet used for communication: Performance from inline functions, macros, and network-specific implementations
Optimized Collective ops High-performance communication Consistently matches or outperforms MPI One-sided, lightweight semantics Cray XT4 Bandwidth
Cray XT4 Roundtrip Latency
1800
Dynamic optimization approach implemented in the Berkeley UPC compiler Demonstrated improved performance and productivity
30
1600
25
800
portals -conduit Put OSU MPI BW test mpi -conduit Put
400
20
15
10
mpi -conduit Put MPI Ping -Ack
5
portals -conduit Put
200 0
0 2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
1
2
4
8
16
Transfer Size (Bytes)
128
256
512
1024
MPI Send/Recv GASNet (Get + sync) GASNet (Put + sync)
8
7
3000
2500
2000
1500
1000
6
(down is good)
(up is good)
Bandwidth (MiB/s)
3500
9
Roundtrip Latency (µs)
4000
64
BlueGene/P Roundtrip Latency
Six Link Peak GASNet (6 link) MPI (6 link) GASNet (4 link) MPI (4 link) GASNet (2 link) MPI (2 link) One Link Peak GASNet (1 link) MPI (1 link)
4500
32
Transfer Size (Bytes)
BlueGene/P Multi-link Bandwidth
5
4
3
2
500
0
(down is good)
1000
600
Roundtrip Latency (µs)
1200
(up is good)
Bandwidth (MiB/s)
1400
1
512
1k
2k
4k
8k
16k
32k
64k
128k
Transfer Size (Bytes)
256k
512k
1M
2M
0
1
2
4
8
16
32
64
Transfer Size (Bytes)
2009, Lawrence Berkeley National Laboratory
128
256
512