ACCESS-Opt UM HPC optimisation project

21 downloads 431 Views 4MB Size Report
cpu-time, omp_get_wtime, mpi_wtime for time measurements of a code region before and after a change ... Improvement is calculated as a relative gain in the elapsed time from the start of a first time step to the end .... Python, R,. MatLab, IDL.
ACCESS-Opt UM HPC optimisation project

Dr Ben Evans

nci.org.au @NCInews nci.org.au

Earth Systems Science Research and Societal Impacts Examples: • Modelling Extreme & High Impact events – BoM • NWP, Climate Coupled Systems & Data Assimilation – BoM, CSIRO, Research Collabs • Hazards - Geoscience Australia, BoM, States • Geophysics, Seismic – Geoscience Australia, Universities • Monitoring the Environment & Ocean – ANU, BoM, CSIRO, GA, Research, Fed/State • International research – International agencies and Collaborative Programs Tropical Cyclones

Cyclone Winston 20-21 Feb, 2016

Volcanic Ash

Manam Eruption 31 July, 2015

© National Computational Infrastructure 2016

Bush Fires

Flooding

Wye Valley and St George, QLD Lorne Fires February, 2011 25-31 Dec, 2015 Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au

NCI High Performance Optimisation and Scaling activities 2014-16 •

Objectives: • Upscale and increase performance of high-profile community codes

• Year 1 • • •

Characterise, Optimise and Tune of critical applications for higher resolution Best practise configuration for improved throughput Establish analysis toolsets and methodology

• Year 2 • • • •

Characterise, Optimise and Tune of next generation high priority applications Select high priority geophysics codes and exemplar HPC codes for scalability. Parallel Algorithm Review and I/O optimisation methods to enable better scaling Established TIHP Optimisation work package (Selwood, Evans)

• Year 3 • Assess codes for scalability, encourage adoption from “best in discipline” • Updated hardware (many-core), memory/data latency/bandwidths, energy efficiency • Communication libraries, math libraries

© National Computational Infrastructure 2016

Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au

Common Methodology and approach for analysis -

Analyse code to establish strengths and weaknesses. Full code analysis including hotspot and algorithm choices Expose model to more extreme scaling – e.g. realistic higher resolution Consider software stack configuration Decomposition strategies for nodes and node tuning Parallel Algorithms, MPI communication Patterns. e.g. Halo analysis, grid exchanges I/O techniques: Move to parallel Consideration of future technologies based on analysis

© National Computational Infrastructure 2016

nci.org.au

NCI-Fujitsu Scaling/Optimisation: BoM Research-to-Operations High Profile cases Domain

Yr1

Yr2

Atmosphere

APS1 (UM 8.2-4) UM10.x (PS36) • Global N320L70 • Global (40km) and preN768L70 APS2 N512L70 (~17km) (25km) • Regional, City • Regional N768L70 • (~17km) • City 4.5k

Yr3 APS3 prep • UM10.x latest • ACCESS-G (Global) N512L85 (25km) and N1024L70 (12km)/ N1280L70 (10km) • ACCESS-GE Global Ensemble (N216L70) (~60km) (based on MOGREPS) • ACCESS-TC 4km • ACCESS-R (Regional) 12km • ACCESS-C (City) 1.5km (3DVar DA) • ACCESS-CE City Ensemble 2.2km, downscaled global perturbations nci.org.au

NCI-Fujitsu Scaling/Optimisation: BoM Research-to-Operations High Profile cases Domain

Yr1

Yr2

Yr3

Data assimilation

• 4D-VARv30 (New Dynamics) • N216L70, N320L70 • BODAS

4D-VAR

• 4D-VAR Latest for Global at N320L70 • Perhaps 3D-VAR for City and ETKF for GE? • enKF-C

Ocean

MOM5.1 • OFAM3 • Initial 0.25°, 0.1° L50

MOM5.1analysis • 0.25°, 0.1° + L50

• MOM5/6 0.1° and 0.03° • Model Initialisation • Scaling up to 40+k cores • ROMS (Regional) for StormSurge (2D) & eReefs (3D) • OceanMAPS3.1 (MOM5) with enKF-C

WaveWatch3 v4.18 (v5.08 beta)

• AusWave-G 0.4° • AusWave-R 0.1°

Wave

nci.org.au

NCI-Fujitsu Scaling/Optimisation: Bureau Research-to-Operations High Profile cases Domain

Yr1

Yr2

Coupled Systems

Climate: ACCESS-CM Climate: • GA6 ACCESS-CM (UM8.5)+MOM5 cont. with OASIS-MCT • Global N96L38 (135km), 1° and 0.25° ocean.

Yr3 Climate: ACCESS-CM2 (AOGCM) • GA7 (UM 10.4+ and GLOMAP aerosol), • Global N96L85, N216L85 (~60km) • MOM5.1 0.25° • CABLE2 (pending) • UKCA aerosols

Earth System: ACCESS-ESM2 • ACCESS-CM2+Terrestrial Biochemistry – CASA-CNP • Oceanic biogeochemistry – WOMBAT (Matear, CSIRO) • Atmospheric chemistry – UKCA Seasonal Climate: • ACCESS-S1 - UK GC2 with OASIS3 • N216L85 (~60km) NEMO 0.25°

GC2 • Multi-week and Seasonal Climate: NCI profiling • ACCESS-S2 / UK GC3 methodology • Atmos:N216L85 (60km) and applied for NEMO 3.6 0.25° L75 MPMD

nci.org.au

HPC General Scaling Domain

Yr2

Profiling Methodology

Create Methodology for Updates to Methodology based on profiling codes application across more codes

Profiling tools suite

Review Major open source Profiling Tools

I/O profiling

• Baseline profiling for • Advanced Profiling HDF5 and comparison of NetCDF4 for compression NetCDF3, NetCDF4, algorithms, multithreading, HDF5 and GeoTIFF cache management and API options (e.g. • Profiling analysis of other data GDAL). formats • Profiling comparison • e.g., Astronomy FITS, SEG-Y, of IO performance of BAM Lustre, NFS • Compare MPI-I/O comparison vs POSIX on Lustre

Accelerator Technology Investigation

Yr3

• Intel Phi (Knights Landing) • AMD GPU nci.org.au

HPC General Scaling Domain

Yr2

Yr3

Compute Node Performance Analysis

• Partially committing nodes • Hardware Hyperthreading • Memory Bandwidth • Interconnect bandwidth

• Evaluating Energy Efficiency vs performance of next generation Intel processors • Broadwell improvements • Memory speed • Vectorisation • OpenMP coverage

Software Stacks

• OpenMPI vs IntelMPI • OpenMPI vs IntelMPI analysis analysis • Intel compiler versions • Intel compiler • Math Libraries versions

Analysis of major HPC parallel • Initial Analysis of • Detailed Analysis of MPI Algorithms and codes MPI communications communication dependent • Commence analysis algorithms of high • Survey of Codes and Algorithms priority/profile HPC used. codes, particularly in Earth Systems, Geophysics and nci.org.au Gene Sequencing

ACCESS - Numerical Weather Prediction (NWP) A prime requirement for the Bureau of Meteorology will is to provide vastly improved

weather prediction, including better resolution and severe events: • • • •

High resolution: 1-1.5km Data assimilation Rapid update cycles High resolution ensembles APS-2

APS-3

APS-4

Op: 2016

Op: 2017-2018

Op: 2019-2020

ACCESS-G

25km {4dV}

12km {4dVH}

12km {4dVH/En}

ACCESS-R

12km {4dV}

8km {4dVH}

4.5km {4dVH/En}

ACCESS-TC

12km {4dV}

4.5km {4dVH}

4.5km {4dVH}

ACCESS-GE

60km (lim)

30km

30km

ACCESS-C

1.5km {FC}

1.5km {4dVH}

1.5km {4dVH/En}

ACCESS-CE

-

2.2km (lim)

1.5km

ACCESS-X

-

1.5km {4dVH}

1.5km {4dVH/En}

ACCESS-XE

© National Computational Infrastructure 2016

-

1.5km

Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au

Fully coupled Earth System Model

Core Model • • • •

Atmosphere

Terrestrial

Atmospheric chemistry

Coupler

Carbon

Atmosphere – UM 10.4 Ocean – MOM 5.1 Sea-Ice – CICE5 Coupler – OASIS-MCT

Carbon cycle (ACCESS-ESM) • Terrestrial – CABLE • Bio-geochemical • Couple to modified ACCESS1.3

Ocean Oceanand andsea-ice sea-ice

Aerosols • UKCA • Couple to ACCESS-CM2

© National Computational Infrastructure 2016

Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au

The Queensland storm surge - sea surface height using ROMS.

© National Computational Infrastructure 2016

Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au

UM (atmosphere) and 4DVAR Data Assimilation

Improved UM IO Server in 10.4 Uses MPI/IO

See Dale Roberts talk today

Max Bandwidth FLOP efficiency © National Computational Infrastructure 2016



Establishing baselines for different decompositions and node commitment.



Improvements using partial node commit and hyperthreading enabled.

Fully committed CPUs 600 MB/sec 13 cycles / FLOP

Half-committed CPUs 923 MB/sec 8 cycles / FLOP

nci.org.au

Relative runtime performance for the MOM5 global 0.25 model : - 120 to 15360 CPUs

c/- Marshall Ward © National Computational Infrastructure 2016

Ben Evans, CSIRO CSS, March 2016

nci.org.au

GC3 Coupled Climate Simulation • •

GC3: UM 10.3.1, UKCA, NEMO 3.6, CICE’ – version 5.1 • N216L85 and ORCA025 global ocean ‘MPMD’ profiling issues: • Standard profiling tools just don’t work • Score-P – hangs during initialisation • Openspeedshop – can only profile the first of the 3 applications • Intel Vtune – Does not work at all with OpenMPI • IPM – Works, but lacks detail • DrHook - enough to balance the sub-models, but …

nci.org.au

Work to improve the UM 10.x OpenMP coverage •





Optimisation and OpenMP coverage improvements in UM10.3-10.4 (Illia Bermous, BOM) • Intel SandyBridge 2.6 GHz, 16 cores/node • Intel compilers v15.0.1.133 & Intel MPI library v5.0.3.048 • DrHook to gather performance profiling on a per subroutine basis • cpu-time, omp_get_wtime, mpi_wtime for time measurements of a code region before and after a change N768L70 EG 1 day forecast job (u-aa783), I/O off & horizontal decomposition = 16x30 • check results bit reproducibility after every change • measure performance using 2, 4 and quite often 8 threads UM 10.3->10.4 – Improvement is calculated as a relative gain in the elapsed time from the start of a first time step to the end of a run kxmxn

UM10.3 (elapsed time)

UM10.3 + mods (elapsed time)

Gain

Improvement

4 x 16 x 30

619s

599s

20s

3.4%

8 x 16 x 30

410s

387s

23s

6.1%

k – number of threads, mxn – E-W x N-S domain decomposition nci.org.au

Performance Evaluation for UM 10.4->10.5 • Run with various configurations • 2,4,8 threads • decompositions (16x30,24x40) • 960-3480 cores Improvement in the elapsed times 9

% improvement

8 7 6 5

2 threads 4 threads 8 threads

Elapsed times on 3840 cores

4

370

3

360

2

365

350

1 0

With configuration 8x16x30 (2 MPI tasks/node) on 3840 cores • total performance gain with the 10.3 & 10.4 changes is approx 14% on SandyBridge • comparable with the gain achieved by upgrading from Haswell to Broadwell (mem b/w plus effective tuning for cores)

340 960 cores

1920 cores

3840 cores

341

330 320

Further OpenMP coverage to be made in the in the global model, the LAM and diagnostics

319

310 300

311

1 thread

2 threads

4 threads

8 threads

nci.org.au

I/O profiling •

I/O profiling • Metrics • I/O time: the absolute time on I/O and its weight in the program execution. • I/O data access: concurrent or consecutive; • I/O transaction unit size: program buffer size; • I/O client access: sharing/independent on read/write/metadata operations; • I/O amount: data transferred per file system; • I/O imbalance: I/O time per MPI rank. • Measurement tools: Darshan, ScoreP, Intel tools, Then command line profiles • Analysis tools: NCI profiler reporting tool • Global profiling tools focused on IO • Compare to baselines

© National Computational Infrastructure 2016

Ben Evans, CSIRO CSS, March 2016

nci.org.au

Data file layouts and performance analysis • Data chunking patterns to match usage patterns and cache sizes • Exploring different data compression techniques (particularly using HDF5 filters) • Lustre configuration tuning (e.g., RPC sizes 1MB vs 4MB) • Deficiencies of non-HPC formats becoming more apparent (GeoTIFF, SEG-Y, netCDF3, BAM)

How data is stored? Buffer in memory

Data in the file

Contiguous (default)

Data elements stored physically adjacent to each other

Chunked

Better access time for subsets; extendible

Improves storage efficiency, transmission speed

Chunked & Compressed

August 7, 2013

Extreme Scale Computing HDF5

17$

www.hdfgroup.org

nci.org.au

Emerging Petascale Geophysics codes -

-

Assess priority Geophysics areas - 3D/4D Geophysics: Magneto-tellurics, AEM - Hydrology, Groundwater, Carbon Sequestration - Forward and Inverse Seismic models and analysis (onshore and offshore) - Natural Hazard and Risk models: Tsunami, Ash-cloud Issues - Data across domains, data resolution (points, lines, grids), data coverage - Model maturity for running at scale - Ensemble, Uncertainty analysis and Inferencing

© National Computational Infrastructure 2016

Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au

NCI National Earth Systems Research Data Collections 1. Climate/ESS Model Assets and Data Products 2. Earth and Marine Observations and Data Products 3. Geoscience Collections 4. Terrestrial Ecosystems Collections 5. Water Management and Hydrology Collections http://geonetwork.nci.org.au Data Collections

Approx. Capacity

CMIP5, CORDEX, ACCESS Models

5 Pbytes

Satellite Earth Obs: LANDSAT, Himawari-8, Sentinels, plus MODIS, INSAR, …

2 Pbytes

Digital Elevation, Bathymetry Onshore/Offshore Geophysics

1 Pbytes

Seasonal Climate

700 Tbytes

Bureau of Meteorology Observations

350 Tbytes

Bureau of Meteorology Ocean-Marine

350 Tbytes

Terrestrial Ecosystem

290 Tbytes

Reanalysis products

100 Tbytes

© National Computational Infrastructure 2016

Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au

Data Classified Based On Processing Levels Level*

points grids

Proposed Name

Description*

0

Raw Data

Instrumental data as received from sensor. Includes any and all artefacts.

1

Instrument Data

Instrument data that have been converted to sensor units but are otherwise unprocessed. Data includes appended time and platform georeferencing parameters (e.g., satellite ephemeris).

2

Calibrated Data

Data that has undergone corrections or calibrations necessary to convert instrument data into geophysical value. Data includes calculated position.

3

Gridded Data

Data that has been gridded and undergone minor processing for completeness and consistency (i.e., replacing missing data).

4

“Value-added” Data Products

Analytical (modelled) data such as those derived from the application of algorithms to multiple measurements or sensors.

5

Model-derived Data Products

Data resulting from the simulation of physical processes and/or application of expert knowledge and interpretation.

HPD

*The level numbers and descriptions above follow definitions used in satellite data processing, as defined by NASA. (see ; ; ). © National Computational Infrastructure 2016

Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au

National Earth Systems Research Data Interoperability Platform: a simplified view Compute Intensive

Virtual Laboratories

Fast/Deep Data Access

NERDIP Data Platform Server-side Data functions Services

© National Computational Infrastructure 2016

Portal views

Machine Connected

Program access

Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au

National Environmental Research Data Interoperability Platform (NERDIP) Biodiversity & Climate Change VL

Climate & Weather Science Lab

eMAST Speddexes

eReefs

AGDC VL

All Sky Virtual Observatory

VGL

Globe Claritas

VHIRL

Open Nav Surface

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways Ferret, NCO, GDL, Fortran, C, C++, Python, R, Models GDAL, GRASS, QGIS MPI, OpenMP MatLab, IDL

Visualisation Drishti

ANDS/RDA AODN/IMOS TERN AuScope Portal Portal Portal Portal

Data. gov.au

Digital Bathymetry & Elevation Portal

Tools Data Portals

National Environmental Research Data Interoperability Platform (NERDIP) Open DAP

OGC W*TS

OGC SWE

OGC W*PS

OGC

netCDF-CF

WCS

OGC WFS

OGC WMS

RDF, LD

Data Conventions

Fast “whole-of-library” catalogue

CS-W

Direct Access

Services Layer

Vocab PROV Service Service

ISO 19115, ACDD, RIF-CS, DCAT, etc. GDAL

API Layers

HP Data Library Layer

NetCDF4 Climate, Weather

NetCDF4 Ocean Bathy

ASDF HDF5

PH5 HDF5

NetCDF4 EO

HDFEOS

[Airborne [SEG-Y] Geophysics]

[FITS]

[LAS LiDAR]

HDF5

Other Legacy formats

Lustre

Object Storagenci.org.au

Enable global and continental scale … and to scale-down to local/catchment/plot • Water availability and usage over time • Catchment zone • Vegetation changes • Data fusion with point-clouds and local or other measurements • Statistical techniques on key variables Preparing for: • Better programmatic access • Machine/Deep Learning • Better Integration through Semantic/Linked data technologies

© National Computational Infrastructure 2016

Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au

Climate and Weather Science Laboratory The CWSlab provides an integrated national facility for research in climate and weather simulation and analysis. • To reduce the technical barriers to using state of the art tools; • To facilitate the sharing of experiments, data and results; • To reduce the time to conduct scientific research studies; and • To elevate the collaboration and contributions to the development of the Australian Community Climate Earth-System Simulator (ACCESS)

ACCESS Modelling © National Computational Infrastructure 2016

Data Services

Computational Infrastructure

Climate Analysis

Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au