cpu-time, omp_get_wtime, mpi_wtime for time measurements of a code region before and after a change ... Improvement is calculated as a relative gain in the elapsed time from the start of a first time step to the end .... Python, R,. MatLab, IDL.
ACCESS-Opt UM HPC optimisation project
Dr Ben Evans
nci.org.au @NCInews nci.org.au
Earth Systems Science Research and Societal Impacts Examples: • Modelling Extreme & High Impact events – BoM • NWP, Climate Coupled Systems & Data Assimilation – BoM, CSIRO, Research Collabs • Hazards - Geoscience Australia, BoM, States • Geophysics, Seismic – Geoscience Australia, Universities • Monitoring the Environment & Ocean – ANU, BoM, CSIRO, GA, Research, Fed/State • International research – International agencies and Collaborative Programs Tropical Cyclones
Cyclone Winston 20-21 Feb, 2016
Volcanic Ash
Manam Eruption 31 July, 2015
© National Computational Infrastructure 2016
Bush Fires
Flooding
Wye Valley and St George, QLD Lorne Fires February, 2011 25-31 Dec, 2015 Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au
NCI High Performance Optimisation and Scaling activities 2014-16 •
Objectives: • Upscale and increase performance of high-profile community codes
• Year 1 • • •
Characterise, Optimise and Tune of critical applications for higher resolution Best practise configuration for improved throughput Establish analysis toolsets and methodology
• Year 2 • • • •
Characterise, Optimise and Tune of next generation high priority applications Select high priority geophysics codes and exemplar HPC codes for scalability. Parallel Algorithm Review and I/O optimisation methods to enable better scaling Established TIHP Optimisation work package (Selwood, Evans)
• Year 3 • Assess codes for scalability, encourage adoption from “best in discipline” • Updated hardware (many-core), memory/data latency/bandwidths, energy efficiency • Communication libraries, math libraries
© National Computational Infrastructure 2016
Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au
Common Methodology and approach for analysis -
Analyse code to establish strengths and weaknesses. Full code analysis including hotspot and algorithm choices Expose model to more extreme scaling – e.g. realistic higher resolution Consider software stack configuration Decomposition strategies for nodes and node tuning Parallel Algorithms, MPI communication Patterns. e.g. Halo analysis, grid exchanges I/O techniques: Move to parallel Consideration of future technologies based on analysis
© National Computational Infrastructure 2016
nci.org.au
NCI-Fujitsu Scaling/Optimisation: BoM Research-to-Operations High Profile cases Domain
Yr1
Yr2
Atmosphere
APS1 (UM 8.2-4) UM10.x (PS36) • Global N320L70 • Global (40km) and preN768L70 APS2 N512L70 (~17km) (25km) • Regional, City • Regional N768L70 • (~17km) • City 4.5k
Yr3 APS3 prep • UM10.x latest • ACCESS-G (Global) N512L85 (25km) and N1024L70 (12km)/ N1280L70 (10km) • ACCESS-GE Global Ensemble (N216L70) (~60km) (based on MOGREPS) • ACCESS-TC 4km • ACCESS-R (Regional) 12km • ACCESS-C (City) 1.5km (3DVar DA) • ACCESS-CE City Ensemble 2.2km, downscaled global perturbations nci.org.au
NCI-Fujitsu Scaling/Optimisation: BoM Research-to-Operations High Profile cases Domain
Yr1
Yr2
Yr3
Data assimilation
• 4D-VARv30 (New Dynamics) • N216L70, N320L70 • BODAS
4D-VAR
• 4D-VAR Latest for Global at N320L70 • Perhaps 3D-VAR for City and ETKF for GE? • enKF-C
Ocean
MOM5.1 • OFAM3 • Initial 0.25°, 0.1° L50
MOM5.1analysis • 0.25°, 0.1° + L50
• MOM5/6 0.1° and 0.03° • Model Initialisation • Scaling up to 40+k cores • ROMS (Regional) for StormSurge (2D) & eReefs (3D) • OceanMAPS3.1 (MOM5) with enKF-C
WaveWatch3 v4.18 (v5.08 beta)
• AusWave-G 0.4° • AusWave-R 0.1°
Wave
nci.org.au
NCI-Fujitsu Scaling/Optimisation: Bureau Research-to-Operations High Profile cases Domain
Yr1
Yr2
Coupled Systems
Climate: ACCESS-CM Climate: • GA6 ACCESS-CM (UM8.5)+MOM5 cont. with OASIS-MCT • Global N96L38 (135km), 1° and 0.25° ocean.
Yr3 Climate: ACCESS-CM2 (AOGCM) • GA7 (UM 10.4+ and GLOMAP aerosol), • Global N96L85, N216L85 (~60km) • MOM5.1 0.25° • CABLE2 (pending) • UKCA aerosols
Earth System: ACCESS-ESM2 • ACCESS-CM2+Terrestrial Biochemistry – CASA-CNP • Oceanic biogeochemistry – WOMBAT (Matear, CSIRO) • Atmospheric chemistry – UKCA Seasonal Climate: • ACCESS-S1 - UK GC2 with OASIS3 • N216L85 (~60km) NEMO 0.25°
GC2 • Multi-week and Seasonal Climate: NCI profiling • ACCESS-S2 / UK GC3 methodology • Atmos:N216L85 (60km) and applied for NEMO 3.6 0.25° L75 MPMD
nci.org.au
HPC General Scaling Domain
Yr2
Profiling Methodology
Create Methodology for Updates to Methodology based on profiling codes application across more codes
Profiling tools suite
Review Major open source Profiling Tools
I/O profiling
• Baseline profiling for • Advanced Profiling HDF5 and comparison of NetCDF4 for compression NetCDF3, NetCDF4, algorithms, multithreading, HDF5 and GeoTIFF cache management and API options (e.g. • Profiling analysis of other data GDAL). formats • Profiling comparison • e.g., Astronomy FITS, SEG-Y, of IO performance of BAM Lustre, NFS • Compare MPI-I/O comparison vs POSIX on Lustre
Accelerator Technology Investigation
Yr3
• Intel Phi (Knights Landing) • AMD GPU nci.org.au
HPC General Scaling Domain
Yr2
Yr3
Compute Node Performance Analysis
• Partially committing nodes • Hardware Hyperthreading • Memory Bandwidth • Interconnect bandwidth
• Evaluating Energy Efficiency vs performance of next generation Intel processors • Broadwell improvements • Memory speed • Vectorisation • OpenMP coverage
Software Stacks
• OpenMPI vs IntelMPI • OpenMPI vs IntelMPI analysis analysis • Intel compiler versions • Intel compiler • Math Libraries versions
Analysis of major HPC parallel • Initial Analysis of • Detailed Analysis of MPI Algorithms and codes MPI communications communication dependent • Commence analysis algorithms of high • Survey of Codes and Algorithms priority/profile HPC used. codes, particularly in Earth Systems, Geophysics and nci.org.au Gene Sequencing
ACCESS - Numerical Weather Prediction (NWP) A prime requirement for the Bureau of Meteorology will is to provide vastly improved
weather prediction, including better resolution and severe events: • • • •
High resolution: 1-1.5km Data assimilation Rapid update cycles High resolution ensembles APS-2
APS-3
APS-4
Op: 2016
Op: 2017-2018
Op: 2019-2020
ACCESS-G
25km {4dV}
12km {4dVH}
12km {4dVH/En}
ACCESS-R
12km {4dV}
8km {4dVH}
4.5km {4dVH/En}
ACCESS-TC
12km {4dV}
4.5km {4dVH}
4.5km {4dVH}
ACCESS-GE
60km (lim)
30km
30km
ACCESS-C
1.5km {FC}
1.5km {4dVH}
1.5km {4dVH/En}
ACCESS-CE
-
2.2km (lim)
1.5km
ACCESS-X
-
1.5km {4dVH}
1.5km {4dVH/En}
ACCESS-XE
© National Computational Infrastructure 2016
-
1.5km
Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au
Fully coupled Earth System Model
Core Model • • • •
Atmosphere
Terrestrial
Atmospheric chemistry
Coupler
Carbon
Atmosphere – UM 10.4 Ocean – MOM 5.1 Sea-Ice – CICE5 Coupler – OASIS-MCT
Carbon cycle (ACCESS-ESM) • Terrestrial – CABLE • Bio-geochemical • Couple to modified ACCESS1.3
Ocean Oceanand andsea-ice sea-ice
Aerosols • UKCA • Couple to ACCESS-CM2
© National Computational Infrastructure 2016
Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au
The Queensland storm surge - sea surface height using ROMS.
© National Computational Infrastructure 2016
Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au
UM (atmosphere) and 4DVAR Data Assimilation
Improved UM IO Server in 10.4 Uses MPI/IO
See Dale Roberts talk today
Max Bandwidth FLOP efficiency © National Computational Infrastructure 2016
•
Establishing baselines for different decompositions and node commitment.
•
Improvements using partial node commit and hyperthreading enabled.
Fully committed CPUs 600 MB/sec 13 cycles / FLOP
Half-committed CPUs 923 MB/sec 8 cycles / FLOP
nci.org.au
Relative runtime performance for the MOM5 global 0.25 model : - 120 to 15360 CPUs
c/- Marshall Ward © National Computational Infrastructure 2016
Ben Evans, CSIRO CSS, March 2016
nci.org.au
GC3 Coupled Climate Simulation • •
GC3: UM 10.3.1, UKCA, NEMO 3.6, CICE’ – version 5.1 • N216L85 and ORCA025 global ocean ‘MPMD’ profiling issues: • Standard profiling tools just don’t work • Score-P – hangs during initialisation • Openspeedshop – can only profile the first of the 3 applications • Intel Vtune – Does not work at all with OpenMPI • IPM – Works, but lacks detail • DrHook - enough to balance the sub-models, but …
nci.org.au
Work to improve the UM 10.x OpenMP coverage •
•
•
Optimisation and OpenMP coverage improvements in UM10.3-10.4 (Illia Bermous, BOM) • Intel SandyBridge 2.6 GHz, 16 cores/node • Intel compilers v15.0.1.133 & Intel MPI library v5.0.3.048 • DrHook to gather performance profiling on a per subroutine basis • cpu-time, omp_get_wtime, mpi_wtime for time measurements of a code region before and after a change N768L70 EG 1 day forecast job (u-aa783), I/O off & horizontal decomposition = 16x30 • check results bit reproducibility after every change • measure performance using 2, 4 and quite often 8 threads UM 10.3->10.4 – Improvement is calculated as a relative gain in the elapsed time from the start of a first time step to the end of a run kxmxn
UM10.3 (elapsed time)
UM10.3 + mods (elapsed time)
Gain
Improvement
4 x 16 x 30
619s
599s
20s
3.4%
8 x 16 x 30
410s
387s
23s
6.1%
k – number of threads, mxn – E-W x N-S domain decomposition nci.org.au
Performance Evaluation for UM 10.4->10.5 • Run with various configurations • 2,4,8 threads • decompositions (16x30,24x40) • 960-3480 cores Improvement in the elapsed times 9
% improvement
8 7 6 5
2 threads 4 threads 8 threads
Elapsed times on 3840 cores
4
370
3
360
2
365
350
1 0
With configuration 8x16x30 (2 MPI tasks/node) on 3840 cores • total performance gain with the 10.3 & 10.4 changes is approx 14% on SandyBridge • comparable with the gain achieved by upgrading from Haswell to Broadwell (mem b/w plus effective tuning for cores)
340 960 cores
1920 cores
3840 cores
341
330 320
Further OpenMP coverage to be made in the in the global model, the LAM and diagnostics
319
310 300
311
1 thread
2 threads
4 threads
8 threads
nci.org.au
I/O profiling •
I/O profiling • Metrics • I/O time: the absolute time on I/O and its weight in the program execution. • I/O data access: concurrent or consecutive; • I/O transaction unit size: program buffer size; • I/O client access: sharing/independent on read/write/metadata operations; • I/O amount: data transferred per file system; • I/O imbalance: I/O time per MPI rank. • Measurement tools: Darshan, ScoreP, Intel tools, Then command line profiles • Analysis tools: NCI profiler reporting tool • Global profiling tools focused on IO • Compare to baselines
© National Computational Infrastructure 2016
Ben Evans, CSIRO CSS, March 2016
nci.org.au
Data file layouts and performance analysis • Data chunking patterns to match usage patterns and cache sizes • Exploring different data compression techniques (particularly using HDF5 filters) • Lustre configuration tuning (e.g., RPC sizes 1MB vs 4MB) • Deficiencies of non-HPC formats becoming more apparent (GeoTIFF, SEG-Y, netCDF3, BAM)
How data is stored? Buffer in memory
Data in the file
Contiguous (default)
Data elements stored physically adjacent to each other
Chunked
Better access time for subsets; extendible
Improves storage efficiency, transmission speed
Chunked & Compressed
August 7, 2013
Extreme Scale Computing HDF5
17$
www.hdfgroup.org
nci.org.au
Emerging Petascale Geophysics codes -
-
Assess priority Geophysics areas - 3D/4D Geophysics: Magneto-tellurics, AEM - Hydrology, Groundwater, Carbon Sequestration - Forward and Inverse Seismic models and analysis (onshore and offshore) - Natural Hazard and Risk models: Tsunami, Ash-cloud Issues - Data across domains, data resolution (points, lines, grids), data coverage - Model maturity for running at scale - Ensemble, Uncertainty analysis and Inferencing
© National Computational Infrastructure 2016
Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au
NCI National Earth Systems Research Data Collections 1. Climate/ESS Model Assets and Data Products 2. Earth and Marine Observations and Data Products 3. Geoscience Collections 4. Terrestrial Ecosystems Collections 5. Water Management and Hydrology Collections http://geonetwork.nci.org.au Data Collections
Approx. Capacity
CMIP5, CORDEX, ACCESS Models
5 Pbytes
Satellite Earth Obs: LANDSAT, Himawari-8, Sentinels, plus MODIS, INSAR, …
2 Pbytes
Digital Elevation, Bathymetry Onshore/Offshore Geophysics
1 Pbytes
Seasonal Climate
700 Tbytes
Bureau of Meteorology Observations
350 Tbytes
Bureau of Meteorology Ocean-Marine
350 Tbytes
Terrestrial Ecosystem
290 Tbytes
Reanalysis products
100 Tbytes
© National Computational Infrastructure 2016
Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au
Data Classified Based On Processing Levels Level*
points grids
Proposed Name
Description*
0
Raw Data
Instrumental data as received from sensor. Includes any and all artefacts.
1
Instrument Data
Instrument data that have been converted to sensor units but are otherwise unprocessed. Data includes appended time and platform georeferencing parameters (e.g., satellite ephemeris).
2
Calibrated Data
Data that has undergone corrections or calibrations necessary to convert instrument data into geophysical value. Data includes calculated position.
3
Gridded Data
Data that has been gridded and undergone minor processing for completeness and consistency (i.e., replacing missing data).
4
“Value-added” Data Products
Analytical (modelled) data such as those derived from the application of algorithms to multiple measurements or sensors.
5
Model-derived Data Products
Data resulting from the simulation of physical processes and/or application of expert knowledge and interpretation.
HPD
*The level numbers and descriptions above follow definitions used in satellite data processing, as defined by NASA. (see ; ; ). © National Computational Infrastructure 2016
Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au
National Earth Systems Research Data Interoperability Platform: a simplified view Compute Intensive
Virtual Laboratories
Fast/Deep Data Access
NERDIP Data Platform Server-side Data functions Services
© National Computational Infrastructure 2016
Portal views
Machine Connected
Program access
Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au
National Environmental Research Data Interoperability Platform (NERDIP) Biodiversity & Climate Change VL
Climate & Weather Science Lab
eMAST Speddexes
eReefs
AGDC VL
All Sky Virtual Observatory
VGL
Globe Claritas
VHIRL
Open Nav Surface
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways Ferret, NCO, GDL, Fortran, C, C++, Python, R, Models GDAL, GRASS, QGIS MPI, OpenMP MatLab, IDL
Visualisation Drishti
ANDS/RDA AODN/IMOS TERN AuScope Portal Portal Portal Portal
Data. gov.au
Digital Bathymetry & Elevation Portal
Tools Data Portals
National Environmental Research Data Interoperability Platform (NERDIP) Open DAP
OGC W*TS
OGC SWE
OGC W*PS
OGC
netCDF-CF
WCS
OGC WFS
OGC WMS
RDF, LD
Data Conventions
Fast “whole-of-library” catalogue
CS-W
Direct Access
Services Layer
Vocab PROV Service Service
ISO 19115, ACDD, RIF-CS, DCAT, etc. GDAL
API Layers
HP Data Library Layer
NetCDF4 Climate, Weather
NetCDF4 Ocean Bathy
ASDF HDF5
PH5 HDF5
NetCDF4 EO
HDFEOS
[Airborne [SEG-Y] Geophysics]
[FITS]
[LAS LiDAR]
HDF5
Other Legacy formats
Lustre
Object Storagenci.org.au
Enable global and continental scale … and to scale-down to local/catchment/plot • Water availability and usage over time • Catchment zone • Vegetation changes • Data fusion with point-clouds and local or other measurements • Statistical techniques on key variables Preparing for: • Better programmatic access • Machine/Deep Learning • Better Integration through Semantic/Linked data technologies
© National Computational Infrastructure 2016
Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au
Climate and Weather Science Laboratory The CWSlab provides an integrated national facility for research in climate and weather simulation and analysis. • To reduce the technical barriers to using state of the art tools; • To facilitate the sharing of experiments, data and results; • To reduce the time to conduct scientific research studies; and • To elevate the collaboration and contributions to the development of the Australian Community Climate Earth-System Simulator (ACCESS)
ACCESS Modelling © National Computational Infrastructure 2016
Data Services
Computational Infrastructure
Climate Analysis
Ben Evans, UM Collaborators Meeting, June 2016 nci.org.au