IO Performance Evaluation on Massive Data ...

7 downloads 1048 Views 4MB Size Report
IO Performance Evaluation on Massive. Data Processing at NCI High Performance. Computing Platform. Rui Yang, Ben Evans. National Computational ...
IO Performance Evaluation on Massive Data Processing at NCI High Performance Computing Platform Rui Yang, Ben Evans National Computational Infrastructure (NCI) Australia nci.org.au

NCI – background and key motivation for this work •

NCI is the peak National Computational Centre for Research in Australia – ANU Supercomputer Facility (1987-2011) – APAC (Australian Partnership for Advanced Computing) (2000-2007) – NCI (2007-)



NCI: move from academic HPC centre to a full national high performance computing centre: – Partnership includes ANU, Bureau of Meteorology, CSIRO, Geoscience Australia – Particular Focus on Climate, Weather, Earth Systems, Environment, Water Management.



Geophysics

&

Two key drivers around the work in this talk – HPC Scaling and Optimisation (supported by Fujitsu) – High Performance Data analysis



Application areas: • • • •

Weather forecasting for the Bureau of Meteorology operations (see Tim Pugh talk) Weather, Climate, Earth systems, water mgt for research (gov & unis) (See Marshall Ward talk) Satellite Earth Observation data with Geoscience Australia and Bureau of Meteorology Geophysics applications for Geoscience Australia , state surveys, and uni’s nci.org.au

Raijin supercomputing facility

Raijin is a Fujitsu Primergy high-performance, distributed-memory cluster installed in 2013. It comprises: • 57,472 cores (Intel Xeon Sandy Bridge technology, 2.6 GHz) in 3592 compute nodes • 160 TBytes of main memory • Infiniband FDR interconnect • 10 PBytes of usable fast filesystem (for short-term scratch space). nci.org.au

High Performance Data at NCI •

The National Computational Infrastructure has now co-located a priority set of over 10+ PetaBytes of national data collections for our priority areas.



The facility provides an integrated high-performance computational and storage platform, or a High Performance Data (HPD) platform, to serve and analyse the massive amounts of data across the spectrum of environmental collections – in particular from the climate, environmental and geoscientific domains.



The data is managed to support the government agencies, major academic research communities and collaborating overseas organizations.



By co-locating the vast data collections with high performance computing environments and harmonizing these large valuable data assets, new opportunities have arisen for Data-Intensive interdisciplinary science at scales and resolutions not hitherto possible.



Note, there are a lot of elements of this that are not touched on in this talk.



As well as addressing management and performance issues, our work was to also help transform the communities to using better ways of processing, managing and analysing data.

nci.org.au

We care about the performant access to data through various levels of IO

User Applications local or remote, serial or parallel, file formats High-level I/O Library GDAL: Geospatial Data Abstraction Library converting among many formats like GeoTIFF, NetCDF NetCDF4: Network Common Data Form Simplified data abstraction, self-describing HDF5: Hierarchical Data Formats chunking, compression I/O Middleware MPIIO: higher level of data abstraction MPI hints POSIX I/O: full and low-level control of serial I/O transfer regions of bytes Parallel File System Lustre: Parallel access to multiple OSTs

nci.org.au

IO Tuning activities and interests

Metrics

Serial IO

Metrics

Parallel IO

Serial IO

Parallel IO

NetCDF/HDF5

User application Transfer size





Chunk pattern





File size





Chunk cache





Subset selection





Compression





Concurrency

N/A



Access remote DAP server



Independent & Collective

N/A



Collective buffering

N/A



IO interfaces NetCDF4/HDF5





MPIIO

N/A



POSIX





GDAL/GeoTIFF



GDAL/NetCDF(4) Classic



MPIIO

Data sieving



Lustre file system Stripe count





Stripe size









14

16

IO profiling & tracing total

nci.org.au

SERIAL IO

nci.org.au

Serial IO: Read benchmark on file format and IO library

File Formats

pixel-interleaved GeoTIFF band-interleaved GeoTIFF NetCDF Classic NetCDF4

Total Bands/Variables 9 Storage Layouts IO Libraries Access patterns

contiguous, chunked(chunk/tiling size=640 640) GDAL, NetCDF full file access, block access (hyperslab size= 2560 2560)

nci.org.au

Serial IO: Full file access

900 800

Read Throughputs (MB/s)

700 600 500 400 300 200 100 0

Pixelinterleaved GeoTIFF

full_access_chunked gdal_gtiff_band-gdal full_access_chunked gdal_gtiff_pixel-gdal full_access_chunked nc4-gdal full_access_chunked nc4-netcdf full_access_contiguous gdal_gtiff_band-gdal full_access_contiguous gdal_gtiff_pixel-gdal full_access_contiguous gdal_nc-gdal full_access_contiguous nc-gdal full_access_contiguous nc-netcdf full_access_contiguous nc4-gdal full_access_contiguous nc4-netcdf

nci.org.au

Serial IO: Block access

400 350

Read Throughputs (MB/s)

300 250 200 150 100 50 0

Pixelinterleaved GeoTIFF

block_access_chunked gdal_gtiff_band-gdal block_access_chunked gdal_gtiff_pixel-gdal block_access_chunked nc4gdal block_access_chunked nc4netcdf block_access_contiguous gdal_gtiff_band-gdal block_access_contiguous gdal_gtiff_pixel-gdal block_access_contiguous gdal_nc-gdal block_access_contiguous nc-gdal block_access_contiguous nc-netcdf block_access_contiguous nc4-gdal block_access_contiguous nc4-netcdf

nci.org.au

Serial IO: Profiling band-interleaved and pixel-interleaved layout GeoTIFF files Band-interleaved

Pixel-interleaved

Read Total

1 variables 9 variables

1 variables 9 variables

Read Time Throughputs

0.273s 446.3MB/s

3.208s 38MB/s

Touch range Touch Size

122 MB 258 MB

1071 MB 1214 MB

Access Patterns

nci.org.au

Serial IO: Access pattern, storage layout, chunking and compression of NetCDF Raw GA NCF

LS5_TM_NBAR,LS7_ETM_NBAR,LS8_OLI_TIRS_NBAR

Variable Size/Band/file

[Full Time. 4000 pixel, 4000 pixel] e.g. 1 year x 100KM x 100KM

Chunking Shapes Deflation Levels Data Type

[Full Time, 500 pixel, 500 pixel ],[5, 500, 500],[5, 200, 200],[1,200,200] 1, 4, 9 Short Spatial access: 13 days x 400 KM x 400KM (16 files) Time series access: 49 years x 1 pixel x 1 pixel (49 files)

Access patterns

nci.org.au

Serial IO: File sizes, chunks and compression levels

nci.org.au

Serial IO: Optimized compression level

nci.org.au

Serial IO: Spatial Access

nci.org.au

Serial IO: Time series access

nci.org.au

Serial IO: optimized chunking patterns

nci.org.au

Serial IO: Compression Algorithm

Library NetCDF4 HDF5

Default Deflate(Zlib)

Dynamic Filter N/A Bzip2,mafisc,spdp,szip,LZ4 Deflate(Zlib) Blosc(blosclz,lz4hc,lz4,SNAPPY,ZLIB)

The NetCDF4 library is revised to enable different compression algorithms via dynamic filter.

nci.org.au

NETCDF C LIB: Write TP vs File Size cksize_T:23 cksize_YX:128

Serial IO: Write performance of different compression algorithms

Blosc/LZ4 No shuffle Default

Blosc/LZ4 Bit/Byte Shuffle

nci.org.au

NETCDF C LIB: Read TP vs File Size: cksize_T:23 cksize_YX:128 Serial IO: Read performance of different compression algorithms

Blosc/BLOSCLZ No shuffle Blosc/LZ4 Bit/Byte Shuffle

Blosc/LZ4 No shuffle Default

nci.org.au

ETCDF C LIB: Read TP vs File Size: cksize_T:23 cksize_YX:128

Serial IO: Read/Write performance of different compression algorithm Blosc/LZ4 No shuffle

Default

Blosc/LZ4 Bit/Byte Shuffle

nci.org.au

Serial IO: Chunk cache

nci.org.au

Serial IO: chunk cache size 200 180 160

Throughputs (MB/s)

140 120 100 2MB 80

128MB

60 40 20 0

Subsets

Chunk size=1 5500 nci.org.au

Serial IO: Accessing local file system vs. remote THREDDS server

70

1000 900

60 800

Level 0 Level 1

50

Level 2

Throughputs (MB/s)

Throughputs (MB/s)

700 600

Level 3

40

Level 4

500

Level 5 400

2

30

Level 6 Level 7

300

Level 8 200

20

Level 9 Native

100

10

regular 0

0

Transfer Size

Transfer Size

nci.org.au

PARALLEL IO

nci.org.au

Parallel IO: Lustre tuning on stripe count

Independent write 1800 1600 1400

MB/s

1200

IOR Benchmark MPI size = 16 Stripe size =1M Block size = 8G Transfer size = 32M

1000 800 600 400 200 0

1

8

16

32

64

128

Stripe count HDF5

MPIIO

POSIX

nci.org.au

Parallel IO: Lustre file system tuning on stripe count

Independent Read 7000 6000

MB/s

5000

IOR Benchmark MPI size = 16 Stripe size =1M Block size = 8G Transfer size = 32M

4000 3000 2000 1000 0

1

8

16

32

64

128

Stripe count HDF5

MPIIO

POSIX

nci.org.au

MPI-IO tuning



Default value for MPI ranks using 8 cores/node * 2 nodes: – Collective buffering – – – – – – – – – – – – – – – – – – – – – – –

direct_read = false direct_write = false romio_lustre_co_ratio = 1 romio_lustre_coll_threshold = 0 romio_lustre_ds_in_coll = enable cb_buffer_size = 16777216 (16M) romio_cb_read = automatic romio_cb_write = automatic cb_nodes = 2 romio_no_indep_rw = false romio_cb_pfr = disable romio_cb_fr_types = aar romio_cb_fr_alignment = 1 romio_cb_ds_threshold = 0 romio_cb_alltoall = automatic ind_rd_buffer_size = 4194304 ind_wr_buffer_size = 524288 romio_ds_read = automatic romio_ds_write = automatic cb_config_list = *:1 striping_unit = 1048576 striping_factor = 1 romio_lustre_start_iodevice = 0

• Romio_cb_read/write – auto – enable – disable

• Romio_config_list (aggregators) • Cb_buffer_size

– Data sieving • Romio_ds_read/write – auto – enable – disable

nci.org.au

Parallel IO: MPI-IO Tuning on data sieving and collective buffering



Data sieving – – – –



I/O performance suffers considerably when making many small I/O requests Access on small, non-contiguous regions of data can be optimized by grouping requests and using temporary buffers This optimization is local to each process (non-collective operation)

Collective buffering processes to match data layout in file. Mix of I/O and MPI communications to read or write data Communication phase to merge data from different processes into large chunks File accesses are done only by selected processes (called aggregators), the others communicates with them – Large operations are split into multiple phases (to limit the size of the buffers and to overlap communications and I/O) – – – –

nci.org.au

MPI-IO tuning: contiguous access

Collective read 6000 5000

MB/s

4000 3000 2000

MPI size = 16 Stripe count = 8 Stripe size: 1M Block size: 1G (cfg 1,2,3,7) 4G (cfg 4,5,6) Transfer size = 8M

HDF5 MPIIO

1000 0

MPIIO hints

Disable collective buffering to get better performance for contiguous access. nci.org.au

MPI-IO tuning: contiguous access

Collective Write 1800 1600 1400

MB/s

1200

Raijin:/short MPI size = 16 Stripe count = 8 Stripe size: 1M Block size: 1G (cfg 1,2,3,7) 4G (cfg 4,5,6) Transfer size = 8M

1000 800 600

HDF5

400

MPIIO

200 0

MPIIO hints

Disable collective buffer to get better performance for contiguous access. nci.org.au

MPI-IO Tuning: Non-Contiguous Access

Read with hyperslab [1:1:512:256] 1000 900

Throughputs (MB/S)

800 700 600 500 400 300 200 100 0

block: 4*4 hyperslabs

nci.org.au

Parallel IO: write performance of Blosc multithreading ( LZ4 byte shuffling)

blosc_LZ4_H5_SHUFFLE_BLOSC_BYTE_SHUFFLE

blosc_LZ4_H5_NO_SHUFFLE_BLOSC_BYTE_SHUFFLE

nci.org.au

Parallel IO: Read performance of Blosc multithreading (LZ4 byte shuffling)

Page 17

Page 14

nci.org.au

Parallel IO: IO profiling and performance tuning on a climate application Independent metadata operations Collective metadata operations

Elapse d Time(s)

Seria l IO

Official HDF5/NetCDF 4

Patched HDF5/NetCDF 4

Case 1 4909

10173

3766

Case 2 1625

8543

1341

nci.org.au

Summary

• IO performance tuning involves many parameters of interest – File System (Lustre): stripe count, stripe unit – MPI-I/O: collective buffer size, collective nodes etc. – NetCDF/HDF5: alignment, sieve buffer size, data layout, chunking, compression etc. – User application: access patterns, – Access to data: network (remote) vs in-situ (local)

Thank you! nci.org.au

Suggest Documents