Electric Field Potential Computation on a Flash-Based

Active Storage for High Performance Computing: Electric Field Potential Computation on a Flash-Based Key-Value Store

Master Thesis Presented January 2015 to Dr. Darius Sidlauskas at the Data-Intensive Applications and Systems Laboratory School of Computer and Communications École Polytechnique Fédérale de Lausanne by Stefan Eilemann Faubourg de l’Hôpital 12 CH-2000 Neuchâtel

Acknowledgements The research leading to this thesis was supported in part by the Blue Brain Project, the Swiss National Science Foundation under Grant 200020-129525, the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 604102 (Human Brain Project) and the King Abdullah University of Science and Technology (KAUST) through the KAUSTEPFL alliance for Neuro-Inspired High Performance Computing. I would like to take the opportunity to thank the Blue Brain Project, in particular the visualization team, the DIAS Laboratory, IBM Research, and all the other reviewers for their support in developing and writing this thesis. We would also like to thank github for providing an excellent infrastructure hosting our projects at http://github.com/BlueBrain, http://github.com/HBPVis and http://github.com/Eyescale. Lausanne, January 15, 2015

Stefan Eilemann

i

Abstract High-performance computing is at the verge of a disruptive change from batch processing to become an interactive tool for domain specialists. Various factors contribute to this, most notably supercomputers are facing a bandwidth wall rendering current workflows ineffective. As a corollary of interactive supercomputing, innovation cycles will be drastically shortened leading to improved productivity. This master thesis designs and implements a first application use case that enables serverclass persistent flash memory on supercomputers for robust, fast and interactive coupling of simulation, data analysis and visualization. It delivers a first module towards interactive supercomputing, which allows to dynamically and interactively introspect running HPC simulations. Its key contributions are: a production quality software stack demonstrating the capabilities of an active storage subsystem; new, reusable software components for parallel field voxelization; a relevant, functional use case for neuro-scientific research and a novel burst buffer storage system tightly integrated with simulations on high performance computing systems. Key words: Key-Value Store, HPC, Data Staging, Distributed Programming, Big Data, Volume Rendering, Image Processing, Interactive Supercomputing

iii

Contents 1 Introduction

1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2 Use Case and Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.3 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.4 User Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.5 Software Ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.5.1 Hardware Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.5.2 ITK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.5.3 Lunchbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.5.4 Brion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.5.5 BBPSDK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.5.6 ZeroEQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.5.7 Livre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.6 Uniform Resource Identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2 Architecture

9

2.1 Software Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.3 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.4 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.4.1 Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.4.2 Key-Value Store Data Format . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.5 Field Voxelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.6 Livre Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.6.1 Local Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.6.2 Remote Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

2.7 Brion Compartment Report Writer and Reader . . . . . . . . . . . . . . . . . . . .

18

2.8 Parallel Volume Compositor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

3 Implementation

23

3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.2 Field Voxelization Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.3 Livre Local Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24 v

Contents 3.4 SKV . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Brion Compartment Report Writer and Reader . 3.6 ZeroEQ . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Extensible Vocabularies . . . . . . . . . . 3.6.2 Endian-Safe Messaging . . . . . . . . . . 3.6.3 Connection Broker . . . . . . . . . . . . . 3.6.4 Shared Receivers . . . . . . . . . . . . . . 3.7 Livre Remote Data Source . . . . . . . . . . . . .

. . . . . . . .

25 25 26 26 26 26 27 28

4 Results 4.1 Fivox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Livre Remote Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29 30 31

5 Conclusion and Future Work

35

A Project Plan and Execution

37

B Raw Benchmark Data B.1 Figure 4.1 and Figure 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Figure 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39 39 40

vi

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

List of Figures 1.1 1.2 1.3 1.4

Local Field Potential Workflow . . . . . . . Hardware Setup . . . Livre Data Sources .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

2 2 3 6

2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11

Software Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Schematic Morphological Skeleton and 3D Mesh of a Neuron. . . . . . Data Layout and Access for a Compartment Report . . . . . . . . . . . . Key-Value Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fivox Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Livre Data Source Class Diagram . . . . . . . . . . . . . . . . . . . . . . . Bricks in a Three-Level Octree . . . . . . . . . . . . . . . . . . . . . . . . Brick Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sequence Diagram of the Livre Remote Data Source Network Protocol Parallel Direct-Send Compositing . . . . . . . . . . . . . . . . . . . . . . Sequence Diagram of the Parallel Direct-Send Compositor . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

9 12 13 14 15 16 17 17 18 20 21

3.1 ZeroEQ Connection Broker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 ZeroEQ Shared Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27 27

4.1 4.2 4.3 4.4

. . . .

30 31 32 33

A.1 Thesis Project Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

Sampling Performance of the Fivox Data Source . . . Relative Hardware Performance for Fivox . . . . . . . Throughput of the Remote Data Source . . . . . . . . Memory Operations Performance on Intel Hardware

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

vii

1 Introduction

1.1 Motivation Simulations on high performance computing (HPC) systems are increasingly bandwidth limited, since the disk IO bandwidth and available compute power follow different exponential growth curves, doubling roughly every 45 months and 18 months, respectively. This leads to an ever increasing “bandwidth wall”, which requires rethinking the current workflows to take this increasing imbalance into account. This thesis explores the use of fast burst buffers to store simulation results in a temporary storage pool, using a novel storage paradigm for fine-grained data access. This storage pool provides a faster data buffer compared to traditional filesystems, although with less consistency and reliability guarantees. It allows asynchronous, live coupling of simulation, data analysis and visualization within a heterogeneous supercomputer. Flash has emerged in the last couple of years as the technology of choice for these burst buffers. It is often used as a new level in the storage hierarchy, being faster but more expensive than disks, but persistent, cheaper and slower than main memory. Using traditional filesystem semantics to access flash storage is however suboptimal, since parallel filesystems are historically optimized for disk access with different design parameters: Traditional hard disks provide good streaming bandwidth, relatively poor random access performance and slow performance when using many random IO accesses. Server-class flash devices on the other hand have negligible random access times and need many concurrent IO operations to utilize all channels to the individual memory chips. Key-value stores store data in a distributed, associative array of key-value pairs. They promise exploiting the characteristics of flash memory better than parallel file systems by using many concurrent key-value operations on relatively small objects, exposing the fine-grained and parallel access to the hardware. The key-value store can be used to store more transient data than the more long term parallel file system, not using it as a transparent cache in the storage hierarchy but as an explicit short-term burst buffer for more detailed simulation results. Furthermore, the compute capabilities on the flash storage system can be used to perform local computation on the data, as explored by the map-reduce paradigm [DG08]. 1

Chapter 1. Introduction

1.2 Use Case and Workflow The motivation for this thesis is a concrete application use case of visualizing local electric field potential, given by the Blue Brain Project [RAP+ 13]. Figure 1.1 shows a local field potential (LFP) volume rendering overlaid on some representative neurons of the simulated circuit. The LFP is based on the electrical activity when simulating a neuronal circuit. Each neuron is discretized into hundreds of compartments, each simulated separately. This simulation yields a time series of electrical activity on a position in space for each compartment, used to aggregate the field potential in any given region. Figure 1.1 – Local Field Potential For this thesis, the use case is to concurrently write, read, voxelize and render the simulation values of a full-machine simulation through a nearby key-value store. A full-machine simulation on the current Blue Brain supercomputer uses 64k MPI processes to simulate about 200,000 neurons with around 500 compartments per neuron at a simulation rate of 1 to 10 hertz. Figure 1.2 depicts the high-level workflow and interaction between the major components. The user interacts with an interactive direct volume rendering application to explore the local field potential of an electrical simulation of a piece of reconstructed brain tissue. This volume renderer, introduced in Section 1.5.7, uses an out-ofcore algorithm and runs the rendering loop asynchronously to updating the input data. It receives its data from a remote or local data source. The remote data source, described Section 2.6.2, runs in parallel on all storage nodes. It uses a new field voxelization library, described in Section 2.5, to sample the node-local simulation data into a volume. This field voxelization library exploits nodelevel parallelism using multi-threading. The partial volumes created on each node are merged into the final result using a parallel volume compositor, described in Section 2.8. The data is read through a compartment report reader (Section 3.5) accessing a key-value store (Section 3.4) running in parallel on all storage nodes. The data 2

Local Field Potential Renderer reads volume from

Scientist A

uses

Remote Volume Data Source uses

Parallel Volume Compositor

Field Voxelization Library uses Compartment Report Reader reads events from Flash-Based KeyValue Store

Scientist B

writes events into HPC Simulation on BlueGene/Q

Figure 1.2 – Workflow

1.3. Hardware is deposited into the key-value store by a simulation running on the Blue Gene/Q supercomputer, or by using a converter based on a compartment report writer (Section 3.5). The simulation may run concurrently with the visualization.

1.3 Hardware Figure 1.3 shows the hardware setup used to implement, debug, benchmark and deploy the software developed in this thesis. The supercomputer used by the Blue Brain Project is a four rack Blue Gene/Q providing 4096 compute nodes with 16 processing cores and 16 GB of memory each. 64 of these compute nodes form 64 groups, each connected to one of 64 Blue Gene Active Storage (BGAS) nodes [SDK+ 14]. BGAS nodes are extended Blue Gene IO nodes, fully integrated into the Blue Gene interconnect [CEH+ 11]. Each BGAS node hosts a two terabyte PCIExpress flash device, and all nodes are running one key-value store instance. Each of the 64 BGAS nodes is connected to a ten Gigabit Ethernet switch shared with the visualization cluster. The BGAS nodes use an IBM A2 processor, an in-order Power-based CPU running 16 cores at 1.6 GHz [A2112] and a standard Redhat Enterprise Linux installation.

4096 total

64:1

Compute Node Compute Node

...

Compute Node

...

Compute Node Lugano Lausanne

40 total

BGAS Node

Visualization Node Tesla K20

RoQ Verbs

... Compute Node Compute Node

64 total

CNK Verbs

... BGAS Node

Tesla K20

10 GE

...

IB Verbs

Visualization Node

DSA Verbs

Flash

Tesla K20

Tesla K20

10 GBit/s WAN Link

GTX 580 GTX 580 GTX 580

Visualization Node

GTX 580 GTX 580 GTX 580

Visualization Node

10 GE

Figure 1.3 – Hardware Setup

The compute nodes use the same A2 processor as the BGAS nodes, but with a simplified operating system kernel, running static scheduling and no virtual memory to minimize operating system jitter and overhead for HPC simulation runs. Each of the 40 visualization cluster nodes has two NVidia Tesla K20 cards used for the GPUbased direct volume rendering. It has two processor sockets with 8-core Intel Xeon E5-2670 v2 processors, 128 GB memory and in internal InfiniBand FDR network. For display, a 24 Megapixel, 4x3 tiled display wall is used locally. The two sites are connected with a 10 Gbit/s WAN link. The display nodes use three NVidia GTX580 to drive two Full-HD 3

Chapter 1. Introduction displays per GPU. The nodes are connected to a local ten Gigabit Ethernet switch as well as a local QDR InfiniBand network.

1.4 User Requirements This section outlines the expectations and needs of computational neuroscientists on the software stack designed and implemented in this thesis. The core of the implementation is a reliable, always-on key-value store running as a data storage service on the Blue Gene Active Storage (BGAS) nodes, which are tightly integrated with the supercomputer. The primary goal of this store is to allow asynchronous coupling and buffering of data between simulations and downstream analysis and visualization software. It has to provide good performance for random IO operations, for the given use case 4-40 Gigabytes per second. Additionally, lightweight computation on the BGAS nodes allows reformating and extraction of data to reduce the bandwidth needed to transport data through the system. Since we expect this type of workflow to be used on future supercomputers, the whole stack must be portable to commodity hardware. This ensures portability to future hardware by properly encapsulating vendor-specific APIs today, and enables reuse of parts or the whole stack on other systems for different use cases. Due to the limited storage and compute capabilities on the BGAS nodes, it needs to be efficient and optimized for Blue Gene and BGAS hardware. This optimization includes efficient use of the available flash memory, without excessive replication for storage reliability.

1.5 Software Ecosystem 1.5.1 Hardware Interfaces The compute nodes and BGAS nodes are all connected using the Blue Gene torus and are programmed using InfiniBand-like RDMA interfaces. MPI is only supported within a set of compute nodes, or within a set of BGAS nodes, that is, one cannot setup an MPI session between compute and BGAS nodes. The flash devices are using multiple channels to the actual storage chips, and are also programmed using a RDMA interface. This enables multiple asynchronous operations to exploit the hardware parallelism with less overhead than a standard Linux block device. The Ethernet interfaces are programmed using standard TCP sockets, and the QDR InfiniBand interfaces can be programmed using TCP, RDMA or MPI. The GPUs in both visualization clusters are accessible through OpenGL or CUDA.

4

1.5. Software Ecosystem The tiled display wall is driven by the DisplayCluster software [JAW+ 12], heavily enhanced by the Blue Brain Project. The primary access method is a pixel streaming library [dcS14], which uses parallel compression to allow interactive rendering even on high resolutions.

1.5.2 ITK ITK [JMIC13] is an open-source, cross-platform system containing an extensive set of software implementations for image analysis. It is the de-facto standard for developing image processing software in C++. ITK makes extensive use of templates to parameterize the dimensionality and pixel type of the images processed. One of the reasons for using ITK was the existing, versatile implementation of images as the data storage container for the volume data. Furthermore, the produced volume images can easily be wrapped and visualized with VTK. Algorithms are implemented as filters in ITK, which are then chained together to form image processing pipelines. This flexibility, coupled with the sheer amount of existing filters, was the last deciding factor for ITK. Implementing the field voxelization code as an image source in ITK allows the resulting system to be easily extensible with new processing pipelines in the future.

1.5.3 Lunchbox Lunchbox is a C++ library for multi-threaded programming, providing utilities which are not available in the Boost libraries. It is used by most C++ software developed in the Blue Brain Project as a base library, and provides all common, reusable utility classes.

1.5.4 Brion Brion is the C++ IO library used to access the files produced and consumed in the Blue Brain Project. It implements readers and writers for most of the static and dynamic data. Brion provides fast and low-overhead read access to the major formats: BlueConfig, Circuit, CompartmentReport, Mesh, Morphology, SpikeReport, SynapseSummary, Synapse and Target files, as well as write access to CompartmentReport, Mesh and Morphology.

1.5.5 BBPSDK The Blue Brain Project SDK provides a data model on top of the raw file data provided by Brion. The data structures relevant for this thesis work are explained in Section 2.4. It provides a programming interface to access the model and simulations of the neocortical column in the NEURON simulator [HES08]. This so called Microcircuit API is used for research and application development in the Blue Brain Project. 5

Chapter 1. Introduction

1.5.6 ZeroEQ Zero Event Queue (ZeroEQ) enables loose coupling of different applications through eventbased messaging using a publish-subscribe mechanism. It is build on ZeroMQ [Akg13] as the transport layer, zeroconf networking [zer13] (also known as Apple Bonjour) for automatic discovery and flatbuffers [fla14] as a serialization format. Applications using ZeroEQ are coupled in an automatic and robust fashion, that is, application crashes or exits are handled automatically and do not affect other running applications. ZeroEQ was designed to provide a stateless, simple and performant messaging layer between different applications.

1.5.7 Livre Livre (Large-scale Interactive Volume Rendering Engine) is an out-of-core direct volume rendering application. It uses an octree data structure to provide error-based level of detail (LOD) selection for data sets of any size, aiming for a constant rendering quality. Livre can render different input data sets by using a plugin interface abstractTuvok Terabyte ing the access to the unimage stack derlying data. Based on the given URI, a plugin is MeshVox Meshes selected to read the data. obj, ply Livre Figure 1.4 shows the available data sources: a fileNeuroVox BBPSDK based source reading terMorphologies abyte image stacks, two data sources computing Fivox Electrical Simulation on-the-fly volumes from watertight meshes or Blue Brain morphologies, and Remote ZeroEQ the two data sources deFigure 1.4 – Livre Data Sources veloped in this thesis performing on-the-fly LFP calculation (Section 2.6.1) and enabling remote execution of any of the data sources (Section 2.6.2). The Livre core algorithm requests bricks from these data sources dependending on the current view frustum, screen resolution as well as CPU and GPU memory capacity. An asynchronous rendering pipeline separates the rendering algorithm into a data, upload and render thread. The render thread constantly updates the LOD tree to request new bricks based on the current view frustum, and then renders the available textures. The data thread is

6

1.6. Uniform Resource Identifiers traditionally responsible to load, and potentially decompress, the requested data from disk. In our implementation the data is generated by on-the-fly voxelization of the simulation report. The volume data is then sent to the GPU upload thread, which uploads the raw data from memory into an OpenGL texture in the GPU. The active texture identifiers are sent to the render thread, which picks the appropriate texture bricks for rendering. All three threads run asynchronous, which allows interactive frame rates for rendering, while the data loading might be delayed by a few seconds due to poor IO bandwidth or computational overhead. The data and upload threads maintain a cache to keep as many bricks or textures in CPU and GPU memory as possible on the available hardware. The Equalizer parallel rendering framework [EMP09] is used by Livre for sort-first parallel rendering onto multi-tile display walls. In this mode, each GPU render thread runs an asynchronous GPU upload thread, and each node runs a shared data loader thread.

1.6 Uniform Resource Identifiers Various software components in this thesis use Uniform Resource Identifiers (URI) to select plugins or to connect to remote systems. The abstraction interface to different key-value stores, described in Section 2.7, uses URIs to identify the backend implementation and location. Data sources in the volume renderer, including local and remote implementations, are selected and addressed using URIs, as described in Section 1.5.7 and Section 2.6.2. The ZeroEQ implementation, described in Section 3.6, uses URIs to identify publishers and their data schema. The URI format is described in the C++ standards committee proposal N3420 [MB12], and an example URI consists of the following parts: URI part scheme user_info host port path query fragment

Range [a, b) [c, d) (d, e) (e, f) [f, g) (g, h) (h, i)

Value http bob www.example.com 8080 /path/ key=value fragment

http://[email protected]:8080/path/?key=value#fragment ^ ^ ^ ^ ^ ^ ^ ^ ^ a b c d e f g h i

7

2 Architecture

2.1 Software Stack The implementation of this thesis is embedded in the software ecosystem described in the previous section, which allows to implement most of the desired use case in the limited time available. This section introduces how the new and changed software components build the complete system within the scope of the larger ecosystem. The next chapter explains the implementation.

Large Scale Interactive Volume Renderer fivox::DataSource remote::DataSource bbp::report_converter fivox::ImageSource ZeroEQ bbp::CompartmentReportReader/Writer itk::ImageSource zmq zeroconf flatbuffers brion::CompartmentReport lunchbox::PersistentMap HDF5 Bin LevelDB new software SKV Legend:

Livre RemoteDataSource fivox::DataSource fivox::Compositor ...same as before...

BBP software

third-party software

Figure 2.1 – Software Stack

Figure 2.1 shows how the work of this thesis integrates into the existing software stack. Livre, an existing large-scale interactive volume rendering engine (Section 1.5.7) is extended by two data sources: a local data source is sampling discrete electrical events into regular 3D volumes (Section 2.6.1), and a remote data source providing access to data sources on different systems (Section 2.6.2), including the newly created local data source. The local data source allows on-the-fly electric field potential visualization of simulation reports stored in local files, and is build on a new, parallel field voxelization kernel (Section 2.5) implemented as an ITK image source (Section 1.5.2). The remote data source is built using an improved version of ZeroEQ (Section 3.6), which connects the data source proxy running in the volume renderer with the remote data source service process. The service process uses a parallel volume compositor (Section 2.8) and serves all data sources available in Livre. 9

Chapter 2. Architecture The field voxelization kernel receives its data from a BBPSDK compartment report reader (Section 1.5.5), which is a high-level abstraction using the Brion low-level compartment report access class (Section 1.5.4) to read the data. A standalone report converter tool is used to convert the data from a file into the key-value store. A new plugin to the Brion compartment report implements the key-value storage backend, which accesses different key-value stores using an abstraction interface in Lunchbox (Section 3.5) and a data format (Section 2.4) optimized for key-value stores. The production environment will run SKV as the key-value store. SKV is developed in the context of a research collaboration with IBM (Section 3.4).

2.2 Scalability The performance-critical components are designed to be horizontally and vertically scalable, as applicable. In the following we will describe their scalability within the complete system. SKV is a horizontally scalable service, which is exploited by the Lunchbox abstraction interface through the use of many concurrent, asynchronous requests, which are distributed over the SKV servers. Adding more servers will increase storage capacity, network bandwidth, IO operations throughput and access bandwidth. The field voxelization kernel is a vertically scalable component which uses all available CPU on a given system. Using faster clocked processors or more processing cores will increase the processing speed for generating LFP volumes. The remote data source service provides horizontal scalability by instantiating one field voxelization kernel per node. This also provides horizontally scalable access to SKV, as each instance reads a subset of the report to be voxelized. The parallel volume compositor provides a scalable algorithm to combine the partial results. Adding more nodes to the service will increase the throughput to the SKV servers and the processing speed to generate LFP volumes. Last, but not least, Livre scales horizontally by using multiple nodes and GPUs for sort-first scalable rendering. Adding more rendering nodes will increase the framerate and quality of the interactive volume rendering. We expect the whole system to be vertically scalable to achieve faster voxelization performance, which is the main computational bottleneck in our current use case. Horizontal scalability, i.e., adding more storage nodes to the system, will scale the available storage size as well as the overall performance since the storage nodes are actively used for computation.

2.3 Reliability The multiple software components running on a large-scale distributed infrastructure composed of many heterogeneous nodes create a system with an inherent complexity, which requires robust handling of resource and software failures to produce an usable infrastructure. 10

2.4. Data Model In the following we outline the steps taken to increase reliability of the system in the presence of failures. Reliability is achieved by two means: Isolating components from each other by implementing them as separate services, and by handling failures within a horizontally scalable service. The four main components in our system are separate applications: data storage in SKV, LFP calculation in the remote data source service, rendering in Livre, and writing of data from the HPC simulation. Reliability in SKV, and in the communication to SKV, is part of the SKV project and therefore outside of the scope of this thesis. SKV does not currently implement any reliability. The remote data source service is designed to support runtime failures by using stateless ZeroEQ events, constant runtime reconfiguration of the compositing network and appropriate timeouts during compositing. The communication protocol to the service also uses ZeroEQ, and the remote data source in Livre has appropriate timeouts for finding a remote data source and for receiving bricks after a request has been made. Livre scales horizontally by using the Equalizer parallel rendering framework. Equalizer supports reliability transparently for its applications by detecting resource failures and removing these resources from the runtime configuration. The HPC simulation is typically an MPI job, and therefore not failure tolerant. It is decoupled from the rest of the system through the SKV-based burst buffer.

2.4 Data Model 2.4.1 Neurons A Neuron is a single nerve cell simulated, and connected with thousands of other neurons. It is constructed from a morphological skeleton. This skeleton has a tree-like structure originating at the Soma, the cell body of each neuron. Each neuron has a unique GID (global identifier). A Target is a set of GIDs to identify a logical group of neurons, for example all neurons of a given type. The branches of the tree are formed by Sections which are subdivided in smaller pieces. They start at the soma and eventually branch into multiple sections in a tree-like structure, as shown in Figure 2.2. During the reconstruction from real neurons, each section is formed by a number of Segments. Each segment has two endpoints in R3 , as well as a radius for each endpoint. These segments consequently represent tube-lets in 3D space.

11

Chapter 2. Architecture

Se gm en art t me nt

Se

Cp

tm t

ctio

men

mp

n

Seg

Co

t

Soma

Seg men t Com p a rt men t

Seg men t

Figure 2.2 – Schematic Morphological Skeleton and 3D Mesh of a Neuron.

For the simulation, each section is decomposed into m Compartments, which have a n : m mapping to segments. For each compartment, various observables are simulated and may be saved in a CompartReport.

2.4.2 Key-Value Store Data Format Key-value stores use different semantics and looser consistency guarantees over traditional parallel file systems. They store objects and address them using a unique key, instead of storing files in an hierarchical file system structure. The performance of a key-value store scales through the concurrent access to many key-value pairs, which are randomly distributed over the available storage nodes. This fine-grained access pattern can be exploited to no longer optimize the data layout for a given algorithm, but by providing good performance for all access patterns through the use of sufficiently small key-value pairs. For example, storing a matrix in row-major layout on a file system leads to excellent performance when reading a single row, but to terrible performance when reading a single column. By storing the matrix in many key-value pairs, each holding a small 2D region of elements, one no longer favors one access pattern and reduces the access overhead to the granularity of the key-value pairs. Using the same data layout on a traditional file system would not yield the same benefits, as they tend to be optimized for sequential reads and large block sizes. For example, the GPFS used on the Blue Brain supercomputer has a chunk size of one Megabyte, while the storage granularity of the key-value store is 4096 bytes. Consequently, our main design goal for the data layout in the key-value store is to provide performance optimized for all access patterns. The output of Neuron simulations is twodimensional, one scalar voltage value per time step and compartment: v : f (t , c). Neuron writes all compartments simulated on each process to a single file, which is reassembled 12

2.4. Data Model

offline into a single file: v t :

mpiP pr oc i =1

( f i (t , c)). The existing serial version of the field potential

compartment

Figure 2.3 shows these different data access patterns. A single time step is called CompartmentReportFrame, the whole data set CompartmentReport. The size of a frame is determined by the targetm that is, the list of GIDs selected when opening the report. The simulator writes a frame in multiple chunks, the local field potential code accesses it frame at a time, and spike trace generation accesses one compartment of all frames at a time.

LFP

voxelization code iterates “vertically” over a full time step at once: v t : f (t , c). Other analysis algorithms analyze the time series “horizontally” by reading all time steps of a single compartment or neuron: v c : f (t , c).

spike trace

MPI Process #1 MPI Process #2 MPI Process #3

The layout of compartments within a frame is detertime mined by the simulator, that is, it does not correlate to the order of compartments in the neurons forming Figure 2.3 – Data Layout and Access the circuit. To establish this correlation, a mapping de- for a Compartment Report fines the index of each section of each neuron within the compartment report frame, as well as the number of compartments covered by each section. The optimal size of a value (number of stored compartment values) depends on a large number of factors, including the implementation details of the key-value store and the block size of the underlying storage layer and hardware. Since there is no predefined access pattern, the optimal access would be a randomized layout on single value entities. Access time and storage overhead per key-value pair require a larger granularity of items stored in the key-value store. To group a number of values into a single key-value pair, the most obvious and simple organization is to store all compartment report values for a single neuron and a single time step in one key-value pair. Storing a 2D area of a few compartments over a few time steps would give even fairer access time for compartment-oriented access. We decided not to use this layout in this first implementation, as it would require the capability to append to existing values and a more complex implementation of the access layer. Furthermore, it would negatively impact the predominant vertical access pattern. Figure 2.4 shows the key-value format of the compartment report stored in the key-value store. All key-value pairs are scoped with the corresponding name of the Blueconfig file and the target name of the report. The key of each data item is composed using that scope, the name of the structure, and, where applicable, the cell identifier and time step. An example key for

13

Chapter 2. Architecture the header is MyBlueConfigLayer5Header, and an example key for one time step of a neuron in this report is MyBlueConfigLayer5Frame42_17. The metadata information describing the circuit, simulation, available targets and compartment reports is still stored as a Blueconfig file on a traditional file system. It is created beforehand using other tools and does not contain any performance sensitive data. The compartment report contains a header, the list of all reported neuron identifiers, the compartment description of each neuron, and report data for each neuron and time step, as shown in Figure 2.4.

Header magic startTime [n] endTime [n] timestep [n] dunit tunit

1 m

nxm

GIDSet [m]

Compartments numCompartmentsPerSection Frame_ values

The key for the header is calculated using the target name, the compartment Figure 2.4 – Key-Value Data Format report name and a fixed magic string. The header contains a magic number for versioning and byte order detection. Byte order detection compares the value read against the hard-coded magic. If it matches, the byte order of the host matches the byte order of the report. If not, it is byte-swapped and compared again. If it matches, the byte orders are different and all subsequently read values are byte-swapped. If it does not match, the report is in a different version and cannot be read. Furthermore the header contains the start time, end time and time step which determine the number and time of each compartment report frame. It also contains the data and time unit of the report, which are strings used to characterize the reported values. A compartment report can be read either using an empty set of neuron identifiers, or by using a given set of IDs. In the first case, the implementation will read the full report reading all available IDs from one GIDSet per report. The GIDSet has no other purpose. Reading IDs which are not stored in the report will throw a runtime exception. For each neuron, another key-value pair stores the number of compartments per section in the neuron. This information is used when a compartment report frame is read, to establish how each value in the frame maps to the compartments in the requested neurons. Consequently, there are m key-value pairs in the report, one for each neuron in the GIDSet, and each pair is addressed using its identifier and the scope (Blueconfig, target, magic). Finally, one key-value pair for each neuron and each time step store the compartment report values for all compartments of this neuron at the given time step. This data is the bulk data −st ar t of each report, there are n × m key-value pairs, with n = end t i mest ep being the number of time steps and m the number of number of neurons.

14

2.5. Field Voxelization

2.5 Field Voxelization The core algorithm, which samples the explicit compartment report into the implicit volume, is implemented in a new, standalone library called libFivox. The main class in the library is the fivox::ImageSource, derived from itk::ImageSource. On one hand, this enables to use the full breadth of all ITK algorithms directly, i.e., the voxelization code can be used to construct image processing pipelines. On the other hand, it facilitated the development through a well implemented image (volume) storage class and a multithreaded update algorithm. The fivox::ImageSource is an itk::ImageSource which samples events happening at concrete positions in R3 into a regular volume. It uses an EventFunctor to sample a list of generic Events for a given voxel. Events represent discrete events happening at a 3D position in space with a given magnitude. The event functor defines the sampling function, i.e., how an event value affects its surrounding space. The events are provided by an EventSource, of which two new implementations are available. They provide the events from per-neuron (soma) and per-compartment reports. Figure 2.5 provides the class diagram for the field voxelization library. Classes which do not explicitly mention a namespace are the new classes implemented in the fivox namespace.

TImage itk::ImageSource TImage, TFunctor

ImageSource Functor ThreadedGenerateData 1 * TImage EventFunctor Attribute 1 Attribute operator( Point )

*

EventSource Events clear add update getEvents getMin getMax * 1 Event position value

CompartmentLoader ReportReader loadFrame * 1 bbp::Compartment ReportReader 1 * SomaLoader ReportReader loadFrame

An ITK image source is templated on the Figure 2.5 – Fivox Class Diagram type of the image, which itself is templated on the dimensionality and pixel type. Semantically, only three-dimensional images are useful in our context, since the input data is three dimensional. The Fivox image source has a functor, which is invoked on each voxel. The parallelization of the ITK image source to utilize all cores in a single node performs a domain decomposition across all voxels, calling ThreadedGenerateData in parallel on each sub-volume to be updated. The complexity of sampling one voxels is O(n), with n being the number of voxels, which in itself is O(m 3 ) with respect to the extent (size) of the volume in one dimension. The event functor is responsible for sampling and accumulating all event values for a single voxel. It implements the sampling kernel, and can be easily exchanged by implementing a different functor. It is also the most critical part for performance within Fivox, and has been the focus of optimization as described in Section 4.1.

15

Chapter 2. Architecture An event provides a three dimensional position in space as well as a value characterizing the magnitude of the event. Events are provided by the event source, which is the base class for various implementations providing events. It provides methods to manage the stored events. Two implementations create and update events for BBP-specific data: one for the voltage at each neuron cell body (soma), and one for the voltage at each compartment of each neuron. Both sources use the BBPSDK to load the simulation data from either files or the key-value store.

2.6 Livre Data Source 2.6.1 Local Data Source Figure 2.6 shows the class diagram of the Livre data source. It uses the Fivox image source described in Section 2.5 to generate the requested parts of the volume on the fly. It is driven by the Livre core library and has to implement two functions: initialize and getData. During initialization, the given URI is parsed and used to open the data source. For the Fivox data source, this requires instantiation of the fivox::CompartmentLoader, which in turn will open the Blueconfig, the given target and load the compartment report to populate the event list for the fivox::EventFunctor. Furthermore, the level of detail octree used by Livre is initialized.

livre::fivox::DataSource maxResolution source initialize( URI ) getData( node ) 1 1 fivox::ImageSource see Figure 2.3

livre::VolumeDataSource maxResolution source initialize( URI ) getData( node )

livre::MemoryDataSource livre::UVFDataSource livre::remote::DataSource

Figure 2.6 – Livre Data Source Class Diagram

At rendering time, Livre will call getData to request the creation of new bricks. The node information is provided as a parameter, which contains all the information needed to describe the brick, such as its size, level and position in the hierarchy. Livre will try to keep these blocks cached in CPU and GPU memory to improve performance, even if they are not currently used for rendering. The Fivox data source samples the requested brick using the ITK-based image data source and then copies the result into the octree node memory. Figure 2.7 shows a three-level octree used by Livre and the Fivox data source to organize and select bricks for caching and rendering at runtime. The tree in the Figure is incomplete to decrease the complexity, only one child per level is depicted. The colored quads illustrate the region occupied by one child brick in the parent brick.

16

2.6. Livre Data Source

Figure 2.7 – Bricks in a Three-Level Octree

The Fivox data source creates a regular tree, that is, all voxels are isotropic, all bricks have the same size and are cubes, the tree is complete and balanced and each level of the hierarchy has double the resolution of its parent level. Livre addresses these bricks using absolute coordinates relative to the current level in the LOD tree. Figure 2.8 shows this addressing scheme, where the topleft-front corner is (0, 0, 0) in all levels; the bottomright-front corner is (1, 1, 0) in the top level, (2, 2, 0) in the middle and (4, 4, 0) in the leaf level. The middlefront point does not exist in the top level since there is no brick for this point. It has the coordinates (1, 1, 0) in the middle level and (2, 2, 0) in the lowest level.

(0,0,0)

(1,1,0) (2,2,0) (3,3,0)

The Fivox library addresses the space in real-world co(1,1,0) ordinates using micron units. The Fivox data source (2,2,0) translates from the Livre coordinate system into the (4,4,0) Fivox coordinates using normalized, level indepenFigure 2.8 – Brick Addressing dent addressing. This coordinate computation was added to the internal Livre LOD nodes to facilitate the mapping. They can be derived trivially from the absolute coordinates by dividing them by the level-relative size of the absolute coordinate space (2l evel ). The relative, normalized coordinates are multiplied by the bounding box of the loaded circuit to calculate the position and size per voxel in microns, which is set on the output image of the Fivox image source. Afterwards, the ITK image source update is triggered, which resamples the requested space, and the result is copied into the memory of the Livre LOD node. 17

Chapter 2. Architecture

2.6.2 Remote Data Source The remote data source was designed to fulfil the requirement to stream data from the BGAS nodes to the visualization cluster. The design is generic, and allows the remote execution of any data source in Livre to any supported remote system. A proxy data source in Livre uses ZeroEQ to communicate with a remote data source service process, which provides all data sources available in Livre. This allows for example also to instantiate a normal, file-based data source on a remote file server and to receive this data for local volume rendering. Figure 2.9 outlines the communication protocol between the remote data source and the proxy data source in Livre. A set of publish-subscribe events is used to establish a link between the two processes. This protocol implements a synchronous request-reply communication based on the stateless asynchronous ZeroEQ protocol. All synchronous replies are implemented using timeouts and appropriate failure handling to preserve robustness in case of crashes, that is, while the request-reply is inherently synchronous the implementation is stateless and robust.

DataSource Service pub livresource sub livresink

Open DataSource setup LODTree

Load

Livre DataSource pub livresink sub livresource

ZeroEQ

DataSource [URI]

DataSourceData [LODTree, event]

init()

setup LODTree

DataSample [NodeInfo, event] n DataSampleData [event, Data]

loaded node

get() timeout? n

y

empty node

The first event pair is used for initial- Figure 2.9 – Sequence Diagram of the Livre Remote ization: The remote data source sub- Data Source Network Protocol scribes to DataSource events, which are an event published by the proxy to request a new remote data source. The data source responds on success with an event describing the properties of the loaded data source. At runtime, the second event pair is used to generate and transmit new volume data bricks. One DataSample event is emitted by the proxy to request a brick, to which the remote data source eventually replies sending the brick. It is important to note that this protocol is stateless, that is, if a request is not fulfilled within a given time out, the data source proxy assumes the operation failed and returns an empty brick. Likewise, if another data source sends the correct brick, the proxy also accepts it.

2.7 Brion Compartment Report Writer and Reader The target key-value store SKV is optimized for the asynchronous RDMA interfaces used on the BGAS nodes for node-to-node and node-to-flash communication, and therefore does not currently run on commodity hardware. Substantial parts of the development are however done 18

2.8. Parallel Volume Compositor on commodity hardware due to the easier access. For this reason a PersistentMap abstraction interface was implemented in Lunchbox. Its API is purposefully simple, modelled as closely as feasible to the std::map API. Two underlying backend implementations are available: One to a local leveldb backend, as well as a local and remote SKV backend, using the same URIs as the brion compartment report. The brion::CompartmentReport uses a plugin-based approach, where different implementations are chosen based on the URI and file extension. For reports stored in a key-value store, a new plugin implementation was created using the skv and leveldb URI schemes. This plugin uses the data format described in Section 2.4 to write many key-value pairs per report. The following URIs are supported by Brion:

file://filename.(rep|bin|bbp|h5|hdf5) Reads or writes a compartment report in binary format (first three extensions) or in HDF5 format (last two extensions). leveldb://leveldb?name=Name,target=Report Addresses a compartment report saved in a local leveldb instance, for example: leveldb:///home/eile/leveldb?name=Blueconfig skv://SKVConfig#SKVPDS?name=Name,target=Report,scope=local Addresses a compartment report saved in SKV (IBM’s Scalable Key Value store), for example: skv://?name=BlueConfig,target=allCompartments.

The skv variant has the added speciality in allowing to use a local scope when reading a report to return only the subset of report values saved on the local SKV node. This will be used by the parallel implementation of the remote Fivox data source.

2.8 Parallel Volume Compositor Due to time constraints explained in Appendix A, the parallel volume compositor is not fully implemented. Its design, architecture and programming interface are validated sufficiently to be presented here. Each instance of the field voxelization kernel produces a volume brick of the requested size containing only the local events. These bricks have to be merged across all participating processes. The task of compositing this volume data is conceptually the same as parallel sortlast compositing, a problem extensively studied in cluster-parallel rendering [EP07, MEP10, YWM08, MPHK94]. The total amount of data composited is an O(n) operation over the n participating nodes, but can be parallelized across the n nodes to reach, message overhead notwithstanding, a constant wall clock time to complete the operation. While more complex algorithms such as 2-3 swap [YWM08] show performance benefits on very large scale clusters with thousands 19

Chapter 2. Architecture of nodes, most practical implementations in production use direct-send [EP07] due to its simplicity and better performance on typical cluster sizes. Direct-send compositing performs a domain decomposition over the image or volume space; in the following we will use the volume terminology. Each node is responsible for compositing one brick of the volume. Consequently, each process receives n − 1 bricks (of the same space) from all its peers, and sends n − 1 bricks to its peers (of all the other space). At the end, a final gather step collects the fully composited bricks on a single node. The brick compositing step has a constant amount of voxels to composite: as the number of nodes increases, the number of voxels per node decreases linearly. The final gather step has an upper bound, the total volume size, of data to be collected. Figure 2.10 shows this algorithm across three nodes for depth-based sort-last image compositing: Each node sends and receives two image tiles to composite, and two fully composited tiles are gathered on the third node in the final gather stage.

Figure 2.10 – Parallel Direct-Send Compositing This parallel compositing step is an image-to-image filter in ITK to facilitate interoperability and reuse within the Fivox library. Note that the filter is not specific to field potential voxelization, it is a generic direct-send image and volume compositor. 20

2.8. Parallel Volume Compositor The filter discovers its peers using ZeroEQ, establishes a random compositing order across the processes, and uses additional ZeroEQ events to send partial and complete bricks. The output of the filter is the full volume on all instances where the gather flag is set, or the complete brick on the instances where gathering is disabled. This will allow the Livre remote data source to either do the final gathering on the service side or on the application side, by connecting to one instance of the service (which will set the gather flag), or to all instances, respectively. Figure 2.11 shows the sequence diagram of the compositor protocol. A first set of publishers and subscribers is used for peer discovery. The peers discover each other using a session name (to allow concurrent compositors in the same network) and a random unique node identifier, used to sort the nodes. This information is sufficient to know the size ( n1 ) and position (in the sorted node vector) of each brick. This refresh is run at the beginning of each operation, which enables robust execution in the event of node failures or relaunches, since the compositing network will adapt.

Compositor pub fivoxcomp sub fivoxcomp

refresh() gather and sort nodes of session compute own brick set output

pub fivoxcomp+ID (n-1) * sub fivoxcomp+ID update() refresh send n-1 peer bricks recv n-1 peer bricks merge own brick send own brick assemble bricks

n-1 Compositors

ZeroEQ

Node [Session, ID]

ZeroEQ

pub fivoxcomp sub fivoxcomp

refresh()

pub fivoxcomp+ID (n-1) * sub fivoxcomp+ID update()

Partial Brick [Frame, Data]

For each brick, a separate publish/subFull Brick scribe scheme is established. Each node [Frame, Data] subscribes to brick events of its node identifier and will publish bricks of all other known node identifiers. This sep- Figure 2.11 – Sequence Diagram of the Parallel arates the event messages and routes Direct-Send Compositor them to the appropriate processes. During the update phase, which is run by ITK every time the pipeline is updated, each node composites and potentially gathers the bricks as described above. The receiving of brick events will use a configurable timeout to implement failure-tolerant execution.

21

3 Implementation

3.1 Methodology An important requirement for the implementation is production quality software, which goes beyond a prototype demonstrating the feasibility. All newly developed components were implemented using industry best practices: test-driven development started the implementation with a unit test defining the API and contract; peer code review ensured the quality of the implementation; and a continous integration system verified the correctness of the implementation. All source code is in a git version control system, either gerrit for the closed source software or github for the open source components. We expect to open source all remaining components in the course of 2015.

3.2 Field Voxelization Library The fivox::ImageSource is the core of the field voxelization code. Its base class itk::ImageSource implements multi-threaded parallelization using a domain decomposition, that is, each thread operates on an evenly sized, contiguous region of voxels within the whole volume. The ITK data source implementation already delivered an optimized base implementation, in particular due to its use of multithreading. Furthermore, the heavy usage of templates allows the compiler to make many compile-time optimizations based on the data types and dimensionality used. A performance test was implemented which sets up a simple ITK pipeline using the Fivox image source for power-of-two steps of volume sizes, updates this pipeline, and measures the sampling rate in million voxels per second. In debug mode, a file writer is attached to the end of the pipeline, which will dump the resulting volume to disk. The written volume files can then be visualized in Paraview for visual verification. Using this performance test we performed profiling and implemented the following optimizations.

23

Chapter 3. Implementation The first implementation of the image source uses the naive approach of iterating over all events for each voxel to accumulate its value. The total complexity of this operation is O(m 3 × n), with m being the extent of the volume and n the number of events. The m 3 iteration over all voxels cannot be avoided, the optimizations focused therefore on the event sampling at each voxel. The first optimization implemented a cutoff distance, which when exceeded causes an event not to be sampled for a given voxel. This optimization still requires an O(n) traversal over all events on each voxel, but on average less work is performed for each iteration. The second optimization transforms this O(n) traversal to an O(l og (N )) operation by using an in-memory RTree [Gut84] to select only the points within the cutoff distance. The boost::geometry library was used to construct this RTree on-the-fly while adding events to the event source. The boost library only stores an index into the flat event vector stored separately in the fivox::EventSource. This allows a fast update of the event value when a new time step is loaded, since the value can be changed directly without rebuilding the spatial data structure.

3.3 Livre Local Data Source Livre supports different data sources with different source data formats. This selection was previously based on the file extension. As part of this thesis, the data source selection was implemented using an URI scheme based on a standalone parser. The existing URI parser was extended to parse multiple key-value pairs in the query section of the URI into an unordered map, which simplifies the access to the various query elements used in the URI. The following URIs are now supported by Livre: URI template and example

Data Source

uvf::///path uvf:///home/eile/brain.uvf

Tuvok-based, on-disk UVF data

mem://#x_res,y_res,z_res,block_size mem://#1024,1024,1024,32

In memory random data

fivox://BlueConfig?size=int,blockSize=int#TargetName fivox:///home/eile/Blueconfig#L5CSPC

New field voxelization data source

remoteany of the above remotefivox:///home/eile/Blueconfig#L5CSPC

Remote data source using ZeroEQ

The Fivox data source uses the field voxelization library described in Section 2.5, and its architecture is described in Section 2.6.1. The remote data source uses ZeroEQ to access all of the other data sources served from a remote system. It is described in Section 2.6.2.

24

3.4. SKV Apart from the implementation of the local data source following straight forward from the design, the Livre source code was improved to facilitate the development. An option to debug the render path enables output of the bricks being sampled, uploaded to the GPU and used for rendering. A default transfer function was implemented which provides a reasonable default to view a volume, and the application was back-ported to support OpenGL 2.1 on Mac OS X 10.8. These improvements greatly facilitated a fast development and debugging of the Fivox data source. Once the data source was functional, it quickly became apparent that the default update strategy, based on file-based IO, was not suitable for the slower update rate of the Fivox data source. Therefore the existing data loader thread was refactored to push updates not only at the end, but regularly during the update traversal. This required further refactoring on the downstream GPU upload thread to correctly handle the partial updates.

3.4 SKV The IBM Scalable Key Value store (SKV) was chosen since it has been designed specifically for the Blue Gene hardware described in Section 1.3. BGAS nodes can be categorized as high-end embedded hardware (low-clocked, in-order CPU, limited memory) and need more optimized software compared to standard hardware. For example, there exists no fast Java virtual machines for this hardware, which excludes a substantial portion of state-of-the-art key value store implementations. Furthermore, the Flash storage cards used are programmed using an RDMA API to reduce the overhead to get the IOPs and bandwidth available from the hardware. This type of persistent memory is still unsupported by any existing key-value store. SKV is an evolution of the Parallel In-Memory Database (PIMD) [RFPG10]. As part of a collaboration with IBM it has been ported to BGAS and extended to support the BGAS Flash cards as persistent storage. This work was performed by IBM concurrently to the implementation of this thesis. Due to the specialized nature and novelty of the software and hardware, this codevelopment required substantial involvement on our side to setup, test, extent and optimize SKV on our Blue Gene/Q, which did not yield any new functionality for the thesis work itself.

3.5 Brion Compartment Report Writer and Reader The implementation of the brion::CompartmentReport was straight-forward from its design (Section 2.7) and the definition of the data format for the key-value store (Section 2.4). After the first implementation, an obvious optimization was to use asynchronous IO operations from the PersistentMap to SKV. Asynchronous writes are associated with a handle, upon which the application can wait for completion. These handles are enqueued, and only flushed on an explicit request, or to not exceed a pre-configured maximum outstanding queue size to limit memory consumption.

25

Chapter 3. Implementation Asynchronous reads will be implemented by returning a std::future-like object to the application, which can then enqueue multiple reads by retaining the futures before waiting on their completion.

3.6 ZeroEQ ZeroEQ is an easy to use messaging layer integrated by many applications in the Human Brain Project. It started as a collaborative project about six months before this thesis. It is developed using an use case driven approach to keep it as simple as possible while still serving real-world scenarios. Previous use cases focused on light-weight messaging to distribute application state such as camera positions. The use of ZeroEQ for the remote data source validated the performance of the messaging layer for bulk data. The remainder of this section motivates the architecture and describes the implementation of new features added to ZeroEQ as part of this thesis.

3.6.1 Extensible Vocabularies For the remote data source implementation in Livre (Section 3.7), we extended ZeroEQ to support extensible vocabularies. Vocabularies define the set of messages, their wire protocol and the serialization between different applications. Previously, the vocabulary was directly integrated into ZeroEQ and was not independently extensible. Extensibility was achieved by using 128 bit universally unique identifiers (UUID) as event types. These UUIDs are generated by MD5 hashing the fully qualified class name of the event, e.g., livre::zeq::DataSourceEvent or the URI of a datum, e.g., fivox://.

3.6.2 Endian-Safe Messaging We implemented full support for endian-safe messaging to allow coupling between the big endian BGAS nodes and little endian visualization cluster. Changes were relatively localized, since the serialization library Flatbuffers already generates endian-safe encoding.

3.6.3 Connection Broker ZeroEQ uses zeroconf networking or explicit addressing to discover publishers of events. The Livre remote data source subscribes to events published from applications to handle their requests, as described in Section 2.6.2. Unfortunately, the data source service cannot, by design, know the applications to subscribe to in advance. Zeroconf networking is also not available, since the supercomputer is hosted in another subnet from the visualization cluster and uses a conservative system configuration. Therefore a mechanism to broker new

26

3.6. ZeroEQ connections from the publisher on the data service side to the publisher on the application side was needed.

0MQ rep

0MQ req

zeq connection Broker

1 Subscriber

*

handle

Service address

subscribe(Publisher)

subscribe

addConnection handle

Publisher publish

Figure 3.1 – ZeroEQ Connection Broker

The connection broker mechanism is shown in the sequence diagram in Figure 3.1, which implements the publish-subscribe “cloud” in Figure 2.9. A ZeroMQ request-reply socket pair implements a connection broker in a new zeq::connection namespace. The Broker on the subscriber side brokers connection requests for a subscriber, that is, it receives addresses to subscribe to and adds these connections to an existing subscriber. On the publisher end, a Service class subscribes existing publishers using their address to the remote broker by connecting to the broker’s well-known address.

3.6.4 Shared Receivers The connection broker introduced a new class with its own underlying receiver socket. This requires dispatching input data from both the Subscriber (for received ZeroEQ events) and the Broker (for subscription requests) simultaneously. Previously, a similar use case of dispatching events from multiple subscribers was presented, but not yet implemented. The shared receiver implementation presented here fulfils both use cases. To enable this functionality, the socket demultiplexing code using zmq_poll was refactored out Broker Subscriber into a new base class Receiver, shared by both the subscriber and connection broker. The receiver Receiver uses one implementation instance between all *1 detail::Receiver zmq_poll addSockets( sockets ) shared receiver instances, which manages and process(socket) dispatches events on all sockets of all shared receivers. Sharing is enabled by passing an existing Figure 3.2 – ZeroEQ Shared Receiver instance of a receiver to the creation of a new receiver. The implementation is held by shared_ptr, and its lifetime is therefore as long as there is at least one shared instance alive. 27

Chapter 3. Implementation Figure 3.2 shows the class diagram. The shared implementation in the detail namespace gathers all the connections from all the receivers, calls zmq_poll on all of them, and then invokes the abstract process method on the correct receiver instance of a socket which has pending data. The receiver then implements the specific implementation to read and process the data from the given socket.

3.7 Livre Remote Data Source Based on the design, the implementation of the remote data source consisted of implementing the ZeroEQ features described in the previous section, and then using these to implement and test the specified communication protocol. A standalone unit test was implemented to benchmark and profile the performance of the new remote data source. This unit test launches a data source service in a separate thread, and connects to this service using a remotemem data source. This data source then requests bricks of various sizes, validates their content and measures the throughput. The memory data source simply creates bricks of a uniform color and has little processing overhead itself. It is used in various other unit tests. This unit test was then used to profile and optimize the implementation in Livre, Fivox and ZeroEQ. The first optimization was to use a lunchbox::Buffer instead of a std::vector to store the received data in zeq::Event. The STL vector needs to initialize each element separately, which is not needed for POD memory buffers and results in a non-trivial initialization overhead when allocating the buffer. The Lunchbox buffer was previously created for this reason, and uses a simple allocation scheme internally. The second and third optimization replaced two instances of std::vector with lunchbox::Buffer, and subsequently removed one buffer altogether. The two vectors were used in the memory data source to prepare a brick of volume data, and in the MemoryUnit to store the brick in the LOD tree. The first instance was removed by directly using the memory unit allocation to prepare the data in place. The last optimization was the deciding one: The default serialization of the brick memory used the documented, standard way of serializing a vector in flatbuffers. Similar to the std::vector, this requires serializing each element separately, although with a much higher per-element overhead. Recently a new, yet undocumented, feature was added to flatbuffers to create an uninitialized vector, and then fill the vector content manually. For our use case, this removes a significant overhead since serializing each element separately is unneeded for byte arrays.

28

4 Results

As part of this thesis, various benchmarks were executed for the performance-critical components of the software stack on a wide range of target hardware platforms. Benchmarks on server-class hardware provide data for large-scale production runs, and benchmarks on lower end hardware provide data for day-to-day work on smaller data sets. For both system sizes, a Power-based system on the Blue Gene supercomputer and an Intel-based commodity system was chosen. All benchmarks presented here fit into main memory. The server-class Intel-based machines are the nodes of the visualization cluster co-hosted with the Blue Gene supercomputer. These nodes use dual-socket Intel Xeon CPU running eight cores at 2.6 GHz each. They have 128 GB main memory. The low end hardware is a MacBook Pro (10,1) using a four-core Intel Core i7 running at 2.3 GHz and 16 GB of main memory. The server-class PowerPC machines are the front-end nodes of the Blue Gene/Q supercomputer, a four-socket Power 7 system using four cores each at 3.6 GHz with 128 GB main memory. The low-end hardware are the BGAS IO nodes, running a single 16 core PowerPC A2 at 1.6 GHz and 16 GB of main memory. For all benchmarks on all systems the software was compiled with gcc 4.8.2 using the O3 optimization level, with the exception of the MacBook, which used clang 6.0. The raw results are reproduced in Appendix B. All optimizations were implemented using a profiling-guided approach. On Intel hardware, the VTune profiler or XCode Instruments was used to identify hotspots, which were consequently addressed by the presented optimizations. Unfortunately we did not have access to similar tools on the Power-based systems. The numbers reported are cumulative, that is, a later optimization includes all previous optimizations. Testing and implementing the optimizations independently was not feasible in the given time for source control management reasons.

29

Chapter 4. Results

4.1 Fivox The field voxelization library is a standalone component to create volumetric representations of discrete spatial events of any type. It provides interactive performance for sampling volumes of typical sizes. Figure 4.1 shows the sampling performance on the hardware described in the parent section. The performance is given in million voxels per second sampled. 2x Intel Xeon, 8x2.6 GHz

8x Power 7, 4x3.5 GHz

100

MVoxels/s

10

1

0.1

0.01

16³

32³

64³

128³

256³

baseline

512³ 16³

cutoff

32³

64³

128³

256³

512³

256³

512³

rtree

Intel Core i7, 4x2.3 GHz

PowerPC A2, 16x1.6GHz

100

MVoxel/s

10

1

0.1

0.01

16³

32³

64³

128³

Volume Size

256³

512³ 16³

32³

64³

128³

Volume Size

Figure 4.1 – Sampling Performance of the Fivox Data Source

First, the benchmarks show that the setup time, i.e., loading the compartment report as well as starting and synchronizing the threads is amortized in a reasonably sized volume of about 643 −1283 voxels. This can be seen in the baseline results, where each voxel has the same static workload independent of the volume size. On the MacBook Pro, cache effects are visible where the 643 volume performs best, since it fully fits into L2 cache memory. These effects are less visible on the server-class machines due to the Numa architecture, for which no optimizations have been implemented yet. The first optimization, using a cutoff distance after which events are not considered, already yields a considerable performance increase, which becomes more pronounced as the volume size increases. With an increasing volume size, the size of each voxel becomes smaller, and therefore less events per voxel have to be accumulated. 30

512³

0.233828

Table 2 Relative Performance of:

0.693175

50.5502

4.2. Livre Remote Data Source

Compared to:

Inteltraversal : Low-End The second optimization reduces Low-End the event time perPower voxel to from O(N )211% to O(l og (N )) Low-End Intel : Server-Class Intel 37% by using an in-memory spatial data structure. This improves the performance by several Powerthe : Low-End Intel 47% structure orders of magnitude. Profiling didLow-End show that construction time of the RTree data Low-End Power : Server-Class Power 21% does take an insignificant amount of time. The performance improvements are significant, Server-Class Intel : Server-Class Power 120% since the number of events considered for each voxel can be dramatically reduced. This is less Server-Class Intel : Low-End Intel 271% important for low-resolution volumes, where each voxel is bigger and therefore includes more Server-Class Power : Server-Class Intel 83% events in its neighborhood. As the resolution of the volume increases, the workload per voxel Server-Class Power : Low-End Power 476% decreases, and the sampling rate in voxels per second increases.

Figure 4.2 shows the relative performance of the hardware to each other, when all optimizations are enabled. Only the relevant hardware pairs are shown, since there is no point comparing the Blue Gene frontend node against a MacBook Pro.

Low-End Intel : Low-End Power Low-End Intel : Server-Class Intel Low-End Power : Low-End Intel Low-End Power : Server-Class Power Server-Class Intel : Server-Class Power Server-Class Intel : Low-End Intel Server-Class Power : Server-Class Intel Server-Class Power : Low-End Power

211% 37% 47% 21% 120% 271% 83% 476%

Figure 4.2 – Relative Hardware Performance for Fivox In this benchmark Intel-based hardware consistently outperforms the Power-based machines. The gcc compiler appears to perform better optimizations for the x86 architecture. Furthermore, the computational load is relatively simple, so small changes in the order of operations, memory access and thread locality yield significant performance differences. We suspect that with careful tuning for the given architecture, significant performance improvements can be achieved and the difference between the micro-architectures becomes less significant. Furthermore, the BGAS nodes are more than four times slower than the Blue Gene frontend node, due to their low-end CPU. Even a mobile Intel CPU, which has only a quarter of the processing cores, outperforms the A2 chip by a factor of two. Sampling a 1283 sized volume, a typical size used in Livre, will take about two seconds, not including the compositing and network transmission time. Since this sampling runs asynchronous to the rendering in Livre, this is an acceptable delay for an end user running the system.

4.2 Livre Remote Data Source This performance test measures the transfer speed of a remote memory data source between two threads within the same process. This setup minimizes network bottlenecks, since a process-local TCP socket has little overhead. The benchmark primarily measures the performance of the serialization, protocol and sampling rate of the memory data source. Figure 4.3 shows the results of the various optimizations described in Section 3.7. The optimizations were performed on the MacBook Pro, and show therefore the most consistent improvements for this platform. We will first discuss the results on this platform, and then 31

256³

153.207

147.226

175.033

167.366

318.414 256³

127.179

132.264

133.94

512³

152.297

155.744

164.564

177.175

371.586 512³

120.054

119.389

121.792

Low-End Intel

Baseline

Buffer in ZeroEQ

Two Buffers in Livre

One Buffer in Livre

Flatbuffers Raw Vector

Workstation-Class Baseline Intel

Buffer in ZeroEQ


134.063 126.458 One Buffer in Livre

Fl Ve

32³

73.7063

77.9368

101.346

101.372

276.491 32³

86.7572

84.6197

78.5263

88.2239

64³

93.1819

99.6434

128.668

130.922

666.665 64³

146.081

147.714

167.145

153.343

128³

96.8574

104.36

132.559

131.665

938.975 128³

162.405

171.592

218.681

Buffer 185.347

256³

81.7275

89.5174

107.541

115.568

557.912 256³

199.628

204.108

215.731

68.8496

86.5077

88.4499

248.223 512³

213.548

216.709

222.919 32³ 224.456 64³

Chapter 4.66.2412 Results

240.925

128³

2x Intel Xeon, 8x2.6 GHz

8x Power 7, 4x3.5 GHz

256³

1000

512³

Buffer 32³

MB/s

64³ 128³

100

256³ 512³

Relative 10

32³

64³

128³

Baseline One Buffer in Livre

256³

Buffer in ZeroEQ Flatbuffers Raw Vector

512³ 32³

64³

128³

256³

512³

Two Buffers in Livre

Low-End

Server-C Lo Server-C

Intel Core i7, 4x2.3 GHz

Server-C

Intel Core i7 970, 6x3.2 GHz

Server-C

1000

Se

MB/s

512³

100

10

Server-C

32³

64³

128³

Brick Size

256³

512³ 32³

64³

128³

256³

512³

Brick Size

Figure 4.3 – Throughput of the Remote Data Source

observations for the other platforms. It was not possibly to run this test on the BGAS nodes, since they missed system software packages needed to run the test. The three optimizations replacing STL vectors with lower-overhead buffers lead to small, but consistent improvements in performance. Together these optimizations yield a 30-40% performance improvement. The significant optimization for this component was to replace a byte-by-byte serialization of the volume data in flatbuffers with a single memcpy into a pre-allocated buffer. This brings the performance of the serialization code up to the expected speed to saturate a ten Gigabit Ethernet link. Peak performance for serialization, local transmission, and deserialization is over 900 MB/s, which corresponds to 7.2 GBit/s plus protocol overhead for the TCP socket. Since in a production setup the three steps are pipelined and executed concurrently, we do not expect this part to become a bottleneck in the final system. The Blue Gene frontend node has a similar performance profile to the MacBook Pro. On this machine however, the buffer optimizations show no visible improvement, the performance is the same within measurement noise. Since we did not have access to a profiler on this machine, 32

4.2. Livre Remote Data Source we could not analyze this in a reasonable amount of time. The flatbuffers optimisation also brought the performance to acceptable levels on this machine. The server-class Intel machine showed unexpected behaviour in two aspects: It has an unusual performance profile and delivers performance much lower than expected. To understand this behaviour better, we continued with further tests and debugging. First, we did run the remote data source test on a workstation-class Intel machine, a singleprocessor, six-core Intel Core i7 970 running at 3.20GHz with 12 GB main memory (Figure 4.3, bottom right graph). This machine observes a much more consistent performance, twice as fast as the more modern server-class machine. We then furtherIntelinvestigated the issue Workstation-Class Server-Class Intel Server-Class Power and Low-End Intel memcpy 3.20218 5.63729 5.29794 eliminated messaging latencies for the ZeroEQ protocol as the4.58987 cause. Profiling showed the memset 7.48972 5.45175 7.67743 5.1 initialization of the memory using memset in memmove the data source and memcpy 3.20184 within the software 4.57775 5.63599 13.9187 stack as the remaining hotspots.

GB/s

A micro-benchmark confirmed these operations as the cause for the slow performance. 13.9 14 On the server-class Intel machine, the bench12 mark was run using numactl -m 1 -N 1 to 10 restrict it to run on a single processor to ex7.7 7.5 8 clude NUMA effects. As shown in Figure 4.4, 2 5.6 5.3 5.6 6 5.5 the server-class machine has about 3 of the 5.1 4.6 4.6 4 performance compared to the older worksta3.2 3.2 tion. This issue has been escalated, but could 2 not be resolved in time to be included in this 0 memcpy memset memmove Workstation-Class Intel Server-Class Intel Server-Class Power Low-End Intel thesis. The Blue Gene frontend node and MacBook Pro showed the expected perfor- Figure 4.4 – Memory Operations Performance mance. The MacBook memmove implemen- on Intel Hardware tation is surprisingly more than two times faster compared to memcpy, while its contract is more constrained. This can be used in the future to optimize the performance on this platform. This investigation explains the performance anomalies for the server-class machine: The erratic behaviour is caused by NUMA scheduling effects, which are not visible on a singlesocket UMA machine. The slow performance of the server-class machine compared to the workstation is due to a setup issue causing slow memory bandwidth. The fast performance of the MacBook Pro compared to the Linux-based machines is likely due to better optimizations performed by the clang compiler.

33

5 Conclusion and Future Work

This master thesis proposes a complete system to perform active computation on a flash storage system tightly integrated with a supercomputer. It is to our knowledge one of the first implementations evaluating the concept of smart storage in a high performance computing environment. Designing and implementing such a system is a complex undertaking, since various design parameters have to be taken into account: First, the system needs to handle large amounts of data in a scalable fashion. This requires careful design to minimize the overhead on a single node, and to allow for scaling out onto a bigger system. Section 2.2 presents a scalable system architecture, and all components have been chosen carefully during the implementation to minimize overhead (Section 2.5, Section 3.4, Section 3.6). In the results (Section 4) we demonstrated that this architecture, implementation and optimization provides the expected performance for the implemented components. Future work will bring the remaining components together to form a complete and scalable system. Second, the presented system needs to be reliable in the presence of hardware and software failures. In Section 2.3 we outline the steps taken to ensure a robust implementation. Once the system has been put into production, will evaluate and improve the robustness of the system based on end user feedback. No unforeseen reliability or stability issues were encountered during the development and initial testing. Third, we present a production quality software stack, which has a high code quality (Section 3.1), is reusable through meaningful modularization (Section 2) and is portable to RedHat Enterprise Linux 6, Mac OS X 10.8 on x86 and Power architectures (Section 4). Designing and implementing this use case and software stack on a proprietary supercomputer has presented more challenges then expected. The system availability, stability and conservative administration has been a major road block during the implementation of this thesis. It remains to be seen if the tight integration of the storage nodes justifies this cost during

35

Chapter 5. Conclusion and Future Work production usage of the system. A less tightly integrated, but more mature key-value store appliance might end up being the more cost efficient solution. We plan to continue this work in the following directions: Finalizing the implementation as outlined in Section 2, benchmarking and optimizing the full system to achieve the expected and desired performance, installing the active storage components as a production service, and launching the system to end users. Afterwards we plan to work with the users to implement other use cases for active data analysis and visualization on the key-value store, as well as to evaluate off-the-shelf key-value store appliances. Furthermore we want to investigate storing out-of-core spatial data structures on top of the existing data on the key-value store to accelerate the access to the data.

36

A Project Plan and Execution

Figure A.1 shows the Gantt chart of the project plan, with completed tasks marked dark. The tasks shown are the unmodified from the initial plan to allow a comparison and discussion with the actual work below. Tasks 1 to 6 are needed to complete the use case for this thesis. They are discussed in the following paragraphs. Task

Qtr 3 2014 Effort

1) Electric Field Potential Voxelization 1.1) Selection of Volume Library

1w

1.2) Implementation of parallel voxelization…

1w

1.3) Benchmarking and Optimization

2w

2) SDK-based Livre Data Source

2w

3) Key-Value Store Selection

1w

4) Key-Value Data Format for Neuron Simulations

1w

5) Parallel Voxelization on BGAS Key-Value Store

Qtr 4 2014

Qtr 1 2015

Qtr 2 2015

Qtr 3 20

4w Stefan {50% of 50%} Stefan {50% of 50%} Stefan {50% of 50%} Stefan {50% of 50%} Stefan {50% of 50%} Stefan {50% of 50%}

10w

5.1) Key-Value Store setup

1w

5.2) Brion KV report writer

1w

5.3) Brion KV report reader

1w

5.4) Task to Data Mapping on Key-Value Store

2w

5.5) HPX/ZeroEQ setup

1w

5.6) Parallel Volume Reduce Implementation

2w

5.7) Benchmarking and Optimization

2w

6) Remote BGAS Livre Data Source

2w

7) Neuron writing to Key-Value Store on BGAS

4w

8) Update Triggers

3w

Stefan {50% of 50%} Stefan {50% of 50%} Stefan {50% of 50%} Stefan {50% of 50%} Stefan {50% of 50%} Stefan {50% of 50%} Stefan {50% of 50%} Stefan {50% of 50%} HPC

Stefan {

9) Deliverables 9.1) LFP in Livre 9.2) Data Format in Key-Value Store 9.3) LFP streamed from BGAS 9.4) Concurrent Simulation and Rendering 10) Dependencies 10.1) SKV in memory 10.2) SKV CNK support

Figure A.1 – Thesis Project Plan

10.3) SKV TCP support 10.4) SKV persistence 10.5) SKV update triggers 11) IBM Tasks 11.1) X86 Builds 11.2) RocksDB SKV 11.3) RocksDB DSA 11.4) CNK TCP 11.5) TCP endian safe

Task 1 enabled5wthe generation of volume data sampling the electric field potential from 1w discrete events.1wITK was chosen as the basis for this implementation, and the result volume 1w can be saved to1wfiles readable in Paraview. Various optimizations were performed to bring 1w the voxelization performance to acceptable levels for interactive applications. This task took roughly the time estimated for it. Task 2 was decomposed into two subtasks: Feeding the field voxelization library from task 1 with SDK data, and integrating this library and reader into the volume renderer. This task allowed to visualize electric field potential directly in an interactive volume rendering engine. It was completed in time. 37

Appendix A. Project Plan and Execution Task 3 was completed in time, and IBM’s SKV was chosen as the key-value store. Alternative key-value stores were discarded due to the specialized nature of the target BGAS IO nodes. Task 4 settled on a pragmatic format and performant granularity of the key-value pairs stored. It was completed in time. Task 5 is the key component of this thesis. As such, it decomposes into many sub tasks. Task 5.1 took significantly more time then expected. Both the system environment and the key-value store itself proved to be not ready for use. Various impediments, from system downtime, failing critical services and slow response time to requests, made working on the target machine difficult. SKV itself was fragile to set up and to keep running, causing many days of debugging to maintain a working setup. Overall, this task spanned more then two month of calendar time and more than one man-month of total work. Tasks 5.2 and 5.3 were performed more or less simultaneously and completed in time. Tasks 5.4 to 5.6 required a functional SKV setup, which was not achieved until the end of this thesis and will be completed later. Some work on task 5.7 was performed, implementing asynchronous IO operations. Task 6 was performed in parallel with task 5.1. It decomposed into the actual implementation in Livre, using ZeroEQ as the protocol, and into implementing various features and improvements in ZeroEQ. The core of the work was performed on time, but the need for a remote connection broker, and the resulting shared receiver implementation, was unplanned and took about two weeks to implement. To summarize: the major overrun of task 5.1 caused a delay in the last tasks (5.4 to 5.6), needed to complete the work and run at scale on the BGAS IO nodes. Given the limited time frame of four months, a significant portion of the target use case was implemented.

38

B Raw Benchmark Data

B.1 Figure 4.1 and Figure 4.2 Server-Class Intel

baseline

cutoff

rtree

16³ 32³ 64³ 128³ 256³ 512³

0.0359526 0.122695 0.171721 0.199772 0.233029 0.233828

0.0431874 0.226241 0.44294 0.54609 0.554487 0.693175

0.0491254 0.370213 2.92476 16.0651 39.0237 50.5502

Server-Class Power

baseline

cutoff

rtree

16³ 32³ 64³ 128³ 256³ 512³

0.00657267 0.0349093 0.0753957 0.0881915 0.0903839 0.0909527

0.00687657 0.0497198 0.202548 0.340319 0.371448 0.381162

0.00718519 0.0561802 0.455869 3.35115 18.1206 42.1017

Low-End Intel

baseline

cutoff

rtree

16³ 32³ 64³ 128³ 256³ 512³

0.0300752 0.0714962 0.0913762 0.0774036 0.0845293 0.0813005

0.0423267 0.129375 0.163702 0.146773 0.130692 0.138747

0.0508818 0.41675 2.85427 10.576 16.5566 18.6368

Low-End Power

baseline

cutoff

rtree

16³ 32³ 64³ 128³ 256³ 512³

0.00137171 0.00789024 0.0175 0.027034 0.0332859 0.037017

0.00150614 0.0109655 0.0488274 0.105212 0.13811 0.154141

0.00154429 0.0123414 0.0974926 0.721499 3.84452 8.84519

39

Appendix B. Raw Benchmark Data

B.2 Figure 4.3 Server-Class Intel

Baseline

Buffer in ZeroEQ


One Buffer in Livre


32³ 64³ 128³ 256³ 512³

71.8848 84.5284 112.237 153.207 152.297

75.1915 83.4722 117.455 147.226 155.744

82.6343 90.0086 134.579 175.033 164.564

87.0683 91.7924 121.196 167.366 177.175

209.506 230.251 208.193 318.414 371.586

Server-Class Power

Baseline

Buffer in ZeroEQ


One Buffer in Livre


32³ 64³ 128³ 256³ 512³

101.031 139.604 134.177 127.179 120.054

100.803 139.346 133.355 132.264 119.389

114.979 128.97 139.516 133.94 121.792

114.312 143.993 138.684 134.063 126.458

314.388 668.058 661.792 573.133 512.361

Low-End Intel

Baseline

Buffer in ZeroEQ


One Buffer in Livre


32³ 64³ 128³ 256³ 512³

73.7063 93.1819 96.8574 81.7275 66.2412

77.9368 99.6434 104.36 89.5174 68.8496

101.346 128.668 132.559 107.541 86.5077

101.372 130.922 131.665 115.568 88.4499

276.491 666.665 938.975 557.912 248.223

Workstation

Baseline

Buffer in ZeroEQ


One Buffer in Livre


32³ 64³ 128³ 256³ 512³

86.7572 146.081 162.405 199.628 213.548

84.6197 147.714 171.592 204.108 216.709

78.5263 167.145 218.681 222.919 224.456

88.2239 153.343 185.347 215.731 240.925

152.4750 307.145 412.225 506.415 596.915

40

Bibliography [A2112]

A2 Processor Users Manual for Blue Gene/Q, 2012.

[Akg13]

Faruk Akgul. ZeroMQ. Packt Publishing, 2013.

[CEH+ 11] Dong Chen, Noel A. Eisley, Philip Heidelberger, Robert M. Senger, Yutaka Sugawara, Sameer Kumar, Valentina Salapura, David L. Satterfield, Burkhard SteinmacherBurow, and Jeffrey J. Parker. The IBM Blue Gene/Q Interconnection Network and Message Unit. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, pages 26:1–26:10, New York, NY, USA, 2011. ACM. [dcS14]

Display Cluster Streaming Library. https://bluebrain.github.io/, 2014.

[DG08]

Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107–113, January 2008.

[EMP09]

Stefan Eilemann, Maxim Makhinya, and Renato Pajarola. Equalizer: A scalable parallel rendering framework. IEEE Transactions on Visualization and Computer Graphics, May/June 2009.

[EP07]

Stefan Eilemann and Renato Pajarola. Direct send compositing for parallel sortlast rendering. In Proceedings Eurographics Symposium on Parallel Graphics and Visualization, 2007.

[fla14]

FlatBuffers: Memory Efficient http://google.github.io/flatbuffers/, 2014.

[Gut84]

Antonin Guttman. R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, SIGMOD ’84, pages 47–57, New York, NY, USA, 1984. ACM.

[HES08]

Michael L. Hines, Hubert Eichner, and Felix Schürmann. Fully implicit parallel simulation of single neurons. Journal of Computational Neuroscience, 25(3):439– 448, aug 2008.

Serialization

Library.

41

Bibliography [JAW+ 12] G.P. Johnson, G.D. Abram, B. Westing, P. Navr’til, and K. Gaither. Displaycluster: An interactive visualization environment for tiled displays. In Cluster Computing (CLUSTER), 2012 IEEE International Conference on, pages 239–247, Sept 2012. [JMIC13]

Hans J. Johnson, M. McCormick, L. Ibáñez, and The Insight Software Consortium. The ITK Software Guide. Kitware, Inc., third edition, 2013. In press.

[MB12]

Glyn Matthews and Dean Michael Berris. A URI Library for C++. http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2012/n3420.html, 2012.

[MEP10]

Maxim Makhinya, Stefan Eilemann, and Renato Pajarola. Fast compositing for cluster-parallel rendering. In Proceedings Eurographics Symposium on Parallel Graphics and Visualization, pages 111–120, 2010.

[MPHK94] Kwan-Liu Ma, James S. Painter, Charles D. Hansen, and Michael F. Krogh. Parallel volume rendering using binary-swap image composition. IEEE Computer Graphics and Applications, 14(4):59–68, July 1994. [RAP+ 13]

Michael W. Reimann, Costas A. Anastassiou, Rodrigo Perin, Sean L. Hill, Henry Markram, and Christof Koch. A Biophysically Detailed Model of Neocortical Local Field Potentials Predicts the Critical Role of Active Membrane Currents. Neuron, 79(2):375–390, 2015/01/12 2013.

[RFPG10] Alex Rayshubskiy, Blake Fitch, Mike Pitman, and Robert Germain. Parallel InMemory Database Reference Manual, 2010. [SDK+ 14] Felix Schürmann, Fabien Delalondre, Pramod S. Kumbhar, John Biddiscombe, Miguel Gila, Davide Tacchella, Alessandro Curioni, Bernard Metzler, Peter Morjan, Joachim Fenkes, Michele M. Franceschini, Robert S. Germain, Lars Schneidenbach, T.J. Christopher Ward, and Blake G. Fitch. Rebasing I/O for Scientific Computing: Leveraging Storage Class Memory in an IBM BlueGene/Q Supercomputer. In Julian Martin Kunkel, Thomas Ludwig, and Hans Werner Meuer, editors, Supercomputing, volume 8488 of Lecture Notes in Computer Science, pages 331–347. Springer International Publishing, 2014. [YWM08] Hongfeng Yu, Chaoli Wang, and Kwan-Liu Ma. Massively parallel volume rendering using 2-3 swap image compositing. In Proceedings IEEE/ACM Supercomputing, 2008. [zer13]

42

Zero Configuration Networking. http://www.zeroconf.org/, 2013.

Electric Field Potential Computation on a Flash-Based

Electric Field Potential Computation on a Flash-Based

Suggest Documents

electric potential difference in a uniform electric field

Electric field computation and measurements in the

Computation of electric fields and potential on ... - Semantic Scholar

Non-potential electric field model of magnetosphere

ELECTRIC POTENTIAL & FIELD MAPS - Rutgers Physics ...

electric field-induced transmembrane potential depends on cell ...

Effect of electric field induced transmembrane potential on spheroidal

Lab 1: Electric Potential and Electric Field - Instructional Physics Lab

Computation of Electric Field and Human Body ...

Computation of the gradient-induced electric field ... - Springer Link

Computation of the Electric Field in Aged Underground ... - IEEE Xplore

1 Electric field computation in 765 kV substation using

Electric Potential

Electric Potential

Electric field

Universal neural field computation

Universal neural field computation

Potential The work done by the electric field on a particle with ...

Analysis of the Electric Field and the Potential Distribution ... - CiteSeerX

GPU-Accelerated Crack Path Computation Based on a Phase Field ...

An assessment of potential applications with pulsed electric field in ...

Polar Cap Potential and Merging Electric Field during High Intensity ...

An assessment of potential applications with pulsed electric field in ...

Specific variations of the atmospheric electric field potential gradient