Electric Field Potential Computation on a Flash-Based

0 downloads 0 Views 19MB Size Report
boost::geometry library was used to construct this RTree on-the-fly while adding events to the ... Furthermore, the Flash storage cards used are programmed.
Active Storage for High Performance Computing: Electric Field Potential Computation on a Flash-Based Key-Value Store

Master Thesis Presented January 2015 to Dr. Darius Sidlauskas at the Data-Intensive Applications and Systems Laboratory School of Computer and Communications École Polytechnique Fédérale de Lausanne by Stefan Eilemann Faubourg de l’Hôpital 12 CH-2000 Neuchâtel

Acknowledgements The research leading to this thesis was supported in part by the Blue Brain Project, the Swiss National Science Foundation under Grant 200020-129525, the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 604102 (Human Brain Project) and the King Abdullah University of Science and Technology (KAUST) through the KAUSTEPFL alliance for Neuro-Inspired High Performance Computing. I would like to take the opportunity to thank the Blue Brain Project, in particular the visualization team, the DIAS Laboratory, IBM Research, and all the other reviewers for their support in developing and writing this thesis. We would also like to thank github for providing an excellent infrastructure hosting our projects at http://github.com/BlueBrain, http://github.com/HBPVis and http://github.com/Eyescale. Lausanne, January 15, 2015

Stefan Eilemann

i

Abstract High-performance computing is at the verge of a disruptive change from batch processing to become an interactive tool for domain specialists. Various factors contribute to this, most notably supercomputers are facing a bandwidth wall rendering current workflows ineffective. As a corollary of interactive supercomputing, innovation cycles will be drastically shortened leading to improved productivity. This master thesis designs and implements a first application use case that enables serverclass persistent flash memory on supercomputers for robust, fast and interactive coupling of simulation, data analysis and visualization. It delivers a first module towards interactive supercomputing, which allows to dynamically and interactively introspect running HPC simulations. Its key contributions are: a production quality software stack demonstrating the capabilities of an active storage subsystem; new, reusable software components for parallel field voxelization; a relevant, functional use case for neuro-scientific research and a novel burst buffer storage system tightly integrated with simulations on high performance computing systems. Key words: Key-Value Store, HPC, Data Staging, Distributed Programming, Big Data, Volume Rendering, Image Processing, Interactive Supercomputing

iii

Contents 1 Introduction

1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2 Use Case and Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.3 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.4 User Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.5 Software Ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.5.1 Hardware Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.5.2 ITK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.5.3 Lunchbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.5.4 Brion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.5.5 BBPSDK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.5.6 ZeroEQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.5.7 Livre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.6 Uniform Resource Identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2 Architecture

9

2.1 Software Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.3 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.4 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.4.1 Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.4.2 Key-Value Store Data Format . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.5 Field Voxelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.6 Livre Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.6.1 Local Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.6.2 Remote Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

2.7 Brion Compartment Report Writer and Reader . . . . . . . . . . . . . . . . . . . .

18

2.8 Parallel Volume Compositor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

3 Implementation

23

3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.2 Field Voxelization Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.3 Livre Local Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24 v

Contents 3.4 SKV . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Brion Compartment Report Writer and Reader . 3.6 ZeroEQ . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Extensible Vocabularies . . . . . . . . . . 3.6.2 Endian-Safe Messaging . . . . . . . . . . 3.6.3 Connection Broker . . . . . . . . . . . . . 3.6.4 Shared Receivers . . . . . . . . . . . . . . 3.7 Livre Remote Data Source . . . . . . . . . . . . .

. . . . . . . .

25 25 26 26 26 26 27 28

4 Results 4.1 Fivox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Livre Remote Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29 30 31

5 Conclusion and Future Work

35

A Project Plan and Execution

37

B Raw Benchmark Data B.1 Figure 4.1 and Figure 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Figure 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39 39 40

vi

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

List of Figures 1.1 1.2 1.3 1.4

Local Field Potential Workflow . . . . . . . Hardware Setup . . . Livre Data Sources .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

2 2 3 6

2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11

Software Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Schematic Morphological Skeleton and 3D Mesh of a Neuron. . . . . . Data Layout and Access for a Compartment Report . . . . . . . . . . . . Key-Value Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fivox Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Livre Data Source Class Diagram . . . . . . . . . . . . . . . . . . . . . . . Bricks in a Three-Level Octree . . . . . . . . . . . . . . . . . . . . . . . . Brick Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sequence Diagram of the Livre Remote Data Source Network Protocol Parallel Direct-Send Compositing . . . . . . . . . . . . . . . . . . . . . . Sequence Diagram of the Parallel Direct-Send Compositor . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

9 12 13 14 15 16 17 17 18 20 21

3.1 ZeroEQ Connection Broker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 ZeroEQ Shared Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27 27

4.1 4.2 4.3 4.4

. . . .

30 31 32 33

A.1 Thesis Project Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

Sampling Performance of the Fivox Data Source . . . Relative Hardware Performance for Fivox . . . . . . . Throughput of the Remote Data Source . . . . . . . . Memory Operations Performance on Intel Hardware

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

vii

1 Introduction

1.1 Motivation Simulations on high performance computing (HPC) systems are increasingly bandwidth limited, since the disk IO bandwidth and available compute power follow different exponential growth curves, doubling roughly every 45 months and 18 months, respectively. This leads to an ever increasing “bandwidth wall”, which requires rethinking the current workflows to take this increasing imbalance into account. This thesis explores the use of fast burst buffers to store simulation results in a temporary storage pool, using a novel storage paradigm for fine-grained data access. This storage pool provides a faster data buffer compared to traditional filesystems, although with less consistency and reliability guarantees. It allows asynchronous, live coupling of simulation, data analysis and visualization within a heterogeneous supercomputer. Flash has emerged in the last couple of years as the technology of choice for these burst buffers. It is often used as a new level in the storage hierarchy, being faster but more expensive than disks, but persistent, cheaper and slower than main memory. Using traditional filesystem semantics to access flash storage is however suboptimal, since parallel filesystems are historically optimized for disk access with different design parameters: Traditional hard disks provide good streaming bandwidth, relatively poor random access performance and slow performance when using many random IO accesses. Server-class flash devices on the other hand have negligible random access times and need many concurrent IO operations to utilize all channels to the individual memory chips. Key-value stores store data in a distributed, associative array of key-value pairs. They promise exploiting the characteristics of flash memory better than parallel file systems by using many concurrent key-value operations on relatively small objects, exposing the fine-grained and parallel access to the hardware. The key-value store can be used to store more transient data than the more long term parallel file system, not using it as a transparent cache in the storage hierarchy but as an explicit short-term burst buffer for more detailed simulation results. Furthermore, the compute capabilities on the flash storage system can be used to perform local computation on the data, as explored by the map-reduce paradigm [DG08]. 1

Chapter 1. Introduction

1.2 Use Case and Workflow The motivation for this thesis is a concrete application use case of visualizing local electric field potential, given by the Blue Brain Project [RAP+ 13]. Figure 1.1 shows a local field potential (LFP) volume rendering overlaid on some representative neurons of the simulated circuit. The LFP is based on the electrical activity when simulating a neuronal circuit. Each neuron is discretized into hundreds of compartments, each simulated separately. This simulation yields a time series of electrical activity on a position in space for each compartment, used to aggregate the field potential in any given region. Figure 1.1 – Local Field Potential For this thesis, the use case is to concurrently write, read, voxelize and render the simulation values of a full-machine simulation through a nearby key-value store. A full-machine simulation on the current Blue Brain supercomputer uses 64k MPI processes to simulate about 200,000 neurons with around 500 compartments per neuron at a simulation rate of 1 to 10 hertz. Figure 1.2 depicts the high-level workflow and interaction between the major components. The user interacts with an interactive direct volume rendering application to explore the local field potential of an electrical simulation of a piece of reconstructed brain tissue. This volume renderer, introduced in Section 1.5.7, uses an out-ofcore algorithm and runs the rendering loop asynchronously to updating the input data. It receives its data from a remote or local data source. The remote data source, described Section 2.6.2, runs in parallel on all storage nodes. It uses a new field voxelization library, described in Section 2.5, to sample the node-local simulation data into a volume. This field voxelization library exploits nodelevel parallelism using multi-threading. The partial volumes created on each node are merged into the final result using a parallel volume compositor, described in Section 2.8. The data is read through a compartment report reader (Section 3.5) accessing a key-value store (Section 3.4) running in parallel on all storage nodes. The data 2



Local Field Potential Renderer reads volume from

Scientist A

uses

Remote Volume Data Source uses

Parallel Volume Compositor

Field Voxelization Library uses Compartment Report Reader reads events from Flash-Based KeyValue Store

Scientist B

writes events into HPC Simulation on BlueGene/Q

Figure 1.2 – Workflow

1.3. Hardware is deposited into the key-value store by a simulation running on the Blue Gene/Q supercomputer, or by using a converter based on a compartment report writer (Section 3.5). The simulation may run concurrently with the visualization.

1.3 Hardware Figure 1.3 shows the hardware setup used to implement, debug, benchmark and deploy the software developed in this thesis. The supercomputer used by the Blue Brain Project is a four rack Blue Gene/Q providing 4096 compute nodes with 16 processing cores and 16 GB of memory each. 64 of these compute nodes form 64 groups, each connected to one of 64 Blue Gene Active Storage (BGAS) nodes [SDK+ 14]. BGAS nodes are extended Blue Gene IO nodes, fully integrated into the Blue Gene interconnect [CEH+ 11]. Each BGAS node hosts a two terabyte PCIExpress flash device, and all nodes are running one key-value store instance. Each of the 64 BGAS nodes is connected to a ten Gigabit Ethernet switch shared with the visualization cluster. The BGAS nodes use an IBM A2 processor, an in-order Power-based CPU running 16 cores at 1.6 GHz [A2112] and a standard Redhat Enterprise Linux installation.

4096 total

64:1

Compute Node Compute Node

...

Compute Node

...

Compute Node Lugano Lausanne

40 total

BGAS Node

Visualization Node Tesla K20

RoQ Verbs

... Compute Node Compute Node

64 total

CNK Verbs

... BGAS Node

Tesla K20

10 GE

...

IB Verbs

Visualization Node

DSA Verbs

Flash

Tesla K20

Tesla K20

10 GBit/s WAN Link

GTX 580 GTX 580 GTX 580

Visualization Node

GTX 580 GTX 580 GTX 580

Visualization Node

10 GE

Figure 1.3 – Hardware Setup

The compute nodes use the same A2 processor as the BGAS nodes, but with a simplified operating system kernel, running static scheduling and no virtual memory to minimize operating system jitter and overhead for HPC simulation runs. Each of the 40 visualization cluster nodes has two NVidia Tesla K20 cards used for the GPUbased direct volume rendering. It has two processor sockets with 8-core Intel Xeon E5-2670 v2 processors, 128 GB memory and in internal InfiniBand FDR network. For display, a 24 Megapixel, 4x3 tiled display wall is used locally. The two sites are connected with a 10 Gbit/s WAN link. The display nodes use three NVidia GTX580 to drive two Full-HD 3

Chapter 1. Introduction displays per GPU. The nodes are connected to a local ten Gigabit Ethernet switch as well as a local QDR InfiniBand network.

1.4 User Requirements This section outlines the expectations and needs of computational neuroscientists on the software stack designed and implemented in this thesis. The core of the implementation is a reliable, always-on key-value store running as a data storage service on the Blue Gene Active Storage (BGAS) nodes, which are tightly integrated with the supercomputer. The primary goal of this store is to allow asynchronous coupling and buffering of data between simulations and downstream analysis and visualization software. It has to provide good performance for random IO operations, for the given use case 4-40 Gigabytes per second. Additionally, lightweight computation on the BGAS nodes allows reformating and extraction of data to reduce the bandwidth needed to transport data through the system. Since we expect this type of workflow to be used on future supercomputers, the whole stack must be portable to commodity hardware. This ensures portability to future hardware by properly encapsulating vendor-specific APIs today, and enables reuse of parts or the whole stack on other systems for different use cases. Due to the limited storage and compute capabilities on the BGAS nodes, it needs to be efficient and optimized for Blue Gene and BGAS hardware. This optimization includes efficient use of the available flash memory, without excessive replication for storage reliability.

1.5 Software Ecosystem 1.5.1 Hardware Interfaces The compute nodes and BGAS nodes are all connected using the Blue Gene torus and are programmed using InfiniBand-like RDMA interfaces. MPI is only supported within a set of compute nodes, or within a set of BGAS nodes, that is, one cannot setup an MPI session between compute and BGAS nodes. The flash devices are using multiple channels to the actual storage chips, and are also programmed using a RDMA interface. This enables multiple asynchronous operations to exploit the hardware parallelism with less overhead than a standard Linux block device. The Ethernet interfaces are programmed using standard TCP sockets, and the QDR InfiniBand interfaces can be programmed using TCP, RDMA or MPI. The GPUs in both visualization clusters are accessible through OpenGL or CUDA.

4

1.5. Software Ecosystem The tiled display wall is driven by the DisplayCluster software [JAW+ 12], heavily enhanced by the Blue Brain Project. The primary access method is a pixel streaming library [dcS14], which uses parallel compression to allow interactive rendering even on high resolutions.

1.5.2 ITK ITK [JMIC13] is an open-source, cross-platform system containing an extensive set of software implementations for image analysis. It is the de-facto standard for developing image processing software in C++. ITK makes extensive use of templates to parameterize the dimensionality and pixel type of the images processed. One of the reasons for using ITK was the existing, versatile implementation of images as the data storage container for the volume data. Furthermore, the produced volume images can easily be wrapped and visualized with VTK. Algorithms are implemented as filters in ITK, which are then chained together to form image processing pipelines. This flexibility, coupled with the sheer amount of existing filters, was the last deciding factor for ITK. Implementing the field voxelization code as an image source in ITK allows the resulting system to be easily extensible with new processing pipelines in the future.

1.5.3 Lunchbox Lunchbox is a C++ library for multi-threaded programming, providing utilities which are not available in the Boost libraries. It is used by most C++ software developed in the Blue Brain Project as a base library, and provides all common, reusable utility classes.

1.5.4 Brion Brion is the C++ IO library used to access the files produced and consumed in the Blue Brain Project. It implements readers and writers for most of the static and dynamic data. Brion provides fast and low-overhead read access to the major formats: BlueConfig, Circuit, CompartmentReport, Mesh, Morphology, SpikeReport, SynapseSummary, Synapse and Target files, as well as write access to CompartmentReport, Mesh and Morphology.

1.5.5 BBPSDK The Blue Brain Project SDK provides a data model on top of the raw file data provided by Brion. The data structures relevant for this thesis work are explained in Section 2.4. It provides a programming interface to access the model and simulations of the neocortical column in the NEURON simulator [HES08]. This so called Microcircuit API is used for research and application development in the Blue Brain Project. 5

Chapter 1. Introduction

1.5.6 ZeroEQ Zero Event Queue (ZeroEQ) enables loose coupling of different applications through eventbased messaging using a publish-subscribe mechanism. It is build on ZeroMQ [Akg13] as the transport layer, zeroconf networking [zer13] (also known as Apple Bonjour) for automatic discovery and flatbuffers [fla14] as a serialization format. Applications using ZeroEQ are coupled in an automatic and robust fashion, that is, application crashes or exits are handled automatically and do not affect other running applications. ZeroEQ was designed to provide a stateless, simple and performant messaging layer between different applications.

1.5.7 Livre Livre (Large-scale Interactive Volume Rendering Engine) is an out-of-core direct volume rendering application. It uses an octree data structure to provide error-based level of detail (LOD) selection for data sets of any size, aiming for a constant rendering quality. Livre can render different input data sets by using a plugin interface abstractTuvok Terabyte ing the access to the unimage stack derlying data. Based on the given URI, a plugin is MeshVox Meshes selected to read the data. obj, ply Livre Figure 1.4 shows the available data sources: a fileNeuroVox BBPSDK based source reading terMorphologies abyte image stacks, two data sources computing Fivox Electrical Simulation on-the-fly volumes from watertight meshes or Blue Brain morphologies, and Remote ZeroEQ the two data sources deFigure 1.4 – Livre Data Sources veloped in this thesis performing on-the-fly LFP calculation (Section 2.6.1) and enabling remote execution of any of the data sources (Section 2.6.2). The Livre core algorithm requests bricks from these data sources dependending on the current view frustum, screen resolution as well as CPU and GPU memory capacity. An asynchronous rendering pipeline separates the rendering algorithm into a data, upload and render thread. The render thread constantly updates the LOD tree to request new bricks based on the current view frustum, and then renders the available textures. The data thread is

6

1.6. Uniform Resource Identifiers traditionally responsible to load, and potentially decompress, the requested data from disk. In our implementation the data is generated by on-the-fly voxelization of the simulation report. The volume data is then sent to the GPU upload thread, which uploads the raw data from memory into an OpenGL texture in the GPU. The active texture identifiers are sent to the render thread, which picks the appropriate texture bricks for rendering. All three threads run asynchronous, which allows interactive frame rates for rendering, while the data loading might be delayed by a few seconds due to poor IO bandwidth or computational overhead. The data and upload threads maintain a cache to keep as many bricks or textures in CPU and GPU memory as possible on the available hardware. The Equalizer parallel rendering framework [EMP09] is used by Livre for sort-first parallel rendering onto multi-tile display walls. In this mode, each GPU render thread runs an asynchronous GPU upload thread, and each node runs a shared data loader thread.

1.6 Uniform Resource Identifiers Various software components in this thesis use Uniform Resource Identifiers (URI) to select plugins or to connect to remote systems. The abstraction interface to different key-value stores, described in Section 2.7, uses URIs to identify the backend implementation and location. Data sources in the volume renderer, including local and remote implementations, are selected and addressed using URIs, as described in Section 1.5.7 and Section 2.6.2. The ZeroEQ implementation, described in Section 3.6, uses URIs to identify publishers and their data schema. The URI format is described in the C++ standards committee proposal N3420 [MB12], and an example URI consists of the following parts: URI part scheme user_info host port path query fragment

Range [a, b) [c, d) (d, e) (e, f) [f, g) (g, h) (h, i)

Value http bob www.example.com 8080 /path/ key=value fragment

http://[email protected]:8080/path/?key=value#fragment ^ ^ ^ ^ ^ ^ ^ ^ ^ a b c d e f g h i

7

2 Architecture

2.1 Software Stack The implementation of this thesis is embedded in the software ecosystem described in the previous section, which allows to implement most of the desired use case in the limited time available. This section introduces how the new and changed software components build the complete system within the scope of the larger ecosystem. The next chapter explains the implementation.

Large Scale Interactive Volume Renderer fivox::DataSource remote::DataSource bbp::report_converter fivox::ImageSource ZeroEQ bbp::CompartmentReportReader/Writer itk::ImageSource zmq zeroconf flatbuffers brion::CompartmentReport lunchbox::PersistentMap HDF5 Bin LevelDB new software SKV Legend:

Livre RemoteDataSource fivox::DataSource fivox::Compositor ...same as before...

BBP software

third-party software

Figure 2.1 – Software Stack

Figure 2.1 shows how the work of this thesis integrates into the existing software stack. Livre, an existing large-scale interactive volume rendering engine (Section 1.5.7) is extended by two data sources: a local data source is sampling discrete electrical events into regular 3D volumes (Section 2.6.1), and a remote data source providing access to data sources on different systems (Section 2.6.2), including the newly created local data source. The local data source allows on-the-fly electric field potential visualization of simulation reports stored in local files, and is build on a new, parallel field voxelization kernel (Section 2.5) implemented as an ITK image source (Section 1.5.2). The remote data source is built using an improved version of ZeroEQ (Section 3.6), which connects the data source proxy running in the volume renderer with the remote data source service process. The service process uses a parallel volume compositor (Section 2.8) and serves all data sources available in Livre. 9

Chapter 2. Architecture The field voxelization kernel receives its data from a BBPSDK compartment report reader (Section 1.5.5), which is a high-level abstraction using the Brion low-level compartment report access class (Section 1.5.4) to read the data. A standalone report converter tool is used to convert the data from a file into the key-value store. A new plugin to the Brion compartment report implements the key-value storage backend, which accesses different key-value stores using an abstraction interface in Lunchbox (Section 3.5) and a data format (Section 2.4) optimized for key-value stores. The production environment will run SKV as the key-value store. SKV is developed in the context of a research collaboration with IBM (Section 3.4).

2.2 Scalability The performance-critical components are designed to be horizontally and vertically scalable, as applicable. In the following we will describe their scalability within the complete system. SKV is a horizontally scalable service, which is exploited by the Lunchbox abstraction interface through the use of many concurrent, asynchronous requests, which are distributed over the SKV servers. Adding more servers will increase storage capacity, network bandwidth, IO operations throughput and access bandwidth. The field voxelization kernel is a vertically scalable component which uses all available CPU on a given system. Using faster clocked processors or more processing cores will increase the processing speed for generating LFP volumes. The remote data source service provides horizontal scalability by instantiating one field voxelization kernel per node. This also provides horizontally scalable access to SKV, as each instance reads a subset of the report to be voxelized. The parallel volume compositor provides a scalable algorithm to combine the partial results. Adding more nodes to the service will increase the throughput to the SKV servers and the processing speed to generate LFP volumes. Last, but not least, Livre scales horizontally by using multiple nodes and GPUs for sort-first scalable rendering. Adding more rendering nodes will increase the framerate and quality of the interactive volume rendering. We expect the whole system to be vertically scalable to achieve faster voxelization performance, which is the main computational bottleneck in our current use case. Horizontal scalability, i.e., adding more storage nodes to the system, will scale the available storage size as well as the overall performance since the storage nodes are actively used for computation.

2.3 Reliability The multiple software components running on a large-scale distributed infrastructure composed of many heterogeneous nodes create a system with an inherent complexity, which requires robust handling of resource and software failures to produce an usable infrastructure. 10

2.4. Data Model In the following we outline the steps taken to increase reliability of the system in the presence of failures. Reliability is achieved by two means: Isolating components from each other by implementing them as separate services, and by handling failures within a horizontally scalable service. The four main components in our system are separate applications: data storage in SKV, LFP calculation in the remote data source service, rendering in Livre, and writing of data from the HPC simulation. Reliability in SKV, and in the communication to SKV, is part of the SKV project and therefore outside of the scope of this thesis. SKV does not currently implement any reliability. The remote data source service is designed to support runtime failures by using stateless ZeroEQ events, constant runtime reconfiguration of the compositing network and appropriate timeouts during compositing. The communication protocol to the service also uses ZeroEQ, and the remote data source in Livre has appropriate timeouts for finding a remote data source and for receiving bricks after a request has been made. Livre scales horizontally by using the Equalizer parallel rendering framework. Equalizer supports reliability transparently for its applications by detecting resource failures and removing these resources from the runtime configuration. The HPC simulation is typically an MPI job, and therefore not failure tolerant. It is decoupled from the rest of the system through the SKV-based burst buffer.

2.4 Data Model 2.4.1 Neurons A Neuron is a single nerve cell simulated, and connected with thousands of other neurons. It is constructed from a morphological skeleton. This skeleton has a tree-like structure originating at the Soma, the cell body of each neuron. Each neuron has a unique GID (global identifier). A Target is a set of GIDs to identify a logical group of neurons, for example all neurons of a given type. The branches of the tree are formed by Sections which are subdivided in smaller pieces. They start at the soma and eventually branch into multiple sections in a tree-like structure, as shown in Figure 2.2. During the reconstruction from real neurons, each section is formed by a number of Segments. Each segment has two endpoints in R3 , as well as a radius for each endpoint. These segments consequently represent tube-lets in 3D space.

11

Chapter 2. Architecture

Se gm en art t me nt

Se

Cp

tm t

ctio

men

mp

n

Seg

Co

t

Soma

Seg men t Com p a rt men t

Seg men t

Figure 2.2 – Schematic Morphological Skeleton and 3D Mesh of a Neuron.

For the simulation, each section is decomposed into m Compartments, which have a n : m mapping to segments. For each compartment, various observables are simulated and may be saved in a CompartReport.

2.4.2 Key-Value Store Data Format Key-value stores use different semantics and looser consistency guarantees over traditional parallel file systems. They store objects and address them using a unique key, instead of storing files in an hierarchical file system structure. The performance of a key-value store scales through the concurrent access to many key-value pairs, which are randomly distributed over the available storage nodes. This fine-grained access pattern can be exploited to no longer optimize the data layout for a given algorithm, but by providing good performance for all access patterns through the use of sufficiently small key-value pairs. For example, storing a matrix in row-major layout on a file system leads to excellent performance when reading a single row, but to terrible performance when reading a single column. By storing the matrix in many key-value pairs, each holding a small 2D region of elements, one no longer favors one access pattern and reduces the access overhead to the granularity of the key-value pairs. Using the same data layout on a traditional file system would not yield the same benefits, as they tend to be optimized for sequential reads and large block sizes. For example, the GPFS used on the Blue Brain supercomputer has a chunk size of one Megabyte, while the storage granularity of the key-value store is 4096 bytes. Consequently, our main design goal for the data layout in the key-value store is to provide performance optimized for all access patterns. The output of Neuron simulations is twodimensional, one scalar voltage value per time step and compartment: v : f (t , c). Neuron writes all compartments simulated on each process to a single file, which is reassembled 12

2.4. Data Model

offline into a single file: v t :

mpiP pr oc i =1

( f i (t , c)). The existing serial version of the field potential

compartment

Figure 2.3 shows these different data access patterns. A single time step is called CompartmentReportFrame, the whole data set CompartmentReport. The size of a frame is determined by the targetm that is, the list of GIDs selected when opening the report. The simulator writes a frame in multiple chunks, the local field potential code accesses it frame at a time, and spike trace generation accesses one compartment of all frames at a time.

LFP

voxelization code iterates “vertically” over a full time step at once: v t : f (t , c). Other analysis algorithms analyze the time series “horizontally” by reading all time steps of a single compartment or neuron: v c : f (t , c).

spike trace

MPI Process #1 MPI Process #2 MPI Process #3

The layout of compartments within a frame is detertime mined by the simulator, that is, it does not correlate to the order of compartments in the neurons forming Figure 2.3 – Data Layout and Access the circuit. To establish this correlation, a mapping de- for a Compartment Report fines the index of each section of each neuron within the compartment report frame, as well as the number of compartments covered by each section. The optimal size of a value (number of stored compartment values) depends on a large number of factors, including the implementation details of the key-value store and the block size of the underlying storage layer and hardware. Since there is no predefined access pattern, the optimal access would be a randomized layout on single value entities. Access time and storage overhead per key-value pair require a larger granularity of items stored in the key-value store. To group a number of values into a single key-value pair, the most obvious and simple organization is to store all compartment report values for a single neuron and a single time step in one key-value pair. Storing a 2D area of a few compartments over a few time steps would give even fairer access time for compartment-oriented access. We decided not to use this layout in this first implementation, as it would require the capability to append to existing values and a more complex implementation of the access layer. Furthermore, it would negatively impact the predominant vertical access pattern. Figure 2.4 shows the key-value format of the compartment report stored in the key-value store. All key-value pairs are scoped with the corresponding name of the Blueconfig file and the target name of the report. The key of each data item is composed using that scope, the name of the structure, and, where applicable, the cell identifier and time step. An example key for

13

Chapter 2. Architecture the header is MyBlueConfigLayer5Header, and an example key for one time step of a neuron in this report is MyBlueConfigLayer5Frame42_17. The metadata information describing the circuit, simulation, available targets and compartment reports is still stored as a Blueconfig file on a traditional file system. It is created beforehand using other tools and does not contain any performance sensitive data. The compartment report contains a header, the list of all reported neuron identifiers, the compartment description of each neuron, and report data for each neuron and time step, as shown in Figure 2.4.

Header magic startTime [n] endTime [n] timestep [n] dunit tunit

1 m

nxm

GIDSet [m]

Compartments numCompartmentsPerSection Frame_

Suggest Documents