Memory Management for Interactive Rendering of Large and ...

12 downloads 845 Views 560KB Size Report
The brick cache is managed by a module with a dedicated paging kernel. A list of nodes, whose bricks should be added to the cache and a list of nodes, whose ...
Category: Large Scale Data Visualization & In-Situ Graphics - LI01 Poster

P4235

contact Name

Oliver Jato: [email protected]

Memory Management for Interactive Rendering of Large and Semitransparent Volumes Oliver Jato, André Hinkenjann

Introduction

Continuous improvements in data acquisition techniques and rising computational power cause the need for an efficient handling of very large data sets in interactive volume rendering. The key parts to solve this are spatial decomposition, simplification, efficient strategies for memory management and hardware accelerators. The renderer Volt uses CUDA to accelerate the ray integration in direct volume rendering. Sophisticated memory management enables low opacity visualizations of large volumes in real time.

Brick cache management

The brick cache is managed by a module with a dedicated paging kernel. A list of nodes, whose bricks should be added to the cache and a list of nodes, whose cache blocks should be freed, are updated during cut refinement. The nodes, whose bricks are added, are passed free 3D cache block indices into their Mapping instances. The indices are linked with their brick's data by creating pointers to map the pinned host memory into device space. CutManager +mDeviceUploadQueue : queue +mDeviceFreeQueue : queue

BrickedVolumeOctreeNode

1

I(b) =

diagonal(b) diagonal(b) + distance(b,Eye)

diagonal(b) ≥ diagonal(b) + distance(b,Eye)

ED(b) = E(b) −

diagonal(b) 2

diagonal(b) 2

+ distance(b,Eye)

E(bi) |V (b)|

bi ∈V(b)

(1) (2) (3)

00 01 10 11

10

11

12

1

2

3

4

5

6

7

8

9

10

11

12

Results Tree +innerNodeIdxMask : unsigned int = 0x00000000 +dataLeafIdxMask : unsigned int = 0x40000000 +emptyLeafIdxMask : unsigned int = 0x80000000

The index of a cache block is encoded in 30 bits of a 32 bit value. 2 bits encode the node's status to show if it's a leaf with a brick, an empty leaf or an inner node. So the 30 bits can also contain the index of an inner node's first child. This is used to build a bucket-proctree (see [Sam06]) for the purpose of a hierarchical page table in linear device memory. Since this representation of an incomplete octree is very lean and the nodes are in breadth-first order, the device's caches are utilised for efficient stackless traversal by the rendering kernel, similar to the idea of kd-Restart [FS05].

inner node leaf with brick empty leaf invalid node / index

9

empty leaf

-mPages : queue -mTasks : Mapping* -mNumPages : Vector3i -mPageSize : Vector3i -mPagingLimit : unsigned int

node type

8

4

leaf with 3D index of the brick in the brick cache

Pager

The split-and-collapse algorithm [CF11] is used to combine out-ofcore rendering with pre-paging. The complete working set of bricks is build and transfered before the rendering of a frame starts. A cut through the octree is progressively refined by removing cut members and adding their visible children or parents with the aim of global error reduction. The error E of a brick b in the current scene is determined by its importance I (1). Since the global error must never increase when splitting a node, it is assured that the importance of a child can never be greater than the importance of its parent (2). The error distance ED between a node's error and the average error of its visible children V(b) is then used for its priority in the cut (3).

7

3

inner node with index of first child in array

T : BrickedVolumeOctreeNode

The GPU's memory is not sufficient to create a brick cache which may contain all visible data. Streaming bricks to the device on page faults [CNLE09] would enable high quality visualization, but it would also hinder rendering performance because of too many interruptions of the rendering kernel in low opacity visualizations.

6

2

00 1 00 5 00 9 01 xyz 10 - 01 xyz 01 xyz 01 xyz 01 xyz 01 xyz 01 xyz 01 xyz 01 xyz

PageInfo +doUpload : bool +inDeviceMem : bool +inFreeQueue : bool +inUploadQueue : bool

Time step of a global simulation of ocean temperatures (3602*2394*80 voxels, data set kindly provided by [Deu13]) and a synthetic test data set.

Defining the working set

5

Reel::Utils::CuMemVirt

Index +xMask : unsigned int = 0x3FFC0000 +yMask : unsigned int = 0x3FF80 +zMask : unsigned int = 0x7F +xShift : unsigned int = 18 +yShift : unsigned int = 7 +idxMaskLShift : unsigned int = 2 +idxMaskRShift : unsigned int = 30 +invalidIdxMask : unsigned int = 0xC0000000

The volume is decomposed into uniformly sized bricks. The two outer layers of voxels replicate the voxels of the neighboring bricks to be able to independently interpolate values and calculate gradients. The bricks are recursively averaged and they are organized by a branch-on-need octree [WVG92] to constitute a hierarchical multiresolution data structure with a minimum amount of leafs.

1

0

1

Mapping +mmapPtr : CUdeviceptr +pageIdx : unsigned int

Basic host-side data structure

0

Measurements have been done with synthetic test data sets of sizes from 10243 up to 25603 voxels. A GTX 680 GPU has been used. Frame rates range from 3 to 9 Hz and show no notable dependence on the volume's size. The execution time of the rendering kernel tends to rise with increasing brick sizes. Since the octree is more shallow and less often traversed with larger bricks, the downgrading of empty space skipping seems to void the faster octree traversal. The impact of octree traversal is thus believed to be very low, but measurements still have to be done. The transfer rate of the paging kernel is near to the maximum. It has to be noted, that the net transfer rate descends with smaller bricks, since the redundant voxels have to be omitted from the payload. Upcoming work will be done to speed up the rendering kernel. Also, occluded bricks should be detected and bricks should also not be added to the cut, if they are more detailed than necessary.

References [CF11]

CARMONA, Rhadamés ; FROEHLICH, Bernd: Errorcontrolled real-time cut updates for multi-resolution volume rendering. In: Computers & Graphics 35 (2011), No. 4, pp. 931 - 944

[CNLE09]

CRASSIN, Cyril ; NEYRET, Fabrice ; LEFEBVRE, Sylvain ; EISEMANN, Elmar: GigaVoxels: ray-guided streaming for efficient and detailed voxel rendering. In: I3D '09: Proceedings of the 2009 symposium on Interactive 3D graphics and games. New York, NY, USA : ACM, 2009, pp. 15 - 22

[Deu13]

DEUTSCHES KLIMARECHENZENTRUM GMBH: http://www.dkrz.de

[FS05]

FOLEY, Tim ; SUGERMAN, Jeremy: KD-tree acceleration, 2005 (HWWS '05), pp. 15 - 22

[Sam06]

SAMET, Hanan: Foundations of Multidimensional and Metric Data Structures. San Francisco, CA, USA : Morgan Kaufmann, 2006.

[WVG92]

WILHELMS, Jane ; VAN GELDER, Allen: Octrees for faster isosurface generation. In: ACM Trans. Graph. 11 (1992), July, No. 3, pp. 201 - 227

index block child index є [0,230) z є [0,4096)

31 30 29

...

y є [0,2048) 18 17

...

x є [0,128) 7

6

...

0 bit

A surface reference to the 3D texture of the brick cache enables updates to arbitrary blocks by the paging kernel. The unequal highest possible dimensions of a surface reference are reflected by the non-uniform mapping of bits to coordinate axes in the index. If a brick is to be uploaded, its Mapping structure is copied to the paging kernel's work list. The threads of a thread block transfer neighboring voxels of a brick to the cache until the brick is finished. The texture caches get invalidated when the paging kernel exits, so the rendering kernel will not read invalid entries.