OpenGL Multipipe SDK: A Toolkit for Scalable Parallel ... - IEEE Xplore

19 downloads 0 Views 1MB Size Report
Stefan Eilemann†. Silicon Graphics, Inc. ABSTRACT. We describe OpenGL Multipipe SDK (MPK), a toolkit for scalable parallel rendering based on OpenGL.
OpenGL Multipipe SDK: A Toolkit for Scalable Parallel Rendering Praveen Bhaniramka

Philippe C.D. Robert∗

Stefan Eilemann†

Silicon Graphics, Inc.

A BSTRACT We describe OpenGL Multipipe SDK (MPK), a toolkit for scalable parallel rendering based on OpenGL. MPK provides a uniform application programming interface (API) to manage scalable graphics applications across many different graphics subsystems. MPKbased applications run seamlessly from single-processor, singlepipe desktop systems to large multi-processor, multipipe scalable graphics systems. The application is oblivious of the system configuration, which can be specified through a configuration file at run time. To scale application performance, MPK uses a decomposition system that supports different modes for task partitioning and implements optimized GPU-based composition algorithms. MPK also provides a customizable image composition interface, which can be used to apply post-processing algorithms on raw pixel data obtained from executing sub-tasks on multiple graphics pipes in parallel. This can be used to implement parallel versions of any GPUbased algorithm, not necessarily used for rendering. In this paper, we motivate the need for a scalable graphics API and discuss the architecture of MPK. We present MPK’s graphics configuration interface, introduce the notion of compound-based decomposition schemes and describe our implementation. We present some results from our work on a couple of target system architectures and conclude with future directions of research in this area. CR Categories: I.3.2 [Computer Graphics]: Graphics Systems —Distributed Graphics; I.3.7 [Computer Graphics]: ThreeDimensional Graphics and Realism —Virtual Reality Keywords: Scalable Rendering, Parallel Rendering, Immersive Environments, Scalable Graphics Hardware 1

I NTRODUCTION

The need for interactive visualization systems continues to increase constantly. Large amounts of data have to be processed when visualizing complex simulations, rendering large 3D models or scientific data sets [22]. Data sizes in the magnitude of terabytes are not uncommon. In addition, there is a growing need for display technologies, such as CAVEs [13], PowerWalls, domes and other immersive environments. This imposes high requirements on the development and deployment of interactive graphics applications, which have to render at high frame rates and achieve high visual realism. It is desirable that the same application be used in many different environments, ranging from common graphics workstations to high-end visualization systems with multiple graphics pipes and specialized scalable graphics hardware. Multiple graphics pipes can be used to allow graphics-intensive applications to achieve the desired rendering performance and im∗ Now † Now

at the University of Bern, Switzerland at the University of Zurich, Switzerland

IEEE Visualization 2005 October 23-28, Minneapolis, MN, USA 0-7803-9462-3/05/$20.00 ©2005 IEEE.

age quality by executing sub-tasks in parallel and combining partial results generated by individual pipes. Traditionally, proprietary high-end graphics vendors, like SGI and SUN, supported hosting multiple graphics accelerators on one system. With the advent of PCI Express [20], motivated by the needs of the booming video gaming market, this technology can be expected to be more widely available in the future. The advent of modern graphics architectures [37][19], where commodity components are used to build powerful multi-processor, multi-GPU systems, adds to the complexity of modern visualization systems. There is need for a scalable rendering system which enables applications to utilize all available processing and rendering power and reach optimal performance by taking advantage of multiple graphics pipes and other specialized hardware. OpenGL Multipipe SDK (MPK) is a scalable rendering toolkit, which facilitates the development and deployment of parallel, OpenGL-based multipipe applications. MPK-based applications can be configured at run time either via a configuration file or programmatically. By separating the system’s resource management and physical environment from the application, MPK is able to provide applications with run time configurability and scalability. MPK implements compound algorithms based on various decomposition modes and provides a parallel rendering API. Being able to choose and adapt the decomposition strategy for a given problem domain and graphics environment at run time leads to a great amount of flexibility and guarantees that the applications be deployable in a variety of environments. In addition to scalability, MPK allows controlling stereo features of the display environment. It is possible to switch between mono and stereo rendering at run-time allowing MPK applications to run in complex environments and support various input peripherals and projection systems, such as HMDs [39] or BOOMs [25]. MPK is also capable of providing transparent scalability in multi-head X11 environments. We present the architecture of OpenGL Multipipe SDK, and show how it can be used for parallel rendering for a variety of target applications. By separating scene database management from rendering and resource management, we provide run-time configurability and run-time scalability for graphics applications. We introduce a novel way of describing and implementing scalability schemes for parallel graphics applications. Based on that, we design and implement a scalable graphics API. Finally, we show results for GPU-based composition algorithms and extensions to these algorithms to improve scalability and overall rendering performance. MPK currently runs on IRIX, 32-bit Linux and 64-bit Linux platforms. 2

BACKGROUND AND R ELATED W ORK

The field of parallel graphics abounds with literature on attempts to facilitate the development of large-scale visualization and virtual reality applications. Molnar et al. [27] identify three classes of parallel rendering paradigms based on the stage in the rendering pipeline where the sort from object space to screen space occurs: sort-first, sort-middle and sort-last. In the sort-first approach, the screen-space is divided into a number of disjoint display regions which are rendered in parallel and later assembled in the output

119

120

3

G RAPHICS C ONFIGURATION

MPK provides the application a high-level view of the underlying system by hiding the details of low-level systemic issues like graphics resource management and parallel rendering. Figure 1 outlines the architecture of a scalable graphics system as viewed by an MPK application. The system consists of a host subsystem and a pool of GPUs connected via a high-bandwidth interconnect. The results generated by the GPUs in the graphics subsystem are distributed across an image composition network, which, after some processing, routes it to the display subsystem. The host is responsible for running the application and controlling other components of the system. The composition network can consist of general-purpose CPUs, GPUs or dedicated composition devices. The architecture does not make any assumptions on the type or topology of the interconnect between the host and the graphics subsystems. Similarly, no restrictions are imposed on the display subsystem, which can either be a set of projectors used to drive a CAVE or a single screen displaying the composited output from the multiple graphics pipes.

CPU 1

CPU 2

CPU 3

CPU p

Memory

Graphics Subsystem

Display Subsystem

GPU 1

Display 1

GPU 2

GPU n

Composition Network

Host Subsystem High Speed Interconnect

frame buffer. In sort-middle, primitives are redistributed between geometry processing and rasterization stages. This straightforward model has been used in many hardware architectures, including InfiniteReality [29], which uses a vertex-data broadcast bus, and Pixel-Planes 5 [15], which uses a ring network to distribute primitives. In sort-last, each graphics pipe renders only a subset of the scene database, preventing any primitive from being rendered more than once. However, image compositing is required to combine the partial results into a single output. Humphrey and Hanrahan [18] describe a virtual graphics system, WireGL, designed to support multiple simultaneous rendering streams to drive large tiled displays. This system was later ported to run on a cluster [16] using traditional sort-first rendering. Humphreys et al. [17] later integrated a parallel rendering interface in WireGL to achieve data scalability. They introduced Chromium [19], a generic system for manipulating streams of graphics commands on clusters of workstations, making it possible to build sortfirst as well as sort-last architectures. Samanta et al. describe a costbased model for load-balancing the rendering tasks among cluster nodes [34]. Later they extend their technique to allow for tile overlap, creating a hybrid sort-first and sort-last algorithm [32]. Since these approaches require full database replication on each cluster node, Samanta et al. compared various data management strategies for clustered environments [33]. CAVELib by Cruz et al. [13] is an API designed for creating interactive multi-screen applications for immersive environments. In 1999, CAVELib was enhanced to support PC-based visualization clusters. VR Juggler [21] [10] is a development and deployment environment for virtual reality applications. Introducing the notion of a virtual platform for VR, Bierbaum et al. implement an operating environment that shields developers from specific details of the underlying hardware architecture and operating system. VR Juggler has also been designed to facilitate run-time reconfigurability [9]. Another toolkit, which supports the development of multipipe applications, is SUN’s multi-display utility MDU [4]. Aside from these highly specialized solutions, high-level toolkits exist which provide at least some support for scalable rendering. OpenGL Performer [31] provides an API for managing multiple graphics pipes, but it does not virtualize the configuration of a multipipe system. Applications have to be aware of the available system resources and use them explicitly in order to take advantage of them. Likewise, Open Inventor applications can make use of immersive environments using TGS’ MultiPipe extension [2]. Head and hand tracking is thereby provided through the Trackd library [5]. Paraview [3], on the other hand, is an application based on VTK [6] which uses a sort-last parallel implementation for scalable visualization of large data sets. Numerous hardware architectures have been proposed for accelerating image composition using specialized hardware. PixelFlow [28] is a proprietary rendering system for real-time image generation, designed to scale linearly with it’s Image Composition Network composed of multiple compositors. Stoll et al. describe Lightning-2 [38], a system to perform composition on a cluster of commercial of-the-shelf (COTS) PCs using the DVI digital video output of commodity GPUs. Similarly, the Metabuffer [11] is a scalable multi-display system for COTS clusters, which includes novel multi-resolution capabilities. The Sepia system from Compaq [26] is a flexible architecture based on programmable FPGA devices to achieve real-time frame rates when rendering partitioned data sets on a cluster of PCs. Similar compositors are available for SGI Onyx and Prism high-performance visualization systems [36]. To leverage the adoption of hardware-based, image-composition solutions using COTS clusters, Alcorn and Frank introduced the Parallel Image Compositing API, called PICA [7]. PICA provides a complete abstraction layer for distributed image composition, independent of the graphics API.

Display 2

Display m

Figure 1: A scalable graphics system

MPK isolates the application from details of the underlying hardware by separating these details from other aspects of the application, like scene database management. The graphics configuration is abstracted using the MPK configuration. Based on this abstraction, MPK implements multiple parallel rendering paradigms using compounds. 3.1

The MPK Configuration

The MPK configuration is a tree-like data structure used to describe the graphics resources on the system, along with information on using these resources to generate the final output image. The run-time execution environment provided by MPK uses the configuration to determine which physical pipes to initialize and manage, what parallel tasks to create and synchronize, and where to send the final rendered image for display. This simple scheme is extremely powerful since it isolates the application from the details of the system architecture on which it is running. Additionally, this high-level abstraction allows MPK to optimize the rendering process for specific target architectures in an application-transparent manner. Figure 2 shows an example of a configuration hierarchy and its various components. Each of these components has a unique identifier, a character string. At the top level, the configuration contains a list of pipes. Each pipe corresponds to a physical rendering engine and is characterized by the name of its display – e.g., the X11 display identifier. Each pipe consists of a set of windows, each representing a single graphics context. A window retains attributes like geometry, display visuals and context handles and also provides the abstraction for a task to be created for parallel rendering. Each window further consists of a set of channels, which provide a view definition of the scene. Each channel defines a viewport in the window where the scene, or a part thereof, will be drawn, including the various projection and display parameters. Channels defined in the

Pipe ":0.0"

Window "window" Channel "channel0"

Channel "channel1"

config { pipe { display ":0.0" window { name "window" viewport [ 0.25 0.25 0.5 0.5] channel { name "channel0" } channel { name "channel1" } } } }

Figure 2: A sample configuration hierarchy

same window render to the same graphics context and use the same rendering thread. The configuration can be loaded at run time using an ASCII configuration file (Figure 2) or created programmatically using the MPK API. The former scheme is more commonly used requiring users to only change the MPK configurations during application development and deployment. Multiple windows per pipe are handy for developing, testing and debugging multi-threaded multi-context applications on single-pipe desktop systems while different multipipe configurations will be used when deploying the application in an immersive environment or parallel rendering. 3.2

Compounds

Compounds describe how the rendering resources are combined to produce the final views. MPK allows decomposing the global rendering task into smaller tasks and assigns these tasks to individual channels for parallel rendering. This task division requires a decomposition scheme to send a subset of the rendering primitives to each channel, get back rendered images from each channel and then composite these to get the final image (Figure 3). The decomposition scheme is abstracted using compounds, which are specified as part of the MPK configuration and hence can be configured using the same ASCII file.

source: channel0

Compound

format

Decomposition

source: channel1

Recomposition

source: channel2

destination: channel0

mode

Figure 3: A compound with 3 channels

Compounds form a tree-like structure which provides an abstraction for the decomposition of the rendering. Multiple compounds can be configured as multi-level compound trees. A compound tree specifies the topology of the (de)composition network as well as the operations to be applied during rendering and image composition. Multiple disjoint compound trees may be used to drive multiple displays. Figure 3 outlines the elements of a compound. The destination channel is the root node of the compound tree where the composed result will be displayed. The composition units – CPUs, GPUs or dedicated composition hardware – combine partial images and route the output image to the destination channel. Source channels represent the leaf nodes and perform the actual rendering. The compound mode specifies the decomposition scheme for the immediate children and the composition operator to be applied to the com-

pound. The compound format controls the pixel data to be transported from the source channel; e.g., some composition schemes need both depth and color data to be transported, while others do not. The compound traversal algorithm allows application-specific load balancing and facilitates task-partitioning algorithms to achieve optimal rendering performances with minimal run-time overhead (Section 6.2). 4

S CALABLE R ENDERING

As explained in 3.2, scalability in MPK is achieved through task decomposition and recomposition algorithms implemented as compounds. MPK supports various parallel rendering modes and by default uses optimized GPU-based composition algorithms. Factors affecting the performance of the application include: scalability of the decomposition algorithm, load balancing between source pipes, latency incurred during composition and the graphics IO overhead. The various factors can sometimes be mutually conflicting. To trade-off these factors to achieve the optimal performance for a given configuration, various heterogenous compounds can be combined to create multi-tier (de)composition trees. This also allows parallel composition algorithms to be implemented since individual compounds typically run in parallel. Multi-level compounds are also necessary to support multi-tier hardware compositors when the number of pipes on the system exceeds the number of inputs allowed by a single device [30]. Below we describe the currently available decomposition modes of MPK. 4.1

Frame Decomposition

In frame decomposition, a frame or view is divided into regions, each of which is assigned a different source pipe for rendering. The following compounds fall in this category: Screen Decomposition In screen decomposition (2D), each pipe renders a smaller part of the screen area. Copying the source images and pasting them side-to-side in the destination channel generates the final image. This operation is easily implemented using dedicated hardware (see 5.4). 2D compounds scale pixel-filllimited applications trivially and can also be used to scale generalpurpose computations using graphics hardware (GPGPU) [1], preventing expensive network communication as on GPU clusters [14]. The graphics I/O requirements of 2D are low, because the source images are small. However, as in [27], the technique has issues with load balancing as the number of pipes increases. Database Decomposition In database decomposition (DB), the scene is rendered in parallel by dividing the rendered data across different graphics pipes. Each pipe renders a subset of the data to generate partial images, which are then composited to generate the final image using depth testing and/or alpha blending; e.g., for volume rendering, the application can partition the volume data into equal bricks, each of which is rendered on a different graphics pipe [8]. The system’s pixel-fill performance, texture download bandwidth, as well as texture memory size scale linearly with this technique. Sub-Pixel Decomposition MPK can be used to parallelize operations at the fragment level by using multi-pass rendering algorithms executed in parallel on different pipes and then combining these partial results in the final composition step. Such a scheme can be used to implement full-scene antialiasing (FSAA) by rendering the scene from slightly different viewpoints and applying a filtering kernel during composition. The number of passes is thereby determined by the number of source channels. MPK allows every

121

channel to be used multiple times to allow higher-order filtering algorithms to be implemented. Eye Decomposition Eye decomposition (EYE) is useful for stereo rendering only, where each pipe renders a particular view (left or right) of the scene. If stereo is active, then each pipe view fills in the right or left buffer of the final rendering. This provides excellent load balancing and scalability for stereo-view rendering, because the scene content is similar for each eye.

cull source channel0

draw source channel0

pre-cull destination channel

assemble destination channel

cull source channel1

draw source channel1

Figure 5: Hierarchical culling using multiple cull threads

4.2

Temporal Decomposition

In contrast to frame decomposition, temporal decomposition balances the workload by scheduling the work on each pipe in sync with that of the other pipes to produce a steady stream of rendered frames. Time scheduling rather than the frame division is the focus here. MPK provide two temporal decomposition algorithms: Frame Multiplexing Frame multiplexing (DPLEX) distributes entire frames to the source pipes over time for parallel processing. It uses pipelining of successive frames by introducing latency in the rendering pipeline. DPLEX scales geometry, pixel-fill performance and host-to-graphics bandwidth, as the workload balance between pipes is intrinsically maintained. However, it has an increased transport delay inherent to frame synchronization required across the pipes and produces a latency of (pipes − 1) frames; i.e., there will be a (pipes − 1) frames delay between a user input and the corresponding output frame. frame N+1

frame N+2

frame N+3

channel2

channel1

channel0

frame N

thread until everything is rendered, implicitly load-balancing the rendering. The source channel images are then recomposed in the same way as in a DB compound. 5

C OMPOUND O PTIMIZATIONS

In this section, we discuss various optimizations introduced in MPK that provide a significant performance increase and help to achieve better scalability. 5.1

Asynchronous Compositions

As the number of pipes increase, the composition overhead soon becomes an issue since the destination channel needs to composite the source images sequentially at the end of the frame. MPK provides an asynchronous composition mode (ASYNC) for minimizing this overhead by pipelining the rendering and composition operations. The composition occurs asynchronously with the frame rendering in individual source channels at the beginning of the next frame, allowing the source channels to render the next frame, while the destination channel is compositing the current one. This process also distributes the frame transport evenly, reducing the impact on the I/O subsystem. For the CULL compound, draw operations occur asynchronously from the cull operations; e.g., if the draw threads render frame N-1, the cull threads work on culling for frame N. Hence, the draw never stalls on the cull operation. This scheme improves performance by introducing an additional frame of latency to the pipeline. Figure 6 compares the execution pipeline without and with this mode.

Figure 4: Data streaming using 3 channels

5.2 Data Streaming Data streaming (3D) is similar to database decomposition in that it divides the scene among multiple pipes. The rendering of the final view is streamed through the available channels, using a series of successive compositions and readbacks for each frame, as shown in Figure 4. Like DPLEX, 3D compounds have a latency of (pipes − 1) frames, but they have low graphics I/O overhead, since each compound needs to read and assemble only one source image at a time. Hence, it is a good replacement of DB decomposition if the increased latency is acceptable. 4.3

Operational Decomposition

MPK can decompose operational parts of the application’s pipeline similar to the rendering process itself. This scheme is used to parallelize the draw and cull operations by using one or more culling threads, in addition to the per-window rendering threads. It is used when the cull operation takes a significant amount of time and additional compute resources are available. This flexible scheme allows multiple cull threads per draw operation, multiple draw threads per compound, as well as hierarchical culling for 2D decompositions (Figure 5). Multiple draw threads per compound are implicitly load balanced: Each draw operation pulls the next data to be rendered from the cull queue and renders it. This is repeated by each draw

122

Dynamic Load Balancing

MPK provides a dynamic load balancing mode for 2D, DB and 3D compounds. This mode automatically computes the viewport or range of the compound’s children based on the rendering time of each compound child of the last finished frame. This results in good load balancing for low-latency compounds, provided that the workload is relatively evenly distributed within the child viewport or range. To have a more precise knowledge about the workload distribution, MPK utilizes the region defined for adaptive readback (see 5.5). Moreover, the tiling scheme for 2D compounds can be configured to adapt to the nature of the data being rendered. It is possible to create load-balanced 2D compounds that use channels on other windows also used for the final display. This cross-usage of rendering resources enables scalability on tiled displays by distributing the rendering uniformly across all rendering resources. 5.3

Pbuffer Rendering

MPK uses pbuffer rendering to provide better resource utilization when scaling graphics applications. For example, typical implementations of DPLEX composition do not allow the destination channel to contribute to the final rendering, limiting the scalability of an N-pipe system to N-1 times a single pipe’s performance. MPK

frame 0

channel0 destination clear draw frame 0 fourth quarter

frame 1

frame 2

frame 0

channel2

channel3

clear

clear

clear

draw frame 0 second quarter

draw frame 0 third quarter

draw frame 0 first quarter

read frame 0

read frame 0

read frame 0

post-assemble frame 0 clear

clear

clear

clear

draw frame 1 fourth quarter

draw frame 1 second quarter

draw frame 1 third quarter

draw frame 1 first quarter

read frame 1

read frame 1

read frame 1

channel1

channel2

channel3

clear

clear

clear

draw frame 0 second quarter

draw frame 0 third quarter

draw frame 0 first quarter

read frame 0 clear

read frame 0 clear

read frame 0 clear

draw frame 1 second quarter

draw frame 1 third quarter

draw frame 1 first quarter

read frame 1

read frame 1

read frame 1

post-assemble frame 1

channel0 destination

latency 1 frame 1

channel1

clear pre-assemble frame 0 draw frame 0 fourth quarter

DPLEX, EYE and FSAA decomposition modes are currently implemented in MPK using the SGI Graphics Compositor [36] or DPLEX option board [35]. Hardware compositors also help in reducing the rendering latency in some composition schemes. The use of this hardware is transparent to the application. 5.5

Adaptive Image Pipeline

At the end of the draw operation, the application can specify the image-space bounding box within the framebuffer that was modified during the draw. MPK uses this information to minimize the pixel transfer overhead for the current frame by only processing this region. MPK further optimizes the image acquisition and composition steps for different graphics hardware using 4-pixelaligned transfers, for example. The bounding box is also used to tune the load balancing algorithm since it provides the load balancer more concrete information about the workload distribution in screen space, leading to better prediction for the subsequent frames. 6

P ROGRAMMING AND E XECUTION M ODEL

frame 2

Figure 6: DB decomposition, without and with asynchronous compositions

prevents this by using a separate pbuffer with the same OpenGL context on the destination window as the visible window. The draw thread renders to the pbuffer while the application calls MPK from the draw callback to render the other channels’ output frames. On the availability of a frame, MPK makes the drawable of the visible window current and assembles the frame. Before returning to the application’s draw callback, the OpenGL state is restored and the pbuffer drawable is made current again. This approach gives maximum scalability with DPLEX and has minimal overhead since it does not require an OpenGL context switch. This is depicted in Figure 7. frame N+1

frame N+2

frame N+3

6.1

channel0

window

frame N

syncDPlex: assemble

glCopyPixels

syncDPlex: assemble

glCopyPixels

channel1

pbuffer

glCopyPixels

Figure 7: Full-scale DPLEX rendering using 2 channels

5.4

MPK’s programming model reflects the natural application framework of OpenGL and isolates the rendering task from resource management by using a callback-based interface. This interface is similar to the popular OpenGL Utility Toolkit (GLUT). The application provides function callbacks for specific tasks while the core of MPK handles the multi-processing aspects of the application. For a number of tasks, such as window creation, frame readback and compound assembly, MPK provides default implementations, which can be replaced by the application. A typical example could be to use compound assembly for GPGPU applications [1], where the final composition step would be used to combine the partial computation results instead of the default assembly. Initialization and exit callbacks are invoked for creating and destroying components (pipes, windows, channels, and compounds) and setting of initial parameter values, like display windows, graphics contexts, etc.. Update callbacks are used for actions to execute during each frame refresh, including the per-channel rendering as well as the updates done on the global context handled by each window. Event callbacks process user input and execute actions for a given input event (mouse, keyboard, etc) for each window.

Hardware Composition

Several composition schemes mentioned in the previous section can be accelerated using special-purpose hardware. These devices prevent the overhead in the image acquisition and composition stages by ingesting the output video signal directly from the source graphics pipes and providing the composited video signal as output. 2D,

Initialization

Figure 8 shows the execution model for a typical MPK application. The execution begins by loading the MPK configuration and initializing the application data. MPK allocates and initializes the various components of the configuration. MPK makes data management easier by allowing applications to store and access data in respective containers for each node in the configuration hierarchy. For example, identifiers like texture objects, display list identifiers, etc., which correspond to a given OpenGL context, can be created and stored using per-window containers and later retrieved in the update callbacks. MPK uses a multi-threaded execution model for parallelizing the rendering process and feeding the multiple graphics pipes in parallel. During configuration initialization, MPK creates threads for each window and manages their synchronization during each frame. Event interception and processing in MPK is centralized in the main application thread and allows event-driven execution or continuous rendering. Event handling can be configured and disabled on a perwindow basis to allow custom event processing for different application scenarios. Applications can select from fork, sproc, and pthread multi-tasking schemes at run time. On NUMA systems, pipe- and window-specific data can be allocated on the same node in the system to prevent unnecessary inter-node communication.

123

6.3

start

window 0

window 1

start thread

start thread

initialize window

initialize window

While rendering a frame, the correct contextual data has to be passed to the application callbacks, depending on latency of the compounds. This is done by maintaining a unique data pointer (referred to as frame data), which is allocated and passed to MPK at the beginning of each frame. Once the data is no longer needed, MPK notifies the application to delete it. Likewise, applications can use a culling infrastructure, where the data describing the frame is produced by the application thread. The frame data is always passed latency-correct to the update callbacks.

load config create windows

initialize config

frame begin update compounds assign tasks

frame end

unlock window threads update window

update window

channel 0 clear

channel 1 clear

channel 0 preassemble

channel 1 preassemble

channel 0 draw

channel 1 draw

channel 0 postassemble

channel 1 postassemble

channel 0 readback

channel 1 readback

channel n clear

channel m clear

7

synchronize swapbuffers swap buffers

swap buffers

event processing

exit ? no

yes

update database

no exit config

stop windows

destroy config

stop

exit window?

no exit window?

yes

yes

stop thread

stop thread

Figure 8: Execution model

6.2

Frame Data

Frame Generation

MPK uses a frame-driven rendering pipeline (Figure 8); i.e. the application tells MPK to produce a new frame and MPK invokes all the callbacks on the configuration’s resources required to produce that frame. Therefore, the rendering is always frame-synchronized. An exception to this rule is the DPLEX compound, where individual rendering threads run unsynchronized for multiple frames to allow overlapped time-multiplexed rendering.

D ISCUSSION

In the previous sections, we described the design and optimizations of MPK aimed at maximum scalability and flexibility. Now we demonstrate that MPK-based applications indeed meet these requirements. Using a typical, MPK-based volume rendering application, we present results for a few commonly used compounds. All the results have been generated by using different configuration files with the same unmodified application. We compare these results with the theorical performance numbers and show how optimizations built into MPK help achieve better scalability. The results collected in this paper come from two systems running Linux: an SGI Prism system with 10 Intel Itanium2 processors, 8 ATI FireGL X3 graphics pipes (AGP 8x) and 9.35 GB memory, and a COTS workstation with 1 AMD Opteron 3000+ processor, 2 NVidia GeForce 6600 graphics pipes (PCIe x16) and 2 GB memory. The achieved performance depends on the single-pipe rendering time t1 , readback rate Rr , draw rate Rd and the number of pixels in the destination channel nPixels. Please note that the source channel readbacks are executed in parallel whereas the destination draws are done sequentially. Also note that for asynchronous compounds, the readback and draws are executed in parallel, whereas, for synchronous compounds, they are executed serially (see Figure 6). Hence, for np pipes, assuming linear scalability, perfect load balancing and using the destination channel as a source as well, the theoretical achievable performance for ASYNC and NO ASYNC is given by the following: ttotal

async

=

noasync

=

ttotal

124

(1) (2)

For 2D compounds, tread and tdraw are given by the following: nPixels nPixels , tdraw = (np − 1) · (3) np · Rr np · Rd whereas, for DB compounds, assuming full-screen reads and draws, the corresponding values are the following: tread =

nPixels nPixels , tdraw = (np − 1) · (4) Rr Rd For the tests on the SGI Prism, we have the following characteristics: t1 = 603ms, nPixels = 1280 · 1024, Rr = 68MPixels/s and Rd = 148MPixels/s. Figures 9(a) and 9(b) show the theoretical and measured numbers for 2D and DB compounds, respectively. The NOCOPY graph shows the performance if no software readback or compositing is done, i.e., without the composition overhead. Typically, this is the case when using a hardware compositing device. 2D compounds provide better scalability than DB compounds, if there is perfect load-balancing and the data fits into GPU memory, which is unfortunately not true for most real-world applications. MPK’s timing-based load-balancer provides a better decomposition for such applications, provided they maintain a reasonable tread =

At the beginning of a frame, MPK performs several steps to prepare the compound trees for the rendering. First, each tree is traversed to update the contextual data – like viewport (2D) or range (DB/3D) for the current frame. The next traversal prepares empty input and output frames as well as cull queues as specified by the compound flags and the rendering operations. In the last traversal, the appropriate rendering tasks (like pre- and post-assemble, clear, cull and draw) are assigned to each compound. Now the frame is ready to be rendered. By unlocking the window threads, each rendering channel traverses the compound tree and executes the tasks which are assigned to it by its referencing compounds. When this is done, each window notifies the application thread, so that MPK can synchronize the swapbuffer for all the contributing windows.

t1 + max (tread ,tdraw ) np t1 +t +t np read draw

ASYNC ASYNC IDEAL 15FPS

NO ASYNC NO ASYNC IDEAL

NOCOPY

LINEAR

1-pipe 2-pipe 2D 2-pipe DB

12FPS

NO ASYNC 0.979 1.785 0.962

ASYNC 0.979 1.860 1.764

NOCOPY 0.979 1.951 1.949

LINEAR 0.979 1.958 1.958

Table 1: Achieved frames per second on the dual PCI Express system.

9FPS 6FPS 3FPS

0FPS 1p

2p

3p

4p

5p

6p

7p

8p

15FPS

12FPS

Please note that compounds like 2D, EYE and DPLEX require minimal programming effort and have smaller composition overhead. However, these compounds are limited when handling large data sets. DB and 3D compounds remove these limitations, but they require more application awareness in partitioning the data set across source channels and compositing the partial results. Depending on the problem, data size and system configuration, different compounds might be more applicable.

9FPS

8

C ONCLUSIONS

6FPS 3FPS

0FPS 1p

2p

3p

4p

5p

6p

7p

8p

Figure 9: Performance results using 2D decompositions (top) and DB decompositions (bottom)

frame consistency, both spatially and time-wise. Applications can provide their own load balancing scheme based on their extended knowledge about the rendered data. The performance drop in Figure 9(b) for the NO ASYNC case is an application-specific anomality caused by a suboptimal implementation of the test application. DB

2D

DPLEX

Hybrid 2D-DB

2-stage DB

90.0ms

67.5ms

45.0ms

22.5ms

0ms 2p

4p

6p

8p

Figure 10: Composition overhead for different compounds

Figure 10 compares only the measured overhead incurred by various compounds using ASYNC mode, i.e., t1 = 0. It can be seen that for 2D and DPLEX, the overhead remains almost constant as np increases, whereas for DB, the overhead increases almost linearly (extreme case since the complete viewports are being read and composited). Using a hybrid 2D/DB (similar to Binary Swap [23]) or a multi-stage hierarchical DB composition scheme reduces the overhead considerably by distributing the pixel transfer overhead as well as parallelizing the compositions. Table 1 shows the results obtained on the Dual PCIe system with t1 = 1021ms and Rr = Rd = 190MPixel/s. The advantage of using ASYNC composition is clear, specially for DB compounds where the overhead of composition is higher.

In this paper, we have presented OpenGL Multipipe SDK, a toolkit for parallel scalable rendering. MPK provides applications with a high-level abstraction to the system’s graphics resources while making low-level optimizations behind the scenes. Unlike application-transparent systems like ATI’s CrossFire, Nvidia’s SLI or Chromium, MPK requires parts of an application to be rewritten to take advantage of MPK’s advanced scalability features. This application-aware approach prevents the overhead incurred by application-transparent approaches to get better scalability. At the most generic level, an application only needs to deal with OpenGL rendering areas – i.e., channels – by providing the drawing code to handle a channel of arbitrary configuration (frustum, location in larger display setup, etc.). Everything else is handled by MPK: the type of windowing system, the system architecture, the type and topology of the interconnect. The architecture as such does not depend on the use of OpenGL. MPK’s flexible compound-based decomposition system can be configured to scale any kind of target applications. The compound traversal scheme implements new techniques to provide better scalability for applications. The callback-driven programming model coupled with the frame-driven execution model of MPK is extremely easy to use and intuitive for OpenGL application developers. We have presented scalability results for a few commonly used compounds and shown how asynchronous composition and multilevel compound heirarchies can be configured to scale applications by removing limitations inherent with certain compounds. 9

F UTURE W ORK

Our experiments with the dual-PCI Express system were limited by the availability of only 2 PCIe x16 slots on the system. We are currently investigating several approaches to enable distributed rendering using MPK on a cluster of PCs – e.g., using MPI or TCP/IP for distributing the rendering pipeline over the network. Additionally, using modified Chromium SPUs, we envision an integration where MPK can be used to configure and drive unmodified applications using the Chromium framework. We are also looking into new, more optimal algorithms for DB compositing to reduce the composition overhead and provide better scalability. Fragment shaders and highly tuned, parallel software implementations are approaches that we are currently investigating. Sub-pixel decomposition is a relatively new addition to MPK and we are investigating ways to make it more accessible to end users. Using this approach, applications can provide scalable visual effects like a large number of light sources by rendering a subset of

125

light sources on each source channel or scalable fragment shaders by splitting the shading passes across source channels. While MPK is able to scale GPGPU applications using 2D and DB compounds and custom composition implementations, it does not yet automate scalability for GPGPU applications. This could be addressed by implementing specialized compounds for processing multiple data streams in parallel, eventually using tools like BrookGPU [12] or Sh [24]. 10

ACKNOWLEDGEMENTS

The authors would like to thank Patrick Bouchaud for his efforts he put into MPK. We would also like to thank Vaibhav Saxena, Marc Romankewicz, Ken Jones, Bruno Stefanizzi (all SGI), Hanspeter Bieri (University of Bern, Switzerland), Mike Houston (Stanford University) and other reviewers for their valuable assistance and discussions. R EFERENCES [1] General-Purpose Computation using Graphics Hardware. http://www.gpgpu.org [2] Multi-Pipe Extension for Open Inventor. http://www.tgs.com/ [3] ParaView. http://www.paraview.org [4] Sun Multi-Display Utilities. http://www.sun.com/software/graphics/opengl/mdu/ [5] TrackD API. http://www.vrco.com/trackd/Overviewtrackd.html [6] J. Ahrens, C. Law, W. Schroeder, K. Martin, and M. Papka. A Parallel Approach for Efficiently Visualizing Extremely Large, Time-Varying Datasets. Technical Report LAUR-00-1620, Los Alamos National Laboratory, 2000. [7] B. Alcorn and R. Frank. Parallel Image Compositing API. Workshop on Commodity-Based Visualization Clusters, October 2002. [8] P. Bhaniramka and Y. Demange. OpenGL Volumizer: A Toolkit for High Quality Volume Rendering of Large Data Sets. Proceedings of the 2002 IEEE Symposium on Volume Visualization and Graphics, pages 45–54, 2002. [9] A. Bierbaum and C. Cruz-Neira. Run-Time Reconfiguration in VR Juggler. 4th Immersive Projection Technology Workshop, June 2000. [10] A. Bierbaum, C. Just, P. Hartling, K. Meinert, A. Baker, and C. CruzNeira. VR Juggler: A Virtual Platform for Virtual Reality Application Development. Proceedings of IEEE Virtual Reality 2001, March 2001. [11] W. Blanke, C. Bajaj, D.Fussel, and X. Zhang. The Metabuffer: A Scalable Multi-Resolution 3-D Graphics System Using Commodity Rendering Engines. Technical Report TR2000-16, University of Texas at Austin, 2000. [12] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream Computing on Graphics Hardware. Proceedings of SIGGRAPH 2004, 23(3):777–786, 2004. [13] C. Cruz-Neira, D. Sandin, and T. DeFanti. Surround-Screen Projection-Based Virtual Reality: The Design and Implementation of the CAVE. Proceedings of SIGGRAPH 93, pages 135–142, 1993. [14] Z. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover. GPU Cluster for High Performance Computing. Proceedings of the ACM/IEEE SC2004 Conference, page 47, 2004. [15] H. Fuchs, J. Oulton, J. Eyles, T. Greer, J. Goldfeather, D. Ellsworth, S. Molnar, G. Turk, B. Tebbs, and L. Israel. Pixel-Planes 5: A Heterogeneous Multiprocessor Graphics System Using Processor-Enhanced Memories. Proceedings of SIGGRAPH 89, pages 79–88, July 1989. [16] G. Humphreys, I. Buck, M. Eldridge, and P. Hanrahan. Distributed Rendering for Scalable Displays. IEEE Supercomputing, October 2000. [17] G. Humphreys, M. Eldridge, I. Buck, G. Stoll, M. Everett, and P. Hanrahan. WireGL: A Scalable Graphics System for Clusters. Proceedings of SIGGRAPH 2001, pages 129–140, August 2001.

126

[18] G. Humphreys and P. Hanrahan. A Distributed Graphics System for Large Tiled Displays. IEEE Visualization 1999, pages 215–224, October 1999. [19] G. Humphreys, M. Houston, R. Ng, R. Frank, S. Ahern, and P. D. Kirchner. Chromium: A Stream-Processing Framework for Interactive Rendering on Clusters. ACM Transactions on Graphics, 21(3):693–702, 2002. [20] Intel Corporation. The PCI Express Architecture and Advanced Switching, 2003. [21] C. Just, A. Bierbaum, A. Baker, and C. Cruz-Neira. VR Juggler: A Framework for Virtual Reality Development. 2nd Immersive Projection Technology Workshop, May 1998. [22] M. Levoy, K. Pulli, B. Curless, S. Rusinkiewicz, D. Koller, L. Pereira, M. Ginzton, S. Anderson, J. Davis, J. Ginsberg, J. Shade, and D. Fulk. The Digital Michelangelo Project: 3D Scanning of Large Statues. Proceedings of SIGGRAPH 2000, pages 131–144, July 2000. [23] K.-L. Ma, J. S. Painter, C. D. Hansen, and M. F. Krogh. Parallel Volume Rendering Using Binary-Swap Image Composition. IEEE Computer Graphics and Algorithms, July 1994. [24] M. D. McCool, Z. Qin, and T. S. Popa. Shader Metaprogramming. Proceedings of the ACM SIGGRAPH/Eurographics Conference on Graphics Hardware, pages 57–68, 2002. [25] I. E. McDowall, M. T. Bolas, S. D. Pieper, S. S. Fisher, and J. Humphries. Implementation and Integration of a Counterbalanced CRT-based Stereoscopic Display for Interactive Viewpoint Control in Virtual-Environment Applications. Proc. SPIE, 1256, 1990. [26] L. Moll, A. Heirich, and M. Shand. Sepia: Scalable 3D Compositing using PCI Pamette. Proceedings of the Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines, page 146, 1999. [27] S. Molnar, M. Cox, D. Ellsworth, and H. Fuchs. A Sorting Classification of Parallel Rendering. IEEE Computer Graphics and Algorithms, pages 23–32, July 1994. [28] S. Molnar, J. Eyles, and J. Poulton. PixelFlow: High-Speed Rendering Using Image Composition. Proceedings of SIGGRAPH 92, pages 231–240, August 1992. [29] J. S. Montrym, D. R. Baum, D. L. Dignam, and C. J. Migdal. InfiniteReality: A Real-Time Graphics System. Proceedings of SIGGRAPH 97, pages 293–302, August 1997. [30] J. Nonaka, N. Kukimoto, N. Sakamoto, H. Hazama, Y. Watashiba, X. Liu, M. Ogata, M. Kanazawa, and K. Koyamada. Hybrid Hardware-Accelerated Image Composition for Sort-Last Parallel Rendering on Graphics Clusters with Commodity Image Compositor. IEEE Volume Visualization Symposium, 2004. [31] J. Rohlf and J. Helman. IRIS Performer: A High Performance Multiprocessing Toolkit for Real-Time 3D Graphics. Proceedings of SIGGRAPH 94, pages 381–395, July 1994. [32] R. Samanta, T. Funkhouser, K. Li, and J. P. Singh. Hybrid Sort-First and Sort-Last Parallel Rendering with a Cluster of PCs. Proceedings of Eurographics Workshop on Graphics Hardware, pages 97–108, August 2000. [33] R. Samanta, T. Funkhouser, K. Li, and J. P. Singh. Sort-First Parallel Rendering with a Cluster of PCs. SIGGRAPH 2000 Technical Sketch, August 2000. [34] R. Samanta, J. Zheng, T. Funkhouser, K. Li, and J. P. Singh. Load Balancing for Multi-Projector Rendering Systems. Proceedings of Eurographics Workshop on Graphics Hardware, pages 107–116, August 1999. [35] Silicon Graphics, Inc. Onyx2 DPLEX Option Hardware User’s Guide, 1999. [36] Silicon Graphics, Inc. SGI InfinitePerformance: Scalable Graphics Compositor User’s Guide, 2002. [37] Silicon Graphics, Inc. Silicon Graphics Prism Family of Visualization Systems, 2004. [38] G. Stoll, M. Eldridge, D. Patterson, A. Webb, S. Berman, R. Levy, C. Caywood, M. Taveira, S. Hunt, and P. Hanrahan. Lightning-2: A High-Performance Display Subsystem for PC Clusters. Proceedings of SIGGRAPH 2001, pages 141–148, August 2001. [39] M. A. Teitel. The Eyephone: A Head Mounted Stereo Display. Proc. SPIE, 1256:168–172, 1990.

Suggest Documents