Network Aware Parallel Rendering with PCs Andre Hinkenjann∗ University of Applied Sciences Bonn-Rhein-Sieg Sankt Augustin, Germany
Matthias Bues† Fraunhofer IAO Stuttgart, Germany
Abstract Interactive rendering of complex models has many applications in the Virtual Reality Continuum. The oil&gas industry uses interactive visualizations of huge seismic data sets to evaluate and plan drilling operations. The automotive industry evaluates designs based on very detailed models. Unfortunately, many of these very complex geometric models cannot be displayed with interactive frame rates on graphics workstations. This is due to the limited scalability of their graphics performance. Recently there is a trend to use networked standard PCs to solve this problem. Care must be taken however, because of nonexistent shared memory with clustered PCs. All data and commands have to be sent across the network. It turns out that the removal of the network bottleneck is a challenging problem to solve in this context. In this article we present some approaches for network aware parallel rendering on commodity hardware. These strategies are technological as well as algorithmic solutions.
Many virtual environment applications depend on the ability to interact with multi-million polygon data sets. For this, usually special graphics workstations are employed. Due to their limited scalability and sometimes poor base graphics performance (compared to modern PC graphics boards) some data sets cannot be viewed interactively. There is currently a huge trend to implement graphics clusters made of commodity hardware. This is because standard PCs have a better price to performance ratio than high end graphics workstations. With the cluster approach, another graphics board is easily added to an existing PC cluster while the graphics workstations are usually limited in the maximum allowed number of installable graphics pipelines. The result is better scalability on the PC cluster side, in theory. Practically, due to the missing shared memory concept, PC clusters produce a huge amount of network traffic. This must be handled by the chosen standard network whose bandwidth usually is about an order of magnitude below the memory bandwidth of a shared-memory SMP system. In the following sections, we present our system and the strategies to overcome the network bottleneck. For benchmark numbers of the system compare [Hinkenjann et al. 2002]. For the case of sort-last parallel rendering we utilize a hierarchical representation ∗ e-mail:
[email protected]
† e-mail:
[email protected]
Figure 1: Parallel rendering system overview
of depth intervals occurring for a special view point. This helps reducing the number of bytes that have to be sent across the network. This approach is described in section 3. Section 4 deals with alternative hardware solutions for the network implementation and lists some performance numbers. Section 5.1 describes a fast, lossless compression strategy. In section 5.2 we describe the advantage of using a retained mode render strategy compared to immediate mode. The possibility of trading speed for latency is described in section 5.3.
System Description
The work described here is based on [Hinkenjann et al. 2002]. There we implemented a cluster rendering system based on commodity hardware. We used PCs and standard Gigabit ethernet technology. Our system is roughly divided into a front-end and a backend part. The front-end attaches to unparallelized applications written in OpenGL|Performer and distributes work tasks to the backend. Technically, the application programmer only has to add a few API-calls to an existing application. All the details of the parallelization are hidden by the front-end. The back-end is a generic parallel rendering system that renders several render tasks in parallel on slave PCs and displays the result on a master PC (cf. figure 1). The back-end implements sort-first and sort-last parallel rendering (for a classification and explanation of parallel rendering strategies compare [Molnar et al. 1994]). The speedups for the sort-first setup were promising, but the numbers for sort-last rendering were limited by the theoretical bandwidth of the chosen network technology. This can be seen from the comparison of the time the system spends in the single tasks it handles (cf. table 1). While all time steps are more or less negligible compared to time step no 4, the latter was identified as the biggest bottleneck of the system. It is about one order of magnitude slower than the other steps. Although this bottleneck will be overcome after the introduction of 10-Gigabit ethernet technology, current implementations still suffer from it. In addition, when faster ethernet technology will be broadly available, the other steps will be faster, too. This is due to constantly increasing CPU and GPU performance. Because of the network bottleneck, we can deduce the ex-
step 1. slave update 2. slaves render slaves read back buffers 3. (color, depth) 4. slaves send depth data to master master compares, 5. requests color data slaves send color data 6. (1 amortized color buffer) 7. master assembles frame
time negligible (camera updates) any rate (hw dependent) > 60Hz per buffer < 4Hz (5 slaves)! negligible > 40Hz negligible Figure 3: Sort last setup
Table 1: Time spent in single steps
Figure 2: Sort first setup
pectable frame rate of our system from the available network bandwidth. With Gigabit ethernet network technology the following example frame rates can be achieved: Let the network bandwidth be 1 GBit/s, and let the image resolution be 1000·1000 pixels. Let the setup be 5 slave PCs that render in parallel and 1 master PC which combines the results.
is worth to look for improvements for the sort-last approach. This section describes a hierarchical data structure that we introduced to reduce the required network traffic in the case of sort-last parallel rendering. When using the trivial sort-last approach for parallel rendering, the slave PCs have to send all depth buffers to the render master. The master then compares all depth values for each pixel and chooses the pixel (color) with the smallest depth value. Different solutions have been proposed to prevent that each slave has to send all depth information, like in [Cox and Hanrahan 1993]. Here, a broadcasting mechanism is used to construct a global frame buffer sequentially. To reduce the network load with the sort-last approach, we utilize a hierarchical depth buffer (HRD). A HRD is a quadtree whose inner nodes contain the overall maximum and minimum depth values (depth interval) of all its child nodes. The leaves of the HRD correspond to the values of the original depth buffer or to a small constant size subset (to prevent that very small data packages are sent over the network). The hierarchies of the HRDs are sent by the render slaves top to bottom on demand and are examined by the render master (during the combination of the frames) top to bottom.
Sort-first: We need to send 1000·1000·3(RGB) bytes = 3 million bytes = 24 million bits each frame. This is because the render slaves send an amortized color buffer each frame. Related to an available 1 GBit/s bandwidth, we can achieve a maximum frame rate of 41,66 Hz. Sort-last: We need to send 5·1000·1000· (3(RGB)+4(depth)) bytes = 35 million bytes = 280 million bits each frame. This is because the render slaves each send the complete color buffer and depth buffer for each frame. Related to an available 1 GBit/s bandwidth, we can achieve a maximum frame rate of 3,57 Hz. In general, with n slave PCs rendering in parallel and an image resolution of m pixels, the slaves need to send a sum of O(m) bytes of color information for each frame with a sort-first setup. The master PC has to combine the partial color results into a final renderer image (cf. figure 2). With a sort-last setup, the slaves need to send O(n · m) bytes of color information and an amortized depth buffer (compare section 3.1) each frame, resulting in O((n + 1) · m) data. The master PC has to combine the partial color and depth results into a final renderer image (cf. figure 3).
The Hierarchical Depth Buffer (HRD)
Despite its worse complexity the sort-last approach has the advantage of being easier to balance. Because of this crucial advantage it
Figure 4: A three-level HRD example This allows the combiner to decide, starting at the root of the HRD, which color buffers are overlapping or disjoint with respect to their depth. If their depth intervals are pair wise disjoint, only the color buffer with the nearest depth interval has to be considered. Overlapping intervals have to be compared recursively by traversing the HRD downwards to the leaves. At the leaves, a simple depth comparison is performed.
Comparision to Trivial Approach
For the comparison of the complexity of the HRD approach with the trivial algorithm, we are interested in the amount of data that needs to be sent across the network. Let n be the number of render slaves and m be the size of the depth and color buffers. As mentioned above, the render master has to do a pixel by pixel depth comparison and request the pixel colors with the smallest
depths from the render slaves. That means that the overall amount of data that has to be sent across the network is n depth buffers of size m and an amortized complete color buffer of the same size. Thus the worst case and best case complexity (in terms of the size of data to be transmitted) of the trivial approach is in O((n + 1) · m). The worst case performance of the HRD approach is also in O((n + 1) · m). In the worst case all depth intervals of all HRD levels of the slaves overlap and the decision which color to choose can only be done at the n · m leaves (cf. figure 8). The pixels are then requested from the respective slaves.
Figure 7: Comparison of frame rates for rendering an optimal scene with (upper curve) and without HRD (lower curve)
resents values without HRD. While the lower curve stays constant around 2.7Hz, the upper curve shows more or less constant 8.6Hz which is a speedup of about 3.2. The average case performance will depend on a number of factors:
Figure 5: Worst case depth distribution among render slaves (one HRD level shown) The best case performance is achieved when the depth intervals that correspond to the roots of the HRDs (i.e. the overall depth ranges of the slave’s scenes) indicate that one interval has a maximum depth value that is smaller than the minimum value of all other intervals (cf. figure 6). That means that this slave’s scene is completely in front of all other scenes. In this case the render master needed only a constant size data set from each slave to decide which pixels to choose. These O(m) pixels are then requested only from the respective slave. Thus, the best case performance is in O(n + m). As an example of best case performance we built a small scene with four quadrilaterals rendered by four render slaves. Each quadrilateral is rendered by a separate render slave. One of the four quadrangles covers the complete viewport in front of all other quadrangles. For this setup we compared the frame rate with and without HRD. Figure 7 shows the results. The upper curve represents measured values with activated HRD, the lower curve rep-
Type of scene: When the scene has many large occluders it is likely that they will enable the render master to decide at levels close to the root of the HRDs. Subdivision of the scene: The best subdivision of the scene across the slaves is a division that minimizes, at the root of the HRDs as a first priority, the number of overlapping depth intervals, the number of overlapping depth intervals at the second level as a second priority etc. from all possible viewing positions and directions. Viewers path through the scene: Similar to the last case, the best path is a path that minimizes at the root of the HRDs as a first priority, the number of overlapping depth intervals at the second level as a second priority etc. from all viewing positions and directions along the path. The hierarchical depth buffer has to be built by each render slave each frame because the depth information for a pixel usually changes from frame to frame. Therefore, the construction of the HRD has to be fast. We measured the frame rate for the construction of a HRD with an original depth resolution of 1024 by 1024 pixel as 53 Hz.
Faster Network Hardware
With the hardware components chosen and set up with care, the achievable network bandwidth of Gigabit Ethernet is quite close to the theoretical maximum. However, as shown in the introduction, this bandwidth is insufficient in many cases, especially on sort-last rendering. Besides algorithmical approaches to reduce the amount of network traffic, faster networking hardware can help to improve scalability.
Figure 6: Best case depth distribution among render slaves (one HRD level shown)
SCI as a fast network for frame buffer distribution
We investigated the Scalable Coherent Interface (SCI) cluster hardware [Hellwagner 1999]. As opposed to other networking technologies, SCI provides a distributed shared memory in hardware. Memory segments on one node can be mapped into the address space of
a remote node; the SCI hardware takes care of keeping the memory coherent on both sides. This can reduce the data transmission from host A to host B to a read/write operation to the shared memory segment, avoiding any copying of data to temporary buffers. With a link speed of 667 MByte/s per direction, SCI is about five times faster than Gigabit Ethernet. In addition, very low latencies of down to 1.4 microseconds can be achieved. With the most recent D334 SCI boards from Dolphin Interconnect AS, a maximum throughput of 326 MBytes/s (application to application) was measured on Intel IA64 systems with the 460 GX chipset. For our tests, we used Dual 1.7 GHz Xeon boards with i860 chipset and D330 cards from Dolphin in a small cluster setup of three nodes. On this hardware combination, we measured a throughput of 169 MByte/s.
SCI for frame buffer transmission
We extended our rendering back-end to support SCI using the SISCI API [SISCI Project Group n. d.] which provides direct access to the shared memory functionality of SCI. For benchmarks, we concentrated on sort-last rendering to eliminate view and dataset dependencies as far as possible. The following table shows benchmark figures for the Stanford Dragon data set [Stanford Computer Graphics Lab n. d.] consisting of 871k triangles, at a window size of 1000 x 1000 pixels, uncompressed transmission of color and depth buffer:
Figure 8: Sort-last rendering: Stanford Dragon with two slaves
Unparallelized rate factor 5.9 Hz 1.0
SCI rate factor 11.4 Hz 1.93
GigE rate factor 7.5 Hz 1.27
Table 2: Sort-Last Benchmark results: Stanford Dragon
Due to the sort-last parallelization, the view-dependent changes of the frame rates turned out to be as negligible as expected. The unparallelized performance is measured on a single node, without any distribution overhead, and is therefore a reference value for performance gain. The parallelized performance is on a 3-node cluster consisting of two rendering slaves and a combiner node. Using SCI, the frame rate is almost doubled, yielding an effective parallelization gain of approx. 66 % per node. Transmitting a 7 megabyte frame buffer takes 42 milliseconds, which is almost the full SCI throughput measured using the manufacturer benchmarks. Solving the equation from Section 1 for the number of slaves n at a given network bandwidth b, a frame size of s, and a minimum target frame rate of fmin , we get b n≤ s · fmin Assuming 10 Hz for fmin and the measured bandwidth of 169 MBytes/s of our test setup, we see that the maximum number of slaves with this hardware configuration is two. Given the 329
MBytes/s from Section 4.1, we get n = 4. As these values are based on raw, uncompressed transmission of frame buffer contents, they can be improved considerably by the bandwidth reduction techniques described in this paper. Due to the ring topology of SCI networks, allowing to connect up to 6 nodes with single-link cards without an external switch, a rendering cluster with 4 slaves and a master can be realized with a single SCI card per cluster PC.
Further Efficiency Strategies
Compression of Data
As a further option to decrease the amount of data that has to be sent across the network, we utilize on online compression of data. To maintain the quality of the original data, especially of the depth data, we decided to use a lossless compression algorithm. After benchmarking several algorithms, we choose a fast online Lempel-Ziv compression algorithm [Lehmann 2000]. The chosen algorithm is able to compress up to 40 frames of resolution 1000 by 1000 pixels each second with realistic scenes. We did not try a run length compression like in [Ahrens and Painter 1998] because we think that with large gouraud shaded objects run length encoding might be less efficient than Lempel-Ziv encoding. An important application of our system is an interior walk through of very large construction sites. In most cases, this means that most of the image is covered by lit (and possibly textured) objects, so that subsequent pixels may be of similar color, but most probably not identical, so that no reasonable run lengths can be expected. Compression adds latency to the system. Unfortunately, the latency increases in the case when the compression algorithm can not compress the data well. Because of this compression can be switched on and off by a flag in the configuration files of the slaves and the master. This allows to compare the behavior of the parallelized application with and without compression.
Retained Mode Distribution of Changes
Besides frame buffer distribution, the second major source for network traffic is geometry distribution to the rendering slaves. Two major approaches can be distinguished: • Immediate Mode. All graphics primitives to be drawn are sent to the rendering slaves every frame. The network traffic is directly related to the scene complexity, regardless of updates in the scene. It can be reduced by using caching mechanisms of the underlying graphics API (e.g. display lists in OpenGL); however, the stability and efficiency of display lists tend to be implementation- and hardware-dependent. WireGL [Humphreys et al. 2001] and Chromium [Humphreys et al. 2002] are examples for immediate mode distribution. • Retained Mode. In the Retained Mode approach, all geometry is cached on the rendering slaves independently from APIintegrated caching mechanisms. An efficient way to do this is to use a scene graph as the data structure to be distributed and retained on the slaves. The per-frame network traffic is reduced to updates in fields of the scene graph nodes. The perframe amount of network traffic depends on the application; it ranges from view matrix in a pure walk through to huge polygon meshes, e.g. in an immersive modeling application. Examples of this are Avango [Tramberend 1999] and OpenSG [Voss et al. 2002]. Instead of replicating only the scene graph data structure, the entire application manipulating the scene graph can be replicated on the slaves, too. The per-frame network traffic is then reduced to the application input data, e.g.
tracking sensor positions and button events in case of a typical VR system, at the cost of some amount of unnecessarily replicated computation. Systems following this approach are VR Juggler [A.Bierbaum et al. 2001] and Lightning [Bues et al. 2001]. [Staadt et al. 2003] give a performance analysis of the these distribution approaches on a 4-node cluster without combiner stage, showing the network bottleneck of the immediate-mode distribution scheme.
• The use of run length encoding or even video codecs. • Thorough analysis of the average case performance of the HRD and usage of the results for the development of optimized subdivision techniques for HRD-based sort-last rendering.
Combiner Trees
As seen in Section 4, the network bandwidth limits the number of inputs of a combiner node and therefore the number of rendering slaves which can participate in generating a single final image. To scale beyond this limitation, the combiner nodes can be cascaded, forming a tree whose root is the master combiner driving the display. The inner nodes are ”intermediate” combiners feeding the combiners of the next tree level. The leaves are the rendering slaves. The following image gives an example of a combiner tree collecting the output of 4 rendering slaves built from combiners with two inputs each. The aggregated network bandwidth as seen from the
Render Slave 0
A.B IERBAUM , C.J UST, H ARTLING , P., M EINERT, K., AND C RUZ N EIRA , C. 2001. Vr juggler: A virtual platform for virtual reality application development. In IEEE Virtual Reality, 89–96. A HRENS , J., AND PAINTER , J. 1998. Efficient sort-last rendering using compression-based image compositing. In Second Eurographics Workshop on Parallel Graphics and Visualisation, September 1998. B UES , M., B LACH , R., S TEGMAIER , S., H FNER , U., H OFFMANN , H., AND H ASELBERGER , F. 2001. Towards a scalable high performance application platform for immersive virtual environments. In Proceedings of Immersive Projection Technology and Virtual Environments, 165–174. C OX , M., AND H ANRAHAN , P. 1993. Pixel merging for object-parallel rendering: a distributed snooping algorithm. In Proceedings of the 1993 symposium on Parallel rendering, ACM Press, 49–56. H ELLWAGNER , H. 1999. The sci standard and applications of sci. In SCI: Scalable Coherent Interface, H. Hellwagner and A. Reinefeld, Eds., vol. 1734 of Lecture Notes in Computer Science. Springer, Berlin.
Intermediate Combiner
Render Slave 1 Master Combiner
Render Slave 2 Intermediate Combiner
Render Slave 3
Figure 9: 4-slave combiner tree rendering slaves is the sum of the bandwidths of the combiner nodes of the level right below. On the other side, the combiner tree can be seen as a pipeline; the nodes of each tree level form a stage of the pipeline. Each stage passes its processing result at frame boundaries. Thus, one image generated by the rendering slaves must pass the entire pipeline before it can be displayed, which takes (n − 1) frames on a combiner tree with n levels. At a given frame rate r, this leads to an additional latency time of n−1 r .
• Evaluation of larger SCI clusters and other fast networking hardware solutions (VLAN/link aggregation for Gigabit Ethernet, 10 Gigabit Ethernet)
Conclusion and Future Work
We presented some approaches to attack the network bottleneck of parallel rendering with commodity hardware. Our strategies are a combination of software and hardware solutions. All of them can be combined in various ways. Wherever possible, we supplied some performance numbers of the approaches. For the newly introduced hierarchical depth buffer we gave an analysis of the expected runtime behavior in best and worst-case. The results obtained so far indicate that the network bottleneck can be overcome; sort-last parallelization can be expected to become usable for reasonable display resolutions in the near future. The following points will be subject of future work:
H INKENJANN , A., B UES , M., O LRY, T., AND S CHUPP, S. 2002. Mixedmode parallel real-time rendering on commodity hardware. In proceedings of 5th Symposium on Virtual Reality (SVR 2002), Brazilian Society of Computing. H UMPHREYS , G., E LDRIDGE , M., B UCK , I., S TOLL , G., E VERETT, M., AND H ANRAHAN , P. 2001. Wiregl: A scalable graphics system for clusters. In SIGGRAPH 2001 Proceedings, ACM Press, 129–149. H UMPHREYS , G., H OUSTON , M., N G , R., F RANK , R., A HERN , S., K IRCHNER , P. D., AND K LOSOWSKI , J. T. 2002. Chromium: a stream-processing framework for interactive rendering on clusters. In SIGGRAPH 2002 Proceedings, ACM Press, 693–702. K URC , T. M., K UTLUCA , H., AYKANAT, C., AND B ULENT. 1997. A comparison of spatial subdivision algorithms for sort-first rendering. In HPCN Europe, 137–146. L EHMANN , M., 2000. Liblzf, a very small data compression library. M OLNAR , S., C OX , M., E LLSWORTH , D., AND F UCHS , H. 1994. A sorting classification of parallel rendering. IEEE Comput. Graph. Appl. 14, 4, 23–32. S AMANTA , R., F UNKHOUSER , T., L I , K., AND S INGH , J. P. 2000. Hybrid sort-first and sort-last parallel rendering with a cluster of pcs. In SIGGRAPH/Eurographics Workshop on Graphics Hardware. SISCI P ROJECT G ROUP . Standard software infrastructures for sci-based parallel systems. S TAADT, O. G., WALKER , J., N UBER , C., AND H AMANN , B. 2003. A survey and performance analysis of software platforms for interactive cluster-based multi-screen rendering. In Proceedings of Immersive Projection Technology and Virtual Environments. S TANFORD C OMPUTER G RAPHICS L AB. The stanford 3d scanning repository. T RAMBEREND , H. 1999. Avocado: A distributed virtual reality framework. In IEEE Virtual Reality, 14–21. V OSS , G., B EHR , J., R EINERS , D., AND ROTH , M. 2002. A multithread safe foundation for scene graphs and its extension to clusters. In Eurographics Workshop on Parallel Graphics and Visualization, 33–38.