Sep 10, 2003 - Hardware, August 2000, pp. 97-108. [7] Steve Molnar, Michael Cox, David Ellsworth, Henry. Fuchs. A sorting classification of parallel rendering,.
IEEE International Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications 8-10 September 2003, Lviv, Ukraine
A Proposal for a Sort-Middle Cluster Rendering System Jorge Luis Williams, Robert E. Hiromoto University of Idaho, Department of Computer Science, Moscow, Idaho 83843 USA, Tel. (208)-885-6589, Fax (208)-885-9052, {jorgew,hiromoto}@cs.uidaho.edu Abstract: Cluster rendering systems often take a sort-first, sort-last, or a hybrid of these two approaches because it is generally assumed that the hardware pipeline cannot be split between the geometric and rasterization stages. These approaches often limit the scalability of the system because they introduce either load balancing problems or high contention for the communication network. We propose a novel approach for dividing up the pipeline and distributing work in a sort-middle fashion that has the promise to aid in the scaling of a cluster rendering system. We have shown, in preliminary experiments, that it is possible to implement this approach and, based on these results, we are proposing to extend an existing cluster rendering system (Chromium) to incorporate our approach. Keywords: - Parallel Cluster Rendering, Load Balancing, Sort-Middle, Pipeline Separation
1. INTRODUCTION Most computer graphics images are represented in computer memory as a three-dimensional (3D) polygonal model; a collection of polygons that represent the surface of the object to be drawn (or rendered). The standard 3D graphics pipeline creates an image of this model (in 2D image space) that may be rendered to the computer screen. This pipeline contains two major stages. In the geometry stage, polygonal vertices are transformed from 3D object space into 2D image space and the light intensity at each vertex is computed. In the rasterization stage, the light intensity at each vertex is interpolated throughout the surface of the polygon. Often, it is desirable to interact with the model at run time by changing the viewpoint, the light source positions, or by modifying the model. To improve speed, the 3D graphics pipeline is usually implemented in hardware in a graphics accelerator card. Typically, it is difficult or impossible to render large models at interactive frame-rates due to the increased number of polygons. Advances in graphic accelerator technology continue to increase the number of polygons that a user can render on a single workstation interactively; however, this in turn creates the demand to render larger models often far exceeding the capacity of a single graphics workstation [1]. Recently, there have been attempts to address this problem by interconnecting a number of graphics workstations via a high-speed network into a cluster rendering system [2,3,4, 5, and 6]. A cluster has the advantage of providing both increased memory capacity and additional graphics accelerator pipelines than those available on a single workstation. In addition a cluster can also apply parallelization to the rendering task that has the potential to speed up the renderer. Finally, building a cluster rendering system may be orders of magnitudes less
0-7803-8138-6/03/$17.00 ©2003 IEEE
expensive than buying a large specialized parallel rendering machine. Many different parallel rendering techniques have been developed. These techniques have been classified in terms of when data are sorted in the pipeline [7]. Data can be sorted before the geometry stage (sort-first), between the geometry and rasterization stages (sort-middle), and after the rasterization stage (sort-last). In cluster rendering systems only sort-first and sort-last techniques have been used. Sort-middle is typically not considered because it is assumed that the hardware pipeline cannot be split between the geometry and rasterization stages. However, these cluster approaches are shown to limit the scalability of the system by introducing either load balancing problems or high network contention. We have developed a technique for achieving a sortmiddle cluster rendering system and have conducted experiments to test the viability of our approach. We have shown in preliminary experiments that it is possible to implement this approach, and from these results we are proposing to extend an existing cluster rendering system (Chromium) [4] to incorporate our approach. Our preliminary results indicate that extending Chromium to support sort-middle would create not only a functionally complete cluster rendering system (Chromium already includes support for sort-first and sort-last), but would also improve the cluster rendering system’s scalability and speed.
2. PROPOSED IMPLEMENTATION We achieved the sort-middle implementation by moving the geometric part of the pipeline away from the graphics accelerator and over to the main processor(s). The standalone graphics accelerator is used only for the rasterization stage. We accomplished this by modifying an open source DRI [8] driver. Because the geometric portion of the pipeline involves mostly floating point operations, it is well suited for the main processor. In fact modern processors contain technology that helps accelerate the geometric graphics process. Pentium IV Geometry Acceleration (SSE 2) and AMD’s 3DNow! are examples of such technology. The splitting of the geometric and rendering pipeline introduces parallelism at the geometric process level that can easily be loadbalanced; therefore, any loss in speed of the geometric portion observed by bypassing the graphics accelerator card on a single workstation can be offset by the parallelism. In order to accomplish the separation in the stages of the pipeline, two extensions are written for the OpenGL library: triangle-feedback and triangle-rasterize. We use triangles because internally the OpenGL library converts all polygons in a model into triangles before processing them. The triangle-feedback extension allows the OpenGL implantation to pass the model through the geometric portion of the pipeline and return the results of
36
server. The rasterization server performs the rasterization
this portion to the caller instead of rasterizing them. The triangle-rasterize extension takes the results of the geometric portion and allows the data to be rasterized to an image buffer. In other words, the triangle-feedback extension only computes the geometry portion of the pipeline on the main CPU and the triangle-rasterize extension only computes the rasterization portion in the graphics accelerator card. Our implementation is based on the Chromium cluster rendering system. Chromium [4] was developed as an extension of WireGL [3]. An important contribution of the WireGL system is that it implemented a protocol for sending a stream of OpenGL graphics commands through a network to another computer where the commands could be executed. Chromium extends this idea by adding Stream Processing Units (SPUs). An SPU is a library that can filter a stream of OpenGL commands. These filters can be linked together in chains to achieve a particular cluster rendering algorithm. For example, Fig. 1 illustrates a sort-last operation.
Fig. 2 – Proposed sort-middle operation.
portion of the OpenGL commands it receives and generates an image which is sent to the render server so that it can be rendered on the screen. Note that the render server on Figure 1 and the render server on Figure 2 perform in significantly different ways. Because there is complete overlap between the images generated by the Readback SPUs, the render server in Figure 1 must perform a composite operation. In Figure 2 none of the images produced by the Raster SPUs overlap so a composite step is not necessary. The images sent through the network in a sort-middle operation are 1/P the size of the images sent in a sort-last operation, where P is the number of Raster SPUs. The Geometry SPU will use triangle-feedback to perform the geometry portion of the rendering pipeline and will then stream triangle-rasterize operations to the Raster SPU. Data traffic between geometry and rasterization processors is performed in an asynchronous pipeline manner. To accomplish this we require a two-processor SMP geometry node, where one processor handles the communication and other performs the actual geometric calculations. A round-robin technique is used to transmit work from the geometry processors to the rasterization processors in order to minimize network contention. In the rasterization node, we also require a twoprocessor SMP node with a graphics accelerator card. Here one processor handles communication and transmits work to the graphics accelerator card. The other is used if a dynamic splitting of a rasterization frame is required. This situation occurs when the node detects the need to dynamically re-tile the screen; thereby, allowing dormant rasterization servers to take up some of the work to achieve load-balance during the rasterization stage.
Fig. 1 – Chromium sort-last operation.
Here rendering is split-up into different processes (or applications) each of which generates its own polygons via OpenGL calls. Chromium captures these calls and sends them as a stream to a Readback SPU. This SPU executes the calls and renders the image into a private buffer. It then streams an OpenGL call to render this buffer on the screen to the Send SPU. The Send SPU simply dispatches any call it receives to a Chromium Sever via the WireGL protocol. The Chromium Server receives the WireGL network data and generates a stream of OpenGL calls that are sent to the Render SPU. The Render SPU simply executes the call (in this case each client is sending a render image and composite operation) which renders the final image to the screen. Fig. 2 illustrates our proposed approach to the sortmiddle operation in Chromium. Here an application generates a collection of polygons via an OpenGL stream that is passed to the Poly-break SPU. This SPU distributes them evenly among all geometry servers. The geometry servers compute the geometry portion of the pipeline and send the data to the appropriate rasterization
3. CONCLUSION In the past, sort-middle cluster rendering was considered impossible or at least very difficult to accomplish. Our current work indicates that it is feasible to build such a system, where parallelism at the geometric level is easily load-balanced, network contention is minimized, the use of inexpensive two-processor SMP systems allow for the synchronization of communication between nodes, and a dynamic tiling algorithm is used to load-balance the rasterization stage.
37
framework for interactive Rendering on clusters. Proceedings of SIGGRAPH 2002, July 2002, pp 693702. [5] T. Mitra, T. Chiueh. Implementation and Evaluation of the Parallel Mesa Library. Technical report, State University of New York at Stony Brook, 1998. [6] Rudrajit Samanta, Thomas Funkhourser, Kai Li, Jaswinder Pal Singh. Hybrid sort first and sort-last parallel rendering with a cluster of PCs. Proceedings of SIGGRAPH/Eurographics Workshop on Graphics Hardware, August 2000, pp. 97-108. [7] Steve Molnar, Michael Cox, David Ellsworth, Henry Fuchs. A sorting classification of parallel rendering, IEEE Computer Graphics and Algorithms, July 1994. pp. 23-32. [8] The DRI Project Homepage. http://dri.sourceforge.net/.
4. REFERENCES [1] Luebke, P. David. A Developer’s Survey of IEEE Polygonal Simplification Algorithms, Computer Graphics and Applications, May/June 2001. pp. 24-35. [2] A. Heirich, L. Moll. Scalable Distributed Visualization Using Off-the-Shelf Components. IEEE Parallel Visualization and Graphics Symposium, October 1999, pp. 55-59. [3] Greg Humphreys, Matthew Eldridge, Ian Buck, Gordon Stoll, Matthew Everett, Pat Hanrahn. WireGL: A scalable graphics system for clusters. Proceedings of SIGGRAPH 2001, August 2001, pp. 129-140. [4] Greg Humphereys, Mike Houston, Ren Ng, Randall Frank, Sean Ahern, Peter D. Kirchner, James T. Klosowski. Chromium: A stream-processing
38