Visibility Culling per Cache Block with Tiling ... - Semantic Scholar

2 downloads 0 Views 170KB Size Report
Edition, Addison-Wesley, Massachusetts, 1990. ... [16] L. Bishop, D. Eberly, T. Whitted, M. Finch, and M. Shantz, “Designing a PC Game Engine,” IEEE Computer.
Visibility Culling per Cache Block with Tiling-Traversal Algorithm Moon-Hee Choi†, Woo-Chan Park††, and Shin-Dug Kim† †Department of Computer Science Yonsei Univ. {mhchoi, sdkim}@yonsei.ac.kr ††School of Engineering Sejong Univ. [email protected] Abstract As many applications in computer graphics require render high complex 3D scenes at interactive rates, the search for an effective visibility culling method has become one of the most important issues to be addressed in the design of 3D rendering processors. In this paper, we proposed a new rasterization pipeline with visibility culling; the proposed architecture performs the visibility culling at an early stage of the rasterization pipeline by retrieving data in a pixel cache without any significant hardware logic such as hierarchical z-buffer. When the proposed visibility culling method is performed, cache block misses can occur. For reducing this occurring rate, we fetch next pixel cache blocks previously, which can be predicted by tiling-traversal algorithm. Simulation results show that the proposed architecture can achieve a performance gain of about 41% compared with the conventional pre-texturing architecture, and about 17% compare to the hierarchical z-buffer visibility scheme. Keyword: 3D graphics, graphics hardware, rendering processor, visibility culling, and pixel cache

1. Introduction Recently, 3D computer graphics is a major requisite technology for various multimedia applications, such as 3D animation, games, and so on. Although 3D graphics technology has matured significantly, real-time rendering display is still impossible due to growing complexity of computer-animated scenes. One of the well known methods for dealing with this problem is to use visibility culling, thereby reducing unnecessary operations by processing only the actual triangle in the 3D graphics pipeline [1, 2, 3, 4, 5, 6]. Visibility culling algorithm have recently regained attention mainly due to the ever increasing size of 3D data set and emergence of specialized applications [5, 6] Among various approaches to visibility culling, the image space culling is the most suitable one for hardware implementation where occlusion determination is carried out in the image space [5]. Typical examples include hierarchical z-buffer visibility (HZB) [2], occlusion culling using hierarchical occlusion map (HOM) [7], OpenGL assisted occlusion culling unit (OCCU) [8], and VISULIZE fx [10]. Among the image space culling algorithms, the HZB implementation renders all visible triangles, and it has an efficient culling architecture with a z-pyramid, based on the z-buffer that quickly rejects most of the hidden triangles before applying the texture mapping routine In this paper, we proposed a new rasterization architecture which performs visibility culling by referencing data in the pixel cache at an early stage of the rasterization pipeline.

We call this architecture, visibility culling per cache block with tiling-traversal algorithm (VCBT). The VCBT culls unwanted blocks of triangles by using only a representative depth value (z value) of a pixel block saved in the pixel cache. Thus it does not require any complex unit and any other computation for maintaining the representative z-values used in visibility culling, compared to the HZB which performs the visibility culling by maintaining the whole or part of the full frame at the z-pyramid. When the proposed visibility culling method is performed, cache block misses can occur. For reducing this occurring rate, we fetch next pixel cache blocks previously, which can be predicted by tiling-traversal algorithm. We have built a trace-driven simulator to validate the performance of the VCBT, and simulation results using four benchmarks: Quake3 demo I and demo II [17], Crystal Space [18], and Lightscape [19] are presented. The efficiency of the visibility testing proposed VCBT is computed by measuring the hit the rate of visibility tested pixel cache blocks at the traversal stage, namely, the average coverage rate for visibility testing. The rate of visibility testing at the traversal stage are about 97% and 98% for Quake3 demo I and demo II, about 90% for Lightscape, and about 96% for Crystal Space. We also measure the efficiency of the visibility culling at the traversal stage of the VCBT, namely, the rate of rejected hidden blocks in inputted total blocks. The rate of culling rate here are about 36% and 39% for Quake3 demo I and demo II, about 27% for Lightscape, and about 4.4% for Crystal Space. The performance evaluation of our architecture was carried out by computing the average visible fragments per cycle using the simulation results. The result shows that the VCBT can achieve improvements in performance by about 37% for Quake3, 50% for Lightscape, and 41% for Crystal Space, compared with conventional pixel pipeline architecture. Similarly a performance gain of 31% for Lightscape and 37% for Crystal Space can also be achieved with respect to the HZB; however, the performance for Quake 3 deteriorated by about 1%. The rest of this paper is organized as follows. In Section 2, the rasterization pipeline, the tiling-traversal scheme, and the HZB are presented, the VCBT and it’s characteristics are described in Section 3, various simulation results and performance evaluation are presented in Section 4, and finally, conclusions are given in Section 5.

2. Related work The 3D rendering pipeline consists of two steps: geometry processing and rasterization. In the geometry processing step, triangle vertices are transformed from object space into image space. In the rasterization step, a triangle is converted into pixels, which are depth buffered into the frame memory. Rasterization algorithm can be classified into scan-line algorithm (SLA) and tiling-traversal algorithm (TTA) according to the polygon traversing scheme. The former has the simplest form to be easily implemented as a pipelined architecture, so it is widely used [1]. But it has weak points at memory efficiency and texture caching. For solve this problem, the later is introduced at [9] that can provide some advantages by the reduced memory page processing and by the increased locality for texture caching. A TTA based unit can prevent page crossing overhead effectively; it splits the screen into tiles by the size of a memory page, and fragments are computed in the tile order. Therefore, we choose TTA as the rasteization pipeline algorithm. 2.1. A pixel rasterization pipeline In conventional high-performance rendering processors, texture mapping is performed before depth comparison (z-testing) [4, 9, 12], this architecture shown in figure 1(a) is

denoted as the pre-texturing architecture in [4]. As reported in [4, 12], the pre-texturing architecture supports the same semantics of standard API such as OpenGL, but drawback is that even invisible fragments need to be processed texture mapping. In order to remove these unnecessary operations, we introduce a post-texturing architecture described in [4] to perform texture mapping after z-testing. Here, the semantics of OpenGL allow us to split the z-testing stage into two parts; the z-read and z-test operations carried out before texture mapping and followed by the z-write operation. However, this separation of post-texturing architecture between reading and writing z-values could cause inconsistency as reported in [4, 9].

(a) pre-texturing

(c) mid-texturing

(b) post-texturing

(d) Hierarchical z-buffer visibility

Figure 1. The pixel rasterization pipelines.

The mid-texturing architecture shown in figure 1(c) performs texture mapping between the first z-testing stage and the second z-testing stage, thereby overcoming the drawbacks of pretexturing and post-texturing architecture [4]. That is, this approach can eliminate unnecessary texture mapping operations on obscured fragments by rejecting them at the first z-testing stage. In addition, the inconsistency problem due to the separation at the z-testing stage can be eliminated by performing another z-test at the second z-testing stage. It also reduces the miss penalties of the pixel cache by using a pre-fetch scheme. However, the weakness of the mid-texturing architecture are that it dose not eliminate the occurrence of unnecessary processor cycles for pixel processing because the hidden fragments are removed at the pixel processing step. In addition, the mid-texturing dose not prevent the stall of the pixel pipeline if a pixel cache miss occurs at the first-z-testing stage during the pre-fetch routine is performed. The proposed VCBT can reduce not only the number of unnecessary operations occurring at the pre-texturing architecture, but also the unnecessary processor cycles occurring at the mid-texturing architecture as well, since it removes hidden fragments during the scan conversion step.

2.2. Hierarchical z-buffer visibility Among various approaches of the image space culling, considered to be the most suitable one for hardware implementation [5], the HZB [2] is an efficient culling architecture with a zpyramid that quickly rejects most of the hidden triangles before texture mapping. The HZB renders all visible triangles using the conservative visibility mechanism; it is a very important concept and its set includes at least all of visible triangles, and in addition, it also adds some nearly visible triangles [5, 6]. The z-pyramid is constructed as a layered-buffer with different resolutions at each level, based on the original z-buffer [2, 6, 8]. As shown in figure 1(d), the HZB removes invisible fragments resulting from visibility culling at a scan conversion step. In HyperTM of Radeon (a high-performance rendering processor), a simpler form of HZB mechanism is included [2, 3, 5]; here, a considerable amount (60–70% on the average) of fragments which fail the z-test are detected and discarded from the pipeline by the above approach [3]. On the negative side, the z-pyramid becomes a large-scale hardware structure since a frame buffer has to be added to the HZB [2, 3, 5]; in addition, whenever z-buffer is modified for maintaining the z-pyramid, the new z-value is propagated to coarser levels of the pyramid until the entry in the pyramid is as far away as the new z-value [2]. The VCBT needs only a light hardware structure and provides a simple scheme to maintain the information for the visibility culling since it uses only the data from the pixel cache.

3. Visibility culling per cache block with tiling-traversal algorithm In this section, we describe the proposed rasterization architecture performs effectively visibility culling per cache block with tiling-traversal algorithm.

3.1. The proposed rasterization architecture Figure 2 shows the rasterization architecture in the proposed VCBT scheme. Like the HZB, the VCBT can reduce considerably the wasted processing cycles occurring in the pixel

processing step for processing hidden fragment. Either the SLA [1] or the TTA [9], normally used for rasterization pipeline, can be adapted for the VCBT. In this paper, we focus on the VCBT based on the TTA for obtaining the information of a next referred cache block, and our discussions will be confined to those aspects relevant to the TTA. Each triangle setup during the scan conversion step handles the initialization of all the necessary parameters and computations required for a series of increment operations of each triangle. In traversal stage of Figure 2, object boundaries are determined and fragment stamps are generated using a half-plane edge function [9]. Stamp is a unit for enhancing the efficiency of memory system; the size is fluid and generally composed of 2w pixels wide and 2h pixels high [9]. The representative values of each stamp are also computed at this stage; these are the z-value of the farthest fragment and that of the nearest fragment from the viewpoint. In addition, visibility culling is performed at the traversal stage using the computed representative value. At the span processing stage in figure 2, a set of fragments for each stamp is generated by interpolating the color and texture coordinates.

Figure 2. The organization of VCBT. In the pixel processing step, the texture mapping routine reads four or eight texels from the texture cache, performs filtering operation on them to produce a single texel, and blends the texel with the pixel color. While processing a specific fragment, z-testing retrieves the corresponding z-value from the pixel cache and compares it with that of the current fragment. Note that the pixel cache consists of a depth cache and a color cache. A new z-value of the current fragment is written into the pixel cache at the z-write stage, provided the fragment turns out to be visible. In particular, if this visible fragment is nearer from the view-point than the representative value of the cache block within Max-z-table, then this value is updated for maintaining the visibility information, described in detail in Section 3.2. At the alphablending stage, the color data of the current fragment are read from the color cache of the pixel cache; the alpha-blending is performed with the output from texture blending, and the final color data are written back into the color cache.

3.2 The proposed visibility culling method The implementation of the visibility culling method of the VCBT is based on the results obtained by comparing the two representative values (the maximum z-value and the minimum z-value) of the current stamp being processed with the representative value (maximum zvalue) of its corresponding cache block that exists in the pixel cache. For performing visibility culling, a compare unit, a Max-z-generator and a Max-z-table are added to the general pipeline. The compare unit performs compare operation between the representative values of a particular stamp and the representative value of its corresponding cache block in order to process visibility culling at the traversal stage. To eliminate the amount of computation and additional hardware required to compute the representative values of each stamp, two representative values for each stamp are obtained from the three z-values forming the vertices of a triangle. The use of these representative values could have an adverse effect on the accuracy of visibility culling; the obscured block could be processed in the pixel processing step. However, the error rate of the visibility culling is so low that it has practically very little effect on the performance of visibility culling as demonstrated in Table 1 of Section 4.2, by computing error rates using four benchmarks. The representative value of each cache block is stored in a Max-z-table. Maintaining the Max-z-table is very easy; the representative value is updated when a new z-value is nearer from view-point than the corresponding representative value at the z-buffering stage or a new pixel cache block is loaded from the frame memory. The Max-z-generator calculates a new representative value for a particular cache block loaded from the frame memory. According to the definition of z-value size and cache block size, the entry size of a Max-z-table and the number of required comparisons of a Max-z-generator are fluid. At the visibility culling stage, the tag field of the pixel address of the current stamp is examined in the tag table of the pixel cache; then the pixel address of the stamp and that of the cache tag are compared; if a tag comparison reveals a hit, the pipeline execution is processed immediately; otherwise (the result is a miss), the pipeline will be stalled until the corresponding cache block is loaded. For reducing the occurring rate of the pixel cache miss at traversal stage, the neighboring stamps as well as the current stamp are loaded from a frame memory. At the TTA, the positions for next stamps is computed and saved at the contexts (special buffer) for traversing of the triangle while a half-plane edge function for current stamp is performed [9]. Thus, the referred next stamps can be easily caught and loaded in advance.

4. Simulation and Performance evaluation

(a) Quake3 DemoI

(b) Quake3 DemoII (c) Crystal Space Figure 3. The benchmarks.

(d) Lightscape

In order to evaluate the proposed architecture, we built a trace-driven simulator by modifying the simulator reported in [4]. The traces are generated using benchmarks, Quake3 demoI, Quake3 demoII, Crystal Space, and Lightscape for 1024 × 768 screen resolution, by modifying the Mesa OpenGL compatible API (Application Programming Interface). For each benchmark, 100 frames were used to generate each trace. Figure 3 shows the scenes captured for these benchmarks; Quake3 and Crystal Space are OpenGL-based game engines; Quake3 is one of typical current video games, and is frequently used as a benchmark by other authors in their simulations. The game engines are architectural walkthroughs with visibility culling [16]. Lightscape is a product of SPECviewperfTM [19] and it is an industrial standard benchmark for measuring the performance of 3D rendering systems, running under OpenGL; it was used as a benchmark in this paper because of its high scene complexity, compared to other SPECviewperfTM products.

4.1 Efficiency of visibility culling at the VCBT

Figure. 4. The rate of visibility culling at the traversal stage within VCBT.

Table. 1 Error rate of visibility culling at the VCBT.

Figure 4 shows the percentage of visibility tested stamps at the traversal stage, namely, the average coverage rate for visibility culling. This rate is the same as the hit rate of the cache blocks for visibility culling, denoted by Hblock. The amount of the visibility tested stamp at the traversal stage, is about 97% and 98% for Quake3 demo I and demo II, about 96% for the Crystal space, and about 90% for the Lightscape. As described in Section 3.2, for reducing the rate of untested stamps, we fetch the block the corresponding current stamp and the next blocks as well. Figure 4 also shows clearly the rate of removed stamps by the proposed visibility culling method at the traversal stage; the percentage of the invisible blocks, is about 36% and 39% for Quake3 demo I and demo II, about 4.4% for the Crystal space, and about 27% for the Lightscape. Table 1 shows the error rate of visibility culling with respect to the representative values at the VCBT. As mentioned in Section 3.2, this error could reduce the accuracy of visibility culling, but it is so low that it has no significant impact on the performance of the VCBT.

4.2 Performance Evaluation To evaluate the performance of the VCBT analytically, we calculate the average visible fragments per cycle (AFPC), assuming that only the miss penalties of pixel cache and the texture cache can degrade the overall performance; these assumption are similar to those in [4]. This assumption implies that the bandwidth between the caches and the external memory is infinite and does not affect the overall performance. In this paper, we compare the pretexturing and the mid-texturing architecture as well as the HZB described in Section 2. The AFPC of the architectures under consideration can be computed as follows: 1 , ( H pix × H tex + (1 − H pix ) × T pix + (1 − H tex ) × Ttex ) 1 , AFPC mid =  ( H pix × γ ) + { H pix × (1 − γ ) × ( H 1 pix × H tex + (1 − H 1 pix ) × T pix + (1 − H tex ) × Ttex )} +     {(1 − H pix ) × ((1 − H tex ) × Ttex + T pix × (1 − reduction ))}    1 , AFPC HZB = ( H pix × H tex + (1 − H pix ) × T pix + (1 − H tex ) × T tex ) × (1 − γ 1) 1 . AFPC VCBT = ( H 2 pix × γ 2 ) + ( H 2 pix × (1 − γ 2 ) × ( H tex + (1 − H tex ) × Ttex )) + (1 − H 2 pix ) × T 1 pix AFPC

pre

=

The denominators of AFPCpre and AFPCmid correspond to ACPFpre and ACPFmid of [4]; ACPF is the average cycles per visible fragment; Hpix and Htex are hit rates of each pixel cache and texture cache; Tpix and Ttex are the corresponding cycle times for the miss penalty of each pixel cache and texture cache, and γ represents the occlusion rate of the first z-testing failure at mid-texturing. H1pix = 1- M1pix, where M1pix is the occurrence rate of the case where the full execution cycle for a cache miss penalty is executed, as mentioned in Section 3.3. The variable reduction in the denominator of ACPFmid is the pixel cache miss penalty reduction rate given in [4], which fluctuates according to the length of the pipeline between the first ztesting and the second z-testing of the mid-texturing. For calculating AFPCHZB , we assume that there is no delay in accessing the HZB. ACPFHZB can thus be easily calculated by multiplying ACPFpre and (1-γ1), where γ1 represents the occlusion rate caused by the visibility culling failure at the HZB. An estimate of the AFPC of the proposed VCBT can be obtained by analyzing the logical sequence of operations, namely,

H2pix x γ2, H2pix x (1- γ2) x (Htex + (1-Htex) x Ttex), and (1- H2pix) x T1pix. H2pix is the hit rate of visibility tested stamp at traversal stage, shown in figure 4, and γ2 represents the occlusion rate caused by the visibility test failure. T1pix is cycle times for the miss penalty of the pixel cache at the VCBT. We assume that T1pix is 12 cycle times, 10 cycle time mentioned in [4] plus an additional 2 cycles occurring at the Max-z-generator unit.

Figure 5. AFPCs of the various architectures. Figure 5 shows the AFPCs of the pre-texturing, mid-texturing, HZB, and VCBT. The pixel cache configuration is assumed to be direct-mapped with a cache size of 16K bytes and a block size of 64bytes as described in [4]. The miss ratio of the pixel cache has already been given in [4]. For Quake 3 demo I and demo II, the performance of the VCBT can be improved about 37% compared with pre-texturing, about 19% with respect to mid-texturing; however, compared to the HZB the performance deteriorated by about 1%. In the case of Crystal Space, the VCBT can improve its performance by about 41% compared to pre-texturing, about 28% with respect to mid-texturing, and about 37% relative to the HZB. In here, the performance of mid-texturing is better than that of HZB because the depth complexity of Crystal Space is very low [4]. For Lightscape, the VCBT can improve its performance by about 50% compared to pre-texturing, and about 40% with respect to mid-texturing, and about 31% relative to the HZB.

5. Conclusion In this paper, we have proposed a new and more effective architecture, VCBT (visibility culling per cache block with tiling-traversal algorithm) which effectively executes the visibility culling at the traversal stage by retrieving data in a pixel cache without the addition

of large-scale hardware such as those included in the design of the HZB (hierarchical z-buffer visibility). Four benchmarks were computed using simulation results and they show that the proposed VCBT can improve the performance by about 41% on an average compared with pre-texturing, an average of about 27% with respect to mid-texturing, and an average of 17% with respect to the HZB.

Reference [1] J.D. Foley, A. Dam, S.K. Feiner, and J.F. Hughes, Computer Graphics, Principles and Practice, Second Edition, Addison-Wesley, Massachusetts, 1990. [2] N. Greene, M. Kass, and G. Miller, “Hierarchical z-buffer visibility,” Proc. SIGGRAPH ‘93, pp. 231–238, Aug. 1993. [3] S. Morein, “ATI Radeon - HyperZ technology,” Hot3D Session 2000 SIGGRAPH/Eurographics Workshop Computer Graphics Hardware, http://www.ibiblio.org/hwws/previous/www_2000/presentations/ATIHot3D.pdf, Aug. 2000. [4] Woo-Chan Park, Kil-Whan Lee, Il-San Kim, Tack-Don Han, and Sung-Bong Yang, "An Effective Pixel Rasterization Pipeline Architecture for 3D Rendering Processors," IEEE Transactions on Computers, Vol. 52, No. 11, pp. 1501-1508, Nov. 2003. [5] N. Greene, “Interactive Geometric Computing Using Graphics Hardware -Visibility Culling using Graphics Hardware,” SIGGRAPH 2002 Tutorial Course #31, http://gamma.cs.unc.edu/SIG02_COURSE/course31_2002.pdf, July, 2002. [6] D. Cohen-Or, Y. Chrysanthou, and C. Silva, "A Survey of Visibility for Walkthrough Applications," IEEE Transactions on Visualization and Computer Graphics, Vol. 9, No. 3, pp. 412-431, Jul.-Sep. 2003. [7] H. Zhang, D. Manocha, T. Hudson and K. E. Hoff, “Visibility culling using hierarchical occlusion maps”, Proc. SIGGRAPH’97, pp. 77-88, July 1997. [8] D. Bartz, M. Meiβner and T. Hüttner, “Extending Graphics Hardware for Occlusion Queries in OpenGL,” Proc. SIGGRAPH/Eurographics Workshop on Graphics Hardware, pp. 97-104, Aug. 1998. [9] J. McCormack, R. McNamara, C. Gianos, L. Seiler, N. P. Jouppi, K. Correl, T. Dutton, and J. Zurawski, “Neon: a (big) (fast) single-chip 3D workstation graphics accelerator,” Research Report 98/1, Western Research Laboratory, Compaq Corporation, Aug. 1998, revised July 1999. [10] Scott N, Olsen D, Gannett E., “An Overview of the VISULIZE fx graphics accelerator hardware,” The Hewlett-Packard Journal, pp. 28-34, 1998. [11] M. Meißner, D. Bartz, R. Günther and W. Strßer, “Visibility Driven Rasterization,” Computer Graphics Forum, vol. 20, no. 4, pp. 283-294, 2001. [12] M. Woo, J. Neider, and T. Davis, OpenGL programming guide, Addison Wesley, 1996. [13] M. F. Deering, S. A. Schlapp, and M. G. Lavelle, “Fbram: A new form of memory optimized for 3d graphics,” Proc. SIGGRAPH’94, pp. 167-174, 1994. [14] H. Igehy, M. Eldridge, and K. Proudfoot, “Prefetching in a texture cache architecture,” Proc. SIGGRAPH/Eurographics Workshop on Graphics Hardware, pp. 133–142, Aug. 1998. [15] M. D. Hill, J. R. Larus, A. R. Lebeck, M. Talluri, D. A. Wood, “Wisconsin architectural research tool set,” ACM SIGGARCH Computer Architecture News, Vol. 21, pp. 8-10, Sep. 1993.

[16] L. Bishop, D. Eberly, T. Whitted, M. Finch, and M. Shantz, “Designing a PC Game Engine,” IEEE Computer Graphics and Applications, vol. 18, no. 1, pp. 46-53, Jan. 1998. [17] http://www.idsoftware.com/games/quake/quake3-arena/, 2003. [18] http://crystal.sourceforge.net, 2003. [19] http://www.spec.org/gpc/opc.static/viewperf711info.html, 2003.

Suggest Documents