CoarseZ Buffer Bandwidth Model in 3D Rendering Pipeline

0 downloads 164 Views 214KB Size Report
CoarseZ Buffer Bandwidth Model in 3D Rendering Pipeline. Ke Yang, Ke Gao, Jiaoying Shi, Xiaohong Jiang, and Hua Xiong. S
CoarseZ Buffer Bandwidth Model in 3D Rendering Pipeline Ke Yang, Ke Gao, Jiaoying Shi, Xiaohong Jiang, and Hua Xiong State Key Lab. of CAD&CG, Zhejiang University, Hangzhou, China {kyang,gaoke,jyshi,jiangxh,xionghua}@cad.zju.edu.cn Abstract Depth traffic occupies a major portion of 3D graphics memory bandwidth. In order to reduce depth reading, we propose employing a low-resolution depth buffer, namely CoarseZ buffer, for tile-level depth culling before perpixel test. The maximum depth of a tile is stored in the corresponding entry of CoarseZ buffer. Simulation results show that a small CoarseZ buffer can achieve remarkably high culling rate and significantly reduce z-reading bandwidth. We build a model that quantifies the influence of the CoarseZ design parameters on its efficiency and bandwidth. Test results of industrial benchmarks show that CoarseZ with tile size of 4 and bit depth of 16 can be a best selection to reduce memory bandwidth.

1. Introduction Memory bandwidth is usually a major bottleneck in 3D graphics rendering. Especially when rendering scenes in today’s applications with considerably high depth complexity, depth buffer traffic is a major issue. Furthermore, the rapidly growing game market is demanding low bandwidth devices, making techniques for reducing z traffic an important research area [6, 7, 9, 10]. In this paper we describe a low-bandwidth, highefficiency depth buffer, namely CoarseZ, and meantime present a model for its efficiency tuning and bandwidth evaluation. This paper is organized as follows. Section 2 describes the CoarseZ buffer architecture. Section 3 describes the model used to estimate CoarseZ bandwidth consumption. Section 4 presents the experimental methods to measure CoarseZ efficiency, and presents the experimental results. Conclusions and directions for future work are given in Section 5.

2. CoarseZ Buffer In this section we present the basic idea of CoarseZ buffer. As we show, of the overall bandwidth of the pipeline, depth reading constitutes a significant part.

CoarseZ buffer decomposes this part into three much smaller, all subjecting to the design parameters of the buffer.

2.1. Related Works In order to eliminate the bottleneck that depth traffic brings forwards, techniques for culling entire or part of triangles in front of per-pixel depth test are usually adopted. Most consumer graphics hardware adds one or more extra occlusion culling units. [2, 10] organize z buffer into a hierarchy to avoid most per-pixel tests, and are particularly efficient for densely occluded complex scenes, but need expensive pre-processing work to build and maintain the tree, and are difficult for GPU support. [1] reduces memory by using two-level hierarchical buffer ᧤ HyperZ ᧥ , fast clear and compression, and works best when drawing roughly front to back, while no exact detailed descriptions are available for commercial reasons. We add a much simpler unit than Hierarchical Zbuffer and HyperZ by adding only about 100K extra bytes for CoarseZ buffer. [8] uses a similar low-resolution Zbuffer (LRZ), and more occlusion information is accumulated before a triangle pops out from the delay stream. Our CoarseZ buffer acts per-tile culling in a similar way, while differs from LRZ in that CoarseZ is not updated unless all pixels in a tile pass CoarseZ test, thus reducing CoarseZ-write to a trivial amount. [3] adds two counters to adaptively solve the optimal position of the depth filter, which costs only 1-2 bits per fragment, which is close to CoarseZ in hardware consumption, but no performance tests on industrial benchmarks are shown.

2.2. CoarseZ Buffer This paper concentrates on optimizing the depth bandwidth of the 3D rendering pipeline (Fig.1). In commercial architectures the depth test are usually performed before texture blocks. The bandwidth (B) of rendering a fragment consists of five parts: Z Read (MRZ) ᧨ Texture Read(MRT), Z Write(MWZ) ᧨ Color Read (MRC), Color Write(MWC), each of which consumes roughly 4 bytes per-pixel [1]. In the most common case,

Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06) 0-7695-2581-4/06 $20.00 © 2006 IEEE

Color Read is not appeared. A fragment is probably occluded and culled after Z Read, thus requires only the bandwidth of MRZ. Let rz be the probability of a pixel being rejected in depth test, then the total bandwidth is

Bandwidth distribution with 80% RZ

13%

Mrz

13%

Mwz

B = MRZ + (1- rz)×(MRT + MWZ + MWC) (1) 12%

62%

Scan Conversion

Mtr Mwc

Depth Engine

Fig. 2 Approx. bandwidth distribution Z Read

Z Test

Texture Cache

Texture Engine

7H[W XU H 5HDG )L O W HU

External Memory

Z Write

Color Engine

Color Write

Figure1 Rendering Pipeline

Fig.1. Rendering Pipeline

As an example, our test shows that 3DMark03 demos Wings of Fury and Battle of Proxycon, which represent a commonplace for today’s application, have an average rz of about 40%, indicating that MRZ occupies more than 1/3 of the rendering pipeline bandwidth (Fig.2 up). Nowadays, by organizing the complex scene properly and traverse in an approximately front-to-back way, e.g. octree [2], BSP tree [6] or PLP [4]᧨rz is usually high. In fact, in the 3Dmark03 demos Trolls Lair, the average rz is about 80%. In this case (Fig.2 down), B is roughly 50 bits, wherein MRZ alone takes more than 60%, revealing the importance of optimizing depth reading.

Therefore, we introduce a low resolution Z buffer, namely CoarseZ buffer, prior to per-pixel test (Fig.3). The maximum and depth values within each tile, namely Zmax, can be easily obtained as a side-product by most tile-based implementations [5, 7]. CoarseZ subdivides the screen into grids of tile size and stores only Zmax for each grid, thus accessing to it costs much less bandwidth. This newly added block needs only 1 depth read for all pixels of a tile, and can reject large portions of tiles before Z read. If the closest depth value of a scan-converted tile is still farther than Zmax, then all pixels within that tile must be occluded, thus a conservative per-tile culling is enabled. Otherwise, at least one pixel is nearer than Zmax, causing CoarseZ unable to decide, and the tile has to go through full-resolution depth test. If all pixels in a tile (yellow tile in Fig.3) pass CoarseZ test, CoarseZ buffer should be updated. Unlike [8], if only part of the pixels in a tile (blue tile in Fig.3) pass the test, CoarseZ buffer is not updated. By updating Zmax in a much less frequency than [8], the extra memory access caused by CoarseZ is minimized.

Max Z value in tile

Fig.3 Z-Buffer (left) and related CoarseZ Buffer (right), in case of 4x4 tile size Bandwidth distribution with 40% RZ

37%

21%

Mrz Mwz

21% 21%

Mtr Mwc

Thus CoarseZ engine decomposes MRZ into three parts (Fig.4): CoarseZ Read (MRCZ) for all tiles, CoarseZ Update (MWCZ), and per-pixel depth buffer read (M’RZ) for tiles that pass CoarseZ test, i.e. MRZ = MRCZ+MWCZ+M’RZ

Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06) 0-7695-2581-4/06 $20.00 © 2006 IEEE

(2)

There are 2 major factors in designing LRZ-like buffers: the size of the tile and the bit count of Zmax. Selecting among different values for either factor would greatly influence the efficiency of CoarseZ test, and thus sway MRZ and M. As far as we know, however, related works either haven’t published the values they select, or haven’t explained the consideration for choosing particular values. Rare published works on hierarchical depth buffer have done on quantitatively describing the influence of the buffer’s parameters on memory bandwidth, which is one of the main contributions of this paper. Scan Conversion

Let rz be the probability of a fragment being culled in depth test. rz is called depth rejection ratio, which characterizes the intrinsic complexity of a given fragment stream input. Let rc, namely CoarseZ rejection ratio, be the probability of a fragment failing CoarseZ test. rc ideally equals rz when CoarseZ is able to cull all necessary pixels before successive depth test could receive them. We propose here the metric of CoarseZ efficiency, E = rc / rz .

(3)

CoarseZ Engine

The ideal value of E is 1, and an optimal choice of n and k pushes E nearest to 1.

CoarseZ Read

3.2. Per-pixel Bandwidth Evaluation

CoarseZ Test

External Memory

CoarseZ Write

Depth Engine

Texture Cache

Texture Engine

3.1. CoarseZ Efficiency

Color Engine

Figure.4 Rendering Pipeline with CoarseZ Engine

3. CoarseZ Bandwidth Model By Eq.2, MRZ is the sum of MRCZ, MWCZ and M’RZ. Under a given fragment stream input, they are subject to the choice of two parameters of CoarseZ, the tile size n and depth bit count k. Greater k needs greater MRCZ, but enhances compare precision, which is helpful for increasing culling rate, and thus decreasing MWCZ and M’RZ. Greater n saves MRCZ by letting more pixels share one CoarseZ grid, but bigger tiles usually have wider depth range, making it harder for CoarseZ to hold them back, and letting more pixels escape from being otherwise culled, hence increasing MWCZ and M’RZ. We try to find a balance between these two contradictory effects exerted by n and k on the three parts of MRZ, so to find an optimal configuration of CoarseZ that practically best reduces memory bandwidth.

Here we consider the bandwidth evaluation function for a pixel on average. By Eq.2, MRZ = MRCZ+MWCZ+M’RZ, where 1. MRCZ is the average bandwidth of reading CoarseZ buffer, where n×n pixels share one Zmax which is k bits᧨ therefore each pixel contributes to MRCZ = k / n2 (bits).

(4-1)

2. MWCZ is the average bandwidth of CoarseZ-write by tiles that pass it. We define the possibility of a passed tile updating CoarseZ as rw, which is tested under different tile sizes and CoarseZ bit counts, and results show that it is below 1%. Thus MWCZ is trivial to MRZ, indicating that the extra write overhead that CoarseZ brings can be ignored. Since the percentage of passing CoarseZ is 1-rc ᧨we have MWCZ = (1- rc) ×rw ×k / n 2 = o (MRZ)

(4-2)

3. M’RZ, the average bandwidth of per-pixel z-reading by pixels that pass CoarseZ, is (1- rc) × k᧨where k is the bit count of depth buffer and is traditionally 32. M’RZ = 32(1- rc ) (bits)

(4-3)

To sum up, the per-pixel z-reading bandwidth with CoarseZ architecture is given by MRZ = MRCZ+MWCZ+M’RZ = k/n2 + 32(1- rzE)

(4)

Here we use E rather than rc to see how the efficiency of CoarseZ is close to an ideal peak. We also denote the

Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06) 0-7695-2581-4/06 $20.00 © 2006 IEEE

percentage of MRZ reduction by CoarseZ in traditional 32-bit bandwidth as DRZ =1-MRZ/32. In the follow section, test results of E and DRZ as functions of n and k are presented.

4. Experimental Evaluation One of the most often used benchmarks for 3D graphics accelerators is 3DMark. We choose 3DMark03 Game Test demos (Fig.5), which are representative of today’s common applications. For each test, 100-150 frames are rendered on our reference rasterizer, and tracing information such as the number of CoarseZ rejections are generated. This information is subsequently fed to an estimator to produce the statistics of CoarseZ efficiency and bandwidth. We first select out reasonable ns under a large enough k in Section 4.1, then examine various ks under each n in Section 4.2.

When n grows bigger than 16, no matter what k is, the efficiency of CoarseZ decreases rapidly and contributes relatively less to saving the bandwidth of reading depth, and much less to the bandwidth of rendering pipeline. Considering hardware costs and computation overheads, n4 and n8 would be more practical choices.

4.2. Various CoarseZ Bit Depth With n fixed to 4 and 8 respectively, we tested E and DRZ under all combinations of (n, k) with k chosen from among 4, 8, 16 and 24. Tab.2-3 show their statistics. From Tab.2-3, CoarseZ contributes little in case of k4, while performs efficiently in case of k16 and k24, whose results are close to each other. Actually, DRZ of k16 is even a little higher than that of k24, since a smaller k leads to a smaller MRCZ. Therefore we can make a conclusion that a 16-bit CoarseZ is generally better for practical usage. Table2. E and DRZ with n=4 Benchmark

Fig. 5. Screenshots of 3DMark03 Game Test demos. Left to right: Wings of Fury, Battle of Proxycon, Trolls Lair

4.1. Various Tile Size

k8

Table 1 depicts E and DRZ as n being chosen among 4, 8, 16 and 32 in the benchmarks. k is fixed to 24, which is precise enough to classify different efficiencies under various n. The case of n=2 was not tested since it requires too much computation cost than can be practically implemented. As can be expected, the smaller the tile, the higher the efficiency. Table1. E and DRZ with k=24. rz is about 40% in Wings of Fury and Battle of Proxycon, and about 80% in Trolls Lair Benchmark n4 n8 n16 n32

E

Wings of Fury 65.3%

k4

Battle of Proxycon 67.4%

Trolls Lair 64.5%

DRZ

21.5%

22.3%

46.9%

E

43.2%

48.2%

41.5%

DRZ

16.1%

18.1%

32.0%

E

20.1%

34.3%

20.8%

DRZ

7.7%

13.4%

16.3%

E

5.1%

15.3%

8.9%

DRZ

2.0%

6.1%

7.1%

k16 k24

Wings of Fury

Battle of Proxycon

Trolls Lair

E

31.51%

6.67%

0.00%

DRZ

11.82%

1.89%

-0.78%

E

55.77%

50.84%

41.90%

DRZ

20.75%

18.77%

31.96%

E

65.08%

67.59%

63.74%

DRZ

22.91%

23.91%

47.87%

E

65.34%

67.39%

64.53%

DRZ

21.45%

22.27%

46.94%

Table3. E and DRZ with n=8 Benchmark

Wings of Fury

E

21.73%

k4 k8 k16 k24

Battle of Proxycon

Trolls Lair

2.79%

0.00%

DRZ

8.50%

0.92%

-0.20%

E

37.73%

38.10%

28.47%

DRZ

14.70%

14.85%

22.39%

E

42.89%

47.97%

41.11%

DRZ

16.37%

18.41%

32.11%

E

43.16%

48.23%

41.47%

DRZ

16.09%

18.12%

32.00%

The game test Battle of Proxycon shows somewhat anomalous behavior, i.e., E increases as the bit depth

Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06) 0-7695-2581-4/06 $20.00 © 2006 IEEE

decreases from n4k24 to n4k16. This effect is more significant among frame series that the inter-frame consistency is poor. We studied the traces and discovered that there exist a big proportion of random primitives within these frames, which are thus not representative of the result. We also noticed that the efficiency of Trolls Lair is zero in case of k4, indicating that CoarseZ hardly rejected any fragments. The reason for this fact might be that only the lower 20 bits in all depth values in Trolls Lair scenes are valid.

Decreased Rendering Pipeline bandwidth

     Wings of Fury

n4k16

n4k8

Battle of Proxycon n8k16

Trolls Lair

best of other (n,k)

4.3. Summing Up Table 1-3 are visualized in Fig.6-7. Taking the rough approximation in Fig.2, the decreased portion of the total bandwidth in the pipeline by CoarseZ is estimated in Fig.8.

Fig. 8. Estimation of decreased total bandwidth

From Fig.6-8, It can be inferred that 1. n4k16 outperforms other combinations, both from efficiency and contribution to lowering bandwidth.

E

2. n4k8 and n8k8 performs roughly on the same level. Since they need less extend memory than n4k16 does, they are potential choices under certain conditions.

     Wings of Fury n4k16

n4k8

Battle of Proxycon n8k16

Trolls Lair

best of other(n,k)

Fig. 6. CoarseZ efficiency

3. In applications where rZ is low, such as the first two benchmarks, even the best case of n4k16 gains limited total bandwidth reduction (less than 7% in our case), and it makes little difference on varying configurations. Only when rZ is fairly high, the pipeline would gain higher bandwidth reduction at a fast speed, up to more than 20% in our n4k16 case.



We encourage readers to try more (n, k) combinations for their own situations. In our n4k16 test running at 1024×768 resolution, we gain about 30% memory curtailment for the best case, at a cost of an extra 100K bytes memory for CoarseZ buffer.



5. Conclusions

Drz



 Wings of Fury

n4k16

n4k8

Battle of Proxycon n8k16

Trolls Lair

best of other (n,k)

Fig. 7. CoarseZ’s contribution to decreasing RZ

In this paper we have proposed and evaluated using a low-bandwidth CoarseZ buffer before per-pixel depth test in 3D graphics architecture. We have initiated a quantitative model for analyzing CoarseZ bandwidth, and by experiments on industrial benchmarks, we have formed a look-up table evaluating the influences of CoarseZ design parameters on the bandwidth. For most frames in the used record a tile size of 4 and with a bit depth of 16 yields the most efficient CoarseZ test, but designers can as well make their own selections combining the considerations of hardware cost and computational overhead.

Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06) 0-7695-2581-4/06 $20.00 © 2006 IEEE

Our tests are conducted under the resolution of 1024×768, which is common for today’s mainstream consumer hardware. In the future we intend to extend our experiments to various resolutions, e.g. 640×480, which might be more suitable for embedded applications. An adaptive CoarseZ with reconfigurable parameters may best reduce bandwidth, thus a profile is needed to tailor the application to its most suitable configuration. Finally, a cache for CoarseZ can further reduce bandwidth, leaving the simulation of influence of the cache parameters on its performance a future work to us.

6. Acknowledgments We would like to thank the anonymous reviewers. Many thanks go to the organizers and committee of this first IMSCCS conference. We also thank Jian Yang at Centrality Communications Co. Ltd. for long-time helps.

7. References [1] S. Morein, “ATI radeon HyperZ technology,” in Workshop on Graphics Hardware, Hot3D Proceedings, ACM SIGGRAPH/Eurographics, Aug. 2000. [2] N. Greene and M. Kass, M., “Hierarchical Z-buffer visibility,” in Proceedings of ACM SIGGRAPH 93, ACM Press,

pp. 231–240. [3] C. H. Yu and L. S. Kim, “An adaptive spatial filter for early depth test,” IEEE International Symposium on

Circuits and Systems, 2004, pp. II 137-140. [4] J. T. Klosowski and C. T. Silva, “Efficient conservative visibility culling using the prioritized-layered projection algorithm,” IEEE Transactions on Visualization and Computer Graphics, Vol. 7, No. 4, pp. 365-379, October 2001. [5] J. Torborg and J. Kajiya, “Talisman: commodity real time 3D graphics for the PC,” in Proceedings of SIGGRAPH 96, August 1996, pp. 353-364. [6] N. Greene, "Hierarchical tiling with coverage masks,” in Proceedings of SIGGRAPH 96, August 1996, pp. 65-74. [7] PowerVR white paper: 3D graphical processing. PowerVR, 2000 (http://www.powervr.com/pdf/TBR3D.pdf). [8] T. Aila, V. Miettinen, and P. Nordlund, “Delay streams for graphics hardware,” ACM Trans. Graphics, Vol. 22, No. 3, pp. 792-800, July 2003. [9] F. Xie, and M. Shantz, “Adaptive hierarchical visibility in a tiled architecture,” in Proceedings of the 1999 Eurographics/SIGGRAPH workshop on Graphics hardware, 1999, 75–84. [10] H. Zhang, D. Manocha, T. Hudson, and K. E. Hoff, “Visibility culling using hierarchical occlusion maps,” in Proceedings of ACM SIGGRAPH 97, ACM Press, 1997, 77–88.

Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06) 0-7695-2581-4/06 $20.00 © 2006 IEEE

Suggest Documents