A Generic Framework for Implementing Real-Time

0 downloads 0 Views 6MB Size Report
... met codenaam SaJi is geımplementeerd als een statische bibliotheek die ...... is associated with the University of Leuven and masters a diversity of skills such ...
DE NAYER Department of Electronics-ICT – In Conjunction With –

A Generic Framework for Implementing Real-Time Stereo Matching Algorithms on the Graphics Processing Unit

Sammy Rogmans

Academic Year 2006–2007 Promoters: dr. ir. H. Crauwels and dr. ir. G. Lafruit

Belgian Master’s thesis for getting the degree of industrial engineer electronics with specialization in information and communication technology.

Copyright 2007 Sammy Rogmans. The author and promoters give the permission to use this thesis for consultation and to copy parts of it for personal use. Every other use is subject to the copyright laws, more specifically the source must be explicitly specified when using results or information of this thesis. Leuven, April 2007

Sammy Rogmans, dr. ir. H. Crauwels and dr. ir. G. Lafruit Compiled on: 21 augustus 2007

Acknowledgements This paper has originated out of my ambition, but was never possible without the care of Lu Jiangbo. I want to explicitly thank him for his patience and efforts which gave me all of this ambitious opportunities. He has closely collaborated with me for the entire year, and I have learned a lot of things at his side. Together we have synergetically accelerated our research and ended up with two international state-of-the-art publications, and counting. Not to be forgotten are my promoters Gauthier Lafruit and Herman Crauwels, which have assisted and guided me through the entire project. I also want to give special thanks to Klaas Tack and Wolfgang Van Raemdonck for giving me the necessary information and expertise to get started. Surely, all the support that I have gotten from many people of the teaching staff at De Nayer is to be mentioned as well. I cannot thank them enough, as I have always felt that everyone stood right behind me and helped me whenever they could. As for the non-technical support, I especially want to express my gratitude toward my parents and my girlfriend’s parents for their continuous mental support. Of course last but not least, my girlfriend Linda Croughs herself, for her listening ear and putting up with my ongoing nagging about the subject at hand. She has literately experienced it all together with me, the joyful but also the darker days of this work. Thank you all!

iii

Acknowledgements

iv

Abstract English Version Free Viewpoint Video is a revolutionary video technology that enables the user to freely roam a custom viewpoint without the need of a physical camera to capture the requested image. The image is thereby synthesized by means of a number of fixed-point cameras. Analog to depth perception of the human eye, two neighboring cameras of the requested viewpoint are used as a stereo vision to determine the depth of the captured scene. The depth information, together with the static captured images, can be used to interpolate an intermediate camera viewpoint. In this thesis we formally study stereo matching, which is the drive behind Free Viewpoint Video technology. Stereo matching algorithms make it possible to acquire depth from a stereo vision, and exist in general out of a tremendous amount of computations and a high complexity. It is therefore hard to use those algorithms in real-time. We present a generic framework that makes it possible to easily implement real-time stereo matching algorithms by exploiting the parallel computational horsepower within Graphics Processing Units (GPUs) on commodity graphics cards. This enables us to obtain very high-speed matching performances with up to 478 frames per second, nonetheless we do not lose a great deal of quality. Middlebury quality evaluation ranks our proposed stereo algorithm even over a number of algorithms that use global optimization. As the framework permits easy low-level expandability, it can also be used to perform state-of-the-art stereo corresponding research. Thereby we have enhanced the framework with a hybrid CPU/GPU application time benchmarking mechanism, to provide a formal way of evaluating implementations that concurrently run on both the central processor and the graphics card. The generic framework that is codenamed SaJi, is implemented as a static library that uses the object oriented programming language C++ as a host language, and the High-Level Shading Language to program the underlying graphics hardware.

v

Abstract Nederlandstalige Versie Free Viewpoint Video is een revolutionaire videotechnologie die de gebruiker de mogelijkheid geeft om vrij een camerastandpunt te kiezen zonder de nood aan een fysische camera die het gevraagde beeld filmt. Het beeld wordt daardoor gesynthetiseerd met behulp van een aantal vaste camera’s. Analoog aan het dieptezicht van het menselijke oog, worden twee naburige camera’s van het gevraagde standpunt gebruikt als een stereovisie voor het bepalen van de diepte van de gefilmde sc`ene. De diepte-informatie, samen met de statisch gefilmde beelden, kunnen gebruikt worden om een tussenliggend camerastandpunt te interpoleren. In deze thesis bestuderen we formeel het stereocorresponderen, wat de drijfveer is achter de Free Viewpoint Video technologie. Stereocorrespondentie-algoritmen maken het mogelijk om diepte te verwerven uit een stereovisie, en bestaan in het algemeen uit een enorme hoeveelheid berekeningen en een hoge complexiteit. Het is daarom moeilijk om zulke algoritmen te gebruiken in realtime. We presenteren een algemeen framework dat het mogelijk maakt om makkelijk real-time stereocorrespondentie-algoritmen te implementeren door het uitbuiten van de parallelle rekenkracht in Graphics Processing Units (GPU’s) op commerci¨ele beeldkaarten. Dit stelt ons in staat om correspondentieperformaties te halen met zeer hoge snelheid gaande tot 478 beelden per seconde, niettegenstaande dat we niet heel veel aan kwaliteit verliezen. Middlebury kwaliteitsevaluaties stellen het door ons voorgestelde stereocorrespondentie-algoritme zelfs boven een aantal algoritmen die gebruik maken van globale optimalisatie. Omdat het framework gemakkelijk uitbreidbaar is op laag niveau, kan het ook gebruikt worden om state-of-the-art stereocorrespondentie-onderzoek te verrichten. Daarbij hebben we het framework versterkt met een meetmechanisme voor hybride CPU/GPU applicatietijd, om een formele manier te voorzien voor het evalueren van implementaties die gelijktijdig uitgevoerd worden op de centrale processor en de beeldkaart. Het algemene framework met codenaam SaJi is ge¨ımplementeerd als een statische bibliotheek die de object-geori¨enteerde programmeertaal C++ gebruikt als gasttaal, en de High-Level Shading Language om de onderliggende grafische hardware te programmeren.

vi

Contents Acknowledgements

iii

Abstract

v

Contents

vii

List of Figures

xi

Acronyms

xiii

Dutch Summary

xv

1 Thesis Introduction 1.1 Background Information . . . . . . . . . 1.1.1 College Situation . . . . . . . . . 1.1.2 The Related Corporation . . . . 1.1.3 Personal Motivations . . . . . . . 1.2 Thesis Objectives . . . . . . . . . . . . . 1.2.1 Free Viewpoint Video . . . . . . 1.2.2 High-Speed Stereo Matching . . 1.2.3 Two-Level Readability . . . . . . 1.2.4 Compact Programming Tutorials 1.3 Thesis Disposition . . . . . . . . . . . . 1.3.1 Stereo Matching Fundamentals . 1.3.2 Basic Computer Graphics . . . . 1.3.3 Exploiting Graphics Hardware . 1.3.4 Experimental Algorithms . . . . 1.3.5 Framework Implementation . . . 1.3.6 Experimental Results . . . . . . 1.3.7 Conclusions and Future Work . . 1.4 Tools and Experimental Environment .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

1 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6

2 Stereo Matching Fundamentals 2.1 Overview . . . . . . . . . . . . . . . . . . 2.1.1 Stereo Usefulness and Applications 2.1.2 Disparity Maps . . . . . . . . . . . 2.1.3 Middlebury Evaluation . . . . . . . 2.2 Stereo Mathematics . . . . . . . . . . . . 2.2.1 Epipolar Geometry . . . . . . . . . 2.2.2 Image Rectification . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

7 7 7 8 10 12 12 13

vii

Contents

2.3

2.2.3 Disparity Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Taxonomy of Stereo Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Basic Computer Graphics 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 The Simplified Graphics Hardware Pipeline 3.1.2 Graphics Programming Interfaces . . . . . . 3.1.3 Hardware Programmability . . . . . . . . . 3.1.4 Multi-Level Parallel Computing . . . . . . . 3.2 Evolution of the Graphics Processing Unit . . . . . 3.2.1 Four Generations of Hardware . . . . . . . 3.2.2 Advanced Disposition of Current Pipelines 3.2.3 Future Pipeline Architecture . . . . . . . . 3.3 Programming Graphics Hardware . . . . . . . . . . 3.3.1 Assembly Languages . . . . . . . . . . . . . 3.3.2 High-Level Programming . . . . . . . . . . 3.3.3 Effect Development . . . . . . . . . . . . . 3.4 Efficient Hardware Availability . . . . . . . . . . . 3.4.1 Mipmap Scale-Space . . . . . . . . . . . . . 3.4.2 Texture Coordinate Interpolators . . . . . . 3.4.3 Multiple Render Targets . . . . . . . . . . . 3.4.4 Early-Z Mechanism . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

14 16

. . . . . . . . . . . . . . . . . .

17 17 17 19 20 21 22 22 23 25 26 26 27 28 30 30 32 32 32

4 Exploiting Graphics Hardware 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Realizing General Purpose Computations 4.1.2 Transfer Bottleneck . . . . . . . . . . . . 4.1.3 Abstract Stream Programming . . . . . . 4.1.4 Data Independence . . . . . . . . . . . . . 4.1.5 Locality Constraint . . . . . . . . . . . . . 4.2 Avoiding Data Corruption . . . . . . . . . . . . . 4.2.1 One-to-One Mapping . . . . . . . . . . . . 4.2.2 Half-Pixel Shift Correction . . . . . . . . 4.2.3 Using an Optimal Precision . . . . . . . . 4.3 Mapping Computational Concepts . . . . . . . . 4.3.1 Arithmetic Intensity . . . . . . . . . . . . 4.3.2 Central Processor Analogies . . . . . . . . 4.3.3 Simple Mapping Example . . . . . . . . . 4.4 Exploiting Specialized Hardware . . . . . . . . . 4.4.1 Pass Reductions . . . . . . . . . . . . . . 4.4.2 Computational Masks . . . . . . . . . . . 4.4.3 Data Packing . . . . . . . . . . . . . . . . 4.4.4 Data Filtering . . . . . . . . . . . . . . . 4.4.5 Optimal Sampling Kernels . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

33 33 33 34 35 36 36 37 37 37 38 39 39 39 41 42 42 43 45 45 45

5 Experimental Algorithms 5.1 Overview . . . . . . . . . . . . . 5.1.1 Landau Notation . . . . . 5.1.2 Multi-Dimensional Spatial 5.1.3 Integral Image . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

47 47 47 48 49

viii

. . . . . . . . . . . . Filtering . . . . . .

. . . .

. . . .

. . . .

Contents

5.2

5.3

5.4

5.1.4 Simple View Interpolation . . . . . . Advanced Matching Cost Aggregation . . . 5.2.1 Gaussian Approach . . . . . . . . . . 5.2.2 Laplacian Approach . . . . . . . . . Algorithmic Quality Enhancements . . . . . 5.3.1 Directional Support Windows . . . . 5.3.2 Integral Image Adaptive Windowing 5.3.3 Truncated Absolute Difference . . . Motion Parallax View Interpolation . . . . 5.4.1 Image Synthesis . . . . . . . . . . . 5.4.2 Simplified Occlusion Handling . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

51 52 52 54 54 54 55 55 56 56 56

6 Framework Implementation 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Conceptual Diagram . . . . . . . . . . . . . . . . . . . 6.1.2 Library Calls and Functionalities . . . . . . . . . . . . 6.1.3 Low-Level Library Expandability . . . . . . . . . . . . 6.1.4 Efficient Implementations and Optimizations . . . . . 6.2 Detailed Framework Analysis . . . . . . . . . . . . . . . . . . 6.2.1 Advanced Conceptual Diagram . . . . . . . . . . . . . 6.2.2 Object-Oriented Internal Library Structure . . . . . . 6.2.3 Memory Management . . . . . . . . . . . . . . . . . . 6.2.4 Algorithm Time Benchmarking . . . . . . . . . . . . . 6.2.5 GPGPU Accelerated Development . . . . . . . . . . . 6.3 Enhanced Quality Implementations . . . . . . . . . . . . . . . 6.3.1 Truncated Separable Laplacian Kernel Approximation 6.3.2 Shortcomings of the Integral Image . . . . . . . . . . . 6.4 In-depth Image Synthesis . . . . . . . . . . . . . . . . . . . . 6.4.1 Forward Warping . . . . . . . . . . . . . . . . . . . . . 6.4.2 Simplified Occlusion Handling . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

57 57 57 58 61 62 64 64 65 67 68 70 74 74 79 81 81 82

7 Experimental Results 7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Optimal Speed Results . . . . . . . . . . . . . . 7.1.2 Best Quality Results . . . . . . . . . . . . . . . 7.2 Detailed Speed Measurements . . . . . . . . . . . . . . 7.2.1 Separable 2D Convolution Speed Optimizations 7.2.2 View Synthesis Performance . . . . . . . . . . . 7.3 Detailed Quality Evaluations . . . . . . . . . . . . . . 7.3.1 Laplacian Kernel Approximation . . . . . . . . 7.3.2 Alternative Approaches and Techniques . . . . 7.4 View Synthesis . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

83 83 83 84 85 85 86 87 87 88 88

8 Conclusions and Future Work 8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89 89 90

Bibliography

91

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

ix

Contents

x

List of Figures 1

Conceptueel Schema van Free Viewpoint Video . . . . . . . . . . . . . . . . . . . .

1.1 1.2 1.3 1.4 1.5

The New IMEC Corporate Logo . . . . . . . Conceptual Diagram of Free Viewpoint Video 2-Level Readability Concept . . . . . . . . . . GPU Evolution Timeline . . . . . . . . . . . . GPGPU Computing Community Logo . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

1 2 3 4 5

2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19

Stereo Matching Flowchart . . . . Standard Stereo Vision Alignment Disparity Search Algorithm . . . . Simple Matching Cost Aggregation Tsukuba Scene . . . . . . . . . . . Map Scene . . . . . . . . . . . . . . Sawtooth Scene . . . . . . . . . . . Venus Scene . . . . . . . . . . . . . Bull Scene . . . . . . . . . . . . . . Poster Scene . . . . . . . . . . . . Barn 1 Scene . . . . . . . . . . . . Barn 2 Scene . . . . . . . . . . . . Cones Scene . . . . . . . . . . . . . Teddy Scene . . . . . . . . . . . . . Pinhole Camera Concept . . . . . . The Epipolar Constraint . . . . . . Rectified Images . . . . . . . . . . Stereo Depth Calculation. . . . . . Stereo Taxonomy . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

7 8 9 10 11 11 11 11 11 11 11 11 11 11 12 13 14 15 16

3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11

Unbalanced and Balanced Pipelining . . . . Vertex, Primitive and Mesh Representation Simple Graphics Pipeline . . . . . . . . . . Graphics API . . . . . . . . . . . . . . . . . Hardware Programmability Distinction . . . Different Levels of Parallelism . . . . . . . . GPU Evolution . . . . . . . . . . . . . . . . Detailed Vertex Transformations . . . . . . Raster Operations Flowchart . . . . . . . . DirectX 10 Pipeline . . . . . . . . . . . . . GPU Assembly Profile Structure . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

17 18 18 19 20 21 22 24 25 26 27

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

xi

. . . . . . . . . . . . . . . . . . .

xv

List of Figures

xii

3.12 Point and Linear Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.13 Mip-level Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30 31

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13

GPGPU Computational Kernel Functionality . . . . Personal Computer Memory Architecture . . . . . . The Stream Programming Model . . . . . . . . . . . The Kernel Locality Constraint . . . . . . . . . . . . One-to-One Mapping Procedure . . . . . . . . . . . . Half-Pixel Origin Shift . . . . . . . . . . . . . . . . . Half-Pixel Texture or Vertex Coordinate Adjustment Mapping CPU Inner Loops to GPU . . . . . . . . . Setting the Computational Domain . . . . . . . . . . Independent Kernel Combination with MRT . . . . . Building a Computational Mask . . . . . . . . . . . Packing Scalar Data into a Four-Component Form . One Dimensional Sampling Kernel Example . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

33 34 35 36 37 38 38 40 41 42 44 45 46

5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11

Searching Optimal Idea Solutions . . . . . . . . . . . . . . Generic 2D Convolution Kernel . . . . . . . . . . . . . . . Applying the Convolution Kernel . . . . . . . . . . . . . . Main Advantage of the Integral Image. . . . . . . . . . . . Synthezing an Interpolated View . . . . . . . . . . . . . . Changing the Gaussian Kernel Standard Deviation . . . . Approximating a Gaussian Kernel with Mip-Levels . . . . Approximating a Gaussian Kernel with Integral Images . Truncating the Generic Convolution Kernel . . . . . . . . Composing Different Window Shapes with Integral Images In-depth View Interpolation Synthesis . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

47 48 49 50 51 52 53 53 54 55 56

6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9

SaJi Framework Conceptual Diagram . . . . . . . . . . . . . . . . . . . High-level Concept for Low-level Expandability . . . . . . . . . . . . . Efficient Convolving on Graphics Hardware . . . . . . . . . . . . . . . The Recursive Doubling Algorithm . . . . . . . . . . . . . . . . . . . . Advanced SaJi Framework Conceptual Diagram . . . . . . . . . . . . . Optimal Memory Management with Abstract Base Classes . . . . . . Partial Class Diagram of the SaJi Framework . . . . . . . . . . . . . . Processing Elements and Data Flow of the Proposed Stereo Algorithm Dealing with Intrinsic Precision Problems in an Integral Image . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

57 61 62 63 64 65 66 75 80

7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9

Best Tsukuba and Map Disparity Output for the Optimal Speed Model . Best Sawtooth and Venus Disparity Output for the Optimal Speed Model Speed Optimization Comparisons . . . . . . . . . . . . . . . . . . . . . . . Disparity Outputs for Different Standard Deviation γp . . . . . . . . . . . Full Laplacian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Optimal Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Box-Filter Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interpolated View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Occlusion Handled View . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

84 84 86 87 88 88 88 88 88

. . . . . . . . . . . . .

. . . . . . . . . . . . .

Acronyms A

C

D

F

G

H I

L

AD AGP API

Absolute Difference Accelerated Graphics Port Application Programming Interface

Cg CPU CUDA

C for graphics Central Processing Unit Compute Unified Device Architecture

D3DX DDR de/s DIBR

Direct3D Extension Double Data Rate disparity estimations per second Depth Image Based Rendering

fps FVV FX

frames per second Free Viewpoint Video Effects

GDDR GLSL GLU GPGPU GPU

Graphics DDR GL Shading Language GL Utility General-Purpose GPU Graphics Processing Unit

HLSL

High Level Shading Language

ICT IEEE IMEC

Information and Communication Technology Institute of Electrical and Electronics Engineering Inter-university Micro Electronics Center

LOD

Level of Detail xiii

Acronyms

M

MCH MIMD mip MML MRT

Memory Controller Hub Multiple Instruction Multiple Data multum in parvo Multiple Mip-Level Multiple Render Target

NASA NES

National Aeronautics and Space Administration Nomadic Embedded Systems

PCI PCIe PPM

Peripheral Component Interconnect PCI express Perspective Projection Matrix

RAM RGBA ROP

Random Access Memory Red Green Blue Alpha Raster Operator

SAD SAT SD SIMD SM SSD

Sum of Absolute Difference Summed Area Table Squared Difference Single Instruction Multiple Data Shader Model Sum of Squared Difference

T

TAD

Truncated Absolute Difference

V

VGA

Video Graphics Array

WENK WTA

Hogeschool voor Wetenschap & Kunst Winner Takes All

N P

R

S

W

xiv

Dutch Summary Free Viewpoint Video (FVV) is een technologie die het mogelijk maakt om een willekeurig camerastandpunt te kiezen zonder dat er effectief een fysische camera aanwezig is om dat beeld vast te leggen (zie Figuur 1). Dit beeld wordt berekend aan de hand van een aantal camera’s met een vast standpunt. FVV heeft toekomstige toepassingen in live digitale video, zodat bijvoorbeeld een voetbalwedstrijd kan bekeken worden door een camerastandpunt dat ogenblikkelijk gekozen wordt door de kijker. Een andere toepassing is het makkelijk genereren van filmeffecten waarbij spatiale beelden getoond worden zoals bijvoorbeeld in de populaire ‘Matrix’ trilogie.

Figuur 1: Conceptueel Schema van Free Viewpoint Video

Alhoewel de FVV technologie verschillende algoritmische implementaties kan hebben, is Depth Image Based Rendering (DIBR) een van de meest betrouwbare. Hierbij worden de beelden van naburige gefixeerde camera’s van het gevraagde standpunt samen met diepte-informatie van de sc`ene gebruikt om de tussenliggende beelden te synthetiseren. Aangezien we spreken over een dynamisch gefilmde situatie, kan deze diepte-informatie niet op voorhand opgesteld en beschikbaar gemaakt worden. DIBR vereist dus een manier om in real-time diepte te verwerven uit de gefilmde beelden. Dit proces is gekend als stereocorresponderen. Analoog aan het dieptezicht van het menselijke oog, worden twee naburige camera’s van het gevraagde standpunt gebruikt als een stereovisie voor het bepalen van de diepte van de gefilmde sc`ene. We gaan in deze thesis uit van een stereovisie die bestaat uit twee netjes parallel opgelijnde camera’s, die conventioneel gedefinieerd worden als de linker- en rechtercamera. In het tweede hoofdstuk beschrijven we gedetailleerd hoe objecten in de linkse en rechtse beelden systematisch ‘verschoven’ zijn, en hoe we ze eerst kunnen xv

Dutch Summary rechtzetten in het geval van een niet-parallel opgelijnde opzet. We geven een formeel bewijs dat objecten in de voorgrond meer verschuiven dan objecten in de achtergrond. Correspondenties in de linker- en rechterbeelden met een grote verschuiving of dispariteit bevinden zich dus dicht tegen de camera en vice versa. Daarom geven we in hoofdstuk 2 ook een wiskundig bewijs dat dispariteit omgekeerd evenredig is met de diepte. DIBR heeft dus al voldoende aan dispariteitskaarten die opgesteld worden door het stereocorrespondeneren. Zulke dispariteitskaarten bevatten de hoeveelheid van verschuiving tussen correspondenties op een pixel-basis tussen het linker- en rechterbeeld. Stereocorrespondentie-algoritmen zijn over het algemeen zeer complex en vereisen een enorme hoeveelheid aan berekeningen, en zijn dus zeer moeilijk te gebruiken in ogenblikkelijke of realtime applicaties. In deze thesis stellen we een manier voor waarbij we real-time performatie proberen te halen door het uitbuiten van commerci¨ele Graphics Processing Units (GPU’s) die we terugvinden op recente beeldkaarten. In het derde hoofdstuk tonen we aan dat sinds het ontstaan van de eerste generatie GPU’s in 1997, intrinsiek parallellisme een dominerende factor is. Veel functies worden onderverdeeld in meerdere niveau’s, die elk telkens een aantal keer parallel uitgevoerd worden. Hedendaagse GPU’s zijn dan ook in het algemeen uitgerust met meerdere Single Instruction Multiple Data (SIMD) processors, die elk dezelfde instructie uitvoeren maar op verschillende data. Net daarom kunnen we ook makkelijk aantonen dat de ruwe rekenkracht van GPU’s die van de hedendaagse CPU’s ver overschrijden. Omwille van het feit dat CPU’s een hoge sequenti¨ele controleerbaarheid hebben over de data, zijn ze moeilijk parallel uitvoerbaar. De wet van Moore, die beschrijft dat om de twee jaar transistortechnologie twee maal verkleind in oppervlakte, impliceert dan voor CPU’s dat ze slechts kleiner kunnen ontworpen worden. Omwille van de SIMD en het multiniveau-parallellisme in de GPU’s kunnen zij twee keer sneller gemaakt worden om de twee jaar. In de rest van hoofdstuk 3 proberen we dan ook heel goed de basis van grafische hardware te begrijpen, opdat we ze optimaal zouden kunnen uitbuiten. In het vierde hoofdstuk formaliseren we het uitbuiten van de GPU tot het principe van GeneralPurpose GPU (GPGPU) verwerking. GPGPU is een stijl van programmeren waarbij we de beeldkaart gebruiken voor algemene berekeningen die normaal op de CPU zouden uitgevoerd worden. We vinden ze meer en meer terug in nieuwe hybride CPU/GPU applicaties die de hoge controleerbaarheid van de CPU en de ruwe rekenkracht van de GPU synergetisch combineren. In dit hoofdstuk zien we dan ook hoe een simpel voorbeeld van het verwerken van een rij gegevens vertaald kan worden naar een programma dat op de GPU loopt. Op de CPU zouden we de rij met gegevens sequentieel doorlopen en op elk element de nodige bewerkingen toepassen. In GPGPU is het gebruikelijk om deze rij gegevens in even lange stukken in te delen, en ze onder elkaar te organiseren zodat een twee-dimensionaal gegevensveld opgebouwd wordt. De scalaire waarden in dit veld worden vertaald als intensiteiten en geladen in de beeldkaart als textuur. Een rechthoekige geometrie wordt dan naar de beeldkaart gestuurd zodat deze de rechthoek zal rasteren naar fragmenten met de grootte van een pixel. Daar waar de beeldkaart normaal een kleur zal bepalen door het opzoeken van een pixel uit de textuur en een lichtberekening, kunnen we ook eigen berekeningen laten uitvoeren op de textuur. Een belangrijk onderdeel van GPGPU is dan ook dat ervoor gezorgd moet worden dat de rechthoek in evenveel fragmenten wordt gerasterd dan er data-elementen beschikbaar zijn in de textuur. In plaats van het resultaat dan weer te geven op het scherm, is het gebruikelijk om ze naar een buffer in het videogeheugen te sturen die xvi

Dutch Summary niet weergegeven wordt. Nadat de gewenste berekeningen gedaan zijn, kunnen we de buffer kopi¨eren naar het systeemgeheugen. Daardoor kunnen we gewoon verder werken alsof de verwerking op de CPU uitgevoerd was. Hoofdstuk 4 beschrijft ook diepgaande onderwerpen die toelaat om gespecialiseerde grafische hardware maximaal uit te buiten voor enkele specifieke toepassingen. Omwille van het feit dat we de beeldkaart gebruiken als onderliggend platform voor het uitvoeren van stereocorrespondentie-algoritmen, legt dit beperkingen op de structuur en het ontwerp van algoritmen die we kunnen implementeren. Niet alleen moeten we zorgen voor een grote algoritmische intensiteit zodat er zo weinig mogelijk datatrafiek is tussen het systeem en de beeldkaart, ook moeten we ons beperken tot lokale stereo-algoritmen. Dit duidt op het feit dat we geen of moeilijk globale optimalisatie kunnen realiseren op de beeldkaart, aangezien het lezen van textuur enkel effici¨ent kan gebeuren op kleine lokale gebieden. In hoofdstuk 5 stellen we enkele algoritmen voor, die kans maken om maximaal effici¨ent uitgevoerd te worden op de GPU. Aangezien globale optimalisatie dikwijls gebruikt wordt om dispariteitskaarten van een zeer hoge kwaliteit te halen, introduceren we in het vijfde hoofdstuk ook enkele lokale algoritmische technieken om de kwaliteit zo veel mogelijk te verhogen. In het zesde hoofdstuk brengen we alle voorgaande hoofdstukken samen met het ontwerp van een algemeen framework dat het mogelijk maakt om stereocorrespondentie-algoritmen heel makkelijk te ontwikkelen. Het SaJi framework kan transparant op hoog niveau als een statische bibliotheek in C++ gebruikt worden met het door ons voorgestelde en ontwikkeld algoritme, om zo een module te hebben die stereocorrespondentie aan hoge snelheid ter beschikking stelt. De transparantie zit hem in het feit dat op dit hoog niveau, de gebruiker geen kennis moet hebben van het programmeren van een beeldkaart. Het framework handelt alle GPGPU implicaties voor hem af. Een andere belangrijke eigenschap van het SaJi framework, is de makkelijke uitbreidbaarheid op laag niveau. Er is een volledige GPGPU omgeving beschikbaar die kan overge¨erfd en gebruikt worden voor de ontwikkeling van nieuwe stereocorrespondentie-algoritmen die gebruik maken van hybride CPU/GPU verwerking. Daardoor is er ook een complex meetmechanisme beschikbaar, dat de laag-niveau programmeur in staat stelt om deze hybride uitvoertijd makkelijk te meten. In hoofdstuk 6 geven we dan ook de beschrijving hoe we interne CPU kloktellers kunnen gebruiken voor het zeer nauwkeurig meten van algoritmen. Wanneer de centrale processor van de computer uitgerust is met meerdere kernen, moeten we de affiniteit van het meetmechanisme forceren naar dezelfde kern. Zo wordt altijd consistent dezelfde teller gebruikt. Om deze CPU kloktellers te gebruiken om GPU applicatietijd te meten, is het nodig om telkens net voor elke meting te synchroniseren met de beeldkaart. Uiteindelijk zien we ook dat de tijd voor deze synchronisatie, zeker bij zeer snelle en kleine hybride applicaties, de tijdmeting verstoort. Hierdoor moet zulk algoritme een groot aantal keer uitgevoerd worden tussen twee synchronisatiemomenten, om dan uiteindelijk een representatieve meting te bekomen. In hoofdstuk 7 geven we de resultaten van de snelheidsmetingen weer die we uitgevoerd hebben op het stereo-algoritme dat door ons werd ontwikkeld. Uit deze metingen blijkt dat we in staat zijn correspondentieperformaties te halen met zeer hoge snelheid gaande tot 478 beelden per seconde, niettegenstaande dat we niet heel veel aan kwaliteit verliezen. Kwaliteitsevaluatie van een stereo-algoritme wordt sinds lange tijd systematisch gestuurd door Daniel Scharstein en Richard Szeliski. Ze hebben een website opgesteld met een evaluatiemechanisme dat ook ogenblikkelijk de rang van het beoordeelde algoritme weergeeft met voorgaande gepubliceerde technieken. Deze gexvii

Dutch Summary avanceerde kwaliteitsevaluaties stellen het door ons voorgestelde stereocorrespondentie-algoritme zelfs boven een aantal algoritmen die gebruik maken van globale optimalisatie. We kunnen daaruit besluiten dat ons voorstel en framework een goede bijdrage levert aan de huidige state-of-the-art. Daardoor hebben we dan ook enkele internationale state-of-the-art publicaties kunnen maken voor conferenties verwant aan het Institute of Electrical and Electronics Engineering (IEEE). In het zevende hoofdstuk bekijken we ook dieper de snelheidsverhoging van enkele optimalisaties die we ge¨ımplementeerd hebben door het uitbuiten van gespecialiseerde hardware. Als laatste bekijken we ook even de resultaten van een snel ontwikkeld DIBR algoritme, dat een camerastandpunt interpoleert aan de hand van de door ons berekende dispariteitskaarten. Formele evaluaties van deze ge¨ınterpoleerde beelden hebben we niet ge¨ıntegreerd in deze thesis, omdat we ons meer specifiek richten naar enkel het stereocorresponderen. Omdat deze thesis veel deuren opent naar verder toekomstig werk, geven we in het laatste hoofdstuk nog enkele richtingen aan die zelfs doctoraatsonderzoek kunnen stimuleren. Sinds er tegen het einde van dit onderzoek een nieuwe generatie van beeldkaarten met hoge programmeerbaarheid op de commerci¨ele markt beschikbaar gekomen is, kunnen we het vertalen van computervisiealgoritmen naar beeldkaarten in het algemeen meer gaan abstraheren naar een hoger formalisme. In het kader om een eerste praktisch FVV prototype te ontwikkelen, zijn er nog voldoende uitbreidings- en verbeteringsmogelijkheden die kunnen bestudeerd worden.

xviii

Chapter 1

Thesis Introduction 1.1 1.1.1

Background Information College Situation

The assignment of this Master’s thesis originates from the De Nayer Institute which is part of the Hogeschool voor Wetenschap & Kunst (WENK) [1] in Sint-Katelijne-Waver, Belgium. It is associated with the University of Leuven and masters a diversity of skills such as chemical, civil, mechanical and electrical engineering. Since a few years, whole Europe operates under the Bologna declaration of June 19th, 1999. This implicates that colleges as De Nayer are also capable of graduating academic Masters. Hence this Master’s thesis.

1.1.2

The Related Corporation

The Inter-university Micro Electronics Center [2] or IMEC abbreviated, is the single largest European research center on the area of micro- and nano-electronics, nanotechnology, design methods and technologies for ICT systems. It is physically situated next to a university campus from the Catholic University of Leuven and produces thereby a grand motive for all kinds of different post-academic essays.

Figure 1.1: The New IMEC Corporate Logo

The new logo, as depicted in Figure 1.1, also represents the administrative fusion of all IMEC divisions so that a closer collaboration becomes possible. By this they hope to achieve a more productive working environment and the possibility to expand even at an increased rate. 1

Chapter 1. Thesis Introduction

1.1.3

Personal Motivations

That a non-universitary student makes his term paper at IMEC is less obvious. I have taken my motivation from the fact that I would be surrounded by a large number of intelligent and more abstract thinking people, through which I can be certain to enormously improve my knowledge and way of thinking. An other very important motive for me was that a strong ambition is very much appreciated and even motivated there. This is because there are a lot of possibilities to even publish papers and such. I chose to apply at the Nomadic Embedded Systems (NES) division of IMEC, more specific at the multimedia group. NES continuously tries to excel in international scientific research that precede the industrial needs for three to ten years. NES was formerly know as DESICS in the old corporate administrative structure.

1.2 1.2.1

Thesis Objectives Free Viewpoint Video

The ultimate umbrella goal for IMEC is the realization of a new revolutionary video technology known as Free Viewpoint Video (FVV). Free Viewpoint Video makes it possible for users to freely roam any customizable camera point. Of course this is not realized by moving a camera at the user’s command. This would after all limit the control to one master user. The idea is to have a number of fixed point camera’s, select the two which are closest to the viewing point that is desired by the user, and to perform some kind of imaging technique to create his custom vision as depicted in Figure 1.2

Figure 1.2: Conceptual Diagram of Free Viewpoint Video

There are many different ways and techniques to create this custom view [3; 4], but one of the most reliable ones is the use of Depth Image Based Rendering (DIBR) [4]. In this technique, depth information is used together with available imagery to obtain the requested viewpoint. Hence this thesis will focus upon algorithms that are related to the latter. 2

Chapter 1. Thesis Introduction

1.2.2

High-Speed Stereo Matching

Looking at Figure 1.2, the most pertinent question that arises is how to acquire this depth information from the 3D scenery. The answer we are looking for, is to be found in the domain of computer vision. By the use of two similar yet different images, we can derive depth through a mathematical approach which is called stereo matching [5; 6]. We may or may not realize that we bump in to stereo matching every second of our every day lives. Our very own eyes are what is called a stereo vision. Our brain interprets the similarities of the captured left and right retinal image, upon which we gain the ability of depth perception. Stereo matching is in general a very complex procedure and a lot of computing power is being used by our brain to sustain its perception. It is exactly this reason why drunk people lose their ability to perceive depth. The brain is simply not capable anymore to perform the complex operations that are involved. In this thesis we mainly focus upon the implementation and optimization of efficient stereo matching, in order to achieve at least real-time performance. Nonetheless, we also briefly discuss a very simple way of creating an intermediary viewpoint through the use of DIBR.

1.2.3

Two-Level Readability

As seen on Figure 1.3, this thesis has been designed so that it can be read at two different levels. One level gives you a basic idea of the story line, without going to much into detail. Hence this level offers a quick and easy understandable overview of the work that has been done. If you would like to do so, you only have to read the overview sections of chapter 2 until 7. It is preferable to read the current and last chapter completely.

Figure 1.3: 2-Level Readability Concept

If you have no such interest or are too compelled and drawn into this wonderful work to skip any details, you can go and read everything in normal chronological order. For people who have finished reading at the easy comprehensible level and have been triggered an interest into the details, you can simply read the remaining parts or rereading the overview for some chapters if necessary.

1.2.4

Compact Programming Tutorials

In an attempt to make this thesis even more useful than only its research results, most of the chapters can be seen as individual tutorials for beginning programmers. Students and/or programmers unfamiliar with any of the discussed topics can quickly train themselves without having 3

Chapter 1. Thesis Introduction to address several informational sources. As we all know, structured data and information has become progressively valuable these days. This will certainly be an additional value for IMEC, or for any other interested party. A quick overview of the discussed topics are given in the following disposition of the thesis.

1.3 1.3.1

Thesis Disposition Stereo Matching Fundamentals

In chapter 2 we become familiar with the terminology used in the field of stereo matching. As a starter we discuss the use and fundamentals of stereo matching, by understanding how depth maps can be composed out of a stereo vision. We then briefly touch a simple manner of improving the matching certainty and evaluating the eventual resulting map. We then try to grasp the basic ideas about epipolar geometry [5] which describes the geometrical implications when using a stereo vision. Built upon that, we discuss some fundamental mathematics which will empower us to understand how depth is exactly derived out of the available images, and how to determine whether it is even possible to do so. To finish this chapter, we briefly touch a standardized conventional taxonomy which dissects the process of stereo matching into essential basic steps.

1.3.2

Basic Computer Graphics

As we will come to understand later on, a Graphics Processing Unit (GPU) on current generation graphics cards harnesses a tremendous larger amount of computational horse power [7] than a standard Central Processor Unit (CPU). In this thesis we will therefore try to address this power to achieve high-speed stereo matching and completely off-load CPU. In order to do so, we must first have some knowledge of the internal functionality of graphics hardware. In this chapter we will discuss the absolute basics of graphics hardware, and how to address it. Since GPU’s evolved through four generations in as little as five years (see Figure 1.4), lots of people get confused about which functionality is available on which graphics card or GPU generation.

Figure 1.4: GPU Evolution Timeline

For most computer graphics developers this is not that important, knowing that most of the subsequent hardware has implemented backwards compatibility somehow. In terms of reverseengineering this piece of equipment and exploiting it to its maximum capabilities, understanding 4

Chapter 1. Thesis Introduction correct hardware availability is crucial. We will have a quick look on how to address and use current graphics cards, followed by a disposition of older generation hardware and its functionalities. We build upon that information to discuss an advanced model of current GPU’s. To further complete our knowledge in this area, we also take a brief look ahead into the future. We finish the third chapter by focusing on some specialized hardware availabilities that can be used to achieve maximum computational efficiency for some specific applications or processing elements.

1.3.3

Exploiting Graphics Hardware

Chapter 4 is somewhat the essence of the entire thesis. We describe a manner to exploit the GPU for general-purpose computations. The latter is also known under the name General-Purpose GPU or GPGPU computing (see Figure 1.5). As we will come to understand, it is a revolutionary way of developing high-performance applications. A relatively small group of programmers have formed a community [8] to share their thoughts and expertise in order to expose their findings, so programmers can save a great deal of time by using their sources and techniques.

Figure 1.5: GPGPU Computing Community Logo

We will fully explore the ways to exploit graphics hardware, together with the advantages, disadvantages and restrictions. The exploits will be described in a formal way by abstracting the GPU programming model. After that we will discuss the analogy between CPU applications, together with a brief example. Understanding the mapping from CPU to GPU applications is crucial, as it is necessary to implement efficient algorithms. Techniques to exploit some of the the specialized hardware described in the previous chapter, will be handled last.

1.3.4

Experimental Algorithms

In the fifth chapter we formulate a basic stereo matching algorithm, and create some variances upon the basic model, each using different techniques to obtain a similar or equivalent approach. Later we will go deeper into the possibilities to enhance the quality on an algorithmic level, leading to an advanced and complex model proposal for GPU-based stereo matching. A manner to formally compare algorithmic complexity is discussed, so these models can be properly analyzed on an algorithmic level. To end this chapter we will give a brief discussion on how to synthesize an intermediate viewpoint with DIBR. Mind that this is only a very simple model which is certainly not optimized, since it is not part of this thesis. It is merely an extra to create additional value to our stereo matching work, and to test it in real-life circumstances.

1.3.5

Framework Implementation

The formal ideas of the previous chapters are implemented into a generic framework which was codenamed SaJi. In chapter 6 we analyze the structure and code of the framework. Since it is important for us to obtain high, over real-time performance, we also discuss how the algorithm 5

Chapter 1. Thesis Introduction time benchmarking is implemented. We present the developer with a smart timing mechanism that is able to benchmark hybrid CPU/GPU application time. Furthermore we also handle the memory management on both system and graphics level, since the framework internally uses both resources. SaJi can also be used as a high-level static library for use in applications that need high-speed stereo matching algorithms. Therefore we introduce the functionality of this high-level library by means of some basic examples and documentation.

1.3.6

Experimental Results

Chapter 7 will mainly depict visual results, and graphs of speed measurements. We evaluate our proposed advanced stereo matching algorithm with the time benchmarking mechanism that is available in the generic framework. Different optimizations tricks have been evaluated separately in different circumstances, so we can derive the significance of the specific hardware exploits and thus understand their contribution in a more formal way. We finish this chapter with a brief look at some of the interpolated images that where generated with our algorithms. No formal evaluation was performed on these synthesized images, because of our focus to the stereo matching itself.

1.3.7

Conclusions and Future Work

In the final chapter we will draw adequate conclusions out of the results presented in the previous chapter. Built upon these conclusions, we can efficiently suggest further areas of study. At the time of writing, we are on the verge of being blessed with a new generation of graphics cards. We keep this in mind throughout the entire thesis, hence future work built on this project could be very interesting.

1.4

Tools and Experimental Environment

The framework described in this thesis is developed on commodity hardware, in contrast to dedicated systems [9], by a combination of several programming languages, each with their specific purposes. Motivations and the specific use for these languages will become clear as one advances through the following chapters. We just present a quick list of tools and the experimental environment of our research, to give the interested developer a comfortable and easy overview. The SaJi generic framework is implemented as a static library that uses the object oriented programming language C++ as a host language for all of the others. We use Microsoft DirectX version 9.0c and the extension library D3DX for communicating with the graphics card. The graphics card in our system is an NVIDIA GeForce 7900GTX with 512 MB GDDR3 memory, housed in a 3.2 GHz dual core PC with 1 GB system memory. For basic graphics programming we have used some existing examples [10] to get started. Application programming on the GPU is done by developing and compiling the High-Level Shading Language to Shader Model 3.0 assembly languages. The GPU programming was structured by the use of the Effect Format, to provide the developer with a more clear overview of the code. The thesis book itself was developed with LaTeX, by using some skeleton templates [11] and editorial courses of the Ghent University.

6

Chapter 2

Stereo Matching Fundamentals 2.1 2.1.1

Overview Stereo Usefulness and Applications

Simply said, stereo matching is a concept that makes it possible to acquire depth information from two images sharing many similarities. It is also known under the name stereo corresponding, or simply stereo abbreviated. On Figure 2.1 it is noticeable that images taken with cameras that are close to each other, produce similar yet somewhat shifted images. Notice that two cameras always lay on a straight line that connects them. Exactly for this reason we conventionally name the two produced images in such camera line-up as the left and the right image respectively. In the flowchart example we show the capturing of a block, so this is the area in the picture upon which we focus. We can clearly see that the block finds itself on the right in the left image and vice versa. In the depth image it is conventional to represent a pixel that lays closest to the camera in white and gradually darken the color toward black [12] as the pixel lays deeper in the scene.

Figure 2.1: Stereo Matching Flowchart

The applications for stereo matching are endless but almost always involve 3D imagery. Examples are controlling autonomous robots [13] as already done by NASA for planetary investigation or mechanical repairing. An upcoming application is the use of stereo visions on vehicles. That way an on-board computer can use this data to prevent collisions, to render the surrounding area into a user friendly navigational system, etcetera. Another application is passively scanning 3D objects 7

Chapter 2. Stereo Matching Fundamentals for movie [14] or game development. Passive 3D scanning means that nothing has to be changed in the scene or on the object, in contrast with active scanning where most of the time a grid is projected on the object that needs to be scanned. The application that we are targeting in our higher objectives is Free Viewpoint Video (FVV) through Depth Image Based Rendering (DIBR). As we can understand, FVV is the application and DIBR is an algorithmic implementation of it. To be able to successfully use DIBR, we off course need a depth image first. As we will come to understand later on, obtaining this depth information involves about 80 percent of the total amount of work that needs to be done. Hence our focus to high-speed stereo matching in this thesis.

2.1.2

Disparity Maps

In its simplest form, a stereo matching algorithm takes a pixel from the left image, and is going to look for the corresponding pixel in the right image. When we have a camera setup where the two cameras are close to each other and they are arranged for parallel capture as depicted in Figure 2.2, the matching becomes rather easy. Not only can we restrict our searching to a single horizontal line, but because of the fact that an object finds itself more to the left in the right image, we can also restrict the search by only looking at the left of the current coordinate we are trying to match.

Figure 2.2: Standard Stereo Vision Alignment

There are many ways to check whether two pixels are corresponding or matching, but for now we will focus on a technique that is called intensity based matching [6]. Intensity based matching involves checking whether the two pixels have the same intensities. There are two frequently used implementations of this technique. • The Absolute Difference (AD) subtracts the two pixel intensities and takes the modulus of this value. The AD linearly characterizes the mismatch between the two pixel intensities. ADr (s) = |IL (sr ) − IR (sr − s)|

(2.1)

• The Squared Difference (SD) subtracts the two pixel intensities and squares the result. The SD becomes exponentially larger as the intensities start to differentiate. SDr (s) = [IL (sr ) − IR (sr − s)]2 8

(2.2)

Chapter 2. Stereo Matching Fundamentals With sr denoting the reference coordinate of the pixel we are trying to match by looking in the right image. IL and IR are intensity functions of an horizontal line on the left, respectively right image. To search a good match, we let s grow and keep track of which coordinate has a minimal absolute or squared difference. This minimal value indicates the maximum possibility of a match, since in theory we need a zero value for a perfect match. The value of the AD or SD is often called the matching cost, so stereo matching can be defined as searching the minimal matching cost. Because we are going to match digital images, we make use of discrete functions. The continuous functions express themselves as one-rowed vectors or matrices. We will rewrite the equations 2.1 and 2.2 to a discrete form and extend them to a two dimensional form to include every row. ADst [δ] = |IL [s, t] − IR [s − δ, t]|

(2.3)

2 SDst [δ] = IL [s, t] − IR [s − δ, t]

(2.4)

In this thesis we mainly focus on the usage of the AD for the matching cost computation, so we will be using equation 2.3 throughout the book. As we search in Figure 2.3 from 0 to a preset maximum value R, we notice that δ = 5 will give us a minimal matching cost.

Figure 2.3: Disparity Search Algorithm

The stride that gives us this minimal matching cost, is defined as the disparity. Ergo, the disparity in the above example for the pixel at [s, t] is 5. Since we limit our search over R + 1 values, we define our algorithm to implement a disparity search range of R + 1. Again, speaking in terms for the above example, we use a search range of 8. When we collect all of the disparities for every pixel, we obtain what is called the disparity map in the terminology of stereo matching. The disparity map is often already referred to as a depth map. This is because the closer an object is to the camera, the more the object will shift in the two images [15], hence disparity is inverse proportional with depth. A conversion to true spacial depth will almost never be done. That way, the largest disparity in the depth maps is visualized as white, and gradually go darker as the disparity decreases. By intuition we can feel that this approach is certainly not waterproof. When searching in homogeneous or untextured areas, we will have a good match for several different disparities. We can expect the disparity map for this single pixel matching to become very noisy. In Figure 2.4 we demonstrate a simple solution to this problem. Instead of using only one pixel to determine a matching cost for the current disparity estimation, we also calculate the AD with the same 9

Chapter 2. Stereo Matching Fundamentals disparity for the surrounding pixels. In the following diagram we are showing an image pair that has a row filled with pixels of the same intensity. We have hatched the pixel we are trying to match, to visually differentiate it with the others.

Figure 2.4: Simple Matching Cost Aggregation

If we should use the former matching technique, a match would be found on every disparity estimation. By summing up the matching costs within the vicinity of the pixel we are trying to match, we use surrounding elements to strengthen the possibility of a good match. The local area that we use to aggregate all of these costs is known as the support window. The size M × N of the support window, is often referred to as the window or kernel size. This plain aggregation of the AD is a technique that is called Sum of Absolute Differences (SAD) [6]. To be complete, in the same way we can create the Sum of Squared Differences (SSD) [6]. As SAD or SSD improves the matching for homogeneous areas, we can expect this technique to corrupt the disparity map around the edges since the disparity is not constant in those areas. The overall effect is that the depth map gets blurred more as the kernel size of the support window increases. Because of the quality trade-off that arises when solving the stereo matching problem through SAD/SSD aggregation, people who want very high quality disparity maps often cast the latter as a global optimization problem [6; 12]. When estimating a disparity for a certain pixel, global optimization involves taking all of the other pixels of the image into account. Since SAD/SSD focuses only on a rather small region around the pixel of interest, we define these as local approaches.

2.1.3

Middlebury Evaluation

Depth from stereo has already been intensively studied for decades. Since 2002, Daniel Scharstein from Middlebury College together with Richard Szeliski from Microsoft Research published a paper [6] where they proposed an evaluation methodology. That way all of the existing and future algorithms could be consequently characterized. From that point on, Scharstein and Szeliski has systematically surveyed and evaluated stereo research [12]. To perform this evaluation efficiently, they provide the stereo researcher with a number of data 10

Chapter 2. Stereo Matching Fundamentals sets. Hence, throughout this thesis we make use of their data sets to evaluate our proposed algorithms. To get the beginning stereo researcher and/or reader acquainted with this data, we introduce three currently active used groups of Middlebury data sets.

Figure 2.5: Tsukuba Scene

Figure 2.6: Map Scene

The data sets depicted in Figure 2.5 and 2.6 are real-life scenes which are conventionally most commonly used. A number of camera viewpoints are available, but we have shown only the reference1 image. They are available together with what is called a ground truth disparity map. This disparity map contains the exact results when matching it theoretically perfect with the outermost right image.

Figure 2.7: Sawtooth Scene

Figure 2.8: Venus Scene

Figure 2.9: Bull Scene

Figure 2.10: Poster Scene

Figure 2.11: Barn 1 Scene

Figure 2.12: Barn 2 Scene

The data sets in Figures 2.7 through 2.12 are synthetic images built up out of planar areas. In the evaluation mechanism of Scharstein and Szeliski, only the Sawtooth and Venus Scene are used.

Figure 2.13: Cones Scene

Figure 2.14: Teddy Scene

The Cones and Teddy Scenes as shown in Figure 2.13 and 2.14 are very recent data sets. We will focus specifically on the Tsukuba, Map, Sawtooth and Venus Scene, because still most of 1 The

reference image is often regarded as the left image.

11

Chapter 2. Stereo Matching Fundamentals the research considering GPU-based stereo processing uses these older data sets. We will not use any of the other synthetic images, as they are not incorporated into the Middlebury evaluation mechanism. A disparity map generated with a stereo matching algorithm can efficiently be evaluated, since the ground truth is known. The Middlebury evaluation mechanism will look for differences in the custom generated map compared to the ground truth. By this, a score can be given to indicate the quality of the disparity map at hand. The scores presented by Middlebury are larger as there are more differences with the ground truth, so a theoretic score from zero indicates a perfect algorithm. Most of the time this score comes with two sub scores, indicating the matching quality in homogeneous or untextured regions and at discontinuous areas. As we have discussed before, these discontinuous areas are generally the object edges in the image.

2.2 2.2.1

Stereo Mathematics Epipolar Geometry

In the previous section, we have always assumed that the stereo vision camera setup was perfectly parallel aligned. In practical systems, this is unfortunately not always the case. We first investigate the concept of a pinhole camera in Figure 2.15. The pinhole camera operates without a conventional glass lens and confines all the rays of a scene through a single point, which is referred to as its optical center. The focal depth is the distance needed behind the optical center to focus the scene. Therefore we can see the captured image as a projection on a plane that lays at an equal distance before the optical center. This plane is often called the retinal plane of the camera.

Figure 2.15: Pinhole Camera Concept

Capturing an image with a camera can therefor be mathematically seen as mapping 3D world coordinates p to 2D camera space coordinates m. These coordinates are generally written in a matrix form2 , as we have visualized in equations 2.5 and 2.6.

2 For

i>

h p= x y

z

h m= s

i> t

(2.5)

(2.6)

practical reasons, most of the time the transposed matrix form is used because column matrices take in more vertical space on the paper. Thereby they surely make it impossible to use it in-line with text.

12

Chapter 2. Stereo Matching Fundamentals As we will discuss further in the chapter about basic computer graphics, it is possible to define the mapping from world coordinates to camera space as a linear transformation in homogeneous coordinates. That way a matrix that models the camera as a mathematical concept can easily be written down [16]. This matrix is commonly known as the Perspective Projection Matrix (PPM). Considering a random setup of two pinhole cameras as depicted in Figure 2.16, the distance between the two optical centers OL and OR is called the baseline. By drawing a plane through the baseline, it can still be rotated around the latter in every direction. Focusing on one point p that we want to capture with the two cameras, it is possible to define a fixed plane ψ through the baseline which contains p. The plane ψ can be seen as the extension of the triangle pOL OR . Therefore, ψ will intersect with the retinal planes of both cameras. This intersection will cause an imaginary line to be visualized onto the planes, which is referred to as the epipolar line. Because the epipolar line lies upon the retinal plane and on ψ, it will always intersect with the baseline. This special point e is called the epipole, since every epipolar line that could been drawn from an arbitrary p will go through this point.

Figure 2.16: The Epipolar Constraint

This geometry is consistently defined as epipolar geometry [5; 15] and has important implications when a point m is known on one of the retinal planes. Let us take an arbitrary point mL and we want to search the same point in the right retinal plane for stereo matching purposes. Since we assume that camera positions OL and OR are known, the baseline is also known. The plane extended from the triangle mL OL OR is the same as ψ, so the epipolar line can be constructed in the right image when the focal depths are known as well. As we can come to understand, the matching pixel must be found on this line. This important restriction is known as the epipolar constraint. The disparity between those pixels can then be expressed as a 2D vector.

2.2.2

Image Rectification

Because of the epipolar constraint, we could allow a stereo matching algorithm to restrict its search area to one line. Nonetheless, we still need to calculate the epipolar lines for every pixel we want to match. This involves too much overhead, so we are looking for a technique that simplifies this matter. In Figure 2.17 we present a solution that is known as image rectification [16]. The process of image rectification will apply a transformation upon one of the two PPMs that represent the retinal planes. The retinal planes will thereby become parallel with baseline. As the epipoles are 13

Chapter 2. Stereo Matching Fundamentals the intersection of the baseline with the retinal planes, they will be laying at infinity. Therefore the epipolar lines also become parallel. If we choose a proper transformation, we can make sure that these parallel epipolar lines also become horizontal. The transformation needed to do this is a 3 × 3 homography [16].

Figure 2.17: Rectified Images

Image rectification can thereby be defined as the application of a transformation that aligns epipolar lines with scan lines3 , hence degrading the disparity vector to a scalar. That way the stereo matching problem reduces to the simple parallel stereo we have discussed earlier. Rectification is one the most studied topics in the field of stereo corresponding, and has already been extensively explored in both terms of quality and speed. Hence we shall not focus upon this process, and assume pre-rectified images in our algorithms. This assumption is strengthened more together with the fact that the rectification process in order to be able to perform stereo matching, can be avoided through proper camera alignment.

2.2.3

Disparity Equation

Looking at the following setup in Figure 2.18, it is a parallel camera alignment or a proper rectified stereo pair. At the left side we can see that triangles mL xL OL and pxOL are similar. Taken that oW is the world origin in the center of the baseline, we can derive equation 2.7 by calculating the cotangent of the lower corner. b +x xL − xOL = 2 (2.7) fD z Two similar triangles can also be seen on the right side. Therefore we can derive equation 2.8 in an analogue manner. b −x xOR − xR 2 = (2.8) fD z Since we are working with the pixels on the images, we are not interested in using the coordinates xL and xR . In stead the previous equations will be expressed by referencing to the retinal plane coordinates. The origin of the retinal planes can be seen in the optical center. This is actually a simple translation over the X-axis, hence xL = sL + xOL and xR = sR + xOR . Having those 3 Scan

14

lines are horizontal lines needed to build up a screen image.

Chapter 2. Stereo Matching Fundamentals equations, we can relate the depth of point p with the coordinates mL = [sL tL ]> and mR = [sR tR ]> of the retinal planes. These are after all the coordinates of the captured images that we are directly using. We substitute the coordinate transformation into equations 2.7 and 2.8 to obtain equations 2.9 and 2.10. b +x sL 2 = (2.9) fD z b −x −sR 2 = fD z

(2.10)

If we sum equations 2.9 and 2.10 together, the interesting equation 2.11 is obtained. Notice that the disparity dL (sL , vL ) = sL − sR when we use the left images as a reference [12]. b b +x −x sL −sR sL − sR b 2 2 + = + ⇒ = fD fD z z fD z

(2.11)

The formula that is known as the disparity equation can be derived as shown in equation 2.12. The disparity equation gives us a relationship between the matching distance of a point in the stereo vision and its depth in the scene. dL (sL , tL ) = sL − sR =

fD · b z

(2.12)

As we have discussed before, this gives an exact mathematical proof of our intuition that objects that are closest to the camera have the greatest shift when comparing the stereo images. When we consider the baseline and focal depth to be constant, we see that disparity is indeed inverse proportional with depth.

Figure 2.18: Stereo Depth Calculation.

Since disparities are only defined for matching points, we can understand that it is possible for some areas to lack this depth coordinate. This problem originates in the fact that the cameras do not have the same field of view. Another cause can be that an object in front of a camera blocks images that can be seen by the other. This phenomenon is known as occlusion. Depth cannot be derived from occluded areas. In the case that we need sparse disparity maps, these areas that lack depth are not really a problem. Sparse maps only have a depth determination for a small 15

Chapter 2. Stereo Matching Fundamentals percentage of the image area. When we want dense disparity maps, these areas should be handled by some kind of interpolation mechanism if they are large. The dense maps require a determined depth or disparity for almost every pixel. We can understand that for high-quality DIBR, dense disparity map are certainly needed.

2.3

Taxonomy of Stereo Algorithms

In 2002, Scharstein en Szeliski not only published a test bed to benchmark correspondence algorithms in their paper, they also introduced a taxonomy [6] for stereo algorithms. To be complete, we briefly want to touch these essential building blocks of such algorithms. That way we can also properly characterize the previously discussed techniques.

Figure 2.19: Stereo Taxonomy

As shown in Figure 2.19, a stereo algorithm can be broken down into four algorithmic steps. The sequence in which they get executed can vary depending on the specific implementation, although most of the time the presented sequence is used. • The matching cost computation is most commonly done by an AD or SD intensity calculation. There are some considerations with this techniques, as they assume that both cameras have the same lighting circumstances and internal intensity bias. When the intensity starts to vary, serious problems can arise when we do not correct the images. The solution can be biasing the intensity of one of the images, or using a more complex technique that is insensitive for these variations. Such a technique is the Census transform [17; 9] which only looks at pixels as being darker or lighter. • The cost aggregation is optional, but needed for a more reliable quality in most of the algorithms. As discussed before, it makes sure that matches become more reliable. Techniques as SAD or SSD are commonly used, but in the discussed matter all values were summed up without assigning weights. More advanced approaches generally assign a high weight to the center pixel and lower weights to pixels farther from the center [18]. These support windows are commonly called center biased support windows. • The disparity computation and optimization mostly involves checking which disparity has the minimal aggregated matching cost. The algorithm generally executes a winner-takeall (WTA) [6] optimization for each pixel. When multiple windows are used [19] for the aggregation phase, the minimum cost window is used for the WTA update. • The disparity refinement usually applies some kind of filtering to the resulting disparity map, or involves a global optimization problem. That way the map quality can be further levered. In this thesis we will use this taxonomy for creating local dense stereo algorithms. We keep the presented sequence, although we only focus upon the first three algorithmic steps. 16

Chapter 3

Basic Computer Graphics 3.1 3.1.1

Overview The Simplified Graphics Hardware Pipeline

The goal of the computer graphics domain is to efficiently visualize high-quality advanced synthesized scenery. Graphics hardware is implemented as a pipeline architecture [5]. Pipelines are conventionally separated into several stages, and every stage or block is operating differently but at the same time. When a block has finished its calculations, it waits until the next block has finished and sends its data through to the next stage. We can understand that to fully optimize a pipe architecture, one should take care that these stages take an equal amount of time. This is often referred to as a balanced pipe, in contrast with an unoptimized or unbalanced pipe. To visualize this, we refer to Figure 3.1. Pipeline mechanisms are commonly characterized by their high throughput rather than responding quickly for a single computation. A single computation always has to pass every stage after all. While the data is in one block, the others are doing nothing. This is in contrast with putting in new data every time the first stage is ready, where the last stage will output subsequent data very quickly.

Figure 3.1: Unbalanced and Balanced Pipelining

Since the beginning of commodity graphics hardware, electrical or silicon engineers have always focused their development on high throughput performance. In the old days only the pipeline mechanism functioned as the high throughput capability. But as electrical and silicon technology evolved into physically smaller implementations, graphics cards became more and more equipped with parallel hardware. The focus from efficiently balancing the pipe has been shifted more to 17

Chapter 3. Basic Computer Graphics mere brute force. Nonetheless, we still have to address this graphics hardware very intelligently when we want to obtain maximum performance. The graphics pipeline generally takes a scenery description as input. This description is possible through the use of vertices or simply coordinates. A vertex represents a point in space and thereby a number of vertices can represent a simple shape. These shapes built out of vertices are known as primitives. A computer graphics developer will have a numerous amount of primitive shapes available, but in low-level hardware the polygon shapes always break down to triangles. By means of these primitive polygons or triangles, more complex 3D shapes can be represented through a mesh structure. Figure 3.2 tries to give an idea of this hierarchy.

Figure 3.2: Vertex, Primitive and Mesh Representation

Each of these meshes can be described in a local space which we call model space. The advantage of this is that we can use simple coordinates as we see fit. Because some of the meshes can become very complex, it would be nearly impossible to write these coordinates in a space bound to other model descriptions.

Figure 3.3: Simple Graphics Pipeline

With all of the previous information, we can start to comprehend a simplified model of the graphics pipeline [5; 20; 21] as depicted in Figure 3.3. The pipeline can be broken down into three basic but important computational stages. 1. The vertex transformation transforms coordinates from the model space in which meshes are described, to a 3D world space where all the objects build up the entire scenery. This 3D world is then mapped to a 2D screen space as we look at this world through a fictional 18

Chapter 3. Basic Computer Graphics camera. These operations are completely independent of each other, so the order in which the vertices are transormed does not matter at all. 2. After the vertices are transformed, the graphics card is going to check which vertices are grouped together to form a primitive. This process is referred to as the primitive assembly. Because of the fact that the hardware works with triangles on a low-level basis, this step is also commonly known as the triangle setup. After formation of these polygon planes, the rasterization process is started. The rasterizer is the central part of the graphics pipeline and divides the primitives into fragments which have the size of screen pixels. 3. The fragment texturing and coloring is a final stage in this simplified pipe. It accesses a texture in the video memory. Textures are images that are being used to paste onto the polygon surfaces. This texture access is implemented by a hardware texture sampler that looks up the correct pixel that has to be mapped on the current fragment. Because we can have textures that are a lot larger in size then the area they are being mapped to, it is possible that some filtering will happen in the texture sampler. Most of the time a lighting calculation will occur next to this sampling procedure. By combining the texture color and lighting color, a proper resulting pixel color can be computed. This simplified model of the graphics pipeline is certainly not complete. Although it gives a good overview on the stages, every one of them is equipped with a lot more complex functionalities. For now it is enough to comprehend the presented model, as we will go in to much more detail in the following sections.

3.1.2

Graphics Programming Interfaces

If we want to start rendering scenes, naturally the question arises of how to send meshes and textures to the graphics card. Communicating with the graphics card has been made possible through several graphics programmer interfaces. These programmer interfaces are more commonly known as graphical Application Programming Interfaces (APIs) [22; 5]. We can address such an interface from regular CPU code, and it will get handled internally by the API at hand. It offers us a way to send and receive data to and from the graphics card in a transparent way, as can be seen on Figure 3.4.

Figure 3.4: Graphics API

Momentarily, two widely used graphical APIs are available for graphics programmers. These are Microsoft Direct3D [23] and the open alternative OpenGL [21; 20]. Direct3D comes as part 19

Chapter 3. Basic Computer Graphics of the DirectX API for transparent communication with all multimedia related devices that are commonly found in a personal computer that runs Microsoft Windows. DirectX 9.0c [24] is the currently active version. OpenGL has been founded by Silicon Graphics, which originally ran only on powerful UNIX workstations. Over the years, OpenGL has grown to support additional multiple operating systems such as Microsoft Windows, Macintosh OS and the most common Linux distributions.

3.1.3

Hardware Programmability

In the beginning, graphics cards were equipped with configurable fixed functionality hardware. Later, parts of the pipeline became systematically programmable and replaced the concurrent fixed functionality hardware. This programmability was introduced by the use of processors inside the Graphics Processing Unit (GPU). In stead of always executing the same transformation and lighting equations, GPUs could henceforth compute custom made algorithms. Since programming this GPU is not directly related to communicating with the video card, another programming language is introduced to write the relatively small GPU applications. The graphical APIs will then be used to send these applications to the graphics card, and to load them into the GPU. It is of utmost importance that we grasp the concept of the difference between the graphical API and the GPU programming language, therefore we visualize this concept in Figure 3.5.

Figure 3.5: Hardware Programmability Distinction

A cooperation between NVIDIA and Microsoft led to a very popular high level programming language that is known as C for graphics [20] or Cg in short. Cg lets us develop small programs that can be loaded into the GPU through the API, before executing the pipeline mechanism. Such a language is consistently defined as a shading language. Applications that involve vertex processing are thereby called vertex shaders. Hence, applications involving fragment processing are called fragment shaders or pixel shaders, which is conventionally more used. A shading language differs itself from other major programming languages because the use of additional semantic keywords next to normal data types. The data type float4 is a supported type in shading languages, being a vector containing four floats. We can associate a semantic meaning to this vector, being it for example to distinguish a color or vertex coordinate. Since the existence of Cg, Microsoft has developed their own shading language, known as the High-Level Shading Language (HLSL) [24; 20]. HLSL has basically the exact same syntax, but the 20

Chapter 3. Basic Computer Graphics difference lies especially in the code compiler. Over the years we do expect the syntax to become more and more different. The OpenGL community has developed a shading language as well, the GL Shading Language (GLSL) [5]. Again, this was done to create a completely open alternative in contrast to the other available ones.

3.1.4

Multi-Level Parallel Computing

Graphics cards have a great intrinsic parallel architecture. As depicted in Figure 3.6, we can distinct up to four levels of different parallelism [25; 26] inside current pipelines. • At first there are the different stages of the pipeline which are working simultaneously. This implicates that different operations can be done at the same time. This technique is termed plain parallelism. • Looking deeper inside a single stage or operation, we notice that the tasks performed there can be run on different data at the same time. This is defined as task-level parallelism, and is basically possible because of the independence between the data. In current GPUs, this task-level parallelism is implemented by means of a number of parallel processors. • Focussing on one processor or task, we notice that the hardware in there operates on fourcomponent vectors. This means that computations will happen on four data elements simultaneously, and therefore it is referred to as data-level parallelism. • Finally, multiple simple instructions can be executed on the same data element at the same time. We define the latter as instruction-level parallelism. Since this architecture has been designed from the ground up to run computer graphics algorithms, writing high performance applications in this domain does not acquire to really think at these levels. Nevertheless, it is always better to understand the underlaying hardware since it opens up the possibility to achieve maximum performance or efficiency.

Figure 3.6: Different Levels of Parallelism

This parallelism is also the quintessential element in understanding why graphics cards today outperform even very large clusters of CPUs. Since CPUs dedicate most of their silicon area to unparallelizable control logic, they can only become faster by increasing their core clock. As central processors have reached a plateau considering clock speed, they are somewhat stuck at their current performances. As Moore’s law [5] teaches us that every two year the number of 21

Chapter 3. Basic Computer Graphics transistors are doubled on the same silicon area, CPUs do not benefit from it in terms of speed. Because of the great intrinsic parallel nature of GPUs, they can be equipped with about twice as many processors. Hence, their performance will still be doubled every two years. This makes that GPUs are becoming more and more interesting, certainly by knowing that Moore’s law will not saturate for at least another decade.

3.2 3.2.1

Evolution of the Graphics Processing Unit Four Generations of Hardware

Throughout the years, GPUs have evolved at a rapid pace. GPUs were not always programmable, they have endured several transformations [20] that made them the highly programmable machines they are today.

Figure 3.7: GPU Evolution

In the year 1987 IBM introduced the Video Graphics Array (VGA) which was basically just a piece of memory that retained pixel colors that where being displayed on the screen. This piece of memory is often indicated as a frame bufer. In those graphics cards the frame buffer was updated by the CPU. We speak of the pre-GPU era, since these cards do not perform any computations at all. From that point on, we distinct four generations of GPUs as shown in Figure 3.7. • Around 1997, NVIDIA and ATI introduced the first generation GPUs with the NVIDIA TNT2 and ATI Rage. They were capable of rasterizing pre-transformed vertices and applying one or two textures, so they completely relieved the CPU of updating pixels inside the frame buffer. They implement the DirectX 6 feature set and were a serious improvement compared to the pre-GPU era. • The second generation of GPUs was introduced in 1999 with the NVIDIA GeForce 256 and the ATI Radeon 7500. They implemented the DirectX 7 and OpenGL feature sets, and were capable of 3D configurable vertex transformation and lighting. The set of mathematical operations involved with texturing and coloring pixels had been extended. Still all of this was configurable and not truly programmable. • In 2001, a third generation of GPUs was developed with the NVIDIA GeForce 3 and ATI Radeon 8500. Feature sets from DirectX 8 and OpenGL were supported, and thereby in22

Chapter 3. Basic Computer Graphics troduced the programmability of vertex processing. This means that the GPU lets the application specify a sequence of instructions to process the vertices. Although the pixellevel configurability was extended again by the means of multi-texture register combiners, they still lacked full programmability of that stage. • Only a year later, in 2002, the NVIDIA GeForce FX and ATI Radeon 9700 introduced the fourth generation of GPUs. By implementing the extensive DirectX 9 and OpenGL feature sets, they also supported pixel-level programmability. The ability of addressing the old fixed functionality still remains, but it gets emulated by standard shaders in the GPU. As DirectX revised to versions 9.0a, 9.0b and 9.0c, the programmability was extended with more functionality and the possibility to write larger applications, etcetera. This newer hardware include the NVIDIA GeForce 6 and 7 series. The latest and fastest graphics cards from this generation are the NVIDIA GeForce 7900GTX and ATI Radeon X1950 XTX. These last cards also implement the OpenGL 2.0 API, and have support for the GLSL. In this thesis we experiment with the NVIDIA GeForce 7900GTX with 512MB DDR3 RAM memory. We have chosen NVIDIA because they often release professional tools [27; 28] for developers, and are very open considering their internal hardware architecture.

3.2.2

Advanced Disposition of Current Pipelines

We will now go into a little bit more detail with the current pipelines, implementing DirectX 9.0c and OpenGL 2.0 feature sets. Most of these descriptions specify more detail of the stages we have seen in the simplified model, but to be complete, we explain an additional essential stage at the end of the pipe. Vertex Transformation We already discussed the transformation from model space to screen space. In Figure 3.8, a more in-depth flow of the performed transformations in the first stage is presented. To understand these transformations, we should know that 3D coordinates p = [x y z]> in computer graphics are always written down in a four-component h = [x y z 1]> column vector form. These are 3D homogeneous coordinates1 [21; 5] which let us represent affine transformations with a single matrix. That way, even translations can be applied by multiplying a proper 4 × 4 matrix T as shown in equation 3.1.       1 0 0 tx x x + tx 0 1 0 t  y   y + t    y y   T·h =  (3.1) · =  0 0 1 tz   z   z + tz  0

0

0

1

1

1

Therefore we can also understand that all other operations that are involved with a geometric transformation, such as rotation R and scaling S can be combined into one 4 × 4 matrix. After applying this matrix it is certainly possible for the fourth element out of the homogeneous coordinates to differ from 1. This value is defined as the perspective division number w [20; 21; 25]. Hence we are able to recover the non-homogeneous coordinates by dividing the components with w. This operation is referred to as the perspective divide. 1 Homogeneous

coordinates are introduced by August Ferdinand M¨ obius.

23

Chapter 3. Basic Computer Graphics Looking at Figure 3.8 we notice a modeling transformation that is associated with a certain mesh. It maps the coordinates from this local mesh model to a world where all meshes come together and construct the scene. The model matrix determines where the object will be placed in the world, and thus it is one of the configurable things in fixed-functionality hardware. It is therefore literally possible to load custom matrices into the hardware pipe.

Figure 3.8: Detailed Vertex Transformations

The next transformation is the view transformation, it transforms the world scene to eye space as seen from an imaginable camera. The matrix associated with this transform is usually called the view matrix, and has the same kind of configurability as the former. It is then followed by the projection transformation which defines a view frustum. This view frustum selects the volume that is viewable for the fictive camera, and is applied by the 4 × 4 projection matrix. The coordinates result in clip space which are restricted to −w ≤ x ≤ w, −w ≤ y ≤ w and −w ≤ z ≤ w [21; 20], except for Direct3D, where 0 ≤ z ≤ w [24]. When the coordinates are fully transformed, a perspective division is performed to bring the coordinates to normalized device space. This means that all coordinates will lay between −1 and 1. To finish, a viewport transform brings this normalized coordinates to screen space. That way the geometry is expressed in pixel coordinates, and can be directly fed into the rasterizer for proper rasterization. Since vertex processing became programmable, the three 4 × 4 transformation matrices are combined into one single ModelViewProjection matrix [20]. As a result only one matrix multiplication has to be executed in the vertex shader to perform the entire transformation. Primitive Assembly and Rasterization As we have already discussed primitive assembly, it also takes care of clipping when polygons exceed the view frustum. On top of that, polygons that face backwards are discarded. This process is commonly known as culling. Triangles or polygons that survive the clipping and culling steps are fed into the rasterizer. The rasterization process breaks the polygons down to pixel-sized fragments. One should not confuse a fragment with a pixel. Pixels are elements in the frame buffer, in contrast to fragments that are yet to be processed by coloring etcetera. Eventually they can be updated to a pixel in the frame buffer, therefore always look at a fragment as being a potential pixel in stead of being a pixel itself. Fragment Texturing and Coloring This stage has undergone the most changes as they evolved. First and second generation GPUs supported sampling from only one or two textures, and it was very hard to blend two textures 24

Chapter 3. Basic Computer Graphics together as it was often only possible by going a second time through the pipeline. Third generation GPUs were equipped with multi-texture register combiners that could sample from maximum 4 textures. A lot of logic was available in the register combiners to perform operations like additions, subtractions an blending in a single pass on all of the textures. As it was a huge improvement, it is still no match for the more general programmable texturing that is available today. In theory one can now access and blend as many textures in a pixel shader as wanted. Although still keeping an eye on the internal architecture of the GPU in order to achieve high performance. Raster Operations The raster operations happen at the end of the graphics pipeline [20; 25]. They perform a large number of essential test and operations on the fragments, but we only focus upon the ones presented in Figure 3.9. The hardware units that perform these operations are known as the Raster Operators or ROPs abbreviated. The first hardware of the fourth generation has an equal number of ROPs as the number of fragment processors. Since fragment processing became more complex, it takes a lot longer to complete the pixel shader compared to the raster operations. This results in the fact that the ROPs are often in an idle state. Newest hardware aggregates processed fragments into the fragment crossbar where after they are sent to the ROPs. That way a smaller number of ROPs is needed, and more silicon area becomes available to exploit intrinsic parallelism.

Figure 3.9: Raster Operations Flowchart

At first, a stencil test is performed. The stencil buffer is most of the time an 8-bit unsigned number that allows the programmer to create its own custom flags. By setting the stencil test to less, greater etcetera, the programmer can implement a way of flow control since only fragments that pass this stencil test will go further down the pipe. The following is a depth test. When a fragment reaches this test, it compares its depth with the Z-value of the pixel that is currently in the frame buffer. Passing this test transforms our fragment into a pixel, since it will get updated into the frame buffer. By updating the pixel, the Z-value will also be overwritten in the depth or Z-buffer for depth testing the next fragment that will potentially occlude this pixel. Blending hardware is available at the final end of the pipe. This serves whenever the programmer wants to mix a pixel color with the color that was already there. Since the latter lies deeper in the scene, the blending can be efficiently used to create transparency.

3.2.3

Future Pipeline Architecture

Toward the end of this thesis research, a fifth generation of hardware has been released that implements the Microsoft Windows Vista [5] DirectX 10 [24; 29] feature set. The NVIDIA GeForce 25

Chapter 3. Basic Computer Graphics 8800GTS and GTX [28; 27] are the first in its kind, and again have tremendously evolved compared to the previous generation. Fixed functionality has become completely obsolete, thus there is no emulation or whatsoever available. The most distinctive feature compared to the DirectX 9.0c is the introduction of a third shader type. This shader type is the geometry shader and operates on primitives in stead of vertices. Hence, as depicted in Figure 3.10, its location in the pipe is between the vertex shader and the rasterizer.

Figure 3.10: DirectX 10 Pipeline

Geometry shaders have the ability to write to the video memory, thus creating additional geometry. In future gaming, these shaders will be used to automatically create high definition geometry when zooming into a specific area. Because vertex shaders and pixel shaders run on different types of processors at most of the previous generations GPUs, NVIDIA has evolved to the unified shader architecture. Hereby all underlying processors can behave as either functionality, and thus the design allows the hardware to dynamically adjust to the demands of the application in order to automatically achieve the highest pipeline performance possible. Another significant change in this hardware concept is the sacrifice of data-level parallelism for more task-level parallelism. The shaders no longer work on four component vectors, but on single scalars. This frees up a lot of silicon area to create more parallel shaders, ergo more parallelism on the task-level. When comparing a NVIDIA GeForce 8800GTX with the previous GeForce 7900GTX, it is noticeable that the 8800 has 128 processors in stead of the 32 which can be found in the 7900. This enormous increase in number of processors does come at a price, since they only operate on single scalar data. Because the clock of the shader processors in the 8800GTX is about 2.5 times faster as its predecessor, we can easily derive its speed compared to traditional GPU processors as 128 · 2.5/4 = 80 [29]. This ruffly means that this GPU will have the same performance as if it was designed with 80 traditional four-component processors. Throughout this thesis we will develop our techniques upon strict shader code without relying on fixed-functionality. That way the shader code can be easily ported without too many problems to the DirectX 10 graphical API in future applications.

3.3 3.3.1

Programming Graphics Hardware Assembly Languages

Processors execute instructions through the use of an assembly language. As we have discussed earlier, different types of processors are housed within the GPU. A different assembly language exists for every processor type, and to make matters worse, DirectX and OpenGL both use different assembly instruction profiles. On top of that, as graphical APIs evolve, newer and more extensive 26

Chapter 3. Basic Computer Graphics instructions sets become available. Therefore every new API revision has defined a new assembly profile. We have tried to visualize this complex structure of available assembly languages in Figure 3.11.

Figure 3.11: GPU Assembly Profile Structure

Microsoft DirectX has the habit of naming concurrent assembly profiles for different shader types as a Shader Model (SM). DirectX 9 started with Shader Model 2.0 and revision 9.0c introduced Shader Model 3.0. The newest DirectX 10 also introduces a geometry shader assembly language (gs 4 0) as part of the Shader Model 4.0. These DirectX assembly profiles are well supported on both ATI and NVIDIA hardware, as in contrast with most of the OpenGL assembly profiles. The vp20/fp20, vp30/fp30 and vp40/fp40 profiles are more NVIDIA specific [20] and thus could give some practical issues when porting to ATI hardware. Please keep this in mind, as it is one of the reasons why we have decided to perform this research in the DirectX graphical API.

3.3.2

High-Level Programming

Because it is simply inhumane to develop very large programs in assembler, the languages Cg, HLSL and GLSL have been developed. Just as C was the first high-level language for system programming on CPUs, Cg was the first high-level language for programming GPUs. Analog by needing a different compiler or assembler profile to compile C code for e.g. an x86 and Itaniumbased architecture, we need to specify a different assembler profile when compiling a shading language for different graphics cards. Since it is possible to compile shading languages at runtime, the need for multiple compilation can be avoided. When developing high-level shader applications, we should always use the lowest possible profile that supports our needs. That way we can run our application on the broadest range of hardware possible. In modern gaming, it is good to support multiple profiles. Most of the time an advanced profile is used, together with a fair fall-back when the requested profile is not supported. We will discuss the exact implementation of this functionality later on. To introduce the high-level shader programming, we focus only on Cg/HLSL syntax. Listing 3.1 and 3.2 gives a very small and simple example of a vertex shader and a pixel shader. As we discussed before, by use of the semantics we define the layout of the data that enters and exits 27

Chapter 3. Basic Computer Graphics our shader. This is done by means of a standard C structure. Next to the definitions of these structures, we also have to define variables that can be loaded from the application to the GPU. In case of the vertex shader example, this is the ModelViewProjection matrix. Vertex shaders are commonly written into clear text files with a .vsh extension. Listing 3.1: Simple Vertex Shader Code Example // Defining shader variables which get loaded from the application. float4x4 modelViewProjMatrix; // Defining structures used in the shader. struct MY VERTEX INPUT { float4 myPosition : POSITION; // Typical declaration in Cg/HLSL. (Datatype, Variable and Semantic) }; struct MY VERTEX OUTPUT { float4 myPosition : POSITION; }; // A standard vertex shader to emulate normal ModelViewProj transformation. MY VERTEX OUTPUT main(MY VERTEX INPUT myInput){ MY VERTEX OUTPUT myOutput; myOutput.myPosition = mul(modelViewProjMatrix, myInput.myPosition); return myOutput; }

When the vertices have been processed, they are being rasterized into fragments. Those fragments are sent to the fragment shader as shown in Listing 3.2. Analog to vertex shaders, this code is written into files with the .psh extension. Listing 3.2: Simple Pixel Shader Code Example // Defining structures used in the shader. struct MY VERTEX OUTPUT { float4 myPosition : POSITION; }; struct MY FRAGMENT OUTPUT { float4 myColor : COLOR; }; // A standard pixel shader which does not sample texture, but always fills in blue. MY FRAGMENT OUTPUT main(MY VERTEX OUTPUT myInput){ MY FRAGMENT OUTPUT myOutput; myOutput.myColor = float4( 0.0f, 0.0f, 1.0f, 1.0f ); // RGBA−values. return myOutput; }

We address this code by making a proper call to the API from inside the CPU code. In most of the situations, the graphical API will handle the compilation process for us.

3.3.3

Effect Development

Since graphics programming became more and more complex, games are being developed with larger groups of people that each have their own specific expertise. Because of this, two needs in 28

Chapter 3. Basic Computer Graphics the domain of computer graphics had to be fulfilled. First there was the need to easily address fallback shaders when the user did not have the latest hardware that the game demanded. Because of the pressure that originates from the high-end user market, the game developers cannot permit themselves to build a game based on older hardware. Therefore popular games will almost always use the latest tricks and hardware capabilities. The fall-back shaders are thus needed to make the game accessible for a broader spectrum of users or gamers. The second need was caused by the close collaboration between graphics designers and graphics programmers. Designers wanted to change texturing features on their digitally drawn models, but most of the time this involved changing the shader code. Since designers are not into programming, it would lead to a chaotic communication with the programmers. All of these needs are satisfied with the introduction of effects (FX) [30; 20], which are written in clear text files with an .fx extension. When we look at the sample effect code in Listing 3.3 we can see that we can define techniques that consist out of several passes. Inside each pass, all shader functions are defined together with the profile in which to compile the code at runtime. All custom settings from the pipeline for that pass can be set next to these shader definitions as well. This provides a clean structure for programming and configuring the entire pipe. Listing 3.3: Effect File Structure technique myNormalTechnique { pass Pass0 { // Call a shader written in the FX−file by its function name, // in stead of writing single files with a main−function. VertexShader = compile vs 3 0 myNormalVertexShader1(); PixelShader = compile ps 3 0 myNormalPixelShader1(); } ... pass PassN { VertexShader = compile vs 3 0 myNormalVertexShaderN(); PixelShader = compile ps 3 0 myNormalPixelShaderN(); } } technique myFallBackTechnique { pass Pass0 { Lighting = FALSE; VertexShader = compile vs 1 1 myFallbackVertexShader1(); PixelShader = compile ps 1 1 myFallbackPixelShader1(); } ... pass PassM { VertexShader = compile vs 1 1 myFallbackVertexShaderM(); PixelShader = compile ps 1 1 myFallbackPixelShaderM(); } }

We assume that we want to render, for example, a shiny metal surface upon some given geometry with the above example. The idea is to create this effect through at least one technique. We use the myNormalTechnique when Shader Model 3.0 is supported on the hardware which is running the application. If it is not supported we fall back to a technique that uses Shader Model 1.1 through addressing myFallBackTechnique. Notice that we compile to a different assembly profile, but 29

Chapter 3. Basic Computer Graphics also compile different shader code. Hence, such an approach demands that we write shaders that approximate the shiny metal effect through the usage of older hardware functionalities. This other technique can very well have a different number of passes. Most of the time older approaches require more passes than on newer ones. The functions containing the shader code must be written in the fx-file as well, that way all of the graphics code for one entire effect is neatly tucked into a single file. When writing effects in HLSL, the syntax in this file is referred to as the Effect Format. In terms of Cg, the syntax is known as CgFX. As a small note, we also want to mention that there are graphical development tools for fx-files. They make it possible for programmers to link sliders and small graphical control interfaces to the variables inside the code, eliminating the need for designers to adjust the code when they want to visually change the effect. An example of such a tool is the NVIDIA FX Composer [27], and is widely used throughout movie and game development.

3.4 3.4.1

Efficient Hardware Availability Mipmap Scale-Space

When a texture is sampled in order to paste this image upon a surface in our geometry, this procedure is called texture mapping. Pixels from a texture are often named texels to distinct them from frame buffer pixels. When mapping texture, the size of the texture is not necessarily the size of the geometry we are mapping it to. Therefore some kind of filtering is needed. When the geometry size exceeds the one of the texture, we speak of texture magnification filtering. In the opposite case, we must apply a texture minification filter in order to properly reduce the size of the texture. In theory these filtering steps could be skipped, but the price to pay is very large. In case of skipping magnification filtering, we will see large sharp edged blocks since the texels are just blown up to the requested size. In case of skipping minification filtering, we sample at sparse places in the texture. Since all of the details are still in the texture, we will suffer from aliasing [25]. This is a phenomenon that arises whenever we sample something that contains too much information2 to handle with such sparse sampling.

Figure 3.12: Point and Linear Sampling

In graphics, we distinct a number of filtering types that are supported in the hardware. It is very important that we have a good understanding of the point and linear sampling concepts depicted 2 Formally

30

known as the Nyquist theorem.

Chapter 3. Basic Computer Graphics in Figure 3.12. They are after all the main principle for the all of the filter types. As point filtering or sampling simply selects the closest texel color c0 , linear sampling will apply weights based on the sampling coordinate s. The distance |s − s0 | between texels s0 and sb determines the weight that is being used for the interpolation process as shown in equation 3.2.   |s − s0 | |s − s0 | cs = · cb + 1 − · c0 (3.2) |sb − s0 | |sb − s0 | The more we approach a texel, the larger weight this texel will have compared to the neighboring texel. Hence this results in a smooth gradient transition. As this sampling is demonstrated in one dimension, linear filtering on a GPU is translated as a bilinear sampling [25; 26] process. The latter uses four texels that surround the sampling coordinate and performs basically the same weighted operation as we have discussed before. While a bilinear approach can be performed quickly for magnification filtering, it can become very complex for minification. If an entire texture would be mapped onto a geometrical surface of only a single pixel, we have to access all of the texels. Since most of the time minification filtering is needed, we can come to understand that it can seriously down-grade speed. As the graphics cards manufacturers were aware of this, efficient hardware was implemented to solve these problems. The solution is known as mipmaps [5]. The word ‘mip’ descends from the Latin sentence ‘multum in parvo’, meaning literally ‘much in a small space’. To avoid redundant operations and heavy calculations, a mipmap can be quickly produced from a texture before it is needed in real-time applications. A texture is bilinear filtered to what is called a higher mip-level. Since four texels become one, the total area of this next miplevel is reduced four times. This mip-level then undergoes the same process, creating the second mip-level. The hardware repeats this until a single texel remains, which is the highest mip-level in the chain. This entire mip-level hierarchy forms the mipmap. A common texture lookup will thereby result in a simple single bilinear interpolation from the closest level of detail (LOD) from the mipmap.

Figure 3.13: Mip-level Hierarchy

Figure 3.13 shows an example of a 16 × 16 texture that has been converted to a mipmap. The hierarchy exists out of five levels in this case, but normally this is a lot higher as textures today can grow up to 4096 × 4096 resolution [24] images for usage with extreme high definition. When non-power-of-two resolutions are used, a complex algorithm is started to convert the first mip-level to the closest power-of-two resolution. For exactly this reason, power-of-two resolution textures in general seriously speed up rendering. 31

Chapter 3. Basic Computer Graphics

3.4.2

Texture Coordinate Interpolators

When we want to map texture to the geometry, we also have to specify texture coordinates along with the vertices. These texture coordinates specify the texel that corresponds with the fragments that are laying on the vertex locations. Since we did not specify texture coordinates for all of the other fragments, there are texture coordinate interpolators available in the rasterizer. They will adequately interpolate the coordinate while generating the fragment. Because newer hardware often works with multiple textures, it possible that all of these textures need different coordinate specifications. Most modern GPU hardware have up to eight parallel units available.

3.4.3

Multiple Render Targets

With Multiple Render Targets (MRT) [24] we are able to write to multiple buffers at the same time, and most of the graphics hardware supports this feature. Using this ability is very easy and is demonstrated in Listing 3.4. Listing 3.4: Addressing Multiple Render Targets // Defining structures used in the shader. struct MY VERTEX OUTPUT { float4 myPosition : POSITION; }; struct MY FRAGMENT OUTPUT { float4 myColor0 : COLOR0; float4 myColor1 : COLOR1; }; // As this is part of an fx−file, we do not use a main function. MY FRAGMENT OUTPUT myPixelShader(MY VERTEX OUTPUT myInput){ MY FRAGMENT OUTPUT myOutput; myOutput.myColor0 = float( 0.0f, 1.0f, 0.0f, 1.0f ); // RGBA, Green color. myOutput.myColor1 = float( 1.0f, 0.0f, 0.0f, 1.0f ); // RGBA, Red color. return myOutput; }

By defining the proper semantics, the pixel shader has an abstract distinction between the color buffers to where it can render to. Linking these render targets to proper areas or buffers in the video memory happens on the API-level.

3.4.4

Early-Z Mechanism

Since fragment processing becomes more and more complex, a lot of computations that are happening in the traditional pipeline are redundant. This is because fragments sometimes get discarded at the end of the pipe when they fail the depth test in the ROP, hence all of the complex processing has already been executed. Therefore the early-Z mechanism [26; 24] was developed. When the hardware sees that no depth changes are made in the pixel shader, it is possible to test the depth when the fragment comes out of the rasterizer. When failed, the fragment can be discarded before the fragment processing. This technique is known as implicit early-Z culling, in contrast with explicit early culling where we explicitly disable depth writing in the pixel shader.

32

Chapter 4

Exploiting Graphics Hardware 4.1 4.1.1

Overview Realizing General Purpose Computations

Since GPUs became a lot more powerful than CPUs, they have attracted the interest of highperformance programmers. As this opens up a completely new programming era, techniques are still being developed to efficiently exploit graphics hardware [31; 15; 26]. This new domain of research has formalized itself to the principle of GPGPU computing.

Figure 4.1: GPGPU Computational Kernel Functionality

The abstract principle of performing general purpose computations on the GPU has been schematically visualized in Figure 4.1. First we have to dimension the data that we want to process. This dimensioning happens by chopping up the data array in equal pieces and pasting them underneath 33

Chapter 4. Exploiting Graphics Hardware each other. That way the data will be organized into a two dimensional form. As a side note, it is always best to try and make the resulting dimensions as powers-of-two since they allow faster advanced intrinsic processing on the GPU. When the data has been organized into a 2D form, it can be loaded into the video memory as a texture. Hence texture intensities represent the scalar values of the data. After that, computational activity in the graphical pipeline is triggered by sending a quadrilateral to it. A quadrilateral or quad is a mere rectangle that exists out of four vertices. The quad will get rasterized into fragments, and every fragment will execute a pixel shader. Therefore it is possible to write the general purpose computations, in stead of the color calculations, inside the shader. The fragment processing is often referred to as the GPGPU computational kernel since it is the essence of exploiting the graphics card. The results of the computations are written to a color buffer inside the video memory, and can be transferred back to the system memory. To finish, we can reorganize the data is reorganized again into a one dimensional array form. The processed data can then eventually be used by the application at hand.

4.1.2

Transfer Bottleneck

GPGPU has opened an era of programming where application developers start to use the synergy of combining the heavy computational horse power of the GPU and the controllability of the CPU as program flow control. Writing these enhanced applications implicates a lot of data transfers between video and system memory. Figure 4.2 depicts the common architecture found in commodity personal computers [5]. Since the CPU is still the central part of execution, a data transfer from system memory to video memory is defined as downloading. We define the reverse transfer as a read-back operation.

Figure 4.2: Personal Computer Memory Architecture

As the diagram shows, the video and system memory are very far apart from each other. A download operation implicates data to pass the memory bus into the PCIexpress lanes or into the Accelerated Graphics Port (AGP) in older computers. From that point on it is available in the GPU, but it still has to pass over its internal memory bus to reach the video memory. This basically involves three kinds of data transfers. • Transferring data from the system memory to the NorthBridge or Memory Controller Hub (MCH) in Intel architectures. The speed of this transfer depends on the limiting clock frequency of either the memory or the controller itself. System memory at this time is 34

Chapter 4. Exploiting Graphics Hardware commonly DDR2 RAM ranging from 667 MHz to 1250 MHz. This means transfer speeds of 5336 MB/s to 10000 MB/s, as the memory bus is 64 bits wide. Since most motherboards today are enhanced with a 128 bit dual channel memory bus, these speeds can range from 10672 MB/s up to 20000 MB/s. • From the NorthBridge/MCH, data has to pass over a bus dedicated for the graphics cards. This can generally be a AGP8x which has a maximum speed of 2133 MB/s or a 16 lane PCIexpress. PCIe16x gains speeds up to 4000 MB/s since a single lane can transfer 250 MB/s. Although the PCIe32x standard has been developed, no graphics card is already using it. • The last transfer happens over the memory bus of the GPU. Latest graphics cards are using GDDR3 memory with multiple 64 bit channels. The NVIDIA GeForce 7900GTX uses 1600 MHz memory clocking with a quadruple channel of 256 bits. Therefore it can internally transfer data up to 51200 MB/s. The GeForce 8800GTX clocks its memory at 1800 MHz and uses a six wide channel of 384 bits. It can gain amazing speeds up to 86400 MB/s. The conclusion we need to make out of these calculations is that the transfer bottleneck will most likely be the transfers that pass the AGP or PCIe lanes. Hence we should avoid downloads and read-backs, as it will dramatically lower the performance of the higher level application. Therefore it is very wise to keep the data in one place as long as possible. Avoiding these operations should be the first priority in developing CPU/GPU hybrid applications.

4.1.3

Abstract Stream Programming

Developing applications for the GPU can be abstracted to what is known as the stream programming model [26]. In this model, all data is represented as a stream. A stream is made up of a number of data elements from the same type, and can be of any length.

Figure 4.3: The Stream Programming Model

Operations that are allowed on the streams are duplicating them, deriving substreams from them, and performing computations on them by the means of a kernel. The kernel is running on a stream processor and operates only on entire streams as visualized in Figure 4.3. Most of the time, this kernel performs a mapping operation where a single input element is mapped to an single output element. Other operations include expansions and reductions. Expansion kernels generate more data elements for each input element. Reduction kernels do the opposite and combine multiple inputs into one single output element. Applications that are written inside the stream programming model, are generally build up by linking these computational kernels to each other. By linking proper kernels that represent every stage, we can easily abstract the graphical pipeline into this model. Notice that in GPGPU only the fragment processing stage is seen as customizable, resulting that we generally can only implement one kernel per pass. As we can 35

Chapter 4. Exploiting Graphics Hardware see the relevance between GPU and stream programming, we are able to derive some important properties of GPU programming through the further study of the stream programming model.

4.1.4

Data Independence

An important property of computational kernels is data element independence [31; 26]. Data independence is forced since the kernel can only operate on its input and not on its output. Therefore, calculations that are previously performed by the kernel can never influence the following computations. Exactly because of this reason, the stream programming model is easily implemented on Single Instruction Multiple Data (SIMD) [25; 5] hardware. The GPU is therefore able to translate its programmable fragment processing to a SIMD architecture by the means of multiple fragment processors that are all executing the same instruction at the same time. Vertex processing also uses multiple processors, but since their hardware is a little more complex, they are also capable of performing Multiple Instruction Multiple Data (MIMD) [26; 5] operations. As GPGPU focuses mainly on the fragment processing, this functionality is less important for us. Another important implication of data independence is that we cannot directly design an algorithm on the GPU that needs to perform calculations that are bound to previous outputs. Simply said, since its intrinsic parallelism, we cannot predict in what sequence calculations will be performed. When we are able to adjust the algorithm and split its design into two linked computational kernels, we can implement it on GPU. Notice that in such case we need two passes to implement each of the computational kernels.

4.1.5

Locality Constraint

Every data element that gets inputted into a stream processor results into a kernel invocation. Keeping this in mind, another very important property of stream processors is kernel locality. As we can see on Figure 4.4, a computational kernel has a hard time operating on the global stream. We define kernel locality as the values which are short-lived to a single kernel invocation. Although it is theoretically possible to operate on the entire global stream when we have a large enough buffer, it is simply impossible to produce practical hardware that does this and still keeps its high performance.

Figure 4.4: The Kernel Locality Constraint

Although it would be possible otherwise, kernel locality implicates that we should restrict ourselves to mapping local algorithms on the GPU. Global approaches would seriously downgrade the overall performance, because of the close nature to stream processors. The internal caches of the GPU have been designed especially for local access, because the GPU always operates on 2 × 2 texel1 areas. As a result by sampling a texel, the neighboring ones are also read into the cache and those 1 Texels

36

are pixels from a texture.

Chapter 4. Exploiting Graphics Hardware texels can thereby be accessed very quickly. Hence our interest and focus upon local-based stereo matching algorithms in this thesis.

4.2 4.2.1

Avoiding Data Corruption One-to-One Mapping

In common GPGPU, we want a kernel invocation for every data element that needs to be processed. Since the number of data elements are the number of texels that form the texture, a fragment has to be created by the rasterizer for each texel [26]. This way we implement a one-to-one mapping between texels and pixels in the output color buffer. In Figure 4.5 we visualize how this mechanism should be forced. The fragments are transformed to pixels at the end of the pipe, so they already have an implicit 1:1 ratio. Hence we only have to focus on a 1:1 sampling procedure, meaning that every fragment samples exactly one texel.

Figure 4.5: One-to-One Mapping Procedure

A 1:1 sampling can be forced by making sure that the viewport transformation of the quadrilateral results in an equal size compared to the texture. By this, the texture coordinate interpolators of the rasterizer will output coordinates that exactly match the texel or data places inside the texture. If this should not be the case, the interpolators would output coordinates that lay between texels. Therefore the GPU hardware would bilinear filter these intensity values and thus corrupting the original scalar data representation. This makes one-to-one mapping a fundamental element in writing correct GPGPU computational kernels.

4.2.2

Half-Pixel Shift Correction

As taking care of an one-to-one mapping procedure is sufficient in the OpenGL graphical API, it does not suffice in DirectX. Programmers that find themselves only using correct viewport transformation, will encounter corrupted data output since the data is not sampled correctly. This is due the intrinsic nature of the DirectX rasterization scheme [24]. People have the tendency to abstract pixels as squares, however in reality pixels are dots. Since the squares indicate the areas that are lit by pixels, the pixel dot finds itself in the center of the square. In contrast to OpenGL, Direct3D uses these pixel dots as the main screen space coordinates. Exactly for this reason the square raster representation starts a half pixel earlier than the screen space origin, as depicted in Figure 4.6. The screen coordinates of the raster origin are thereby (−0.5, −0.5) in 37

Chapter 4. Exploiting Graphics Hardware stead of (0.0, 0.0). This distinction, though seemingly small, is of utmost importance if we want to perform a correct one-to-one texel to pixel mapping.

Figure 4.6: Half-Pixel Origin Shift

As transformed vertices are expressed in screen space coordinates, any geometry with rounded integer screen coordinates will never map straight onto the raster. Since texture coordinates are referenced to the geometry itself, the texels will suffer from the same problem. Visualized in Figure 4.7, a texel covers multiple squares in the raster. Therefore the screen coordinates will fall in the middle of four texels, and a sample at this location will result in a bilinear filtering of those four texels.

Figure 4.7: Half-Pixel Texture or Vertex Coordinate Adjustment

A solution to this problem is adjusting the texture or vertex coordinates. By making a half-pixel shift correction [24] to the texture coordinates, we can make sure that the texture eventually gets mapped straight onto the raster. As an alternative, the vertex coordinates can be shifted as well. This way the geometry maps straight to the raster, and as a consequence, so will the texture.

4.2.3

Using an Optimal Precision

Since we are using textures to represent data and the pixel intensities represent the scalar values of the data elements, the precision of the scalar values are limited to the precision used by the 38

Chapter 4. Exploiting Graphics Hardware textures. A normal texture consists mostly out of four 8 bit integer values that represent RGBA2 color space values. Luckily modern GPU hardware also supports floating point textures. Floating point textures represent the RGBA color space with floating point numbers. These floating point numbers were generally custom formats, but current hardware usually has support for full 32 bit IEEE standardized floating point numbers. Therefore current pipelines are commonly 128 bits wide. Programmers should always check which texture formats are supported on the hardware. In stead of always using the maximum precision available, one should adjust the precision to the data representation needs. By using the smallest precision necessary, a maximum performance is obtained out of the hardware. Especially formats that are maximally 32 bits wide in total, are significantly accelerated compared to the higher precision formats. Notice that data corruption arises when using a precision that is too low to represent the data. This is because the unrepresentable values will get truncated or wrapped to other values [25] which are however representable in the used precision format.

4.3

Mapping Computational Concepts

4.3.1

Arithmetic Intensity

As we have discussed before, transferring data from system level to GPU level should be avoided because of the inevitable bottleneck that lies between them. Therefore we should be able to determine whether certain algorithms are inherently good for GPGPU acceleration in hybrid applications. Algorithms or parts of such are characterized by their arithmetic intensity [26], which is defined in equation 4.1. In this formula, AI stands for the arithmetic intensity, CO for the number of computational operations and WT for the data words that need to be transferred. AI = CO /WT

(4.1)

Algorithms with a high arithmetic intensity are very well suited for the GPU. Because of the intrinsic parallel nature of the graphics hardware, the GPU is capable of performing a great number of parallel computations. Therefore a high-level peak performance will be obtained by porting application portions with high arithmetic intensity to a GPGPU setup.

4.3.2

Central Processor Analogies

Even for very experienced CPU programmers it is tricky to start with GPGPU programming without any aid. In order to gain a better comprehension of porting code to a GPGPU environment, some analogies are drawn [26; 31] between familiar CPU-level concepts and their corresponding GPU counterparts. System Data Arrays This analogy is very easy. Everywhere we would use a data array on system level, we should use a texture to replace it. We have seen before that textures in general always store the unprocessed data for GPGPU kernels. The output data of such kernels are mostly written to off-screen buffers that are known as render-to-texture objects. These special buffers can be read back to the system 2 RGBA

is the composition of a red, green, blue and alpha channel value.

39

Chapter 4. Exploiting Graphics Hardware memory or they can be used by means of a texture as input for a next GPGPU computational kernel. Therefore we can see that either an input or output system level data array is always equal to a texture, thus that textures can also contain processed data. Inner Loops Inner loops on the CPU usually process a data array sequentially to a new output data array. The larger an inner loop becomes, the higher the arithmetic intensity will be of the total iteration process. These processes are the fundamental building blocks for mapping entire algorithms to the GPU. By using system level inner loops as GPGPU computational kernels, we can efficiently exploit the task-level parallelism from the GPU by running the same kernel on multiple processors in an SIMD architecture.

Figure 4.8: Mapping CPU Inner Loops to GPU

As shown in Figure 4.8, we notice that when porting the inner loop as a fragment program, the fragment processors can simultaneously process a large number of data elements. For this to work, we must be sure that the computations inside the inner loop are independent of previous outputs. Otherwise the inner loop cannot be mapped efficiently to the GPU. Computation Invocation We now have analogies for data representation and computation. The fact remains that we somehow need to invoke the needed computations. Since our computational kernels are implemented inside fragment programs, we need to generate fragments. This is simply done by rendering some random geometry. In common GPGPU computing a specifically placed quadrilateral is used, but it is possible to specify different geometric elements or shapes. Hence we can say that geometry rasterization basically equals computation invocation. Computational Domain The computational domain is defined as the set of input data. While in CPU inner loops this set is controlled by setting iteration boundaries, in GPGPU kernels this translates itself to tweaking the texture coordinates. As Figure 4.9 shows, a fixed quadrilateral is assumed. When we change the coordinates of the texture, we basically select the texels that lay inside the quad. Since 40

Chapter 4. Exploiting Graphics Hardware this also determines the amount of texels in the quadrilateral, we have to adjust the viewport transformation so that an equal amount of fragments will be generated. This only being so, if we want to implement a common GPGPU mapping kernel.

Figure 4.9: Setting the Computational Domain

Changing the amount of texels inside the quad without viewport adjustment will destroy the oneto-one mapping between the texels and pixels. It is of utmost importance that the programmer keeps this in mind when changing these coordinates. The ability to randomly select the input data is known as gather. Computational Range The computational range defines the set of output data. In contrast to CPU inner loops, fragment programs are not directly capable of scatter. Scatter holds the ability to write the output to a random selected target. Therefore the computational range is determined by the vertex coordinates. By changing the vertex coordinates, we are changing the place to where our geometry renders inside the color buffer or render-to-texture.

4.3.3

Simple Mapping Example

To complete our insight in mapping CPU code to the GPU, we will discuss a small example. The CPU inner loop can be seen in Listing 4.1 and has the purpose of doubling all input values. Listing 4.1: Simple System Level Inner Loop Example // Definition of used constants. #DEFINE N 100 // Declaration of used variables. float myInputData[N]; float myOutputData[N]; // Abstract the input data. fillInValues(myInputData); // CPU inner loop that duplicates values. for(int i = 0; i < N; i++){ myOutputData[i] = 2∗myInputData[i]; }

We have abstracted the values by calling a fictive function that fills in the data. We thereby assume that we have 100 values available in a one dimensional array. In a regular for-loop we 41

Chapter 4. Exploiting Graphics Hardware iterate through the values, sequentially doubling them and writing the result to the output array. Next, some typical shader code is presented that maps this functionality very basically to the GPU. We assume that we have downloaded the input data into the red channel of a texture. Such a download happens on the graphical API level, and not in the shader code. We write the output to the red channel of an output buffer, and fill in the other channels with zero. Listing 4.2: Simple GPU Ported Kernel Example struct MY VERTEX OUTPUT { float4 myPosition : POSITION; float2 myTexCoord : TEXCOORD0; }; struct MY FRAGMENT OUTPUT { float4 myColor : COLOR; }; // This fragment program replaces the CPU inner loop. MY FRAGMENT OUTPUT myPixelShader(MY VERTEX OUTPUT myInput){ MY FRAGMENT OUTPUT myOutput; // We are only using the red channel. myOutput.myColor.r = 2∗tex2D(mySampler, myInput.myTexCoord).r; myOutput.myColor.gba = float3(0.0f, 0.0f, 0.0f); return myOutput; }

As we will see later on, mapping this sample functionality with the listed shader does not achieve maximum efficiency. But for now, it gives us a good idea of how GPGPU shader code originates out of normal system level code.

4.4 4.4.1

Exploiting Specialized Hardware Pass Reductions

If a ported algorithm contains two different independent kernels F1 and F2 that are reading data from the same source texture, than we can optimize that algorithm.

Figure 4.10: Independent Kernel Combination with MRT

As depicted in Figure 4.10, we can see that we combine the separate kernel functionalities F1 and F2 into a single larger kernel F1 F2 which incorporates both. Therefore we are not reducing 42

Chapter 4. Exploiting Graphics Hardware the amount of computations in total. The advantage of this lies in the fact that we avoid GPU pass overhead time. In contrast to the separate solution, we only need one pass to perform the requested calculations. This means that we only have to send the vertices once and the rasterizer thereby only has to perform a single geometry rasterization, if we assume that kernels F1 and F2 have completely identical computational ranges. When at least one of the computational kernels is very light, or the computational domain is very small, we can expect a significant speedup compared to the total time that is needed to do them both separately. This is because of the fact that in those cases the GPU pass overhead time is very large compared to the time needed for the actual computations. When both kernel functionalities are very complex, or the computational domain is very large, we can expect the speedup to be less significant. Listing 4.3 shows the most general example of a shader which incorporates two kernel functionalities. We have two separate texture coordinates that describe the computational domain for each of the kernels, and we have two render targets to output each of the processed data. Listing 4.3: Shader Code Sample of Combined Kernels struct MY VERTEX OUTPUT { float4 myPosition : POSITION; float2 myTexCoord0 : TEXCOORD0; float2 myTexCoord1 : TEXCOORD1; }; struct MY FRAGMENT OUTPUT { float4 myColor0 : COLOR0; float4 myColor1 : COLOR1; }; // To abstract kernel functionality, we call functions to perform the computations. MY FRAGMENT OUTPUT myPixelShader(MY VERTEX OUTPUT myInput){ MY FRAGMENT OUTPUT myOutput; // The first kernel functionality. float4 myInputData0 = tex2D(mySampler, myInput.myTexCoord0); myOutput.myColor0 = performComputations0(myInputData0); // The second kernel functionality. float4 myInputData1 = tex2D(mySampler, myInput.myTexCoord1); myOutput.myColor1 = performComputations1(myInputData1); return myOutput; }

In a special case, the computational domains of both kernels are exactly the same. Hereby the data has to be accessed only once, hence we only need one tex2D call to sample. Since these texture sampling are of of the most costly operations in a GPU, the speedup will always be extremely high. This formality can be easily extended to the possibility of combining more than two kernels, but one should always keep the hardware limitations in mind.

4.4.2

Computational Masks

Sometimes we want to limit kernel execution to a select portion of the computational domain, without having to adjust the latter. A reason for this could be that the geometric complexity 43

Chapter 4. Exploiting Graphics Hardware of the domain becomes too high to be easily described in a couple of vertices. Another more considerable reason is that the information of which portion to execute, originates from within the pixel shader. It would not be intelligent to send this information to the CPU, translate it into a proper set of vertices, and then send it back to the GPU. As we have discussed before, we should avoid any data transfer between these two systems.

Figure 4.11: Building a Computational Mask

To fully optimize such a scenario, we can build a computational mask by exploiting the earlyZ mechanism in newer GPU hardware [32; 26]. This technique involves setting up a separate pass that writes the mask, as seen on Figure 4.11. Therefore we should be sure that the fragment processing is complex enough so that limiting those computations to the selected portion weighs up to the penalty of the extra pass. Nonetheless, smart programmers will always try and incorporate this mask writing to an existent necessary pass so that this penalty becomes obsolete. Listing 4.4: Effect Structure Sample of a Computational Mask technique myEarlyCulling { pass myMaskWriting { VertexShader = compile vs 3 0 myNormalVertexShader(); PixelShader = compile ps 3 0 myMaskFunction(); } pass myMainExecution { // Explicitly defining depth functionality. ZEnable = TRUE; ZFunc = LESS; ZWriteEnable = FALSE; VertexShader = compile vs 3 0 myNormalVertexShader(); PixelShader = compile ps 3 0 myComputations(); } }

Listing 4.4 presents the most important code that involves using this technique. Writing a computational mask happens to the Z-buffer in a first pass. Mind that the Z-buffer needs to be of the same size as the computational domain you are specifying it for. We write a depth of 0.0f for the areas we do not want to process. For the portions that we do want to process, we fill in a random depth value, commonly 1.0f. In a next pass we are going to execute our main kernel computations, but limited to the selected area. For this it is best to explicitly disable depth writing, and setting the depth test to LESS. By this, only fragments with a depth less than the value stored into the Z-buffer will pass the test. We start this pass by rendering our geometry to a different 44

Chapter 4. Exploiting Graphics Hardware depth, generally 0.5f. Since basic GPGPU never uses depth, it is most common to always render this flat geometry at zero depth. By this advanced approach, the early-Z mechanism will kick in and discard fragments that get occluded by existing ones. The exploit lies in the fact that we have filled the Z-buffer with custom values, ergo fooling the Z-hardware. As the fragments have a depth of 0.5f, the fragment processing will be avoided for areas where we filled the buffer with 0.0f. This in contrast to the areas where we filled the Z-buffer with 1.0f values.

4.4.3

Data Packing

A lot of general computations use single scalar values as data elements. Since the GPU operates on four-component data elements, we should always exploit this data-level parallelism by breaking down our scalar data into four equivalently sized areas [26]. This has been schematically visualized in Figure 4.12. Hereby we can pack the same amount of data in a texture that is four times smaller, thus also needing four times lesser space in the video memory. Hence transferring data between CPU and GPU becomes a lot faster. As this is already a very good thing, it still is not the main advantage that we gain. By packing data, we are able to perform four computations in the same time as one in the unpacked form. This implicates an incredible four-times speedup.

Figure 4.12: Packing Scalar Data into a Four-Component Form

A programmer should always try to exploit data-level parallelism maximally in every situation. Since texture coordinates for GPGPU are most of the time two dimensional, we can pack two coordinate sets into one four-component vector. Thereby we extend the amount of coordinate sets that can be hardware interpolated in the rasterizer.

4.4.4

Data Filtering

Sometimes a kernel needs the sum of multiple texel intensities from a small area. In stead of sampling every texel and manually adding the values, we can exploit the intrinsic filtering functions of the GPU. When we use mipmaps or the bilinear sampling capabilities, we can reduce this operation to a single sample and a multiplication. Since the GPU has averaged out the filtered values, the multiplication is required to assign the weight of the area. Notice that this only works for power-of-two areas, since mip-levels are generated with a bilinear filtering algorithm.

4.4.5

Optimal Sampling Kernels

In a lot of situations, the computational kernel has to gather multiple independent texels and therefore it is inevitable to perform multiple texture accesses per fragment. We cannot access a 45

Chapter 4. Exploiting Graphics Hardware texel once and use it in multiple fragments, as this would destroy the data independent relationship between the inputs. We can however make sure that these texture accesses are accelerated as much as possible [26]. In Figure 4.13 we show an example of a kernel that samples 5 texels in one dimension. Such a kernel often uses a center pixel and offsets from there to the surrounding texels.

Figure 4.13: One Dimensional Sampling Kernel Example

Calculating the coordinates for the surrounding pixel are by a lot of programmers mostly done in the pixel shader, as shown in Listing 4.5. Although these computations look trivial, one should not go over it so lightly. Since there are 4 multiplications and additions necessary per fragment, we speak of over 6 million calculations in a normal 1024 × 768 resolution image. Listing 4.5: Computing Sampling Kernel Coordinates in the Pixel Shader MY FRAGMENT OUTPUT myPixelShader(MY VERTEX OUTPUT myInput, uniform float widthStride){ MY FRAGMENT OUTPUT myOutput; float2 myCoordinates[5]; myCoordinates[0] = float2(myInput.myTexCoord.x−2∗widthStride, myInput.myTexCoord.y); myCoordinates[1] = float2(myInput.myTexCoord.x−1∗widthStride, myInput.myTexCoord.y); myCoordinates[2] = myInput.myTexCoord; myCoordinates[3] = float2(myInput.myTexCoord.x+1∗widthStride, myInput.myTexCoord.y); myCoordinates[4] = float2(myInput.myTexCoord.x+2∗widthStride, myInput.myTexCoord.y); myOutput.myColor = performFunctionality(myCoordinates); return myOutput; }

All of these operations can be avoided by porting the texture coordinate calculations to the vertex shader, as shown in Listing 4.6. We then assign the results to the hardware dedicated texture coordinate sets that we can semantically declare in the vertex shader output structure. Listing 4.6: Porting Sampling Kernel Coordinates to the Vertex Shader MY VERTEX OUTPUT myVertexShader(MY VERTEX INPUT myInput, uniform float widthStride){ MY VERTEX OUTPUT myOutput; myOutput.myTexCoord0 = float2(myInput.myTexCoord.x−2∗widthStride, myInput.myTexCoord.y); myOutput.myTexCoord1 = float2(myInput.myTexCoord.x−1∗widthStride, myInput.myTexCoord.y); myOutput.myTexCoord2 = myInput.myTexCoord; myOutput.myTexCoord3 = float2(myInput.myTexCoord.x+1∗widthStride, myInput.myTexCoord.y); myOutput.myTexCoord4 = float2(myInput.myTexCoord.x+2∗widthStride, myInput.myTexCoord.y); return myOutput; }

Normally we only have four vertices, so the time of these calculations can be neglected. All of the other needed computations will be done by the texture coordinate interpolators in the rasterizer. This dedicated hardware will cause an enormous speedup of the entire process. 46

Chapter 5

Experimental Algorithms 5.1 5.1.1

Overview Landau Notation

The first thing we need when we are theoretically designing algorithms, is a formal way of comparing their efficiency without needing any implementation work. It is very important that our algorithmic blocks are maximum efficient before we start implementing. Hereby we also expose the difference between an optimal algorithm and an optimal implementation, as shown in Figure 5.1. Since time measurement is good for searching an optimal implementation, we first need a mechanism to look for an optimal algorithm along its alternatives.

Figure 5.1: Searching Optimal Idea Solutions

To abstract computational or algorithmic complexity to a number of elementary operations, we use the Landau notation [5]. The Landau notation shows us the complexity of an algorithm in function of the number of input elements n, to which it is proportional. There are a number of conventional symbols that are used for this notation, but Θ and O are most commonly used. Θ(n) = O(n), O(n2 ), O (log2 (n)) , O (n · log2 (n)) , etc...

(5.1)

Equation 5.1 shows us some examples of typical algorithmic complexities. We can find that a linear complexity O(n) equals for example to a for-loop, since we need n iterations for n elements. As 47

Chapter 5. Experimental Algorithms another example, the logarithmic complexity O (log2 (n)) is typically associated with a binary search algorithm in n elements, where we only need to do a single extra operation when we double the amount of elements.

5.1.2

Multi-Dimensional Spatial Filtering

Because we are targeting to implement our algorithms on programmable graphics hardware, we are bound to local approaches due to the kernel locality constraint. In order to enhance the standard stereo algorithms, we want to think of more intelligent ways to aggregate the matching cost in stead of the simple SAD or SSD. We first propose to perform a multi-dimensional spatial filtering on the AD-image. This image is represented by performing equation 2.3 for every s and t within the image borders, and by using a constant disparity δ. Hence the AD-image D collects the matching costs for every texel at the same disparity estimation. Spatial filtering is commonly performed by a convolution [33]. Since we are designing a local algorithm, we must restrict our convolution filter to a finite impulse response. We can thereby represent the convolution filter as a two dimensional matrix, because we are working in a discrete two dimensional environment. This matrix is often referred to as the convolution kernel, which is visualized in Figure 5.2.

Figure 5.2: Generic 2D Convolution Kernel

Convolving this kernel with the image can be done as described in equation 5.2, if we assume that the kernel size M and N are equal so M = N = W . This basically comes down to aligning the center value with the texel wherefore we want to compute the convolution. We then multiply the values out of the kernel with the corresponding texels. We result in a matrix with a size equal to the convolution filter kernel. After that we sum all of those results inside that matrix into a single value that indicates our new aggregated matching cost C. C[s, t] = D[s, t] ∗ ∗f [s, t] =

W X

W X

D[s + k, t + l] · f [k, l]

(5.2)

k=−W l=−W

As we can understand from Figure 5.3, we do this for every texel of the image. When we approach the vicinity of the image edges, the convolution kernel will span over the borders. Since we have 48

Chapter 5. Experimental Algorithms no matching costs available outside the AD-image, we will implicitly assume zeros in these areas. Notice that this will negatively influence the integrity of the matching cost aggregation, and thus lower the reliability of the match.

Figure 5.3: Applying the Convolution Kernel

If we break down a convolution to elementary operations, it consists out of a number of multiplyand-accumulate (MAC) computations. Therefor we can look at this approach to have a general per-texel complexity of O(N × M ) or O(W 2 ) in our case. The values contained in f do not matter for the complexity. Although we could use any random values, we also propose the use a high value for the center and gradually lower the values as we go farther away from the center of the kernel [18]. Because the convolution filter can be seen as a support window, we refer to center-biased support windows for such an approach. If we fill in values so that there is a symmetry in both the horizontal and vertical direction, we speak of a circular isotropic kernel. Circular kernels have the ability of being separable [33], which basically means that we can perform the 2D convolution as a cascade of two 1D convolutions. Since the 1D convolution kernels are both of size W , we can reduce the algorithm complexity to O(2W ) or O(M + N ) in general. If we apply this convolution filter on every AD-image that belongs to a specific disparity estimation out of the search range, we are able to get a much more improved disparity map. This is realized by associating disparities that lead to a minimal aggregated matching cost to texels of the reference image. Hence this algorithm can be briefly summarized into three essential steps. 1. Calculate the AD-image for every disparity estimation out of the search range. 2. Perform the proposed multi-dimensional spatial filtering on every AD-image. 3. Search the minimum cost for every texel while associating the corresponding disparity. We do not explicitly define the values contained in the convolution kernel. That way we can experiment with those values without any performance penalty, and use the Middlebury evaluation mechanism to determine the ones that deliver the highest quality depth maps.

5.1.3

Integral Image

In an attempt to come up with a highly competitive stereo algorithm, we try to find a way of calculating variable sized support windows without any performance penalty. We propose the 49

Chapter 5. Experimental Algorithms usage of integral images out of the AD-images to efficiently aggregate matching costs. The integral image SI is also known as a Summed Area Table (SAT) SI in the domain of computer graphics. The idea behind integral images is to integrate texel intensities in two dimensions, as presented in equation 5.3. Basically, all the values in the rectangle spanned by the image origin and the texel that is considered are being summed together. Z sZ t D(s, t) dsdt (5.3) SI (s, t) = 0

0

Again, since we are working in a discrete form, this formula gets translated into equation 5.4. Hence the frequently used name summed area tables. But as we are operating in function of the computer vision domain, we will henceforth always refer to the integral image. SI [s, t] =

s X t X

D[i, j]

(5.4)

i=0 j=0

As one can immediately understand, generating this integral image out of the AD-image is pure algorithmic overhead. Thereby it cannot be directly compared to other existing algorithmic complexities. Whether this overhead weighs up to the advantages that the integral image offers, is highly dependent upon the way of implementation. Therefore, without any time benchmarking, we cannot make any conclusions yet. We now consider the main advantage of this technique, together with a lot of its possibilities.

Figure 5.4: Main Advantage of the Integral Image.

The main advantage is that we can compute a random sized support window SAD with a constant time complexity. We demonstrate this in Figure 5.4. Only three computations are needed to calculate a SAD with kernel size |s4 − s1 | × |t4 − t1 |. SAD|s4 −s1 |×|t4 −t1 | = SI [s4 , t4 ] − SI [s3 , t3 ] − SI [s2 , t2 ] + SI [s1 , t1 ] SAD|s4 −s1 |×|t4 −t1 | =

t4 s4 X X

D[i, j] −

i=0 j=0

SAD|s4 −s1 |×|t4 −t1 | =

s4 X t4 X

s3 X t3 X

D[i, j] −

i=0 j=0

SAD|s4 −s1 |×|t4 −t1 | =

s1 X t4 X

D[i, j] −

i=0 j=0

s4 X t1 X

D[i, j] −

i=0 j=t1 +1 s4 X

s1 X t1 X

D[i, j]

(5.6)

D[i, j]

(5.7)

i=0 j=0

D[i, j] +

i=0 j=0

s4 t4 X X

SAD|s4 −s1 |×|t4 −t1 | =

D[i, j] +

i=0 j=0

i=0 j=0

D[i, j] −

t2 s2 X X

(5.5)

s1 t4 X X

s1 X t1 X i=0 j=0

D[i, j]

(5.8)

i=0 j=t1 +1 t4 X

D[i, j]

(5.9)

i=s1 +1 j=t1 +1

Looking at equations 5.5 to 5.9, we deliver a solid proof why the integral image has a high potential in fast adaptive algorithms. We have the possibility to use the latter in standard stereo 50

Chapter 5. Experimental Algorithms corresponding algorithms by only calculating the SAD [34]. Nonetheless, we can use it as a basis for more complex approaches. We can emulate a multi-dimensional spatial filter with a centerbiased support window, by summing up multiple SADs from different size. By summing up the center texel with a 3×3 SAD, a 5×5 SAD etcetera, we are implicitly giving the center texel a larger weight. This is because the center texel is taken into account in every kernel. The further a texel lays from the center, the less kernels take it into account. Hence we have a similar functionality to the center biased convolution filter.

5.1.4

Simple View Interpolation

To finish designing a complete FVV algorithm, we use a very simple but efficient way of synthesizing an interpolated view between two camera points [4]. Since we are targeting DIBR, we use the reference image and the disparity map in order to compute the requested viewing angle. To formalize this requested image, we define the variable λ as the interpolation factor. The interpolation factor λ can be of any value between zero and one. A value of zero indicates that we request the left image view point, as where a value of one indicates that we request the viewing angle of the utmost right image capture.

Figure 5.5: Synthezing an Interpolated View

Since the disparity map contains values of how many places a texel should be shifted to obtain the utmost right image, we first scale the disparity map by λ. Hereby we adjust the amount of places a texel should be shifted to obtain the requested viewing angle. This operation is very trivial, as it only involves multiplying every disparity value out of the map with the interpolation factor. If we have obtained the scaled disparity map that belongs to the requested view point, we can begin synthesizing the interpolated image. Figure 5.5 shows the forward warping process, where we simply shift the texels with their corresponding scaled disparity. As the diagram shows this for only two texels, we should repeat this procedure for every texel inside the image. Therefore it is very important that we have a good disparity map where almost every texel has been matched. Texels that lack disparity cannot be directly used in this approach. As this simple forward warping process still has a lot of flaws, it is already sufficient to generate reasonable quality interpolated views. We will discuss some possible enhancements to this tech51

Chapter 5. Experimental Algorithms nique, later on in the advanced topics. The main goal of this thesis is after all to examine stereo matching algorithms, so this is just an extra to provide the accomplished work with another means of practical testing and use.

5.2

Advanced Matching Cost Aggregation

5.2.1

Gaussian Approach

As we have already come to understand, using center-biased support windows is an intelligent way of aggregating the matching cost. Since we did not discuss the exact aggregation weight values yet, we now introduce a more concrete solution to this approach. In the Gaussian approach we propose the use of a Gaussian distribution as shown in equation 5.10. 2

fG (k, l) =

1 − k2γ+lg2 e 2πγg2

2

(5.10)

In the Gaussian distribution, γg stands for the standard deviation and thereby determines the fall-off rate of the kernel. The impact of changing the standard deviation has been visualized in Figure 5.6. A high value will cause wide distributions, and therefore the kernel will take surrounding texels in the matching procedure heavily into account. In contrast, a low deviation will cause the Gaussian distribution function to become very narrow, and thus the aggregated cost will focus more on the center texel.

Figure 5.6: Changing the Gaussian Kernel Standard Deviation

If we use a too low standard deviation, only the center texel matters, so that the aggregation phase becomes actually obsolete. Thereby we can already understand that there is an optimal deviation or fall-off rate that will cause a sweet spot1 between blurring the edges of the disparity map, and causing a lot of noise in the homogeneous areas. Generic 2D Convolution We can apply this Gaussian kernel to the AD-image in a variety of ways. The most trivial one is to fill in k and l values in equation 5.10, and use the results to build up a discrete Gaussian kernel in a two dimensional W × W matrix form. Hence we can apply the kernel as a generic 2D convolution. Since Gaussian kernels are of an isotropic nature, they are implicitly separable. Thereby we can break the 2D matrix down into two 1D arrays of size W , and apply the kernel by two separate fast convolutions as we have discussed earlier. 1A

52

sweet spot or point is an optimal compromise between two related countering functionalities.

Chapter 5. Experimental Algorithms Aggregating a Mip-level Hierarchy If we think more implementation-wise, another very promising way of applying the Gaussian kernel is the use of mipmaps in graphics hardware. As we have already discussed before, the first miplevel actually has aggregated a 2 × 2 area, and thereby we can think of mip-levels as power-of-two support windows. The higher mip-level we use, the larger the support window will be. As we can see in Figure 5.7, aggregating the bottom of a mip-level hierarchy approximates the distribution of a Gaussian. Notice that we have visualized window size and not the size of the mip-level in the diagram. Support window size is inverse proportional with the mip-level size.

Figure 5.7: Approximating a Gaussian Kernel with Mip-Levels

This solution is often referred to as the multiple mip-level (MML) method introduced by Yang and Pollefeys [35; 36; 37; 38]. It tries to use the benefits of small and large aggregation windows. This is exactly what the Gaussian kernel functionality wants to accomplish. It all comes down to the importance of center biased support windows in general. Box-filter Summation Since we notice that MML is very static and offers not much of adaptability, we propose the use of integral images to aggregate arbitrary sized windows. Because mipmaps are physically bound to their power-of-two size, they cannot easily adjust their intrinsic center biased approach.

Figure 5.8: Approximating a Gaussian Kernel with Integral Images

In Figure 5.8 we see the potential of integral images, as we can easily change the size of a support window. Not only does this solution serve as an alternative to the MML method, it also proves to have much more intrinsic adaptability. One of the most important advantages of the integral 53

Chapter 5. Experimental Algorithms image is that speed performance remains the same when we adjust the emulated Gaussian kernel. The largest disadvantage is that it is not hardware supported by the GPU, and therefore has to be build manually. The performance issues that are involved with this overhead should be measured in a basic implementation.

5.2.2

Laplacian Approach

As an alternative to the Gaussian approach we propose the usage of a Laplacian approach, where we use another distribution function that is described in equation 5.11. The Laplace distribution is also known as the double exponential distribution, and thereby the curvature changes a lot more differently when compared to the Gaussian. 1 − fL (k, l) = e 2γp



k2 +l2 γp

(5.11)

Hereby γp also determines the fall-off rate of the distribution. Although this approach looks very similar with the Gaussian approach, the subtle differences in the distribution curvatures can cause a large difference in the computed aggregated matching cost. Therefore it can also have an impact to the quality of the resulting disparity map of the stereo algorithm. Since of its unique curvature, it is difficult to implement this approach in another way than a generic 2D convolution with a Laplacian kernel.

5.3 5.3.1

Algorithmic Quality Enhancements Directional Support Windows

We propose to truncate the convolution kernel in stead of computing it as a whole. If we divide the kernel into four truncated parts contained in each quadrant of the K, L-coordinate system, we obtain directional support windows as shown in Figure 5.9.

Figure 5.9: Truncating the Generic Convolution Kernel

When we truncate the convolution kernel, we of course make sure that the center value is embedded in every directional support window. We can then compute four different aggregated costs by applying each directional shifted window to the image. At the end, the minimum value of these 54

Chapter 5. Experimental Algorithms four is selected and used in the WTA disparity update. This technique is based on the non-linear Kuwahara filter [39], and will enormously improve the quality. Since we are aggregating in separate directions, areas that contain strong edges will automatically be avoided by selecting a different directional aggregated cost. Therefore we will not smear out the detail in the disparity map so hard, as when we apply the entire kernel as a single window. C0 [s, t] = D[s, t] ∗ ∗f0 [s, t]

=

0 X

0 X

D[s + k, t + l] · f0 [k, l]

(5.12)

k=−W l=−W

C1 [s, t] = D[s, t] ∗ ∗f1 [s, t]

=

0 W X X

D[s + k, t + l] · f1 [k, l]

(5.13)

k=−W l=0

C2 [s, t] = D[s, t] ∗ ∗f2 [s, t]

=

W X W X

D[s + k, t + l] · f2 [k, l]

(5.14)

k=0 l=0

C3 [s, t] = D[s, t] ∗ ∗f3 [s, t]

=

W X 0 X

D[s + k, t + l] · f3 [k, l]

(5.15)

k=0 l=−W

Knowing that f0 , f1 , f2 and f3 represent the truncated impulse responses of the convolution kernel, equations 5.12 through 5.15 describe how the four matching costs should be computed. In case we should use a separable kernel, it might look like we need eight convolution passes to perform all of the four separate kernels. However, the upper and lower right section use the same 1D horizontal filter kernel. Since we find an analog situation in the upper and lower left portions, we can see that this approach can be optimized to six elementary convolution passes in total.

5.3.2

Integral Image Adaptive Windowing

As another disparity map quality enhancement feature, we propose the usage of integral images for adaptive windowing. Analog to the truncated support windows, we use an adaptive windowing scheme to compute multiple matching costs. The window scheme may contain a number of various window shapes, as depicted in Figure 5.10. The minimum out of all these matching cost is selected, and then again this minimal cost is used in a WTA disparity update.

Figure 5.10: Composing Different Window Shapes with Integral Images

The integral images concept is a great candidate for creating a variety of adaptive windowing schemes. They can produce a very large spectrum of shapes without any performance penalty.

5.3.3

Truncated Absolute Difference

In stead of calculating the normal AD between two texels in a matching procedure, we encourage the usage of the truncated absolute difference (TAD). The TAD gets computed in the exact same 55

Chapter 5. Experimental Algorithms way, but it gets truncated if it exceeds a certain maximum value ρ. Therefore the TAD between two texels will never be larger than ρ, whereby it avoids the corruption of an aggregated matching cost in case of mismatches or noise contained in the images.

5.4 5.4.1

Motion Parallax View Interpolation Image Synthesis

We will now give a brief but more in-depth discussion about the simple view interpolation synthesis. We are able to synthesize an interpolated view with the left reference image and a scaled disparity map, through the mechanism of motion parallax [5]. Motion parallax describes the shifting effect of objects, when they are viewed from different viewing angles. As we have already mathematically derived, disparity is inverse proportional with depth. Motion parallax therefore defines objects in the foreground to have a large shift when comparing two different viewpoints. In contrast, objects in the background shift a lot less.

Figure 5.11: In-depth View Interpolation Synthesis

If we look at Figure 5.11, we can see that motion parallax can cause objects in the background to become occluded by foreground objects. This implicates that a motion parallax based DIBR implementation cannot perform a 1:1 mapping. Multiple texels can map to the same synthesized texel. Therefore we should always map the texel that lies most in front, or in other words, has the largest disparity. After synthesizing every texel, some areas will still not be filled in because no texel maps to these places. These are the occluded areas and should be handled properly.

5.4.2

Simplified Occlusion Handling

Although occlusion handling is not necessarily needed, it gives the synthesized view a lot better first impression. It is best to fill these areas with texels from the right image since they are not captured in the left image. As a simple solution, we calculate the inverse scaled disparity |1 − λ| · d[s, t]. We then use this inverse shift to lookup a texel in the right image. 56

Chapter 6

Framework Implementation 6.1 6.1.1

Overview Conceptual Diagram

As an objective of this thesis, we have developed a generic framework that makes it possible to implement real-time or high-speed stereo matching algorithms on the GPU. Since the algorithmic building blocks are inevitably implemented within this framework core, we still have tried to make as much of high-level user controllability as possible.

Figure 6.1: SaJi Framework Conceptual Diagram

Figure 6.1 shows the basic conceptual diagram of the framework that was codenamed SaJi, and therefore it exposes the mechanisms that are embedded into the latter. We have implemented the framework by means of a static library in the C++ programming language. We have chosen 57

Chapter 6. Framework Implementation a static library in stead of a dynamic one, because it is imperative that we obtain maximum performance. Efficient cooperation and code linking with other libraries or applications are not that important for us. The library functionality is accessed by the user through proper calls. We have made all library calls on the algorithmic level and not on the implementation level. Hereby we avoid that the framework user needs a knowledge of low-level GPU hardware. A stereo matching algorithm can be build by making a sequence of proper calls and using the intrinsic object classes that are embedded into the library. Hence an external high-level developer can use the library very intuitively to create an FVV or other high-speed stereo matching applications. A function call to the library is translated by the framework core to a small GPGPU accelerated hybrid program. This is done by using the DirectX 9.0c graphical API to load proper computational kernels or shaders into the GPU, and to trigger computational activity in the pipe. When all computations and functionality is performed, the library makes the results available for the highlevel application. This means that every result is always read back to the system memory, so that it can be managed from within the high-level application. Exactly for this reason the framework has, next to the fundamental stereo building blocks, also a real-time processor interface which runs a complete stereo matching algorithm without any slow inter-system communication. It is also possible that the framework core uses the CPU to perform computations that do not map efficient to the GPU. By using this hybrid approach, we can maximally lever the performance by exploiting the synergy between the graphics hardware and the central processor.

6.1.2

Library Calls and Functionalities

This subsection introduces the list of library calls and functional object classes that are momentarily supported in the framework. A beginning programmer can use it as a manual or documentation to incorporate the SaJi framework in a real-time stereo matching application. Listing 6.1: Building a SaJi Framework Core // Include the static library so we can use the SaJi framework. #include ”SaJi.h” int main(int argc, char ∗argv[]) { // The most important thing first, is to build a SaJi core. // Look at the name of devices that can be used in the framework. char deviceName[SJ DX9Core::DEVICE NAME LENGTH]; SJ DX9Core::getDeviceName(0, deviceName); printf(”ACTIVE DEVICE: %s\n”, deviceName); // Build a SaJi core on a device number of your choice. SJ DX9Core∗ myCore = new SJ DX9Core(0); }

An introduction into the framework is given by means of a couple demo applications. Looking at Listing 6.1, it defines the essential part that every user should perform, i.e. building a SaJi core. The framework has multiple core support, meaning that you can build multiple cores when the computer has more than one graphics card. Since most commodity personal computers are equipped with only a single one, we restrict ourselves to this example. Although the framework 58

Chapter 6. Framework Implementation has been internally designed to support multiple graphical APIs, only the DirectX 9.0c has a fully working implementation. We can build a core by creating a new SJ DX9Core object in the application. For this to work, a device number should be given. The device name that belongs to the corresponding number can be looked up by the static call getDeviceName to the core class, as shown in the example. Listing 6.2: Using the Real-Time Processor for Stereo Matching Applications int main(int argc, char ∗argv[]) { // Set a fixed image size we are going to use in this application. const int w = 384; const int h = 288; // A structure available in the SaJi framework that contains all necessary parameters. SJ PROCESSORPARAMETERS myProcessorParameters; fillMyProcessorParameters(&myProcessorParameters); // We first have to build a SaJi framework core. SJ DX9Core∗ myCore = new SJ DX9Core(0); // Next load the images you want to process. SJ DX9Image∗ myLeftImage = new SJ DX9Image(myCore, L”../images/tsukubaL.png”, w, h); SJ DX9Image∗ myRightImage = new SJ DX9Image(myCore, L”../images/tsukubaR.png”, w, h); // Assemble the images into a stereo vision. SJ DX9StereoVision∗ myStereoVision = new SJ DX9StereoVision(myCore, myLeftImage, myRightImage); // And just tell the framework to calculate a (horizontal) disparity map. SJ DX9DisparityMap∗ myMap = new SJ DX9DisparityMap(myCore, w , h); myStereoVision−>calculateRealTimeDisparityMap(myMap, myProcessorParameters); // Take a look at the image, and save it to a bitmap. myMap−>show(); myMap−>saveToBitmap(L”../images/mydisparitymap.bmp”); }

Listing 6.2 shows us a standard demo application for using the real-time processor interface that is embedded into the framework. Users that are only interested in using SaJi for high-speed stereo matching, and not stereo research, will most of the time use this kind of setup. Hence we will explain this functionality first. For the convenience of the user, it is advised to define or allocate a pair of constants that indicate the image resolutions since most of the library calls need these values. This is internally used as intrinsic buffer sizes in the GPU, so it actually defines the size of memory allocations on the graphics card. The user then specifies a structure that contains all the settings that are needed and can be changed to perform the stereo algorithm at hand. At this point, we still abstract filling in this structure. After building a core, we load an image pair into the memory. Notice that we need to specify the core on which the images will be used. This is because the framework will allocate memory in the corresponding graphics card. The two images are then coupled to a stereo vision object. This stereo vision is used with the real-time processor to generate the disparity map 59

Chapter 6. Framework Implementation with a certain set of processor parameters. The disparity map is stored in a proper class object, which can then be viewed or saved to a generic 24 bits bitmap file for further external usage. Listing 6.3: Performing Separate Stages of the Stereo Matching Process int main(int argc, char ∗argv[]) { ... // We can calculate an absolute difference image. SJ DX9AbsDifference∗ myAbsDifference; myAbsDifference = new SJ DX9AbsDifference(myCore, w, h); myStereoVision−>calculateAbsDifference(myAbsDifference, myDispEst, myVerticalOffset); // Calculate the integral data of this absolute difference image. SJ DX9IntegralData∗ myIntData; myIntData = new SJ DX9IntegralData(myCore, w, h); myAbsDifferences−>calculateIntegralData(myIntData, SJ DYNVAL4); // Then calculate a number of SADs (we assume 2) with given support window sizes. const int SADS = 2; SJ DX9SADImage∗ mySADImages[SADS]; SJ SADKERNEL mySADKernels[SADS]; // The first kernel or support window size is 3 by 3 texels. mySADKernels[0].width = mySADKernels[0].height = 3; mySADImages[0] = new SJ DX9SADImage(myCore, w, h); myIntData−>calculateSADImage(mySADImages[0], mySADKernels[0]); // The second kernel or support window size is 15 by 15 texels. mySADKernels[1].width = mySADKernels[1].height = 15; mySADImages[1] = new SJ DX9SADImage(myCore, w, h); myIntData−>calculateSADImage(mySADImages[1], mySADKernels[1]); // With all SADs we can compose a matching kernel. SJ DX9MatchingKernel∗ myMatchingKernel; myMatchingKernel = new SJ DX9MatchingKernel(myCore, mySADImages[0], SADS); // Specify the weights that need to be coupled to every SAD. float myKernelWeights = { 0.90f, 0.50f }; // Now the total aggregated matching cost can be computed. SJ DX9MatchingCost∗ myCost; myCost = new SJ DX9MatchingCost(myCore, w, h); myMatchingKernel−>calculateMatchingCost(myCost, myKernelWeights); ... }

In Listing 6.3 we demonstrate how the SaJi framework can be used to perform smaller stages out of the stereo matching process. We assume a stereo vision object has already been setup. From that point we can calculate an absolute difference image by specifying the disparity estimation and a vertical offset if necessary. As we can calculate the integral image from a normal image, it is also possible to calculate the integral image from an AD-image. With this stage we need to select the dynamic value upon which the integral image will be generated. We will discuss this parameter more in-depth in one of the following subsections. The value SJ DYNVAL4 is most of 60

Chapter 6. Framework Implementation the time the appropriate value to use. Using this integral image, we can easily calculate different sized SADs. In the example we demonstrate this for two different sized support windows. When we have computed all of the SADs that we need, a matching kernel can be setup. This matching kernel can then aggregate all costs to the final matching cost. One should always repeat these stages a number of times for an entire range of disparity estimations. As a last step the minimum matching costs and their corresponding disparities should be kept. We demonstrate this in Listing 6.4. With this optimal matching cost, we can request the associated disparities and thereby obtain the disparity map that has been created by the algorithm. Listing 6.4: Updating the Matching Cost with Optimal Values // Define your search range in the beginning of the program. const int ESTIMATIONS = 16; ... // Selecting the optimal matching cost and associate the corresponding disparity. myCost−>calculateBestMatch(myOptimalCost); // Lookup the associated disparities out of the optimal matching cost. SJ DX9DisparityMap∗ myDisparityMap = new SJ DX9DisparityMap(myCore, w, h); myOptimalCosts−>getHorizontalDisparityMap(myDisparityMap, ESTIMATIONS); myDisparityMap−>show();

We also want to mention that every object supports a view and a save function call, so that the intermediate results can always be visualized or used for further external analysis. Listing 6.5: Using Mipmaps for Real-Time Stereo Computation // Another structure is used for specifying the real−time parameters. SJ MIPMAPPARAMETERS myMipMapParameters; fillMyMipMapParameters(&myMipMapParameters;); ... myStereoVision−>calculateRealTimeMipMapDisparityMap(myMap, myMipMapParameters);

Hereby we have implemented the summation of a variable sized box-filter. We can also use this approach to emulate a mip-level hierarchy aggregation quality-wise. If we want to compare speed, another low-level implementation is available that can be called as shown in Listing 6.5.

6.1.3

Low-Level Library Expandability

We need to have a possibility to expand the low-level internal functionality as well. This is needed to perform the research of different stereo algorithms for potential future publications.

Figure 6.2: High-level Concept for Low-level Expandability

Looking at Figure 6.2, we use the concept of the real-time processor as a basis to perform low-level implementation research. We have copied the standard SaJi library and a small standard program 61

Chapter 6. Framework Implementation as demonstrated in Listing 6.2. Except for the processor parameters structure, nothing changes on a high-level to test different types of low-level implementations. These copies represent different models that have been used to implement different approaches and algorithms. Those models than act as a reference to build state-of-the-art papers or publications when they have a high potential.

6.1.4

Efficient Implementations and Optimizations

We now discuss some of the essential low-level implementations out of the shader pool that drives the computational kernels inside the GPU. Since we are working on a DirectX 9.0c implementation, it is most advised to use the HLSL and its corresponding Microsoft compiler. It is normal for the Microsoft corporation to implement a better compiler to DirectX assembler than NVIDIA with their Cg compiler, since they have internal knowledge of their assembly instruction sets. Efficiently Convolving Large Kernels In order to efficiently convolve large kernels on graphics hardware, we exploit the bilinear sampling capabilities of the GPU. Figure 6.3 shows us this mechanism in a formal way. We sample with a 0.5 texel offset, so that the bilinear sampling filter of the GPU will average out two values of the AD-image. Then we multiply with the convolution kernel value at this location. As a result, we can convolve large kernels with only half the amount of texture accesses.

Figure 6.3: Efficient Convolving on Graphics Hardware

Since we are performing the convolution with steps of 2 texels, we should double the intermediate convolution values to compensate the loss of texel samples in the aggregation process. Generating Integral Images Hensley et al. proposed the use of the recursive doubling algorithm to quickly generate the integral image [40], but there are some downsides by implementing this on graphics hardware. The recursive doubling exists out of a horizontal phase and a analog vertical phase. Figure 6.4 shows us the recursive doubling algorithm for only the horizontal phase. In a first pass we add a value that is 20 = 1 place located to the left. Notice that the first box will sample out of the image, but we solve this by setting a border color that is black. Hereby, zero values are sampled outside the image. In a second pass we add a value that is 21 = 2 places located to the left. This continues, until all values over the entire length are summed up. If ∆W and ∆H are defined as the width and height of the image, this algorithm has a complexity of O(log2 (∆W ) + log2 (∆H )). Since the input depends on the previous output, we need an equal amount of GPU passes. Therefore we propose 62

Chapter 6. Framework Implementation the use of a dynamic value 1 ξ that reduces the amount of passes to O(logξ (∆W ) + logξ (∆H )). By this, we are also sampling more zero values. We can understand that there will be a sweet spot where we have an optimal dynamic value ξ that has a minimum amount of execution time, since it will compromise the amount of passes and the amount of redundant zero additions.

Figure 6.4: The Recursive Doubling Algorithm

As the image width and height are not known in advance, we have to control the amount of passes on the CPU level. We first need to calculate the amount of passes needed to finish the algorithm. Only then can we execute a dynamic shader for each pass, where we set the stride as a uniform float in the GPU from within the system level before shader execution. A uniform variable on the GPU is analog to a constant on the CPU, it ensures that it stays the same for every shader invocation of that pass. Hence it can be loaded into a constant register for optimal performance. Listing 6.6: System Level Control of Integral Image Generation horizontalPassCount = ceil(log((float)imageWidth)/log((float)dynamicValue)); int stride = 1; for(int currentPass = 0; currentPassRelease(); }

Listing 6.8 shows how to perform a basic memory allocation in the default pool [23; 24]. We need to specify the device on which the memory will be allocated, so this is the device linked to the SaJi core upon which the code is running. We then specify the buffer size, the amount of mip-levels, the usage and the format of the data elements in the buffer. These formats can be standard 8 bit integers up to 32 bit IEEE floating point numbers. Finally we define the memory pool and a place to store the pointer to this allocated memory. Deallocation happens with a Release() call.

6.2.4

Algorithm Time Benchmarking

To be able to test the speedup of certain optimizations or the execution time of entire algorithms in general, we need a time benchmarking mechanism. Time benchmarking hybrid application programs is less trivial, since we are parallel executing different code. We will propose a way of a good timing mechanism that is adapted for hybrid execution. System Timing Mechanisms A very accurate way of performing time measurements of CPU execution time is the use of the clock tick counter [23; 24] on the central processor. This counter increments with every clock cycle, so it the most accurate approach that can be developed. Reading values out of this processor core counter is done by the QueryPerformanceCounter, and is shown in Listing 6.9. Based on 68

Chapter 6. Framework Implementation the counter status on the start and end point of the algorithm, we can compute the number of clock ticks that where needed to perform the algorithm execution. If we then lookup the clock frequency of the central processor, we can easily calculate the corresponding execution time. Since it is certainly possible for the clock tick counter to overflow once and a while, we also present a overflow detection mechanism that handles these occurrences. Listing 6.9: Accurately Profiling CPU Execution Time // First declare the LARGE INTEGER structures from the windows library. LARGE INTEGER algorithmStart, algorithmStop, clockFrequency; ... // Enclose the algorithm with clock tick counter queries. QueryPerformanceCounter(&algorithmStart); executeAlgorithm(); QueryPerformanceCounter(&algorithmStop); ... // Perform counter overflow detection for correct results. LONGLONG tickStart = algorithmStart.QuadPart; LONGLONG tickStop = algorithmStop.QuadPart; LONGLONG tickTotal = 0; if(tickStoppDevice−>CreateQuery(D3DQUERYTYPE EVENT, &pEventQuery); ... // Wait until the GPU has finished its work (=synchronization). pEventQuery−>Issue(D3DISSUE END); while(S FALSE == pEventQuery−>GetData(NULL, 0, D3DGETDATA FLUSH)); ... // When all GPU syncs are done, release the query mechanism. pEventQuery−>Release();

Listing 6.11 provides us a way in DirectX 9.0c to setup a query mechanism with the GPU, and how to exploit the queries to perform a synchronization with the central processor [24]. The synchronization is done by querying the graphics card whether a previously given task or assignment has ended, and waiting in a while-loop until it does. Synchronization Overhead Distortion Synchronizing with the GPU before reading the clock tick counter of a specific processor core, allows us to measure the correct timing of an algorithm that is running on both CPU and GPU. However, the final synchronization query handling is also measured. Therefore it distorts the exact algorithm time measurements, certainly when the algorithm is very fast. Listing 6.12: Performing Correct Hybrid Algorithm Time Measurements synchronizeWithGPU(); queryClockTicks(&algorithmStart); for(int i=0; ipDevice−>Reset(¶meters); setupTimingEnvironment(); // STEP 2: Setup the needed GPGPU kernel invocation quadrilateral. setupGenericQuad(0, height, width, 0); // STEP 3: Setup the off−screen buffer array that is needed. setupOffscreenEnvironment(D3DFMT A8R8G8B8); // STEP 4: Load and initialize the effect file. setupEffect(L”../saji/shaders/SJ DX9AbsDifferenceResources.fx”, ”absDifference subtract”); initEffectParameters)(; // STEP 5: Time and execute the shader by sending the quad. LPDIRECT3DTEXTURE9 pProcessedData; setPerformanceCheckPoint(&perfStart); for(int i=0; ienvironmentWindow; parameters.BackBufferWidth = width; parameters.BackBufferHeight = height; parameters.BackBufferFormat = D3DFMT UNKNOWN; parameters.PresentationInterval = D3DPRESENT INTERVAL IMMEDIATE; parameters.EnableAutoDepthStencil = TRUE; parameters.AutoDepthStencilFormat = D3DFMT D16; }

The first step is to reset the hardware device to its required parameters. These parameters are initialized in the shader tree parent constructor, or can be changed in the child constructor. The default setup is shown in Listing 6.15. It initializes the device to an windowed environment that uses the desktop window handle as its control. Swapping between front and back buffer has been disabled, and the back buffer is being defined as unknown. The back buffer settings do not really matter since the latter will probably not be used. Listing 6.16: Generic GPGPU Quadrilateral Generation void SJ DX9Shader::setupGenericQuad(int leftTexCoordinate, int topTexCoordinate, int rightTexCoordinate, int bottomTexCoordinate) { // Normalize the texture coordinates and correct the half−pixel shift. float left = ((float)leftTexCoordinate+0.5f)/(float)width; ... // Create the quadrilateral with the flexible vertex format. GenericVertex quad[4]; fillQuadrilateralCoordinates(quad, left, top, right, bottom); // Create the vertex buffer and load the vertices into it. core−>pDevice−>CreateVertexBuffer(4∗sizeof(GenericVertex), D3DUSAGE WRITEONLY, FVF GENERICVERTEX, D3DPOOL DEFAULT, &pVertexBuffer, NULL); void ∗pVertices = NULL; pVertexBuffer−>Lock(0, sizeof(quad), (void∗∗)&pVertices, 0); memcpy(pVertices, quad, sizeof(quad)); pVertexBuffer−>Unlock(); // Calculate the 1:1 mapping matrices for the quad. D3DXMatrixIdentity(&modelViewMatrix); D3DXMatrixOrthoRH(&projectionMatrix, 2.0, 2.0, 0.0, 10.0); // Set the computational range to the same location as the domain. setViewport(leftTexCoordinate, topTexCoordinate, rightTexCoordinate, bottomTexCoordinate); }

The next step is to generate a quadrilateral that will invoke the GPGPU kernel computations. We can see that the shader tree parent provides us a way of implementing this with a simple function 72

Chapter 6. Framework Implementation call described in Listing 6.16. This function automatically calculates the texture coordinates and applies the half-pixel shift correction. After that, the quad is composed with the flexible vertex format [24] of DirectX. This allows us to define custom vertex formats, i.e. a float3 location and float2 texture coordinate indication. This vertex format should be compatible with the defined vertex input structure of the effect file we want to use. The matrices needed to perform a correct one-to-one mapping are then automatically generated. At the end we set the computational range to correspond with location of the computational domain. This way we do not only ensure correct mapping, but also a consistent relationship between input elements and their output. Listing 6.17: Using Custom Off-Screen Render Targets void SJ DX9Shader::setupOffscreenEnvironment(D3DFORMAT textureFormat) { if(currentOffscreenspDevice, width, height, 1, D3DUSAGE RENDERTARGET, textureFormat, D3DPOOL DEFAULT, &pTempTexture[currentOffscreens]); // The render to surface environment in D3DX makes it easy to render off−screen. D3DXCreateRenderToSurface(core−>pDevice, width, height, textureFormat, false, D3DFMT D16, &pRenderToSurface[currentOffscreens]); // Link an off−screen surface to the texture we just allocated. pTempTexture[currentOffscreens]−>GetSurfaceLevel(0, &pTextureSurface[currentOffscreens]); currentOffscreens++; } }

The third step is to allocate an array of off-screen buffers that are needed to perform the algorithmic functionality. Since we are only calculate the AD-image in the example, we need a single off-screen render target. This function call is in general subsequently called a number of times to create multiple buffers that can be used to store intermediate results. As we can see in Listing 6.17, the buffer target format can be specified upon creation. In the example we use a four channel 8 bit color representation. As we have already discussed, this format can go up to a four channel 32 bit IEEE floating point representation. Alway keep in mind that such large formats slow down the processing significantly. Listing 6.18: Effect Specification and Compilation void SJ DX9Shader::setupEffect(LPCWSTR effectPath, D3DXHANDLE effectTechnique) { // This call compiles the effect file into assembly code for the GPU. D3DXCreateEffectFromFile(core−>pDevice, effectPath, NULL, NULL, 0, NULL, &pEffect, &pBufferErrors); // We compute the 1:1 mapping modelViewProjection matrix. D3DXMATRIX modelViewProjMatrix = modelViewMatrix∗projectionMatrix; // Set the technique, and link the matrix to the effect parameter. pEffect−>BeginParameterBlock(); pEffect−>SetTechnique(effectTechnique); pEffect−>SetMatrix(”modelViewProjectionMatrix”, &modelViewProjMatrix); pEffect−>ApplyParameterBlock(pEffect−>EndParameterBlock()); }

The last preparation is setting up the desired effect that contains the shader code for the computational kernels that have to be executed. As shown in Listing 6.18, the D3DX library makes compiling effect code at run-time very easy. We also combine the matrices that were previously 73

Chapter 6. Framework Implementation computed into a single one-to-one mapping matrix. This matrix is loaded into the available ModelViewProj matrix parameter that is specified in the effect file. By combining effect parameter loads into a parameter block, we can optimize the data transfer by doing it all at once on the ApplyParameterBlock function call. Listing 6.19: Executing the Shader Functionality void SJ DX9Shader::executeShader(int targetNumber, LPDIRECT3DTEXTURE9∗ ppProcessedData) { pRenderToSurface[targetNumber]−>BeginScene(pTextureSurface[targetNumber], NULL); UINT totalPasses; pEffect−>Begin(&totalPasses, 0); for(UINT i=0; iBeginPass(i); core−>pDevice−>SetStreamSource(0, pVertexBuffer, 0, sizeof(GenericVertex)); core−>pDevice−>SetFVF(FVF GENERICVERTEX); core−>pDevice−>DrawPrimitive(D3DPT TRIANGLESTRIP, 0, 2); pEffect−>EndPass(); } pEffect−>End(); pRenderToSurface[targetNumber]−>EndScene(0); ∗(ppProcessedData) = pTempTexture[targetNumber]; }

The fifth and most important step is to execute the shader functionality. As we discussed before, this is done by sending the quadrilateral to the pipeline. Listing 6.19 shows how to do this. Notice that this is done for every pass required by the effect file. Listing 6.20: Data Transferring for Memory Management or Read-Back Time Benchmarking if(AbsDifference−>pTexture) AbsDifference−>pTexture−>Release(); D3DXCreateTexture(core−>pDevice, width, height, 1, D3DX DEFAULT, D3DFMT A8R8G8B8, D3DPOOL MANAGED, &(AbsDifference−>pTexture)); LPDIRECT3DSURFACE9 pSrcSurface = NULL; LPDIRECT3DSURFACE9 pDestSurface = NULL; pProcessedData−>GetSurfaceLevel(0, &pSrcSurface); AbsDifference−>pTexture−>GetSurfaceLevel(0, &pDestSurface); D3DXLoadSurfaceFromSurface(pDestSurface, NULL, NULL, pSrcSurface, NULL, NULL, D3DX DEFAULT, 0); pDestSurface−>Release(); pSrcSurface−>Release();

As an optional step, we can save the result to a different memory pool. The copyResult call is a pseudo function call that is usually replaced by the code in Listing 6.20. Copying from the default to the managed pool is necessary when we still want this resource to be available after a device reset. If we want to benchmark read-back time, we should copy it to the system memory pool.

6.3 6.3.1

Enhanced Quality Implementations Truncated Separable Laplacian Kernel Approximation

When the SaJi framework was finished, we have copied the generic base model and used the low-level expandability to do some research in high-speed dense stereo matching algorithms on 74

Chapter 6. Framework Implementation programmable graphics hardware. Since we have made a lot of different models, it is impossible to incorporate every one of them in this thesis. Thereby we only mention the most important contributions that we have made to the current state-of-the-art2 .

Figure 6.8: Processing Elements and Data Flow of the Proposed Stereo Algorithm

During our research, Lu Jiangbo et al. published a high-speed dense stereo algorithm to the 3DTV [44] and ICME [45] 2007 conference. The heart of our algorithm uses a truncated separable Laplacian kernel approximation. In this subsection we will explain the essential algorithmic building blocks that support these papers. The processing elements and data flow of our proposed stereo algorithm are depicted in Figure 6.8. Notice that we hereby follow the taxonomy of Scharstein and Szeliski [6] by first computing the matching cost, then aggregating it, and finally updating the disparity that belongs to the minimum matching cost. Our algorithm does not incorporate a subsequent disparity refinement stage, so this is still a future possibility to further lever the quality. The accompanying legend indicates the intense computational stages of the graphical pipe that is subsequently built out of a vertex shader (VS), the rasterizer (R.), a pixel shader (PS) 2 State-of-the-art

research is the highest level of development.

75

Chapter 6. Framework Implementation and the Z-testing hardware (Z) in the ROPs. Since MRT is used wherever possible, we only need a total of five passes to perform the entire algorithm. By this we implement a high-speed stereo algorithm that fully off-loads the CPU. Therefore it is still free for other high-level computations. Truncated Absolute Difference After downloading the left and right image into the default graphics memory pool, we first calculate a TAD-image for four disparity estimations at once. That way we can exploit the data-level parallelism of the graphics card by data packing. Listing 6.21 presents the shader code that is responsible for this stage. Listing 6.21: The TAD Computational Kernel PS OUTPUT subtractFP(VS OUTPUT vertexOutput, uniform float horDisp0, uniform float horDisp1, uniform float horDisp2, uniform float horDisp3, uniform float thresholdVal) { PS OUTPUT pixelOutput; // Sample a texel from the (reference) left image. float currPixelVal = tex2D(leftSampler, vertexOutput.texelCoordinate); pixelOutput.computedValue = float4(currPixelVal, currPixelVal, currPixelVal, currPixelVal); // Sample 4 texels with different subsequent offsets from the right image. float4 candPixelVal; candPixelVal.r = tex2D(rightSampler, float2(vertexOutput.texelCoordinate.x−horDisp0, vertexOutput.texelCoordinate.y)); candPixelVal.g = tex2D(rightSampler, float2(vertexOutput.texelCoordinate.x−horDisp1, vertexOutput.texelCoordinate.y)); candPixelVal.b = tex2D(rightSampler, float2(vertexOutput.texelCoordinate.x−horDisp2, vertexOutput.texelCoordinate.y)); candPixelVal.a = tex2D(rightSampler, float2(vertexOutput.texelCoordinate.x−horDisp3, vertexOutput.texelCoordinate.y)); // Calculate the absolute difference, and truncate it if necessary. pixelOutput.computedValue = abs(pixelOutput.computedValue − candPixelVal); pixelOutput.computedValue = min(pixelOutput.computedValue, thresholdVal); return pixelOutput; }

We have loaded the four offset values as uniform floats into the GPU, so that these calculations can be avoided. Although these calculations seem trivial again, every fragment, thus every texel in GPGPU, will execute the presented shader and its internal computations. To truncate the absolute difference, we need an additional min instruction. Therefore the TAD is calculated slower than the AD-image. But as we already discussed, the TAD significantly levers the quality of the outputted disparity map by generating more reliable matching costs. Hereby we can immediately see which hardware is specifically exploited per needed pass through the GPU. Separable Kernel Approximation We perform the matching cost aggregation phase by a generic 2D convolution with a 33 × 33 Laplacian kernel. Since the Laplacian is not separable, we propose to approximate it to equation 6.1 so that the algorithmic complexity significantly decreases. A negative consequence is that it doubles the amount of passes that are needed to perform the convolution. This would cause 76

Chapter 6. Framework Implementation an enormous performance drop when working with low resolution images, since the GPU pass overhead in those cases becomes significant. fL (k, l) =

|l| 1 − |k| 1 − |k|+|l| − e γp = e γp · e γp 2γp 2γp

(6.1)

As the second 1D convolution kernel needs the results of the first one, the computational kernels are not independent. Hence this negative influence cannot be taken care of through multiple render targets. Nonetheless, this performance drop is no match for the gigantic speedup that is accomplished by the kernel separation. Directional Support Windows with a Truncated Kernel As another quality boost, we propose to truncate the Laplacian kernel in four directional support windows. Listing 6.22 shows us one of the shaders that performs the truncated convolution. Listing 6.22: Simultaneous Generation of the Left and Right Convolution Filter struct VS OUTPUT PTEX PACKED { float4 viewCoordinate: POSITION; float4 texelCoordinate0: TEXCOORD0; ... float4 texelCoordinate7: TEXCOORD7; }; PS OUTPUT MRT fn GenerateHorImage(VS OUTPUT PTEX PACKED vertexOutput, uniform float widthTexelStride, uniform float colorWeightArray[12], uniform float maxWeightSum, uniform float thresholdInput) { PS OUTPUT MRT pixelOutput; // Kernel coordinate that could not be calculated in the vertex shader. float2 kernelCo17 = float2(vertexOutput.texelCoordinate7.x−2.0f∗widthTexelStride, vertexOutput.texelCoordinate0.y); // Sequential access of textures is significantly faster than random. pixelOutput.computedValue0 = colorWeightArray[7] ∗ tex2D(ADImageSampler, kernelCo17) + colorWeightArray[6] ∗ tex2D(ADImageSampler, vertexOutput.texelCoordinate7.xy) ... + colorWeightArray[0] ∗ tex2D(ADImageSampler, vertexOutput.texelCoordinate1.xy); float4 centerPixel = tex2D(ADImageSampler, vertexOutput.texelCoordinate0.xy); pixelOutput.computedValue0 += centerPixel; pixelOutput.computedValue1 = centerPixel + colorWeightArray[0] ∗ tex2D(ADImageSampler, vertexOutput.texelCoordinate1.zw) + colorWeightArray[1] ∗ tex2D(ADImageSampler, vertexOutput.texelCoordinate2.zw) ... + colorWeightArray[7] ∗ tex2D(ADImageSampler, vertexOutput.texelCoordinate0.zw); // Normalize the output results, so that it properly fits the buffer format. pixelOutput.computedValue0 /= (maxWeightSum ∗ thresholdInput); pixelOutput.computedValue1 /= (maxWeightSum ∗ thresholdInput); return pixelOutput; }

Since every directional support window needs a different 17 × 17 convolution kernel, the normal way of applying a separate kernel would implicate eight convolution passes. As a lot of these passes are independent of each other, we exploit the use of MRT to perform all of the convolutions 77

Chapter 6. Framework Implementation in only three passes as indicated in Figure 6.8. In theory, we could perform them in only two passes. The NIVIDIA GeForce 7900GTX supports rendering to four targets, but this would lead to the fact that we need to sample from two different images in the same kernel. Since the costly texture reads can be significantly accelerated by performing sequential accesses [26], reading from two images destroys this efficient caching. Eventually the two-pass solution will be slower than when split up in three passes that read from a single texture. Listing 6.23: Corresponding Vertex Shader for the Horizontal Convolution Filter struct VS INPUT { float3 modelCoordinate: POSITION; // Based on this central texel, an entire 1D sample kernel is computed. float2 texelCoordinate: TEXCOORD0; }; VS OUTPUT PTEX PACKED vs GenerateHorImage(VS INPUT vertexInput, uniform float4x4 transformationMatrix, uniform float widthTexelStride) { VS OUTPUT PTEX PACKED vertexOutput; vertexOutput.viewCoordinate = mul(transformationMatrix, float4(vertexInput.modelCoordinate, 1.0f)); vertexOutput.texelCoordinate0.xy = vertexInput.texelCoordinate; float2 texAddrStartLe = float2(vertexOutput.texelCoordinate0.x + 0.5 ∗ widthTexelStride, vertexOutput.texelCoordinate0.y); float2 texAddrStartRi = float2(vertexOutput.texelCoordinate0.x − 0.5 ∗ widthTexelStride, vertexOutput.texelCoordinate0.y); // First four kernel coordinate calculations. float4 offsets = mul(float4(2.0f, 4.0f, 6.0f, 8.0f), widthTexelStride); vertexOutput.texelCoordinate1.xy = texAddrStartLe − float2(offsets.x, 0.0f); vertexOutput.texelCoordinate1.zw = texAddrStartRi + float2(offsets.x, 0.0f); ... // Sixth to eighth kernel coordinate calculations. offsets = mul(float4(10.0f, 12.0f, 14.0f, 16.0f), widthTexelStride); vertexOutput.texelCoordinate5.xy = texAddrStartLe − float2(offsets.x, 0.0f); vertexOutput.texelCoordinate5.zw = texAddrStartRi + float2(offsets.x, 0.0f); ... // Mind that texelCoordinate0.xy is the central texel, and is not used here. vertexOutput.texelCoordinate0.zw = texAddrStartRi + float2(offsets.w, 0.0f); return vertexOutput; }

Another very important optimization is the computation of the kernel sample coordinates by the rasterizer, and reducing the number of needed coordinate sets by efficient convolving through exploiting the bilinear filtering hardware. Hereby we reduce the required coordinate sets from 33 to 17. As our graphics card3 supports up to eight texture coordinate interpolators, we pack 16 two-dimensional kernel coordinates into the corresponding vertex shader as shown in Listing 6.23. Because we need a total of 17 coordinate sets, one of them still has to be calculated in the fragment shader. As a side note, we also want to mention that we use the mul instruction to multiply up to four values simultaneously with a given factor. Although this is not so important in a GPGPU vertex shader, it can significantly speed up a pixel shader. The two vertical convolution passes are 3 Most

78

current commodity graphics cards have eight texture coordinate interpolaters.

Chapter 6. Framework Implementation implemented completely analog to the horizontal filtering pass shaders but they cannot be reused. Nevertheless, only a single shader pair is needed for the vertical passes. This is because only the input data is different, and not the kernel functionality. Z-Buffer Disparity Selection At the final end of our algorithm, the minimum value out of these four directional matching costs is selected as shown in Listing 6.24. The minimum directional matching cost is determined for every disparity estimation that is embedded on the data-level. After that, the minimal value out of the subsequent disparity estimations is searched and the best disparity is kept. Listing 6.24: Exploiting the Z-hardware for WTA Disparity Update PS OUTPUT Z fn FindBestMatchingCost(VS OUTPUT vertexOutput, uniform float horDispVec0, uniform float horDispVec1, uniform float horDispVec2, uniform float horDispVec3) { PS OUTPUT Z pixelOutput; // Sample the four directional matching costs. float4 sampleValue0 = tex2D(SeparatePassImageSampler0, vertexOutput.texelCoordinate); float4 sampleValue1 = tex2D(SeparatePassImageSampler1, vertexOutput.texelCoordinate); float4 sampleValue2 = tex2D(SeparatePassImageSampler2, vertexOutput.texelCoordinate); float4 sampleValue3 = tex2D(SeparatePassImageSampler3, vertexOutput.texelCoordinate); // Select minimum directional matching cost. float4 minFourCost = min(min(min(sampleValue0, sampleValue1),sampleValue2), sampleValue3); // Select minimum matching cost from four packed disparity estimations. float minCost, bestDisp; if (minFourCost.g < minFourCost.r) { minCost = minFourCost.g; bestDisp = horDispVec1; } else { minCost = minFourCost.r; bestDisp = horDispVec0; } if (minFourCost.b < minCost) { minCost = minFourCost.b; bestDisp = horDispVec2; } if (minFourCost.a < minCost) { minCost = minFourCost.a; bestDisp = horDispVec3; } // Store the best disparity as a color, and the matching cost as the fragment depth. pixelOutput.computedValue.r = bestDisp; pixelOutput.computedValue.g = pixelOutput.computedValue.b = 0.0f; pixelOutput.computedValue.a = 1.0f; pixelOutput.myDepth = minCost; return pixelOutput; }

The resulting minimal matching cost is then stored as the depth of the current fragment. As the fragment progresses to a ROP, it will be submitted to a depth test. If we have initially filled the Z-buffer with maximum values, the Z-test will automatically take care of the disparity update. When the matching cost is less than the cost that is currently in the Z-buffer, the fragment becomes a pixel and replaces the existing color by the color that was calculated in the fragment program. Since this color contains the new disparity value, we have successfully performed a WTA disparity update through exploiting the Z-hardware.

6.3.2

Shortcomings of the Integral Image

In our research to exploiting the integral image for high-speed stereo, we have noticed that although the latter has a very high potential in future hardware, momentarily it cannot be efficiently enough mapped to the GPU to compete with current state-of-the-art. We will briefly discuss the most 79

Chapter 6. Framework Implementation important shortcomings of the integral image, and the needed technology or techniques to solve them. This makes that the integral image can certainly be used in future work, and therefore our research on its domain is certainly still of great value. High Precision Needs If we sum up values from the origin up, the results will become very high toward the ending corner. This will lead to a loss in precision since current GPUs cannot provide a larger precision than 32 bits IEEE floating point. Thereby noise is introduced into the areas with high value sums.

Figure 6.9: Dealing with Intrinsic Precision Problems in an Integral Image

Figure 6.9 depicts a number of possible sums that can solve this precision problem. When we offset every value with the average of the entire image, we obtain an optimal summation. Since it is too costly to compute this average, we propose offsetting by 0.5 as intensities vary from 0 to 1. Large Pass Overhead Since an integral image has to be recursively computed, we need a relative large number of passes through the GPU. As we cannot get the arithmetic intensity high enough, this large pass overhead causes the algorithm to map very poorly to the GPU. Listing 6.25: Recursively Called Shader for Integral Image Generation PS OUTPUT intDataHorFP(VS OUTPUT vertexOutput, uniform int dynamicValue, uniform float deltaX) { PS OUTPUT pixelOutput; pixelOutput.computedValue = float4(tex2D(intDataSampler, vertexOutput.texelCoordinate).rgb, 1.0f); // Add a number of values equal to the dynamic value. for(int i = 1; i