Nonlinear Optimization Framework for Image ... - Semantic Scholar

27 downloads 745 Views 1MB Size Report
framework for solving many large nonlinear optimizations concur- rently in .... The conjugate gradient algorithms choose the successive search ...... Ray Engine.
To appear in Proceedings of ACM SIGGRAPH 2003

Nonlinear Optimization Framework for Image-Based Modeling on Programmable Graphics Hardware Karl E. Hillesland University of North Carolina at Chapel Hill ∗

Sergey Molinov Intel Corporation †

Abstract

Image-based modeling, a field that looks at building photorealistic 3D models from images, stands to benefit greatly from these new hardware developments. Traditionally, image-based modeling has been done on the CPU because of the complexity of the data processing algorithms that it uses, such as factorization of large matrices and solving of large systems of equations. In the past, these algorithms could not be implemented on graphics hardware because it was not programmable. Today, these limitations no longer exist. We show that by rearranging the order of processing of the reference image data, we are able to develop novel algorithms that take full advantage of the high-level parallelism of graphics hardware. Since a big portion of building a model is its reconstruction, or rendering, implementing the full image-based modeling pipeline in graphics hardware leads to highly optimized performance. It takes full advantage of the rendering speed of graphics hardware and produces models that are by construction optimized for efficient visualization. Additionally, close coupling of image synthesis and analysis results in a surprisingly simple and clean framework for building image-based models in graphics hardware. In order to solve the largest possible class of problems, we develop a general framework for doing nonlinear optimization in graphics hardware and cast diverse image-based modeling problems in this framework. Since image-based models often have millions of parameters to optimize and image-based datasets contain billions of samples, our algorithms need to use memory efficiently. To this end, we propose an approach that treats the image data as a stream on which a sequence of programs gets executed and that builds the solution incrementally. We use this approach to build two common nonlinear optimization algorithms: the steepest descent and the conjugate gradient algorithms [Dennis and Schnabel 1996]. We apply our optimization framework to two complex problems in image-based modeling: the light field mapping approximation of surface light fields [Chen et al. 2002] and fitting the Lafortune model [Lafortune et al. 1997] to spatial bidirectional reflectance distribution functions [McAllister et al. 2002]. The first problem is challenging because it involves collecting information from many image samples during the construction of the view maps—a difficult case for graphics hardware. The latter problem requires fitting millions of highly nonlinear reflectance models and processing of massive image datasets consisting of thousands of images.

Graphics hardware is undergoing a change from fixed-function pipelines to more programmable organizations that resemble general-purpose stream processors. In this paper, we show that certain general algorithms, not normally associated with computer graphics, can be mapped to such designs. Specifically, we cast nonlinear optimization as a data streaming process that is well matched to modern graphics processors. Our framework is particularly well suited for solving image-based modeling problems since it can be used to represent a large and diverse class of these problems using a common formulation. We successfully apply this approach to two distinct image-based modeling problems: light field mapping approximation and fitting the Lafortune model to spatial bidirectional reflectance distribution functions. Comparing the performance of the graphics hardware implementation to a CPU implementation, we show more than 5-fold improvement. CR Categories: I.3.1 [Computer Graphics]: Hardware Architecture—Graphics processors I.3.3 [Computer Graphics]: Picture/Image Generation—Digitizing and Scanning, Viewing Algorithms; Keywords: Programmable Graphics Hardware, Nonlinear Optimization, Image-Based Modeling

1

Radek Grzeszczuk Intel Corporation †

Introduction

Graphics hardware has recently evolved from a fixed-function pipeline to a pipeline with programmable vertex and fragment stages. Increased programming flexibility brings graphics processing units (GPUs) closer to a general-purpose architecture similar to vector or stream processors. Stream processors are well suited for a variety of media applications since they are characterized by high levels of parallelism, high computation per memory access ratios and high tolerance of memory latency [Khailany et al. 2001]. These processors operate on large streams of data, executing the same program on each element of the stream. A graphics processor is a type of stream processor operating on streams of vertices and fragments [Lindholm et al. 2001]. Today’s GPUs can be used to accelerate a wide range of non-graphics applications. The recent advent of programming languages for GPUs makes porting of these new applications much easier.

1.1

∗ Email:

[email protected] † Email: [sergey.molinov,radek.grzeszczuk]@intel.com

Contributions

Nonlinear Optimization on Graphics Hardware: We develop a framework for solving many large nonlinear optimizations concurrently in graphics hardware. We implement the conjugate gradient algorithm, often used in practice because of its fast convergence rate and minimal storage requirements. See Section 3. Image-based Modeling on Graphics Hardware: We apply the optimization framework to image-based modeling. The proposed methods can produce a broad class of function approximations, require minimal storage overhead and do not involve a resampling step. They are, therefore, particularly well suited for modeling of high dimensional functions, where resampling is prohibitively expensive. See Section 5.

1

To appear in Proceedings of ACM SIGGRAPH 2003

3

Practical Applications and Performance: We solve two distinct image-based modeling problems—the light field mapping approximation of surface light fields and fitting the Lafortune model to spatial bidirectional reflectance distribution functions. We analyze the performance of these algorithms and show significant improvements over a CPU implementation. See Section 4.

2

Nonlinear Optimization Framework

In this paper, we will be concerned with the general problem of fitting a parametric model to sample data through nonlinear optimization. The approach assumes the data are generated by an unknown function, defines a model that is supposed to approximate well the original function and tries to minimize the difference between the model and the data by adjusting the model’s parameters. Next, we introduce the mathematical formalism used in the paper to define the problem. Let f (x) be the function from which the data are sampled, where x is the vector of physical parameters, such as the surface location, the light direction, etc. Let m(p) denote the model used to approximate the function, where p = [p1 . . . pN ]T is the vector of model parameters. We will assume that for each instantiation x of the physical parameter vector there is a corresponding instantiation of the model parameter vector p, i.e., f (x) ≈ m(p). Finally, we will let t = [t1 . . .tS ]T be the vector of data samples. In the paper, we will use N to denote the number of model parameters and S to denote the number of data samples. The vector function R(p) = [r1 (p) . . . rS (p)]T is the residual function that is nonlinear in p and

Previous Work

Recent advances in the programmability of graphics hardware have enabled new research on porting rendering algorithms not normally associated with graphics hardware to this platform. Purcell et al. [2002] map a ray tracer to graphics hardware, showing performance that is competitive with CPU-based implementations. Carr et al. [2002] implement ray-triangle intersection as a fragment program and use it for a host of applications, such as ray tracing, photon mapping and subsurface scattering. Several papers describing applications from non-graphics domains being ported to graphics hardware have also appeared in recent years. Hoff et al. [1999] use graphics hardware to compute Voronoi diagrams. Strzodka and Rumpf [2001] implement multiscale methods for image processing using multipass rendering. Harris et al. [2002] implement an explicit solver for dynamic systems represented as coupled map lattices and use it to simulate convection and diffusion. Yang et al. [2002] combine a planesweeping algorithm with view synthesis for real-time, on-line 3D scene acquisition and view synthesis. This paper is related to our work in the sense that it uses graphics hardware for image analysis. Thompson et al. [2002] implement a variety of non-graphics problems such as matrix multiplication and 3-satisfiability in current graphics architectures using a vertex unit. When compared against standard CPU implementations, they often demonstrate significant performance improvements. Bolz et al. [2003] implement two sparse matrix solvers using a fragment unit. Kr¨uger and Westermann [2003] develop linear algebra operators for implementing numerical algorithms on graphics hardware. The use of nonlinear optimization for image-based modeling is well established. Sato et al. [1997] compute a fixed reflectance model from a set of images of an object captured under controlled lighting conditions by means of nonlinear optimization. The proposed method is closely tied to the specific reflectance model used in the paper. The method works only for objects made of a uniform material that can be approximated with the proposed reflectance model. Inverse global illumination [Yu et al. 1999] reconstructs the reflectance properties of a scene from a sparse set of photographs using an iterative optimization procedure that additionally estimates the inter-reflections between surfaces present in the scene. As with the earlier approaches, this method is limited to a predefined, lowparameter reflectance model that is not flexible enough to approximate complex surface material properties. Homomorphic factorization [McCool et al. 2001] deals with a special class of function factorizations by converting them into the log domain and treating the problem as a system of linear equations. This approach can handle scattered data without a separate resampling step, but it only applies to a narrow class of factorizations, which significantly limits its generality. The light field mapping method [Chen et al. 2002] approximates a surface light field using matrix factorization. It requires an expensive resampling step that increases the complexity of the data processing pipeline and the amount of data to process. Matrix factorization and homomorphic factorization cannot be easily mapped to streaming architectures. The optimization framework adopted here eliminates resampling of image data and can handle a broad class of function approximations.

ri (p) = m(pi ) − f (xi ) = m(pi ) − ti

(1)

denotes the ith residual of function R(p) corresponding to the ith data sample. We seek to solve the following nonlinear least squares problem 1 1 s min E(p) = R(p)T R(p) = ∑ ri (p)2 . 2 2 i=1 p∈RN

(2)

The first derivative matrix of R(p) is the Jacobian matrix J(p), where J(p)[i, j] = ∂ ri (p)/∂ p j and the first derivative of E(p) is simply (3) ∇E(p) = J(p)T R(p). The most straight-forward technique for solving the nonlinear least squares problem (2) is the steepest descent method, which at each iteration moves by a small step in the negative direction of the gradient pk+1 = pk − λk+1 ∇E(pk ) = pk − λk+1 gk+1 (4) where subscript k denotes the iteration of the algorithm and gk+1 = ∇E(pk ). (In the text, we will use both symbols for the gradient interchangeably.) Although each iteration of the steepest descent is fairly efficient the algorithm requires many iterations to reach the minimum because of its slow convergence rate [Dennis and Schnabel 1996]. The Newton method, which uses a local quadratic model to solve Equation (2), converges much more rapidly but requires the inversion of the Hessian matrix at every iteration. There exist variants of the Newton method that compute the approximation of the inverse of the Hessian incrementally but they are fairly difficult to implement and not directly applicable to nonlinear least squares. Additionally, all Newton-like methods require O(N 2 ) memory to store the Hessian, which is not practical when N is large. The conjugate gradient algorithms choose the successive search directions so that the component of the gradient parallel to the previous search direction, which has just been made zero, is unaltered [Bishop 1995]. At each iteration of the conjugate gradient algorithm the model parameters are adjusted iteratively according the formula pk+1 = pk + αk+1 dk+1 (5) where vectors dk+1 represent the search directions. The search direction is chosen to be a linear combination of the current gradient and the previous search direction dk+1 = −gk+1 + βk+1 dk

2

(6)

To appear in Proceedings of ACM SIGGRAPH 2003 where the coefficients αk+1 are given as

αk+1 =

dTk+1 gk+1 T dk+1 H(pk ) dk+1

each residual only affects the model parameters from one surface location. This means that the first and the second derivatives of each residual with respect to the model parameters p will have a very small number of nonzero entries. We rewrite the expression for the gradient as

(7)

and the coefficients βk+1 are

βk+1 =

gTk+1 gk+1 gTk gk

∇E(p) = .

(8)

(12)

S

∑ dTi Hi (p) di .

(13)

We evaluate each term dTi Hi (p) di independently, again in a constant number of operations per residual, and sum the results. The total cost of evaluating expression (13) is therefore O(S) and requires O(N) memory. The above observation leads naturally to the formulation of optimization as a stream process. In this formulation, the data samples are continuously streamed through the processor, while the fragment unit updates the optimization information based on the contribution of each sample. The model parameters are updated only after all the data samples are processed. The loop is repeated as long as the error can be diminished. The current implementation of the conjugate gradient algorithm requires two full passes through the reference images: once to evaluate the direction based on Equation (6), and once to evaluate the stepsize based on Equation (7). High-level pseudo-code for the k-th iteration of streaming nonlinear optimization looks as follows.

at every parameter of vector p.1 We will denote the total error due to regularization as Er . Often it is also required that the solution of a nonlinear optimization problem be within certain bounds, that is,

S TREAMING -N ONLINEAR -O PTIMIZATION () 1 while Et (pk+1 ) < Et (pk ) 2 do dk+1 ← E VALUATE -D IRECTION () αk+1 ← E VALUATE -S TEPSIZE() 3 4 pk+1 ← U PDATE -M ODEL -PARAMETERS (pk , αk+1 , dk+1 )

(10)

E VALUATE -D IRECTION () 1 for i ← 1 to S 2 do ri ← C OMPUTE -R ESIDUAL(pk ,ti ) 3 gk+1 + = P ER -S AMPLE -U PDATE -G RADIENT (ri , pk ) 4 gk+1 + = B OUNDS -R EGULARIZE -G RADIENT (pk ) 5 βk+1 ← C OMPUTE -B ETA(gk , gk+1 ) 6 dk+1 ← U PDATE -D IRECTION (dk , gk+1 , βk+1 )

We handle the constrains due to bounds using the penalty method that generates a quadratic function (pi − ai )(pi − bi ) for each parameter pi that is out of bounds. We will denote the total error due to bounds as Eb . By combining the error from Equation (2) with the regularization error and the bounds error we obtain an expression for the total error

E VALUATE -S TEPSIZE() 1 for i ← 1 to S 2 do ri ← C OMPUTE -R ESIDUAL(pk ,ti ) 3 dHd + = P ER -S AMPLE -U PDATE - D H D (ri , dk+1 , pk ) 4 αk+1 ← C OMPUTE -A LPHA(dk+1 , gk+1 , dHd)

(11)

where λ and µ are the weights that control the relative contribution of each term.

3.2

i=1

i=1

In most practical applications of nonlinear optimization one encounters the problem of insufficient, or irregularly sampled, data. Traditionally, this problem is handled through the use of regularization, which essentially ensures that every parameter of the model has a constraint and controls the smoothness of the solution. Following [McCool et al. 2001], we implement regularization by applying the Laplacian operator   0 −1 0 4 −1  (9) L =  −1 0 −1 0

Et (p) = E(p) + λ Er (p) + µ Eb (p)

i=1

dT H(p) d =

Regularization and Simple Bounds

P = { p | ai ≤ pi ≤ bi , i = 1, . . . , N}.

S

where ∇Ei (p) is the gradient of the error for residual ri , Ji (p) is the Jacobian for this residual and S is the number of radiance samples. Because of sparseness, the terms on the right side of the equation can be computed in a constant time per residual. Since the operation has to be repeated for each residual, the Expression (12) can be evaluated in O(S) operations and requires O(N) of memory. Similarly, we can rewrite the denominator in Equation (7) as (we drop subscripts for simplicity)

Note the implicit use of the Hessian to compute the coefficients αk+1 in Equation (7). Normally, the conjugate gradient algorithm will avoid the use of the Hessian matrix by performing a line minimization along the search direction to find the correct value of αk+1 . However, this complicates the implementation of the algorithm and makes it less robust, since this algorithm is known to be sensitive to the line search procedure. Later we show how the denominator of Equation (7) can be efficiently evaluated in graphics hardware. The conjugate gradient algorithm is very efficient because it exhibits the same convergence rate as the Newton method, yet it requires only O(N) memory.

3.1

S

∑ ∇Ei (p) = ∑ Ji (p) ri (p)

In Evaluate-Direction(), Line 3 computes the contribution of each sample to the gradient based on Equation (12), Line 5 evaluates Equation (8) and Line 6 evaluates Equation (13). In Evaluate-Stepsize(), Line 3 evaluates the denominator of Equation (8) and Line 4 completes the evaluation of this equation. In Streaming-Nonlinear-Optimization(), Line 4 updates the model parameters based on Equation (5). The contribution of the regularization term and the bounds term to the overall solution is computed outside the main loop. Since these two terms depend only on the model parameters, and not the residuals, the cost of their evaluation is negligible compared to the cost of computing the least squares fit.

Streaming Nonlinear Optimization

We implement in graphics hardware both the conjugate gradient algorithm and the steepest descent algorithm. To achieve efficiency, we need to compute the following expressions quickly: ∇E(p) from Equation (3) and the denominator of Equation (7). In practice, when both the number of model parameters and the residuals are large, each residual will depend on just a small subset of all parameters. This is certainly the case for the image-based datasets, where 1 As

explained in Section 5, the model parameters are stored as a 2D texture map. We apply the Laplacian operator to the 2D texture.

3

To appear in Proceedings of ACM SIGGRAPH 2003

4

where Λv j is the barycentric weight of each point in the ring of triangles relative to vertex v j . For each vertex light field a local image-based model is built such that

Image-Based Modeling

We apply the streaming optimization framework developed in this paper to the problem of finding a compact representation for the light reflectance properties of an object from a set of input images captured under varying lighting and viewing conditions. We assume that the object’s geometry is known to a sufficient level of accuracy. We allow the reflectance properties to vary across the surface and to be different at each surface point. The optimization data comes in the form of reference images. With each image, there is associated a calibration file that describes the camera settings during the image capture, such as the field of view, the resolution, and the transformation from the world reference frame to the camera reference frame. Each image is cropped to minimize the number of unused pixels. The registration error for the datasets acquired using 3D photography is relatively small—we did not notice any problems due to misregistration. The images in these datasets are rectified to remove lens distortion. In mathematical terms, the problem of computing an imagebased model can be stated as follows using the notation introduced in Section 3. Function f (x) is the radiance function defined on the surface of an object. Function m(p) denotes the image-based model that approximates this radiance function. Vector p = [p1 . . . pN ]T contains the parameters of the image-based model and vector t = [t1 . . .tS ]T concatenates the radiance samples from all reference images. Building an image-based model can be thought of as finding vector p that minimizes the difference between the target values ti and the corresponding model values m(pi ) over all radiance samples t. Approaches to image-based modeling use diverse algorithms from machine learning, computer graphics and computer vision. Some of the methods used are matrix factorization [Chen et al. 2002; Nishino et al. 1999; Kautz and McCool 1999], tensor product computation [Furukawa et al. 2002], solving a system of linear equations [McCool et al. 2001], and various flavors of nonlinear optimization [Sato et al. 1997; Yu et al. 1999; McAllister et al. 2002]. Instead of trying to port each of these algorithms to graphics hardware, we develop a general framework for solving data fitting problems on graphics hardware and then cast a variety of imagebased modeling problems in this framework. In fact, all algorithms listed above can be expressed in our framework. To illustrate this idea, we pick two distinct image-based modeling problems—surface light field approximation through matrix factorization [Chen et al. 2002] and fitting the Lafortune model to SBRDF data [McAllister et al. 2002]—and implement them as streaming nonlinear optimization. Next we describe the two problems.

4.1

f v j (s, v) ≈ mv j (pv j ) =

vj

vj

(15)

k=1 v

v

where surface maps gk j (s) and view maps hk j (v) are stored in a sampled form as 2D texture maps. The model parameters pv j correspond to the texels of the surface maps and the view maps used to approximate the light field for vertex v j . The model parameter vector for the whole object is the concatenation of the model parameters for the individual vertices: p = [pv1 . . . pvQ ], where Q denotes the number of vertices in the model. From Equation (15) it follows that each light field mapping approximation uses 3 surface maps per triangle and one view map per vertex. We tile the view maps into one texture map and the surface maps into another texture map. A typical resolution of individual surface maps and view maps used in our experiments is 8×8 pixels. Light field mapping approximation can be thought of as a sequence of optimizations, one per vertex of the model. Each optimization corresponds to finding the view map and the surface map texels that best approximate the portion of the surface light field for the triangle ring of the given vertex. The error function for this problem is obtained by substituting Equation (15) for m() in Equation (1).

4.2

Fitting Lafortune Model to SBRDF data

The Spatial Bidirectional Reflectance Distribution Function (SBRDF), as defined in [McAllister et al. 2002], is a sixdimensional light reflectance function parameterized over incoming and outgoing light directions and surface position. We will try to fit an independent BRDF at each surface point and store the parameters of this model as texture maps. The Lafortune model [Lafortune et al. 1997] is a function with nonlinear parameters capable of approximating a broad class of BRDFs. The model is simple, compact and computationally efficient. It is also physically correct in the sense that it is reciprocal and energy-conserving, and can describe a variety of physical effects. The expression we use for the Lafortune model at surface point li is f li (u, v) ≈ mli (pli ) = ρd uz +

K

∑ ρks Lks ,

(16)

k=1

where u is the incident direction andv is the exitant direction, both expressed in the local coordinate system at the given surface point. (Although both vectors are expressed using 3 coordinates, they each have only 2 degrees of freedom.) The first term in the equation above denotes the diffuse component and the second term denotes the specular component, where

Light Field Mapping Approximation

Light field mapping (LFM) approximation refers to the method of representing a surface light field as a sum of a small number of products of lower-dimensional functions [Chen et al. 2002]. A surface light field is a 4-dimensional function f (s, v), where s is a 2-dimensional vector describing the surface location and v is a 2-dimensional vector describing the viewing direction. This function completely defines the outgoing radiance of every point on the surface of an object in every viewing direction. The LFM approximation assumes the model geometry represented as a triangular mesh and partitions the light field data around each vertex. The light field unit corresponding to each vertex is called the vertex light field and for vertex v j is denoted as f v j (s, v). Partitioning is computed by weighting the radiance function f v j (s, v) = Λv j (s) f (s, v)

K

∑ gk (s) hk (v)

Lks = [Ckx ux vx +Cky uy vy +Ckz uz vz ]nk .

(17)

We assume isotropic reflection (Ckx = Cky ) modeled by a single (K = 1) forward-reflective term (no retro-reflection). Additionally, we use just one Lks for all color channels. This means that the model parameter vector pli uses 9 independent parameters to describe the BRDF at a given surface location: [ρRd ρGd ρBd ], [ρRs ρGs ρBs ] and [Ckx Ckz nk ]. The model parameter vector for the whole object is the concatenation of the model parameters for the individual surface points: p = [pl1 . . . plT ], where T denotes the number of surface point samples. Fitting the Lafortune model to SBRDF data can be thought of as a sequence of optimizations with the number of individual optimizations equal to the number of texels used to texture map the

(14)

4

To appear in Proceedings of ACM SIGGRAPH 2003 IL

IL

I1

I1

pv1 Distribute Radiance Data Across Optimizations

Distribute Radiance Data Across Optimizations

Disk

Simultanously Optimize All Problems

pv2 pv3 pv4

Disk

Optimize

pv1

Disk

Optimize

pv2

Disk

Optimize

pv3

Disk

Figure 2: Image-based modeling on the GPU combines the preprocessing of the reference images with the optimization. Since the input data are no longer limited to individual optimizations, all optimizations need to be solved concurrently.

Optimize

pv4

Figure 1: Image-based modeling on the CPU is a two step process. The first step, shown at the top, distributes the radiance data across individual optimizations. The second step, shown at the bottom, solves each optimization problem sequentially.

would be too costly to stream the radiance data for all optimizations, but solve just one.) This increases the memory requirements, since we have to keep the data structures for all optimizations in memory. Second, this approach performs redundant computation at every iteration, since the same reordering and distribution of data is redone many times. However, the benefits outweigh the limitations. The computational redundancy has almost no extra cost on the stream architecture and overall the new implementation achieves much better performance. In the next section, we describe the proposed method in more detail.

model. For each optimization problem, we seek to find the values of the 9 parameters of our model that best approximate the lighting at the given surface location. The error function is obtained by substituting Equation (16) for m() in Equation (1).

4.3

CPU and GPU Implementations

5

Image-based modeling inherently involves large data sets and many model parameters. For example, a typical light field mapping approximation requires 104 factorization of matrices of size 103 × 103 . Tensor product computation used in [Furukawa et al. 2002] for factorization of SBRDF data requires 104 factorization of 3dimensional arrays of size 103 × 103 × 102 . Fitting the Lafortune model to SBRDF data involves solving 106 highly nonlinear optimizations, each using 103 data samples. Solving problems of this magnitude is computationally demanding and requires rearranging massive amounts of data. A standard approach to building image-based models on the CPU is to first preprocess the radiance data as illustrated at the top of Figure 1. This step distributes the data stored in reference images between the individual optimizations. For example, in the case of the light field mapping approximation, it splits the pixels in the reference images into groups based on which mesh triangle they project to. Depending on the specifics of a given modeling task, this step might additionally perform resampling of data. In our case, there is no resampling done other than from rasterization. During preprocessing the data are written out to disk. Once this step is finished, the second step solves each individual optimization sequentially on the CPU as illustrated at the bottom of Figure 1. While the second step is often considered the core of the modeling problem, the first step can be a major bottleneck. The preprocessing of data can take a significant portion of the total time to build an image-based model, sometimes as much as 50%. Therefore, to get a significant improvement in performance, both steps need to be accelerated. However, since the preprocessing of data is generally implemented using graphics hardware—this avoids having to write a renderer in software—it would be very hard to combine the two steps into a single process running on the CPU. With the advent of programmable graphics hardware, it is much easier to combine the two steps into a single process running on the GPU. In this new approach illustrated in Figure 2, the GPU reads and processes the radiance data one image at a time, distributing the individual image pixels between the different optimizations. All optimizations are being solved simultaneously by applying the streaming nonlinear optimization framework introduced in Section 3.2. The GPU implementation has certain drawbacks. First, since the radiance data are no longer distributed between the individual optimizations, all optimizations need to be solved concurrently. (It

Graphics Hardware Implementation

Modern graphics hardware introduces programmable vertex and fragment stages [Lindholm et al. 2001], which execute user defined programs. Vertex units operate on each incoming vertex and pass the results to a rasterizer. Fragment units operate on the fragments generated by the rasterization stage. Although less general than vertex units, fragment units have more computational power, since they are designed to operate at the systems fill rate on the order of billions of operation per second. Our algorithms perform most of the work in the fragment units and use the vertex units to compute the transformation from world space to image space and model space. Currently, fragment units have a number of limitations, often dictated by graphics applications performance. These include a limit on the number of texture lookups, texture coordinates and output values. The output location in the framebuffer or texture cannot be changed by the fragment program. Therefore, no data dependent write addresses are allowed from the fragment program, restricting many possible applications. There are also limitations on branching and the number of instructions. Working within the constrains imposed by the graphics architecture often translates to using different programming constructs than on the CPU. We will explain some of them here. In Section 4.3, we showed that image-based modeling on the GPU is a concatenation of two steps: the distribution step that splits the radiance data coming from the reference images between the individual optimization problems, and the optimization step that solves the optimizations for all problems simultaneously. In this section, we describe the details of implementing these two steps on current graphics hardware. Specifically, we implement the algorithms in DirectX 9.0 on ATI’s RadeonTM 9700.

5.1

Distribution Step

Our algorithms read and process the radiance data in the original image format. We will refer to this as the representation of data in the image space (IS). Since the algorithms must loop over the datasets multiple times, one of the fundamental operations they perform is distributing the information from the image space to the model parameter space (MPS).

5

To appear in Proceedings of ACM SIGGRAPH 2003 Image Space (IS)

Model Parameter Space (MPS)

Model Parameter Space (MPS)

ect Space (OS)

p

Image Space (IS)

p

(a) (b) Figure 3: Approach (a) loops through all image pixels writing to arbitrary locations of p. Graphics hardware is more suited for reading from multiple memory locations as in (b).

Figure 4: Transformation of radiance samples from Image Space to Surface Location Space.

Later stages of the algorithm will read the residuals from the SLS and do further computation on them in the model parameter space. Each residual will be read from the SLS as many times as there are model parameters that it affects. Magnification occurs when multiple reference image pixels project to a single texel in the SLS and means that the surface of the object is sampled too coarsely. We choose the resolution of the SLS high enough that instead minification occurs, i.e., a single reference image pixel projects to multiple texels in the SLS. However, even with minification, our algorithm assigns each reference image pixel to just one texel in the SLS. Although this leaves some of the texels in the SLS with no samples after processing one reference image, we can afford to do it because image-based datasets are highly redundant and, after all images are processed, we typically get many image samples per model parameter. Additionally, regularization ensures that every model parameter has a constraint.

One way of accomplishing this task is to loop through all image pixels and, each time through the loop, compute which model parameters this pixel affects, evaluate the relevant optimization information, and write the results to the affected data structures. Figure 3(a) illustrates this approach. Since this method requires writing to data dependent memory locations, it is not well suited for an implementation in graphics hardware that does not support this functionality in the fragment unit. Currently, the fragment unit is designed to write to a predefined memory location that cannot be altered from within the program. Instead, we formulate our algorithms in terms of multiple reads from, and conditional writes to, fixed memory locations. Our approach is motivated by the fact that the rendering pipeline is optimized for textured scan conversion, which involves gathering of texels for a fixed fragment. To implement this idea, we introduce intermediate storage to help transfer the data from the image space to the model parameter space. We refer to it as the surface location space (SLS). This space is implemented as a texture map that has the same number of texels as would a texture map that we apply to the geometry mesh of the model. The mapping between the model parameter space and this space is fixed. This means that, for a given model parameter, our algorithm knows exactly where to look for relevant information in the SLS. This allows us to formulate our approach in terms of multiple reads from fixed memory locations as shown in Figure 3(b). 5.1.1

5.2

Optimization Step

In describing the implementation of the optimization step, we will concentrate on the conjugate gradient algorithm, since the steepest descent algorithm can be considered its special, simplified case. Once the residuals are evaluated and converted to the SLS by the distribution step, we can compute other quantities required by the conjugate gradient algorithm. In Section 3.2, we wrote that the conjugate gradient algorithm first loops through all images to compute the new direction dk+1 and then loops through all images again to compute the stepsize αk+1 . Next we describe the two loops in more detail. The main difference between the pseudo-code in Section 3.2 and the pseudo-code below is that the algorithm presented here solves multiple optimizations simultaneously. Although this is not explicitly reflected in the notation we use, the reader needs to keep in mind that our function calls will often involve many rendering passes that gather information on a per-optimization basis for all problems in parallel. We give a specific example of this in Section 5.2.1. The main data structures used at iteration k of the algorithm are: model parameters pk , search direction dk and gradient gk . The algorithm additionally uses pk+1 , dk+1 and gk+1 to store the estimates of these quantities for the next iteration. All these are represented as 2D floating point texture maps containing N texels, where N is the total number of model parameters across all optimization problems. If M is the number of independent optimizations being solved, we additionally have the following floating point texture maps containing M texels to store the variables used by each individual optimization: vector a = [α1 . . . αM ] for the α s, b = [β1 . . . βM ] for the β s, and ek+1 and ek for the errors from the current and the previous iteration, respectively. Here is the pseudo-code for the first loop through the reference images. The symbol (*) at the end of the line denotes a call to a routine that requires a gather. L is the number of images.

Implementation of Distribution Step

The distribution step computes the residuals for each radiance sample and distributes them into appropriate locations in the SLS. It can be thought of as the inverse of the transformation that we would apply to synthesize an image by texture mapping the geometry mesh using the SLS texture, as illustrated in Figure 4. The distribution step requires two passes through the model geometry. The first pass generates a texture item buffer. Values in the texture item buffer reflect locations in the SLS. The buffer is generated by rasterizing the model triangles in the image space and writing the item numbers as output. There is a unique item number for each texel in the SLS. The second pass moves from the image space to the SLS by rasterizating model triangles into the SLS. Projective texturing is used to perform a lookup into image space for each texel in the SLS. Visibility and rasterization consistency are tested by first looking into the texture item buffer to ensure that the pixel being read from the image does indeed correspond to the fragment that is being written out to the SLS. Below is the pseudo-code for the distribution step; ri (SLS) are the residuals originating from image Ii expressed in the SLS. C OMPUTE -R-I N -SLS( Ii , Ti ) 1 ItemBuf(IS) ← G ENERATE -T EXTURE -I TEM -B UFFER ( Ti ) 2 ri (SLS) ← C ONVERT-R-T O -SLS( Ti , Ii , ItemBuf(IS) ) 3 return ri (SLS)

6

To appear in Proceedings of ACM SIGGRAPH 2003 View Maps Texture

Surface Maps Texture

Residuals in SLS

Residuals in SLS

Surface Maps Gradient Texture

5.2.1

Implementation of Optimization Step

It is beyond the scope of the paper to provide a more detailed description of our algorithm. Instead, we elaborate on one part of it and hope that this example will let the reader generalize to the rest of the algorithm, which uses the same ideas repeatedly. Specifically, we will describe the implementation of the function Compute-Gradient() in Line 5 of Evaluate-Direction() for the problem of light field mapping approximation. We will assume a constant viewing direction across the vertex light field. For nearest neighbor interpolation, a residual for light field mapping approximation can be written as rk = sk vk − tk , where sk is a surface map texel and vk is a view map texel. Since the error for the residual is Ek = rk2 /2, its term of the gradient vector in Equation (12) has only two non-zero entries: ∂ Ek /∂ sk = rk vk and ∂ Ek /∂ vk = rk sk . After computing residuals ri (SLS) for the reference image Ii , the gradient for each surface map texel is evaluated by reading the corresponding residual from the SLS and multiplying it by the appropriate view map texel. Since the surface map gradient texture has the same dimensions as the SLS, there is a one-to-one correspondence between these two spaces. This process is illustrated at the top of Figure 5. When computing the gradient for the view map texel, we need to collect information from all residuals falling inside the triangle ring of a given vertex. This can be done by multiplying the surface map gradient texture by the surface map texture and summing all products for each vertex ring into a single value that gets written into the corresponding texel of the view map gradient texture. This process is illustrated at the bottom of Figure 5 for 3 different vertices, each having 2 triangles in its ring. Current graphics hardware does not support loops and allows only up to 16 texture lookups in the fragment unit. This is rather limiting, since occasionally we need to perform a gather operation that spans several hundred elements or computes a dot product of two large vectors. Not having enough texture lookups, we are forced to use a multi-pass approach to perform these operations. Naturally, additional rendering passes consume more bandwidth and should be avoided if possible. Gathering values from multiple locations in one texture map, and computing a sum of values, is done by rendering into a smaller resolution texture, but keeping positions in the texture space constant. If the 16 lookups afforded by current graphics hardware are insufficient, then multiple passes can be used, gathering 16 values at each level, so that to gather n values requires log16 n passes. See [Kr¨uger and Westermann 2003; Bolz et al. 2003] for details on how to implement gather operations on current graphics hardware.

View Maps Gradient Texture

Figure 5: Computation of the gradient for light field mapping approximation. The top shows the computation for the surface maps portion of the gradient and the bottom shows the computation for the view maps portion. Solid arrows indicate multiplication of the end points and dashed arrows indicate addition. E VALUATE -D IRECTION () 1 for i ← 1 to L 2 do load image Ii 3 ri (SLS) ← C OMPUTE -R-I N -SLS( Ii , Ti , pk ) 4 ek+1 + = C OMPUTE -E RRORS ( ri (SLS) ) (∗) 5 gk+1 + = C OMPUTE -G RADIENTS( ri (SLS), pk ) (∗) 6 ek+1 + = R EGULAR -A ND -B OUNDS -E RROR ( pk ) 7 gk+1 + = R EGULAR -A ND -B OUNDS -G RADIENT ( pk ) 8 ggk+1 ← C ONDITIONAL -C OMPUTE -GG( gk+1 ) (∗) 9 b ← C OMPUTE -B ETAS( ggk ,ggk+1 ) (∗) 10 dk+1 ← C ONDITIONAL -C OMPUTE -D IRS ( dk , gk+1 , b ) 11 dgk+1 ← C OMPUTE -DG( dk+1 , gk+1 ) (∗)

The meaning of the individual function calls in the above code is as follows. The call to Compute-Errors() evaluates Equation (1). The call to Compute-Gradients() evaluates Equation (12). The call to Regular-And-Bounds-Error() adds the contribution of the regularization error and the bounds error based on Equation (11). The call to Regular-And-Bounds-Gradient() computes the gradient due to regularization and the bounds. The call to Conditional-Compute-GG() evaluates the numerator in Equation (8). The call to Compute-Betas() completes the evaluation of this equation. The call to Conditional-Compute-Dirs() does a conditional update of directions using Equation (6). Finally, the call to Compute-DG() evaluates the numerator in Equation (7).

5.3

Other Implementation Issues

The steepest descent algorithm is simpler to implement than the conjugate gradient algorithm. It does not require a call to Evaluate-Stepsize() and the call to Evaluate-Direction() involves only the computation of the residuals, the gradient and the regularization. During the first iteration of the conjugate gradient algorithm, the search direction is set to the direction of the negative gradient. If the error at the current iteration is larger than the error at the previous iteration, the search direction is set to the steepest descent direction again. When using the steepest descent direction, if we encounter the situation where the error increases, we halve the stepsize.

The computation of the denominator of Equation (7) is required to complete the evaluation of the vector of stepsizes a. Currently this requires a second pass through the images to compute the Hessian. The pseudo-code for this step is as follows E VALUATE -S TEPSIZE() 1 for i ← 1 to L 2 do load image Ii 3 ri (SLS) ← C OMPUTE -R-I N -SLS( Ii , Ti , pk ) 4 dHd + = C OMPUTE -DHD( dk+1 , ri (SLS), pk ) (∗) 5 a ← C ONDITIONAL -C OMPUTE -A LPHAS ( dgk+1 , dHd )

5.3.1

In the code above, conditional updates are required because the data structures are updated differently depending on whether the error at the current iteration is smaller than the error at the previous iteration.

Regularization

We use the alpha channel to delineate the boundaries of the individual surface maps and view maps. The regularization kernel applies the Laplacian operator from Equation (9) to all texels of the model

7

To appear in Proceedings of ACM SIGGRAPH 2003 Error Convergence for Steepest Descent 6.00

5.00

5.00

4.00

4.00

Error

Error

Error Convergence for Conjugate Gradient 6.00

3.00 2.00

2.00 Optim 1 Optim 2 Optim 3

1.00

0.00 1

11

21

31

41

51 Iteration

Bust Star Ornament

Param. Count 0.91MT 0.69MT 0.86MT

Param. Tex. Size 1.64MT 1.23MT 1.73MT

Image Count 339 282 1760

parameter texture. For each texel, only those neighbors that have the appropriate alpha value are used to evaluate the operator. Stopping Criteria

We have implemented 3 stopping criteria in the CPU version of the algorithm. The first criterion stops the execution of the optimizer if the difference between the target value and the value generated by the model is less than some threshold value. The second criterion stops the execution if the new update of the model parameters is so small that it changes the output of the model by less than another threshold value. The third criterion uses the number of iterations to decide when to stop. We found the first two comparisons to be too restrictive and, in our GPU implementation, use the third one. When to stop the optimization has important implications on the overall utilization of the algorithm. If too many optimizations finish early then the utilization of the algorithm diminishes because we are streaming the same amount of data yet doing less useful work per iteration.

6

71

81

91

101

1

11

21

31

41

51

61

71

81

91

101

Iteration

vertex. In our experiments, we use fixed-size surface maps for all triangles, each having 32 texels. We also use fixed-size view maps of resolution 8 × 8. As explained in Section 4.2, the Lafortune fit to SBRDF data uses 9 parameters per surface location which we store in 3 texture maps. Here also, we use fixed-size textures for all triangles, each having 128 texels. The total number of model parameters used for each model is given in Table 1. Textures used to store the model parameters are slightly larger because of the extra space required for packing. (We are not using a very efficient packing scheme; half of the space for surface maps is unused.) The size of the textures allocated to the model parameters is also given in the table. Image data are stored on the disk in uncompressed format. As shown in Section 6.2, streaming the reference images to the graphics card consumes about 20% of run time. This time could be reduced by reading the images into the host memory and streaming them to the GPU from there.

Image Size 161MB 159MB 0.98GB

Table 1: Models used in the experiments. The 3rd column shows the model parameter count, the 4th column shows the count of pixels used to store the model parameters. The counts are given in megatexels [MT].

5.3.2

61

Figure 7: The left graph shows the converge rate of the conjugate gradient method on 3 randomly selected optimization problems from the bust dataset. The right graph shows the convergence rate for the steepest descent method on the same 3 problems.

Figure 6: The bust model and the star model are used for the light field mapping approximation experiments and the ornament model is used for the Lafortune fit to SBRDF data. Poly. Count 7228 6093 3690

Optim 1 Optim 2 Optim 3

1.00

0.00

Models

3.00

6.1

Steepest Descent vs. Conjugate Gradient

The conjugate gradient method achieves much better convergence than the steepest descent method. It often requires an order of magnitude fewer iterations to converge and usually finds a better local minimum. The comparison is shown in Figure 7. The left graph has the convergence rate for the conjugate gradient method on 3 randomly selected optimization problems from the star dataset. The right graph has the rate for the steepest descent method on the same 3 problems. The conjugate gradient method converges significantly faster. For example, “Optim 2” reaches the minimum in about 25 iterations using the first method and it takes about 250 iteration to get down to the same error level using the second method. This ratio is consistent across different optimizations and different datasets.

Results and Discussion

6.2

We have implemented both the conjugate gradient method and the steepest descent method in graphics hardware and successfully applied them to the light field mapping approximation and the Lafortune fit to SBRDF data for several full-size models. Figure 6 shows the bust and the star models from [Chen et al. 2002] that were used for the light field mapping approximation experiments. The ornament model, shown on the right of Figure 6, was used to compute the Lafortune fit to SBRDF data. The dataset for it was generated synthetically in 3D Studio Max . The surface consists of 3 distinct materials produced using 2 reflectance models: Oren-Nayar-Blinn (white) and Lafortune (gold and green). Table 1 lists the pertinent information about the datasets: the polygon count, the model parameter count, the size of the radiance dataset and the number of images. As explained in Section 4.1, a one-term light field mapping approximation uses 3 surface maps per triangle and one view map per

CPU and GPU Implementation Comparison

To have a fair comparison of performance, we implemented and optimized a CPU version of the light field mapping approximation using nonlinear optimization following the approach discussed in Section 4.3 and compared it against the GPU implementation. Since the CPU implementation constituted a fairly substantial effort, we have not done it for the Lafortune fit to SBRDF data. The distribution step for the CPU implementation writes out the following information to the disk for each radiance sample falling on a given triangle: target value, residual value, surface map coordinate and 3 view map coordinates. The distribution step in this case is still computed on the GPU—implementing the whole renderer in software would require substantial effort. Once data distribution is finished, the algorithm sequentially solves each individual optimization on the CPU. Figure 8 compares the computational performance of the GPU and the CPU implementations. For the CPU, we divide the total

8

To appear in Proceedings of ACM SIGGRAPH 2003

total time for 100 iterations [mins]

160

30

140

25

120

20 GPU overhead GPU computation GPU image stream CPU optimization CPU distribution

100 80 60

high-res PCA on CPU high-res optim on CPU low-res optim on CPU low-res optim on GPU

15 10

40

5

20

0 0

APE CPU CG

GPU CG

CPU SD

PSNR

GPU SD

Figure 9: Convergence and error analysis for 4 different approximation methods.

Figure 8: Execution time comparison between the CPU and the GPU implementations of the light field mapping approximation running on the star model. We compare the performance of the conjugate gradient (CG) and the steepest descent (SD) methods.

The errors for the two methods described above indicate that, when using the same resolution light field maps, the nonlinear optimization algorithm yields a better approximation than the original algorithm based on matrix factorization. The most likely cause of this is that the new method does not perform the extensive resampling required by the matrix factorization approach. To compare the error of nonlinear optimization running on the CPU and the GPU, we had to use low resolution light field maps and 16-bit floating point numbers on the GPU. The errors reported in Figure 9 for these two methods, although much higher than in the first experiment, are very similar to each other. This indicates that the use of 16-bit floating point numbers on the GPU did not have a detrimental effect on the convergence of the algorithm and that, when we have more memory available on the card, we will be able to achieve the high quality of approximation that the light field mapping method provides.

run time into the distribution step and the optimization step. For the GPU, we divide it into image streaming, computation and overhead for context switches. The numbers are given for the full star model running for exactly 100 iterations. When computing the location of a given radiance sample in the surface location space we used the nearest neighbor approximation. We used ATI’s All-In-Wonder RadeonTM 9700 to run the GPU implementation and Intel’s 2GHz Pentium 4 for the CPU implementation. In both cases, the GPU performance is more than 5 times better than the CPU performance. The GPU implementation of the steepest descent method spends approximately half the total execution time on context switches. For the same implementation of the conjugate gradient this portion of execution time goes up to almost 66%. As graphics hardware evolves, we will be able to drastically reduce the number of context switches we need to perform.

6.5

Computation and Bandwidth Requirements

Our algorithms use memory efficiently, since they only require O(N) of storage. However, the size of textures becomes larger when using floating point texels. Memory on the graphics card is approximately an order of magnitude smaller than on the CPU. Although host memory is also accessible to the graphics processor, the data path is much slower and cannot be used directly as a render target. Since the Radeon 9700 has only about 110MB of video memory available for the textures, we were limited by the size of the models we could process. For example, the conjugate gradient algorithm running on the full star model uses about 7.8 megatexels for the data structures related to the surface maps and 1.1 megatexels for the data structures related to the view maps. Since each 32-bit floating point RGBA texel requires 16 bytes, this translates to 142MB of texture memory. Therefore, to solve the full star model, we had to use fairly low resolution light field maps and 16-bit floating point numbers. The use of 16-bit floating point numbers did not have a negative impact on the convergence as shown in Section 6.4. Better tiling and distribution of texels according to the projected triangle screen size would have allowed for more efficient use of this limited resource, but would have required much longer and more sophisticated fragment programs for gather operations.

We calculate the computation and the memory bandwidth requirements of our algorithms by counting the number of instructions of each fragment program and the number of memory reads and writes and multiplying them by the number of times each fragment program is executed. Figure 10 shows computation and memory bandwidth utilization for the two image-based modeling problems, one running the star model, and one running the ornament model. The Radeon 9700 Pro is capable of a peak 2.6 Gops of arithmetic operations in the fragment unit and 0.65 Gops of 64-bit wide texture reads in parallel. The 29% utilization we currently obtain for the LFM approximation experiment could become much higher in the future. The two most obvious improvements are: (1) streaming images from the host memory instead of the disk; (2) combining rendering passes by using multiple render targets and by having longer fragment programs. Combining passes would improve the performance by reducing the effects of the synchronization that happens every time we perform a context switch and by eliminating redundant texture reads. This latter change would shift the utilization towards the arithmetic maximum by reducing texture operations. We use only a fraction of the available memory bandwidth between the GPU and video memory. For the Lafortune experiment, streaming images from the disk constitutes about 70% of the execution time.

6.4

6.6

6.3

Memory Requirements

Convergence and Error Analysis

To compare the error of the original light field mapping approximation algorithm with the nonlinear optimization implementation, we computed a one-term approximation of the star model using principal component analysis [Chen et al. 2002] and the CPU version of nonlinear optimization. In both cases, we used high resolution surface maps (144 texels on average) and view maps (256 texels). The errors reported in Figure 9 are computed based on the difference between the input image and the rendered image using both APE (average pixel error) and PSNR for the foreground pixels only.

Addressability Limitations of Fragment Unit

Output from the fragment program is written to a single, fixed memory location. By choosing to use the fragment unit, we were forced to work within this constraint. This led to the formulation of the algorithms in terms of multiple reads, instead of multiple writes, as described in Section 5.1. This results in certain computational inefficiencies because it forces us to look through many texels in the SLS that do not contain any information. The inefficiency increases when the sampling of the appearance function of our model is significantly larger than the sampling rate in the image space. Unfor-

9

To appear in Proceedings of ACM SIGGRAPH 2003 ¨ , P. 2003. Sparse B OLZ , J., FARMER , I., G RINSPUN , E., AND S CHR ODER Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. ACM Transactions on Graphics 22, 3 (July). (Proceedings of ACM SIGGRAPH 2003). C ARR , N. A., H ALL , J. D., AND H ART, J. C. 2002. Ray Engine. 2000 SIGGRAPH / Eurographics Workshop on Graphics Hardware, 1–10. C HEN , W.-C., B OUGUET, J.-Y., C HU , M. H., AND G RZESZCZUK , R. 2002. Light Field Mapping: Efficient Representation and Hardware Rendering of Surface Light Fields. ACM Transactions on Graphics 21, 3 (July), 447–456. ISSN 0730-0301 (Proceedings of ACM SIGGRAPH 2002). D ENNIS , J. E. J., AND S CHNABEL , R. B. 1996. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Classics in Applied Mathematics, 16. SIAM. F URUKAWA , R., K AWASAKI , H., I KEUCHI , K., AND S AKAUCHI , M. 2002. Appearance Based Object Modeling Using Texture Database: Acquisition, Compression and Rendering. Eurographics Rendering Workshop 2002. H ARRIS , M. J., C OOMBE , G., S CHEUERMANN , T., AND L ASTRA , A. 2002. Physically-Based Visual Simulation on Graphics Hardware. 2002 SIGGRAPH / Eurographics Workshop on Graphics Hardware, 1–10. H OFF , K., C ULVER , T., K EYSER , J., L IN , M., AND M ANOCHA , D. 1999. Fast Computation of Generalized Voronoi Diagrams Using Graphics Hardware. In Proceedings of SIGGRAPH 99, Computer Graphics Proceedings, Annual Conference Series, 277–286. K AUTZ , J., AND M C C OOL , M. D. 1999. Interactive Rendering with Arbitrary BRDFs using Separable Approximations. Eurographics Rendering Workshop 1999 (June). K HAILANY, B., DALLY, W. J., R IXNER , S., K APASI , U. J., M ATTSON , P., NAMKOONG , J., OWENS , J. D., T OWLES , B., AND C HANG , A. 2001. Imagine: Media Processing with Streams. IEEE Micro (March/April), 35–46. ¨ , J., AND W ESTERMANN , R. 2003. Linear Algebra Operators for K R UGER GPU Implementation of Numerical Algorithms. ACM Transactions on Graphics 22, 3 (July). (Proceedings of ACM SIGGRAPH 2003). L AFORTUNE , E. P. F., F OO , S.-C., T ORRANCE , K. E., AND G REEN BERG , D. P. 1997. Non-Linear Approximation of Reflectance Functions. Proceedings of SIGGRAPH 97 (August), 117–126. L INDHOLM , E., K ILGARD , M. J., AND M ORETON , H. 2001. A UserProgrammable Vertex Engine. In Proceedings of ACM SIGGRAPH 2001, Computer Graphics Proceedings, Annual Conference Series, 149–158. M C A LLISTER , D. K., L ASTRA , A., AND H EIDRICH , W. 2002. Efficient Rendering of Spatial Bidirectional Reflectance Distribution Functions. Eurographics Rendering Workshop 2002 (June). M C C OOL , M. D., A NG , J., AND A HMAD , A. 2001. Homomorphic Factorization of BRDFs for High-Performance Rendering. Proceedings of SIGGRAPH 2001 (August), 171–178. N ISHINO , K., S ATO , Y., AND I KEUCHI , K. 1999. Eigen-Texture Method: Appearance Compression Based on 3D Model. In Proceedings of the IEEE Computer Science Conference on Computer Vision and Pattern Recognition (CVPR-99), 618–624. P URCELL , T. J., B UCK , I., M ARK , W. R., AND H ANRAHAN , P. 2002. Ray Tracing on Programmable Graphics Hardware. ACM Transactions on Graphics 21, 3 (July), 703–712. ISSN 0730-0301 (Proceedings of ACM SIGGRAPH 2002). S ATO , Y., W HEELER , M. D., AND I KEUCHI , K. 1997. Object Shape and Reflectance Modeling from Observation. Proceedings of SIGGRAPH 97 (August), 379–388. S TRZODKA , R., AND RUMPF, M. 2001. Nonlinear Diffusion in Graphics Hardware. Proceedings EG/IEEE TCVG Symposium on Visualization, 75–84. T HOMPSON , C. J., H AHN , S., AND O SKIN , M. 2002. Using Modern Graphics Architectures for General-Purpose Computing: A Framework and Analysis. Proceedings of 35th International Symposium on Microarchitecture (MICRO-35). YANG , R., W ELCH , G., AND B ISHOP, G. 2002. Real-Time ConsensusBased Scene Reconstruction using Commodity Graphics Hardware. Proceedings of Pacific Graphics. Y U , Y., D EBEVEC , P. E., M ALIK , J., AND H AWKINS , T. 1999. Inverse Global Illumination: Recovering Reflectance Models of Real Scenes From Photographs. Proceedings of SIGGRAPH 99 (August), 215–224.

35.00% 30.00% 25.00% 20.00%

LFM Lafortune

15.00% 10.00% 5.00% 0.00% Arithmetic Ops

Texture Ops

Read/Write Bandwidth

Figure 10: Computation and memory bandwidth utilization for the two modeling problems using the steepest descent method. tunately, since the resolution of the surface location space texture corresponds to the spatial sampling of the appearance function of the model, it is desirable that it exceed the sampling rate in the image space, in order to avoid magnification artifacts. To eliminate the direct dependence on the sampling rate of the model parameters, the fragment unit would need a limited scattergather capability. In the future, we would like to study the feasibility of implementing such a capability and its impact on the performance of our algorithms. The scatter requirements are limited such that a fragment unit would only need to write to locations from which it reads. The branching factor corresponds to the filter support for minification. The gather operation is a simple summation, which is order independent. The branching factor for the gather operation corresponds to the factor of magnification.

7

Conclusion

We have developed a framework for solving large nonlinear optimizations in graphics hardware by turning the problem into a streaming process that is well matched to modern graphics processors. We have applied this framework to building image-based models in graphics hardware. The proposed methods can produce a broad class of function approximations, require minimal storage overhead and do not involve a resampling step. We have successfully applied this approach to two distinct image-based modeling problems. We have analyzed the performance of our algorithms on them and showed significant improvements over the CPU implementation. This work demonstrates not only that it is possible to build image-based models using programmable graphics hardware, but that such hardware is particularly well suited for the task.

8

Acknowledgments

We appreciate the discussions with UNC-CH’s global illumination reading group that helped foster the idea of using projective texturing. We appreciate timely proofreading and valuable suggestions by Gordon Stoll. We thank Gary Bishop, Anselmo Lastra and Radomir Mech for critical reads of the paper. We thank Wei-Chao Chen, Alexey Smirnov and Jean-Yves Bouguet for their contribution to OpenLF. From Intel, we thank Gary Bradski, Bob Liang and Justin Rattner for encouraging this work. From ATI, we thank Jason Mitchell, Mark Segal and Evan Hart. From Nvidia, we thank Matt Papakipos. We thank Pat Hanrahan for the inspiring talk he gave at Intel on graphics hardware.

References B ISHOP, C. M. 1995. Neural Networks for Pattern Recognition. Clarendon Press.

10

Suggest Documents