Department of Computer Science. University of Central Florida, USA. Erik Reinhard ..... Now that we have a better set of weights for each pixel, we can repeat the ...
ROBUST GENERATION OF HIGH DYNAMIC RANGE IMAGES Erum Arif Khan , Ahmet Oguz Akyuz
Erik Reinhard
Department of Computer Science University of Central Florida, USA
Department of Computer Science University of Bristol, UK
ABSTRACT High dynamic range images may be created by capturing multiple images of a scene with varying exposures. Images created in this manner are prone to ghosting artifacts, which appear if there is movement in the scene at the time of capture. This paper describes a novel approach to removing ghosting artifacts from high dynamic range images, without the need for explicit object detection and motion estimation. Weights are computed iteratively and then applied to pixels to determine their contribution to the final image. We use a non-parametric model for the static part of the scene, and a pixel’s membership in this model determines its weight. In contrast to previous approaches, our technique does not rely on explicit object detection, tracking, or pixelwise motion estimates. Ghost-free images of different scenes demonstrate the effectiveness of our technique. 1. INTRODUCTION The illumination of typical world scenes around us varies over several orders of magnitude. Conventional sensors in image capture devices are only able to capture a limited part of this range. Instead, the radiance of a scene may be captured more accurately by spatially varying pixel exposures [1], using multiple imaging devices, or devices that use special sensors [2]. These devices are expensive and will not be affordable for the average consumer for some years to come [2]. Meanwhile, there exist methods of obtaining high dynamic range (HDR) images using conventional devices [3, 4]. Such techniques require the user to take several images of the same scene at different exposures, and apply a weighted average over these to compute radiance values of the scene. Due to this requirement of multiple captures, such techniques have to address certain issues. For instance, if there is any movement in the scene while the exposures are being captured, the moving objects will appear in different locations in these exposures. Therefore, when corresponding pixel values from different exposures are merged to produce an HDR image, a ghosting effect will appear in regions where there was movement at the time of capture, (see Figure 3). Due to this problem, existing techniques may only be used to create HDR images of scenes that are completely still. This is rather restricting as most scenes around us contain motion. Without a solution to this problem, we are unable to use multiple capture techniques to produce HDR images of scenes that have any moving objects, such as people, animals, and vehicles. This is especially problematic in natural scenes where wind causes dynamic behavior in leaves, trees, flowers, clouds, etc. One solution to this problem is to track the movement of objects across exposures, and average pixel values according to this movement. For instance, Bogoni [5] estimates an optical flow field between the different exposures, and warps these exposures such that all scene features are in accurate alignment. Similarly, Kang
et al. [6] use gradient-based optical flow between successive frames to compute a dense motion field, which is then used to warp pixels in exposures so that the appropriate values may be averaged together to generate a ghost-free HDR image. Techniques that use motion estimation work as long as the motion estimation is accurate. Currently, there are no approaches to motion estimation that work reliably for all kinds of movement. For instance, such techniques will fail for scenes with effects such as inter-reflections, specularities, and translucency [6]. Even for simple lambertian objects, motion estimation fails when objects appear in some exposures and not others due to occlusion [7]. One solution that avoids motion estimation altogether, works under the assumption that ghosting occurs in regions where the dynamic range is low enough to be represented accurately by a single exposure [2]. First, regions in the image where ghosting is likely to occur are detected by computing the weighted variance of pixel values for each location in the image, and selecting regions where this variance is above a threshold. Then, for each of these regions, a single exposure is selected from the exposure sequence, and its values are used directly in the HDR image. This technique will fail to capture ghosting in regions where the object color was not significantly different from the background. Jacobs et al. [8] address this issue by applying the threshold on a measure they derive from entropy, which is invariant to the amount of contrast in the data. Regions that are detected as possible ghost regions in this manner are again replaced with values from single exposures. This solution works well for many scenes, but fails when ghosting occurs in regions where there is high dynamic range. This locally high dynamic range may be due to the presence of features in the scene such as windows, strong highlights, or light sources. In this work, we present a novel approach to removing ghosts from HDR images. Unlike previous techniques, the proposed approach does not require any intermediate representation, such as optical flow or explicit detection. Instead, we generate an HDR image directly from image information. Thus, our approach is not conditional on the success of some intermediate process, such as optical flow computation or object detection. We make an important deviation from the standard HDR image generation process by iteratively weighting the contribution of each pixel according to its chance of belonging to the static part of the scene, (henceforth referred to as background), as well as its chance of being correctly exposed. We use a non-parametric model of the background, which enables us to compute a pixel’s membership in the model, and therefore its weight. Since the model is non-parametric, we do not impose any restrictions on the background. The only assumption we make is that the exposure sequence predominantly captures the background, so that in any local region in image space, the number of pixels that capture the background is significantly greater than the pixels that capture the object. Given this assumption, the neighborhood of a pixel in image space may serve as a reasonable representation of the background,
and the probability of the pixel’s membership in this neighborhood may serve as the weight for that pixel. The remainder of this paper is organized as follows: We describe the standard HDR image generation process in Section 2, and explain how we improve this process to remove ghosting artifacts. In Section 3, we present images in which ghosts have been removed using our technique. We also show that our algorithm can be used to remove noise from images. 2. ITERATIVE GHOST REMOVAL In this section, we first briefly introduce the standard HDR image generation process, and then explain our iterative method of computing a ghost-free HDR image from a set of exposures. Once the exposures have been captured, they may be used to obtain the camera response function of the capturing device, which is then applied to the exposures to convert them to radiance maps. The individual radiance maps, normalized with their exposure time, are averaged together to generate an HDR image [2]. The process is represented by the following equation: PQ g(Zij ) j=1 w(Zij ) ∆ tj Ei = (1) PQ j w(Zij ) where Ei is the radiance at location i in the image, j denotes the exposure, Zij is the value of the ith pixel in the j th exposure, g(·) is the inverse of the camera response function and ∆ tj is the exposure time of the j th exposure. Note that the above is a weighted average of the radiance in each exposure. w(·) is typically chosen to diminish the contribution of pixel values that are under- or over-exposed. This weight is small for extremely high or low pixel values. The above equation is typically evaluated thrice for each pixel, (once for each channel). We make an important improvement to this process by computing weights that are determined not only by their chance of being correctly exposed, but also by the chance that they capture the background. Unlike the first attribute, which can be ascertained by simply looking at the pixel’s absolute values, there is no existing method to find the probability that a pixel captures part of a moving object. The object may be of any color, shape, or size, and its movement may be slow or fast, rigid or non-rigid. These features make it difficult to model the background, and to assign weights to pixels accordingly. Given a set of S exposures, each of size M × N , our objective is to compute a set of M × N × S weights, that will be used to determine the contribution of each pixel in the exposure sequence. We equate this problem to that of finding the probability that a pixel belongs to the background. We use a non-parametric estimation scheme to determine this probability so as to impose as few restrictions on our data as possible. Such estimation schemes give a high probability of membership to elements that lie in densely populated regions of the distribution’s feature space. In particular, we use the kernel density estimator to compute this probability [9, 10]. To find the probability that a vector x belongs to a class F , an estimate can be computed, P (x|F ) = n
−1
n X
KH (x − yi ),
(2)
i=1
where n is the number of vectors in the class, yi is the ith vector in the class, H is a symmetric, positive definite, d × d bandwidth matrix, and 1
1
KH (x) = |H|− 2 K(H− 2 x),
(3)
where K is a d-variate kernel function. The d-variate Gaussian density is a common choice as the kernel K: 1 d 1 KH (x) = |H|− 2 (2π)− 2 exp(− xT H−1 x). 2
(4)
and we use this kernel function in our implementation. We represent each pixel by a vector in a feature space such that xk ∈ R5 , k = 1,2,...M × N × S. Three dimensions represent color, and two represent the location of the pixel in image space. Typically, a pixel’s colors are assigned three different weights [2, 8]. However, since the three channels are correlated, a single weight for a pixel is necessary to preserve the original information [5]. Secondly, using spatial information exploits dependencies between proximal pixels. We use Lαβ color space to represent each pixel’s color values in the feature space, as this color space is perceptually uniform to a first approximation [11]. For a vector xk , representing a pixel at location (p, q), the background is represented by an m × m neighborhood, N, around the vector, in all exposures. Thus, for each xk , F = {yi,j,s |(i, j) ∈ N(xk ), (i, j) 6= (p, q), and s = 1,2,...S}, (see Figure 1). This representation of the background remains the same for all the S pixels with location (p, q). M N
m m
Exposure 1
Exposure S-1
S
Exposure S
Fig. 1. The arrays in the above diagram represent images of the same scene taken at S different exposures. For a vector xk representing a pixel at location (p, q) (shown as yellow), yi,j consists of an m × m neighborhood around (p, q) in all the S exposures (shown in blue). m equals three in the above diagram. Note that the representation of the background is identical for all pixels at location (p, q). Thus far, we have assumed that all vectors yi are equally a part of the background. In practice, we know that many of these vectors represent pixels that are under- or over-exposed, and therefore do not represent the background well. These should not be considered a part of the background. Also, some vectors represent pixels that capture the moving object, and are also not a valid part of the background. Initially, while we do not know which vectors represent the moving object, we may reduce the effect of under- and overexposed vectors by weighting their contribution in kernel density estimation [12]. The weights assigned to each vector yi are based on a simple hat function shown in Figure 2. This function is computed as follows: w(Z) = 1 − (2 ·
Z − 1)24 , 255
(5)
where Z represents pixel values. This produces three weights for each pixel. The final weight, wi , is an average of these three. Using these weights, the probability that a vector x belongs to the background, now becomes: Pn i KH (x − yi ) i=1 w Pn P (x|F ) = (6) i=1 wi
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
50
100
150
200
250
Fig. 2. Initial weights assigned to each a pixel based on its absolute values. where wi is the weight of vector yi . If our assumption holds and the neighborhood around each pixel predominantly represents the background, vectors which capture the moving object will get lower probabilities than vectors that capture the background. Therefore, once the probabilities have been computed for each vector x, these can be used as weights of the corresponding pixels in HDR image generation. An HDR image created using these weights will show diminished ghosting as compared to an image generated using the initial weights, which are determined only by absolute pixel values. Now that we have a better set of weights for each pixel, we can repeat the above process of kernel density estimation, this time the initial weights of the vectors yi will be the ones computed in the last iteration of kernel density estimation as shown below: wi,t+1 = wi,0 .wi,t
(7)
where i specifies the vector yi that the weight is for, t denotes the iteration number, and wi,t is P (xi |F ) from Equation 6. As before, we want to diminish the probability of pixels that are under or over-exposed, and we multiply the newly computed weights with the initial weights obtained from the hat function before using them in density estimation. 3. RESULTS AND CONCLUSIONS In this section, we show some of the results obtained using our iterative algorithm. We used a Nikon D2X camera to capture all the exposures required to create HDR images. We used completely manual settings, doubling only the exposure time between consecutive captures of the same scene. A tripod was used to keep the camera stable during image capture. The camera response function, which is required to generate HDR images, (see Equation 1), was obtained by using the algorithm proposed by Debevec et al. [3]. All HDR images shown in this paper have been tonemapped with the Photographic tonemapping operator [13]. Our algorithm requires the user to specify the size of the neighborhood around a pixel, which we have kept at a constant 3 × 3 for all the results that we show in this section. The user is also required to specify the matrix H = diag(hx , hy , hL , hα , hβ ), which we kept as the identity matrix for our runs. Figure 3 shows our algorithm applied to a scene in which people are walking by. In this sequence, the size of the movement is large, and ghosting occurs in most of the image as a result. In Figure 4, ghosting is more localized, but involves the movement of a highlight (on the edge of the green book), which implies that the ghosting region is in high dynamic range. Successive iterations of our algorithm remove both instances of ghosting from the HDR images. We also show that our algorithm may be used to remove noise
Fig. 3. The left column shows the second, fifth, and seventh exposure of a sequence in which people are walking across the scene. The right column shows tonemapped HDR images that are generated using (top) weights determined only by absolute pixel values, (middle) weights that were estimated with a single iteration of our algorithm, (bottom) weights that were estimated after nine iterations of our algorithm.
in HDR images. The flower in Figure 5 appears clearer after a single iteration of our algorithm. We artificially added noise to an image sequence and applied our method to remove this noise. Noise was introduced by adding normally distributed random numbers to the color values of all the pixels in the exposures. After applying 4 iterations of our algorithm, we compute the mean squared error (MSE) between our result, and the HDR image made from the same sequence without adding noise. We repeated this process for different noise strengths, which we vary by changing the standard deviation of the noise distribution. In Figure 6, we compare these MSE values with those that we get if we do not apply our algorithm. Clearly, our method may be used to reduce noise in HDR images. We have found that in general, random noise typically takes fewer iterations to fade away than ghosting artifacts. This is possibly due to the fact that moving objects are better represented in the background than random noise. Consequently, weights assigned to pixels that capture moving objects will decrease more slowly. In future, we intend to study the application of our algorithm in creating artifact free HDR images from exposures that were captured with a hand-held (unstable) camera. 4. REFERENCES [1] Mitsunaga T. and Nayar S. K., “High dynamic range imaging: Spatially varying pixel exposures,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2000, vol. I, pp. 472–479. [2] Reinhard E., Ward G., Pattanaik S., and Debevec P., High Dynamic Range Imaging: Acquisition, Display, and Image-Based Lighting, Morgan Kaufmann, 2005. [3] Debevec P. E and Malik J., “Recovering high dynamic range
Fig. 4. The top row shows the second, fourth, sixth, and eighth exposures of a sequence in which a book is being placed upon a stack of books. The bottom row (left to right) shows tonemapped HDR images that were generated using weights determined only by absolute pixel values, and weights that were estimated with a single, two, and fourteen iterations of our algorithm.
4
2.5
x 10
2
MSE
1.5
1
0.5
0
0
5
15
25
35 45 55 65 Noise Standard Deviation
75
85
95
Fig. 6. MSE values are plotted against the standard deviation of the added noise. The blue line indicates MSE values if our algorithm is not applied, while the red line plots MSE values obtained after four iterations of our algorithm. Fig. 5. The image on the left was generated without using our algorithm. A single iteration of our algorithm reduces the noise in this image considerably, as shown on the right.
[4]
[5]
[6]
[7]
radiance maps from photographs,” in SIGGRAPH 97 Conference Proceedings, August 1997, Annual Conference Series, pp. 369–378. Mitsunaga T. and Nayar S. K., “Radiometric self-calibration,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 1999, vol. I, pp. 374–380. Bogoni L., “Extending dynamic range of monochrome and color images through fusion,” in International Conference on Pattern Recognition, 2000, vol. 3, pp. 7–12. Kang S. B., Uyttendaele M., Winder S., and Szeliski R., “High dynamic range video,” in ACM Transactions on Graphics, 2003, vol. 22. Wang J. Y. A. and Adelson E., “Representing moving images with layers,” in IEEE Transactions on Image Processing, May 1994, vol. 3, pp. 369–378.
[8] Jacobs K., Ward G., and Loscos C., “Automatic hdri generation of dynamic environments,” in Sketch, Siggraph 2005, 2005, Annual Conference Series. [9] Parzen E., “On estimation of a probability density and mode,” in Annals of Math. Statistics, 1962. [10] Rosenblatt M., “Remarks on some nonparametric estimates of density functions,” in Annals of Math. Statistics, 1956. [11] Ruderman D. L., Cronin T. W., and Chiao C-C., “Statistics of cone responses to natural images: Implications for visual coding,” in Journal of the Optical Society of America A, 1998, vol. 15, pp. 2036–2045. [12] Goerlich F. J., “Weighted samples, kernel density estimators and convergence,” in Empirical Economics, 2003, vol. 28, pp. 335–351. [13] Reinhard E., Stark M., Shirley P., and Ferwerda J., “Photographic tone reproduction for digital images,” 2002, vol. 21, pp. 267–276.