IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
VOL. 33, NO. 3,
MARCH 2011
485
Hashed Nonlocal Means for Rapid Image Filtering Nicholas Dowson and Olivier Salvado Abstract—Denoising algorithms can alleviate the trade-off between noise-level and acquisition time that still exists for certain image types. Nonlocal means, a recently proposed technique, outperforms other methods in removing noise while retaining image structure, albeit at prohibitive computational cost. Modifications have been proposed to reduce the cost, but the method is still too slow for practical filtering of 3D images. This paper proposes a hashed approach to explicitly represent two summed frequency (hash) functions of local descriptors (patches), utilizing all available image data. Unlike other approaches, the hash spaces are discretized on a regular grid, so primarily linear operations are used. The large memory requirements are overcome by recursing the hash spaces. Additional speed gains are obtained by using a marginal linear interpolation method. Careful choice of the patch features results in high computational efficiency, at similar accuracies. The proposed approach can filter a 3D image in less than a minute versus 15 minutes to 3 hours for existing nonlocal means methods. Index Terms—Nonlocal means, image filtering.
Ç 1
INTRODUCTION
A
trade-off still exists between image acquisition time and noise levels for certain types of images, e.g., 3D Magnetic Resonance (MR), Positron Emission Tomography (PET), and low light and infrared images. It is frequently impractical to increase acquisition time beyond a certain limit due to movement of the subject or the time constraints associated with any expensive shared resource. Hence, denoising filters are required. The “optimal” filter attempts to trade-off between the retention of image structure, such as edges and corners, and the removal of noise. The classical approach is to compute a weighted sum of a voxel and its surroundings. This operation assumes that a voxel should resemble its neighbors. Algorithms that are adaptive to regional intensity gradients allow greater smoothing without additional destruction of image structure. This modification simply uses a tighter assumption: Voxels should resemble similar, nearby neighbors. One such adaptive approach is anisotropic diffusion proposed by Perona and Malik [1], which relies upon “diffusing” intensities orthogonally to the local intensity gradient, thereby preserving edges and corners. A more generalized approach, independently proposed by Smith and Brady [2] as SUSAN and by Tomasi and Manduchi [3] as the bilateral filter, is also widely used. The latter term is used in this paper. Other similar methods exist, such as M-smoothers, which regularize using the original image [4], methods using more robust image statistics [5], and methods using higher order
. The authors are with the Australian e-Health Research Centre, Level 5, UQ Health Sciences Building, Royal Brisbane and Women’s Hospital, Herston, QLD 4029, Australia. E-mail: {nicholas.dowson, olivier.salvado}@csiro.au. Manuscript received 2 June 2009; revised 22 Jan. 2010; accepted 3 May 2010; published online 1 June 2010. Recommended for acceptance by J. Weickert. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TPAMI-2009-06-0348. Digital Object Identifier no. 10.1109/TPAMI.2010.114. 0162-8828/11/$26.00 ß 2011 IEEE
intensity statistics [6]. Mrazek et al. [7] provide a comprehensive summary of denoising methods and how they relate to each other. Nonlocal means filtering, recently introduced by Buades et al. [8], [9] and Awate and Whitaker [6], [10], exploits the self-similarity within images. Nonlocal means computes a weighted sum of all voxels within the image, and weights pairs of voxels according to the respective patches of intensities surrounding them, i.e., it assumes voxels should resemble other voxels with similar nearby intensities. Awate approaches the problem from an information-theoretic viewpoint by exploiting the increase in entropy generated by noise, which leads to an update function that reduces the entropy in feature space. Despite different starting points, the final derivations are essentially identical. There are some differences in that Awate uses random sampling, larger patches, and an iterative approach. Awate also optimizes the smoothing parameter. As Awate points out [10], iterative nonlocal means is a particular application of the mean shift algorithm [11], [12], which uses a Parzen window estimate [13] of the probability distribution function (PDF) of the feature space of local descriptors. Muresan and Parks [14] develop a similar approach treating local patches of intensity as samples of a set of principal components. The corresponding component coefficients for each patch are reduced to better fit the model according to the measured noise. Azzabou et al. [15] take the approach of iteratively selecting random sets of local patches for the purposes of smoothing. Likewise, Dabov et al. [16] iteratively cluster patches and perform withincluster filtering to achieve the same effect. The intuition is that nonlocal means matches local image structures rather than image intensities. Similar principles have been used before to increase the resolution in photographs, trading off between different information channels [17], and to generate self-similar texture [18]. Structure is retained because only regions of similar structure are combined. Noise is better reduced as a Published by the IEEE Computer Society
486
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
large number of exemplars may be used from throughout the image rather than being restricted to a local subset, and the effect of statistical outliers is reduced. The disadvantage of nonlocal means filtering is its prohibitive computational cost, making it impractical for use on large 3D and 4D data sets. Various methods for reducing the computational cost of nonlocal means have been explored [19], [20]. Coupe et al. combined these techniques with some additional proposals [21], specifically suggesting three ways to reduce the cost. First, weights between a pair of voxels are only computed if the mean and covariance of intensities within a pair of corresponding patches are sufficiently similar. This is similar to the early jump-out techniques used in block matching [22], [23]. Second, matches are restricted to voxels within a local neighborhood. Finally, coweights are not computed at every location, but on a subsampled grid of points. To fill in the gaps between voxels, entire patches of intensity are written. The patches are overlapped and averaged to improve robustness. The first modification improves performance at little cost to accuracy. The second modification means that only a subset of available information is used for smoothing, while the third creates a plaid pattern of artifacts within the image. Brox et al. propose an approach [24] to reduce the computational cost without blocking artifacts. The data are arranged into a binary tree of clusters of patches, preventing comparison of dissimilar patches. Brox’s approach divides the feature space up into a set of simplexes. To prevent bias of samples near cluster edges, a spill tree is used, which duplicates nearby samples external to each cluster. Spill trees exclude samples within the support region of the smoothing kernel, but outside the spill radius. This trade-off improves speed, but can also reduce accuracy in densely populated regions of feature space. This paper proposes a novel hashed implementation of nonlocal means. This work uses an idea similar to the extension proposed by Paris and Durand to speed up bilateral filtering [25], where 2D image data are projected into a 3D space with an additional dimension for intensity. Unlike [25], the spatial dimensions are dispensed with, and techniques to overcome the high dimensionality of the data are proposed. With care, the hash structure results in an algorithm that is faster than even the optimized nonlocal means method proposed by Coupe et al. and uses all available image data. The remainder of the paper is organized as follows: In Section 2, the nonlocal means algorithm is introduced. The proposed approach is described in Section 3. Some experiments are discussed in Section 4 before summarizing and concluding in Section 5.
2
BACKGROUND ON NONLOCAL MEANS
Images, I, of dimension D are treated as a function defined at a set of points, I : ðX ZZD Þ ! IR. X is a discrete set of voxel locations on a grid for which I is defined, i.e., X ¼ fx1 ; x2 ; . . . ; xi ; . . . ; xjXj g, and jXj indicates the number of elements in X. Each element in X or voxel is indexed by i. The image is assumed to have been obtained using a process ^ þ N ½x, where which includes additive noise, i.e., I½x ¼ I½x N is unknown additive noise with standard deviation, , and
VOL. 33,
NO. 3,
MARCH 2011
I^ is the noise-free image. Square brackets are used to indicate that I is defined on a discrete grid. Associated with each voxel i is a feature vector (descriptor) describing the local image structure, f i . In some versions of nonlocal means, f is simply the 3D patch of intensities centered on xi . However, f can vary. Coupe experimented with patches of up to 73 in size [21] and Awate used 92 patches [10]. Features other than neighborhood intensities could also be used. An exploration of other types of features is left for future work. Here, the descriptor is defined using a set of intensities at predefined spatial offsets, P ¼ fp1 ; . . . ; pk ; . . . ; pjP j g indexed by k, from position, xi : f i ¼ fI½xi þ pk ; 8k 2 ½1; jP jg:
ð1Þ
The index k is also used to refer to elements in the vector f i , with the notation f ik . For each voxel, i, nonlocal means computes a smoothed ~ i , as a weighted sum of intensities of voxels at value, I½x locations Y X, normalized by the sum of weights. The effect of selecting a subset of nearby locations may be obtained by letting Y ¼ X and using a spatial distance kernel : PjXj ~ i ¼ I½x
ðxi xj Þ wK ðf i f j Þ I½xj ; PjXj j¼1 ðxi xj Þ wK ðf i f j Þ
j¼1
ð2Þ
where j is a summing parameter and wK is the weighting function. In Buades [8], ðxi xj Þ equals 1 everywhere, making (2) location-independent. However, this results in an OðjXj2 jP jÞ process to compute (2) for all voxels. One of the optimizations proposed by Coupe [21] is to define as a top-hat function so that the sum is computed over a subset of locations, Y , near to x. jY j is kept constant relative to image size. Kervrann and Boulanger [26] have proposed locally adapting Y using a ~ i . measure of the uncertainty in I½x Computational costs may be further reduced by decreasing the size of the patch descriptor. Due to their proximity, patch intensities (features) share redundant information, which can increase computation without a proportionate ~ Hence, P need not improvement in the accuracy of I. include all the voxels directly surrounding a position x. Likewise, as elements are added to P , performance would be expected to improve, but at decreasing rates. Experiments showed that a feature vector consisting of the six intensities nearest to the center of a cube-shaped patch, and excluding the central intensity, gave a good trade-off between performance and speed. These experiments are presented in Appendix I-A in the supplemental material, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/ TPAMI.2010.114. The weight between two voxels i and j is computed based upon the difference between their corresponding descriptors: f i and f j . Like [26], the weight function is defined as the Gaussian with a full covariance matrix: wK ðf i f j Þ ¼ expððf i f j ÞT h2 K 2 ðf i f j ÞÞ;
ð3Þ
where h is a scalar used to control smoothing and K 2 is an inverse covariance matrix describing how good a predictor a given feature, f :k (or linear combination of features) is of
DOWSON AND SALVADO: HASHED NONLOCAL MEANS FOR RAPID IMAGE FILTERING
a voxel’s intensity. Like previous work [8], [10], [21], each offset intensity is treated as independent, so K 2 is diagonal. Buades [8] weighted the diagonal elements of K 2 according to their spatial euclidean distance of each offset from the central voxel: Kkk ¼ expð 12 pk pk Þ. Awate [10] also downweighted patch voxels distant from the center, but with a central plateaux, because the patch size is larger: 92 . Like Coupe [21], this work makes no prior assumptions about the predictive ability of the offsets and a classical euclidean norm is used, i.e., K 2 is the identity matrix. K 2 is multiplied by a scalar smoothing parameter, h, to control the amount of smoothing. The smoothing parameter is chosen based upon the estimated noise variance, 2 . Normalization for the scaling induced by K 2 is also required. This is simply traceðK 2 Þ, which equals jP j for the classical euclidean norm used here: h2 ¼ 22 jP j:
ð4Þ
A tuning parameter, , is retained, although Buades et al. [8] and Coupe et al. [21] suggest leaving this to one based on empirical evidence. In contrast, Awate and Whitaker [10] automatically optimize h to minimize the entropy of the PDF of f . Considering the PDF of f , h can be thought of as a prior on the size of clusters of data contained within the PDF. The denominator in (2) is the Parzen Window estimate of the PDF. Updating using nonlocal means drives voxel intensities toward the local mode, resulting in a decrease in average cluster volume. Similarly, adding noise blurs the clusters and induces mixing between them. Even without noise, variations in scene pose and the finite size of pixels mean that clusters should have a nonzero volume and may populate some nonelliptical region within the PDF. Nonlocal means seeks solely to remove the blurring induced by noise. This highlights the importance of selecting . Too large a value means the estimated PDF incorrectly models clusters as large elliptical Gaussians, distorting their true shape and, in the worst case, treating nearby but separate clusters as one entity. This can drive samples toward false modes and remove natural variations in appearance in addition to noise. Too small a means that random fluctuations in density in the PDF are modeled as clusters, to which samples are driven, arbitrarily partitioning the PDF. At best, almost no denoising occurs. At worst, this could increase noise for some samples as they are driven away from the true local mode of a nearby cluster. Standard Fourier theory implies that separate clusters that have been mixed by noise cannot be resolved if their modes are within of each other [27], [28], where is the noise standard deviation. Likewise, the edges of clusters cannot be known with greater certainty than . This suggests that values greater than 1 are suboptimal. Optimizing as Awate suggests [10] tunes to the data and the local cluster size and deals with errors in estimates of . However, optimizing requires computing the nonlocal means function for every iteration of the optimization function, and many iterations may be required. The cost of such a scheme can outweigh the benefits of obtaining an optimal h. An alternative is to use a subset of the available data at lower cost, but this too may give a suboptimal .
487
Comaniciu and Meer [11] also propose methods to select , but suggest that a manual selection using domain knowledge is sufficient. Similar care should be taken not to iterate nonlocal means too many times as clusters shrink to unrealistically small volumes that poorly model real data and cause stepping artifacts in images. In extreme cases, clusters collapse to single points. The minimum entropy observed in these cases starts to describe the entropy of the Parzen windows rather than the data itself. Mostly for reasons of speed, this work does not iterate nonlocal means.
3
HASHED NONLOCAL MEANS
The descriptor f j may be decoupled from f i in (2) by integrating over the feature space and multiplying by the indicator function, ðgÞ, which equals 1 when g is at the origin and 0 elsewhere. Setting ðÞ to 1 everywhere to decouple xi and xj gives an alternative formulation of nonlocal means: PjXj R j¼1 g2F wK ðf i gÞðg f j ÞI½xj dg ~ ; ð5Þ I½xi ¼ PjXj R j¼1 g2F wK ðf i gÞðg f j Þdg where F IRjP j defines a hyperrectangular region containing all f -vectors and g is a dummy variable used to specify a location in feature space. The sum and integral terms can be interchanged using the commutative law. This allows (5) to be simplified by using frequency distributions or hash spaces: H1 ðgÞ ¼
jXj X
ðg f j Þ;
ð6Þ
j¼1
Hf ðgÞ ¼
jXj X
ðg f j Þ I½xj :
ð7Þ
j¼1
Examples of H1 and Hf are shown in Fig. 1. Equation (6) counts the frequency with which certain f -signatures occur, while (7) accumulates intensity values with the same f -signatures. The hash spaces can be viewed as a computationally convenient way to group the terms in the summations in (2). Grouping the summation terms in (2) reveals a function that no longer depends on xi , but only on the associated descriptor f i : R g2F Hf ðgÞ wK ðf i gÞdg I~i ¼ R : ð8Þ g2F H1 ðgÞ wK ðf i gÞdg The weight, wK , becomes a nonlinear scaling of euclidean distance within the feature or hash spaces (hereafter, the latter term is used). Due to limitations in data representation and storage, the hash spaces are divided up into discrete bins. Hence, the terms in the numerator and denominator of (2) and (5) are grouped according to bin membership. Only the summed result of each bin is stored. A single coweight is computed between each pair of bins (group of terms), before computing a weighted sum for each bin. This grouping of terms makes large reductions in computational cost possible. The bins in Hf and H1 form a regular Cartesian grid in jP j-dimensional hash space, the weighted sum operations
488
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
VOL. 33,
NO. 3,
MARCH 2011
Fig. 1. An illustrative example of (a) a 1D sinusoidal signal with additive noise. A descriptor consisting of a single feature, namely: the intensity immediately preceding each sample is extracted and used to construct (b) a frequency hash which is equivalent to H1 and (c) a summed intensity hash which is equivalent to Hf . In (b) and (c), the thick and dashed lines show the hash space before and after convolution. The thin line is the difference between the two. The dots show the actual feature values, with a random y-axis offset to make density visible. (a) A sinusoid with additive noise. (b) Frequency hash for one feature (H1 ). (c) Intensity hash for one feature (Hf ).
are equivalent to convolutions by weighting kernel wK . The use of a Cartesian grid means no (expensive) search step is required to find neighboring bins. Inspection shows that the numerator and denominator of (8) are convolutions of Hf and H1 by wK : ðwK Hf Þðf i Þ Hf0 ðf i Þ ¼ : I~i ¼ ðwK H1 Þðf i Þ H10 ðf i Þ
ð9Þ
Although Hf and H1 are discretely represented, round brackets are used to indicate that interpolation may be used to obtain values at off-grid locations. Obtaining the smoothed (denoised) intensity becomes a matter of interpolating values for the numerator and denominator (after convolution) for a given f and dividing them. The hash spaces only need to be created once for the image, after which they may be reused to interpolate the denoised value for every voxel in the image. Hence, nonlocal means no longer scales with jXj, but with the dimensionality of the hash space and the complexity of the interpolation method used. The construction of hash spaces Hf and H1 is an OðjXkP jÞ process. As each offset intensity is treated independently, i.e., the weighting kernel is axis-aligned, separable convolution may be applied, resulting in further potential cost savings. Separable convolution has a computational cost of OðjP jjF j wsize Þ, where F is the set of all bins in the hash spaces and wsize is the size of the kernel. I~ may also be calculated by dividing the two discretized w H hash spaces and interpolating ðwKK Hf1 Þðf Þ. However, in most cases jXj jF j, e.g., jXj 7 106 and jF j 4:7 107 are typical, so this approach was not used. Division is a nonlinear operation and further simplifications to (9) are not performed. Note that division by zero never occurs. The hash spaces are generated from the images that they are used for, so lookups only ever occur in regions of H1 wK that are populated, i.e., nonzero regions. The method described by (9) is related to that proposed by Paris and Durand to speed up bilateral filtering denoising [25], [29]. Paris proposes projecting images into a ðD þ 1Þdimensional hash space, where the coordinates are a concatenation of the spatial coordinates of each voxel, xi ,
and its intensity, Ii . Equation (9) applies the same principles but dispenses with the spatial coordinates, and replaces them with multiple offset-intensity coordinates. Spatial dependence may be added back to (9) by concatenating the f and x to form a new descriptor with appropriate adjustments to wK . This replaces . As Paris points out, the Gaussian-shaped kernel acts as a low-pass filter on the hash space. Artifacts smaller than the cutoff frequency of the filter (inverse of h) are overwhelmed by the lower frequency effects. The hash-space resolution, and hence, computational expense, may be reduced according to the parameter h. By Nyquist’s theorem, signal variations less than half the sampling frequency are hidden, but these variations would be removed by the low-pass filter anyway. The filter may be truncated to where its values are below some threshold. Higher resolutions and proportionately larger kernels yield slight improvements in accuracy, but only at great computational cost. Hence, a bin width of 1h is used with a five-bin-wide kernel. This decision is justified by the experiments in Appendix I-B in the supplemental material, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/ 10.1109/TPAMI.2010.114.
3.1 Interpolation and Recursion of the Hash Space The deviation from standard nonlocal means caused by a discrete representation of the hash spaces, H1 and Hf , arises from three sources [30]: the finite number of samples (no. voxels), the size of each bin (hash-space resolution), and the method used to obtain information from each bin (interpolation). The number of voxels (samples) is fixed, so the focus of the following discussion is on the latter two items. Fig. 1 gives an example of a discretely represented hash space for a 1D sinusoidal signal with descriptor consisting of a single feature. The following discussion refers to H1 (Fig. 1b) and Hf (Fig 1c), but for brevity, only Hf is referred to. The resolution of the hash spaces is limited by available memory, severely so for high-dimensional spaces. However, if bin memberships are recorded, requiring OðjXjÞ memory, standard nonlocal means can be used to compute I~ values using (2) for within-bin samples only (Figs. 2a and 2b). The effect of extrabin samples is accounted for by adding a bin bias: ðHf wK Þ½f Hf ½f (Fig. 2c). Deviations
DOWSON AND SALVADO: HASHED NONLOCAL MEANS FOR RAPID IMAGE FILTERING
489
Fig. 2. A zoomed region of Fig. 1c is shown to illustrate the interpolation process. A given sample, (a), is contained along with other samples in a particular bin, (b). The bias, (c), imposed by samples external to the bin ðHf wK Hf Þ is added. If the number of samples is small, the weighted sum of all nearby within-bin samples is computed using standard nonlocal means. Weighting is computed using the same Gaussian smoothing values illustrated by (d). If the number of samples is large, but the bin width is sufficiently small relative to h, then Hf may be interpolated from (e) the Hf values of nearby bins. Finally, if the bin width is large enough to make interpolation inaccurate, a subhash space is recursively generated from the current bin, illustrated by (f) new bin edges.
from standard nonlocal means are restricted to samples near the edge of the bin, where local effects of nearby extrabin samples are lost in their accumulation with distant extrabin samples. As the cost of standard nonlocal means is quadratic in the number of bin samples, Xf , this approach is only feasible for bins with low populations. For bins with higher populations, Hf ðf Þ can be interpolated directly from the values at discrete points in the hash space (Fig. 2d). Several options for interpolating the hash spaces exist. Nearest neighbor (NN) lookups are constant in time and cost; however, they provide limited accuracy, especially if the hash space is of lower resolution than required by smoothing parameter h. Linear interpolation is not a viable possibility when the dimensionality of the hash space, jP j, is larger than about four or five, as linear interpolation is Oð2jP j1 Þ in computational cost. Hence, a marginal-linear interpolation scheme is used, which discards the cross products of standard N-dimensional linear interpolation. The resulting interpolation function is a weighted sum of marginal gradients along each hash-space dimension: Hf ðf i Þ ¼ Hf ½roundðf i Þ þ
jP j X dHf ðf i Þ ðf ik roundðf ik ÞÞ ; df ik k¼1
ð10Þ where roundðÞ of a vector, , gives the nearest vector to consisting solely of integer components, and f ik is the kth component of f i . This scheme is actually extrapolating when two or more components of f i are noninteger, and nonrealistic results can be obtained. For this reason, Hf ðf i Þ is clamped to the range: ½12 Hf ½roundðf i Þ; 2Hf ½roundðf i Þ. Interpolation is inaccurate when the bin width is large relative to h since local variations of higher frequency than the inverse bin width are filtered out. Hence, when bins are large relative to h, but have too many members to use
standard nonlocal means, a subhash space is constructed from the intrabin samples, and the procedure just described is recursively applied Similarly, the effect of extrabin samples is recursively applied by adding them as a bin bias. As marginal-linear interpolation is only applied when the bin width is sufficiently low, i.e., at maximum recursion and when a given bin contains more than a certain threshold of samples, kmaxjXf j , the inaccuracies introduced by the interpolation scheme are kept low. Also, when large numbers of nearby samples exist, local variations are minimized by averaging effects. Hence, the algorithm exploits dense regions in the hash spaces to reduce computation, while reverting to “local” standard nonlocal means in low-density regions. In experiments, the accuracy was not sensitive to the threshold of within-bin samples used to choose between nonlocal means and interpolation. The use of marginal-linear interpolation implies a computational cost of OðjP jÞ for looking at a single value in most cases. The cost is OðkmaxjXf j Þ in bins where standard nonlocal means is used. In practice, however, the overhead of boundary checking and looking up neighboring bin values for marginal-linear interpolation is such that standard nonlocal means is actually faster than marginal linear interpolation when jXf j is less than approximately 300. Hence, a cost of OðjXjjP jÞ is assumed for interpolation. Recursion of the hash space nonintuitively implies an increasing computational cost in the convolution step for lower smoothing values. For uniformly distributed data (in the hash space), where every bin would need to be recursed, this would make the proposed approach impractical. In practice (for 3D medical images), data are not uniformly distributed and recursion is only applied to a low fraction (