Suprathreshold visual psychophysics and structure-based ... - CiteSeerX

3 downloads 186 Views 907KB Size Report
When included in a more efficient coder, DCQ can reliably provide images of equal visual ..... Lab Coder, http://foulard.ece.cornell.edu/index.php?loc=/vcl_coder/.
Invited Paper

Suprathreshold visual psychophysics and structure-based visual masking Sheila S. Hemami, Damon M. Chandler, Bobbie G. Chern, Jeri Anne Moses Visual Communications Lab, School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853 e-mail: [email protected], [email protected] ABSTRACT Current state-of-the-art algorithms that process visual information for end use by humans treat images and video as traditional “signals” and employ sophisticated signal processing strategies to achieve their excellent performance. These algorithms also incorporate characteristics of the human visual system (HVS), but typically in a relatively simplistic manner, and achievable performance is reaching an asymptote. However, large gains are still realizable with current techniques by aggressively incorporating HVS characteristics to a much greater extent than is presently done. Achieving these gains requires HVS characterizations which better model natural image perception ranging from sub-threshold perception (where distortions are not visible) to supra-threshold perception (where distortions are clearly visible). This paper reviews classical psychophysical HVS characterizations focused on the visual cortex (V1), pertaining to the contrast sensitivity function, summation, and masking, which have been obtained using unrealistic stimuli such as sinusoids and white noise. The direct applicability of these results to natural images is often not clear. Complementary results are then presented which have been obtained using realistic stimuli derived from or consisting of natural images, along with several applications of these results. Finally, a new structure-based masking model is proposed to model masking in homogeneous natural image patches as a function of the image type: textures, edges, or structures. Keywords: visual psychophysics, supra-threshold distortion, masked detection, summation, wavelet image compression.

1. INTRODUCTION Current state-of-the-art algorithms that process visual information for end use by humans treat images and video as traditional “signals,” employing efficient transformations, high-dimensional probability models, and sophisticated prediction and filtering operations to exploit the statistical characteristics of these signals. With their state-of-the-art signal processing, these algorithms provide very high efficiency compression and excellent quality at relatively high bit rates (i.e., at relatively large file sizes), where storage and transmission resources are not scarce. These algorithms incorporate characteristics of the human visual system (HVS), but typically in a relatively simplistic manner. The relative gains achieved by algorithms along the way to these state-of-art results have exhibited, as expected, diminishing returns, and such purely signal-processing based algorithms are reaching the asymptote of what can be achieved. However, gains are still achievable with current techniques by aggressively incorporating HVS characteristics to a much greater extent than is presently done. As such, more accurate characterizations are needed which better model natural image perception. This paper reviews results in HVS characterizations obtained in the Visual Communications Laboratory at Cornell University over the past five years, and presents new results which begin to incorporate cognition into perceptual models. Our work has focused on the visual cortex (also known as V1), which contributes to low-level vision. Two predominant theories of low-level vision have sparked a large body of research aimed at characterizing the role of V1. One theory proposed that cells in V1 operate as edge and bar detectors;1 another proposed that V1 contains “channels” responsible for analyzing an image’s spatial frequencies and orientations.2,3 The former has led primarily to physiological work geared toward characterizing the non-linear response properties of individual V1 neurons. The latter has led to studies which have attempted to quantify, mainly via psychophysical experiments with humans, channel-level properties such as frequency/orientation bandwidths and interactions amongst channels. These two theories of V1 have more recently been unified under a widely accepted model of gain control.4-7 Two challenges exist in applying existing psychophysical V1 characterizations to natural image processing. First, the results are generally obtained using simple stimuli such as shown in Figure 1(a); while using such sinusoidal stimuli is excellent for isolating frequency channels in the visual cortex, few natural images consist of isolated sinusoids. Secondly, the resulting models do not often yield meaningful results when applied to complex images consisting of many

Visual Communications and Image Processing 2006, edited by John G. Apostolopoulos, Amir Said, Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol. 6077, 60770O, © 2005 SPIE-IS&T · 0277-786X/05/$15

SPIE-IS&T/ Vol. 6077 60770O-1 Downloaded from SPIE Digital Library on 08 Feb 2010 to 139.78.68.23. Terms of Use: http://spiedl.org/terms

(b)

(c)

(a)

a Figure 1 Psychophysical experimental stimuli: (a) a simple sinusoid, which is an unnatural image, (b) unmasked distortions from quantizing wavelet coefficients in a single subband in a natural image, yielding a bandlimited and spatially correlated signal, (c) the same distortions as in (b), but now masked by the image.

components, the responses of which interact in V1. Characterizing the HVS has traditionally been the job of psychologists and neuroscientists. Our approach to characterizing the HVS differs in both method and goals. Our approach has been to use realistic stimuli and maskers in experiments. Rather than showing observers sine-wave patches (Figure 1(a)) or white noise, we use natural images exhibiting relevant distortions as illustrated in Figure 1(b,c). Our end goal is an improved processing or compression system. Model development may be explicit or implicit, but the usefulness of any resulting model is ultimately judged based on whether images processed while applying the model are “better” than those processed without applying the model, or with a different model. This paper first reviews classical psychophysical results used to characterize the HVS. It is these results which have traditionally been relied upon to guide development of image processing and compression algorithms. Next, our approach to obtaining complementary results for natural image stimuli is described and the results are compared with the classical results. Several exemplary applications of our results are discussed. Finally, our most recent work on moving beyond perception to incorporating elements of cognition into a perceptual model is presented.

2. CLASSICAL PSYCHOPHYSICAL RESULTS FOR THE HUMAN VISUAL SYSTEM The HVS can be experimentally characterized by measuring visibility thresholds (VTs) (or simply detectability) for various stimuli, or targets. The most common targets in classical psychophysics are either sinusoids or Gabor patches (sinusoids multiplied by a Gaussian). For these targets, a common contrast measure is the Michelson contrast, given in terms of minimum and maximum luminances in the stimulus l min and l max as C = ( l max – l min ) ⁄ ( l max + l min ) . Contrast is measured using physical luminances which appear on the display rather than the integers representing grayscale values as the image is stored in the computer. The visibility threshold for a given target is defined as the contrast at which the target can be detected by a human observer. More specifically, it is the contrast at which some percentage of observers correctly detect that the target is present. This definition is formalized by use of the psychometric function Ψ ( C ) which relates the observers’ performance to the intensity of the stimulus (here, given by the contrast C). The psychometric function is an expression for the probability of detection of a target with visibility threshold C 0 and is of the form Ψ(C) = 1 – α

–( C ⁄ C0 )

β

SPIE-IS&T/ Vol. 6077 60770O-2 Downloaded from SPIE Digital Library on 08 Feb 2010 to 139.78.68.23. Terms of Use: http://spiedl.org/terms

(1)

The variable α can take on one of several values; α = e for the Weibull psychometric function and α = 2 for the Quick psychometric function. The probability of the observer being correct is then calculated by assuming that the observer guesses correctly in half of the trials, and that the probability of detection as given by the psychometric function applies in the other half of the trials. This yields a probability of correctly detecting the target with a contrast of C of Prob ( correct detection ) = 1 – 0.5α

–( C ⁄ C0 )

β

(2)

Because the VT occurs at C = C 0 , the VT is the contrast when observers correctly detect the target in –1 100 × ( 1 – 0.5α ) % of the trials (82% for Weibull, 75% for Quick). 2.1.

Subthreshold Spatial Frequency Perception — The Human Contrast Sensitivity Function (CSF) Perhaps the simplest characterization of the HVS is the human contrast sensitivity function (CSF), a well-accepted description of spatial frequency perception. The CSF plots contrast sensitivity, defined as 1/VT, versus spatial frequency in units of cycles/degree of visual angle. This function peaks around 2-6 cycles/degree and exhibits exponential fall-off; the HVS behaves as a low-pass filter.2,8 Because the CSF is computed using VTs when observers can just detect the stimuli, it represents sub-threshold perception, and as such its direct application to the supra-threshold regime in compression, when artifacts are clearly visible, is not accurate. Because the CSF is determined experimentally with targets consisting of sinusoids (e.g., Figure 1(a)) or to specific basis functions, it represents measured sensitivities to either a single frequency or to a limited range of frequencies. To predict contrast sensitivity to more sophisticated compound stimuli (ideally, to an image or to broadband distortions), a summation model is required. Furthermore, to predict contrast sensitivity to a target (e.g., distortions) displayed simultaneously with another image, a masking model is required. Both summation and masking are addressed generally by models as described in the next two subsections. 2.2.

Summation — Sensitivity to Compound Stimuli A compound target is the superposition of more than one simple target. Components of a compound target are considered to be detected independently if they are sufficiently separated in the dimension of interest; for example, if they are separated by a large spatial distance, or if their frequencies are very different. If the components are not sufficiently separated, however, the compound target can be more easily detected than is either of its simple components. In this case, the responses to those components are believed to have summed. The amount of summation is typically assessed by computing relative visibility thresholds, and it is often reported as a Minkowski summation exponent (β). For a compound 0 0 target consisting of two components having individual simple contrast thresholds C 1 and C 2 and contrasts in the compound target (at threshold) given by C 1 and C 2 , the Minkowski sum is given by 0 β

0 β

( C1 ⁄ C1 ) + ( C2 ⁄ C2 ) = 1

(3)

This function results from the assumption that each of the two components are detected by separate channels, so the probability of detection is given by Ψ 12 = 1 – ( 1 – Ψ 1 ) ( 1 – Ψ 2 ) (where the subscripts refer to the components). Substituting for the psychometric functions Ψ 1 and Ψ 2 using Equation (1) yields a psychometric function for the compound target given by 0 β

Ψ 12 = 1 – α

0 β

–( ( C1 ⁄ C1 ) + ( C2 ⁄ C2 ) )

(4)

Equation (3) then results following a similar argument to that used to that obtaining Equation (2). As β → ∞ , the compound target is detected when either of its components reaches its individual VT. When β = 1 , summation is linear and each component contributes proportionately to visibility of the compound target. Values in between (typically from 2 to 4) reflect increased sensitivity to a component in the context of the compound target. Previous studies have compared the detectability of a compound target (e.g., two superimposed orthogonal sine-wave gratings) to the detectability of its individual components, presented against either no mask or an unnatural masker (e.g.

SPIE-IS&T/ Vol. 6077 60770O-3 Downloaded from SPIE Digital Library on 08 Feb 2010 to 139.78.68.23. Terms of Use: http://spiedl.org/terms

Inhibition from other neurons V1 neuron

Non-linearity

÷

- + -

Image

Input gain

Response

Output gain

Figure 2 Standard gain control model of V1 neurons. The initial, linear response of a V1 neuron is computed as a weighted inner product of a neuron s receptive field and an input image; this response is then subjected to a pointwise nonlinearity followed by divisiv inhibition from other neurons.

white noise or a sinusoid). Studies have investigated summation-at-threshold on the orientation, spatial frequency, and spatial dimensions. These studies have reported β values ranging from β = ∞ to β = 2.2 using sinusoid and Gabor targets presented against either a uniform background (no mask) or an unnatural masker such as white noise or a sinusoid.9-17 How these results can be extended to natural images, however, is unclear. 2.3.

Explicitly Modeling Masking Masking is a general term that refers to the perceptual phenomenon in which the presence of masking signal (the masker) reduces a subject’s ability to detect a given signal (the target); the VT of the target is raised. For example, Figure 1(b) displays unmasked distortions from quantizing bandlimited data in a natural image, formed by quantizing a single subband, subtracting the image, and adding 128. Figure 1(c) is the image itself with the single subband quantized; the distortions are exactly as displayed in Figure 1(b) but they are substantially more difficult to see because they are masked by the image. Current explanations of visual masking can be divided into three paradigms: (1) Noise masking, which attributes the increase in visibility thresholds to the corruptive effects of the masker on internal decision variables;18 (2) contrast masking, which attributes threshold elevations to gain control;4-7 and (3) entropy masking, which is imposed solely by an observer’s unfamiliarity with the masker.19 Masking is commonly modeled using a widely accepted model of gain control.4-7 Noise and contrast masking, in particular, have been modeled by using variations of the standard gain control model. In this model, detection thresholds are predicted based on the difference between the model’s response to the masker alone (e.g., an original image) and the response to the mask + target (e.g., an image with compression-induced distortions). This model has proved successful at predicting both the non-linear responses of V1 neurons and many of the channel interactions, and variations of this model have been used extensively in the vision and image-processing communities to predict various forms of visual masking for simple stimuli. 5,6,18,20,21 Under the standard gain control model, channels are composed of a bank of neurons with similar tuning properties, and the response of an individual neuron is computed via three stages: (1) a weighted inner product between the input image and the neuron’s receptive field, (2) a point-wise nonlinearity, and (3) divisive inhibition from other neurons which comprise a so-called “inhibitory pool.” The structure for a HVS model up to and including V1 which implements the standard gain control model is illustrated in Figure 2. A filtering stage implements the weighted inner product and decomposes the input signal into typically octave-spaced frequency bands which are indexed by spatial frequency and orientation; approaches to implement this stage include a steerable pyramid, a Gaussian pyramid, an overcomplete wavelet decomposition, radial filters, and cortex filters. The filter outputs are then raised to a power typically between 2 and 4, followed by divisive inhibition based on the responses of other filters. Let x ( u, f, θ ) correspond to the filter bank output at location u, frequency f, and orientation θ . The response of a neuron r ( u, f, θ ) as computed by this structure is mathematically modeled as

SPIE-IS&T/ Vol. 6077 60770O-4 Downloaded from SPIE Digital Library on 08 Feb 2010 to 139.78.68.23. Terms of Use: http://spiedl.org/terms

p

( w ( u, f, θ )x ( u, f, θ ) ) r ( u, f, θ ) = g t × ---------------------------------------------------------------------------------------q q b + ( w ( u, f, θ )x ( u, f, θ ) )



(5)

( u, f, θ ) ∈ S

where g t is the detection gain, b represents the saturation constant, the set S indicates which other neuron outputs are included in the inhibitory pool and provide the masking effect, and w ( u, f, θ ) are weights for the filter bank outputs of the neuron and in the inhibitory pool. Following formation of the responses of individual neurons, these responses must be combined via a summation operation either to yield a spatial map indicating neural responses at different locations within the image (i.e., summation only across frequency and orientation), or to yield a single number which quantifies the visual response to the entire image (i.e., summation across frequency, orientation, and space). For example, for a location map of responses, the summation stage would form ⎛ R(u) = ⎜ ⎝

∑θ ∑f r ( u, f, θ ) ⎛ ⎝

1 ⁄ βθ β f β θ ⁄ β f⎞

⎞ ⎠

⎟ ⎠

(6)

where β f and β θ are the pooling exponents for the frequency and orientation summation, respectively. To produce a single number, further summation over space with a pooling exponent β u is performed. The inhibitory pool of the gain control model given in Equation (1) provides the functional model for masking. A larger denominator reduces the response and indicates increased masking. The actual contributors to the inhibitory pool in V1, however, are unknown. Use of the model with an orthonormal filter bank and a (commonly used) pooling exponent q = 2 leads to a simple energy term in the denominator. The specific parameters used in this model and variations thereof are generally fit to known or experimental psychophysical data, e.g., to visibility thresholds measured using simple sinusoidal gratings,22 to filtered white noise,23 or to threshold-versus-contrast (TvC) curves of target Gabor patterns with sinusoidal maskers.24,7 These parameters, however, do not generalize across stimuli, and do not fit natural image data well. As a result, the utility of these models in practical applications is limited. General image-processing applications are typically more concerned with the detectability of specific targets presented against naturalistic, structured backgrounds rather than white noise or homogeneous textures. The standard gain control model, which does not account for structural information, consistently overestimates the ability of edges and object boundaries to mask distortions.

3. CHARACTERIZING THE HVS USING NATURAL IMAGES In resource-limited systems, visually lossless image quality cannot be provided. Rather, images compressed to the low rates required for these systems exhibit obvious visual defects. Most work on including HVS characteristics in image compression has focused on incorporating perception in the sub-threshold regime, where compression-induced distortion is not visible. When compression artifacts become visible, the compression is no longer in the sub-threshold regime and now operates in the supra-threshold regime. In this regime, we can provide only visually optimal compression, in which distortion visibility is minimized for a given bit rate. While the models described in the previous sections have been successful in fitting experimental results obtained with sinusoidal targets and sinusoidal and white noise maskers, they are not ideal for application to natural images. While numerous compression schemes have exploited the fact that the HVS is low-pass, the applicability of the CSF to visually lossy image compression remains unclear. The CSF represents subthreshold perception, while visually lossy compression induces distortions which are clearly supra-threshold. Summation exponents are also computed in the subthreshold regime, and the perception of two superimposed sinusoids is clearly not equivalent to viewing a complete image. Lastly, parameters in the standard gain control model and the final summation step are fit to specific targets, and they neither generalize across stimuli nor do they accurately model masking in natural images. Our work has been motivated by the following principles: • Characterize the HVS using realistic targets and maskers derived from or consisting of natural images; • Fully explore supra-threshold perception as well as subthreshold perception;

SPIE-IS&T/ Vol. 6077 60770O-5 Downloaded from SPIE Digital Library on 08 Feb 2010 to 139.78.68.23. Terms of Use: http://spiedl.org/terms

Generate results which, when incorporated into an image compression or processing algorithm, produce substantial gains. Our stimuli have been primarily distortions caused by uniform quantization of wavelet coefficients in natural images. Quantization of the coefficients within a single subband gives rise to distortions in the reconstructed image that are localized in spatial frequency and orientation, and that are spatially correlated with the image. We have performed extensive psychophysical experimentation, using both unmasked stimuli (e.g., presented against a blank background without the image present as in Figure 1(b)), and masked stimuli (with the image present as in Figure 1(c); i.e. what someone would actually see). While unmasked results often mimic those previously reported based on sinusoidal stimuli, masked results are vastly different and explain why the classical results have not extended well to or achieved widespread use in image processing applications. Three important results have emerged. First, visibility of wavelet subband quantization distortions presented against natural-image maskers requires its own model and cannot be fit by existing models. Secondly, summation of responses to distortions presented against natural-image maskers is much greater than summation in the unmasked paradigm and is nearly linear. Lastly, in the supra-threshold regime where the distortions are clearly visible, the perceived contrast of the distortions has little effect on the quality of the compressed image; rather, one must account for the impact that these distortions effect on a subject’s ability to visually process the image’s edge-structure. For complex stimuli such as images, root-mean-squared (RMS) contrast is used;25 the RMS contrast C S of a stimulus S is given by •

1 ⎛ 1 C S = -------- ⎜ --------µ L ⎝ NM S

N – 1M – 1

∑∑

y = 0x = 0

2⎞

1⁄2

L S ( x, y ) – µ L ⎟ S ⎠

(7)

where µ L denotes the average luminance of S , L S ( x, y ) denotes the luminance of the pixel at location ( x, y ) , and S N × M denotes the total number of pixels. Contrast of distortions is computed by subtracting the original image from the distorted image and computing the contrast of the resulting (zero-mean) image. To generate stimuli, images were decomposed into 16 subbands using a five-level hierarchical wavelet decomposition with the 9/7 biorthogonal filters. Viewing distances were adjusted (all approximately 45-55 cm) such that when combined with the dot pitch of the monitor, the 15 AC subbands correspond to 5 LH bands (horizontal information) with horizontal spatial frequencies centered at 1.15, 2.3, 4.6, 9.2, and 18.4 cycles/degree, 5 HL bands with vertical spatial frequencies centered at the same values, and 5 HH bands with both horizontal and vertical spatial frequencies centered at the same values. 3.1.

Subthreshold Spatial Frequency Perception — Masked and Unmasked CSFs To quantify the effects of natural images on the visibility of distortions induced by quantization of wavelet coefficients, we examined the detectability of simple wavelet subband quantization distortions presented against no mask and two natural-image maskers.26,27 VTs measured in the unmasked paradigm were compared with VTs measured when the distortions were presented against two natural-image backgrounds. Results are shown in Figure 3 and are compared with the traditionally obtained CSF as described in Section 2.1. Contrast sensitivity to simple wavelet subband quantization distortions presented against a uniform gray field displays a maximum at spatial frequencies at least as low as 1.15 cycles/degree. For the range of spatial frequencies tested (1.15-18.4 cycles/degree), the resulting CSF demonstrates a low-pass profile that is consistent with what has been reported in the literature for targets of similar bandwidth.8,14,28 With the exception of the very low-frequency behavior, this CSF is very similar to the traditionally obtained CSF. When simple wavelet subband quantization distortions are presented against a natural image, thresholds are elevated by factors ranging from approximately 10 for 1.15 cycles/degree to 2 for 18.4 cycles/degree. Thus it is substantially more difficult to see these distortions in the context of an image, and both the unmasked results and the standard CSF overestimate sensitivity to these targets. Note that these results implicitly include the masking effects of the image.

SPIE-IS&T/ Vol. 6077 60770O-6 Downloaded from SPIE Digital Library on 08 Feb 2010 to 139.78.68.23. Terms of Use: http://spiedl.org/terms

1000

Contrast sensitivity

Unmasked — just noise

100 Masked — noise + image

10 1

10 Spatial frequency (cycles/degree)

Figure 3 Unmasked (as in Figure 1(b)) and masked (as in Figure 1(c)) CSFs for distortions caused by quantized wavelet coefficients in single subbands. The dashed line represents the traditionally obtained CSF as described in Section 2.1.

3.2.

Sub- and Supra-threshold Summation for Unmasked and Masked Targets To quantify the effects of natural images on the visibility of compound distortions, we investigated the summation of responses to distortions induced by quantization of discrete wavelet transform subbands presented against no mask and eight different natural image maskers.26,29 In the framework of previous summation studies, the detectability of distortions induced by quantization of pairs of discrete wavelet transform subbands is compared to the detectability of distortions induced by quantization of each subband alone. Pairs of subbands were taken to be LH and HL bands at the same frequency (i.e., from the same scale), and bands of the same orientation at two adjacent scales (e.g., LH bands at scales 2 and 3, corresponding to 2.3 and 4.6 cycles/degree center frequencies). For unmasked stimuli at subthreshold contrasts, subthreshold quantization distortions pooled in a non-linear fashion and the amount of summation agreed with those of previous summation-at-threshold experiments ( β = 2.43 ). However, the inclusion of natural image masked produced substantially different results. For masked stimuli at supra-threshold contrasts, summation measured in the presence of natural image maskers was much greater than that found in previous studies utilizing unnatural maskers. supra-threshold quantization distortions pooled in a linear fashion ( β ≈ 1 ). Summation increased as the distortions became increasingly supra-threshold but quickly settled to near-linear values. For signal processing applications, linear summation is fortunate as it can substantially simplify algorithms. Furthermore, linear summation can be used in the subthreshold regime because it provides a conservative estimate of summation relative to larger β values. As such, a single summation model ( β ≈ 1 ) can be applied in both subthreshold and supra-threshold compression. 3.3.

Contrast Constancy and Distributing Distortions Across Frequency Masked results for visibility thresholds of simple and compound targets suggest detection thresholds for one or two quantized subbands, and that supra-threshold distortions add linearly. However, these results do not answer the question of how to partition distortions over all subbands to produce visually acceptable images. A potential answer to this question is provided by the phenomenon known as contrast constancy.30 Contrast constancy refers to the fact that the perceived contrast of a supra-threshold target depends much less on its spatial frequency than what is predicted by the CSF; in fact, perceived contrast becomes invariant with spatial frequency, resulting in a profile significantly flatter than that specified by the CSF. In the context of lossy image compression, contrast constancy suggests that the contrasts of the distortions could be

SPIE-IS&T/ Vol. 6077 60770O-7 Downloaded from SPIE Digital Library on 08 Feb 2010 to 139.78.68.23. Terms of Use: http://spiedl.org/terms

allocated equally across the frequency spectrum; i.e., all subbands are quantized such that the induced distortions have the same contrast, regardless of each subband’s spatial-frequency content. An alternative approach would be to proportion the contrasts based on contrast sensitivity. Under this assumption, the contrasts of the distortions should be allocated based on VTs; i.e., subbands should be quantized such that the contrasts of the distortions are proportioned based on masked VTs. To assess whether contrast constancy holds for supra-threshold quantization distortions, we performed a contrastmatching experiment using targets consisting of wavelet subband quantization distortions presented against a uniform background.31 To quantify the effects of natural images on the perceived contrast of supra-threshold distortions, a second contrast-matching experiment was performed using wavelet subband quantization distortions presented against three natural-image maskers.31 In these experiments, observers matched the contrast of stimuli until their perceived contrast was judged to be equal. Our results demonstrate that indeed contrast constancy holds for supra-threshold wavelet subband quantization distortions. However, when applied to a compression application, we found that proportioning the contrasts of the distortions according to constant perceived-contrast ratios results in compressed images which exhibit lower visual quality than those obtained by proportioning the contrasts of the distortions using CSF-derived ratios. We therefore conclude that when the distortions are clearly visible, the perceived contrast of the distortions has little effect of the quality of the compressed image; rather, one must account for the impact that these distortions effect on a subject’s ability to visually process the image’s edge-structure. This conclusion is in accord with the psychophysical theory of global precedence, which has been previously investigated in the context of image recognition; specifically, global precedence advocates that an image’s edge structure is visually processed by combining information across all continuous spatial scales, beginning with the coarsest scale and ending with the finest available scale.32-34 Thus, eliminating or distorting image-features (e.g., via quantization) at an intermediate spatial scale will result in two percepts: a blurred version of the object; and separate, erroneous high-frequency structure. This result has clear implications for compression, as discussed in the next section.

4. SOME APPLICATIONS This section describes three exemplary applications of the approach and results presented in Section 3: a quantization strategy, a visual quality metric, and a medical image compression application. 4.1.

Distortion-Contrast Quantization We have developed a quantization strategy — distortion-contrast quantization (DCQ) — for use in wavelet-based image compression based on the theory of global precedence as described in the previous section and incorporating our measured psychophysical results.35,36 DCQ operates based on three results: (1) supra-threshold distortion contrasts add linearly and subthreshold distortion contrasts can be added linearly; (2) supra-threshold distortion contrasts should be allocated to subbands based on masked VTs to provide maximum visual quality; and (3) no child wavelet coefficients should be retained if the parents have been quantized to zero (i.e., preserve global precedence). Within a single, unified framework, the proposed quantization strategy yields images which are competitive in visual quality with results from current visually lossless approaches at high bit-rates, and which demonstrate improved visual quality over current visually lossy approaches at low bit-rates. The unified framework requires no case statements or user-specified switch to indicate whether a low-rate or high-rate optimization should be performed. Furthermore, this strategy can be applied to both non-embedded and embedded quantization, the latter of which yields a scalable codestream which attempts to provide the best-possible visual quality at all bit-rates. We have applied DCQ in both a JPEG-2000 compliant coder35 and a general image coder.37 Example performance is given in Figure 4, which provides rate savings for equivalent visual quality for JPEG-2000 with DCQ over JPEG-2000. When included in a more efficient coder, DCQ can reliably provide images of equal visual quality to those provided by baseline JPEG-2000 at bit rates which are up to 30% lower. 4.2.

A Visual Quality Metric We have developed a visual quality metric incorporating our VT and summation models and global precedence.38,39 Using contrast signal-to-noise ratios (CSNRs, defined as the ratio of image content contrast to distortion contrast in a given subband), the visual distortion between an original image and a distorted image is estimated based on a sum of two distances: (1) the Euclidean distance between the ideal CSNR vector for all subbands as predicted by our contrast-based quantization scheme and the actual CSNR vector as measured from the distorted image; and, (2) the distance along the

SPIE-IS&T/ Vol. 6077 60770O-8 Downloaded from SPIE Digital Library on 08 Feb 2010 to 139.78.68.23. Terms of Use: http://spiedl.org/terms

1.2 -0.9%

0.6

9%

0.4

18%

0.2

13%

0 very high

J2K-DCQ J2K

30%

1.2

8%

0.8

13%

1.4 Rate (bits/pixel)

Rate (bits/pixel)

1

1.6 J2K-DCQ J2K

mid Visual Quality

very low

27%

1

29%

0.8 0.6

16%

0.4

12%

0.2 0

~0% very high

mid

very low

Visual Quality

Figure 4 Rate comparison for equivalent visual quality using JPEG-2000 with dynamic contrast quantization (J2K-DCQ) and standard JPEG-2000 (J2K). Left: cat image (from JPEG-2000 test suite), with many high-frequency details. Right: rainriver image, with less high-frequency content. Numbers indicate rate savings for JPEG-2000-DCQ over JPEG-2000.

path of minimum visual distortions in contrast space which we determined via a psychophysical scaling experiment. The resulting metric produces results consistent with subjective tests for images compressed with DCQ, JPEG-2000, and s quantization strategy, and baseline JPEG. 4.3.

Visually Lossless Compression of Medical Images Of primary concern in medical image compression is the determination of a visually lossless threshold, which specifies the maximum compression that can be applied before the resulting image appears distorted. Whereas previous approaches have reported visually lossless thresholds in terms of average compression ratios, we investigated the use of a masking model to predict the visually lossless threshold on a per-image basis. This approach provides a bit-rateindependent measure of visual losslessness and as such is more globally applicable than previous approaches. In cooperation with a radiologist, we performed a psychophysical experiment measuring visibility of wavelet subband quantization distortions in digitized radiographs. Our measured results exhibited a marked variation from image to image, suggesting the need for applying image-specific parameters in compression. We have used these results to guide the development of an image-adaptive visually lossless compression algorithm, thereby avoiding overly conservative or aggressive compression which can result from an average-compression-ratio based approach.40

5. STRUCTURE-BASED VISUAL MASKING — EXPERIMENTS AND A MODEL In spite of the successes in characterizing low-level vision and in applying this characterization to imaging applications, many applications are better approached with a higher-level view of images, considering color, structure, or even object content. When a user asks for “images of a volleyball court” in querying an image database, consideration of the image at the level of its content or at least its structure is required. Automated video indexing also requires a contentor structurally-based representation of frames. No unified model of either the signal or the HVS processing exists at this higher-level view; such a model has the potential to substantially improve the automation of solutions to these problems. The HVS initially performs signal analysis in V1 and ultimately ends in cognition. Cognition is thought to gradually occur as the input is processed; no single portion of the visual system is singularly responsible for recognition or transformation of the internal representation to an abstract concept.41 V1-based models, such as the standard gain control model, fail to varying degrees when applied to images because higher-order processing of image structure by humans produces cognitive effects which cannot be modeled by considering only responses to primitive components. While content-based analysis is clearly beyond what can be achieved using any signal-processing based model, the

SPIE-IS&T/ Vol. 6077 60770O-9 Downloaded from SPIE Digital Library on 08 Feb 2010 to 139.78.68.23. Terms of Use: http://spiedl.org/terms

continuum of visual processing from signal analysis to cognition suggests that signal-processing-based modeling of the HVS need not terminate abruptly after signal analysis but instead may be extended to include some structural processing. We wish to answer: can higher-level perception of structure be included in a vision model that is still essentially signalbased, and can these ideas be applied practically to common applications? Our preliminary results indicate that the answer is yes; we have demonstrated that a structure-based masking model can accurately predict threshold elevations in homogeneous image patches which represent either textures, strong edges, or recognizable structures. 5.1. A Structural Masking Experiment We performed an initial masking experiment to quantify the relative masking capabilities of textures, edges, and structure. In this discussion, a “texture” is defined as image content for which threshold elevation is reasonably well predicted by current masking models, and roughly matches our intuitive idea of what the term texture represents. An “edge” is a single boundary between two homogeneous regions. A “structure” is neither edge nor texture which contains some recognizable organization. To investigate the effects of image content on visual masking, masked VTs were measured for wavelet subband quantization distortions in the presence of various natural-image patches containing either textures, structures, or edges. The targets consisted of distortions generated via uniform scalar quantization of the HL3 subband (obtained using the 9/7 filters), resulting in vertically oriented distortions centered at 4.6 cycles/degree. The masks consisted of 64 × 64 pixel sections cropped from 512 × 512 8-bit grayscale images. Fourteen 64 × 64 masks were used in the study, four of which were visually categorized as containing primarily texture, five of which were visually categorized as containing primarily structure, and five of which were visually categorized as containing primarily edges. The RMS contrast of each 64 × 64 image was adjusted to values of 0.01, 0.02, 0.04, 0.08, 0.16, 0.32, and 0.64. Figure 5 depicts high-contrast versions of the 14 patches used in this study. 5.2.

Results and a Model for Structure-based Masking The results of the experiment for two images of each type are shown in Figure 6 in form of threshold-versus-contrast (TvC) curves in which masked detection thresholds are plotted as a function of the contrast of the mask. Figure 6(a) depicts the results for patches fur and wood (textures); Figure 6(b) depicts the results for patches baby and pumpkin

Textures: fur

wood

newspaper

basket

baby

pumpkin

hand

cat

flower

butterfly

sail

post

handle

leaf

Structures:

Edges:

Figure 5 Stimuli in the structural masking experiment. “Texture” roughly matches our intuitive idea of what the term represents. An “edge” is a single boundary between two homogeneous regions. A “structure” is neither edge nor texture which contains some recognizable organization.

SPIE-IS&T/ Vol. 6077 60770O-10 Downloaded from SPIE Digital Library on 08 Feb 2010 to 139.78.68.23. Terms of Use: http://spiedl.org/terms

Figure 6 Threshold vs. contrast curves for (a) texture, (b) structure, (c) edge patches.

(structures); Figure 6(c) depicts the results for patches sail and butterfly (edges). In each graph, the horizontal axis denotes the RMS contrast of the mask, and the vertical axis denotes the RMS contrast of the target. Data points indicate VTs averaged over all three observers; error bars denote standard errors of the means. As shown in Figure 6, for mask contrasts below 0.04, the thresholds for all three image-types (textures, structures, edges) are roughly the same. These data suggest that at very low mask contrasts, in the regime in which the mask is nearly undetectable (and certainly visually unrecognizable), masking is perhaps due primarily to either noise masking or withinvisual-area gain-control mechanisms (e.g., inhibition amongst V1 simple cells). However, as the contrasts of the masks increase, the TvC curves for the three image-types demonstrate a marked divergence. On average, high-contrast textures elevate detection thresholds approximately 2 times greater than high-contrast structures, and high-contrast structures elevate thresholds approximately 2.5 times greater than high-contrast edges. This latter observation is further illustrated in Figure 7, which depicts relative threshold elevations for the 0.32 contrast condition plotted for each of the 14 images. The data points denote threshold elevations relative to the minimum observed threshold (which occurred for patch leaf); dashed lines denote average threshold elevations for each of the three image types. The patches depicted on the horizontal axis have been ordered to represent a general transition from simplistic edge to complex texture (from left to right). Indeed, notice that the data demonstrate a corresponding left-toright increase in threshold elevation. The standard gain-control model, which has served as a cornerstone in our current understanding of the non-linear response properties of early visual neurons, has proved quite successful at predicting thresholds for detecting targets placed against relatively simplistic masks (e.g., sinusoidal gratings, Gabor patches, or white noise). However, gain-control models do not explicitly account for image content; rather, they employ a relatively oblivious inhibitory pool which imposes largely the same inhibition on the detecting mechanism (thereby raising detection thresholds) regardless of whether the mask is a texture, structure, or edge. Such a strategy is feasible for low-contrast masks; but, as demonstrated by our experimental results, high-contrast textures, structures, and edges impose significantly different elevations in thresholds. This finding motivates a revision of the standard model so as to account for these image-type-specific effects. Here, we employ an image-type-specific “inhibition modulation” term which serves to attenuate the inhibitory effect of the gain-control pool based on whether the mask is a texture, structure, or edge. Under this revised model, a response of a neuron r ( u, f, θ ) is given by p

( w ( f )x ( u, f, θ ) ) r ( u, f, θ ) = g t × ------------------------------------------------------------------------------------------q q b + gm h ( u ) ( w ( f )x ( u, f, θ ) )



(8)

( u, θ ) ∈ S

where the weights w ( f ) correspond to CSF values at frequencies f, h ( u ) implements spatial pooling of the weighted responses, and g m is the inhibition modulation term and varies based on image type. The inhibitory pool is only computed at a single frequency (scale).

SPIE-IS&T/ Vol. 6077 60770O-11 Downloaded from SPIE Digital Library on 08 Feb 2010 to 139.78.68.23. Terms of Use: http://spiedl.org/terms

edge

structure

texture

Figure 7 Relative threshold elevations for the 0.32 contrast condition plotted for each of the 14 patches. The data points denote threshold elevations relative to the minimum observed threshold (which occurred for patch leaf); dashed lines denote average threshold elevations for each of the three image types. The patches depicted on the horizontal axis have been ordered to visually represent a transition from simplistic edge to complex texture.

Figure 8 Data fits for the gain control model with a structure-based inhibitory modulation term: (a) texture, (b) structure, (c) edge.

Figure 8 depicts average fits of the revised gain-control model (solid black lines) to the experimental results replotted from Figure 6. The initially linear responses of the neurons were modeled by using a steerable pyramid decomposition with four orientations and four levels of decomposition performed on 256 × 256 interpolated versions of the 64 × 64 images. The inhibitory pool consisted of those neural responses at all orientations within the same frequency band as the detecting neuron and h ( u ) was a 3 × 3 Gaussian filter; pooling across frequency and orientation were performed using β θ = 1.5 and β f = 1.5 , and final pooling across space to produce a single number was performed using β u = 2 . The parameters of the gain-control model chosen to provide the best fit to the experimental data were as follows: b = 0.03 , p = 2.4 , q = 2.45 for edges and structures and q = 2.75 for textures. The inhibition modulation term g m was 1.0 for textures, 0.3 for structures, and 0.1 for edges. As shown in Figure 8, the revised gain-control model provides a reasonable fit to the experimental data. However, the question remains as to whether the HVS employs such an inhibition modulation-based scheme during masked detection

SPIE-IS&T/ Vol. 6077 60770O-12 Downloaded from SPIE Digital Library on 08 Feb 2010 to 139.78.68.23. Terms of Use: http://spiedl.org/terms

Although the neurophysiological origin of the modulation mechanism remains an area of future investigation, the results of recent studies involving higher-level visual areas suggests that these areas provide shape selectivity.42,43 As such, the revised model is at least physically plausible.

6. CONCLUDING COMMENTS This paper has presented an approach to characterizing the human visual system’s response to natural images. Measurement of masked and unmasked CSFs for wavelet coefficient quantization distortions, an investigation of summation for these distortions in natural images, and a perceived contrast and quality study have provided results which have been incorporated into compression and quality measurement applications. A similar experimental methodology has been applied to visually lossless medical image compression. These results demonstrate that HVS characteristics can be more aggressively exploited than is commonly done to extract larger gains from image compression applications. While we have attempted to provide relevant references, the reader is strongly encouraged to refer to the authors’ other publications listed in the Reference section below for a more complete review of background work with references. An additional example application to watermarking is also provided,44 as are several earlier works leading to the research discussed in this paper.45-48

ACKNOWLEDGEMENTS The authors wish to thank Dr. Marcia Ramos Chang, Dr. Mark A. Masry, and Kenny Lim for their contributions to the research described herein.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

D. H. Hubel and T. N. Wiesel. “Receptive fields and functional architecture of monkey striate cortex,” J. of Physiology, Vol. 195, pp. 215—43, 1968. F. W. Campbell and J. G. Robson, “Application of Fourier analysis to the visibility of gratings,” J. of Physiology, Vol. 197, pp. 551—66, 1968. C. Blakemore and F.W. Campbell, “On the existence of neurons in the human visual system selectively sensitive to the orientation and size of retinal images,” J. of Physiology, Vol. 203, pp. 237—60, 1969. D. J. Heeger, “Normalization of cell responses in cat striate cortex,” Visual Neuroscience, Vol. 9, pp. 181—97, 1992. J. M. Foley, “Human luminance pattern mechanisms: masking experiments require a new model,” J. Opt. Soc. Am. A, Vol. 11, pp. 1710—19, 1994. J. M. Foley and C. C. Chen, “Pattern detection in the presence of maskers that differ in spatial phase and temporal offset: threshold measurements and a model,” Vision Research, Vol. 39, pp. 3855—72, 1999. A. B. Watson and J. A. Solomon, “A model of visual contrast gain control and pattern masking,” J. Opt. Soc. Am. A, Vol. 14, pp. 2379—90, 1997. E. Peli, L. E. Arend, G. M. Young, and R. B. Goldstein, “Contrast sensitivity to patch stimuli: effects of spatial bandwidth and temporal presentation,” Spatial Vision, Vol. 7, pp. 1—14, 1993. C. R. Carlson, R. W. Cohen, and I. Gorog, “Visual processing of simple two-dimensional sine-wave luminance gratings,” Vision Research, Vol. 17, pp. 351-8, 1977. V. Manahilov and W. A. Simpson, “Energy model for contrast detection: spatial-frequency and orientation selectivity in grating summation,” Vision Research, Vol. 41, pp. 1457-1560, 2001. N. Graham and J. Nachmias, “Detection of grating patterns containing two spatial frequencies: a comparison of single-channel and multiple-channels models,” Vision Research, Vol. 11, 1971. M. B. Sachs, J. Nachmias, and J. G. Robson, “Spatial-frequency channels in human vision,” J. Opt. Soc. Am. A, Vol. 61, pp. 1176-86, 1971. A. B. Watson, “Summation of grating patches indicates many types of detector at one retinal location,” Vision Research, Vol. 22, pp. 17-25, 1982. A. B. Watson, G. Y. Tang, J. A. Solomon, and J. Villasenor, “Visibility of wavelet quantization noise,” IEEE Trans. Image Processing, Vol. 6, pp. 1164-75, 1997.

SPIE-IS&T/ Vol. 6077 60770O-13 Downloaded from SPIE Digital Library on 08 Feb 2010 to 139.78.68.23. Terms of Use: http://spiedl.org/terms

15. Y. Bonneh and D. Sagi, “Effects of spatial configuration on contrast detection,” Vision Research, Vol. 38, pp. 354153, 1998. 16. Y. Bonneh and D. Sagi, “Contrast integration across space,” Vision Research, Vol. 39, pp. 2597-602, 1999. 17. D. Kersten, “Spatial summation in visual noise,” Vision Research, Vol. 24, pp. 1977-90, 1977. 18. G. E. Legge and J. M. Foley, “Contrast masking in human vision,” J. Opt. Soc. Am., Vol. 70, pp. 1458—70, 1980. 19. A. B. Watson, M. Taylor, R. Borthwick, “Image quality and entropy masking,” Proc. SPIE Human Vision, Visual Processing, and Digital Display VIII, pp. 2-12, 1997. 20. S. Daly, “Visible differences predictor: an algorithm for the assessment of image fidelity,” in Digital Images and Human Vision, A. B. Watson ed., pp. 179-206, 1993. 21. W. Zeng, S. Daly, S. Lei, “An overview of the visual optimization tools in JPEG 2000,” Signal Processing: Image Communication, Vol. 17, pp. 85-104, 2001. 22. S. Winkler, “Visual quality assessment using a contrast gain control model,” IEEE Signal Processing Society Workshop on Multimedia Signal Processing, pp. 527—32, September 1999. 23. C. J. van den Branden Lambrecht, “A working spatio-temporal model of the human visual system for image representation and quality assessment applications,” Proc. IEEE ICASSP, pp. 2291—4, May 1996. 24. P. C. Teo and D. J. Heeger, “Perceptual image distortion,” Proc. SPIE, Vol. 2179:127—41, 1994. 25. B. Moulden, F. A. Kingdom, and L. F. Gatley, “The standard deviation of luminance as a metric for contrast in random-dot images,” Perception, Vol. 19, pp. 79—101, 1990. 26. D. M. Chandler, S. S. Hemami, “Effects of natural images on the detectability of simple and compound wavelet subband quantization distortions,” Journal of the Optical Society of America:A, July 2003. 27. M. G. Ramos, S. S. Hemami, “Supra-threshold wavelet coefficient quantization in complex stimuli: psychophysical evaluation and analysis,” Journal of the Optical Society of America:A, October 2001. 28. R. J. Safranek and J. D. Johnston, “A perceptually tuned sub-band image coder with image dependent quantization and post-quantization data compression,” Proc. IEEE ICASSP, Vol. 3, pp. 1945—8, 1989 29. D. M. Chandler, S. S. Hemami, “Additivity models for supra-threshold distortion in quantized, wavelet-coded images,” Proceedings SPIE Human Vision and Electronic Imaging, San Jose, CA, January 2002. 30. M. A. Georgeson and G. D. Sullivan, “Contrast constancy: Deblurring in human vision by spatial frequency channels,” J. Physiology, Vol. 252, pp. 627-56, 1975. 31. D. M. Chandler, S. S. Hemami, “supra-threshold image compression based on contrast allocation and global precedence,” SPIE Human Vision and Electronic Imaging 2003, San Jose, CA, January 2003. 32. D. Navon, “Forest before trees: The precedence of global features in visual perception,” Cognitive Psychology, pp. 353-83, 1977. 33. P. G. Schyns and A. Oliva, “Dr. Angry and Mr. Smile: When categorization flexibly modifies the perception of faces in rapid visual presentations,” Cognition , Vol. 69, pp. 243-65, 1999. 34. A. Hayes, “Representation by images restricted in resolution and intensity range,” Ph.d. dissertation, University of Western Australia, Perth, Australia, 1989. 35. D. M. Chandler, S. S. Hemami, “Dynamic contrast-based quantization for lossy wavelet image compression,” IEEE Transactions on Image Processing, Vol. 14, No. 4, pp. 397-410, April 2005. 36. D. M. Chandler, S. S. Hemami, “Contrast-based quantization and rate control for wavelet-coded images,” Proceedings IEEE Intl. Conf. Image Processing, Rochester, NY, September 2002. 37. Visual Communications Lab Coder, http://foulard.ece.cornell.edu/index.php?loc=/vcl_coder/ vcl_coder.html

38. D. M. Chandler, K. L. Lim, S. S. Hemami, “Effects of spatial correlations and global precedence on the visual fidelity of distorted images,” Proc. SPIE Human Vision and Electronic Imaging, San Jose, CA, January 2006.

SPIE-IS&T/ Vol. 6077 60770O-14 Downloaded from SPIE Digital Library on 08 Feb 2010 to 139.78.68.23. Terms of Use: http://spiedl.org/terms

39. D. M. Chandler, M. A. Masry, S. S. Hemami, “Quantifying the visual quality of wavelet-compressed images based on local contrast, visual masking, and global precedence,” Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, November 2003. 40. D. M. Chandler, N. L. Dykes, S. S. Hemami, “Visually lossless compression of digital radiographs based on contrast sensitivity and visual masking,” SPIE Medical Imaging: Image Perception, Observer, and Technology Assessment, San Diego, CA, February 2005. 41. G. Avidan, M. Harel, T. Hendlerr, D. Ben-Bashat, E. Zohary, and R. Malach, “Contrast sensitivity in huan visual areas and its relationship to object recognition,” Journal of Neurophysiology, Vol. 87, pp. 3102—16, 2001. 42. J. Hedge and D. C. Van Essen, “Selectivity for complex shapes in primate visual area V2,” Journal of Neuroscience, Vol. 20 RC61, 2000. 43. J. Hedge and D. C. Van Essen, “Strategies of shape representation in macaque visual area V2,” Visual Neuroscience, Vol. 20, pp. 313-28, 2003. 44. M. A. Masry, D. M. Chandler, S. S. Hemami, “Digital watermarking using local contrast-based texture masking,” Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, November 2003. 45. S. S. Hemami, M. G. Ramos, “Quantifying Visual Distortion in Low-Rate Wavelet-Coded Images,” Proceedings IEEE Intl. Conf. Image Processing, Thessaloniki, Greece, October 2001. 46. M. G. Ramos, S. S. Hemami, “Perceptual Quantization for Wavelet-Based Image Coding,” Proceedings IEEE Intl. Conf. Image Processing, Vancouver, BC, September 2000. 47. S. S. Hemami, M. G. Ramos, “Wavelet coefficient quantization to produce equivalent visual distortion in complex stimuli,” Proceedings of SPIE Human Vision & Electronic Imaging, San Jose, CA, Jan. 2000. 48. S. S. Hemami, “Visual Sensitivity Considerations for Subband Coding,” Proceedings of Thirty-first Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, Vol. 1, pp. 652-6, Nov. 1997.

SPIE-IS&T/ Vol. 6077 60770O-15 Downloaded from SPIE Digital Library on 08 Feb 2010 to 139.78.68.23. Terms of Use: http://spiedl.org/terms