is finally used to evaluate the performance of various masking .... The testing tool also contains the important feature of flick- ering. ..... The only free parameters.
1
Performance Comparison of Masking Models Based on a New Psychovisual Test Method with Natural Scenery Stimuli Marcus J. Nadenau , Member, IEEE, Julien Reichel, Member, IEEE (*) and Murat Kunt, Fellow, IEEE Abstract— Various image processing applications exploit a model of the human visual system (HVS). One element of HVSmodels describes the masking-effect, which is typically parameterized by psycho-visual experiments that employ superimposed sinusoidal stimuli. Those stimuli are oversimplified with respect to real images and can capture only very elementary maskingeffects. To overcome these limitations a new psychovisual test method is proposed. It is based on natural scenery stimuli and operates in the wavelet domain. The collected psycho-visual data is finally used to evaluate the performance of various masking models under conditions as found in real image processing applications like compression. Keywords— HVS; Masking; Quality Assessment
Wavelet;
Image Compression;
I. History of Masking Models and Measurements
V
ARIOUS image processing applications exploit the limitations and imperfections of human visual perception. Typically, a model of the human visual system (HVS) predicts the visibility of specific image stimuli, so that for example a watermarking scheme can more efficiently hide its information in the image, a compression scheme achieve a better visual quality or a printer deliver better rendered prints due to HVS-improved halftoning patterns. However, the main difficulty lies in a proper description of the human visual system. It is of amazingly high complexity. Hence, the idea to model the entire system by a direct transcription from physiological features to a numerical model was very soon abandoned. It is rather tried to parameterize psycho-physical effects, like limited spatial frequency resolution, light adaptation or masking. Generally speaking, masking refers to the reduced or totally inhibited visibility of a signal stimulus due to the simultaneous presence of a masker stimulus. However, there exist several types of masking [11]. Contrast masking occurs if a weak signal is hidden by a stronger signal that is present at the same location. One speaks of texture or activity masking, if a signal is hidden by the local image surround that is so busy and irregular that the HVS gets confused and is no longer able to localize the initial signal. Both masking phenomena are of great relevance for image compression or image quality assessment tools. Before these masking effects can be exploited in final image processing applications, they need to be measured and modeled. The “classical” masking measurements employ two superimposed sinusoidal gratings. Their contrast, spatial frequency and orientation is varied to determine the conditions when masking occurs. Based on this data a model is built that predicts the masking effect. Typically, it processes the image by several filter-channels that differ in their band-pass frequency and spatial orientation. The model might only consider the masking between stimuli of the same spatial frequency and orientation [12], called intra-channel-masking, or also between different orientation and frequencies [9, 18, 19], called inter-channel masking. The variety of channels that is considered to compute the inhibitory term can also account for the interaction between the luminance and chrominance channels [4, 5, 13, 17]. The subjective decision task whether masking occurs or not is typically modeled as a Gaussian process [10] to explain the
varying threshold level of detection. There are two drawbacks of the “classical” method that is based on sinusoidal patterns: First, the sinusoidal masker grating represents a tremendous simplification with respect to natural scenery images. Thus, effects like texture masking cannot be measured. Even if more recent measurements employ noise patterns, they still represent an oversimplification with respect to natural images. Second, in general, a sinusoidal stimulus is much simpler than for example a real compression artifact. Hence, the finally derived model might do a good prediction for sinusoidal patterns, but not for real images and thus is of limited usage for image processing applications. A first step towards masking measurements with realistic distortions and natural scenery is presented in [3, 7], where nondeterministic noise-patterns were used. Also Watson [21] did some preliminary experiments with more complex backgrounds. However, the masking measurements are still not as close to real conditions as they could be. Therefore, the authors propose a new scheme of masking measurements. Natural scenery images are decomposed by a wavelet-decomposition and quantized within one subband of this decomposition to generate the test stimuli. This way the masker-signal is a real image and the test signal a quantization error signal, like it would occur in image compression applications. This new technique allows to measure the contrast masking and texture masking effect. Finally, a database with the results of the masking experiments is provided. The Modelfest group [8] established a similar database, but for sinusoidal test stimuli. Once a common database is available, it can be used for the performance evaluation of various masking models. Such a comparative work is of urgent need, because typically only new measurements and models are presented without a clear performance comparison to existing ones. However, the only comparative work known to the author is based on sinusoidal test patterns [20] and thus of limited relevance for real image processing applications. This article presents a comparison of five commonly used masking models based on the psychovisual measurements with natural stimuli. The paper is structured as follows: Section II describes how the new test stimuli are generated. In Section III the psychovisual testing procedure is explained. Section IV explains the process of the masking model training/optimization using the data from the psycho-visual experiments. Section V defines the masking models that are compared in their performance. Finally, in Section VI the performance of these models is analyzed and discussed, before the final conclusions are drawn. II. Test stimuli A. Generation of a single test stimulus The ISO 300 image set [16] is used for the generation of the test stimuli. This image set consists of natural scenes like a market place, fruits, houses and so on. Each image of this set is converted into a pure luminance representation and then several sub-regions were selected. Finally, a set of NI = 30 different
2
mono-chromatic test images is obtained. In the following it is referred to these NI images as originals. Each of the originals is decomposed using the Daubechies 9/7 filters [1]. This results in 5 levels of decomposition and 3 orientations at each level, namely horizontal, vertical and oblique. Including the low-frequency representation in the so-called LLband, 16 subbands are computed. A single test stimulus consist of a slightly modified test image. The modification/distortion is introduced by quantizing a number of wavelet coefficients cij . All quantized wavelet coefficients cˆij form together the set Φ. The set contains the wavelet coefficients of an 8 × 8 block within the subband with horizontal details of decomposition level 3. This specific subband was chosen, because it contains the spatial frequencies for which the human observer is most sensitive under the given viewing conditions. However, the masking measurements and the masking model performance analysis, which is based on these measurements, is valid for any spatial frequency. This is, because the sensitivity which depends on the spatial frequency is compensated separately over the contrast sensitivity function (CSF). The quantized coefficients cˆij are described by:
cˆij =
N -1 R Original
NR
1
Fig. 1. Examples for NR distortion constellations that form a test set. Here, the distortions resemble patches in the center region.
be the original. On the bottom of the screen, one of the NR DICO’s and the original. The observer has to answer on which side the distorted image is displayed. If the difference is not visible to him he answers by pressing the “same”-button.
|cij | ∆Qc ∆Qc
i, j Φ,
(1)
where · is the truncation operator and ∆Qc the quantization step-size. The effect that small luminance differences in bright regions are less visible than those in dark regions is sometimes referred to as luminance masking. To simplify masking models that are analyzed later on, the effect of luminance masking is already compensated for at this stage. This is achieved by transforming the luminance (L) data into lightness/brightness (B) data before quantization: B = L1/ν , (2) where ν is set to 3. Thus, the linear quantization of cij results in a perceptually uniform brightness/lightness quantization. B. Generation of a test set Not only a single stimulus, but an entire series of NR = 12 stimuli is generated for each original. Manually, NR quantization step-sizes ∆Qc are chosen, so that the range covers invisible to clearly visible artifacts. The change of parameter ∆Qc does not result in a scaled version of a constant quantization error shape, but changes the entire constellation of the distortion. It is referred to a single test stimuli as distortion constellations (DICO). Fig. 1 illustrates such a series with NR DICO’s. While the first images are barely quantized, the last images reveal artifacts that are easy to see. It is important to remark that the DICO’s are not ordered with regard to the visibility of the distortion. They just cover the desired distortion range. For example, it is absolutely possible that DICO number 5 turns out to be more visible than DICO number 7. Beside the NR stimuli an additional one is generated that is referred to as strongest possible distortion constellation (SPDICO). It is generated by setting all cˆij to zero. III. Psychovisual Testing Procedure A. Presentation Methodology The actual psycho-visual testing procedure contains some innovative aspects that are of utmost importance to obtain reliable measurements although the pattern complexity is extremely high. Basically, the observer always sees three images (see Fig. 2). On the top of the screen, the original, known to
Fig. 2. The observers see this screen during their test. The upper image shows the original. In the bottom row, one image is the original, the other is one of the NR compressed images. The observer has to judge whether the compressed image is to the left or right.
The testing tool also contains the important feature of flickering. It causes a flicker of only the top image, while the bottom images stay static. The flicker changes between the original and a “second image” 4 times in 2 seconds and stops again with the original. The second image depends on the chosen test mode, as explained later on. Actually, the flickering is the reason why it becomes possible to do masking measurements with such complex stimuli. An observer, who is trained to a specific image, is even able to identify a distortion in very complex images immediately. However, it would take a long time by simply watching the images to reach for each image set and test person this level of training. The flicker enormously accelerates this learning process, because the distortion appears as a “moving object”, which facilitates its detection due to the high-level vision effect of grouping. There exist three test modes, named A, B and C that are summarized in Table I. They differ with regard to the second image used for flickering, the choice when the observer is allowed to flicker, and the time constraints. In mode A the second flicker image is the one corresponding to the strongest possible distortion (SP-DICO). This shall only teach the observer the location, but not exact shape of the distortion. The observers can only flicker during the initial period, that is before the actual test starts. Moreover, a maximum time constraint per decision is given. In mode B the second image used for flickering corresponds to the one statically shown in the bottom line. The observers can flicker whenever they want and do not have any time constraints for their decisions. Mode C is identical to mode A. It delivers however different result, because it is done after the observer learned about the distortion in mode A and B.
3
Mode A B C
2nd flicker image SP-DICO as bottom image SP-DICO
Time/decision 6s ∞ 6s
TABLE I Specification of the test modi
B. Observer Data Discussion The entire set of stimuli comprises NI × NR =360 distortion constellations. To compute the probability of distortion detection pd in a reliable manner, it is necessary to test each stimulus several times. The probability pd is given by: pd =
nc + 0.5ns ns + nc + nf
(3)
where nc , nf , and ns are the number of correct, false, and “same”-judgements, respectively. A “same”-judgements means that the observer did not see any difference and would have to guess. In this situation a series of guessed responses will result in a detection probability of 50%. However, the tests have to be short, otherwise the concentration decreases. Therefore, the observer can directly answer that he would have to guess and press the “same”-button, which is counted as 50% detection probability. Since long test sessions lead to a loss of concentration an optimal tradeoff in terms of test length and accuracy has to be found. This is, that in test mode A and C each DICO is tested 5 times and in test mode B only 3 times. Hence, the test data for a single observer can be limited to 4680 decisions. The test should also average over the varying detection capabilities of different observers. Therefore, for each stimulus the detection probability was evaluated by 4 different observers. In all cases the primary author was one of the observers. The second observer was an expert-viewer familiar with image quality assessment. The two other observers were non-experts. In the following it is referred to these observers as: Expert-1, Expert-2, Intermediate and Beginner. The above described measurements deliver a separate probability curve for each of the test sets, observers and test modes. Fig. 3 shows the plot of those curves for test mode B and three different observers. On the x-axis is noted the DICO-index. The equal spacing on the x-axis should not be interpreted as equal distortion difference, it is just an index number. The corresponding y-value represents the measured detection probability. It can be observed that in the extremes of visible (DICO 9-12)
Detection Probability
1
0.9
0.8
or non-visible (DICO 1-4) distortion all observers agree. The inter-observer variation is biggest in the transition zone. Sometimes only a single observer disagrees with all others. Therefore, the combined measured detection-probability pM is computed as the linear average of the observer-specific detection probabilities pd , where a clear outlier (if existing) is suppressed before averaging. The measured-detection-probability pM is plotted as thick black graph with additional error-bars that mark plus minus one standard deviation. Analyzing the data of test modes A to C reveals that mode B delivers the most coherent results. This could be expected, since the conditions concerning flickering and without time constraints are optimal for the detection process. The mode A data is the least stable of all tests. The observer needs this test to learn about the images and distortions. Its data is not recommended for any masking model parameterization. Mode C data is very coherent and reveals with respect to mode B a reduced detection sensitivity. However, it represents probably the most realistic settings for the task of spontaneous image quality judgement. Therefore it is recommended to use mode A just to train the observer, mode B for the development and first calibration of the masking models and mode C data to set all parameters to values optimal for the model incorporation into image processing applications. IV. Using the Subjective Data for Masking Model Evaluations A. Model Training/Optimization Now that the subjective data are available, they can be used to analyze various masking models. First, the tested masking model is trained on a subset of the subjective data to compute the optimal model-parameters. Then, these optimal are used to apply the masking model to a new data set that was not part of the training set. Thus, the prediction quality and parameter stability can be evaluated. Fig. 4 summarizes the procedure of parameter optimization. Basically, all stored wavelet coefficients that characterize a specific DICO are used to compute the predicted distortion detection probability pC . This probability is compared with the one from the subjective experiments pM . In an iterative manner the model-parameters are modified to minimize the difference between pC and pM . Each DICO is described by its corresponding wavelet coefficients. The ones that can be used by the masking models are marked in light gray. The actually quantized coefficients are marked in dark gray. Depending on the specific masking model all marked coefficients or a subset of them is used to map each DICO into a single distortion value D. In the next step the distortion D is transformed via a Weibull-function [2] into the probability pC . These two steps are separated, because in the final image processing application the distortion measure D is needed, while the experiments deliver a probability. The Weibull-function is approximately an accumulated Gaussian-distribution that is typically used in psycho-physical experiments to model the varying decision threshold of human observers. Applied to the distortion D, pC is given by:
0.7 Expert1 Expert2 Intermediate Average
0.6
0.5
2
4
6
8
10
12
Distortion Constellation Index Fig. 3. Measured detection probability for three observers in mode B. The error-bars indicate plus minus one standard deviation.
pC = 1 − 0.5e
D − D 0
ζ
ζ=−
log (log 2) , log(κ)
(4)
where the parameterization is chosen so that D = 1 is always mapped to a detection probability of 75 %. The parameter D0 normalizes the distortion values, and parameter κ determines the steepness of the Weibull-function. Both parameters, D0 and κ, are also varied during the optimization/training procedure.
4
Corresponding Distortion Values D for each Constellation
Initial Model Parameter
10
Distortion Value
Available wavelet coefficients characterizing the DICO's
Masking Model
DICO
1 D
0.1
2 4 6 8 10 12 Distortion Constellation Index
Each decomposition plane describes one of the NR DICO's of one test set
quantized coefficients
Weibull-Function p
including D0 and κ
Changed Model Parameter
NR
D
D0, κ
D
pC
difficulties for the model. Thus, possible modifications of the model can be better designed. Fig. 5 illustrates the optimization result in a graphical manner. The masking model was applied to all NI test sets and the parameters optimized to reduce the global fitting error (GFE), which is the FE given in Eq. 5, for N =NI · NR . The result is shown on the left. The continuous curve are the detection probabilities predicted by the model; the black dots are the subjectively measured ones. Due to the limited number of decision tasks per stimulus the measured detection probabilities are at discrete positions. The main error contributions come from the samples marked by an ellipse. They have a very small variance, since all observers agreed on it and are therefore statistically significant. Nevertheless, the model made a wrong prediction in these cases.
Measure (FE)
I
Measured Predicted
0.9
Detection Probability PM, PC
Fitting Error
Detection Probability
1
0.8 0.7 0.6 0.5 0.1
1.0 Distortion Value
10
Fig. 4. The masking model is trained to optimize its predictions for a subset of subjective data (e.g. all DICO’s of one test set). First, each DICO is mapped by the model to a single distortion value D. Then, the D is mapped to a probability pC using a Weibull-function. In an iterative manner the model-parameters (including the Weibullparameters) are varied to compute the optimal settings.
The computed detection probability pC is compared to the measured prediction probability pM by fitting error (FE) criterion that is defined as: FE =
1 N
X i=1...N
C pM i − pi 2 σi
2
,
(5)
II
0.9 0.8
1
1
0.75
0.75
0.75
0.5
0.5 1
2.68
0.12
Musicians 250-270
0.7
0.5 0.028
1
3.0
0.5 1
1.44
0.09
Cafe 177-342
1
1
1
0.75
0.75
0.20
0.5 1
1.64
GFE
0.028
1
2.33
Fruit 213-113
0.75 0.5
Distortion D
Cafe 237-208
1
0.12
0.6
III
Cafe 110-174
Cafe 316-121
1
0.5 1
1.43
0.032
1
2.00
SPP
Fig. 5. Global Fitting Error (GFE) and Separate-Prediction-Performance (SPP). The black stars and light gray crosses represent the measured (pM ) and computed (pC ) detection probabilities, respectively.
On the right of Fig. 5 the Separate-Prediction-Performanceplot (SPP) of 6 test sets is shown. It allows to identify, for which test sets the model has prediction problems. Columns I and II demonstrate the over- and under-estimation of the distortion visibility by the model, respectively. Column III shows an example for nearly perfect prediction. V. Definition of Compared Masking Models
where σi is the observer’s data variance and N the number of considered distortion constellations. It can be proven that this error criterion is optimal to account for data of different reliability [22]. Finally, the FE-value is fed back to modify the parameters of the masking model. This training/optimization process leads then to the optimal parameters of the masking model that are recommended for the chosen training set of images. Even if the training method is straightforward, a few aspects have to be treated carefully. For instance, a simple masking model with only two parameters shows typically a good convergence. Due to the larger number of free parameters (e.g. 10) a complex model has many local minima that are probably quite different to the global one. Therefore it is necessary to start the optimization process from many different starting points (initial parameters).
A. Framework Common to All Models In this section five different masking models will be defined (see Table II) which will be analyzed in section VI regarding their prediction performance and stability. All presented masking models will be applied to the wavelet-coefficients which contain already a compensation for the CSF and luminance masking. These compensated coefficients are denoted by cij . This way the actual masking model can be analyzed separately from the CSF-stage. Additionally, the viewing conditions for the psycho-visual test were chosen so that the spatial frequencies affected by quantization, correspond to the frequencies for which human observers are most sensitive. Furthermore, all models provide a distortion measure D that is mapped by Eq. 4 into a distortion detection probability pC . B. MSE - Model
B. Graphical Representation The optimization result yields an insight into various aspects of the model behavior. First, it offers the optimal parameters for a masking model in the framework of a wavelet-based coder. Second, the remaining fitting error gives a first idea of the model performance that can easily be compared to the performance of other models. Moreover, it indicates which test sets posed
The mean-square-error (MSE) is a mathematical difference measure and no masking model. Nevertheless, most image processing applications use MSE as distortion measure. Typically, it is applied directly in the “pixel”-domain, but in waveletcodecs it is often computed directly from the wavelet coefficients. Here, the MSE serves as a reference to evaluate how much can be gained by more sophisticated models. The MSE
5
The response itself is computed as the quotient of the excitatory and inhibitory term:
distortion is defined as: D=
1 X 64
(cij − cˆij )2 ,
(6)
i,j Φ
where cij and cˆij are the original and quantized wavelet coefficients, respectively. Their position in the quantized subband is given by the indices i and j, where Φ describes the set of 8 × 8 coefficients that were quantized. C. Intra-Channel Model (IaC) Within the class of intra-channel contrast masking models there exist a variety of minor model variations. A prior analysis [14] revealed that these modifications lead only to small performance changes. The general structure of such an intrachannel masking model consists of 4 stages: the local contrast difference ∆C, the threshold elevation function T e , their combination to the masked contrast C M and finally the pooling function to compute the probability of distortion detection pC . Here, the local contrast is described by the absolute value of the wavelet coefficient |cij |. Thus, Cij is equivalent to |cij |. The contrast difference due to quantization is consequently defined as: ∆Cij = |cij − cˆij | = ∆cij . (7) The non-linearity of the threshold elevation function T e is approximated by two piece-wise linear functions Tije = max 1, cij
(8)
where is the slope-parameter. The original coefficient is assumed to represent the masker signal. Fig. 6 illustrates the shape of this curve. At each position in the set of quantized log T
ε 1 log cij'
Fig. 6. The threshold elevation function T e is characterized by slopeparameter .
coefficients Φ, the masked contrast cM ij is computed from the contrast difference and threshold elevation function: M Cij =
∆C ij . Tije
(9)
The final distortion value D for the entire distortion signal is obtained by pooling with the Minkowski-norm with parameter β: D=
X
.
(10)
The inter-channel masking model by Watson and Solomon [19] does not start its computations with the contrast difference. Rather it determines at each position i, j a response value for o c and compressed image rij . The difference is the original rij computed at the end and summed by a Minkowski-norm: X o c β rij − rij i,jΦ
X
gΨ (u)
u=1,2,3
#1/β
.
X
gΓ (x, y)(cx,y,u)q ,
(13)
x,yΓ
where u is the channel number that runs over the three orientations and Γ describes the set of locally neighboring wavelet coefficients. Here, this neighborhood is limited to a region of NΓ = 7·7 coefficients centered around position i, j. The weighting factors are normalized by: X
gΨ (u)
u=1,2,3
X
gΓ (x, y) = 1.
(14)
x,yΓ
The values for gΨ and gΓ are taken from a Gaussian-distribution g that is characterized by its standard deviation kΨ and kΓ , respectively 2 1 g(v) = √ e−(v/k) . (15) 2π
E. Intra-Channel Model with Local Activity (IaCLA) In the framework of the JPEG2000 development two competitive techniques were merged to the so-called extended masking [6]. Basically, it considers the point-wise contrast masking as captured by the IaC-model, but applies additionally an inhibitory term that takes the neighborhood activity into account. The neighborhood Γ is chosen to be causal. This is important for coding purposes, since only already encoded coefficients are known to the decoder. The distortion function is given by D=
X i,jΦ
D. Watson-Solomon Inter-Channel Model (WSIrC)
D=
hij =
β
i,jΦ
"
where p and q are the excitatory and inhibitory exponent, z a constant and hij the pooling result at position i, j for channel u. It is pooled over the three orientations and the spatial support. Gaussian pooling kernels are used to characterize the pooling function, specified by their variance. For the concrete case this means that in each of the three subbands a region centered around the position i, j is weighted by a Gaussian kernel gΓ and summed. Then, each of these sums is weighted by a factor gΨ . Hence, the final pooling result h is given by:
!1 β (cM ij )
(12)
zq
In the case of orientation pooling (k = kΨ ), v takes values of 0◦ (in excitatory subband), 45◦ (in neighboring subband) and 90◦ (in subband of opposite orientation). In the case of spatial pooling (k = kΓ ), v describes the Euclidean-distance in pixels to the center location i, j. The WSIrC model has the strong advantage that it takes all three subbands (orientations) into account for the prediction of the masking effect. However, this causes a high complexity in terms of data access and arithmetic operations.
e
1
(cij )p , + hij
o rij =
(11)
∆cij e Tij (1 + wΓ )
β ! β1
,
(16)
where wΓ is the newly introduced correction term for the influence of an active or homogeneous neighborhood. It is the normalized sum of the neighboring coefficients that were taken to the power of ϑ: wΓ =
X ϑ 1 |cij | . ϑ (kL ) NΓ Γ
(17)
The parameter kL determines the dynamic range of wΓ , while NΓ specifies the number of coefficients in the neighborhood Γ.
6
F. Model Parameterization Overview To facilitate the reference to a specific model, including its parameter set, Table II gives an overview. Basically it comprises 4 models (MSE, IaC, WSIrC, IaCLA-1) that are entirely different in their structure and a fifth (IaCLA-2) that is distinguished by another parameter setting. TABLE II Analyzed masking model configurations
Name MSE IaC
Description Mean-square-error Intra-channel
WSIrC
Inter-channel
IaCLA-1
IaC + local activity
IaCLA-2
Pure local activity
Parameters Fixed Free β=2 D0 , κ β=2 D0 , κ, β=4, q=2, D0 , κ, p Γ=49 β=2, Γ=84,
kΨ , kΓ , z D0 , κ,
kL =1e-4,ϑ=0.2 β=20, Γ=84,
D0 , κ,
kL =3e-6, =0
ϑ
Fig. 7. Fitting error (FE) and prediction error (PE) for the models applied to mode B data.
The MSE-model is entirely fixed. The only free parameters concern the Weibull-function. The IaC-model has just the parameter that describes the “classical” non-linearity. With respect to the number of free parameters, the WSIrC-model is the most complex. But its real complexity is rather due to the increased number of data accesses (three orientations and local surround). The IaCLA model is basically the model that is used within the JPEG 2000 framework. In the constellation IaCLA-1 its parameters ϑ and kL are fixed to the same values as in JPEG2000. By subjective evaluations it was shown that it significantly improves the visual compression quality [6]. A second parameterization, named IaCLA-2, is analyzed. It does not incorporate any -nonlinearity and the parameters β, kL are fixed to 20 and 3e-6, respectively. VI. Performance Discussion
of the MSE reference model. The IaCLA-1 model is closest to the IaCLA-2 model, but the difference still is significant. The IaC and WSIrC-model have similar FE-values. With regard to the more relevant PE-value it is also the MSE who performs worst and the IaCLA-2 the best. However, it is interesting to see that the WSIrC-model has a PE-value almost as big as the MSE. This indicates already at this stage that the optimal WSIrC-model parameters found in the training are not very stable or general. As soon as they are used for another nottrained prediction they deliver a relatively bad prediction. Beside the prediction performance it is important that the parameters stay stable. Table III summarizes the statistical data of the fitting/prediction experiment in mode B. The first row in each sub-table, marked with µ, displays the average of the NI optimal parameters that were found. The row below (σ) indicates their standard deviation that is expressed as percentage of the average values. The last row summarizes the inter-parameter dependence. Based on the co-variance matrix MC , the values ka,b are computed:
A. Prediction Performance (Mode B) In the first step, the data from mode B are used to compare the prediction performance of the five masking models. Such a comparison implies the average prediction error and the stability of the optimal model parameters. To gather the necessary statistical data the leave-one-out method is used. The models are trained on NI − 1 test sets. The resulting optimal model parameters are used to do the prediction for the remaining single tests set. This procedure is repeated NI times, where every time another test set is left out. Hence, data for NI different training and prediction situations are collected. From these data various statistics can be computed. The most interesting results are the fitting error (FE) and the prediction error (PE). The former indicates how good the model could match the training data. The latter represents the error in the model’s prediction for the remaining test sets that was not trained. It is exactly computed like the FE, given in Eq. 5, but only accumulated over all images NL that were not used in the training: PE =
1 NL
X i=1...NL
C pM i − pi σi2
2
.
(18)
Both, fitting and prediction error, are presented in Fig. 7. The comparison reveals that the MSE-model fits the worst to the subjective data; it has the biggest FE. The IaCLA-2 model delivers the best FE-value. Its FE is only half as big as the one
ka,b = |
MC (a, b) | MC (a, b)MC (a, b)
(19)
where a, b represents any combination of two model parameters. The parameter in parenthesis is the one that has the strongest influence on the parameter of the corresponding column. For example the prediction error PE depends the most on the value of κ; it is correlated by 0.93. A closer look to the statistics reveals that the parameters of the MSE-model stay very stable. A standard-deviation of ±5% is small. The parameter with biggest influence on the error is κ, which indicates that the steepness of the Weibull-function rather than the exact value of D0 is of importance. Finally, a very large variation for PE (±117%) is observed. This is mainly due to a few extreme values of very good and very bad predictions that were achieved. However, this is common to all models and will be analyzed in more detail later on. Also the IaC-model parameters can be considered as being stable, even if the parameter D0 varies slightly more. The parameter-correlation indicates a strong link between and D0 , which could be expected, since the slope-parameter has a direct impact on the final magnitude of summed coefficients and must be compensated by a changed value for D0 . The value of corresponds surprisingly well with the one observed for the same model in “classical” masking experiments. There, typical values between 0.5 and 0.8 are observed. Other experiments indicated an elevated value for rather unknown or difficult patterns [15].
7
Type µ σ ka,b
Type µ σ ka,b
Type φ σ ka,b
Type φ σ ka,b
D0 3.39e+03 ±5.26% 0.72(κ)
0.672 ±4.44% 0.97 (D0 )
MSE κ 1.47 ±1.96% 0.72(D0 )
IaC D0 138 ±15.25% 0.97 ()
Type φ σ ka,b
WSIrC p 2.27 ±2.47% 0.93 (D0 )
Type φ σ ka,b
D0 4.76 ±37.28% 0.93 (p)
FE 6.44 ±4.11% 0.93(κ)
β=2 κ 1.09 ±0.47% 0.67 ()
PE 7.04 ±118.00% 0.93(κ)
FE 4.91 ±3.58% 0.50 (D0 )
PE 5.71 ±116.60% 0.72 (D0 )
q= 2 β= 4 NΓ =49 kΨ kΓ z 194 1.81 151 ±28.54% ±36.75% ±9.17% 0.44 (κ) 0.69 (z) 0.87 (p) κ 1.15 ±1.07% 0.44 (kΨ )
FE 4.88 ±6.26% 0.94 (κ)
PE 6.5 ±153.65% 0.94 (κ)
IaCLA1 D0 98.3 ±12.54% 0.98(ϑ)
= 0.3 β = 2 kL = 1e − 04 NΓ = 84 κ ϑ FE PE 1.1 0.356 4.25 5.36 ±0.32% ±6.13% ±3.83% ±111.93% 0.21 (D0 ) 0.98 (D0 ) 0.90 (κ) 0.87 (κ)
IaCLA2 D0 30.5 ±12.29% 0.98 (ϑ)
= 0 β = 20 kL = 3e − 06 NΓ = 84 κ ϑ FE PE 1.09 0.389 3.45 4.08 ±0.42% ±3.66% ±3.80% ±105.29% 0.37 (ϑ) 0.98 (D0 ) 0.74 (κ) 0.72 (κ)
The WSIrC-model is the most complex one and might be expected to perform the best, because it uses more information for its prediction than the other models. However, the prediction performance is not very convincing. The parameters are not very stay stable. Variations between 30-40% indicate, that the optimal parameter constellation is not well characterized. This is in agreement with the independent observation that the WSIrC-model converge only very slowly towards the optimal parameters. The strong variation in the parameters could be interpreted by an unsuccessful minimum search that dropped into local minima, but an intensive minimization running over several days came up with the same result. The IaCLA-1 model shows a very stable parameter behavior. The improved prediction performance proves that the extension by an inhibition factor for local activity is very efficient. However, the optimization showed also, that the values fixed here, are sub-optimal. A better choice is presented in model IaCLA-2. The IaCLA-2 model is the model that performed best. It obtains a very low average value for the PE and has also very stable parameters. The -non-linearity that applied only to the central position is completely removed with a positive impact on performance. Basically, this means that in natural scenery the main issue is a good characterization of the inhibition due to local activity. The point-wise non-linearity no longer has a significant impact and can lead to a performance decrease. An increased value of β=20 showed also a positive impact. It drives the Minkowski-norm towards a maximum operator. This appears natural for a detection experiment, because only a few large quantization errors within the 8 × 8-block may be above
the visual threshold. Thus, rather a few peaks than the sum of many small values cause the visibility of the stimulus. It is interesting to see that the optimal value for ϑ is close to 0.4, since this is in disagreement with the often presented energy terms (ϑ=2) that are supposed to capture local activity. The reason for the better performance with ϑ=0.4 is that the neighboring coefficients are all close to zero for a homogeneous region. However, some of them might be very big if a single strong edge is present. A linear or squared sum would declare such a region as highly active due to the single edge, which is false from perceptual point of view. The clear edge rather helps the observer to locate the distortion in the homogeneous region. Therefore, a value for ϑ smaller than 1 suppresses this effect, while it maintains the good property to declare a region as active, if many medium-size coefficients are present. B. Test Set Specific Behavior The large standard deviations of the PE error indicate that it exists at least one particular test set that causes a problem for the masking models. The question is whether this is linked to the subjective data or to a specific problem with the masking model. Fig. 8 shows a plot with the prediction errors for each model and test set. LeaveOneOut Performance 60 MSE IAC WSIRC IACLA1 IACLA2
50
Prediction Error
TABLE III Parameter statistics for Mode B data
40
30
20
10
0 0
5
10
15
20
25
30
Applied to test set # Fig. 8. Test set specific prediction error
While for the models MSE, IaC, WSIrC and IaCLA-1, the test-set 1 was the most difficult to predict, model IaCLA-2 had most trouble with number 10. This indicates that there is no systematic error within the subjective data, because otherwise the model IaCLA-2 also would obtain bad prediction results for test-set 1. This is also confirmed by the fact that the observer data for the particular test-set 1 is very consistent. However, there are several test sets that have an exceptionally large prediction error. The extremes are certainly test sets 1, 2 and 10, but also 16, 20 and 30 show bad performance. Obviously, something in these particular test-sets disturbs the models. Therefore, it might help to look at the particular images of the testsets 1 and 10, which are shown in Fig. 9. Test-set 1 shows a very regular checker-board pattern. It consists of purely homogeneous regions, separated by a black grid. Hence, the border of each square is marked by a strongly contrasted edge. These edges cause large wavelet coefficients. In the test the distortion itself is visible in the homogeneous regions close to the edges. A masking model should identify this pattern as not active, because the slightest distortion in
8
Very active local surround
performance the MSE-model gained with respect to mode B and the IaCLA-2-model lost, but still performs best in absolute terms. The WSIrC shows for mode C a prediction performance, worse than the simple MSE-model. The optimal parameters and the statistics describing their stability are given in Table IV. The D0 -value of the MSE-
Distortion Location Strong contrast edges
#1
TABLE IV
#10
Fig. 9. The images of two test-sets that cause exceptionally large prediction errors. The distortion locations are marked, but the original images without distortion shown.
the homogeneous region will immediately be visible. However, this is not necessarily easy, since the strong edge coefficients might confuse a distortion metric and classify it as active instead. The IaCLA-2 model performs this classification successfully. The WSIrC-model operates with another inhibition term that probably considers also the other orientations, but which is obviously confused by the strong edges. The MSE and IaC models do not have an inhibition term that considers the local surround. In test-set 10 the distortion occurs in a small homogeneous region that is surrounded by very active image content. It seems that the local surround (NΓ =84) of model IaCLA-2 was chosen too large. It captures the activity, even if the closest neighborhood around the distortion is smooth. This leads to an underestimation of the distortion visibility and explains the elevated PE. C. Differences between mode B and C results So far only the data of test mode B (see Table I) was used. It is interesting to see how the model prediction behavior changes when applied to the data of mode C. As a reminder, mode C imposes time restrictions and does not allow flickering during the test. It reflects in its setting rather a “real-case” image quality assessment situation. However, the noise in the observer data is also significantly higher. It is of particular interest to see how the optimal parameter settings change from mode B to C. All models are applied to the entire data of mode C, including all 4 observer groups. The fitting and prediction errors are given in Fig. 10. The errors increased significantly for all models
Fig. 10. Fitting error (FE) and prediction error (PE) for the models applied to mode C data.
compared to the values found for mode B. This was expected, because mode C data are much more noisy. In terms of relative
Parameter Statistics for Mode C data
Type µC σC in %
Type µC σC in %
D0 6.7e+03 (±6.86)
0.855 (±2.79)
Type µC σC in % Type µC σC in %
MSE κ 1.73 (±2.46)
IaC D0 77.2 (±13.82)
WSIrC p 2.04 (±4.44) D0 1.2 (±62.51)
FE 15.2 (±3.30)
β=2 κ 1.08 (±0.39)
PE 16.4 (±95.31)
FE 12.4 (±3.54)
PE 14.6 (±105.14)
q = 2 β = 4 NΓ = 49 kΨ kΓ z 110 83.4 222 (±28.80) (±44.30) (±16.42) κ 1.24 (±1.50)
FE 12.8 (±5.01)
PE 17.4 (±126.73)
IaCLA1 = 0.3 β = 2 kL = 1e − 04 NΓ = 84 Type D0 κ ϑ FE PE µC 91.4 1.14 0.447 10.8 12.6 σC in % (±13.10) (±0.46) (±4.75) (±3.66) (±102.18) IaCLA2 = 0 β = 20 kL = 3e − 06 NΓ = 84 Type D0 κ ϑ FE PE µC 23.2 1.17 0.477 9.88 11 σC in % (±15.58) (±0.44) (±3.58) (±3.72) (±100.73)
model doubled and the slope of the Weibull-function became flatter (κ increased). This is exactly what could be expected, because it is obvious that in mode C stronger distortions (D0 ) are necessary to trigger a detection. The increased κ-value is due to the bigger deviation between the 4 observers. For the IaC-model an increase of the -value could be observed. Since the D0 value is strongly correlated to , it changes also. The increase of was also somehow expected, because experiments [15] showed that for more complex or difficult patterns the value can increase up to 1 [15]. The situation is comparable to more complex patterns, since the observers are under more pressure in mode C and the complexity of the patterns cannot be compensated for by the flickering tool. The WSIrC-model has significant trouble with test mode C. While it already had problems to converge towards a minimum for the data of mode B, these became even more serious for test mode C. The excitatory exponent p is practically identical to the inhibitory exponent q. The Gaussian kernels changed their characteristics completely. So the orientation pooling kΨ became rather small and the spatial pooling kΓ very large, which was the opposite for mode B data. This parameter was not well determined with a standard-deviation of ±36% for mode B either, but now it got worse and reached ±44%. The IaCLA-1 model did not change much from mode B to C. Only the exponent ϑ in the activity term increased from 0.36 to 0.45. The performance in prediction stayed good and also the parameter standard-deviations remained small. Similar to the IaCLA-1 model, also the IaCLA2 model increases ϑ to a value of 0.48. The prediction error achieves the best value of all tested models. The value of D0
9
changed more than in the case of the IaCLA-1 model, since the coupling to ϑ is stronger. VII. Conclusions The choice of a specific masking model for an image processing application should be motivated by the model performance in terms of prediction quality and stability. Unfortunately, no results about the explicit performance of masking models on natural scenery images were available. To close this gap of information, the authors presented a new method of psycho-visual measurements that uses photo-realistic stimuli. Such complex stimuli can be handled, since a flicker-tool drastically shortens the learning phase. Furthermore, the new way of masking measurements enables the parameterization of complex phenomena like texture masking, which is not possible by “classical” experiments. The enormous amount of work put into the psychovisual tests (18720 subjective judgements) is justified by the concept that the database is built only once and can be used as is for the design and performance check of newly proposed masking models. The actual performance comparison of 5 masking models confirmed the superiority of all masking models over the plain MSE measure, but revealed also a very surprising aspect. The most complex and only inter-channel masking model of the tested candidates performed worse than the simple intra-channel models. The best results were achieved by a model that considers exclusively the local activity and no longer any point-wise contrast masking effect. For a deeper understanding how far the more complex stimuli or the constraints of a wavelet decomposition are responsible for this behavior, further experiments are necessary. However, it reflects the so far ignored difference between elementary measurements and the conditions found in real applications. As a consequence of the presented analysis, the model IaCLA-2 is recommended for wavelet-based image processing applications. Its very low complexity and significant performance gain (about 50%) over MSE, make it the model of choice. Acknowledgments The authors would like to acknowledge the financial support by Hewlett-Packard. References [1]
Marc Antonini, Michel Barlaud, Pierre Mathieu, and Ingrid Daubechies. Image coding using wavelet transform. IEEE Transaction on Image Processing, 1(2):205–220, April 1992. [2] Peter G. J. Barten. Contrast Sensitivity of the Human Eye and Its Effects on Image Quality. SPIE, Bellingham, Washington, 1999. [3] Kim T. Blackwell. The effect of white and filtered noise on contrast detection thresholds. Vision Research, 38(2):267–280, 1998. [4] P. Le Callet, A. Saadane, and D. Barba. Interactions of chromatic components on the perceptual quantization of the achromatic component. In Proc. SPIE Human Vision and Electronic Imaging, San Jose, CA, January 24-29 1999. SPIE. [5] G. R. Cole, C. F. Stromeyer III, and R. E. Kronauer. Visual interactions with luminance and chromatic stimuli. Journal of the Optical Society of America A, 7(1):128–140, January 1990. [6] Scott Daly, W. Zeng, and S. Lei. Visual masking in wavelet compression for JPEG2000. In Proc. SPIE Image and Video Communications and Processing, volume 3974, San Jose, CA, January 22–28, 2000. SPIE. [7] Miguel P. Eckstein, Albert J. Ahumada, Jr., and Andrew B. Watson. Visual signal detection in structured backgrounds. II. Effects of contrast gain control, background variations, and white noise. Journal of the Optical Society of America A, 14(9):2406–2419, September 1997. [8] T. Carney et Al. Modelfest: First year results and plans for year two. In Proceeding of the SPIE - Human Vision and Electronic Imaging. SPIE, 2000. [9] John M. Foley. Human luminance pattern-vision mechanisms: masking experiments require a new model. Journal of Optical Society of America, 11(6):1710–1719, June 1994. [10] John M. Foley and Gordon E. Legge. Contrast detection and near-threshold discrimination in human vision. Vision Research, 21:1041–1053, 1981.
[11] Stanley A. Klein, Thom Carney, Lauren Barghout-Stein, and Christopher W. Tyler. Seven models of masking. In Proceedings of SPIE, Human Vision and Electronic Imaging, volume 3016, pages 13–24, San Jose, CA, February 8-14 1997. [12] Gordon E. Legge and John M. Foley. Contrast masking in human vision. Journal of the Optical Society of America, 70(12):1458–1471, December 1980. [13] M. A. Losada and K. T. Mullen. The spatial tuning of chromatic mechanisms identified by simultaneous masking. Vision Research, 34(3):331–341, 1994. [14] Marcus J. Nadenau and Julien Reichel. Image compression related contrast masking measurements. In Proc. SPIE Human Vision and Electronic Imaging, volume 3959, pages 188–199, San Jose, CA, January 22–28, 2000. SPIE. [15] R.A. Smith and D.J. Swift. Spatial-frequency masking and birdsall’s theorem. Journal ofthe Optical Society of America A, 2:1593–99, 1985. [16] International Standard. Graphic Technology - Prepress Digital Data Exchange - CMYK Standard Colour Image Data. ISO, Geneva, Switzerland, 1997. [17] E. Switkes, A. Bradley, and K. De Valois. Contrasct dependence and mechanisms of masking interactions among chromatic and luminance gratings. Journal of Optical Society of America, 5(7):1149–1162, July 1988. [18] Patrick C. Teo and David J. Heeger. Perceptual image distortion. In Proceedings of SPIE, Human Vision, Visual Processing and Digital Display, volume 2179, pages 127–141, San Jose, CA, February 8-10 1994. [19] A.B. Watson and J.A. Solomon. Model of visual contrast gain control and pattern masking. Journal of the Optical Society of America A (Optics, Image Science and Vision), 14(9):2379–2391, September 1997. [20] Andrew B. Watson. Visual detection of spatial contrast patterns: Evaluation of five simple models. Optics Express, 6(1):12–33, 2000. [21] Andrew B. Watson, M. Taylor, and R. Borthwick. Image quality and entropy masking. SPIE Proceedings, Human Vision, Visual Processing, and Digital Display VIII, 3016/San Jose, CA, 1997. [22] Asad Zaman. Statistical Foundations for Econometric Techniques. Academic Press, Inc., San Diego, California, 1996.