A model for audibility data - Semantic Scholar

7 downloads 0 Views 94KB Size Report
within Philips, see e.g. [2], to answer this basic question of audibility research. ... post-screening criteria than common in current standards and different statistical ...
Technical Note PR-TN 2005/00447

Issued: 05/2005

A model for audibility data

T.J.J. Denteneer; J.F. Aprea (PDSL) Philips Research Europe

Unclassified ©

Koninklijke Philips Electronics N.V. 2006

PR-TN 2005/00447

Authors’ address

Unclassified

T.J.J. Denteneer

WL 11

[email protected]

J.F. Aprea (PDSL)

[email protected]

© KONINKLIJKE PHILIPS ELECTRONICS NV 2006 All rights reserved. Reproduction or dissemination in whole or in part is prohibited without the prior written consent of the copyright holder .

ii

©

Koninklijke Philips Electronics N.V. 2006

Unclassified Report

TN-2005/00447

Title:

A model for audibility data

Author(s):

Dee Denteneer, Javier Aprea

Reviewer(s):

J. Engel, S. van de Par

Technical Note:

TN-2005/00447

Additional Numbers: Subcategory: Project: Customer:

Philips Research

Keywords:

audibility research, statistics

Abstract:

In audibility research, it is necessary to show that a given method for processing high quality audio does not entail a perceptible change to the audio quality. This task is usually approached via subjective listening tests. However, currently, there is no established basis in the standards for the statistical analysis of the data gathered in these experiments.

Conclusions:

In this report, we propose a multiplicative model for audibility data. In a comparative experiment, we have shown that this model greatly improves upon the standard ANOVA model that is usually proposed for such data. Thus, this model provides a suitable starting point for the analysis of audibility data. The model is not only useful for audibility data, but can also be used for the assessment of small impairments or for the assessment of intermediate quality levels.

c Koninklijke Philips Electronics N.V. 2005

iii

TN-2005/00447

iv

Unclassified Report

c Koninklijke Philips Electronics N.V. 2005

Unclassified Report

TN-2005/00447

Contents 1 Introduction

1

2 Current procedure 2.1 Post-screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 3 4 4

3 Proposed model 3.1 Model description . . . . . . . . . . . . . . . . . 3.2 Comparison . . . . . . . . . . . . . . . . . . . . 3.2.1 Interpretation . . . . . . . . . . . . . . . 3.2.2 Statistical, data-based, model comparison

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

6 6 7 7 8

4 Some technical issues 4.1 Identifiability . . . 4.2 Estimation . . . . . 4.3 Robustness . . . . 4.4 Standard deviations 4.5 Statistical tests . . 4.6 Model extensions .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

12 12 12 13 14 14 15

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

5 Conclusion and future research

c Koninklijke Philips Electronics N.V. 2005

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

16

v

Unclassified Report

TN-2005/00447

Section 1

Introduction In audibility research, it is necessary to show that a given method for processing high quality audio does not result in a perceptible change to the audio quality. Examples of research efforts for which this is relevant are SACD, see [18], and audio watermarking, see [15]. The basic question: ‘is this manipulation inaudible?’ poses considerable methodological problems and there is no current standard that details the approach to be taken. Notwithstanding this lack of an ‘official’ approach, there is an emerging current approach within Philips, see e.g. [2], to answer this basic question of audibility research. This emerging approach is based on subjective listening tests. Both the design of such listening tests, the choice of fragments involved, the physical set-up of the experiment, and the training and grading phases are well described in known standards. These are available for the subjective assessment of small impairments in audio systems, see [13], and more recently also for the assessment of intermediate quality level, see [14]. They can be used with minor modifications for audibility research. However, the procedures for the statistical analysis of the data gathered in such audibility experiments cannot be reused without further efforts. There are various reasons for this: • The basic goal in audibility research is to show that no difference exists between original and modified stimuli, and this is logically different from the more common question to find the best method for subjective assessments. • Audibility research seems to require more stringent criteria than assessment of small impairments and assessment of medium level audio. Thus, one can expect more stringent post-screening criteria than common in current standards and different statistical tests to be executed on the data. • Even the current approach to the statistical analysis of data from subjective listening tests is not well described and leaves much room for controversy. It is this last issue that will be emphasized in this note. We will review the current practice on statistical analysis of data from subjective listening tests, basing ourselves on existing recommendations, e.g. [13] and [14], and results from formal listening tests, e.g. [11] and [12]. We will conclude from this review that the current approach is incomplete and inconsistent. We will then introduce a new, multiplicative, model. This model overcomes the inconsistencies inherent in the now standard approach based on linear models. Moreover, we will show that this model fits the data better than the linear models that are currently used. This model is a suitable starting point for the development of a standardized approach to the statistical analysis of audibility data. c Koninklijke Philips Electronics N.V. 2005

1

TN-2005/00447

Unclassified Report

This note is further organized as follows. In Section 2, we describe the current approach to the analysis of audibility data, as suggested by the standards on subjective listening tests. Moreover, we give an overview of what we see as the most important shortcomings in this current approach. Next, we introduce our alternative, in Section 3. In Section 3.2, we compare this proposed model with the current model; in Section 3.2.1 we do so generally, and in Section 3.2.2 we do so by using data from two listening experiments. In Section 4, we provide some further technical details for the proposed model to analyze the audibility data. Finally, in Section 5, we provide ideas for further research to standardize an approach based on the proposed model.

2

c Koninklijke Philips Electronics N.V. 2005

Unclassified Report

TN-2005/00447

Section 2

Current procedure In this section, we describe the statistical elements in the current procedure for analyzing the results from subjective listening tests. We will assume that data are collected according to the design described in [13], [11] and [12]. More specifically, one follows the double-blind triple-stimulus with hidden reference paradigm as described in [13] ITU BS-1116-1. At each trial, grades are given using the ITU-R BS-562-3 5-point impairment scale (continuous, equal interval range from 5.0 to 1.0). Therefore, data are subjective difference grades in the range [-4.0, 4.0] representing the difference between the grade given to the target and that given to the hidden reference. This means that data are available to us in the form yi j kl . Here i ranges from 1 to I and indicates the items used in the listening test. The variable j ranges from 1 to J and indicates listeners. The variable k ranges from 1 to K and indicates the codecs under test. Finally, l ranges from 1 to L and denotes the replications. Consequently, yi j kl denotes the so-called diff-grade associated with listener j when grading item i coded with target codec k for the lth time. A negative diff-grade indicates that the listener correctly identified the target. A diff-grade closer to zero means that the listener judged the audio quality to be high. The resulting histogram for such data is expected to be left-skewed, centrally clustered with mean slightly less than zero. The data analysis procedure proposed in all standards consists of two steps. Firstly, the listeners are screened based on their responses, i.e. post-screening. Secondly, an ANOVA is applied to the responses of the listeners that passed the post-screening test. We will turn to these two steps of the procedure in turn.

2.1 Post-screening In all the standards there is agreement that post-screening of the listeners should be carried out based on the data collected during the listening test. The goal of such a screening procedure is to eliminate the data from listeners who cannot make the appropriate discriminations, [13]. The most common way to carry out the post-screening is by means of a one-sided t-test, e.g. see [13] Appendices 1 and 2 to Annex 1, [14] Section 9, [11] Section 6.2, and [12] Section 10.2. This test tests whether the response-mean for each listener is below zero with sufficient probability. In [11] Section 6.2, it is suggested to use the Wilcoxon signed-rank test in addition to the t-test. The Wilcoxon test is then used for the same purpose as the t-test. However, it is more robust to non-normality of the data, i.e. it is good in case of heavy-tailed distributions and outliers, see [16]. To this general procedure, it is usually added that the t-test should be based only on those c Koninklijke Philips Electronics N.V. 2005

3

TN-2005/00447

Unclassified Report

items that are neither too easy, nor too difficult. Additionally, [14] Section 4.1.2 suggests to eliminate also those listeners that use the extremes of the scales too frequently, considering this as a lack of expertise from the listener. On the other hand, [13] argues against alternative, post hoc, screening criteria, considering this as being tantamount to shaping the data according to the experimenter’s preconception.

2.2 ANOVA The second step in the proposed procedure consists of an Analysis of Variance (ANOVA) as applied to the responses of the listeners who passed the post-screening test. This amounts to the assumption of a linear model for the responses of the form yi j kl = µ + αi + γk + i j kl ,

(2.1)

where αi is the effect of item i and γk is the effect of codec k. The variables i j kl are measurement noise. In carrying out the ANOVA, one usually assumes that these noise variables are independently and identically distributed according to a normal distribution. Potentially, this model is extended to catch interactions, δik , between codec and item, as in [11] Section 6.6.1, or to include additional effects, such as respondents’ position, as in [12] Section 10.3. In [13] and [14] no guidelines are given as to the actual use of the analysis of variance. In [11] and [12], the ANOVA model is used to estimate the significance of the codec-effect by means of the F-test that is common in the analysis of variance. They then go on to more specific investigations. In particular, overall performance of the codecs is displayed via overall means and 95% confidence intervals, based on the results from the ANOVA. Next, codec by item performance is assessed by applying a one way ANOVA for item to the data for each codec. Finally, they rank the codecs. This is done both by the overall means and by means of the number of codec item combinations for which the confidence interval for the means does include zero. Here, the interpretation is that if such a confidence interval for the mean contains zero, then the coded version is not significantly different from the original signal. There are no specific guidelines given on how to choose the best codec from a number of codecs under test. In [8], also see [11], Annex H, a precise EBU definition is given on ‘indistinguishable’ quality. Here, indistinguishable means that the coded signals must be indistinguishable from compact disk quality. This definition is based on the overlap of the confidence intervals for coded and reference signal. The definition is probably too weak for use in audibility tests with almost transparent items, as they arise in e.g. SACD or in tests with watermarked items.

2.3 Discussion The procedure proposed in the standards amounts to the assumption of a linear model for the reponses of the form yi j kl = µ + αi + β j + γk + δik + i j kl , (2.2) Here the αi , the γk , and the δik are as decribed in Section 2.2. The parameters β j are additional parameters that model listener ability. Note that the systematic quantities described in Section 2.2, all derive from this linear model. The t-test in the post-screening step described in Section 2.1 thus uses the estimate of β j + µ in the numerator. The significance of the codec effect is based on the variation explained by 4

c Koninklijke Philips Electronics N.V. 2005

Unclassified Report

TN-2005/00447

the γk . Finally, the codec by item effect is based on the estimates for δik . However, it should also be noted that the values for the standard deviations used in the construction of confidence intervals in the sections above are not readily interpretable in the light of the above model. E.g. in the post-screening procedure a within-person standard deviation is used in the denominator. Also observe, however, that this standard deviation does include systematic item effects, so that it cannot be given a clear probabilistic interpretation. Using this model-based perspective, we can now comment on the procedure outlined in Sections 2.1 and 2.2. Interpretation A linear model as in (2.2) is difficult to interpret as a model for audibility data because additive effects have no meaning in this context. As an example, assume that a critical listener, j , has a clearly negative score β j . Then the model (2.2) predicts that this listener has a large probability to correctly discriminate perfectly coded items. However, in reality, it is not possible to detect flaws in perfectly coded items, whatever the sensitivity of the listener. One can see a similar flaw in the model, when it is applied to perfect codecs. Again, the model predicts audibility of artefacts for critical items, due to its additive nature. However, perfect codecs do not have artifacts whatever the difficulty of the item to which it is applied. Inconclusiveness The post-screening procedure leads to a chicken and egg problem. Indeed, the t-test must be based on only those items that are neither too easy nor too difficult to assess. However, these items can only be selected beforehand by a selection panel or once the grades from non-critical listeners have been removed. The current procedure is essentially inconclusive in how to handle this chicken and egg problem. We note that this has led [2] to consider iterative procedures for post-screening. Furthermore, the procedures are also inconclusive in that they do not formally state the tests or procedures to be carried out to answer the various questions for which the data were actually gathered. Efficiency Leaving out data from listeners that are on the borderline of criticality throws away relevant information, and thus reduces the precision with which the various effects are estimated. Thus, one needs more listeners for a meaningful conclusion. Multiple comparisons effect The post-screening is based on the upper limit of a two-sided 90% confidence interval for the listener means, so that there is only 5% probability that the interval will contain zero if the true listener mean is at its estimated value. However, when performing a number of such tests simultaneously, the limits must be adapted to reflect the number of tests carried out. Indeed, 5 out of 100 competent listeners will be rejected, thus aggravating the efficiency issue above. Statistical interpretation It is tempting to interpret the confidence intervals used above in terms of relative frequencies, and, indeed, such an interpretation is strongly suggested by the use of the phrase ‘90%’ confidence interval. However, due to the non-standard nature of the estimates of the standard deviations such an interpretation is not warranted.

c Koninklijke Philips Electronics N.V. 2005

5

TN-2005/00447

Unclassified Report

Section 3

Proposed model In view of the deficiencies of the linear model, we now propose an alternative model. We will first formally introduce the model and some of the technical details. In Section 3.2, we will then compare the model to the additive model (2.2) currently in use. This comparison will readdress the issues raised in the discussion of Section 2.3. Moreover, we will use data from a number of listening tests, described in [3], to carry out a data-based comparison.

3.1 Model description We propose to view the responses as arising from a multiplicative model, rather than from a linear model: yi j kl = αik β j + i j kl .

(3.1)

Here, as before, the β j model the sensitivity of the respondent, whereas the αik is the audibility of the artefact when item i is coded by means of codec k. The i j kl again denote random noise variables that are added to the structural part of the response. We will assume that the i j kl are independently drawn from a given distribution with mean 0 and constant variance. Note that this model utilizes the same effects as the linear model. However, these effects are now multiplied rather than added as in the linear model (2.2) considered previously. Thus the sensitivity of the respondent induces a relative change in the audibility rather than an absolute shift. This model thus alleviates a number of difficulties with the linear model that underlies the traditional approach, and that were outlined in the discussion in Section 2.3 above. We will turn to these in more detail in Section 3.2, particularly Subsection 3.2.1. In this model, we interpret the parameters αik as indicating the audibility of a given item i when coded with codec k. Consequently, parameters αik that are zero correspond to item-codec combinations that are indistinguishable from the reference. Likewise, an item-codec combination with a parameter αik that is very negative corresponds to item-codec combinations that are very distinguishable from a reference signal. In case of audibility tests, when comparison is with a perfect reference, it seems reasonable to restrict the αik to negative values. Hence we require that αik ≤ 0, for i = 1, . . . , I, and k = 1, . . . K.

(3.2)

However, when comparisons are carried out with respect to a general reference, such a restriction may not be appropriate. 6

c Koninklijke Philips Electronics N.V. 2005

Unclassified Report

TN-2005/00447

The parameters β j are interpreted as listener ability parameters. We assume these to be centered around an arbitrary value of 1. Moreover, we restrict the β j to be non-negative: J 1X βj J j =1

βj

= 1

(3.3)

≥ 0, for j = 1, . . . , J.

Listeners with an ability parameter of 0 correspond to very non-critical listeners, whereas listeners with large positive parameters correspond to very critical listeners.

3.2 Comparison In this section, we argue that the proposed model (3.1) is better than the linear model (2.2). We do so by showing, in Section 3.2.1, that it solves some of the problems with the linear model that were discussed in Section 2.3. Moreover, in Section 3.2.2, we show that the multiplicative model better fits the data gathered in a number of audibility experiments.

3.2.1 Interpretation Interpretation The multiplicative model (3.1) avoids the interpretational difficulties associated with the linear model (2.2) as discussed in Section 2.3. For this, note that αik = 0 if there is no audible difference for given item i coded with codec k and the reference version of item i. Clearly, then we also have that αik β j = 0, whatever the sensitivity of listener j . This solves the essential flaw in the linear model, where of course, αik = 0 does not imply that αik + β j is equal to zero. Inconclusiveness The multiplicative model has an automatic post-screening feature. To see this, note that some of the β j will be estimated as being close to 0, when estimating the parameters β j and αik . Next, observe that the observations yi j kl associated with j such that β j = 0 do not contain information about the αik . Hence, data from subjects j associated with a β j ≈ 0 are automatically left out of the analysis. The model thus avoids the inconclusiveness associated with the linear model in that we do not need arbitrary thresholds and selection of items to carry out the post-screening. However, some of the inconclusiveness problems with the linear model reappear in a different form. Recall that with the linear model, items that are too easy or too difficult should be deleted for the post-screening procedure. Now, items that are too difficult do not cause problems here, as they exert little impact when estimating the ability parameters. However, items that are too easy and stand apart, far from the body of the data, may exert an undue influence. This feature of the model can be readily understood via a parallel with simple linear regression models of a straight line through the origin, and will be discussed more extensively in Section 4.3. The relevance of this problem should be assessed, in practice and in simulation. Counter measures, e.g. based on robust estimates are possible. Efficiency As observed above, the multiplicative model has an automatic post-screening feature in that listeners, for whom β j = 0, are automatically left out when estimating the c Koninklijke Philips Electronics N.V. 2005

7

TN-2005/00447

Unclassified Report

item-codec parameters αik . This observation can be extended to listeners j for whom the estimated listener sensitivity, β j , is small. For such listeners, the responses will contain little information on the αik , and so will be ignored to a large extend when estimating the audibility parameters. Hence, the multiplicative model automatically weighs the listeners according to their ability. Again, the parallel with simple linear regression through the origin is helpful here. We conjecture that the soft thresholding, implicit in the multiplicative model, is more efficient than the hard thresholding via post-screening that characterizes the approach based on the linear model. However, this feature should be further investigated by means of simulation. Multiple comparisons effect By avoiding post-screening, we also avoid the multiple comparisons problem associated with it. Statistical interpretation By using one explicit statistical model, we estimate the noise term in the right way. Thus, there are no interpretational problems with the estimates of the variability.

3.2.2 Statistical, data-based, model comparison In this section, we show that the multiplicative model is preferable to a linear model in that it gives a better fit to the data gathered in two listening experiments. To show this, we must compare two different, non-nested models. Standard statistical theory provides a complete set of tools to compare nested models, in which one model is a simplification of the other. In this case, one uses a likelihood ratio test statistic, which has, asymptotically a χ 2 -distribution. The statistical, data-based, comparison of non-nested models is much less standard. This problem was first considered by Cox [6, 7] by means of asymptotic methods. The approach was extended by Williams [21] to a simulationbased approach. Our account is based on Hinde [10] and Allcroft and Glasbey [1] and uses the bootstrap. We assume that the noise variables in either model follow a normal distribution. We use (l) L (a) ˆ to denoted the minus two times the log-likelihood of the observed data under the linear model given in (2.2) with fitted parameter a. ˆ Here a is a shorthand for the parameters in the linear model. Hence, with the model as in (2.2) we have that a = (µ, αi , β j , γk , δik ), i = 1, . . . , I ; j = 1, . . . , J ; k = 1, . . . K,

(3.4)

and that (l)

L (a) ˆ ∝ log

J X K X L I X X

! (yi j kl − µˆ − αˆ i − βˆ j − γˆk − δˆik )

2

− log (n − I K − J + 1)

i=1 j =1 k=1 l=1

− n − I K − J + 1.

(3.5)

ˆ for the log-likelihood of the data under the multiplicative model given Similarly, we use L (m) (b) in (3.1). So that here, we have that b = (αik , β j ), i = 1, . . . , I ; j = 1, . . . , J ; k = 1, . . . K, 8

(3.6)

c Koninklijke Philips Electronics N.V. 2005

Unclassified Report and that

TN-2005/00447

! J X K X L I X X ˆ ∝ log L (m) (b) (yi j kl − αˆ ik βˆ j )2 − log (n − I K − J + 1)

(3.7)

i=1 j =1 k=1 l=1

−n − IK − J +1

(3.8)

ˆ is not known, so that there is no formal statisNow the distribution of L (l) (a) ˆ − L (m) (b) (l) (m) ˆ tical basis to compare L (a) ˆ and L (b). However, it is possible to use a simulation-based approach. To use this simulation-based approach, we first consider the linear model as the null hypothesis and the multiplicative model as the alternative hypothesis. We use aˆ to denote the fitted parameter vector under the linear model and bˆ to denote the fitted parameter vector under the multiplicative model. Moreover, we use Fˆ (l) for the empirical distribution of the residuals with respect to the linear model and Fˆ (m) for the empirical distribution of the residuals with respect to the multiplicative model. Let ˆ d obs := L (l) (a) ˆ − L (m) (b)

(3.9)

denote the observed log likelihood ratio for the two models. We now generate S simulated data sets, yi(s) j kl , s = 1, . . . , S, acting as if the linear model is the true one. We do so by generating data according to the linear model (2.2) with fixed parameter a: ˆ yi(s) ˆ i + βˆ j + γˆk + δik + i(s) (3.10) j kl = µ + α j kl . ˆ (l) Here, the noise variables i(s) j kl are generated i.i.d. according to F . We fit both models to the simulated data. Thus we obtain parameter estimates aˆ (s) and bˆ (s) and simulated log likelihood ratios d (s) , where d (s) := L (l) (aˆ (s) ) − L (m) (bˆ (s) ). (3.11) A test statistic to test the linear model versus the multiplicative model is based on the observed log likelihood ratio under the empirical distribution of the simulated log likelihood ratios, e.g. using S   1X T l,m := 1 d (s) > d obs . (3.12) S s=1 The null hypothesis will now be rejected if T l,m is out of some predetermined interval, e.g. [.025, .975]. An equivalent, graphical, procedure is to plot the distribution of the simulated likelihood differences together with the observed value of the likelihood difference. The null hypothesis is then accepted if the observed value lies in the core of the distribution, i.e. if the observed likelihood difference is not unlikely under this distribution. A similar test with the multiplicative model as null hypothesis and the linear model as alternative can be constructed. In practice, one will apply both tests. This may lead to the rejection of one, both, or neither model. We now apply this technique to the data generated in two experiments with a watermarking system for high quality audio, see [3], in order to compare the linear model (2.2) with the proposed multiplicative model (3.1). Presentation order of all trials was randomly chosen for each item-codec and each listener. Replications of the same item-codec pair were presented as Ref/A/B and Ref/B/A. Furthermore, listening conditions were the same for both listening tests. Each item is a 10s long audio fragment of very high quality audio selected to be critical for the system under test (i.e., making explicit its shortcomings). c Koninklijke Philips Electronics N.V. 2005

9

TN-2005/00447

Unclassified Report Linear model

0

0

5

5

10

Density 15

Density 10

20

15

25

Multiplicative model

-0.22

-0.20

-0.18 -0.16 -0.14 Log likelihood ratio

-0.12

-0.10

Figure 3.1: Model comparison test assuming multiplicative model, first experiment. The density gives the empirical distribution of the simulated log likelihood ratios; the dot indicates the value of the observed log likelihood ratio

-0.15

-0.10

-0.05 0.0 Log likelihood ratio

0.05

0.10

Figure 3.2: Model comparison test assuming linear model, first experiment. The density gives the empirical distribution of the simulated log likelihood ratios; the dot indicates the value of the observed log likelihood ratio

The first data set (n = 1,240, m = -0.30, s = 1.07) comes from 31 listeners (from which 18 where found critical after post-screening), 10 items, 2 codecs and 2 replications. From these, 1 codec was the target codec, or system under test (where we are interested in the mean opinion of the listeners); the other was used as a low-rating anchor codec (somehow exaggerated artifacts to help identify critical listeners by means of post-screening). The second data set (n = 1,824, m = -0.34, s = 1.00) comes from 38 listeners (from which 32 where found critical), 8 items, 3 codecs (2 target codecs, 1 anchor) and 2 replications. The results of the test are graphically displayed in figures 3.1 through 3.4. In figures 3.1 and 3.2 we show the results for the first data set. Figure 3.1 then shows the graphics when taking the multiplicative model as the null model and the linear model as the alternative. In this case, the observed values of the log likelihood ratios, indicated by the big dots in the plot, lie well within the distribution of the simulated log likelihood ratios given that the multiplicative models is the true model. This gives no cause to reject the multiplicative null model in favor of the linear alternative model. In Figure 3.2, we show the results of the converse situation in which we have taken the linear model as the null model and the multiplicative model as the alternative. We now see a completely different picture. In this case, the observed value of the likelihood differences falls far from the distribution of the simulated likelihood differences. Thus the likelihood ratios, assuming that the linear model is the true model, are much higher than the log likelihood ratio as actually observed. This gives cause to reject the null hypothesis of the linear model in favour of the multiplicative alternative. Figures 3.3 and 3.4 display similar results for the second data set and lead to the same conclusions.

10

c Koninklijke Philips Electronics N.V. 2005

Unclassified Report

TN-2005/00447

Linear model

0

0

5

5

10

Density 15

Density 10

20

15

25

30

Multiplicative model

-0.35

-0.30

-0.25 Log likelihood ratio

-0.20

-0.3

-0.2

-0.1 Log likelihood ratio

0.0

0.1

Figure 3.3: Model comparison test assuming multiplicative model, second experiment. The density gives the empirical distribution of the simulated log likelihood ratios; the dot indicates the value of the observed log likelihood ratio

Figure 3.4: Model comparison test assuming linear model, second experiment. The density gives the empirical distribution of the simulated log likelihood ratios; the dot indicates the value of the observed log likelihood ratio

c Koninklijke Philips Electronics N.V. 2005

11

TN-2005/00447

Unclassified Report

Section 4

Some technical issues We now turn to some technical issues concerning model (3.1).

4.1 Identifiability Note that the model (3.1) as such is not identified: dividing the listener parameters by a constant factor, c, and multiplying the item parameters with this same factor will lead to the same predicted values: αik β j = cαik β j /c. (4.1) Particularly, note that the sign of the parameters is not identified. However, the model for the design as outlined in Section 2 together with the restrictions (3.3) is identifiable. Hence, optimization procedures will obtain a unique set of estimates for the parameters in (3.1). The restrictions (3.2) are not necessary for identifiability but code extra knowledge about the comparison at hand. We make two further remarks about the restrictions imposed to make the system identifiable: • The restrictions (3.3) are arbitrary, as we can restrict the average listener abilities to any arbitrary value. This may open a road to inter-experiment comparison, using some known average listener ability. • Moreover, restrictions (3.3) are arbitrary, as we can also restrict the average αik to some fixed value. Mathematically, there is no difference in restricting the β j or the αik . However, it must be investigated whether there are practical differences. • Note that with other versions of model (3.1), see e.g. Section 4.6, or with other designs in which one does not collect data for all combinations of the factors, it may be necessary to reconsider the identifiability issue.

4.2 Estimation We propose to estimate the parameters in model (3.1) by minimizing a least squares criterion: X ˆ := argmin (α, ˆ β) (yi j kl − αik β j )2 . (4.2) Here αˆ denotes the vector of estimated values corresponding to the αik and βˆ similarly denotes the vector of estimated values corresponding to the β j . The proposed model (3.1) is non-linear 12

c Koninklijke Philips Electronics N.V. 2005

Unclassified Report

TN-2005/00447

and this complicates the estimation of the parameters as compared to the current model (2.2) used in conjunction with a least squares criterion. However, the optimization of such non-linear models has been well investigated, see e.g. [20]. Moreover, the model is linear in the αik when the β j are fixed, and is also linear in the β j when the αik are fixed. This enables a very simple see saw procedure to estimate the parameters. This is an example of a more general procedure, see [9], which is guaranteed to converge. However, the restrictions (3.3) and (3.2) cause us to make two minor changes to this general procedure. Firstly, after estimation of the β j they are normalized to satisfy (3.3). Secondly, after estimation of the αik , positive parameters, violating (3.2), are set to zero. We have not investigated the properties of this modified procedure in detail. However, in practice, we have not encountered any problems in the optimization of (4.2) and convergence was always obtained within 5, say, iterations.

4.3 Robustness Traditionally, parameter estimation via the optimization of a least squares criterion is considered to be not very robust in that very deviant responses may exert a large influence on the parameter estimates. As the responses in the audibility tests are restricted to the range [1, 5], resulting in difference grades in the range [−4, 4], there is no room for really outlying responses and this problem will not be present in the current context. However, linear models can also be non-robust in that certain observations have high leverage, see e.g. [5], Section 4.2.2. Usually, this arises when one of the design points is far from the other design points. This arises in its most pure form when fitting a simple linear regression. As an illustration, assume that most design points are located near the origin whereas one design point is located far from the origin, see e.g. Figure 4.1. In this case, the design point far from the origin will exert undue influence when estimating the slope of the regression line. Indeed, in Figure 4.1, the fitted regression line misses the orientation of the majority of the data points and is almost fully determined by the isolated observation. This leverage problem is relevant to our problem. It will particularly arise when one of the codec-item combinations contains artefacts that are easy to detect and that are unacceptable, whereas other item-codec combinations are nearly non-perceptible. In this case, it may be that this one example may exert an inappropriate amount of influence when estimating listener abilities. The size of this problem has to be assessed within the current context. When considered serious, measures against it should be taken. One possibility is to use a robust linear regression procedure in the see saw fitting procedure. An attractive procedure may be the procedure as considered by Rousseeuw, see e.g. [19], which is robust against high leverage. Moreover, note that this can also be a problem in the linear model. Indeed, for the postscreening procedure it is advised to eliminate those items that are too easy, see Section 2.1. Hence, one can also use solutions as proposed in the current standards for the multiplicative models. However, this retains at least part of the inconclusiveness described in Section 2.3, which we would rather avoid. Finally, it is relevant to investigate the robustness against model errors. An example of this is the presence of various levels of within subject variability. These are not included in the model but might very well be present in the data. The effect of such omissions on the estimation procedures needs further attention.

c Koninklijke Philips Electronics N.V. 2005

13

Unclassified Report

50

TN-2005/00447

10

20

y

30

40



• 0

• • -2

•• • 0

• •• •

2

4 x

6

8

10

Figure 4.1: An isolated observation has high leverage and may exert an undue influence on the fitted regression (dotted line) as can be observed from a comparison with fitted regression without outlying point (solid line)

4.4 Standard deviations The discussion above has focussed on the interpretation of the model and the estimation of the most relevant parameters, notably the item-codec audibility parameters αik . These estimates must be complemented with estimates of their standard deviations. Parameter estimates plus their standard deviations can then form the basis for appropriate inference, see Section 4.5. There are two avenues open to determine the standard deviations of the estimates. The first approach is computational and uses simulation based on the bootstrap. In this approach, we simulate data sets, as described in Section 3.2.2. Repeatedly estimating the parameters of interest for these simulated data sets, we obtain and estimate of their variability. An alternative approach uses the inverse of the expected information matrix, see [20] Section 2.2.1. The standard error so obtained is valid under somewhat more restrictive conditions than the standard errors obtained via the bootstrap. This analytical approach to obtain the standard errors has the advantage, however, that it will clarify the impact of the (estimated) β j on the standard deviation of the estimated αik . It thus makes the implicit weighing procedure, described in Section 3.2.1, more explicit and as such has independent interest.

4.5 Statistical tests The multiplicative model (3.1) provides a coherent frame to estimate the relevant parameters plus standard deviations. These then in turn provide the basis to test the relevant hypotheses used for decision making. Given the non-standard questions of audibility research, the formulation of the relevant hypotheses will require further discussion. Additionally, we observe that even the more standard settings for codec comparisons call for a more precise discussion of hypotheses and their associated tests than currently available in the standards.

14

c Koninklijke Philips Electronics N.V. 2005

Unclassified Report

TN-2005/00447

4.6 Model extensions As in the case of the linear model, it is possible to extend the multiplicative model (3.1) in various directions. We now illustrate possible extensions. Extension related to the systematic term It is possible to substitute alternative models for the item-codec interaction term. One possibility is to use αik = αi γk ,

(4.3)

which is related to the main effects model in the linear model. Alternatively, one could extend the model to include additional effects such as listening position. An extension of the multiplicative model in this direction calls for modification of both algorithm and identifiability constraints. Extension related to the error term In the proposed model (3.1), the errors i j kl are assumed to be independently drawn from the same error distribution. In practice, there will be correlation among the errors that are associated with one listener. Extensions of the model in this direction require precise models for this correlation structure and adapted estimation procedures. The advantage of such models is that they do capture the differences between listeners even better, as listeners can differ not only with respect to sensitivity, but also with respect to precision. This provides the possibility to even better differentiate the weight given to the responses of the listeners when estimating the αik leading to even more efficient estimates. Extension related to the parameter interpretations In the current model (3.1) the parameter αik is interpreted as relating to a fixed item i for a given k. However, an alternative interpretation is to view the αik , i = 1, . . . , I as randomly drawn from a distribution that characterizes the codec k. The research interest is most likely not in the αik themselves, but in the distribution that underlies these αik , i = 1, . . . , I for a given k and in the comparison of these distributions. A model based on random components does exactly this and models the parameters αik , i = 1, . . . , I as arising from a distribution characterized via a number of parameters, such as mean µk and variance σk2 . The goal of the estimation is then not to estimate the individual αik but to estimate the distributional parameters µk and σk2 . Again, this requires further work to model the distributions underlying the parameters and to adapt the estimation procedure. The benefit is that the model even better expresses the actual situation. Additionally, such a random components model has fewer parameters. Hence, they can be estimated more accurately.

c Koninklijke Philips Electronics N.V. 2005

15

TN-2005/00447

Unclassified Report

Section 5

Conclusion and future research In this note, we have proposed a multiplicative model for audibility data. In a comparative experiment, we have shown that this model greatly improves upon the standard ANOVA model that is usually proposed for such data. This thus model provides a suitable starting point for the analysis of audibility data, as well as for the assessment of small impairments or for the assessment of intermediate quality levels in combination with listening tests. We propose the following steps for future research in order to develop the proposed model into a tool for the analysis of data from listening tests. Corroboration The proposed model must be used to analyze various data sets so as to be able to offset its practical use against the use of the more common linear model. Hypotheses One must formulate the basic hypotheses for audibility data, and develop the basic test statistics to test these hypotheses. Essential for this is the development for procedures to estimate the standard errors of the parameter estimates. Robustness assessment The robustness issue, discussed in Section 4.3, must be investigated. Counter measures must be taken if the problem is judged to be serious. Such counter measures can be based on robust estimation procedures. An alternative option was suggested by Steven van de Par. This is to modify the model, so that it interpolates between the model as proposed (3.1) and the standard model (2.2). The multiplicative model (3.1) is then valid for small impairments, whereas the linear model (2.2) would be valid for medium impairments. The method of interpolation must be sorted out, and this gives an interesting perspective for further research. Generalization Finally, application to other areas of listening tests can be investigated. After the completion of this technical note, Jan Engel kindly communicated to me a paper fromn sensory research by Brockhoff and Skovgaard [4]. In this paper, the authors propose a multiplicative model, similar to ours. They motivate their model by applications in food research, more in particular for the analysis of sensory data from panel data. The discussion in [4] is a valuable starting point to pursue the research issues listed above.

16

c Koninklijke Philips Electronics N.V. 2005

Unclassified Report

TN-2005/00447

References [1] Allcroft, D.J., and C.A. Glasbey (2003). A simulation-based method for model evaluation. Statistical Modelling 3, pp. 1-13. [2] Aprea, J. (2003). Notes on assessing subjects’ expertise for selecting critical listeners in subjective listening tests, PDSL report AR6-875020JA C05S2. [3] Aprea, J. et al. (2002). WATERMASK Robustness and Audibility Test Report. PDSL reports AR6-875020JA C08S2, C08S3 and C08S4. [4] Brockhoff, P.M., and I. Skovgaard (1994). Modelling individual differences between assessors in sensory evaluations. Food quality and preference 5, pp. 215-224. [5] Chatterjee, S. and A.S. Hafi (1988). Sensitivity analysis in linear regression. Wiley, New York. [6] Cox, D.R. (1961). Tests of separate families of hypotheses. Proc. of the Fourth Berkely Symposium 1, pp. 105-123. [7] Cox, D.R. (1962). Further results on tests of separate families of hypotheses. J. Royal. Statist. Soc., Ser. B 24, pp. 406-424. [8] EBU (1991). Basic audio quality requirements for digital audio bit-rate reduction systems for broadcast emission and primary distribution. CCIR document number TG 10-2/2. [9] Golub, G. and V. Pereyra (1973). The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate. SIAM. J. Numer Anal., 10, 413-432. [10] Hinde, J. (1992). Choosing between non-nested models: A simulation approach. In: Fahrmeir et al., eds. Advances in GLIM and statistical modelling: Proc. of the GLIM92 conference and the 7th workshop on statistical modelling. New York: Springer. [11] ISO/IEC JTC1/SC29/WG11 N1419, Kirby, D. and K. Watanabe (1996) Report on the formal subjective listening tests of MPEG-2 NBC multichannel audio coding. [12] ISO/IEC JTC1/SC29/WG11 N2006, Meares, D., K. Watanabe, and E. Scheirer (1998). Report on the MPEG-2 AAC stereo verification tests. [13] ITU-R BS. 1116-1 (1997). Methods for the subjective assessment of small impairments in audio systems including multichannel sound systems. [14] ITU-R BS. 1534-1 (2003). Methods for the subjective assessment of intermediate level of coding systems. c Koninklijke Philips Electronics N.V. 2005

17

TN-2005/00447

Unclassified Report

[15] Lemma, A.N., J. Aprea, W. Oomen, L. vd Kerkhof (2003). A Temporal Domain Audio Watermarking Technique. IEEE Transactions on Signal Processing, vol. 51, no. 4, 10881097. [16] Miller, R.G. (1986). Beyond ANOVA. Wiley, New York. [17] Neyman, J., and E.L. Scott (1948). Consistent estimates based on partially consistent observations. Econometrica 16, 1. [18] Reefman, D. and P. Nuijten (2001). Editing and switching in 1-bit audio stream. AES 110TH CONVENTION, AMSTERDAM, NETHERLAND, 2001 MAY 12-15. [19] Rousseeuw, P. and S. van Aelst (1998). The deepest fit. In: Payne and Green (eds.), Proceeding Compstat, Physica-Verlag. [20] Seber, G.A.F. and C.J. Wild (1989). Nonlinear regression. New York: Wiley. [21] Williams, D.A. (1970). Discrimination between regression models to determine the pattern of enzyme synthesis in synchronous cell cultures. Biometrics 28, pp. 28-32.

18

c Koninklijke Philips Electronics N.V. 2005