Convention Paper

0 downloads 0 Views 449KB Size Report
excerpt. While the first part of this set of two papers, which is entitled “Real World ... Two test sets were created from these regular items. ... The items were played within a MUSHRA interface ... the box boundaries) are marked with +-symbols. 3.
Audio Engineering Society

Convention Paper 8563 Presented at the 131st Convention 2011 October 20–23 New York, USA This Convention paper was selected based on a submitted abstract and 750-word precis that have been peer reviewed by at least two qualified anonymous reviewers. The complete manuscript was not peer reviewed. This convention paper has been reproduced from the author’s advance manuscript without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society.

Exploratory Studies on Perceptual Stationarity in Listening Test - Part II: Synthetic Signals with Time Varying Artifacts Frederik Nagel1,2 , Max Neuendorf1 1

Fraunhofer Institute for Integrated Circuits IIS, Erlangen, 91058 Germany

2

International Audio Laboratories Erlangen

Correspondence should be addressed to Frederik Nagel ([email protected]) ABSTRACT Many recent publications related to audio coding use the recommendation ”MUltiple Stimuli with Hidden Reference and Anchor” (MUSHRA; ITU-R BS.1534-1) for the evaluation of subjective audio quality. Judging the quality of multiple conditions can be inconclusive if the employed test excerpts exhibit more than one prevalent artifact. Two papers investigate the impact of time varying artifacts in both, synthetic and real world signals and claim ”perceptual stationarity” as a requirement for test sequences used in MUSHRA tests. This second part focuses on the alternation of multiple types of artifacts in one item and the differences in the ratings compared to items which only contain one of the respective types. It furthermore discusses the meaning of the temporal position of artifacts in an item.

1. INTRODUCTION Many recent publications in audio use the recommendation ”MUltiple Stimuli with Hidden Reference and Anchor” (MUSHRA; ITU-R BS.1534-1) for the evaluation of perceptual quality. In these tests it is common practice to choose items of approx. 10 seconds in length or more. During this time period the subjective perception of the items is not necessarily

constant. Even the ranking of conditions under test may change during this time span. Several articles were published in the Audio Engineering Society during the last years which focused on different aspects of the MUSHRA [1] methodology. These covered the statistical evaluation of the results and possible biases with the MUSHRA

Nagel AND Neuendorf

Exploratory Studies on Perceptual Stationarity - Part II

scale [2, 3, 4, 5], audibility and classification of artifacts [6, 7, 8, 9], item selection [10], and speculations about ’listener fatigue’. Søren Bech and Nick Zacharov’s book [11] and a recently published PhD thesis [12] give a comprehensive overview about listening test design and evaluation. Hence, almost all necessary information on good design, execution and evaluation of listening tests is available and accessible to the AES. While most of the aforementioned literature deals with proper design and possible biases with the MUSHRA ratings and pitfalls in the statistical evaluation, there is a lack of empirical investigations, particularly regarding item selection. [10, 2, 12] discuss different motivations in the selection process of items, particularly [2] claims the necessity of items not to change their characteristics. The present paper therefore aims at empirically identifying the perceptual meaning of different types of artifacts in one item. It explores effects of alternating the types of artifacts and changing their order within one audio excerpt. While the first part of this set of two papers, which is entitled “Real World Signals from Custom Listening Tests” [13], already demonstrated the importance of perceptual stationarity on the basis of real world items from stereotypical listening tests, this paper helps to begin understanding parts of the implicit and maybe unconscious processes listeners have to apply in order to generate MUSHRA ratings for audio items which are not perceptually stationary and which exhibit different types of artifacts at various temporal positions. The types of artifacts, which we used, were motivated by the work of [8] and [9], which investigated the impact of different grades of of artifacts on the perceived audio quality. The artifacts included birdies, transient smearing, and lowpass filtering. While Marins et al. additionally used spatial artifacts, we decided to incorporate additionally a typical parametric bandwidth extension artifact of not maintaining harmonic frequency structure of an audio signal. 1.1. Aims The objective of this study was to generate knowledge about listeners’ rating of items which contain more than one type of artifact over time. The following questions were thereby investigated:

i How do subjects rate items with more than one prevalent type of artifact? ii How does the rating differ between combinations of different types of artifacts and how does it compare to items with constant characteristics? iii How does the rating for two alternating types of artifacts compare to items which only contain one of the respective types? iv Is the variance increased in the ratings amongst subjects due to changing perceptual characteristics compared to perceptually stationary items? v What is the impact of the temporal position of a prevalent artifact within an item? vi How are the types of artifacts weighed against each other? We expected that the rating for two types of artifacts in one item is always worse than for items with one type alone and that multiple types lead to an increased variance in the data compared to items with constant characteristics. Furthermore, we expected that the temporal location of an artifact has an effect on the rating and that the beginning of an item is more important than the end. 2. METHOD AND PROCEDURE A test set of synthetic items was created based on three different regular items with constant characteristics taken from MPEG test sets: a male choir, salvation, transients, castanets, and a stationary, harmonic pitch pipe, pitchpipe. These items are referred to in the following as regular items. Salvation was used for generating birdies which describe a rapidly time varying spectral quantization effect which leads to repeatedly appearing and disappearing spectral holes within an audio signal [6]. In order to simulate non-harmonic spectral continuation, pitchpipe was firstly lowpass filtered at a cutoff frequency of 4 kHz. The resulting bandlimited lowpass part was then single sideband modulated multiple times for generating a frequency range up to 16 kHz with in a filterbank. The spectral envelope was reconstructed by applying a short term LPC-filter. The effects of lowpass filtering and smeared transients (pre-echoes)

AES 131st Convention, New York, USA, 2011 October 20–23 Page 2 of 8

Nagel AND Neuendorf

Exploratory Studies on Perceptual Stationarity - Part II

were generated with the castanets. For the smearing, 1024 sample Modified Discrete Cosine Transform (MDCT)-frames were quantized with 4 bits in a non-uniform quantizer. Lowpass filtering was carried out with a steep 200 tap FIR filter with a cutoff frequency of 12 kHz. Birdies were introduced into salvation by randomly setting varying small regions of the MDCT spectra above 4 kHz to zero. All introduced delays were subsequently removed from the signals. Two test sets were created from these regular items. For the first tests, two times six items were generated by concatenation of processed and unprocessed regular items. These 12 items represent all possible permutations of concatenating two of the four regular items. Table 1 illustrates the different items and the 4 conditions which are denoted as xx with different types of artifacts in first and second half, xo with one type of artifacts in the first half but no artifact in the second half, and ox with one type of artifacts in the second half but none in the first half of the item. Consequently, the reference without any artifact is referred to as oo. Every rectangle marks one item with the four different conditions. The second test set simply contained the four regular items with and without the assigned artifact. In the following, the item on which the birdie artifact was applied is referred to as Bird, the item with lowpass filtering as Low, non-harmonic continuation as Harm, and transient smearing as Smear. Note that the type of artifact and the item on which the respective artifact was applied have the same name. In the following formulas, every original concatenated item is denoted as itemoo ij , with an artifact xo itemox or item respectively, and with two types of ij ij artifacts as itemxx . The regular items’ notation is ij item0i without or itemxi with an artifact. 12 male expert listeners participated in this experiment. Three of them had to be postscreened, as they did not detect the hidden reference correctly for at least one item; a fourth participant’s data was excluded, because the 3.5 kHz anchor was graded higher than the 7 kHz anchor. The items were played within a MUSHRA interface allowing for setting loops on arbitrary positions and rating all conditions. The items were replayed in randomized order with randomized conditions. Additionally to the three conditions xx, xo, and ox, 2

Basic Item Bird (=salvation) Low (=castanets) Harm (=pitch pipe) Smear (=castanets)

Artifact x o x o x o x o

Bird x o xx xo ox oo

Low x

o

BirdLow

LowBird HarmBird

Harm x o

Smear x o

BirdHarm

BirdSmear

LowHarm

LowSmear

HarmLow

HarmSmear

SmearBird SmearLow SmearHarm

Table 1: Generation of Items. Each gray box corresponds to one item from test set 1 with four conditions for the combination of two types of artifacts; xx means both types of artifacts are introduced, xo has artifacts only in the first part and ox only in the second part. The black boxes indicate items from test set 2 which contained the regular items with and without artifacts (only x and o). lower anchors lp35 and lp70 were used with cutoff frequencies of 3.5 kHz and 7 kHz, respectively, plus a hidden reference. The items were played out from a passively cooled computer equipped with a professional sound card, Stax headphones and amplifier. As listening environment, a laboratory was chosen that was not anechoic but dedicated to listening tests. Test session one was followed by the second session for all participants. As only the results from 8 of 12 subjects could be used for statistical evaluation, we decided not to assume normal distribution of the data [5] but to use boxplots for data visualization. The boxplots contain notches which can be used for identifying statistical differences between groups on a 5%-significance level, if the two notches do not overlap. This corresponds to the application of a Wilcoxon test which is more conservative than a t-test. The boxes cover the range from the 25-percentile to the 75-percentile, the whiskers indicate the range, and outliers (values further than 1.5 times inter-quartil-range distance from the box boundaries) are marked with +-symbols. 3. RESULTS 3.1. Overall Ratings and Rating Variance Figure 2 shows the raw listening test results from the first test set. It supports the decision of using nonparametrical statistics, as many boxes are not symmetrical. The boxes have a large span for the condi-

AES 131st Convention, New York, USA, 2011 October 20–23 Page 3 of 8

Exploratory Studies on Perceptual Stationarity - Part II

tions with artifacts (xx, xo, and ox ). The rating for test set 2 with the regular items can be seen in Figure 1. The rating for the item Smear is significantly lower than the ratings for Low and Bird (Wilcoxon test, p < 5%). The four items under test have different perceptual quality which, however, only becomes significant for these two comparisons.

(itemxo ij ) to illustrate the small if any influence in those parts of the signals. 100

MUSHRA Score

Nagel AND Neuendorf

80 60 40 20 0

MUSHRA Score

100

...Low

...Harm

Bird...

...Smear

...Bird

...Harm

...Smear

Low...

...Bird

...Low

Harm...

...Smear

...Bird

...Low

...Harm

Smear...

80 60 40 20 0

lp35

lp70

artifact

lp35

Bird

lp70

artifact

lp35

lp70

Low

Harm

artifact

lp35

lp70

artifact

Smear

lp35

lp70

artifact

All

Fig. 1: Boxplots showing the raw listening results from the test set 2. Ratings for the item Smear is significantly lower than the ratings for Low and Bird. Furthermore, it can be seen that the variance differs between both test sets; the numbers of the standard error means in Table 2, calculated q are shownP n mn mn 2 as SEMij = 1/(n(n − 1)) k=1 (itemmn ijk − µij ) with listener k ≤ n, sample size P n = 8, conditions n mn mn ∈ {oo, ox, xo, xx} and µmn = ij k=1 (itemijk )/n. It is worth noticing that the variance is particularly large for the Harm-artifact. Obviously, there is little agreement amongst subjects about the quality degradation of audio due to the non maintained harmonic structure in bandwidth extension. Bird

Standard Error Bird Low Harm Smear

Artifact x o x o x o x o

x

o 5.76

4.41 4.87 7.27 7.56 4.44 5.15

5.88 0.51 5.24 0.00 5.18 0.51

Low Harm x o x o 5.24 5.17 7.43 8.19 6.69 0.00 6.10 0.82 6.98 7.81 4.71 4.82 0.00 7.02 4.04 7.46 8.54 0.00 4.33 4.80 4.28 7.74 4.30 0.00 4.93 0.00

Smear x o 5.61 6.54 5.04 0.51 4.93 5.47 3.76 0.00 4.08 6.17 7.42 0.00

Fig. 3: Ratings of xo-conditions. The lower labels denote the beginning, the upper smaller the endings of the items. Grading amongst items with an artifact in the beginning and different artifact-free ends do not differ significantly. 3.2. Order of Artifacts Table 2 and Figure 2 hint at a dependency of the order of artifact types in the concatenation. This was analyzed in more detail resulting in Figure 4 which shows the difference of rating between both orders of the combined items from test set 1. The xx differences were calculated as %itemxx ij = itemij − xx xo xo ox ox itemji , %itemij = itemij − itemji , and %itemij = xo itemox ij − itemji with j > i ∈ {1 . . . 4}. The column BirdLow e. g. expresses the rating for the birdy artifact followed by lowpass filtered castanets minus the rating for the reversed order concatenation of the same regular items, i. e. LowBird. In many items, a significant difference caused by the order of the items was found, most prominently for items which combine artifacts with a high impact on the grading such as Smear and Harm with less severe artifacts or the unprocessed versions. Therefore, BirdLow has no significant differences, while for all other combined items this effect can be observed at least for one condition.

5.19

Table 2: Standard error means for the rating of all four conditions of the combined items and the regular items on the main diagonal axis. Comparing the results from both sets in Figure 1 mn (itemm i ) and 2 (itemij ), there is no rating difference between the concatenated xo/ox and the regular items with the respective artifact. Figure 3 shows the results for the items with artifact-free endings

The significant (Wilcoxon, p < 5%) positive rating for ox and negative for xo indicates that the first part of an item generally appears to have a bigger effect on the overall rating than the rest of the item ox (itemxo ij < itemji ). Items are graded higher, if the affected part is located at the end (ox ) of the item compared to items with an artifact at the very beginning (xo). ox items were even rated as being transparent by some listeners although they were instructed to listen carefully to the entire item even when setting loop points at freely chosen positions.

AES 131st Convention, New York, USA, 2011 October 20–23 Page 4 of 8

Nagel AND Neuendorf

Exploratory Studies on Perceptual Stationarity - Part II

100

MUSHRA Score

90 80 70 60 50 40 30 20 10 0 lp35lp70 xx xo ox ref lp35lp70 xx xo ox ref lp35lp70 xx xo ox ref lp35lp70 xx xo ox ref lp35lp70 xx xo ox ref lp35lp70 xx xo ox ref lp35lp70 xx xo ox ref lp35lp70 xx xo ox ref lp35lp70 xx xo ox ref lp35lp70 xx xo ox ref lp35lp70 xx xo ox ref lp35lp70 xx xo ox ref lp35lp70 xx xo ox ref

All

BirdLow BirdHarm BirdSmear LowBird LowHarm LowSmear HarmBird HarmLowHarmSmearSmearBird SmearLowSmearHarm

Fig. 2: Boxplots showing the raw listening results from the test set 1 including the lowpass anchors at 3.5 kHz and 7 kHz and the hidden reference. Note that negative numbers mean here that the reversed order got a higher grading than the compared. Therefore, xo is often negative and ox positive.

the second type of artifact (β) located at the end of the item, which is indicated by the column header.

3.3. Rating of the Alternation of Artifacts and Artifact Weighing One of the basic questions of this paper was, how the rating of an alternation of different types of artifacts relates to items which contain only one of the respective types of artifacts. These results are shown in Figure 5. For each listener and each item, mn xx the differences ∆itemmn ijk = itemijk − itemijk with mn ∈ {ox, xo} were calculated. Surprisingly, some participants rated the combination of two types of artifacts (xx ) better than the worst condition with only one artifact. This is indicated by whiskers or outliers in ∆itemmn ij ratings which reach below the zero line. The rating of the artifact combination is barely, if any, worse than that of the worst rating of the single type condition. This means that the influence of an additional ”softer” artifact on the overall rating is very limited if a ”strong” artifact is already present.

Table 3 expresses the strongly unbalanced levels of importance of the different types of artifacts for the overall rating. Additionally, as already shown in Figure 4, the dependence of the location of the artifact within the item on the overall grade is substantiated.

The second type of artifact though has an influence on the overall rating as is displayed in Figure 6 for all {itemxx ij }, i, j ∈ [1 . . . 4]. Only for the most severe artifact type Smear, the second artifact does not seem to play any role. This effect was further assessed in a regression analysis on the ratings, solving the xo ox equation itemxx ijk = α · itemijk + β · itemijk . The results for α and β are shown in Table 3. It contains the mean estimations of a linear regression analysis with 95%-confidence bounds. For each item, the first row corresponds to the impact of the first part containing an artifact (α), the second row refers to

MUSHRA Score

100 80 60 40 20 0

...Low

...Harm

...Smear

...Bird

Bird...

...Harm

...Smear

...Bird

Low...

...Low

...Smear

...Bird

Harm...

...Low

...Harm

Smear...

Fig. 6: Different gradings for all combinations of different types of artifacts in xx -conditions. The lower labels denote the beginning, the upper smaller labels denote the endings of the items. Regression Bird Low Harm Smear

← other ← other ← other ← other

low 0.00 0.00 -0.14 0.35 0.68 -0.27 0.64 -0.08

Bird mean 0.00 0.00 0.22 0.69 0.97 -0.03 0.84 0.02

high 0.00 0.00 0.59 1.04 1.25 0.22 1.04 0.12

low 0.44 -0.07 0.00 0.00 -0.83 0.10 0.89 -0.06

Low mean 0.71 0.17 0.00 0.00 -0.09 0.74 1.01 -0.01

high 0.97 0.41 0.00 0.00 0.65 1.39 1.13 0.05

low -0.13 -0.07 -0.12 0.55 0.00 0.00 0.45 -0.14

Harm mean 0.40 0.47 0.11 0.81 0.00 0.00 0.76 0.04

high 0.93 1.01 0.34 1.07 0.00 0.00 1.08 0.21

low -0.05 0.67 -0.07 0.72 0.01 0.22 0.00 0.00

Smear mean 0.06 0.85 0.02 0.89 0.21 0.52 0.00 0.00

high 0.18 1.03 0.11 1.07 0.40 0.82 0.00 0.00

Table 3: Linear Regression coefficients with 95%confidence bounds. For each item, the first row corresponds to the impact of the first part containing an artifact (α), the second row refers to the second artifact type located end the at of the item, indicated by the column header (β).

AES 131st Convention, New York, USA, 2011 October 20–23 Page 5 of 8

Nagel AND Neuendorf

Exploratory Studies on Perceptual Stationarity - Part II

MUSHRA Difference Score

30

20

10

0

−10

−20

−30 lp35

lp70

xx

xo

ox

lp35

BirdLow

lp70

xx

xo

ox

lp35

BirdHarm

lp70

xx

xo

ox

lp35

BirdSmear

lp70

xx

xo

ox

LowHarm

lp35

lp70

xx

xo

ox

lp35

LowSmear

lp70

xx

xo

ox

lp35

lp70

xx

xo

ox

All

HarmSmear

MUSHRA Difference Score

Fig. 4: Significant difference caused by the order of the items, most prominently for items with severe artifacts. The beginning has higher impact on the rating than the end, indicated by the negative results for xo and positive for ox . 60 40 20 0 −20

xo

ox

xo

ox

xo

ox

xo

ox

xo

ox

xo

ox

xo

ox

xo

ox

xo

ox

xo

ox

xo

ox

xo

ox

BirdLow BirdHarm BirdSmear LowBird LowHarm LowSmear HarmBird HarmLowHarmSmearSmearBird SmearLowSmearHarm

xo

ox

All

Fig. 5: Difference between ox and xo to xx for each item and each listener. 4. DISCUSSION In the following, we firstly discuss general aspects of the test design which was chosen for this study. Afterwards, the questions which were stated at the beginning of this paper are revisited. 4.1. Test Design Each of the test items from the first test was assembled from two regular items which were well known from former tests, e. g. used in MPEG standardization. All listeners were familiar with these items and the assigned artifacts. Some of the listeners experienced the test as strange or peculiar, as they had to listen to two items with different types of artifacts in every trial. Exactly this was desired, as the intention of the test was to model real world items with time varying characteristics. Naturally, researchers would hardly select those items for testing of audio coders. Nevertheless, as Part I of this set of papers [13] demonstrated, items with changing perceptual

characteristics exist and are used in MUSRHA listening test. The subjects reported that they began to guess the purpose of this listening test. Therefore, the rating behavior might have been changing over time. Several participants complained about the test as being exhausting or annoying. This can be taken as a request of listeners to avoid changing perceptual characteristics in audio items which are used for testing of quality. The four different types of artifacts were not rated equally and the ratings were applied on different items. The dependency between the item and the type of artifact has thereby not been taken into account. It cannot be ruled out that with other items, some observations would have been different. In contrast, the Smear -artifact probably would not much degrade the quality of the pitch pipe and the

AES 131st Convention, New York, USA, 2011 October 20–23 Page 6 of 8

Nagel AND Neuendorf

Exploratory Studies on Perceptual Stationarity - Part II

castanets would hardly be affected by the Harmartifact. 4.2. Rating of items with more than one prevalent artifact One answer to the questions i to iii is that the listeners’ judgments are mainly influenced by those parts of audio signals which represent the worst perceived degradation. The existence of artifacts which either follow or precede the prevalent artifact does not or hardly further decrease the rating on the MUSHRA score. Artifact free continuations of any item, had no statistically significant impact on the rating. Hence, the hypothesis that the rating of items with two types of artifacts is always worse than the lowest rating of items which exhibit only one of these two types has to be rejected. It cannot even be stated that the variance is consistently increased for the rating of the concatenated items, exhibiting different types of artifacts. We could also find no evidence that the rating of items with varying type of artifacts was a result of any averaging between the rating for the single artifacts. Furthermore, no relation to the length of the different parts exhibiting one of the two artifact types was found. Artifacts are not at all graded cumulatively in the process of audio quality evaluation. One consequence is that in the course of optimizing the audio quality e. g. of an audio coding algorithm, the engineer should mainly, if not only, focus on the most prevalent artifact and may neglect all other smaller problems with other less severe artifacts. 4.3. Order of Artifacts and Temporal Position of a Prevalent Artifact Within an Item The order of the regular items within the concatenated items under test had a huge impact on the overall rating. While we expected that the end of items would have a bigger impact due to recency effect [3], we found the opposite. This can be explained by the behavior of listeners to set a loop at the very beginning of an item, then roughly grading the quality at that point in time. Subsequently, a fine adjustment amongst the conditions is carried out based on the rest of the item. Severe artifacts at the end of an item can hence easily be missed and may not distinctly contribute to the overall quality rating.

4.4. Weighing of Artifacts In the field of modern audio coding, particularly at low bitrates, artifacts cannot be totally avoided. In contrast, engineers have to carefully weigh which artifact may be introduced into which signals. We have shown that different types of artifacts are graded very differently and that on combinations of these the overall rating almost entirely depends on the more severe artifact. A simple answer to the question of artifact weighing is thus that the more severe artifact beats the lesser one. These observations have to be taken into account for proper test design in codec optimization to really identify optimal operating points. It should also be mentioned that these findings can easily be misused for hiding artifacts and weaknesses in audio quality evaluation by putting the critical parts for an algorithm towards the end of an item. Furthermore, the MUSHRA methodology may not reveal small degradations on audio items if they are outweighed by severe prevalent artifacts. 5. CONCLUSION AND OUTLOOK When combining audio material with different characteristics it is impossible to asses quality degradations of different types which occur in the same test item. Therefore, MUSHRA tests should only use those items which are perceptually stationary over time. In particular, less pronounced artifacts, which do not lead to a very low grading, have to be tested independently from more severe artifacts. If more than one type of artifact exist in an item, the perceptual characteristics at the beginning of the item dominate the resulting grading. This approach did not reveal implicit or explicit weighing of artifacts by the subjects. In order to address this question, different strengths of the same artifacts are needed as applied in [8, 9]. In any case, further testing is necessary with all types of artifacts applied on different items with varying artifact severity. In addition to considering alternating types of artifacts, these should also be applied simultaneously to the test items in order to gain more insight into the rating and weighing of artifacts. 6. REFERENCES [1] International Telecommunication Union, “Method for the subjective assessment of

AES 131st Convention, New York, USA, 2011 October 20–23 Page 7 of 8

Nagel AND Neuendorf

Exploratory Studies on Perceptual Stationarity - Part II

intermediate sound quality (MUSHRA),” 2001, ITU-R, Recommendation BS. 1534-1, Geneva, Switzerland. [2] Slawomir Zieli´ nski, Philip Hardisty, Christopher Hummersone, and Francis Rumsey, “Potential biases in mushra listening tests,” in Audio Engineering Society Convention 123, October 2007. [3] Slawomir Zieli´ nski, Francis Rumsey, and Søren Bech, “On some biases encountered in modern audio quality listening tests – a review,” J. Audio Eng. Soc, vol. 56, no. 6, pp. 427–451, 2008. [4] Thomas Sporer, Judith Liebetrau, and Sebastian Schneider, “Statistics of MUSHRA revisited,” in AES 127th Convention, New York, USA, 2009, vol. Paper 7825.

[10] Jonas Ekeroot, Jan Berg, and Arne Nyknen, “Selection of audio stimuli for listening tests,” in Audio Engineering Society Convention 130, May 2011. [11] Søren Bech and Nick Zacharov, Perceptual audio evaluation: theory, method and application, Wiley, Chichester, 2006. [12] G. Lorho, Perceived quality evaluation: an application to sound reproduction over headphones., Ph.D. thesis, Aalto University, Espoo, 2010. [13] Max Neuendorf and Frederik Nagel, “Exploratory studies on perceptual stationarity in listening test – Part I: Real world signals from custom listening tests,” in 131st Convention, New York, Oct. 2011, Audio Eng. Soc.

[5] Frederik Nagel, Thomas Sporer, and Peter Sedlmeier, “Towards a Statistically WellGrounded Evaluation of Listening Tests Avoiding Pitfalls, Misuse, and Misconceptions,” in Proceedings, 128th Audio Engineering Society (AES) Convention, May 2010. [6] Markus Erne, “Perceptual audio coders: ’what to listen for’,” in Audio Engineering Society Convention 111, November 2001. [7] Han-Wen Chang, Chia-Ming Hsu, Kan-Chun Lee, Wen-Chieh Lee, Chi-Min Liu, ShouHung Tang, Chung-Han Yang, and Yung-Cheng Yang, “Compression artifacts in perceptual audio coding,” in Audio Engineering Society Convention 121, October 2006. [8] Paulo Marins, Francis Rumsey, and Slawomir K. Zieli´ nski, “The relationship between selected artifacts and basic audio quality in perceptual audio codecs,” in Audio Engineering Society Convention 120, May 2006. [9] Paulo Marins, Francis Rumsey, and Slawomir K. Zieli´ nski, “The relationship between basic audio quality and selected artefacts in perceptual audio codecs - Part II: Validation experiment,” in Audio Engineering Society Convention 122, May 2007.

AES 131st Convention, New York, USA, 2011 October 20–23 Page 8 of 8