Feb 26, 1994 - DAB in broadcasting, and low-bit-rate speech codecs in Telecommunications, ... (ii) Broadcasting; DAB (Digital Audio Broadcasting).
Objective Perceptual
Analysis: Comparing
the Audible Performance
M.P. Hollierl, M.O. Hawksford2, D.R. Guardl 1BT Labs,MartleshamHeath, Ipswich.IP5 7RE. 2 Dept. Electronics Sys. Eng., University of Essex, Colchester.
of Data Reduction
Schemes
3797 (P3.6)
CO4 3SQ.
Presented at the 96th Convention 1994 February 26 - March 01 Amsterdam
AUD I O
Thispreprint has been reproducedfrom the author'sadvance manuscript, withoutediting, correctionsor considerationby the Review Board. The AES takes no responsibilityfor the contents. Additionalpreprintsmay be obtainedby sendingrequestand remittanceto the AudioEngineering Society,60 East 42nd St., New York,New York 10165-2520,USA. All rights reserved. Reproductionof this preprint,or any portion thereof, is not permitted withoutdirectpermissionfrom the Journal of the Audio EngineeringSociety.
AN AUDIO ENGINEERING SOCIETY PREPRINT
Objective
Perceptual
Analysis: Comparing the Audible Data Reduction Schemes
Performance
of
M P Hollier(1), M O Hawksford(2), D R Guard(l) (1) BT Labs, Martlesham Heath, Ipswich. IP5 7RE. (2) Dept. Electronics Sys. Eng., University of Essex, Colchester. CO4 3SQ. ABSTRACT Data reduction schemes such as; DCC and Mini-Disc in domestic hi-fi, DAB in broadcasting, and low-bit-rate speech codecs in Telecommunications, are not adequately characterised by simple engineering performance metrics. This paper introduces the emerging field of perceptual analysis for predicting subjective opinion. A practical perceptual analysis is used to compare example data reduction schemes, and its usefulness as a performance metric and diagnostic tool is discussed. 1.0 INTRODUCTION There is increasing use of DRSs (Data Reduction Schemes) across a range of industries including; telecommunications, broadcasting, and domestic audio equipment. The success of such schemes depends on minimising the audibility of the errors produced by the data reduction. This audibility is ultimately determined by human psychoacoustics. As a result, some DRSs employ psyehoacoustic "rules" in an attempt to mask errors with the desired signal. In principle, this technique should lead to even lower data rates as the psychoacoustic rules are improved and more signal redundancy can be omitted. Example DRSs include: (i) Domestic audio; DCC (Digital Compact Cassette) and Mini-Disc. (ii) Broadcasting; DAB (Digital Audio Broadcasting). (iii) Telecommunications; low-rate coding algorithms in GSM (Global System for Mobile Communications) and DCME (Digital Circuit Multiplying Equipment). DRS operation includes non-linear processes, and their subjective performance is not adequately assessed by simple engineering performance metrics such as SNR (Signal-to-Noise-Ratio), THD (Total-HarmonicDistortion), and steady-state frequency response. In particular, conventional steady-state test-stimuli are unlikely to produce representative excitation of the system, and the engineering metrics listed will not take account of the masking of distortion products by the rest of the signal. The use of perception based analysis to assess the audibility of errors produced by non-linear processes is a rapidly emerging field. This field is introduced in Section 2, together with a practical perceptual analysis, which seeks to analyse audio signals in a way which is broadly analogous to human hearing. The advantages of a perceptual analysis are that; performance can be expressed in terms of a listener's opinion, and the detail performance of a system can be examined in terms of the audible error from individual signal artefacts.
Page 1 of 8
The performance of example DRSs is examined in Section 3, using a broadband telephony auditory model developed for performance evaluation of non-linear processes in speech communications systems. The broadband telephony model operates to 8 kHz and therefore does not cover the full bandwidth of all the DRSs investigated. It can however, indicate the presence of audible distortions below 8 kHz. Further, the principles of the modelling technique are equally applicable over the full auditory bandwidth. 2.0 PERCEPTION
BASED ANALYSIS
2.1 Perception based performance metrics It is highly desirable to provide a performance measurement which yields a direct indication of human perception. In particular, a typical user is not concerned with either the technical operation of a coding algorithm, or in its abstract objective performance. The user is however, concerned with the audio quality provided by equipment and services. The emerging perceptual measurement techniques seek to produce performance metrics which estimate whether an audio-error is audible, and if it is, estimate its subjective significance. In telecommunications assessment there are two main approaches to the objective prediction of subjective opinion: (i) Empirical methods. For this approach a number of objective measures, such as cepstral distance, coherence and intelligibility index, are applied to the system to be characterised. The measurement method is "trained" by determining the relationship between these objective measures and a corpus of known subjective test data, typically using advanced statistical techniques such as clustering. Examples of this approach include [1,2,3]. The main limitation of such methods is that they only work well within the range of their training data, and hence may produce spurious predictions when characterising an unknown degradation. This is a serious limitation since we would not routinely expect to know the type of degradation in advance. (ii) Auditory models. In this approach the audio signal is analysed in a way which is broadly analogous to human hearing. In this way an auditory excitation for original and degraded versions of the signal can be used to produce a prediction of audible-error. Examples of this kind of approach include [4,5,6,7,8]. The principal advantage of this method is that, in conjunction with a higher level perceptual analysis, it is potentially independent of degradation. Some auditory models, such as [4], are concerned with predicting whether an audio-error will be audible at all. Such an approach is appropriate for the evaluation of very high quality codecs for professional audio, where any perceptible error is unacceptable. The model reported here, together with several others including [5, 6, 7], attempts to assess the subjective significance of errors which are certainly audible. This is appropriate to telecommunications evaluation where constraints, such as limited radio bandwidth, and high ambient noise, for a mobile telephone, mean that some audible error is present in practical systems. Hence it is desirable to optimise the practical system to minimise the subjective impact of the audible error.
Page 2 of 8
2.2 Practical perceptual analysis The practical auditory model used in this investigation, is related to that previously reported [8], and includes the major non-linear behaviours of the ear, such as, Frequency-to-Pitch, and Level-to-Sensation transformations. Simultaneous and temporal masking properties are also modelled, together with frequency selectivity. The simultaneous masking and frequency selectivity of human hearing is commonly modelled as a bank of bandpass filters. The model in [8] has been improved by employing a better approximation to the auditory filter shape, and band-dependent temporal resolution. Auditory filter approximations by Moore and Glasburg [9], equation (1), and Sekey and Hanson [10], equation (3), have been tested. From [9], we have: W(g) = (1 + pg)e-Pg
(1)
Where g is the relative frequency variable, g= I f - fo I / fo, and p is an adjustable parameter. For the simplified filter in (1), the value of ERB/f (Equivalent Rectangular Bandwidth / frequency) is exactly 4/p, where BWER (Band Width (Equivalent Rectangular)), is given by: BWER = 6.23f 2 + 93.39f+ lis in kHz. Alternatively,
28.52
(Hz)
(2)
- 17.510.196 + (z-0.215)2] °.5
(3)
after [10]:
10 log F(z) = 7.00 - 7.5(z-0.215)
Where the individual frequencies determined by:
filters
f= 600 sinh [(z + 0.5)/6],
are spaced
one Bark apart,
z = 1,2 ..... 19
with centre
(4)
ensuring that all frequency components in the range 100 Hz to 8 kHz are within the 3 dB range of at least one filter (extending the frequency range covered in [10] from 5 kHz to 8 kHz). The rate at which new filters are added is defined by the Critical-Band-rate (CB-rate, or' Bark), but the exact centre frequencies of the filters is of no consequence. Figure I shows a graph of ERB-rate (equation (5)) and CB-rate vs. frequency. It is apparent that there are more ERBs than CBs, (29 ERBs vs. 19 CBs) within the required frequency range, 100 Hz to 8 kHz. ERB-rate
= 11.1711n
I (f+ 0.312) / (f+ 14.68)]
+43.0
(5)
fis in kHz. The analysis included in this paper uses 19 filters on a 1 Bark spacing, where the individual filter characteristics are described by equations (3) and (4), and the filter bank is shown graphically in Figure 2. The filter bank, in conjunction with threshold, temporal-masking, and level-to-sensation mapping, forms the basis of the auditory model.
Page 3 of 8
The auditory model can be used to predict the sensation, (excitation pattern), of original and distorted versions of a signal, in the form of a "surface". Normalisation of the degraded sensation-surface by the original sensation-surface is then the audible-error-surface, presented on a dB amplitude scale. The form of the error surface is given in Figure 3, which is the audible error for a fragment of speech degraded by an amplitude proportional noise distortion. The audible error surface is useful for diagnostic investigation. The audible error due to particular signal features can be examined and conclusions drawn on both the cause of the error and the likely subjective impact. Specific errors, and their diagnostic interpretation, are included in the example measurements in Section 3. It is not sufficient, however, to only represent the audible error in the form of a surface. To simplify routine testing, the overall subjective impact of the error must be estimated, ideally leading to a prediction of subjective opinion against a recognised opinion scale, such as YLE (Listening Effort) [11]. To achieve this, a higher level perceptual analysis, which is intended to be broadly analogous to human perception, has been developed [12]. This analysis interprets the overall psychoacoustic consequences of the error, and leads to an estimate of subjective opinion. 3.0 PERFORMANCE
MEASUREMENT
OF EXAMPLE
DRSs
3.1 Test signal selection The behaviour of a non-linear system is stimulus dependent. It is therefore essential that the test-stimulus is representative of the "in-service" signal. This will be predominantly; a signal with speech properties for low-rate speech codec testing, speech and music for broadcast systems, and music for domestic hi-fi equipment. The development of test signals with a particular range of statistical properties, and especially those of speech, is an important area in its own right. However, real speech and music were used for this investigation to allow an informal listening test, and because the measurements did not need to be automated. For the experimental composite was used containing:
investigation
reported
below,
a test-signal
(i) broadband speech (100 Hz to 8 kHz), narrowband speech (315 Hz to 4 kHz), (ii) music (opening section of Bach's Italian Concerto in F major, played by Tatyana Nikolayeva) and, (iii) pseudo-random noise. It is possible to include specific objective test-signals, maximum-length-sequence, into such a composite to performance metrics. The test signal was compiled as a Studer Dyaxis digital audio editor, and stored as 48 kHz recordings.
Page 4 of 8
such as a chirp and yield conventional digital file, using a and 44.1 kHz DAT
In each of the examples below the test-signal was encoded and decoded by the relevant DRS, and analysed by the auditory model. An informal listening test was also conducted by the experimenter. 3.2 Low-bit-rate speech codec performance Results for an example commercial low-rate codec are included because: (i) the subjective opinion estimates of the perceptual analysis have been confirmed by rigorous subjective testing, (ii) the signal degradations are clearly audible, and therefore readily measured and interpreted, (iii) it provides a performance comparison with the other DRSs. The speech codec in question is a regular pulse excited linear predictive codec (RPE~LPC) with long~term (pitch) prediction, and a net bit-rate of 13 kbit/s. Figures 4 to 5 show the sensation and audible-error predictions for a speech fragment. By careful examination of the audible-error-surface, and with reference to the relevant portion of the sensation-surface, the error can be seen to occur for: (i) Voiced sounds; particularly when the amplitude is high, and (ii) Unvoiced sounds; particularly when the rate of change of amplitude is high. The scale of the error amplitude indicates that a clearly audible degradation has occurred. This is found to be the case in subjective experiments. If the audible-error-surfaces for a small corpus of speech material are calculated, the higher level perceptual analysis can be used to interpret the surfaces, and the subjective opinion of an average user predicted. The current analysis has been able to predict the subjective opinion of an average user to within 0.5 of an opinion score for a range of non-linear distortions, against the established 5-point listening effort scale. 3.3 DCC performance The audible error-surface predicted for the DCC code and decode process resulted in a very high opinion score against the established five point telephony scale for listening effort. This is to be expected since the scale is designed to rank a wide range of audible distortions, while DCC is intended for hi-fi audio applications where audible errors are not acceptable. However, some error-surface features are produced, and these represent audible differences between the original and processed files. Figures 6 and 7 show the sensation and audible-error predictions for a piano chord strike in the first few bars of Bach's Italian Concerto. There is a feature on the error-surface at the time of the chord strike. This is potentially due to: (i) A change in the audible error with signal level, (ii) Imperfections in the temporal masking predictions made by the DRS, i.e. the degree to which forward masking may be relied on to mask error has been inaccurately estimated. Figures 8 and 9 show the sensation and audible-error prediction for a fragment of speech. The error-surface is very detailed, but shows more activity where the rates of change of signal amplitude are greatest. In particular, there is an error feature at around 475 ms which corresponds to the onset of a speech event.
Page 5 of 8
3.4 MiniDisc performance As for DCC, the audible error-surface predicted for the MiniDisc code and decode process resulted in a very high predicted opinion score against the established five point telephony scale for listening effort. However, by examining the error-surface detail, it is also possible to predict just audible error due to particular signal artefacts within the model bandwidth. Figures 10 and 11 show the sensation and audible-error estimates for a piano chord strike equivalent to that analysed for DCC. The audible error estimate does not show a particular feature at the chord strike onset, but gradually rises in level, mainly in bands 10 to 15, with signal energy. This is potentially due to changes in the criteria for dynamic bit allocation, and filter adaptation, as the signal energy to be coded increases. Figures 12 and 13 show the sensation and audible-error estimates for a fragment of speech. Again the error-surface level rises with signal energy. There is also an error feature at around 425 ms which corresponds to the peak of the speech signal. 4.0 DISCUSSION
AND CONCLUSIONS
There are numerous applications across a range of industries where DRSs are of great importance; the storage of digital signals for professional and domestic music applications, the transmission of high quality digital audio channels in broadcasting, and in telecommunications. In telecommunications, demand for bandwidth is expected to increase dramatically and multi-media applications (such as video-telephony and home-working) become the norm. Bandwidth is already a serious issue for mobile telephony due to the limitations in available radio-spectrum, hence the use of low-rate speech coding in digital mobile telephones e.g. GSM. The perceptual analysis accounts for the main non-linear functions, and masking, of human hearing and can thus predict the audibility of signal errors. In particular those errors which would not be audible, if for example they are masked by the rest of the signal, will not appear on the audible error surface. Conversely errors which are not masked will be apparent, and will be represented by an appropriate psycheacoustic feature on the error-surface. A higher level analysis then interprets the error-surface to predict overall subjective opinion. The technique is of special value when evaluating systems which rely on psychoacoustic masking in order to operate, e.g. DRSs, since the performance of the psychoacoustic "rules" cannot be readily assessed objectively with conventional engineering metrics. It is acknowledged that the psychoacoustic evaluation discussed in this paper is dependent on the practical auditory model. The performance predictions are therefore, at best, indicative of an average listener's opinion, but are dependent on the performance of the model estimates and assumptions. The performance of the auditory model, over the telephony bandwidth, has been assessed against subjective test data, obtained under well controlled conditions, and has been shown to produce realistic opinion predictions for a range of degradations.
Page 6 of 8
As might be expected, DCC and MiniDisc both produce high scores from the auditory model intended for evaluating wideband telephony applications. For comparison, the 13 kbit/s telephony codec produces clearly audible errors and yields a reduced opinion score estimate. Even within the 8 kHz bandwidth of the telephony auditory model, errors, which are just audible, are apparent for both DCC and MiniDisc systems. An examination of these errors as audible-error-surfaces, illustrates the form of the perceptual error. This representation could not be readily achieved with conventional engineering performance metrics. The model described in this paper is bandlimited to suit telecommunications applications, but in principle could be extended to cover the full auditory bandwidth. This increase in bandwidth would require further subjective testing in order to validate the model's performance and predictions. It is interesting to hypothesise that the widespread use of DRSs, which work satisfactorily individually, will ultimately lead to audible distortions. The reason for this is the effect of the masked distortion products from the first DRS on the second, and subsequent, DRSs. Measuring the behaviour of tandem non-linear processes is, of course, within the capabilities of perceptual analysis. To conclude, the paper describes the importance of DRSs across a range of industries, and introduces the benefits of perception-mike based performance metrics for such schemes. The benefits are; performance measures which reflect a user's opinion, and a powerful diagnostic tool which can illustrate the audibility of errors which result from particular signal artefacts. 5.0 ACKNOWLEDGEMENTS The authors would like to thank their colleagues at BT Labs for their encouragement and support. Particular thanks are due to Phil Gray for reviewing the paper at short notice.
Page 7 of 8
6.0 REFERENCES [1]
Halka U, Heuter U, "A New Approach to Objective Quality-Measures Based on Attribute-Matching",
[2]
Speech Comms,
NTIA, CCITT SG XII Contribution
"Effects of speech amplitude
normalization on NTIA objective voice quality SQ-74.91, December 1991. [3]
[4]
Early 1992.
assessment
method",
DOC.
Irii H, Kozono J, Kurashima K, "PROMOTE-A System for Estimating Speech Transmission Quality In Telephone Networks", NTT Review, Vol.3 No.5, September 1991. Paillard B, Mabilleau P, Morissette S, Soumagne J, "PERCEVAL: Perceptual Evaluation of the Quality of Audio Systems.", J. Audio Eng. Soc., Vol.40, No.1/2, Jan/Feb 1992.
[5]
Wang S, Sekey A, Gersho A, "An Objective Measure for Predicting Subjective Quality of Speech Coders", IEEE J. on Selected areas in Communications, Vol.10, No.5, June 1992.
[6]
Beerends J, Stemerdink J, "A Perceptual Audio Quality Measure Based on a Psychoacoustic Sound Representation", J. Audio Eng. Soc., Vol.40, No.12, December 1992.
[7]
Stuart J, "Psychoacoustic models for evaluating error in audio systems", Procs. Inst. of Acoustics, Vol.13, No.7, November 1991.
[8]
Hollier M P, Hawksford M O, Guard D R, "Characterisation of Communications Systems Using a Speech-Like Test Stimulus", Convention, San Francisco (Preprint 3395), October 1992.
93rd AES
[9]
Moore B, Glasburg B, "Suggested formulae for calculating auditory-filter bandwidths and excitation patterns", J. Acoust. Soc. Am. Vol.74, No.3, September 1983.
[10]
Sekey A, Hanson B, "Improved 1-Bark bandwidth Aceust. Soc. Am., Vol.75, No.6, June 1984.
[11]
Section 7, "Subjective Geneva 1989.
auditory
filter", J.
Opinion Tests", ITU-TS (CCITT), Blue Book Vol.V,
[12] Hollier M P, Hawksford M O, Guard D R, "Error-activity and errorentropy as a measure of psychoacoustic significance in the perceptual domain", Submitted to IEE-I Proceedings on Vision, Image and Signal Processing, November 1993.
Page 8 of 8
4o
I Bark ERB-rate
a
'2O-
i
0
1O0
...
1000
10000
Frequency [Hz]
Figure
1, CB-rate
and ERB-rate
vs. log Frequency
lO
I,I _
-40
_j---_...._l-..__b'Vv'",,r_
k \1\[/_1
-I00
Frequency Figure
2, Bandpass
filter
[Hz] bank,
at_er
[10].
Figure 3, Audible-error-surface
for speech fragment.
Si Figure
4, Auditory
sensation
estimate
for speech fragment.
Si
Figure 5, Audible~error-surface
for speech fragment
in Figure
4.
Di Figure
6, Auditory
sensation
estimate
for music fragment.
D_ Figure
7, Audible-error-surface
for music fragment
in Figure
6.
DC Figure
8, Auditory
sensation
estimate
for speech fragment.
DI Figure
9, Audible-error-surface
for the speech fragment
in Figure 8.
M, Figure 10, Auditory sensation estimate for music fragment.
M Figure 11, Audible-error-surface
for music fragment in Figure 10.
Mi Figure 12, Auditory sensation estimate for speech fragment.
M_ Figure 13, Audible-error-surface for the speech fragment in Figure 12.