Audio Engineering Society
Convention Paper Presented at the 118th Convention 2005 May 28–31 Barcelona, Spain This convention paper has been reproduced from the author's advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society.
Multistimulus ranking versus pairwise comparison in assessing quality of musical instruments sounds 1
1
1
1
1
Luiza Budzyńska , Jacek Jelonek , Ewa Łukasik , Roman Słowiński and Robert Susmaga 1
Institute of Computing Science, Poznań University of Technology, Poznan, Poland e-mail contact:
[email protected]
ABSTRACT The paper compares the process and the results of two different methods for ranking musical instruments. A dedicated software tool enables the presentation of recorded sounds to the expert who makes her/his assessment according to particular criteria using a multistimulus test in a scale from 1 to n and a pairwise comparison followed by the Net Flow Scoring method. Several aspects of the ranking process can be analyzed, e.g. consistency of the results for the two methods, stability of rankings over time as well as an assessment of cognitive effort from the expert in each ranking method. In comparing the resulting rankings some statistical measures, such as the Kendall’s coefficient and Blest coefficient are used. Results show that the multistimulus test appeared to be faster to perform but less distinctive than pairwise comparison and demanding more cognitive effort.
1.
INTRODUCTION
Assessing the quality of musical instruments voices belongs to a group of tests where stimuli are perceived by human senses, so – depending on particular preferences of the expert – the assessment may prove strongly non-linear and the results may have very subjective character. The problem presented in this paper arose from a more general task in which the expert’s preference models were inferred from the objective characteristics of acoustic signals and stored in the form of “if… then… ”
decision rules. Such analysis was in particular applied to the voices of a set of violins submitted to 10th International Henryk Wieniawski Violinmakers Competition in Poznań. An evaluation test that involves human senses has three main stages: sensation, perception and response. Therefore, the test paradigm should include methods for presenting the stimuli to the senses, methods for quantification of the sensory magnitude of the stimuli and methods for expressing preference-related responses. The evaluation tests are performed with a dedicated software tool, where the elements of the test paradigm are efficiently supported. These include:
Budzyńska et al.
Multistimulus ranking vs pairwise comparison …
• proper presentation of stimuli to the expert, • various methods for performing a ranking concordant with the expert’s preferences.
to a chosen criterion. The evaluations of objects on this criterion are rather subjective and may change from one expert to another.
Assessment of the stimuli may be based on preference test paradigm with the ordinal scale forced by paired comparisons or acceptance test paradigm, where the quantitative response with e.g. 1 to n point scale is expected (e.g. nine-point hedonic equal interval scale, describing the degree of liking, or five-point Likert scale used in psychology).
In order to support an expert in building her/his ranking of the objects, we acquire the preferential information from the expert in form of pairwise comparisons of all pairs of objects or of a subset of all pairs including each object in at least one pair [4,7].
Following the above assumptions the software tool presents sound stimuli to the expert in a comprehensible manner and enables him performing assessment of objects on each particular criterion using two major approaches: pairwise comparison followed by Net Flow Scoring method (NFS) for generating ranking and a multistimulus grading in a scale from 1 to n. All items are presented on the screen allowing the expert to compare sounds of any pair of instruments. The number of comparison criteria may be chosen arbitrarily, in the case discussed being sound volume of the sound, timbre of the sound and equal sound volume of strings. The reference sound and anchor may be added to the test if needed. 2.
RANKING METHODS
Ranking of auditory and visual objects is a part of the more general domain of sensory evaluation – a method used to evoke, measure, analyse and interpret features of objects perceived through the senses of sight, smell, touch, taste and hearing [1,6]. It must define methods for presentation of the stimuli, quantification of the sensory magnitude of the stimuli (auditive measurement) and assessment of the acceptability of the stimulus in so-called affective test. The affective tests may have a form of a preference test, when the subject is asked to choose the preferred stimulus from a group of two or more stimuli or an acceptance test, when grading acceptance/liking of the stimulus is made on a given scale. Several scales are used, e.g. the five-point Likert scale or nine-point hedonic scale expressing degree of liking of each stimulus. 2.1.
Pairwise comparison table (PCT)
Let A be a set of objects characterized by some 1-D and 2-D signals. The objects have to be ranked with respect
The objects to be compared are stored in a pairwise comparison table (PCT), where each row corresponds to a pair of objects (x,y)∈A×A. For each particular ordered pair (x,y), the expert may make one of the following preferential assessments with respect to a given criterion: • x weakly outranks y, denoted by xSwy, • x strongly outranks y, denoted by xSsy, • x does not outrank y, denoted by xScy. The difference between the weak and strong outranking relations is such, that Sw denotes a much weaker preference than Ss and thus, Sw is reflexive and nontransitive, while Ss is irreflexive and transitive. However, neither Sw nor Ss is complete. Once the preference relation has been established for all or for the selected pairs of objects, an outranking graph can be drawn, where nodes correspond to objects from the set A. An edge from x to y is drawn if and only if (in the PCT) xSwy or xSsy, or if there exists a chain of relations xSsz, zSsu, uSsy. In order to construct a ranking with respect to a considered criterion, we exploit the outranking graph using the Net Flow Scoring procedure. For each node in the outranking graph, which corresponds to an objects x∈A, its score NFS(x) is calculated as a difference between the number of outgoing edges and the number of ingoing edges. These numbers, called also positive and negative flows, respectively, provide arguments about relative strength and relative weakness of the node x. Aggregation of both arguments leads to the final value of NFS(x). The score is a real number so if its value is known for every object, it is possible to construct a ranking as a total preorder of the objects according to their scores.
AES 118th Convention, Barcelona, Spain, 2005 May 28–31 Page 2 of 6
Budzyńska et al.
2.2.
Multistimulus ranking vs pairwise comparison …
Multistimulus auditive and preference tests
Multistimulus method has been introduced for comparing audio signals and systems with clearly audible differences on a given criterion [8]. The method offers the expert a random access to the objects being assessed. As it has been observed, the experts tend to roughly sort the objects first and to grade them afterward. Thus they get the benefit of both paired comparisons and grading. The assessor can switch at will between any of the objects under test. The comparison may be also carried out on the reference signal. Since the assessor can directly compare different objects, he can relatively easily detect differences and grade them accordingly. 3.
• • • •
method of assessment, number of criteria used, type of particular criterion, number of scale levels for scoring.
The different options of ranking are available: single stimulus ranking, Pairwise Comparison Table (PCT) ranking and multistimulus ranking, 4.1.
This simplest method allows the expert to assess individual objects presented to her/him in random order by the system. There is no possibility to change the decision. 4.2.
ACOUSTIC DATA – VIOLIN VOICES
During the 10th International Henryk Wieniawski Violinmakers Competition in Poznań the voices of a set of violins submitted to the contest have been recorded and placed in AMATI database [5] for further processing. Recorded material includes:
Single stimulus ranking
PCT
The expert is confronted with two objects to be evaluated at a time. Any of the sounds (characterizing the objects) may be replayed an arbitrary number of times at any volume (Fig. 1).
• individual sounds played in detached mode on all open strings, • plucked sound on all open strings, • the entire range of notes across a chromatic scale (two octaves) on each string, • a range of notes of a diatonic scale from G3 to G6, • a fragment of Johannes Sebastian Bach’s Partita no 2 in d-minor, BWV 1004, part 3 – Sarabande. The expert listeners evaluated each instrument according to several criteria (part of the set of criteria used during the competition) choosing sound material for listening at will (see Sect. 5). 4.
THE SOFTWARE TOOL “RANKER”
The software tool “Ranker” has been created to enable the presentation of stimuli, creation of rankings of audiovisual objects and comparison of ranking methods used. In its present form the tool allows presenting items to the expert, obtaining and storing her/his response (the assessment) in an effective way. The program enables also the choice of: • data sets,
Fig. 1
Pairwise comparison of two violins voices
For violin voices there is a possibility to listen to all recorded sounds from AMATI database. Having compared a pair of objects, the expert is confronted with another pair of objects to compare. It is important that the test is ‘blind’, i.e. the expert has no means for identifying the object. Neither has she/he the possibility to change the decision. 4.3.
Multistimulus ranking
The method has been inspired by the multi-stimulus method for comparing coded audio signals and systems with clearly audible differences included in ITU-R
AES 118th Convention, Barcelona, Spain, 2005 May 28–31 Page 3 of 6
Budzyńska et al.
Multistimulus ranking vs pairwise comparison …
BS.1534 standard for subjective listening tests [8]. BS.1534 is called multi-stimulus with hidden reference and anchor method that uses continuous quality scale with sorting and grading process. Usually the assessor needs a reference in order to know the best-case or a benchmark performance. In the case of comparing competitive violin voices there may be a dummy instrument, but usually the listener considers one of the instruments as a benchmark. The expert has a free access to all test items (Fig. 2).
where dk(R1,R2) is the Kendall’s distance between matrices R1 and R2 defined as: d k (R1 , R 2 ) =
Elements
rij1 (i,j=1,...,m) of the matrix R1 are defined as:
0 r = 0.5 1 1 ij
Elements
1 m m 1 2 rij rij ∑∑ 2 i =1 j =1
if i = j or a j P a i
in ranking 1
if a i I a j
in ranking 1
if a i P a j
in ranking 1
rij2 (i,j=1,...,m) of the matrix R2 are defined
analogously. Kendall’s coefficient τ∈〈–1, +1〉: if τ=1 then the rankings are identical, if τ=–1 the two rankings are strictly inverse. It is usually assumed that τ ≥ 0.6 corresponds to similar rankings. Fig. 2
Multistimulus test of violin voices
Usually, the experts tend to sort and then grade the test items, being able to perform pairwise comparisons between freely chosen objects. Grading may be changed an arbitrary number of times before the end of the test. In this way, we get benefits of both paired-comparisons and grading. It has been proved that random access to assessed sound provides finer resolution and greater consistency of results than sequential access forced by the experiment designer [8]. Combining sorting and grading process also tends to provide more consistent inter-subject results. 4.4.
The Kendall’s coefficient τ [2] is used to measure a ‘distance’ between two rankings. We will use it to assess a similarity between the rankings resulting from the methods described above. Let R1 and R2 be two m×m matrices where m is the number of ranked objects. R1 and R2 correspond to the two rankings being compared. Kendall’s τ coefficient t is calculated from the formula: d k (R1 , R 2 ) m (m − 1)
The Blest index ν has similar properties to that of Kendall’s τ and is computed using the formula:
ν=
n 2n + 1 12 R − (1 − x ) 2 R y ∑ n − 1 n ( n − 1) i =1 n +1
where Rx and Ry are ranks of the elements’ positions in the compared rankings, n – number of objects ranked.
Ways to compare rankings
τ =1− 4
Blest measure of rank correlation [3] attaches more significance to the top part of the ranking (unlike the Kendall's coefficient, which assigns the same importance to all ranks). In the Blest measure discrepancies in top ranking are given more weight when calculating the correlation. This complies with our experience on evaluating violins.
5.
EXPERIMENT
The main goal of the experiment was to examine the two rankings created by the two ranking methods: the pairwise comparison followed by the Net Flow Score and the multistimulus direct ranking. The similarity of results was estimated using the Kendall’s τ and Blest ν coefficients.
AES 118th Convention, Barcelona, Spain, 2005 May 28–31 Page 4 of 6
Budzyńska et al.
Multistimulus ranking vs pairwise comparison …
The actual rankings were created by two independent, highly qualified experts – a violin player (“BB”) and a violinmaker (“CC”). Both were asked to rank the same set of 13 violins recorded during the Henryk Wieniawski competition. The instruments had been selected in a representative way from the set of all instruments taking part in the competition. The violins were to be ranked using the following criteria. • volume of the sound, • timbre of the sound • inter-string equality of the sound.
The last of three above criteria seems to be the most difficult to assess, as inter-string equality means simultaneous comparison of both volume and timbre of individual strings and making decision, how much do they match each to the other. Since the task of comparing voices of the best quality musical instruments from the same family is highly specialized, we can only arrange it for singular qualified assessors.
(which automatically decreases the number of comparisons), all (n2–n)/2 comparisons for three criteria had to be made, which, in our case, made a total of 78·3 = 234 evaluations. 5.2.
The results
The similarity of the instruments ranking was slightly different for each of three criteria, since, as it was mentioned before, the difficulty of assessment was different in each case. Exemplary ordering of violin voices by one expert in two different tests (pairwise comparison and multistimulus ranking) for the sum of two criteria: volume and timbre of the sound is shown in Fig. 3. Numbers denote identifiers of the instruments, encircled objects – violins ranked identically, arrows – the direction of outranking.
5.1.
The course of the experiment
The experiment started with the multistimulus direct ranking. The expert browsed all instruments subjected to the test and chose the reference instrument. Then he proceeded with a sort of pairwise comparisons, changing from time to time the grade previously assigned to the instrument at a given criterion. The scale from 1 to 10 has been used for each criterion, however the expert tended to flatten the scale, scoring the instruments not lower than 5, therefore the actual resolution has been decimated by the factor of two (resulting in an actual scale from 1 to 5 – such a span of the scale is being used in certain standardized listening tests). The paired comparison of instrument voices appeared to be very attractive and easy for the expert in the first phase of listening tests. However, it grew more and more tiring and the expert reported limited ability to express his preferences. Since he has never used the possibility of skipping some pairs of violins or assessing a given instrument as strongly outranking its counterpart
43
49
10
10 9
88
1 11
76
49 109 74
43
88
85
60
89
76
85
21
60
56
21 56
111
Fig. 3 We also considered some very simple combinations of these criteria (actually implemented as sums of rankings).
89
10
74
Exemplary rankings of the 13 instruments.
As it can be observed in Fig. 3, the test based on pairwise comparisons seems to be more ‘informative’, in the sense that it generated a more distinct ranking. This was a phenomenon observed also in case of other rankings. Summing up the remarks collected from the experts during the experiments, we conclude that the specificity of listening tests, which are linear in time, makes paired comparison easy and attractive at the first glance, but rather tiring in the long run (the test becomes especially long when all possible pairs of objects are compared using the weak outranking relation). The degree of annoyance may be diminished by introducing multistimulus direct ranking, which combines the pairwise comparisons with direct grading. As it is usually assumed that the Kendall’s τ coefficient exceeds 0.6 for similar rankings, the methods did not always produce similar results – see Table 1. The particularly low value of the coefficient for the interstring equality criterion reflects the difficulty in assessing this feature of the instrument (this feature is known to be difficult to assess by the listener, however, it is easier to assess by the player).
AES 118th Convention, Barcelona, Spain, 2005 May 28–31 Page 5 of 6
Budzyńska et al.
Multistimulus ranking vs pairwise comparison …
Table 1 Similarity of rankings generated by the two methods in terms of the Kendall coefficients Criterion\Expert
BB
CC
volume
0.54
0.65
timbre
0.38
0.59
inter-string
0.22
0.35
vol.+tim.
0.47
0.69
vol.+tim.+inter
0.38
0.65
The multistimulus direct ranking method is faster but more demanding, as it certainly demands more cognitive effort from the expert. Additionally, a particular phenomenon of ‘scale compression’ was observed. The experts – being afraid of grading good instruments too low – have used only a part of the available scale (usually the upper part). Table 2 Similarity of rankings generated by the two methods in terms of the Blest coefficients Criterion\Expert
BB
CC
volume
0.80
0.81
timbre
0.68
0.64
inter-string
0.46
0.21
vol.+tim.
0.62
0.79
vol.+tim.+inter
0.55
0.74
Future experiments may be concentrated on different data sets as well as new comparative aspects, like stability of rankings over time and assessment of cognitive effort required from the expert in each of the ranking methods. 7.
REFERENCES
[1] Bech S., “Test paradigm www.aes.org/technical.
and
selection”,
[2] Budzyńska L., Jelonek J., Łukasik E., Słowiński R., “Supporting Experts In Ranking Generic Audio And Visual Objects”, Proceedings of Signal Processing'2004, Poznan, pp. 81-86. [3] Genest C., Franc J., “On Blest's measure of rank correlation”, The Canadian Journal of Statistics, 31 (1), 2003, pp. 1–18.
Another important phenomenon observed is a kind of non-linearity in which the experts treat the created rankings. They tend to put much more attention to the top part of the rankings, thus slightly neglecting the bottom parts. To take into account we produce the comparison of rankings in terms of the Blest ν coefficient, which puts more attention to the top of the ranking (see Table 2). 6.
also considered. First of all, the similarities between the results generated by the two methods were satisfactory. Multistimulus test was quicker in performance but gave less distinctive results than pairwise comparison. The latter was more attractive for the expert at the beginning of the test (less cognitve cost), but with time it tented to become tiring, as it was rather long to perform.
CONCLUSIONS
The paper reports on a series of experiments with two methods for object ranking: the pairwise comparison followed by the Net Flow Scoring and the multistimulus auditive and preference test based on the MUSHRA method. Both ranking methods have been tested from the point of view of similarity between the generated results and expert preferences for the method, although other objectives, like stability of the results in time, equivalence of individual experts rankings, etc. may be
[4] Greco S., Matarazzo B., Słowiński R.: “Rough sets theory for multicriteria decision analysis”. European Journal of Operations Research 129 (2001), pp. 1–47. [5] Łukasik E., “Multimedia Database of Violin Sounds”, Proc. of Stockholm Musical Acoustics Conference (SMAC’03), Stockholm, pp. 79–82. [6] Meilgaard M., Civille G.V., Carr B.T., Sensory Evaluation Techniques, 3rd Edition, CRC Press, 1999. [7] Słowiński R., Greco S., Matarazzo, B.: “Mining decision-rule preference model from rough approximation of a preference relation”. Proc. of the 26th IEEE Annual Int. Conf. on Computer Software & Applications (COMPSAC 2002), pp. 1129–1134. [8] Soulodre G., “Overview of ITU-R BS. 1534 (The MUSHRA Method)”, www.aes.org/technical.
AES 118th Convention, Barcelona, Spain, 2005 May 28–31 Page 6 of 6