A Bayesian Mallows approach to non-transitive pair ... - arXiv

1 downloads 0 Views 1MB Size Report
May 24, 2017 - mouse click mistake or a random preference between a pair of items. 2. The probability of making a mistake depends on the items being com-.
Submitted to the Annals of Applied Statistics

arXiv:1705.08805v1 [stat.AP] 24 May 2017

A BAYESIAN MALLOWS APPROACH TO NON-TRANSITIVE PAIR COMPARISON DATA: HOW HUMAN ARE SOUNDS? By Marta Crispino1 , Elja Arjas2,3 , Valeria Vitelli3 , Natasha Barrett4 and Arnoldo Frigessi3 Bocconi University1 , University of Helsinki2 , University of Oslo3 , Norwegian State Academy for Music in Oslo4 We propose a Bayesian probabilistic method to learn preferences from non-transitive pairwise comparison data, as happens when one (or more) individual preferences in the data contradicts what is implied by the others. Lack of transitivity easily arises when the items compared are perceived as rather similar and when the pairwise comparisons are presented sequentially without allowing for consistency checks. We build a Bayesian Mallows model in order to handle nontransitive data, with a latent layer of uncertainty which captures the generation of preference misreporting. We then develop a mixture extension of the Mallows model, able to learn individual preferences in a heterogeneous population, which is particularly important in applications. We are interested in learning how listeners perceive sounds as having human origins. An experiment was performed with a series of electronically synthesized sounds, and listeners were asked to compare them in pairs. The result of our analysis is relevant for composers and sound designers whose aim is to understand how computer generated sounds can be sound more human.

1. Introduction. We consider pairwise comparison data in the form “x is preferred to y”, denoted x ≺ y, where x and y are some items of interest to be compared. The word “preference” is interpreted in a broad sense as an order relation. When the set of items is large, users are unlikely to be able, or willing, to inspect and compare all items in order to perform a complete ranking. A pairwise comparison test is then often preferred, and sometimes it is the only possible experimental procedure: when differences between items are small it is in fact recommended to collect preferences through comparisons between items in pairs (Agresti, 1996). In this paper we consider pairwise comparison data coming from an experiment where people were asked to hear a series of two different abstract sounds, and to tell which one was perceived as more human. The results Keywords and phrases: Pairwise Comparisons, Non-transitivity, Ranking data, Mallows model, Bayesian learning, Mixture models.

1

2

M. CRISPINO ET AL.

of this test are very interesting for musicologists, composers and sound designers, whose aim is to understand how human performance expression can be communicated through spatial audio, leading to computer generated sounds appearing more life-like. Although every sound can be regarded as ‘spatial’ in that sound waves propagate through space, the term ‘spatial audio’ is here used to describe the way sound captures the physical movement in 3-D needed to produce it. The cohort of listeners who took part in the experiment had varying backgrounds, ranging from musicologists to university students. Therefore we expected listeners to cluster into groups, sharing different opinions about the degree of human causation behind sounds. In addition to the grouping of the listeners around a shared consensus ranking of the “humanness of sounds”, we were interested in studying the association between individual listeners’ rankings and their own musical experience or musical background. This application is described in detail in Section 2. When dealing with this kind of data a major challenge arises: pairwise preferences are not always transitive (Tversky, 1969). By transitivity, we mean that, for every triplet of items, {x, y, z}, x ≺ y and y ≺ z imply x ≺ z. In practice this property is often violated. The data we deal with, indeed, contains preferences of the form x ≺ y , y ≺ z but z ≺ x. Importantly, we consider this kind of non-transitivity at an individual level, not non-transitivities arising when aggregating preferences across users, as under majority rule or Condorcet voting. Non-transitivity of preferences can arise for many reasons, for example users’ inattentiveness, uncertainty in their preferences, and actual confusion, even when one specific criterion for ranking is used. These situations are so common that most pairwise comparison data are in fact non-transitive, thus creating a need for methods able to predict individual preferences from data that lack logical transitivity. To our knowledge, most methods designed to estimate individual rankings from pairwise comparison data do not handle individual-level non-transitivity. They instead either drop such pairs, or they only focus on the estimation of the consensus ranking, without specifically modeling the non-transitivity characterizing the data. We believe that the only attempt to account for non-transitivity from a probabilistic perspective is Volkovs and Zemel (2014), who propose their Multinomial Preference model, a score-based method that generalizes the Plackett Luce model (Luce, 1959; Plackett, 1975). A brief review on methods for pairwise preference data is presented in Section 5. In this paper, we incorporate the non-transitive patterns of the data directly into the Mallows model (Mallows, 1957; Diaconis, 1988). This is done by postulating the existence of a true latent individual ranking of the items, for each user, and assuming that non-transitive patterns may arise because

A BAYESIAN MALLOWS APPROACH TO NON-TRANSITIVE DATA

3

users make mistakes by switching the order between some pairs. The new method is presented in Section 3. We distinguish between the two following situations: 1. The probability of making a mistake is constant, independent of the pairs being assessed, and also independent of all other comparisons made by the same user. These assumptions model, for example, a mouse click mistake or a random preference between a pair of items. 2. The probability of making a mistake depends on the items being compared: the stronger the individual preference between a pair of items is, the smaller is the probability of making a mistake. This models the situation when items are more easily mis-compared by the user when they are rather similar in the personal ranking. These situations correspond to two different generative models for mistakes, which call for different strategies (described in Sections 3.1 and 3.2). Our method provides the posterior distribution of the consensus ranking, as well as the posterior distribution of the latent individual rankings for each user, and follows the approach in Vitelli et al. (2017). The consensus ranking can be seen as a model-based Bayesian aggregation of individual preferences of a group of users. It is analogous to the quantities which are usually of interest in the rank aggregation literature (Negahban, Oh and Shah, 2012; Dwork et al., 2001; Kenyon-Mathieu and Schudy, 2007; Rajkumar et al., 2015). The estimated posterior distributions of the individual rankings can be of great interest, for example, when performing personalized recommendations, or in studying how individual preferences change with user related covariates. In Section 3.3, we present a cluster model, which allows to treat groups of users with different consensus rankings. Inference is based on a Markov Chain Monte Carlo algorithm, outlined in Section 4. Section 6 is devoted to simulations, while in Section 7 we apply our method to the sound data, showing that the model identifies interesting clusters of listeners, with similar perception of electroacoustic sounds. Finally, in Section 8 we summarize the contributions of this paper. 2. Acousmatic music experiment. Acousmatic music is a type of electronic music composed for presentation using loudspeakers, as opposed to live or video recorded performance. The composer manipulates digitally recorded sounds, so that the cause of the sound, being a musical instrument or any other sound making system, remains hidden. Indeed, when sounds are played over loudspeakers there are no visual cues to help listeners understand how the sounds were made. On the other hand, when we hear the sound of musical instruments or sounds from our everyday environment, we are able

4

M. CRISPINO ET AL.

to recognize their cause, since in visual music we obtain the information that indicates the sounding object, i.e. its causation. To give an example: if you listen to the crying of a baby, you understand that a baby produced this sound. Since the advent of recording technology, abstract sounds (that is, sounds transformed with computer tools) have been used not only in acousmatic music, but also in much of the sound-world we experience over the Internet, TV and film. The question of interest is related to the capacity of listeners to identify the presence of human causation through the spatial behavior of abstract sounds. Spatial in this context describes the fact that the causation of sound happens as an action in 3-D space. The starting point for the experiment was a high-speed motion tracking recording of the physical movement used to produce one selected sound: a cellist bowing a down-bow chord. Features of this 3-D movement were successively subtracted, resulting in a series of 12 motion data-sets of varying proximity to the original. The motion data were then made audible by a process called parameter-mapping sonification (Grond and Berger, 2011), where parameters in the data are mapped to parameters controlling computer generated sound (see Figure 1). The mapping rules are chosen to draw on our everyday perception of spatial motion, which involves not only absolute 3-D spatial location but in addition changes in volume, intensity and pitch, correlated with changes in proximity and speed. In other words, listeners heard the physical spatial motion through sonification, rather than hearing the sound that the motion created, which, in this instance, was the sound of the cello.

Fig 1: A scheme of the process generating sounds. Testing how listeners perceive a sound for which we lack a clear and commonly understood descriptive vocabulary is problematic. Therefore pair comparisons is the most appropriate design. Furthermore, listeners’ varying familiarity with acousmatic music (or more generally with abstract sounds), and personal tastes, affect the test results.

A BAYESIAN MALLOWS APPROACH TO NON-TRANSITIVE DATA

5

2.1. Pair comparison experiment. N = 46 listeners were presented with a series of sounds, called spatial audio stimuli because they originate from a spatial 3-D performance. The number of stimuli was n = 12. Test stimulus 1 (S1) was designed to most clearly sonify all features of the data. Each of the other 11 test stimuli were sonified by modifying one or more features of the data. This involved removing pitch and volume variation, flattening directional changes in the motion, or slowing the overall motion speed. The test stimuli can be summarized as follows: S2: Spatial motion occurring in front. S3: Played in mono over one speaker direct-front. S4: Partial flattening of 3-D spatial variation leaving the main direction changes. S5: Total flattening of 3-D spatial variation leaving the main direction changes. S6: Removal of volume variation. S7: Removal of pitch variation. S8: Removal of pitch and volume variation. S9: Partial flattening of 3-D spatial variation and removal of pitch and volume variation. S10: Total flattening of 3-D spatial variation and removal of pitch and volume variation. S11: S1 played 30% slower. S12: S1 played 50% slower.

Each listener was then exposed to M = 30 pairs of these sounds, which is ca. 45% of the total number of possible pairs of 12 stimuli. The pairs were chosen randomly and independently for each user. Also the order in which the sounds were played was randomized. Listeners were then asked to indicate, for each pair, which of the two stimuli most evoked a sensation of human physical movement of any kind. They were asked to follow their feelings, rather than imagining to watch a performance. The listeners were not told that the source motion stemmed from a cellist, nor were they asked to identify a specific human spatial movement. Before the test, participants were informed that the sounds were made by sonification and were warned that the sounds may be heard as strange. Each listener carried out the test sitting centrally to the loudspeaker array. Prior to the experiment, listeners were presented with a short training session of three sounds not used in the test sequence. When the experiment began, the test number was displayed on a computer screen, answers were written on paper, and listeners were requested to always make a choice even if they

6

M. CRISPINO ET AL.

found it difficult to decide. If needed, they could ask to hear a test pair for a second time. At the end, they were asked to complete two questionnaires, the aim of which was to assign a Musical Sophistication Index score (MSI) and a rating of Spatial Audio awareness (SAA) to all the listeners. The MSI used was the Ollen musical sophistication index (Ollen, 2006), which is an online survey that tests the validity of 29 indicators of musical sophistication used in published music research literature. The SAA index consisted of five questions as indicators of how aware listeners were of spatial audio regardless of musical background. Such a test did not exist in the literature, and needed to be custom designed for the experiment. The choice to rely on a pairwise comparison experiment is crucially based on the listeners’ lack of experience with abstract sounds. It is easier for the participants to compare two sounds, rather than to be exposed to several, which could create confusion. The experiment, indeed, was difficult as expected: 37 listeners (80%) reported non-transitivities in their pair comparisons, only 9 out of 46 listeners were able to stay consistent with themselves. A full description of the basic background, hypotheses, experimental setup, developed scores, and discussion of results will appear in Barrett and Crispino (2017), in preparation. 3. Bayesian Mallows models for non-transitive pairwise comparisons. We consider the situation where N users independently express their preferences between pairs of the n items in O = {O1 , ..., On }. In many situations of practical interest the users do not decide on the set of pairs to be considered, which are instead assigned to them by an external authority. In this paper we decided not to model the way in which the pairs are chosen, and simply assume that each user j receives a different subset Cj = {Cj1 , ..., CjMj } of Mj ≤ n(n − 1)/2 random pairs. Let Bj = {Bj1 , ..., BjMj } be the set of pairwise preferences given by user j, where Bjm is the order that user j assigned to the pair Cjm . For example, if Cjm = {Om1 , Om2 }, it could be that Bjm = (Om1 ≺ Om2 ), m1 , m2 ∈ {1, ..., n}, meaning that item Om1 is preferred to item Om2 . Such data can be incomplete since not all items, nor pairs, are always handled by each user. We assume no ties in the data, that is users are forced to express their preference for all pairs in the list Cj assigned to them, and indifference is not permitted. We denote a generic ranking by r = (r1 , ..., rn ) ∈ Pn , where ri ∈ {1, ...n} is the rank of item Oi (the most preferred item has rank ri = 1), and Pn is the space of n-dimensional permutations. A widely used distance-based family of distributions for ranks in Pn is the Mallows model (Mallows, 1957; Diaconis, 1988). According to the Mallows model, the probability density of

A BAYESIAN MALLOWS APPROACH TO NON-TRANSITIVE DATA

7

a given ranking r = (r1 , ..., rn ), here denoted by Mal(ρ, α), is given by (1)

fR (r | α, ρ) :=

exp{− αn d(r, ρ)} 1Pn (r). Zn (α)

In (1), ρ ∈ Pn is the location parameter representing the shared consensus ranking, α > 0 is the scale parameter measuring the concentration of the data around ρ, and d(·, ·) is a distance function between two n−dimensional permutations that satisfies right-invariance (Diaconis, 1988), i.e., d(r, ρ) = d(r ◦ r 0 , ρ ◦ r 0 ), ∀r, r 0 , ρ ∈ Pn , where ρ ◦ r 0 = ρr0 = (ρr10 , ..., ρrn0 ). Rightinvariance is crucial since from this property it follows that the partition function of (1) does not dependon the location parameter, and can then be P α written as Zn (α) = r∈Pn exp − n d(r, 1n ) , where 1n = (1, ..., n) (see for example Mukherjee (2016)). When the distance function in (1) is chosen to be the Kendall, the Cayley, or the Hamming distance, the partition function of the Mallows model is available in closed form (Fligner and Verducci, 1986). For this reason, most of the work on the Mallows was limited to these distances (see, for example, Fligner and Verducci (1986), Lu and Boutilier (2014), Irurozki, Calvo and Lozano (2016, 2014)). The Mallows with other distance functions was less treated because of its computational complexity. Recently, Vitelli et al. (2017) gave a procedure to compute Zn (α) when the footrule and Spearman distances are used, either exactly (up to some moderate values of n), or approximated through an Importance Sampling technique. The authors set the original Mallows model in a Bayesian framework, also allowing for data in the form of transitive pairwise comparisons. The main assumption of our model is that each user j has a personal latent ranking, Rj = (Rj1 , ..., Rjn ) ∈ Pn , distributed according to the Mallows i.i.d

density (1), R1 , ..., RN |ρ, α ∼ Mal(ρ, α). We model the situation where each user j, when announcing her preferences, matches the items under comparison with her latent ranking Rj . Then, if the user is consistent with Rj , the pairwise orderings in Bj are induced by Rj according to: (2)

(Om1 ≺ Om2 ) ⇐⇒ Rjm1 < Rjm2 ,

where Rjmi denotes the rank of item Omi in Rj . In this case the set of pairwise orderings Bj contains only transitive (mutually compatible) preferences, since the preferences are induced from a complete ranking in Pn that, by definition, is transitive. The transitive closure of a set of pairwise preferences, denoted by tc(Bj ), is the smallest set that consistently extends the original preference set: it is defined as the set union of Bj and all pairwise preferences that are not explicitly given but are induced from Bj by

8

M. CRISPINO ET AL.

transitivity. In this case it is possible to first compute tc(Bj ), and second, to make inference on the posterior distribution of the Mallows parameters by integrating out all the rankings r ∈ Pn that are compatible with the transitive closure of the preference sets, denoted by r ← tc(Bj ), (3)

N Y

π(α, ρ| B1 , ..., BN ) ∝ π(α)π(ρ)



 X

 j=1

fRj (r| α, ρ) .

r←tc(Bj )

This setting was described in Vitelli et al. (2017), Section 4.2. If the user is not fully consistent with her latent ranking, the pairwise orderings in Bj may not be mutually compatible. In such a case the transitive closure may not exist and the previous procedure cannot be followed. Therefore a model able to account for non-transitive patterns in the data is needed in this setting. We propose a probabilistic strategy based on the assumption that nontransitivities are due to mistakes in deriving the pair order from the latent raking Rj . The likelihood assumed for a set of preferences Bj (analogous to the summation of eq. (3)) is

(4) f (Bj |α, ρ) =

X

X

f (Bj , Rj = r|α, ρ) =

fRj (r| α, ρ)f (Bj |Rj = r),

r∈Pn

r∈Pn

where f (Bj |Rj = r) is the probability of ordering the pairs in Cj as in Bj (possibly generating non-transitivities), when the latent ranking for user j is Rj = r. It can therefore be seen as forming the error model in this context, which will be specified below. The joint posterior of the model parameters is then: π(α, ρ| B1 , ..., BN ) ∝ π(α)π(ρ)

N Y



 X

 j=1

fRj (r| α, ρ)f (Bj |Rj = r) .

r∈Pn

In this paper we have assumed a gamma prior, π(α) = 1

(ρ)

λγ γ−1 e−λα 1 R+ (α), Γ(γ) α

n for α, and the uniform prior on Pn , π(ρ) = Pn! , for ρ. This strategy is able to recover possible linear orderings close (in terms of some given distance) to the non-transitive sets of preferences. We developed two basic models for the probability of making a mistake: the Bernoulli model (BM) and the Logistic model (LM). BM assumes that non-transitivities arise from random mistakes while LM assumes that nontransitivities arise from mistakes due to difficulty in ordering similar items.

9

A BAYESIAN MALLOWS APPROACH TO NON-TRANSITIVE DATA

3.1. Bernoulli model (BM). We assume that the pairwise comparisons given by a user are conditionally independent given her latent ranking Rj , (5)

f (Bj |Rj = r) =

Mj Y

f (Bjm |Rj = r).

m=1

We define here a function of a given comparison Bjm = (Om1 ≺ Om2 ), and of a given ranking r = (r1 , ..., rn ) ∈ Pn , (

g(Bjm , r) =

0 if rm1 < rm2 1 otherwise

where m1 is the index of the preferred item Om1 in the m-th comparison Bjm of user j, and m2 is the index of the less preferred item. Thus g(Bjm , r) = 1 if the preference order of Bjm contradicts with that implied by the ranking r (in the sense of eq. (2)). We then assume the following Bernoulli type model for modeling the probability that a user j makes a mistake in a given pairwise comparison Bjm , that is the probability that she reverses the true latent preference implied by her latent ranking Rj : P(Bjm mistake | θ, Rj = r) = P(g(Bjm , r) = 1 | θ, Rj = r) = θ,

θ ∈ [0, 0.5).

The probability of eq. (5) is then given by 

f (Bj | θ, Rj = r) =

θ 1−θ

PMj g(Bjm ,r) m=1

(1 − θ)Mj .

We assign to θ the truncated Beta distribution on the interval [0, 0.5) as prior, with given hyperparameters κ1 and κ2 : π(θ) ∝ θκ1 −1 (1−θ)κ2 −1 1[0,0.5) (θ), conjugate to the Bernoulli model. We choose the truncated Beta mainly for identification purposes, but this choice is also motivated by the fact that we want to force the probability of making a mistake to be less than 0.5. Let B1:N be a shorthand for B1 , ..., BN , and R1:N for R1 , ..., RN . The posterior  density of the model parameters, defined on the support  S = 1 {α > 0} ∩ {ρ ∈ Pn } ∩ {Rj ∈ Pn }N j=1 ∩ {0 ≤ θ < 0.5} , has the following form, (6) π(α, ρ, θ|B1:N ) ∝ π(α)π(ρ)π(θ)

N Y



 X

 j=1

r∈Pn

fRj (r|α, ρ)f (Bj |θ, Rj = r) .

10

M. CRISPINO ET AL.

We sample from the density of eq. (6) through an augmented sampling scheme, by first updating α, ρ and θ given B1:N and R1:N , and then updating R1:N given α, ρ, θ and B1:N . The former step is performed by using the conditional density 

π(α,ρ, θ|B1:N , R1:N ) = α (7) 

·

θ 1−θ

1 γ−1 −α λ+ n

e

PN PMj

κ1 −1+

m=1

j=1



PN

d(Rj ,ρ) −N ln[Zn (α)] j=1

g(Bjm ,Rj )

PN

κ2 +κ1 −2+

(1 − θ)

j=1

Mj

.

The second step is performed by using the density π(R1:N |α, ρ, θ, B1:N ) ∝ π(R1:N |α, ρ)π(B1:N |θ, R1:N ) = (8) =

e

−α n

PN j=1

d(Rj ,ρ)

[Zn (α)]N



θ 1−θ

PN PMj g(Bjm ,Rj ) m=1 j=1

(1 − θ)

PN j=1

Mj

.

Figure 2 shows a graphical representation of the hierarchical construction of the Bernoulli model for mistakes. ρ

Bjm α

Rj

θ

Mj N

Fig 2: Graphical representation of the Bernoulli model for mistakes.

3.2. Logistic model (LM). The idea behind the logistic model for mistakes is that a user j is more likely to be confused (and consequently to make a mistake) if two items in a pair are more similar according to her latent rank vector Rj . We assume the following logistic type model for the probability of making a mistake in a given pairwise comparison 





logit P Bjm mistake Rj , β0 , β1 = −β0 − β1

dRj ,m − 1 , n−2

where dRj ,m is the `1 distance of the ranks of the two items under comparison in Bjm , according to Rj : if Bjm = (Om1 ≺ Om2 ), then dRj ,m =

11

A BAYESIAN MALLOWS APPROACH TO NON-TRANSITIVE DATA

|Rjm1 − Rjm2 |. We assume that β1 and β0 are a priori independent and distributed according to a gamma prior, β1 ∼ Γ(λ11 , λ12 ), and β0 ∼ Γ(λ01 , λ02 ). These choices are motivated by the fact that we want to model a negative dependence between the distance of the items and the probability of making a mistake (β1 > 0), and second, we want to force the probability of making a mistake when the items have ranks differing by 1 to be less than 0.5 (β0 > 0). The posterior density of the model, defined on the support S = 1 ({α > 0} ∩ {ρ ∈ Pn } ∩ {R1:N ∈ Pn } ∩ {β1 > 0} ∩ {β0 > 0}), is then π(α, ρ, β0 , β1 |B1:N ) ∝ π(β0 )π(β1 )π(ρ)π(α) (9)

·

N Y



 X

fRj (r|α, ρ)f (Bj |β0 , β1 , Rj = r) .

 j=1

r∈Pn

Analogously to eq. (6), we sample from the posterior of eq. (9) by first updating α, ρ, β0 and β1 , given B1:N and R1:N , i.e. from αγ−1 β0λ01 −1 β1λ11 −1

π(α, ρ, β0 , β1 |B1:N , R1:N ) ∝ QN

j=1

(10)



·e

1 −α λ+ n

h

·e

j=1

1 −β1 λ12 + n−2

m=1



PN

"

QMj

h

1+e

j=1

m=1

dR ,m −1 j n−2

PN PMj

d(Rj ,ρ) −N ln[Zn (α)]−β0 λ02 +

PN PMj

−β0 −β1

j=1

m=1

#

i

g(Bjm ,Rj )

i

g(Bjm ,Rj )(dRj ,m −1)

.

Secondly, we update R1:N , given α, ρ, β0 , β1 and B1:N , from (11) π(R1:N |α, ρ, β0 , β1 , B1:N ) ∝ π(R1:N |α, ρ)π(B1:N |β0 , β1 , R1:N ) ∝ ∝e

−α n

β1 − n−2

·e

PN j=1

d(Rj ,ρ)−N ln[Zn (α)]−β0

m=1

j=1

g(Bjm ,Rj )



Mj N Y Y m=1 g(Bjm ,Rj )(dRj ,m −1) 

PN PMj j=1

PN PMj

1+e

dR ,m −1 j −β0 −β1 n−2

!−1  .

j=1 m=1

3.3. Clustering non-transitive users. So far we assumed that a unique consensus ranking was shared by all users. Since in many situations the assumption of a common underlying consensus ranking is unrealistic, we here allow for clustering the users into separate subsets, each sharing a consensus ranking of the items, similarly to Vitelli et al. (2017). We propose a mixture model generalization of the Bernoulli model of Section 3.1 to deal with heterogeneous users expressing pairwise preferences with mistakes.

12

M. CRISPINO ET AL.

Let z1 , ..., zN ∈ {1, ..., C} be the class labels indicating how individual users are assigned to one of the C clusters. Each cluster is described by a different pair of Mallows parameters (αc , ρc ), c = 1, ..., C, so that the likelihood has the following form: f (B1:N |α1:C , ρ1:C , θ, η1:C , z1:N ) =

 N  X Y j=1

where



 

fRj (r|αzj , ρzj )f (Bj |θ, Rj = r) , 

r∈Pn

αz 1Pn (r) exp − j d(r, ρzj ) fRj (r|αzj , ρzj ) = Zn (αzj ) n 



.

We assume that the cluster labels are a priori conditionally independent given the mixing parameters of the clusters, η1 , ..., ηC , and distributed according to a categorical distribution π(z1 , ..., zN |η1 , ..., ηC ) ∝

N Y

ηzj =

N Y C Y 1c (zj )

ηc

j=1 c=1

j=1

where ηc ≥ 0, ∀c = 1, ..., C and c ηc = 1. Finally we assign to η1 , ..., ηC the Dirichlet density with parameter χ, P

π(η1 , ..., ηC ) ∝

C Y

ηcχ−1

.

c=1

These choices lead to the following posterior density, π(α1:C , ρ1:C ,η1:C , θ, z1:N |B1:N ) ∝ π(θ)

C Y

[π(αc )π(ρc )π(ηc )]

c=1

(12) ·

N Y



 π(zj |η1:C )

j=1

X

fRj (r|αzj , ρzj )f (Bj |θ, Rj = r) .

r∈Pn

Similarly to the homogeneous case, we then sample from the posterior of eq. (12) by first updating α1:C , ρ1:C , η1:C , z1:N and θ, given B1:N and R1:N , and then updating R1:N , given α1:C , ρ1:C , η1:C , z1:N , θ and B1:N . The former step is done by using the conditional density (13) π(α1:C , ρ1:C , η1:C , θ, z1:N |B1:N , R1:N ) ∝

C Y

"

PN

ξ−1+ αcγ−1 e−λαc ηc

#

1 (z ) j=1 c j

c=1

·

N Y

 

j=1



e

αzj n

d(Rj ,ρzj )

Zn (αzj )



·

θ 1−θ

κ1 −1+PMj g(Bjm ,Rj ) m=1



· (1 − θ)κ2 +κ1 −2+Mj  .

13

A BAYESIAN MALLOWS APPROACH TO NON-TRANSITIVE DATA

The second step is performed by using the density π(R1:N |α1:C , ρ1:C , η1:C , θ, z1:N , B1:N ) ∝ (14)



N Y



e− 

αzj

j=1

n

d(Rj ,ρzj )

Zn (αzj )



θ 1−θ

PMj g(Bjm ,Rj ) m=1



(1 − θ)Mj  .

4. MCMC for non-transitive pairwise preferences. We develop a Markov Chain Monte Carlo (MCMC) algorithm which, at convergence, samples from the posterior density of eq. (6). As explained in Section 3.1, the MCMC iterates between two main steps: 1. Update α, ρ and θ given B1:N and R1:N (using eq. (7)): (a) Metropolis update of ρ (b) Metropolis update of α (c) Gibbs update of θ 2. Update R1:N given α, ρ, θ and B1:N (using eq. (8)). In step 1(a), we propose a new consensus ranking ρp according to a symmetric proposal which is centered around the current consensus ranking ρt . Definition 1. Swap proposal. At step t, denote the current version of the consensus ordering vector by xt = (ρt )−1 , which is the vector whose n components are the items in O ordered from best to worst according to ρt , i.e., xti = Ok ⇐⇒ ρtk = i. Let L∗ ∈ {1, .., n}. Sample uniformly an integer l from {1, 2, ..., L∗ } and draw a random number u uniformly in {1, 2, ..., n−l}. The proposal xp has components xpi

(15)

=

 t   x i

xtu+l   x t u

if i 6= {u, u + l} if i = u if i = u + l

and the proposed ranking is ρp = (xp )−1 . The parameter L∗ plays the role of the maximum allowed distance between the ranks of the swapped items, and is used for tuning the acceptance probability in the Metropolis-Hastings step. The transition probability of the Swap proposal is symmetric and given by ∗

p

t

t

p

q(ρ → ρ ) = q(ρ → ρ ) =

L X

P(L = l)P(ρp → ρt |L = l)1(|ρp − ρt | = 2l) =

l=1

=

1 L∗

L∗ X l=1

1 1(|ρp − ρt | = 2l) . n−l

14

M. CRISPINO ET AL.

The Swap proposal is a very intuitive and simple way of exploring Pn , but has appealing properties for us, that will become clear in step 2. In this step, alternative proposals could be considered, for example the Leap-and-Shift proposal of Vitelli et al. (2017). Remark 1. The Swap proposal ρp is a local perturbation of ρt , separated from ρt by Cayley distance dC = 1, by Hamming distance dH = 2, expected Kendall distance E(dK ) = L∗ , expected footrule distance E(dF ) = L∗ + 1, ∗ ∗ +1) and expected Spearman’ distance E(dS ) = (L +1)(2L . This follows from 3 the definitions of the various distances and by simple calculations. The proposed ranking is then accepted with probability equal to  = min{1, aρ }, where ln aρ = −

N h i αX d(Rj , ρp ) − d(Rj , ρt ) . n j=1

In step 1(b) we propose αp from a log-normal density ln N (ln αt , σα2 ), and accept it with probability equal to  = min{1, aα }, where h

i



ln aα =γ ln(αp /αt ) − λ + 

p

t

N 1X d(Rj , ρ) (αp − αt )+ n j=1





−N ln[Zn (α )/Zn (α )] . This acceptance probability takes into account the asymmetric transition probability of the chain, that results from the log-normal proposal. The partition function Zn (α) can be computed exactly or approximated by the importance sampling scheme proposed by Vitelli et al. (2017), depending on the distance function chosen and on the number n of items considered. In step 1(c) we sample θ from the beta distribution, truncated to the interval [0, 0.5), with updated hyper-parameters, κ01

= κ1 +

Mj N X X j=1 m=1

g(Bjm , Rj ),

κ02

= κ2 +

Mj N X X

[1 − g(Bjm , Rj )].

j=1 m=1

Step 2 is a Metropolis-Hastings for the individual rankings. Here we exploit the fact that, when fixing all other parameters and the data B1 , ..., BN , R1 , ..., RN are conditionally independent, and that each Rj only depends on the corresponding data Bj . We thus sample a proposed individual ranking

A BAYESIAN MALLOWS APPROACH TO NON-TRANSITIVE DATA

15

rjp from the Swap proposal, separately for each j = 1, ..., N . The Swap proposal is here advantageous because it perturbs locally not only the current individual ranking rjt , but also the function g(Bjm , rjt ). Remark 2. The Swap proposal always gives a proposed individual ranking rjp 6= rjt . However, it may happen that g(Bjm , rjp ) = g(Bjm , rjt ), ∀m = 1, ..., Mj . This is important for what concerns the acceptance probability of rjp . If g(Bjm , rjp ) = g(Bjm , rjt ), ∀m = 1, ..., Mj , the acceptance probability depends only on the ratio of the Mallows likelihoods of rjp and rjt , and is equal to  = min{1, a1 }, where ln a1 = −

i αh p d(rj , ρ) − d(rjt , ρ) . n

If g(Bjm , rjp ) 6= g(Bjm , rjt ) for some m = 1, ..., Mj , the acceptance probability depends also on the mistake model, and is equal to  = min{1, a2 } where ln a2 = ln a1 +

Mj h X

i

g(Bjm , rjp ) − g(Bjm , rjt ) ln [θ/(1 − θ)] .

m=1

Example. To illustrate this step of the algorithm, suppose that a user expresses the following set of preferences, Bj = {(O2 ≺ O1 ), (O5 ≺ O4 ), (O5 ≺ O3 ), (O5 ≺ O2 ), (O5 ≺ O1 ), (O3 ≺ O2 ), (O1 ≺ O3 )}.

This set contains the non-transitive pattern O2 ≺ O1 ≺ O3 ≺ O2 . For the illustration, suppose that the current value of the individual ranking vector is rjt = (5, 4, 3, 2, 1), which corresponds to the ordering vector xtj = P7 p (O5 , O4 , O3 , O2 , O1 ), and for which m=1 g(Bjm , rj ) = 1. If we sample the proposal xpj = (O5 , O3 , O4 , O2 , O1 ), this gives g(Bjm , rjp ) = g(Bjm , rjt ), ∀m = 1, ..., 7, and rjp = (5, 4, 2, 3, 1) 6= rjt . However, if we sample xpj = P (O4 , O5 , O3 , O2 , O1 ), then rjp = (5, 4, 3, 1, 2) 6= r t and also 7m=1 g(Bjm , rjp ) = P7 p t 2 6= m=1 g(Bjm , rj ) since, according to the sampled rj , the preference O5 ≺ O4 is reversed. Appropriate convergence of the MCMC must in practice be checked by inspecting the trace plots of the parameters, and by monitoring for example the integrated autocorrelation. In Supplement A we explain in detail how the algorithm is adapted to the case of the logistic mistake model, and to the mixture extension.

16

M. CRISPINO ET AL.

5. Other approaches to pairwise preference data. A widely used approach to pairwise preference data is the class of paired comparison models (David, 1963), which includes the Babington Smith (BS) model (Smith, 1950). In the BS model, for each pair of items {Oi , Ok } ⊂ O, 1 ≤ i < k ≤ n, pik = P(Oi ≺ Ok ) is the probability that Oi is preferred to Ok . The density of a ranking R ∈ Pn under the BS model is then proportional to the product of such parameters. The Mallows-Bradley-Terry (MBT) model (Mallows, 1957) simplifies the BS model, by assuming that the probability pik has the Bradley Terry i form (Bradley and Terry, 1952) pik = µi µ+µ , where the parameter µi > 0, k Pn i = 1, ..., n, i=1 µi = 1, represents the score rating or skill of the item i. The number of free parameters is thus n − 1. The goal is then to estimate µ = (µ1 , ..., µn ) that best represents the consensus preference of all the users. As a rank aggregation model it is insensitive to possible individual-level nontransitivities. Some interesting extensions of the original MBT model were developed by Causeur and Husson (2005), who proposed a two-dimensional MBT model, and by Usami (2010), who suggested a multidimensional generalization of the same model. Both methods deal with individual-level nontransitive preferences, but they do not lead to a final ranking of all items. The MBT model was studied in the Bayesian framework by Caron and Doucet (2012), who proposed an efficient Gibbs sampling procedure to estimate µ. Mallows (1957) introduced a further simplification of the BS, defining the φ−model and the ρ−model, parametrized by only two parameters. Both models belong to the wider distance-based family of the Mallows model, and assume that pik depends only on the relative order of items Oi and Ok in the consensus ranking. All the works mentioned so far assume that non-transitive response patterns cannot occur. The literature on inferential models for non-transitive pairs is more limited. As mentioned, Volkovs and Zemel (2014) deal with non-transitivities. Their model is motivated by an idea similar to ours, namely that when rankings are converted into pairwise preferences, they carry two kinds of information: the binary preference, and a confidence, based on the size of the rank difference. In this respect, their method is connected to our logistic model for mistakes (Section 3.2). The main difference between their model and ours is the data generating mechanism: they assume a multinomial generative process to model the observed preferences, while our model is based on a distance between rankings. Another relevant paper is Ailon (2012), which aims at finding the linear ordering of items with the smallest number of disagreements with the preferences, by using methods from combinatorial optimization. In this respect his goal is similar to ours; however, the model

A BAYESIAN MALLOWS APPROACH TO NON-TRANSITIVE DATA

17

is not probabilistic, and therefore cannot be used for expressing uncertainty of the estimates. 6. Simulation study. The aim of the experiments was to validate the method and to evaluate its performance in some test situations. The data were simulated from the Mallows model with the Bernoulli mistake model, varying parameters θ, α, n, N , and Mj , j = 1, ..., N , while always using the footrule distance. The number of items n was always kept below 50, thus enabling us to use the exact partition function (Vitelli et al., 2017). For a detailed description of the data generation, see Supplement B. Various point estimates can be deduced from the posterior distribution of ρ, one being the maximum a posteriori (MAP). We prefer the following sequential construction, which we call the cumulative probability (CP) consensus ordering, as in Vitelli et al. (2017): first we select the item which has the largest marginal posterior probability of being ranked 1st ; then, excluding this first choice, we select the item which has the largest marginal posterior probability of being ranked 1st or 2nd among the remaining ones, and so on. In order to assess the performance of our methods, in Figure 3 we plot the posterior distribution of the normalized footrule distance between the P estimated consensus ρ and the true consensus, df (ρ, ρtrue ) = n1 ni=1 |ρi − ρi,true |, for varying parameters α, θ, λM (the average number of pairs given to each user) and N , while keeping fixed n = 10. As expected, the performance of the method improves as the number of users N increases (Figure 3a), as the probability of making mistakes θ decreases (Figure 3b), as the dispersion of the individual latent rankings Rj,true around ρtrue decreases, that is when α increases (Figure 3c), and when the average number of pairwise comparisons becomes larger (Figure 3d). Interestingly, in the last case, the method performs generally well also if the average number of pairs is λM = 15, being only 1/3 of the maximal number of pairs possible. In Figure 4 we plot the posterior distribution of df (ρ, ρtrue ) corresponding to simulation experiments with number of items n ∈ {15, 25}, when increasing the number of users N . Note that the number of pairs assessed by each user in the case n = 25 is around 50, which is 1/6 of all the possible pairs. Next, we studied the performance of the method in terms of the precision of the individual ranking estimation. We quantified the results by the probability of getting at least 3 items right, among the top-5, defined as follows. For each user j = 1, ..., N , we found the triplet of items D3j = {Oi1 , Oi2 , Oi3 } that had maximum posterior probability of being ranked jointly among the top−3

18

0.4

0.4

cdf

cdf

0.6

0.6

0.8

0.8

1.0

1.0

M. CRISPINO ET AL.

θ 0.2

0.2

N

0.0

0.2

0.4

0.6

0.8

1.0

1.2

0.05 0.1 0.15 0.2

0.0

0.0

120 60 30 20 1.4

0.0

df ( ρ , ρTrue) n

0.5

1.0

1.5

df ( ρ , ρTrue) n

(a)

0.2

0.4

0.4

cdf

cdf

0.6

0.6

0.8

0.8

1.0

1.0

(b)

0.5

1.0

35 25 15

0.0

0.0

2 4 6 0.0

λM

0.2

α

1.5

0.0

df ( ρ , ρTrue) n

0.2

0.4

0.6

0.8

1.0

1.2

df ( ρ , ρTrue) n

(c)

(d)

Fig 3: Results of the simulated data. Posterior CDFs of df (ρ, ρtrue ) as a function of N for α = 3, λM = 25, θ = 0.1 (a); as a function of θ for α = 3, N = 40, λM = 25 (b); as a function of α for θ = 0.1, N = 40, λM = 25 (c); as a function of λM for α = 3, N = 40, θ = 0.1 (d).

items, i.e. the triplet that maximized σ∈P3 P({Rji1 , Rji2 , Rji3 } = σ | data), where σ denotes a permutation of the set {1, 2, 3}. This posterior quantity was estimated along the MCMC trajectory. We defined H5j to be the set of P

19

1.0 0.8 0.6

cdf 0.4 0.2

N

0.0

150 100 50 0.0

0.5

1.0

1.5

N 300 150 100

0.0

0.2

0.4

cdf

0.6

0.8

1.0

A BAYESIAN MALLOWS APPROACH TO NON-TRANSITIVE DATA

0.0

0.2

df ( ρ , ρTrue) n

0.4

0.6

0.8

1.0

1.2

1.4

df ( ρ , ρTrue) n

(a)

(b)

Fig 4: Results of the simulated data. Posterior CDFs of df (ρ, ρtrue ) as a function of N , for θ = 0.1, α = 3.5, λM = 25, n = 15 (a), and for θ = 0.1, α = 4.5, λM = 50, n = 25 (b).

5 highest ranked items in Rj,true , for each user j. We then checked whether D3j ⊂ H5j (that is, if the top-3 estimated items were all among the top-5 of each user). The percentages of users for which this is true is reported in Table 1. We notice that the results are overall very good: in the cases where n is set to 10 (first 4 sub-tables from the left in Table 1), we consistently learn 3 out of the top−5 items in more than 70% of the users (with a peak of 100%). Also in the more difficult cases of n = 15 and n = 25 (first 2 sub-tables from the right in Table 1) the results are very good, especially considering that this percentage does not include the cases where only 2 (or 1) items where correctly estimated in the top positions. Table 1 Results of the simulated data. Percentage of users for which the estimated top-3 items belong to the true top-5. Data correspond to simulations with the same parameter settings as the results shown in Figures 3 and 4: from left to right, same parameters as in Figure 3a, Figure 3b, Figure 3c, Figure 3d, Figure 4a and Figure 4b. N 20 30 60 120

% 88 83 83 75

θ 0.05 0.1 0.15 0.2

% 92.5 87.5 75 72.5

α 2 4 6

% 82.5 95 92.5

λM 15 25 35

% 85 97.5 100

N 50 100 150

% 65 58 60

N 100 150 300

% 44 46 45

We then chose one of the simulated data cases and computed the poste-

20

M. CRISPINO ET AL.

rior probabilities of correctly not all pairs h predicting the preference order of i assessed by the users, i.e. P g(Bj,new , Rj ) = g(Bj,new , Rj,true ) data . Figure 14 shows the boxplots for these predictive probabilities, (left) stratified according to the number of pairs each user assessed in the data, and (right) stratified according to the footrule distance between the individual ranking P Rj,true and the consensus ρtrue , d(ρtrue , Rj,true ) = ni=1 |ρi,true − Rji,true |.

Fig 5: Results of the simulated data. Posterior probabilities of correctly predicting the preference order of all pairs not assessed by the users, (left) stratified according to the number of pairs each user assessed in the data, and (right) stratified according to d(ρtrue , Rj,true ). In the case considered, the model had a very good predictive power, especially considering that the simulated data had many mistakes (around 10%). We also notice a slight increase of the predictive probabilities as Mj increases (left panel) and as d(ρtrue , Rj,true ) decreases (right panel). These results are not surprising: it is easier to predict correct orderings of new pairs when (i) the user assesses more pairs, and (ii) the user’s own ranking resembles more the shared consensus. In Supplement C we report an analysis of data generated by the logistic model LM. The results were very similar to those obtained above. In fact, the posterior distribution of β1 was highly concentrated around 0, which is when LM collapses to BM. 7. Human causation in sounds. We analyzed the data using the mixture model explained in Section 3.3 with the footrule distance. With n = 12 sounds we can use the exact expression of the partition function (Vitelli et al., 2017). In the Dirichlet prior for η, we set the hyperparameter

A BAYESIAN MALLOWS APPROACH TO NON-TRANSITIVE DATA

21

χ = 20, which favors high-entropy distributions, thus reflecting our inability to express precise prior knowledge. In the Beta prior for θ, we set the hyperparameters at κ1 = κ2 = 1, i.e. the uniform distribution on the interval [0, 0.5), and the hyperparameters of the prior for α at γ = 1 and λ = 1/10, as discussed in Vitelli et al. (2017). We run the MCMC sampler for 106 iterations, after a burn-in of 2 · 105 . Separate analyses were performed for C ∈ {1, . . . , 7}. In order to choose an appropriate number of clusters, we plot in Figure 6 two quantities: on the left, the within-cluster sum of footrule distances between the individual rankings and the consensus ranking of that cluster, PC P c=1 j:zj =c df (Rj , ρc ); on the right, the within-cluster indicator of mis-fit PM

j to the data, C c=1 j:zj =c m=1 g(Bjm , ρc ). Both these measures are defined in Vitelli et al. (2017), and tested as good measures to select C.

P

P

Fig 6: Acousmatic data. Boxplots of the within-cluster sum of footrule distances between the individual rankings and the consensus ranking of that cluster (left), and of the within-cluster indicator of mis-fit to the data (right), for different choices of C. There appears to be an elbow in the figures at C = 3, to guide us in the choice of the number of clusters. We decided on C = 3, also motivated by the relatively small sample size of the experiment (N = 46). Table 2 shows the results for C = 3: the maximum a posteriori (MAP) estimates for η and α, together with their 95% highest posterior density (HPD) intervals, are shown at the top of the table. The table also shows the estimated cluster-specific consensus lists of sounds, estimated by the CP procedure. We observe the differences in the three consensus lists. S1, the stimulus with the most dynamic spatial motion, is on top in cluster 3, but

22

M. CRISPINO ET AL.

at the bottom in cluster 1; S8, the test stimulus that has maximum spatial details but no volume nor pitch change, is on top in cluster 1, but second to the last in clusters 2 and 3. Finally, S5, the stimulus that contains the least movement variation but has pitch and volume suppressed, is ranked third and first in clusters 1 and 2, but towards the bottom of the list in cluster 3. Table 2 Acousmatic data. Sounds are ordered according to the CP consensus ordering, obtained from the posterior distribution of ρc , c = 1, 2, 3. Cluster 1 α1 = 2.66 (1.14,4.96) η1 =0.31 (0.21,0.41) S8 S10 S5 S9 S6 S4 S7 S11 S12 S2 S3 S1

Cluster 2 α2 = 5.16 (3.15,9.29) η2 =0.33 (0.22,0.43) S5 S4 S12 S2 S11 S3 S6 S1 S7 S9 S8 S10

Cluster 3 α3 =5.32 (3.61,7.66) η3 =0.37 (0.27,0.48) S1 S7 S11 S2 S4 S12 S6 S3 S5 S9 S8 S10

Figure 7 shows the heatplot of the posterior marginal probabilities, for each sound, of being ranked as the k−th highest, k = 1, ..., 12. On the x-axis the sounds are ordered according to the CP consensus ordering of Table 2. Cluster 1

Cluster 2 11 10

10

9

7

0.4

6

0.3

0.2

3 0.1

S8

S9

S10

S7

S1

S6

S3

S2

S11

S4

S12

S5

S1

S3

S2

S12

S7

S4

S11

S6

S9

S5

S10

S8

0.1

2 0.0

1

0.0

1 S8

2 0.0

1

0.2

3

0.1

2

4

S9

3

0.3

5

4

S10

0.2

0.4

6

0.3

5

4

7

S5

5

0.5

8

S3

6

8

S6

0.4

S4

7

Rank

8

0.6

9 0.5

S2

0.5

Rank

9

0.7

11

0.6

S12

0.6

S7

10

12 0.7

S11

11

Rank

Cluster 3

12 0.7

S1

12

Fig 7: Acousmatic data. Posterior consensus ranking of the three clusters. Listeners in cluster 1 found variation in volume or pitch as a negative or distracting feature. They rated S8 at the top, a test stimulus that has maximum spatial details but no volume nor pitch change. Also, S10, S5 and S9, which were ranked next, lack volume and pitch details. The bottom 4 stimuli contain maximum pitch and volume variation. Among them was S3

A BAYESIAN MALLOWS APPROACH TO NON-TRANSITIVE DATA

23

(mono sound, no space at all), forming a strong contrast to the top ranked S8 (maximum spatial movement). Evidently, space was important for these listeners, while pitch and volume variation was a negative feature. In cluster 2 listeners did not like fast movements as a sign of human feature, but they did like correlated pitch and volume (the top 4 sounds feature a low amount of spatial variation, but also correlated pitch and volume, while the bottom 3 sounds are the same as the top 3 but lack correlated pitch and volume variation). Listeners in this cluster prioritized pitch and volume variations above spatial variation, and preferred low spatial variation (slower, or more relaxed movements). Cluster 3 consists of subjects who, in their evaluation of the test stimuli, appear to include all spatial cues that adhere to our everyday perception of spatial motion. The stimuli with most dynamic spatial motion, enhanced by spatially correlated pitch and volume variations, are in the top-3, while stimuli with the least of these features are in the bottom-3. These listeners prioritize high levels of spatial detail above all other features, and their perception of these details are enhanced by correlated pitch and volume variations. This is indicated in (i) S1 being at the top; (ii) S7, which is the same as S1 but lacks pitch variation, being second; (iii) S11, which is the same as S1 but played 30% slower, being third (i.e. space, volume and pitch variations are just a bit slower); (iv) S8, S9, S10 are in the bottom, and all lack pitch, volume variation, and spatial movement details. We investigate the stability of the clustering in Figure 8, that shows the heatplot of the posterior probabilities, for all the listeners (shown on the x-axis), for being assigned to each of the C = 3 clusters identified in Table 2. Most of the probabilities are concentrated on some particular value of c among the three possibilities, indicating a reasonably precise behavior in the cluster assignments. We then computed, fixing these cluster assignments, the marginal posterior probability that each sound is among the top-4 in ρ1:C and in Rj , j = 1, ..., 46, respectively. The results are shown in Figure 9. Each heatplot refers to a cluster (C1 (left), C2 (center) and C3 (right)) and represents the marginal posterior probabilities for each sound (y-axis) being ranked among the top-4 in the consensus of that cluster (first column), and in the individual rankings of listeners in that cluster (remaining columns, users on the x-axis). As Figure 9 shows, there is considerable variation in the estimated rankings of the sounds between individual listeners even when they are included in the same cluster. For example, looking at Figure 9 left, we see that S8, S10, and S5 have high (> 0.8) posterior probability of being ranked among the top-4 stimuli in the consensus ranking (column 1). However, looking at the

24

M. CRISPINO ET AL.

1.0

ρ3 0.8

0.6

ρ2 0.4

0.2

ρ1

3 7 10 21 28 29 30 34 35 36 43 5 6 8 11 12 19 20 23 26 31 32 33 41 42 46 1 2 4 9 13 14 15 16 17 18 22 24 25 27 37 38 39 40 44 45

0.0

Listener ID

Fig 8: Acousmatic data. Heatplot, for all the listeners (on the x-axis), of the posterior probabilities of being assigned to each of the three clusters (on the y-axis).

S8

S4

S1

S10

S5

S7

0.8

S9

S12

0.8

S2

S6

S2

S11

0.6

0.6

S4 0.6

S7

S6

S12

S4

S3

S6 0.4

0.4

0.4

S11

S1

S3

S2

S7

S5 0.2

0.2

0.2

S12

S10

S10

S3

S9

S9 0.0

S8 ρ3

5 6 8 11 12 19 20 23 26 31 32 33 41 42 46

43

36

35

34

30

29

28

21

7

3

10

ρ1

0.0

S8 ρ2

0.0

S1

0.8

S11

1 2 4 9 13 14 15 16 17 18 22 24 25 27 37 38 39 40 44 45

S5

1.0

Fig 9: Acousmatic data. Heatplot of the marginal posterior probabilities for all the stimuli (y-axis) of being ranked among the top-4 for cluster 1 (left), 2 (center) and 3 (right).

estimates for the listeners in cluster 1, we see that the variation is very high: For example, listener 30 (column with label 30) has a very high posterior probability of ranking S3 and S6 among the top-4 stimuli. This aspect is important for what concerns individual estimates. Here we consider the relationship between the probability of placing some given stimuli in the top (bottom) ranks and the musical sophistication index (MSI), or the spatial audio awareness index (SAA). Figure 10 shows the relationship between listeners’ SAA and the probability of sounds S1 and S7 being ranked in the top-4 (both marginally and jointly). Recall that S1 was the original sound, while S7 was identical to S1, but without pitch variation. The plot suggests that spatial listening is a skill that is enhanced through

25

1 (10 users)

2 (9 users)

SAA

3 (7 users)

0.8 0.6 0.4 0.0

0.0 0 (20 users)

0.2

probability that S1 and S7 are jointly amongst top−4

0.8 0.6 0.4

probability that S7 is amongst top−4

0.2

0.8 0.6 0.4 0.2 0.0

probability that S1 is amongst top−4

1.0

1.0

A BAYESIAN MALLOWS APPROACH TO NON-TRANSITIVE DATA

0 (20 users)

1 (10 users)

2 (9 users)

SAA

3 (7 users)

0 (20 users)

1 (10 users)

2 (9 users)

3 (7 users)

SAA

Fig 10: Acousmatic data. Boxplot of the posterior probabilities for sounds S1 (left), S7 (middle), S1 and S7 jointly (right), of being ranked among the top-4 in the individual ranking Rj , stratified by the SAA index. The horizontal red line is the threshold in the case of random assignment. The scale of SAA goes from 0 to 3, where 3 is an indicator of awareness of spatial dimension of sounds.

training. Figure 11 shows the relationship between listeners’ MSI and the probability of sounds S8 and S10 being ranked among the bottom-4 (both marginally and jointly). Respondents with a score greater than 500 were classified as musically more sophisticated, and those with a score less than 500 as less sophisticated, as suggested in http://marcs-survey.uws.edu.au/OMSI/omsi.php. Both S8 and S10 suppress pitch and volume variations, which are expected to enhance the implication of human causation. These two stimuli are more likely to be ranked in the last 4 positions by listeners with high MSI. Interestingly, this suggests that musically sophisticated listeners find pitch and volume variations to be qualities for a stimulus to sound human. 8. Conclusions. The main contribution of this paper is to introduce a new Bayesian method for considering and correcting non-transitive pairwise preference data. The principal advantage of the Bayesian approach in this context comes from its ability to combine different types of uncertainty in the reported data, coming from different sources, and from being able to convert such data into the form of meaningful probabilistic inferences. Our method provides the posterior distribution of the consensus ranking, based on pairwise assessment data from a pool of users who may have individually violated logical transitivity in their reporting. The method is also able to produce the posterior distributions of the latent individual rankings of the users. Importantly, such rankings can be used in the construction of person-

26

(0,500] (23 users)

(500,999] (23 users)

MSI

(0,500] (23 users)

(500,999] (23 users)

MSI

0.8 0.6 0.4 0.2 0.0

probability that S10 and S8 are jointly amongst bottom−4

1.0 0.8 0.6 0.4 0.0

0.2

probability that S10 is amongst bottom−4

0.8 0.6 0.4 0.2 0.0

probability that S8 is amongst bottom−4

1.0

M. CRISPINO ET AL.

(0,500] (23 users)

(500,999] (23 users)

MSI

Fig 11: Acousmatic data. Boxplot of the posterior probabilities for sounds S8 (left), S10 (middle), S8 and S10 jointly (right), of being ranked among the bottom-4 in the individual ranking Rj , stratified by the MSI index. The horizontal red line is the threshold in the case of random assignment.

alized recommendations, or in studying how individual preferences change with user related covariates. We also developed a mixture model generalization of the main model, able to handle heterogeneity in pairwise and non-transitive preference data. The model was then used to investigate how individual listeners perceive human spatial causation in acousmatic sounds. The data came from a difficult experiment, that involved human perceptions. For this reason, pair comparison of sounds was the only feasible practical design in this context. The data were noisy, and in particular logically non-transitive at the individual level. We thus used our approach to estimate individual rankings, which were in turn used to discover a clustering structure. The results revealed how differently people listen to and interpret abstract sounds. We also related individual musicological scores to individual rankings, leading to an interesting correspondence between spatial sound feelings and sound expertise. Acknowledgments. MC thanks Sonia Petrone and Isadora Antoniano Villalobos for useful discussions. References. Agresti, A. (1996). Categorical data analysis. New York: John Wiley & Sons. Ailon, N. (2012). An active learning algorithm for ranking from pairwise preferences with an almost optimal query complexity. Journal of Machine Learning Research 13 137–164. Barrett, N. and Crispino, M. (2017). The Impact of 3-D Sound Spatialisation on Listeners’ Understanding of Human Agency in Acousmatic Music. In prepration.

A BAYESIAN MALLOWS APPROACH TO NON-TRANSITIVE DATA

27

Bradley, R. A. and Terry, M. E. (1952). Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika 39 324–345. Caron, F. and Doucet, A. (2012). Efficient Bayesian inference for generalized Bradley– Terry models. Journal of Computational and Graphical Statistics 21 174–196. Causeur, D. and Husson, F. (2005). A 2-dimensional extension of the Bradley–Terry model for paired comparisons. Journal of statistical planning and inference 135 245– 259. David, H. A. (1963). The method of paired comparisons 12. DTIC Document. Diaconis, P. (1988). Group representations in probability and statistics. Lecture Notes Monograph Series 11. Institute of Mathematical Statistics, Hayward, CA, USA. Dwork, C., Kumar, R., Naor, M. and Sivakumar, D. (2001). Rank aggregation methods for the web. In Proceedings of the 10th international conference on World Wide Web 613–622. ACM. Fligner, M. A. and Verducci, J. S. (1986). Distance based Ranking Models. Journal of the Royal Statistical Society: Series B 48 359-369. Grond, F. and Berger, J. (2011). Parameter Mapping Sonification. In Thomas Hermann, Andrew D. Hunt, and John Neuhoff (eds.), The Sonification Handbook 363–398. Irurozki, E., Calvo, B. and Lozano, A. (2014). Sampling and learning the Mallows and Generalized Mallows models under the Hamming distance. Bernoulli (submitted). Irurozki, E., Calvo, B. and Lozano, A. (2016). Sampling and learning the Mallows and Generalized Mallows models under the Cayley distance. Methodology and Computing in Applied Probability. Kenyon-Mathieu, C. and Schudy, W. (2007). How to rank with few errors. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing 95–103. Lu, T. and Boutilier, C. (2014). Effective Sampling and Learning for Mallows Models with Pairwise-Preference Data. Journal of Machine Learning Research 15 3783–3829. Luce, R. D. (1959). Individual choice behavior: A theoretical analysis. Wiley, New York, NY, USA. Mallows, C. L. (1957). Non-Null Ranking Models. I. Biometrika 44 114–130. Mukherjee, S. (2016). Estimation in exponential families on permutations. The Annals of Statistics 44 853–875. Negahban, S., Oh, S. and Shah, D. (2012). Iterative ranking from pair-wise comparisons. In Advances in Neural Information Processing Systems 2474–2482. Ollen, J. E. (2006). A criterion-related validity test of selected indicators of musical sophistication using expert ratings PhD thesis, The Ohio State University. Plackett, R. L. (1975). The Analysis of Permutations. Journal of the Royal Statistical Society: Series C 24 193-202. Rajkumar, A., Ghoshal, S., Lim, L.-H. and Agarwal, S. (2015). Ranking from Stochastic Pairwise Preferences: Recovering Condorcet Winners and Tournament Solution Sets at the Top. In ICML 665–673. Smith, B. B. (1950). Discussion of Professor Ross’ paper. Journal of the Royal Statistical Society: Series B 12 153–162. Tversky, A. (1969). Intransitivity of Preferences. Preference, Belief, and Similarity 433. Usami, S. (2010). Individual differences multidimensional Bradley-Terry model using reversible jump Markov chain Monte Carlo algorithm. Behaviormetrika 37 135–155. Vitelli, V., Sørensen, Ø., Crispino, M., Frigessi, A. and Arjas, E. (2017). Probabilistic preference learning with the Mallows rank model. ArXiv e-prints, 1405.7945v4. Volkovs, M. N. and Zemel, R. S. (2014). New Learning Methods for Supervised and Unsupervised Preference Aggregation. Journal of Machine Learning Research 15 11351176.

28

M. CRISPINO ET AL.

Supplement A. Details of the algorithm. A.1. Adaptations of the algorithm for the logistic model. The structure of the MCMC algorithm for the logistic model is the same as the one of Section 5. The differences in the LM version of the algorithm are (1) the acceptance ratios for R1 , ..., RN , and (2) the way in which the parameters of the error model are updated. The acceptance probability of an individual ranking rjp is the following. If g(Bj,m , rjp ) = g(Bj,m , rjt ), ∀m = 1, ..., Mj , the acceptance probability, conditioned on the current values of α, ρ, β0 and β1 , is  = min{1, a3 }, where Mj

h i β1 X g(Bjm , rjt ) drp ,m − drt ,m + j j n − 2 m=1

ln a3 = ln a1 − (16) +



Mj X

ln 

m=1

1 + exp{−β0 − 1 + exp{−β0 −

β1 n−2 (drjt ,m β1 p n−2 (drj ,m

− 1)}



− 1)}



i

h

and ln a1 = − αn d(rjp , ρ) − d(rjt , ρ) . If g(Bj,m , rjp ) 6= g(Bj,m , rjt ) for some m = 1, ..., Mj , the acceptance probability is  = min{1, a4 }, where ln a4 = ln a1 − β0

Mj h X

i

g(Bj,m , rjp ) − g(Bj,m , rjt ) +

m=1 Mj

(17)



+

i β1 X h g(Bjm , rjp )(drp ,m − 1) − g(Bjm , rjt )(drt ,m − 1) + j j n − 2 m=1 Mj X



ln 

m=1

1 + exp{−β0 − 1 + exp{−β0 −

β1 n−2 (drjt ,m β1 p n−2 (drj ,m

− 1)} − 1)}

 .

In place of the Gibbs step for θ there are two Metropolis steps for updating β0 and β1 . The β1 step, conditioning on the current values of α, ρ, β0 and R1 , ..., RN , is performed as follows. We sample the proposal β1p from ln N (ln β1t , σβ1 ), and accept it with probability  = min{1, aβ1 }, where Mj

ln aβ1 = (18)



β1t



β1p

Mj

+

N X X j=1 m=1

N X i 1 X g(Bjm , Rj )(dRj ,m − 1) + λ12 + n − 2 j=1 m=1

h





ln  



t  1 + exp −β0 − β1

1 + exp −β0 − β1p

dRj ,m −1 n−2 dRj ,m −1 n−2

 h i  p t  ln(β /β ) . + λ 11 1 1 

29

A BAYESIAN MALLOWS APPROACH TO NON-TRANSITIVE DATA

The β0 step is performed by conditioning on the current values of α, ρ, β1 and R1 , ..., RN . We sample the proposal β0p from ln N (ln β0t , σβ0 ), and accept it with probability  = min{1, aβ0 }, where h

ln aβ0 = (β0t − β0p ) λ01 +

Mj N X X

h

i

i

g(Bjm , Rj ) + λ00 ln(β0p /β0t )

j=1 m=1

(19)





ln  



Mj

+

N X X j=1 m=1



dRj ,m −1 β1 n−2

1 + exp −β0p −

dRj ,m −1 β1 n−2

 1 + exp

−β0t

   .

A.2. Adaptations of the algorithm for the mixture extension. The algorithm for the mixture model is a simple generalization of the algorithm in Section 5. The M-H steps for ρ1:C and α1:C are straightforward, since (ρc , αc )c=1,...,C are assumed conditionally independent given z1 , ..., zN . For each c = 1, ..., C, we sample the proposal ρpc from the Swap distribution centered at ρtc , and accept it with probability  = min{1, aρc }, where ln aρc = −

i αc X h d(Rj , ρpc ) − d(Rj , ρtc ) . n j: z =c j

Next, we sample αcp ∼ ln N (ln αct , σα2 ), and accept it with probability  = min{1, aαc }, where 

ln aαc



1 X = γ ln(αcp /αct ) − λ + d(Rj , ρc ) (αp − αt )+ n j: z =c h

i

j



−Nc ln Zn (αcp )/Zn (αct )



where Nc = N j=1 1c (zj ), c = 1, ..., C. We then update η1 , ..., ηC , independently, by sampling from a Dirichlet density with updated hyperparameters, η1 , ..., ηC ∼ D(χ + N1 , ..., χ + NC ). In the Gibbs Sampler step for z1 , ..., zN , we sample each zj independently from the categorical distribution with probabilities P

αc

e− n d(Rj ,ρc ) π(zj = c|ηc , ρc , αc , Rj ) ∝ ηc π(Rj |ρc , αc ) = ηc Zn (αc )

c = 1, ..., C.

The Gibbs Sampler step for θ is the same as in Section 5. The Metropolis step for updating the latent individual ranks goes as follows. For each assessor j = 1, ..., N , a new rank vector rjp is proposed with

30

M. CRISPINO ET AL.

the Swap proposal centered at rjt . Then, the proposed augmented rank vector is accepted with probability  = min{1, a5 }, where (20)

ln a5 = −

i α zj h p d(rj , ρzj ) − d(rjt , ρzj ) , n

if g(Bj,m , rjp ) = g(Bj,m , rjt ) ∀m = 1, ..., Mj , and with probability  = min{1, a6 }, where (21)

ln a6 = ln a5 +

Mj h X

i

g(Bjm , rjp ) − g(Bjm , rjt ) ln [θ/(1 − θ)] ,

m=1

if g(Bj,m , rjp ) 6= g(Bj,m , rjt ) for some m = 1, ..., Mj . Supplement B. Sampling procedure. Several simulation experiments were carried out to assess the performance of the methodology introduced in this paper. An important aspect in the design of these experiments was to find appropriate values for the model parameters. The parameter αtrue controls the concentration of the individual latent rankings R1,true , ..., RN,true around the true consensus ρtrue : the larger αtrue is, the more concentrated the individual rankings are. To give an idea of this effect, we plot in Figure 12, for a range of different αtrue values, average P distances N1 N j=1 d(Rj,true , ρtrue ) obtained when Rj,true ∼ Mal(αtrue , ρtrue ), with n = 10 and N = 100. Each point in this figure is computed from a simulated sample, and there were T = 100 samples for each considered αtrue . With the number of items growing, identifying a consensus ranking becomes increasingly hard due to the n! possible permutations. To balance this, N , the number of users, and λM , the expected number of pairs compared by each user, must be chosen accordingly. For instance, if n = 10, the n maximal number of pairs that can be formed is Mmax = 2 = 45, while for n = 20, Mmax = 190. Choosing λM = 30 would in the case of n = 10 correspond to providing the individual users, on average, with the proportion λM 30 Mmax = 45 ' 0.67 of all possible pair comparisons, while for n = 20 the 30 M corresponding proportion is only Mλmax = 190 ' 0.16. One should also account for the effect of the parameter θ controlling the level of noise in the data, in the form of mistakes in reporting individual pairwise comparisons. While larger values of N will generally facilitate the estimation of the consensus ranking, larger θ will render individual ranking estimates much less reliable.

A BAYESIAN MALLOWS APPROACH TO NON-TRANSITIVE DATA

31

Fig 12: Distribution of N1 N j=1 d(Rj,true , ρtrue ), for N = 100, and for individual rankings, R1:N,true , generated from the Mallows model, Mal(αtrue , ρtrue ), for increasing values of αtrue . P

B.1. Sample from the Mallows in the presence of mistakes. In this section we explain the procedure we used to sample a collection B = {B1 , ..., BN }, of individual lists of pairwise preferences, where Bj = {Bj1 , ..., BjMj }, from the Mallows model in the presence of mistakes. The scheme is the following: 1. Sample R1,true , ..., RN,true , from the Mallows model, with consensus ρtrue and scale αtrue ; 2. Sample the number of pairwise comparisons given by each assessor, M1 , ..., MN , from a Poisson distribution with mean λM < Mmax = n(n − 1)/2, truncated from above at Mmax ; 3. For each j = 1, ..., N , sample without replacement Mj unordered pairs, Cj = {Cj1 , ..., CjMj }, from the collection of Mmax possible pairs; 4. For each pair Cjm , generate the ordered comparison Bjm , either correctly (w.r.t. Rj,true ) or reversed, with probability depending on the model used, BM or LM. Steps 1-3 are almost the same as in Vitelli et al. (2017), the only difference being that the data were here produced in a nested fashion. This was done to facilitate the direct comparability of the simulation results generated under different parameter settings. When increasing N , the number of users. For fixed n, θ, λM , α and ρ, we first generated B = B1 , ..., BN with the largest N , through steps 1-4 of Section B.1. Then we created the nested datasets by subsampling from B the intended smaller number of users.

32

M. CRISPINO ET AL.

When increasing λM , the average number of pairs per user. Fix n, θ, N , α and ρ, and let 1 ≤ λ1M < · · · < λTM ≤ Mmax . The goal here was to create, for all users j, individual nested pairwise datasets, Bλt , 1 ≤ M t ≤ T , such that (i) all pairs present in Bλt+1 had to be present also in M

Bλt , an (ii) the cardinalities Mjt = |Bλt | would satisfy the approximation M M E(Mjt ) ≈ λtM . This was achieved by first sampling Mjt = |BλT | from the M

truncated Poisson distribution with parameter λTM , truncated at Mmax , and then performing sequential thinning of the pairs, moving first from BλT to M BλT −1 , then from BλT −1 to BλT −2 , etc., until finally reaching Bλ1 . Thinning M M M M was done independently for different users j, and so that approximately t t , the proportion λt−1 M /λM of the Mj pairs in BλtM were kept also in Bλt−1 M 1 ≤ t ≤ T , with the other pairs being removed. When increasing θ, the number of mistakes in the BM model. The goal here was to create nested datasets, Bθ1 , ..., BθT , corresponding to increasing probabilities of mistakes, 0 < θ1 < ... < θT < 0.5, with the general rule that the mistakes of Bθt had to be present also in Bθt+1 . Each Bθt = {B1,θt , ..., BN,θt }, t ∈ {1, ..., T }, denotes the collection of the pairwise comparisons of the N users, corresponding to θt number of mistakes. We implemented points 1-3 of Section B.1, while step 4 was done as follows. For each Cjm , j = 1, ..., N , m = 1, ..., Mj : - Divide the interval (0, 1) into subintervals: [0, θ1 ), ..., [θT −1 , θT ), [θT , 1); - Sample um ∼ U (0, 1); - If um ∈ [θt , θt+1 ), t ∈ {0, ..., T }, then, in generating the datasets Bj,θ1 , ..., Bj,θt−1 , keep the order of the pair comparison Bjm the same as it is in Rj,true , and reverse it in the data sets Bj,θt , ..., Bj,θT . When increasing the number of mistakes in the LM model. The goal here was to create nested data with an increasing number of mistakes generated from the logistic model: 





logit P Bjm mistake Rj , β0 , β1 = −β0 − β1

dRj ,m − 1 . n−2

To do so, we either fixed β1 but varied β0 , or fixed β0 but varied β1 . Here we explain the case of β1 fixed at β1∗ and β0 varying. Let us denote the nested datasets as Bβ0,1 , ..., Bβ0,T , corresponding to the decreasing sequence β0,1 > ... > β0,T > 0, and assume, as in the previous section, that the mistakes in Bβ0,t had to be present also in Bβ0,t−1 (notice that here the larger β0 is, the less mistakes there are in the data).

A BAYESIAN MALLOWS APPROACH TO NON-TRANSITIVE DATA

33

For an illustration, consider datasets of n = 10 items.   In Table 3, we show ∗ how the probabilities P Bj,m mistake Rj , β0 , β1 = β1 , for β1∗ = 5 and for some chosen values of β0 (columns), depend on the distance dRj,m between the compared items (rows). Table 3 Values of P(Bj,m mistake|Rj , β0 , β1 = 5), for some chosen values of β0 (columns), depending on the value of the distance between the compared items (rows). dRj,m 1 2 3 4 5 6 7 8 9

β0,1 = 1.6 0.17 0.11 0.07 0.04 0.03 0.02 0.01 0.01 0.00

β0,2 = 1.1 0.25 0.17 0.11 0.07 0.04 0.03 0.02 0.01 0.01

β0,3 = 0.6 0.35 0.25 0.17 0.11 0.07 0.04 0.03 0.02 0.01

β0,4 = 0.1 0.48 0.35 0.25 0.17 0.11 0.07 0.04 0.03 0.02

We denote the matrix of values of Table 3 by Λ, and its d-th row by Λd . For each user j, we first sampled a set of pair comparisons Cj , as in steps 1-3 of Section B.1, and then generated four (corresponding to the 4 values of β0 reported in Table 3) nested datasets with increasing numbers of mistakes as follows. For each Cjm , j = 1, ..., N , m = 1, ..., Mj : - Compute dRj,m and select the corresponding row ΛdRj,m ; - Divide the interval [0, 1) into 4 subintervals with, as extremes, the values of ΛdRj,m : [0, β0,4 (dRj,m )), ..., [β0,1 (dRj,m ), 1); - Sample um ∼ U (0, 1); - If um ∈ [β0,t+1 (dRj,m ), β0,t (dRj,m )), t ∈ {0, ..., T }, with β0,0 = 1 and β0,T +1 = 0, then, in generating the datasets Bβ0,T , ..., Bβ0,t , keep the order of the pair comparison Bjm the same as it is in Rj,true , and reverse it in the data sets Bβ0,t+1 , ..., Bβ0,1 . Supplement C. Simulations with the Logistic DGP. Here we show results obtained from experiments on simulated data generated from the logistic model for mistakes. We varied the parameters N , β0 , β1 , α and λT , while, as in Section 6, we fixed the true ranking ρtrue and used the footrule distance. We then analyzed the generated datasets by applying both the logistic (LM) and the Bernoulli (BM) models. No systematic differences in the results could be detected in the accuracy of the consensus ranking estimate, evaluated in terms of the posterior CDF of df (ρ, ρtrue ) (not shown). We then studied whether clear differences could

34

M. CRISPINO ET AL.

be found when comparing the estimates of the individual ranking vectors to the corresponding true values. For this, we inspected the performance of the LM model by using two sets of simulations where n = 10, N = 100, Tj = 25 ∀j, α = 2.5, but with different settings of logistic parameters: S1: We varied β0 while keeping constant β1 = 5, with the nested procedure explained in Supplement B.1; S2: We varied the parameters β0 and β1 together, while keeping constant the probability of making a mistake (averaged across the simulated distances dRj,m ) constant, and equal to that of the BM model. Figure 13 gives us an idea of how P(Bjm mistake|Rj , β0 , β1 ) depends on dRj,m , the distance between the items compared, when varying β0 and β1 according to the schemes S1 (left) and S2 (right).

Fig 13: Logistic probabilities of making a mistake as a function of the distance between the items compared in the two sets of simulations, S1 left, S2 right. In S1, where β1 is fixed, the smaller the value of β0 , the more likely are mistakes in the data. With scheme S2 instead, we want to inspect how the performance of LM changes as the dependence on the distance dRj,m becomes stronger. For this, we used the same measure as in Table 1 of Section 6, namely the percentage of users for which the estimated top-3 items belong to the true top-5. In Table 4, we report the results for all the simulated datasets, when estimated with the BM and the LM algorithms. As expected, the performance of both models deteriorates as β0 decreases when β1 is fixed (left panel in Table 4, corresponding to simulations pa-

A BAYESIAN MALLOWS APPROACH TO NON-TRANSITIVE DATA Simulation

Model

%

S1: β0 = 1.4

LM BM

94% 90%

S1: β0 = 1.2

LM BM

S1: β0 = 1 S1: β0 = 0.8

Simulation

Model

%

S2: black line

LM BM

91% 92%

92% 92%

S2: red line

LM BM

89% 87%

LM BM

93% 91%

S2: green line

LM BM

92% 91%

LM BM

88% 85%

S2: blue line

LM BM

91% 91%

35

Table 4 Percentage of users for which the estimated top-3 items belong to the true top-5. Simulations with parameter setting of Figure 13.

rameters in Figure 13 left). Somewhat surprisingly, BM and LM perform in very similar ways (right panel in Table 4, corresponding to simulations parameters in Figure 13 right), and this remains true when the dependence in LM on the distance becomes stronger. We then computed the posterior probabilities of correctly predicting the preference order of all pairs not assessed by the users, that is (22)







P g(Bj,new , Rj ) = g(Bj,new , Rj,true ) data .

Figure 14 shows the box-plots for these predictive probabilities, stratified according to the true distance between the items.

Fig 14: Box-plots of the posterior probability for correctly predicting the missing preferences stratified by dRj , the distance between the items in Rj,true . Simulation S2, red line in Figure 13 right.

36

M. CRISPINO ET AL.

In this comparison, both models had a very good predictive power, especially considering that the simulated data had many mistakes (around 10%) and the assessments provided by different users were quite variable (α = 2.5). In many instances, the more general LM model appears to have had a slight edge over the simpler BM, but this was not true always, and overall, both methods produced very similar estimates of the individual rank vectors. The similarity of their performances may be because the transitivity property required in constructing versions of complete rankings Rj is so strong that the precise form of the error model no longer has a major impact on the results. E-mail: [email protected] E-mail: [email protected] E-mail: [email protected]

E-mail: [email protected] E-mail: [email protected]

Suggest Documents