one parameter exponential families on permutations which include Mal- lows models with ... coming from different modes before applying their estimation scheme. ... permutation can be thought of as a probability measure on the unit square with ... pothesis of uniformity in this setting is equivalent to the hypothesis that θ = 0.
The Annals of Statistics 2016, Vol. 44, No. 2, 853–875 DOI: 10.1214/15-AOS1389 © Institute of Mathematical Statistics, 2016
ESTIMATION IN EXPONENTIAL FAMILIES ON PERMUTATIONS B Y S UMIT M UKHERJEE Columbia University Asymptotics of the normalizing constant are computed for a class of one parameter exponential families on permutations which include Mallows models with Spearmans’s Footrule and Spearman’s Rank Correlation Statistic. The MLE and a computable approximation of the MLE are shown to be consistent. The pseudo-likelihood estimator of Besag is shown to be √ n-consistent. An iterative algorithm (IPFP) is proved to converge to the limiting normalizing constant. The Mallows model with Kendall’s tau is also analyzed to demonstrate the flexibility of the tools of this paper.
1. Introduction. Analysis of permutation data has a long history in statistics. An early paper is the work of Mallows [34] in 1957, where the author proposed an exponential family of the form e−θ d(π,σ )−Zn (θ,σ ) , henceforth referred to as Mallows models, to study nonuniform distributions on permutations. In this model, σ is a fixed permutation which is a location parameter, θ is a real valued parameter and d(·, ·) : Sn × Sn → R is a “distance” function on the space of permutations. Here, Zn (θ, σ ) denotes the (unknown) log normalizing constant of this family. Mallows mainly considered two distances, namely the Kendall’s tau, and the Spearman’s rank correlation. Using this modeling approach, Feigin and Cohen [16] analyzed the nature of agreement between several judges in a contest. Critchlow [9] gave some examples where Mallows model gives a good fit to ranking data. Fligner and Verducci [18, 19] generalized the Mallows model with Kendall’s tau and Cayley’s distance to an n parameter exponential family over permutations, where each parameter inductively determines a permutation in Sk starting from a partition in Sk−1 . The extra parameters allow more flexibility, and the structure of the model allows one to retain tractability. See also [10] which deal with various aspects of permutation models, Chapters 5 and 6 of [13] which sketch out numerous possible applications and the book length treatment in [35], which covers both theoretical and applied aspects of permutation modeling. Permutation modeling has also received some recent attention in machine learning literature. The generalized version of the Mallows model with Kendall’s tau was considered in [8, 30]. Estimation of the location and scale parameters in this Received May 2015; revised September 2015. MSC2010 subject classifications. Primary 62F12, 60F10; secondary 05A05. Key words and phrases. Permutation, normalizing constant, Mallows model, pseudo-likelihood.
853
854
S. MUKHERJEE
model was carried out in [40]. An infinite version of the generalized Mallows model was proposed in [38, 39], where the authors show existence of sufficient statistics, and compute conjugate priors for this model. If the data has multiple modes, the authors propose to use clustering techniques to separate permutations coming from different modes before applying their estimation scheme. In another direction, [28] propose the use of Fourier analysis on representation theory of finite groups to study probability models on permutations. Representing a probability distribution via its Fourier coefficients, they are able to formulate the update rule in a Bayesian model tracking problem in terms of Fourier coefficients. This representation, along with fast computation of Fourier coefficients allow study of permutation models of moderate size (n = 30). For a more detailed exposition on this technique, refer to [23]. Modeling of partially ranked data under the Mallows model with Kendall’s tau was studied in [31]. Efficient sampling schemes for learning Mallows models with Cayley distance, Ulam distance and Hamming distance has been studied recently in [24–26] utilizing the structure of the model under consideration. See also the recent work of [3] where the authors study estimation of location and scale mixtures of Mallows model with Kendall’s tau in an algorithmic framework. This paper analyzes a class of exponential families on the space of permutations using the recently developed concept of permutation limits. The notion of permutation limits has been first introduced in [22], and is motivated by dense graph limit theory (see [32] and the references therein). The main idea is that a permutation can be thought of as a probability measure on the unit square with uniform marginals. Multivariate distributions with uniform marginals have been studied widely in probability and statistics (see [20, 27, 33, 37, 45, 46, 48, 51] and references therein) and finance (see [1, 7, 36, 41, 43] and references therein) under the name copula. One of the reasons for their popularity is that copulas are able to capture any dependence structure, as shown in Sklar’s theorem [48]. 1.1. Main contributions. This paper gives a framework for analyzing probability distributions on large permutations. It computes asymptotics of normalizing constants in a class of exponential families on permutations, and explores identifiability of such models. It derives the limit in probability of statistics under such models, and shows the consistency of two estimates including the MLE. It also shows the existence of consistent tests for such models. It gives an Iterative Proportional Fitting Procedure (IPFP) to numerically compute the normalizing con√ stant. It also shows n consistency of the pseudo-likelihood estimator of Besag. It demonstrates the flexibility of this approach by analyzing the Mallows model with Kendall’s tau. For the Mallows model with Kendall’s tau, it again shows consistency of two estimates including the MLE. Finally, using results from [6] it computes joint limiting distribution of whole permutation for a class of permutation models which include the Mallows model with Spearman’s rank correlation and Kendall’s tau.
ESTIMATION IN EXPONENTIAL FAMILIES ON PERMUTATIONS
855
Even though Mallows models for a class of distance functions have been considered in the literature, not much is available in terms of weak limits, limit distributions, rate of errors in estimation and testing of hypothesis, outside the Mallows model with Kendall’s tau. One of the advantages of the proposed framework of this paper is that it allows analysis of permutation models both from a rigorous and a visual point of view. From a rigorous perspective, this approach allows one to study limiting properties of permutation models outside Mallows models with Kendall’s tau and its generalizations, which seemingly have no independence structure built in to exploit. In particular, this approach covers the Mallows model with Spearman’s correlation, for which not much theory is available in the literature. From a visual and diagnostic perspective, it gives a way to compare permutations, and thus test goodness of fit for permutation models. Another advantage of this approach is that it provides a limiting distribution for the whole permutation viewed as a process for a wide class of models on permutations, which to the best of my knowledge was not known before even for the usual Mallows model with Kendall’s tau. This allows for the possibility of studying partially observed rankings as well, by studying finite-dimensional marginals of this permutation process. The main tool for proving the results is a large deviation principle for a uniformly random permutation. This was first proved in [50], but the supplementary file [42] contains a new proof of this large deviation using permutation limits. 1.2. The 1970 draft lottery. To see how permutation data can arise naturally, consider the following example of historical importance. On December 1, 1969, during the Vietnam War the US Government used a random permutation of size 366 to decide the relative dates of when the people (among the citizens of the USA born between the years 1944–1950) will be inducted into the army in the year 1970, based on their birthdays. 366 cylindrical capsules were put in a large box, one for each day of the year. The people who were born on the first chosen date had to join the war first, those born on the second chosen date had to join next, and so on. There were widespread allegations that the chosen permutation was not uniformly random. In [17], Fienberg computed the Spearman’s rank correlation between the birthdays and lottery numbers to be −0.226, which is significantly negative at 0.001 level of significance. This suggests that people born in the latter part of the year were more likely to be inducted earlier in the army. If a permutation is not chosen uniformly at random, then the question arises whether a particular nonuniform model gives a better fit. It might be the case that there is a specific permutation σ toward which the sampling mechanism has a bias, and permutations close to σ have a higher probability of being selected. For example, in the draft lottery example σ is the permutation (366, 365, 364, . . . , 3, 2, 1). The Mallows models defined above are able to capture such behavior. The hypothesis of uniformity in this setting is equivalent to the hypothesis that θ = 0.
856
S. MUKHERJEE
Possibly the most famous and widely used model on permutations is the Mallows model with Kendall’s tau as the divergence function. One of the reasons for this is that for this model the normalizing constant is known explicitly (see, e.g., [15], (2.9)), and so analyzing this model becomes a lot simpler. However, when one moves away from the Mallows model with Kendall’s tau and its generalizations, not much theory is available in the literature. One reason for this is that normalizing constant is not available in closed form, and there is no straightforward independence assumptions in the model which one can exploit to analyze such models. Even basic properties for such models such as identifiability and consistency of estimates are not well understood in general. 1.3. Choice of the function d. Even though the literature on Mallows models mostly mention d(·, ·) to be a metric, for the purposes of this paper this is not necessary. As such, the function d will henceforth be referred to as a divergence function. One restriction on the divergence d(·, ·) which seems reasonable is that d(·, ·) is right invariant, that is, d(π, σ ) = d(π ◦ τ, σ ◦ τ )
for all π, σ, τ ∈ Sn .
The justification for this last requirement is as follows: suppose the students in a class are labeled {1, 2, . . . , n}, and let π(i) and σ (i) denote the rank of student i based on math and physics scores, respectively (assume no tied scores). The divergence d(π, σ ) can be thought of as a measure of the strength of the relationship between math and physics rankings. If students are now labeled differently using a permutation τ , so that student i now becomes student τ (i), then the math and physics rankings become π ◦ τ and σ ◦ τ , respectively. But this relabeling of students in principle should not change the relation between math and physics rankings, which requires the right invariance of d(·, ·). Some of the common choices of right invariant divergence d(·, ·) in the literature are the following ([13], Chapters 5 and 6):
(a) Spearman’s foot rule: ni=1 |π(i) − σ (i)|. (b) Spearman’s rank correlation: ni=1 (π(i) − σ (i))2 . n (c) Hamming distance: i=1 1{π(i) = σ (i)}. (d) Kendall’s tau: Minimum number of pairwise adjacent transpositions which converts π −1 into σ −1 . (e) Cayley’s distance: Minimum number of adjacent transpositions which converts π into σ = n-number of cycles in πσ −1 . (f) Ulam’s distance: Number of deletion–insertion operations to convert π into σ = n-length of the longest increasing subsequence in σ π −1 . See [13], Chapters 5 and 6, for more details on these divergences. It should be noted here that barring Spearman’s rank correlation, all the other divergences in
ESTIMATION IN EXPONENTIAL FAMILIES ON PERMUTATIONS
857
this list are metrics on the space of permutations. If d(·, ·) is right invariant, then the normalizing constant is free of σ , as
e−θ d(π,σ ) =
π ∈Sn
e−θ d(π ◦σ
π ∈Sn
−1 ,e)
=
e−θ d(π,e) ,
π ∈Sn
where e is the identity permutation. Also, if π is a sample from the probability mass function e−θ d(π,σ )−Zn (θ ) , then π ◦ σ −1 is a sample from the probability mass function e−θ d(π,e)−Zn (θ ) . This paper focuses on the case where σ is known, and carries out inference on θ when one sample π is observed from this model. If the location parameter σ is unknown, estimating it from one permutation π seems impossible, unless the model puts very small mass on permutations which are away from σ , in which case π itself is a reasonable estimate for σ . In case σ is known, without loss of generality by a relabeling it can be assumed that σ is the identity permutation. In an attempt to cover the first two divergences in the above list, consider an exponential family of the form (1.1)
Qn,f,θ (π) = eθ
n
i=1 f (i/n,π(i)/n)−Zn (f,θ )
,
where f is a continuous function on the unit square. In particular, if f (x, y) = −|x − y| then n
f i/n, π(i)/n = −
i=1
n 1 i − π(i), n i=1
which is a scaled version of the Foot rule [see (a) in list above]. For the choice f (x, y) = −(x − y)2 , n
f i/n, π(i)/n = −
i=1
n 2 1 i − π(i) 2 n i=1
is a scaled version of Spearman’s rank correlation statistic [see (b) in the list above]. A simple calculation shows that the right-hand side above is same as n 2 (n + 1)(2n + 1) iπ(i), + 2 3n n i=1
and so the same model would have been obtained by setting f (x, y) = xy. Note that the Hamming distance (third in the list of divergences) is also of this form for the choice f (x, y) = 1x=y which is a discontinuous function. R EMARK 1.1. It should be noted here that the model Qn,f,θ covers a wide class of models, some of which are not unimodal, for example, if one sets f (x, y) = x(1 − x)y then for n = 7 7 i=1
f (i/7, j/7) =
7 1 i(7 − i)π(i), 73 i=1
858
S. MUKHERJEE
which is maximized when
π(7) = 1,
π(1), π(6) = {2, 3},
π(2), π(5) = {4, 5},
π(3), π(4) = {6, 7}.
Thus, for θ > 0 this model has 23 = 8 modes. In general, for θ > 0 this model has 2(n−1)/2 modes for n odd, and 2(n−2)/2 modes for n even. If one assumes that for every fixed y the function y → f (x, y) has a unique global maximum at y = x, then the model Qn,f,θ is unimodal. Indeed, in this case the mode is the identity permutation (1, 2, . . . , n) for θ > 0 and the reverse identity permutation (n, n − 1, . . . , 1) for θ < 0. Note that both the functions f (x, y) = −(x −y)2 and f (x, y) = −|x −y| satisfy this condition. In general, if y → f (x, y) is maximized uniquely at y = φ(x) for some function φ : [0, 1] → [0, 1], then at a heuristic level the mode of the distribution is the permutation πφ given by πφ (i) = φ(ni). If the specific problem at hand demands unimodality, one can restrict to functions f (x, y) which are maximized on the diagonal, though nothing changes as this information is not exploited anywhere. One important comment about the model Qn,f,θ is that different choices of the function f may give the same model. Indeed as already remarked above, the function f (x, y) = −(x − y)2 /2 and f (x, y) = xy gives rise to the same model. In general, whenever f (x, y) − g(x, y) can be written as φ(x) + ψ(y) for any two functions φ, ψ : [0, 1] → R the two models are the same. In particular, the function f (x, y) = x + y and g(x, y) ≡ 0 gives rise to the same model, which is the uniform distribution on Sn . The following definition restricts the class of functions f to ensure identifiability. D EFINITION 1.2. which satisfy
Let C be the set of all continuous functions f on [0, 1]2 1 0
(1.2)
1 0
f (x, z) dz = 0
∀x ∈ [0, 1];
f (z, y) dz = 0
∀y ∈ [0, 1],
and f is not identically 0. P ROPOSITION 1.3. If f1 , f2 ∈ C and Qn,f1 ,θ (π) = Qn,f2 ,θ (π) for all π ∈ Sn for all large n for some θ = 0, then f1 ≡ f2 . P ROOF . Setting g(x, y) = θ [f1 (x, y) − f2 (x, y)] the given assumption implies ni=1 g(i/n, π(i)/n) = Cn for all π ∈ Sn , where Cn := Zn (f, θ1 ) − Zn (f, θ2 ) is free of π . Thus, for any 1 ≤ i, j ≤ n − 1 setting σ1 , σ2 ∈ Sn by σ1 (i) = j,
σ1 (n) = n,
σ1 (i) = i
otherwise,
σ2 (i) = n,
σ2 (n) = j,
σ2 (i) = i
otherwise,
ESTIMATION IN EXPONENTIAL FAMILIES ON PERMUTATIONS
859
by choosing π = σ1 and σ2 in succession one has n
g i/n, σ1 (i)/n =
i=1
n
g i/n, σ2 (i)/n = Cn ,
i=1
giving g(i/n, j/n) + g(1, 1) = g(i/n, 1) + g(1, j/n). Clearly, this holds for i = n or j = n as well by direct substitution. Thus, fixing x, y ∈ (0, 1] and choosing i = nx , j = ny and letting n → ∞, on invoking continuity of g one gets g(x, y) + g(1, 1) = g(x, 1) + g(1, y). On integrating w.r.t. y and using the condition g = θ[f1 − f2 ] ∈ C this gives g(x, 1) = g(1, 1). By symmetry, one has g(1, y) = g(1, 1), and so g(x, y) = g(1, 1) as well, which forces g(x, y) ≡ 0 again invoking the condition g ∈ C . This completes the proof of the proposition. Another set of constraints which would have served the same purpose is f (x, 0) = 0, ∀x ∈ [0, 1]; f (0, y) = 0, ∀y ∈ [0, 1]. For the sake of definiteness, this paper uses (1.2). This mimics the condition in the discrete setting that the row and column sums of a square matrix are all 0. It should be noted here that the function f (x, y) = xy does not belong to C , and it should be replaced by the function f (x, y) = (x − 1/2)(y − 1/2). However, this is not done in Sections 2 and 3 to simplify notation, on observing that all the proofs and conclusions of this paper go through as long as f (x, y) cannot be written as φ(x) + ψ(y), which is true for f (x, y) = xy. 1.4. Statement of main results. The first main result of this paper is the following theorem which computes the limiting value of the log normalizing constant of models of the form (1.1) for a general continuous function f in terms of an optimization problem over copulas. D EFINITION 1.4. Let M denote the space of all probability distributions on the unit square with uniform marginals. T HEOREM 1.5. For any function f ∈ C and θ ∈ R, consider the probability model Qn,f,θ (π) as defined in (1.1). Then the following conclusions hold: (a) Zn (f, θ ) − Zn (0) = Z(f, θ ) := sup θ μ[f ] − D(μu) , n→∞ n μ∈M
lim
where u is the uniform distribution on the unit square, μ[f ] := f dμ is the expectation of f with respect to the measure μ and D(··) is the Kullback–Leibler divergence.
860
S. MUKHERJEE
(b) If π ∈ Sn is a random permutation from the model Qn,f,θ , then the random probability measure νπ :=
n 1 δ(i/n,π(i)/n) n i=1
on the unit square converge weakly in probability to the probability measure μf,θ ∈ M, where μf,θ the unique maximizer of part (a). (c) The measure μf,θ of part (b) has density gf,θ (x, y) := eθf (x,y)+af,θ (x)+bf,θ (y) with respect to Lebesgue measure on [0, 1]2 , with the functions af,θ (·) and bf,θ (·) ∈ L1 [0, 1] which are unique almost surely. Consequently, one has
sup θ μ[f ] − D(μu) = −
μ∈M
1 x=0
af,θ (x) + bf,θ (x) dx.
(d) The function Z(f, θ ) of part (b) is a differentiable convex function with a continuous and strictly increasing derivative Z (f, θ ) which satisfies 1 Z (f, θ ) = μf,θ [f ]. n→∞ n n
Z (f, θ ) = lim
R EMARK 1.6. Part (b) of the above theorem gives one way to visualize a permutation π as a measure νπ on the unit square. A somewhat similar way to view a permutation π as a measure is presented in the supplementary file [42]. It also demonstrates how the measure νπ looks like, when π is a large permutation from Qn,f,θ . As an example, setting θ = 0 one gets the uniform distribution on Sn , when the limiting measure μf,θ becomes u the uniform distribution on [0, 1]2 . Note that the theorem statement uses Zn (0) instead of Zn (f, 0). This is because Zn (f, 0) = log n! for all choices of the function f , and so the use of the notation Zn (0) is without loss of generality. Focusing on inference about θ when an observation π is obtained from the model Qn,f,θ , then the following corollary of Theorem 1.5 shows consistency of the Maximum Likelihood Estimate (MLE). In this model, MLE for θ is the solution to the equation
n 1 1 f i/n, π(i)/n − Zn (f, θ ) = 0. n i=1 n
Since Zn (f, θ ) and Zn (f, θ ) are hard to compute numerically, as an approximation one can replace the quantity n1 Zn (f, θ ) above by its limiting value Z (f, θ ) and then solve for θ . The following corollary shows that this estimate is consistent for θ as well.
ESTIMATION IN EXPONENTIAL FAMILIES ON PERMUTATIONS
861
C OROLLARY 1.7. For f ∈ C , consider the model Qn,f,θ as in (1.1), and let π be an observation from this model. (a) In this case, one has n P 1 f i/n, π(i)/n → Z (f, θ ) = μf,θ [f ] n i=1
for every θ ∈ R. (b) Both the expressions MLn (π, θ ) :=
n 1 f i/n, π(i)/n − Zn (f, θ ), n i=1
LDn (π, θ ) :=
n 1 f i/n, π(i)/n − Z (f, θ ) n i=1
have unique roots θˆML and θˆLD with probability tending to 1 which are consistent for θ . (c) Consider the testing problem of θ = θ0 versus θ = θ1 with θ1 > θ0 . Then the test φn := 1{θˆLD > (θ0 + θ1 )/2} is consistent, that is, lim EQn,f,θ0 φn = 0,
n→∞
lim EQn,f,θ1 φn = 1.
n→∞
The above corollary shows that it is possible to estimate the parameter θ consistently with just one observation from the model Qn,f,θ . No error rates have been obtained for the estimates {θˆML , θˆLD } in this paper, as part (a) of Theorem 1.5 does not have any error rates. Thus, a good approximation of the limiting log normalizing constant will lead to an efficient estimator for θ , in the sense that the estimator will be close to the MLE. The definition of Z(f, θ ) is in terms of an optimization problem over M, which is an infinite-dimensional space. In general, such optimization can be hard to carry out. The next theorem gives an iterative algorithm for computing the density of the optimizing measure μf,θ with respect to Lebesgue measure. Intuitively, the algorithm starts with the function eθf (x,y) and alternately scales it along x and y marginals to produce uniform marginals in the limit. D EFINITION 1.8. For any integer k ≥ 1, let Mk denote the set of all k × k matrices with nonnegative entries with both row and column sums equal to 1/k. T HEOREM 1.9. (a) Define a sequence of k × k matrices by setting B0 (r, s) := ef (r/k,s/k) for 1 ≤ r, s ≤ k, and B2m (r, s) B2m+1 (r, s) := m , k l=1 B2m (r, l)
B2m+1 (r, s) B2m+2 (r, s) := m . k l=1 B2m+1 (l, s)
862
S. MUKHERJEE
Then there exists a matrix Ak,θ ∈ Mk such that limm→∞ Bm = Ak . (b) Ak,θ ∈ Mk is the unique maximizer of the optimization problem
sup
θ
A∈Mk
k
k
f (r/k, s/k)A(r, s) − 2 log k −
r,s=1
A(r, s) log A(r, s) .
r,s=1
(c) The function
Wk (f, θ ) := sup
A∈Mk
−
θ
k
f (r/k, s/k)A(r, s) − 2 log k
r,s=1
k
A(r, s) log A(r, s)
r,s=1
is a convex differentiable function in θ for fixed f with Wk (f, θ ) =
k
Ak,θ (r, s)f (r/k, s/k).
r,s=1
(d) Finally, for any continuous function φ : [0, 1]2 → R one has lim
k
k→∞
Ak,θ (r, s)φ(r/k, s/k) =
r,s=1
[0,1]2
φ(x, y)gf,θ (x, y) dx dy.
In particular, this implies
Z(f, θ ) = lim Wk (f, θ ) = lim lim k→∞
θ
k→∞ m→∞
k
f (i/k, j/k)Bm (i, j )
i,j =1 k
− 2 log k −
Bm (i, j ) log Bm (i, j ) .
i,j =1
R EMARK 1.10. Since gf,θ (x, y) has uniform marginals, the functions af,θ (·) and bf,θ (·) are the solutions to the joint integral equations 1 0
1 0
eθf (x,z)+af,θ (x)+bf,θ (z) dz = 1, eθf (z,y)+af,θ (z)+bf,θ (y) dz = 1
∀x, y ∈ [0, 1].
By Theorem 1.9, it follows that Zn (f, θ ) − Zn (0) =− n→∞ n lim
1 x=0
af,θ (x) + bf,θ (x) dx.
863
ESTIMATION IN EXPONENTIAL FAMILIES ON PERMUTATIONS
For the limiting normalizing constant in the Mallows model with the Foot-rule or the Spearman’s rank correlation, one needs to take f (x, y) = −|x − y| and f (x, y) = −(x −y)2 [or f (x, y) = xy], respectively. Even though analytic computation for af,θ (·), bf,θ (·) might be difficult, the algorithm of Theorem 1.9 (known as IPFP) can be used for a numerical evaluation of these functions. Iterative Proportional Fitting Procedure (IPFP) originated in the works of Deming and Stephan [12] in 1940. For more background on IPFP, see [11, 29, 44, 47] and the references therein. Theorem 1.9 gives a way to approximate numerically the limiting log partition function by fixing k large and running the IPFP for m iterations with a suitably large m. Another approach for estimation in such models can be to estimate the parameter θ without estimating the normalizing constant. The following theorem con√ structs an explicit n consistent estimator for θ , for the class of models considered in Theorem 1.5. This estimate is similar in spirit to Besag’s pseudo-likelihood estimator [4, 5]. The pseudo-likelihood is defined to be the product of all onedimensional conditional distributions, one for each random variable. Since in a permutation the conditional distribution of π(i) given {π(j ), j = i} determines the value of π(i), it does not make sense to look at the conditional distribution (π(i)|π(j ), j = i). In this case, a meaningful thing to consider is the distribution of (π(i), π(j )|π(k), k = i, j ), which gives the pseudo-likelihood as
Qn,f,θ π(i), π(j )|π(k), k = i, j .
1≤i