Linking continuous and discrete linear ordering problems Rajeev Kohli Columbia University
[email protected]
Khaled Boughanmi Columbia University
[email protected]
Vikram Kohli Northwestern University,
[email protected]
We characterize relations between continuous and discrete optimization versions of the linear ordering problem. One continuous version uses a logit model to represent uncertainty in outcomes and is solved in polynomial time by maximizing a likelihood function. Another is obtained by changing the objective from maximizing a likelihood function to maximizing the expected value of a randomized algorithm. We show that the latter unconstrained optimization problem is equivalent to a discrete, NP hard, version of the linear ordering problem. The solution to the maximum likelihood problem can be used to implement a randomized algorithm for the discrete problem. The expected value and performance ratio of this algorithm has a lower bound that is a function of the maximum likelihood value. The lower bound can be further refined for each problem instance. We present an application using data collected by TestTube on consumer perceptions of funny videos. Results suggest that the proposed continuous methods can quickly obtain near optimal solutions for problems with thousands of alternatives. Key words : Linear ordering problem; discrete optimization; nonlinear optimization; approximation algorithms; randomized algorithms; analysis of algorithms; big data analytics.
1
2
1. Introduction While there have always been connections between the foundations of continuous and discrete optimization, the active areas of research in these two fields have significantly differed until recently. In the last decade, partly stimulated by the growth of machine learning and by the proliferation of massive datasets, a number of new research areas have emerged at the intersection of the two fields. The expanded interface between continuous and discrete optimization has already led to a number of breakthroughs in both areas . . . Bridging Continuous and Discrete Optimization1
The purpose of this paper is to develop links between discrete and continuous versions of the linear ordering problem, and to illustrate how the latter can be used to solve problems with hundreds or thousands of alternatives. The linear ordering problem concerns the aggregation of multiple rankings or paired comparisons into a single representative ranking over a set of alternatives (Mart´ı and Reinelt (2011)). Classic applications of the problem are the ranking of political candidates across voters, and the ranking of players (or teams) in a sport based on the results of matches in a season or tournament. Other applications include combining multiple rankings of schools and colleges, musical albums and songs, and videos and movies. Dwork, Kumar, Naor and Sivakumar (2001) describe an application to meta-search, in which query results from two or more Internet search engines are combined into a single ranking that is used to present the results to a user. Tromble and Eisner (2009) describe an application to the machine translation of text from one language to another. Extensions of the linear ordering problem include ranking items for selection in a Facebook newsfeed, and ranking missed tweets from which Twitter constructs “while you were away” lists. These problems can have hundreds or thousands of alternatives, a small, personalized, subset of which is presented to a user. The solutions to these problems need to be found quickly, sometimes in real time. We consider one discrete and two continuous versions of the linear ordering problem. The discrete problem is a maximization version of the problem described by Kemeny (1959). It seeks a single
3
ordering of alternatives that is consistent with the largest possible number of paired comparisons in, or inferred from, the data. The problem is NP hard (Bartholdi, Tovey and Trick (1989)). An integer programming formulation of the problem with m alternatives has O(m2 ) 0-1 decision variables and O(m3 ) constraints. Since the number of decision variables and constraints grow quickly with m, it can be difficult to solve even a linear programming relaxation of the problem (which, for example, can be useful for randomized rounding). In section 3, we discuss a number of approximate procedures that have been proposed for solving the problem. The first continuous version of the problem we consider is a statistical inference problem. It allows uncertainty in outcomes, which can occur because of uncertain or time-varying preferences or abilities, measurement error, and missing information on variables that may affect outcomes (for example, the surface on which a tennis match is played, or the weather). A random utility model is used to capture the uncertainty in outcomes. If the utilities of the alternatives have independent, extreme value distributions, the probability with which one alternative beats another has a binary logit form. The parameters, one for each alternative, are estimated by maximizing a likelihood function. Unlike the discrete problem, the statistical version considers the data for a problem instance to be a random sample from a relevant population, and does not assume statistical independence of paired comparisons when these are inferred from individual rankings or choices over more than two alternatives. Despite these differences, we show that the statistical and optimization approaches are related. Changing the objective of the statistical inference problem from maximizing a likelihood function to maximizing an expected value (of a randomized algorithm) yields a continuous and unconstrained formulation of the discrete linear ordering problem. The two objectives, one maximizing a likelihood function and the other an expected value, are also related. Both maximize measures of central tendency — the likelihood formulation maximizes a geometric mean, and the expected value formulation maximizes an arithmetic mean, of the same probabilities. Since the geometric mean is no smaller than an arithmetic mean, the maximum likelihood value can be used to obtain
4
a lower bound on the optimal solution value for the discrete version of the linear ordering problem. This lower bound is a function of the geometric mean and the variance of the square roots of the maximum likelihood probabilities. Notably, the lower bound, and the associated performance ratio, can be refined using the data for each problem instance. The maximum likelihood solution can be used to implement a randomized algorithm, and as a starting solution in a nonlinear optimization routine maximizing the expected value (which may converge to a local optimum). We compare the solutions obtained by the maximum likelihood method, the randomized algorithm, and the maximum expected value method, using data on pairwise comparisons of a large number of funny videos obtained from TestTube, a research lab at YouTube. The instance-specific lower bounds on the performance ratios are consistently high across the methods.
Organization of the paper. Section 2 formulates a statistical model for the linear ordering problem. Section 3 describes a discrete version of the problem and its integer formulation. Then it obtains a continuous, nonlinear programming representation of the problem by changing the objective function in the statistical model from maximizing a likelihood value to maximizing an expected value. Section 4 develops a randomized algorithm using the maximum likelihood solution, and relates the expected value of its solution to the maximum likelihood value. It also characterizes the lower bound on the performance ratio in terms of the maximum likelihood probabilities and their variances. Section 5 illustrates the results with an application using TestTube data on funny videos. The largest problem solved has 2,000 videos. Section 6 concludes the paper, and discusses the potential use of the proposed approach for other discrete optimization problems.
2. Statistical approach The statistical approach allows uncertainty in outcomes. The special case in which the probabilities are restricted to, or obtain, 0-1 values corresponds to deterministic outcomes. We develop the following analysis for data consisting of independent paired comparisons among alternatives. The extension to choice data and ranking data is straightforward, and is briefly discussed at the end of the section.
5
Let m denote the number of alternatives. Let nij ≥ 0 denote the number of times alternative i beats alternative j, for all 1 ≤ i < j ≤ m. Let Nij = nij + nji , denote the number of paired comparisons of i and j in the data. Let m X
Ni =
nij
j=1,j6=i
denote the total number of times alternative i beats another alternative, where i = 1, . . . , m. Let N=
n X
Ni =
i=1
m−1 X
m X
Nij
i=1 j=i+1
denote the total number of paired comparisons. Let pij denote the probability that alternative i beats alternative j. Then the likelihood (joint probability) that i beats j nij times, and j beats i nji times, is n
lij = pijij (1 − pij )nji . The likelihood function for the data is l=
m−1 Y
m Y
lij .
i=1 j=i+1
The objective of the linear ordering problem is to (1) estimate the pij values, and (2) find a linear ordering of the alternatives that maximizes the likelihood function. Since maximizing l is equivalent to maximizing ln l, we consider the equivalent problem of maximizing ln l =
m−1 X
m X
cij ,
i=1 j=i+1
where cij = ln lij = nij ln pij + nji ln(1 − pij ) is the log-likelihood for the pair of alternatives i and j. DeCani (1969) allowed pij to obtain a separate value for each pair of alternatives, i and j, subject to the constraints (1) pij + pji = 1, and (2) pij ≥ 1/2 (that is, pij ≥ pji ) when i is ranked
6
above j in the linear ordering. He showed that in any ordering of the alternatives, the maximum likelihood estimates of the probabilities are
1 nij pˆij = max , 2 nij + nji
, 1 ≤ i < j ≤ m.
Thus, for each pair of alternatives, i and j, the data for a problem instance can be used to calculate the values of pˆij , pˆji = 1 − pˆij , and cij = nij ln pˆij + nji ln(1 − pˆij ). The problem of maximizing the log likelihood function can then be written in the same form as the discrete optimization problem described in section 3, except that the decision variables xij have the coefficients cij . As noted, this is an NP-hard problem (DeCani incorrectly observed that it is a linear programming problem). An alternative formulation, for which a solution can be obtained in polynomial time, uses a logit model, which is closely related to the Luce choice rule (Luce (1977)) employed by Thompson and Remage (1964), Remage and Thompson (1966) and Singh and Thompson (1968). Let alternative i have a random value ui = vi + i , where vi is a deterministic value and i is a stochastic term with an independent distribution for each alternative i = 1, . . . , m. For example, if the pairwise comparisons represent preferences over products, then vi and i are the deterministic and stochastic components of a person’s utility for alternative i. Similarly, if the paired comparisons are the results of tennis matches, then vi is an (unobserved) measure of a player’s capability and i a random term that captures uncertainty in the player’s performance in a particular match. The value of vk can be a function of other variables, like the playing surface, the number of sets played and injuries to a player. The value of i may capture both uncertainty and heterogeneity (for example, in the preferences of consumers, and in the judged relevance of documents in response to a search query). The probability that i beats j is then given by Z
∞
Z
ui
pij = p(ui > uj ) =
fi (ui )fj (uj )dui duj , ui =−∞
uj =−∞
7
where f (ui ) is the density function of the random variable ui . Different assumptions about the distribution function lead to different expressions for pij . The well-known logit model assumes that each i has an independent, extreme value distribution with density function fi (i ) = e−i e cumulative distribution function Fi (i ) = e−e
−i
pij =
−i
and
. In this case (see Maddala (1983)), e vi . e vi + e vj
The likelihood function, lij , is the joint probability that i beats j nij times and loses to j nji times: lij =
evi nij evj nji . evi + evj evi + evj
Taking logs on both sides of this expression gives the log likelihood for the pair i and j: cij = ln lij = nij {vi − ln(evi + evj )} + nji {vj − ln(evi + evj } = nij vi + nji vj − Nij ln(evi + evj ), where Nij = nij + nji is the total number of paired comparisons of alternatives i and j. Thus, the log likelihood function for the data is ln l =
m−1 X
m X
i=1 j=i+1
cij =
m−1 X
m X
(nij vi + nji vj ) −
i=1 j=i+1
m−1 X
m X
Nij ln(evi + evj ).
i=1 j=i+1
Since the probabilities depend on the differences vi − vj between pairs of alternatives, one of the vi values, say v1 , can be arbitrarily set to zero. The values of the remaining m − 1 parameters in the model are estimated from the data. The maximum likelihood estimates vˆi are obtained by maximizing ln l over the variables v1 , . . . , vm . Since the log-likelihood function is a log concave function, the estimates can be obtained by solving a convex optimization problem. Suppose alternative k = kj has the jth largest value of vk = vˆk obtained by maximizing the likelihood function. Then the linear ordering k1 , k2 , . . . , km maximizes the likelihood function. It has the property that if i precedes j, then pij ≥ pji . The maximum likelihood solution also implies that if i < j < k for three alternatives, i, j and k, in the optimal linear ordering (that is, if i precedes j and j precedes k in the linear ordering), then pij ≤ pik .
8
Suppose vˆk1 vˆk2 · · · vˆkm . Then pij = 1 − ij , for all 1 ≤ i < j ≤ m, where ij > 0 becomes smaller as j − i increases. If the paired comparison data have few reversals, the pij are close to 0-1 values, and the optimal linear ordering also solves the discrete optimization version of the problem. The preceding method for paired comparisons has a straightforward extension when the data are choices from sets with two or more alternatives, or multiple rankings of alternatives. For choice data, the likelihood function is described as a product of choice probabilities for the selected alternatives. For rank ordered data, it is described by the joint probability of obtaining the set of rankings. For random utilities with extreme value distributions, the choice probabilities are given by the multinomial logit model, and the probabilities of rank orderings are given by the rank-ordered logit model described by Beggs, Cardell and Hausman (1981). In each case, the log likelihood function is log-concave function of the parameters, and can be maximized in polynomial time.
3. Discrete optimization approach The discrete optimization version of the linear ordering problem uses paired comparisons data. Thus, if the data consist of rankings of some or all of m alternatives by individuals, then these are converted into paired comparisons by assuming transitivity. Similarly, if the data are choices from sets of alternatives, then these are converted into paired comparisons between the chosen alternative and each other alternative in the choice set. Unlike the statistical approach, there is no explicit consideration of error or uncertainty in outcomes, and all paired comparisons, including those inferred from ranking or choice data, are assumed to be independent. We consider a maximization version of the linear ordering problem due to Kemeny (1959). Its objective is to find a linear ordering of the alternatives that maximizes the number of paired comparisons consistent with the data. As noted, Bartholdi, Tovey and Trick (1989) showed that the problem is NP hard; Dwork, Kumar, Naor and Sivakumar (2001) showed that it continues to be NP hard even when the data has only four rank orderings over alternatives.
9
Let xij = 1 if alternative i precedes alternative j in a linear ordering; otherwise, xij = 0. The optimal linear ordering is a solution to the following 0-1 integer programming problem, denoted P1 : Maximize z =
m m X X
nij xij .
i=1 j=1,j6=i
subject to xij + xji = 1, for all i 6= j, 1 ≤ i, j ≤ m, xij + xjk + xki ≤ 2, for all i 6= j 6= k, 1 ≤ i, j, k ≤ m, xij ∈ {0, 1}, for all i 6= j, 1 ≤ i, j ≤ m. Let z ∗ denote the optimal solution value for problem P1 . Observe that this formulation has O(m2 ) decision variables and O(m3 ) constraints. In section 5, we solve problems of different sizes, the largest of which has m = 2, 000 alternatives. Formulated as the integer programming problem P1 , it has about 4 million decision variables and 8 billion constraints. Even a continuous relaxation of the problem, which could be used to implement a randomized algorithm, is impractically large. We describe an alternative formulation of the problem as an unconstrained and continuous nonlinear program with m decision variables. A feasible solution to problem P1 that obtains at least half as many non-reversals as there are paired comparisons has long been known to be trivially obtained: if any linear ordering has less than half the number of non-reversals, then the opposite ordering of alternatives must have at least half the number of non-reversals. Gr¨otschel, J¨ unger and Reinelt (1984) described a cutting plane algorithm for the problem. Other methods used for solving the linear ordering problem include tabu search (Laguna, Mart´ı and Campos (1999), memetic algorithms (Schiavinotto and St¨ utzle (2004)), variable neighborhood search (Garca, P´erez-Brito, Campos and Mart´ı (2006), simulated annealing (Charon and Hudry (2007)), scatter search (Campos, Glover, Laguna and Mart´ı (2001)) and greedy randomized adaptive search (Campos, Glover, Laguna and Mart´ı (2001)). Ailon, Charikar and Newman (2008) obtained a randomized algorithm that finds a linear ordering for which the expected number of reversals (in a minimization version of the problem)
10
is no greater than 11/7 times the number of reversals in an optimal ordering. Kenyon-Mathieu and Schudy (2007) developed a polynomial-time approximation scheme for the feedback arc set problem for weighted tournaments, which solves the Kemeny problem as a special case. Although an important theoretical result, the algorithm appears to be impractical. Van Zuylen and Williamson (2009) described a polynomial time approximation algorithm that guarantees that the number of reversals is no more than 8/5 times the number of reversals in an optimal ordering. Mart´ı and Reinelt (2011) and Charon and Hurdy (2007, 2010) surveyed many of the approaches that have been used for solving the linear ordering problem. Fagin, Kumar, Mahdian, Sivakumar and Vee (2006) considered an extension of the problem to allow partial rankings that allow ties or the classification of alternatives into ordered categories. Ailon, Charikar and Newman (2008), Filkov and Skiena (2004) and Van Zuylen and Williamson (2009) considered the related problem of consensus clustering. Continuous formulation. We use the probabilistic framework described in section 2 to reformulate problem P1 as an unconstrained, nonlinear program. As before, let ui = vi + i denote the random utility of alternative i, where i has an independent, extreme value distribution, for each i = 1, . . . , m. Suppose we knew the values of v1 , . . . , vm . Then we could use the following randomized algorithm to obtain a linear ordering: (1) Generate an observation ui = vi + i , where i is a random draw from an extreme value distribution, with density function fi (i ) = e−i e
−i
, for each i = 1, . . . , m;
(2) Arrange the utilities ui in decreasing order of their values. If uk1 > uk2 > · · · > ukm , declare k1 , k2 , . . . , km as the solution to the linear ordering problem. The expected value of the solution obtained by the randomized algorithm is E=
m−1 X
m X
i=1 j=i+1
e vi evi + evj
nij +
evj e vi + e vj
nji ,
where nij is the number of pairs in which i is preferred to j, and nji is the number of pairs in which j is preferred to i in the linear ordering obtained by the randomized algorithm. Let P2 denote the
11
problem of maximizing E over the variables v1 , . . . , vm . Let E ∗ denote the value of the optimal solution to problem P2 . Theorem 1 shows that E ∗ = z ∗ ,where z ∗ is the optimal solution to problem P1 . The optimal solution to P2 is obtained when the probability evi /(evi + evj ) approaches the value 1 when xij = x∗ij = 1 in the optimal solution to problem P1 , and the value 0 when x∗ij = 0. Theorem 1. E ∗ = z ∗ . Proof. Without loss of generality, let 1, . . . , m, denote the optimal linear ordering for problem P1 . Consider a feasible solution v1 , . . . , vm to the problem of maximizing E, where v1 > v2 > · · · > vm . Then the probability that i precedes j in the linear ordering obtained by the randomized algorithm is pij =
1 . 1 + e−vi +vj
Let d = min{vi − vi+1 |1 ≤ i ≤ m − 1} > 0. Then pij =
1 1 ≥ = 1 − , for all 1 ≤ i < j ≤ m, 1 + e−vi +vj 1 + e−d
where > 0 is a decreasing function of d. As d becomes arbitrarily large, the value of becomes arbitrarily close to zero, and the expected value of the randomized algorithm approaches the value E ∗ = z ∗ from below. Since E is an expected value, it cannot exceed the value z ∗ of the optimal linear ordering. It follows that E ∗ = z ∗ . Theorem 1 implies that the discrete optimization version of the linear ordering problem can be formulated as the unconstrained nonlinear optimization problem, P2 . Continuous optimization methods may be used to find locally optimal solutions to the problem. Randomized algorithms in which the vi values are obtained by solving a related problem in polynomial time may also be used. We consider a randomized algorithm using the maximum likelihood solution described in section 2. Its performance depends on a relation with problem P1 , which we consider next.
12
4. Relation between statistical and optimization approaches As before, let pij = evi /(evi + evj ) denote the probability that i beats j, where 1 ≤ i < j ≤ m. Then a=
m−1 m E 1 X X = pij nij + (1 − pij )nji N N i=1 j=i+1
is the arithmetic mean of the probabilities of observing nij paired comparisons in which i beats j, and nji paired comparisons in which j beats i, for all 1 ≤ i < j ≤ m. Following Theorem 1, the maximum value of a is a∗ = z ∗ /N . Now consider the geometric mean of the same probabilities: !1/N m−1 m Y Y nij nji g= . pij (1 − pij ) i=1 j=i+1
Then g = l1/N , where l is the likelihood function for the data. Maximizing l, or log l, is equivalent to maximizing g. The difference between maximizing z or l is that the first maximizes the arithmetic mean and the second maximizes the geometric mean of the probabilities pij for the data. Let gˆ = (l∗ )1/N denote the value of the geometric mean when it is evaluated using the maximum likelihood probabilities pˆij = evˆi /(evˆi + evˆj ). Let z = zˆ denote the value of the objective function for problem P1 when pij = pˆij . Let a ˆ = zˆ/N . Since the arithmetic mean is no smaller than the geometric mean, a ˆ ≥ gˆ. Thus, z ∗ = N a∗ ≥ N a ˆ ≥ N gˆ = N (l∗ )1/N . The solution z ∗ = N (l∗ )1/N is obtained when each inequality in the above expression is satisfied as an equality. It is attained when nji = 0, for all 1 ≤ i < j ≤ m; or when vi = v (and thus pij = 1/2, for all i and j). Except for these two trivial cases, the lower bound for z ∗ depends on the value of a ˆ − gˆ, in a manner described below. Let r = E/E ∗ = E/z ∗ denote the performance ratio of a randomized algorithm using the maximum likelihood parameters vˆi , for all i = 1, . . . , m. We obtain a lower bound for r using the following lemmas. Lemma 1. N ≤ z ∗ ≤ M ≤ N, 2
13
where M=
m−1 X
m X
max{nij , nji }.
i=1 j=i+1
Proof. The upper bound of M on the value of z ∗ follows trivially from the observation that there are at most max{nij , nji } non-reversals for each pair of alternatives i and j. Thus, z ∗ ≤ M ≤ N because a linear ordering problem can have at most M ≤ N non-reversals. The lower bound follows from the observation that pij = evi /(evi + evj ) = 1/2, and thus E = N/2, for a randomized algorithm using vi = v, for all i = 1, . . . , m. Since an expected value is no greater than the maximum value of a feasible solution, E = N/2 ≤ z ∗ . Tung (1975) and Aldaz (2012) obtained the following bounds for the difference between an arithmetic and geometric mean. Lemma 2. (Tung 1975) p 2 1 p pˆN − pˆ1 ) ≤ a ˆ − gˆ ≤ cˆ p1 + (1 − c)ˆ pN − pˆc1 pˆ1−c N , N where c=
log [(ˆ pN /(ˆ pN − pˆ1 )) log pˆN /ˆ p1 ] . log pˆN /ˆ p1
Lemma 3. (Aldaz 2012) Var(ˆ p) ≤ a ˆ − gˆ ≤ (N − 1)Var(ˆ p), where Var(ˆ p) =
N 2 1 X p pˆk − s N − 1 k=1
is the variance of the square root of the probabilities pˆ1 , . . . , pˆN and s=
N 1 Xp pˆk N k=1
is the mean of the square root of the probabilities pˆ1 , . . . , pˆN . The following theorem characterizes the lower bound for r.
14
Theorem 2. p p 2 1 p r ≥ k gˆ + max pˆN − pˆ1 ) , Var( pˆ) , N ∗
where gˆ >
1 2
and k ∗ = N/z ∗ ≥ N/M ≥ 1.
Proof. If z ∗ = N , the maximum likelihood solution obtains the optimal ordering of the alternatives, say i = 1, . . . , m, when vˆ1 vˆ2 · · · vˆm . In this case, pij = evˆi /(evˆi + evˆj ) = 1 − , where can be made arbitrarily small by making the difference between vi and vi+1 arbitrarily large. Thus, ˆ = Na r=a ˆ = gˆ = 1 when z ∗ = N . If z ∗ ≤ N − 1, then z ∗ ≥ E ˆ. Using the lower bounds in the theorems by Tung and Aldaz gives ˆ = Na z ≥E ˆ ≥ N gˆ + max ∗
p p 2 p pˆN − pˆ1 ) , N · Var( pˆ) .
Thus ˆ p 2 p Na ˆ E 1 p ∗ r = ∗ = ∗ ≥ k gˆ + max pˆN − pˆ1 ) , Var( pˆ) , z z N where k ∗ = N/z ∗ ≥ N/M ≥ 1 following Lemma 1. Finally, gˆ > 1/2 because the feasible solution vi = v, for all i = 1, . . . , m, obtains pij = evi /(evi + evj ) = 1/2, for all pairs of alternatives i and j. The corresponding value of the likelihood function is l = 1/2N . Since the maximum likelihood value l∗ is no smaller than the feasible likelihood value l, we obtain gˆ = (l∗ )1/N ≥ l1/N = 21 . As noted, the lower bound on the performance ratio is unity if z ∗ = N/2 or z ∗ = N . If z ∗ = N/2, then k ∗ = N/z ∗ = 2 and the maximum likelihood solution obtains the optimal solution since r ≥ k ∗ gˆ ≥ 2 · 12 = 1. If z ∗ = N , then again the maximum likelihood solution obtains the optimal solution because k ∗ = N/z ∗ = 1 and gˆ = 1 since pij ∈ {0, 1} and l∗ = 1. Thus, if z ∗ is close to its minimum value of N/2, the lower bound on the performance ratio is improved because k ∗ is large. It is equal to one z ∗ = N because each pij value is unity. Finally, if there is an ordering, say 1, . . . , n, for which nij ≥ nji for all 1 ≤ i < j ≤ m, then the maximum likelihood procedure will choose the ordering that maximizes the number of non-reversals. The maximum likelihood estimates vˆi can be used to obtain a lower bound on the performance ratio for each problem instance. We can use the data for a problem instance to calculate N/M , and
15
√ ˆ gˆ and Var( pˆ). the maximum likelihood parameters to calculate the values of pˆij , for all i, j, E,
From these, we can calculate the lower bound p 2 p N 1 p r≥ pˆN − pˆ1 ) , Var( pˆ) . gˆ + max M N Theorem 2 implies that r ≥ k ∗ gˆ ≥ (N/M )ˆ g > 1/2, because N/M ≥ 1 and gˆ > 1/2. The Tung lower bound decreases with the value of N , since z ∗ ≥ N/2 implies that p 2 p 2 2 p 2 1 p ≤ p ˆ − p ˆ ) p ˆ − pˆ1 ) ≤ . N 1 N ∗ z N N Thus for large values of N , the Aldaz lower bound dominates the Tung lower bound, and the performance ratio in Theorem 2 becomes n p o r ≥ k ∗ gˆ + Var( pˆ) . In general, the lower bound increases with the variance of the square root of the maximum likelihood probabilities. The variance is zero only when all pˆk values are equal, or equivalently, when a ˆ = gˆ, which only occurs when z ∗ = N/2 or z ∗ = N and thus r = 1.
5. Application Comedy Slam was an experiment conducted in 2011 and 2012 by TestTube, a YouTube research lab (Shetty (2012)). Participants were shown pairs of funny videos and asked to vote for the one they found funnier. The two videos in a pair were placed side by side, and were randomly assigned left/right positions. The experiment used 21,207 videos, and 327,091 distinct pairs of videos. These 327,091 pairs account for 0.145 % of the
21,207 2
> 200 million possible pairs that could have been
compared. A total of 1,138,562 votes were obtained across participants. The average number of times a pair of videos was compared was 2.79, the minimum one, and the maximum 5,493. We used subsets of the data to solve linear ordering problems with 35, 50, 100, 200, 500, 1,000 and 2,000 videos. The smallest problem with 35 videos was obtained by selecting the largest clique in which each video was compared to each other at least 19 times.2 These 35 videos comprise
16 Table 1
Summary of data for the solved problems.
No. of No. of pairs No. of paired % of % pairs of % of paired videos of videos comparisons videos videos comparisons 35 595 359,326 0.17 0.18 31.56 50 1,138 396,033 0.24 0.35 34.78 100 2,844 462,589 0.47 0.87 40.63 200 6,709 549,131 0.94 2.05 48.23 500 20,581 594,723 2.36 6.29 52.23 1,000 43,554 653,455 4.72 13.32 57.39 2,000 75,872 715,166 9.43 23.20 62.81
0.165% of the 21,207 videos in the data, but 31.55% (359,326) of the 1,138,562 total votes. The larger problems were generated by successively adding videos to the data set. In each step, we added a subset of videos that were compared to the largest number videos already in the clique. Table 1 provides descriptive statics for the data used for each problem. The smallest problem has 0.17% of all videos. It accounts for 0.18% of the 327,091 pairs of videos, and 31.56% of the 1,138,562 votes. The largest problem has 9.43% of all videos. It accounts for 23.20% of all pairs of videos, and 62.81% of all votes.
5.1. Computational time and the performance of algorithms We first maximized the likelihood function to obtain the values of vˆi for each problem. We used the maximum likelihood solution to calculate the lower bound on the performance ratio for the randomized algorithm. Then we used the maximum likelihood solution as starting values for the problem of maximizing the expected value of a randomized algorithm for the linear ordering problem. As noted, the solution obtained by maximizing the expected value may be a locally optimal solution. In each case, we obtained an integer solution. We used the limited memory version of the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, available in the RSTAN package in R, to maximize the likelihood and expected value functions (Byrd, Nocedal and Schnabel (1994)). The algorithm uses a quasi-Newton method and is particularly suited to problems with very large numbers of variables. The computations were done on a Dell laptop computer using an Intel i5-6440HQ CPU (2.60GHz) with 8GB of RAM.
17
Running time (in seconds)
Method
Maximum expected value
Maximum likelihood
2000
1500
1000
500
0 0
500
1000
1500
2000
2500
Number of videos (m)
Running time(in seconds) No. of Maximum Maximum videos likelihood expected value 35 0.0105 0.0141 50 0.0356 0.0702 100 0.4582 0.4572 200 1.4498 8.4635 500 12.4563 71.3828 1,000 114.702 325.6792 2,000 864.6066 1,436.2584 Figure 1
Computational time for the maximum likelihood and maximum expected value solutions as a function of the number of alternatives in a problem.
Figure 1 shows the computational time for the maximum likelihood and maximum expected value solutions. For problems with 200 or fewer videos, the maximum likelihood solution was obtained in less than 1.5 seconds, and the maximum expected value solution was obtained in less than 9 seconds. The problem with 2,000 videos took the longest time to solve: 14.41 minutes (864.6066 seconds) for maximum likelihood, and 23.94 minutes (1436.2584 seconds) for maximum expected value. Table 2 shows the values of the upper bound (M ), the total number of paired comparisons (N ), the solution values zl and zE using the maximum likelihood and maximum expected value solutions, and the lower bounds zl /M and zE /M . The second-last column of Table 2 shows the (instance specific) lower bound on the performance ratio of a randomized algorithm using the maximum likelihood parameter values. The last column shows the values of Kendall’s tau for the rankings of
18
videos obtained by maximizing the likelihood function and the expected value. We interpret these results below. Table 2
Solution values and lower bounds on the performance ratios using the maximum likelihood and maximum expected value solutions, and the randomized algorithm.
No. of videos 35 50 100 200 500 1000 2000
M 199,839 220,451 258,325 309,405 341,688 384,740 433,889
N 359,326 396,033 462,589 549,131 594,723 653,455 715,166
zl 198,873 218,709 255,557 304,863 330,962 364,216 400,757
zE 199,287 219,377 256,263 305,779 332,487 366,636 404,362
z` /M 0.9952 0.9921 0.9893 0.9853 0.9686 0.9467 0.9236
zE /M 0.9972 0.9951 0.9920 0.9883 0.9731 0.9529 0.9319
L.B. Kendall’s on r τ 0.9067 0.8454 0.9057 0.8057 0.9029 0.6008 0.8955 0.5427 0.8787 0.5670 0.8578 0.5506 0.8339 0.4877
(1) The solutions obtained using both methods are close to the upper bound M , and thus close to the value of the optimal solution for the discrete optimization problem. The performance ratios z` /M and zE /M decrease as the problem size increases; both performance ratios are above 0.92 for the largest problem with 2,000 videos. (2) Maximizing the expected value provides marginal improvements in the solution values for each problem. As noted, maximizing the expected value may obtain a locally-optimal solution. The same solutions were obtained upon using random starts, although the computational time was longer (by about 100 seconds for the problem with 2,000 videos). Thus, the similarity in the orderings obtained by maximizing the likelihood value and the expected value decreases as the number of videos increases. (3) The lower bound on the performance ratio for the randomized algorithm is about 10% lower than the performance ratios for both the maximum likelihood and maximum expected value solutions. 5.2. Illustrative details We describe details of the analysis for the problem with 35 videos. Figure 2 displays the data for the problem as a graph. Each black band around the circle, numbered 1 to 35, represents a video, and each edge a single comparison between a pair of videos. The videos are numbered in
19
increasing order of the number times they were compared to another video. For example, videos 1 was compared 5,540 times, and video 35 was compared 55,910 times, to another video (the average number of comparisons per video was 10,226). The total number of votes recorded for a pair of videos ranged between 19 and 3,225, and had an average value of 302. (1) We estimated the maximum likelihood parameters, vˆi , and used them to compute the predicted probabilities pˆij = evˆi /(evˆi + evˆj ). Figure 3 compares the 35 = 595 values of pˆij against nij /(nij + 2 nji ), the actual proportion of times video i beat video j. There is a strong correlation between the two probabilities (but note that these pairwise probabilities are not independent). We ranked the videos in decreasing order of their vˆi values. Figure 4 plots pˆij against nij /(nij + nji ) for each of the videos ranked first, tenth, fifteenth, twenty-fifth and thirty-fifth. The topmost panel shows that the highest-ranked video has a probability of 0.53 of beating the second-highest ranked video; this probability increases monotonically and has a value of 0.675 against the lowest ranked video. The actual proportions fluctuate around the line dotted line showing the predicted probabilities. The lowest panel shows that the lowest-ranked video has a probability just below 0.5 of beating the second-lowest ranked video; this probability decreases to 0.32 against the highest ranked video. In general, each video has a probability below 0.5 of beating a video of a higher rank and above 0.5 of beating a video of a lower rank. These probabilities increase or decrease monotonically with the difference in the ranks of a pair of videos. The difference between the largest and smallest predicted probabilities ranges from 0.145 for the highest-ranked video to 0.186 for the fifteenth ranked video. (2) We used the maximum likelihood estimates to implement a randomized algorithm. We generated an observation ui = vi + i for each video, where i is an independent random draw from an extreme value distribution (with scale value equal to 1 and a variance equal to π 2 /6). We obtained a linear ordering of the videos by arranging them in decreasing order of their ui values, i = 1, . . . , 35. We repeated the procedure one million times. The total running time was 6.533 minutes (392 seconds). Figure 5 shows the distribution of the total number of non-reversals obtained by these solutions. The average solution value, which is indicated by the dashed line, is equal to z¯ = 182, 719, the
20
31
32 0
20000
30
20000
0
0
00
200
20
33
00
0
0
29
40
00
00
0
0 20
0
0
28
40
20 00 0
0
00
34
20
00 0
27 0
0
0
0
20000
0
5
0
25
4 3 2 1
0
20000
40000
26
35
0
2000
200
00
0
0
0 0
6
00 0 2 0
22
0 02 0
0
00
0
21
0
020 0
00
20
0
200000
0
19
0
0
18
17
16
11
0
23 0
0
12
13
14
15
Note: Black bands correspond to videos and edges to a paired comparisons. Figure 2
Graph showing paired comparisons data for the problem with 35 videos.
10
0
00
0 20
00
9
0
8
0
0
24
7
00
20
21
0.8
0.7
Predicted
0.6
0.5
0.4
0.3
0.2 0.2
0.3
0.4
0.5
0.6
0.7
0.8
Actual
Figure 3
Comparison of the actual and predicted proportions of votes for the 595 pairs of videos.
expected value of the solution obtained by the randomized algorithm using the maximum likelihood solution. The smallest and largest solution values were z1 = 164, 618 and z2 = 196, 513 non-reversals, respectively. The latter solution is at least as large as z2 /M = 0.983 times the optimal solution value for the discrete optimization problem, where M = 199, 839 is the upper bound on the optimal solution z ∗ to the discrete optimization problem. (3) We also used the maximum likelihood solution as the starting solution in a nonlinear maximization procedure that maximized the expected value of the randomized algorithm. An optimal solution to the discrete problem (and thus to the maximum expected value problem) satisfies the condition ni,i+1 ≥ ni+1,i ; otherwise, exchanging the positions of i and i + 1 would increase the number of non-reversals for the pair i and i + 1 without affect reversals for any other pair of alternatives. Figure 6 shows that this condition is satisfied by the (possibly local) solution obtained by maximizing the expected value. The twenty-fourth and twenty-fifth ranked videos beat each other an equal number of times; in all other cases, a higher ranked video beats a lower ranked video more often than it is beaten by it.
22
Actual porportions
Predicted probabitlities
Predictions for rank k= 1 0.65 0.60 0.55 0.50 Predictions for rank k= 10 0.60 0.55 0.50
Probability of beating video i
0.45 Predictions for rank k= 15 0.7
0.6
0.5
0.4 Predictions for rank k= 25 0.55 0.50 0.45 0.40 Predictions for rank k= 35 0.5 0.4 0.3 0.2 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
Rank of video i
Figure 4
Actual and predicted proportions of votes for the kth ranked video against the other thirty-four videos.
(4) Figure 7 compares the rankings of videos obtained maximizing the expected value and the likelihood value. The upper nodes, colored black, correspond to the expected value solution, and the lower nodes, colored gray, to the maximum likelihood solution. The numbering of the videos corresponds to their
23 1000
count
750
500
250
0 170000
180000
190000
Solution value
Figure 5
Distribution of solution values for the randomized algorithm using the maximum likelihood solution.
0.60 0.59 0.58
Proportions
0.57 0.56 0.55 0.54 0.53 0.52 0.51
1/2 2/3 3/4 4/5 5/6 6/7 7/8 8/9 9 / 10 10 / 11 11 / 12 12 / 13 13 / 14 14 / 15 15 / 16 16 / 17 17 / 18 18 / 19 19 / 20 20 / 21 21 / 22 22 / 23 23 / 24 24 / 25 25 / 26 26 / 27 27 / 28 28 / 29 29 / 30 30 / 31 31 / 32 32 / 33 33 / 34 34 / 35
0.50
Consecutive pairs
Figure 6
Values of ni /(ni + ni+1 ) for the ordering obtained by the maximum expected value solution, i = 1, . . . , 34.
ordering in the maximum expected value solution (video 1 is the highest ranked video and video 35 is the last ranked video). An edge between the ith black node and the jth gray nodes means that the video ranked i by the maximum expected value solution was ranked j by the maximum likelihood solution. The two orderings are close; as shown in Table 2, the Kendall tau correlation between the
24
Maximum expected value ordering 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Maximum likelihood ordering
Figure 7
Comparison of the rank orderings obtained by maximizing the expected value and the likelihood value for the problem with 35 videos.
two rankings is 0.8454. The largest difference in the two rankings is seven (the video ranked 23 by the maximum expected value method is ranked 30 by the maximum likelihood method). Four of the top five, and eight of the top ten, videos identified by the two methods are the same.
6. Conclusion We examined the relation between continuous and discrete optimization versions of the linear ordering problem. One continuous version of the problem uses a logit model; its maximum likelihood parameters can be estimated in polynomial time. Another continuous version of the problem is obtained by changing the objective function from maximizing a likelihood value to maximizing an expected value. It is equivalent to a discrete version of the problem, which is NP hard. The maximum likelihood value is related to a geometric mean of probabilities, and the expected value to their arithmetic mean. Since the arithmetic mean is no smaller than the geometric mean, the maximum likelihood value provides a lower bound on the performance ratio of a randomized algorithm that is implemented using the maximum likelihood estimates of the logit parameters. The bound exceeds 1/2, and has a value of one when the number of reversals in the optimal solution is equal to half or all paired comparisons. The lower bound can be computed for each problem
25
instance. The maximum likelihood solution can be used as a starting solution when maximizing the expected value. An empirical application using a large dataset from YouTube on the ranking of funny videos shows that the maximum likelihood procedure is fast and that improvements in the solution to the expected value problem are also quickly obtained. The lower bound on the performance ratios was consistently above 0.92 and progressively closer to one for smaller problem. The lower bound on the expected value of the solution obtained by the randomized algorithm decreased from over 90% of the optimal for the smallest problem with 35 videos to over 83% of the optimal for the largest problem with 2,000 videos. The standard integer programming formulation of this problem has about 4 million decision variables and 8 billion constraints; it is unlikely that this problem, or a randomized algorithm based on a linear programming relaxation of this problem, could have been solved as easily and quickly as it was using the continuous formulations considered in the paper. The present formulations may be extended to allow the rank orders to be functions of covariates describing the alternatives, and to allow heterogeneity in the rankings across groups or individuals using latent-class or random-effects approaches. It may also be extended to the related problem of consensus clustering, introduced by Filkov and Skiena (2004) in the context of organizing microarray data and identifying genes of similar or differential expression. The approach presented in this paper can be useful for other problems than the linear ordering problem. For example, it can be used to obtain continuous and unconstrained nonlinear formulations for minimum vertex cover (maximum independent set), minimum and maximum satisfiability, and maxcut problems (the maximum expected value formulation for the maxcut problem is equivalent to the formulation used by Goemans and Williamson (1995) as the basis for their famous randomized algorithm using semidefinite programming). Continuous, maximum likelihood formulations can be obtained for these problems, and those corresponding to minimum vertex cover and minimum satisfiability can be solved in polynomial time. The results relating the maximum likelihood value to a lower bound on the performance ratio of a randomized algorithm extend to these problems.
26
Endnotes 1. https://simons.berkeley.edu/programs/optimization2017 2. The proportion of times a video is voted over another stabilizes as the number comparisons of the two videos increases. We used a minimum of nineteen comparisons per pair because the number of videos in a clique fell sharply when we increased the number of paired comparisons.
References Ailon N., M. Charikar and A. Newman (2008), “Aggregating inconsistent information: ranking and clustering,” Journal of the ACM, 55 (5), 23:1–23:27. Aldaz, J.M. (2012), “Sharp bounds for the difference between the arithmetic and geometric means,” Archiv der Mathematik, 99 (4), 393–399. Bartholdi, J., C.A. Tovey and M.A. Trick (1989), “Voting schemes for which it can be difficult to tell who won the election,” Social Choice and Welfare, 6 (2), 157–165. Beggs, S., S. Cardell and J. Hausman (1981), “Assessing the potential demand for electric cars,” Journal of Econometrics, 17 (1), 1–19. Byrd, R.H., J. Nocedal and R.B. Schnabel (1994), “Representations of quasi-Newton matrices and their use in limited memory methods,” Mathematical Programming, 63 (4), 129–156. Campos, V., F. Glover, M. Laguna and R. Mart´ı (2001), “An experimental evaluation of a scatter search for the linear ordering problem,” Journal of Global Optimization, 21 (4), 397–414. Charon, I. and O. Hudry (2007), “A survey on the linear ordering problem for weighted or unweighted tournaments,” 4OR, 5 (1), 5–60. Charon, I. and O. Hudry (2010), “An updated survey on the linear ordering problem for weighted or unweighted tournaments,” Annals of Operations Research, 175 (1), 107–158. DeCani, J.S. (1969), “Maximum likelihood paired comparison ranking by linear programming,” Biometrika 56 (2), 537–45. Dwork, C., R. Kumar, M. Naor and D. Sivakumar (2001), “Rank aggregation methods for the web,” WWW ’01: Proceedings of the 10th International Conference on World Wide Web, 613–622.
27 Fagin, R., R. Kumar, M. Mahdian, D. Sivakumar and E. Vee (2006), “Comparing partial rankings,” SIAM Journal on Discrete Mathematics, 20 (3), 628–648. Filkov, V. and S. Skiena (2004), “Integrating microarray data by consensus clustering,” International Journal on Artificial Intelligence Tools, 13 (4), 863–880. Garca, C.G., D. P´erez-Brito, V. Campos and R. Mart´ı (2006), “Variable neighborhood search for the linear ordering problem,” Computers and Operations Research 33 (12), 3549–3565. Goemans, M.X. and D.P. Williamson (1995), “Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming,” Journal of the ACM, 42 (6), 1115–1145. Gr¨otschel, M., M. J¨ unger and G. Reinelt (1984), “A cutting plane algorithm for the linear ordering problem,” Operations Research, 32 (6), 1195–1220. Kemeny, J.G. (1959), “Mathematics without numbers,” Daedalus, 88 (4), 577–591. Kenyon-Mathieu, C. and W. Schudy (2007), “How to rank with few errors,” Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, ACM, 95–103. Laguna, M., R. Mart´ı and V. Campos (1999), “Intensification and diversification with elite tabu search solutions for the linear ordering problem,” Computers and Operations Research 26 (12), 1217–1230. Luce, R.D. (1977), “The choice axiom after twenty years,” Journal of Mathematical Psychology, 15 (3), 215–233. Maddala, G.S. (1983), Limited Dependent and Qualitative Variables in Econometrics, New York: Cambridge University Press. Mart´ı, R. and G. Reinelt (2011), The Linear Ordering Problem: Exact and Heuristic Methods in Combinatorial Optimization, Vol. 175, Springer Science & Business Media. Remage, R. and W.A. Thompson (1966), “Maximum-likelihood paired comparison rankings,” Biometrika, 53 (1/2), 143–149. Schiavinotto, T. and T. St¨ utzle (2004), “The linear ordering problem: Instances, search space analysis and algorithms,” Journal of Mathematical Modelling and Algorithms, 3 (4), 367–402. Shetty, S. (2012), “Quantifying comedy on YouTube: why the number of o’s in your LOL matter,” Google Research Blog, https://research.googleblog.com/2012/02/quantifying-comedy-on-youtube-why.html
28 Singh, J. and W.A. Thompson (1968), “A treatment of ties in paired comparisons,” Annals of Mathematical Statistics, 39 (6), 2002–2015. Thompson, W.A. and R. Remage (1964), “Rankings from paired comparisons,” Annals of Mathematical Statistics, 35 (2), 739–747. Tromble, R. and J. Eisner (2009), “Learning linear ordering problems for better translation,” Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2, Association for Computational Linguistics. Tung, S.H. (1975), “On lower and upper bounds of the difference between the arithmetic and the geometric mean,” Mathematics of Computation, 29 (131), 834–836. Van Zuylen, A. and D.P. Williamson (2009), “Deterministic pivoting algorithms for constrained ranking and clustering problems,” Mathematics of Operations Research, 34 (3), 594–620.