arXiv:1609.05374v1 [cs.LG] 17 Sep 2016
Extended Formulation for Online Learning of Combinatorial Objects
Holakou Rahmanian Department of Computer Science University of California Santa Cruz Santa Cruz, CA 95064
[email protected]
S.V.N. Vishwanathan Department of Computer Science University of California Santa Cruz Santa Cruz, CA 95064
[email protected]
David P. Helmbold Department of Computer Science University of California Santa Cruz Santa Cruz, CA 95064
[email protected]
Abstract The convex hull of n-symbol Huffman trees is known to have exponentially many facets/constraints [17]. This makes the standard on-line learning techniques for learning Huffman trees impractical, since they use multiplicative updates followed by projections to satisfy all of the constraints. However, there are general extended formulation techniques that encode the convex hull of Huffman trees as a polytope in a higher dimensional space with only polynomially many facets. This extended formulation methodology can also be used to encode the n-element permutahedron in O(n log n) dimensions with only a polynomial number of facets. We develop a general technique for converting these extended formulations into efficient on-line algorithms with good relative loss bounds. The resulting algorithms have nearly the same regret bounds as state of the art algorithms for permutations, and are the first efficient algorithms for the on-line learning of Huffman trees.
1 Introductions This paper introduces a general methodology for developing efficient and effective on-line learning algorithms over combinatorial structures. Examples include learning the best permutation of a set of elements for scheduling or assignment problems, or learning the best Huffman tree for compressing sequences of symbols. On-line learning algorithms are being successfully applied to an increasing variety of problems, so it is important to have good tools and techniques for creating good algorithms that match the particular problem at hand. The on-line learning setting proceeds in a series of trials where the algorithm makes a prediction or takes an action associated with a combinatorial object in the space and then receives the loss of its choice in such a way that the loss of any of the possible combinatorial objects can be easily computed. The algorithm can then update its internal representation based on this feedback and the process moves on to the next trial. Unlike batch learning settings, there is no assumed distribution from which losses are randomly drawn, instead the losses are drawn adversarially. In general, an adversary can force arbitrarily large loss on the algorithm, so instead of measuring the algorithm’s performance by the total loss incurred, the algorithm is measured by its regret, the amount of loss the algorithm incurs above that of the single best predictor in some comparator class. Usually the comparator class is the class of predictors defined by the combinatorial space being learned. To
make the setting concrete, consider the case of learning Huffman trees. In each trial, the algorithm would predict a Huffman tree, and then obtain a sequence of symbols to be encoded. The loss of the algorithm on that trial can be the inner product of any [0, 1]n loss vector and the sequence of code lengths of the symbols. Using frequencies as the loss vector, the loss will be the average code length of the sequence. Over a series of trials, the regret of the algorithm is the difference in combined average code lengths between the on-line algorithm and the single best Huffman tree chosen in hindsight. Therefore the regret of the algorithm can be viewed as the cost of not knowing the best combinatorial object ahead of time. With proper tuning, the regret is typically logarithmic in the number of combinatorial objects. One way to create algorithms for these combinatorial problems is to use one of the well-known socalled “experts algorithms” like weighted majority [15] or hedge [5] with each combinatorial object treated as an “expert”. However, this requires explicitly keeping track of one weight for each of the exponentially many combinatorial objects, and thus results in an inefficient algorithm. Furthermore, it also causes an additional loss range factor in regret bounds as well. There has been much work on creating efficient algorithms that implicitly encode the weights over the set of combinatorial objects using a concise representations. For example, many distributions over the 2n subsets of n elements can be encoded by the probability of including each of the n elements. In addition to subsets, such work includes permutations [9, 21], paths [14, 19], and k-sets [20]. The component hedge algorithm of Koolen, Warmuth, and Kivinen [13] is a powerful generic technique when the implicit encodings are suitably simple. The component hedge algorithm works by performing multiplicative updates on the parameters of its implicit representation. However, the implicit representation is typically constrained to lie in a convex polytope. Therefore Bregman projections are used after the update to return the implicit representation to the desired polytope. The technique for proving good relative loss bounds is to, on each trial, relate the excess loss of the algorithm to its movement (from both update and projection) towards the implicit representation of an arbitrary comparator in the class. Note that this process is only efficient when there are simply a small (polynomial) number of constraints on the implicit representations. Mathematicians have studied the problem of concisely specifying the convex hulls of complicated combinatorial structures like permutations and trees using few constraints. They have developed a powerful technique of representing these polytopes as a linear projection of a higher-dimensional polyhedron using extended formulations so that the polytope description has way less (polynomial instead of exponential) constraints. Unfortunately component hedge techniques do not apply to these extended formulations. Our main contribution is a general methodology for creating efficient and effective on-line learning algorithms over combinatorial structures that exploit these extended formulations. This methodology creates new on-line leaning algorithms over structures like permutations that already have efficient algorithms, as well as efficient algorithms over structures like Huffman Trees that are not suitable for component hedge techniques and did not have efficient on-line learning algorithms. Our methodology with extended formulations includes a novel way of producing predictions directly from the algorithm’s implicit representation. Previous techniques require that the algorithm perform a potentially expensive decomposition of its implicit representation into a convex combination of corners of its convex polytope, and then samples one of these corners to use as its prediction. In contrast, we can sample directly from the algorithm’s implicit representation without require a decomposition step. The remainder of the paper is organized as follows. In Section 2, we discuss existing work in area of online learning generally, and over structured concepts specifically, as well as extended formulation techniques. Section 3 explains the extended formulation used in our setting by reviewing the work by Kaibel and Pashkovich [12]. We then propose our algorithm along its technical steps in Section 4. Finally, Section 5 concludes with contrasting our approach to existing ones and describing directions of future work.
2
2 Related Work On-line learning is a rich and vibrant area, see [3] for a textbook treatment. The implicit representations for structured concepts (sometimes called ‘indirect representations’) have been used for a variety of problems [7, 8, 16, 18, 19]. In the case of permutations [9, 21] and component hedge [13] the implicit representations seem to be better matched with the combinatorial structure than the explicit representation, allowing not only decreased running time but also the proof of better bounds. Our more general methodology with extended formulations also has this advantage, although perhaps to a lesser extent than the permutation-specific algorithms. As in [21], the loss family provided in our approach is linear over the first order representation of the objects [4]. Concretely, for permutations of n items, we work with vectors v ∈ Rn in which each of the elements of {1, 2, . . . , n} appears exactly once. Also for Huffman trees of n symbols, we work with vectors v ∈ Rn in which vi indicates the depth of the node corresponding to symbol i in the coding tree. In contrast, [9] works with the second order representation, and consequently losses, which is a more general loss family (See [21] for comparison). There have been several works aimed at efficiently describing the polytope of different combinatorial objects like permutations [6] and Huffman trees [17]. Discovered by the combinatorial optimization community, extended formulation, on which our results rely, is a general methodology to nicely describe combinatorial polyhedra [11, 12].
3 Extended Formulation There are many combinatorial objects whose polytope cannot be described other than using exponentially many facets in its original space (e.g. see [17]). In order to have more efficient algorithms, there have been several efforts in the field of combinatorial optimization towards describing these polytopes in some other spaces. In recent years, the concept of representing these polytopes as a linear projection of a higher-dimensional polyhedron – which is known as extended formulation – has received significant attention. There are many combinatorial objects whose associated polyhedra can be described using small numbers of facets by projections of higher dimensional polyhedra comparing to the original space. See [11] for some of the tools for constructing such extended formulations. In the following subsections, we first overview the work by Kaibel and Pashkovich [12], and then extract the formulation fitting to our methodology. 3.1 Constructing Extended Formulation from Reflection Relations One of the tools to construct polynomial size extended formulations is the framework developed by [12] using reflection relations. The basic idea is to start with a corner of the polytope (e.g. a permutation) and then create a sequence of reflections through hyperplanes (e.g. swapping a pair of elements) so that any corner of the polytope can be generated by applying a subsequence of the reflections to the canonical corner. The convex hull is then generated by allowing “partial reflections” (i.e. containing the entire line segment connecting the original point and its reflection). Any point in the convex hull is then created by a sequence of partial reflections, and can be encoded by a sequence of variables indicating how much of each reflection was used. In fact, there is no need to start with a single corner, one could consider passing an entire polytope as an input through the sequence of (partial) reflections to generate a new polytope. This framework provides an inductive construction of higher dimensional polytopes via sequence n of reflection relations (see Theorem 1 in [12]). Concretely, let Pobj be the polytope of a given combinatorial object of size n (i.e. the convex hull of H – the set of all vectors of the combinatorial n n object in Rn ). The typical approach is to properly embed Pobj ⊂ Rn into Pˆobj ⊂ Rn+1 , and then feed it through an appropriate sequence of reflection relations as an input polytope in order to obtain n+1 an extended formulation for Pobj ⊂ Rn+1 . Theorem 1 in [12] provides the sufficient conditions for the correctness of this procedure. By doing this inductive step, there will be additional variables and inequalities in the formulations. In fact, for each reflection relation, there will be one additional variable indicating the extent to which the reflection occurs, and two additional inequalities indicating the two extreme cases of complete reflection and remaining unchanged. Therefore, if polynomially many reflection relations 3
1 0
2 3
1
1
2
2
3
0
3
0
1
3
2
1
and
3
2
2
1
1
3
2
2
1
3
1 0
2
2 average
===⇒
3
1.5 2.5
1
1
1
2.5
2
.5 2.5
3
1 0
2 3
Figure 1: An example of sorting network illustrating the extended formulation in pure and mixture permutations of n = 3 items. The values of v and x are shown in blue and red, respectively. are used to go from n to n + 1, then we can construct an extended formulation of polynomial size n for Pobj for all n. 3.2 Extended Formulation of Objects Closed under Re-Ordering Assume we want to construct an extended formulation for the class of combinatorial objects which is closed under any re-ordering. Note that Huffman trees and – trivially – permutations belong to such class of objects. In these cases, for reflection relations, one can use hyperplanes going through the origin with normal vector of form ek − eℓ which makes reflections as swapping kth and ℓth elements. Now let us figure out the form of extended formulation in this particular case. First, we find the additional variable along with two additional inequalities associated with this reflection relation. Concretely, assume v ∈ Rn is going through this reflection relation and v ′ ∈ Rn is the output. Since v ′ must fall into the line segment connecting v and its reflection via the hyperplane, we have ∃x ∈ R v ′ − v = x (ek − eℓ ) and hek − eℓ , vi ≤ hek − eℓ , v ′ i ≤ − hek − eℓ , vi. Using these two properties, one can obtain the relation between v ′ and v via a linear transformation given the additional variable x as well as the constraints enforced on x: i=k 1 ′ v = m x + v, mi = −1 i = ℓ (1) 0 otherwise 0 ≤ x ≤ vℓ − vk 1
(2)
Notice that x indicates the value that is being swapped between kth and ℓth elements which can go from zero (remaining unchanged) to the maximum swap capacity (complete swap). Also note that the sequence of reflection relations can be viewed as a sequence of comparators in a network [1]. In the case of permutations, the comparators associated with combined sequences of reflection relations form bubble-sort networks as in each inductive step a sequence of length n of corresponding comparators must bubble-out the largest element. For Huffman trees, however, the sequence of comparators used in the inductive step consists of two sweeps of the elements to bubble-out the two largest elements which are associated with the two deepest leaves in the tree [12]. Furthermore, since the construction of the polytope is inductive, the order of reflection relations is the opposite of the direction of the network of comparators. In fact, one can observe that the extended formulation of a Huffman tree/permutation is the swap values of the comparators in the sorting network. Moreover, a mixture of Huffman trees/permutations can be represented by partial swap values in the comparators (See Figure 1). Now we can express the extended formulation using the results above. Suppose we are using m reflection relations in total. Then starting from an anchor point c and applying Equation (1) successively, we obtain the affine transformation connecting the extended formulation space X and original space V: v = M x + c,
v, c ∈ V ⊂ Rn , x ∈ X ⊂ Rm , M ∈ {−1, 0, 1}n×m
The anchor point c is [1, 2, . . . , n]T and [1, 2, . . . , n − 2, n − 1, n − 1]T in case of permutation and Huffman tree, respectively. Also using all the inequalities from the reflection relations as in (2), the 1
vℓ − vk may already be in terms of previous variables in extended formulation.
4
(V, X ) W (v, x) ∈ Rn+m , L = (ℓ, 0) ∈ Rn+m (v, x, λ) ∈ Rn+2m , L = (ℓ, 0, 0) ∈ Rn+2m s.t. x ≥ 0, s.t. x, λ ≥ 0, Ax ≤ b, Ax + λ = b, v = Mx + c v = Mx + c Table 1: Concatenating original space and extended formulation (left), and augmented formulation space (right).
extended formulation space X can be described as X = {x ∈ Rm : Ax ≤ b and x ≥ 0} in which A ∈ Rm×m and b ∈ Rm . See the Appendix for some theoretical results of the structure of A and b. Note: In description above, for simplicity, we explained the basic inductive approach of building extended formulation for permutations and Huffman trees. However, one can build an extended formulation both for permutations and Huffman tree non-inductively and more efficiently. Concretely, for permutations one can construct an extended formulation based on an arbitrary sorting network. Similarly, by using an arbitrary sorting network along with some additional comparators and simple linear maps, an extended formulation for Huffman trees can also be built (See Theorem 7 in [12]). 3.3 Combinatorial Polytope Description in Augmented Formulation Even though the polytope can now be described using polynomial number of facets in extended formulation x ∈ X , it is not natural to define linear loss over the elements of x. However, one can concatenate the original formulation v ∈ V to the extended formulation x ∈ X , so that it is possible to not only efficiently describe the polytope, but also provide meaningful losses such as average code length and sum of completion times in Huffman trees and permutations, respectively. So given a loss vector ℓ ∈ [0, 1]n defined over V, we can work with (V, X ) as shown in the left column of Table 1. In addition, in order to have the comfort of working with affine subspaces2 , we add a positive slack vector λ to turn all inequalities into equalities. As a result, we define the augmented formulation space W (See the right column of the Table 1). Observe that there exists a duality between x and λ. Concretely, xi + λi indicates the swap value capacity at ith reflection relation and comparator. Also xi λi = 0 for all i ∈ [m] for any arbitrary pure Huffman trees/permutations since either comparator either passes or swaps its input elements.
4 Algorithm In this section, we propose our algorithm Extended-Learn, discuss its technical steps, and prove its regret bounds. The main idea of our algorithm is to maintain a distribution over all instances of the objects by keeping track of an evolving point w t = (v t , xt , λt ) in the augmented formulation space W through trials t = 1, . . . , T . The main structure of Extended-Learn is show in Algorithm 1. Similar to [13], our algorithm consists of three main technical parts: 1. Prediction: Predict with an instance γ t−1 ∈ H such that E γ t−1 = v t−1
ˆ t−1 according to the incurred loss multiplicatively. 2. Update: Update the mixture w t−1 to w
ˆ t−1 back to the polytope W and obtain wt . 3. Projection: Project the updated mixture w In the prediction step, which is discussed in 4.1, we draw an instance probabilistically from the distribution latent in wt−1 ∈ W such that it has the same expected value as v t−1 ∈ V. For the ˆ t−1 is obtained from a trade-off between update step, having defined Lt = (ℓt , 0, 0), the updated w the linear loss and the unnormalized relative entropy [13]. : ˆ t−1 = arg min ∆(w||w t−1 ) + η w · Lt w w∈Rn+2m
2
The reader will see having affine subspaces allows us to take an efficient projection approach, namely iterative Bregman projections, in Section 4.2
5
It is fairly straight-forward to see:
t
∀i ∈ [n + 2m], w ˆit−1 = wit−1 e−η Li
t−1 t−1 −η ℓti ∀i ∈ [n], vˆi = vi e −→ ∀i ∈ [m], x ˆt−1 = xt−1 i i ∀i ∈ [m], λ t−1 ˆt−1 ˆ =λ i
i
In the projection step, for which we propose an iterative approach in 4.2, we obtain wt , the Bregman ˆ t−1 back to the augmented formulation space W, projection of w ˆ t−1 ) wt = arg min∆(w||w w∈W
Algorithm 1 Extended-Learn 1: w0 ← q ∈ W – a proper prior distribution discussed in 4.3 2: For t = 1, . . . , T 3: wt−1 = (v t−1 , xt−1 , λt−1 ) ∈ W 4: Execute Prediction(wt−1 ) and get a random instance γ t−1 ∈ H s.t. E γ t−1 = v t−1 5: Incur a loss γ t−1 · ℓt ˆ t−1 ): ˆ t−1 = (ˆ ˆ t−1 , λ 6: Update w v t−1 , x ˆ t−1 ← λt−1 ˆ t−1 ← xt−1 , λ 7: Keep unchanged x t
8: 9:
ˆ t−1 as vˆit−1 ← vit−1 e−η ℓi Update v ˆ t−1 ) and obtain wt which is wt = arg min ∆(w||w ˆ t−1 ) Execute Projection(w w∈W
4.1 Prediction In this subsection, we describe how to probabilisticly predict with an instance such that it has the same expected value as the mixture vector of which we are keeping track. First we propose an algorithm for decomposing any mixture point into convex combination of instances. Despite its inefficiency, it leads us to another algorithm which does efficiently produce proper predictions.
Inefficient Decomposition The decomposition algorithm is shown in Algortihm 2. The main idea of the algorithm is to exploit the notion of partial swaps in the comparators corresponding to reflection relations in our extended formulation. In other words, at each comparator, we want to decompose based on the extent to which we used swap capacity. One can show that Algorithm 2, despite its inefficiency, results in valid convex combination of instances (See the Appendix for the proof): Lemma 1. (i) Given (v, P x, λ), Algorithm 2 generates a decomposition set S of (γ, pγ ) pairs with γ ∈ H such that v = (γ,pγ )∈S pγ γ. (ii) The time complexity of Algorithm 2 is O(2m ). 6
Algorithm 2 Extended-Decomposition 1: Input: (v, x, λ) ∈ W ⊆ Rn+2m 2: Output: A decomposition set S of (instance, probability)-pairs – (γ, pγ ) 3: S ← {(c, 1)} 4: for i = 1 to m do 5: if xi = 0 then 6: continue 7: else if λi = 0 then 8: Swap the associated coordinates for all s ∈ S 9: else 10: for (γ, pγ ) ∈ S do i 11: pi ← xix+λ i 12: Remove (γ, pγ ) from S 13: Add (γ, (1 − pi )pγ ) and (Mi (γ), pi pγ ) to S 3 14: end for 15: end if 16: end for 17: return S
Efficient Prediction In order to achieve efficiency, despite the vastly common idea in the literature [9, 13, 20, 21], one can avoid decomposition and do prediction directly. To this purpose, in Algorithm 2, we replace the idea of partial swaps with probabilistic swaps. Algorithm 3 describes this idea in concrete terms. It can be shown (see the proof in the Appendix) that Algorithm 3 does proper prediction efficiently: Lemma 2. (i) Given (v, x, λ), the Algorithm 3 generates a γ ∈ H such that E[γ] = v. (ii) The time complexity of Algorithm 3 is O(m). Algorithm 3 Extended-Prediction 1: Input: (v, x, λ) ∈ W ⊆ Rn+2m 2: Output: A prediction γ ∈ H 3: γ ← c 4: for i = 1 to m do 5: if xi = 0 then 6: continue 7: else Mi (γ) with probability xi /(xi + λi ) 8: γ← γ with probability λi /(xi + λi ) 9: end if 10: end for 11: return S 4.2 Projection Formally, the problem is to find the ∆-projection of the multiplicatively-updated p back onto set W: ∆ PW (p) = arg min ∆(q||p)
(3)
q∈W
In which ∆(·||·) is the unnormalized relative entropy. Observe that W is an intersection of affine subspaces presented by m + n equality constraints in Ax + λ = b and M x + c = v denoted by C1 , . . . , Cm+n . Since the non-negativity constraints are already in the definition of ∆, it is possible to solve (3) by simply using iterative ∆-projections 4 [2]. Starting from p0 = p, we iteratively 3 4
Mi denotes the swap action associated with the ith reflection relation. In [9] Sinkhorn balancing is used for projection which is also a special case of iterative Bregman projection
7
compute: ∀k > 0,
pk = PC∆k (pk−1 )
(4)
in which the index k is (m + n)-periodic (i.e. ∀k ∈ N, Ck+m+n = Ck ). It is proved [2] that pk converges to the unique solution of (3). In the Appendix, we discuss how one can project onto each hyperplane Ck for all k ∈ [n + m]. Although the iterative projection is only guaranteed to enter the simplex in the limit, the approaches in [9, 13] provide ways to overcome this issue. 4.3 Regret Bounds Using the vectors wt ∈ W produced by the algorithm, we can prove the following cumulative regret bound: " T # η min PT γ · ℓt + ∆(θ||w0 ) t=1 X γ∈H Lemma 3. E γ t−1 · ℓt ≤ in which θ ∈ W is the augmented 1 − e−η t=1 formulation corresponding to the instance γ ∈ V. The proof is standard in the online learning literature (see, e.g., [13]) and is shown in the Appendix. Now, we prove that there exists a good initial point w0 in W such that the ∆(θ||w 0 ) divergences are all appropriately bounded (shown in the Appendix). Lemma 4. Assume we are working with m reflection relations. Also assume that associated network does not have any redundant comparator. Then given n ≤ m ≤ n2 , there exists 5 q ∈ W such that for all p ∈ W, we have ∆(p||q) ≤ 16 m n log n. By Lemmas 3 and 4, we can now show the main regret bound result (see the Appendix for the proof): Theorem 5. Assume the number of reflection relations used in the extended formulation is m ∈ [n, n2 ] 6 and none of the corresponding comparators in the network are redundant. Then, given ℓt ∈ [0, 1]n for all t = 1 . . . T , we have: " T # η min PT γ · ℓt + 16 m n log n t=1 X γ∈H E γ t−1 · ℓt ≤ 1 − e−η t=1 Furthermore, in both cases of permutations and Huffman trees, by choosing a sorting network of size m = O(n √ log n) and tuning η appropriately, the expected regret of Extended-Learn is at most O(n2 log n T ).
5 Comparisons and Conclusion Table 2 contains a comparison of the regret bounds for the new Extended-Learn algorithm, previous algorithms for permutations, and the hedge algorithm [5] which inefficiently maintains an explicit weight for each of the exponential in n log n permutations or trees. Using loss vectors from the general unit cube, for permutations we assume the scheduling loss from [21] that has range [0, n2 ] per trial. When compared with the state of the art implicit algorithms PermELearn [9] and PermutahedLearn [21], the general Extended-Learn methodology has a small additional regret bound penalty p of log(n). When compared with the generic explicit hedge algorithm p (which is not computationally efficient), Extended-Learn has a better loss bound by a factor of n/ log n.
When comparing Extended-Learn with explicit hedge on Huffman trees, we consider two loss regimes: (i) one where the loss vectors are from the general unit cube, and consequently, the pertrial losses are in [0, n2 ] (like permutations), (ii) and one where the loss vectors represent frequencies and lie on the unit simplex so the per-trial losses are in [0, n]. In the first case, Extended-Learn and Hedge have the same asymptotic bounds as with permutations. In the second case, the √ lower loss range favors Hedge, and the inefficient algorithm’s bound is slightly better by a factor of log n. 5 The construction of q can be done as a pre-processing step and may depend on the network of comparators. For instance, if we use the bubble-sort sorting network in case of permutation, the average of the augmented formulation of the two permutation [1, 2, . . . , n] and [n, n − 1, . . . , 1] will be a good choice. 6 We still have ∆(p||q) = O(m n log n) and the same bounds asymptotically with m = poly(n).
8
Huffman Tree Permutation t ℓ ∈ Unit Cube ℓt ∈ Unit Simplex √ √ √ 1 Extended-Learn O(n2 log n T ) O(n2 log n T ) O(n1 2 log n T ) PermELearn and 1√ 2 2 – – T) PermutahedLearn O(n (log n) √ 1 1 1 1√ 1 1√ Hedge Algorithm O(n2 2 (log n) 2 T ) O(n2 2 (log n) 2 T ) O(n1 2 (log n) 2 T ) Table 2: Comparing the regret bounds of Extended-Learn with other existing algorithms in different problems and different loss regimes. Algorithm
In conclusion, we have presented a general methodology for creating on-line learning algorithms from generic extended formulation constructions. Because these extended formulations are designed to describe complex polytopes using a manageable number of constraints, they allow efficient learning on structures, like the polytope of Huffman trees, that previously could not be learned efficiently. Several areas for future work remain. We mentioned how to generate a good initial w0 for particular extended formulations. It would be nice to have a generic w0 generation method for an arbitrary extended formulation. Also recall that the iterative Bregman projections (see Section 4.2) may only enter the polytope W in the limit. Although this causes a negligible increase in regret with Sinkhorn balancing [9], it would be helpful to have a formal analysis of when Extended-Learn makes predictions using wt just outside the polytope. Finally, and most importantly, one can investigate extended formulations for other combinatorial objects and analyze the learning algorithms that result from the Extended-Learn methodology.
References [1] Miklós Ajtai, János Komlós, and Endre Szemerédi. Sorting inc logn parallel steps. Combinatorica, 3(1):1–19, 1983. [2] Lev M Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200–217, 1967. [3] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006. [4] Persi Diaconis. Group representations in probability and statistics. Lecture Notes-Monograph Series, 11:i–192, 1988. [5] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997. [6] Michel X Goemans. Smallest compact formulation for the permutahedron. Mathematical Programming, 153(1):5–11, 2015. [7] David P Helmbold, Sandra Panizza, and Manfred K Warmuth. Direct and indirect algorithms for on-line learning of disjunctions. Theoretical Computer Science, 284(1):109–142, 2002. [8] David P Helmbold and Robert E Schapire. Predicting nearly as well as the best pruning of a decision tree. Machine Learning, 27(1):51–68, 1997. [9] David P Helmbold and Manfred K Warmuth. Learning permutations with exponential weights. The Journal of Machine Learning Research, 10:1705–1736, 2009. [10] Mark Herbster and Manfred K Warmuth. Tracking the best linear predictor. The Journal of Machine Learning Research, 1:281–309, 2001. [11] Volker Kaibel. Extended formulations in combinatorial optimization. arXiv:1104.1023, 2011.
arXiv preprint
[12] Volker Kaibel and Kanstantsin Pashkovich. Constructing extended formulations from reflection relations. In Facets of Combinatorial Optimization, pages 77–100. Springer, 2013. 9
[13] Wouter M Koolen, Manfred K Warmuth, Jyrki Kivinen, et al. Hedging structured concepts. 2010. [14] Dima Kuzmin and Manfred K Warmuth. Optimum follow the leader algorithm. In Learning Theory, pages 684–686. Springer, 2005. [15] Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and computation, 108(2):212–261, 1994. [16] Wolfgang Maass and Manfred K Warmuth. Efficient learning with virtual threshold gates. Information and Computation, 141(1):66–83, 1998. [17] Jean-François Maurras, Thanh Hai Nguyen, and Viet Hung Nguyen. On the convex hull of huffman trees. Electronic Notes in Discrete Mathematics, 36:1009–1016, 2010. [18] Eiji Takimoto and Manfred K Warmuth. Predicting nearly as well as the best pruning of a planar decision graph. Theoretical Computer Science, 288(2):217–235, 2002. [19] Eiji Takimoto and Manfred K Warmuth. Path kernels and multiplicative updates. The Journal of Machine Learning Research, 4:773–818, 2003. [20] Manfred K Warmuth and Dima Kuzmin. Randomized online pca algorithms with regret bounds that are logarithmic in the dimension. Journal of Machine Learning Research, 9(10):2287– 2320, 2008. [21] Shota Yasutake, Kohei Hatano, Shuji Kijima, Eiji Takimoto, and Masayuki Takeda. Online linear optimization over permutations. In Algorithms and Computation, pages 534–543. Springer, 2011.
6 Appendix 6.1 Structure of Inequality Constraints in Extended Formulation Lemma 6. If M is the matrix of affine transformation corresponding to m reflection relations, the we have: A = Tri(M T M ) + I, b = −M T c in which Tri(·) is a function over square matrices which zeros out the upper triangular part of the input including the diagonal. Proof. Let v k be the vector in V after going through the kth reflection relation. Also denote the kth Pk column of M by Mk . Observe that v 0 = c and v k = c + i=1 Mi xi . Let Mk = er − es . Then, using (2), the inequality associated with the kth row of Ax ≤ b will be obtained as below: ! k−1 X k−1 k−1 T k−1 T Mi xi xk ≤ vs − vr = −Mk v = −Mk c + i=1
−→ xk +
k−1 X i=1
MkT Mi xi ≤ −MkT c = bk
Thus: T M i M j ∀ i, j ∈ [m] Aij = 1 0
i>j i=j, i 0 such that: n+2m X (5) ai wi β ai − a0 = 0 i=1
Observe that due to the structure of matrices M and A (see Lemma 6), ai ∈ Z and ai ≥ −1 for all i ∈ [n + 2m], and furthermore a0 ≥ 0. Thus we can re-write the equation (5) as the polynomial below: f (β) = αk β k + . . . + α2 β 2 − α1 β − α0 = 0 in which all αi ’s are positive real numbers. Note that f (0) < 0 and f (β) → +∞ as β → +∞. Thus f (β) has at least one positive root. However, it can not have more than one positive roots and we can prove it by contradiction. Assume that there exist 0 < r1 < r2 such that f (r1 ) = f (r2 ) = 0. Since f is convex on positive real line, using Jensen’s inequality, we can obtain the contradiction below: r2 − r1 r2 − r1 r1 r1 r2 − r1 0 = f (r1 ) = f ×0+ × r2 < f (0) + f (r2 ) = f (0) < 0 r2 r2 r2 r2 r2 Therefore f has exactly one positive root which can be found by Newton’s method starting from a sufficiently large initial point. 7 6.5 Proof of Lemma 3 Proof. Assuming w = (v, x, λ), θ = (γ, xθ , λθ ) and L = (ℓ, 0, 0): (1 − e−η )v t−1 · ℓt = (1 − e−η )w t−1 · Lt X t wit−1 (1 − e−η Li ) ≤ i
ˆ t−1 ) + η θ · Lt = ∆(θ||wt−1 ) − ∆(θ||w ≤ ∆(θ||wt−1 ) − ∆(θ||w t ) + η γ · ℓt
The first inequality is obtained using 1 − e−ηx ≥ (1 − e−η )x for x ∈ [0, 1] as done in [15]. The second inequality is a result of the Generalized Pythagorean Theorem [10], since w t is a Bregman ˆ t−1 into the convex set W which contains θ. By summing over t = 1 . . . T and using projection of w the non-negativity of divergences, we obtain: T T X X (1 − e−η ) v t−1 · ℓt ≤ ∆(θ||w0 ) − ∆(θ||wT ) + η γ · ℓt t=1
−→ E
"
T X t=1
γ
t=1
t−1
t
·ℓ
#
η min
≤
γ∈H
PT
7
t
t=1
γ · ℓ + ∆(θ||w0 )
1 − e−η
Note that if the constraint belongs to v = M x + c, since all the coefficinets are in {−1, 0, 1}, the f will be quadratic and the positve root can be found through the closed form formula.
12
6.6 Proof of Lemma 4 Proof. According to the definition: ∆(p||q) =
X i
=
X i
pi log
pi + qi − pi qi
pi log pi − pi log qi + qi − pi
We will bound each term of the expression above. First observe that, given Sp = X pi pi log ≤ log(n + 2m) Sp Sp i X pi log pi ≤ Sp log(n + 2m) + Sp log Sp −→
P
i
pi , we have:
i
Note that for an arbitrary point (v, x, λ) ∈ W, for all i ∈ [m], xi + λi ≤ n because it indicates the maximum swap value at reflection relation i. Also since vi ≤ n for all i ∈ [n], thus Sq , Sp ≤ n2 + mn ≤ 2mn. Therefore: X pi log pi ≤ Sp log(n + 2m) + Sp log Sp i
≤ 4mn log n + 6mn log n = 10 m n log n
P Now we only need to bound i −pi log qi . To do so, we choose q to be the average of some instances {θ(j) }j∈J ⊂ W such that for all i ∈ [n + 2m], qi is sufficiently large. Trivially v θ ≥ 1. Also note that, since we do not have any redundant comparator in our network, we can assume for all i ∈ [m], there exists a witness instance in W such that xi ≥ 1. Same argument can be made for 1 λi ’s. Now let q be the average of all these 2m instances. Therefore for all i ∈ [n + 2m], qi ≥ 2m and consequently − log qi ≤ log 2m. Putting everything together, we obtain: X pi log pi − pi log qi + qi − pi ∆(p||q) = i
≤ 10 m n log n + 2 m n log 2m + 2 m n ≤ 16 m n log n
6.7 Proof of Theorem 5 Proof. The first part is the immediate consequence of Lemmas 3 and 4: " T # η min PT γ · ℓt + 16 m n log n t=1 X γ∈H E γ t−1 · ℓt ≤ 1 − e−η t=1 Let Lbest = min γ∈H
E
"
PT
T X t=1
t=1
γ
γ · ℓt . We can tune η as instructed in Lemma 4 in [5]:
t−1
t
·ℓ
#
− min
γ∈H
T X t=1
γ · ℓt ≤
p 2Lbest 16 m n log n + 16 m n log n
Applying Lbest ≤ T n2 and m = O(n log n) (by choosing an appropriate sorting network [1]) into inequality above, we will obtain the desired result.
13