Acknowledgments We would like to thank Yair Censor for his comments and advice throughout the research and to Ran Bachrach for his valuable comments.
Learning Algorithms for Enclosing Points in Bregmanian Spheres Koby Crammer and Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,singer}@cs.huji.ac.il
Abstract. We discuss the problem of finding a generalized sphere that encloses points originating from a single source. The points contained in such a sphere are within a maximal divergence from a center point. The divergences we study are known as the Bregman divergences which include as a special case both the Euclidean distance and the relative entropy. We cast the learning task as an optimization problem and show that it results in a simple dual form which has interesting algebraic properties. We then discuss a general algorithmic framework to solve the optimization problem. Our training algorithm employs an auxiliary function that bounds the dual’s objective function and can be used with a broad class of Bregman functions. As a specific application of the algorithm we give a detailed derivation for the relative entropy. We analyze the generalization ability of the algorithm by adopting margin-style proof techniques. We also describe and analyze two schemes of online algorithms for the case when the radius of the sphere is set in advance.
1
Introduction
To motivate the topic of this paper let us discuss briefly the following application. The task of speaker verification is concerned with determining whether a speech segment is uttered by a pre-specified speaker or by an imposter. This problem is inherently different than classification problems since in general the verification system is provided at the training stage with examples only from the speaker whose voice we wish to verify in the test phase. Although a verification system might receive supervised data from different speakers, casting the problem as a multiclass problem might be problematic since the number of classes is not fixed and an imposter is not necessarily one of the speakers in the training set. Therefore, a common approach in speaker verification systems is to build a different model for each speaker where each model is trained based on examples from a single source (speaker). In the problem setting we discuss, we indeed observe examples from a single source and our goal is to find a body of a small volume such most of the examples lie inside the learned body. Since we observe examples from a single source we term this problem the uniclass1 problem. To illustrate the problem, 1
In a similar setting [18], this problem was called the one-class problem. We believe however that the term uniclass is more appropriate.
let us assume that the examples are points in an Euclidean space. A natural and simple candidate for enclosing points from a single source is a ball. The learning problem therefore reduces to the problem of finding the center and the radius of a ball enclosing the points. Clearly, a finite sample can always be enclosed in a sphere of a large radius. However, such a sphere is very likely to include points not belonging to the source. Therefore, we give some leeway to the learning algorithm so that a few of the points may not lie inside the ball. This view cast a natural tradeoff between the radius of the enclosing sphere and the portion of Fig. 1. The level sets induced by the Euclidean distance (left column) and points outside this ball. In quite a few applications, the ex- the relative entropy (right column). The amples do not reside in an Euclidean points are in the three dimensional simspace. For instance, in text retrieval plex and are projected onto the plane. applications documents are often rep- Each line represents equidistant points resented by word frequencies and in- from a center where the center is at formation theoretic measures are more [0.3 0.3 0.3] in the top figures and at natural than the Euclidean distance [0.8 0.2 0.2] in the bottom figures. as the means for assessing divergences between documents. We therefore employ a rather general notion of divergence, called Bregman divergences [3]. A Bregman divergence is defined via a strictly convex function F : X → R defined on a closed, convex set X ⊆ R n . A Bregman function F needs to specify a set of constraints. We omit the discussion of the constraints and refer the reader to [4]. All the functions we discuss in this paper fulfill these constraints and are hence Bregman functions. In this paper we occasionally require that X is also bounded and therefore compact. Assume that F is continuously differentiable at all points of Xint , the interior of X, which we assume is nonempty. The Bregman divergence associated with F is defined for x ∈ X and w ∈ Xint to be def
BF (xkw) = F (x) − [F (w) + ∇F (w) · (x − w)] . Thus, BF measures the difference between F and its first-order Taylor expansion about w, evaluated at x. Bregman distances generalize some commonly studied distance measures. The divergences P we employ are defined via a single scalar n convex function f such that F (x) = l=1 f (xl ), where xl is the lth coordinate of x. In this paper we exemplify our algorithms and their analyses with two commonly used divergences. The first, when X ⊂ R n is derived by setting f (x) = (1/2) x2 and thus BF becomes the squared Euclidean distance between x and w, 2 BF (xkw) = (1/2) kx − wk . The second divergence we consider is derived by setting f (x) = x log(x)−x. In this case BF is the (unnormalized) relative entropy,
Pn xl BRE (xkw) = l=1 xl log w − xl + wl . While the above divergence can be l defined over a convex subset of R n+ , in this paper we restrict the domain in the case of the relative entropyPto be a compact subset of the nth dimensional simplex, ∆n = {x | xl ≥ 0 ; l xl = 1}. For Pnthis specific choice of domain, the relative entropy reduces to BRE (xkw) = l=1 xl log (xl /wl ) which is often referred to as the Kullback-Leibler divergence. An illustration of the Bregmanian spheres for the Euclidean norm and the relative entropy is given in Fig. 1. The two divergences exhibit different characteristics: the level sets of the Euclidean distance intersect the boundary of the simplex, the relative entropy bends close to the boundary and all the level sets remain strictly within the simplex. The problem we consider in this paper is the construction of a simple sub-set that encloses a large portion of a set of instances. Concretely, we are given a set of examples S = {xi }m i=1 where xi ∈ X. Our goal is to find w ∈ X such that many of the examples attain a Bregman divergence from w that is smaller than R (R ≥ 0). Informally, we seek w ∈ X and a small scalar R such that most of xi ∈ S are in a Bregmanian ball of radius R, BF (xi kw) ≤ R. Clearly, the smaller R is, the less likely we are to succeed in our task. Therefore, there is a natural tradeoff between the size of R and the number of points in S that are within a divergence R from w. We thus associate a loss with each example xi that falls outside the Bregmanian ball of radius R This loss is equal to excess Bregman divergence of a point xi from w over the maximal radius R, that is, BF (xi kw) − R. We cast the tradeoff between the need to find a Bregmanian ball with a small radius and the need to attain a small excess loss on each point as the following optimization problem, m
min R + w,R,ξ
1 X ξi νm i=1
s.t BF (xi kw) ≤ R + ξi ; ξi ≥ 0
i = 1, . . . , m .
(1)
Here ν ∈ [0, 1] is a parameter that governs the tradeoff between the value of R and the total amount of discrepancies from the Bregmanian sphere. We also need to require that R is non-negative and w to be in X. However, these constraints are automatically fulfilled at the optimal solution. We would also like to note that although a Bregman divergence is not necessarily convex in its second argument w, the dual problem of Eq. (1) is concave and attains a unique optimal solution. The optimization problem defined in Eq. (1) was originally cast for the squared distance by Sch¨ olkopf [14] and Tax and Duin [18]. Later on, Sch¨ olkopf et al. [15] introduced and analyzed a closely related problem in which the goal is to separate most of the examples from the origin using a single hyperplane. We adopt the notation used in [15]. However, our setting is more general since we allow the use of any Bregman divergence that satisfies the conditions above. The uniclass problem has many potential applications such as simple clustering, outliers detection (e.g. intrusion detection), novelty detection, and density estimation. Due to the lack of space we refer the reader to the work of Tax [17], Tax and Duin [18] and Sch¨ olkopf et al. [15] and the references therein. Finally, we would like to emphasize that although the use of the Bregman divergences in the context of uniclass problems is novel, Bregman divergences have been
trendy and useful tools in other learning settings such as online learning [1, 10, 9], boosting [6], and principal component analysis [5]. We now turn to the dual form of the optimization problem described in Eq. (1) and its properties.
2
The Optimization Problem
The Lagrangian of Eq. (1) is equal to, m m X 1 X αi [BF (xi kw) − R − ξi ] − γi ξi ξi + νm i=1 i=1 ! X m m m X X 1 ξi αi + αi BF (xi kw) . =R 1− − αi − γi + νm i=1 i=1 i=1
L=R+
(2)
We first compute the derivatives of the Lagrangian with respect to its primal variables R and ξi and equate them to 0, ∂ L= ∂R
1−
m X
αi = 0
i=1
∂ 1 L= − αi − γi = 0 ∂ξi νm
⇒ ⇒
m X
αi = 1
(3)
1 νm
(4)
i=1
αi ≤
Next we compute the derivate of BF (xkw) with respect to w, ∇w BF (xkw) = ∇w [F (x) − F (w) − ∇F (w) · (x − w)] = −∇F (w) − △F (w)(x − w) + ∇F (w) = −△F (w) (x − w) ,(5) where △F (w) is the matrix of the second order derivatives of F with respect to w and we write all vectors as column vectors. Using P Eq. (5) we calculate the m derivative of the Lagrangian with respect to w, ∇ L = w i=1 αi ∇w BF (xi kw) = Pm − i=1 αi △F (w) · (xi − w). Setting ∇w L = 0 we get, −
m X i=1
αi △F (w) · (xi − w) = 0 ⇒ △F (w)
"
m X i=1
αi xi − w
m X i=1
#
αi = 0 . (6)
Since F (w) is strictly convex, then △F (w) is positive definite and thusP its inverse exists 2 . Multiplying both sides of Eq. (6) and using the fact that i αi = 1 (Eq. (3)) we now get, m X αi xi . (7) w= 2
P Since we assume that F (x) =
i=1
n l=1
f (xl ), we thus have that △F (w) is a diagonal matrix, with positive elements on the diagonal.
Substituting Eqs. (3), (4) and (7) in the Lagrangian given by Eq. (2) we get the following dual objective function, m m m m m X X X X X αi αj xj αi F (xi ) − F αj xj ) = αi BF (xi k Q(α) = i=1
i=1
j=1
i=1
j=1
m m m m X X X X αi . (8) αj xj αi xi − −∇F αj xj
j=1
i=1
Pm
j=1
i=1
Using again Eq. (3) we finally get, Q(α) = i=1 αi F (xi ) − F ( summary, we get the following dual problem for Eq. (1) Pm Pm maxα Q(α) Pm = i=1 αi F (xi ) − F1 ( i=1 αi xi ) i = 1, . . . , m s.t. i=1 αi = 1 ; 0 ≤ αi ≤ νm
Pm
i=1
αi xi ). In
(9)
We would like to underscore a few properties of the above constrained opI>D[D @ DI>[@D I>[@ timization problem before turning to an algorithm for finding the optimal solution. First, the objective function, Q(α), is strictly concave and also nonnegative due to the convexity of F . This function is the difference between the convex combination of F evaluated D at each sample point (the line in Fig. 2) and F applied to the convex combiQP QP nation of the sample (the curve in in Fig. 2). Second, if ν = 1 than there is Fig. 2. Illustration of the dual’s objeca unique assignment that satisfies the tive function for the uniclass problem. two constraints, namely αj = 1/m for all j = 1, . . . , m. Using similar arguments to Sch¨ olkopf et al. [15], it is rather simple to prove that ν is a lower bound on the fraction of examples for which whose Lagrange multiplier αi is greater than zero, and also that ν is an upper bound on the fraction of examples which are outliers (examples for which ξi > 0).
3
A Parallel-update Algorithm
In this section we describe and analyze an algorithm that finds the optimal set of α’s by solving the dual optimization problem given in Eq. (9). Our algorithm can be used in conjunction with a rather general class of Bregman divergences. We would like to note in passing that although there exists a wide range of algorithms for finding the optimal solution of a concave function under linear constraints (see for instance [8, 4]), these algorithms are rather general and do not take advantage of the particular form of the optimization problem on hand.
In contrast, we describe in this section a specialized algorithm that is simple to implement and to analyze. Our algorithm employs an auxiliary function that bounds the objective function from below and is reminiscent of a recent learning algorithm for SVMs by Sha et. al [16] and also bears resemblance to fixed-point algorithms such as the NMF algorithm [11] and the EM algorithm [7]. Before diving into the specifics of the algorithm we first provide a high level overview of its operation. Let Q(α) be the objective function of Eq. (9). We employ an auxiliary function R(·, ·) which for any α∗ > 0 satisfies two properties: Q(α) = R(α, α) Q(α) ≥ R(α, α∗ ) .
(10)
Input: – A set of examples {xi }m i=1 , xi,l > 0 – A Bregman function F (x) 1 Initialize: Set α1i = m Loop: For t = 1, 2, ..., T P – Set wt = i αti xi – Update t αt+1 ← NewWeights(wt , {xi }m i ,α )
Compute sphere’s parameters: P – Set w = i αTi xi 1 – Choose i such that αTi ∈ (0, νm ) – Define R = BF (xi kw) Output : Center w ; Radius R Fig. 3. The parallel - update algorithm for the uniclass problem.
The algorithm is initialized with any α1 that satisfies Eq. (9). It then works in rounds. On round t it sets αt+1 to be the maximizer of R(α, αt ). As we show in the sequel, the above two properties of R imply that the value of Q(αt ) is a non-decreasing function with a single fixed-point at the optimal solution. The analysis we provide below further motivates the following construction of R. The function R that we use for Eq. (9) is, n n X X X X X α 1 i P t αti xi,l f t αtj xj,l , (11) αi f (xi,l ) − R α, αt = x α α j,l j i j i j i l=1
l=1
where f is the per-coordinate Bregman function. For simplicity, we assume that xi,l > 0 though there are less restrictive terms for which Eq. (11) is well defined. Let us assume for now that R satisfies Eq. (10). Replacing Q(α) with R (α, αt ) we get the following optimization problem, αt+1 = arg max R α, αt α
s.t.
m X i=1
αi = 1 ; 0 ≤ αi ≤
1 νm
i = 1, . . . , m .(12)
The above constrained optimization problem constitutes the core of the algorithm. To solve Eq. (12) appropriate P we first write Pits Pm Lagrangian and get, m L(α) = R (α, αt ) − γ ( m α − 1) + µ α − i=1 i i=1 i i i=1 ηi (αi − 1/(νm)) , where
µi , ηi ≥ 0 (1 ≤ i ≤ m) and γ are the Lagrange multipliers. Taking the derivative of L with respect to αi we get, n n X X X ∂ αi αtj xj,l + f (xi,l ) − γ + µi − ηi . (13) L=− xi,l f ′ t ∂αi αi j l=1
l=1
Rearranging and adding terms which do not depend on i in Eq. (13) we get, n n X X X αi αtj xj,l − f (xi,l ) + γ − µi + ηi xi,l f ′ t αi j l=1 l=1 X X X αi = −BF xi k αtj xj αtj xj − ∇F αtj xj + xi · ∇F t α i j j j +γ ′ − µi + ηi ,
i P P P Pn h t t ′ t . For clarity, α x α x + f x −f α j,l j,l j,l j j j j l=1 j j P t we denote by wt = j αj xj . Setting the derivative of the Lagrangian, in the manipulated form above, to zero we get, αi (14) wt − ∇F (wt ) = BF (xi kwt ) − γ ′ + µi − ηi . xi · ∇F αti
where γ ′ = γ +
Eq. is the heart of a single weight-update iteration: given αt we set wt = P (14) t t j αj xj ; we then solve Eq. (14) for α by fixing wt and α . The Lagrange multipliers are still present and are set so that the constraints of Eq. (9) are satisfied along with their corresponding Karush-Kuhn-Tucker (KKT) conditions. Before proceeding to the description of the full algorithm. We would like to note that the solution of Eq. (14) takes a multiplicative form, αt+1 = αti g (wt , xi , γ ′ , µi , ηi ). i Pseudo-code for the algorithm is given in Fig. 3. The algorithm starts with an initialization of αi that satisfies the constraints by setting α1i = 1/m. It then alternates between setting wt from the current αt and finding a new set αt+1 from the newly calculated wt . The radius of the resulting Bregmanian sphere is the computed by choosing any point xi that lies on the sphere itself and computing its distance to the center of the sphere w. Such a point can be easily identified since KKT conditions imply that its Lagrange multiplier should be inside (0, 1/(νm)). Analysis: We now turn our attention to the convergence properties of the above algorithm and its correctness. The following lemma serves as a basic technical tool in our analysis. The lemma is a generalization of Eq. (29) in [11]. m Lemma 1. Let f (x) be a convex function.PLet {ai }m i=1 and {zi }i=1 be two sets a = 1. Then, the following bound of variables such that a > 0 for all i and i i P i P holds, f ( i zi ) ≤ i ai f (zi /ai ) .
P P P zi Proof. The convexity of f implies , f ( i zi ) = f i ai ai ≤ i ai f (zi /ai ) . ⊓ ⊔ We now state the main conditions of the function R. Lemma 2. Assume that xi,l > 0 and αti > 0 for all i and l, then Q (defined in Eq. (9)) and R (defined in Eq. (11)) satisfy, (1) R(α, α) = Q(α) and (2) R(α, αt ) ≤ Q(α). We omit the proof due to lack of space and proceed to the main result of this section. In the following theorem we show that the process of solving Eq. (12) repeatedly never decrease the value of the objective function, and furthermore, the only fixed point of the algorithm is the solution of Eq. (9). Theorem 1. Let αt , t = 1, 2, . . . be the solutions obtained by the algorithm for Eq. (12). Then, Q (αt ) ≤ Q αt+1 for all t ≥ 1. Furthermore, αt = αt+1 iff αt is the optimal solution of Eq. (9). Proof. To prove the first property we apply Lemma 2. The first part of Lemma 2 implies that Q (αt ) = R (αt , αt ). Since αt satisfies the constraints of Eq. (12) and αt+1 is the maximizer of Eq. (12) we have that R(αt , αt ) ≤ R αt+1 , αt . Using the second part of Lemma 2 we have, R αt+1 , αt ≤ Q αt+1 . Summing up we get, Q (αt ) = R (αt , αt ) ≤ R αt+1 , αt ≤ Q αt+1 , which proves property (1). For the second property, let us start with Eq. (12). We have showed that Eq. (13), when set to zero, defines a set of constraints that serve as the conditions needed for solving Eq. (12). A fixed point of Eq. (13) is obtained when αt = αt+1 . Put another way, αt is a fixed point of Eq. (13) if and only if it is a solution for the following set of equations (for i = 1, . . . , m), n n X X X f (xi,l ) + γ − µi + ηi = 0 . (15) αtj xj,l − xi,l f ′ l=1
j
l=1
Similarly, solving Eq. (9) by writing the appropriate Lagrangian and comparing its derivative with respect to αi to zero yields the same set of equations as Eq. (15). Since the fixed points of Eq. (12) and the optimal solution of Eq. (9) yield the same set of equations, which cast necessary and sufficient conditions for optimality, we obtain that a fixed point of Eq. (12) is the optimal solution for Eq. (9). ⊓ ⊔
4
A Closed-form Update for the Relative Entropy
In the previous section we described a general iterative algorithm for solving the constrained optimization problem for the Bregmanian uniclass problem. We have left unspecified though the algorithmic details of finding αt+1 that minimizes R(α, αt ). In this section we demonstrate the applicability of our approach by describing an efficient algorithm for the relative entropy as the Bregman divergence. Similar updates can be devised for the squared norm. We focus though
on the relative entropy which has not been explored in the context of uniclass problems. To cast the relative entropy as a Bregman divergence we set f (x) = x log(x)− x as the convex Bregman function of a single component. This implies that f ′ (x) = log(x). We restrict ourselves to the more common case whereP x defines a discrete distribution, i.e, each instance xi conforms with xi,l ≥ 0 and l xi,l = 1. In this case Eq. (14) reduces to, n X X X α i αtj xj,l ) = BF (xi kwt ) − γ ′ + µi − ηi αt xj,l ) − log( xi,l log( t αi j j j l=1
Since log
αi αti
t j αj xj,l
P
= log
P
P αi t + log α x and l xi,l = 1 we get j,l j j αt i
that the update rule is of the form, αt+1 = αti exp [BF (xi kwt ) − γ ′ + µi − ηi ] . i t+1 t Note that if αi > 0 then αi > 0 as well. P Thus, KKT conditions imply that = 1 implies that γ ′ , which µi = 0. Next note that the requirement that i αt+1 i does not depend on i, is a normalization constraint and we can rewrite the update rule as, αt+1 = i
1 exp BF (xi kwt ) + log(αti ) − ηi , Zt
(16)
P where Zt = i exp [BF (xi kwt ) + log(αti ) − ηi ]. We are not done yet: Eq. (16) is not sufficient to devise an iterative update that computes αt+1 since each Lagrange multiplier ηi is not bound to a specific value. The following lemma provides a monotonicity property that in turn serves as a tool for setting ηi . Lemma 3. Let ηi and αt+1 are chosen so as to satisfy both Eq. (16) and the i KKT conditions for Eq. (12). Then, BF (xi kwt ) + log(αti ) ≥ BF (xj kwt ) + log(αtj ) implies that ηi ≥ ηj . Proof. The KKT condition of Eq. (12) imply that the Lagrange multipliers (for all k) ηk and αt+1 must satisfy ηk (αt+1 − 1/(νm)) = 0 while ηk ≥ 0. Assume for k k contradiction that BF (xi kwt ) + log(αti ) ≥ BF (xj kwt ) + log(αtj ), but ηi < ηj . From Eq. (16) we get that, log αt+1 − log αt+1 i j = BF (xi kwt ) + log(αti ) − BF (xj kwt ) + log(αtj ) − ηi + ηj .
The above equation is strictly greater than zero since BF (xi kwt ) + log(αti ) ≥ BF (xj kwt ) + log(αtj ) and we assumed ηj > ηi and therefore, αt+1 > αt+1 i j . However, using the assumption that ηj > ηi ≥ 0 in conjunction with the KKT condition ηj (αt+1 − 1/(νm)) = 0, we must have αt+1 = 1/(νm). Alas, αt+1 j j i t+1 must be within the box constraints, that is, 0 ≤ αi ≤ 1/(νm) and thus αt+1 ≤ 1/(νm) = αt+1 which leads to a contradiction. ⊓ ⊔ i j The above lemma sheds more light on the form of solution for Eq. (16). We get that the values of the Lagrange multiplies are ordered in according to the
(known) sums of the relative entropy between each point and wt and the logarithm of the current estimate for αt . Combining this property with the box constraints yields the following lemma. Lemma 4. Assume that ηi and αt+1 satisfy Eq. (16) and the KKT conditions i of Eq. (12). Denote by vi = BF (xi kwt ) + log(αti ). Then, there exists a constant ψ such that, αt+1 = exp(ui )/Zt where ui = min{ψ, vi }. i Proof. Assume w.l.o.g. that the indices are sorted according to vi , i.e., v1 ≥ v2 ≥ · · · ≥ vm . Using Lemma 3 together with KKT conditions we get η1 ≥ η2 · · · ≥ ηm ≥ 0. Denote by k the maximal index for which ηi is strictly greater than zero, that is, η1 ≥ η2 · · · ≥ ηk > 0 and ηk+1 = . . . = ηm = 0. From the KKT conditions we have that whenever ηi > 0 (meaning vi ≥ vk ), αt+1 is on boundary i of constraints and is equal to 1/(νm). An analogous argument implies that for i such that ηi = 0 (and thus vi ≤ vk+1 ), αt+1 is inside the simplex of constraints i and is equal to exp [BF (xi kwt ) + log(αti )] /Zt = exp(vi )/Zt . Summing up, there exists a threshold, ψ ∈ [vk+1 , vk ), such that for vi > ψ, αt+1 = exp(ψ)/Zt and i for vi ≤ ψ αt+1 = exp(v )/Z . Finally, ψ is set such that 1/(νm) = exp(ψ)/Zt . i t i ⊓ ⊔ As noted in Lemma 4, the exact value of ψ is the solution of 1/(νm) = exp(ψ)/Zt . As in the lemma, let us assume that vi = BF (xi kw) + log(αti ) are sorted in descending order and for boundary conditions we define v0 = ∞. An immediate consequence of the lemma is that we can write αt+1 as αt+1 = exp(min{vi , ψ})/ i i P [ j exp(min{vj , ψ})]. Let k be the index such that vk > ψ ≥ vk+1 . Then from the above we get that, eψ 1 = αt+1 = ψ Pm . k νm ke + j=k+1 evj
(17)
Put another way, had we set ψ wrongly to be vk we would have gotten that, 1 evk Pm < v . νm ke k + j=k+1 evj
(18)
Analogously, since ψ ≥ vk+1 we must have,
evk+1 1 Pm ≥ . v νm (k + 1)e k+1 + j=k+2 evj
(19)
Eq. (18) and Eq. (19) do not depend on the exact value of ψ. Therefore, in order to find the interval [vk+1 , vk ) in which ψ resides, it suffices to find the index k for which Eq. (18) and Eq. (19) are both satisfied. A direct calculation of Eq. (18) takes O(m) time for each r and thus the total search would take O(m2 ) time. However, we can reduce the search time evr to be linear in m by defining the following auxiliary variables, Φ(r) = Z(r) P m and Z(r) = revr + j=r+1 evj . We can rewrite Z(r) recursively as, Z(r + 1) = vr vr+1 − e ). To find the value of ψ, we first perform a linear search to Z(r) + r(e
find the index r for which Φ(r) > 1/(νm) and Φ(r + 1) ≤ 1/(νm), which, from Eq. (18) and Eq. (19), implies thatPψ ∈ [vr+1 , vr ) . To find the exact value of ψ m we use Eq. (17) and get, ψ = log[( j=r+1 evj )/(νm − r)]. (Note that the above equation implies that r ≤ νm.) The search process is linear in m and with the sorting of vi we get that finding ψ and αt+1 afterwards takes O(m log(m)) time. i
5
Generalization
We now analyze the generalization of the uniclass algorithm for Eq. (1) in terms of the empirical discrepancies from the sphere enclosing the points in the sample. Informally, the theorem below states that the true probability of observing a point outside the enclosing sphere, depends on the empirical distribution of the points in a divergence r from the origin where r ∈ [R − θ, R]. This probability is also inversely proportional to θ. Similar bounds have been derived for the generalization of margin classifiers. Indeed, our proof is closely related to the proofs in [2, 13, 12]. The proof exploits the fact that the solution to the batch problem is a convex combination of the points in the sample (Eq. (7)). The proof is omitted due to lack of space. Theorem 2. Let D be a distribution over a compact domain X ⊂ R n , and let S be a sample of m examples chosen independently according to D. Let BF be a Bregman divergence with f ′ continuous over X. Define ( ) m X X H = h : X → R | h(x) = R − BF (xkw) ; w = αi = 1 αi xi , αi ≥ 0, i=1
i
and assume that 0 < R ≤ θ0 . Then with probability at least 1−δ over the random choice of the training set S, every function h ∈ H satisfies the following bound for all 0 < θ ≤ θ0 , θ0 n log(m) √ log(m) + log . PD [h(x) ≤ 0] ≤ PS [h(x) ≤ θ] + O θ m δ
6
Online Algorithms
In this section we describe online algorithms for the uniclass problems and analyze their mistake bounds. In online settings we receive examples one at a time. We therefore need to employ a somewhat different assumptions in order to analyze the algorithms. We cast the learning goal as the task of constructing a Bregmanian sphere whose radius is competitive with the radius of the smallest sphere that encloses all of the sample points. Formally, we observe a sequence of points x1 , x2 , . . . , xt , . . . (xs ∈ X) and assume that there exists a radius r and a center u such that ∀t : BF (xt ku) ≤ r. The goal of the learning algorithm is to construct a sphere that encloses all of the points with a small radius. The algorithms we present in this section assume that the radius of the sphere we need to construct, denoted R = (1 + δ)r, is given.
We also assume that R is strictly Input: Divergence BF ; Radius R > 0 greater than r, (δ > 0). We start with Initialize: Set w1 = x0 an initial value for the center of the Loop: For t = 1, 2, . . . , T sphere and modify the center on any – Get a new instance xt ∈ R n input point that lies outside the cur– If BF (xt kwt ) ≥ R update: rent sphere. We prove bounds on the 1. Choose αt ∈ [0, 1] number of center modifications, that 2. Set wt+1 = αt xt + (1 − αt )wt is, the number of times a newly obElse: wt+1 = wt served point resides outside the Bregmanian sphere of radius R. Our anal- Return: sphere’s center – w T +1 ysis implies that after a finite number of examples, we find a center such that all future points lie inside a ball Fig. 4. The uniclass online algorithm. of radius R. In Fig. 4 we give the skeleton for all of our online algorithms. Upon a mistake, i.e., when a new example xt is found to be outside the current ball, we set the new center wt+1 to be a convex combination of the current center wt and the new example xt . We describe below two schemes to set the interpolation parameter, denoted αt ∈ [0, 1]. The first is very simple as it employs a fixed value for αt where we only require that αt is be bounded above by a number which depends only on δ. The second scheme chooses on each round t a different value for αt that depends on xt , wt , and R. While the mistake bound for the second scheme is potentially better, finding αt requires solving a single dimension optimization problem over the interval [0, 1]. In our analyses we use of the following equality that was derived by Kivinen and Warmuth [10]: BF (xku) = BF (xkw) + BF (wku) + (∇F (w) − ∇F (u)) · (x − w) ,
(20)
for any x, w, u ∈ X. This equality generalizes the Euclidean cosine rule to Bregman divergences. Let us first describe and analyze the version that chooses different value for αt on each round. In this version we set αt as the minimizer of the following (convex) problem, αt = min Q(α) = BF (αxt + (1 − α)wt kwt ) − αBF (xt kwt ) + αR α
(21)
s.t. α ∈ [0, 1] . The rationale for this choice of αt is as follows. On one hand, if BF (xt ku) ≤ R then the optimal value for αt is zero. If, on the other hand, BF (xt ku) > R then the optimal value for α is set aggressively in order to enclose xt in a sphere centered at wt+1 . The next theorem shows that the cumulative sum of αt is bounded. This result enables us to prove the corresponding mistake bound. Theorem 3. Let x1 , . . . , xT be a input sequence of points in X fed to the online uniclass algorithm of Fig. 4 where αt is chosen according to Eq. (21). Assume that there exists a center u ∈ X and a radius r such that BF (xt ku) ≤ r for all t ∈ {1, . . . , T }. Then, for any R = (1 + δ)r with δ > 0 the following bound holds, PT 1 t=1 αt ≤ δ .
Proof. Denote by, ∆t = BF (wt+1 ku) − BF (wt ku). To prove the theorem we PT t bound t=1 ∆ from below and above. First note that the sum is telescopic with w1 = x0 and thus, T X t=1
∆t = BF (wT +1 ku)−BF (w1 ku) ≥ −BF (w1 ku) = −BF (x0 ku) ≥ −r . (22)
To bound the sum from above we bound each term of ∆t independently. Let us assume that BF (xt kwt ) > R (otherwise ∆t = 0). In this case wt+1 = αt xt + (1 − αt )wt . We rewrite ∆t as ∆t = F (wt+1 ) − F (u) − ∇F (u) · (wt+1 − u)
− [F (wt ) − F (u) − ∇F (u) · (wt+1 − u)] = F (wt+1 ) − F (wt ) − ∇F (u) · (wt+1 − wt ) .
(23)
Substituting wt+1 = αt xt + (1 − αt )wt , we get, ∆t = F (wt+1 ) − F (wt ) − ∇F (u) · (αt xt + (1 − αt )wt − wt ) = F (wt+1 ) − F (wt ) − αt ∇F (u) · (xt − wt ) .
(24)
Using the “generalized cosine equality” of Eq. (20) we get, (∇F (wt ) − ∇F (u)) · (xt − wt ) = BF (xt ku) − BF (xt kwt ) − BF (wt ku). Substituting in Eq. (24), ∆t = F (wt+1 ) − F (wt ) +αt [−∇F (wt ) · (xt − wt ) + BF (xt ku) − BF (xt kwt ) − BF (wt ku)] = F (wt+1 ) − F (wt ) − αt ∇F (wt ) · (xt − wt ) +αt BF (xt ku) − αt BF (xt kwt ) − αt BF (wt ku)
(25)
We use the definition of BF to further develop Eq. (25), ∆t = BF (wt+1 kwt ) + αt BF (xt ku) − αt BF (xt kwt ) − αt BF (wt ku) = BF (αt xt + (1 − αt )wt kwt ) − αt BF (xt kwt ) + αt R
−αt R + αt BF (xt ku) − αt BF (wt ku) = Q(αt ) − αt R + αt BF (xt ku) − αt BF (wt ku) ,
(26)
where we used the definition of Q(α) to obtain the last equality. It is straightforward to verify that Q(0) = 0 and thus Q(αt ) ≤ 0 since αt is the minimizer of Q(α). Since we assumed that BF (xt ku) ≤ r we can upper bound ∆t , ∆t ≤ −αt R + αt r = − αt ((1 + δ)r − r) = − αt δ r . Combining bound on ∆t with Eq. (27) we get, P the lower 1 r ⇒ t αt ≤ δ , which concludes the proof.
P
t
αt δ r ≤
(27) PT
t=1
∆t ≤ ⊓ ⊔
Note that the above theorem does not equip us with a mistake bound. A priori, one would hope that it is possible to derive a bound on the number of times a
new point would no reside inside the learned ball. However, the situation seems a bit more delicate and we could not derive such a bound. We give below a somewhat weaker bound. In a nutshell, the theorem states that the number of ˜ > R is bounded. rounds on which the divergence BF (xt kwt ) exceeds R Theorem 4. Assume that the first eigenvalue of the Hessian of F (x) is bounded by λ for any x ∈ X and that kxt k2 ≤ B for 1 ≤ t ≤ T . Then, for the same ˜ such that δ˜ > δ, the number of ˜ = (1 + δ)r conditions of Thm. 3 and for any R ˜ is bounded above by, 4λB 2 . rounds for which BF (xt kwt ) ≥ R ˜ rδ(δ−δ)
Proof. Due to the lack of space we only give a sketch of the proof. We want to ˜ To do so we find a lower bound the number of rounds for which BF (xt kwt ) ≥ R. bound on αt for all such rounds. To derive the bound we use the second order Taylor expansion of Q and write Q′ (αt ) = Q′ (0) + αt Q′′ (ξ) where ξ ∈ [0, αt ]. Since αt is the minimizer of Q we know that Q′ (αt ) = 0. Combining this property with tedious calculus yields that, αt (xt − wt )T H [(wt + ξ(xt − wt )] (xt − wt ) = BF (xt kwt ) − R ,
(28)
˜ we where H is the Hessian of F . Using the assumption that BF (xt kwt ) ≥ R ˜ can bound the the right hand side of Eq. (28) with r(δ − δ). Using the Hessian’s eigenvalue bound, the left hand side of Eq. (28) is lower bounded by, 4λB 2 . ˜ by, Therefore, we lower bound αt on each round for which BF (xt kwt ) ≥ R P ˜ r(δ−δ) αt ≥ 4λB 2 . Since we know from Thm. 3 that t αt ≤ 1/δ we get that the ˜ cannot exceed the desired bound. number of rounds with BF (xt kwt ) ≥ R ⊓ ⊔ Finally, to conclude the section on online algorithms we would like to mention in passing the formal results obtained for the fixed-α update. Using similar proof techniques it is possible to prove the following mistake bound. Theorem 5. Let x1 , . . . , xT be a input sequence of points in X fed to the online uniclass algorithm of Fig. 4 used with a fixed-rate update (∀t : αt = α). Then, under the same conditions of Thm. 4, there exist α such that the number of ˜ is bounded from above by, 4(1 + δ)/δ 2 . rounds for which BF (xt kwt ) ≥ R
7
Discussion
We introduced and analyzed a general algorithmic framework for uniclass problems that can be used with a broad family of Bregman divergences. We provided a new parallel-update algorithm for solving the optimization problem and demonstrated its usage by with the relative entropy. We also introduced and analyzed an apparatus for online learning of Bregmanian sphere. There are numerous directions in which this work can be extended and further investigated. A particularly interesting extension builds on the relation of Bregman divergences to the exponential family of distributions. One of the more challenging questions that arises from this view is whether the uniclass framework can to be generalized and applied to complex distribution models which arise in speech recognition, information retrieval, and biological sequence analysis.
Acknowledgments We would like to thank Yair Censor for his comments and advice throughout the research and to Ran Bachrach for his valuable comments. Thanks also to Guy Lebanon for his help in producing the figures and to Leo Kontorovich for comments and suggestions. This work was partially funded by EU project KerMIT No. IST-2000-25341 and by NSF ITR Award 0205594.
References 1. K.S. Azoury and M.W. Warmuth. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43(3):211–246, 2001. 2. P.L. Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2):525–536, March 1998. 3. L. M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7:200–217, 1967. 4. Y. Censor and S.A. Zenios. Parallel optimization: Theory, Algorithms and Applications. Oxford University Press, 1997. 5. M. Collins, S. Dasgupta, and R. Schapire. A generalization of principal component analysis to the exponential family. In Advances in Neural Information Processing Systems 13, 2001. 6. M. Collins, R.E. Schapire, and Y. Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 47(2/3):253–285, 2002. 7. A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Ser. B, 39:1– 38, 1977. 8. R. Fletcher. Practical Methods of Optimization. John Wiley, second edition, 1987. 9. C. Gentile and M. Warmuth. Linear hinge loss and average margin. In Advances in Neural Information Processing Systems 10, 1998. 10. J. Kivinen and M. K. Warmuth. Relative loss bounds for multidimensional regression problems. Journal of Machine Learning, 45(3):301–329, July 2001. 11. D.D. Lee and H.S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems 13, pages 556–562, 2000. 12. R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):1–40, 1999. 13. R.E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651–1686, October 1998. 14. B. Sch¨ olkopf, C. Burges, and V.N. Vapnik. Extracting support data for a given task. In U.M. Fayyad and R. Uthurusamy, editors, First International Conference on Knowledge Discovery & Data Mining (KDD). AAAI Press, 1995. 15. B. Sch¨ olkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443–1472, 2001. 16. F. Sha, L.K. Saul, and D.D. Lee. Multiplicative updates for nonnegative quadratic programming in support vector machines. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, 2002.
17. D.M.J. Tax. One-class classification; Concept-learning in the absence of counterexamples. PhD thesis, Delft University of Technology, 2001. 18. D.M.J. Tax and R.P.W. Duin. Data domain description using support vectors. In M. Verleysen, editor, Proceedings of the European Symposium on Artificial Neural Networks, pages 251–256, April 1999.