Hyperparameter Learning for Kernel Embedding ... - Google Sites

1 downloads 240 Views 789KB Size Report
We verify our learning algorithm on standard UCI datasets, ... We then employ Rademacher complexity as a data dependent
Journal of Machine Learning Research 1–8

ICML 2017 AutoML Workshop

Hyperparameter Learning for Kernel Embedding Classifiers with Rademacher Complexity bounds Kelvin Y.S. Hsu Richard Nock Fabio Ramos

[email protected] [email protected] [email protected]

The University of Sydney and Data61, CSIRO, Australia

Abstract We propose learning-theoretic bounds for hyperparameter learning of conditional kernel embeddings in the probabilistic multiclass classification context. Kernel embeddings are nonparametric methods to represent probability distributions directly through observed data in a reproducing kernel Hilbert space (RKHS). This property forms the core of modern kernel methods, yet hyperparameter learning for kernel embeddings remains challenging, often relying on heuristics and domain expertise. We begin by developing the kernel embedding classifier (KEC), and prove that its expected classification error can be bounded with high probability using Rademacher complexity bounds. This bound is used to propose a scalable hyperparameter learning algorithm for conditional embeddings with batch stochastic gradient descent. We verify our learning algorithm on standard UCI datasets, as well as to learn feature representations of a convolutional neural network with improved accuracy, demonstrating the generality of this approach.

1. Introduction Kernel embeddings are principled methods to represent probability distributions in a nonparametric setting. By transforming distributions into mean embeddings within a reproducing kernel Hilbert space (RKHS), distributions can be represented directly from data without assuming a parametric structure (Song et al., 2013). Consequently, nonparametric probabilistic inference can be carried out entirely within the RKHS, where difficult marginalisation integrals become simple linear algebra (Muandet et al., 2016). In this framework, positive definite kernels k : X × X → R provide a coherent sense of similarity between two elements of the same space through implicitly defining higher dimensional features. However, kernel hyperparameters are often selected a-priori and not learned. In this paper, we take a learning theoretic approach to learn the hyperparameters of a conditional kernel embedding in a supervised manner. We begin by proposing the kernel embedding classifier (KEC), a principled framework for inferring multiclass probabilistic outputs using conditional embeddings, and provide a proof for its stochastic convergence. We then employ Rademacher complexity as a data dependent model complexity measure, and prove that expected classification risk can be bounded by a combination of empirical risk and conditional embedding norm with high probability. We use this bound to propose a learning objective that learns the balance between data fit and model complexity in a way that does not rely on priors.

c K.Y. Hsu, R. Nock & F. Ramos.

Hsu Nock Ramos

2. Hilbert Space Embeddings of Conditional Probability Distributions To construct a conditional embedding map UY |X corresponding to the distribution PY |X , where X : Ω → X and Y : Ω → Y are measurable random variables, we choose a kernel k : X × X → R for the input space X and another kernel l : Y × Y → R for the output space Y. These kernels k and l each describe how similarity is measured within their respective domains X and Y, and are symmetric and positive definite such that they uniquely define −1 the RKHS Hk and Hl . We then define UY |X := CY X CXX where CY X := E[l(Y, ·) ⊗ k(X, ·)] and CXX := E[k(X, ·) ⊗ k(X, ·)] (Song et al., 2009). The conditional embedding map can be seen as an operator map from Hk to Hl . In this sense, it sweeps out a family of conditional embeddings µY |X=x in Hl , each indexed by the input variable x, via the property µY |X=x := E[l(Y, ·)|X = x] = UY |X k(x, ·). Under the assumption that E[g(Y )|X = ·] ∈ Hk , Song et al. (2009, Theorem 4) proved that the conditional expectation of a function g ∈ Hl can be expressed as an inner product, E[g(Y )|X = x] = hµY |X=x , gi. While the assumptions that E[g(Y )|X = ·] ∈ Hk and k(x, ·) ∈ image(CXX ) hold for finite input domains X and characteristic kernels k, it is not necessarily true when X is a continuous domain (Fukumizu −1 et al., 2004), which is the scenario for many classification problems. In this case, CY X CXX becomes only an approximation to UY |X , and we instead regularise the inverse and use CY X (CXX + λI)−1 , which also serves to avoid overfitting (Song et al., 2013). In practice, we do not have access to the distribution PXY to analytically derive the conditional embedding. Instead, we have a finite collection of observations {xi , yi } ∈ X × Y, i ∈ Nn := {1, . . . , n}, for which the conditional embedding map UY |X can be estimated by UˆY |X = Ψ(K + nλI)−1 ΦT ,

(1)

    where K := {k(xi , xj )}n,n i=1,j=1 , Φ := φ(x1 ) . . . φ(xn ) , Ψ := ψ(y1 ) . . . ψ(yn ) , φ(x) := k(x, ·), and ψ(y) := l(y, ·) (Song et al., 2013). The empirical conditional embedding µ ˆY |X=x := UˆY |X k(x, ·) then stochastically converges to µY |X=x in the RKHS norm at a rate 1

1

of Op ((nλ)− 2 + λ 2 ), under the assumption that k(x, ·) ∈ image(CXX ) (Song et al., 2009, Theorem 6). This allows us to approximate the conditional expectation with hˆ µY |X=x , gi instead, where g := {g(yi )}ni=1 and k(x) := {k(xi , x)}ni=1 , E[g(Y )|X = x] ≈ hˆ µY |X=x , gi = gT (K + nλI)−1 k(x),

(2)

3. Kernel Embedding Classifier In the multiclass setting, the output label space is finite and discrete, taking values only in Y = Nm := {1, . . . , m}. Naturally, we first choose the Kronecker delta kernel δ : Nm ×Nm → {0, 1} as the output kernel l, where labels that are the same have unit similarity and labels that are different have no similarity. That is, for all pairs of labels yi , yj ∈ Y, δ(yi , yj ) = 1 only if yi = yj and is 0 otherwise. As δ is an integrally strictly positive definite kernel on Nm , it is therefore characteristic (Sriperumbudur et al., 2010, Theorem 7). As such, by definition of characteristic kernels (Fukumizu et al., 2004), δ uniquely defines a RKHS Hδ = span{δ(y, ·) : y ∈ Y}, which is the closure of the span of its kernel induced features (Xu and Zhang, 2009). For Y = Nm , this means that any real-valued function g : Nm → R that is bounded on its discrete domain Nm is in the RKHS of δ, because we can always write 2

Hyperparameter Learning for Kernel Embedding Classifiers

P g= m y=1 g(y)δ(y, ·) ∈ span{δ(y, ·) : y ∈ Y}. In particular, indicator functions on Nm are in the RKHS Hδ , since 1c (y) := 1{c} (y) = δ(c, y) are simply the canonical kernel induced features of Hδ . Such properties do not necessarily hold for continuous domains in general, and allows consistent estimations of decision probabilities used in multiclass classification. Let pc (x) := P[Y = c|X = x] be the decision probability function for class c ∈ Nm , or the probability of the class label Y being c when the example X is x. We begin by writing this probability as an expectation of indicator functions, pc (x) := P[Y = c|X = x] = P[Y ∈ {c}|X = x] = E[1c (Y )|X = x].

(3)

With 1c ∈ Hδ , we let g = 1c in (2) and 1c := {1c (yi )}ni=1 to estimate the right hand side by pˆc (x) = fc (x) := 1Tc (K + nλI)−1 k(x).

(4)

Therefore, the vector of empirical decision probability functions over the classes c ∈ Nm is ˆ (x) = f (x) := YT (K + nλI)−1 k(x) ∈ Rm , p (5)   where Y := 11 12 · · · 1m ∈ {0, 1}n×m is simply the one hot encoded labels {yi }ni=1 . The KEC is thus the multi-valued decision function f (x) (5). We then proved that empirical decision probabilities (4) converge to the true decision probabilities. In fact, the inference distribution (5) is equivalent to the empirical conditional embedding. Theorem 1 (Uniform Convergence of Empirical Decision Probability Function) Assuming that k(x, ·) is in the image of CXX , the empirical decision probability function pˆc : X → R (4) converges uniformly to the true decision probability pc : X → [0, 1] (3) at a 1 1 stochastic rate of at least Op ((nλ)− 2 + λ 2 ) for all c ∈ Y = Nm .

4. Hyperparameter Learning with Rademacher Complexity Bounds Kernel embedding classifiers (4) are equivalent to a conditional embedding with a discrete target space Y. Hyperparameter learning for conditional embeddings is particularly difficult compared to joint embeddings, since the kernel kθ with parameters θ ∈ Θ is to be learned jointly with a regularisation parameter λ ∈ Λ = R+ . This implies that the notion of model complexity is especially important. To this end, we propose to use a learning theoretic approach to balance model complexity and data fit. The Rademacher complexity (Bartlett and Mendelson, 2002) measures the expressiveness of a function class F by its ability to shatter, or fit, noise. They are data-dependent measures, and are thus particularly well suited to learning tasks where generalisation ability is vital, since complexity penalties that are not data dependent cannot be universally effective (Kearns et al., 1997). We begin by defining a loss function as a measure for performance. For decision functions of the form f : X → A = Rm whose entries are probability estimates, we employ the cross entropy loss, L (y, f (x)) := − log [yT f (x)]1 = − log [fy (x)]1 ,

(6)

to express classification risk, where we use the notation [ · ]1 := min{max{ · , }, 1} for  ∈ (0, 1). Under this loss, we prove a bound on the expected risk. 3

Hsu Nock Ramos

Algorithm 1 KEC Hyperparameter Learning with Batch Stochastic Gradient Updates 1: Input: kernel family kθ : X × X → R, dataset {xi , yi }n i=1 , initial kernel parameters θ0 , initial regularisation parameter λ0 , learning rate η, tolerance , batch size nb 2: θ ← θ0 , λ ← λ0 3: repeat 4: Sample the next batch Ib ⊆ Nn , |Ib | = nb 5: Y ← {δ(yi , c) : i ∈ Ib , c ∈ Nm } ∈ {0, 1}nb ×m 6: Kθ ← {kθ (xi , xj ) : i ∈ Ib , j ∈ Ib } ∈ Rnb ×nb ∈ Rnb ×nb 7: Lθ,λ ← cholesky(Kθ + nb λInb ) 8: Vθ,λ ← LTθ,λ \(Lθ,λ \Y ) ∈ Rnb ×m 9: Pθ,λ ← Kθ Vθ,λ ∈ Rnb ×m q P b T K V L ((Y )i , (Pθ,λ )i ) + 4e α(θ) trace(Vθ,λ 10: q(θ, λ) ← n1b ni=1 θ θ,λ ) ∂q ∂q θ ← θ − η ∂θ (θ, λ), λ ← λ − η ∂λ (θ, λ) (Or other gradient based updates)

 ∂q  T ∂q