Prediction-time Efficient Classification Using Computational

Prediction-time Efficient Classification Using Computational Dependencies in Feature Generation (Supplementary Materials) This supplementary material is to provide the proof process for Theorem 5.1 of the paper: Liang Zhao, Amir Alipour-Fanid, Martin Slawski, Kai Zeng. Prediction-time Efficient Classification Using Computational Dependencies in Feature Generation. KDD 2018, submitted. Theorem 5.1 and its proof process are as follows: Theorem 5.1 [Convergence Analysis] Given that the loss function L(·) is a Lipschitz-differentiable function (e.g., logistic loss), and the function L(Ω1 · B) + λ · Rc (H Ω2 B, D) is a coercive function [1], then the proposed ADMM-based method will converge when optimizing Equation (10) when the following conditions are satisfied: 1) the matrix H · Ω2 has full rank, 2) the number of FCCs is not larger than the number of features. Proof. The Equation (10) can be written as the follows: min ϕ(M, B) = f (B) + λ · h(M)

(1)

M, B ≥0

s.t . P · M + Q · B = 0 where f (B) = L(Ω1 · B) is convex and Lipschitz-differentable. And h(M) = Rc (M, D) is nonsmooth and nonconvex. In addition, Q = −H · Ω2 ∈ R |E |×2|V | and P ∈ R |E |× |E | is identity matrix. The above problem amounts to a nonconvex-regularized optimization objective with linear equality constraint on two decision variables. For this type of problem, Wang et al. [6] provided the sufficient condition for proving its convergence when using Alternating Direction Methods of Multipliers (ADMM) [2], which amounts to the following five requirements: (1) (Coercivity) Define the feasible set F := {(M, B) ∈ Rn+q : PM + QB = 0}. The objective function ϕ(M, B) is coercive over this set, namely, ϕ(M, B) → ∞ if (M, B) ∈ F and ∥(M, B)∥ → ∞. (2) (Feasibility) Im(P) ⊆ Im(Q), where Im(·) is defined as the image of a matrix. (3) (Lipschitz sub-minimization paths) Three sub-requirements need to be satisfied: (i) Φ : Im(Q) → R |E | defined by Φ(u) ≜ arg minB {ϕ(M, B) : QB = u} is a Lipschitz-continuous map; (ii) Ψ : Im(P) → R |E | defined by Ψ(u) ≜ arg minM {ϕ(M, B) : PM = u} is a Lipschitzcontinuous map; and (iii) Ψ and Φ have a positive universal Lipschitz constant. (4) (Regularity of the regularization term) h(Mi ) is either “continuous and piecewise linear” or “restricted prox-regular”. Here the restricted prox-regular [6] is defined as follows. For a lower semi-continuous function F , let m ∈ R+ , F : RN → R ∪ {∞}, and define the ∂ F(x ) exclusion set Sm := {x ∈ dom(F ) : ∥d ∥ > m for all d ∈ ∂x }. And thus F is called restricted prox-regular if, for any m > 0 and bounded set T ⊆ domf , there exists θ > 0 such that F (y)+θ /2∥x −y∥ 2 ≥ F (x)+ ⟨d, y − x⟩ , ∀x ∈ T −Sm , y ∈ T , d ∈ (5) (Regularity of the loss function term) f (M) is Lipschitz-differentiable.

∂ F(x ) ∂x ,

∥d ∥ ≤ m.

Since the first and fifth requirements have been satisfied in our assumptions, in the following, we prove that our problem satisfies the second, third, and fourth requirements. Because the matrix P is identity matrix and thus it has full rank. Therefore, the Requirement (3)(i) is satisfied. Similarly, because H · Ω2 ∈ R |E |×2 |V | has full rank and 2|V | ≥ |E|, the Requirement (3)(ii) is satisfied, too. And hence the third requirement is satisfied, too. Also because the matrix Q has full rank and 2|V | ≥ |E|, we have Im(A |E | ) ⊆ Im(Q), where A |E | ∈ R |E |×|E | is the identity matrix. This means the second requirement is also satisfied. In the following, we prove that the Requirement (4) is satisfied when using different types of regularization terms Rc (M, D). In our algorithm, the re-weighted nonconvex regularization term Rc (M, D) has been used which can be based on commonly used nonconvex regularization terms such as MCP, SCAD, capped ℓ1 norm, and ℓp quasi-norm (0 < p < 1) [3]. Í And when Rc (M, D) = i Rc′ (Mi , D i ) is re-weighted capped ℓ1 norm, we have: Rc′ (Mi , D i ) = D i min(|Mi |, γ ) (γ > 0) where it is clearly to see that Rc (Mi , D i ) is continuous and piece-wise linear. Thus it satisfies Requirement (4). Í When Rc (M, D) is the re-weighted MCP term, then Rc (M, D) = i Rc′ (Mi , D i ) is defined as follows: |D · M | − Mi2 /(2γ ), |Mi | ≤ γ D i Rc′ (Mi , D i ) = 1 i 2 i |Mi | ≥ γ D i 2 γ Di ,

(2)

(3)

which is the maximum of a set of quadratic functions and thus can be proved to be proximal regular, as also shown in Example 2.9 in [4]. In addition, when Rc (M, D) = Rc′ (Mi , D i ) is re-weighted SCAD, we define: Rc′ (Mi , D i )

=

 |D i · Mi |,     2γ |D i ·Mi |−M 2 −D 2 i

2γ −2    1 (γ + 1) · D 2 ,  2 i 1

i

,

|Mi | ≤ D i D i < |Mi | ≤ γ D i |Mi | > γ D i

(4)

Conference’17, July 2017, Washington, DC, USA

Liang Zhao, Amir Alipour-Fanid, Martin Slawski, Kai Zeng

which is again the maximum of a set of quadratic functions and thus its property of proximal-regular is readily to be proved by Example 2.9 in [4]. As proximal regularity is a sufficient condition of restricted proximal regularity (see [5] and Definition 1.1 in [4]), the re-weighted MCP and SCAD are proved to be restricted proximal regularity (i.e., prox-regularity). Í Now, we prove when Rc (M, D) is re-weighted ℓp quasi-norm (0 < p < 1) such that Rc (M, D) = i D i |Mi |p , Requirement (4) still holds. The proof is extended from that in [6]. p

1

First, given any two positive constants α > 0, β > 1, we define c ≡ 13 ( β ) 1−p . Define a set Cβ = {x | minx i ,0 |x i | ≤ 3c}. Then for any vectors z, y ∈ R |E | such that z ∈ {x |∥x ∥ < α } − Cβ and y ∈ {x |∥x ∥ ∈ α }, if ∥z − y ∥ ≤ c, then supp(z) ⊂ supp(y), where supp(z) denotes the index set of all non-zero elements of z. Also define y ′ whose element is defined as: yi′ = yi · I ′ (i), where I ′ (i) is the indicator function such that I ′ (i ∈ supp(zi )) = 1 while I ′ (i < supp(zi )) = 0. Í ∂ Rc (z, D) ∂ R(z) Define d(z) = and d ′ (z) = ∂z , where R(z) = i |zi |p and hence for each element d(zi ) = D i · d ′ (zi ). Therefore, when ∂z ∥z − y∥ ≤ c, we have: Rc (y, D) − Rc (z, D) − ⟨d, y − z⟩ Õ Õ Õ = D i |yi |p − D i |zi |p − di · (yi − zi ) i

≥

i

Õ i

≥−

D i |yi′ |p −

i

≥−

Õ p(1 − p) i

2

D i |zi |p −

i

2

(6)

i

Õ

Õ p(1 − p)

(5)

Õ i

D i · di′ · (yi′ − zi )

(by the definition of y ′ and d ′ )

c p−2 D i · (zi − yi′ )2

the second order derivative of R(z) is no bigger than p(1 − p)c p−2

c p−2 D i · (zi − yi )2

(7) (8) (9)

p(1 − p) p−2 ≥ − |E| · D max · c · ∥z − y ∥ 2 2

D max = max D i i

(10)

When ∥z − y∥ > c, we have: Rc (y, D) − Rc (z, D) − ⟨d, y − z⟩ Õ Õ Õ = D i |yi |p − D i |zi |p − di · (yi − zi ) i

i

(12)

i

D i |yi |p − |zi |p − di′ · (yi − zi ) i Õ ≥− D i (2α p + 2α β) =

(11)

Õ

(by the definition of y ′ and d ′ )

(13) (14)

i

2α p + 2α β Õ D i ∥y − z ∥ 2 c2 i !

≥−

(|z − y ∥ > c)

(15)

Therefore, according to the definition of restricted proximal regularity given in the Requirement (4), Requirement (4) is proved to be satisfied when Rc is re-weighted ℓp quasi-norm (0 < p < 1). In all, when Rc (M, D) is the re-weighted version of commonly used nonconvex regularization terms that satisfy Requirement (4), the proposed ADMM-based algorithm will converge based on the aforementioned conditions. □

REFERENCES [1] Aleksandr Y Aravkin, James V Burke, and Gianluigi Pillonetto. 2013. Sparse/robust estimation and kalman smoothing with nonsmooth log-concave densities: Modeling, computation, and theory. The Journal of Machine Learning Research 14, 1 (2013), 2689–2728. [2] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning 3, 1 (2011), 1–122. [3] Pinghua Gong, Changshui Zhang, Zhaosong Lu, Jianhua Huang, and Jieping Ye. 2013. A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In International Conference on Machine Learning. 37–45. [4] René Poliquin and R Rockafellar. 1996. Prox-regular functions in variational analysis. Trans. Amer. Math. Soc. 348, 5 (1996), 1805–1838. [5] Joseph Wang, Kirill Trapeznikov, and Venkatesh Saligrama. 2015. Efficient learning by directed acyclic graph for resource constrained prediction. In Advances in Neural Information Processing Systems. 2152–2160. [6] Yu Wang, Wotao Yin, and Jinshan Zeng. 2015. Global convergence of ADMM in nonconvex nonsmooth optimization. arXiv preprint arXiv:1511.06324 (2015).

Prediction-time Efficient Classification Using Computational

Prediction-time Efficient Classification Using Computational

Suggest Documents

Computational Intelligent Techniques for Tumor Classification (Using ...

Efficient Associative Classification using Genetic Network Programming

Efficient Social Network Multilingual Classification using Character ...

Efficient Social Network Multilingual Classification using Character ...

Efficient Twitter Sentiment Classification using Subjective Distant ...

Computational Classification of Cloud Forests in Thailand Using ...

Efficient Geo-Computational Algorithms for

Efficient computational noise in GLSL

Efficient Classification of EOG using CBFS Feature ... - Semantic Scholar

Efficient Classification of Images using Histogram ... - Semantic Scholar

Efficient Online Classification using an Ensemble of Bayesian Linear ...

Space-Efficient TCAM-based Classification Using ... - Semantic Scholar

Efficient Classification of Images using Histogram ... - Semantic Scholar

Feature Fusion for Efficient Object Classification Using Deep ... - ijmlc

efficient gaussian process classification using random decision forests

an efficient view classification of echocardiogram using ... - JATIT

Efficient Tone Classification of Speaker

Toward An Efficient Fingerprint Classification

Computationally Efficient Target Classification - arXiv

Computational Method for Segmentation and Classification of ...

Computational Comparison and Classification of ... - Semantic Scholar

Evaluation of Computational Classification Methods for Discriminating ...

Computational classification of mitochondrial shapes ... - BioMedSearch

Computational Comparison and Classification of ... - Semantic Scholar