Science in China Series A: Mathematics Nov., 2009, Vol. 52, No. 11, 2506–2516 www.scichina.com math.scichina.com www.springer.com/scp www.springerlink.com
Generalization performance of graph-based semisupervised classification CHEN Hong1, 2 & LI LuoQing2 † 1 2
College of Science, Huazhong Agricultural University, Wuhan 430070, China Faculty of Mathematics and Computer Science, Hubei University, Wuhan 430062, China
(email:
[email protected],
[email protected])
Abstract
Semi-supervised learning has been of growing interest over the past few years and many methods have been proposed. Although various algorithms are provided to implement semi-supervised learning, there are still gaps in our understanding of the dependence of generalization error on the numbers of labeled and unlabeled data. In this paper, we consider a graph-based semi-supervised classification algorithm and establish its generalization error bounds. Our results show the close relations between the generalization performance and the structural invariants of data graph.
Keywords: envelope MSC(2000):
1
semi-supervised learning, generalization error, graph Laplacian, graph cut, localized 68T05, 62J02
Introduction
Semi-supervised learning, i.e., learning from labeled and unlabeled data, has attracted considerable attention in recent years. The key challenge of semi-supervised learning is how to improve the generalization performance utilizing unlabeled data together with labeled data. Under various assumption conditions, some semi-supervised learning algorithms have been proposed, including margin-based methods (see [1, 2]), co-training in [3], various graph-based methods (see [4–8]), and see [9] for a survey of semi-supervised learning. Although the active development of semi-supervised algorithms, only recently there are theoretical studies trying to understand why they work. Under cluster assumption, generalization error bounds of semi-supervised classification are provided in [10]. Johnson and Zhang[11,12] present generalization error estimates for the graph-based transductive learning. In [4], the error analysis of graph-based learning is established based on algorithmic stability. In [1], the generalization error of semi-supervised margin-based method is estimated by using grouping information from unlabeled data. However, despite this remarkable progress on the analysis of semi-supervised learning algorithms, there are still significant gaps in their theoretical foundations. Inspired from the error analysis in [13], we consider the generalization performance of a new graph-based semiReceived December 21, 2007; accepted February 10, 2009 DOI: 10.1007/s11425-009-0190-8 † Corresponding author This work was supported by National Natural Science Foundation of China (Grant No. 10771053), by Specialized Research Foundation for the Doctoral Program of Higher Education of China (SRFDP) (Grant No. 20060512001), and Natural Science Foundation of Hubei Province (Grant No. 2007ABA139). Citation: Chen H, Li L Q. Generalization performance of graph-based semi-supervised classification. Sci China Ser A, 2009, 52(11): 2506–2516, DOI: 10.1007/s11425-009-0190-8
Generalization performance of graph-based semi-supervised classification
2507
supervised classification algorithm in this paper. We try to create a new model that bring together three distinct concepts which have received some independent attention recently in machine learning: regularization learning on graph (see [4, 5, 14]), error analysis in reproducing kernel Hilbert spaces (RKHS) (see [15, 16]), concentration inequalities for empirical process (see [17, 18]). We show how these ideas can be brought together in a coherent and natural way to study the performance of the graph-based semi-supervised algorithm. The points below highlight several new features of the current paper. Firstly, we design the graph-based semi-supervised algorithm without any constrains on class proportion (see [19]). It has long been noticed that constraining the class proportions on unlabeled data can be important for semi-supervised learning. Compared with transductive learning in [11,12], the new model is suitable to deal with more general classification problems. Secondly, we introduce a novel graph cut which can be considered as a natural extension of previous definitions in [11, 20]. Furthermore, the relation between graph cut and regularization error is presented. Finally, we use the method of slicing (or peeling) in [17] to measure the complexity of assumption space instead of the covering number in [21, 22] and VC dimensions in [1]. 2
Graph-based semi-supervised classification
Suppose that X = {v1 , . . . , vn } and Y = {−1, 1}. There is a probability distribution ρ on Z := X × Y according to which labeled examples (x, y) are generated. Let Nn := {1, . . . , n}. Consider an undirected weighted graph G = (V, E), where V = X is the vertex set with edges E ⊂ Nn × Nn and weights wij associated to edges (i, j) ∈ E. For simplicity, we assume that weights wii 0 when (i, i ) ∈ E, and wii = 0 otherwise. We denote by W = (wij )ni,j=1 the adjacency matrix. The graph Laplacian L is the n × n matrix defined as L := D − W, where n D = diag(di , i ∈ Nn ) and di = j=1 wij . Let S = diag(si , i ∈ Nn ), where si is a scaling factor. The S-normalized Laplacian matrix is defined as LS = S−1/2 LS−1/2 . A common choice of S is S = I, corresponding to the unnormalized Laplacian L. Another common choice is to normalized by si = di , as in [20]. Our goal is to realize better prediction by exploiting the knowledge of the structure of data graph. Of course, if there is no identifiable relation between vertices and the conditional distribution ρ(y|x), the information of graph structure is unlikely to be of much use. We will assume that the weights are indications of the affinity of nodes with respect to each other and consequently are related to the potential similarity of the y value these nodes are likely to have (see [7, 8]). As illustrated in [4, 14], the graph Laplacian regularization from the perspective of graph smoothing functional coincides with a kernel regularization in RKHS. However, as pointed in [19], the algorithms considered in [4, 14] have unnecessary constrains for the labels of the vertices. In order to remedy this deficiency, we introduce the following graph-based quantity S = αI + LS , L
(1)
S is where I is the identity matrix. Note that α > 0 is a tuning parameter to ensure that L −1 −1/2 −1/2 S = αS + LS = S strictly positive. Another choice is to set L (αI + L)S which we will not refer to in this paper.
Chen H & Li L Q
2508
Let H(G) = {f : V → R} be a real-valued function space defined on G. For f ∈ H(G), we denote f = (f (v1 ), . . . , f (vn ))T . The inner product on H(G) is defined as S g, f, gH(G) := f T L
f, g ∈ H(G),
where T denotes the transposition. Moreover, the function f H(G) := f, f H(G) , f ∈ H(G) is a norm. This fact can be easily verified by noting that f 2H(G)
Sf = α =f L T
n f 2 (vi ) i=1
si
wij f (vi ) f (vj ) 2 + . √ − √ 2 si sj (i,j)∈E
The norm · H(G) is a measure of smoothness (complexity) on the graph. Thus if f H(G) is small then f varies slowly on the graph. Given a set of random labeled data z := {zi }li=1 = {(xi , yi )}li=1 ∈ Z l independently drawn according to ρ, the graph-based learning algorithm is searching function defined as fz , where fz is a minimizer of the following optimization problem (see [4, 11, 14]): l 1 Sf . (1 − yi f (xi ))+ + γf T L (2) fz = arg min f ∈H(G) l i=1 Here γ = γ(l) > 0 is a regularization parameter and tends to zero as l → ∞. Let {λi , ui }ni=1 be a system of eigenvalues/vectors of LS , the eigenvalues be in non-decreasing order, and let S )−1 = K = (L
n
(λi + α)−1 ui uTi .
i=1
S f = f which implies that S )−1 L Thus, if f ∈ H(G) then (L S )−1 L S f = K(vi , ·), f H(G) , f (vi ) = ei (L S )−1 )i,j , i, j ∈ Nn , is a reproducwhere ei is the i-th coordinate vector. Hence, K(vi , vj ) = ((L ing kernel. The reproducing kernel Hilbert space HK associated with the kernel K is defined (see [23]) to be the closure of the linear span of the set of functions {Kx := K(x, ·) : x ∈ V } with the inner product ·, ·HK = ·, ·K satisfying Kx , Ky K = K(x, y). We can rewrite (2) as fz = arg min
f ∈HK
l
1 (1 − yi f (xi ))+ + γf 2K l i=1
where HK = H(G) (see [14] for details). The misclassification error for a classifier f : X → Y is defined as
R(f ) := I(y = f (x))dρ, Z
,
(3)
Generalization performance of graph-based semi-supervised classification
2509
where I(·) is the indicator function: it assumes 1 if its argument is true, and 0 otherwise. Recall that the regression function is given by
ydρ(y|x), x ∈ X fρ (x) = Y
and the Bayes classifier is fc = sgn(fρ ), where the signum function is defined as sgn(f )(x) = 1 if f (x) 0 and sgn(f )(x) = −1 otherwise. It is well known that fc is the minimizer of R(f ). From now on we focus on estimating the excess misclassification error R(sgn(fz )) − R(fc ) for the graph-based regularized method (3). 3
Regularization error and error decomposition
Denote by (f, z) = (1 − yf (x))+ the hinge loss. We define the empirical risk of f with the l hinge loss as Ez (f ) = i=1 (f, zi ) and the expect risk as
(f, z)dρ. E(f ) = V ×Y
Let S f } = arg min {E(f ) + γf 2K } fγ := arg min {E(f ) + γf T L f ∈HK
f ∈H(G)
(4)
and define the regularization error of scheme (2) as D(γ) :=
S f } = inf {E(f ) − E(fc ) + γf 2K }. inf {E(f ) − E(fc ) + γf T L f ∈HK
f ∈H(G)
(5)
Inspired by the definitions of graph cut in [11, 20], we introduce a new learning theoretical definition. Definition 3.1. S as L
Given a distribution ρ on Z, we define the cut for the S-normalized Laplacian
S , fc ) = cut(L
2 n wij fc (vi ) fc (vj ) − √ , √ 2 si sj i,j=1
where fc := (fc (vi ), i ∈ Nn ). Note that for the transductive learning case, our definition corresponds to the definition in [11]. Furthermore, for the unnormalized Laplacian, the learning theoretical definition can be reduce to the definition of graph cut in [20]. Since fc ∈ H(G), we have n n fc2 (vj ) S fc γ cut(L S , fc ) + α S , fc ) + α D(γ) γfc T L s−1 γ cut(L . j sj j=1 j=1 We introduce the projection operator used in [15] as follows: Definition 3.2.
The projection operator π is defined on the space of measurable functions
f : X → R as
⎧ ⎪ 1, ⎪ ⎨ π(f )(x) =
⎪ ⎪ ⎩
if f (x) > 1;
f (x),
if −1 f (x) 1;
−1,
if f (x) < −1.
Chen H & Li L Q
2510
It is easy to see that π(f ) and f induce the same classifier. It is sufficient for us to bound the excess misclassification error for π(fz ) instead of fz . This leads better estimates. It is known that for any f the following holds (see [15, 26]): R(sgn(f )) − R(fc ) E(π(f )) − E(fc ). Thus, it suffices to estimate the excess misclassification error R(sgn(fz )) − R(fc ) by bounding E(π(fz )) − E(fc ). Now, we give the bounds of fz and fγ . Lemma 3.3. estimates:
S )−1 )ii . Then, we have the following Let κ := sup1in K(vi , vi ) = sup1in ((L
√ √ √ fz K 1/ γ and fz ∞ min{κ, 1/ α + λ1 }/ γ; √ (b) fγ K D(γ)/γ and fγ ∞ min{κ, 1/ α + λ1 } D(γ)/γ; √ (c) π(fz )∞ min{1, κ/ γ, 1/ γ(α + λ1 )}. Proof. We can verify that fz ∞ 1/ (α + λ1 )γ by using Proposition 1 in [4]. Based on √ (3) and the definition of fz , we have fz K Ez (0)/γ 1/ γ. Note that fz ∞ κfz K . √ √ Therefore, we have fz ∞ min{κ, 1/ α + λ1 }/ γ. (a)
We now turn to prove (b). According to the definition of fγ , fγ K
(E(fγ ) − E(fρ ) + γfγ 2K )/γ = D(γ)/γ.
Thus, fγ ∞ κ D(γ)/γ. For all f ∈ H(G), by Rayleigh-Ritz theorem in [24], we have the following estimates: Sf (α + λ1 )f 2∞ (α + λ1 )f T f f T L
S fγ E(fγ ) − E(fρ ) + γfγT L S fγ = D(γ). and γfγT L
Combining all inequalities above, we derive the desired result. Proposition 3.4.
Let fz and fγ be defined by (2) and (4) respectively. Then we have
E(π(fz )) − E(fc )
l l 1 1 E(ξ1 ) − ξ1 (zi ) + ξ2 (zi ) − E(ξ2 ) + D(γ), l i=1 l i=1
(6)
where ξ1 := (π(fz ), z) − (fc , z) and ξ2 := (fγ , z) − (fc , z). Proof.
We decompose the difference E(π(fz )) − E(fc ) as follows: S fγ } {E(π(fz )) − Ez (π(fz ))} + {Ez (fγ ) − E(fγ )} + {E(fγ ) − E(fc ) + γfγT L S fz − {Ez (fγ ) + γfγT L S fγ }} − γfzT L S fz . +{Ez (π(fz )) + γfzT L
By the definition of fz , the fourth term is negative. Then the conclusion follows. 4
Error bounds based on localized envelopes
In this section, we consider generalization error bounds for the graph-based learning algorithm by concentration inequalities in empirical process (see [17, 18]). We introduce some notations that will be used in this section. Let z = {zi }li=1 ∈ Z l be a sequence of independent random
Generalization performance of graph-based semi-supervised classification
2511
variables with probability law ρ and G be a class of Borel measurable functions defined on Z. Meanwhile, we define Pl g and P g as l
Pl g :=
1 g(zi ), l i=1
P g := E
l 1 g(zi ) = g(z)dρ. l i=1 Z
Here are some notations that will be used the rest of this section: σP2 (g) VarP (g), σP := √ supg∈G σP (g) (usually σP (g) will either be the standard derivation of g or P g). Given 0 < r˜ 1, r˜ < δ˜ 1, we set ˜ = G(δ)\G(˜ ˜ G(˜ r ) = {g ∈ G : σP (g) r˜} , G(˜ r , δ] r ). For 1 < q 2 and r˜ < δ˜ r˜q k for some k ∈ N, we let ρj := r˜q j , j = 0, . . . , k, and ψl,q (t) := EPl − P G(ρj−1 ,ρj ] ,
t ∈ (ρj−1 , ρj ],
j = 1, . . . , k,
where ψ(g)G = supg∈G |ψ(g)|. ˜ For given δ˜ and r˜, we set k = logq ( δr˜ ) , which is the smallest integer larger than or equal to ˜ logq ( δr˜ ). We also set ψ l,q := sup ψl,q (t) = max ψl,q (ρj ) ˜ t∈(˜ r,δ]
1jk
and l 1 2 Vl,q (ρj ) = Vl (ρj ) := E (g(zi ) − P g) . l G(ρj−1 ,ρj ] i=1 Finally, we let ζ −1 (t) := t log(1 + t), 0 t 1, and for any V l,q (ρj ) Vl,q (ρj ), denote τl,q = τl,q (u1 , . . . , uk ) 2uj = max 2 max j:uj >2lV l,q (ρj ) l log uj /(V l,q (ρj )) j:uj 2lV l,q (ρj )
uj V l,q (ρj ) , l
where uj is any sequence of positive numbers and a ∨ b := max{a, b}. Lemma 4.1. With the above definitions, there exists a universal constant C ∈ (0, ∞) such that for any number sequence uj of positive numbers Probz∈Z l
Proof.
sup
k |Pl g − P g| − ψ l,q τl,q C exp{−uj /C}. ˜
g∈G,˜ r 0, there Lemma 4.2. With the previous notations and let r˜ < σP (gf ) δ, exists a universal constant C ∈ (0, ∞) such that for any positive sequence uj and for all g ∈ ˜ G, r˜ < σP (g) δ, √ E(π(f )) − E(fc ) − (Ez (π(f )) − Ez (fc )) (1 + min{1, κ/ γ, 1/ γ(α + λ1 )})(ψ l,q + τl,q ) with confidence at least 1 − C
k
j=1
exp{−uj /C}.
Based on the Bernstein inequality, we can derive following error estimates. Lemma 4.3. Let γ, α, λ1 be defined as previous. Then for every 0 < δ < 1, with confidence at least 1 − δ, we have √ l 21 min κ, 1/ α + λ1 log(2/δ) D(γ) 1 + D(γ). ξ2 (zi ) − E(ξ2 ) √ l i=1 6l γ Proof.
Denote ξ21 = (fγ , z) − (π(fγ ), z) and ξ22 = (π(fγ ), z) − (fc , z). Then, we have
ξ2 = ξ21 + ξ22 . Now, we first bound ξ21 . According to Lemma 3.3, ˜ := 2 min{κ, 1/ α + λ1 } D(γ)/γ, 0ξB ˜ and σ 2 (ξ21 ) E(ξ 2 ) BEξ ˜ 21 . Hence |ξ21 − Eξ21 | B 21 For all ε > 0, with the one-side Bernstein inequality we obtain l 1 lε2 . ξ21 (zi ) − E(ξ21 ) > ε exp − Probz∈Z l ˜ l i=1 2(σ 2 (ξ21 ) + 13 Bε)
Generalization performance of graph-based semi-supervised classification
2513
Choose ε∗ to be the unique positive solution of the quadratic equation lε2 = log(1/δ). ˜ 2(σ 2 (ξ21 ) + 13 Bε) Then with confidence 1 − δ, there holds ε∗
1 l
l
i=1 ξ21 (zi )
− E(ξ21 ) ε∗ . It is easy to verify that
˜ log(1/δ) 7B + Eξ21 . 6l
In the same way, we can obtain that with confidence 1 − δ the inequality l
˜ log(1/δ) 1 7B ξ22 (zi ) − E(ξ22 ) + Eξ22 l i=1 12l holds true. Notice that Eξ21 + Eξ22 D(γ) we get the desired result. √ Theorem 4.4. With previous notations, we set σP (g) = P g and δ˜ = 1. Then for any given α, ˜ u > 0 and all 0 < δ < 1, there exists a universal constant C ∈ (0, ∞) such that with confidence at least 1 − (qα˜ C −1)u exp{−u/C} − δ, there holds R(sgn(fz )) − R(fc ) √ S , fc ) + α n s−1 21 min{κ, 1/ α + λ1 } log(2/δ) cut(L j=1 j 2 r˜ + inf 0 0, we can derive the explicit convergence rate for R(sgn(fz )) − R(fc ). In the remainder of this section, we try to find a generalization error bound for the graphbased algorithms by means of new ratio probability inequalities. Note that the ratio probability inequality (see, Lemma 3.1 in [25]) is a standard result in learning theory (e.g. [15, 21]). Our l interest is to bound E(ξ1 ) − 1l i=1 ξ1 (zi ) with new ratio concentration inequalities. Now, we introduce the following notations. Given a continuous, strictly increasing function φ such that φ(0) = 0, define φq (t) = φ(ρj ),
t ∈ (ρj−1 , ρj ],
j = 1, . . . , k,
and ˜ := sup ψl,q (t) . βl,q,φ (˜ r , δ] ˜ φq (t) t∈(˜ r ,δ] Associated with the quantity τl,q , we denote τl,q,φ := τl,q,φ (u1 , . . . , uk ) 2uj 2 := max max j:uj >2lV l,q (ρj ) lφ(ρj ) log(uj /V l,q (ρj )) j:uj 2lV l,q (ρj )
uj V l,q (ρj ) . lφ2 (ρj )
The following lemma is proved in [18] for considering concentration inequalities and asymptotic results for ratio type empirical processes. Lemma 4.5.
With the above definitions, there exist universal constants C ∈ (0, ∞) such that
for any number sequence uj of positive numbers Probz∈Z l
k |Pl g − P g| − βl,q,φ τl,q,φ C exp{−uj /C}. g∈G,˜ r