entropy Article
Semi-Supervised Minimum Error Entropy Principle with Distributed Method Baobin Wang 1 and Ting Hu 2, * 1 2
*
School of Mathematics and Statistics, South-Central University for Nationalities, Wuhan 430074, China;
[email protected] School of Mathematics and Statistics, Wuhan University, Wuhan 430072, China Correspondence:
[email protected]
Received: 26 October 2018; Accepted: 10 December 2018; Published: 14 December 2018
Abstract: The minimum error entropy principle (MEE) is an alternative of the classical least squares for its robustness to non-Gaussian noise. This paper studies the gradient descent algorithm for MEE with a semi-supervised approach and distributed method, and shows that using the additional information of unlabeled data can enhance the learning ability of the distributed MEE algorithm. Our result proves that the mean squared error of the distributed gradient descent MEE algorithm can be minimax optimal for regression if the number of local machines increases polynomially as the total datasize. Keywords: information theoretical learning; distributed method; MEE algorithm; semi-supervised approach; gradient descent; reproducing kernel Hilbert spaces
1. Introduction The minimum error entropy (MEE) principle is an important criterion proposed in information theoretical learning (ITL) [1] and was firstly addressed for adaptive system training by Erdogmus and Principe [2]. It has been applied to blind source separation, maximally informative subspace projections, clustering, feature selection, blind deconvolution, minimum cross-entropy for model selection, and some other topics [3–8]. Taking entropy as a measure of the error, the MEE principle can extract the information contained in data fully and produce robustness to outliers in the implementation of algorithms. Let X ∈ Rn be an explanatory variable with values taken in a compact metric space (X , d), Y be a real response variable with Y ∈ Y ⊂ R, and g : X → Y be a prediction function. For a given set of labeled examples D = {( xi , yi )}iN=1 ⊂ X × Y (N denotes the sample size) and a windowing function G : R → R+ , the MEE principle is to find a minimizer of the empirical quadratic entropy: ( Hˆ ( g) = − log
h2 N2
[(yi − g( xi )) − (y j − g( x j ))]2 G ∑ h2 ( x ,y )∈ D
!) ,
i i ( x j ,y j )∈ D
where h > 0 is the scaling parameter. Its goal is to solve the problem y = gρ ( x ) + ε, where ε is the noise and gρ ( x ) is the target function. Taking a function f ( xi , x j ) := g( xi ) − g( x j ), MEE belongs to pairwise learning problems, which involves with the intersections of example pairs. Since logarithmic function is monotonic, we only consider the empirical information error of MEE: h2 R( f ) = − 2 N
Entropy 2018, 20, 968; doi:10.3390/e20120968
! [yi − y j − f ( xi , x j )]2 , ∑ G h2 ( x ,y )∈ D
(1)
i i ( x j ,y j )∈ D
www.mdpi.com/journal/entropy
Entropy 2018, 20, 968
2 of 18
in the optimization process. Borrowing the idea from Reference [9], we introduced the Mercer kernel K (·, ·) : X 2 × X 2 → R, (X 2 := X × X ) and employed the reproducing kernel Hilbert space (RKHS) HK as our hypothesis space. With K, HK is defined as the linear span of the functions set {K(x,u) := K (( x, u), (·, ·)) , ∀( x, u) ∈ X 2 }, which is equipped with the inner product h·, ·iK and the reproducing property hK( x,u) , K( x0 ,u0 ) iK = K (( x, u), ( x 0 , u0 )), ∀( x, u), ( x 0 , u0 ) ∈ X 2 . For the G nonconvex, we usually solve Equation (1) using the kernel-based gradient descent method as follows. It starts with f 1,D = 0 and is updated by: f t+1,D = f t,D − η × ∇R( f t,D ),
(2)
in the t-th step, where η > 0 is a step size, ∇ is the gradient operator and: 1 ∇R( f t,D ) = − 2 N
∑
( xi ,yi )∈ D ( x j ,y j )∈ D
G
0
! [yi − y j − f t,D ( xi , x j )]2 [ f t,D ( xi , x j ) − yi + y j ]K(xi ,x j ) , h2
as we know that the example pairs will grow quadratically with the increasing example size N, which will bring the computational burden in the MEE implementation. Thus, it is necessary to reduce the algorithmic complexity by the distributed method based on a divide-and-conquer strategy [10]. Semi-supervised learning (SSL) [11] has attracted extensive attention as an emerging field in machine learning research and data mining. Actually, in many practical problems, few data are given, but a large number of unlabeled data are available, since labeling data requires a lot of time, effort or money. In this paper, we study a distributed MEE algorithm in the framework of SSL and show that the learning ability of the MEE algorithm can be enhanced by the distributed method and the combination of labeled data with unlabeled data. There are mainly three contributions in this paper. The first one is that we derive the explicit learning rate of the gradient descent method for distributed MEE in the context of SSL, which is comparable to the minimax optimal rate of the least squares in regression. This implies that the MEE algorithm can be an alternative of the least squares in SSL in the sense that both of them have the same prediction power. The second one is that we provide the theoretical upper bound for the number of local machines guaranteeing the optimal rate in the distributed computation. The last one is that we extend the range of the target function allowed in the distributed MEE algorithm. In Table 1, we summarize some notations used in this paper. Table 1. List of notations used throughout the paper. Notation X Y X Y ρ(·, ·) ρX ρ(y| x ) gρ ( x ) f ρ ( x, u) K D N d N/4e |D| D∗ S | D∗ |
Meaning of the Notation the explanatory variable the response variable X ∈ X , a compact subset of an Euclidian space Rn Y ∈ Y , a subset of R a Boreal measure on X × Y the marginal probability measure of ρ on X the conditional probability measure ofRy ∈ Y given X = x the mean regression function gρ ( x ) = Y ydρ(y| x ) the target function of MEE induced by f ρ ( x, u) = gρ ( x ) − gρ (u) a reproducing kernel on X × X the labeled data set D = {( x1 , y1 ), . . . , ( x N , y N )} the size of labeled data set D the largest integer not exceeding N/4 the cardinality of D, | D | = N the unlabeled data set D ∗ = { x1 , . . . , xS } the size of unlabeled data set D ∗ the cardinality of D ∗ , | D ∗ | = S
Entropy 2018, 20, 968
3 of 18
Table 1. Cont. Notation
Meaning of the Notation training data set used in the distributed MEE algorithm, consisting of D and D ∗ ˜ |D ˜| = N+S the cardinality of D, the number of local machines ˜ 1≤l≤m the lth subset of D, the loss function of MEE algorithm the integral operator associated with K ˜ the empirical operator of LK on D
˜ D | D˜ | m ˜l D G LK LK,D˜ f t+1,D
the function output by the kernel gradient descent MEE algorithm with data D and kernel K after t iterations the function output by the kernel gradient MEE algorithm with data Dl and kernel K after t iterations the global output averaging over local outputs f t+1,D˜ l , l = 1, . . . , m
f t+1,Dl f t+1,D˜
2. Algorithms and Main Results We considered MEE for the regression problem. To allow noise in sampling processes, we assumed that a Borel measure ρ(·, ·) is defined on the product space X × Y . Let ρ(y| x ) be the conditional distribution of y ∈ Y for any given x ∈ X , and ρX (·) the marginal distribution on X . For the semi-supervised MEE algorithm, our goal was to estimate the regression function R gρ ( x ) = Y ydρ(y| x ), x ∈ X , from labeled examples D = {( xi , yi )}iN=1 and unlabeled examples D ∗ = { x j }Sj=1 drawn from the distribution ρ and ρX , respectively. Based on the divide-and-conquer strategy, both D and D ∗ are partitioned equally into m subsets, S S ∗ D = ∑m Dl and D ∗ = ∑m Dl . Here, we denote the size of subsets | Dl | = n and | Dl∗ | = s, l =1 l =1 ˜ = ∑m S D ˜ l by: 1 ≤ l ≤ m, i.e., N = mn, S = ms. We construct a new dataset D l =1 ˜ l = Dl ∪ D ∗ = {( xk , yk )}n+s , D l k =1 where:
( xk =
xk ,
if ( xk , yk ) ∈ Dl ,
xk ,
if xk ∈ Dl∗ ,
( yk =
and
n+s n yk ,
if ( xk , yk ) ∈ Dl ,
0,
if xk ∈ Dl∗ .
Based on the gradient descent algorithm (Equation (2)), we can get a set of local estimators { f t,D˜ l } ˜ l , 1 ≤ l ≤ m. Then, the global estimator averaging over these local estimators is for each subset D given by: f t,D˜ =
1 m
m
∑ f t,D˜ l .
(3)
l =1
In the pairwise setting, our target function f ρ ( x, x 0 ) = gρ ( x ) − gρ ( x 0 ), x, x 0 ∈ X , which is the difference of the regression function gρ . Denote by L2ρ 2 the space of square integrable functions on the X
product space X 2 : L2ρ
X2
:=
n
f : X 2 → R : k f k L2 =
Z Z X2
| f ( x, x 0 )|2 dρX ( x )dρX ( x 0 )
12
o 0, |y| ≤ M almost surely. Without generality, windowing function G is assumed to be differentiable and satisfies G 0 (0) = −1, G 0 (u) < 0 for u > 0, CG := supu∈(0,∞) | G 0 (u)| < ∞ and there exists some p such that c p > 0 and:
Entropy 2018, 20, 968
4 of 18
| G 0 (u) − G 0 (0)| ≤ c p |u| p , ∀u > 0.
(4)
It is easy to check that the Gaussian kernel G (u) = exp{−u} satisfies the assumptions above with p = 1. Before we present our main results, define an integral operator LK : L2ρ 2 −→ L2ρ 2 associated X X with the kernel K by: LK ( f ) :=
Z
Z X
X
f ( x, x 0 )K( x,x0 ) dρX ( x )dρX ( x 0 ),
∀ f ∈ L2ρ 2 . X
Our error analysis for the distributed MEE algorithm (Equation (3)) is stated in terms of the following regularity condition: f ρ = LrK (φ)
f or
φ ∈ L2ρ
r > 0,
some
X2
,
(5)
where LrK denotes the r-th power of LK on L2ρ 2 and is well defined, since the operator LK is positive X and compact with the Mercer kernel K. We use the effective dimension [12,13] N (λ) to measure the complexity of HK with respect to ρX , which is defined to be the trace of the operator (λI + LK )−1 LK as:
N (λ) = Tr ((λI + LK )−1 LK ), λ > 0. To obtain optimal learning rates, we need to quantify N (λ) of HK . A suitable assumption is: that
N (λ) ≤ C0 λ− β , f or some C0 > 0 and 0 < β ≤ 1.
(6)
Remark 1. When β = 1, Equation (6) always holds with C0 = Tr ( LK ). For 0 < β < 1, when HK is a Sobolev space W α (X ) on X ⊂ Rd with all derivative of order up to α > 2d , then Equation (6) is satisfied with d β = 2α [14]. Moreover, if the eigenvalues {γi }i∞=1 of the operator LK decays as γi = O(i−b ) for some b > 1, 1
then N (λ) = O(λ− b ). The eigenvalues assumption is typical in the analysis of the performances of kernel methods estimators and recently used in References [13,15,16] to establish the optimal learning rate in the least square problems. The following theorem shows that the distributed gradient descent algorithm (Equation (3)) can achieve the optimal rate by providing the iteration time T and the maximal number of local machines, whose proof can be found in Section 3. Theorem 1. (Main Result) Assume Equations (5) and (6) hold for r + β ≥ 12 . Let the iteration time T = 1
β +1
d N/4e 2r+β and S + N ≥ N 2r+β : 1
m
( N + S)
2p+1 2p
r + p+ 3 2
N 2p(2r+β) N
−
2p+1 2p
,
Entropy 2018, 20, 968
5 of 18
then for any 0 < δ < 1, with confidence at least 1 − δ:
k f T +1,D˜ − f ρ k L2 ≤ C 0 N
r − 2r+ β
log4
24 . δ
(9)
− r Remark 2. The rate O N 2r+β in Equation (9) is optimal in the minimax sense for kernel regression problems [13]. When m = 1, the result of Equation (9) shows that the kernel gradient descent MEE algorithm (Equation (2)) on a single big data set can achieve the minimax optimal rate for regression. Thus, MEE is a nice alternative of the classical least squares. Meanwhile, the upper bound (Equation (7)) for the number of local machines implies that the performance of the distributed MEE algorithm (Equation (3)) can be as good as the ˜ provided that the subset D ˜ l ’s size n + s is not standard MEE algorithm (2) (acting on the whole data set D), too small. Remark 3. If no unlabeled data is engaged in the algorithm (Equation (3)), then S = 0 and the upper bound ! (Equation (7)) for the number of local machines m that ensures the optimal rate is about O ! 1 the regularity parameter r in Equation (5) is close to 12 , the upper bound O
r− 2
N 2r+β
r − 21
N 2r+β
. So, when
reduces to a constant and
then the distributed algorithm (Equation (3)) will not be feasible in real applications. A similar phenomenon is observed in various distributed algorithms [15–18]. When the size of unlabeled data S > 0, we see from Equation (7) that the upper bound of m keeps growing with the increase of S when the size of labeled data N is 1 r fixed. For example, let β > 21 and S = N 2r+β , then the upper bound in Equation (7) is O N 2r+β and will not be a constant when r → 12 . Hence, with sufficient unlabeled data D ∗ , the distributed algorithm (Equation (3)) will allow more local machines in the distributed method. Remark 4. A series of distributed works [15–19] were carried out when the target function f ρ lies in the space HK , i.e., the regularization parameter r > 12 . As a byproduct, our work in Theorem 1 does not impose the restriction r > 21 on the distributed algorithm (Equation (3)). 3. Proof of Main Result In this section we prove our main results in Theorem 1. To this end, we introduce the data-free gradient descent method in HK for the least squares, defined as f 1 = 0 and: f t +1 = f t − η t
Z
Z X
X
( f t ( x, x 0 ) − f ρ ( x, x 0 ))K(x,x0 ) dρX ( x )dρX ( x 0 ),
t ≥ 1.
Recalling the definition of LK , it can be written as: f t +1 = f t − η t L K ( f t − f ρ ) = ( I − η t L K ) f t + η t L K ( f ρ ),
t ≥ 1.
(10)
Following the standard decomposition technique in leaning theory, we split the error f¯t+1,D˜ − f ρ into the sample error f¯t+1,D˜ − f t+1 and the approximation error f t+1 − f ρ . 3.1. Approximation Error Firstly, we estimate the approximation error k f t+1 − f ρ k L2 . It has been proven in Reference [20] and shown in the lemmas as follows. Lemma 1. Define { f t } by Equation (10) with 0 < η ≤ 1. If Equation (5) holds with r > 0, there are:
k f t − f ρ k L2 ≤ cφ,r t−r ,
Entropy 2018, 20, 968
6 of 18
and when r ≥ 12 : 1
k f t − f ρ kK ≤ cφ,r t−(r− 2 ) , 1 where cφ,r = max kφk L2 (2r/e)r , kφk L2 [(2r − 1)/e]r− 2 . Moreover, we derive the uniform bound of the sequence { f t } by Equation (10) when 0 < r < 12 , which is useful in our analysis. Here and in the sequel, denote πit+1 ( L) as the polynomial operator associated with an operator L defined by πit+1 ( L) := ∏tj=i+1 ( I − ηL) and πtt+1 := I. We use the conventional notation ∑ Tj=T +1 := 1. Lemma 2. Define { f t } by Equation (10) with 0 < η ≤ 1. If Equation (5) holds with 0 < r < 12 , there are: 1
k f t kK ≤ dφ,η,r t 2 −r ,
(11)
where dφ,η,r is defined in the proof. Proof. Using Equation (10) iteratively from t to 1, then we have that: t
f t +1 =
∑ ηπit+1 ( LK ) LK ( f ρ ),
f or
all
t ≥ 1.
i =1
With Equation (5):
t
t
t r t k f t+1 kK = ∑ ηπi+1 ( LK ) LK ( f ρ ) = ∑ ηπi+1 ( LK ) LK LK (φ)
i =1
i =1 K
K
t
t
1
1 1 r+ r+
= ∑ ηπit+1 ( LK ) LK 2 k LK2 φkK = ∑ ηπit+1 ( LK ) LK 2 kφk L2 .
i =1
i =1
(12)
Let {σk }∞ k =1 be the eigenvalues of the operator LK and 0 ≤ σk ≤ 1, k ≥ 1, since LK is positive and k LK kHK →HK ≤ 1, then the norm:
t
t t −1 r + 12 r + 12 r + 12
t t t r + 12
∑ ηπi+1 ( LK ) LK = sup ∑ ηπi+1 (σk )σk ≤ η sup ∑ πi+1 ( a) a + ηLK
i =1
a >0 i =1 k ≥1 i =1 ( ) t −1
a >0
+η
i =1
t −1
≤
1
∑ [exp {−ηa(t − i)}] ar+ 2
≤ η sup
∑ sup
n
1
η [exp {−ηa(t − i )}] ar+ 2
o
+ η.
i =1 a >0
For each i ≤ t − 1, by a simple calculation, we have: n
sup [exp {−ηa(t − i )}] a
r + 12
o
n
= [exp {−ηa(t − i )}] a
a >0
r + 21
o
a=(r + 21 )(η (t−i ))−1
1 1 1 1 1 1 = η 2 −r (r + )r+ 2 exp −(r + ) (t − i )−(r+ 2 ) ≤ (t − i )−(r+ 2 ) . 2 2
Thus, we have:
Entropy 2018, 20, 968
7 of 18
t t −1 t −1 1 1 r +
−(r + 12 ) + η = η ∑ i−(r+ 2 ) + η.
∑ ηπit+1 ( LK ) LK 2 ≤ η ∑ (t − i )
i =1 i =1 i =1 By the elementary inequality ∑it=1 t−θ ≤
t1− θ 1− θ
with 0 < θ < 1, it follows that:
t
1 1 1 3/2 − r r + 21
t t 2 −r . + 1 t 2 −r = η
∑ ηπi+1 ( LK ) LK ≤ η
i =1
1/2 − r 1/2 − r Together with Equation (12), then the proof is completed by taking dφ,η,r := η
3/2−r 1/2−r
k φ k L2 .
3.2. Sample Error Define the empirical operator LK,D : HK → HK by: LK,D :=
1 N2
∑
h·, K(xi ,x j ) iK K(xi ,x j ) ,
( xi ,yi )∈ D ( x j ,y j )∈ D
and for any f ∈ HK : LK,D ( f ) =
1 N2
∑
h f , K(xi ,x j ) iK K(xi ,x j ) =
( xi ,yi )∈ D ( x j ,y j )∈ D
1 N2
∑
f ( xi , x j )K( xi ,x j ) .
( xi ,yi )∈ D ( x j ,y j )∈ D
˜ can be written as: Then, the MEE gradient descent algorithm (Equation (2)) on D f t+1,D˜ = [ I − ηLK,D˜ ]( f t,D˜ ) + η f ρ,D˜ + ηEt,D˜ ,
(13)
where:
Et,D˜ =
1 ( N + S )2
∑
( xi ,yi )∈ D˜ ( x j ,y j )∈ D˜
G
0
[yi − y j − f t,D˜ ( xi , x j )]2 h2
!
! 0
− G (0)
f t,D˜ ( xi , x j ) − yi + y j K( xi ,x j ) , (14)
and: f ρ,D˜ =
1 ( N + S )2
∑
( xi ,yi )∈ D˜ ( x j ,y j )∈ D˜
(yi − y j )K(xi ,x j ) .
In the sequel, denote:
B D,λ = ( LK,D˜ + λI )−1 ( LK + λ) , ˜
− 12 C D,λ = ( L − L ) ( L + λI )
˜ ˜ , K K K, D
1 m
− 21 ( L + λI ) ( L − L D D,λ = )
˜ ˜l , K K K, D
m l∑
=1
1 m 1
F D,λ = ∑ ( LK + λI )− 2 [ f ρ,D˜ l − LK ( f ρ )] , ˜
m l =1 K
− 21 G D,λ = ( LK + λI ) ( LK f ρ − f ρ,D˜ ) . ˜ K
Entropy 2018, 20, 968
8 of 18
With these preliminaries in place, we now turn to the estimates of the sample error f¯t+1,D˜ − f t+1 presented in the following Lemma, whose proof can be found in the Appendix. Here and in the sequel, −1 −1 we use the conventional notation ∑it=1 (t − i )−1 := ∑it= + 1. 1 (t − i ) −1 Lemma 3. Let λ > 0 and 0 < η < min{CG , 1}, for any f ∗ ∈ HK , there holds:
k f¯T +1,D˜ − f T +1 k L2 ≤ term 1 + term 2 + c p,M | N + S|2p+1 N −(2p+1) T p+3/2 h−2p , 2p+1
where the constant c p,M = 24p+2 c p CG T
term 1 = sup
∑
(T − i)
1≤ l ≤ m i =1
−1
(15)
M2p+1 : (
+ ηλ C D˜ l ,λ ×
i −1
∑
s =1
+ (1 + ληi )B D˜ l ,λ (C D˜ l ,λ k f ∗ kK + G D˜ l ,λ )λ
− 21
1 (i − s − 1)−1 + λη k f s − f ∗ kK B D˜ l ,λ C D˜ l ,λ λ− 2 )
+ c p,M | N + S|2p+1 N −(2p+1) i p+1/2 h−2p ,
and: T
term 2 =
∑
∗ ∗ ( T − i )−1 + ηλ D D,λ ˜ k f i − f k K + (1 + ληT )(D D,λ ˜ k f k K + F D,λ ˜ ).
i =1
With the help of Lemma above, to bound the sample error k f¯T +1,D˜ − f T +1 k L2 , we first need to 1 √ estimate the quantities the quantities B D,λ F Dλ + ˜ , C D,λ ˜ , D D,λ ˜ ˜ and G D,λ ˜ . Denote A D,λ : = d| D |/4e λ q N (λ) (| D | is the cardinality of D). In previous work [19,21–23], we have foundnd that each of the d| D |/4e following inequality holds with confidence at least 1 − δ: 2A ˜ log 2 2 2 2 D,λ δ √ + 2, C D,λ ≤ 2A D,λ D D˜ , λ ≤ 2A D,λ ˜ ˜ log , ˜ log δ δ λ 4 4 ≤ 16MA D,λ log , and G D,λ ≤ 16MA D,λ log . ˜ δ δ
B D,λ ≤2 ˜ F D,λ ˜
(16)
By Lemma 3, we also see that the function f ∗ is crucial to determine k f¯T +1,D˜ − f T +1 k. To get a tight bound for the learning error, we should choose an appropriate f ∗ ∈ HKe according to the regularity of the target function. When r ≥ 12 , f ρ ∈ HK and we take f ∗ = f ρ . When 0 < r < 12 , f ρ is out of the space HK and we let f ∗ = 0. Now, we give the first main result when the target function f ρ is out of HK with 0 < r < 21 . −1 Theorem 2. Assume Equation (5) for 0 < r < 12 . Let 0 < η < min{1, CG }, T ∈ N and λ = T −1 . Then, for any 0 < δ < 1, with probability at least 1 − δ, there holds:
( 16 4 24m ∗ r− 1 ¯ k f T +1,D˜ − f ρ k L2 ≤ C T −r + log2 ( T )J D,D,λ + log( T )A D,λ ˜ log ˜ λ 2 + A D,λ log δ δ ! ) 1 2 + | N + S|2p+1 N −(2p+1) h−2p T p+3/2 + T p+ 2 log( T ) sup A D˜ l ,λ log , δ 1≤ l ≤ k where C ∗ is a constant given in the proof, J D,D,λ = sup ˜
1≤ l ≤ m
(17)
! A ˜ 2 1 Dl ,λ √ + 1 (A2D˜ ,λ λr−1 + A D˜ l ,λ A Dl ,λ λ− 2 ). l λ
Entropy 2018, 20, 968
9 of 18
Proof. Decompose k f¯T +1,D˜ − f ρ k L2 into:
k f¯T +1,D˜ − f ρ k L2 ≤ k f¯T +1,D˜ − f T +1 k L2 + k f T +1 − f ρ k L2 . The estimate of k f T +1 − f ρ k L2 is presented in Lemma 1. We only need to handle k f¯T +1,D˜ − f T +1 k L2 by Lemma 3. 1 1 For any 0 < s ≤ T − 1 and λ = T −1 , by Equation (11), we have k f s kK ≤ dφ,η,r s 2 −r ≤ dφ,η,r λr− 2 . Take f ∗ = 0 in Lemma 3, then: T
∑
term 1 ≤ (1 + dφ,η,r + c p,M ) sup
(T − i)
−1
1≤ l ≤ m i =1
+ (1 + ληi )B D˜ l ,λ (C D˜ l ,λ λ
r − 12
+ G D˜ l ,λ )λ
− 21
(
+ ηλ C D˜ l ,λ ×
i −1
∑
s =1
(i − s − 1)−1 + λη B D˜ l ,λ C D˜ l ,λ λr−1 )
+ | N + S|2p+1 N −(2p+1) i p+1/2 h−2p ,
and: ( term 2 ≤ (1 + dφ,η,r )
T
∑
(T − i)
−1
+ ηλ D D,λ ˜ λ
r − 12
+ (1 + ληT )(D D,λ ˜ λ
r − 21
)
+ F D,λ ˜ ) .
i =1 i
Noticing the elementary inequality
∑ i−1 ≤ 2 log(i), then:
s =1 i −1
∑
(i − s − 1)−1 + λη ≤
s =1
T
∑
( T − i )−1 + ηλ
∑
(i − s − 1)−1 + T −1 ≤ 4 log(i ),
s =1
i −1
∑
s =1
i =1
i −1
T (i − s − 1)−1 + λη ≤ 4 ∑ ( T − i )−1 + T −1 log(i ) i =1
T T −1 log(i ) 1 ≤ 16 ∑ ≤ 16 log( T ) ∑ = 16 log( T ) ∑ i−1 ≤ 32 log2 ( T ), T−i T−i i =1 i =1 i =1 T
T
∑
( T − i )−1 + ηλ (1 + ληi ) ≤
i =1
T
∑
( T − i )−1 + ηT −1 (1 + T −1 ηi )
i =1
T ≤ 2 ∑ ( T − i )−1 + 1 ≤ 8 log( T ), i =1
and: T
∑
( T − i )−1 + ηλ ≤ 4 log( T ),
i =1
T
∑
i =1
( T − i )−1 + ηλ i p+1/2 ≤
T
∑
i =1
( T − i )−1 + ηλ T p+1/2 ≤ 4T p+1/2 log( T ).
Entropy 2018, 20, 968
10 of 18
Plugging the above inequalities into term 1 and term 2, then: 2 r −1 2 r −1 term 1 ≤ sup C1 log2 ( T )B D˜ l ,λ C D + log( T )B D˜ l ,λ C D ˜ ,λ λ ˜ ,λ λ l
1≤ l ≤ m
+ log( T )B D˜ l ,λ C D˜ l ,λ G D˜ l ,λ λ
l
− 12
+ | N + S|2p+1 N −(2p+1) T p+1/2 log( T )C D˜ l ,λ h−2p ,
(18)
and: 1
r− term 2 ≤ C2 (log( T )D D,λ ˜ ), ˜ λ 2 + F D,λ
(19)
where C1 = 32(1 + dφ,η,r + c p,M ) and C2 = 6(1 + dφ,η,r ). By Equation (16), for any fixed l, there exist three subsets with measure at least 1 − δ such that:
B D˜ l ,λ ≤ 2
2 Dl ,λ log δ
2A ˜
√
λ
2
+ 2,
2 C D˜ l ,λ ≤ 2A D˜ l ,λ log , δ
and: 4 G D˜ l ,λ ≤ 16MA Dl ,λ log . δ Thus, for any fixed l, with confidence at least 1 − 3δ, there holds: 2 r −1 B D˜ l ,λ C D ˜ l ,λ λ
! A ˜ 2 2 Dl ,λ √ + 1 A2D˜ ,λ λr−1 log4 , ≤ 32 l δ λ
and:
B D˜ l ,λ C D˜ l ,λ G D˜ l ,λ λ
− 12
! A ˜ 2 1 2 4 Dl ,λ √ + 1 A D˜ l ,λ A Dl ,λ λ− 2 log3 log . ≤ 256M δ δ λ
Therefore, with confidence at least 1 − 3mδ, there holds: r −1 2 sup B D˜ l ,λ C D ˜ ,λ λ
1≤ l ≤ m
l
! A ˜ 2 2 Dl ,λ √ + 1 A2D˜ ,λ λr−1 log4 , ≤ 32 l δ λ
and: sup B D˜ l ,λ C D˜ l ,λ G D˜ l ,λ λ
1≤ l ≤ m
− 12
! A ˜ 2 1 2 4 Dl ,λ √ ≤ 256M + 1 A D˜ l ,λ A Dl ,λ λ− 2 log3 log . δ δ λ
Thus, by Equation (18), it follows that with confidence at least 1 − δ/2 by scaling 3mδ to δ/2, there holds: ! A ˜ 2 1 24m D ,λ √l term 1 ≤ C3 sup log2 ( T ) + 1 (A2D˜ ,λ λr−1 + A D˜ l ,λ A Dl ,λ λ− 2 ) log4 l δ λ 1≤ l ≤ m !
+ | N + S|2p+1 N −(2p+1) T p+1/2 log( T )A D˜ l ,λ h−2p , where C3 = C1 (256M + 64).
Entropy 2018, 20, 968
11 of 18
Similarly, with confidence at least 1 − 2δ such that: 2 D D,λ ≤ 2A D,λ ˜ ˜ log , δ
and
4 F D,λ ≤ 16MA D,λ log . ˜ δ
By Equation (19), it follows that with confidence at least 1 − δ/2 by scaling 2δ to δ/2: 16 r− 1 term 2 ≤ C4 log( T )A D,λ , ˜ λ 2 + A D,λ log δ where C4 = C2 (16M + 2). Together with Lemma 1, we obtain the desired bound (Equation (17)) with C ∗ = cφ,r + C3 + C4 + c p,M . Next, we give the result when the target function f ρ is in HK with r ≥ 12 . −1 Theorem 3. Assume Equation (1) for r ≥ 12 . Let 0 < η < min{1, CG }, T ∈ N and λ = T −1 . Then, for any 0 < δ < 1, with probability at least 1 − δ, there holds:
(
k f¯T +1,D˜ − f ρ k L2 ≤ C
∗
24m 16 + log( T )A D,λ ˜ + A D,λ log δ δ ! ) 2 p+3/2 p+ 12 T +T log( T ) sup A D˜ l ,λ log , δ 1≤ l ≤ k
4 T −r + log2 ( T )K D,D,λ ˜ log
+ | N + S|2p+1 N −(2p+1) h−2p
where K D,D,λ = sup1≤l ≤m ˜
A ˜
Dl ,λ
√
λ
2
(20)
! 1
+ 1 (A2D˜ ,λ + A D˜ l ,λ A Dl ,λ )λ− 2 and C ∗ is a constant given in l
the proof. The proof is similar to that of Theorem 2. Here we omit it. With these preliminaries in place, we can prove our main result in Theorem 1. 1
Proof of Theorem 1. We first prove Equation (8) by Theorem 2 when 0 < r < 12 . Let T = d| D |/4e 2r+β ˜ | = N + S and m| Dl | = | D |, m| D ˜ l | = |D ˜ | for 1 ≤ l ≤ m, with and λ = T −1 . Notice that | D | = N, | D 1 r + β > 2 and Equation (7), we obtain that: β p −1+ −1+ 4r+1 2β A D,λ = d| D |/4e + C0 d| D |/4e 2 4r+2β p √ p − r − r ≤ ( C0 + 1)d| D |/4e 2r+β ≤ 5( C0 + 1)| D | 2r+β √ p − r = 5( C0 + 1) N 2r+β ,
β p 1 1 A D,λ = d| D˜ |/4e−1 [| D |/4] 4r+2β + C0 d| D˜ |/4e− 2 d| D |/4e 4r+2β ˜ β √ p 1 1 ≤ 5( C0 + 1)(| D˜ |−1 | D | 4r+2β + | D˜ |− 2 | D | 4r+2β ) β √ p 1 1 = 5( C0 + 1)(| N + S|−1 N 4r+2β + | N + S|− 2 N 4r+2β ),
β p 1 1 A Dl ,λ = d| Dl |/4e−1 d| D |/4e 4r+2β + C0 d| Dl |/4e− 2 d| D |/4e 4r+2β β √ p 1 −1+ 4r+1 2β −1+ ≤ 5( C0 + 1)(m| D | + m 2 | D | 2 4r+2β ) β √ p 1 −1+ 4r+1 2β −1+ = 5( C0 + 1)(mN + m 2 N 2 4r+2β ),
Entropy 2018, 20, 968
12 of 18
and: β √ p 1 ˜ l |− 12 | D | 4r+2β ) ˜ l |−1 | D | 4r+2β + | D 5( C0 + 1)(| D β √ p 1 1 1 = 5( C0 + 1)(m| D˜ |−1 | D | 4r+2β + m 2 | D˜ |− 2 | D | 4r+2β ) β √ p 1 1 ≤ 2 5( C0 + 1)m 2 | D˜ |− 2 | D | 4r+2β .
A D˜ l ,λ ≤
Thus: β √ p 1 A D˜ l ,λ 1 1 1 √ ≤ 5( C0 + 1)(m| D˜ |−1 | D | 4r+2β + m 2 | D˜ |− 2 | D | 4r+2β )d| D |/4e 2(2r+β) λ β +1 √ p 1 1 ≤ 2 5( C0 + 1)m 2 | D˜ |− 2 | D | 4r+2β β +1 √ p √ p 1 1 = 2 5( C0 + 1)m 2 | N + S|− 2 N 4r+2β ≤ 2 5( C0 + 1).
It follows for l = 1, · · · , m: A ˜ 2 p D ,λ √l + 1 ≤ 20( C0 + 1)2 + 1, λ 1
s
2
1+ β
1−r
A2D˜ ,λ λr−1 ≤ 10(
p
˜ |−1 | D | 2r+β )| D | 2r+β ˜ |−2 | D | 2r+β + m| D C0 + 1)2 (m2 | D
= 10(
p
˜ |−2 | D | 2r+β + m| D ˜ |−1 | D | 2r+β )| D | C0 + 1)2 (m2 | D
≤ 20(
p
C0 + 1)2 | D |− 2r+s / log6 | D |
= 20(
p
C0 + 1)2 N − 2r+s / log6 N,
l
r − 2r+ β
r
r
1
β
1
β
1
A D˜ l ,λ A Dl ,λ λ− 2 ≤ 10(
p
1 ˜ |− 12 | D | 4r+2β (m| D |−1+ 4r+2β + m 21 | D |− 2 + 4r+2β )| D | 4r+2β C0 + 1)2 m 2 | D
≤ 10(
p
˜ |− 2 | D | C0 + 1)2 (m 2 | D
≤ 20(
p
C0 + 1)2 | D |
= 20(
p
C0 + 1)2 N
1
3
1
r − 2r+ β
r − 2r+ β
−
β+4r −2 4r +2β
1
+ m| D˜ |− 2 | D |
β+1−2r 4r +2β
)
/ log6 | D |
/ log6 N.
Thus, by the above estimates:
J D,D,λ ≤ 40(20( ˜
p
C0 + 1)2 + 1)(
p
C0 + 1)2 N
r − 2r+ β
/ log6 N
Thus: 24m 4 4 24 ≤ 24 log2 ( T )J D,D,λ ˜ (log m ) log δ δ 24 4 4 ≤ 24 (2r + β)−2 (log2 N )J D,D,λ ˜ (log m ) log δ 24 4 ≤ 24 (2r + β)−2 (log6 N )J D,D,λ ˜ log δ p p 24 − r 10 −2 2 ≤ 2 (2r + β) (20( C0 + 1) + 1)( C0 + 1)2 | D | 2r+β log4 δ p p r 4 24 10 −2 2 2 − 2r+ β = 2 (2r + β) (20( C0 + 1) + 1)( C0 + 1) N log , δ
4 log2 ( T )J D,D,λ ˜ log
Entropy 2018, 20, 968
13 of 18
and: β +1 √ p 1 r r− 1 −1 ˜ |− 12 | D | 4r+2β )| D |− 2r+β ˜ |−1 | D | 2r+β + | D log( T )A D,λ 5( C0 + 1) log | D |(| D ˜ λ 2 ≤ (2r + β ) p p √ √ − r − r ≤ 2 5(2r + β)−1 ( C0 + 1)| D | 2r+β = 2 5(2r + β)−1 ( C0 + 1) N 2r+β .
Putting the above estimates into Theorem 2, we have the desired conclusion (Equation (8)) with: p p p √ √ p C 0 = C ∗ 210 (2r + β)−2 (20( C0 + 1)2 + 1)( C0 + 1)2 + 2 5(2r + β)−1 ( C0 + 1) + 5( C0 + 2) . When r ≥ 12 , we apply Theorem 3 and take the same proof procedure a above. Then, the conclusion (Equation (8)) can be obtained. The proof is completed. 4. Simulation and Conclusions In this section, we provide the simulation to verify our theoretical statements. We assume that the inputs { xi } are independently drawn according to the uniform distribution on [0, 1]. Consider the regression model yi = gρ ( xi ) + ε i , i = 1, · · · , N, where ε i is the independent Gaussian noise N (0, 0.12 ) and: ( x, if 0 < x ≤ 0.5, gρ ( x ) = 1 − x, if 0.5 < x ≤ 1. Define the pairwise kernel K : X 2 × X 2 → R by K (( x, u), ( x 0 , u0 )) := K1 ( x, x 0 ) + K1 (u, u0 ) − K1 ( x, u0 ) − K1 ( x 0 , u) where: K1 ( x, x 0 ) = 1 + min{ x, x 0 }. We apply the kernel K to the distributed algorithm (Equation (3)). In Figure 1, we plot the mean squared error of Equation (3) for N = 600 and S = 0, 300, 600 when the number of local machines m varies. Note that S = 0, and it is a standard distributed MEE algorithm without unlabeled data. When m becomes large, the red curve increases dramatically. However, when we add 300 or 600 unlabeled data, the error curves begin to increase very slowly. This coincides with our theory that using unlabeled data can enlarge the range of m in the distributed method.
Figure 1. The mean square errors for the size of unlabeled data S ∈ {0, 300, 600} as the number of local machines m varies.
Entropy 2018, 20, 968
14 of 18
This paper studied the convergence rate of the distribute gradient descent MEE algorithm in a semi-supervised setting. Our results demonstrated that using additional unlabeled data can improve the learning performance of the distributed MEE algorithm, especially in enlarging the range of m to guarantee the learning rate. As we know, there are many gaps between theory and empirical studies. We regard this paper as mainly a theoretical paper and expect that the theoretical analysis give some guidance to real applications. Author Contributions: B.W. conceived the presented idea. T.H. developed the theory and performed the computations. All authors discussed the results and contributed to the final manuscript. Funding: The work described in this paper is partially supported by the National Natural Science Foundation of China [Nos. 11671307 and 11571078], the Natural Science Foundation of Hubei Province in China [No. 2017CFB523], and the Fundamental Research Funds for the Central Universities, South-Central University for Nationalities [No. CZY18033]. Conflicts of Interest: The authors declare no conflict of interest.
Appendix A. Proof of Lemma 3 We state two useful lemmas as follows, whose proof can be found in Reference [22]. Lemma A1. For λ > 0, 0 < η < 1 and j = 1, · · · , t − 1, we have: max{kη ( LK + λ)π tj+1 ( LK )k, kη ( LK,D˜ + λ)π tj+1 ( LK,D˜ )k} ≤ t
t
j =1
j =1
1 + ηλ, t−j
max{k ∑ η ( LK + λ)π tj+1 ( LK )k, k ∑ η ( LK,D˜ + λ)π tj+1 ( LK,D˜ )k} ≤ 1 + ηλt. Lemma A2. For any λ > 0 and f ∗ ∈ HK , there holds: i t h −1 k f t+1,D˜ − f t+1 k L2 ≤ B D,λ + ηλ k f i − f ∗ kK ˜ C D,λ ˜ ∑ (t − i ) i =1
∗ + c p,M ( N + S)2p+1 N −(2p+1) t p+1/2 h−2p , + B D,λ ˜ k f kK + G D,λ ˜ ˜ (1 + ηλt ) C D,λ
(A1)
and: i t h √ −1 k f t+1,D˜ − f t+1 kK ≤ B D,λ + ηλ k f i − f ∗ kK / λ ˜ C D,λ ˜ ∑ (t − i ) i =1
√ + B D,λ 1 + ηλt C k f k + G / λ + c p,M ( N + S)2p+1 N −(2p+1) t p+1/2 h−2p , ( ) ˜ ˜ ˜ K D,λ D,λ
∗
2p+1
where the constant c p,M = 24p+2 c p CG
(A2)
M2p+1 .
Proof of Lemma A2. By Equations (10) and (13), we get a two error decomposition for f t+1,D − f t+1 . The first one is: t
f t+1,D˜ − f t+1 = η ∑ πit+1 ( LK,D˜ )[ LK − LK,D˜ ]( f i ) i =1
t
t
i =1
i =1
+ η ∑ πit+1 ( LK,D˜ )[ f ρ,D˜ − LK ( f ρ )] + η ∑ πit+1 ( LK,D˜ ) Ei,D˜ ,
(A3)
Entropy 2018, 20, 968
15 of 18
and the second one is: t
f t+1,D˜ − f t+1 = η ∑ πit+1 ( LK )[ LK − LK,D˜ ]( f i,D˜ ) i =1
t
t
i =1
i =1
+ η ∑ πit+1 ( LK )[ f ρ,D˜ − LK ( f ρ )] + η ∑ πit+1 ( LK ) Ei,D˜ .
(A4)
| D˜ |
1
It has been proven in Reference [22] that { f t,D˜ } as k f t,D˜ kK ≤ 2CG | D| Mt 2 = 2CG It follows from Equation (4) that: 1 k Et,D˜ kK ≤ | N + S |2
≤ cp
2
∑
˜ ( xi , yi ) ∈ D, ( x j , y j ) ∈ D˜
"
0
G
( f t,D∗ ( xi , x j ) − yi + y j )2 h2
2p+1 | N +S| C M + 2 G N 2p+1 k f t,D˜ kK h2p
2p+1
≤ 24p+2 c p CG
!
1 | N +S| 2 N Mt .
− G (0) ( f t,D˜ ( xi , x j ) − yi + y j )K(xi ,x j )
#
0
M2p+1
| N + S|2p+1 p+1/2 −2p t h . N 2p+1
(A5)
Then, we can follow the proof procedure in Proposition 1 of Reference [24] to prove Equations (A2) and (A1). With the help of the lemmas above, we can prove Lemma 3. ˜ =D ˜ l for l = 1, · · · , m, we have that: Proof of Lemma 3. Applying Equation (A4) with D
k f¯T +1,D˜ l − f T +1 k L2 T
≤ kη ∑ πiT+1 ( LK ) i =1 T
+ kη ∑ πiT+1 ( LK ) i =1
1
=
m 1 m
1 m
m
∑
l =1
f¯T +1,D˜ l
− f T +1
L2
m
∑ [ LK − LK,D˜ l ]( fi,D˜ l )k L2
l =1 m
T
l =1
i =1
1
m
∑ [ fˆρ,D˜ l − LK ( f ρ )]k L2 + kη ∑ πiT+1 ( LK ) m ∑ Ei,D˜ l k L2 l =1
:= I1 + I2 + I3 . Firstly, we will bound I1 , which is most difficult to handle. It can be decomposed as:
T
m 1 1
I1 ≤ η ∑ πiT+1 ( LK ) ∑ ( LK + λ) 2 [ LK − LK,D˜ l ]( f i,D˜ l − f i )
i =1
m l =1
K
T
1 1 m
+ η ∑ πiT+1 ( LK ) ∑ ( LK + λ) 2 [ LK − LK,D˜ l ]( f i − f ∗ )
i =1
m l =1 K
T
m 1 1
+ η ∑ πiT+1 ( LK ) ∑ ( LK + λ) 2 [ LK − LK,D˜ l ]( f ∗ )
i =1
m l =1 K
:= I11 + I12 + I13 . Then, it is easy to get that by Lemma A1 and f 1,D˜ l = f 1 = 0:
K
Entropy 2018, 20, 968
16 of 18
T m 1 1
= ∑ η ( LK + λ)πiT+1 ( LK ) ∑ ( LK + λ)− 2 [ LK − LK,D˜ l ]( f i,D˜ l − f i )
i =1 m l =1 K
1 m
T
T − 21 ≤ ∑ η ( LK + λ)πi+1 ( LK ) ∑ ( LK + λ) [ LK − LK,D˜ l ]( f i,D˜ l − f i )
m i =1 l =1
I11
K
T
1
≤ ∑ (ηλ + ( T − i )−1 ) sup ( LK + λ)− 2 [ LK − LK,D˜ l ]( f i,D˜ l − f i ) 1≤ l ≤ m
i =1
K
T
≤ sup
∑ (ηλ + (T − i)−1 )k fi,D˜ l − fi kK CD˜ l ,λ .
1≤ l ≤ m i =1
˜ =D ˜ l and t + 1 = i, we have that: Applying Proposition A2 with D
k f i,D˜ l − f i kK ≤
i −1
∑
s =1
1 (i − s − 1)−1 + λη k f s − f ∗ kK B D˜ l ,λ C D˜ l ,λ λ− 2 1
+ (1 + ληi )B D˜ l ,λ (C D˜ l ,λ k f ∗ kK + G D˜ l ,λ )λ− 2 + c p,M
| N + S|2p+1 p+1/2 −2p i h . N 2p+1
Thus: (
T
i −1
∑ (ηλ + (T − i)−1 )CD˜ l ,λ × ∑
I11 ≤ sup
1≤ l ≤ m i =1
s =1
∗
+ (1 + ληi )B D˜ l ,λ (C D˜ l ,λ k f kK + G D˜ l ,λ )λ
− 12
1 (i − s − 1)−1 + λη k f s − f ∗ kK B D˜ l ,λ C D˜ l ,λ λ− 2
) | N + S|2p+1 p+1/2 −2p + c p,M i h . N 2p+1
(A6)
By Lemma A1 again, we have: I12
T
m 1 1
≤ ∑ η ( LK + λ)πiT+1 ( LK ) ∑ ( LK + λ)− 2 [ LK − LK,D˜ l ]( f i − f ∗ )
i =1
m l =1 K
1 m
T 1
T −2 ∗ ≤ ∑ η ( LK + λ)πi+1 ( LK ) ∑ ( LK + λ) [ LK − LK,D˜ l ]kk f i − f
m l =1
i =1
K
T
≤
∗ ˜ k f i − f kK , ∑ (ηλ + (T − i)−1 )DD,λ
(A7)
i =1
and: I13
T
1
T ≤ ∑ η ( L K + λ ) π i +1 ( L K )
i =1
m
m
∑ ( LK + λ)
l =1
− 21
[ LK − LK,D˜ l ] k f ∗ kK
∗ ≤ (1 + ληT )D D,λ ˜ k f kK .
(A8)
This completes the estimate of I1 with Equations (A6), (A7), and (A8). Now we turn to bound I2 . Then, by the definition of F D,λ ˜ , the bound (Equation (A5)) of Et, D ˜ and Lemma A1, we obtain that: I2 ≤ (1 + ηλT )F D,λ ˜ , and: I3 ≤ c p,M
| D˜ |2p+1 p+3/2 −2p | N + S|2p+1 p+3/2 −2p T h = c T h . p,M | D |2p+1 N 2p+1
Entropy 2018, 20, 968
17 of 18
Together with the bound of Equations (A6), (A7), and (A8), we can get the desired conclusion (Equation (15)). References 1. 2.
3. 4. 5. 6.
7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22.
Principe, J.C. Renyi’s entropy and Kernel perspectives. In Information Theoretic Learning; Springer: New York, NY, USA, 2010. Erdogmus, D.; Principe, J.C. Comparison of entropy and mean square error criteria in adaptive system training using higher order statistics. In Proceedings of the International Conference on ICA and Signal Separation; Springer: Berlin, Germany, 2000; pp. 75–90. Erdogmus, D.; Hild, K.; Principe, J.C. Blind source separation using Renyi’s α-marginal entropies. Neurocomputing 2002, 49, 25–38. [CrossRef] Erdogmus, D.; Principe, J.C. Convergence properties and data efficiency of the minimum error entropy criterion in adaline training. IEEE Trans. Signal Process. 2003, 51, 1966–1978. [CrossRef] Gokcay, E.; Principe, J.C. Information theoretic clustering. IEEE Trans. Pattern Anal. Mach. Learn. 2002, 24, 158–171. [CrossRef] Silva, L.M.; Marques, J.; Alexandre, L.A. Neural network classification using Shannon’s entropy. In Proceedings of the European Symposium on Artificial Neural Networks; D-Side: Bruges, Belgium, 2005; pp. 217–222. Silva, L.M.; Marques, J.; Alexandre, L.A. The MEE principle in data classification: A perceptron-based analysis. Neural Comput. 2010, 22, 2698–2728. [CrossRef] Choe, Y. Information criterion for minimum cross-entropy model selection. arXiv 2017, arXiv:1704.04315. Ying, Y.; Zhou, D.X. Online pairwise learning algorithms. Neural Comput. 2016, 28, 743–777. [CrossRef] [PubMed] Zhang, Y.; Duchi, J.C.; Wainwright, M.J. Divide and conquer kernel ridge regression: A dis tributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 2013, 30, 592–617. Chapelle, O.; Zien, A. Semi-Supervised Learning (Adaptive Computation and Machine Learning); The MIT Press: Cambridge, MA, USA, 2006. Zhang, T. Learning bounds for kernel regression using effective data dimensionality. Neural Comput. 2005, 17, 2077–2098. [CrossRef] [PubMed] Caponnetto, A.; Vito, E.D. Optimal rates for the regularized least-squares algorithm. Found. Comput. Math. 2007, 7, 331–368. [CrossRef] Steinwart, I.; Hush, D.R.; Scovel, C. Optimal rates for regularized least squares regression. In Proceedings of the COLT 2009—the Conference on Learning Theory, Montreal, QC, Canada, 18–21 June 2009. Lin, S.B.; Guo X.; Zhou, D.X. Distributed learning with regularized least squares. J. Mach. Learn. Res. 2017, 18, 3202–3232. Guo, Z.C.; Lin, S.B.; Zhou, D.X. Learning theory of distributed spectral algorithms. Inverse Prob. 2017, 33, 074009. [CrossRef] Guo, Z.C.; Shi, L.; Wu, Q. Learning theory of distributed regression with bias corrected regu-larization kernel network. J. Mach. Learn. Res. 2017, 18, 4237–4261. ¨ Mucke, N.; Blanchard, G. Parallelizing spectrally regularized kernel algorithms. J. Mach. Learn. Res. 2018, 19, 1069–1097. Lin, S.B.; Zhou, D.X. Distributed kernel-based gradient descent algorithms. Constr. Approx. 2018, 47, 249–276. [CrossRef] Yao, Y.; Rosasco, L.; Caponnetto, A. On early stopping in gradient descent learning. Constr. Approx. 2007, 26, 289–315. [CrossRef] Chang, X.; Lin, S.B.; Zhou, D.X. Distributed semi-supervised learning with kernel ridge re-gression. J. Mach. Learn. Res. 2017, 18, 1493–1514. Hu, T.; Wu, Q.; Zhou, D.X. Distributed kernel gradient descent algorithm for minimum error entropy principle. Unpublished work, 2018.
Entropy 2018, 20, 968
23. 24.
18 of 18
Guo, X.; Hu, T.; Wu, Q. Distributed minimum error entropy algorithms. Unpublished work, 2018. Wang, B.; Hu, T. Distributed pairwise algorithms with gradient descent methods. Unpublished work, 2018. c 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).