260
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 2, FEBRUARY 2012
Adaptive Multiregression in Reproducing Kernel Hilbert Spaces: The Multiaccess MIMO Channel Case Konstantinos Slavakis, Member, IEEE, Pantelis Bouboulis, Member, IEEE, and Sergios Theodoridis, Fellow, IEEE
Abstract— This paper introduces a wide framework for online, i.e., time-adaptive, supervised multiregression tasks. The problem is formulated in a general infinite-dimensional reproducing kernel Hilbert space (RKHS). In this context, a fairly large number of nonlinear multiregression models fall as special cases, including the linear case. Any convex, continuous, and not necessarily differentiable function can be used as a loss function in order to quantify the disagreement between the output of the system and the desired response. The only requirement is the subgradient of the adopted loss function to be available in an analytic form. To this end, we demonstrate a way to calculate the subgradients of robust loss functions, suitable for the multiregression task. As it is by now well documented, when dealing with online schemes in RKHS, the memory keeps increasing with each iteration step. To attack this problem, a simple sparsification strategy is utilized, which leads to an algorithmic scheme of linear complexity with respect to the number of unknown parameters. A convergence analysis of the technique, based on arguments of convex analysis, is also provided. To demonstrate the capacity of the proposed method, the multiregressor is applied to the multiaccess multipleinput multiple-output channel equalization task for a setting with poor resources and nonavailable channel information. Numerical results verify the potential of the method, when its performance is compared with those of the state-of-the-art linear techniques, which, in contrast, use space-time coding, more antenna elements, as well as full channel information. Index Terms— Adaptive kernel learning, convex analysis, multiple-input multiple-output channel equalization, projection, regression, subgradient.
I. I NTRODUCTION
K
ERNEL methods have become indispensable tools for modern classification and regression tasks [1]–[8]. Their appeal to the machine learning and signal processing community is based on the mathematically sound way [9] that they offer in order to map a task from a low-dimensional Manuscript received May 28, 2010; revised August 29, 2011; accepted October 26, 2011. Date of publication January 3, 2012; date of current version February 8, 2012. K. Slavakis is with the Department of Telecommunications Science and Technology, University of Peloponnese, Tripolis 22100, Greece (e-mail:
[email protected]). P. Bouboulis and S. Theodoridis are with the Department of Informatics and Telecommunications, University of Athens, Athens 15784, Greece (e-mail:
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. This paper has supplementary downloadable material available at http://ieeexplore.ieee.org provided by the authors. This material is 77 KB in size. Digital Object Identifier 10.1109/TNNLS.2011.2178321
Euclidean data space to a high-, possibly infinite-, dimensional feature space. The most remarkable thing is that the kernel methods operate in a high-dimensional space inexpensively, the celebrated kernel trick [1]–[3] guarantees that the inner products in the high-dimensional feature space are performed by means of simple kernel function evaluations on the lowdimensional Euclidean data space. The mainstream of kernel methods relates to batch processing, i.e., all the necessary data are available beforehand. The most celebrated kernel method for batch processing is the support vector machine (SVM) scheme [1]–[3]. All the available data are used in order to form a constrained quadratic optimization task, posed usually in a dual space, by exploiting the intimate connection of optimization theory and the classical concept of the Lagrangian functions [1]–[3]. Recently, significant effort has been devoted to the development of online or time-adaptive kernel methods, i.e., the case where data excite a possibly time-varying system in a sequential fashion. Motivated by the success of SVM, a large number of online schemes have been devised as variants of the SVM framework [10]–[16]. Another path for kernel methods, which springs from the classical adaptive filtering theory [17], [18], has been followed in [19] and [20]. The common point of the studies in [19] and [20] is the use of a quadratic function as the measure of loss in a regression task. The reason for adopting such a loss function stems from the following two benefits granted in the classical adaptive filtering theory [17], [18]: 1) the quadratic function, being the most celebrated differentiable loss function, is easy to handle, and 2) there exist a plethora of fast recursive realizations [17], [21] which can be straightforwardly extended to the kernel context. A different path for online kernel methods has been very recently developed in [22]–[24], emanating from the advances in adaptive projection-based algorithms [25]–[30]. Following the same path as in [22]–[24], this paper introduces a wide framework for adaptive multiregression in reproducing kernel Hilbert spaces (RKHS). Any convex function can be used as a measure of loss in the regression task. Differentiability is no longer a necessary condition, and the only requirement is for the subgradients of the loss function to take an analytic form. To infuse robustness into the design, for every loss function we adopt, its -insensitive version is utilized, because of its attractive features widely known in robust statistics [1], [31]. To learn by example,
1045–9227/$31.00 © 2012 IEEE
SLAVAKIS et al.: ADAPTIVE MULTIREGRESSION IN REPRODUCING KERNEL HILBERT SPACES
this paper demonstrates a way to calculate the set of all the subgradients of the -insensitive versions of two loss functions: a differentiable and a non-differentiable one. The proposed algorithmic procedure takes a simple recursive form that builds upon the information within the subgradients of the adopted -insensitive loss function. The algorithm gives us the freedom of concurrently processing a (user-defined) finite number of loss functions, paving thus the way for efficient implementation via parallel processing. It is well known that, when dealing with online schemes in high- or even infinite-dimensional RKHS, the memory keeps increasing with the sequentially incoming data. To overcome such a computational obstacle, this paper employs sparsification by means of a sliding buffer and a closed convex set into which we constrain our sequence of multiregressors. The way to incorporate such a constraint into the algorithmic procedure is via the fundamental functional analytic tool of the metric projection mapping onto closed convex sets. Overall, the resulting algorithm takes a compact form with a simple geometric representation; it is nothing but the concatenation of a subgradient and a metric projection mapping. Such a geometric analog results in low computational complexity, namely linear with respect to the number of unknowns. This paper offers also a theoretical analysis of the algorithm via arguments of convex analysis, different from the celebrated dual space and stochastic gradient approaches [1]–[3], [8], [10]–[16], as well as from the methodologies met in classical adaptive filtering theory [17]–[20]. To demonstrate the potential of the proposed method, an adaptive multiregression task over a demanding multiple-input multiple-output (MIMO) channel [32]–[35], with multiple users (multiaccess), is considered. Batch kernel methods for a single-user MIMO channel, based on the SVM methodology with a quadratic loss function, are given in [36] and [37]. The proposed algorithm is validated against the state-ofthe-art linear receivers [33] that employ the highly efficient orthogonal space-time block codes (OSTBC) [32], [38], [39] a strategy that optimally utilizes the diversity offered by the multiple transmit antennas in the channel. Moreover, our method is tested against the recently introduced kernel RLS [20]. The numerical results demonstrate that the kernel methods outperform the linear receivers in demanding channel conditions, i.e., many users, which implies a high interference environment with poor hardware resources. In addition, the proposed method results in a similar performance with [20] albeit with lower complexity, i.e., linear with respect to the number of unknowns as opposed to the quadratic one of [20]. This paper is organized as follows. The online multiregression task is introduced in Section II, where the main differences with its well-known batch counterpart are pointed out. The definition of the RKHS, i.e., our stage of discussion, is given in Section III. Section IV introduces the way to form the -insensitive loss functions, and provides a way to calculate the subgradients of loss functions usually met in regression tasks. Our algorithm is presented in Section V, where sparsification is also addressed. The convergence analysis is given in Section VI. Section VII discusses the computational complexity issues. The multiaccess MIMO channel model is defined
261
in Section VIII, and the related numerical results are given in Section IX. The appendixes address all the mathematical details that are necessary for a rigorous discussion. Finally, all the claims and propositions are sorted in the order of appearance for helping the reader to easily navigate throughout the text. We will denote the set of all integers, nonnegative integers, positive integers, real and complex numbers by Z, N, N∗ , R, and C respectively. For any integers j1 ≤ j2 , we denote by j1 , j2 the set {j1 , j1 + 1, . . . , j2 }. Vector and matrix valued quantities appear in boldfaced symbols. II. O NLINE M ULTIREGRESSION P ROBLEM This paper considers the case of supervised learning, i.e., the scenario where a sequence of input data (xn )n∈N ⊂ RQ , Q ∈ N∗ , and a sequence of desired responses (yn )n∈N ⊂ RL , L ∈ N∗ are available to the designer. A multiregressor is a mapping f : RQ → RL : xn → f (xn ) such that given a pair (xn , yn ), the “disagreement” or “discrepancy” f (xn ) − yn , between the image of the multiregressor and the desired response, obtains some small value, via a user-defined loss function. To capture the dynamic nature of the training data, and to comply with the spirit of classical time-adaptive filtering [17], [18], we allow our multiregressors to be timevarying. For such a reason, all the training data (xn , yn )n∈N , as well as all the sequences of multiregressors (fn )n∈N , that appear in this paper are indexed by n ∈ N, which stands for discrete time. Let us denote the set of all candidates of the mapping f , or better (fn )n∈N , by H. More details on the structure of H will be given in Section III. Choose any convex, continuous, and not necessarily differentiable function l : RL → R, which quantifies the designer’s perception of “disagreement” between the output f (x) ∈ RL and the desired response y ∈ RL . Choose also a nonnegative number in order to obtain the -insensitive version of l l (ξ) := max{0, l(ξ) − } ∀ξ ∈ RL .
(1)
Notice that l is a nonnegative function. The function l scores a penalty only in cases of ξ, where l(ξ) > . It is easy to verify also that the points not penalized by l are all those ξ such that l(ξ) − ≤ 0, assuming, of course, that such ξ exist. Equivalently, the set of global minimizers of (1) is exactly the 0-th level set lev≤0 l := {ξ ∈ RL : l (ξ) ≤ 0}. In other words, we are not looking for a unique optimum point, but we abide by the set theoretic estimation approach [40], where a set of points that achieve a predefined tolerance via l, up to , are equally qualified as candidates for “optimal” points. Illustration of such functions can be found in Fig. 1(a) and (b), where the l takes the place of Θ, and RL of H. Given the sequence of training data (xn , yn )n∈N , we use the loss function l to measure the instantaneous disagreement f (xn ) − yn . Notice that the training data xn , yn are the parameters in such a formalization. Hence, a sequence of loss functions (L,n )n∈N , defined on H, is formed as L,n (f ) := l (f (xn ) − yn )
∀f ∈ H∀n ∈ N.
(2)
In this paper, all the loss functions L,n will be assumed convex. It is easy to verify by (1) that the minimizers of such
262
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 2, FEBRUARY 2012
a loss function are arg min L,n (f ) = {f ∈ H : l(f (xn ) − yn ) ≤ } f ∈H
= lev≤0 L,n . An illustration can be found in Fig. 1(a) and (b), where L,n takes the place of Θ. Problem 1: (Online Multiregression Problem) Given the sequence of data (xn , yn )n∈N ⊂ RQ × RL , let a sequence of convex, continuous, and not necessarily differentiable loss functions (L,n )n∈N . Find, then, a sequence of multiregressors, i.e., a sequence (fn )n∈N , such that (L,n (fn ))n∈N converges to some small value. Remark 2: (Batch Multiregression Problem) The classical multiregression problem is usually stated in a batch form, i.e., −1 Q L given a finite set of training data (xn , yn )N n=0 ⊂ R × R , for some N ∈ N∗ , and a user-defined loss function l, find a mapping f that minimizes the following empirical loss: 1 N
N −1
1 Ln (f ) = N n=0
N −1
III. RKHS Henceforth, the symbol H will stand for a possibly infinitedimensional real Hilbert space [42], equipped with an inner product denoted by ·, · , and an induced norm · := ·, · . Assume an H whose elements are functions defined on RQ , i.e., each point in H is a function f : RQ → R. Definition 3 (RKHS): The real Hilbert space H is called an RKHS [1], [2], [9] whenever, for an arbitrarily fixed x ∈ RQ , the mapping f → f (x) is continuous on H. This condition is equivalent to the existence of a (unique) reproducing kernel function κ(·, ·) : RQ × RQ → R which satisfies the following: 1) κ(x, ·) ∈ H, ∀x ∈ RQ ; 2) the following reproducing property holds: f (x) = f, κ(x, ·) ,
l(f (xn ) − yn ).
n=0
A well-known example for l, used extensively by the machine learning community due to its computational tractability [1], [2], [19], [20], is the quadratic loss function. An extended version of the batch multiregression problem, for a quadratic loss function, is its regularized version [41] N −1 2 2 1 ˆ f (xn ) − yn L + η fˆ N n=0 R (3) where η > 0 is the regularization parameter, ·RL denotes the standard Euclidean norm in RL , and fˆ stands for the norm, in some sense, of the mapping fˆ. Such a regularization term is introduced for accommodating sparsification issues of the solution. In other words, apart from the term that refers to the disagreement betweentheoutput and the desired response, the 2 regularization term η fˆ forces the optimization method to search for solutions which have, also, a small norm or “size.” This acts beneficially against overfitting [1], [2]. Usually, batch techniques, e.g., SVMs, mobilize an optimization method in order to calculate a minimizer f of the empirical loss, based on all the training data received at the time instants 0, N − 1. However, such an approach is not appropriate for a dynamic and computationally efficient updating process of the minimizer, for a newly arrived training data (xN , yN ), the optimization method should start from scratch. For this reason, novel computationally efficient online techniques need to be devised, following the spirit of the classical time-adaptive filtering [17], [18], in order to take advantage of the dynamic nature of both the training data and the multiregressor function. In the next section, we introduce a structure to the set H, which contains all the candidates for the regressor f . Such a structure models a fairly large number of nonlinearities, which are used extensively in machine learning and signal processing applications. Moreover, it contains also, as a special case, the
find an f ∈ arg minfˆ
scenario of linear regressors, i.e., the case where the mapping f is linear.
∀x ∈ RQ ∀f ∈ H.
(4)
In addition, it turns out that H = span{κ(x, ·) : x ∈ RQ }, where the symbol span stands for the set of all linear combinations of the elements of a set, and the overline symbol denotes the closure, in the strong topology, of a set. Celebrated examples of reproducing kernels, with applications that span from pattern analysis and recognition, to equalization, identification, and sampling in signal processing, are the linear kernel, the polynomial kernel, and the Gaussian kernel [1]–[3], as well as the sinc function. Now, assume a number of L RKHS H1 , H2 , . . . , HL , and define the Cartesian product space H := H1 × · · · × HL . An element of H can be written apparently as f := [f1 , . . . , fL ]t , where f1 ∈ H1 , . . . , fL ∈ HL , and the superscipt (·)t stands for vector/matrix transposition. To make H a Hilbert space, we need to define an inner product. To do so, and in order to take advantage of the inter-relations between the multiple functional RKHS spaces, we let P = [pij ] ∈ RL×L to be a positive definite matrix, such that for all those i, j ∈ 1, L with Hi = Hj , pij = 0. For example, if there are no identical spaces in H1 , . . . , HL , then the positive definite matrix P becomes a diagonal one. Now, we define the following inner product: , f2j . The for any f1 , f2 ∈ H, f1 , f2
:= L i,j=1 pij f1i induced norm in H will be denoted by |||·||| := ·, ·
. In general, the Hilbert space H need not be an RKHS. Whenever the discussion is constrained to the Euclidean space RL , the following inner product will be considered: ξ1 , ξ2 Q := ξ1t Qξ2 , ∀ξ1 , ξ2 ∈ RL , and the induced norm will be ·Q := ·, · Q , where Q = [qij ] ∈ RL×L is a user-defined positive definite matrix. If Q becomes the identity matrix, i.e., Q = IL , then the norm ·Q becomes the standard unweighted Euclidean norm of RL , which we will denote by ·RL (see for example (3)). IV. D ESIGNING THE L OSS F UNCTIONS Definition 4 (Loss Function Design): The strategy for constructing loss functions for the online multiregression problem contains three steps.
SLAVAKIS et al.: ADAPTIVE MULTIREGRESSION IN REPRODUCING KERNEL HILBERT SPACES
Θ
lev≤0Θ
f
0 TΘ(f)
263
optimal point. When the function is not differentiable, the subgradient is a generalization of the gradient and it also indicates a path toward an optimal point. Definition 5 (Subdifferential, Subgradient): Given a continuous convex function Θ : H → R, we define the subdifferential of Θ at f as the following set: ∂Θ(f ) := {g ∈ H : h − f , g
+ Θ(f ) ≤ Θ(h), ∀h ∈ H} .
−
H
H Hyperplane generated by the differential
(a) Θ
lev≤0Θ
f TΘ(f)
0 H−
H Hyperplane generated by the subgradient (b)
Fig. 1. (a) Geometric illustration of a differentiable and a nondifferentiable convex function Θ : H → R, with a nonempty zeroth level set lev≤0 Θ := {f ∈ H : Θ(f) ≤ 0}. The figures demonstrate also the concepts of the differential and the subgradient of Θ at a point f. (b) Notion of the subgradient projection mapping TΘ with respect to Θ is also illustrated (for more details, see Appendix IX).
1) Choose any convex continuous function l: RL → R. The function need not be differentiable (see Appendix XI). Since l is chosen to be convex and continuous, its subgradients (see Definition 5) are guaranteed to exist [43], [44]. The only requirement on the choice of l is for its subgradients to be known in analytic form. 2) Form the -insensitive version of l as follows: l (ξ) := max{0, l(ξ) − } ∀ξ ∈ RL . 3) Given a pair of training data (x, y) ∈ RQ × RL , define the loss function L :H → R, as follows: L (f ) := l (f (x) − y) ∀f ∈ H. Notice that L is convex (for the reason, see Example 18 in Appendix XI). An illustration of the convex loss function L is given in Fig. 1(a) and (b), where L takes the place of Θ. The notion of the subdifferential is becoming increasingly popular these days, as convex optimization tools are diffused more and more in the machine learning and signal processing community. It is by now well established that, in optimization tasks, the gradient direction guarantees a path toward the
Any element of ∂Θ(f ) is called a subgradient of Θ at f , and will be denoted by Θ (f ). In other words, Θ (f ) ∈ ∂Θ(f ). Note that for any continuous convex function Θ we can always define a subgradient, since for such a function we have ∂Θ(f ) = ∅, ∀f ∈ H [43], [44]. Now, in the case where the function Θ is (Fréchet) differentiable (see Definition 14), Θ has a unique subgradient at f , which is nothing but the (Fréchet) differential Θ (f ). Put in geometrical terms, both differential and subgradient Θ (f ) of Θ at f produce the following hyperplane which supports the graph of Θ at f : {(h, h − f , Θ (f )
+ Θ(f )) : h ∈ H} [see Fig. 1(a) and (b)]. If f ∈ / lev≤0 Θ, then this hyperplane splits H into two parts, and lev≤0 Θ is contained in one of them, namely, the closed halfspace H − , which definitely does not contain f [see Fig. 1(a) and (b)]. The intimate connection of the subgradient and the optimal points associated with a given loss function can be verified by the following well-known fact. Fact 6: (Subgradient and Minimization [43], [44]) For any convex function Θ : H → R, the following holds: 0 ∈ ∂ Θ(f ) ⇔ f ∈ arg minfˆ∈H Θ(fˆ). In what follows, we provide with a couple of examples of loss functions and their subdifferentials in closed form. The kick-off point is well-known convex loss functions l, used extensively in optimization tasks for machine learning and signal processing applications, and defined on classical finitedimensional Euclidean spaces RL . However, since in kernel methods infinite-dimensional spaces are involved, our objective is the derivation of the subdifferential of loss functions defined on general infinite-dimensional Hilbert spaces H, and, more specifically, of their -insensitive version L . Hence, although the subdifferential of several l loss functions can be found in various contexts of convex analysis (for example, the subdifferential of the ∞ (L) or Tchebycheff norm can be found in [43, p. 215]), here, our objective is to introduce a way to calculate the subdifferential of L . Lemma 7 (1 (L)-Norm -Insensitive Loss Function): loss as the function l in Definition 4, Choose the 1 (L)-norm t L i.e., let l(ξ) := ξ1 := L j=1 |ξj |, ∀ξ := [ξ1 , . . . , ξL ] ∈ R . Then, the loss function L of Definition 4 becomes as follows: L (f ) := max{0, f (x) − y1 − }, ∀f ∈ H. The subdifferential of this loss function is given in Table I. Proof: The computation of the subdifferential in Table I can be found in Appendix IX. Lemma 8 (Quadratic -Insensitive Loss Function): Choose the quadratic loss as the function l in Definition 4, i.e., let l(ξ) := ξ2Q := ξ t Qξ, ∀ξ ∈ RL , and where Q = [qij ] ∈ RL×L is a user-defined positive definite matrix. Then, the loss function L of Definition 4 becomes as follows: L (f ) :=
264
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 2, FEBRUARY 2012
TABLE I S ET OF A LL S UBGRADIENTS , i.e., THE S UBDIFFERENTIAL , OF THE 1 (L)-N ORM -I NSENSITIVE L OSS F UNCTION OF L EMMA 7. H ERE , Jf := {j ∈ 1, L : fj (x) − yj = 0}, AND ν S TANDS FOR THE C ARDINALITY OF Jf , W HENEVER Jf = ∅. T HE conv S YMBOL S TANDS FOR THE C ONVEX H ULL OF A S ET [43] f f(x) − y1 < f(x) − y1 >
f(x) − y1 >
f(x) − y1 = f(x) − y1 =
Subdifferential ∂L (f) {0}.
P
Jf = ∅
Jf = ∅
Jf = ∅
sgn(fL (x)−yL )κL (x,·)
.
γP −1
Jf = ∅
Θ(xk)
xk xk+1
Fig. 2. Illustration of the classical Newton–Raphson algorithm for finding the root of a function Θ. The point xk+1 is the metric projection of xk onto the intersection of the horizontal axis with the line {(x, (x − xk )Θ (xk ) + Θ(xk )): x ∈ R}. Notice the similarity of this line with the supporting hyperplane introduced in Definition 5.
max{0, f (x) − y2Q − }, ∀f ∈ H. The subdifferential of this loss function L is given in Table II. Proof: This proof follows similar steps to the proof of Lemma 7 and it is omitted. The subdifferentials for the -insensitive versions of a variety of loss functions, suited not only for mutliregression but also for classification tasks, can be straightforwardly calculated by following the same steps as in Appendix IX. The details and a full list of such subdifferentials for a large variety of loss functions will be reported elsewhere. V. A LGORITHM A. Preliminaries The algorithmic scheme that will be developed shows similarities with the classical Newton–Raphson algorithm [45] for root finding. Given a differentiable function Θ : R → R : x → Θ(x), and an arbitrary starting point x0 ∈ R, the recursion ∀k ∈ N, xk+1 := xk −
.. .
conv{P −1 u1 , . . . , P −1 u2ν }, where the components of the vectors uk , ∀k ∈ 1, 2ν , are given by sgn(fj (x) − yj )κj (x, ·), if j ∈ / Jf , uk,j := ±κj (x, ·), if j ∈ Jf .
Θ
x∗
sgn(f1 (x)−y1 )κ1 (x,·)
−1
Θ(xk ) Θ (xk ) (Θ (xk ))2
under certain conditions, converges to a root of Θ, i.e., to an x∗ such that Θ(x∗ ) = 0. This classical scheme has a nice geometric interpretation, shown in Fig. 2.
sgn(f1 (x)−y1 )κ1 (x,·)
.. .
sgn(fL (x)−yL )κL (x,·)
: γ ∈ [0, 1] .
conv{0, P −1 u1 , . . . , P −1 u2ν }.
Departing from the previous classical scheme, Polyak [46] introduced an algorithm for nondifferentiable convex functions Θ: H → R: f → Θ(f ), where H is a real Hilbert space, and where the subgradient is used in place of the gradient. Assuming that the minimum Θ∗ of the function Θ is known, Polyak used his extension to locate a point f∗ , where Θ(f∗ ) = Θ∗ , and proposed the following scheme for an arbitrarily fixed f0 ∈ H : ∀k ∈ N, fk+1 := fk − λk
Θ(fk ) − Θ∗ Θ (fk )
2
Θ (fk )
where Θ (fk ) denotes a subgradient of Θ at fk , and λk ∈ [ , 2 − ], for some sufficiently small > 0. Although, in general, knowing Θ∗ poses a problem, in the case where one deals with nonnegative penalty functions (see Section IV), we know that the minimum is zero, and it is attained at any point in the respective lev≤0 Θ. Hence, Θ∗ = 0. The nice geometric interpretation, illustrated in Fig. 2, is still retained in Fig. 1(a) and (b). We stress here that both Newton–Raphson and Polyak algorithms deal with the case of a fixed loss function Θ. This is the reason for using the index k ∈ N, in both the Newton–Raphson and Polyak algorithms, as opposed to the index n ∈ N which is reserved to indicate time instants. However, our focus is on potentially time-varying systems, and Polyak’s algorithm, which employs a fixed loss function, is no longer suitable for the dynamic scenario considered in this paper. In the online supervised setting, at each time instant n ∈ N, a measurement pair (xn , yn ) ⊂ RQ × RL is received. To quantify the designer’s perception of loss with respect to the training pair (xn , yn ), the loss function L,n was introduced in (2). In order to speed up convergence, we are going to consider, concurrently, data points that were received at time instants previous to the current one, n. To this end, given a user-defined positive integer q ∈ N∗ , and for every time instant n, we define the following sliding window on the time axis of size at most q: Jn := max{0, n − q + 1}, n. Every j ∈ Jn associates to the loss function L,j (see (2)), which is, in turn,
SLAVAKIS et al.: ADAPTIVE MULTIREGRESSION IN REPRODUCING KERNEL HILBERT SPACES
265
TABLE II S UBDIFFERENTIAL OF THE Q UADRATIC -I NSENSITIVE L OSS F UNCTION f
Subdifferential ∂L (f)
f(x) − y2Q < f(x) − y2Q >
f(x) −
y2Q
=
2P
2γP −1
{0}. L j=1
L j=1
L j=1
−1
L j=1
determined by the training data (xj , yj ). The set Jn indicates those loss functions to be concurrently processed at the time instant n. We define, now, Θn as a weighted average of the loss functions {L,j }j∈Jn (see Appendix IX). In other words, instead of a fixed Θ, as in the cases of Newton–Raphson and Polyak algorithms, as well as in the batch multiregression problem (see Remark 2), the online setting imposes a sequence of (Θn )n∈N on the design. Every penalty function Θn defines the set lev≤0 Θn . All points that lie in the set lev≤0 Θn would score a zero penalty for {(xj , yj )}j∈Jn . The aim is to move toward the convex set lev≤0 Θn (property set, zero-penalty area) associated with {(xj , yj )}j∈Jn . Our final objective, therefore, is to produce a sequence of multiregressors (fn )n∈N which asymptotically minimize the sequence of loss functions (Θn )n∈N . It turns out that this can be viewed as the task of trying to find a point that lies in the intersection of the aforementioned property sets. This ties up our approach with the classical projections onto convex sets (POCS) theory [47], [48]. However, now, the number of the involved convex sets is infinite and not finite, as required by the classical theory. The extension of Polyak algorithm for the asymptotic minimization of a sequence of loss functions (Θn )n∈N was given by the adaptive projected subgradient method (APSM) [26], [27]. It was motivated by projection-based adaptive algorithms [25], which stand as generalizations of POCS [47], [48], in the case of an infinite number of convex sets. Recent extension of the APSM for constrained online learning tasks were presented in [28] and [29], and applications can be found in [24], [30], [49], and [50]. The following section builds upon the arguments of the APSM and presents a novel algorithmic scheme for online multiregression in general RKHS, where the designer has the freedom to choose from a large variety of convex functions for measuring loss with respect to the received data. B. Algorithm Algorithm 9 (Adaptive Kernel Multiregression): 1) Initialization: Choose a nonnegative ≥ 0, and a positive integer q, which will stand for the number of loss functions to be concurrently processed at each time instant n. Fix an arbitrary f0 ∈ H as a starting point for the algorithm. 2) Given any time instant n ∈ N, define the following sliding window on the time axis, of size at most q :
qj1 (fj (x)−yj )κ1 (x,·)
.. .
qjL (fj (x)−yj )κL (x,·)
qj1 (fj (x)−yj )κ1 (x,·)
.. .
qjL (fj (x)−yj )κL (x,·)
.
: γ ∈ [0, 1] .
Jn := max{0, n − q + 1}, n. The user-defined parameter q determines the number of training data, which are reused or concurrently processed at the time instant n. The larger the q, the faster the convergence speed of the resulting algorithmic scheme. 3) Given the current multiregressor estimate fn , choose any subgradient L,j (fn ) ∈ ∂L,j (fn ). In such a way, the following collection of subgradients is formed: {L,j (fn )}j∈Jn . 4) Define the active index set: In := {j ∈ Jn : L,j (fn ) = (n) 0. If In = ∅, define a set of weights {ωi }i∈In ⊂ (0, 1] (n) (n) such that i∈In ωi = 1. Each ωi assigns a weight to the contribution of L,i to the following concurrent scheme. The most straightforward assignment is to let each L,i , i ∈ In , equally contribute to the (n) subsequent recursion by setting ωi := 1/ card In , ∀i ∈ In , where card stands for the cardinality of a set. 5) Then, calculate the next multiregressor fn+1 by fn+1 := fn − μn
(n)
ωi
i∈In
L,i (fn ) L (fn ). L (fn )2 ,i ,i
In other words, the update term is formed as a weighted average. Each term in the summation is the contribution of the loss function L,i , i ∈ In . Such a term would be the update term if only the loss function at time i would be considered. Due to concurrency, extrapolation is possible. The extrapolation parameter μn lies within the interval (0, 2Mn ), where
Mn :=
⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨
¬¬¬¬¬¬È ¬¬¬¬
È
i∈In
(n) i∈In ωi
(n)
ωi
L2 ,i (fn ) 2 L (fn ) ,i
|||
L,i (fn ) 2 L (fn ) ,i
|||
L,i (fn )
¬¬¬¬¬¬ ¬¬¬¬
2
,
||| ||| ⎪ (n) L,i (fn ) ⎪ ⎪ if ω ⎪ i∈In i |||L (fn )|||2 L,i (fn ) = 0 ⎪ ,i ⎪ ⎪ ⎩1, otherwise. (5) 2 Due to the convexity of |||·||| , notice that Mn ≥ 1, ∀n ∈ N. To remove any ambiguity, in the case where In = ∅, thesummation term over ∅, i.e., (n) L (fn )2 L (fn ), will be L ω (f )/ ,i n ,i ,i i∈∅ i set equal to 0.
266
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 2, FEBRUARY 2012
C. Sparsification The celebrated representer theorem [41] claims that the solution of (3), for the case of L := 1, lies in the finitedimensional subspace of H, spanned by the set {κ(xn , ·): n ∈ 0, N − 1}. That is, if we denote by f∗ the solution of (3), for L := 1, then N −1 γn κ(xn , ·) (6) f∗ =
H
B[0, Δ] 0
fn+1 fn
n=0 −1 for some set of real-valued parameters {γn }N n=0 . It has been observed, e.g., [1], [2], that the presence of the regularization 2 −1 term η |||·||| , in (3) forces a number of coefficients {γn }N n=0 to obtain values close to zero. Such a regularization term, together with a properly chosen loss function, introduces a sparsification of the solution, which is crucial for the limited memory and computational resources of real-world systems. Moreover, such a sparsification acts beneficially against model overfitting [1], [2]. Sparsification becomes imperative in the present online setting, since, even for the special case of L := 1, the sequentially arriving training data force N → ∞ in (6). The overwhelming number of training data makes memory requirements and computational load to grow unbounded. According to the Representer Theorem, the solution (6) of (3) will be located in a sequence of finite-dimensional subspaces of H, with nondecreasing dimension as N → ∞. In contrast to the classical batch setting, where an unconstrained minimization task is formed (3), here we will follow a different approach for incorporating sparsification in the online learning task, a constrained optimization task will be imposed. That is, we choose a positive parameter Δ, and we impose the closed ball with center 0 and radius Δ, i.e., B[0, Δ] := {f ∈ H : |||f ||| ≤ Δ}, onto the design. To incorporate such a constraint, we replace the respective step of Algorithm 9 with ∀n ∈ N (n) L,i (fn ) ωi fn+1 := PB[0,Δ] fn − μn L (fn ) L (fn )2 ,i i∈In ,i (7) where PB[0,Δ] is the metric projection mapping (see Appendix IX) onto B[0, Δ], given as follows, ∀f ∈ H
PB[0,Δ] (f ) :=
Δ f. max {Δ, |||f |||}
(8)
D. Geometric Equivalent of the Algorithm The previous algorithmic procedure of (7) obtains a simple geometric interpretation if we employ the concept of the subgradient projection mapping (see Appendix IX, Definition 15). It turns out (under Assumption 10.1) that (7) can be rephrased as follows: fn+1 = PB[0,Δ] (I + λn (TΘn − I))(fn ) where Θn is a weighted average of the loss functions {L,i }i∈In (see Appendix IX), TΘn is the subgradient projection mapping with respect to Θn , and λn ∈ (0, 2). The mapping I + λn (TΘn − I) stands for the relaxed subgradient projection mapping, where I denotes the identity mapping
TΘn(fn) ∩i∈Inlev ≤0L∈,i
Hn−
Fig. 3. Geometric illustration of (7) for the case where μn := Mn (⇔ λn = 1). It can be readily verified that each recursion is nothing but the concatenation of two projections: first the subgradient projection mapping TΘn , with respect to a function Θn (see Appendix IX), and then the metric projection mapping onto the closed ball B[0, Δ]. For the definition of the − halfspace Hn , see Definition 5, as well as Fig. 1(a) and (b).
in H. Notice that in the special case where λn = 1, the relaxed subgradient mapping becomes TΘn itself. In other words, the update fn+1 is obtained from fn by concatenating two projection mappings, first, apply a relaxed version of the subgradient projection mapping, i.e., I + λn (TΘn − I), λn ∈ (0, 2), with respect to Θn , and then apply the metric projection mapping onto the closed ball B[0, Δ]. This is illustrated in Fig. 3. VI. T HEORETICAL A NALYSIS OF THE A LGORITHM First, define the set Ωn := B[0, Δ] ∩ ( i∈In lev≤0 L,i ), where the symbol lev≤0 was introduced in Section II and in Fig. 1(a) and (b). In other words, the set Ωn stands for all of the joint minimizers of {L,i }i∈In , such that their norm or “size” is constrained by B[0, Δ]. Along the lines of the discussion in Section V-A, any point lying in Ωn attains a zero penalty for the function Θn , defined in the supplementaty file. Assumption 10: Make the following assumptions, which will be used in the sequel. 1) Thereexists a nonnegative integer n0 ∈ N such that Ω := n≥n0 Ωn = ∅. In other words, with the possible exception of a finite number of Ωn s, the rest of them own a common intersection. This point becomes important in real-world applications, since it accommodates a number of outliers which, most likely, are inconsistent with the rest of the incoming training data. 2) Let some sufficiently small > 0 such that μn /Mn ∈ [ , 2 − ], ∀n ∈ N. 3) There exists a hyperplane Π ⊆ H such that riΠ Ω = ∅, where riΠ Ω stands for the relative interior of Ω with respect to Π. For the definitions of the hyperplane and the relative interior see Appendix IX. (n) 4) Assume that ω ˇ := inf{ωi : i ∈ In = ∅, n ∈ N} > 0. This guarantees that none of the contributions of the loss functions in the weighted sum of (7) will fade away, via (n) the ωi s, as n advances to ∞. 5) The set {L,i (fn ) : i ∈ In , n ∈ N} is bounded.
SLAVAKIS et al.: ADAPTIVE MULTIREGRESSION IN REPRODUCING KERNEL HILBERT SPACES
Tx 1, s1(n) 1
2
Tx P, sP (n)
N
1
2
N
HP
H1 1
M
2
Rx X(n) Fig. 4. Multiaccess MIMO channel model. A number of P users or transmitters (Txs), with N antennas per Tx, send information, via the channel matrices H1 , . . . , HP , to a receiver (Rx) with M antennas. At the pth Tx, and at each time instant n ∈ N, the complex vector sp (n) is encoded by the encoder C prior transmission. The received signal is the matrix-valued sequence (X(n))n∈N , given in (11).
6) Assuming the existence of f∗ := limn→∞ fn , let {L,i (f∗ ) : i ∈ In , n ∈ N} be bounded. Theorem 11: Under the previous assumptions, the following statements hold. 1) (Monotone approximation): For all n ≥ n0 d(fn+1 , Ωn ) ≤ d(fn , Ωn ) d(fn+1 , Ω) ≤ d(fn , Ω) where d(·, ·) stands for the (metric) distance function (see Appendix IX). In other words, every step of Algorithm 9 potentially takes us closer to the set Ω. 2) (Strong convergence): There exists an f∗ ∈ B[0, Δ] to which the sequence (fn )n∈N converges, in the strong (norm) topology of H, i.e., limn→∞ fn = f∗ . 3) (Asymptotic minimization): limn→∞ max{L,i (fn ) : i ∈ In } = 0. That is, the sequence of multiregressors (fn )n∈N forces the sequence (L,n )n∈N to go to zero. 4) (Asymptotic minimization at the limiting point): Assuming the existence of f∗ = limn→∞ fn , then limn→∞ max{L,i (f∗ ) : i ∈ In } = 0. Proof: The proof, as well as the details on which assumptions are mobilized in order to prove each one of the previous claims, can be found in Appendix IX. VII. C OMPUTATIONAL C OMPLEXITY Similar to the procedure adopted in [22, Appendix C] and [24, Appendix F], by using mathematical induction, one can show (details are omitted) that the recursion (7) leads to the following result, for every time instant n ∈ N, there exists a (n+1) n }j=0 such that fn+1 , in (7), sequence of real numbers {γj is equivalently written as ⎤ ⎡ κ1 (xj , ·) n (n+1) ⎢ ⎥ .. (9) fn+1 = γj ⎦ , ∀n ∈ N. ⎣ . j=0 κL (xj , ·) It is easy to observe that the key point for the calculation of all the subgradients, given the training pair (xi , yi ), that appear in Tables I and II, for any time instant n, is the quantity fn (xi ).
267
By the fundamental reproducing property (4) and the linearity of the inner product, it is easy to verify by (9) that ⎡ ⎤ κ1 (xj , xi ) n−1 (n) ⎢ ⎥ .. γj ⎣ fn (xi ) = ⎦. . j=0 κL (xj , xi ) Notice, here, that this quantity is also the key point for the evaluation of the loss function L,i (fn ) = l (fn (xi ) − yi ), met in (7). Let us see, now, the computational load needed to calculate the projection operator PB[0,Δ] in (7). According to (8), the main load comes from the computation of a norm. However, (n) this scales linearly to the number of unknowns {γj }n−1 j=0 , due to the fact that |||fn ||| is known from the previous time instant, and that 2 (n) L,i (fn ) ωi L (f ) fn − μn n L (fn )2 ,i i∈In ,i 2 L,i (fn ) (n) 2 2 = |||fn ||| + μn ωi 2 L,i (fn ) L (fn ) i∈In ,i (n) L,i (fn ) − 2μn ωi 2 fn , L,i (fn ) L (fn ) i∈In ,i 2 L,i (fn ) (n) 2 2 ωi L (f ) = |||fn ||| + μn n L (fn )2 ,i i∈In ,i − 2μn
n−1 i∈I
j=0
(n)
(n)
γj ωi
L,i (fn ) L (fn )2 ,i
⎤ ⎡ n κ1 (xj , ·) ⎥ ⎢ . .. × ⎦, L,i (fn ) ⎣ κL (xj , ·)
where the last equation was obtained by using (9). Overall, the complexity of the algorithm is linear with respect to (n) the number of the parameters {γj }n−1 j=0 . More accurately, the complexity is of order O(qn), owing to the concurrency introduced by the summation term in (7). It is clear by the above that, as time n goes by, the computational load O(qn) of the algorithm grows unbounded. However, it has been observed in [22, Sec. VI] and [24, Sec. VIII] that the mapping PB[0,Δ] , in a recursion similar (n+1) to (7), potentially forces those γj in (9), with js close to zero, to take very small absolute values. In other words, as time n goes by, more and more coefficients, with indexes in the remote past, attain negligible values. As such, the introduction of the mapping PB[0,Δ] gives the freedom to the designer to discard such coefficients and keep only those with indexes that correspond to recent training data. Hence, we choose the length Lb ∈ N∗ of a buffer, into which we keep only the Lb (n+1) n most recent coefficients {γj }j=n−Lb +1 in order to add the following equation as the last step of Algorithm 9: ⎡ ⎤ κ1 (xj , ·) n ⎥ (n+1) ⎢ .. (10) f˜n+1 := γj ⎣ ⎦. . j=n−Lb +1 κL (xj , ·)
268
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 2, FEBRUARY 2012
TABLE III S UMMARY OF N OTATIONS FOR THE M ULTIACCESS MIMO C HANNEL Notation n P N M K T R := K/T sp (n) ∈ CK C C(sp (n)) ∈ CT ×N N×M {Hp }P p=1 ⊂ C L M
Definition Time index. Number of Txs. Number of antennas per Tx. Number of antennas at the Rx. Number of symbols prior encoding. Number of symbols transmitted per antenna. Rate of the code. The vector of K symbols to be encoded at time n. The encoder. The block of symbols after encoding. Channel matrices. Number of SOI Txs. The vector obtained by stacking the columns of the matrix M.
Therefore, Algorithm 9 together with (10), lead to a scheme with a computational complexity of order O(qLb ). A different sparsification strategy that results into a quadratic computational complexity, which could be adopted in the multiregression task, can be found in [23]. VIII. O NLINE M ULTIREGRESSION PARADIGM : T HE M ULTIACCESS MIMO C HANNEL In the sequel, we will apply the developed algorithm in a major application in the communications discipline namely, that of the multiaccess MIMO channel [32]–[35] equalization problem, which will be viewed √ as a supervised learning task. In the following, we let ı := −1. A. Transmit Diversity and Coding The model of a multiaccess MIMO channel [33], [34] is given in Fig. 4, with P users or transmitters (Txs), N antennas per Tx, and M antennas at the receiver (Rx). Due to the multiplicity of the antennas, special coding strategies, known as the OSTBC, have been proposed, which optimally utilize the diversity offered by the multiplicity of antennas at the Txs. For more information on this coding strategy, the reader is referred to [32]–[34], [38], [39], and the references therein. In the OSTBC setting, and at each time instant n ∈ N, a Tx encodes information in blocks of symbols prior to its transmission through the channel. The information to be encoded at the pth Tx is the vector sp (n) := [sp1 (n), . . . , spK (n)]t ∈ CK , of K complex symbols, where the notation (·)t stands for vector transposition. The encoder is mathematically expressed by the linear mapping C : CK → CT ×N : sp (n) → C(sp (n)). For simplicity, we assume that all the Txs share the same encoding strategy, i.e., a common linear mapping C. Notice that the output of the encoder is a block of symbols of size T ×N . The choice of the OSTBC depends on the number N of antennas at the Tx. The rate of the code is defined as R := K/T . The signal received at the Rx is given by [33] and [34] X(n) =
P p=1
C(sp (n))Hp + V (n) ∈ CT ×M, ∀n ∈ N
(11) where Hp ∈ CN ×M stands for the channel matrix between the pth Tx and the Rx. The (i, j)th component of Hp is a
complex number that gathers all the possible channel effects on the signal, e.g., fading, when transmitted from the ith antenna of the pth Tx to the jth antenna of the Rx. The sequence (V (n))n∈N ⊂ CT ×M will stand for the complex matrixvalued noise process. Table III summarizes the notations introduced previously. In this paper, we will adopt the following assumptions on the encoder and the available channel information. Assumption 12: (Transmit Coding and Channel Information) 1) All the linear techniques that appear in this paper will follow the encoding strategy of the OSTBC, where the mapping C is defined in [32]–[34], [38], and [39]. 2) The proposed nonlinear kernel-based method will adopt, instead, the following simple encoding strategy: let K:= T := 1, and C : C → C1×N : s → [s, s, . . . , s]. That is, at every time instant n, the same symbol is transmitted by every antenna of the Tx. In other words, no source coding is used for the proposed technique. We will demonstrate that the benefits a space-time coding scheme offers to the system can alternatively be gained by the implicit mapping of the original task in a high (even infinite) dimensional space. Hence, since no coding scheme is used, the decoder operates simply on a symbol-by-symbol basis. 3) No channel information will be available for the proposed design, i.e., no knowledge is assumed for the channel matrices {Hp }P p=1 . On the contrary, most of the linear techniques that will be discussed in Section IX, utilize partial or full exact knowledge of {Hp }P p=1 . It should be pointed out here that linear methods are quite flexible in incorporating additional side information regarding the MIMO channel. For the proposed design, such a capability in the context of MIMO systems is currently under investigation. To incorporate side information in the proposed MIMO scheme, the new directions of employing a priori knowledge in projection-based adaptive algorithms established very recently in [28] and [29] will be followed. 4) Among the P Txs, we identify a number L ∈ 1, P of them as the users or signals of interest (SOI), whose transmitted symbols need to be identified.
SLAVAKIS et al.: ADAPTIVE MULTIREGRESSION IN REPRODUCING KERNEL HILBERT SPACES
X(n)
S(n)
MIMO channel Preprocessing X(n)
χ1(n), χ2(n)
Receiver f (χ1(n)), f (χ2(n))
(S(n)) + i(S(n)) ≈ f (χ1(n)) + f (χ2(n)) Fig. 5. Model of the proposed receiver. The quantity X(n) is the vectorized form of the received signal (11). The symbol S(n) ∈ CKL stands for the vectorized form of the matrix S(n) := [s1 (n), . . . , sL (n)] ∈ CK×L which contains the symbols sent by the L SOI and which need to be identified at the output of the receiver (equalization task). The receiver is the vectorvalued mapping f := [f1 , . . . , fL ]t . Our goal is to find an f such that the “disagreements” f(χ1 (n)) − (S(n)) and f(χ2 (n)) − (S(n)) become sufficiently small in some sense, i.e., when viewed via a user-defined loss function.
B. Receiver To make the notation easier to follow, and to work with real vectors instead of complex ones, we introduce some preprocessing before the signal reaches the receiver (see Fig. 5). First, let us define the mapping M := vect M , where vect stands for the standard column stacking operation, and M is any matrix. Now, take X(n), ∀n ∈ N, from (11), and define also the following (2T M )-dimensional real vectors ∀n ∈ N (X(n)) (X(n)) , χ2 (n) := (12) χ1 (n) := (X(n)) −(X(n)) where (·) and (·) denote the real and the imaginary part of a complex vector, respectively. Our receiver will be defined as the mapping f : R2T M → RKL : R2M → RL (K := T := 1) (13) :x→ f (x) := [f1 (x), . . . , fL (x)]t . The receiver f is not constrained to be linear. This paper will focus on the case where f takes a nonlinear form; more specifically, each component of f belongs to a special Hilbert space, i.e., the RKHS. Such a setting is wide enough to include also the cases where f is a linear mapping. This is also the reason for introducing the vectors in (12), which is elucidated by the following proposition. Proposition 13: Assume that the receiver f is a linear mapping. Then, there exists a W ∈ CM×L such that f (χ1 (n)) + ıf (χ2 (n)) = W ∗ X(n), ∀n ∈ N where the superscript (·)∗ , when applied to a matrix, stands for complex conjugate transposition. In other words, the mapping f , defined in (13), when applied to the vectors in (12), covers the classical case of a linear receiver. Proof: The proof is given in Appendix IX. In order to unify notations with the previous sections, we define the sequence of received data as n , ∀n ∈ N xn := χ(n mod 2)+1 2 where γ stands for the largest integer less than or equal to the real number γ. To see how this works, notice that the previous definition generates the sequence (x0 , x1 , x2 , x3 , . . .) = (χ1 (0), χ2 (0), χ1 (1), χ2 (1), . . .).
269
In a similar way, define for every time instant n ∈ N, S(n) := [s1 (n), . . . , sL (n)], let σ1 (n) := (S(n)), σ2 (n) := (S(n)), and finally define n , ∀n ∈ N. yn := σ(n mod 2)+1 2 In other words, the following sequence is constructed (y0 , y1 , y2 , . . .) = (σ1 (0), σ2 (0), σ1 (1), . . .). Since we deal with supervised learning task, the training data (xn , yn )n∈N ⊂ R2M × RL are assumed to be available at the receiver. Notice, here, that Q := 2M for the initial formulation of the online multiregression Problem 1. The user can select from a variety of loss functions, given, for example, in Tables I and II, and apply Algorithm 9 to attack this equalization task. IX. N UMERICAL R ESULTS The proposed approach was validated against both linear and nonlinear (kernel-based) techniques. All the linear techniques in this section employ the OSTBC methodology [38], [39], and are divided into two classes: 1) those that utilize both the available channel information and the training data [33]; and 2) those that cannot incorporate any channel information and rely only on the sequential training data in order to build the multiregressors. As a representative of the second class, we have chosen the classical recursive least squares (RLS) algorithm [17]. To validate the proposed methodology against other kernel-based techniques, we have chosen the recently introduced kernel RLS [20]. We stress here that no channel coding was elaborated in all of the evaluated methods, both linear and nonlinear. For a comparison of a method, similar in spirit with the present one, against an online SVM variant [11] in the context of classification, we refer the interested reader to [22]. Moreover, a linear variant of the proposed scheme, i.e., the case of a linear kernel function, can be found in [34]. The linear receiver in [33] follows a linearly constrained quadratic minimization approach. The OSTBC methodology formulates the regression problem into a Euclidean feature space of dimension 4KM T (for the physical meaning of each symbol refer to Table III). The user has knowledge of a number τ ∈ 1, P of the channel matrices {Hp }P p=1 . For sure, the user possesses the knowledge of the SOI channel matrix. In short, the receiver is constrained to produce a distortionless output when applied on the SOI channel matrix, and to cancel the output when applied on any other available channel matrix. The knowledge of each channel matrix introduces a number of 4K 2 constraints in the feature space, hence a total number of 4K 2 τ linear constraints are introduced. As a result, the receiver has, at least, a number of 4KM T − 4K 2 τ available degrees of freedom (DOF) in order to minimize a “variance,” i.e., a quadratic function formed by an estimate of a correlation matrix. To estimate such a correlation matrix, the receiver uses the sequentially arriving training data. According to [33], the previous approach is followed for each Tx whose symbols need to be identified, i.e., a number of L independent linear multiregressors are derived to identify the L incoming streams of symbols.
270
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 2, FEBRUARY 2012
100
SER
10−1
RLS (N =3) KRLS (N =2) MV_SOI (N =2) MV_All (N =3) MV_SOI (N =3) Proposed (N =3) Proposed (N = 2)
−2
10
10−3 2
3
4
5
6
7 8 9 SNR (dB)
10
11
12
13
Fig. 6. SER versus SNR at the Txs. Both “Proposed” and “KRLS” do not use any channel information. They are built solely on the sequentially incoming training data. By contrast, the linear techniques “MV_All” and “MV_SOI” assume full and partial knowledge of the channel information, respectively.
The preceding discussion results in the following rule of thumb: a sufficient condition for having a meaningful constrained optimization problem in [33] is 4KM T − 4K 2 τ > 0, or, equivalently, R := (K/T ) < M/τ . In other words, the previous condition is equivalent to solving a quadratic minimization problem over an underdetermined set of linear equations. The “MV_SOI” and “MV_All” tags, which appear in the subsequent figures, refer to the methodology in [33], for τ = 1 and τ = P , respectively. Note here that the RLS [17], which corresponds to the “RLS” tag in the subsequent figures, solves an unconstrained minimization problem, which is slightly different from the one in [33], where the cost function is formed by the correlation matrix of the training data. This is due to the fact that the “RLS” cannot incorporate any a priori knowledge, i.e., channel information, and only the sequentially training data are exploited. The present section will deal with the following scenario: P := 4, M := 2, L := 2 (see Table III). The number of antennas per Tx, i.e., N , takes two values: 2 and 3. Each component of the source symbol sp (n) takes equiprobably one of the complex values ±1 ± ı, ∀p, and ∀n. Each component of the channel matrix, i.e., Hp , p ∈ 1, P , is a complex-valued random variable whose real and imaginary parts follow the normal distribution with zero mean and variance equal to 1. Different components among a single channel matrix are set to be independently distributed. Such an independency is also assumed among components of different channel matrices. Of course, the channel matrices {Hp }P p=1 are set to be different for every realization in the following experiments. Fig. 6 illustrates the symbol error rates (SER) versus the signal-to-noise ratio (SNR) variations of the Txs. For the present scenario, the SNR is set to be equal for all the Txs. The term “signal power” refers to the total power contained in a single block of symbols. Such a choice was dictated by the need to have the same amount of power reaching the Rx, from each Tx of interest, for different encoding schemes, and for the same N . Each point on each curve is the uniform average of 100 experiments. The number of complex training data for
each experiment, for all the methods, is set equal to 5000. In order to show the convergence behavior of all the methods versus the sequential training data, Fig. 7 depicts the root mean square (RMS) distance to a set of 500 known test data versus the number of data used to train the multiregressors, and in the case where the SNR is set to 9 dB. Once more, a uniform average of 100 experiments is considered in order to draw each curve. Both the SER and RMS quantities are considered per block of transmitted symbols. Regarding the “MV_SOI (N = 2)” method, the celebrated Alamouti OSTBC strategy [38], [39, Eq. 32] was employed. In such a case, K = 2 and T = 2. According to the preceding discussion, the lower bound of DOF for “MV_SOI (N = 2),” onto which a quadratic cost function is minimized, becomes 4KM T − 4K 2 τ = 4 · 2 · 2 · 2 − 4 · 22 · 1 = 16. However, as Fig. 6 suggests, “MV_SOI (N = 2)” does not lead to a satisfactory performance. It is clear that the number of available DOF is not sufficient to obtain a “satisfactory” solution of the constrained minimization task in [33]. The case of “MV_All (N = 2)” does not satisfy the preceding rule of thumb: 4KM T − 4K 2 τ = 4 · 2 · 2 · 2 − 4 · 22 · 4 = −16. Our experiments show that “MV_All (N = 2)” results in an almost identical performance to “MV_SOI (N = 2),” hence this case is not depicted in Fig. 6. To increase the number of available DOF, the rate R of the OSTBC should be decreased, according to the preceding rule of thumb. However, the rate R of the OSTBC is tightly connected to the number N of the antennas at the Txs, in order to decrease R, N has to be increased. To this end, we set N = 3 and employ the OSTBC scheme found in [39, Eq. 27], where K = 4 and T = 8, i.e., R = K/T = 0.5. Notice that in this case, the available DOF for “MV_All (N = 3)” are larger than or equal to 4KM T − 4K 2 τ = 4 · 4 · 2 · 8 − 4 · 42 · 4 = 0. The respective number for “MV_SOI (N = 3),” i.e., for τ = 1, becomes 192. As suggested by Fig. 6, both “MV_All (N = 3)” and “MV_SOI (N = 3)” produce satisfactory performance at the expense of increased hardware, i.e., the number of antennas at the Txs. Notice that, although “MV_All (N = 3)” exploits the knowledge of all the channel matrices, its performance is worse than “MV_SOI (N = 3).” This is due to the previously demonstrated fact that the lower bound for the DOF, in the case of “MV_SOI (N = 3),” is 192, as opposed to 0 for the “MV_All (N = 3).” In short, more DOF result in a “larger” solution space for the minimization problem. All of the preceding linear methods operate in a Euclidean feature space of dimension 4KM T . We have demonstrated by the previous discussion that, in order to enhance performance, the DOF should be increased, which forces K and T to take large values, and which results in a large dimensional feature space. However, such a strategy inevitably increases the number of antennas N at each Tx. By contrast, the kernelbased method presented here follows a different approach. The dimension of the feature space is determined and it can (implicitly) be increased by the appropriate choice of the kernel function. The hardware burden of a large number of antennas at the Tx is no longer needed. Here, the Gaussian kernel is used, which can result in an RKHS feature space of
SLAVAKIS et al.: ADAPTIVE MULTIREGRESSION IN REPRODUCING KERNEL HILBERT SPACES
0.7 0.6
RMS
0.5
0.8
RLS (N = 3) KRLS (N = 2) MV_SOI (N = 2) MV_All (N = 2) MV_SOI (N = 3) MV_All (N = 3) Proposed (N = 2)
0.7 0.6
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Number of training data
Fig. 7. Illustration of convergence behavior by the RMS distance to a known set of test source symbols versus the number of sequential data used to train the mutliregressors. Recall here that each component of the source symbol sp (n) takes equiprobably one of the complex values ±1 ± ı, ∀p, ∀n. Hence, the minimum distance between two symbols in the constellation, per component of sp (n), takes the value of 2.
infinite dimension [1], [2]. The variance σ 2 of the Gaussian kernel was set equal to 1 for all the experiments, and for both the kernel recursive least squares (KRLS) and the “Proposed” schemes. More specifically, for the “Proposed” method, the 1 (L)-norm -insensitive loss function (Section IV, Table I) was adopted in order to build the sequence of multiregressors. Moreover, the following parameters were used: 1) the length of the buffer Lb , in Section VII, is set equal to 4000 real-valued data, i.e., 2000 complex data, according to the scheme adopted in (12); 2) the radius Δ of the ball is set equal to 100 in (n) Section VII; 3) q := 2 in Algorithm 9; 4) ωj := 1/ card In , ∀j ∈ In , ∀n ∈ N, where card In stands for the cardinality of In ; 5) in Section IV is set equal to 0.01; 6) the matrix P := IL in Table I; and 7) the extrapolation parameter μn := 1, ∀n ∈ N. We stress here that both the proposed kernel-based method and “KRLS” do not assume any channel information, as opposed to “MV_All” and “MV_SOI.” They are built solely on the sequentially incoming training data. For the “KRLS” method, the value of the ν parameter [20], which determines the cardinality of a basis in the feature space, is set equal to 0.5. This value was chosen in order for the “KRLS” to produce similar performance to the “Proposed” method, both in Figs. 6 and 7. For example, in Fig. 7, the average cardinality of the basis, i.e., the number of the Gaussian kernel functions, is card B = 1154.5 ± 155.99. In our experiments, we noticed that the lower the SNR, i.e., the more noisy our training data are, the larger the card B. Similar behavior to the one presented in the figures is also valid for values of ν close to 0.5. We remark here that the computational complexity of the “KRLS” is of order O((card B)2 ) [20]. In our attempt to decrease the computational complexity of the “KRLS” by decreasing drastically card B, we observed that the performance of the “KRLS” deteriorates significantly. We recall here that the “KRLS” [20], as a nonlinear counterpart of the celebrated RLS [17], is confined to employ only a quadratic loss function. The proposed methodology gives us
(4000,100,0.01) (3750,50,0.01) (3000,100,0.01) (2000,100,0.01)
0.5 RMS
0.8
271
0
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Number of training data
Fig. 8. Dependence of the RMS distance of the proposed methodology on the parameters (Lb , Δ), introduced in Section VII.
the freedom of choosing any objective function, provided that the subgradients take a closed form. We have already seen in Section V that our algorithm employs only the subgradient of an adopted loss function, or even better, the respective subgradient projection mapping, which is nothing but a metric projection onto a closed halfspace, as illustrated also in Fig. 3. Due to this fact, the computational complexity of our algorithm stays linear with respect to the number of unknowns, i.e., of order O(qLb ). Recall, now, that we have set q = 2 and Lb = 4000 for the “Proposed” scheme, both in Figs. 6 and 7. The statistical analysis of the curves in Fig. 7, i.e., the error bars, consolidates the preferable overall performance of the “Proposed” when compared to the “KRLS.” Despite the lower computational complexity of the “Proposed” (O(qLb )) with respect to the one of the “KRLS” (O((card B)2 )), not only the RMS performance is similar, but also the error bars that correspond to the “Proposed” curves show a smaller span than the ones for the “KRLS.” In order to decrease the standard deviation of the “KRLS” curves, one solution would be to increase the number of kernel functions in the basis B, i.e., card B. However, this would have a costly impact on its computational load, recall that the “KRLS” complexity scales to the order of O((card B)2 ). Fig. 8 illustrates the performance of the proposed method for various values of (Lb , Δ, ). The curve “(4000, 100, 0.01)” is exactly the same as the one appearing in Fig. 7. As expected, a decrease of the buffer length Lb introduced in Section VII leads to a performance degradation as can be seen by the increase of the RMS distance level. This is a natural outcome of (10). The smaller the Lb , the fewer terms in the kernel series expansion of (10) and thus the less accurate our mutliregressor becomes in capturing the fine details of the stochastic model that underlies our data. However, it is clear by Fig. 8 that such a performance degradation is minor even when we cut down the number of kernel functions from 4000 to 2000. The robustness of the method is also justified by the almost identical standard deviation behavior of all the curves in Fig. 8. Notice in Fig. 8 that, as we move toward small values of Lb , the curves exhibit a distinct tendency to raise the RMS distance
272
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 2, FEBRUARY 2012
1
0.8 0.7
(4000,100,0.01) (4000,100,0.1) (4000,100,1)
0.9 0.8
0.6
0.7 RMS
RMS
0.5 0.4
0.6 0.5
0.3
0.4
0.2
0.3 0.2
0.1 0
MV_SOI (N = 2) MV_All (N = 2) Proposed (N = 2)
0.1 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Number of training data
Fig. 9. Dependence of the RMS distance of the proposed methodology on the parameter , which is used in the design of the loss functions in Section IV.
level in the region of large time instants. Such a phenomenon can be observed both in the curves “(3000, 100, 0.01)” and “(2000, 100, 0.01).” This is due to the fact that a small buffer Lb is not sufficient to keep the RMS distance level at low values. In order to keep the RMS distance level at low values, the designer has to perform two tasks: 1) to increase the length of the buffer Lb ; and 2) to decrease the value of the ball radius Δ, introduced in Section VII. Equation (10) suggests that the introduction of the buffer is a brute-force method to cut off the terms in the kernel series expansion that are located in the remote past. However, prior to applying such a drastic method, a better idea would be to gradually trim down the coefficients in (10), which correspond to the remote past. This can be done by the introduction of Δ (see Section VII), and it is illustrated in Fig. 8 by the curve “(3750, 50, 0.01).” Such a parameter trimming is omnipresent in any adaptive algorithm, in order to guarantee satisfactory convergence performance. Fig. 9 illustrates the performance of the proposed methodology in the case where the value of in the design of the loss functions in Section IV takes several values. Here, we let span over two orders of magnitude. We notice that even by a change of one order of magnitude, i.e., from 0.01 to 0.1, the method results in almost identical performance, both in terms of the RMS value and standard deviation, i.e., it is rather insensitive to the choice of . However, if one overdoes it, i.e., assigns the value 1 to , then the method does not perform well. This has a simple geometrical explanation the larger the , the larger the zeroth level sets in Fig. 1(a) and (b). As such, the larger these level sets are, the larger the intersection Ω is, i.e., the neighborhood to which our method converges (see Theorem 11). In order to demonstrate the behavior of the proposed algorithm in time-adaptive scenarios, an experiment is realized for the case of a MIMO channel that changes abruptly its state at the time instant n = 2500. This is depicted in Fig. 10. For the time span of 0, 2499, a number of P = 3 Txs send symbols to the Rx. However, at the time instant n = 2500, the MIMO channel shows a sudden change by adding an extra Tx in the scenario, i.e., P = 4 for the
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Number of training data
Fig. 10. RMS distance versus number of training data for the case where an abrupt channel change occurs at the time instant n = 2500. Starting with a total number of P = 3 Txs, for the time span 0, 2499, the MIMO channel changes its state at n = 2500 by adding an extra Tx that uses the same coding scenario as the rest, such that P = 4 for the time span of 2500, 5000. The rest of the parameters follow exactly the scenario of Fig. 7.
time span of 2500, 5000. It can be readily verified that both the proposed method and the linear techniques [33] adapt to the channel change, albeit the superior performance of the proposed algorithm. Such a superior performance is due to the same reasons behind the behavior illustrated in Fig. 7. Finally, we stress here that batch techniques, which are built upon the empirical loss minimization strategy (see Remark 2), can no longer be applied to the scenario of Fig. 10, since the consideration of the total number of training data in the loss function, as in Remark 2, is not appropriate for dynamic cases where the probability distribution of the training data changes with time. A PPENDIX A C ONVEX S ETS AND P ROJECTIONS Assume a real Hilbert space H, with inner product ·, ·
and norm |||·||| := ·, ·
. Given S of H, the function a subset ˆ ˆ defined as d(f , S) := inf{f − f : f ∈ S}, ∀f ∈ H, is called the (metric) distance function of f to the set S. A subset C of H will be called convex, if ∀f1 , f2 ∈ C, and ∀γ ∈ [0, 1], we have γf1 + (1 − γ)f2 ∈ C. Let us give two basic examples of convex sets, which are found in several places of this paper. The first example is the open ball: given a δ > 0, and an f0 ∈ H, an open ball is the following set: B(f0 , δ) := {f ∈ H : |||f − f0 ||| < δ}. An example of a closed convex set is the hyperplane, which is defined as the set Π := {f ∈ H : f , a
= β}, for some given nonzero a ∈ H, and some β ∈ R. Given a closed convex subset C of H, the (metric) projection onto C is defined as the mapping PC : H → C, which takes a point f ∈ H to the uniquely existing PC (f ) ∈ C such that |||f − PC (f )||| = d(f , C). Given a subset S of H, the relative interior of S with respect to a set A ⊂ H is defined as the following subset of S: riA S := {f ∈ S : ∃δf > 0 with ∅ = (B(f , δf ) ∩ A) ⊂ S}. Notice that the interior of S is defined as int S:= riH S.
SLAVAKIS et al.: ADAPTIVE MULTIREGRESSION IN REPRODUCING KERNEL HILBERT SPACES
A PPENDIX B D IFFERENTIABILITY AND S UBDIFFERENTIABILITY First, we will introduce the notion of differentiability of a function defined on a real Hilbert space H. Definition 14 (Fréchet Differentiability): Assume a function Θ: H → R, and fix arbitrarily an f ∈ H. If there exists a g ∈ H such that the following holds: lim
h→0
|Θ(f + h) − Θ(f ) − g, h
| = 0. |||h|||
Then Θ will be called (Fréchet) differentiable at f , and the continuous linear mapping g, ·
: H → R will be called the (Fréchet) differential of Θ at f . It turns out that, if such a g exists, then it is unique. In other words, g characterizes the differential of Θ at f . If Θ is (Fréchet) differentiable at any f , then the mapping Θ : f → g, will be called the (Fréchet) derivative of Θ. Hence, in this notation, and due to the uniqueness of g, Θ (f ) stands for the differential of Θ at f . Definition 15 (Subgradient Projection Mapping ): Let a convex continuous function Θ : H → R such that lev≤0 Θ = ∅. Then, the subgradient projection mapping TΘ : H → H with respect to Θ is defined as (see, for example, [51]) ) if f ∈ / lev≤0 Θ f − |||ΘΘ(f (f )|||2 Θ (f ), TΘ (f ) := f, if f ∈ lev≤0 Θ where Θ (f ) is any subgradient of Θ at f . Of course, if Θ is differentiable at f , then Θ (f ) is the differential. The subgradient projection mapping has the following simple geometrical interpretation: TΘ (f ) is nothing but the metric projection of f onto H − [see Fig. 1(a) and (b), and the discussion regarding Definition 5]. For more details on the theoretical properties of TΘ , see, for example, [51]. Definition 16 (Adjoint Mapping [42]): Let a continuous linear mapping A0 : H → RL . Then, the adjoint mapping A∗0 : RL → H is the linear mapping defined by ξ, A0 (f ) Q = A∗0 (ξ), f
, ∀ξ ∈ RL , ∀f ∈ H. Fact 17: (Subdifferential of the composition of a convex function with an affine mapping [43, Th. 23.9], [44, Ch. 1, Prop. 5.7]): Assume a continuous convex function l : RL → R, and a continuous affine mapping A : H → RL defined as A(f ) := A0 (f ) − y, where A0 : H → RL is a continuous linear mapping and y is a given vector in RL . Consider, now, the composition L := l ◦ A : H → R. Then ∂(l ◦ A)(f ) = A∗0 ∂l(A(f )) where the mapping A∗0 : RL → H is the adjoint of A0 . Example 18: This is an example of an affine mapping A, which is frequently used in the sequel. Given the pair of training data (x, y) ∈ RQ × RL , and the Hilbert space H := H1 × · · · × HL , where each Hj , j ∈ 1, L, is an RKHS, define the affine mapping ⎤ ⎡ f1 , κ1 (x, ·)
⎥ ⎢ .. A(f ) := f (x) − y = ⎣ ⎦−y . fL , κL (x, ·)
=: A0 (f ) − y, ∀f ∈ H.
273
Notice, here, that the loss function L in (2) and in Definition 4 is defined as L := l ◦ A, i.e., it is the composition of a convex function with an affine mapping. For this reason, L is convex. Lemma 19: The adjoint mapping A∗0 : RL → H of the linear mapping A0 , given in Example 18, becomes (recall that P , Q are positive definite matrices) ∀ξ ∈ RL ⎡ L ⎤ i=1 qi1 ξi κ1 (x, ·) ⎢ ⎥ .. −1 A∗0 (ξ) := P −1 ⎣ ⎦ = P K(x)Qξ . L i=1 qiL ξi κL (x, ·) where K(x) is the following diagonal matrix: K(x) := diag{κ1 (x, ·), . . . , κL (x, ·)}. Proof: The proof is an immediate consequence of the definition of an adjoing mapping. Fact 20 ([43, Th. 25.6]): Let l : RL → R be a convex function, continuous on RL . Then, conv(X(ξ)) = ∂l(ξ), ∀ξ ∈ RL , where conv stands for the convex hull of a set [43], and the overline symbol denotes the closure of a set. X(ξ) is the set of all limit points of sequences of the form (l (ξn ))n∈N , where (ξn )n∈N is a sequence of points at which l is (Fréchet) differentiable, (l (ξn ))n∈N are the differentials at the respective points, and limn→∞ ξn = ξ. A PPENDIX C P ROOF OF L EMMA 7 Notice that L = l ◦ A, where A is the affine mapping defined in Example 18. First, we will calculate the subdifferential of the loss function l (ξ) := max{0, ξ1 − }, ∀ξ ∈ RL . Then, we will apply Fact 17. We construct the proof case by case. 1) Consider the case of a ξ that ξ1 < . It can be easily verified (omitted) !that ξ belongs to the interior of the set ξˆ : ξˆ ≤ . Hence, there exists a δ > 0 1 ! such that the open ball B(ξ, δ) ⊂ ξˆ : ξˆ ≤ (for 1 the definition of the interior of a set and of an open ball see Appendix A). By the definition of l , ∀h ∈ B(ξ, δ), l (h) = 0. Moreover, by the Definition 5 of the subgradient, any subgradient ζ of l at ξ will satisfy the following: ∀h ∈ B(ξ, δ), ζ t Q(h − ξ) + l (ξ) ≤ l (h), or equivalently, ζ t Q(h − ξ) ≤ 0. If we change the ˆ := h − ξ, this takes a more compact form: variables h ˆ ˆ ≤ 0. Now, we can always find an ∀h ∈ B(0, δ), ζ t Qh r > 0 (let us say sufficiently small) so that rζ ∈ B(0, δ). 2 Hence, by the above, ζ t Qrζ ≤ 0 ⇒ ζ t Qζ = ζQ ≤ 0, which clearly implies that ζ = 0. In other words, we have just showed that ∂l (ξ) = {0}. 2) We introduce here some notations, given a ξ, define the index set Jξ := {j ∈ 1, L : ξj = 0}. Consider the case of a ξ such that ξ1 > , and Jξ = ∅. The fact Jξ = ∅ means that |ξj | > 0, ∀j ∈ 1, L.
(14)
274
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 2, FEBRUARY 2012
Then, for such a ξ, we clearly have l (ξ) = L j=1 |ξj |− L = j=1 sgn(ξj )ξj − , where the function sgn stands for the sign of a real number. Now, recall Definition 14, and assume an arbitrary h which belongs to a neighborhood of 0. We can make the neighborhood small enough so that (14) implies the following: sgn(ξj ) = sgn(ξj + hj ), ∀j ∈ 1, L.
(15)
Use, now, (15) with Definition 14, and especially the point where one has to consider a limh→0 , to verify that the Fréchet differential of l at ξ is given by ∀h ∈ RL l (ξ)(h) =
L j=1
sgn(ξj )hj = ζ t Qh = ζ, h Q −1
t
:= where ζ Q [sgn ξ1 , . . . , sgn ξL ] . Hence, by Definition 14, we obtain l (ξ) = ζ = Q−1 [sgn ξ1 , . . . , sgn ξL ]t . 3) Next, we move to the case where ξ is such that ξ1 > , and Jξ = {j1 , . . . , jν } = ∅. The function l is not Fréchet differentiable at such an ξ. To tackle such a case we will make use of Fact 20. By the definition of Jξ , we have ξj = 0, for any j ∈ {j1 , . . . , jν }. We can always choose a sequence (hn )n∈N such that limn→∞ hn = ξ with hn,j = 0, ∀n ∈ N and ∀j ∈ 1, L. For example, let hn,j := ξj ± (1/n), ∀n ∈ N, and ∀j ∈ 1, L. Since ·1 is continuous, and since limn→∞ hn = ξ, it is clear that for all sufficiently large n, hn 1 > . As we have already seen in the proof of Lemma 7, part 2, the function l is Fréchet differentiable at such a point hn , with l (hn )
−1
=Q
t
[sgn hn,1 , . . . , sgn hn,L ] .
(16)
Recall now that, since limn→∞ hn = ξ, we have limn→∞ hn,j = 0, ∀j ∈ Jξ . However, we have constructed every point hn of our sequence so that for all large n, hn,j = 0, which means that for all such n, the sgn hn,j is either +1 or −1. Hence, if we take limn→∞ on both sides of (16), then l (hn ) converges to Q−1 v, where v is an L-dimensional vector whose components are as follows: sgn ξj , if j ∈ / Jξ (17) vj := +1 or − 1, if j ∈ Jξ . Apparently, the total number of such vectors v is 2ν , since ν is the cardinality of Jξ . If we gather all such vectors v, and use Fact 20, we obtain ∂l (ξ) = conv{Q−1 v1 , . . . , Q−1 v2ν } = conv{Q−1 v1 , . . . , Q−1 v2ν }. Notice that the closure operation on the convex hull is not needed since the cardinality of the set {Q−1 v1 , . . . , Q−1 v2ν } is finite. 4) Assume now a ξ such that ξ1 = and Jξ = ∅. By following similar arguments one can verify that ∂l (ξ)
= conv{0, Q−1 [sgn ξ1 , . . . , sgn ξL ]t } = {γQ−1 [sgn ξ1 , . . . , sgn ξL ]t : γ ∈ [0, 1]}.
The point 0 is the result of Fact 20 and we can construct a sequence that ξ from the approaches ! interior of the set ξˆ : ξˆ ≤ , while the point 1
Q−1 [sgn ξ1 , . . . , sgn ξL ]t comes from that we can also approach ! ξ from the complement of the set ξˆ : ξˆ ≤ . 1 5) Now, we are experienced enough to verify that, in the case where ξ is such that ξ1 = and Jξ = ∅, the subdifferential becomes ∂l (ξ) = conv{0, Q−1 v1 , . . . , Q−1 v2ν }. It can be easily verified, by using the definition of the convex hull, and the fact that the mapping A∗0 is linear, that A∗0 conv(S) = conv(A∗0 (S)), ∀S ⊂ RL .
(18)
Use, now, this result, the expressions we have computed for ∂l above, and Lemma 17 to obtain the final result for ∂L = ∂(l ◦ A), given in Table I. A PPENDIX D P ROOF OF T HEOREM 11 The proof is based on an equivalent description of Algorithm 9. For such a description, the following nonnegative convex function is defined: ∀f ∈ H ⎧ (n) L,i (fn ) ⎨ 1 i∈In ωi |||L (fn )|||2 L,i (f ), if In = ∅ Ln ,i Θn (f ) := ⎩0, if In = ∅ 2 (n) ). Then, it where Ln := i∈In ωi (L,i (fn )/ L,i (fn ) can be shown that Algorithm 9 is written equivalently as follows, ∀n ∈ N ⎧
Θn (fn ) ⎪ ⎪ f − λ Θ (f ) P 2 n+1 n n B[0,Δ] n ⎨ |||Θn (fn )||| fn+1 = if Θ n (fn ) = 0 ⎪ ⎪ ⎩P if Θn (fn ) = 0 B[0,Δ] (fn+1 ) where Θn (fn ) is a subgradient of Θn at fn , and λn ∈ (0, 2). In order not to overload this paper with lengthy mathematical arguments, the detailed proof is deferred to the supplementary file of this paper. A PPENDIX E P ROOF OF P ROPOSITION 13 Since the mapping f is assumed linear, with domain R2M and range RL , elementary linear algebra guarantees the existence of a matrix U ∈ R2M×L such that f (x) = "U t x, # 1 ∀x ∈ R2M . Now, perform the following partition U =: U U2 , with U1 , U2 ∈ RM×L , and define W := U1 + ıU2 . Then, it is easy to verify by elementary algebra that f (χ1 (n)) = U t χ1 (n) = (W ∗ X(n)) f (χ2 (n)) = U t χ2 (n) = (W ∗ X(n)) which establishes Proposition 13.
SLAVAKIS et al.: ADAPTIVE MULTIREGRESSION IN REPRODUCING KERNEL HILBERT SPACES
R EFERENCES [1] B. Schölkopf and A. J. Smola, Learning with Kernels. Cambridge, MA: MIT Press, 2001. [2] S. Theodoridis and K. Koutroumbas, Pattern Recognition, 4th ed. New York: Academic, 2009. [3] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge, U.K.: Cambridge Univ. Press, 2000. [4] F. Dufrenois, J. Colliez, and D. Hamad, “Bounded influence support vector regression for robust single-model estimation,” IEEE Trans. Neural Netw., vol. 20, no. 11, pp. 1689–1706, Nov. 2009. [5] D. Musicant and A. Feinberg, “Active set support vector regression,” IEEE Trans. Neural Netw., vol. 15, no. 2, pp. 268–275, Mar. 2004. [6] D. Tzikas, A. Likas, and N. Galatsanos, “Sparse Bayesian modeling with adaptive kernel learning,” IEEE Trans. Neural Netw., vol. 20, no. 6, pp. 926–937, Jun. 2009. [7] S. Chen, A. Wolfgang, C. Harris, and L. Hanzo, “Symmetric RBF classifier for nonlinear detection in multiple-antenna-aided systems,” IEEE Trans. Neural Netw., vol. 19, no. 5, pp. 737–745, May 2008. [8] F. Pérez-Cruz and O. Bousquet, “Kernel methods and their potential use in signal processing,” IEEE Signal Process. Mag., vol. 21, no. 3, pp. 57–65, May 2004. [9] N. Aronszajn, “Theory of reproducing kernels,” Trans. Amer. Math. Soc., vol. 68, no. 3, pp. 337–404, 1950. [10] T. Frieß, N. Cristianini, and C. Campbell, “The kernel adatron algorithm: A fast and simple learning procedure for support vector machines,” in Proc. Int. Conf. Mach. Learn., 1998, pp. 188–196. [11] J. Kivinen, A. J. Smola, and R. C. Williamson, “Online learning with kernels,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2165–2176, Aug. 2004. [12] G. Cauwenberghs and T. Poggio, “Incremental and decremental support vector machine learning,” in Advances in Neural Information Processing Systems, vol. 13. Cambridge, MA: MIT Press, 2000, pp. 409–415. [13] P. Laskov, C. Gehl, S. Krüger, and K.-R. Müller, “Incremental support vector learning: Analysis, implementation and applications,” J. Mach. Learn. Res., vol. 7, pp. 1909–1936, Dec. 2006. [14] A. Bordes, S. Ertekin, J. Weston, and L. Bottou, “Fast kernel classifiers with online and active learning,” J. Mach. Learn. Res., vol. 6, no. 9, pp. 1579–1619, Sep. 2005. [15] S. Vishwanathan, N. Schraudolph, and A. Smola, “Step size adaptation in reproducing kernel Hilbert space,” J. Mach. Learn. Res., vol. 7, pp. 1107–1133, Jun. 2006. [16] D. J. Sebald and J. A. Bucklew, “Support vector machine techniques for nonlinear equalization,” IEEE Trans. Signal Process., vol. 48, no. 11, pp. 3217–3226, Nov. 2000. [17] A. H. Sayed, Fundamentals of Adaptive Filtering. New York: Wiley, 2003. [18] S. Haykin, Adaptive Filter Theory, 3rd ed. Englewood Cliffs, NJ: Prentice-Hall, 1996. [19] W. Liu, J. Príncipe, and S. Haykin, Kernel Adaptive Filtering: A Comprehensive Introduction. Hoboken, NJ: Wiley, 2010. [20] Y. Engel, S. Mannor, and R. Meir, “The kernel recursive least-squares algorithm,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2275–2285, Aug. 2004. [21] G.-O. Glentiis, K. Berberidis, and S. Theodoridis, “Efficient least squares adaptive algorithms for FIR transversal filtering,” IEEE Signal Process. Mag., vol. 16, no. 4, pp. 13–41, Jul. 1999. [22] K. Slavakis, S. Theodoridis, and I. Yamada, “Online kernel-based classification using adaptive projection algorithms,” IEEE Trans. Signal Process., vol. 56, no. 7, pp. 2781–2796, Jul. 2008. [23] K. Slavakis and S. Theodoridis, “Sliding window generalized kernel affine projection algorithm using projection mappings,” EURASIP J. Adv. Signal Process., vol. 2008, pp. 1–16, 2008. [24] K. Slavakis, S. Theodoridis, and I. Yamada, “Adaptive constrained learning in reproducing kernel Hilbert spaces: The robust beamforming case,” IEEE Trans. Signal Process., vol. 57, no. 12, pp. 4744–4764, Dec. 2009. [25] I. Yamada, K. Slavakis, and K. Yamada, “An efficient robust adaptive filtering algorithm based on parallel subgradient projection techniques,” IEEE Trans. Signal Process., vol. 50, no. 5, pp. 1091–1101, May 2002. [26] I. Yamada, “Adaptive projected subgradient method: A unified view for projection based adaptive algorithms,” J. IEICE, vol. 86, no. 8, pp. 654– 658, Aug. 2003.
275
[27] I. Yamada and N. Ogura, “Adaptive projected subgradient method for asymptotic minimization of sequence of nonnegative convex functions,” Numer. Funct. Anal. Optim., vol. 25, nos. 7–8, pp. 593–617, 2004. [28] K. Slavakis, I. Yamada, and N. Ogura, “The adaptive projected subgradient method over the fixed point set of strongly attracting nonexpansive mappings,” Numer. Funct. Anal. Optim., vol. 27, nos. 7–8, pp. 905–930, 2006. [29] K. Slavakis and I. Yamada. (2011). The Adaptive Projected Subgradient Method Constrained by Families of Quasi-Nonexpansive Mappings and Its Application to Online Learning [Online]. Available: http://arxiv.org/abs/1008.5231 [30] S. Theodoridis, K. Slavakis, and I. Yamada, “Adaptive learning in a world of projections: A unifying framework for linear and nonlinear classification and regression tasks,” IEEE Signal Process. Mag., vol. 28, no. 1, pp. 97–123, Jan. 2011. [31] P. J. Huber, Robust Statistics. New York: Wiley, 1981. [32] E. G. Larsson, P. Stoica, and G. Ganesan, Space-Time Block Coding for Wireless Communications. New York: Cambridge Univ. Press, 2003. [33] S. Shahbazpanahi, M. Beheshti, A. Gershman, M. Gharavi-Alkhansari, and K. Wong, “Minimum variance linear receivers for multiaccess MIMO wireless systems with space-time block coding,” IEEE Trans. Signal Process., vol. 52, no. 12, pp. 3306–3313, Dec. 2004. [34] R. Cavalcante and I. Yamada, “Multiaccess interference suppression in orthogonal space–time block coded MIMO systems by adaptive projected subgradient method,” IEEE Trans. Signal Process., vol. 56, no. 3, pp. 1028–1042, Mar. 2008. [35] M. Chen, S. Ge, and B. How, “Robust adaptive neural network control for a class of uncertain MIMO nonlinear systems with input nonlinearities,” IEEE Trans. Neural Netw., vol. 21, no. 5, pp. 796–812, May 2010. [36] M. Sánchez-Fernández, M. de Prado-Cumplido, J. Arenas-García, and F. Pérez-Cruz, “SVM multiregression for nonlinear channel estimation in multiple-input multiple-output systems,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2298–2307, Aug. 2004. [37] I. Goethals, K. Pelckmans, J. Suykens, and B. D. Moor, “Identification of MIMO Hammerstein models using least squares support vector machines,” Automatica, vol. 41, no. 7, pp. 1263–1272, Jul. 2005. [38] S. Alamouti, “A simple transmit diversity technique for wireless communications,” IEEE J. Sel. Areas Commun., vol. 16, no. 8, pp. 1451–1458, Oct. 1998. [39] V. Tarokh, H. Jafarkhani, and A. Calderbank, “Space-time block codes from orthogonal designs,” IEEE Trans. Inf. Theory, vol. 45, no. 5, pp. 1456–1467, Jul. 1999. [40] P. L. Combettes, “The foundations of set theoretic estimation,” Proc. IEEE, vol. 81, no. 2, pp. 182–208, Feb. 1993. [41] G. S. Kimeldorf and G. Wahba, “Some results on Tchebycheffian spline functions,” J. Math. Anal. Appl., vol. 33, no. 1, pp. 82–95, 1971. [42] E. Kreyszig, Introductory Functional Analysis with Applications (Classics Library). New York: Wiley, 1989. [43] R. T. Rockafellar, Convex Analysis. Princeton, NJ: Princeton Univ. Press, 1970. [44] I. Ekeland and R. Témam, Convex Analysis and Variational Problems. Amsterdam, The Netherlands: North-Holland, 1976. [45] M. S. Bazaraa, H. D. Sherali, and C. M. Shetty, Nonlinear Programming: Theory and Algorithms, 2nd ed. New York: Wiley, 1993. [46] B. T. Polyak, “Minimization of unsmooth functionals,” USSR Comput. Math. Phys., vol. 9, no. 3, pp. 14–29, 1969. [47] L. M. Bregman, “The method of successive projections for finding a common point of convex sets,” Soviet Math. Dokl., vol. 162, no. 3, pp. 688–692, 1965. [48] L. G. Gubin, B. T. Polyak, and E. V. Raik, “The method of projections for finding the common point of convex sets,” USSR Comput. Math. Phys., vol. 7, no. 6, pp. 1–24, 1967. [49] M. Yukawa, K. Slavakis, and I. Yamada, “Multi-domain adaptive learning based on feasibility splitting and adaptive projected subgradient method,” IEICE Trans. Fundam., vol. E93-A, no. 2, pp. 456–466, Feb. 2010. [50] Y. Kopsinis, K. Slavakis, and S. Theodoridis, “Online sparse system identification and signal reconstruction using projections onto weighted 1 balls,” IEEE Trans. Signal Process., vol. 59, no. 3, pp. 936–952, Mar. 2011. [51] H. H. Bauschke and P. L. Combettes, “A weak-to-strong convergence principle for Fejér-monotone methods in Hilbert spaces,” Math. Oper. Res., vol. 26, no. 2, pp. 248–264, May 2001.
276
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 2, FEBRUARY 2012
Konstantinos Slavakis (M’08) received the M.E. and Ph.D. degrees in electrical and electronic engineering from the Tokyo Institute of Technology (TokyoTech), Tokyo, Japan, in 1999 and 2002, respectively. He was with TokyoTech as a Japan Society for the Promotion of Science Post-Doctoral Fellow from 2004 to 2006 and from 2006 to 2007, he was a PostDoctoral Fellow with the Department of Informatics and Telecommunications, University of Athens, Athens, Greece. Since September 2007, he has been an Assistant Professor with the Department of Telecommunications Science and Technology, University of Peloponnese, Tripolis, Greece. His current research interests include applications of convex analysis and computational algebraic geometry to signal processing, machine learning, arrays, and multidimensional systems problems. Dr. Slavakis was a recipient of the Japanese Government Scholarship from 1996 to 2002. He serves as an Associate and Area Editor of the IEEE T RANSACTIONS ON S IGNAL P ROCESSING.
Pantelis Bouboulis (M’10) received the M.Sc. and Ph.D. degrees in informatics and telecommunications from the National and Kapodistrian University of Athens, Athens, Greece, in 2002 and 2006, respectively. He served as an Assistant Professor with the Department of Informatics and Telecommunications, University of Athens, from 2007 to 2008. Since 2008, he has taught mathematics in Greek High Schools. His current research interests include the areas of machine learning, fractals, wavelets, and image processing.
Sergios Theodoridis (F’08) is currently a Professor of signal processing and communications with the Department of Informatics and Telecommunications, University of Athens, Athens, Greece. He is the co-editor of the book Efficient Algorithms for Signal Processing and System Identification (Sunnyvale, CA: Prentice Hall, 1993), Pattern Recognition (Boston: Academic Press, 4th ed. 2008), Introduction to Pattern Recognition: A MATLAB Approach (Boston: Academic Press, 2009), and three books in Greek, two of them for Greek Open University. He is the co-author of six papers that have received best paper awards including the IEEE Computational Intelligence Society Transactions on Neural Networks Outstanding Paper Award in 2009. His current research interests include the areas of adaptive algorithms and communication, machine learning and pattern recognition, signal processing for audio processing, and retrieval. He has served as an IEEE Signal Processing Society Distinguished Lecturer. He was the General Chairman of the European Signal Processing Conference in 1998, the Technical Program Co-Chair for International Symposium on Circuits and Systems in 2006, the Co-Chairman and Co-Founder of Competitiveness and Innovation Program (CIP) in 2008, and the Co-Chairman of CIP in 2010. He has served as the President of the European Association for Signal Processing and as a Board Member of Governors for the IEEE Casualty Actuarial Society. He currently serves as a Board Member of Governors (Member-at-Large) of the IEEE Signal Processing (SP) Society. He has served as a member of the Greek National Council for Research and Technology and he was the Chairman of the SP advisory committee for the Edinburgh Research Partnership. He has served as a Vice Chairman of the Greek Pedagogical Institute and he was a Board Member of Directors of COSMOTE (the Greek mobile phone operating company) for four years. He is a Fellow of the Institution of Engineering and Technology and the European Association for Signal Processing and a Corresponding Fellow of the Royal Society of Arts.