IEEE COMMUNICATIONS LETTERS, VOL. 20, NO. 11, NOVEMBER 2016
2241
Machine Learning-Based Antenna Selection in Wireless Communications Jingon Joung, Senior Member, IEEE Abstract— This letter is the first attempt to conflate a machine learning technique with wireless communications. Through interpreting the antenna selection (AS) in wireless communications (i.e., an optimization-driven decision) to multiclass-classification learning (i.e., data-driven prediction), and through comparing the learning-based AS using k-nearest neighbors and support vector machine algorithms with conventional optimization-driven AS methods in terms of communications performance, computational complexity, and feedback overhead, we provide insight into the potential of fusion of machine learning and wireless communications. Index Terms— Machine learning, multiclass classification, k-NN, SVM, data-driven prediction (DDP), optimization-driven decision (ODD), antenna selection, MIMO.
I. I NTRODUCTION VER the past few decades, advanced analytics techniques, such as data analytics, data mining, and machine learning, have attracted the attention of analysts, data scientists, researchers, and engineers in various fields in order to exploit very large and diverse data sets. For example, recently in information communications, a network utilizing big data was demonstrated [1], and networks and wireless communications that embrace big data have been studied [2], [3]. This study focuses particularly on the potential of machine learning techniques in wireless communications. Specifically, we apply multiclass classification, which is a primary task in machine learning, into a multiple-input multiple-output (MIMO) system with transmit antenna selection (AS) [4]. We employ multiclass classification algorithms, i.e., multiclass k-nearest neighbors (k-NN) and a support vector machine (SVM) [5], to classify the training channel state information (CSI) into one of the classes that represent the antenna set that provides the best communication performance. From training with sufficient channels (data), we obtain a classification model and use it to classify a new channel and to accordingly obtain the best antenna set, i.e., data-driven prediction (DDP). We compare the learning-based antenna selection (AS) systems with a conventional AS system that maximizes the minimum of either the eigenvalue or the norm of channels; i.e., an optimization-driven decision (ODD). From the comparison, we discuss the systems’ advantages and disadvantages in terms of bit-error-rate (BER), selection complexity, and feedback overhead. This study provides insight into the fusion of machine learning and wireless communications.
O
Manuscript received June 24, 2016; accepted July 24, 2016. Date of publication July 27, 2016; date of current version November 9, 2016. The associate editor coordinating the review of this letter and approving it for publication was D. B. da Costa. The author is with the School of Electrical and Electronics Engineering, Chung-Ang University, Seoul 156-756, South Korea (e-mail:
[email protected]). Digital Object Identifier 10.1109/LCOMM.2016.2594776
II. ODD: O PTIMIZATION -D RIVEN A NTENNA S ELECTION We consider a simple point-to-point communication with a transmitter (Tx) and a receiver (Rx) employing n t and nr antennas, respectively. The Tx transfers n d independent data streams, x ∈ Cnd ×1 , through n s selected antenna(s) to the Rx, where n d ≤ min{nr , n s } for reliable performance with linear processing. Then, the Rx estimates x˜ ∈ Cnd ×1 as follows1 : x˜ = Wr H Wt x + Wr n, where Wr ∈ Cnd ×nr and Wt ∈ Cns ×nd are pre- and post-coding matrices, respectively; H ∈ Cnr ×ns is a partial MIMO channel matrix whose n s columns are selected (i.e., AS) from the n t columns of an original full MIMO channel matrix Hfull ∈ Cnr ×nt ; and n ∈ Cnr ×1 is the additive white Gaussian noise (AWGN) at the Rx. Next, a set of a selected antenna index vector, sn ∈ Rns ×1 , is defined as S = {s1 , . . . , sS }, where the elements of sn represent the indices of the selected antennas and S is the number of selection candidates. The optimal AS sno in terms of BER is obtained through solving the optimization problem below [4]: sno = max svnd ([ Hfull ]sn ), sn ∈S
(1)
where svnd ( A) gives the n d th largest singular value of matrix A; and [ A]sn gives a partial matrix that consists of the selected columns from A according to the selection indices in sn . Then, the pre- and post-precoder matrices are obtained as Wr = U H and Wt = V , respectively, where U ∈ Cnr ×nd and V H ∈ Cnd ×ns are the left and right singular matrices corresponding to the largest n d singular values of H. Because sno in (1) with the corresponding pre- and postcoding maximizes the minimum of the effective channel gains, it yields the optimal BER and is included as a performance bound in the BER comparison. III. DDP: DATA -D RIVEN A NTENNA S ELECTION In order to select the optimal antenna indices among |S|, where |A| is the cardinality of set A, instead of solving the combinatorial optimization problem in (1), we consider supervised machine-learning algorithms. In particular, we employ a multiclass classification algorithm in order to classify CSI into |S| classes, each of which represents the antenna set that provides the best communication performance. With a sufficient number of CSI samples, i.e., training data, we can obtain a classification model and predict a class of new CSI, i.e., the best antenna set for a new channel. Here, we interpret a communications system as a learning system, as follows: (i) CSI as training samples (i.e., an instance or observation); (ii) index n of sn ∈ S as a class label (i.e., membership); and (iii) the set of index n as a set of classes, L (i.e., categories or groups). 1 In this study, for the feasibility test, we assumed that the Tx and Rx know the CSI. Note that the CSI is not typically required for a practical AS system because the selection can be performed at the Rx.
1558-2558 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
2242
IEEE COMMUNICATIONS LETTERS, VOL. 20, NO. 11, NOVEMBER 2016
TABLE I E XAMPLE OF A M APPING TABLE B ETWEEN L ABEL ∈ L AND A NTENNA I NDICES sn ∈ S WHEN n t = 6, nr = 2, AND n s = 2
We address this method as a DDP method, and henceforth use the terms in communications and machine learning interchangeably and introduce two popular multiclass classification algorithms k-NN and SVM to select the antennas in the communications. A. Training Set Manipulation From Channels We perform three procedures (not necessarily in sequence): (i) design the training samples from the channels, (ii) design the key performance indicator (KPI), and (iii) declare the corresponding labels based on the KPI, i.e., labeling. 1) Training Set Generation: The training samples are the input of a learning system and are known as input variables, features, predictors, or attributes. In communications, M nr by-n t complex channel matrices, Hm , (or vectors) are used for the training. Because the training samples are a real-value vector, the channels must be manipulated for N real-value features, such as angle, magnitude, real and imaginary parts of h i j , where h i j is the (i, j )th complex-value element of Hm . Also, the training samples must be normalized, i.e., feature normalization, in order to avoid significant bias in the training. For example, the procedures used in this study are as follows: S.1 Generate the 1-by-N real-value feature vector dm from the training CSI matrix Hm (refer to an example in Sec. IV). S.2 Repeat S.1 for all training CSI sets m ∈ {1, . . . , M}. S.3 Generate a training data matrix D ∈ R M×N by stacking T ]T . dm ’s as D = [d1T · · · d M S.4 Normalize/scale D and generate a normalized feature matrix T , such that the (i, j )th element of T is a normalized value of the (i, j )th element of D as ti j = (di j − Ei di j )/(maxi {di j } − mini {di j }). 2) KPI Design: A KPI is designed to label the training samples. In general, a KPI can be defined as any metric used in communications, such as spectral efficiency, energy efficiency, BER, the norm of an effective channel, the effective received signal-to-noise ratio (SNR), received signal power, the maximum of the minimum eigenvalue of the effective channel, communications latency, and any combination thereof. In this work, we use BER as the KPI. 3) Labeling: From the interpretation of the AS and multiclass classification, it is clear that designing L is equivalent to designing S. Suppose that there are n s combinations from a set with n t antennas, i.e., |S| = nnst . From the communicationdomain knowledge stating that less correlated antennas are more likely to be selected for better communication performance, we design S with the sets of less correlated antennas. By doing this, the training samples are devoted as equally as possible to each class label, which results in better training performance. Without this knowledge (or if it is known that the channels are uncorrelated), we design S with full combinations of an antenna set. A mapping example between the labels ∈ L
and sn ∈ S is presented in Table I, when n t = 6, nr = 2, and n s = 2. Here, we demonstrate the full sn set and highlight the less correlated antennas in bold typeface when the correlation factors ρt = 0.3 and ρr = 0.1 at Tx and Rx, respectively (refer to the precise correlation model in [6]). Then, we identify a label of the training samples from the L. This procedure is called labeling and is summarized as follows: S.5 Evaluate a KPI for the mth training CSI Hm with a particular antenna set sn corresponding to label ∈ L. S.6 Generate a class label vector c ∈ R M×1 by setting the mth element cm of c by ∗ , which gives the best KPI. S.7 Repeat S.5 and S.6 for CSI samples Hm , for all m. B. Build a Learning System We now have a manipulated, real-value matrix T ∈ R M×N and a corresponding class-label vector c = [c1 · · · c M ]T . Using the labeled training data set, i.e., T and c, we build a learning system, i.e., a trained multiclass classifier, whose input is CSI and output is the index of the selected antenna set. As |L| > 2 in AS systems, we employ |L|-class (i.e., multiclass) classification algorithms, i.e., the k-NN and SVM algorithms. For a simple description of the multiclass classification algorithms, we denote the mth row vector of T by tr [m]. 1) Multiclass k-NN Classification: Among M training samples, a k-NN classifier finds the k-nearest training samples {tr [m]} from a new observation tr , which is a query sample. Then, the k-NN classifier declares the class/label ∗ of tr based on a majority (the highest representation or a weighted average) label among the labels, m s, which belong to the knearest training samples from tr . Here, the ‘nearest’ is defined based on the distance (or dissimilarity measure) among the samples. In this study, we confine ourselves to the most popular measure of distance, i.e., the Euclidian distance defined as d(tr [m], tr ) = tr [m] − tr 2 . We restrict k to an odd number in order to avoid an ambiguous majority and to determine it in order to minimize misclassification error through exhaustive searching from 1 to approximately M/3. 2) Multiclass SVM Classification: In order to classify multiple classes using an SVM, we employ |L| binary classifiers, each of which identifies one category versus the other categories, i.e., a one-vs.-rest (or one-vs.-all) binary classifying approach. The detailed procedure is as follows. S.8 Define a sub-training data set {T }, such that tr [m] is located at a row of T if cm = , for all m ∈ {1, . . . , M}. Similarly, we define {T } for all ∈ L. Then, we perform an SVM to classify the two training groups T and T where T is a shrunk matrix, through eliminating the row vectors of T from T. S.9 Generate a binary label vector b = [b[1] · · · b [M]]T for the th binary classification, such that b [m] = 1 if cm = , otherwise b [m] = 0. S.10 One()-vs.-rest() method: Solve an alternative logistic regression problem with two training groups {T , T } and the corresponding binary labels b , as follows: θ =min C θ
M
b [m]g1 θT f (tr [m]) + (1 − b [m])
m=1
× g0 θT f (tr [m]) + θ 22 /2.
(2)
JOUNG: MACHINE LEARNING-BASED AS IN WIRELESS COMMUNICATIONS
In (2), C is a penalty (or cost) parameter that balances the bias and over fitting (similar to a regularization factor in regression); the cost function gk (·) is defined as gk (z) = (−1)k z + 1 if (−1)k z ≥ −1, otherwise 0; θ ∈ R M×1 is a learning parameter vector; and f (tr [m]) ∈ R M×1 is a Gaussian radial-based kernel function vector that improves the choice of features, and its qth element function f q (tr [m]) gives a similarity score between tr [q] and tr [m] as f q (tr [m]) = exp(−tr [q] − tr [m]2 /(2σ 2 )). Here, the penalty parameter C and the variance σ of the Gaussian radial-based kernel function are design parameters. The Gaussian kernel is a commonly used kernel function in SVMs when the number of features N is small and the number of training sets M is intermediate [5]. Note that the case is typical of the AS system considered. S.11 Repeat S.10 for all ∈ {1, . . . , |L|}. C. Antenna Selection Based on a Multiclass SVM Classifier Once all θ values are obtained, we can build an AS system using the learning function in (2). Upon a new input into a channel matrix, we manipulate it to be an input of the learning machine, i.e., tr , and provide it to a classifier built with {θ } in order to predict the label of the class, i.e., the selected antenna index. Here, as the SVM is implemented using an |L|-binary classifier, the SVM might predict multiple labels. In this case, we determine the optimal label ∗ that provides the lowest cost, and an antenna pair corresponding to ∗ will be selected for communications. D. Parameter Optimization The classification performance depends on the design of the parameters. For example, the performance of k-NN depends on the number of neighbors k, the distance function d(·), and the weight of the distance, and the performance of the SVM primarily depends on the kernel functions. Furthermore, in communications, a learning system is built offline, and it is difficult to directly include the numerous practical parameters in the communication environment in the design. Moreover, the parameters also depend on the system configuration, such as n t , nr , n s , SNR, channel uncertainty level, and the spatial correlation factors. Hence, the design of the optimal parameters remains a subject for future study. In this study, we follow a heuristic approach, which is typically used to identify better parameters: for k-NN, we perform an exhaustive search for a better k; for SVM, we check with the multiple random initial parameters of σ 2 and C using iterative cross-validation with M/10 (10%) training samples. IV. P ERFORMANCE E VALUATION W ITH T EST C HANNELS In this section, we evaluate the BER performance of the AS systems invoking machine learning-based AS and optimization-based AS, and their performance is compared (Table II). Here, for comparison, we consider three benchmarking systems. The AS criteria/methods compared in the simulation are listed and denoted as follows: • ODD: MaxMinEV. Maximize the minimum eigenvalue of the selected channels, (1).
2243
• •
ODD: MaxMinNorm. Maximize the minimum norm of the channels such that si = max min [ Hfull ]si 2 . si ∈S
DDP: SVM. Multiclass SVM-based selection. • DDP: k-NN. Multiclass k-NN-based selection. • Random selection: Random. We generated 2×103 training CSI samples (i.e., M = 2×103 ) that have the Rayleigh distribution and considered the spatial correlation factors ρt and ρr for the transmit and receive antennas, respectively. For each channel realization, 200 16-quadrature amplitude modulation (QAM) symbols were transmitted. We set the spatial correlation factors of the channels as ρt = 0.3 and ρr = 0.1. For uncertain CSI, the uncertain channel power was set to 2% of the average channel power. In S.1, |h i j |2 was used to construct dm because it is effective in learning the cost in (1) from our observations. The other communications and training-and-learning parameters tested in the evaluation are as follows: • Fig. 1(a): {n t , n r , n s , n d } = {8, 1, 1, 1}; N = 8; |L| = 8 with si ∈ {1, . . . , 8}; and 15 dB SNR per an Rx antenna. • Fig. 1(b): {n t , n r , n s , n d } = {6, 2, 2, 2}; N = 12; |L| = 6 with highlighted in Table I; and 20 dB SNR per Rx antenna. • Note: Other configurations with possible n s for {8(6), 1(2), n s , 1} and {6, 2, n s , 2} have the same trends as the results in Figs. 1(a) and 1(b), respectively. Thus, the results are omitted in this letter. A. Classification Performance In this study, in order to visualize the multiclass classification performance, we illustrate the misclassification rate (error rate) using a web representation [7] in Figs. 1(a) and 1(b). Each point of a regular polygon represents the misclassification ¯ where ∀¯ ∈ L and ¯ = , and the denoted by → , corresponding misclassification rate is noted on the spoke. When a Tx selects one antenna and transmits a single stream, the multiclass classification performs better compared with the two-antenna selection for two streams. This is because the dimensions of the feature, N, increase from 8 to 12 with the limited training set. Furthermore, the magnitude of the elements in a vector channel, i.e., dm , is sufficient information to capture the KPI of the vector channel, yet it is not sufficient for a matrix channel. B. Communications Performance In Fig. 1, we illustrate the empirical cumulative density function (CDF) of the BER. Because we have finite training samples of channels in a large dimension, which is proportional to N, and the samples of random channels would be outliers with high probability, the k-NN classifier often suffers from high variations of classification at the decision boundary as reported in [5], while an SVM performs well with the outliers and can be used in a linear or non-linear manner with the use of a kernel. As seen in the results, therefore, an SVM classifier is more preferable for communications than a k-NN classifier. Moreover, better classification yields better communications performance. When a Tx selects one antenna to send a single stream, all methods, with the exception of Random, achieve near-optimal
2244
IEEE COMMUNICATIONS LETTERS, VOL. 20, NO. 11, NOVEMBER 2016
Fig. 1. Empirical CDF of BER. (a) Single-antenna selection for single-stream transmission: {n t , nr , n s , n d } = {8, 1, 1, 1}, N = 8, ρt = 0.3, ρr = 0.1, and SNR is 15 dB. (b) Two-antenna selection for double-stream transmission: {n t , nr , n s , n d } = {6, 2, 2, 2}, N = 12, ρt = 0.3, ρr = 0.1, and SNR is 20 dB. TABLE II
C OMPARISON OF AS M ETHODS . |L| ≤ nnst = n t !/(nr !(n t − nr )!)
performance as illustrated in Fig. 1(a). In contrast, there is performance loss when multiple antennas are selected for a multi-stream transmission. Here, we note that the learningbased AS, i.e., the SVM, outperforms the others. C. Feedback Amount and Computational Complexity Depending on whether the AS decision is made at a Tx or an Rx, the required feedback amount varies. If an Rx determines the selection, all schemes except Random require log2 |L|-bit feedback to inform the label of si to a Tx. In contrast, if a Tx determines the selection, H, i.e., nr n t complex values, is required to be fed back for MaxMinEV and MaxMinNorm; however, the SVM and k-NN schemes only need t, which is an N-real value. By designing dm in S.1, the learning-based methods can further reduce the feedback amount. The selection complexity is compared in Table II. The selection complexity is defined as prediction complexity excluding the training complexity as the training is performed before the communications offline. Here, it can be seen that the selection complexity of the learning-based AS (i.e., DDP) is polynomial, which is clearly lower than that of MaxMinEV and MaxMinNorm (i.e., ODD) using a combinatorial search across |L| ≤ nnst . V. C ONCLUSION In this letter, we applied a multiclass classification to an antenna selection system. The results from a communications
perspective, leaving the training cost aside, verify the feasible complexity and performance of learning-based antenna selection. This study provides a reference for the use of various machine-learning algorithms for wireless communications (i.e., DDP for ODD), and it will accelerate their implementation in real-life communications. Interesting areas for further study include the validation of a learning system with realistic channels and an online learning algorithm to track channels with time-varying statistics. R EFERENCES [1] Korea Communication Review, pp. 38–42, Apr. 2015. [Online]. Available: http://www.netmanias.com/en/?m=view&id=reports&no=7430& vm=pdf [2] T. T. T. Nguyen and G. Armitage, “A survey of techniques for Internet traffic classification using machine learning,” IEEE Commun. Surveys Tuts., vol. 10, no. 4, pp. 56–76, 4th Quart., 2008. [3] S. Bi, R. Zhang, Z. Ding, and S. Cui, “Wireless communications in the era of big data,” IEEE Commun. Mag., vol. 53, no. 10, pp. 190–199, Oct. 2015. [4] R. W. Heath, S. Sandhu, and A. Paulraj, “Antenna selection for spatial multiplexing systems with linear receivers,” IEEE Commun. Lett., vol. 5, no. 4, pp. 142–144, Apr. 2001. [5] H. Zhang, A. Berg, M. Maire, and J. Malik, “SVM-KNN: Discriminative nearest neighbor classification for visual category recognition,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), New York, NY, USA, Jun. 2006, pp. 2126–2136. [6] J.-P. Kermoal, L. Schumacher, K. I. Pedersen, P. E. Mogensen, and F. Frederiksen, “A stochastic MIMO radio channel model with experimental validation,” IEEE J. Sel. Areas Commun., vol. 20, no. 6, pp. 1211–1226, Aug. 2002. [7] B. Diri and S. Albayrak, “Visualization and analysis of classifiers performance in multi-class medical data,” Expert Syst. Appl., vol. 34, no. 1, pp. 628–634, Jan. 2008.