advanced optimization algorithms for sensor arrays ...

16 downloads 289 Views 10MB Size Report
WMAN, HIPERLAN/2) [1,4–6,9], and European digital audio and video broadcast (DAB ...... http://www.damtp.cam.ac.uk/user/na/reports.html, 2006. .... group: a tutorial. Journal ...... follows: PD (positive definite), PSD (positive semi-definite), ND.
Helsinki University of Technology Department of Signal Processing and Acoustics Espoo 2008

Report 6

ADVANCED OPTIMIZATION ALGORITHMS FOR SENSOR ARRAYS AND MULTI-ANTENNA COMMUNICATIONS Traian Abrudan Dissertation for the degree of Doctor of Science in Technology to be presented with due permission of the Department of Signal Processing and Acoustics for public examination and debate in Auditorium S2 at Helsinki University of Technology (Espoo, Finland) on the 21st of November, 2008, at 12 o’clock noon.

Helsinki University of Technology Faculty of Electronics, Communications and Automation Department of Signal Processing and Acoustics Teknillinen korkeakoulu Elektroniikan, tietoliikenteen ja automaation tiedekunta Signaalink¨ asittelyn ja akustiikan laitos

Distribution: Helsinki University of Technology Department of Signal Processing and Acoustics P.O. Box 3000 FIN-02015 HUT Tel. +358-9-451 3211 Fax. +358-9-452 3614 E-mail: [email protected] Web page: http://signal.hut.fi

© Traian Abrudan ISBN 978-951-22-9606-4 (Printed) ISBN 978-951-22-9607-1 (Electronic) ISSN 1797-4267 Multiprint Oy Espoo 2008

AB ABSTRACT OF DOCTORAL DISSERTATION HELSINKI UNIVERSITY OF TECHNOLOGY P. O. BOX 1000, FI-02015 TKK http://www.tkk.fi Author

Traian Abrudan

Name of the dissertation Advanced Optimization Algorithms for Sensor Arrays and Multi-antenna Communications

Manuscript submitted

August 1, 2008

Date of the defence

Manuscript revised

October 10, 2008

November 21, 2008

Monograph Faculty Department

X Article dissertation (summary + original articles) Faculty of Electronics, Communications and Automation Department of Signal Processing and Acoustics

Field of research Opponent(s)

Sensor Array Signal Processing Prof. Athina Petropulu (Drexel University) and Prof. Erik G. Larsson, (Linköping University)

Supervisor Instructor

Prof. Visa Koivunen Prof. Visa Koivunen

Abstract Optimization problems arise frequently in sensor array and multi-channel signal processing applications. Often, optimization needs to be performed subject to a matrix constraint. In particular, unitary matrices play a crucial role in communications and sensor array signal processing. They are involved in almost all modern multi-antenna transceiver techniques, as well as sensor array applications in biomedicine, machine learning and vision, astronomy and radars. In this thesis, algorithms for optimization under unitary matrix constraint stemming from Riemannian geometry are developed. Steepest descent (SD) and conjugate gradient (CG) algorithms operating on the Lie group of unitary matrices are derived. They have the ability to find the optimal solution in a numerically efficient manner and satisfy the constraint accurately. Novel line search methods specially tailored for this type of optimization are also introduced. The proposed approaches exploit the geometrical properties of the constraint space in order to reduce the computational complexity. Array and multi-channel signal processing techniques are key technologies in wireless communication systems. High capacity and link reliability may be achieved by using multiple transmit and receive antennas. Combining multi-antenna techniques with multicarrier transmission leads to high the spectral efficiency and helps to cope with severe multipath propagation. The problem of channel equalization in MIMO-OFDM systems is also addressed in this thesis. A blind algorithm that optimizes of a combined criterion in order to be cancel both inter-symbol and co-channel interference is proposed. The algorithm local converge properties are established as well.

Keywords

Optimization, unitary matrix, array signal processing, blind separation, equalization, MIMO, OFDM

ISBN (printed)

978-951-22-9606-4

ISBN (pdf) Language Publisher

978-951-22-9607-1 English

ISSN (printed)

1797-4267

ISSN (pdf) Number of pages

95p. + app. 94p.

Helsinki University of Technology, Department of Signal Processing and Acoustics

Print distribution

Helsinki University of Technology, Department of Signal Processing and Acoustics

X The dissertation can be read at http://lib.tkk.fi/Diss/2008/isbn9789512296071

AB VÄITÖSKIRJAN TIIVISTELMÄ

TEKNILLINEN KORKEAKOULU PL 1000, 02015 TKK http://www.tkk.fi

Tekijä

Traian Abrudan

Väitöskirjan nimi Kehittyneet optimointialgoritmit antenniryhmien signaalinkäsittelyssä ja moniantennikommunikaatiossa

Käsikirjoituksen päivämäärä Väitöstilaisuuden ajankohta

01.08.2008

Korjatun käsikirjoituksen päivämäärä

10.10.2008

21.11.2008

Monografia Tiedekunta Laitos

X Yhdistelmäväitöskirja (yhteenveto + erillisartikkelit) Elektroniikan, tietoliikenteen ja automaation tiedekunta Signaalinkäsittelyn ja akustiikan laitos

Tutkimusala Vastaväittäjä(t)

Tietoliikenteen signaalinkäsittely Prof. Athina Petropulu (Drexel University) and Prof. Erik G. Larsson, (Linköping University)

Työn valvoja Työn ohjaaja

Prof. Visa Koivunen Prof. Visa Koivunen

Tiivistelmä Sensori- ja antenniryhmien sekä monikanavaisten signaalien käsittelyssä esiintyy usein optimointiongelmia, joissa on rajoitteita optimoitaville parametreille. Monikanavaisten signaalien tapauksessa rajoitteet kohdistuvat tyypillisesti matriiseihin, ja erityisesti unitaariset matriisit ovat keskeisessä osassa moniantennitietoliikenteen sekä sensoriryhmien biolääketieteellissä sovelluksissa, tutkajärjestelmissä, koneoppimisessa ja radioastronomiassa. Tässä väitöskirjassa on kehitetty algoritmeja optimointiin unitaarisuusrajoituksen alla. Kehitetyt jyrkimmän laskeuman ja liittogradientti menetelmät perustuvat Riemannin geometriaan ja käyttävät hyväkseen unitaaristen matriisien Lien ryhmän ominaisuuksia. Kehitetyt menetelmät löytävät optimiratkaisun tehokkaasti ja toteuttavat unitaarisuusrajoituksen tarkasti. Unitaariseen viivahakuun on myös kehitetty tehokkaita hakualgoritmeja, jotka hyödyntävät rajoiteavaruuden rakennetta laskentatarpeen vähentämiseksi. Moniantenniteknologiat ovat keskeisessä osassa tulevaisuuden laajakaistaisissa langattomissa tietoliikennejärjestelmissä. Niiden avulla voidaan radiolinkkien kapasiteettia ja laatua parantaa merkittävästi. Käytettäessä lisäksi monikantoaaltotekniikoita saavutetaan erinomainen spektritehokkuus sekä luotettava toiminta huolimatta radiokanavan monitie-etenemisestä. Tässä työssä on johdettu sokea kanavakorjainalgoritmi moniantenni ja -kantoaaltotekniikkaan perustuviin MIMO-OFDM vastaanottimiin. Kehitetty algoritmi kumoaa symbolien välisen keskinäisvaikutuksen sekä samalla kanavalla esiintyvän interferenssin käyttäen rajoitettua optimointia. Työssä osoitetaan kehitetyn algoritmin suppenevan paikalliseen optimiratkaisuun.

Asiasanat Optimointi, unitaari matriisi, antenniryhmien signaalinkasittely, sokea signaalien erottelu, ekvalisaatio, MIMO, OFDM ISBN (painettu) ISBN (pdf) Kieli Julkaisija

978-951-22-9606-4 978-951-22-9607-1

Englanti

ISSN (painettu)

1797-4267

ISSN (pdf) Sivumäärä

95s. + liit. 94s.

Teknillinen korkeakoulu, Signaalinkäsittelyn ja akustiikan laitos

Painetun väitöskirjan jakelu

Teknillinen korkeakoulu, Signaalinkäsittelyn ja akustiikan laitos

X Luettavissa verkossa osoitteessa http://lib.tkk.fi/Diss/2008/isbn9789512296071

Acknowledgements The research work for this doctoral thesis was carried out at the Department of Signal Processing and Acoustics, Helsinki University of Technology, during the years 2001–2008. The Statistical Signal Processing group led by Prof. Visa Koivunen is part of SMARAD (Smart and Novel Radios Research Unit) Centre of Excellence in research nominated by the Academy of Finland. First, I wish to express my sincere gratitude to my supervisor, Prof. Visa Koivunen, for his continuous support and encouragement during the course of this work. His guidance was crucial and helped me overcome many obstacles I encountered in my research work. It has been an honour to work with such a dedicated scientist and outstanding group leader. I am also grateful to all my co-workers, especially Dr. Jan Eriksson and Dr. Marius Sˆırbu, with whom I co-authored several publications. Their comments, suggestions and constructive criticism have greatly contributed to the technical quality of the thesis. I would like to thank my thesis pre-examiners, Prof. Corneliu Rusu and Prof. Keijo Ruotsalainen for their comments and for the effort they have put in revising the manuscript. Furthermore, Prof. Iiro Hartimo, the former director of GETA Graduate School in Electronics, Telecommunications and Automation, and Marja Lepp¨aharju, the GETA coordinator are highly acknowledged. The department secretaries Mirja Lemetyinen and Anne J¨a¨ askel¨ ainen deserve many thanks for assisting with all the practical issues and arrangements. This research was funded by the Academy of Finland and the GETA graduate school. I use this opportunity to thank Nokia Foundation, Jenny and Antti Wihuri Foundation, and Elisa Foundation for the financial support they provided during my studies. I would like to thank all my colleagues in the department for the interesting discussions, especially Dr. Timo Roman, Dr. Mihai Enescu, Eduardo Zacar´ıas, Karol Schober, Dr. Fabio Belloni, Jussi Salmi, M´ario Costa, Dr. Stefan Werner, Dr. C´assio Ribeiro, Dr. Andreas Richter, Tuomas Aittom¨aki, Mei Yen Cheong, and Prof. Risto Wichman. Special thanks go to my dear friends from Otaniemi for the fantastic times we spent together. I will remember these wonderful years all my life. v

I also thank all my Romanian friends in Finland who made me feel not so far from home. Futhermore, I am very grateful to my good friends in my home city Cluj-Napoca, as well as my dear old friends from my childhood place Feleacu, for the memorable holidays we spent together. They certainly contributed to my good mood throughout all these years. Finally, I am deeply grateful and I dedicate this thesis to my parents Traian and Leontina, and my sister Luiza, and thank them for their love and irreplaceable support. In final, dedic aceast˘a lucrare parint¸ilor mei Traian ¸si Leontina, ¸si surorii mele Luiza ¸si le mult¸umesc din suflet pentru dragostea d˘aruit˘a ¸si pentru suportul lor de neˆınlocuit. I am grateful to my love, Orquidea Ribeiro, for bringing immeasurable happiness into my life. I thank her from the heart for support and encouragement. Estou grato ao meu amor, Orquidea Ribeiro, pela imensur´avel felicidade que deu `a minha vida. Agrade¸co-lhe, do fundo do cora¸c˜ao, pelo apoio e encorajamento.

Espoo, October 2008

Traian Abrudan

vi

Contents Acknowledgements

v

List of original publications

ix

List of abbreviations

xi

List of symbols

xv

1 Introduction 1.1 Motivation of the thesis . 1.2 Scope of the thesis . . . . 1.3 Contributions . . . . . . . 1.4 Structure of the thesis . . 1.5 Summary of publications .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

2 Overview of geometric optimization techniques 2.1 Constrained optimization from a differential geometry perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Optimization under unitary matrix constraint - different approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Classical Euclidean approach for optimization under unitary matrix constraint . . . . . . . . . . . . . . . . 2.2.2 Differential geometry based optimization algorithms . 2.2.3 Optimization under unitary matrix constraint – an illustrative example . . . . . . . . . . . . . . . . . . . . 2.3 Applications of differential geometry to array and multichannel signal processing . . . . . . . . . . . . . . . . . . . . 2.3.1 Optimization and tracking on manifolds . . . . . . . . 2.3.2 Quantization on manifolds . . . . . . . . . . . . . . . . 2.3.3 Statistics on manifolds . . . . . . . . . . . . . . . . . .

1 1 4 4 5 6 9 9 13 14 16 19 21 22 27 29

3 Practical Riemannian algorithms for optimization under unitary matrix constraint 33 3.1 The unitary group U (n) as a real manifold . . . . . . . . . . . 33 vii

3.1.1 3.1.2

3.2

3.3

Revealing the real Lie group structure of U (n) . . . . Differentiation of real-valued function of complexvalued argument . . . . . . . . . . . . . . . . . . . . . 3.1.3 Justification of using complex-valued matrices . . . . . Practical optimization algorithms along geodesics on U (n) . . 3.2.1 Steepest Descent Algorithm along geodesics on U (n) 3.2.2 Conjugate Gradient Algorithm along geodesics on U (n) . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Efficient Line search methods on U (n) . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Overview of blind equalization techniques for MIMOOFDM systems 4.1 Second-Order Statistics (SOS) based methods . . . . . . . . . 4.1.1 SOCS based methods . . . . . . . . . . . . . . . . . . 4.1.2 Statistical subspace methods . . . . . . . . . . . . . . 4.2 Higher-Order Statistics (HOS) based methods . . . . . . . . . 4.2.1 BSS methods . . . . . . . . . . . . . . . . . . . . . . . 4.3 Structural properties based methods . . . . . . . . . . . . . . 4.3.1 Modulation properties . . . . . . . . . . . . . . . . . . 4.3.2 Properties of the guard interval of OFDM signal (CP or ZP) . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Exploiting special matrix structures . . . . . . . . . . 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34 35 35 36 37 37 41 49 51 52 52 53 54 54 56 57 58 59 60

5 Blind equalizer for MIMO-OFDM systems based on vector CMA and decorrelation criteria 5.1 System model for spatial multiplexing MIMO-OFDM system 5.2 Blind MIMO-OFDM equalizer . . . . . . . . . . . . . . . . . 5.2.1 Modified VCMA Criterion . . . . . . . . . . . . . . . . 5.2.2 Output Decorrelation Criterion . . . . . . . . . . . . . 5.2.3 Composite Criterion . . . . . . . . . . . . . . . . . . . 5.2.4 Conditions for symbol recovery . . . . . . . . . . . . . 5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63 64 66 66 67 67 68 70

6 Summary

73

Bibliography

75

Publications 97 Errata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

viii

List of original publications (I) T. Abrudan, J. Eriksson, V. Koivunen, “Steepest Descent Algorithm for Optimization under Unitary Matrix Constraint”, IEEE Transactions on Signal Processing vol. 56, no. 3, Mar. 2008, pp. 1134–1147. (II) T. Abrudan, J. Eriksson, V. Koivunen, “Conjugate Gradient Algorithm for Optimization Under Unitary Matrix Constraint”, Submitted for publication. Material presented also in the technical report: “Conjugate Gradient Algorithm for Optimization Under Unitary Matrix Constraint”. Technical Report 4/2008, Department of Signal Processing and Acoustics, Helsinki University of Technology, 2008. ISBN 978-951-22-9483-1, ISSN 1797-4267. (III) T. Abrudan, J. Eriksson, V. Koivunen, “Optimization under Unitary Matrix Constraint using Approximate Matrix Exponential”, Conference Record of the Thirty Ninth Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, 28 Oct.–1 Nov. 2005, pp. 242–246. (IV) T. Abrudan, J. Eriksson, V. Koivunen, “Efficient Line Search Methods for Riemannian Optimization Under Unitary Matrix Constraint”, Conference Record of the Forty-First Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, 4–7 Nov. 2007, pp. 671– 675. (V) T. Abrudan, J. Eriksson, V. Koivunen, “Efficient Riemannian Algorithms for Optimization Under Unitary Matrix Constraint”, IEEE International Conference on Acoustics Speech and Signal Processing Las Vegas, NV, 31 Mar.– 4 Apr. 2008, pp. 2353–2356. (VI) T. Abrudan, M. Sˆırbu, V. Koivunen, “Blind Multi-user Receiver for MIMO-OFDM Systems”, IEEE Workshop on Signal Processing Advances in Wireless Communications, Rome, Italy, 15–18 Jun. 2003, pp. 363–367. (VII) T. Abrudan, V. Koivunen, “Blind Equalization in Spatial Multiplexing MIMO-OFDM Systems based on Vector CMA and Decorrelation Criteria”, Wireless Personal Communications, vol. 43, no. 4, Dec. 2007, pp. 1151–1172. ix

x

List of abbreviations [Publication p] 3G 3GPP 4G ADSL ANSI ARIB B3G BLAST BPSK BRAN BSS CCI CDMA CG CM CMA CP CR CRLB CSI DAB DFT DVB DVB-H DVB-T E-UTRA EASI ED EEG EKG FFT FIM

pth original publication Third Generation Third Generation Partnership Project Fourth Generation Asymmetric Digital Subscriber Line American National Standards Institute Association of Radio Industries and Businesses Beyond 3G Bell Labs lAyered Space-Time architecture Binary Phase Shift Keying Broadband Radio Access Networks Blind Source Separation Co-Channel Interference Code Division Multiplexing Access Conjugate Gradient Constant Modulus Constant Modulus Algorithm Cyclic Prefix Cognitive Radios Cram´er-Rao Lower Bound Channel State Information Digital Audio Broadcast Discrete Fourier Transform Digital Video Broadcast Digital Video Broadcast - Handheld Digital Video Broadcast - Terrestrial Evolved Universal Terrestrial Radio Access EquivAriant source Separation via Independence Eigen-Decomposition ElectroEncephaloGraphy ElektroKardioGramm, known also as ECG (ElectroCardioGram) Fast Fourier Transform Fisher Information Matrix xi

flop GCEIR HIPERLAN HOS ICA IEEE IDFT IFFT IMT-2000 ISA ISI IVLB i.i.d. JADE JD LAN LDPC LS LTE MCA MEG MIMO MISO ML MMSE MSE OFDM OFDMA PAST PASTd PHY QAM QPSK RADICAL RQI RX SA SD SDR SIC SIMO SISO SOCS SINR

floating point complex operation Global Channel-Equalizer Impulse Response HIgh PERformance Radio Local Area Networks Higher-Order Statistics Independent Component Analysis Institute of Electrical and Electronics Engineers Inverse Discrete Fourier Transform Inverse Fast Fourier Transform International Mobile Telecommunications-2000 Independent Subspace Analysis Inter-Symbol Interference Intrinsic Variance Lower Bound independent and identically distributed Joint Approximate Diagonalization of Eigen-matrices Joint Diagonalization Local Area Network Low-Density Parity-Check Least Squares Long Term Evolution Minor Component Analysis MagnetoEncephaloGraphy Multiple-Input Multiple-Output Multiple-Input Single-Output Maximum Likelihood Minimum Mean-Square Error Mean-Square Error Orthogonal Frequency-Division Multiplexing Orthogonal Frequency-Division Multiple Access Projection Approximation Subspace Tracking Projection Approximation Subspace Tracking with deflation PHYsical layer Quadrature Amplitude Modulation Quadrature Phase Shift Keying Robust Accurate Direct ICA aLgorithm Rayleigh Quotient Iteration receiver Steepest Ascent Steepest Descent Software Defined Radio Successive Interference Cancellation Single-Input Multiple-Output Single-Input Single-output Second-Order Cyclostationarity Statistics Signal-to-Interference-plus-Noise Ratio xii

SNR SOS SVD TX UMTS V-BLAST VCM VCMA VSC ZF ZP WiMAX WLAN WSS w.r.t.

Signal-to-Noise Ratio Second-Order Statistics Singular Value Decomposition transmitter Universal Mobile Telecommunications System Vertical-BLAST Vector Constant Modulus Vector Constant Modulus Algorithm Virtual Sub-Carriers Zero Forcing Zero Padding Worldwide Inter-operability for Microwave ACceSs Wireless Local Area Network Wide-Sense Stationary with respect to

xiii

xiv

List of symbols a, A a A A A A∗ AT AH A−1 min max arg min arg max E[A] trace{A} sign modulo ◦ , := == 6=  ℜ{A} ℑ{A} ∠a |a| kAkF hA, BiE hA, BiW ∂J ∂A ∇E

∇R

complex-valued scalars complex-valued vector complex-valued matrix real-valued scalar function complex-valued matrix function conjugate of matrix A transpose of matrix A Hermitian transpose of matrix A inverse of matrix A minimum value maximum value minimizing argument maximizing argument expected value of matrix A trace of matrix A sign operator modulo operation function composition operation defined as operation attributing the value of the right operand to the left operand equality testing not equal imaginary unit real part of a matrix A imaginary part of a matrix A angle of complex scalar a absolute value of complex scalar a Frobenius norm of a matrix A Euclidean inner product between A and B Riemannian inner product between A and B at W partial derivative of function J w.r.t. matrix A Euclidean gradient Riemannian gradient xv

N R C Rn Cn Rn×p Cn×p GL(n) SL(n) O(n) SO(n) U (n) SU (n) St(n, p) Gr(n, p) exp so(n) u(n) AR AI AR ap b1 , . . . bq C(t) C ′ (t) ¯ C Cpq cpq cpq Dk d d1 , d2 Eqp ep eqp eqp F F Gk ˜k G

set of natural numbers set of real numbers set of complex numbers n-dimensional real vector space n-dimensional complex vector space set of n × p real matrices set of n × p complex matrices general linear group (Lie group of n × n invertible matrices) special linear group (Lie group of n × n matrices with unit determinant) orthogonal group (Lie group of n × n orthogonal matrices) special orthogonal group (Lie group of n × n orthogonal matrices with unit determinant) unitary group (Lie group n × n unitary matrices) special unitary group (Lie group of n × n unitary matrices with unit determinant) Stiefel manifold (the set of n × p orthonormal matrices) Grassmann manifold (the set of p-dimensional subspaces on the n-dimensional Euclidean space) standard matrix exponential Lie algebra of SO(n) Lie algebra of U (n) real part of matrix A imaginary part of matrix A block-matrix build from the real and imaginary parts of matrix A GCEIR corresponding to the pth transmitted data stream complex coefficients curve parametrized by t, on a differentiable manifold first-order derivative of C(t) w.r.t. parameter t MIMO channel matrix Sylvester matrix containing the coefficients of the (l, p) MIMO branch vector of coefficients of the (l, p) MIMO channel branch coefficient of the (l, p) MIMO channel branch diagonal matrix with eigenvalues of a skew-Hermitian matrix, at iteration k integer delay range limits of the integer delay Sylvester matrix containing the coefficients of the sub-equalizer (q, p) vector with coefficients of the equalizer of the pth data stream vector with coefficients of the pth sub-equalizer from qth receive antenna coefficient of the pth sub-equalizer from qth receive antenna normalized IDFT matrix almost periodic function Riemannian gradient translated to identity, at iteration k Riemannian gradient, at iteration k

xvi

H Hk ˜k H HE k HR k In I J Jˆ Jˆ′ Jgeod Jproj JCayley J VCMA J xcorr k L L Lc Le M N Na n O(np ) P P{A} ˆ P p q Q Rk Rpl r r2 sp ˆsp sp T TW TCP t ta

Riemannian search direction Riemannian search direction translated to identity, at iteration k Riemannian search direction at iteration k Euclidean ascent direction at iteration k Riemannian ascent direction at iteration k n × n identity matrix group identity element differentiable cost function cost function along a curve on the manifold first-order derivative of the cost function along a curve on the manifold cost function along a geodesic curve on U (n) cost function along a curve on U (n) described by the projection operator cost function along a curve on U (n) described by the Cayley transform VCMA cost function cross-correlation cost function time instance or iteration number Lagrangian function order of the global channel-equalizer filter maximum channel order equalizer order number of subcarriers length of the OFDM block order of the approximation of the first-order derivative of the cost function number of rows of an matrix with orthonormal columns function of n, such that limn→∞ |n−p O(np )| < ∞ number of transmit antennas operator that projects an arbitrary matrix A into the unitary group local parametrization arising from a projection operator natural number natural number number of receive antennas unitary rotation, at iteration k cross-correlation matrix between the pth and lth equalized outputs order of the cost function energy dispersion constant block of constellation symbols of the pth transmitted data stream estimated block of constellation symbols of the pth transmitted data stream constellation symbol of the pth transmitted data stream almost period tangent space at point W cyclic prefix addition matrix real parameter approximation range of the first-order derivative of the cost function

xvii

U UE k Uk u ˜p u up up vq v W W geod W proj W Cayley W Wk ˜k W w w y yq ˜q y Z zp zp γk δi ǫ θ λ λi λk µ µk τH ω1 , . . . , ωn ωmax

unitarity criterion Euclidean gradient of the unitarity criterion at iteration k eigenvectors of a skew-Hermitian matrix, at iteration k vector obtained by stacking the P transmitted OFDM blocks OFDM block corresponding to the pth transmitted data stream vector of samples corresponding to the pth transmitted data stream sub-symbol within the OFDM block of the pth transmitted data stream noise vector at the qth receive antenna noise vector obtained by stacking the Q noise vectors local parametrization on the unitary group geodesic curve on the unitary group curve on the unitary group described by the projection operator curve on the unitary group described by the Cayley transform unitary matrix unitary matrix, at iteration k matrix close to unitary, at iteration k unit-norm complex vector unit-length complex number vector obtained by stacking Q received data vectors received data vector at the qth receive antenna enlarged received OFDM block at the qth receive antenna arbitrary n × n complex matrix equalized data vector corresponding to the pth data stream equalized data sample corresponding to the pth data stream weighting constant of the previous conjugate direction, at iteration k + 1 vector which has one unit element on position i and zero in rest small real number arbitrary phase rotation weighting constant of the composite cost function Lagrange multipliers weighting constant of the extra-penalty function, at iteration k step size parameter step size parameter at iteration k parallel transport of the tangent vector H along a geodesic imaginary parts of the eigenvalues of a skew-Hermitian matrix dominant eigenvalue of a skew-Hermitian matrix

xviii

Chapter 1

Introduction 1.1

Motivation of the thesis

The field of multi-channel and array signal processing develops and applies powerful mathematical and statistical techniques that process multi-channel signals. Typically, space and time are employed as explaining variables. Spatial dimension is enabled by using multiple sensors, or multiple emitters located at different positions. Sensor array signal processing algorithms have the ability to fuse data collected at several sensors in order to perform a given estimation task [125]. Conversely, by using multiple transmitters, special space-time structure may be constructed into the transmitted signals depending on the task at hand. This type of array signal processing techniques are needed to solve real-world problems. The most common applications include communications and radar applications such as smart antennas and adaptive beamforming, interference cancellation, high-resolution direction of arrival estimation, channel sounding, MIMO radar and sonar and sensor networks. Other important practical applications include source localization and classification, tracking, surveillance and navigation. Array signal processing is commonly used also in biomedical applications (e.g. EEG, MEG, EKG, cancer diagnosis), machine vision, as well as geophysical and astronomical applications (radio telescopes). The most common application of space-time processing is spatial filtering (beamforming). A beamformer is a processor used in conjunction with an array of sensors in order to provide a versatile form of spatial filtering [201]. The main goal is to estimate signals arriving from the desired directions in the presence of noise and interference signals, or to transmit signals in the desired directions. Antenna arrays with smart signal processing algorithms may also be used to identify spatial signal signatures such as the direction of arrival or location. Spatial processing techniques are also used in acoustic signal processing, track and scan radars, as well as in cellular systems. The basic idea of sensor array signal processing is given in Figure 1.1. 1

SENSOR ARRAY

processed SPACE

data

SPACE−TIME PROCESSOR

TIME

Figure 1.1: Sensor array signal processing.

The algorithms used in the space-time processor need to be smart and adaptive [105] in order to deal with time and space varying signals. In wireless receivers, optimization techniques are commonly needed. Minimizing certain error criterion, or maximizing certain gain function needs to be done iteratively, in an online fashion. Numerical optimization may be the only solution because a closed-form solution may not exist, or if it exists, it is hard to find. In multi-channel signal processing the optimization is often performed subject to matrix constraints. An elegant way to solve this type of constraint optimization problems is to view the error criterion as a multi-dimensional surface contained on the space of the free parameters. The constraint is viewed as a second surface. The feasible solutions may be found in the parameter space determined by the intersection of the two surfaces. This parameter space is usually a non-Euclidean space, i.e., a differential manifold. Consequently, powerful geometric optimization methods are needed. Classical optimization techniques operating on the usual Euclidean space suffer from low convergence speed, and/or deviation from the constraint. In order to overcome these impairments, state-of-the-art Riemannian optimization algorithms may be employed. They provide efficient solutions and allow better understanding of the problem. By using the Riemannian geometry approach, the initial constrained optimization problem is converted into an unconstrained one, in a different parameter space [18]. The constraints are fully satisfied, in a natural way. The geometric properties of the constrained space may be exploited in order to reduce to computational complexity. Unitary/orthonormal matrices play a crucial role in array and multichannel signal processing applications. They are involved in almost all modern transceiver techniques such as limited-feedback MIMO sys2

tems [50, 51, 121, 143–145], space-time codes [103, 116, 118, 120, 141, 144], smart antennas [85, 86, 183, 186], blind equalization and source separation [45, 156, 187, 192, 209] and MIMO radars and sonars [34, 82, 167, 191]. Unitary matrices also play an important role in biomedical applications that employ sensor arrays or pattern recognition systems (EKG, EEG, MEG, cancer prevention and diagnosis, modelling of the human body) [20, 164], machine learning [46,73–79,151,160,161], computer vision [133,146] and optimal control [56,211,223]. For this reason, reliable algorithms for optimization under unitary matrix constraint are needed. In wireless communication, a major practical issue to be considered is that the terminals may possess different signal processing capabilities and limited power resources. Thus, algorithms with reasonable computational complexity should be employed in order to cope with high data rates and time-space-frequency selective channels. In anticipation of the growing demands for voice and multimedia applications, the future beyond third generation (B3G) and fourth generation (4G) wireless services promise considerably higher effective data rates and enhanced user mobility. In this respect, emerging wireless systems like B3G Long Term Evolution (LTE) [10], IMT-2000 [8] and WiMAX [1] are being deployed. Array signal processing plays again a crucial role. By using multiple transmit/receive antennas, i.e. the so called Multiple-Input Multiple-Output (MIMO) systems [36, 194], high capacity, link reliability and enhanced network coverage may be achieved. These benefits give a strong motivation to develop reliable multi-antenna transceiver structures. Radio spectrum is a scarce and expensive resource. It is desirable to increase the data rates without expanding the bandwidth or using additional power. Multicarrier techniques such as Orthogonal Frequency Division Multiplexing (OFDM) modulation [106, 200] use the limited spectrum very efficiently. In addition, they are very robust to channel multipath propagation and allow simple equalization. Future wireless systems such as LTE [10], IMT-2000 [8] employ OFDM in physical layer. OFDM has already been adopted in many standards such as ADSL (Asymmetric Digital Subscriber Line) [3], wireless local and metropolitan area networks (WLAN, WMAN, HIPERLAN/2) [1, 4–6, 9], and European digital audio and video broadcast (DAB, DVB) [2, 7]. The recently proposed IEEE 802.11n standard for WLAN [4] combines OFDM with MIMO in order to increase the throughput and the operation range. In order to fully achieve the benefits of MIMO-OFDM, accurate channel estimation is needed. In MIMO systems, this is more difficult than in the single antenna case due to the the large number of channel parameters. In mobile MIMO-OFDM systems, however, plenty of pilot symbols are needed in order to deal with the space-time-frequency selective channels. Blind methods [128] achieve higher effective data rates because they do not require any training data or pilot signals. They rely only on statistical or structural properties of the transmitted signal. Semi-blind methods use a 3

reduced amount of training data and are more feasible for fast-fading scenarios. Moreover, they allow solving all the ambiguities that blind receivers are subjected to.

1.2

Scope of the thesis

The objective of this thesis is to develop efficient optimization algorithms for multi-channel and sensor array signal processing applications such as the future multi-antenna transceivers. Special attention needs to be paid to practicality, so that no unrealistic assumptions are made in deriving the algorithms. Reasonable computational complexity of the algorithms is required in order to be able to operate at very high sampling rates needed in real-time applications. Optimization algorithms are used in most of the modern wireless receivers. Constrained optimization problems arise frequently in sensor array and multi-channel signal processing applications. The core of and adaptive algorithm is usually an optimization algorithm, possibly with some constraints. In particular, optimization under unitary matrix constraint is needed in smart antennas, closed-loop MIMO systems, space-time codes, blind signal separation, blind subspace methods and blind beamforming. Novel optimization techniques that posses fast convergence to the desired solution need to be developed. Riemannian geometry provides powerful tools for solving this type of problems. They prove to be computationally feasible and outperform classical Euclidean approaches in terms of convergence speed. For this reason, the main objective of this dissertation is to develop optimization algorithms stemming from Riemannian geometry that have the ability to find the optimal solution in a numerically efficient manner. The problem of blind channel equalization in MIMO-OFDM system is also addressed in this thesis. The goal is to develop computationally efficient blind algorithms for MIMO-OFDM that are able to cancel both inter-symbol and co-channel interference. They must achieve fast convergence and good tracking capabilities when used in a semi-blind mode. The convergence properties and the conditions for symbol recovery need to be established as well.

1.3

Contributions

This work contributes to the fields of multi-channel and array signal processing, optimization theory and multi-antenna communications. There are two main contributions of this thesis, as explained below. The first main contribution is to numerical optimization techniques for array and multi-channel signal processing applications. Two novel Riemannian techniques for optimization under unitary matrix constraint are proposed. 4

This type of optimization arises frequently in multi-antenna transceivers algorithms. Typical applications are blind equalization, source separation and smart antennas. Steepest descent (SD) and conjugate gradient (CG) algorithms operating on the Lie group of unitary matrices are derived. Novel line search methods specially tailored for this type of optimization are also introduced. The proposed algorithms exploit the geometrical features of the Lie group in order to reduce the computational cost. They outperform the classical Euclidean approaches for constrained optimization both in terms of convergence speed and computational complexity. Moreover, they generalize the existing Riemannian optimization algorithms for optimization under orthogonal constraint which are designed only for real-valued matrices and signals. In communications and array signal processing we deal with complex-valued signals and matrices. The complex-valued case has been addressed only in [137], since most of the authors consider the extension from the real to the complex case trivial. We show that this simplistic assumption is not always true. The complexity of the proposed algorithms is significantly lower than the differential geometry approach in [137]. They are directly applicable to joint diagonalization (JADE) [45], which is a widely used technique for blind source separation (BSS). The proposed algorithms achieve faster convergence in comparison to the approach based on Givens rotations, originally proposed in [45]. Other possible application include high-resolution direction finding and blind subspace methods. The proposed SD and CG algorithms together with the two proposed line search methods are the first ready-to-implement algorithms for optimization under unitary matrix constraint. The second main contribution of this dissertation is in the area of multiantenna OFDM communications systems. An optimization algorithm based on a combined criterion is proposed in order to be cancel both inter-symbol interference (ISI) and co-channel interference (CCI) in a blind manner, i.e., without known training or pilot symbols. It possesses reduced computational complexity since it does not involve any matrix inversions or decompositions. The local converge properties of the algorithm are established as well. The proposed blind algorithm can be used for channel tracking using the data symbols only. It is suitable for MIMO-OFDM systems, under slow to moderate fading conditions such as in wireless LANs and continuous transmissions (television and radio).

1.4

Structure of the thesis

The thesis consists of an introductory part and seven original publications. The publications are listed at page ix and appended at the end of the manuscript, starting at page 97. The introductory part of this thesis is orga5

nized as follows. The first two chapters deal with Riemannian optimization algorithms. In particular, the problem of optimization under unitary matrix constraint is addressed. The next two chapters address the problem of blind equalization in MIMO-OFDM systems. In Chapter 2, an overview of optimization techniques stemming from differential geometry is provided. Constrained optimization is regarded as a geometric problem. The main focus is on optimization under unitary matrix constraint and the existing techniques. A comprehensive review of applications of differential geometry to array and multi-channel signal processing is provided. Chapter 3 proposes algorithms optimization under unitary matrix constraint. The Lie group of unitary matrices U (n) is described as a real manifold. Motivation why the existing algorithms designed for real-valued matrices cannot be straight-forwardly applied to complex-valued matrices is given. Two computationally feasible optimization algorithms operating on the Lie group U (n) are introduced. Steepest Descent (SD) and a Conjugate Gradient (CG) algorithms exploiting the geometrical properties of U (n) are provided. Two efficient line search methods specially tailored for the proposed algorithms are also introduced. In Chapter 4, blind channel identification and equalization algorithms for multi-antenna OFDM systems are reviewed. A classification of these algorithms is also provided. Chapter 5 addresses the problem of blind equalization in spatial multiplexing MIMO-OFDM systems. An algorithm which optimizes a composite criterion in order to mitigate both inter-channel and co-channel interference is proposed. Chapter 6 provides a summary of the contributions and the results of the thesis. Future research directions are also discussed.

1.5

Summary of publications

In this subsection, brief overview over the author’s original publications is given. In [Publication I], a Riemannian Steepest Decent (SD) algorithm for optimization under unitary matrix constraint is derived. The algorithm benefits from the geometrical features of the Lie group of unitary matrices U (n) in order to reduce the computational complexity. Recent advances in numerical techniques for computing the matrix exponential needed in the update are exploited. Armijo line search method is efficiently used in the step size selection. The computational complexity and stability issues are addressed in detail. Detailed implementation tables and numerical solutions are provided, unlike other more general algorithms that require solving matrix equations (sometimes differential equations) and do not provide any 6

feasible numerical solutions. The proposed algorithm is tested in a blind source separation application for MIMO systems, using joint diagonalization. In [Publication II], a Riemannian Conjugate Gradient (CG) algorithm for optimization under unitary matrix constraint is derived. Two efficient line search methods exploiting the almost periodic property of the cost function along geodesics on U (n) are also proposed. Detailed description of the implementation is provided. The proposed CG algorithm and line search methods are tested in a blind source separation application for MIMO systems using joint diagonalization. They are also used to compute the eigendecomposition of a Hermitian matrix iteratively by maximizing the Brockett criterion [42, 183, 184]. In [Publication III], a Riemannian steepest descent based on Taylor approximation of the geodesics is proposed. The Riemannian SD algorithm is applied to high resolution direction finding. A comparison to Euclidean approaches is provided. The algorithm is also applied to an existing blind receiver for MIMO-OFDM systems [228] in order to reduce complexity. In [Publication IV], introduces two novel line search methods on U (n). The first method uses a polynomial approximation of the derivative of the cost function. The second method is based on a DFT approximation. They are used together with the Riemannian SD algorithm in [Publication I]. [Publication V] compares Riemannian SD and CG algorithms on U (n) using Armijo line search method. The algorithms are applied to two different cost functions. The first one is the Brockett criterion and is used to perform the diagonalization of a Hermitian matrix. The second one is the JADE criterion and is used to perform the blind signal separation of communications signals in a MIMO system. In [Publication VI], a blind equalization algorithm for spatial multiplexing MIMO-OFDM systems is proposed. The algorithm is based on a composite criterion designed to cancel both the inter-symbol and co-channel interference. The composite criterion is comprised of a Vector Constant Modulus (VCM) criterion and a decorrelation criterion. Identifiability conditions for the MIMO channel are also provided. In [Publication VII] a blind equalization algorithm for spatial multiplexing MIMO-OFDM systems is proposed. The algorithm is the final version of the algorithm in [Publication VI], subsequently developed in [13] and [12]. The VCMA criterion is modified in order to be able to cope with the correlation introduced by the cyclic prefix (CP). The resulting composite criterion consists of a modified VCM criterion and a decorrelation criterion. Local convergence properties of the algorithm have been established. Conditions for the blind equalization and co-channel signal cancellation are also provided. All the simulation software for all the original publications included in this dissertation was written solely by the author. 7

In [Publications I–VIII], the original ideas and the derivations of the algorithms were developed by the first author. Simulations were made by the first author as well. The co-authors provided guidance during the development of the algorithms, establishing their properties and the design of the experiments. They have also provided valuable comments that substantially improved the rigorousness and the technical quality of the papers.

8

Chapter 2

Overview of geometric optimization techniques In this chapter, different approaches for solving optimization problems [27, 83, 153, 163] subject to differentiable equality constrains are reviewed. We are mainly interested in minimizing cost functions, but all the algorithms considered can be easily adapted to maximization problems. Classical constrained optimization algorithms operating on Euclidean spaces, as well as non-Euclidean approaches are presented in Section 2.1. Different algorithms for optimization under unitary matrix constraint are revisited in Section 2.2. A detailed literature review of Riemannian optimization algorithms and their applications in signal processing is provided in Section 2.3.

2.1

Constrained optimization from a differential geometry perspective

Constrained optimization problems arise frequently in many sensor array and multi-channel signal processing applications. Most of the adaptive algorithms [105] require minimizing an error criterion in an iterative manner, subject to some matrix equality constraint. Solving this type of problems is typically done numerically. In fact, numerical optimization may be the only way to solve certain problems. One of the reasons for the numerical optimization may be that a closed-form solution does not exist, or if it exists it is very hard to find. Iterative optimization algorithms are also suitable in the case where only small corrections to a previous solution need to be applied sequentially. Classical approaches solve this problem on the Euclidean space and they are mainly of two types. The first approach supposes optimizing the unconstrained cost function by using classical gradient algorithms. In order to satisfy the constraint, a restoration procedure needs to be applied after 9

every iteration. In general, the resulting algorithms convergence slowly due to fact that most of the effort is spent on enforcing the constraint and not on the actual optimization. The second classical approach for constraint optimization is based on the method of Lagrange multipliers [97]. The method introduces an additional set of unknown scalar parameters called Lagrange multipliers. The number of multipliers is equal to the number of scalar constraints. A new cost function called Lagrangian is constructed. It is comprised of the original cost function and the set of constraints weighted by the Lagrange multipliers. This function is jointly optimized w.r.t. to both the original variables and the new variables, i.e., the Lagrange multipliers. Closed-from solutions may be found in simple cases by solving a system of equations. Often, increasing the number of unknowns is undesirable, especially when the dimension of the optimization problem is already large. Many practical applications involve complicated expressions of the cost function (e.g. the JADE criterion [45]) and non-linear matrix constraints (e.g. the unitary matrix constraint). In such cases, the Lagrangian approach requires solving a large system of nonlinear equations which may be mathematically intractable. For this reason, simplified extra-penalty methods [153] stemming from the Lagrangian approach have been proposed. Instead of using several Lagrange multipliers, they use a single scalar parameter to weight the extra-term which penalizes the deviation from the constraint. The corresponding composite cost function is minimized iteratively by using a classical steepest descent method. This approach may lead to inaccurate solutions due to the fact that the deviation from constraint accumulates after each iteration. Additional stabilization procedures are usually needed. More reliable and modern solutions to some specific classes of constrained optimization problems [18,87,131,184,196] may be obtained by using tools of differential geometry [59,101]. The initial constrained optimization problem on Euclidean space is converted into an unconstrained one, on a different parameter space. This new parameter space can be viewed as a system of coordinates where only the values of coordinates satisfying the constraint are allowed. Geometrically, the set determined by the constraints can be understood as a lower-dimensional space embedded on the initial Euclidean space (see Figure 2.1). Usually smooth (differentiable) constraints may determine differentiable manifolds [59]. In general, these manifolds are non-Euclidean spaces and may be elegantly explored by using state-of-the-art tools from differential geometry. Efficient numerical optimization algorithms may be derived by taking into account the geometrical structure of the manifold. In addition to that, matrix manifolds [18] possess rich algebraic structure arising from the special properties of matrices. The classical gradient-based algorithms such as steepest descent, conjugate gradient and Newton methods can be naturally extended from Euclidean space to differentiable manifolds. Pioneering work by Luenberger 10

Euclidean space

C ′(0)=H

C(0)=W

contours of J(W)

C(t)

constraint surface

Figure 2.1: Optimization on manifolds.

[131] and Gabay [87] treats the constrained optimization problem in a differential geometry context and establishes interesting connections with the Lagrange multipliers method considering the first and the second-order optimality conditions. Optimizing a differentiable cost function J (W) on a differentiable manifold supposes choosing a point W on the manifold and a search direction H tangent to the manifold, at W (see Figure 2.1). The next iteration consists of moving along a curve C(t) which emanates from W in the direction of H. This curve is called local parametrization and can be used to describe a neighborhood around a given point on the manifold. Therefore, C(t) must be contained on the manifold and fulfill the condition C(0) = W. Additionally, its derivative at W must coincide with the search direction H, i.e., C ′ (0) = H. Optimizing the cost function in one dimension along the curve C(t) is required at every iteration. It means that a line search is performed. By moving along such curves, the constraint is automatically satisfied at each iteration. Choosing the appropriate local parametrization and the appropriate search direction must be computationally feasible. The most natural local parametrizations are the geodesics, which on Riemannian manifolds are locally the length minimizing curves. They correspond to the straight lines on the Euclidean space. On certain manifolds such as Lie groups [89, 107, 123, 208], the geodesics are computationally attractive. Using other local parametrizations is also possible. A well-known non-geodesic approach is to use a retraction, which is a map that locally projects the tangent plane 11

onto the manifold [16, 18, 131, 137]. Choosing the search direction is a compromise between high complexity and fast convergence. This direction can be, for example, the steepest descent (or ascent) direction, or other direction, for example the one corresponding to a conjugate gradient or Newton algorithm. The steepest descent (SD) or steepest ascent (SA) algorithm on differentiable manifolds [41, 57, 69, 77, 87, 131, 150, 151, 161, 183, 196, 219, 220] is relatively simple, but its convergence is only linear [87, 183, 196, 219, 220]. Its asymptotic rate of convergence is related to the eigenvalues of the Hessian associated with the Lagrangian function of the constrained optimization problem evaluated at the solution [131]. Developing a SD algorithm on a differentiable manifold requires defining intrinsically the gradient field of the cost function on the manifold. The gradient can be defined only after endowing the differentiable manifold with a Riemannian metric, which turns it into a Riemannian manifold. The Riemannian gradient is a vector tangent to the manifold, along which the cost function increases the fastest. The conjugate gradient (CG) algorithm on differentiable manifolds [57, 69, 122, 183, 184, 229] is still a relatively simple algorithm and it achieves superlinear convergence [183]. In general CG is considerably simpler than a Newton algorithm which would require computing second-order derivatives. CG captures the second-order information from two successive firstorder derivatives which are properly combined. The additional complexity compared to the SD is due to the fact that CG requires transporting gradient vectors from one point to another, i.e., performing parallel transport. This operation is not as simple as in a vector space. The parallelism is not relative to straight lines as on the Euclidean space, but to an affine connection (often the Levi-Civita connection) [59, 107, 154]. The connection makes clear the two basic ideas of covariant differentiation and parallel transport [59, 107, 154]. In certain manifolds the parallel transport can be done in a very simple manner, and the resulting CG algorithm is comparable to the SD in terms of computational complexity. Riemannian Newton algorithm [17, 87, 135, 155, 183, 196] achieves quadratic convergence [17, 87, 183]. Newton algorithm is a prohibitively expensive algorithm even on Euclidean space, when the dimension of the optimization problem is large. On Riemannian manifolds, the complexity is even higher, since it would require computing the second covariant derivative and inverting it. For this reason, CG [69, 122, 183, 184, 229], or modified versions of the Newton algorithm [137] are often preferred. Moreover, the Newton algorithm may converge to any type of stationary points, not only the extrema of interest. Trust-region methods on Riemannian manifolds have been recently proposed in the literature [15, 18]. In conclusion, in order to extend the optimization algorithms from the Euclidean space to Riemannian manifolds, the straight lines are replaced by geodesics, the classical differentiation by covariant differentiation and 12

the idea of vector addition by exponential map and parallel transport [184]. The most common differentiable manifolds which arise in signal processing applications are homogeneous spaces [107, 208] such as the Stiefel manifold, the Grassmann manifold [69,74,75,137,183]. The Stiefel manifold St(n, p) is the set of n×p (real or complex) matrices with mutually orthogonal columns. The Grassmann manifold Gr(n, p) is the set of all p-dimensional subspaces of the n-dimensional Euclidean space (Rn or Cn ). They are also represented by n × p orthonormal matrices which can be in this case any arbitrarily rotated basis of a subspace. The Stiefel manifold of n×n orthogonal/unitary matrices is a special case due to the fact that these matrices are algebraically closed under the standard matrix multiplication operation, i.e., they form a matrix Lie group [89, 107, 123, 208]. A matrix Lie group is a differentiable manifold and a matrix group at the same time. Orthogonal matrices form the orthogonal group O(n). Similarly, unitary matrices form the unitary group U (n). Another relevant Lie group is the general linear group GL(n), which is the group of n × n invertible matrices. In general, the additional group structure brings computational benefits which may be exploited in practical algorithms, as it will be shown later in Chapter 3. For this reason, in the literature special attention has been paid to optimization algorithms operating on Lie groups [17,40,41,57,69,76,77,79,135,150,151,155,161,183, 214, 229, 230]. In particular, in this thesis, the problem of optimization under unitary matrix constraint is addressed. Computationally efficient optimization algorithms which exploit the geometry of the Lie group n × n unitary matrices U (n) are proposed.

2.2

Optimization under unitary matrix constraint - different approaches

Consider a differentiable real-valued cost function J : Cn×n → R. The problem of optimization under unitary matrix constraint may be formulated as: minimize J (W) subject to H

W W = In .

(2.1) (2.2)

Two main approaches for solving the above optimization problem have been proposed in the literature. The first one requires solving the constrained optimization problem on the Euclidean space by using classical gradient-based algorithms. The second one requires solving an unconstrained optimization problem on the differentiable manifold determined by the constrained set, which is the Lie group of n × n unitary matrices by U (n). The two main approaches are presented below together with the methods which they in13

clude. Similar classification is provided in [Publication I],[Publication III], and [162].

2.2.1

Classical Euclidean approach for optimization under unitary matrix constraint

The classical Euclidean approach for solving constraint optimization problems include the unconstrained gradient-based method with constraint enforcement and the method of Lagrange multipliers (or methods stemming from it). Euclidean gradient algorithms with constraint enforcement An unconstrained classical gradient-based method is used to minimize the cost function J (W). The constraint is satisfied by using an additional restoration procedure, which needs to be applied after every iteration. This is a well-known technique for optimization under unitary/orthonormal matrix constraint [64, 113, 156, 157, 171, 172, 216]. Each iteration k of the algorithm consists of the two following steps: ˜ k+1 = Wk − µk HE W k ˜ k+1 }, Wk+1 = P{W

(2.3) (2.4)

where HE k is an ascent direction and the step size µk > 0 sets the convergence speed. The search direction HE k may be the gradient direction on the E Euclidean space at Wk , i.e., ∇ J (Wk ). Other search directions could be used, for example the one corresponding to a conjugate gradient, or Newton method. Due to the additive type of update, the result of (2.3) is gener˜ k+1 deviates from the unitary ally not a unitary matrix. The new iterate W property at every iteration. The constraint needs to be restored by pro˜ k+1 onto the space of unitary matrices. A projection operator jecting W n×n P:C → U (n) is used in (2.4) in order to obtain a unitary matrix. Few algorithms in the literature [11,113,216] find the unitary matrix which is the closest to the original one under the Euclidean norm. The corresponding projection is known as “the symmetric orthogonalization” procedure [113, Ch. 6, Sect. 6.6], and it may be written as ˜ k+1 } = W ˜ k+1 (W ˜H W ˜ k+1 )−1/2 , P{W k+1

(2.5)

˜ k+1 [108]. Other nonor in terms of left and right singular vectors of W optimal projections (under Euclidean norm) may be used, for example the one based on Gram-Schmidt orthogonalization [64, 156, 157, 171, 172, 216]. The approach does not take into account the group property of unitary matrices, i.e., the fact that unitary matrices are closed under the multiplication operation. This is not the case under addition. The departure from 14

the unitary property may be significant and most of the effort will be spent on enforcing the constraint, instead of moving towards the optimum. Consequently, this algorithm achieves lower convergence speed, as demonstrated in [Publication I] and [Publication III]. The Euclidean gradient algorithms combined with projection methods do not take into account the curvature of the constrained surface, and therefore achieve only linear convergence [183]. Moreover, this approach makes the idea of line search optimization meaningless. Lagrange multipliers and related methods The method of Lagrange multipliers [97] is the basic tool for optimizing a function of multiple variables subject to one or more scalar constraints. The method introduces a set of new real scalar unknowns λi called Lagrange multipliers. A composite cost function L(W) called Lagrangian is constructed by using the original cost function J (W) and an extra-term containing the constraints weighted by the Lagrange multipliers. This function is jointly optimized w.r.t. both the elements of W and the Lagrange multipliers λi . In this way, finding stationary points of the constrained cost function of J (W) is equivalent to finding the stationary points of the unconstrained cost function L(W). The unitary matrix constraint (2.2) is equivalent to n2 real scalar constraints. Consequently, a number of n2 real Lagrange multipliers are required to construct the Lagrangian. The original cost function J (W) has already 2n2 free real variables. Therefore, the Lagrangian has 3n2 free variables. Increasing the number of variables is undesired, especially when n is large. A large system of non-linear matrix equations needs to be solved in order to find the stationary points. Often, for practical cost functions, solving such a system of equations is non-trivial already for n ≥ 2. An example of such cost function is the JADE criterion [45], whose optimization has been considered in [Publication I] and [Publication II]. For this reason, a simplified technique stemming from the method of Lagrange multipliers has been proposed in the literature [205]. This technique uses a gradient-based iterative method to minimize a composite cost function on the Euclidean space. The composite cost function is comprised of the original cost function J (W) and an extra-term U(W) = kWH W − In kF , which penalizes the deviation from the unitary constraint. In this work we will refer to this method as “the extra-penalty method”. The kth iteration of the extra-penalty method is of the form E Wk+1 = Wk − µk [HE k + λk Uk ]

(2.6)

The direction HE k represents the Euclidean gradient of the original cost function J (W) at Wk , whereas the direction UE k is the Euclidean gradient of the penalty function U(W) at Wk . The later term is used to penalize the deviation from the unitary property, and in a way it resembles the Tikhonov 15

regularization method [147]. A single scalar weighting parameter λk is used to weight the direction UE k in a manner similar to the method of Lagrange multipliers. For this reason, the method has also been called bigradient method in the literature [205]. Most of the extra-penalty type of algorithms in the literature dealing with optimization under orthonormal constraints are specialized to certain tasks such as subspace tracking or Independent Component Analysis (ICA) [61, 64, 66, 205]. They are computationally efficient for the task they are designed for. In general, they cannot be applied to other optimization problems with unitary/orthonormal constraint. Some of the optimization algorithms are restricted only to the case of optimizing J (w), where w is a unit-norm vector. In the unit-norm vector case, a scalar parameter λk always exists such that the unit-norm constraint is satisfied. Finding the optimum λk which keeps the constraint satisfied limits the choice of the step size µk > 0, and consequently the convergence speed. This limitation is due to the additive update which alters the constraint. Algorithms based on additive update which are able to keep the constraint satisfied up to machine precision exist in the literature [60]. They are based on Householder transforms [99]. These algorithms are specialized to minor component analysis (MCA). The extra-penalty method is in general numerically unstable, the deviation from the unitary constraint accumulates over time. Self-stabilized algorithms have been proposed [61–66], in order to avoid this unstable behavior. They usually discretize the differential equation which describes the motion on the Riemannian manifold determined by the constraint. The discretization leads to numerical errors which are compensated by inserting additional stabilizing factors at various points within update (see [Publication III]). The numerical stability is improved and this fact has been proved in [61] by using an asymptotic analysis which shows that the error does not accumulate over time. In conclusion, solutions based on the method of Lagrange multipliers may be non-trivial to compute for large dimensions n. Moreover, the mathematical tractability is highly dependent on the expression of the cost function, since a system of 3n2 nonlinear equations with 3n2 unknowns needs to be solved. The solution obtained by using the extra-penalty method satisfies the constraint just approximately, when only one scalar parameter λk is used. This fact has been demonstrated in [Publication I] and [Publication III]. Finally, these methods are not directly applicable to a general problems of optimization under unitary matrix constraint.

2.2.2

Differential geometry based optimization algorithms

The unitary matrix constraint (2.2) is a smooth constraint which determines a differentiable manifold. This manifold can be seen as a “constrained surface” embedded on an higher-dimensional Euclidean space. Differential geometry-based optimization algorithms move along directions which are 16

tangent to the manifold. Depending on the choice of the local parametrization there are two types of algorithms projection-based algorithms and geodesic algorithms. An important aspect to be considered is that the unitary matrices are closed under the standard matrix multiplication operation, i.e., they form the Lie group of n × n unitary matrices U (n). This fact brings additional properties which may be exploited in optimization. The most important property is that geodesics are described by simple formulas, therefore they can be efficiently computed. Unlike the Lagrange multipliers method which introduces new unknown variables, differential geometry-based algorithms exploit the reduced dimension of the manifold. The original problem involves a cost function of 2n2 variables and n2 constraints. The differential geometry-based approach involves only n2 variables (which is the dimension of U (n)), and no constraints. Moreover, the mathematical tractability does not depend on the expression of cost function. Non-geodesic differential geometry based gradient algorithms This type of algorithms use as a local parametrization a projection operator (2.5). They move along straight lines tangent to the manifold and deviate from the unitary constraint at every iteration. This is due to the fact that the the manifold is a “curved space”. Therefore, this algorithm has the same drawback as its Euclidean counterpart, i.e., the constraint restoration procedure needs to be applied after every iteration. The algorithm consists of the two following steps at each iteration: ˜ k+1 = Wk − µk HR W k ˜ Wk+1 = P{Wk+1 }.

(2.7) (2.8)

The search direction −HR k is a descent direction tangent to the manifold at Wk . The step size µk > 0 determines the convergence speed. Compared to its Euclidean counterpart, they depart less from the constrained surface since they move along the search directions which are tangent to the manifold. The deviation is only due to the curvature of the manifold, and not because of the inaccurate search direction. Consequently, these type of algorithms achieve better convergence speed, as demonstrated in [Publication I] and [Publication III]. From a differential geometry point of view, the projection operator is as local parametrization on the manifold, i.e., a mathematical description of the neighborhood of an arbitrary point Wk ∈ U (n). The corresponding curve starting from Wk ∈ U (n) with the initial tangent vector −HR k is given by R ˆ P(µ) = P{Wk − µHk }. Unfortunately, due to the nature of the projection operator, the line search optimization needed in the step size selection is computationally expensive. 17

Riemannian gradient algorithms along geodesics A natural way to optimize a cost function on a Riemannian manifold is to move along geodesics. Geodesics are locally the length minimizing paths on a Riemannian manifold [59]. Intuitively, they correspond to straight lines on Euclidean space. Riemannian algorithms for optimization under unitary matrix constraint use the exponential map as a local parametrization. The resulting algorithms employ a multiplicative update rule, i.e., a rotation is applied to the previous value to obtain the new one. Each iteration k of the algorithm consists of the following step: Wk+1 = exp(−µk HR k )Wk = Rk Wk

(2.9)

where −HR k is the directional vector of the geodesic and is represented by a skew-Hermitian matrix. Consequently, its matrix exponential Rk = exp(−µk HR k ) is a unitary matrix. Since Wk is a unitary matrix, Wk+1 will remain unitary at every iteration. In this way, the constraint is maintained automatically and no enforcing procedure is necessary. Optimization algorithms with unitary constraints such as the ones in [137] are more general in the sense that they are designed for the Stiefel and the Grassmann manifolds. Therefore, when dealing with the case on n × n unitary matrices they do not take into account its Lie group structure which brings numerous computational benefits. We fully exploit these benefits in the gradient algorithms proposed in [Publication I], [Publication II], [Publication III], [Publication IV] and [Publication V]. The most important advantage is the convenient expression for the geodesics and parallel transport. Geodesics are expressed in terms of matrix exponential of skew-Hermitian matrices. Hence, they are easier to compute compared to the projection-based method [137] which requires the computing SVD of arbitrary matrices. Recent progress in numerical methods for calculating the matrix exponential may be exploited [47, 117, 142]. For more details see [Publication I], Section V. Another important property of the Lie group is that transporting vectors from a tangent space to another may be done in a very simple manner. The parallel transport which is needed for the conjugate gradient algorithm may be done simply by using left/right matrix multiplications [Publication I]. Moreover, the geodesic search when adapting the size µk at every iteration may be efficiently done by using for example the Armijo rule [163]. Efficient line search methods for step size adaptation have been proposed in [Publication IV]. They exploit the almost periodic property of the cost function along geodesics on U (n). In conclusion, geodesic algorithms for optimization under unitary matrix constraint fully exploit the geometric and algebraic structure of the Lie group of unitary unitary matrices to reduce complexity. 18

2.2.3

Optimization under unitary matrix constraint – an illustrative example

The goal of this example is to illustrate how different optimization algorithms operate under unitary matrix constraint. We use a simple toy problem for illustration purposes. We minimize the cost function J (w) = |w + 0.2|2 under unitary constraint, which in this case is the unit circle w∗ w = 1, w ∈ C. The unit-length complex numbers form the Lie group of 1 × 1 unitary matrices U (1). We consider a 3-dimensional representation of the cost function J (w) with respect to the real and the imaginary parts of w, i.e., x = ℜ{w} and y = ℑ{w}. The cost function J : R2 → R is quadratic in x and y and is represented by the paraboloid P in Figure 2.2. The unitary constraint is represented by the cylinder obtained by translating the unit circle U (1) along the vertical axis. The resulting cylinder is the space U (1) represented in three dimensions (embedded on R3 ). The parameter space where the cost function satisfies the constraint x2 +y 2 = 1 is represented by the intersection of the cost function surface with the cylinder, i.e., the ellipse E represented by thick curve. This ellipse represents the constrained parameter space of the cost function J : U (1) → R. There are a significant differences between minimizing the cost function J (w) on R2 or on U (1). On the Euclidean space, i.e., R2 the minimum is attained at the point mE of coordinates (x, y) = (−0.2 , 0) (the minimum of the paraboloid P in Figure 2.2). This point does not satisfy the constraint, therefore is an undesired minimum. On the Riemannian space the minimum is attained at the point mR of coordinates (x, y) = (−1, 0) (the minimum on the ellipse E in Figure 2.2). This point satisfies the constraint and is the desired minimum. The steepest descent direction on the Euclidean space −∇E J (x0 , y0 ) at a given point p satisfying the constraint is tangent to the meridian of the paraboloid, and points in the direction of the undesired minimum. The Riemannian steepest descent direction −∇R J (x0 , y0 ) is tangent to the ellipse E, and points in the direction of the desired minimum. The two-dimensional representation of the cost function in Figure 2.3 shows how different algorithms operate under unitary constraint. Five different algorithms are considered. The first two algorithms are the unconstrained and the constrained SD on R2 , respectively. The constrained version enforces the unit norm after every iteration, as described in Subsection 2.2.1. The third algorithm is the extra-penalty method, as described in Subsection 2.2.1. The fourth and the fifth algorithms operate on U (1), and they are the non-geodesic SD described in Subsection 2.2.2 and the geodesic SD algorithm in Subsection 2.2.2, respectively. We may notice in Fig. 2.3 that the unconstrained SD (marked by ♦) takes the steepest descent direction in R2 , and goes straight to the undesired minimum. By enforcing the unit norm constraint, we project radially the current point on the unit circle (). In 19

J

P

E

p

mR −∇E J (x0 , y0 ) −∇R J (x0 , y0 )

y

x

mE

unit cir cle

Figure 2.2: Three-dimensional visualization of the cost function given by J (x, y) = (x + 0.2)2 + y 2 , represented by the paraboloid P. The unitary constraint x2 + y 2 = 1 in R3 is the cylinder obtained by translating the unit circle along the vertical axis (not shown in the figure). The constrained parameter space of the cost function J (x, y) is obtained by intersecting the paraboloid with the cylinder, i.e., the ellipse E represented by thick curve. The minimum of the unconstrained cost function mE is the minimum in the Euclidean space. The minimum of the constrained cost function mR is the minimum in the Riemannian space U (1). The Euclidean steepest descent direction −∇E J (x0 , y0 ) at the point (x0 , y0 ) is tangent to the meridian of the paraboloid crossing (x0 , y0 ). The Riemannian steepest descent direction −∇R J (x0 , y0 ) at the point (x0 , y0 ) is tangent to ellipse.

each step the constraint has to be enforced in order to avoid undesired minimum. The extra-penalty SD algorithm (▽) converges somewhere between 20

y initialization point w0 = exp(π/4)

x desired minimum mR



−−− −−−  · −▽·− · ·× + ····



undesired minimum mE

SD on the Euclidean space (unconstrained) SD on the Euclidean space (with constraint enforcement) SD on the Euclidean space (extra-penalty approach) SD on the Riemannian space (non-geodesic) SD on the Riemannian space (geodesic)

Figure 2.3: Minimizing the cost function J (x, y) = (x + 0.2)2 + y 2 on the unit circle U (1): Euclidean vs. Riemannian steepest descent (SD) methods.

the desired and the undesired minimum, depending on the factor which is used to weight the extra-penalty. The non-geodesic SD algorithm on U (1) (+ ×) [137] takes the steepest descent direction on U (1), and moves along a straight line tangent to the unit circle. Due to the non-zero curvature of U (1) the constraint needs to be enforced after every iteration by projection. The geodesic SD algorithm ( ) uses a multiplicative update which is a phase rotation in this case. Consequently, the constraint is satisfied at every step in a natural way. Although this low-dimensional example is rather trivial, it has been included for illustrative purposes. In the case of multi-dimensional unitary matrices, a similar behavior of the algorithms considered here is encountered.

2.3

Applications of differential geometry to array and multi-channel signal processing

Differential geometry has become a highly important topic in the signal processing community. It does not only provide powerful tools for solving certain problems, but it allows a better understanding of the problems [89]. 21

A comprehensive overview of the geometric methods and their applications in engineering is provided in [88]. A recent review of applications of differential geometry to signal processing may be found in [138]. The applications may be classified by the nature of the task to be solved as follows: optimization on manifolds, tracking on manifolds, statistics on manifolds and quantization on manifolds. In this section, a detailed presentation of this classification, as well as application-oriented literature review are provided.

2.3.1

Optimization and tracking on manifolds

When the optimization needs to be performed subject to differentiable equality constraints, the optimization on differentiable manifolds arises naturally [18]. In this way, the initial constrained optimization problem becomes an unconstrained one, on an appropriate differentiable manifold. Among the most relevant contributions to this area throughout the past thirty-five years are brought by Luenberger in 1972 [131], Gabay in 1982 [87], Smith in 1993 [183] and Udri¸ste in 1994 [196]. They consider the optimization on general differentiable manifolds. However, the corresponding algorithms do not exploit special properties that may appear on certain manifolds, such as Lie groups and homogeneous spaces [107, 123, 208]. For this reason, the resulting algorithms may exhibit high computational complexity. Most of the work done after 1994 is dedicated to optimization on particular Riemannian manifolds. The most popular manifolds arising in practice are determined by orthonormal matrix constraints, such as the Stiefel manifold St(n, p) and the Grassmann manifold Gr(n, p) [17, 69, 74, 75, 137, 151]. The points on these manifolds may be represented by orthonormal n × p matrices. The n × n orthogonal/unitary matrices are a special case of Stiefel manifold, i.e., they form Lie groups. Special attention has been paid in the literature to optimization algorithms operating on the orthogonal group O(n) [17, 40, 41, 69, 76, 77, 79, 150, 151, 161, 183, 214] and on the general linear group GL(n) [229, 230]. The above algorithms are designed only for real-valued matrices. Complex-valued case has been addressed only in [137], since most of the authors consider the extension from the real to the complex case trivial. Most of the communication and sensor array signal processing applications deal with complex-valued matrices and signals. In this thesis we focus on Riemannian algorithms operating on the unitary group U (n). Therefore, these algorithms are designed for complexvalued matrices. They fully exploit the properties arising from the Lie group structure of the manifold. In [Publication I] and [Publication III] we propose Riemannian steepest descent algorithms on U (n) and in [Publication II] we propose a Riemannian conjugate gradient algorithm on U (n). In the following part of this subsection few relevant applications of optimization on manifolds are presented. They include subspace techniques, MIMO communications, blind source separation, and array signal processing in general. 22

Subspace estimation and tracking Subspace techniques are fundamental tools in signal processing. Among the most common applications are high-resolution frequency estimation, direction finding used for smart antennas [170, 178], beamforming, delay estimation and channel equalization. A thorough literature survey of the algorithms for tracking the extreme eigenvalues and/or eigenvectors in signal processing up to year 1990 may be found in [53]. Since then, many new algorithms for subspace estimation and tracking with different complexity and performance have been developed. One of the most common algorithms is Projection Approximation Subspace tracking (PAST and PASTd) [11, 215, 216] which is mostly used for direction of arrival estimation. Other algorithms in the literature are designed for principal or minor subspace tracking in blind source separation [60, 63, 64, 66, 205]. Many authors approach the subspace estimation and tracking as an optimization problem on the Grassmann manifold Gr(n, p) [17–19, 85, 86, 122, 186]. The subspaces represent points on Gr(n, p) and their time variation correspond to a trajectory on the manifold. In addition to the observations, a stochastic model for the dynamics of the subspaces may be employed. Most of the algorithms in the literature use simple models to predict the motion or the rotation of the subspaces. In these models, the best estimate of the subspace at one time instance is simply the current value of the subspace. These algorithms do not take into account any information on the dynamic behavior, i.e., “the observed motion of the subspaces”. This lack of predictability comes from the fact that the subspaces cannot be associated with vector quantities moving in a finite-dimensional vector space, described by a conventional state-space model, such as the Kalman filter. Algorithms which may take into account the subspace dynamics have been proposed, and they operate on the Grassmann manifold [85, 86, 186]. In general, the complexity is relatively high, but they posses very good tracking capabilities. As suggested in [138], a potential research area is to extend the state-space model to nonlinear settings by replacing the addition operation by a Lie group action, such as multiplication. An example of the group action is the rotation applied to an orthonormal matrix. Some applications require estimating the exact set of eigenvectors, not only the subspace they span. In this case, the optimization should be performed on the Stiefel manifold. Classical algorithms for computing eigenvectors [53] are formulated on the Riemannian manifolds, such as Jacobi-type methods [112] (see [Publication III] and [62]) and Rayleigh Quotient Iteration (RQI) [18, 19, 136, 148, 183]. In this work, we propose Riemannian algorithms for computing the complete basis of eigenvectors (see [Publication I], [Publication II] and [Publication III]). They operate on the Lie group of unitary matrices U (n). 23

MIMO communication systems Communication systems with multiple transmit and/or receive antennas achieve high capacity and link reliability [194]. Sending data on the eigenmodes of the MIMO channel is an important practical transmission scheme. The unitary matrices play an crucial role in the beamformer design. The Stiefel manifold is relevant for this type of applications, since the exact eigenvectors are required. Orthonormal coding matrices are also used in space-time coding [111]. A blind identification approach is applied to beamforming in [45] based on a Joint Approximate Diagonalization of Eigen-matrices (JADE). The JADE approach relies on the independence of the sources, by exploiting the statistical information of the fourth order cumulants. Without knowing the array manifold, the beamforming is made robust to antenna array imperfections, so no physical modeling or calibration is needed. The diagonalization of the fourth order cumulant matrices is formulated as an optimization problem under unitary matrix constraint. This problem is addressed by proposing a steepest descent algorithm on the unitary group in [Publication I] and a conjugate gradient algorithm on the unitary group in [Publication II]. The proposed technique outperforms the classical JADE optimization approach [45] based on Givens rotations, especially when the number of signals is relatively large. This type of application has also been formulated as an optimization problem on the Stiefel manifold [149], and recently on the oblique manifold [16]. A Riemannian optimization technique on the oblique manifold has been applied for optimizing the transmit beamforming covariance matrix of MIMO radars [34,82,167] in [21]. In [192] a constrained beamformer design is considered. The problem may also be formulated as an optimization under unitary matrix constraint. In [185] the beamforming is used to maximize the signal-to-interference-plus-noise ratio (SINR). The optimal weight is determined by using linear and nonlinear conjugate algorithms. They operate on the Euclidean space and the Stiefel manifold, respectively. A LS approach to blind beamforming was adopted in [209]. The problem is decomposed into two stages. First, a whitening procedure is applied to the received array vector, which transforms the array response matrix into a unitary matrix. The second step is a unitary rotation. The rotation matrix is determined from the fourth order cumulants, similarly the the JADE algorithm [45]. Riemannian gradient algorithms for array signal processing have also been proposed in [187]. The optimum weight coefficients are vectors with constant magnitude, but variable phase. Therefore only a phase-nulling approach is used to maximize the SINR. This is done by using Riemannian conjugate gradient and Newton algorithms. In [197] a tracking solution for the downlink eigenbeamforming in Wideband CDMA is proposed. The unitary constrained optimization is performed by using Givens rotations. 24

The complexity may be decreased by formulating the problem on the Stiefel manifold. A blind source separation approach on single/multi-user MIMO communication systems has been adopted in [156]. The algorithm requires maximizing a multi-user kurtosis criterion under unitary matrix constraint. This is done by using classical Euclidean steepest descent combined with Gram-Schmidt orthogonalization procedure after every iteration (see Section 2.2.1). In [171] a blind source separation approach for the Bell Labs lAyered Space-Time coding (BLAST) architecture [210] is proposed. The algorithm is based on a multi-modulus algorithm, which leads to the same unitary optimization problem as in [156]. Similar constant-modulus criteria to be minimized under unitary matrix constraint are employed in [130, 157, 172]. Blind source separation Separating signals blindly may be done by exploiting the statistical properties of the transmitted signals. Amari [25] proposed the natural gradient algorithm for blind separation. The learning algorithm operates on the Lie group of invertible matrices and it has been proved to be Fisher efficient by means of information geometry [24]. Cardoso and Laheld [44] developed Equivariant source Separation (EASI) via Independence. The concept of matrix multiplicative group is considered and the resulting algorithm is called relative gradient. The algorithm provides “isotropic convergence” similarly to the Newton algorithm. The connection between the natural gradient and the relative gradient has been established in [75, 161]. Douglas and Kung consider the blind source separation with orthogonality constrains and propose the ordered rotational KuickNet algorithm [65]. The algorithm discretizes the geodesic motion on the Stiefel manifold. Even though [25,44,65,230] consider the matrix group concept, the update of the corresponding algorithms is not based on the group operation. Additive update is used instead. For this reason the constraint need to be restored by separate procedures in each iteration. A conjugate gradient algorithm for blind separation of temporally correlated signals which exploits the group properties of the group of invertible matrices GL(n) is proposed in [229]. In this way, the undesired trivial solution which the Euclidean gradient algorithms would converge to (the zero matrix), is avoided. Differential geometry-based learning algorithms on the orthogonal group for blind separation have been proposed first by Fiori et. al [73,76,78,79] and Nishimori [150]. Algorithms operating on the Stiefel and on the Grassmann manifolds have also been proposed [74, 75]. More recent relevant work in this area is [46, 77, 151]. Plumbley [161, 162] proposed Lie group methods for non-negative ICA, i.e., a steepest descent on the orthogonal group. All the above mentioned steepest descent algorithms are designed only for realvalued matrices and sources. These algorithms are not directly suitable for unitary matrices, as it will be shown later in Chapter 3. 25

A reliable alternative for solving the blind separation problem is the JADE algorithm proposed by Cardoso and Souloumiac [45]. The JADE algorithm consists of two stages. First, a pre-whitening of the received signal is performed. The second stage is a unitary rotation. This second stage is formulated as an optimization problem under unitary matrix constraint since no closed-form solution can be provided except for simple cases such as 2-by-2 unitary matrices. It should be noted that the first stage can also be formulated as a unitary optimization problem as in [Publication II] and [Publication III]. In order to solve for the unitary rotation we propose a steepest descent (SD) which fully exploits the benefits of the Lie group of unitary matrices U (n) [Publication I] . For this reason the complexity per iteration is lower compared to the steepest descent in [137, 149]. In general, conjugate gradient (CG) converges faster than SD [183]. This happens also in the case of the JADE cost function [Publication II], when the input signals are not identically distributed. Moreover, the computational complexity is comparable to the one of the steepest descent in [Publication I] and [Publication III]. The reduction in complexity is achieved by exploiting the additional group properties of U (n), when computing the search directions, as well as special matrix structures associated with the search directions. The almost periodic property of a smooth cost function along geodesics on U (n) enables efficient search along geodesics when adapting the step size parameter [Publication IV], [Publication II]. General algorithms for optimization under unitary matrix constraint operating on the complex Stiefel manifold of n × n unitary matrices have also been proposed [137,149]. The algorithms in [137, 149] are very general, in the sense that the local parametrization is chosen for the Stiefel and the Grassmann manifolds. For this reason, when applied to n × n unitary matrices they do not exploit the additional Lie group properties of U (n), in order to reduce the computational complexity. The FastICA algorithm [113] has been recently extended to Independent Subspace Analysis (ISA) in [179]. The corresponding algorithms are based on optimization on the Grassmann manifold. Other ICA and ISA algorithms operating on the flag manifold have been recently proposed in [152]. Robust Accurate Direct ICA aLgorithm (RADICAL) on the orthogonal group has been proposed in [26]. Linear algebra applications Various linear algebra problems may be solved iteratively by using tools of Riemannian geometry. The most popular matrix decompositions such as the (generalized) eigendecomposition (ED) and the singular value decomposition (SVD) may be formulated in terms of descent equations on differentiable manifolds [14, 18, 40, 43, 52, 57, 112, 182–184]. The SVD has direct applications in Least Squares (LS) estimation, low-rank approximation, matrix inversion, and subspace techniques. For the low-rank approximation 26

SVD is optimal under the Frobenius norm, but this is no longer true under weighted norms. Moreover, no closed-form solution exists, in general. The weighted low-rank approximation may be formulated as a minimization problem on the Grassmann manifold in [140]. The convolutive reducedrank Wiener filtering has also been formulated as an optimization problem on the Grassmann manifold in [139]. Several Least Squares (LS) matching problems may be solved by minimizing an error criterion on a suitable manifold [40, 43, 52, 69]. A typical example is encountered in image processing applications where matching points in one image with point on the second image are required. Matching is often very difficult because of the large number of possibilities. Hence, approximate solutions may be of interest. In this case, the matching problem reduces to finding an optimum orthogonal matrix and a permutation matrix [43]. Other image processing application are motion estimation in computer vision [133, 146], or biomedical applications [20, 164].

2.3.2

Quantization on manifolds

Quantization on manifolds requires approximating arbitrary points on the manifold by elements of a finite set of points on the manifold. The finite set of manifold-valued points is called code book. The goodness of the approximation is defined in terms of Riemannian distance. The code book design is crucial for the performance of the quantizer. It supposes maximizing the minimum distance between the code words. Quantization on Grassmann manifold has straight-forward application to limitedfeedback MIMO communication systems [50,51,121,143–145] and space-time codes [103, 116, 118, 120, 141, 144]. Closed-loop MIMO systems Often, in MIMO communication systems in order to increase capacity, a lowrate feedback channel is used to provide channel state information back to the receiver. This information needs to be quantized in order to reduce the transmission rate of the feedback link. Unitary/orthonormal matrices play again an important role. The quantization aims at describing orthonormal matrices by as few parameters as possible, with sufficient accuracy. In [144] a Grassmann code book design strategy for MIMO systems is provided. The goal is to achieve full diversity and significant array gain in an uncorrelated fading channel by using the channel knowledge at the transmitter. The design is based on minimizing a “chordal distance”, which is a length defined on the manifold in order to describe the distortion introduced by quantization. The results are applied to performance analysis of a MIMO wireless communication system with quantized transmit beamforming and quantized receive combining. A similar Riemannian approach has also been 27

considered in [145]. In [143] algorithms for quantized MIMO-OFDM systems are developed. The scheme uses a quantized feedback link in order to provide the channel state information (CSI) at the transmitter and achieve capacity and diversity gains. The motivation is that the existing schemes designed for flat-fading channel do not extend naturally to frequency selective channels due to an enormous feedback overhead. Two classes of algorithms for quantizing the channel information are considered. They are named “clustering algorithms” and “transform algorithms”, respectively. The clustering algorithms group the subcarriers in clusters and choose a common frequency domain representation for each group. Thus the feedback rate depends on the number of groups and not on the number of subcarriers. The transform algorithms quantize the channel information in time domain where the transform essentially decorrelates the channel information. Both algorithms provide significant compression of the channel information maintaining the bit error rate close to the case of perfect channel knowledge. A spatial multiplexing scheme with multi-mode precoding for MIMOOFDM system is considered in [121]. Multi-mode precoding used linear transmit precoding, but adapts the number of transmit data streams or modes according to the channel conditions, therefore, it achieves high capacity and reliability. Typically, for OFDM scheme the multi-mode precoding requires complete knowledge of the transmit precoding matrices for each subcarrier at the transmitter. The authors propose an alternative way to reduce the feedback rate by quantizing the precoding matrices of a fraction of the number of the subcarriers and obtaining the other precoders using interpolation. The subcarrier mode selection, the precoder-quantizer design and the interpolation are addressed in the paper. It is found that unitary matrices cannot be interpolated by using linear interpolation techniques due to the fact that they do not form a vector space, but a group, i.e., the unitary group. Two algorithms for interpolation of unitary matrices are proposed, namely “geodesic interpolation” and “conditional interpolation”, respectively. The geodesic interpolation has a result a point which lies halfway on the geodesic connecting two points. The method exploits the fact that the right singular vectors of the channel matrix are ambiguous up to a diagonal unitary matrix. These additional degrees of freedom are used to identify the smoothest interpolation path between adjacent quantized points. The conditional interpolation considers the interpolation of MIMO channel matrices acquired on the pilot subcarriers. Recent work on channel adaptive quantization for limited feedback MIMO Beamforming systems may be found in [50]. Compared to [121, 143,144] the quantization algorithm is designed for correlated Rayleigh fading MIMO channels. A Grassmannian switched code book is used to exploit the inherent spatial and temporal correlation of the channel. In [51] an interpolation-based unitary precoding for spatial multiplexing 28

MIMO-OFDM with limited feedback is considered. The algorithm exploits the fact that OFDM transmission converts the frequency-selective channel into multiple narrow-band flat-fading sub-channels. Operating on each subcarrier may be costly, especially if their number is large. Therefore, the precoding algorithm operates on groups of subcarriers and interpolation techniques on the Grassmann manifold are proposed. Space-time codes Space-time codes improve the reliability of the radio links by providing a good trade-off between data rate and diversity gain [111]. The Grassmann space-time codes become more and more popular due to their ability to use all the degrees of freedom of the MIMO system, i.e., M ×(1−(M/T )) symbols per channel use, where M is the number of transmit antennas and T is the temporal length of the space-time code. In [120] a family of Grassmann space-time codes for non-coherent MIMO systems is proposed. The codes exploit all degrees of freedom of the Grassmann manifold Gr(T, M ). The code design and also the decoding are based on minimum chordal distance, similarly to [144]. Unitary matrices play an important role in space-time coding. A capacity efficient scheme using isotropically random unitary space-time signals is proposed in [141]. The signals transmitted across antennas, viewed as a matrix with spatial and temporal dimensions form a unitary matrix. A unitary space-time modulation scheme via Cayley transform is proposed in [118]. No channel knowledge is required at the receiver. The scheme is suitable for wireless systems where channel tracking is unfeasible, either because of rapid changes in the channel characteristics or because of the limited system resources. Similar transmission schemes have been considered in [103]. The codes may be decoded by using a sphere decoder [203] algorithm (often with cubic complexity) near to ML performance. In [165], Cayley differential unitary space time codes for MIMO-OFDM are proposed. These codes possess excellent features. They allow effective high-rate data transmission in multi-antenna communication systems, with reasonable encoder and decoder complexity. Differential geometry methods may be combined with information theory to help understanding and interpreting different problems. The resulting methods are called information geometry methods. The LDPC codes which are powerful and practical error correction codes may be analyzed by using information geometry [116].

2.3.3

Statistics on manifolds

When dealing with estimation of parameters which are constrained to a specific subset of the Euclidean space by some smooth constraints, the classical 29

estimation techniques must be revisited [24, 188, 189, 191, 192, 212]. Estimators of manifold-valued parameters as well as their statistical bounds need to be derived on the space where the parameters are constrained. This space is usually a curved space, i.e., a Riemannian manifold with non-zero sectional curvature. They occur frequently in practice, for example when the parameters are defined on a sphere, or when estimating eigenvectors or subspaces. Another class of such problems occurs when the parameters can be estimated only up to certain ambiguities, like in blind channel estimation and blind source separation. Concepts such as bias and variance need to be defined in the proper parameter space, i.e., the constrained set. Estimators which are unbiased on the Euclidean space may be biased on the Riemannian space. The bias should be measured by using the distance defined on the corresponding Riemannian manifold, instead of the Euclidean distance. Also statistical performance bounds such as the Cram´er-Rao Lower Bound (CRLB) are in general derived for parameters taking values in Euclidean spaces. They are no longer valid for constrained parameters defined on differential manifolds. Ignoring the constraint when deriving such bounds may lead to singular Fisher information matrix (FIM) due to too many degrees of freedom. Often the pseudo-inverse of the FIM is considered. The geometric interpretation of this was given in [212]. CRLB for estimating parameters with differentiable deterministic constraints have been derived by Stoica et al. [192]. The unconstrained Fisher information matrix, which is not necessarily of full rank is replaced by a constrained FIM determined from the smooth constraint. This bound is still expressed in terms of Euclidean distance. Its accuracy is expected to degrade since the curvature of the corresponding Riemannian space is neglected. An Intrinsic Variance Lower Bound (IVLB) on Riemannian manifolds has been derived in [213]. The IVLB is a lower limit on the estimation accuracy, measured in terms of the mean-square Riemannian distance. For parameters defined in Euclidean spaces (zero sectional curvature), the IVLB coincides with the classical CRLB. Estimation Bounds on arbitrary manifolds in which no set of extrinsic coordinates exist have been established recently by Smith, in [188,189]. The frequently encountered examples of estimating either an unknown subspace or a covariance matrix are examined in detail in [188]. Intrinsic versions of the Cram´er-Rao bound on manifolds are derived for both biased and unbiased estimators. Remarkably, it is shown in [188] that from an intrinsic perspective, the sample covariance matrix is a biased and inefficient estimator. The bias term reveals the dependency of estimator on the limited sample support observed in practice. For this reason, the natural invariant metric is recommended over the flat metric (Euclidean norm) for analysis of covariance matrix estimation. The capacity of non-coherent MIMO fading channels is derived in [231] by using a Riemannian geometry approach. The scenario of fast fading 30

is considered, thus an accurate estimation of the fading coefficients is not available to either the transmitter or the receiver. A geometric interpretation of the capacity expression on the Grassmann manifold is given. Monte-Carlo extrinsic estimators of manifold-valued parameters have been considered in [191]. The estimation of means and variances of manifoldvalued parameters is considered, using two popular sampling methods (independent and importance sampling). The results are applied to target pose estimation on the orthogonal group and subspace estimation on the Grassmann manifold.

31

32

Chapter 3

Practical Riemannian algorithms for optimization under unitary matrix constraint 3.1

The unitary group U (n) as a real manifold

Most of the Riemannian optimization algorithms in the literature [40, 41, 46, 69, 73, 76–79, 150, 151, 161, 183, 214] are designed for optimization on the orthogonal group O(n), i.e., they consider only real-valued matrices. Very often, in communications and sensor array signal processing applications we are dealing with complex-valued matrices and signals. Consequently, the optimization needs to be performed under unitary matrix constraint, i.e., on the unitary group U (n). Often, and unfairly, extending algorithms designed for real-valued matrices to complex-valued matrices is considered to be trivial. Commonly, this is done in a simplistic manner by changing a real-valued result into a complex-valued one, just by replacing the transpose operation with the Hermitian transpose, and skipping all the intermediate derivation steps. In many cases, the result holds, but there are cases when this simplistic approach fails, leading to wrong results. We will show at the end of this section an illustrative example where the simplistic approach fails. Thus, a proper derivation of the algorithms dealing with complexvalued matrices is required. In order to be able to derive optimization algorithms on the unitary group U (n), it is important to know that the Lie group U (n) of n × n unitary matrices is a real differentiable manifold. We show how all complex algebraic operations can be mapped into real operations and vice versa, and describe a convenient way of real differentiation using complex-valued matrices and operations. 33

3.1.1

Revealing the real Lie group structure of U(n)

A Lie group is a differentiable manifold [59] and a group as the same time, with the property that the group operations are differentiable [107,123,208]. The n × n unitary matrices are closed under the standard matrix multiplication and they form the unitary group U (n). Even though the elements of U (n) are represented by complex-valued matrices, the unitary group is a real Lie group, i.e., it possesses real differentiable structure. Complex Lie groups are the ones whose multiplication operation is compatible with complex analytic manifold structure [123]. Some groups possess both real and complex differentiable structure (e.g. complex general linear group GL(n), complex special linear group SL(n), and so on) [107, Ch. 8]. Although the Lie group of unitary matrices U (n) cannot be viewed as a complex Lie group, it is important to understand that the complex manifold structure is generally useless in the optimization context. This is because we are always dealing with real-valued cost functions of complex-valued argument. Such functions are not complex differentiable, unless they are constant functions. Instead, they may be differentiable w.r.t. to the real and the imaginary parts of the complex argument. Any complex-valued matrix A ∈ Cn×n can be mapped into real-valued matrix AR ∈ R2n×2n , by using its real and imaginary parts AR , ℜ{A} and AI , ℑ{A}, respectively, as follows:   AR −AI A = AR + AI ←→ AR , . (3.1) AI AR It is straight-forward to verify that the mapping (3.1) is differentiable, and the following equalities hold for any A, B ∈ Cn×n : (tA)R = tAR ,

∀t ∈ R,

(3.2)

(A + B)R = AR + BR ,

(3.3)

(AB)R = AR BR ,

(3.4)

(A

−1 H

−1

)R = (AR )

(A )R =

,

det{A} = 6 0,

(3.5)

ATR ,

(3.6)



(3.7)

trace{AR } = 2ℜ trace{A} .

Based on (3.2), (3.3) and (3.4), it also follows that:  exp(tA) R = exp(tAR ), t ∈ R.

(3.8)

where exp(·) denotes the exponential map from the Lie algebra to the Lie group [107, Ch. II, §1]. For matrix Lie groups, this coincides with the standard matrix exponential given by the convergent power series exp(A) ,

∞ X Am . m!

m=0

34

(3.9)

In conclusion, the mapping (3.1) reveals the real Lie group structure of U (n). The main benefit is that it enables using complex-valued matrices instead of large real-valued matrices containing the real and the imaginary parts in separate blocks. Other properties of U (n) are presented in detail in [Publication I].

3.1.2

Differentiation of real-valued function of complexvalued argument

The real differentiation of functions of complex-valued matrix arguments can be conveniently described in complex terms by the following partial derivatives [39, 124]: 1  ∂J ∂J  ∂J , − (3.10) ∂A 2 ∂AR ∂AI and ∂J 1  ∂J ∂J  , +  . ∂A∗ 2 ∂AR ∂AI

(3.11)

In practice, it would be inconvenient to split all complex-valued matrices in their real and imaginary parts because this would complicate the mathematical expressions. The operators (3.10)-(3.11) and the mapping (3.1) enable direct manipulation of the complex-valued matrices, but we have to keep in mind that in fact all the operations are applied to the real and imaginary parts. The computation of derivatives (3.10) and (3.11) has been recently addressed in [109].

3.1.3

Justification of using complex-valued matrices

In this subsection we provide an example showing that the results involving real-valued matrices do not always extend in a straight-forward manner to complex-valued matrices. The “trick” of replacing the transpose operation with the Hermitian-transpose may lead to a wrong result. Moreover, the correct result is not a trivial extension. We consider the chain-rule for differentiating a real-valued function involving real and complex matrices, respectively. A comparison between the real and the complex case is provided in Table 3.1. As it can be seen in Table 3.1, extending the result obtained in the real-valued case to a the complex-valued case is not straight-forward. Just by replacing the transpose operation with the Hermitian-transpose leads to a wrong result. The complex-conjugation operation of one of the two factors inside the trace operator would be missed, as well as taking the real part. This chain-rule example has been selected in purpose, since it is used to differentiate the cost function along geodesics when performing the line search optimization. For details, see [Publication II] and [Publication IV]. 35

Real-valued case Consider a real-valued scalar function of real-valued matrix argument J1 : Rn×n → R, and a real-valued matrix function of real-valued scalar argument W 1 : R → Rn×n . The composition of J1 and W 1 is Jˆ1 (t) , (J1 ◦ W 1 )(t) = J1 (W 1 (t)) Jˆ1 Real-case result: ddt (t1 ) =  T  dW  dJ 1 trace{ dW11 (W 1 (t1 )) dt (t1 ) }.

Complex-valued case Consider a real-valued scalar function of complex-valued matrix argument J2 : Cn×n → R, and a complex-valued matrix function of real-valued scalar argument W 2 : R → Cn×n . The composition of J2 and W 2 is Jˆ2 (t) , (J2 ◦ W 2 )(t) = J2 (W 2 (t)) Jˆ2 Complex-case result: ddt (t ) =  ∂J  T2 dW 2 2ℜ{trace{ ∂W22 (W 2 (t2 )) dt (t2 ) }}.

Table 3.1: Differentiation by using the chain rule: real-valued case vs. complex-valued case. The result on the left column is the correct result obtained in the real-valued case. Replacing the the transpose operation with the Hermitian-transpose in the real-valued result would lead to the wrong  ∂J2 H  dW2  Jˆ2 (t2 ), i.e., trace{ ∂W (W (t )) (t ) }. complex-valued result for ddt 2 2 2 dt 2 The correct result obtained in the complex-valued case is the result on the right column. It may be noticed that extending the real-valued case result to a complex-valued one in not trivial.

In conclusion, the algorithms dealing with complex-valued matrices need to be derived from scratch, by using the approach presented in Subsections 3.1.1 and 3.1.2. In [Publication I] and [Publication II] we derive Riemannian steepest descent and conjugate gradient algorithms on U (n).

3.2

Practical optimization geodesics on U (n)

algorithms

along

When deriving optimization algorithms on Riemannian manifolds, the geometrical properties of the parameter space play a crucial role in reducing the computational complexity. The fact that U (n) is a matrix Lie group enables very simple formulas for geodesics and parallel transport. Additional computational benefits arise from exploiting the special matrix structures. The tangent space at the group identity element is the Lie algebra of skewHermitian matrices u(n). Geodesics through an arbitrary point W ∈ U (n) can be given in terms of an exponential of skew-Hermitian matrices. Recent developments in numerical methods dealing with this type of computation can be exploited [117, 224]. For details, see [Publication I]. Another useful aspect to be considered is the fact that the exponential map induces an almost periodic behavior of the cost function along geodesics. The almost periodic property may be exploited when performing the line search needed 36

in the step size selection [Publication IV], [Publication II]. In this section we provide Riemannian algorithms which may be used in practical applications involving optimization under unitary matrix constraint. Steepest Descent (SD) and Conjugate Gradient (CG) algorithms operating on the Lie group of unitary matrices U (n) are given in Section 3.2.1 and Section 3.2.2, respectively. Efficient line search methods which can be used together with the proposed algorithms are provided in Section 3.2.3.

3.2.1

Steepest Descent Algorithm along geodesics on U(n)

Similarly to the Euclidean space, the main advantage of the Riemannian steepest descent (SD) algorithm is that it is very simple to implement. Each iteration of the Riemannian SD algorithm consists of two subsequent stages. The first one is the computation of the Riemannian gradient which gives the steepest ascent direction on the manifold. The second one is taking a step along the geodesic emanating in the direction of the negative gradient. The Riemannian gradient of the smooth cost function J at an arbitrary point Wk ∈ U (n) is given by: h ∂J iH ∂J (W ) − W (W ) Wk . (3.12) ∇RJ(Wk ) = k k k ∂W∗ ∂W∗ ∂J where ∂W ∗ (Wk ) is defined in (3.11) and represents the gradient of the cost function J on the Euclidean space at a given W [39, 198]. The derivation of expression (3.12) is provided in [Publication I]. The geodesic emanating from Wk along the steepest descent direction −∇RJ(Wk ) on U (n) is given by:

W(µ) = exp(−µGk )Wk , Gk , ∇

R

where

J(Wk )WkH

∈ u(n).

(3.13) (3.14)

Gk is the gradient of J at Wk after translation into the tangent space at the identity element. Consequently, the matrix Gk is skew-Hermitian, i.e., Gk = −GH k . The skew-Hermitian structure of Gk brings important computational benefits when computing the matrix exponential [Publication I], as well as when performing the line search [Publication IV]. The Riemannian SD algorithm on U (n) has been derived in [Publication I] and it is summarized in Table 3.2.

3.2.2

Conjugate Gradient Algorithm along geodesics on U(n)

Conjugate gradient (CG) algorithm achieves in general faster convergence compared to the SD, not only on the Euclidean space, but also on Riemannian manifolds. This is due to the fact that Riemannian SD algorithm has 37

1 2 3 4 5 6

Initialization: k = 0 , Wk = I Compute the Riemannian gradient direction Gk : ∂J Gk = Γk WkH − Wk ΓH Γk = ∂W ∗ (Wk ), k Evaluate hGk , Gk iI = (1/2)trace{GH k Gk }. If it is sufficiently small, then stop Determine µk = arg minµ J (exp(−µGk )Wk ) Update: Wk+1 =exp(−µk Gk )Wk k := k + 1 and go to step 2

Table 3.2: Steepest descent (SD) algorithm along geodesics on U (n)

the same drawback as its Euclidean counterpart, i.e., it takes ninety degree turns at each iteration [183]. This is illustrated in Figure 3.1, where the contours of a cost function are plotted on the Riemannian surface determined by the constraint. CG algorithm may significantly reduce this drawback. It ˜ k at exploits the information provided by the current search direction −H ˜ Wk and the SD direction −Gk+1 at the next point Wk+1 . The new search direction is chosen to be a combination of these two, as shown in Figure 3.2. The difference compared to the Euclidean space is that the current ˜ k and the gradient G ˜ k+1 at the next point lie in difsearch direction −H ferent tangent spaces, TWk and TWk+1 , respectively. For this reason they are not directly compatible. In order to combine them properly, the parallel ˜ k from Wk to Wk+1 along the transport of the current search direction −H corresponding geodesic is needed. The new search direction at Wk+1 is ˜ k+1 = −G ˜ k+1 − γk τ H ˜ k, −H

(3.15)

˜ k denotes the parallel transport of the vector H ˜ k into TW where τ H k+1 along the corresponding geodesic (see Figure 3.2). The weighting factor ˜ k and H ˜ k+1 are Hessianγk is determined such that the directions τ H conjugate [69, 183]. The exact conjugacy would require expensive computation of the Hessian matrices. In practice an approximation of γk is used instead, for example the Polak-Ribi`erre formula [69]. For more details, see [Publication II]. The fact that U (n) is a Lie group enables describing all tangent directions (steepest descent and search directions) by tangent vectors which correspond to elements of the Lie algebra u(n) via right (or left) translation. Then, all tangent vectors are represented by skew-Hermitian matrices. Thus, the new search direction on the Lie algebra u(n) is Hk+1 = Gk+1 + γk Hk ,

Hk , Hk+1 , Gk+1 ∈ u(n)

(3.16)

where Hk is the old search direction on u(n). The conjugate gradient step ˜k = is taken along the geodesic emanating from Wk in the direction −H −Hk Wk , i.e., W(µ) = exp(−µHk )Wk . (3.17) 38

Wk

˜k −G

Wk+1 ˜ k+1 −G contours of J (W)

˜k −τ G

Wk+2

manifold MINIMUM

Figure 3.1: The SD algorithm takes ninety-degree turns at every iteration, ˜ k+1 , −τ G ˜ ki i.e., h−G Wk+1 = 0, where τ denotes the parallelism w.r.t. the geodesic connecting Wk and Wk+1 . Wk ˜k −H

contours of J (W) manifold

˜ k+1 −G Wk+1 ˜k −τ H

˜ k+1 −H MINIMUM

˜ k+1 at Wk+1 which is Figure 3.2: The CG takes a search direction −H ˜ a combination of the new SD direction −Gk+1 at Wk+1 and the current ˜ k transported to Wk+1 along the geodesic connecting search direction −H ˜ k+1 Wk and Wk+1 . The new Riemannian steepest descent direction −G at Wk+1 will be orthogonal to the current search direction −Hk at Wk ˜ k+1 , −τ H ˜ ki transported to Wk+1 , i.e., h−G Wk+1 = 0.

39

A geodesic search needs to be performed in order to choose a suitable value of µ [Publication IV], [Publication II]. The step size selection is crucial for the performance of the CG algorithm. The Riemannian CG algorithm on U (n) has been derived in [Publication II] and it is summarized in Table 3.3. 1 Initialization: k = 0 , Wk = I 2 Compute the Riemannian gradient direction Gk and the search direction Hk : if (k modulo n2 ) == 0 ∂J Γk = ∂W ∗ (Wk ) Gk = Γk WkH − Wk ΓH k Hk := Gk 3 Evaluate hGk , Gk iI = (1/2)trace{GH k Gk }. If it is sufficiently small, then stop 4 Determine µk = arg minµ J (exp(−µHk )Wk ) 5 Update: Wk+1 = exp(−µk Hk )Wk 6 Compute the Riemannian gradient direction Gk+1 and the search direction Hk+1 : ∂J Γk+1 = ∂W ∗ (Wk+1 ) H Gk+1 = Γk+1 Wk+1 − Wk+1 ΓH k+1 Hk+1 = Gk+1 + γk Hk 7 k := k + 1 and go to step 2

Table 3.3: Conjugate gradient (CG) algorithm along geodesics on U (n). Riemannian CG algorithm achieves superlinear convergence, whereas the Riemannian SD converges only linearly [87, 183, 196]. On U (n), the computational complexity of the CG is comparable the one of the SD, due to the fact that the parallel transport is easy to perform. This is not always true on general Riemannian manifolds. Both SD on U (n) introduced in [Publication I] and CG on U (n) introduced in [Publication II], exhibit cubic complexity in n per iteration, i.e., O(n3 ). This property seems to be unavoidable, since a trivial multiplication of two n × n matrices already requires about 2n3 flops [99]. CG is considerably simpler than a Newton algorithm which would require computing costly second-order derivatives. CG algorithm captures the second-order information by computing successive first-order derivatives and combining them properly. Newton algorithms on general Riemannian manifolds [87, 155, 183, 196] are computationally expensive also due to the fact that they do not take into account the particular properties that may appear on certain manifolds, such as special matrix structures. Newton algorithms on Stiefel and Grassmann manifolds are proposed by Edelman et. al [69]. Other Newton algorithms on Stiefel and Grassmann manifolds have been proposed in the literature [17, 130, 137]. When applied on U (n), they have complexity is of order O(n6 ). Due to computational reasons, in this thesis we treat only SD and CG on U (n) and provide computationally feasible solutions. Moreover, Newton algorithm is not guaranteed to con40

verge, not even locally [69, 137, 183, 184]. It may converge to any stationary points unless some strict requirements (such as convexity of the cost function) are satisfied. Trust-region methods on Riemannian manifolds have been recently proposed [15, 18], to overcome this drawback. In conclusion, Riemannian CG algorithm provides a reliable alternative for optimization under unitary matrix constraint.

3.2.3

Efficient Line search methods on U(n)

Once the search direction corresponding to the SD, or to the CG algorithm has been chosen, a line search needs to be performed in order to select an appropriate step size. The line search supposes minimizing (or maximizing) the cost function J (W) in one dimension, along the curve W(µ) describing the local parametrization, i.e., find µk = arg min Jˆ(µ), where

(3.18)

Jˆ(µ) , J (W(µ))

(3.19)

µ

Line search usually requires expensive operations, even in the case of Euclidean optimization algorithms [163], due to multiple cost function evaluations. On Riemannian manifolds, the problem becomes even harder because every evaluation of the cost function requires expensive computations of the local parametrization. For this reason, the choice of the local parametrization plays a crucial role in reducing the computational complexity. In case of U (n), exponential map possesses desirable properties that may be exploited. In [Publication I], Armijo method [163] is efficiently used to perform search along geodesics. The step size µk evolves in a dyadic basis. By exploiting the properties of the exponential map, the computation of the matrix exponential may often be avoided. Reduction in complexity by half is achieved when Armijo method is used together with the geodesic SD algorithm, compared to the non-geodesic SD algorithm in [137]. The computational issues are addressed in detail in [Publication I], Section V. An important property of the exponential map which can be exploited in line search (see [Publication I] and [Publication IV]) is that it induces an almost periodic [80] behavior of the cost function along geodesics on U (n). The almost periodic functions is a well-studied class of functions [181]. There are many definition for these type of functions [33,35,54,80,127]. We present the most intuitive one as in [80,81]. A real number T is called ǫ-almost period (or just almost period) of the function F : R → R if |F(t + T ) − F(t)| ≤ ǫ, ∀t ∈ R.

(3.20)

The function F is called almost periodic if for any ǫ > 0, the set of ǫ-almost periods is relatively dense in R [80]. An almost periodic function can also 41

be expressed as a trigonometric polynomial F(t) =

q X

bm exp(ωm t),

(3.21)

m=1

where b1 , . . . , bq ∈ C, ω1 , . . . , ωq ∈ R, and q ∈ N. If the numbers ω1 , . . . , ωq are in harmonic relation, the above expression represents the classical Fourier series of a periodic function comprised of q harmonic components. For almost periodic functions, the frequencies ω1 , . . . , ωq are non-harmonic. The almost periodic property of a cost function along geodesics on U (n) is a consequence of the fact that geodesics are expressed in terms of exponential of skew-Hermitian matrices. This special property appears only on certain manifolds such as the unitary group U (n) and the special orthogonal group SO(n), and it does not appear on Euclidean spaces or on general Riemannian manifolds. For this reason, other local parametrizations designed for more general Riemannian manifolds, such as Stiefel and Grassmann manifolds [69, 137] exhibit higher complexity on U (n). It will be shown next that the almost periodic property appears for the exponential map. For other common parametrizations such as the Euclidean projection operator or the Cayley transform (for details see [Publication I]) the property does not appear. These parametrizations do not take into account the special structure of the tangent vectors at the group identity. ˜ k at Wk corresponds via right translation to a skewAny search direction H Hermitian matrix into the Lie algebra u(n): ˜ k ∈ TW U (n) H k

←→

˜ k WH ∈ u(n). Hk = H k

(3.22)

Skew-Hermitian matrices have purely imaginary eigenvalues of the form ωi , i = 1, . . . , n. Consider the eigendecomposition of Hk   ω1 0 . . . 0  0 ω2 . . . 0    Hk = Uk Dk UH , U ∈ U (n), D =  .. .. ..  (3.23) k k . k .  . . . .  0 . . . . . . ωn where Uk are the eigenvectors of Hk , and Dk is a diagonal matrix containing the eigenvalues along the diagonal. Next, different local parametrization on U (n) will be considered (see [Publication I]) and the differences among them from the line search perspective are explained. Exponential map Geodesic update on U (n) at iteration k is expressed in terms of exponential of skew-Hermitian matrices W geod (µ) = exp(−µHk )Wk =

Uk exp(−µDk )UH k Wk . 42

(3.24) (3.25)

Comparison between different local parametrizations

Brockett function

11.5 11 10.5 10 9.5 9 8.5

exponential map

8 7.5 0

projection operator Cayley transform 5

10

15 µ

20

25

30

Figure 3.3: Behavior of the Brockett function [41, 184] (see also [Publication II] and [Publication V]) along different parametrizations on U (n). The exponential map (3.24) is represented by continuous line. The projection operator (3.27) is represented by dashed line. The Cayley transform (3.30) is represented by dotted line. The exponential map induces an almost periodic behavior of the cost function along geodesics. For the other two parametrizations, the behavior is not almost periodic.

From (3.23) it follows that the matrix exp(−µDk ) is a diagonal matrix whose diagonal elements are complex exponentials of the form e−ωi µ , i = 1, . . . , n. Consequently, each element of W geod (µ) is a sum of complex exponentials, as in Eq. (3.21). The cost function evaluated along the geodesic Jˆgeod (µ) , J (W geod (µ))

(3.26)

is an almost periodic function [80], due to the fact that J is a smooth function of W geod (µ). As an example, the behavior of the Brockett function [41, 184] along geodesics is shown in Figure 3.3 by continuous line. The almost periodic behavior may be exploited in line search optimization [Publication II] and [Publication IV]. 43

Euclidean projection map The Euclidean projection map supposes moving along a straight line tangent to the manifold and projecting back into the manifold [137]. The projection operator is defined in (2.5). It approximates the matrix exponential at the origin up the the second order, as shown in [Publication I]. The update at iteration k may be written as: W proj (µ) = P{Wk − µHk Wk } = Uk P{I −

µDk }UH k Wk .

(3.27) (3.28)

From (2.5) and (3.23) it follows that the matrix P{I − µDk } is a diagonal matrix whose diagonal entries are of the form (1 − ωi µ)/|1 − ωi µ| = e∠(1−ωi µ) , i = 1, . . . , n. Consequently, the corresponding rotation angles ∠(1 − ωi µ) are all confined within the interval [0, π/2), regardless how much µ is increased. This fact can be easily understood by taking as an example the unit circle U (1). Moving along a straight line tangent to the circle towards infinity, and projecting back ends up to a point which is rotated ninety degrees from the starting point. Therefore, the projection operator cannot produce rotations larger than ninety degrees. The exponential map, on the other hand, can span the whole unit circle, and the cost function is periodic along the circle. The illustrative example in Figure 2.3, Section (2.2.3) may clarify these explanations. The variation of the rotation angle on U (1) w.r.t. µ is shown in Figure 3.4. The exponential map is represented by continuous line and the projection operator, by dashed line. The same angle limitation of the projection operator appears also in the multi-dimensional case. By increasing µ towards infinity, the projection map will converge to a fixed matrix 

 sign(ω1 ) . . . 0   H .. .. .. lim P{Wk − µHk Wk } = Uk   Uk Wk . . . . µ→∞ 0 . . . sign(ωn ) (3.29) The values of the Brockett function [41, 184] (see also [Publication II] and [Publication V]) along different curves on U (n) is shown in Figure 3.3. It may be noticed that the function is not almost periodic along the curve described by the projection operator (3.27) (dashed line), as in the case of the exponential map (solid line). Cayley transform A local parametrization based on the Cayley transform is also a second-order approximation of the matrix exponential at the origin (see [Publication I]). 44

rotation angle on U (1)

Comparison between different local parametrizations π

π 2

0

− π2 exponential map projection operator

−π

Cayley transform

0

5

10

15

20

µ

Figure 3.4: The rotation angle on the unit circle U (1) for different local parametrizations. The initial point is the zero angle, and the tangent vector is unit-norm. The exponential map (3.24) spans all the range [−π, π) (shown by continuous line). The phase increases linearly with µ and is periodic. The projection operator (3.27) spans the interval [0, π/2) (shown by dashed line), and the angle has horizontal asymptote at π/2 . The Cayley transform (3.30) spans the interval [0, π) (represented by dotted line), and the angle has horizontal asymptote at π .

45

The corresponding update at iteration k is given by  µ µ Hk )−1 (I − Hk ) Wk . 2 2 ∞ X    (µDk )m UH = I + 2Uk k Wk .

W Cayley (µ) =



(I +

(3.30) (3.31)

m=1

P m From (3.23) it follows that the matrix ∞ m=1 (µDk ) is a diagonal matrix whose diagonal entries are of the form ωi µ/(1 − ωi µ), i = 1, . . . , n. Consequently, the Cayley transform has the same limitations as the projector operator. The corresponding rotation angles ∠(1 − ωi µ) are all confined within the interval [0, π), regardless how much µ is increased. The variation of the rotation angle on U (1) w.r.t. µ is shown in Figure 3.4. In the multi-dimensional case, the Cayley transform will converge to the matrix   µ µ lim (I + Hk )−1 (I − Hk ) Wk = −Wk . µ→∞ 2 2

(3.32)

Again, the behavior of the cost function along the curve described by the Cayley transform (3.30) is not almost periodic, as in the case of the exponential map. This is shown in Figure 3.3 by dotted line, taking as an example the Brockett function [41,184] (see also [Publication II] and [Publication V]). In conclusion, the exponential map is suitable for line search methods due to its almost periodic behavior, unlike other common local parametrizations. Practical line search methods on U (n) Many of the existing geometric optimization algorithms do not include practical line search methods [69, 151], or if they do, they are too complex when applied to optimization on U (n) [77, 137]. In some cases, the line search methods are either valid only for specific cost functions [183], or the resulting search is not highly accurate [77, 122, 161]. The difficulty of finding a closed-form solutions for a suitable step size is discussed in [122]. The accuracy of line search is crucial for the performance of the resulting algorithms, especially in the case of the CG algorithm which assumes exact line search. Two efficient high-accuracy line search methods exploiting the almost periodic property of the cost function along geodesics on U (n) are proposed in [Publication II] and [Publication IV]. The first method finds only the first local minimum (or maximum) of the cost function along a given geodesic. It is based on a low-order polynomial approximation of the first-order derivative of the cost function along geodesics and detecting its first sign change [225]. The second one finds several local minima (or maxima) of the cost function along a given geodesic and selects the best one. It approximates the almost periodic function by a periodic one [81], using the classical Discrete Fourier Transform (DFT) approach. 46

Jˆ(µ), dJˆ/dµ, approximations of dJˆ/dµ

The almost periodic behavior of the cost function Jˆ(µ) (3.26) and its first-order derivative dJˆ/dµ along geodesic W geod (µ) (3.24) is shown in Figure 3.5.

The JADE cost function along geodesic cost function along geodesic Jˆ(µ) 12 first-order derivative dJˆ/dµ polynomial 10 approx. polynomial approximation of dJˆ/dµ range DFT-based approximation of dJˆ/dµ 8 6

DFT approx. range

4 2 0 −2 −4

equi-spaced sampling

−6 0

1

2

3

4

5

6

µ

Figure 3.5: Performing the geodesic search for the JADE cost function [45]. The almost periodic behavior of the cost function Jˆ(µ) (3.26) and its firstorder derivative dJˆ/dµ along geodesic W geod (µ) (3.24) may be noticed. The first proposed line search method uses a polynomial approximation of dJˆ/dµ in order to find its smallest positive zero-crossing value which corresponds to the first local minimum of Jˆ(µ), i.e., the desired step size µk . The second proposed line search method uses DFT-based approximation of dJˆ/dµ (dashed line) in order to find several local minima of Jˆ(µ) along geodesic and select the best one. Both methods sample the derivative dJˆ/dµ at equi-spaced points in order to avoid repeated computations of the matrix exponential.

Both proposed methods find one or more zero-crossing values of the first-order derivative of the cost function dJˆ/dµ. They correspond to local minima of the cost function J (µ). The zero-crossing values are related to the frequency spectrum of the first-order derivative. Due to differentiation, this spectrum corresponds to the high-pass filtered spectrum of the cost function. The approximation range is set according to the highest frequency 47

component in the spectrum, which is related to the dominant eigenvalue of the argument of the matrix exponential. The common steps of the two line search methods in [Publication II] and [Publication IV] are summarized in Table 3.4. The main difference between the polynomial-based approach and the DFT-based approach is the choice of the approximation range ta and the number of approximation points Na . Other specific characteristics such as computational complexity issues are presented in detail in [Publication II] and [Publication IV]. An important common feature shared by the two methods is that the matrix exponential is evaluated at equi-spaced points (see Figure 3.5). Therefore, both proposed methods require only one evaluation of the matrix exponential, i.e. R1 in step 4 (Table 3.4). The other N − 1 rotations are powers of R1 . 1 2

3 4

5 6 7

Given Wk ∈ U (n), −Hk ∈ u(n), compute the eigenvalue of Hk of highest magnitude |ωmax | Determine the order r of the cost function J (W) in the coefficients of W, which is the highest degree that t appears in the expansion of J (W + tZ), t ∈ R, Z ∈ Cn×n Based on |ωmax | and r, determine an appropriate range ta for approximating the first-order derivative of the cost function along geodesics, Jˆ′ (µ). Evaluate the rotation R(µ) = exp(−µHk ) at Na + 1 equi-spaced points µi ∈ {0, ta /Na , 2ta /Na , . . . , ta } as follows: R0 , R(0) = I  R1 , R ta /Na = exp − Ntaa Hk ,  R2 , R 2ta /Na = R1 R1 , ..., RNa , R(ta ) = RNa −1 R1 . By using the computed values Ri , evaluate:  of ∂J H H H }, i = 0, . . . , N Jˆ′ (µi ) , dJˆ/dµ=−2ℜ{trace ∂W ∗ (Ri Wk ) Wk Ri Hk Approximate Jˆ′ (µ) by using polynomial approximation, or DFT-based approximation and find the zero-crossing values of the approximation Set the step size µk to a root corresponding to the desired minimum (or maximum) along geodesic

Table 3.4: Geodesic search algorithm on U (n).

The periodicity of functions on SO(2) was discussed and exploited in the ICA context in [72]. This property was treated extensively later in [160,161] for SO(2) and SO(3). A one-dimensional Newton method which uses the first-order Fourier approximation the cost function along geodesics on SO(n) is proposed in [161]. The main difference between the proposed DFT-based method [Publication II] and the line search method in [160, 161] is that we choose to approximate the first-order derivative of the cost function along geodesics and find the corresponding zeros, instead of approximating the cost function itself and finding a local minimum as in [160, 161]. Another 48

difference is that the line search method in [160, 161] finds only one minimum. The proposed method finds multiple local minima and selects the best one. For this reason, when used with a SD algorithm the proposed DFT method leads to a performance comparable to the one of the CG algorithm, as shown in [Publication II]. The method in [160,161] exploits the periodicity of the cost function along geodesics which only appears on SO(2) and SO(3) (and it does not appear even on U (2) and U (3)). For n > 3 the accuracy of the approximation decreases, since the periodicity of the cost functions is lost. Moreover, the proposed DFT-based approach uses multiple frequency components for the DFT in order to approximate the almost periodic derivative. Thus, a better spectral description of the almost periodic function is obtained. This result is demonstrated in [Publication II]. Furthermore, the proposed algorithm avoids computing second-order derivatives, unlike the method in [160, 161]. These derivatives are not always straight-forward to compute and they may be computationally expensive since large matrices (the Euclidean Hessian) may be involved. The two proposed line search methods exploit the almost periodicity of the cost function and its derivatives in a computationally efficient manner. Other approaches are also possible. The proposed DFT-based method opens multiple possibilities for finding better local minima. One interesting approach would be to store several local minima at one iteration and use them in the next iterations. Simulated annealing approach [77] may be employed in order to reduces the dimension of the search space and at the same time avoid convergence to weak local minima. Another approach for finding better local minima (or possibly the global minimum) is to take into account more eigenvalues of the argument of the matrix exponential. The proposed approaches use only the dominant eigenvalue (no eigenvectors are required). By including several dominant eigenvalues, more precise information about the evolution of the almost periodic cost function over a wide step size range may be obtained. These aspects remain to be studied.

3.3

Discussion

In this chapter, the problem of optimization under unitary matrix constraint has been addressed. Computationally efficient SD and CG algorithms along geodesics on U (n) are proposed. Two high accuracy line search methods specially tailored for the proposed algorithms are introduced. The algorithms proposed in this chapter are compared to other existing optimization algorithms (for details, see Chapter 2). Advantages and disadvantages of each algorithm are summarized in Table 3.5.

49

Euclidean SD with enforcing constraint [113, 156,157,171,172, 216] Lagrange multipliers method [97]

Benefits ✔easy to implement (just few equations)

Weaknesses ✘ slow convergence ✘ computationally expensive ✘ expensive step size adaptation

✔closed-form solution may exist for simple cost functions and low matrix dimension n

✘ increases even more the dimension of the optimization problem (from 2n2 to 3n2 ) ✘ often mathematically intractable ✘ very slow convergence ✘ low accuracy in satisfying the unitary constraint ✘ expensive local parametrization (projection of an arbitrary matrix) ✘ expensive line search (no properties to be exploited) ✘ less simple to implement (more equations are needed)

extra-penalty method [205]

✔easy to implement (just few equations)

non-geodesic SD on U (n) [137, 149]

✔fast convergence (linear) ✔reduces the dimension of the optimization problem (from 2n2 to n2 )

geodesic SD on U (n) [Publication I], [Publication III], [Publication IV]

✔fast convergence (linear) ✔reduces the dimension of the optimization problem (from 2n2 to n2 ) ✔efficient computation of the geodesics: exponential of skew-Hermitian matrix ✔efficient line search methods (due to the almost periodic behavior of the cost function along geodesics) ✔faster convergence (superlinear) ✔reduces the dimension of the optimization problem (from 2n2 to n2 ) ✔efficient computation of geodesics and parallel transport ✔efficient line search methods ✔very fast convergence (quadratic)

geodesic CG on U (n) [Publication II], [Publication V]

geodesic and non-geodesic Newton algorithms on U (n) [137, 149, 155]

✘ less simple to implement (more equations are needed)

✘ computationally very expensive (of order O(n6 )) ✘ may converge to undesired stationary points ✘ less simple to implement

Table 3.5: Comparison between different algorithms for optimization under unitary matrix constraint. The classical approaches operating on the Euclidean space vs. the differential geometry approaches operating on U (n). 50

Chapter 4

Overview of blind equalization techniques for MIMO-OFDM systems Multiple-Input Multiple-Output (MIMO) systems are a key technology of the future high-rate wireless communication systems such as 3GPP longterm evolution (LTE), IMT-2000, WiMAX and WLAN [1,4,8,10]. By using multiple transmit and received antennas, linear increase in capacity may be achieved [36, 194]. Spatial multiplexing produces parallel data streams resulting into high data rates. The transmitted streams are not necessarily orthogonal, therefore the co-channel interference problem must be considered. Moreover, the MIMO channel is selective in time, frequency and space. Space-time and space-frequency codes [22, 111, 165, 193] may be used to increase the link reliability, especially in the absence of the CSI at the transmitter [95]. They provide a good balance between the multiplexing gain and diversity gain. When MIMO techniques are combined with spectrally efficient Orthogonal Frequency Division Multiplexing (OFDM) modulation [106, 200], the resulting MIMO-OFDM systems are very robust to multipath propagation. Typically in OFDM transmission a cyclic prefix (CP) is employed. In this way, the broadband frequency selective channel is converted into multiple orthogonal flat-fading channels. Moreover, OFDM modulation enables multiuser access schemes by allocating distinct subcarriers to different users. The resulting system is called OFDMA (Orthogonal Frequency Division Multiple Access). Channel estimation in MIMO systems [70,126] is a difficult problem due to the fact that the number of unknown channel parameters grows rapidly with the number of transmit/receive antennas. Consequently, pilot-aided channel estimation methods require a very large number of training data which decrease the effective data rates. Blind techniques [128] may be used 51

to improve the effective data rates by exploiting the statistical and/or structural properties of the transmitted signals. They are very suitable in the case of continuous transmissions (e.g. DVB-T) or slowly time-varying channels (e.g. ADSL). Blind algorithms are subject to inherent ambiguities (e.g. amplitude, phase, permutation indeterminacies). A small amount of training data may be used in order to remove the ambiguities. This amount is much smaller than what is needed for the pure training-based methods. The resulting semi-blind methods may improve the convergence speed and tracking capability of the blind methods. They use both the received symbols as well as the training data. Training-based channel estimation methods need to wait until the next pilot is received. The training sequence may be distorted by the channel in a way that it is not recognized at the receiver. In conclusion, semi-blind algorithms are more feasible in practice. The core of any semi-blind method is a blind method. In this chapter we focus on blind methods for channel estimation and equalization in MIMO-OFDM systems using P transmit antennas and Q receive antennas. Most of the methods considered here are intended for spatial multiplexing scenarios. The transmitted data streams are mutually independent and correspond to different users or multiple streams from single user. Single and multi-user SIMO cases are also considered. We classify the corresponding channel estimation and equalization algorithms into three main categories. The fist two categories include algorithms exploiting statistical properties of signals and matrices, i.e., Second-Order Statistics (SOS) and Higher-Order Statistics (HOS) approaches. The third category include deterministic algorithms that exploit structural properties of signals and matrices (or hybrid structural-statistical algorithms) .

4.1

Second-Order Statistics (SOS) based methods

In general, SOS-based blind methods exploit the correlation properties of the received signals. In general, they belong to two main classes. The first class includes SOCS (Second Order Cyclostationarity Statistics) based methods that rely on different correlation functions. The second class includes statistical subspace methods that exploit the output covariance matrix or the received data matrix.

4.1.1

SOCS based methods

Many man-made signals encountered in communications possess intrinsic periodicities caused by modulation and coding, for example. Their statistics such as mean or autocorrelation are periodic functions. Conventional WSS (wide-sense stationary) models do not take into account the valuable information contained in this periodicity. The property is called cyclostationarity and may be exploited in blind algorithms. Typically, the symbol52

rate received signals are WSS, but by taking multiple samples within the symbol interval, additional information is obtained. This can be done either by oversampling in time domain, or in spatial domain by employing several symbol-rate receivers (antenna array). Cyclostationarity may be induced at the sampling rate also by shaping the signal statistics at the transmitter, but this may require redundancy. OFDM signals possess special correlation properties. The cyclic prefix (CP), or zero-padding (ZP) induce correlation in the transmitted signals which may be exploited. Blind channel estimation methods for MIMOOFDM systems based on the SOCS induced by the CP have been proposed in the literature [29, 67, 68, 134]. Channel correlation properties due to the Fourier transform have also been exploited in [90]. Special correlation properties may be introduced also by precoding or space-time coding [37, 221, 222]. In addition, special signal structure helps in solving the ambiguities inherent to all blind methods. In [37], a scheme for resolving the multi-dimensional ambiguity up to a diagonal complex matrix is presented. The SOCS methods in [29, 37] are immune to the common channel zeros.

4.1.2

Statistical subspace methods

Another alternative to induce structure in the transmitted signals is by inserting zero guard bands at the end of each OFDM block i.e, the so-called zero-padding (ZP) [226, 228]. Virtual subcarriers (VSC) [28, 180] may also be used, which are unmodulated subcarriers at known frequencies in the spectrum, usually in the roll-off region. Statistical subspace-methods for blind channel estimation in MIMO-OFDM have also been proposed in literature [28,84,91–94,129,180,226,228]. They rely on a low-rank model where the signal subspace is associated with the range space of the channel matrix. The signal and noise subspaces are obtained either via eigendecomposition of the sample estimate of the covariance matrix, or via singular value decomposition of the received data matrix [110]. Receive diversity plays an important role in building the low-rank model. Blind identification algorithms for SIMO-OFDM have been considered in [23, 48, 96]. The method in [23] is sensitive to the channel common zeros. Unlike most of the subspace methods, in [91,92,96] the model uses the covariance matrix computed in frequency domain. The advantage of this approach is the resilience to common channel zeros. Identifiability conditions for the subspace methods applied to OFDM have been formulated in [180, 227]. Statistical subspace methods are applied to OFDM signals with CP [84,93,94], or ZP [226,228] in MIMO scenarios. It has been shown in [93, 94] that algorithms designed for ZPOFDM transmission [226, 228] may be adapted to CP-OFDM transmission just by rewriting of the MIMO system model appropriately. In this way, they become compatible with most of the existing OFDM standards [2–7,9] which use CP instead of ZP. In [67, 68] a subspace method exploiting the 53

CP is proposed to initialize an iterative CMA. Subspace-based methods exploiting virtual subcarriers (VSC) have been proposed in [23, 28, 180]. They do not need CP, as long as VSC are used. Space-time codes have been used in conjunction with subspace methods in [222, 226]. In general, subspace methods are able to estimate the MIMO channel up to a full-rank complex P × P ambiguity matrix [84, 91, 92, 226, 228]. The ambiguity is removed by exploiting the signal structure induced at the transmitter via precoding or space time-coding, or by using a small amount of training data, i.e., P symbols within one OFDM block of length N . Due to the fact that in practice P ≪ N , semi-blind subspace methods still achieve increased effective data rates. An extensive review on semi-blind channel estimation methods for MIMO-OFDM systems is provided in [168]. An efficient HOS-based blind approach for solving ambiguities remaining after the blind subspace identification was proposed in [202]. By exploiting the independence between in-phase and in-quadrature components of the complex-valued signal, the algorithm is able to reduce the remaining fullrank ambiguity matrix to a diagonal matrix. Its diagonal elements correspond to complex scalar gains multiplying each of the (possibly permuted) data streams. Most of the subspace methods involve eigendecomposition of SVD operations. In general, such matrix decompositions are computationally expensive, especially when the matrix dimensions are large [99]. Complexity reduction for the subspace methods may be often achieved. The subspace algorithm in [228] which requires both eigendecomposition and SVD has been reconsidered in [Publication III]. The SVD operation of a large tall matrix was replaced by an eigendecomposition of a small square matrix. The eigendecomposition is obtained iteratively by using Riemannian optimization technique (steepest descent on the unitary group U (n)). A more efficient solution would be optimizing the Brockett function [41,184] by using the Riemannian conjugate gradient algorithm as in [Publication II].

4.2

Higher-Order Statistics (HOS) based methods

Typically, HOS-based blind methods need a larger sample support compared to the SOS-based blind methods because HOS-based estimators have higher variances. On the other hand, HOS methods may potentially have increased noise immunity, since the higher order statistics of the Gaussian noise vanish.

4.2.1

BSS methods

Most of the HOS-based blind channel estimation methods proposed in the literature [55, 102, 115, 173–177] use BSS (Blind Source Separation) principles [71] in order to separate the transmitted signals. Commonly used 54

Independent Component Analysis (ICA) methods [113] such as natural gradient [25] and JADE [45] are applied. They rely on the mutual statistical independence and the non-Gaussianity properties of the transmitted signals. Therefore, due to the IDFT operation at the transmitter side, they operate only in frequency domain. Assuming CP in the OFDM transmission, the MIMO channel is regarded as a set of per-tone instantaneous mixing matrices. The size of each mixing matrix is identical to the dimensionality of the MIMO channel, i.e., P × Q. Consequently, the number of mixing matrices is equal to the number of tones, which leads to computationally expensive algorithms. In order to avoid this inconvenience, algorithms which exploit the correlation among subcarriers in frequency domain have been proposed in [175,177]. Correlation across subcarriers depends on the coherence bandwidth and inter-carrier spacing. These figures are directly related to channel delay spread and number of subcarriers. The channel length is considerably smaller than the IDFT length, and the channel frequency response is a result of the IDFT of a zero-padded channel impulse response [169]. Consequently, high correlation among channel coefficients corresponding to adjacent subcarriers is introduced. In conclusion, it is sufficient to obtain channel frequency response by using ICA on a number of frequency bins equal to the maximum channel length (CP length may be used as an upper bound). For the remaining tones the channel response is obtained by interpolation. Other algorithms [115, 174] apply the JADE only one one reference subcarrier, and the others subcarriers are unmixed iteratively by using a linear MMSE receiver. This may lead to error propagation across subcarriers. Additional successive interference cancellation (SIC) technique may be involved to improve the performance [174]. The SIC approach has also been used in [175] combined with a layered space-time architecture (V-BLAST) [210]. Blind source separation approach based on a natural gradient learning algorithm is developed in [102]. The algorithm in [55] uses ICA and fractional sampling. Consequently, the noise resilience is improved (more decision variables are available). On the other hand, the complexity is increased even more in comparison to the other ICA-based methods, by a factor equal to the inverse of the oversampling rate. The major problem of BSS methods for MIMO-OFDM systems considered above is that they operate in frequency domain. This leads to several drawbacks. First, the computational complexity increases linearly with the number of subcarriers. The complexity reduction techniques become compulsory even for relatively small number of subcarriers. Second, the ambiguity problem is very hard to resolve. On each subcarrier, there is an unknown complex scalar which multiplies the channel frequency response. For this reason, convolutional coding [115,173] may be required in order to introduce redundancy, thus decreasing the effective data rate. Non-redundant linear precoding of each transmitted data stream has been also proposed [174,175]. This techniques introduce known correlation structure between subcarriers 55

which may be exploited at the receiver. Special constellation properties have also been considered [55]. Most of the blind algorithms considered in this section are based on minimizing the JADE criterion [45]. Complexity reduction for JADE may be achieved by using Riemannian conjugate gradient as in [Publication II]. This problem has also been addressed in [Publication I], [Publication IV] and [Publication V]. The reduction in complexity is considerable if multiple JADE algorithms are employed in parallel, such as the blind algorithms considered in this section, which operate on a subcarrier basis. Moreover, when the dimensions of the mixing matrix are large, the pairwise processing used in the diagonalization stage of JADE leads to slow convergence [Publication II], [Publication IV], [Publication V]. Same problem occurs when the input data streams have different distributions, i.e., they belong to different constellations [Publication II].

4.3

Structural properties based methods

Typically, statistical blind methods require a large sample size in order to provide unbiased channel estimates. Apart form statistical properties, signals and matrices may posses special structural properties that may be exploited in blind algorithms. These properties arise either from the modulation scheme, or from different matrix structures employed in the description of system model. These special properties appear even for very small sample size. For this reason, in some cases (e.g. noise-free scenarios or constant modulus constellations) the channel estimation can be achieved even from a single received data block, in a deterministic manner. In general, deterministic methods outperform statistical methods for small number of received data blocks. By using more received data blocks, the robustness to noise and other imperfections is increased. In this way, some deterministic methods are converted into hybrid structural-statistical methods. Such hybrid methods require fewer observations compared to the pure statistical methods. Reducing the size of data records enables good channel tracking capabilities. OFDM modulation and MIMO channel determine special signal and/or matrix structures that may be useful in blind channel estimation. In this subsection, the structural blind methods for multiple-antenna OFDM systems are classified into three main categories. The first one exploits the structure of the transmitted signals arising from the known constellation modulation scheme [30, 67, 68, 114, 126, 159, 190, 232]. The second category includes algorithms that use the OFDM guard bands [31,49,119]. The third category includes blind algorithms exploiting special matrix structures arising from the data model [49, 100, 126, 166, 204]. 56

4.3.1

Modulation properties

Blind algorithms that rely on the properties of the modulation scheme have been proposed in the literature [114, 159, 232]. The main properties that are exploited are the finite-alphabet, constant envelope, or constant block energy of the transmitted signals. In addition to these properties, some of the methods exploit the receive diversity which is achieved by oversampling in time or space. Finite-alphabet methods Digitally modulated communications signals have a finite-alphabet (FA) structure, i.e., the transmitted symbols belong to a finite set of amplitudes and phases. Blind methods using the FA property match the received signals to the unknown channel taps and projects the soft-estimated symbols onto the constellation set. Least squares methods that estimate the channel coefficients and the transmitted symbols alternately have been proposed [126]. The FA property was first applied to OFDM in [232] in single-antenna case. The corresponding deterministic blind algorithm is able to identify the channel from a single OFDM block when PSK constellations are used at high SNR conditions. The remaining phase ambiguity may be easily resolved, since it belongs to a finite set. The algorithm is computationally expensive due to the fact that it operates on each tone. It requires an exhaustive search on a space which grows exponentially with the number of active subcarriers. A sub-optimal version of the FA method is proposed in [232], but the dimension of the search space is still exponential in the number of channel taps. An improved version was proposed in [159]. The method dramatically reduces the computational complexity because it operates on clusters of subcarriers (withing the same coherence bandwidth). For this reason, the method is sensitive to the choice of the clusters. In order to overcome this difficulty, turbo-decoding was used in conjunction with the FA property in [159]. The drawback of the method is that channel coding requires redundancy. Other sub-optimal approaches for reducing the complexity of the FA method were proposed in [114]. Receiver diversity may also be employed in order to improve the performance. A deterministic ML (maximum-likelihood) blind method for SIMO-OFDM exploiting the FA property was considered in [30]. The method exploits the receive diversity and in the absence of noise can achieve perfect channel estimation by using single received OFDM block. The algorithm exhibits very high complexity. The ML method can be decoupled into two separate LS problems, involving the channel coefficients and the transmitted symbols, respectively. Exhaustive search over a high-dimensional space is still needed for the symbol estimation part. A comprehensive review of different FA methods until year 2000 may be found in [126, Sec. 4.3]. For QAM constellations, FA 57

methods suffer from error floor effect and high variance of the estimates. Moreover, for higher-order constellations they require HOS of the received signals, and therefore longer data records. Constant modulus property An iterative algorithm exploiting the constant envelope characteristic of the transmitted data symbols was proposed for SIMO-OFDM in [190]. The algorithm is based on least squares CMA (Constant Modulus Algorithm) that takes additional benefit from the receive diversity. In [67, 68] constant modulus algorithm was employed for channel equalization on MIMO-OFDM systems. The initialization is made by using the estimates provided by a subspace method. After initialization the CMA works in an adaptive fashion. The constant modulus property has also been used to resolve the multiple scaling ambiguities of blind algorithms operating in frequency domain [102]. Constant mean-block energy property Iterative methods minimizing different criteria have been proposed for OFDM. A blind equalizer based on restoring the constant mean block energy property [218] of the received OFDM data blocks was proposed for singleantenna case in [119]. The algorithm is called VCMA (Vector Constant Modulus Algorithm). In addition, the structure of the CP and ZP guard bands are exploited. A blind equalizer for MIMO OFDM systems using VCMA and decorrelation criteria was introduced in [Publication VI]. The algorithm has been modified in order to take into account the correlation introduced when using CP in [13]. A block-Toeplitz structure of the equalizer is enforced by averaging along diagonals. Other approaches enforcing the Toeplitz structure are considered in [126]. The VCMA algorithm proposed in [Publication VI] and its improved final version in [Publication VII] will be discussed in detail in Chapter 5. The VCMA-based algorithms can also be included in the class of HOS methods, since the corresponding criteria use fourth order moments. We have included them in the class of structural methods because they attempt to restore the constant mean block energy property and do not use sample averaging, which may decrease the convergence speed. In practice, they are implemented in an adaptive fashion and the expected values are replaced by instantaneous estimates.

4.3.2

Properties of the guard interval of OFDM signal (CP or ZP)

This type of methods rely on the guard intervals used in the OFDM transmission such as CP or ZP [119]. In addition, they may exploit the receive diversity. A blind beamformer for SIMO-OFDM systems exploiting receive 58

antennas diversity and the temporal redundancy induced by the cyclic prefix was proposed in [31]. A criterion which penalizes the MSE between the CP samples and the corresponding data samples within the received OFDM block is employed. In [49], a deterministic LS blind approach using received diversity and CP similarity is proposed. The algorithm is able to estimate the SIMO channel by using a single received OFDM block. The channel response and the array response are incorporated into a global mixing matrix, whose dimensions depend on the channel order. Therefore, the channel order must be known precisely. It is estimated by using SVD of the data matrix. The channel is found by using the SVD of a difference matrix which penalizes the difference between the CP samples and the corresponding data samples. The ambiguity is resolved also based on the CP redundancy.

4.3.3

Exploiting special matrix structures

Special matrix structures arising from the SIMO and MIMO models may be exploited in blind methods [199]. Deterministic subspace methods The full column rank property of the channel matrix is a prerequisite for subspace-based system identification. Therefore, the model must involve a tall channel matrix (more rows than columns). This is just a necessary condition, but not a sufficient one. If the full column rank condition is not met, several received data blocks may be stacked in the top of each other. Received diversity is also a mean to build a low-rank model. In spatial multiplexing scenarios, estimating the desired base of the subspaces is not sufficient for separating the transmitted data streams. Thus, a second constraint must be employed. Finite-alphabet, constant-modulus or specific Hankel and Toeplitz matrices may be used in conjunction with the subspace methods [126]. Enforcing special matrix structures enables satisfying the second constraint needed for the estimated subspace. In this way, by properly combining the estimated basis vectors the co-channel signal cancellation may be achieved. Cyclic prefix or zero-padding may also be exploited [49]. In some cases [49, 100, 204] the projection matrix to the noise subspace is deterministic and known at the receiver, and the channel estimation may be accomplished after the first received OFDM block. In [204], a blind SIMO channel identification algorithm based on received diversity was proposed. Special properties arising from the unitary FFT matrix used in the OFDM transmission was exploited. The SIMO channel estimation problem reduces to the usual problem of finding the minimal eigenvector of a Hermitian matrix (or minimal singular vector of a tall matrix). The algorithm may be sensitive to the common zeros on the subchannels. Moreover, in [204] the channel length is assumed to be known 59

a priori. Special structure in the signals may be induced at transmitter by using a redundant precoding scheme. In [100], a deterministic blind equalization algorithm for SIMO-OFDM is proposed. The algorithm exploits the receive antenna diversity and the structure imposed by frequency domain spreading. Therefore, the transmission system can be viewed as a multicarrier CDMA system. In the absence of noise, the equalization can be achieved in a single OFDM block. A regularization approach is used in order to cope with the problem of common zeros. The redundancy introduced by spreading improves the system reliability. It is shown that the proposed redundant scheme outperforms the uncoded scheme. Same data rate is achieved at given SNR with lower bit error rate. Other algebraic techniques A deterministic blind channel identification method for MIMO-OFDM has been recently proposed in [166]. It is based on an algebraic technique which decomposes the received signal vector in a four-way tensor whose dimensions are space, time and frequency. The method exhibits high complexity, due to the fact that it uses large multi-dimensional matrices. In addition, it is difficult to prove if the identifiability conditions are met in practice.

4.4

Discussion

In this chapter, the most relevant blind channel estimation and equalization methods for multi-antenna OFDM systems have been reviewed. The main focus was on blind methods applicable to SIMO and MIMO systems. The methods were classified in three main classes. The first two include statistical methods (SOS and HOS) and the third one includes methods exploiting structural properties of signals and matrices. In Table 4.1, a comparison of the three classes of methods is given. The pros and cons of each methods are considered, as well as the computational complexity. Both statistical and structural methods possess benefits and drawbacks. HOS-based methods require large sample support in order to provide unbiased estimates. In general, SOS-based methods outperform the HOS-based methods for the same amount of received data. Subspace-based methods require computationally expensive matrix decompositions. This is valid also for some of the structural methods. Moreover, some structural methods such as FA are just for theoretical study, since their computational complexity is unaffordable in practice. Finally, most of the methods require precoding or shaping the space-time signals in order to cope with the channel common zeros. In conclusion, exploiting the structural properties of matrices and signals, and at the same time taking advantage of their statistics may result into fast and computationally efficient algorithms. 60

Benefits

Weaknesses

✔require less received data compared to HOS methods ✔computationally simple

✘ may require additional precoding or oversampling ✘ may be unable to identify nonminimum-phase channels ✘ sensitive to common zeros unless precoding is involved ✘ require CP/ZP

subspace based [23,28,84,91–94, 96, 129, 180, 226, 228]

✔require less received data compared to HOS methods ✔frequency domain models are immune to common zeros

HOS methods: BSS [55,102,115, 173–177]

✔robust to Gaussian noise ✔robust to channel zeros

✘ require expensive matrix decompositions ✘ frequency domain models are computationally complex ✘ require CP/ZP and/or VSC ✘ high variance ✘ require more received data compared to SOS methods ✘ high complexity for large number of subcarriers ✘ ambiguities are harder to resolve ✘ require CP/ZP

SOS methods: SOCS [29, 37, 67, 68, 134, 221, 222]

Structural methods: Modulation properties (FA, CMA, VCMA) [30, 67, 68, 102, 190]

✔FA may achieve estimation in single OFDM block (high SNR and CM constellations) ✔CMA and VCMA are adaptive ✔for VCMA, the complexity does not grow with the number of subcarriers ✔for VCMA, ambiguities are easier to resolve

✘ FA methods are extremely complex ✘ for FA and CMA, ambiguities are harder to resolve ✘ FA and CMA exhibit high complexity for large number of subcarriers ✘ FA, CMA require CP/ZP ✘ VCMA is sensitive to common zeros ✘ CMA and VCMA may converge only locally

CP/ZP structure [31, 49, 119]

✔estimation may be achieved in single OFDM block (high SNR)

✘ ambiguities are harder to resolve ✘ require CP/ZP

Special matrix structures [49, 100, 166, 204]

✔estimation may be achieved in single OFDM block (high SNR)

✘ ambiguities are harder to resolve ✘ require expensive matrix decompositions ✘ require CP or precoding

Table 4.1: Different algorithms for blind channel estimation and equalization for SIMO/MIMO-OFDM systems. The statistical-based methods (SOS, HOS), and methods exploiting the structural properties. 61

62

Chapter 5

Blind equalizer for MIMO-OFDM systems based on vector CMA and decorrelation criteria In this chapter, the problem of blind recovery of multiple OFDM data streams in a MIMO system is addressed. We propose an equalization algorithm for MIMO-OFDM receivers which optimizes a composite criterion in order to cancel both the ISI and CCI. ISI is minimized by using a modified Vector Constant Modulus criterion while CCI is minimized by using a decorrelation criterion. The composite criterion was introduced in [Publication VI]. The algorithms was subsequently improved in [12, 13] and [Publication VII]. The convergence properties of the algorithm have also been established. Conditions for the existence of the stable minima corresponding to the zero forcing receiver which performs the joint blind equalization and the co-channel signal cancellation are established in [Publication VII]. The proposed blind algorithm operates in the time domain before the DFT operation at the receiver. Therefore, it is designed to deal with three different cases: there is no CP at all, the CP is too short (compared to the channel impulse response) and CP is sufficiently long. The CP may be used for synchronization purposes, hence it is included in the algorithm derivation. However, it is not needed in finding the equalizer. The proposed blind algorithm exploits the mutual statistical independence among the transmitted data streams and the the constant mean block energy property of the OFDM signals. Hence it is applicable to spatial multiplexing systems. The VCMA criterion [217] penalizes the deviation of the block energy from a dispersion constant. The VCMA cost function may be decomposed into a constant modulus (CM) cost function [98] and an auto-correlation function of the squared magnitudes of the received signal [195]. Therefore, the 63

original VCMA is not suitable for signals which have a periodic correlation such as OFDM signal. When CP is used, a strong auto-correlation in the transmitted signals is introduced. It may be stronger than the correlation caused by the multipath propagation channel. Consequently, the performance of the original VCMA degrades then significantly because it penalizes the correlation induced by the CP [12, 13]. The modified VCMA proposed in [Publication VII] is designed to deal with the auto-correlation caused by the CP and to cancel ISI simultaneously. The VCMA was applied to blind equalization for shaped constellations in [217] and for Single-Input Single-Output (SISO) OFDM system in [119]. In [119] CP or ZP were required in order to perform the equalization. MIMO schemes have been considered in [132, 158] using the classical CMA, for BPSK signals. VCMA was employed in the context of DS-CDMA systems in [206]. In a spatial multiplexing MIMO scenario, the problem becomes more difficult. At one receive antenna we have the desired signal with its delayed replicas caused by the channel ISI in addition to the co-channel signals, i.e., CCI with their delayed replicas. In order to perform both the blind equalization and signal separation, an output decorrelation criterion is needed. This criterion assumes that the transmitted data streams are mutually independent. Hence, it is suitable for spatial multiplexing systems. It penalizes the correlation among the equalized outputs. Consequently, we come up with a cost function comprised of two criteria: a modified VCMA criterion and a decorrelation criterion. The proposed algorithm is presented in detail in this chapter. First, the system model is given in Section 5.1. The blind equalizer is presented in Section 5.2.

5.1

System model for spatial multiplexing MIMOOFDM system

We consider a MIMO-OFDM system with P -transmit and Q-receive antennas (Figure 5.1). We assume a spatial multiplexing scenario, where independent OFDM data streams are launched from each antennas. Each data stream consists of i.i.d. complex symbols modulated by M subcarriers. Multi-user SIMO systems have similar model. In this model we use a block formulation similar to the one in [207]. The sample index is denoted by (·), and the block index by [·]. Consider the complex symbols from the pth data stream stacked in a M × 1 vector sp [k] = [sp (kM ), . . . , sp (kM − M + 1)]T . The N × 1 transmitted OFDM block of the pth data stream can be written as: ˜ p [k] = TCP Fsp [k], u (5.1) where F is the M × M normalized IDFT matrix and TCP is the N × M cyclic prefix addition matrix. The sequence of L + 1 consecutive transmit64

Figure 5.1: MIMO-OFDM system model

ted OFDM samples corresponding to the antenna p is denoted by up (k) = [up (k), . . . , up (k − L)]T . The MIMO channel branches from the pth transmit to the qth receive antenna (see Figure 5.1) have maximum order Lc and they are characterized by the impulse responses cpq = [cpq (0), . . . , cpq (Lc )]. Stacking the vectors up (k) corresponding to the P transmitted data streams in a vector u(k) = [uT1 (k), . . . , uTP (k)]T , the L − Lc + 1 consecutive samples received at the antenna q, are: yq (k) = [C1q . . . CP q ]u(k) + vq (k),

q = 1, . . . , Q,

(5.2)

where Cpq are (L−Lc +1)×(L+1) Sylvester convolution matrices containing the channel coefficients cpq , and vq (k) is the additive white Gaussian noise ¯ whose at the qth receive antenna. Consider the MIMO channel matrix C blocks (p, q) are the matrices Cpq , with p = 1, . . . , P and q = 1, . . . , Q. The T (k)]T may be written Q(L − Lc + 1) × 1 array output y(k) = [y1T (k), . . . , yQ as: ¯ y(k) = Cu(k) + v(k), (5.3) T (k)]T . The adaptive equalizers have order L where v(k) = [v1T (k), . . . , vQ e and they are row vectors denoted by eqp [k] = [eqp (0), . . . , eqp (Le )]. The minimum equalizer order is chosen according to the identifiability conditions presented in [Publication VII]. In order to recover the P transmitted data streams, a bank of P equalizers   ep [k] = e1p [k], . . . , eQp [k] , p = 1, . . . , P (5.4)

is used at each receive antenna (see Figure 5.1). By choosing L = Lc + Le , the equalized sample corresponding to the pth data stream can be written 65

as: zp (k) = ep [k]y(k).

(5.5)

Considering the 1 × (L + 1) global channel-equalizer impulse response ¯ corresponding to the pth data stream, the equalized (GCEIR) ap = ep [k]C sample from this data stream may be written as: zp (k) = ap u(k).

(5.6)

If the equalization is achieved, the GCEIRs are equal to the standard unit vector multiplied by an unknown phase rotation, i.e., ap = δi eθ , where p = 1, . . . , P, i ∈ {0, 1, . . . , L} and θ ∈ [−π, π). A permutation of the equalized data streams may also be encountered. The phase and the permutation ambiguities are inherent to all blind methods. The adaptive equalizer corresponding to each recovered data stream operates in a block mode, and it outputs a block of k samples zp [k] = [zp (kN ), . . . , zp (kN − N + 1)]T . The equalized block corresponding to the pth data stream may be written as: zp [k] =

Q X

Eqp [k]˜ yp [k],

(5.7)

q=1

˜ p [k] = [yp (kN ), . . . , yp (kN −N −Le +1)]T and Eqp [k] are N ×(N +Le ) where y Sylvester convolution matrices built with the coefficients eqp [k].

5.2

Blind MIMO-OFDM equalizer

The proposed blind algorithm performs the equalization and the co-channel interference cancellation. It minimizes a composite cost function comprised of two criteria: a modified VCMA criterion and a decorrelation criterion. These two criteria are described next.

5.2.1

Modified VCMA Criterion

In single transmitter case [217] VCMA criterion penalizes the deviation of the equalized block energy from a given dispersion constant. In the multiple transmitter scenario considered in this work, the energy penalty over all data streams may be written as: J

VCMA

P i h X E (kzp [k]k2 − r2 )2 , (e1 [k], . . . , eP [k]) =

(5.8)

p=1

where k · k denotes the l2 -norm of a vector. The block energy dispersion constant is r2 = E[k˜ up [k]k4 ]/E[k˜ up [k]k2 ]. The original VCMA cost function [217] is not applicable to signals which have a periodic correlation such as 66

OFDM signal using cyclic prefix (CP). This is due to the fact that the proposed criterion penalizes the both the autocorrelation and the crosscorrelations of the transmitted data streams. CP introduces autocorrelation which may be stronger than the inter-symbol interference (ISI) caused by the multipath channel. The proposed modified VCMA [Publication VII] can handle both the auto-correlation caused by the CP and the channel ISI simultaneously. For details, see [12, 13].

5.2.2

Output Decorrelation Criterion

The pth equalized data stream zp [k] may contain interfering signals corresponding to the other data streams zl [k], as well as their delayed replicas, zl [n, d] = [zl (kN − d), . . . , zl (kN − d − N + 1)]T . The interference is measured by the cross-correlation matrix Rpl (d), between pth and lth equalized  outputs for a certain delay d, i.e, Rpl (d) = E zp [k]zH l [n, d] . A decorrelation criterion must be employed because multiple copies of other signals may be present in the desired signal, i.e., the CCI. This criterion minimizes the squared Frobenius norm of the cross-correlation matrices. The crosscorrelation cost function over all equalized data streams is: J

xcorr

d2 P X X

Rpl (d) 2 . (e1 [k], . . . , eP [k]) = F

(5.9)

p,l=1 d=d1

p6=l

The delays d1 , d2 are chosen according to the maximum delay introduced by the channel. The integer d spans the window of all possible delays, in order to mitigate all the delayed replicas of the interference signals.

5.2.3

Composite Criterion

The VCMA cost function (5.8) has originally been designed for single transmitter case [217]. Its global convergence has not been established, not even in the single transmitter case [104, 195]. If multiple signals are present, depending on its initialization, VCMA may converge to any of the transmitted signals, usually to the ones that have the strongest power [217]. This is due to the fact that VCMA updates the equalizers corresponding the the P data streams independently, i.e., the equalized outputs do not influence each other. Obviously, VCMA alone is not sufficient for equalization in a spatial multiplexing scenario, since the problem of co-channel signals must be considered as well [158]. We propose a composite criterion which we prove to be locally convergent. The cost functions (5.8) and (5.9) may be combined in order to cancel both ISI and CCI. A weighting parameter 0 < λ < 1 is used to weight the two criteria. The composite cost function is given by: J = λJ VCMA + (1 − λ)J xcorr . 67

(5.10)

The composite criterion (5.10) needs to be minimized w.r.t. to the equalizer coefficients. In addition, the unknown parameter λ needs to be found. This is a challenging optimization problem since the function to be minimized is a multivariate function of fourth-order in its complex-valued arguments. The method of Lagrange multipliers [97] is the first method one would have in mind for this type of constraint minimization problem. In that case, the unknown parameters are the equalizer coefficients and the Lagrange multipliers. The equalized outputs depend on both channel and equalizer coefficients. Consequently, the corresponding Lagrangian function includes the unknown channel impulse responses and the Lagrangian method cannot be applied. Moreover, solving the corresponding system of equations would be difficult even with known channel. So happens for the dual optimization approach [32, 38]. In addition, due to the fact that the criterion is nonconvex, the dual approach does not provide an optimal solution, i.e., there is the so-called duality gap [32, 38]. Our derivation in [Publication VII] provides an optimal solution to the minimization problem. In order to get rid of the unknown channel impulses response, the cost function is analyzed in the space of the global channel-equalizer responses. In this way, the optimum weighting parameter λ may be found. By using the obtained value, the equalizer coefficients are updated by using a stochastic steepest descent method ep [k + 1] = ep [k] − µ∇E ep J ,

(5.11)

where ∇E ep J is the instantaneous Euclidean gradient (3.10) of the composite cost function J (5.10), w.r.t. the pth equalizer coefficients ep (5.4) at iteration k. The gradient expression is given in [Publication VII] and it includes instantaneous estimates instead of expected values. A conjugate gradient algorithm could also be used instead, but this would require performing an exact line search. An accurate adaptive step size would be required, but this is difficult due to the due to stochastic nature of the algorithm. A closed-form expression for the optimum weighting parameter λ and an upper bound for the step size µ are provided in [Publication VII]. It has been demonstrated in [Publication VII] that stable zero-forcing solutions always exist if the value of the parameter λ is set appropriately. Local convergence properties of the algorithm have also been established in [Publication VII].

5.2.4

Conditions for symbol recovery

The proposed algorithm guarantees both signal equalization and co-channel interference cancellation under the conditions below. First, the existence of the zero-forcing (ZF) equalizer must to be established. Second, conditions for the convergence of the equalizer needs to be found. 68

Existence of the blind equalizer Two conditions are necessary for the existence of the ZF equalizer [58]. The first condition is that the virtual polynomials associated to the MIMO channel have no common zeros. This is ensured by a sufficient antenna element spacing w.r.t. coherence distance or by coding. The second necessary condition for identifiability is the minimum equalizer length which depends on the channel maximum order and the number of transmit/receive antennas. For details, see [Publication VII]. Convergence properties The global convergence properties of the original VCMA [217] have not been established so far, not even in the fractionally-spaced case [104]. The local convergence of VCMA has been investigated in [195]. In [Publication VII] we prove that the proposed algorithm which is based a composite cost function convergences at least locally. The differences compared to the pure VCMA [104, 217] are explained in detail in [Publication VII]. Once the existence of the ZF equalizer is guaranteed, the convergence of the blind equalizer to the ZF solution depends upon the characteristics of the surface of the composite criterion represented in the space of the global channelequalizer impulse responses. We show that truly stable local minima of the composite criterion which correspond to the zero-forcing solutions always exist under the assumption that the parameter λ weighting the two criteria and the step size µ are appropriately selected. It has been demonstrated in [Publication VII] that in the absence of noise, any value 0 < λ < 1 is appropriate. In noisy conditions, values of λ which are close to zero and one should be avoided. A closed-form expression for optimal parameter λ and an upper bound on the step size parameter µ are also provided and they depend only on the system parameters. It has been shown in [Publication VII] that other local minima than the ones corresponding to the ZF solutions are very unlikely to exist. Even if they would exist, the convergence can be achieved by proper initialization of the algorithm. This can be achieved by using a very small amount of training data. In this case the algorithm operates on a semi-blind mode. Semi-blind methods are more feasible in practice since they also resolve the inherent ambiguities which the blind algorithms are subjected to. Initialization strategies may also be found. A basic requirement is that the initial equalizer settings are non-zero vectors. This case of zero vectors corresponds to a maximum of the cost function and the coefficients will remain identically zero. Moreover, the initial settings of the sub-equalizers corresponding to different output data streams must not be identical. This is necessary because identical initial settings may cause the same data stream to be recovered at the corresponding outputs. This case corresponds to a saddle 69

point. For details, see the comments related to Table 1 in [Publication VII]. The saddle points may be easily avoided in practice by ensuring the fact that the initial gradient value is non-zero. In that case, different initialization setting may be chosen. Initialization at saddle points has extremely low probability since these type of stationary points involve a very special structure of the global channel-equalizer impulse responses.

5.3

Discussion

We propose a blind equalizer for spatial multiplexing MIMO-OFDM systems based on the minimization of a composite criterion. It is able to perform the blind equalization and co-channel interference cancellation without estimating the MIMO channel matrix, unlike most of the existing SOS and HOS blind methods. For this reason, the proposed blind algorithm is computationally simple. It does not require expensive operations such as matrix inversions as usually required for the ZF and MMSE equalizers, or matrix decompositions employed in subspace methods [28,84,91–94,129,180,226,228] or the algebraic techniques [100, 166]. When the channel order exceeds the CP length used in the OFDM transmission, the benefit of single-tap equalization is lost. The proposed equalizer operates in time domain, before the FFT operation at the receiver. Consequently the equalizer is able to deal with three situations: no CP at all, CP too short or CP sufficiently long. CP is not needed in equalization, but it may be used for synchronization purposes, for example. The channel identifiability is conditioned on the common zeros of the MIMO channel branches, but this problem can be solved via non-redundant precoding [37, 91, 92, 129, 174, 175, 221, 222]. Unitary Cayley space-time codes [103,118,141,165] may also be used due to the fact that they preserve the correlation properties of the transmitted space-time signals. Grassmann space-time codes [120] are also a good alternative, since they efficiently use the degrees of freedom of the MIMO channel [231]. Compared to the BSS methods [55, 102, 115, 173–177], or the tensorbased method in [166], which perform the channel estimation in frequency domain, the proposed blind algorithm has much lower complexity. Moreover, the ambiguities (inherent to all blind methods) do not affect every subcarrier, but every transmitted data stream. The proposed algorithm is able to recover each of the the transmitted data streams up to a phase ambiguity and possible delay. User permutation ambiguity may be also encountered. These can be resolved by using P pilot symbols within a single OFDM block. Since in practice P is much smaller than the length of the OFDM block, the semi-blind version still provides high data rate. The same amount pilots is necessary for subspace-based methods, but in addition these methods require either CP, ZP or VC. The proposed blind algorithm can 70

be used for channel tracking using the data symbols only. It is suitable for MIMO-OFDM systems, under slow to moderate fading conditions such as in wireless LANs, continuous transmissions (television and radio), and fixed wireless communications systems.

71

72

Chapter 6

Summary Optimization techniques are a key part of many array and multi-channel signal processing algorithms. Application domains include radar, multiantenna communications, sensor arrays, biomedical applications. Often, the optimization needs to be performed subject to matrix constraints. In particular, orthogonal or unitary matrices play a crucial role in many tasks, for example, adaptive beamforming, interference cancellation, MIMO transmission, space-time coding, and signal separation. In order to obtain optimal or close to optimal performance, optimization algorithms are needed to minimize the selected error criterion or cost function. In many practical applications numerical optimization is the only computationally feasible solution. For this reason, in this work we focus on optimization under unitary matrix constraint. In this thesis, reliable and computationally feasible constrained optimization algorithms are proposed. Riemannian steepest descent and conjugate gradient algorithms operating on the Lie group of unitary matrices U (n) are derived. They take full benefit of the geometrical properties of the group, as well as the recent advances in numerical techniques. Two novel line search methods exploiting the almost periodic property of the cost function along geodesics on U (n) are also proposed. The proposed algorithms are suitable for performing the joint diagonalization of a set of Hermitian matrices, which is a fundamental problem of blind source separation. They outperform the classical JADE approach based on Givens rotations [45] in terms of converge speed, at similar cost/iteration, as demonstrated in [Publication I], [Publication II], [Publication IV], [Publication V]. SD and CG on U (n) are used for computing the full set of eigenvectors of a Hermitian matrix, by optimizing the off-norm cost function [112] in [Publication III], and the Brockett cost function [41, 184] in [Publication II], [Publication V]. Multi-antenna MIMO systems and multicarrier transmission such as OFDM are the key technologies in future wireless communication systems such as B3G Long Term Evolution (LTE), IMT-2000 and WiMAX [1, 8, 10]. 73

In this work a blind receiver for MIMO-OFDM systems is proposed. The algorithm optimizes a composite criterion in order to cancel both inter-symbol and co-channel interference. Identifiability conditions and local convergence properties of the algorithm are established. Possible topics of future research include extending the proposed algorithms to optimization w.r.t. non-square orthonormal matrices. When the optimization needs to be performed w.r.t. an orthonormal matrix with more rows than columns, the appropriate parameter space is the Stiefel manifold of n × p orthonormal matrices, St(n, p). This is the case of applications that require a distinct set of orthonormal vectors, such as limited-feedback MIMO systems, unitary space-time codes, MIMO radars and sonars. If the cost function possesses symmetries, such as invariance to right multiplication of its argument by unitary matrices, the appropriate parameter space is the Grassmann manifold Gr(n, p) of p-dimensional subspaces of the ndimensional Euclidean space. This is the case of all subspace estimation and tracking techniques. Therefore, a broad range of array and multi-channel applications may be addressed. The most important applications include blind equalization and source separation, smart antennas, as well as biomedical applications. The fact that Stiefel and Grassmann manifolds are homogeneous spaces, may be beneficial in reducing the computational complexity of optimization algorithms. Stiefel and Grassmann manifolds are quotient spaces arising form the unitary group U (n) (for complex-valued matrices) of from orthogonal group O(n) (for real-valued matrices). There are many properties inherited from the corresponding Lie groups that may be exploited. In conclusion, the fact that the proposed algorithms focus only on U (n) does not have to be seen as a limitation to n × n unitary matrices. Line search methods are crucial for the performance of the optimization algorithms. New approaches exploiting the almost periodicity of the cost function along geodesics are possible and they remain to be studied. The proposed DFT-based line search method opens multiple possibilities for finding better local minima (or to reach the global minimum faster). Computationally efficient Riemannian Newton algorithms may also be addressed in future work. An important goal is to achieve complexity of order O(n3 ) per iteration by fully exploiting the properties of U (n). Trustregion methods [15, 18] present particular interest due to their desirable global convergence properties, which classical Newton algorithms do not possess, in general. Another possible research topic for the future is developing algorithms for joint blind equalization and carrier frequency-offset compensation in MIMOOFDM systems. This may lead to computationally efficient algorithms that enable high user mobility.

74

Bibliography [1] Worldwide inter-operability for microwave access (WiMAX). WiMAX Forum: http://www.wimaxforum.org/home. [2] Digital broadcasting system television, sound, and data services: framing structure, channel coding, and modulation digital terrestrial television. ETSI Standard: EN 300 744, 1996. European Telecommunications Standardization Institute, Sophia-Antipolis, Valbonne, France. [3] Network and customer installation interfaces – asymmetric digital subscriber line (ADSL). ANSI Standard: T1.413, 1998. American National Standards Institute. [4] Wireless local area network (LAN) medium access control (MAC) and physical layer (PHY) specifications: high speed physical layer in the 5 GHz band. IEEE Standards: IEEE 802.11a-1999, IEEE 802.11g-2003, IEEE 802.11n-2007, 1999/2003/2007. [5] Broadband radio access networks (BRAN); high performance radio local area networks (HIPERLAN), type 2; physical (PHY) layer. ETSI Standard: TS 101 475, 20 Dec. 2001. European Telecommunications Standardization Institute, Sophia-Antipolis, Valbonne, France. [6] Local and metropolitan area networks: Air interface for fixed broadband access systems. IEEE Standard: 802.16, 2004. [7] Radio broadcasting system, digital audio broadcasting (DAB) to mobile, portable, and fixed receiver. ETSI Standard: EN 300 401, 2006. European Telecommunications Standardization Institute, Sophia-Antipolis, Valbonne, France. [8] IMT-2000 (International Mobile Telecommunications-2000) OFDMA (Orthogonal Frequency Division Multiplexing) TDD (Time Division Duplex) WMAN (Wireless Metropolitan Area Network). ITU Recommendation: ITU-R M.1457, 2007. International Telecommunication Union. 75

[9] OFDMA broadband mobile wireless access system. ARIB Standards: STD-T94, STD-T95, 2008. Association of Radio Industries and Businesses, Tokyo, Japan. [10] Third Generation Partnership Project (3GPP). Evolved universal terrestrial radio access (E-UTRA); long-term evolution (LTE) physical layer. 3GPP Standard TS 36.201, http://www.3gpp.org/. [11] K. Abed-Meraim, A. Chkeif, and Y. Hua. Fast orthonormal PAST algorithm. IEEE Signal Processing Letters, 7(3):60–62, Mar. 2000. [12] T. Abrudan, A. Hjørungnes, and V. Koivunen. Toeplitz method for blind equalization in MIMO OFDM systems. In International Z¨ urich Seminar on Communications, IZS 2004, pages 212–215, Z¨ urich 2004, Switzerland, 18–20 Feb. 2004. [13] T. Abrudan, M. Sˆırbu, and V. Koivunen. A block-Toeplitz VCMA equalizer for MIMO-OFDM systems. In Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, volume 1, pages 1037– 1041, Pacific Grove, CA, 9–12 Nov. 2003. [14] P.-A. Absil, C. G. Baker, and K. A. Gallivan. A truncated-CG style method for symmetric generalized eigenvalue problems. Journal of Computational and Applied Mathematics, 189(1–2):274–285, May 2006. [15] P.-A. Absil, C. G. Baker, and K. A. Gallivan. Trust-region methods on Riemannian manifolds. Foundations of Computational Mathematics, 7(3):303–330, 2007. [16] P.-A. Absil and K. A. Gallivan. Joint diagonalization on the oblique manifold for independent component analysis. Technical Report NA2006/01, DAMTP, University of Cambridge, http://www.damtp.cam.ac.uk/user/na/reports.html, 2006. [17] P.-A. Absil, R. Mahony, and R. Sepulchre. Riemannian geometry of Grassmann manifolds with a view on algorithmic computation. Acta Applicandae Mathematicae, 80(2):199–220, 2004. [18] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton, NJ, Jan. 2008. [19] P.-A. Absil, R. Mahony, R. Sepulchre, and P. Van Dooren. A Grassmann-Rayleigh quotient iteration for computing invariant subspaces. SIAM Review, 44:57–73, 2002. 76

[20] R. L. Adler, J.-P. Dedieu, J. Y. Margulies, M. Martens, and M. Shub. Newton’s method on Riemannian manifolds and a geometric model for the human spine. IMA Journal of Numerical Analysis, 22(3):359–390, Jul. 2002. [21] T. Aittom¨aki and V. Koivunen. Signal covariance matrix optimization for transmit beamforming in MIMO radars. In Forty-First Asilomar Conference on Signals, Systems and Computers, 2007, pages 182–186, 4–7 Nov. 2007. [22] S. Alamouti. A simple transmit diversity technique for wireless communications. IEEE Journal on Selected Areas in Communications, 16:1451–1458, Oct. 1998. [23] H. Ali, J. H. Manton, and Y. Hua. A SOS subspace method for blind channel identification and equalization in bandwidth efficient OFDM systems based on receive antenna diversity. In 11th IEEE Signal Processing Workshop on Statistical Signal Processing, pages 401–404, Aug. 2001. [24] S.-I. Amari. Lecture notes in statistics. Differential Geometry methods in Statistics. Springer-Verlag, 1985. [25] S.-I. Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998. [26] M. Journ´ee amd P.-A. Absil and R. Sepulchre. Gradient-optimization on the orthogonal group for independent component analysis. In 7th International Conference on Independent Component Analysis and Signal Separation, ICA 2007, pages 57–64, London, UK, 9–12 Sep. 2007. [27] A. Antoniou and W.-S. Lu. Practical Optimization: Algorithms and Engineering Applications. Springer, 2007. [28] W. Bai and Z. Bu. Channel identification in MIMO-OFDM systems. In The IEEE 6th Circuits and Systems Symposium on Emerging Technologies: Frontiers of Mobile and Wireless Communication ’04, volume 2, pages 611–614, 31 May–2 Jun. 2004. [29] W. Bai, C. He, L. Jiang, and H. Zhu. Blind channel estimation in MIMO-OFDM systems. In IEEE Global Telecommunications Conference, GLOBECOM ’02, volume 1, pages 317–321, 17–21 Nov. 2002. [30] W. Bai, H. Yang, and Z. Bu. Blind channel identification in SIMOOFDM systems. In International Conference on Communications, Circuits and Systems, ICCCAS 2004, volume 1, pages 318–321, 27–29 Jun. 2004. 77

[31] D. Bartolom´e, A. I. P´erez-Neira, and A. Pascual. Blind and semiblind spatio-temporal diversity for OFDM systems. In IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 3, pages 2769–2772, 13-17 May 2002. [32] D. P. Bertsekas. Convex Analysis and Optimization. Athena Scientific, 2003. [33] A. S. Besicovitch. Almost periodic functions. Dover, New York, 1954. [34] D. W. Bliss and K. W. Forsythe. Multiple-input multiple-output (MIMO) radar and imaging: degrees of freedom and resolution. In Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, volume 1, pages 54–59, 2003. [35] H. Bohr. Almost periodic functions. Chelsea, New York, 1951. [36] H. B¨olckei, D. Gesbert, and A. J. Paulraj. On the capacity of OFDMbased spatial multiplexing systems. IEEE Transactions on Communications, 50(2):225–234, Feb. 2002. [37] H. B¨olcskei, R. W. Heath, and A. J. Paulraj. Blind channel identification and equalization in OFDM-based multiantenna systems. IEEE Transactions on Signal Processing, 50(1):96–109, Jan 2002. [38] S. Boyd and L. Vandenberghe. Convex Optimization. University Press, Cambridge, UK, 2004. [39] D. H. Brandwood. A complex gradient operator and its applications in adaptive array theory. In IEE Proceedings, Parts F and H, volume 130, pages 11–16, Feb. 1983. [40] R. W. Brockett. Least squares matching problems. Linear Algebra and its Applications, 122/123/124:761–777, 1989. [41] R. W. Brockett. Dynamical systems that sort lists, diagonalize matrices, and solve linear programming problems. Linear Algebra and its Applications, 146:79–91, 1991. [42] R. W. Brockett. Differential geometry and the design of gradient algorithms. Proceedings of the Symposia in Pure Math, American Mathematical Society, 54(1):69–92, 1992. [43] R. W. Brockett. Singular values and least squares matching. In Proceedings of 36th IEEE Conference on Decision and Control, pages 1121–1124, San Diego, CA, USA, Dec. 1997. [44] J.-F. Cardoso and B. H. Laheld. Equivariant adaptive source separation. IEEE Transactions on Signal Processing, 44:3017–3030, 1996. 78

[45] J.-F. Cardoso and A. Souloumiac. Blind beamforming for nonGaussian signals. IEE Proceedings-F, 140(6):362–370, 1993. [46] E. Celledoni and S. Fiori. Neural learning by geometric integration of reduced ’rigid-body’ equations. Journal of Computational and Applied Mathematics, 172(2):247–269, 2004. [47] E. Celledoni and A. Iserles. Methods for approximation of a matrix exponential in a Lie-algebraic setting. IMA Journal on Numerical Analysis, 21(2):463–488, 2001. [48] Y. Y. Cheng, Y. Lee, , and H. J. Li. Subspace-MMSE blind channel estimation for multiuser OFDM with receiver diversity. In IEEE Global Telecommunications Conference, volume 4, pages 2295–2299, Dec. 2003. [49] H. Cheon and D. Hong. A blind spatio-temporal equalizer using cyclic prefix in OFDM systems. In IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 5, pages 2957–2960, Jun. 2000. [50] J. Choi and R. W. Heath Jr. Channel adaptive quantization for limited feedback MIMO beamforming systems. IEEE Transactions on Signal Processing, 54(12):4717–4729, Dec. 2006. [51] J. Choi, B. Mondal, and R. W. Heath Jr. Interpolation based unitary precoding for spatial multiplexing MIMO-OFDM with limited feedback. IEEE Transactions on Signal Processing, 54(12):4730–4740, Dec. 2006. [52] M. T. Chu and K. R. Driessel. The projected gradient methods for least squares matrix approximations with spectral constraints. SIAM Journal on Numerical Analysis, 27(4):1050–1060, 1990. [53] P. Comon and G. H. Golub. Tracking a few extreme singular values and vectors in signal processing. In Proceedings of the IEEE, volume 78, pages 1327–1343, Aug. 1990. [54] C. Corduneanu. Almost periodic functions. Interscience tracts in pure and applied mathematics. Interscience Publishers, New York, 1961. [55] S. R. Curnew and J. Ilow. Blind signal separation in MIMO OFDM systems using ICA and fractional sampling. In International Symposium on Signals, Systems and Electronics, ISSSE ’07, pages 67–70, 30 Jul.–2 Aug. 2007. [56] J. Dehaene. Continuous-time matrix algorithms, systolic algorithms and adaptive neural networks. PhD thesis, K. U. Leuven, Oct. 1995. 79

[57] J. Dehaene, C. Yi, and B. De Moor. Calculation of the structured singular value with gradient-based optimization algorithms on a Lie group of structured unitary matrices. IEEE Transactions on Automatic Control, 42(11):1596–1600, Nov. 1997. [58] L. Deneire, E. De Carvalho, and D. T. M. Slock. Identifiability conditions for blind and semi-blind multiuser multichannel identification. In Ninth IEEE SP Workshop on Statistical Signal and Array Processing, pages 372–375, 14–16 Sept. 1998. [59] M. P. do Carmo. Riemannian Geometry. Mathematics: theory and applications. Birkhauser, 1992. [60] S. C. Douglas. Numerically-robust adaptive subspace tracking using Householder transformations. In Proceedings of the 2000 IEEE Sensor Array and Multichannel Signal Processing Workshop, pages 499–503, Mar. 2000. [61] S. C. Douglas. Self-stabilized gradient algorithms for blind source separation with orthogonality constraints. IEEE Transactions on Neural Networks, 11(6):1490–1497, Nov. 2000. [62] S. C. Douglas. On the design of gradient algorithms employing orthogonal matrix constraints. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, volume 4, pages 1401–1404, 15–20 Apr. 2007. [63] S. C. Douglas, S.-I. Amari, and S.-Y. Kung. A self-stabilized minor subspace rule. IEEE Signal Processing Letters, 5(12):328–330, Dec. 1998. [64] S. C. Douglas, S.-I. Amari, and S.-Y. Kung. On gradient adaptation with unit-norm constraints. IEEE Transactions on Signal Processing, 48(6):1863–1847, Jun. 2000. [65] S. C. Douglas and S.-Y. Kung. An Ordered–Rotation KuicNet Algorithm for separating arbitrarily-distributed sources. In Proceedings of IEEE International Conference on Independent Component Analysis and Signal Separation, pages 419–425, Aussois, France, Jan. 1999. [66] S. C. Douglas and X. Sun. Designing orthonormal subspace tracking algorithms. In Thirty-Fourth Asilomar Conference on Signals, Systems and Computers, volume 2, pages 1441–1445, 2000. [67] Jiang Du, Qicong Peng, and Yubei Li. Adaptive blind equalization for MIMO-OFDM wireless communication systems. In International Conference on Communication Technology Proceedings, ICCT ’03, volume 2, pages 1086–1090, 31 May–2 Jun. 2003. 80

[68] Jiang Du, Qicong Peng, and Hongying Zhang. Adaptive blind channel identification and equalization for MIMO-OFDM wireless communication systems. In 14th IEEE Proceedings on Personal, Indoor and Mobile Radio Communications, PIMRC 2003, volume 3, pages 2078– 2082, 7–10 Sept. 2003. [69] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis and Applications, 20(2):303–353, 1998. [70] M. Enescu. Adaptive Methods for Blind Equalization and Signal Separation in MIMO Systems. PhD thesis, Helsinki University of Technology, Helsinki, Finland, 2002. [71] J. Eriksson. Contributions to Theory and Algorithms of Independent Component Analysis and Signal Separation. PhD thesis, Helsinki University of Technology, Helsinki, Finland, Aug. 2004. [72] J. Eriksson and V. Koivunen. Characteristic-function-based independent component analysis. Signal Processing, 83:2195–2208, Oct. 2003. [73] S. Fiori. ’Mechanical’ neural learning for blind source separation. Electronics Letters, 35(22):1963–1964, 28 Oct. 1999. [74] S. Fiori. Stiefel-Grassman Flow (SGF) learning: further results. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks, volume 3, pages 343–348, 24-27 Jul. 2000. [75] S. Fiori. A theory for learning by weight flow on Stiefel-Grassman manifold. Neural Computation, 13:1625–1647, 2002. [76] S. Fiori. Unsupervised neural learning on Lie group. International Journal of Neural Systems, 12(3 and 4):219–246, 2002. [77] S. Fiori. Quasi-geodesic neural learning algorithms over the orthogonal group: a tutorial. Journal of Machine Learning Research, 1:1–42, Apr. 2005. [78] S. Fiori and F. Piazza. Orthonormal strongly-constrained neural learning. In IEEE International Joint Conference on Neural Networks, volume 2, pages 1332–1337, 4-9 May 1998. [79] S. Fiori, A. Uncini, and F. Piazza. Application of the MEC Network to principal component analysis and source separation. In Proceedings of International Conference on Artificial Neural Networks, pages 571– 576, 1997. 81

[80] A. Fischer. Structure of Fourier exponents of almost periodic functions and periodicity of almost periodic functions. Mathematica Bohemica, 121(3):249–262, 1996. [81] A. Fischer. Approximation of almost periodic functions by periodic ones. Czechoslovak Mathematical Journal, 48(123):193–205, 1998. [82] E. Fishler, A. Haimovich, R. Blum, D. Chizhik, L. Cimini, and R. Valenzuela. MIMO radar: an idea whose time has come. In IEEE Radar Conference 2004, pages 71–78, 2004. [83] R. Fletcher. Practical Methods of Optimization; (2nd ed.). WileyInterscience, New York, NY, USA, 1987. [84] H. Fu, P. H. W. Fung, and S. Sun. Semiblind channel estimation for MIMO-OFDM. In 15th IEEE International Symposium on Personal, Indoor and Mobile Radio Communications, PIMRC 2004, volume 3, pages 1850–1854, 5-8 Sept. 2004. [85] D. R. Fuhrmann. A geometric approach to subspace tracking. In Conference Record of the Thirty-First Asilomar Conference on Signals, Systems & Computers, volume 1, pages 783–787, 2-5 Nov. 1997. [86] D. R. Fuhrmann, A. Srivastava, and Hojin Moon. Subspace tracking via rigid body dynamics. In Proceedings of the 8th IEEE Signal Processing Workshop on Statistical Signal and Array Processing, pages 578–581, 24-26 Jun. 1996. [87] D. Gabay. Minimizing a differentiable function over a differential manifold. Journal of Optimization Theory and Applications, 37(2):177–219, Jun. 1982. [88] J. Gallier. Geometric methods and applications, volume 38 of Texts in applied mathematics. Springer, 2000. [89] J. Gallier. Notes on Differential Geometry and Lie Groups. Apr. 2008. Book in progress (2008) – available online at http://www.cis.upenn.edu/˜jean/gbooks/manif.html. [90] C. Gao, Ming Zhao, Shidong Zhou, and Yan Yao. Blind channel estimation algorithm for MIMO-OFDM systems. Electronics Letters, 39(19):1420–1422, 18 Sept. 2003. [91] F. Gao and A. Nallanathan. Subspace-based blind channel estimation for SISO, MISO and MIMO OFDM systems. In IEEE International Conference on Communications, ICC ’06, volume 7, pages 3025–3030, Jun. 2006. 82

[92] F. Gao and A. Nallanathan. Blind channel estimation for MIMO OFDM systems via nonredundant linear precoding. IEEE Transactions on Signal Processing, 55(2):784–789, Jan. 2007. [93] F. Gao, W. Wu, Y. Zeng, and A. Nallanathan. A novel blind channel estimation for CP-based MIMO OFDM systems. In IEEE International Conference on Communications, ICC ’07, pages 258–2591, 24–28 Jun. 2007. [94] F. Gao, Y. Zeng, A. Nallanathan, and T.-S. Ng. Robust subspace blind channel estimation for cyclic prefixed MIMO OFDM systems: algorithm, identifiability and performance analysis. IEEE Journal on Selected Areas in Communications, 26(2):378–388, Feb. 2008. [95] D. Gesbert, M. Shafi, D. Shiu, and P. Smith. From theory to practice: An overview of space-time coded MIMO wireless systems. IEEE Journal on Selected Areas on Communications, 21(3), Apr. 2003. Special issue on MIMO systems. [96] M. Ghogho and A. Swami. Blind channel identification for OFDM systems with receive antenna diversity. In IEEE Workshop Signal Processing Advances in Wireless Communications, pages 378–382, Jun. 2003. [97] P. E. Gill and W. Murray. The computation of Lagrange-multiplier estimates for constrained minimization. Mathematical Programming, 17(1):32–60, Dec. 1979. [98] D. N. Godard. Self-recovering equalization and carrier tracking in two-dimensional data communication systems. IEEE Transaction on Communications, 7(2):1867–1875, Nov. 1980. [99] G. H. Golub and C. van Loan. Matrix computations. The Johns Hopkins University Press, Baltimore, 3rd edition, 1996. [100] A. Gorokhov. Blind equalization in SIMO OFDM systems with frequency domain spreading. IEEE Transactions on Signal Processing, 48(12):3536–3549, Dec. 2000. [101] S. Gudmundsson. An introduction to Riemannian geometry, 2004. Lecture notes available at http://www.matematik.lu.se/matematiklu/ personal/sigma/index.html. [102] B. Guo, H. Lin, and K. Yamashita. Blind signal recovery in multiuser MIMO-OFDM system. In The 2004 47th Midwest Symposium on Circuits and Systems, MWSCAS ’04, pages 637–640, 25–28 Jul. 2004. 83

[103] B. Hassibi and B. M. Hochwald. Cayley differential unitary space-time codes. IEEE Transactions on Information Theory, 48(6):1485–503, Jun. 2002. [104] M. A. Haun. The fractionally spaced vector constant modulus algorithm. Master’s thesis, University of Illinois at Urbana-Champaign, 1999. [105] S. Haykin. Adaptive Filter Theory, volume 3rd edition. Prentice Hall, 1996. [106] J. Heiskala and J. Terry. OFDM Wireless LANs: A theoretical and practical guide. SAMS Publishing, 2001. [107] S. Helgason. Differential geometry, Lie groups and symmetric spaces. Academic Press, 1978. [108] N. J. Higham. Matrix nearness problems and applications. In M. J. C. Gover and S. Barnett, editors, Applications of Matrix Theory, pages 1–27. Oxford University Press, 1989. [109] A. Hjørungnes and D. Gesbert. Complex-valued matrix differentiation: Techniques and key results. IEEE Transaction on Signal Processing, 55(6):2740–2746, Jun. 2007. [110] R. A. Horn and C. R. Johnson. Topics in Matrix Analysis. Cambridge University Press, 1991. [111] A. Hottinen, O. Tirkkonen, and R. Wichman. Multi-antenna transceiver techniques for 3G and beyond. Wiley, Jan. 2003. [112] K. H¨ uper, U. Helmke, and J. B. Moore. Structure and convergence of conventional Jacobi-type methods minimizing the off-norm function. In Proceedings of 35th IEEE Conference on Decision and Control, volume 2, pages 2124–2129, Kobe, Japan, 11-13 Dec 1996. [113] A. Hyv¨arinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley & Sons, 2001. [114] O. A. Alim I. Ghaleb and K. Seddik. A new finite alphabet based blind channel estimation for OFDM systems. In IEEE 5th Workshop on Signal Processing Advances in Wireless Communications, pages 102 – 105, 11–14 Jul. 2004. [115] D. Iglesia, A. Dapena, and C. J. Escudero. Multiuser detection in MIMO OFDM systems using blind source separation. In Proc. 6th Baiona Workshop on Signal Processing in Communications, pages 41– 46, Sept. 2003. 84

[116] S. Ikeda, T. Tanaka, and S.-I. Amari. Information geometry of turbo and low-density parity-check codes. IEEE Transactions on Information Theory, 50(6):1097–1114, Jun. 2004. [117] A. Iserles and A. Zanna. Efficient computation of the matrix exponential by general polar decomposition. SIAM Journal on Numerical Analysis, 42(5):2218–2256, Mar. 2005. [118] Yindi Jing and B. Hassibi. Unitary space-time modulation via Cayley transform. IEEE Transactions on Signal Processing, 51(11):2891– 2904, Nov. 2003. [119] D. L. Jones. Property-restoral algorithms for blind equalization of OFDM. In Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, volume 1, pages 619–622, Pacific Grove, CA, 9–12 Nov. 2003. [120] I. Kammoun and J.-C. Belfiore. A new family of Grassmann spacetime codes for non-coherent MIMO systems. IEEE Communications Letters, 7(11):528–530, Nov. 2003. [121] N. Khaled, B. Mondal, R. W. Heath Jr., G. Leus, and F. Petr´e. Quantized multi-mode precoding for spatial multiplexing MIMO-OFDM system. In 2005 IEEE 62nd Vehicular Technology Conference, volume 2, pages 867–871, Sept. 2005. [122] M. Kleinsteuber and K. H¨ uper. An intrinsic CG algorithm for computing dominant subspaces. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, volume 4, pages 1405–1408, Apr. 2007. [123] A. Knapp. Lie groups beyond an introduction, volume 140 of Progress in mathematics. Birkhauser, 1996. [124] S. G. Krantz. Function theory of several complex variables. Wadsworth & Brooks/Cole Advanced Books & Software, Pacific Grove, CA, 2nd edition, 1992. [125] H. Krim and M. Viberg. Two decades of array signal processing research: the parametric approach. IEEE Signal Processing Magazine, 13(4):67–94, Jul. 1996. [126] J. Laurila. Semi-Blind Detection of Co-Channel Signals in Mobile Communications. PhD thesis, Technische Universtit at Wien, Wien, Austria, Mar 2000. available online at http://www.nt.tuwien.ac.at/mobile/theses finished/. 85

[127] B. M. Levitan and V. V. Zhikov. Almost periodic functions and differential equations. Cambridge, 1982. [128] R. Liu and L. Tong (Ed.). Special issue on blind system identification and estimation. Proceedings of the IEEE, 86(10):1903–2116, Oct. 1998. [129] Xia Liu and M. E. Bialkowski. SVD-based blind channel estimation for a MIMO OFDM system employing a simple block pre-coding scheme. In The International Conference on ”Computer as a Tool”, EUROCON 2007, pages 926–929, 9-12 Sept. 2007. [130] J. Lu, T. N. Davidson, and Z.-Q. Luo. Blind separation of BPSK signals using Newton’s method on the Stiefel manifold. In IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 4, pages 301–304, Apr. 2003. [131] D. G. Luenberger. The gradient projection method along geodesics. Management Science, 18:620–631, 1972. [132] Y. Luo, J. A. Chambers, and S. Lambotharan. Global convergence and mixing parameter selection in the cross-correlation constant modulus algorithm for the multi-user environment. IEE Vision, Image and Signal Processing, 148(1):9–20, Feb. 2001. [133] Y. Ma, J. Koˇseck´a, and S. Sastry. Motion estimation in computer vision: optimization on Stiefel manifolds. In Proceedings of the 37th IEEE Conference on Decision and Control, volume 4, pages 3751– 3756, Dec. 1998. [134] Yi Ma, Yi Huang, Xu Zhu, and Na Yi. Blind channel estimation for OFDM based multitransmitter systems using guard interval diversity. In 2004 IEEE 59th Vehicular Technology Conference, volume 1, pages 440–444, 17-19 May 2004. [135] R. Mahony and J. H. Manton. The geometry of the Newton method on non-compact Lie groups. Journal of Global Optimization, 23:309–327, 2002. [136] J. H. Manton. A new algorithm for computing the extreme eigenvectors of a complex Hermitian matrix. In Proceedings of the 11th IEEE Signal Processing Workshop on Statistical Signal Processing, pages 225–228, 6-8 Aug. 2001. [137] J. H. Manton. Optimization algorithms exploiting unitary constraints. IEEE Transactions on Signal Processing, 50(3):635–650, Mar. 2002. [138] J. H. Manton. On the role of differential geometry in signal processing. In International Conference on Acoustics, Speech and Signal Processing, volume 5, pages 1021–1024, Philadelphia, Mar. 2005. 86

[139] J. H. Manton and Y. Hua. Convolutive reduced-rank Wiener filtering. In Proceeding of IEEE Conference on Acoustics, Speech and Signal Processing, May 2001. [140] J. H. Manton, R. Mahony, and Y. Hua. The geometry of weighted low-rank approximations. IEEE Transactions on Signal Processing, 51(2):500–514, Feb. 2003. [141] T. L. Marzetta, B. Hassibi, and B. M. Hochwald. Structured unitary space-time autocoding constellations. IEEE Transactions on Information Theory, 48(4):942–950, Apr. 2002. [142] C. Moler and C. van Loan. Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM Review, 45(1):3– 49, 2003. [143] B. Mondal and R. W. Heath Jr. Algorithms for quantized precoded MIMO-OFDM systems. In Thirty-Ninth Asilomar Conference on Signals, Systems and Computers, pages 381–385, 2005. [144] B. Mondal, R. W. Heath Jr., and L. W. Hanlen. Quantization on the Grassmann manifold: applications to precoded MIMO wireless systems. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005, volume 5, pages 1025–1028, Mar. 2005. [145] B. Mondal, R. Samanta, and R. W. Heath Jr. Frame theoretic quantization for limited feedback MIMO beamforming systems. In 2005 International Conference on Wireless Networks, Communications and Mobile Computing, volume 2, pages 1065–1070, 2005. [146] J. B. Moore and P. Y. Lee. Differential geometry applications to vision systems. In Symposium on Mechanical Systems Control, pages 1–30, Berkeley, CA, Jun. 2006. [147] A. Neumaier. Solving ill-conditioned and singular linear systems: A tutorial on regularization. SIAM Review, 40:636–666, 1998. [148] M. Nikpour, K. H¨ uper, and J. H. Manton. Generalizations of the Rayleigh quotient iteration for the iterative refinement of the eigenvectors of real symmetric matrices. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005, volume 5, pages 1041–1044, 2005. [149] M. Nikpour, J. H. Manton, and G. Hori. Algorithms on the Stiefel manifold for joint diagonalisation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume 2, pages 1481–1484, 2002. 87

[150] Y. Nishimori. Learning algorithm for independent component analysis by geodesic flows on orthogonal group. In International Joint Conference on Neural Networks, volume 2, pages 933–938, Jul. 10-16 1999. [151] Y. Nishimori and S. Akaho. Learning algorithms utilizing quasigeodesic flows on the Stiefel manifold. Neurocomputing, 67:106–135, Jun. 2005. [152] Y. Nishimori, S. Akaho, S. Abdallah, and M. D. Plumbley. Flag manifolds for subspace ICA problems. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, volume 4, pages 1417–1420, 15–20 Apr. 2007. [153] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, 2006. [154] K. Nomizu. Invariant affine connections on homogeneous spaces. American Journal of Mathematics, 76(1):33–65, Jan. 1954. [155] B. Owren and B. Welfert. The Newton iteration on Lie groups. BIT Numerical Mathematics, 40(1):121–145, Mar. 2000. [156] C. B. Papadias. Globally convergent blind source separation based on a multiuser kurtosis maximization criterion. IEEE Transactions on Signal Processing, 48(12):3508–3519, Dec. 2000. [157] C. B. Papadias and A. M. Kuzminskiy. Blind source separation with randomized Gram-Schmidt orthogonalization for short burst systems. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 5, pages 809–812, 17-21 May 2004. [158] C. B. Papadias and A. J. Paulraj. A constant modulus algorithm for multiuser signal separation in presence of delay spread using antenna arrays. IEEE Signal Processing Letters, 4(6):178–181, Jun. 1997. [159] T. Petermann, S. Vogeler, K. D. Kammeyer, and D. Boss. Blind turbo channel estimation in OFDM receivers. In Thirty-Fifth Asilomar Conference on Signals, Systems and Computers, volume 2, pages 1489–1493, Nov. 2001. [160] M. D. Plumbley. Optimization using Fourier expansion over a geodesic for non-negative ICA. In Proceedings of the International Conference on Independent Component Analysis and Blind Signal Separation, ICA 2004, pages 49–56, Granada, Spain, Sept. 2004. [161] M. D. Plumbley. Geometrical methods for non-negative ICA: manifolds, Lie groups, toral subalgebras. Neurocomputing, 67:161–197, 2005. 88

[162] M. D. Plumbley. Geometry and manifolds for independent component analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, volume 4, pages 1397–1400, 15–20 Apr. 2007. [163] E. Polak. Optimization: Algorithms and Consistent Approximations. New York: Springer-Verlag, 1997. [164] N. Prabhu, H.-C Chang, and M. deGuzman. Optimization on Lie manifolds and pattern recognition. Pattern Recognition, 38:2286–2300, 2005. [165] A. Qatawneh and L. de Haro Ariet. OFDM-MIMO system using Cayley differential unitary space time coding. In Proceedings of Sensor Array and Multichannel Signal Processing Workshop 2004, pages 254– 258, 18–21 Jul. 2004. [166] M. Rajih, P. Comon, and D. Slock. A deterministic blind receiver for MIMO OFDM systems. In IEEE 7th Workshop on Signal Processing Advances in Wireless Communications, SPAWC ’06, pages 1–5, 2–5 Jul. 2006. [167] F. C. Robey, S. Coutts, D. Weikle, J. C. McHarg, and K. Cuomo. MIMO radar theory and experimental results. In Conference Record of the Thirty-Eighth Asilomar Conference on Signals Systems and Computers, volume 1, pages 300–304, 2004. [168] T. Roman. Advanced receiver structures for mobile MIMO multicarrier systems. PhD thesis, Helsinki University of Technology, Helsinki, Finland, Apr. 2006. [169] T. Roman, M. Enescu, and V. Koivunen. Joint time-domain tracking of channel and frequency offsets for MIMO OFDM systems. Wireless Personal Communications, 31(3–4):181–200, Dec. 2004. [170] R. Roy and T. Kailath. ESPRIT-estimation of signal parameters via rotational invariance techniques. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(7):984–995, Jul. 1989. [171] P. Sansrimahachai, D. B. Ward, and A. G. Constantinides. Blind source separation for BLAST. In 14th International Conference on Digital Signal Processing, volume 1, pages 139–142, 1-3 Jul. 2002. [172] P. Sansrimahachai, D. B. Ward, and A. G. Constantinides. Multipleinput multiple-output least-squares constant modulus algorithms. In IEEE Global Telecommunications Conference, volume 4, pages 2084– 2088, 1-5 Dec. 2003. 89

[173] L. Sarperi, A. K. Nandi, and Xu Zhu. Multiuser detection and channel estimation in MIMO OFDM systems via blind source separation. In Proc. 5th Int. Symposium on Independent Component Analysis and Blind Signal Separation, ICA 2004, pages 1189–1196, Granada, Spain, Sept. 2004. [174] L. Sarperi, X. Zhu, and A. K. Nandi. Low-complexity ICA based blind multiple-input multiple-output OFDM receivers. Neurocomputing, 69(13–15):1529–1539, 2006. [175] L. Sarperi, X. Zhu, and Asoke A. K. Nandi. Blind OFDM receiver based on independent component analysis for multiple-input multipleoutput systems. IEEE Transactions on Wireless Communications, 6(11):4079–4089, Nov. 2007. [176] L. Sarperi, Xu Zhu, and A. K. Nandi. Blind layered space-time equalization for MIMO OFDM systems. In in Proc. 13th European Signal Processing Conference, EUSIPCO 2005, 2005. [177] L. Sarperi, Xu Zhu, and A. K. Nandi. Reduced complexity blind layered space-time equalization for MIMO OFDM systems. In IEEE 16th International Symposium on Personal, Indoor and Mobile Radio Communications, PIMRC 2005, pages 236–240, 11–14 Sept. 2005. [178] R .O. Schmidt. Multiple emitter location and signal parameter estimation. IEEE Transactions on Antennas and Propagation, 34(3):276– 280, Mar. 1986. [179] H. She and K. H¨ uper. Generalised FastICA for independent subspace analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, volume 4, pages 1409–1412, Apr. 2007. [180] C. Shin and E. J. Powers. Blind channel estimation for MIMO-OFDM systems using virtual carriers. In IEEE Global Telecommunications Conference, GLOBECOM ’04, volume 4, pages 2465–2469, 29 Nov.–3 Dec. 2004. [181] A. I. Shtern. Almost periodic functions and representations in locally convex spaces. Russian Math. Surveys, 60(3):489–557, 2005. [182] S. T. Smith. Dynamical systems that perform singular value decomposition. Systems and Control Letters, 16(5):319–327, May 1991. [183] S. T. Smith. Geometric optimization methods for adaptive filtering. PhD thesis, Harvard University, Cambridge, MA, May 1993. 90

[184] S. T. Smith. Optimization techniques on Riemannian manifolds. Fields Institute Communications, American Mathematical Society, 3:113– 136, 1994. [185] S. T. Smith. Linear and non-linear conjugate gradient methods for adaptive processing. In IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 3, pages 1834–1837, Atlanta, GA, USA, May 1996. [186] S. T. Smith. Subspace tracking with full rank updates. In Conference Record of the Thirty-First Asilomar Conference on Signals, Systems and Computers, volume 1, pages 793–797, 2-5 Nov. 1997. [187] S. T. Smith. Optimum phase-only adaptive nulling. IEEE Transactions on Signal Processing, 47(7):1835–1843, Jul 1999. [188] S. T. Smith. Covariance, subspace, and intrinsic Cram´er-Rao bounds. IEEE Transactions on Signal Processing, 53(5):1610–1630, May 2005. [189] S. T. Smith. Statistical resolution limits and the complexified Cram´erRao bound. IEEE Transactions on Signal Processing, 53(5):1597– 1609, May 2005. [190] J. H. Son and D. B. Williams. A blind symbol recovery for dual antenna OFDM systems. In IEEE 10th Digital Signal Processing Workshop, pages 29–34, 13-16 Oct. 2002. [191] A. Srivastava and E. Klassen. Monte Carlo extrinsic estimators of manifold-valued parameters. IEEE Transactions on Signal Processing, 50(2):299–308, Feb. 2002. [192] P. Stoica and D. A. Linebarger. Optimization result for constrained beamformer design. In IEEE Signal Processing Letters, volume 2, pages 66–67, Apr. 1995. [193] V. Tarokh, N. Seshadri, and A. R. Calderbank. Space-time codes for high data rate wireless communication: Performance criterion and code construction. IEEE Transactions on Information Theory, 44:744– 765, Mar. 1998. [194] I. E. Telatar. Capacity of multi-antenna Gaussian channels. European Transactions on Telecommunications, 10(6):585–595, Nov. 1999. [195] A. Touzni, L. Tong, R. A. Casas, and C. R. Johnson, Jr. Vector-CM stable equilibrium analysis. IEEE Signal Processing Letters, 7(2):31– 33, Feb. 2000. 91

[196] C. Udri¸ste. Convex Functions and Optimization Methods on Riemannian Manifolds. Mathematics and Its Applications. Kluwer Academic Publishers Group, Boston, MA, 1994. [197] W. Utschick and C. Brunner. Efficient tracking and feedback of DLeigenbeams in WCDMA. In Proceedings of the 4th European Personal Mobile Communications Conference, Vienna, Austria, 2001. [198] A. van den Bos. Complex gradient and Hessian. IEE Vision, Image and Signal Processing, 141(6):380–383, Dec. 1994. [199] A.-J. van der Veen. Algebraic methods for deterministic blind beamforming. Proceedings of IEEE,, 86(10):1987–2008, 1998. [200] R. van Nee and R. Prasad. OFDM for Wireless Multimedia Communications. Artech House, 2000. [201] B. D. Van Veen and K. M. Buckley. Beamforming: A versatile approach to spatial filtering. IEEE ASSP Magazine, 5(5):4–24, Apr. 1988. [202] S. Visuri and V. Koivunen. Resolving ambiguities in subspace-based blind receiver for MIMO channels. In Thirty-Sixth Asilomar Conference on Signals, Systems and Computers, volume 1, pages 589–593, 3-6 Nov. 2002. [203] E. Viterbo and J. Boutros. A universal lattice code decoder for fading channels. IEEE Transactions on Information Theory, 45:1639–1642, Jul. 1999. [204] Hao Wang, Ying Lin, and Biao Chen. Data-efficient blind OFDM channel estimation using receiver diversity. IEEE Transactions on Signal Processing, 51(10):2613–2623, Oct. 2003. [205] L. Wang, J. Karhunen, and E. Oja. A Bigradient optimization approach for robust PCA, MCA and source separation. In Proceedings of IEEE Conference on Neural Networks, volume 4, pages 1684–1689, 27 Nov.-1 Dec. 1995. [206] X. M. Wang, W.-S. Lu, and A. Antoniou. Blind adaptive multiuser detection using a vector constant-modulus approach. In Thirty-Fifth Asilomar Conference on Signals, Systems and Computers, volume 1, pages 36–40, Pacific Grove, CA, 4–7 Nov. 2001. [207] Z. Wang and G. B. Giannakis. Wireless multicarrier communications. IEEE Signal Processing Magazine, 17(3):29–48, May 2000. 92

[208] F. W. Warner. Foundations of differentiable manifolds and Lie groups. Graduate Texts in Mathematics. Springer-Verlag New York, LLC, Oct. 1983. [209] M. Wax and Y. Anu. A new least squares approach to blind beamforming. In IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 5, pages 3477–3480, 21-24 Apr. 1997. [210] P. W. Wolniansky, G. J. Foschini, G. D. Golden, and R. A. Valenzuela. V-BLAST: an architecture for realizing very high data rates over richscattering channel. In Proc. of 1998 URSI International Symposium on Signals, Systems and Electronics, ISSSE 98, pages 295–300, 29 Sept.–2 Oct. 1998. [211] C. Wu, D. Gao, and L. Zhang. The normal matrix approach to the design of multivariable robust control systems. In IEEE TENCON, pages 203–207, Beijing, 1993. [212] J. Xavier and V. Barroso. The Riemannian geometry of certain parameter estimation problems with singular Fisher information matrix. In IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’04, volume 2, pages 1021–1024, Montreal, Canada, May 2004. [213] J. Xavier and V. Barroso. Intrinsic Variance Lower Bound (IVLB): An extension of the Cram´er-Rao bound to Riemannian manifolds. In IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’05, volume 5, pages 1033–1036, Mar. 2005. [214] I. Yamada and T. Ezaki. An orthogonal matrix optimization by dual Cayley parametrization technique. In Proceedings of ICA, pages 35– 40, 2003. [215] B. Yang. Projection approximation subspace tracking. IEEE Transactions on Signal Processing, 43(1):95–107, Jan. 1995. [216] J.-F. Yang and M. Kaveh. Adaptive eigensubspace algorithms for direction or frequency estimation and tracking. IEEE Transactions on Acoustics, Speech, and Signal Processing, 36(2):241–251, Feb. 1988. [217] V. Y. Yang and D. L. Jones. A vector constant modulus algorithm for shaped constellation equalization. In Thirty-First Asilomar Conference on Signals, Systems and Computers, volume 1, pages 590–594, Pacific Grove, CA, 2–5 Nov. 1997. [218] V. Y. Yang and D. L. Jones. A vector constant modulus algorithm for shaped constellation equalization. IEEE Signal Processing Letters, 5(4):89–91, Apr. 1998. 93

[219] Y. Yang. Optimization on Riemannian manifold. In Proceedings of the 38th Conference on Decision & Control, pages 888–893, Phoenix, Arizona USA, Dec. 1999. [220] Y. Yang. Globally convergent optimization algorithms on Riemannian manifolds: Uniform framework for unconstrained and constrained optimization. Journal of Optimization Theory and Applications, 132(2):245–265, Feb. 2007. [221] S. Yatawatta and A. P. Petropulu. Blind channel estimation in MIMO OFDM systems. In IEEE Workshop on Statistical Signal Processing, pages 363–366, 28 Sept.–1 Oct. 2003. [222] S. Yatawatta and A. P. Petropulu. Blind channel estimation in MIMO OFDM systems with multiuser interference. IEEE Transactions on Signal Processing, 54(3):1054–1068, Mar. 2006. [223] C. Yi. Robustness analysis and controller design for systems with structured uncertainties. PhD thesis, K. U. Leuven, May 1995. [224] A. Zanna and H. Z. Munthe-Kaas. Generalized polar decomposition for the approximation of the matrix exponential. SIAM Journal on matrix Analysis, 23(3):840–862, Jan. 2002. [225] J. Zeng. The first sign change of a cosine polynomial. Proceedings of the American Mathematical Society, 111(3):709–716, Mar. 1991. [226] Y. Zeng, W. H. Lam, and T.-S. Ng. Semiblind channel estimation and equalization for MIMO space-time coded OFDM. IEEE Transactions on Circuits and Systems I, 53(2):463–474, Feb. 2006. [227] Y. Zeng and T.-S. Ng. A proof of the identifiability of a subspace-based blind channel estimation for OFDM systems. IEEE Signal Processing Letters, 11(9):756–759, Sept. 2004. [228] Y. Zeng and T.-S. Ng. A semi-blind channel estimation method for multiuser multiantenna OFDM systems. IEEE Transactions on Signal Processing, 52(5):1419–1429, May 2004. [229] L. Zhang. Conjugate gradient approach to blind separation of temporally correlated signals. In IEEE International Conference on Communications, Circuits and Systems, ICCCAS-2004, volume 2, pages 1008–1012, Chengdu, China, 2004. [230] L. Q. Zhang, A. Cichoki, and S.-I. Amari. Natural Gradient Algorithm for blind separation of overdetermined mixture with additive noise. Signal Processing Letters, 6(11):293–295, Nov. 1999. 94

[231] L. Zheng and D. N. C. Tse. Communication on the Grassmann manifold: a geometric approach to the noncoherent multiple-antenna channel. IEEE Transactions on Information Theory, 48(2):359–383, Feb. 2002. [232] S. Zhou and G. B. Giannakis. Finite-alphabet based channel estimation for OFDM and related multicarrier systems. IEEE Transactions on Communications, 49(8):1402–1414, 2001.

95

96

Original Publications Errata [Publication I] ˆ p. 1136, first column, within the text should  line 17: equation  read wk+1 = wk exp 0.4µ ℑ{wk } . ˆ p. 1137, first column, lines 5, 10, 15: equations within the text should read C(0) = W.

[Publication III] ˆ p. 243, second column, line 3: should read “...requires 5n3 operations...”. ˆ p. 243, second column, line 5: should read “...order of ω ≤ 5...”.

[Publication V] ˆ p. 2355, first column, Table 2. Second row should read: 2 Compute the Riemannian gradient direction Gk and the search direction Hk : if (k modulo n2 ) == 0 . ∂J Γk = ∂W ∗ (Wk ) H H Gk = Γk Wk − Wk Γk Hk := Gk

[Publication VII] ˆ p. 1156, eq. (23) should read:

Lg + 1 ≥

l KL m h . Q−K

97

(23)

98

[Publication I] T. Abrudan, J. Eriksson, V. Koivunen, “Steepest Descent Algorithm for Optimization under Unitary Matrix Constraint”, IEEE Transaction on Signal Processing vol. 56, no. 3, Mar. 2008 pp. 1134–1147. ©2008 IEEE. Reprinted with permission.

99

100

1134

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 3, MARCH 2008

Steepest Descent Algorithms for Optimization Under Unitary Matrix Constraint Traian E. Abrudan, Student Member, IEEE, Jan Eriksson, Member, IEEE, and Visa Koivunen, Senior Member, IEEE

Abstract—In many engineering applications we deal with constrained optimization problems with respect to complex-valued matrices. This paper proposes a Riemannian geometry apof proach for optimization of a real-valued cost function complex-valued matrix argument W, under the constraint that unitary matrix. We derive steepest descent W is an (SD) algorithms on the Lie group of unitary matrices ( ). The proposed algorithms move towards the optimum along the geodesics, but other alternatives are also considered. We also address the computational complexity and the numerical stability issues considering both the geodesic and the nongeodesic SD algorithms. Armijo step size [1] adaptation rule is used similarly to [2], but with reduced complexity. The theoretical results are validated by computer simulations. The proposed algorithms are applied to blind source separation in MIMO systems by using the joint diagonalization approach [3]. We show that the proposed algorithms outperform other widely used algorithms. Index Terms—Array processing, optimization, source separation, subspace estimation, unitary matrix constraint.

I. INTRODUCTION ONSTRAINED optimization problems arise in many signal processing applications. One common task is to minimize a cost function with respect to a matrix, under the constraint that the matrix has orthonormal columns. Some typical applications in communications and array signal processing are subspace tracking [4]–[6], blind and constrained beamforming [7]–[9], high-resolution direction finding (e.g., MUSIC and ESPRIT), and generally all subspace-based methods. Another straightforward application is the independent component analysis (ICA), [3], [10]–[19]. This type of optimization problem has also been considered in the context of multiple-input multiple-output (MIMO) communication systems [6], [20]–[23]. Most of the existing optimization algorithms are derived for the real-valued case and orthogonal matrices [10], [11], [13]–[15], [17], [24]–[27]. Very often in communications and signal processing applications we are dealing with complex matrices and signals. Consequently, the optimization need to be performed under unitary matrix constraint. Commonly optimization algorithms employing orthogonal/unitary matrix constraint minimize a cost function on

C

Manuscript received January 30, 2007; revised August 13, 2007. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Sergiy Vorobyov. This work was supported in part by the Academy of Finland and by the GETA Graduate School. The authors are with the SMARAD CoE, Signal Processing Laboratory, Department of Electrical Engineering, Helsinki University of Technology, FIN02015 HUT, Finland (e-mail: [email protected]; [email protected]; [email protected]). Digital Object Identifier 10.1109/TSP.2007.908999

matrices using a classical steepest descent the space of (SD) algorithm. Separate orthogonalization procedure must be applied after each iteration [12], [20]–[22]. Approaches stemming from the Lagrange multipliers method have also been used to solve such problems [16]. In such approaches, the error criterion contains an extra term that penalizes for the deviations from orthogonality property. Self-stabilized algorithms have been developed to provide more accurate, but still approximate solutions [17]. Major improvements over the classical methods are obtained by taking into account the geometrical aspects of the optimization problem. Pioneering work by Luenberger [28] and Gabay [29] convert the constrained optimization problem into an unconstrained problem, on an appropriate differentiable manifold. An extensive treatment of optimization algorithms with orthogonality constraints is given later by Edelman et al. [24] in a Riemannian geometry context. A non-Riemannian approach has been proposed in [2], which is a general framework for optimization under unitary matrix constraint. A more-detailed literature review is presented in Section II. In this paper we derive two generic algorithms stemming from differential geometry. They optimize a real-valued cost with respect to complex-valued matrix satfunction , i.e., perform optimization subisfying ject to the unitary matrix constraint. SD algorithms operating are proposed. They on the Lie group of unitary matrices move towards the optimum along the locally shortest paths, i.e., geodesics. Geodesics on Riemannian manifolds correspond to the straight lines in Euclidean space. Our motivation to opt for the geodesic algorithms is that on the Lie group of unitary ma, the geodesics have simple expressions described by trices the exponential map. We can fully exploit recent developments in computing the matrix exponential needed in the multiplica. The generalized polar decomposition [30] tive update on proves to be one of the most computationally efficient method if implemented in a parallel architecture, or the Cayley transform (CT) [31] otherwise. We also consider other parametrizations proposed in the literature and show that all these parametrizations are numerically equivalent up to a certain approximation order. However, the algorithms differ in terms of computational complexity, which is also addressed in this paper. The proposed generic geodesic algorithms, unlike other parametrizations, can be relatively easily adapted to different problems with varying complexity and strictness of the unitarity property requirements. This is due the fact that the computation of the matrix exponential function employed in the proposed algorithms is a well-researched problem [32] with some recent progress relevant to the unitary optimization [30]. Moreover, we show that the expo-

1053-587X/$25.00 © 2008 IEEE

ABRUDAN et al.: SD ALGORITHMS FOR OPTIMIZATION

nential map is well suited for adapting the step size for the SD method on the unitary group. This paper is organized as follows. In Section II, an overview of the problem of optimization under unitary matrix constraint is provided. A brief review of different approaches presented in the literature is given as well. A simple geometric example is used to illustrate the differences among various approaches. In Section III, we derive the Riemannian gradient on the Lie group of unitary matrices and the corresponding SD algorithms. Equivalence relationships between the proposed algorithms and other algorithms are established in Section IV. The computational complexity and the numerical stability issues are studied in Sections V and VI, respectively. Simulation results are presented in Section VII. The proposed algorithms are used to solve the unitary matrix optimization problem encountered in the joint approximate diagonalization of eigenmatrices (JADE) algorithm [3] which is applied for blind source separation in a MIMO system. Finally, Section VIII concludes the paper. II. OPTIMIZATION UNDER UNITARY MATRIX CONSTRAINT In this section, a brief overview of optimization methods under orthonormal or unitary matrix constraint is provided. Different approaches are reviewed and the key properties of each approach are briefly studied. A simple example is presented to illustrate how each algorithm searches for the optimum. A. Overview Most of classical optimization methods with unitary matrix constraint operate on the Euclidean space by using a SD algorithm. The unitary property of the matrix is lost in every iteration, and it needs to be restored in each step. Moreover, the convergence speed is reduced. Other algorithms use a Lagrangian type of optimization, by adding an extra-penalty function which penalizes for the deviation from unitarity [16]. These methods suffer from slow convergence and find only an approximate solution in terms of orthonormality. Self-stabilized algorithms provide more accurate solutions [17], [33]. A major drawback of the classical Euclidean SD and Lagrange type of algorithms [12], [16], [20]–[22] is that they do not take into account the special structure of the parameter space where the cost function needs to be optimized. The constrained optimization problem may be formulated as an unconstrained one in a different parameter space called manifold. Therefore, the space of unitary matrices is considered to be a “constrained surface.” Optimizing a cost function on a manifold is often considered [10], [14], [24], [25], [29], [34] as a problem of Riemannian geometry [35]. Algorithms more general than the traditional Riemannian approach are considered in [2]. The second important aspect neglected in classical algorithms unitary matrices are algebraically closed under is that the the multiplication operation, not under addition. Therefore, they form a group under the multiplication operation, which is the Lie group of unitary matrices, [36]. Consequently, by using an iterative algorithm based on an additive update the unitarity property is lost after each iteration. Even though we are moving along a straight line pointing in the right direction,

1135

we depart from the constrained surface in each step. This happens because a Riemannian manifold is a “curved space.” The locally length-minimizing curve between two points on the Riemannian manifold is called a geodesic and it is not a straight line like on the Euclidean space. Several authors [10], [11], [13], [14], [24], [25], [28], [29], [34], [37], [38] have proposed that the search for the optimum should proceed along the geodesics of the constrained surface. Relevant work in Riemannian optimization algorithms may be found in [24], [29], [34], and [38]–[41]. Algorithms considering the real-valued Stiefel and/or Grassmann manifolds have been proposed in [10], [11], [15], [17], [24]–[26], and [42]. Edelman et al. [24] consider the problem of optimization under orthonormal constraints. They propose SD, conjugate gradient, and Newton algorithms along geodesics on Stiefel and Grassman manifolds. A general framework for optimization under unitary matrix constraints is presented in [2]. It is not following the traditional Riemannian optimization approach. A modified SD algorithm, coupled with Armijo’s step size adaptation rule [1] and a modified Newton algorithm are proposed for optimization on both the complex Stiefel and the complex Grassmann manifold. These algorithms do not employ a geodesic motion, but geodesic motion could be used in the general framework. A local parametrization based on an Euclidean projection of the tangent space onto the manifold is used in [2]. Hence, the computational cost may be reduced. Moreover, it is suggested that the geodesic motion is not the only solution, since there is no direct connection between the Riemannian geometry of the Stiefel (or Grassmann) manifold (i.e., the “constrained surface”) and an arbitrary cost function. The SD algorithms proposed in this paper operate on the . We have derived the Lie group of unitary matrices . We Riemannian gradient needed in the optimization on choose to follow a geodesic motion. This is justified by the that the right multiplication is an desirable property of isometry with respect to the canonical bi-invariant metric [35]. This allows us to translate the descent direction at any point in the group to the identity element and exploit the fact that the tangent space at identity is the Lie algebra of skew-Hermitian matrices. This leads to lower computational complexity because the argument of the matrix exponential operation is skew-Hermitian. Novel methods for computing the matrix exponential operation for skew-symmetric matrices recently proposed in [30] and [32] may be exploited. Moreover, we show that using an adaptive step size according to Armijo’s rule [1] fits very well to the proposed algorithms. B. Illustrative Example We present a rather simple simulation example, in order to illustrate how different algorithms operate under the unitary constraint. We consider the Lie group of unit-norm complex num, which are the 1 1 unitary matrices. The unitary bers constraint is in this case the unit circle. We minimize the cost function , subject to . Five different algorithms are considered. The first one is the unconstrained SD algorithm on the Euclidean space, with the corresponding up, where is the step size. The date

1136

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 3, MARCH 2008

point on the unit circle in terms of Euclidean norm. The prouses a multiplicative update which posed SD algorithm is a phase rotation. The phase is proportional to the imaginary part of the complex number associated with the point. For this reason, the constraint is satisfied at every step in a natural way. Although this low-dimensional example is rather trivial, it has been included for illustrative purposes. In the case of multidimensional unitary matrices, a similar behavior is encountered. III. ALGORITHM DERIVATION

Fig. 1. Minimization of a cost function on the unit circle versus Riemannian SD methods.

U (1). Euclidean

second one is the same SD with enforcing the unit norm con, after each iteration. The third straint, method is similar to the bigradient method [16] derived from the Lagrange multiplier method. An extra-penalty weighted by a parameter is added to the original cost function in order to penalize the deviation from the unit norm. In . The this case, fourth SD algorithm operates on the right parameter space determined by the constraint. At each point the algorithm takes a direction tangent to the unit circle and the resulting point is projected back to the unit circle. The corresponding update is , where is the step size and is the operator which projects an arbitrary point to the closest point on the unit circle in terms of Euclidean norm. The fifth algorithm is a multiplicative update SD algorithm derived in this paper. The corresponding update is a rotation, i.e., , where and is the imaginary part of . The parameter repfor resents the step size. The starting point is all the algorithms (see Fig. 1). The point , sets , but this is an the cost function to its minimum undesired minimum because it does not satisfy the constraint. The desired optimum is , where the constraint is satisfied and . We may notice in Fig. 1 that the unconstrained SD (marked by ) takes the SD direction in , and goes straight to the undesired minimum. By enforcing the unit norm constraint, we project radially the current point on the unit circle . The enforcing is necessary at every iteration in order to avoid the undesired minimum. The extra-penalty follows the unconstrained SD in the first itSD algorithm eration, since initially the extra-penalty term is equal to zero. It converges somewhere between the desired and the undesired minimum. The SD algorithm [2] on the space determined by the constraint takes in this case the SD direction on , tangent to the unit circle. The resulting point is projected to the closest

In this section, we derive two generic SD algorithms on the Lie group of unitary matrices. Consider a real-valued cost funcmatrix , i.e., . Our tion of a complex goal is to minimize (or maximize) the function under the constraint that , i.e., is unitary. We proceed as follows. First, in Section III-A we describe of unitary matrices, which is a real the Lie group differentiable manifold. Moreover, we describe the real differentiation of functions defined in complex spaces in a way which is suitable for the optimization. In Section III-B, we introduce . The definition of the Riemannian metric on the Lie group the gradient on the Riemannian space is intimately related to this metric. The Riemannian gradient is derived in Section III-C, and a basic generic optimization algorithm is given in Section III-D. Finally, a Riemannian SD algorithm with an adaptive step size is given in Section III-E. A. Differentiation of Functions Defined in Complex Spaces A Lie group is defined to be a differentiable manifold with a smooth, i.e., differentiable group structure [36]. The Lie group is a real differentiable manifold beof unitary matrices cause it is endowed with a real differentiable structure. Therefore, we deal with a real-valued cost function essentially defined in a real parameter space. However, since the algebraic properties of the group are defined in terms of the complex field, it is convenient to operate directly with complex representation of the matrices instead of using separately their real and the imaginary parts, i.e., without using reals for representing the complex space [43], [44]. Now the real differentiation can be described by a pair of complex-valued operators defined in terms of real differentials with respect to real and imaginary parts [43], [45]

and (1) with and . If a function is holomorphic (analytical), the first differential operator in (1) coincides with the complex differential and the second one is identically zero (Cauchy-Riemann equations). It should be noted that a real-valued function is holomorphic only if it is a constant. Therefore, the complex analyticity is irrelevant to optimization problems. The above representation is more compact, allows differentiation of complex argument functions without

ABRUDAN et al.: SD ALGORITHMS FOR OPTIMIZATION

1137

using reals for the representation, and it is appropriate for many applications [45]. B. Riemannian Structure on A differentiable function represents a (see Fig. 2). Let curve on the smooth manifold and let be the set of functions on that are differentiable is a function at . The tangent vector to the curve at given by

W

X

Fig. 2. Illustrative example representing the tangent space , and a tangent vector 2 T U (n ) .

(2)

T U (n) at point

satisfying for all the condition

A tangent vector at is the tangent vector at of . All the tangent vectors at a some curve with form the tangent space . The tangent point space is a real vector space attached to every point in the differis ential manifold. It should be noted that the value independent of the choice of local coordinates (chart) and indeand pendent of the curve as long as [35]. Since the curve , we have . Differentiating both sides with respect to , the tangent space may be identified with the -dimensional real at vector space (i.e., it is a vector space isomorphic to)

(6) is the gradient on

evaluated at . The direction defined in (1) represents the steepest ascent on the direction of the cost function of complex argument [45]. The left-hand side (LHS) in Euclidean space at a given (6) represents an inner product in the ambient space, whereas the right-hand side (RHS) represents a Riemannian inner product at . Equation (6) may be written as (7)

(3) From (3), it follows that the tangent space of at the group identity is the real Lie algebra of skew-Hermitian matrices . We emphais not a complex vector space (Lie algebra), size that because the skew-Hermitian matrices are not closed under is a multiplication with complex scalars. For example, if is Hermitian. Let and be skew-Hermitian matrix, then . The inner product two tangent vectors, i.e., is given by in

, is Equation (7) shows that the difference . Therefore, it lies in the normal orthogonal to all (5), i.e., space (8) where the matrix is a Hermitian matrix determined by imposing the condition that . From (3) it follows that: (9)

(4) This inner product induces a bi-invariant Riemannian metric on the Lie group [35]. We may define the normal space at considering that is embedded in the ambient space , equipped with the Euclidean metric. The normal space is the orthogonal complement of the tangent space with respect to the metric of the ambient space [24], and , we have i.e., for any . It follows that the normal is given as space at (5)

C. The SD Direction on the Riemannian Space . We consider a differentiable cost function Intuitively, the SD direction is defined as “the direction where the cost function decreases the fastest per unit length.” Having the Riemannian metric, we are now able to derive the Riemannian gradient we are interested in. A tangent vector

From (8) and (9), we get the expression for the Hermitian matrix . The gradient of the cost funcmay be written tion on the Lie group of unitary matrices at by using (8) as follows: (10) D. Moving Towards the SD Direction in Here, we introduce a generic Riemannian SD algorithm along . A geodesic geodesics on the Lie group of unitary matrices for curve on a Riemannian manifold is defined as a curve which the second derivative is zero or it lies in the normal space for all (i.e., the acceleration vector stays normal to the direction of motion as long as the curve is traced with constant speed). Locally the geodesics minimize the path length with respect to the Riemannian metric ([35, p. 67]). A geodesic emanating from the identity with a velocity is characterized by the exponential map: (11)

1138

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 3, MARCH 2008

Fig. 3. The geodesic emanating from identity in the direction of

complex matrices is given by the conThe exponential of . The optivergent power series is carried out along geodesics on the conmization of straint surface. For the optimization we need the equation of the . This may be found by geodesic emanating from is taking into account the fact that the right translation in an isometry with respect to the metric given by (4) and an isometry maps geodesics to geodesics [10], [35]. Therefore, . It follows that the geodesic emanating from is , i.e., (12) Consequently, we need to translate the gradient of the cost func(10) to identity, i.e., into the Lie algebra . Since tion at the differential of the right translation is a vector space isomorphism, this is performed simply by postmultiplying by , i.e.,

(13) We have to keep in mind that this is not the Riemannian gradient of the cost function evaluated at identity. The tangent vector is the Riemannian gradient of the cost function evaluated and translated to identity. Note that the argument of the at matrix exponential operation is skew-Hermitian. We exploit this very important property later in this paper in order to reduce the computational complexity. may be minimized iteratively by The cost function . using a geodesic motion. Typically we start at We choose the direction in to be the negative direction (13). Moving from of the gradient, i.e., to , is equivalent to moving , as it is shown in Fig. 3. The geodesic motion from to in corresponds to the multiplication by a rotation matrix . The parameter controls the magnitude of the tangent vector and consequently the algorithm convergence speed. The update corresponding to the SD algorithm along geodesics on is given by (14)

0G

ending at

P

= exp(

0G

).

TABLE I THE BASIC RIEMANNIAN SD ALGORITHM ON U (n)

The algorithm is summarized in Table I. Practical algorithms require the computation of the exponential map, which is addressed in Section V. E. A Self-Tuning Riemannian SD Algorithm on An optimal value of the step size is difficult to determine in practice. Moreover, it is cost function dependent and the appropriate step size may change at each iteration. The SD algorithm with a fixed small step size converges in general close to a local minimum. It trades off between high convergence speed, which requires large step size, and low steady-state error, which requires a small step size. An adaptive step size is often a desirable choice. In [27], a projection algorithm is considered together with three other optimization alternatives along geodesics in the real case, i.e., on the orthogonal group. The first geodesic algorithm in [27] uses a fixed step size, which leads to the “real-valued counterpart” of the algorithm in Table I. The second one is a geodesic search for computing the step size in the update equation. If the geodesic search is performed in a continuous domain [39], [40], it is computationally very expensive since it involves differential equations. A discretized version of the geodesic search may be employed. Two such methods are reviewed in [27]. The third alternative is a stochastic type of algorithm which adds perturbation to the search direction. We opt for a fourth alternative based on the Armijo step size [1]. It allows reducing the computational complexity and gives the optimal local performance. This type of algorithm takes an initial step along the geodesic. Then, two other possibilities are checked by evaluating the cost function for the case of doubling or halving the step size. The doubling or halving step continues

ABRUDAN et al.: SD ALGORITHMS FOR OPTIMIZATION

TABLE II THE SELF-TUNING RIEMANNIAN SD ALGORITHM ON U (n)

as long as the step size is out of a range whose limits are set by two inequalities. It is known that in a stationary scenario (i.e., the matrices involved in the cost function are time invariant) the SD algorithm together with the Armijo step size rule [1] almost always converges to a local minimum if not initialized at a stationary point. The convergence properties of the geodesic SD algorithm using the Armijo rule have been established in [29], [46] for general Riemannian manifolds, provided that the cost function is continuously differentiable and has bounded level sets. The first condition is an underlying assumption in this paper and the second one is ensured by the compactness . of In [2], a SD algorithm is coupled with the Armijo rule for optimizing the step size. Geodesic motion is not used. Nevertheless, in the general framework proposed in [2] it could be used. We show that by using the Armijo rule together with the generic SD algorithm along geodesics, the computational complexity is reduced by exploiting the properties of the exponential map, as it will be shown later. The generic SD algorithm with adaptive step size selection is summarized in Table II. The choice for computing the matrix exponential is explained in Section V. Algorithm Description: The algorithm consists of the following steps. • Step 1—Initialization: A typical initial value is . , then the identity element is a If the gradient stationary point. In that case a different initial value may be chosen. • Steps 2–3—Gradient computation: The Euclidean gradient and Riemannian gradient are computed. • Step 4—Setting the threshold for the final error: Evaluate the squared norm of the Riemannian gradient in order to check if we are sufficiently close to the minimum of the cost function. The residual error may be set to a value closest to the smallest value available in the limitedprecision environment, or the highest value which can be tolerated in task at hand. • Step 5—Rotation matrix computation: This step requires the computation of the rotation matrix . may be comThe rotation matrix puted just by squaring , because

1139

. Therefore, when doubling the step size, instead of computing a new matrix exponential only a matrix squaring operation is needed. It is important to mention that the squaring operation is a very stable operation [32] being also used in software packages for computing the matrix exponential. • Steps 6 and 7—Step size evaluation: In every iteration we check if the step size is in the appropriate range determined by the two inequalities. The step size evolves in a dyadic basis. If it is too small it will be doubled and if it is too high it will be halved. • Step 8—Update: The new update is obtained in a multiplicative manner and a new iteration is started with step 2 if the residual error is not sufficiently small. Remark: The SD algorithm in Table II may be easily converted into a steepest ascent algorithm. The only difference is that the step size would be negative and the inequalities in steps 6 and 7 need to be reversed. IV. RELATIONS AMONG DIFFERENT LOCAL PARAMETRIZATIONS ON THE UNITARY GROUP The proposed SD algorithms search for the minimum by moving along geodesics, i.e., the local parametrization is the exponential map. Other local parametrizations used to describe a small neighborhood of a point in the group have been proposed in [2] and [11]. In this section, we establish equivalence relationships among some different local parametrizations of . The first one is the exponential map the unitary group used in the proposed SD algorithms, the second one is the Cayley transform [31], the third one is the Euclidean projection operator [2], and the fourth one is a parametrization based on the QR-decomposition. The four parametrizations lead to . The different update rules for the basic SD algorithm on update expressions may be described in terms of Taylor series expansion, and we prove the equivalence among all of them up to a certain approximation order. A. The Exponential Map used in the update expression (14) The rotation matrix of the proposed algorithm can be expressed as a Taylor series expansion of the matrix exponential, i.e., . The update is equivalent to

(15) B. The Cayley Transform in the update is computed by using If the rotation matrix the CT [31] instead of the matrix exponential, then . The corresponding Taylor series is . For the update equation is

(16) Obviously, (16) is equivalent to (15) up to the second order. the CT equals the first order Notice also that for

1140

diagonal Padé approximation of see [32]).

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 3, MARCH 2008

(i.e.,

,

C. The Euclidean Projection Map Another possibility is to use an Euclidean projection map as a local parametrization as in [2]. This map projects an arbionto at a point which is the trary matrix in terms of Euclidean norm, i.e., closest point to . The unitary matrix minimizing the above norm can be obtained from the polar decomposition of as in [47]

The equivalence up to the first order between the update (20) and the other update expressions (15), (16), and (19) is obtained by from the Gramexpanding the columns of the matrix Schmidt process separately in a Taylor series (proof available on request) (21) The equivalence may extend to higher orders, but this remains to be studied. V. COMPUTATIONAL COMPLEXITY

(17) where and are the left and the right singular vectors of , respectively. Equivalently (18) Equation (18) is also known as “the symmetric orthogonalization” procedure [12]. In [2], the local parametrizations are more general in the sense that they are chosen for the Stiefel and Grassmann manifolds. The projection operation is computed via SVD as in (17). For the SD algorithm on the Stiefel manifold [2], , where is the the update is of form SD direction on the manifold and is the step size. According to (18) the update is equivalent to . By expanding the above expression in a Taylor series we get

(19) and , for Considering that the three update expressions (15), (16), and (19) become equivalent up to the second order. D. The Projection Based on the QR-Decomposition A computationally inexpensive way to approximate the optimal projection of an arbitrary matrix onto (18) is the QR-decomposition. We show that this is not the optimal projection in terms of minimum Euclidean distance, but is accurate enough to be used in practical applications. We establish a connection between the QR-based projection and the optimal projection. Let us consider the QR-decomposition of the arbigiven by trary nonsingular matrix where is an upper triangular matrix and is a unitary mais an approximation of the optimal trix. The unitary matrix projection of onto , i.e., . The connection between this projection and the optimal projection can be established by using the polar decomposition of the upper-trian. We obtain , gular matrix and . Therewhere fore, the matrix is an approximation of the optimal proand it includes an additional rotation from jection . The update of the SD algorithm is equal to the unitary factor from the QR-decomposition. In other words, if we have , then (20)

In this section, we evaluate the computational complexity of the SD algorithms on by considering separately the the geodesic and the nongeodesic SD algorithms. The proposed geodesic SD algorithms use the rotational update of form , and the rotation matrix is computed via matrix exponential. We review the variety of algorithms available in the literature for calculating the matrix exponential in the proposed algorithms. Details are given for the matrix exponential algorithms with the most appealing properties. The nongeodesic SD algorithms are based on an update expression , and the computational of form complexity of different cases is described. The cost of adapting to the step size using the Armijo rule is also evaluated. The SD involves overall complexity of flops1 method on per iteration. Algorithms like conjugate gradient or Newton algorithm are expected to provide a faster convergence, but also their complexity is expected to be higher. Moreover, a Newton algorithm is more likely to converge to stationary points other than local minima. A. Geodesic SD Algorithms on In general, the geodesic motion on manifolds is computation, the complexity is reduced ally expensive. In the case of even though it requires the computation of the matrix exponential. We are interested in the special case of matrix exponential , where and is a skew-Hermiof form tian matrix. Obviously, finding the exponential of a skew-Hermitian matrix has lower complexity than of general matrix. The matrix exponential operation maps the skew-Hermitian matrices into unitary matrices which reside on from the Lie algebra . Several alternatives for approximating the the Lie group matrix exponential have been proposed in [30], [32], [48], and [49]. In general the term “approximation” may refer to two different things. The first kind of approximation maps the elements of the Lie algebra exactly into the Lie group, and the approximation takes place only in terms of deviation “within the constrained surface.” Among the most efficient methods from this category are: the diagonal Padé approximation [32], Generalized Polar Decomposition [30], [48], technique of coordinates of the second kind [49]. The second category includes methods for which the resulting elements do not reside on the group any. The most popular methods belonging to more, i.e., 1One “flop” is defined as a complex addition or a complex multiplication. An operation of form ab c; a; b; c 2 is equivalent to two flops. A simple multiplication of two n n matrices requires n flops. This is a quick evaluation of the computational complexity, not necessarily proportional to the computational speed.

+ 2

2

ABRUDAN et al.: SD ALGORITHMS FOR OPTIMIZATION

this category is the truncated Taylor series and the nondiagonal Padé approximation. They do not preserve the algebraic properties, but they still provide reasonable performance in some applications [50]. Their accuracy may be improved by using them together with the scaling and squaring procedure. 1) Padé Approximation of the Matrix Exponential: This is [32, Method 2], and together with scaling and squaring [32, Method 3] is considered to be one of the most efficient methods for approximating a matrix exponential. For normal matrices (i.e., matrices which satisfy ), the Padé approximation prevents the round-off error accumulation. The skew-Hermitian matrices are normal matrices, therefore, they enjoy this benefit. Because we deal with a SD algorithm on we are also concerned about preserving the Lie algebraic properties. The diagonal Padé approximation preserves the unitarity property accurately. The Padé approximation together with scaling and squaring supposes the choice of the Padé approximation order and the scaling and squaring exponent to get the best approximant given the approximation accuracy. optimally. See [32] for information of choosing the pair The complexity of this approximation is flops. The drawback of Padé method when used together with the scaling and squaring procedure is that if the norm of the argument is large the computational efficiency decreases due to the repeated squaring. 2) Approximation of the Matrix Exponential via Generalized Polar Decomposition (GPD): The GPD method, recently proposed in [30] is consistent with the Lie group structure as it maps the elements of the Lie algebra exactly into the corresponding Lie group. The method lends itself to implementation in parallel flops [30] regardless of architectures and it requires about the approximation order. It may not be the most efficient implementation in terms of flop count, but the algorithm has potential for highly parallel implementation. GPD algorithms based on splitting techniques have also been proposed in [48]. The corresponding approximation is less complex than the one in [30] for the second and the third order. The second-order approximaflops. This is the same amount of tion requires only computation needed to perform the CT. Other efficient approximations in a Lie-algebraic setting have been considered in [49] by using the technique of coordinates of the second kind (CSK). A second-order CSK approximant requires flops.

B. Nongeodesic SD Algorithms on This category includes local parametrizations derived from a projection operator which is used to map arbitrary matrices into . The optimal projection and an approximation of it are considered. 1) Optimal Projection: The projection that minimizes the Euclidean distance between the arbitrary matrix and a matrix may be computed in different ways. By using the flops and SVD the computation of the projection requires flops. by using the procedure (18) it requires about 2) Approximation of the Optimal Projection: This method is the most inexpensive approximation of the optimal projection, being based on the QR-decomposition of the matrix . It requires only the unitary matrix which is an orthonormal

1141

TABLE III THE COMPLEXITY (IN FLOPS) OF COMPUTING THE LOCAL PARAMETRIZATION IN

U (n)

basis in the range space of . This can be done by using Householder reflections, Givens rotations or the Gram-Schmidt procedure [32]. The most computationally efficient and numerically stable approach is the modified Gram-Schmidt procedure which flops. requires only In Table III, we summarize the complexity2 of computing the local parametrizations for the geodesic and the nongeodesic methods, respectively. The geodesic methods include: the diagonal Padé approximation with scaling and squaring of type (1, 0) [32] (CT), the Generalized Polar Decomposition with reduction to tridiagonal form (GPD-IZ) [30] and without reduction to the tridiagonal form (GPD-ZMK) [48]. All methods have an approximation order of two. The nongeodesic methods include the optimal projection (OP) and its approximation (AOP). C. The Cost of Using an Adaptive Step Size In this subsection, we analyze the computational complexity of adapting the step size with Armijo rule. The total computational cost is given by the complexity of computing the local parametrization and the additional complexity of selecting the step size. Therefore, the step size adaptation is a critical aspect to be considered. We consider again the geodesic SD algorithms and the nongeodesic SD algorithms, respectively. We show that the geodesic methods may reduce the complexity of the step size adaptation. 1) The Geodesic SD Algorithms: Since the step size evolves in a dyadic basis, the geodesic methods are very suitable for the Armijo step. This is due to the fact that doubling the step size does not require any expensive computation, just squaring the rotation matrix as in the scaling and squaring procedure. For normal matrices, the computation of the matrix exponential via matrix squaring prevents the round-off error accumulation [32]. An Armijo type of geodesic SD algorithm enjoys this benefit, since the argument of the matrix exponential is skew-Hermitian. Moreover, when the step size is halved, the corresponding rotation matrix may be available from the scaling and squaring procedure which is often combined with other methods for approximating the matrix exponential. This allows reducing the complexity because the expensive operation may often be avoided. 2) The Nongeodesic SD Algorithms: The nongeodesic . methods compute the update by projecting a matrix into Unfortunately, the Armijo step size adaptation is relatively expensive in this case. The main reason is that the update and the one corresponding to the do not have a double step size straightforward relationship as squaring the rotation matrix for the geodesic methods. Thus, the projection operation needs to be computed multiple times. Moreover, even keeping the step size constant involves the computation of the projection twice 2Only

dominant terms are reported, i.e.,

O (n ) .

1142

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 3, MARCH 2008

TABLE IV THE COMPLEXITY OF ADAPTING THE STEP SIZE

since both inequalities 6 and 7 in Table II need to be tested even and if they fail. In this case both projections need to be evaluated. We compare the proposed geodesic SD algorithms to the nongeodesic SD algorithms by considering the complexity of adapting to the step size. We also take into account the cost of computing the rotation matrix for the geodesic SD algorithms and the cost of computing the projection operation for the nongeodesic algorithms. These costs are given in Table III for and the different local parametrizations. We denote by number of times we double, respectively, the number of times we halve the step size during one iteration . The complexity of adapting the step size is summarized in Table IV. We may conclude that the local parametrization may be chosen based on the requirements of the application. Often, preserving the algebraic structure is important. On the other hand, the implementation complexity may be a limiting factor. Most of the parametrizations presented here are equivalent up to the second order. Therefore, the difference in convergence speed is not expected to be significant. The cost function to be minimized plays a role in this difference as also stated in [2]. An SD algorithm with adaptive step size is more suitable in practice. Consequently, the geodesic algorithms are a good choice. In this case, the matrix exponential is employed and it may be computed either by using the CT [31] or the GPD-ZMK method [48]. They require equal number of flops, therefore the choice remains upon the numerical stability. Even though the GPD-IZ method recently proposed in [30] is sensibly less efficient in terms of flop count, it may be faster in practice if implemented in parallel architectures. Moreover, it provides good numerical stability as it will be seen in the simulations. As a final conclusion, we would opt for the GPD-IZ method [30] if the algorithm is implemented in a parallel fashion and for the CT if the parallel computation is not an option. VI. NUMERICAL STABILITY In this section, we focus on the numerical stability of the proposed SD algorithms on . Taking into account the recursive nature of the algorithms, we analyze the deviation of each from the unitary constraint, i.e., the deparnew update . The nongeodesic SD algorithms do not experiture from ence this problem due to nature of local parametrization. In that case, the error does not accumulate because the projection operator maps the update into the manifold at every new iteration. Therefore, we consider only the geodesic SD algorithms. The methods proposed here for approximating the matrix exponential map the elements of the Lie algebra exactly into the Lie group, therefore they do not cause deviation from the unimay be affected tary constraint. However, the rotation matrix by round-off errors, and the error may accumulate in the update (14) due to the repeated matrix multiplications.

We provide a closed-form expression for the expected value of the deviation form the unitary constraint after a certain number of iterations. The theoretical value derived here predicts the error accumulation with high accuracy, as it will be shown in the simulations. We show that the error accumulation is negligible in practice. We assume that at each iteration , the rotation matrix is affected additively by the quantization error , i.e., where is the true rotation matrix. The real and imaginary parts of the entries of the matrix are mutually independent and independent of the entry indices. They are assumed to be uniformly distributed within the quantization interval of width . The deviation of the quantized from the unitary constraint is measured by update (22) The closed-form expression of the expected value of the deviais given by (derivation available on request) tion at iteration

(23) The theoretical value (23) depends on the matrix dimension and the width of the quantization interval . Often, the convergence is reached in just few iterations, as in the practical example presented in Section VII. Therefore, the error accumulation problem is avoided. We show that even if the convergence is achieved after a large number of iterations, the expected value of the deviation from the unitary constraint is negligible. This is due to the fact that the dominant term in (23) is driven by . The error is increasing very slowly and the the factor increasing rate decays rapidly with , as it will be shown in Section VII. VII. SIMULATION RESULTS AND APPLICATIONS In this section, we test how the proposed method performs in signal processing applications. An example of separating independent signals in a MIMO system is given. Applications to array signal processing, ICA, BSS, for MIMO systems may be found in [5]–[8], [10], [14]–[17], [19]–[23], [51]. A recent review of the applications of differential geometry to signal processing may be found in [52]. A. Blind Source Separation for MIMO Systems Separating signals blindly in a MIMO communication systems may be done by exploiting the statistical information of the transmitted signals. The JADE algorithm [3] is a reliable alternative for solving this problem. The JADE algorithm consists of two stages. First, a prewhitening of the received signal is performed. The second stage is a unitary rotation. This second stage is formulated as an optimization problem under unitary matrix constraint, since no closed form solution can be given except for simple cases such as 2-by-2 unitary matrices. This may be efficiently solved by using the proposed SD on the unitary group. It should be noted that the first stage can also be formulated as a unitary optimization problem [50], and the algorithms

ABRUDAN et al.: SD ALGORITHMS FOR OPTIMIZATION

1143

proposed in this paper could be used to solve it. However, here we only focus on the second stage. The JADE approach has been recently considered on the oblique [53] and Stiefel [19] manifolds. The SD algorithm per iteration as the original in [19] has complexity of JADE [3], but in general it converges in fewer iterations. This is true especially for large matrix dimensions, where JADE seems to converge slowly due to its pairwise processing approach. Therefore, the overall complexity of algorithm in [19] is lower than in the original JADE. It operates on the Stiefel manifold of unitary matrices, but still without taking into account the additional Lie group structure of the manifold. Our proposed unitary algorithm is designed specifically for the case of matrices, and for this reason the complexity per iteration is lower compared to the SD in [19]. The convergence speed is identical as it will be shown later. The algorithms in [2] and [19] are more general than the proposed one, in the sense that the parametrization is chosen for the Stiefel and the Grassmann manifolds. The reduction in complexity for the proposed algorithm is achieved by exploiting the additional group structure . The SD along geodesics is more suitable for Armijo of step size. independent zero-mean signals are sent by A number of transmit antennas and they are received by receive using is in antennas. The frequency flat MIMO channel matrix mixing matrix . We use the clasthis case an masical signal model used in source separation. The corresponding to the received signal may be written as trix where is an matrix corresponds to the transmitted signals and is the additive white noise. In the prewhitening stage the received signal is decorrelated based on the eigendecomposition of the correlation matrix. The , prewhitened received signal is given by and contain the eigenvectors and the eigenwhere values corresponding to the signal subspace, respectively. In the second stage, the goal is to determine a unitary matrix such that the estimated signals are the transmitted signals up to a phase and a permutation ambiguity, which are inherent to any blind methods. The unitary matrix may be obtained by exploiting the information provided by the fourth-order cumulants of the whitened signals. The JADE algorithm minimizes the following criterion: (24) with respect to , under the unitarity constraint on , i.e., we . The eigenmatrices deal with a minimization problem on which are estimated from the fourth-order cumulants need computes the sum of the to be diagonalized. The operator squared magnitudes of the off-diagonal elements of a matrix, therefore, the criterion penalizes the departure of all eigenmatrices from the diagonal property. The Euclidean gradient of the JADE cost function is , where denotes the elementwise matrix multiplication. The performance is studied in terms of convergence speed considering the JADE criterion and the Amari distance (performance index) [12]. This JADE criterion (24) is a measure of

Fig. 4. The constellation patterns corresponding to (a) four of the six received signals and (b) the four recovered signals by using JADE with the algorithm proposed in Table I. There is an inherent phase ambiguity which may be noticed as a rotation of the constellation, as well as a permutation ambiguity.

are jointly diagonalized. This how well the eigenmatrices characterizes the goodness of the optimization solution, i.e., the is a unitary rotation stage of the BSS. The Amari distance good performance measure for the entire blind source separation problem since it is invariant to permutation and scaling. In terms of deviation from the unitary constraint the performance is measured by using a unitarity criterion (22), in a logarithmic scale. signals are transmitted, three QPSK sigA number of nals and one BPSK signal. The signal-to-noise ratio (SNR) is dB and the channel taps are independent random coefficients with power distributed according to a Rayleigh distribution. The results are averaged over 100 random realizations of the (4 6) MIMO matrix and (4 1000) signal matrix. In the first simulation, we compare three optimization algorithms: the classical Euclidean SD algorithm which enforces after every iteration, the Euclidean SD the unitarity of with extra-penalty similar to [16] stemming from the Lagrange multipliers method and the proposed Riemannian SD algorithm from Table II. The update rule for the classical SD algorithm . The unitarity property is enforced is by symmetric orthogonalization after every iteration [12], i.e., . The extra-penalty SD method uses an additional term added to the original cost function (24) similarly to the bigradient method in [16]. The corresponding update rule is . A weighting factor is used to weight the importance of the unitarity constraint. The third method is the SD algorithm summarized in Table II. Armijo [1] step size selection rule is used for all four methods. Fig. 4 shows received signal mixtures, and separated signals by using JADE with the proposed algorithm. The performance of the three algorithms in terms of convergence speed and accuracy of satisfying the unitary constraint are presented in Fig. 5. The JADE criterion (24) versus the number of iterations is shown in subplot a) of Fig. 5. Subplot b) of Fig. 5 shows the evolution of the Amari distance with the number of iterations. We may notice that the accuracy of the optimization solution described by the value of the JADE cost function is very related to the accuracy of solving the entire source separation, i.e., the Amari distance. The Riemannian SD algorithm (Table I) converges faster compared to the classical

1144

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 3, MARCH 2008

Fig. 5. A comparison between the conventional optimization methods operating on the Euclidean space (the classical SD algorithm with enforcing unitarity, the extra-penalty SD method) and the Riemannian SD algorithm from Table II. The horizontal thick dotted line in subplots (a) and (b) represents the solution of the original JADE algorithm [3]. The performance measures are the (24), Amari distance d and the unitarity criterion JADE criterion J (22) versus the iteration step. The Riemannian SD algorithm outperforms the conventional methods.

1

(W )

methods, i.e., the Euclidean SD with enforcing unitarity and the extra-penalty method. The Euclidean SD and extra-penalty method do not operate in an appropriate parameter space, and the convergence speed is decreased. All SD algorithms satisfy perfectly the unitary constraint, except for the extra-penalty method which achieves also the lowest convergence speed. This is due to the fact that an optimum scalar weighting parameter may not exist. The unitary matrix constraint is equivalent to smooth real Lagrangian constrains, therefore more parameters could be used. However, computing these parameters may computationally expensive and/or non-trivial even in the case , like the example presented in Section II-B. The of accuracy of the solution in terms of unitary constraint is shown (22) versus in subplot c) of Fig. 5 considering the criterion the number of iterations. We will next analyze how the choice of the local parametrization affects the performance of the SD algorithms. The results are compared in terms of convergence speed and the accuracy of satisfying the unitary constraint. Two classes of Riemannian SD algorithms are considered. The first one includes the geodesic SD algorithms and the second one include the nongeodesic SD algorithms. For the geodesic SD algorithms the exponential map is computed by three different methods: the Matlab’s expm function which uses the diagonal Padé approximation [32], the Generalwith scaling and squaring ized Polar Decomposition of order four by Iselres and Zanna

(W ) ()

Fig. 6. The JADE criterion J (24), the Amari distance and the unitarity criterion (22) versus the iteration step. A comparison between different local parametrizations on U n : the geodesic algorithms (continuous line) versus the nongeodesic algorithms (dashed line). For the geodesic algorithms the exponential map is computed by using three different methods: Matlab’s diagonal Padé approximation with scaling and squaring (DPA SS), GPD-IZ [30], CT. For the nongeodesic algorithms the projection operation is computed with two different methods: the OP via Matlab’s svd function and the approximation of it (AOP) via QR decomposition. The horizontal thick dotted line in subplots (a) and (b) represents the solution of the original JADE algorithm [3].

1

+

(GPD-IZ) [30] and the CT. The nongeodesic SD algorithms are based on a projection type of local parametrization. The OP is computed via SVD [2] by using the Matlab svd function and the approximation of the optimal projection (AOP) based on the QR algorithm is computed via modified Gram-Schmidt procedure [54]. In terms of convergence speed, both the geodesic and the nongeodesic SD algorithms such as [2] and [19] have similar performance, regardless of the local parametrization, as shown in subplots a) and b) of Fig. 6. Also in terms of unitarity criterion, all algorithms provide good performance. The solution of the original JADE algorithm (represented by the horizontal thick dotted line in Fig. 6) is achieved in less than 10 iteration for all SD algorithms, regardless of the local parametrization. In conclusion, the choice of the local parametrization in made according to the computational complexity and numerical stability. An Armijo step size rule is very suitable to the geodesic algorithms, i.e., using the exponential map as a local parametrization. If implemented in parallel architecture the

ABRUDAN et al.: SD ALGORITHMS FOR OPTIMIZATION

P~

Fig. 7. The deviation from the unitary constraint for different values of the after one million iterations. quantization errors  on the rotation matrices The theoretical value E in (23) is represented by continuous black line. (22) obtained by repeatedly multiplying unitary maThe unitarity criterion trices is represented by dashed gray line. The value obtained by using the proposed SD algorithm in Table II is represented by dot-dashed thick black line. The theoretical expected value of the deviation from the unitary constraint predicts accurately the value obtained in simulations. The proposed algorithm produces an error lower that the theoretical bound due to the fact that the error does not accumulate after the convergence has been reached.

[1 ] 1

1

GPD-IZ method [30] for computing the matrix exponential is a reliable choice from the point of view of efficiency and computational speed. Otherwise, the CT provides a reasonable performance at a low computational cost. Finally, the proposed algorithm has lower computational complexity per iteration compared to the nongeodesic SD in [2], [19] at the same convergence speed. This reduction is achieved by exploiting unitary the Lie group structure of the Stiefel manifold of matrices. The last simulation shows how the round-off errors caused by finite numerical precision affect the proposed iterative algorithm. In Fig. 7, different values of the quantization error are , which are obtained by considered for the rotation matrices using the method [32]. Similar results are obtained by using the other approximation methods of the matrix exponential presented in Section V-A. The theoretical value of the in (23) is represented by continuous unitarity criterion (22) obtained by repeatedly multiplying lines. The value unitary matrices is represented by dashed lines and the value obtained by using the proposed algorithm in Table II is represented by dot-dashed thick lines. The theoretical value (23) predicts accurately the value (22) obtained by repeated multiplications of unitary matrices. The proposed algorithm exhibits an error below the theoretical value due to the fact that the convergence is reached in few steps. Even if the convergence would be reached after a much larger number of iterations, the error accumulation is negligible for reasonable values of the quanti, as shown in Fig. 7. In practice, a much zation errors smaller number of iterations need to be performed. VIII. CONCLUSION In this paper, Riemannian optimization algorithms on the Lie group of unitary matrices have been introduced.

1145

Expression for Riemannian gradient needed in the optimization has been derived. The proposed algorithms move towards the optimum along geodesics and the local parametrization is the exponential map. We exploit the recent developments in computing the matrix exponential needed in the multiplicative up. This operation may be efficiently computed in a date on parallel fashion by using the GPD-IZ method [30] or in a serial fashion by using the CT. We also address the numerical issues and show that the geodesic algorithms together with the Armijo rule [1] are more efficient in practical implementations. Nongeodesic algorithms have been considered as well, and equivalence up to a certain approximation order has been established. The proposed geodesic algorithms are suitable for practical applications where a closed form solution does not exist, or to refine estimates obtained by classical means. Such an example is the joint diagonalization problem presented in the paper. We have shown that the unitary matrix optimization problem encountered in the JADE approach for blind source separation [3] may be efficiently solved by using the proposed algorithms. Other possible applications include: smart antenna algorithms, wireless communications, biomedical measurements and signal separation, where unitary matrices play an important role in general. The algorithms introduced in this paper provide significant advantages over classical Euclidean gradient with enforcing unitary constraint and Lagrangian type of methods in terms convergence speed and accuracy of the solution. The unitary constraint is automatically maintained at each iteration, and consequently, undesired suboptimal solutions may be avoided. , the proposed algorithm Moreover, for the specific case of has lower computational complexity than the nongeodesic SD algorithms in [2]. REFERENCES [1] E. Polak, Optimization: Algorithms and Consistent Approximations. New York: Springer-Verlag, 1997. [2] J. H. Manton, “Optimization algorithms exploiting unitary constraints,” IEEE Trans. Signal Process., vol. 50, pp. 635–650, Mar. 2002. [3] J. Cardoso and A. Souloumiac, “Blind beamforming for non Gaussian signals,” Inst. Elect. Eng. Proc.-F, vol. 140, no. 6, pp. 362–370, 1993. [4] S. T. Smith, “Subspace tracking with full rank updates,” in Conf. Rec. 31st Asilomar Conf. Signals, Syst. Comp., Nov. 2–5, 1997, vol. 1, pp. 793–797. [5] D. R. Fuhrmann, “A geometric approach to subspace tracking,” in Conf. Rec. 31st Asilomar Conf. Signals, Syst., Comp., Nov. 2–5, 1997, vol. 1, pp. 783–787. [6] J. Yang and D. B. Williams, “MIMO transmission subspace tracking with low rate feedback,” in Int. Conf. Acoust., Speech Signal Process., Philadelphia, PA, Mar. 2005, vol. 3, pp. 405–408. [7] W. Utschick and C. Brunner, “Efficient tracking and feedback of DL-eigenbeams in WCDMA,” in Proc. 4th Europ. Pers. Mobile Commun. Conf., Vienna, Austria, 2001. [8] M. Wax and Y. Anu, “A new least squares approach to blind beamforming,” in IEEE Int. Conf. Acoust., Speech, Signal Process., Apr. 21–24, 1997, vol. 5, pp. 3477–3480. [9] P. Stoica and D. A. Linebarger, “Optimization result for constrained beamformer design,” in IEEE Signal Process. Lett., Apr. 1995, vol. 2, pp. 66–67. [10] Y. Nishimori, “Learning algorithm for independent component analysis by geodesic flows on orthogonal group,” in Int. Joint Conf. Neural Netw., Jul. 10–16, 1999, vol. 2, pp. 933–938. [11] Y. Nishimori and S. Akaho, “Learning algorithms utilizing quasi-geodesic flows on the Stiefel manifold,” Neurocomputing, vol. 67, pp. 106–135, Jun. 2005.

1146

[12] A. Hyvärinen, J. Karhunen, and E. Oja, Independent Component Analysis. New York: Wiley, 2001. [13] S. Fiori, A. Uncini, and F. Piazza, “Application of the MEC network to principal component analysis and source separation,” in Proc. Int. Conf. Artif. Neural Netw., 1997, pp. 571–576. [14] M. D. Plumbley, “Geometrical methods for non-negative ICA: Manifolds, Lie groups, toral subalgebras,” Neurocomput., vol. 67, pp. 161–197, 2005. [15] A. Cichocki and S.-I. Amari, Adaptive Blind Signal and Image Processing. New York: Wiley, 2002. [16] L. Wang, J. Karhunen, and E. Oja, “A bigradient optimization approach for robust PCA, MCA and source separation,” in Proc. IEEE Conf. Neural Netw., 27 Nov.–1 Dec. 1995, vol. 4, pp. 1684–1689. [17] S. C. Douglas, “Self-stabilized gradient algorithms for blind source separation with orthogonality constraints,” IEEE Trans. Neural Netw., vol. 11, pp. 1490–1497, Nov. 2000. [18] S.-I. Amari, “Natural gradient works efficiently in learning,” Neural Comput., vol. 10, no. 2, pp. 251–276, 1998. [19] M. Nikpour, J. H. Manton, and G. Hori, “Algorithms on the Stiefel manifold for joint diagonalisation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2002, vol. 2, pp. 1481–1484. [20] C. B. Papadias, “Globally convergent blind source separation based on a multiuser kurtosis maximization criterion,” IEEE Trans. Signal Process., vol. 48, pp. 3508–3519, Dec. 2000. [21] P. Sansrimahachai, D. Ward, and A. Constantinides, “Multiple-input multiple-output least-squares constant modulus algorithms,” in IEEE Global Telecommun. Conf., Dec. 1–5, 2003, vol. 4, pp. 2084–2088. [22] C. B. Papadias and A. M. Kuzminskiy, “Blind source separation with randomized Gram-Schmidt orthogonalization for short burst systems,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., May 17–21, 2004, vol. 5, pp. 809–812. [23] J. Lu, T. N. Davidson, and Z. Luo, “Blind separation of BPSK signals using Newton’s method on the Stiefel manifold,” in IEEE Int. Conf. Acoust., Speech, Signal Process., Apr. 2003, vol. 4, pp. 301–304. [24] A. Edelman, T. Arias, and S. Smith, “The geometry of algorithms with orthogonality constraints,” SIAM J. Matrix Analysis Applicat., vol. 20, no. 2, pp. 303–353, 1998. [25] S. Fiori, “Stiefel-Grassman Flow (SGF) learning: Further results,” in Proc. IEEE-INNS-ENNS Int. Joint Conf. Neural Netw., Jul. 24–27, 2000, vol. 3, pp. 343–348. [26] J. H. Manton, R. Mahony, and Y. Hua, “The geometry of weighted low-rank approximations,” IEEE Trans. Signal Process., vol. 51, pp. 500–514, Feb. 2003. [27] S. Fiori, “Quasi-geodesic neural learning algorithms over the orthogonal group: A tutorial,” J. Mach. Learn. Res., vol. 1, pp. 1–42, Apr. 2005. [28] D. G. Luenberger, “The gradient projection method along geodesics,” Manage. Sci., vol. 18, pp. 620–631, 1972. [29] D. Gabay, “Minimizing a differentiable function over a differential manifold,” J. Optim. Theory Applicat., vol. 37, pp. 177–219, Jun. 1982. [30] A. Iserles and A. Zanna, “Efficient computation of the matrix exponential by general polar decomposition,” SIAM J. Numer. Anal., vol. 42, pp. 2218–2256, Mar. 2005. [31] I. Yamada and T. Ezaki, “An orthogonal matrix optimization by dual Cayley parametrization technique,” in Proc. ICA, 2003, pp. 35–40. [32] C. Moler and C. van Loan, “Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later,” SIAM Rev., vol. 45, no. 1, pp. 3–49, 2003. [33] S. Douglas and S.-Y. Kung, “An ordered-rotation kuicnet algorithm for separating arbitrarily-distributed sources,” in Proc. IEEE Int. Conf. Independ. Compon. Anal. Signal Separat., Aussois, France, Jan. 1999, pp. 419–425. [34] C. Udriste, Convex Functions and Optimization Methods on Riemannian Manifolds. Mathematics and Its Applications. Boston, MA: Kluwer Academic, 1994. [35] M. P. do Carmo, Riemannian Geometry. Mathematics: Theory and Applications. Boston, MA: Birkhauser, 1992. [36] A. Knapp, Lie Groups Beyond an Introduction, Vol. 140 of Progress in Mathematics. Boston, MA: Birkhauser, 1996. [37] D. G. Luenberger, Linear and Nonlinear Programming. Reading, MA: Addison-Wesley, 1984. [38] S. T. Smith, “Optimization techniques on Riemannian manifolds,” Fields Inst. Commun., Amer. Math. Soc., vol. 3, pp. 113–136, 1994.

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 3, MARCH 2008

[39] R. W. Brockett, “Least squares matching problems,” Linear Algebra Applicat., vol. 122/123/124, pp. 761–777, 1989. [40] R. W. Brockett, “Dynamical systems that sort lists, diagonalize matrices, and solve linear programming problems,” Linear Algebra Applicat., vol. 146, pp. 79–91, 1991. [41] B. Owren and B. Welfert, “The Newton iteration on Lie groups,” BIT Numer. Math., vol. 40, pp. 121–145, Mar. 2000. [42] P.-A. Absil, R. Mahony, and R. Sepulchre, “Riemannian geometry of Grassmann manifolds with a view on algorithmic computation,” Acta Applicandae Mathematicae, vol. 80, no. 2, pp. 199–220, 2004. [43] S. G. Krantz, Function Theory of Several Complex Variables, 2nd ed. Pacific Grove, CA: Wadsworth and Brooks/Cole Advanced Books and Software, 1992. [44] S. Smith, “Statistical resolution limits and the complexified Cramér-Rao bound,” IEEE Trans. Signal Process., vol. 53, pp. 1597–1609, May 2005. [45] D. H. Brandwood, “A complex gradient operator and its applications in adaptive array theory,” in Inst. Elect. Eng. Proc., Parts F and H, Feb. 1983, vol. 130, pp. 11–16. [46] Y. Yang, “Optimization on Riemannian manifold,” in Proc. 38th Conf. Decision Contr., Phoenix, AZ, Dec. 1999, pp. 888–893. [47] N. J. Higham, “Matrix nearness problems and applications,” in Applications of Matrix Theory, M. J. C. Gover and S. Barnett, Eds. Oxford, U.K.: Oxford Univ. Press, 1989, pp. 1–27. [48] A. Zanna and H. Z. Munthe-Kaas, “Generalized polar decomposition for the approximation of the matrix exponential,” SIAM J. Matrix Anal., vol. 23, pp. 840–862, Jan. 2002. [49] E. Celledoni and A. Iserles, “Methods for approximation of a matrix exponential in a Lie-algebraic setting,” IMA J. Numer. Anal., vol. 21, no. 2, pp. 463–488, 2001. [50] T. Abrudan, J. Eriksson, and V. Koivunen, “Optimization under unitary matrix constraint using approximate matrix exponential,” in Conf. Rec. 39th Asilomar Conf. Signals, Syst. Comp. 2005, 28 Oct.–1 Nov. 2005. [51] P. Sansrimahachai, D. Ward, and A. Constantinides, “Blind source separation for BLAST,” in Proc. 14th Int. Conf. Digital Signal Process., Jul. 1–3, 2002, vol. 1, pp. 139–142. [52] J. H. Manton, “On the role of differential geometry in signal processing,” in Int. Conf. Acoust., Speech Signal Process., Philadelphia, PA, Mar. 2005, vol. 5, pp. 1021–1024. [53] P. Absil and K. A. Gallivan, Joint diagonalization on the oblique manifold for independent component analysis DAMTP, Univ. Cambridge, U.K., Tech. Rep. NA2006/01, 2006 [Online]. Available: http://www. damtp.cam.ac.uk/user/na/reports.html [54] G. H. Golub and C. van Loan, Matrix Computations, 3rd ed. Baltimore, MD: The Johns Hopkins Univ. Press, 1996.

Traian E. Abrudan (S’02) received the M.Sc. degree from the Technical University of Cluj-Napoca, Romania, in 2000. Since 2001, he has been with the Signal Processing Laboratory, Helsinki University of Technology (HUT), Finland. He is a Ph.D. student with the Electrical and Communications Engineering Department, HUT. Since 2005, he has been a member of GETA, Graduate School in Electronics, Telecommunications and Automation. His current research interests include statistical signal processing and optimization algorithms for wireless communications with emphasis on MIMO and multicarrier systems.

Jan Eriksson (M’04) received the M.Sc. degree in mathematics from University of Turku, Finland, in 2000, and the D.Sc.(Tech) degree (with honors) in signal processing from Helsinki University of Technology (HUT), Finland, in 2004. Since 2005, he has been working as a postdoctoral researcher with the Academy of Finland. His research interests are in blind signal processing, stochastic modeling, constrained optimization, digital communication, and information theory.

ABRUDAN et al.: SD ALGORITHMS FOR OPTIMIZATION

Visa Koivunen (S’87–M’93–SM’98) received the D.Sc. (Tech) degree with honors from the Department of Electrical Engineering, University of Oulu, Finland. He received the primus doctor (best graduate) award among the doctoral graduates during 1989–1994. From 1992 to 1995, he was a visiting researcher at the University of Pennsylvania, Philadelphia. In 1996, he held a faculty position with the Department of Electrical Engineering, University of Oulu. From August 1997 to August 1999, he was an Associate Professor with the Signal Processing Labroratory, Tampere University of Technology, Finland. Since 1999 he has been a Professor of Signal Processing with the Department of Electrical and Communications Engineering, Helsinki University of Technology (HUT), Finland. He is one of the Principal Investigators in Smart and Novel Radios (SMARAD) Center of Excellence in Radio and Com-

1147

munications Engineering nominated by the Academy of Finland. Since 2003, he has also been an adjunct full professor with the University of Pennsylvania. During his sabbatical leave (2006–2007), he was the Nokia Visiting Fellow at the Nokia Research Center, as well as a Visiting Fellow at Princeton University, Princeton, NJ. His research interests include statistical, communications, and sensor array signal processing. He has published more than 200 papers in international scientific conferences and journals. Dr. Koivunen coauthored the papers receiving the Best Paper award in IEEE PIMRC 2005, EUSIPCO 2006, and EuCAP 2006. He served as an Associate Editor for IEEE SIGNAL PROCESSING LETTERS. He is a member of the editorial board for the Signal Processing journal and the Journal of Wireless Communication and Networking. He is also a member of the IEEE Signal Processing for Communication Technical Committee (SPCOM-TC). He was the general chair of the IEEE SPAWC (Signal Processing Advances in Wireless Communication) conference held in Helsinki, June 2007.

[Publication II] T. Abrudan, J. Eriksson, V. Koivunen, “Conjugate Gradient Algorithm for Optimization Under Unitary Matrix Constraint”, Technical Report 4/2008, Department of Signal Processing and Acoustics, Helsinki University of Technology, 2008. ISBN 978-951-22-9483-1, ISSN 1797-4267. Submitted for publication.

Conjugate Gradient Algorithm for Optimization Under Unitary Matrix Constraint 1

Traian Abrudan, Jan Eriksson, and Visa Koivunen

Abstract In this paper we introduce a Riemannian algorithm for minimizing (or maximizing) a real-valued function J of complex-valued matrix argument W under the constraint that W is an n × n unitary matrix. This type of constrained optimization problem arises in many array and multi-channel signal processing applications. We propose a conjugate gradient (CG) algorithm on the Lie group of unitary matrices U (n). The algorithm fully exploits the group properties in order to reduce the computational cost. Two novel geodesic search methods exploiting the almost periodic nature of the cost function along geodesics on U (n) are introduced. We demonstrate the performance of the proposed conjugate gradient algorithm in a blind signal separation application. Computer simulations show that the proposed algorithm outperforms other existing algorithms in terms of convergence speed and computational complexity.

Index Terms: – Optimization, unitary matrix constraint, array processing, subspace estimation, source separation

1

Introduction

Constrained optimization problems arise in many signal processing applications. In particular, we are addressing the problem of optimization under unitary matrix constraint. Such problems may be found in communications and array signal processing, for example, blind and constrained beamforming, high-resolution direction finding, and generally in all subspace-based 1

This work was supported in part by the Academy of Finland and GETA Graduate School. The authors are with the SMARAD CoE, Department of Signal Processing and Acoustics, Helsinki University of Technology, FIN-02015 HUT, Finland (email: {tabrudan, jamaer, visa}@wooster.hut.fi)

methods. Another important class of applications is source separation and Independent Component Analysis (ICA). This type of optimization problems occur also in Multiple-Input Multiple-Output (MIMO) communication systems. See [1, 2], and [3] for recent reviews. Commonly, optimization under unitary matrix constraint is viewed as a constrained optimization problem on the Euclidean space. Classical gradient algorithms are combined with different techniques for imposing the constraint, for example, orthogonalization, approaches stemming from the Lagrange multipliers method, or some stabilization procedures. If unitarity is enforced by using such techniques, one may experience slow convergence or departures from the unitary constraint as shown in [1]. A constrained optimization problem may be converted into an unconstrained one on a different parameter space determined by the constrained set. The unitary matrix constraint considered in this paper determines a parameter space which is the Lie group of n × n unitary matrices U(n). This parameter space is a Riemannian manifold [4] and a matrix group [5] under the standard matrix multiplication at the same time. By using modern tools of Riemannian geometry, we take full benefit of the nice geometrical properties of U(n) in order to solve the optimization problem efficiently and satisfy the constraint with high fidelity at the same time. Pioneering work in Riemannian optimization may be found in [6–9]. The optimization with orthogonality constraints is considered in detail in [10]. Steepest descent (SD), conjugate gradient (CG) and Newton algorithms on Stiefel and Grassman manifolds are derived. A CG algorithm on the Grassmann manifold has also been proposed recently in [11]. A non-Riemannian approach, which is a general framework for optimization, is introduced in [12]. Modified SD and Newton algorithms on Stiefel and Grassman manifolds are derived. SD algorithms operating on orthogonal group are considered recently in [13–15] and on the unitary group in [1,2]. A CG on the special linear group is proposed in [16]. Algorithms in the existing literature [6–8,10,12,17] are, however, more general in the sense that they can be applied on more general manifolds than U(n). For this reason when applied to U(n), they do not take full benefit of the special properties arising from the Lie group structure of the manifold [1]. In this paper we derive a conjugate gradient algorithm operating on the Lie group of unitary matrices U(n). The proposed CG algorithm provides faster convergence compared to the existing SD algorithms [1, 12] at even lower complexity. There are two main contributions in this paper. First, a computationally efficient CG algorithm on the Lie group of unitary matrices U(n) is proposed. The algorithm fully exploits the Lie group features such as simple formulas for the geodesics and tangent vectors. 2

The second main contribution in this paper is that we propose Riemannian optimization algorithms which exploit the almost periodic property of the cost function along geodesics on U(n). Based on this property we derive novel high accuracy line search methods [8] that facilitate fast convergence and selection of suitable step size parameter. Many of the existing geometric optimization algorithms do not include practical line search methods [10,13], or if they do, they are too complex when applied to optimization on U(n) [12, 14]. In some cases, the line search methods are either valid only for specific cost functions [8], or the resulting search is not highly accurate [1, 11, 12, 14, 15]. Because the conjugate gradient algorithm assumes exact search along geodesics, the line search method is crucial for the performance of the resulting algorithm. An accurate line search method exploiting the periodicity of a cost function which appears on the limited case of special orthogonal group SO(n) for n ≤ 3, is proposed in [15] for non-negative ICA (independent component analysis). The method can also be applied for n > 3, but the accuracy decreases, since the periodicity of the cost function is lost. The proposed high-accuracy line search methods have lower complexity compared to well-known efficient methods [1] such as the Armijo method [18]. To our best knowledge the proposed algorithm is the first ready-to-implement CG algorithm on the Lie group of unitary matrices U(n). It is also valid for the orthogonal group O(n). This paper is organized as follows. In Section 2 we approach the problem of optimization under unitary constraint by using tools from Riemannian geometry. We show how the geometric properties may be exploited in order to solve the optimization problem in an efficient way. Two novel line search methods are introduced in Section 3. The practical conjugate gradient algorithm for optimization under unitary matrix constraint is given in Section 4. Simulation results and applications are presented in Section 5. Finally, Section 6 concludes the paper.

2

Optimization on the Unitary Group U (n)

In this section we show how the problem of optimization under unitary matrix constraint can be solved efficiently and in an elegant manner by using tools of Riemannian geometry. In Subsection 2.1 we review some important properties of the unitary group U(n) which are needed later in our derivation. Few important properties of U(n) that are very beneficial in optimization are pointed out in Subsection 2.2. The difference in behavior of the steepest 3

descent and conjugate gradient algorithm on Riemannian manifolds in explained in Subsection 2.3. A generic conjugate gradient algorithm on U(n) is proposed in Subsection 2.4.

2.1

Some key geometrical features of U (n)

This subsection describes briefly some Riemannian geometry concepts related to the Lie group of unitary matrices U(n) and show how they can be exploited in optimization algorithms. Consider a real-valued function J of an n × n complex matrix W, i.e., J : Cn×n → R. Our goal is to minimize (or maximize) the function J = J (W) under the constraint that WWH = WH W = I, i.e., W is unitary. This constrained optimization problem on Cn×n may be converted into an unconstrained one on the space determined by the unitary constraint, i.e., the Lie group of unitary matrices. We view our cost function J as a function defined on U(n). The space U(n) is a real differential manifold [4]. Moreover, the unitary matrices are closed under the standard matrix multiplication, i.e., they form a Lie group [5]. The additional properties arising from the Lie group structure may be exploited to reduce the complexity of the optimization algorithms. 2.1.1

Tangent vectors and tangent spaces

The tangent space TW U(n) is an n2 -dimensional real vector space attached to every point W ∈ U(n). At the group identity I, the tangent space is the real Lie algebra of skew-Hermitian matrices u(n) , TI U(n) = {S ∈ Cn×n |S = −SH }. Since the differential of the right translation is an isomorphism, the tangent space at W ∈ U(n) may be identified with the matrix space TW U(n) , {X ∈ Cn×n |XH W + WH X = 0}. 2.1.2

Riemannian metric and gradient on U(n)

After U(n) is equipped with a Riemannian structure (metric), the Riemannian gradient on U(n) can be defined. The inner product given by 1  hX, YiW = ℜ trace{XY H } , X, Y ∈ TW U(n) 2

(1)

induces a bi-invariant metric on U(n). This metric induced from the embedding Euclidean space is also the natural metric on the Lie group, since it equals the negative (scaled) Killing form [19]. The Riemannian gradient gives the steepest ascent direction on U(n) of the function J at some point W ∈ U(n): ˜ ∇J(W) , ΓW − WΓH W, (2) W 4

dJ where ΓW = dW ∗ (W) is the gradient of J on the Euclidean space at a given W, defined in terms of real derivatives with respect to real and imaginary parts of W [20].

2.1.3

Geodesics and parallel transport on U(n)

Geodesics on the Riemannian manifold are curves such that their second derivative lies in the normal space. Locally, they are also the length minimizing curves. Their expression is given by the Lie group exponential map [19, Ch. II, §1]. For U(n) this coincides with the matrix exponential, for which stable and efficient numerical methods exist (see [1] for details). The geodesic ˜ = SW is given by emanating from W in the direction S G W (t) = exp(tS)W,

S ∈ u(n),

t ∈ R.

(3)

The parallelism on a differentiable manifold is defined with respect to an affine connection [4,19,21]. Usually the Levi-Civita connection [8,17] is used on Riemannian manifolds. Now, the parallel transport of a tangent vector ˜ = XW ∈ TW , X ∈ u(n), w.r.t. the Riemannian connection along the X geodesic (3) is given by [19] ˜ = exp(tS/2)Xk exp(−tS/2)G W (t), τ X(t)

(4)

where τ denotes the parallel transport. In the important special case of ˜ = S, ˜ Eq. (4) simply transporting the velocity vector of the geodesic, i.e., X H reduces to the right multiplication by W G W (t): ˜ = SG W (t) = SW ˜ H G W (t). τ S(t)

2.2

(5)

Almost periodic cost function along geodesics

In this subsection we present an important property of the unitary group, i.e., the behavior of a smooth cost function along geodesics on U(n). A smooth cost function J : U(n) → R takes predictable values along geodesics on U(n). This is a consequence of the fact that geodesics (3) are given by matrix exponential of skew-Hermitian matrices S ∈ u(n). Such matrices have purely imaginary eigenvalues of form ωi, i = 1, . . . , n. Therefore, the eigenvalues of the matrix exponential exp(tS) are complex exponentials of form eωi t . Consequently, the composed cost function Jˆ(t) = J (G W (t)) is an almost periodic function, and therefore it may be expressed as a sum of periodic functions of t. Jˆ(t) is a periodic function only if the frequencies ωi , i = 1, . . . , n are in harmonic relation [22]. This happens for SO(2) and 5

SO(3) as noticed in [15], but not for U(n) with n > 1, in general. The derivatives of the cost function are also almost periodic functions. The almost periodic property of the cost function and its derivatives appears in the case of exponential map. This is not the case for other common parametrizations such as the Cayley transform or the Euclidean projection operator [1]. Moreover, this special property appears only on certain manifolds such as the unitary group U(n) and the special orthogonal group SO(n), and it does not appear on Euclidean spaces or on general Riemannian manifolds. The almost-periodic behavior of the cost function along geodesics may be used to perform geodesic search on U(n). This will be shown later in Section 3 where two novel line search methods for selecting a suitable step size parameter are introduced.

2.3

SD vs. CG on Riemannian manifolds

The Conjugate Gradient (CG) algorithm provides typically faster convergence compared to the Steepest Descent (SD) algorithm not only on the Euclidean space, but also on Riemannian manifolds. This is due to the fact that Riemannian SD algorithm has the same drawback as its Euclidean counterpart, i.e., it takes ninety degree turns at each iteration [8]. This is illustrated in Figure 1 (top), where the contours of a cost function are plotted on the manifold surface. The steps are taken along geodesics, i.e., the trajectory of the SD algorithm is comprised of geodesic segments connecting successive points Wk , Wk+1 , Wk+2 on the manifold. The zig-zag type of trajectory decreases the convergence speed, e.g., if the cost function has the shape of a “long narrow valley”. The conjugate gradient algorithm may significantly reduce this drawback Figure 1 (bottom). It exploits the information provided ˜ k at Wk and the SD direction −G ˜ k+1 at by the current search direction −H the next point Wk+1 . The new search direction is chosen to be a combination of these two, as shown in Figure 1 (bottom). The difference compared ˜ k and the grato the Euclidean space is that the current search direction −H ˜ dient Gk+1 at the next point lie in different tangent spaces, TWk and TWk+1 , respectively. For this reason they are not directly compatible. In order to combine them properly, the parallel transport of the current search direction ˜ k from Wk to Wk+1 along the corresponding geodesic is utilized. The −H new search direction at Wk+1 (see Figure 1 (bottom)) is ˜ k+1 = −G ˜ k+1 − γk τ H ˜ k, −H

(6)

˜ k is the parallel transport (5) of the vector H ˜ k into TW . The where τ H k+1 ˜ k and H ˜ k+1 are weighting factor γk is determined such that the directions τ H 6

Wk

˜k −G

Wk+1 ˜ k+1 −G contours of J (W)

˜k −τ G

Wk+2

manifold MINIMUM

Wk ˜k −H

contours of J (W) manifold

˜ k+1 −G Wk+1 ˜k −τ H

˜ k+1 −H MINIMUM

Figure 1: SD (top) vs. CG (bottom) on Riemannian manifolds. SD algorithm ˜ k+1 , −τ G ˜ ki takes ninety-degree turns at every iteration, i.e., h−G Wk+1 = 0, where τ denotes the parallelism w.r.t. the geodesic connecting Wk and ˜ k+1 at Wk+1 which is a combination Wk+1 . CG takes a search direction −H ˜ k+1 at Wk+1 and the current search direction of the new SD direction −G ˜ k translated to Wk+1 along the geodesic connecting Wk and Wk+1. The −H ˜ k+1 at Wk+1 will be orthognew Riemannian steepest descent direction −G onal to the current search direction −Hk at Wk translated to Wk+1 , i.e., ˜ k+1 , −τ H ˜ ki h−G Wk+1 = 0.

7

Hessian-conjugate [8, 10], i.e., γk = −

2.4

˜ k, G ˜ k+1 ) Hess J (τ H . ˜ k, τ H ˜ k) Hess J (τ H

(7)

Conjugate Gradient Algorithm on U (n)

In the exact conjugacy formula (7), only the special case of vector transportation (5) is needed. However, the factor γk contains the computationally expensive Hessian, and therefore, as usual, it is approximated. Using the crucial assumption that Wk+1 is a minimum point along the geodesic G W (t) and the first-order Taylor series approximation of the first differential form of J (Wk+1 ), the Polak-Ribi`erre approximation formula for the factor γk is obtained [8, 10]: ˜ k+1 − τ G ˜ k, G ˜ k+1 i hG Wk+1 γk = . (8) ˜ ˜ hGk , Gk i Wk

˜ k could be obtained from (4), but we choose to The parallel transport for G approximate it by using the right multiplication by WkH Wk+1 leading to the approximate Polak-Ribi`erre formula. γk =

hGk+1 − Gk , Gk+1 iI hGk , Gk iI

(9)

˜k = G ˜ k , then by (5) formulae (8) and (9) become equal. In our simulaIf H tions (8) and (9) have given identical results, and therefore we propose the computationally simpler formula (9). Finally, the conjugate gradient step is taken along the geodesic emanating ˜ k = −Hk Wk , i.e., from Wk in the direction −H Wk+1 = exp(−µk Hk )Wk .

(10)

A line search needs to be performed in order to find the step size µk which corresponds to a local minimum along the geodesic.

3

Line Search on U (n). Step size selection

Step size selection plays a crucial role in the overall performance of the CG algorithm. In general, selecting an appropriate step size may be computationally expensive even for the Euclidean gradient algorithms. This is due 8

to the fact that most of the line search methods [18] require multiple cost function evaluations. On Riemannian manifolds, every cost function evaluation requires expensive computations of the local parametrization (in our case, the exponential map). In [11], a conjugate gradient on the Grassmann manifold is proposed. The line search method is exact only for the convex quadratic cost functions on the Euclidean space. The difficulty of finding a closed-form solutions for a suitable step size is discussed in [11]. A onedimensional Newton method which uses the first-order Fourier expansion to approximate the cost function along geodesics on SO(n) is proposed in [15]. It requires computing the first and the second-order derivatives of the cost function along geodesics. The method exploits the periodicity of the cost function along geodesics on SO(2) and SO(3). For n > 3 the accuracy of the approximation decreases, since the periodicity of the cost functions is lost. The method avoids computing the matrix exponential by using the closedform Rodrigues formula, valid for SO(2) and SO(3) only. In case of U(2) and U(3) and in general for n > 1, the Rodrigues formula cannot be applied. In this section, we propose two novel methods for performing highaccuracy one-dimensional search along geodesics on U(n). They rely on the fact that smooth functions as well as their derivatives are almost periodic [22] along geodesics on U(n). The first method is based on a polynomial approximation of the first-order derivative of the cost function along geodesics. The second one is based on an approximation using Discrete Fourier Transform (DFT). We choose to approximate the derivative of the cost function along geodesics and find the corresponding zeros, instead of approximating the cost function itself and finding the local minima as in [15]. Moreover, compared to [15] the proposed method does not require the second-order derivative. The main goal is to find a step size µk > 0 along the geodesic curve W(µ) = exp(−µHk )Wk , R(µ)Wk , R(µ) ⊂ U(n),

(11)

which minimizes the composed function Jˆ(µ) , J (W(µ)).

(12)

The direction −Hk ∈ u(n) in (11) may correspond to a steepest descent, conjugate gradient, or any other gradient-type of method. Consider two successive points on U(n) such that Wk = W(0) and Wk+1 = W(µk ). Finding the step size µ = µk that minimizes Jˆ(µ) may be done by computing the first-order derivative dJˆ/dµ and setting it to zero. By using the chain rule for the composed function J (W(µ)), we get o n ∂J dJˆ  H H H W R (µ)H R(µ)W (µ)=−2ℜ{trace k k k }. dµ ∂W ∗ 9

(13)

Almost periodicity may be exploited in many ways in order to find the zeros of the first-order derivative corresponding to the desired values of the step size. We present two different approaches. The first approach finds only the first zero of the derivative by using a polynomial approximation of the derivative. The second one finds several zeros of the derivative and is based on a Fourier series approximation. Other approaches may also be possible, and they are being investigated.

3.1

Line search on U (n) by using polynomial approximation approach

The goal of the polynomial approximation approach is to find the first local minimum of the cost function along a given geodesic. This corresponds to finding the first zero-crossing value of the first-order derivative of the cost function, which is also almost periodic. In this purpose we use a low-order polynomial approximation of the derivative and find its smallest positive root. The approximation range of the derivative is determined from its spectral content. The method provides computational benefits since only one evaluation of the matrix exponential is needed. The method is explained in detail below. ˜ k is a descent direction at Wk , otherwise it will be The direction −H reset to the negative gradient. Therefore, the first-order derivative dJˆ/dµ is always negative at the origin (at µ = 0) and the cost function Jˆ(µ) is monotonically decreasing up to the first zero-crossing of dJˆ/dµ. This value corresponds to a local minimum of Jˆ(µ) along geodesic (11) (or seldom to a saddle point). Due to differentiation, the spectrum of dJˆ/dµ is the high-pass filtered spectrum of Jˆ(µ). The frequency components are determined by the purely imaginary eigenvalues of −Hk as shown in Subsection 2.2. Therefore, the cost function as well as its derivative possess discrete frequency spectra. For our task at hand, we are not interested in the complete spectrum of dJˆ/dµ, but the main interest lies in the smallest zero-crossing value of the derivative. This is determined by the highest frequency component in the spectrum of dJˆ/dµ in the following way. In the interval of µ which is equal to one period corresponding to the highest frequency in the spectrum, the function dJˆ/dµ has at most one complete cycle on that frequency, and less than one on other frequencies. The highest frequency component of dJˆ/dµ is q|ωmax |, where ωmax is the eigenvalue of Hk having the highest magnitude, and q is the order of the cost function. The order q corresponds to the highest degree that t appears on in the Taylor series expansion of J (W + tZ) about t0 = 0, and it is assumed to be finite (most of the practical cost functions). 10

Jˆ(µ), dJˆ/dµ, approximation of dJˆ/dµ

geodesic search for the JADE cost function cost function Jˆ(µ) first-order derivative dJˆ/dµ polynomial approx. of dJˆ/dµ

10 8



sampling at equi-spaced points

6

µk 4 2 0 −2 −4

ˆ first zero-crossing of dJ/dµ ˆ (local minimum of J (µ))

−6

0

0.5

1

1.5

2

2.5

µ

Figure 2: Performing the line search for the JADE [23] cost function. The almost periodic behavior of the function Jˆ(µ) and its first-order derivative dJˆ/dµ (13) along geodesic W(µ) (11). The first zero-crossing of dJˆ/dµ corresponds to the first local minimum of Jˆ(µ), i.e., the desired step size µk . This first zero-crossing value is obtained by using a fourth-order polynomial approximation of dJˆ/dµ at equi-spaced points within the interval Tµ .

11

Otherwise, a truncated Taylor series may be used. The period corresponding the the highest frequency component is Tµ =

2π . q|ωmax |

(14)

The highest frequency component is amplified the most due to differentiation (high-pass filtering). The other components have less than one cycle within that interval as well as usually smaller amplitudes. Therefore, the first-order derivative dJˆ/dµ crosses zero at most twice within the interval [0, Tµ ). The presence of the zeros of the derivative are detected as sign changes of the derivative within [0, Tµ ). Since dJˆ/dµ varies very slowly within the interval [0, Tµ ) due to the almost periodic property of the derivative, a low-order polynomial approximation of the derivative is sufficient to determine the corresponding zero-crossing value. The approximation requires evaluating the cost function at least at P points, where P is the order of the polynomial, resulting into at most P zero-crossings for the approximation of the derivative. In order to reduce complexity, the derivative is evaluated at equi-spaced points {0, TPµ , 2 TPµ , . . . , Tµ }. Consequently, only one computation of the matrix exponential R(µ) = exp(−µHk ) is needed at µ = Tµ /P , and the next (P − 1) values are the powers of R(µ). The polynomial coefficients may be found by solving a set of linear equations. In Figure 2 we take as an example the JADE cost function used to perform the joint diagonalization for blind separation in [23]. A practical application of the proposed algorithm to blind separation by optimizing the JADE criterion will be given later in Section 5.2. The cost function Jˆ(µ) is represented by black continuous curve in Figure 2. Its first-order derivative dJˆ/dµ is represented by the gray continuous curve in Figure 2. The interval Tµ where the derivative needs to be approximated is also shown in Figure 2. In Figure 2, a fourth-order polynomial approximation at equi-spaced points within the interval [0, Tµ ) is used. The approximation is represented by thick dashed line. The steps of the proposed geodesic search algorithm based on polynomial approximation are given in Table 1.

3.2

Line search on U (n) by using a DFT-based approach.

The goal of our second line search method is to find multiple local minima of the cost function along a given geodesic and select the best one. The main benefit of this method is that it allows large steps along geodesics. The proposed method requires also only one evaluation of the matrix exponential, 12

1 Given Wk ∈ U (n), −Hk ∈ u(n), compute the eigenvalue of Hk of highest magnitude |ωmax | 2 Determine the order q of the cost function J (W) in the coefficients of W, which is the highest degree that t appears in the Taylor expansion of J (W + tZ), t ∈ R, Z ∈ Cn×n 3 Determine the value: Tµ = 2π/(q|ωmax |) 4 Choose the order of the approximating polynomial: P = 3, 4, or 5. 5 Evaluate R(µ) = exp(−µHk ) at equi-spaced points µi ∈ {0, Tµ /P, 2Tµ /P, . . . , Tµ } as follows: R0 , R(0) = I  T T  R1 , R Pµ = exp − Pµ Hk , T  R2 , R 2 Pµ = R1 R1 , . . . , RP , R(Tµ ) = RP −1 R1 . 6 By using the computed values of Ri , evaluate derivative of Jˆ(µ) at  ∂J the first-order H H H ′ µi , for i = 0, . . . , p: Jˆ (µi )=−2ℜ{trace ∂W∗(Ri Wk ) Wk Ri Hk } 7 Compute the polynomial coefficients a0 , . . . , aP : a0 = Jˆ′ (0) and     2 P −1 Jˆ′ (µ1 )−a0 µ 1 µ 1 . . . µ1 a1   ..   .. .. .   ..   . = . . . . . . ..   2 P ′ ˆ aP J (µP )−a0 µP µP . . . µP 8 Find the smallest real positive root ρmin of a0 + a1 µ + . . . + ap µP = 0. If it exists, then set the step size to µk = ρmin . Otherwise set µk = 0.

Table 1: Proposed geodesic search algorithm on U(n) based on polynomial approximation.

13

but more matrix multiplication operations. The basic idea is to approximate the almost periodic function dJˆ/dµ (13), by a periodic one, using the classical Discrete Fourier Series (DFT) approach. The method is explained next. First, the length of the DFT interval TDFT needs to be set. The longer DFT interval is considered, the better approximation is obtained. In practice we have to limit the length of the DFT interval to few periods Tµ (14) corresponding to the highest frequency component (minimum one, maximum depending on how many minima are targeted). Once the DFT interval length is set, the derivative dJˆ/dµ needs to be sampled at NDFT equi-distant points. According to the Nyquist sampling criterion, K ≥ 2 samples must be taken within an interval of length Tµ . Therefore, if NT periods Tµ are considered, the DFT length NDFT ≥ 2NT . Due to the fact that Tµ does not necessarily correspond to any almost period [22] of the derivative, its values at the edges of the DFT interval may differ. In order to avoid approximation mismatches at the edges of the DFT interval, a window function may be applied [24]. The chosen window function must be strictly positive in order to preserve the position of the zeros of the first-order derivative that we are interested in. In our approach we choose a Hann window h(i) [24] and discard the zero-values at the edges. This type of window minimizes the mismatches at the edges of the window. Therefore, instead of approximating the first-order derivative (13), it is more desirable to approximate the windowed derivative Jˆ D(µi) = h(i) ddµ (µi ), i = 0, . . . , NDFT − 1 as (NDFT +1)/2

D(µ) ≈

X

k=−(NDFT −1)/2

 2πk  µ , ck exp  TDFT

(15)

where NDFT is chosen to be an odd number. Again, in order to avoid computing the matrix exponential, the derivative dJˆ/dµ is evaluated at points µi ∈ {0, TDFT /NDFT , . . . , (NDFT − 1)TDFT /NDFT }. After determining the Fourier coefficients ck , the polynomial corresponding to the Fourier series approximation (15) is set to zero. The roots of the polynomial (15) which are close to the unit circle need to be determined, i.e., ρl = eωl , l ≤ 2NT . A tolerance δ from the unit circle may be chosen experimentally (e.g., δ < 1%). The values of µ corresponding to those roots need to be found. Given a descent direction −Hk , the smallest step size value µl corresponds to a minimum (or seldom to a saddle point). If no saddle points occur within the DFT window, all the step size values µl with l odd, correspond to local minima and the even ones correspond to maxima. Within the interval TDFT there are at most NT minima, and it is the possible to choose the best one. Therefore, the global minimum within the DFT window can be chosen in order to 14

geodesic search for the JADE cost function cost function Jˆ(µ) Tµ first-order derivative dJˆ/dµ DFT-based approx. of dJˆ/dµ

Jˆ(µ), dJˆ/dµ, approximation of dJˆ/dµ

12 10 8

TDFT

6

µk

4 2 0 −2 −4 −6 0

1

2

3

4

5

6

µ Figure 3: Performing the line search for the JADE [23] cost function. The almost periodic behavior of the function Jˆ(µ) and its first-order derivative dJˆ/dµ (13) along geodesic W(µ) (11) may be noticed. The odd zero-crossing values of dJˆ/dµ correspond to local minima of Jˆ(µ), i.e., to desired values of the step size µk . They are obtained by DFT-based approximation of dJˆ/dµ at equi-spaced points within the interval TDFT .

reduce the cost function as much as possible at every iteration. Finding the best minimum would require evaluation the cost function, therefore computing the matrix exponential for all µl with odd l, which is rather expensive. A reasonable solution is in this case to use the information on the sampled values of the cost function. Therefore, the step size is set to the root which is closest to the value that achieves a minimum of the sampled cost function. In Figure 3, we consider the JADE cost function [23], analogously to the example in Figure 2. The steps of the proposed geodesic search algorithm based on discrete Fourier series (DFT) approximation are given in Table 2. 15

1

Given Wk ∈ U (n), −Hk ∈ u(n), compute the eigenvalue of Hk of highest magnitude |ωmax | 2 Determine the order q of the cost function J (W) in the coefficients of W, which is the highest degree that t appears in the Taylor expansion of J (W + tZ), t ∈ R, Z ∈ Cn×n 3 Determine the value: Tµ = 2π/(q|ωmax |) 4 Choose the sampling factor K = 3, 4, or 5. Select the number of periods Tµ for the approximation, NT = 1, 2, . . . 5 Determine the length of the DFT interval TDFT = NT Tµ and the DFT length NDFT = 2⌊KNT /2⌋ + 1, where ⌊·⌋ denotes the integer part. 6 Evaluate the rotation R(µ) = exp(−µHk ) at equi-spaced points µi ∈ {0, TDFT /NDFT , . . . , (NDFT − 1)TDFT /NDFT } as follows: R0 , R(0) = I   TDFT Hk , R1 , R TDFT /NDFT = exp − N DFT  R2 , R 2TDFT /NDFT = R1 R1 , . . . ,  RNDFT −1 , R (NDFT − 1)TDFT /NDFT = RNDFT −2 R1 . 7 By using Ri computed in step 6, evaluate the first-order Jˆ(µ) at  ∂J derivativeHof H ′ ˆ µi , for i = 0, . . . , NDFT − 1: J (µi )=−2ℜ{trace ∂W∗(Ri Wk ) Wk Ri HH k } i+1 8 Compute the Hann window: h(i) = 0.5 − 0.5 cos 2π NDFT , i = +1 0, . . . , NDFT − 1. 9 Compute the windowed derivative: D(µi ) = h(i)Jˆ′ (µi ), i = 0, . . . , NDFT − 1 10 For k = −(NDFT − 1)/2, . . . , +(NDFT − 1)/2 compute the Fourier coefficients: ck =

NDFT X−1 i=0

 2πi  k . D(µi ) exp − NDFT

11 Find the roots ρl of the approximating Fourier polynomial (NDFT +1)/2

P(µ) =

X

k=−(NDFT −1)/2

 2πk  µ ≈ D(µ) ck exp + TDFT

close to the unit circle with the radius tolerance δ. If there are not roots of form ρl ≈ eωl then set the step size to µk = 0 and STOP. 12 If there are roots of form ρl ≈ eωl compute the corresponding zero-crossing values of P(µ): µl = [ωl TDFT /(2π)]modulo TDFT . Order µl in ascending order and pick the odd values µ2l+1 , l = 0, 1, .... 13 By using Ri computed in step 6, find the value of µ which minimizes the function Jˆ(µ) at µi : µi⋆ = arg minµi J (Ri Wk ), i = 0, . . . , NDFT − 1. Set the step size to µk = arg minµ2l+1 |µi⋆ − µ2l+1 |.

Table 2: Proposed geodesic search algorithm on U(n) based on DFT approximation. 16

3.3

Computational aspects

Both the polynomial approach and the DFT-based approach require several evaluations of the cost function Jˆ(µ) and its first-order derivative dJˆ/dµ (13) within the corresponding approximation interval. However, they require only one computation of the matrix exponential. The desirable property of the matrix exponential that exp(−mµHk ) = [exp(−µHk )]m is used to evaluate the rotation matrices at equi-spaced points. We also emphasize the fact that for both methods, when evaluating the approximation interval Tµ by using (14), only the largest eigenvalue |ωmax| of Hk needs to be computed and not the full eigen-decomposition (nor the corresponding eigen-vector) which is of complexity of O(n) [8]. The major benefit of the DFT method is that multiple minima are found and the best minimum can be selected at every iteration (which is not necessarily the first local minimum). In conclusion, in terms of complexity both proposed geodesic search methods are more efficient than the Armijo method [18] which requires multiple evaluations of the matrix exponential at every iteration [1] Unlike the method in [15], the proposed method does not require computing any second-order derivatives, which in some cases may involve large matrix dimensions, or they may be non-trivial to calculate (e.g. the JADE criterion (18)).

4

The practical conjugate gradient algorithm on U (n)

In this section we propose a practical conjugate gradient algorithm operating on the Lie group of unitary matrices U(n). By combining the generic conjugate gradient algorithm proposed in Subsection 2.4 which uses the approximated Polak-Ribi`erre formula with one of the the novel geodesic search algorithms described in Subsection 3.1 and 3.2, we obtain a low complexity conjugate gradient algorithm on the unitary group U(n). The proposed CG-PR algorithm on U(n) is summarized in Table 3. Remark 1 The line search algorithms in Table 1 and Table 2 and the CG algorithm in Table 3 are designed for minimizing a function defined on U(n). They may be easily converted into algorithms for maximizing a function on U(n). The rotation matrix R1 in the step 5 (Table 1) needs to be replaced by R1 = exp[(+Tµ /p)Hk ] and the sign of the derivative Jˆ′ (µi ) in step 6  ∂J H H H needs to be changed, i.e., Jˆ′ (µi)=+2ℜ{trace ∂W }. In ∗ (Ri Wk ) Wk Ri Hk Table 2, the same sign changes are needed in steps 6, 7. Additionally, in step 13, the value µi⋆ = arg maxµi J (Ri Wk ). Similarly, for the CG algorithm in 17

1 Initialization: k = 0 , Wk = I 2 Compute the Euclidean gradient at Wk . Compute the Riemannian gradient and the search direction at Wk , translated to the group identity: if (k modulo n2 ) == 0 ∂J Γk = ∂W ∗ (Wk ) Gk = Γk WkH − Wk ΓH k Hk := Gk 3 Evaluate hGk , Gk iI = (1/2)trace{GH k Gk }. If it is sufficiently small, then STOP. 4 Given a point Wk ∈ U (n) and the tangent direction −Hk ∈ u(n) determine the step size µk along the geodesic emanating from Wk in the direction of −Hk Wk , by using the algorithm in Table 1 or the algorithm in Table 2 5 Update: Wk+1 = exp(−µk Hk )Wk 6 Compute the Euclidean gradient at Wk+1 . Compute the Riemannian gradient and the search direction at Wk+1 , translated to the group identity: ∂J Γk+1 = ∂W ∗ (Wk+1 ) H H −W Gk+1 = Γk+1 Wk+1 k+1 Γk+1 hG

−G ,G

i

k k+1 I γk = k+1 hGk ,Gk iI Hk+1 = Gk+1+ γk Hk 7 If hHk+1 , Gk+1 iI = 21 ℜ trace{HH k+1 Gk+1 } < 0, then Hk+1 := Gk+1 8 k := k + 1 and go to step 2

Table 3: Conjugate gradient algorithm on U(n) using the Polak-Ribi`erre formula – CG-PR

18

Table 3, the update in step 5 would be Wk+1 = exp(+µk Hk )Wk . Remark 2 In step 7, if the CG search direction in not a descent/ascent direction when minimizing/maximizing a function on U(n), it will be reset to the steepest descent/ascent direction. This step remains the same both when minimization or maximization is performed since the inner product hHk+1, Gk+1 iI needs to be positive.

5

Simulation Examples

In this section we apply the proposed Riemannian conjugate gradient algorithm to two different optimization problems on U(n). The first one is the maximization of the Brockett function on U(n), which is a classical example of optimization under orthogonal matrix constraint [9,17]. The second one is the minimization of the JADE cost function [23] which is a practical application of the proposed conjugate gradient algorithm to blind source separation. Other possible signal processing applications are considered in [1–3].

5.1

Diagonalization of a Hermitian Matrix. Maximizing the Brockett criterion on U (n)

In this subsection we maximize the Brockett criterion [9, 17], which is given as: JB (W) = tr{WH ΣWN}, subject to W ∈ U(n).

(16)

The matrix Σ is a Hermitian matrix and N is a diagonal matrix with the diagonal elements 1, . . . , n. By maximizing2 (16), the matrix W will converge to the eigenvectors of Σ and the matrix D = WH ΣW will converge to a diagonal matrix containing the eigenvalues of Σ sorted in the ascending order along the diagonal. This type of optimization problem arises in many signal processing applications such as blind source separation, subspace estimation, high resolution direction finding as well as in communications applications [2]. This example is chosen for illustrative purposes. The order of the Brockett function is q = 2. The Euclidean gradient is given by ΓW = ΣWN. The performance is studied in terms of convergence speed considering a diagonality criterion, ∆, and in terms of deviation from the unitary constraint using a unitarity criterion Ω, defined as ∆ = 10 lg 2

off{WH ΣW} , diag{WH ΣW}

Ω = 10 lg kWWH − Ik2F ,

See Remark 1 in Section 4

19

(17)

CG-PR

SA 0

Armijo method grid search

−10

−40

DFT method one−dimensional Newton method [15]

−30 −40 −50

−50

−60

−60

−70

−70

−80

−80

−90

−90 −100 0

polynomial method

−20

DFT method one−dimensional Newton method [15]

−30

grid search

−10

polynomial method

−20

Armijo method

Diagonality criterion [dB]

Diagonality criterion [dB]

0

100

200

300

−100 0

100

200

300

iteration

iteration

Figure 4: A comparison between the steepest ascent algorithm (SA) on U(n) and the proposed conjugate gradient algorithm on U(n) in Table 3, using Polak-Ribi`erre formula (CG-PR). Five line search methods for selecting the step size parameter are considered: the Armijo method [18], the grid search, the line search method in Table 1 which is based on polynomial approximation the line search method in Table 2 which is based on DFT approximation, and the line search method proposed for SO(n) in [15]. The performance measure is the diagonality criterion ∆ vs. the iteration step.

where off{·} operator computes the sum of the squared magnitudes of the off-diagonal elements of a matrix, and diag{·} does the same operation, but for the off-diagonal ones [25]. The diagonality criterion ∆ (17) measures the departure of the matrix WH ΣW from the diagonal property in logarithmic scale and it is minimized when the Brockett criterion (16) is maximized. The results are averaged over 100 random realizations of the 6 × 6 Hermitian matrix Σ. In Figure 4, we compare two different optimization algorithms. The first algorithm is the geodesic steepest ascent on U(n) (SA) obtained from the algorithm in CG-PR Table 3 by setting γk to zero at every iteration k. The second algorithm is the CG-PR algorithm in Table 3 (see Remark 1, Section 4). For both algorithms (SA, CG-PR) five different line search methods for selecting the step size parameter are compared. The first one is the Armijo 20

method [18] used as in [1, 12]. The second is an exhaustive search along geodesics W(µ) (11) by using a linear grid µ ∈ [0, 10] for the parameter µk . A sufficiently large upper limit of the interval has been set experimentally in order to ensure that the interval contains at least one local maximum at every iteration. The lower limit has been set to ensure a reasonable resolution (10−4 ). The grid search method is very accurate, but extremely expensive and it has been included just for comparison purposes. The third line search method is the polynomial approximation approach proposed in Table 1. The fourth one is the DFT approximation approach in Table 2. The fifth one is the line search method proposed for SO(n) in [15]. It is based on a Newton step which approximates the cost function geodesics by using a first-order Fourier expansion. For the proposed line search methods a polynomial order P = 5 has been used (see Table 1). The parameters used in the DFT approach are the sampling factor K = 3 and NT = 10 periods Tµ (see Table 2). It may be noticed in Figure 4 that the CG-PR algorithm outperforms significantly the steepest ascent (SA) algorithm for all line search methods considered here, except the proposed DFT-based approach. The polynomial approximation approach proposed in Table 1 performs equally well as the method in [15], and the grid search method when used with the SA. The proposed DFT-based line search approach in Table 2 outperforms significantly the method in [15] when used with the SA algorithm, and achieves a convergence speed comparable to the one of the CG-PR algorithm. The convergence of SA algorithm with Armijo line search method [18] is better than the proposed polynomial approach and worse than the DFT approach. When used with the CG-PR, all methods achieve similar convergence speed, but their complexities differs. In terms of satisfying the unitary constraint, all algorithms provide good performance. The unitarity criterion (17) is close to the machine precision as also shown in [1].

5.2

Joint Approximate Diagonalization of a set of Hermitian Matrices. Minimizing the JADE criterion on U (m)

In this subsection we apply the proposed CG-PR algorithm together with the two novel line search methods to a practical application of blind source separation (BSS) of communication signals. A number of m = 16 independent signals are separated from their r = 18 mixtures based on the statistical properties of the original signals. Four signals from each of the following constellations are transmitted: BPSK, QPSK, 16-QAM and 64-QAM. A total of 5000 snapshots are collected and 100 independent realizations of the r × m 21

mixture matrix are considered. The signal-to-noise-ratio is SNR= 20dB. The blind recovery of the desired signals may be done in two stages by using the JADE approach [23]. It can be done up to a phase and a permutation ambiguity, which is inherent to all blind methods. The first stage is the prewhitening of the received signals based on the subspace decomposition of the received correlation matrix, and it could also be formulated as a maximization of the Brockett function (16) as shown in Section 5.1. The second stage is a unitary rotation operation which needs to be applied to the whitened signals. It is formulated as an optimization under unitary constraint and solved by using the approach proposed in Table 3. The function to be minimized is the joint diagonalization criterion [23] JJADE (W) =

m X

ˆ i W} subject to W ∈ U(m). off{WH M

(18)

i=1

ˆ i which are estimated from the fourth order cumulants. The eigenmatrices M The criterion penalizes the departure of all eigen-matrices from the diagonal property [25]. The order of the function (18) is q = 4, and the Euclidean gradient of the JADE cost function is given in [1]. In Figure 5-a) we show four of the eighteen received signals, i.e., noisy mixtures of the transmitted signal. Four of the sixteen separated signals are shown in Figure 5-b). In the the first simulation we study the performance of the proposed Riemannian algorithms in terms of convergence speed considering the JADE criterion (18). This JADE criterion (18) is a measure of how well the eigenˆ i are jointly diagonalized. matrices M The whitening stage is the same for both the classical JADE and the Riemannian algorithms. The unitary rotation stage differs. The classical JADE algorithm in [23] performs the approximate joint diagonalization task by using Givens rotations. Three different Riemannian optimization algorithms are considered. The first one is the steepest descent (SD) on U(m) obtained from the CG algorithm in Table 3 by setting γk to zero at every iteration k. The line search method in Table 2, which is based on DFT approximation approach is used. The second one is the CG-PR algorithm in Table 3 with the line search method proposed in Table 1 (polynomial approximation approach). The third algorithm is the CG-PR algorithm in Table 3 with the line search method proposed in Table 2 (DFT approximation approach). In Figure 6 it may noticed that all three Riemannian algorithms outperform the classical Givens rotations approach used in [23]. Again CG algorithm convergences faster compared to the SD algorithm, with both proposed line search methods (Table 1 and Table 2). The parameters used in 22

1

1

1

1

0

0

0

0

−1

−1

−1

−1

−1

0

1

−1

0

1

−1

0

−1

1

0

1

a) four of the eighteen received constellation patterns b) four of the sixteen separated constellation patterns 1

1

1

1

0

0

0

0

−1

−1

−1

−1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

Figure 5: The constellation patterns corresponding to a) 4 of the 18 received signals and b) 4 of the 16 recovered signals by the CG-PR algorithm proposed in Table 2 with the novel line search method in Table 1. Since the method is blind there is inherent phase ambiguity, as well as permutation ambiguity.

this simulation for line search methods in Table 1 and 2 are the same as in the previous simulation (Subsection 5.1). All three Riemannian algorithms have complexity of O(m3 ) per iteration and only few iterations are required to achieve convergence. Moreover, the number of iterations needed to achieve convergence stays almost constant when increasing m. The Givens rotation approach in [23] has a total complexity of O(m4 ), since it updates not only the unitary rotation matrix, but also the full set of eigen-matrices Mi . Therefore, the total complexity of the proposed algorithm is lower, especially when the number of signals m is very large. The proposed algorithms converge faster at similar computational cost per iteration. Therefore, they are suitable for blind separation applications, especially when the number of signals to be separated is very large m > 10.

6

Conclusions

In this paper, a Riemannian conjugate gradient algorithm for optimization under unitary matrix constraint is proposed. The algorithm operates on the Lie group of n × n unitary matrices. In order to reduce the complex23

Minimizing the JADE criterion 20

Givens rotations SD − DFT method

JADE criterion [dB]

15

CG−PR − polynomial method CG−PR − DFT method

10 5 0

−5 0

10

20

30

40

iteration k Figure 6: A comparison between the classical JADE algorithm in [23] based on Givens rotations and other three different optimization algorithms: SD on U(m) (in Table 3 by setting γk = 0) + line search method in Table 2 (DFT method), CG-PR algorithm on U(m) in Table 3 + line search method in Table 1 (polynomial method), CG-PR algorithm on U(m) in Table 3 + line search method in Table 2 (DFT method). The performance measures are the JADE criterion (18) vs. the iteration step. All three Riemannian algorithms outperform the classical Givens rotation approach used in [23]. CG-PR converges faster than SD for both proposed line search methods.

24

ity, it exploits the geometrical properties of U(n) such as simple formulas for the geodesics and the tangent vectors. The almost-periodic behaviour of smooth functions and their derivatives along geodesics on U(n) is shown. Two novel line search methods exploiting this property are introduced. The first one used a low-order polynomial approximation for finding the first local minimum along geodesics on U(n). The second one uses a DFT-based approximation for finding multiple minima along geodesics and selects the best one unlike the Fourier method in [15], which finds only one minimum. Our method models better the spectral content of the almost periodic derivative of the cost function. The two proposed line search methods outperform the Armijo method [18] in terms of computational complexity and provide better performance. The proposed Riemannian CG algorithm not only achieves faster convergence speed compared to the SD algorithms proposed in [1, 12], but also has lower computational complexity. The proposed Riemannian CG algorithm also outperforms the widely used Givens rotations approach used for jointly diagonalizing Hermitian matrices, i.e., in the classical JADE algorithm [23]. It may be applied, for example, to smart antenna algorithms, wireless communications, biomedical measurements, signal separation, subspace estimation and tracking tasks where unitary matrices play an important role in general.

References [1] T. Abrudan, J. Eriksson, and V. Koivunen, “Steepest descent algorithms for optimization under unitary matrix constraint,” IEEE Transaction on Signal Processing, vol. 56, pp. 1134–1147, Mar. 2008. [2] T. Abrudan, J. Eriksson, and V. Koivunen, “Optimization under unitary matrix constraint using approximate matrix exponential,” in ThirtyNinth Asilomar Conference on Signals, Systems and Computers, 2005., pp. 242–246, 28 Oct.–1 Nov. 2005. [3] J. H. Manton, “On the role of differential geometry in signal processing,” in International Conference on Acoustics, Speech and Signal Processing, vol. 5, (Philadelphia), pp. 1021–1024, Mar. 2005. [4] M. P. do Carmo, Riemannian Geometry. Mathematics: theory and applications, Birkhauser, 1992. [5] A. Knapp, Lie groups beyond an introduction, vol. 140 of Progress in mathematics. Birkhauser, 1996. 25

[6] D. G. Luenberger, “The gradient projection method along geodesics,” Management Science, vol. 18, pp. 620–631, 1972. [7] D. Gabay, “Minimizing a differentiable function over a differential manifold,” Journal of Optimization Theory and Applications, vol. 37, pp. 177–219, Jun. 1982. [8] S. T. Smith, Geometric optimization methods for adaptive filtering. PhD thesis, Harvard University, Cambridge, MA, May 1993. [9] R. W. Brockett, “Dynamical systems that sort lists, diagonalize matrices, and solve linear programming problems,” Linear Algebra and its Applications, vol. 146, pp. 79–91, 1991. [10] A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithms with orthogonality constraints,” SIAM Journal on Matrix Analysis and Applications, vol. 20, no. 2, pp. 303–353, 1998. [11] M. Kleinsteuber and K. H¨ uper, “An intrinsic CG algorithm for computing dominant subspaces,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2007, vol. 4, pp. 1405–1408, Apr. 2007. [12] J. H. Manton, “Optimization algorithms exploiting unitary constraints,” IEEE Transactions on Signal Processing, vol. 50, pp. 635–650, Mar. 2002. [13] Y. Nishimori and S. Akaho, “Learning algorithms utilizing quasigeodesic flows on the Stiefel manifold,” Neurocomputing, vol. 67, pp. 106–135, Jun. 2005. [14] S. Fiori, “Quasi-geodesic neural learning algorithms over the orthogonal group: a tutorial,” Journal of Machine Learning Research, vol. 1, pp. 1– 42, Apr. 2005. [15] M. D. Plumbley, “Geometrical methods for non-negative ICA: manifolds, Lie groups, toral subalgebras,” Neurocomputing, vol. 67, pp. 161– 197, 2005. [16] L. Zhang, “Conjugate gradient approach to blind separation of temporally correlated signals,” in IEEE International Conference on Communications, Circuits and Systems, ICCCAS-2004, vol. 2, (Chengdu, China), pp. 1008–1012, 2004. 26

[17] S. T. Smith, “Optimization techniques on Riemannian manifolds,” Fields Institute Communications, American Mathematical Society, vol. 3, pp. 113–136, 1994. [18] E. Polak, Optimization: Algorithms and Consistent Approximations. New York: Springer-Verlag, 1997. [19] S. Helgason, Differential geometry, Lie groups and symmetric spaces. Academic Press, 1978. [20] S. G. Krantz, Function theory of several complex variables. Pacific Grove, CA: Wadsworth & Brooks/Cole Advanced Books & Software, 2nd ed., 1992. [21] K. Nomizu, “Invariant affine connections on homogeneous spaces,” American Journal of Mathematics, vol. 76, pp. 33–65, Jan. 1954. [22] A. Fischer, “Structure of Fourier exponents of almost periodic functions and periodicity of almost periodic functions,” Mathematica Bohemica, vol. 121, no. 3, pp. 249–262, 1996. [23] J. Cardoso and A. Souloumiac, “Blind beamforming for non-Gaussian signals,” IEE Proceedings-F, vol. 140, no. 6, pp. 362–370, 1993. [24] F. J. Harris, “On the use of windows for harmonic analysis with the discrete Fourier transform,” Proceedings of the IEEE, vol. 66, pp. 51– 83, Jan. 1978. [25] K. H¨ uper, U. Helmke, and J. B. Moore, “Structure and convergence of conventional Jacobi-type methods minimizing the off-norm function,” in Proceedings of 35th IEEE Conference on Decision and Control, vol. 2, (Kobe, Japan), pp. 2124–2129, 11-13 Dec 1996.

27

[Publication III] T. Abrudan, J. Eriksson, V. Koivunen, “Optimization under Unitary Matrix Constraint using Approximate Matrix Exponential”, Conference Record of the Thirty Ninth Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, 28 Oct.–1 Nov. 2005, pp. 242–246. ©2005 IEEE. Reprinted with permission.

Optimization under Unitary Matrix Constraint using Approximate Matrix Exponential Traian Abrudan,

Jan Eriksson,

Visa Koivunen

SMARAD CoE, Signal Processing Laboratory, Helsinki University of Technology, P.O.Box 3000, Espoo, FINLAND 02015 e-mail: {tabrudan, jamaer, visa}@wooster.hut.fi

Abstract— In many engineering applications we deal with constrained optimization problems w.r.t. complex valued matrices. This paper proposes a Riemannian geometry approach for optimization of a real valued cost function J of complex valued matrix argument W, under the constraint that W is an n×n unitary matrix. An approximate steepest descent algorithm based on Taylor series expansion is developed. The approximation satisfies the unitary matrix constraint accurately even if low order approximation is used. Armijo adaptive step size rule [1] is used while moving towards the optimum. In the simulation examples, the proposed algorithm is applied to array signal processing and communications problems. The method outperforms other widely used algorithms.

I. I NTRODUCTION Constrained optimization problems arise in many applications. In particular, we are addressing the problem of optimization under unitary matrix constraint. Such problems may be found in communications and array signal processing, for example, blind and constrained beamforming, high-resolution direction finding (e.g. MUSIC and ESPRIT), and generally all subspace-based methods where subspace tracking is needed. In addition, this type of optimization problems occur in MultipleInput Multiple-Output (MIMO) communication systems and blind signal separation. See, [2] for recent review. Typically in communications and signal processing applications we are dealing with complex matrices and signals. Consequently, the methods derived for real-valued signals and orthogonal matrices may not be applicable. The extension from real [3] to complex case and unitary matrices is not trivial. It is not obtained just by replacing the transposition operation by the Hermitian transposition and the real derivative by the complex derivative, respectively. Commonly, a cost function with unitary matrix constraint is minimized in the space of n × n matrices using a classical Steepest Descent Algorithm (SDA) with separate orthogonalization step applied in each iteration [4], [5]. The method of Lagrange multipliers where deviations from the unitarity property are penalized has also been employed in such problems [6]. A major improvement over the classical methods above is proposed in [7]. This differential geometry based method performs the optimization under unitary matrix constraints in an appropriate parameter space. In this paper we propose an algorithm stemming from differential geometry for optimization of a real-valued cost 1 This work was supported by the Academy of Finland and GETA Graduate School.

‹,(((

function J : Cn×n → R subject to WWH = WH W = I. This constrained optimization problem in Cn×n may be translated into an unconstrained one in a different parameter space, i.e., the Lie group of n × n unitary matrices U (n). A steepest descent algorithm operating in such parameter space is proposed. The exact method proposed in our earlier work [8] performs geodesic motion, i.e., it moves along locally lengthminimizing paths towards the optimum. It requires the computation of a matrix exponential which may be too expensive in certain applications. In order to reduce the computational cost, we propose an approximate method that uses truncated Taylor series expansion. Even a low order model may be used since the approximate method does not suffer from error propagation. Consequently, the unitary matrix constraint may be satisfied with high fidelity in adaptive algorithms. This paper is organized as follows. In Section II we introduce the Riemannian gradient in the unitary group and a steepest descent algorithm employing Taylor series expansion is derived. Simulation results and array signal processing and communications applications are presented in Section III. Finally, Section IV concludes the paper. II. A LGORITHM In this section we propose a steepest descent algorithm operating in the Lie group of n × n unitary matrices U (n). In order to reduce the computational cost, we propose an approximate alternative to the exact algorithm proposed in [8]. The exact algorithm optimizes constrained cost function J (W) along geodesics on U (n). The Riemannian gradient of the cost function evaluated at W and translated to identity is given by: , G(W)  ΓW WH − WΓH W

∈ TI U (n),

(1)

where ΓW = ∇J (W) is the gradient of J in the R2n×2n Euclidean space [9] at a given W. The cost function J (W) is minimized iteratively, and the geodesic motion may be described by using an exponential map: Wk+1 = expm(−µGk )Wk  Rk Wk ,

(2)

where Gk = G(Wk ) (1), the parameter µ > 0 controls the algorithm convergence speed and Rk is unitary (rotation) matrix. Note that the argument of the matrix exponential operation is a skew-Hermitian matrix. The equation (2) is the exact update from [8]). This exploits the fact that the unitary matrices form a Lie group under the multiplication operation.



Product of unitary matrices is a unitary matrix. Hence, the multiplicative update in (2) satisfies the constraint in each step. Finding the rotation matrix Rk requires the computation of the matrix exponential operation (expm). The expm operation may be too expensive in some applications. We propose a low complexity approximate algorithm based on Taylor series expansion of the expm operation. The effect of the approximation order is studied. Approximate algorithms do not necessarily satisfy the unitary matrix constraint exactly. However, already a low order approximation (order 3 to 5) produces accurate results. The truncated Taylor series approximation of expm of order ω is expm(A) ≈

ω  Am . m! m=0

(3)

˜ k Wk , The corresponding approximate update is Wk+1 = R ˜ where Rk is the approximate rotation matrix. For example, the second order approximation is

Consequently, the complexity of computing matrix powers is reduced approximately by half. The GPD method in [12] used to compute the matrix exponential requires 10n3 operations for skew-Hermitian matrices. Hence, Taylor series approximation of order ω ≤ 10 is justified. An optimal value of the step size µ is difficult to determine in practice, since the matrices involved in the cost function may be random. Moreover, they may be time-varying. An adaptive step size is a reliable choice. It is known that the steepest descent algorithm together with the Armijo rule [1] for choosing the step size almost always converges to a local minimum if not initialized at a stationary point. The proposed algorithm is summarized in Table I using a third order Taylor series approximation of the matrix exponential. The step size µ evolves in a dyadic basis. If it is too small, it will be doubled, or if it is too high it will be halved. The criteria for choosing the step size value are defined by two inequalities, the steps 6 and 7, respectively.

2 ˜ k = I − µGk + µ G2 , (4) R 2 k where µ is the step size. After expanding the expression of the Riemannian gradient Gk , the corresponding update is:   Wk+1 = Wk − µ Γk WkH Wk − Wk ΓH (5) k Wk 2 µ 2 (Γk WkH )2 Wk + (Wk ΓH + k ) Wk 2  H H − Γk WkH Wk ΓH k Wk − Wk Γk Γk Wk Wk .

1 2

The multiplicative update in (2) turns into an additive update. The error propagation is an important practical issue. Since the unitary constraint is satisfied only approximately, the weighting factor WkH Wk in equation (5) is different from the identity matrix. This weighting factor affects directly the unitary property of Wk+1 . If the weighting factor is ignored in eq. (5) (i.e., if we assume WkH Wk = I), then even more severe degradation in the performance and departure from the unitarity property is experienced. The weighting factor improves the accuracy of the update in terms of unitary criterion similarly to the self-stabilized gradient algorithms in [10]. One important advantage of the proposed approximate algorithm is that the deviation from unitarity remains constant after a number of iterations, and it does not accumulate error as it can be seen in simulation results. This is due to the fact that the gradient (1) is always skew-Hermitian. The remaining error depends on the truncation order of the Taylor series. For a small step size, the third order truncated Taylor series approximates the matrix exponential with high fidelity. This is shown in [11], and verified by our simulations. The proposed approximation of the matrix exponential operation requires (ω−1)[n3 +2n2 ] operations (multiplications and additions). This takes into account the the special skewHermitian structure of Gk . The approximation requires computing matrix powers in (3) which may be done very efficiently for skew-Hermitian matrices. The odd powers of Gk are also skew-Hermitian and the even powers are Hermitian.

7

3 4 5 6

8

Initialization: k = 0 ,Wk = I and µ = 1 Compute the gradient of the cost function in the Euclidean ∂J space: Γk = ∂W ∗ (Wk ) Compute the gradient direction in the Riemannian space: Gk = Γk WkH − Wk ΓH k Evaluate Gk 2Wk = trace{Gk GH k }. If it is sufficiently small, then STOP. Determine the approximate rotation matrices: ˜k = R ˜ kR ˜k ˜ k = I − µGk + (µGk )2 /2 − (µGk )3 /6, Q R ˜ k Wk ) ≥ µGk 2W While J (Wk ) − J (Q k ˜ k , µ := 2µ ˜ k := Q ˜ k, Q ˜k = R ˜ kR R ˜ k Wk ) < (µ/2)Gk 2W While J (Wk ) − J (R k ˜ k = I − µGk + (µGk )2 /2 − (µGk )3 /6, µ := µ/2 R ˜ k Wk and go to step 2, k := k + 1 Update: Wk+1 = R TABLE I T HE PROPOSED APPROXIMATE ALGORITHM . T HE TAYLOR SERIES APPROXIMATION ORDER IS

ω=3

This type of step adaptation allows reducing the complexity. When the step size needs to be doubled (step 6), the computation of the Taylor series approximation in not needed, ˜ k ≈ expm(−2µGk ), may be because the rotation matrix Q ˜ k ≈ expm(−µGk ). This obtained by squaring the matrix R it is a realistic assumption even though we deal with an approximation, because for normal matrices, the approximate expm computation via matrix squaring (when doubling the step size) prevents the roundoff error accumulation [11]. An Armijo type of update algorithm enjoys this benefit, since the argument of the matrix exponential is skew-Hermitian, i.e. a normal matrix. Larger step size causes larger approximation error in the Taylor series. This may be avoided by using the scaling and squaring approach (Method 3 from [11]). III. A PPLICATION E XAMPLES In this section we test the proposed algorithm in two different examples of signal processing applications. The first application example is a subspace-based direction of arrival



J (W) = WH ΣW − I ⊙ (WH ΣW)2F ,

The approximate algorithm, ω = 2, 3, 5 0 2nd order 3rd order 5th order

2nd order 3rd order 5th order f10

The unitarity criterion Ω[dB]

The diagonality criterion ∆[dB]

f10

f20

f30

f40

f50

f60 0

200

iteration

300

The pseudospectrum [dB]

a) classical SD algorithm + enforcing unitarity 40 MUSIC true DOAs

30 20 10 0 f10

0 20 40 b) Lagrangian SD algorithm

60

80 100 angle [deg]

120

140

160

180

40 MUSIC true DOAs

30 20 10 0 f10

0 20 40 60 c) proposed approximate algorithm, ω=3

80 100 angle [deg]

120

140

160

180

40 MUSIC true DOAs

30 20 10 0 f10

0

20

40

60

80

100 angle [deg]

120

140

160

180

Fig. 2. The estimated DOAs (solid line) vs. the true DOAs: θ1 = 87o and θ2 = 92o , (marked by dashed lines). The MUSIC algorithm is applied to a 6element ULA at SNR=15dB. The eigendecomposition is obtained iteratively by minimizing the cost function (6), based on three different algorithms: a) The classical SD with enforcing unitarity every iteration b) The Lagrangian SD algorithm c) The proposed algorithm (Table I). The classical algorithms a), b) lose the high-resolution property. The proposed algorithm c) can solve closely-spaced DOA angles.

f20

f30

f40

f50

100

The pseudospectrum [dB]

(6)

where ⊙ denotes the elementwise  matrix multiplication. The  gradient of (6) is ΓW = 2ΣW WH ΣW − I ⊙ (WH ΣW) . In the first simulation the impact of the Taylor series approximation order ω is studied, for ω = 2, 3, 5. The performance is studied in terms of convergence speed. Obviously, faster methods for finding the DoA are used in practice. Two figures of merit are considered: a diagonality criterion ∆ and the unitarity criterion Ω. The diagonality criterion is defined as a ratio of two squared Frobenius norms, i.e., the one corresponding to the off-diagonal vs. the one corresponding to the diagonal elements of D, in logarithmic scale, i.e., ∆ = 10 lg[off(D)2F /diag(D)2F ]. The unitarity criterion is defined as the squared Frobenius norm of the deviation from the unitarity property also in a logarithmic scale, i.e., Ω = 10 lg WWH − I2F . The results are depicted in Fig. 1. 0

uniform linear array (ULA) is used. The “true” DOAs are θ1 = 87o and θ2 = 92o . The separation angle is 5o , hence the high resolution property is needed. The obtained eigenvectors are plugged into standard MUSIC algorithm. We compare three algorithms: the classical steepest descent (SD) method which enforces unitarity in every iteration as in [4], a Lagrangian type of method as in [6], and the proposed approximate algorithm (Table I). We plot the spatial pseudoThe pseudospectrum [dB]

(DOA) estimation used in smart antenna systems. The second one is a subspace method for (semi)blind channel estimation in MIMO OFDM systems. 1) Subspace-based direction of arrival estimation: The method requires the computation of the signal or noise subspace. Here, the subspaces are estimated using a diagonalization approach. The antenna array covariance matrix Σ is diagonalized by finding diagonal matrix D = WH ΣW such that W is a unitary matrix. The matrix D contains the eigenvalues of Σ and W is a unitary matrix whose columns are the eigenvectors of Σ. They may be found iteratively by minimizing the off-diagonal elements of WH ΣW, w.r.t. W, under the unitarity constraint on W. This is equivalent to minimizing the following cost function:

f60 0

100

200

300

iteration

Fig. 1. The performance of the proposed algorithm (Table I) using the truncated Taylor series approximation of the matrix exponential operation. The diagonality criterion ∆ (left) and the unitarity criterion Ω (right) vs. the iteration step. The unitarity criterion stabilizes to a steady-state value after few iterations. This value depends on the approximation order.

We may notice that approximation order does not impact the diagonality criterion significantly, see Fig. 1 (left). However, the approximation order plays a significant role in satisfying the unitarity property which is crucial in many applications, see, Fig. 1 (right). In the second simulation we show the importance of unitarity and how it reflects on the DOA estimates. A 6-element

spectrum after 200 iterations for all methods. The classical SD algorithm produces the pseudo-spectrum in Fig. 2-a). It fails to detect both sources. This method converges to an accurate estimate only after several thousands of iterations. The Lagrangian method yields eigenvectors where the unitary property does not hold. Therefore, the orthogonality property of signal and noise subspaces is degraded and high resolution property is lost as well. See Fig. 2-b) where closely spaced sources remain unresolved. The proposed algorithm finds both signals as it is shown in Fig. 2-c). The peaks are high and well separated. This is obtained even if very poor initial estimates of the eigenvectors are used, i.e., the identity matrix. The method performs reliably in subspace-based estimation and tracking tasks where the unitarity property plays a crucial role. 2) Subspace method for semi-blind channel estimation in MIMO OFDM systems: In this example we consider the channel estimation algorithm for multi-user MIMO OFDM systems proposed in [13]. The same simulation parameters as in Example 2 in [13] are considered. The MIMO system



The multiplication GH G requires about 12 (JM − KN )(L + 1)[J 2 N (L + 2) − J(J − 1)] + 21 [(L + 1)J]4 ≈ 1.3 · 106 operations by exploiting the block-Toeplitz structure of G. Finding the eigenvalues and the associated eigenvectors by using the proposed iterative approximation requires about (ω2 − 1)[((L + 1)J)3 + 2((L + 1)J)2 ] ≈ 0.1 · 106 operations. Therefore, the SVD may be approximated in 1.4 · 106 operations which is less than half of the complexity of the exact method. We compare the exact and the approximated algorithms in order to evaluate the performance loss due to the approximation. In this simulation we perform both the ED and the SVD by using the proposed iterative approximation. The sample estimate of the auto-correlation matrix Rx is computed by using 200 received OFDM blocks. The number of iterations for the computation of the ED is also equal to 200, for both EDs. After the MIMO channel is estimated, the equalization is performed in time domain and the ambiguity B due to the blind identification is removed as in [13]. The error is averaged over 50 Monte Carlo realizations. In Fig. 3 the performance of the proposed approximation is compared to the exact method [13] in terms of root mean-square error (RMSE) in the channel coefficients, as a function of the signal-to-noise ratio (SNR). Both the SNR and the channel RMSE are defined in [13]. The symbol RMSE is defined in the same manner for each user, and the average value over all users is considered. The performance degradation on the channel estimate due to the approximation may be noticed at high SNR. In terms of symbol RMSE the gap between the exact method [13] and the proposed approximate algorithm is not very significant, as it is shown in Fig. 4. The demodulated constellation patterns for the two users for the exact method and for the approximate method are shown in Fig. 5-a) and Fig. 5-b), respectively. The SNR is 21.4 dB, as in [13]. RMSE in the channel coefficients vs SNR exact method proposed approximation

0

10

channel RMSE

has K = 2 transmit antennas and J = 3 receive antennas. All the MIMO channel branches are frequency selective and have order L < 10. The corresponding taps are zero-mean complex Gaussian, mutually independent generated according to the exponential power-delay profile E[|h(j,k) (l)|2 ] = exp(−0.64l), l = 0, . . . , L. The OFDM block contains N = 32 sub-symbols belonging to 16-QAM constellation. Each block is zero-padded, resulting to a block of length M = 41. The blind channel identification algorithm proposed in [13] is based on second order statistics of the received signal. The algorithm consists of the following steps. First, a sample ˆ x is computed based estimate of the auto-correlation matrix R on a finite number of received OFDM blocks. Ideally, the autoH correlation matrix is given by Rx = E[xi xH i ] = HRu H + σν2 IJM , where xi is the receive antenna array output and H is the JM × KN block-Toeplitz matrix modeling the MIMO channel. Ru is the auto-correlation matrix of the transmitted signals and σν2 is the noise variance. The noise subspace is ˆ x . There are identified from the eigendecomposition (ED) of R q = JM − KN eigenvectors βi corresponding to the noise subspace, therefore the classical subspace unitarity property βi H H = 0 holds. In the next step, each eigenvector βi is re-arranged into a N × (L + 1)J block-Toeplitz matrix Gi , and and the resulting matrices are stacked into a large matrix G. The MIMO channel coefficients are also reshaped in a ¯ An equivalent unitarity equality J(L + 1) × K matrix H. ¯ may be determined ¯ = 0 is obtained. The matrix H GH ¯ = H ¯ 0 B. The up to a K × K ambiguity matrix B, i.e., H ¯ matrix H is a basis of the right null space of G. This can be obtained from the singular value decomposition (SVD) of G. ¯ 0 are formed by the right singular Therefore, the columns of H vectors corresponding to the K smallest singular values. The ambiguity matrix B may be removed based on pilot data, i.e., at least K sub-symbols in one OFDM block must be known. In conclusion, the channel estimation method proposed in [13] requires one ED of the auto-correlation matrix Rx , followed by a SVD of the matrix G formed with the eigenvectors βi . The proposed approximate algorithm reduces the complexity with a small performance degradation. First, we compare the complexity of the algorithm proposed in [13] by using the exact ED and SVD operations to the proposed iterative algorithm. An exact ED of the JM ×JM matrix Rx requires about 13(JM )3 ≈ 24 · 106 operations including both additions and multiplications. The exact SVD of the N q × (L + 1)J matrix G requires about 2(N q)[(L+1)J]2 +11[(L+1)J]3 ≈ 3.7·106 operations [14]. By using the proposed approximate algorithm (Table I) a complexity reduction may be achieved. We assume low approximation orders, i.e., ω1 = 4 for the ED of the Rx matrix and ω2 = 5 for the SVD of G. The approximation of the ED operation reduces the complexity more than four times, i.e., about (ω1 − 1)[(JM )3 + 2(JM )2 ] ≈ 5.7 · 106 operations are required. Moreover, the SVD operation is converted to an ED operation. Instead of computing the right singular vectors of a large tall matrix G of size N q × (L + 1)J = 1288 × 30, we compute the eigenvectors of a smaller matrix GH G of size 30 × 30. In this way we further reduce the complexity.

f1

10

f2

10

5

10

15

20

25

30

35

40

SNR [dB]

Fig. 3.

The channel estimation RMSE as a function of SNR.

The algorithm convergence speed is also evaluated in terms of both channel RMSE and symbol RMSE. It may be noticed in Fig. 6 that after approximately 10 iterations both the channel and the symbol RMSE decrease significantly. This is due to the fact that after a certain number of iterations



RMSE in the constellation symbols vs SNR

RMSE vs the number of iterations

1

10 exact method proposed approximation

0

channel RMSE symbol RMSE

0

10 f1

10

RMSE

channel RMSE

10

f1

10 f2

10

f2

5

10

15

20

25

30

35

10

40

SNR [dB]

Fig. 4.

0

100

200

300

400

500

# iterations

The symbol RMSE as a function of SNR.

Fig. 6. The channel and the symbols RMSE vs. the number of iterations at SNR=21.4 dB.

a) equalized symbols f user 1

equalized symbols f user 2

1.5

1.5

1

1

0.5

0.5

0

0

f0.5

f0.5

f1

f1

f1.5

f1.5 f1

0

1

The proposed algorithm provides significant advantages over the classical methods in terms of computational complexity. In terms of accuracy the approximation approaches the exact method. R EFERENCES f1

0

1

b) equalized symbols f user 1 1.5

equalized symbols f user 2 1.5

1

1

0.5

0.5

0

0

f0.5

f0.5

f1

f1

f1.5

f1.5 f1

0

1

f1

0

1

Fig. 5. The constellation patterns corresponding to the two users at SNR=21.4 dB. a) The exact method [13]. The channel RMSE is equal to 0.04 and the symbol RMSE is equal to 0.11. b) The proposed approximate algorithm from Table I. The channel RMSE is equal to 0.11 and the symbol RMSE is equal to 0.19.

the smallest eigenvalues may be distinguished easily and the proper eigenvectors are plugged in the algorithm. IV. C ONCLUSIONS In this paper, an approximate Riemannian optimization algorithm under unitary matrix constraint is proposed. The truncated Taylor series approximation reduces the complexity and is very robust in the face of error propagation. Armijo step size [1] as well as more classical adaptation rules may be used in the update. The proposed method may be applied, for example, to smart antenna algorithms, wireless communications, biomedical measurements, signal separation, subspace estimation and tracking tasks where unitary matrices play an important role in general. Comparison to classical steepest descent and Lagrangian methods in given as well.

[1] E. Polak, Optimization: Algorithms and Consistent Approximations. New York: Springer-Verlag, 1997. [2] J. H. Manton, “On the role of differential geometry in signal processing,” in International Conference on Acoustics, Speech and Signal Processing, vol. 5, pp. 1021–1024, Philadelphia, March 2005. [3] S. Fiori, “Quasi-geodesic neural learning algorithms over the orthogonal group: a tutorial,” Journal of Machine Learning Research, vol. 1, pp. 1– 42, Apr. 2005. [4] P. Sansrimahachai, D. Ward, and A. Constantinides, “Multiple-Input Multiple-Output Least-Squares Constant Modulus Algorithms,” in IEEE Global Telecommunications Conference, vol. 4, pp. 2084–2088, 1–5 Dec. 2003. [5] C. B. Papadias and A. M. Kuzminskiy, “Blind source separation with randomized Gram-Schmidt orthogonalization for short burst systems,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5, pp. 809–812, 17–21 May 2004. [6] L. Wang, J. Karhunen, and E. Oja, “A Bigradient optimization approach for robust PCA, MCA and source separation,” in Proceedings of IEEE Conference on Neural Networks, vol. 4, pp. 1684–1689, 27 Nov.–1 Dec. 1995. [7] J. H. Manton, “Optimization algorithms exploiting unitary constraints,” IEEE Transactions on Signal Processing, vol. 50, pp. 635–650, Mar. 2002. [8] T. Abrudan, J. Eriksson, and V. Koivunen, “Optimization under unitary matrix constaints via multiplicative update,” submitted to IEEE Transactions on Signal Processing, May 2005. [9] D. H. Brandwood, “A complex gradient operator and its applications in adaptive array theory,” IEE Proceedings, Parts F and H, vol. 130, pp. 11–16, Feb. 1983. [10] S. C. Douglas, “Self-stabilized gradient algorithms for blind source separation with orthogonality constraints,” IEEE Transactions on Neural Networks, vol. 11, pp. 1490–1497, Nov. 2000. [11] C. Moler and C. van Loan, “Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later,” SIAM Review, vol. 45, no. 1, pp. 3–49, 2003. [12] A. Iserles and A. Zanna, “Efficient computation of the matrix exponential by general polar decomposition,” SIAM Journal on Numerical Analysis, vol. 42, pp. 2218–2256, March 2005. [13] Y. Zeng and T. Ng, “A semi-blind channel estimation method for multiuser multiantenna OFDM systems,” IEEE Transactions on Signal Processing, vol. 52, pp. 1419–1429, May 2004. [14] G. H. Golub and C. van Loan, Matrix computations. The Johns Hopkins University Press, Baltimore, 3rd ed., 1996.



[Publication IV] T. Abrudan, J. Eriksson, V. Koivunen, “Efficient Line Search Methods for Riemannian Optimization Under Unitary Matrix Constraint”, Conference Record of the Forty-First Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, 4–7 Nov. 2007, pp. 671–675. ©2007 IEEE. Reprinted with permission.

[Publication V] T. Abrudan, J. Eriksson, V. Koivunen, “Efficient Riemannian Algorithms for Optimization Under Unitary Matrix Constraint”, IEEE International Conference on Acoustics Speech and Signal Processing Las Vegas, NV, 31 Mar.– 4 Apr. 2008, pp. 2353–2356. ©2008 IEEE. Reprinted with permission.

EFFICIENT RIEMANNIAN ALGORITHMS FOR OPTIMIZATION UNDER UNITARY MATRIX CONSTRAINT Traian Abrudan, Jan Eriksson, Visa Koivunen Helsinki University of Technology, Department of Electrical and Communications Engineering, SMARAD CoE, Signal Processing Laboratory, FIN-02015 HUT, Finland (e-mail: {tabrudan, jamaer, visa}@wooster.hut.fi) ABSTRACT In this paper we propose practical algorithms for optimization under unitary matrix constraint. This type of constrained optimization is needed in many signal processing applications. Steepest descent and conjugate gradient algorithms on the Lie group of unitary matrices are introduced. They exploit the Lie group properties in order to reduce the computational cost. Simulation examples on signal separation in MIMO systems demonstrate the fast convergence and the ability to satisfy the constraint with high fidelity. Index Terms— Optimization, unitary matrix constraint, array processing, subspace estimation, source separation

In this paper, efficient steepest descent (SD) and conjugate gradient (CG) algorithms operating on the Lie group of unitary matrices U (n) are proposed. They move towards the optimum along geodesics, which on a Riemannian manifold correspond to the straight line on the Euclidean space. The main contribution in this paper is that we take full benefit of the geometric properties of the Lie group, such as simple formulas for geodesics and parallel transport, as well as special matrix structures. Therefore, the resulting optimization algorithms are computationally efficient. This paper is organized as follows. In Section 2 we propose practical steepest descent and conjugate gradient algorithm for optimization under unitary matrix constraint. Simulation results are presented in Section 3. Finally, Section 4 concludes the paper.

1. INTRODUCTION Many signal processing applications require optimizing a certain criterion w.r.t a complex-valued matrix, under the constraint that the matrix has orthonormal columns. Such problems arise in communications and array signal processing, for example, high-resolution direction finding, blind and constrained beamforming, and generally all methods where subspace estimation or tracking is needed. Another important class of applications is source separation and Independent Component Analysis (ICA). This type of optimization problems occur also in Multiple-Input Multiple-Output (MIMO) communication systems. For a recent review, see [1]. Commonly, the problem of optimization under unitary matrix constraint is solved on the Euclidean space by using classical gradient algorithms. In order to maintain the constraint satisfied, additional orthogonalization, or some stabilization procedures need to be applied after every iteration. Consequently, such algorithms experience slow convergence or deviations from the constraint [1]. The initial constraint optimization problem may be converted into an unconstrained one, on an appropriate differential manifold [2,3]. In the case of unitary matrix constraint, the appropriate parameter space is the Lie group of n × n unitary matrices U (n). The nice geometrical properties of U (n) may be exploited in order to solve the optimization problem efficiently and satisfy the constraint with high fidelity. Riemannian geometry based algorithms for optimization with orthogonality constraints are considered in [4]. In [5] a nonRiemannian approach is introduced. Optimization algorithms operating on the unitary group are considered in [1, 6]. Algorithms in the existing literature [2–5] are, however, more general in the sense that they can be applied on more general manifolds than U (n). On the other hand when applied to U (n), they do not take benefit of the special properties arising from the Lie group structure of the manifold. This work was supported in part by the Academy of Finland and GETA Graduate School.

‹,(((



2. RIEMANNIAN OPTIMIZATION ALGORITHMS ON THE UNITARY GROUP In this section we propose steepest descent (SD) and conjugate gradient (CG) algorithms operating on the Lie group of unitary matrices U (n). The goal is to minimize (or maximize) the real-valued cost function J of n × n complex matrix argument W, under unitary matrix constraint, i.e, WWH = WH W = I, where I is the n × n identity matrix. The constrained optimization problem on the Euclidean space Cn×n may be formulated as an unconstrained one, on a different parameter space determined by the constraint. The unitary constraint defines the Lie group of unitary matrices U (n), which is a differential manifold and a multiplicative matrix group at the same time. By exploiting the additional group properties of the manifold of unitary matrices a reduction in complexity is achieved.

2.1. Some key geometrical features of U (n) This subsection describes briefly some Riemannian geometry concepts related to the Lie group of unitary matrices U (n). We also show how the properties of U (n) may be exploited in order to reduce the complexity of the optimization algorithms.

2.1.1. Tangent vectors and tangent spaces The tangent space is a n2 -dimensional real vector space attached to every point W ∈ U (n), and it may be identified with the matrix space TW U (n)  {X ∈ Cn×n |XH W + WH X = 0}. The tangent space at the group identity I is the real Lie algebra of skewHermitian matrices u(n)  TI U (n) = {S ∈ Cn×n |S = −SH }.

,&$663

Wk

Wk

˜k −G

˜k −H Wk+1

˜ k+1 −G contours o f J (W)

˜k −τ G contours o f J (W)

Wk+2

manifold

manifold

˜k −τ H

MINIMUM

Fig. 1. The SD algorithm takes ninety-degree turns at every iteration, ˜ k ˜ k+1 , −τ G i.e., −G Wk+1 = 0, where τ denotes the parallelism w.r.t. the geodesic connecting Wk and Wk+1 . 2.1.2. Riemannian metric and gradient on U (n) The gradient vector can only be defined after endowing U (n) with a Riemannian metric. ˘ ¯ The inner product given by X, YW = H 1 ℜ trace{XY } , X, Y ∈ TW U (n) induces a bi-invariant met2 ric on U (n). Therefore, the right (and also the left) translation preserves the inner product, i.e., X, YW = XV, YVWV , ∀V ∈ U (n). This property is called isometry, and it is very useful for performing translations of the tangent vectors (gradients and search directions) from a tangent space to another. The Riemannian gradient at a point W ∈ U (n) is: ˜ ∇J(W)  ΓW − WΓH W, W

(1)

dJ where ΓW = dW ∗ (W) is the gradient of J on the Euclidean space at a given W [1].

2.1.3. Geodesics and parallel translation on U (n) Intuitively, geodesics on a Riemannian manifold are the locally length minimizing paths. On U (n) they have simple expressions described by the exponential map. The fact that the right translation is an isometry enables simple parallel transport of the tangent vectors along geodesics, via matrix multiplication. When performing the geodesic optimization on U (n), due to computational reasons it is more convenient to translate all tangent vectors into u(n) whose elements correspond to skew-Hermitian matrices. Because an isometry maps geodesics into geodesics, the right multiplication also allows translating geodesics from one point to another. The geodesic emanating from the identity element of U (n) in the direction of S ∈ u(n) is given by G I (t) = exp(tS). Using the right translation, a geodesic emanating from an arbitrary W ∈ U (n) in the direction SW is given by G W (t) = exp(tS)W, SW ∈ TW U (n), t ∈ R. Consequently, the tangent direction S ∈ u(n) is transported along the geodesic to W and the resulting tangent vector is SW ∈ TW U (n). Conversely, if X ∈ TW U (n), then XWH ∈ u(n). 2.2. Steepest descent algorithm on U (n) The unitary optimization can be solved in an iterative manner, by using a steepest descent algorithm along geodesics on U (n). The corresponding rotational update at iteration k is given by: Wk+1 = exp(−µk Gk )Wk ,

k = 0, 1, . . .

˜ k+1 −G Wk+1

(2)

H ˜ where Gk  ∇J(W)W =ΓW WkH −Wk ΓH ∈ u(n) is the RieW mannian gradient of J (W) at Wk translated to the group identity,



˜ k+1 −H MINIMUM

˜ k+1 at Wk+1 which Fig. 2. The CG takes a search direction −H ˜ k+1 at Wk+1 and is a combination of the new SD direction −G ˜ k translated to Wk+1 along the the current search direction −H geodesic connecting Wk and Wk+1 . The new Riemannian steep˜ k+1 at Wk+1 will be orthogonal to the est descent direction −G current search direction −Hk at Wk translated to Wk+1 , i.e., ˜ k+1 , −τ H ˜ k −G Wk+1 = 0. 1 2 3 4 5 6

Initialization: k = 0 , Wk = I Compute the Riemannian gradient direction Gk : H H ∂J Γk = ∂W ∗ (Wk ), Gk = Γk Wk − Wk Γk Evaluate Gk , Gk I = (1/2)trace{GH G }. If it is suffik k ciently small, then stop Determine µk = arg minµ J (exp(−µGk )Wk ) Update: Wk+1=exp(−µk Gk )Wk k := k + 1 and go to step 2

Table 1. Steepest descent (SD) algorithm along geodesics on U (n) ∂J and ΓW = ∂W ∗ (Wk ) is the Euclidean gradient at Wk . The notation exp(·) stands for the matrix exponential. The rotational update (2) maintains Wk+1 unitary at each iteration. The step size µk > 0 controls the convergence speed and needs to be computed at each iteration. In [1], Armijo step size rule [7] is efficiently used. The step size evolves in a dyadic basis. Therefore, when doubling the step size, only a matrix squaring is needed instead of computing a new matrix exponential. In this way the complexity is reduced approximately by half [1], compared to the SD in [5] using also the Armijo rule. Other approaches from the Euclidean space [7] may also be adapted. The proposed SD algorithm is summarized in Table 1.

2.3. Conjugate gradient algorithm on U (n) The Conjugate Gradient (CG) algorithm provides typically faster convergence compared to the Steepest Descent (SD) algorithm not only on the Euclidean space, but also on Riemannian manifolds. This is due to the fact that the Riemannian SD algorithm suffers from the same deficiency as its Euclidean counterpart, i.e., it takes ninety degree turns at each iteration [3]. This fact is illustrated in the left plot of Figure 1 by plotting the cost function level sets on the manifold surface. The conjugate gradient algorithm may significantly reduce this drawback. Moreover, CG provides an inexpensive alternative to Newton algorithm. The new search direction is chosen to be a combination of the current search direction at Wk and the gradient at the next point Wk+1 , as illustrated in Figure 2. The difference compared to the Euclidean space is that the two vectors lie in different tangent spaces. For this reason they are not directly compatible. The fact that U (n) is a Lie group enables simple parallel translation of tangent vectors from a tangent space to another. It is desirable to translate all the tangent directions (steepest descent and search directions) to the same tangent space. Due to compu-

3 4 5 6

G

7

−G ,G



k k+1 I γk = k+1 Gk ,Gk I Hk+1 = Gk+1 + γk Hk k := k + 1 and go to step 2

SA vs CG on U(n) í Brockett function 0

0 SA í geod

SA í geod

SA í nonígeod CGíPR í50

CGíPR

í100

í150

í100

í150 0

SA í nonígeod

í50

Unitarity criterion [dB]

Initialization: k = 0 , Wk = I Compute the Riemannian gradient direction Gk and the search direction Hk : if (k modulo n2 ) == 0 ∂J Γk = ∂W ∗ (Wk ) Gk = Γk WkH − Wk ΓH k Hk := Gk else Gk := Gk+1 Hk := Hk+1 Evaluate Gk , Gk I = (1/2)trace{GH k Gk }. If it is sufficiently small, then stop Determine µk = arg minµ J (exp(−µHk )Wk ) Update: Wk+1 = exp(−µk Hk )Wk Compute the Riemannian gradient direction Gk+1 and the search direction Hk+1 : ∂J Γk+1 = ∂W ∗ (Wk+1 ) H − Wk+1 ΓH Gk+1 = Γk+1 Wk+1 k+1

Diagonality criterion [dB]

1 2

í200

í250

100

200

300

í300 0

100

iteration

200

300

iteration

Fig. 3. Comparison between different SA (steepest ascent) and CG algorithms: the geodesic SA obtained for the SD in Table 1, the nongeodesic SA as in [5] and the CG-PR algorithm given in Table 2, which uses the Polak-Ribièrre formula (CG-PR). The geodesic and the non-geodesic SA algorithms perform the same, but the geodesic SA algorithm has lower complexity. The CG provides faster convergence compared to the geodesic SA at comparable complexity. 3.1. Diagonalization of a Hermitian Matrix

Table 2. Conjugate gradient algorithm along geodesics on U (n) using the Polak-Ribièrre formula (CG-PR) tational reasons, the tangent space at the group identity element is preferred [6]. Then, all the tangent vectors belong to the Lie algebra u(n) and they are represented by skew-Hermitian matrices. The computation of the exponential of skew-Hermitian matrices and its approximations have been thoroughly studied in the literature, see [1]. The new search direction translated into u(n) is Hk+1 = Gk+1 + γk Hk ,

Hk , Hk+1 , Gk+1 ∈ u(n)

(3)

where Hk is the old search direction at Wk , translated into u(n). The weighting factor γk may be determined for example, by using the Polak-Ribièrre formula γk = Gk+1 − Gk , Gk+1 I /Gk , Gk I [3]. The conjugate gradient step is taken along the geodesic emanat˜ k = −Hk Wk , i.e., ing from Wk in the direction −H Wk+1 = exp(−µk Hk )Wk ,

k = 0, 1, . . . .

(4)

Analogous to the Euclidean CG, it is desirable to reset the search direction −Hk to the gradient direction −Gk after each n2 steps, which is the dimension of U (n). This may enhance the convergence speed. The proposed CG algorithm on U (n) using the PolakRibièrre formula is summarized in Table 2. Remarks: The SD algorithm in Table 1 is designed to minimize a cost function. It may converted into a steepest ascent (SA) algorithm for solving maximization problems, by changing the update step 5 into Wk+1 = exp(+µk Hk )Wk . The same change needs to be applied to the CG-PR in Table 2, step 5, in order to solve maximization problems. Additionally, the step 4 in Table 2 needs to be replaced by µk = arg maxµ J (exp(+µHk )Wk ). 3. SIMULATION RESULTS AND APPLICATIONS In this section we test the proposed Riemannian algorithms on two different optimization problems on U (n). The first one is a classical test function for optimization under orthogonal matrix constraint [3]. The second one is the JADE cost function [8] which is a practical application of the proposed algorithms to blind source separation.

The diagonalization of a Hermitian matrix Σ can be achieved by maximizing the Brockett criterion [3] JB (W) = tr{WH ΣWN},

subject to W ∈ U (n).

(5)

The matrix W converges to the eigenvectors of Σ sorted according to the ascending order of the eigenvalues, provided that N is a diagonal matrix with the diagonal elements 1, . . . , n. This type of optimization problem arises in many signal processing applications such as blind source separation, subspace estimation, high resolution direction finding as well as in communications applications. This example of computing the eigenvectors of a Hermitian matrix (such as a covariance matrix) is chosen for illustrative purposes since it is well known by most of the readers. A more practical application is the blind source separation problem is considered in Subsection 3.2. The Euclidean gradient of the Brockett function is ΓW = ΣWN. The performance is studied in terms of convergence speed considering a diagonality criterion, ∆, and in terms of deviation from the unitary constraint using a unitarity criterion Ω, defined as ∆ = 10 lg

off{WH ΣW} , diag{WH ΣW}

Ω = 10 lg WWH − I2F ,

(6)

where off{·} operator computes the sum of the squared magnitudes of the off-diagonal elements of a matrix, and diag{·} does the same operation, but for the off-diagonal ones. The diagonality criterion measures the departure of the matrix WH ΣW from the diagonal property, in logarithmic scale. The unitarity criterion is the squared Frobenius norm of the deviation from the unitarity property in a logarithmic scale. The results are averaged over 100 random realizations of the 6 × 6 Hermitian matrix Σ. In order to maximize the criterion (5), the SD in Table 1 and CG-PR in Table 2 need minor modifications (see Remarks at the end of Subsection 2.3). In Figure 3, we compare three algorithms. The first one is the geodesic steepest ascent (SA) obtained from the SD in Table 1. The second algorithm is the non-geodesic SA obtained from SD in [5]. The third one is the CG algorithm in Table 2 which uses the Polak-Ribièrre formula (CG-PR). The step size for all three algorithms is chosen by using the Armijo rule [7] as in [1]. In the left plot of Figure 3 we observe that the geodesic and the non-geodesic SA algorithms perform the



In this subsection we test the proposed SD and CG algorithms in a practical application of blind source separation (BSS) of communication signals by using a joint diagonalization approach [8]. We show that the proposed algorithms outperform the classical JADE algorithm [8]. A number of m = 20 independent 16-QAM signals are separated blindly from their r = 20 mixtures. A total of N = 15000 snapshots are collected and the results are averaged over 100 independent realizations of the r × m mixture matrix. The signal-to-noise-ratio is 20dB. The blind recovery of the desired signals is based on statistical properties and it is done in two stages. The first one is a whitening operation, which can be done by diagonalizing the sample covariance matrix as in Subsection 3.1. The second ˆ i which stage is the joint diagonalization of a set of eigenmatrices M are estimated from the fourth order cumulants of the whitened signals. In [8], this is done by using Givens rotations. In this paper we find the unitary rotation by minimizing the JADE criterion [8] which penalizes the deviation of eigen-matrices from the diagonal property. JJADE (W) =

m X

ˆ i W} off{WH M

subject to W ∈ U (n).

(7)

i=1

The gradient of the JADE ˆ cost function on theH Euclidean ˜ space is P ˆ i W WH M ˆ i W − I ⊙ (W M ˆ i W) , where ⊙ M ΓW = 2 m i=1 denotes the elementwise matrix multiplication. In Figure 4 we compare the classical JADE algorithm to the proposed SD and CG algorithms on U (m). The performance measure for the optimization problem is the JADE criterion (7). The performance index used for the entire blind separation problem is the Amari distance. The geodesic SD and the CG-PR algorithm have similar convergence speed and they outperform the classical JADE algorithm [8]. The non-geodesic SD [5] (not shown in Figure 4) performs the same as the geodesic SD. The geodesic algorithms on U (m) take benefit of the Lie group properties of U (m) in order to reduce complexity. The proposed SD and CG-PR algorithms have complexity of O(m3 ) per iteration and only few iterations are required to achieve convergence. Moreover, the number of iterations needed to achieve convergence stays almost constant when increasing m. The Givens rotation approach in [8] has a total complexity of O(m4 ), since it updates not only the unitary rotation matrix, but also the full set of eigen-matrices Mi . Therefore, the total complexity of the proposed algorithms is lower, especially when the number of signals m is very large. The proposed algorithms converge faster at similar computational cost/per iteration. Therefore, they are suitable for blind separation applications, especially when the number of signals to be separated is large.

0 Classical JADE SD í geod CGíPR

8

Classical JADE SD í geod CGíPR

í2 í4

6

Amari distance [dB]

3.2. Joint Approximate Diagonalization of a set of Hermitian Matrices. Minimizing the JADE criterion on U (m)

Classical JADE vs SD and CG on U(n) 10

JADE criterion [dB]

same, but the geodesic SA algorithm has lower complexity, especially when Armijo step is used [1]. The CG-PR algorithm outperforms both the geodesic and the non-geodesic SA algorithms, with comparable computational complexity. In terms of satisfying the unitary constraint, all algorithms provide good performance as it is shown in the right subplot of Figure 3.

4 2

í6 í8

í10

0

í12

í2

í14

í4

í16 í6 í18 0

10

20

30

40

50

0

10

iteration k

20

30

40

50

iteration k

Fig. 4. The classical JADE algorithm [8] vs. SD and CG algorithms on U (20): the geodesic SD in Table 1 and the CG-PR in Table 2. The geodesic SD and the CG-PR algorithm perform similarly in terms of JADE criterion and Amari distance. They outperform the classical JADE algorithm [8], especially when the number of signal sources is large. In this application the CG did not improve convergence speed of the SD.

formulas and parallel transport in order to reduce the complexity. For this reason their complexity is lower than the non-geodesic SD in [5] at the same convergence speed. The algorithms provide a reliable solution to the joint diagonalization problem for blind separation and outperforms the widely used Givens rotations approach, i.e., in the classical JADE algorithm [8]. It may be applied, for example, to smart antenna algorithms, wireless communications, biomedical measurements, signal separation, subspace estimation and tracking tasks where unitary matrices play an important role in general. 5. REFERENCES [1] T. Abrudan, J. Eriksson, and V. Koivunen, “Steepest descent algorithms for optimization under unitary matrix constraint,” to appear in IEEE Transactions on Signal Processing. [2] D. Gabay, “Minimizing a differentiable function over a differential manifold,” Journal of Optimization Theory and Applications, vol. 37, pp. 177–219, June 1982. [3] S. T. Smith, Geometric optimization methods for adaptive filtering. PhD thesis, Harvard University, Cambridge, MA, May 1993. [4] A. Edelman, T. Arias, and S. Smith, “The geometry of algorithms with orthogonality constraints,” SIAM Journal on Matrix Analysis and Applications, vol. 20, no. 2, pp. 303–353, 1998. [5] J. H. Manton, “Optimization algorithms exploiting unitary constraints,” IEEE Transactions on Signal Processing, vol. 50, pp. 635–650, Mar. 2002. [6] T. Abrudan, J. Eriksson, and V. Koivunen, “Conjugate gradient algorithm for optimization under unitary matrix constraint,” submitted to IEEE Transactions on Signal Processing.

4. CONCLUSIONS

[7] E. Polak, Optimization: Algorithms and Consistent Approximations. New York: Springer-Verlag, 1997.

In this paper, Riemannian steepest descent and conjugate gradient algorithms for optimization under unitary matrix constraint were proposed. They operate on the Lie group of n×n unitary matrices and it exploits the geometrical properties of U (n), such as simple geodesic

[8] J. Cardoso and A. Souloumiac, “Blind beamforming for nonGaussian signals,” IEE Proceedings-F, vol. 140, no. 6, pp. 362– 370, 1993.



[Publication VI] T. Abrudan, M. Sˆırbu, V. Koivunen, “Blind Multi-user Receiver for MIMOOFDM Systems”, IEEE Workshop on Signal Processing Advances in Wireless Communications, Rome, Italy, 15–18 Jun. 2003, pp. 363–367. ©2003 IEEE. Reprinted with permission.

[Publication VII] T. Abrudan, V. Koivunen, “Blind Equalization in Spatial Multiplexing MIMO-OFDM Systems based on Vector CMA and Decorrelation Criteria”, Wireless Personal Communications, vol. 43, no. 4, Dec. 2007, pp. 1151– 1172. ©2007 Springer. Reprinted with permission.

Wireless Pers Commun (2007) 43:1151–1172 DOI 10.1007/s11277-007-9291-2

Blind equalization in spatial multiplexing MIMO-OFDM systems based on vector CMA and decorrelation criteria Traian Emanuel Abrudan · Visa Koivunen

Received:15 July 2005 / Accepted: 7 March 2007 / Published online: 29 June 2007 © Springer Science+Business Media B.V. 2007

Abstract In this paper we address the problem of blind recovery of multiple OFDM data streams in a Multiple-Input Multiple-Output (MIMO) system. We propose an equalization algorithm which is based on a combined criterion designed to cancel both intersymbol interference (ISI) and co-channel interference (CCI). ISI is minimized by using a modified Vector Constant Modulus criterion while CCI is minimized by a decorrelation criterion. We establish conditions for the existence of the stable minima corresponding to the zero forcing receiver which performs the joint blind equalization and the co-channel signal cancellation. The local convergence properties of the algorithm are proved under the assumption that the balance parameter weighting the two criteria is set appropriately. We also provide the optimal value for this parameter. Reliable performance is achieved with relatively fast convergence and small steady-state error. The implementation of the blind equalizer requires lowcomputational cost, without any matrix inversions or other expensive operations.

T. E. Abrudan (B) · V. Koivunen SMARAD CoE, Signal Processing Laboratory, Department of Electrical Engineering, Helsinki University of Technology, 02015 Espoo, Finland e-mail: [email protected] V. Koivunen e-mail: [email protected]

Keywords Blind equalization · OFDM · MIMO systems · Spatial multiplexing · Local convergence analysis

1 Introduction Multiantenna Multiple-Input Multiple-Output (MIMO) systems and multicarrier transmission such as orthogonal frequency division multiplexing (OFDM) are among the key technologies in future beyond 3G wireless communication systems. These technologies provide high-spectral efficiency and improved reliability of radio links. Moreover, multicarrier transmission turns broadband frequency selective channel into set of narrowband frequency flat channels which leads to simplified receiver design. In mobile MIMO-OFDM systems, however, plenty of pilot symbols are needed in order to deal with time selectivity of the channel which means that more control signals are transmitted instead of information symbols. Blind equalization algorithms provide higher effective data rates because they do not require any training data or pilot signals. They rely only on statistical or structural properties of the transmitted signal in order to compensate the impairment caused by the channel. Typically in OFDM transmission a cyclic prefix (CP) is employed because it allows single tap equalization at the receiver. When the CP is too short this benefit is lost, and the inter-carrier interference may make the demodulation impossible.

123

1152

In this paper, a blind equalization algorithm combining modified Vector Constant Modulus and decorrelation criteria is developed. We also prove the local convergence of the proposed algorithm. The proposed blind algorithm operates in the time domain before the DFT operation at the receiver. Therefore, it is designed to deal with three different cases: there is no CP at all, the CP is too short and CP is sufficiently long. The CP may be used for synchronization purposes, hence it is included in the algorithm derivation. However, it is not needed in finding the equalizer. The proposed blind algorithm exploits two important properties of OFDM signals: the constant mean block energy property and the Gaussianity. Both of them are due to the unitary IDFT used in OFDM transmission. The VCMA criterion penalizes the deviation of the block energy from a dispersion constant. The VCMA cost function may be decomposed into a constant modulus (CM) cost function [7] and an auto-correlation function of the squared magnitudes of the received signal [15]. Therefore, it is not applicable to signals which have a periodic correlation such as OFDM signal. When CP is used, a strong auto-correlation in the transmitted signals is introduced. It may be stronger than the correlation caused by the multipath propagation channel. The performance of the classical VCMA degrades then significantly because it penalizes the correlation induced by the CP. The proposed modified VCMA is designed to deal with the auto-correlation caused by the CP and to cancel inter-symbol interference (ISI) simultaneously. The VCMA was applied to blind equalization for shaped constellations in [18] and for Single-Input Single-Output (SISO) OFDM system in [10]. In [10] the CP or zero padding were required in order to perform the equalization. MIMO schemes have been considered in [11,12] using the classical CMA, for real constant modulus signals. VCMA was employed in the context of DS-CDMA systems in [17]. Blind channel identification and equalization algorithms for MIMOOFDM systems exploiting the statistics of the OFDM signals have been proposed. Some of these algorithms exploit the cyclostationarity statistics induced by block precoding [3,19] or cyclic prefix [2,6]. Subspace-based methods exploiting virtual carriers or zero padding have been proposed in [1,14,20]. A blind source separation approach based on a natural gradient learning algorithm is developed in [8]. In a spatial multiplexing MIMO scenario, the problem becomes more difficult. At one receive antenna

123

T. E. Abrudan, V. Koivunen

we have the desired signal with its delayed replicas caused by the channel ISI in addition to the co-channel signals, i.e., co-channel interference (CCI) with their delayed replicas. In order to perform both the blind equalization and signal separation, an output decorrelation criterion is needed. This criterion assumes that the transmitted data streams are mutually independent. Hence, it is suitable for spatial multiplexing systems. It penalizes the correlation among the equalized outputs. Consequently, we come up with a cost function comprised of two criteria: a modified VCMA criterion and a decorrelation criterion. A weighting parameter λ is used to balance the two criteria. The equalizer coefficients are adaptively adjusted using a steepest descent algorithm. The global convergence of VCMA has not been established so far, not even in the fractionally spaced case [9]. The local convergence properties of VCMA have been investigated in [15]. In our analysis, we assume a composite cost function and we analyze its local convergence properties under more general conditions, as it will be shown later in Sect. 5. We show that truly stable local minima corresponding to the zero-forcing solutions always exist if the balance parameter λ of the composite cost function is set appropriately. We find an optimal value for this parameter. This leads to a non-convex constrained optimization problem. This paper is organized as follows. In Sect. 2, we define the system model. The blind equalization algorithm for MIMO-OFDM systems is introduced in Sect. 3. The proposed composite cost function is analyzed in Sect. 4. Local convergence analysis of the algorithm is provided in Sect. 5. Simulation examples are presented in Sect. 6. The proofs are provided in the Appendix.

2 System model We consider a MIMO-OFDM system with K -transmit and Q-receive antennas (Fig. 1). We assume a spatial multiplexing scenario, where independent OFDM data streams are transmitted across the antennas. Each data stream consists of i.i.d. complex symbols modulated by M subcarriers. In this model, we use a block formulation similar to the one in [16]. The sample index is denoted by (·), and the block index by [·]. We consider the complex symbols from the kth data stream stacked in a M ×1 vector sk [n] = [sk (n M), . . . , sk (n M −M +1)]T . The N × 1 transmitted OFDM block of the kth data

Blind equalization in spatial multiplexing MIMO-OFDM systems based on vector CMA and decorrelation criteria Fig. 1 MIMO-OFDM system model

RX 1

w(n) 1

TX1 RX 2

s(n) 1

s (n) K

. . . .TX . .

w(n) 2

MIMO channel

IFFT +CP

K

RX Q

wQ(n)

1

...

FFT -CP

. zK(n) ..

g1K

FFT -CP

1

^s

(n)

K

g21

y(n) 2

IFFT +CP

^ s(n)

z(n)

g11

y(n) 1

1153

. ..

g2K

...

...

gQ1 . yQ(n) .. gQK

bank of equalizers

stream can be written as: u˜ k [n] = TCP Fsk [n],

(1)

where F is the M × M normalized IDFT matrix and TCP is the N × M cyclic prefix addition matrix [16]. The sequence of L + 1 consecutive transmitted OFDM samples corresponding to the antenna k is denoted by uk (n) = [u k (n), . . . , u k (n − L)]T . The MIMO channel branches from the kth transmit to the qth receive antenna have maximum order L h and they are characterized by the impulse responses hkq = [h kq (0), . . . , h kq (L h )]. The corresponding (L − L h + 1) × (L + 1) Sylvester convolution matrices describing the MIMO channel are:   0 h kq (0) . . . h kq (L h ) . . .   .. .. .. .. Hkq =  ...  . . . . 0

. . . h kq (0) . . . h kq (L h )

 Sylvester{hkq }.

(2)

Stacking the vectors uk (n) corresponding to the K transmitted data streams in a vector u(n) = [u1T (n), . . . , uTK (n)]T , the L − L h + 1 consecutive samples received at the antenna q, are yq (n) = [H1q . . . H K q ]u(n) + wq (n), q = 1, . . . , Q. The vectors wq (n) represent the additive noise at the qth receiver. Considering the MIMO channel matrix:   H11 . . . H K 1  H12 . . . H K 2    H= . .   .. . . . ..  H1Q . . . H K Q (3) the Q(L − L h +1)×1 array output y(n) = [y1T (n), . . . , yTQ (n)]T may be written as: y(n) = Hu(n) + w(n),

(4)

where w(n) = [w1T (n), . . . , w TQ (n)]T . The adaptive equalizers have order L g and they are row vectors denoted by gqk [n] = [gqk (0), . . . , gqk (L g )]. The minimum equalizer order is chosen according to the identifiability conditions presented in Sect. 3.4. In order to recover the K transmitted data streams,   a bank of K equalizers gk [n] = g1k [n], . . . , g Qk [n] , k = 1, . . . , K is used at each receive antenna. By choosing L = L h + L g , the equalized sample corresponding to the kth data stream can be written as: z k (n) = gk [n]y(n).

(5)

Considering the 1 × (L + 1) global channel-equalizer impulse response (GCEIR) ak = gk [n]H, corresponding to the kth data stream, the equalized sample from this data stream may be written as: z k (n) = ak u(n).

(6)

If the equalization is achieved, the ak ’s are equal to the standard unit vector, multiplied by an unknown phase rotation which is inherent to all blind methods. The adaptive equalizer corresponding to each recovered data stream operates in a block mode, and it outputs a block of N samples zk [n] = [z k (n N ), . . . , z k (n N − N + 1)]T . The equalized block corresponding to the kth data stream may be written as: zk [n] =

Q

Gqk [n]˜yq [n],

(7)

q =1

where y˜ q [n] = [yq (n N ), . . . , yq (n N − N − L g + 1)]T and Gqk [n] = Sylvester{gqk [n]} are N ×(N + L g ) Sylvester convolution matrices built in the same manner as in (2), but with the coefficients gqk [n].

123

1154

T. E. Abrudan, V. Koivunen

3 Blind MIMO OFDM equalizer

By denoting q1 q2 = y˜ q1 [n]˜yq2 [n] H , from (7) we get:

The equalizer coefficients are computed for each N received samples. They are updated according to a steepest descent algorithm which minimizes a cost function comprised of two criteria: a modified VCMA criterion and a decorrelation criterion. These two criteria are described next.

zk [n]2 =

3.1 Modified VCMA criterion In single transmitter case [18] VCMA criterion penalizes the deviation of the equalized block energy from a given dispersion constant. In our multiple transmitter scenario, we consider the energy penalty over all data streams: J VCMA (g1 [n], . . . , g K [n]) K

= E (zk [n]2 − R2 )2 ,

(8)

k=1

where  ·  denotes the l2 -norm of a vector. The block energy dispersion constant is R2 = E[u˜ k [n]4 ]/E[u˜ k [n]2 ]. It can be shown that for complex Gaussian signals R2 = N + 1. We are interested in finding the gradient of this cost function w.r.t. the equalizers gk , k = 1, . . . , K

∂J VCMA ∂J VCMA . (9) ··· ∇gk J VCMA = ∗ ∂g1k ∂g∗Qk In our modified VCMA we use a different approach in the update, compared to the original VCMA [18]. The equalizer Gqk [n] has a Sylvester structure and the received data samples are staked in a column vector (7). This does not make any difference from the convolution point of view, but it plays a role in the way how we deal with the periodic correlation induced by the CP. The update rule will be different for our modified VCMA. More details will be provided after this derivation. Using the instantaneous values instead of expectation in (8), we may compute the derivative of ∗ . This vector this cost function w.r.t. the row vector gqk contains the coefficients of one particular FIR subequalizer, corresponding to the kth output from the qth receive antenna. The derivative may be written as follows: ∂zk [n]2 ∂J VCMA 2 = 2(z [n] − R ) . k 2 ∗ ∗ ∂gqk ∂gqk

123

(10)

Q Q

q1 =1 q2 =1

trace Gq1 k q1 q2 GqH2 k .

(11)

The above equation is expressed in terms of the Sylvester matrix Gqk , which contains the coefficients of interest gqk for the derivative in (10). We need to express zk [n]2 as a function of these coefficients. We denote  q1 q2 = Gq1 k q1 q2 GqH2 k . The trace of  q1 q2 may be expressed taking into account the special structure of Gq1 k and Gq2 k . The dth diagonal element  qd,d 1 q2 is obtained by multiplying the corresponding dth row of Gq1 k by the columns of q1 q2 and the dth column of GqH2 k , i.e., = Gq1 k (d, :)q1 q2 GqH2 k (:, d),  qd,d 1 q2

(12)

where the notations (d, :) and (:, d) stand for the dth row and column of a matrix, respectively. The trace of  q1 q2 is a sum of these elements over all d. The zero elements located outside the band structure of Gq1 k and GqH2 k reduce the complexity of the multiplication in (12). Taking into consideration the band structure of the Sylvester matrices, the remaining contribution from the matrix q1 q2 consists of square blocks of size (L g + 1) × (L g + 1) on the main diagonal, which we (d:d+L g )

denote by q1 q2 trace{ q1 q2 } =

N

, d = 1, . . . , N . Now, (d:d+L g ) H gq2 k .

gq1 k q1 q2

(13)

d=1

The squared magnitude may thus be expressed as: zk [n]2 =

Q Q N

(d:d+L g ) H gq2 k .

gq1 k q1 q2

(14)

q1 =1 q2 =1 d=1

For the particular equalizer gqk located at the qth receive antenna on the kth branch (see Fig. 1) the derivative of interest, is: Q Q N ∂zk [n]2 (d:d+L g ) ¯ q1 q , = g  = gq1 k q k q q 1 1 ∗ ∂gqk q1 =1 d=1

q1 =1

(15) (d:d+L g ) . d=1 q1 q

N

¯ q1 q = where  may be written as:

The derivative (10)

Q ∂J VCMA 2 ¯ q1 q . = 2(z [n] − R ) gq1 k  k 2 ∗ ∂gqk

(16)

q1 =1

¯ q1 q represents the sample estimate of The quantity  the correlation matrix between the received signals at

Blind equalization in spatial multiplexing MIMO-OFDM systems based on vector CMA and decorrelation criteria

the q1 th and qth receive antennas. This is computed on a short processing window of length equal to the equalizer length L g + 1, instead of the whole OFDM block as in [18]. Therefore, the CP does not enter in the computation of the correlation matrix, and does not affect its diagonal dominance. In this way, the algorithm is able to handle both correlation induced by the CP and the correlation introduced by the multipath propagation channel. In the original VCMA the correlation is computed using the whole OFDM block of length N . In that case, the periodic correlation between the CP and the end of the OFDM block, may be stronger than the correlation introduced by the channel ISI. Consequently, the equalizer performance would degrade, depending on the CP length. We emphasize that the equalizer is blind and it does not need the CP at all. The CP has been included in the derivations, because it may be used for block synchronization purposes.

1155

First, we compute the derivative of the squared Frobenius norm: 2  ∂ Rkl (τ )F ∂zk [n]2 = zl [n − τ ]2 ∗ ∗ ∂gqk ∂gqk +zk [n]2

∂zl [n − τ ]2 . ∗ ∂gqk

(19)

Using the same strategy as before, and using the same notations, we can write the equalizer update: Q

∇gqk J xcorr =

¯ q1 q gq1 k 

q1 =1

τ2 K l=1

l=k

2

+zk [n] τ

N

τ =τ1

Q τ2 K

q1 =1

¯q q = where  1

zl [n − τ ]2

˜ q1 [n d=1 y

l=1

l=k

¯ qτ q , (20) gq1 l  1

τ =τ1

− τ ]˜yq2 [n − τ ] H .

3.2 Output decorrelation criterion 3.3 Composite criterion The kth equalized data stream zk [n] may contain interference signals corresponding to the other data streams zl [n], as well as their delayed replicas, zl [n, τ ] = [zl (n N − τ ), . . . , zl (n N − τ − N + 1)]T . The interference is measured by the cross-correlation matrix Rkl (τ ), between kth and lth  a certain  branches for delay τ , i.e., Rkl (τ ) = E zk [n]zlH [n, τ ] . A decorrelation criterion must be employed because multiple copies of other signals may be present in the desired signal, i.e., the CCI. This criterion minimizes the squared Frobenius norm of the cross-correlation matri2 

ces Rkl (τ )F =trace Rkl (τ )RklH (τ ) . The cross-correlation cost function over all equalized data streams is: J xcorr (g1 [n], . . . , g K [n]) =

τ2 K   Rkl (τ )2 . (17) F

l,k=1 τ =τ1

l=k

The delays τ1 , τ2 are chosen according to the maximum delay introduced by the channel, i.e., the integer τ spans the window of all possible delays, in order to mitigate all the replicas of the interference signals. Using the instantaneous values instead of expectation, and following the same derivation as above, we compute the gradient of this cost function w.r.t. the row vector gqk .  2 τ2 K ∂ Rkl (τ )F xcorr . (18) ∇gqk J = ∗ ∂gqk τ =τ l=1

l=k

1

The VCMA cost function (8) has originally been designed for single transmitter case. Its global convergence has not been established [9,15]. If multiple signals are present, depending on its initialization, VCMA may converge to any of the transmitted signals, usually to the ones that have the strongest power [18]. This is due to the fact that VCMA updates the equalizers corresponding the the K data streams independently, i.e., the equalized outputs do not influence each other. Obviously, VCMA alone is not sufficient for equalization in a spatial multiplexing scenario, since the problem of co-channel signals must be considered as well [12]. We propose a composite criterion which we prove to be locally convergent. The cost functions (8) and (17) may be combined in order to cancel both ISI and CCI. A balance parameter λ ∈ (0, 1) is used to weight the two criteria. We will show that stable zero-forcing solutions always exist if the parameter λ is set appropriately. We also provide a closed-form expression for the optimum parameter λ, and this is done in Sect. 5.5. The composite cost function is given by: J = λJ VCMA + (1 − λ)J xcorr .

(21)

The equalizer coefficients are updated for each N incoming samples using the steepest descent method gk [n + 1] = gk [n] − µ∇gk J .

(22)

123

1156

T. E. Abrudan, V. Koivunen

The parameters involved in the algorithm are the weighting factor λ and the convergence step µ. The optimal value of the parameter λ depends on the extent of ISI in different MIMO branches as well as the power of CCI. Bounds for choosing λ and µ such that both the local convergence of the equalizer and the stability are ensured are established later in this paper in Sect. 5.

3.4 Identifiability conditions We provide necessary conditions for the existence of the zero-forcing equalizer, i.e., conditions which must be fulfilled in order to have an identifiable system. We will show later that the sufficient condition for the existence of such equalizer is set by the balance parameter λ. The following conditions are needed for the existence of the ZF equalizer [5]: (1) Equalizer minimum length: The length of each equalizer gqk must satisfy:  K (L + 1)  h , (23) Lg + 1 ≥ Q−K where ⌈·⌉ denotes ceiling operator. (2) Common zeros condition: The virtual channel polynomials defined in the Z-transform domain1 as: H¯ q (Z) =

K

Z −k+1 h kq (Z K ),

k=1

q = 1, . . . , Q

(24)

must be co-prime. The channels are defined in ZL h transform domain, as: h kq (Z) = l=0 h kq (l)Z −l . This condition is achieved by using a sufficient antenna element spacing w.r.t coherence distance, which is a natural requirement in spatial multiplexing scenario.

4 Computation of the composite cost function In this section, an expression for the composite cost function in terms of global channel-equalizer impulse responses is derived. Knowledge about the input signal statistics is needed in order to compute the expectation. Expressions for the two components of the cost 1

To avoid confusion we denote by Z the variable of the transform.

123

function are found separately. After that the global cost function is analyzed in the space of global channelequalizer impulse response. Some conditions and assumptions needed in this analysis are given below. 4.1 Underlying assumptions 1. The analysis considers a noise free scenario. 2. The data symbols modulated across subcarriers are independent and they belong to a complex circularly symmetric constellation (e.g., QPSK). Because the OFDM signal is a result of the unitary IFFT operation, this assumption has important consequences. The first one is that the block energy is preserved, according to Parseval’s Theorem, i.e., sk (n)2 = Fsk (n)2 = M. The second consequence is that if the number of subcarriers is sufficiently large, then the resulting OFDM signal has a nearly Gaussian distribution. This is justified by the Central Limit Theorem. Moreover, the real and the imaginary parts of the resulting OFDM signals are independent and identically distributed and we get E[u 2k (n)] = 0, ∀k = 1, . . . , K . The third consequence is that if the input symbols are white, zero mean and they have a given variance σs2 , the unitary IFFT operation preserves these properties. Therefore, E[u k (n)u ∗k (n − m)] = σu2 δ(m), σu2 = σs2 , and E[u k (n)] = 0, ∀k = 1, . . . , K . 3. Each transmitter sends independent data streams, i.e., the spatial multiplexing transmission is assumed, E[u k (n)u l∗ (n)] = 0, ∀k = l with k, l = 1, . . . , K . 4. The CP is not considered here (N = M). The equalizer operates in time domain before the FFT operation at the receiver. Therefore, it is able to cope with all the three cases: no CP is used, CP too short, CP sufficiently long. The effect of CP has been considered in the derivation of the modified VCMA. 4.2 MIMO considerations We want to emphasize that the algorithm exploits the independence between the transmitted data streams, which is a key assumption in a spatial multiplexing scenario. The proposed algorithm is applicable to general K -by-Q MIMO systems, under the identifiability

Blind equalization in spatial multiplexing MIMO-OFDM systems based on vector CMA and decorrelation criteria

conditions stated above. To make the derivation clearer for the reader, we consider a two transmit antenna MIMO system, i.e., K = 2. In the two-transmit antenna case, the channel input is u(n) = [u 1 (n), . . . , u 1 (n−L)|u 2 (n), . . . , u 2 (n−L)]T . The global channelequalizer impulse responses corresponding to the two data streams are: ak = [αk (0), . . . , αk (L) | βk (0), . . . , βk (L)] = [α k | β k ],

k = 1, 2.

(25)

The equalized outputs: z k (n) = ak u(n) = α k u1 (n) + β k u2 (n), k = 1, 2.

(26)

Perfect equalization is achieved for the data stream k if ak = [0, . . . , e jθ , . . . , 0 | 0, . . . , 0]. The non-zero element is located on the position d, having unit magnitude and phase θ . The first data stream is recovered with a delay d, unit power and phase rotation θ . This phase rotation corresponds to the inherent phase ambiguity which is present in all blind equalization algorithms. If perfect recovery is not achieved, we have |α1 (d)| > |α1 (l)| ∀l = 1, . . . , L l = d. W.l.o.g we assume that the first equalizer retrieves the first data stream with some interference from the second one. Then, we may consider: L

l=0,l=d

2

|α1 (l)| ≥

L

|β1 (l)|2 .

(27)

l=0

The coefficients α1 (l) with l = d correspond to the channel ISI. The coefficients β1 (l) correspond to the CCI. In our analysis, we will study the impact of these coefficients on the recovery of the two signal sources. The composite function is designed for multiple independent data streams. Hence, the algorithm is not limited to the two-transmit antenna systems. The K global channel-equalizer impulse responses can incorporate the K × Q MIMO channel matrix. Therefore, the derivation is applicable to a general K × Q MIMO system. If more transmit antennas are employed, we have to deal with a larger number of cross-terms and larger matrix dimensions. More details are presented in Sect. 4.5. See also Remark 1 referring to the expression of the composite cost function (41). We prove in our analysis that the algorithm provides stable local minima corresponding to the ZF solution if the weighting parameter λ is properly set. In this case, the autocorrelation function corresponding to each GCEIR, as well as the

1157

cross-correlation between any two pairs of GCEIRs are equal to an impulse function. This is ensured by the whitening performed by the modified VCMA criterion together with the pairwise decorrelation penalty present in the composite cost function. We show that the algorithm is not sensitive to the choice of the weighting parameter λ. Only two values must be avoided, λ = 0 and 1. In practice, especially in noisy conditions, we may choose any value λ ∈ (0, 1) which is not close to either zero or one. We will see that the zero value is the least desirable value. Obviously the choice of λ impacts the convergence speed. An optimum value may be chosen in order to achieve strong desired minima. It may be found based on the eigenvalues of the Hessian of the composite cost function (see Sect. 5.3). This optimum value is derived in Sect. 5.5. This ensures the existence of the ZF receiver which is able to recover of all the transmitted data streams.

4.3 Modified VCMA cost function The modified VCMA cost function penalizes the deviation of the equalized block energy from a given dispersion constant, over all equalized data streams. For the simple two-transmit system considered in this analysis (8), may be written as:   J VCMA (a1 , a2 ) = E (z1 [n]2 − R2 )2   +E (z2 [n]2 − R2 )2 . (28) First, we compute the term corresponding to the first recovered data stream:   J VCMA (a1 ) = E z1 [n]4   −2R2 E z1 [n]2 + R22 . (29)

We can compute this expectation, because the statistics of the input signal are known. Having (26) we can compute the statistics of the equalized samples, which are also Gaussian, zero mean, but different variance (linear transformation). The second-order term   (30) E z1 [n]2 = N a1 2 . For the fourth order term, we obtain   E z1 (n)4 = (N 2 + N )a1 4 +2

L

(N − m)|a1 D2(L+1),m a1H |2 ,

m=1

(31)

123

1158

T. E. Abrudan, V. Koivunen

where the matrix D2(L+1),m is defined as   I L+1,m 0 L+1 D2(L+1),m = 0 L+1 I L+1,m

(32)

and the matrices I L+1,m are (L + 1) × (L + 1) matrices with row and column indices r and c such as {I L+1,m }r,c = δ(r − c − m), r, c = 0, . . . , L . In other words, the matrices I L+1,m have ones only on one diagonal shifted up or down with m positions from the main diagonal, and zero in rest. If m > 0 this diagonal is located m positions below the main diagonal and if m < 0 this diagonal is located |m| positions above the main diagonal. Obviously, I L+1,0 is the (L+1)×(L+1) identity matrix and I L+1,m = 0 L+1 if |m| > L. The calculation of the second order term (30) and the fourth-order term (31) are contained in Appendices A.1 and A.2, respectively. Finally, the term from modified VCMA cost function which corresponds to the first data stream, may be expressed as: J VCMA (a1 ) = (N 2 + N )a1 4 +2

L

(N − m)|a1 D2(L+1),m a1H |2

m=1

−2R2 N a1 2 + R22 .

(33)

For the two equalized data streams the VCMA cost function is J VCMA (a1 , a2 ) = J VCMA (a1 ) + J VCMA (a2 ), (34) which may be expressed as

The co-channel interference is measured by the squared Frobenius norm of the correlation matrix, R12 (τ )2F = trace{R12 (τ )R12 (τ ) H }. This squared norm may be expressed as: R12 (τ )2F =

L

m=−L

(37)

The derivation is given in Appendix A.3. Substituting the expression (37) in (36) and changing the summation order, the decorrelation cost function may be written as: J xcorr (a1 , a2 ) =

This criterion penalizes the cross-correlation between an equalized output zk [n] and any other output zl [n] , as well as its delayed replicas zl [n − τ ] = [zl (n N − τ ), . . . , zl (n N − τ − N + 1)]T . In our simple two-transmit antenna system which is considered for analysis, the cost function to be minimized is t

τ =−t

123

R12 (τ )2F .

(38)

The maximum delay of one recovered data stream z 1 (n) compared to its replica which may be contained as interference in the other data stream z 2 (n) can be L. This delay L translates at the other output in delay −L. Therefore, the argument τ of the cross-correlation matrix spans the interval −L . . . L, i.e., t = L which corresponds to all the possible delays between an equalized output and the corresponding  L interference to the other output. Considering that τ =−L (N −|τ −m|) = (2L + 1)N − L(L + 1) − m 2 , the decorrelation cost function may now be expressed as: J xcorr (a1 , a2 ) =

L

|a1 D2(L+1),m a2H |2 [(2L + 1)N

m=−L

−L(L + 1) − m 2 ].

(39)

4.5 The composite cost function In a two transmit antenna case, the composite cost function may be written in terms of the global channelequalizer impulse responses as

4.4 Decorrelation cost function

J xcorr (a1 , a2 ) =

(N − |τ − m|).

τ =−t

+|a2 D2(L+1),m a2H |2 ) (35)

|a1 D2(L+1),m a2H |2

×

m=1

−2R2 N (a1 2 + a2 2 ) + 2R22 .

L

m=−L t

J VCMA (a1 , a2 ) = (N 2 + N )(a1 4 + a2 4 ) L +2 (N −m)(|a1 D2(L+1),m a1H |2

 2 (N − |τ − m|)a1 D2(L+1),m a2H  .

(36)

J (a1 , a2 ) = λJ VCMA (a1 , a2 ) +(1 − λ)J xcorr (a1 , a2 ).

(40)

From (35) and (39) we get:  J (a1 , a2 ) = λ c4 (a1 4 + a2 4 ) +2

L

m=1

km (|a1 Dm a1H |2 + |a2 Dm a2H |2 )

Blind equalization in spatial multiplexing MIMO-OFDM systems based on vector CMA and decorrelation criteria

−c2 (a1 2 + a2 2 ) + c0 +(1 − λ)

L



vm |a1 Dm a2H |2

5 Local convergence analysis (41)

m=−L

where c4 km c2 c0 vm m

= (N 2 + N ), = N − m, m = 1, . . . , L = 2R2 N , = 2R22 , = (2L + 1)N − L(L + 1) − m 2 , = −L , . . . , L

1159

(42)

and for brevity D2(L+1),m = Dm . Remark 1 We may notice that the composite cost function (41) is comprised of a CMA cost function [7], and two additional terms: 1. The CMA part consists of the sum of quantities containing the coefficients c4 , c2 , c0 . This part is homogeneous in the coefficients of the global channel equalizer impulse response, i.e., it is just a norm constraint, of the second-and fourth-order, respectively. In other words, it reduces to a power constraint on the global channel-equalizer impulse response. Consequently, the CMA criterion alone is not able to equalize Gaussian signals [15], the stationary points are any vectors which have a certain norm. 2. The first additional term is the autocorrelation of the global channel-equalizer impulse response coefficients corresponding to each data stream and it consists of the first sum, with the coefficients km . This autocorrelation term breaks the homogeneity of the CMA cost function. Therefore, the cost function does not accept anymore arbitrary stationary points with a certain norm value. This term penalizes the coefficient autocorrelation and it helps the ISI cancellation, by whitening the input signal. The CMA term together with the autocorrelation term form the VCMA cost function [18]. 3. The second additional term is the crosscorrelation between the global channel-equalizer impulse responses corresponding to the two transmitted data streams, and it consists of the last sum, with the summation coefficients vm . Therefore, this term penalizes the co-channel interference. This function is analyzed jointly with respect to a1 and a2 .

In this section, we study the local convergence properties of the proposed algorithm. To make it easier for the Reader to follow, we analyze only the twotransmit antenna case. In the K -transmit antenna case (K > 2), the expression of the composite cost function (41) will contain more terms, and larger matrices will be involved in the analysis. The derivation is very tedious and lengthy then. Hence it will not be included in this paper. The results we derive in this paper apply to the general case. The VCMA part of the composite cost function decouples into independent penalties for each data stream as it may be seen in Eq. (35). Therefore, in the K -transmit antenna case (K > 2) we deal with more GCEIRs ak corresponding to all transmitted data streams. Note that the same coefficients c4 , km , c2 , and c0 in Eq. (42) will be involved in the VCMA part. Similarly, for the decorrelation part of the composite cost function the same coefficients vm in Eq. (42) will be employed. This is due to the fact that the decorrelation criterion operates in a pairwise manner, and the two-transmit antenna case captures the general case. The dimension of the space where the cost function is analyzed increases, but the same characteristics of the error surface are maintained, due to the fact that the above weighting coefficients remain unchanged. The bounds we derive for the two-transmit antenna case depend only on these coefficients, therefore they are valid for the case where more transmit antennas are employed. In this analysis, we prove the local convergence of the proposed algorithm, under the condition2 that the weighting parameter λ belongs to a certain range. We show that under this condition the proposed composite criterion admits truly stable minima corresponding to the zero-forcing equalizer performing both the blind equalization and co-channel signal cancellation. These local minima always exist and they are stable for any input–output delays, unlike the classical VCMA criterion [18] which admits stable zero forcing solutions only for extreme input–output delays [15]. For other input–output delays the VCMA behavior remains unknown. Unlike [15], we assume a more general scenario. In the analysis in [15] the authors consider real valued quantities. Moreover, only the case when the 2

The identifiability condition stated in Sec. 3.4 are also necessary.

123

1160

global channel-equalizer impulse response is equal to the length VCMA block is considered. We consider a composite cost function, and complex circularly symmetric signal sources. We also assume that the VCMA block length considerably larger than the length of the global channel-equalizer impulse response. The sufficient condition of existence of the zero-forcing minima of the composite criterion is that the weighting parameter λ is not close to zero or one. We also provide a closed form expression for the optimal value of λ. This value always exists and it depends only on basic parameters of the MIMO-OFDM system. This optimization problem is non-trivial since the function to be minimized is a complex multivariate function of fourth order in its arguments. In addition, it is non-convex and the dual optimization approach does not provide an optimal solution in this case, since there is a so-called duality gap. Our derivation provides an optimal solution to the minimization problem. We need to compute the gradient of the composite cost function and the corresponding stationary points. We are able to compute several stationary points and the nature of each of these stationary points is evaluated by calculating the corresponding Hessian eigenvalues. Depending on the value of the weighting parameter λ several cases may be noticed: when only one signal is recovered, when two signals are recovered, respectively, none of them. We show that an optimum weighting parameter guarantees the existence of the zero-forcing equalizer which ensures both the equalization and co-channel signal cancellation. Unfortunately, we were unable to prove the uniqueness of these minima, due to the mathematical difficulties but we bring arguments that minimize the possibility of existence of other minima. This aspect will be detailed later, in Sect. 5.3. A range for the parameter µ is derived as well, in order to guarantee the local convergence and the stability.

T. E. Abrudan, V. Koivunen

in order to minimize this cost function. The function J is seen as J (a, a∗ ). For more details, see [4]. The gradient of this function is:

∂J (a , a ) ∂J (a , a ) 1 2 1 2  [G1 G2 ]. G = ∇a J (a) = ∂a1∗ ∂a2∗ (43) Differentiating the cost function (41) w.r.t. a1∗ and a2∗ we get, respectively: L

 ∂J (a1 , a2 ) = λ 2c4 a1 2 a1 + 2 km a1 DmH a1H a1 ∗ ∂a1 m=1  H ×Dm + a1 Dm a1 a1 DmH − c2 a1 +(1 − λ)

L

vm a1 Dm a2H a2 DmH , (44)

m=−L L

 ∂J (a1 , a2 ) km a2 DmH a2H a2 = λ 2c4 a2 2 a2 + 2 ∗ ∂a2 m=1  H ×Dm + a2 Dm a2 a2 DmH − c2 a2 +(1 − λ)

L

vm a2 DmH a1H a1 Dm . (45)

m=−L

5.2 Expression for Hessian The Hessian matrix of J (a, a∗ ) is:  2  ∂ J (a, a∗ ) ∂ 2 J (a, a∗ )   H11 H12  T ∗ ∂a T ∂a  H =  2∂a ∂a ∗  .  ∂ J (a, a ) ∂ 2 J (a, a∗ ) H21 H22 ∂a H ∂a∗ ∂a H ∂a (46) The computation of the Hessian matrix is given in Appendix A.4.

5.3 Stationary points 5.1 Expression for gradient The composite cost function (41) needs to be minimized w.r.t. the complex valued row vector a = [a1 a2 ], therefore it can be seen as J = J (a). Because we deal with a real function of several complex variables, this function is non-analytical w.r.t. a. It is differentiable w.r.t. the real and the imaginary parts of the complex argument instead. We use the Brandwood’s approach

123

Finding all the stationary points of the cost function (41) requires solving the system of Eqs. (44)–(45), which is non-trivial. We found several stationary points and for this reason we consider our analysis to be a proof of local convergence instead of global convergence. This, even though we show that other solutions are not very likely to exist, and if they exist it is unlikely to be local minima. We evaluate the Hessian at

Blind equalization in spatial multiplexing MIMO-OFDM systems based on vector CMA and decorrelation criteria

these points and analyze if the points correspond to maxima (negative definite Hessian), minima (positive definite Hessian) or saddle points (indefinite Hessian). The weighting parameter λ allows us to control the definiteness of the Hessian in order to avoid some potential undesired minima. Different cases are summarized in Table 1. More details are given in Appendix A.5. In this table, we consider m, n ∈ {0, . . . , L} and θ1 , θ2 , θ ∈ [0, 2π ). The coefficients c2 , km , vn have been defined in Sect. 4.5. We consider five solution types in Table 1, together with the corresponding Hessian eigenvalues. More details are presented in Appendix A.5. The solution types are listed below. Type 1 This stationary point is a maximum if λ = 0. If the equalizer is initialized at this point the coefficients will remain equal to zero after every iteration, considering that the update is equal to zero. This case never occurs in practice if we start from non-zero initial conditions and λ = 0, because we deal with a maximum of the composite cost function. Type 2 In this case, we deal with a saddle point, or a potential minimum. For λ ∈ (0, 1) the Hessian is indefinite, therefore we have a saddle point. If λ = 0, the Hessian matrix is positive semi-definite, which may correspond to a local minimum or a saddle point. In case of a local minimum one data stream would be recovered, and this would be an undesired minimum. Again, this case never occurs in practice for non-zero initial settings and λ = 0, because we deal with an unstable stationary point. Type 3 This is a desired solution and corresponds to the zero forcing equalization. In this case both sources will be recovered with unit power, without interference and arbitrary delays. An unknown phase rotation as well as a permutation ambiguity of the equalized streams may occur. The eigenvalues are balanced by λ and 1 − λ. Consequently, for λ = 0 or 1, the Hessian is positive semi-definite, fact that leads to a slow convergence or it may prevent the local convergence. A well-conditioned positive-definite Hessian is needed in this case, in order to achieve a fast local convergence. Type 4 For any λ ∈ (0, 1) this stationary point is a saddle point. This translates into the fact that the composite criterion is able to cancel the ISI for a circularly symmetric complex Gaussian source, i.e., the criterion does not admit that the global channelequalizer impulse response is correlated to itself. For

1161

λ = 0 this point becomes a minimum, and in this case only the decorrelation criterion is employed. In this case the criterion decorrelates the independent sources from each other, but it does not whiten them. Type 5 For any λ ∈ (0, 1) this stationary point is a saddle point. A weighting factor λ = 0 sets the Hessian to zero, which determines an unknown behavior. For λ = 1, we deal with a positive semi-definite Hessian, i.e., a potential local minimum or a saddle point. This may lead to ill convergence or to recovery of the same source at both outputs. Consequently, the cases λ ∈ {0, 1} must be avoided. The desired solution of Type 3 needs to be achieved. The other solutions must be avoided, by setting the weighting parameter λ accordingly.

5.4 On the existence of other local minima In this section, we investigate the existence of other potential local minima. First, we investigate the existence of any stationary points other than the ones in Table 1 and after that the possibility that these point correspond to stable minima. We do not provide a complete proof whether they exist or not, because such proof is not straight forward, but we give arguments that minimize the chance that such local minima exist. We start from the desired solution of Type 3 which is comprised of two sequences which are uncorrelated to themselves and uncorrelated to each other. We prove that by adding additional pulses the resulting sequences are not minima anymore. Actually they are not even stationary points, i.e., solutions of (44)–(45). It follows that only sequences containing at most two non-zero pluses are admitted as solutions. Lemma 1 There are no non-zero pulses ν1 , ν2 , ν3 such that the sequences a1 = [0, . . . , ν1 , . . . , 0] and a2 = [0, . . . , ν2 , . . . , ν3 , . . . , 0] can be solutions of the system of Eqs. (44)–(45), regardless of their positions. Lemma 2 There are no non-zero pulses ν1 , ν2 , ν3 , ν4 such that the sequences a1 = [0, . . . , ν1 , . . . , ν2 , . . . , 0] and a2 = [0, . . . , ν3 , . . . , ν4 , . . . , 0] can be solutions of the system of Eqs. (44)–(45), regardless of their positions. The proofs of Lemmas 1 and 2 are provided in the Appendix A.6.

123

1162

T. E. Abrudan, V. Koivunen

Table 1 The stationary points of the composite cost function are denoted as follows: Max (maxima), Min (minima), Sad (saddle points), U (unknown). The Hessian definiteness is represented as Few solution types

follows: PD (positive definite), PSD (positive semi-definite), ND (negative definite), ID (indefinite), respectively, Null for identically zero Hessian

Hessian eigenvalues

Stationary point type/Hessian definiteness λ=0

λ ∈ (0, 1)

λ=1

γi = −c2 λ

U/Null

Max/ND

Max/ND

γi ∈ {0, −c2 λ, 2km λ,

U/PSD

Sad/ID

Sad/ID

γi ∈ {2λkm , vn (1 − λ)}

U/PSD

Min/PD

U/PSD

γi ∈ {0, −c2 λ, 2λkm ,

U/PSD

Sad/ID

Sad/ID

U/Null

Sad/ID

U/PSD

Type 1 a1 = 0 a2 = 0 Type 2 a1 = [0, 0, . . . , 0, e jθ , 0, . . . , 0]

−c2 λ + vn (1 − λ)} a2 = 0 (or vice versa) Type 3 a1 = [0, . . . , e jθ1 , . . . , 0 | 0, . . . , 0] a2 = [0, . . . , 0 | 0, . . . , e jθ2 , . . . , 0] (or vice versa) Type 4 a1 = [0, . . . , |ν1 |e jθ1 , . . . , 0 |0, . . .  . . . , 1 − |ν1 |2 e jθ2 , . . . , 0]

2λkm |ν1

|2 ,

2λkm |ν2

|2 ,

a2 = [0, . . . , 0 | 0, . . . , 0]

−c2 λ + vn (1 − λ),

(or vice versa)

−c2 λ + vn (1 − λ)|ν1 |2 , −c2 λ + vn (1 − λ)|ν2 |2 }

Type 5 a1 = [0, . . . , |ν1 |e jθ1 , . . . , 0 | 0, . . . , 0] a2 = [0, . . . , |ν2

|e jθ2 , . . . , 0

Strictly negative or negative

| 0, . . . , 0]

(or vice versa)

Remark 2 Lemma 1 follows from Lemma 2, which is a generalization of this. We were not able to extend further our generalization, due to the fact that more crossterms are involved, leading to a non-trivial systems of equations. Therefore, the behavior at other potential stationary points has not been investigated, because we were not able to determine such points. If those points exist they are sequences containing more nonzero pulses than the solutions considered in Table 1 and in Lemmas 1 and 2, respectively. Consequently, more cross-terms are involved in the computation of the Hessian blocks, which means higher magnitude of the elements of the Hessian off-diagonal blocks. This translates into the fact that the Hessian tends to depart from positive-definiteness because it loses the diagonal dominance. It follows that if other stationary points exist, they are very likely not to be local minima, but saddle points. Even if such local minima exist, a proper

123

initialization close to the desired minima may be done by using pilot symbols, i.e., a semi-blind version of the algorithm. In this case the algorithm would operate in a tracking mode. 5.5 Selection of weighting parameter λ The value λ = 0 is undesired in most of the cases, i.e., solutions of Types 1, 2, 4, 5, see Table 1. These stationary points may become attractive (positive semidefinite Hessian), and the algorithm may not converge to the desired solution of Type 3. Practically a very small value of λ, causes the same behavior as λ = 0, especially in noisy conditions. This is due to the poorly conditioned Hessian matrix, i.e., the negative eigenvalues have much smaller magnitude compared to the positive ones. Consequently these saddle points may become attractive.

Blind equalization in spatial multiplexing MIMO-OFDM systems based on vector CMA and decorrelation criteria

The value λ = 1 is undesired as well, as it is shown for the solution of Type 5. The same problem of undesired minima may appear if λ is very close to one. In this case the Hessian becomes close to positive semidefinite, having large positive eigenvalues and very small negative eigenvalues. Consequently the solution of Type 5 may become an attractive stationary point, and in the best case only one source is recovered at both outputs (or none of them). The decorrelation criterion is disabled, and the outputs will be highly correlated to each other. This is very likely when the equalizer starts with same initial conditions for all the K equalizers and λ is close to one. An optimum value of the weighting parameter λ may be derived from the desired solution of Type 3. Strong local minima need to be achieved in this case, therefore large positive eigenvalues are required. There are two types of eigenvalues, weighted by λ and 1 − λ, see the solution of Type 3 from Table 1. For λ close to zero or close to one, we get a poorly conditioned Hessian. The shape of the composite cost function will be similar to “a long narrow valley.” Most of the effort will be spent going up and down the steep valley walls and very little in getting to the minimum along the less steep direction. The Hessian eigenvalues must be as equal as possible in magnitude. This ensures a faster local convergence by imposing similar curvature of the composite cost function around the desired local minima in all dimensions. Because L ≪ N , the coefficients km are relatively close to each other for any m ∈ {1, . . . , L}. Same thing happens for the coefficients vn , n ∈ {0, . . . , L} (42). Therefore, a condition which ensures that the eigenvalues are relatively close to each other is: 2λkm = vn (1 − λ).

(47)

By setting m = n = L, we get the optimum weighting parameter: λopt =

vL . v L + 2k L

(48)

To conclude, a very small value of the weighting parameter λ is more undesirable than a value close to one, because it may happen that none of the source signals will be recovered. The derived value ensures a strong minimum of Type 3, and it also helps avoiding some undesired solutions. A maximum of Type 1 is achieved with this value, as well as well-defined saddle points of Types 2,4,5.

1163

5.6 Selection of convergence step size µ To ensure the algorithm convergence, the forgetting factor µ has to be chosen according to the curvature of the error surface around the minima, i.e., the equilibrium points. This curvature is characterized by the Hessian eigenvalues at these minima (see the solution of Type 3 from Table 1). We may notice that the eigenvalues depend on the weighting parameter λ, therefore the convergence step will depend on λ as well. The parameter µ should obey to the following condition: 2 , i = 1, . . . , 8(L + 1), (49) 0

Suggest Documents