Evolving Intelligent Systems: Methodology and Applications. Edited by Plamen
Angelov, ... Biometrics: Theory, Methods, and Applications. Edited by Nikolaos V.
Biometrics Theory, Methods, and Applications Edited by Nikolaos V. Boulgouris Konstantinos N. Plataniotis Evangelia Micheli-Tzanakou
IEEE Computational Intelligence Society, Sponsor IEEE Press Series on Computational Intelligence David B. Fogel, Series Editor
Biometrics
IEEE Press 445 Hoes Lane Piscataway, NJ 08854 IEEE Press Editorial Board Lajos Hanzo, Editor in Chief
R. Abari J. Anderson S. Basu A. Chatterjee
T. Chen T. G. Croda M. El-Hawary S. Farshchi
B. M. Hammerli O. Malik S. Nahavandi W. Reeve
Kenneth Moore, Director of IEEE Book and Information Services (BIS) Technical Reviewer Qinghan Xiao, Defence Research and Development Canada IEEE-CIS Liaison to IEEE Press, Gary B. Fogel Books in the IEEE Press Series on Computational Intelligence Evolving Intelligent Systems: Methodology and Applications Edited by Plamen Angelov, Dimitar Filev, and Nik Kasabov 2010 978-0470-28719-4 Biometrics: Theory, Methods, and Applications Edited by Nikolaos V. Boulgouris, Konstantinos N. Plataniotis, and Evangelia Micheli-Tzanakou 2010 978-0470-24782-2 Clustering Rui Wu and Donald Wunsch II 2009 978-0470-27680-0 Computational Intelligence in Bioinformatics Edited by David B. Fogel, David W. Corne, and Yi Pan 2008 978-0470-10526-9 Introduction to Evolable Hardware: A Practical Guide for Designing Self-Adaptive Systems Garrison W. Greenwood and Andrew M. Tyrrell 2007 978-0471-71977-9 Evolutionary Computation: Toward a New Philosophy of Machine Intelligence, Third Edition David B. Fogel 2006 978-0471-66951-7 Emergent Information Technologies and Enabling Policies for Counter-Terrorism Edited by Robert L. Popp and John Yen 2006 978-0471-77615-4 Computationally Intelligent Hybrid Systems Edited by Seppo J. Ovaska 2005 0-471-47668-4 Handbook of Learning and Appropriate Dynamic Programming Edited by Jennie Si, Andrew G. Barto, Warren B. Powell, Donald Wunsch II 2004 0-471-66054-X Computational Intelligence: The Experts Speak Edited By David B. Fogel and Charles J. Robinson 2003 0-471-27454-2
Biometrics Theory, Methods, and Applications Edited by Nikolaos V. Boulgouris Konstantinos N. Plataniotis Evangelia Micheli-Tzanakou
IEEE Computational Intelligence Society, Sponsor IEEE Press Series on Computational Intelligence David B. Fogel, Series Editor
Copyright © 2010 by the Institute of Electrical and Electronics Engineers, Inc. Published by John Wiley & Sons, Inc., Hoboken, New Jersey. All rights reserved. Published simultaneously in Canada. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher, the editors, and the authors have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor editor/author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data is available. ISBN 978-0470-24782-2 Printed in the United States of America 10 9 8 7 6 5 4 3 2 1
Contents
Preface
vii
Contributors
xiii
1. Discriminant Analysis for Dimensionality Reduction: An Overview of Recent Developments
1
Jieping Ye and Shuiwang Ji
2. A Taxonomy of Emerging Multilinear Discriminant Analysis Solutions for Biometric Signal Recognition
21
Haiping Lu, Konstantinos N. Plataniotis, and Anastasios N. Venetsanopoulos
3. A Comparative Survey on Biometric Identity Authentication Techniques Based on Neural Networks
47
Raghudeep Kannavara and Nikolaos Bourbakis
4. Designing Classifiers for Fusion-Based Biometric Verification
81
Kritha Venkataramani and B. V. K. Vijaya Kumar
5. Person-Specific Characteristic Feature Selection for Face Recognition
113
Sreekar Krishna, Vineeth Balasubramanian, John Black, and Sethuraman Panchanathan
6. Face Verification Based on Elastic Graph Matching
143
Anastasios Tefas and Ioannis Pitas
7. Combining Geometrical and Statistical Models for Video-Based Face Recognition
171
Amit K. Roy-Chowdhury and Yilei Xu
8. A Biologically Inspired Model for the Simultaneous Recognition of Identity and Expression
195
Donald Neth and Aleix M. Martinez
9. Multimodal Biometrics Based on Near-Infrared Face Recognition
225
Rui Wang, Shengcai Liao, Zhen Lei, and Stan Z. Li
10. A Novel Unobtrusive Face and Hand Geometry Authentication System Based on 2D and 3D Data
241
Filareti Tsalakanidou and Sotiris Malassiotis
11. Learning Facial Aging Models: A Face Recognition Perspective
271
Narayanan Ramanathan and Rama Chellappa
v
vi
Contents
12. Super-Resolution of Face Images
295
Sung Won Park and Marios Savvides
13. Iris Recognition
315
Yung-Hui Li and Marios Savvides
14. Learning in Fingerprints
339
Alessandra Lumini, Loris Nanni, and Davide Maltoni
15. A Comparison of Classification- and Indexing-Based Approaches for Fingerprint Identification
365
Xuejun Tan, Bir Bhanu, and Rong Wang
16. Electrocardiogram (ECG) Biometric for Robust Identification and Secure Communication
383
Francis Minhthang Bui, Foteini Agrafioti, and Dimitrios Hatzinakos
17. The Heartbeat: The Living Biometric
429
Steven A. Israel, John M. Irvine, Brenda K. Wiederhold, and Mark D. Wiederhold
18. Multimodal Physiological Biometrics Authentication
461
Alessandro Riera, Aureli Soria-Frisch, Mario Caparrini, Ivan Cester, and Giulio Ruffini
19. A Multiresolution Analysis of the Effect of Face Familiarity on Human Event-Related Potentials
483
Brett DeMarco and Evangelia Micheli-Tzanakou
20. On-Line Signature-Based Authentication: Template Security Issues and Countermeasures
497
Patrizio Campisi, Emanuele Maiorana, and Alessandro Neri
21. Unobtrusive Biometric Identification Based on Gait
539
Xiaxi Huang and Nikolaos V. Boulgouris
22. Distributed Source Coding for Biometrics: A Case Study on Gait Recognition
559
Savvas Argyropoulos, Dimosthenis Ioannidis, Dimitrios Tzovaras, and Michael G. Strintzis
23. Measuring Information Content in Biometric Features
579
Richard Youmaran and Andy Adler
24. Decision-Making Support in Biometric-Based Physical Access Control Systems: Design Concept, Architecture, and Applications
599
Svetlana N. Yanushkevich, Vlad P. Shmerko, Oleg Boulanov, and Adrian Stoica
25. Privacy in Biometrics
633
Stelvio Cimato, Marco Gamassi, Vincenzo Piuri, Roberto Sassi, and Fabio Scotti
26. Biometric Encryption: The New Breed of Untraceable Biometrics
655
Ann Cavoukian and Alex Stoianov
Index
719
Preface
T
he objective of biometric systems is the recognition or authentication of individuals based on some physical or behavioral characteristics that are intrinsically unique for each individual. Nowadays, biometric systems are fundamental components of advanced security architectures. The applications of biometrics range from access control, military, and surveillance to banking and multimedia copyright protection. Recently, biometric information has started to become an essential element in government issued authentication and travel documents. The large-scale deployment of biometrics sensors in a variety of electronic devices, such as mobile phones, laptops, and personal digital assistants (PDA), has further accelerated the pace at which the demand for biometric technologies has been growing. The immense interest in the theory, technology, applications, and social implications of biometric systems has created an imperative need for the systematic study of the use of biometrics in security and surveillance infrastructures. This edited volume provides an extensive survey of biometrics theory, methods, and applications, making it a good source of information for researchers, security experts, policy makers, engineers, and graduate students. The volume consists of 26 chapters which cover most aspects of biometric systems. The first few chapters address particular recognition techniques that can be used in conjunction with a variety of biometric traits. The following chapters present technologies tailored to specific biometric traits, such as face, hand geometry, fingerprints, signature, electrocardiogram, electroencephalogram, and gait. The remaining chapters focus on both theoretical issues as well as issues related to the emerging area of privacy-enhancing biometric solutions. An overview of recent developments in discriminant analysis for dimensionality reduction is presented in the first chapter. Specifically, a unified framework is presented for generalized linear discriminant analysis (LDA) via a transfer function. It is shown that various LDA-based algorithms differ in their transfer functions. This framework explains the properties of various algorithms and their relationship. Furthermore, the theoretical properties of various algorithms and their relationship are also presented. An emerging extension of the classical LDA is the multilinear discriminant analysis (MLDA) for biometric signal recognition. Biometric signals are mostly multidimensional objects, known as tensors. Recently, there has been a growing interest in MLDA solutions. In Chapter 2, the fundamentals of existing MLDA solutions are presented and then categorized according to the multilinear projection employed. At the same time, their connections with traditional linear solutions are pointed out. The next two chapters present classification issues in biometric identification. The problem of classification is extremely important because it vii
viii
Preface
essentially sets the framework regarding the way decisions are made once feature extraction and dimensionality reduction have taken place. A variety of classification approaches can be taken. One of these approaches is to use neural networks (NN). Chapter 3 is a comparative survey on biometric identity authentication techniques based on neural networks. This chapter presents a survey on representative NN-based methodologies for biometric identification. In particular, it captures the evolution of some of the representative NN-based methods in order to provide an outline of the application of neural nets in biometric systems. A specific, but far from uncommon, case of classification is that involving fusion of biometrics. The main task here is the design of classifiers for fusion-based biometric verification, which is addressed in Chapter 4. The chapter provides guidelines for optimal ensemble generation, where each classifier in the ensemble is a base classifier. Examples are shown for support vector machines and correlation filters. The chapter also focuses on decision fusion rules and the effect of classifier output diversity on their decision fusion accuracy is also analyzed. Chapters 5–20 present systems based on specific biometric modalities. Methods for face recognition/verification are presented in Chapters 5–8. One of the most important problems in face recognition is feature selection. Chapter 5 presents a person-specific characteristic feature selection for face recognition. In this chapter, a new methodology for face recognition is introduced that detects and extracts unique features on a person’s face and then uses those features for the purpose of recognition. Chapter 6 presents a different approach by performing face verification based on elastic graph matching. Using elastic graph matching, a face is represented as a connected graph. This approach endows the recognition process with robustness against geometric distortions of the facial image. Another challenging task in the area of face-based biometric systems is the efficient use of video sequences for face authentication. Chapter 7 presents a method for the combination of geometrical and statistical models for video-based face authentication. In this chapter, it is shown that it is possible to describe object appearance using a combination of analytically derived geometrical models and statistical data analysis. Specifically, a framework that is robust to large changes in facial pose and lighting conditions is presented for face recognition from video sequences. The method can handle situations where the pose and lighting conditions in the training and testing data are completely disjoint. Chapter 8 is about a biologically inspired model for the simultaneous recognition of identity and expression. This work builds upon the fact that faces can provide a wide range of information about a person’s identity, race, sex, age and emotional state. In most cases, humans easily derive such information by processes that appear rapid and automatic. However, upon closer inspection, one finds these processes to be diverse and complex. This chapter examines the perception of identity and emotion. Next, it develops a computational model that is applied for identification based on face images with differing expression as well as for the classification of expressions. Chapters 9–11 present some more advanced methods for face recognition. The first two of these chapters go beyond the conventional approach and are based on the realization that face recognition does not have to rely on an image taken using a
Preface
ix
conventional camera. Face recognition using infrared cameras is a very interesting extension of conventional face recognition. In Chapter 9, a near-infrared (NIR) face-based approach is presented for multimodal biometric fusion. The NIR face is fused with the visible light (VL) face or iris modality. This approach has several advantages, including the fact that NIR face recognition overcomes problems arising from uncontrolled illumination in VL images and achieves significantly better results than when VL faces are used. Furthermore, the fusion of NIR face with VL face or iris is a natural combination for multibiometric solutions. A different, multimodal system based on the fusion of 2D and 3D face and hand geometry data is presented in Chapter 10. This topic is of particular interest because recent advances in multimodal biometrics as well as the emergence of affordable 3D imaging technologies have created great potential for techniques that involve 3D data. The main advantage is the simultaneous acquisition of a pair of depth and color images of biometric information using low-cost sensors. Although the above face-based methodologies offer improved performance, they are not directly improving the resilience of visual biometric systems to the change that these biometrics undergo through time. Aging is a crucial factor for recognition applications and, in the case of face recognition, can be dealt with by using learning facial aging models. Such facial models studied in Chapter 11 can be used for the prediction of one’s appearance across ages and, therefore, are of great importance for performing reliable face recognition across age progression. Chapter 12 is about super-resolution techniques, which can be used in conjunction with face recognition technologies. The next three chapters are devoted to iris and fingerprint recognition. The technologies that are used in iris recognition systems are presented in Chapter 13. Iris recognition is an extremely reliable technique for identification of individuals, and this chapter reviews both its theoretical and practical aspects. Fingerprint recognition is another very important technology that has been reliably used in biometric systems for a many years. Chapter 14, entitled learning in fingerprints, gives a short introduction of the basic concepts and terminology. Furthermore, it provides a detailed review of the existing literature by discussing the most salient learning-based approaches applied to feature extraction, matching, and classification of fingerprints. Chapter 15 makes a comparison of classification and indexing-based approaches for fingerprint recognition. This chapter presents a comparison of two key approaches for fingerprint identification. These approaches are based on classification followed by verification and indexing followed by verification. The fingerprint classification approach is based on a feature-learning algorithm, while the indexing approach is based on features derived from triplets of minutiae. Chapters 16 and 17 present methods using electrocardiograms (ECG). ECG is essentially a medical diagnostic technique but more recently it has fulfilled a rather unlikely role, as a provider of security and privacy in the form of a biometric. Chapter 16, entitled Electrocardiogram (ECG) Biometric for Robust Identification and Secure Communication, examines the various implications and technical challenges of using the ECG as a biometric. Specifically, novel signal processing techniques are surveyed and proposed that seek to not only establish the status of the ECG as an indisputable biometric trait, but also reinforce its versatile utility, such as
x
Preface
in alleviating the resource consumption in certain communication networks. Chapter 17 discusses the heartbeat as a living biometric. Although previous research on the topic focused mainly on analysis of the electrocardiogram, this chapter extends the ECG results by applying processing methods to a larger and more diverse set of individuals, demonstrating that performance remains high for a larger and a more diverse population. Alternative sensing methods, using blood pressure and pulse oximetry, are presented and their corresponding performance is documented. The chapter also discusses the phenomenology and sensing modalities for monitoring cardiovascular function and, finally, examines the fusion of heartbeat information across the three modalities and quantifies its performance. Chapters 18 and 19 explore methodologies mainly based on electroencephalograms (EEG). In Chapter 18, a method is proposed using physiological signals for key features in high-security biometric systems. The experimental protocol that is common for EEG and ECG recording is explained. EEG and ECG features as well as the authentication algorithms are presented and their efficiency is individually assessed. A fusion process carried out to achieve higher performance is also presented. Chapter 19 presents a multiresolution analysis of the effect of face familiarity on human event-related potentials. This method works by processing of the electroencephalograms (EEGs) in response to familiar and unfamiliar face stimuli. Stimuli were presented in successive trials and consisted of (a) multiple presentations of frontal, gray-scale images of one person known to the subject and (b) unique unknown images taken from multiple face databases. Coherent oscillations in phase were observed in the lower delta activity of ERPs in response to known stimuli but not in response to unknown stimuli. Chapters 20 and 21 present methods and applications based on signature recognition and gait recognition. Although several approaches can be used in authentication systems, the most commonly used authentication method in everyday transactions is based on signature. The specific points of concern regarding online signature-based authentication have to do more with template security issues and countermeasures. Chapter 20 focuses on the security issues related to biometric templates, with application to signature based authentication systems. The main privacy and security issues are briefly summarized and some approaches that are used for the protection of biometric templates are discussed. Data hiding techniques are used to design a security scalable authentication system. The enrollment and the authentication procedure are detailed. In contrast to the signature, which is an established method for authentication, gait recognition is an emerging technology that is particularly attractive for biometric identification because the capturing of gait can take place in an unobtrusive manner. Chapter 21 presents the fundamental approaches for unobtrusive biometric identification based on gait and provides directions for future research. Chapters 22–26 deal with biometric applications including issues related to the concept of biometric capacity. Chapter 22, presents a completely new framework for biometric authentication in secure environments as well as a relevant application based on gait recognition. The proposed framework is based on distributed source coding for biometrics. In this new framework, the problem of biometric recognition
Preface
xi
is formulated as the dual of data communication over noisy channels. In such a system, the enrollment and authentication procedures are considered as the encoding and decoding stages of a communication system. The above approach is highly relevant to information theory. Further application of information theory in biometrics can be found in the assessment of the information content of biometric traits. The discriminating ability of biometric features is usually estimated by means of experimentation. However, the information carried by biometrics, their uniqueness, and their fusion prospects can be studied based on concepts from information theory. This is the topic of Chapter 23, which deals with measuring information content in biometric features. Next, in Chapter 24, a summary is presented of the theoretical results and design experience obtained during the development of a next generation physical access security system (PASS). The main feature of this PASS is its efficient decision-making support of security personnel enhanced with the situational awareness paradigm and intelligent tools. Despite the increasing use of biometric features for authentication and identification purposes in a broad variety of institutional and commercial systems, the adoption of biometric techniques is restrained by a rising concern regarding the protection of the biometrics templates. In fact, people are not generally keen to give out biometric traits unless they are assured that their biometrics cannot be stolen or used without their consent. Recent results showed that it is feasible to generate a unique identifier by combining biometric traits. This approach makes it impossible to recover the original biometric features and, thus, ensures the privacy of the biometrics. Chapter 25, entitled Privacy in Biometrics, reviews the privacy issues related to the use of biometrics, presents some of the most advanced techniques available up to date, provides a comparative analysis, and gives an overview of future trends. A particular system that builds privacy into an information system is presented in the final chapter, entitled Biometric Encryption. In this chapter, the emerging area of privacy-enhancing biometric technologies, referred to as “untraceable biometrics,” makes it possible to enhance both privacy and security in a positive-sum model. By its nature, an edited volume covers only a limited number of works and initiatives in the area of biometric systems. Researchers and practitioners are introducing new developments at a very fast pace, and it would be impossible to cover all of them in a single volume. However, we believe that the collection of chapters presented here cover sufficiently well the theory, methods, and applications of biometrics. Readers who wish to further explore the fascinating area of biometrics can find additional information using the bibliographic links that are provided in each one of the chapters of this volume. We thank all those who have helped to make this edited volume possible, especially the contributors who spent much of their precious time and energy in preparing their chapters. We are really grateful for their enthusiasm and devotion to this project. We thank the contributors and other experts who served as reviewers. Special thanks should go to Dr. Qinghan Xiao, the reviewer assigned by IEEE Press, for providing lots of useful suggestions for the improvement of the book. Our deep feelings of appreciation go to John Wiley & Sons for the impeccable processing of the authors’ contributions and the final production of the book. Last, but certainly not least, we
xii
Preface
would like to thank Jeanne Audino of IEEE Press for her professionalism and her continuous support and assistance during all stages of the preparation and publication of the manuscript. Nikolaos V. Boulgouris Konstantinos N. Plataniotis Evangelia Micheli-Tzanakou London, United Kingdom Toronto, Ontario, Canada New Brunswick, New Jersey July 2009
Contributors
ANDY ADLER, Carleton University, Ottawa, Ontario, Canada FOTEINI AGRAFIOTI, University of Toronto, Toronto, Ontario, Canada SAVVAS ARGYROPOULOS, Informatics and Telematics Institute, Thessaloniki, Greece VINEETH BALASUBRAMANIAN, Arizona State University, Tempe, Arizona BIR BHANU, University of California, Riverside, California JOHN BLACK, Arizona State University, Tempe, Arizona OLEG BOULANOV, University of Calgary, Calgary, Alberta, Canada NIKOLAOS V. BOULGOURIS, King’s College, London, United Kingdom NIKOLAOS BOURBAKIS, Wright State University, Dayton, Ohio FRANCIS MINHTHANG BUI, University of Toronto, Toronto, Ontario, Canada PATRIZIO CAMPISI, University of Rome, Rome, Italy MARCO CAPARRINI, Starlab, Barcelona, Spain ANN CAVOUKIAN, Office of Information and Privacy Commissioner Ontario, Toronto, Ontario, Canada IVAN CESTER, Starlab, Barcelona, Spain RAMA CHELLAPPA, University of Maryland, College Park, Maryland STELVIO CIMATO, University of Milan, Milan, Italy BRETT DEMARCO, Rutgers University, New Brunswick, New Jersey MARCO GAMASSI, University of Milan, Milan, Italy DIMITRIOS HATZINAKOS, University of Toronto, Toronto, Ontario, Canada XIAXI HUANG, King’s College, London, United Kingdom DIMOSTHENIS IOANNIDIS, Informatics and Telematics Institute, Thessaloniki, Greece JOHN M. IRVINE, Science Applications International Corporation (SAIC), Arlington, Virginia STEVEN A. ISRAEL, Science Applications International Corporation (SAIC), Arlington, Virginia SHUIWANG JI, Arizona State University, Tempe, Arizona RAGHUDEEP KANNAVARA, Wright State University, Dayton, Ohio SREEKAR KRISHNA, Arizona State University, Tempe, Arizona xiii
xiv
Contributors
B. V. K. VIJAYA KUMAR, Carnegie Mellon University, Pittsburgh, Pennsylvania ZHEN LEI, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China STAN Z. LI, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China YUNG-HUI LI, Carnegie Mellon University, Pittsburgh, Pennsylvania SHENGCAI LIAO, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China HAIPING LU, University of Toronto, Toronto, Ontario, Canada ALESSANDRA LUMINI, Biometric System Laboratory, University of Bologna, Bologna, Italy EMANUELE MAIORANA, University of Rome, Rome, Italy SOTIRIS MALASSIOTIS, Informatics and Telematics Institute, Thessaloniki, Greece DAVIDE MALTONI, Biometric System Laboratory, University of Bologna, Bologna, Italy ALEIX M. MARTINEZ, Ohio State University, Columbus, Ohio EVANGELIA MICHELI-TZANAKOU, Rutgers University, New Brunswick, New Jersey LORIS NANNI, Biometric System Laboratory, University of Bologna, Bologna, Italy ALESSANDRO NERI, University of Rome, Rome, Italy DONALD NETH, Ohio State University, Columbus, Ohio SETHURAMAN PANCHANATHAN, Arizona State University, Tempe, Arizona SUNG WON PARK, Carnegie Mellon University, Pittsburgh, Pennsylvania IOANNIS PITAS, University of Thessaloniki, Thessaloniki, Greece VINCENZO PIURI, University of Milan, Milan, Italy KONSTANTINOS N. PLATANIOTIS, University of Toronto, Toronto, Ontario, Canada NARAYANAN RAMANATHAN, University of Maryland, College Park, Maryland ALEJANDRO RIERA, Starlab, Barcelona, Spain AMIT K. ROY-CHOWDHURY, University of California, Riverside, California GIULIO RUFFINI, Starlab, Barcelona, Spain ROBERTO SASSI, University of Milan, Milan, Italy MARIOS SAVVIDES, Carnegie Mellon University, Pittsburgh, Pennsylvania FABIO SCOTTI, University of Milan, Milan, Italy VLAD P. SHMERKO, University of Calgary, Calgary, Alberta, Canada
Contributors
xv
AURELI SORIA-FRISCH, Starlab, Barcelona, Spain ALEX STOIANOV, Office of Information and Privacy Commissioner Ontario, Toronto, Ontario, Canada ADRIAN STOICA, Jet Propulsion Laboratory, NASA, USA MICHAEL G. STRINTZIS, Informatics and Telematics Institute, Thessaloniki, Greece XUEJUN TAN, University of California, Riverside, California ANASTASIOS TEFAS, University of Thessaloniki, Thessaloniki, Greece FILARETI TSALAKANIDOU, Informatics and Telematics Institute, Thessaloniki, Greece DIMITRIOS TZOVARAS, Informatics and Telematics Institute, Thessaloniki, Greece ANASTASIOS N. VENETSANOPOULOS, University of Toronto, Toronto, Ontario, Canada KRITHIKA VENKATARAMANI, Carnegie Mellon University, Pittsburgh, Pennsylvania RONG WANG, University of California, Riverside, California RUI WANG, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China BRENDA K. WIEDERHOLD, Science Applications International Corporation (SAIC), Arlington, Virginia MARK D. WIEDERHOLD, Science Applications International Corporation (SAIC), Arlington, Virginia YILEI XU, University of California, Riverside, California SVETLANA N. YANUSHKEVICH, University of Calgary, Calgary, Alberta, Canada JIEPING YE, Arizona State University, Tempe, Arizona RICHARD YOUMARAN, Carleton University, Ottawa, Ontario, Canada
Chapter
1
Discriminant Analysis for Dimensionality Reduction: An Overview of Recent Developments Jieping Ye and Shuiwang Ji
1.1
INTRODUCTION
Many biometric applications such as face recognition involve data with a large number of features [1–3]. Analysis of such data is challenging due to the curse-ofdimensionality [4, 5], which states that an enormous number of samples are required to perform accurate predictions on problems with a high dimensionality. Dimensionality reduction, which extracts a small number of features by removing irrelevant, redundant, and noisy information, can be an effective solution [6]. The commonly used dimensionality reduction methods include supervised approaches such as linear discriminant analysis (LDA) [7, 8], unsupervised ones such as principal component analysis (PCA) [9], and additional spectral and manifold learning methods [10–14]. When the class label information is available, supervised approaches, such as LDA, are usually more effective than unsupervised ones such as PCA for classification. Linear discriminant analysis (LDA) is a classical statistical approach for supervised dimensionality reduction and classification [8, 15–18]. LDA computes an optimal transformation (projection) by minimizing the within-class distance and maximizing the between-class distance simultaneously, thus achieving maximum
Biometrics: Theory, Methods, and Applications. Edited by Boulgouris, Plataniotis, and Micheli-Tzanakou Copyright © 2010 the Institute of Electrical and Electronics Engineers, Inc.
1
2
Chapter 1
Discriminant Analysis for Dimensionality Reduction
class discrimination. The optimal transformation in LDA can be readily computed by applying an eigendecomposition on the so-called scatter matrices. It has been used widely in many applications involving high-dimensional data [19–24]. However, classical LDA requires the so-called total scatter matrix to be nonsingular. In many applications involving high-dimensional and low sample size data, the total scatter matrix can be singular since the data points are from a very high-dimensional space, and in general the sample size does not exceed this dimension. This is the well-known singularity or undersampled problem encountered in LDA. In recent years, many LDA extensions have been proposed to deal with the singularity problem, including PCA+LDA [19, 23], regularized LDA (RLDA) [21], null space LDA (NLDA) [20], orthogonal centroid method (OCM) [25], uncorrelated LDA (ULDA) [24], orthogonal LDA (OLDA) [24], and LDA/GSVD [26]. A brief overview of these algorithms is given in Section 1.2. Different algorithms have been applied successfully in various domains, such as PCA+LDA in face recognition [19, 23], OCM in text categorization [25], and RLDA in microarray gene expression data analysis [21]. However, there is a lack of a systematic study to explore the commonalities and differences of these algorithms, as well as their intrinsic relationship. This has been a challenging task, since different algorithms apply completely different schemes when dealing with the singularity problem. Many of these LDA extensions involve an eigenvalue problem, which is computationally expensive to solve especially when the sample size is large. LDA in the binary-class case, called Fisher LDA, has been shown to be equivalent to linear regression with the class label as output. Such regression model minimizes the sum-ofsquares error function whose solution can be obtained efficiently by solving a system of linear equations. However, the equivalence relationship is limited to the binary-class case. In this chapter, we present a unified framework for generalized LDA via a transfer function. We show that various LDA-based algorithms differ in their transfer functions. The unified framework elucidates the properties of various algorithms and their relationship. We then discuss recent development on establishing the equivalence relationship between multivariate linear regression (MLR) and LDA in the multiclass case. In particular, we show that MLR with a particular class indicator matrix is equivalent to LDA under a mild condition, which has been shown to hold for most high-dimensional data. We further show how LDA can be performed in the semisupervised setting, where both labeled and unlabeled data are provided, based on the equivalence relationship between MLR and LDA. We also extend our discussion to the kernel-induced feature space and present recent developments on multiple kernel learning (MKL) for kernel discriminant analysis (KDA). The rest of this chapter is organized as follows. We give an overview of classical LDA and its generalization in Section 1.2. A unified framework for generalized LDA as well as the theoretical properties of various algorithms and their relationship is presented in Section 1.3. Section 1.4 discusses the least squares formulation for LDA. We then present extensions of the discussion to semisupervised learning and kernelinduced feature space in Sections 1.5 and 1.6, respectively. This chapter concludes in Section 1.8.
1.2 Overview of Linear Discriminant Analysis
1.2
3
OVERVIEW OF LINEAR DISCRIMINANT ANALYSIS
We are given a data set that consists of n samples {(xi , yi )}ni=1 , where xi ∈ IRd denotes the d-dimensional input, yi ∈ {1, 2, . . . , k} denotes the corresponding class label, n is the sample size, and k is the number of classes. Let X = [x1 , x2 , . . . , xn ] ∈ Rd×n be the data matrix and let Xj ∈ Rd×nj be the data matrix of the jth class, where nj is the sample size of the jth class and kj=1 nj = n. Classical LDA computes a linear transformation G ∈ Rd× that maps xi in the d-dimensional space to a vector xiL in the -dimensional space as follows: xi ∈ IRd → xiL = GT xi ∈ R ,
< d.
In LDA, three scatter matrices, called the within-class, between-class, and total scatter matrices are defined as follows [8]: k 1 Sw = (x − c(j) )(x − c(j) )T , n
(1.1)
j=1 x∈Xj
1 nj (c(j) − c)(c(j) − c)T , n
(1.2)
1 (xi − c)(xi − c)T , n
(1.3)
k
Sb =
j=1 n
St =
i=1
where c(j)
is the centroid of the jth class and c is the global centroid. It can be verified from the definitions that St = Sb + Sw [8]. Define three matrices Hw , Hb , and Ht as follows: 1 Hw = √ [X1 − c(1) (e(1) )T , . . . , Xk − c(k) (e(k) )T ], (1.4) n 1 √ √ Hb = √ [ n1 (c(1) − c), . . . , nk (c(k) − c)], n
(1.5)
1 Ht = √ (X − ceT ), n
(1.6)
where e(j) and e are vectors of all ones of length nj and n, respectively. Then the three scatter matrices, defined in Eqs. (1.1)–(1.3), can be expressed as Sw = Hw HwT ,
Sb = Hb HbT ,
St = Ht HtT .
(1.7)
It follows from the properties of matrix trace that trace(Sw ) =
k 1 2 x − c(j) 2 , n j=1 x∈Xj
(1.8)
4
Chapter 1
Discriminant Analysis for Dimensionality Reduction
1 2 nj c(j) − c2 . n k
trace(Sb ) =
(1.9)
j=1
Thus trace(Sw ) measures the distance between the data points and their corresponding class centroid, and trace(Sb ) captures the distance between the class centroids and the global centroid. In the lower-dimensional space resulting from the linear transformation G, the scatter matrices become L = GT Sw G, Sw
SbL = GT Sb G,
StL = GT St G.
(1.10)
L ) siAn optimal transformation G would maximize trace(SbL ) and minimize trace(Sw L multaneously, which is equivalent to maximizing trace(Sb ) and minimizing trace(StL ) L + S L . The optimal transformation, GLDA , of LDA is simultaneously, since StL = Sw b computed by solving the following optimization problem [8, 16]: −1 GLDA = arg max trace SbL StL . (1.11) G
It is known that the optimal solution to the optimization problem in Eq. (1.11) can be obtained by solving the following generalized eigenvalue problem [8]: Sb x = λSt x.
(1.12)
More specifically, the eigenvectors corresponding to the k − 1 largest eigenvalues form columns of GLDA . When St is nonsingular, it reduces to the following regular eigenvalue problem: St−1 Sb x = λx.
(1.13)
When St is singular, the classical LDA formulation discussed above cannot be applied directly. This is known as the singularity or undersampled problem in LDA. In the following discussion, we consider the more general case when St may be singular. The transformation, GLDA , then consists of the eigenvectors of St+ Sb corresponding to the nonzero eigenvalues, where St+ denotes the pseudo-inverse of St [27]. Note that when St is nonsingular, St+ equals St−1 . The above LDA formulation is an extension of the original Fisher linear discriminant analysis (FLDA) [7], which deals with binary-class problems, that is, k = 2. The optimal transformation, GF , of FLDA is of rank one and is given by [15, 16] GF = St+ (c(1) − c(2) ).
(1.14)
/ 0, is also a solution Note that GF is invariant of scaling. That is, αGF , for any α = to FLDA. When the dimensionality of data is larger than the sample size, which is the case for many high-dimensional and low sample size data, all of the three scatter matrices are singular. In recent years, many algorithms have been proposed to deal with this singularity problem. We first review these LDA extensions in the next subsection. To
1.3 A Unified Framework for Generalized LDA
5
elucidate their commonalities and differences, a general framework is presented in Section 1.3 that unifies many of these algorithms.
1.2.1
Generalizations of LDA
A common way to deal with the singularity problem is to apply an intermediate dimensionality reduction, such as PCA [9], to reduce the data dimensionality before classical LDA is applied. The algorithm is known as PCA+LDA, or subspace LDA [19, 28]. In this two-stage PCA+LDA algorithm, the discriminant stage is preceded by a dimensionality reduction stage using PCA. The dimensionality, p, of the subspace transformed by PCA is chosen such that the “reduced” total scatter matrix in this subspace is nonsingular, so that classical LDA can be applied. The optimal value of p is commonly estimated through cross-validation. Regularization techniques can also be applied to deal with the singularity problem of LDA. The algorithm is known as regularized LDA (RLDA) [21]. The key idea is to add a constant μ > 0 to the diagonal elements of St as St + μId , where Id is the identity matrix of size d. It is easy to verify that St + μId is positive definite [27], hence nonsingular. Cross-validation is commonly applied to estimate the optimal value of μ. Note that regularization is also the key to many other learning algorithms including Support Vector Machines (SVM) [29]. In reference 20, the null space LDA (NLDA) was proposed, where the betweenclass distance is maximized in the null space of the within-class scatter matrix. The singularity problem is thus avoided implicitly. The efficiency of the algorithm can be improved by first removing the null space of the total scatter matrix. It is based on the observation that the null space of the total scatter matrix is the intersection of the null spaces of the between-class and within-class scatter matrices. In contrast, the orthogonal centroid method (OCM) [25] maximizes the between-class distance only and thereby omits the within-class information. The optimal transformation of OCM is given by the top eigenvectors of the between-class scatter matrix Sb . In reference 24, a family of generalized discriminant analysis algorithms were presented. Uncorrelated LDA (ULDA) and orthogonal LDA (OLDA) are two representative algorithms from this family. The features in the reduced space of ULDA are uncorrelated, while the transformation, G, of OLDA has orthonormal columns, that is, GT G = I . The LDA/GSVD algorithm proposed in reference 26, which overcomes the singularity problem via the generalized singular value decomposition (GSVD)[27], also belongs to this family. Discriminant analysis with an orthogonal transformation has also been studied in reference 30.
1.3
A UNIFIED FRAMEWORK FOR GENERALIZED LDA
The LDA extensions discussed in the last section employ different techniques to deal with the singularity problem. In this section, we present a four-step general framework for various generalized LDA algorithms. The presented framework unifies most of the generalized LDA algorithms. The properties of various algorithms as well as their
6
Chapter 1
Discriminant Analysis for Dimensionality Reduction
relationships are elucidated from this framework. The unified framework consists of four steps described below: 1. Compute the eigenvalues, {λi }di=1 , of St in Eq. (1.3) and the corresponding eigenvectors {ui }di=1 , with λ1 ≥ · · · ≥ λd . Then St can be expressed as St = di=1 λi ui uTi . 2. Given a transfer function : IR → IR, let λ˜ i = (λi ), for all i. Construct the matrix S˜ t as S˜ t = di=1 λ˜ i ui uTi . q 3. Compute the eigenvectors, {φi }i=1 , of S˜ t+ Sb corresponding to the nonzero eigenvalues, where q = rank(Sb ), S˜ t+ denotes the pseudo-inverse of S˜ t [27]. Construct the matrix G as G = [φ1 , . . . , φq ]. 4. Optional orthogonalization step: Compute the QR decomposition [27] of G as G = QR, where Q ∈ IRd×q has orthonormal columns and R ∈ IRq×q is upper triangular. With this four-step procedure, the final transformation is given by either the matrix G from step 3, if the optional orthogonalization step is not applied, or the matrix Q from step 4 if the transformation matrix is required to be orthogonal. In this framework, different transfer functions, , in step 2 lead to different generalized LDA algorithms, as summarized below: r In PCA+LDA, the intermediate dimensionality reduction stage by PCA keeps the top p eigenvalues of St ; thus it applies the following linear step function: (λi ) = λi , for 1 ≤ i ≤ p, and (λi ) = 0, for i > p. The optional orthogonalization step is not employed in PCA+LDA. r In regularized LDA (RLDA), a regularization term is applied to St as St + μId , for some μ > 0. It corresponds to the use of the following transfer function: (λi ) = λi + μ, for all i. The optional orthogonalization step is not employed in RLDA. r In uncorrelated LDA (ULDA), the optimal transformation consists of the top eigenvectors of St+ Sb [24]. The corresponding transfer function is thus given by (λi ) = λi , for all i. The same transfer function is used in orthogonal LDA (OLDA). The difference between ULDA and OLDA is that OLDA performs the optional orthogonalization step while it is not applied in ULDA. r In orthogonal centroid method (OCM), the optimal transformation is given by the top eigenvectors of Sb [25]. The transfer function is thus given by (λi ) = 1, for all i. Since the eigenvectors of Sb forms an orthonormal set, the optional orthogonalization step is not necessary in OCM. It has been shown [31] that the regularization in RLDA is effective for nonzero eigenvalues only. Thus, we can apply the following transfer function for RLDA:
for 1 ≤ i ≤ t, λi + μ (λi ) = 0 for i > t,
7
1.3 A Unified Framework for Generalized LDA Table 1.1. Transfer Functions for Different LDA Extensions PCA+LDA
(λi ) =
λi for 1 ≤ i ≤ p 0 for i > p
RLDA
ULDA/OLDA
OCM
λi
1
λi + μ for 1 ≤ i ≤ t 0 for i > t
where t = rank(St ). The transfer functions for different LDA extensions are summarized in Table 1.1. In null space LDA (NLDA) [20, 32], the data are first projected onto the null space of Sw , which is then followed by classical LDA. It is not clear which transfer function corresponds to the projection onto the null space of Sw . In reference 33, the equivalence relationship between NLDA and OLDA was established under a mild condition C1 : rank(St ) = rank(Sb ) + rank(Sw ),
(1.15)
which has been shown to hold for many high-dimensional data. Thus, for highdimensional data, we can use the following transfer function for NLDA: (λi ) = λi , for all i.
1.3.1
Analysis
The unified framework from the last section summarizes the commonalities and differences of various LDA-based algorithms. This unification of diverse algorithms into a common framework sheds light on the understanding of the key features of various algorithms as well as their relationship. It is clear from Table 1.1 that ULDA is reduced to the OCM algorithm [25] when St is a multiple of the identity matrix. Recent studies on the geometric representation of high-dimensional and small sample size data show that under mild conditions, the covariance matrix St tends to a scaled identity matrix when the data dimension d tends to infinity with the sample size n fixed [34]. This implies that all the eigenvalues of St are the same. In other words, the data behave as if the underlying distribution is spherical. In this case, OCM is equivalent to ULDA. This partially explains the effectiveness of OCM when working on high-dimensional data. We can observe from Table 1.1 that when the reduced dimensionality, p, in the PCA stage of PCA+LDA is chosen to be the rank of St —that is, the PCA stage keeps all the information—then the transfer functions for PCA+LDA and ULDA are identical. That is, PCA+LDA is equivalent to ULDA in this case. It can also be observed from Table 1.1 that the transfer function for RLDA equals the one for ULDA when μ = 0. Thus, ULDA can be considered as a special case of both PCA+LDA and RLDA. It follows from the above discussion that when μ = 0 in RLDA, and p = rank(St ) in PCA+LDA, they both reduce to ULDA. It has been shown that, under condition C1 in Eq. (1.15), the transformation matrix of ULDA lies in the null space of Sw [33]. That is, GT Sw = 0. Furthermore, it was shown in reference 31 that if GT Sw = 0
8
Chapter 1
Discriminant Analysis for Dimensionality Reduction
holds, then the transformation matrix G maps all data points from the same class to a common vector. This is an extension of the result in reference 32, which assumes that all classes in the data set have the same number of samples. Thus it follows that the ULDA transformation maps all data points from the same class to a common vector, provided that condition C1 is satisfied. This leads to a perfect separation between different classes in the dimensionality-reduced space. However, it may also result in overfitting. RLDA overcomes this limitation by choosing a nonzero regularization value μ, while PCA+LDA overcomes this limitation by setting p < rank(St ). The above analysis shows that the regularization in RLDA and the PCA dimensionality reduction in PCA+LDA are expected to alleviate the overfitting problem, provided that appropriate values for μ and p can be estimated. Selecting an optimal value for a parameter such as μ in RLDA and p in PCA+LDA from a given candidate set is called model selection [17]. Existing studies have focused on the estimation from a small candidate set, as it involves expensive matrix computations for each candidate value. However, a large candidate set is desirable in practice to achieve good performance. This has been one of the main reasons for their limited applicability in practice. To overcome this problem, an efficient model selection algorithm for RLDA was proposed in reference 31 and this algorithm can estimate an optimal value for μ from a large number of candidate values efficiently.
1.4
A LEAST SQUARES FORMULATION FOR LDA
In this section, we discuss recent developments on connecting LDA to multivariate linear regression (MLR). We first discuss the relationship between linear regression and LDA in the binary-class case. We then present multivariate linear regression with a specific class indicator matrix. This indicator matrix plays a key role in establishing the equivalence relationship between MLR and LDA in the multiclass case.
1.4.1
Linear Regression versus Fisher LDA
Given a data set of two classes, {(xi , yi )}ni=1 , xi ∈ IRd and yi ∈ {−1, 1}, the linear regression model with the class label as the output has the following form: f (x) = xT w + b,
(1.16)
where w ∈ IRd is the weight vector, and b is the bias of the linear model. A popular approach for estimating w and b is to minimize the sum-of-squares error function, called least squares, as follows: 1 1 ||f (xi ) − yi ||2 = ||XT w + be − y||2 , 2 2 n
L(w, b) =
(1.17)
i=1
where X = [x1 , x2 , . . . , xn ] is the data matrix, e is the vector of all ones, and y is the vector of class labels. Assume that both {xi } and {yi } have been centered, that is,
1.4 A Least Squares Formulation for LDA
n
i=1 xi
= 0 and
n
i=1 yi
9
= 0. It follows that yi ∈ {−2n2 /n, 2n1 /n} ,
where n1 and n2 denote the number of samples from the negative and positive classes, respectively. In this case, the bias term b in Eq. (1.16) becomes zero and we construct a linear model f (x) = xT w by minimizing 1 T ||X w − y||2 . (1.18) 2 It can be shown that the optimal w minimizing the objective function in Eq. (1.18) is given by [16, 17] + w = XXT Xy. L(w) =
Note that the data matrix X has been centered and thus XXT = nSt and Xy = 2n1 n2 (1) − c(2) ). It follows that n (c w=
2n1 n2 F 2n1 n2 + (1) St (c − c(2) ) = G , 2 n n2
where GF is the optimal solution to FLDA in Eq. (1.14). Hence linear regression with the class label as the output is equivalent to Fisher LDA, as the projection in FLDA is invariant of scaling. More details on this equivalence relationship can be found in references 15, 16, and 35.
1.4.2 Relationship Between Multivariate Linear Regression and LDA In the multiclass case, we are given a data set consisting of n samples {(xi , yi )}ni=1 , where xi ∈ IRd and yi ∈ {1, 2, . . . , k} denotes the class label of the ith sample and k > 2. To apply the least squares formalism to the multiclass case, the 1-of-k binary coding scheme is usually used to associate a vector-valued class code to each data point [15, 17]. In this coding scheme, the class indicator matrix, denoted as Y1 ∈ IRn×k , is defined as follows:
1 if yi = j, (1.19) Y1 (ij) = 0 otherwise. It is known that the solution to least squares problem approximates the conditional expectation of the target values given the input [15]. One justification for using the 1-of-k scheme is that, under this coding scheme, the conditional expectation is given by the vector of posterior class probabilities. However, these probabilities are usually approximated rather poorly [15]. There are also some other class indicator matrices considered in the literature. In particular, the indicator matrix Y2 ∈ IRn×k , defined as
1 if yi = j, (1.20) Y2 (ij) = −1/(k − 1) otherwise,
10
Chapter 1
Discriminant Analysis for Dimensionality Reduction
has been introduced to extend support vector machines (SVM) for multiclass classification [36] and to generalize the kernel target alignment measure [37], originally proposed in reference 38. In multivariate linear regression, a k-tuple of discriminant functions f (x) = (f1 (x), f2 (x), . . . , fk (x))
˜ = [˜x1 , . . . , x˜ n ] ∈ IRd×n and Y˜ = Y˜ ij ∈ is considered for each x ∈ IRd . Denote X IRn×k as the centered data matrix X and the centered indicator matrix Y , respectively. That is, x˜ i = xi − x¯ and Y˜ ij = Yij − Y¯ j , where x¯ = n1 ni=1 xi and Y¯ j = n1 ni=1 Yij . Then MLR computes the weight vectors, {wj }kj=1 ∈ IRd , of the k linear models, fj (x) = xT wj , for j = 1, . . . , k, via the minimization of the following sum-of-squares error function: 1 1 ˜T ||X W − Y˜ ||2F = ||fj (˜xi ) − Y˜ ij ||2 , 2 2 k
L(W) =
n
(1.21)
j=1 i=1
where W = [w1 , w2 , . . . , wk ] is the weight matrix and || · ||F denotes the Frobenius norm of a matrix [27]. The optimal W is given by [15, 17] ˜ Y˜ , ˜X ˜T +X W= X (1.22) which is dependent on the centered class indicator matrix Y˜ . Both Y1 and Y2 defined in Eqs. (1.19) and (1.20), as well as the one in reference 39, could be used to define the centered indicator matrix Y˜ . An interesting connection between the linear regression model using Y1 and LDA can be found in reference 17 ˜ is the transformed data by W1 , where (page 112). It can be shown that if XL = W1T X + T ˜ ˜ ˜ ˜ W1 = XX XY1 is the least squares solution in Eq. (1.22) using the centered ˜ in the indicator matrix Y˜ 1 , then LDA applied to XL is identical to LDA applied to X original space. In this case, linear regression is applied as a preprocessing step before the classification and is in general not equivalent to LDA. The second indicator matrix Y2 has been used in SVM, and the resulting model using Y2 is also not equivalent to LDA in general. This is also the case for the indicator matrix in reference 39. One natural question is whether there exists a class indicator matrix Y˜ ∈ IRn×k , with which multivariate linear regression is equivalent to LDA. If this is the case, then LDA can be formulated as a least squares problem in the multiclass case, and the generalizations of least squares can be readily applied to LDA. In MLR, each x˜ i is transformed to (f1 (˜xi ), . . . , fk (˜xi ))T = W T x˜ i , ˜ ∈ IRd×n is transformed to W T X ˜ ∈ IRk×n , thus achievand the centered data matrix X ing dimensionality reduction if k < d. Note that the transformation matrix W in MLR is dependent on the centered class indicator matrix Y˜ as in Eq. (1.22). To derive a class indicator matrix for MLR with which the transformation matrix is related to that of LDA, it is natural to apply the class discrimination criterion used in LDA. We thus
1.4 A Least Squares Formulation for LDA
look for Y˜ , which solves the following optimization problem: maxY˜ trace (W T Sb W)(W T St W)+ ˜ Y˜ , ˜X ˜T +X subject to W= X
11
(1.23)
˜X ˜ T can be singular. where the pseudo-inverse is used as the matrix X In reference 40, a new class indicator matrix, called Y3 , is constructed and it was shown that Y3 solves the optimization problem in Eq. (1.23). This new class indicator matrix Y3 = (Y3 (ij))ij ∈ IRn×k is defined as follows: ⎧ ⎪ ⎨ nn − nnj if yi = j, j Y3 (ij) = (1.24) ⎪ ⎩ − nj otherwise, n where nj is the sample size of the jth class, and n is the total sample size. Note that Y3 defined above has been centered (in terms of rows), and thus Y˜ 3 = Y3 . More importantly, it was shown in reference 40 that, under condition C1 in Eq. (1.15), multivariate linear regression with Y3 as the class indicator matrix is equivalent to LDA. We outline the main result below and the detailed proof can be found in reference 40. Recall that in LDA, the optimal transformation matrix (GLDA ) consists of the top eigenvectors of St+ Sb corresponding to the nonzero eigenvalues. On the other hand, ˜ 3 = nHb , where St and Hb are defined in Eqs. (1.3) and ˜X ˜ T = nSt and XY since X (1.5), respectively, the optimal weight matrix W MLR for MLR in Eq. (1.22) can be expressed as ˜ 3 = (nSt )+ nHb = St+ Hb . ˜X ˜ T + XY (1.25) W MLR = X It can be shown that the transformation matrix GLDA of LDA, which consists of the top eigenvectors of St+ Sb , and the projection matrix for MLR that is given in Eq. (1.25) are related as follows [40]: W MLR = GLDA , 0 QT , where is a diagonal matrix and Q is an orthogonal matrix. The K-Nearest-Neighbor (K-NN) algorithm [16] based on the Euclidean distance is commonly applied as the classifier in the dimensionality-reduced space of LDA. If we apply W MLR for dimensionality reduction before K-NN, the matrix W MLR is invariant of an orthogonal transformation, since any orthogonal transformation preserves all pairwise distance. Thus W MLR is essentially equivalent to GLDA , 0 or GLDA , as the removal of zero columns does not change the pairwise distance either. Thus the essential difference between W MLR and GLDA is the diagonal matrix . Interestingly, it was shown in reference 40 that the matrix is an identity matrix under the condition C1 defined in Eq. (1.15). This implies that multivariate linear regression with Y3 as the class indicator matrix is equivalent to LDA provided that
12
Chapter 1
Discriminant Analysis for Dimensionality Reduction
the condition C1 is satisfied. Thus LDA can be formulated as a least squares problem in the multiclass case. Experimental results in reference 40 show that condition C1 is likely to hold for high-dimensional and undersampled data.
1.5
SEMISUPERVISED LDA
Semisupervised learning, which occupies the middle ground between supervised learning (in which all training examples are labeled) and unsupervised learning (in which no labeled data are given), has received considerable attention recently [41–43]. The least square LDA formulation from the last section results in Laplacianregularized LDA [44]. Furthermore, it naturally leads to semisupervised dimensionality reduction by incorporating the unlabeled data through the graph Laplacian.
1.5.1
Graph Laplacian
Given a data set {xi }ni=1 , a weighted graph can be constructed where each node in the graph corresponds to a data point in the data set. The weight Sij between two nodes xi and xj is commonly defined as follows:
Sij =
⎧ ⎨ exp − xi −xj 2 , σ
⎩0
xi ∈ N (xj ) or xj ∈ N (xi ),
(1.26)
otherwise,
where both and σ > 0 are parameters to be specified, and xi ∈ N (xj ) implies that xi is among the nearest neighbors of xj [45]. Let S be the similarity matrix whose (i, j)th entry is Sij . To learn an appropriate representation {zi }ni=1 which preserves locality structure, it is common to minimize the following objective function [45]:
zi − zj 2 Sij .
(1.27)
i,j
Intuitively, if xi and xj are close to each other in the original space—that is, Sij is large—then zi − zj tends to be small if the objective function in Eq. (1.27) is minimized. Thus the locality structure in the original space is preserved. Define the Laplacian matrix L as L = D − S, where D isa diagonal matrix whose diagonal entries are the column sums of S. That is, Dii = nj=1 Sij . Note that L is symmetric and positive semidefinite. It can be verified that 1 zi − zj 2 Sij = trace(ZLZT ), 2 n
n
i=1 j=1
where Z = [z1 , . . . , zn ].
(1.28)