Unsupervised Linear Feature-Extraction Methods and ... - CS StudentNet

8 downloads 0 Views 1MB Size Report
curse of dimensionality have an impact in high-dimensional data analysis. ...... Fig. 13. Kennedy Space Center image with ground truth information. previously.
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 45, NO. 2, FEBRUARY 2007

469

Unsupervised Linear Feature-Extraction Methods and Their Effects in the Classification of High-Dimensional Data Luis O. Jiménez-Rodríguez, Member, IEEE, Emmanuel Arzuaga-Cruz, Student Member, IEEE, and Miguel Vélez-Reyes, Senior Member, IEEE

Abstract—This paper presents an analysis and a comparison of different linear unsupervised feature-extraction methods applied to hyperdimensional data and their impact on classification. The dimensionality reduction methods studied are under the category of unsupervised linear transformations: principal component analysis, projection pursuit (PP), and band subset selection. Special attention is paid to an optimized version of the PP introduced in this paper: optimized information divergence PP, which is the maximization of the information divergence between the probability density function of the projected data and the Gaussian distribution. This paper is particularly relevant with current and the next generation of hyperspectral sensors that acquire more information in a higher number of spectral channels or bands when compared to multispectral data. The process to uncover these high-dimensional data patterns is not a simple one. Challenges such as the Hughes phenomenon and the curse of dimensionality have an impact in high-dimensional data analysis. Unsupervised feature extraction, implemented as a linear projection from a higher dimensional space to a lower dimensional subspace, is a relevant process necessary for hyperspectral data analysis due to its capacity to overcome some difficulties of high-dimensional data. An objective of unsupervised feature extraction in hyperspectral data analysis is to reduce the dimensionality of the data maintaining its capability to discriminate data patterns of interest from unknown cluttered background that may be present in the data set. This paper presents a study of the impact these mechanisms have in the classification process. The impact is studied for supervised classification even on the conditions of a small number of training samples and unsupervised classification where unknown structures are to be uncovered and detected. Index Terms—Classification, dimensionality reduction, feature extraction, feature selection, hyperspectral data, pattern recognition, principal component analysis (PCA), projection pursuit (PP).

Manuscript received May 9, 2005; revised June 11, 2006. This work was supported in part by the U.S. Army Corps of Engineers Topographic Engineering Center under Grant DACA76-97-K-0007, in part by the National Science Foundation Engineering Research Center Program under Grant EEC-9986821, in part by the NASA University Research Centers Program under Grant NCC5518, and in part by the National Imagery and Mapping Agency under Contract NMA2010112014. L. O. Jiménez-Rodríguez and M. Vélez-Reyes are with the Electrical and Computer Engineering Department, University of Puerto Rico at Mayagüez, Mayagüez 00681, Puerto Rico (e-mail: [email protected]; [email protected]). E. Arzuaga-Cruz is with the Department of Electrical and Computer Engineering, Northeastern University, Boston, MA 02115 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TGRS.2006.885412

I. I NTRODUCTION

T

HIS PAPER presents an analysis and a comparison of different linear unsupervised feature-extraction methods applied to hyperdimensional data and their impact on classification. Feature extraction is a process applied in multivariate pattern recognition to reduce the number of variables to be used in the classification process. The objective of feature extraction is to reduce the dimensionality of the data while maintaining its discrimination capability. This paper focuses on linear projections due to a series of advantages [1]: 1) The linear mapping function is well defined. 2) The only requirement is to find the coefficient of the linear function which is done by maximizing or minimizing a criterion. 3) Linear projection uses well-developed techniques of linear algebra or well-known iterative and numerical techniques. For features that are not linear functions of original measurements, it is a challenge to find suitable nonlinear projection functions and usually is a very problem dependent method. Linear feature selection is a subset of linear featureextraction process. It is a type of linear projection that chooses variables instead of adding weighted variables that represent numerical features. The main purpose of this paper is to compare and study different unsupervised linear methods for dimensionality reduction applied to hyperdimensional data and their impact on classification. The methods studied in this paper are: principal component analysis (PCA), projection pursuit (PP), and band subset selection. PP techniques search for projections that optimize certain projection indexes. The projection index we optimized in PP, named information divergence PP (IDPP), is the information divergence between the probability density function of the projected data and the Gaussian distribution. To calculate the projection, an optimization algorithm was used. The PP was modified to develop an unsupervised feature-selection version of the information divergence measure. We named this version as information divergence band subset selection (IDSS). In order to select the set of dsub -bands out of the original set of d-bands, we measured the divergence between each band probability density function and the Gaussian probability density function. An unsupervised band subset selection mechanism already developed by Vélez-Reyes [2] was implemented with the purpose of further comparison. This method, named singular value decomposition subset selection (SVDSS), selects the best dsub -bands that better approximate to the dsub -principal

0196-2892/$25.00 © 2007 IEEE

470

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 45, NO. 2, FEBRUARY 2007

components. The method uses a technique based on the SVD to select the best subset of dsub -bands over a total number of d-bands. These techniques are relevant for hyperspectral data analysis. As the number of spectral bands increases in hyperspectral sensors, there is a significant amount of information added to the high-dimensional feature space data. Previous work has shown that hyperspectral data contains a redundant information in terms of spectral features [3]. The relevant information content of hyperspectral data can be represented in a lower dimensional subspace for specific applications [3]. Therefore, reducing the dimensionality of hyperspectral data without loosing important information about objects of interest is a very important issue for the remote sensing community. The differences in terms of statistical properties and difficulties in density parameter estimation between high- and low-dimensional spaces cause that conventional supervised classification methods originally developed for lowdimensional multispectral data to give unsatisfactory results when managing hyperspectral data. These unsatisfactory results of hyperspectral data analysis are due to the characteristics of high-dimensional spaces. Fukunaga shows that as the number of dimensions increases, the number of samples needed to have high classification accuracy increases as well [1]. The increase in the number of samples is linear in the dimensionality for a linear classifier and proportional to the square of the dimensionality for a quadratic one. This problem is also known as “the Hughes phenomenon” [4]. For density estimation, the situation is more difficult. Local neighborhoods are almost surely empty, requiring the bandwidth of estimation to be large and producing the effect of losing detailed density estimation. As a consequence, the situation is even worst for nonparametric classifiers based on density estimation. It has been estimated that as the number of dimensions increases, the sample size needs to increase exponentially in order to have an effective estimate of multivariate densities [5], [6]. This also has been named “the curse of dimensionality” [3], [5]. A similar problem is found in supervised feature extraction. A number of techniques for case-specific feature extraction have been developed to reduce the dimensionality without loss of class separability. Most of these techniques require the estimation of statistics at full dimensionality in order to extract relevant features for classification. If the number of training samples is not adequately large, the estimation of parameters in high-dimensional data will not be accurate enough. As a result, the estimated features may not be as effective as they could be. To avoid the Hughes phenomenon and the curse of dimensionality, dimensionality reduction methods used for multispectral data analysis must be modified in a way that takes into consideration the high-dimensional characteristics of the hyperspectral data [7]. This paper presents different mechanisms that perform unsupervised linear dimensionality reduction in the hyperspectral image analysis process. Special attention is paid to an optimized version of PP introduced in this paper: optimized IDPP (OIDPP), which is the optimization of the information divergence between the probability density function of the projected data and the Gaussian distribution. The objective is to present a

study of the impact these mechanisms have in the classification process and how they handle the Hughes Phenomena. Section II introduces a different unsupervised feature-extraction mechanism. Section III introduces the PP using information divergence. Section IV presents our modification to this mechanism, named OIDPP. A subset selection process based on OIDPP is presented in Section V. Experimental results with synthetic, multispectral, and hyperspectral data are shown in Section VI. For the purpose of comparison, the study is performed for both types of classification processes: unsupervised and supervised classifications. The last experiment is performed and studied on the supervised classification on the conditions of small number of training samples. Finally, concluding remarks are presented in Section VII. II. D IMENSIONALITY R EDUCTION T ECHNIQUES This section presents a summary of several dimensional reduction techniques. Although the scope of this paper is on unsupervised techniques, one of the best known supervised techniques, Fisher discriminant analysis (DA), will be presented. Its strengths and limitations are exemplary of other supervised feature-extraction mechanism. Unsupervised techniques, such as PP, PCA, and SVDSS, will be summarized with its potentiality to address the Hughes phenomenon and the curse of dimensionality. A. Fisher DA DA is a supervised feature-extraction technique that uses the information of the data distribution as an index to maximize. The index that maximizes is the ratio of the between-class and within-class variances as a measure of separability. This method needs a priori information in terms of labeled samples that represents classes in the data set. This transformation is based on the eigenvalue/vector decomposition. This transformation is based on the Fisher criterion [8]. For mathematical clarity, let us define the within-class scatter matrix Σω =

M 1  Σi M i=1

(1)

where Σi is the covariance matrix of the ith class. Let us define the between-class scatter matrix Σb M 1  Σb = (mi − m)(mi − m)T M i=1

(2)

where M is the number of spectral classes, mi is the sample mean vector of the ith class, and m is the sample mean vector for the whole labeled data set. Finally, we compute the feature vector a by maximizing the following function: J(a) =

aT Σb a . aT Σω a

(3)

JIMÉNEZ-RODRÍGUEZ et al.: UNSUPERVISED LINEAR FEATURE-EXTRACTION METHODS

471

The feature vector a that maximizes (3) is the eigenvector of Σb Σ−1 ω associated to the largest eigenvalue and is used to linearly project the data. One of the major drawbacks of Fisher DA is that if the class-mean vectors are very close to each other, the method may select a projection that merges classes. DA also requires a large amount of training samples per number of features, in order to avoid Hughes phenomenon. Kernel-based methods have been developed that include variants of Fisher DA based on more general equations than (1)–(3). For example, (3) could be expressed as J(a) =

aT S i a aT S N a

(4)

where Si is a symmetric matrix that measures desire information and SN measures undesired noise along the direction of projection. Solutions to this problem have been formulated as an optimization in some kernel feature space [9], [10], where a is expressed as  αk φk , αk ∈ . (5) a= k

Feature vector a lies in the spanning of a set of functions φi . This method is called Kernel Fisher discriminant and resembles support vector machines classifiers. This mechanism has been implemented in hyperspectral data analysis [11]. Linear and nonlinear feature-extraction methods have been developed and implemented based on kernel methods including the PCA [12]. Fisher DA as expressed in (1)–(3) was used for comparison purposes in terms of class separability in one experiment. B. PP PP is a mechanism that linearly projects data to a lower dimensional subspace while retaining most of the information in terms of optimizing a defined projection index. This technique is able to reduce the dimensionality of hyperspectral data bypassing many of the challenges that high-dimensional space introduces. For a mathematical definition, we defined the following variables: X original data set (d × N ); Y projected data (dsub × N ); A orthonormal projection matrix (d × dsub ) that satisfies Y = AT X

(6)

where N is the number of spectral samples, d is the original dimensionality of the hyperspectral data, and dsub is the dimensionality of the projected data y(dsub < d). PP computes matrix A by optimizing the projection index that is a function of Y : f (Y) = f (AT X). Unless we state otherwise, bold upper letters will refer to matrices and bold lower letter will refer to vectors on the rest of this paper. Friedman and Tukey were the first ones that introduced the term PP [7]. Since then, many articles related to the topic have appeared. All publications that followed Friedman and Tukey’s work are based on the same concept of optimizing a certain projection index. Jiménez and Landgrebe [7] proposed a supervised projection index based on the minimum Bhattacharyya

Fig. 1. (a) First principal component axis as an optimum separability projection. (b) First principal component axis as a poor separability direction.

distance, a class distance measure. Ifarraguerri and Chang [13] proposed the use of the information divergence index, which is an unsupervised version based on relative entropy. Chiang and Chang [14] presented projection indexes of higher order statistics for target detection problems. MacDonald et al. [15] used kernel methods applied to PP principles. C. Principal Components PCA is one of the unsupervised techniques most commonly used to reduce the dimensionality of the hyperspectral images [16]. PCA is the discrete version of the Karhunen-Loeve expansion [17]. This transformation is based on the eigenvalue/vector decomposition of the covariance matrix of the whole data set. PCA can be defined as a type of PP that uses the data variance as the projection index to be optimized [14]. The projection matrix A contains the eigenvectors that correspond to the largest selected eigenvalues. PCA is optimum under the conditions of signals plus Gaussian noise [3], [14]. It does not work properly in the detection and classification of small-size objects relative to the scene [16, p. 144]. This is due to the small contribution that objects with small number of pixels will make to the covariance matrix of the total image data sets. This method can find a good class separability in its projections when the classes of interest are located along the largest eigenvector; the first principal axis [Fig. 1(a)]. The PCA obtains a poor separability projection in the case that the classes are not distributed along

472

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 45, NO. 2, FEBRUARY 2007

the principal axis [Fig. 1(b)]. Another helpful characteristic of this mechanism is the fact that it uses the whole data set to estimate the parameters. This implies that the ratio of samples per number of features is generally acceptable.

The band subset selection problem can be framed in the approximation framework discussed previously by further restricting the projection matrix A as follows:   Id A = P sub 0

D. SVD Band Subset Selection Another dimension-reduction problem of particular interest is the band subset selection where a subset of dsub -bands is selected from the set of d-bands in a way that some measure of information contained in the data subset is maximized (or information loss is minimized). Let x be a d-dimensional random vector with zero mean and covariance matrix ΣX . We wish to consider a linear dimension-reducing transformation of x to a dsub -dimensional random vector y given by T

y=A x

(7)

where A is a d × dsub projection matrix with dsub < d and AT A = Idsub (orthonormal). Thus, y is a dsub -dimensional random vector. Notice that y is a zero-mean random variable with covariance ΣY = AT ΣX A.

(8)

Suppose that we want to reconstruct the random vector x using ˆ The best linear meany. We denote the prediction of x by X. squared estimator of x from y is given by  T −1 ˆ = ΣXY Σ−1 y. x Y y = (ΣX A) A ΣX A

(9)

This estimate is the optimal mean-squared Bayes estimator for the case that x is a Gaussian random vector. The covariance of the estimate is given by  T −1 T A ΣX . ΣX ˆ = ΣX A A ΣX A

(10)

Note that ΣX ˆ is a d × d singular matrix of rank dsub < d. ˆ Therefore, the transformation from x to y and back to X ˆ involves a loss of information. Let e = X − X be the reconstruction error, the error covariance is given by T Σe = ΣX − ΣXY Σ−1 Y ΣXY  −1 T = ΣX − ΣX A AT ΣX A A ΣX

= ΣX − ΣX ˆ.

(11)

From this result, it is clear that the difference between ΣX ˆ and ΣX is a measure of the loss of information resulting from the dimension-reduction process. Therefore, we can think of selecting the orthonormal projection matrix A to minimize the loss of information in dimension reduction by minimizing some measure of this difference [1]. The optimal solution to this problem for any unitary invariant norm is given by the first dsub principal components of x. Although PCAs are optimal in the mean-square sense, the resulting projection vector y is a linear combination of the original bands. Therefore, computation of the principal components requires processing the full hyperspectral data cube that will imply that most of the tradeoffs discussed previously will be difficult to achieve.

(12)

where P is a permutation matrix. The net effect of this constraint is that the dimension-reducing transformation A now selects a subset of the original variables in the vector x as follows: 

T

y = A x = [ Idsub

xi1  xi2   · 0 ] PT x =   ·  · xidsub

    .  

(13)

The selection of a subset of bands has several interesting advantages since we are retaining the physical meaning of the data in order to: 1) maximize human understanding; 2) combine spectral data with other data types; and 3) exploit physical modeling/simulation. The selection of the dsub -optimal bands is still a combinatorial optimization problem with a very large dimension solution space. For instance, selecting ten out of 210 bands (as in HYDICE) results in searching approximately 3.7 × 1016 possibilities. This problem can be tackled using standard search mechanisms for combinatorial optimization problems that are quite time consuming. We have implemented a heuristic algorithm for band selection that is based on approximating the principal components using subspace approximation methods, which give suboptimal solutions, but in significantly less time. This is because we use matrix decompositions that can be computed in polynomial time. This approach is based on the subset selection method described in [18]. The basic idea behind the approach of [19] is that, under a certain condition, the subspace spanned by a subset of dsub sufficiently independent columns of a matrix will approximate the subspace spanned by the first dsub principal components for the same matrix. To search for a subset of sufficiently independent columns rank revealing decompositions such as the SVD [18] and rank revealing QR factorizations can be used. Here, we focus on using the SVD method that has been shown in the linear algebra literature to produce the best approximation [19]. The SVD subset selection algorithm is summarized in the following. 1) Construct a matrix representation of the hyperspectral cube as follows. Let X = [x1 , x2 , . . . , xN ], where N is the number of pixels in the image. Notice that each row of the image corresponds to a band of the cube and each column corresponds to a measured pixel spectral signature. 2) Construct the normalized matrix Z = [z1 , z2 , . . . , zN ] which is obtained by subtracting the mean from each

JIMÉNEZ-RODRÍGUEZ et al.: UNSUPERVISED LINEAR FEATURE-EXTRACTION METHODS

pixel and normalizing each band to unit variance as follows: z = D1/2 (x − µx )

3) 4)

5) 6)

where D = diag{σx2 1 , σx2 2 , . . . , σx2 d } and σx2 i is the variance of the ith band. Compute the SVD of Z and determine its rank dsub (you can determine dsub in this form or use an a priori value). Compute the QR factorization with pivoting of the matrix V1T P = QR where V1 is formed by the first dsub left singular vectors of X, P is the pivoting matrix of the QR factorization and Q and R are the QR factors. Compute X = PT X, where P is the pivoting matrix from the QR factorization in step 4). Take the first dsub rows of X as the selected bands.

The impact of all the unsupervised feature-extraction methods discussed previously on the unsupervised classification process will be studied. In the rest of this paper, PP will refer to its use of the information divergence function, as it will be explained in the next section. The DA, a supervised method, will only be used in one experiment to show the advantages of using the PP. III. IDPP Recently, Ifarraguerri and Chang [13] proposed a method that uses the divergence between the estimated probability density function of the projected data and the Gaussian distribution as a projection index. This method is unsupervised, and it extracts the required information, the estimated probability density function of the projected data, from the image itself without any a priori information [13]. This method starts by sphering the whole data set X. The sphering procedure is an orthogonal transformation that “standardizes” the data by subtracting its mean (centering) and whitening it. The sphering procedure is a preprocessing step. Let X be expressed as X = [x1 , x2 , x3 , . . . , xN ], where each column is a pixel vector, and matrix Z be expressed as Z = [z1 , z2 , z3 , . . . , zN ], whose columns are the N sphered pixel vectors. For a mathematical definition, the sphered column zi of Z is defined as zi = D−1/2 UT (xi − µx )

(14)

where D is the diagonal matrix containing the eigenvalues of the estimated covariance matrix of the image, U is the matrix containing the column eigenvectors associated with the eigenvalues, and µx is the mean of the columns of the image X. Note that the resulting mean of Z columns is the vector 0 = [0, 0, . . . , 0]T , and its covariance matrix is equal to the identity matrix Id . The projection index, the function to be optimized, used in this method is known as the information divergence index because it calculates the relative entropy between probability density functions. In information theory, entropy is defined as a measure of the amount of “information” available in the data set [20]. To better understand this method let us use the notation in (6) Y = aT Z, where a is a dx1vector. The projected columns

473

data in Y will be one dimensional and represent the projected data. Let y be the random variable that generates the columns of the projected data Y. The projection index uses the estimated continuous probability density function of the projected data f (y) and the continuous probability density function g(y) of the distribution that we desire the projection to diverge the most. In order to give the definition of this projection index, there are other measures we must introduce first. The continuous relative entropy of f (y) with respect to g(y) is given by ∞ d(f g) =



 f (y) ln f (y) g(y) dy.

(15)

−∞

The continuous absolute information divergence between f (y) and g(y) is expressed by the following [20]: i(f, g) = d(f g) + d(gf ).

(16)

This index is symmetric and nonnegative. When the value of the index increases, the two distributions are said to diverge the most. The minimum value is zero, and this value occurs when both distributions are the same; f (y) = g(y). If we define g(y) to be a normal Gaussian distribution, N (0, 1), we can compute the divergence of f (y) from that Gaussian probability density function. We must estimate the distribution f (y) from the projected image data in order to compute the divergence. Given the fact that we are using a discrete estimate of f (y), named vector p obtained from the data, we can simplify the information divergence computation using a discrete approximation of the Gaussian distribution as well. Ifarragueri and Chang use a quantization technique for the discrete approximation of g(y). They select a number of bins n and a width ∆y to approximate the integral of a standard normal distribution

√ qi = 1 2π

(i+1)∆y

2

e−t

/2

dt.

(17)

i∆y

The vector q is the discrete approximation of g(y). The values of q range from i = −n/2 to n/2. The discrete estimate of f (y), vector p, is created using a histogram of the projection with the same number of bins n. The number of bins must be in the range of [−5σ, 5σ] (were σ is the standard deviation) in order to cover most of the Gaussian shape. The authors stated that the optimal width ∆y is close to 1 for 1000 samples and decreases with an increasing sample size. Once vector q and vector p are generated, the relative entropy for discrete distributions can be used. The relative entropy for discrete distributions is defined as [20]    (18) pi log pi qi D(pq) = i

where qi and pi are the ith components of column vectors q and p, which are the discrete approximations of g(y) and f (y), respectively. The discrete absolute information divergence that is implemented is defined by the following expression: I(p, q) = D(pq) + D(qp).

(19)

474

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 45, NO. 2, FEBRUARY 2007

Fig. 2. Ifarraguerri and Chang procedure that considers all pixels as possible candidates for projections of the whole data set.

The implementation of this method named IDPP, as proposed by Ifarraguerri and Chang, will yield a search for all pixels, after the sphering process, as possible candidates for projections of the whole data set. Fig. 2 shows that every sample could represent the direction of the projection. Therefore, it will make an exhaustive search through the entire image data set. The number of projections that has to be made to obtain one that maximizes the projection index equals the number of pixels available in the image. If we consider the size of hyperspectral data, this calculation can be extremely time consuming and does not guarantee an optimal projection. IV. OIDPP It is important to note that when using the information divergence index, we are maximizing the difference between the distribution of the projected data and the Gaussian distribution. In this section, the use of a numerical optimization algorithm will be proposed as an alternative to the exhaustive search followed by Ifarraguerri and Chang’s approach. This modification is named the OIDPP. A. Finding Projections 1) Basic Algorithm: Using the information divergence index defined in (19), we can implement the PP using the following algorithm. 1) Select from the data the desired starting direction and create a vector p. 2) Create the discrete approximation of a Gaussian distribution, that is vector q, by means of a Gaussian random variable of distribution N (0, 1). 3) Maximize I(p, q) using a numerical optimization routine to obtain the optimum projection vectors that will form projection matrix A. 4) Project data using A to reduce the dimensionality. A detailed explanation of how to use this algorithm and how to create vectors p and q will be given in the following lines. 2) Initial Step: Data Sphering: In order to simplify the correlation structure of the data and reduce the contribution of

Fig. 3. Selection of vector p: Histogram of standardized data projection. Horizontal axes have values of Y = aT Z. Vertical axes have number of samples on a particular window of the histogram.

outliers [8], [21], the first step is to sphere the data. Define D as the diagonal matrix with eigenvalues and U as the matrix whose columns are the eigenvectors of the estimated covariance matrix of the image that correspond to the eigenvalues in D. The matrix X in (6) is first transformed to a sphered matrix Z by the orthogonal transformation in (14). As previously explained X = [x1 , x2 , x3 , . . . , xN ], where each column is a pixel vector, and Z = [z1 , z2 , z3 , . . . , zN ], whose columns are the N sphered pixel vectors. For every pixel, each value in the column is associated to a corresponding feature. Let Zi,j be the jth pixel in the ith feature. For example Z1,2 is the value of the second pixel in the first feature. 3) Constructing A: Once the matrix Z is created, we need to construct the projection matrix A to project Z to a lower dimensional subspace Y. For this task, we must select the vector aopt that maximizes the information divergence index between the distribution p of the projected data (Y = aT Z) and our vector q that is an approximation to the Gaussian distribution. In the next section, we explain with more detail how to create vectors p and q. Estimates of p and q vectors: Let us define the vector p in (18) and (19) as a vector that approximates the distribution of the projected data. In our case, we obtain p from an initial direction a0 extracted from the data. After obtaining direction a0 , we create p by taking the histogram of the projection aT 0 Z. Fig. 3 shows the estimated distribution, a histogram, of the projected data from where p is created. For the case of creating vector q with same size of p, a Gaussian random variable with distribution N (0, 1) is used. Vector q is created by estimating a histogram of that Gaussian random variable. An example of the selection of vector q can be seen on Fig. 4. Finding projection vector aopt : From previous definitions and descriptions, let us identify our projection vector aopt as   aopt = arg max I(f (aT i Z), q) . i

(20)

We used a numerical optimization routine to find aopt . The routine used is the MATLAB “fminunc” function that will be

JIMÉNEZ-RODRÍGUEZ et al.: UNSUPERVISED LINEAR FEATURE-EXTRACTION METHODS

475

optimized projection vectors obtained up to the ith iteration. I is the identity matrix, and A(i)+ is the pseudoinverse of matrix A(i) . As a result of the iterative process, the rank of Z(i) is reduced by a factor of i. As a consequence of this rank reduction of Z(i) , the maximum number of projections that this process is able to find equals the number of iterations that reduces the rank of Z(i) to zero. In the ideal case, this amount is equal to the original number of bands (= d). B. Description of the Optimization Routine

Fig. 4. Selection of vector p: Histogram of standardized data projection. Horizontal axes have values of Y = aT Z. Vertical axes have number of samples on a particular window of the histogram.

described below. The initial value in the optimization routine is given by: a0 = mean(Z). Observe that p = f (aT Z), the estimated distribution of the projected data. The vector aopt is normalized since we are only interested in its direction. aopt =

V. I NFORMATION D IVERGENCE B AND S ELECTION

aopt . aopt 2

(21)

Constructing A iteratively: The matrix A is constructed iteratively. Let us define the structure of the matrix A as follows: A = [ a1

a2

···]

(22)

where ak is the kth columns of A and is the column that will produce the kth projection of the data set Z. The range of possible values of the index k is as follows: 1 ≤ k < d, where d is the full dimensionality of Z. The index should be less than d in order to have a reduction of dimensionality. When we use the following notation: A(i) , we refer to the ith iteration in the construction of A, where the last column is the column ai . A(i) = [ a1

a2

···

As explained before, a numerical optimization routine was used to find aopt . MATLAB was used for this task. The function named “fminunc” was chosen due to its simplicity. This function attempts to find a minimum of a scalar function of several variables, starting at an initial estimate. This is generally referred to as unconstrained nonlinear optimization. There is neither “free” parameter selection nor parameter tuning. In this particular case, we created an objective function that accepts an initial parameter (our projector candidate) and returns the information divergence index between the distribution of the projection (p) and the distribution of the Gaussian vector (q).

ai ] .

(23)

As an example, A(1) = [a1 ], A(2) = [a1 a2 ], A(3) = [a1 a2 a3 ], etc. The first iteration, that coincides with the first column in the projection matrix A, A(1) = [a1 ], consists of the vector aopt (A(1) = [aopt ]). Other columns in A(ai for 1 < i ≤ N ) are obtained by an iterative process. The iterative process consists in generating a matrix Z(i) by projecting Z(i−1) to a subspace orthogonal to matrix A(i−1) . After finding Z(i) , we must repeat the process in (20) to find the next element of matrix A. In general, the ith iteration to obtain Z(i) can be stated as (i−1) Z(i) = ST ⊥A(i−1) Z

(24)

where ST = I − A(i−1) A(i−1)+ is the subspace orthog⊥A(i−1) (i−1) onal to A . A(i) is the matrix whose columns are the

As mentioned earlier, the advantage of band subset selection methods is that it maintains the physical structure of the data, that is, we remain with the original values measured by the instrument instead of a linear transformation. Band subset selection methods select a set of spectral bands that maximize certain measurements of separability between objects. The separability criterion used in the band subset selector method that is proposed in this paper is based on the information divergence index previously introduced. A. Algorithm for IDSS Band subset selection can be implemented using the information divergence index through the following algorithm. 1) Create a vector p for every different spectral bands. 2) Create the discrete approximation of g(y); q. 3) Select the dsub bands with maximum indexes I(p, q). 4) Reduce the dimensionality of the data by using only the selected dsub (dsub < d) bands. Note that the IDSS will select the dsub bands with less Gaussian structure according the information divergence index. VI. E XPERIMENTS The experiments were designed to compare the results of the unsupervised feature-extraction techniques explained earlier: PCA, SVDSS, IDPP, OIDPP, and IDSS. There were five different experiments applied to four different data sets. Data sets in each experiment were projected to a lower dimensional feature space. The results were compared in terms of class separability or classification accuracy. The first data set used was computergenerated bivariate data. Although the scope of this paper

476

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 45, NO. 2, FEBRUARY 2007

is limited to the unsupervised feature extraction, supervised feature-extraction method DA was used only in this experiment for validation purposes in terms of class separability. It enables to observe the potential of these unsupervised featureextraction methods for classification purposes. The second data set used was remotely sensed multispectral data. This data were acquired from the Airborne Thematic Mapper [(ATM) which possesses seven bands]. The experiments consisted on applying the feature extraction and unsupervised classification. The third data set used was hyperspectral data collected from the Airborne Visible Infrared Imaging Spectrometer [(AVIRIS) with 224 bands] at Kennedy Space Center. Two experiments were performed on this data that consisted on applying the unsupervised feature-extraction techniques to a whole set of training samples to compute the matrix A. After the dimensionality of the data was reduced, we used a supervised classifier to classify testing samples. One experiment consisted on training samples was difficult to classify, and the other on more separable labeled samples. The fourth data set used was hyperspectral data collected from the AVIRIS at Indiana’s NW Indian Pines Test Site. The objective was to study the capabilities of each studied unsupervised feature-extraction technique to enable the detection of small objects, small in terms of the numbers of pixels with respect to the whole data set. In this data set, we applied PCA, SVDSS, OIDPP, and IDSS to the whole data set and then, using the training samples, we used a supervised classifier ML to testing samples. These experiments were conducted using a MATLAB Toolbox for hyperspectral image analysis developed in LARSIP at UPRM [22]. The experiments are explained in detail in the following sections. A. Computer-Generated Bivariate Data Gaussian bivariate data were generated. This data set is in a matrix of two rows by 2000 columns. Each column vector represents a sample. The number of rows is equal to the number of features. The whole data set consisted of two clusters of 1000 samples each with different mean vectors and identical covariance matrices. The experiments conducted consisted on linearly projecting the data from a two to a one-dimensional feature. In that subspace, we analyzed the separability of the clusters obtained by the different unsupervised techniques. A supervised feature-extraction method, DA was used for validation purposes. Results are shown in Fig. 5. Each subfigure consists of a scatter plot of bivariate data and the estimated direction of the projection for each particular method. Observe that scatter plots of OIDPP and IDPP appear different from the rest due to the sphering process of (14). Table I shows the results in terms of Bhatacharyya distances of the projected clusters as a measure of separability. Bhattacharyya distance was used due to the fact that is related to the classification accuracy probability [3]. If we observe the projected results for the simulated data set and their corresponding Bhattacharyya distance, the unsupervised mechanism PCA (Fig. 5) and SVDSS provide the results with less cluster separability. This is due to the fact that

the means are not located along the principal component as discussed previously. OIDPP and IDPP find the best direction to project the data in terms of providing the best class separability even under the conditions of being an unsupervised mechanism. The Bhattacharyya distance of OIDPP is optimum when it is compared to IDPP due to its numerical optimization procedure. We can see also that IDSS [Fig. 5(e) and Table I] selects the feature that maximizes the information divergence and the Bhattacharyya distance under the restriction of being a band selection algorithm. As expected, the SVDSS succeeds in finding the best approximation to PCA available in the original feature set. If we use the results of the supervised feature-extraction DA method [Fig. 5(f)] to compare these unsupervised results, we can observe that OIDPP and IDPP are the most capable unsupervised methods of obtaining a direction in which the classes are separable. From these results, we can clearly see that OIDPP will obtain a favorable direction for projection preserving separability of objects, in terms of Bhattacharyya distance, even in cases where conventional unsupervised methods encounter difficulties under the condition of not having any a priori information in terms of labeled samples. IDSS also proved to select the feature that better separates the objects present. B. Remotely Sensed Multispectral Data: ATM The ATM multispectral image data sets used for this experiment consist of 409 rows by 565 columns by seven bands. It was taken at Añasco, a town located at a well-known region in the west coast of Puerto Rico. Fig. 6 shows a true color RGB composite from bands 1, 2, and 3. There are several known spatial characteristics in this image: sediment in the coastal waters due to the river discharge in the upper left corner of the image, agro fields in the right side of the image, and some urban zones including a small airport in the right. The experiments conducted using the ATM image consisted in reducing the dimensionality of the data from their original number of seven bands to two features using the unsupervised mechanisms of PCA, SVDSS, OIDPP, IDPP, and IDSS. Using only two features enables us to study and compare not only the clustering results but also the histograms of every feature for each method without multiplying exhaustively the number of figures. The lower dimensional images were then classified using an unsupervised classification mechanism with no labeled samples: C-Means with covariance clustering. This clustering algorithm is similar to C-Means, but instead of using the minimum Euclidean classifier in its iterative process, it uses the maximum-likelihood (ML) classifier. As a consequence, each cluster is assumed to have a different mean and a different covariance that are computed in every iteration. This mechanism enables better extraction of information, in terms of class maps than traditional C-Means [23]. The number of clusters was seven in all of the experiments. The results consist of class maps produced by the clustering algorithm, histograms of the whole data set at every projected feature, and the minimum Bhattacharyya distance between the clusters for each method. Results are displayed in Figs. 6–11 and in Table II.

JIMÉNEZ-RODRÍGUEZ et al.: UNSUPERVISED LINEAR FEATURE-EXTRACTION METHODS

Fig. 5.

477

Experiment A, scatter plots, and linear projections of (a) PCA, (b) SVDSS, (c) OIDPP, (d) IDPP, (e) IDSS, and (f) DA.

TABLE I EXPERIMENTS WITH COMPUTER-GENERATED BIVARIATE DATA: BHATTACHARYYA DISTANCE FOR THE DIFFERENT FEATURE-EXTRACTION TECHNIQUES

Results of this experiment, in specific, the histograms of the first feature obtained using OIDPP [Fig. 12(c)], have a less Gaussian structure than those from other methods. If we also observe Table II, we can see that the OIDPP obtained the highest value of the minimum Bhattacharyya distance between two clusters. These results suggest that the OIDPP accomplished the task of finding a projection that preserves separability of the unknown objects present in the image. Although the IDPP

follows OIDPP in terms of minimum Bhattacharyya distance, Fig. 12(d) shows that the histogram is not as spread as OIDPP in Fig. 12(c). This signifies that data set and, as a consequence, clusters are much more separable using the OIDPP. The classification maps of ATM demonstrate how OIDPP and IDPP are able to preserve information about sediment that arrives to the coastal waters through the river discharge in the upper left side of the clustering map. Other areas such as urban zones, agro fields, and woods are best identifiable in the OIDPP results. The classification map using IDSS (Fig. 11) obtained similar results to OIDPP in terms of the clustering map and the appearance of sediments at the coastal waters and at the same time IDSS obtained the smallest number of the minimum Bhattacharyya distance. This may imply that the IDSS may not work well to uncover very close clusters in a multiclass problem, although it works better than PCA and SVDSS for moderately separated groups. It is interesting to notice that again, the SVDSS method approximates the results obtained using the PCA. Results demonstrate that OIDPP and IDPP are

478

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 45, NO. 2, FEBRUARY 2007

Fig. 10.

Clustering map: IDPP feature extraction.

Fig. 11.

Clustering map: IDSS feature extraction.

Fig. 6. ATM image, Añasco, PR.

Fig. 7. Clustering map: PCA feature extraction.

TABLE II EXPERIMENTS WITH ATM DATA: MINIMUM BHATTACHARYYA DISTANCE FOR D IFFERENT F EATURE -E XTRACTION T ECHNIQUES

excellent methods for unsupervised dimensionality reduction of hyperspectral data. C. Remotely Sensed AVIRIS Hyperspectral Data: Applying the Feature-Extraction Techniques to Training Samples Fig. 8. Clustering map: SVDSS feature extraction.

Fig. 9. Clustering map: OIDPP feature extraction.

Fig. 13 shows an RGB composite of an AVIRIS image from the Kennedy Space Center in Florida. The image size is 397 rows by 268 columns. Note the contrast between urban zones, roads, and soil. Two experiments were performed with this hyperspectral data. Both of them consisted of constructing the projection matrix A with the use of a reduced amount of pixels: a set of all training samples from all classes. The differences consist on the level of separability of the labeled samples. The first training set contained data that are spectrally closed; as a consequence, they are very difficult to classify. The second set of labeled samples, although consists of the same classes than the first, contains data more separable in terms of classification accuracy. Both sets of labeled samples used for training and testing were obtained from knowledge of the area and unsupervised classification performed on the image

JIMÉNEZ-RODRÍGUEZ et al.: UNSUPERVISED LINEAR FEATURE-EXTRACTION METHODS

479

Fig. 12. Histogram of data at first projected feature. (a) PCA, (b) SVDSS, (c) OIDPP, (d) IDPP, and (e) IDSS.

Fig. 13. Kennedy Space Center image with ground truth information.

previously. Areas where the labeled fields were obtained are shown in Fig. 12 as well. 1) Labeled Samples With Less Separability: Labeled samples used in these experiments are spectrally closer and, as a consequence, are more difficult to separate in terms of classification accuracy. Table III shows the number of training and testing samples per class. The mechanisms of PCA, OIDPP, IDPP, SVDSS, and IDSS were used to reduce the dimensionality by doing all the computations based only on the data from the training samples. All the training samples were gathered in one file, and the unsupervised techniques were applied to them, thus computing the matrix A that will reduce the dimensionality of the image from its original size to the lower dimensional subspace where classification process was applied. For this particular experiment, the data set was projected from full

480

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 45, NO. 2, FEBRUARY 2007

TABLE III EXPERIMENTS WITH AVIRIS KENNEDY SPACE CENTER DATA: CLASSES AND N UMBER OF T RAINING AND T ESTING S AMPLES WITH LESS SEPARABILITY

TABLE V EXPERIMENTS WITH AVIRIS KENNEDY SPACE CENTER DATA: CLASSES AND N UMBER OF T RAINING AND T ESTING S AMPLES WITH MORE SEPARABILITY

TABLE IV EXPERIMENTS WITH AVIRIS KENNEDY SPACE CENTER DATA TESTING SAMPLES: CONFUSION MATRICES AND OVERALL ACCURACIES

Fig. 14. Experiments with testing samples, 15 features reduction. Total classification accuracy.

dimensionality to seven features. The final number of features of the projected subspace was chosen, taking into consideration that there are eight classes and due to previous experience with the data set. The total number of training samples is 1245, and the total number of testing samples is 1011. We applied a supervised classification algorithm, ML classifier, using the training samples. Table IV shows the classification of testing samples in terms of the total test classification accuracy, Kappa statistics, and Kappa variance for every feature-extraction method. Table IV shows that PCA, SVDSS, and IDPP had similar performances; OIDPP obtained the best results and IDSS the lowest. This is true for both metrics: total test classification accuracy and Kappa statistics. 2) More Separable Labeled Samples: The classes on this data set coincide with the experiment on C1, but the labeled fields are different. This data set is more separable that the previous. Table V shows the number of training samples and testing samples per class. On this data set, the mechanisms of PCA, OIDPP, SVDSS, and IDSS were used to reduce the dimensionality by doing all the computations based only on the data from the training samples. The IDPP was not used

due to its similarity to OIDPP and its time consuming procedures. Experiments A, B, and C1 have shown that the OIDPP outperforms the IDPP. As in C1, all the training samples were gathered in one file and the unsupervised techniques were applied to them thus computing the matrix A that will reduce the dimensionality of the image from its original size to the lower dimensional subspace where classification process was applied. For the classification process, we increased sequentially the number of features from 1 to 15 features and applied the algorithm. The feature-selection procedure was implemented as follows: For each feature-extraction method, we project to the first feature and perform the classification. Then, the data set was projected to the second feature, and the classification analysis was done on the subspace composed of the first and second features. The process continues by iteratively adding features until a maximum of 15 features was reached. The percentages of classification accuracy for the testing samples are shown in Fig. 14. The maximum dimensionality was chosen due to the fact that, as seen in Fig. 14, three out of four feature-extraction methods reached a classification value of more than 90% of classification accuracy and the fourth reaches its maximum. We applied a supervised classification algorithm, ML classifier, using the training samples. With respect to this data set, OIDPP is the method that obtains the best classification results of the testing samples that belongs to the identified classes. This statement is based on the classification accuracies of every method. The classification results show that the OIDPP gets an above 90% classification accuracy in the use of the first three features versus PCA and SVDSS in which similar classification accuracy is achieved

JIMÉNEZ-RODRÍGUEZ et al.: UNSUPERVISED LINEAR FEATURE-EXTRACTION METHODS

481

TABLE VI EXPERIMENTS WITH AVIRIS INDIAN TEST SITE DATA: CLASSES AND N UMBER OF T RAINING AND T ESTING S AMPLES

Fig. 15. Experiments with testing samples, 15 features reduction. Total classification accuracy.

only using the first 14 features for PCA and the first eight features for SVDSS. These results suggest that the band subset obtained using IDSS as proposed in this paper work will not obtain a suboptimum subset as SVDSS does for this data set. The total number of training samples is 1082, and the total number of testing samples is 1242. D. Remotely Sensed AVIRIS Hyperspectral Data: Small Number of Training and Testing Samples and the Application of Feature-Extraction Techniques to the Whole Image The AVIRIS image data set used in this experiment is from the NW Indian Pines test site located in Indiana. The data set has 145 rows by 145 columns. Note that in this image, most of its contents are agro fields in the growing season; therefore, the spectral classes are very alike. Table VI shows the number of training and testing samples. The total number of classes is 16. These classes are very difficult to separate. It is important to note that according to Table VI, many classes have small number of training samples and some even have an extremely small number, hence increasing the Hughes phenomenon. This will enable the study of feature-extraction capabilities to handle the effect of this event. For example, Grass-Pasture-mowed has only six training samples and Bldgs-Grass-Trees-Drives has only four training samples. For obvious reasons, supervised feature-extraction techniques could not be used to reduce the dimensionality of the data from 220 initial bands to 15 features. This particular number of features was chosen because, according to the results in Figs. 14–16, feature-extraction methods classes reach saturation on classification accuracies or reach 100%. Labeled samples were obtained based on the previous information collected at the Laboratory of Remote Sensing at Purdue University.

Fig. 16. Experiments with testing samples, 15 features reduction. Grasspasture classification accuracy.

The unsupervised feature-extraction techniques, PCA, SVDSS, OIDPP, and IDSS, were applied to the whole AVIRIS data set. As with the previous experiment, we applied the supervised classification using the ML classifier with the training samples. Fig. 14 shows the classification accuracy of the total number of testing samples. On it, we can observe that PCA, SVDSS, and OIDPP provide similar results, although the PCA does better. However, when some of the classes with small number of training samples are observed, the results show that the OIDPP has a better performance. Figs. 15 and 16 show the classification results on testing samples for Grass-Pasturemowed and Oats classes. Observe that OIDPP reaches 100% classification accuracy for both classes after a number of features are reached, meanwhile with other methods, the classification accuracy reaches a large value at some points, but then it drops drastically with an increase in the number of features. As mentioned before, most unsupervised feature-extraction techniques, i.e., PCA and SVDSS, do not optimize the separability

482

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 45, NO. 2, FEBRUARY 2007

Fig. 17. Experiments with testing samples, 15 features reduction. Oats classification accuracy.

among the unknown clusters. This explains why adding features in the classification process does not improve the classification accuracy but, on the contrary, Hughes phenomenon takes place due to small number of training samples and the accuracy decreases. OIDPP has a stronger capability to find a robust projection in terms of class separability, and it is sensitive enough to cases of clusters with small number of samples. The OIDPP is more efficient in the process of retrieving the information from the whole data set and handles much better than the Hughes phenomenon. Normally, the ML classification algorithm uses the ML estimate of the covariance matrix for each class. For classes where the estimated covariance matrix is singular, we replace it with an identity matrix, and the classifier becomes the minimum Euclidean distance classifier in those classes. For GrassPasture-mowed class and all the feature-extraction techniques, the ML classifier uses the ML estimate of the covariance up to five features. For six features or more, the estimated covariance matrix is singular and the classifier uses the identity matrix. For less than six features, all methods, except OIDPP, reach the optimum classification accuracy then drop to zero. OIDPP continues to improve the accuracy because it is able to extract more information in terms of clusters separability than other techniques. Something similar happened with the Oats class as shown in Fig. 17 that has only four training samples. The covariance matrix is singular for four or more features. That explains the instability of the classification process for three or less features for all feature-extraction techniques. Still the OIDPP is able to improve the classification accuracy for four or more features until it reaches 100% for that particular class. VII. C ONCLUSION In this paper, different methods for unsupervised dimensionality reduction of multispectral and hyperspectral images were presented and compared in terms of cluster or class separability, and classification accuracy. Experiments using computergenerated and remotely sensed multispectral and hyperspec-

tral data were conducted in order to analyze the capability of each method. The results of applying the supervised and unsupervised classification to different images after using each feature-extraction mechanism were shown. The results obtained empirically validate the statement that it is possible to reduce the dimensionality of high-dimensional data by means of unsupervised algorithms while preserving the important separability information between clusters or classes and at the same time dealing with the Hughes phenomena. Among the mechanisms used to reduce the dimensionality, OIDPP obtained the best results in terms of finding a direction that maximizes a non-Gaussian structure, thus maximizing class separability. Comparing OIDPP and IDPP, the first method is faster and outperforms the IDPP. A band subset selection algorithm IDSS that selects subset of bands to maximize the non-Gaussian structure was proposed and compared with other feature-extraction and band subset selection algorithms. It still needs modifications to be improved. The results show that the information divergence index is an excellent measure of the separability of classes found in the multispectral and hyperspectral data sets. The implementations of IDPP and OIDPP, using the information divergence index, obtained a projection that preserves the separability of objects and deals with the challenges posed to high-dimensional data by the Hughes phenomenon and the curse of dimensionality, even in cases where conventional supervised and unsupervised mechanisms encounter difficulties such as small number of training samples. These projections are found without having a priori information in terms of labeled samples. ACKNOWLEDGMENT The authors would like to thank the Laboratory of Remote Sensing at Purdue University for providing the AVIRIS hyperspectral image data sets used at experiment D, and the NASA Kennedy Space Flight Center for providing the AVIRIS hyperspectral image data sets used at experiment C. R EFERENCES [1] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed. San Diego, CA: Academic, 1990, pp. 399–400. [2] M. Vélez and L. Jiménez, “Subset selection analysis for the reduction of hyperspectral imagery,” in Proc. Int. Conf. IGARSS, Seattle, WA, Jul. 1998, vol. 3, pp. 1577–1581. [3] L. O. Jiménez and D. A. Landgrebe, “Supervised classification in high dimensional space: Geometrical, statistical, and asymptotical properties of multivariate data,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 28, no. 1, pp. 39–54, Feb. 1998. [4] G. H. Hughes, “On the mean accuracy of statistical pattern recognizers,” IEEE Trans. Inf. Theory, vol. IT-14, no. 1, pp. 55–63, Jan. 1968. [5] D. W. Scott, Multivariate Density Estimation. New York: Wiley, 1992, pp. 208–212. [6] J. Hwang, S. Lay, and A. Lippman, “Nonparametric multivariate density estimation: A comparative study,” IEEE Trans. Signal Process., vol. 42, no. 10, pp. 2795–2810, Oct. 1994. [7] L. O. Jiménez and D. A. Landgrebe, “Hyperspectral data analysis and supervised feature reduction via projection pursuit,” IEEE Trans. Geosci. Remote Sens., vol. 37, no. 6, pp. 2653–2667, Nov. 1999. [8] R. Duda and P. Hart, Pattern Classification and Scene Analysis. New York: Wiley, 1973. [9] S. Mika, “Kernel fisher discriminant,” Ph.D. dissertation, Dept. Electron. Inf., Univ. Berlin, Berlin, Germany, Dec. 2002. [Online]. Available: http://edocs.tu-berlin.de/diss/2002/mika_sebastian.htm

JIMÉNEZ-RODRÍGUEZ et al.: UNSUPERVISED LINEAR FEATURE-EXTRACTION METHODS

[10] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K.-R. Müller, “Fisher discriminant analysis with kernels,” in Neural Networks for Signal Processing IX, Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, Eds. Piscataway, NJ: IEEE Press, 1999, pp. 41–48. [11] G. Camps-Valls and L. Bruzzone, “Kernel-based methods for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 6, pp. 1351–1362, Jun. 2005. [12] B. Schölkopf, S. Mika, C. J. C. Burges, P. Knirsch, K.-R. Müller, G. Rätsch, and A. J. Smola, “Input space versus. feature space in kernelbased methods,” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 1000–1017, Sep. 1999. [13] A. Ifarraguerri and C. Chang, “Unsupervised hyperspectral image analysis with projection pursuit,” IEEE Trans. Geosci. Remote Sens., vol. 38, no. 6, pp. 2529–2538, Nov. 2000. [14] S. Chiang and C. Chang, “Unsupervised target detection in hyperspectral images using projection pursuit,” IEEE Trans. Geosci. Remote Sens., vol. 39, no. 7, pp. 1380–1391, Jul. 2001. [15] D. MacDonald, C. Fyfe, and D. Charles, “Kernel exploratory projection pursuit,” in Proc. 4th Int. Conf. Knowledge-Based Intell. Eng. Syst. and Allied Technol., Brighton, U.K., Aug. 2000, pp. 193–196. [16] J. A. Richards, “Remote sensing digital image analysis,” in An Introduction, 3rd ed. New York: Springer-Verlag, 1999. [17] D. A. Landgrebe, Signal Theory Methods in Multispectral Remote Sensing. Hoboken, NJ: Wiley, 2003, pp. 281–287. [18] G. Golub and C. F. Van Loan, Matrix Computations. Baltimore, MD: The Johns Hopkins Univ. Press, 1997. [19] T. F. Chan and P. C. Hansen, “Some applications of the rank revealing QR factorization,” SIAM J. Sci. Stat. Comput., vol. 13, no. 3, pp. 727–741, May 1992. [20] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley, 1991. [21] J. Hwang, S. Lay, and A. Lippman, “Nonparametric multivariate density estimation: A comparative study,” IEEE Trans. Signal Process., vol. 42, no. 10, pp. 2795–2810, Oct. 1994. [22] E. Arzuaga-Cruz, L. O. Jiménez-Rodríguez, M. Vélez-Reyes, D. Kaeli, E. Rodriguez-Diaz, H. T. Velazquez-Santana, A. Castrodad-Carrau, L. E. Santos-Campis, and C. Santiago, “A MATLAB toolbox for hyperspectral image analysis,” in Proc. Int. Conf. IGARSS, Anchorage, AK, Sep. 2004, vol. 7, pp. 4839–4842. [23] L. O. Jiménez, M. Vélez-Reyes, Y. Chaar, F. Fontan, and C. Santiago, “Partially supervised detection using band subset selection in hyperspectral data,” in Proc. SPIE Conf., Orlando, FL, Apr. 1999, pp. 148–156.

Luis O. Jiménez-Rodríguez (M’96) received the B.S.E.E. degree from the University of Puerto Rico at Mayagüez (UPRM), in 1989, the M.S.E.E. degree from the University of Maryland, College Park, in 1991, and the Ph.D. degree from Purdue University, West Lafayette, IN, in 1996. He is currently a Professor of electrical and computer engineering with the University of Puerto Rico at Mayagüez. He was the Director of the Laboratory of Applied Remote Sensing and Image Processing (LARSIP) and was with the UPRM component of the Center for Subsurface Sensing and Imaging Systems, an NSF Engineering Research Center. During the academic year of 2000–2001, he was selected the Outstanding Professor with the Department of Electrical and Computer Engineering. During the same year, before his term, he was promoted by exceptional merit to Full Professor for his distinguished research and education work. The President of the University of Puerto Rico selected him as a Distinguished Researcher of the University of Puerto Rico. His research has been in the area of hyperspectral image analysis, remote sensing, pattern recognition, image processing, and subsurface sensing. Dr. Jiménez-Rodríguez is a member of the IEEE Geoscience and Remote Sensing, the IEEE System, Man and Cybernetics, and SPIE Societies. He is also a member of the Tau Beta Pi and Phi Kappa Phi honor societies.

483

Emmanuel Arzuaga-Cruz (S’99) received the BSCpE and MSCpE degrees from the University of Puerto Rico at Mayagüez (UPRM), in 2000 and 2003, respectively. He is currently working toward the Ph.D. degree at Northeastern University, Boston, MA. He worked with UPRM as a Software Developer for the Laboratory for Applied Remote Sensing and Image Processing, where he worked in the development of remote sensing and pattern recognition related software for the NSF-funded Engineering Research Center for Subsurface Sensing and Imaging Systems and the Tropical Center for Earth and Space Studies funded by the NASA University Research Centers Program. Mr. Arzuaga-Cruz is a member of the ACM and the IEEE Computer Society.

Miguel Vélez-Reyes (S’81–M’92–SM’00) received the B.S.E.E. degree from the University of Puerto Rico at Mayagüez (UPRM), in 1985, and the M.S. and Ph.D. degrees from the Massachusetts Institute of Technology, Cambridge, in 1988, and 1992, respectively. In 1992, he joined the faculty with UPRM, where he is currently a Professor. He has held Faculty Internship Positions with AT&T Bell Laboratories, Air Force Research Laboratories, and the NASA Goddard Space Flight Center. His teaching and research interests are in the areas of model-based signal processing, system identification, parameter estimation, and remote sensing using hyperspectral imaging. He has over 60 publications in journals and conference proceedings. He is the Director with the UPRM Tropical Center for Earth and Space Studies, a NASA University Research Center, and Associate Director of the Center for Subsurface Sensing and Imaging Systems, an NSF Engineering Research Center led by Northeastern University. Dr. Vélez-Reyes was one of the 60 recipients from across the United States and its territories of the Presidential Early Career Award for Scientists and Engineers from the White House in 1997. He is member of the Academy of Arts and Sciences of Puerto Rico and a member of the Tau Beta Pi, Sigma Xi, and Phi Kappa Phi honor societies.

Suggest Documents