From projection pursuit to other unsupervised ... - Wiley Online Library

6 downloads 5253 Views 348KB Size Report
Jul 24, 2007 - The main goal of exploratory data analysis is to reveal clusters of objects, ... (PCA), robust PCA and independent component analysis (ICA).
JOURNAL OF CHEMOMETRICS J. Chemometrics 2007; 21: 270–279 Published online 24 July 2007 in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/cem.1044

From projection pursuit to other unsupervised chemometric techniques Michał Daszykowski* Department of Chemometrics, Institute of Chemistry, The University of Silesia, 9 Szkolna Street, 40-006 Katowice, Poland Received 1 December 2006; Revised 28 March 2007; Accepted 2 April 2007

The main goal of exploratory data analysis is to reveal clusters of objects, local changes of data density, outlying objects and/or influential sources of data variance. These different aspects of data exploration can sometimes be accomplished simultaneously with the use of one algorithm, the projection algorithm (PA). In this paper, the PA is described and discussed in detail. It is shown that this algorithm can be considered as a general platform to perform principal components analysis (PCA), robust PCA and independent component analysis (ICA). This goal is achieved by optimizing various different projection indices in the PA. Among these indices one can find entropy, variance and robust scale estimator. The present paper can be regarded as tutorial, aiming to provide a better understanding of the projection pursuit approaches (PPs). Copyright # 2007 John Wiley & Sons, Ltd. KEYWORDS: latent variables; projection; robust PCA; projection index; projection algorithm

1. INTRODUCTION Projection methods play an important role in exploration and interpretation of the structure of the chemical data. In general, the projection methods aim to project the data in a linear or a non-linear way from a high-dimensional space onto a few latent factors that are then used for the data exploration/modeling purposes [1]. The linear projection methods are the simplest methods that allow an easier data interpretation. In this context, principal components analysis (PCA) [2] is one of the most often applied methods for data compression and visualization. Due to its variance criterion and to the orthogonality constraint, a data set can usually be compressed to a few latent factors, better known as principal components, PCs. These PCs serve as a new coordinate system and replace the explanatory variables in the forthcoming data analysis. There are several other ways in which similar latent factors can be constructed. The concept of latent factors is widely used in chemometrics. To support this statement, a simple projection algorithm (PA) is discussed and its applications to exploration and analysis of chemical data are presented. In general, the latent factors can be viewed as a special solution of the projection pursuit approach (PP) [3]. The goal of PP is to find a set of the low-dimensional projections (latent factors) that maximize the so-called projection index (PI), which defines the intent of the method. The search for directions that maximize the PI of *Correspondence to: M. Daszykowski, Department of Chemometrics, Institute of Chemistry, The University of Silesia, 9 Szkolna Street, 40-006 Katowice, Poland. E-mail: [email protected]

a given projection can be facilitated, using the PA [4]. Although it might not be a very efficient approach with some applications, giving only an approximation of a true solution, it can be considered as a general platform for the other chemometric methods such, as PCA [2], robust PCA [4], independent component analysis (ICA) [5], projection pursuit and the other projection-type approaches. Therefore it is our deep conviction that this paper can serve as a tutorial, explaining all aspects of data exploration by means of the PPs that employ a PA.

2. THEORY Firstly, let us point out a general principle in latent variable modeling. The data, X, are often represented as a product of two matrices. The columns of the first matrix are the so-called scores, ti, and the columns of the second—loading—matrix are the weight vectors, pi, which explain contributions of the individual data variables to construction of each latent factor:



f X

ti pTi þ E

(1)

i¼1

The residual matrix, E, describes the part of the data which remains unexplained by the model with f latent factors. The presented decomposition model is linear in each of the matrices. Therefore, every latent factor is the weighted sum of the explanatory variables, or in other words, it is their linear combination. From a geometrical point of view, the loading vector, being of unit length, points out to the Copyright # 2007 John Wiley & Sons, Ltd.

From projection pursuit to other unsupervised chemometric techniques 271

direction in the multivariate data space, whereas the score vector is a result of an orthogonal data projection onto that direction.

2.1. Goals of data exploration Data exploration aims to reveal certain hidden structures such as clusters, main sources of data variance, fluctuations of data density and unique objects in the data. To achieve these goals simultaneously, it is necessary to construct several latent factors that contain this information and enable data visualization. To construct latent factors, determination of a suitable set of directions is necessary, that allow linear transformation of the data into latent factors by means of an orthogonal projection of the data onto these directions. Intuitively, different data projections lead to the different low-dimensional data spaces and hence, to the different information captured in these data sub-spaces. A general algorithm for finding such directions given a certain criterion can be obtained with use of the projection pursuit technique [3].

projections has been successfully performed, the direction characterized by the highest PI value is selected. Then the data are deflated [see Equation (5)], and the next direction is to be found in the residual data space. The main steps of the PA are presented below: 1. Construct a set of potential directions, pi, as the normalized rows of X(m, n): pi ¼

xi kxi k

(2Þ

where jj. . .jj is the Euclidean norm of vector pi and i ¼ 1, 2,. . ., m. 2. Project the data onto all possible directions to obtain a set of m projections ti: ti ¼ XpTi

(3Þ

3. Score every projection according to its PI and determine the direction that is characterized by the highest value of the PI: arg max PIðti Þ

(4Þ

i

2.2. Projection pursuit The main idea behind the PP is to find a few directions in the data space that lead to ‘interesting’ low-dimensional projections displaying groups of objects, regions of higher data density or atypical objects (outliers) [6]. This is achieved by maximizing the PI, which is the core of the PP method. Most projection indices are designed, based on the assumption that all projections of an approximately Gaussian distribution are the least ‘interesting’ ones. From a practical point of view, the attractive projection indices are those that can be computed fast. For instance, the most popular PI, entropy, can be approximated using the higher order moments of data distribution which will be relatively easy to compute [5]. Application of the projection indices like kurtosis, or the Yenyukov’s index [7] helps in revealing groups in the data on the low-dimensional projections. Depending on the applied PI, a different picture of data structure can be observed, since each PI is sensitive to the various different aspects of data structure. The PP solution can be found in several possible ways [8,9], but the most straightforward one can probably be found with use of the simple PA. This algorithm was originally introduced by Croux and Ruiz-Gazen [4] for the construction of the robust PCs. Depending on the applied PI within PA, the ‘interesting’ directions and the ‘interesting’ projections approximate solutions of other chemometric unsupervised methods, such as the marker objects projections (MOPs) [10], PCA [2], robust PCA [11], ICA [5], etc. All these methods aim to construct latent factors, which are linear combinations of explanatory variables, as described in Equation (1).

2.3. Projection algorithm The PA can be summarized as follows. First, a set of the possible directions is constructed. These directions are defined by the individual data objects so that each row of the data matrix is normalized to a unit length. Second, the data are projected onto all possible directions to form a set of latent factors (projections). In the next step, the PI is calculated for each projection. After the scoring of all Copyright # 2007 John Wiley & Sons, Ltd.

4. Remove the information explained by the selected projection, t: X ¼ X  ti pi

(5Þ

Repeat steps 1–4 until the assumed number of latent factors is constructed. Depending on the type of the PI applied, different objectives can be achieved and the solution approximated with the PA resembles that from such chemometrical techniques, as PCA (and also its robust variant), ICA and MOPs. Let us briefly describe these methods and the projection indices used in the PA course that lead to these solutions.

2.4. Principal components analysis In the PCA, the data set is decomposed into PCs and loadings. The PCs are constructed in such a way as to maximize the data variance and are mutually orthogonal. The PI used to obtaining the PCA solution is expressed as: PIðti Þ ¼ varðti Þ

(6)

where ‘var(ti)’ denotes the variance of projection ti. It should also be emphasized that the obtained decomposition is the least squares problem. The PCA solution can be sought by different algorithms [12]. Solution of PCA can also be approximated with PA. From the exploratory point of view, the obtained PCs can often display groups in the score plots. This is because groups of objects are usually distributed along the directions representing the largest data variance. However, in the cases when information about clustering tendency is not associated with these directions, the PCs might not reveal groups on the low-dimensional projections.

2.5. Robust principal components analysis The goal of robust PCA is to provide a set of robust latent factors (robust PCs) not influenced by the presence of outlying objects in the data [13]. There are several robust PCA methods available [14] and one of them was proposed J. Chemometrics 2007; 21: 270–279 DOI: 10.1002/cem

272 M. Daszykowski

by Croux and Ruiz-Gazen [4]. The robust PCs are constructed by finding a set of directions that maximize the robust scale of projections with the PA. To estimate the robust scale of a projection, several robust scale estimators are used, among them the Qn estimator which is considered very efficient from a statistical point of view [15]. The Qn scale estimator of projection ti is defined as the value of the element corresponding to one-fourth the length of a vector of the sorted absolute pair-wise differences between its all elements. The value obtained from the Qn scale estimator is then multiplied by the constant factor, c, which equals to 2.2219:   Qn ¼ c  ti  tj ; i < j ðkÞ (7) h  m  where k ¼ 2  2 =4; h ¼ ½m=2 þ 1; and m is the number of elements in vector t. The PI of projection ti used in the robust PCA technique can be expressed in the following way: PIðti Þ ¼ Qnðti Þ

(8)

Once the robust PCs are found, it is possible to identify the outlying objects, taking into the account their Mahalanobis distances computed in the space of robust PCs and their orthogonal distances (residuals from the robust PCA model), using the so-called distance–distance plot constructed for the assumed number of robust PCs. Such plot displays the Mahalanobis and the orthogonal distances for each object. Depending on these two types of distances, data objects can be labeled either as regular objects (or good leverages), or as high residual objects (or bad leverages). The good leverage objects are those that are characterized by relatively large Mahalanobis distances and small orthogonal distances (contrary to the high residual objects, which have large orthogonal distances, but small Mahalanobis distances). The bad leverage objects are the objects that have high Mahalanobis and orthogonal distances. Although the score plots can reveal samples located far away from the data majority, it is almost impossible to detect the orthogonal outliers. The orthogonal outliers fall into the data cloud when projected onto the robust PCs, but they are not displayed in the robust score plots. They can only be identified, using the distance–distance plots.

2.6. Independent component analysis In ICA, the observed data are assumed to be generated by mixing a number of independent sources [16,17]. The main goal of this method is to provide a set of latent variables, which are constructed to be statistically independent. Statistical independence is a stronger condition than orthogonality. Two variables can be orthogonal, but still statistically dependent. In order to determine a set of statistically independent components, a PI maximizing a certain measure of non-Gaussianity is used. The most popular one is entropy, which makes a direct link between the PP and the ICA approach. The entropy of a projection can be approximated, using the higher order moments of the data distribution:   1 1 E3 t2i þ kurtosis ðti Þ2 (9) 12 48 where E3 is the third moment of the data distribution. PIðti Þ ¼

Copyright # 2007 John Wiley & Sons, Ltd.

The ICA approach is mostly applied in the field of signal processing, but it starts gaining popularity in chemistry, basically for identification of pure components in mixtures [18–21].

2.7. Marker objects projections The MOPs approach can be considered as the simplest case of PP, where no PI is involved to determine projections. Instead, the data are projected onto the selected directions defined by individual data objects [10]. In that way, the projections are made onto the a priori defined meaningful axes. In order to select among these meaningful axes, one can use either a preliminary knowledge about the studied data, or select the most dissimilar samples.

3. DATA DESCRIPTION Data set 1 contains 572 samples of olive oil, for which concentrations of eight fatty acids (palmitic, palmitoleic, stearic, oleic, linoleic, eicosanoic, linolenic, eicosenoic) have been determined. The samples have been collected in nine different growing regions in Italy (North Apulia, Calabria, South Apulia, Sicily, Inland Sardinia, Coast Sardinia, East Liguria, West Liguria, Umbria) [22]. The data set is available from Reference 23. The original data set was autoscaled. Data set 2 contains 124 samples of opium. For each sample, the concentration profiles of 14 amino acids (D, aspartic acid; T, threonine; S, serine; E, glutamic acid; G, glycine; A, alanine; V, valine; I, isoleucine; L, leucine; Y, tyrosine; F, phenylalanine; H, histidine) have been measured by means of liquid chromatography. The samples of opium have been collected in the three provinces of India (Uttar Pradesh, Rajasthan and Madhya Pradesh, denoted as classes 1, 2 and 3, respectively). A more detailed information about the analytical procedure as well as the data can be found in Reference 24.

4. RESULTS AND DISCUSSION The PCA technique is a special case of the projection pursuit technique, which uses variance as PI. To illustrate similarities between PP and PCA, first for data set 1 we constructed PCs with the singular value decomposition algorithm (SVD). Then for the same data set, the PCs were constructed with the PA, where variance was used as PI. The results (see Figure 1a) show that the first PC obtained from SVD is very similar to that obtained from PA. However, it should be emphasized that PA can only approximate the true solution, because the directions are restricted to the individual data objects only. Depending on the data distribution and dimensionality, it can also happen that the number of directions to be examined is insufficient to provide a satisfactory result. In the case of our example (see Figure 1b), the number of possible directions is relatively large, but not large enough to obtain the same second PC from PA and SVD. To overcome this problem, simple solution can be provided by an increase of the number of directions to be examined. In that way it is more likely to find a better J. Chemometrics 2007; 21: 270–279 DOI: 10.1002/cem

From projection pursuit to other unsupervised chemometric techniques 273

Figure 1. PCs obtained from the SVD versus PCs obtained from the PA by maximization of the variance in the projected data: (a) the first and (b) the second PC are obtained for data set 1 from the PA, considering the directions defined by the individual data objects only; (c) the second PC obtained for data set 1 from the PA, where additionally 5000 random directions were investigated; (d) the first PC and (e) the second PC are obtained for data set 2 from the PA, where additionally 5000 random directions were included.

solution. Additional directions can be constructed as the normalized linear combinations of the data. In the presented case, the additional 5000 directions were constructed in the course of running PA. The second PCs obtained from SVD and PA, respectively, are shown in Figure 1c. The increased number of directions allows a better approximation of PC 2. For data set 2, the PA solution obtained from the set of directions, defined by the data objects and after adding 5000 random directions, is compared with the results obtained from SVD, as it is shown in Figure 1d and e, respectively. Construction of additional directions is highly recommended, when the number of samples to the number of variables ratio is small. However, even with many objects in the data, the data structure also plays an important role and influences the quality of the results with PA. Although adding extra directions helps to achieve a better PA solution, the increased number of variables in the data requires more directions to sample the data space. A price that one has to pay for these extra directions is the computational speed of the algorithm. For the highly multivariate data, examination of a large number of directions can become completely unfeasible with the use of this algorithm. The main objective of the PCA method is to compress the data in such a way as to capture as much variance as possible by few latent variables. In the other words, the goal is to approximate the original data matrix by as few latent factors and their corresponding loading vectors, as possible. Evidently, this strategy quasi-automatically leads to reduction of the data dimensionality. If the data are compressed to the full rank, the obtained latent factors (PCs) represent the same variance as the original data. Copyright # 2007 John Wiley & Sons, Ltd.

Usually, the first few PCs explain the majority of the data variance and for this reason they serve well the visualization and/or modeling purposes. From among all linear approaches, PCA allows the most efficient data compression, since the obtained PCs are by definition orthogonal and represent the largest data variability. For the illustrative purpose, the cumulative percentage of the data variance explained by the consecutive PCs is given in Figure 2a. For the studied data set 1, the first two PCs describe over 68.59% of the data variance. Of course, the more variance explained by the PCA model the better. Since the first two PCs represent the largest variance that can be explained, they allow gaining some basic knowledge about the data structure together with the corresponding loading vectors (see Figure 2b and c). For data set 2, the PCA compression is not that efficient. The first two PCs explain 65.87% of the data variance (see Figure 2e) and also reveal the four outlying objects. These are objects numbers 36, 41, 61 and 64. Since the variance criterion used in PCA is vulnerable to the outliers, it should be replaced by a robust measure of scale, in order to construct the robust PCs. The presented PA can also be used for construction of the robust PCs and identification of outliers, as discussed in Reference 4. The robust PCs are constructed as projections of the data onto the directions that maximize the robust scale of these projections, for example the Qn scale estimator [15]. An example of identification of the outlying samples was demonstrated with data set 2. First, the data set was preprocessed in a robust way. Since the variables appeared in different ranges, they were autoscaled using the robust estimator of location (L1-median [25]) and the robust J. Chemometrics 2007; 21: 270–279 DOI: 10.1002/cem

274 M. Daszykowski

Figure 2. Results of PCA for data sets 1 and 2, respectively: (a) and (d): the cumulative percentage of the data variance explained by the consecutive PCs; (b) and (e): projection of objects on the planes defined by the first two PCs with the indicated geographical origin of the samples; and (c) and (f): loadings on PC 1 versus loadings on PC 2.

estimator of the data scale (the Qn estimator). Then to the preprocessed data we applied PA with the Qn PI. On the basis of the robust eigenvectors, the complexity of the robust PCA model was chosen as equal to three. Projection of the data on the first two robust scores is given in Figure 3. There are four samples numbers 36, 49, 61 and 64, which are located in a considerable distance from the remaining ones. Copyright # 2007 John Wiley & Sons, Ltd.

In order to determine the cut-off values of the Mahalanobis and orthogonal distances, the distances were z-transformed in a robust way, namely, they were centered about the median and scaled with the Qn scale estimator [14]. The cut-off values were set to three and all objects with the absolute values of the z-transformed distances above these cut-offs were considered as the outliers. J. Chemometrics 2007; 21: 270–279 DOI: 10.1002/cem

From projection pursuit to other unsupervised chemometric techniques 275

Figure 3. Results of the robust PCA for data set 2: (a) projection of objects on the plane defined by the first two robust PCs with the indicated geographical origin of the samples, (b) the distance–distance plot constructed for the three-component robust PCA model and (c) projection of robust loadings on robust PC 1 and 2.

For the studied data set 2, the distance–distance plot constructed for the three-component robust PCA model shows a core of regular objects (located within the cut-off boundaries) and several objects exceeding the cut-off lines of the Mahalanobis and orthogonal distances (see Figure 3b). According to the distance–distance plot, sample no. 88 has a large orthogonal distance and therefore, it does not fit in the constructed model. However, this sample cannot be recognized as a problematic one on the score plot. A similar conclusion can be drawn about sample no. 63. The most problematic sample no. 49 can be considered as a bad leverage object, since its Mahalanobis and orthogonal distances are larger, compared with the remaining samples. Now, let us turn to the application of projection pursuit, which shares the same objective with the MOPs, but the directions are selected in order to maximize the PI. We construct the projection pursuit features by using the PA. As pointed out by Huber [6], any projection that shows distribution deviating from the normal distribution is ‘interesting’. In our applications of PA, two projection indices scoring departures from normality (i.e. entropy and kurtosis) are applied. Entropy can be efficiently approxiCopyright # 2007 John Wiley & Sons, Ltd.

mated, as described in Reference 5. Kurtosis can be calculated for each projection in a straightforward way. For instance, Pen˜a and Prieto [26] demonstrated that—when minimizing kurtosis of a projection—one can detect in the data the groups of similar objects, and when maximizing kurtosis, one can display the outliers. With data set 1, minimization of kurtosis as the PI leads to the projection shown in Figure 4a. The projection of objects onto the first two projection pursuit features uncovers four groups in the data. Although these groups are not fully separated, the first projection pursuit feature is generally responsible for the first data split into the samples from Umbria, West and East Liguria and Sardinia (the positive values), and the samples from South and North Apulia, Sicily and Calabria (the negative values). The result of such data partition can be explained by the properties of the kurtosis PI. Kurtosis reaches the minimum, if the distribution of a projection is bi-modal. This happens for the two wellseparated groups of objects of approximately equal size [26]. Another split can be observed along the second PP feature. The positive values of the second PP feature are characteristic of the samples collected in Sicily, Calabria, Umbria, North J. Chemometrics 2007; 21: 270–279 DOI: 10.1002/cem

276 M. Daszykowski

Figure 4. Projection of objects on the planes defined by the first two projection pursuit features (objects are coded by the different markers describing geographical origin of the samples) and the corresponding projection of loadings obtained with the PA (a) minimizing and (c) maximizing kurtosis of the projected data and their corresponding loadings (b) and (d), respectively.

Apulia and East and West Liguria. The PP feature 1 discriminates samples from South Apulia and Sardinia from all the remaining ones. Watching the loadings of the first two PP features (see Figure 4b), samples located on the positive side of the PP feature 1 are rich in oleic acid (e.g. samples from Sardinia, West Liguria, etc.). Samples located on the positive side of the PP feature 2 contain relatively high amounts of linoleic acid (e.g. samples from Umbria, East Liguria, etc.) Complementary information about the data structure can be obtained from maximization of the projections kurtosis, as presented in Figure 4c. Apart from the information revealed by the PP features and shown in Figure 4a, it is also possible to distinguish a group of distant samples from West Liguria that characterize by the negative values on the PP feature 1. This is due to the relatively small contents of stearic acid in the West Ligurian samples (see Figure 4d). Another PI in this study is entropy, which is sensitive to the projections uncovering unique objects. The plot of the first two projection pursuit features, constructed for data set 2, reveals the presence of several outliers located far away Copyright # 2007 John Wiley & Sons, Ltd.

from the data majority. These are object numbers 36, 49, 61, 63, 64 and 88. When the data core is enlarged (see Figure 5b), samples from class 3 (collected in Madhya Pradesh) can be distinguished from the remaining ones mainly along the PP feature 1, whereas objects from classes 1 and 2 (samples from Uttar Pradesh and Rajasthan) strongly overlap. Samples from class 1 are characterized mostly by low concentrations of glutamic acid (see Figure 5c). The results of the MOP approach can also be obtained from the PA. In the MOP approach, the user selects the directions on the basis of an a priori knowledge about the data studied. The directions pass through the individual data objects. For an illustrative purpose, in our application of MOP to data set 1, two pairs of samples were chosen. First, the directions were selected such as to produce the best contrast between the samples from South Apulia and West Liguria. Therefore, two out of the most dissimilar samples from these groups were chosen. The first projection feature is a result of projecting the data onto the normalized object no. 317. The second projection feature is then constructed by projecting the residual data space onto the normalized object no. 542. J. Chemometrics 2007; 21: 270–279 DOI: 10.1002/cem

From projection pursuit to other unsupervised chemometric techniques 277

Figure 5. Results of projection pursuit performed by the PA with entropy as the PI applied to data set 2: (a) projection of objects on the plane defined by the first two projection pursuit features (objects are labeled with the different markers representing geographical origin of the samples (class 1: Uttar Pradesh; class 2: Rajasthan; class 3: Madhya Pradesh), (b) the enlarged core of the data shown in (a) and (c) the corresponding loadings of the projection pursuit features.

Projection of the two projection features reveals many regions of high density of the data. The high density regions correspond with the geographical origin of the samples (see Figure 6a). The first projection feature differentiates the West Ligurian samples (the negative values on projection feature 1) from the South Apulian samples (the positive values on projection feature 1). In order to better understand the reasons, why such structures can be observed on the projection, the loadings of the projection features are further interpreted. The main difference between these two groups of samples is due to the relatively low concentrations of palmitic acid with the West Ligurian olive oil samples, in contrast to the South Apulian samples. The second pair of the samples that define the directions, sample numbers 390 and 411, are selected to provide information about the possible differences between the Sardinian samples collected at the coastal and the inland part of the island. The obtained projection of objects on the space of the first two projection features is shown in Figure 6c. There is no clear division between these two groups of samples, yet the first two projection features also Copyright # 2007 John Wiley & Sons, Ltd.

uncover an interesting data structure. Samples from Sardinia are well separated from the remaining ones. Explanation of these structures observed on the projection can be supported by the analysis of the projection features loadings. With the discussed example it can be concluded that the difference between the two types of the Sardinian samples is mostly due to palmitoleic acid. This example clearly demonstrates that—depending on the directions chosen—the different projections can be obtained. For the exploratory purpose, many directions can be valuable and give complementary information about the data structure. For instance, the projection of objects shown in Figure 6c reveals a group of the Sardinian samples that was previously not visible in Figure 6a.

5. CONCLUSIONS It was demonstrated and discussed that such approaches as PCA, ICA and robust PCA can be considered as special cases of projection pursuit, where certain PI is maximized within the PA course. The PI itself is the core of the method and it is J. Chemometrics 2007; 21: 270–279 DOI: 10.1002/cem

278 M. Daszykowski

Figure 6. Projections of objects of data set 1 on the plane defined by the first two projection features found with marker objects projection (objects are coded with the different markers describing geographical origin of samples): a ) the two directions pass through objects numbered 317 and 542 and c) the two directions pass through objects numbered 390 and 411. The corresponding projections of loadings of the first two projections features for: b) the two directions pass through objects numbered 317 and 542 and d) the two directions pass through objects numbered 390 and 411.

responsible for the final result. Depending on the type of the applied index, different aspects of the data structure can be studied. The obtained latent variables can capture as much variability of the data as possible (PCA) and due to that, they can help detecting the unique samples (robust PCA) and reveal groups in the data (projection pursuit with the index sensitive to non-Gaussian distributions). The main advantage of the presented PA is that it is straightforward in use, conceptually simple and with its aid different aspects of the data structure can simultaneously be studied. Nevertheless, one should be aware of the fact that searching for an interesting direction in the data is in fact an optimization problem and using PA only an approximation of the true solution can be obtained. For the exploratory data analysis this might be enough, but when the latent variables are to be used in due course, some problems can then be encountered. For instance, one can use PCs instead of the original data variables for constructing calibration models or as input to the clustering approaches, etc. Copyright # 2007 John Wiley & Sons, Ltd.

Of course, there are available certain more efficient algorithms as well to perform PCA and ICA than the simple PA presented in this paper, which was also discussed in order to provide a better understanding of the PP-like approaches. We limited ourselves to the unsupervised approaches only, but the PP concept of finding certain directions in the data space is popular with the supervised techniques, too. For instance, PC regression actively uses PCs [27]. In the partial least squares approach, one tries to find such directions in the data space that maximize covariance between X and y [27]. In fact, the robust continuum regression employs PP to search for suitable directions immediately in the data space [28]. Huber in Reference 6 has also discussed the use of the so-called projection pursuit regression. A similar application can be found in Reference 29. For example in Reference 22 the ICA components are used to construct a multivariate regression model as proposed by Chen and Wang [22] and Westad [21], who extended the J. Chemometrics 2007; 21: 270–279 DOI: 10.1002/cem

From projection pursuit to other unsupervised chemometric techniques 279

same idea to independent component partial least squares regression.

Acknowledgements M. Daszykowski expresses his sincere gratitude to the Foundation for Polish Science for the financial support.

REFERENCES 1. Daszykowski M, Walczak B, Massart DL. Projection methods in chemistry. Chemom. Intell. Lab. Syst. 2003; 65: 97–112. 2. Malinowski ER. Factor Analysis in Chemistry. John Wiley & Sons, Inc.: New York, 1991. 3. Friedman JH, Tuckey JW. A projection pursuit for exploratory data analysis. IEEE Trans. Comput. 1974; 23: c-23 9 (1974). 4. Croux C, Ruiz-Gazen A. A fast algorithm for robust principal components based on projection pursuit. COMPSTAT: Proceedings in Computational Statistics; 1996; 211–217, Heidelberg: Physica-Verlag. 5. Hyva¨rinen A, Karhunen J, Oja E. Independent Component Analysis. John Willey & Sons, Inc.: New York, 2001. 6. Huber PJ. Projection Pursuit. Ann. Stat. 1985; 13: 435–475. 7. Yenyukov S. Indices for projection pursuit. In Data Analysis Learning Symbolic and Numeric Knowledge, Diday E (ed.). Nova Science Publishers: New York, 1989; 181–188. 8. Glover D, Hopke PK. Exploration of multivariate chemical data by projection pursuit. Chemom. Intell. Lab. Syst. 1992; 16: 45–59. 9. Guo Q, Wu W, Questier F, Massart DL, Boucon C, de Jong C. Sequential projection pursuit using genetic algorithms for data mining of analytical data. Anal. Chem. 2000; 72: 2846–2855. 10. Kvalheim OM, Telnaes N. Visualizing information in multivariate data: applications to petroleum geochemistry. Part 1. Projection methods. Anal. Chim. Acta 1986; 191: 87–96. 11. Hubert M, Rousseeuw PJ, Verboven S. A fast method for robust principal components with application to chemometrics. Chemom. Intell. Lab. Syst. 2002; 60: 101–111. 12. Wu W, Massart DL, de Jong S. The kernel PCA algorithms for wide data. Part I: Theory and algorithms. Chemometrics Intelligent Lab. Syst. 1997; 36: 165–172. 13. Frosch Møller S, von Frese J, Bro R. Robust methods for multivariate data analysis. J. Chemometrics 2005; 19: 549–563.

Copyright # 2007 John Wiley & Sons, Ltd.

14. Daszykowski M, Kaczmarek K, Vander Heyden Y, Walczak B. Robust statistics in data analysis - a review. Basic concepts. Chemom. Intell. Lab. Syst. 2007; 85: 203–219. 15. Rousseeuw PJ, Croux C. Alternatives to Median Absolute Deviation. J. Am. Stat. Assoc. 1993; 88: 1273–1283. 16. Hyva¨rinen A, Oja E. Independent component analysis: algorithms and applications. Neural Netw. 2000; 13: 411–430. 17. De Lathauwer L, De Moor B, Vandewalle J. An introduction to independent component analysis. J. Chemometrics 2000; 14: 123–149. 18. Visser E, Lee T. An information-theoretic methodology for the resolution of pure component spectra without prior information using spectroscopic measurements. Chemom. Intell. Lab. Syst. 2004; 70: 147–155. 19. Westad F, Kermit M. Cross validation and uncertainty estimates in independent component analysis. Anal. Chim. Acta 2003; 490: 341–354. 20. Wang G, Cai W, Shao X. A primary study on resolution of overlapping GC-MS signal using mean-field approach independent component analysis. Chemom. Intell. Lab. Syst. 2006; 82: 137–144. 21. Westad F. Independent component analysis and regression applied on sensory data. J. Chemometrics 2005; 19: 171–179. 22. Chen J, Wang XZ. A new approach to near-infrared spectral data analysis using independent component analysis. J. Chem. Inf. Comput. Sci. 2001; 41: 992– 1001. 23. ftp://ftp.clarkson.edu/pub/hopkepk/Chemdata/Original/ oliveoil.dat 24. Krishna Reddy MM, Ghosh P, Rasool SN, Sarin RK, Sashidhar RB. Source identification of Indian opium based on chromatographic fingerprinting of amino acids. J. Chromatogr. A 2005; 1088: 158–168. 25. Rousseeuw PJ, Leroy AM. Robust Regression and Outlier Detection. John Wiley & Sons: New York, 1987. 26. Pen˜a D, Prieto FJ. Cluster identification using projections. J. Am. Stat. Assoc. 2001; 96: 1433–1445. 27. Martens H, Næs T. Mutivariate Calibration. John Wiley & Sons: Chichester, UK, 1989. 28. Serneels S, Filzmoser P, Croux C, Van Espen PJ. Robust continuum regression. Chemom. Intell. Lab. Syst. 2005; 76: 197–204. 29. Liu H, Yao X, Liu M, Hu Z, Fan B. Prediction of gas-phase reduced ion mobility constants (K0) based on the multiple linear regression and projection pursuit regression. Talanta 2007; 71: 258–263.

J. Chemometrics 2007; 21: 270–279 DOI: 10.1002/cem

Suggest Documents