MultiDA: Chemometric software for multivariate data ...

8 downloads 123499 Views 1MB Size Report
May 2, 2012 - MultiDA: Chemometric software for multivariate data analysis based on Matlab. Qianxu Yang, Liangxiao Zhang, Longxing Wang, Hongbin Xiao ...
Chemometrics and Intelligent Laboratory Systems 116 (2012) 1–8

Contents lists available at SciVerse ScienceDirect

Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab

MultiDA: Chemometric software for multivariate data analysis based on Matlab Qianxu Yang, Liangxiao Zhang, Longxing Wang, Hongbin Xiao ⁎ Key Lab of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian 116023, China

a r t i c l e

i n f o

Article history: Received 17 November 2011 Received in revised form 1 March 2012 Accepted 5 March 2012 Available online 2 May 2012 Keywords: Chemometrics software Matlab Multivariate analysis Metabolomics/metabonomics Multi-model comparison

a b s t r a c t Multivariate data analysis (MultiDA), a user-friendly interface chemometric software, is developed for the routine metabolomics/metabonomics data analysis. There are mainly two advantages for MultiDA. First, it could simultaneously provide multiply methods for data preprocessing and multivariate analysis. The main chemometric methods in MultiDA contains k-means cluster analysis, k-medoid cluster analysis, hierarchical cluster analysis (HCA), principal component analysis (PCA), robust principal component analysis (ROPCA), non-linear PCA (NLPCA), non-linear iterative partial least squares (NIPALS), SIMPLS, discriminate analysis (DA), canonical discriminate analysis (CDA), stepwise discriminate analysis (SDA), uncorrelated linear discriminate analysis (ULDA) and some data preprocessing methods, such as standardization, outlier detection, genetic algorithm for feature selection (GAFS), orthogonal signal correction (OSC), weight analysis (Weight) etc. Second, multi-model comparison could be conducted to obtain the best outcome. Moreover, this software is available for free. © 2012 Elsevier B.V. All rights reserved.

1. Introduction Chemometrics is defined as “A chemical discipline that uses statistical and mathematical methods, to design or select optimum procedures and experiments, and to proved maximum chemical information by analyzing chemical data” [1]. With emergence and development of systems biology including genomics, translatomics, proteomics, and metabolomics, massive amounts of data are produced from instruments. Subsequent data processing has become a challenge for the development of omics. Besides, there are many algorithms focusing on the same problem, such as the total nine PLS1 algorithms [2], which always confuse people without statistical basis. At the same time, there is no perfect method for all data. Selection of chemometrics method depends on the current data. Thus, it is necessary to compare models built by different methods for the same data. Matlab is high-level technical computing language and interactive platform for algorithm development, data visualization, data analysis, and numeric computation [3]. With the help of the graphical user interface (GUI) in Matlab, it is possible to develop user-friendly software. In this study, MultiDA is created based on the Matlab GUI. Recently, many excellent Matlab toolbox have been developed for multivariate data processing, such as ParLes [4] and TOMCAT [5]. Both of ParLes and TOMCAT are popular software. Among them, ParLes focuses on spectroscopic data processing with minor multivariate calibration function and TOMCAT emphasizes on multivariate calibration, including many algorithms on PCA, robust PCA, PLS and Robust PLS.

⁎ Corresponding author. Tel./fax: + 86 411 84379756. E-mail address: [email protected] (H. Xiao). 0169-7439/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2012.03.019

In this study, MultiDA is developed to deal with metabolomics/ metabonomics data analysis. The multi-model comparison is conducted in MultiDA for a comprehensive, robust and concise outcome. Additionally, MultiDA tries to discriminate samples in different groups and explore informational variables. The GNU 3 license of MultiDA is available through http://code.google.com/p/multida/ for raw code and http:// code.google.com/p/multida-standalone/ for standalone version. 2. Software 2.1. Software environment and installation MultiDA is developed under MATLAB R2007b environment based on graphic user interface, and tested at MATLAB 7 and 2010, 2011b. The standalone version is also available. A Microsoft Spreadsheet ActiveX Control (corresponding file “OWC10.DLL”) is embedded in software for showing raw data. ActiveX control is an object linking and embedding control extension (OCX) called by Microsoft. It provides a set of rules for how applications should share information. If MultiDA is transferred to a target computer without OWC10.DLL file, ActiveX Control should be registered firstly, or an error will be generated. 2.2. Outline and structure of MultiDA As shown in Fig. 1, the software comprises seven compartments, including data input, data preprocessing, cluster, PCA, classification, PLS and figure control. Each compartment performs the special function as marked in Fig. 1. MultiDA provides a plain operation interface and a straightforward operation model without complicated parameters

2

Q. Yang et al. / Chemometrics and Intelligent Laboratory Systems 116 (2012) 1–8

Fig. 1. Outline of MultiDA: the software is composed of seven compartments, i.e., data input, data preprocessing, cluster, PCA, classification, PLS and figure processing. Each compartment is a separate panel.

setting. Note that the data input compartment is a Microsoft spreadsheet ActiveX Control, separating from the independent variable input block and group variable (or dependent variable) input block, marked with “X data” and “Y data” at the top of the spreadsheet (“X” for independent variable data, “Y” for group variable data), to reduce the complexity of subsequent data processing. Fig. 2 displays the structure and data stream of software. There are two forms to input data, [ReadData] 1 button for reading data from the spreadsheet and [ImportData] from a special format file. After reading of data, some preprocessing method is available, such as standardization, weight analysis [6], outlier detection [7,8], GAFS [9–11] and OSC [12]. If the information of sample class is unknown, cluster analysis is unique choice. MultiDA provides three cluster methods, including kmeans cluster analysis, k-medoid cluster analysis and HCA. The commonly used unsupervised methods in MultiDA contain PCA, ROPCA [13,14], and NLPCA [15,16], while supervised learning methods are also available for a given sample class including linear DA (LDA), quadratic DA (QDA), Mahalanobis DA (MDA), CDA [17], SDA [18], ULDA [19,20], Kernel-PLS [21], NIPALS [22,23] and SIMPLS [24]. In the actual data processing, one or more than one methods could be employed according to the requirement of data analysis. Surely, the results from different methods could be compared with each others. To clearly show the results of these methods, many figures are provided by MultiDA, such as bar plot for the weight coefficient from weight analysis and stem-leaf plot for Hotelling T-square statistics of each sample in PCA. Matrix and structure are the format for data storing in MultiDA. All the useful figures and data, especially transitive variable could be saved by [Save Plot] or [Save Data] button, respectively.

processing (plot of fitness changes with generation increasing) etc.; Mode 2—figures are produced by special uicontrol button, such as [Score plot], [Loading plot] and [Weight] button, for visualization of desire result. [PlotControl] panel displays some fundamental uicontrol for figures modification, especially for score or loading plot, the error ellipse with or without group, cross line through point [0, 0], group name etc., could be changed. When figures exist, [SavePlot] is available for saving plot in different form. Especially, for tif format, figures could be saved in proper resolution to meeting need of publication.

2.3. Figure output There are two modes of figure generation of MultiDA: Mode 1— figures are an accompaniment of algorithm with the purpose of giving additional information of outliers in samples (PCA boxplot), principle component selection (Pareto plot, AIC-RMSE plot), computation 1

Square brackets represent uicontrol in MultiDA for clarity.

Fig. 2. Structure and data stream of MultiDA: rectangle represents data with different format. Elapse shows the chemometric ability. Round rectangle stands for the data input button in software. Double line arrow displays the main direction of data transfer. Dashed line arrow indicates complementary data stream. Bracket means that all the data in processing can be saved in figure or mat format.

Q. Yang et al. / Chemometrics and Intelligent Laboratory Systems 116 (2012) 1–8

2.4. Others 2.4.1. Transparent data analysis MultiDA provides a relative transparent data analysis environment. All the useful data produced by MultiDA are stored in the “handle” object and subsequently saved in workspace with name of algorithm tag, including the transition data for your checking error. Taking ULDA as an example, a “structure” type variable named by “ULDA” is exported to Matlab workspace when runs of ULDA finished. The “ULDA” structure contains some variables as follows: NumberOfEachGroup number of each group MeanXByGroup mean of each variable in different group WithInDeviation with-in class deviation TotalDeviation total deviation BetweenClassScatter between class scatter Transmits transform matrix UDV uncorrelated discriminate vector CrossValidation recognition, prediction, correct and five-fold recognition rate of ULDA Sometimes, it is available to get a plot utterly by UDV instead of the sample number in Matlab command windows. Because it is usually difficult to get more than one UDV, the default figure employs the sample number to avoid error. Fig. 3 is the scatter plot gotten by UDV1 and UDV2. It looks beautiful compared with by the sample number.

3. Applications 3.1. Data acquisition The wine data set [25] from UCI Machine Learning Repository is employed to test the functions of MultiDA and demonstrate multimodel comparison method. Wine data set contains the quantities of 13 constituents found in 3 types of wines in 178 wines samples. All the 13 constituents are alcohol, malic acid, ash, alkalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phenols, proanthocyanins, color intensity, hue, OD280/OD315 of diluted wines and proline. There are two ways to input data to software: [ReadData] and [ImportData] button. [ReadData] is used to get data in both x and y spreadsheet. [ImportData] is applied to data import from the specified −3.5 3 2 1

−4 −4.5 −5

UDV2

−5.5

3

file. File types MultiDA could recognize are listed in Table 1. The [XdataLabel] in [ImportData] dialog is employed for storing data for labeling the sample or variable of X data, which can be invoked by [Label] button in the scores plot, loading plot or some else. 3.2. Methods of data preprocessing MultiDA provides many data preprocessing methods including descriptive analysis, data standardization, outlier detection and OSC. 3.2.1. Descriptive analysis and standardization Descriptive analysis describes the main feature of a collection of data quantitatively. In descriptive analysis, MultiDA provides a fundamental analysis of within class data and whole data, including mean, median, standard variance, variance, maximum, minimum, kurtosis, skewness and coefficient. According to descriptive analysis, an overview of data shows. The descriptive analysis results indicate that wine data possess the magnitude varies from 0.1 to 1000 and the variance varies from 0.01 to 100,000. It means that almost the 99.9% information focuses on a few variables, which shields the effect of other variables dramatically. From the prediction error of PCA-DA and Z‐scores-PCA-DA, a prominent improvement of predictive ability is obtained after standardization (data not show). In MultiDA, [Standerlize] provides five methods for standardization, that is, log-transform, centering, Z-scores, min-max normalization and decimal scaling. 3.2.2. Outlier detection As defined by Grubbs [7], outlier is one that appears to deviate markedly from other members of the sample in which it occurs. MultiDA provides two outlier detection methods (Grubbs test [7] and Wilk's method [8,26]) and three figures description of outlier (PCA boxplot, stem-leaf-plot of Hotelling T-square and ROPCA distance scatter plot). [Outlier] button invokes a selective dialog for Grubbs Test and Wilk's Method. MultiDA does not afford a direct control interface for figure description of outlier. While PCA boxplot is a part outcome of outlier analysis, stem-leaf-plot of Hotelling T-square is outputted by PCA and ROPCA score diagnostic plot is obtained by ROPCA. 3.2.3. Multi-comparison of outlier detection Table 2 displays the outliers recognized by different methods. Results indicate that different outlier detection methods tend to select different outliers. However, the 70th and 96th sample were selected by all methods, so multi-model comparison suggests that these samples could be classified as extreme outlier. Fig. 4 represents a visualized outcome of three figures description of outlier. It is interesting that all outliers detected above belong to group 2 (Fig. 4a and b), which might suggest samples in group 2 preserve greater variations. The overview of outlier detection based on different methods could give us a more comprehensive insight to wine data.

−6

3.3. Feature selection method −6.5

Feature selection or variable selection is to select a subset of feature for constructing a more robust model. MultiDA provides weight

−7 −7.5 −8 −8.5

Table 1 Files that MultiDA could recognize.

1

1.5

2

2.5

3

3.5

4

4.5

5

UDV1 Fig. 3. Scatter plot get by UDV1 and UDV2 of ULDA. x and y axis represent UDV (uncorrelated discriminate vector) 1 and 2. Star, circle and square represent three different groups respectively. The two UDV give a perfect separation of three groups in wine data.

File type

Format

Text file Excel file Image file Sound file Others

txt, dat, tap, dlm, xls, csv, wk1 gif, cur, ico, tif, bmp, pcx, jpeg wav, au, snd mat, avi

4

Q. Yang et al. / Chemometrics and Intelligent Laboratory Systems 116 (2012) 1–8

different variables may affect different groups. Such as, Fig. 5c indicates that color intensity is the key factor for discriminating group 1 and 3, while flavanoids influence the classification of group 2 and 3 according to Fig. 5d.

Table 2 Outlier detection by different outlier detection methods and figure description of outlier. Method

Outlier detection

Grubbs test Wilk's method Box plot Hotelling T-square plot Distance plot

60, 70, 96 159, 70, 96, 74, 122 96, 70, 97, 74, 122 122, 70, 96, 74, 159, 111 74, 96, 122, 70, 79, 158, 160

3.3.2. Genetic algorithm for feature selection The GUI of GAFS in MultiDA shows the default values of population, mutation rate, generation, crossover rate and goal (see Fig. 6a). Surely, the users could adopt other values for the optimism results. Moreover, evaluation functions based on LDA, QDA, MDA, ULDA, NIPALS, and SIMPLS are involved in our software. Fig. 6b reflects the outcome of GAFS based on NIPALS. MultiDA compared fitness of every chromosome in each generation, and the best chromosome (or subset) would be marked by its fitness as Fig. 6b. A structure variable of GAFS was also sent to workspace of Matlab, which contains the information of the best chromosome, selected variable and fitness of the best chromosome. For wine data the best chromosome contains six variables (variable 1, 4, 5, 7, 12 and 13) and five components are extracted by NIPALS. The six variables give the 92.1348% correct rate of classification.

analysis, genetic algorithm, CART, PLS weight and stepwise DA ([Stepwise]) for this purpose. 3.3.1. Weight analysis The purpose of weight analysis is to find the best variable set for discriminating between two groups. A variance weight method is introduced to achieve this aim based on Liang Y.Z. [6]. 

2 nC P nT  P xcj −xtj c¼1 t¼1  n n   n n  2 2 C P C P PT PT nC xc′ j −xcj þ nT n xt ′ j −xtj n nC nT

n2

c′ ¼1 c¼1

3.3.3. Multi-comparison of feature selection Table 3 illustrates the informative variable based on the feature selection methods mentioned above. All the selected variable subsets were evaluated by 10-fold cross validation RMSEP, RMSER and BIC. As shown in Table 3, variable 1, 13, and 7 (namely alcohol, proline and flavanoids) are selected in most of six methods; they are therefore considered as more informative variable. Modeling with those three variables could

t ′ ¼1 t¼1

Fig. 5a shows the outline of weight analysis GUI. Because weight analysis method is used only for two groups, there are two popupmenus in weight analysis GUI for selecting group pair. From Fig. 5b and c, it is clear that variable 13, 10, 7 and 1 (i.e. proline, color intensity, flavanoids and alcohol) are probably the latent variables. However,

a

b

Box Plot of Raw data

Box Plot of Raw data 30

500

20 10 0

Values

Values

0 74

−10

97

−20

122

−30

−500

74

−40 −50

70 96

−60

−1000

1

1

3

2

2

3

PC2

PC1

d

c 9 Slightly devious sample

8

60

159 122

6 96

5 4 70

3 79

2

70

40

74

96 111

159

30 20 10

1 0

122

50

160

Hotelling T−Square

Orthogonal Distance

7

Extrem sample

74

0

Normal sample

Slightly devious sample

1

2

3

4

Score Distance

5

6

7

0

0

20

40

60

80

100 120 140 160 180

Sample Case

Fig. 4. Three figures description of outlier: a) PCA boxplot of three groups at the first component, b) PCA boxplot at the second component, c) scored diagnostic plot, each sample displays a score distance within the PC space and orthogonal distance to the PCA space, and d) stem-leaf-plot of Hotelling's T-square statistic of each sample. Outliers were marked with sample number in all four figures.

Q. Yang et al. / Chemometrics and Intelligent Laboratory Systems 116 (2012) 1–8

a

b

5

Weight Analysis Between Group 1 And Group 2

1.6 1.4 1.2

Weight

1 0.8 0.6 0.4 0.2 0

1

2

3

4

5

6

7

8

9 10 11 12 13

d

c Weight Analysis Between Group 2 And Group 3

1.4

Weight Analysis Between Group 1 And Group 3

5 4.5

1.2 4 3.5 3

0.8

Weight

Weight

1

0.6

2.5 2 1.5

0.4

1 0.2 0.5 0

1

2

3

4

5

6

7

8

0

9 10 11 12 13

1

2

3

4

5

Variable

6

7

8

9 10 11 12 13

Variable

Fig. 5. Weight analysis GUI and bar plot of weight coefficient bar plot of each variable between different groups pair: a) layout of weight analysis GUI, b) weight coefficient between group 1 and 2, c) group 1 and 3, d) group 2 and 3. In weight coefficient bar plot, x axis represents variables and y axis indicates discriminating ability to each group pair of each variable.

get a low BIC at −14.4213, indicating a robust and concise model. Variable 2, 3, 4, 5, 9, 10, 11 and 12 were just selected by some of methods and little RMSE could be obtained when they are involved in model, so

a

they can be classified as assistant variable. Similarly, total phenols (variable 6) and nonflavanoid phenol (variable 8) were never selected by any method implying they possess little contribution for discriminating

b 1 0.95 0.921348

0.9

Fitness

0.85 0.8 0.75 0.7 0.65

0

20

40

60

80

100

Generation Fig. 6. Genetic algorithm for feature selection: a) outline of GAFS GUI, b) fitness of population in different generations based on NIPALS. Circle represents for mean fitness (y axis) of each generation and star for the maximum fitness in each generation (x axis). Digital number denotes the best fitness of all generation. If goal met, run will stop prematurely.

6

Q. Yang et al. / Chemometrics and Intelligent Laboratory Systems 116 (2012) 1–8

different types of wine, so multi-model comparison of various feature selection methods could give a deep insight of each variable and more accurate conclusion. 3.4. Supervised learning method MultiDA contains lots of algorithms for supervised analysis as follows: linear discriminate analysis (LDA), quadratic discriminate analysis (QDA), Mahalanobis discriminate analysis (MDA), CDA, ULDA and PLS. 3.4.1. Discriminate analysis LDA, QDA and MDA are invoked by [Classification] button. A prior probability should be selected ahead of classification analysis. MultiDA supplies “All Group equal” and “Weight by Group Size,” two kinds of prior probability. After classification analysis, the territorial map and tree plot could be output. CDA is related to PCA and canonical correlation. Raw program of CDA is obtained at Matlab file exchange center created by Trujillo-Ortiz [17] with revision. Leave-one-out (LOO) and k-fold cross-ventilation are available for evaluation of discriminate algorithms. PCA-DA is also available for tackling high dimensional data. 3.4.2. Partial least squares analysis Partial least squares (PLS) is a commonly used method for modeling relations between multi-independent variable and non‐ (PLS1) or multi-dependent (PLS2) variable. MultiDA provides Kernel-PLS, NIPALS and SIMPLS algorithms for PLS. [NIPALS] and [SIMPLS] will invoke a dialog for choosing number of components; at the same time a figure displays providing information of AIC and RMSE variation with different number of components. Like PCA, the scores and loadings plots are also easily output for PLS. LOO and k-fold cross validation are also available as discriminate analysis. 3.4.3. Multi-comparison of supervised learning methods Table 4 displays the multi-model comparison outcome of different supervised learning methods by 10-fold cross validation RMSEP and RMSER. Both CDA and ULDA gave a perfect recognition and prediction error. Therefore, multi-model comparison shows that ULDA and CDA are suitable for analysis this wine data. 3.5. Unsupervised learning method Cluster analysis and PCA are the two most important unsupervised learning methods. There are three methods for cluster analysis in MultiDA, namely, k-means ([KMeans]), k-medoid ([KMedoid]) and hierarchical clustering ([HierachicalClustering]), and three methods for PCA, classical PCA, robust PCA and NLPCA included.

selective, in which methods for calculating the distance and proximity are provided. The grouping results will display in the Y data spreadsheet. 3.5.2. Comparison of multiply cluster analysis Fig. 7 is PCA scores plot from three methods. It is clear that k-means and k-medoid depict the group information correctly and give similar results. Compared with k-means, k-medoid is more robust to outlier buts computationally intensive. Hierarchical cluster method is sensitive to data value for calculating similarity and distance of data. In the left of Fig. 7b, the 19th sample is grouped singly because the value of praline in the 19th sample is bigger than others and praline magnitude is significantly larger than other variables. For z-score data, hierarchical cluster groups the 96th sample itself as a class and the 70th, 79th and 74th samples as another one, all of which are outlier according to outlier detection. Thus, it is essential to standardize samples and exclude outlier for hierarchical cluster. As mentioned above, through different cluster analysis, the inner data structure could be revealed deeply. Thus, multi-model comparison takes much advantage than single analysis. 3.5.3. Principle component analysis Classical PCA, robust PCA (ROPCA) and non-linear PCA (NLPCA) are provided in MultiDA. PCA is sensitive to outlier, so robust PCA is needed and used for handling data with outlier. When after run of ROPCA, a distance scatter plot is plotted automatically to reflect the extreme and moderate outlier in data. The extreme outlier may locate in the top right position. In MultiDA, non-linear PCA is based on the feed-forward back propagation network including five layers. Neuroses in each hidden layer are decided by data dimensionality. Fig. 8a is NLPCA GUI profile. Default parameters of max epoch to 500, error to 0.001, learn rate to 0.3 and show interval to 10 are provided. Fig. 8b is the scores plot based on NLPC 1 and 2. Compared with PCA scores plot, NLPCA scores plot displays poor separation of three groups, which indicates variables in wine display poor nonlinear correlation. 4. Conclusion Multivariate data analysis (MultiDA), a user-friendly interface chemometric software, is developed for the routine metabolomics data analysis. MultiDA integrates plenty of algorithms to deal with metabolomics data. The multi-model comparison is also adopted in MultiDA for a comprehensive insight of data structure and making up shortcomings of similar algorithm. A case of wine data is employed to demonstrate the functions and operation procedures of MultiDA. Because based on MATLAB platform, MultiDA possesses a powerful extended capability for your own purpose. Acknowledgment

3.5.1. Cluster analysis A group number is needed ahead of cluster analysis in MultiDA. For [HierachicalClustering], a dialog will pop up for more parameters

This work is supported by the National Natural Science Foundation of China (90709014).

Table 3 Multi-model comparison of different feature selection method. Method

Informative variable RMSEP-10CVc RMSER-10CV BIC

Normala Weight analysis GA NIPALS CART NIPALS weight Stepwise Co-variablesb

1–13 1, 7, 10–13 1, 2, 4, 7, 9–13 5, 7, 11–13 1, 4, 5, 13 1–4, 7, 10–13 1, 7, 13

a b c

0.0618 0.0278 0.0562 0.0781 0.2245 0.0451 0.0614

All variable was introduced into PLS model. Variable were nearly selected in all six methods. 10 Fold cross validation.

0.0546 0.0305 0.0431 0.0614 0.1327 0.0377 0.0584

− 1.9601 − 15.0633 − 9.1189 − 10.8625 − 6.1949 − 8.2655 − 14.4213

Table 4 Comparison of different supervised learning methods. Method

RMSEP-10CVa

RMSER-10CV

LDA QDA MDA CDA ULDA NIPALS SIMPLS

0.0056 0.0056 0.0222 0 0 0.0448 0.0448

0 0.0047 0 0 0 0.0322 0.0273

a

10 Fold cross validation.

Q. Yang et al. / Chemometrics and Intelligent Laboratory Systems 116 (2012) 1–8

b

a

RAW−PCA Score plot

KMeans−PCA Score plot

30 1 2 3

20 10 0 −10 −20 −30 −40 −50 −60 −1000

−500

0

Principal Component 2(0.17359%)

Principal Component 2(0.17359%)

30

3 2 1

20 10 0 −10 −20 −30 −40 −50 −60 −1000

500

−500

c

d

KMedoid−PCA Score plot

30

500

Hierarchical−PCA Score plot

1 2 3

10 0 −10 −20 −30 −40 −50 −500

0

500

Principal Component 2(0.17359%)

30

20

−60 −1000

0

Principal Component 1(99.8091%)

Principal Component 1(99.8091%)

Principal Component 2(0.17359%)

7

1 2 3

20 10 0 −10 −20 −30 −40 −50 −60 −1000

−500

Principal Component 1(99.8091%)

0

500

Principal Component 1(99.8091%)

Fig. 7. PCA scores plot grouped by three cluster methods and raw group: a) PCA scores plot grouped by k-means cluster analysis, b) PCA scores plot grouped by raw group, c) PCA scores plot grouped by hierarchical cluster analysis, d) PCA scores plot grouped by k-medoid cluster analysis. x and y axis indicate principle component score 1 and 2 with variance explanation in four figures, and star, circle and square represent three different groups.

a

b 3.5 1 2 3

3

NLPCA 2

2.5 2 1.5 1 0.5 0 −0.5

1

1.5

2

2.5

3

3.5

4

NLPCA 1 Fig. 8. Outline of non-linear PCA GUI: a) NLPCA GUI profile. Number of node in three hidden layer, epoch, error, learning rate and showing interval is adjustable. b) Scores plot of wine data processed by NLPCA under the 20-3-20 hidden layer structure. x and y axis are weight of node in bottleneck layer respectively. Star, circle and square represent three different groups.

References [1] K. Varmuza, P. Filzmoser, Introduction to Multivariate Statistical Analysis in Chemometrics, CRC, Boca Raton, 2009. [2] M. Andersson, A comparison of nine PLS1 algorithms, Journal of Chemometrics 23 (2009) 518–529. [3] Matlab, The MathWorks, Inc. Natwick, MA (USA), http://www.mathworks.com. [4] R.A. Viscarra Rossel, ParLeS: software for chemometric analysis of spectroscopic data, Chemometrics and Intelligent Laboratory Systems 90 (2008) 72–83.

[5] M. Daszykowski, S. Serneels, K. Kaczmarek, P. Van Espen, C. Croux, B. Walczak, TOMCAT: a MATLAB toolbox for multivariate calibration techniques, Chemometrics and Intelligent Laboratory Systems 85 (2007) 269–277. [6] L.Z. Yi, J. He, Y.Z. Liang, D.L. Yuan, F.T. Chau, Plasma fatty acid metabolic profiling and biomarkers of type 2 diabetes mellitus based on GC/MS and PLS-LDA, FEBS Letters 580 (2006) 6837–6845. [7] F.E. Grubbs, Procedures for detecting outlying observations in samples, Technometrics 11 (1969) 1–21. [8] J.E. Gentleman, M.B. Wilk, Detecting outliers in two-way table: 1. Statistical behaviour of residuals, Technometrics 17 (1975) 1–14.

8

Q. Yang et al. / Chemometrics and Intelligent Laboratory Systems 116 (2012) 1–8

[9] D. Jouan-Rimbaud, D.L. Massart, R. Leardi, O.E. De Noord, Genetic algorithms as a tool for wavelength selection in multivariate calibration, Analytical Chemistry 67 (1995) 4295–4301. [10] R. leardi, Application of a genetic algorithm for feature selection under full validation conditions and to outlier detection, Journal of Chemometrics 8 (1994) 65–79. [11] R. Leardi, Genetic algorithms in chemometrics and chemistry: a review, Journal of Chemometrics 15 (2001) 559–569. [12] S. Wold, H. Antti, F. Lindgren, J. Ohman, Orthogonal signal correction of near-infrared spectra, Chemometrics and Intelligent Laboratory Systems 44 (1998) 175–185. [13] M. Hubert, P.J. Rousseeuw, K. Vanden Branden, ROBPCA: a new approach to robust principal components analysis, Technometrics 47 (2005) 64–79. [14] M. Hubert, S. Engelen, Fast cross-validation of high-breakdown resampling algorithms for PCA, Computational Statistics and Data Analysis 51 (2007) 5013–5024. [15] S. Hoon, W. Keith, R.F. Charles, Novelty Detection under Changing Environmental Conditions, SPIE's 8th Annual International Symposium on Smart Structures and Materials, Newport Beach, 2001, pp. 108–118. [16] M.A. Kramer, Nonlinear principal component analysis using autoassociative neural networks, AICHE Journal 37 (1991) 233–243. [17] A. Trujillo-Ortiz, R. Hernandez-Walks, S. Perez-Osuna, RAFisher2cda:Canonical Discriminant Analysis, A MATLAB file, 2004, [WWW document].

[18] H.W. Wang, Partial Least-Squares Regression Method and Applications, National Defense Industry Press, Beijing, 1994. [19] D. Yuan, Y. Liang, L. Yi, Q. Xu, O. Kvalheim, Uncorrelated linear discriminant analysis (ULDA): a powerful tool for exploration of metabolomics data, Chemometrics and Intelligent Laboratory Systems 93 (2008) 70–79. [20] Y. Xu, J.Y. Yang, Z. Jin, A novel method for Fisher discriminant analysis, Pattern Recognition 37 (2004) 381–384. [21] F. Lindgren, P. Geladi, S. Wold, The kernel algorithm for PLS, Journal of Chemometrics 7 (1993) 45–49. [22] H. Wold, Nonlinear estimation by iterative least squares procedures, Research Papers in Statistics, Wiley, New York, 1996, pp. 411–444. [23] P. Geladi, B.R. Kowalski, Partial least squares regression: a tutorial, Analytica Chimica Acta 185 (1986) 1–17. [24] S. de Jong, SIMPLS: an alternative approach to partial least squares regression, Chemometrics and Intelligent Laboratory Systems 18 (1993) 251–263. [25] A. Frank, A. Asuncion, UCI Machine Learning Repository, University of California, School of Information and Computer Science, 2010. [26] S. Yang, Y. Lee, Identification of a Multivariate Outlier, Annual Meeting of the American Statistical Association, San Francisco, 1987.

Suggest Documents