STATISTICAL SIGNIFICANCE OF THE ... - Springer Link

6 downloads 0 Views 408KB Size Report
In this paper, the statistical significance of the contribution of variables to the ... Key words: principal components analysis, permutation, statistical significance, ...
PSYCHOMETRIKA — VOL . 76, NO . 3, J ULY 2011 DOI : 10.1007/ S 11336-011-9216-6

440–460

STATISTICAL SIGNIFICANCE OF THE CONTRIBUTION OF VARIABLES TO THE PCA SOLUTION: AN ALTERNATIVE PERMUTATION STRATEGY

M ARIËLLE L INTING , BART JAN VAN O S ,

AND JACQUELINE

J. M EULMAN

LEIDEN UNIVERSITY, THE NETHERLANDS In this paper, the statistical significance of the contribution of variables to the principal components in principal components analysis (PCA) is assessed nonparametrically by the use of permutation tests. We compare a new strategy to a strategy used in previous research consisting of permuting the columns (variables) of a data matrix independently and concurrently, thus destroying the entire correlational structure of the data. This strategy is considered appropriate for assessing the significance of the PCA solution as a whole, but is not suitable for assessing the significance of the contribution of single variables. Alternatively, we propose a strategy involving permutation of one variable at a time, while keeping the other variables fixed. We compare the two approaches in a simulation study, considering proportions of Type I and Type II error. We use two corrections for multiple testing: the Bonferroni correction and controlling the False Discovery Rate (FDR). To assess the significance of the variance accounted for by the variables, permuting one variable at a time, combined with FDR correction, yields the most favorable results. This optimal strategy is applied to an empirical data set, and results are compared with bootstrap confidence intervals. Key words: principal components analysis, permutation, statistical significance, component loadings, p-values.

1. Introduction Principal components analysis (PCA) is a nonparametric analysis method frequently used in the social and behavioral sciences to reduce a large number of variables to a smaller number of uncorrelated unobserved variables—called principal components—that contain as much information from the observed variables as possible. The PCA solution can be computed in different ways; in the present study, we use the eigenvalue decomposition of the Pearson correlation matrix of the observed variables.1 The eigenvalues in PCA equal the eigenvalues of the Pearson correlation matrix of the variables and represent the variance accounted for (VAF) by the principal components. Principal components are ordered according to their eigenvalues, with the first component associated with the largest eigenvalue. The variance in a variable accounted for by all principal components selected for analysis is represented by a communality, which is the sum of squared component loadings for that variable across principal components. Component loadings represent correlations between the principal components and the observed variables. Despite that, in practice, PCA is often treated as a form of exploratory factor analysis, it has a different statistical background as well as a different objective compared to factor analysis: If the goal of the analysis is to optimally reduce a large number of variables to a smaller number of composite variables, instead of deriving a parsimonious model of the correlational structure between the variables, PCA is the more appropriate procedure (Fabrigar, Wegener, MacCallumm, & Strahan, 1999). 1 The advantage of analyzing the correlation matrix instead of the covariance matrix is that differences in observed variance between the variables do not have an influence on the analysis results, that is, the principal components are not sensitive to the measurement units of the variables. When PCA is performed on the covariance matrix, variables with relatively much variance dominate the first few principal components (Jolliffe, 2002). Requests for reprints should be sent to Mariëlle Linting, Department of Education and Child Studies, Leiden University, P.O. Box 9555, 2300 RB Leiden, The Netherlands. E-mail: [email protected]

© 2011 The Psychometric Society

440

M. LINTING ET AL.

441

Although PCA is often used in exploratory research, it need not be deprived of inferential diagnostics. Two of the main challenges in performing PCA are finding the optimal number of principal components, and determining which variables contribute significantly to those components. In previous research, these challenges have been faced using different methods. For example, asymptotic confidence intervals for the component loadings have been established for PCA based on the covariance matrix (Anderson, 1963; Girshick, 1939) as well as the correlation matrix (Ogasawara, 2004). Alternatively, nonparametric methods do not require a particular probability model, which theoretically better matches the nonparametric nature of PCA. The bootstrap is one such nonparametric method that has been used in PCA to establish stability (Linting, Meulman, Groenen, & van der Kooij, 2007b) and statistical significance of the component loadings (Peres-Neto, Jackson, & Somers, 2003; Timmerman, Kiers, & Smilde, 2007). Another way of nonparametrically establishing significance of PCA results is by using permutation tests (Buja and Eyuboglu, 1992; Peres-Neto et al., 2003). However, the permutation approaches used in previous studies have not been completely successful in correctly assessing the significance of the contribution of the variables to the PCA solution. In the current paper, we compare a new permutation strategy with a strategy used in previous studies. After a brief introduction to permutation tests in general, we discuss the theoretical value of this new strategy in the context of variable fit. To assess the practical value of this new strategy, we perform a simulation study in which we account for multiple testing. Finally, we illustrate the practical use of the proposed strategy by applying it to an empirical data set, and results are compared with bootstrap results on the same data.

2. Permutation Tests in PCA 2.1. Permutation Tests in General A permutation test is used in nonparametric hypothesis testing to estimate the null distribution of a particular statistic conditionally on the observed data, without relying on a specific probability model. The statistic of interest is computed on the observed data set as well as on a large number of new data sets, generated by randomly and independently permuting the elements within the columns of the data matrix. If the variables are assumed to be interchangeable on the assumption of shared marginal distributions, the data may be permuted across rows as well as columns. However, this assumption is unrealistic in most cases, because variables commonly differ in content and scaling. Therefore, usually the data are only permuted within the columns of the data set, on the assumption of interchangeability of the cases (Good, 2000). The values of the statistic computed on the permuted data sets are the parameter estimates that compose a null distribution. The total number of possible permuted data sets is n!m−1 , with n the number of objects (or persons) and m the number of permuted variables.2 Because this number increases rapidly with the number of objects and variables, usually a random sample of the total set of permutations is used. The alternative hypothesis that the observed value deviates significantly from the center of its null distribution is tested against the null hypothesis that it does not. A one-sided p-value is computed as p = (q + 1)/(P + 1), with q the number of times a statistic from the permutation distribution is greater than or equal to the observed statistic, and P the number of permuted data 2 This is the upper bound of the number of permuted data sets. The minus 1 in this formula indicates that all variables but one are permuted. When all but one variables have been permuted in every possible way, every combination with values of the last variable has already come along. So, permutation of the last variable would only instigate “doubles”, and is thus redundant. When variables have duplicate values, the number of possible distinct data sets is smaller than n!m−1 .

442

PSYCHOMETRIKA

sets (Buja and Eyuboglu, 1992; Noreen, 1989). Because, under the null hypothesis, the observed data are assumed to be just another permutation of a random data set, the denominator in this equation is P + 1 rather than P .3 Note that, with this approach, p-values have a lower bound of 1/(P + 1), indicating that the number of permutations has an effect on the minimum p-value that can be obtained. If too few permutations are used, the minimum p-value will be relatively large. Buja and Eyuboglu (1992) suggested using 99 or 499 permutations, which would lead to minimum p-values of p = (0 + 1)/(99 + 1) = 0.01 and p = (0 + 1)/(499 + 1) = 0.002, respectively. In this study, we use 999 permutations. P -values are, as usual, compared with a prechosen significance level α: If p < α, the result is marked significant.4 The permutation test originated as a nonparametric alternative to using the theoretical tdistribution in comparing the means of two different groups of objects (see Fisher, 1935; Good, 2000). The biologist Mantel (1967) developed a significance test for the congruence of two distance matrices, by constructing a null distribution from random permutations of the rows and columns of one of the matrices, while keeping the other matrix fixed. Many extensions and generalizations of this idea have been proposed (e.g., see Dietz, 1983; Smouse, Long, & Sokal, 1985), and the idea has been applied and modified in several research fields, such as geography (Glick, 1979; Sokal, 1979), ecology and evolutionary research (Douglas and Endler, 1982), and psychometrics and classification (e.g., see Hubert and Schultz, 1976; Hubert, 1984, 1985, 1987). Permutation tests have proved useful in multiple regression and ANOVA (Anderson and Ter Braak, 2003; Ter Braak, 1992), and in several forms of nonlinear multivariate analysis that use optimal scaling (e.g., see De Leeuw and Van der Burg, 1986; Heiser and Meulman, 1994; Meulman, 1992, 1993, 1996). Suitable permutation tests can be formulated for many different types of problems, following a general procedure consisting of four steps: (1) Formulate the testing problem; (2) Choose the test statistic; (3) Formulate the null and alternative hypothesis in terms of the test statistic, and (4) Decide on the appropriate permutation strategy. These four steps will be followed in formulating a permutation test for assessing the significance of the contribution of variables to the PCA solution. 2.2. Assessing Significance of Contribution of Variables to the Principal Components The significance of the contribution of the variables to the PCA solution has been established using permutation tests before (Buja and Eyuboglu, 1992; Landgrebe, Wurst, & Welzl, 2002; Peres-Neto et al., 2003). Alternatively, in studies by Peres-Neto et al. (2003) and Timmerman et al. (2007) the nonparametric bootstrap procedure rendered the best results in establishing significance of PCA component loadings, compared to other (parametric and nonparametric) methods. The nonparametric bootstrap and permutation tests are two different approaches to the same type of problem. Whereas permutation tests are developed to establish the distribution of a particular statistic under a specific null hypothesis, the bootstrap establishes the sampling distribution of a statistic based on an observed data set. In fact, the bootstrap simulates resampling from the population by randomly drawing, with replacement, new samples from the observed sample. The values of the statistic of interest in the bootstrap samples form the sampling distribution from which (percentile) confidence intervals can be computed. If a bootstrap confidence interval does not contain the value assumed under the null hypothesis, the observed statistic is concluded to be 3 When all possible permutations, including the observed data, are used (complete enumeration), the +1 is not needed. 4 In the literature, the null hypothesis is often rejected if the p-value is smaller than or equal to the significance level, but in the current paper, we chose the slightly more conservative rule of rejecting when p is smaller than the significance level.

M. LINTING ET AL.

443

statistically significant. The advantage of permutation tests over the bootstrap is that they render exact p-values. Also, the simulation of a null distribution is more closely related to traditional hypothesis testing. Peres-Neto et al. (2003) compared several approaches to establishing significance of PCA loadings, including the bootstrap and a permutation approach in which all variables within a data set are independently and concurrently permuted. Compared to the bootstrap, this permutation approach was less successful: permutation results showed proportions of Type I error smaller than nominal alpha, and the applied method had too little power. The same permutation approach was used earlier, by Buja and Eyuboglu (1992), who found surprisingly few significant loadings considering data content. These results illustrate the theoretical inappropriateness of permuting the entire data set for testing the significance of the contribution of separate variables to the PCA solution. However, this approach is appropriate for situations where the test statistic is the VAF in the entire data set by the first c principal components, with c the number of components selected to represent the data set sufficiently. This fit measure, total VAF (TVAF), is equal to the sum of the eigenvalues of the first c components. When the test statistic is a fit measure for single variables (loadings or communalities), comparisons between its observed value and its value for variables with the same univariate distribution (permutations) in data sets with a completely random structure do not make sense. Alternatively, we propose a different approach, following the general procedure of choosing a suitable permutation test: Step 1: The testing problem at hand is assessing the significance of variable fit to the PCA solution, conditional on the structure among the other variables in the data set. Step 2: The test statistic in which we are interested is the communality: the contribution of each separate variable to the TVAF, given by the sum of the squared component loadings for a variable across components. The value of this statistic is constant across orthogonal rotations of the PCA solution. Alternative test statistics would be the (squared) loadings per component, but in that case rotational freedom of the PCA solution becomes an issue. In the previously applied approach of permuting variables concurrently, incorporating rotation toward a target solution (for instance, the observed solution) does not make sense because the correlational structure of the permuted data sets is completely random. Because our aim is to compare the new approach with this previous approach, it seems best to use a rotation-independent statistic. Step 3: For each variable in a data set, the null hypothesis to be tested is that its communality does not differ from the expected value (established from the permutation distribution) for a variable that is independent of the other variables in the data set. The alternative hypothesis would be that the communality exceeds this expected value. Step 4: To establish conditional significance of the communality, instead of permuting all variables concurrently, the values of one variable at a time should be randomly permuted, while keeping the other variables fixed. In this way, the correlational structure between the permuted variable and the other variables is destroyed, while the structure among the other variables remains unchanged. This latter strategy is referred to as PermV in the remainder of this paper. We will refer to permuting the entire data set (all variables concurrently) as PermD. To establish empirical support for the theoretical idea that PermV is a more appropriate strategy for establishing significance of variable VAF, the remainder of this paper will be mainly dedicated to comparing results from the PermV strategy to the PermD strategy. In doing so, we focus on the use of permutation tests with two-dimensional PCA solutions.

444

PSYCHOMETRIKA

F IGURE 1. Complete design of the Monte Carlo study; m is the total number of variables in a data set, n is the number of objects. In each cell, Type I and II errors for the uncorrected, Bonferroni corrected, and FDR corrected 0.05 significance levels are computed.

3. Design of the Monte Carlo Study To compare the effectiveness of PermV and PermD, we performed an extensive Monte Carlo simulation study, varying several aspects of the design. We used a large number of Monte Carlo replications (R) of three different types of simulated data sets: data with a strong two-dimensional principal components structure, data with a moderately strong two-dimensional structure, and data with no distinct component structure (random data). Each Monte Carlo replication consists of the following steps: (a) generating a data set of a specific size and structure, (b) permuting the generated data set a large number of times, (c) creating a permutation distribution for the communality of each variable, and (d) computing a p-value for each communality. The simulation procedure we use (see the next section) allows us to generate data sets in which some variables contribute to the two-dimensional structure (thus should have significant communalities) and others do not. Data sets in all cells of the design were analyzed by both PermD and PermV. We performed 999 permutations for each test, such that PermD requires 999 permutations in each Monte Carlo replication and PermV requires 999 × m permutations in each replication. We studied the behavior of each strategy with respect to: the proportion of false significant results (Type I error), and the proportion of false insignificant results (Type II error) regarding the communalities.5 The results of the Monte Carlo study are expected to vary with the size of the data set, as a principal components structure may be more clearly established in larger data sets. Therefore, for each principal components structure, we varied the size of the data set, from quite small (with 20 variables and 100 objects) to relatively large (with 40 variables and 500 objects). In Figure 1, the complete design is presented schematically. We estimated the p-values for the VAF per variable by using a large number of Monte Carlo replications for each cell of the design (1500 for data sets with 20 variables, 750 for data sets with 40 variables; the validation of this number of replications can be found in Appendix B). 3.1. Generating Data Matrices with Different Principal Component Structures To examine the behavior of the two strategies under different conditions, data matrices have been constructed with three types of prespecified principal components structure. Each type can be represented by two blocks of variables: one block with variables (the signal variables) that may contribute significantly to the principal components of a correlation matrix C, and another 5 The probability of making a Type II error is also referred to as β, where 1 − β approximates the power.

M. LINTING ET AL.

445

F IGURE 2. Block structure of the simulated correlation matrices.

block with variables that do not (the noise variables). We assume that in the population there is no correlation between the two blocks of variables. The blocks of variables are created as follows. We generate a block-diagonal population correlation matrix C (see Figure 2) with two blocks on the diagonal. The first block, C1 , contains the correlations between m1 (possibly) signal variables, and the second block, C2 , the correlations between m2 noise variables. The off-diagonal blocks of C consist of zeros, thus the variables in C1 do not correlate with the variables in C2 . A C1 block is generated according to one of three types of structure: 1. A strong two-component structure; 2. A moderately strong two-component structure; 3. No specific principal components structure (i.e., a random structure). In the C1 blocks with a strong structure, the first principal component accounts for approximately 50% of the TVAF, while the second principal component accounts for approximately 30%. The rest of the variance is approximately equally divided among the remaining components, thus the associated eigenvalues are approximately equal. In C1 blocks with a moderately strong structure, the first two components account for 30% and 10% of the TVAF, respectively. C1 blocks without a particular principal components structure are generated such that the proportion of varianceaccounted-for by each component approximately equals 1/m. In that case, the entire data set consists of noise variables. Scree plots for the eigenvalues expected for each of the structures are given in Figure 3. The noise variables do not contribute to the principal components structure of C. In reality, the population data will never exactly concur with the null hypothesis, and thus the noise variables will not be perfectly uncorrelated, neither with each other, nor with the signal variables. Therefore, to obtain a more realistic situation, we impose a very weak one-dimensional structure among the noise variables (for example, as in a simple form of method effect). Consequently, in structured data sets, the proportion of VAF by each component within C2 is not exactly 1/m2 (which is the value we would expect when the variables are uncorrelated with each other), but is slightly higher for the first component. In data sets without a specific component structure, the VAF is also slightly higher for the first than for the other components. The details of the data generating procedure are described in Appendix A. To keep the results as comparable as possible, and to not further complicate the design, the same ratio of variables from C1 to C2 has been used to create small and large data sets. The ratio

446

PSYCHOMETRIKA

F IGURE 3. Expected eigenvalues of the dimensions from block C1 of the simulated data matrices.

of m1 to m2 is 3 : 1, resulting in m1 = 15 and m2 = 5 for the data sets with 20 variables, and m1 = 30 and m2 = 10 for the data sets containing 40 variables. When we perform a two-dimensional PCA on data with a strong or moderately strong twodimensional principal components structure, the null hypothesis (i.e., the VAF of a variable does not exceed its expected value under the assumption of independence from the other variables) should be rejected for the variables in C1 , and not rejected for the variables in C2 . For the data without a distinct principal components structure, the null hypothesis should not be rejected for all variables. To find out whether the two permutation strategies are able to detect the imposed components structure in terms of the VAF, the entire permutation procedure (Steps (a) through (d) mentioned at the start of this section) is repeated R times, resulting in R p-values per variable. Consequently, R decisions of whether or not to reject the null hypothesis at a certain significance level are made. Following the procedure described in Appendix B, we chose R to be 1500 for data sets with 20 variables, and 750 for data sets with 40 variables. Whenever the decision is false compared to the actual structure of the simulated data set, we record it as an error (Type I error for false significants, Type II error for false nonsignificants). The proportion of Type I error should equal the nominal α, whereas Type II error rate should be as small as possible. We use different significance levels, either corrected or uncorrected for multiple testing. 3.2. Correction for Multiple Testing It is common practice to consider a test result significant if its p-value is smaller than 0.05. We will refer to this criterion as the uncorrected significance level (UNC). When we consider the VAF for each of m variables, however, multiple (m) tests are performed on the same data set. Thus, when a significance level of 0.05 is used for each separate result, the probability of

M. LINTING ET AL.

447

incorrectly marking one or more of the results significant will be inflated. To overcome this problem in multiple testing, we use two different methods to correct the significance level. The first correction method we use is the simple Bonferroni correction (BF), which is aimed at controlling the so-called familywise error rate (FWE). The FWE is defined as the probability of making one or more false rejections in a “family” of hypothesis tests (see Shaffer, 1995). Thus, the probability that any of the results is incorrectly marked significant is controlled. The BF divides the significance level α by the number of tests performed on one data set. In this study, applying the BF reduces to dividing the significance level by the total number of variables in the data set (α/m), because we perform a significance test on each of the variables. As this type of correction is easy to understand and implement, it is quite often used by applied researchers. The second correction method that is used is a less conservative alternative to the Bonferroni correction, developed by Benjamini and Hochberg (1995). Instead of controlling the FWE, this method is aimed at controlling the false discovery rate (FDR), which is the proportion of falsely rejected null hypotheses within the total set of rejections. Suppose we obtain a list of results from several hypothesis tests for one data set, with a 0.05 significance level α. Methods controlling the FWE ensure that the probability of the list containing at least one false rejection is at most 5%, that is, that the probability of the entire list being correct (does not contain false rejections) is at least 95%. Alternatively, controlling the false discovery rate (from here on referred to as the FDR correction) ensures that the proportion of rejections that is false is kept below 5% (i.e., that at most 5% of the significant results on the list is incorrect). The FWE is always larger than or equal to the FDR, thus FWE control automatically implies FDR control, and FDR control will lead to a gain in power compared with FWE control. In the context of exploratory research, FDR control seems more sensible, as accepting a certain amount of error is common practice (Keselman, Cribbie, & Holland, 1999; Verhoeven, Simonsen, & McIntyre, 2005). The FDR procedure starts with sorting the p-values for all the variables of one data set in an ascending order, and assigning each of them a rank number. Next, starting with the largest p-value (the one with the highest rank number, i.e. the last one in the sorted list), each p-value is tested by a significance level of r/t × α, with t the number of tests (in this study, t equals the number of variables m), and r the rank number of the p-value. When a p-value is smaller than the FDR corrected significance level, the statistic corresponding to this p-value as well as the statistics corresponding to all p-values with lower rank numbers are marked significant. For instance, if m = 5, α = 0.05, and the ordered p-values are 0.001 (r = 1), 0.021 (r = 2), 0.025 (r = 3), 0.09 (r = 4), and 0.12 (r = 5), FDR control is executed in the following way. The p-value of 0.12 is tested with 5/5 × 0.05 = 0.05 and found nonsignificant, the p-value of 0.09 is tested with 4/5 × 0.05 = 0.04 and found nonsignificant, the p-value of 0.025 is tested with 3/5 × 0.05 = 0.03 and found significant, and consequently the other two p-values are also marked significant. Benjamini and Hochberg (1995) have shown that the performance of this procedure with respect to the proportion of Type I error is quite satisfactory. The major advantage of the FDR procedure over the Bonferroni correction is that it attains greater power (i.e., a smaller proportion of Type II error). In addition, FDR control proved to be more powerful than several other correction procedures, specifically when the number of hypotheses tested increased (Keselman et al., 1999). 3.3. Computing Proportions of Type I and Type II Error Type I and Type II error rates are calculated for both permutation strategies, for each cell of the design, both without correction as well as with BF and FDR correction. In the data sets with a strong or moderately strong two-component structure in C1 , the variables in C2 are not supposed to contribute to the TVAF. The proportion of tests in which these variables give false significant results is the Type I error rate. If the testing procedure works correctly, Type I error rate will

448

PSYCHOMETRIKA

equal the nominal alpha (that is, render false significant results in α × 100% of the tests). Type I error rate is computed as TypeI = sigC2 /m2 , where sigC2 is the number of times that the VAF of a variable in C2 is incorrectly found significant, and m2 is the number of significance tests with a (possibly false) positive outcome that are performed on one data set, which in the structured data sets equals the number of variables in C2 . Note that the number of significant results in C2 differs for each type of significance level condition, as the actual significance level used depends upon the correction method. In data sets with no specific component structure, each significant result is a false significant. Then, in the computation of the Type I error rate in the UNC condition, m2 should be replaced by the total number of variables in the data set, m. The Type I error rate is averaged over all R data replications. The Type II error rate for the strong and moderately strong two-dimensional data structures is given by the proportion of times that a variable contributing to the two-dimensional data structure (i.e., a variable in C1 ) does not come up with a significant VAF. The proportion of Type II error is computed as TypeII = nsigC1 /m1 , with nsigC1 the number of times a variable in C1 is falsely found nonsignificant (at the significance level corresponding to the correction method used), and m1 the number of tests with a (possibly false) nonsignificant outcome. Data sets without a specific two-dimensional structure have no possibility of rendering false nonsignificant results (each result is supposed to be nonsignificant), and thus for unstructured data, Type II errors do not exist.

4. Results of the Simulation Study The main question in this study is how PermV performs compared to PermD, under different significance level conditions (UNC, BF, or FDR) and for different sizes and structures of data sets. In the following sections, we will first focus on the most useful results for answering that question, that is, the data sets with 20 variables, as these show higher Type I and II error rates than the data sets with 40 variables. The relatively high error rates for data sets with 20 variables result from the fact that we kept the proportion of VAF by the two principal components constant across the two data size conditions. Consequently, the principal components structure is more evident in data sets with 40 variables than in data sets with 20 variables: For example, for data sets with a moderate structure, the first two components account for 40% of the variance in the data, and the remaining 60% is approximately equally divided across the other components. So, in the 40-variable condition, 60% of the variance is accounted for by the remaining 38 components, whereas in the 20-variable condition, 60% is accounted for by only 18 components. Thus, in data sets with 20 variables, the remaining proportion of VAF will be distributed among relatively few components compared to the 40-variable condition, such that the difference in VAF between the first two components and the rest will be smaller, and the structure is less clear. 4.1. Permutation Strategies: General Comparison of Error Rates First, we will focus on the general comparison between PermD and PermV. Table 1 displays a selection of general results of the study, showing the proportions of Type I and Type II error for the two permutation strategies, with an uncorrected 0.05 significance level (UNC). The mean proportion of Type I error over numbers of objects is 0.005 for PermD and 0.063 for PermV, and the mean proportion of Type II error is 0.271 for PermD and 0.043 for PermV. The fact that proportions of Type I error are much smaller than the nominal α with PermD confirms the inappropriateness of PermD for the testing problem at hand. For PermV proportions of Type I error are slightly higher than the nominal α, which may be due to the fact that we introduced noise to the data structures. Power (1-proportion of Type II error) is clearly higher with PermV.

449

M. LINTING ET AL. TABLE 1.

Proportions of Type I, Type II, and total errors for the PermD and PermV strategies.a,b,c

Type I

Type II

No. of objects

PermD

PermV

PermD

PermV

100 200 300 500

0.015 0.004 0.001 0.000

0.063 0.064 0.062 0.064

0.538 0.303 0.177 0.067

0.159 0.011 0.001 0.000

Mean

0.005

0.063

0.271

0.043

a Moderately strong 2-dimensional data structure (m = 20) b Number of Replications = 1500 c With uncorrected 0.05 significance levels

F IGURE 4. Examples of permutation distributions of the VAF in variable V2 of a particular Monte Carlo data set, after permutation of the entire data set (PermD) and after permutation of separate variables (PermV). Results for a moderately structured data set with 20 variables and 100 objects. The observed VAF in V2 is indicated by the star.

The next sections will show that a similar comparison applies (albeit with smaller error rates) with Bonferroni and FDR correction of the significance level. These differences between the two strategies can be explained from the fact that the permutation distribution with PermD has a wider spread than the permutation distribution with PermV, as is illustrated by Figure 4. In this figure, examples of the permutation distributions for the VAF of one variable (V2) in a particular simulated data set are shown. Panel (a) contains the distribution obtained by PermD. The observed VAF value is represented by a star. Because with PermD permuted data sets have an entirely random structure, variables may sometimes by chance obtain a relatively high loading. As a result, the distribution for PermD is quite widely spread, and the observed VAF value lies within the permutation distribution, with a corresponding p-value of 0.199, which is not significant. If, alternatively, the PermV strategy is used, where only one variable is permuted, the permuted data set will still have a principal components structure as determined by the other variables in C1 . The chance that the permuted variable will obtain a permutation distribution containing relatively large VAF values is very small. Panel (b) of Figure 4 shows that, indeed, the permutation distribution for V2 with the PermV strategy is much more narrow than that for PermD (Panel a), and the observed VAF (indicated by a star) is close to the tail of the distribution, with a p-value of 0.006, which indicates significance. Therefore, with the PermV strategy, the probability of marking VAF significant, and thus also of making a Type I error, will be larger compared with the PermD strategy, while the probability of making a Type II error will be smaller. In exploratory research, Type II error is often

450

PSYCHOMETRIKA TABLE 2.

Proportions of Type I and II errorsa based on 1500 replications on simulated data with strong, moderately strong, or no distinct 2-dimensional structure, m = 20.

No. of cases

Permutation

Sign.b

Strong struct. Type I Type II

Moderate struct. Type I Type II

No struct. Type I

100

PermD

0.05 FDR BF 0.05 FDR BF

0.000 0.000 0.000 0.048 0.039 0.002

0.000 0.000 0.000 0.000 0.000 0.000

0.015 0.002 0.001 0.063 0.046 0.004

0.538 0.878 0.961 0.159 0.261 0.624

0.052 0.003 0.003 0.060 0.005 0.003

0.05 FDR BF 0.05 FDR BF

0.000 0.000 0.000 0.048 0.040 0.002

0.000 0.000 0.000 0.000 0.000 0.000

0.004 0.001 0.000 0.064 0.054 0.004

0.303 0.600 0.907 0.011 0.024 0.167

0.052 0.003 0.003 0.077 0.006 0.004

0.05 FDR BF 0.05 FDR BF

0.000 0.000 0.000 0.049 0.040 0.002

0.000 0.000 0.000 0.000 0.000 0.000

0.001 0.000 0.000 0.062 0.053 0.004

0.177 0.381 0.855 0.001 0.002 0.035

0.054 0.003 0.003 0.091 0.010 0.006

0.05 FDR BF 0.05 FDR BF

0.000 0.000 0.000 0.054 0.044 0.002

0.000 0.000 0.000 0.000 0.000 0.000

0.000 0.000 0.000 0.064 0.056 0.003

0.067 0.149 0.783 0.000 0.000 0.001

0.057 0.004 0.003 0.125 0.022 0.012

PermV

200

PermD

PermV

300

PermD

PermV

500

PermD

PermV

a Sizes of confidence regions or these errors range from 0.0001 to 0.0065 b FDR = controlling the false discovery rate; BF = Bonferroni correction

considered more serious than Type I error, because in such research it is important to find any effect that might be present in the data. When an effect is discovered in an exploratory study, new studies might be conducted to confirm this result. However, when an exploratory study fails to find an effect that is present in the population (i.e., a Type II error is made), new studies investigating this effect might never be conducted. This emphasis on avoiding Type II error again implies the greater suitability of PermV in the testing problem at hand. 4.2. Permutation Strategies Combined with Different Confidence Level Conditions In this section, we combine the two permutation strategies with the three different confidence level conditions: the uncorrected condition (UNC) with 0.05 significance level, the BF corrected, and the FDR corrected condition. Table 2 shows the Type I error rates for data sets with 20 variables and with three different types of structure. When we compare the two permutation strategies, for all confidence level conditions, PermD gives smaller proportions of Type I error than PermV, showing that the conclusion for the uncorrected confidence level in Table 1 also applies to the BF and FDR conditions. The results considering Type II error for the moderately structured data sets with 20 variables are also displayed in Table 2. To show the differences among the confidence level conditions

M. LINTING ET AL.

451

F IGURE 5. Proportions of Type II error for data sets with a moderately strong principal components structure, calculated after R replications of the permutation test. (R = 1500 when m = 20, and R = 750 when m = 40.) Results after PermD are indicated by dashed lines; results after PermV are indicated by solid lines. Circles indicate results with an uncorrected 0.05 significance level, downward-pointing triangles indicate results after FDR correction, and upward-pointing triangles after BF correction.

more clearly, we have additionally displayed the Type II errors in Figure 5. Panel (a) shows results for data sets with 20 variables, and Panel (b) for data sets with 40 variables. The results are averaged over the Monte Carlo replications (1500 for the 20 variables condition, and 750 for the 40 variables condition). Estimates for proportions of Type II error are displayed for the UNC condition (marked with circles), the BF condition (upward-pointing triangles), and FDR condition (downward-pointing triangles). The dashed lines indicate the results from the PermD strategy, and the solid lines those from the PermV strategy. The results for Type II error are completely reversed compared with the results for the Type I error: PermD has much larger proportions of Type II error than PermV, specifically for the BF and the FDR condition. The proportions of Type II error in the UNC condition are smaller than those with FDR, which are smaller than those with BF. For data with a moderate two-component structure and over different numbers of objects, for PermD, proportions of Type II error range from 0.07 to 0.54 for the UNC condition, from 0.15 to 0.88 for the FDR condition, and from 0.78 to 0.96 for the BF condition. For PermV, these ranges are 0.00 to 0.16 (UNC), 0.00 to 0.26 (FDR), and 0.00 to 0.63 (BF). If we consider a power of .80 or higher acceptable, the PermD strategy never gives acceptable results (except for n = 500, with UNC and FDR), while PermV is acceptable under all conditions (except for n = 100, with BF and FDR). For both PermD and PermV, the Bonferroni correction leads to the highest loss of power, thus we conclude that this correction is much too conservative. Permutation strategies with different confidence level conditions for different data structures The results in Table 2 show that, for all confidence level conditions, Type I error rates are smaller when the data structure is stronger. In unstructured data sets, Type I error rates are much larger than in structured data sets. This result is in accordance with traditional, parametric hypothesis testing, where the probability of obtaining a false significant is also larger in data sets with a weak compared to a stronger structure. In this study, because the unstructured data sets were not “unstructured” in the strict sense, as they contained a (realistic) method effect (which should be insignificant), the probability of a false significant was increased compared with truly unstructured data. Type II error rates are also smaller for data sets with a strong structure compared to a moderate structure, as (of course) the power is much higher when effects are strong. For unstructured data sets, Type II errors do not apply.

452

PSYCHOMETRIKA TABLE 3.

Proportions of Type I and II errorsa based on 750 replications on simulated data with strong, moderately strong, or no distinct 2-dimensional structure, m = 40.

No. of cases

Permutation

Sign.b

Strong struct. Type I Type II

Moderate struct. Type I Type II

No struct. Type I

100

PermD

0.05 FDR BF 0.05 FDR BF

0.000 0.000 0.000 0.048 0.038 0.000

0.000 0.000 0.000 0.000 0.000 0.000

0.000 0.000 0.000 0.048 0.039 0.001

0.108 0.194 0.822 0.000 0.000 0.042

0.051 0.001 0.001 0.059 0.002 0.002

0.05 FDR BF 0.05 FDR BF

0.000 0.000 0.000 0.045 0.035 0.001

0.000 0.000 0.000 0.000 0.000 0.000

0.000 0.000 0.000 0.050 0.040 0.001

0.007 0.013 0.552 0.000 0.000 0.000

0.051 0.001 0.001 0.068 0.003 0.002

0.05 FDR BF 0.05 FDR BF

0.000 0.000 0.000 0.052 0.041 0.001

0.000 0.000 0.000 0.000 0.000 0.000

0.000 0.000 0.000 0.048 0.039 0.002

0.000 0.001 0.360 0.000 0.000 0.000

0.054 0.001 0.001 0.081 0.004 0.003

0.05 FDR BF 0.05 FDR BF

0.000 0.000 0.000 0.048 0.037 0.001

0.000 0.000 0.000 0.000 0.000 0.000

0.000 0.000 0.000 0.050 0.037 0.001

0.000 0.000 0.158 0.000 0.000 0.000

0.055 0.002 0.001 0.100 0.007 0.005

PermV

200

PermD

PermV

300

PermD

PermV

500

PermD

PermV

a Sizes of confidence regions for these errors range from 0.0001 to 0.0065 b FDR = controlling the false discovery rate; BF = Bonferroni correction

Permutation strategies with different confidence level conditions for different numbers of objects For structured data sets, the Type I error rates are not dependent on the number of objects in the data set (see Table 2), which indicates that 100 objects are sufficient for proper PermV results. However, for unstructured data sets, Type I error rates are higher for data sets containing more objects, which reflects the fact that, in general, in large samples smaller effects will turn up significant compared to small samples. Thus, in this study, the small method effect is more likely to surface in a larger sample. From Figure 5, we conclude that the proportion of Type II error decreases when the number of objects in the data set increases. Figure 5a shows that this conclusion holds specifically for the PermV condition, where the error drops considerably from 100 to 200 objects. Permutation strategies with different confidence level conditions for different numbers of variables The results for the data sets with 40 variables are displayed in Table 3. All the results described for the 20 variables condition are confirmed; and, as expected, there is an overall drop in proportion of errors compared to the 20 variables condition. In the structured data sets with PermV, for the UNC condition, the Type I error rate is around the desired 0.05 for data sets of all sizes; the proportions of Type I error in the FDR condition are slightly smaller than expected

M. LINTING ET AL.

453

(they vary slightly around 0.04); and the Type I error rates in the BF condition are close to 0. For PermD with structured data sets, the proportion of Type I error is always (close to) zero. For the unstructured data sets and without correction of the significance level, the Type I error rates are higher, specifically for PermV with large samples. The results for the proportion of Type II error concord with the 20 variables condition: For PermD on moderately structured data, the error proportions in the BF condition are too large if n < 500. Results for the other conditions range from acceptable (for n = 100) to excellent. In summary, the number of objects and variables in the data set is primarily important for the proportion of Type II error. If the entire data set is permuted, more objects and variables (the proportion of VAF by the components being equal) are needed to obtain enough power compared to the permutation of one variable at a time. In addition, for small data sets, the Bonferroni correction leads to a huge loss of power for both permutation strategies, especially for PermD. Considering the size of structured data sets, acceptable results for both Type I and Type II error rates were obtained for PermV with FDR and BF, for data sets with 20 variables and at least 200 objects, or with 40 variables and at least 100 objects. With FDR correction a higher level of power was reached than with BF correction. For data sets without any structure and without correction of the significance level, Type I error rates were somewhat inflated, probably due to the noise that was induced on the simulated data.

5. Application to Empirical Data In practice, applied researchers are confronted with empirical data instead of the simulated data we used in this study. Thus, from a substantive perspective, it would be interesting to know whether the PermV method leads to insightful results for such data. Empirical data differ in two essential ways from simulated data: (1) their underlying population structure is unknown, and (2) the substantive meaning of the variables is of interest. The first difference points out the unsuitability of empirical data for technical validation studies, whereas the second indicates its suitability for gaining substantive insight into a method’s results. With this last issue in mind, we present an application of PermV to an empirical data set. Of course, results for these particular data should be viewed as an illustration and cannot be generalized. The data in the example were collected by the National Institute of Child Health and Human Development in their Study of Early Child Care (NICHD Early Child Care Research Network, 1996). A sample of 592 6-month old children were observed in their primary non-maternal child care arrangement using the Observational Record of the Caregiving Environment (ORCE). We selected 20 ORCE items for analysis, all indicators of quality of caregiver–child interaction. Thirteen of these indicators were counts of behavior during the observation period, the other seven (variables 2, 5, 12, 13, 14, 16, and 19 in Table 4) were 4-point ratings of behavior ranging from “not at all characteristic” through “highly characteristic”. Instead of including all ORCE items in further analyses, the NICHD wished to construct composite variables that summarize the information in the data as well as possible. This type of objective can typically be attained using PCA. The ORCE variables are expected to incorporate two subscales, thus a two-dimensional structure is expected to underly these data. To find out how PermV would deal with possible noise variables, we included five background variables in the two-dimensional PCA: three characteristics of the child care setting (caregiver age, caregiver–child ratio and group size) and two of the children’s families (income-to-needs ratio, hours per week work of the mother). Out of these characteristics, we expected group size and caregiver–child ratio to be related to the quality

−0.611 −0.599 −0.013 −0.173 −0.022

−0.559 −0.547 0.074 −0.083 0.071

21. Group size 22. Cg-child ratio 23. Income-needs 24. Hours work m. 25. Cg age

0.710 0.599 0.570 0.580 0.559 0.538 0.506 0.497 0.361 0.357 0.100 0.081 0.537 0.225 0.141 0.000 0.036 0.005 0.002 0.000 0.312 0.299 0.005 0.007 0.005

0.862 0.804 0.787 0.794 0.782 0.770 0.747 0.743 0.652 0.653 0.397 0.361 −0.692 −0.407 −0.294 0.086 0.261 0.004 0.085 0.066 −0.510 −0.495 0.164 0.009 0.168

VAF

Component 1 BCIup

0.001 0.001 0.117 0.096 0.129

0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.857 0.001 0.149 0.788 0.916

p

0.033 0.032 0.034

0.035 0.045 0.050

0.049

cFDR

−0.350 −0.373 −0.179 −0.014 0.047

0.080 −0.157 −0.263 0.062 0.109 −0.086 −0.022 0.065 0.092 −0.034 0.017 −0.140 0.196 0.402 −0.033 0.614 0.596 0.567 0.525 0.280

Load

in boldface.

a BCI’s that do not contain zero and p-values significant at the FDR corrected significance level (c FDR ) are

0.821 0.742 0.715 0.723 0.711 0.690 0.669 0.663 0.546 0.531 0.228 0.206 −0.767 −0.535 −0.449 −0.063 0.115 −0.143 −0.060 −0.056

BCIlow

0.843 0.774 0.755 0.762 0.748 0.733 0.712 0.705 0.601 0.597 0.316 0.284 −0.733 −0.474 −0.376 0.010 0.191 −0.071 0.013 0.005

Load

1. Other talk 2. Pos. regard 3. Resp. nondistress 4. Asks 5. Facilitates beh. 6. Stimulation 7. Stim. cogn. 8. Vocalization 9. Pos. affect 10. Pos. physical 11. Stim. social 12. Reads 13. Detachment 14. Flatness 15. Restricts phys. 16. Intrusiveness 17. Restricts act. 18. Neg. speech 19. Neg. regard 20. Neg. physical

Variable

TABLE 4.

−0.471 −0.495 −0.357 −0.220 −0.159

0.005 −0.226 −0.331 −0.039 0.029 −0.207 −0.143 −0.029 −0.019 −0.176 −0.189 −0.293 0.104 0.283 −0.195 0.474 0.486 0.325 0.294 0.081

BCIlow

−0.204 −0.230 0.007 0.161 0.248

0.151 −0.065 −0.182 0.167 0.182 0.052 0.092 0.158 0.197 0.104 0.223 0.031 0.281 0.498 0.126 0.698 0.670 0.700 0.667 0.455

BCIup

0.123 0.139 0.032 0.000 0.002

0.006 0.025 0.069 0.004 0.012 0.007 0.000 0.004 0.008 0.001 0.000 0.019 0.039 0.161 0.001 0.377 0.356 0.321 0.275 0.078

VAF

Component 2

95% Bootstrap Confidence Intervals (BCI) for the component loadings for the NICHD data in a two-component PCA, and P -values from a permutation test for corresponding VAF.a

0.001 0.001 0.031 0.847 0.571

0.327 0.070 0.002 0.449 0.177 0.285 0.817 0.440 0.256 0.670 0.839 0.092 0.015 0.001 0.667 0.001 0.001 0.001 0.001 0.004

p

0.029 0.048 0.042

0.043

0.041 0.036 0.038 0.046 0.040 0.037 0.044 0.047 0.031 0.028

0.039 0.030

cFDR

454 PSYCHOMETRIKA

M. LINTING ET AL.

455

of interactions. Thus, only these variables were expected to have significant squared loadings on either of the two principal components we were interested in.6 For the research objective of the NICHD—as in much other applied research—the overall contribution of each variable to the solution (communality) is of lesser interest than the variables’ contribution to each specific component (squared component loading). Thus, for substantive reasons we selected the squared component loadings as a test statistic. As this statistic is not invariant under rotation of the PCA solution, we used Procrustes rotation to rotate the permutation solution toward the observed solution. In the case of PermV, this approach makes sense, because—contrary to PermD—the structure of all but one variable from the observed solution is retained.7 As a comparison to PermV results, we also applied the nonparametric bootstrap with Procrustes rotation to the loadings from the NICHD data, where 95% percentile confidence intervals should agree with decisions on significance at a two-sided 5% alpha from the permutation approach. Permutation and bootstrap results from a PCA on the NICHD data are shown in Table 4. Bootstrap confidence intervals containing the value zero indicate insignificance. For p-values resulting from PermV, FDR corrected significance levels are used to determine significance of variable VAF. These significance levels are only displayed up to the first significant p-value. All p-values with lower rank numbers are automatically also significant, so significance level is irrelevant. Results show that PermV and the bootstrap agree on significance of variable contribution to the two principal components in all but three cases (squared loadings of variables 1, 2, and 5 on Component 2). In these cases, the bootstrap gives a significant result, whereas PermV does not. However, in all three cases the bootstrap confidence interval borders close to zero, indicating near-insignificance. Substantively, results are as expected. All ORCE variables show significant squared loadings on at least one component. The first principal component can be interpreted as degree of positive caregiver–child interaction, with variables with significant positive loadings (1 through 12) indicating movement toward the child, and variables with significant negative loadings (13 through 15) indicating movement away from the child. Variables with significant positive loadings on the second component (16 through 20) indicate overt negative behaviors (moving against the child). These results are in accordance with psychological theory (see, for example, Horney, 1945). Note that variables 3, 13, 14 and 17 have significant loadings on both components, but one loading is clearly higher than the other. Group size and caregiver–child ratio relate negatively to moving toward the child as well as to moving against the child, which seems plausible, as caregivers in large groups will have less time to pay attention (positive as well as negative) to individual children. As expected, family characteristics and caregiver age have insignificant squared loadings on both components. Although these results are tentative, they illustrate the substantive validity of the PermV strategy for empirical data.

6. Conclusions and Discussion In this study, we proposed a new method for assessing the significance of the contribution of variables to the PCA solution by performing permutation tests in which one variable is per6 If we used PermD to establish the optimal number of components, we would find more than two significant dimensions, with the family and caregiver characteristics loading on the higher dimensions. However, for this particular example, we introduced these variables as “noise” variables that should obtain insignificant loadings on the two-dimensional structure we were interested in. 7 Procrustes rotation is the most relevant when eigenvalues are close together, such that changes in component ordering may occur. In the data set at hand, the first component was quite dominant over the second, such that rotation did not make much difference.

456

PSYCHOMETRIKA

muted, while keeping the others fixed (PermV). This method seems theoretically sound and its statistical validity was indicated in a simulation study. PermV combined with FDR correction of the significance level yields the most favorable combination of proportions of Type I and Type II error. As in previous studies, the alternative strategy of permuting the entire data set (PermD) proves inappropriate for the research question at hand, which is expressed in Type I error rates much smaller than the nominal α, and very low power, especially when the size of the data set is small. For data sets with 20 variables, PermV gives Type I error rates somewhat higher than the nominal α, which can be explained by the fact that we introduced noise to the data structure. However, especially for structured data sets, proportions of Type I error seem acceptable. The Bonferroni correction (as applied here) is much too conservative and leads to a huge loss of power. Regarding the number of objects, we can conclude that for acceptable Type I error rates with PermV, 100 objects is enough. However, for data sets with 100 objects, the power of the permutation test is somewhat low, resulting from the fact that a population principal components structure will be less manifest in smaller samples (n ≤ 100). Based on these results, we can give researchers who wish to apply permutation tests to PCA for assessing the significance of the VAF per variable the following advice: Permute one variable at a time, while keeping the others fixed (PermV). Do not use the Bonferroni correction, especially for samples with fewer than 200 objects and 40 variables. For samples with 100 or fewer objects and fewer than 40 variables, consider whether Type I error is more serious than Type II error (which in exploratory research is not very likely). If so, apply the FDR correction, otherwise use an uncorrected significance level. For larger samples (n > 100), FDR correction is recommended, not only because it implies higher power than the Bonferroni correction, but also because it theoretically fits the objective of exploratory data analysis (Keselman et al., 1999). Our simulation study was focused on the significance of variable VAF in two-dimensional PCA solutions. We do not know whether the results also apply to other dimensionalities. For high-dimensional structures (i.e., with the number of dimensions close to the number of variables), the results might be different, because variable VAF may be similar on several dimensions, and structures may become harder to establish. The fact that we added noise to the correlational structure of the data sets in this study prevents us from drawing “clean” conclusions about the comparison of proportions of Type I error and nominal alpha. However, our strategy does allow proper comparison between PermV and PermD under realistic circumstances. So, the question of which approach performs best can still be answered. The advantage of the approach taken here is that it gives more insight into the performance of the methods for realistic data than a noise-free approach would. In addition, our results are similar to results from previous studies in which the performance of the PermD method under noise-free circumstances was evaluated (Buja and Eyuboglu, 1992; Peres-Neto et al., 2003) (i.e., proportions of Type I error for structured data sets are clearly smaller than nominal alpha). In this study, for the sake of comparability of Type I error rate across correction methods, we computed the Type I errors for each correction method in the same way: the number of false significant results divided by the total number of tests for which a false significant was possible. However, it could be argued that, as the Bonferroni and the FDR correction attempt to control different types of Type I error (i.e., respectively the FWE and the FDR), the error rates for these correction methods should be calculated differently. In that case, for the Bonferroni corrected results, the Type I error rate can be computed as follows: FWE = datsigb C2 >0 /R, with datsigb C2 >0 denoting the number of data sets in which there is at least one false significant result at a Bonferroni corrected significance level of 0.05/m, and R the number of replications. For the FDR corrected results, one may compute the Type I error rate as FDR = sigf C2 /(sigf C2 + sigf C1 ), that is, the number of false significant results within the total number of significant results at an FDR corrected significance level (r/t × α). This FDR can then be averaged over

M. LINTING ET AL.

457

replications. Although Type I error rates are no longer comparable when different calculation methods are used across correction methods, we computed the alternative rates for our simulated data sets to find out how they behaved for each correction method separately. We found that in the structured data sets, FDR and FWE were always well below 0.05 (and quite close together), whereas in the data sets without a particular two-dimensional structure both rates were inflated, due to the small “method effect” we simulated. However, as we have shown in Tables 2 and 3, the traditional Type I error rates were controlled by FDR and BF correction under all conditions. Therefore, correction of the significance level is worthwhile. In this study, we chose a test statistic that was invariant across all possible rotations of the PCA solution. This choice was sensible, because otherwise we would have to take rotational variability into account for PermV, but not for PermD (as rotation of random data toward a target does not make sense). However, for interpretation purposes, it is more interesting to look at the VAF of the variables per component, which varies across different rotations. In the application section of this paper, we focused only on PermV and addressed rotational variability by using a Procrustes rotation. Further simulation studies may investigate whether this is the most appropriate procedure. In the literature there has been an ongoing discussion about the validity of null hypothesis significance testing (NHST) in the popular sense, which is an unfortunate combination of Fisherian, Neyman–Pearson, and Bayesian reasoning (Cohen, 1994). The main point brought forward by opponents of NHST is that it is not valid to use the probability of observed data given that the null hypothesis is true as an answer to the reversed question, that is, what is the probability of the null hypothesis given the observed data (Cohen, 1994; Gliner, Leech, & Morgan, 2002; Killeen, 2005, 2006). Consistent with that line of thought, an alternative to the traditional p-value, called prep has been proposed (Killeen, 2005), which gives the probability that the direction of a certain effect (positive or negative) can be replicated in another study, under the same circumstances. Prep can be calculated within a parametric framework under the assumption that the data are normally distributed. In addition, it can be calculated nonparametrically by doing a bootstrap study and calculating the proportion of bootstrap values for a specific outcome value that point in the same direction as the observed result (Killeen, 2005). This latter approach may also be used in the PCA context, which could be a valuable addition to the permutation procedure. Buja and Eyuboglu (1992) noted that significance of loadings should not be mistaken for sampling stability. Significance means that loadings of a certain magnitude are unlikely to be due to chance alone. Sampling stability refers to the question of how much a statistic varies between different samples from the same population (variability in the sampling distribution). Stability can be established by resampling techniques, like the bootstrap (see Linting et al., 2007b), but not by permutation tests. On the other hand, permutation tests better match the traditional idea of hypothesis testing, as they simulate the null distribution. From the application section of this paper, we may conclude that the 95% bootstrap confidence ellipses and permutation tests with (FDR corrected) 5% significance level may complement each other and lead to similar conclusions. In common research practice, data do not always meet the assumptions of linear PCA, such as linearity of relationships between variables and numeric measurement. For such cases, a nonlinear alternative has been developed (see, e.g., Linting, Meulman, Groenen, & van der Kooij, 2007a; Meulman, Van der Kooij, & Heiser, 2004). Linting et al. (2007b) established the stability of the nonlinear PCA solution and compared it with that of the linear PCA solution. The simulation study performed here provides a basis for the application of PermV to nonlinear PCA. This next step may be considered worthwhile for the application of multivariate categorical data analysis methods in the social and behavioral sciences.

458

PSYCHOMETRIKA

Appendix A. Simulating Data with a Prespecified Principal Components Structure In this study, each simulated data set is generated such that its correlation matrix approximates a block-diagonal correlation matrix C (see Figure 2). The first block on the diagonal, C1 , contains the correlations between m1 variables that correspond to either a strong or a moderate two-dimensional structure, or to no significant components structure. The second block on the diagonal, C2 , contains the random correlations between m2 variables. The off-diagonal blocks consist of zeros, such that C1 and C2 do not correlate with each other. In this appendix, we explain how we derive a data set that corresponds to C. First, C1 and C2 are constructed using an algorithm by Lin and Bendel (1985) that generates random correlation matrices for a given eigenvalue structure. Then, C is composed of C1 and C2 on the diagonal, keeping the off-diagonal entries zero. Second, the data matrix X is constructed, consisting of n objects and m variables, which is normalized such that X X = C. This is accomplished by taking X = BS, where B is an n × m orthonormal matrix (B B = I), and S is an m × m square matrix, such that S S = C. One way to compute S is to use the eigenvalue decomposition C = Q2 Q , with Q an m × m orthonormal matrix and  a diagonal matrix containing the eigenvalues of C on its main diagonal. Then, we take S = Q. (Another way to compute S is to use the Cholesky decomposition.) The final step toward arriving at a data matrix X is to generate B. To ensure that the simulated data would be realistic, random sampling error was included in the data generating process by ˜ containing a random sample creating, instead of B, the approximately orthonormal matrix B, ˜ ≈ I. from a normal distribution with zero mean and a standard deviation of 1/n, such that B˜  B ˜ ˜ Consequently, if we take the data matrix to be X = BS, it reflects sampling variation. PCA is ˜  X, ˜ which is the correlation matrix of the generated data matrix X. ˜ X ˜ X ˜ is not performed on X exactly, but asymptotically equal to C.

Appendix B. Construction of Confidence Intervals for Type I and II Error in PCA In general, the population proportion p is estimated by the sample proportion pˆ = X/t, with X the number of ‘successes’ and t the number of trials. If the number of trials is sufficiently large, √ pˆ has approximately the normal distribution, with mean μpˆ = p, and standard deviation σpˆ = p(1 − p)/t . Simulation studies have shown that this traditional approach to computing confidence intervals for proportions can be quite inaccurate, because pˆ may approximate 0 or 1, in which case the estimated standard deviation becomes 0, and thus the margin of error (unrealistically) also becomes 0 (see Agresti and Coull, 1998). To avoid this problem, the Wilson estimate (Wilson, 1927) is proposed, which moves pˆ slightly away from 1 or 0. In our permutation study, it seems sensible to use the Wilson estimate, because the proportions of Type I and Type II error can easily become 0 (for instance, if all variables that are supposed to be significant are indeed marked significant in all replications). The idea of the Wilson estimate of the population proportion is to add two failures and two successes to the observed data.Then, the estimate is calculated as p˜ = (X + 2)/(t + 4). The standard error of p˜ is SEp˜ = p(1 ˜ − p)/(t ˜ + 4). The approximate confidence interval for p is p˜ ± z SEp˜ , with z the standard score corresponding to the specified confidence level (for example, for a 95% confidence interval, z equals 1.96). The Wilson estimate can be applied to our study containing R replications of permutations on different data sets. With the calculation of Type I errors for the uncorrected results, X would be the number of times a variable present in C2 (a noise variable) is found to be significant, and t would equal the number of times a test is applied (which equals m2 R, in the structured data sets,

M. LINTING ET AL.

459

and (m1 + m2 )R in the unstructured data sets). For Type II errors, X would equal the number of times a variable in C1 is marked nonsignificant, and t would equal m1 R. Using the Wilson estimate, the number of replications needed to find an acceptable margin of error (me) can be calculated as t = (z /me)2 p  (1 − p  ) − 4, with p  a presupposed value for the proportion. Here, we concentrate on Type I error, and estimate p  to be equal to the chosen significance level of 0.05. We wish the confidence interval to be no broader than 0.01. In other words, the margin of error should not exceed 0.005. Then, t = (1.96/0.005)2 × 0.05(1 − 0.05) − 4 = 7295. As t = m2 R, R should be 7295/m2 . Consequently, in the cells concerning data sets with 20 variables, with m1 = 15 and m2 = 5, R should be at least 7295/5 = 1459. In the cells with 40 variables, with m1 = 30 and m2 = 10, R should be 7295/10 = 729.5. To be on the safe side, in our study we decided to use R = 1500 for data sets with 20 variables, and R = 750 for data sets with 40 variables. For Type II errors, confidence regions may become larger, because p  for Type II errors will come closer to 0.50 than p  for Type I errors. However, in practice, the differences between proportions of Type II error under different conditions in our study are much larger than for Type I errors, and confidence intervals do not overlap. Therefore, for the estimation of proportions of Type II error, we perform the same number of replications as for proportions of Type I error.

Acknowledgements The data used in the application were collected by the NICHD Early Child Care Research Network, supported by the NICHD through a cooperative agreement that calls for scientific collaboration between the grantees and NICHD staff. The authors acknowledge the generous way in which the NICHD Study on Early Child Care has made this unique data set available for secondary analyses. References Agresti, A., & Coull, B.A. (1998). Approximate is better than ‘exact’ for interval estimation of binomial proportions. The American Statistician, 52, 119–126. Anderson, M.J., & Ter Braak, C.J.F. (2003). Permutation tests for multi-factorial analysis of variance. Journal of Statistical Computation and Simulation, 73, 85–113. Anderson, T.W. (1963). Asymptotic theory for principal component analysis. Annals of Mathematical Statistics, 34, 122–148. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B. Methodological, 57, 289–300. Buja, A., & Eyuboglu, N. (1992). Remarks on parallel analysis. Multivariate Behavioral Research, 27, 509–540. Cohen, J. (1994). The earth is round (p < 0.05). The American Psychologist, 49, 997–1003. De Leeuw, J., & Van der Burg, E. (1986). The permutational limit distribution of generalized canonical correlations. In Diday, E. (Ed.). Data analysis and informatics, IV, pp. 509–521. Amsterdam: Elsevier. Dietz, E.J. (1983). Permutation tests for association between two distance matrices. Systematic Zoology, 32, 21–26. Douglas, M.E., & Endler, J.A. (1982). Quantitative matrix comparisons in ecological and evolutionary investigations. Journal of Theoretical Biology, 99, 777–795. Fabrigar, L.R., Wegener, D.T., MacCallum, R.C., & Strahan, E.J. (1999). Evaluating the use of exploratory factor analysis in psychological research. Psychological Methods, 4, 272–299. Fisher, R.A. (1935). The design of experiments. Edinburgh: Oliver and Boyd. Girshick, M.A. (1939). On the sampling theory of roots of determinantal equations. Annals of Mathematical Statistics, 10, 203–224. Glick, B.J. (1979). Tests for space-time clustering used in cancer research. Geographical Analysis, 11, 202–208. Gliner, J., Leech, N., & Morgan, G. (2002). Problems with null hypothesis significance testing (NHST): What do the textbooks say? Journal of Experimental Education, 71, 83–92. Good, P.I. (2000). Permutation tests: A practical guide to resampling methods for testing hypotheses. New York: Springer. Heiser, W.J., & Meulman, J.J. (1994). Homogeneity analysis: Exploring the distribution of variables and their nonlinear relationships. In Greenacre, M., & Blasius, J. (Eds.). Correspondence analysis in the social sciences: recent developments and applications (pp. 179–209). New York: Academic Press.

460

PSYCHOMETRIKA

Horney, K. (1945). Our inner conflicts: a constructive theory of neurosis. New York: Norton. Hubert, L.J. (1984). Statistical applications of linear assignment. Psychometrika, 49, 449–473. Hubert, L.J. (1985). Combinatorial data analysis: association and partial association. Psychometrika, 50, 449–467. Hubert, L.J. (1987). Assignment methods in combinatorial data analysis. New York: Marcel Dekker. Hubert, L.J., & Schultz, J. (1976). Quadratic assignment as a general data analysis strategy. British Journal of Mathematical & Statistical Psychology, 29, 190–241. Jolliffe, I.T. (2002). Principal component analysis. New York: Springer. Keselman, H., Cribbie, R., & Holland, B. (1999). The pairwise multiple comparison multiplicity problem: an alternative approach to familywise and comparisonwise Type I error control. Psychological Methods, 4, 58–69. Killeen, P.R. (2005). An alternative to null-hypothesis significance tests. Psychological Science, 16, 345–353. Killeen, P.R. (2006). Beyond statistical inference: a decision theory for science. Psychonomic Bulletin & Review, 13, 549–562. Landgrebe, J., Wurst, W., & Welzl, G. (2002). Permutation-validated principal components analysis of microarray data. Genome Biology, 3, 0019. Lin, S.P., & Bendel, R.B., (1985). Algorithm AS 213: generation of population correlation matrices with specified eigenvalues. Applied Statistics, 34, 193–198. Linting, M., Meulman, J.J., Groenen, P.J.F., & van der Kooij, A.J. (2007a). Nonlinear principal components analysis: introduction and application. Psychological Methods, 12, 336–358. Linting, M., Meulman, J.J., Groenen, P.J.F., & van der Kooij, A.J. (2007b). Stability of nonlinear principal components analysis: an empirical study using the balanced bootstrap. Psychological Methods, 12, 359–379. Mantel, N. (1967). The detection of disease clustering and a generalized regression approach. Cancer Research, 27, 209–220. Meulman, J.J. (1992). The integration of multidimensional scaling and multivariate analysis with optimal transformations of the variables. Psychometrika, 57, 539–565. Meulman, J.J. (1993). Nonlinear principal coordinates analysis: minimizing the sum of squares of the smallest eigenvalues. British Journal of Mathematical & Statistical Psychology, 46, 287–300. Meulman, J.J. (1996). Fitting a distance model to homogeneous subsets of variables: points of view analysis of categorical data. Journal of Classification, 13, 249–266. Meulman, J.J., Van der Kooij, A.J., & Heiser, W.J. (2004). Principal components analysis with nonlinear optimal scaling transformations for ordinal and nominal data. In Kaplan, D. (Ed.), Handbook of quantitative methodology for the social sciences (pp. 49–70). London: Sage Publications. NICHD Early Child Care Research Network (1996). Characteristics of infant child care: factors contributing to positive caregiving. Early Childhood Research Quarterly, 11, 269–306. Noreen, E.W. (1989). Computer intensive methods for testing hypotheses. New York: Wiley. Ogasawara, H. (2004). Asymptotic biases of the unrotated/rotated solutions in principal component analysis. British Journal of Mathematical & Statistical Psychology, 57, 353–376. Peres-Neto, P.R., Jackson, D.A., & Somers, K.M. (2003). Giving meaningful interpretation to ordination axes: assessing loading significance in principal component analysis. Ecology, 84, 2347–2363. Shaffer, J.P. (1995). Multiple hypothesis testing. Annual Review of Psychology, 46, 561–584. Smouse, P.E., Long, J., & Sokal, R.R. (1985). Multiple regression and correlation extensions of the Mantel test of matrix correspondence. Systematic Zoology, 35, 627–632. Sokal, R.R. (1979). Testing statistical significance of geographical variation. Systematic Zoology, 28, 227–232. Ter Braak, C.J.F. (1992). Permutation versus bootstrap significance tests in multiple regression and ANOVA. In Jöckel, K.H., Rothe, G., & Sendler, W. (Eds.), Bootstrapping and related techniques (pp. 79–86). Berlin: Springer. Timmerman, M.E., Kiers, H.A.L., & Smilde, A.K. (2007). Estimating confidence intervals for principal component loadings: a comparison between the bootstrap and asymptotic results. British Journal of Mathematical & Statistical Psychology, 60, 295–314. Verhoeven, K., Simonsen, K., & McIntyre, L. (2005). Implementing false discovery rate control: increasing your power. Oikos, 108, 643–647. Wilson, E.B. (1927). Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22, 209–212. Manuscript Received: 8 SEP 2009 Final Version Received: 20 DEC 2010 Published Online Date: 28 MAY 2011