becomes very challenging to extract useful features from a large data set because many statistics are difficult to compute by standard algorithms or statistical ...
Proceedings of the Second Workshop on Knowledge Economy and Electronic Commerce
Tests and Variables Selection on Regression Analysis for Massive Datasets Tsai-Hung Fan Graduate Institute of Statistics National Central University
Kuang-Fu Cheng Graduate Institute of Statistics National Central University
Abstract In this paper, a two-stage block hypothesis testing following the idea of Fan, Lin and Cheng (2004) is proposed for massive data regression analysis. Variables selection criteria incorporating with classical stepwise procedure are also developed to select significant explanatory variables. Simulation study confirms that our approach is more accurate in the sense of achieving the nominal significance level for huge data sets. Real data example also verifies that the proposed procedure is accurate compared with the classical method. Keywords: massive data, hypothesis testing, regression analysis, stepwise selection.
1. Introduction In the past decade, we have witnessed a revolution in information technology. As a consequence, routine collection of systematically generated data is now commonplace. Databases with hundreds of fields, billions of records and terabytes of information are not unusual. For Example, Barclaycard (UK) carries out 350 million transactions a year, Wal-mart makes over 7 billion transactions a year, and AT\&T carries over 70 billion long distance calls annually. See Hand, Blunt, Kelly and Adams (2000). It becomes very challenging to extract useful features from a large data set because many statistics are difficult to compute by standard algorithms or statistical packages when the data set is too large to be stored in a primary memory. The memory space in some computing environments can be as large as several terabyte. However, the number of observations that can be stored in primary memory is often restricted. The available memory, though large, is finite. Many computing environments also limit the maximum array size allowed and this can be much smaller and even independent of the available memory. The large data sets present some obvious difficulties, such as complex relationships among variables, and a large physical memory requirement. In general, a simple job for a small data set may be of major difficulty for large data sets. Unlike observations resulting from designed experiments, large data sets sometimes become available without predefined purposes, or only with rather vague purposes. Typically, it is desirable to find some interesting features in the data sets that will provide valuable information to support decision making. Primary tasks in analyzing large data sets include: data processing, classification, detection of
229
Proceedings of the Second Workshop on Knowledge Economy and Electronic Commerce
abnormal patterns, summarization, visualization, association/correlation analysis. The conventional regression analysis is popular for serving these purposes. However, regression analysis is not straightforward for massive datasets. For efficiency, an estimator should be derived based on the whole data set rather than its parts, which is not feasible for large data sets. Intuitively we may sequentially read the data and store in primary memory block by block, and analyze the data in each block separately. As long as the size of block is small, one can easily implement this estimation procedure within each block under various computing environments. A question arises here is how to make a final conclusion based on the results obtained from each block. Fan, Lin and Cheng (2004) consider block weighted least square estimators of the regression coefficients by minimizing the variances of the estimators and prove the asymptotic properties of the resulting estimators. They also indicate that the estimators make better interval estimation in terms of coverage probabilities than the usual least square estimators. In this paper, we will furthermore discuss the issue on testing hypotheses as well as the variable selection criteria in regression analysis on massive data sets. We propose to make the classical test in each block via p-value followed by a second stage test to integrate the results from the blocks. The method can also be applied to select the significant independent variables. Ideal sample size to be used in each block will also be suggested. The rest of the paper is organized as follows. Section 2 discusses the general approach of Fan, Lin and Cheng (2004). Section 3 presents our main results including the hypothesis testing procedure and the variable selection criteria. The block sample size determination is given in Section 4. Simulation and real example study are provided in Section 5 and Section 6 is the final conclusions and discussion.
2. Block weighted least square method This section mainly introduces the block weighted least square method proposed by Fan, Lin and Cheng (2004). In the linear regression model of a massive dataset, we separate a huge data set of sample size N into k blocks, each of n samples such that for block j = 1, 2,K , k , the regression model is given by
Yj = X j β + ε j ,
230
Proceedings of the Second Workshop on Knowledge Economy and Electronic Commerce
X 1, p −1, j ⎤ ⎡ β0 ⎤ ⎡ ε1 j ⎤ ⎥ ⎢ε ⎥ ⎢ ⎥ L X 2, p −1, j ⎥ β1 ⎥ 2j , β=⎢ , εj = ⎢ ⎥, ⎥ ⎢ ⎢ ⎥ O M M ⎥ M ⎥ ⎢ ⎥ ⎢ ⎥ L X n , p −1, j ⎦⎥ ⎢⎣ ε nj ⎥⎦ ⎣⎢ β p −1 ⎦⎥ with ε j ~ N ( 0, σ 2j I ) . Suppose ε j ’s are independent in different blocks. The
⎡ Y1 j ⎤ ⎡1 X 1,1, j ⎢Y ⎥ ⎢1 X 2j 2,1, j where Y j = ⎢ ⎥ , X j = ⎢ ⎢ M ⎥ ⎢M M ⎢ ⎥ ⎢ ⎣⎢ Ynj ⎦⎥ ⎣⎢1 X n ,1, j
L
procedure of the blocks weighted least square estimators of the regression coefficients proposed by Fan, Lin and Cheng (2004) is as follows: (for each i = 0,1,K , p − 1 ,)
Step 1. Split N sample points to k blocks such that each block contains n data points (and the memory is available). That is N=kn. Step 2. Estimate β i , in each block by the usual least square estimation and denote the estimator by βˆ , for j = 1, 2,K , k , i.e. βˆ is the value of the i-th element of ij
−1
ij
(x j x j ) x j y j , and (x j , y j ) is the observations in the j-th block. Step 3. Compute rˆij = (σˆ ij2 ) −1 = aijσˆ −j 2 , an estimate of (var( βˆij )) −1 , where aij−1 is the '
'
n
i-th diagonal element of (x 'j x j ) −1 and σˆ 2j = ∑ ( ylj − xlj βˆ j ) (n − p ) is the l =1
variance estimate in block j, xlj = (1, X l ,1, j , X l ,2, j ,..., X l , p −1, j ) . k
β%i* = ∑ wˆ ij βˆij , where wˆ ij = rˆij / ∑ rˆij is the
Step 4. The weighted estimator of β i is
j =1
j
weight of every estimator βˆij , contributed to the integrated estimator. Fan, Lin and Cheng (2004), show that these estimators achieve minimum variance 1 / ∑ rˆij among all weighted estimators and posses the asymptotic normality j
property, i.e.,
∑ j =1 rˆij ( β%i* − βi ) → N (0,1) k
in distribution as n tends to infinity. In
next section, we will develop a block testing procedure for the regression coefficients and make variable selection form the result.
3. Main results 3.1 Block hypothesis testing
If we want to assure that an explanatory variable, say β i , for some
i = 0,1,K , p − 1 , has a significant effect on the explanatory variable of a regression model, we may consider to test H 0 : β i = 0 versus H1 : β i ≠ 0 . From Fan, Lin and Cheng (2004), it is shown that the block least square estimator β%i* , is approximate normally distributed. Since our sample size is always large, by simple normal theory,
231
Proceedings of the Second Workshop on Knowledge Economy and Electronic Commerce
the null hypothesis H 0 : β i = 0 is rejected if the test statistic
k
∑ rˆ j =1
ij
β%i* is large.
Specifically, given the significant level α , if the observed weighted estimate β% * > z rˆ , i
α /2
∑
ij
(1)
j
where zα / 2 is the critical value of the normal test, H 0 is rejected. However, in massive datasets, the sample sizes are usually huge, so whatever the true value of β i is, (1) is almost always satisfied and thus the conclusion trends to reject H 0 . This is indeed a serious phenomenon so called the ‘Lindley’s paradox’ encountered in classical tests when sample size is too large. See Lindley (1957), Berger (1985), Hand (1998) and Hand, et al. (2000) and the references therein. Thus, the above test is not suitable for massive datasets practically. In this section, we will adopt a two-stage test by first computing the p-value from the classical normal test in each block and then integrating the splitting results from all the k blocks. As in Fan, Lin and Cheng (2004), the original data are divided into k blocks. In the first stage, we individually perform the normal test for H 0 : β i = 0 vs. H1 : β i ≠ 0 based on the usual least square estimator βˆij in each block. Let βˆij0 be the resulting value of βˆij based on the data in block j, and H 0 is rejected if
βˆij0 / σˆ ij > zα / 2 and the corresponding p-value, namely Pj = PH ( βˆij > βˆij0 ) < α . 0
Note that for any random observation βˆij0 of βˆij , Pj = P( βˆij0 ) = 1 − F ( βˆij0 ) , where
F (⋅)
is the cumulative distribution function of
βˆij
under
H 0 . Hence
F ( βˆij0 ) ~ uniform(0,1) , so is Pj , if H 0 is true. Therefore, for j = 1, 2,K , k , Pj ’s are i.i.d. uniform random variables when H 0 is true, so PH 0 ( P − value < α ) = α , for j = 1, 2,K , k . The second stage is to integrate the p-values from the k blocks. Letting Z be the number of blocks with p-value less than α in the normal tests, then, if H 0 is true, Z has a binomial distribution with parameters k and p, where k is the total number of blocks and the success probability p = PH 0 ( P − value < α ) = α . When Z is too large, that means H 0 is rejected in most blocks, so the original hypothesis H 0 : β i = 0 is suspicious intuitively. Indeed, the critical value of the second stage test is the smallest 232
Proceedings of the Second Workshop on Knowledge Economy and Electronic Commerce
k
integer c satisfying
⎛k ⎞
∑ ⎜ j ⎟α
j
(1 − α ) k − j ≤ α . Moreover, when k is large, Z has an
⎝ ⎠ approximate N ( kα , kα (1 − α )) distribution, and the above critical value c can be j =c
approximated by c* such that Φ(
c* − 0.5 − kα ) = 1 − α , where Φ (⋅) kα (1 − α )
is the
cumulative distribution function of the standard normal distribution. More precisely, letting z1−α = Φ −1 (1 − α ) , we have c* = z1−α kα (1 − α ) + kα + 0.5 .
3.2 Variables selection
Variables selection has been an important issue in multiple regression analysis. In classical regression analysis, many techniques are used to choose significant explanatory variables among which stepwise selection, forward selection and backward elimination are the most popular ones. However, all these methods are based on hypothesis testing for the regression coefficients. Therefore, the problem caused by the Lindley Paradox mentioned before is arisen once more in selection of variables for massive data sets. In fact, the huge data set will trend to include all variables into the model, so all these three criteria no longer work. In this subsection, we adopt the block two-stage hypothesis tests derived in Subsection 3.1 by incorporating the idea of classical forward/backward selection to manage the variables in massive datasets. First of all, the data are divided into k blocks as before. At the first stage, classical stepwise selection is made in each block. Let Z ij = 1 for each i = 0,1,K , p − 1 , and j = 1, 2,K , k , if the i-th explanatory variable X i is contained in the individual model built up by the classical stepwise procedure in block j, and Z ij = 0 , otherwise. Let Z i = ∑ j Z ij which indeed is the total number of blocks that variable X i is chosen (or equivalently, significant) via stepwise selection individually. By the same reason discussed in Subsection 3.1, each Z i has binomial distribution binomail (k , α ) for i = 0,1,K , p − 1 . Thus, if Z i is too large, it indicates significant evidence to support X i (i.e. to reject H 0 : β i = 0 ) in the integrated model. We describe the second stage forward selection and backward elimination criteria as follows: (1) Forward selection: Begin with the largest Z i associated with variable X i that appears most frequently in the individual models of the k blocks. If Z i > c* ,
233
Proceedings of the Second Workshop on Knowledge Economy and Electronic Commerce
then X i has significant effect overall and should be included in the model. Then, return to the first stage to proceed the stepwise selection in each block with respect to the rest of the variables with X i included in the model already. Repeat the second stage forward selection until the largest Zi at the moment is no greater than c* . The integrated model finally consists of all the variables selected at the time. (2) Backward elimination: This method is like a reverse of the forward selection. It begins with the smallest Z i obtained from classical stepwise procedure made for each block at the first stage which means X i is least significant in most blocks, so should be removed from the overall model if Zi < c* . Then go back to the first stage and redo the stepwise procedure with X i eliminated from the model in each block. Repeat this second stage elimination procedure until all the variables remaining in the models are significant. As expected, the two selections of explanatory variables may yield different results. Forward selection is indeed to add the explanatory variables into the model one-by-one, so it is crueler for the explanatory variables to be selected. In contrast, backward elimination is to remove the explanatory variables from the model, so it is crueler to drop a variable. If the users prefer using less explanatory variables considering time and cost, we would suggest the forward selection. Otherwise, the backward elimination is advised.
4. Optimal block size In the block test adopted above, if the sample size of each block is large, then the test trends to reject H 0 in each block also as described in Section 3. Here we will discuss the block sample size allocation to avoid such scenario. We will consider the maximum acceptable sample size in each block based on the equal weights and the optimal weights used in the block weighted least square estimators separately. If the normal test is performed based on the weighted least square estimator, then by (1), H 0 will be rejected if k
∑ rˆij βˆij j =1
k
∑ rˆ j =1
ij
> zα / 2 .
(2) n
Letting the residual sum of squares in block j be e j = (n − p)σˆ 2j = ∑ ( ylj − xlj βˆ j ) 2 , for l =1
j = 1, 2,K , k , and replacing rˆij in terms of e j in (2), it yields k
n > zα2 / 2 ∑ aij−1e j j =1
k
(∑ aij e −j 1 βˆij ) 2 + p . j =1
234
(3)
Proceedings of the Second Workshop on Knowledge Economy and Electronic Commerce
If equal weights, i.e. wij = 1/ k , is used for each block, then (3) can be deduced to k
k
j =1
j =1
n > zα2 / 2 ∑ (1/ k ) 2 (aij−1e j ) (∑ βˆij / k ) 2 + p .
(4)
Denote the integer parts of the quantities on the right hand sides of (3) and (4), respectively, by n* and n0 . Then they are the maximum acceptable sample sizes that one can tolerate if H 0 is not rejected using optimal or equal weights, respectively. The next theorem shows that that the maximum acceptable sample size by using equal weights, is no smaller than that by using optimal weights. Theorem 1: Let n* and n0 be the maximum sample sizes in a block that can k
tolerate using β%i* and β%i0 = ∑ βˆij k , respectively, in testing β i , we have j =1
(n0 − p) (n* − p) ≥ 1, a.s.. Proof: For i = 0,1,K , p − 1 , consider −1 2 −1 ˆ 2 n* − p ⎡ ∑ j aij e j k ⎤ ⎡ (∑ j aij e j β ij ) ⎤ ⎥⎢ ⎥ =⎢ n0 − p ⎢ (∑ j βˆij k ) 2 ⎥ ⎢ ∑ j aij e −j 1 ⎥ ⎣ ⎦⎣ ⎦
= ⎡ ∑ j aij−1e j k 2 ⎤ ⎡ ∑ j aij e −j 1 ⎤ ( β%i* β%i0 ) 2 . ⎣ ⎦⎣ ⎦
Note that the Cauchy-Schwartz inequality yields that
( ∑ a e )( ∑ a e ) ≥ k , j
−1 ij j
j
ij
−1 j
2
thus (n* − p ) (n0 − p ) ≥ ( β%i* β%i0 ) 2 . Furthermore, for i = 0,1,K , p − 1 , the least square estimates βˆij → β i , a.s. for each j = 1, 2,K , k . Thus, β%i* → β i , a.s. and β%i0 → β i , a.s. , so β%i* β%i0 → 1, .a.s.. Then the proof is completed.
□
5. Simulation study and real example In this section, we will conduct a simulation study to illustrate and verify the proposed methods. Following the same notation, let p=5, σ 2j = 1 , β1 = 0 , and the rest regression coefficients ( β 0 , β 2 , β3 , β 4 , β 5 ) be generated from N (0,10000) and get β 0 = −22.23 , β 2 = −83.66 , β 3 = 65.69 , β 4 = 91.66 and β 5 = −59397 . N=1,000,000 data are simulated from the model Yl = β 0 + β l X l1 + K + β l X l 5 + ε l , 235
Proceedings of the Second Workshop on Knowledge Economy and Electronic Commerce
ε l ~ N (0,1) , l = 1,K , N , with the explanatory variables x1 , x2 ,K , x5 generated from uniform(0,1) individually. We split the data into k blocks randomly such that each block contains n samples, so kn=1,000,000. We will first test the hypothesis H 0 : β1 = 0 with significant level α = 0.05 by the block hypothesis testing proposed in Subsection 3.1. We consider three different combinations of k and n with 1000 simulation runs and the resulting Type I error probabilities together with the associated critical values c are listed in Table 1. We see that if there are not enough blocks, then the sample size in each block is still too large such that H 0 is rejected quite often. So the first two allocations produce large Type error probabilities. A finer split as in the last case indeed generates better result. In this model the maximum acceptable sample size is n0 = 37,888 in which the optimal weighted estimators are roughly computed via a small pilot sample of the data. This may explain why the first allocation yield so poor result for it has sample size n=100,000 which is much larger than the maximum sample size allowed. k×n
c ( ≅ c* ) 10 × 100,000 3 100 × 10,000 10 1,000 × 1,000 62
Type I error 0.952 0.083 0.055
Table 1: The Type I error probabilities and the critical values via block hypothesis testing in various block allocations.
Next, we apply the proposed variable selection methods to analyze a data set obtained from the UCI website (http://www.ics.uci.edu/~mlearn/ MLRepository.html). which is the census of American populations from 1994 to 1995. Each sample is composed by the personal income ( Y1 ), age ( X 1 ) , capital profit ( X 2 ) , capital loss ( X 3 ) , savings interest ( X 4 ) and the number of members in his/her family ( X 5 ) . The sample size is only 16,540 which is not large enough to be called a massive data set, of course, but it could be used to verify the proposed approach by compared with the regular regression analysis. We will study the impact on the personal income affected by all the other variables, so a regression analysis is necessary. Since the personal incomes are all positive and skewed with large observations, a logarithm transformation Y = log(Y1 ) is made to fit the normal assumption of the model. Table 2 presents the correlation coefficients between the variables. We see that not many explanatory variables have strong relationship with Y. We are interested in choosing those significant variables.
236
Proceedings of the Second Workshop on Knowledge Economy and Electronic Commerce
Y X1
X1 X2 Y 1 0.280 0.068 0.280 1 0.043
X3 0.047 0.026
X4 0.036 0.086
X5 0.171 0.045
X2
0.068 0.043 1
-0.015 0.054
0.019
X3
0.047 0.026 -0.015 1
0.005
0.018
X4
0.036 0.086 0.054
0.005
1
-0.007
X5
0.171 0.045 0.019
0.018
-0.007 1
Table 2: Correlation coefficients between the variables in the census data.
Divide the data into 10 blocks, each contains 1,645 data. We will apply the two-stage forward selection and backward elimination methods to choose the important variables and get the block weighted least square estimators on the coefficients of the selected model. Classical regression analysis without blocking is also conducted with stepwise variable selection procedure, and the results are listed in Table 3. It is shown in Table 3 that all three methods exclude X 4 (savings interest), and the block forward selection does not include X 3 (capital loss) either. They all give close estimates of the other coefficients. In this data set, the maximum acceptable size n0 is about 6,438 in testing β 4 =0 and 2,656 for testing β 3 =0, so Lindley paradox never occurs and the classical method is a good guideline to be compared. It is seen that our results are quite satisfactory, especially the backward elimination approach. Regression coefficients Blocking
forward selection Backward elimination No blocking
constant
X1
X2
X3
X4
9.66 9.66 9.66
0.015 0.012 * * 0.015 0.012 0.101 * 0.015 0.012 0.101 *
X5 0.065 0.065 0.065
Table 4: Estimates of the regression coefficients of census data via three variables selection methods. (* means the variable is not selected in the model.)
6. Conclusions and discussion We propose a two-stage block hypothesis testing method as well as the variable selection criteria in regression analysis for massive data sets. To avoid the Lindley paradox encountered in classical tests for huge data sets, our approach provides an alternative. We also suggest a maximum sample size used for the sample size in each block which hopefully gives a hint for the block allocation. Simulation study shows that our testing procedure is better than the classical method in terms of Type I error probabilities in massive data sets; and the real data study verifies that our approach is as good as the usual procedure under moderate sample size. It is true that even with 237
Proceedings of the Second Workshop on Knowledge Economy and Electronic Commerce
the maximum acceptable size in the block, the number of blocks might also turn to be too large again in the second stage test when the sample size is overwhelmingly large. It deserves further study in such cases.
References Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd Ed., Springer Verlag, New York. Fan, T. H., Lin, D. K. J. and Cheng, K. F. (2004). “Intelligent regression analysis for massive data sets.” Manuscript. Hand, D. J., Blunt, G., Kelly, M. G. and Adams, N. M. (2000). “Data mining for Fun and Profit.” Statistical Sciences, 15, 111-131. Hand, D. J. (1998). “Data mining: Statistics and more?” American Statistician, 52, 112-119. Lindley, D. V. (1957). “A statistical paradox”. Biometrika, 44, 187-192.
238