Document not found! Please try again

A modified GMDH approach for social science research - Springer Link

3 downloads 199 Views 1MB Size Report
Department of Sociology, University of Illinois, 326 Lincoln Hall, 720 South Wright Street,. Urbana, 1L 61801, U.S.A.. Abstract. Group Method of Data Handling ...
Quality & Quantity 26: 19-38, 1992. 9 1992 Kluwer Academic Publishers. Printed in the Netherlands.

19

A modified GMDH approach for social science research: exploring patterns of relationships in the data TIM FUTING LIAO Department of Sociology, University of Illinois, 326 Lincoln Hall, 720 South Wright Street, Urbana, 1L 61801, U.S.A.

Abstract. Group Method of Data Handling (GMDH) is a way with which a system of models

self-organize themselves by forming higher-order polynomials and selecting the ones with best power of prediction by certain criterion. This method is helpful when we explore patterns of relationships in the data under investigation. In this paper the author presents a modified version of the G M D H algorithm emphasizing the parsimony of models and the behavior of individual parameter estimates as well as of the whole model, and utilizing the consistency and accuracy of bootstrap estimates. This approach is suitable for most research social scientists conduct. An example, the 1907 Romanian Peasant Rebellion, is used to illustrate how to employ the GMDH algorithm when the research topic has been theory-laden. The findings show that GMDH is an appropriate method that social scientists can utilize in their pursuit of a model that is most parsimonious and theoretically meaningful at the same time. Possible extensions of the modified approach, which in its present form works on linear regression type of models, to logit and probit models are also considered.

Introduction Sometimes we are faced with a demanding task of estimating dozens of, on occasions even hundreds of, linear or linearizable models when we try to explore social phenomena. This is true even when we know something about the theory behind the relationships among the variables we are dealing with. To start with, we build a few models around the theory. Then we modify these models according to certain criteria, R square, goodness-of-fit statistics, and parameter estimates for certain variables, to name just a few. The modification may involve redefining functional forms, trying out interaction terms, excluding variables that appear as random noise, and including other variables that might be more meaningful. If one round of modification is not enough, it may be necessary to carry out some iterations of modification. More often we test a theoretical hypothesis by setting up a model representing the hypothesis, and decide whether the relationships specified by the hypothesis are supported by the findings from the model. Most such hypotheses in social science specify a linear relationship, which is the most basic type of relationships. What if we loosen up the linear assumption and explore? This again involves testing many more models with nonlinear or polynomial terms.

20

Tim Futing Liao

Both situations of seeking scientific explanations above can be handled by the method proposed in this paper. We construct a few basic primeval equations first, and a new generation of equations based on the parental generation. Then we let a survival-of-the-fittest principle determine which equations of the new generation to live and which ones to die. We continue the process for more generations until the system of mathematical models satisfy a certain condition we have specified. What I described in the paragraph above is the group method of data handling (GMDH) algorithm. In 1966, Ivakhnenko, a Ukrainian cyberneticist, when discouraged by the fact that the modeler must know things about the system that are typically impossible to have knowledge of, first proposed the method. Since its invention, the method has been applied to research from modeling environmental effects on fisheries (Brooks and Probers, 1984) to predicting air pollution (Tamura and Kondo, 1980). The purpose of this paper is to introduce a modified G M D H algorithm to social scientists. This is achieved by doing the following things: First, I go over the steps of the original G M D H method. Second, I present a modified version of the G M D H approach, which is more appropriate for the kind of research social scientists do, since it selects the best subset of variables in their most effective forms of polynomials and its criterion for selection takes into account of the statistical behavior of each individual parameter as well as the whole model, and controls Type I error, which often presents a problem to social analysis but not necessarily to weather forecast. Finally, I use an example of previously published social research to illustrate the application of the modified approach. The emphasis of the presentation is not on using G M D H in a situation where we simply do not know what possible relationships are implied by the data and let the model decide, but on applying the method to possibly viable patterns in the data that are not specified by the existing hypotheses. The example demonstrates how this exploration is carried out. This example also shows that only when the modeler has adequate substantive knowledge can the use of G M D H be fruitful. Otherwise it is easy to abuse it and do data-dredging. Also discussed are some "user-defined" aspects of the method. Other issues dealt with in the paper are possible extensions to other types of models not exemplified, and GMDH's relations to the social science methodology in common use. The latter is the topic of the following section.

GMDH's relations to methods in use

Empirical social scientists are mostly familiar with four methodological approaches that bear relations to the G M D H method. First, the effect of

A modified GMDH approach for social science research

21

interactions between predictor variables are widely studied and employed (e.g., Allison, 1978; McCullagh and Nelder, 1983; Stolzenberg, 1980). Related to the study and use of interaction terms is a tradition of employing and interpreting polynomials (e.g., Bock, 1975; Draper and Smith, 1981; Stolzenberg, 1980). More recently, a procedure, called the ACE algorithm, was developed by a group of statisticians to select the best functional transformation(s) in multiple regression (Breiman and Friedman, 1985). Finally, methodologists are also aware of the necessity, and carry out the task, of selecting the best subset of all independent variables for the sake of parsimony without too much sacrificing the model fit, by way of forward, or backward selection (e.g., Bock, 1975; Draper and Smith, 1981; Mosteller and Tukey, 1977). The criterion for such a selection may be R 2, residual mean square, Mallows' Cp, Allen's PRESS, or some other statistics (Draper and Smith, 1981; Mosteller and Tukey, 1977). Because of the tremendous amount of computation involved in examining all possible models when the number of predictor variables is large (Draper and Smith, 1981), in reality, no known research has been done to incorporate all the four traditions above at once, by examining all possible combinations of interaction as well as polynomial terms, to arrive at a "best" model that best describes the data. GMDH is an approach that accomplishes most of the goals of the four approaches but examining a variety of functional forms. The GMDH method does not look at all possible combinations of terms. Instead, it evolves from simple models to relatively complex models by taking the best path-keeping promising terms and excluding hopeless ones. The goal thus will be reached. A practice all but out of favor that seeks to select the best subset of predictor variables is stepwise regression. However, it does not allow the researcher to test out interactions and polynomials not included in the model. A lesser known procedure that bears resemblance to GMDH at the first look is the Automatic Interaction Detector (AID) algorithm. The problem of detecting interactions was first discussed by Morgan and Sonquist (1962). Then they proposed their solution, AID (Sonquist and Morgan, 1964), and further revised it (AID III) (Sonquist and Morgan, 1973). Perhaps the only major similarities between AID and GMDH are that both are based on regression type of models and that both have some kind of self-organizing mechanisms to search for a better-fitting model by not assuming linearity and additivity. However, there are more differences between the two. First, the methodological basis for AID is sequentially partitioning data sets into subgroups, whereas that for GMDH is the Kolmogorov-Gabor polynomial. Because of the partitioning, AID algorithm requires at least 1,000 cases, a number larger than surveys of smaller scale. GMDH, on the other hand, needs only as many number of observations as the total number of parameters to be estimated. (The example in this paper has only 32

22

Tim Futing Liao

cases). Moreover, AID may eliminate extreme cases and truncate variables automatically, while with the present version of G M D H these are deemed inappropriate and unnecessary. Finally, the aim in AID is to reduce error in terms of sum of squares, not to have better statistical significance on the part of individual parameters. With the modified version of GMDH, I attempt to both reduce error of the model and select the combination of most statistically significant predictors. In addition to the distinguishing features above, in G M D H the data for model-building and for fit-checking may be independently drawn. This ensures the model found not to be too "case-specific". A more important difference between the methodology in use by social scientists and the G M D H approach, however, lies at the heart of social science methodology. It is a problem in the philosophy of science. That is, how do we practice science? Science, in general, has been well known for its inductive way of research as well as for its deductive reasoning. We extract and generalize principles from empirical data. Interestingly, most researchers today deduce from social science theories a few hypotheses, and employ statistical tools to test them. The G M D H approach has features of both induction and deduction. It explores possible patterns in the data that otherwise would be ignored. This method, however, still works within a deductive framework. But now the deductive reasoning is supplemented with inductive supports. This alternative is identical to the usual hypothesis testing except for one important point: It loosens up the assumptions about functional forms implied by the hypothesis and interactions with variables other than the ones specified by the hypothesis. Therefore, it tests a deductively derived theoretical hypothesis by some inductive insights not specified by previous hypotheses, in addition to testing relationships already specified by them.

The original GMDH Mathematically, the G M D H algorithm is based on the Kolmogorov-Gabor polynomial. (Some authors refer to this as the Ivakhnenko polynomial). The complete Kolmogorov-Gabor polynomial, a discrete analog to the Volterra series, takes the following form (Ivakhnenko, 1970, 1984):

y--a+2 xi+22c,,x,x,+22 i=l

i=l i=l

i=1 j=l

k=l

where y is a dependent variable; x,., xj, Xk, etc., are independent variables;

A m o d i f i e d G M D H approach f o r social science research

23

m is the total number of x variables; and a, b, c, etc., are parameters. The equation above can be replaced by a composition of lower-order polynomials of the form

.9 = A + Bxi + Cx] + DxZi + E x 2 + F x i x i ,

for

i # j.

(2)

It is sufficient to have only six data points to find the coefficients A, B, C, D, E, and F in Equation (2). Rather than taking a top-down design, the GMDH algorithm works its way up from the lowest order equations taking the form of Equation (2). In our first equation to be evaluated, we use all possible combinations of input variables - xl, x2. . . . , Xm - in place of xi and xj in Equation (2). We regress the observed y on these combinations of xi and x j, using Equation (2). These xi and xj are our zeroth generation variables. Since there are m ( m - 1)/2 possible combinations of xl to Xm we will have m ( m - 1)/2 number of equations predicting y's for each zeroth generation observed y. Let us call these variables generated by equations predicting y's zl, z2. . . . . Zm, and use them as first generation variables in place of the original input variables of xl, x2. . . . . xm. We select those first generation variables that better predict the original y, and estimate a new collection of ml(ml - 1)/2 equations following the form of Equation (2) (ml may be greater than, smaller than, or equal to m). We proceed as before and generate new independent variables with the help of z~, z2 . . . . , Xm. These form the basis of a second generation (see Diagram 1). We continue this process until the system begins to get poorer predictability (by a certain user-defined criterion) than the previous generation. That is, the model is now overspecialized. Experimental work has shown that the curve of the power of prediction over generations is V-shaped: once the model is overspecified, the power of prediction will get poorer as the order of polynomials becomes higher (Farlow, 1984). We stop at this point and pick the best of the polynomials in the previous generation. Here the estimate of y is a quadratic of two variables, which are themselves quadratics of two other variables in a lower-order generation, which are themselves quadratics of two other variables in a lower-order generation . . . . , which are themselves quadratics in the original input variables. Should we make the necessary algebraic manipulations, we would arrive at a complex polynomial having the form of Equation (1). Following descriptions in Ivakhnenko (1970) and Farlow (1984), I summarize the GMDH procedures in the following six steps: (1) Divide the data set into two parts - the training set and the checking set. The selection of the training set can be done randomly, or by some other

24

Tim Futing Liao Diagram I: The Tree of Polynomials in GMDH g = last generation

I y = A + Bz.~,+Cz.~j+Oz~.,+. Ez~j+ F:z~z~i I

2nd generation

[ z'i=A+Bz"+Cz'i+Dz~i+Ez~'+

I I~'~:~*

I ~-:~*

Xl

Fz"zu

X2

X3

X4

I II.~'"-':A+I

..........

I Iz'":~+

.r,,,-2

X,,,- 1

I

x,,,

Note: The parameters, A, B, C, D, E, and F. are not sub~ripted accordingly for the sake of comprehensibility. The first subscript for the zs denotes generation; and the second, the sequence number within a generation. "'m" designates the total number of original input variables: and "n", the number of input variables (Z vectors) for each generation after the 0th, which can be greater, equal to, or smaller than m but not greater than m(m - 1)/2.

ways that the modeler deems appropriate. The remaining data points will be the checking set. E.g., training set: Yl, Y2,

9

9

9

,Yi-n,

and

X1, X2, 99 9 X i - . ;

checking set:

Yn-k+l~ Yn--k+2, 999~ YN , and

A modified G M D H approach for social science research

25

X,,-k+l, X,,-k+2 . . . . . XN. (2) Take all independent variables in the X matrix two at a time and construct m(m - 1)/2 number of regression polynomials in the form of Equation (2) (where m is the number of independent variables). This is done on the dependent observations y in the training set. One could also start this step from the checking set, and reverse the functions of the two sets. The results should be similar, however, since the training of the model in one set is always checked against the data in the other. (3) Using parameter estimates from (2), evaluate the polynomials at all data points in both the training set and the checking set, and store these new observations (new generation of variables), z's, in the columns of X matrix, thus replacing the original Xl, x2 . . . . . xm. This can be mathematically expressed as

Z=A*+Bxi+Cx~+Dx~+Ex2+Fxlxj,

for

i~]

and

n=ltoN

(3) where B, C, D, E, and F are parameter estimates from Equation (2) in step 2, and A* may not be equal to A since intercepts are not stored for later use. (4) Compute the root mean square (also called the regularity criterion) rj for each column of Z in the checking set,

Ei (Yi rj =

Zij) 2 Ei Y~ ' -

(4)

order the columns of Z according to increasing rj, and then select those columns of Z satisfying rj < R (R being some prescribed number decided in advance) to replace the X matrix. The statistic rj is computed over all observations, yg, in the checking set. The number of z variables, mh may be greater than, less than, or equal to the original number, m. (5) Find the smallest of the rTs from the last step and call it RMIN. If the value of RMIN is less than that from the previous generation, we return to Step 2 and repeat Steps 2 through 4. If the value of RMIN is greater than the preceding one, we assume that the minimum of RMIN has been reached. We stop the process here and return to the results from the previous generation. (6) Select the best-fit polynomial from the previous generation, and translate it back in terms of original independent variables of X. Now one may

26

Tim Futing Liao

want to estimate this model again in terms of the original variables in the whole data set. (This step is not in Ivakhnenko's algorithms, but is a natural extension of the original approach.) Of the six steps above, the first is not necessarily to be done by dividing the data set into two parts. Although the argument for dividing is that the polynomials will be checked against independent observations, it is also true that some information is lost by estimating the polynomials on a portion of the original data set. In that case both training and checking can be done on the original data set. It seems reasonable to divide the data into the training and checking set when the sample is relatively large, and using the data set as a whole when the sample is relatively small. How small is small? While this question is mostly a matter of taste, it seems that either training or checking data set needs preferably to have 15 cases or more after division. As noted earlier, the first-generation equations need very few data points to accomplish the estimation. Although the method uses many equations and has a large number of parameters estimated, the required number of data points remains unchanged, because variables from the second generation on are evaluated on observations created from the previous generation and stored for use. The G M D H method is heuristic. Therefore, details of the algorithm are "user-defined". Using the whole data set as the training and the checking set at the same time instead of dividing it into two subsets is such an example. Moreover, the criterion we use to choose the better-fit equations at each generation can be, instead of root mean square, Mallows's Cp, or some other criterion of the user's choice.

A modified GMDH approach In this section I present a modified version of the original G M D H approach. 1 The modified version is more appropriate for most research social scientists conduct. Most analyses published in major social science journals are focused on explaining the determinants of a certain social behavior, rather than predicting how frequently that behavior will happen. This makes it necessary to have a different type of criterion to select the best models from those used by researchers in other fields interested in prediction only (e.g., Brooks and Probert, 1984; Ikeda et al., 1976; Tamura and Kondo, 1980; Vicino et al., 1987). There is a great difference between prediction and explanation, for under certain circumstances it is possible to predict phenomena without necessarily being able to explain them, and vice versa (Scriven, 1959). "Roughly speak-

A modified G M D H approach f o r social science research

27

ing, prediction requires only a correlation, the explanation requires more" (Scriven, 1959: 480). Commenting from the standpoint of a philosopher of science, Kaplan (1964) states that the ideal explanation is probably one that allows prediction, but that the converse is certainly questionable. As far as a model is concerned, we may be able to predict well by selecting a model with the smallest root mean square or the highest R square. But the independent variables may fare poorly as explicators, since their parameter estimates can well be nonsignificant because of collinearity, though collectively they may be good predictors and give a high R square and low root mean square. There are two situations of explanation I identified at the outset of the paper: building models of phenomena of interest, based on some knowledge; and testing hypothesis, which explains the specified relationship by proving or disproving it. Moreover, we not only are more interested in explanation rather than in prediction of social behavior, but also prefer to do so with models as parsimonious as possible. Thus, I have modified the original GMDH algorithm in the following three ways: (1) Similar to Prager's (1984) treatment of model selection, I use Mallow's Cp to select the best of all possible subsets of variables, based on Equation (2) of the original Step 2. This will give a model with one to five variables, whichever has the smallest value of Cp, and will be entered later into the Z vector. Cp is given by the following equation: Cp = ( y ' y -

(5)

E'e)/o "2 - ( N - 2p),

where y and E are the matrix notations for the dependent variable y and disturbance E, respectively, O"2 is the variance of the error term, and p is the number of independent variables including the intercept. (2) Instead of root mean square or some other similar measures that are based on the behavior of the model as a whole, I propose to use a-root mean square as the criterion of selection, a-root mean square is a measure that takes into account of how individual input variables as well as the model as a whole fare. It is here defined as the root mean square for the model using checking data set weighted by the average of the a values for the twotailed Student's t-test of all parameters excluding the intercept. It is defined mathematically as follows: a-RMS = (y - z ) ' ( y - z) E]prob[(flj/o't3s = y'y p - 1

t[

df]

(6)

The purpose of using this criterion is to select the model with the highest

28

Tim Futing Liao

predicting power of the whole model while the parameter estimates in the model on average render the most significant t-test. Other measures, such as tolerance level and variance inflation factor, that could be used in other situations, would not be appropriate since terms in a polynomial are by definition highly collinear, a-root mean square has proved to be appropriate in my test runs of the modified G M D H algorithm and for the example in the next section. (3) A final improvement over the original G M D H method is the use of bootstrap methods to control Type I error, for it is possible, five out of one hundred times, for instance, to happen to have a set of polynomials that behave very well. Therefore, the modified G M D H method evaluates the aroot mean square on the checking data set, which consists of a certain number (50 in the following example) of bootstrap samples of the original data. The bootstrap method used here is discussed by Efron and Tibshirani (1985). This method gives unbiased and consistent estimates of standard errors. In addition to using bootstrap samples for the checking data set, bootstrap estimates of parameters and their standard errors for the final model can also be calculated similarly.

An example: the 1907 Romanian peasant rebellion

The original model In the following I present an empirical example for which two theories are candidates for testing. The example used is a re-analysis of Chirot and Ragin's (1975) study of the Romanian peasant rebellion in 32 counties in 1907. They tested two theories of peasant rebellion. They first synthesized Wolf, Moore, Tilly and Hobsbawm's theoretical arguments, and tested the "transition society" hypothesis, the main argument of which is that peasant rebellion is most likely when there has been extensive, recent commercialization of agriculture in a traditional peasant society. They derived the second hypothesis from Stinchcombe, and called it the "structural" hypothesis. In this model the focus is class stratification. That is, peasants have a greater tendency to rebel if the inequality of land tenure is greater and the middle peasants are stronger. Chirot and Ragin tested three versions of the dependent variables: Violence of rebellion, a measure that records the approximate deaths in each county into four groups; Spread, a variable that records the proportion of villages involved in rebellion in each county; and Intensity, a summary measure that is the sum of the standardized variables of Violence and Spread.

A modified GMDH approach for social science research

29

Table 1. Multiple regression results from Chirot and Ragin's original models

Model Variable

(1)

(2)

(3)

(4)

Intercept

- 12.837" (4,796) 0.095**

11.302 (10.747) -0.775*

-2.251 (2.655)

12.784 (10.983) -0.873*

(0.019) 0.123" (0.058)

(0.353) -0.161 (0.127)

(0.366) -0.194 (0.131)

0.010" (0.004) -0.011

0.011" (0.004) -0.001

(0.025)

(o.o16)

3.770 (3.921)

2.293 (2.642)

0.101 0.039

0.670 0.607

1 Commercial agriculture

2 Traditionalism 3 Interaction of 1 and 2 4 Mid. peasant strength 5 Gini coefficient of land inequality R2 Adjusted R 2

0.575 0.545

0.651 0.613

Note: standard errors of estimates are reported in parentheses below unstandardized regression coefficients; * denotes statistically significant at the 0.05 level, and ** at the 0.01 level.

For testing the "transitional society" hypothesis, they operationalized the commercialization of agriculture by taking the percent of the cultivated land devoted to wheat, which was the main cash crop, in each county over a fiveyear period (1900-1904), and traditionalism by looking at the percent illiterate in each county, since rising literacy symbolizes breaking down of traditional ways of viewing the world for the Romanian peasants. They tested the hypothesis by running two regression models, one with additive effects only, and the other with the interaction effect between commercialization and traditionalism, an effect implied by the hypothesis. To test the "structural" hypothesis, they also had two variables - the relative strength of middle peasants, and the relative strength of the familysized tenancy system. The former is measured by taking the percent of the rural households that owned between 7 and 50 ha, and the latter is represented by a Gini index of inequality of landowning. Chirot and Ragin (1975) altogether tested four models - the first with commercialization of agriculture and traditionalism as independent variables, the second is the same as the first except for the addition of an interaction term, the third with the strength of middle peasants and that of land inequality, and the last is a combined model of the second and the third models to test the competing hypotheses - on all three versions of the dependent variable, the tendency to have peasant rebellion. In Table 1 I report the results for all four models using Intensity as the dependent variable, which gives the most consistent support to their conclusions. 2 Instead of reporting

30

Tim Futing Liao

standardized parameter estimates as they did, I report unstandardized coefficients because the independent variables are comparable - they are all on a scale either from 1 to 100 or from 0 to 1. As shown in the table, the findings support the "transitional society" hypothesis but not the "structural" hypothesis. In model 1, the parameter estimates for both commercialization of agriculture and traditionalism are significant. In model 2, the interaction term between the two factors of the "transitional society" model is significant, giving the hypothesis strong support. 3 Model 3 tests the "structural" hypothesis, and appears to be a poor model. In the full model, where variables for both hypotheses are entered, the significant parameter estimates are the same ones as in model 2, while the adjusted R square decreases somewhat. Based on these findings, Chirot and Ragin (1975) concluded that the "transitional society" hypothesis was supported by the Romanian data of 1907 peasant rebellion, while the "structural" hypothesis was not. Needless to say, the models they used are adequate for testing the hypotheses they set up. But disproving one theory while retaining the other can only be meaningful when one assumption is made. That is, reality is assumed to be as simple (or as complex) as the limitations of the two theoretical hypotheses confine. Imagine that reality is more complex than what either theory has defined. In this case we have not found what the relationships really are among the social factors. If we do not assume the constraints of the theoretical hypotheses above, and let the model self-organize itself and better represent the relationships among the four social forces indicated by the four variables, the picture may be drastically different. In the following I am going to show this picture by using GMDH method to arrive at the most representative model.

The new model selected by G M D H Using the modified GMDH algorithm described in a previous section, a "best" model was arrived at within four iterations. The results from each generation or iteration are reported in Table 2. The Xl through x4 variables represent commercialization of agriculture, traditionalism, strength of middle peasants, and Gini coefficient of land inequality, respectively. The z variables are the Z vectors described in the GMDH algorithm section. They consist of either x or z variables from the previous generation. For this example I control the number of input variables for the next generation to be one less than the total possible number of combinations of the variables (6 - 1 = 5), or the number of input variables for the present generation plus one (4 + 1 = 5). That is why for the first and the second generations there are five z

A modified GMDH approach for social science research

31

Table 2. Results from G M D H method

Pair of variables ~

Variables actually in the modelb

Training data R2

Checking data a-RMS

Generation I zl: x l x4 z2: x l x2 z3: x l x3 z4:x2 x3 z5:x3 x4

xl x2 xl x2 x4

x l 2 x42 x l 2 x22 xl 2 x22

0.6482 0.6792 0.6341 0.2759 0.1052

0.04281269 0.04590468 0.05283396 0.08902807 0.19281394

Generation 2 zl: z3 z4 z2: zl z4 z3: zl z2 z4: zl z3 z5:z3 z5

z3 zl z2 zl z3

z42 z42 z22 z32 z5 z32

0.6482 0.5858 0.6985 0.6790 0.6964

0.02559303 0.02626779 0.03745713 0.04767710 0.04936548

Generation 3 zl: zl z5 z2: zl z4 z3:z2 z3 z4: zl z3

z5 z4 z2 z3

0.7302 0.7172 0.7039 0.7022

0.00009262 0.00015946 0.00016771 0.00019836

Generation 4 zl: zl z2 z2: zl z4 z3:z2 z3

zl zl z3

0.7112 0.6997 0.7121

0.00026847 0.00026847 0.00044620

a These pairs are only those selected by GMDH by keeping the number of input variables for each generation equal to or no more than the original number of input variables plus one. b These are the variables symbolized by the polynomials from the preceding generation.

variables. Note that the third and the fourth generations have four and three z variables respectively. This is so because in each generation only the most parsimonious model, the one with the lowest Cp, is selected for each possible pair of variables, and thereby some of the Z vectors may become identical to others and thus redundant. The input matrix is trimmed for singularity before entering into the next iteration. A few general remarks about the results in Table 2 are in order. First, the model with the lowest a-root mean square on the checking data - with the a-root mean square calculated from 50 bootstrap samples drawn from the original data - generally has relatively high, though not necessarily the highest, R square from the training data (which is based on all data in the original set). Over the first three iterations, the trends are clear that R squares have increased while a-root means squares have declined. 4 Generation 4 shows that on average a-root mean squares are higher that those in generation 3, indicating that the system is overspecialized beyond iteration 3.

32

Tim Futing Liao

Table 3. Coefficientestimates for the polynomialderived by GMDH method for 1907 Romanian peasant revolution data

Variable

OLS estimates

Bootstrap estimates

Intercept 1 Commercial agriculture 2 Gini coefficientof land inequality 3 (Commercial agriculture)2 4 (Commercial agriculture)4 R2 Adjusted R2

-4.399* 0.219 3.362# -0.008 2.6E-06" 0.716 0.674

-4.610" (1.651) 0.248 (0.203) 3.327* (1.434) -0.009 (0.006) 2.8E-06" (1.3E-06)

(1.855) (0.199) (t.822) (0.006) (1.2E-06)

Note: standard errors of estimates are reported in parentheses to the right of unstandardized regression coefficients;# denotes statisticallysignificantat the 0.05 level, one-tailed test, and * denotes statistically significant at the 0.05 level, two-tailed test. Therefore, let us go back to the best model in generation 3, and work out the polynomial in terms of the original variables, xs. The best model in generation 3 is Zl, which has the lowest a-root mean square of 0.00009 and is composed of z~ and z5 in generation 2. However, Zl can be dropped since the variable actually used in Zl of generation 3 is z5 only. Z5 of generation 2, in turn, consists of z3, z5 and (z3) 2 from generation 1. Again, the z3 and z5 variables in generation 1 can be traced to a representation of the original input variables. They are Xl + xZ~, and x4, respectively. If we do all the factoring and throwing out constants, we will arrive at the following equation .~ = X 1 "~ X 4 dr- X 2 ~,- X41 "

(7)

This equation is the best model the modified G M D H algorithm has produced by using the selection criterion of a-root mean square with 50 bootstrap samples as the checking data set. Bootstrap method is useful in keeping the selected model parsimonious, for without using bootstrap samples, a model with many interaction terms and higher-order polynomials would have been selected. Now we may re-estimate this model in the form of original variables (xs). The estimation in the original variables renders summary statistics such as the R 2 a little different from those from generation 3, because z variables in a higher generation are composite variables therefore the degree of freedom is different. Ordinary Least Square parameter estimates from the model expressed in the original terms of the variables using all data are given in Table 3. Also given are estimates based on 100 bootstrap samples of the original data. This is done to confirm that Type I error does not present a problem because polynomials of a higher order may happen to be significant from estimation with one sample.

A modified GMDH approach for social science research

33

First, although all the parameter estimates for the lower-order form of commercialization of agriculture are not statistically significant, it does not suggest that these lower-order terms are unimportant. In fact, a series of Ftests show that the contribution in terms of variance explained from variable 4 alone is significant at the 0.05 level, that from variables 3 and 4, and from variables 1, 3, and 4 are both significant at the 0.01 level. If our aim is testing the structural hypothesis, the one-tailed test may be used for the variable of land inequality. If our purpose is more along the lines of exploration, as with the G M D H method, then two-tailed test should be used. It is clear that the results from the estimation using bootstrap samples are reassuring: the parameter estimates by and large are not only consistent with the OLS estimates, the estimate for land inequality is also significant at the 0.05 level of a two-tailed test. Interpretation of the parameter estimate for the variable of land inequality is straightforward, the same way as we would with any multiple regression coefficients. It demonstrates that the greater land inequality was, the more likely the peasants would rebel with a greater intensity, thereby supporting the structural hypothesis. To interpret polynomials of a higher order is trickier, we may take first-degree partial derivatives with respect to the variable involved in the higher order of polynomials (Stolzenberg, 1980). Such a partial derivative for commercialization of agriculture, OylOXl, is taken and graphed in Figure 1. The effect of commercialization of agriculture on the intensity of rebellion changes with the extent to which the county's agriculture was commercialized. This effect decreases for the lower level of commercialization to below zero when the middle level, and then the effect increases for the higher level of commercialized agriculture. This suggests that when the county had a small proportion of wheat production, an increase in this proportion, though to a small extent, would incur an increase in the intensity of rebellion. This effect weakens until the proportion of wheat production reaches a middle level. In fact, an increase in the proportion would only result in dampening the intensity for counties of this level of commercialization. As soon as the proportion of wheat production passed this middle level (about 25-26%), the impact of an increase in commercialization of agriculture would result greater and greater positive influence on the intensity of rebellion. The same idea can also be derived by graphing the relationship between the intensity of rebellion and the degree of commercialized agriculture when the extent of land inequality is kept at its mean (Figure 2). Overall, the findings support elements from both the "transition society" and the "structural" theories, but this support is not without conditions. The findings really modify both theories in important ways. The patterns in which

Tim Futing Liao

34

0.7 0.6

_~ 0.5 Cr

0.4

;

o.3

C

c

0.2

C" 0

.~

0.1

0 0

~-

W

0.0

I

-0.1

I0

5

I

1

1

I

I

I

I5 20 25 30 35 40 7. C o m m e r c i o l i z o t i o n o f A g r i c u l t u r e

I

45

50

Fig. 1. The effect of commercialization of agriculture on intensity of rebellion.

C Q

2

t'~ 0~

rr

I

Q 3)

C t"

/

-I

_~

I

5

10

i

I

I

I

I

I

15 20 25 30 35 40 Z Commerciolizotion of Agriculture

I

45

50

Fig. 2. The relation between intensity ofrebellion and commercialization ofagriculture,

A modified GMDH approach for social science research

35

the two social forces - commercialization of agriculture, and land inequality - influence the intensity of peasant rebellion are very clear from the model obtained from the GMDH method. This model is parsimonious, and it synthesizes elements from both theories. Other things being equal, it suggests that stratification in the then Romanian society is an important factor in explaining rebellion, and that commercialization in agriculture aggravates peasants when the level is either low or high, and especially when the commercialized proportion is high, suggesting beyond a certain degree impositions of the market force would become unbearable for the peasants. In sum, using the GMDH method I have arrived at a model different from Chirot and Ragin's. Their model found support for the "transition society" hypothesis but not the "structural" hypothesis, whereas my final model gives both theories support with important qualifications. The example showed how patterns of relationships in the data were explored.

Some

extensions

The use of the modified GMDH approach exemplified above can be generalized to other types of models. In the example all variables, dependent and independent, are continuously measured. Thus, the first extension of the GMDH approach is to include the use of dummy variables. In fact, this task is a trivial case, for few modifications are necessary in order to allow dummy variables into the model. Dummy variables of higher polynomials will drop out the game, since a dummy variable raised to the nth power is still itself. In the present modified algorithm the rank of the input matrix for each succeeding generation is checked, and redundant terms eliminated. Therefore, the more dummy variables a model has, the more quickly it is for the GMDH algorithm to find the "best" model before it gets overspecified. Since categorical variables can be handled by using dummy representations in GMDH, a more important extension is in order. In the present paper the GMDH approach is used on regression-type models. A regression model is only a special case of the family of generalized linear models. Following McCullagh and Nelder's formalism, the systematic component - a linear predictor 71produced by covariates Xl, x2. . . . . Xp is defined as follows (1983): P

jxj.

=

(8)

l

If we define E(Y) =/x, then for the linear regression model we have ~ = 77.

36

Tim Futing Liao

For two other members of the family of generalized linear models, logit and probit models, the links between/~ and 77 are different. For the logit model we have 7/= ln[tt/(1 -/z)], and for the probit model we have 77= ~-l(/z) 9 When these two types of models are used in the GMDH algorithm, we must change the criterion for checking the fit of the model. Recall that both root mean square and Mallows' Cp (or most other measures of fit for the regression type of models) are based on E(y - 12)2, for logit models this criterion may be a form of the deviance measure 2 ~] {[y ln(y/fi)] + (n - y) ln[(n - y ) / ( n - fi)]}. The counterpart for probit models can be 2 ~ {[x ln(cI)/x)] + (n - x) ln[1 - (I)/z]}. It is not difficult to see that with these checking criteria implemented, possibly again weighted with a-levels pertaining to the t-test of individual parameters, the generalization of the modified GMDH method to logit and probit models can be feasible. However, detailed discussions are beyond the scope of the present paper.

Conclusion In this paper I have introduced the GMDH approach, and presented a modified version of the GMDH algorithm that is based on the parsimony of models and the behavior of individual parameter estimates as well as of the whole model. The modified version is more suitable for the kind of analysis most social scientists do. I have also used the example of Romanian 1907 peasant rebellion to illustrate how a model with higher-order polynomials can be arrived at and interpreted when the data have been fully explored for patterns of relationships. The example showed that the model arrived at and the conclusions implied therein can be very different from the original. The use of bootstrap methods significantly contributed to the parsimony of models and the control of possible Type I error in the example. Bootstrap samples also make the idea of "checking" data set more meaningful. I have also shown some possible ramifications of the GMDH method. The algorithm in the present paper can easily handle models with dummy vari-

A modified GMDH approach for social science research

37

ables. With some modifications, the approach can also be extended to applying to logit and probit models. Given the GMDH's appeal to modelers, a word of warning is in order. GMDH is no panacea. If you have variables that are poorly measured, or if you are at your wit's end in finding a new theory, GMDH approach is not going to help you. It may even mislead you. Only when you have made all necessary preliminary cleaning and analysis, and when the preparation is coupled with your substantive knowledge of the subject topic, will GMDH approach not only save your time and energy but also shed new light on your pursuit of patterns of relationships in the data in your research.

Acknowledgements I would like to thank Gerhard Arminger and Kenneth Bollen for their helpful comments on an earlier draft. My special thanks go to the late Prof. Hubert M. Blalock, who gave some nice suggestions but could not live to see this publication. A version of this paper was presented at the American Sociological Association Annual Meeting, August 9-13, 1989, San Francisco, U.S.A. Support from the Carolina Population Center is also acknowledged.

Notes 1. The modified G M D H algorithm is programmed by the author in SAS/IML as a module that takes matrix data as input. 2. For the sake of parsimony, results for models using the other two dependent variables are not reported here, because Intensity is a summary measure of the other two. For the same reason, G M D H method is applied to models with Intensity as the dependent variable only. 3. A minor discrepancy is found between Chirot and Ragin's and the results reproduced using their data. Unlike their findings in which the inclusion of the interaction term made the two main effects insignificant, among the reproduced results the main effect of commercialization is still significant after the interaction term has been entered. This, however, may not affect their conclusion much, since the simultaneous effect of the main and interaction effects for commercialization of agriculture would be mostly positive along observed values of traditionalism if the partial derivative of intensity of rebellion in relation to commercialization of agriculture was taken. 4. For people who are concerned over the R 2 increase for a polynomial model, it is necessary to point out that the R 2 for the best model in Table 2 is not directly comparable to the R 2 in the Chirot and Ragin's model 4 in Table 1, because the best model from generation 3 in Table 2 has only one parameter (in addition to the intercept) to be estimated, even less than the original model, though this only variable is a complex one composed of many lowerorder polynomials. The R 2 from model 1 in Table 3, in which variables are rearranged in terms of original variables, is more comparable to that in the original model, though strictly speaking the variables in the original model are not an exact subset of the ones in model 1

38

T i m Futing L i a o in Table 3, for two variables, traditionalism and middle peasant strength, have dropped out in the GMDH selection.

References Bock, R. Darrell (1975) Multivariate Statistical Methods in Behavioral Research. New York: McGraw-Hill. Breiman, Leo and Friedman, Jerome H. (1985) Estimating optimal transformations for multiple regression and correlation. Journal of American Statistical Association 80: 580-598. Brooks, Hugh A. and Probert, Thomas H. (1984) Let's Ask GMDH What Effect the Environment Has On Fisheries. In Stanley Farlow (ed). Self-Organizing Methods in Modeling: GMDH Type of Algorithms. New York: Marcel Dekker. Chirot, Daniel and Ragin, Charles (1975) The market, tradition and peasant rebellion: the case of Romania in 1907. American Sociological Review 40: 428-444. Draper, N.R. and Smith, H. (1981) Applied Regression Analysis. New York: Wiley. Efron, B. and Tibshirani, R. (1985) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science 1(i): 54-77. Farlow, Stanley J. (1984) The GMDH algorithm, In Stanley Farlow (ed) Self-Organizing Methods in Modeling: GMDH Type of Algorithms. New York: Marcel Dekker. Ikeda, Saburo, Ochiai, Mikiko and Sawaragi, Yashikazu (1976) Sequential GMDH algorithm and its application to fiver flow prediction. IEEE Transaction on Systems, Man and Cybernetics. SMC-6: 473-479. Ivakhnenko, A.G. (1970) Heuristic self-organization in problems of engineering cybernetics. Avtomatica 6: 207. Ivakhnenko, A.G. (1984) Past, present, and future of GMDH. In Stanley Farlow (ed) SelfOrganizing Methods in Modeling: GMDH Type of Algorithms. New York: Marcel Dekker. Kaplan, A. (1964) The Conduct of Inquiry. San Francisco: Chandler. McCullagh, P. and Nelder, J.A., F.R.S. (1983) Generalized Linear Models. London: Chapman and Hall. Morgan, James N. and Sonquist, John A. (1962) Problems in the analysis of survey data: and a proposal. Journal of American Statistical Association 58: 415-434. Mosteller, Frederick, and Tukey, John W. (1977) Data Analysis and Regression. Reading, MA: Addison-Wesley. Prager, Michael H. (1984) An SAS program for simplified GMDH models. In Stanley Farlow (ed) Self-Organizing Methods in Modeling: GMDH Type of Algorithms. New York: Marcel Dekker. Scriven, M. (1959) Explanation and prediction in evolutionary theory. Science 130: 477-482. Sonquist, John A. and Morgan, James N. (1964) The Detection of Interaction Effects. Survey Research Center Monograph No. 35, Institute for Social Research, The University of Michigan, Ann Arbor, MI. Sonquist, John A., Baker, Elizabeth Lauh and Morgan, James N. (1973) Searching for Structure. Survey Research Center Monograph, Institute for Social Research, The University of Michigan, Ann Arbor, MI, Stolzenberg, Ross M. (1980) The measurement and decomposition of causal effects in nonlinear and nonadditive models. In Karl F. Schuessler (ed) Sociological Methodology. Chapter 15, pp. 459-488. Tamura, Hiroyuki and Kondo, Tadashi (1980) Heuristic free group method of data handling algorithm of generating optional partial polynomials with application to air pollution prediction. International Journal of Systems Science 11: 1095-1111. Vincino, A., Tempo, R., Genesio, R. and Milanese, M. (1987) Optimal error and GMDH predictors. International Journal of Forecasting 3: 313-328.