1RQ:KLWH1R0RUH(IIHFW&RGLQJDVDQ$OWHUQDWLYH WR'XPP\&RGLQJ:LWK,PSOLFDWLRQVIRU+LJKHU(GXFDWLRQ 5HVHDUFKHUV 0DWWKHZ-0D\KHZ-HIIUH\66LPRQRII
Journal of College Student Development, Volume 56, Number 2, March 2015, pp. 170-175 (Article) 3XEOLVKHGE\-RKQV+RSNLQV8QLYHUVLW\3UHVV DOI: 10.1353/csd.2015.0019
For additional information about this article http://muse.jhu.edu/journals/csd/summary/v056/56.2.mayhew.html
Access provided by New York University (12 Aug 2015 20:09 GMT)
Research in Brief
Vasti Torres,
associate editor
Non-White, No More: Effect Coding as an Alternative to Dummy Coding With Implications for Higher Education Researchers Matthew J. Mayhew Jeffrey S. Simonoff The purpose of this article is to describe effect coding as an alternative quantitative practice for analyzing and interpreting categorical, racebased independent variables in higher education research. Unlike indicator (dummy) codes that imply that one group will be a reference group, effect codes use average responses as a means for interpreting information. This technique is especially appropriate for examining race, as such a process enables raced subgroups to be compared to each other and does not position responses of any raced group as normative— the standard against which all other race effects are interpreted. The issues raised here apply in any research context where a categorical variable without a natural reference group (e.g., college major) is a potential predictor in a regression model.
Theoretical Orientation The use of alternative codings, like effect coding, for analyzing and interpreting categorical predictors is not a new idea. According to Struik (1987), the formal operations date back to the Babylonians and their understandings of linear algebra. More recently, Fisher (1921) discussed alternative codings in his early work
on the two-way analysis of variance (ANOVA). Expanding the principles articulated in these early efforts, Yates (1934) discussed alternative codings by suggesting that the interpretation of the regression coefficients for indicator variables correspond to ANOVA interpretations of deviations: An overall level, as opposed to one based on a specific group of individuals, could serve as the reference point for interpreting effects. Recently and frequently, these possibilities have been discussed in the methodology literature; examples include Cohen and Cohen (1984), Hardy (1993), Miles and Shevlin (2001), Muller and Fetterman (2002), Kleinbaum, Kupper, Nizam, and Muller (2008), Warner (2008), Rutherford (2011), and Chatterjee & Simonoff (2013). This effort draws on philosophical tenets embedded within critical paradigms for examining the raced experiences of individuals located in a particular sociohistorical and political context. Borrowing from critical race theorists (see Delgado, 1995; Solorzano, 1997), we concede that, in America, systemic oppression of individuals of color is existent, sustained, and culturally reproduced, often by forces ranging from but not limited to overt
Matthew J. Mayhew is Associate Professor of Higher Education and Jeffrey S. Simonoff is Professor of Statistics, both at New York University. The authors gratefully acknowledge the Ewing Marion Kauffman Foundation for its generous support in the funding of this research project. 170
Journal of College Student Development
Research in Brief
hegemonic and discriminatory practices and behaviors to more subtle micro-aggressions, such as the language used to essentialize the experiences of one group over another. Even though it is often unintended, quantitative researchers who work with indicator variables as a means for comparing raced identity patterns may be using language to essentialize the experiences of one group over another. For example, researchers interested in examining effects by race might say, “When compared to White students, African American students are significantly more or less likely to _____.” Here, the language used to describe racial differences uses the White experience as the context for making meaning of the African American experience; thus, the White student experience is essentialized, becoming the norm against which the raced experiences of African American students are understood. The goal of this article is to offer an alternative practice of working with categorical variables in a way that does not use language that essentializes the experiences of any particular group. Related, this study draws on Stage’s (2007) definition of the quantitative criticalist, a researcher who “question[s] the models, measures and analytical practices of quantitative research in order to offer competing models, measures and analytical practices that better describe the experiences of those who have not been adequately represented” (p. 10). Critical quantitative research is sensitive, in particular, to the challenges of examining race. Until now, these challenges have been associated with between-group analyses—taking into account that comparative approaches to subgroup variations can both mask important differences that exist within groups as well as decontextualize the varied experiences of individual groups (for example, see Solorzano, 1997). The current effort extends ideas championed by the quantitative criticalist by problematizing the use of dummy coding in matters of understanding
race and its effects on outcomes. With effect coding, there is no reference group per se, and as a result, interpretations of the effects offered would read, “Compared to an overall level, African American/Black students are significantly more or less likely to engage in cocurricular programming.” We offer effect coding as a small step toward addressing issues of equity within quantitative research. To clarify, we explain differences between effect and dummy coding as well as the relative advantages of each and describe how to code using effect coding procedures. It is our hope that this study will shed light on how certain analytic techniques, even those as foundational as the coding of categorical variables, influence the ways we make meaning about race as a predictor of outcomes.
Description of Method For the purposes of this example, we have chosen to focus on explaining the mechanics of effect coding. To do this, we use the example mentioned earlier where participants are forced to respond by selecting one of the presented options: 1 = White; 2 = African American/Black; 3 = Latino/a; 4 = Asian/Asian Pacific; 5 = Native American; 6 = Biracial or Multiracial; or 7 = Prefer Not to State. Creating indicator variables begins with selecting a reference group. In this example, we use White students as the reference group. Indicator variables would then be created for all other groups. This process involves changing the seven-category variable into a series of six new variables. The process of creating these new variables involves a coding sequence, with 0 = all other races and 1 = African American/Black, for example. This process is repeated for each new variable: Latino/a, Asian/Asian Pacific, Native American, Biracial or Multiracial, and Prefer Not to State, respectively. These newly created
March 2015 ◆ vol 56 no 2 171
Research in Brief
six variables are then simultaneously entered into a regression equation. In this example, White students serve as the reference group and are therefore not included in the process of recoding or entry into the regression equation, and all race effects are interpreted using White students as the benchmark; as an example, an interpretation would read, “Compared to White students, African American/Black students were significantly more or less likely to engage in cocurricular programming.”* Effect coding slightly differs from indicator variable recoding. We again would need to pick a group to not be included in the process of recoding or the final regression equations. Extending the aforementioned example, we select Whites. Effect coding then involves a process of changing the seven-category variable into a series of six new variables. The process of creating these new variables involves a different coding sequence, with –1 = Whites, 0 = all other races, and 1 = African American/Black, for example.† This process is repeated for each new variable: Latino/a, Asian/Asian Pacific, Native American, Biracial or Multiracial, and Prefer Not to State, respectively. These newly created six variables are again simultaneously entered into a regression equation.‡ While
interpretations for indicator variables are based on White student responses, interpretations of effect codes are based relative to the unweighted average of the group means. See Table 1 for a side-by-side comparison of dummy versus effect coding. In addition, when using effect coding, the choice about the variable to exclude from the regression equation matters less and the group corresponding to the omitted variable thus should not be referred to as the “reference group.” Using effect coding, the parameter estimates for each categorical covariate included in model estimations will be the same, regardless of the choice of group receiving the code –1. As a result, researchers using effect coding should refer to the group of students, coded as –1, as the group omitted from the regression analysis rather than the reference group. See Table 2 for effect coding values when individuals who self-identified as White compose the omitted group. In addition to interpreting coefficients by way of an overall level rather than a reference group, another advantage of the use of effect codings is in their potential to include more information concerning parameter estimates, even those associated with the group omitted
* For illustrative purposes, we discuss coefficient interpretations in the context of ordinary least squares regression. Effect codes are also appropriate for other types of analyses, including but not limited to logistic regression, hazard modeling, hierarchical linear modeling, and structural equation modeling. For example, when the outcome of interest is binary, such as whether or not a student graduates from college, using indicator variables to define racial groups, with White as the reference group, results in coefficients that are log-odds ratios relative to that group; that is, exponentiating a coefficient gives the multiplicative effect on the odds of a member of that group graduating relative to a White student graduating (holding all else in the model fixed). The use of effect coding would change the interpretation to be the multiplicative effect on the odds of a member of that group graduating relative to an overall level. † All of the discussion here has assumed that observations can fall into one and only one group. Sometimes multiple group membership is possible (as in a “Check all that apply” situation), and it is possible to generalize both indicator variables and effect coding to this situation; see Chatterjee and Simonoff (2013, sec. 6.3.4) and Mayhew and Simonoff (in press). ‡ This does not affect the underlying regression fit, including statistics designed to measure overall fit like R 2, the overall F test, the estimated slope coefficient for any other variable, or the partial F test for the significance of the race effect given everything else in the model. 172
Journal of College Student Development
Research in Brief
Table 1. Side-by-Side Comparison of Dummy and Effect Coding Dummy Variable (Di) Recoding
Effect Coding (Ei)
In Model
Recoding
In Model
Nothing (excluded for reference group)
No
Nothing (excluded from analysis)
No
African American / 0 = All Other Races; 1 = African American / Black Black
Yes
–1 = White; 0 = All Other Races; 1 = African American / Black
Yes
White
Latino/a
0 = All Other Races; 1 = Latino/a
Yes
–1 = White; 0 = All Other Races; 1 = Latino/a
Yes
Asian / Asian American
0 = All Other Races; 1 = Asian / Asian American
Yes
–1 = White; 0 = All Other Races; 1 = Asian / Asian American
Yes
Native American
0 = All Other Races; 1 = Native American
Yes
–1 = White; All Other Races; 1 = Native American
Yes
Biracial or Multiracial
0 = All Other Races; 1 = Biracial or Multiracial
Yes
–1 = White; All Other Races; 1 = Biracial or Multiracial
Yes
Prefer Not to State
0 = All Other Races; 1 = Prefer Not to State
Yes
–1 = White; All Other Races; 1 = Prefer Not to State
Yes
Table 2. Values of Effect Codings When the Coding for White Is the Omitted Group Note that
Variable
Race
Ei = Di − D7 E1
E2
, where
E3
D7 E4
is the Dummy Variable for White.
E5
E6
Expected Response
White
–1
–1
–1
–1
–1
–1
β 0 − β1 − − β 6 (≡ β 0 + β 7 )
1
African American / Black
1
0
0
0
0
0
β 0 + β1
2
Latino/a
0
1
0
0
0
0
β0 + β2
3
Asian / Asian American
0
0
1
0
0
0
β 0 + β3
4
Native American
0
0
0
1
0
0
β0 + β4
0
0
0
0
1
0
β 0 + β5
0
0
0
0
0
1
β0 + β6
Biracial or Multiracial Prefer Not to State
5 6
β
Note. Averaging all seven expected responses in the table shows that 0 is an overall level that equals the (unweighted) average of the expected responses over all races. The equivalence given in parentheses in the table for the expected response for the group for which the effect coding is omitted follows from the fact that the original model corresponds to seven slope coefficients that sum to 0, that is: 7
∑β i =1
i
=0
implying that
β 7 = − β1 − − β 6
March 2015 ◆ vol 56 no 2 173
Research in Brief
from the original regression analysis. Initially, the model may be fit using White as the coding not included in the regression analysis; this yields estimated coefficients and t statistics for all groups except those in the White category. However, if the researcher wanted to include information on the White individuals, she or he could create a new set of effect codings by using another category (e.g., African American/Black, although the choice of group is immaterial) as the group coded –1 and thus not include it in the next regression analysis. This step would yield estimated coefficients and statistics for the White group, and also for the other groups in the model, with the exception of the African American/Black students now omitted for this new set of analysis. In addition to providing new parameter estimates for the White group, the coefficients and t statistics for all other groups (with the exception of the African American/Black students now omitted for this new set of analysis) will be identical to those from the original analysis (i.e., using effect codings based on omitting the White group), since the choice of the omitted group has no effect on inferences for the other groups included in the model. Thus, the summary of the model fit can include all of the coefficients and t statistics from the original analysis, along with the coefficient and t statistic for the White group from the additional analysis. This added step renders a statement like “this group is significantly more or less likely to engage in cocurricular programming compared to an overall level” more meaningful as it is based on information for all seven race categories, rather than just six of them (i.e., as would be the case for dummy coding where inference is based on difference from the arbitrarily chosen
174
reference group).
Discussion To be clear, dummy coding is not incorrect or inaccurate; in fact, it is and continues to be the dominant practice among quantitative researchers for examining race and its effects on a variety of student outcomes. However, from an inclusive perspective, what are the implications of consistently essentializing the voices of any group of students as the benchmark for understanding racial differences? How are we, in our research practices, using quantitative practices that serve to privilege certain voices over others, to norm all behaviors to the certain experiences, and to interpret all raced experiences based on the race of one particular group? While effect coding is certainly not the answer to these complicated and nuanced questions, it may serve as a small step toward rethinking conventions in quantitative practice and their often insidious use in privileging certain voices, experiences, and perceptions over others. By removing the idea of a reference group and by interpreting categorical effects as those that differ from an overall level, effect coding may equip quantitative criticalists (see Stage, 2007) with the language they need to start making more informed choices regarding the use of statistics in understanding race and its effects on a variety of outcomes of interest to the higher education audience. Correspondence concerning this article should be addressed to Matthew J. Mayhew, New York University, 239 Greene St., Suite 300, New York, NY 10003;
[email protected]
Journal of College Student Development
Research in Brief
References Chatterjee, S., & Simonoff, J. S. (2013). Handbook of regression analysis. Hoboken, NJ: John Wiley. Cohen, J., & Cohen, P. (1984). Applied multiple regression/ correlation analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum. Fisher, R. A. (1921). Studies in crop variation. I. An examination of the yield of dressed grain from Broadbalk. Journal of Agricultural Science, 11, 107‑135. Hardy, M. A. (1993). Regression with dummy variables. Newbury Park, CA: SAGE. Kleinbaum, D. G., Kupper, L. L., Nizam, A., & Muller, K. A. (2008). Applied regression analysis and other multivariable methods. Belmont, CA: Thomson Brooks/Cole. Mayhew, M. J., & Simonoff, J. S. (in press). Effect coding as a mechanism for improving the accuracy of measuring students who self-identify with more than one race. Research in Higher Education. Miles, J., & Shevlin, M. (2001). Applying regression and correlation: A guide for students and researchers. London, England: SAGE.
Muller, K. E., & Fetterman, B. A. (2002). Regression and ANOVA: An integrated approach using SAS software. Cary, NC: SAS Institute. Rutherford, A. (2011). ANOVA and ANCOVA: A GLM approach (2nd ed.). Hoboken, NJ: John Wiley. Solorzano, D. (1997). Images and words that wound: Critical race theory, racial stereotyping, and teacher education. Teacher Education Quarterly, 24, 5‑19. Stage, F. K. (2007). Answering critical questions using quantitative data. In F. Stage (Ed.), Using quantitative data to answer critical questions (pp. 5‑23). San Francisco, CA: Jossey-Bass. Struik, D. J. (1987). A concise history of mathematics. New York, NY: Dover. Warner, R. M. (2008). Applied statistics: From bivariate through multivariate techniques. Thousand Oaks, CA: SAGE. Yates, F. (1934). The analysis of multiple classifications with unequal numbers in the different classes. Journal of the American Statistical Association, 29, 51‑66.
March 2015 ◆ vol 56 no 2 175