Recent Reliability Reporting Practices in ... - Semantic Scholar

7 downloads 0 Views 113KB Size Report
the most frequently used and/or misused reliability statistic (Cor- tina, 1993; Hogan, Benjamin, & Brezinski, 2000; Peterson, 1994). Typically, researchers have ...
Psychological Assessment 2011, Vol. 23, No. 3, 656 – 669

© 2011 American Psychological Association 1040-3590/11/$12.00 DOI: 10.1037/a0023089

Recent Reliability Reporting Practices in Psychological Assessment: Recognizing the People Behind the Data Carlton E. Green, Cynthia E. Chen, Janet E. Helms, and Kevin T. Henze Boston College Helms, Henze, Sass, and Mifsud (2006) defined good practices for internal consistency reporting, interpretation, and analysis consistent with an alpha-as-data perspective. Their viewpoint (a) expands on previous arguments that reliability coefficients are group-level summary statistics of samples’ responses rather than stable properties of scales or measures and (b) encourages researchers to investigate characteristics of reliability data for their own samples and subgroups within their samples. In Study 1, we reviewed past and current reliability reporting practices in a sample of Psychological Assessment articles published across 3 decades (i.e., from the years 1989, 1996, and 2006). Results suggested that contemporary and past researchers’ reliability reporting practices have not improved over time and generally were not consistent with good practices. In Study 2, we analyzed an archival data set to illustrate the real-life repercussions of researchers’ ongoing misconstrual and misuse of reliability data. Our analyses suggested that researchers should conduct preliminary analyses of their data to determine whether their data fit the assumptions of their reliability analyses. Also, the results indicated that reliability coefficients varied across racial or ethnic and gender subgroups, and these variations had implications for whether the same depression measure should be used across groups. We concluded that the alpha-as-data perspective has implications for one’s choice of psychological measures and interpretation of results, which subsequently affect conclusions and recommendations. We encourage researchers to recognize the people behind their data by adopting better practices in internal consistency reporting, analysis, and interpretation. Keywords: Cronbach’s alpha, reliability, assessment, counseling research, item response bias

to “provide reliability coefficients of the scores for the data being analyzed even when the focus of their research is not psychometric” (Wilkinson & the Task Force on Statistical Inference, 1999, p. 596). Improper reliability reporting practices imply that “reliable” measures are appropriate for use with all populations. Yet Dawis (1987) pointed out that the magnitudes of reliability coefficients are related to sample composition (e.g., the demographic characteristics of race, ethnicity, gender, and education level). Reporting sample demographic information in research helps readers interpret and determine the relevance of findings for their samples (Helms et al., 2006; Vacha-Haase et al., 2000). For instance, test manuals typically report score reliability for their samples’ and subgroups’ responses on the basis of demographic factors such as race, age, education level, and gender (e.g., Butcher et al., 1992; Myers, McCaulley, Quenk, & Hammer, 1998; Reynolds & Kamphaus, 2002), and it appears to be common practice for researchers to induct reliability from test manuals. Yet “this induction turns on the pivotal requirement that both [referent and current] samples are comparable as regards both composition and score variability” (Vacha-Haase et al., 2000, p. 518). Here, we focus on how better reporting of internal consistency reliability data, as well as sample demographics, has implications for whether researchers can infer that a measure is fair and appropriate for use with a particular population. In Study 1, we review past and current reliability reporting practices in a sample of articles published in the 1989, 1996, and 2006 volumes of Psychological Assessment (PA). In Study 2, we analyze archival data that were used in a previously

Historically, the reporting and use of reliability estimates of scale scores have been criticized by a variety of sources (Helms, Henze, Sass, & Mifsud, 2006; Vacha-Haase, Kogan, & Thompson, 2000; Willson, 1980). Cronbach’s (1951) coefficient alpha, a measure of interitem response relatedness or internal consistency, is the most frequently used and/or misused reliability statistic (Cortina, 1993; Hogan, Benjamin, & Brezinski, 2000; Peterson, 1994). Typically, researchers have reported internal consistency reliability data as a property of measures rather than as a characteristic of the scores from a sample of interest despite advice to the contrary (Thompson & Vacha-Haase, 2000). Wilkinson and the Task Force on Statistical Inference (1999) urged authors to recognize that tests are neither reliable nor unreliable but that reliability is a psychometric property of a specific sample’s responses to a measure administered under specific conditions. Thus, they advised authors

This article was published Online First April 25, 2011. Carlton E. Green, Cynthia E. Chen, Janet E. Helms, and Kevin T. Henze, Institute for the Study and Promotion of Race and Culture, Boston College. Kevin T. Henze is now at the Empowerment and Peer Services Center, Edith Nourse Rogers Memorial Veterans Hospital, Bedford, Massachusetts. We are grateful to Leyla Perez-Gualdron for her contributions to the pilot study for this article. Correspondence concerning this article should be addressed to Carlton E. Green or Cynthia E. Chen, Institute for the Study and Promotion of Race and Culture, Boston College, 318 Campion Hall, Chestnut Hill, MA 02467. E-mail: [email protected] 656

RECENT RELIABILITY REPORTING TRENDS

published study to illustrate the real-life repercussions of researchers’ ongoing misconstrual and misuse of reliability data. In a recent article, recognized with an Outstanding Quantitative Contribution Award by the American Psychological Association (APA) Division of Counseling Psychology and Sage Publications, Helms et al. (2006) advocated treating Cronbach’s alpha (CA) reliability coefficients as data in counseling psychology. The authors provided a user friendly list of better reliability analysis and reporting practices (a) to correct the misperception of reliability as an intrinsic characteristic of a measure or scale and (b) to aid researchers in reporting, interpreting, and analyzing internal consistency coefficients. Additionally, they advocated for an alphaas-data perspective, which encourages researchers to report, interpret, and analyze reliability data as a characteristic of the assessed sample’s responses. This stance is meant to prevent researchers from incorrectly inferring that a scale or measure is appropriate for use with all populations. In other words, CA coefficients, as well as other forms of reliability, are group-level statistics that should be reported for each sample’s and subsample’s responses to a specific scale rather than inferring that the coefficients describe stable attributes of scales or measures per se. The understanding and application of reliability analyses within psychology research has changed over the past 15⫹ years. In this time frame, extant reviews of reporting practices (e.g., Meier & Davis, 1990), surveys of undergraduate (Friedrich, Buday, & Kerr, 2000) and graduate psychology statistics curricula (Aiken et al., 1990), guidelines for psychometric best practices (American Educational Research Association [AERA], APA, & National Council on Measurement in Education, 1999; Wilkinson & the Task Force on Statistical Inference, 1999; Joint Committee on Testing Practices, 2004), and reliability primers (Cousin & Henson, 2000; Helms et al., 2006; Onwuegbuzie & Daniel, 2002; Thompson, 2003) have entered the fold of psychology literature. Yet it remains unclear whether researchers have changed their praxis to be consistent with recommended practice. Authors of these previously cited reliability reviews, guidelines, and primers have attempted to address and redirect the improper reporting, characterization, and use of reliability coefficients, particularly CA, in research publications. For example, Willson (1980) reviewed articles published in the American Educational Research Journal from 1969 to 1978 and concluded that only 37% of those articles included reliability coefficients for their own sample’s responses. He argued that recognizing the heterogeneity of the sample is fundamental to conducting between-group analyses; thus, he advised that researchers should report reliability coefficients for their own data. He also suggested that “editors and reviewers ought to routinely return papers that fail to establish psychometric properties” (Willson, 1980, p. 9). A decade later, Meier and Davis (1990) provided evidence of the CA reporting practices in the Journal of Counseling Psychology (JCP) by aggregating reliability reporting trends over three decades (i.e., by studying articles from volumes published in the years 1967, 1977, and 1987). They found that from 1967 to 1977, there was an increase in researchers’ providing reliability coefficients for their own sample’s responses; however, fewer reliability estimates were published in 1987. More specifically, whereas 85% of articles in 1977 cited previously published reliability coefficients, only 60% of articles in 1987 reported these data. Additionally, in 1977, 97% of the articles contained reliability estimates for their sample’s

657

responses, whereas only 77% of articles in 1987 provided reliability data (Meier & Davis, 1990). Meier and Davis (1990) suggested two reasons for the inconsistent reliability reporting practices. One reason that the authors proposed was an absence of standards for documenting psychometric data. The lack of standards might have reflected silent condoning of the misrepresentation of reliability as a characteristic intrinsic to measures. Second, researchers might have perceived that reliability has no practical utility, thus accepting and perpetuating the assumption that a universal human experience can be captured by a unitary measure without regard for individual and cultural differences. As the premier assessment journal for APA, PA should be an exemplar of good psychometric reporting practices for all APA journals in which psychological measures are used. In his inaugural editorial for PA: A Journal of Consulting and Clinical Psychology, Kazdin (1989) cited a rise in assessment articles as the impetus for creating the journal and actively solicited assessment articles that focused on developing, validating, and evaluating measurement methods. Furthermore, PA editors, including both Kazdin and Strauss (2004), stressed the importance of creating an inclusive journal by petitioning for articles that included crosscultural samples and populations that varied in areas such as “ethnicity, minority status, gender, and sexual orientation” (Kazdin, 1989, p. 3). Strauss (2004) indicated that “the applied focus of [PA] means that potential generalizability of findings and other aspects of external validity are important considerations in the evaluation of all submissions, especially studies of nonclinical groups” (p. 3). In other words, PA’s preeminence in the psychological community relies on its role as a reference that readers can use to critically generalize and ethically translate research findings into treatment and practice with diverse populations. Consequently, good reliability reporting practices in PA are paramount for both the researchers who submit manuscripts and the readers who use their results. That is, reliability reporting has implications for generalizability and ethical practice (AERA et al., 1999; APA, 2003; Wilkinson & the Task Force on Statistical Inference, 1999). During the past three decades, various scholarly organizations, professional associations, and experts in psychometrics have consistently recommended that ethical assessment practices with nondominant populations (e.g., women, girls, racial and linguistic minorities) require at least two considerations by psychologists. First, “potential test users need to determine whether reliability estimates reported in test manuals [or journal articles] are based on samples similar in composition and variability to the group for whom the test will be used” (Crocker & Algina, 1986, p. 144). Second, when no such data exist, “psychologists [should] describe the strengths and limitations of test results and interpretation” (APA, 2002, p. 1071). Some of the relevant literature includes the “Ethical Principles of Psychologists and Code of Conduct” (APA, 1992, 2002), the Standards for Educational and Psychological Testing (AERA et al., 1999), and the Code of Fair Testing Practices in Education (Joint Committee on Testing Practices, 2004). For example, AERA et al. (1999) stated, “The critical information on reliability includes . . . a description of the examinee population to whom the foregoing data apply, as the data may accurately reflect what is true of one population but misrepresent what is true of another” (p. 27). Additionally, the Joint Committee on Testing Practices (2004) has

GREEN, CHEN, HELMS, AND HENZE

658

recommended that psychologists and other test users “obtain and provide evidence on the performance of test takers of diverse subgroups, [make] significant efforts to obtain sample sizes that are adequate for subgroup analyses, [and] evaluate the evidence to ensure that differences in performance are related to the skills being assessed” (p. 3). From each of these documents, it is possible to infer that it is poor practice and potentially harmful to tested subgroups for researchers to assume that reliability evidence obtained with one sample (e.g., adult men) can generalize to other samples and/or populations (e.g., women, children, adolescents). Given PA’s position in the measurement community and the likelihood that researchers look to PA for methodological direction, a review of the journal’s reliability reporting practices seems to be in order. Our aim in Study 1 was to examine the trends of reporting reliability information in PA, starting with its inception in 1989. To assess the state of analyzing, interpreting, and reporting reliability coefficients in recent psychological research, we adapted and integrated the Meier and Davis (1990) methodology with the Helms et al. (2006) alpha-as-data perspective for better reliability reporting practices. More specifically, we investigated whether researchers reported demographic data (e.g., gender, age, education level), scale properties (e.g., number of items), and reliability data (e.g., coefficients) for their own samples.

Study 1 Method Sample. Similar to Meier and Davis’s (1990) procedure, we reviewed a random sample of articles in PA spanning three decades (i.e., articles from PA volumes published in 1989, 1996, and 2006) to discern if reliability reporting, interpretation, and analysis practices changed over time. We chose to review volumes of PA published in 1989, 1996, and 2006 for multiple reasons. First, we sought to be as consistent as possible with previously published studies of reliability reporting practices. For example, Meier and Davis documented reliability reporting trends spanning three decades (i.e., from articles in volumes published in 1967, 1977, and 1987) in JCP. On the basis of their model, we intended to review volumes extending over three decades. Because when we began the study, 2006 was the last complete volume, we chose to review volumes of PA from 1996 and 2006. Second, the first year in which PA was published was 1989, which made it impossible for us to review articles from 1986. Of the 180 articles published in PA during the years 1989, 1996, and 2006, only the 170 (94%) empirical studies were considered for review. To sample across the three decades and make the review process more manageable, we used an online random number generator (http://www.random.org/integers) to randomly select 40% of the empirical articles from each year to review. This represents a total of 68 articles that were eligible for review. Given that reliability is a sample-dependent psychometric statistic, we examined internal consistency reliability reporting practices pertaining to the use of scales that measured psychological constructs. We included articles that used self-report scales designed to measure an underlying construct on which people potentially vary (e.g., identity, self-esteem, depression). A scale was operationally defined as a series of two or more items used in

summative fashion to assess a construct or domain. Henceforth, scale and measure are used interchangeably throughout this article. Some measures consisted of multiple subscales. Additionally, an article was included in the current study if (a) the researcher(s) used at least one scale measuring psychological constructs and (b) original data were collected and analyzed in their study. We assumed authors collected their own data unless they reported using archival data. Researchers using original data published details about administration of measures, allowing us to verify that only self-report measures were included in our sample. In many cases, researchers’ secondary analyses of existing data sets were vague about the administration procedures (e.g., participant selfassessment or rater or observer assessment). Additionally, aspects of the rater’s identity (i.e., gender socialization, racial identity) might have influenced participants’ responses to the assessment tool, thus potentially influencing the participants’ reports of content areas. Therefore, scales completed by raters (e.g., checklists, structured diagnostic interviews) were not included in our analyses. As shown in Table 1, 45 (66%) of the 68 articles met the inclusion criteria. Procedure. We reviewed each article using a coding scheme devised for this study that combined Meier and Davis’s (1990) data collection methods and the three categories of better practice (i.e., internal consistency reporting, analysis, and interpretation) specified by Helms et al. (2006). We approached our sample with four levels of analysis in mind: (a) article, (b) measure, (c) subscale, and (d) scale administration. Multiple components of interest made up each level of analysis. At the article level, we collected data on how researchers described their samples. We also assessed whether they engaged in better practices of reliability reporting, interpretation, and analysis such as (a) reporting reliability data for subgroups’ scores within their research samples and (b) comparing reliability data from their analyses with previously published internal consistency estimates via reliability generalization studies. At the measure level, we coded data according to whether measures consisted of subscales. We also collected data on how many scales were researcher developed versus previously developed and, for this latter group, how many were researcher modified. Additionally, given that the number of items in a scale or subscale can influence the calculation and the magnitude of reliability coefficients (Helms et al., 2006), we identified when researchers reported the number of items in the scales or subscales they used. Finally, at the level of scale administration, we coded data according to whether researchers reported their sample’s Table 1 Summary of Sampled Articles in Psychological Assessment in 1989, 1996, and 2006 1989

1996

2006

Sample

f

%

f

%

f

%

Articles Measures Multiscale measures Scales Previously developed scales Researcher-modified scales Scale administrations

15 43 12 111 78 16 137

33.3

16 51 20 175 163 14 275

35.6

14 42 21 144 137 0 182

31.1

70 20.5

93 8.6

95 0

RECENT RELIABILITY REPORTING TRENDS

internal consistency estimates with confidence intervals and related descriptive statistical information (i.e., means, standard deviations, and standard errors of measurement) each time the scale was administered in a study (e.g., Time 1, Time 2). These four levels of analysis and their related components of interest were the foundation of our coding scheme. In our pilot study, four reviewers independently coded 10 articles and then compared results to determine if there were any discrepancies. When discrepancies occurred between raters, coding criteria were further refined and clarified. For example, initially we only coded whether authors reported reliability either for their own data or for scores from previous studies that explicitly labeled reliability by type (e.g., CA, theta, Kuder–Richardson formulas). This approach was augmented after reviewing several articles that made a general reference to internal consistency or reliability but failed to explicitly mention the type of internal consistency reported in one or more places. We opted to code such instances as CA, as this is the most frequently used internal consistency method in current literature (Helms et al., 2006; Thompson, 2003) and other methods (e.g., split-half, Kuder– Richardson) tended to be labeled as such when reported in the articles. Moreover, Kuder–Richardson formulas are special cases of CA for analyzing dichotomous item responses (e.g., true–false), and CA may be interpreted as the average of all possible split-half reliability coefficients. Perhaps the common oversight in identifying the type of internal consistency statistic used can be attributed to authors’ unfamiliarity with the different types of reliability coefficients (e.g., split-half, CA, and Kuder–Richardson formulas) or lack of knowledge about how to choose the appropriate reliability method. (Thompson, 2003, provided a detailed description of the various reliability coefficients.) Also, it is the case that CA is the default option in some statistical packages, and, consequently, researchers may not realize that the appropriate use of CA depends on whether one’s data satisfy specific statistical assumptions. After finalizing the coding scheme, reviewers individually coded the remaining 35 of the 45 total articles. The reviewers met several times as a group to discuss individual reviews and reach consensus when points of discrepancy emerged. This consensus approach proved to be necessary given the disparate manner in which researchers described their sample(s) and reported, analyzed, and interpreted internal consistency reliability coefficients. Thus, the results pertain to the analysis of the 45 randomly selected articles from PA from the years 1989, 1996, and 2006 that met the inclusion criteria.

Results The results of our analyses indicated that adoption of better practices of reporting reliability data as suggested by various psychometric guidelines (AERA et al., 1999; Joint Committee on Testing Practices, 2004; Wilkinson & the Task Force on Statistical Inference, 1999) and reliability primers (Cousin & Henson, 2000; Helms et al., 2006; Onwuegbuzie & Daniel, 2002; Thompson, 2003) has been inconsistent over time. As shown in Table 2, we examined the frequency with which researchers reported quantifiable demographic data. Means and standard deviations constituted quantifiable demographic reporting for age, whereas numbers or percentages sufficed for all other demographic variables. Age,

659

Table 2 Demographic Reporting in Sampled Psychological Assessment Articles by Year 1989

1996

2006

Demographic variable

f

%

f

%

f

%

Age Race Education Geographic location Gender Socioeconomic status Sexual orientation

12 9 8 3 14 3 0

80 60 53 20 93 20 0

14 12 11 1 14 0 0

88 75 69 6 88 0 0

6 9 0 1 13 1 0

43 64 0 7 93 7 0

race, and gender were the most frequently reported demographics. However, the reporting of age statistics decreased in 2006, with only 43% of the sampled articles reporting both means and standard deviations, whereas in 1989 and 1996, at least 80% of articles reported that information. Likewise, education level was reported for at least 50% of the articles in 1989 and 1996, but no authors reported education levels in the 2006 articles. Geographic location, socioeconomic status (SES), and sexual orientation were the least reported demographic variables. None of the articles reviewed reported demographic information related to the sexual orientation of samples. In Table 3, we summarized the degree to which investigators embraced an alpha-as-data perspective by reporting reliability coefficients calculated from the scores of their own samples or subsamples rather than inferring reliability from measures used in past research (i.e., referent reliability). The trend of researchers’ reporting reliability data for their own samples (e.g., subgroup reliability estimates, scale score means and standard deviations) remained consistently low across the three volumes reviewed. The fewest reports of researchers’ own samples’ reliability coefficients occurred in 1996 (10%). In 1989, 21% of investigators reported reliability estimates for their own samples’ responses, and in 2006, 28% included this information. Similarly, in 1989 and 1996, no authors (0%) provided reliability coefficients for subgroups’ responses, and in 2006, only 1 (7%) article supplied reliability data from subgroup scores. Equally important, there was little change in authors’ reporting of referent reliability coefficients. In contrast to Meier and Davis’s (1990) finding that the majority of JCP authors reported reliability estimates for previous samples’ responses, our review found that PA authors minimally reported these data. Over the three time points sampled, only 6%–11% of PA authors reported previously published reliability estimates. The provision of referent reliability would have indicated whether the authors had correctly concluded that their selected measures were appropriate for use with their sample(s). Contrary to Meier and Davis’s (1990) survey of JCP articles, our investigation indicated that PA authors used more previously developed scales in their research and provided references more frequently for those scales. As the use of previously developed scales increased, researchers would have more referent reliability statistics available by which to conduct reliability generalization studies (Vacha-Haase, 1998). Vacha-Haase (1998), the originator of the concept of reliability generalization studies, indicated that

GREEN, CHEN, HELMS, AND HENZE

660

Table 3 Frequency of Reliability Reporting, Analysis, and Interpretation Good Practices in Sampled Psychological Assessment Articles by Year 1989 Good practice Author reported Number of scale items Cited reference for developed scale Previous sample’s reliability coefficients Own study reliability coefficients Own sample subgroup reliability coefficients M and SD of scale scores M and SD of subgroup scale scores Composite reliability Analysis Reliability confidence intervals Interpretation Reliability generalization study

1996

2006

f

%

f

%

f

%

40 64 5 29 0 52 7 0

36 82 6.4 21 0 38 47 0

32 112 10 28 0 66 9 0

18 69 6.1 10 0 24 56 0

40 96 15 51 1 46 5 0

28 70 11 28 7 25 36 0

0

0

0

0

0

0

0

0

0

0

0

0

the meta-analytic process would help researchers and practitioners understand “(a) the typical reliability of scores for a given test across studies, (b) the amount of variability in reliability coefficients for given measures, and (c) the sources of variability in reliability coefficients across studies” (p. 6). Cousin and Henson (2000) further explained that the reliability generalization method could potentially assist test users in identifying appropriate measures for use in practice and research while also specifying “sample characteristics and factors that influence and affect score data” (p. 3). Beyond the reliability reporting practices investigated in the Meier and Davis (1990) review, we also examined how many PA authors used good standards of analyzing and interpreting reliability data, such as calculating and reporting composite alpha and confidence intervals and conducting reliability generalization studies. Unfortunately, rather than reporting composite alpha (i.e., treating scales as items) for multiscale measures, authors reported alpha for all items across scales (i.e., total-scale alpha) without regard for the likelihood that a sample may respond differently to the unique dimensions being assessed by subscales. Instead, as recommended by Helms et al. (2006), authors should have reported composite alpha, which “provides a numerical estimate of the extent to which a shared theme runs through the subscales as opposed to the individual items” (p. 643). Thus, calculating composite alpha would have constituted a better practice. (See Cronbach, 1951, or Helms et al., 2006, for a more thorough explanation of composite alpha.) Moreover, no authors calculated and reported confidence intervals around their obtained sample reliability estimates or conducted reliability generalization studies. Either of these procedures would have enhanced confidence in the adequacy of the obtained reliability estimates. In sum, the frequency of reporting the most basic reliability information (e.g., CA coefficients) was minimal. Furthermore, reporting of basic sample demographic information (e.g., age, education) and scale attributes (e.g., number of items) was inconsistent and marginal over the years. The majority of authors reported the age, race, gender, and, to a lesser degree, educational demographics of their samples. However, we inferred that authors’ oversight in reporting sample demographic information pertaining

to geographic location, SES, and sexual orientation implied that the authors did not study the effects of these demographic factors on CA coefficients, thus limiting the generalizability of their study’s results. No authors included reliability confidence intervals, generalization studies, or standard errors of measurement in their reports; of those who reported full-scale alpha for multiscale measures, none included a related composite alpha estimate.

How Reliability Coefficients Might Matter One factor possibly contributing to researchers’ incorrect assumptions about reliability (e.g., reporting reliability as a property of a measure) could be the lack of a clear understanding as to how poor reliability practices might affect the people behind their data. When scales are used for screening or diagnostic purposes, clients or clinical populations may be over- or underassessed or misdiagnosed if reliability estimates are inaccurate; when scales are used for research purposes, insufficient reliability reporting practices influence the interpretation and application of research results and contribute to development and use of faulty measures. Specifically pertinent to psychological testing and research, reliability coefficients matter with respect to (a) test norms, (b) validity studies, (c) the assessment of individuals, and (d) the evaluation of sensitivity of measures. A 12-item version (CES–D12) of Radloff’s (1977) Center for Epidemiologic Studies Depression Scale (CES–D) is used to elaborate on these issues. Concerning test norms, Radloff’s (1977) CES–D was developed on entirely White samples of unknown genders, and the psychometric analyses conducted on these samples’ responses have been presumed to generalize to African, Latino, Asian, and Native American (ALANA) and immigrant subgroups or samples without confirming that such is the case. Yet if reliability coefficients differ among subgroups within a sample, then the interpretation of scale scores should also differ. With respect to validity, studies to determine what construct is being measured (e.g., construct validity) are often conducted by correlating participants’ scores from two construct-related measures, as, for example, two measures of depression (e.g., CES–D and days of mental health treatment). By means of the correction

RECENT RELIABILITY REPORTING TRENDS

for attenuation, reliability coefficients may be used to “correct” validity coefficients (i.e., correlations) for imperfect reliability and, therefore, provide information about the likely magnitude of the correlation under similar measurement conditions if one or both measures had yielded scores that were perfectly reliable. Validity coefficients of subgroups with lower reliability coefficients would require more correction than analogous coefficients for subgroups with higher reliability coefficients. Sample-level scale statistics are often used to evaluate scores of individuals on measures. The standard error of measurement (SEM), which is derived in part from the sample-level reliability coefficient and standard deviation, is useful in this respect. The SEM reveals how variable a person’s scores are theoretically, but its use presumes that the assessed individual is similar in important aspects (e.g., age, race or ethnicity, gender) to the group from which the reliability statistics were derived. Its use assumes that a normal distribution of scores would have occurred if the person had responded to the CES–D-12 many times, say 5,000 times, and did not remember her or his previous responses. The SEM is the hypothetical standard deviation for the person’s many hypothetical test scores estimated from a sample’s reliability data. Thus, the normal curve, which is demarked by standard deviations, may be used to provide a theoretical estimate of the range wherein the person’s “true [depression] score” (Helms et al., 2006, p. 634) lies at some level of probability. The SEM is smaller when reliability is high and the standard deviation is small. Test sensitivity refers to whether scores obtained from a specific test yield accurate classifications with respect to some external criterion. For example, in clinical settings, cutoff scores (e.g., 10) on the CES–D might be used in contingency tables to predict desirable (e.g., improvement following treatment) or undesirable (e.g., suicide attempts) outcomes. Correct identification of the proportions of people who actually attempted suicide, for example, and whose scale scores were above the cutoff would indicate the sensitivity of the CES–D-12 scale scores. Thus, for different subgroups (e.g., racial or ethnic groups), one would expect different numbers of people to exceed the cutoff score if the reliability coefficients of the subgroups varied; therefore, sensitivity would be different. In sum, understanding reliability coefficients as data has implications for how researchers and practitioners use test scores, which, in turn, will have implications for the people or participants who are being assessed. Practically speaking, examination of reliability coefficients may assist test users with understanding whether or how well measures are actually assessing the psychological concepts they purport to measure.

Study 2 In Study 2, we applied the alpha-as-data perspective to archival data from the National Survey of American Life (NSAL; Jackson, Neighbors, Nesse, Trierweiler, & Torres, 2004), a data set housed at the University of Michigan’s Inter-University Consortium for Political and Social Research (n.d.). We illustrate how subgroups’ responses affect reliability coefficients, as well as related analysis and interpretation practices. It was not our intent to question the integrity and usefulness of previously published studies that used NSAL data; instead, we intended to encourage use of better reliability practices more generally.

661

Many of the researchers who wrote the articles we reviewed did not seem to recognize that sociodemographic categories function as proxies for life experiences (Helms et al., 2006). To investigate how contemporary researchers have used the CES–D-12 with ALANA and immigrant samples, we explored the reliability practices of Lincoln, Chatters, Taylor, and Jackson (2007). These researchers used latent profile analysis to identify how variables (e.g., demographic, social support, and life stress) clustered, creating risk and protective profiles for depression among African American and Black Caribbean respondents. Lincoln et al. found that older age and male gender were related to lower levels of depression as measured by the CES–D-12; in contrast, female gender and low SES (a combination of educational level and income) increased risk for depression. Lincoln et al. (2007) stressed “the importance of examining heterogeneity [with respect to various combinations of demographic (e.g., age, gender, SES) and psychological (e.g., experiences of racial discrimination and social support) factors] within racial and ethnic groups” (p. 209). Yet Lincoln et al. seem not to have considered how the internal consistency coefficients (i.e., CA) of their samples’ responses to the CES–D-12 were affected by the demographic factors (e.g., gender, SES, and age) that were relevant to their results and conclusions. Here, we chose to focus our investigation on the scores of subgroups categorized by race or ethnicity and gender given Lincoln et al.’s conclusion that race or ethnicity and gender were related to different levels (e.g., low vs. high) of depression reported by their participants. According to Helms et al. (2006), preliminary within-group examinations of the statistical properties of the data on which CA coefficients are based allow researchers and practitioners (a) to decide whether CA is an appropriate statistic for analyzing their data and (b) to better interpret their CA coefficients, if they decide that it is. Typically, researchers have not treated reliability coefficients as data, nor have they conducted the same sorts of preliminary analyses they would if they were conducting other types of statistical analyses (e.g., analyses of variance, t tests) subsumed under the family of general linear models (GLM). Perhaps, because psychometricians and statisticians often use different terminology or rationales to refer to essentially the same concept, the interrelationships among reliability and statistical analyses may not be readily apparent. With respect to CA reliability coefficients specifically, Thompson (2003), quoting Dawson (1999), noted that the GLM is evident in the calculation of CA, because CA is essentially the ratio of two variances (p. 12). Most standard texts on psychological assessment present the formula for the ratio of CA as some version of the following: 2 共K/[K ⫺ 1])(1 ⫺ [兺SDk2/SDtotal ]),

where K is the number of scale items, 兺SD2k is the sum of each item’s response variance, and SD2total is the variance for the total scale scores. Thompson (2003, pp. 14 –19) pointed out that the algebraic equivalent of SD2total in the foregoing formula is SD2total ⫽ 兺SD2k ⫹ (兺COVij ⫻ 2); SD2total and SD2k are as previously described and 兺COVij is defined as the sum of the products of the off diagonal correlation coefficients and the corresponding pairs of standard deviations (e.g., r12 ⫻ SD1 ⫻ SD2 in a K ⫻ K square matrix).

662

GREEN, CHEN, HELMS, AND HENZE

Decomposition of CA into its components makes it easier to show that, as is the case for other GLM statistics, response styles of individuals within samples ought to affect the magnitude of standard deviations of item responses as well as their variances (SD2). Response-style effects occur because item–response mean values are used to calculate the standard deviation and their variances and, therefore, are hidden foundations of the variance– covariance matrices used to compute SD2total as previously discussed. Oddly, it appears that researchers typically do not conduct preliminary analyses to determine whether their data meet the assumptions of GLM statistics or the tenets of classical test theory with respect to CA specifically. Moreover, preliminary analyses of the characteristics of one’s data would enable the researcher to infer how or whether CA is deleteriously influenced by such characteristics and, in turn, whether CA is an appropriate reliability analysis for their data (Helms et al., 2006). Thus, one purpose of Study 2 was to show why such preliminary analyses are important. The second purpose was to illustrate better reliability reporting and interpretive practices. The final purpose was to demonstrate how sociodemographic characteristics (i.e., proxies for different life experiences) might affect the reliability of subgroups’ responses.

Method Sample. The NSAL was a household probability sample of African Americans (n ⫽ 3,570), Black Caribbeans (n ⫽ 1,438), and White Americans (n ⫽ 891), all 18 years of age or older (Jackson et al., 2004). Our sample (N ⫽ 5,323) consisted of African American (n ⫽ 3,383), Black Caribbean (n ⫽ 1,375), and White (n ⫽ 565) men and women who completed the CES–D-12. Our sample is somewhat smaller than the sample described by Lincoln et al. (2007) given that we only included respondents who had no missing responses to the CES–D-12. Participants’ mean age was 43.30 years (SD ⫽ 16.34). The majority of participants identified as women (62.6%) with men making up 37.4% of the sample. The majority of participants were employed (78%) with unemployment affecting 9.1% of the sample. A segment of the sample (22.9%) was retired or otherwise not in the labor force. Married or cohabiting participants constituted 38.4% of the sample, with the remaining participants being either divorced (30.4%) or never married (31.2%). The sample’s educational background spanned from no high school diploma (22.7%) to some college experience (23.9%) to college graduate and beyond (18.1%), with high school graduates (35.2%) constituting the largest percentage of the sample. Measure. The CES–D-12 was developed to assess depressive symptoms experienced within the past 7 days (Radloff, 1977). Examples of items include “I felt that I was just as good as other people” and “I felt that everything I did was an effort.” Participants responded to each item using a frequency scale with responses ranging from 0 ⫽ rarely or none of the time (less than 1 day) to 3 ⫽ most or all of the time (5–7 days). Positively worded items were reverse scored. Summing the responses to all 12 items yielded a continuous measure of depression; higher scores indicated a greater number of depressive symptoms. This commonly used measure has been used to assess depression with diverse samples. Seaton, Caldwell, Sellers, and Jackson

(2008) reported a CA coefficient for their total sample’s responses (␣ ⫽ .68) to the CES–D-12. Their sample included African American (n ⫽ 810; M CES–D-12 score ⫽ 9.08) and Black Caribbean (n ⫽ 360; M CES–D-12 score ⫽ 8.29) adolescents. (Standard deviations for each group’s CES–D-12 scores were not included in the article). For an adult sample’s responses to the CES–D-12, Lincoln et al. (2007) reported data for their African American (n ⫽ 3,361; M ⫽ 6.78; ␣ ⫽ .76) and Black Caribbean (n ⫽ 1,554; M ⫽ 6.07; ␣ ⫽ .74) subgroups. Elgar, Mills, McGrath, Waschbusch, and Brownridge (2007) reported a range of CA coefficients (␣ ⫽ .80 –.87) for a predominantly White sample of Canadian men’s (n ⫽ 330; M ⫽ 16.43, SD ⫽ 5.95) and women’s (n ⫽ 3,854; M ⫽ 16.62, SD ⫽ 5.81) CES–D-12 scores. Procedure. Using SPSS 16.0 Frequency and Reliability analyses and syntax we wrote, we conducted preliminary and reliability analyses of the responses from the entire selected sample, as well as for each of the racial or ethnic and gender subgroups. Readers may note the limited demographic data that we used, which seems incongruent with better practices. We chose to report demographics related to gender, race or ethnicity, age, and income level (i.e., SES) because Lincoln et al. (2007) concluded that these factors either promoted or protected against depression. In other words, gender, race or ethnicity, age, and income level were the most salient demographic factors in their study. Additionally, we used income level as a proxy for SES because we could not determine how Lincoln et al. operationally defined the variable. Here, we provide an example of how subgroup characteristics might influence their responses to a measure and therefore reliability. Thus, we opted to conduct the preliminary analyses and reliability analyses for racial or ethnic and gender subgroups’ responses but recognize that examination of subgroups’ scores defined by the other tabled demographic characteristics might have yielded different results.

Results Preliminary analyses. To determine whether the data for each group fit the GLM and classical test theory assumptions underlying use and interpretation of CA, we conducted preliminary analyses. We tested the assumptions of (a) normal distributions of item responses, (b) homogeneity of variances, and (c) positive correlations among item responses. Pearson correlations (e.g., average interitem correlations) among item scores were examined to determine the extent to which the item responses were positively linearly interrelated. See Table 4 for a summary of the results of these preliminary analyses. Effects of skewed or nonnormal response distributions. Concerns about the effects of skewed item responses on the magnitude of CA have been expressed, but only two studies (Enders & Bandalos, 1999; Norris & Aroian, 2004) have investigated effects on CA specifically rather than on correlation coefficients or standardized alpha, which is derived from correlation coefficients (Greer, Dunlap, Hunter, & Berman, 2006). In statistical theory, it is well-known that skewed distributions affect the size of the mean. Glass and Hopkins (1996) stated that the mean of a skewed distribution “is pulled toward the tail [or skew]” (p. 59); consequently, the mean is overestimated when the skew is positive (left bunched) and underestimated when the skew is negative (right bunched). None of the studies that we located examined the effects

Note. NSAL ⫽ National Survey of American Life; CES–D-12 ⫽ Center for Epidemiologic Studies Depression Scale (12 items); Item skewness ⫽ degree of skewness; Slight ⫽ 0 –1; Moderate ⫽ 1–2; Strong ⫽ 2–3; Very Strong ⫽ ⬎3; No. skew sig. ⫽ number of significantly positively skewed item responses; Z test ⫽ skewness statistic/standard error of skew for CES–D-12 total score; Variance ratio ⫽ the ratio of the largest item standard deviation to the smallest item standard deviation; rs ⫽ Spearman rank order correlation between item mean scores and item levels of skewness; Interitem r៮ ⫽ average interitem correlations; Skewness alpha ⫽ internal consistency calculated with skewness statistic. ⴱ p ⬍ .0001 level if Z test ⱖ 4.27.

3 7 2 0 12 31.00ⴱ 1.61 ⫺1.00 .20 .74 7 5 0 0 10 5.71ⴱ 1.34 ⫺0.97 .27 .80 7 4 0 1 11 4.12 1.92 ⫺0.95 .25 .74 1 9 2 0 12 12.48ⴱ 1.69 ⫺0.99 .17 .72 1 7 3 1 12 13.69ⴱ 2.29 ⫺0.99 .17 .71 4 8 0 0 12 19.07ⴱ 1.59 ⫺0.94 .20 .74 2 7 2 1 12 15.46ⴱ 2.30 ⫺0.98 .16 .68 1 8 3 0 12 31.21ⴱ 1.69 ⫺0.99 .24 .78 6 6 0 0 11 5.45ⴱ 1.38 ⫺0.97 .29 .82 7 4 0 1 11 4.04 1.92 ⫺0.95 .27 .75 1 8 3 0 12 13.09ⴱ 1.76 ⫺0.99 .22 .77 1 7 3 1 12 13.30ⴱ 2.40 ⫺0.98 .21 .75 1 9 2 0 12 19.63ⴱ 1.68 ⫺1.00 .25 .79 1 6 4 1 12 15.49ⴱ 2.45 ⫺0.99 .20 .73

4,669 338 203 711 472 1,899 1,046 5,323 354 211 833 542 2,183 1,200

Sample size Item skewness Slight Moderate Strong Very Strong No. skews sig. Z tests Variance ratio rs Interitem r៮ Skewness alpha

Women Men Women Men Women Women Men Women Men Women Men Variable

African American

Black Caribbean

White American

Total

Men

White American Black Caribbean African American

Sample with at least one symptom Total sample

Table 4 Summary of Preliminary Analyses of Skewness, Homogeneity of Variance, and Item-Response Correlations for the NSAL CES–D-12

Total

RECENT RELIABILITY REPORTING TRENDS

663

of skewed response distributions on item means. Yet the mean is implicated in virtually all of the components of the CA formula previously discussed, including the variance and standard deviation (e.g., SD2 ⫽ 兺[X ⫺ M]2/N); correlation coefficient (rxy ⫽ COVxy/SDxSDy); and, therefore, the variance– covariance matrix. Thus, in our preliminary analyses, we focused on determining whether the CES–D-12 item responses were skewed for the various groups and, if so, how the means and CA were affected. Z tests for skewness (i.e., the value of skewness divided by the standard error of skewness) and examination of histograms were used to evaluate the normality of distributions of the individual items of the CES–D-12, as well as the total scores. Spearman rank order correlations (rs) were calculated to explore the relationships of mean item responses to levels of item skewness. The assumption of normal distribution of item responses (i.e., lack of skewness) was violated for the full sample and each of the subgroups. We could locate no objective criteria for classifying skewness as problematic; for our purposes, we arbitrarily used the labels slight (0 to 1), moderate (1 to 2), strong (2 to 3), and very strong (3 to 4) to classify the degree of skewness. For the full sample and each of the subgroups, the distributions of each CES– D-12 item and the total score were significantly positively skewed (i.e., bunched toward the lower end of the scale) with the exceptions of one item each for White women and White men. Most of the total sample’s 12 item response distributions (67%) were moderately skewed and the z test for the total score disregarding race, ethnicity, and gender was significant, as shown in Table 4. The degree of skewness for the subgroups’ item response distributions tended to vary. For African American and Black Caribbean women, all except one of their 12 item response distributions were moderately or strongly skewed. For the African American and Black Caribbean men, 10 (83%) of their item response distributions were at the moderate to strongly skewed levels, whereas most (92% or 11) of White men’s responses were in the slight to moderate ranges of skewness. For each subgroup, the Spearman rank order correlations between the 12 item mean scores and item levels of skewness were almost perfectly inversely related (e.g., for Black Caribbean women, rs ⫽ ⫺.99; for Black Caribbean women endorsing depressive symptoms, rs ⫽ ⫺.99), indicating that higher levels of skewness were related to lower mean item responses. Even though level of skewness is also implicitly derived from item means (i.e., 兺z3/N), note that if the item response distributions had not been skewed or had been skewed in different or multiple directions, the correlations would not have been so high. Given the strong levels of positive skewness across subgroups, we attempted to reduce the skewness of item response distributions. Transforming item responses was not practicable, because results of analyses would not be generalizable to other studies in which the CES–D was used. Adapting an analytic procedure from item response theory (Henard, 2002), we eliminated the consistent responders (i.e., the people who gave the same response to all items). The only consistent responders in our sample were those who answered 0 to all items on the CES–D-12 (i.e., their total score was 0). Thus, we removed these consistent responders; henceforth, we refer to them as zero responders. The proportions of zero responders for each subgroup in descending order were 14.65% of Black Caribbean women, 13.01% of African American women, 12.92% of Black Caribbean men, 12.83% of African American men,

664

GREEN, CHEN, HELMS, AND HENZE

4.52% of White women, and 3.79% of White men. We replicated the previous analyses with the remaining responders (i.e., those who endorsed at least one symptom). Table 4 also summarizes the results of this second set of preliminary analyses. The results were essentially the same without zero responders, but the levels of positive skewness were not as strong as when zero responders were included. Changes were that 10 rather than 12 of White women’s item responses were significantly positively skewed and 11 item responses were significantly positively skewed for White men. Thus, the assumption of normality of item responses was not supported for either the full sample and subgroups or their parallels without zero responders. Spearman rank order correlations between the 12 item response means and levels of skewness again revealed almost perfect inverse relationships for all groups in each condition, indicating that higher skewness was associated with lower means. Moreover, almost perfect redundancy between the two variables (i.e., item means and levels of item response skewness) suggests that level of skewness could be substituted for mean values in the computation of the variance– covariance matrices used to calculate CA coefficients without changing the value of CA coefficients by much, if at all. We performed these analyses and discuss the results in the Reliability Analyses section. Homogeneity of item variances. In measurement theory, whether item response variances are equal (i.e., homogenous) determines in part whether CA is an appropriate statistic for estimating reliability (Feldt & Charter, 2003). Item variance ratios (i.e., the largest item standard deviation divided by smallest item standard deviation) were used for determining homogeneity of variance among the item responses. To reduce the number of analyses, we compared only the largest and smallest variances because if these were not equivalent, then the assumption of equality of variances was violated. Analyses of the item variance ratios shown in Table 4 revealed that almost all of the larger item variances were significantly different from the smaller item variances in the total samples and the subgroups well beyond the .001 level of significance. This was true even after adjusting for correlations between variances with the exceptions of the total sample without zero responders, r(4667) ⫽ .13, p ⬍ .01; the total sample of African American men, r(1198) ⫽ .10, p ⬍ .01; and White women with symptoms, r(336) ⫽ .29, p ⬍ .01 (Snedecor & Cochran, 1967, p. 187). Thus, the assumption that item response variances were equal was not confirmed for most of the groups. Therefore, CA coefficients might have been attenuated in most of the subgroups and the total sample. Interitem correlations. CA is intended to assess the extent to which a set of items assesses a single attribute or is homogeneous with respect to content (e.g., depression), which is commonly expressed as the level of internal consistency. “If the average correlation among [responses to] items is very low (and thus the average correlation of [responses to] items with total scores is low), the items as a group are not homogeneous” (Nunnally, 1967, p. 255). As shown in Table 4, all of the average interitem correlations (r៮) were large enough to suggest that the items were linearly positively related to each other with the possible exceptions of the Caribbean subgroups and African American men in the symptomatic subgroups. Yet examination of the correlation matri-

ces revealed varying numbers of negative correlations among items for the different groups. In both the total and zero-responder reduced total samples, all item responses were positively correlated except one pair and two pairs, respectively. The average interitem correlation was positive for each of the three subgroups of women in each sample; but one pair of interitem responses was negatively correlated for the White women in the total sample, whereas one, four, and seven negative correlations occurred for the White women, African American women, and Black Caribbean women, respectively, in the sample with at least one symptom. All total-sample interitem responses, with the exception of two, were positively correlated for the Black Caribbean men; the scores of the African American and White men’s subgroups each resulted in only one pair of negatively correlated interitem responses. The numbers of negatively correlated interitem responses for symptomatic men increased to seven for Black Caribbeans and six for African Americans but remained at one for Whites. Thus, regardless of the subgroups’ characteristics (e.g., gender, race or ethnicity, and presence of depressive symptoms), the average interitem correlations for the sample and the subsamples were positive but low. Additionally, there were varying numbers of negative correlations among the within-subgroup item responses, which might have signified the presence of multidimensional or multifactorial item responses. In sum, the assumption of normal distributions of item responses was not met due to our findings of significant positive skewness across all of the groups and the significant, almost perfect inverse relations between levels of skewness and item response means. The assumption of equality of variance was not met for the item responses within subgroups or the total samples and one or more interitem correlations were negative for each subgroup. Thus, none of the assumptions for the use of CA were met, and violation of any of them potentially would lead to attenuation of internal consistency reliability coefficients and/or misinterpretation of CA findings. Our preliminary analyses suggested that CA and internal consistency reliability statistics generally might not have been appropriate for analyzing the NSAL sample’s responses to the CES–D-12, but we proceeded so that we could buttress our points about the importance of good reliability practices. Reliability analyses. Ignoring the results of the preliminary analyses that suggested that CA might not be the best statistic for evaluating the internal consistency of subgroups’ responses to the CES–D-12 because of extreme item response skewness, heterogeneity of variances, and some negative interitem response correlations for some of the subgroups, we calculated the statistic on the basis of all respondents in each of the subgroups (see Table 5) and subgroup respondents with all zero responders removed (see Table 6) for illustrative purposes. In the tables, we used better practices for summarizing reliability data (e.g., 95% confidence intervals, means, standard deviations, and CA by subgroup) for the responses of the total sample of NSAL participants (N ⫽ 5,323) who completed the CES–D-12. Sample sizes for the total sample and the separate subgroups under the two conditions are shown in Tables 5 and 6. The 95% confidence intervals for the CA coefficients indicate the probability that the unknown true value of CA is contained within the obtained range. In our entire sample divided by race or ethnicity and gender subgroup, both men of Color ethnic groups

RECENT RELIABILITY REPORTING TRENDS

665

Table 5 National Survey of American Life CES–D-12 Reliability Data by Subgroup Age

Income

CES–D-12

Subgroup

N

M

SD

M

SD

M

SD

95% CI

CA

SEM

CA 95% CI

Black Caribbean men Black Caribbean women African American men African American women White men White women Total sample

542 833 1,200 2,183 211 354 5,323

40.62 41.07 43.26 42.44 48.25 49.07 42.90

15.48 15.46 16.17 16.11 15.80 17.76 16.21

47,067 38,754 37,930 28,139 49,950 40,285 35,607

36,579 31,806 31,831 25,943 36,323 34,248 31,255

5.83 6.06 6.23 7.15 8.55 9.08 6.82

5.04 5.40 5.28 6.09 5.94 6.23 5.77

[5.40, 6.26] [5.70, 6.43] [5.93, 6.53] [6.90, 7.41] [7.74, 9.35] [8.43, 9.73] [6.67, 6.98]

.72 .76 .72 .78 .81 .83 .77

2.67 2.65 2.79 2.86 2.59 2.57 2.77

[.68, .75] [.73, .78] [.70, .75] [.77, .80] [.77, .85] [.80, .85] [.76, .78]

Note. CES–D-12 ⫽ Center for Epidemiologic Studies Depression Scale (12 items); CA ⫽ Cronbach’s alpha; SEM ⫽ standard error of measurement; CI ⫽ confidence interval.

largest. In other words, the reliability coefficients in Table 5 (which included zero responders) should have been smaller than the reliability coefficients in Table 6 because the item skewness levels in Table 5 were larger. That is, our findings were the reverse of what might have been expected. The explanation for the obtained skewness anomaly is that when item responses of zero (i.e., the symptom is not endorsed as self-descriptive) are used disproportionately by respondents relative to other response options, the zero values have indirect effects on the item means but strong direct effects on the item variance– covariance matrix used to calculate CA. Consider, for example, the item “I had crying spells,” the most severely positively skewed item for African American men after the consistent zero responders had been removed. The proportions of respondents endorsing each response option were as follow: 0 (rarely or never), .8872; 1 (some of the time), .0669; 2 (occasionally), .031; and 3 (most or all of the time), .0153. Thus, the sample mean for this item is the sum of the products of the proportion corresponding to the response options greater than zero: [(.0669 ⫻ 1) ⫹ (.031 ⫻ 2) ⫹ (.0153 ⫻ 3)] ⫽ .175. Each item mean would be calculated in like manner, and the total subgroup sample means, reported in Tables 5 and 6, are the sums of the item means so calculated. Proportions of nonzero responses increased as the sample sizes decreased because, of course, proportions are based on total sample size; but levels of respondents’ endorsement of the items did not actually change; they were just given more weight as the total numbers of (zero)

responded to the CES–D-12 in a less internally consistent manner than did women across racial and ethnic groups and White men in the samples including zero responders, but both men of Color and Black Caribbean women’s responses were less internally consistent relative to the other groups when zero responders were excluded. For the full samples, all of the midpoint coefficients and all but one (Black Caribbean men) of the CI lower limits exceeded the arbitrary standard of .70, which was the minimum value used by most of the Study 1 researchers to judge the adequacy of their scales when they reported reliability coefficients. CA coefficients without zero responders were reduced by about 6%– 8% for the groups of Color but only about 1%–2% for the White groups. In the absence of the preliminary analyses and subsequent reliability analyses summarized in Tables 5 and 6, the results in Table 5 might have been interpreted as indicating moderate to high reliability of each individual subgroup’s responses, and mean scores would have been interpreted as meaningful. However, Table 6 shows that when the levels of item response skewness were reduced but not eliminated, CA reliability coefficients decreased relative to their Table 5 values while mean CES–D-12 scores increased for all of the subgroups, although these effects were much smaller for the White men and White women than for the other groups. These findings with respect to item response variability were contrary to Greer et al.’s (2006) Monte Carlo simulation studies of the effects of skewness on another type of internal consistency reliability estimate (i.e., standardized alpha), where they found that alpha was decreased most when skewness was

Table 6 National Survey of American Life CES–D-12 Reliability Data by Subgroup for Participants Who Reported Depressive Symptoms Age

Income

CES–D-12

Subgroup

N

M

SD

M

SD

M

SD

95% CI

CA

SEM

95% CI

Black Caribbean men Black Caribbean women African American men African American women White men White women Total sample

472 711 1,046 1,899 203 338 4,669

39.80 40.75 42.54 41.58 47.80 48.40 42.25

15.41 15.59 15.94 16.02 15.71 17.55 16.13

45,886 36,891 37,291 26,836 50,142 40,278 34,622

35,724 30,004 32,065 25,413 36,843 34,603 30,924

6.69 7.10 7.14 8.22 8.88 9.51 7.78

4.84 5.18 5.04 5.81 5.80 6.04 5.30

[6.26, 7.13] [6.72, 7.49] [6.84, 7.45] [7.96, 8.48] [8.08, 9.68] [8.86, 10.16] [7.62, 7.94]

.66 .69 .66 .73 .80 .81 .72

2.82 2.88 2.94 3.02 2.59 2.63 2.80

[.61, .70] [.66, .73] [.63, .69] [.71, .75] [.75, .84] [.78, .84] [.71, .73]

Note. CES–D-12 ⫽ Center for Epidemiologic Studies Depression Scale (12 items); CA ⫽ Cronbach’s alpha; SEM ⫽ standard error of measurement; CI ⫽ confidence interval.

666

GREEN, CHEN, HELMS, AND HENZE

respondents decreased. Thus, the zero responses exerted an indirect or a nonobvious influence on the item means and CES–D-12 total mean scores at the sample level. CA coefficients are calculated from variance– covariance matrices, where item variances are defined as item standard deviations squared and item response covariances are the cross products of item standard deviations and corresponding correlation coefficients. Mean item responses, calculated as previously described, also have an insidious effect on estimates of variance and covariance. Recall that the first step in calculating an item standard deviation is to subtract the item mean from each obtained score for each person. There can be only four obtained item responses for the CES–D-12 (i.e., 0, 1, 2, or 3). Deviations of these responses from the item means are squared, summed, and then multiplied by the respective proportions of respondents for each response option, which yields the item variances that are the diagonal values of the variance– covariance matrix used to compute CA. The square roots of the item variances are the item standard deviations. Thus, zero values always result in the item mean being added to the item variances and the standard deviation because zero minus the item mean is the negative mean (e.g., 0 ⫺ .175 ⫽ ⫺.175), which is squared to permit computation of the item response variance (e.g., .1752 ⫽ .0306). Thus, in general, greater numbers of zero responses per item yielded higher levels of item variability and, therefore, higher CA coefficients, because the samples’ item mean values were substituted a greater number of times when item responses were positively skewed. In other words, depending on their proportions in the subgroup(s), the responses of people reporting an absence of symptoms (i.e., zero values) were given more or as much weight in the internal consistency reliability analyses as the responses of people reporting some symptoms. To explore the effects of positive skewness (i.e., zero bunching), we made use of our preliminary finding that item means and levels of item skewness were almost perfectly inversely correlated for most of the subgroups by substituting levels of item skewness for item mean values in calculating the variance– covariance matrices for the zero-responder reduced subgroups. The results of these analyses were that skewness alpha coefficients were essentially equivalent to CA within rounding error for the total sample, the total sample with at least one symptom, and the race or ethnicity and gender subsamples, with the exception of the four White and Black Caribbean men subsamples (see Table 4). These results suggest that for most subgroups, CA coefficients were likely a response-style (i.e., zero endorsing) rather than a content (i.e., depression) coefficient. In sum, if we had not conducted preliminary analyses to confirm the assumptions pertaining to interpretation of CA for each of the subgroups, we might have erroneously concluded that the CES–D-12 measured the same construct (depression) across groups. Moreover, if we had followed the reliability reporting procedures of the studies reviewed in our Study 1, we would have used the magnitude of the CA for the total sample as support for our interpretation and subsequent main analyses. Our findings suggest that further analyses are necessary to determine what construct(s) the CES–D-12 measures in the NSAL data set and for whom.

General Discussion The purpose of Study 1 was to examine the trends of reporting CA reliability information in PA, starting with the inception of PA in 1989. Overall, our stratified random review indicated that researchers continued to misconstrue reliability as a property of scales as opposed to a property of the samples’ responses from which their data were derived. Prior alpha-as-data scholars have provided persuasive, well-researched treatises on the origins of this measurement fallacy (e.g., Thompson & Vacha-Haase, 2000), as well as clear frameworks for circumventing it (e.g., Helms et al., 2006; Onwuegbuzie & Daniel, 2002). In Study 2, we highlighted that the lack of explicit attention to reliability coefficients and related statistics as data may contribute to the development and use of scales that do not have the same meaning across subgroups and perhaps the populations that they are intended to assess. More specifically, psychological measures (e.g., CES–D-12) that are frequently endorsed as reliable might not be appropriate for assessing groups (e.g., ALANAs) who were not included in the normative sample or whose presence was too small to influence the group-level statistic. If we had not conducted item-level preliminary analyses and race or ethnicity and gender subgroup analyses as recommended by Helms et al. (2006), we might have incorrectly inferred from the CA coefficients that the CES–D-12 scores should have been interpreted similarly across groups. Also, previous researchers seem not to have recognized that the psychometric assumption underlying the use of CA is that all items on a scale are supposed to assess a single theoretical construct (i.e., first factor concentration), in this case, symptoms of depression. Moreover, if the scale is to be used with multiple populations, the items should assess the same construct across groups. Yet several authors have indicated that depression may manifest differently for people of Color relative to their White counterparts, including populations of African descent (Ayalon & Young, 2003; Das, Olfson, McCurtis, & Weissman, 2006). For example, Brown, Schulberg, and Madonia (1996) reported disparate types and levels of somatic symptoms, stress, and physical functioning in depressed African Americans and Whites. Also, when discussing depression, White participants were more likely to mention the changes in their affective mood, whereas African American participants tended to address their physical symptoms. In our Study 2 analyses, the higher percentages of zero responders and zero item responses of the men and women of Color as compared with White women and White men seems consistent with the observation that different theories of depression and, therefore, scale items might be necessary to assess depression adequately across race or ethnicity and gender groups despite their scores meeting the arbitrarily defined as acceptable level of CA (e.g., .70; Helms et al., 2006). We did not find many examples of studies in which researchers investigated the within-group psychometric characteristics of single race or ethnicity and gender subgroups’ responses to the items from which their CA coefficients were derived, even when the coefficients were obtained as supplements to item-level factor analyses or structural equation modeling. In one exception, Love and Love (2006) found that zero responders to the CES–D-20 accounted for 72% (CA ⫽ .91) of their community sample and 61% (CA ⫽ .87) of their clinic sample of elderly Black urban men of unspecified ethnicities. Thus, the

RECENT RELIABILITY REPORTING TRENDS

problem of allowing asymptomatic responders to define the psychometric structure of the CES–D and perhaps other scales may be as problematic as was evident from our analyses, especially for men of Color. Also, the relatively lower mean scores for the subgroups of Color in Study 2 might not necessarily indicate lower levels of depression but instead may reflect the inadequacy of the items to assess depression as it is manifested in the studied groups.

Limitations The small sample of PA articles reviewed in Study 1 may not be representative of the state of affairs in the field as a whole, nor does the sample fully represent the past 24 years of publications in PA. However, we based our approach to this study on extant literature that used a similar methodology for exploring similar issues (e.g., Meier & Davis, 1990) in other journals. Furthermore, our findings are consistent with the work of previous researchers who have concluded that there is a dearth of reliability data in published articles in psychological literature (e.g., Meier & Davis, 1990; Thompson & Snyder, 1998; Willson, 1980). In both Study 1 and Study 2, we relied on previously published and collected data. Thus, we were constrained by the quality and/or ambiguities of the data that were available to us for analysis. For Study 1, the researchers’ nonstandard reporting practices made it difficult to ascertain how to appropriately code the reliability-related data. In Study 2, when the NSAL code book and the data were inconsistent, we were forced to use our best judgment to resolve the discrepancies. For example, the NSAL code book (Inter-University Consortium for Political and Social Research, n.d.) referenced “in the past month” as the time frame for assessing symptoms of depression, whereas the actual data set refers to “in the previous week.” Also in Study 2, we experienced limitations because three of the assumptions (i.e., normality of response distributions, homogeneity of item variances, and positive interitem correlations) underlying the use of CA were violated. We could find no evidence that Lincoln et al. (2007) or, for that matter, other researchers addressed these three violations of CA assumptions concurrently either in Monte Carlo studies or in real-life samples. The goal of our studies was not to recommend use of particular types of reliability. Yet it should be noted that if our findings generalize to other samples, researchers would be well advised to consider whether CA and analogous internal consistency estimates yield interpretable results for their samples— even when the magnitudes of obtained coefficients appear adequate.

Recommendations and Conclusions We recognize that the Helms et al. (2006) alpha-as-data guidelines were articulated and published after or concurrently with our sample of articles, but the guidelines were a collection of longstanding recommendations for proper reporting and understanding of the concept of reliability generally (Gronlund & Linn, 1990; Rowley, 1976). Yet other than Helms et al. (2006), previous advisories have not focused on conducting preliminary CAfocused analyses, and none of the studies we reviewed in Study 1 indicated that they had conducted such analyses. On the basis of our current findings and observations, we recommend six practical strategies for reporting, analyzing, and interpreting reliability data

667

in psychology literature (see Appendix). We provided some easily replicable analyses that might be useful for these purposes. Here, we focused on the reporting practices of PA, the preeminent measurement journal in professional psychology. Our studies, which are consistent with previous reviews, suggest that proper reporting of score reliability seems not to be a consistent practice in empirical articles published in journals. Ideally, our recommendations will help foster an alpha-as-data perspective. If researchers understand that reliability is calculated by using a sample’s pattern of responses, then it should become apparent that internal consistency is a property of the sample rather than the scale. Thus, researchers can accurately interpret their reliability coefficients not as characteristics of measures but as data about test scores that potentially have real-life implications for the people behind the data.

References Aiken, L. S., West, S. G., Sechrest, L., Reno, R. R., Roediger, H. L., III, Scarr, S., & Sherman, S. J. (1990). Graduate training in statistics, methodology, and measurement in psychology: A survey of PhD programs in North America. American Psychologist, 45, 721–734. doi: 10.1037/0003-066X.45.6.721 American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. American Psychological Association. (1992). Ethical principles of psychologists and code of conduct. American Psychologist, 47, 1597–1611. doi:10.1037/0003-066X.47.12.1597 American Psychological Association. (2002). Ethical principles of psychologists and code of conduct. American Psychologist, 57, 1060 –1073. doi:10.1037/0003-066X.57.12.1060 American Psychological Association. (2003). Guidelines on multicultural education, training, research, practice, and organizational change for psychologists. American Psychologist, 58, 377– 402. doi:10.1037/0003066X.58.5.377 Ayalon, L., & Young, M. A. (2003). A comparison of depressive symptoms in African Americans and Caucasian Americans. Journal of CrossCultural Psychology, 34, 111–124. doi:10.1177/0022022102239158 Brown, C., Schulberg, H., & Madonia, M. (1996). Clinical presentations of major depression by African Americans and Whites in primary medical care practice. Journal of Affective Disorders, 41, 181–191. doi:10.1016/ S0165-0327(96)00085-7 Butcher, J. N., Williams, C. L., Graham, J. R., Archer, R. P., Tellegen, A., Ben-Porath, Y. S., & Kaemmer, B. (1992). MMPI–A (Minnesota Multiphasic Personality Inventory—Adolescent): Manual for administration, scoring, and interpretation. Minneapolis, MN: University of Minnesota Press. Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78, 98 –104. doi: 10.1037/0021-9010.78.1.98 Cousin, S. L., & Henson, R. K. (2000). What is reliability generalization, and why is it important? Retrieved from ERIC database. (ED44507) Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York, NY: Holt, Rinehart & Winston. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334. doi:10.1007/BF02310555 Das, A. K., Olfson, M., McCurtis, H. L., & Weissman, M. M. (2006). Depression in African Americans: Breaking barriers to detection and treatment. Journal of Family Practice, 55, 30 –39. Dawis, R. V. (1987). Scale construction. Journal of Counseling Psychology, 34, 481– 489. doi:10.1037/0022-0167.34.4.481

668

GREEN, CHEN, HELMS, AND HENZE

Elgar, F. J., Mills, R. S. L., McGrath, P. J., Waschbusch, D. A., & Brownridge, D. A. (2007). Maternal and paternal depressive symptoms and child maladjustment: The mediating role of parental behavior. Journal of Abnormal Child Psychology, 35, 943–955. doi:10.1007/ s10802-007-9145-0 Enders, C. K., & Bandalos, D. L. (1999). The effects of heterogeneous item distributions on reliability. Applied Measurement in Education, 12, 133–150. doi:10.1207/s15324818ame1202_2 Feldt, L. S., & Charter, R. A. (2003). Estimating the reliability of a test split into two parts of equal or unequal length. Psychological Methods, 8, 102–109. doi:10.1037/1082-989X.8.1.102 Friedrich, J., Buday, E., & Kerr, D. (2000). Statistical training in psychology: A national survey and commentary on undergraduate programs. Teaching of Psychology, 27, 248–257. doi:10.1207/S15328023TOP2704_02 Glass, G. V., & Hopkins, K. D. (1996). Statistical methods in education and psychology (3rd ed.). Needham Heights, MA: Allyn & Bacon. Greer, T., Dunlap, W. P., Hunter, S. T., & Berman, M. E. (2006). Skew and internal consistency. Journal of Applied Psychology, 91, 1351–1358. doi:10.1037/0021-9010.91.6.1351 Gronlund, N. E., & Linn, R. L. (1990). Measurement and evaluation in teaching (6th ed.). New York, NY: Macmillan. Helms, J. E., Henze, K. T., Sass, T. L., & Mifsud, V. A. (2006). Treating Cronbach’s alpha reliability coefficients as data in counseling research. The Counseling Psychologist, 34, 630–660. doi:10.1177/0011000006288308 Henard, D. H. (2002). Item response theory. In L. G. Grimm & P. R. Yarnold (Eds.), Reading and understanding more multivariate statistics (pp. 67–96). Washington, DC: American Psychological Association. Hogan, T. P., Benjamin, A., & Brezinski, K. L. (2000). Reliability methods: A note on the frequency of use of various types. Educational and Psychological Measurement, 60, 523–531. doi:10.1177/00131640021970691 Inter-University Consortium for Political and Social Research. (n.d.). National Survey of American Life (NSAL), 2001–2003. Retrieved from http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/190 Jackson, J. S., Neighbors, H. W., Nesse, R. M., Trierweiler, S. J., & Torres, M. (2004). Methodological innovations in the National Survey of American Life. International Journal of Methods in Psychiatric Research, 13, 289 –298. doi:10.1002/mpr.182 Joint Committee on Testing Practices. (2004). Code of fair testing practices in education. Washington, DC: Author. Retrieved from http:// www.apa.org/science/FinalCode.pdf Kazdin, A. E. (1989). Editorial. Psychological Assessment: A Journal of Consulting and Clinical Psychology, 1, 3–5. doi:10.1037/h0092776 Lincoln, K. D., Chatters, L. M., Taylor, R. J., & Jackson, J. S. (2007). Profiles of depressive symptoms among African Americans and Caribbean Blacks. Social Science & Medicine, 65, 200 –213. doi:10.1016/ j.socscimed.2007.02.038 Love, A. S., & Love, R. J. (2006). Measurement suitability of the Center for Epidemiological Studies—Depression Scale among older urban Black men. International Journal of Men’s Health, 5, 173–189. doi: 10.3149/jmh.0502.173 Meier, S. T., & Davis, S. R. (1990). Trends in reporting psychometric

properties of scales used in counseling psychology research. Journal of Counseling Psychology, 37, 113–115. doi:10.1037/0022-0167.37.1.113 Myers, I. B., McCaulley, M. H., Quenk, N. L., & Hammer, A. L. (1998). MBTI manual: A guide to the development and use of the Myers–Briggs Type Indicator (3rd ed.). Palo Alto, CA: Consulting Psychologists Press. Norris, A. E., & Aroian, K. J. (2004). To transform or not transform skewed data for psychometric analysis: That is the question! Nursing Research, 53, 67–71. doi:10.1097/00006199-200401000-00011 Nunnally, J. C. (1967). Psychometric theory. New York, NY: McGrawHill. Onwuegbuzie, A. J., & Daniel, L. G. (2002). Framework for reporting and interpreting internal consistency reliability estimates. Measurement and Evaluation in Counseling and Development, 35, 89 –103. Peterson, R. A. (1994). A meta-analysis of Cronbach’s coefficient alpha. Journal of Consumer Research, 21, 381–391. Radloff, L. S. (1977). The CES–D scale: A self-report depression scale for research in the general population. Applied Psychological Measurement, 1, 385– 401. doi:10.1177/014662167700100306 Reynolds, C. R., & Kamphous, R. W. (2002). The clinician’s guide to the Behavior Assessment System for Children (BASC). New York, NY: Guilford Press. Rowley, G. L. (1976). The reliability of observational measures. American Educational Research Journal, 13, 51–59. Seaton, E. K., Caldwell, C. H., Sellers, R. M., & Jackson, J. S. (2008). The prevalence of perceived discrimination among African American and Caribbean Black youth. Developmental Psychology, 44, 1288 –1297. doi:10.1037/a0012747 Snedecor, G. W., & Cochran, W. G. (1967). Statistical methods (6th ed.). Ames, IA: Iowa State University Press. Strauss, M. E. (2004). Editorial. Psychological Assessment, 16, 3. doi: 10.1037/1040-3590.16.1.3 Thompson, B. (2003). Score reliability: Contemporary thinking on reliability issues. Thousand Oaks, CA: Sage. Thompson, B., & Snyder, P. A. (1998). Statistical significance and reliability analyses in recent Journal of Counseling & Development research articles. Journal of Counseling & Development, 76, 436 – 441. Thompson, B., & Vacha-Haase, T. (2000). Psychometrics is datametrics: The test is not reliable. Educational and Psychological Measurement, 60, 174 –195. doi:10.1177/0013164400602002 Vacha-Haase, T. (1998). Reliability generalization: Exploring variance in measurement error affecting score reliability across studies. Educational and Psychological Measurement, 58, 6 –20. doi:10.1177/ 0013164498058001002 Vacha-Haase, T., Kogan, L. R., & Thompson, B. (2000). Sample compositions and variabilities in published studies versus those in test manuals: Validity of score reliability inductions. Educational and Psychological Measurement, 60, 509 –522. doi:10.1177/00131640021970682 Wilkinson, L., & the Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594 – 604. doi:10.1037/0003-066X.54.8.594 Willson, V. L. (1980). Research techniques in AERJ articles: 1969 to 1978. Educational Researcher, 9, 5–10. doi:10.2307/1175221

RECENT RELIABILITY REPORTING TRENDS

669

Appendix Summary of Practical Recommendations for Reporting, Analyzing, and Interpreting Reliability Data in Psychology Literature • Regardless of the type of reliability analysis used, researchers should conduct and report preliminary analyses to determine the fit of their data to the analytic model. • Researchers should report and specify the type of internal consistency statistic (e.g., Cronbach’s alpha) for each scale administered in their research. • When researchers administer a multiscale measure (e.g., with subscales), the reliability reporting should include internal consistency estimates for responses to each subscale and composite reliability coefficients when indicated (Cronbach, 1951). • It is important that researchers report internal consistency estimates, means, and standard deviations for each relevant subgroup (e.g., race, ethnicity, gender, and age) within their study.

• To lessen ambiguous reporting and to aid future researchers in replicating research studies, it is important that authors be more explicit and detailed when reporting relevant information related to scale modifications and assessments used in their studies. For example, authors should clearly describe any scale modifications, which include, but are not limited to, (a) deleting items from a scale, (b) adding items to a scale, (c) altering the wording of items, or (d) modifying scale instructions. • Authors should clearly specify scale administration protocols (e.g., self-assessment, clinician or researcher administered) in the procedure section. Received March 16, 2010 Revision received January 14, 2011 Accepted January 18, 2011 䡲

Suggest Documents