researc hr eport - ETS.org

7 downloads 16865 Views 506KB Size Report
common statistical estimates of test validity and fairness are unduly ... grades is important to the valid use of test scores as well as grade averages because, in.
RR-00-15

R E S E A R C R H E P O R T

GRADES AND TEST SCORES: ACCOUNTING FOR OBSERVED DIFFERENCES

Warren W. Willingham Judith M. Pollack Charles Lewis

Princeton, New Jersey 08541

September 2000

Grades and Test Scores: Accounting for Observed Differences

Warren W. Willingham , Judith M. Pollack, and Charles Lewis

September 2000

Research Reports provide preliminary and limited dissemination of ETS research prior to publication. They are available without charge from the Research Publications Office Mail Stop 07-R Educational Testing Service Princeton, NJ 08541

Abstract Why do grades and test scores often differ? A framework of possible differences was proposed. An approximation of the framework was tested with data on 8454 high school students. Individual and group differences in grade versus test performance were substantially reduced by focusing the two measures on similar academic subjects, correcting for grading variations and unreliability, and adding teacher ratings and other information about students. Concurrent prediction of high school average was thus increased from .62 to .90; differential prediction was reduced to .02 letter-grades. Grading variation was a major source of discrepancy between grades and test scores. The analysis suggested Scholastic Engagement as a promising organizing principle in understanding student achievement. It was defined by three types of observable behavior: employing school skills, demonstrating initiative, and avoiding competing activities. Groups differed in average achievement, but group performance was generally similar on grades and tests. If artifactual differences between the two measures are not corrected, common statistical estimates of test validity and fairness are unduly conservative. Different characteristics give grades and test scores complementary strengths in high-stakes assessment. (Key words: validity, school achievement, scholastic engagement, group differences, grading, differential prediction)

Contents Introduction

1

Differences in Grades and Test Scores A framework A five-factor approximation

7 7 16

Previous Research Grading practices Student characteristics Teacher ratings Implications for this study

23 23 31 43 46

Study Design The sample Tests and grade averages Other variables in the analysis

49 50 52 56

Statistical analysis Adjusting for grading variations Estimating reliability

61 61 68

Results of the Analyses The effects of factors 1 to 5 Differential prediction A condensed analysis of major factors Gender, ethnicity, and school program

71 71 90 94 98

On Four Noteworthy Findings Accounting for grade-test score differences The problematic variation in school grading Scholastic engagement as an organizing principle Group performance: Similar dynamics, different levels

105 107 111 115 123

The Merits of Grades and Tests Validity and fairness Differential strengths

131 131 136

Summary

149

References Author Note Figures and Tables Appendices: A. Descriptive statistics: Tables A-1 to A-8 B. Student variables: Acronyms and specifications C. Notes

159 178

1

Introduction Many of the most important educational decisions we make about young people concern those summative, often irreversible, judgments regarding student entry or exit from programs or institutions. Who will be placed in a slow or fast track in grade school, or earn a high school diploma, or be accepted in a selective college, or advance to the upper division, or flunk out, or be admitted to a demanding graduate or professional program? Grade averages and test scores are the two types of evidence most commonly used in supporting these highstakes judgments. Tests are routinely evaluated for such educational purposes, grades less systematically. When a cumulative grade record is used in reaching important educational decisions, it becomes, in effect, a predictor or criterion. In this capacity grades take on an assessment function both broader and different from the teachers’ original evaluations of their students’ acquired proficiency in a particular subject in a given class. In serving the broader function, grade averages have virtues as well as limitations. Understanding such characteristics of grades is important to the valid use of test scores as well as grade averages because, in practice, the two measures are often intimately connected. The use of grades and tests are interdependent in a number of ways. Teachers use classroom tests in assigning grades. Administrators use standardized tests in monitoring grading standards and in evaluating grade differences between students and among groups of students. On the other hand, we use grades to validate tests, and we use grades to judge the fairness of tests. In many situations test sponsors urge the use of the grade record and the test score together in order to enhance the validity and fairness of important educational decisions.

2

Despite these obvious interdependencies, there are odd contradictions in the ways we view tests and grades. Note the paradox: In some educational contexts we use tests to keep grade scales honest or because we do not fully understand or trust grades to be an accurate indicator of educational outcomes. But we also reverse those sentiments and use grades to demonstrate the usefulness of tests and to justify their use. One likely source of the contradiction is the tendency, for educators and measurement specialists alike, to assume that a grade average and a test score are, in some sense, mutual surrogates; i.e., measuring much the same thing, even in the face of obvious differences.1 One manifestation of that implicit assumption is the common inclination to treat an improvement in grade prediction as the dominant, if not the sole, basis for validating a highstakes admissions test and justifying its use. For example, it may be hard to sell the substitution of a new test predictor that has important educational advantages unless it is clearly equal to or preferably stronger than a current measure in predicting GPA (i.e., normally, the surrogate of primary interest). Similarly, debate over the added value of an admissions test often focuses only on its incremental predictive validity over the already available prior grade record, overlooking other important educational considerations that may hinge on intrinsic differences between the grade record and test scores. A more telling instance of the implicit assumption that grade criteria and test scores are mutual surrogates lies in the formal professional definition of test bias. With few qualifications, a test is considered biased for a group of examinees if it predicts a mean criterion score any different from the actual criterion mean (American Educational Research Association, American Psychological Association, National Council on Measurement in Education, 1985; American Psychological Association, American Educational Research

3

Association, National Council on Measurement in Education, 1999). This interpretation leaves little room for differential prediction due to the test predictor and the grade criterion incorporating somewhat different construct relevant components or due to technical artifacts such as unreliable predictors (Linn & Werts, 1971). In their defense, measurement specialists may read "predictive bias" simply as “different result.” The more common interpretation is "something wrong with the test,” regardless of why the results are different. Presumably, both grades and test scores have characteristic strengths. It would be useful to have a better understanding of those strengths, when one measure might be superior to the other, and in what ways their joint use might be advantageous. Insuring the validity and fairness of one requires an appreciation of issues concerning the validity and fairness of the other. Despite the widespread use of both grades and tests as indicators of educational achievement, we have quite different habits and expectations regarding the standards to which we hold these two measures. National agencies and special commissions give careful attention to the technical quality and proper use of tests in high-stakes decisions, but seldom grades (Gifford & O’Connor, 1992; Heubert & Hauser, 1999; Office of Civil Rights, 1999; Wigdor & Garner, 1982). A substantial body of professionals, with varied interests and agendas, devote most of their time and attention to tests. We study what tests measure, how to evaluate and improve their quality, and how to insure the validity and fairness of test scores. We have extensive debates, scholarly literature, and textbooks on test theory, standards, and practice. All of this is as it should be. Testing is a public enterprise. To be sure, researchers have carried out useful studies of grades and grading, but nothing to compare with the systematic attention devoted to tests. This is not to say that tests are fine and grades are a mess.

4

Testing has a long history of public controversy (Cronbach, 1975; Linn, 1982b). Many critics—both within and without the profession—have discussed the technical shortcomings and the social concerns that testing engages (Crouse & Trusheim, 1988; Frederiksen, 1984; Jencks, 1998; Lemann, 1999; Madaus, 1994; Shepard, 1992b). These issues notwithstanding, objective measures of school achievement have obvious benefit— especially for high-stakes selection (Beatty, Greenwood, & Linn, 1999, pp. 20-22) and as policy instruments to foster educational accountability (Heubert & Hauser, 1999, pp. 33-40). Heubert and Hauser (p.1) described the current interest in testing to promote accountability. The use of large-scale achievement tests as instruments of educational policy is growing. In particular, states and school districts are using such tests in making highstakes decisions with important consequences for individual students. Three such high-stakes decisions involve tracking (assigning students to schools, programs, or classes based on their achievement levels), whether a student will be promoted to the next grade, and whether a student will receive a high school diploma. These policies receive widespread public support and are increasingly seen as a means of raising academic standards, holding educators and students accountable for meeting those standards, and boosting public confidence in the schools. This is not a new development. Linn (2000) described the use of tests as key elements in five waves of educational reform during the past fifty years. These included tracking and selection in the 1950s, program accountability in the 1960s, minimum competency programs in the 1970s, school and district accountability in the 1980s, and standards-based accountability in the 1990s. The most recent reform effort is accompanied by a strong emphasis on improving teaching and learning through improved assessment.

5

One goal in assessment reform is to establish more direct linkages between instructional objectives and the content and process of testing (Frederiksen, J. & Collins, 1989; Frederiksen, N., Glaser, Lesgold, & Shafto, 1990; and Resnick & Resnick, 1992). Another goal is to focus assessment on established educational standards—in both the educational system and the individual classroom (Baker & Linn, 1997; Shepard, 2000). A third goal is to broaden the range of assessment formats and the skills thereby engaged (Bennett & Ward, 1993; Wiggins, 1989). Assessment reform seeks measures that will better inform teaching and learning and provide more useful feedback regarding the outcomes. Current initiatives imply dissatisfaction with both grading and testing—and the need, one might say, to better realize the strengths of each. To that end, this study endeavors to enhance our understanding of some of the main ways in which grades and tests differ. Our premise was that it should be possible to account for much of the differences observed between grades and test scores. This study had several purposes: to suggest a framework that might help to explain major differences between grades and scores, to evaluate an approximation of the framework in a national database, to test its generality among different groups of students, and to examine possible implications of the findings as to the respective merits of grades and scores in high-stakes decisions.

6

7

Differences in Grades and Test Scores Explaining why students often perform somewhat differently on classroom grades and standardized tests calls for some form of framework. Developing a framework poses two challenges. One task is to describe how such a framework might look in theory. Another, somewhat different, task is to devise an approximation that can be evaluated with real data. We address those tasks in turn. A Framework In considering a framework to compare two generic measures like grades and test scores, it is useful to consider two basic aspects of educational measurement. First, any measure that is used in making high-stakes decisions about individual students serves two overlapping but distinguishable functions: selective and educative. Either of these functions may be the primary purpose of the measure. A school-leaving test may be one basis for deciding who from graduates high school, but its primary purpose may be to further certain educational objectives. A college admissions test may influence high school instruction, but its primary purpose is normally to facilitate selection. Since our presenting question is why individual students or groups of students often perform somewhat differently on grades and test scores, we are concerned with factors that bear on the selective function. That is, what factors cause the selective function of grades and test scores to identify somewhat different high or low scorers. Such factors may affect the educational quality of grades or tests as well, but it is important to bear in mind that additional features of grades and tests, not considered here, will also affect the quality of the measures and the effects of their use.

8

Both a grade record and a test may have good or poor educative qualities, independent of whether they rank students similarly or differently. Other features may primarily determine what knowledge and skills are being acquired and how grading and testing affect teaching and learning. This study focuses on why students often rank differently on the two measures. Thus, the analysis is based on patterns of individual and group differences in assessment outcomes rather than content differences in the measures themselves. The corollary measurement consideration is how the specific differences between grades and tests that result in different ranking of students may also influence the validity, and necessarily the fairness, of each measure. Two critical aspects of the validity of a high-stakes measure can be usefully contrasted as content relevance and fidelity—does the measure assess desirable aspects of achievement and does it do so accurately? These two qualities play an important role in evaluating results in any framework proposed, because content relevance and fidelity provide the basis for drawing validity and fairness implications from specific sources of grade-test score differences.2 The relevance of a measure refers to how well it represents the domain of pertinent knowledge and skills, and does not include knowledge and skills that are irrelevant to the measurement objective (Messick, 1989; 1995). Relevance determines not only the short-term usefulness of a measure for high-stakes decisions, but also its long-term importance in developing human resources and its antecedent effects on the priorities of teachers and learners who know that the test is coming. A measure's fidelity includes its reliability from one testing to another, its comparability from one situation to another, and its security from cheating and compromise--all of these being socially demanded aspects of accuracy and dependability in the actual use of a high-stakes measure. With these measurement

9

distinctions in mind, we turn now to the problem of identifying those factors that are most likely to cause the selective function of grades and test scores to operate differently. For many years researchers have sought to understand what factors are associated with favorable educational outcomes. Research on this question readily lends itself to causal modes of thinking; viz., what family circumstances and values promote achievement in school, what student characteristics, habits, and attitudes lead to good grades? It is common to include test scores in a longitudinal analysis of student development or prediction of future grades, because the purpose is to "explain," in a pragmatic statistical sense, what accounts for the educational gain or the achievement above or below expectation. But this reasonable concern and line of inquiry does not mean that causal logic is the most useful way to view the relationship between test scores and grades. Test scores do not cause individual differences in grades, nor vice versa. If a test and a grade are intended to represent much the same achievement, then individual differences in both measures presumably result from much the same learning processes, influenced by much the same environmental and genetic factors, and channeled by much the same cognitive differences and personal interests. From that perspective, attaining a better understanding of the wellsprings of achievement will not necessarily help in understanding differences in performance on grades and test scores. For the purposes of this study, we pose a somewhat different question, "How does the composition of grades and test scores differ and what are the implications of those differences?" The two measures are presumably somewhat similar composites of skill and knowledge that are generally relevant to the achievement construct of interest plus some other sources of construct-irrelevant variance. From this view, grades and test scores are correlated

10

only moderately because the elements of the two composites overlap only partly. It is important to bear in mind that, for our purposes here, a different composition is important only because the different components affect individuals and groups differently. That is, components interact with other individual characteristics such as behavior and background. Figure 1 suggests a framework of possible sources of difference between grades and test scores. Categories A and C refer to the composition of grades and test scores—the former to content differences between the two measures, the latter to different types of error in both. Category B refers to related individual differences that play an important role in converting the content differences of Category A into score differences. Category D refers to situational differences that increase the possibility of content differences and divergent patterns of individual and group differences. ______________ Insert Figure 1 about here ______________ Figure 1 is not intended to be comprehensive or to represent fully the goals or the outcomes of education. The sources of difference overlap, and the framework focuses on major areas of difference, not details. Furthermore, we are concerned here with possible sources of the observed differences between the two measures as we normally encounter them, not with noticeably improved grades or tests that we might sometimes encounter or hope to develop. From a measurement perspective and on the basis of what is widely known about grades and test scores, most of the sources here suggested are commonsensical. In some cases, a substantial body of pertinent research literature can be consulted for clues as to how the source of difference actually works. It is reasonable to assume, however, that these

11

various types of grade-test score differences may work differently for different groups of students or academic programs. If it is expected that a grade and a test score should rank students similarly, perhaps the most obvious implicit assumption is that the two measures encompass knowledge and skills based on a generally similar academic domain as implied by Category A.1 in Figure 1. Similar broad areas of competence might be sufficient to place students in much the same order if the domain were broad, but a closer correspondence of specific subject matter and skills would be required if the domain were narrow. For example, consider comparing student performance on a grade average and a test battery, each based on a number of academic subjects. For that particular comparison, having free-response word problems on one and multiple-choice equation solving on the other might constitute a sufficiently similar representation of mathematics. Those elements would not be sufficiently similar if one were only comparing performance in mathematics. Differences between internal and external tests must also be considered. The construct-relevant knowledge and skills typically represented on external standardized tests are surely not identical to those typically found in the local classroom tests on which teachers base their grading. It is reasonable to expect more difference between grades and performance on external tests than on local tests that more closely reflect the local syllabus as well as the teacher’s particular view of the subject and how it should be assessed. Some teachers may be performance oriented in their testing and grading and, for that reason, tend to stress written or oral presentation. Skills involved in such assessment are not frequently represented on external tests. To be sure, some teachers will lean to assessing knowledge and problem solving skills more similar to those typically represented on an objective test.

12

Nonetheless, assessment format is, like subject matter, a likely source of relevant or irrelevant skills that differ somewhat between grades and test scores. Category A.2 suggests an inherent distinction between a teacher’s grade and an external test score that stems from different purposes of the two measures. To aid learning and instruction, grades reflect specific knowledge and skills stressed in the particular classes that a given student takes. To foster accountability, standardized tests provide outcome assessments that are comparable across schools (Shepard, 2000). This distinction will result in different performance on grades and test scores among students with different schooling because students will therefore experience somewhat different curricula and learning situations, and also because students respond to school differently. Normally, teachers assign grades largely on the basis of quizzes and examinations on the lessons that they assign in class. A student may know a good deal about the subject, but if poorly motivated to study the material assigned, he or she is less likely to correctly answer the teacher's specific questions about that particular material and will be graded accordingly. An external test in a given subject area does not represent the specific knowledge and skills that characterize a particular learning experience, but tends to represent content typical of a generic course in the subject, wherever it is taught. Some students might make a reasonably good score on such a generic course-based test due to having learned applicable knowledge and skills outside of school or some years earlier. Other students may have a mediocre command of the course but earn a good grade in the class by working hard on the particular material and exercises presented by their teacher. Different patterns of individual and group differences will depend, of course, on the interaction of content differences and other sources of individual differences. Thus, the extent

13

to which students do relatively better on classroom-based grades as compared to course-based tests will tend to depend on their total learning experiences (Category B.1), their dedication to schoolwork (B.2), and their teacher’s judgment as to how well they have performed in class (B.3). Category A.2.c in Figure 1 represents a related fundamental difference between grades and test scores. To some degree, all students individualize their academic pursuits, but most educators ascribe to common learning goals and standards within an educational jurisdiction. As a matter of principle, non-traditional education places high value on the individual assuming responsibility for his or her learning. From this perspective, personal development for each student is a more important goal of education than is mastery of prescribed knowledge and skills by all students (Keeton and Associates, 1976). In theory, grades or some parallel form of individualized assessment can readily recognize special learning and accomplishment. To the extent that education is individualized, a standardized test will tend to yield results somewhat different from an individualized grade assessment (Whitaker, 1989). Category A.3 refers to other elements that may be represented in the grade or the test score but are not formally part of the knowledge and skills that define the subject domain. A social objective of education such as enhancing good citizenship would be one example (A.3.a), though it seems doubtful that either grades or tests are much influenced by such outcomes. In the case of grades, students may receive credits or deductions for particular behavior that is more directly connected with schoolwork. Examples include attendance, class participation (or disruption), turning in homework assignments, other evidence of dilligence and progress, contributions to the learning environment, and so on (A.3.b). These elements may or may not influence subject mastery. Their construct relevance in grading

14

stems from their pertinence to a broader definition of education that includes personal development and conative aspects of learning like volition and effort. When teachers assess learning outcomes for individual students or consider awarding credits or making deductions from grades, the teacher’s judgment is an additional source of variation (Category B.3). Positive halo or negative bias may play an unconscious role— clearly a potential construct-irrelevant component of grading. Other construct irrelevant components may be connected with the test format or the assessment process, whether it is an external test or a classroom test or graded exercise (Category A.3.c). Examples include test wiseness and anxiety in testing or performing situations. Such elements can influence grades or scores either positively or negatively, and the effects are likely to vary considerably with the specific situation. In Category A of Figure 1, one could also include other types of knowledge and skill that may be quite relevant to a specific syllabus but are not routinely considered academic subject matter; for example, cooperative learning, physical development, or religious teaching. In a particular program, such educational outcomes may well be represented in either grades or tests. Error in grades and scores. Two types of error cause discrepancies between observed grades and test scores: systematic and unsystematic. Noncomparability is the important form of systematic error (C.1.a). An extensive literature over several decades has documented substantial differences in grading standards from school to school and college to college. The corresponding problem with test scores (C.1.b) can apply if, for example, a practitioner compares scores from different tests that have the same names, or uses percentile scores based on the wrong norm group, or makes some consequential equating error (Hartocollis, 1999).

15

Another serious, and evidently more prevalent, type of noncomparability is the possibility of a change in test difficulty over time due to increased familiarity with a particular test form following its repeated use (Cannell, 1988). Cheating (C.1.c) can be another form of noncomparability—evidently a common student practice in school, but often notorious when it involves a high-stakes test. When stakes go up, schools can also engage in questionable practices whether it involves tests or grades (Saslow, 1989; Wilgoren, 2000) Unreliability is, by definition, unsystematic measurement error (C.2). Like all educational measures, both grades and test scores are subject to such error. While there are many sources of measurement error (Thorndike, 1951), variation in the likelihood that a student will know the answer to a particular question is probably the most important in the context of this study. Unreliability is independent of any noncomparability among grades or test scores, though both attenuate the observed relationship between grades and test scores. Finally, if grades and tests are expected to reflect a similar level of performance, they need to be based on concurrent learning in a similar context (Category D). Otherwise, performance differences may be due to situational variation in the student’s motivation or other differences associated with the particular learning experience. A particular characteristic that distinguishes grades and test scores may look different in different situations. For example, assume th*at one is considering whether the use of an essay versus a multiple-choice test has any differential effect on grades and test scores. That choice of assessment format might look like a difference in test-taking skills (A.3.c) in a course such as physics where writing is often incidental to knowing the correct answer. On the other hand, writing would be a construct-relevant cognitive skill in English (A.1.b), where

16

compositional expertise is likely to be a critical element among the intended learning outcomes. In a given assessment such details may loom large. But for the purposes of this study, the immediate question is how to delineate and quantify the major sources of discrepancy so as to study their effects, preferably all at the same time. More specifically, how might the most important factors in Figure 1 be represented with sufficient validity in real data available in a large database. A Five-Factor Approximation Any proposal to study the effects of the various sources of grade-test score discrepancy poses some nontrivial problems. Of the sources listed in Figure 1, some can only be partly estimated, some can only be estimated indirectly, some are not included in the database we proposed to use, and for some there are simply no data available. Fortunately, the possible sources overlap and the ones that can be approximated are likely to be the more important ones. Consequently, the analysis reported here is an approximation based upon the following five factors: •

Factor 1. Subjects Covered



Factor 2. Grading Variations



Factor 3. Reliability



Factor 4. Student Characteristics



Factor 5. Teacher Ratings These five factors are described below. It is first necessary to comment briefly on the

analytic model and the database. Principal aims of this study were to determine to what extent the proposed factors can account for differences in the rank order of students on

17

corresponding grades and test scores and to evaluate what role each factor plays in that regard. This is why the study examines differences between grades and test scores by analyzing patterns of individual and group differences rather than analyzing content or structural differences between the two measures. The five-factor approximation derives from the assumption that grades and test scores have different constituent parts, which, in turn, have different effects on the observed performance of individual students. The grades and test scores of students should correspond more closely—that is, the measures should be more highly correlated—if one could alter the components or statistically adjust the two composite measures so that they are similarly constituted. Prediction is a useful analytic framework with which to initiate the inquiry because it asks, simply, "What must one add to one measure in order to account for variation in the other?" Consequently, there is considerable emphasis throughout this report on the size of the grade-test correlation, the magnitude of group differences, and other such statistical characteristics of the two measures. It is important to remember the limited objectives of this analysis. Any attempt to assess the overall validity of grades and tests would also stress the character of the measures, their educational and psychometric qualities, and the effect of their use in high-stakes decisions. Toward the end of this account, we return to that thought. The five-factor approximation, the statistical procedure, and the aims of the study all call for an unusual database. No available database can fully fill the bill, but the National Education Longitudinal Study of 1988 (NELS) affords remarkably rich information about a national sample of students who graduated from high school in 1992. NELS provides critical information of several types: student background and personal characteristics, test scores in

18

four basic academic areas, full course-grade transcripts throughout high school, teacher ratings of individual students, plus other useful data from school records. NELS has some shortcomings, which will be explained, but compared to any other database, it is particularly well suited to examining differences between grades and test scores. While NELS provides useful information that bears upon each of the five factors, the data available do not map exactly on the factors postulated. In order to evaluate the effect of the factors, the task is to approximate the role of each in a manner that is compatible with multiple regression analysis. There are three means of doing so. The most obvious approach—not always possible—is to index a factor; that is, represent it as a score for each student that can be included as a variable in the analysis. Another method is to define or correct the grade or test score so as to reduce apparent differences in their constitution. Finally, some factors can be handled as statistical adjustments. Thus, representing each factor in the analysis as accurately as possible involves somewhat different steps and assumptions. The following paragraphs are intended to indicate only briefly the general manner in which the five factors are here approximated. Specific procedures are described in more detail in two subsequent sections concerning Study Design and Statistical Analysis. Factors 1, 2, and 3 are somewhat different from Factors 4 and 5. The former three concern mismatched material and error components that make grades and test scores less comparable. The latter two concern student behavior and other characteristics that were postulated have a heavier bearing on grade performance than on test performance. As previously observed, the mutual validity of grades and test scores as equivalent measures can be compromised either by lapses in fidelity or in content relevance. Factors 2 and 3 bear upon the fidelity of the measures, because they represent some degree of inevitable

19

error in both grades and test scores. Factors 1, 2, and 5 bear upon the relevance of the two measures, because they represent potential content differences between grade and tests. Factor 1. Subjects Covered. The NELS survey provides a complete transcript for each student including traditional academic subjects, vocational subjects, and service courses like driver training and physical education. The NELS tests cover a more restricted range of academic subjects: reading, mathematics, science and social studies. A reasonably good match is attained between an overall grade and a test composite for each student by defining the measures as follows: a) restrict the grade average to the four “New Basics” subject areas (Ingels et al, 1995) that correspond most closely to the four NELS tests—courses in English, mathematics, science, and social studies, and b) weight the four test components so as to optimally reproduce the students’ rank order on the grade average. Factor 2. Grading Variations. In cross-sectional data grading standards can vary across situations associated with different instructors, sections, courses, programs, schools, and several possible interactions among those. The NELS transcript database permits analysis of grading variations across schools and courses within schools, which likely account for the more consequential errors due to differences in grading. Variations in school grading can be corrected by carrying out regression analyses within schools and adjusting the pooled results for restriction in range. In this manner school differences in either grades or test scores do not come into play. The effect of course-grading differences can be indexed for each student according to the leniency or strictness of grading (i.e., average grades in relation to average test scores) in the courses that he or she took. The resulting index is then used to correct the student’s grade average. Variation in test score scales is not likely to be significant in the NELS data because all students took parallel and equated forms of the same test.

20

Factor 3. Reliability. Measurement error in both grades and test scores is independent of any systematic scale differences due to grading variations. This source of error cannot be indexed by student but can be taken into account by traditional corrections for attenuation. Test reliabilities are available from NELS. Grade reliabilities can be estimated through appropriate analysis of course-grade records. Factor 4. Student Characteristics. It is assumed that both grades and test scores reflect, in varying degree, the learning that students acquire in their particular school programs. There is no practical way to estimate the difference in specific knowledge and skills that are represented in the NELS Test and the grade averages (A.2 in Figure 1). Detailed comparison of the actual subject matter in the test and each student’s courses is not a realistic alternative. It is possible, however, to index these differences indirectly by focusing on student characteristics (B.2 in Figure 1) that help to predict the academic performance of individual students. Adding this factor advances the goal of accounting for grades earned because such characteristics are likely to be more highly related to grades than to test scores. Research on why students often make grades higher or lower than expected on the basis of test performance has a long history, which is briefly reviewed in the following section. Student characteristics that appear to be most promising include family background, personal attitudes, academic history, activities in school, and a variety of behaviors often found to be related to school achievement. A number of such measures can be constructed from the NELS database. The common objective is to include measures that may signify higher or lower performance in school due to greater or lesser commitment to school. Some of these self-reported characteristics are also related to grading pluses or minuses that students often receive for certain behavior in school (A.3.b in Figure 1).

21

Factor 5. Teacher Ratings. Teachers are a primary source of information about the academic achievement of students as well as their behavior in school. Information concerning such matters as class participation and completing homework were also obtained from student questionnaires, but teachers observe the behavior directly and may be more objective than the students themselves. More important, teachers decide what counts, and teachers assign the grades. Teacher ratings may also provide some indication of performance skills more likely reflected in grades than in standardized test scores. Lastly, teachers’ ratings of students may reflect two other types of performance more likely to be represented in grades than in standard tests of subject knowledge and skill: educational objectives in the individual classroom and individualized learning outcomes (A.2.b and A.2.c, respectively in Figure 1).

22

23

Previous Research In reviewing previous research, our main objective was to inform and improve the design of the proposed study. The most critical design questions concerned the selection and definition of useful measures and the best modes of analysis in order to examine the effects of the five factors proposed. Two of the factors posed limited options. Matching the subjects covered by test and grades (Factor 1) was simply a matter of using measures already available from NELS. Correcting for unreliability (Factor 3) required application of mostly standard procedures. On the other hand, Factors 2, 4, and 5 presented many options and uncertainties. A substantial research literature proved quite helpful in addressing design issues in the three pertinent areas: grading practices, student characteristics, and teacher ratings. Grading Practices Grading has a rich and sometimes quirky past. In an entertaining account of that history, Cureton (1971) cites instances to illustrate that current grading problems are not all that new. There was a time, for example, when enforcing standards called for corporal punishment. Cureton cites the Stuttgart Schulordnung of 1505 as specifying that, for this purpose, school children should be instructed to bring in fresh rods from the forest each week. Similarly, grade inflation has manifested itself in a manner fitting to the times. In the 19th century, a Virginia academy graded its students with these clear categories: optimus, melior, bonus, malus, pejor, and pessimus. Cureton (1971, p. 2) quotes the president, Henry Ruffner, regarding the continual tendency of teachers to mark inferior students too high. “While optimus ought to have been reserved for students of the highest merit, [it] was commonly bestowed on all who rose above mediocrity.” To counter this problem, the president modernized the grading system to three categories—disapproved, approved, and

24

distinguished. Nevertheless, he reportedly mourned that “within two or three years, some bad scholars were approved, and good scholars were nearly all distinguished” (p. 2). Troublesome and often controversial issues swirl about grading because grades have consequences, and diverse grading practices are often not regulated by any clear consensus regarding principles and values. On some occasions, as in the student protest movement of the 1960’s, grading was a lightening-rod social issue. But in every decade, basic assumptions about grading are regularly debated: the purposes of grading, the basis on which grades are assigned, the standards imposed, the effects on teaching and learning, and the social consequences. Such issues have regularly spawned scholarly and popular articles and books bemoaning what’s wrong with grading and offering advice on what to do about it (Hills, 1981; Kirschenbaum, Simon, & Napier, 1971; Loyd, B. & Loyd, D., 1997; Milton, Pollio, & Eison, 1986; Terwilliger, 1989; Vickers, 2000; and Warren, 1971). Meanwhile, a variety of technical and research topics are periodically reviewed: evaluation methods, criterion vs. normative standards, marking systems, as well as favored reforms like pass-fail or contract grading (Geisinger, 1982; Natriello, 1992; and Thorndike, 1969). Similarly, regional or national surveys tally the latest evidence as to what schools and colleges are actually doing regarding such matters (College Board, 1998; National School Public Relations Association, 1972; Robinson & Craver, 1989). It is useful to distinguish two aspects of grading: components and standards. Grading components. For the purposes of this study, “What counts?” in the eyes of teachers is an especially germane aspect of grading practice. More specifically, what additional components are represented in grades other than demonstrated mastery of knowledge and skills that are pertinent to the objectives of the course? What extra factors

25

might be expected to result in discrepancies between a grade and a test primarily concerned only with subject knowledge and skills? The other factors that count and when they come into play are likely influenced by the multiple purposes that grades are often intended to serve. Geisinger (1982) listed the diverse purposes of grading based on his review of research and writing on the topic. His list agrees rather well with the opinions of college faculty collected a decade later by Ekstrom and Villegas (1994). The following objectives appear to be most prominent in educators’ minds: provide feedback useful to students in their studies, help colleges make decisions about students and maintain standards, provide information about students’ performance to other institutions and employers, motivate students academically, and help students learn discipline for later work and adult life. Several studies provide information concerning teachers’ opinions as to what influences grade assignment. Frary, Cross, and Weber (1993) surveyed a random sample of high school teachers in five academic areas in the state of Virginia. Tests and quizzes easily had the most influence on grades, but several additional elements were considered to be important or should be taken into account in determining final grades. Teachers endorsed these factors with the following frequency: projects/papers—71%; daily homework—71%; class participation—51%; exceptionally high or low effort—66%; laudatory or disruptive classroom behavior—31%. In varying degree, each of these added considerations implies behavior for which the student may win or lose grade points irrespective of demonstrated knowledge and skill in the subject. The practice of rewarding appropriate behavior applies particularly to special projects and homework because such assignments help to serve broader pedagogical ends like

26

encouraging effort and initiative, or learning skills critical to the management of complex tasks . The validity of such grades as a measure of the individual’s competence may be compromised because work out of class often benefits from the cooperation and talents of fellow students—either by instructional design or unsanctioned collusion. In any event, the points thus won or lost represent one source of difference between grades and test scores. A national survey by Robinson and Craver (1989) solicited information from 832 high school districts on the influence of behavioral factors in grading. Regarding the role of student effort, the districts reported as follows: Effort did not enter into grading–34%, was included in the course grade–33%, or there was no uniform district policy–33%. Presumably the absence of a district policy left the matter up to the teachers. In a more recent survey, approximately half of the teachers reported that colleagues in their school pass a student who has tried hard even if he or she had not learned what was expected (Public Agenda, 2000). Some data suggest that in higher education formal policy is rarer than individual teachers taking extenuating factors into account when they assign grades. Table 1 (Ekstrom & Villegas, 1994) shows the frequency with which faculty reported that certain student behaviors influence grades. On average, 30% of respondents said that it was informally expected that faculty would take into account such factors as effort and timeliness, but only 7% reported any official policies regarding such practice. ______________ Insert Table 1 about here ______________ How important are different factors in assigning grades? Ekstrom and Villegas summarized their data as follows. Among 15 types of evidence, tests and papers were considered of great or very great importance in introductory courses by most of the seven

27

departments. Four other factors—subject-specific skills, meeting due dates, creativity, attitude/effort, and class participation—were considered of moderate to great importance in at least three of the seven departments. Two differences were distinguished with respect to grading in advanced courses: faculty placed more importance on the four factors just mentioned, and they considered the quality of students’ papers more critical than their classroom test performance. Another study illustrates that grading can be a complex value-laden process when different purposes of grading come into play. Brookhart (1993) posed to teachers such grading questions as the following, “What do you do when a student with an A test average fails to turn in a report that counts for 25% of the grade?” The answer varied among teachers and circumstances and often turned on the importance attached to different objectives that might be served. In summary, grading procedures depend upon local policy and practice, and they have varied over time. Surveys of schools and teachers clearly indicate that graders often consider factors in addition to the student’s acquired knowledge and skill in order to serve other educational objectives. To what degree such considerations directly affect grades is hard to say. The factors that seem especially worth attention in the present analysis include attendance, class participation, disruptive behavior, and completing work assignments. Grading standards. The preceding section cited evidence of variation in the components of grades, or the factors that teachers consider in grading. Researchers have also accumulated considerable evidence of variation in grading standards; that is, the level of the grades typically assigned for presumably comparable work by individual teachers or teachers in a given subject, school, etc. Different types of evidence indicate that grading standards

28

vary: variation in the grades that different instructors assign to the same papers, variation over time in the average grade assigned to a population of students that has not ostensibly changed, variation in the average grade for groups of students with comparable scores on a relevant test, variation in the average grade earned by the same group of students in different courses. Early in the last century, Starch and Elliott (1912, 1913) published perhaps the most widely quoted studies of “alarming” variation in the grades that different instructors assign to the same papers—in subjectively graded subjects, and even in mathematics. Much of such disagreement is no doubt just a matter of inconsistent judgment, but some variation comes from different instructors assigning grades that are systematically higher or lower than the norm for their particular subject (Kelly, 1914). Instructors’ grading habits are common fodder for student lore. Shifts in the apparent level of grades in schools and colleges across the country are also frequently reported. Such shifts are typically characterized as grade inflation, probably because they always seem to go up. In fact, grading practices are also periodically redefined or differently enforced so as to cut down on the number of students receiving an A, 95, 4.0, or “optimus.” There are years of inflation, deflation, and stagnation (see Willingham & Cole, 1997, p. 305 for periods and references). Periods of grade inflation and deflation provide a sign of the ease with which teachers can shift their grading standards. Contrary to what one might assume, such broad shifts within the range we have experienced do not appear likely to affect either the correlation between test scores and grades (Bejar & Blew, 1981) or the observed discrepancies between

29

grades and test scores. What does have such effects is the comparability of grades; i.e., variation in grading standards from course to course and from school to school. A number of studies have demonstrated that grades are not comparable from course to course (Elliott & Strenta, 1988; Goldman & Hewitt, 1975; Goldman, Schmidt, Hewitt, & Fisher, 1974; Goldman & Slaughter, 1976; Goldman & Widawski, 1976; Juola, 1968; Strenta & Elliott, 1987; Willingham, 1963c, 1985). These analyses are typically based on a comparison of average grades across groups after average test differences are taken into account. Similar results are obtained without any reference to test scores by simply comparing the grades earned by the same group of students in different courses (Goldman & Widawski, 1976). Grading standards tend to be stricter in courses like mathematics and science that often attract stronger students; grading tends to be more lenient in courses like education and sociology that often attract weaker students (Bridgeman, McCamley-Jenkins, & Ervin, 2000; Goldman & Hewitt, 1975; Ramist, Lewis, & McCamley, 1990). That particular pattern of noncomparability has been replicated across studies and appears to be similar from one institution to another (Elliott & Strenta, 1988). Goldman and Hewitt (1975) proposed a theory of grading variations based on “adaptation level.” They suggested that observed patterns of apparently discrepant grading standards result from the tendency of faculty to adapt their grading level to the ability level of the students that they typically encounter. In this view, teachers have a tendency to assign a generally similar spread of As, Bs, Cs, etc. despite significant differences in the average level of competence of students in different courses and institutions.

30

Numerous studies have demonstrated that noncomparable course grades will lower the correlation between test scores and grades within an institution and that adjusting the grades to make them more comparable will improve the correlation (Elliott & Strenta, 1988; Goldman & Slaughter, 1976; Pennock-Roman, 1994; Ramist et al., 1990; Strenta & Elliott, 1987; Stricker, Rock, Burton, Muraki, & Jirele, 1994; Willingham, 1985; Young, 1990). Raising the correlation between test scores and grades by using a more comparable grade criterion is simply the result of identifying and removing one source of grade-test score discrepancy. The same principle applies to groups of students. A number of studies have demonstrated that underprediction of women’s grades is due partly to somewhat different grading standards in courses that are typically taken by males and females. Furthermore, it has been shown that using a grade criterion that is more comparable for women and men will reduce that prediction error (Clark & Grandy, 1984; Elliott & Strenta, 1988; Gamache & Novick, 1985; Hewitt & Goldman, 1975; Leonard & Jiang, 1995; McCornack & McLeod, 1988; Pennock-Roman, 1994; Ramist, Lewis, McCamley-Jenkins, 1994; Young, 1991. Stricker, Rock, & Burton [1993] is an exception.) A similar principle applies to variation in grading standards from school to school. College admission officers have long known that a B from one high school is not necessarily equivalent to a B from another school. Students with equal test scores tend to make higher grades in schools with low average test scores than in schools with high average scores. Early on, individual colleges were interested in adjusting the grades of applicants from different high schools in order to improve the correlation between school grades and grades that students earn after enrollment in college (Burnham, 1954; Dressel, 1939). Bloom and Peters (1961) stimulated great interest in this topic. They purported to show that college grade

31

prediction and college selection decisions could be improved substantially by rescaling the grades of individual high schools. A flurry of research on grade adjustment ensued; see Linn (1966) for a thorough review. The main result of this research was to show that the book by Bloom and Peters had raised a false hope. Subsequent work soon showed that adjusting school grades did not improve prediction of college grades when admissions tests were included and regression equations were cross-validated (Lindquist, 1963; Willingham, 1963a). More elaborate statistical models have not shown promise of changing that result (Tucker, 1960). Numerous studies in this period did document in some detail, however, that grading standards do vary from school to school. Analyses of college grades yielded similar results (Astin, 1971; Braun & Szatrowski, 1984; Rock & Evans, 1982; Willingham, 1961). An important implication of this work is that institutional grading variations can be an important source of discrepancies between grade averages and test scores if the grade averages come from different schools or different colleges. Student Characteristics Understanding the characteristics of students and their environment that are most associated with achievement in school is a challenge that has fascinated researchers for many years. It is a complex topic with major sub-themes concerning academic work habits, the role of activities outside of class, attitudes, influence of family and peers, and so on. The purpose of this section is to review briefly the results of previous research on the major types of variables that may hold some promise for understanding differences in grade performance and test performance.

32

One can think broadly of such variables as either behavioral or contextual; the latter referring to the backdrop of attitudes, family situation, and peer relationships that often influence a student’s behavior. Two other important sources of variation—subgroup membership and school differences—are beyond the scope of this review, partly because they are not used in this study as explanatory variables. Ethnic and gender identification are not treated as independent variables in order to see how and to what extent other explanatory factors account for observed results in differential prediction among such groups. In our analytic model, school origin is associated with the possibility of grading variations, holding test performance constant. The data will allow a test of that proposition. Historically, analysis of school effects on achievement has focused on different issues, especially the effects of school organization, financing, and educational practices on test performance (Coleman et al., 1966; Jencks et al., 1972; Wenglinsky, 1997.) This review includes studies that use both grades and test scores as performance criteria in order to locate any variables that might provide a better understanding of student achievement. Behavior or conditions that foster achievement will likely enhance both test and grade performance. Note, however, that the ultimate interest is differential grade performance; that is, high or low grades in relation to test scores covering similar subject matter. Shedding light on that question requires a demanding combination of grades, scores, and other useful measures from the same sample in the same situation. For the moment, we attend to a broad range of potentially useful studies. They fall into several distinct topics, only generally related. College prediction. The practical interests of college admissions officers has, for many years, encouraged researchers to search for characteristics that might account for

33

students making higher or lower college grades than their admissions tests would lead one to expect (Fishman & Pasanella, 1960). Such characteristics include a variety of information often referred to as biographical factors (background, activities, attitudes, academic behaviors, etc.). For the purposes of the present study, prediction research is hampered by several limitations. First, self-reported biographical data collected in connection with admissions may not be entirely trustworthy. Second, such data are limited by what a college considers appropriate to ask of applicants. Third, variables may be limited in their usefulness in predicting college achievement because student attitudes and habits do not necessarily transfer to a new situation. Finally, the useful information that a variable might offer about positive or negative achievement tendencies may already be represented in the high school average that is routinely used as a predictor along with admissions tests. Nonetheless, such research can provide useful clues for the present investigation. Reviewing studies undertaken during the height of prediction research, Freeberg (1967) remarked on the somewhat conflicting results but noted a frequent finding that higher grades are associated with positive attitudes about education and good study habits. Willingham (1963b, 1965) coded a number of items in a college application blank and found 15 that were significantly related to college grades. Some were especially notable: a strong record in school (but not community) activities, expressions of confidence, a willingness to undertake demanding academic work, and (negatively) any sign of hedging in the school’s recommendation. Astin (1971) obtained a variety of information, including self-ratings, from a large national sample of college matriculants. Among the 13 personal characteristics that were useful in predicting college grades, confidence was also a prominent factor in this study.

34

Several types of behavior reported by students were associated with poorer grades than expected on the basis of test scores and the prior academic record: turning in work late, coming late to class, making wisecracks in class, and going to movies frequently. To what extent is such information helpful in accounting for errors in predicting grades in college? In his review of biographical inventories and their usefulness in admissions, Breland (1981) cautioned that many of the apparently positive results are not promising because studies are often based on small samples or were not cross-validated on a new class. In Astin’s large study of matriculants, personal characteristics raised the multiple correlation with college GPA from .56 to .60 (Astin, 1971, p. 12). In Willingham’s study of applicants, an application composite based on promising personal characteristics had a crossvalidated correlation of .48 with freshman GPA; corresponding validity coefficients were .46 for the SAT and .49 for High School Average (Willingham, 1963b). The application composite added .10 to the correlation between the SAT and freshman grade average, but only .02 when both SAT and High School Average (HSA) were used as predictors of freshman GPA. This pattern suggests that the HSA already included much of the useful information that might be gleaned from evidence of attitudes and behavior in high school. Another college prediction study is interesting because it included information from high school as well as concurrent information about academic behavior in college. Stricker et al. (1991) identified several types of behavior that correlated with college GPA from .11 to .32 with SAT partialled out. In order of merit, they were: attendance, completing assignments, taking tests on schedule, a study skills scale, taking notes in class, and average years that key high school subjects were studied. The overall picture is that of the serious, conscientious student.

35

High school performance. Several analyses of large databases in the past decade have demonstrated the variety of variables that might help to account for differences in grade and test performance among high school students. In Ekstrom’s (1994) analysis of High School and Beyond (HSB) data, school attitudes and behavior contributed significantly to test scores in accounting for English grades. School behaviors showing significant relationships with grades included hours of homework, attendance, discipline problems, and coming to class unprepared. A study of 2500 ninth-grade students in Indiana also demonstrated positive relationships between school attitudes and grades, but is more notable for arguing the influence of parent education and expectations on student aspiration and attainment (Hossler & Stage, 1992) Ekstrom, Goertz, and Rock (1988) reported a more detailed analysis of the HSB data. This study identified six student characteristics that showed some independent contribution in accounting for differential grade performance. Among these, two aspects of student behavior showed the strongest effects: behavior problems and time on homework. Other significant contributors were school activities, parent aspirations, parent involvement in program planning, and locus of control. Hanson and Ginsburg (1988) reported similar results from their analysis of the sophomore HSB cohort. These authors examined student performance from a value perspective and stressed the notion of “responsibility” in explaining the predictive value of this nest of variables. Finn (1993) showed that a pattern of participation in school was related to test performance among eighth grade NELS students. Positive school behavior has also been referred to as effective “studenting” (Cole, 1997). Homework is always a hot topic in the parent’s mind and the public press. Keith (1982) examined the role of time on homework in explaining grade performance in the HSB

36

sample. He showed that, within ability strata, time on homework had a quite linear additive effect on grades earned, and was not subject to diminishing returns as other writers had suggested (Frederick & Walberg, 1980). Some evidence indicates that the portion of homework completed may have a stronger effect on grades earned than does time on homework. Cooper, Lindsay, Nye, & Greathouse (1998) provide data suggesting that this effect may operate through teachers’ evaluation of homework assignments rather than homework influencing acquisition of knowledge and skill, which in turn affect grades. One theme is particularly evident in the results of these studies. The students who make good grades in relation to test performance are those who behave like serious students. They come to class, they participate positively rather than disrupt, and they do their homework. Three other factors that deserve attention are the activities in which students engage, the attitudes they bring to school, and the influence of family and peers on those attitudes. We consider each of those topics in turn. Activities. The relationship between so-called extracurricular activities to academic performance has long been a topic of great interest to many educators. Some have studied student activities with a view to broadening the public view of talent, admissions criteria, and useful outcomes of education (Richards, Holland, & Lutz, 1967; Taber & Hackman, 1976; Willingham, 1974, 1985; Wing & Wallach, 1971). Richards et al. (1967) concluded that nonacademic accomplishments in high school can be assessed with moderate reliability, are related to similar achievements in college, but are largely independent of academic achievement. Werts (1967) challenged that interpretation, and among other arguments, pointed to a correlation of .37 between high school average and a composite of 18 extracurricular

37

achievement items. Hanks and Eckland (1976) also reported strong connections between academic and nonacademic achievement but distinguished sharply between social and athletic activities. Achievement in these two domains correlated .38 and .05, respectively, with high school grades. Spady (1970) argued that while athletics can be a major source of peer status, it is only through other service and leadership activities that students achieve success in academic and later life. Such views were part of a continuing debate as to whether extracurricular activities have a causal relationship to beneficial outcomes of education (Brown, 1988; Holland & Andre, 1988; Steinberg, Brown, Cider, Kaczmarek & Lazzaro, 1988). A more limited question that is more germane to this investigation is whether engaging in activities outside the classroom is associated with the more strictly academic outcomes and with grades specifically. To see extracurricular achievement as a potentially useful variable in identifying possible sources of grade-test score discrepancies, it is only necessary to assume, along with Marsh (1992b), that such achievement represents “commitment-to-school.” Marsh found that a composite measure of extracurricular achievement had positive relationships with a number of secondary and postsecondary outcomes (e.g., correlated .23 with high school grades). Since this composite included athletic as well as community activities, Marsh (1992b, p. 560) interpreted his results to be inconsistent with Coleman’s (1961) zero-sum model wherein different domains of activity compete for the student’s time. On the other hand, in a second study Marsh (1991) found that a number of school outcomes were negatively related to total hours of employment—evidently a competing activity. Marsh noted a positive relationship between academic achievement and employment that students undertook in order to earn money for college. He explained the apparent

38

contradiction in the two studies by arguing that the values represented in activities are important, not the hours. Nevertheless, it does seem likely that activities—even ostensibly positive activities—can have negative effects on academic performance if they significantly reduce time or commitment from schoolwork. Some investigators have found that employment has adverse effects on schoolwork only when students work 20 or more hours a week (Steinberg et al., 1988). We do not yet know how consequential the zero-sum reality might be if a number of competing activities are considered together (e.g., employment, childcare, community work, athletics, socializing, gangs, TV and video games). A quite recent study further supports the proposition that after-school activities have a positive or negative effect on differential grade performance, depending upon whether they contribute to or compete with a student’s school responsibilities. Cooper, Valentine, Nye, and Lindsay (1999) found that residual grade performance (i.e., test performance controlled) was positively correlated with extracurricular activities and amount of homework finished, but negatively correlated with watching TV and (marginally) number or hours employed. Another technical issue suggests that activity measures cannot always be taken at face value. One might suppose that students have greater opportunity for a wider range of activities in a large high school but face more competition due to limited spaces available, especially regarding leadership positions. There is some evidence to support the latter assumption. Students in small schools evidently perceive ample extracurricular opportunity (National Center for Education Statistics, 1995) and participate at a higher frequency than do students at large schools. In a national sample of seniors, Lindsay (1982, 1984) reported that participation correlated –.22 with school size.

39

Student attitudes. Teachers probably need no hard evidence to know that students’ attitudes about themselves and their education can have a marked effect on their behavior and performance in school. In recent years the “academic self-concept” has provided an active research focus for this common-sense observation. Self-concept instruments typically correlate positively with performance in school. But in reviewing the nature of the academic self-concept and its relationship to academic performance, Byrne (1996) observed that such correlations are widely discrepant. Research on several technical issues in recent years has helped to clarify the nature of academic self-concept and its relationship to performance (see Byrne, 1996; Marsh, Byrne, & Yeung, 1999; and Marsh & Yeung, 1997 for useful reviews). This work is applicable to the present investigation for its implications regarding design of the proposed analysis. Through the work of several investigators in recent years, it has become progressively clearer that self-concept is best viewed as a complex hierarchical structure. Researchers have distinguished general self-concept, academic self-concept, and subareas of self-concept based on broad skill areas (especially quantitative and verbal) as well as particular academic subjects (Byrne, 1986; Marsh, 1990a; Marsh, Byrne, & Shavelson, 1988; Marsh & Shavelson, 1985). Marsh (1992a) has emphasized that the more specifically the self-concept refers to a particular academic subject, the stronger its effects on performance in that area. This empirical observation raises a caution in studying the relative strength of selfconcept measures to grade vs. test performance. Self-ratings of confidence in very specific academic areas may exaggerate the role of self-concept as a source of grade-test score discrepancy. In fact, self-ratings in a particular subject may be much dependent on grades in that academic area. The issue of domain specificity exacerbates what Byrne (1996, p. 302)

40

called “the most perplexing and illusory” issue in studying the relationship of self-concept to performance in school. Does self-concept cause grades, or do grades cause self-concept? In the former case, the argument that attitudes play a functional role in helping to distinguish grades and test scores is more convincing. If attitudes do influence effort, then a relatively strong correlation between attitudes and grades reflects a greater sensitivity of grades to the level of student effort than would normally be case with an external test score. In a pioneering study, Byrne (1986) concluded that neither academic achievement nor self-concept had causal dominance. Using a more rigorous multi-wave design than had previous investigators, Marsh (1990b) concluded that grade averages in Grades 11 and 12 were affected by academic self-concept measured the previous year, whereas prior grades did not affect subsequent measures of self-concept. However, in a subsequent study within three academic subjects, Marsh and Yeung (1997) found that the effects of achievement on selfconcept were somewhat larger and more systematic than were the effects of self-concept on achievement. The authors saw the results as a contribution to “the growing body of research—particularly at the high school level—in support of the reciprocal effects model” (Marsh & Yeung, 1997, p. 50; see also Marsh et al., 1999). As Byrne (1996) lamented, the evidence leans both ways. Another technical problem has received considerable attention. It is called the bigfish-little-pond effect and derives from the fact that from school to school, average ability is negatively associated with average academic self-concept (Marsh, 1987, 1994; Marsh & Parker, 1984). As a result, students of similar academic ability tend to have higher academic self-concepts in low ability schools than in high ability schools. Soares and Soares (1969) first reported this phenomenon in their work with disadvantaged children. It is a frame-of-

41

reference issue parallel to variation in school grading standards that was previously discussed, and both phenomena can be seen in the same data (Marsh, 1994). The implication for the present investigation is to illustrate a further advantage of analyzing data in a pooled withinschool sample in order to avoid distortions due to spurious school differences. Despite these various difficulties, academic self-concept has proven to be a useful measure in research. More pertinent to the present investigation was a meta analysis reporting that, on average, self-concept correlated .34 with grades and .22 with several types of achievement tests (Hansford & Hattie, 1982, Table VIII). That pattern suggests that academic self-concept is a variable that may be useful in accounting for grade-test score differences. Locus-of-control is a somewhat related attitude measure that has been included in the HSB and NELS surveys. It refers to a tendency of people to feel responsible for things that happen to them (i.e., internal control) or to feel that forces beyond their control determine outcomes in life (i.e., external control). This measure showed some promise in HSB data (Ekstrom et al., 1988) but does not appear consistently to have the same favorable pattern of correlations just cited for academic self-concept (Findley & Cooper, 1983). Eccles (1983) has proposed a framework of “expectancies, values, and academic behaviors” that suggests a much broader range of beneficial student attitudes than does a positive academic self-concept alone. Her theoretical orientation assumes a very practical network of manifest behaviors and interpersonal relationships—all necessary elements in a student’s pursuit of effective personal development through educational achievement. Thus, positive academic attitudes are reflected in aspiration and planning, in recognizing the value of taking certain courses and working hard, in positive relationships with parents and teachers, and in choosing peers who help to define and reinforce academic commitments and beneficial

42

habits. Many such behaviors were noted in studies previously discussed. Such results suggest the need for additional comments on the family—a final domain of potential influence on student grades. Family. Sociologists have established in many studies that socioeconomic status (SES) of the family is positively related to educational attainment (Jencks et al., 1972). It has also been widely assumed that parental encouragement has an independent positive effect on the educational aspirations of students (Sewell & Shah, 1968). Harris (1995) has challenged that conventional view, arguing that peer culture, not family, plays the dominant role in shaping the behavior, personality, language, motivation, and values of young people. Most recently, it has been argued that parents do influence children, but mainly through indirect effects and interactions with other variables (Collins, Maccoby, Steinberg, Hetherington, & Bornstein, 2000). With regard to the more specific question addressed in this investigation, it is not clear whether the family has more effect on cognitive skills developed over time and probably better represented in test scores, or on school achievement better represented in a more proximal measure like grades. The analysis of HSB data by Rock, Ekstrom, Goertz, and Pollack (1986) did indicate that the family had an influence on differential grade performance (i.e., controlling for test performance). Other studies have indicated that the educational aspirations of students are influenced by the parents’ aspirations on their behalf (Eccles, 1983; Hossler & Stage, 1992) and that parents may directly influence grade performance of students only through specific types of parenting behavior (DiMaggio, 1982; Lamborn, Mounts, Steinberg, & Dornbusch, 1991).

43

Researchers have had a special interest in the influence of the single-parent family on students and their academic performance. There is a small relationship between family structure and test performance (higher average scores for students in intact families), but those differences are evidently explained by demographic differences (Milne, Myers, & Rosenthal, & Ginsburg, 1986). In another study, such differences associated with family structure were larger on grades than on test scores (Mulkey, Crain, & Harrington, 1992). In the latter analysis test differences connected with family structure also vanished when background was controlled. Residual grade differences were associated with student behaviors; for example, absenteeism, not doing homework, frequent dating, not talking to parents. Marsh (1990a found similar effects with HSB senior data but not when sophomore outcomes were controlled. Teacher Ratings Several considerations suggest that teacher ratings may be useful in understanding better the observed discrepancies between grade performance and test performance. First, teachers’ ratings should give an indication of whether a student tends to perform well on local instructional goals—outcomes likely to be reflected more accurately in grades than in external test scores. Second, teachers should be able to provide good evidence not otherwise attainable regarding student behavior that can result in a direct debit or credit when grades are assigned (e.g., work habits, class behavior). Also, teachers may be able to provide independent evidence regarding complex skills that are relevant to course objectives and reflected in course grades but not ordinarily reflected in test scores (e.g., performing skills, ability to explain the subject to others).

44

In considering whether teacher ratings can provide useful evidence regarding observed discrepancies in grade-test performance, it is useful first to examine the obverse. Do teacher evaluations of students provide a reasonably accurate estimate of a student’s subject knowledge and skill as represented in test performance? In a review of 16 studies, Hoge and Coladarci (1989) found average correlations in the mid-60s between teacher judgments and standardized achievement tests. On this basis they concluded “high validity” for the teacherjudgment measures. Another reasonable conclusion is that there are substantial differences between the two measures. What other information do the teacher judgments offer? Some years ago, Davis (1965) carried out a series of studies to identify what components make up the college teacher’s perception of the valued student. Based on faculty ratings of students on a number of traits, Davis identified 16 factors, five of which had a consequential relationship to college grades. The relationship with grades was high for ratings of “academic performance,” moderate for “intellectual curiosity” and “orientation to tasks,” and low for “creativity” and “achievement motivation.” There is limited evidence in these particular studies as to whether faculty grading is influenced by student behavior other than the specific knowledge and skills pertinent to the course. What happens when the research gives such factors more direct attention? Pedulla, Airasian, and Madaus (1980) offered interesting data based on 170 teachers and 2617 fifth-grade students in Ireland. Factoring three tests and 15 teacher ratings, they found three factors: one based on behavior in school, another on academic work habits (also behavior), and a third loading on tests and ratings of academic achievement. Teacher ratings of both behavioral factors were more highly related to grades than to test performance,

45

suggesting that the teachers’ judgments were reflecting a noncognitive grade component to some degree. Ekstrom (1994) factored teacher comments on students that were collected in the High School and Beyond longitudinal survey of high school students. The first factor was defined as a teacher comments composite. The composite looked like student motivation, loading mainly on “self-discipline,” “seems to dislike school,” and “will probably go to college.” This teacher comment composite was correlated—typically in the .20s—with several items from the questionnaire completed by students: In order of magnitude they were discipline problems, attitudes about school, anxiety about school, educational aspirations, attendance, and came to class unprepared. This teacher comment composite added to tests and other student characteristics in predicting grade performance. The multiple correlation with English grade average as of the sophomore year was .68 with teacher comments included, and .63 without. Teacher ratings of behavior can be unduly influenced by the grades a student earns, especially if the ratings are collected at about the same time as the grades are assigned. A study by Farkas, Grobe, Sheehan, and Shaun (1990) appears to illustrate the point. These authors reported a concurrent correlation of .77 (N = 486) between teachers’ ratings of students’ work habits and their grades in social studies—quite high for a single course grade. It is also possible that teacher ratings can be a source of discrepancy between test scores and grades either because teacher judgments influence grades (Caldwell & Hartnett, 1967) or, because their judgments may become self-fulfilling prophecies in student learning and grade performance (Rosenthal & Jacobson, 1968). The specter of the self-fulfilling bias has been an active and controversial research topic (Brophy, 1983). Researchers continue to

46

interpret results in different situations either as evidence of only minor effects (Jussim, 1989) or as substantial effects (Babad, Inbar, & Rosenthal, 1982). Implications for This Study The foregoing review of previous research has focused on grading practices, student characteristics, and teachers’ ratings of students. The NELS Study includes extensive information on these three topics and each holds promise for understanding performance in school, especially differential grade performance in relation to test performance. Results of the previous work suggest a number of points that bear especially upon the choice of variables and analyses that are likely to be most appropriate to the current study. •

A long history of research indicates that grading patterns vary from time to time and situation to situation. Grading standards often vary among instructors, courses, schools, academic majors, and colleges. Correction for such differences has typically enhanced the relationship between grades and test scores and reduced any differential prediction for gender and ethnic groups. Differences in standards across courses and across schools seem the most likely sources for grading variations that can result in observed discrepancies between grades and test scores.



Because of the multiple purposes served by grades, teachers and schools often report taking into account student behavior and other factors in addition to knowledge and skills relevant to a course when they assign grades. Thus, various aspects of effective school skills such as attendance, class participation, discipline, and timely completion of assigned work may be important components of grades that are not represented in test scores.



In addition to using effective school skills, initiative and involvement with school also tend to be associated with higher academic achievement—higher test performance and

47

especially higher grades. Evidence of student initiative such as taking a strong program of demanding courses, participating in school activities, and avoiding activities that might compete with academic work all tend to be more strongly associated with grades than with test performance. Evidence to date suggests studying student activities in several separate spheres: school activities (athletic distinct from non-athletic), community activities, and other ways that students may spend their time in competition with schoolwork. •

Effective school skills and initiative tend to predict future grades as well as concurrent grades. That parallel pattern suggests that such characteristics of students represent a somewhat stable orientation to schooling. Thus, some student behaviors, like turning in homework, may result in differences between grades and test scores in several ways: quite directly if the teacher gives extra grade credit for completing assignments, also directly if doing the assigned work enhances the specific knowledge and skill that results in a higher grade in that particular course, and indirectly if such behavior reinforces an habitual pattern of commitment to academic work and commensurate payoff in better grades.



Various aspects of the student’s family life and socioeconomic status have shown promise in helping to account for academic performance. There is wide agreement among researchers that positive attitudes foster achievement in school—especially attitudes reflecting confidence in specific subjects and aspirations regarding educational goals. Since it is also likely that good grades foster good attitudes, it is important to guard against spurious attitude “effects” that could result from using attitude variables that are obviously dependent upon past achievement. This is a form of confounding or

48

experimental dependence—a design hazard to avoid in trying to understand the dynamics of school achievement. •

Another methodological consideration concerns the context of data analysis. Different lines of evidence suggest the importance of studying academic achievement within schools. One is the well-documented difference in grading standards across schools. Another is the noncomparability of some student characteristics—especially attitudes and extracurricular activities—in larger and smaller schools. These problems recommend a within-school analysis; that is, an analysis based upon deviation scores after school means have been subtracted from all scores for all variables. Much evidence also underscores the importance of correcting for range restriction in such analyses.



Studies of teacher ratings indicate that teachers can be a useful source of information as to what student characteristics are associated with achievement in school. Such ratings are typically more highly related to grades than to test scores, though teacher judgment can be unduly influenced by the grades that students have earned—another form of confounding. For that reason, it may be important to use teacher ratings that are not collected at the same time that grades are assigned and that are focused on specific behaviors of students rather than their academic reputation.

49

Study Design The National Education Longitudinal Study of 1988 (NELS) is part of a major longterm program of the National Center for Education Statistics (NCES) to study representative samples of students as they progress through elementary school, high school, and beyond. “The general aim of the NELS program is to study the educational, vocational, and personal development of students at various grade levels, and the personal, familial, social, institutional, and cultural factors that may affect that development” (Ingels et al., 1994, p. 1). The NELS database comprises an unusually broad and detailed record of students’ characteristics, attitudes, experiences, and achievements. The analyses here reported are based largely on information from the NELS second follow-up. Much of these data were collected in January through March of 1992, when the survey students were in their senior year, though some other data were gathered as late as the summer of 1992. Five types of data were employed: tests administered in the senior year, a senior student questionnaire, a full course-grade transcript for grades 9 through 12, teacher ratings collected in the sophomore year, and some additional information from school records. The general approach in this analysis is to “correct” for the five factors discussed in the previous section by introducing each, in turn, as adjustments or predictors in a multiple regression analysis that starts with the concurrent correlation between average grade performance and NELS Test performance. To what extent can the five factors account for differential grade performance, the common tendency for students to make grades somewhat higher or lower than expected on the basis of a test based on generally similar academic material? To what extent can the multiple R be raised and differential prediction be lowered? The analytic approach requires close attention to the sample of students, the selection and

50

definition of variables, and the statistical procedures. Appendix B describes the student variables and lists acronyms used here. The Sample The 12th grade NELS cohort numbered 17,153 students. We restricted this study sample to students with essential data; i.e., those who took the NELS Test, filled out a questionnaire, and had transcript data with no missing records. Students in special education and bilingual programs were not included since the comparability of their grades and test scores was uncertain. With these constraints, the original sample with requisite data numbered 10,849. Ten subgroups were available for analysis: two genders, four ethnic groups (AfricanAmerican, Asian-American, Hispanic, and White), and four school programs (Rigorous Academic, Academic, Academic-Vocational, Vocational). For research purposes, NCES assigned students to school programs insofar as possible on the basis of the pattern of coursework in each student’s record. In the sample employed for this analysis, some 90% were assignable to the four programs indicated. Each of these 10 groups included sufficient students for separate analysis, but an additional constraint on the sample had to be taken into account. Assessing grading variations required at least a minimal number of students in each high school—the more the better in order to stabilize any grade-score differences among schools or courses. From this perspective, the database posed a problem because some of the original cohort in the 8th grade moved to other cities or districts. NELS followed those students to their new school even if there were no other survey participants in that school. This meant adding many schools that had only one or two students in the NELS study—too

51

few for our purposes. Deciding on a minimum school sample size involved conflicting considerations. Setting a higher minimum number of survey participants per school promised to yield more stable data in individual schools, but that constraint would simultaneously reduce the total sample as well as subgroup representation. As it happened, setting the minimum school sample size above 10 quickly reduced subgroup samples to a level too small for analysis but offered very little improvement in the typical size of course or school samples. It was not necessary to go below a minimum school sample of 10 to obtain sufficient data (i.e., a minimum of 400 in all subgroups of interest). There were 581 schools having 10 or more students with the requisite data. Together, these schools yielded a reduced student sample of 8454 (78% of the original sample meeting all data constraints). Subgroup sample sizes are shown in Table 2. ______________ Insert Table 2 about here ______________ The most obvious difference between the reduced sample and the original sample with requisite data would be a smaller number of students in the reduced sample who had changed high schools. A substantial proportion of students who moved would necessarily transfer into a school with few other NELS students because the new school was often not in the original NELS sample. This final, reduced sample appeared similar to the sample with full data in most respects. For example, the average test score differed by .035 SD; the standard deviation differed by .3%. Ethnic minority representation was 26% in the original sample, 23% in the reduced sample. Also, in analysis of NELS data in grade eight, little association was found between family moves and school performance level (Finn, 1993, p. 66).

52

Nevertheless, it is impossible to say what self-selected considerations might be involved. Because this selective influence rendered the sample not necessarily representative of high school students nationally, NELS sample weights were not employed in the analysis. Another reason for not using weighted data was the possibly distorting effects that wide variations in sample weights might have on the estimation of local grading parameters in small school samples. Due to these sampling limitations, the analysis is best regarded as exploratory. Due to these considerations, statistical tests of hypotheses were not carried out, and standard errors of statistics were not estimated. Results should be interpreted as descriptive of characteristics observed in a database of 8454 students attending 581 schools throughout the country, who earned some 187,000 credits in 21,000 high school courses. It can be expected that the results reported here are similar to what would be found with another similar group of schools, courses and students. Needless to say, results would more likely be similar in a group of 4125 males than 402 vocational students, the latter being the smallest group in which correlational analyses were undertaken. Tests and Grade Averages The NELS database provides four test scores and four grade averages covering generally similar academic subject matter over the same time period. Students were administered “a series of curriculum-sensitive cognitive tests to measure educational achievement and cognitive growth between the eighth and twelfth grades in four subject areas—reading, mathematics, science, and social studies (history, geography, civics).” (Ingels et al., 1994, p. 7). The complete battery comprised 116 multiple-choice items. Ingels et al. (1994) described the four tests as follows:

53



Reading Comprehension. (21 questions, 21 minutes) This subtest contained five short reading passages or parts of passages, with three to five questions about the content of each. Questions encompassed understanding the meaning of words in context, identifying figures of speech, interpreting the author’s perspective, and evaluating the passage as a whole.



Mathematics. (40 questions, 30 minutes) Test items included word problems, graphs, equations, quantitative comparisons, and geometric figures. Some questions could be answered by simple application of skills or knowledge; others required the student to demonstrate a more advanced level of comprehension and/or problem solving.



Science. (25 questions, 20 minutes) The science test contained questions drawn from the fields of life science, earth science, and physical science/chemistry. Emphasis was placed on understanding of underlying concepts rather than retention of isolated facts.



Social Studies. (History/Citizenship/Geography—30 questions, 14 minutes) American history questions addressed important issues and events in political and economic history from colonial times through the recent past. Citizenship items included questions on the workings of the federal government and the rights and obligations of citizens. The geography questions touched on patterns of settlement and food production shared by other societies as well as our own. Student grades were obtained from the NELS Transcript Component Data File (Ingels et

al., 1995). The file contains an enormous amount of detailed information regarding individual term grades for all courses attempted by each student. Working with individual student transcripts and curriculum specifications from the schools, NELS slotted each course as accurately as possible into a single comprehensive framework of 1540 courses with different

54

titles. The total number of courses in all schools was only 21,000 because many courses were available in only a few schools. It is impossible to say to what extent courses with the same title (e.g., Precalculus, European History, or Introduction to Computers) may have actually covered somewhat different subject matter from school to school. In addition to different types of courses within mathematics, history, science, etc., the framework included placement levels (remedial, honors, etc.) and the year (academic and calendar) in which the course was taken. NELS also transformed different school grading systems to a common scale of 1 to 13 (i.e., A+ to F). Our intended analyses required grades for individual courses as well as several averages and summary indices based on the total transcript. The unusual size of the NELS transcript tape and the noncomparability of course information from school to school complicated the compilation of these various measures. For our purposes, it was necessary to create a course-grade file on which numerous instances of multiple grades for the same student for the same course title (repeats and multiple terms of different lengths in different schools) were appropriately weighted and averaged in order to provide a single grade and a comparable term length for each course. This dedicated file provided for each student a measure of total course hours and total course credits expressed on a common Carnegie Unit (CU) base. The file contained 311,607 individual course grades based on all 1540 courses in the NELS framework, not counting a few service courses like physical education and driver training. In this analysis grades were converted to a “4.0” scale—actually 4.3 because the original scale provided for an A+ grade. These course grades, weighted by CU hours, yielded the data necessary for the analysis of grading variations from course to course. The course

55

grades also served as the basis for a High School Average, HAS(T), based on each student’s total transcript . The correlation between this HSA(T) and the unweighted total score on the NELS Test provided the initial baseline indication of the relationship between grade performance and test performance for the NELS graduating seniors. In order to take account of Factor 1 (Subjects Covered), it was also necessary to compute a grade average based on subject matter generally comparable to that represented on the NELS Test. As it happened, NELS had already computed for each student a set of grade averages corresponding to the so-called “New Basics” subject areas (Ingels et al., 1995, p. 56 and Appendix H). The NELS Tests correspond reasonably well to four of these six subject areas (excepting Foreign Languages and Computer Science): •

English (113 courses)



Mathematics (47 courses)



Science (74 courses)



Social Studies (256 courses)

For the present study, the mean of the students’ grade averages in these four subject areas was defined as HSA, the academic average. Unless otherwise specified, all of the analyses reported here are based upon this set of 490 academic courses, though as will be described, a correction for grading variations was applied in some analyses. HSA was based on a great variety of courses in these four academic areas, including advanced as well as remedial work. Courses that were represented in HSA(T) but not in HSA included all work in foreign languages and computer sciences, a wide variety of other special interest courses of an academic nature, and all vocational courses.

56

Other Variables in the Analysis The selection and definition of variables used in this analysis were determined by our proposed approximation as discussed earlier, by what information was available in the NELS database, and by information gleaned from our review of previous research. The overriding interest was to identify a set of student characteristics and qualities that might help to account for differential grade performance— as here defined, the tendency of students to make somewhat higher or lower grades than would be expected on the basis of the NELS curriculum-based test scores. As prior research indicates, many such characteristics and qualities show some promise in this regard. All available measures were included if they appeared to be potentially useful and did not involve excessive missing data or raise technical problems such as those discussed below. Needless to say, even the very rich NELS database did not include all information that might be interesting. This is especially the case for information concerning the knowledge and skills expected of students in individual courses, the meaning of grades assigned, and student skills that might be particularly pertinent to test performance but not to grade performance. As Table 3 indicates, 26 student variables were included under these five headings: School Skills, Initiative, Competing Activities, Family Background, and Student Attitudes. The measures reflect both behavior and context. The first two categories are clearly behavioral; these measures refer to things students do. Competing Activities are also behavioral but are likely to be more influenced by context and are therefore somewhat less under the student’s control. Student Attitudes and Family Background are more contextual. These latter variables refer less to behavior than to conditions and circumstances that can influence behavior. Appendix B provides a description of each variable. See Green, Dugoni,

57

Ingels, and Camburn (1995) for a “profile” that gives extensive information about the NELS seniors and how they vary by subgroup and background characteristics. ______________ Insert Table 3 about here ______________ In the main, student variables were developed and retained on the basis of their rationale and substance, not their interrelationships or validity in accounting for differential grade performance. Thus, the 26 student variables originally selected were retained throughout the analysis. Analytical relationships among possible measures was sometimes critical, however, in the initial choice and definition of variables to be used in the analysis. This was because, in actual practice, some pairs of variables proved to be mutually constrained or confounded. While we guarded against collinearity among the student characteristics, there was no avoiding the mirror image “zero-sum,” problem described by Coleman (1961). That problem is evident in Competing Activities. A student cannot easily score high on killing time, child care, employment, etc., all at the same time. Each of the variables in that category may have an important idiosyncratic bearing on school achievement, but statistically, they hardly form a coherent construct. Ambiguous causality is also a special problem. For example, has involvement in honor societies contributed to a student having earned high grades relative to test scores or has such involvement merely resulted from superior classroom performance? Thurstone (1947, p. 441) referred to confounding of this sort as experimental dependence and argued that it deserves close attention in the correlational analysis of behavioral data, especially data collected concurrently. It is useful to distinguish three levels of potential confounding in the relationship between individual variables and grade performance.

58

a) A statistical constraint can consistently increase or decrease the association between grades and another measure regardless of the underlying relationship. For example, a failing grade automatically lowers both the student’s grade average and the number of credit hours earned. This type of spurious dependence of one variable on another can often be identified and even statistically isolated. b) An explicit dependence can influence the relationship of a variable to a student’s grade average even though the effect is not necessarily identifiable or consistent. Examples include the use of variables such as the following: attitudes like “I don’t do well in math,” enrollment in remedial English or in the Vocational curriculum (assignment to which may be directly influenced by the grade record), or an activities measure based partly on “member of an honor society.” c) An implicit dependence may be subtle and unmeasurable but real, nonetheless. Examples include the following: partial dependence of self-esteem on grades earned, the likely tendency of students to form educational aspirations on the basis of academic performance, the tendency of students to like or dislike school partly on the basis of how they do there, the likelihood that decisions to enroll in demanding courses are influenced by a student’s grade history in that subject area. We attempted to avoid the more clearly artifactual and potentially misleading forms of confounding, especially those in the first two categories above. Note, however, that a student’s level of academic motivation or degree of scholastic orientation is clearly influenced by a history of doing well or poorly in school. That is one of the key phenomena under study here—the student’s personal orientation to schooling. It is not a momentary state or a technical problem that is pertinent only to a concurrent analysis. Student attitudes about

59

school do change over time and circumstance, but a strong or weak commitment to school will likely be reflected in longitudinal as well as in concurrent analyses. Much of the national effort that goes into encouraging excellence and selecting good students for demanding educational programs is based on that assumption. In describing results of the analyses, we will come back to the possibility of spurious findings due to confounding. In most cases, student variables were constructed as composites of several items in the Student Questionnaire or other types of information (see Table 3 and Appendix B). Developing composite variables served two purposes. One was to enhance generalizability by including different aspects and evidence of a characteristic. Another objective was to minimize missing data; at least partial information was available for almost all students on most variables. NELS collected from classroom teachers a variety of ratings regarding the academic behavior and work habits of the participating students. These judgments provide a valuable supplement to the student variables because they come from a person who is presumably more objective than the student but is also experienced and informed. Five Teacher Rating variables were used, either based on individual ratings or based on composites of two or three ratings. The five variables, shown in Table 3, represent observable classroom behavior that is often cited as a legitimate consideration in grading or might be expected to influence teachers’ evaluations of students and the grades they assign. Such ratings were collected in both the first and the second follow-up survey (in the middle of the sophomore and the senior year). Several considerations led to a decision to use the sophomore ratings. In the senior year, only one rating was obtained from either a science or a mathematics teacher who had a NELS student in her or his class. In the sophomore year,

60

ratings were obtained from two teachers balanced among English, mathematics, science, and social studies—one from a verbal and one from a quantitative discipline for each student. Thus, the two sophomore ratings are more reliable and more representative than is the single senior rating. Another consideration was the possibility of the teachers’ ratings being biased by their knowledge of the students’ grade records. Such influence is not unlikely, even though care was taken to avoid using any ratings with wording that suggested any direct dependence on the grade record (e.g., “I have talked to this student’s parents about his/her grades”). Concern about such possible confounding was also a major reason for using teacher ratings that pertained to student behavior, not academic achievement. Focusing on behavior undoubtedly reduced the likelihood that the Teacher Ratings would reflect some types of learning outcomes that may be reflected more in grades than in test scores; for example, social goals of education and individualized learning outcomes (Categories A.3.a and A.2.c, respectively, in Figure 1). Since more than half of the grade record came after the first follow-up teacher ratings, these ratings are, in part, predictors rather than concurrent evaluations. Using the sophomore ratings seemed a more conservative choice for valid and useful teacher ratings. Finally, this choice resulted in less missing data: 90% of our sample had at least one sophomore rating, while teacher ratings in the fourth year of high school were available for only about seven in ten participating seniors (Ingels et al., 1994).

61

Statistical Analysis The statistical analysis proceeded in the following manner. It was proposed earlier that five major factors could help to explain discrepancies between grades and test scores; that is, improve the concurrent prediction of grade average. These corrections, Factors 1 through 5, were introduced successively. The presenting issue in this study concerns this question: What effect do these corrections have on grade prediction and on observed patterns of differential prediction for subgroups? Other questions concern how each factor operates; for example, what role is played by different components, or what happens if the factors are defined through different methods or in different sequence? In this connection, the nature of grading variations and student engagement in their schooling deserve some special attention. Finally, we examined the generality of results across gender and ethnic groups and among high school programs. Most of these analyses involve familiar applications of regression analysis, which are best described as results are reported. Missing data was not extensive and was handled through pairwise deletion. Two methodological issues require some initial explanation. These involve analysis of grading variations and estimation of the reliability of grade averages. Adjusting for Grading Variations Our review of the research literature identified various types of grading variations and a number of statistical methods that have been used for correcting such differences. The NELS transcript database provided sufficient information to analyze grading variations from school to school and from course to course, but not variations among instructors or sections. We employed relatively simple methods to correct only for average differences in grading

62

level (i.e., the overall strictness or leniency in relation to test score level). Intercept differences appear to largely account for observed variations in grading standards (Linn, 1966). Limiting the model in this manner makes the school and course-grading variations additive; the objective is to correct each in turn. The same methods can often apply to the correction of grading variations across different types of groups; e.g., students enrolled either in different courses or in different schools. Figure 2 illustrates two methods of correcting grading variations—in the case depicted, variations across schools. High School Average is here regressed on a NELS composite test score. In each of the three panels, the regression line is based on all students in the total sample. The top panel represents scatterplots for four schools as they might appear in the original data. The middle panel shows how those plots would look after applying the “within-school” method. Similarly, the bottom panel illustrates the “residual method.” ______________ Insert Figure 2 about here ______________ In the within-school method, school means are subtracted from the observed scores for all variables (i.e., all grade averages, test scores, student characteristics, and teacher ratings). This creates a pooled within-school matrix of deviation scores where all variables have a mean of zero. The effect is to superimpose the school scatterplots so that all have their center at scale values of 0,0. The pooled within-school correlation matrix based on these deviation scores is then corrected for range restriction. This mutivariate correction employs an extension of the Pearson-Lawley method (Gulliksen, 1950, p. 165). In all corrections for range restriction, we used a composite of the four NELS tests and the SES composite as explicit selection variables. These two were the only variables in the

63

matrix that could be assumed to be reasonably comparable across schools. The variances of both showed substantial reduction after school means were removed, but correction for range restriction brought the variances for both measures back to their original value across all schools. With this correction, the correlations represent more faithfully what one might expect to obtain from a full range of scores on all variables for a national sample of high school seniors. As illustrated in the bottom panel of Figure 2, the residual method makes use of the average difference between the observed grades in a particular school and grades that would be predicted based on the regression line for the overall sample. For example, if the High School Averages (HSA) in a given school run low compared to those predicted on the basis of the NELS Test in the total sample, this mean residual is treated as the School Grading Factor (SGF) and all HSAs in the school are adjusted by that amount. As illustrated in the figure, when this correction for grading strictness is applied to each school grade scale, the effect is to move all school scatterplots to the overall regression line. Recognize, however, that in this method HSA is the only variable so corrected. Ostensibly, the within-school and the residual methods are based on different assumptions. The residual method takes any difference between actual and predicted grade performance to be error that is best removed. The within-school method ignores school differences altogether and relies on individual differences (within schools) to best understand the nature of achievement. Earlier analysis of these two methods (Willingham, 1962b) suggests that they are actually quite similar in derivation. In the present situation, there appear to be two main analytical distinctions between the two methods. First, linear corrections generally akin to the residual method will tend to overfit, and to some degree,

64

inflate the relationship between the two measures in question3 . Second, scale distortions in other pertinent variables have been well-documented. The within-school method corrects for such scale differences in other variables, while the across-school method does not. Previous investigators have demonstrated school contextual effects on other variables that are both psychological and statistical in character. One example is the “big-fish-littlepond” effect already noted; viz., students’ attitudes about themselves and their education can be inversely influenced by the average ability of students in the school that they attend (Marsh, 1987; Marsh & Parker, 1984). The size of a school has contradictory effects on two other variables of interest. Larger school enrollment is directly related to the comprehensiveness of the curriculum and therefore expands a student’s course-taking possibilities (Monk & Haller, 1993). On the other hand, smaller school enrollment expands a student’s possibilities for extracurricular achievements, because fewer people are competing for limited positions of honor (Lindsay, 1982, 1984). Holland and Andre (1987, p. 437) argued that, as a consequence, “Low-ability and lower SES students are more involved in school life in smaller schools.” The within-school analysis removes these sources of ‘noise.’ There is an implicit assumption; namely, that the resulting gain in accuracy outweighs any unappreciated signal associated with schools that may be lost in the process. Both the overfitting problem and the evidence of scale distortion in other variables recommend a within-group analysis in the case of school differences. It is not clear that this method is appropriate for course-grading differences. In any event, samples were entirely too small for this latter purpose. Consequently, the within-school method was used first to correct school differences for all variables. Then the residual method was applied to the pooled within-school data matrix in order to correct grading differences among courses. In

65

subsequent discussion, this joint correction of school differences and course-grading differences is referred to as the “within-school” analysis. This usage clearly differentiates the method of correcting for school grading variations, which was the grading variation of consequence in the analysis. Following Ramist et al. (1994), the course-grading correction was ha ndled in the within-school analysis by adding an adjustment for course-grading strictness, K, to each student’s within-school grade average. Thus, the doubly corrected grade average becomes HSAw + K , where HSAw = HSA − H S A denotes the deviation score obtained by subtracting the mean HSA at a student’s school from that student’s average grade. The K correction for course-grading strictness was the average Course Grading Residual (CGR) for the particular courses in which the student was enrolled. In the within-school analysis CGR was defined for each course (j) as

Nj

CGRj = ∑ [Pred( HSAk − HS A( k ) ) − ( CGjk − HS A( k ) )] / N j

.

k =1

In this equation, CG jk represents the grade obtained by the kth student in course j,

HS A(k ) is the mean HSA in this student’s school, Pred( HSAk − HS A(k ) ) is the predicted value of this student’s HSAw based on a pooled within-school regression analysis,4 using the NELS Composite deviation score for the student as the predictor, and N j denotes the total number of students with grades in course j. In other words, the predicted HSAw is used as a baseline for determining how strict the grading is in a particular course. The Course Grading Residual (CGR) for a course is the average discrepancy between the predicted HSAs and the grades obtained by students in that course—all computed within schools on the basis of deviations

66

from each school mean, and then pooled across schools. Again, each student’s K score was simply the mean CGR for the courses taken by that student. A complication in computing CGRs was the large number of empty or near-empty school-course cells due to the wide variations in courses elected and the diverse names that schools attach to courses5 . As a result, CGRs ultimately had to be based on all students taking a course with a given title; i.e., all students in Algebra I, all students in American History, etc., irrespective of school. Ignoring schools assumes that grading variations from course to course follow the same pattern in all schools. As will be reported among the results concerning grading variations, a special analysis was undertaken in order to determine what proportion of course-grading variations was lost by collapsing each course across schools in this manner. An important practical consideration had a bearing on the methodology of analyzing grading variations. One research objective was to understand the effect of grading variations on observed grades of individual students and on groups of students. Another useful objective, recognized in the course of the study, would be to examine the relationship between school grading variations and course-grading variations. These objectives required indexing the effect of grading variations for each student. In the within-school method, course-grading variations could be indexed for individual students through the K score, but school-grading variations could not be indexed in that method because the pooled within-school data set simply removed school differences altogether. An alternate correction procedure, based solely on the across-school matrix (the original data set), was developed in order to examine independently the effects of school and course-grading variations and how each factor was related to other variables and to group

67

differences. In this way each grading factor becomes a separate predictor variable. Two grading factors were defined: a School Grading Factor (SGF) and a Course Grading Factor (CGF) which are given by Ni

SGFi = ∑ [Pred( HSAk ) − HSAk ] / Ni k =1

Nj

CGFj = ∑ [Pred( HSAk ) − CGjk ] / N j k =1

Both were derived in a manner similar to the computation of K, described above. The equations are comparable to the one for CGR, but simpler because we are not here working with within-school deviation scores. Course-grade residuals for individual courses and HSA residuals for individual schools were determined from predictions based on the regression of HSA on the four NELS tests with original data in the total sample. When the grading factors are associated with individual students, the SGF for a given school is assigned to all students in that school. However, the index for course-grading variations for each student is the MCGF, the mean CGF for the particular courses that student took. The two grading factors, SGF and MCGF, were expressed on the same scale as HSA, so they could be used either as predictors or as criterion corrections. Since each factor represented a grading strictness “handicap,” the criterion correction simply entailed adding the two factors to the high school grade average; i.e., HSA + SGF + MCGF (labeled “HSA+2G”). In effect, this correction adds to or subtracts from each student’s HSA an amount appropriate to the strictness or leniency of grading in the student’s high school and the particular courses that the student elected to take.6 Applying the SGF correction alone gives the lower panel of Figure 2. Incorporating SGF and MCGF in HSA+2G made it possible to examine the relationship of variables to an original across-school HSA in which grading

68

variations were corrected. Using SGF and MCGF with other variables in their original form is hereafter referred to as an “across-school” analysis (in contrast to a within-school analysis). Because an across-school analysis makes it possible to examine how the effects of grading might vary among groups, this method was used in lieu of the within-school method in most of the analyses based on students in different school programs, gender, and ethnic groups. As we have noted, the across-school analysis has two shortcomings that influence the grade-test correlation in opposite directions. First, the method is likely to overfit. Second, it does not correct known scale distortions in other variables relevant to the analysis. Nevertheless, the across-school analysis serves as a partial check on the within-school method and provides additional evidence regarding the assumptions on which the methods are based. Estimating Reliability The proposed analyses required estimates of the reliability of high school grade averages and the four NELS tests for two purposes. Reliability estimates were needed in order to correct for measurement error as one factor in explaining discrepancies in grade and test performance. Reliability estimates were also needed in order to adjust for any differential effects of HSA reliability on analyses within different subgroups. It was assumed that HSA reliability would vary to some degree because subgroups took varying numbers of the academic courses on which HSA was based. The following reliabilities for the NELS tests were reported by NCES (Rock, Pollack, & Quinn, 1995): .85

Reading

.94

Mathematics

.82

Science

.85

Social Studies

69

Standard equations for the reliability of a composite (see Note 7) were used to estimate the reliability of two other test variables: .96

NELS-T The total NELS test score

.95

NELS-C The best weighted test composite for predicting HSA

Since HSA was based on the equally weighted average of four New Basics subject area grade averages computed by NELS, its reliability was estimated as the reliability of an equally weighted composite of the split half reliabilities for the four area average grades. Reliability estimates of HSA for the 10 subgroups (2 gender, 4 ethnic, 4 school programs) were similarly derived. Reliability of HSA(T) was based on a simple split half estimate for the total grade record. Finally, reliability estimates were corrected for range restriction as necessary using procedures described in Ramist et al. (1994, p. 10). One further complication arose because grade averages sometimes included coursegrade corrections (e.g., HSA+K). This model corrects only for the unsystematic measurement error inherent in grades and in test scores. Such measurement error is unrelated to systematic grading variation, another factor in the approximation of the original framework. Since neither grading variations nor any other factors are being corrected for attenuation, the criterion component represented by course-grade corrections was assumed to have no error; i.e., perfect reliability. 7 This assumption is conservative with regard to corrections for attenuation (i.e., the effect of Factor 3).

70

71

Results of the Analyses The analyses revolved around the five factors assumed to affect the relationship between the NELS tests and high school grade average: 1. Subjects Covered, 2. Grading Variations, 3. Reliability, 4. Student Characteristics, and 5. Teacher Ratings. The first issue to examine is how taking each of these factors into account alters the pattern of individual differences on test scores and grade averages; that is, the multiple correlation between the two. Following an overview of this analysis, each factor is examined more closely. Next is an examination of group effects. This entails an analysis of differential prediction, and finally, a condensed analysis of how the most important variables work for students in each of the gender and ethnic groups and the four school programs. Descriptive statistics for all variables are shown in Appendix A. The Effects of Factors 1 to 5 Figure 3 shows the accumulating effect of Factors 1 through 5 in predicting the grade performance of individual students; that is, the extent to which one can one account for differential grade performance. The right column describes the steps involved in making the five adjustments; the left column characterizes the status of the grade-score relationship before and after each adjustment. The correlation with grade average at each of those points appears in the middle column. In considering the successive correlations in Figure 3, it is good to remember that the correlations, and therefore the steps between them, are not comparable in the sense of adding variables to a multiple regression. It is one thing to add a variable, another thing to remove school differences, yet another to correct for unreliability. Each step must be interpreted on its merits.

72

______________ Insert Figure 3 about here ______________ The initial correlation, .62, is based on the simplest representation of grade average and test score: essentially all course grades on the transcript and the total score for all four tests. From that point, each of the five adjustments resulted in a tangible increment in the correlation. Matching the subject matter of grade average and test score increases the correlation by .06. Taking account of grading variations and measurement errors raises the multiple correlation by .13. Taking account of student characteristics and teachers’ judgments of students' behavior increased the multiple R another .09. The eventual multiple correlation of .90 was based on the four NELS tests, 31 additional variables, corrections for school and course-grading variations, and corrections for unreliability of grades and test scores. This analysis resulted in an increase to 81% of variance accounted from the 38% based on the total NELS Test score alone. There is logic to the order of the factors. It is necessary, first, to move from the grade and score in hand to the grade and test measures that are appropriate to the analysis (Factor 1), to then correct errors in those measures (Factors 2 & 3), and lastly, to consider why students perform differently on the “true” scores and grades (Factors 4 & 5). The last two factors are of a different character because they add new predictors in order to help account for grade and test score differences. It is possible to entertain somewhat different orders, in which case the end result is the same but the increments in R that are associated with each factor may vary. In general, a factor will make less apparent contribution if taken into account after, rather than before, another overlapping factor that makes a contribution. Reliability is a special case here. In

73

third position, the correction for unreliability increases the grade-score correlation by .05. Were it Factor 5 at the end, it would increase the multiple only .02 . Factor 1. Subject Match. The total NELS Test score and the average of all grades on the school transcript are both summative measures of performance in high school, but they differ even in the general skills and subject matter that that they comprise. Bringing the two measures into rough concordance involved two steps. One is to restrict the grade average to courses in the four New Basics subject areas that best correspond to the four tests. Another is to weight the tests so that a composite of the four (NELS-C) best represents (predicts or reproduces) individual differences in the grade average. Together, both steps raise the correlation between grade and test score from .62 to .68. Since the effects of the two steps will overlap to some degree, how much each contributes to the correlation depends upon which comes first. As Table 4 indicates, weighting the tests to predict the total transcript average HSA(T) increased the multiple correlation by .03. If shifting to the more strictly academic grade average (HSA) had come first, the correlation would have been increased by .02. Two other aspects of the data in Table 4 are noteworthy. The correlations and standard regression weights associated with the two grade criteria are remarkably similar. The traditional academic courses represented in HSA constitute about two-thirds of the courses in HSA(T), the remainder often being quite different in surface character. The heavy weight on mathematics in concurrent prediction of grade performance is also notable. Part of that heavier weight is due to the mathematics test being longer and more reliable than the other tests. The small negative weight for the science test does not appear to be a consequential finding but rather results from the collinearity of that variable with the other tests.

74

______________ Insert Table 4 about here ______________ Factor 2. Grading Variations. Correcting the grade criterion for variations in grading standards was also a two step process. As previously described, school grading variations were removed by subtracting the school means from all variables, thereby creating a pooled within-school covariance matrix, which was then corrected for range restriction. In this data set the multiple correlation between the NELS tests and HSA was .75, an increment of .07 attributable only to removing the variation in grading from school to school. Step two was to remove variations in course grading. This involved adding to each student’s grade average the constant, K, the average grading strictness of the courses taken by that student. The multiple correlation of the NELS tests with this doubly adjusted HSAw+K was .76, an additional increment of only .01 associated with differences in course-grading standards. Using K as an independent predictor had no effect on the multiple correlation. This result is in sharp contrast to the findings of Ramist et al. (1994). Their similarly derived Z, an index of grading strictness in college courses, substantially improved Freshman GPA predictions (to .58 from a multiple correlation of .48 based on the SAT and high school average). The Z index of Ramist et al. (1994) correlated negatively with GPA, indicating that students who take strictly graded courses earn somewhat lower grades. But in the present analysis of high school grades, the comparable K index correlated positively with HSA. At this point in the analysis it was not clear why K behaved unexpectedly. The discrepancy in these results and those of Ramist et al. (1994) could be related to different grading habits between school and college instructors or different habits of college freshmen and high school students in selecting tough versus easy courses. The likelihood of such

75

school-college differences is considered in later discussion of possible educational implications of our findings. At this point it is desirable to examine two empirical questions that may be helpful in interpreting the grading data. One question is whether school and course-grading variations are highly correlated. If that were the case, the initial removal of school-grading variations in a within-school analysis would also have removed much of the course-grading variations. Thus, the impression in our analysis of a large school-grading effect and a small coursegrading effect would be due simply to having lumped most of the course-grading variations in with the correction for school-grading variations. Another empirical question concerns the consistency of the pattern of course-grading strictness in high school. If the pattern of course-grading standards varies noticeably from school to school, the sparseness of NELS data within individual schools would take on added significance. In that case, the lack of sufficient data to represent course by school interactions in grading variations would limit the effectiveness of K and, as a result, underestimate the overall effect of grading variations. Two additional analyses were undertaken in order to explore these questions further. One analysis required indexing both school grading variations and course-variations for each student so that the two factors could be examined as separate predictor variables. As described earlier, a School Grading Factor (SGF) and a Course Grading Factor (CGF) were developed based on the original across-school data. Both SGF and CGF represented grading strictness; that is, a positive score on CGF or SGF implies that the student would have received, respectively, a higher course grade or grade average were it not for variation in

76

grading standards. Each was based on residuals from the regression of HSA on the four NELS tests in the total sample. The SGF for a given school—and for each of it students—was simply the average residual HSA (i.e., predicted minus actual HSA) in that school. The index of course-grading variations assigned to each student was the Mean Course Grading Factor (MCGF) for the courses taken by that student. The development of MCGF was parallel to that of the Course Grading Residual leading to K in the within-school analysis (see p. 67). CGF was computed for each course with at least 25 students enrolled in the total sample. Since MCGF and SGF were on the same scale as HSA they could be used as criterion corrections (instead of predictors) by simply adding the two factors to HSA. As was the case in the within-school analysis, correcting for course-grading variations by adding MCGF to the HSA criterion resulted in a relatively small addition (.015) to the multiple R based on tests alone. MCGF had essentially no effect on the multiple correlation when used as a predictor. As will be evident, SGF was quite useful as a criterion correction, either when it alone was added to the NELS test or when it was used as one of 37 predictors. These results were consistent with the previous within-school analysis, but puzzling nonetheless. At least, this result established that the failure of course grading to play much of a role in the within-school analysis was not because course-grading variations are highly correlated with school grading variations and thereby removed from the picture when school means were removed from the analysis. As here defined, the two factors are additive and interdependent but do not appear to be strongly related in high school grading patterns. SGF and MCGF correlated only .18.

77

There is still to consider the second possible reason why correcting course-grading variations did not work according to expectation; namely, not being able to take into account differences in course-grading patterns from school to school. It was possible to determine whether that was the case through an analysis of grading variations attributable to schools, courses, and the interaction of schools and courses. This ANOVA was performed on residual course grades in an across-school analysis, using HSA predictions based on the four NELS tests. There were 225 courses in the four subject areas on which HSA was based. These 225 courses in 574 schools created a matrix with 129,150 cells, of which only about one in six contained data. Table 5 shows the results of an analysis of variation among those cells, weighted to account for unequal N’s and credits.8 ______________ Insert Table 5 about here ______________ Two main findings are evident in Table 5. First, 54% of the variation between cells is attributable to the additive model, which assumes a course effect (represented by MCGF) and a school effect (represented by SGF). The course-school interaction accounted for the remaining 46% of the cell variation. Thus, the two variables that have been used to correct for grading variations actually only accounted for approximately half of the systematic variation in grading that is jointly associated with schools and courses. This result indicates that our analysis substantially underestimates the extent to which discrepancies between grades and test scores are due to grading variations. If the pattern of course-grading variations could be determined within individual schools, the final multiple correlation in Figure 3 would in all likelihood have been larger, though how much larger cannot be determined with the data at hand.

78

Second, the ANOVA results indicate that the relative effect of course grading versus school grading on discrepancies between grades and test scores is underestimated in our analysis. The course main effect, which is the sole basis of MCGF in our analysis, accounted for 14% of the variation in cell means (i.e., all course-grade residuals). The interaction, representing 46%, is also most reasonably treated as variation in course grading. Thus, differences in the pattern of course grading from school to school accounted for three times as much grading variation as did the overall differences from course to course (e.g., 3rd year Chemistry versus 4th year English). Had it been possible to reliably identify grading patterns within schools, whatever enhanced grade prediction the interaction might provide would have gone to MCGF. Thus, the ANOVA results clarify the misleading impression of the regression analysis—that course-grading variations are not consequential. Such variations played a minor role in this analysis only because limitations in the data base precluded corrections for the major source of inconsistency in course grading. Factor 3. Reliability. In our five-factor model, variation in grading standards represents a systematic error in HSA that can be associated with particular students and situations. Unreliability, on the other hand, represents unsystematic measurement error. As here defined, the two types of error are independent and additive as correction factors. Unreliability in both grades and test scores contributes to observed differences between the two. In meta analyses of validity studies, it is appropriate to correct for unreliability of the criterion alone (Hunter & Schmidt, 1982, p. 88). But in the present study the objective was rather to remove the effects of observed discrepancies between grades and test scores due to errors of measurement in either of the two.

79

The reliabilities for the NELS tests (.95 for total score) were reported above. No data were readily available to determine whether the reliabilities of the NELS tests vary among subgroups. They were, however, the same tests for all students or parallel forms that were scaled together, and there is no indication that they would vary in reliability except due to differences in range, which are correctable and not reflected in the error of measurement. Other evidence suggests that the reliability of standardized tests of this type are not likely to vary consequentially across groups (Rock & Werts, 1979). On the other hand, the reliability of a grade average could vary if there is variation in the nature of the courses included in the average or in the amount of coursework on which the average is based. Table 6 illustrates such an association between reliability and amount of coursework. Among subject area averages for different groups, reliabilities varied from .63 to .92. The lower reliabilities tend to be found in mathematics and science. In these areas weaker students tend to take fewer courses, so reliability is lower due to fewer observations and a somewhat restricted range. In Table 6, reliability estimates of overall average were corrected for range restriction but those for subject averages were not. ______________ Insert Table 6 about here ______________ Reliabilities for overall grade averages are quite high. The reliability of HSA for vocational students is the apparent exception. Previously reported estimates of the reliability of grade averages have tended to be substantially lower—from the low .60s to the low .80s (Etaugh, Etaugh, & Hurd, 1972; Ramist et al., 1990; Werts, Linn, & Joreskog, 1978). Several possible explanations come to mind. Estimates of the reliability of grade averages are typically based upon samples with a restricted range, but here the range is quite broad since

80

all high school seniors are included. Furthermore, estimates of grade reliability are normally based upon a more limited amount of coursework than one finds in a four-year high school transcript. Another consideration is that the reliability estimates were based on a conventional odd-even split half method that, in this context, may yield overly highly high reliabilities. The odd-even method treats any variation in performance across years as reliable covariance. An alternate interpretation of HSA reliability would treat yearly variation in performance as unsystematic error. This latter definition would presumably yield a lower and arguably more appropriate estimate of reliability because that would be more consistent with our use of a concurrent analysis to control grade-test score differences. Figure 1 treats variations over time as one source of such differences. In an analysis of college data, Humphreys (1968, p. 375) reported that, “A substantial amount of instability of intellectual performance over this four-year time span is revealed.” His data also showed systematic differences in the pattern of grade-test correlations from year to year.9 These data suggest that our corrections for unreliability are likely to be conservative. Range restriction does not appear to place much of a ceiling on the reliability of HSA. Correcting for range restriction typically had relatively little effect. It is also noteworthy that the reliabilities of HSA and HSA(T) are very similar for most groups and programs, despite there being only two-thirds the amount of coursework represented in HSA as is in HSA(T), which is based on the full transcript. Reliability is lower for HSA than for HSA(T) in the Vocational Program because those students take a smaller proportion of academic courses. For other programs and groups, the reliability enhancing effects of more coursework in

81

HSA(T) is possibly offset by more homogeneous grading standards among the more strictly academic New Basics courses included in HSA. In any event, the reliability of the grade average (or the test score) alone was not one of the larger factors in this accounting for observed differences in grades and test scores. The lack of comparability of grades due to variation in grading standards was a more consequential source of grade-test score discrepancy. Correcting for unreliable grades increased the multiple correlation between grades and test scores about .02,10 but correcting for grading variations increased the correlation by .08. As was experienced in the experimental Vermont state assessment a few years ago, however, low reliability can be a significant problem in a high-stakes situation with a less traditional performance test (Koretz, Stecher, Klein, & McCaffrey, 1994). Factor 4. Student Characteristics. Table 7 shows the original correlations (across schools) of 26 student characteristics with the total test score, NELS-T, and the transcript grade average HSA(T). These variables were selected on their promise of showing some relationship to performance in school. Each of the five categories, A to E, included variables with at least a moderate correlation with performance. As expected, some of the relationships were negative, as in the case of behavior involving Competing Activities. Also, as expected, a number of the characteristics were more strongly correlated with grades than with test scores. For seven of the 26 characteristics, the absolute value of the correlation was at least .10 higher with HSA(T) than with NELS-T. ______________ Insert Table 7 about here ______________

82

In two instances, the absolute value of the correlation was .10+ lower with HSA(T) than with NELS-T. That is not surprising in the case of SES, because the social advantages implied by a higher SES would presumably act over a lifetime on the development of general cognitive skill in and out of school. The test is more likely to reflect such general skills than is the school average that focuses on specific learning objectives and behavior in the classroom. Furthermore, the correlation between SES and grade average is likely to be depressed by grading variations (an assumption supported by the corrected correlation in Table 8 following). The correlations with “Leisure Reading” on material unrelated to school (Variable 16) were somewhat more of a surprise. Students who read more tended to have higher test scores but not higher grades. The underlying reasons are unclear but may be similar to the case with SES. A history of outside reading could raise general cognitive skills as do other advantages of a high SES, but also take time away from schoolwork. The result would be less opportunity for the achievement on specific course objectives that is required for good grades. In either sense, leisure reading would be a competing activity with regard to relative performance on grades and tests. Table 7 also shows that the various student characteristics have a highly similar pattern of correlations with HSA(T) and HSA. Conventional wisdom might suggest that some significant reordering of students would occur with a shift from the full transcript HSA(T) to the academic emphasis of HSA. The more important consideration is that the two averages are based on substantially overlapping coursework and correlate .97. In particular subgroups, the average HSA(T) tended to be .01 to .03 higher than the average HSA (see Appendix Tables A-3 and A-4). The focus of the analysis now moves from HSA(T) to HSA.

83

This shift helps to satisfy the first specification of our five-factor model, an improved fit of HSA with the NELS Test with respect to subject matter. Table 8 traces the relationship of each student characteristic with HSA as several corrections and controls are taken into account. The first data column shows the acrossschool correlation of each characteristic with HSA—a repeat of the third data column in Table 7. All other data columns in Table 8 are based on within-school analyses corrected for range restriction. Going from the first to the second column, one might expect some increase in the relationship of these variables to HSA after grading errors in HSA are corrected (withinschool method). The correlations in the second column do tend to be somewhat higher than the original correlations; only one is smaller in absolute value. ______________ Insert Table 8 about here ______________ The third column of Table 8 shows the partial correlation of each student characteristic with HSA when grading variations are corrected and scores on the four NELS tests are controlled. This partial correlation gives the best indication of the extent to which each characteristic is related to differential grade performance. Again, all five categories, A to E, include variables that make such an explanatory contribution. All partials for competing activities are negative, as are “Discipline problems” and “Stress at home.” As would be expected, the partials are typically lower than the correlations in the preceding column, but the partial for “Work completed” (#4) increased substantially. This self-rating of homework habits was related to grades earned, despite a slightly negative relationship to test scores. Thus, getting one’s schoolwork done presents a near mirror-image

84

of the situation—behaviorally and statistically—with “Leisure Reading,” where a small positive relationship with HSA swung negative when the test score was held constant. The beta weights in column 4 indicate which variables made an independent contribution in accounting for grade performance. As would be expected, the weight for many student characteristics was near zero. The interesting aspect of this stage of the analysis was which types of variables remained in the picture. Variables 17 through 26 concerning Family Background and Student Attitudes largely drop out. It was mostly the behavioral variables—especially those directly involved with school—that made an independent contribution in accounting for grades earned. Scholastic Engagement. As mentioned earlier in our review of pertinent literature, researchers have looked hard for variables that might help in understanding why some students work hard in school and others do not. Many studies have focused on measures or circumstances that influence achievement, but several writers who have taken a more holistic view in seeking to understand how students do or do not become involved or engaged in school. These and similar ideas have been variously applied to students’ behavior and attitudes (Finn, 1993; Hanson & Ginsburg, 1988; Lamborn, Brown, Mounts, & Steinberg, 1992; NCES, 1995; Newmann, 1992) In seeking to understand the relationship of student characteristics to school achievement, it seems helpful to distinguish overt student behavior from contextual factors like family background and student attitudes. The distinction is especially useful in accounting for differences in the two types of school outcomes: grade performance and test performance. As Table 7 shows, the measures of Family Background and Student Attitudes (categories D & E) are, for the most part, similarly related to grade average and test scores. It

85

is mostly the student’s behaviors (categories A, B, & C) that are more strongly related to grades and seem therefore to largely determine whether a student’s differential grade performance is high or low (i.e., HSA relative to test scores) . The measures in categories A through C all involve behaviors that represent different aspects of being engaged in school: taking a demanding courseload, doing the work assigned, and not being involved in competing activities. A number of those behavioral measures showed a consequential partial correlation with HSA when grading variations and test scores are held constant. As indicated in the third data column of Table 8, nine of the 16 behavioral measures had a partial of ±.10 or larger. Table 9 shows comparable partial correlations computed by subgroup, with the nine measures listed in order of their partial in the total sample. The most obvious thing about the table is the generally similar pattern of partial correlations across school programs as well as gender and ethnic groups. Two differences are noticeable: a) taking advanced electives, participating in class, and being involved in school activities played little role in defining engagement for Vocational students; b) discipline problems and killing time played no such role for Asian-American students. ______________ Insert Table 9 about here ______________ Table 10 shows group differences in the level of these behaviors for different groups. For most of the measures, a trend is evident across the four school programs. Moving progressively from the Rigorous Academic program to the Vocational program, students are likely to take substantially fewer advanced electives and have many more disciplinary

86

infractions. Similar but lesser trends are visible in the overall number of courses completed, in attendance, in class participation, and in number of school activities. ______________ Insert Table 10 about here ______________ Differences in the incidence of these behaviors were typically small among the gender and ethnic groups, though three exceptions are notable. Females and males showed a twofold difference in the incidence of disciplinary problems (26% vs. 51%). Also, females were generally more engaged in school, as here defined. In fact, women scored more positively than males on all nine measures. Lastly, there was a substantial difference in the number of advanced electives taken by Asian-American students compared to African-American and Hispanic students. On this measure, the standard mean differences, D, between the AsianAmerican group and the latter two groups were .88 and .85, respectively. 11 In subsequent analyses Scholastic Engagement—or more simply, Engagement—was defined as a composite based on the nine characteristics listed in Table 9. The nine components were weighted in proportion to the partials for the total group. Engagement correlated .56 with HSA and had a moderately strong relationship to Family Background and Attitude measures. Indeed, the family and attitude measures were more closely associated with Engagement (Multiple R = .59) than with High School Average (Multiple R = .51). This correlational pattern involving Engagement, along with the pattern of partial correlations and regression weights in Table 8, support a commonsense view that it is largely the student’s behavior that directly influences differential grade performance. As the correlations in Table 11 suggest, Scholastic Engagement is more related to the students’ attitudes than to their backgrounds. When these two sets of variables were used to

87

predict Engagement in the total sample, the multiple correlations were .57 and .41, respectively. This finding is consistent with some writers’ supposition (Eccles, 1983; Newmann, 1992) that involvement with schoolwork is intrinsically a matter of attitude. Variable 26, the students’ judgment regarding their closest peers’ attitudes about education, had a somewhat stronger correlation with Engagement than did Variable 20, the parents’ educational aspiration for their child (r = .44 versus .30 in the total group). That relationship would appear to be compatible with the controversial argument that peers count more than parents in the development of personality (Harris, 1995). Of course, parents do influence the selection of friends. It is also notable that the results in Table 11 are very similar for males and females. ______________ Insert Table 11 about here ______________ Factor 5. Teacher Ratings. Teacher ratings showed strong relationships with grade performance and with differential grade performance. Four of the five ratings correlated from .37 to .63 with HSA. A single 5-point rating on whether the student regularly turned in assignments had a partial correlation of .56 with HSA, with grading variations controlled and holding the test score constant—the highest partial for any variable that was examined. Three of the five teacher ratings had significant beta weights in the overall regression analysis (see Table 8). Teacher Ratings raised the multiple correlation .04 (.86 to .90), even after all other factors had been taken into account. Several aspects of these ratings make these results all the more striking. The ratings do not represent a consensus judgment of all the teachers who knew each student. Ratings were available from an average of only 1.6 teachers per student. The evaluations were not

88

very consistent from teacher to teacher. The reliability of an average rating by two teachers ranged from .27 to .66 for Variables 27 through 31. Finally, the ratings were collected in the middle of the sophomore year, not near the end of high school when the teachers would have had a more extensive track record on which to judge. By design, only those NELS Teacher Ratings that had a behavioral emphasis were included in the analysis. In Variables 27 to 31, teachers largely described what students did in school. The teacher ratings obviously overlapped with several of the student self-ratings among Variables 1 to 26. The last two columns of Table 8 indicate that the Teacher Ratings accounted for some, but not all, of the predictive variance associated with the student characteristics. In the regression analysis based on 26 student characteristics, the larger beta weights dropped somewhat when the Teacher Ratings were added, but not nearly to zero. Both the teacher’s and the student’s judgments contributed in accounting for differences in grade and test performance. The student is privy to information that the teacher is not, but the teacher is probably more objective than most students would be in judging their own behavior. The student and the teacher may have slightly different views of attendance and completing assignments (Variables 1 & 27 and 4 & 31, respectively). Observe that there are two ways in which such behaviors can directly influence grades earned. Effective academic behavior can lead to added knowledge and skill that gets reflected in higher grades. The student’s behavior can also result in points being added to or subtracted from course grades, irrespective of knowledge and skill actually acquired. Teachers not only provide an independent view of the student’s behavior; they also assign the grades in ways that reflect their pedagogical values.

89

A Teacher Rating Composite (TRC) was developed for use in some subsequent analyses. TRC was based on the best-weighted average of the five ratings for predicting HSA. All five ratings contributed to that prediction; the heaviest weights went to Work completed (.33), Educational mtivation (.28), and Class behavior (.11). The TRC composite correlated .68 with HSA. Engagement, based largely on information supplied by the students, correlated .56 with HSA. It is possible that Teacher Ratings help to account for grade-test score differences for yet another reason. A teacher’s positive or negative bias regarding some students may be reflected in the teachers’ ratings as well as in the grades they assign. To the extent that that occurs, it is for our purposes, another real difference between grades and test scores. But if these Teacher Ratings were heavily based on such “halo effects,” it seems unlikely that three of the five teacher ratings would each make an independent contribution to the multiple correlation. Finally, do Teacher Ratings contribute to grade prediction mainly because of the information that ratings add or because of confounding? That is, do individual teachers base high or low student ratings on the grades students earn in their particular class or on knowledge of the student’s past grade record? Two considerations argue against confounding being a big factor. First, the one or two teachers who rated each NELS student constitute only a small fraction of the teachers who assigned grades to that student over four years. Second, the ratings were collected relatively early in high school and therefore act more as predictors than concurrent correlates. Nevertheless, the possibility of spurious effects due to confounding of grades with the predictors is a legitimate concern. Each of the 31 predictors variables in Table 8 was

90

examined for possible susceptibility to confounding; that is, the likelihood that the measure could reflect grade performance rather than account for grade performance. Considering the nature of the measures, three variables appeared to have the greatest likelihood of a relationship to HSA due to confounding: #7 Advanced electives, which attract students with good grades, #23 Educational plans, which are likely to be optimistic if the grade record is good, and #30 Educational motivation, which teachers may be prone to rate high simply because the grades are high. When all three of these variables were removed from the final regression analysis shown in Table 8, the multiple correlation was reduced by .007. Since other variables largely accounted for the predictive variance contributed by these three measures that seem to be the most suspect, the role of confounding does not appear to be large. Differential Prediction As already noted, an increment in the multiple R is one index for evaluating the effects of identifying and adjusting for sources of discrepancy between grades and test scores. Another index is change in the extent of differential prediction. The former concerns individual differences in grades and test scores; the latter concerns group differences. Figure 4 shows the extent to which the HSA of various groups of students was over- or underpredicted on the basis of the NELS Tests and the accumulating effects of adjusting for grading variations, student characteristics, and teacher ratings. Results are shown for students in four subgroups and four school programs. Predictions were based on the regression line for the total sample and did not involve any correction for unreliability. ______________ Insert Figure 4 about here ______________

91

Among the eight groups, initial differential predictions based on the NELS tests alone ranged from +.13 to –.14; the average was .09, disregarding sign12 . When grade predictions were based on additional information (moving to the right in Figure 4), differential prediction diminished in all cases where there was originally any consequential differential prediction. The lines converge toward zero as Grading Variations, Student Characteristics, and Teacher Ratings are taken into account. With corrections for all three factors, the absolute level of differential prediction (in column 4) averaged about two hundredths of a letter-grade. Table 12 shows predicted and actual grades of each group at each stage of the analysis. When grading variations are corrected (moving from column 1 to column 2 in Table 12), both the actual and predicted mean grades for subgroups were adjusted somewhat because all school means were removed. Moving from column 2 to 4, only the predicted mean grades were adjusted. ______________ Insert Table 12 about here ______________ The three corrections affected the groups somewhat differently. Grading variations had a small effect on differential prediction for most groups but a fairly large effect for African-American students. As reported above, it was mainly school grading variations rather than course-grading variations that affected the accuracy of concurrent grade prediction. African American students were slightly overrepresented in schools that graded more strictly. 13 Otherwise, there appears to be little if any systematic relationship in these data between grading standards and subgroup representation from school to school (see SGF means for subgroups in Appendix Table A-4). These somewhat surprising results are inconsistent with

92

an assumption that African American and Hispanic students are more likely to benefit from easy grading due to being overrepresented in poor schools.14 Taking student characteristics into account reduced grade underprediction (by –.05 to –.07) for three groups: women, Asian-Americans, and students in Rigorous Academic programs. These groups tended to achieve higher grades than one might expect on the basis of the NELS Test. Correcting for Student Characteristics had a more substantial effect (+.11) in accounting for the underprediction of the grades of Vocational students. These effects were notably associated with different levels of Scholastic Engagement. The standard mean difference (D) in Engagement between males and females was .48. The difference in Engagement between students in Rigorous Academic versus Vocational programs was substantially larger (D = 1.41). Teacher Ratings tended to reduce differential prediction slightly for all groups. A frequently noted gender difference in differential prediction (Bridgeman et al., 2000; Willingham & Cole, 1997) was consistently observed in these data as well. The total differential prediction by gender (absolute difference for males and females) ranged from .18 to .25 for all ethnic-racial groups: African-American, Asian-American, Hispanic, and White. When predicted and actual grades were computed for male and female students within these four groups, there was a wider spread of results but the same convergence to low levels of differential prediction when additional information was taken into account. Actual minus predicted grade ranged from +.21 (Asian-American females) to –.23 (African-American males). For all groups, the mean absolute value of differential prediction was .12 based on tests alone, and .03 based on all information. Table 13 shows all predicted and actual grades by gender within ethnic group. These results closely match those of Bridgeman et al. (2000,

93

Table 6), which were based on a comparison of college grades and admissions test scores. The main exception is greater underprediction of the grades of Asian women in the high school data. ______________ Insert Table 13 about here ______________ Differential validity. In these analyses of differential prediction, the objective has been to identify and evaluate the effects of different sources of discrepancy in the grade performance and test performance of groups of students. Differential validity is a related issue. A following section—Gender, ethnicity, and school program—takes up the question of whether the several major factors under study here show similar correlational patterns for different groups. But it is first useful to look for possible patterns of differential validity based on the NELS Test alone. Table 14 shows correlations, multiple correlations, and mean score levels for six subgroups, four school programs, and the total sample. ______________ Insert Table 14 about here ______________ The correlations in Table 14 are corrected for range restriction, grading variations, and unreliability. Consequently, the correlations are more comparable across groups and tests than is normally the case in inspecting such data. First-order correlations show a quite similar pattern across groups. Except for the vocational students, NELS Mathematics has consistently the highest correlation with HSA. Because of that similar pattern from group to group, the multiple correlation with HSA is typically very close to the correlation between HSA and the NELS Composite, which is based on weights derived from the total sample16 .

94

Based on either the multiple R or the composite, the corrected correlation between the test and the grade average ranges fairly consistently from the low .70s to the low .80s. The correlational patterns indicate that the four NELS Tests function quite similarly from group to group with respect to their relationship to grade performance. There were two main differences. As has been frequently reported, tests tended to be somewhat better predictors of women’s grades than of men’s grades. Also the grades of vocational students were not predicted as well as those of students in academic programs. That was true despite HSA being based only on academic courses. Furthermore, performance of the individual groups is, on the whole, similar on each of the four tests. On the other hand, there is a range of approximately one standard deviation in mean test performance among the ethnic groups and among the school programs. In the following paragraphs, those differences are compared with differences on other major variables. A Condensed Analysis of Major Factors To this point, our analysis of differential grade performance has involved a large number of variables, which makes interpretation somewhat unwieldy. The results suggest that each of the major factors contributing to differences between grades and test scores might be represented with little loss by a single variable or composite. If that is the case, a muchcondensed analysis of the major variables could help to clarify relationships among the several factors. Furthermore, one of the purposes of the study was to test the generality of the results within subgroups of students. A simplified basis for describing how grades and test scores work for key groups of students would likely offer analytic and heuristic benefits.

95

It is clear that school grading variations is one major variable that would need to be included in such a condensed analysis. Since there are zero school differences in a withinschool data matrix, it was necessary to use an across-school analysis in order to index school grading variations as an independent variable affecting each student’s grade average. This analytic approach raises first the question of how the results of the within-school and the across-school analyses compare (see pp. 61-68 for a description of the two methods). Table 15 shows the outcome for the two approaches using all 37 variables plus corrections for unreliability of grades and test scores. In the across-school analysis, school and coursegrading variations were treated as predictors (column 2) and as criterion corrections (column 3). In both cases the across-school analysis uses the residual method of correcting grading variations. ______________ Insert Table 15 about here ______________ Overall, the within-school and the across-school (residual) methods of correcting grading variations gave comparable results. The pattern of beta weights for the various Student Characteristics and Teacher Ratings are highly similar in the two types of analysis. The negative weight for the NELS Science Test is apparently due to collinearity and should not be taken seriously. As Table 14 indicates, the Science test had a strong positive correlation with HSA in all groups save the Vocational students. As expected, the within-school analysis was somewhat more successful in accounting for grade performance than was the across-school analysis; the multiple R was .90 in the former and .88 in the latter. The within-school analysis removes school-related scale differences for all variables; the across-school analysis adjusts only for grading differences.

96

Thus, the across-school approach limits the tendency of scale corrections to enhance the multiple correlation. Apparently this limitation outweighed the tendency of the across-school method to inflate the correlation somewhat through overfitting. Subtracting the school means from all variables in the within-school analysis could also remove real school differences that affect both grade and test performance. The results suggest, however, that subtracting the means removed more noise than signal. 15 Table 15 also shows that quite similar results were obtained when SGF and MCGF were employed either as criterion corrections or as predictors. The two methods of handling grading variations both gave an overall multiple of .88 and a pattern of regression weights that was highly similar to that of the within-school analysis. The main difference was where the grading variations show up. They appear as somewhat higher beta weights for the test in the within-school analysis and when SGF and MCGF are used to correct the criterion (columns 1 & 3). They appear as SGF and MCGF beta weights when those grading factors are used as predictors (column 2). The two grading factors, separately indexed, offer an additional perspective on grading variations in a solution that is almost as effective as the within-school analysis. The following four composite variables gave essentially the same level of predictive accuracy, as did the analysis based on 37 variables: •

NELS Test Composite (NELS-C). The NELS total test score was the initial predictive baseline. This best-weighted composite of the four as predictors of HSA (in lieu of HSA(T) incorporates Factor 1 in the model.



School Grading Factor (SGF). The Course Grading Factor (MCGF) correlated .31 with HSA and .47 with NELS-C. As Table 15 shows, the course-grading correction (MCGF)

97

added little if anything to the multiple correlation of the NELS tests with HSA. That was true for every group (as later shown in Tables 19 & 20). Thus, SGF comprises the useful part of Factor 2 and represents the second major variable in the condensed analysis. •

Engagement Composite. Typically, the student characteristics most directly associated with differential grade performance were the behavioral variables. The Scholastic Engagement Composite includes the nine behavioral variables that were most effective in that regard, each weighted in proportion to its contribution. Thus, Engagement is included as a third major variable on the assumption that it incorporates most of the useful variance among the 26 variables that originally defined Factor 4.



Teacher Rating Composite (TRC). This best-weighted composite of the five Teacher Ratings in predicting HSA incorporates Factor 5. Table 16 shows correlations among these variables and HSA. Since the correlations

are corrected for unreliability of the test and the HSA, Factor 3 is incorporated in addition to the four factors to which the four variables above refer. A noteworthy aspect of Table 16 is the multiple correlation of .876, which is within rounding error of the multiple R of .884 based on 37 predictors in the comparable across-school analysis just reported in Table 15. Condensing the analysis to four major predictors lost little, if any, information in accounting for grade average. ______________ Insert Table 16 about here ______________ Of these four major predictors of grade performance, the NELS Test had the largest correlation with HSA, though the Teacher Rating was a close second. Each of the three nontest variables correlated substantially higher with HSA than with NELS-C, and each had a

98

consequential beta weight in the multiple regression. The weight for the NELS Test was clearly larger than that of Engagement or Teacher Rating, in part, because the latter two variables overlap as previously discussed. School Grading (SGF) had a more modest correlation with HSA, but it made a substantial contribution in accounting for differential grade performance because it was largely independent of the other variables. Gender, Ethnicity, and School Program Table 17 shows correlations and multiple regression results in a condensed, fourvariable analysis for the two gender and four ethnic groups. As was stated in the results of the condensed analysis for the total group, little information was lost in the subgroup analyses in going from the 37-variable to the 4-variable analysis. In the latter case, the multiple Rs are all at nearly the same level. The striking thing about these regression analyses is the similarity of results across groups. Almost without exception, the critical results that were noted above for the total group also characterized each gender and ethnic group. Had we carried out this and the following analysis by gender within ethnic groups, some additional distinctions would no doubt have obtained. Considering that the gender by ethnicity breakdown produced comparable overall results in Table 12 and Table 13 and the fact that gender differences were quite similar from group to group, we used the less detailed analysis here for the sake of a simpler presentation. In all groups the NELS Test and the Teacher Rating had the strongest relationships with grade average. In each group, each of the non-test predictors—Engagement, Teacher Rating, and School Grading—had a substantially higher correlation with HSA than with the NELS Test. Consistently, the Teacher Rating had a moderately high correlation with Engagement. School Grading also showed a fairly consistent, moderately negative

99

relationship with HSA from group to group, though that was likely due in large part to the happenstance of each group being more or less equally represented among schools that graded more and less strictly. ______________ Insert Table 17 & 18 about here ______________ As a result of this consistent pattern of correlations, the standard regression weights were also quite similar across groups. The NELS Test and the Teacher Rating, in particular, carried almost exactly the same weight in all gender and ethnic groups. Indeed the only differences of any note concerned the Asian-American students. In this group, compared to students generally, Engagement was somewhat more important, and School Grading was somewhat less important in accounting for differential grade performance. Table 18 shows results for the condensed multiple regression analysis for each of the four school programs. Results of this analysis showed important parallels to the analysis by gender and ethnic subgroups—as well as some important differences. Here again, the fourvariable analysis yielded a multiple correlation almost as high as the 37-variable analysis in all groups. Furthermore, Engagement, Teacher Rating, and School Grading were consistently more highly related to grade average than to test score. There are, however, progressive changes in the predictive pattern as one moves, left to right, across the table from the more academic to the less academic school programs. In general, the correlations tended to be lower in the less academic programs. Grade performance was somewhat less predictable; the multiple R dropped progressively from .88 in the Rigorous Academic program to .73 in the Vocational program. The correlation between the NELS Test and HSA dropped from .71 to .40 even though both measures were based on

100

performance in traditional academic subjects for the students in each of the programs. On the other hand, the Teacher Rating was a stronger predictor in the less academic programs. Among Vocational students, Teacher Rating had a larger regression weight than did the test. Two factors may be at work in producing lower correlations between test scores and grade averages in the more vocationally oriented programs. A somewhat broader range of competences in the more vocational programs may influence teachers’ judgments and their grading. Also, a widely practiced social promotion among academically weaker students may play a role. If many teachers are inclined to pass students partly on the basis of effort, as has been reported (Public Agenda, 2000), the practice might well produce in vocational programs a correlation with grades that is lower for the test and higher for the teacher rating. Table 19 shows results of multiple regression analyses for gender and ethnic groups based on the full set of 37 variables with reliability corrections as before. These regression weights present a very consistent picture from group to group. The parallel Table 20 for school programs shows some differences in the dynamics of educational achievement in the more academic versus the more vocational programs. As one moves along the academic continuum from Rigorous Academic to Vocational programs of study, competence in reading (NELS Reading) becomes a more important predictor of differential grade performance, and the Science test evidently becomes less relevant. Some unsystematic fluctuations in weights for the four NELS tests from group to group likely reflect instability due to a high degree of collinearity among the tests after scores and grade averages were corrected for unreliability. ______________ Insert Tables 19 & 20 about here ______________

101

Class behavior and educational motivation evidently have more influence on teachers’ ratings of vocational students than academic students. For vocational students, that judgment of motivation was the best predictor of HSA (highest correlation and highest regression weight) among all of the student characteristics and teacher ratings. For reasons that are unclear, some variables took on negative weights for the vocational students in the full regression analysis. For example, turning in assignments appears to be just as important for Vocational students as it is for academic students, but spending a lot of time on homework was associated with poor grades. Since the Vocational group is not large, chance fluctuations are likely responsible for some of these aberrant results. Effects that appear in both the Academic-Vocational as well as the Vocational groups or effects that change progressively across the academic-vocational continuum are likely to be the most dependable. The previous discussion was concerned with group differences in the role of several major variables in explaining grade performance; that is, whether the factors that influence grade achievement are similar or different from group to group. A separate issue is whether subgroups tend to score at a similar or different level on such measures. Figure 5 shows profiles of average scores for HSA and the three major composite measures that reflect individual differences: the NELS Test, Scholastic Engagement, and the overall Teacher Rating. School Grading is not included here because it reflects school, not individual differences. Each of these measures is expressed on a standard scale with a mean of 50 and standard deviation of 10. The horizontal lines at 48 and 52 are shown in order to aid interpretation (± .20 standard deviations being, by convention, the lower boundary of a socalled “small” difference). Actual means and standard deviations are shown in Appendix Tables A-7 and A-8.

102

_____________ Insert Figure 5 about here ______________ The four measures in Figure 5 can be viewed as somewhat different indicators of student performance in school. In that sense, the profile of average scores for each group gives some indication of how the groups are similar and different with respect to their scholastic achievement. Looking first at the top panel of Figure 5, score levels differed substantially and in a rather consistent pattern from program to program. Students in the Rigorous Academic program had almost the same consistently high mean scores on HSA, NELS-C, Engagement, and Teacher Rating. Students in the Academic, AcademicVocational, and Vocational programs also had generally similar mean scores on those four measures, but at progressively lower levels. The range of mean scores across the four school programs was quite similar for each measure—a difference of 1.2 to 1.5 standard deviations from Rigorous Academic to Vocational. More important, the pattern was much the same for each measure. The group means in the lower panel of Figure 5 tell a generally similar story, but with important differences. Each of the six groups tends to show a characteristic pattern of performance, above or below average to some degree. Mean scores for the four ethnic groups—Asian American, White, Hispanic, and African American—reflect a pattern of scores almost as divergent as that of the four school programs. Mean scores on the four measures were not always as consistent for a given gender or ethnic subgroup as was typically true for students in one of the four school programs. Females had slightly higher means than males, but gender differences on these broad measures were typically less than “small;” that is, between the lines designating ± .20 standard deviations. Both the Hispanic and the African

103

American students, particularly the latter, tended to score lower on NELS Test and HSA, compared to Engagement and Teacher Rating. This finding could be due partly to noncomparability of ratings and student-reported information due to these groups being more likely to cluster in particular schools. Results here show the familiar tendency of women to score somewhat better on the grade average than on the test (.16 Standard Deviations). Recall that the superior grades of women compared to men was consistently observed in all ethnic groups (see column 1 in Table 13). To a considerable extent, that gender difference appears to be attributable to women being more involved in school. Women scored higher than men on all nine of the measures that defined Scholastic Engagement. The results here are notable in the generally similar performance of each ethnic group on mean test score and mean grade average. This finding in high school stands in contrast to the relatively low average grades that have been reported for some minority students in selective colleges. Bowen and Bok reported that, among students with comparable SAT scores, 50% of White students versus 25% of Black students achieved a median rank in class on their grade performance (1998, Figure 3.10).

104

105

On Four Noteworthy Findings The previous section described in some detail a number of results from the various analyses that were undertaken. What were the principal outcomes? Broadly speaking, four findings seem especially significant and warrant further discussion. First, our premise that it should be possible to account for most of the observed differences between grades and test scores proved to be largely accurate. Several other factors possibly explain the remaining differences. Second, grading variation is clearly a major source of discrepancies between observed grades and test scores. Taking those variations into account may prove problematic in practice, however, because patterns of course grading vary from school to school and are not likely to be reliably identifiable. Third, Scholastic Engagement defines a logical pattern of successful academic behavior—an organizing principle that holds promise for studying and improving achievement in school. Fourth, subgroups often differed in achievement level but were mostly quite similar in achievement dynamics—including generally consistent average performance on grades and tests. Each of these findings is addressed in the following pages. The discussion closes with some reflection on two questions. First, what general implications might the findings have regarding our evaluation of the validity and fairness of grades and test scores? Second, what do the findings imply regarding the particular strengths of these measures when they are used in high-stakes decisions? In considering the four findings, it is useful to call attention again to limitations in the scope of this study and to caution against overgeneralizing the results. The scope of the study is defined by the nature of the data and the context from which they are derived. Clearly, the

106

results are most directly applicable to high school performance. At lower grade levels or in advanced education, the relationship between grades and test scores will certainly engage additional issues not considered here. In principle, however, it is likely that the major reasons for differences in grades and test scores will, in varying degree, transcend a particular educational situation. The scope of the study is also defined by the specific objectives of the analysis. The presenting issue in this study was why students score differently on grades and tests. The statistical connection is obviously important because high-stakes decisions depend on score level. Using one measure or the other results in some difference in the group of students selected. Therefore, understanding the main reasons why grades and tests yield somewhat different results should help to inform questions concerning their validity and fairness. Nevertheless, this study does focus on how current high school grades are related to a set of pertinent standardized test scores, not on the specific content of each measure or what they ought to represent from an educational or social perspective. The generality of the results also depends upon the stability and representativeness of the sample. This national sample is large and includes good representation of important subgroups. Nevertheless, there were constraints on the sample such as availability of requisite data and loss of those students in schools with fewer than 10 NELS participants. An effect of the latter constraint was to exclude many students who moved during high school. The effect of such sample losses is impossible to evaluate with any accuracy. For example, underrepresentation of students who move during high school may influence these results, even though this constraint appeared to have limited effect on either the mean test scores or

107

the ethnic representation. In interpreting the results of this exploratory analysis, the sample constraints should be kept in mind. Accounting for Grade-Test Score Differences The premise of this study was that an approximation based on five factors could account for most of the observed differences in grade performance and test performance. The premise was largely accurate. Adding corrections and supplemental information that bear on the five proposed factors did account, to a substantial degree, for individual and group differences in grade and test performance. One index of that result was an augmentation in the correlation between the NELS Test and high school grade average from .62 to a multiple correlation of .90 based on the test plus 31 additional variables and corrections for unreliability and grading variations (see p. 72). A corollary finding was a similar effect on group differences. Taking all information into account reduced average differential prediction to two hundreds of a letter-grade for four subgroups and four school programs—about onequarter of the differential prediction based on test scores alone. An important related finding was that grade performance could be explained to a large extent with only four composite variables: a test covering a similar academic domain, school grading variations, student engagement, and an overall teacher rating. These four “major” variables yielded a multiple correlation only .008 less than the multiple based on otherwise comparable analyses including all 37 variables (Table 15 versus Table 16). The significance of this result lies in the simple accounting that it permits. Individual differences in grade and test performance need not be conceived as an amorphous array of numerous conditions and student characteristics. Rather, accounting for school performance can be usefully viewed as

108

mainly dependent upon a few recognizable variables that are readily studied and evaluated in relation to educational practice and social policy concerns. To what extent does the five-factor approximation actually succeed in accounting for grade-test score differences? An average differential prediction of .02 letter-grades suggests, in absolute terms, little room for improvement in the accuracy of accounting for group performance. On the other hand, a correlation of .90 leaves 19% of the grade variance of individuals unaccounted for. In part, this is because the effect of each of the five factors was probably underestimated; viz., Factor 1 (subject match) because curricula vary, Factor 2 (grading variations) because variations are known to be larger than those here corrected, Factor 3 (reliability) because our correction for HSA reliability was conservative, Factors 4 (student characteristics) and Factor 5 (teacher ratings) because of technical problems in measuring those variables accurately. A useful perspective on the results is to consider, in more general terms, what types of differences between grades and test scores we are likely to be missing in this analysis. As Thorndike (1963) once argued, identifying all causes of difference between grade performance and test performance would require intensive study of individual students. It is possible, however, to identify some additional factors that likely account for most of the remaining variance. Five come to mind. They concern limitations just cited in the effectiveness of the five-factor analysis as well as other factors that could not be taken into account. In considering these several types of grade-test score difference that are at least partly missing, recall that the test is assumed to be external in some sense, i.e., not a classroom or school-based test.

109

Curriculum variation. By intention, educational objectives vary somewhat among districts and schools, teachers vary somewhat in the material they teach and assess, and students take different courses and focus on somewhat different learning goals. Accordingly, grades are based on a syllabus that varies to some degree across schools, classrooms, and individuals. Tests are designed to avoid content that is unique to particular learners or learning situations. A good standardized test is intended to sample content from the common ground of the curriculum in order to provide a common measure that is fair to all students. Grades, on the other hand, are intended to reflect the diverse learning of different students in different situations. Thus the test content is constant, but the substance of the grading standard necessarily varies from student to student. There is no way to adequately “correct for” that fundamental difference between grades and tests. Construct differences. Aside from variations in curriculum, it is in the nature of teachers’ grades and external tests to focus on somewhat different learning outcomes. Grades will necessarily reflect a broader range of knowledge and skills than can be represented in a test of limited length and more restrictive modes of assessment. Grading may also be influenced to some degree by broad educational objectives like leadership and citizenship, but tests are normally limited to more traditional cognitive skills and subject knowledge. Abstract reasoning is more likely to be stressed on a standardized test; performance skills are more likely to be reflected in grading. Class grades may be somewhat more influenced by writing or expositional skills, tests by other test-taking skills. Some learning opportunities may be unique to graded class assignments; other learning may stem mainly from out-of-school experiences.

110

Scholastic Engagement and Teacher Ratings presumably account for some of the curriculum and construct differences just outlined, but the database for this analysis contained little information concerning student differences that would be directly pertinent to curriculum or construct features intrinsic to grades versus test scores. Considering the many possibilities for curriculum and construct differences, we might not expect the relationship between grades and test scores to approach 1.00. From this perspective, one might even argue that with careful evaluation of each student’s performance a multiple correlation with grade average of .90 sounds higher than it ought to be. Temporal variations. The NELS Test and the NELS Questionnaire were administered in the senior year, but the HSA was based on grades earned in four different years. Students change over time, and they do not necessarily perform the same in school from one year to the next. Therefore, the time factor was not fully controlled, and the apparent concurrent analysis was only partly that. Earlier research has shown that grade-test correlations can vary with the length of time separating the two measures, as can intercorrelations among term-to-term grade averages. Such temporal variations were not taken into account in either the correlational analyses or in the corrections for unreliability of HSA. Technical shortcomings. Several shortcomings have been cited regarding the measures and statistical methods used in the analysis. The accuracy of questionnaire data is always open to question. That is probably especially true of data from graduating seniors. The effects of unreliability in grades and test scores were likely underestimated. There were significant missing data in this analysis, particularly among the Teacher Ratings that came from only one or two teachers. The reliability of those ratings was quite modest, and the data were collected relatively early in the students’ high school program. Both of the statistical

111

methods that were used to adjust for school grading variations had known deficiencies. All such factors would likely limit an accounting of grade-test score differences. Grading standards. While variation in grading standards from school to school was found to be an important source of differences between grades and test scores, other grading variations such as differences among instructors and sections could not be identified in the database. It is more consequential that schools do not follow the same pattern of grading strictness from course to course. A major shortcoming of this analysis was the lack of adequate grade data in order to correct for those different patterns. Aside from leaving an important source of grade-test score difference unaccounted for in this study, this data limitation has more general implications regarding the negative effects of grading variations. The Problematic Variation in School Grading Our analysis attempted to take separately into account variation in grading standards in both schools and courses. It is important to recall that here, as in most research on this topic, variation in grading standards means average grading level in relation to average test score level. As shorthand, we typically refer to variation in grading standards as grading variations. As expected, grading variations among schools was a major factor in diminishing the observed relationship between grades and test scores. The result is consistent with extensive previous research. This means, as a corollary, that accurate interpretation of grades is problematic and may often be an important source of unfairness when school grades are used for high-stakes decisions. An unexpected finding was the rather unpredictable pattern of school grading variations. It is often assumed that students from families at a higher socioeconomic level tend to come from schools that grade strictly, and that students with a

112

disadvantaged or minority background are likely to attend schools with lenient grading. Tables 7 and 12 suggest no such relationship. School grading standards were mostly unrelated to any of the personal or background characteristics of students that we examined. Several studies have demonstrated substantial negative effects of course-grading variations on predictive accuracy (Elliott & Strenta, 1988; Ramist et al., 1994; Young, 1990). The results of our analysis of course-grading variations were both unexpected and instructive. Unlike the findings of previous work on college grade prediction, grading variations among high school courses had little such effect in this analysis. The different result is evidently due to our analysis being based on course grade data that were pooled across schools, while earlier studies have analyzed course-grading variations within each school. Limitations in school sample sizes in the NELS database prevented our measuring course-grading variations within individual schools. ANOVA results indicated substantial differences in course-grading patterns from school to school. Because of extreme data fragmentation within schools, our analysis could not capture most of the course-grading differences. Earlier investigators had suggested a consistent pattern in course-grading strictness from college to college; e.g., more strict in the natural sciences, less strict in the social sciences (Elliott & Strenta, 1988). On that basis, we hoped that within-school data pooled across schools would capture most of the course-grading variations. Why course-grading patterns are evidently more variable among secondary schools than among colleges is unclear. Teachers in different schools may have different grading cultures; that is, different values, theories, and habits as to the fair and proper way to grade honors courses, remedial courses, service courses, and so on. That may be less true of colleges. College instructors tend to congregate in different discipline areas where students of somewhat different average ability

113

tend to concentrate their course-taking. These conditions may well promote similar coursegrading patterns from college to college. In any event, our inability to represent much of the course-grading variation in this database undoubtedly degraded the quality of any course-grading index so derived. Correlational patterns provide clues as to the effect of this shortcoming in the data. SGF, our index for strictness of school grading, correlated –.29 with HSA and +.09 with the NELS Test composite. Thus, being subjected to strict school grading is somewhat associated with lower grades but not with lower test scores. In the analysis of Ramist et al. (1994), where data were available for most students in each college rather than a small sample, the index Z of coursegrading strictness showed a similar pattern: a correlation of –.22 with college grade average and +.18 with SAT score. In our data, the corresponding pattern of correlations with the course-grading index MCGF was markedly different from that of Z. In the present analysis, MCGF showed a substantial positive correlation with both HSA (+.31) and NELS test composite (+.46). With this pattern of correlations, MCGF makes no contribution to NELS-C in predicting HSA. Table 7 shows how different SGF and MCGF actually are. SGF was largely unrelated to any student characteristic or teacher rating. On the other hand, MCGF was related to a number of student variables—most strongly with the teacher’s rating of educational motivation, taking advanced electives, and parents’ education aspirations for the student.17 MCGF identifies students who take courses that tend to be strictly graded throughout the country, though as we know, there is much variation from school to school. Judging from its pattern of correlations with HSA, NELS-C, and other student variables, the MCGF index is apparently a weak representation of educational motivation. If MCGF reflects motivation to

114

some extent, why does it add nothing to the NELS Test in predicting HSA? Presumably, this is because its correlation with HSA is attenuated by the second component of MCGF. The grades of students with a high MCGF suffer somewhat from strict course grading. All in all, the irretrievable school variations in course-grading patterns resulted in a poor MCGF for our purposes. It is neither a good measure of motivation nor a good measure of course-grading variations. Notice, however, that MCGF might perform quite differently if it were being used to predict future grades rather than current grades. In that case, both components of MCGF might be helpful in forecasting academic performance; i.e., stronger motivation and an undervalued HSA. A problem in using coursework information in high-stakes admissions decisions is signaled by the large variation we found in course-grading patterns from one school to another. That result needs further confirmation, but it is apparently a mistake to assume that one can rely on conventional wisdom in evaluating a student’s course grades in high school. Our data indicate no sound basis for knowing whether a course with a particular title will be strictly or leniently graded in a given secondary school. This ambiguity makes interpretation of course grades problematic. The same caution apparently applies to fair interpretation of high school grade average. Even if a school is known to grade strictly or leniently, one cannot safely say that the grade average of a particular student is artificially high or low. The student’s particular selection of courses may not reflect the average grading strictness of the school. Overall, it is probably hazardous to make any assumption regarding school grading strictness or the worth of a specific course grade without analysis of the particular situation. College admissions officers frequently face such questions, but seldom have sufficient data to examine them.

115

Further research on these issues is badly needed. State testing programs with course-grade data on all students may provide the opportunity. Scholastic Engagement as an Organizing Principle The composite of several student characteristics, which we have termed Scholastic Engagement, represents one of the major factors in accounting for observed differences in grades and test scores. As an added benefit of this analysis, the Engagement composite was found to represent a quite sensible pattern of student behavior. The measure may therefore prove to be more broadly useful. It is worthwhile to consider further the nature of Scholastic Engagement and how it might provide a useful organizing principle in understanding student achievement. Several ideas are prominent in the notion of Scholastic Engagement. Foremost is the idea that learning is not a purely intellectual activity. Achievement is much dependent upon personal motivation and positive attitudes about school and the value of learning. Certain behaviors apparently play a critical role in translating those attitudes into relevant academic accomplishment. Finally, the connection between engagement and achievement suggests that influencing such behaviors positively is likely to be a worthwhile institutional goal. That learning does not depend upon cognition alone is hardly a new idea. A century ago, Dewey (1900) articulated an influential philosophy of education that rejected the image of the learner as an empty cognitive vessel to be filled through instruction. For Dewey, effective schooling depended critically upon student involvement; that is, experiencing directly the substance and process of learning. Fifty years later, work on the psychology of behavior enlisted individual motivation as another aid in understanding school learning.

116

By mid-century, it was commonly accepted that the challenge in explaining variations in academic achievement was to understand the influence of the non-intellective factors (Fishman, 1958). At that time, theory regarding individual need for achievement was much influenced by two interpretations of motivation: fear-of-failure and hope-for-success (McClelland, Atkinson, Clark, & Lowell, 1953). Subsequent work revealed that motivation has many dimensions and is often highly dependent upon the circumstances of particular situations. A much broadened domain of research and theory has incorporated such additional notions as peer status, school culture, competitiveness, intrinsic motivation, and the value orientations that such distinctions carry (Jackson, Ahmed, & Heapy, 1976; Pintrich & Schunk, 1996; Weiner, 1992). In recent years a further effort to understand individual differences in school achievement has focused on the role of non-cognitive factors in the ongoing learning process. Snow (1989), in particular, articulated the conative aspects of learning—notably such factors as interest, volition, and self-regulation, which lie at the intersection of the affective and cognitive domains of behavior. Snow and Jackson (1993; 1997) have provided helpful reviews of research and theory concerning the assessment of conative constructs in learning. Clearly, concern with the effects of students’ involvement or engagement with their learning and schooling has a very long history. As the data and analysis here show, grade achievement is more sensitive than is test achievement to individual differences in motivation, involvement, and effort of high school students. Scholastic Engagement represents our attempt to capture that pattern of behavior with respect to school generally. Several researchers have used the term “engagement,” in somewhat different ways, to characterize students’ attitudes and behavior in school. Finn’s (1993) conception of student

117

engagement was based on two components: “participation” (primarily behavior) and “identification” (primarily attitude and value). Disengagement from school has been characterized as poor attendance, disruptive behavior, and giving up (Kleese & D’Onofrio, 1994). Engagement has also been defined on the basis of student self-ratings of effort, concentration, and attention to academic work (Lamborn et al., 1991; Newmann, 1992). For these researchers, engagement was one manifestation of involvement in school; the others being misconduct, doing homework, and academic expectations—all of which were seen to be affected by family, peers, extracurricular activities, and part-time work. In the broader context of personal development, a number of writers have advanced the idea that student involvement is essential to an effective college education. Books by Astin (1985) and Chickering (1981) illustrate that work. Student development in secondary education is a somewhat separate literature, well summarized by Eccles, Wigfield, and Schiefele (1998). In the course of our analysis, Scholastic Engagement was defined on the basis of several considerations, both inductive and deductive. The analysis started with 26 student characteristics that one or more previous studies had shown to be related to achievement in school. Among those characteristics, the 16 that involved overt behavior got early attention in our work, because it was mostly those measures that made independent contributions in predicting grade performance (see Table 8). These 16 measures were sorted into three types of behavior 18 that had some logical relationship to school achievement. For working purposes, it seems reasonable to view the three types of behavior as three components of Scholastic Engagement. Finally, for our purposes, an overall Engagement composite was based on the nine specific behaviors that had the largest effect on differential grade

118

performance (HSA, with test scores controlled). Each of the three components was represented by at least two behaviors. In this schema: •

The engaged student employs appropriate School Skills. Engagement means coming to school regularly, participating in class, refraining from misbehavior, and doing the work assigned. A number of studies reviewed earlier showed one or more of these behaviors to be related to school achievement. A particularly interesting result in the present analysis is the contrast between time on homework and homework completed. Cooper et al. (1998) pointed out that studies on the effects of homework have been based almost routinely on reports of the number of hours that a student spends doing homework. From their analysis of a limited sample, those authors suggested that homework completed might be the more significant variable. Our results for student self-reports on Homework Hours and Work Completed strongly support that proposition. Furthermore, the teachers’ rating of Work Completed proved to have the highest partial correlation with HSA (test score controlled) and the highest beta weight in predicting HSA among all of the 31 student variables in the present analysis. Notice that teachers see the outcome of the student’s homework efforts, not how much time they have devoted to it.



The engaged student takes Initiative in school. Engagement means taking a full and demanding program of coursework and participating in other scholastic activities. The student’s track record of course enrollment was a strong indicator of differential grade performance. A robust transcript appears to be the most telling aspect of a student’s academic paper trail. A recent analysis by Adelman (1999) provides a form of corroborative evidence. His findings indicate that the quantity and quality of high school coursework is also a strong indicator of the likelihood of a student graduating from

119

college. Like earlier studies, our analysis suggests that participation in school activities also reflects scholastic initiative. There is a long history of research and interest in the contribution of sports to the personal development and school life of young people (Steinberg et al., 1988). Results here support Hanks and Eckland’s (1976) early data and well-argued contention that only school activities relevant to academic work are likely to have a bearing on grade performance. The same principle apparently applies to community activities. •

The engaged student avoids unnecessary Competing Activities. Engagement means abstaining, where possible, from pursuits that take undue time and commitment away from schoolwork. All six of the competing activities included here showed some negative effect, especially killing time and involvement with drugs and gangs. Some earlier research has associated differential grade performance with movie-goers (Astin, 1971) and “druggies” (Lamborn et al., 1992). Research has typically shown the effects of afterschool employment to be marginal (see Marsh, 1991, and Steinberg et al., 1988), as do the findings here. Leisure Reading is an interesting item on the list of activities that may compete with schoolwork. It remains to be seen whether the significant negative beta weight of this variable in predicting HSA does actually represent a behavioral retreat from homework, or simply means that avid readers develop higher test scores over time. None of the effects for competing activities is large, but this may be partly due to the behaviors not being very accurately reported. The heuristic attraction and potential benefit of this Engagement proposal lie in the

overall configuration of the findings, not in the results for particular measures. The three components describe a recognizable pattern of behavior: employ appropriate school skills,

120

take initiative in school, and avoid competing activities. Together, they provide a logical framework that extends what we know about successful student performance and makes makes clear the primacy of scholastic behavior. The framework appears, in the most important respects, to be consistent with previous research and to accommodate the most salient variables. Finally, the Engagement composite incorporates in one measure essentially all of the variance in the 26 student characteristics that was useful in accounting for students doing better or poorer than their test performance would suggest. As a single variable, an Engagement composite has the potential for conceptual complexity as well as analytical parsimony. Its behavioral focus is a distinct benefit. To be sure, the behavior of students is constrained by background and conditioned by peer culture and personal attitudes. Nevertheless, it is reasonable to assume, and the data do suggest, that it is behavior that most directly affects achievement. Furthermore, behavior is more readily described, observed, and assessed. Behavior can, in principle, be modified to the benefit of teachers and learners. All this is to suggest that the idea of behavioral engagement in school can be a useful tool in studying and improving the educational process. If students do well because they are engaged, is it not just as likely that they are engaged because they do well? That is certainly true as has been argued with respect to the reciprocal effects of self-concept and achievement (Marsh & Yeung, 1997). The methodological concern is the possibility of spurious causal effects due to direct dependence of grade predictors on grade performance. As previously discussed, we took pains to avoid such confounding, apparently with some success (see p. 89). The reciprocity of behavior and performance is not a technical problem pertinent only to the analysis of concurrent data. To be sure, a student’s attitude about school will change with time and circumstance, but a strong

121

or weak commitment to school will likely be reflected in longitudinal as well as in concurrent analyses. Is the engagement argument circular? To a degree, yes. But, from an educational perspective, the challenge is to influence student behavior in ways that create positive selffulfilling achievement prophecies. Much of the national effort that goes into encouraging excellence and selecting good students for demanding educational programs is based on that assumption. When interest and commitment lead to achievement, it is a demonstration of the affective side of learning. Students do better on those topics and lines of endeavor that match their prior experience, interests, and value orientation (Dwyer & Johnson, 1997; Stricker & Emmerich, 1999). Some students like school more than others. Those who work in school learn schoolwork. There is a critical assumption. Students who are engaged in school not only learn their lessons; are also more likely to develop the habits of mind and broadly applicable cognitive skills that serve the unpredictable demands of adult life. Recall that Engagement is substantially related to both test performance and grade performance, but especially to the latter (see Tables 7 & 16). Reason asserts and the data confirm: Students who are more motivated toward academic pursuits are likely to perform relatively better on the specific knowledge and skills that grades are intended to recognize. Thus, a stronger motivational component in grade performance is an important difference between grades and test scores. A special merit of Scholastic Engagement, as defined, is the focus on behavior. The data suggest that if we want to know who is motivated or what motivates, look at the students’ behavior. From a practical standpoint, what does that mean? In designing a research project or considering alternate educational practices, evidence in the students’ track record may

122

prove useful. Extracurricular precursors in high school of success in college are a good example (Willingham, 1985). Clearly, teacher ratings are also a potentially quite useful source of information. The Teacher Ratings included in our analysis focused largely on some of the same behaviors as represented in the Engagement composite—School Skills and Initiative. The Teacher Ratings were actually more strongly related to achievement and better indicators of differential grade performance than was student-supplied information. The strength of the results from the Teacher Ratings offers further validation of the Engagement composite, especially considering that the ratings came from the middle of the sophomore year. The Teacher Ratings may provide a somewhat more objective and valid view of the students’ behavior19 than do the Student Characteristics (Variables 1-26) that are mostly based on student self-reports. On the other hand, the Student Characteristics and the Teacher Ratings made independent contributions in predicting HSA. Ratings by the teachers may show stronger relationships partly because teachers add and subtract grade points depending upon their own observations of students. Also, teachers likely base their grades to some degree on competencies and considerations that are important in their classroom, but not normally covered by tests. Different issues arise if evidence of engagement is intended for actual use in highstakes decisions such as promotion, graduation, or admission to a selective institution or demanding program. Teacher ratings, extracurricular activities, or self-ratings can pose problems because of ethical issues or the obvious pressures and fudging problems that usually arise when consequential use is made of a measure. The student’s transcript is another matter.

123

The academic record has long been recognized as a proper basis for high-stakes decisions. Historically, successful completion of prescribed courses has been a major factor in deciding who earns a diploma or is admissible to a selective advanced program of study. The student’s transcript is public, highly sanctioned, and not readily manipulated. Our data indicate, as does Adelman’s (1999) recent analysis, that a strong track record of academic coursework is a robust indicator of academic success. Group Performance: Similar Dynamics, Different Levels Much of the attention in this analysis was directed to understanding individual differences in grade performance and test performance. Parallel questions regarding group differences on these high-stakes measures take on special significance because group differences are particularly associated with notions of fairness in assessment. Fairness in high-stakes tests has often been cast as differential validity and differential prediction (Linn, 1982a). In the present context, it is useful to think of differential validity as pertaining, more generally, to whether the dynamics of achievement are comparable across groups. Comparable achievement dynamics imply that the major variables that influence achievement function in a similar manner from group to group; that is, they are similarly constituted and interrelated in a similar manner. In contrast, differential prediction pertains to comparability of performance level across groups. The data and the analyses offer several advantages for examining the ways in which groups of students are similar and different on grades and test scores with respect to the dynamics of school achievement and performance level. We have here a sizable—if not assuredly representative—national sample of students, four composite measures known to largely account for grade performance, additional detailed information underlying those

124

composites, six gender and ethnic groups, and four program groups distributed along a continuum of academic to vocational emphasis. Furthermore, the analyses incorporate corrections, where appropriate, for distortions that often cloud comparative group data: range restriction, unreliability, and grading variations. Dynamics of achievement. Do grades, test scores, and related measures function similarly or dissimilarly as indicators of school achievement from one group of students to another? Our results indicate that the main factors that represent or influence school achievement are quite similar across six gender/ethnic subgroups. At different points in the analysis, several types of evidence point to this conclusion. •

Prediction of HSA on the basis of the four NELS tests yielded similar multiple correlations for the six gender/ethnic subgroups, all within the range of the low .70s to the low .80s (variables corrected as appropriate for range restriction, grading variations, and unreliability). [Table 14]



Prediction of HSA in a full analysis based on 37 variables yielded quite similar multiple correlations for the six gender/ethnic subgroups, all within the range of .84 to .89. [Table 19]



A corresponding condensed analysis based on four composite predictors—NELS Test, Scholastic Engagement, Teacher Rating, and School Grading—yielded a multiple R within .02 of that for the full analysis for each of the six subgroups. [Table 17]



In all six gender/ethnic subgroups, Scholastic Engagement was more highly correlated with HSA than with the NELS Test. As a result, Engagement was a significant contributor in accounting for differential grade performance in all six groups. A similar

125

relationship obtained for both the Teacher Rating and School Grading in each of the six subgroups. [Table 17] •

In the condensed analyses, the magnitude of the standard regression weights for the four variables followed an almost identical rank order for each gender/ethnic subgroup: NELS Test, Teacher Rating, School Grading, and Scholastic Engagement. NELS Test had by far the highest weight—in the tight range of .49 to .53—for each of the six groups. Regression weights for the other three composite variables were similarly consistent. [Table 17]



Each of the student-based composite measures was constituted in a consistent manner from one gender/ethnic subgroup to another: •

In each subgroup, the Mathematics test had the highest correlation with HSA among the four NELS Tests. [Table 14]



In each subgroup, the same two ratings—Does Homework and Educational Motivation—had the highest and second highest weight, respectively, in predicting HSA among the five constituents of the composite Teacher Rating. [Table 19]



In each subgroup, with minor exceptions, the same three behaviors—taking advanced electives, completing work assignments, and coming to school—had the largest partial correlations with HSA among the nine student characteristics constituting Scholastic Engagement. [Table 9]



In each subgroup, the internal consistency of the high school average was uniformly high—ranging from .94 to .97 for HSA and .96 to .97 for the full-transcript HSA-T. [Table 6]

126

These results indicate that the dynamics of achievement are strikingly similar across gender and ethnic groups. Few differences are apparent from group to group in the behaviors that contribute to school achievement or in the abilities that come into play. The major variables that here account for differences in grades and test scores are similarly constituted and function in a very similar manner across groups. Parallel analyses based on students in the four school programs provide an additional perspective on the dynamics of school achievement. In large measure, the consistent results just described for the six gender/ethnic groups hold for the four program groups as well. There were, however, some significant variations associated with school programs. NELS classified students into programs on the basis of coursework. Most were placed in either the Rigorous Academic (25%) or the Academic (61%) program. Differences in the dynamics of school achievement mainly pertained to the two relatively small groups of students who took either some vocational coursework or a strictly vocational program. These were the principal differences observed: •

The correlations between the NELS Test and HSA were .80 and .79 in the two academic programs but declined to .73 and .56 in the more vocationally oriented programs—even though HSA was based only on courses classified as academic. [Table 14]



The multiple correlation between the four major composite variables and HSA was .88 and .87 in the two academic programs but declined to .83 and .73 in the more vocationally oriented programs. In that analysis, the role of two predictors changed quite noticeably from academic to vocational programs. The standard regression

127

weight for the NELS Test decreased progressively from .57 to .35, but the weight for Teacher Rating increased from .31 to .44. [Table 18] •

The reliability of HSA declined from .97 in the Rigorous Academic program to .89 in the Vocational program (based on academic coursework alone and corrected for restriction in range). [Table 6] While the dynamics of school achievement were quite similar across gender and

ethnic groups, it is clear that there were some differences in the way that the major variables influence school achievement in academic versus vocational programs. In this respect, the findings here appear to echo Snow’s observation that there is much evidence for interaction between ability (here, factors influencing achievement) and treatments (here, programs), but little evidence of interaction between ability and group membership (Snow, 1998, p.100). What is the implication of similar dynamics in school achievement across gender and ethnic groups? Consistency in the way that the variables are related from group to group indicates little differential validity—an important mark of fair assessment. Whereas groups obviously differ in interests, background, and culture, the findings here give no indication that such differences impart a different meaning to grades, test scores, and the other major variables that influence school achievement. The level at which students achieve in school may be another matter. Achievement level. As just described, achievement dynamics largely reflected group similarities. Achievement level presents a more complex picture of group differences as well as similarities. The performance of students in different school programs provides a useful frame of reference for considering gender and ethnic group differences. We focus here on the four major student-based variables: High School Average, NELS Test, Engagement, and

128

Teacher Rating. As was illustrated in Figure 5, mean scores on these four variables presented a very consistent picture. Students in each school program tended to score at much the same level on each variable—the Rigorous Academic group scoring moderately high on all four, the less academically oriented groups scoring progressively and consistently lower on all four measures. The High School Average and the NELS Test show a pattern of mean differences among the four ethnic groups much like that found among the four school programs— approximately one standard deviation spread from the highest to the lowest scoring group on each of these measures. These ethnic group mean differences are also reflected with some consistency in the four grade averages that make up HSA (Table A-4) and the four subtests that constitute the NELS Test (Table 14). The gender mean differences on grades, and especially tests, were considerably smaller. The question of greatest interest to the present study is, of course, how grade performance and test performance compare, group by group. Relative to the size of the differences in academic achievement from group to group, the mean difference between these two measures was typically quite small. Among the 10 groups that we studied, in only one group (the Academic-Vocational program) was the mean difference in HSA and NELS Test as large as one-fifth of a standard deviation. The analysis of differential prediction provides a somewhat different view of group differences, but does suggest that even these small mean differences on the standard scale can be largely accounted for. When school grading, student engagement, and teacher ratings were controlled, the mean difference between actual grade average and grade average expected on the basis of test scores was about .02 grade points (Figure 4).

129

Most of the gender and ethnic groups tended to score on Scholastic Engagement and Teacher Rating Composite at a mean level generally commensurate with their mean grade performance. For reasons that are unclear, that finding was less true of the African American and Hispanic students. As might be expected, a number of student groups showed different mean performance levels on the various components of Scholastic Engagement and Teacher Rating. Also there were instances where these two composite measures indicated different mean performance levels for apparently the same behavior (i.e., discrepant student and teacher reports). The nature of these various measures and their relationship to school achievement seems a promising ground for further research. These data appear to be at some varia nce with college data regarding the mean performance level of some groups of students on grades versus test scores. The results of differential prediction in these data are generally similar to those of corresponding college groups (Bridgeman et al., 2000; Ramist et al., 1994). On the other hand, recent college data suggest that, on average, African American students tend to perform less well on grades than would be expected from test scores. Results for Hispanic students are not clear. (see Bowen & Bok, 1998; Ramist et al., 1994; and the discussion in pp. 90-94 and Note 12). The major variables identified here—engagement, teacher ratings, grading variations—may or may not be involved in these somewhat different patterns observed in school and college data. Nonetheless, these factors appear sufficiently promising to warrant careful study at the college level.

130

131

The Merits of Grades and Tests The somewhat uncertain relationship between grades and test scores was cast as the presenting issue in this study. Both measures are used in high-stakes educational decisions and, in some respects, each measure is dependent upon the other for corroborative evidence as to its quality. Yet it is often not clear why observed performance on grades and tests differs for individual students, and sometimes groups of students. Results of the present analysis indicate several reasons for such differences. What, then, do these findings suggest regarding the merits of grades and tests for high-stakes decisions? That question is usefully considered from two perspectives: how we evaluate validity and fairness, and the differential strengths of grades and tests in high-stakes decisions. In discussing these topics, we do so on the basis of the findings here described. Examining possible sources of score differences has often indicated differences in the nature of grades and test scores and suggested broader implications regarding their use in high-stakes decisions. But it is important to recall again the limited scope of this analysis. Needless to say, the overall value of grades and tests in a given situation will depend on many additional factors, especially their content and how they are used. Validity and Fairness Traditional statistical indices of validity and fairness are especially appropriate to high-stakes measures, because rank order is a critical consideration in such decisions. To account for differences in rank order between the measure and its surrogate or criterion is to provide evidence relevant to validity and fairness. Given a grade average and a NELS Test based on a comparable representation of the high school curriculum, the application of several

132

corrective factors resulted in a quite high correlation between the two measures and a quite low level of differential prediction for subgroups. Thus, taking into account reasonable sources of difference, we found grades and test scores to be strongly related. Furthermore, the several corrective factors appeared to work in largely similar fashion from group to group. The results thus provide, in principle, evidence of the validity and fairness of grades and test scores based on generally similar subject material, assuming that extraneous sources of error like grading variations are taken into account. As here illustrated, the two measures can be mutually validating because they are differently derived indicators of student performance. Grades represent the teacher’s summative judgment of performance based on evidence collected in class. An external test represents a performance sample of knowledge and skills devised by external subject matter specialists. Only moderate correlations between grades and test scores and some degree of differential prediction for subgroups are routine results in real-life assessment. In the public’s perception, poor test performance in relation to grades earned often means that the student does not “test well.” In some cases that is undoubtedly true, and groups can certainly tend to do less well on a particular type of test. For example, other things being equal, men tend to do less well on a test that calls for writing, and women tend to do less well on a test that calls for spatial visualization, though such relationships are complex and not always predictable (Bridgeman & Lewis, 1994; Willingham & Cole, 1997, p. 244f). But simply ascribing observed discrepancies between grades and test scores to “testing well” would be misleading at best. In these data the various subgroups typically earned mean grades generally similar to their mean test scores. In the analysis of differential prediction, non-test factors largely

133

accounted for the small differences that were observed between these two measures for a given group. Anything less than a strong correspondence between test results and grade results is usually taken to be evidence of invalidity and unfairness in the test scores—seldom the grades. That interpretation is not surprising given the formal connection of these statistical indices to professional standards of test validity and fairness (American Psychological Association et al., 1999) and the emphasis that such empirical markers receive in test interpretation materials and in the habits of researchers. On the other hand, the interpretation seems oddly inconsistent with the results reported here. Given a grade average and a test score based on generally similar subject matter, discrepancies between the two appear to have less to do with mysterious sources of invalidity or defects in the test than with errors in the grades and incomplete information about the students and their approach to schooling. Do the statistical findings suggest that, except for a few known differences, grades and tests based on a corresponding domain are likely to be comparably valid and fair? In a limited sense that is true, because accepted statistical indicators of validity and fairness look very robust when we take into account factors that should logically cause the two measures to differ in practice. But this view of the topic leaves quite a lot unsaid—even about these statistical criteria of validity and fairness. A broader view. Paradoxically, finding that it is possible to account for much of the apparent discrepancy between grade performance and test performance should caution against excessive reliance on the statistical indicators that we seek to explain and improve. Researchers and policy analysts give close scrutiny to whether a particular test correlates with grades .50 or .55 and to whether differential prediction for a particular group is .10 or .05 of a

134

letter-grade. We know from earlier work that statistical indicators of validity and fairness can be strongly influenced by various technical artifacts as well as social and educational values. These include, for example, range restriction and other aspects of selective sampling (Lewis & Willingham, 1995; Linn, 1983), unreliability of predictors and criteria (Humphreys, 1968; Linn & Werts, 1971), the nature of the criterion that is available or preferred (Elliott & Strenta, 1988; Ramist et al., 1994; Willingham, 1985), institutional differences and changes over time (Willingham & Lewis, 1990), whether other variables typically used in decisionmaking happen to be included or not included in an analysis (Linn & Werts, 1971), and finally, what particular ethical values are emphasized in defining validity and fairness (Hunter & Schmidt, 1976; Messick, 1989; Petersen & Novick, 1976). The results here further demonstrate the hazards in uncritical interpretation of statistical indicators of validity and fairness. We see that it is possible to substantially increase grade-test correlations and decrease differential prediction with corrections for factors that may have little if any actual connection with the quality of the test being evaluated. The statistics will vary if grades and tests are based on somewhat different subjects, if grades deliberately include factors other than knowledge and skill in the subject, if grading standards vary, and so on. Without corrections for such potential artifacts, these common statistics will be unduly conservative; that is, indicate less validity and less fairness than is warranted. In practice, the information necessary to make such corrections is hard to obtain. Furthermore, the statistical indices are insufficient. With a broader view of the topic, it is also obvious that the substance of the measures is critical. Grades, as currently assigned in typical schools, may be based on much the same knowledge and skills as represented in the NELS Test. Assessing different constructs or

135

using different assessment formats for grades or test scores might affect their validity and fairness in a variety of ways, statistical results notwithstanding. Ultimately, the validity and fairness of any measure depend upon the consequences of its use for individuals, groups, and the public interest. A broader view of the consequences of using a test includes not only the particular high-stakes decision at issue. Evaluation of validity and fairness must also include the backward effects on instruction and learning and the forward effects on the eventual social outcomes of the educational process (Frederiksen, 1984; Haertel, 1999; Messick, 1989; Resnick & Resnick, 1992; Willingham, 1999). Validity coefficients and differential prediction provide useful information regarding validity and fairness, but such evidence is bounded by what we can learn from a correlational model that has inherent limitations. A low correlation between a grade and a test score may say little about fairness and validity of the test if the grade criterion is poor. That is well understood. Similarly, a high correlation does not necessarily inform us about the quality of either measure except that they are mutually supportive. Depending upon the situation and the nature of the high-stakes decision, other grades and other tests might better serve educational objectives. Also, alternate tests might have quite similar correlations with a grade average, yet have quite different learning implications, social significance, or practical ramifications (Willingham & Cole, 1997, p. 234-244). As we stated at the outset, the type of analysis presented here can tell little about the relevance or sufficiency of the construct that is normally represented by grades and tests. Indeed, the analysis helps to illustrate the importance of understanding better the particular knowledge, skills, and cognitive processes that are and are not included in a given representation of educational outcomes.

136

Finally, it is important to distinguish what conclusions the results do and do not suggest regarding the comparability of grades and test scores. Being able to adjust away a major part of the observed differences between the two measures does not mean that they are the same. To the contrary, the several factors that were employed in “correcting” the relationship between the two measures show that grades and test scores are dissimilar in important ways. These factors influence grades and test scores differently with respect to content relevance and fidelity. What do these differences suggest regarding the use of grades and tests in making high-stakes decisions? In the main, the findings imply that grades and test scores have different strengths. The different strengths are likely to have different practical effects in the actual use of one measure or the other. For that reason, the findings reinforce the importance of considering all aspects of validity and fairness in considering what measures to use in highstakes situations. Differential Strengths In considering what these results may imply regarding differential strengths, we start with the factors that differentiate the two measures. Of the five factors originally included in the analysis, three represent characteristic differences between grades and test scores: Factor 2—grading variations, Factor 4—scholastic engagement, and Factor 5—the teacher’s judgment of the student’s performance in school. Each of these three can be seen as a component of grades that is not normally represented in tests. Clearly, an analysis of individual differences—on which we have focused here—is not unrelated to an analysis of construct differences. Indeed, critical evidence of construct differences lies in systematic patterns of individual differences.

137

From that perspective, the additional components found mainly in grades can be expected to generate different patterns of individual and group differences in grades and test scores. As the analyses indicate, taking the three components into account brings grades and test scores more closely into line. The following discussion focuses on possible strengths of grades and tests that may be implied by those three components. Factors 1 and 3 (subject match and reliability) are not considered because neither necessarily represents a strength or advantage that is especially associated with the use of either grades or test scores.20 The practical implications of Factors 2, 4, and 5 can easily vary with the nature of the high-stakes decision. There are many types of decisions where the stakes are high for individual students. It is clearly beyond the scope of this report to examine possible implications of using one measure or the other for specific purposes in particular situations. Nevertheless, the findings do suggest that some inherent differences between grades and tests tend to be associated with particular strengths. In evaluating possible strengths, it is useful to consider typical objectives of high-stakes assessment and the context in which assessment is carried out. Figure 6 proposes a schema for that purpose. ______________ Insert Figure 6 about here ______________ As this figure suggests, each of Factors 2, 4, and 5 is associated with a component of grades that is especially pertinent to a particular assessment objective. They are, respectively, that high-stakes assessment should be fair to the student, that assessment should recognize and foster the development of critical skills, and finally, that assessment should help to motivate effective teaching and learning. The relationships are not exact. Also, these three objectives do not apply equally to all types of high-stakes assessment, nor do they exhaust or

138

fully state its various purposes. Nevertheless, these important goals of assessment provide a useful context for considering possible implications of the findings. In comparing grades and test scores, Figure 6 helps to connect substantive issues (the goals and content of education) with outcome issues (the statistics of individual differences). It is also necessary to take account of the particular administrative context in which the two types of assessment are carried out. Grades represent each teacher’s judgment in each class as to how well the student has fulfilled the implicit local contract between teacher and student. The contract may be poorly stated, but in broad outline it is likely to be reasonably clear to both parties. For example, the implicit understanding with a given teacher may be, “If you master the knowledge and skills pertinent to my course reasonably well, you will probably get at least a B—maybe higher if you do all the assignments and contribute to the class, but maybe lower if you forget your homework or disrupt class.” A test, on the other hand, provides an external standard that is intended to compare performance across educational units. For that reason, the test is designed to include knowledge and skills generally representative of relevant coursework that, in detail, will differ somewhat from unit to unit. Naturally, there is considerable overlap between the local contract and the external standard, but as Figure 6 indicates, they do have distinguishing characteristics that tend to be associated with different strengths. In this schema, Factors 2, 4 and 5 suggest differential strengths of grades and test scores as follows. Grading variations (Factor 2). One important objective in high-stakes assessment is to insure that all students are evaluated on the same scale. In the case of grades, same scale usually means the “local standard” that is normally applied in a given situation. For some purposes that may be viewed as sufficient for fair assessment. The strictness of grading in a

139

particular program or institution can clearly affect the likelihood that a student will pass, but local standards are institutionalized and may not be perceived as a fairness issue if applied in an evenhanded manner. Passing standards that vary from one educational unit to another are not necessarily dysfunctional. Schools or educational programs with students who are academically weak or unusually talented may not be well served by the same grade scale. On the other hand, many high-stakes decisions call for an assessment that is comparable across educational units—either to be fair to all students involved or, for educational purposes, to base decisions on some objective standard of competence. Grading variations become an important issue if a system or state wishes to enforce accountability by imposing comparable standards across schools. Admission to a selective college or graduate program is, of course, the most familiar and telling instance of the fairness hurdle imposed by noncomparable grade records. In all such situations, grading variations represent a failure in the fidelity of grades as a basis for high-stakes decisions. In this regard, the consistent scale meaning of a “common yardstick” has long been seen as a distinctive strength of a standardized external test (Cameron, 1989; Dyer & King, 1955). It is no surprise that grading variations should be a major factor in accounting for observed differences between grades and test scores. The earlier review of pertinent literature indicates a long history and ample documentation of fluctuating standards and differences in grading patterns from one time to another, one situation to another. Conversely, the promise of having a common yardstick that avoids such problems is a principal reason why tests are used in the first place. Tests lack the error component that is represented in grading variations.

140

To be sure, test score scales are not always dependable. Some K-12 tests have been known to deliver suspicious scores,21 but tests used in high-stakes decisions involving individual students are typically normed and scaled with considerable care. In most highstakes situations, consistent scale meaning is likely to be a major strength of tests and a notable weakness of grades. It has been frequently documented that basing high-stakes selection decisions on grades and test scores together routinely improves predictive validity (Ramist et al., 1994). But fairness is an equally compelling argument for using the two measures together, because a test will tend to compensate for a grade average that is either inflated or too stringent. Scholastic engagement (Factor 4). Another common objective of high-stakes assessment is to evaluate skills generally considered to be critical outcomes of education. When grades and tests are based on the same subject matter, both presumably cover relevant knowledge and skills, but there are important distinctions in the content of grades and tests. The findings here indicate that the grade average reflects in part the degree to which a student is effectively engaged in school, while the test focuses on academic content. In Figure 6 this difference is contrasted as conative versus cognitive skills—an important distinction in the construct relevance of grades and test scores as high stakes measures. Some students are more engaged than others, and they work harder. The effort pays off with higher grades and higher test scores, but the payoff is more direct and surer in the case of grades, which include the conative component. Furthermore, students often receive grade credit quite directly for taking school seriously and doing the work assigned. This undeniable relevance and responsiveness of grades to student effort and learning are clearly strengths of grades as a fair, though not necessarily sufficient, basis for many high-stakes

141

decisions. Educators recognize that conative skills like volition, habits of inquiry, effort, and self-regulation are, in themselves, important goals of schooling in a free and effective society—a consideration that adds to the validity of grades as a graduation requirement. As Bandura (2000) notes, the educational enterprise has multifaceted aims, though currently there is much sentiment throughout the country to hold students to a test-based, largely cognitive graduation requirement (Baker & Linn, 1997). Conative and cognitive skills serve complementary roles in facilitating achievement in school as well as adult life (Bloom, 1956; Krathwold, Bloom, & Masia, 1964; Snow, 1989; Sternberg, 1985). Colleges have long searched for evidence of student motivation (a close relative of Factor 4) in the hope of enhancing predictive validity in high-stakes selection decisions. Interest in developing high-stakes tests of conative skills has raised numerous possibilities but many hurdles (Fishman, 1958; Messick, 1967; Snow & Jackson, 1993; Willingham & Breland, 1982). Our findings suggest that a likely reason that the search typically bears only limited predictive fruit is that scholastic engagement is represented to some degree in the student’s previous grade record—already used as a predictor in many high-stakes situations. The dependence of grades on student engagement is, of course, not always a strength. Whether a student is engaged with school is not independent of influences beyond the student’s control. As the data indicate, engagement can be influenced negatively by family circumstances, dysfunctional peer associations, and so on. And for a student who has experienced a disastrous period of indifference or disconnect in high school, grades can be an unforgiving indicator of academic incompetence. A test can be a safety net for such students because the more general skills typically emphasized on the test are more dependent on a

142

lifetime of learning, both in and out of school. On the whole, however, the conative component is a distinctive strength of grades. This feature of grades is especially valuable because of the continuing difficulties likely to be encountered in developing tests of such characteristics as scholastic engagement that would be valid and acceptable in high-stakes situations. For similar reasons, the focus on cognitive skills is a strength of tests. Research in recent years has added to our understanding of a broader range of cognitive skills that are known to have value in academic work and adult life (Shepard, 1992a). Several circumstances make it now possible to better focus the content of a test on important cognitive skills. Briefly, these advances include the emphasis on educational standards, the developing technology of test design, the standardization procedures that yield comparable measures for all students, and the research potential for determining which designs work best. In recent years considerable effort has gone into improved design and delivery in the assessment of more complex cognitive and performing skills (Bennett & Ward, 1993; Frederiksen et al., 1990; Linn, 2000; Mislevy, Steinberg, Breyer, Almond, & Johnson, 1999; Tatsuoka & Tatsuoka, 1992). In comparison, harnessing the judgment of innumerable teachers to the goal of correctly recognizing and consistently assessing specific cognitive skills is a daunting task. There is some evidence to support the conventional wisdom that the grades and classroom tests of teachers tend to emphasize facts, terms, and rules (Fleming & Chambers, 1983; Terwilliger, 1989). Recently, however, serious efforts have been undertaken to improve classroom assessment in order to better serve both instruction and accountability.22 (Shepard, 2000; Snow & Mandinach, 1999).

143

Teacher judgment (Factor 5). That teacher ratings are more closely related to grades than to test scores suggests a connection to a time-honored purpose of outcome assessment; namely, to motivate academic achievement by providing feedback. “Motivate” implies the need to shape learning as well as encourage effort. Grades and tests perform thus function differently. Grades are based more directly on what is going on in the individual classroom. Through the teacher’s assessment, grades provide an immediate reward or punishment to the student for good or poor work on the specific material assigned. External tests are not usually tied to specific courses. Tests are more concerned with how students are doing at the end of the year on the knowledge and skills that are common to and most relevant to the curriculum—in the system, the state, or the nation, depending on the test. Thus, the inclusion of the teacher’s judgment in grading represents another distinction in the content relevance of grades and test scores. Grades more clearly reflect performance on the wide range of material that students are actually studying in a given class, day by day. In discussing this content distinction, Shepard (2000) refers to the formative and summative roles of classroom assessment and external tests, respectively. It is no surprise that teacher ratings are more highly correlated with grades than with test scores. Evidently, teachers are about evenly split on whether it is acceptable to pass students simply because they have tried hard (Public Agenda, 2000). Parents and teachers certainly look to grades to motivate students, hopefully on a month to month basis. On the other hand, administrators and politicians look to annual test results to motivate teachers and schools. In theory, grades motivate students because they reflect the student’s behavior, and tests motivate the educational process because they encourage accountability and provide a more effective means of designating what knowledge and skills are most important.

144

Developing good work habits and attaining long-term curriculum objectives are surely complementary and worthy goals. How effectively either grades or tests actually motivate or can be designed to motivate is, of course, a matter of regular study and debate (Covington, 2000; Linn, 2000; Shepard et al., 1995). In practice, do these different characteristics of grades and tests constitute differential strengths in making valid and fair high-stakes decisions? That the teacher’s judgment of students’ performance in school has special weight in accounting for grades is very likely a net gain for the validity and fairness of grades in many high-stakes decisions. The teacher is in the best position to know who is working hard and achieving learning objectives in school—in all ways that students do achieve. Teachers are in a position to recognize many different forms of achievement on many occasions. For that reason, grades inevitably cover a broader range of skills than do tests. In time, advances in technology may narrow the gap between teacher judgement and standardized tests. Better models of performance along with computer simulation and natural language processing of student responses hold promise for assessment of much more complex skills than is possible with current tests (Braun, Bennett, Frye, & Soloway, 1990; Bennett, 1999; Gitomer, 1993). To be sure, the added breadth that teacher judgement adds to assessment likely comes at the expense of consistent meaning as to what the grade represents. On the other hand, tests can be off the mark altogether. Heubert and Hauser (1999) describe the public policy dilemmas and the legal issues that arise if an external test does not adequately match the school’s curriculum or is otherwise considered unfair to its students. Thus the more direct instructional relevance of grades and the comparability of test scores are key complementary strengths of the two measures.

145

Additional evidence of validity and fairness in grades lies in the observation that teacher ratings reduced differential prediction. This analysis of high school data may actually sell short the importance of the teacher’s judgment as a component of grades. In elementary school and in graduate school, we normally assume that the teacher has the best feel for how the student is doing. Traditionally, teacher recommendations have been a critical element in selective admissions and probably a useful predictor as well (see Willingham, 1985, Table 5.2). Factor 5 appears to work much like Factor 4 in arguing the strengths of grades and tests for high-stakes decisions. In principle, engagement in school and the teacher’s judgment of school achievement should overlap considerably as components of grades or sources of variation in grades. Both surely reflect the student’s conative skills. Both surely indicate a student’s level of attainment on local learning objectives. But in this particular analysis, the Teacher Rating is probably a weak measure of the latter—partly because the ratings focused on student behavior rather than learning outcomes, and partly because they were based on the judgment of only two teachers in the sophomore year. The influence of teacher judgment may represent both strength and weakness in grades, because the teachers’ judgment may be as subject to varying standards, as are the grades. Presumably, teachers are the primary source of positive and negative adjustments in grading to give explicit credit for effort, attendance, completing assignments, and other factors that are not necessarily related to acquired knowledge and skill. To what extent such considerations are a strength or weakness of an assessment depends partly on the nature of the particular high-stakes decision and partly on the educational philosophy that governs it.

146

Theory aside, it would be useful to have better facts about the role of teacher judgment. In Table 16, the Teacher Rating Composite was correlated .69 with High School Average. Is all of that strong correlation good news? Considering that the 1.6 teacher ratings per student came from the middle of the sophomore year, this substantial relationship is not likely to be explained simply as contamination between the two measures. Is the behavior of students (the main focus of the ratings) really that stable through high school? Or could the .69 partly reflect a tendency for teachers’ grades to be influenced by mindsets as to who deserves good grades? The possibility of self-fulfilling prophecies in teacher judgment is an old and controversial issue (Brophy, 1983; Rosenthal & Jacobson, 1968). This strong relationship between grade average and teacher ratings of student behavior further illustrates the need for more research on the validity and fairness of grades as the basis for high-stakes decisions. It would also be desirable to see a more balanced concern regarding fair use of grades as compared to the concern typically advanced regarding legal obligations in using test scores (Office of Civil Rights, 1999). This discussion of differential strengths of grades and test scores was not intended to be exhaustive. Mainly, we have cited differential strengths suggested by the analysis reported here. While grades and tests clearly overlap, differences between the two measures are thereby all the more obvious. It is also more obvious that the strengths of grades and tests are often complementary. For example, in the last few pages we have noted these contrasts: Grades can represent broader content and reflect distinct accomplishments, but tests are more amenable to research and focused design on the most important content. Tests can more readily focus on priority cognitive skills, but grades can more readily focus on motivational components of academic achievement. Grades can readily reflect performance on what

147

students are studying, but tests can focus on more significant long-term educational objectives. Test scores can be compared from one school to another, but grade scales can be accommodated to local situations and programs. The implications of these various characteristics will naturally depend upon the nature of the educational decision. High-stakes decisions can be quite different as to purpose, context, and values. Certain assessment strengths can be major considerations in some highstakes decisions. In selective admissions, for instance, evidence of scholastic engagement is a major strength of grades, while the common yardstick is a major strength of test scores. The two strengths advance validity and fairness in complementary ways. Furthermore, it is useful to recall again that this comparison of grades and test scores has been based on an analysis of individual differences as opposed to a construct analysis. These two perspectives are not the same, though they are parallel and complementary. Each is useful because it suggests different value issues and different possible courses of action. Construct analysis leads to assessment and curriculum design; analysis of individual differences leads more naturally to questions concerning differential educational practice and performance of particular students. Finally, it is worth observing that this analysis has been based on grades that we commonly know and tests that we commonly use. In making consequential decisions about individual students, it will often prove desirable to use the two measures together. Both are useful, and both have unique and complementary strengths. But there is no reason to assume that either grades or tests are as good as they could be or need to be. Measurement specialists and educators might do well to worry less about the statistical indices of the grade-test relationship and more about what we assess and how that improves teaching and learning.

148

149

Summary Both grades and tests are used in high-stakes educational decisions. Who will be placed in a slow or fast track in grade school, or earn a high school diploma, or be accepted in a selective college or a prestigious graduate program? Grades and test scores play a curious, interdependent role in such decisions. We use tests to keep grade scales honest or because we do not fully understand or trust grades as an accurate indicator of educational outcomes. On the other hand, we use grades to judge the validity and fairness of tests. Grades and tests serve this mutually supportive role, in part, because it is commonly assumed that they measure much the same thing. Yet we wonder why observed grades and test scores frequently differ. There is much evidence but little systematic attention to why the two measures often yield somewhat different results. The premise of this study was that it should be possible to account for the differences in large part, and that such an analysis would be helpful in understanding better the distinctive strengths of grades and tests as high-stakes measures. A Framework of Differences. To that end, a framework of possible sources of discrepancy between grades and test scores was developed. The framework emphasized four broad domains: content differences between grades and test scores, individual differences related to content differences, errors in grades and in test scores, and situational differences across contexts and over time. In the large sample necessary for this type of study, it is not feasible to examine all possible sources of difference, but that was not essential because the sources overlap. Furthermore, results of prior studies suggest that some sources of potential difference between grades and test scores are much more promising than others. A long history of

150

research on grading standards indicates that variations across courses and across schools are the most likely types of grading errors to produce differences between observed grades and test scores. Because of the multiple purposes served by classroom assessment, grading is influenced by educationally relevant student behavior in addition to subject knowledge and skills. Evidence suggests that teacher ratings of students are a useful source of information regarding such influence. Various aspects of students’ attitudes and family life have also shown promise in helping to account for school achievement. Rather than attempt to cover all possible sources of difference between grades and test scores, this study employed an approximation amenable to analysis and based upon the most promising sources of difference that were suggested by relevant prior research. Accordingly, the analysis took into account these five factors: 1) the particular subjects covered by the grades and the tests, 2) grading variations, 3) reliability of grades and test scores, 4) characteristics of students that are likely to influence performance in school, and 5) teacher ratings of behaviors that can bear on grades assigned. Design of the study. The study focused on patterns of individual and group differences on grades and test scores, rather than content differences in the two measures. The analysis posed this question, “How can one correct errors or otherwise add to a test score to account for differential grade performance?” Grades were regressed on test scores, taking each of the five factors into account in turn. Achievement was analyzed both across and within schools because studies have indicated that attitudes and other critical variables in addition to grades are often not comparable from school to school. Since it is also likely that good grades foster good attitudes, variables that are obviously dependent upon past grades were avoided in order to minimize spurious effects.

151

The data were based on 8454 NELS seniors in 1992 in schools having at least 10 survey participants with necessary data. Ten subgroups were available for analysis: two genders, four ethnic groups (African-American, Asian-American, Hispanic, and White), and four school programs (Rigorous Academic, Academic, Academic-Vocational, Vocational). The five factors were introduced on the basis of the following data and analyses. •

Factor 1: The subjects covered by grades and tests were matched by restricting the High School Average (HSA) to the four academic areas (English, mathematics, science, social studies) most similar to the four NELS tests and optimally weighting the tests as predictors of grade average.



Factor 2: Grading variations across schools were adjusted by two alternate methods—a pooled within-school analysis corrected for range restriction and a residual method that adjusted grades for mean over- or underprediction based on test performance. Coursegrading variations were adjusted by the residual method either within or across schools.



Factor 3: Correlations between grade averages and test scores were corrected for attenuation with traditional estimates of reliability, corrected for range restriction where appropriate.



Factor 4: All student characteristics available in the NELS database were used where research evidence indicated promise in predicting grade performance (family and attitude measures and a number of behaviors such as attendance, coursetaking, and school activities—26 variables in all, mostly composites of related information).



Factor 5: Ratings by one or two teachers in the middle of the sophomore year focused on five types of behavior that are often considered relevant in assigning grades (e.g., effort and completing assignments).

152

The Findings. Extensive data were presented regarding many influences on grade performance and test performance in high school. The most important results can be summarized under four general findings. The first finding concerns this presenting question, “Can one account for observed differences between grade performance and test performance?” Two other significant findings concern the nature of two of the major factors that do account for grade and test score differences: grading variations and scholastic engagement. A fourth refers to the systematic patterns of similarities and differences that we observed in the performance of subgroups. 1. Accounting for the Differences. The assumption that it would be possible to account for most of the observed differences between grades and test scores by making adjustments for the five proposed factors proved largely accurate. Thus, in an analysis based on the test and 31 additional variables plus corrections for grading variations and unreliability, both individual and group differences between the two measures were reduced substantially. Reduced error in predicting the grade performance of individuals was reflected in a multiple correlation of .90 compared to one of .62 based on the tests alone. Average differential prediction in subgroups was reduced to two hundredths of a letter-grade. Grade performance could be explained almost as well with only a few composite variables: a test covering a comparable academic domain, school grading variations, student engagement in school, and an overall teacher rating. These four major variables gave a multiple correlation with grade average only .008 less than the analysis based on 37 variables. The remaining individual and group differences can probably be explained largely by limitations of the analysis or other known differences that could not be examined: curriculum variations among schools and students, construct differences in grades and tests, temporal variations, shortcomings in the

153

measures and statistics used here, and limitations in information available regarding grading variations. 2. Grading Variations. Grading variation among schools was a major source of discrepancy between observed grades and grades predicted from test scores. Correction of those school grading variations had a substantial effect on the correlation between grade average and test scores. An analysis of variance showed that grading variation among courses also accounted for a large part of the discrepancy between grade performance and test performance, however almost half of all observed school and course-grading variation was associated with differences in the pattern of course grading from school to school. Such differences in course-grading patterns could not be identified and corrected due to sparse data within individual schools. As a result, the full effect of grading variations was underestimated in this study. Furthermore, taking grading variations into account in actual use of grade information in high-stakes decisions is problematic, because in actual practice, the data necessary to estimate the effect of course—and therefore also school—grading variations on a given student’s transcript are seldom likely to be available. 3. Scholastic Engagement. Many student characteristics—family background, attitudes, and especially behavior—were related to both test performance and grade performance. Particular types of behavior made large, independent contributions in predicting differential grade performance; that is, grade performance with test scores taken into account. These behaviors were oriented toward school and followed a logical pattern. The best single predictor of differential grade performance was completing schoolwork assignments. Overall, the students who tended to make higher grades than test scores were those who employed appropriate school skills, demonstrated initiative in school, and avoided competing activities.

154

A weighted composite of nine such behaviors, termed Scholastic Engagement, proved to be a major variable in this analysis of differential grade performance. Family background and student attitudes were often strongly related to Engagement, but apparently play a secondary role to overt behavior in determining school achievement. Because the behaviors represent a coherent pattern and are potentially modifiable, Scholastic Engagement appears promising as an organizing principle for studying and improving school performance. Teacher ratings showed a similar pattern of behaviors to be relevant to test performance and especially to grade performance. 4. Group Performance. Subgroups often differed significantly in average achievement level, but were mostly quite similar with respect to achievement dynamics. That is, the major variables that accounted for individual differences in grades and test scores were similarly constituted and functioned in a similar manner across each of the six gender and ethnic groups. For example, in all groups: “does homework” and “educational motivation” were consistently the best grade predictors among the teacher ratings; taking advanced electives, completing assignments, and coming to school were consistently the most effective components of Scholastic Engagement. In predicting grades, multiple correlations and standard regression weights for each of the major variables were quite similar for each of these six groups. Parallel analyses based on the four school programs gave similar results, except among vocational students where grades tended to be more closely related to teacher judgment and less to test performance. The ten student groups differed substantially in school performance as indicated by grades and test scores, but average results based on each of these two measures was generally similar within each group, particularly when other group differences were taken into account. While the groups obviously differed in interests,

155

background, and culture, the results here give no indication that such differences impart a different meaning to grades, test scores, and other major variables that influence school achievement. The merits of grades and tests. The implications of these findings may vary substantially depending upon the particular situation in which grades and tests are used, the nature of the high-stakes decision, and other practical considerations. Nevertheless, the important differences between grades and test scores appear to be characteristic of the two measures and can therefore be expected, in some significant degree, to have generalizable effects on their validity and fairness. The results further demonstrate that it is possible to decrease substantially both individual and group errors of prediction by taking into account factors that may have little connection with the test that is being evaluated. Without correction for such artifacts, validity coefficients and differential prediction will be unduly conservative indicators of validity and fairness. That it is possible to so readily account for much of the apparent discrepancy between grade performance and test performance should caution against assuming that validity and fairness are fully and accurately described by the statistical indicators that we commonly seek to explain and improve. It is obvious that the substance of the measures and the consequences of their use are critical. Being able to adjust away a major part of the differences between grades and test scores does not mean that they are the same. To the contrary, the findings imply that grades and tests are different, and they have different strengths with respect to various interpretations of validity and fairness in the actual use of one measure or the other.

156

Two circumstances account for the fact that grades and tests have differential strengths. First, as this study suggests, three components of grades characterize important differences between the two measures: grading variations, scholastic engagement, and the teacher’s judgment of the student’s performance. Second, these components are pertinent to somewhat different though overlapping assessment objectives. In considering how each translates into differential strengths of grades and tests, it is useful to think of the two measures in the following way. Grades represent the teacher’s assessment as to how well the student has fulfilled the implicit local contract between teacher and student. Tests represent an external standard that makes it possible to compare performance across educational units. Grading variations (Factor 2) are especially associated with fairness in assessment. Flexible grading standards can be an arguable strength in a local contract. Classes largely composed of academically weak or unusually talented students may not always be well served by the same grade scale. Most high-stakes decisions, however, involve students in different educational locales competing for the same recognition or opportunity. In such situations, variation in grading standards compromises the fidelity of grades as a high-stakes measure. The fairness inherent in a common yardstick is a critical strength of the external test standard. Scholastic engagement (Factor 4) appeals to another important assessment objective; that is, to represent as fully as possible the most critical skills. Aside from covering the subject matter, grades and tests have the potential for distinctive strengths due to their somewhat different content relevance to high-stakes decisions. The behaviors that define Scholastic Engagement pertain to conative skills like volition, effort, and self-regulation— attributes not readily represented in a test. On the other hand, a well-developed test can focus

157

on particular cognitive skills considered to be most significant educationally—a goal not readily achieved in classroom grading. That teacher ratings (Factor 5) are more closely related to grades than to test scores suggests a familiar objective of educational assessment: to motivate effective teaching and learning. Grades and tests do this differently. The potential strength of grades is to motivate students by focusing specifically on the individual achievement and behavior that their teachers can recognize and appreciate. The potential strength of tests is to motivate the educational process by encouraging accountability and providing a more effective means of designating what educational outcomes are most important. As Heubert and Hauser (1999) have emphasized, high-stakes testing is a policy instrument. The strengths of grades and tests are clearly different and often complementary. Grades can gauge performance specific to the material that students are actually studying, tests can focus on more significant long-term educational objectives. Tests can help in developing the most critical cognitive skills; grades can help in developing the often neglected but essential motivational component of learning. Teacher’s grades may be more nuanced, but tests are more objective. As high-stakes measures, the comparability of test scores and the breadth of grades are key complementary strengths. Common advice that the two measures should be used together where possible is well founded. Emphasis on research and other efforts that might enhance their complementary strengths would be well placed.

158

159 References Adelman, C. (1999). Answers in the tool box: Academic intensity, attendance patterns, and bachelor's degree attainment. Washington, DC: U.S. Department of Education, Office of Educational Research and Improvement. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Washington, DC: American Psychological Association. American Psychological Association, American Educational Research Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Astin, A. W. (1971). Predicting academic performance in college. New York: The Free Press. Astin, A. W. (1985). Achieving educational excellence. San Francisco, CA: JosseyBass. Babad, E. Y., Inbar, J., & Rosenthal, R. (1982). Pygmalion, Galatea, and the Golem: Investigations of biased and unbiased teachers. Journal of Educational Psychology, 74(4), 459-474. Baker, E. L. & Linn, R. L. (1997). Emerging educational standards of performance in the United States (CSE Technical Report 437). Los Angeles: National Center for Research on Evaluation, Standards, and Student Testing. Bandura, A. (2000). A sociocognitive perspective on intellectual development and functioning. Newsletter for Educational Psychologists, 23(2), 1-4. Beatty, A., Greenwood, M. R. C., & Linn, R. L. (Eds.). (1999). Myths and tradeoffs: The role of tests in undergraduate admissions. Washington, DC: National Academy Press. Bejar, I. I., & Blew, E. O. (1981). Grade inflation and the validity of the Scholastic Aptitude Test. American Educational Research Journal, 18(2), 143-156. Bennett, R. E. (1999). Using new technology to improve assessment (ETS RR-99-6). Princeton, NJ: Educational Testing Service. Bennett, R. E. & Ward, W. (Eds.) (1993). Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment. Hillsdale, NJ: Lawrence Erlbaum Associates.

160 Bloom, B. S. (Ed.). (1956). Taxonomy of educational objectives: Handbook I. Cognitive domain. New York: David McKay Company. Bloom, B. S., & Peters, F. R. (1961). The use of academic prediction scales for counseling and selecting college entrants. New York: The Free Press of Glencoe. Bowen, W. G. & Bok, D. (1998). The shape of the river. Princeton: Princeton University Press. Braun, H. I., Bennett, R. E., Frye, D., & Soloway, E. (1990). Scoring constructed responses using expert systems. Journal of Educational Measurement, 27(2), 93108. Braun, H. I., & Szatrowski, T. H. (1984). The scale-linkage algorithm: Construction of a universal criterion scale for families of institutions. Journal of Educational Statistics, 9(4), 311-330. Breland, H. M. (1981). Assessing student characteristics in admissions to higher education (Research Monograph No. 9). New York: College Entrance Examination Board. Bridgeman, B. & Lewis, C. (1994). The ralationship of essay and multiple-choice scores with grades in college courses. Journal of Educational Measurement, 31(1), 3750. Bridgeman, B., McCamley-Jenkins, L., & Ervin, N. (2000) Predictions of freshman grade-point average from the revised and recentered SAT I: Reasoning Test (College Board Report No. 2000-1, ETS RR-00-1). New York: College Board. Brookhart, S. M. (1993). Teachers' grading practices: Meaning and values. Journal of Educational Measurement, 30(2), 123-142. Brophy, J. E. (1983). Research on the self-fulfilling prophecy and teacher expectations. Journal of Educational Psychology, 75(5), 631-661. Brown, B. B. (1988). The vital agenda for research on extracurricular influences: A reply to Holland and Andre. Review of Educational Research, 58(1), 107-111. Burnham, P. S. (1954). The evaluation of academic ability. College admissions . New York: College Entrance Examination Board. Byrne, B. M. (1986). Self-concept/academic achievement relations: An investigation of dimensionality, stability, and causality. Canadian Journal of Behavioral Science, 18(2), 173-185. Byrne, B. M. (1996). Academic self-concept: Its structure, measurement, and relation to academic achievement. In B. A. Bracken (Ed.), Handbook of self-concept:

161 Developmental, social, and clinical considerations (pp. 287-316). New York: John Wiley & Sons, Inc. Caldwell, E., & Hartnett, R. (1967). Sex bias in college grading? Journal of Educational Measurement, 4(3), 129-132. Cameron, R. G. (1989). The common yardstick: A case for the SAT. New York: College Entrance Examination Board. Cannell, J. J. (1988). Nationally normed elementary achievement testing in America’s public schools: How all 50 states are above the national average. Educational Measurement: Issues and Practices, 7(2), 5-9. Chickering, A. W. (Ed.). (1981). The modern American college. San Francisco: Jossey-Bass Publishers. Clark, M., & Grandy, J. (1984). Sex differences in the academic performance of Scholastic Aptitude Test takers (CB Rep. No. 84-8, ETS RR-84-43). New York: College Entrance Examination Board. Cole, N. S. (1997). Understanding gender differences and fair assessment in context. In W. W. Willingham & N. S. Cole, Gender and fair assessment (pp. 157-184). Mahwah, NJ: Lawrence Erlbaum Associates. Coleman, J. S. (1961). The adolescent society: The social life of the teenager and its impact on education. New York: Free Press. Coleman, J. S., Campbell, E. Q., Hobson, C. J., McPartland, J., Mood, A. M., Weinfeld, F. D., & York, R. L. (1966). Equality of educational opportunity. Washington, DC: U.S. Department of Health, Education and Welfare. College Board. (1992). College bound seniors. New York: College Board. College Board. (1998). High school grading policies (RN-04). New York: College Board Office of Research and Development. Collins, W. A., Maccoby, E. E., Steinberg, L., Hetherington, E. M., & Bornstein, M. H. (2000). Contemporary research on parenting: The case for nature and nurture. American Psychologist, 55(2), 218-232. Cooper, H., Lindsay, J. J., Nye, B., & Greathouse, S. (1998). Relationships among attitudes about homework, amount of homework assigned and completed, and student achievement. Journal of Educational Psychology, 90(1), 70-83. Cooper, H., Valentine, J. C., Nye, B., & Lindsay, J. J. (1999). Relationships between five after-school activities and academic achievement. Journal of Educational Psychology, 91(2), 369-378.

162

Covington, M. V. (2000). Intrinsic versus extrinsic motivation in schools: A reconciliation. Current Directions in Psychological Science, 9(1), 22-25. Cronbach, L. J. (1975). Five decades of public controversy over mental testing. American Psychologist, 30(1), 1-14. Crouse, J. & Trusheim, D. (1988). The case against the SAT. Chicago: University of Chicago Press. Cureton, L. W. (1971). The history of grading practices. Measurement in Education, 2(4), 1-8. Davis, J. A. (1965). Faculty perceptions of students: V. A second-order structure for faculty characterizations (College Entrance Examination Board RDR-64-5, No. 14; ETS RB-65-12). Princeton, NJ: Educational Testing Service. Dewey, J. (1900). School and society. Chicago: University of Chicago Press. DiMaggio, P. (1982). Cultural capital and school success: The impact of status culture participation on the grades of U.S. High School students. American Sociological Review, 47, 189-201. Dressel, P. L. (1939). Effect of the high school on college grades. Journal of Educational Psychology, 30, 612-617. Dwyer, C. A. & Johnson, L. M (1997). Grades, accomplishments, and correlates. In W. W. Willingham & N. S. Cole, Gender and fair assessment (pp. 127-156). Mahwah, NJ: Lawrence Erlbaum Associates. Dyer, H. S., & King, R. G. (1955). College Board scores. New York: College Entrance Examination Board. Eccles (Parsons), J. (1983). Expectancies, values, and academic behaviors. In J. T. Spence (Ed.), Achievement and achievement motives (pp. 75-146). San Francisco: W. H. Freeman Company. Eccles, J. S., Wigfield, A., & Schiefele, U. (1998). Motivation to succeed. In N. Eisenberg (Ed.), Social, Emotional, and Personality Development (Vol. 3, pp. 1017-1095). New York: John Wiley & Sons. Ekstrom, R. (1994). Gender differences in high school grades: An exploratory study (College Board Report No. 94-3, ETS RR-94-25). New York: College Entrance Examination Board. Ekstrom, R., Goertz, M., & Rock, D. (1988). Education & American youth. London: Falmer Press.

163

Ekstrom, R., & Villegas, A. (1994). College grades: An exploratory study of policies and practices (College Board Report No. 94-1, ETS RR-94-23). New York: College Entrance Examination Board. Elliott, R., & Strenta, A. C. (1988). Effects of improving the reliability of the GPA on prediction generally and on comparative predictions for gender and race particularly. Journal of Educational Measurement, 25(4), 333-347. Etaugh, A. F., Etaugh, C. F., & Hurd, D. E. (1972). Reliability of college grades and grade point averages: Some implications for prediction of academic performance. Educational and Psychological Measurement, 32(4), 1045-1050. Farkas, G., Grobe, R. P., Sheehan, D., & Shuan, Y. (1990). Cultural resources and school success: Gender, ethnicity, and poverty groups within an urban school district. American Sociological Review, 55(1), 127-142. Fetters, W. B., Stowe, P. S., & Owings, J. A. (1984). Quality of responses of high school students to questionnaire items. Washington: National Center for Educational Statistics. Findley, M. J., & Cooper, H. M. (1983). Locus of control and academic achievement: A literature review. Journal of Personality and Social Psychology, 44(2), 419-427. Finn, J. D. (1993). School engagement and students at risk. Washington, DC: National Center for Education Statistics. Fishman, J. A. (1958). The use of tests for admission to college: The next fifty years. In A. E. Traxler, Long range planning for education (pp. 74-79). Washington: American Council on Education. Fishman, J. A., & Pasanella, A. K. (1960). College admission selection studies. Review of Educational Research, 30(4), 298-310. Fleming, M. & Chambers, B. (1983). Teacher-made tests: Windows on the classroom. In W. E. Hathaway (Ed.), Testing in the schools (pp. 29-38). (New Directions for Testing and Measurement, no. 19.) San Francisco: Jossey-Bass. Frary, R. B., Cross, L. H., & Weber, L. J. (1993). Testing and grading practices and opinions of secondary teachers of academic subjects: Implications for instruction in measurement. Educational Measurement: Issues and Practice, 12(3), 23-30. Frederiksen, J. R. & Collins, A. A systems approach to educational testing. Educational Researcher, 18(9), 27-32.

164 Frederiksen, N. (1984). The real test bias: Influences of testing on teaching and learning. American Psychologist, 39(3), 193-202. Frederiksen, N., Glaser, R., Lesgold, A. & Shafto, M. G. (1990). Diagnostic monitoring of skill and knowledge acquisition. Hilsdale, NJ: Lawrence Erlbaum Associates. Fredrick, W. C., & Walberg, H. J. (1980). Learning as a function of time. Journal of Educational Research, 73, 183-204. Freeberg, N. E. (1967). The biographical information blank as a predictor of student achievement: A review. Psychological Reports, 20(4), 911-925. Gamache, L. M., & Novick, M. R. (1985). Choice of variables and gender differentiated prediction within selected academic programs. Journal of Educational Measurement, 22(1), 53-70. Geisinger, K. F. (1982). Marking systems. In H. E. Mitzell (Ed.), Encyclopedia of educational research (5th ed., pp. 1139-1149). New York: The Free Press. Gifford, B. R. & O’Connor, M. C. (Eds.). (1992). Changing assessments: Alternate views of aptitude, achievement, and instruction. Boston: Kluwer Academic Publishers. Gitomer, D. H. (1993). Performance assessment and educational measurement. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment (pp. 241-264). Hillsdale, NJ: Lawrence Erlbaum Associates. Goldman, R. D., & Hewitt, B. N. (1975). Adaptation-level as an explanation for differential standards in college grading. Journal of Educational Measurement, 12(3), 149-161. Goldman, R. D., Schmidt, D. E., Hewitt, B. N., & Fisher, R. (1974). Grading practices in different major fields. American Educational Research Journal, 11(4), 343357. Goldman, R. D., & Slaughter, R. E. (1976). Why college grade point average is difficult to predict. Journal of Educational Psychology, 68(1), 9-14. Goldman, R. D., & Widawski, M. H. (1976). A within-subjects technique for comparing college grading standards: Implications in the validity of the evaluation of college achievement. Educational and Psychological Measurement, 36(2), 381-390. Green, P. G., Dugoni, B. L., Ingels, S. J., Camburn, E. (1995). A profile of the American high school senior in 1992. Washington: U. S. Department of Education.

165

Gulliksen, H. (1950). Intrinsic validity. American Psychologist, 5(10), 511-517. Haertel, E. H. (1999). Validity arguments for high stakes testing: In search of the evidence. Educational Measurement: Issues and Practice, 18(4), 5-9. Hanks, M. P., & Eckland, B. K. (1976). Athletics and social participation in the educational attainment process. Sociology of Education, 49(4), 271-294. Hansford, B. C., & Hattie, J. A. (1982). The relationship between self and achievement/performance measures. Review of Educational Research, 52(1), 123-142. Hanson, S. L. & Ginsburg, A. L. (1988). Gaining ground: Values and high school success. American Educational Research Journal, 25(3), 334-365. Harris, J. R. (1995). Where is the child's environment? A group socialization theory of development. Psychological Review, 102(3), 458-489. Hartocollis, A. (1999). Chancellor cites test score errors. New York Times, p. A1, Sept. 15, 1999. Heubert, J. P., & Hauser, R. M. (1999). High stakes: Testing for tracking, promotion, and graduation. Washington, DC: National Academy Press. Hewitt, B. N., & Goldman, R. D. (1975). Occam's razor slices through the myth that college women overachieve. Journal of Educational Psychology, 67(2), 325-330. Hills, J. R. (1981). Measurement and evaluation in the classroom. (2nd ed.). Columbus, Ohio: Merrill. Hoge, R. D., & Coladarci, T. (1989). Teacher-based judgments of academic achievement: A review of literature. Review of Educational Research, 59(3), 297-313. Holland, A., & Andre, T. (1987). Participation in extracurricular activities in secondary school: What is known, what needs to be known? Review of Educational Research, 57(4), 437-466. Holland, A., & Andre, T. (1988). Beauty is in the eye of the reviewer. Review of Educational Research, 58(1), 113-118. Hoover, H. D. & Han, L. (1995). The effect of differential selection on gender differences in college admissions test scores. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.

166 Hossler, D., & Stage, F. K. (1992). Family and high school experience influences on the postsecondary educational plans of ninth-grade students. American Educational Research Journal, 29(2), 425-451. Hulin, C. L., Henry, R. A., & Noon, S. L. (1990). Adding a dimension: Time as a factor in the generalizability of predictive relationships. Psychological Bulletin, 107(3), 328-340. Humphreys, L. G. (1968) The fleeting nature of academic prediction. Journal of Educational Psychology, 59(5), 375-380. Humphreys, L. G. & Taber, T. (1973). Postdiction study of the Graduate Record Examination and eight semesters of college grades. Journal of Educational Measurement, 10(3), 179-184. Hunter, J. E., & Schmidt, F. L. (1976). Critical analysis of the statistical and ethical implications of various definitions of test bias. Psychological Bulletin, 83(6), 1053-1071. Hunter, J. E., & Schmidt, F. L. (1976). Critical analysis of the statistical and ethical implications of various definitions of test bias. Psychological Bulletin, 83(6), 1053-1071. Hunter, J. E., & Schmidt, F. L. (1982). Meta analysis: Cumulating research findings across studies. Beverly Hills: Sage Publications. Ingels, S. J., Abraham, S., Rasinski, K. A., Karr, R., Spencer, B. D., & Frankel, M. R. (1990). NELS:88 Base year student component (NCES 90-464). Washington, DC: National Center for Education Statistics, U.S. Department of Education. Ingels, S. J., Dowd, K. L., Baldridge, J. D., Stipe, J. L., Bartot, V. H., & Frankel, M. R. (1994). Second follow-up: Student component data file user's manual (NCES 94374). Washington, DC: National Center for Education Statistics, U.S. Department of Education. Ingels, S. J., Dowd, K. L., Taylor, J. R., Bartot, V. H., Frankel, M. R., & Pulliam, P. A. (1995). Second follow-up: Transcript component data file user's manual (NCES 95-377). Washington, DC: National Center for Education Statistics, U.S. Department of Education. Ingels, S. J., Scott, L. A., Lindmark, J. T., Frankel, M. R., & Myers, S. L. (1992a). NELS:88 First follow-up student component (NCES 92-030). Washington, DC: National Center for Education Statistics, U.S. Department of Education. Ingels, S. J., Scott, L. A., Lindmark, J. T., Frankel, M. R., & Myers, S. L. (1992b). NELS:88 First follow-up teacher component (NCES 92-085). Washington, DC: National Center for Education Statistics, U.S. Department of Education.

167 Jackson, D. N., Ahmed, S. A. & Heapy, N. A. (1976). Is achievement a unitary construct? Journal of Research in Personality. 10(1), 1-21. Jencks, C. (1998). Racial bias in testing. In Jencks, C. & Phillips, M. (Eds.), The blackwhite score gap (pp. 55-85). Washington: Brookings Institution Press. Jencks, C., Smith, M., Acland, H., Bane, M. J., Cohen, D., Gintis, H., Heyns, B., & Michelson, S. (1972). Inequality: A reassessment of the effect of family and schooling in America. New York: Basic Books, Inc. Juola, A. E. (1968). Illustrative problems in college-level grading. Personnel and Guidance Journal, 47(1), 29-33. Jussim, L. (1989). Teacher expectations: Self-fulfilling prophecies, perceptual biases, and accuracy. Journal of Personality and Social Psychology, 57(3), 469-480. Keeton, M. & Associates. (1976). Experiential learning: Rationale, characteristics, and assessment. San Francisco: Jossey-Bass. Keith, T. Z. (1982). Time spent on homework and high school grades: A large sample path analysis. Journal of Educational Psychology, 74(2), 248-253. Kelly, F. J. (1914). Teachers' marks Their variability and standardization. (Vol. No. 6). New York: Teachers College, Columbia University. Kleese, E. J. & D’Onofrio, J. A. (1994). Student activities for students at risk. Reston, VA: National Association of Secondary School Principals. Kirschenbaum, H., Simon, S. B., & Napier, R. W. (1971). Wad-ja-get? The grading game in American education. New York: Hart Publishing Company. Koretz, D., Stecher, B., Klein, S. & McCaffrey, D. (1994). The Vermont portfolio assessment program. Educational Measurement: Issues and Practices, 13(3), 516. Krathwohl, D. R., Bloom, B. S., & Masia, B. B. (1964). Taxonomy of educational objectives: Handbook II. Affective domain. New York: David McKay Company, Inc. Lamborn, S. D., Brown, B. B., Mounts, N. S., & Steinberg, L. (1992). Putting school in perspective: The influence of family, peers, extracurricular participation, and part-time work on academic engagement. In F. M. Neuman (Ed.), Student engagement and achievement in American secondary schools (pp. 153-181). New York: Teachers College Press.

168 Lamborn, S. D., Mounts, N. S., Steinberg, L., & Dornbusch, S. M. (1991). Patterns of competence and adjustment among adolescents from authoritative, authoritarian, indulgent, and neglectful families. Child Development, 62(5), 1049-1065. Lemann, N. (1999). The big test: The secret history of American meritocracy. New York: Farrar, Straus, & Giroux. Leonard, D. K., & Jiang, J. (1995, April). Gender bias in the college predictions of the SAT. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA. Lewis, C., & Willingham, W. W. (1995). The effects of sample restriction on gender differences (ETS RR-95-13). Princeton, NJ: Educational Testing Service. Lindquist, E. F. (1963). An evaluation of a technique for scaling high school grades to improve prediction of college success. Educational and Psychological Measurement, 23(4), 623-646. Lindsay, P. (1982). The effect of high school size on student participation, satisfaction, and attendance. Educational Evaluation and Policy Analysis, 4(1), 57-65. Lindsay, P. (1984). High school size, participation in activities, and young adult social participation: Some enduring effects of schooling. Educational Evaluation and Policy Analysis, 6(1), 73-83. Linn, R. L. (1966). Grade adjustments for prediction of academic performance: A review. Journal of Educational Measurement, 3(4), 313-329. Linn, R. L. (1982a). Ability testing: Individual differences, prediction, and differential prediction. In Wigdor, A. K. & Garner, W. R. (Eds.), Ability testing: Uses, controversies, and consequences (Vol. 2, pp. 335-388). Washington: National Academy Press. Linn, R. L. (1982b). Admissions testing on trial. American Psychologist, 37(3), 279291. Linn, R. L. (1983). Predictive bias as an artifact of selection procedures. In H. Wainer & S. Messick (Eds.), Principals of modern psychological measurement: A festschrift for Frederic M. Lord (pp. 27-40). Hillsdale, NJ: Lawrence Erlbaum Associates. Linn, R. L. (2000). Assessments and accountability. Educational Researcher, 29(2), 416.

169 Linn, R. L. & Graue, E. & Sanders, N. M. (1990). Comparing state and district test results to national norms: The validity of the claims that “Everyone is above average.” Educational Measurement.: Issues & Practices, 9(3), 5-14. Linn, R. L., & Werts, C. E. (1971). Considerations for studies of test bias. Journal of Educational Measurement, 8(1), 1-4. Loyd, B. H., & Loyd, D. E. (1997). Kindergarten through grade 12 standards: A philosophy of grading. In G. D. Phye (Ed.), Handbook of classroom assessment: Learning, adjustment, and achievement (pp. 481-489). San Diego, CA: Academic Press. Madaus, G. F. (1994). A technological and historical consideration of equity issues associated with proposals to change the nation’s testing policy. Harvard Educational Review, 64(1), 76-95. Makitalo, A. (1994). Non-comparability of female and male admission test takers (Department of Education and Educational Research. Report no. 1994:06). Goteborg: Goteborg University. Marsh, H. W. (1987). The big-fish-little-pond effect on academic self-concept. Journal of Educational Psychology, 79(3), 280-295. Marsh, H. W. (1990a). Causal ordering of academic self-concept and academic achievement: A multiwave, longitudinal panel analysis. Journal of Educational Psychology, 82(4), 646-656. Marsh, H. W. (1990b). A multidimensional, hierarchical model of self-concept: Theoretical and empirical justification. Educational Psychology Review, 2(2), 77-172. Marsh, H. W. (1991). Employment during high school: Character building or a subversion of academic goals? Sociology of Education, 64(3), 172-189. Marsh, H. W. (1992a). Content specificity of relations between academic achievement and academic self-concept. Journal of Educational Psychology, 84(1), 35-42. Marsh, H. W. (1992b). Extracurricular activities: Beneficial extension of the traditional curriculum or subversion of academic goals? Journal of Educational Psychology, 84(4), 553-562. Marsh, H. W. (1994). Using the National Longitudinal Study of 1988 to evaluate theoretical models of self-concept: The self-description questionnaire. Journal of Educational Psychology, 86(3), 439-456.

170 Marsh, H. W., Byrne, B. M., & Shavelson, R. J. (1988). A multifaceted academic selfconcept: Its hierarchical structure and its relation to academic achievement. Journal of Educational Psychology, 80(3), 366-380. Marsh, H. W., Byrne, B. M. & Yeung, A. S. (1999). Causal ordering of academic selfconcept and achievement: Reanalysis of a pioneering study and revised recommendations. Educational Psychologist, 34(3), 155-168. Marsh, H. W., & Parker, J. W. (1984). Determinants of student self-concept: Is it better to be a relatively large fish in a small pond even if you don't learn to swim as well? Journal of Personality and Social Psychology, 47(1), 213-231. Marsh, H. W., & Shavelson, R. (1985). Self-concept: Its multifaceted, hierarchical structure. Educational Psychologist, 20(3), 107-123. Marsh, H. W., & Yeung, A. S. (1997). Causal effects of academic self-concept on academic achievement: Structural equation models of longitudinal data. Journal of Educational Psychology, 89(1), 41-54. McClelland, D. C., Atkinson, J. W., Clark, R. A., & Lowell, E. L. (1953). The achievement motive. New York: Appleton-Century-Crofts. McCornack, R. L., & McLeod, M. M. (1988). Gender bias in the prediction of college course performance. Journal of Educational Measurement, 25(4), 321-331. Messick, S. (1967). Personality measurement and college performance. In D. N. Jackson & S. Messick (Eds.), Problems in human assessment (pp.834-845), New York: McGraw-Hill. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). New York: American Council on Education & Macmillan. Messick, S. (1995). Validation of inferences from persons’ responses and performance as scientific inquiry into score meaning. American Psychologist, (50)9, 741-749. Milne, A. M., Myers, D. E., Rosenthal, A. S., & Ginsburg, A. (1986). Single parents, working mothers, and the educational achievement of school children. Sociology of Education, 59(2), 125-139. Milton, O., Pollio, H. R., & Eison, J. A. (1986). Making sense of college grades. San Francisco: Jossey-Bass. Mislevy, R. J., Steinberg, L. S., Breyer, F. J., Almond, R. G., & Johnson, L. (1999). A cognitive task analysis, with implications for designing a simulation-based assessment system. Computers and Human Behavior. 15, 335-374.

171 Monk, D. H. & Haller, E. J. (1993). Predictors of high school academic course offerings: The role of school size. American Educational Research Journal, 30(1), 3-21. Mulkey, L. M., Crain, R. L., & Harrington, A. J. C. (1992). One-parent households and achievement: Economic and behavioral explanations of a small effect. Sociology of Education, 65(1), 48-65. National Center for Education Statistics. (1995). Extracurricular participation and student engagement. Education Policy Issues: Statistical Perspectives (NCES 95741). Washington, DC: U.S. Department of Education, National Center for Education Statistics. National School Public Relations Association. (1972). Grading and reporting: Current trends in school policies & programs. Arlington, VA: National School Public Relations Association. Natriello, G. (1992). Marking systems. In M. C. Alkin (Ed.), Encyclopedia of Educational Research (6th ed., Vol. 3, pp. 772-776). New York: Macmillan Publishing Company. Newmann, F. M. (Ed.). (1992). Student engagement and achievement in American secondary schools. New York: Teachers College, Columbia University. Office of Civil Rights. (1999). Nondiscrimination in high-stakes testing: A resource guide. Washington, DC: U.S. Department of Education. Pedulla, J. J., Airasian, P. W., & Madaus, G. F. (1980). Do teacher ratings and standardized test results of students yield the same information? American Educational Research Journal, 17(3), 303-307. Pennock-Roman, M. (1990). Test validity and language background. New York: College Board. Pennock-Roman, M. (1994). College major and gender differences in the prediction of college grades (College Board Report No. 94-2, ETS RR-94-24). New York: College Entrance Examination Board. Petersen, N. S., & Novick, M. R. (1976). An evaluation of some models for culture-fair selection. Journal of Educational Measurement, 13(1), 3-29. Pintrich, P. R., & Schunk, D. H. (1996). Motivation in education: Theory, research, and applications. Englewood Clifts, NJ: Prentice Hall. Public Agenda (2000). Reality check 2000. Education Week. Special Report. February 16. p. S1-S8.

172 Ramist, L., Lewis, C., & McCamley, L. (1990). Implications of using freshman GPA as the criterion for the predictive validity of the SAT. In W. W. Willingham, C. Lewis, R. Morgan, & L. Ramist, Predicting college grades: An analysis of institutional trends over two decades (pp. 253-288). Princeton, NJ: Educational Testing Service. Ramist, L., Lewis, C., & McCamley-Jenkins, L. (1994). Student group differences in predicting college grades: Sex, language, and ethnic groups (College Board Report No. 93-1, ETS RR-94-27). New York: College Entrance Examination Board. Resnick, L. B. & Resnick, D. P. (1992). Assessing the thinking curriculum: New tools for educational reform. In B. R. Gifford & M. C. O’Connor (Eds.), Changing assessments: Alternate views of aptitude, achievement, and instruction (pp. 3776). Boston: Kluwer Academic Publishers. Richards, J. M., Jr., Holland, J. L., & Lutz, S. W. (1967). Prediction of student accomplishment in college. Journal of Educational Psychology, 58(6), 343-355. Robinson, G. E., & Craver, J. M. (1989). Assessing and grading student achievement. Arlington, VA: Educational Research Service. Rock, D. A., Ekstrom, R. B., Goertz, M. E., & Pollack, J. (1986). Study of excellence in high school education: Longitudinal study, 1980-82 final report. Washington, DC: U.S. Department of Education, Center for Statistics, Office of Educational Research and Improvement. Rock, D., & Evans, F. (1982). The effectiveness of several grade adjustment methods for predicting law school performance (LSAC No. 82-02). Newtown, PA: Law School Admission Services. Rock, D. A., Pollack, J. M., & Quinn, P. (1995). Psychometric report for the NELS:88 base year through second follow-up (NCES 95-382). Washington, DC: U.S. Department of Education, National Center for Education Statistics. Rock, D. A. & Werts, C. E. (1979). Construct validity of the SAT across populations— an empirical confirmatory study (Research Report 79-2). Princeton: Educational Testing Service. Rosenthal, R., & Jacobson, L. (1968). Pygmalion in the classroom: Teacher expectations and pupils' intellectual development. New York: Holt, Rinehart and Winston. Saslow, L. (1989). Schools say inflated grades cut grants. New York Times. p. 1, May 7, 1989.

173 Sewell, W. H., & Shah, V. P. (1968). Social class, parental encouragement, and educational aspirations. American Journal of Sociology, 73(5), 559-572. Shepard, L. A. (1990). Inflated test score gains: Is the problem old norms or teaching the test? Educational Measurement: Issues and Practice, 9(3), 15-22. Shepard, L. A. (1992a). Commentary: What policy makers who mandate tests should know about the new psychology of intellectual ability and learning. In B. R. Gifford & M. C. O’Connor (Eds.), Changing assessments: Alternate views of aptitude, achievement, and instruction (pp. 301-328). Boston: Kluwer Academic Publishers. Shepard, L. A . (1992b). Uses and abuses of testing. In M. C. Alkin (Ed.), Encyclopedia of educational research (pp.1477-1485). New York: Macmillan Publishing Company. Shepard, L. A. (2000). The role of classroom assessment in teaching and learning (CSE Technical Report 517). Los Angeles: National Center for Research on Evaluation, Standards, and Student Testing. Shepard, L. A., Flexer, R. J., Hiebert, E. H., Marion, S. F., Mayfield, V. & Weston, T. J. (1995). Effects of introducing classroom performance assessments (CSE Technical Report 394). Los Angeles: National Center for Research on Evaluation, Standards, and Student Testing. Snow, R. E. (1989). Toward assessment of cognitive and conative structures in learning. Educational Researcher, 18(9), pp.8-14. Snow, R. E. (1998). Abilities as aptitudes and achievements in learning situations. In J. J. McCardle & R. W. Woodcock (Eds.), Human cognitive abilities in theory and practice (pp. 93-112). Mawah, NJ: Lawrence Erlbaum Associates. Snow, R. E. & Jackson, D. N. (1993). Assessment of conative constructs for educational research and evaluation: A catalogue (CSE Technical Report 354). Los Angeles: National Center for Research on Evaluation, Standards, and Student Testing. Snow, R. E. & Jackson, D. N. (1997). Individual differences in conation: Selected constructs and measures (CSE Technical Report 447). Los Angeles: National Center for Research on Evaluation, Standards, and Student Testing. Snow, R. E., & Mandinach, E. B. (1999). Integrating assessment and instruction for classrooms and courses: Programs and prospects for research. Princeton: Educational Testing Service. Soares, A. T., & Soares, L. M. (1969). Self-perceptions of culturally disadvantaged children. American Educational Research Journal, 6(1), 31-45.

174 Spady, W. G. (1970). Lament for the letterman: Effects of peer status and extracurricular activities on goals and achievement. American Journal of Sociology, 75(4), 680-702. Starch, D., & Elliott, E. C. (1912). Reliability of the grading of high-school work in English. School Review, 20, 442-457. Starch, D., & Elliott, E. C. (1913). Reliability of grading work in mathematics. School Review, 21(5), 254-256. Steinberg, L., Brown, B. B., Cider, M., Kaczmarek, N., & Lazzaro, C. (1988). Noninstructional influences on high school student achievement: The contributions of parents, peers, extracurricular activities, and part-time work. Madison, WI: National Center for Effective Secondary Schools. Sternberg, R. J. (1985). Beyond IQ: A triarchic theory of human intelligence. New York: Cambridge University Press. Strenta, A. C., & Elliott, R. (1987). Differential grading standards revisited. Journal of Educational Measurement, 24(4), 281-291. Stricker, L. J. & Emmerich, W. (1999). Possible determinants of differential item functioning: Familiarity, interest, and emotional reaction. Journal of Educational Measurement, 36(4), 347-366. Stricker, L. J., Rock, D. A., & Burton, N. W. (1991). Sex differences in SAT predictions of college grades (College Board Report No. 91-2, ETS RR-91-38). New York: College Entrance Examination Board. Stricker, L. J., Rock, D. A., & Burton, N. W. (1993). Sex differences in predictions of college grades from Scholastic Aptitude Test scores. Journal of Educational Psychology, 85(4), 710-718. Stricker, L. J., Rock, D. A., Burton, N. W., Muraki, E., & Jirele, T. J. (1994). Adjusting college grade point average criteria for variations in grading standards: A comparison of methods. Journal of Applied Psychology, 79(2), 178-183. Taber, T. D., & Hackman, J. D. (1976). Dimensions of undergraduate college performance. Journal of Applied Psychology, 61(5), 546-558. Tatsuoka, K. K. & Tatsuoka, M. M. (1992). A psychometrically sound cognitive diagnostic model: Effect of remediation as empirical validity (ETS RR-92-38). Princeton, NJ: Educational Testing Service. Terwilliger, J. S. (1989). Classroom standard setting and grading practices. Educational Measurement: Issues and Practice, 8(2), 15-19.

175

Thorndike, R. L. (1951). Reliability. In E. F. Lindquist (Ed.), Educational measurement (pp. 561-620). Washington, DC: American Council on Education. Thorndike, R. L. (1963). The concepts of over- and underachievement. New York: Teachers College, Columbia University. Thorndike, R. L. (1969). Marks and marking systems. In R. L. Ebel (Ed.), Encyclopedia of educational research (4th ed., pp. 759-766). New York: Macmillan. Thurstone, L. L. (1947). Multiple-factor analysis. Chicago: University of Chicago Press. Tucker, L. (1960). Formal models for a central prediction system (ETS RB-60-14). Princeton, NJ: Educational Testing Service. Vars, F. E. & Bowen, W. G. (1998). Scholastic Aptitude Test scores, race, and academic performance in selective colleges and universities. In Jencks, C. & Phillips, M. (Eds.), The black-white score gap (pp. 457-479). Washington: Brookings Institution Press. Vickers, J. M. (2000). Justice and truth in grades and their averages. Research in Higher Education, 41(2), 141-164. Warren, J. R. (1971). College grading practices: An overview (Report 9). Washington, DC: ERIC Clearinghouse on Higher Education, George Washington University. Weiner, B. (1992). Human motivation: Metaphors, theories, and research. Newbury Park, CA: Sage Publications. Wenglinsky, H. (1997). How money matters: The effect of school district spending on academic achievement. Sociology of Education, 70(July), 221-237. Werts, C., Linn, R. L., & Joreskog, K. G. (1978). Reliability of college grades from longitudinal data. Educational and Psychological Measurement, 38(1), 89-95. Werts, C. E. (1967). The many faces of intelligence. Journal of Educational Psychology, 58(4), 198-204. Whitaker, U. (1989). Assessing learning: Standards, principles, and procedures. Philadelphia: Council for Adult and Experiential Learning. Wigdor, A. K., & Garner, W. R. (Eds.). (1982). Ability testing: Uses, consequences, and controversies. Washington, DC: National Academy Press.

176 Wiggins, G. (1989). A true test: Toward more authentic and equitable assessment. Phi Delta Kappan, 70, 703-713. Wilgoren, J. (2000). Cheating of statewide tests is reported in Massachusetts. New York Times. February 25. Willingham, W. W. (1961). Prediction of the academic success of transfer students (RM 61-16). Atlanta: Georgia Institute of Technology. Willingham, W. W. (1962a). Longitudinal analysis of academic performance. (RM 625). Atlanta: Georgia Institute of Technology. Willingham, W. W. (1962b). The analysis of grading variations (RM 62-9). Atlanta: Georgia Institute of Technology. Willingham, W. W. (1963a). Adjusting college predictions of the basis of academic origins. In M. Katz (Ed.), The twentieth yearbook of the National Council on Measurement in Education (pp. 1-6). East Lansing, MI: National Council on Measurement in Education. Willingham, W. W. (1963b). The application blank as a predictive instrument (RM 6310). Atlanta: Georgia Institute of Technology. Willingham, W. W. (1963c). The effect of grading variations on the efficiency of predicting freshman grades (RM 63-1). Atlanta: Georgia Institute of Technology. Willingham, W. W. (1963d). Variation among the grade scales of different high schools (RM 63-3). Atlanta: Georgia Institute of Technology. Willingham, W. W. (1965). The application blank as a predictive instrument. College and University, Spring, 271-281. Willingham, W. W. (1974). Predicting success in graduate education. Science, 183, 273-278. Willingham, W. W. (1985). Success in college: The role of personal qualities and academic ability . New York: College Entrance Examination Board. Willingham, W. W. (1999). A systemic view of test fairness. In S. Messick (Ed.), Assessment in Higher Education (p. 213-242). Mahwah, NJ: Lawrence Erlbaum Associates.

Willingham, W. W. & Breland, H. M. (1982). Personal qualities and college admissions. New York: College Board.

177

Willingham, W. W., & Cole, N. S. (1997). Gender and fair assessment. Mahwah, NJ: Lawrence Erlbaum Associates. Willingham, W. W. & Lewis, C. (1990). Institutional differences in prediction trends. In W. W. Willingham, C. Lewis, R. Morgan, & L. Ramist, Predicting college grades: An analysis of institutional trends over two decades (pp. 141-160). Princeton, NJ: Educational Testing Service. Wing, C. W., & Wallach, M. A. (1971). College admissions and the psychology of talent. New York: Holt, Rinehart, & Winston. Young, J. W. (1990). Adjusting the cumulative GPA using item response theory. Journal of Educational Measurement, 27(2), 175-186. Young, J. W. (1991). Gender bias in predicting college academic performance: A new approach using item response theory. Journal of Educational Measurement, 28(1), 37-47.

178 Author Note

Appreciation is expressed to Henry Braun, Brent Bridgeman, Nancy Burton, Robert Linn, and Lawrence Stricker for reviewing drafts of this report and to Linda Johnson for graphical and editorial assistance. National Center for Educational Statistics provided the data for the analysis. The study was supported by Educational Testing Service.

Figure 1 A Framework of Possible Sources of Discrepancy Between Observed Grades and Test Scores A. Content Differences Between Grades and Test Scores 1. Domain of general knowledge and skill a. Subjects covered, such as science and history; broad divisions within subjects, such as physics or European history b. General cognitive skills, such as reasoning, writing or performance 2. Specific knowledge and skills as reflected in: a. Course-based content throughout the school district, state, or nation (especially relevant to an external test) b. Classroom-based content (especially relevant to a teacher’s grade) c. Individualized content (especially relevant to personal interests and skills) 3. Components other than subject knowledge and skills: a. Social objectives of education (e.g., leadership, citizenship) b. Academic and personal development (e.g., attendance, class participation, completing assignments, disruptive behavior, effort and coping skills) c. Assessment skills and debilities (pertinent to test-taking or class assignments, construct-relevant or irrelevant; general or specific to a particular assessment) B. Individual Differences Related to Content Differences 1. Early development and relevant learning acquired outside of school 2. Student motivation reflected in academic behavior, attitudes, and circumstances 3. Teacher judgment regarding the student’s performance C. Errors in Grades or Test Scores 1. Systematic Error—Noncomparability a. Variation in grading standards (across schools, courses, teachers, and sections) b. Variation in test score scales (across forms; across time) c. Cheating (by students or schools, on class assignments or tests) 2. Unsystematic Measurement Error—Unreliability (in grades and in test scores) D. Situational Differences 1. Across contexts 2. Over time ________________________________________________________________________

Figure 3 The Accumulating Effects of Five Factors That Help to Account for Observed Differences Between Grades and Test Scores ___________________________________________________________________________________

Status of Grade-Score Relationship

Correlation: HSA vs. NELS

Adjustment for 5 Factors

___________________________________________________________________________________ Transcript grade average is correlated with total test score

[.62] Factor 1. Subjects Covered. Restrict grade average to 4 NAEP “new basic” academic areas; optimally weight the 4 corresponding test scores

Grades and scores are based on corresponding subject matter

[.68] Factor 2. Grading Variations. Subtract school means; correct for range restriction. Adjust HSA for mean grading difficulty of each student’s courses

School and course grading variations are removed

[.76] Factor 3. Reliability. Correct for the unreliability of HSA and NELS test scores

Measurement errors in scores and grades are taken into account

[.81] Factor 4. Student Characteristics. Add 26 student variables to the multiple correlation between test scores and HSA

Differential effects of student effort on grades and test scores are taken into account

[.86] Factor 5. Teacher Ratings. Add 5 teacher judgments concerning school behavior of individual students

Other factors noted by teachers that may influence grades are taken into account

[.90]

___________________________________________________________________________________

Figure 6 A Schema for Considering Possible Strengths of Grades and Test Scores in High Stakes Decisions

Differential characteristics of: Differentiating Component in Grades

Pertinent Assessment Objective

GRADES (As performance on the local contract)

TEST SCORES (As performance on the external standard)

Grading Variations (Factor 2)

To assess fairly

Local standards

Common yardstick

Scholastic Engagement (Factor 4)

To evaluate critical skills

Conative skills

Cognitive skills

Teacher Judgment (Factor 5)

To motivate teaching and learning

Assessment based on each student’s learning and behavior

Assessment based on designated knowledge and skills

Table 1 Percentage of Faculty Reporting that Various Factors Affect Grades as a Matter of Policy or as Expected Practice*

Grading Factor

Official Policy

Expected Practice

Late work must be graded lower

7

47

Attendance is included in the course grade

11

31

Grades reflect progress toward goals of individual students

9

30

Effort is included in the course grade

3

30

Attitude and/or behavior is included in the course grade

6

26

Students can raise grades through an extra credit project

4

15

*Reproduced with permission. Copyright  1994 by The College Entrance Examination Board. All rights reserved.

Table 2 Sample Sizes for Gender by Ethnicity and School Program*

Male

Female

Total

African American

268

323

591

Asian American

251

273

524

Hispanic

405

399

804

White

3201 ____

3270 ____

6471 ____

4125

4265

8390

Rigorous Academic

907

985

1892

Academic

2235

2415

4650

Academic-Vocational

345

282

627

Vocational

235 ____

167 ____

402 ____

3722

3849

7571

4154

4300

8454

Ethnicity

School Program

Total

*Ethnicity was not available for 64 students. Program counts do not include 883 students whose programs were not classified by NCES or a very small group characterized as Rigorous Academic/Vocational. Information regarding ethnicity was not available of 1% of the sample.

Table 3 Student Characteristics and Teacher Ratings— Illustrative Content and Missing Data* ______________________________________________________________________________ Variable % Data Missing ______________________________________________________________________________ A. School Skills 1. Attendance (6: not absent/tardy—from student and school record) 0 2. Class participation (8: come prepared, pay attention, take notes) 0.1 3. Discipline problems (6: trouble with school rules, suspension, fights) 0.1 4. Work completed (5: turn in work on time, more than required) 0.1 5. Homework hours (2: hours per week—at home and school) 0.6 B. Initiative 6. Courses completed (number, irrespective of grade earned) 0 7. Adv. electives (5 pts: any AP course, 12th grade Math though not required) 0 8. School activities (12 pts: with added points for awards/offices) 0 9. School sports (5 pts: with added points for achievement/leadership) 0 10. Outside activities (6: frequency or award) 0 C. Competing Activities 11. Drugs/gangs (4: involvement with) 5.1 12. Killing time (4: TV, video games, talking, riding around) 0.5 13. Peer sociability (3: friends like parties, popularity, going steady) 4.6 14. Employment (20+ hours per week—yes/no) 2.8 15. Child care (20+ hours per week—yes/no) 0.4 16. Leisure reading (hours/week, not school related) 0 D. Family Background 17. SES (NELS composite: parents’ education, occupation, and income) 0.1 18. Family intact (living with 2 parents/stepparents—from 1990 survey) 0.9 19. Parent relations (7: discuss school with parents, OK with parents) 0.1 20. Parent aspiration (4: want child to continue education) 0.9 21. Stress at home (10: parent lost job, died; family member on drugs) 0.7 E. Attitudes 22. Teacher relations (3: thinks they do a good job, solicits their help) 0.1 23. Educational plans (6: plans college, plans career requiring college) 0 24. Self esteem (6: pride, optimism) 2.6 25. Locus of control (5: internal control—planning and effort pay off) 2.9 26. Peer studiousness (6: friends like school, studying, good grades) 4.3 F. Teacher Ratings 27. Attendance (2: seldom absent or tardy) 9.9 28. Class behavior (2: usually attentive, seldom disruptive) 9.9 29. Consults teacher (talks with teacher outside class) 10.1 30. Educational motivation (3: works hard, going to college, will not drop out) 9.9 31. Work completed (usually completes homework assignments) 10.1 ______________________________________________________________________________ *Numbers in parentheses indicate that a variable is based on a mean or point count (pts.) for more than one response, rating, or other item of information; following phrases illustrate the types of content.

Table 4 Regression of Total Transcript Average (HSA-T) and Academic Average (HSA) on NELS Tests* HSA(T)

HSA

r

β

r

β

Reading

.55

.17

.56

.14

Mathematics

.64

.52

.66

.54

Science

.51

−.07

.53

−.07

Social Studies

.51

.07

.54

.10

NELS Test

Multiple R *N = 8454

.65

.68

Table 5 Analysis of Course Grading Variations By Schools, Courses, and the Interaction of Schools and Courses* Source of Variation

Sum of Squares

Proportion of Total

Schools

8115.3

.41

Courses, controlling schools

2751.4

.14

10,866.8

.54

9118.4

.46

19,985.2

1.00

Additive model Course-school interaction Total between cells

*Based on 574 schools and 225 courses in four basic academic areas. Adjusted for overfitting; see Note 8.

Table 6 Reliabilities of Grade Averages and the Credit Hours upon Which They were Based Subject Area Averages English

Math

Science

4-Year Averages Social Studies

HSA

HSA*

HSA-T*

Reliabilities Male

.91

.87

.87

.91

.97

.97

.97

Female

.90

.86

.87

.90

.96

.97

.97

African American

.84

.76

.81

.83

.94

.94

.96

Asian American

.91

.91

.85

.91

.97

.97

.96

Hispanic

.89

.82

.84

.89

.95

.96

.96

White

.92

.87

.87

.91

.97

.97

.97

Rigorous Academic

.91

.88

.87

.89

.96

.97

.97

Academic

.91

.87

.87

.90

.96

.97

.97

Academic/Vocational

.87

.82

.79

.86

.94

.95

.95

Vocational

.80

.63

.69

.79

.89

.89

.92

Total

.91

.87

.87

.91

.97

.97

.97

HSA Hours

% of Total

Total Hours

Mean Credit Hours# Male

4.2

3.6

3.2

3.6

14.7

67

21.9

Female

4.2

3.5

3.2

3.7

14.5

65

22.4

African American

4.4

3.6

3.1

3.7

14.7

68

21.7

Asian American

4.3

3.9

3.5

3.7

15.4

69

22.3

Hispanic

4.4

3.5

2.9

3.6

14.4

65

22.0

White

4.2

3.5

3.2

3.6

14.5

65

22.2

Rigorous Academic

4.3

3.9

3.8

3.8

15.8

69

23.0

Academic

4.3

3.6

3.3

3.7

14.9

67

22.1

Academic/Vocational

4.3

3.3

2.8

3.6

14.0

62

22.7

Vocational

4.0

2.5

2.2

3.0

11.6

54

21.3

Total 4.2 3.5 3.2 3.6 14.6 66 *Corrected for range restriction. #Carnegie Units completed, irrespective of grades earned.

22.1

Table 7 Correlations of Student Characteristics and Teacher Ratings with NELS Test, High School Averages, and Grading Corrections*

NELS-T A. School Skills 1. Attendance 2. Class participation 3. Discipline problems 4. Work completed 5. Homework hours B. Initiative 6. Courses completed

7. Advanced electives 8. School activities 9. School sports 10. Outside activities C. Competing Activities 11. Drugs/gangs 12. Killing time 13. Peer sociability 14. Employment 15. Child care 16. Leisure reading D. Family Background 17. SES 18. Family intact 19. Parent relations 20. Parent aspiration 21. Stress at home E. Attitudes 22. Teacher relations 23. Educational plans 24. Self esteem 25. Locus of control 26. Peer studiousness F. Teacher Ratings 27. Attendance 28. Class behavior 29. Consults teacher 30. Educ. motivation 31. Work completed *N=8454.

HSA(T)

HSA

SGF

MCGF

.19 .09 −.22 −.04 .22

.35 .22 −.34 .19 .20

.33 .21 −.31 .18 .21

−.03 −.01 .04 −.03 −.02

.12 .12 −.10 −.00 .13

.28 .59 .19 .04 .15

.36 .54 .30 .04 .14

.32 .58 .28 .06 .14

−.05 .03 −.08 −.02 −.01

−.10 .47 −.03 .14 .10

−.07 −.10 −.09 −.12 −.10 .20

−.23 −.16 −.10 −.13 −.06 .06

−.20 −.16 −.09 −.13 −.07 .06

.07 .01 .01 .01 .00 .01

−.00 −.04 .00 −.08 −.07 .00

.48 .11 .07 .34 −.07

.33 .13 .18 .31 −.12

.35 .13 .17 .33 −.12

.08 −.02 −.02 .02 .03

.31 .03 .09 .32 −.04

.21 .33 .28 .24 .25

.23 .34 .29 .25 .29

.23 .35 .29 .25 .29

−.00 .02 −.00 −.01 .00

.13 .28 .21 .13 .16

.22 .35 .11 .45 .33

.38 .51 .15 .62 .61

.37 .51 .15 .63 .61

−.07 −.04 −.04 −.05 −.11

.16 .20 .05 .33 .19

Table 8 The Relationship of Student Characteristics and Teacher Ratings to High School Average (HSA)—with Progressive Controls Applied

Variable A. School Skills 1. Attendance 2. Class participation 3. Discipline problems 4. Work completed 5. Homework hours B. Initiative 6. Courses completed

7. Advanced electives 8. School activities 9. School sports 10. Outside activities C. Competing Activities 11. Drugs/gangs 12. Killing time 13. Peer sociability 14. Employment 15. Child care 16. Leisure reading D. Family Background 17. SES 18. Family intact 19. Parent relations 20. Parent aspiration 21. Stress at home E. Student Attitudes 22. Teacher relations 23. Educational plans 24. Self esteem 25. Locus of control 26. Peer studiousness F. Teacher Ratings 27. Attendance 28. Class behavior 29. Consults teacher 30. Educ. motivation 31. Work completed

βeta weights: reliability correction added 26 student and 5 teacher variables ratings

Correlation with HSA

R, plus grading control

Plus test & grading control

.33 .21 −.31 .18 .21

.39 .23 −.35 .18 .21

.31# .23# −.23# .30# .06

.11* .01 −.03* .11* −.04*

.07* .00 .00 .09* −.04*

.32 .58 .28 .06 .14

.45 .65 .32 .06 .18

.24# .34# .18# .03 .06

.07* .16* .03* −.00 −.01

.04* .12* .03* −.01 −.01

−.20 −.16 −.09 −.13 −.07 .06

−.24 −.16 −.11 −.13 −.08 .07

−.21# −.14# −.05 −.08 −.01 −.08

−.02 −.03* −.02 −.01 −.00 −.06*

−.00 −.03* −.01 −.01 .00 −.04*

.35 .13 .17 .33 −.12

.43 .15 .20 .39 −.13

.11 .08 .17 .15 −.10

.03* .02 .02 −.00 −.00

.02 .01 .01 −.01 .00

.23 .35 .29 .25 .29

.19 .41 .35 .27 .30

.12 .21 .17 .13 .18

.01 .04* .00 .02 −.01

.01 .03* .00 .02 −.01

.37 .51 .15 .63 .61

.43 .55 .18 .68 .65

.31 .41 .10 .50 .56

* β ≥ .03. N=8454 #Variables used to define Scholastic Engagement

.

.01 .04* −.01 .12* .20*

Table 9 Partial Correlations Among Behavioral Measures That Suggest Scholastic Engagement—By Subgroup and School Program* Partial Correlations with HSA Student Behavior

Rigor. Acad.

Acad.

Acad./ Voc.

Voc.

Male

Female

Afr. Amer.

Asian Amer.

Hispanic

White

Advanced electives

.20

.30

.20

−.06

.30

.27

.28

.27

.30

.29

Work completed

.32

.30

.30

.21

.28

.27

.28

.32

.26

.30

Attendance

.25

.27

.25

.18

.29

.28

.19

.37

.31

.27

Class participation

.22

.25

.18

.03

.21

.21

.13

.27

.21

.24

Discipline problems (−)

−.20

−.20

−.17

−.13

−.19

−.15

−.20

.01

−.15

−.21

Drugs/gangs (−)

−.22

−.19

−.25

−.21

−.16

−.20

−.12

−.18

−.11

−.21

Killing time (−)

−.12

−.16

−.13

−.08

−.11

−.12

−.15

−.02

−.10

−.14

School activities

.12

.13

.11

−.06

.12

.06

.12

.11

.12

.12

Courses completed .07 .02 .02 .07 .11 .14 .12 .13 .12 −.04 *Student behaviors are listed by size of the partial r in the total sample. Test scores and grading variations were controlled in computing partial correlations. All coefficients were corrected for range restriction. HSA and test scores were corrected for unreliability. (−) indicates disengaged behavior.

Table 10 Patterns of Scholastic Engagement by Subgroup and School Program Mean Standard Score* Student Behavior

Rigor. Acad.

Acad.

Acad./ Voc.

Voc.

Male

Female

Afr. Amer.

Asian Amer.

Hispanic

Advanced electives

55.7

50.9

44.0

39.4

49.6

50.4

46.7

55.1

46.9

50.3

Work completed

50.8

49.9

50.1

49.1

48.4

51.6

51.5

50.3

50.0

49.8

Attendance

52.2

50.3

49.1

46.8

49.9

50.1

51.2

50.4

47.2

50.2

Class participation

51.2

50.5

49.2

45.5

48.2

51.7

52.1

49.6

49.8

49.9

Discipline problems (−)

30%

36%

44%

56%

51%

26%

45%

31%

41%

37%

Drugs/gangs (−)

48.9

50.1

49.6

52.1

51.6

48.5

46.2

47.3

50.4

50.5

Killing time (−)

49.2

49.9

51.2

50.8

51.0

49.0

50.5

48.6

49.5

50.1

School activities

51.1

50.5

48.5

46.5

48.1

51.8

49.9

51.5

48.5

50.1

Courses completed

54.4

49.8

51.6

45.2

48.8

51.2

46.8

51.2

48.2

50.4

White

*Standard scales with X =50 and SD =10 were based on the total sample. (−) indicates disengaged behavior. Discipline here described as % reporting any infraction.

Table 11 Regression of Scholastic Engagement on Family and Attitude Measures—By Gender

Males

Females

r

βeta Wt.

r

βeta Wt.

17. SES

.24

.05

.23

.05

18. Family intact

.10

.03

.10

.04

19. Parent relations

.28

.09

.29

.09

20. Parent aspiration

.31

.06

.28

.02

−.22

−.10

−.17

−.06

22. Teacher relations

.33

.16

.30

.14

23. Educational plans

.40

.19

.38

.22

24. Self esteem

.35

.07

.35

.05

25. Locus of control

.27

.04

.30

.10

26. Peer studiousness

.44

.25

.40

.23

Family Background

21. Stress at home Attitudes

Multiple R

.60

.56

Table 12 Actual and Predicted High School Average for Four Subgroups and Four School Programs—By Amount of Predictive Information

Predictive Information* 1. NELS Test

2. Plus grade variations controlled

3. Plus 26 student variables

4. Plus Teacher judgments

Group Women Actual HSA 2.59 2.58 2.58 2.59 Predicted HSA 2.47 2.47 2.52 2.55 (diff. pred.) (+.12) (+.11) (+.06) (+.03) African-American Actual HSA 2.01 2.28 2.28 2.28 Predicted HSA 2.11 2.29 2.31 2.30 (diff. pred.) (−.10) (−.01) (−.03) (−.02) Asian-American Actual HSA 2.82 2.76 2.76 2.76 Predicted HSA 2.69 2.62 2.70 2.72 (diff. pred.) (+.13) (+.14) (+.06) (+.04) Hispanic Actual HSA 2.18 2.33 2.33 2.34 Predicted HSA 2.23 2.36 2.37 2.37 (diff. pred.) (−.05) (−.03) (−.03) (−.02) School Program Rigorous Academic Actual HSA 2.87 2.84 2.84 2.84 Predicted HSA 2.75 2.71 2.78 2.80 (diff. pred.) (+.12) (+.13) (+.06) (+.04) Academic Actual HSA 2.53 2.50 2.50 2.51 Predicted HSA 2.53 2.51 2.51 2.52 (diff. pred,) (.00) (−.00) (−.01) (−.01) Academic Vocational Actual HSA 2.22 2.29 2.29 2.30 Predicted HSA 2.20 2.29 2.25 2.28 Diff. pred.) (+.03) (+.03) (+.02) (−.00) Vocational Actual HSA 1.84 1.97 1.97 1.97 Predicted HSA 1.98 2.10 1.99 1.96 (diff. pred.) (+.01) (−.14) (−.13) (−.02) *In each column, predictions also take into account the predictors used in previous columns. In columns 2-4, school means were subtracted from all measures. Differences between actual and predicted HSA (diff. pred.) reflect rounding. Total N = 7571 in each of columns 1-3; 6853 in column 4.

Table 13 Actual and Predicted High School Average For Males and Females Within Ethnic Groups—By Amount of Predictive Information

Predictive Information* 1. NELS Test

2. Plus grade variations controlled

3. Plus 26 student variables

4. Plus teacher judgments

Group White Male Actual HSA Predicted HSA (diff. pred.) White Female Actual HSA Predicted HSA (diff. pred.) African-American Male Actual HSA Predicted HSA (diff. pred.) African-American Female Actual HSA Predicted HSA (diff.pred.) Asian-American Male Actual HSA Predicted HSA (diff. pred.) Asian-American Female Actual HSA Predicted HSA (diff. pred,) Hispanic Male Actual HSA Predicted HSA (diff. pred.) Hispanic Female Actual HSA Predicted HSA (diff. pred.)

2.43 2.54 (−.12)

2.39 2.51 (−.12)

2.39 2.45 (−.06)

2.40 2.43 (−.03)

2.65 2.52 (+.13)

2.60 2.49 (+.11)

2.60 2.54 (+.06)

2.60 2.57 (+.03)

1.85 2.09 (−.23)

2.11 2.27 (−.16)

2.11 2.22 (−.11)

2.11 2.17 (−.06)

2.14 2.13 (+.02)

2.42 2.31 (+.11)

2.42 2.38 (+.04)

2.42 2.40 (+.02)

2.71 2.67 (+.03)

2.66 2.61 (+.05)

2.66 2.65 (+.02)

2.67 2.65 (+.02)

2.92 2.71 (+.21)

2.85 2.64 (+.21)

2.85 2.76 (+.09)

2.84 2.79 (+.05)

2.13 2.27 (−.14)

2.27 2.39 (−.13)

2.27 2.35 (−.09)

2.28 2.33 (−.05)

2.23 2.18 (+.04)

2.40 2.33 (+.07)

2.40 2.38 (+.02)

2.41 2.41 (+.01)

*In each column, predictions also take into account the predictors used in previous columns. In columns 2-4, school means were subtracted from all measures. Differences between actual and predicted HSA (diff. pred.) reflect rounding. Total N = 8390 in each of columns 1-3; 7565 in column 4.

Table 14 NELS Test Means and Correlations with High School Average for Six Groups* Afr. Amer.

Asian Amer.

Hispanic

White

4 tests (Mul. R)

.76

.84

.72

.80

.78

.82

.80

.79

.73

.56

.79

NELS composite

.76

.78

.72

.79

.77

.81

.79

.78

.70

.52

.78

NELS Reading

.69

.62

.66

.69

.64

.71

.70

.68

.61

.50

.68

NELS Math

.75

.78

.71

.78

.77

.80

.78

.77

.68

.50

.77

NELS Science

.65

.70

.64

.66

.67

.72

.70

.67

.53

.37

.66

NELS Social Studies

.63

.69

.62

.67

.65

.71

.69

.67

.54

.41

.66

Male

Female

Rigor. Acad.

Acad.

Acad./ Voc.

Voc.

Total

Correlation with HSA:

Means NELS Reading

43.7

52.0

45.9

51.0

48.9

51.1

54.0

50.8

45.2

42.3

50.0

NELS Math

42.6

54.3

45.0

51.0

50.6

49.4

55.1

50.9

44.7

40.3

50.0

NELS Science

41.8

52.0

44.7

51.3

51.6

48.4

53.9

50.7

45.9

43.0

50.0

NELS Social Studies

44.2

52.8

45.9

50.9

50.9

49.1

54.0

50.9

45.4

42.6

50.0

*Correlations were corrected for range restriction, grading variations, and unreliability of tests and grades. The grade criterion was HSA+2G. In this table test scores were scaled to a mean of 50 and a standard deviation of 10 based on the total study sample.

Table 15 A Comparison of Within-School and Across-School Regression of Grade Average on All 37 Variables Beta Weights Within Schools (HSA + K) A. School Skills 1. Attendance 2. Class participation 3. Discipline problems 4. Work completed 5. Homework hours B. Initiative 6. Courses completed 7. Advanced electives 8. School activities 9. School sports 10. Outside activities C. Competing Activities 11. Drugs/gangs 12. Killing time 13. Peer sociability 14. Employment 15. Child care 16. Leisure reading D. Family Background 17. SES 18. Family intact 19. Parent relations 20. Parent aspiration 21. Stress at home E. Attitudes 22. Teacher relations 23. Educational plans 24. Self esteem 25. Locus of control 26. Peer studiousness F. Teacher Ratings 27. Attendance 28. Class behavior 29. Consults teacher 30. Educ. motivation 31. Work completed G. Grading Factors 32. SGF 33. MCGF H. NELS Test 34. Reading 35. Mathematics 36. Science 37. Social Studies Multiple R

Beta Weights Across Schools (HSA) (HSA + 2G)

.07 .00 .00 .09 −.04

.06 −.01 −.01 .09 −.03

.06 −.00 −.00 .09 −.03

.04 .12 .03 −.01 −.01

.02 .12 .04 −.01 −.01

−.01 .14 .02 −.00 −.01

−.00 −.03 −.01 −.01 .00 −.04

−.02 −.03 −.01 −.01 −.00 −.04

−.01 −.03 −.01 −.01 −.00 −.04

.02 .01 .01 −.01 .00

.00 .02 .02 −.01 .00

.01 .01 .02 .00 .01

.01 .03 .00 .02 −.01

−.01 .02 .00 .02 −.02

−.01 .03 .01 .02 −.02

.01 .04 −.01 .12 .20

.02 .03 −.01 .11 .20

.02 .04 −.01 .12 .18

− −

−.30 −.02

− −

.04 .51 −.30 .27

.09 .43 −.16 .16

.08 .48 −.19 .18

.90

.88

.88

Table 16 A Condensed Analysis: Intercorrelations and Beta Weights for HSA Regressed on Four Major Factors*

NELS-C

SGF

Engage

School Grading

.09

Engagement

.41

−.05

Teacher Rating

.49

−.08

.52

HSA

.71

−.29

.57

Multiple R .88

TRC

.69

Beta Weights .50

−.30

.18

*NELS-C and HSA were corrected for unreliability.

.33

Table 17 Condensed Analyses of Major Factors Related to Differences Between Grades and Test Scores—By Gender and Ethnic Groups*

Male (4154)

Female (4300)

African American (591)

Asian American (524)

Hispanic (804)

White (6471)

Correlations With HSA: NELS Test Engagement Teacher rating School grading With NELS Test: Engagement Teacher rating School grading With Engagement: Teacher rating School grading With Teacher rating: School grading Beta weights in predicting HSA: NELS Test Engagement Teacher rating School grading Multiple R based on: 4 condensed variables 37 original variables

.70 .55 .69 −.28

.74 .57 .68 −.31

.68 .52 .64 −.28

.72 .61 .72 −.14

.62 .51 .60 −.28

.71 .59 .71 −.28

.41 .50 .11

.45 .51 .06

.35 .44 .08

.41 .53 .18

.34 .41 .16

.44 .51 .11

.50 –.06

.51 –.04

.49 –.04

.57 .04

.43 –.03

.54 –.06

–.08

–.09

–.01

–.01

–.03

–.09

.49 .16 .34 −.30

.53 .17 .30 –.31

.50 .16 .34 −.31

.50 .24 .31 −.23

.48 .20 .31 −.34

.49 .17 .34 −.30

.86 .88

.89 .89

.85 .87

.87 .88

.83 .84

.88 .89

*Corrected for range restriction in all variables and unreliability in HSA and NELS Test. (Total N=8454 except for teacher rating where N=7619; subgroup Ns in parentheses)

Table 18 Condensed Analyses of Major Factors Related to Differences Between Grades and Test Scores—By School Program* Rigorous Academic (1892)

Academic (4650)

Academic Vocational (627)

Vocational (402)

.71 .48 .64 −.30

.70 .54 .67 −.30

.60 .42 .62 −.35

.40 .26 .58 −.31

.34 .41 .10

.38 .46 .08

.15 .33 .10

–.03 .21 .14

.41 –.04

.50 –.05

.40 –.13

.32 –.05

–.09

–.08

–.14

–.09

.57 .15 .31 −.32

.52 .18 .31 −.31

.50 .16 .34 −.33

.35 .11 .44 −.32

.88 .89

.87 .88

.83 .86

.73 .77

Correlations: With HSA: NELS Test Engagement Teacher rating School grading With NELS Test: Engagement Teacher rating School grading With Engagement: Teacher rating School grading With Teacher rating: School grading Beta weights in predicting HSA: NELS Test Engagement Teacher rating School grading Multiple R based on: 4 condensed variables 37 original variables

*Corrected for range restriction in all variables and unreliability in HSA and NELS Test. (Total N=8454 except for teacher rating where N=7619; subgroup Ns in parentheses)

Table 19 Regression of HSA on all Variables—By Gender and Ethnic Groups*

Males

Females

Beta Weights for: African Asian American American

Hispanic

White

A. School Skills 1. Attendance .07 .05 .01 .13 .09 .05 2. Class participation −.03 −.00 −.05 −.04 −.01 −.01 3. Discipline problems .01 .01 −.01 −.02 −.04 −.00 4. Work completed .09 .09 .15 .12 .12 .08 5. Homework hours .01 −.04 −.03 −.06 −.05 −.03 B. Initiative 6. Courses completed .02 .01 .07 .08 .01 −.02 7. Advanced electives .13 .10 .08 .14 .14 .11 8. School activities .06 .01 .08 .04 .07 .03 9. School sports .01 .02 −.01 −.03 −.02 −.01 10. Outside activities .00 −.02 −.01 −.01 −.05 −.01 C. Competing Activities 11. Drugs/gangs .02 .05 −.00 −.03 −.06 −.02 12. Killing time .02 .02 −.02 −.02 −.02 −.03 13. Peer sociability .00 .03 .01 −.01 −.01 −.01 14. Employment .01 −.01 −.01 −.03 −.06 −.01 15. Child care .00 .01 .00 −.01 −.01 −.02 16. Leisure reading −.04 −.04 −.09 −.04 −.06 −.04 D. Family Background 17. SES .01 .00 −.01 −.02 −.04 −.02 18. Family intact .02 .01 .02 .01 −.00 −.01 19. Parent relations .02 .01 .00 .00 .02 .02 20. Parent aspiration .03 .02 .01 −.01 −.00 −.01 21. Stress at home .00 .03 .01 −.00 −.01 −.01 E. Attitudes 22. Teacher relations .00 .02 .04 −.01 −.02 −.01 23. Educational plans .02 .03 .03 .04 −.00 −.02 24. Self esteem .03 .03 .01 −.01 −.00 −.01 25. Locus of control .04 .02 .02 −.01 −.01 −.00 26. Peer studiousness −.02 −.03 −.04 −.02 −.04 −.02 F. Teacher Ratings 27. Attendance .00 .04 .02 −.00 −.00 −.01 28. Class behavior .02 .03 .03 .01 .06 .03 29. Consults teacher .01 −.01 −.01 −.01 −.01 −.01 30. Educ. motivation .10 .12 .08 .10 .09 .13 31. Completes work .23 .15 .27 .24 .19 .19 G. Grading Factors 32. SGF −.29 −.30 −.31 −.22 −.35 −.29 33. MCGF .03 .06 −.04 −.01 −.04 −.02 H. NELS Test 34. Reading .06 .03 .25 .12 .09 35. Mathematics .40 .43 .37 .25 .43 [.50]# 36. Science −.11 −.08 −.00 −.00 −.19 37. Social Studies .15 .19 .10 .18 −.06 Multiple R .88 .89 .87 .88 .84 .89 * Across-schools analysis corrected for unreliability, range restriction, and shrinkage. # NELS −C was substituted for this analysis because the matrix was singular with the four tests entered separately.

Table 20 Regression of HSA on all Variables—By School Program* Rigorous Academic

Beta Weights for: AcademicAcademic Vocational Vocational

A. School Skills 1. Attendance .05 .04 .08 2. Class participation .00 −.03 −.04 3. Discipline problems .04 −.00 −.01 4. Work completed .11 .08 .14 5. Homework hours −.01 −.03 −.03 B. Initiative 6. Courses .00 −.03 −.01 7. Advanced electives .07 .11 .13 8. School activities .04 .04 .02 9. School sports .03 −.00 −.02 10. Outside activities −.03 −.00 −.03 C. Competing Activities 11. Drugs/gangs −.03 −.01 −.01 12. Killing time −.01 −.05 −.01 13. Peer sociability −.03 −.00 −.01 14. Employment −.04 −.00 −.03 15. Child care .00 −.02 −.01 16. Leisure reading −.06 −.03 −.03 D. Family Background 17. SES .01 −.02 −.08 18. Family intact .02 .04 −.02 19. Parent relations .03 .02 −.00 20. Parent aspiration .02 −.01 −.04 21. Stress at home .00 .01 −.05 E. Attitudes 22. Teacher relations .01 −.00 −.01 23. Educational plans .03 .03 −.02 24. Self esteem .00 .03 −.01 25. Locus of control .03 .01 .01 26. Peer studiousness −.01 −.02 −.01 F. Teacher Ratings 27. Attendance .03 .02 −.06 28. Class behavior .06 .02 .11 29. Consults teacher .00 .00 −.01 30. Educ. motivation .07 .11 .13 31. Work completed .19 .19 .18 G. Grading Factors 32. SGF −.32 −.31 −.32 33. MCGF −.03 −.04 −.01 H. NELS Test 34. Reading .07 .06 .25 35. Mathematics .58 .47 .57 36. Science −.37 −.15 −.40 37. Social Studies .33 .15 .15 Multiple R .89 .88 .86 * Across-school analysis corrected for unreliability, range restriction, and shrinkage.

Total

.08 −.06 −.00 .15 −.15

.06 −.01 −.01 .09 −.03

.05 −.01 −.04 .00 .00

.02 .12 .04 −.01 −.01

−.05 −.00 .00 −.01 .09 .02

−.02 −.03 −.01 −.01 −.00 −.04

−.11 .05 .02 .00 .03

.00 .02 .02 −.01 .00

.02 .00 .08 −.09 −.14

−.01 .02 .00 .02 −.02

−.02 .09 −.05 .26 .19

.02 .03 −.01 .11 .20

−.31 −.01

−.30 −.02

.32 .21 −.03 −.06 .77

.09 .43 −.16 .16 .88

Appendices A. Descriptive Statistics Tables A-1 to A-8 B. Student Variables Acronyms and Specifications C. Notes

Appendix Table A-1 Student Characteristics for Four School Programs*

Rigorous Academic A. School Skills 1. Attendance 2. Class participation 3. Discipline problems 4. Work completed 5. Homework hours B. Initiative 6. Courses completed

7. Advanced electives 8. School activities 9. School sports 10. Outside activities C. Competing Activities 11. Drugs/gangs 12. Killing time 13. Peer sociability 14. Employment 15. Child care 16. Leisure reading D. Family Background 17. SES 18. Family intact 19. Parent relations 20. Parent aspiration 21. Stress at home E. Attitudes 22. Teacher relations 23. Educational plans 24. Self esteem 25. Locus of control 26. Peer studiousness Scholastic Engagement

Means AcademicAcademic Vocational

Vocational

Total Mean

Total S.D.

3.92 3.84 .08 3.26 3.61

3.78 3.80 .12 3.20 3.49

3.70 3.72 .15 3.21 3.15

3.52 3.50 .23 3.15 2.71

3.76 3.76 .12 3.20 3.40

.73 .59 .22 .65 1.67

22.93 2.46 2.74 2.28 .73

21.74 1.81 2.57 2.10 .68

22.20 .88 2.08 1.52 .59

20.54 .26 1.58 1.20 .51

21.78 1.69 2.46 1.98 .67

2.59 1.35 2.49 2.60 .49

.95 1.97 2.01 .09 .04 2.13

1.00 2.00 2.03 .12 .05 2.20

.98 2.06 2.03 .22 .07 2.06

1.08 2.04 2.10 .32 .07 1.99

1.00 2.01 2.03 .14 .05 2.14

.42 .41 .52 .35 .22 1.76

.34 .87 2.06 7.94 1.34

.22 .84 2.03 7.42 1.39

−.24 .82 2.00 6.21 1.36

−.42 .79 1.93 5.35 1.44

.14 .83 2.02 7.21 1.39

.78 .37 .41 2.12 .49

3.26 2.63 3.82 3.13 2.56 54.41

3.22 2.53 3.71 3.06 2.49 50.56

3.14 2.38 3.55 2.96 2.31 47.72

3.05 2.23 3.43 2.91 2.18 42.16

3.20 2.51 3.69 3.05 2.45 50.00

.49 .35 .52 .48 .43 10.00

*Original score scales as described in text and Appendix B.

Appendix Table A-2 Student Characteristics for Gender and Ethnic Groups*

Males A. School Skills 1. Attendance 2. Class participation 3. Discipline problems 4. Work completed 5. Homework hours B. Initiative 6. Courses completed

7. Advanced electives 8. School activities 9. School sports 10. Outside activities C. Competing Activities 11. Drugs/gangs 12. Killing time 13. Peer sociability 14. Employment 15. Child care 16. Leisure reading D. Family Background 17. SES 18. Family intact 19. Parent relations 20. Parent aspiration 21. Stress at home E. Attitudes 22. Teacher relations 23. Educational plans 24. Self esteem 25. Locus of control 26. Peer studiousness Scholastic Engagement

Females

African Asian American American

Hispanic

White

3.75 3.66 .18 3.10 3.30

3.77 3.87 .07 3.31 3.50

3.85 3.89 .16 3.30 3.12

3.79 3.74 .08 3.23 3.79

3.56 3.75 .13 3.21 3.29

3.78 3.76 .12 3.19 3.41

21.47 1.64 1.99 2.62 .72

22.08 1.73 2.91 1.36 .61

20.95 1.24 2.44 1.98 .66

22.10 2.38 2.82 1.72 .68

21.32 1.27 2.08 1.85 .61

21.89 1.73 2.48 2.02 .67

1.06 2.05 2.08 .16 .02 2.06

.94 1.97 1.98 .12 .08 2.22

.84 2.03 1.95 .10 .12 1.86

.88 1.95 1.94 .08 .06 2.14

1.02 1.99 2.01 .14 .08 1.99

1.02 2.01 2.04 .15 .04 2.19

.16 .83 1.99 7.11 1.37

.11 .83 2.05 7.30 1.41

−.32 .62 1.95 7.04 1.52

.33 .89 1.98 7.78 1.34

−.39 .81 2.05 7.25 1.45

.23 .85 2.02 7.17 1.38

3.17 2.48 3.69 3.02 2.38 47.65

3.23 2.53 3.68 3.07 2.52 52.27

3.15 2.58 3.70 2.97 2.42 49.69

3.19 2.66 3.65 2.99 2.56 53.03

3.19 2.53 3.64 3.01 2.39 47.67

3.21 2.49 3.69 3.06 2.46 50.09

*Original score scales as described in text and Appendix B.

Appendix Table A-3 Grade Averages and Grade Corrections for Four School Programs*

Rigorous Academic Subject Grade Average English Mathematics Science Social Studies School Grade Average HSA

HSA (T) HSA + K HSA + 2G Grading Corrections K SGF MCGF

Means Academic Academic Vocational

Vocational

Total Mean

Total S.D.

2.93 2.69 2.81 3.03

2.62 2.38 2.47 2.66

2.26 2.12 2.21 2.30

1.88 1.77 1.83 1.87

2.56 2.34 2.43 2.60

.81 .85 .84 .85

2.87 2.99 2.85 2.74

2.53 2.70 2.50 2.36

2.22 2.48 2.17 1.97

1.84 2.17 1.77 1.59

2.48 2.66 2.45 2.30

.76 .70 .78 .76

−.01 .02 −.14

−.03 −.00 −.17

−.05 −.03 −.22

−.07 .00 −.26

−.03 .00 −.18

.05 .26 .08

*All entries are based on a 4.0 grade scale as described in the text.

Appendix Table A-4 Grade Averages and Grade Corrections for Gender and Ethnic Groups*

Females

African American

Asian American

Hispanic

White

2.37 2.27 2.36 2.50

2.74 2.40 2.50 2.70

2.07 1.87 1.97 2.14

2.89 2.66 2.78 2.93

2.23 2.04 2.14 2.30

2.62 2.40 2.48 2.66

2.37 2.54 2.34 2.21

2.59 2.79 2.55 2.39

2.01 2.19 1.97 1.90

2.82 2.95 2.79 2.68

2.18 2.38 2.13 2.00

2.54 2.72 2.51 2.35

−.03 .01 −.17

−.04 −.01 −.19

−.04 .07 −.18

−.02 .01 −.15

−.05 .01 −.19

−.03 −.01 −.18

Males Subject Grade Average English Mathematics Science Social Studies School Grade Average HSA

HSA (T) HSA + K HSA + 2G Grading Corrections K SGF MCGF

*All entries are based on a 4.0 grade scale as described in the text.

Appendix Table A-5 NELS Test Scores and Teacher Ratings for Four School Programs* Means Acad.Acad. Voc.

Voc.

Total Mean

56.2 58.0 56.5 56.3

53.3 54.0 53.4 53.4

47.9 48.0 48.7 48.2

45.2 43.9 45.9 45.6

52.5 53.1 52.7 52.6

9.48 9.52 9.66 9.47

56.8 3.30

53.5 3.09

48.2 2.75

45.1 2.53

52.7 3.03

8.52 .51

4.80 4.90 .40 .92 4.40 2.88

4.68 4.74 .39 .82 4.15 2.70

4.63 4.61 .34 .70 3.97 2.54

4.48 4.37 .32 .54 3.69 2.29

4.68 4.71 .38 .80 4.12 2.67

.47 .61 .40 .27 .78 .47

Rigor. Acad. NELS Tests Reading Mathematics Science Social Studies NELS-T (total) NELS-C(weighted comp.) Teacher Ratings Attendance

Class behavior Consults teacher Educational motivation Work completed TRC (weighted comp.)

*Original score scales as described in text and Appendix B.

Total S.D.

Appendix Table A-6 NELS Test Scores and Teacher Ratings for Gender and Ethnic Groups*

NELS Tests Reading Mathematics Science Social Studies NELS-T (total) NELS-C (weighted comp) Teacher Ratings Attendance

Class behavior Consults teacher Educational motivation Work completed TRC (weighted comp.)

Male

Female

Afr. Amer.

51.4 53.7 54.2 53.5

53.5 52.6 51.2 51.7

46.5 46.1 44.7 47.1

54.3 57.2 54.6 55.3

48.6 48.3 47.6 48.7

53.4 54.1 53.9 53.4

53.2 3.05

52.2 3.03

46.1 2.66

55.4 3.24

48.3 2.78

53.7 3.09

4.69 4.59 .36 .76 3.96 2.59

4.66 4.83 .40 .83 4.28 2.76

4.63 4.59 .34 .71 3.92 2.53

4.81 4.95 .35 .89 4.38 2.85

4.54 4.65 .33 .73 3.98 2.56

4.69 4.72 .39 .80 4.14 2.69

*Original score scales as described in text and Appendix B.

Asian Amer.

Hispanic

White

Appendix Table A-7 Means and Standard Deviations for Five Composite Measures— By Gender and Ethnic Groups* African American (591)

Asian American (524)

Male (4154)

Female (4300)

Hispanic (804)

White (6471)

Mean for: HSA NELS Test Engagement Teacher rating School grading

48.6 50.2 47.6 48.1 50.3

51.4 49.8 52.3 51.8 49.7

43.8 42.7 49.7 47.0 52.7

54.4 54.1 53.0 53.8 50.5

46.0 45.0 47.7 47.6 50.5

50.8 51.0 50.1 50.3 49.7

S.D. for: HSA NELS Test Engagement Teacher rating School grading

10.1 10.3 10.4 10.6 10.0

9.7 9.7 9.0 9.0 10.0

8.8 9.2 8.6 10.7 9.7

9.5 9.9 10.2 8.3 8.8

8.8 9.3 9.4 10.8 9.7

9.9 9.6 10.1 9.8 10.1

*Each measure is scaled to a mean of 50 and standard deviation of 10 for all students in the full sample. (Total N=8454 except for teacher rating where N=7619; subgroup Ns in parentheses)

Appendix Table A-8 Means and Standard Deviations for Five Composite Measures— By School Program* Rigorous Academic (1892)

Academic (4650)

Academic Vocational (627)

Vocational (402)

Mean for: HSA NELS Test Engagement Teacher rating School grading

55.1 55.2 54.4 54.4 50.7

50.7 51.0 50.6 50.6 49.9

46.6 44.4 47.7 47.1 48.7

41.5 40.2 42.2 41.9 50.0

S.D. for: HSA NELS Test Engagement Teacher rating School grading

8.5 7.5 8.6 6.8 10.0

9.6 9.7 9.5 9.5 9.9

8.6 9.0 9.1 10.3 10.6

7.1 7.6 8.7 10.7 9.9

*Each measure is scaled to a mean of 50 and standard deviation of 10 for all students in the full sample (Total N=8454 except for teacher rating where N=7619; subgroup Ns in parentheses)

B-1

Appendix B: Student Variables

Acronyms for variables and corrections: CGF – Course Grading Factor, computed across schools CGR – Course Grading Residual, computed within schools CU – Carnegie Unit HSA – High School Average based on four “new basic” subject areas, computed within schools or across schools as indicated HSAw – HSA specified as a within-school deviation score (school mean removed) HSA+K – Within-school HSA, plus a correction for course grading variations HSA+2G – Across-school HSA, plus corrections for school and course grading variations HSA(T) – High School Average based on all courses on the transcript, save physical education and service courses like driver training K – Mean Course Grading Residual (computed within schools) for the courses taken by each student MCGF – Mean Course Grading Factor (computed across schools) for the courses taken by each student NELS-C –NELS Test Composite; 4 tests, best weighted predictors of HSA NELS-T – NELS Test total score; 4 tests unweighted SGF – School Grading Factor, computed across schools SE – Scholastic Engagement; composite of 9 behavioral measures SES – Socioeconomic Status TR – Teacher Rating TRC – Teacher Rating Composite; 5 ratings, best weighted predictors of HSA

B-2 Specifications for Student Characteristics and Teacher Ratings* School Skills 1. Attendance (average of 6 vars.) reflected so 0=poor attendance 5=good attendance (could have avg=6; reflected avg= -1 if only RAB vars are present and highest category)

F2RAB90 (transcript): Number of days absent 90-91, grouped as in codebook frequency distributions, but changed from 1-7 to 0-6 range (Source 2) F2RAB91 (transcript) Same as above, for 91-92 (Source 2) F2S9A Late for school: code 0=none, code 5=15+ F2S9B Cut/skip classes: code 0=none, code 5=15+ F2S11A Last unexcused absence: recode so that: code 1(never)=0; code 2 (this term)=3; code 3 (first term this year)=2; code 4 (last year) and code 5 (2 or more years ago)=1 other or omit=0 F2S11B Days missed, last unexcused absence: code 0=1-2; code 5=21+

2. Class participation

F2S15bc Copy science notes 1=never/very rarely; 5=every day

(average of 8 vars.)

F2S19bc Copy math notes 1=never/very rarely; 5=every day

1=low participation 5=high particip.

F2S17a Pay attention, science class, 1=never; 5=always F2S21a Pay attention, math class 1=never; 5=always F2S17d Participate actively, science class 1=never; 5=always F2S21d Participate actively, science class 1=never; 5=always F2S24A Come to class without pencil/paper 1=usually; 4=never F2S24B Come to class without books 1=usually; 4=never

3. Discipline problems

F2S.8f Fight at school: 0=never; 1=once or twice; 2= more than twice

Average of 6 vars.

F2S.8g Fight to or from school

0=no problem 2=more than 2 problems



F2S.9d Trouble re school rules: 0=never; 1= 1-2 times; recoded 2-6 becomes 2=3+ or multiple response F2S9e In-school suspension F2S9f Suspended from school F2S9h Was arrested

” ” ”

* The six NELS data sources used in this study are cited on page B8 Most variables come from #1, The Second Followup Student Component. Other sources are indicated on the list of variables where appropriate.

B-3 4. Work completed

F2S17b Complete science work on time 1=never; 5=always

(average of 5 vars.)

F2S21b Complete math work on time 1=never; 5=always

1=never complete/ more than req. 5=always complete/ more than req.

F2S17c Do more science work than required 1=never; 5=always

5. Homework hours

F2S25F1 Total homework time per week in school: codes 0-8

(average of 2 vars., range 0-8)

F2S21c Do more math work than required 1=never; 5=always F2S24c Come to class without homework done 1=usually; 4=never

F2S25F2 Total homework time per week out of school: codes 0-8

Initiative 6. Courses completed

Total graded “course values”: credits for all courses graded 1-13 (credits imputed from term duration for failed courses); service courses excluded (Source 3)

7. Advanced electives

F2S13E Ever in AP program 1=yes

sum of 1 point each for specified codes

F2S18a Taking science this term 2=yes, but not required F2S22a Taking math this term 2=yes, but not required F2RENG_C Units in English (NAEP): 5 units or more F2RFOR_C Units in Foreign Language (NAEP): 3 units or more

8. School activities

F2S29a Class officer 1=yes (2 points)

2 points for each of the responses in var. 7

F2S29c Award in math/science 1=yes (2 points)

plus 1 point for participation alone as noted

F2S30ac School spirit group 5=captain (2 points); code 4 (varsity) 1 point

maximum=12

F2S30bb Drama group 4=officer/leader (2 points); code 3 (participated) 1 point

F2S29f Recognition of writing 1=yes (2 points)

F2S30ba Music group 4=officer/leader (2 points); code 3 (participated) 1 point

F2S30bc Student govt 4=officer/leader (2 points); code 3 (participated) 1 point (=12 if more than 12 points indicated)

F2S30be Publication 4=officer/leader (2 points); code 3 (participated) 1 point F2S30bf Service club 4=officer/leader (2 points); code 3 (participated) 1 point F2S30bg Academic club 4=officer/leader (2 points); code 3 (participated) 1 point F2S30bh Hobby club 4=officer/leader (2 points); code 3 (participated) 1 point F2S30bi Professional club 4=officer/leader (2 pts); code 3 (participated) 1 point

B-4 9. School sports

F2S29g Most valuable player on a team: code 1=yes: 2 points

sum of points as noted

F2S30aa Team sport: code 3 or 4 (jv, varsity) 2 points; code 5 (captain) 3 points F2S30ab Individual sport



F2S30bj Intramural team code 3 (participated)=1 point; code 4 (leader) 2 points F2S30bk Intramural individual sport ” 10. Outside activities

F2S33b How often work on hobbies (1=never...4=every day)

average of 6 variables coded 1-4

F2S33c How often attend religious activities F2S33d How often attend youth groups



then the composite rescaled to 0-3

F2S33e How often community service





F2S33L How often play sports



F2S29h Received a community service award code 1(yes) recoded to 4; code 2(no), 6 (multiple) and 8(missing) recoded to 1

Competing Activities 11. Drugs/gangs

F2S81b Alcohol in 12 months: 0=none; 3=20+ occasions Legitimate skip (code 9) = never in lifetime--recoded to 0

average of 4 vars: F2S83b Marijuana in last 12 months: 0=never/none 3=frequent/all



(legit skip as above)

F2S70 Number of friends in gangs: 1=none, 2=some, 3=all F2S71 I belong to a gang: recoded so 1(yes) becomes 3; 2(no) becomes 1

12. Killing time

F2S33f Riding around: recode 1&2→1 (< 1/week); 3→2 (1-2/week); 4→ 3 (every day)

average of 4 vars. F2S33g Talking, doing things with friends 1=lowest amount of time 3=highest amount of time



F2S34a Video/computer games (weekdays): recode 0→1 (none); 1,2→2 (< 2 hours); 3-5→3 (2+ hours/day) F2S35a TV (weekdays): recode 0,1→1 (40 hours/week Recoded to: 0= codes 0-4 ≤20 hr 1= codes 5+ >20 hr 15. Child care 0=none; 6=10+ hr/day Recoded to: 0=codes 0-2:40 recodes: if F2S88 Missing or legitimate skip and F2S86a (ever worked)=no → hours=0; if F2S86a (ever worked) = 3 (not currently employed) and F2S86bmo/byr (last month and year worked) is NOT Sept-Dec/91 or any 1992 → hours=0 F2S94 Hours per day: 1=