Comparing the predictive validity of reasoning tests and national end ...

3 downloads 0 Views 189KB Size Report
against which pupil progress is assessed in subsequent end of Key Stage 3 (KS3) tests at age 14, and in General Certificate of Secondary Education (GCSE) ...
British Educational Research Journal Vol. 32, No. 2, April 2006, pp. 209–225

Comparing the predictive validity of reasoning tests and national end of Key Stage 2 tests: which tests are the ‘best’? Steve Strand* University of Warwick, UK (Submitted 9 July 2004; conditionally accepted 24 November 2004; accepted 10 December 2004)

This article describes a longitudinal analysis of a nationally representative cohort of over 80,000 pupils in England who completed both national end of Key Stage 2 (KS2) tests and the Cognitive Abilities Test (CAT) at age 11 in 1997, national end of Key Stage 3 (KS3) tests at age 14 in summer 2000 and General Certificate of Secondary Education (GCSE) and other public examinations at age 16 in summer 2002. The CAT had significantly higher correlations with subsequent KS3 and GCSE outcomes than did KS2 test points scores. However, multiple regression analyses indicated that a combination of CAT and KS2 test scores gave the best prediction of future KS3\GCSE outcomes. The article argues that measures of both pupils’ general transferable learning abilities, and measures of specific curricular attainments at the end of primary school have unique and distinct value at the start of the secondary phase. The article discusses some practical ways in which the different types of assessment data can be used within the secondary school.

Introduction End of Key Stage 2 (KS2) tests, completed in the final year of primary school at age 10/11 years, have been a statutory feature of the National Curriculum (NC) in England since 1995. The KS2 tests in English, mathematics and science are expected to fulfil a wide range of functions. For primary schools, they have a prominent role in public accountability, through the publication of school results in comparative performance tables alongside the results for other local schools and local and national averages. The tests are also expected to drive school improvement through (i) the setting of national targets for performance in the tests, and (ii) the *CEDAR, Institute of Education, University of Warwick, Coventry CV4 7AL, UK. Email: [email protected] ISSN 0141-1926 (print)/ISSN 1469-3518 (online)/06/020209-17 # 2006 British Educational Research Association DOI: 10.1080/01411920600569073

210 S. Strand requirement for school governors to set and publish school-level targets for future performance in the tests. For secondary schools, the KS2 tests provide the baseline against which pupil progress is assessed in subsequent end of Key Stage 3 (KS3) tests at age 14, and in General Certificate of Secondary Education (GCSE) and other public examinations at age 16. Pupils’ KS3 and GCSE results are compared to their prior KS2 scores in order to calculate ‘value-added’ measures of school effectiveness, and these measures are published in national secondary school performance tables. Governors also have to set school-level targets for future KS3 and GCSE results (e.g., set targets in autumn 2002 for summer 2004 tests/ examinations) and the KS2 results of the relevant cohort are expected to inform these targets. Measuring educational progress at secondary school Measuring progress between the end of KS2 and KS3 has always been more contentious than between other key stages. The School Curriculum and Assessment Authority (SCAA, 1994), in looking at the type of information that might be used as baseline data for value-added measures of progress, indicated that National Curriculum (NC) levels ‘should not be used in value-added studies since they are likely to lead to less reliable predictions and greater error for individual schools’ (p. 11). NC levels were seen to be inappropriate as baseline data because it was not safe to assume that ‘the levels have the same interpretation irrespective of key stage’ (p. 24). Similarly, the then Department for Education (DfE, 1995) concluded that ‘progress from one NC level of attainment to the next could be too large a step to be used for accurate measurement of value added; more fine-grained measures … are likely to be needed’ (p. 2). However, this position changed substantially in 1997 when it was suggested that ‘from 1998, schools should be provided with value-added scores based on KS2 data predicting KS3 outcomes’ (SCAA, 1997a, p. 98). While the National Value Added Project (SCAA, 1997a) had shown positive correlations between KS1 test levels and KS2 test levels, and between KS3 test levels and GCSE scores, no correlations between KS2 and KS3 test results were then available. Despite this, there were claims (SCAA, 1997b) that ‘standardised tests are no longer needed to predict outcomes at KS3, as KS2 results are reliable enough to serve as baseline data’. Despite these assurances, the comparability of NC levels between KS2 and KS3 is still a concern for many involved in educational measurement. Wiliam (2001) highlights the theoretical issues involved. Others have highlighted practical issues, for example, Hayes (2001) reports that in one local education authority (LEA) almost 1 in 7 (14%) of the entire cohort were at level 4 in KS2 (1998) and still at level 4 in KS3 three years later (2001) in at least one of the core subjects. These results were not simply attributable to gender, social disadvantage or ethnicity of the pupils, and suggest some discontinuity in the assessment of ‘level 4’ at the end of the two key stages. Many measurement professionals are similarly concerned about the value-added methodology employed by the Department for Education and Skills

Predictive validity of tests 211 (DfES), not only because it uses a simple comparison of NC levels at KS2 and KS3, but also because of the ‘median line’ methodology which introduces systematic bias in value-added results, and the absence of confidence bands around the reported value-added measures (Critchlow & Coe, 2003; Goldstein, 2003). The absence of published information on the reliability of any of the end of key stage tests compounds these problems (Tymms & Dean, 2004, Appendix 4). The Cognitive Abilities Test Around two-thirds of secondary schools in England use reasoning tests, principally the Cognitive Abilities Test (CAT), as part of their pupil assessment programme. As a result, a large proportion of the school population complete CAT in September– November of Year 7, some four to six months after the KS2 tests. When considering targets for the future attainment of their pupils, secondary schools often ask which is the ‘best’ predictor: the CAT or the KS2 tests. There is actually very little empirical evidence on the comparative predictive validity of the two sets of tests (see below). As a result, schools base their decisions on a variety of other criteria. Some schools give more weight to the CAT because they remain sceptical about the reliability and validity of the national tests. For example, some point to rising KS2 test results for their intakes between 1996 and 2000, and contrast this with relatively stable CAT scores. A range of evidence has been reported that appears to support this finding (Times Educational Supplement, 2002; Hopkins & Davis, 2003; Massey et al., 2003), and cynics assert that pupils are just becoming increasingly well prepared to take KS2 tests. On the other hand, some schools observe that the provision of KS2 results from primary schools has improved markedly in terms of speed and completeness, and that KS2 results are sufficient to meet their statutory requirements in regard to target setting. The question of which is the ‘best’ test seems to presume that differences between the results would indicate a flaw in one or other of the tests (e.g. Rayment, 1996; Pegg, 1998). However, reasoning tests such as the CAT, and attainment tests such as the KS2 English, mathematics and science tests may be contrasted on a number of dimensions; for example, in terms of test content, the transferability of the skills assessed and in the publication (or otherwise) of the results in public performance tables. It is apparent, therefore, that the two sets of tests assess different domains and with different sensitivities. The tests may therefore have different, but complementary, roles. However, since both sets of tests are potentially available as a baseline for future performance, either to set targets or to measure ‘value added’, it is important to know how they compare in terms of their predictive validity.

Comparing predictive validity We know that CAT scores at age 11 are highly correlated both with subsequent KS3 test levels and with GCSE examination results. For example, Fernandes and Strand (1998) compared the autumn 1994 Year 7 CAT scores and subsequent summer

212 S. Strand 1997 KS3 test levels of 12,000 pupils and reported correlations of 0.65 between CAT VR and KS3 English test level, and 0.84 and 0.76 between mean CAT score and KS3 mathematics and science test levels respectively. At GCSE, Thomas and Mortimore (1996) compared autumn 1987 Year 7 CAT scores with summer 1993 GCSE results for 8500 pupils and reported correlations of 0.72, 0.67 and 0.74 with GCSE total points score, GCSE English grade and GCSE mathematics grades respectively. Fernandes and Strand (1998) reported similarly high correlations for 13,000 pupils between autumn 1994 CAT scores and summer 1997 GCSE outcomes, and additionally reported high correlations with GCSE science double award (0.69), history (0.62), geography (0.68) and modern foreign languages (0.57), amongst other subjects. We presume from the analyses contained in the Autumn Package1 that KS2 test results are also significantly correlated with subsequent KS3 tests and GCSE results. The Autumn Package does not actually report the correlations underlying the data. However, the present author has completed an analysis of the matched data set underlying the first KS3 Autumn Package published in 1999. This revealed correlations of 0.65, 0.79 and 0.69 between the KS2 English, mathematics and science test levels and the respective KS3 test levels in the same subjects. A similar analysis of the data set underlying the first KS2 to GCSE Autumn Package in 2001 indicates significant correlations between KS2 average points score and GCSE total points score (0.63), English (0.63), mathematics (0.66), science double award (0.59), history (0.58), geography (0.59) and French (0.59), among other subjects. It is apparent, therefore, that both CAT and KS2 tests are highly correlated with pupils’ subsequent performance in national tests and examinations. However, there is only limited data giving a direct comparison of reasoning tests against KS2 tests as predictors of subsequent attainment. Moody (2001) compared CAT scores and KS2 test scores as predictors of subsequent KS3 outcomes for a large secondary girls’ comprehensive school. The data was drawn from 131 pupils taking KS3 tests in 1997 and 153 pupils taking the KS3 tests in 1998. He concluded that ‘KS2 data, either in the form of test results or teacher assessment (TA), have no predictive validity, or reliability, for test results or TA at KS3’, although ‘CAT correlated more highly with both test results and TA at KS3 in core subjects’ (p. 81). For example, the KS3 English test correlated 0.74 with CAT VR, compared to only 0.49 with KS2 English test. Similarly, for the KS3 science test, the correlation with CAT VR score was 0.61, compared to 0.31 with the KS2 science test. For mathematics, the CAT and KS2 tests were more comparable as predictors, with correlations of 0.81 for both CAT QR and KS2 mathematics test (although mean CAT score was the best predictor with a correlation of 0.87 compared to 0.78 for KS2 average test level). Strand (2001) matched the results for 18,000 pupils who took KS2 and CAT tests in summer/autumn 1996 and KS3 tests in summer 1999. The results indicated that both CAT and KS2 test scores were strongly correlated with KS3 test outcomes, although CAT gave consistently higher correlations than KS2 test average points score, particularly for KS3 mathematics (0.85 for mean CAT score versus 0.78 for KS2 average points score).

Predictive validity of tests 213 The current study extends these studies by drawing on results from the most recent (2002) national data sets, by tracking pupils from age 10/11 years right through to GCSE/GNVQ (General National Vocational Qualification) results at age 15/16, and by substantially increasing the sample size. The following questions are asked:

N N N N N

What are the correlations between reasoning test scores at age 11 and KS3/GCSE outcomes in a large and nationally representative sample of pupils? What are the correlations between the end of KS2 tests and KS3/GCSE outcomes in a large and nationally representative sample of pupils? Based on the same large sample of pupils, which tests have the highest correlation with each of the KS3 and GCSE outcomes? Do KS2 tests and reasoning tests explain common or distinct parts of the variance in KS3 and GCSE outcomes? What are the multiple correlations between a combination of reasoning and KS2 tests in predicting subsequent attainment? How can schools integrate and use the data arising from the KS2 tests and reasoning tests productively?

Method The Cognitive Abilities Test (CAT) The CAT is the most widely used test of reasoning abilities in the UK, with close to one million students assessed each academic year. The data reported here relate to the CAT second edition (CAT2E) (Thorndike et al., 1986). CAT2E has 10 separate subtests which are aggregated into three standardised measures of Verbal (VR), Quantitative (QR) and Non-Verbal (NVR) reasoning abilities. An average of the three standardised scores, the mean CAT score, is also calculated. The tests are described in detail in Strand (2004a) and in the test manual. The CAT2E spans the age range 7:06–15:09 years and is divided into six levels of difficulty, Level A through to Level F. Level D is typically used in the first year of secondary school (Year 7). The results are most often reported as standard age scores, with a mean of 100 and standard deviation of 15. The tests have very high levels of reliability, both in terms of internal consistency estimates (Thorndike et al., 1986) and test–retest correlations (Sax, 1984; Strand, 2004a). A third edition of the CAT (CAT3) was launched in June 2001 and is now more widely used than CAT2E. However, an extensive equating study involving over 10,000 pupils allows CAT2E scores to be converted to CAT3 equivalents or vice versa (Smith et al., 2001). National end of Key Stage 2 and Key Stage 3 tests End of Key Stage 2 tests are completed in May of the final year of primary school (Year 6), when pupils are aged between 10 and 11 years. The tests and tasks cover the core subjects of the National Curriculum: English (reading, writing, spelling and handwriting), mathematics (separate written papers with and without the use of

214 S. Strand calculators, and a separate test of mental maths) and science. End of Key Stage 3 tests are completed in May of Year 9, when pupils are aged between 13 and 14 years. They cover the same curriculum areas as the KS2 tests. Further details on the content of the national tests can be obtained from the QCA (2002a, b). Results of national tests are typically expressed as National Curriculum (NC) levels. At KS2 in 2002 the results could range from level 2 to level 6, and at KS3 from level 2 to level 8. The ‘typical’ pupil is expected to attain level 4 at the end of KS2, and between level 5 and level 6 at the end of KS3. In this article, NC test levels have been converted to points scores as described in the Autumn Package (2002). An overall measure of a pupil’s performance at the end of each key stage was also derived by calculating the average of the pupil’s points scores across all the tests they completed, again as described in the Autumn Package (2002). National GCSE/GNVQ public examination results Pupils in England sit national public examinations at age 15/16 years. These are typically General Certificate of Secondary Education (GCSE) examinations, which are offered in a wide range of subjects and are graded from A* down to G (or U for ungraded) in each subject. A small proportion of entries are also made for GNVQs. These can be awarded at Distinction, Merit or Pass levels. A detailed set of score equivalencies are defined in the Autumn Package which allow GCSE/GNVQ results to be expressed on a common scale as ‘points scores’, and allow the calculation of the ‘Best 8’ performance score as an overall measure of attainment. The Best 8 performance score gives the pupils’ points score in the eight highest scoring units studied, whether they be GCSE full course, GCSE short course, GNVQ foundation or GNVQ intermediate examinations. The data sets and the matching process CAT is processed through a national service which provides computer scoring and analysis of pupils’ answer sheets. Consent is asked from schools to retain this data for further research on monitoring CAT norms or developing indicators, and 94% of users consent to this use. CAT test scores from administrations in the 1997/98 academic year, mostly in September–November 1997, were matched to the national matched KS2–GCSE 2002 data set. Thus a data set of matched May 1997 KS2 test results, autumn 1997 CAT results, May 2000 KS3 test results and May 2002 GCSE results was created. After matching was complete, individual records were given a unique identification code and all identifying information on pupils’ names was deleted from the file to assure confidentiality for individual pupils and schools. Results The matched data sets CAT scores were found for over 80,000 of the pupils in the national KS2–GCSE data set, with pupils drawn from a total of 973 secondary schools across 103 LEAs.

Predictive validity of tests 215 Table 1. Comparison of the matched sample against the population average for KS3 and GCSE outcomes Measure

CAT matched sample

KS3 average points score English % Level 5+ Maths % Level 5+ Science % Level 5+

Mean (SD) % % %

GCSE\GNVQ ‘Best 8’ points English % entries graded A*–C Maths % entries graded A*–C Science % entries graded A*–C Sample size

Mean (SD) % % %

34.1 (6.3) 72 73 66 37.3 (13.4) 62 54 54 80,074

Full national data set 33.9 (6.7) 70 71 65 36.8 (14.0) 61 53 53 361,335

Table 1 contrasts the KS3 and GCSE results for the CAT matched sample against the national averages derived from the full national data set. The KS3 average point score and the GCSE/GNVQ Best 8 performance score for the CAT matched sample do not differ substantially from the national average. Given that the matched CAT sample constitutes over one-fifth (22%) of the entire national data set, this is perhaps not surprising. Correlations with subsequent KS3 and GCSE attainment Table 2 presents the correlations of CAT standard age scores and KS2 test points scores with subsequent end of KS3 test points scores at age 14. The correlations between the KS2 tests and the equivalent KS3 tests are all highly significant; 0.68 between KS2 and KS3 English, 0.77 between KS2 and KS3 Table 2. Correlation coefficients of Year 7 CAT scores and KS2 points scores with KS3 points scores KS3 English points score

KS3 mathematics KS3 science points score points score

KS3 average points score

Verbal Quantitative Non-Verbal Mean CAT score

0.70 0.59 0.50 0.67

0.75 0.81 0.72 0.85

0.76 0.67 0.65 0.78

0.79 0.78 0.69 0.84

KS2 KS2 KS2 KS2

0.68 0.55 0.52 0.68

0.63 0.77 0.64 0.79

0.65 0.66 0.67 0.72

0.73 0.74 0.64 0.81

English PS Mathematics PS Science PS Average PS

Notes. Key Stage test levels have been recoded as points scores (PS) as given in the Autumn Package (2002). All correlations are statistically significant at p,.0001.The highest correlation for each KS3 outcome is indicated in bold.

216 S. Strand mathematics and 0.67 between KS2 and KS3 science. It is perhaps not surprising that the KS2 and KS3 tests are so closely correlated since the tests are similar in content and approach. However the relevant KS2 test is not actually the best predictor of the KS3 test outcome. For KS3 English the best single predictor is VR score, and for KS3 mathematics, KS3 science and KS3 average points score, the best single predictor is mean CAT score. The advantage for CAT over KS2 average points score is highly statistically significant for every KS3 outcome (p,.0001). The absolute size of the CAT advantage over the KS2 tests is relatively small for English (0.70 versus 0.68) and KS3 average point score (0.84 versus 0.81), but is quite pronounced for KS3 mathematics (0.85 versus 0.79) and KS3 science (0.78 versus 0.72). While Verbal Reasoning and KS2 English points score have similar correlations with KS3 English points score (0.70 for VR versus 0.68 for KS2), there is a marked superiority for Verbal Reasoning over KS2 English in the correlations with KS3 mathematics (0.75 for VR versus 0.63 for KS2) and KS3 science (0.76 for VR versus 0.65 for KS2). The language skills assessed by the KS2 English test appear specific to performance in national English tests; Verbal Reasoning is a far stronger indicator of general performance in the end of KS3 tests. Table 3 presents the correlations of CAT standard age scores and KS2 test points scores with subsequent GCSE/GNVQ results at age 16. Results are reported for eight GCSE outcomes: the GCSE/GNVQ Best 8 performance score, the number of higher grade (A*–C) passes achieved, and the GCSE grade in the six subjects with the highest national levels of entry (English, mathematics, double award science, geography, history and French). Again the single best predictor for every GCSE outcome is one of the CAT scores; VR for English and French, and mean CAT score for all other outcomes. The difference between the CAT and the KS2 test correlations with GCSE are all highly statistically significant (p,.0001). The absolute difference in the size of the CAT to GCSE correlations versus the KS2 to Table 3. Correlations of Year 7 CAT and KS2 points scores with GCSE outcomes five years later ‘Best 8’ Number GCSE GCSE points of A*–C English mathescore grades matics

GCSE double science

GCSE GCSE GCSE geogra- history French phy

Verbal Quantitative Non-Verbal Mean CAT score

0.69 0.65 0.58 0.72

0.66 0.62 0.56 0.69

0.68 0.59 0.51 0.67

0.66 0.72 0.65 0.76

0.64 0.60 0.56 0.68

0.64 0.58 0.52 0.66

0.63 0.56 0.48 0.63

0.65 0.57 0.48 0.64

KS2 KS2 KS2 KS2

0.65 0.61 0.57 0.70

0.61 0.58 0.54 0.66

0.66 0.55 0.52 0.66

0.59 0.68 0.57 0.71

0.56 0.58 0.55 0.64

0.58 0.55 0.53 0.64

0.59 0.53 0.50 0.62

0.60 0.52 0.46 0.61

English PS Mathematics PS Science PS Average PS

Notes. Key Stage test levels and GCSE grades have been recoded into points scores (PS) as given in the Autumn Package (2002). All correlations are statistically significant at p,.0001. The highest correlation for each GCSE outcome is indicated in bold.

Predictive validity of tests 217 GCSE correlations is relatively small, except for mathematics (0.76 versus 0.71), science (0.68 versus 0.64) and French (0.65 versus 0.61). Differences between CAT scores and KS2 test points scores do exist in terms of the strength of their relationships with KS3 and GCSE outcomes. However, on an absolute level all the correlations are high, typically 0.60 as a minimum. It could be argued that both CAT and KS2 tests therefore have a strong statistical basis for predictions of subsequent attainment. CAT raw scores or standard age scores? There can be up to 12 months’ difference in age between the youngest and oldest pupils in a year group. However NC levels derived from the KS2 and KS3 tests are not adjusted for age. We might therefore expect that CAT raw scores, which are also not adjusted for age, might have somewhat higher correlations with key stage test outcomes than the CAT standard age scores reported above. A similar argument would apply for GCSE results, since these are also not adjusted for age although small month of birth effects are reported (e.g. Massey et al., 1996). To test this hypothesis, the KS3 and GCSE correlations with CAT level D raw scores were compared to the correlations with CAT level D standard age scores. The raw score correlations were only very marginally higher than the standard age score correlations. For example, the correlation with KS3 average point score remained at 0.84, and the correlation with GCSE Best 8 performance score increased only from 0.72 to 0.73. Standard age scores have therefore been used throughout this report because they are comparable across CAT test levels and therefore allow the maximum number of cases to be included. KS2 test marks or test levels? There are two reasons why KS2 points scores rather than KS2 test marks have been used in the above correlations. First, the KS2 tests change every year. The relationship between KS2 test marks and KS3 outcomes will therefore change each year. However, KS2 test levels are assumed to be comparable across years, even though the tests change, so relationships between KS2 test levels and KS3 outcomes should be consistent across time. Second, secondary schools frequently do not have access to the KS2 test marks of each pupil, although they should all receive KS2 test levels. It is only realistic to suggest analyses based on data that is readily available to secondary schools, and consistent over time, and these criteria are met by CAT standard age scores and KS2 test levels. Intercorrelations between CAT and KS2 test scores Table 4 shows the intercorrelations between the CAT scores and KS2 test results. These correlations are all highly significant. For example, the correlation between

218 S. Strand Table 4. Correlations between CAT standard age scores and KS2 test point scores

Verbal Reasoning Quantitative Reasoning Non-Verbal Reasoning Mean CAT score

KS2 English points score

KS2 mathematics points score

KS2 science points score

KS2 average points score

0.76 0.63 0.54 0.72

0.67 0.75 0.64 0.77

0.65 0.59 0.56 0.67

0.79 0.75 0.66 0.82

Notes: Listwise deletion, sample size579,977.

VR and KS2 English points score is 0.76, and the correlation between QR and KS2 mathematics points score is 0.75. However, the two types of test are not measuring exactly the same thing. This is shown by the results of multiple regression analyses, described below. Which is the ‘best’ predictor of subsequent KS3 and GCSE outcomes? Multiple regression is a technique to determine the unique association of each of a range of independent variables with a particular outcome. Table 5 shows the results of multiple regression analyses for each of the KS3 and GCSE outcomes. For each outcome, all six scores at age 11 (VR, QR and NVR standard age scores, and KS2 English, mathematics and science points scores) were eligible for stepwise entry to the multiple regression equation. With a sample of this size, conventional statistical significance levels are a poor guide to educational significance, since almost any change is statistically significant (Strand, 2004b). Therefore a rule was applied that an explanatory variable must account for at least an additional 2% of the variance in the KS3/GCSE outcome in order to warrant inclusion in the relevant multiple regression equation. The main results from Table 5 are:

N

N

N N

For the summary measures (KS3 average points score and GCSE Best 8 performance score), the combination of VR and QR explains nearly all the unique variance. However adding KS2 mathematics score explained a further 2% variance in KS3 average points score (increasing the variance accounted for from 72% to 74%), and KS2 English explained a further 3% in GCSE Best 8 performance score (increasing from 52% to 55%). For KS3 English and for GCSE English, VR was the best single predictor. However, the addition of KS2 English points score accounted for an additional 6% of the variance in KS3 English and 5% of the variance in GCSE English. For KS3 mathematics and GCSE mathematics, QR was the best single predictor. The addition of KS2 mathematics test accounted for a further 6% of the variance in KS3 mathematics and 5% of the variance in GCSE mathematics. For KS3 science and GCSE science, VR was the best single predictor. For KS3 science, VR explained 58% of the variance, KS2 science test accounted for a

Predictive validity of tests 219 Table 5. Multiple regression equations of CAT and KS2 test points score on KS3 test and GCSE examination outcomes Outcomes

Significant predictors

Multiple Correlation

R square

Standard Error

R square change

KS3 English

VR VR, En

0.70 0.74

49.0% 54.8%

4.42 4.18

– 5.8%

KS3 Mathematics

QR QR, Ma QR, Ma, VR

0.81 0.84 0.86

65.6% 71.2% 74.1%

4.41 4.00 3.82

– 5.6% 2.9%

KS3 Science

VR VR, Sc VR, Sc, NVR

0.76 0.79 0.81

57.8% 62.6% 65.0%

4.25 3.98 3.87

– 4.8% 2.4%

KS3 Average Points Score

VR VR, QR VR, QR, Ma

0.79 0.85 0.86

62.4% 71.6% 73.8%

3.76 3.30 3.17

– 9.2% 2.2%

GCSE English

VR VR, En

0.68 0.72

46.6% 51.6%

1.07 1.01

– 4.9%

GCSE Mathematics

QR QR, Ma QR, Ma, NVR

0.72 0.75 0.77

52.0% 56.7% 58.8%

1.15 1.09 1.07

– 4.7% 2.1%

GCSE Double Science

VR VR, Ma

0.64 0.67

41.1% 45.2%

1.21 1.16

– 4.1%

GCSE Geography

VR VR, Ma

0.64 0.66

41.2% 44.0%

1.29 1.24

– 2.7%

GCSE History

VR VR, En

0.63 0.65

39.1% 42.3%

GCSE French

VR VR, En

0.65 0.67

42.0% 44.6%

1.31 1.25

– 2.6%

GCSE Best 8 points score

VR VR, QR VR, QR, En

0.69 0.72 0.74

47.9% 52.3% 54.8%

8.68 8.31 8.07

– 4.4% 2.5%

– 3.2%

Notes: VR5Verbal Reasoning; QR5Quantitative Reasoning, NVR5Non-Verbal Reasoning, En5KS2 English test points score; Ma5KS2 mathematics test points score; Sc5KS2 science test points score.

N

further 5%, and NVR added a final 2%, raising the total to 65%. For GCSE science, VR accounted for 41% of the variance and KS2 mathematics points score accounted for an additional 5%. It is interesting that KS2 science score did not account for any unique variance in GCSE science grade. For GCSE history, geography and French, VR was the best single predictor of performance, although KS2 mathematics accounted for a further 3% of the variance for history and KS2 English accounted for a further 3% of the variance for geography and French.

220 S. Strand Discussion The main results can be summarised as follows:

N

N N

KS2 test points scores and reasoning test scores at age 11 are both highly correlated with pupils’ subsequent KS3 test scores at age 14 and GCSE outcomes at age 16. Both reasoning tests and KS2 tests provide a statistically adequate basis for generating indicated KS3/GCSE outcomes. Reasoning test standard age scores correlate more highly than KS2 test points scores with each outcome at KS3 and at GCSE. Reasoning scores at age 11 were the best single predictors of all KS3 and GCSE outcomes. Multiple regression analyses reveal that reasoning scores and KS2 test scores account for somewhat different parts of the variation in KS3/GCSE outcomes. Adding the KS2 test points scores along with reasoning scores accounts for a small but significant additional part of the variance in KS3/GCSE outcomes. Consequently, a combination of CAT and KS2 points scores together gives the best indication of future KS3/GCSE outcomes.

KS2 test marks revisited As already argued, KS2 test marks do not have the necessary properties of consistency over time and ready availability to be of practical use for secondary schools in KS3/GCSE target setting. However, KS2 test levels (range 2–6) are relatively undifferentiated compared to CAT2E standard age scores (range 70–130) and KS2 test marks (range 0–80 for science and 0–100 for English and mathematics). Is it just the greater differentiation in CAT standard age scores, compared to KS2 test levels, that explains the primacy of CAT scores in the multiple regression analyses? KS2 test marks were not included in the national KS2–GCSE 2002 data set reported in this article. However, a direct comparison of KS2 test marks, KS2 test levels and CAT standard age scores was possible using the national KS2–KS3 2002 data set. While the correlations of KS3 outcomes with KS2 test marks were somewhat higher than the correlations with KS2 test levels, the results of multiple regression analyses again revealed that the combination of both CAT and KS2 test marks provided a significantly higher prediction of KS3 outcomes than either type of test on its own. It is not therefore the simple fact that CAT scores are more differentiated that accounts for their significant role in the multiple regressions, otherwise the reasoning tests would no longer be significant once an equally differentiated KS2 measure was introduced. This confirms the unique contribution of both reasoning tests and attainment tests in the prediction of future performance. Why do predictions from reasoning tests and KS2 tests differ? The CAT and the KS2 tests may be contrasted on a number of dimensions. First, reasoning tests focus on very familiar content, basic elements such as simple words or sentences, numbers and number operations, or shapes and geometric forms, while the content of attainment tests is drawn from the detailed programmes of study of

Predictive validity of tests 221 the National Curriculum. Second, reasoning tests emphasise the perception and manipulation of relationships using these basic elements, abilities which should transfer or be applied to a wide range of tasks, while attainment tests tend to measure specific outcomes of learning and instruction. Third, the KS2 tests are ‘high stakes’ for primary schools and are published in national performance tables, while the CAT is a ‘low stakes’ test, used for teaching and learning and for management purposes within the school. Each set of tests therefore has its own strengths. The KS2 tests assess attainment in some core areas of the curriculum, and reflect how well pupils have acquired and retained specific knowledge in these areas. By contrast, reasoning tests assess more general, transferable learning abilities, and can provide a measure of potential. While there is a high correlation, around 0.75, between the CAT and the KS2 tests, they are not measuring the same thing. Hence the best basis for predicting future attainment comes from a combination of both CAT and the KS2 tests. It is important to note that the high correlations reported here do not indicate a deterministic relationship between KS2 or reasoning test scores at age 11 and subsequent performance at age 14/16. Even a correlation as high as 0.70 indicates that over half the variation in outcomes is attributable to factors other than age 11 score. Such factors will include the motivation and effort of the pupil, the quality of teaching, the level of parental support, and many others. The indicators can only give the typical or most frequent outcome for a particular age 11 score, and there will be a range of pupil achievement around it. See Strand (2003, pp. 94–116) for further details on how the CAT indicators are presented and should be interpreted. Using reasoning tests and KS2 tests together in target setting It follows from the above that the KS3/GCSE indicators provided by reasoning tests such as CAT and by KS2 tests will not be a perfect match in every case. Some pupils may be expected to gain a certain outcome by one measure but not by another. However, this should not be seen as problematic, or as indicating a flaw in one or other of the tests, since different domains are being assessed. There are precedents for KS3/GCSE indicators from different ‘baselines’ being available for schools. For example, the Autumn Package currently gives secondary schools a choice of benchmarks for GCSE performance, based on either the percentage of pupils entitled to Free School Meals or the prior KS3 average points score of the cohort. In the same way, schools can have a choice of KS3/GCSE indicators based on reasoning test scores or on KS2 test levels. Schools can make an informed choice between these, depending on their local circumstances. There may be real variability in the reliability of KS2 test results for some schools with a diverse intake from many different primary schools, or where KS2 results are not available due to administrative problems, pupil absence etc. The availability of such an option offers schools the greatest flexibility in the analysis and use of their data. However, the strongest use would come from considering and integrating the indicators from both sets of tests alongside each other. The most comprehensive

222 S. Strand approach to end of KS3/GCSE target setting would be based on individual pupil data and would include reasoning scores, KS2 test results and teacher assessment at the time that targets were being set. For example, consider the situation where targets are being set at the start of Year 8 for end of KS3 science attainment. One way of making sense of the three sources of data (reasoning tests, KS2 tests and teacher assessment) is to plot the various predictions as a Venn diagram (see Figure 1). If the name of each pupil predicted to achieve Level 6 or above in the KS3 science test were plotted in each hoop of the Venn diagram, then pupils whose name appears in all three hoops are highly likely to achieve the outcome. However, pupils predicted to attain Level 6 or above by only one or two measures are less secure. The profile of predictions across the three measures can prompt particular diagnostic questions about the pupil’s performance (see Strand, 2003, pp. 155–157 for a detailed discussion). The results can also be used to estimate a range of KS3 targets for the cohort as a whole. The least challenging target for the school to attain is that which includes only those pupils encompassed by all three hoops; the most challenging target would include any pupil encompassed by at least one hoop. Schools can explore the range across these possible targets, and consider how they relate to current performance, in determining appropriately challenging targets for future attainment. Secondary school value-added calculations Early in their introduction there was considerable debate about the viability of using KS2 test levels for value-added calculations, or whether more finely differentiated measures of the intake, such as standardised tests, were needed (SCAA, 1994; DfE, 1995). Later work suggested that by taking the average of the test levels (KS2 average points score) a sufficiently refined measure could be obtained (SCAA, 1997a). The strong correlations reported here between KS2 average points score and both KS3 and GCSE outcomes bear this out. However, this does not mean that

Figure 1. Combining predictions from reasoning tests, Key Stage 2 tests and current teacher assessment when setting targets for the end of Key Stage 3

Predictive validity of tests 223 ‘other standardised tests are no longer required to predict outcomes at KS3’ (SCAA, 1997b). This article shows that reasoning scores actually have higher correlations with both KS3 and GCSE outcomes than KS2 test points scores. More generally, reasoning tests improve prediction based on KS2 test results alone, and this should allow better measures of value added to be developed. However, such refinement of the input measures is a small concern relative to the deficiencies of the particular methodology currently employed (the median line analysis) and the absence of reported confidence bands around the value-added scores (Goldstein, 2003; Tymms & Dean, 2004). Conclusion Contrary to some initial predictions that other tests might no longer be necessary following the introduction of national KS2 tests, this article demonstrates a clear role for reasoning tests alongside national end of KS2 tests. Crucially, it demonstrates that, in relation to predictive validity, a combination of reasoning and KS2 tests together provides the most reliable basis for predicting future performance. However, predictive validity is only one of the criteria involved when selecting a test or assessment. Many secondary schools use reasoning tests in addition to the data from the KS2 tests because they offer a baseline administered in a consistent fashion for pupils from all feeder primary schools; a method to identify pupils who may be ‘underachieving’ against their abilities; diagnostic information on pupils’ cognitive strengths and weaknesses and learning styles/preferences; and the opportunity to triangulate and assess the viability of school or pupil KS3/GCSE targets based on KS2 results alone. The challenge facing schools is how they integrate and use effectively the wide range of assessment information available to them, rather than arguments about which (if any) are the ‘best’ tests. Notes 1. The ‘Autumn Package’ is a set of analyses of national test and assessment results at ages 7, 11 and 14, and public examination results at age 16, produced each autumn by government departments and agencies and published on a national website (see references). The analyses include national averages and trends for results at each key stage, national value-added information based on typical (median) progress made between key stages, and school-level benchmarks of performance against socio-economic circumstances (the percentage of pupils entitled to free school meals) or the average prior attainment of the cohort.

References Autumn Package (2002) The Autumn Package. Available online at: www.standards.dfes.gov.uk/ performance (accessed 17 May 2004). Critchlow, J. & Coe, R. (2003) Serious flaws arising from the use of the median in calculating value-added measures for UK school performance tables paper presented to the International Association for Educational Assessment (IAEA) Annual Conference, Manchester, October 2003. Department for Education (1995) Value added in education: a briefing paper from the Department for Education (London, Department for Education).

224 S. Strand Fernandes, C. & Strand, S. (1998) CAT and KS3/GCSE indicators: technical report (Windsor, NFER-Nelson). Latest version available online at: http://www.nfer-nelson.co.uk/cat/gtbfc/ GCSE/CAT_GCSE_2003_TechRep.pdf (accessed 24 June 2004). Goldstein, H. (2003) A commentary on the secondary school value added performance tables for 2002 (released January 22 2003). Updated December 2003. Available online at: http://www.mlwin.com/hgpersonal/value-added-commentary-jan03.htm (accessed 2 June 2004). Hayes, S. (2001) Pupils who achieved Level 4 at Key Stage 2 and also at Key Stage 3 paper presented to the Annual Conference of the British Educational Research Association, Leeds University, 13–15 September 2001. Hopkins, D. R. & Davis, R. (2003) An investigation into KS2 national test data and consideration of issues affecting primary/secondary phase transition within one Local Education Authority., Unpublished research supported by Teacher Research Scholarships granted by the General Teaching Council of Wales (Professional Development References 151 & 3022). Massey, A., Elliott, G. & Ross, E. (1996) Season of birth, sex and success in GCSE English, mathematics and science: some long lasting effects from the early years? Research Papers in Education, 11(2), 129–150. Massey, A., Green, S., Dexter, T. & Hamnett, L. (2003) Comparability of national tests over time: key stage test standards between 1996 and 2001. Final report to the QCA (London, QCA Publications). Moody, I. (2001) A case-study of the predictive validity and reliability of Key Stage 2 tests results, and teacher assessments, as baseline data for target-setting and value-added at KS3, The Curriculum Journal, 12(1), 81–101. Pegg, D. (1998) National Curriculum Key Stage 3 tests and NFER tests—are comparisons of use? British Journal of Curriculum and Assessment, 8(2), 37–43. Qualifications and Assessment Authority (2002a) Key Stage 2 assessment and reporting arrangements 2002 (London, QCA Publications). Qualifications and Assessment Authority (2002b) Key Stage 3 assessment and reporting arrangements 2002 (London, QCA Publications). Rayment, T. (1996) Key Stage 3 National Curriculum tests 1996: some observations from a study of one school, British Journal of Curriculum and Assessment, 7(1), 19–22. Sax, G. (1984) The Lorge–Thorndike Intelligence Tests/Cognitive Abilities Test, in: D. J. Keysner & R. C. Sweetland (Eds) Test critiques. (vol. 1) (Kansas, Test Corporation of America). School Curriculum and Assessment Authority (1994) Value added performance indicators for schools (London, School Curriculum and Assessment Authority). School Curriculum and Assessment Authority (1997a) The Value Added National Project: final report (ref COM/97/844) (London, SCAA Publications). School Curriculum and Assessment Authority (1997b) Making effective use of Key Stage 3 assessments (London, SCAA Publications). Smith, P., Fernandes, C. & Strand, S. (2001) Cognitive Abilities Test. (3rd edn). Technical Manual (Windsor, NFER-Nelson). Strand, S. (2001) Exploring the use of reasoning tests in secondary schools: letting the CAT out of the bag paper presented to the Annual Conference of the British Educational Research Association, Leeds University, 13–15 September 2001. Strand, S. (2003) Getting the best from CAT: a practical guide for secondary schools (London, NFERNelson). Strand, S. (2004a) Consistency in reasoning scores over time, British Journal of Educational Psychology, 74(4), 617–631. Strand, S. (2004b) The use of effect sizes: two examples from recent educational research, in: I. Schagen & K. Elliott (Eds) But what does it mean? The use of effect sizes in educational research (Slough, National Foundation for Educational Research). Times Educational Supplement (2002) Great results fudge true picture, 10 May, pp. 28–29.

Predictive validity of tests 225 Thomas, S. & Mortimore, P. (1996) Comparison of value-added models for secondary school effectiveness, Research Papers in Education: Policy and Practice, 11(1), 5–33. Thorndike, R. L., Hagen, E. & France, N. (1986) Cognitive Abilities Test. (2nd edn). Administration manual (Windsor, NFER-Nelson). Tymms, P. & Dean, C. (2004) Value-added in the primary school league tables: a report for the National Association of Head Teachers (London, National Association of Head Teachers). Wiliam, D. (2001) Level best? Levels of attainment in national curriculum assessment (London, Association of Teachers and Lecturers).

Suggest Documents