ISBN 978-3-8309-3091-4
www.waxmann.com
Educational Policy Evaluation Strietholt | Bos | Gustafsson | Rosén (Eds.)
O
ne of the most salient findings from the field of education is that there are huge national differences in student achievement as shown in international comparative studies like PISA and TIMSS. The shockingly large gap between the highest performing countries (mostly in East Asia) and many European countries corresponds to a difference in attainment of two years of schooling. Although this finding has been replicated in several studies, the reasons for and consequences of such differences are currently not well understood. This book is a collection of essays and studies by leading experts in international comparative education who demonstrate how international comparative assessments can be used to evaluate educational policies. The volume is organized into two parts that address, first, theoretical foundations and methodological developments in the field of international assessments, and, second, innovative substantive studies that utilize international data for policy evaluation studies. The intention of this book is to revisit the idea of ‘using the world as an educational laboratory’, both to inform policy and to facilitate theory development.
Rolf Strietholt, Wilfried Bos, Jan-Eric Gustafsson, Monica Rosén (Eds.)
Educational Policy Evaluation Through International Comparative Assessments
Educational Policy Evaluation through International Comparative Assessments
Rolf Strietholt Wilfried Bos Jan-Eric Gustafsson Monica Rosén (Eds.)
Educational Policy Evaluation through International Comparative Assessments
Waxmann 2014
Münster New York
Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available in the Internet at http://dnb.d-nb.de
Print-ISBN 978-3-8309-3091-4 E-Book-ISBN 978-3-8309-8091-9 © Waxmann Verlag GmbH, 2014 www.waxmann.com
[email protected] Cover design: Anne Breitenbach, Tübingen Typesetting: Stoddart Satz- und Layoutservice, Münster Print: Hubert & Co., Göttingen Printed on age-resistant paper, acid-free as per ISO 9706
All rights reserved. Printed in Germany No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, electrostatic, magnetic tape, mechanical, photocopying, recording or otherwise without permission in writing from the copyright holder.
Contents
PART A: CONCEPTUAL AND METHODOLOGICAL FOUNDATIONS Rolf Strietholt, Jan-Eric Gustafsson, Monica Rosén and Wilfried Bos Outcomes and Causal Inference in International Comparative Assessments....................... 9 Jan-Eric Gustafsson and Monica Rosén Quality and Credibility of International Studies ..................................................................... 19 Leonidas Kyriakides and Charalambos Y. Charalambous Educational Effectiveness Research and International Comparative Studies: Looking Back and Looking Forward ........................................................................................ 33 Rolf Strietholt Studying Educational Inequality: Reintroducing Normative Notions ................................. 51 Eugenio J. Gonzalez Calculating Standard Errors of Sample Statistics when Using International Large-Scale Assessment Data............................................................................. 59 Agnes Stancel-Piątak and Deana Desa Methodological Implementation of Multi Group Multilevel SEM with PIRLS 2011: Improving Reading Achievement .............................................................. 75 Martin Schlotter, Guido Schwerdt and Ludger Woessmann Econometric Methods for Causal Evaluation of Education Policies and Practices: A Non-Technical Guide ................................................................................... 95
PART B: EMPIRICAL STUDIES Hongqiang Liu, Kim Bellens, Wim Van Den Noortgate, Sarah Gielen and Jan Van Damme A Cross-country Comparison of the Effect of Family Social Capital on Reading Literacy, Based on PISA 2009 ............................................................................. 129 Eric A. Hanushek and Ludger Woessmann Institutional Structures of the Education System and Student Achievement: A Review of Cross-country Economic Research .................................................................. 145 Anne-Catherine Lehre, Petter Laake and Joseph Andrew Sexton Using Quantile Distance Functions to Assess Inter- and Intrasex Variability in PISA Achievement Scores ................................................................................ 177
6
Contents
Leonidas Kyriakides, Charalambos Y. Charalambous, Demetris Demetriou and Anastasia Panayiotou Using PISA Studies to Establish Generic Models of Educational Effectiveness................ 191 Monica Rosén and Jan-Eric Gustafsson Has the Increased Access to Computers at Home Caused Reading Achievement to Decrease in Sweden? ..................................................................... 207 Hongqiang Liu, Kim Bellens, Sarah Gielen, Jan Van Damme, and Patrick Onghena A Country Level Longitudinal Study on the Effect of Student Age, Class Size and Socio-Economic Status – Based on PIRLS 2001, 2006 & 2011 .................. 223
Authors ....................................................................................................................................... 243
PART A CONCEPTUAL AND METHODOLOGICAL FOUNDATIONS
Rolf Strietholt, Jan-Eric Gustafsson, Monica Rosén and Wilfried Bos
Outcomes and Causal Inference in International Comparative Assessments Abstract
The main aim of this essay is to discuss how international large-scale assessments can be utilized for policy evaluation studies. We overview key findings from previous studies and propose the curriculum as an organizing concept in considering firstly, how educational opportunities are provided to students around the world, and secondly, the factors that influence how students use these opportunities. Thereafter, we discuss recent developments in the design of the international studies and methodological advances that allow for robust inferences about the causal mechanisms that cause the observed differences in student outcomes. Finally, we identify major challenges for future research including the demand for studies that utilize the trend design of modern studies, more focus on educational equity, and strengthening interdisciplinary and intersectoral collaborations.
Introduction One of the most salient findings from the field of education is that there are huge national differences in student achievement observed in international comparative studies (Gustafsson & Rosén, this volume). The shockingly large gap between the highest performing countries (mostly in East Asia) and many European countries corresponds to a difference in attainment of two years of schooling. Although this finding has been replicated in several studies (Mullis, Martin, Foy, & Arora, 2012; Mullis, Martin, Foy, & Drucker, 2012; Mullis, Martin, Foy, & Stanco, 2012; OECD, 2014), reasons for and consequences of such differences are currently not well understood. To understand the great need for research in this area it is worth recapitulating some major empirical results stemming from the international comparisons. As has already been noted, one of the most striking results is the very large difference in mean levels of educational achievement between countries. In the area of mathematics, for example, the TIMS studies show enormous differences in the mean levels of performance between the highest performing countries (most are East Asian) and the lowest performing ones. In the most extreme cases these differences reach more than three standard deviations (SD); one SD corresponding to the effect of approximately two years of schooling. Even within the group of developed countries, the mean differences between the Asian countries and European countries like Sweden is more than one SD. The PISA studies present a similar pattern of differences in educational achievement, with East Asian countries displaying a large advantage in mathematics and science. These studies also show that the school systems of some Western countries, such as Finland, yield a high level of achievement.
10
Rolf Strietholt, Jan-Eric Gustafsson, Monica Rosén and Wilfried Bos
Another important finding is that even though there is a general pattern of stability over time, there are also considerable changes in levels of achievement, which are sometimes dramatic. One example is the sharp decline in levels of achievement in mathematics and science in Norway and Sweden after 1995, which amounts to the effect of one year of schooling. A further example is the rapid increase in the level of achievement within the Finnish system from the 1980s, when achievement was at about the same level as the other Nordic countries, to the extremely high level that the country boasts today. A third cluster of results is connected to the large differences in educational inequality across countries. These differences can be observed in terms of the dispersion of student test scores and inequality of opportunity by gender, social background and ethnicity. Interestingly, the measures of inequality also differ between domains and they change between primary and secondary schools in the respective countries.
The Curriculum Model in the Multilevel Educational System Many features of the educational systems affect how students learn. The curriculum, broadly defined, is an organizing concept in considering firstly, how educational opportunities are provided to students around the world, and secondly, the factors that influence how students use these opportunities (Robitaille et al., 1993). The curriculum model has three aspects: (1) the intended curriculum is what national educational policies intend students to learn and how the education system should be organized to facilitate this learning; (2) the implemented curriculum includes how the respective educational organization (e.g. schools) implement such goals, what is actually taught in classrooms and who teaches it, and how it is taught; (3) lastly, the attained curriculum describes what students have actually learned, and what they think about it, as well as the emergence of educational inequality (see figure 1).
National Educational and Social Context
Educational Settings and Home Learning Environment
Intended Curriculum
Implemented Curriculum
Educational Outcomes and Inequality
Figure 1:
The curriculum model
Attained Curriculum
Outcomes and Causal Inference in International Comparative Assessments
International large-scale assessments provide a suitable data basis to test the curriculum model because it is possible to uses the achievement tests from international large-scale assessments on student achievement to describe student learning in the participating countries. To form a more complete picture of these learners, information from the international studies’ questionnaires (students, parents, teachers, principals) and other relevant sources (e.g. UNESCO Institute for Statistics, OECD statistics) providing a wealth of information. It is worth recapitulating that educational systems have a multilevel structure where students are nested within classes, classes nested within schools, and schools being nested within regions, societies and nations. Although educational policies are typically located at higher levels they also manifest on lower levels. For this reason it is important to study direct, mediating and moderating effects at the various levels to understand the complex mechanisms within the educational system.
Educational Policy Evaluation through International Comparative Assessments? Educational effectiveness research aims to understand how and under what circumstances students learn (Creemers & Kyriakides, 2008; Kyriakides & Charalambous, this volume). International comparisons are particularly useful to evaluate the impact of educational reforms and measures. As variation in many system-level features can only be observed across countries (e.g. the existence of central exams), international comparative studies provide a unique approach to study how educational policies and societal issues affect learning and the emergence of educational inequality (Hanushek & Wößmann, 2011). International assessments facilitate comparable measurements of central outcomes of educational systems not only within but also between countries. Since the start of the new millennium, Gustafsson (2008) observes the implementation of a new generation of international comparative studies with a trend design. Recent assessments such as PISA and TIMSS are repeated every few years and thus have a longitudinal component at system level. Unlike previous cross-sectional comparisons, such longitudinal designs allow researchers to estimate causal effects of changes in educational policies and other factors at the system level. Overview of Development of International Studies. The International Association for the Evaluation of Educational Achievement (IEA) was founded in 1958, with the aim of understanding the factors influencing student achievement. The researchers used the metaphor of using the world as an ‘educational laboratory’ to investigate the effects of school, home, student and societal factors on educational outcomes arguing that an international comparative approach was necessary to investigate the effects of many of these factors. During the 1960s and 1970s two main studies were conducted, one on mathematics (First International Mathematics Study (Husén, 1967, Postlethwaite, 1967)), and
11
12
Rolf Strietholt, Jan-Eric Gustafsson, Monica Rosén and Wilfried Bos
one on six different subjects (Six Subject Study (Walker, 1976)). During the 1980s the studies in mathematics, science and reading literacy were repeated (Second International Mathematics Study (Pelgrum, Eggen, & Plomp, 1986), Second International Science Study (Postlethwaite & Wiley, 1992), Reading Literacy Study (Elley, 1992)). While many interesting results were obtained, it was obvious that the studies were not particularly successful at answering questions regarding the determinants of educational achievement, and the causal mechanisms involved. The primary reason for this was that the studies conducted were cross-sectional surveys, and such designs do not easily support causal inference. In 1995 the TIMS study (Third Mathematical and Science Study), which was a study of enormous scope and complexity, was launched (Martin et al., 1997; Mullis et al., 1997). This study was heralded a major success, and it marked the beginning of a new phase in the development of international studies. In this phase, the presence of educational researchers is less marked and the involvement of national administrative and policy institutions is stronger. Even though researchers are still involved in the design, analysis and reporting of the international studies, the level of ambition in the reporting of important international findings is rather limited. The task of analyzing the factors behind the outcomes for the different countries is left to each participating country, and the databases are made available to the research community for secondary analysis. There has thus been an unfortunate drift away from explanations of causality to the more descriptive aims, mainly serving the purpose of evaluation of educational quality. Since 1995, the TIMS study has been repeated on a four-yearly cycle, the acronym TIMSS now standing for Trends in International Mathematics and Science Study, and the number of participating countries has increased successively. In 2001, a study on a five-year cycle assessing reading literacy in Grade 4 (PIRLS, Progress in International Reading Literacy Study; Mullis, Martin, Gonzalez, & Foy, 2003) was also established, based upon the same solid design principles as TIMSS. In 2000, the OECD launched its popular Programme for International Student Assessment (PISA), which covers mathematics, science and reading attainment in 15-year olds (OECD, 2001). PISA includes all the OECD countries, along with a large number of associate countries, and it is repeated every third year. This study uses methods and techniques that are similar to those used in the IEA studies. However, while the IEA studies focus on curriculum defined knowledge and skills, the OECD studies also try to capture competencies expected to be important in adult life. Furthermore, while the IEA studies have a base in communities of researchers, the OECD studies have a more explicit policy-orientation, aiming to influence the educational systems of the member states. One area that is not well represented in the studies conducted by the IEA and OECD during the last few decades is that of foreign languages. In 2011, however, the European Survey on Language Competences (ESLC) was conducted in 16 European countries and educational entities, and the study, which investigates reading, listening and writing in several languages, has recently been completed (European Commission, 2012).
Outcomes and Causal Inference in International Comparative Assessments
Research Methodology. During the last two decades, there have been important methodological developments which have made it possible to address issues which were previously impossible to approach. A brief overview of important developments in the fields of measurement and causal inference from observational data is given below. The fields of educational and psychological measurement have seen remarkable developments in powerful statistical methods through the evolvement of modern test theory or item response theory (Boeck & Wilson, 2004). The power of IRT comes from the fact that parameters of probabilistic models of performance on test items are invariant over samples of persons and items, while the statistics computed within the framework of classical test theory are dependent upon the sample of persons and on which particular combinations of items are used. Since the early 1990s, IRT has been used regularly in the international studies, and through employment of these techniques, the quality of the studies has improved immensely. With the IRT methodology, matrix-sampling models in which different persons take different subsets of items have been implemented, as have methods for equating the scales of different studies. Another significant contribution to the field of measurement is the development of structural equation and latent variable models (SEM) (Muthén, 2002). Through formulating models in terms of both latent and manifest variables, SEM can deal with errors of measurement in observed variables. Such models can also estimate both direct and indirect effects of chains of variables. Over the last twenty years, SEM has been developed in several different ways, such as for analyzing categorical data, and for addressing the nested structure of units of the educational systems, students being clustered in classrooms, classrooms being clustered in schools and schools being clustered in municipalities, and so on (see Stancel-Piątak & Deana Desa, this volume, for an application). Currently SEM can be employed to model up to three levels of latent variables. Another important strand of development concerns analytical approaches that allow valid causal inferences based on observational data (Morgan & Winship, 2007, Schlotter, Schwerdt, & Woessmann, this volume). The randomized experiment is a prototypical way to achieve valid conclusions about causal effects, but the challenge is greater when the researcher cannot manipulate conditions in experimental designs. Indeed, many interesting research issues within the field of education are not suitable for experimentation, for ethical, practical and economic reasons. This forces the researchers to rely on different types of observational data. However, a problem with using such data is that associations are not easily interpretable in causal terms, which is to say that with such data it often not possible to say that one factor actually causes a particular outcome. One reason for this is that a variable that is assumed to be dependent may partially cause an effect in a variable that is assumed to be an independent variable. This is what is known as the problem of reverse causality or endogeneity. Another reason why an observed association between two variables need not express a causal relation is that there may be one or more variables that have been omitted from the study, which affect both variables. A further threat that
13
14
Rolf Strietholt, Jan-Eric Gustafsson, Monica Rosén and Wilfried Bos
can be problematic when interpreting results in terms of causality is the threat of errors of measurement in observed variables. Such errors tend to cause systematic underestimation of relations between variables. Several approaches have been developed to guard against the different threats to valid causal inference in analyses of observational data: • One class of approaches relies on conditioning techniques. The basic strategy is to find a set of control variables that can be included in regression equations in order to remove the effects of omitted variables. The multilevel and SEM approaches allow more efficient and correct analysis of multilevel and error-laden data, and propensity score matching techniques add additional power. However, even though conditioning works well when we have a valid and reliable measure of the control variables, many omitted variables can only be partially observed, and there may be unobserved omitted variables. As such, conditioning is not an infallible approach to arrive at valid causal inference. • Another approach is instrumental variables (IV) regression. The idea is to find a variable (an ’instrument’) that is related to an independent, endogenous, variable X, but not to the dependent variable Y, except indirectly via X (Angrist & Krueger, 2001). The treatment effect is identified through the part of the variation in X that is triggered by the instrument. This approach is often used to deal with problems of reverse causality and errors of measurement and there are many examples of successful applications, particularly within the field of economics. However, IV regression suffers from limitations as well. For example, the standard errors of IV estimates tend to be large, and it is based on quite strong and generally untestable assumptions. • Within social sciences, longitudinal designs are frequently used (Gustafsson, 2010). When the units under study have characteristics that remain constant over time, the units can be used as their own controls, which brings the advantage that fixed characteristics can be omitted without causing any bias. Regression analysis with change scores for independent and dependent variables, or regression with ‘fixed effects’ in which each observed unit is identified with a dummy variable can be used to conduct such analyses. This approach does not require longitudinal observations at the individual level, but can be applied at other levels of observation. • Repeated cross-sectional designs that are used in the international studies of educational achievement to measure achievement trends have a longitudinal design at the country level (Liu, Bellens, Van Den Noordgate, Gielen, & Van Damme, and Rosén & Gustafsson, both in this volume, provide examples). Therefore, with data aggregated to the country level it is possible to take advantage of the strength of longitudinal designs. Analysis of longitudinal data at aggregated levels is often referred to as differences-in-differences analysis. Aggregated data also has the advantage that mechanisms, which at the individual level cause reverse causality, need not be present at higher levels of observation. Furthermore, such data is not influenced by errors of measurement to the same extent as individual data. The
Outcomes and Causal Inference in International Comparative Assessments
downward biasing effect of errors of measurement is much less of a problem using this approach than with individual data. One of the main criticisms of the international studies is that the varying characteristics of nations in terms of culture, history and populations make it impossible to draw any inferences concerning the causal effects of different aspects of the educational system (Wiseman & Baker, 2005). This criticism basically expresses the problems caused by omitted variables in between-country comparisons, which is correct. Most of these problems are avoided, however, with a country-level longitudinal approach as it is possible to investigate change and development using such a technique. This description of advances in methodology for making causal inferences from observational data suggests that there are indeed tools available that can be fruitfully applied to investigate substantive research problems within the field of education. It also is clear, however, that used alone each of the different methods have their limitations, which makes it necessary to use multiple approaches, to attend to possible sources of bias, and to find innovative ways to analyze the complex data from international comparative studies.
Conclusions and Challenges For decades international comparative studies had cross-sectional designs and the possibilities to use such data for studies that aim at identifying the causal effects of educational policies on educational outcomes were limited. It is only within the last 10–15 years that studies with a longitudinal trend component have been implemented. New data from multiple cycles of such studies are now available. It seems to be a promising approach that future research makes use of the fact that such newly available data from trend studies are much more appropriate to use when testing hypotheses about the causal effects of certain educational policies and reforms on student learning than cross-sectional data. Most previous educational effectiveness research focused on average levels of achievement. A challenge for future research is to go beyond the currently dominant focus on averages in educational outcomes by emphasizing the idea that educational equality is an equally important outcome of educational systems (Strietholt, this volume). Different ways to operationalize equality and inequality have to be considered and discussed in terms of underlying theories of justice. Furthermore, the consolidation of various disciplines promises to generate new multidisciplinary approaches to educational effectiveness. Traditionally, economists, sociologists and political scientists typically investigate social structures, institutions and other phenomena that are located on higher levels of the educational system. Conversely, educational scientists and psychologists typically are concerned with individual differences, and therefore focus their attentions on the lower levels of the system, namely the individual students, educators or principles. The dif-
15
16
Rolf Strietholt, Jan-Eric Gustafsson, Monica Rosén and Wilfried Bos
ferent research traditions are also visible in the various methodological approaches that have traditionally been used. On the one hand, econometrics has a particular strength in estimating causal effects from observational data. On the other hand, psychometricians and educational measurement experts have developed elaborated models to test competences and attitudes. From the point of view of research on educational policies and their effects on student learning, it is, however, necessary to attend to both individuals and institutions, and to take account of the multi-level nature of educational phenomena. Finally, we feel that it is worth to strengthen collaborations between public and private organizations. It is quite obvious that the integration of different sectors is less developed in education in comparisons to other scientific field like engineering or pharmacies. In this context, it is important to mention that it is not universities but organizations like the ACER (Australian Council for Educational Research), ETS (Educational Testing Service), IEA (International Association for the Evaluation of Educational Achievement), and the OECD (Organisation for Economic Co-operation and Development) that are internationally responsible for almost all large-scale studies on student achievement that have been carried out to date. They are the driving forces behind the development of new survey and testing methodologies and for the implementation of new studies. However, these organizations tend to produce reports that merely describe international differences in educational achievement without explaining what the root causes of such differences are. The collaboration of the private sector with leading university researchers might strengthen future international studies as research institutes can engage with private sector partners not only in describing international differences but also in explaining the causes of them. At the same time the collaboration promised to enhance the capacities of universities to conduct international comparative studies. University researchers or groups of researchers from different universities may, for instance, make use of the existing infrastructure of studies like PISA and TIMSS by adding national extensions (e.g. adding an individual panel). This would be a valuable resource for those researchers wishing to answer specific questions particular to their nation’s educational systems.
References Angrist, J. D. & Krueger, A. B. (2001). Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments. Journal of Economic Perspectives, 15(4), 69–85. Boeck, D. & Wilson, M. (2004). Explanatory item response models: a generalized linear and nonlinar approach. New York: Springer. Creemers, B. P. M. & Kyriakides, L. (2008). The dynamics of educational effeciveness. London: Rouledge. Elley, W. B. (1992). How in the world do students read? IEA Study of Reading Literacy. The Hague: IEA. European Commission. (2012). First european survey on language competences: Final report. Luxembourg: Publications Office of the European Union.
Outcomes and Causal Inference in International Comparative Assessments
Gustafsson, J.-E. (2008). Effects of international comparative studies on educational quality on the quality of educational research. European Educational Research Journal, 7(1), 1–17. Gustafsson, J.-E. (2010). Longitudinal designs. In B. P. M. Creemers, L. Kyriakides, & P. Sammons (Eds.), Methodological Advances in Educational Effectiveness Research (pp. 77–101). London and New York: Routledge. Hanushek, E. A. & Wößmann, L. (2011). The economics of international differences in educational achievement. In E. A. Hanushek, S. Machin & L. Wößmann (Eds.), Handbook of the economics of education. (Vol. 3). Amsterdam: Elsevier. Husén, T. (Ed.). (1967). International study of achievement in mathematics: A comparison of twelve countries (Vols. 1–2). Stockholm: Almqvist & Wiksell. Martin, M. O., Mullis, I. V. S., Beaton, A. E., Gonzalez, E. J., Smith, T. A., & Kelly, D. L. (1997). Science Achievement in the Primary School Years: IEA’s Third International Mathematics and Science Study (TIMSS) (Vol. Chestnut Hill, MA): Boston College. Morgan, S. L. & Winship, C. (2007). Counterfactuals and causal inference. Methods and principles for social research. Cambridge: University Press. Mullis, I. V. S., Martin, M. O., Beaton, A. E., Gonzalez, E. J., Kelly, D. L., & Smith, T. A. (1997). Mathematics Achievement in the Primary School Years: IEA’s Third International Mathematics and Science Study (TIMSS) (Vol. Chestnut Hill, MA): Boston College. Mullis, I. V. S., Martin, M. O., Foy, P., & Arora, A. (2012). TIMSS 2011 International Results in Mathematics. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College. Mullis, I. V. S., Martin, M. O., Foy, P., & Drucker, K. T. (2012). PIRLS 2011 international results in reading. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College. Mullis, I. V. S., Martin, M. O., Foy, P., & Stanco, G. M. (2012). TIMSS 2011 International Results in Science. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College. Mullis, I. V. S., Martin, M. O., Gonzalez, E. J., & Foy, P. (2003). PIRLS 2001 international report: IEA’s study of reading literacy achievement in primary schools in 35 countries. Chestnut Hill, MA: Boston College. Muthen, B. O. (2002). Beyond SEM: General latent variable modeling. Behaviormetrika, 29(1), 81–117. OECD. (2001). Knowledge and skills for life: First results from PISA 2000. Paris: OECD. OECD. (2014). PISA 2012 Results: What Students Know and can do Student Performance in mathematics, reading, and science (volume 1, revised edition, February 2014): OECD Publishing. Pelgrum, W. J., Eggen, T., & Plomp, T. (1986). Second International Mathematics Study: The implemented and attained mathematics curriculum – a comparison of eighteen countries. Washington, DC: Center for Education Statistics. Postlethwaite, N. (1967). School organization and student achievement: A study based on achievement in mathematics in twelve countries. Stockholm: Almqvist & Wiksell. Postlethwaite, T. N. & Wiley, D. E. (Eds.). (1992). The IEA Study of Science II: Science achievement in twenty-three countries. Oxford: Pergamon Press. Robitaille, D. F., Schmidt, W. H., Raizen, S., McKnight, C., Britton, E., & Nicol, C. (1993). Curriculum frameworks for mathematics and science: TIMSS monograph no. 1. Vancouver, Canada: Pacific Educational Press.
17
18
Rolf Strietholt, Jan-Eric Gustafsson, Monica Rosén and Wilfried Bos
Walker, D. A. (1976). The IEA Six Subject Survey: An empirical study of education in twenty-one countries. Stockholm: Almqvist & Wiksell. Wiseman, A. W. & Baker, D. P. (2005). The worldwide explosion of internationalized educational policy. In D. P. Baker & A. W. Wiseman (Eds.), Global trends in educational policy (pp. 1–26). Oxford: Elsevier.
Jan-Eric Gustafsson and Monica Rosén
Quality and Credibility of International Studies
Abstract
Large-scale survey studies of educational achievement are becoming increasingly frequent, and they are visibly present in both educational policy debates and within the educational research community. These studies face a large number of methodological challenges which, in combination with the fact they often yield unpopular results, are reasons why these studies are frequently contested on quality grounds. Taking starting points in two published papers criticizing international studies, methodological challenges related to the validity of the paper and pencil based measurement instruments and to the applicability of the scaling models based on item response theory, are discussed. It is concluded that, while international studies do indeed face methodological challenges that need further work, there is little reason to reject the studies as yielding invalid results on the basis of the expressed criticism.
International comparative studies of educational achievement currently form one of the most conspicuous phenomena in the field of education. At an increasing rate, these studies produce data that policy-makers can worry about and take advantage of, and that researchers can use in analyses of achievement differences between and within countries, and as a basis for investigating effects of different educational and societal factors on educational achievement. Such international studies are, furthermore, a hotly debated phenomenon (e.g., Hopmann et al., 2007; Novóa & YarivMashal, 2003; Simola, 2005), which attracts considerable media attention and which has profound influence on educational policies. One fundamental question that is often raised in relation to international studies is whether the results that they present can be trusted. Some researchers are likely to respond to this question with a very definite ‘no’ while other researchers are likely to respond to this question with an equally definite ‘yes’. However, international comparative studies are extremely complex endeavors, so it would seem unlikely that their results can be blindly trusted and, given the large amount of resources spent on them, it would also seem unlikely that they are completely untrustworthy. The purpose of the present chapter is, therefore, to discuss some recent methodologicallyoriented challenges to the quality and credibility of international studies. One line of criticism concerns the measurement design of international studies and argues, basically, that with paper and pencil tasks it is not possible to obtain valid results concerning students’ knowledge (e.g., Schoultz, Säljö and Wyndhamn 2001). The other line of criticism is raised from a quantitative methodological point of view, questioning the ways in which measurement and scaling models based on item response theory are being applied (e.g., Kreiner & Christensen, in press). Both
20
Jan-Eric Gustafsson and Monica Rosén
lines of criticism are in a sense devastating because, if the critiques are correct, international studies are fraught with fundamental problems which invalidate the entire approach. These two lines of criticism will be focused upon below. First, however, there is reason to provide some background on the development of international studies.
Background and Development of International Studies The International Association for the Evaluation of Educational Achievement (IEA) was founded in 1958 by a small group of educational and social science researchers, with the purpose of conducting international comparative research studies focused on educational achievement and its determinants. Their aim was to understand the great complexity of factors influencing student achievement in different subject matter domains (Husén & Postlethwaite, 1996; Papanastasiou, Plomp, & Papanastasiou, 2011). They used the metaphor that they wanted to use the world as an educational laboratory to investigate effects of school, home, student and societal factors, arguing that an international comparative approach was necessary to investigate effects of many of these factors. The researchers also had the responsibility of raising funding and conducting the entire research process, from theoretical conceptions and design, to analysis and reporting. The TIMSS (Third International Mathematics and Science Study) 1995 study (Beaton et al., 1996) marks the beginning of a second phase in the development of international studies (Gustafsson, 2008). Now, the researcher presence is less marked and there has been a shift away from explanatory towards descriptive purposes. The involvement of national administrative and policy institutions has become stronger and, even though researchers are still involved in the design, analysis and reporting of the studies, the level of ambition of the reporting typically is limited. International reports mainly describe outcomes, along with background and process factors, but there is no attempt to explain the variation in outcomes between school systems, or to make inferences about causes and effects. The task of analyzing the factors behind the outcome for different countries is left to each participating country, and the databases are made available to the research community for secondary analysis. Thus, there has been a drift from explanation to description, mainly serving the purpose of evaluation of educational quality as a basis for national discussions about educational policy. After 1995, there also has been a dramatic increase in the volume and frequency of studies. The number of countries participating in a particular study has increased dramatically and now often amounts to more than 60 countries or school systems. The frequency of repetition has also increased: the IEA studies of mathematics and science (i.e., TIMSS), and reading (i.e., PIRLS), are now designed to capture withincountry achievement trends and are therefore repeated every fourth or fifth year. The OECD PISA study (“Programme for International Student Assessment”, which cov-
Quality and Credibility of International Studies
ers mathematics, science and reading, includes all the OECD countries, along with a large number of associate countries, and is repeated every third year. There were several reasons for this upsurge of interest in international comparative studies in the 1990s. One was that, since the 1980s, there has been an increased focus on outcomes of education, partly as a consequence of the changes in educational governance through processes of decentralization and deregulation. Another reason was that great advances had been made in the methodology for large-scale assessment of knowledge and skills. International studies adopted the methodology developed in the National Assessment of Educational Progress (NAEP) in the United States in the 1980s, based on complex item-response theory, matrix-sampling designs and sophisticated stratified cluster sampling techniques (Jones & Olkin, 2004). This methodology was well suited for efficient and unbiased estimation of system-level performance, and it was skillfully implemented to support international studies. The TIMSS 1995 study was the first study to take full advantage of this technology and, when PISA started a few years later, similar techniques were adopted in that study.
Stability of Results in International Studies It does seem reasonable to assume that, unless the technology implemented in the TIMSS 1995 study and later studies had generated results that were perceived as being trustworthy, the great boom of international studies would not have taken place. By and large, it seems that country-level results keep quite stable over time. Even though this is of course not necessarily a demonstration of reliability, a pattern of random variation in the outcomes for different countries over time would cause stakeholders to lose faith in the studies. There also are several examples of countries that repeatedly perform unexpectedly poorly or unexpectedly well, and where it rather seems that expectations were incorrect, compared to the measured outcomes. One example of unexpectedly high levels of achievement is provided by the excellent results of East Asian countries, surprising given that the Western literature had indicated that instructional practices in East Asia were traditional and backward, failing to keep pace with the latest development in learning and instructional theories (Leung, 2008). Another, similar, example is Finland where the PISA results have been unexpectedly high. The stability of both expected and unexpected outcomes suggests that there must be at least a basic level of quality and credibility in the international studies, as does the expansion of the studies. However, this has been bought at the price of having adopted a very complex technology, inaccessible to educational researchers and policy-makers, and which even very few specialists master. Furthermore, the international comparative studies on student achievement have a somewhat deceptive appearance. They involve students who work on tasks that are similar to those used in classrooms in everyday schoolwork. Yet, the primary purpose is not to provide knowledge about everyday classroom activities but to make generalized descriptions of achievement outcomes at the school system level.
21
22
Jan-Eric Gustafsson and Monica Rosén
These studies also have the appearance of research studies, involving large and representative samples of students, teachers and schools, and a large number of instruments designed to capture not only student outcomes but also many categories of background and explanatory variables. Yet, they are not designed to test theories or provide explanations, but rather to provide an infrastructure for research through generating data that may be used to investigate a wide range of issues.
Limitations of Paper and Pencil Assessments Typically, items in international studies are presented in written form and require written responses. Such items are seen by many as artificial and restricted, and it has been argued that more authentic performance assessments should be preferred. The TIMSS 1995 study offered countries the opportunity to administer a set of performance assessment tasks in science and mathematics to additional samples of students not participating in the main study (Harmon, Smith, Martin, Kelly, Beaton, Mullis, Gonzalez, & Orpwood, 1997). In the study, about a dozen different tasks were administered to students in Grades 4 and 8, each student being given three or four tasks. Altogether, some 20 countries participated in the performance assessment study, even though participation rates were not acceptable in all countries. The overall level of achievement on the performance tasks agreed quite well with the results in the written assessments, though limitations in the data prohibited deeper analyses. Another finding was that countries that did well overall generally tended to do better than other countries on each of the tasks, even though there was also some variation in rank ordering across tasks. After this first study of performance assessments in international comparisons, no other TIMS study has included such tasks. The reason for this is that they are timeconsuming to administer and score whilst, at the same time, the increase in information yield is marginal compared to paper and pencil tasks.
Level of Performance in Paper and Pencil Tests vs. Interviews However, Schoultz, Säljö and Wyndhamn (2001) argued that paper and pencil tasks have severe limitations, influencing their reliability and validity. They took a starting-point in a socio-cultural perspective and argued that differences in performance should not be seen as a consequence of students’ abilities and knowledge; performance should rather be seen as produced through concrete communicative practice. In particular, they argued that there are difficulties associated with the particular communicative format of test items which are presented in written form and which require a written response. They thus claimed that reading and responding to test items in solitude cannot be taken as an unbiased indicator of what students know and understand.
Quality and Credibility of International Studies
Schoultz et al. (2001) selected two items from the TIMSS 1995 study for scrutiny in an interview study comprising 25 Swedish Grade 7 students. One was an optics item. It presented an illustration showing two flashlights, one with and one without a reflector, and the question was which of the two flashlights shines more light on a wall 5 meters away. An open response was requiredand, to be scored correct, the response had to include an explanation that argued that the reflector focused the light on the wall. According to the TIMSS results, this item was quite difficult. In the Swedish Grade 7 sample, only 39 % of the students answered the item correctly, which figure was somewhat below the international average. In the interview study, 66 % of the students gave correct answers. Even though this small and possibly unrepresentative sample makes it difficult to compare this result with that from the TIMS study, it nevertheless indicates that the interview situation makes the item easier. One reason for this was that the students did not have to write the answer in the interview situation. Furthermore, many students did not understand the word “reflector” and so had initial difficulties connecting what was written in the question with the illustration but, in the dialogue with the interviewer, these things were clarified. Thus, the higher performance in the interview study was to a large extent due to the scaffolding provided by the interviewer in a Socratic dialogue. The other item was a multiple-choice chemistry item where the results were even more dramatic. According to the TIMSS data, only 26 % of the Swedish Grade 7 students chose the correct response alternative butin the interview study, no less than 80 % of the students responded correctly. Like the previous case, this was due to the interaction between the interviewer and the interviewee thus helping the students to interpret the text and the meaning of the response alternatives. From this study, the authors concluded, among other things, that the low performance demonstrated in the TIMS study was due to the fact that the students were limited to operating on their own, and in a world of paper. They concluded that: “Knowing is in context and relative to circumstance. This would seem an important premise to keep in mind when discussing the outcomes of psychometric exercises.” (p. 234). This may seem to be a serious criticism, not only of the TIMS study, but also of results from paper and pencil tests generally. However, the results of this study have little to do with quality aspects of the TIMSS assessment, or of the validity of paper and pencil tests. The Schoultz et al. (2001) study appears at a surface level to deal with the validity of items in the TIMSS test, but this study has in fact different aims and it is based on different assumptions than those made in TIMSS. As will be shown, this makes it impossible to make any inference about the phenomena studied in TIMSS from the results obtained in the Schoultz et al. study, and vice versa. The most fundamental difference concerns the assumptions made about the nature of performance differences over different contexts. Schoultz et al. view the performance differences between the paper and pencil and interview situations as absolute while, in TIMSS, performance differences between two situations are seen as relative. They interpret the higher level of performance when the item is admin-
23
24
Jan-Eric Gustafsson and Monica Rosén
istered in an interview situation compared to a paper and pencil situation as evidence of a higher level of knowledge and conceptual insight, and therefore as better evidence of what students can actually accomplish. This interpretation also implies that, if TIMSS were to use interviews to a larger extent than is currently done, this would result in a more positive picture of student knowledge. However, this is not so because in TIMSS the observed performance level is seen as being determined not only by student ability but also by the difficulty of the item. Thus, a TIMSS researcher who is presented with the finding that the level of performance is higher when an item is presented in a highly supportive interview context than in a paper and pencil context, would not necessarily think that the level of ability becomes higher when students are interviewed than when they sit alone and read and write. Another, more reasonable, interpretation is that the level of ability of the person is more or less constant in the two situations, while the task presented in the interview situation is easier than the paper and pencil task. Another difference between the assumptions underlying the Schoultz et al. (2001) study and the TIMS study concerns the notions of reliability and validity. Schoultz et al. (2001) argue that it is possible to subject the TIMSS items, which already had been tested for validity and reliability, to a further test which, in a truer sense, would reveal the actual validity and reliability of the items. According to this view, the items have immanent and absolute characteristicswhich can be revealed through a careful and detailed analysis of the context in which the student interacts with the item. This view is related to the absolute view of student performance discussed above. A person working on large-scale assessment would, by contrast, find such a view to be incomprehensible because, according to the assessment view, the constructs of validity and reliability do not primarily refer to characteristics of single items, but to collections of items. Thus, the most commonly used form of reliability refers to the internal consistency of a scale based on many items. Similarly, the most fundamental concept of validity, namely construct validity (Messick, 1989), is not applicable to an item in isolation. When it comes to reliability and validity, it would rather seem that the Schoultz et al. (2001) study faces serious problems making credible inferences about students’ absolute level of ability to perform the two tasks on the basis of an interview study which, in many respects, was more like a teaching situation than a testing situation. According to this analysis, Schoultz et al. (2001) have made the mistake of starting from one set of assumptions, which emphasize the context-bound nature of human action and interaction, and have applied them to an activity which is based on the assumption that it is possible to generalize across contexts to the system level. This generates more confusion than clarification because concepts and observations that seem to refer to the same phenomena do, in fact, refer to different phenomena. The problem is that Schoultz et al. (2001) have applied the socio-cultural perspective with its set of assumption in a critique of a phenomenon that is based on quite different assumptions. Their results therefore do not invalidate the TIMS study; nor is it possible to argue that the present criticism invalidates the socio-cultural approach in general. It might, however, be worthwhile trying to capture the differ-
Quality and Credibility of International Studies
ence between the two perspectives in somewhat more constructive terms than just to state that they are different. One way to capture different perspectives is to describe them in terms of metaphors. So let us introduce a metaphor intended to do just that.
Weather and Climate We are almost always concerned with weather because it profoundly affects our daily life; decisions about what clothes we should wear, if we should go to the golf course or to the museum, if it would be advisable to take the car or not, just to mention a few examples. Weather also affects our mood, and it supplies us with conversation material in almost all social contexts. However, we cannot do much about the weather, except adapting to the conditions it creates for us. Fortunately, meteorologists can predict what the weather will be like within the next couple of days. However, there is a margin of error in these predictions and, beyond a week or so, the predictions are useless. This is because of the great complexity of weather phenomena, and because the weather is chaotic; it is not even theoretically possible to predict weather over longer periods of time. Should we not like the weather there is not much to do, except, of course, to move to a place with a better climate. Simple indicators, like average temperature, average rainfall, and number of days with sunshine, give us much information on which to compare the climates of different places. However, even though such information tells us much about the climate, they do not tell us much about what weather we are likely to experience on a particular visit, because these numbers are averages with a lot of variation. Thus, the link between climate and weather is a weak, probabilistic, one. However, while weather is unpredictable and chaotic, climate and climate changes are stable phenomena which we can understand theoretically and for which empirically-based models, predicting long-term development, can be constructed. It could be argued that climate does not exist, in the sense that we cannot experience it directly. We do experience weather, however, and through aggregating these experiences, we get a sense of climate. In a more precise manner, scientists define climate as aggregate weather, using indicators such as mean temperature. Thus, climate is an abstraction which, in a sense, only exists in theoretical models. Nevertheless, it is a powerful abstraction which has very concrete and important implications for how we could and should live our lives. In terms of this metaphor, large-scale survey studies are concerned with climate, while research that focuses on context-bound phenomena is concerned with weather. Thus, the assessment in TIMSS is based on aggregation of a very large number of item responses, little or no interest being focused on the particular items. In contrast, the Schoultz et al. (2001) study is focused on particular contexts. Many object to aggregation of observations in educational and psychological research, ascribing validity only to that which can be directly observed (e.g., Yanchar & Williams, 2006). But the argument can also be turned around and it can be argued
25
26
Jan-Eric Gustafsson and Monica Rosén
that, in order to see the general aspects (e. g., the climate), it is necessary to get rid of the specifics (e. g., the weather). Seen from this perspective, methods which conceal context-dependent variation have strengths, rather than disadvantages, when the purpose is to investigate general patterns and relations.
Low- and High-level Inference Research The Schoultz et al. study would be classified as a qualitative study, while the TIMS study would be classified as a quantitative study. However, Ercikan and Roth (2006) challenged the meaningfulness of this distinction, arguing that the quantitative and qualitative dichotomy is fallacious. One of their arguments was that all phenomena involve both quantitative and qualitative aspects at the same time. As an alternative to the quantitative/qualitative distinction, Ercikan and Roth (2006) proposed that different forms of research should be put on a continuous scale that goes from the lived experience of people on one end (low-level inference) to idealized patterns of human experience on the other (high-level inference). According to Ercikan and Roth (2006), “Knowledge derived through lower-level inference processes … is characterized by contingency, particularity, being affected by the context, and concretization. Knowledge derived through higher-level inferences is characterized by standardization, universality, distance, and abstraction … The more contingent, particular, and concrete knowledge is, the more it involves inexpressible biographical experiences and ways in which human beings are affected by dramas of everyday life. The more standardized, universal, distanced and abstract knowledge is, the more it summarizes situations and relevant situated knowledge in terms of big pictures and general ideas.” (p. 20) This level-of-inference approach to characterizing different forms of research is much more useful than the qualitative/quantitative dichotomy. Thus, while research on weather and climate cannot easily be characterized with the quantitative/qualitative distinction, research on weather may be meaningfully described as low-level inference and research on climate as high-level inference. Similarly, the Schoultz et al. (2001) study is an example of low-level-inference research, while the TIMS study is an example of high-level-inference research.
Quality Aspects of High-level Inference Data While the low-level inference approach can be grounded in interpretations generated from observations in specific contexts, this is not possible in the high-level inference approach. In this approach, the intention is to capture abstractions which span specific contexts and contents. The question, then, is if this is possible and meaningful, and what criteria we can used to decide whether it is meaningful. Ocular inspection of the items obviously cannot be used, and the answer cannot be found in detailed analyses of the contents and contexts of specific items, even
Quality and Credibility of International Studies
though this seems to be a common belief. A solution to this problem is, instead, to take advantage of the concepts and techniques within the field of measurement. The technology of measurement has evolved over more than 100 years, and thousands of researchers have contributed to its development. It is still under development, and the field of large-scale assessments is a driving force in this development. The technology of measurement is full of complex and esoteric constructs such as reliability, validity, item characteristic curve, item difficulty parameter, just to mention a few. In international studies, the items are developed in a laborious process of invention, creation, preliminary tryouts, and field trials. In this process, different statistical techniques are used to generate information about the characteristics of the items, along with qualitative techniques. In the final step, scaling is done in such a way that the results on different items are put onto the same scale, taking the difficulties of the items into account. This is an extremely complicated process and, at every step, things may go wrong, presenting threats to the usefulness of the derived scale. In order to ensure the quality of the final scale, every step of the process of development and implementation of large-scale assessments involves quality controls against explicitly defined criteria (Martin, Rust & Adams, 1999). However, while there are numerous quality criteria, the technology of measurement does not offer a single technique or number which may be used to characterize the meaningfulness and quality of the resulting scale. However, even though the field of educational measurement offers a useful set of tools, this does not imply that the tools are perfect, or that they are easy to apply. After the development of highly mathematically and statistically sophisticated models of measurement, which now form a methodological foundation for international studies, the technical complexities of these models have become a source of problems. One source of problems is that these models are not easily understood by anyone but the experts. Another source is that there are conflicting schools of thought in the field of educational measurement, and there rarely is consensus about which model is to be preferred in a particular case. Recently, Kreiner and Christensen (in press) directed strong critique against the scaling techniques employed in PISA, arguing that the PISA results cannot be trusted because the model used does not fit the data. This critique is discussed below.
Can PISA Results be Trusted if the Rasch Model Does not Fit the Data? In a paper entitled “Analyses of model fit and robustness. A new look at the PISA scaling model underlying ranking of countries according to reading literacy” which has been accepted for publication in the prestigious journal, Psychometrika, Svend Kreiner and Karl Bang Christensen (K&C) claim that the country rankings with respect to reading literacy levels reported by PISA cannot be trusted. The main line of argument in the paper is that the Rasch model is not appropriate for use with the PISA data. The Danish statistician Georg Rasch (1960) developed the Rasch model, and Svend Kreiner, who is a professor of biostatistics at the Univer-
27
28
Jan-Eric Gustafsson and Monica Rosén
sity of Copenhagen, has done much research on further development of this model. The Rasch model is the simplest of the IRT-models, and postulates that the probability of a correct answer to an item is a function only of the ability of the person and the difficulty of the item. When this model fits the data, it has very attractive characteristics. It thus allows estimation of ability on the same scale from different sets of items for different persons, and it allows estimation of the difficulty of items from data from different groups of persons. These characteristics are taken advantage of in the PISA study. The other international studies, like TIMSS and PIRLS, use similar, but somewhat more complex, types of model. The problem that K&C point out is that the attractive characteristics of the Rasch model are not guaranteed to hold unless the data is in agreement with the assumptions of the model or, to express it differently, that the model fits the data. K&C argue in particular that so-called Differential Item Functioning (DIF) results in certain countries being disadvantaged by some test items whilst being advantaged by other test items. The consequence is that the rank ordering of countries is strongly influenced by which particular items are included in the test. This stands in conflict with one assumption of the Rasch model, namely, that the rank ordering is invariant over selection of items. However, K&C show without any doubt that the Rasch model does not fit data from the PISA reading literacy test. They also demonstrate that the rank ordering of countries varies considerably over different subsets of items, showing, for example, that Denmark can get ranks between 5 and 36 among 56 countries, depending upon which subsets of item were used for ranking. This would seem like a devastating criticism of the entire PISA project. However, if K&C were correct in their criticism, we would expect to see considerable variability in the PISA results for countries across the different waves of measurement in PISA. These are conducted every third year and, for each wave, a part of the previously used items is replaced with new items. Still, the PISA results tend to be quite consistent for countries from one wave to another, even though it may be noted that the degree is less for minor domains than for major domains. Given that K&C demonstrated poor fit of the Rasch model to the data, it may be asked how this can come about? The answer is that this is possible because results may be robust against violations of the model assumptions, in the sense that the correct results may be obtained in spite of violations of model assumptions (Gustafsson, 1980). Thus, even though model fit in a sense guarantees the correct results, lack of model fit does not automatically imply that incorrect results are obtained. A closer look at the analyses upon which K&C base their conclusions reveals that these are limited to only a small fraction of the available data. They only use data from one of the many test booklets in PISA 2006, namely Booklet 6. This booklet includes 8 texts and 28 test items. However, K&C only analyzed 20 test items which had responses from all participating countries. The PISA samples of students for each country typically include at least 4000 students, but all students do not take all items because they are given different booklets of test items. The samples analyzed by K&C, therefore, only comprised about 250–400 students in many countries (e.g., 357 out of 4532 students in the Danish sample, 361 out of 4692 in the Norwegian sample, and 325 out of 4443 in the Swedish sample).
Quality and Credibility of International Studies
The main reason why K&C fail to find sufficient invariance of the country ranks when they analyze these data is that the results of their analyses are heavily influenced by random errors due to the small samples of items and students. Thus, the subsets of items used for comparing different rankings comprised between 5 and 9 items, and such small sets of items are heavily influenced by errors of measurement. The small samples of students contribute further random variation. K&C do indeed show that, for the actually used samples, the confidence interval for the observed rank of 17 for Denmark based on the Booklet 6 data would range from 10 to 24, while samples of 2000 students would have yielded a confidence interval ranging from 14 to 20. However, in spite of this clear demonstration of the importance of random errors for ranking, K&C have chosen not to pay much attention to this source of non-invariance of country rankings in the analyzed data. K&C generalize the results obtained from the Booklet 6 data to apply to the entire PISA survey, but this inference is fundamentally flawed. Their study allows the conclusion that, with samples comprising some 350 students and samples of 5 to 9 items, it is not possible to achieve trustworthy rankings of countries according to levels of reading literacy achievement. However, this would be a trivial conclusion because no one has ever made the claim that this can be done and this is not how PISA is designed. Instead of paying attention to the unsystematic sources of influence on the stability of country rankings, K&C focus on the possible systematic effects of Rasch model misfit on the country rankings. They demonstrate that model misfit causes error over and above that contributed by unsystematic factors, and this seems to be the basis for their strong emphasis on the importance of model fit. However, they do not investigate the relative importance of error due to model misfit and random errors due to selection of items and students in a realistic assessment situation. This is the kind of investigation that would be needed to obtain information about the degree of robustness of the Rasch model against violations of the model assumptions.
Discussion and Conclusions We have here identified two lines of criticism against international studies put forward in published papers. It can be noted that, even though these papers are very different, they have one thing in common: they make far-reaching generalizations about the invalidity of the results of international studies. These generalizations are most certainly incorrect, both because they are based on very limited evidence and because circumstantial evidence provide support for the validity of the results from international comparative studies. It is interesting to note that non-quantitatively oriented researchers arrive at the same conclusion as that arrived at by quantitatively oriented researchers, and it might be interesting to speculate about the reason for this. One hypothesis that may explain this convergence is that both Schoultz et al. and K&C take their starting point in a particular meta-theory. Schoultz et al. work within a socio-cultural frame-
29
30
Jan-Eric Gustafsson and Monica Rosén
work and formulate both their criticism and their conclusions within this framework, one which is not at all suited for bringing insight into measurement aspects of international studies of educational achievement. K&C take their starting points in the Rasch model, their view being that to the extent that measurements do not fit the Rasch model, the fault is with the measurements, not with the model. Thus, in both cases, the criticism is based on convictions that a particular meta-theory is the correct one, making it easy to arrive at strong conclusions about invalidity of procedures based on other starting points and meta-theories. This is unfortunate for the simple reason that, even though the critique does not arrive at particularly useful conclusions, the critics certainly identify areas of problem in need of attention. Thus, different item formats cannot only be expected to influence item difficulty but may also influence item difficulty in different ways for different groups of test-takers, thereby causing DIF. The scaling model based on the Rasch model has been demonstrated to be too simplified, and there is need to develop scaling models which are better adapted to the nature of the data from international studies. The two lines of criticism discussed here bring up fundamental problems and, if the conclusions claimed or implied by the critics were correct, they could invalidate the entire approach taken in international studies. However, even though the critics identify areas of concern, our conclusion is that the critique certainly is not devastating to the further conduct of international studies.
References Beaton, A. E. et al. (1996). Mathematics Achievement in the Middle School Years: IEA’s Third International Mathematics and Science Study (TIMSS). Chestnut Hill, Mass.: Boston College, 1996. Ercikan, K. & Roth, W.-M. (2006). What good is polarizing research into qualitative and quantitative? Educational Researcher, 35(5), 14–23. Gustafsson, J.-E. (1980). Testing and obtaining fit of data to the Rasch model. British Journal of Mathematical and Statistical Psychology, 33, 205–233. Gustafsson, J.-E. (2008). Effects of international comparative studies on educational quality on the quality of educational research. European Educational Research Journal, 7(1), 1–17. Harmon, M., Smith, T. A., Martin, M. O., Kelly, D. L., Beaton, A. E., Mullis, I. V. S., Gonzalez, E. J., & Orpwood, G. (1997). Performance Assessment in IEA’s Third International Mathematics and Science Study. Boston: TIMSS International Study Centre, Boston College. Hopmann, S. T., Brinek, G., & Retzl, M. (Eds.) (2007). PISA zufolge PISA: Hält PISA, was es verspricht? / PISA according to PISA: Does PISA keep what it promises? Vienna: LIT Verlag. Husén, T. & Postlethwaite, N. (1996). A Brief History of the International Association for the Evaluation of Educational Achievement (IEA). Assessment in Education: Principles, Policy & Practice, 3(2), 129–141.
Quality and Credibility of International Studies
Jones, L. V. & Olkin, I. (Eds.) (2004). The nation’s report card. Evolution and perspectives. Bloomington, IN: Phi Delta Kappan. Kreiner, S. & Christensen, K.B. (in press). Analyses of model fit and robustness. A new look at the PISA scaling model underlying ranking of countries according to reading literacy. Psychometrika. DOI: 10.1007/S11336-013-9347-Z. Leung, F. K. S. (2008). The Significance of IEA Studies for Education in East Asia and Beyond. Keynote address at the 3rd IEA International Research Conference, Taipei, 18 September 2008. Martin, M. O., Rust, K., & Adams, R. J. (Eds.) (1999). Technical standards for IEA studies. Amsterdam: IEA. Messick, S. (1989). Validity. In R. Linn (Ed.) Educational Measurement (3rd ed). Washington: National Council of Measurement in Education. Novóa, A. & Yariv-Mashal, T. (2003). Comparative Research in Education: A Mode of Governance or a Historical Journey? Comparative Education, 39(4), 423–438. Papanastasiou, C., Plomp, T., & Papanastasiou, E.C. (Eds.) (2011). IEA 1958–2008. 50 years of experiences and memories. Nicosia, Cyprus: Cultural Center of the Kykkos Monastery. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Danish Institute for Educational Research. Sahlberg, P. (2011). Finnish Lessons: What Can the World Learn from Educational Change in Finland? New York: Teachers College Press. Schoultz, J., Säljö, R., & Wyndhamn, J. (2001). Conceptual knowledge in talk and text: What does it take to understand a science question. Instructional Science, 29, 213–236. Simola, H. (2005). The Finnish miracle of PISA: historical and sociological remarks on teaching and teacher education. Comparative Education, 41(4), 455–470. Yanchar, S. C. & Williams, D. D. (2006). Reconsidering the compatibility thesis and eclecticism: Five proposed guidelines for method use. Educational Researcher, 35(9), 3–12.
31
Leonidas Kyriakides and Charalambos Y. Charalambous
Educational Effectiveness Research and International Comparative Studies: Looking Back and Looking Forward Abstract
With a little more than half of a century of conducting International Comparative Studies (ICS) and almost half of a century after the issuing of Coleman et al.’s study – often thought to have sparked Educational Effectiveness Research (EER) – we step back in this paper, and adopt a Janus-faced approach in considering the interrelatedness of ICS and EER. Looking back, we investigate how ICS and EER have contributed to each other; looking forward, we outline ways in which we think a stronger link between ICS and EER can contribute to the mutual development of both while, at the same time, serve their common goal: to better understand how education contributes to student learning and, in doing so, help improve student learning outcomes.
Aiming to start unpacking and understanding problems of school and student evaluation, in 1958 a group of educational psychologists, sociologists, and psychometricians met in Hamburg, Germany, forming a cross-national enterprise devoted to comparative studies of school practices and achievement, known as the International Association for the Evaluation of Educational Achievement (IEA) (Purves, 1987). This year can be considered to mark the inauguration of International Comparative Studies (ICS) – at least those conducted at large scale – with the first one, the Pilot Twelve-Country Study, carried out in 1960 and followed by a conglomerate of other studies conducted during the last 50 years. At about the same time, a study undertaken by Coleman and colleagues (1966) reported that only a very small proportion of the variation in student achievement can be attributed to schools; this finding sparked heated debates on the role of schools and teachers in student learning, thus giving birth to Educational Effectiveness Research (EER) (see Teddlie & Reynolds, 2000).1 A little more than half of a century after the inauguration of ICS and almost about 50 years after the issuing of the Coleman report which motivated the development of EER, in this paper we adopt a Janus-faced approach and, by stepping back, look at the evolution and the interrelatedness of EER and ICS. In doing so, we examine how each domain contributed to the other and unravel open issues that are
1
Following a conceptualization of educational effectiveness research that is more evident in works of the last decade (cf. de Jong, Westerhof, & Kruiter, 2004; Kyriakides, 2005; Reynolds, Creemers, Stringfield, Teddlie, & Schaffer, 2002), in this chapter we are using the term Educational Effectiveness Research (EER) as an overarching term to convey studies originally clustered under School Effectiveness or Teacher Effectiveness Research. As used in this chapter, EER captures studies that help identify how different factors at the student, classroom, school, and system level, as well as their interactions, can contribute to student performance/learning.
34
Leonidas Kyriakides and Charalambos Y. Charalambous
worth considering. Looking forward, we then point to areas in which EER can ICS can contribute to each other in the future. The chapter is organized in two sections. In the first section, we provide a brief historical overview of ICS and EER. Our aim in this overview is not to provide a comprehensive review of the historical development of either domain but, rather, to identify areas of convergence. In the second section, we first point to areas in which each domain has contributed to the other; next, based on what we learned during the past 50 years, we examine how in the future ICS and EER can help advance work in either domain while, at the same time, serving their common agenda on improving student learning. The reader is cautioned that several other contributions of both ICR and EER could be considered (e.g., Levers, Dronkers, & Kraaykamp, 2008; Luyten, 2006); however, because it is not possible to cover them all in a single paper, we opted to mainly focus on the use of ICR and EER for understanding how certain teacher and school attributes and actions promote student learning outcomes. Therefore, this chapter is more concerned with how ICR and EER can contribute to the development of a theory of effective teaching and learning.
A Brief History of International Comparative Studies Although here we largely think of ICS as a unified body, in essence these studies display notable differences. However, in this section, we consider two broad strands of ICS: those conducted by the IEA and those carried out by the Organization of Economic Co-operation and Development (OECD). We note, nonetheless, that several other ICS have also been conducted during these past 50 years (see Reynolds Creemers, Stringfield, Teddlie, & Schaffer, 2002).
A Brief Historical Overview of IEA Studies Providing a comprehensive overview of even one of the aforementioned strands of ICS lies beyond the scope of the work presented herein; rather, we concentrate on selected studies to illustrate some trends and critically discuss design decisions associated with these studies. With respect to IEA studies, the first large-scale ICS took place in 1960 and focused on five subjects: mathematics, science, reading comprehension, geography, and non-verbal ability (Purves, 1987). Carried out in 12 countries, and focusing on students of twelve years of age, this study convinced scholars about the possibility of conducting ICS the results of which could have both theoretical and practical implications. Four years later, the First International Mathematics Study (FIMS) took place, focusing on a single subject matter and extending the student population to also involve students at their final year of secondary education (Husén, 1967; Postlethwaite, 1967). In 1970, this study was followed by a series of studies focusing on six different subject matters: science, reading comprehension, literature, foreign languages (English and French), and civic education, largely known
Educational Effectiveness Research and International Comparative Studies: Looking Back and Looking Forward
as the Six Subject Survey. This first round of studies conducted in the mid-1960s and early 1970s pointed to different predictors of student outcomes, including opportunity to learn – based on how the curriculum is taught – student motivation, different methods of teaching and school practices. It was not surprising then, that new cycles of ICS were conducted by IEA in the next decades. Initiated by a study focusing again on mathematics (the Second International Mathematics Study, SIMS), the second cycle was conducted at the beginning of the 1980s, with an even broader country participation (20 countries); this study was again followed by a study focusing on Science (Second International Science Study) involving 24 countries. Because both these studies drew on items and questions from the two earlier corresponding studies, they did not only outline a picture of teaching mathematics and science in the participating countries, but also facilitated comparing and contrasting parts of this picture with how the two subjects were found to be taught; they also provided an opportunity for linking teaching with student performance in the first cycle. Even more critically, at this point, the designers of IEA thought of conducting what came to be known as a Longitudinal Study (Purves, 1987). For a subset of the participating countries in SIMS, student performance data were obtained at two different time points. By obtaining both a pre-test and a posttest measure of achievement, this study not only enabled examining changes in student performance in mathematics, but critically also gave the opportunity to investigate how different classroom and school characteristics contributed to this learning. This was also accomplished by another longitudinal IEA study, the Classroom Environment Study, which focused on the nature of classroom activities and studentteacher interactions (Anderson, 1987). A similar longitudinal IEA study initiated in the 1980s, the Preprimary Project (cf. Olmsted & Weikart, 1995), examined how the early preprimary experiences of students of four years of age contributed to their cognitive and language performance at the age of seven. As will be argued below, the decision to depart from the critical attribute that characterized the studies conducted in the 1980s – namely their longitudinal character – limited the potential of these studies to yield information on student learning and its contributors. The 1990s marked the transition to ICS that included even more countries and which focused on topics such as computers and civic education and engagement. The ICS focusing on mathematics and science conducted in the Third International Mathematics and Science Study (TIMSS) of 1995 turned out to be the first in a four-year cycle of assessments in mathematics and science, currently known as Trends in International Mathematics and Science Study. An important attribute of the international studies conducted in the 1990s relates to the two videotaped studies that aimed to examine the teaching of mathematics in a subset of the participating countries. Accompanying the TIMSS 1995 study, the first TIMSS Videotaped Classroom Study examined the practices of teaching eighth grade mathematics in USA, Japan, and Germany by analyzing 231 videotaped lessons. Recognizing that surveys alone cannot tell much about the teaching that takes place in the classroom and using a national probability sample, the designers of this study videotaped each of the participating classrooms for one complete lesson on a date convenient for the teacher (see
35
36
Leonidas Kyriakides and Charalambos Y. Charalambous
more in Stigler & Hiebert, 1999). Parallel to that – although not using videotaped lessons – another study, conducted over 120 classroom observations in mathematics and science classrooms in six countries, attempted to portray what comprises a “typical” lesson in the disciplines under consideration in those countries (Schmidt et al., 1996). The videotaping endeavor was also undertaken four years later in the context of TIMSS 1999, given that it was recognized that “to better understand, and ultimately improve, students’ learning, one must examine what happens in the classroom” (Hiebert et al., 2003, p. 2). The second TIMSS Videotaped Classroom Study extended its scope to include videotaped lessons from Australia, the Czech Republic, Hong Kong SAR, the Netherlands, Switzerland, the United States, and Japan. Despite the criticism directed at both these studies because of their sampling and the argument that the lessons videotaped were not necessarily representative of the typical teaching in each country (see, for example, Keitel & Kilpatrick, 1999), both videotaped studies afforded the research community an indispensable opportunity to not only peer into classrooms and examine different approaches in teaching but, most importantly, to start understanding how teaching practices can contribute to student outcomes. The problem, however, was the absence of longitudinal data that characterized IEA studies during the previous decade. Had TIMSS 1995 and 1999 also collected pre-test data, it would have been possible to examine how certain practices contribute to student learning. Despite this and other limitations, looking inside the classrooms – either through videotapes or through “live” classroom observations – marked a significant shift in ICS carried out so far in that it emitted a significant message: that survey data alone cannot tell much of the story of how teaching contributes to student learning. In this respect, that such studies have not been part of ICS conducted in the 21st century cannot but be considered another way in which ICS can be improved, as discussed in the next section. In addition to carrying out a TIMS study on mathematics and science every four years, IEA also conducted a series of studies in other subjects, such as Reading Literacy (i.e., the Progress in International Reading Literacy Study, PIRLS) and Information Technology and Computers (i.e., International Computer and Information Literacy Study, ICILS). What perhaps characterizes an important shift in the 21st century is the first international study to be conducted at tertiary education, the Teacher Education and Development Study in Mathematics (TEDS-M). This study investigated the policies, programs, and practices for the preparation of future primary and lowersecondary mathematics teachers in 17 countries (Tatto et al., 2012). By initiating this study, IEA underlined the importance of exploring variables that may affect student achievement indirectly – such as teacher knowledge and preparation – and which can inform decisions on teacher education at the district or national level. Studies such as TEDS-M can be thought as a major avenue for collecting and channeling information to policymakers and other stakeholders about the effectiveness of tertiary education, just as the studies reviewed above have attempted to do for compulsory education. To summarize, this brief review of the evolution of IEA studies points to four interesting features. First, after the initial IEA studies that covered multiple sub-
Educational Effectiveness Research and International Comparative Studies: Looking Back and Looking Forward
ject matter, separate studies were conducted for different subject matters (with the exception of TIMSS that concerns both mathematics and science). This separation and domain specialization resonates with Shulman’s plea (1986) for greater attention to the subject matter per se and the implications that this increased attention have for both teaching and learning within certain disciplines. Second, another notable feature concerns the longitudinal studies mainly carried out in the 1980s, which, however, were not continued in the recent cycles of IEA studies. Looking inside classrooms to better understand teaching and its effects can be thought a landmark feature of the studies conducted in the 1990s, while attention to tertiary education, in addition to primary and secondary, constitutes an advancement of IEA studies during the 21st century.
A Brief Historical Overview of OECD Studies The studies conducted by the OECD had a somewhat different scope in terms of the type of outcomes measured and the data collection approaches pursued. Below we briefly focus on only one such OECD set of studies, the Programme for International Student Assessment (PISA); we do so because of its focus on student achievement and its aim of identifying potential predictors of student learning. We also use PISA because it constitutes a good comparison to the IEA studies considered thus far; comparing PISA studies to IEA studies (and particularly TIMSS) points to different design decisions that have informed ICS – decisions that should not be divorced from the findings yielded by these studies and which cannot be ignored when designing new cycles of ICS. Initiated in 2000, PISA studies are conducted every three years, and mainly focus on three subject matters: mathematics, reading, and science. In contrast to the IEA studies, PISA studies measure skills and knowledge of students of 15 years of age (which, in most countries, marks the end of the compulsory education); therefore, these studies target the student population based on age, instead of grade-level (as IEA studies do). Additionally, unlike IEA studies, PISA studies are literacy – rather than curriculum – oriented; this implies that, instead of examining mastery of specific school curricula, the latter studies investigate the extent to which students are able to apply knowledge and skills in the subject areas under consideration in a gamut of “authentic” situations, including analyzing, reasoning, communicating, interpreting, and solving problems (Lingard & Grek, 2008). So far, more than 70 countries have participated in PISA studies, which mainly use student tests and selfreports (of students, teachers, school headteachers, and parents) as the main datacollection instruments. It is also important to note that, in contrast to IEA studies that enable linking student background and performance data to their classrooms/ teachers, PISA studies do not enable such linking. Regardless of their origin (IEA or OECD), their data-collection methods, their student populations, and their underlying assumptions, the ICS conducted so far have provided the research community with the means to start understanding
37
38
Leonidas Kyriakides and Charalambos Y. Charalambous
student performance in one country relative to the other countries – which often turns out to be one of the misuses of such studies – to start grasping how different factors – be they related to the student, the classroom, the teacher, the school, the curriculum, or other wider contextual and system factors – contribute to student learning. A similar agenda characterizes EER considered next.
A Brief Historical Overview of Educational Effectiveness Research Largely originating as a reaction to two studies conducted in the mid 1960s and early 1970s which showed that schools do not matter for student learning (Coleman et al., 1966; Jencks et al., 1972), EER attempts to establish and test theories which explain why and how some schools and teachers are more effective than others in promoting student learning (Creemers, Kyriakides, & Sammons, 2010). The first two school effectiveness studies, independently undertaken by Edmonds (1979) in the USA and Rutter, Maughan, Mortimore, Ouston, and Smith (1979) in England, aimed at providing empirical evidence that schools matter for student learning. By providing encouraging results regarding the effect of teachers and schools on student outcomes, these two studies have paved the way for a series of studies that followed in different countries and with different student populations, all aiming to further unpack and understand how teachers and schools contribute to student learning. The establishment of the International Congress for School Effectiveness and School Improvement, along with its related journal School Effectiveness and School Improvement founded in 1990, formally heralded the development of a new field concerned with understanding how the classroom and school processes influence student learning. A plethora of studies have been conducted in this field and which can be grouped in four phases (see Kyriakides, 2008). Mainly concerned with proving that teachers and schools do matter for student learning, the studies conducted during the first phase set the foundations for this new field. Conducted in the early 1980s, the studies of this phase attempted to show that there were differences in the impact of particular teachers and schools on student learning outcomes. Once empirical evidence regarding the effect of education on student learning began to mount, scholars in EER further refined their research agenda, trying to understand the magnitude of school effects on student learning. By the end of this phase, mounting empirical evidence was accrued which negated Coleman et al.’s (1966) report and showing that schools and teachers do matter for students learning. After having established the role of teachers and schools in student learning, the next step was to understand what contributes to this learning. Therefore, the studies of the second phase of EER (late 1980s and early 1990s) largely aimed at identifying factors that can help explain differences in educational effectiveness. As a result of the studies undertaken during this phase, lists of correlates associated with student achievement were generated, often leading to models of educational effectiveness, such as Edmond’s (1979) five-factor model. Despite the criticism that this and other
Educational Effectiveness Research and International Comparative Studies: Looking Back and Looking Forward
models received on methodological and conceptual grounds, the models proposed during this phase emphasized the importance of developing more sound theoretical foundations for EER, an endeavor undertaken during the next phase. The studies of the next phase, conducted during the late 1990s and early 2000s, sought to model educational effectiveness by proposing and empirically testing theoretical models that aimed to explain why factors situated at different levels (e.g., student, classroom, school, and system levels) contribute to student outcomes. In general, three different perspectives can be identified during this phase. The first perspective emanates from economists (e.g., Hanushek, 1986; Monk, 1992). These scholars were generally concerned with producing mathematical functions – in studies which came to be known as education production function studies – which link resource inputs with educational outcomes, after controlling for various background features. The underlying assumption of these studies was that increased inputs will lead to increments in learning outcomes. Largely influenced by sociologists, the second perspective focused on how factors related to students’ educational and family background (e.g., SES, ethnicity, social-capital) contribute to student outcomes. Because of its sociological underpinnings, in addition to examining the contribution of these factors, studies in this realm also investigated the extent to which schools are successful in decreasing the gap between different student populations, thus promoting a dual agenda; focusing not only on the quality of education, but also on school equity. The third perspective was mainly influenced by educational psychologists who focused on how certain student attributes, such as student motivation or learning aptitude, contribute to student learning. Of interest to this group of scholars was also how certain teacher behaviors can increase student learning. This led to the identification of certain teaching behaviors based on which teaching models such as the Direct Instruction model or the Active Teaching model were proposed as means to the improvement of student learning (see Rosenshine, 1983). In essence, the work associated with this last perspective contributed to a re-orientation of EER – both theoretically and empirically – on the processes transpiring at the teaching and learning level, thus considering factors at the classroom or the teaching level as the primary contributors to student learning. More recently, there also seems to be a shift from focusing only on observable teaching behaviors to additionally considering aspects of teacher cognition and thinking, as potential contributors to both teaching quality and student learning (e.g., Hamre et al., 2012; Hill et al., 2008; Shechtman, Roschelle, Haertel, & Knudsen, 2010). Having shown that schools matter and having generated models that explain teacher and school effectiveness, during the fourth and most recent phase of EER, scholars are largely preoccupied with issues of complexity. A gradual move from the third to the fourth phase is particularly observed after 2000, when researchers started to realize that educational effectiveness should not be seen as a stable characteristic; rather, it should be considered a dynamic attribute of teachers and schools, and one that might vary across years, different student populations, different outcomes, and even different subject matters (see, for example, Creemers & Kyriakides, 2008; Graeber, Newton, & Chambliss, 2012). Consequently, EER scholars have started to
39
40
Leonidas Kyriakides and Charalambos Y. Charalambous
attend to issues such as growth and change over time – which has become the major focus of this phase – as well as issues such as consistency, stability, and differential effectiveness (Thomas, Peng, & Gray, 2007). From this respect, it is not a coincidence that, during this phase, school effectiveness is coming even closer to school improvement, aiming to propose and empirically test how different theoretical models of educational effectiveness can contribute to the improved functioning of schools (Creemers & Kyriakides, 2012). Because of this emphasis on change, theoretical developments in EER during this phase have also been associated with methodological developments which have backboned this new research agenda of EER. In fact, a closer attention to the evolution of EER suggests that the theoretical and empirical developments of EER studies have been accompanied and supported by methodological advancements, as summarized below. During the first phase of EER, major emphasis was placed on outlier studies that compared the characteristics of more and less effective schools. Because of conceptual and methodological concerns associated with these studies, during the second and third phase researchers moved to cohort designs and, more recently (third and fourth phases), to longitudinal designs involving large numbers of schools and students. Additionally, during the last two phases emphasis has also been given to searching predictors that have indirect effects, in addition to the direct effects examined in the previous two phases. The employment of advanced techniques, such as multilevel modeling to account for the nested nature of educational data and the development of contextual value-added models that controlled for student prior attainment background characteristics, as well as contextual measures of class or school composition (Harker & Tymms, 2004; Opdenakker & Van Damme, 2001; Sammons, Thomas, & Mortimore, 1997), also contributed significantly to the development of EER studies. The recent development of multilevel structural equation modeling approaches (see Hox & Roberts, 2011) is also envisioned to advance work in EER, by enabling researchers to search for indirect effects and examine the validity of recent EER models, which are multilevel in nature. Finally, during this last phase of EER, emphasis has been paid to longitudinal models with at least three points of measurement, attempting to examine how changes in the functioning of the effectiveness factors under consideration are associated with changes in educational effectiveness. By employing such longitudinal approaches, scholars can also investigate reciprocal relationships between different factors, which are advanced by more recent EER theoretical developments.
A Retrospective and a Prospective View of the Mutual Interaction of ICS and EER So far, we have considered ICS and EER as two parallel and unrelated strands. In reality, however, there have been many overlaps and interrelatedness between these two strands. By first, in this section, employing a retrospective view, we examine how each strand contributed to the other. By then adopting a prospective view, we dis-
Educational Effectiveness Research and International Comparative Studies: Looking Back and Looking Forward
cuss how we envision that in the future these two domains can contribute to further advancing the work carried out in each of them.
A Retrospective View of the Mutual Interaction of ICS and EER Although having a similar agenda – to understand what contributes to student achievement – in the past 50 years, ICS and EER seemed to have evolved as two distinct and rather unrelated domains; despite this, each area has contributed to the other in significant ways. In what follows, we first contemplate on the contribution of ICS to EER and then zoom in on the ways in which EER has informed ICS. The first way in which ICS contributed to EER pertains to highlighting the importance of and mobilizing the resources for conducting educational effectiveness studies. In particular, the ultimate goal of ICS seems to have been to raise awareness of the importance of education and its effects. Admittedly, because of media pressure, often times the results of ICS have been misinterpreted and misused in simplistic ways to either rank order countries based on student performance or to even transplant ideas from systems that “proved to be working” to less effective systems, without any detailed acknowledgement of the possible context specificity of the apparently “effective” policies in the original societies utilizing them (Creemers, 2006). For example, Reynolds (2006) attributed the British enthusiasm for whole class direct instruction at Key Stage 2 in British primary schools to a simplistic association between the educational practices of the Pacific Rim and their high levels of achievement in international studies. Without underestimating these side effects arising from ICS, information yielded from such studies has been employed more constructively to inform policy makers, curriculum specialists, and researchers by functioning as a mirror through which each participating country can start better grasping its educational system. Schmidt and Valverde (1995) speak to this idea: By looking at the educational systems of the world, we challenge our own conceptions, gain new and objective insights into education in our own country, and are thus empowered with a fresh vision with which to formulate effective educational policy and new tools to monitor the effects of these new policies (p. 7). This opportunity to look closer at the educational systems of different countries has motivated policy makers to fund research projects aiming to understand the nature of educational effectiveness and identify ways of improving school effectiveness. Consequently, a number of EER projects were initiated in various countries (e.g., Baumert et al., 2010). Another way in which EER has benefited from ICS pertains to the national, and in some respects limiting, character of most EER studies (Mortimore, 2001). Because of their international character, the data emerging from ICS have much larger variance in the functioning of possible predictors of student outcomes given that dissimilarities are more likely to occur across rather within countries. This large variation, in turn, increases statistical power and gives the opportunity to scholars working in
41
42
Leonidas Kyriakides and Charalambos Y. Charalambous
EER and capitalizing on data from ICS not only to figure out whether variables at the teacher and school level predict student outcomes, but to even examine if these effects are curvilinear rather than linear. This is important as some theoretical EER models (e.g., the Dynamic Model of Educational Effectiveness, Creemers & Kyriakides, 2008) assume that some factors may have such effects; yet, this assumption cannot be adequately tested through national effectiveness studies (Creemers et al., 2010). By affording scholars in the field of EER rich international datasets, ICS also enabled them to conduct secondary analyses (e.g., Kyriakides & Charalambous, 2005; Maslowski, Scheerens, & Luyten, 2007) which can facilitate the theoretical and methodological development of the field. Additionally, conducting across- and withincountry analyses supported investigating whether specific factors at the classroom and/or the school level are generic while others are more country-specific. By also including information on several contextual factors, ICS provided a platform for examining why some factors operate mainly in specific contexts while others “travel” across countries (Creemers et al., 2010). Turning to the contribution of EER to ICS, perhaps the main way in which the former has been supportive of the latter pertains to providing theoretical constructs to inform the design of ICS. This influence is more palpable in the most recent ICS (e.g., PISA 2009 and PISA 2012), which have capitalized on theoretical models from EER to develop both the theoretical frameworks undergirding those studies and associated measurement tools. In fact, in more recent ICS emphasis is also placed on process variables at the classroom and school level which are drawn from effectiveness factors included in EER models. For example, instead of examining the school climate factor at a more general level, recent PISA studies concentrate on specific school learning environment factors shown to be associated with student achievement, as suggested by meta-analyses conducted in the field of EER (Kyriakides, Creemers, Antoniou, & Demetriou, 2010; Scheerens, 2013). Also, instead of examining teaching behaviors more generally or the content taught, recently ICS have started investigating the impact of specific teaching behaviors, be they generic or content specific. This broadening of their scope also aligns with recent meta-analyses of EER pointing to the importance of exploring both types of practices as potential contributors to student learning (Kyriakides, Christoforou, & Charalambous, 2013; Seidel & Shavelson, 2007). The second contribution of EER pertains to its impact on the methodological design of ICS and the analysis of the data yielded from such studies. Multilevel analysis has been a prominent approach in EER, especially during the last two decades, because of taking into consideration the nested character of educational data. The systematic use of multilevel analysis in EER has also been fueled by the type of research questions addressed in this field. At the beginning of this century, employing multilevel techniques to analyze ICS data had been stressed (e.g., Kyriakides & Charalambous, 2005) and such approaches are gradually making their way into technical reports of ICS (e.g., OECD, 2005, 2009) as well as in secondary analyses of ICS (e.g., Kyriakides et al., 2010; Kyriakides et al., 2013; Scheerens, 2013). EER scholars have also noted the limitations of using cohort data from ICS to identify causal
Educational Effectiveness Research and International Comparative Studies: Looking Back and Looking Forward
relationships between effectiveness factors and student learning outcomes (Creemers et al., 2010; Gustafsson, 2010). Recognizing these limitations, these scholars have pointed to the importance of identifying trends in the functioning of different factors across years. Identifying such trends will enable both the design of policy reforms to account for any undesirable changes in the teaching quality and/or student learning that occur over time, as well as the evaluation of interventions designed toward improving teaching and learning. From this respect, the importance of maintaining focus on certain theoretical constructs and using identical items/scales across different ICS cycles is stressed. The examination of the contribution of ICS to EER and vice versa as outlined above is by no means comprehensive; rather, it indicates how in the past 50 years each domain has informed the design and contributed to the evolution of the other domain. Taking for granted that this mutual interaction between these domains has been beneficial for both and will thus continue in the future, in what follows we contemplate ways in which each domain can further advance the work undertaken in the other.
A Prospective View of the Mutual Interaction of ICS and EER As explained in the previous section, during the fourth phase of EER more emphasis has been paid to the dynamic nature of educational effectiveness. In this context, EER researchers call for a reconceptualization of how the functioning of effective schools should be perceived. Specifically, instead of merely focusing on what happens at a specific point in time (i.e., static approach), it is recommended that attention also be paid to the actions taken by teachers, schools, and other stakeholders in dealing with the challenges that seem to impinge on learning. For example, in the PISA studies of 2009 and 2012, some items ask headteachers to indicate whether student learning is hindered by student or teacher absenteeism. Apart from collecting such static data, at least equal emphasis needs to be given to the actions that the educational systems and their constituent components (schools, teachers, parents) take in order to deal with this challenge (e.g., through reducing teacher/student absenteeism, encouraging parental involvement etc). A similar argument applies to items in IEA studies: instead of simply tapping into students’ and teachers’ expectations, actions taken to raise expectations should also be investigated. A second possible way of advancing ICS by drawing on lessons learned in the context of EER pertains to becoming more critical about the sources of data used in measuring different constructs. For example, information on school-level factors (e.g., the school learning environment) is typically gathered through questionnaires administered to the school headteacher. However, it has been found (Freiberg, 1999; Walberg, 1979) that student, teacher, and headteacher conceptions of school-level factors are independent of each other, whereas within-group perceptions (e.g., student or teacher perceptions alone) turn out to be quite consistent. These findings pose a challenging question: Can the data drawn from a questionnaire administered
43
44
Leonidas Kyriakides and Charalambos Y. Charalambous
to a single person (i.e., the headteacher) yield accurate and valid information regarding school level factors? Because the answer is rather in the negative, we argue that information on school level factors should also be collected at least from the teaching staff, an approach that will allow not only the testing of the generalizability of the data, but also reveal any potential biases. For example, in schools with a poor learning environment, teachers might be more inclined to portray the real situation compared to the headteacher, who may feel more responsible for the poor functioning of the school. A similar challenge pertains to data collected for teaching, where most information is typically collected through surveys. Although more economically efficient, surveys are typically less accurate in depicting what goes in the classroom compared to classroom observations or log books (Kane & Staiger, 2012; Pianta & Hamre, 2009; Rowan & Correnti, 2009). The reader is reminded that, in earlier cycles of IEA studies, classroom observations were used in parallel to other data-collecting approaches, something that suggests that similar approaches can also be employed in the future. The financial and other logistic difficulties inherent in conducting observations (e.g., training observers and monitoring their ability to consistently and adequately use observation instruments in each country) cannot be underestimated; nor can we ignore other substantive issues that need to be addressed in order to determine the optimal number of lessons to be observed and the number of coders observing these lessons in order to obtain reliable estimates of the quality of teaching in different classrooms (cf. Hill, Charalambous, & Kraft, 2012). However, past experience has pointed to the multiple benefits of employing classroom observations, at least as complements to surveys. By employing classroom observations, more aspects of effective teaching practices can be measured, covering both generic and domainspecific teaching skills found to be associated with student achievement gains. This dual focus on teaching skills has actually been manifested in more recent rounds of ICS (e.g., in the most recent PISA self-report surveys), and hence seems to resonate with the expansion of the ICS agenda to also incorporate such practices. Drawing on EER studies, ICS could also collect prior achievement data, which will enable the examination of gains in student learning and, consequently, connect these gains to student, teacher, and school-level factors. Collecting such data will not be new to ICS since, as the reader might recall, such data have been also collected in earlier cycles of IEA studies. The benefits of collecting prior achievement data in future ICS are expected to be multidimensional not only in methodological terms, but also in terms of informing policy decisions. For instance, if information on student progress rather than student final achievement is reported, the press might not continue adhering to simplistic approaches which pit one country against the other and ignore differences in student entrance competencies. This, in turn, might encourage underdeveloped countries to also participate in ICS, for the focus will no longer be on how each country performs relative to the others but, rather, on the progress (i.e., student learning) made within each country. Recent developments in ICS can also advance the work currently undertaken in the field of EER. Specifically, if the recommendations suggested above are consid-
Educational Effectiveness Research and International Comparative Studies: Looking Back and Looking Forward
ered in future ICS, richer datasets might be yielded. This, in turn, can benefit EER in significant ways, since secondary analyses employing such datasets could contribute to the testing and further development of theoretical frameworks of EER. One main element in these frameworks concerns system-level factors, such as the national policy on teaching or the evaluation of such policies (cf. Creemers & Kyriakides, 2008). Testing these factors is an area that ICS could be particularly useful, given that EER studies are mostly national and hence do not lend themselves to testing such factors. To successfully explore the effect of such factors, ICS that collect data on the functioning of the educational system (e.g., OECD Teaching and Learning International Survey [TALIS] and INES Network for the Collection and Adjudication of System-Level Descriptive Information on Educational Structures, Policies and Practices [NESLI]) should expand their agenda to also investigate national policies for teaching and the school learning environment. These data also need to be linked to student achievement data. This will enable investigating whether these system-level factors have direct and indirect effects – through school and teacher level factors – on student learning outcomes. For example, links could be established between studies such as TIMSS, NESLI, and TALIS. Finally, we note that for many years emphasis has been given to investigating cognitive outcomes, in both ICS and EER. Because the mission of compulsory schooling has recently been conceptualized to also incorporate new learning goals (such as self-regulation and meta-cognition), ICS, and particularly PISA, have lately included measures of these types of learning outcomes in addition to the traditional cognitive outcomes. Given that the instruments developed in the context of PISA studies have proven to have satisfactory psychometric properties (OECD, 2005), future EER studies can capitalize on these instruments to examine whether the impact of different effectiveness factors is consistent across different types of learning outcomes. For example, it is an open issue whether factors typically associated with the Direct and Active teaching approach (Joyce, Weil, & Calhoun, 2000), such as structuring and application, relate only to cognitive outcomes, while constructivist-oriented factors (Vermunt & Verschaffel, 2000), such as orientation and modeling, are associated with both types of outcomes. By adopting both a retrospective and a prospective approach, in this chapter we have suggested that the two fields – ICS and EER – not only have similar agendas, but also have commonalities in several respects, ranging from design, to analysis, to how their results can inform policy. Because both give emphasis to providing evidence-based suggestions for improving policy, we believe that in the years to come a closer collaboration between scholars in both fields can advance both domains and better serve their common agenda: to understand what contributes to student learning and, through that, develop reform policies to promote quality in education.
45
46
Leonidas Kyriakides and Charalambos Y. Charalambous
References Anderson, L.W. (1987). The classroom environment study: teaching for learning. Comparative Education Review, 31(1), 69–87. Baumert, J., Kunter, M., Blum, W., Brunner, M., Voss, T., Jordan, A., et al. (2010). Teachers’ mathematical knowledge, cognitive activation in the classroom, and student progress. American Educational Research Journal, 47(1), 133–180. Coleman, J.S., Campbell, E.Q., Hobson, C.F., McPartland, J., Mood, A.M., Weinfeld, F.D., et al. (1966). Equality of educational opportunity. Washington, D.C.: U.S. Government Printing Office. Creemers, B.P.M (2006). The importance and perspectives of international studies in educational effectiveness. Educational Research and Evaluation: An International Journal on Theory and Practice, 12(6), 499–511. Creemers, B.P.M. & Kyriakides, L. (2008). The dynamics of educational effectiveness: a contribution to policy, practice and theory in contemporary schools. London and New York: Routledge. Creemers, B.P.M. & Kyriakides, L. (2012). Improving quality in education: Dynamic approaches to school improvement. London and New York: Routledge. Creemers, B.P.M., Kyriakides, L., & Sammons, P. (2010). Methodological advances in educational effectiveness research. London and New York: Routledge. de Jong, R., Westerhof, K.J., & Kruiter, J.H. (2004). Empirical evidence of a comprehensive model of school effectiveness: A multilevel study in mathematics in the 1st year of junior general education in the Netherlands. School Effectiveness and School Improvement, 15(1), 3–31. Edmonds, R.R. (1979). Effective schools for the urban poor. Educational Leadership, 37(1), 15–27. Freiberg, H.J. (Ed.) (1999). School climate: Measuring improving and sustaining healthy learning environments. London: Falmer. Graeber, A.O., Newton, K.J., & Chambliss, M.J. (2012). Crossing the borders again: Challenges in comparing quality instruction in mathematics and reading. Teachers College Record, 114(4), 1–30. Gustafsson, J.-E. (2010). Longitudinal designs. In B.P.M. Creemers, L. Kyriakides, & P. Sammons (Eds.), Methodological advances in educational effectiveness research (pp. 103–114). London, UK: Routledge. Hamre, B.K., Pianta, R.C., Burchinal, M., Field, S., LoCasale-Crouch, J., et al. (2012). A course on effective teacher-child interactions: Effects on teacher beliefs, knowledge, and observed practice. falseAmerican Educational Research Journal, 49(1), 88–123. Hanushek, E.A. (1986). The economics of schooling: Production and efficiency in public schools. Journal of Economic Literature, 24, 1141–1177. Harker, R. & Tymms, P. (2004). The effects of student composition on school outcomes. School Effectiveness and School Improvement, 15(2), 177–199. Hiebert, J., Gallimore, R., Garnier, H., Givvin, K., Hollingsworth, H., Jacobs, J., et al. (2003). Teaching mathematics in seven countries: Results from the TIMSS 1999 video study. Washington: National Center for Educational Statistics. Hill, H.C., Blunk, M., Charalambous, C.Y., Lewis, J., Phelps, G.C., Sleep, L., et al. (2008). Mathematical knowledge for teaching and the mathematical quality of instruction: An exploratory study. Cognition and Instruction, 26, 430–511.
Educational Effectiveness Research and International Comparative Studies: Looking Back and Looking Forward
Hill, H. C., Charalambous, C. Y., & Kraft, M. (2012). When rater reliability is not enough: Teacher observation systems and a case for the G-study. Educational Researcher, 41(2), 56–64. Hox, J.J., & Roberts J.K. (Eds.) (2011). Handbook of advanced multilevel analysis. New York: Routledge. Husén, T. (Ed.). (1967). International study of achievement in mathematics: A comparison of twelve countries (Vols. 1–2). Stockholm: Almqvist & Wiksell. Jencks, C., Smith, M., Acland, H., Bane, M.J., Cohen, D., Gintis, H., Heyns, B., & Michelson, S. (1972). Inequality: A reassessment of the effects of family and schooling in America. New York: Basic Books. Joyce, B., Weil, M., & Calhoun, E. (2000). Models of teaching. Boston: Allyn & Bacon. Kane, T.J., & Staiger, D.O. (2012). Gathering feedback for teaching: Combining high-quality observations with student surveys and achievement gains. Seattle: Bill & Melinda Gates Foundation. Retrieved March 3, 2012 from http://www.metproject.org/reports.php Keitel, C. & Kilpatrick, J. (1999). The rationality and irrationality of international comparative studies. In G. Kaiser, E. Luna, & I. Huntley (Eds.). International comparisons in mathematics education (pp. 241–256). Philadelphia: Falmer Press. Kyriakides, L. (2005). Extending the comprehensive model of educational effectiveness by an empirical investigation. School Effectiveness and School Improvement, 16(2), 103– 152. Kyriakides, L. (2008). Testing the validity of the comprehensive model of educational effectiveness: a step towards the development of a dynamic model of effectiveness. School Effectiveness and School Improvement, 19(4), 429–446. Kyriakides, L. & Charalambous, C. (2005). Using educational effectiveness research to design international comparative studies: turning limitations into new perspectives. Research Papers in Education, 20(4), 391–412. Kyriakides, L., Christoforou, C., & Charalambous, C.Y. (2013). What matters for student learning outcomes: A meta-analysis of studies exploring factors of effective teaching. Teaching and Teacher Education, 36, 143–152. Kyriakides, L., Creemers, B., Antoniou, P., & Demetriou, D. (2010). A synthesis of studies searching for school factors: Implications for theory and research. British Educational Research Journal, 36(5), 807–830. Levels, M., Dronkers, J., & Kraaykamp, G. (2008). Immigrant children’s educational achievement in Western Countries: Origin, destination, and community effects on mathematical performance. American Sociological Review, 73, 835–853. Lingard, B. & Grek, S. (2008). The OECD, indicators and PISA: An exploration of events and theoretical perspectives. ESRC/ESF Research Project on Fabricating Quality in Education. Working Paper 2. Luyten, H. (2006). An empirical assessment of the absolute effect of schooling: regression discontinuity applied to TIMSS-95. Oxford Review of Education, 32(3), 397–429. Maslowski R., Scheerens J., & Luyten H. (2007). The effect of school autonomy and school internal decentralization on students’ reading literacy. School Effectiveness and School Improvement, 18(3), 303–334. Monk, D.H. (1992). Education productivity research: An update and assessment of its role in education finance reform. Educational Evaluation and Policy Analysis, 14 (4), 307– 332. Mortimore, P. (2001). Globalisation, effectiveness, and improvement. School Effectiveness and School Improvement, 12(1), 229–249. OECD (2005). PISA 2003 technical report. Paris: OECD Publications.
47
48
Leonidas Kyriakides and Charalambos Y. Charalambous
OECD (2009). PISA 2006 data analysis manual. Paris: OECD Publications. Olmsted, P.P. & Weikart, D.P. (Eds.) (1995). The IEA preprimary study: Early childhood care and education in 11 countries. London: Elsevier Science Inc. Opdenakker, M.-C., & Van Damme, J. (2001). Relationship between school composition and characteristics of school process and their effect on mathematics achievement. British Educational Research Journal, 27(4), 407–432. Pianta, R. & Hamre, B.K. (2009). Conceptualization, measurement, and improvement of classroom processes: Standardized observation can leverage capacity. Educational Researcher, 38 (2), 109–119. Postlethwaite, N. (1967). School organization and student achievement: A study based on achievement in mathematics in twelve countries. Stockholm: Almqvist & Wiksell. Purves, A.C. (1987). The evolution of the IEA: A memoir. Comparative Education Review, 31(1), 10–28. Reynolds, D. (2006). World class schools: Some methodological and substantive findings and implications of the International School Effectiveness Research Project (ISERP). Educational Research and Evaluation, 12(6), 535–560. Reynolds, D., Creemers, B., Stringfield, S., Teddlie, C., & Schaffer, G. (Eds.). (2002). World class schools. London: RoutledgeFalmer. Rosenshine, B. (1983). Teaching functions in instructional programs. The Elementary School Journal, 83(4), 335–351. Rowan, B., & Correnti, R. (2009). Studying reading instruction with teacher logs: Lessons from the study of instructional improvement. Educational Researcher, 38(2), 120–131. Rutter, M., Maughan, B., Mortimore, P., Ouston, J., & Smith, A. (1979). Fifteen thousand hours: Secondary schools and their effects on children. Cambridge, MA: Harvard University Press. Sammons, P., Thomas, S., & Mortimore, P. (1997). Forging links: Effective schools and effective departments. London: Paul Chapman. Scheerens, J. (2013). School leadership effects revisited: review and meta-analysis of empirical studies. Dordrecht, the Netherlands: Springer. Schmidt, W. & Valverde, G.A. (1995) National policy and cross-national research: United States participation in the Third International and Science Study. East Lansing, MI, Michigan State University, Third International Mathematics and Science Study. Schmidt, W.H., Jorde, D., Cogan, L.S., Barrier, E., Gonzalo, I., Moser, U., et al. (1996). Characterizing pedagogical flow. Dordrecht: Kluwer Academic Publishers. Seidel, T. & Shavelson, R.J. (2007). Teaching effectiveness research in the past decade: The role of theory and research design in disentangling meta-analysis research. Review of Educational Research, 77, 454–499. Shechtman, N., Roschelle, J., Haertel, G., & Knudsen, J. (2010). Investigating links from teacher knowledge, to classroom practice, to student learning in the instructional system of the middle-school mathematics classroom. Cognition and Instruction, 28(3), 317–359. Shulman, L. S. (1986). Those who understand: Knowledge growth in teaching. Educational Researcher, 15 (2), 4–14. Stigler, J. & Hiebert, J. (1999). The teaching gap. New York: The Free Press. Tatto, M.T., Schwille, J., Senk, S.L., Ingvarson, L., Rowley, G., Peck, R., et al. (2012). Policy, practice, and readiness to teach primary and secondary mathematics in 17 countries: Findings from the IEA Teacher Education and Development Study in Mathematics (TEDS-M). Amsterdam, the Netherlands: International Association for Educational Achievement (IEA).
Educational Effectiveness Research and International Comparative Studies: Looking Back and Looking Forward
Teddlie, C. & Reynolds, D. (2000). The international handbook of school effectiveness research. London: Falmer Press. Thomas, S., Peng, W.J., & Gray, J. (2007). Modelling patterns of improvement over time: value added trends in English secondary school performance across ten cohorts. Oxford Review of Education, 33(3), 261–295. Vermunt, J. & Verschaffel, L. (2000). Process-oriented teaching. In R.J. Simons, J. van der Linden, & T. Duffy (Eds), New learning (pp. 209–225). Dordrecht, the Netherlands: Kluwer. Walberg, H.J. (Ed.) (1979). Educational environments and effects. Berkeley, California: McCutchan.
49
Rolf Strietholt
Studying Educational Inequality: Reintroducing Normative Notions Abstract
The main argument of the paper is that studying educational inequalities is based on certain ideas about social justice that are often not sufficiently explicated. Some inequalities are irrelevant or less relevant than others when we think about educational justice. The central thesis is that the operationalization of justice, i.e. selection of an inequality from a set of alternatives, is a normative decision. Different normative assumptions lead to different operationalization. The metric of inequalities and the choice of an equitable distributive rule provide a conceptual framework with which to describe how inequalities are assessed in empirical studies. As researchers are obliged to be transparent about the entire research process, they should also reveal their normative accounts more explicitly and thereby empower the reader to evaluate the theoretical foundations of a study.
Introduction Assessing the effectiveness of educational systems requires the development of criteria that define effectiveness. In this regard, not only certain outcome levels are of interest – e.g. of performance – but also the variation among individuals (Creemers & Kyriakides, 2008). This aspect is related to discussions on educational equity, which constitutes one of the main topics of educational effectiveness research (EER) and is related to certain ideas about justice. In this essay, I will discuss why and how educational equity and justice are studied within the field of EER. I argue that, although normativity is often neglected, it is still an essential part of educational research. Studying equity and justice is a prominent research area in social science: the distribution of income, participation in policy-making, different educational trajectories and so forth are overarching research themes. Educational researchers, for instance, are interested in the effects of educational streaming children into different ability tracks or the existence of private schools on variation in educational outcomes. Certain ideas about social justice are arguably the main motivations to study such inequalities. For example, some researchers argue that tracked school systems or the existence of private schools increase educational inequality and, thus, contradict the idea of a just society (see e.g. Coleman & Hoffer, 1987; Slavin, 1990 for further discussions). From this perspective, the educational system can be considered a social project that aims to (re-)establish (social) justice. However, in order to study the effects of certain features of the educational system on educational equity, one has to define what equity and justice mean because the term itself is content-free.
52
Rolf Strietholt
What Does Justice Mean in Education? It is be reasonable to believe that most people agree that educational inequalities should be minimal in a just society, but that does not imply that everyone has the same understanding of what kind of inequalities matter for educational equity and justice. Some inequalities are irrelevant or less relevant than others when we think about justice. For example, it is hardly problematic that some people like plays whilst others prefer poetry. But, to the contrary, girls having no access to education in many developing countries is an issue that is vehemently discussed. This leads to the question, “Equality of What?” (Sen, 1979). In order to evaluate educational policies, we have to select between different normative accounts. This decision means that certain inequalities are considered to be unjust. In this context, it is useful to distinguish between the metric and the distributive rule that is used to assess educational justice. Any empirical study of educational inequality is based on a certain metric and a distributive rule. For example, the United Nations aims to reduce educational inequalities at the global level because they claim that every child has the right to complete his or her primary education (UNESCO, 2003). Here, the metric aspect is completing primary school and the distributive is that every child should have access to this type of education. In order to assess just how a country stands in relation to educational inequality, one might use the proportion of children who are not able to complete primary education. Metric of inequality. The metric of inequality might be considered a ‘currency’ for equity. For different reasons, it is not possible to define a single context-free and universal metric. First, this assessment differs depending on the concrete context and environment. Typically, almost all children complete their primary education in developed countries. Thus, it would be hard to defend the proposition that universal primary education is a sufficient indicator for educational justice in affluent countries. Here, it might be more appropriate to focus on another metric to assess inequalities: the proportion of young people who drop out during secondary school or enroll for tertiary education, for example. One may also shift the focus to achieved competences in mathematics or other domains. The basic idea underlying international student assessments is not that it is the time that children spend in school that matters but, rather, competences that are actually achieved. Besides context-sensitivity, the choice of a metric for educational inequalities is related to specific research interests. For example, a focus on actually achieved competences might be more prominent in the fields of psychology and educational effectiveness research whereas sociologists emphasize on educational choices and trajectories. The question of which inequalities are justified is arguably more complex in education than in other areas because the role of education can be studied from different perspectives (e.g. Drèze & Sen, 2002). For example, education can be regarded as intrinsically important if we think about the joy of playing an instrument or studying a foreign language. Even if it is unlikely that someone will ever play in an orchestra or use the language professionally, it might still be intrinsically satisfying to learn an
Studying Educational Inequality: Reintroducing Normative Notions
instrument or a foreign language, thus showing that a purely economic perspective is not necessarily sufficient to study inequalities. However, education can also play an instrumental role, qualifying people for a job or enabling them to live healthy lives. Furthermore, education might also be important on the individual and collective levels. On the individual level, being knowledgeable about vaccination can make oneself immune against a disease whilst, on the collective level, vaccinated persons cannot affect others. Thus, different roles of education can inform different normative judgments about what inequalities are justified and which are not. Distributive rules. How to distribute education in society? The choice of an equitable distributive rule is another decision to be made in studies on educational inequality. The normative character of this can be explained with a concrete example. Raymond Boudon (1974) introduced the distinction between primary and secondary effects in the reproduction of class differences. Some countries, for example, track students into differing-ability schools after primary school. Children from privileged backgrounds tend to take up more ambitious options at the transition from primary to secondary school. In this example, primary effects may be defined as those that are expressed via the correlation between the students’ achievement level at the end of primary school and their home background. The secondary effect is the impact of children’s backgrounds on the decision to take up the more ambitious route after controlling for prior achievement. I assume that secondary effects clash with the ideas of justice that most people have. However, the situation is more controversial for primary effects as we have to decide whether we are willing to accept the idea of meritocracy in educational contexts. Many people argue that class-specific educational trajectories are justified as long as they are due to differences in ability. Napoleon Bonaparte declared that, under his rule, careers were open to the talented (carrière ouverte aux talents). The normative account that underlies this view is that inequalities are not tolerated if they are based on nepotism but legitimate if they are based on merit. Others reject the idea that meritocracy justifies such educational inequalities because they argue, for example, that it is virtually impossible to disentangle performance and family background (e.g. Bell, 1972; Goldthorpe, 1996; Solga, 2005). In privileged families, children receive more academic support in order to fulfill admission requirements. Here, it is impossible to avoid the question as to whether primary effects justify educational inequalities or not; at least implicitly, a decision has to be made. Normative theories provide a conceptual foundation for selecting between different metrics and distributive rules. Normative theories. Different ideas about justice may not be compatible with each other. For this reason, the normative dimension is the sine qua non in studies on inequality and justice. For example, an economic egalitarian may demand that everyone should earn about the same, whereas proponents of liberalism urge individual degrees of freedom and identity. What differences do these philosophies make for social policies? Large income inequalities are not justified from the egalitarian perspective but they do not contradict the ideals of liberalism. The liberal argument is
53
54
Rolf Strietholt
that there is a right to property and this right justifies inequalities. The normative account that underlies social policies will lead, for example, to different views about certain forms of taxation. John Stuart Mill (1848) called progressive taxation “a mild form of robbery”. For an egalitarian, however, taxation is a legitimate way to achieve justice. The different roles education plays in our lives makes it important to define a reasonable, convincing, and well-grounded definition of what particular inequalities matter. For this purpose, Robeyns (2006) compared three models of education: rights, capabilities, and human capital. The human capital theory considers education as an investment and economic production factor because educated workers are more productive and earn higher wages (Schultz, 1963). This is an entirely instrumental perspective because it values education only in so far as it contributes to economic productivity. The human rights approach describes universal moral principles every human being is entitled to: the right to live, the freedom of speech, or the right to a fair trial, for example (Morsink, 1999). Such rights are fundamental to governments and legislation. The United Nations adopted this approach in the Education For All (EFA) movement and demanded that every child has the right to complete primary education. This perspective highlights the intrinsic value of education. Finally, the Capability Approach (CA) emphasizes the distinctions between capabilities (opportunities) and functionings (outcomes) (Nussbaum, 2003; Sen, 1992). It emphasizes that education can be viewed as a means or an end, or, in other words, the CA provides a flexible framework by which to evaluate the intrinsic and the instrumental value of education. The comparison reveals the importance of a sound normative foundation for studies on educational inequality, especially as normative theories may not be compatible with each other. The human capital approach suggests a cost-benefit analysis to evaluate whether an educational reform pays-off in terms of economic growth rates. Expensive interventions for children at risk or programs for people with disabilities may not be approved by such an analysis if they fail to increase growth. The human rights approach would disregard the economic pay-offs as long as the intervention supplies a proper education to disadvantaged or disabled persons. While a more elaborate discussion about the different normative accounts that can underlie educational policy evaluations is beyond the scope of this paper, it is important to bear in mind that studies on inequality should not neglect discussing what educational justice means in the context of the respective study. In this regard, the fact that there is no objective truth about what inequalities are justified (or not) is not helpful, but researchers should be clear and transparent to themselves and others about the normative framework they are employing.
The Current Situation Justice is a moral concept based on ethics and norms. If one accepts that ideas about justice are the main motivation for studying inequalities, we are in the field of normativity. Many educational effectiveness researchers are still nervous about the status
Studying Educational Inequality: Reintroducing Normative Notions
of normativity in empirical research and avoid discussing the normative foundations of their work. This might stem from the century-long debate on the question as to whether or not the social sciences can make a normative obligatory statement in politics (Werturteilsstreit, positivism dispute; see Albert, Dahrendorf, Habermas, Pilot, & Popper, 1976; von Schmoller, 1893; Weber, 1904). Opponents argue that preferring one political guideline to another cannot be justified scientifically. But, even if science might not be eligible to make normative obligatory statements, the question of which inequalities are within the focus of a study remains. Example 1. The consequences of circumventing clear statements about the normative foundations of a study are best described with a concrete example. Some countries stream students into different school tracks after primary school while others have a comprehensive secondary-school system. Hanushek and Woessmann (2006) used data from international comparative large-scale studies on student achievement to compare countries that track students after primary school with countries having a comprehensive secondary school system. The main finding of the study was that the achievement level does not increase in countries that employ tracking but the variation in achievement scores does. The authors concluded that tracking increases inequality in the achieved competences measured in international comparative studies. Waldinger (2007) reanalyzed similar data focusing on the effect of streaming on the relationship between parental education and other family background measures and achievement. In this study, there was no evidence of its negative impact on equity. How can such apparently contradictory results on the impact of tracking on educational equity be interpreted? One possible explanation is that the two studies used different inequality measures. The first study examined the variation in achievement scores while the second focused on the relationship between achievement and students’ background. To answer the question of whether educational equity is higher in countries without tracking depends on the operationalization of equity, i.e. the inequality measure that underlies a study. This, in turn, depends on the normative account we are willing to apply. The two examples demonstrate that the way we operationalize and measure equity matters. Even though both studies are excellent in many regards, the authors of both studies leave the reader in the dark about the normative account they endorse. Example 2. The identification of a social problem is itself a normative decision. Therefore, it is essential that researchers do not conceal but, rather, reveal their normative accounts. Transparency about the moral concept that underlies a study enables policy makers and researchers to scrutinize whether they are willing to utilize the results of an empirical study or not. I would like to illustrate this with another example. As mentioned above, Boudon argued that the transitions between different educational stages are particularly important in the generation of educational inequalities. His work motivated a number of empirical studies testing his theory. One example was the German extension to TIMSS 2007 (Maaz, Baumert, Gresch, & McElvany, 2010). Germany has a comprehensive primary school system and a sec-
55
56
Rolf Strietholt
ondary school system with different ability tracks. The students not only took the TIMSS tests in grade 4 (about 10 year-olds) but the study also comprised a follow-up survey after the transition to secondary school. The study confirmed primary effects because privileged children outperformed disadvantaged children on the TIMSS test, and this achievement gap explained in part why they more often took the prestigious higher ‘Gymnasium’ track. Furthermore, the study revealed secondary effects because, even after controlling for achievement differences, privileged children took the higher track more often than disadvantaged children. How to interpret the results from the study? The authors of the study were quite clear about the observed secondary effects because they are not based on merits and violate their ‘sense of justice’. Primary effects, however, are not criticized because they are based on merits. I would like to challenge the authors’ evaluation of their results. One could not only argue that it is hardly reasonable to hold 10-year olds responsible for what secondary school track they take, even if this decision is solely based on merits. But more importantly, primary effects in themselves lead to the question as to whether meritocracy is a suitable normative account in an educational context at the end of primary school. While it is beyond the scope of this essay to elaborate on this example, it demonstrates that is not possible to avoid normative decisions in educational research. In this context, Ziegler and Böllert (2011) warn that avoiding the discussions of normative issues leads to an affirmation of existing conditions and practices. As such, they demand that researchers should give reasons for their normative preferences instead of making implicit and probably arbitrary decisions.
Conclusion The definition of a social or educational problem itself is a normative decision based on value judgments, and any empirical study has its own moral values. Researchers may or may not make them explicit. Different political philosophies like liberalism, egalitarianism, and utilitarianism – as well as approaches like human rights, human capital, and capabilities – take different perspectives on inequalities and justice. This essay has demonstrated that, within the field of EER, the relation of inequality and justice is an important but neglected topic. The future debate has to combine normative and measurement issues in order to make well-founded operationalization of educational equity. Inequalities can be studied, for instance, in terms of differences between certain groups of students (e.g. indigenous vs. immigrant students), by defining minimum requirements, or by studying the variance in certain outcomes. Researchers should explain why they prefer a certain operationalization to another. Generally, researchers are obliged to be transparent about their entire research process. As such, EER studies should also reveal their normative accounts and empower the reader to evaluate the theoretical foundations of a particular study. That does not mean that there is no objectivity in research but, rather, that the research process involves normativity. In this context, the fact-value dichotomy is misleading
Studying Educational Inequality: Reintroducing Normative Notions
because normative statements are the starting point of any study on inequality. The researcher has to make a decision about relevant and negligible inequalities. The vital point is the need to provide well-founded reasons concerning which information is considered to be relevant. If both outcome levels and equity are two equally important goals of education, we have to explain why we select and focus on certain information to assess and study inequalities. This has to be done according to scientific standards.
References Albert, H., Dahrendorf, R., Habermas, J., Pilot, H., & Popper, K. R. (1976). The positivist dispute in German sociology. London: Heinemann. Bell, D. (1972). Meritocracy and equality. Public Interest, 29(Fall), 29–68. Boudon, R. (1974). Education, opportunity, and social inequality: Changing prospects in Western society. New York, NY: John Wiley & Sons. Coleman, J. S. & Hoffer, T. (1987). Public and Private High Schools: The Impact of Communities. New York: Basic Books. Creemers, B. P. M. & Kyriakides, L. (2008). The dynamics of educational effeciveness. London: Rouledge. Drèze, J. & Sen, A. (2002). India: Development and Participation. Oxford: Oxford University Press. Goldthorpe, H. J. (1996). Problems of “meritocracy”. In R. Erikson & J. O. Jonsson (Eds.), Can education be equalized? The Swedish case in comparative perspective. Boulder Colorado: WestView Press. Hanushek, E. A. & Wossmann, L. (2006). Does Educational Tracking Affect Performance and Inequality? Differences-in-Differences Evidence across Countries. Economic Journal, 116 (510), C63-76. Maaz, K., Baumert, J., Gresch, C., & McElvany, N. (2010). Der Übergang von der Grundschule in die weiterführende Schule – Leistungsgerechtigkeit und regionale, soziale und ethnisch-kulturelle Disparitäten. Bonn: BMBF. Mill, J. S. (1848). Principles of political economy. vol. II, book V, ch. II, sec. 3 (Vol. II). Morsink, J. (1999). The universal declaration of human rights: origins, drafting, and intent Philadelphia, PA: University of Pennsylvania Press. Nussbaum, M. (2003). Women’s education: a global challenge. Signs: Journal of Women in Culture and Society, 29(2), 325–355. Robeyns, I. (2006). Three models of education: rights, capabilities and human capital. Theory and Research in Education, 4(1), 69–84. doi: DOI: 10.1177/1477878506060683 Schultz, T. W. (1963). The economic value of education. New York: Columbia University Press. Sen, A. (1979). Equality of what? Tanner lecture on human values: Tanner Lectures: Stanford University. Sen, A. (1992). Inequality re-examined. Oxford: Clarendon Press. Slavin, R. E. (1990). Achievement effects of ability grouping in secondary schools: a bestevidence synthesis. 1990, 60(3), 471–499. Solga, H. (2005). Meritokratie – die moderne Legitimation ungleicher Bildungschancen. In P. A. Berger & H. Kahlert (Eds.), Institutionalierte Ungleichheiten. Wie das Bildungssystem Chancen blockiert (pp. 19–38). Weinheit: Juventa.
57
58
Rolf Strietholt
UNESCO. (2003). Gender and Education for All. The Leap to Equality. Paris: UNESCO publishing. von Schmoller, G. (1893). Die Volkswirtschaft, die Volkswirtschaftslehre und ihre Methode. Frankfurt a.M.: Kostermann. Waldinger, F. (2007). Does ability tracking exacerbate the role of family background for students test scores? Mimeo, University of Warwick. Weber, M. (1904). Die ,Objektivität‘’ sozialwissenschaftlicher und sozialpolitischer Erkenntnis. Archiv für Sozialwissenschaft und Sozialpolitik. Band 19(1), 22–87. Ziegler, H. & Böllert, K. (2011). Gerechtigkeit und Soziale Arbeit–Einige Anmerkungen zur Debatte um Normativität. Soziale Passagen, 3(2), 165–174.
Eugenio J. Gonzalez
Calculating Standard Errors of Sample Statistics when Using International Large-Scale Assessment Data
Abstract
International large-scale assessments use complex sampling and assessment designs to deliver the assessment to the population of interest. They administer relatively large amounts of material to a sample from the target population, minimizing individual burden and the number of people needing to be assessed while achieving results that can be used with a certain level of confidence for planning and decision making. When using these methods, the resulting information has some uncertainty that needs to be accounted for, and this uncertainty is expressed in the form of standard errors. This paper presents an overview of the procedures and formulas used in the calculation of standard errors. The goal is to provide a didactic guide to which formulas to use and when, rather than provide mathematical explanations and derivations of the formulas.
The use of international large-scale assessment (ILSA) data has become ubiquitous among educational researchers and policy makers. Results from these largescale assessments are often cited in different contexts to refer to differences between groups, provide indicators of proficiency of students, make distinctions between contextual conditions, and so on. But while we might treat these results as valid indicators of underlying processes, any reported estimate of means, distributional characteristics and correlations reported based on ILSAs is associated with a level of uncertainty. No ILSA tests every person in the population, and no test produces measures that are without error. If we would draw a different sample, or if we tested the same sample a different day, or with another version of a test measuring the same skill, we would see slight variations in the results. The expected variations of the results are captured by statistical quantities called standard error that quantify this sampling and measurement uncertainty. ILSA methods and procedures are rooted in the survey tradition that relies heavily on sampling principles and methods to collect the data in an efficient manner. As a result, findings from ILSA need to be interpreted in the context of the uncertainty that accompanies the findings.
60
Eugenio J. Gonzalez
Sources of Uncertainty in ILSA Large-scale assessments, in the broadest sense, are defined as surveys of knowledge, skills, or behaviours in a given domain, with the intent to describe a population of interest. They typically involve sampling (a) knowledge and skills using a comprehensive theoretical framework, (b) a relatively large number of items or tasks to cover the domain, and (c) relatively large samples of representatives of the population of interest. Results tend to be reported and aggregated at the group level, which is why the term “group score” is sometimes added to the description of ILSAs. There are several sources of uncertainty that need to be considered when interpreting results from ILSA. These include translation of the instruments, administrative conditions, scoring of the items, data entry and processing, instrumentation, and so on. These sources of uncertainty are generally controlled by standardizing conditions across and within participating countries and building quality control procedures to ensure the equivalence of these conditions – as well as procedures that ensure comparability of the translated test instruments – across the different cultures and contexts. In addition, there are two other sources of uncertainty that are controlled, not by standardizing but by using the equivalent of statistical manipulation and control. These sources are related to the selection of respondents participating in the assessment and the selection of items that are included in the assessment. To avoid overburdening those surveyed, large-scale assessments often make use of multiple matrix sampling designs (von Davier et al, 2009) which allow for surveying smaller subsets of participants with fewer items. This works by selecting a subset of the population to participate in the survey, with each item administered to a subset of those selected. Or viewed in a different way, any one person selected is presented with a subset of all the items, and his or her estimated ability is based on this subset of items. The assignment of items to people is not completely at random, following carefully considered designs that ensure sufficient overlap of the items across the population. This approach, as would any other measurement scheme, results in uncertainty about the estimates.
Standard Errors The concept of standard error is embedded in sampling theory. Theory basically says that when we select multiple random subsets or samples from a population, a statistic calculated based on this collection of observations will be approximately normally distributed. The expected value of this statistic is the value it would be for the whole population if we were able to observe this variable and calculate it using all observations from the entire population. However, if we just sample from the population, rather than using the whole population, we will observe some variability if we look at statistics calculated based on different subsets of the sample. This variability about the population average is what we call the standard error.
Calculating Standard Errors of Sample Statistics when Using International Large-Scale Assessment Data
The larger the samples drawn, the less variability we would expect. The more homogeneous the population, the less variability we would expect. This is reflected in the formula for the standard error of the mean x, or 𝑆𝑒x, which, in the case that a simple random sample from the population is drawn, can be calculated as:
ܵ݁௫ҧ ൌ
ఙೣ
ξ
Notice in this formula that the larger the standard deviation (𝜎x), the larger the uncertainty or standard error, and the larger the sample (n), the smaller the expected uncertainty or error. The formula above assumes the sample is drawn from the population using simple random sampling procedures. But what is the use of this 𝑆𝑒x? Assuming that 𝑥 calculated from the multiple samples are approximately normally distributed around the population average, we can talk about the distribution of the 𝑥 around that parameter. In doing this, we make the assumption that the mean of the means of the samples has a distribution with expected value equal to the population mean, and the standard deviation of this distribution of sample means is the standard error of the mean. For example, if the mean math score of the population is 50, with standard deviation of 10, and we select multiple simple random samples of 25 observations, the standard error will be ܵ݁௫ҧ ൌ
ଵ
ξଶହ
ൌ ʹ. Using what we know of the area under
the normal distribution, we can say that if we were to draw multiple samples of 25 observations from this population, 95% of our samples will have means between (50−1.96∗2) and (50+1.96∗2). But we simply do not know this with certainty because we only get to select one sample. The calculation and interpretation of the standard error becomes a bit more complicated if we want to compare statistics computed using two different samples. This is the case when we want to compare the average performance of two countries participating in an ILSA, or when we want to compare the average performance of a single country across two different cycles of the assessment. The inference we want to make is whether these two samples belong to populations that are the same in the parameter of interest. Because we know there is some uncertainty surrounding each of the two estimates, we combine these uncertainties when establishing whether the samples were likely drawn from populations with the same value of the parameter. In the case of the means computed from samples drawn independently from each other, as is the case of samples from different countries, or the same country in different years, we simply combine the uncertainties around each of the means and treat this as the uncertainty about the differences between these means. For that purpose, we use the following formula:
ܵ݁ሺ௫ҧ ି௬തሻ ൌ ටܵ݁௫ҧଶ ܵ݁௬ଶത .
61
62
Eugenio J. Gonzalez
Standard errors are then used to assess the relative magnitude of a statistic, or of the difference between two statistics. We achieve this by converting the observed statistic, or difference between them, to standard deviation units, dividing the observed statistic by its standard error and comparing this ration with values based on theoretical distributions. The conversion to standard deviation units is done because these difference statistics are on an arbitrary scale with no particular origin and would not be very useful if not in standard deviation units.
Calculating Standard Errors in ILSA In ILSAs, there are generally two measureable or quantifiable sources of error used to calculate the standard errors. There is a source of error related to the uncertainty about the particular sample that was achieved, and there is one related to the uncertainty about the particular subset of items selected for the assessments and ultimately administered to the individual. In addition, some ILSAs add a linking component to the standard error which is related to the error from linking the assessment from one year to the next. But this source of error and its calculation are beyond the scope of this paper. Details on the calculation and use of the estimation error can be found in Hsieh, Xu and von Davier (2009).
Sampling Error Because of the complex sampling design used in ILSAs, standard errors cannot be calculated as in the equation above. One customary way of assessing sampling error in ILSA is through replication methods (e.g., Efron, 1982). Other estimation methods are available, such as those based on the sample design or Taylor expansion, but they are seldom used in ILSA (Binder, 1983; Rust, 1985). Replication methods use differences between the replicates and the entire sample to evaluate variability. There are several replication methods that are used in ILSAs, but the main principle across all of them is to resample from the existing sample and compare the statistics obtained. Operationally, this is achieved by systematically reducing or eliminating the contribution of parts of the entire sample to compute a replicate of the statistic, comparing this replicate statistic with the statistic computed using the entire sample, summarizing these differences, and multiplying them by a factor. The principle behind using these replication methods is as follows: If you systematically take out portions of your sample, compute the statistic, compare it with the statistic computed using the full sample, and systematically find there are no differences or only small ones, you then assume your sample is relatively homogenous and drawn from a relatively homogenous population. Therefore, more draws from the same population are expected to be similar to the one you obtained. So you have a small sampling error, or sampling uncertainty.
Calculating Standard Errors of Sample Statistics when Using International Large-Scale Assessment Data
However, if you systematically take out portions of your sample, compute the statistic, compare with the statistic computed using the entire sample, and find relatively large differences, then you assume your sample is quite diverse and drawn from a diverse population. The assumption if that further draws from the same population will likely yield somewhat different samples. So you have a relatively higher sampling error, or uncertainty. Depending on the systematic procedure for selecting the replicate sample, the population affecting the replication results could be a stratum, or the entire population or interest. Currently in ILSA, there are three replication methods commonly used: Jackknife 1 (JK1), Jackknife 2 (JK2) and Fay’s Balanced Repeated Replicate variant (FAY) (Fay, 1989). In brief, JK1 is used for unstratified sample designs where the replicates are formed by systematically deleting observations from the entire sample and adjusting the contribution of the remaining observations in the sample. JK2 is used for stratified designs consisting of clusters where the replicate samples are formed by pairing the clusters, and then systematically dropping them from the sample to form replicates while the other cluster member of the pair has its contribution adjusted. FAY is also used for stratified designs consisting of pairs of clusters, but instead of completely dropping a cluster to form the replicates, the weights for the clusters within each pair are systematically adjusted by a factor (FAY factor). The adjustment is done systematically using a Hadamard matrix. A Hadamard matrix, named after the French mathematician Jacques Hadamard, is a square matrix whose entries are either +1 or −1 and whose rows are mutually orthogonal. Fay’s approach has some advantages over the other replication approaches when computing estimates for small subgroups of the population. All these replication approaches take into account dependencies and the intercorrelation of the sampled observations within the strata. Because of the specifics of the population structure and sampling procedures, JK2 and FAY are more often used in school-based samples, where clusters (schools) are paired, and the replicates are formed according to this pairing. JK1 is more often used in samples that select from the general population such as adult literacy and numeracy studies. For more information on the implementation and the mechanics of each of these replication procedures, please refer to Wolter (1985). In addition, publicly available software for variance estimation, such as WesVar, have a useful user’s guide where computational examples are provided for each of the replication methods described above (WesVar, 2007). Annex B of this paper lists the replication procedure used in the largest international studies currently available. Specifics about the implementation of the sampling design and calculation of the replicate weights can be found in the corresponding technical report of the study. The replicate samples are achieved by creating replicate weights according to a set procedure and using these in the analysis to compute the replicate statistics. Each replicate weight is achieved by multiplying the full sampling weight by a factor according to the replication design. For example, to “drop” an observation within the sample, you multiply its weight by zero to create a replicate weight. To double the contribution of an observation, you multiply its weight by a factor of 2. When creating replicate weights using the FAY approach, you multiply the sample weight of the
63
64
Eugenio J. Gonzalez
corresponding observation by a factor k (where 0 < k < 1), or 2-k, according to the specific Hadamard matrix used to create the replicate weights. Once the set of “R” replicate weights are created, we use these to compute the replicate statistics and we summarize the results using the following formula to calculate the sampling error: R
Sampling _ SEH
f * ¦ H r H 0
2
r 1
where, in the case of JK1, in the case of JK2, in the case of FAY,
݂ൌ
ோିଵ ோ
݂ ൌ ͳǤͲ
݂ൌ
ଵ
.
ோכሺଵିி௬௧ሻమ
To avoid redundancy in the presentation of the nomenclature used in the formulas, and to avoid providing specific information about any of the main ILSAs within the body of the paper, we include the nomenclature and the values used in the formulas, mainly number and method of replication, as well as number of plausible values, in Annexes A and Annex B of this paper. Notice that, when we use this formula, we basically accumulate the squared differences between the statistics computed using the full sample and the same statistic computing the replicate sample, multiply this by a factor based on the replication procedure used, and take the square root of the result.
Measurement Error A second portion of the error of the statistics reported in ILSAs stems from the measurement model. While there is uncertainty about all the answers provided by the respondents, ILSA’s main outcome measure is the proficiency estimate resulting from the assessment portions of the instruments. Sophisticated and complex procedures have been developed to assign scores, or proficiency estimates, in the form of plausible values (von Davier, Sinharay, Oranje, & Beaton, 2007). Plausible values are random draws from the estimated ability distribution of students with the given item response patterns and background characteristics. In this sense, participants in ILSA receive not one but several plausible values. The variability of these plausible values represents the uncertainty of the measurement. Usual practice has been to assign five plausible values to each participant, but the availability of fast modern-day computing power and the need for better precision for estimating the measurement error has recently led to assigning up to 10 plausible values to each participant. Because we do not have one but several outcomes, we can then compute the statistics of interest (say, the country mean) with each plausible value and examine the
Calculating Standard Errors of Sample Statistics when Using International Large-Scale Assessment Data
difference between the five or 10 estimates. The standard error of measurement is calculated as the variance between the different outcomes using each of the plausible values, multiplied by an expansion factor as shown in the formula below. P ª º H 0, p H 0, P 2 » ¦ « «§¨ ·¸ p 1 » P 1 «© P ¹ » « » ¬ ¼
Measurement _ SEH
It is worth noting here that plausible values should never be averaged. Each is a plausible outcome and should be treated independently. Statistics should be computed for each plausible value, and what is reported is the average of the statistics computed with each, together with the standard error consisting of the two components discussed in this and the previous sections.
Combining the Errors We discussed two sources of error above. These need to be combined as they can be considered independent components of the total variance of an estimate based on measures with measurement error collected using a probabilistic clustered sample. When a reported statistic does not involve plausible values, we cannot report on the component of the variance that is represented by plausible values, namely the measurement error. Therefore we simply use and report the sampling portion of the error as presented in the earlier formula.
SEH
Sampling _ SEH , or R
SEH
f * ¦ H r H 0
2
r 1
However, when we are reporting a statistic that is affected by measurement error, we need to combine the values of both sources of error and report the combined errors. In this case, the standard error of the statistic is calculated as:
SEH
.
Sampling _ SEH 2 Measurement _ SEH 2
65
66
Eugenio J. Gonzalez
But notice that we said earlier that, when plausible values are involved, we need to calculate the statistic of interest with each of the plausible values. This can become very cumbersome and time consuming as this would imply also calculating the sampling portion of the error with each of the five or 10 plausible values. Traditionally, given limited computing resources, the sampling portion of the error was calculated using only the first plausible value (the “shortcut method”), and combining this with the measurement portion of the error resulted in the following calculation. Under the shortcut method, the following approach was used:
SEH
where:
P ª 2º H H ¦ 0, 0, p P « » R § · § · p1 2 « » ¨ f * ¦ H r ,1 H 0,1 ¸ ¨ ¸ * P 1 » r 1 © ¹ «© P ¹ « » ¬ ¼
,or the average of the P results using each of the plausible values.
However, the availability of faster and more efficient computer systems now allow us to calculate the sampling error with each of the plausible values and use the average of these as the estimate of the sampling error (the “full method”), resulting in the following calculation: R P ª P § ·º ª º 2 H 0, p H 0, P 2 » « ¦ ¨ f * ¦ H r , p H 0, p ¸ » « ¦ r 1 ¹ » «§ · * p 1 « p 1© » « » «¨© P ¸¹ P P 1 » « » « » ¼ «¬ »¼ ¬
To illustrate the consequences of choosing one of the two approaches described above, take into consideration a study such as the Programme for the International Assessment of Adult Competencies that uses 10 plausible values and up to 80 replicates for the calculation of the sampling error. Using the shortcut method would require calculating any statistic involving plausible values 90 times and summarizing the results. Using the full method would require computing the same statistic 810 times and summarizing the results (80 replications for each plausible value, plus 10 results from each of the plausible values using the full sample). This is not a trivial difference, even if modern computers are able to handle these many calculations in a relatively short amount of time. The calculations above apply to the calculation of the standard error of any statistic of interest, including, but not limited to means, percentages, proportions, regression and correlation coefficients, item parameters, and so on. It also applies to the
Calculating Standard Errors of Sample Statistics when Using International Large-Scale Assessment Data
calculation of the error of the differences between two statistics. But this requires some additional considerations which we will discuss in the next section.
Calculating the Standard Error of a Difference between two Samples Often, we want to establish whether two statistics are statistically different from each other. In doing so, the statistic of interest to make such determination is the standard error of the difference between the two statistics, which we use to divide the difference and compare against a critical value or threshold. When computing this standard error of a difference between two statistics, we need to distinguish whether the statistics are computed from samples that were drawn independently of, or dependent on, each other. Independent samples are those that are drawn from different sampling frames. For example, the samples from each country are drawn from different sampling frames. The same is the case for samples drawn from within a country but in different years. Dependent samples are those that are drawn from the same sampling frame. For example, this includes samples of males and females within a country when they are selected from the same selection of schools, and pretty much any two or more samples that are selected simultaneously using the same sampling frame. The importance of this distinction is that when samples are independently drawn, their errors are not correlated, whereas when they are not, their errors are correlated and we need to take this dependency into account. In the case of independent samples, because their errors are assumed to be independent, we can simply use the formula presented earlier for combining the errors of the two statistics:
We then use this error as the denominator for the difference and compare the result with the threshold or critical value. In this case, SEa2 and SEb2 are the corresponding errors of the statistic computed using the formulas described in the previous section, depending on whether the statistics a and b involve plausible values or not. In the case of dependent samples, we need to take into account the dependency between the samples, so we should not simply combine the errors of the two statistics. Instead, we need to replicate the differences between the statistics (a-b) and use these replicates to calculate the standard error. Here we are faced again with a few choices. When the difference simply refers to a statistic that does not involve a plausible value, we only need to calculate the sampling portion of the error of the difference using the procedure described earlier, as in:
67
68
Eugenio J. Gonzalez
But when the statistics involve the use of the plausible values, we need to decide to use either the shortcut or full method for calculating the sampling error. When using the shortcut method, we have the following:
SE a b
§ ¨ f * ¦ ar ,1 br ,1 a0,1 b0,1 r 1 © R
P
where
a
0, P
¦a
0, p
b0, P
p 1
P
P ª a b a0,P b0,P ¦ · «§ · p 1 0, p 0, p 2 « ¸ ¨ ¸ P 1 ¹ «© P ¹ « ¬
2
º » », » » ¼
b0, p ,
or the average of the P, results using each of the plausible values. When using the full method, we have the following: R P ª P § ·º ª 2 « ¦ ¨ f * ¦ ar , p br , p a0, p b0, p ¸ » « a0, p b0, p a0,P b0,P ¦ r 1 ¹ » «§ · p 1 « p 1© « » «©¨ P ¹¸ P P 1 « » « «¬ »¼ ¬
SE a b
2
º » » » » ¼
Notice that in these last three formulas we have simply replaced the term with (a–b), our new statistic of interest. Nowadays, most statistical software packages that are equipped to compute the sampling and measurement errors taking into account the sampling and assessment designs have the capability of conducting regression analysis. Careful coding of group membership, in combination with a regression analysis tool, allows us to conduct comparisons between group means and proportions, and the calculation of the corresponding errors of a difference. For example, dummy coding group membership results in regression coefficients that reflect the magnitude of the difference with respect to a reference group; effect coding group membership results in regression coefficients that tell us the magnitude of the difference with respect to the overall mean; and contrast coding results in regression coefficients that reflect mean differences between the contrasted groups (Winer, Brown & Michels, 1991). When using software that incorporates regression analysis, you can calculate these regression coefficients with the corresponding standard error that takes into account sampling
Calculating Standard Errors of Sample Statistics when Using International Large-Scale Assessment Data
and measurement uncertainty. Two free and publicly available software applications, the IEA’s IDB Analyzer (IEA’s IDB Analyzer User Guide, 2013) and WesVar (WesVar 4.3 User’s Guide, 2007) have instructions on how to conduct this type of regression analysis using contrast coding.
Calculating the Standard Error of a Difference between a Sample and a Composite Another common instance in ILSA is comparing a statistic with a composite that is calculated using the statistic in question. This is the case when we want to compare a national statistic with a so-called international (composite) average. Here, it is important to consider how the composite statistic is computed. The most common way to compute the international average is to take the simple average of a set of countries that participated in the assessment. An alternative way is to weight the contribution of each country according to population size. In the first case, the international average refers to the average of the statistic across the countries; in the latter it would refer to the average of the statistic across the entire population. The issue to consider when comparing the statistic from a sample with that of a composite includes the sample is one of dependency. The calculated standard error of the composite contains a portion of the error of the sample that is being compared against; this redundancy needs to be “discounted” from the calculation. The standard error of a composite calculated as the simple average of several components is calculated as: C
SED
¦ SEH
2
c 1
c
C2
The standard error of a composite calculated as the weighted average of the components is calculated as: C
SED
¦ w SEH 2 c
c 1
§ C · ¨ ¦ wc ¸ ©c 1 ¹
2 c
2
Notice that the difference between these two formulas is the explicit use of weights in the calculation. In the first formula, we are effectively weighting every component equally (w=1), we can therefore eliminate these elements from the equation.
69
70
Eugenio J. Gonzalez
To calculate the standard error of the difference between the sample statistic and the composite statistic, we use the following formulas. In the case where the composite is calculated as the simple average of the samples, we have:
SE(D a )
§ C 1 2 1 · 2 SED ¨ ¸ SEa 2 ¨ ¸ C © ¹ 2
When the composite is calculated as the weighted average of the statistics, the error of the difference between the statistic form a sample and the composite estimate is given by: 2
SE(D a )
SED2
§§ C · · 2 ¨ ¨ ¦ wc ¸ wa ¸ wa ©© c 1 ¹ ¹ § · ¨ ¦ wc ¸ ©c 1 ¹ C
2
SEa2
Again, in the first of these two formulas we are effectively weighting every component equally (w=1), and we can therefore eliminate these elements from the equation.
Summary Throughout this paper, we have discussed the calculation of the standard error due to sampling and measurement, and how to combine these to arrive at the standard error of a statistic. We have also discussed how standard errors are computed. Currently available software designed to be used specifically with ILSA data implement these formulas. For more details of the specific implementation of these formulas in the different surveys, we suggest the reader consult the corresponding Technical Report for the study. In the meantime, we conclude by presenting annexes with the nomenclature for the formulas used throughout this paper (Annex A) and a summary table with the values for number of replicates, number of plausible values and replication method used for each ILSA (Annex B).
Calculating Standard Errors of Sample Statistics when Using International Large-Scale Assessment Data
Annex A: Nomenclature The following nomenclature is used throughout this paper: R
Number of replicates
P
Number of plausible values
C
Number of elements in a composite estimate
0
Any statistic of interest (percent, mean, variance, regression coefficient, etc.) not involving plausible values computed using the full sampling weight
r
Any statistic of interest (percent, mean, variance, regression coefficient, etc.) not involving plausible values computed using replicate weight r
0,p
Any statistic of interest (percent, mean, variance, regression coefficient, etc.) computed using plausible value p using the full sampling weight
r,p
Any statistic of interest (percent, mean, variance, regression coefficient, etc.) computed using plausible value p using replicate weight r
a, b, etc , , etc
Any statistic of interest ( ) calculated for sample a, b, etc. Any statistic of interest ( ) calculated as the international average Population size for a country (sum of the weights)
71
72
Eugenio J. Gonzalez
Annex B: Parameters for Number of Plausible Values and Replicates Study
Value of P
Value of R
Replication Method
TIMSS
5
75
JK2 – Shortcut
PIRLS
5
75
JK2 – Shortcut
PISA
5
80
FAY – Full
ICCS
5
75
JK2 – Full
10
VENREPS
VEMETHOD – Full
5/10
30
JK2 – Shortcut/Full
PIAAC IALS/ALL1
ALL: Adult Literacy and Lifeskills survey1 IALS: International Adult Literacy Survey ICCS: International Civic and Citizenship Education Survey PIAAC: Programme for the International Assessment of Adult Competencies PIRLS: Progress in International Reading Literacy Study PISA: Programme for International Student Assessment TIMSS: Trends in International Mathematics and Science Study
1
IALS and ALL survey data were rescaled in 2013 to link the numeracy and literacy scales with PIAAC. At this time, 10 plausible values were drawn and the full method is used. The original scales in IALS and ALL have only five plausible values.
Calculating Standard Errors of Sample Statistics when Using International Large-Scale Assessment Data
References Binder, D.A. (1983). On the Variances of Asymptotically Normal Estimators from Complex Surveys. International Statistical Review, 51, 279–292. von Davier, M., Gonzalez, E., & Mislevy, R. (2009) What are plausible values and why are they useful?, IERI monograph series: Issues and methodologies in large scale assessments (Vol. 2), (pp. 9–36). von Davier, M., Sinharay, S., Oranje, A., & Beaton, A. (2007). The statistical procedures used in National Assessment of Educational Progress: Recent developments and future directions. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Vol. 26. Psychometrics (pp. 1039–1055). Amsterdam: Elsevier. Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans. CBMS-NSF Regional Conference Series in Applied Mathematics, Monograph 38, SIAM, Philadelphia. Fay, R. E. (1989). Theoretical application of weighting for variance calculation. Proceedings of the section on survey research methods of the American Statistical Association, pp. 212–217. Hsieh, C., Xu, X., & von Davier, M. (2009). Variance estimation for NAEP data using a resampling-based approach: An application of cognitive diagnostic models. In M. von Davier & D. Hastedt (Eds.). IERI monograph series: Issues and methodologies in large scale assessments (Vol. 2). IEA’s IDB Analyzer User Guide. (2013). Rust, K. (1985). Variance estimation for complex estimators in sample surveys. Journal of Official Statistics, 1 (4), 381–397. WesVar® 4.3 User’s Guide. (2007). Winer, B., Brown, D., & Michels, K. (1991). Statistical principles in experimental design. New York, NY: McGraw-Hill. Wolter, K. M. (1985). Introduction to variance estimation New York, NY: Springer.
73
Agnes Stancel-Piątak and Deana Desa
Methodological Implementation of Multi Group Multilevel SEM with PIRLS 2011: Improving Reading Achievement
Abstract
This study demonstrates how to apply Multiple Group Multilevel Structural Equation Modelling (MG-MSEM) for cross-country comparisons. The methodological implementation of a multilevel model is presented together with a consideration of related topics such as weighting and scaling, model fit indices, centring, and standardization of parameters, as well as missing values, and so forth. Using PIRLS 2011 data from three selected European countries (Denmark, Germany, and France) the study analyses school factors that are important for reading achievement. The complex nature of psychological and sociological constructs is considered using the latent approach.
Current analyses within the field of educational research are often aimed at explaining phenomena focusing on learning processes. Going beyond a single description of the reality, efforts have been made to explore educational processes. Models of School Effectiveness Research (SER) outline learning outcomes as resulting from the interaction of students’ individual and family circumstances with the characteristics of their class and school. It is also recognized that this interaction depends on sociological and political conditions that are specific to the geographical location of the school (Creemers & Kyriakides, 2008). Although these developments may help to overcome previous criticisms related to the lack of theoretical grounding of school effectiveness models (Sandoval-Hernandez, 2008), the empirical implementation remains challenging. Recent methodological developments such as Multilevel Structural Equation Modelling (MSEM) (Hox & Maas, 2004; Marsh et al., 2009) provide a framework suitable for the empirical analysis of complex models. Within this framework, the individual and contextual effects can be analysed simultaneously, including differential effects for sub-groups (cross-level interactions). Furthermore, with Structural Equation Modelling (SEM), constructs can be modelled as latent variables with multiple indicators via confirmatory factor analysis (CFA)1. The latent constructs are free of measurement error and are therefore more reliable proxies for psychological and sociological phenomena than manifest variables (Lüdtke et al., 2008). With CFA, SEM allows for more advanced hypothesis testing compared with traditional methods (e.g. regression analysis). Thereby, the measurement part of the structural equation model and the structural part of it are estimated in a joint model. Using fit indices, the quality of the 1
Current software-applications are also available for applications of an exploratory factor analysis (EFA) (Caro, Sandoval-Hernández, & Lüdtke, 2013).
76
Agnes Stancel-Piątak and Deana Desa
whole model can be justified in terms of how well it represents the empirical data. Latent multilevel modelling has methodological advantages compared to Multilevel Modelling (MLM) with manifest variables. For example, in contrast to MLM, MSEM produces unbiased estimation of the between indirect effects (Preacher, Zyphur, & Zhang, 2010). Moreover, MSEM permits modelling of random means, not only as dependent variables like in MLM, but also as a mediator or predictor. Nevertheless, the implementation of MSEM is rather uncommon in the field of educational research. This can be attributed to technical issues, like the availability of appropriate software, but is also related to the lack of data that fulfil the requirements in terms of sampling design, sample sizes (individuals and clusters) or number and quality of items. Furthermore, as some methodological features are still under development, there are no simple solutions for many of the related issues, such as significance testing of random slopes (LaHuis & Ferguson, 2009), standardizing of the coefficients (Marsh et al., 2009), or the role of sampling ratio2 in the choice of aggregation techniques of individual variables (latent vs. manifest aggregation) (Lüdtke et al., 2008). In the context of data from Large Scale Assessments (LSA), some issues related to weighting and scaling need further clarification (Asparouhov & Muthén, 2007). Drawing on an empirical example, this study presents a framework for the implementation of MG-MSEM (Multiple Group Multilevel Structural Equation Modelling) with LSA data. Thereby, we refer to the most recent literature, though this is not always coherent since the method is still under development. After an overview of latent multilevel models and related methodological issues an example of the empirical implementation of MG-MSEM for country comparison is presented and discussed.
Multiple Group Multilevel Structural Equation Modelling In comparison to MLM, latent multilevel modelling (Muthén & Asparouhov, 2011) is more appropriate for clustered educational data: where students are clustered within classrooms, classrooms are clustered within schools; schools are clustered within school districts, and so forth. The basic idea of multilevel latent modelling is explained in Muthén (1994). Within the SEM approach, latent constructs are modelled in the measurement part of the model, while relationships between them are modelled in the structural part of the model. Two-level SEM extends the structural model into level one (L1, e.g. student level) and level two (L2, e.g. school level) models. Latent constructs can be included on each level. From the linear systems in MSEM, within-cluster dependent variables (e.g. student characteristics) are predicted from the covariance structure of the observed variables and covariates between units (e.g. students) within clusters (e.g. schools). For the between-clus2
Sampling ratio reflects the ratio of the sample size and the size of the population. In the case of clustered data, the sampling ratio within clusters reflects the ratio of the sample size per cluster and the cluster size (Lüdtke et al., 2008).
Methodological Implementation of Multi Group Multilevel SEM with PIRLS 2011: Improving Reading Achievement
ter modelling, the cluster effect and covariates are simultaneously analysed when predicting the dependent variables at within-cluster (e.g. student) level. Muthén (1994) demonstrated that the unbiased estimate of the L1 population covariance matrix is given by the pooled L1 sample covariance matrix, SPW, and the unbiased and consistent estimate of the sample between-cluster covariance matrix is given by SB = ΣW + mΣB, where m is the scaling factor associated with the group size, ΣB is the between-level model-implied covariance structure, and ΣW is the within-level modelimplied covariance structure. Using this model for example, the achievement and background of students (L1) who are nested within schools (L2) can be examined simultaneously without underestimating any significant differences from student and school factors. Bias of clustering students within schools can be reduced. Therefore, any relevant information from both, students and schools can be investigated without any ecological fallacy confounding the results (Blakely & Woodward, 2000). The MSEM approach can be generalized to the case of multiple samples such as a cross-country comparison (MG-MSEM). The covariance structure is defined for the pooled within-covariance matrix SPW and for the scaled between-covariance matrix, SB*. This structure is used when there are G balanced groups (e.g. countries) with equal sizes n, which gives the total sample size N=nG (Hox & Mass, 2001). For unbalanced group sizes, Full Information Maximum Likelihood (FIML) or ad hoc solution is used to estimate separately the between-group model for each distinct group size (Hox, 2002).
Methodological Issues and Applications Compositional and Contextual Effects Although the terms compositional and contextual effects describe L2 effects on L1 outcomes, the use of these terms is inconsistent in the literature. In this study, we refer to Maaz’, Baumert’s and Trautwein’s (2011) definition of compositional effects as referring to such characteristics of the environment which are related to the social composition of the area or social institution that the individuals are in (e.g. students’ family background in schools). In educational LSA this information is usually collected on the individual level, i.e. via student or home questionnaires. Following Maaz et al. (2011), contextual effects refer, by contrast, to characteristics of the institution itself that are often the subject of principal or teacher questionnaires in educational research: for example institutional structure, resources, and teaching styles in a school. Although the analysis method of the contextual and compositional effect appears as being determined by its definition, both can nevertheless be selected whether on L1 or L2. For example, the information about students’ family background might be selected via principals’ questionnaires; or students’ reports on their teachers might be used as aggregated teachers’ characteristics. In the following study, the variables for analysing the compositional effects are selected on L1 and
77
78
Agnes Stancel-Piątak and Deana Desa
the variables for analysing the contextual effects are selected on both L1 (students’ reports on instruction) and L2 (e.g. resources, school management).
Latent versus Manifest L2 Aggregation of L1 Variables It has been stated that, in MLM, the manifest L2-aggregation of L1 constructs leads to biased and latent aggregation to unbiased estimation of the between indirect effects (Preacher et al., 2010) (Appendix 3)3. Nevertheless, Lüdtke et al. (2008) argue that the choice of latent vs. manifest aggregation depends on the nature of the construct, which can be described by defining reflective and formative aggregation. The purpose of L1 measures in the reflective aggregation is to provide reflective indicators of a L2 construct (e.g. students’ ratings of class characteristics). The observed measure is designed to reflect the L2 construct directly; thus, variation within each cluster can be regarded as L2 measurement error. In contrast, formative aggregation is considered as being an index of L1 measures within each L2 group (e.g. student reports on their own characteristics). The observed L1 measure is designed to reflect a L1 construct rather than being a direct measure of a L2 construct. In this case, the variation within each class reflects individual “true-scores”; thus, the measurement error becomes trivial in size and can eventually be ignored. Lüdtke et al. (2008) argue that the reliability of the group mean for the reflective aggregations of L1 constructs in a group-mean centred approach depends on the intra-class correlation coefficient (ICC) and the number of observations per cluster (sampling ratio). In particular, if the ICC is low and/or the sampling ratio is small, the manifest group mean can lead to biased estimation of contextual effects, although the latent aggregation is not very precise in such cases as well. But, notwithstanding the authors recommend using the latent aggregation for reflective constructs even if the sampling ratio is high. They argue that there exist potentially an infinitive number of L1 indicators (e.g. students that could be in a classroom), so that, even with a sampling ratio per cluster equal to one, the measurement error is not completely levelled out4. Then again, the authors consider cases of formative aggregation, in which using the manifest L2 aggregation could be more reasonable than the use of latent aggregation. For example, if the number of observations per cluster on a gender variable is very high, the variation of the scores rather reflects the “true” variation of the gender within the classroom than the measurement error. Modelling such variables as latent constructs could lead to overestimation of the measurement error and underestimation of the coefficients, when the sampling ratio is high. When it is low, however, the sampling error is higher and the latent approach would produce less biased results, while the manifest modelling tends to produce negative biased coefficients (underes3 4
To be obtained from the authors on request Analogous to the typical assumptions of test construction, the differences between L1 ratings are assumed to reflect the measurement error that varies as a function of the number of those ratings (or items in test theory) and the size of the ICC (or correlations among items in test theory).
Methodological Implementation of Multi Group Multilevel SEM with PIRLS 2011: Improving Reading Achievement
timation of the contextual effects) and standard errors (underestimation of the standard errors) (Lüdtke et al., 2008). In this study, latent L2 aggregation is applied for the reflective aggregation (students’ ratings on teachers’ characteristics) as also for the formative constructs (students’ reports on their cultural capital) due to missing values.
Fit Indices Testing model goodness-of-fit Fit indices that can be used within the MSEM framework which are based on the total covariance structure, SPW, are the overall chi-square test (χ2), Comparative Fit Index (CFI), Tucker-Lewis Index (TLI), Root Mean Square Error Approximation (RMSEA), and Standardized Root Mean Square Residual (SRMR) (Hu & Bentler, 1998; Steiger, 1990). The overall χ2 is sensitive to the sample size and the number of parameters and is likely to support the conclusion that the estimated model does not fit the data well. Therefore, this statistic is not commonly used in applied research as a sole index of model fit within the SEM framework (Brown, 2006). As a general guideline for other fit indices, Hu and Bentler (1998) suggest the cut-offs to support conclusions of a reasonably good fit between the target model and observed data where SRMR ≤.08, RMSEA ≤.06, CFI ≥.95 and TLI ≥.95. Other authors recommend RMSEA between .06 and .08 as adequate model fit; and any values between .90 and .95 for CFI and TLI may be indicative of acceptable fit (Browne & Cudeck, 1992; MacCallum, Browne, & Sugawara, 1996). For SRMR an acceptable model fit is indicated by values between .08 and .10 (Hox & Bechger, 1999; Schermelleh-Engel, Moosbrugger, & Müller, 2003). This standard approach is argued to be insensitive to the model at the higher cluster level due to a much larger sample size at the individual level, and the fit indices are more likely to be dominated by the goodness-of-fit of the within-level model (Hox, 2002). There are several solutions to solve this problem such as evaluating the model at each level using Yuan & Bentler’s two-step procedure (Yuan & Bentler, 2007), or using Ryu’s and West’s level-specific evaluation (Ryu & West, 2009). In Mplus, the SRMR fit index for the within and between model is delivered separately in order to evaluate the residuals of the model at each level. In this study, we used the following criteria to evaluate the fit of the model: • good fit: CFI ≥ .95; TLI ≥ .95; RMSEA ≤.06; SRMR (within; between) ≤.08; • acceptable fit: CFI ≥ .90; TLI ≥ .90; RMSEA ≤.08; SRMR (within; between) ≤.10.
Model comparisons For comparison of nested models, chi-square difference tests and a comparison of chi-square based fit indices can be used to choose the model. Complex analysis in
79
80
Agnes Stancel-Piątak and Deana Desa
Mplus (TYPE = COMPLEX) uses the Satorra-Bentler algorithm (MLR)5, which produces maximum likelihood parameter estimates that are robust to non-normality and non-independence of observations (Muthén & Satorra, 1995). When models are estimated with MLR, the difference between the scaled chi-square values for model comparison of nested models does not follow the normal distribution. Therefore, Satorra and Bentler (1999) have proposed a scaled chi-square difference test, containing a proper scaling correction factor (c), which is given in Mplus output by default. However, the chi-square difference test has to be calculated manually in some cases6 (Bryant & Satorra, 2012).
Cross-country Factor Invariance For comparisons across groups (e.g. countries), the latent constructs should have the same meaning in all groups. The across-groups measurement invariance of such constructs can be tested with the Multiple Group (MG) approach (Rutkowski & Svetina, 2013). The three levels of measurement invariance (configural, metric and scalar) should be tested in hierarchical order. Metric invariance requires configural, and scalar invariance requires configural and metric invariance (Millsap, 2011). At the basic level of invariance (configural invariance), all groups have common factors and items. However, in order to have common factors with the same meaning across groups and the same measurement unit, metric (or weak) invariance has to be established. This implies equal strength of the associations between items and the factor for all countries, and is the minimum level of invariance for making comparisons on the relationships between factors across countries and observed variables groups. At scalar (or strong) invariance level, the intercepts are all equal and thus all items indicate the same differences in latent means across groups (e.g. countries). It is the minimum level of invariance if it is the intention to perform a valid cross-country comparison of the scale scores (i.e., means comparison) (Byrne, 2008). In the current study, the configural, metric and strong levels of invariance are tested for each latent construct in the models prior to final modelling and evaluated using the fit indices described in section 3.3. The analysis is implemented in Mplus 7.11 with the CONFIGURAL METRIC SCALAR statements in the analysis command of the syntax (Appendix 2)7. When using this approach, different levels of invariance are compared based on the fit indices and the scaled chi-square difference tests with scaling correction factor (c)8. When comparing a more restrictive with a less restrictive model, it is recommended to favour the more restrictive model when absolute CFI and TLI changes