Assessing Writing 18 (2013) 7–24
Contents lists available at SciVerse ScienceDirect
Assessing Writing
On the relation between automated essay scoring and modern views of the writing construct Paul Deane ∗ Educational Testing Service, Rosedale Road, MS 11-R, Princeton, NJ 08541, United States
a r t i c l e
i n f o
Article history: Available online 16 November 2012
Keywords: Automated essay scoring (AES) Writing Assessment Writing construct Cognitively Based Assessments of, for, and as Learning (CBAL)
a b s t r a c t This paper examines the construct measured by automated essay scoring (AES) systems. AES systems measure features of the text structure, linguistic structure, and conventional print form of essays; as such, the systems primarily measure text production skills. In the current state-of-the-art, AES provide little direct evidence about such matters as strength of argumentation or rhetorical effectiveness. However, since there is a relationship between ease of text production and ability to mobilize cognitive resources to address rhetorical and conceptual problems, AES systems have strong correlations with overall performance and can effectively distinguish students in a position to apply a broader writing construct from those for whom text production constitutes a significant barrier to achievement. The papers begins by defining writing as a construct and then turns to the e-rater scoring engine as an example of AES state-of-the-art construct measurement. Common criticisms of AES are defined and explicated—fundamental objections to the construct measured, methods used to measure the construct, and technical inadequacies—and a direction for future research is identified through a socio-cognitive approach to AES. © 2012 Published by Elsevier Ltd.
1. Introduction Automated essay scoring (AES) has been subject to significant controversy, with significant forces encouraging its adoption in large-scale testing, despite major reservations and significant criticisms. On the one hand, there is significant support for AES as documented in Shermis and Burstein (2003) and Shermis and Burstein (in press). Arguments for use of AES include reliability and ease of scoring, ∗ Tel.: +1 609 734 1927. E-mail address:
[email protected] 1075-2935/$ – see front matter © 2012 Published by Elsevier Ltd. http://dx.doi.org/10.1016/j.asw.2012.10.002
8
P. Deane / Assessing Writing 18 (2013) 7–24
with AES scores correlating highly with human-scored holistic ratings (Burstein & Chodorow, 2003, 2010; Chodorow & Burstein, 2004; Powers, Burstein, Chodorow, Fowles, & Kukich, 2001). AES systems have been adopted as a second or check scorer in high-volume, high-stakes systems such as the Graduate Record Examination (GRE® ), the Test of English as a Foreign Language (TOEFL® , cf. Haberman, 2011), the Graduate Management Admissions Test (GMAT® ), among others; AES systems are the primary essay scoring engine in various assessment and instructional products, including Accuplacer® , the Criterion® Online Writing Evaluation Service, and the Pearson Test of English® . AES techniques have also been applied in languages other than English. This includes multilingual applications of the Intellimetric system (Elliot, 2003) and other efforts to develop AES systems in various languages including Chinese (Chang, Lee, & Chang, 2006; Chang, Lee, & Tam, 2007), French (Lemaire & Dessus, 2001), German (Wild, Stahl, Stermsek, Penya, & Neumann, 2005), Hebrew (Ben-Simon & Cohen, 2011; Cohen, Ben-Simon, & Hovav, 2003), and Spanish (Castro-Castro et al., 2008) On the other hand, there has been and continues to be significant and vocal opposition to AES, particularly when and where it might replace human scoring, a position elaborated by a variety of authors, as elaborated in Ericsson and Haswell (2006), Herrington and Stanley (2012), and Perelman (2012). A major organization for writing professionals, the Conference on College Composition and Communication (CCCC) explicitly opposes the use of AES. Its position statement on Teaching, Learning, and Assessing Writing in Digital Environments (2004) states: Because all writing is social, all writing should have human readers, regardless of the purpose of the writing. Assessment of writing that is scored by human readers can take time; machine-reading of placement writing gives quick, almost-instantaneous scoring and thus helps provide the kind of quick assessment that helps facilitate college orientation and registration procedures as well as exit assessments. The speed of machine-scoring is offset by a number of disadvantages. Writing-to-a-machine violates the essentially social nature of writing: we write to others for social purposes. If a student’s first writing-experience at an institution is writing to a machine, for instance, this sends a message: writing at this institution is not valued as human communication—and this in turn reduces the validity of the assessment. Further, since we cannot know the criteria by which the computer scores the writing, we cannot know whether particular kinds of bias may have been built into the scoring. And finally, if high schools see themselves as preparing students for college writing, and if college writing becomes to any degree machine-scored, high schools will begin to prepare their students to write for machines. We understand that machine-scoring programs are under consideration not just for the scoring of placement tests, but for responding to student writing in writing centers and as exit tests. We oppose the use of machine-scored writing in the assessment of writing. The articles in this special issue are contributions to dialog about the role of automated writing analysis, a dialog whose extreme poles may be framed, on the one hand, by a position advocating unrestricted use of AES to replace human scoring, and, on the other, by a position advocating complete avoidance of automated methods. However, it would be a mistake to focus only on polar positions. Writing is a complex skill, assessed for various audiences and purposes. AES is one instantiation of a larger universe of methods for automatic writing analysis. A more nuanced view may be necessary to assess accurately how automated essay scoring—and other forms of automated writing evaluation—fit into education and assessment. This article explores this middle space. In particular, the article considers how technological possibilities defined by automated scoring systems interact with well-articulated views of writing skill that incorporate modern research findings, whether from social or cognitive perspectives. Critical to this exploration—and thus to the argument to be explored—is the complexity of writing skill and the variety of automated methods that can be applied. Automated analysis can more appropriately be applied to some aspects of writing skill than to others. Some of the tensions inherent in the current situation may be resolved as we explicate this complexity and consider how these issues play out in particular scenarios. It is important to consider the affordances created by current deployments of automated essay evaluation technologies and to analyze how these affordances interact with user requirements and the needs of assessment; it is equally important to recognize that AES technology is not static. We must also consider how the underlying technologies, appropriately configured, could support other use cases
P. Deane / Assessing Writing 18 (2013) 7–24
9
where there is less reason to postulate a conflict between writing viewed as a social, communicative practice and writing viewed as an automatized (and automatically scorable) skill. 2. Established practices: defining writing as a construct to be assessed Historically, direct writing assessment in the North American context was plagued by an inability of raters to agree on common, consistently applied standards. Elliot (2005), citing among others Brigham (1933), and Noyes, Sale, and Stalnaker (1945), outlines some of these difficulties. Widespread adoption of direct writing assessment (White, 1985) depended upon the establishment of reliable holistic scoring methods (Godshalk, Swineford, and Coffman, 1966), which built in turn upon earlier analyses of factors affecting rater judgment (Diederich, French, & Carlton, 1961). Similar issues have also arisen in contexts where assessment must be reliable and valid not only within a single language like English, but also comparable across languages. Approaches like the Common European Framework of Reference for Languages represent an effort to establish cross-linguistically valid scales for specific elements of literate competence, including writing skill (Council of Europe, 2001; Morrow, 2004). Approaches to direct writing assessment have been heavily influenced by evolving concepts of validity within the educational measurement community, which had initially focused on criterion validity (demonstration of correlations with external criteria) and upon content validity (alignment of task content with the skill intended to be measured). Ultimately the educational measurement community settled on the importance of establishing a construct argument: a rational case connecting the design of the test to the skill one intends to measure (Cronbach, 1971; Messick, 1989; Mislevy, 2006, 2007). However, there has been a continuing tension between efforts to define writing assessment in these terms, and concerns with writing considered in a humanistic context (Anson, 2008; Cheville, 2004; Cooper & Odell, 1977), where reasoning skills, writing processes, genre practices, and the cultural and social contexts in which genres develop take center stage. Humanistic concerns can be conceptualized as challenges to the validity of standardized writing assessments. They emphasize various factors, such as domain knowledge, linguistic/cultural background, disciplinary training, variation in topic, differences between conditions of testing and typical writing practices, and variability in rhetorical purposes. Each of these factors can affect perceptions of text quality (Beck & Jeffery, 2007; Murphy & Yancey, 2008). These same issues also affect other forms of performance assessment, which have led some to propose alternative writing assessment methods such as portfolios (Murphy & Camp, 1996). As Condon (2012) establishes in his account of portfolio assessment, the desire for greater construct representation brought evaluation out of its black box in the mid-1980s as instructors shared writing criteria with their students. Seminal collections (Belanoff and Dickson, 1991; Black, Daiker, Sommers, & Stygall, 1994; Yancey & Weiser, 1997) documented various perspectives on the roles of portfolio assessment in education. Recent work by Hamp-Lyons and Condon (2000) demonstrates the significance of considering writing assessment as a process involving iteration, learning, and multiple stakeholders (Willard-Traub, 2002). Research in electronic portfolio assessment (Cambridge, Cambridge, & Yancey, 2009) reveals both the endurance of portfolio assessment and its evolution to embrace the exigencies of digital communication (Yancey, 2012). In terms of articulating, exploring, and expanding the writing construct, the process of dynamic criteria mapping holds promise for those who wish to transform instruction and facilitate assessment (Broad et al., 2009). What is at stake in such work is the definition of the writing construct. There is a series of questions that must be answered; and the answers will determine not just the design of an assessment, but its impact upon teaching and the culture at large. For example, we must ask: • To what extent does assessment focus upon integrated writing performances rather than on specific skills thought to be necessary components of those larger performances? • If integrated performances are assessed, what aspects of writing such as content domains, genre, and rhetorical purposes matter enough to be included in an assessment?
10
P. Deane / Assessing Writing 18 (2013) 7–24
Fig. 1. Elements of the writing assessment episode: participants, processes, and texts (from Ruth & Murphy, 1984, p. 414, reproduced in Ruth & Murphy, 1988, p. 128).
• If integrated performances are assessed, will they be samples of actual work, or written on demand specifically for assessment purposes? • If performances are written on demand, what features of real world contexts for writing matter enough to be replicated in or incorporated into the assessment? • To what extent does assessment draw inferences based not just on evidence drawn from the final written product but also from other aspects of the writing process? • What aspects of the written product are emphasized in determining a score or other information reported to stakeholders? The design of a writing assessment implies value judgments on these and many other questions, and thus implicitly defines a specific understanding of skilled writing. Such considerations have long been prominent in discussions of automated scoring, as an application of a more general approach to assessment design that emphasizes the importance of modeling the domain and using that model to build an evidentiary argument (Braun, Bennett, Frye, & Soloway, 1990; Sebrechts, Bennett, & Rock, 1991). One variant of this approach is becoming increasingly prominent: Evidence-Centered Design (ECD, Mislevy & Haertel, 2006), which emphasizes the importance of starting with a definition of the construct, and building out an explicit model of the evidence to be collected, before an assessment is designed. However, the current generation of AES systems predates the introduction of ECD, and it is possible that new approaches to AES will emerge under its influence. This question will be examined in greater detail below. But it will be useful first to focus on the state of the art in automated essay evaluation and to consider what elements are emphasized and directly measured in current-generation AES systems. Construct definitions can be implicit in features of the testing situation. Consider Fig. 1, from Ruth and Murphy (1984). This diagram enumerates major elements at play in direct, on-demand, timed
P. Deane / Assessing Writing 18 (2013) 7–24
11
Fig. 2. Elements of a model of the writing assessment episode in automated essay scoring.
writing assessments, where writing occupies a single contiguous block of time, there is no audience except the rater, and only the final written product is to be submitted to the rater for evaluation. But note that this situation already emphasizes some aspects of writing skill and deemphasizes others. Explicit text features and content are foregrounded, while social and process elements of writing expertise are backgrounded. Such elements function primarily as influences on (and explanations for variation in) performance, and it is in that capacity that Ruth and Murphy examine them. Current-generation AES systems largely presuppose the definition of the writing construct implicit in a standardized testing situation and embed automated scoring within that frame, as shown in Fig. 2.1 This diagram highlights parallels between machine and human scoring, but with several added elements: • A sampling process for model student responses. • A model design process that builds the elements needed to implement both a general scoring engine and/or a specific scoring model. • An automated process for analyzing student responses to identify text features, assign a score, and where required, generate feedback.
1 At a high level of abstraction, this description applies to all the systems that competed in the vendor competition recently sponsored by the Hewlett Foundation (Shermis & Hamner, in press), with one caveat. The Truscore® system developed by Metametrics corporation (Lockridge, 2012), trains its models to replicate an external grade-level scale aligned to the Lexile® readability system, rather than training on human scores for individual prompts. The other systems represented in the competition fit the general framework, though they may use different specific features, such as Latent Semantic Analysis vs. Content Vector Analysis, or employ different statistical techniques for model building.
12
P. Deane / Assessing Writing 18 (2013) 7–24
Note the dotted arrow in the upper right hand corner of the diagram, from human raters to human scores. A common application, particularly important in high-stakes assessments, applies the AES system as a second or check scorer. This diagram highlights ways in which objections to AES are actually objections to the assumptions of standardized testing. For example, the CCCC position statement cited earlier claims that AES sends the message that writing is not valued primarily as a form of communication; but the standardized situation has already changed the audience and the purpose of writing. The objection depends on the application of AES, not upon the nature of the technology, which could be configured differently, to support different assessment or instructional purposes. This point will be examined in greater detail below. In other words, many of the features of the current generation of automated scoring engines are contingent upon the application for which they are designed. Typical AES engines measure a set of text features very like that indicated by Ruth and Murphy, with its emphasis on subject, genre, structure, and language (at least if we include conventionality of language as well as Ruth and Murphy’s triad of lexical range, detail, and abstraction). But in principle, any feature of the written product, the social situation in which writing takes place, or the process by which a written product is developed could also be measured automatically and included in an AES system, as long as algorithms can be devised to measure them. Moreover, it is possible to use the features that enter into an AES system to define trait scores and thus to support more nuanced scoring and feedback. These points notwithstanding, it remains fair to conclude that the current generation of automated scoring systems is built around assumptions very similar to Ruth and Murphy’s assessment episode. These systems are designed to be applied in the first instance to holistically scored, on-demand, timed assessments. In general, AES systems rely on features intended to measure many of the traits specified in holistic scoring rubrics, or in the Six-Trait model offered by Spandel (2005). The systems match much less well with criteria advanced in such contexts as the Framework for Success in Postsecondary Education (Council of Writing Program Administrators, National Council of Teachers of English, and the National Writing Project, 2011; O’Neill, Alder-Kassner, Fleischer, & Hall, 2012), which emphasizes rhetorical knowledge, critical thinking, and control of the writing process. In part this disjuncture is due to the contrast between models focused on “text quality,” measured in the end product, versus models focused on “writing skill,” which is an attribute of the writer, not the text. In part, it reflects differences in what kinds of features can readily be measured in the current state of natural language processing technology. In this context, the difference between direct and indirect measurement is critical. If an AES system does not have features that address (for instance) the arguments presented in an essay, then argumentation is not directly included in the construct, even if it measures features that can be used as proxies for that skill. Of course, this distinction is a matter of degree since features intended to measure a construct may do so well or poorly. AES involves all the issues that arise when any assessment is validated (Bennett, 2006); additional issues must be taken into account because of the ways that AES relies upon defined algorithms and not upon direct human judgment. However, it will be useful to delay further discussion of these issues until greater detail has been provided about how current AES systems work. For this purpose, one AES system—the e-rater scoring engine developed at ETS—can provide a concrete illustration.
3. An example of the current state of the art: the e-rater scoring engine The e-rater scoring engine is well documented (Attali & Burstein, 2006; Burstein, Chodorow, & Leacock, 2003). Technical details differ among engines, so caution should be used to generalize from it to other commercial AES systems, but based upon available overviews such as Shermis and Burstein (2003), it seems reasonable to assume that roughly similar sets of features and roughly comparable methods for calculating scores are implemented in the major AES systems.
P. Deane / Assessing Writing 18 (2013) 7–24
Fig. 3. Features employed by e-rater engine v.11.1 and their construct interpretation.
13
14
P. Deane / Assessing Writing 18 (2013) 7–24
Fig. 3, following the representation first presented in Quinlan, Higgins, and Wolf (2009), presents the feature decomposition, and hence an implicit construct definition for e-rater as of version 11.1.2 As Fig. 3 indicates, there are twelve high-level e-rater features, grouped into nine major categories, labeled with such headings as organization and development, vocabulary, content, grammar, usage, mechanics, and style. These headings correspond to elements that appear in commonly used rubrics for assessing text quality, though the fidelity with which the features represent the rubric categories varies across features in a manner to be discussed in more detail below. The high-level features are combined using a standard statistical method (linear regression) which provides a way to calculate a weighted sum over the values of the features that maximizes the prediction of scores. Most of these high-level features depend upon natural language processing (NLP) technologies. In some cases, such as the lexical complexity features, relatively simple measures (such as transforms of word length or of frequency in a general English corpus) may be used. Most of the remaining features depend upon relatively complex algorithms that capture some aspect of the structure and wording of the text. Some features measure text structure (Burstein, Marcu, Andreyev, & Chodorow, 2001; Burstein & Marcu, 2003; Burstein, Marcu, & Knight, 2003; Higgins, Burstein, Marcu, & Gentile, 2004), though the features labeled “organization” and “development” focus on the number and length of text units, and not organization and development as a human rater might interpret the terms. Other features identify errors, with an array of specific algorithms focusing on the identification of ways in which essays deviate from conventional patterns. In the most recent version of e-rater, these error features are supplemented by “positive features” that search for patterns characteristic of normal, conventional English usage (Chodorow & Leacock, 2002; Futagi, Deane, Chodorow, & Tetreault, 2008; Leacock & Chodorow, 2001). Finally, textual content features can also play a role. E-rater uses contentvector analysis (Attali & Burstein, 2006), a method that focuses on patterns of vocabulary overlap. Other engines use related techniques, such as Latent Semantic Analysis (LSA) (Landauer, Laham, & Foltz, 1988). The work of defining features is an incremental process, responding both to construct demands and to technical advances in natural language processing. For instance, the error features that are appropriate for a native speaker population and for an English-language learner (ESL) population are different, which has led to development of features targeting typical ESL ESL error patterns, such as errors with articles, prepositions, and collocations (Chodorow, Gamon, & Tetreault, 2010).3 It may not yet be technically feasible to measure other aspects of text quality, such quality of argumentation, but this situation may change with advances in NLP techniques (thus, see Feng & Hirst, 2011, for early research into automated identification of the structure of argument). While features play an important role, the statistical model training process is also critical.4 Statistical analysis of sample essays (a training set) determines the optimum weights to apply when combining features to predict human scores. The training set thus determines what populations and tasks the model can reasonably be generalized to cover. The specific statistical methods and procedures used to build a model have significant impacts on the quality of the final engine (Bejar, 2011; Ben-Simon & Bennett, 2007; Haberman and Sinharay, 2010). There are two major modes for creating AES models using e-rater: prompt-specific and generic, depending on the nature of the training set. Prompt-specific models are trained on essays written to a single prompt, whereas generic models are trained on essays written to multiple prompts (usually within the same genre) and drawn from the same population. Prompt-specific models use content
2 E-rater major version numbers are incremented annually; thus e-rater 11 represents the 2011 version, reflecting incremental improvements and modifications from the baseline version established by e-rater 2 in 2002. The various versions from 2.0 to 11.1 use the same core framework, but they incorporate numerous small modifications at the level of individual features. 3 These efforts are not, so far, particularly closely aligned to scales like the Common European Framework of Reference for Languages that attempt to define aspects of language skill cross-linguistically, nor do existing AES engines appear to have been fully optimized for what Weigle (2013) terms “Assessing Language through Writing,” though as the citations in the body of the text indicate, there are continuing efforts to adapt AES for use with language learners. Such applications involve a significant change in the underlying construct. Thus, efforts to apply AES systems to non-native populations are likely to entail significant adaptations of existing AES systems both at the level of individual features and in the way those features are aggregated. 4 There are alternatives to this process, such as the identification of traits motivated by a factor analysis (Attali & Powers, 2008), but as of the time of writing such alternatives are not widely used in operational settings.
P. Deane / Assessing Writing 18 (2013) 7–24
15
features; generic models do not. Prompt-specific models typically perform somewhat better than generic models, but cannot be applied to new prompts (Attali, Bridgeman, & Trapani, 2010). In addition, generic models may be genre or task specific, program specific, or site-specific, depending on the features of the task and population that define the training set. The quality of automated scoring can be affected by the variability and representativeness of the training set. It can be important to oversample atypical responses. For instance, an AES system must detect and respond appropriately to off-topic essays or other insincere responses (Higgins, Burstein, & Attali, 2006; Powers et al., 2001). In addition, AES features can be disaggregated and used to support detailed reporting and feedback. This is the function of Criterion, which uses the e-rater engine to provide feedback on errors of grammar, usage, mechanics and style, and on other aspects of text quality (Burstein, Chodorow et al., 2003; Shermis, Burstein, Higgins, & Zechner, 2010). Attali and Powers (2008) identify three factors measured by e-rater, and propose reporting them as traits: fluency, accuracy of text production, and vocabulary sophistication. Attali and Powers argue that these factors can be used to compare student performance across grades. The limitations of an AES engine are a direct consequence of the way that engine is built. For instance, AES models are trained against the judgments of human raters, and might therefore replicate biases present in the original human ratings. Operational AES systems such as e-rater are usually trained against operational human ratings, which are subject to a variety of constraints (such as time available and quality of training). It is also important not to confuse the intended construct with the directly measured construct, which may differ from the intended construct in various ways. AES is appropriate where mismatches between the intended and measured construct are not too great and the predictive value of the scores is sufficient for the use to which it will be put. But it is precisely the technical details that define the measured construct and thus determine where and when an AES engine may fail to perform as expected.
4. Construct representation and common criticisms of AES As noted in the introduction, the use of AES in writing assessment has been the subject of controversy (Cheville, 2004; Ericsson, 2006; Haswell, 2006; Herrington & Moran, 2006; Jones, 2006; McGee, 2006). Three major kinds of issues have been raised:
• Fundamental objections focusing on how AES affects the testing situation, since the knowledge that essays will be machine-scored may lead to changes in behavior that undermine the intended construct. • Fundamental objections to the construct measured or to the construct-appropriateness of specific measurement methods. These include objections that focus on the fact that AES systems cannot interpret meaning, infer communicative intent, evaluate factual correctness and quality of argumentation, or take the writing process into account. A related family of objections focuses on the danger that an AES system can be gamed. Some observers have conducted experiments to demonstrate this issue, for instance by feeding different versions of an essay into an AES system (Perelman, 2012). AES systems can be insensitive to salient features that human raters might detect and penalize, such as massive repetition of content, lack of semantic coherence, purposeful (but inappropriate) manipulation of vocabulary to raise the rating, and so forth. • Objections to measurement methods focused on technical inadequacies of an approach that might in principle be accepted. For instance, if a grammatical error detection system fails to achieve complete accuracy in recognizing subject/verb agreement, the presence of false positives or false negatives may be cited as reasons not to trust the system as a whole.
We consider the more fundamental objections below.
16
P. Deane / Assessing Writing 18 (2013) 7–24
4.1. Objections based on the impact of AES on the testing situation This class of objections applies most saliently when AES is deployed in high-stakes testing situations and is crucial if AES is proposed as the sole scorer. As noted previously, where AES is currently deployed in high-stakes testing situations, this extreme situation typically does not apply; both AES and at least one human score are calculated, and disagreements cause the essay to be referred to a second human rater. This is an area where further research is important, as there is relatively little literature on the impact of automated scoring on test-taker behavior (though see Bridgeman, Trapani, & Attali, 2012 for an analysis of differential discrepancies by group between e-rater and human scores on the Test of English as a Foreign Language and the Graduate Record Examination). The impact of automated scoring may be very different in a low-stakes situation, where AES may be viewed as a tool embedded in a larger communicative purpose, than it is in a high-stakes situation, where test-takers may be motivated to distort their usual writing practices to achieve a higher score. In a high-stakes situation, practices such as using an inflated vocabulary may be adopted quite independently of the use of automated scoring, since the test candidate may view the scorer as a barrier to be gamed rather than as a person to communicate with. 4.2. Fundamental construct objections Objections to use of AES that focus on its failure to measure meaningfulness of content, argumentation quality, or rhetorical effectiveness boil down, in their essence, to the observation that AES systems do not measure the full writing construct, but rather, a restricted construct. Almost by definition, that construct excludes elements that are difficult to measure automatically, but which are critical for writing to succeed in its communicative purpose. Recall an earlier distinction: the difference between text quality (emphasized in scoring systems such as the Six-Trait model) and writing skill (emphasized in the Framework for Success in Postsecondary Education.) It is clear that e-rater (like most state-of-the-art AES engines) directly measures text quality, not writing skill. Many construct objections directly hinge upon that difference. It is obvious enough that AES does not measure the full Framework. On the other hand, the features used in AES systems do provide evidence relevant to many of the traits specified in text-quality rubrics such as the Six-Trait model, as long as we accept that certain elements (such as quality of content) are not measured, and that other elements (such as word choice) are measured by proxies. The differences are sufficiently large, however, that it would be unwise to assume that AES measures the same construct as human scoring, no matter how strongly AES and human scoring are correlated. Very well then. What does it measure? This question can be addressed by examining the universe of discourse—and the universe of supporting skills—that come into play when writing is involved. Consider Fig. 4, a diagram presented in Deane (2011, p. 8).5 It is intended to capture the fact that the writing process is not strictly linear; instead, it emphasizes that writing is a sociocognitive process that draws upon and coordinates many different component skills. If we were to view writing as a purely linear expressive process, we might start at the upper right-hand corner of Fig. 4 and work our way down, noting that a writer must (a) engage the audience and purpose; (b) conceptualize what to write; (c) structure that content by developing a rhetorical plan; (d) find ways to phrase each piece of the resulting document structure, conveying the right meaning in the right words; and (e) transcribe the intended sentences onto the page. But writing is not a simple linear process. Interpretive processes come into play—writers may reread their own writing or interrupt to read other materials that may assist them, or they may stop to think, calling upon various processes involving deliberation, reflection, metacognition, and strategy use. This flexibility supports nonlinear, highly recursive writing practices with an emphasis on what Bereiter and Scardamalia (1987) term knowledge transformation.
5 A related table is presented in Klobucar et al. (2012). That table emphasizes the kinds of constructs that may be relevant to the writing construct, but deemphasizes the interaction among skills that is emphasized by the arrows in Fig. 4.
P. Deane / Assessing Writing 18 (2013) 7–24
17
Fig. 4. Modes of thought and modes of representation in the literacy process (slightly modified from Deane, 2011, p. 8).
Fig. 4 also represents writing as involving multiple forms of cognition. The writer must deal in social realities, reasoning about people’s communicative intentions, about social conventions and expectations, about subtext and the sort of practices in which the act of writing is embedded. The writer must deal in abstract conceptual modeling, and may engage in building models, framing arguments, or clarifying understandings of the world. The writer must deal in knowledge of text and its structure, recognize genres, and take advantage of familiar forms. The writer must deal in the basic stuff of language, choosing the right words and the right grammatical constructions to achieve clear, precise, concise modes of expression. The writer must deal with the encoding of language in text, and the conventions that go with that encoding. These elements form a single interconnected whole, at least for skilled writers, since skill requires interpenetration and coordination of these elements, not their analytical separation. If we use Fig. 4 as a way of describing the construct decisions that go into defining writing as a construct, we can characterize the Framework for Success in Postsecondary Education as defining writing very broadly to encompass all five levels shown in Fig. 4, and as drawing the boundaries of the writing construct very broadly to include a wide range of reflective, deliberative, metacognitive practices and strategies. But when an essay is holistically scored, attention focuses on the written product—on text quality—and thus on those skills that directly support written expression. Currentgeneration AES systems, since they do not model the human ability to perform social and conceptual reasoning, necessarily measure an even narrower range of skills. The features of state-of-the-art AES systems almost exclusively deal with textual evidence relevant to three cells in Fig. 4: those labeled structure, phrase, and transcribe, which address the text production abilities that enable writers to organize text according to some outline and elaborate upon the points it contains, using appropriate, clear, concise and unambiguous language, in conventional orthography and grammar. Emphasis on these features is essentially the conclusion of Attali and Powers (2008), who interpret the three factors that emerge from their analysis as having to do with fluency, accuracy, and vocabulary. Quinlan et al. (2009) reach similar conclusions when they argue that e-rater measures efficiency in what Bereiter and Scardamalia (1987) term a ‘knowledge-telling’ approach to writing. There is a deep connection between skill in text production and the broader bundle of skills deployed by expert writers. The literature on the development of writing expertise indicates that cognitive capacity plays a critical role, and suggests that efficiencies in text production free up the capacity needed to apply effective writing strategies (Kellogg, 1988; McCutchen, Covill, Hoyne, & Mildes, 1994; McCutchen, 1996, 2000). This literature emphasizes that it is critical for writers to achieve full fluency and control over text translation (text production), since the social and cognitive skills emphasized in
18
P. Deane / Assessing Writing 18 (2013) 7–24
the Framework for Success involve the application of strategies that compete with text production for attention and memory resources. In other words, those who have developed high fluency and control over text production processes are precisely those who have the cognitive resources needed to practice the skills needed to master a broader writing construct. This conclusion is implicit in Bereiter and Scardamalia’s (1987) distinction between writing as knowledge-telling and writing as knowledge-transforming, and is increasingly explicit in studies of the processing demands of writing (Chenoweth & Hayes, 2001; Hayes, 2009; McCutchen, 1996; Torrance & Galbraith, 2008). Thus it is not particularly surprising that AES systems can achieve high levels of agreement with human raters even when the human scoring rubric may include elements that go well beyond core text production skills. For instance, Bennett (2011b) reports trials of a persuasive writing prompt piloted as part of ETS CBAL (Cognitively Based Assessments of, for, and as Learning) research initiative. In this study, essays were scored by one set of raters on a criticalthinking rubric that valued effective argumentation and attention to audience, and by other raters on a text-production rubric that valued fluency and coherence of expression, effective word choice, accuracy of text production and adherence to conventions—elements closely related to the features included in most AES systems. The ratings appeared to be of high quality. The raters successfully identified essays with strong arguments but weak text production quality (or strong language skills but weak arguments), yet usually the two traits coincided, yielding an overall correlation of 0.80 Of course, to the extent that AES systems fail to measure the same constructs as expert human raters, there is the danger of undesired side-effects. What matters is how AES is deployed in practice, and how that interacts with the other elements that enter into instruction and assessment. When the use case emphasizes the identification of students who need to improve the fluency, control, and sophistication of their text production processes, or affords such students the opportunity to practice and increase their fluency while also learning strategies that will decrease their cognitive load, the case for AES is relatively strong; but if the focus of an assessment is to use quality of argumentation, sensitivity to audience, and other such elements to differentiate among students who have already achieved fundamental control of text production processes, the case for AES is relatively weak. The former case is advanced, for instance, in Kellogg (2007), who argues that one of the strongest potential benefits of AES lies precisely in its capacity to support practice to build fluency and control in text production processes. In many contexts, since text production skills are a necessary enabling skill, and strongly correlated with overall performance, it is reasonable to deploy AES in combination with other sources of evidence, including human ratings (Bennett, 2011a). But if we accept that an AES system primarily measures the fluency, accuracy and sophistication of text production, there is something incoherent about deploying AES as the sole scorer for a high-stakes assessment intended to measure sensitivity to audience, quality of argumentation, or other elements drawn from a broader writing construct. In short, the deployment of an AES system must be sensitive to its limitations. No assessment technology should be applied blindly; but neither should any method be rejected a priori, without considering how it can be used to support effective learning and teaching. 4.3. Objections to measurement methods Of course, even if one accepts the construct definition built into the current generation of AES systems, quality of measurement may vary. Relatively simple proxies may suffice for many purposes, yet be entirely inappropriate where other aspects of the construct play a critical role. Thus, users of an AES system can justifiably ask how much separation there is between the specific construct addressed by its component features and the actual implementation of those features. If features are treated as black boxes, there may be little visible difference in performance between two AES systems until one examines how they perform on unusual and relatively improbable inputs. Moreover, many of the most interesting, and potentially useful, applications of AES require access to information about the detailed pattern of performance displayed by writers, and not simply an unanalyzed summary score. Consider, for instance, the issue of document length and its use in AES. Even if one seeks to avoid any direct use of document length, it can be very hard to remove the effects of length from a scoring
P. Deane / Assessing Writing 18 (2013) 7–24
19
model—a point that continues to influence e-rater development efforts, and which illustrates the fundamental issue at play. Attali and Powers (2008) point out that the e-rater organization and development features have an intrinsic connection with essay length, since they measure the number and length of discourse units. This is related to an issue raised in Ben-Simon and Bennett (2007, Table 10), who provide evidence from National Assessment of Educational Progress (NAEP) writing test data that standard, statistically created e-rater models weight essay length more strongly than human raters. While the e-rater organization and development features do more than measure raw length, since the NLP analysis searches specifically for the presence of markers indicating organized text, this case illustrates the fundamental issue rather well. A particular indicator (in this case, the number and length of discourse units identified by a variety of textual cues) is not equivalent to the construct it is intended to measure (in this case, the ability to produce well-structured, fully elaborated essays), and excessive reliance on such indicators may be misleading, and may (if relied upon blindly) have problematic implications for the way teachers and students interact with an assessment (Bennett, 2011a). And yet one should not presume that document length is irrelevant to the measurement of writing skill. Fluency matters. If writers do not fully complete the task, or only do so by sacrificing other aspects of text quality, then length—or more accurately, the lack of length—may be a strong indicator of lack of skill. It goes without saying, however, that excessive length should mean very little, and that mere length is not what really matters, but rather, fluency at producing the kind of text that actually carries out the assigned task. E-rater addresses this issue in two ways. First, it incorporates advisories: filters that detect abnormally constructed essays that might yield misleading scores. Second, it discounts excessive length by measuring variety and elaboration of discourse units on a logarithmic scale. As a result, the greatest score contrasts will distinguish minimal responses, without extended structure, from responses that display the length and variety expected in a full-length essay. Adding excessive content to an essay than is required will have a much smaller impact. Fluency matters; but it only matters if other aspects of text quality are not sacrificed in the process. However, the goal is to measure the intended construct, and that goal may entail a more elaborate model than needed for predictive accuracy. Cognitive theory predicts a tradeoff between fluency and other text features, such as cohesion and coherence. Thus, ongoing research at ETS focuses on improving measurement of text coherence (Burstein, Tetreault, & Andreyev, 2010). Such improvements to existing AES engines may make a major difference in engine quality, yet may not be visible if all one examines is the agreement between machine and human scoring. Somewhat paradoxically, simpler indicators can be revealing even when considered separately, if one’s goal is to understand what is going wrong among students who fail to achieve a satisfactory level of performance. Consider such observations as the following: • Some students produce less text than required, and give up when other students have barely begun, but with very few errors. • Some students produce text fluently but at the cost of an increased rate of spelling and grammar errors. Observations like these may suggest useful inferences, singly or in combination, for a potential use for AES that has not yet been fully explored. A similar set of observations may be raised in connection with another aspect of the typical AES construct: sophistication of vocabulary. Very simple proxy features, such as word length and word frequency, may work very well, even though the construct we ultimately wish to measure—control of a word choice—is complex, and nuanced phenomenon. Once again, there is a key range where such features do quite a lot of work, distinguishing impoverished from more sophisticated lexis. Once again, the way such features are scaled can minimize potential negative impacts. For instance, since e-rater uses the square root of word length in its scoring model, writers will get relatively little benefit for using an average of four-syllable rather than three-syllable words. And once again, there is a gap between the features used and the construct to be measured. What matters for the judgment of text
20
P. Deane / Assessing Writing 18 (2013) 7–24
quality is that a writer makes use of the right vocabulary, in the right meanings, in the right sentential contexts, while avoiding vagueness, unnecessary and misleading specificity, or excessive ambiguity. Accurate measurement of vocabulary use in writing entails that the scoring system be able to tell the difference between a meaningless jumble of words and a syntactically and semantically coherent text. While the current e-rater engine has some checks built in to detect aberrant text, improvements to e-rater and arguably other AES systems will involve enriching the measurement to capture this construct more precisely. NLP researchers are attempting to model such features of language in any case, and certain aspects have been built into e-rater in the form of so-called “positive features” (Futagi et al., 2008; see also Deane & Quinlan, 2010 for a review of some of the elements currently measurable using AES techniques). But once again, much potential benefit of AES lies in the patterns observed when students fail to demonstrate mastery, especially if combined with other sources of information about student performance. For example, if we could analyze a log of the text production process, and demonstrate that a writer who avoided longer and rarer words also showed hesitancy and spelling errors when such words were produced, we could advance specific hypotheses about why the writer avoided more challenging vocabulary, and what interventions might be most helpful. As these examples illustrate, even if attention is restricted to the construct that AES systems are intended to measure, criticisms of the scoring methods have a point. These criticisms need not invalidate use of AES, as long as adequate steps are taken to prevent problematic cases from distorting scoring or encouraging users to game the system, but they highlight an ongoing tension that has ramifications through this special issue. To the extent that we emphasize those aspects of writing that are easily measured and about which agreement can easily be obtained, whether by machines or humans, we emphasize those aspects of writing that are routine and routinizable; but the writing teacher needs to encourage novice writers to adopt new strategies, flexibly adapt to new contexts, and to address rhetorical problems that require innovative solutions. Writing assessment, whether scored by human or by machines, needs to be structured to support the teacher and to encourage novice writers to develop the wide variety of skills required to achieve high levels of mastery. 5. A way forward: toward a socio-cognitive approach to automated writing analysis One way to conceptualize the issues we have been exploring is that they involve a tension between ease of measurement and construct coverage. On the one hand, the kinds of features currently measured by AES systems focus on a restricted construct that emphasizes text production and applies it primarily to replicating human scoring in a high-stakes context. On the other hand, a richer conception of the writing construct is available from a variety of sources, one that takes into account both the social and cognitive dimensions of writing skill (Flower, 1994; Hayes, 2012). However, the current state of AES is historically contingent. The technology emerged when direct writing assessment was dominated by holistic scoring, and in its early stages of development it was framed precisely to support high-stakes holistic essay scoring. As we have already discussed, there is room to expand the range of features that enter into AES to make it more responsive to a richer understanding of the construct. But it is also important to recognize that AES can be deployed in innovative ways that might provide better support for writer cognition and integrate more fruitfully with the social practices that encourage quality writing. In other words, much of the promise of the technology arises in a different range of applications in which there might be less perceived conflict between AES and the teaching of writing as a humanistic practice. Let us conduct a thought experiment based on an unfortunately common pattern in writing assessment. Suppose that a group of students sits down to take a timed writing assessment. Upon analysis, we discover that essays that get low ratings are shorter than their higher-scoring counterparts and less sophisticated on almost every linguistically measured dimension. By itself, this analysis reveals little. Why did the lower-scoring students fail to complete the task satisfactorily? There is no way of knowing. However, if we begin to collect other kinds of information, the picture begins to change. Consider, for instance, what happens if we collect features that measure the writing process, not just the written product. We may observe that some of those who perform poorly show hesitancy in word choice, correlated with difficulties in spelling that surface in the final written product. That
P. Deane / Assessing Writing 18 (2013) 7–24
21
information means something different to the teacher, than if we observed that the student seemed highly fluent and produced text that was entirely unobjectionable but for its length and conformance to task requirements. More generally, analysis of linguistic features of individual samples, combined with a rich array of electronically collected information, might enable us to identify appropriate instructional interventions without removing the writing from a channel in which it was written to humans for real communicative purposes. Such an approach is, as yet, merely a possibility, though one that has been noted before (Brent & Townsend, 2006). But there is every reason to consider how AES might be used in concert with other assessment systems, within an ecology that emphasizes electronic media and embeds AES in a richly articulated instructional framework. The ultimate promise of automated writing evaluation emerges when we open up the space of possibilities and consider it not only as a technology that supports automated essay scoring, but as the basis for large-scale, embedded forms of automated writing analysis in which social and cognitive aspects of the writing process are taken more richly into account. Note from the Editors of the Special Issue This article is part of a special issue of Assessing Writing on the automated assessment of writing. The invited articles are intended to provide a comprehensive overview of the design, development, use, applications, and consequences of this technology. Please find the full contents list available online at: http://www.sciencedirect.com/science/journal/10752935. References Anson, C. M. (2008). Closed systems and standardized writing tests. College Composition and Communication, 60, 113–128. Attali, Y., Bridgeman, B., & Trapani, C. (2010). Performance of a generic approach in automated essay scoring. Journal of Technology, Learning & Assessment, 10. Retrieved from http://ejournals.bc.edu/ojs/index.php/jtla/article/view/1603/1455 Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater v.2. Journal of Technology, Learning & Assessment, 4. Retrieved from http://ejournals.bc.edu/ojs/index.php/jtla/article/view/1650/1492 Attali, Y., & Powers, D. (2008). A developmental writing scale (ETS Research Report RR-08-19). Princeton, NJ: Educational Testing Service. Retrieved from http://www.ets.org/Media/Research/pdf/RR-08-19.pdf Beck, S. W., & Jeffery, J. (2007). Genres of high stakes writing assessments and the construct of writing competence. Assessing Writing, 12, 60–79. Bejar, I. I. (2011). A validity-based approach to quality control and assurance of automated scoring. Assessment in Education: Principles, Policy & Practice, 18, 319–341. Belanoff, P., & Dickson, M. (1991). Portfolios: Process and product. Portsmouth, NH: Heinemann. Bennett, R. E. (2006). Moving the field forward: Some thoughts on validity and automated scoring. In: D. M. Williamson, R. J. Mislevy, & I. I. Bejar (Eds.), Automated scoring of complex tasks in computer-based testing (pp. 403–412). Hillsdale, NJ: Lawrence Erlbaum. Bennett, R. E. (2011a). Automated scoring of constructed-response literacy and mathematics items. Washington, DC: Arabella Philanthropic Advisors. Retrieved from http://www.ets.org/s/k12/pdf/k12 commonassess automated scoring math.pdf Bennett, R. E. (2011b). CBAL: Results from piloting innovative K-12 assessments (ETS Research Report RR-11-23). Princeton, NJ: Educational Testing Service. Retrieved from http://www.ets.org/Media/Research/pdf/RR-11-23.pdf Ben-Simon, A., & Bennett, R. E. (2007). Toward more substantively meaningful automated essay scoring. Journal of Technology, Learning & Assessment, 6. Retrieved from http://ejournals.bc.edu/ojs/index.php/jtla/article/view/1631/1475 Ben-Simon, A., & Cohen, Y. (2011). The Hebrew Language Project: Automated essay scoring & readability analysis. Paper presented at the annual meeting of the IAEA, Manila. Bereiter, C., & Scardamalia, M. (1987). The psychology of written composition. Hillsdale, NJ: Lawrence Erlbaum. Black, L., Daiker, D., Sommers, J., & Stygall, G. (1994). New directions in portfolio assessment. Portsmouth, NH: Heinemann. Braun, H. I., Bennett, R. E., Frye, D., & Soloway, E. (1990). Scoring constructed responses using expert systems. Journal of Educational Measurement, 27, 93–108. Brent, E., & Townsend, M. (2006). Automated essay scoring in the classroom: Finding common ground. In: P. F. Ericsson & R. H. Haswell (Eds.), Machine scoring of student essays: Truth and consequences (pp. 177–198). Logan, UT: Utah State University Press. Bridgeman, B., Trapani, C. S., & Attali, Y. (2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity and country. Applied Measurement in Education, 25, 27–40. Brigham, C. C. (1933). The reading of the comprehensive examination in English: An analysis of the procedures followed during the five reading periods from 1929 to 1933. Princeton: Princeton University Press. Broad, B., Adler-Kassner, L., Alford, B., Detweiler, J., Estrem, H., Harrington, S., et al. (2009). Organic writing assessment: Dynamic criteria mapping in action. Logan, UT: Utah State University Press. Burstein, J., & Chodorow, M. (2003). Directions in automated essay scoring. In: R. Kaplan (Ed.), Handbook of applied linguistics (pp. 487–497). Oxford, UK: Oxford University Press.
22
P. Deane / Assessing Writing 18 (2013) 7–24
Burstein, J., & Chodorow, M. (2010). Progress and new directions in technology for automated essay evaluation. In: R. Kaplan (Ed.), The Oxford handbook of applied linguistics (2nd ed., pp. 487–497). Oxford, UK: Oxford University Press. Burstein, J., Chodorow, M., & Leacock, C. (2003). Criterion: Online essay evaluation: An application for automated evaluation of student essays. In: Proceedings of the fifteenth annual conference on innovative applications of artificial intelligence (pp. 3–10). Acapulco, Mexico: Association for the Advancement of Artificial Intelligence. Burstein, J., & Marcu, D. (2003). Automated evaluation of discourse structure in student essays. In: M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 200–219). Mahwah, NJ: Lawrence Erlbaum. Burstein, J., Marcu, D., Andreyev, S., & Chodorow, M. (2001). Towards automatic classification of discourse elements in essays. In: Proceedings of the 39th annual meeting of the association for computational linguistics (pp. 98–105). Toulouse, France: Association for Computational Linguistics. Burstein, J., Marcu, D., & Knight, K. (2003). Finding the WRITE stuff: Automatic identification of discourse structure in student essays. IEEE Intelligent Systems: Special Issue on Advances in Natural Language Processing, 18, 32–39. Burstein, J., Tetreault, J., & Andreyev, S. (2010). Using entity-based features to model coherence in student essays. In: Human language technologies: The 2010 annual conference of the North American chapter of the ACL (pp. 681–684). Association for Computational Linguistics. Cambridge, D., Cambridge, B., & Yancey, K. B. (Eds.). (2009). Electronic portfolios 2.0: Emergent research on implementation and impact. Washington, DC: Stylus. Castro-Castro, D., Lannes-Losada, R., Maritxalar, M., Niebla, I., Pérez-Marqués, C., Álamo-Suárez, N. C., et al. (2008). A multilingual application for automated essay scoring. Advances in artificial intelligence. Lecture Notes in Computer Science, 5290, 243–251. Chang, T.-H., Lee, C. H., & Tam, H.-P. (2007). On issues of feature extraction in Chinese automatic essay scoring system. In: Proceedings of the 2007 conference on artificial intelligence in education: building technology rich learning contexts that work (pp. 545–547). Amsterdam: IOS Press. Chang, T.-H., Lee, C.-H., & Chang, Y.-M. (2006). Enhancing automatic Chinese essay scoring system from figures-of-speech. In: Proceedings of the twentieth pacific Asia conference on language, information and computation (pp. 28–34). Chenoweth, N. A., & Hayes, J. R. (2001). Fluency in writing: Generating text in L1 and L2. Written Communication, 18, 80–98. Cheville, J. (2004). Automated scoring technologies and the rising influence of error. English Journal, 93, 47–52. Chodorow, M., & Burstein, J. (2004). Beyond essay length: Evaluating e-rater’s performance on TOEFL® Essays (TOEFL® Research Report RR-73, ETS Research Report RR-04-04). Princeton, NJ: Educational Testing Service. Chodorow, M., Gamon, M., & Tetreault, J. (2010). The utility of article and preposition error correction systems for English Language Learners: Feedback and assessment. Language Testing, 27, 419–436. Chodorow, M., & Leacock, C. (2002). Techniques for detecting syntactic errors in text. Technical Report of the Institute of Electronics, Information, and Communication Engineers, 102, 37–41. Cohen, Y., Ben-Simon, A., & Hovav, M. (2003). The effect of specific language features on the complexity of systems for automated essay scoring. Paper presented at the international association of educational assessment annual conference, October 2003, Manchester. Conference on College Composition and Communication (2004). Position statement on teaching, learning, and assessing writing in digital environments. Retrieved from http://www.ncte.org/cccc/resources/positions/digitalenvironments Condon, W. (2012). The future of portfolio-based writing assessment: A cautionary tale. In: N. Elliot & L. Perelman (Eds.), Writing assessment in the 21st century: Essays in honor of Edward M. White (pp. 233–245). New York, NY: Hampton Press Cooper, C. R., & Odell, L. (Eds.). (1977). Evaluating writing: Describing, measuring, judging. Urbana, IL: NCTE. Council of Europe. (2001). Common European framework of reference for languages: Learning, teaching, assessment. Cambridge, UK: Cambridge University Press. Council of Writing Program Administrators, National Council of Teachers of English & National Writing Project. (2011). Framework for success in postsecondary writing. Retrieved from http://wpacouncil.org/files/ framework-for-success-postsecondary-writing.pdf Cronbach, L. J. (1971). Test validation. In: R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 443–507). Washington, DC: American Council on Education. Deane, P. (2011). Writing assessment and cognition (ETS Research Report RR-11-14). Princeton, NJ: Educational Testing Service. Retrieved from http://www.ets.org/Media/Research/pdf/RR-11-14.pdf Deane, P., & Quinlan, T. (2010). What automated analyses of corpora can tell us about students’ writing skills. Journal of Writing Research, 2, 151–177. Diederich, P. B., French, J. W., & Carlton, S. T. (1961). Factors in judgments of writing ability (ETS research bulletin RB-61-15). Princeton, NJ: Educational Testing Service. Elliot, N. (2005). On a scale: A social history of writing assessment in America. New York, NY: Peter Lang. Elliot, S. (2003). IntelliMetric: from here to validity. In: M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: a cross disciplinary approach. Mahwah, NJ: Lawrence Erlbaum Associates. Ericsson, P. F. (2006). The meaning of meaning: Is a paragraph more than an equation? In: P. F. Ericsson & R. H. Haswell (Eds.), Machine scoring of student essays: Truth and consequences (pp. 28–37). Logan, UT: Utah State University Press. Ericsson, P. F., & Haswell, R. H. (Eds.). (2006). Machine scoring of student essays: Truth and consequences. Logan, UT: Utah State University Press. Feng, W. F., & Hirst, G. (2011). Classifying arguments by scheme. In: Proceedings of the 49th annual meeting of the association for computational linguistics (pp. 987–996). Portland, Oregon: Association for Computational Linguistics. Flower, L. (1994). The construction of negotiated meaning: A social cognitive theory of writing. Carbondale, IL: Southern Illinois Press. Futagi, Y., Deane, P., Chodorow, M., & Tetreault, J. (2008). A computational approach to detecting collocation errors in the writing of non-native speakers of English. Computer Assisted Language Learning, 21, 353–367. Godshalk, F. I., Swineford, F., & Coffman, W. E. (1966). The measurement of writing ability. New York, NY: College Entrance Examination Board. Haberman, S. J. (2011). Use of e-rater in scoring of the TOEFL iBT Writing Test (ETS Research Report RR-11-25). Princeton: Educational Testing Service. Retrieved from http://www.ets.org/Media/Research/pdf/RR-11-25.pdf
P. Deane / Assessing Writing 18 (2013) 7–24
23
Haberman, S. J., & Sinharay, S. (2010). The application of the cumulative logistic regression model to automated essay scoring. Journal of Educational & Behavioral Statistics, 35, 586–602. Hamp-Lyons, L., & Condon, W. (2000). Assessing the portfolio: Principles for practice, theory, and research. Creskill, NJ: Hampton Press. Haswell, R. H. (2006). Automatons and automated scoring: Drudges, black boxes, and dei ex machina. In: P. F. Ericsson & R. H. Haswell (Eds.), Machine scoring of student essays: Truth and consequences (pp. 57–78). Logan, UT: Utah State University Press. Hayes, J. R. (2009). From idea to text. In: R. Beard, D. Myhill, J. Riley, & M. Nystrand (Eds.), The Sage handbook of writing development (pp. 65–79). London, UK: Sage. Hayes, J. R. (2012). Modeling and remodeling writing. Written Communication, 29, 369–388. Herrington, A., & Moran, C. (2006). Writeplacer plus in place: An exploratory case study. In: P. F. Ericsson & R. H. Haswell (Eds.), Machine scoring of student essays: Truth and consequences (pp. 114–129). Logan, UT: Utah State University Press. Herrington, A., & Stanley, S. (2012). CriterionSM : Promoting the standard. In: A. B. Inoue & M. Poe (Eds.), Race and writing assessment (pp. 47–61). New York, NY: Peter Lang. Higgins, D., Burstein, J., & Attali, Y. (2006). Identifying off-topic student essays without topic-specific training data. Natural Language Engineering, 12, 145–159. Higgins, D., Burstein, J., Marcu, D., & Gentile, C. (2004). Evaluating multiple aspects of coherence in student essays. In: S. Dumais, D. Marcu, & S. Roukos (Eds.), HLT-NAACL 2004: Main proceedings (pp. 185–192). Boston, MA: Association for Computational Linguistics. Jones, E. (2006). ACCUPLACER’S essay-scoring technology: When reliability does not equal validity. In: P. F. Ericsson & R. H. Haswell (Eds.), Machine scoring of student essays: Truth and consequences (pp. 93–113). Logan, UT: Utah State University Press. Kellogg, R. T. (1988). Attentional overload and writing performance: Effects of rough draft and outline strategies. Journal of Experimental Psychology: Learning, Memory and Cognition, 14, 355–365. Kellogg, R. T. (2007). Improving the writing skills of college students. Psychonomic Bulletin & Review, 14, 237–242. Klobucar, A., Deane, P., Elliot, N., Ramineni, C., Deess, P., & Rudniy, A. (2012). Automated essay scoring and the search for valid writing assessment. In: C. Bazerman, C. Dean, J. Early, K. Lunsford, S. Null, P. Rogers, & A. Stansell (Eds.), International advances in writing research: Cultures, places, measures (pp. 103–119). Fort Collins, Colorado: WAC Clearinghouse/Anderson, SC: Parlor Press http://wac.colostate.edu/books/wrab2011/chapter6.pdf Landauer, T. K., Laham, D., & Foltz, P. W. (1988). Learning human-like knowledge by singular value decomposition: A progress report. In: M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems 10 (pp. 45–51). Cambridge, MA: MIT Press. Leacock, C., & Chodorow, M. (2001). Automatic assessment of vocabulary usage without negative evidence (TOEFL® Research Report 67; ETS Research Report RR-01-21). Princeton, NJ: Educational Testing Service. Retrieved from http://www.ets.org/Media/Research/pdf/RR-01-21.pdf Lemaire, B., & Dessus, P. (2001). A system to assess the semantic content of student essays. Journal of Educational Computing Research, 24, 305–320. Lockridge, S. (2012, April). Truscore. Paper presented as part of the symposium on contrasting state-of-the-art in automated scoring of essays at the annual meeting of the National Council on Measurement in Education, Vancouver. McCutchen, D. (1996). A capacity theory of writing: Working memory in composition. Educational Psychology Review, 8, 299–325. McCutchen, D. (2000). Knowledge, processing, and working memory: Implications for a theory of writing. Educational Psychologist, 35, 13–23. McCutchen, D., Covill, A., Hoyne, S. H., & Mildes, K. (1994). Individual differences in writing: Implications of translating fluency. Journal of Educational Psychology, 86, 256–266. McGee, T. (2006). Taking a spin on the Intelligent Essay Assessor. In: P. F. Ericsson & R. H. Haswell (Eds.), Machine scoring of student essays: Truth and consequences (pp. 79–92). Logan, UT: Utah State University Press. Messick, S. (1989). Validity. In: R. K. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York, NY: American Council on Education and Macmillan. Mislevy, R. J. (2006). Cognitive psychology and educational assessment. In: R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 257–305). Westport, CT: American Council on Education/Praeger. Mislevy, R. J. (2007). Validity by design. Educational Researcher, 36, 463–469. http://dx.doi.org/10.3102/0013189X07311660 Mislevy, R. J., & Haertel, G. D. (2006). Implications of evidence-centered design for educational testing. Educational Measurement: Issues and Practice, 25, 6–20. Morrow, K. (Ed.). (2004). Insights from the common European framework. Oxford, UK: Oxford University Press. Murphy, S., & Camp, R. (1996). Toward systemic coherence: A discussion of conflicting perspectives on portfolio assessment. In: R. Calfee & P. Perfumo (Eds.), Writing portfolios in the classroom: Policy and practice, promise and peril. Mahwah, NJ: Lawrence Erlbaum. Murphy, S., & Yancey, K. B. (2008). Construct and consequence: Validity in writing assessment. In: C. Bazerman (Ed.), Handbook of writing research: History, society, school, individual, text (pp. 365–386). New York, NY: Lawrence Erlbaum. Noyes, E. A., Sale, W. M., Jr., & Stalnaker, J. M. (1945). Report on the first six tests in English composition, with sample answers from the tests of April and June, 1944. New York, NY: College Entrance Examination Board. O’Neill, P., Adler-Kassner, L., Fleischer, C., & Hall, A. M. (2012). Creating the framework for success in postsecondary writing. College English, 76, 520–524. Perelman, L. (2012). Construct validity, length, score, and time in holistically graded writing assessments: The case against automated essay scoring (AES). In: C. Bazerman, C. Dean, J. Early, K. Lunsford, S. Null, P. Rogers, & A. Stansell (Eds.), International advances in writing research: Cultures, places, measures (pp. 121–131). Fort Collins, Colorado: WAC Clearinghouse/Anderson, SC: Parlor Press http://wac.colostate.edu/books/wrab2011/chapter7.pdf Powers, D. E., Burstein, J., Chodorow, M., Fowles, M. E., & Kukich, K. (2001). Stumping e-rater: Challenging the validity of automated essay scoring (GRE® Board Professional Report 98-08bP, ETS Research Report RR-01-03). Princeton, NJ: Educational Testing Service. Retrieved from http://www.ets.org/Media/Research/pdf/RR-01-03-Powers.pdf
24
P. Deane / Assessing Writing 18 (2013) 7–24
Quinlan, T., Higgins, D., & Wolf, S. (2009). Evaluating the construct coverage of the e-rater scoring engine (ETS Research Report RR09-01). Princeton, NJ: Educational Testing Service. Retrieved from http://www.ets.org/Media/Research/pdf/RR-09-01.pdf Ruth, L., & Murphy, S. (1984). Designing topics for writing assessment: Problems of meaning. College Composition and Communication, 35 (4), 410–422. Ruth, L., & Murphy, S. (1988). Designing writing tasks for the assessment of writing. Norwood, NJ: Ablex. Sebrechts, M. M., Bennett, R. E., & Rock, D. A. (1991). Agreement between expert system and human raters’ scores on complex constructed-response quantitative items. Journal of Applied Psychology, 76, 856–862. Shermis, M. D., & Burstein, J. (Eds.). (2003). Automated essay scoring: A cross-disciplinary perspective. Mahwah, NJ: Lawrence Erlbaum. Shermis, M. D., & Burstein, J. (in press). Handbook of automated essay evaluation: current applications and new directions. New York, NY: Routledge. Shermis, M. D., Burstein, J., Higgins, D., & Zechner, K. (2010). Automated essay scoring: Writing assessment and instruction. In: E. Baker, B. McGaw, & N. S. Petersen (Eds.), International encyclopedia of education: (3rd ed., Vol. 4, pp. 20–26). Oxford, UK: Elsevier Science. Shermis, M. D., & Hamner, B. (in press). Contrasting state-of-the-art automated scoring of essays. In: M. D. Shermis, & J. Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions. New York, NY: Routledge. Spandel, V. (2005). Creating writers through 6-trait writing assessment and instruction (4th ed.). Boston, MA: Pearson, Allyn & Bacon. Torrance, M., & Galbraith, D. (2008). The processing demands of writing. In: C. A. MacArthur, S. Graham, & J. Fitzgerald (Eds.), Handbook of writing research (pp. 67–82). New York, NY: The Guilford Press. White, E. (1985). Teaching and assessing writing: Recent advances in understanding, evaluating, and improving student performance. San Francisco, CA: Jossey-Bass. Weigle, S. C. (2013). English language learners and automated scoring of essays: Critical considerations. Assessing Writing, 18 http://dx.doi.org/10.1016/j.asw.2012.10.006 Wild, F., Stahl, C., Stermsek, G., Penya, Y., & Neumann, G. (2005). In: & C. K. Looij, et al. (Eds.), Factors influencing effectiveness in automated essay scoring with LSA (pp. 947–949). Amsterdam: IOS Press. Willard-Traub, M. K. (2002). Assessing the portfolio: Hamp-Lyons and Condon [Rev. of the book Assessing the portfolio: Principles for practice, theory, and research, by L. Hamp-Lyons & W. Condon]. Assessing Writing, 8, 65–69. Yancey, K. (2012). The rhetorical situation of writing assessment: Exigence, location, and the making of knowledge. In: N. Elliot & L. Perelman (Eds.), Writing assessment in the 21st century: Essays in honor of Edward M. White (pp. 475–492). New York, NY: Hampton Press. Yancey, K., & Weiser, I. (1997). Situating portfolios: Four perspectives. Logan, UT: Utah State University Press. Paul Deane received his Ph.D. in theoretical linguistics from the University of Chicago (1986). He has published on a variety of subjects including lexical semantics, language and cognition, computational linguistics, and writing assessment and pedagogy. His research focuses on vocabulary assessment, automated essay scoring, and innovative approaches to writing assessment.