Automatic Coding of Short Text Responses via

0 downloads 0 Views 2MB Size Report
puter adaptive testing (CAT) is constrained to the ..... sulting matrix could be interpreted accordingly in that .... ically evoke more complex answers in linguistic.
Automatic Coding of Short Text Responses via Clustering in Educational Assessment Fabian Zehner

1

1,3

1,3

, Christine Sälzer

2,3

, Frank Goldhammer

Technical University of Munich, 2 German Institute for International Educational Research, 3

Centre for International Student Assessment (ZIB) e.V.

Automatic coding of short text responses opens new doors in assessment. We implemented and integrated baseline methods of natural language processing and statistical modelling by means of software components that are available under open licenses. The accuracy of automatic text coding is demonstrated by using data collected in the

Student Assessment (PISA) 2012 in Germany.

Programme for International

Free text responses of 10 items with

n = 41, 990

responses in total were analyzed. We further examined the eect of dierent methods, parameter values, and sample sizes on performance of the implemented system. The system reached fair to good up to excellent agreement with human codings (.458

≤ κ ≤ .959).

Especially

items that are solved by naming specic semantic concepts appeared properly coded. system performed equally well with to

n = 249.

The

n ≥ 1661 and somewhat poorer but still acceptable down

Based on our ndings, we discuss potential innovations for assessment that are

enabled by automatic coding of short text responses.

Keywords:

Computer-Based Assessment, Automatic Coding, Automatic Short-Answer

Grading, Computer-Automated Scoring

Originally published in:

Zehner, F., Sälzer, C., & Goldhammer, F. (2016).

of short text responses via clustering in educational assessment.

Measurement, 76 (2), 280303.

Automatic coding

Educational and Psychological

doi: 10.1177/0013164415590022 epm.sagepub.com

Fabian Zehner, TUM School of Education, Centre for International Student Assessment (ZIB) e.V., Technical University of Munich; Christine Sälzer, TUM School of Education, Centre for International Student Assessment (ZIB) e.V., Technische Universität München; Frank Goldhammer, German Institute for International Educational Research, Centre for International Student Assessment (ZIB) e.V. Correspondence concerning this article should be addressed to Fabian Zehner, TUM School of Education, Centre for International Student Assessment (ZIB) e.V., Arcisstr. 21, 80331 Munich, Germany. E-mail: [email protected]

Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022

In automatic coding of short text responses

opens new doors in assessment. For example, com-

(AC) a computer categorizes or scores short text

puter adaptive testing (CAT) is constrained to the

responses. Such open-ended response formats, as

use of response formats that can be evaluated im-

opposed to closed ones, can improve an instru-

mediately. AC enlarges the scope of CAT to open-

ment's construct validity (e.g., for mathematics:

ended text responses. On the other hand, fourth,

cf.

Birenbaum & Tatsuoka, 1987; Bridgeman,

an AC-system has its own limitations, errors, and

1991; for reading: cf. Millis, Magliano, Wiemer-

costs (e.g., software development, manual coding

Hastings, Todaro, & McNamara, 2011; Rauch &

for training data, etc.). The reader might notice

Hartig, 2010; Rupp, Ferne, & Choi, 2006). Yet to-

that we use the term

day, closed formats have become state of the art,

mon

because they allow data to be easily processed and

as a special case of coding and do not want to re-

are more objective in nature. They do not require

strict AC's scope. In coding, a nominal category

human coders, who entail some disadvantages (cf.

is assigned to an element; for example, a response

Bejar, 2012): their subjective perspectives, vary-

could be categorized as either dealing with a -

ing abilities (e.g., coding experience and stamina),

nancial or social aspect.

the need for consensus building measures, and, in

scoring, the categories additionally have a natural

turn, a high demand on resources. Taking a step

order; in the example, the social response could

back and reconsidering the response format's im-

be favored over the nancial one.

scoring.

coding

instead of the com-

That is because we regard scoring

On the other hand, in

pact on construct validity, one realizes that these

Since Carlson and Ward (1988), several research

well-founded practical reasons in favor of closed

groups have been dealing with natural language

formats might not be the most important deci-

processing (NLP) striving to automatically code

sion criteria. Construct validity is the very base

short text responses.

of all subsequent outcomes (e.g., analysis, conclu-

strongly related eld of essay grading, the respec-

sions, treatments) and, hence, can be claimed to

tive methods are not commonly used in practice.

be the most important criterion. Closed response

Reasons for this, among others, may be propri-

formats should be used when the construct deni-

etary licenses, under which most o-the-shelf soft-

tion demands, or is supported by, closed response

ware in this eld is published, as well as a gen-

formats, and open-ended formats should be used

eral human skepticism towards machines process-

when the construct denition requires open-ended

ing natural language (cf. Dennett, 1991).

ones.

But in contrast to the

In fact, a large number of open NLP libraries

One way to avoid or to compensate weaknesses

are available which can be combined with statisti-

of human coding is AC. Its employment has sev-

cal components. In this paper, we propose a col-

eral implications. First, it is internally more con-

lection of baseline methods and related software

sistent, because the nal system is deterministic

components that can be used under open licenses.

and always takes the same decision paths. This,

We implemented and integrated these components

in turn, circumvents consensus building measures.

and report analyses demonstrating the method

Second, the computer has stamina not to fatigue

collection's feasibility and accuracy by applying it

even at big data.

Third, the coding of a new,

to data assessed at the Programme for Inter-

unseen text response is done instantly and, thus,

national Student Assessment (PISA) 2012

2

Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022

PISA is initiated by the Organ-

It is beyond the scope of this paper to pro-

isation for Economic Co-operation and

vide an extensive overview about existing AC-

Development (OECD) and assesses scientic,

systems.

mathematical, and reading literacy in a large-

Burrows, Gurevych, and Stein (2014) who give a

scale context, partly by administering open-ended

comprehensive overview. The main dierences of

items. Large-scale assessments, in particular, typ-

the here presented approach and software opposed

ically require enormous eort and resources in

to existing systems are the following. On the one

terms of human coding of open-ended responses.

hand, the approach does not yet include the most

Naturally, they are a highly suitable eld to ap-

sophisticated machine learning methods with the

ply AC. This seems particularly true for inter-

advantage of higher exibility and transparency

national studies, since they inherently endeavor

for researchers of the social sciences by, among

to maximize consistency across dierent test lan-

others, applying clustering.

guages (cf. Organization for Economic Coopera-

the software will be made freely available with a

tion and Development, 2013).

graphical interface to encourage researchers to use

in Germany.

The three main constructs assessed in PISA ex-

we

would

like

to

refer

to

On the other hand,

AC.

emplarily show the importance of incorporating open-ended response formats.

Rather,

Our study was led by three research questions

First, PISA de-

that investigate, rst, the performance of the AC-

nes mathematical literacy as the capacity to

system, second, the need of single steps in the

formulate, employ, and interpret mathematics

proposed collection of methods as well as possi-

(Organization for Economic Cooperation and De-

ble alternate congurations, and, third, the re-

velopment, 2013, p. 25); particularly the rst two

quired sample size for its employment. This paper

elements can hardly be validly assessed in closed

aims to demonstrate the performance of an AC-

formats. Second, scientic literacy is, among oth-

system for short text responses which relies solely

ers, dened as an individual's scientic knowl-

on baseline-methods.

edge and use of that knowledge

...

Thereby, researchers shall

[as well as

be encouraged to design instruments with open-

their] awareness of how science and technology

ended response format where appropriate and as-

shape our material, intellectual and cultural envi-

sessment practitioners to use them. The following

ronments (p. 100). Particularly the latter clearly

sections present the collection of proposed meth-

shows the requirement for respondents to inte-

ods for AC, how the empirical analysis of these

grate individual knowledge,

and values,

methods was conducted, and the obtained results.

which is best assessed in open-ended formats; oth-

The results illustrate the overall performance of

erwise, the given response options would crucially

AC and how it depends on the item evoking the

inuence the respondents' cognition. This is also

free text response, the selection of a particular

true for, third, the construct of reading literacy,

method and its conguration as well as sample

dened as understanding, using, reecting on and

size.

engaging with written texts

ideas,

...

[which] requires

The nal discussion explicates exemplary

ways in which AC could enhance educational as-

some reection, drawing on information from out-

sessment.

side the text (p. 61).

3

Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022

1 Proposed Collection of Methods

1.1 Making Sense of Responses via NLP

for Automatic Coding

Techniques

In a rst step, the computer preprocesses the reThe following subsections describe methods that

sponse. Therefor, NLP-techniques split the given

step by step process a test taker's free text re-

text response into tokens (words) and transform

sponse represented as a character string.

First,

these into more normalized ones; that is, their

NLP methods are used to transform each text re-

variation is reduced. The preprocessing steps are

sponse into a quantied representation of its se-

described in the following, and each step's eect

mantics. Second, clustering methods are used to

on the example response is visualized in Part A of

build a clustering model by grouping similar re-

Figure 1.

sponses based on the numerical representations

Punctuation removal omits punctuations, similar to 2) digit removal omitting digits. Next, 3) decapitalization converts all chars to lowercase. 1)

of the responses. Third, machine learning methods are applied to assign an interpretable code to each of these groups.

For easier comprehensibil-

For some languages, language-specic techniques

ity, an example response leads through the various

need to be added.

ing through a fentasy world.

For instance, in German 4)

umlaut-normalizing is used to change umlauts into their corresponding vowel (e.g., ä into a ). Split-

steps: Let A girl falling into and wanderbe a test

taker's response to an item asking what the main

ting the response into tokens, meant to be the

tokenizing. Next, it is advisable to apply 6) spelling correction. Then,

plot of Lewis Carroll's Alice in Wonderland

level of analysis, is called 5)

is about. Figure 1 illustrates the entire procedure using this example response.

functional words are omitted that do not carry crucial semantic information themselves. This is

The method selection was led by two main

stop word removal

paradigms that helpfully balance the simplicity

called 7)

of assumptions and their eectiveness in practice.

to words such as articles and conjunctions. As last

First, the so-called sen.

bag of words approach

very important variation reduction, 8)

is cho-

example now reads:

are considered separately, irrespective their or-

co-occurrence -paradigm

stemming

is applied, which means to cut o axes.

This means, all terms given in a response

der and references to each other.

and typically applies

world.

Second, the

serves as a tool for au-

The

girl, fall, wander, fantasy,

These preprocessing techniques are typically

tomatic extraction of semantic relationships be-

useful.

tween words. The idea is to use the phenomenon

reasonable to adapt their assembly. If for example,

that semantically similar terms occur in similar

digits are important in the item's solution, they,

contextsnot necessarily in exactly the same con-

of course, should not be omitted but considered.

text but at least indirectly in similar ones (cf. Lan-

Furthermore, it is possible to replace stemming by

dauer, McNamara, Dennis, & Kintsch, 2011).

the more sophisticated

Nevertheless, in some cases, it might be

lemmatization

(using the

basic word form instead of just cutting o axes), and, dependent on language and application, it can be useful to apply

4

compound splitting

(split-

Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022

Figure 1: Schematic Figure of ACFirst, the computer preprocesses the response (A) and extracts its semantics (B). These numerical semantic representations are used for clustering (C) and machine learning in which a code is assigned to each cluster (D). Finally, a new response receives the code of the cluster which is most similar to it (E).

5

Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022

ting words that are made of more than one stem).

vectors.

Such semantic spaces can be computed

The above listed eight techniques have been used

by, for instance,

for analysis presented in this paper.

Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990) or

Now, having reduced linguistic variation in re-

Latent Semantic Analysis

Explicit Semantic Analysis

(LSA;

(ESA;

sponses, the computer needs some kind of numeri-

Gabrilovich & Markovitch, 2007). Both start with

cal representation of the remaining tokens' seman-

a co-occurrence matrix, called term×document-

bag of words 

matrix, carrying frequencies of how often which

tics (ignoring order information as a

approach is used). Therefor, a big text corpus is

term occurs in which context (i.e.,

automatically analyzed by a statistical method

paragraph, or sentence). This matrix is weighted

using the co-occurrence-paradigmto build the

based on the idea of relevance feedback (Salton &

machine's lexical knowledge. This is often called

Buckley, 1990), giving more weight to terms oc-

semantic space.

curring in rather few, specic contexts over those

Wikipedia, for instance, can

document,

serve as an adequate corpus. If feasible, multiple

occurring diusely across many contexts.

sources can lead to higher performance (Szarvas,

way, words that adhere to a specic semantic

Zesch, & Gurevych, 2011). The semantic spaces

contextwhich typically is rather true for autose-

used for analysis reported in this paper were built

mantic opposed to function wordsbecome im-

1

on basis of a German Wikipedia dump

This

portant carriers of this context's semantic infor-

(from

06-13-2013) as German text responses were to be

mation.

coded.

Solely in cases of items that only evoke

using the log-Entropy function (Dumais, 1991).

a strongly limited amount of dierent words in

Creating a semantic space by means of ESA in-

the responsesfor example, if test takers are re-

cludes several steps (for details see Gabrilovich &

quired to repeat four terms explicitly given in the

Markovitch, 2009) to nally gain a vector model

stimulusit can be conceivable to disregard the

describing the semantic space. The ESA is based

words' semantics. In such cases, it is also possible

on a manually structured text corpus; that is,

to test the plain existence of words and to con-

documents (such as articles in Wikipedia) are

tinue with clustering (see next subsection). Nev-

included as intact entities.

ertheless, in most cases considering the semantics

structure (e.g., composed of articles) each element

is worth the eort, and, in the empirical part, we

corresponds to one dimension in space, often lead-

investigate whether the usage of plain words can

ing to a very high number of dimensions. For in-

outperform the usage of semantics at all.

stance, a term, such as

fantasy )

ness between an article and the term.

is rep-

tion ),

fantasy

and

imagina-

which performs a singular value decomposition on the weighted term×document-matrix.

dened as two vectors having a small an-

Then, as

text corpora are samples of language containing

gle, or more precisely as a high cosine of two

1

The sec-

ond approach to create a semantic space is LSA,

resented by a vector. Similar vectors (terms) are semantically similar (e.g.,

fantasy, is represented by a

of articles) weights, each indicating the related-

These are hyper-dimensional

spaces in which each term (e.g.,

Within this natural

vector comprising over one million (total number

A common way to model semantics is through

vector space models.

Typically, relevance feedback is applied

Wikipedia dumps are available at:

noise, and as contexts are assumed to be correlated, the decomposed matrix is reduced to less

http://dumps.wikimedia.org/

6

Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022

suming LSA was used, a 300-dimensional vector.

often 300dimensions, comparable to factor anal-

2

fantasy

is rep-

Finally, the ve vectors' centroid is computed

resented by a vector comprising 300 values, each

that is one 300-dimensional vector constituting

indicating the relatedness between a latent seman-

the example response's total, or average, seman-

tic concept and the term. For details of LSA see

tics. In all further analyses this centroid vector is

Landauer (2011) and Martin and Berry (2011).

the response's representation.

ysis . For example, a term such as

An

important

specicity

of

matter

text

in

corpora.

LSA

is

domain-

When

taking

1.2 Model Building via Clustering

Wikipedia as a basis, one can make use of its Represented by their semantic centroid vector, all

internal, explicitly connected structure. One way

responses given in the data are spread across the

to gain a domain-specic corpus is to start at

semantic space. If an appropriate semantic space

an article which is strongly related to the item's topic (e.g.,

Alice's Adventures in Wonderland )

has been built, the responses now typically form groups (cf. part C in Figure 1). This means, some

and crawl the incoming and outgoing links, also called children, to related articles (e.g.,

Fantasy ).

responses are more semantically similar to each other than to othersto put it more precisely, re-

Dependent on the desired corpus size and the ini-

sponses in one group contain semantically similar

tial article's connectedness, it is reasonable to do

words.

this within a degree of connectedness of 1 to 3

be applied by conducting a cluster analysis.

(e.g., 2 meaning to take articles' children as well

In

case the data are meant to be explored only, all

as children's children into the corpus). This pro-

responses are taken into account for model build-

cedure can easily be adapted. For example, more

ing. Otherwise, if some sort of supervised learn-

than one starting point can be chosen to gather

ing takes placewhich is going to be explained in

bigger corpora, or lters can be applied which kind

the next subsection, only a subset of responses

of articles must not be included (e.g., articles only

(training data) is considered for this model build-

dealing with a specic date; for more such consid-

ing step.

erations see Gabrilovich & Markovitch, 2009).

Here,

Having provided the computer with semantic

an

agglomerative

hierarchical

cluster

3

analysis is used which comprises three steps.

knowledge, the semantics of the responses can be

First, the distance between every pair of responses

extracted by computing the centroid vector of all

is computed. Second, every response is regarded

tokens in the preprocessed response. At this stage,

as being its own cluster at the beginning. An ag-

the example response comprises ve tokens (cf. part B in Figure 1).

This perspective on the data can easily

glomeration method iteratively decides which two

Each is represented by, as-

clusters are minimal distant from each other and

2 Building

an analogy to factor analysis, the extracted semantic concepts would correspond to the factors in factor analysis, contexts (documents) would represent variables, and terms would stand for persons. The resulting matrix could be interpreted accordingly in that a single value of a term's vector would be the variable's factor loading, carrying the information of how semantically important the term (variable) reveals to be for the semantic concept (factor).

are to be merged to one cluster in the next iteration step. The agglomeration stops when all responses are assigned to one single cluster. Third, the researcher needs to decide which number of

3 Cluster

analyses comprise a large method family with several adaptations. Only one commonly used is described here.

7

Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022

clusters is the best solution.

This is an empiri-

response data only. This is reasonable for data ex-

cal matter and can be determined by plotting the

ploration and is called unsupervised learning; no

distances (rest-component) of clusters for each

further criterion is available (e.g., a variable indi-

solution and applying the elbow criterion.

cating whether a response is

The

correct

or

incorrect )

respective number of clusters constitutes the de-

which would allow model optimization with regard

sired solution at which the rest-component de-

to this criterion. For example, the researcher can

creases signicantly higher, relative to solutions

choose the number of clusters in an unsupervised

with more clusters.

See Rasch, Kubinger, and

manner as described above. But if the overall aim

Yanagida (2011) for a concise description how this

is to do some kind of supervised learning (e.g.,

is done using R or SPSS.

scoring responses), it is reasonable to vary the number of clusters and choose the best perform-

A cluster analysis has several adjustable paramOne important decision is how to dene

ing one in terms of humancomputer-agreement.

similarity, which can either be done by Euclidean

The following subsection about supervised learn-

distance orworthwhile although computation-

ing describes how an interpretation such as

ally more expensivecosine. The latter means to

or

eters.

incorrect

correct

can be assigned to every cluster.

divide the vectors' dot product (Martin & Berry, 2011) by the product of their vector lengths to ap-

1.3 Assigning Codes to Neutral Categories

ply an L2 norm which is done to neutralize dier-

via Supervised Machine Learning

ent document lengths (Gabrilovich & Markovitch, 2009).

Cosine's advantage over Euclidean dis-

Where the overall goal is AC, meaning that one

class needs to (e.g., correct ), an

tance is to take only the vector directions into

of a xed range of valuescalled

accountand not their lengthsbecause in the

be assigned to a text response

semantic space a vector's length mainly depends

external criterion is used by which to learn the

on the term's frequency, and as a term's frequency

relationship between the semantic vectors and the

does not carry semantic information, the distance

intended code.

metric should incorporate only the vector's direc-

supervised learning.

dis similarity, called arccosine,

This kind of procedure is called Such an external criterion

tion. Clustering algorithms base on

can be judgments by trained human coders, also

and hence, the cosine's inverse,

called classication. Ideally, the whole data set is

should be used.

Besides choosing a measure of

classied.

dissimilarity, the researcher needs to decide on the

Supervised

machine

learning

procedures

training

test.

are

number of clusters as well as on an agglomeration

separated into two phases:

method, such as Ward's method, which merges

each phase only a subset of data is used, while the

those two clusters leading to the smallest increase

rest is put aside to control overtting. Overtting

of variance within clusters (Ward, 1963). Never-

is given if a model is optimized too thoroughly,

theless, all these parameters' optima are suscep-

because it then performs very well on the data it

tible to the data's nature, and therefore, can be

was trained on but terribly poor on unseen data;

adapted with respect to theoretical or empirical

while to apply models on unseen data usually is

knowledge about the respective item.

the purpose of building them. Hence, the data is

Up to this point, it is possible to work with text

split into ten pieces, called

8

folds.

and

In

Nine folds are

Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022

used together at a time to train the model while its

ter the response is assigned to. Then, this cluster's

performance is tested on the remaining tenth fold

code is assumed to be the response's code (cf. part

afterward. This is repeated ten times so that every

E in Figure 1).

fold serves as test data once. Yet, two things need

At the end of the day, the automatic classier

to be considered. First, the data's overall class dis-

needs to be evaluated.

tribution should be represented in each fold; this

called

is called

stratication.

Second, because still then

accuracy.

Its performance is often

The simplest accuracy coecient

is percentage of agreement between computer and

chance highly impacts performance while choosing

human.

the folds, the whole procedure is again repeated

corrects for skewed class distribution.

Another important one is kappa, which

ten times, resulting in one hundred performance estimates. This is called

cross-validation ;

stratied, repeated tenfold

2 Methods

details can be found at Witten,

Frank, and Hall (2011) and for comparison with 2.1 Measures of Coder Agreement

other methods see Borra and Di Ciaccio (2010). In the training phase, clusters are built as de-

The

Results

section

reports

percentages

of

scribed in the previous subsection. Then, to each

humancomputer-agreement (%h:c ) as well as Co-

cluster one code is assigned to (cf. part D in Fig-

hen's kappa (κh:c ) for data from PISA 2012. These

ure 1), determined by the within-cluster class dis-

statistics were computed by averaging the re-

tribution.

sults from repeated tenfold cross-validation. The

For example while scoring, a cluster

might mainly comprise responses labelled as

rect ;

correct.

hence, the cluster code is:

cor-

kappa coecient was transformed prior to aver-

The de-

aging accordingly to Fisher (1915, 1921) and re-

cision which code is assigned to a cluster is based

transformed for reporting, on which the denition

on the highest conditional probability using Bayes P (gj |ci )∗P (ci ) , representing theorem, P (ci |gj )k,j = P (gj )

of Fleiss (1981) is applied: Values of

the probability that response

j th cluster g , has c by the external

k,

ith

higher

class

Moreover, a corrected coecient of percentage of %h:ci −%h:c1 , is reported, giving agreement, λh:c = 100−%h:c1

criterion (e.g., human coder).

For example, the probability of a response that is known to be member of cluster

13,

correct

the proportion of the actual humancomputer-

which is true

agreement's increase to the highest attainable in-

that it had been

crease; therefor, the percentage of agreement for

by the human coder,

the model with only one cluster (%h:c1 ) is sub-

ˆ (gj ) = 25% of all responses, for P assigned to the class

fair to good, lower values poor, and ones excellent agreement beyond chance.

constitute

belonging to the

been assigned to the

.40 ≤ κ ≤ .75

ˆ (ci ) = 76% of all responses, which is true for P ˆ (gj |ci ) = 31% of all correct reequals 96% if P sponses belong to cluster 13.

tracted from the percentage of agreement for the nal model with

i

clusters (%h:ci ), and this dif-

ference is divided by

100

minus the percentage of

agreement for the model with one cluster, con-

In the test phase, the unseen responses are classied by using the cluster model and the cluster

stituting the highest attainable increase.

codes. At rst, the highest similarity, for instance

Kappa, this coecient does not only correct for

in terms of cosine, between the unseen response

agreement by chance based on the class distribu-

and a cluster centroid determines to which clus-

tion but also for the trivial coding of empty re-

9

Unlike

Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022

sponses. It furthermore indicates how clustering

terpret

aects performance in the data.

ically evoke more complex answers in linguistic

Finally, Fleiss'

and

Reect and Evaluate

kappa for multiple raters is used to report the sys-

terms than the third aspect does.

tem's internal reliability across repetitions (κc:c ).

pect,

that both typ-

The third as-

Access and Retrieve, was only represented by

one item and is assigned to items asking the test taker to nd the relevant information given in the 2.2 Materials

stimulus and repeat it, resulting in low language diversity in the responses.

Items in PISA typically comprise a stimulus and a question referring to it. Responses used for analysis stem from ten dichotomous items; eight items assessing reading literacy as well as one assessing

2.3 Analyses

scientic and mathematical literacy each. That is, there are items of three dierent constructs and

According to the rst research question, we report

specic coding guides for human coders for ev-

the measures of coder agreement introduced above

ery item guiding them to code responses as either

for all items to show the AC-performance.

correct

Although only items with di-

each item, we varied the number of clusters from

chotomous coding were used in this study, consti-

11000 and report the best performing solution us-

tuting the vast majority in PISA, the chosen ap-

ing cosine and Ward's method for clustering. We

proach to AC is also applicable to polytomously

cross-validated all measures by stratied, repeated

coded items.

(ten times) tenfold cross-validation, which is also

or

incorrect.

Due to repeated measurements,

true for the analyses described next.

partly using the same items across cycles, the item contents are condential and cannot be described here.

For

Following the second research question, the rela-

The ten items are listed in Table 1, each

tionship between performance and dierent meth-

given a name for better traceability in the follow-

ods or their congurations were examined. There-

ing. Prior to item selection, a theoretical frame-

for, seven analyses were carried out and are re-

work had been developed which item character-

ported for single representative items. In Analysis

istics might potentially inuence AC-performance

I, the need for semantic space building opposed to

using the proposed methods.

According to this

the use of plain words is demonstrated. Analysis

scheme, the selected items were intended to vary

II illuminates the impact of spelling correction on

heterogeneously in their characteristics to test the

AC-performance by comparing uncorrected, auto-

AC-method's scope.

matically corrected, and manually corrected re-

As presented in the table, the items ranged

sponses. The latter was possible by means of the

from dicult (10%) to easy (83%) with the major-

transcription process (cf.

ity being medium dicult and a slightly skewed

mantic space building was varied in III using ESA

distribution towards easiness.

For the reading

or LSA, in IV by using dierent text corpora, and

items, the assessed aspect according to the PISA

in V by varying the number of LSA dimensions.

framework (Organization for Economic Coopera-

In VI, various common distance metrics were ap-

tion and Development, 2013) mainly varied be-

plied (

tween the two of three aspects

next subsection).

Se-

cosine, Euclidean, Chebyshev, Manhattan, Canberra ), and in Analysis VII, we studied the im-

Integrate and In-

10

Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022

Table 1: Item Characteristics

Item

Domaina Aspectb Correct

1·Explain Protagonist's Feeling

read

B

2·Evaluate Statement

read

C

3·Interpret the Author's Intention

read

B

4·List Recall

read

A

5·Evaluate Stylistic Element

read

C

6·Verbal Production

read

B

7·Select and Judge

read

C

8·Explain Story Element

read

B

9·Math

math

M

scie

S

10·Science

Total a b

c

83% 43% 10% 59% 56% 80% 68% 69% 35% 58% 56%

Wordsc

n 4,152 4,234 4,234 4,223 4,234 4,152 4,152 4,223 4,205 4,181 41,990

12.3 (4.6) 15.6 (9.0) 12.5 (6.3) 5.6 (3.0) 14.7 (6.2) 12.4 (6.9) 13.6 (7.0) 14.4 (5.5) 14.0 (6.8) 11.1 (5.2) 12.6 (6.1)

one of read ing, math ematics, scie nce according to PISA framework (Organization for Economic Cooperation and Development, 2013), A = Access & Retrieve, B = Integrate & Interpret, C = Reect & Evaluate, M = Uncertainty & Data, S = Explain Phenomena Scientically word count in non-empty responses on average (with SD)

Ward's 4 and McQuitty's method ; Single, Complete, Average, Median, and Centroid Linkage ). For each pact of dierent agglomeration methods (

peated tenfold cross-validation, in that, we rst split the data into ten stratied parts (called metafolds in the following) and subsequently ex-

variation all other parameters and methods were

cluded one metafold at a time until only one

set constant to those from the analyses for the

was

rst research question despite the number of clus-

all left metafolds were merged and an indepen-

ters which was set with respect to the performance

dent stratied, repeated (ten times) tenfold cross-

optimum. Percentages of agreement served as de-

validation was run on them, disregarding the pre-

pendent variable in these experiments. The runs

vious metafold assignments.

within one experiment were compared by consid-

the rst reduction

ering the dierence of means and the dispersions,

sponses contributed to the analysis, in the next

for which percentage of agreement is the optimal

step only

left.

At

each

of

these

reduction

steps,

For example, after

n = 4152 − 415 = 3737

re-

n = 3737 − 415 = 3322 did. Next, when only 415 responses were left, the last remain-

measure. Finally, in the third research question, we went

ing metafold was split into another ten smaller

about the relationship between performance and

metafolds to gain more ne-grained reduction

sample size,

investigating to which extent the

steps within the last tenth; and again, we sub-

methods were applicable to studies with less data

sequently excluded these smaller metafolds until

by conducting a simulation.

To reduce random

only one was left and run independent repeated

errors, we designed the simulation similar to re-

tenfold cross-validations on the remaining data at each step. This procedure was repeated ten times,

4 As implemented in R's hclust-function,

the analysis distinguishes ward.D2, which squares cluster distances before updating, and ward.D, which does not.

each time shifting the order of metafold exclusion to reduce the impact of the metafold sam-

11

Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022

pling itself. Figure 2 visualizes this design. The

stem from transcription typos) using spelling cor-

simulation used the data of item 7·Select and

rection.

Judge, which turned out to be representative re-

revealed to contain a typo this way, which were

garding the obtained results in various analyses

corrected in the data. In analyses, the original re-

and was automatically coded medium wellwith

sponses were used disregarding the manual correc-

that, being sensible to improvements and worsen-

tion but applying an automatic spelling correction

ing caused by sample size variation. Again, all pa-

on students' misspellings.

At total, 0.2 percent of all words were

rameters and methods were set constant to those from the analyses for the rst research question

2.5 Software

despite the number of clusters which was set with

The software programmed for the above described

respect to the performance optimum.

procedure implements open software, libraries and packages.

loaded Wikipedia dump (cf.

The responses we analyzed come from the German PISA 2012 sample.

2008), which also comes with an application programming interface to access the corpus data.

a representative sample of ninth-graders in Ger-

DKPro Similarity (Bär, Zesch, & Gurevych,

many. A detailed sample description can be found

2013), which in turn primarily utilizes S-Space

at Prenzel, Sälzer, Klieme, and Köller (2013) and

(Jurgens & Stevens, 2010), is used to build a vec-

Organization for Economic Cooperation and De-

tor space model. The response processing makes

Due to a booklet design,

use

the numbers of test takers varied for each item (4152

≥ n ≥ 4234;

In PISA 2012,

oered

in

DKPro

Core

For stemming, Snowball (Porter, 2001) is used. For statistical matters, such as clustering, the soft-

be scanned and responses were transcribed by six

ware evokes R (R Core Team, 2014).

To reduce typos, an additional mecha-

nism was employed.

components

UIMA Framework (Ferrucci & Lally, 2004).

were

assessed paper-based. Hence, booklets needed to

persons.

of

(Gurevych et al., 2007), which t into the Apache

cf. Table 1).

reading, maths, and science

subsection

built by using JWPL (Zesch, Müller, & Gurevych,

This includes a repre-

sentative sample of 15-year-old students as well as

velopment (2014).

First, a database storing the down-

Making Sense of Responses via NLP Techniques) is

2.4 Participants and Procedure

Misspelled terms were an-

3 Results

notated with their correct spelling, which additionally served as upper bound of achievable improvement by spelling correction in Analysis II

3.1 Overall Performance and Its

Relationship Between Performance and Alternation of Parameter Values or Methods). Only for the additional typo reduction (cf. subsection

Relationship to Item Characteristics

As indicated by Table 2, the AC can lead to

mechanism, the misspelled terms were replaced

good up to excellent humancomputer-agreement;

by their annotation and these manually corrected

still, its performance varied by item remarkably.

data were put into Microsoft Word 2013 and

Across all items, the system reached high per-

checked for left misspellings (that now could only

centages of agreement from

12

76%

to

98%,

whereas

Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022

Figure 2: Sample Size Simulation DesignThe

x -labels indicate which metafold was excluded in which

order; for example, in the second repetition (t

= 2),

the data in metafold E is included in the

rst three runs and excluded for all subsequent ones (x4) of this repetition. See the text for a more detailed description. Table 2: Accuracy and Reliability

Item 1·Explain Protagonist's Feeling 2·Evaluate Statement 3·Interpret the Author's Intention 4·List Recall 5·Evaluate Stylistic Element 6·Verbal Production 7·Select and Judge 8·Explain Story Element 9·Math 10·Science

a b c d

%h:c a 91.2 [±0.2] 86.5 [±0.3] 90.4 [±0.2] 98.4 [±0.1] 76.2 [±0.4] 92.7 [±0.2] 86.6 [±0.3] 86.7 [±0.3] 85.0 [±0.3] 88.4 [±0.3]

λh:c b 16.5 63.8 9.6 84.2 12.4 39.7 39.2 41.5 56.6 45.4

.533 .729 .458 .955 .503 .704 .679 .676 .669 .761

κh:c c [.519; .547] [.723; .735] [.444; .472] [.952; .959] [.495; .511] [.695; .713] [.672; .686] [.668; .683] [.661; .676] [.754; .767]

κc:c d .814 .910 .738 .983 .771 .925 .875 .886 .822 .918

95% condence intervals given in brackets percentage of humancomputer-agreement %h:ci −%h:c1 relative accuracy increase by clustering; λh:c = 100−% (for details cf. Methods section) h:c1 Cohen's kappa for humancomputer-agreement Fleiss' kappa for within-computer reliability across repetitions

the kappa values ranged from fair to good

ties the just named three items as performing

item 3·Interpret the Author's Intention

poorest but still to a fair to good extent, the coef-

(κh:c

cient

= .458)up

to excellent agreement be-

yond chanceitems 10·Science (κh:c and 4·List Recall (κh:c

= .955).

λh:c

underlines that their AC-performance

did not crucially improve by clustering (9.6

= .761)

λh:c ≤ 16.5).

Besides item



A detailed analysis showed that

3, items 1·Explain Protagonist's Feeling

the responses' correctness to items 3 and 5 could

= .533) and 5·Evaluate Stylistic Element (κh:c = .503) attained the poorest agree-

partially be aected by other linguistic elements

ment beyond chance, and the ve other items

rather poor AC were not obvious. On the other



hand, especially for item 4·List Recall, the

Whereas the kappa coecient iden-

humancomputer-agreement was excellent and it

(κh:c

reached good agreement beyond chance (.669

κh:c ≤ .729).

than pure semantics, while the reasons for item 1's

13

Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022

increased remarkably by clustering (λh:c

= 84.2)

showed that using a semantic space opposed to

as it also did for item 2·Evaluate Statement

simply using the presence of words results in sig-

(λh:c

= 63.8)

and 9·Math (λh:c

= 56.6).

These

nicantly better AC,

t(197) = 3.34, p < .001, d =

items share one important characteristic, namely,

0.94(=0.3% ˆ )although

the evoked response's correctness is determined by

the conservative approach. Whereas the semantic

the test taker expressing specic semantic con-

space needed 65 clusters to reach its optimal per-

In item 4·List Recall, this is done in

formance, the usage of raw words already required

cepts.

a very simple way by just listing terms while

with small eect due to

500 clusters for this simple item to do so.

in the other items the verbal constructions need

Analysis II studied the impact of spelling cor-

to be more complex to be able to express the

rection on performance comparing three condi-

required semantics.

Finally, the system's relia-

tions: no, automatic, and manual spelling correc-

bility mainly attained excellent agreement with

tion. The latter constituted an upper bound for

.771 ≤ κc:c ≤ .983,

the possible eect of spelling correction on perfor-

except of the AC of the al-

ready discussed item 3 (κc:c

= .738).

mance how much an almost perfect spelling correction could improve the AC. Using an ANOVA, the analysis revealed that the automatic spelling

3.2 Relationship Between Performance and

correction neither improved performance signi-

Alternation of Parameter Values or

cantly opposed to dispensing with it nor did the

Methods

performance with manual spelling correction differ signicantly,

Competing methods and parameter values were compared

in

seven

analyses

with

regard

To

further investigate if this nding can be general-

to

ized across items for the PISA data, cross-checks

percentage of humancomputer-agreement (%h:c )

on other items were run.

whereas the sets of one hundred percentage values

These showed, the

necessity of spelling correction depends on the

from stratied, repeated tenfold cross-validation

item's naturethe same analysis with the data

served as sample for the respective experimental conditions.

F (297, 2) = 2.1, p = .130.

of item 10·Science yielded in similarly insignif-

Most analyses were conducted using

icant results,

the data of item 7·Select and Judge constitut-

F (297, 2) = 2.4, p = .089,

whereas

performance with and without spelling correction

ing a medium well working item for AC. This way,

for item 6·Verbal Production diered signi-

the measure of agreement could optimally be af-

cantly with relevant eect size,

fected by the experimental variation, which would

t(197) = 5.81, p

Suggest Documents