Automatic Coding of Short Text Responses via Clustering in Educational Assessment Fabian Zehner
1
1,3
1,3
, Christine Sälzer
2,3
, Frank Goldhammer
Technical University of Munich, 2 German Institute for International Educational Research, 3
Centre for International Student Assessment (ZIB) e.V.
Automatic coding of short text responses opens new doors in assessment. We implemented and integrated baseline methods of natural language processing and statistical modelling by means of software components that are available under open licenses. The accuracy of automatic text coding is demonstrated by using data collected in the
Student Assessment (PISA) 2012 in Germany.
Programme for International
Free text responses of 10 items with
n = 41, 990
responses in total were analyzed. We further examined the eect of dierent methods, parameter values, and sample sizes on performance of the implemented system. The system reached fair to good up to excellent agreement with human codings (.458
≤ κ ≤ .959).
Especially
items that are solved by naming specic semantic concepts appeared properly coded. system performed equally well with to
n = 249.
The
n ≥ 1661 and somewhat poorer but still acceptable down
Based on our ndings, we discuss potential innovations for assessment that are
enabled by automatic coding of short text responses.
Keywords:
Computer-Based Assessment, Automatic Coding, Automatic Short-Answer
Grading, Computer-Automated Scoring
Originally published in:
Zehner, F., Sälzer, C., & Goldhammer, F. (2016).
of short text responses via clustering in educational assessment.
Measurement, 76 (2), 280303.
Automatic coding
Educational and Psychological
doi: 10.1177/0013164415590022 epm.sagepub.com
Fabian Zehner, TUM School of Education, Centre for International Student Assessment (ZIB) e.V., Technical University of Munich; Christine Sälzer, TUM School of Education, Centre for International Student Assessment (ZIB) e.V., Technische Universität München; Frank Goldhammer, German Institute for International Educational Research, Centre for International Student Assessment (ZIB) e.V. Correspondence concerning this article should be addressed to Fabian Zehner, TUM School of Education, Centre for International Student Assessment (ZIB) e.V., Arcisstr. 21, 80331 Munich, Germany. E-mail:
[email protected]
Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022
In automatic coding of short text responses
opens new doors in assessment. For example, com-
(AC) a computer categorizes or scores short text
puter adaptive testing (CAT) is constrained to the
responses. Such open-ended response formats, as
use of response formats that can be evaluated im-
opposed to closed ones, can improve an instru-
mediately. AC enlarges the scope of CAT to open-
ment's construct validity (e.g., for mathematics:
ended text responses. On the other hand, fourth,
cf.
Birenbaum & Tatsuoka, 1987; Bridgeman,
an AC-system has its own limitations, errors, and
1991; for reading: cf. Millis, Magliano, Wiemer-
costs (e.g., software development, manual coding
Hastings, Todaro, & McNamara, 2011; Rauch &
for training data, etc.). The reader might notice
Hartig, 2010; Rupp, Ferne, & Choi, 2006). Yet to-
that we use the term
day, closed formats have become state of the art,
mon
because they allow data to be easily processed and
as a special case of coding and do not want to re-
are more objective in nature. They do not require
strict AC's scope. In coding, a nominal category
human coders, who entail some disadvantages (cf.
is assigned to an element; for example, a response
Bejar, 2012): their subjective perspectives, vary-
could be categorized as either dealing with a -
ing abilities (e.g., coding experience and stamina),
nancial or social aspect.
the need for consensus building measures, and, in
scoring, the categories additionally have a natural
turn, a high demand on resources. Taking a step
order; in the example, the social response could
back and reconsidering the response format's im-
be favored over the nancial one.
scoring.
coding
instead of the com-
That is because we regard scoring
On the other hand, in
pact on construct validity, one realizes that these
Since Carlson and Ward (1988), several research
well-founded practical reasons in favor of closed
groups have been dealing with natural language
formats might not be the most important deci-
processing (NLP) striving to automatically code
sion criteria. Construct validity is the very base
short text responses.
of all subsequent outcomes (e.g., analysis, conclu-
strongly related eld of essay grading, the respec-
sions, treatments) and, hence, can be claimed to
tive methods are not commonly used in practice.
be the most important criterion. Closed response
Reasons for this, among others, may be propri-
formats should be used when the construct deni-
etary licenses, under which most o-the-shelf soft-
tion demands, or is supported by, closed response
ware in this eld is published, as well as a gen-
formats, and open-ended formats should be used
eral human skepticism towards machines process-
when the construct denition requires open-ended
ing natural language (cf. Dennett, 1991).
ones.
But in contrast to the
In fact, a large number of open NLP libraries
One way to avoid or to compensate weaknesses
are available which can be combined with statisti-
of human coding is AC. Its employment has sev-
cal components. In this paper, we propose a col-
eral implications. First, it is internally more con-
lection of baseline methods and related software
sistent, because the nal system is deterministic
components that can be used under open licenses.
and always takes the same decision paths. This,
We implemented and integrated these components
in turn, circumvents consensus building measures.
and report analyses demonstrating the method
Second, the computer has stamina not to fatigue
collection's feasibility and accuracy by applying it
even at big data.
Third, the coding of a new,
to data assessed at the Programme for Inter-
unseen text response is done instantly and, thus,
national Student Assessment (PISA) 2012
2
Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022
PISA is initiated by the Organ-
It is beyond the scope of this paper to pro-
isation for Economic Co-operation and
vide an extensive overview about existing AC-
Development (OECD) and assesses scientic,
systems.
mathematical, and reading literacy in a large-
Burrows, Gurevych, and Stein (2014) who give a
scale context, partly by administering open-ended
comprehensive overview. The main dierences of
items. Large-scale assessments, in particular, typ-
the here presented approach and software opposed
ically require enormous eort and resources in
to existing systems are the following. On the one
terms of human coding of open-ended responses.
hand, the approach does not yet include the most
Naturally, they are a highly suitable eld to ap-
sophisticated machine learning methods with the
ply AC. This seems particularly true for inter-
advantage of higher exibility and transparency
national studies, since they inherently endeavor
for researchers of the social sciences by, among
to maximize consistency across dierent test lan-
others, applying clustering.
guages (cf. Organization for Economic Coopera-
the software will be made freely available with a
tion and Development, 2013).
graphical interface to encourage researchers to use
in Germany.
The three main constructs assessed in PISA ex-
we
would
like
to
refer
to
On the other hand,
AC.
emplarily show the importance of incorporating open-ended response formats.
Rather,
Our study was led by three research questions
First, PISA de-
that investigate, rst, the performance of the AC-
nes mathematical literacy as the capacity to
system, second, the need of single steps in the
formulate, employ, and interpret mathematics
proposed collection of methods as well as possi-
(Organization for Economic Cooperation and De-
ble alternate congurations, and, third, the re-
velopment, 2013, p. 25); particularly the rst two
quired sample size for its employment. This paper
elements can hardly be validly assessed in closed
aims to demonstrate the performance of an AC-
formats. Second, scientic literacy is, among oth-
system for short text responses which relies solely
ers, dened as an individual's scientic knowl-
on baseline-methods.
edge and use of that knowledge
...
Thereby, researchers shall
[as well as
be encouraged to design instruments with open-
their] awareness of how science and technology
ended response format where appropriate and as-
shape our material, intellectual and cultural envi-
sessment practitioners to use them. The following
ronments (p. 100). Particularly the latter clearly
sections present the collection of proposed meth-
shows the requirement for respondents to inte-
ods for AC, how the empirical analysis of these
grate individual knowledge,
and values,
methods was conducted, and the obtained results.
which is best assessed in open-ended formats; oth-
The results illustrate the overall performance of
erwise, the given response options would crucially
AC and how it depends on the item evoking the
inuence the respondents' cognition. This is also
free text response, the selection of a particular
true for, third, the construct of reading literacy,
method and its conguration as well as sample
dened as understanding, using, reecting on and
size.
engaging with written texts
ideas,
...
[which] requires
The nal discussion explicates exemplary
ways in which AC could enhance educational as-
some reection, drawing on information from out-
sessment.
side the text (p. 61).
3
Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022
1 Proposed Collection of Methods
1.1 Making Sense of Responses via NLP
for Automatic Coding
Techniques
In a rst step, the computer preprocesses the reThe following subsections describe methods that
sponse. Therefor, NLP-techniques split the given
step by step process a test taker's free text re-
text response into tokens (words) and transform
sponse represented as a character string.
First,
these into more normalized ones; that is, their
NLP methods are used to transform each text re-
variation is reduced. The preprocessing steps are
sponse into a quantied representation of its se-
described in the following, and each step's eect
mantics. Second, clustering methods are used to
on the example response is visualized in Part A of
build a clustering model by grouping similar re-
Figure 1.
sponses based on the numerical representations
Punctuation removal omits punctuations, similar to 2) digit removal omitting digits. Next, 3) decapitalization converts all chars to lowercase. 1)
of the responses. Third, machine learning methods are applied to assign an interpretable code to each of these groups.
For easier comprehensibil-
For some languages, language-specic techniques
ity, an example response leads through the various
need to be added.
ing through a fentasy world.
For instance, in German 4)
umlaut-normalizing is used to change umlauts into their corresponding vowel (e.g., ä into a ). Split-
steps: Let A girl falling into and wanderbe a test
taker's response to an item asking what the main
ting the response into tokens, meant to be the
tokenizing. Next, it is advisable to apply 6) spelling correction. Then,
plot of Lewis Carroll's Alice in Wonderland
level of analysis, is called 5)
is about. Figure 1 illustrates the entire procedure using this example response.
functional words are omitted that do not carry crucial semantic information themselves. This is
The method selection was led by two main
stop word removal
paradigms that helpfully balance the simplicity
called 7)
of assumptions and their eectiveness in practice.
to words such as articles and conjunctions. As last
First, the so-called sen.
bag of words approach
very important variation reduction, 8)
is cho-
example now reads:
are considered separately, irrespective their or-
co-occurrence -paradigm
stemming
is applied, which means to cut o axes.
This means, all terms given in a response
der and references to each other.
and typically applies
world.
Second, the
serves as a tool for au-
The
girl, fall, wander, fantasy,
These preprocessing techniques are typically
tomatic extraction of semantic relationships be-
useful.
tween words. The idea is to use the phenomenon
reasonable to adapt their assembly. If for example,
that semantically similar terms occur in similar
digits are important in the item's solution, they,
contextsnot necessarily in exactly the same con-
of course, should not be omitted but considered.
text but at least indirectly in similar ones (cf. Lan-
Furthermore, it is possible to replace stemming by
dauer, McNamara, Dennis, & Kintsch, 2011).
the more sophisticated
Nevertheless, in some cases, it might be
lemmatization
(using the
basic word form instead of just cutting o axes), and, dependent on language and application, it can be useful to apply
4
compound splitting
(split-
Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022
Figure 1: Schematic Figure of ACFirst, the computer preprocesses the response (A) and extracts its semantics (B). These numerical semantic representations are used for clustering (C) and machine learning in which a code is assigned to each cluster (D). Finally, a new response receives the code of the cluster which is most similar to it (E).
5
Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022
ting words that are made of more than one stem).
vectors.
Such semantic spaces can be computed
The above listed eight techniques have been used
by, for instance,
for analysis presented in this paper.
Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990) or
Now, having reduced linguistic variation in re-
Latent Semantic Analysis
Explicit Semantic Analysis
(LSA;
(ESA;
sponses, the computer needs some kind of numeri-
Gabrilovich & Markovitch, 2007). Both start with
cal representation of the remaining tokens' seman-
a co-occurrence matrix, called term×document-
bag of words
matrix, carrying frequencies of how often which
tics (ignoring order information as a
approach is used). Therefor, a big text corpus is
term occurs in which context (i.e.,
automatically analyzed by a statistical method
paragraph, or sentence). This matrix is weighted
using the co-occurrence-paradigmto build the
based on the idea of relevance feedback (Salton &
machine's lexical knowledge. This is often called
Buckley, 1990), giving more weight to terms oc-
semantic space.
curring in rather few, specic contexts over those
Wikipedia, for instance, can
document,
serve as an adequate corpus. If feasible, multiple
occurring diusely across many contexts.
sources can lead to higher performance (Szarvas,
way, words that adhere to a specic semantic
Zesch, & Gurevych, 2011). The semantic spaces
contextwhich typically is rather true for autose-
used for analysis reported in this paper were built
mantic opposed to function wordsbecome im-
1
on basis of a German Wikipedia dump
This
portant carriers of this context's semantic infor-
(from
06-13-2013) as German text responses were to be
mation.
coded.
Solely in cases of items that only evoke
using the log-Entropy function (Dumais, 1991).
a strongly limited amount of dierent words in
Creating a semantic space by means of ESA in-
the responsesfor example, if test takers are re-
cludes several steps (for details see Gabrilovich &
quired to repeat four terms explicitly given in the
Markovitch, 2009) to nally gain a vector model
stimulusit can be conceivable to disregard the
describing the semantic space. The ESA is based
words' semantics. In such cases, it is also possible
on a manually structured text corpus; that is,
to test the plain existence of words and to con-
documents (such as articles in Wikipedia) are
tinue with clustering (see next subsection). Nev-
included as intact entities.
ertheless, in most cases considering the semantics
structure (e.g., composed of articles) each element
is worth the eort, and, in the empirical part, we
corresponds to one dimension in space, often lead-
investigate whether the usage of plain words can
ing to a very high number of dimensions. For in-
outperform the usage of semantics at all.
stance, a term, such as
fantasy )
ness between an article and the term.
is rep-
tion ),
fantasy
and
imagina-
which performs a singular value decomposition on the weighted term×document-matrix.
dened as two vectors having a small an-
Then, as
text corpora are samples of language containing
gle, or more precisely as a high cosine of two
1
The sec-
ond approach to create a semantic space is LSA,
resented by a vector. Similar vectors (terms) are semantically similar (e.g.,
fantasy, is represented by a
of articles) weights, each indicating the related-
These are hyper-dimensional
spaces in which each term (e.g.,
Within this natural
vector comprising over one million (total number
A common way to model semantics is through
vector space models.
Typically, relevance feedback is applied
Wikipedia dumps are available at:
noise, and as contexts are assumed to be correlated, the decomposed matrix is reduced to less
http://dumps.wikimedia.org/
6
Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022
suming LSA was used, a 300-dimensional vector.
often 300dimensions, comparable to factor anal-
2
fantasy
is rep-
Finally, the ve vectors' centroid is computed
resented by a vector comprising 300 values, each
that is one 300-dimensional vector constituting
indicating the relatedness between a latent seman-
the example response's total, or average, seman-
tic concept and the term. For details of LSA see
tics. In all further analyses this centroid vector is
Landauer (2011) and Martin and Berry (2011).
the response's representation.
ysis . For example, a term such as
An
important
specicity
of
matter
text
in
corpora.
LSA
is
domain-
When
taking
1.2 Model Building via Clustering
Wikipedia as a basis, one can make use of its Represented by their semantic centroid vector, all
internal, explicitly connected structure. One way
responses given in the data are spread across the
to gain a domain-specic corpus is to start at
semantic space. If an appropriate semantic space
an article which is strongly related to the item's topic (e.g.,
Alice's Adventures in Wonderland )
has been built, the responses now typically form groups (cf. part C in Figure 1). This means, some
and crawl the incoming and outgoing links, also called children, to related articles (e.g.,
Fantasy ).
responses are more semantically similar to each other than to othersto put it more precisely, re-
Dependent on the desired corpus size and the ini-
sponses in one group contain semantically similar
tial article's connectedness, it is reasonable to do
words.
this within a degree of connectedness of 1 to 3
be applied by conducting a cluster analysis.
(e.g., 2 meaning to take articles' children as well
In
case the data are meant to be explored only, all
as children's children into the corpus). This pro-
responses are taken into account for model build-
cedure can easily be adapted. For example, more
ing. Otherwise, if some sort of supervised learn-
than one starting point can be chosen to gather
ing takes placewhich is going to be explained in
bigger corpora, or lters can be applied which kind
the next subsection, only a subset of responses
of articles must not be included (e.g., articles only
(training data) is considered for this model build-
dealing with a specic date; for more such consid-
ing step.
erations see Gabrilovich & Markovitch, 2009).
Here,
Having provided the computer with semantic
an
agglomerative
hierarchical
cluster
3
analysis is used which comprises three steps.
knowledge, the semantics of the responses can be
First, the distance between every pair of responses
extracted by computing the centroid vector of all
is computed. Second, every response is regarded
tokens in the preprocessed response. At this stage,
as being its own cluster at the beginning. An ag-
the example response comprises ve tokens (cf. part B in Figure 1).
This perspective on the data can easily
glomeration method iteratively decides which two
Each is represented by, as-
clusters are minimal distant from each other and
2 Building
an analogy to factor analysis, the extracted semantic concepts would correspond to the factors in factor analysis, contexts (documents) would represent variables, and terms would stand for persons. The resulting matrix could be interpreted accordingly in that a single value of a term's vector would be the variable's factor loading, carrying the information of how semantically important the term (variable) reveals to be for the semantic concept (factor).
are to be merged to one cluster in the next iteration step. The agglomeration stops when all responses are assigned to one single cluster. Third, the researcher needs to decide which number of
3 Cluster
analyses comprise a large method family with several adaptations. Only one commonly used is described here.
7
Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022
clusters is the best solution.
This is an empiri-
response data only. This is reasonable for data ex-
cal matter and can be determined by plotting the
ploration and is called unsupervised learning; no
distances (rest-component) of clusters for each
further criterion is available (e.g., a variable indi-
solution and applying the elbow criterion.
cating whether a response is
The
correct
or
incorrect )
respective number of clusters constitutes the de-
which would allow model optimization with regard
sired solution at which the rest-component de-
to this criterion. For example, the researcher can
creases signicantly higher, relative to solutions
choose the number of clusters in an unsupervised
with more clusters.
See Rasch, Kubinger, and
manner as described above. But if the overall aim
Yanagida (2011) for a concise description how this
is to do some kind of supervised learning (e.g.,
is done using R or SPSS.
scoring responses), it is reasonable to vary the number of clusters and choose the best perform-
A cluster analysis has several adjustable paramOne important decision is how to dene
ing one in terms of humancomputer-agreement.
similarity, which can either be done by Euclidean
The following subsection about supervised learn-
distance orworthwhile although computation-
ing describes how an interpretation such as
ally more expensivecosine. The latter means to
or
eters.
incorrect
correct
can be assigned to every cluster.
divide the vectors' dot product (Martin & Berry, 2011) by the product of their vector lengths to ap-
1.3 Assigning Codes to Neutral Categories
ply an L2 norm which is done to neutralize dier-
via Supervised Machine Learning
ent document lengths (Gabrilovich & Markovitch, 2009).
Cosine's advantage over Euclidean dis-
Where the overall goal is AC, meaning that one
class needs to (e.g., correct ), an
tance is to take only the vector directions into
of a xed range of valuescalled
accountand not their lengthsbecause in the
be assigned to a text response
semantic space a vector's length mainly depends
external criterion is used by which to learn the
on the term's frequency, and as a term's frequency
relationship between the semantic vectors and the
does not carry semantic information, the distance
intended code.
metric should incorporate only the vector's direc-
supervised learning.
dis similarity, called arccosine,
This kind of procedure is called Such an external criterion
tion. Clustering algorithms base on
can be judgments by trained human coders, also
and hence, the cosine's inverse,
called classication. Ideally, the whole data set is
should be used.
Besides choosing a measure of
classied.
dissimilarity, the researcher needs to decide on the
Supervised
machine
learning
procedures
training
test.
are
number of clusters as well as on an agglomeration
separated into two phases:
method, such as Ward's method, which merges
each phase only a subset of data is used, while the
those two clusters leading to the smallest increase
rest is put aside to control overtting. Overtting
of variance within clusters (Ward, 1963). Never-
is given if a model is optimized too thoroughly,
theless, all these parameters' optima are suscep-
because it then performs very well on the data it
tible to the data's nature, and therefore, can be
was trained on but terribly poor on unseen data;
adapted with respect to theoretical or empirical
while to apply models on unseen data usually is
knowledge about the respective item.
the purpose of building them. Hence, the data is
Up to this point, it is possible to work with text
split into ten pieces, called
8
folds.
and
In
Nine folds are
Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022
used together at a time to train the model while its
ter the response is assigned to. Then, this cluster's
performance is tested on the remaining tenth fold
code is assumed to be the response's code (cf. part
afterward. This is repeated ten times so that every
E in Figure 1).
fold serves as test data once. Yet, two things need
At the end of the day, the automatic classier
to be considered. First, the data's overall class dis-
needs to be evaluated.
tribution should be represented in each fold; this
called
is called
stratication.
Second, because still then
accuracy.
Its performance is often
The simplest accuracy coecient
is percentage of agreement between computer and
chance highly impacts performance while choosing
human.
the folds, the whole procedure is again repeated
corrects for skewed class distribution.
Another important one is kappa, which
ten times, resulting in one hundred performance estimates. This is called
cross-validation ;
stratied, repeated tenfold
2 Methods
details can be found at Witten,
Frank, and Hall (2011) and for comparison with 2.1 Measures of Coder Agreement
other methods see Borra and Di Ciaccio (2010). In the training phase, clusters are built as de-
The
Results
section
reports
percentages
of
scribed in the previous subsection. Then, to each
humancomputer-agreement (%h:c ) as well as Co-
cluster one code is assigned to (cf. part D in Fig-
hen's kappa (κh:c ) for data from PISA 2012. These
ure 1), determined by the within-cluster class dis-
statistics were computed by averaging the re-
tribution.
sults from repeated tenfold cross-validation. The
For example while scoring, a cluster
might mainly comprise responses labelled as
rect ;
correct.
hence, the cluster code is:
cor-
kappa coecient was transformed prior to aver-
The de-
aging accordingly to Fisher (1915, 1921) and re-
cision which code is assigned to a cluster is based
transformed for reporting, on which the denition
on the highest conditional probability using Bayes P (gj |ci )∗P (ci ) , representing theorem, P (ci |gj )k,j = P (gj )
of Fleiss (1981) is applied: Values of
the probability that response
j th cluster g , has c by the external
k,
ith
higher
class
Moreover, a corrected coecient of percentage of %h:ci −%h:c1 , is reported, giving agreement, λh:c = 100−%h:c1
criterion (e.g., human coder).
For example, the probability of a response that is known to be member of cluster
13,
correct
the proportion of the actual humancomputer-
which is true
agreement's increase to the highest attainable in-
that it had been
crease; therefor, the percentage of agreement for
by the human coder,
the model with only one cluster (%h:c1 ) is sub-
ˆ (gj ) = 25% of all responses, for P assigned to the class
fair to good, lower values poor, and ones excellent agreement beyond chance.
constitute
belonging to the
been assigned to the
.40 ≤ κ ≤ .75
ˆ (ci ) = 76% of all responses, which is true for P ˆ (gj |ci ) = 31% of all correct reequals 96% if P sponses belong to cluster 13.
tracted from the percentage of agreement for the nal model with
i
clusters (%h:ci ), and this dif-
ference is divided by
100
minus the percentage of
agreement for the model with one cluster, con-
In the test phase, the unseen responses are classied by using the cluster model and the cluster
stituting the highest attainable increase.
codes. At rst, the highest similarity, for instance
Kappa, this coecient does not only correct for
in terms of cosine, between the unseen response
agreement by chance based on the class distribu-
and a cluster centroid determines to which clus-
tion but also for the trivial coding of empty re-
9
Unlike
Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022
sponses. It furthermore indicates how clustering
terpret
aects performance in the data.
ically evoke more complex answers in linguistic
Finally, Fleiss'
and
Reect and Evaluate
kappa for multiple raters is used to report the sys-
terms than the third aspect does.
tem's internal reliability across repetitions (κc:c ).
pect,
that both typ-
The third as-
Access and Retrieve, was only represented by
one item and is assigned to items asking the test taker to nd the relevant information given in the 2.2 Materials
stimulus and repeat it, resulting in low language diversity in the responses.
Items in PISA typically comprise a stimulus and a question referring to it. Responses used for analysis stem from ten dichotomous items; eight items assessing reading literacy as well as one assessing
2.3 Analyses
scientic and mathematical literacy each. That is, there are items of three dierent constructs and
According to the rst research question, we report
specic coding guides for human coders for ev-
the measures of coder agreement introduced above
ery item guiding them to code responses as either
for all items to show the AC-performance.
correct
Although only items with di-
each item, we varied the number of clusters from
chotomous coding were used in this study, consti-
11000 and report the best performing solution us-
tuting the vast majority in PISA, the chosen ap-
ing cosine and Ward's method for clustering. We
proach to AC is also applicable to polytomously
cross-validated all measures by stratied, repeated
coded items.
(ten times) tenfold cross-validation, which is also
or
incorrect.
Due to repeated measurements,
true for the analyses described next.
partly using the same items across cycles, the item contents are condential and cannot be described here.
For
Following the second research question, the rela-
The ten items are listed in Table 1, each
tionship between performance and dierent meth-
given a name for better traceability in the follow-
ods or their congurations were examined. There-
ing. Prior to item selection, a theoretical frame-
for, seven analyses were carried out and are re-
work had been developed which item character-
ported for single representative items. In Analysis
istics might potentially inuence AC-performance
I, the need for semantic space building opposed to
using the proposed methods.
According to this
the use of plain words is demonstrated. Analysis
scheme, the selected items were intended to vary
II illuminates the impact of spelling correction on
heterogeneously in their characteristics to test the
AC-performance by comparing uncorrected, auto-
AC-method's scope.
matically corrected, and manually corrected re-
As presented in the table, the items ranged
sponses. The latter was possible by means of the
from dicult (10%) to easy (83%) with the major-
transcription process (cf.
ity being medium dicult and a slightly skewed
mantic space building was varied in III using ESA
distribution towards easiness.
For the reading
or LSA, in IV by using dierent text corpora, and
items, the assessed aspect according to the PISA
in V by varying the number of LSA dimensions.
framework (Organization for Economic Coopera-
In VI, various common distance metrics were ap-
tion and Development, 2013) mainly varied be-
plied (
tween the two of three aspects
next subsection).
Se-
cosine, Euclidean, Chebyshev, Manhattan, Canberra ), and in Analysis VII, we studied the im-
Integrate and In-
10
Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022
Table 1: Item Characteristics
Item
Domaina Aspectb Correct
1·Explain Protagonist's Feeling
read
B
2·Evaluate Statement
read
C
3·Interpret the Author's Intention
read
B
4·List Recall
read
A
5·Evaluate Stylistic Element
read
C
6·Verbal Production
read
B
7·Select and Judge
read
C
8·Explain Story Element
read
B
9·Math
math
M
scie
S
10·Science
Total a b
c
83% 43% 10% 59% 56% 80% 68% 69% 35% 58% 56%
Wordsc
n 4,152 4,234 4,234 4,223 4,234 4,152 4,152 4,223 4,205 4,181 41,990
12.3 (4.6) 15.6 (9.0) 12.5 (6.3) 5.6 (3.0) 14.7 (6.2) 12.4 (6.9) 13.6 (7.0) 14.4 (5.5) 14.0 (6.8) 11.1 (5.2) 12.6 (6.1)
one of read ing, math ematics, scie nce according to PISA framework (Organization for Economic Cooperation and Development, 2013), A = Access & Retrieve, B = Integrate & Interpret, C = Reect & Evaluate, M = Uncertainty & Data, S = Explain Phenomena Scientically word count in non-empty responses on average (with SD)
Ward's 4 and McQuitty's method ; Single, Complete, Average, Median, and Centroid Linkage ). For each pact of dierent agglomeration methods (
peated tenfold cross-validation, in that, we rst split the data into ten stratied parts (called metafolds in the following) and subsequently ex-
variation all other parameters and methods were
cluded one metafold at a time until only one
set constant to those from the analyses for the
was
rst research question despite the number of clus-
all left metafolds were merged and an indepen-
ters which was set with respect to the performance
dent stratied, repeated (ten times) tenfold cross-
optimum. Percentages of agreement served as de-
validation was run on them, disregarding the pre-
pendent variable in these experiments. The runs
vious metafold assignments.
within one experiment were compared by consid-
the rst reduction
ering the dierence of means and the dispersions,
sponses contributed to the analysis, in the next
for which percentage of agreement is the optimal
step only
left.
At
each
of
these
reduction
steps,
For example, after
n = 4152 − 415 = 3737
re-
n = 3737 − 415 = 3322 did. Next, when only 415 responses were left, the last remain-
measure. Finally, in the third research question, we went
ing metafold was split into another ten smaller
about the relationship between performance and
metafolds to gain more ne-grained reduction
sample size,
investigating to which extent the
steps within the last tenth; and again, we sub-
methods were applicable to studies with less data
sequently excluded these smaller metafolds until
by conducting a simulation.
To reduce random
only one was left and run independent repeated
errors, we designed the simulation similar to re-
tenfold cross-validations on the remaining data at each step. This procedure was repeated ten times,
4 As implemented in R's hclust-function,
the analysis distinguishes ward.D2, which squares cluster distances before updating, and ward.D, which does not.
each time shifting the order of metafold exclusion to reduce the impact of the metafold sam-
11
Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022
pling itself. Figure 2 visualizes this design. The
stem from transcription typos) using spelling cor-
simulation used the data of item 7·Select and
rection.
Judge, which turned out to be representative re-
revealed to contain a typo this way, which were
garding the obtained results in various analyses
corrected in the data. In analyses, the original re-
and was automatically coded medium wellwith
sponses were used disregarding the manual correc-
that, being sensible to improvements and worsen-
tion but applying an automatic spelling correction
ing caused by sample size variation. Again, all pa-
on students' misspellings.
At total, 0.2 percent of all words were
rameters and methods were set constant to those from the analyses for the rst research question
2.5 Software
despite the number of clusters which was set with
The software programmed for the above described
respect to the performance optimum.
procedure implements open software, libraries and packages.
loaded Wikipedia dump (cf.
The responses we analyzed come from the German PISA 2012 sample.
2008), which also comes with an application programming interface to access the corpus data.
a representative sample of ninth-graders in Ger-
DKPro Similarity (Bär, Zesch, & Gurevych,
many. A detailed sample description can be found
2013), which in turn primarily utilizes S-Space
at Prenzel, Sälzer, Klieme, and Köller (2013) and
(Jurgens & Stevens, 2010), is used to build a vec-
Organization for Economic Cooperation and De-
tor space model. The response processing makes
Due to a booklet design,
use
the numbers of test takers varied for each item (4152
≥ n ≥ 4234;
In PISA 2012,
oered
in
DKPro
Core
For stemming, Snowball (Porter, 2001) is used. For statistical matters, such as clustering, the soft-
be scanned and responses were transcribed by six
ware evokes R (R Core Team, 2014).
To reduce typos, an additional mecha-
nism was employed.
components
UIMA Framework (Ferrucci & Lally, 2004).
were
assessed paper-based. Hence, booklets needed to
persons.
of
(Gurevych et al., 2007), which t into the Apache
cf. Table 1).
reading, maths, and science
subsection
built by using JWPL (Zesch, Müller, & Gurevych,
This includes a repre-
sentative sample of 15-year-old students as well as
velopment (2014).
First, a database storing the down-
Making Sense of Responses via NLP Techniques) is
2.4 Participants and Procedure
Misspelled terms were an-
3 Results
notated with their correct spelling, which additionally served as upper bound of achievable improvement by spelling correction in Analysis II
3.1 Overall Performance and Its
Relationship Between Performance and Alternation of Parameter Values or Methods). Only for the additional typo reduction (cf. subsection
Relationship to Item Characteristics
As indicated by Table 2, the AC can lead to
mechanism, the misspelled terms were replaced
good up to excellent humancomputer-agreement;
by their annotation and these manually corrected
still, its performance varied by item remarkably.
data were put into Microsoft Word 2013 and
Across all items, the system reached high per-
checked for left misspellings (that now could only
centages of agreement from
12
76%
to
98%,
whereas
Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022
Figure 2: Sample Size Simulation DesignThe
x -labels indicate which metafold was excluded in which
order; for example, in the second repetition (t
= 2),
the data in metafold E is included in the
rst three runs and excluded for all subsequent ones (x4) of this repetition. See the text for a more detailed description. Table 2: Accuracy and Reliability
Item 1·Explain Protagonist's Feeling 2·Evaluate Statement 3·Interpret the Author's Intention 4·List Recall 5·Evaluate Stylistic Element 6·Verbal Production 7·Select and Judge 8·Explain Story Element 9·Math 10·Science
a b c d
%h:c a 91.2 [±0.2] 86.5 [±0.3] 90.4 [±0.2] 98.4 [±0.1] 76.2 [±0.4] 92.7 [±0.2] 86.6 [±0.3] 86.7 [±0.3] 85.0 [±0.3] 88.4 [±0.3]
λh:c b 16.5 63.8 9.6 84.2 12.4 39.7 39.2 41.5 56.6 45.4
.533 .729 .458 .955 .503 .704 .679 .676 .669 .761
κh:c c [.519; .547] [.723; .735] [.444; .472] [.952; .959] [.495; .511] [.695; .713] [.672; .686] [.668; .683] [.661; .676] [.754; .767]
κc:c d .814 .910 .738 .983 .771 .925 .875 .886 .822 .918
95% condence intervals given in brackets percentage of humancomputer-agreement %h:ci −%h:c1 relative accuracy increase by clustering; λh:c = 100−% (for details cf. Methods section) h:c1 Cohen's kappa for humancomputer-agreement Fleiss' kappa for within-computer reliability across repetitions
the kappa values ranged from fair to good
ties the just named three items as performing
item 3·Interpret the Author's Intention
poorest but still to a fair to good extent, the coef-
(κh:c
cient
= .458)up
to excellent agreement be-
yond chanceitems 10·Science (κh:c and 4·List Recall (κh:c
= .955).
λh:c
underlines that their AC-performance
did not crucially improve by clustering (9.6
= .761)
λh:c ≤ 16.5).
Besides item
≤
A detailed analysis showed that
3, items 1·Explain Protagonist's Feeling
the responses' correctness to items 3 and 5 could
= .533) and 5·Evaluate Stylistic Element (κh:c = .503) attained the poorest agree-
partially be aected by other linguistic elements
ment beyond chance, and the ve other items
rather poor AC were not obvious. On the other
≤
hand, especially for item 4·List Recall, the
Whereas the kappa coecient iden-
humancomputer-agreement was excellent and it
(κh:c
reached good agreement beyond chance (.669
κh:c ≤ .729).
than pure semantics, while the reasons for item 1's
13
Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76 (2), 280303. doi: 10.1177/0013164415590022
increased remarkably by clustering (λh:c
= 84.2)
showed that using a semantic space opposed to
as it also did for item 2·Evaluate Statement
simply using the presence of words results in sig-
(λh:c
= 63.8)
and 9·Math (λh:c
= 56.6).
These
nicantly better AC,
t(197) = 3.34, p < .001, d =
items share one important characteristic, namely,
0.94(=0.3% ˆ )although
the evoked response's correctness is determined by
the conservative approach. Whereas the semantic
the test taker expressing specic semantic con-
space needed 65 clusters to reach its optimal per-
In item 4·List Recall, this is done in
formance, the usage of raw words already required
cepts.
a very simple way by just listing terms while
with small eect due to
500 clusters for this simple item to do so.
in the other items the verbal constructions need
Analysis II studied the impact of spelling cor-
to be more complex to be able to express the
rection on performance comparing three condi-
required semantics.
Finally, the system's relia-
tions: no, automatic, and manual spelling correc-
bility mainly attained excellent agreement with
tion. The latter constituted an upper bound for
.771 ≤ κc:c ≤ .983,
the possible eect of spelling correction on perfor-
except of the AC of the al-
ready discussed item 3 (κc:c
= .738).
mance how much an almost perfect spelling correction could improve the AC. Using an ANOVA, the analysis revealed that the automatic spelling
3.2 Relationship Between Performance and
correction neither improved performance signi-
Alternation of Parameter Values or
cantly opposed to dispensing with it nor did the
Methods
performance with manual spelling correction differ signicantly,
Competing methods and parameter values were compared
in
seven
analyses
with
regard
To
further investigate if this nding can be general-
to
ized across items for the PISA data, cross-checks
percentage of humancomputer-agreement (%h:c )
on other items were run.
whereas the sets of one hundred percentage values
These showed, the
necessity of spelling correction depends on the
from stratied, repeated tenfold cross-validation
item's naturethe same analysis with the data
served as sample for the respective experimental conditions.
F (297, 2) = 2.1, p = .130.
of item 10·Science yielded in similarly insignif-
Most analyses were conducted using
icant results,
the data of item 7·Select and Judge constitut-
F (297, 2) = 2.4, p = .089,
whereas
performance with and without spelling correction
ing a medium well working item for AC. This way,
for item 6·Verbal Production diered signi-
the measure of agreement could optimally be af-
cantly with relevant eect size,
fected by the experimental variation, which would
t(197) = 5.81, p