Presentations Slides July 7th Brighton, UK -----------------------------------------------------------------------------------------------------
8/07/09
Adriana
Berlanga,
Francis
Brouns,
Peter
van
Rosmalen,
Kamakshi
Rajagopal,
Marco
Kalz,
&
Slavi
Stoyanov
Natural Language Processing in support of Learning: Metrics, Feedback and Connectivity
Open
Universiteit
Nederland
AI-ED 2009 July 7th 2009, Brighton, UK
Outline
Lifelong Learning
• Background & LTfLL Language Technologies for Lifelong Learning
• Positioning of the learner in a domain • Providing formative feedback on a learners Conceptual Development – Approach – Showcases – Future work
• Questions
3
4
Survey: ‘critical’ support activities • Assessment of student work – Formative feedback (including plagiarism)
• Answering questions – Routing questions – Formulating personalised answer
• Monitoring progress – Drop out prevention; personal advice
Arts et al.
• Supporting groups and communities – Selecting and creating groups – Providing overviews & feedback to activities 6
Van Rosmalen et al. (2008)
1
8/07/09
Inspired
to
LTfLL
(www.lGll‐project.org):
LTfLL - Themes
‐
FP7‐TEL:
a
3
year
project
2008‐2011
‐
11
partners
(8
countries,
6
languages)
Theme 2 support feedback services
LTfLL
ObjecVve
To
create
a
set
of
next‐generaVon
support
and
advice
services
that
will
enhance
individual
and
collabora7ve
building
of
competences
and
knowledge
creaVon
in
educaVonal
as
well
as
organizaVonal
seXngs.
Theme 1 position of the learner in a domain
Theme 3 social and informal learning
The
project
makes
extensive
use
of
language
technologies
and
cogni7ve
models
in
the
services.
7
8
Positioning
Theme 1: Positioning • Determine learner’s knowledge in a domain (given a specific context e.g. in support of Assessment of Prior Learning or with regard to a specific topic, competence or learning goal)
To determine in a (semi-) automatic way learner’s prior knowledge –by analyzing her Portfolio and the domain of study– to recommend learning materials or courses to follow
Locate best suitable learning materials or courses to follow
To provide formative feedback with regard to the learner’s profile in the domain of study and recommend remedial actions to overcome conceptual gaps
Provide formative feedback and recommend remedial actions
9
10
Formative feedback • Services will offer semi-automatic measurement of conceptual development within a particular expertise area • Diagnosing conceptual development
EXPERTISE DEVELOPMENT: KNOWLEDGE PROCESSES
FORMATIVE FEEDBACK 11
– Person’s knowledge of a domain by looking on how s/he organizes the concepts of such domain – Novice vs. expert approach 12
2
8/07/09
The approach: Novice vs. Expert Novices and experts differ in • How they express the concepts underlying a domain • How they discriminate relevant from nonrelevant information • And how they use and relate the concepts to one another 13
Expertise Level
Knowledge Structure
Learning
Problem solving Reasoning process
Novice
Networks (incomplete and loosely linked)
Knowledge Long chains of accretion, detailed Evidence from:reasoning steps integration and validation through • Medicine networks
Intermediate
Networks (tightly linked and integrated)
Encapsulation
Expert
Illness scripts
Experienced expert
Memory traces of previous cases
Step by step process
– Networks, encapsulations, scripts Reasoning through Big steps (but • Health sciences encapsulated still one at network; the time) – Networks, scripts abbreviated • Business administration Illness script for Illness script Groups of steps – Networks, scripts formation activation and activated as instantiation a whole • Law – Networks, encapsulation +/-, … Instantiated scripts
Automatic reminding
Expertise Level
Knowledge Structure
Learning
Problem solving Reasoning process
Novice
Networks (incomplete and loosely linked)
Knowledge accretion, integration and validation
Long chains of detailed reasoning steps through networks
Step by step process
Intermediate
Networks (tightly linked and integrated)
Encapsulation
Reasoning through encapsulated network; abbreviated
Big steps (but still one at the time)
Expert
Illness scripts
Illness script for formation
Illness script activation and instantiation
Groups of steps activated as a whole
Experienced expert
Memory traces of previous cases
Instantiated scripts
Automatic reminding
Boshuizen et al., 2004; Nievelstein, 2004
“Expert” Model • Defines the expected set of concepts and relations that represent the domain of knowledge at a specific point in time of the development of a learner. • It is not absolute • Derive it (semi-)automatically 16
Boshuizen et al., 2004; Nievelstein, 2004
“Expert” Model
“Expert” Model
• Defines the expected set of concepts and relations that represent the domain of knowledge at a specific point in time of the development of a learner. Y2 • It is not absolute Y1 • Derive it (semi-)automatically
1. ‘Archetypical expert’ model, state-of the art information (e.g., scientific literature) 2. ‘Theoretical expert’ model, documents of a particular course or context (e.g., course material, tutor notes, presentations) 3. ‘Emerging expert’ model, concepts and the relations a group of people (co-workers, peers...) use to describe a domain
relative
absolute 17
18
3
8/07/09
Measuring conceptual development Knowledge elicitation • measure
the
learner’s
understanding
of
the
relaVonships
among
a
set
of
concepts.
• Methods
:concept
maps,
think
aloud,
card
sorVng,
word
associaVon
Knowledge representation
Exploring the approach: Investigating the use of different ‘expert’ models
Evaluation of the representation
• Define
representaVons
of
the
• RelaVve
to
some
elicited
knowledge
that
standard
reflect
underlying
data
• compare
cogni7ve
organizaVon.
structures
of
• Methods:
cluster
analysis,
experts
and
novices
tree
construcVons,
dimensional
representaVons,
path
finder
nets
1. Theoretical expert model – Formal education – Medical students, course and tutor materials – Leximancer and Pathfinder
2. Emergent expert model – Informal learning – Employees – Leximancer
19
Exploring the approach: Investigating the use of different ‘expert’ models 1. Theoretical expert model – Formal education or tutor discontinuous? – Medical students, Continuous course and materials – Leximancer and Pathfinder • Gaps and transitions
2. Emergent expert model – Arts et al – Informal learning – Employees – Leximancer
– Prince – Boshuizen, Schmidt
20
Theoretical Expert Model (Leximancer and Pathfinder) Knowledge elicitation
• A think aloud protocol to elicit students’ knowledge. • The think aloud protocols were transcribed
Knowledge representation
Evaluation representation
• Leximancer was • Pathfinder to used to generate compare concept maps for cognitive novices (think alouds) structures & theoretical expert novices & model (tutor notes, model, identify learning materials) similarities and differences
21
22
Initial findings Verification. Output discussed with an expert: • The concept maps differ on the level of detail.
Generation of expert and student concept maps Leximancer
23
– Student’s concept map: detailed concepts (biology) – Model: encapsulated concepts, panoramic view of the knowledge (the disease) 24
4
8/07/09
Theoretical Emergent Model
Initial findings
(Leximancer) Indicate procedural knowledge, mentioning how to solve a problem “the how”
Explain the reasons and conditions of a problem “the why”
Knowledge elicitation
• A think aloud protocol to elicit employee’s knowledge. • The think aloud protocols were transcribed
Knowledge representation
Evaluation representation
• Leximancer was • Leximancer to used to generate a compare single concept map of cognitive all (think alouds)) structures novices & model, identify similarities and differences
25
26
Feedback Report
Leximancer
27
Future work
These are the concepts you mentioned the most …… From your peers these are the most mentioned concepts ……… The differences are: …. This means that you might find useful to • Read this material • Do this activity • Contact this person
28
Questions?
• emergent model (representation, number, quantitative metrics) • Validation of the reliability and usability emerging expert map & report • Design and develop service v.1 • Pilot with medical students (English) 29
Question mark photo by Leo Reynolds. Licensed under Creative Commons.
30
5
8/07/09
Contact:
[email protected]
or
[email protected]
Project
website:
www.lGll‐project.org
PublicaVons:
DSpace
dspace.ou.nl/simple‐search?query=LTfLL
Comparison of expert and student map Pathfinder 31
6
8/07/09
Introduc)on
Lexical
similarity
metrics
for
vocabulary
learning
modeling
in
Computer‐Assisted
Language
Learning
(CALL)
• The
L1
can
create
a
basis
for
learning
the
vocabulary
of
an
L2:
the
L1
lexicon
helps
the
learner
to
infer
the
meanings
of
words
in
L2
Ismael
ÁVILA
and
Ricardo
GUDWIN
University
of
Campinas
• Techniques
to
compare
the
word‐level
distance
between
L1
and
L2
are
necessary
to
model
this
cross‐linguisLc
influence
(incl.
quanLtaLvely)
Introduc)on
Introduc)on
• With
this
metric
an
ITS
can
anLcipate
which
L2
words
are
more
easily
learned
due
to
transfers
from
L1
and
which
ones
produce
interferences
• We
present
here
a
technique
for
measuring
lexical
similarity
in
terms
of
its
effect
on
the
learners’
perceptual
ability
in
recognizing
L2
words
with
the
help
of
L1
lexicon
• The
ITS
can
use
this
metric
to
iniLalize
the
LM
or
to
sequence
the
lexical
units
in
terms
of
their
easiness
to
a
parLcular
L1‐audience
Lexical
similarity
Lexical
similarity
• Lexical
similariLes
may
be
due
to:
• The
similarity
level
has
two
main
parallel
dimensions:
orthographic
and
phoneLc.
Each
of
them
may
vary
from
a
level
of
“no
similarity”
to
a
level
of
“absolute
match”.
• Regardless
of
their
origins,
these
similariLes
affect
the
language
learning
process
and
have
to
be
considered
by
the
ITS
• DirecLon
(en)
↔
DirecLon
(fr)
• House
(en)
↔
Haus
(de)
• Casa
(it)
↔
Casa
(pt)
• Common
origin:
e.g.
Spanish
“corazón”
and
Portuguese
“coração”
• Borrowings:
e.g.
Japanese
“arigato”
and
Portuguese
“obrigado”
• Coincidences:
e.g.
Greek
“oikia”
and
Tupi
“oca”
1
8/07/09
Methods
to
measure
string
distance
Methods
to
measure
string
distance
• Levenshtein
distance
uses
the
minimum
number
of
inserLons,
deleLons
and
leaer
subsLtuLons
to
transform
one
string
into
another:
LD(s1,
s2)
=
min
(nins
+
ndel
+
nsubst)
• The
Levenshtein
distance
leads
to
slightly
beaer
classificaLon
accuracy
but
the
Feature
distance
allows
for
much
faster
searching.
• Feature
distance
is
given
by
the
number
of
features
(usually
N‐grams,
substrings
of
N
consecuLve
leaers)
in
which
two
strings
differ:
FD(s1,
s2)
=
max
(N1
+
N2)
–
m(s1
+
s2)
• To
account
for
the
fact
that
one
leaer
change
is
more
relevant
in
short
words
than
in
long
ones,
normalized
versions
of
LD
have
been
used.
Lexical
similarity
&
language
proximity
Lexical
Similarity:
perceptual
aspects
Where:
N1
and
N2
are
the
number
of
N‐grams
in
s1
and
s2
and
m(s1
+
s2)
is
the
number
of
matching
N‐grams
• An
automated
method
avoids
the
subjecLvity
that
is
inherent
in
human‐made
comparisons:
e.g.
Gala
(el)
↔
Leche
(es)
• We
want
to
measure
effecLve
similarity,
not
linguisLc
kinship,
for
similarity,
even
accidental,
is
what
maaers
for
learning
easiness.
• A
wriaen
or
printed
word
is
a
visual
sLmulus
in
the
first
place.
• Word
recogniLon
is
easier
aher
fixaLon
of
the
lehmost
than
the
rightmost
leaer
of
a
word
(the
iniLal
in
many
languages).
• FixaLon
on
the
lehmost
leaer
makes
the
whole
word
fall
in
the
right
visual
half‐field,
in
direct
connecLon
to
the
dominant
leh
hemisphere.
Lexical
Similarity:
perceptual
aspects
Lexical
Similarity:
semio)c
aspects
• Word
processing
accuracy
and
speed
depend
on
two
factors:
• IntuiLve
word
recogniLon
factors
are
used
as
a
common
sense
technique
when
we
create
abbreviaLons:
tks
(thanks),
pg
(page),
cmd
(command)
or
ctrl
(control).
• PercepLbility
of
the
individual
leaers
as
a
funcLon
of
the
fixaLon
locaLon
• The
extent
to
which
the
most
visible
leaers
isolate
the
target
word
from
its
competitors
• The
lehmost
leaers
have
a
special
role
in
word
recogniLon
(isolaLon
from
compeLtors).
• Reading
and
word
recogniLon
are
not
simply
based
on
orthographic
informaLon,
but
involve
the
acLvaLon
of
phonological
codes.
• Matching
iniLals
and
consonants
is
more
likely
to
enable
word
recogniLon
than
matching
the
same
number
(same
LD)
of
other
leaers
without
the
iniLal
or
with
vowels
included:
(resp.
tak,
ae,
oma,
coto).
2
8/07/09
Lexical
Similarity:
semio)c
aspects
• The
recogniLon
of
an
L2
word
due
to
a
similarity
with
correlated
L1
words
is
an
inference
based
on
diagrammaLc
(iconic)
features.
• This
“intersymbolic
iconicity”
explains
all
the
recogniLons
based
on
similarity,
regardless
of
their
cause:
common
origin,
borrowings
or
simple
coincidence.
Lexical
Similarity:
semio)c
aspects
Slon (cz) ??? Elefant (dn) Elefante (pt)
The
proposed
LS
metric
The
proposed
LS
metric
• In
our
technique
we
assign
more
value
to
the
diagrammaLc
role
of
consonants
than
to
other
matchings
and
emphasize
the
role
of
iniLals.
• Weights
are
adjusted
so
that
the
maximum
similarity
is
1
(totally
matching
words)
and
the
minimum
is
0
(totally
different
words).
The
equaLon
for
intersymbolic
similarity
is:
• It
may
be
necessary
to
normalize
consonants
and
clusters
to
a
same
notaLon:
for
instance,
“š”,
“ŝ”
and
“sch”
to
“sh”.
• The
comparisons
of
the
consonant
or
vowel
sequences
consider
leaer
groupings
such
as
“cntrl”
or
“oo”.
IS
=
α(γ1I
+
γ2C
+
γ3V)
+
βP
(1)
Where:
IS:
intersymbolic
similarity
(maximum
=1,
minimum
=
0)
I:
iniLals
C:
consonants
V:
vowels
P:
phonemes
(can
be
decomposed
as
the
orthographical
part:
γ4I
+
γ5C
+
γ6V)
α:
weight
of
the
orthographical
similarity
(adjusted
according
to
the
context)
β:
weight
of
the
phoneLc
similarity
(adjusted
according
to
the
context)
γn:
weights
of
factors
of
similarity
(e.g.
γ1=0.4;
γ2=0.4;
γ3
=0.2)
α
+
β
=
1
and
γ1
+
γ2
+
γ3
=
1
and
γ4
+
γ5
+
γ6
=
1
The
proposed
LS
metric
Example:
The
intersymbolic
similariLes
of
the
Italian
word
“tempo”
respecLvely
to
speakers
of
Portuguese,
Spanish,
English,
German
and
Finnish
are:
L1
(tempo)→L2
(tempo):
IniLals:
t=t;
Consonants:
tmp=tmp;
Vowels:
eo=eo
IS
=
0.6*(0.4*1+0.4*1+0.2*1)+0.4*1
=
1
L1
(tempo)→L2
(Lempo):
IniLals:
t=t;
Consonants:
tmp=tmp;
Vowels:
eo≈ieo
IS
=
0.6*(0.4*1+0.4*1+0.2*0.66)+0.4*0.9
=
0.92
L1
(tempo)→L2
(Lme):
IniLals:
t=t;
Consonants:
tmp≈tm;
Vowels:
eo≠ie
IS
=
0.6*(0.4*1+0.4*0.66+0.2*0)+0.4*0.4
=
0.48
L1
(tempo)→L2
(Zeit):
IniLals:
t≈Z(ts);
Consonants:
tmp≈Zt;
Vowels:
eo≈ei
IS
=
0.6*(0.4*0.5+0.4*0.16+0.2*0.33)+0.4*0.2
=
0.28
L1
(tempo)→L2
(aika):
IniLals:
t≠a;
Consonants:
tmp≠k;
Vowels:
eo≠aia
IS
=
0.6*(0.4*0+0.4*0+0.2*0)+0.4*0
=
0
The
proposed
LS
metric
Original
word:
“physics”
transformaLons
to
Czech
“fizyka”
(sisssss)
LD=13
to
Polish
“fyzika”
(sixsxss)
LD=9
to
Afrikaans
“fisika”
(sisxxss)
LD=9
to
Italian
“fisica”
(sisxxxs)
LD=7
to
French
“physique”
(xxxxxssi)
LD=5
The
results
for
intersymbolic
similarity
are:
IS1
=
0.6*(0.4*0.8
+
0.4*0.65
+
0.2*0.8)
+
0.4*0.8
=
0.764
IS2
=
0.6*(0.4*0.8
+
0.4*0.65
+
0.2*0.9)
+
0.4*0.8
=
0.776
IS3
=
0.6*(0.4*0.8
+
0.4*0.72
+
0.2*0.8)
+
0.4*0.8
=
0.781
IS4
=
0.6*(0.4*0.8
+
0.4*0.80
+
0.2*0.8)
+
0.4*0.8
=
0.800
IS5
=
0.6*(0.4*1.0
+
0.4*0.90
+
0.2*0.9)
+
0.4*0.8
=
0.884
3
8/07/09
The
proposed
LS
metric
Conclusions
• Whereas
LD
measured
distances
ranging
from
5
to
13,
the
IS
produced
similar
scores
for
the
five
L2
words,
arguably
because
the
technique
can
capture
the
fact
that
all
words
are
more
or
less
recognizable
based
on
the
original
word.
• We
believe
that
the
IS
captures
the
crucial
features
that
make
a
word
more
easily
recognizable
by
learners.
• We
can
assume
that
there
is
a
threshold
below
which
the
recogniLon
will
no
longer
be
possible
(based
on
IS).
• A
field
study
is
being
designed
to
invesLgate
how
this
threshold
relates
to
the
lexicon
of
each
subject’s
L1
and
to
other
known
L2s.
• Conversely,
an
opposite
situaLon
in
which
two
words
produce
smaller
LD,
but
score
worse
on
IS,
would
be:
“glamour”
(en)
and
“amour”
(fr),
whose
LD=2
is
smaller,
but
whose
IS=0.52
indicates
less
actual
similarity.
Conclusions
• This
technique
is
aimed
to
offer
a
pracLcal
word‐level
similarity
metric
to
compare
words
from
different
languages
so
that
this
measure
can
be
used
as
an
input
to
iniLalize
the
LM
or
to
evaluate
word‐level
errors
in
the
context
of
CALL
applicaLons.
It
is
not
aimed
to
replace
other
formalisms,
neither
to
create
new
computaLonal
treatments
of
lexical
rules.
4
8/07/09
Outline
Cohesion, Semantics and Learning in Reflective Dialog
Motivation: why study cohesion? -
A way to study Interactivity in tutorial dialog
-
Previous work: automatic “lexical” cohesive ties
Arthur Ward, John Connelly, Sandra Katz, Diane Litman, Christine Wilson Learning Research and Development Center University of Pittsburgh
now try more sophisticated measure
Tag Definitions: Set of “semantic” cohesive ties
Corpus: Pre/post-tests & transfer questions
Applying the Tags
Results -
Abstraction & Specialization important for learning
And transfer 2
Cohesive Ties
Interactivity in Tutorial Dialog
Human tutorin g
-
Maybe be c ause it is interactive (Chi et al. 2001, 2008; Graesser et.al 1995) What specific interactive mechanisms help? - Other
Counted coh e sive ties between tutor & student
3
Current work -
-
Man u a lly tag for cohesive ties not automatically identifiable In a different corpus
Like before, focus on when tut o r and student refer to each other's contributions 5 -
4
Correlated with learning, Automatically computable
But missed many of Halliday & Hasan's cohesive devices
Cohesive Ties
(Ward & Litman 2006, 2008)
Repetition of words, w ord stems, hyponym/hypernyms (identified using WordNet) -
ways to study interactivity in dialog
(Halliday & Hasan 1976)
Repetition of words, use of pronouns, ellipsis, etc...
Previous work -
is very effective (Bloom 1984; Cohen Kulik & Kulik 1982) Why?
Measurable using “cohesive ties”
Cohesion: how a text “hangs together”
Cohesion Tag Set
Exact:
word or word stem repetition
Synonym:
two words with similar meanings
Paraphrase:
phrase repetition w/substitution
Pronoun:
pronominal reference (“she” “it”)
Superordinate-class: more general referring term
Class-member:
Collocation:
complementarity (“up-down”)
Negation:
direct contradiction
more specific referring term
6
Lexical ties (eg word repetition, like before)
1
8/07/09
The Corpus
Reflec ti v e tutoring dialogs with a human tutor (Katz et. al 2003) -
The Corpus
After problem solving in Andes (vanLehn et.al 2005)
Study procedure: -
16 Students solved 12 physics problems each
-
Answered 3-8 reflection questions
Resulting corpus has 953 reflective dialogs -
2,218 student turns
-
2,136 tutor turns
Counter-balanced pre & post-tests
9 quantitative mechanics questions
27 qualitative physics questions
-
Exampel:“Supposeh temaxm i ume tnsoinh tath tebungeecordcoudl mania tni wh tioutsnappnigwas700N.Whatwoudl 7 happe n
Cohesion Tag Example
similar to Andes problems
-
new questions, not like Andes problems
-
“far transfer” questions
Students learned significantly by both measures
8
Cohesion Tag Example
9
Cohesion Tag Example
10
Cohesion Tag Example
11
12
2
8/07/09
Cohesion Tag Example
Cohesion Tag Example
13
14
Cohesion Tag Example
Cohesion Tag Example
15
16
Tagging the Corpus
-
Final Tagging Example
Training: 518 student & turns
Refining tag definitions
Initial tagging pass -
Lexical features only
-
Spans agreed by discussion
Final tagging pass -
Re-evaluated 3 tags, using contextual features:
-
“superordinate-class,” “class-member,” “collocation”
Eliminated ties that didn't make sense
Mis-matched topics or referents
didn't seem to involve knowledge construction
- 2nd
T: “Good, that's right. What about in the horizontal directions? for example the 'x' direction on your diagram?” In first pass: tagged lexical relations -
without reference to semantic context
-
“down” is a specific “direction”
tagger re-tagged random 10% Kappa = .57
S: “yes, because gravity pulls the firecracker down and gives it motion in the `y` direction.
17
so tag down-direction as “superordinate-class” 18
3
8/07/09
Final Tagging Example
Analysis
S: “yes, because gravity pulls the firecracker down and gives it motion in the `y` direction.
Linear Model for each cohesion tag -
T: “Good, that's right. What about in the horizontal directions? for example the 'x' direction on your diagram?”
pre-test score
Standardized math score
Tag count
-
-
In second pass: -
notice that student already used “direction”
-
Tutor did not do new generalization
Predict post-test score from:
-
-
remove the tag 19
because correlated with post-test score useful predictor of learning in Andes normalized by #of student or tutor turns
Separate models for:
high pre-testers, low pre-testers, all students
qual (“near”), quant (“far”) & all questions
Analysis
Results
Linear Model for each cohesion tag -
Example for “student superordinate-class” tag
All students, all questions
21
“T:” = Tutor “S:” = Student “Super-Ord” = superordinate-class
Discussion
20
Current work suggests cohesion also correlates in new corpus
-
abstraction/specialization seem to be important cohesive mechanisms in tutoring
-
-
Example -
S: “No the force the airbag exerts back on the man after he goes into is one.”
-
T: “The airbag force and the force of the person on the airbag is such a pair. good. All forces come in such pairs! What is the 'reaction force' for the driver's weight?”
Overlapping spans: -
“force”-”forces” : exact
no results for “exact” in this corpus
-
“force the airbag exerts” - “airbag force”: paraphrase
span identification is the hardest part
-
“force the airbag exerts back on the man” -”pair”: superordinate class
“semantic” ties correlate
22
Span Identification is Hard
Previous work showed that automatic measures of cohesion correlated with learning
-
“Class-mem” = class-member
23
24
4
8/07/09
Span Identification is Hard
Span Identification is Hard
Example
Example
-
S: “No the force the airbag exerts back on the man after he goes into is one.”
-
S: “No the force the airbag exerts back on the man after he goes into is one.”
-
T: “The airbag force and the force of the person on the airbag is such a pair. good. All forces come in such pairs! What is the 'reaction force' for the driver's weight?”
-
T: “The airbag force and the force of the person on the airbag is such a pair. good. All forces come in such pairs! What is the 'reaction force' for the driver's weight?”
Overlapping spans:
Overlapping spans:
-
“force”-”forces” : exact
-
“force”-”forces” : exact
-
“force the airbag exerts” - “airbag force”: paraphrase
-
“force the airbag exerts”-“airbag force”: paraphrase
-
“force the airbag exerts back on the man” -”pair”: superordinate class
-
“force the airbag exerts back on the man” -”pair”: superordinate class
25
Span Identification is Hard
Span Identification is Hard
Example
-
S: “No the force the airbag exerts back on the man after he goes into is one.”
-
T: “The airbag force and the force of the person on the airbag is such a pair. good. All forces come in such pairs! What is the 'reaction force' for the driver's weight?”
Example -
S: “No the fo r ce the airbag exerts back on the man after he goes into is one.”
-
T: “The airbag force and the force of the person on the airbag is such a pair. good. All forces co m e in such pairs! What is the 'reaction force' for the driver's weight?”
Overlapping spans: -
“force”-”forces” : exact
-
“force the airbag exerts” - “airbag force”: paraphrase
-
“force the airbag exerts back on the man” -”pair”: superordinate class
Overlapping spans
27
-
Spans often don't correspond to syntactic structures
-
words often participate in >1 span
-
Spans are sometimes split (those forces)
Future Work
maybe don't need accurate spans?
Could improve student models by detecting student abstraction Could improve tutoring by including more tutor abstraction/specialization at appropriate places -
28
Thanks
Investigate automatic detection -
26
Learning Research & Development Center
ONR N000140710039
The ITSpoke group
Pam Jordan
what's an appropriate place?
29
30
5
8/07/09
Intelligent Tutoring Systems
Speling Mistacks & Typeos: Can Your ITS Handle Them? Adam M. Rennera Philip M. McCarthyb Chutima Boonthumc Danielle S. McNamaraa aUniversity
of Memphis, Psychology / Institute for Intelligent Systems bUniversity of Memphis, English / Institute for Intelligent Systems cHampton University, Computer Science
ITS User-Language Contains high rate of typographical & grammatical errors Not a new issue in NLP
Traditional spellchecking not suitable (e.g., MS Word, email) ITSs necessitate automatic corrections Why2-Atlas (VanLehn et al., 2002) CIRCSIM-Tutor (Elmi & Evens, 1998) Many more just ignore errors
NLP tools thought resistant to errors
LSA (Landauer et al., 2007) – semantic overlap across two whole texts Short responses? Responses with multiple errors? NLP tools trained on edited text When used in ITS, similarity assessment inevitably affected
User-Language Paraphrase Corpus 1998 target sentence/student response pairs Paraphrase attempts by high school students During interactions with iSTART (McNamara, Levinstein, & Boonthum, 2004)
Paraphrases evaluated on widely used computational indices Latent semantic analysis (LSA; Landauer, McNamara, Dennis, & Kintsch, 2007) Entailment (Rus et al, 2007) Type-Token Ratio (TTR; Graesser, McNamara, et al., 2004) Mean Edit Distance (MED; McCarthy et al., 2007)
Provide assessment of user input Guided feedback based on user’s response Many ITSs use conversational dialogue NLP for assessment and determines feedback Input matched to benchmark Assessed for similarity
Assessment limited to proficiency of user High school students or younger Make typing errors/spelling mistakes
What the student intended
Problems with Evaluating User-Language Lack of “colloquial” paraphrase corpora Microsoft Research Paraphrase Corpus (Dolan, Quirk, & Brockett, 2004) – Only binary rating (is/is not a paraphrase)
Echo Chamber (Brockett & Dolan, 2005) Paraphrase Game (Chklovski, 2005)
Limitations in “cleaning” ITS input Datasets artificially created (Fossati & Di Eugenio, 2008) Target populations are relatively proficient – Why2-Atlas: College undergraduates – CIRCSIM-Tutor: 1st year medical students
Use lexicons; computationally expensive
Research Questions How are established computational indices affected by the types of errors found in typed user-language? Do user errors affect NLP assessment and feedback produced by an established ITS? Does correcting user errors improve the capacity for ITS assessment to correspond to human ratings?
Paraphrases also evaluated by trained experts on 10 dimensions w/ Likert ratings
1
8/07/09
iSTART Evaluation Process
iSTART High school students (U.S. grades 9-12) Reading strategy training Paraphrasing, Elaboration, Making Bridging Inferences, Comprehension Monitoring
Paraphrase the following: Over two thirds of the heat generated by a resting human is created by organs of the thoracic and abdominal cavities and the brain. a lot of heat made bya lazy person is made by systems of your stomack and thinking box.
Based on match between paraphrase and target sentence Respond to or remove Frozen expression e.g., I think this is saying…
Word & Soundex matching against benchmark for length, relevance, & similarity Irrelevant (IRR) – too few words match Too short (SH) – response is shorter than specified threshold Too similar (SIM1) – length and word match is close to benchmark
Word match & LSA cosines for quality Adequate paraphrase (SIM2) Better than a paraphrase (OK) Detailed formulae – McNamara, Boonthum, et al. (2007)
Soundex
Procedure
Compensates for misspellings (Christian, 1998) Vowels removed Like-sounding consonants mapped onto same symbol
Identified, coded, & corrected all errors Based on validated models of grammar (e.g., Foster & Vogel, 2004)
Interrater agreement for subset (n = 200) Kappa = .70, p < .001 Single rater coded entire corpus
e.g., b, f, p, v
Lexicon-free Word frequency problem Students make more mistakes on new or uncommon words
83% of responses contained some form of error 52% had some form of spelling error 63% of spelling errors were internal to target sentence
Error types & frequencies Spelling (internal) Spelling (external) Capitalization S-V Agreement Article agreement Preposition agreement Determiner agreement Spacing Punctuation Conjunction agreement Possessive agreement Extra/omitted/substitute
665 (33%) 386 (19%) 1157 (58%) 367 (18%) 75 (4%) 53 (3%) 59 (3%) 174 (9%) 344 (17%) 43 (2%) 71 (4%) 230 (12%)
Results Significant effect of error correction on computational similarity indices Partial Eta2 = – LSA – Entailment – TTR – MED
.178 .268 .240 .111
Spelling internal accounts for large portion of variance Adjusted R2 = – LSA – Entailment – TTR – MED
.35 .45 .46 .17
2
8/07/09
Results
Example Target Sentence: An increase in temperature of a substance is an indication that it has gained heat energy. Student response: increace in tempiture has gaind heat energy. Revised response: Increase in temperature has gained heat energy.
LSA .54 → .90 Entailer .41 → .78 TTR .86 → .62 MED .78 → .60
Results Compared iSTART feedback’s correspondence to human ratings of Paraphrase Quality Removed cases that required no correction or were entirely garbage n = 328
Separate ANOVAs for original and corrected
Dependent – Paraphrase Quality Fixed Factor – iSTART response Original paraphrases, F (5, 1636) = 53.324, p < .001 Corrected paraphrases, F (5, 1636) = 58.543, p < .001
Table 1: Crosstabulation of iSTART responses to user paraphrases iSTART response – corrected Too Too Better Good Similar Short Irrelevant Frozen Total Better 691 45 37 4 0 0 777 Good 12 194 98 0 0 0 304 iSTART response Too Similar 7 7 527 0 0 0 541 original paraphrase Too Short 11 0 1 206 2 1 221 Irrelevant 6 0 0 6 120 7 139 Frozen 0 0 0 0 0 16 16 Total 727 245 663 216 122 24 1998 Cramer’s V = .849, p < .001 Marginal Homogeneity (MH) = 5.892, p < .001
Results Separate pairwise comparisons of Paraphrase Quality Original Mean Diff. SE Sig.a Irrelevant .152 .402 1 Too short -.776 .370 .581 Too Sim -1.955 .363 < .001 Good -2.071 .366 < .001 Better -1.897 .361 < .001 Irrelevant Too short -.918 .209 < .001 Too Sim -2.107 .196 < .001 Good -2.223 .203 < .001 Better -2.0249 .192 < .001 Too short Too Sim -1.189 .115 < .001 Good -1.305 .127 < .001 Better -1.131 .111 < .001 Too similar Good -.116 .103 1 Better .058 .082 1 Good Better .174 .097 1 a Adjustment for multiple comparisons: Bonferroni. Frozen
Corrected Mean Diff. SE .081 .361 -.922 .299 -2.176 .288 -2.421 .297 -2.106 .288 -1.002 .245 -2.257 .231 -2.502 .242 -2.187 .231 -1.255 .112 -1.500 .133 -1.185 .111 -.245 .107 .070 .077 .315 .106
Sig.a 1 .032 < .001 < .001 < .001 .001 < .001 < .001 < .001 < .001 < .001 < .001 .331 1 .044
Discussion ITS feedback algorithms may be optimized if user-language can be filtered prior to processing Misclassification OK for motivation Accuracy not OK: simple rewording can pass for good paraphrase; paraphrase can pass for better
Established NLP approaches not as robust to user-language as believed Response length not enough to wash out individual errors ULPC represents types & amount of errors real students make
Most variance accounted for by internal misspellings Provides direction for future research Automatic spelling corrections only for words in the benchmark Will be silent & computationally light
Thank you! We would like to thank: Vasile Rus Ben Duncan John Myers Rebekah Guess Research supported by:
IIS-0735682
R305A080589
3
=%%9&*/.&(0$$(K$/0$#%&'()&*$+#+(,-$+()#9&02$ %#,'%#*.&X#$(0$.#1.$*/.#2(,&3/.&(0$;'&02$5/0)(+$ 60)#1&02 !
!"#$#%&'()&*$+#+(,-$+#./%"(,$&0$.#1.$ */.#2(,&3/.&(0$4&."$5/0)(+$60)#1&02
Q,(+$#%&'()#'$.($*(0*#%.' !
7/00$8&2&9#$:(/,#/;$9$?"/9&
!
@:=,.$A$B;.&0$CD/,&'$EF>D:>G$Q!LM$.#1.W+&0&02$*(0.#'.
!
5#';9.'
!
D#,'%#*.&X#'
J#K.LMN$D/,&'$9#$OO$P;&0$OLLM
!"#$%&'()$K,(+$C:&03.+/0N$YMEEG
!"#$%&'()$K,(+$C:&03.+/0N$YMEEG
"
#
=%%9&*/.&(0$$(K$/0$#%&'()&*$+#+(,-$+()#9&02$ %#,'%#*.&X#$(0$.#1.$*/.#2(,&3/.&(0$;'&02$5/0)(+$ 60)#1&02
>KK#*.$$(K$K,#T;#0*-$(K$#%&'()#'$(0$#*"($C:&03.+/0N$ YMEEG !
Q,(+$#%&'()#'$.($*(0*#%.' !
!
S#/0$/0)$X/,&/0*#$(K$#*"($&0*,#/'#$4&."$K,#T;#0*-
!
$
=$K/+(;'$+()#9$(K$#%&'()&*$+#+(,-R$S6H>58=$O "
J#'*,&%.&(0
"
I&+;9/.&(0$(K$."#$#KK#*.$(K$K,#T;#0*-$#%&'()#'
=$U(,)$8#*.(,$+()#9R$5/0)(+$60)#1&02 "
J#'*,&%.&(0
"
!#1.$*/.#2(,&3/.&(0$/0)$."#$/%%9&*/.&(0$(K$."#$)&'.,&V;.&(00/9$ "-%(."#'&'
!
!"#$J>Q!LM$.#1.W+&0&02$*(0.#'.
!
5#';9.'
!
D#,'%#*.&X#'
%
=$+()#9$(K$U(,)$8#*.(,$R$5/0)(+$60)#1&02 !
=$+()#9$(K$U(,)$8#*.(,$R$5/0)(+$60)#1&02
!"#$*(++(0$%,&0*&%9#'$V#"&0)$U(,)$8#*.(,'R$ !
6+%9#+#0.&02$)&'.,&V;.&(0/9$"-%(."#'&'
!
J#/9&02$B/,2#$*(,%(,/
!
U(,Z&02$(0$/$*(0.#1.$4&0)(4$$
!
! !
!
@,#/.#$/$+/.,&1$'$C*(1(#GN$$*(0./&0&02$+,*-./0-1234 "
*$&'$."#$0;+V#,$(K$)(*;+#0.'$(,$*(0.#1.'$ *(,,#'%(0)&02$.($."#$*(,%;'
[;&9)&02$/$+/.,&1$."/.$"(9)$."#$;'#'$(K$4(,)'$&0$K;0*.&(0$ (K$."#&,$*(0.#1.' 5#);*&02$."#$+/.,&1
"
#N$."#$0;+V#,$(K$)&+#0'&(0'$CH$]$YLLLG
"
60)#1WX#*.(,$/,#$'%/,'#$/0)$,/0)(+9-$2#0#,/.#)$ X#*.(,'^$!"#-$*(0'&'.$&0$K#4$0;+V#,'$(K$$_Y$/0)$WY$ $
\'&02$X#*.(,&/9$+#."()'$.($+/0&%;9/.#$4(,)'$(,$2,(;%'$ (K$4(,)'$
/0)$!"#$%&$'()*(+
&
'
=$+()#9$(K$U(,)$8#*.(,$R$5/0)(+$60)#1&02
!
=$+()#9$(K$U(,)$8#*.(,$R$5/0)(+$60)#1&02 Q(,$#/*"$)(*;+#0.$(K$."#$*(,%;'N$#/*"$.&+#$/$.#,+$2(
@,#/.#$/$+/.,&1$5(62(1(#7$*(0./&0&02$."#$2-48/0-12349 "
2((&'$."#$0;+V#,$(K$.#,+'$*(+%('&02$."#$*(,%;'
"
!"#$%,(*#''$&'$&0*,#+#0./9^$!($'./,.$."#$+/.,&1$
/%%#/,'$&0$/$)(*;+#0.$*; #
=**;+;9/.#$."#(+,*-.(0-1234(*(,,#'%(0)&02$.($."#$ )(*;+#0.(*($.($."#$2-48(0-1234$*(,,#'%(0)&02$.($."#$
*(+%&9/.&(0N$$/99$*#99'$X/9;#'`$/,#$&0&.&/9&3#)$.($:
.#,+$2
(
)*
=$+()#9$(K$U(,)$8#*.(,$R$5/0)(+$60)#1&02
=$+()#9$(K$U(,)$8#*.(,$R$5/0)(+$60)#1&02 !
=.$."#$#0)$(K$."#$%,(*#''N$ $2-48(0-12349$."/.$/%%#/,#)$&0$ '&+&9/,$*(0.#1.$"/X#$/**;+;9/.#)$'&+&9/,$+,*-.(0-12349
Q!LM$.#1.W+&0&02$*(0.#'.
!
5#';9.'
!
D#,'%#*.&X#'
J>Q!`LM$+/&0$./'Z$4/'$a%&0&(0$*/.#2(,&3/.&(0R I;VP#*.&X&.-FaVP#*.&X&.-$)#.#*.&(0$&0$+;9.&W9&02;/9$P(;,0/9$ *(,%(,/$CK,N$#0N$&.G
!
bLc$(K$."#$*(,%(,/$K(,$.,/&0&02
!
B&+&.#)$.&+#$.#'.$%#,&()$Cd$)/-'G
)"
)#
>KK#*.$$(K$K,#T;#0*-$(K$#%&'()#'$(0$#*"($C:&03.+/0N$ YMEEG
D,&0*&%9#'
!
=$Q,#0*"$!#1.$S&0&02$*(0.#'.$
!
!
!
!
!
[;&9)$/$'#+/0.&*$+#+(,-$K,(+$/99$."#$/X/&9/V9#$ #%&'()#' a,2/0&3#$#%&'()#'$&0$*/.#2(,'$K(99(4&02$%,&0*&%9#'$ (K$#%&'()&*$+#+(,-$+()#9' "
I%9&..&02$."#$*/.#2(,'$&0.($"(+(2#0#(;'$';VW */.#2(,'$,#2/,)&02$."#&,$2=>+1?@+2=
!"#$*,#/.#)$';VW*/.#2(,'$/,#$*(0'&)#,#)$/'$/$9(*/9$ #%&'()&*$+#+(,'
)$
S#/0$/0)$X/,&/0*#$(K$#*"($&0*,#/'#$4&."$K,#T;#0*-
)%
=''&20&02$/$*/.#2(,-
D,&0*&%9#'
)&
)'
=%%9&*/.&(0$$(K$/0$#%&'()&*$+#+(,-$+()#9&02$ %#,'%#*.&X#$(0$.#1.$*/.#2(,&3/.&(0$;'&02$5/0)(+$ 60)#1&02 !
5#';9.'
Q,(+$#%&'()#'$.($*(0*#%.' !
!
=$K/+(;'$+()#9$(K$#%&'()&*$+#+(,-R$S6H>58=$O "
J#'*,&%.&(0
"
I&+;9/.&(0$(K$."#$#KK#*.$(K$K,#T;#0*-$#%&'()#'
=>?@,4A3BAC32>?,5.F:G,AHI@JA K =>?@,4A3BAC:?,5-:35=>?@,4A3BAALM29,=>?@,4A3BAAF>@1258=$O "
J#'*,&%.&(0
"
I&+;9/.&(0$(K$."#$#KK#*.$(K$K,#T;#0*-$#%&'()#'
!
=%%9&*/.&(0$.($/$9/,2#,$(%&0&(0$*/.#2(,&3/.&(0$./'Z'R$ $ !5>@LMW[9(2$.,/*Z
=$U(,)$8#*.(,$+()#9R$5/0)(+$60)#1&02 "
J#'*,&%.&(0
"
!#1.$*/.#2(,&3/.&(0$/0)$."#$/%%9&*/.&(0$(K$."#$)&'.,&V;.&(00/9$ "-%(."#'&'
!
!"#$J>Q!LM$.#1.W+&0&02$*(0.#'.
!
5#';9.'
!
D#,'%#*.&X#'
!
!)
60.#,K/*&02$4&."$(."#,$U(,)WX#*.(,$+#."()'$CBI=N$ :=BN$eG
!!
D(''&V9#$/%%9&*/.&(0'$&0$>);*/.&(0 !
>);*/.&(0/9$,#'(;,*#'$+/0/2#+#0.R$ !
!
$
5#'(;,*#$,#.,/9R$"#9%$;'#,'$.($)#.#,+&0#$."#$X/9;#$(K$ /0$#);*/.&(0$,#'(;,*#$CK/*.;/9$X'^$(%&0&(0G $
5#'(;,*#$*9/''&K&*/.&(0$&0$V(."$."#+/.&*$/0)$(%&0&(0$ )&+#0'&(0'$$
$
!
=''#''+#0.$(,$#''/-W'*(,&02R$ !
!"##$ %&'&()$ *+",)"-.$ /0&($ 1($ 23"(&.$ "#0$ 4)#&5$ 6)',+5$ !""#$%&'( )*+,-.,#/%'0( ',( %/,12$#-0/,-$3( 4'( ,'5,'0( /6'%( 7/34$8( 934'5-32.$ 4"#5$ ()5$ !"#$%& '$& (#$)*$+& '$& "),#-+$& '$& )./'*#*01& 2334& '-& '/5& 60-*))$& '$& #$7#$& 89:;.34.$ 77$ 8-$ 799:.$ ;",&5.$ !"##$%&'&()$*+",)"-$"#0$/0&($1($23"(&$(7/34$8(934'5-32(/34(,&'('"-0$4-%(8'8$#;( 8',/"&$#=( !""+-%/,-$3( ,$( ,'5,( %/,'2$#-