Presentations Slides

Presentations Slides July 7th Brighton, UK -----------------------------------------------------------------------------------------------------

8/07/09

Adriana Berlanga, Francis Brouns,   Peter van Rosmalen, Kamakshi Rajagopal,  Marco Kalz, & Slavi Stoyanov 

Natural Language Processing in support of Learning: Metrics, Feedback and Connectivity

Open Universiteit Nederland 

AI-ED 2009 July 7th 2009, Brighton, UK

Outline

Lifelong Learning

•  Background & LTfLL Language Technologies for Lifelong Learning

•  Positioning of the learner in a domain •  Providing formative feedback on a learners Conceptual Development –  Approach –  Showcases –  Future work

•  Questions

3

4

Survey: ‘critical’ support activities •  Assessment of student work –  Formative feedback (including plagiarism)

•  Answering questions –  Routing questions –  Formulating personalised answer

•  Monitoring progress –  Drop out prevention; personal advice

Arts et al.

•  Supporting groups and communities –  Selecting and creating groups –  Providing overviews & feedback to activities 6

Van Rosmalen et al. (2008)

1

8/07/09

Inspired to LTfLL (www.lGll‐project.org): 

LTfLL - Themes

‐ FP7‐TEL: a 3 year project 2008‐2011  ‐ 11 partners (8 countries, 6 languages) 

Theme 2 support feedback services

LTfLL ObjecVve  To create a set of next‐generaVon support and advice  services that will enhance individual and collabora7ve  building of competences and knowledge creaVon in  educaVonal as well as organizaVonal seXngs.  

Theme 1 position of the learner in a domain

Theme 3 social and informal learning

The project makes extensive use of language technologies  and cogni7ve models in the services.   7 

8

Positioning

Theme 1: Positioning •  Determine learner’s knowledge in a domain (given a specific context e.g. in support of Assessment of Prior Learning or with regard to a specific topic, competence or learning goal)

To determine in a (semi-) automatic way learner’s prior knowledge –by analyzing her Portfolio and the domain of study– to recommend learning materials or courses to follow

Locate best suitable learning materials or courses to follow

To provide formative feedback with regard to the learner’s profile in the domain of study and recommend remedial actions to overcome conceptual gaps

Provide formative feedback and recommend remedial actions

9

10

Formative feedback •  Services will offer semi-automatic measurement of conceptual development within a particular expertise area •  Diagnosing conceptual development

EXPERTISE DEVELOPMENT: KNOWLEDGE PROCESSES

FORMATIVE FEEDBACK 11

–  Person’s knowledge of a domain by looking on how s/he organizes the concepts of such domain –  Novice vs. expert approach 12

2

8/07/09

The approach: Novice vs. Expert Novices and experts differ in •  How they express the concepts underlying a domain •  How they discriminate relevant from nonrelevant information •  And how they use and relate the concepts to one another 13

Expertise Level

Knowledge Structure

Learning

Problem solving Reasoning process

Novice

Networks (incomplete and loosely linked)

Knowledge Long chains of accretion, detailed Evidence from:reasoning steps integration and validation through •  Medicine networks

Intermediate

Networks (tightly linked and integrated)

Encapsulation

Expert

Illness scripts

Experienced expert

Memory traces of previous cases

Step by step process

–  Networks, encapsulations, scripts Reasoning through Big steps (but •  Health sciences encapsulated still one at network; the time) –  Networks, scripts abbreviated •  Business administration Illness script for Illness script Groups of steps –  Networks, scripts formation activation and activated as instantiation a whole •  Law –  Networks, encapsulation +/-, … Instantiated scripts

Automatic reminding

Expertise Level

Knowledge Structure

Learning

Problem solving Reasoning process

Novice

Networks (incomplete and loosely linked)

Knowledge accretion, integration and validation

Long chains of detailed reasoning steps through networks

Step by step process

Intermediate

Networks (tightly linked and integrated)

Encapsulation

Reasoning through encapsulated network; abbreviated

Big steps (but still one at the time)

Expert

Illness scripts

Illness script for formation

Illness script activation and instantiation

Groups of steps activated as a whole

Experienced expert

Memory traces of previous cases

Instantiated scripts

Automatic reminding

Boshuizen et al., 2004; Nievelstein, 2004

“Expert” Model •  Defines the expected set of concepts and relations that represent the domain of knowledge at a specific point in time of the development of a learner. •  It is not absolute •   Derive it (semi-)automatically 16

Boshuizen et al., 2004; Nievelstein, 2004

“Expert” Model

“Expert” Model

•  Defines the expected set of concepts and relations that represent the domain of knowledge at a specific point in time of the development of a learner. Y2 •  It is not absolute Y1 •   Derive it (semi-)automatically

1.  ‘Archetypical expert’ model, state-of the art information (e.g., scientific literature) 2.  ‘Theoretical expert’ model, documents of a particular course or context (e.g., course material, tutor notes, presentations) 3.  ‘Emerging expert’ model, concepts and the relations a group of people (co-workers, peers...) use to describe a domain

relative

absolute 17

18

3

8/07/09

Measuring conceptual development Knowledge elicitation •  measure the  learner’s  understanding of the  relaVonships among  a set of concepts.  •  Methods :concept  maps, think aloud,  card sorVng, word  associaVon

Knowledge representation

Exploring the approach: Investigating the use of different ‘expert’ models

Evaluation of the representation

•  Define representaVons of the •  RelaVve to some  elicited knowledge that  standard  reflect underlying  data  •  compare cogni7ve  organizaVon.  structures of  •  Methods: cluster analysis,  experts and novices  tree construcVons,  dimensional representaVons,  path  finder nets

1.  Theoretical expert model –  Formal education –  Medical students, course and tutor materials –  Leximancer and Pathfinder

2.  Emergent expert model –  Informal learning –  Employees –  Leximancer

19

Exploring the approach: Investigating the use of different ‘expert’ models 1.  Theoretical expert model –  Formal education or tutor discontinuous? –  Medical students, Continuous course and materials –  Leximancer and Pathfinder •  Gaps and transitions

2.  Emergent expert model –  Arts et al –  Informal learning –  Employees –  Leximancer

–  Prince –  Boshuizen, Schmidt

20

Theoretical Expert Model (Leximancer and Pathfinder) Knowledge elicitation

•  A think aloud protocol to elicit students’ knowledge. •  The think aloud protocols were transcribed


Evaluation representation

•  Leximancer was •  Pathfinder to used to generate compare concept maps for cognitive novices (think alouds) structures & theoretical expert novices & model (tutor notes, model, identify learning materials) similarities and differences 

21

22

Initial findings Verification. Output discussed with an expert: •  The concept maps differ on the level of detail.

Generation of expert and student concept maps Leximancer

23

–  Student’s concept map: detailed concepts (biology) –  Model: encapsulated concepts, panoramic view of the knowledge (the disease) 24

4

8/07/09

Theoretical Emergent Model

Initial findings

(Leximancer) Indicate procedural knowledge, mentioning how to solve a problem “the how”

Explain the reasons and conditions of a problem “the why”

Knowledge elicitation

•  A think aloud protocol to elicit employee’s knowledge. •  The think aloud protocols were transcribed


Evaluation representation

•  Leximancer was •  Leximancer to used to generate a compare single concept map of cognitive all (think alouds)) structures novices & model, identify similarities and differences 

25

26

Feedback Report

Leximancer

27

Future work

  These are the concepts you mentioned the most ……   From your peers these are the most mentioned concepts ………   The differences are: ….   This means that you might find useful to •  Read this material •  Do this activity •  Contact this person

28

Questions?

•   emergent model (representation, number, quantitative metrics) •  Validation of the reliability and usability emerging expert map & report •  Design and develop service v.1 •  Pilot with medical students (English) 29

Question mark photo by Leo Reynolds. Licensed under Creative Commons.

30

5

8/07/09

Contact:  [email protected]  or  [email protected]   Project website:  www.lGll‐project.org  PublicaVons: DSpace  dspace.ou.nl/simple‐search?query=LTfLL 

Comparison of expert and student map Pathfinder 31

6

8/07/09

Introduc)on  Lexical similarity metrics for vocabulary  learning modeling in Computer‐Assisted  Language Learning (CALL) 

•  The L1 can create a basis for learning the  vocabulary of an L2: the L1 lexicon helps the  learner to infer the meanings of words in L2 

Ismael ÁVILA and Ricardo GUDWIN  University of Campinas 

•  Techniques to compare the word‐level distance  between L1 and L2 are necessary to model this  cross‐linguisLc influence (incl. quanLtaLvely) 

Introduc)on 

Introduc)on 

•  With this metric an ITS can anLcipate which L2  words are more easily learned due to transfers  from L1 and which ones produce interferences 

•  We present here a technique for measuring  lexical similarity in terms of its effect on the  learners’ perceptual ability in recognizing L2  words with the help of L1 lexicon 

•  The ITS can use this metric to iniLalize the LM  or to sequence the lexical units in terms of  their easiness to a parLcular L1‐audience 

Lexical similarity 

Lexical similarity 

•  Lexical similariLes may be due to: 

•  The similarity level has two main parallel  dimensions: orthographic and phoneLc. Each  of them may vary from a level of “no  similarity” to a level of “absolute match”. 

•  Regardless of their origins, these similariLes  affect the language learning process and have  to be considered by the ITS 

•  DirecLon (en) ↔ DirecLon (fr)  •  House (en) ↔ Haus (de)  •  Casa (it) ↔ Casa (pt) 

•  Common origin: e.g. Spanish “corazón” and  Portuguese “coração”   •  Borrowings: e.g. Japanese “arigato” and  Portuguese “obrigado”   •  Coincidences: e.g. Greek “oikia” and Tupi “oca” 

1

8/07/09

Methods to measure string distance 

Methods to measure string distance  

•  Levenshtein distance uses the minimum number  of inserLons, deleLons and leaer subsLtuLons  to transform one string into another:  LD(s1, s2) = min (nins + ndel + nsubst) 

•  The Levenshtein distance leads to slightly beaer  classificaLon accuracy but the Feature distance  allows for much faster searching. 

•  Feature distance is given by the number of  features (usually N‐grams, substrings of N  consecuLve leaers) in which two strings differ:  FD(s1, s2) = max (N1 + N2) – m(s1 + s2) 

•  To account for the fact that one leaer change is  more relevant in short words than in long ones,  normalized versions of LD have been used. 

Lexical similarity & language proximity  

Lexical Similarity: perceptual aspects 

Where: N1 and N2 are the number of N‐grams in s1 and s2   and m(s1 + s2) is the number of matching N‐grams 

•  An automated method avoids the subjecLvity  that is inherent in human‐made comparisons:  e.g. Gala (el) ↔ Leche (es)  •  We want to measure effecLve similarity, not  linguisLc kinship, for similarity, even accidental,  is what maaers for learning easiness. 

•  A wriaen or printed word is a visual sLmulus in  the first place.  •  Word recogniLon is easier aher fixaLon of the  lehmost than the rightmost leaer of a word  (the iniLal in many languages).  •  FixaLon on the lehmost leaer makes the whole  word fall in the right visual half‐field, in direct  connecLon to the dominant leh hemisphere. 

Lexical Similarity: perceptual aspects 

Lexical Similarity: semio)c aspects 

•  Word processing accuracy and speed depend on  two factors:  

•  IntuiLve word recogniLon factors are used as a  common sense technique when we create  abbreviaLons: tks (thanks), pg (page), cmd  (command) or ctrl (control). 

•  PercepLbility of the individual leaers as a funcLon  of the fixaLon locaLon  •  The extent to which the most visible leaers isolate  the target word from its competitors 

•  The lehmost leaers have a special role in word  recogniLon (isolaLon from compeLtors).  •  Reading and word recogniLon are not simply  based on orthographic informaLon, but involve  the acLvaLon of phonological codes. 

•  Matching iniLals and consonants is more likely  to enable word recogniLon than matching the  same number (same LD) of other leaers  without the iniLal or with vowels included:  (resp. tak, ae, oma, coto). 

2

8/07/09

Lexical Similarity: semio)c aspects  •  The recogniLon of an L2 word due to a similarity  with correlated L1 words is an inference based  on diagrammaLc (iconic) features.  •  This “intersymbolic iconicity” explains all the   recogniLons based on similarity, regardless of  their cause: common origin, borrowings or  simple coincidence. 

Lexical Similarity: semio)c aspects  Slon (cz) ??? Elefant (dn) Elefante (pt)

The proposed LS metric 


•  In our technique we assign more value to the  diagrammaLc role of consonants than to other  matchings and emphasize the role of iniLals. 

•  Weights are adjusted so that the maximum  similarity is 1 (totally matching words) and the  minimum is 0 (totally different words). 

The equaLon for intersymbolic similarity is: 

•  It may be necessary to normalize consonants  and clusters to a same notaLon: for instance,  “š”, “ŝ” and “sch” to “sh”.  •  The comparisons of the consonant or vowel  sequences consider leaer groupings such as  “cntrl” or “oo”. 

IS = α(γ1I + γ2C + γ3V) + βP                (1)  Where:  IS: intersymbolic similarity (maximum =1, minimum = 0)   I: iniLals   C: consonants   V: vowels   P: phonemes (can be decomposed as the orthographical part: γ4I + γ5C + γ6V)   α: weight of the orthographical similarity (adjusted according to the context)   β: weight of the phoneLc similarity (adjusted according to the context)   γn: weights of factors of similarity (e.g. γ1=0.4; γ2=0.4; γ3 =0.2)   α + β = 1 and γ1 + γ2 + γ3 = 1 and γ4 + γ5 + γ6 = 1 

The proposed LS metric  Example: The intersymbolic similariLes of the Italian  word “tempo” respecLvely to speakers of Portuguese,  Spanish, English, German and Finnish are:  L1 (tempo)→L2 (tempo):   IniLals: t=t; Consonants: tmp=tmp; Vowels: eo=eo   IS = 0.6*(0.4*1+0.4*1+0.2*1)+0.4*1 = 1  L1 (tempo)→L2 (Lempo):  IniLals: t=t; Consonants: tmp=tmp; Vowels: eo≈ieo   IS = 0.6*(0.4*1+0.4*1+0.2*0.66)+0.4*0.9 = 0.92  L1 (tempo)→L2 (Lme):   IniLals: t=t; Consonants: tmp≈tm; Vowels: eo≠ie  IS = 0.6*(0.4*1+0.4*0.66+0.2*0)+0.4*0.4 = 0.48  L1 (tempo)→L2 (Zeit):   IniLals: t≈Z(ts); Consonants: tmp≈Zt; Vowels: eo≈ei   IS = 0.6*(0.4*0.5+0.4*0.16+0.2*0.33)+0.4*0.2 = 0.28  L1 (tempo)→L2 (aika):   IniLals: t≠a; Consonants: tmp≠k; Vowels: eo≠aia        IS = 0.6*(0.4*0+0.4*0+0.2*0)+0.4*0 = 0 

The proposed LS metric  Original word: “physics”         transformaLons     to Czech “fizyka”  (sisssss)  LD=13     to Polish “fyzika”  (sixsxss)  LD=9     to Afrikaans “fisika”   (sisxxss)   LD=9     to Italian “fisica”   (sisxxxs)  LD=7     to French “physique”   (xxxxxssi)  LD=5  The results for intersymbolic similarity are:   IS1 = 0.6*(0.4*0.8 + 0.4*0.65 + 0.2*0.8) + 0.4*0.8 = 0.764   IS2 = 0.6*(0.4*0.8 + 0.4*0.65 + 0.2*0.9) + 0.4*0.8 = 0.776   IS3 = 0.6*(0.4*0.8 + 0.4*0.72 + 0.2*0.8) + 0.4*0.8 = 0.781   IS4 = 0.6*(0.4*0.8 + 0.4*0.80 + 0.2*0.8) + 0.4*0.8 = 0.800   IS5 = 0.6*(0.4*1.0 + 0.4*0.90 + 0.2*0.9) + 0.4*0.8 = 0.884 

3

8/07/09


Conclusions 

•  Whereas LD measured distances ranging from  5 to 13, the IS produced similar scores for the  five L2 words, arguably because the technique  can capture the fact that all words are more or  less recognizable based on the original word. 

•  We believe that the IS captures the crucial  features that make a word more easily  recognizable by learners.  •  We can assume that there is a threshold below  which the recogniLon will no longer be  possible (based on IS).  •  A field study is being designed to invesLgate  how this threshold relates to the lexicon of  each subject’s L1 and to other known L2s.  

•  Conversely, an opposite situaLon in which two  words produce smaller LD, but score worse on  IS, would be: “glamour” (en) and “amour” (fr),  whose LD=2 is smaller, but whose IS=0.52  indicates less actual similarity. 

Conclusions  •  This technique is aimed to offer a pracLcal  word‐level similarity metric to compare words  from different languages so that this measure  can be used as an input to iniLalize the LM or  to evaluate word‐level errors in the context of  CALL applicaLons. It is not aimed to replace  other formalisms, neither to create new  computaLonal treatments of lexical rules. 

4

8/07/09

Outline

Cohesion, Semantics and Learning in Reflective Dialog

 

Motivation: why study cohesion? - 

A way to study Interactivity in tutorial dialog

- 

Previous work: automatic “lexical” cohesive ties  

Arthur Ward, John Connelly, Sandra Katz, Diane Litman, Christine Wilson Learning Research and Development Center University of Pittsburgh

now try more sophisticated measure

 

Tag Definitions: Set of “semantic” cohesive ties

 

Corpus: Pre/post-tests & transfer questions

 

Applying the Tags

 

Results - 

Abstraction & Specialization important for learning  

And transfer 2

Cohesive Ties

Interactivity in Tutorial Dialog  

 

Human tutorin g

- 

Maybe be c ause it is interactive (Chi et al. 2001, 2008; Graesser et.al 1995) What specific interactive mechanisms help? -  Other

Counted coh e sive ties between tutor & student  

3

 

Current work - 

-   

Man u a lly tag for cohesive ties not automatically identifiable In a different corpus

Like before, focus on when tut o r and student refer to each other's contributions 5 - 

4

Correlated with learning, Automatically computable

But missed many of Halliday & Hasan's cohesive devices

Cohesive Ties  

(Ward & Litman 2006, 2008)‫‏‬

Repetition of words, w ord stems, hyponym/hypernyms (identified using WordNet)‫‏‬ - 

ways to study interactivity in dialog

(Halliday & Hasan 1976)‫‏‬

Repetition of words, use of pronouns, ellipsis, etc...

Previous work - 

is very effective (Bloom 1984; Cohen Kulik & Kulik 1982) Why?

 

Measurable using “cohesive ties”  

 

 

Cohesion: how a text “hangs together”

Cohesion Tag Set  

Exact:

word or word stem repetition

 

Synonym:

two words with similar meanings

 

Paraphrase:

phrase repetition w/substitution

 

Pronoun:

pronominal reference (“she” “it”)‫‏‬

 

Superordinate-class: more general referring term

 

Class-member:

 

Collocation:

complementarity (“up-down”)‫‏‬

 

Negation:

direct contradiction

more specific referring term

6

Lexical ties (eg word repetition, like before)‫‏‬

1

8/07/09

The Corpus  

Reflec ti v e tutoring dialogs with a human tutor (Katz et. al 2003)‫‏‬ - 

 

The Corpus

 

After problem solving in Andes (vanLehn et.al 2005)‫‏‬

Study procedure: - 

16 Students solved 12 physics problems each

- 

Answered 3-8 reflection questions  

 

Resulting corpus has 953 reflective dialogs - 

2,218 student turns

- 

2,136 tutor turns

Counter-balanced pre & post-tests  

9 quantitative mechanics questions

 

27 qualitative physics questions

- 

Exampel:“Supposeh temaxm i ume tnsoinh tath tebungeecordcoudl mania tni wh tioutsnappnigwas700N.Whatwoudl 7 happe n

Cohesion Tag Example

 

similar to Andes problems

- 

new questions, not like Andes problems

- 

“far transfer” questions

Students learned significantly by both measures

8


9


10


11

12

2

8/07/09



13

14



15

16

Tagging the Corpus  

-   

 

Final Tagging Example

Training: 518 student & turns  

Refining tag definitions

Initial tagging pass - 

Lexical features only

- 

Spans agreed by discussion

 

Final tagging pass - 

Re-evaluated 3 tags, using contextual features:  

- 

 

“superordinate-class,” “class-member,” “collocation”

Eliminated ties that didn't make sense  

Mis-matched topics or referents

 

didn't seem to involve knowledge construction

-  2nd  

T: “Good, that's right. What about in the horizontal directions? for example the 'x' direction on your diagram?” In first pass: tagged lexical relations - 

without reference to semantic context

- 

“down” is a specific “direction”  

tagger re-tagged random 10% Kappa = .57

S: “yes, because gravity pulls the firecracker down and gives it motion in the `y` direction.

17

so tag down-direction as “superordinate-class” 18

3

8/07/09

Final Tagging Example  

 

 

Analysis

S: “yes, because gravity pulls the firecracker down and gives it motion in the `y` direction.

 

Linear Model for each cohesion tag - 

T: “Good, that's right. What about in the horizontal directions? for example the 'x' direction on your diagram?”

 

pre-test score

 

Standardized math score

 

Tag count

- 

- 

In second pass: - 

notice that student already used “direction”

- 

Tutor did not do new generalization  

Predict post-test score from:

- 

- 

remove the tag 19

because correlated with post-test score useful predictor of learning in Andes normalized by #of student or tutor turns

Separate models for:  

high pre-testers, low pre-testers, all students

 

qual (“near”), quant (“far”) & all questions

Analysis  

Results

Linear Model for each cohesion tag - 

Example for “student superordinate-class” tag  

All students, all questions

21

“T:” = Tutor “S:” = Student “Super-Ord” = superordinate-class

Discussion  

 

20

 

Current work suggests cohesion also correlates in new corpus

- 

abstraction/specialization seem to be important cohesive mechanisms in tutoring

- 

- 

 

Example - 

S: “No the force the airbag exerts back on the man after he goes into is one.”

- 

T: “The airbag force and the force of the person on the airbag is such a pair. good. All forces come in such pairs! What is the 'reaction force' for the driver's weight?”

Overlapping spans: - 

“force”-”forces” : exact

no results for “exact” in this corpus

- 

“force the airbag exerts” - “airbag force”: paraphrase

span identification is the hardest part

- 

“force the airbag exerts back on the man” -”pair”: superordinate class

“semantic” ties correlate  

22

Span Identification is Hard

Previous work showed that automatic measures of cohesion correlated with learning

- 

“Class-mem” = class-member

23

24

4

8/07/09

Span Identification is Hard  

 


Example

 

Example

- 


- 


- 


- 


Overlapping spans:

 

Overlapping spans:

- 


- 


- 


- 

“force the airbag exerts”-“airbag force”: paraphrase

- 


- 


25

Span Identification is Hard  

 


Example

 

- 


- 


Example - 

S: “No the fo r ce the airbag exerts back on the man after he goes into is one.”

- 

T: “The airbag force and the force of the person on the airbag is such a pair. good. All forces co m e in such pairs! What is the 'reaction force' for the driver's weight?”

Overlapping spans: - 


- 


- 


 

Overlapping spans

27

- 

Spans often don't correspond to syntactic structures

- 

words often participate in >1 span

- 

Spans are sometimes split (those forces)‫‏‬

Future Work  

 

maybe don't need accurate spans?

Could improve student models by detecting student abstraction Could improve tutoring by including more tutor abstraction/specialization at appropriate places - 

28

Thanks

Investigate automatic detection - 

 

26

 

Learning Research & Development Center

 

ONR N000140710039

 

The ITSpoke group

 

Pam Jordan

what's an appropriate place?

29

30

5

8/07/09

Intelligent Tutoring Systems

Speling Mistacks & Typeos: Can Your ITS Handle Them? Adam M. Rennera Philip M. McCarthyb Chutima Boonthumc Danielle S. McNamaraa aUniversity

of Memphis, Psychology / Institute for Intelligent Systems bUniversity of Memphis, English / Institute for Intelligent Systems cHampton University, Computer Science

ITS User-Language   Contains high rate of typographical & grammatical errors   Not a new issue in NLP

  Traditional spellchecking not suitable (e.g., MS Word, email)   ITSs necessitate automatic corrections   Why2-Atlas (VanLehn et al., 2002)   CIRCSIM-Tutor (Elmi & Evens, 1998)   Many more just ignore errors

  NLP tools thought resistant to errors          

LSA (Landauer et al., 2007) – semantic overlap across two whole texts Short responses? Responses with multiple errors? NLP tools trained on edited text When used in ITS, similarity assessment inevitably affected

User-Language Paraphrase Corpus   1998 target sentence/student response pairs   Paraphrase attempts by high school students   During interactions with iSTART (McNamara, Levinstein, & Boonthum, 2004)

  Paraphrases evaluated on widely used computational indices   Latent semantic analysis (LSA; Landauer, McNamara, Dennis, & Kintsch, 2007)   Entailment (Rus et al, 2007)   Type-Token Ratio (TTR; Graesser, McNamara, et al., 2004)   Mean Edit Distance (MED; McCarthy et al., 2007)

  Provide assessment of user input   Guided feedback based on user’s response   Many ITSs use conversational dialogue   NLP for assessment and determines feedback   Input matched to benchmark   Assessed for similarity

  Assessment limited to proficiency of user   High school students or younger   Make typing errors/spelling mistakes

  What the student intended

Problems with Evaluating User-Language   Lack of “colloquial” paraphrase corpora   Microsoft Research Paraphrase Corpus (Dolan, Quirk, & Brockett, 2004) –  Only binary rating (is/is not a paraphrase)

  Echo Chamber (Brockett & Dolan, 2005)   Paraphrase Game (Chklovski, 2005)

  Limitations in “cleaning” ITS input   Datasets artificially created (Fossati & Di Eugenio, 2008)   Target populations are relatively proficient –  Why2-Atlas: College undergraduates –  CIRCSIM-Tutor: 1st year medical students

  Use lexicons; computationally expensive

Research Questions   How are established computational indices affected by the types of errors found in typed user-language?   Do user errors affect NLP assessment and feedback produced by an established ITS?   Does correcting user errors improve the capacity for ITS assessment to correspond to human ratings?

  Paraphrases also evaluated by trained experts on 10 dimensions w/ Likert ratings

1

8/07/09

iSTART Evaluation Process

iSTART   High school students (U.S. grades 9-12)   Reading strategy training   Paraphrasing, Elaboration, Making Bridging Inferences, Comprehension Monitoring

  Paraphrase the following:   Over two thirds of the heat generated by a resting human is created by organs of the thoracic and abdominal cavities and the brain.   a lot of heat made bya lazy person is made by systems of your stomack and thinking box.

  Based on match between paraphrase and target sentence   Respond to or remove Frozen expression   e.g., I think this is saying…

  Word & Soundex matching against benchmark for length, relevance, & similarity   Irrelevant (IRR) – too few words match   Too short (SH) – response is shorter than specified threshold   Too similar (SIM1) – length and word match is close to benchmark

  Word match & LSA cosines for quality   Adequate paraphrase (SIM2)   Better than a paraphrase (OK) Detailed formulae – McNamara, Boonthum, et al. (2007)

Soundex

Procedure

  Compensates for misspellings (Christian, 1998)   Vowels removed   Like-sounding consonants mapped onto same symbol

  Identified, coded, & corrected all errors   Based on validated models of grammar (e.g., Foster & Vogel, 2004)

  Interrater agreement for subset (n = 200)   Kappa = .70, p < .001   Single rater coded entire corpus

  e.g., b, f, p, v

  Lexicon-free   Word frequency problem   Students make more mistakes on new or uncommon words

  83% of responses contained some form of error   52% had some form of spelling error   63% of spelling errors were internal to target sentence

Error types & frequencies Spelling (internal) Spelling (external) Capitalization S-V Agreement Article agreement Preposition agreement Determiner agreement Spacing Punctuation Conjunction agreement Possessive agreement Extra/omitted/substitute

665 (33%) 386 (19%) 1157 (58%) 367 (18%) 75 (4%) 53 (3%) 59 (3%) 174 (9%) 344 (17%) 43 (2%) 71 (4%) 230 (12%)

Results   Significant effect of error correction on computational similarity indices   Partial Eta2 = –  LSA –  Entailment –  TTR –  MED

.178 .268 .240 .111

  Spelling internal accounts for large portion of variance   Adjusted R2 = –  LSA –  Entailment –  TTR –  MED

.35 .45 .46 .17

2

8/07/09

Results

Example Target Sentence: An increase in temperature of a substance is an indication that it has gained heat energy. Student response: increace in tempiture has gaind heat energy. Revised response: Increase in temperature has gained heat energy.

LSA .54 → .90 Entailer .41 → .78 TTR .86 → .62 MED .78 → .60

Results   Compared iSTART feedback’s correspondence to human ratings of Paraphrase Quality   Removed cases that required no correction or were entirely garbage   n = 328

  Separate ANOVAs for original and corrected        

Dependent – Paraphrase Quality Fixed Factor – iSTART response Original paraphrases, F (5, 1636) = 53.324, p < .001 Corrected paraphrases, F (5, 1636) = 58.543, p < .001

  Table 1: Crosstabulation of iSTART responses to user paraphrases iSTART response – corrected Too Too Better Good Similar Short Irrelevant Frozen Total Better 691 45 37 4 0 0 777 Good 12 194 98 0 0 0 304 iSTART response Too Similar 7 7 527 0 0 0 541 original paraphrase Too Short 11 0 1 206 2 1 221 Irrelevant 6 0 0 6 120 7 139 Frozen 0 0 0 0 0 16 16 Total 727 245 663 216 122 24 1998   Cramer’s V = .849, p < .001   Marginal Homogeneity (MH) = 5.892, p < .001

Results   Separate pairwise comparisons of Paraphrase Quality Original Mean Diff. SE Sig.a Irrelevant .152 .402 1 Too short -.776 .370 .581 Too Sim -1.955 .363 < .001 Good -2.071 .366 < .001 Better -1.897 .361 < .001 Irrelevant Too short -.918 .209 < .001 Too Sim -2.107 .196 < .001 Good -2.223 .203 < .001 Better -2.0249 .192 < .001 Too short Too Sim -1.189 .115 < .001 Good -1.305 .127 < .001 Better -1.131 .111 < .001 Too similar Good -.116 .103 1 Better .058 .082 1 Good Better .174 .097 1 a Adjustment for multiple comparisons: Bonferroni. Frozen

Corrected Mean Diff. SE .081 .361 -.922 .299 -2.176 .288 -2.421 .297 -2.106 .288 -1.002 .245 -2.257 .231 -2.502 .242 -2.187 .231 -1.255 .112 -1.500 .133 -1.185 .111 -.245 .107 .070 .077 .315 .106

Sig.a 1 .032 < .001 < .001 < .001 .001 < .001 < .001 < .001 < .001 < .001 < .001 .331 1 .044

Discussion   ITS feedback algorithms may be optimized if user-language can be filtered prior to processing   Misclassification OK for motivation   Accuracy not OK: simple rewording can pass for good paraphrase; paraphrase can pass for better

  Established NLP approaches not as robust to user-language as believed   Response length not enough to wash out individual errors   ULPC represents types & amount of errors real students make

  Most variance accounted for by internal misspellings   Provides direction for future research   Automatic spelling corrections only for words in the benchmark   Will be silent & computationally light

Thank you!   We would like to thank: Vasile Rus Ben Duncan John Myers Rebekah Guess   Research supported by:

IIS-0735682

R305A080589

3

=%%9&*/.&(0$$(K$/0$#%&'()&*$+#+(,-$+()#9&02$ %#,'%#*.&X#$(0$.#1.$*/.#2(,&3/.&(0$;'&02$5/0)(+$ 60)#1&02 !

!"#$#%&'()&*$+#+(,-$+#./%"(,$&0$.#1.$ */.#2(,&3/.&(0$4&."$5/0)(+$60)#1&02

Q,(+$#%&'()#'$.($*(0*#%.' !

7/00$8&2&9#$:(/,#/;$9$?"/9&

!

@:=,.$A$B;.&0$CD/,&'$EF>D:>G$Q!LM$.#1.W+&0&02$*(0.#'.

!

5#';9.'

!

D#,'%#*.&X#'

J#K.LMN$D/,&'$9#$OO$P;&0$OLLM

!"#$%&'()$K,(+$C:&03.+/0N$YMEEG

!"#$%&'()$K,(+$C:&03.+/0N$YMEEG

"

#

=%%9&*/.&(0$$(K$/0$#%&'()&*$+#+(,-$+()#9&02$ %#,'%#*.&X#$(0$.#1.$*/.#2(,&3/.&(0$;'&02$5/0)(+$ 60)#1&02

>KK#*.$$(K$K,#T;#0*-$(K$#%&'()#'$(0$#*"($C:&03.+/0N$ YMEEG !

Q,(+$#%&'()#'$.($*(0*#%.' !

!

S#/0$/0)$X/,&/0*#$(K$#*"($&0*,#/'#$4&."$K,#T;#0*-

!

$

=$K/+(;'$+()#9$(K$#%&'()&*$+#+(,-R$S6H>58=$O "

J#'*,&%.&(0

"

I&+;9/.&(0$(K$."#$#KK#*.$(K$K,#T;#0*-$#%&'()#'

=$U(,)$8#*.(,$+()#9R$5/0)(+$60)#1&02 "

J#'*,&%.&(0

"

!#1.$*/.#2(,&3/.&(0$/0)$."#$/%%9&*/.&(0$(K$."#$)&'.,&V;.&(00/9$ "-%(."#'&'

!

!"#$J>Q!LM$.#1.W+&0&02$*(0.#'.

!

5#';9.'

!

D#,'%#*.&X#'

%

=$+()#9$(K$U(,)$8#*.(,$R$5/0)(+$60)#1&02 !

=$+()#9$(K$U(,)$8#*.(,$R$5/0)(+$60)#1&02

!"#$*(++(0$%,&0*&%9#'$V#"&0)$U(,)$8#*.(,'R$ !

6+%9#+#0.&02$)&'.,&V;.&(0/9$"-%(."#'&'

!

J#/9&02$B/,2#$*(,%(,/

!

U(,Z&02$(0$/$*(0.#1.$4&0)(4$$

!

! !

!

@,#/.#$/$+/.,&1$'$C*(1(#GN$$*(0./&0&02$+,*-./0-1234 "

*$&'$."#$0;+V#,$(K$)(*;+#0.'$(,$*(0.#1.'$ *(,,#'%(0)&02$.($."#$*(,%;'

[;&9)&02$/$+/.,&1$."/.$"(9)$."#$;'#'$(K$4(,)'$&0$K;0*.&(0$ (K$."#&,$*(0.#1.' 5#);*&02$."#$+/.,&1

"

#N$."#$0;+V#,$(K$)&+#0'&(0'$CH$]$YLLLG

"

60)#1WX#*.(,$/,#$'%/,'#$/0)$,/0)(+9-$2#0#,/.#)$ X#*.(,'^$!"#-$*(0'&'.$&0$K#4$0;+V#,'$(K$$_Y$/0)$WY$ $

\'&02$X#*.(,&/9$+#."()'$.($+/0&%;9/.#$4(,)'$(,$2,(;%'$ (K$4(,)'$

/0)$!"#$%&$'()*(+

&

'

=$+()#9$(K$U(,)$8#*.(,$R$5/0)(+$60)#1&02

!

=$+()#9$(K$U(,)$8#*.(,$R$5/0)(+$60)#1&02 Q(,$#/*"$)(*;+#0.$(K$."#$*(,%;'N$#/*"$.&+#$/$.#,+$2(

@,#/.#$/$+/.,&1$5(62(1(#7$*(0./&0&02$."#$2-48/0-12349 "

2((&'$."#$0;+V#,$(K$.#,+'$*(+%('&02$."#$*(,%;'

"

!"#$%,(*#''$&'$&0*,#+#0./9^$!($'./,.$."#$+/.,&1$

/%%#/,'$&0$/$)(*;+#0.$*; #

=**;+;9/.#$."#(+,*-.(0-1234(*(,,#'%(0)&02$.($."#$ )(*;+#0.(*($.($."#$2-48(0-1234$*(,,#'%(0)&02$.($."#$

*(+%&9/.&(0N$$/99$*#99'$X/9;#'`$/,#$&0&.&/9&3#)$.($:

.#,+$2

(

)*

=$+()#9$(K$U(,)$8#*.(,$R$5/0)(+$60)#1&02

=$+()#9$(K$U(,)$8#*.(,$R$5/0)(+$60)#1&02 !

=.$."#$#0)$(K$."#$%,(*#''N$ $2-48(0-12349$."/.$/%%#/,#)$&0$ '&+&9/,$*(0.#1.$"/X#$/**;+;9/.#)$'&+&9/,$+,*-.(0-12349
Q!LM$.#1.W+&0&02$*(0.#'.

!

5#';9.'

!

D#,'%#*.&X#'

J>Q!`LM$+/&0$./'Z$4/'$a%&0&(0$*/.#2(,&3/.&(0R I;VP#*.&X&.-FaVP#*.&X&.-$)#.#*.&(0$&0$+;9.&W9&02;/9$P(;,0/9$ *(,%(,/$CK,N$#0N$&.G

!

bLc$(K$."#$*(,%(,/$K(,$.,/&0&02

!

B&+&.#)$.&+#$.#'.$%#,&()$Cd$)/-'G

)"

)#

>KK#*.$$(K$K,#T;#0*-$(K$#%&'()#'$(0$#*"($C:&03.+/0N$ YMEEG

D,&0*&%9#'

!

=$Q,#0*"$!#1.$S&0&02$*(0.#'.$

!

!

!

!

!

[;&9)$/$'#+/0.&*$+#+(,-$K,(+$/99$."#$/X/&9/V9#$ #%&'()#' a,2/0&3#$#%&'()#'$&0$*/.#2(,&#'$K(99(4&02$%,&0*&%9#'$ (K$#%&'()&*$+#+(,-$+()#9' "

I%9&..&02$."#$*/.#2(,&#'$&0.($"(+(2#0#(;'$';VW */.#2(,&#'$,#2/,)&02$."#&,$2=>+1?@+2=

!"#$*,#/.#)$';VW*/.#2(,&#'$/,#$*(0'&)#,#)$/'$/$9(*/9$ #%&'()&*$+#+(,&#'

)$

S#/0$/0)$X/,&/0*#$(K$#*"($&0*,#/'#$4&."$K,#T;#0*-

)%

=''&20&02$/$*/.#2(,-

D,&0*&%9#'

)&

)'

=%%9&*/.&(0$$(K$/0$#%&'()&*$+#+(,-$+()#9&02$ %#,'%#*.&X#$(0$.#1.$*/.#2(,&3/.&(0$;'&02$5/0)(+$ 60)#1&02 !

5#';9.'

Q,(+$#%&'()#'$.($*(0*#%.' !

!

=$K/+(;'$+()#9$(K$#%&'()&*$+#+(,-R$S6H>58=$O "

J#'*,&%.&(0

"

I&+;9/.&(0$(K$."#$#KK#*.$(K$K,#T;#0*-$#%&'()#'

=>?@,4A3BAC32>?,5.F:G,AHI@JA K =>?@,4A3BAC:?,5-:35=>?@,4A3BAALM29,=>?@,4A3BAAF>@1258=$O "

J#'*,&%.&(0

"

I&+;9/.&(0$(K$."#$#KK#*.$(K$K,#T;#0*-$#%&'()#'

!

=%%9&*/.&(0$.($/$9/,2#,$(%&0&(0$*/.#2(,&3/.&(0$./'Z'R$ $ !5>@LMW[9(2$.,/*Z

=$U(,)$8#*.(,$+()#9R$5/0)(+$60)#1&02 "

J#'*,&%.&(0

"

!#1.$*/.#2(,&3/.&(0$/0)$."#$/%%9&*/.&(0$(K$."#$)&'.,&V;.&(00/9$ "-%(."#'&'

!

!"#$J>Q!LM$.#1.W+&0&02$*(0.#'.

!

5#';9.'

!

D#,'%#*.&X#'

!

!)

60.#,K/*&02$4&."$(."#,$U(,)WX#*.(,$+#."()'$CBI=N$ :=BN$eG

!!

D(''&V9#$/%%9&*/.&(0'$&0$>);*/.&(0 !

>);*/.&(0/9$,#'(;,*#'$+/0/2#+#0.R$ !

!

$

5#'(;,*#$,#.,&#X/9R$"#9%$;'#,'$.($)#.#,+&0#$."#$X/9;#$(K$ /0$#);*/.&(0$,#'(;,*#$CK/*.;/9$X'^$(%&0&(0G $

5#'(;,*#$*9/''&K&*/.&(0$&0$V(."$."#+/.&*$/0)$(%&0&(0$ )&+#0'&(0'$$

$

!

=''#''+#0.$(,$#''/-W'*(,&02R$ !

!"##$ %&'&()$ *+",)"-.$ /0&($ 1($ 23"(&.$ "#0$ 4)#&5$ 6)',+5$ !""#$%&'( )*+,-.,#/%'0( ',( %/,12$#-0/,-$3( 4'( ,'5,'0( /6'%( 7/34$8( 934'5-32.$ 4"#5$ ()5$ !"#$%& '$& (#$)*$+& '$& "),#-+$& '$& )./'*#*01& 2334& '-& '/5& 60-*))$& '$& #$7#$& 89:;.34.$ 77$ 8-&#$ 799:.$ ;",&5.$ !"##$%&'&()$*+",)"-$"#0$/0&($1($23"(&$(7/34$8(934'5-32(/34(,&'('"-0$4-%(8'8$#;( 8',/"&$#=( !""+-%/,-$3( ,$( ,'5,( %/,'2$#-