Evaluating the Output of an Embodied Multimodal ... - CiteSeerX

Evaluating the Output of an Embodied Multimodal Dialogue System

Mary Ellen Foster Technische Universität München / University of Edinburgh DNLG, 21 January 2008

Evaluating generation systems ●

Generation is an open-ended task –

●

Like “counting from zero to infinity” (Wilks, 1990)

Issues for evaluation (Mellish and Dale, 1998): –

Defining input and output

–

Choosing what to measure

–

Selecting a control or baseline for comparison

–

Obtaining adequate training/test data

–

Dealing with disagreement of human judges

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

2

Task-based evaluation ●

Test whether the system achieves its goals better with advanced generation techniques enabled –

Carenini and Moore (2006): tailoring evaluative arguments to users affected their choices

–

Di Eugenio et al. (2005): users learned more from a tutoring system that used aggregation

–

Elhadad et al. (2005): doctors found information more quickly with clinical summaries tailored to patients

–

Reiter et al. (2003): tailored smoking-cessation materials did not make a difference

21 January 2008


3

Direct human judgements ●

●

Task-based evaluation requires ... –

A fully-working system that can be used by naive users

–

A baseline version of the same system

–

A pool of subjects who can perform the task

Asking users for direct judgements –

Use canned outputs from a partially-working system

–

Alleviate issues with slow or unreliable input

–

Participants do not need to be experts (“overhearer”)

21 January 2008


4

Some direct-judgement studies ●

●

●

●

Binsted et al. (1997): children found (some) generated jokes as good as human-written jokes Hartley et al. (2000): acceptability of generated documentation comparable to human-written Walker et al. (2004): users liked restaurant recommendations tailored to their own prefs Belz and Reiter (2006): subjects compared generated and hand-written weather forecasts (and sometimes found generated better)

21 January 2008


5

Other human evaluation techniques ●

●

Analyse post-edits (Sripada et al., 2005; Sripada and Gao, 2007) Direct physiological measures: –

Prendinger et al. (2005a): eye-tracking

–

Prendinger et al. (2005b): galvanic skin response

–

Bailenson et al. (2004): interpersonal distance in a virtual environment

21 January 2008


6

Automated evaluation ●

●

Why use it? –

Stochastic systems must be evaluated frequently during development

–

Recruiting judges can be time-consuming and costly

The goal: find some metrics that 1. Can easily be computed, and 2.Agree with human judgements of quality

●

Other metrics: PARADISE (Walker et al., 1997); BLEU (Papineni et al., 2002); ROUGE (Lin, 2004)

21 January 2008


7

Some automated-evaluation studies ●

●

●

●

Bangalore et al. (2000): corpus-based metrics for word ordering in a realiser White (2004): measure accuracy and speed of OpenCCG realiser through cross-validation Karamanis and Mellish (2005): corpus-based methods for evaluating information ordering Note: users may not like systems that score highly on corpus similarity (Reiter and Sripada, 2002; Stent et al., 2005; Belz and Reiter, 2006)

21 January 2008


8

Shared-task evaluation? ●

●

●

●

Approaches to a well-defined problem are compared based on their performance on a common task; common in NLU Hot topic in NLG (White and Dale, 2007) Pilot study (Belz et al., 2007): attribute selection for referring expressions –

6 teams, 22 systems

–

Follow-up challenge for 2008 underway

Also: giving instructions in virtual environments

21 January 2008


9

The COMIC dialogue system ●

●

Adds a multimodal dialogue interface to a CADlike application used for bathroom redesign Output generation in COMIC –

Dialogue manager selects high-level message

–

Fission module plans multimodal content, using OpenCCG for text realisation

–

Content is produced using specific output modules (Festival, RUTH)

21 January 2008


10

(COMIC demo)

21 January 2008


11

Evaluation studies ●

●

●

Comparing tailored and non-tailored text generation (Foster and White, 2005) Avoiding repetition in generated text (Foster and White, 2007) Comparing corpus-derived methods of selecting behaviour for the embodied agent (Foster and Oberlander, 2008; Foster, 2008)

21 January 2008


12


●

●


21 January 2008


13

Tailored generation in COMIC ●

COMIC takes into account account user preferences when generating output –

Selecting the designs to show

–

Choosing the attributes to describe

–

Structuring the content

–

Choosing specific words to use

21 January 2008


14

Sample tailored description ●

User: Tell me about this design [click]

●

COMIC: –

(target) Here is a family design. As you can see, the tiles have a blue and green colour scheme. It has floral motifs and artwork on the decorative tiles.

–

(non-target) Here is a family design. Its tiles are from the Lollipop collection by Agrob Buchtal. Although the tiles have a blue colour scheme, it does also feature green.

–

(neutral) This design is in the family style. It uses tiles from Agrob Buchtal's Lollipop series. There are artwork and floral motifs on the decorative tiles.

21 January 2008


15

Evaluation ●

●

●

●

Subjects saw and heard one eight-turn dialogue between COMIC and a hypothetical user User preferences were on screen at all times For each system turn, subjects saw two options and had to choose which was more appropriate for the user Four possible user models (plus neutral); each subject made all pairwise comparisons twice

21 January 2008


16

Overall selections 114

70 Target Other

All trials 21 January 2008


17

“Conflicting concessions” ●

●

Explicit concessions to negative preferences (“although”, “though”, etc.) Two classes of trials: –

Conflict: At least one conflicting concession across the two versions

–

No conflict: No concessions, or same concessions in both versions

21 January 2008


18

Conflicting-concession selections 90

41 24

Conflict 21 January 2008

29

Target Other

No conflict Foster / DNLG / Evaluating Multimodal Output

19

What's going on? ●

Positive preferences do have an effect –

●

●

●

Positive features are placed near the start

Negative features have a bigger effect (“although”, etc.) Because it was not the subjects' own preferences? Because COMIC is a “sales” system, so the default mode is positive, and only marked departures are perceptible?

21 January 2008


20


●

●


21 January 2008


21

Avoiding repetition in generated text ●

●

●

Classic rule of writing: avoid repetition to keep text interesting and lively Often followed in generation systems: –

The ability to create multiple expressions “lets us vary the way in which the system expresses the same fact.” (Isard et al., 2003)

–

Random choice of realisation option “maximise[s] the variety of sentences produced by the system” (van Deemter et al., 2005)

Never explicitly evaluated

21 January 2008


22

Implementation ●

Using the OpenCCG realiser, which incorporates ngram language models into its processing –

●

Including disjunctive logical forms (White, 2006)

Two anti-repetition techniques: –

ε-best sampling: perform normal realisation and then choose one of the top alternatives

–

Anti-repetition scoring: storing the words from previous sentences and penalising repetitions

21 January 2008


23

Human evaluation ●

●

●

Gather user judgements on short description sequences, presented as minimal pairs Two versions: –

Baseline

–

Both anti-repetition measures enabled

Subjects made three forced choices –

Understandability

–

Repetitiveness

–

Writing quality

21 January 2008


24

Description sequences Default ●

●

Anti-repetition

This design is country. It is based on the Sandstein collection by Porcelaingres. The colours are brown, grey and black. There are geometric shapes on the decorative tiles. This design is also country. It is based on the Cardiff collection by Aparici. The colours are cream and dark red. It also has geometric shapes on the decorative tiles.

21 January 2008

●

●

Here is a design in the country style. It uses tiles from the Sandstein collection by Porcelaingres. It has brown, grey and black in the colour scheme. The decorative tiles have geometric shapes. This one is also country. It draws from Cardiff, by Aparici. The colour scheme features cream and dark red. The decorative tiles also have geometric shapes.


25

Evaluation interface

21 January 2008


26

Results 210 180 157 135 112 82

Easier to understand 21 January 2008

More repetitive

Default Anti-repetition

Better written


27

Automated evalution ●

●

Exploring the parameter settings –

ε-best sampling: threshold

–

Anti-repetition scoring: penalty

Test the impact on generated text of a range of parameter values –

0, 1, 5, 10, 20 for each parameter (log space)

–

Sentences in human evaluation had both = 20

21 January 2008


28

Metrics ●

Variability: edit distance between pairs

●

N-gram scores

●

Dispreferred paraphrases: word sequences permitted by grammar but filtered by n-grams –

Also at start and end

–

We here have and We have ... here

–

Is family

21 January 2008


29

Summary of findings ●

Variability (edit distance): increased –

●

●

Both methods had an effect; not simply additive

N-gram scores: decreased –

Low values for either parameter did not have an effect

–

Effect of ε-best threshold > effect of repetition penalty

Dispreferred paraphrases: increased –

ε-best threshold had an effect on all paraphrases

–

Repetition penalty affected only is family

21 January 2008


30

Automated vs. Human evaluation (1) ●

●

●

Human judges generally found the anti-repetition versions better written and less repetitive, with no difference in understandability Corpus-similarity metrics scored the default versions better Measures of variability scored the anti-repetition versions better

21 January 2008


31


●

●


21 January 2008


32

Selecting behaviour for the head ●

●

●

Annotated a corpus of head movements while reading sentences from COMIC domain Implemented three selection methods: –

Rule-based

–

Majority

–

Weighted

Compared them in a human study and automatically

21 January 2008


33

Corpus details ●

●

●

450 sentences generated by COMIC Link displays to the corresponding span of nodes in the OpenCCG derivation tree Also include contextual information: –

User-preference evaluation

–

Information status

–

Speech-synthesiser prosody

–

Clause of the sentence

21 January 2008


34

Characteristic facial displays

21 January 2008


35

Generation strategies ●

Rule-based: displays only on tile-design facts –

●

Positive, negative, neutral displays as appropriate

Data-driven: consider full context –

Majority: always take highest-probability option

–

Weighted: make a stochastic choice weighted by frequency

21 January 2008


36

Generated display schedules Although

it’s

in

nd

nd

nd

the

family

the

tiles

nd

Original

are

by

Alessi Tiles.

nd, bu ll

Rule-based

ll, bd, sq

tr, bu nd

Majority

nd

nd

Weighted

nd

nd

21 January 2008

style,

tr


37

Human evaluation ●

●

Ask judges to make pairwise choices among output of different strategies –

Experiment 1: Corpus vs. Weighted vs. Majority

–

Experiment 2: Corpus vs. Weighted vs. Rule-based

Showed videos in pairs – same speech, different displays – and asked which they preferred

21 January 2008


38

Results of experiment 1 295

278

153

Corpus vs. Majority 21 January 2008

251 170

Weighted vs. Majority Foster / DNLG / Evaluating Multimodal Output

197

Corpus vs. Weighted 39

Results of experiment 2 123 93

Corpus vs. Rule-based 21 January 2008

123

120 97

Weighted vs. Rule-based Foster / DNLG / Evaluating Multimodal Output

92

Original vs. Weighted 40

Discussion ●

Original corpus displays always preferred

●

Majority displays strongly disliked

●

Others in between

●

Features of majority choice

●

–

Moves less frequently than corpus or weighted

–

Nearly always chooses a nod when it does move

Rule-based strategy moves even less, but uses a more even distribution

21 January 2008


41

Automated evaluation ●

●

Corpus-reproduction metrics –

Precision, recall, F score

–

Node accuracy

–

Value for β agreement measure

Sentence-level metrics –

●

Number and diversity of displays per sentence

Global metrics –

Number and diversity of displays across the corpus

21 January 2008


42

Corpus-reproduction metrics 0.82

0.750.77

0.52

0.29 0.22

0.34

0.31 0.24

21 January 2008

0.23

0.18 0.1

Precision

Majority Weighted Rule-based

Recall

0.14

0.12 0.06

F Score

Node Acc.

Beta


43

Sentence-level metrics 5.39 4.62

3.27

3.22

3

Corpus Weighted Majority Rule-based

2.07 1.34

Tokens 21 January 2008

1.24

Types Foster / DNLG / Evaluating Multimodal Output

44

Total distinct combinations 40 35


11

3

Displays 21 January 2008


45

Proportion of downward nods 0.89

0.52

0.54

0.51


Proportion 21 January 2008


46

Automated vs. human evaluation (2) ●

●

●

●

Humans preferred original corpus displays, hated majority displays, and didn't have a strong preference between the other two Corpus-reproduction metrics favoured majority Other metrics scored corpus highest, weighted next, with majority and rule-based further down Only “diversity” score (proportion of nods) favoured rule-based over majority

21 January 2008


47

Discussion

21 January 2008


48

Types of corpus similarity ●

●

Reproducing each sentence (cross-validation) –

Favours systems that produce “average” outputs, with little variation

–

Users dislike this type of output

Reproducing the patterns in the corpus –

Favours systems whose output is distributed like that in the corpus

–

These systems score higher on user preference

21 January 2008


49

Possible future work ●

Task-based evaluation? –

●

●

One result: Weighted displays harm users' ability to detect correctly tailored output; rule-based do not

Enhanced display selection strategies – corpus always scored highest Other evaluation metrics

21 January 2008


50

Thank you

http://homepages.inf.ed.ac.uk/mef/

21 January 2008


51

References ●

●

●

●

●

●

●

BAILENSON J, AHARONI E, BEALL A, GUADAGNO R, DIMOV A, and BLASCOVICH J (2004). Comparing behavioural and self-report measures of embodied agents’ social presence in immersive virtual environments. In Proceedings of the 7th Annual International Workshop on PRESENCE. BANGALORE S, RAMBOW O, and WHITTAKER S (2000). Evaluation metrics for generation. In Proceedings of the 1st International Conference on Natural Language Generation (INLG 2000). BELZ A, GATT A, REITER E, and VIETHEN J, editors (2007). The Attribute Selection for Generation of Referring Expressions Challenge. http://www.csd.abdn.ac.uk/research/evaluation/ BELZ A and REITER E (2006). Comparing automatic and human evaluation of NLG systems. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006). BINSTED K, PAIN H, and RITCHIE G (1997). Children’s evaluation of computer-generated punning riddles. Pragmatics and Cognition, 5(2):309–358. CARENINI G and MOORE J D (2006). Generating and evaluating evaluative arguments. Artiﬁcial Intelligence, 170(11):925–952. doi:10.1016/j.artint.2006.05.003. VAN DEEMTER K, KRAHMER E, and THEUNE M (2005). Real versus template-based natural language generation: A false opposition? Computational Linguistics, 31(1):15–24.

21 January 2008


52

References ●

●

●

●

●

●

●

DI EUGENIO B, FOSSATI D, YU D, HALLER S, and GLASS M (2005). Aggregation improves learning: experiments in natural language generation for intelligent tutoring systems. In Proceedings of the 43rd Meeting of the Association for Computational Linguistics (ACL 2005). Ann Arbor, MI. ELHADAD N, MCKEOWN K, KAUFMAN D, and JORDAN D (2005). Facilitating physicians’ access to information via tailored text summarization. In Proceedings of the AMIA Annual Symposium. FOSTER M E (2008). Automated metrics that agree with human judgements on generated output. Submitted. FOSTER M E and OBERLANDER J (2008). Corpus-based generation of conversational facial displays. Language Resources and Evaluation. To appear. FOSTER M E and WHITE M (2005). Assessing the impact of adaptive generation in the COMIC multimodal dialogue system. In Proceedings of the IJCAI 2005 Workshop on Knowledge and Reasoning in Practical Dialogue Systems. Edinburgh, Scotland. FOSTER M E and WHITE M (2007). Avoiding repetition in generated text. In Proceedings of the 11th European Workshop on Natural Language Generation (ENLG 2007). Schloss Dagstuhl. HARTLEY A, SCOTT D, KRUIJFF-KORBAYOVÁ I, SHAROFF S, SOKOLOVA L, DOCHEV D, STAYKOVA K, ČMEJREK M, HANA J, and TEICH E (2000). Evaluation of the final prototype. Technical Report EVAL2, AGILE project.

21 January 2008


53

References ●

●

●

●

●

●

ISARD A, OBERLANDER J, ANDROTSOPOULOS I, and MATHESON C (2003). Speaking the users’ languages. IEEE Intelligent Systems, 18(1):40–45. doi:10.1109/MIS.2003.1179192. KARAMANIS N and MELLISH C (2005). A review of recent corpus-based methods for evaluating information ordering in text production. In Proceedings of the Corpus Linguistics 2005 Workshop on Using Corpora for Natural Language Generation. LIN C Y (2004). ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pp. 74–81. Barcelona, Spain. MELLISH C and DALE R (1998). Evaluation in the context of natural language generation. Computer Speech and Language, 12(4):349–373. doi:10.1006/csla.1998.0106. PAPINENI K, ROUKOS S, WARD T, and ZHU W J (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002). PRENDINGER H, MA C, YINGZI J, NAKASONE A, and ISHIZUKA M (2005a). Understanding the effect of life-like interface agents through users’ eye movements. In Proceedings of the 7th international conference on Multimodal interfaces (ICMI 2005), pp. 108–115. doi:10.1145/1088463.1088484.

21 January 2008


54

References ●

●

●

●

●

●

PRENDINGER H, MORI J, and ISHIZUKA M (2005b). Using human physiology to evaluate subtle expressivity of a virtual quizmaster in a mathematical game. International Journal of Human-Computer Studies, 62(2):231–245. doi:10.1016/j.ijhcs.2004.11.009. REITER E, ROBERTSON R, and OSMAN L M (2003). Lessons from a failure: Generating tailored smoking cessation letters. Artiﬁcial Intelligence, 144(1–2):41–58. doi:10.1016/S0004-3702(02) 00370-3. REITER E and SRIPADA S (2002). Should corpora texts be gold standards for NLG? In Proceedings of the 2nd International Language Generation Conference (INLG 2002). SRIPADA S and GAO F (2007). Summarising dive computer data: A case study in integrating textual and graphical presentations of numerical data. In Proceedings of the Workshop on Multimodal Output Generation (MOG 2007). SRIPADA S G, REITER E, and HAWIZY L (2005). Evaluation of an NLG system using post-edit data: Lessons learnt. In Proceedings of the 10th European Workshop on Natural Language Generation (EWNLG 2005). STENT A, MARGE M, and SINGHAI M (2005). Evaluating evaluation methods for generation in the presence of variation. In Computational Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science, volume 3406/2005, pp. 341–351. Springer. doi:10.1007/b105772.

21 January 2008


55

References ●

●

●

●

●

●

WALKER M A, LITMAN D, KAMM C A, and ABELLA A (1997). PARADISE: A general framework for evaluating spoken dialogue agents. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL/EACL 1997). WALKER M, WHITTAKER S, STENT A, MALOOR P, MOORE J, JOHNSTON M, and VASIREDDY G (2004). Generation and evaluation of user tailored responses in multimodal dialogue. Cognitive Science, 28(5): 811–840. doi:10.1016/j.cogsci.2004.06.002. WHITE M (2004). Reining in CCG chart realization. In Proceedings of the 3rd International Conference on Natural Language Generation (INLG 2004). WHITE M (2006). CCG chart realization from disjunctive inputs. In Proceedings of the 4th International Conference on Natural Language Generation (INLG-06). WHITE M and DALE R, editors (2007). Report from the Workshop on Shared Tasks and Comparative Evaluation in Natural Language Generation. WILKS Y (1990). Where am I coming from: The reversability of analysis and generation in natural language processing. In M Pütz, editor, Thirty Years of Linguistic Evolution. John Benjamins Publishing Company.

21 January 2008


56

Evaluating the Output of an Embodied Multimodal ... - CiteSeerX

Evaluating the Output of an Embodied Multimodal ... - CiteSeerX

Suggest Documents

Multimodal behavior realization for embodied

NaviChair: Evaluating an Embodied Interface Using a

Output Multimodal Interaction: The Case of

EVALUATING THE RESEARCH OUTPUT OF ... - Semantic Scholar

Evaluating the Output of Machine Translation Systems

NaviChair: Evaluating an Embodied Interface ... - ACM Digital Library

An Assessment of the Energy and Water Embodied in ... - CiteSeerX

Interactive Information Presentation by an Embodied ... - CiteSeerX

Metaphor and the Embodied Mind: An Engine of ... - CiteSeerX

Embodied readings: Exploring the multimodal social semiotic ... - Eric

Evaluating Embodied Conversational Agents - DROPS - Schloss ...

Evaluating types and combinations of multimodal ... - SciELO

Output Multimodal Interaction: The Case of Augmented Surgery - IIHM

On The Embodied Nature of Grammar - CiteSeerX

Hybrid Input-Output Analysis of Embodied Carbon and Construction

EVALUATING EVALUATION: AN EMPIRICAL ... - CiteSeerX

An Evaluation of an Augmented Reality Multimodal ... - CiteSeerX

An Architecture for Multimodal Information Fusion - CiteSeerX

an architecture for multimodal dialogue. - CiteSeerX

an architecture for multimodal dialogue. - CiteSeerX

modeling and evaluating the distribution of the output

Evaluating the Usability of the Mobile Interface of an ... - CiteSeerX

Evaluating the Usability of the Mobile Interface of an ... - CiteSeerX

OBIG: the Architecture of an Output Buffered Switch with ... - CiteSeerX