Evaluating the Output of an Embodied Multimodal ... - CiteSeerX

0 downloads 0 Views 670KB Size Report
Jan 21, 2008 - Evaluation in the context of natural language generation. Computer. Speech and Language, 12(4):349–373. doi:10.1006/csla.1998.0106. ○.
Evaluating the Output of an Embodied Multimodal Dialogue System

Mary Ellen Foster Technische Universität München / University of Edinburgh DNLG, 21 January 2008

Evaluating generation systems ●

Generation is an open-ended task –



Like “counting from zero to infinity” (Wilks, 1990)

Issues for evaluation (Mellish and Dale, 1998): –

Defining input and output



Choosing what to measure



Selecting a control or baseline for comparison



Obtaining adequate training/test data



Dealing with disagreement of human judges

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

2

Task-based evaluation ●

Test whether the system achieves its goals better with advanced generation techniques enabled –

Carenini and Moore (2006): tailoring evaluative arguments to users affected their choices



Di Eugenio et al. (2005): users learned more from a tutoring system that used aggregation



Elhadad et al. (2005): doctors found information more quickly with clinical summaries tailored to patients



Reiter et al. (2003): tailored smoking-cessation materials did not make a difference

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

3

Direct human judgements ●



Task-based evaluation requires ... –

A fully-working system that can be used by naive users



A baseline version of the same system



A pool of subjects who can perform the task

Asking users for direct judgements –

Use canned outputs from a partially-working system



Alleviate issues with slow or unreliable input



Participants do not need to be experts (“overhearer”)

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

4

Some direct-judgement studies ●







Binsted et al. (1997): children found (some) generated jokes as good as human-written jokes Hartley et al. (2000): acceptability of generated documentation comparable to human-written Walker et al. (2004): users liked restaurant recommendations tailored to their own prefs Belz and Reiter (2006): subjects compared generated and hand-written weather forecasts (and sometimes found generated better)

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

5

Other human evaluation techniques ●



Analyse post-edits (Sripada et al., 2005; Sripada and Gao, 2007) Direct physiological measures: –

Prendinger et al. (2005a): eye-tracking



Prendinger et al. (2005b): galvanic skin response



Bailenson et al. (2004): interpersonal distance in a virtual environment

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

6

Automated evaluation ●



Why use it? –

Stochastic systems must be evaluated frequently during development



Recruiting judges can be time-consuming and costly

The goal: find some metrics that 1. Can easily be computed, and 2.Agree with human judgements of quality



Other metrics: PARADISE (Walker et al., 1997); BLEU (Papineni et al., 2002); ROUGE (Lin, 2004)

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

7

Some automated-evaluation studies ●







Bangalore et al. (2000): corpus-based metrics for word ordering in a realiser White (2004): measure accuracy and speed of OpenCCG realiser through cross-validation Karamanis and Mellish (2005): corpus-based methods for evaluating information ordering Note: users may not like systems that score highly on corpus similarity (Reiter and Sripada, 2002; Stent et al., 2005; Belz and Reiter, 2006)

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

8

Shared-task evaluation? ●







Approaches to a well-defined problem are compared based on their performance on a common task; common in NLU Hot topic in NLG (White and Dale, 2007) Pilot study (Belz et al., 2007): attribute selection for referring expressions –

6 teams, 22 systems



Follow-up challenge for 2008 underway

Also: giving instructions in virtual environments

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

9

The COMIC dialogue system ●



Adds a multimodal dialogue interface to a CADlike application used for bathroom redesign Output generation in COMIC –

Dialogue manager selects high-level message



Fission module plans multimodal content, using OpenCCG for text realisation



Content is produced using specific output modules (Festival, RUTH)

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

10

(COMIC demo)

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

11

Evaluation studies ●





Comparing tailored and non-tailored text generation (Foster and White, 2005) Avoiding repetition in generated text (Foster and White, 2007) Comparing corpus-derived methods of selecting behaviour for the embodied agent (Foster and Oberlander, 2008; Foster, 2008)

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

12

Evaluation studies ●





Comparing tailored and non-tailored text generation (Foster and White, 2005) Avoiding repetition in generated text (Foster and White, 2007) Comparing corpus-derived methods of selecting behaviour for the embodied agent (Foster and Oberlander, 2008; Foster, 2008)

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

13

Tailored generation in COMIC ●

COMIC takes into account account user preferences when generating output –

Selecting the designs to show



Choosing the attributes to describe



Structuring the content



Choosing specific words to use

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

14

Sample tailored description ●

User: Tell me about this design [click]



COMIC: –

(target) Here is a family design. As you can see, the tiles have a blue and green colour scheme. It has floral motifs and artwork on the decorative tiles.



(non-target) Here is a family design. Its tiles are from the Lollipop collection by Agrob Buchtal. Although the tiles have a blue colour scheme, it does also feature green.



(neutral) This design is in the family style. It uses tiles from Agrob Buchtal's Lollipop series. There are artwork and floral motifs on the decorative tiles.

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

15

Evaluation ●







Subjects saw and heard one eight-turn dialogue between COMIC and a hypothetical user User preferences were on screen at all times For each system turn, subjects saw two options and had to choose which was more appropriate for the user Four possible user models (plus neutral); each subject made all pairwise comparisons twice

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

16

Overall selections 114

70 Target Other

All trials 21 January 2008

Foster / DNLG / Evaluating Multimodal Output

17

“Conflicting concessions” ●



Explicit concessions to negative preferences (“although”, “though”, etc.) Two classes of trials: –

Conflict: At least one conflicting concession across the two versions



No conflict: No concessions, or same concessions in both versions

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

18

Conflicting-concession selections 90

41 24

Conflict 21 January 2008

29

Target Other

No conflict Foster / DNLG / Evaluating Multimodal Output

19

What's going on? ●

Positive preferences do have an effect –







Positive features are placed near the start

Negative features have a bigger effect (“although”, etc.) Because it was not the subjects' own preferences? Because COMIC is a “sales” system, so the default mode is positive, and only marked departures are perceptible?

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

20

Evaluation studies ●





Comparing tailored and non-tailored text generation (Foster and White, 2005) Avoiding repetition in generated text (Foster and White, 2007) Comparing corpus-derived methods of selecting behaviour for the embodied agent (Foster and Oberlander, 2008; Foster, 2008)

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

21

Avoiding repetition in generated text ●





Classic rule of writing: avoid repetition to keep text interesting and lively Often followed in generation systems: –

The ability to create multiple expressions “lets us vary the way in which the system expresses the same fact.” (Isard et al., 2003)



Random choice of realisation option “maximise[s] the variety of sentences produced by the system” (van Deemter et al., 2005)

Never explicitly evaluated

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

22

Implementation ●

Using the OpenCCG realiser, which incorporates ngram language models into its processing –



Including disjunctive logical forms (White, 2006)

Two anti-repetition techniques: –

ε-best sampling: perform normal realisation and then choose one of the top alternatives



Anti-repetition scoring: storing the words from previous sentences and penalising repetitions

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

23

Human evaluation ●





Gather user judgements on short description sequences, presented as minimal pairs Two versions: –

Baseline



Both anti-repetition measures enabled

Subjects made three forced choices –

Understandability



Repetitiveness



Writing quality

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

24

Description sequences Default ●



Anti-repetition

This design is country. It is based on the Sandstein collection by Porcelaingres. The colours are brown, grey and black. There are geometric shapes on the decorative tiles. This design is also country. It is based on the Cardiff collection by Aparici. The colours are cream and dark red. It also has geometric shapes on the decorative tiles.

21 January 2008





Here is a design in the country style. It uses tiles from the Sandstein collection by Porcelaingres. It has brown, grey and black in the colour scheme. The decorative tiles have geometric shapes. This one is also country. It draws from Cardiff, by Aparici. The colour scheme features cream and dark red. The decorative tiles also have geometric shapes.

Foster / DNLG / Evaluating Multimodal Output

25

Evaluation interface

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

26

Results 210 180 157 135 112 82

Easier to understand 21 January 2008

More repetitive

Default Anti-repetition

Better written

Foster / DNLG / Evaluating Multimodal Output

27

Automated evalution ●



Exploring the parameter settings –

ε-best sampling: threshold



Anti-repetition scoring: penalty

Test the impact on generated text of a range of parameter values –

0, 1, 5, 10, 20 for each parameter (log space)



Sentences in human evaluation had both = 20

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

28

Metrics ●

Variability: edit distance between pairs



N-gram scores



Dispreferred paraphrases: word sequences permitted by grammar but filtered by n-grams –

Also at start and end



We here have and We have ... here



Is family

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

29

Summary of findings ●

Variability (edit distance): increased –





Both methods had an effect; not simply additive

N-gram scores: decreased –

Low values for either parameter did not have an effect



Effect of ε-best threshold > effect of repetition penalty

Dispreferred paraphrases: increased –

ε-best threshold had an effect on all paraphrases



Repetition penalty affected only is family

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

30

Automated vs. Human evaluation (1) ●





Human judges generally found the anti-repetition versions better written and less repetitive, with no difference in understandability Corpus-similarity metrics scored the default versions better Measures of variability scored the anti-repetition versions better

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

31

Evaluation studies ●





Comparing tailored and non-tailored text generation (Foster and White, 2005) Avoiding repetition in generated text (Foster and White, 2007) Comparing corpus-derived methods of selecting behaviour for the embodied agent (Foster and Oberlander, 2008; Foster, 2008)

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

32

Selecting behaviour for the head ●





Annotated a corpus of head movements while reading sentences from COMIC domain Implemented three selection methods: –

Rule-based



Majority



Weighted

Compared them in a human study and automatically

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

33

Corpus details ●





450 sentences generated by COMIC Link displays to the corresponding span of nodes in the OpenCCG derivation tree Also include contextual information: –

User-preference evaluation



Information status



Speech-synthesiser prosody



Clause of the sentence

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

34

Characteristic facial displays

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

35

Generation strategies ●

Rule-based: displays only on tile-design facts –



Positive, negative, neutral displays as appropriate

Data-driven: consider full context –

Majority: always take highest-probability option



Weighted: make a stochastic choice weighted by frequency

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

36

Generated display schedules Although

it’s

in

nd

nd

nd

the

family

the

tiles

nd

Original

are

by

Alessi Tiles.

nd, bu ll

Rule-based

ll, bd, sq

tr, bu nd

Majority

nd

nd

Weighted

nd

nd

21 January 2008

style,

tr

Foster / DNLG / Evaluating Multimodal Output

37

Human evaluation ●



Ask judges to make pairwise choices among output of different strategies –

Experiment 1: Corpus vs. Weighted vs. Majority



Experiment 2: Corpus vs. Weighted vs. Rule-based

Showed videos in pairs – same speech, different displays – and asked which they preferred

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

38

Results of experiment 1 295

278

153

Corpus vs. Majority 21 January 2008

251 170

Weighted vs. Majority Foster / DNLG / Evaluating Multimodal Output

197

Corpus vs. Weighted 39

Results of experiment 2 123 93

Corpus vs. Rule-based 21 January 2008

123

120 97

Weighted vs. Rule-based Foster / DNLG / Evaluating Multimodal Output

92

Original vs. Weighted 40

Discussion ●

Original corpus displays always preferred



Majority displays strongly disliked



Others in between



Features of majority choice





Moves less frequently than corpus or weighted



Nearly always chooses a nod when it does move

Rule-based strategy moves even less, but uses a more even distribution

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

41

Automated evaluation ●



Corpus-reproduction metrics –

Precision, recall, F score



Node accuracy



Value for β agreement measure

Sentence-level metrics –



Number and diversity of displays per sentence

Global metrics –

Number and diversity of displays across the corpus

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

42

Corpus-reproduction metrics 0.82

0.750.77

0.52

0.29 0.22

0.34

0.31 0.24

21 January 2008

0.23

0.18 0.1

Precision

Majority Weighted Rule-based

Recall

0.14

0.12 0.06

F Score

Node Acc.

Beta

Foster / DNLG / Evaluating Multimodal Output

43

Sentence-level metrics 5.39 4.62

3.27

3.22

3

Corpus Weighted Majority Rule-based

2.07 1.34

Tokens 21 January 2008

1.24

Types Foster / DNLG / Evaluating Multimodal Output

44

Total distinct combinations 40 35

Corpus Weighted Majority Rule-based

11

3

Displays 21 January 2008

Foster / DNLG / Evaluating Multimodal Output

45

Proportion of downward nods 0.89

0.52

0.54

0.51

Corpus Weighted Majority Rule-based

Proportion 21 January 2008

Foster / DNLG / Evaluating Multimodal Output

46

Automated vs. human evaluation (2) ●







Humans preferred original corpus displays, hated majority displays, and didn't have a strong preference between the other two Corpus-reproduction metrics favoured majority Other metrics scored corpus highest, weighted next, with majority and rule-based further down Only “diversity” score (proportion of nods) favoured rule-based over majority

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

47

Discussion

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

48

Types of corpus similarity ●



Reproducing each sentence (cross-validation) –

Favours systems that produce “average” outputs, with little variation



Users dislike this type of output

Reproducing the patterns in the corpus –

Favours systems whose output is distributed like that in the corpus



These systems score higher on user preference

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

49

Possible future work ●

Task-based evaluation? –





One result: Weighted displays harm users' ability to detect correctly tailored output; rule-based do not

Enhanced display selection strategies – corpus always scored highest Other evaluation metrics

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

50

Thank you

http://homepages.inf.ed.ac.uk/mef/

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

51

References ●













BAILENSON J, AHARONI E, BEALL A, GUADAGNO R, DIMOV A, and BLASCOVICH J (2004). Comparing behavioural and self-report measures of embodied agents’ social presence in immersive virtual environments. In Proceedings of the 7th Annual International Workshop on PRESENCE. BANGALORE S, RAMBOW O, and WHITTAKER S (2000). Evaluation metrics for generation. In Proceedings of the 1st International Conference on Natural Language Generation (INLG 2000). BELZ A, GATT A, REITER E, and VIETHEN J, editors (2007). The Attribute Selection for Generation of Referring Expressions Challenge. http://www.csd.abdn.ac.uk/research/evaluation/ BELZ A and REITER E (2006). Comparing automatic and human evaluation of NLG systems. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006). BINSTED K, PAIN H, and RITCHIE G (1997). Children’s evaluation of computer-generated punning riddles. Pragmatics and Cognition, 5(2):309–358. CARENINI G and MOORE J D (2006). Generating and evaluating evaluative arguments. Artificial Intelligence, 170(11):925–952. doi:10.1016/j.artint.2006.05.003. VAN DEEMTER K, KRAHMER E, and THEUNE M (2005). Real versus template-based natural language generation: A false opposition? Computational Linguistics, 31(1):15–24.

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

52

References ●













DI EUGENIO B, FOSSATI D, YU D, HALLER S, and GLASS M (2005). Aggregation improves learning: experiments in natural language generation for intelligent tutoring systems. In Proceedings of the 43rd Meeting of the Association for Computational Linguistics (ACL 2005). Ann Arbor, MI. ELHADAD N, MCKEOWN K, KAUFMAN D, and JORDAN D (2005). Facilitating physicians’ access to information via tailored text summarization. In Proceedings of the AMIA Annual Symposium. FOSTER M E (2008). Automated metrics that agree with human judgements on generated output. Submitted. FOSTER M E and OBERLANDER J (2008). Corpus-based generation of conversational facial displays. Language Resources and Evaluation. To appear. FOSTER M E and WHITE M (2005). Assessing the impact of adaptive generation in the COMIC multimodal dialogue system. In Proceedings of the IJCAI 2005 Workshop on Knowledge and Reasoning in Practical Dialogue Systems. Edinburgh, Scotland. FOSTER M E and WHITE M (2007). Avoiding repetition in generated text. In Proceedings of the 11th European Workshop on Natural Language Generation (ENLG 2007). Schloss Dagstuhl. HARTLEY A, SCOTT D, KRUIJFF-KORBAYOVÁ I, SHAROFF S, SOKOLOVA L, DOCHEV D, STAYKOVA K, ČMEJREK M, HANA J, and TEICH E (2000). Evaluation of the final prototype. Technical Report EVAL2, AGILE project.

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

53

References ●











ISARD A, OBERLANDER J, ANDROTSOPOULOS I, and MATHESON C (2003). Speaking the users’ languages. IEEE Intelligent Systems, 18(1):40–45. doi:10.1109/MIS.2003.1179192. KARAMANIS N and MELLISH C (2005). A review of recent corpus-based methods for evaluating information ordering in text production. In Proceedings of the Corpus Linguistics 2005 Workshop on Using Corpora for Natural Language Generation. LIN C Y (2004). ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pp. 74–81. Barcelona, Spain. MELLISH C and DALE R (1998). Evaluation in the context of natural language generation. Computer Speech and Language, 12(4):349–373. doi:10.1006/csla.1998.0106. PAPINENI K, ROUKOS S, WARD T, and ZHU W J (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002). PRENDINGER H, MA C, YINGZI J, NAKASONE A, and ISHIZUKA M (2005a). Understanding the effect of life-like interface agents through users’ eye movements. In Proceedings of the 7th international conference on Multimodal interfaces (ICMI 2005), pp. 108–115. doi:10.1145/1088463.1088484.

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

54

References ●











PRENDINGER H, MORI J, and ISHIZUKA M (2005b). Using human physiology to evaluate subtle expressivity of a virtual quizmaster in a mathematical game. International Journal of Human-Computer Studies, 62(2):231–245. doi:10.1016/j.ijhcs.2004.11.009. REITER E, ROBERTSON R, and OSMAN L M (2003). Lessons from a failure: Generating tailored smoking cessation letters. Artificial Intelligence, 144(1–2):41–58. doi:10.1016/S0004-3702(02) 00370-3. REITER E and SRIPADA S (2002). Should corpora texts be gold standards for NLG? In Proceedings of the 2nd International Language Generation Conference (INLG 2002). SRIPADA S and GAO F (2007). Summarising dive computer data: A case study in integrating textual and graphical presentations of numerical data. In Proceedings of the Workshop on Multimodal Output Generation (MOG 2007). SRIPADA S G, REITER E, and HAWIZY L (2005). Evaluation of an NLG system using post-edit data: Lessons learnt. In Proceedings of the 10th European Workshop on Natural Language Generation (EWNLG 2005). STENT A, MARGE M, and SINGHAI M (2005). Evaluating evaluation methods for generation in the presence of variation. In Computational Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science, volume 3406/2005, pp. 341–351. Springer. doi:10.1007/b105772.

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

55

References ●











WALKER M A, LITMAN D, KAMM C A, and ABELLA A (1997). PARADISE: A general framework for evaluating spoken dialogue agents. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL/EACL 1997). WALKER M, WHITTAKER S, STENT A, MALOOR P, MOORE J, JOHNSTON M, and VASIREDDY G (2004). Generation and evaluation of user tailored responses in multimodal dialogue. Cognitive Science, 28(5): 811–840. doi:10.1016/j.cogsci.2004.06.002. WHITE M (2004). Reining in CCG chart realization. In Proceedings of the 3rd International Conference on Natural Language Generation (INLG 2004). WHITE M (2006). CCG chart realization from disjunctive inputs. In Proceedings of the 4th International Conference on Natural Language Generation (INLG-06). WHITE M and DALE R, editors (2007). Report from the Workshop on Shared Tasks and Comparative Evaluation in Natural Language Generation. WILKS Y (1990). Where am I coming from: The reversability of analysis and generation in natural language processing. In M Pütz, editor, Thirty Years of Linguistic Evolution. John Benjamins Publishing Company.

21 January 2008

Foster / DNLG / Evaluating Multimodal Output

56

Suggest Documents