Jan 21, 2008 - Evaluation in the context of natural language generation. Computer. Speech and Language, 12(4):349â373. doi:10.1006/csla.1998.0106. â.
Evaluating the Output of an Embodied Multimodal Dialogue System
Mary Ellen Foster Technische Universität München / University of Edinburgh DNLG, 21 January 2008
Evaluating generation systems ●
Generation is an open-ended task –
●
Like “counting from zero to infinity” (Wilks, 1990)
Issues for evaluation (Mellish and Dale, 1998): –
Defining input and output
–
Choosing what to measure
–
Selecting a control or baseline for comparison
–
Obtaining adequate training/test data
–
Dealing with disagreement of human judges
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
2
Task-based evaluation ●
Test whether the system achieves its goals better with advanced generation techniques enabled –
Carenini and Moore (2006): tailoring evaluative arguments to users affected their choices
–
Di Eugenio et al. (2005): users learned more from a tutoring system that used aggregation
–
Elhadad et al. (2005): doctors found information more quickly with clinical summaries tailored to patients
–
Reiter et al. (2003): tailored smoking-cessation materials did not make a difference
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
3
Direct human judgements ●
●
Task-based evaluation requires ... –
A fully-working system that can be used by naive users
–
A baseline version of the same system
–
A pool of subjects who can perform the task
Asking users for direct judgements –
Use canned outputs from a partially-working system
–
Alleviate issues with slow or unreliable input
–
Participants do not need to be experts (“overhearer”)
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
4
Some direct-judgement studies ●
●
●
●
Binsted et al. (1997): children found (some) generated jokes as good as human-written jokes Hartley et al. (2000): acceptability of generated documentation comparable to human-written Walker et al. (2004): users liked restaurant recommendations tailored to their own prefs Belz and Reiter (2006): subjects compared generated and hand-written weather forecasts (and sometimes found generated better)
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
5
Other human evaluation techniques ●
●
Analyse post-edits (Sripada et al., 2005; Sripada and Gao, 2007) Direct physiological measures: –
Prendinger et al. (2005a): eye-tracking
–
Prendinger et al. (2005b): galvanic skin response
–
Bailenson et al. (2004): interpersonal distance in a virtual environment
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
6
Automated evaluation ●
●
Why use it? –
Stochastic systems must be evaluated frequently during development
–
Recruiting judges can be time-consuming and costly
The goal: find some metrics that 1. Can easily be computed, and 2.Agree with human judgements of quality
●
Other metrics: PARADISE (Walker et al., 1997); BLEU (Papineni et al., 2002); ROUGE (Lin, 2004)
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
7
Some automated-evaluation studies ●
●
●
●
Bangalore et al. (2000): corpus-based metrics for word ordering in a realiser White (2004): measure accuracy and speed of OpenCCG realiser through cross-validation Karamanis and Mellish (2005): corpus-based methods for evaluating information ordering Note: users may not like systems that score highly on corpus similarity (Reiter and Sripada, 2002; Stent et al., 2005; Belz and Reiter, 2006)
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
8
Shared-task evaluation? ●
●
●
●
Approaches to a well-defined problem are compared based on their performance on a common task; common in NLU Hot topic in NLG (White and Dale, 2007) Pilot study (Belz et al., 2007): attribute selection for referring expressions –
6 teams, 22 systems
–
Follow-up challenge for 2008 underway
Also: giving instructions in virtual environments
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
9
The COMIC dialogue system ●
●
Adds a multimodal dialogue interface to a CADlike application used for bathroom redesign Output generation in COMIC –
Dialogue manager selects high-level message
–
Fission module plans multimodal content, using OpenCCG for text realisation
–
Content is produced using specific output modules (Festival, RUTH)
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
10
(COMIC demo)
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
11
Evaluation studies ●
●
●
Comparing tailored and non-tailored text generation (Foster and White, 2005) Avoiding repetition in generated text (Foster and White, 2007) Comparing corpus-derived methods of selecting behaviour for the embodied agent (Foster and Oberlander, 2008; Foster, 2008)
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
12
Evaluation studies ●
●
●
Comparing tailored and non-tailored text generation (Foster and White, 2005) Avoiding repetition in generated text (Foster and White, 2007) Comparing corpus-derived methods of selecting behaviour for the embodied agent (Foster and Oberlander, 2008; Foster, 2008)
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
13
Tailored generation in COMIC ●
COMIC takes into account account user preferences when generating output –
Selecting the designs to show
–
Choosing the attributes to describe
–
Structuring the content
–
Choosing specific words to use
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
14
Sample tailored description ●
User: Tell me about this design [click]
●
COMIC: –
(target) Here is a family design. As you can see, the tiles have a blue and green colour scheme. It has floral motifs and artwork on the decorative tiles.
–
(non-target) Here is a family design. Its tiles are from the Lollipop collection by Agrob Buchtal. Although the tiles have a blue colour scheme, it does also feature green.
–
(neutral) This design is in the family style. It uses tiles from Agrob Buchtal's Lollipop series. There are artwork and floral motifs on the decorative tiles.
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
15
Evaluation ●
●
●
●
Subjects saw and heard one eight-turn dialogue between COMIC and a hypothetical user User preferences were on screen at all times For each system turn, subjects saw two options and had to choose which was more appropriate for the user Four possible user models (plus neutral); each subject made all pairwise comparisons twice
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
16
Overall selections 114
70 Target Other
All trials 21 January 2008
Foster / DNLG / Evaluating Multimodal Output
17
“Conflicting concessions” ●
●
Explicit concessions to negative preferences (“although”, “though”, etc.) Two classes of trials: –
Conflict: At least one conflicting concession across the two versions
–
No conflict: No concessions, or same concessions in both versions
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
18
Conflicting-concession selections 90
41 24
Conflict 21 January 2008
29
Target Other
No conflict Foster / DNLG / Evaluating Multimodal Output
19
What's going on? ●
Positive preferences do have an effect –
●
●
●
Positive features are placed near the start
Negative features have a bigger effect (“although”, etc.) Because it was not the subjects' own preferences? Because COMIC is a “sales” system, so the default mode is positive, and only marked departures are perceptible?
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
20
Evaluation studies ●
●
●
Comparing tailored and non-tailored text generation (Foster and White, 2005) Avoiding repetition in generated text (Foster and White, 2007) Comparing corpus-derived methods of selecting behaviour for the embodied agent (Foster and Oberlander, 2008; Foster, 2008)
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
21
Avoiding repetition in generated text ●
●
●
Classic rule of writing: avoid repetition to keep text interesting and lively Often followed in generation systems: –
The ability to create multiple expressions “lets us vary the way in which the system expresses the same fact.” (Isard et al., 2003)
–
Random choice of realisation option “maximise[s] the variety of sentences produced by the system” (van Deemter et al., 2005)
Never explicitly evaluated
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
22
Implementation ●
Using the OpenCCG realiser, which incorporates ngram language models into its processing –
●
Including disjunctive logical forms (White, 2006)
Two anti-repetition techniques: –
ε-best sampling: perform normal realisation and then choose one of the top alternatives
–
Anti-repetition scoring: storing the words from previous sentences and penalising repetitions
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
23
Human evaluation ●
●
●
Gather user judgements on short description sequences, presented as minimal pairs Two versions: –
Baseline
–
Both anti-repetition measures enabled
Subjects made three forced choices –
Understandability
–
Repetitiveness
–
Writing quality
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
24
Description sequences Default ●
●
Anti-repetition
This design is country. It is based on the Sandstein collection by Porcelaingres. The colours are brown, grey and black. There are geometric shapes on the decorative tiles. This design is also country. It is based on the Cardiff collection by Aparici. The colours are cream and dark red. It also has geometric shapes on the decorative tiles.
21 January 2008
●
●
Here is a design in the country style. It uses tiles from the Sandstein collection by Porcelaingres. It has brown, grey and black in the colour scheme. The decorative tiles have geometric shapes. This one is also country. It draws from Cardiff, by Aparici. The colour scheme features cream and dark red. The decorative tiles also have geometric shapes.
Foster / DNLG / Evaluating Multimodal Output
25
Evaluation interface
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
26
Results 210 180 157 135 112 82
Easier to understand 21 January 2008
More repetitive
Default Anti-repetition
Better written
Foster / DNLG / Evaluating Multimodal Output
27
Automated evalution ●
●
Exploring the parameter settings –
ε-best sampling: threshold
–
Anti-repetition scoring: penalty
Test the impact on generated text of a range of parameter values –
0, 1, 5, 10, 20 for each parameter (log space)
–
Sentences in human evaluation had both = 20
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
28
Metrics ●
Variability: edit distance between pairs
●
N-gram scores
●
Dispreferred paraphrases: word sequences permitted by grammar but filtered by n-grams –
Also at start and end
–
We here have and We have ... here
–
Is family
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
29
Summary of findings ●
Variability (edit distance): increased –
●
●
Both methods had an effect; not simply additive
N-gram scores: decreased –
Low values for either parameter did not have an effect
–
Effect of ε-best threshold > effect of repetition penalty
Dispreferred paraphrases: increased –
ε-best threshold had an effect on all paraphrases
–
Repetition penalty affected only is family
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
30
Automated vs. Human evaluation (1) ●
●
●
Human judges generally found the anti-repetition versions better written and less repetitive, with no difference in understandability Corpus-similarity metrics scored the default versions better Measures of variability scored the anti-repetition versions better
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
31
Evaluation studies ●
●
●
Comparing tailored and non-tailored text generation (Foster and White, 2005) Avoiding repetition in generated text (Foster and White, 2007) Comparing corpus-derived methods of selecting behaviour for the embodied agent (Foster and Oberlander, 2008; Foster, 2008)
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
32
Selecting behaviour for the head ●
●
●
Annotated a corpus of head movements while reading sentences from COMIC domain Implemented three selection methods: –
Rule-based
–
Majority
–
Weighted
Compared them in a human study and automatically
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
33
Corpus details ●
●
●
450 sentences generated by COMIC Link displays to the corresponding span of nodes in the OpenCCG derivation tree Also include contextual information: –
User-preference evaluation
–
Information status
–
Speech-synthesiser prosody
–
Clause of the sentence
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
34
Characteristic facial displays
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
35
Generation strategies ●
Rule-based: displays only on tile-design facts –
●
Positive, negative, neutral displays as appropriate
Data-driven: consider full context –
Majority: always take highest-probability option
–
Weighted: make a stochastic choice weighted by frequency
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
36
Generated display schedules Although
it’s
in
nd
nd
nd
the
family
the
tiles
nd
Original
are
by
Alessi Tiles.
nd, bu ll
Rule-based
ll, bd, sq
tr, bu nd
Majority
nd
nd
Weighted
nd
nd
21 January 2008
style,
tr
Foster / DNLG / Evaluating Multimodal Output
37
Human evaluation ●
●
Ask judges to make pairwise choices among output of different strategies –
Experiment 1: Corpus vs. Weighted vs. Majority
–
Experiment 2: Corpus vs. Weighted vs. Rule-based
Showed videos in pairs – same speech, different displays – and asked which they preferred
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
38
Results of experiment 1 295
278
153
Corpus vs. Majority 21 January 2008
251 170
Weighted vs. Majority Foster / DNLG / Evaluating Multimodal Output
197
Corpus vs. Weighted 39
Results of experiment 2 123 93
Corpus vs. Rule-based 21 January 2008
123
120 97
Weighted vs. Rule-based Foster / DNLG / Evaluating Multimodal Output
92
Original vs. Weighted 40
Discussion ●
Original corpus displays always preferred
●
Majority displays strongly disliked
●
Others in between
●
Features of majority choice
●
–
Moves less frequently than corpus or weighted
–
Nearly always chooses a nod when it does move
Rule-based strategy moves even less, but uses a more even distribution
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
41
Automated evaluation ●
●
Corpus-reproduction metrics –
Precision, recall, F score
–
Node accuracy
–
Value for β agreement measure
Sentence-level metrics –
●
Number and diversity of displays per sentence
Global metrics –
Number and diversity of displays across the corpus
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
42
Corpus-reproduction metrics 0.82
0.750.77
0.52
0.29 0.22
0.34
0.31 0.24
21 January 2008
0.23
0.18 0.1
Precision
Majority Weighted Rule-based
Recall
0.14
0.12 0.06
F Score
Node Acc.
Beta
Foster / DNLG / Evaluating Multimodal Output
43
Sentence-level metrics 5.39 4.62
3.27
3.22
3
Corpus Weighted Majority Rule-based
2.07 1.34
Tokens 21 January 2008
1.24
Types Foster / DNLG / Evaluating Multimodal Output
44
Total distinct combinations 40 35
Corpus Weighted Majority Rule-based
11
3
Displays 21 January 2008
Foster / DNLG / Evaluating Multimodal Output
45
Proportion of downward nods 0.89
0.52
0.54
0.51
Corpus Weighted Majority Rule-based
Proportion 21 January 2008
Foster / DNLG / Evaluating Multimodal Output
46
Automated vs. human evaluation (2) ●
●
●
●
Humans preferred original corpus displays, hated majority displays, and didn't have a strong preference between the other two Corpus-reproduction metrics favoured majority Other metrics scored corpus highest, weighted next, with majority and rule-based further down Only “diversity” score (proportion of nods) favoured rule-based over majority
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
47
Discussion
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
48
Types of corpus similarity ●
●
Reproducing each sentence (cross-validation) –
Favours systems that produce “average” outputs, with little variation
–
Users dislike this type of output
Reproducing the patterns in the corpus –
Favours systems whose output is distributed like that in the corpus
–
These systems score higher on user preference
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
49
Possible future work ●
Task-based evaluation? –
●
●
One result: Weighted displays harm users' ability to detect correctly tailored output; rule-based do not
Enhanced display selection strategies – corpus always scored highest Other evaluation metrics
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
50
Thank you
http://homepages.inf.ed.ac.uk/mef/
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
51
References ●
●
●
●
●
●
●
BAILENSON J, AHARONI E, BEALL A, GUADAGNO R, DIMOV A, and BLASCOVICH J (2004). Comparing behavioural and self-report measures of embodied agents’ social presence in immersive virtual environments. In Proceedings of the 7th Annual International Workshop on PRESENCE. BANGALORE S, RAMBOW O, and WHITTAKER S (2000). Evaluation metrics for generation. In Proceedings of the 1st International Conference on Natural Language Generation (INLG 2000). BELZ A, GATT A, REITER E, and VIETHEN J, editors (2007). The Attribute Selection for Generation of Referring Expressions Challenge. http://www.csd.abdn.ac.uk/research/evaluation/ BELZ A and REITER E (2006). Comparing automatic and human evaluation of NLG systems. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006). BINSTED K, PAIN H, and RITCHIE G (1997). Children’s evaluation of computer-generated punning riddles. Pragmatics and Cognition, 5(2):309–358. CARENINI G and MOORE J D (2006). Generating and evaluating evaluative arguments. Artificial Intelligence, 170(11):925–952. doi:10.1016/j.artint.2006.05.003. VAN DEEMTER K, KRAHMER E, and THEUNE M (2005). Real versus template-based natural language generation: A false opposition? Computational Linguistics, 31(1):15–24.
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
52
References ●
●
●
●
●
●
●
DI EUGENIO B, FOSSATI D, YU D, HALLER S, and GLASS M (2005). Aggregation improves learning: experiments in natural language generation for intelligent tutoring systems. In Proceedings of the 43rd Meeting of the Association for Computational Linguistics (ACL 2005). Ann Arbor, MI. ELHADAD N, MCKEOWN K, KAUFMAN D, and JORDAN D (2005). Facilitating physicians’ access to information via tailored text summarization. In Proceedings of the AMIA Annual Symposium. FOSTER M E (2008). Automated metrics that agree with human judgements on generated output. Submitted. FOSTER M E and OBERLANDER J (2008). Corpus-based generation of conversational facial displays. Language Resources and Evaluation. To appear. FOSTER M E and WHITE M (2005). Assessing the impact of adaptive generation in the COMIC multimodal dialogue system. In Proceedings of the IJCAI 2005 Workshop on Knowledge and Reasoning in Practical Dialogue Systems. Edinburgh, Scotland. FOSTER M E and WHITE M (2007). Avoiding repetition in generated text. In Proceedings of the 11th European Workshop on Natural Language Generation (ENLG 2007). Schloss Dagstuhl. HARTLEY A, SCOTT D, KRUIJFF-KORBAYOVÁ I, SHAROFF S, SOKOLOVA L, DOCHEV D, STAYKOVA K, ČMEJREK M, HANA J, and TEICH E (2000). Evaluation of the final prototype. Technical Report EVAL2, AGILE project.
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
53
References ●
●
●
●
●
●
ISARD A, OBERLANDER J, ANDROTSOPOULOS I, and MATHESON C (2003). Speaking the users’ languages. IEEE Intelligent Systems, 18(1):40–45. doi:10.1109/MIS.2003.1179192. KARAMANIS N and MELLISH C (2005). A review of recent corpus-based methods for evaluating information ordering in text production. In Proceedings of the Corpus Linguistics 2005 Workshop on Using Corpora for Natural Language Generation. LIN C Y (2004). ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pp. 74–81. Barcelona, Spain. MELLISH C and DALE R (1998). Evaluation in the context of natural language generation. Computer Speech and Language, 12(4):349–373. doi:10.1006/csla.1998.0106. PAPINENI K, ROUKOS S, WARD T, and ZHU W J (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002). PRENDINGER H, MA C, YINGZI J, NAKASONE A, and ISHIZUKA M (2005a). Understanding the effect of life-like interface agents through users’ eye movements. In Proceedings of the 7th international conference on Multimodal interfaces (ICMI 2005), pp. 108–115. doi:10.1145/1088463.1088484.
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
54
References ●
●
●
●
●
●
PRENDINGER H, MORI J, and ISHIZUKA M (2005b). Using human physiology to evaluate subtle expressivity of a virtual quizmaster in a mathematical game. International Journal of Human-Computer Studies, 62(2):231–245. doi:10.1016/j.ijhcs.2004.11.009. REITER E, ROBERTSON R, and OSMAN L M (2003). Lessons from a failure: Generating tailored smoking cessation letters. Artificial Intelligence, 144(1–2):41–58. doi:10.1016/S0004-3702(02) 00370-3. REITER E and SRIPADA S (2002). Should corpora texts be gold standards for NLG? In Proceedings of the 2nd International Language Generation Conference (INLG 2002). SRIPADA S and GAO F (2007). Summarising dive computer data: A case study in integrating textual and graphical presentations of numerical data. In Proceedings of the Workshop on Multimodal Output Generation (MOG 2007). SRIPADA S G, REITER E, and HAWIZY L (2005). Evaluation of an NLG system using post-edit data: Lessons learnt. In Proceedings of the 10th European Workshop on Natural Language Generation (EWNLG 2005). STENT A, MARGE M, and SINGHAI M (2005). Evaluating evaluation methods for generation in the presence of variation. In Computational Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science, volume 3406/2005, pp. 341–351. Springer. doi:10.1007/b105772.
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
55
References ●
●
●
●
●
●
WALKER M A, LITMAN D, KAMM C A, and ABELLA A (1997). PARADISE: A general framework for evaluating spoken dialogue agents. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL/EACL 1997). WALKER M, WHITTAKER S, STENT A, MALOOR P, MOORE J, JOHNSTON M, and VASIREDDY G (2004). Generation and evaluation of user tailored responses in multimodal dialogue. Cognitive Science, 28(5): 811–840. doi:10.1016/j.cogsci.2004.06.002. WHITE M (2004). Reining in CCG chart realization. In Proceedings of the 3rd International Conference on Natural Language Generation (INLG 2004). WHITE M (2006). CCG chart realization from disjunctive inputs. In Proceedings of the 4th International Conference on Natural Language Generation (INLG-06). WHITE M and DALE R, editors (2007). Report from the Workshop on Shared Tasks and Comparative Evaluation in Natural Language Generation. WILKS Y (1990). Where am I coming from: The reversability of analysis and generation in natural language processing. In M Pütz, editor, Thirty Years of Linguistic Evolution. John Benjamins Publishing Company.
21 January 2008
Foster / DNLG / Evaluating Multimodal Output
56