Comparing Human and Algorithm Performance on ... - Semantic Scholar

3 downloads 0 Views 527KB Size Report
We performed all crowdsourcing-based ex- periments on the CrowdFlower online self-service platform. We accepted workers from. Australia, UK and the US for ...
Comparing Human and Algorithm Performance on Estimating Word-based Semantic Similarity Nils Batram, Markus Krause, Paul-Olivier Dehaye Leibniz University, University of Zurich [email protected], [email protected], [email protected]

Abstract. Understanding natural language is an inherently complex task for computer algorithms. Crowdsourcing natural language tasks such as semantic similarity is therefore a promising approach. In this paper, we investigate the performance of crowdworkers and compare them to offline contributors as well as to state of the art algorithms. We will illustrate that algorithms do outperform single human contributors but still cannot compete with results gathered from groups of contributors. Furthermore, we will demonstrate that this effect is persistent across different contributor populations. Finally, we give guidelines for easing the challenge of collecting word based semantic similarity data from human contributors.

1

Introduction

Semantic similarity plays an important role for many natural language processing tasks, especially word sense disambiguation and information retrieval [1–5]. Humans are better than algorithms at rating semantic similarity between two words. The employment of human knowledge however is inflexible, time-consuming and expensive. Involving paid online workers on a crowdsourcing platform (crowdworkers) can reduce costs, but the response quality is harder to predict. The incentive structure for crowdworkers is mostly financial and purposefully under-performing crowdworkers are still a major challenge for crowdsourcing [6]. Different algorithmic approaches do exist [7–9] but are not yet able to reproduce human level results [10]. We aim to answer the question whether crowdworkers provide results at a quality level that justifies the extra costs compared to state of the art algorithms. In this article, we compare three identical data sets generated by offline contributors with native command of the English language, crowdworkers, and state of the art algorithms. We will illustrate that although predictions based on individual human rater scores underperform compared to algorithms, predictions based on averaged human rater scores do outperform algorithms in estimating population averages.

2

Related Work

Using crowdworkers is effective in many areas like paraphrase recognition [11], transcribing large amounts of text [12] or detecting accessibility problems on sidewalks in adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011

Google Streetview [13], but the response quality of crowdworkers judging semantic similarity is still unknown. Temporal Semantic Analysis (TSA) is a state of the art algorithm for predicting semantic similarity with good results on standardized data sets [10]. TSA uses statistical information about words, including usage over time. Radinsky et al. [10] already use results from crowdworkers and offline contributors on semantic similarity to estimate their algorithms quality. However, they do not compare crowdworkers with offline contributors directly, leaving the following questions open: RQ1: Do averaged crowdworker results perform better than algorithms? RQ2: Do averaged crowdworker results perform on the same level as offline contributors? Both questions yield a sub question. As current research only observes average scores of human raters, we are interested in the performance on an individual level: RQ1a: Do individual crowdworkers perform better than algorithms? RQ2a: Do individual crowdworkers perform as well as offline contributors? As explained above a rigorous quality management is sometimes necessary to achieve reasonable results from crowdworkers [14]. In most scenarios, experimenter only use simple quality control mechanisms such as ground truth questions [15] also called gold questions or control questions. We also want to investigate the relevance of this aspect: RQ3: What level of quality control is necessary for optimal response quality? Since human ratings are the baseline, we formulate the following hypotheses for the posed research questions: H1: Crowdworkers do outperform algorithms when quality control is applied. H2: The response quality of crowdworkers is comparable to offline contributors.

3

Study Design

Our study uses a between-group design with three conditions. The first two conditions are crowdworkers without ground truth questions (uncontrolled) and crowdworkers with ground truth questions (controlled). We performed all crowdsourcing-based experiments on the CrowdFlower online self-service platform. We accepted workers from Australia, UK and the US for all conditions. As other studies on word-based semantic similarity we use the WordSimilarity-353 test collection [16]. The set consists of 353 word pairs of the English language and contains 13 to 16 human judgments for each word pair with a continuous scale from 0 (totally unrelated) to 10 (very related). Finkelstein et al. [16] collected this set from offline contributors with near native command of the English language. We refer to this set as the WS-353 condition. We compare the results of our human conditions to two state of the art algorithms from different authors namely TSA [10] and ESA (Explicit Semantic Analysis) [17]. For the controlled condition we used twelve word pairs from the WS-353 data set as ground truth questions. We used their averaged similarity score as baseline and accepted all responses not more than two score points away from this mean as valid.

All crowdworkers in the controlled condition had to take an introduction test of four random ground truth questions. We only accepted workers with an accuracy of 70% or higher. While working on the actual task crowdworkers got a ground truth question every sixth task in the controlled condition. We discarded responses from workers when their accuracy dropped below 70%. For each word pair we collected 13-14 judgments in the controlled condition, for 4413 judgments. Additionally we collected 73-113 judgments for ground truth questions, resulting in further 1096 judgments. 42 different crowdworkers added to the results, with an average of 107 judgments (without ground truth questions). For our analysis, we dropped all workers with less than 50 responses, leaving 23 workers and 3894 judgments for evaluation. We also repeated the controlled condition two times with different settings for the introduction test. As the results were identical, we only refer to the first instance of the controlled condition. For the uncontrolled condition, we collected 13 judgments for each word pair from 16 different workers, resulting in 4589 judgments with an average of approximately 287 judgments per worker. We did not use ground truth questions nor introduction tests in this task.

Fig. 1. Screenshot of the Word Similarity task on CrowdFlower.

4

Procedure

We conduct both experiments for the uncontrolled and controlled conditions using CrowdFlower. We gave instructions identical to the WS-353 survey. We left out some instructions specific to the WS-353 implementation, e.g. crowdworkers did not have to state their name at the beginning of the survey since their ID is automatically stored. Every crowdworker had to rate the similarity between pairs of words on a Likert scale from 1 (unrelated) to 10 (very related). For technical reasons, the scale goes from 1 to

10 contrary to the WS-353 data set with a scale from 0 to 10. We normalized all ratings to lie in the range [0.0, 1.0] for the evaluation process. Figure 1 shows a screen shot of the experiments interface as seen by our crowdworkers.

5

Measures

To estimate the performance of crowdworkers and offline contributors we use Spearman’s Rank-Order Correlation (Spearman's 𝜌). To estimate the overall quality of a condition we calculate 𝜌 for the condition and the average similarity scores from WS353. For each condition we average the scores of each word pair over all contributors (averaged results). We calculate these average scores for each condition and the results of WS-353. Correlation can only reveal relative differences. Therefore we also estimate mean squared error (MSE) and again use the average of the WS-353 data set as an estimator for the population mean. To estimate MSE we normalize all scores to lie in the interval [0.0, 1.0]. So that: 𝑋 − 𝑋𝑚𝑖𝑛 𝑋= 𝑋𝑚𝑎𝑥 − 𝑋𝑚𝑖𝑛 We calculate the mean of these normalized scores for each word pair in each condition. We calculate MSE using the mean normalized scores of the WS-353 data set as an estimate for the true values. Although, semantic similarity is subjective we can estimate the MSE for a condition to be a predictor for the population average. If the results from a condition are close to the population average, we can assume a higher utility. To measure the response quality of individual contributors within a condition we calculate Spearman’s 𝜌 to the average scores of WS-353 for every contributor. We then calculate mean and standard deviation (𝜎) of these, resulting 𝜌 scores (individual results). This allows us to compare individuals from each condition with each other. We also calculate individual MSE scores in the same way. For the TSA algorithm we use the 𝜌 values as reported by Radinsky et al. [10]. Beside correlation, another important quality indicator is inter-rater agreement (IRA). Although semantic similarity in general is a subjective task, the WS-353 list of word pairs contains few controversial pairs. Therefore, IRA is a useful indicator for response quality. To estimate IRA we calculate Krippendorf’s alpha coefficient 𝛼 [18] for contributors in each condition.

6

Results

First, we investigate our first and third questions (RQ1, RQ3): Can crowdworkers outperform current state-of-the-art algorithms without quality control mechanisms. The controlled condition has the highest correlation of 𝜌 = 0.88, followed by the TSA algorithm with 𝜌 = 0.79 and ESA with 𝜌 = 0.75. In the uncontrolled condition, crowdworkers achieved 𝜌 = 0.62 without quality control. All correlations are significant on an 𝛼level of 0.001. This supports hypothesis H1 that without quality control state of the art algorithms do outperform crowdworkers in our task. The mean squared error of the crowdsourced conditions supports this hypothesis as well. After normalizing similarity scores the MSE for the controlled condition is 0.08 compared to 0.31 in the uncontrolled

0.6 0.4 0.2

TSA Algorithm ESA Algorithm

0.0

Spearman's Rank Correlation

0.8

condition. The MSE scores of both algorithms are also higher with 0.29 for TSA and 0.31 for ESA. Our second question RQ2 compared crowdworkers and offline contributors. The uncontrolled condition had an individual average correlation of 𝜌 = 0.26 (𝜎 = 0.24). In the controlled condition crowdworkers had individual average correlations of 𝜌 = 0.69 (𝜎 = 0.07). Contributors in the WS-353 condition have an average correlation of 𝜌 = 0.75 (𝜎 = 0.07). To calculate the individual average agreement for the WS-353 condition we compare each contributor to the average score of the WS-353 without this contributor. Figure 2 shows a boxplot of the individual results. Again all correlations are significant on an 𝛼-level of 0.001.

uncontrolled

controlled

ws_353

Condition

Fig. 1. Individual correlations between human based conditions and the average of the WS-353 data set. Horizontal lines give results for TSA and ESA algorithms as reported in [10].

As the results of the WS-353 condition are very close to the controlled condition, it is not clear if the differences originate from random effects, possible errors introduced through the normalization, or other effects. Therefore, we compare the IRA of both conditions. A higher IRA indicates better agreement within contributors and is an indicator for the quality of responses. Although semantic similarity is an inherently subjective task, the data set WS-353 contains very few controversial word pairs. Consequently, results with a higher IRA are likely to be of higher quality than results with a low IRA. Table 1 reports Krippendorf’s alpha scores for all human based conditions. The agreement between contributors in the controlled condition 𝛼 = 0.444 is lower than in the WS-353 condition with 𝛼 = 0.506. Therefore, we cannot support our hypothesis H2 that crowdworker perform on the same level as offline contributors with our experiment. These findings are consistent with measured individual MSE scores. The average individual MSE score for the controlled condition is 0.19 (σ = 0.05) which is higher than the average MSE score for the WS-353 condition 0.03 (σ = 0.007).

Table 1. Inter Rater Agreement within each condition estimated with Krippendorf’s alpha. Uncontrolled Controlled WS-353

𝛼 = 0.004 𝛼 = 0.444 𝛼 = 0.506

0.85 0.80

TSA Algorithm

0.75

ESA Algorithm

0.70

Spearman's Rank Correlation

These results indicate that on an individual level the TSA algorithm can outperform offline human contributors’ and crowdworkers alike. Yet the MSE score of both algorithms (TSA=0.29 (𝜎 = 0.14), ESA=0.31 (𝜎 = 0.18)) is higher than the average individual MSE in the uncontrolled and WS-353 condition. Therefore, we cannot yet definitely answer our questions RQ1a and RQ2a that contributors are less accurate on an individual level in predicting population averages. The results show that for our experiment aggregated human responses still outperform algorithms in predicting population averages. In order to estimate the number of contributors necessary to achieve good results we drew seven sets of 1000 unique combinations of contributors from our 23 contributors in the controlled condition. Each of the 7 sets was selected so that each word pair aggregated n responses from different contributors, with n in the range of two to eight. We averaged the scores for each pair and calculated Spearman 𝜌 for each sets average to the average of WS-353. Figure 3 shows the number of responses per word pair necessary to achieve certain Spearman 𝜌 scores with crowdworkers from the controlled condition in our experiment.

2

3

4

5

6

7

8

Number of Responses Aggregated per Word Pair

Fig. 2. Boxplot of correlation with different numbers of responses aggregated for each word pair. For each box, we used 1000 permutations of crowdworkers from our controlled set of 19 workers so that an average of n workers contributed to the result of a word pair. We compare each set to the average of the WS-353 data set. Horizontal lines indicate TSA and ESA scores as reported in [10].

As stated earlier we repeated the controlled condition with different settings for the introductory test. In the first repetition, we removed the test completely. In the second repetition, we used eight instead of four ground truth questions of which seven had to be correct. All repetitions and the controlled condition show nearly identical results for IRA, correlation, and MSE. The differences of Spearman’s 𝜌 are not significant at an 𝛼-level of 0.1 with power >95%. These results not only demonstrate that minimal precautions were already sufficient to enhance response quality in our experiment but also illustrate that the results are repeatable. We repeated the controlled condition three weeks after the initial trial.

7

Discussion and Future Work

Semantic similarity plays an important role in natural language processing, especially word sense disambiguation and information retrieval. In our experiment, we compare responses of crowdworkers, offline contributors, and two algorithms TSA (Temporal Semantic Analysis) and ESA (Explicit Semantic Analysis) on the same 353 word pairs. In our experiment, we found that human contributors are better than algorithms at estimating population averages for word based semantic similarity when their responses are aggregated. The correlation of the best crowdsourced condition is 𝜌 = 0.88 and therefore higher than the best score from any algorithm (TSA) with 𝜌 = 0.79. In our experiment, aggregating responses from at least five contributors per word pair were necessary to achieve higher 𝜌-scores than the TSA algorithm. On an individual level, the TSA algorithm achieves better Spearman’s 𝜌 correlation scores than individual contributors than any other condition. Yet individual contributors in our controlled condition still achieve better MSE (mean squared error) scores of 0.19 (σ = 0.05) than the TSA algorithm with a score of 0.29. We found that for our experiment the design of the introductory test had no impact on response quality. The reason for this might be that spammers avoid tasks as soon as minimal precautions against spamming are taken. The measured 𝜌-scores of the controlled condition are close to the offline contributors from the WS-353 data set. However, measured MSE scores differ significantly. Future work should investigate whether more sophisticated quality control mechanisms and feedback methods [19, 20] can enhance response quality of crowdworkers. Furthermore, the WS-353 data set was introduced in 2002 and contains words with a high temporal component. For instance, Maradona or Arafat might have had more relevance in 2002 than 2014. Future work should evaluate the influence of temporal differences by repeating the WS-353 experiment at the same time as the crowdsourced conditions.

8

Acknowledgements

We would like to thank Kira Radinsky for responding so fast to our inquiry, although she had not visited her university for 3 years. This paper would have been much less concise without her help.

9

References

1. Feng, J., Zhou, Y., Martin, T.: Sentence Similarity based on Relevance. Proceedings of IPMU ’08. pp. 832–839 (2008). 2. Navigli, R.: Word Sense Disambiguation: A Survey. ACM Computing Surveys, Vol. 41, No. 2, Article 10 (2009). 3. Krause, M., Porzel, R.: It is about Time : Time Aware Quality Management for Interactive Systems with Humans in the Loop. CHI’13 EA. ACM Press, Paris, France (2013). 4. Strube, M., Ponzetto, S.P.: WikiRelate! Computing Semantic Relatedness Using Wikipedia. AAAI, pp. 1419–1424 (2006). 5. Yang, D., Powers, D.M.W.: Measuring Semantic Similarity in the Taxonomy of WordNet. Conferences in Research and Practice in Information Technology, Vol. 38, pp. 315–322 (2005). 6. Resnik, P.: Using Information Content to Evaluate Semantic Similarity in a Taxonomy. Proceedings of IJCAI'95 (1995). 7. Radinsky, K., Agichtein, E., Gabrilovich, E., Markovitch, S.: A Word at a Time: Computing Word Relatedness using Temporal Semantic Analysis. Proceedings of WWW’11, pp. 337– 346 (2011). 8. Tschirsich, M., Hintz, G.: Leveraging Crowdsourcing for Paraphrase Recognition. Proceedings of the 7th Linguistic Annotation Workshop & Interoperability with Discourse, pp. 205–213 (2013). 9. Williams, J.D., Melamed, I.D., Alonso, T., Hollister, B., Wilpon, J.: Crowd-sourcing for difficult transcription of speech. 2011 IEEE ASRU Workshop, pp. 535–540 (2011). 10. Hara, K., Le, V., Froehlich, J.: Combining Crowdsourcing and Google Street View to Identify Street-level Accessibility Problems. Proceedings of CHI ’13, pp. 631–640 (2013). 11. Radinsky, K., Agichtein, E., Gabrilovich, E., Markovitch, S.: A Word at a Time: Computing Word Relatedness using Temporal Semantic Analysis. Proceedings of WWW’11, pp. 337– 346 (2011). 12. Wang, J., Ipeirotis, P., Provost, F.: Quality-Based Pricing for Crowdsourced Workers. NYU Working Paper No. 2451/31833 (2013). 13. Oleson, D., Sorokin, A., Laughlin, G., Hester, V., Le, J., Biewald, L.: Programmatic Gold: Targeted and Scalable Quality Assurance in Crowdsourcing. Human Computation: Papers from the 2011 AAAI Workshop, pp. 43–48 (2011). 14. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing Search in Context : The Concept Revisited. ACM Transactions on Information Systems, Vol. 20, No. 1, pp. 116–131 (2002). 15. Gabrilovich, E., Markovitch, S.: Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis. IJCAI, Vol. 7, pp. 1606-1611 (2007). 16. Krippendorff, K.: Estimating the Reliability, Systematic Error and Random Error of Interval Data. Educational and Psychological Measurement, 30(1), pp. 61-70 (1970). 17. Sheng, V.S., Provost, F., Ipeirotis, P.G.: Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers. Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, p. 614 (2008). 18. Krause, M.: GameLab: A Tool Suit to Support Designers of Systems with Homo Ludens in the Loop. HComp'13: Works in Progress and Demonstration Abstracts. pp. 38–39. AAAI Press, Palm Springs, CA, USA (2013).

Suggest Documents