the Manual Annotation of Texts. Brett Drury, Paula C.F. Cardoso, Jorge Valverde-Rebaza, Alan Valejo,. Fabio Pereira, and Alneu de Andrade Lopes. ICMC ...
An Open Source Tool for Crowd-Sourcing the Manual Annotation of Texts Brett Drury, Paula C.F. Cardoso, Jorge Valverde-Rebaza, Alan Valejo, Fabio Pereira, and Alneu de Andrade Lopes ICMC, University of S˜ ao Paulo, Av. Trabalhador S˜ ao Carlense 400 S˜ ao Carlos, SP, Brazil C.P. 668, CEP 13560-970 http://www.icmc.usp.br/
Abstract. Manually annotated data is the basis for a large number of tasks in natural language processing as either: evaluation or training data. The annotation of large amounts of data by dedicated full-time annotators can be an expensive task, which may be beyond the budgets of many research projects. An alternative is crowd-sourcing where annotations are split among many part time annotators. This paper presents a freely available open-source platform for crowd-sourcing manual annotation tasks, and describes its application to annotating causative relations. Keywords: crowd-sourcing, annotations, causative relations.
1
Introduction
Manually annotated data is the basis for a large number of tasks in natural language processing as either: evaluation or training data. The manual annotation of data is a time intensive and repetitive task. It may be beyond many projects’ budget to hire full time annotators. Crowd-sourcing is an alternative annotation policy where a large manual annotation tasks can be split among a large number of part-time annotators. The advantage of crowd-sourcing is that each individual spends a relatively small period of time on the annotation task. Crowd-sourcing has become a popular approach for many natural language processing tasks, but the popular tools such as Mechanical Turk are commercial. To remedy the lack of freely available open sourced annotation tools we present an freely available annotation tool which can be used to crowd-source annotation tasks. The remainder of the paper is organised as follows: 1. related work, where we discuss crowd-sourcing and manual annotations, 2. platform architecture where we discuss the underlying design principles of the application and 4 where we discuss the application of the tool to annotate causative relations. J. Baptista et al. (Eds.): PROPOR 2014, LNAI 8775, pp. 268–273, 2014. c Springer International Publishing Switzerland 2014
An Open Source Tool for Crowd-Sourcing the Manual Annotation of Texts
2
269
Related Work
The related work will describe general research about manual annotation as well as crowd-sourcing. A common strategy for manual annotation of text is to use multiple annotators, and to use inter-annotator agreement to accept common annotations. This strategy is designed to mitigate the biases of a single annotator. A systematic review of inter-annotator agreement was conducted by [3]. He conducted two tasks: POS and structure tagging on a large German news corpus. He discovered that the annotator agreement was 98.6% for the POS task and 92.4% for structure tagging. The research literature also contains results related to inter-annotator agreement for tasks such as word-sense disambiguation [9], opinion retrieval[1] and semantics[11]. The works described thus far have used dedicated annotators, an alternative is crowd-sourcing. Crowd-sourcing divides the annotation task between many annotators. The motivation for performing the annotation task may be financial[13][7] or altruistic[13]. A potential problem for crowd-sourcing is the quantity and the quality of the annotated data. Wang[13] suggests that there is a trade-off between quality and quantity because high-quality annotated data requires a comprehensive set of annotation rules. In addition crowd-sourcing may attract non-expert annotators. Hsueh et al[5] conducted a study comparing expert and non-expert annotators found that if there were sufficient numbers of non-expert annotators that they performed nearly as well as small numbers of expert annotators. This hypothesis is partially supported by [10]. A review of the research literature for crowd-sourcing natural language tasks discovered that researchers tend to use closed source commercial systems such as Mechanical Turk (www.mturk.com) and CrowdFlower (www.crowdflower.com)[12].
3
Platform Description
The traditional crowd-sourcing platforms may be insufficient for Portuguese language researchers because: 1. there may not be sufficient native Portuguese speakers with access to the Internet who are willing to work for typical crowdsourcing pay and 2. there may not be a budget to pay people. To resolve this problem we have developed a prototype platform for crowd-sourcing annotations so that researchers can use altruistic annotators rather than ones motivated by money. The platform is designed around an “n-tier” architecture[6]. A n-tier architecture separates: presentation, logic and storage into separate layers so that they have no inter-dependencies. The architecture of the application is displayed in Figure 1. 3.1
Presentation Layer
The display layer uses HTML to provide a rudimentary user interface. The HTML can be hard-coded into a Web Page or generated dynamically with a
270
B. Drury et al.
Fig. 1. N-tier architecture of application
Fig. 2. Rudimentary user interface
Java Server Page (JSP). The user interface is typically: 1. a read only text field, 2. submit button and 3. category check box. Figure 2 shows an example of the user interface. The presentation layer uses client side Javascript with the Dojo library (http://dojotoolkit.org/)) to control the annotator interaction with the Web Page. In addition the Javascript is used to communicate with the logic layer to pass and receive text information. This was achieved using an asynchronous ”post” against a server side application. This technique is commonly known as AJAX. 3.2
Logic Layer
The logic layer function is to communicate with the presentation and the data layers. In this application we used Servlets which are written in Java and rely upon a Java Virtual Machine (JVM), consequently we were obliged to choose a “Servlet Container” which in this case was Tomcat. The Servlet had two functions: 1. save information to the data layer passed to it from the presentation layer and 2. randomly select information from the database layer. The Servlet uses JDBC to communicate with the database, consequently the layer is independent of the database management system (DBMS) in the data layer.
An Open Source Tool for Crowd-Sourcing the Manual Annotation of Texts
3.3
271
Data Layer
The data layer responds to save and select requests from the logic layer. There is no prescribed DBMS, but we used MYSQL. The table which contains the data to be annotated must have a numeric primary key which can be either synthetic or natural. A numeric key is required because the Servlet uses a random selection strategy which relies upon a numeric range.
4
Causative Relation Case Study
The platform was used to annotate causative relations in Brazilian agricultural news. The agricultural news was scraped from news sites located on the Web, and the resulting text was segmented into sentences using the sentence splitter in the NLTK library [2]. A random selection of 2,000 sentences were reserved for annotation. These sentences were imported into a table in MySql. There were six annotators (5 novice and 1 experienced) invited to annotate the text. The annotators had two tasks: classify the sentence into one of two categories: causative or non-causative and annotate the causative sentence. The annotation of causative sentence required the annotator to identify: 1. a cause, 2. a causative verb and 3. an effect. The annotator highlighted each part of the causative phrase in the text box using the mouse and by right clicking the highlighted text would show a context menu. The annotator would then select the requisite item (cause, effect or verb). The annotation period ran from February to March 2014. The annotation site is located at:http://goo.gl/d2UN93, where data can be downloaded. The source code is available from http://goo.gl/qTvlsx 4.1
Evaluation
The evaluation of the annotated data measured two tasks: 1. inter annotator agreement for the classification task and 2. the similarity of the annotation of causative phrases. The measures used for evaluating annotator agreement were: 1. average percentage of annotators agreeing on a single category and 2. average Cohen’s kappa coefficient [4]. An average Cohen’s kappa was calculated because there was a varying number of categorizations for each sentence, and consequently the chance annotator agreement varied with the number of categorizations. The results are in Table 1
Table 1. Evaluation of Causative Categorization Task Measure Value Avg. Percentage 77.96% (±41.44) Avg. Cohen’s kappa coefficient 60.74 (±75.00)
272
B. Drury et al.
The results are lower than the comparable manual annotation tasks described in the research literature, this may be due to the relative inexperience of some of the annotators as well as the subjective nature of the task. The second evaluation was the similarity of the annotations for causative relations. The evaluation measure was an average Levenshtein distance [8] for each annotated part of the causative relation (cause, verb and effect). If there were more than two annotators which had annotated the sentence then each annotation was compared separately and an average distance calculated. We calculated the average distance for two annotator agreements : majority and full agreement. The results are in Table 2. The values for full and majority agreement are identical which suggests that for all sentences which were annotated more than twice had full annotator agreement. This may indicate that the expert annotator agreed with the novice annotators where they annotated the same sentence. There was some confusion amongst the annotators whether to annotate resultative causative verbs as part of the effect or as the causative verb. This may account for the difference between verb and object similarity and subject similarity. Table 2. Similarity of Causative Relation Annotation Annotator Agreement Subject Object Verb Full 0.73 (±0.32) 0.68 (±0.29) 0.71(±0.35) Majority 0.73 (±0.32) 0.68 (±0.29) 0.71(±0.35)
5
Conclusion
This paper describes a freely available crowd-sourcing application which is designed for distributing manual annotation tasks. The results it produced in a causative relation annotation exercise were inferior to comparable exercises described in the research literature, however this may have been due to the nature of the task rather than the distributed annotation. We intend to increase the number of annotators to improve the quality of the annotations. We also plan to release the causative relations annotation data to the community as a benchmark to evaluate causative relation extraction strategies. In addition we will donate the tool to the community in the hope that it can be improved and assist with manual annotation tasks. This work was supported by FAPESP grants:11/20451-1,2011/22749-8 and 2013/12191-5 as well as by the CAPES funding agency.
References 1. Bermingham, A., Smeaton, A.F.: A study of inter-annotator agreement for opinion retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, pp. 784–785 (2009)
An Open Source Tool for Crowd-Sourcing the Manual Annotation of Texts
273
2. Bird, S.: Nltk: The natural language toolkit. In: COLING, COLING-ACL 2006, pp. 69–72. Association for Computational Linguistics (2006) 3. Brants, T.: Inter-annotator agreement for a german newspaper corpus. In: Proceedings of Second International Conference on Language Resources and Evaluation, LREC 2000 (2000) 4. Cohen, J.: A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement 20(1), 37 (1960) 5. Hsueh, P.-Y., Melville, P., Sindhwani, V.: Data quality from crowdsourcing: A study of annotation selection criteria. In: Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing, HLT 2009, Stroudsburg, PA, USA, pp. 27–35. Association for Computational Linguistics (2009) 6. Malkowski, S., Hedwig, M., Pu, C.: Experimental evaluation of n-tier systems: Observation and analysis of multi-bottlenecks. In: IEEE International Symposium on Workload Characterization, IISWC 2009, pp. 118–127. IEEE (2009) 7. Mason, W., Watts, D.J.: Financial incentives and the “performance of crowds”. SIGKDD Explor. Newsl. 11(2), 100–108 (2010) 8. Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(2001) (1999) 9. Ng, H.T., Yong, C., Foo, K.S.: A case study on Inter-Annotator agreement for word sense disambiguation. In: Proceedings of the ACL SIGLEX Workshop on Standardizing Lexical Resources (SIGLEX 1999). College Park, Maryland (1999) 10. Nowak, S., R¨ uger, S.: How reliable are annotations via crowdsourcing: A study about inter-annotator agreement for multi-label image annotation. In: Proceedings of the International Conference on Multimedia Information Retrieval, MIR 2010, pp. 557–566. ACM, New York (2010) 11. Passonneau, R., Habash, N.Y., Rambow, O.: Inter-annotator agreement on a multilingual semantic annotation task. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC (2006) 12. Sabou, M., Bontcheva, K., Scharl, A.: Crowdsourcing research opportunities: Lessons from natural language processing. In: Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies, I-KNOW 2012, pp. 17:1–17:8. ACM (2012) 13. Wang, A., Hoang, V.C.D., Kan, M.-Y.: Perspectives on crowdsourcing annotations for natural language processing. Language Resources and Evaluation 47(1), 9–31 (2013)