Human Language Technologies – The Baltic Perspective A. Utka et al. (Eds.) © 2014 The authors and IOS Press. This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License. doi:10.3233/978-1-61499-442-8-163
163
Error-Annotated Corpus of Latvian Daiga DEKSNEa, 1 and Inguna SKADIŅAa a Tilde SIA, Latvia
Abstract. This paper reports on the development of the annotated Latvian language error corpus designed for grammar checker development and evaluation. We describe the error classification system introduced for this purpose, the annotation process, and guidelines. Two corpora (the corpus of student papers and the balanced text corpus) consisting of a total of 20,877 sentences have been created and annotated. A general characterisation of the corpora and a summary of the annotation results are presented. Keywords. Error classification, corpus annotation, error annotated corpus, grammar checking, Latvian language
Introduction Grammar errors in texts are made not only by people who have recently started to learn a language, but also by native speakers. This is especially true for morphologically and syntactically complex languages like Latvian. Spelling and grammar checkers are developed to support users during the writing process. An indispensable part of the development of a grammar checker is the analysis of errors made by users. There are many language learner corpora with error annotations, for example, The NUS Corpus of Learner English [1], The Cambridge Learner Corpus [2], The International Corpus of Learner English [3]2. However, a standardised scheme for error annotation does not exist, although attempts in creating such have been made. The FLAG project [4] used a different source for its corpus, email messages posted to USENET groups. The goal was to develop a controlled language and grammar checking application. Unfortunately, studies of errors made by native speakers are not as common as is research done on mistakes made by language learners. Error-annotated corpora can be used for different natural language processing tasks, including grammar error correction systems. Such a corpus is important for initial analysis of common error types, for tuning the rules of a grammar checker, and measuring its performance. In this paper, we describe the creation of an error annotated corpus for grammar checker assessment3. We present the establishment of error classification, the collected corpora, the annotation process, and an assessment of the results. 1
Corresponding Author: Daige Deksne; E-mail:
[email protected] See also The Learner Corpus Bibliography at http://www.uclouvain.be/en-cecl-lcbiblio.html 3 The grammar checker is described in [5]. 2
164
D. Deksne and I. Skadin, a / Error-Annotated Corpus of Latvian
1. Classification of Errors Latvian language norms are described in different grammar textbooks [6]. There are also various studies that describe common errors in the language [7]. While exploring this information, we have defined 22 types of errors that were used for corpus annotation. These error types can be grouped in five larger groups: x
Formatting errors involve incorrect usage of space, incorrect abbreviation format, incorrect data format, or an opening bracket which is not followed by a closing bracket in the scope of the same sentence.
x
Orthography errors include wrong usage of capital letters in the names of organisations/institutions, errors in words that should be written together or separately, and short/long vowel errors.
x
Morphology and syntax errors include case/number/gender disagreement between an attribute and its nominal, person/number/gender disagreement between a subject and predicate, and the incorrect usage of indefinite/definite endings for adjectives depending on the context. They also include more specific cases such as incorrect noun case if a verb is used in debitive mood or if a negation of the auxiliary verb ‘būt’ (to be) is used.
x
Punctuation errors occur when a comma is missing in a compound sentence or a complex sentence, a participial clause/insertion/grouping is not separated by a comma/commas from the rest of the sentence, equal parts of a sentence are not separated by a comma, or a dash is missing. The other extreme of the error is unnecessary use of commas.
x
Style errors involve calque usage or usage of undesirable lexicons.
2. Corpus Creation Process 2.1. Corpora and Data Two corpora were built, a student paper corpus and a balanced corpus. The student paper corpus was constructed from scientific papers and essays of high school students and from academic papers of IT students. The corpus was split into two parts. One part (the development corpus) was used for error rule creation, while the other part (the test corpus) was used for the assessment of the grammar checker. Each part contains 5,157 sentences (10,314 sentences in total). The balanced test corpus was created with the aim to assess the quality of the grammar checker. In this corpus, we wanted to represent the diversity of texts that the potential user of the grammar checking system might wish to check. Since the level of Latvian language proficiency of the potential user of the grammar checker is not known, we include both: texts written by native speakers and texts written by non-native speakers. The corpus contains similar proportions of the following text types: blogs, news, high school student papers, student academic papers, project drafts, legal texts,
165
D. Deksne and I. Skadin, a / Error-Annotated Corpus of Latvian
texts from non-native Latvian speakers, e-mails, texts from various domains, and young writers’ texts. The corpus contains a total of 10,563 sentences. 2.2. Annotation Process and Difficulties The corpora were manually annotated by 2 annotators. Since this is a time consuming process, each corpus was annotated by only one annotator. For convenience, the annotators used a Microsoft Excel file containing four columns: Error type, Corrected Sentence, Initial Sentence, and Comments. For every incorrect sentence, the annotators were asked: x to fix all mistakes and write the correct sentence; x
to assign the appropriate error type from the predefined error list;
x
if the sentence contained several errors, choose the most common error and write a comment about the less severe errors in the Comments field.
Although the annotation task seemed simple, the annotators faced several difficulties during the annotation process. To begin with, if the sentence contained several errors, it was difficult to decide which was the most common error. Also, in some cases, several error types could be assigned for the same error. For example, if there is a comma missing after a subordinate clause containing the participial clause, it is correct to mark it either as ‘a comma error in a subordinate clause’ or ‘a comma error in a participial clause’. If there was an error that was not listed in the predefined list of error types, then the annotators assigned the type ‘unknown error’ and wrote an explanation in the Comments field. The annotated corpora were then transformed into a Gold Standard format and used for assessment of the grammar checker (more information in [5]).
3. Results and Discussion Table 1 presents the distribution of errors in our corpora. The percentage of incorrect sentences was: 38.38% in the student development corpus, 44.48% in the student test corpus, and 39.95% in the balanced corpus. Table 1. Number of errors (grouped by main error types) in different corpora. Error type Formatting errors Orthography errors Morphology and syntax errors Punctuation errors Style errors Unspecified errors Total
Student (dev.) 146 224 462 768 208 171 1,979
Student (test) 149 393 459 813 302 178 2,294
Balanced 789 975 434 1,002 399 621 4,220
The student development and test corpora were gathered from similar sources, but annotated by two different annotators. Although the size of both corpora is the same, the total number of errors found is different. Students make punctuation errors and
166
D. Deksne and I. Skadin, a / Error-Annotated Corpus of Latvian
morphology and syntax errors the most often. In the balanced corpus, the most frequent are formatting errors, orthography errors, and punctuation errors. We have investigated the sentences and annotator comments for the unspecified error category more closely. There are some common previously unnoticed error types which should be appended to the predefined list of error types. Mostly, specific formatting errors and punctuation errors are marked, for example, “extra dash”, “hyphen used where dash is required”, “verb used with a wrong prefix”, and others. As the sentences of a corpus are stored in plain text format, we do not have information about the original format. About 100 sentences have the comment “the title or foreign name must be in quotation marks or written in italic”. Since we do not know if italics were used in the original text, these sentences should not be marked as incorrect.
4. Conclusion This paper presents work on the Latvian error annotated corpus for grammar checker development. The total size of the corpus is 20,877 sentences. The corpus consists of two parts (a corpus of student papers and a balanced corpus). Both parts cover systematic errors of Latvian speakers. Punctuation errors are the most common error type in both parts of the corpus. Orthography and syntax errors are also very common.
Acknowledgements The research leading to these results has received funding from the research project “Information and Communication Technology Competence Centre” of EU Structural funds, contract nr. L-KC-11-0003 signed between the ICT Competence Centre (www.itkc.lv) and the Investment and Development Agency of Latvia, Research No. 2.8 ”Research of Automatic Methods for Text Structural Analysis”.
References [1] D. Dahlmeier, H.T. Ng, S.M. Wu, Building a Large Annotated Corpus of Learner English: The NUS Corpus of Learner English. In 8th Workshop on Innovative Use of NLP for Building Educational Applications, (2013), 22–31. [2] D. Nicholls, The Cambridge Learner Corpus: Error coding and analysis for lexicography and ELT. In D. Archer, P. Rayson, A. Wilson and T. McEnery (eds.), Proceedings of the Corpus Linguistics 2003 Conference (2003), 572–581. [3] S. Granger, The International Corpus of Learner English, The European English Messenger (1993) Vol. 2(1), 34. [4] M. Becker, A. Bredenkamp, B. Crysmann, J. Klein, Annotation of error types for German newsgroup corpus. In A. Abeill´e, (eds.), Treebanks. Building and Using Parsed Corpora. Text, Speech And Language Technology, (2003), Vol 20, 89–100. [5] D. Deksne, I. Skadiņa and R. Skadiņš, Extended CFG formalism for grammar checker and parser development. Proceedings of the 15th International Conference on Intelligent Text Processing and Computational Linguistics. Lecture Notes in Computer Science, (2014), Vol. 8403, 237-249. [6] Latviešu valodas gramatika. Rīga : LU Latviešu valodas institūts, 2013 [7] I. Freimane, Valodas kultūra teorētiskā skatījumā. Rīga : Zvaigzne, 1993