Noname manuscript No. (will be inserted by the editor)
Building a learner corpus Jirka Hana · Alexandr Rosen · Barbora Štindlová · Jan Štˇepánek
PREPRINT VERSION of 20 December 2013 Abstract The need for data about the acquisition of Czech by non-native learners prompted the compilation of the first learner corpus of Czech. After introducing its basic design and parameters, including a multi-tier manual annotation scheme and error taxonomy, we focus on the more technical aspects: the transcription of handwritten source texts, process of annotation, and options for exploiting the result, together with tools used for these tasks and decisions behind the choices. To support or even substitute manual annotation we assign some error tags automatically and use automatic annotation tools (tagger, spell checker). Keywords learner corpus · error annotation · Czech
1 Introduction We describe the process, annotation scheme, tools and technical decisions behind the annotation of a learner corpus of Czech. The corpus is compiled mainly from texts written by students of Czech as a second or foreign language and by near-native young speakers of Czech with Romani (Roma, Gypsy) background. We discuss the whole processing work-flow: starting from the transcription of hand-written texts, Jirka Hana Charles University, MFF, Prague, Czech Rep. E-mail:
[email protected] Alexandr Rosen Charles University, FF, Prague, Czech Rep. E-mail:
[email protected] Barbora Štindlová Technical University, Liberec, Czech Rep. E-mail:
[email protected] Jan Štˇepánek Charles University, MFF, Prague, Czech Rep. E-mail:
[email protected]
2
Jirka Hana et al.
conversion into the annotation format, the error annotation itself, and post-processing. In this paper, we focus mainly on the computational and organizational issues; for a detailed discussion of the linguistic aspects and an evaluation of the annotation scheme and taxonomy see esp. Rosen et al (2013) or references cited therein, e.g. Hana et al (2010); Jelínek et al (2012); Štindlová et al (2013). After a brief introduction to the Czech learner corpus project in §2, followed by a sketch of the annotation scheme in §3, we present a description of the workflow in §4.
2 A learner corpus of Czech Texts produced by learners of a second or foreign language are a precious source of linguistic evidence for experts in language acquisition, teachers, authors of didactic tools, and students themselves. A corpus of such texts can be used to compare different varieties of non-native language, or non-native and native language on the background of traditional native language corpora. An error-tagged corpus can also be subjected to computer-aided error analysis as a means to explore the target language and to test hypotheses about the functioning of L2 grammar, e.g. in the domain of verbal tenses (Granger, 1999), lexical errors (Le´nko-Szyma´nska, 2004) or phrasal verbs (Waibel, 2008). The learner corpus of Czech as a Second Language (CzeSL) is built as a part of a larger project, the Acquisition Corpora of Czech (AKCES), a research programme pursued since 2005 (Šebesta, 2010). In addition to CzeSL, AKCES has a written (SKRIPT) and a spoken (SCHOLA) part, collected from native Czech pupils, and a part collected from pupils with Romani background (ROMi). Methods and tools used for collecting, transcribing, annotating and managing the texts are the same at least for CzeSL and ROMi. However, here we deal only with the written texts – the spoken texts are still to be annotated. The written parts of CzeSL and ROMi have now reached the size of 2.2 million word tokens. Short essays written by non-native learners of Czech and students with Romani background account for 1.2 mil. and 0.5 mil. words, respectively, while theses written in Czech by foreign students account for 0.5 mil. words. A part of the hand-written essays, including about 0.4 million words, is error-annotated manually. At the moment of writing, the anonymized transcripts and theses are available via a standard concordancer as a part of the Czech National Corpus, or as full texts under the Creative Commons license. The error-annotated parts, with their complex markup, due to be published in full in a similar way, are accessible online via a purposebuilt search tool.1 All these resources will soon be supplemented by rich metadata and more detailed, automatically assigned annotation. CzeSL contains data from native speakers of the following languages: (1) Slavic, (2) other Indo-European, (3) non-Indo-European. The hand-written texts cover all language levels, from real beginners (A1) to advanced learners (B2 and higher). Each text is equipped with metadata records, some of them relating to the respondent (such as age, gender, first language, proficiency in Czech, knowledge of other languages, 1
See http://utkl.ff.cuni.cz/learncorp/ for links and more details.
Building a learner corpus
3
duration and conditions of language acquisition), while others specify the character of the text and circumstances of its production (availability of reference tools, type of elicitation, temporal and size restrictions, etc.).
3 Annotation scheme Texts produced by non-native speakers can be annotated in a way similar to standard corpora, e.g., by POS tags, syntactic functions or syntactic structure,2 but also corrected (‘emended’) and labelled by error categories.3 The optimal error annotation strategy is determined by the goals of the project and by the type of the language. Single-level schemes could be used, e.g., for a specific purpose or for a language without an elaborate inflection system. However, our corpus should be open to multiple research goals and handle a highly inflectional language. This is why it is based on a multi-level annotation scheme, allowing for: (i) registering successive emendations and errors spanning multiple (potentially discontinuous) forms, and (ii) maintaining links between the original and the emended form even when the word order changes, or in cases of dropped or added expressions. We adopted a solution with two levels of annotation.4 At first, we correct deviant forms detectable out of context (e.g. misspelling, wrong inflection). Only then do we correct forms which seem to be correct in isolation but are wrong in a context (e.g. errors in agreement). The entire scheme consists of the transcript, its tokenized form and the two annotation tiers: – Tier -1 – Anonymized transcript of the hand-written original in the HTML format, encoding self-corrections etc. – Tier 0 – Tokenized text – Tier 1 – Correction of orthographical and morphological errors in isolated forms; the result is a string consisting of correct Czech forms, even though the sentence may not be correct as a whole. – Tier 2 – All other types of errors (valency, agreement, word order, etc.). The correspondences between successively emended forms are explicitly expressed (see Fig. 1). Nodes at neighboring tiers are usually linked 1:1, but words can be joined (kdy by in Fig. 1), split, deleted or added. These relations can interlink any number of potentially non-contiguous words across the neighboring tiers. Multiple words can thus be identified as a single unit, while any of the participating word forms can retain their 1:1 links with their counterparts at other tiers. 2 See, e.g. Díaz-Negrillo et al (2010); Meurers (2009); Dickinson and Ragheb (2009); Hirschmann et al (2007). 3 Only few learner corpora use error tags to classify errors, e.g. Fitzpatrick and Seegmiller (2001); Granger (2003); Abuhakema et al (2009). For an overview see, e.g. Štindlová (2011) or https://www. uclouvain.be/en-cecl-lcworld.html. 4 Two annotation layers, each with error labels belonging to categories of several types, are used also by Dickinson and Ledbetter (2012) in an annotation scheme for Hungarian. However, the two layers are used for a slightly different purpose, namely to distinguish between corrections of errors detectable directly in the learner text and adjustments of the text, needed because of the corrections.
4
Myslím, thinkSG 1
Jirka Hana et al.
že that
kdybych ifSG 1
byl wasMASC . SG
se with
svým my
dítˇetem, child,
‘I think that if I were with my child, . . . ’ Fig. 1 Example of the three-level error annotation scheme
The type of error can be specified as a label at the link connecting the incorrect form at a lower tier with its emended form at a higher tier. Error labels used in Fig. 1 include incorInfl or incorBase for morphological errors in inflectional endings or stems, stylColl as a stylistic marker (here for a colloquial form), wbdOther as a word boundary error (other than wrongly separated prefix or preposition without a following space), and agr as an error in agreement. Some errors may additionally require a link pointing to a form specifying proper agreement categories or valency requirements (such as myslím in our example). The taxonomy of errors is based primarily on linguistic categories. Currently, there are 22 manually assigned error tags (8 tags for Tier 1, 11 tags for Tier 2, and 3 tags for either tier). Some are automatically subspecified (e.g., whether an error in a complex verb form involves an auxiliary or a modal verb) or determined from the correction (errors in word-order, missing or redundant items), resulting in 7 additional tags. All of the above types of errors are complemented by a classification of superficial alternations of the source text, such as the indication of missing, redundant, faulty or incorrectly ordered characters. There are 50 error tags of this type, all of them assigned automatically on the basis of the ‘linguistic’ error type and the corrected form. In addition to the form of the word, each node may be assigned information such as lemma, morphosyntactic category or syntactic function. Each of the choices made in the design of the annotation scheme is a compromise between its feasibility in a practical large-scale annotation process and the requirement of a detailed and complex analysis. To give an example, more annotation tiers could be provided, each with a linguistic interpretation. A tier for errors in graphemics could be followed by tiers dedicated to morphemics, morphosyntax, syntax, lexical phenomena, semantics and pragmatics. More realistically, there could be a tier for errors in graphemics and morphemics, another for errors in morphosyntax (agreement,
Building a learner corpus
5
government) and one more for everything else, including word order and phraseology. In a world of unlimited resources of annotators’ time and experience, this would be the optimal solution. In the real world, the choice must be different. Ours, based on two annotation tiers, distinguished by largely formal criteria, has proved its feasibility while still being useful and linguistically relevant. On the other hand, some distinctions in the categories of error taxonomy and instructions for their use are not accepted by all annotators as appropriate and/or well defined. Such cases include the annotation of constructions involving function words (prepositions, auxiliaries, conjunctions), or colloquial expressions. Some of these error categories seem to be concerned more with the analysis of standard language rather than learner language. An automatic tool could be used in the future to relieve the annotator of some tedious and complex tasks, e.g. by substituting links pointing to the source of an error, simulating syntactic relations, by a syntactic parse.
4 Tasks and tools in the workflow The whole annotation process proceeds by the following steps: 1. Acquisition: The original hand-written texts are collected and scanned. Data about the author and circumstances of the elicitation of the text are supplied. 2. Transcription: The scan is manually anonymized and transcribed into an HTML format (see §4.3). 3. Proofreading: Each transcription is checked by a supervisor. 4. Conversion to PML (an XML format, see §4.2): The transcribed HTML text is tokenized and the corresponding PML-encoded Tier 0 is generated, together with a default Tier 1 and an empty Tier 2. The conversion includes basic checks for incorrect or suspicious transcription. 5. Error annotation: Errors in the text are manually corrected and classified; this is done independently by two annotators. 6. Each annotation is reviewed by the appropriate supervisor, who can approve it, modify it, or – in cases when an unexperienced annotator failed to do a proper job or clearly misinterpreted the annotation rules – return it to the annotator with comments for revision. 7. Each doubly annotated text is checked and adjudicated, resulting in a single annotated version. 8. Postprocessing: Error information that can be inferred automatically is added. The corrected text is lemmatized and tagged with morphosyntactic information. The storage of the documents and their flow within this process is managed by Speed, a purpose-built text management system (see §4.1). Conversion from HTML, annotation, supervision and adjudication are done with the help of feat, an annotation editor designed as a part of the project (see §4.4).
6
Jirka Hana et al.
4.1 Text management To coordinate the work of a large project team and to control the passage of texts along the path from the scanned manuscript up to the annotated and adjudicated result, all versions of every document throughout the whole process are stored and maintained by Speed, a text management system, another tool developed as a part of the project. The system distributes documents to transcribers, annotators, coordinators and adjudicators for processing and accepts their results, monitoring their workload and generating error-rate statistics on demand. Coordinators can thus manage the team of 30 annotators efficiently, without wasting time on administrative tasks. User privileges are applied both horizontally and vertically. Each user is assigned her views of the data and filters associated with those views. As a result, the annotator is prevented from seeing an interpretation used by a colleague. The system is designed on top of a general workflow machine, reusable for similar applications, and it is linked with the feat annotation tool using web services. The users receive tasks and deliver results without leaving the environment of the application. This includes quality checking: through the same channel, the annotator may receive an inadequately annotated text for review with comments by the supervisor.
4.2 Data Format To encode the layered annotation, we have designed an annotation schema in the Prague Markup Language (PML),5 a generic XML-based data format, intended to cope with rich linguistic annotation. Each of the higher tiers contains information about words on that tier, about the corrected errors and about relations to the tokens on the lower tiers. We have also considered using a TEI format.6 However, at least from the perspective of the present project, the support of stand-off (layered) annotation offered by PML is superior to that of TEI, mainly in the availability of tools and libraries. This concerns tasks such as validation, structural parsing, corpus management and searching. While some of those libraries do exist for TEI, many would have to be developed. To allow for data exchange, the feat editor now supports import from several formats, including EXMARaLDA (Schmidt, 2009; Schmidt et al, 2011); it also allows export limited to the features supported by the respective format.
4.3 Transcription of manuscripts The hand-written documents are transcribed using off-the-shelf editors supporting HTML (e.g., Microsoft Word or Open Office Writer). This means that the transcribers can use a tool they are familiar with and no technical training is required. A set of codes is used to capture some properties of the manuscript, e.g. variants, illegible 5 6
http://ufal.mff.cuni.cz/jazz/pml/ http://www.tei-c.org
Building a learner corpus
7
strings, self-corrections – see Štindlová (2011, p. 106ff). Some of these encodings are supported via macros of the editor. The manuscript properties are recorded in order to support the research of handwriting of students with a different native writing system, or to enable multiple interpretation – the same glyph may be interpreted as i in the hand-writing of one student, e of another, and a of yet another). An additional reason for collecting hand-written texts is to avoid the use of a spell checker, because the result would not reflect the student’s skills. In a highly inflectional language such as Czech, deviations in spelling very often do not only reflect wrong graphemics, but indicate errors in morphology. In Fig. 2 a sample text and its transcription is presented.
Viktor je mladý pan z Polska Ruska. Studuje {ˇceštinu} ve škole, protože ne umí psat a cˇ ist spravnˇe. Bydlí na koleje vedle školy, má jednu sestru Irenu, která se uˇcí na univerzite u profesora Smutneveselého. Bohužel, Viktor není dobrý student, protože spí na lekci, ale jeho sestra {piše všechno -> všechno piše} a vybornˇe rozumí cˇ eskeho profesora Smutneveselého {a brzo delá domací ukol}. Veˇceˇre Irena jde na prohasku spolu z kamaradem, ale její bratr dˇelá nic. Jeho cˇ eština je špatná, vím, že se vratit ve Polsko Ruskou a tam budí studovat u pomalu myt podlahy.
Kamarad Ireny je {A|a}meriˇcan a chytry m˚už. On miluje Irenu a chce se vzít na ní. protože ona je hezká, taky chytra, rozumí ho a umí vybornˇe vaˇrit. Fig. 2 A sample hand-written document with its transcription
During the transcription process the texts are anonymized as well: private information is replaced either by generic names (e.g. for the names of persons and towns) or special codes (e.g. for telephone numbers). In the former case, we strive to preserve
8
Jirka Hana et al.
agreement features (by matching the name’s gender, number and case) and some of the possible errors (e.g. capitalization and some errors in declension). We use different substitutes for declinable and non-declinable names, but we do not attempt to match the declension class –– e.g. all declined female given names are substituted by an approprite form of the name Eva, even if the original name (such as Lucie) has a different set of declension endings. The decision to use HTML produced by an off-the-shelf editor was made intentionally to minimize training time and not to limit the pool of potential transcribers (it is hard enough to find people who know the rules of handwriting of speakers of language X, it is even harder to find such experts who are also able to transcribe into XML). However, in retrospect we feel this was not a correct decision, because the efforts needed to review the transcripts clearly outweigh the benefit of using a widely known tool. First, it is really important to minimize the occurrence of errors in transcription as they influence all the subsequent annotation steps. It is easier to enforce formal correctness in an XML editor such as XMLmind than in an HTML editor. Second, the ability to learn to use an XML editor is actually a good indication of other abilities that are important in the transcription process, for example the ability to follow the formal rules of a transcription manual.
4.4 The annotation editor feat The manual portion of error annotation is supported by feat,7 an annotation tool we have developed. The annotator corrects the text on appropriate tiers, modifies relations between elements on adjacent tiers (by default all relations are 1:1) and annotates relations with error tags as needed. The context of the annotated text is shown both as a transcribed HTML document and – optionally – as a scan of the original document. Both the editor and the data format accommodate various approaches towards the process of multi-tier annotation. Some annotators prefer to annotate by paragraphs, annotating the whole paragraph on Tier 1 first and then on Tier 2, while others annotate by sentences, annotating a sentence on both tiers before moving to the next one. The tool is written in Java on top of the Netbeans platform.8 Fig. 3 shows the tool’s user interface. It automatically synchronizes with Speed, the text management system – the user receives (whether an annotator, supervisor or adjudicator) the assigned documents into her Inbox, processes them and moves them to her Outbox. The tool is also used for (manual) adjudication: two documents are displayed in parallel, differences in their annotation are highlighted and the preferred option can be selected.
4.5 Towards automatic annotation Once the text is manually corrected and annotated, additional annotation can be provided automatically (see Jelínek et al (2012) for more details): 7 8
http://purl.org/net/feat/ http://platform.netbeans.org/
Building a learner corpus
9
Fig. 3 The user interface of feat
1. A tagger/lemmatiser, such as that described by Spoustová et al (2007), can be applied to the corrected text with an error rate similar to that for standard texts. The resulting annotation (morphosyntactic tags and lemmas) can then be projected to the original forms. 2. Some manually assigned error tags can be specified in more detail using formal rules. For example, manually marked errors in complex verbs forms on Tier 2 are further automatically specified as errors in analytical verb forms (cvf), modal verbs (mod), or verbo-nominal predicates, passive and resultative form (vnp). Rules for other tags can be completely formalized and the tags can be assigned fully automatically. The tools for extending manual annotation can also be used to check its quality of manual annotation, especially to identify tags that are probably missing or incorrect. Some options are available even for texts without corrections and/or error annotation. So far, we have experimented with Korektor (Richter, 2010), a spell checker that has some functionalities of a grammar checker, using a combination of a lexicon, a morphology model and a syntax model. The experimental results were good enough to substantiate the decision to process all transcribed texts in the corpus in this way.9 This step is followed by the application of a standard tagger/lemmatizer to both the original uncorrected input and its automatically corrected version. 9 The spell checker matched human corrections at Tier 1 with an accuracy of 74%. Only forms unrecognized by a morphological analyzer were considered in the test. See Rosen et al (2013) for details.
10
Jirka Hana et al.
4.6 Searching The anonymized transcripts and theses are available via a standard concordancer as a part of the Czech National Corpus,10 or as full texts under the Creative Commons license. The error-annotated parts with their complex mark-up are accessible online via SeLaQ (Second Language Query), a web-based search tool developed for the project, see Fig. 4. The user can build a query from “boxes” corresponding to nodes at different tiers. A new box is created by specifying its relation to an existing box (e.g. following/preceding, immediately following/preceding on the same tier, corresponding to a higher/lower tier node), its form, lemma, or tag can be further constrained by a condition (e.g. equal/not equal to, matching a regular expression, same as other box’s). If the relation connects two tiers the error type can be also specified. The tool is written in Perl on top of the Dancer framework11 and PostgreSQL database.12
Fig. 4 The user interface of SeLaQ
5 Conclusion We have discussed the schema and process of annotation of a learner corpus of Czech. The corpus is now available for on-line queries, both in its error-annotated part and as the complete set of transcripts without manual annotation. The latter is also available under the Creative Commons license as full texts, with the error-annotated part to follow. 10 11 12
http://www.korpus.cz/english/czesl-plain.php http://www.perldancer.org http://www.postgresql.org
Building a learner corpus
11
In addition to using automatic tools for extending manual annotation and providing basic markup for learner texts without manual annotation, our plans include a parser to assign syntactic annotation (functions and structure).13 Our experience has confirmed that the design of the annotation scheme and the error taxonomy, as well as the choice of methods and tools has a profound effect on the final result. While the choice of two annotation tiers seems to be an optimal strategy, the error taxonomy could be modified in response to the experience from annotating larger volumes of data and the users’ feedback. The methods and tools developed within this project are not tied to the specific use and we hope they will be found useful in other projects. Acknowledgements The corpus was one of the tasks of the project Innovation of Education in the Field of Czech as a Second Language (project no. CZ.1.07/2.2.00/07.0259), a part of the operational programme Education for Competitiveness, funded by the European Structural Funds (ESF) and the Czech government. The tools and data format development were partially funded by grants no. P406/10/P328 and P406/2010/0875 of the Grant Agency of the Czech Republic. This work is also partially supported within the programme Large Research, Development and Innovation Infrastructures of the Czech Ministry of Education, Youth and Sports, the project ‘The Czech National Corpus’, no. LM2011023.
References Abuhakema G, Feldman A, Fitzpatrick E (2009) ARIDA: An Arabic interlanguage database and its applications: A pilot study. Journal of the National Council of Less Commonly Taught Languages (NCOLCTL) 7:161–184 Díaz-Negrillo A, Meurers D, Valera S, Wunsch H (2010) Towards interlanguage POS annotation for effective learner corpora in SLA and FLT. Language Forum 36(1–2):139–154, URL http://purl.org/dm/papers/diaz-negrillo-et-al-09.html, special Issue on Corpus Linguistics for Teaching and Learning. In Honour of John Sinclair Dickinson M, Ragheb M (2009) Dependency annotation for learner corpora. In: Proceedings of the Eighth Workshop on Treebanks and Linguistic Theories (TLT-8), Milan, Italy, URL http://jones.ling.indiana.edu/~mdickinson/papers/ dickinson-ragheb09.html Dickinson M, Ledbetter S (2012) Annotating errors in a Hungarian learner corpus. In: Proceedings of the 8th Language Resources and Evaluation Conference (LREC 2012), Istanbul, Turkey, URL http://jones.ling.indiana.edu/~mdickinson/ papers/dickinson-ledbetter12.html Fitzpatrick E, Seegmiller S (2001) The Montclair electronic language learner database. In: Proceedings of the International Conference on Computing and Information Technologies (ICCIT) Granger S (1999) Use of tenses by advanced EFL learners: Evidence from errortagged computer corpus. In: Hasselgård H, Oksefjell S (eds) Out of Corpora – 13 Ott and Ziai (2010) report that in texts produced by learners of German the main functor-argument relation types can generally be identified with precision and recall in the area of 80–90%. This is an encouraging result, but the success will necessarily depend on the proficiency level of the learners.
12
Jirka Hana et al.
Studies in Honour of Stig Johansson, Atlanta, Amsterdam, URL http://hdl.handle. net/2078.1/76322 Granger S (2003) Error-tagged learner corpora and CALL: A promising synergy. CALICO Journal 20(3):465–480 Hana J, Rosen A, Škodová S, Štindlová B (2010) Error-tagged learner corpus of Czech. In: Proceedings of the Fourth Linguistic Annotation Workshop, Association for Computational Linguistics, Uppsala, Sweden, URL http://utkl.ff.cuni.cz/ ~rosen/public/hanaetal_law2010.pdf Hirschmann H, Doolittle S, Lüdeling A (2007) Syntactic annotation of non-canonical linguistics structures. In: Proceedings of Corpus Linguistics 2007, Birmingham, URL http://ucrel.lancs.ac.uk/publications/CL2007/paper/128_Paper.pdf Jelínek T, Štindlová B, Rosen A, Hana J (2012) Combining manual and automatic annotation of a learner corpus. In: Sojka P, Horák A, Kopeˇcek I, Pala K (eds) Text, Speech and Dialogue – Proceedings of the 15th International Conference TSD 2012, no. 7499 in Lecture Notes in Computer Science, Springer, pp 127–134 Le´nko-Szyma´nska A (2004) Demonstratives as anaphora markers in advanced learners’ English. In: G Aston SBDS (ed) Corpora and Language Learners, John Benjamins, Amsterdam, pp 89–107 Meurers D (2009) On the automatic analysis of learner language: Introduction to the special issue. CALICO Journal 26(3):469–473, URL http://purl.org/dm/papers/ meurers-09.html Ott N, Ziai R (2010) Evaluating dependency parsing performance on German learner language. In: Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories (TLT9), NEALT Proceeding Series, URL http://drni.de/zap/ ott-ziai-10 Richter M (2010) Pokroˇcilý korektor cˇ eštiny [An advanced spell checker of Czech]. Master’s thesis, Faculty of Mathematics and Physics, Charles University, Prague Rosen A, Hana J, Štindlová B, Feldman A (2013) Evaluating and automating the annotation of a learner corpus. Language Resources and Evaluation – Special Issue on Resources and Tools for Language Learners pp 1–28, URL http://dx.doi.org/10. 1007/s10579-013-9226-3 Schmidt T (2009) Creating and working with spoken language corpora in EXMARaLDA. In: Lyding V (ed) LULCL II: Lesser Used Languages and Computer Linguistics II, pp 151–164, URL http://www.eurac.edu/Org/LanguageLaw/ Multilingualism/Projects/LULCL_II_proceedings.htm Schmidt T, Wörner K, Hedeland H, Lehmberg T (2011) New and future developments in EXMARaLDA. In: Multilingual Resources and Multilingual Applications. Proceedings of GSCL Conference 2011 Hamburg., URL http://www. exmaralda.org/files/Exmaralda_GSCL2011.pdf Spoustová D, Hajiˇc J, Votrubec J, Krbec P, Kvˇetoˇn P (2007) The best of two worlds: Cooperation of statistical and rule-based taggers for Czech. In: Proceedings of the Workshop on Balto-Slavonic Natural Language Processing 2007, Association for Computational Linguistics, Praha, Czechia, pp 67–74 Šebesta K (2010) Korpusy cˇ eštiny a osvojování jazyka [Corpora of Czech and language acquistion]. Studie z aplikované lingvistiky/Studies in Applied Linguistics 1:11–34
Building a learner corpus
13
Štindlová B (2011) Evaluace chybové anotace v žákovském korpusu cˇ eštiny [Evaluation of error mark-up in a learner corpus of Czech]. PhD thesis, Charles University, Faculty of Arts, Prague Štindlová B, Škodová S, Hana J, Rosen A (2013) A learner corpus of Czech: current state and future directions. In: Granger S, Gilquin G, Meunier F (eds) Twenty Years of Learner Corpus Research: Looking back, Moving ahead, Presses Universitaires de Louvain, Louvain-la-Neuve, Corpora and Language in Use – Proceedings 1 Waibel B (2008) Phrasal verbs. German and Italian learners of English compared. VDM, Saarbrücken