Automatic Detection of Grammar Errors in Primary ...

2 downloads 30 Views 2MB Size Report
My daughter Sarah for being the sunshine of my life, ...... chapter concludes with discussion of the requirements for a grammar error detector for ...... the-moose.
GOTHENBURG MONOGRAPHS IN LINGUISTICS 24

Automatic Detection of Grammar Errors in Primary School Children’s Texts A Finite State Approach

Sylvana Sofkova Hashemi

Doctoral Dissertation

Publicly defended in Lilla H¨orsalen, Humanisten, G¨oteborg University, on June 7, 2003, at 10.15 for the degree of Doctor of Philosophy

Department of Linguistics, G¨oteborg University, Sweden

ISBN 91-973895-5-2 c

2003 Sylvana Sofkova Hashemi Typeset by the author using LATEX Printed by Intellecta Docusys, Go¨ teborg, Sweden, 2003

i

Abstract This thesis concerns the analysis of grammar errors in Swedish texts written by primary school children and the development of a finite state system for finding such errors. Grammar errors are more frequent for this group of writers than for adults and the distribution of the error types is different in children’s texts. In addition, other writing errors above word-level are discussed here, including punctuation and spelling errors resulting in existing words. The method used in the implemented tool FiniteCheck involves subtraction of finite state automata that represent grammars with varying degrees of detail, creating a machine that classifies phrases in a text containing certain kinds of errors. The current version of the system handles errors concerning agreement in noun phrases, and verb selection of finite and non-finite forms. At the lexical level, we attach all lexical tags to words and do not use a tagger which could eliminate information in incorrect text that might be needed later to find the error. At higher levels, structural ambiguity is treated by parsing order, grammar extension and some other heuristics. The simple finite state technique of subtraction has the advantage that the grammars one needs to write to find errors are always positive, describing the valid rules of Swedish rather than grammars describing the structure of errors. The rule sets remain quite small and practically no prediction of errors is necessary. The linguistic performance of the system is promising and shows comparable results for the error types implemented to other Swedish grammar checking tools, when tested on a small adult text not previously analyzed by the system. The performance of the other Swedish tools was also tested on the children’s data collected for this study, revealing quite low recall rates. This fact motivates the need for adaptation of grammar checking techniques to children, whose errors are different from those found in adult writers and pose more challenge to current grammar checkers, that are oriented towards texts written by adult writers. The robustness and modularity of FiniteCheck makes it possible to perform both error detection and diagnostics. Moreover, the grammars can in principle be reused for other applications that do not necessarily have anything to do with error detection, such as extracting information in a given text or even parsing. K EY W ORDS : grammar errors, spelling errors, punctuation, children’s writing, Swedish, language checking, light parsing, finite state technology

ii

iii

Acknowledgements Work on this thesis would not have been possible without contributions, support and encouragement from many people. The idea of developing a writing tool for supporting children in their text production and grammar emerged from a study on how primary school children write by hand in comparison to when they use a computer. Special thanks to my colleague Torbjo¨ rn Lager, who inspired me to do this study and whose children attended the school where I gathered my data. My main supervisor Robin Cooper awakened the idea of using finite state methods for grammar checking and launched the collaboration with the Xerox research group. I want to express my greatest gratitude to him for inspiring discussions during project meetings and supervision sessions, and his patience with my writing, struggling to understand every bit of it, always raising questions and always full of new exciting ideas. I really enjoyed our discussions and look forward to more. I would also like to thank my assistant supervisor Elisabet Engdahl who carefully read my writing and made sure that I expressed myself more clearly. Many thanks to all my colleagues at the Department of Linguistics for creating an inspiring research environment with interesting projects, seminars and conferences. I especially want to mention Leif Gro¨ nqvist for being the helping hand next door whenever, Robert Andersson for being my project colleague, Stina Ericsson for loan of LATEX-manual and for always being helpful, Ulla Veres for help with recruitment of new victims for writing experiments, Jens Allwood and Elisabeth Ahls´en for introducing me to the world of transcription and coding, Sally Boyd, Nataliya Berbyuk, Ulrika Ferm for support and encouragement, Shirley Nicholson for always available with books and also milk for coffee, Pia Cromberger always ready for a chat. A special thanks to Ylva H˚ard af Segerstad for fruitful discussions leading to future collaboration that I am looking forward to, and for being a friend. I also want to thank the children in my study and their teachers for providing me with their text creations, and Sven Stro¨ mqvist and Victoria Johansson for sharing their data collection. A special thanks to Genie Perdin who carefully proofread this thesis and gave me some encouraging last minute ‘kicks’. I also want to thank all my friends, who reminded me now and then about life outside the university. My deepest gratitude to my family for being there for me and for always believing in me. My husband Ali - I know the way was long and there were times I could be distant, but I am back. My daughter Sarah for being the sunshine of my life, my inspiration, my everything. My mother, father, sister and my big little brother ... Sylvana Sofkova Hashemi G¨oteborg, May 2003

iv

v

Table of Contents 1 Introduction 1.1 Written Language in a Computer Literate Society . . . . . . . . . 1.2 Aim and Scope of the Study . . . . . . . . . . . . . . . . . . . . 1.3 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . .

1 1 3 5

I Writing

7

2 Writing and Grammar 2.1 Introduction . . . . . . . . . . . . . . . . . . . 2.2 Research on Writing in General . . . . . . . . 2.3 Written Language and Computers . . . . . . . 2.3.1 Learning to Write . . . . . . . . . . . . 2.3.2 The Influence of Computers on Writing 2.4 Studies of Grammar Errors . . . . . . . . . . . 2.4.1 Introduction . . . . . . . . . . . . . . . 2.4.2 Primary and Secondary Level Writers . 2.4.3 Adult Writers . . . . . . . . . . . . . . 2.5 Conclusion . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

9 9 10 11 11 12 14 14 14 15 18

3 Data Collection and Analysis 3.1 Introduction . . . . . . . . . . . . . . . . . 3.2 Data Collection . . . . . . . . . . . . . . . 3.2.1 Introduction . . . . . . . . . . . . . 3.2.2 The Sub-Corpora . . . . . . . . . . 3.3 Error Categories . . . . . . . . . . . . . . . 3.3.1 Introduction . . . . . . . . . . . . . 3.3.2 Spelling Errors . . . . . . . . . . . 3.3.3 Grammar Errors . . . . . . . . . . 3.3.4 Spelling or Grammar Error? . . . . 3.3.5 Punctuation . . . . . . . . . . . . . 3.4 Types of Analysis . . . . . . . . . . . . . . 3.5 Error Coding and Tools . . . . . . . . . . . 3.5.1 Corpus Formats . . . . . . . . . . . 3.5.2 CHAT-format and CLAN-software .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

21 21 21 21 23 25 25 26 27 28 31 32 34 34 34

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

vi

4 Error Profile of the Data 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . 4.2 General Overview . . . . . . . . . . . . . . . . . . 4.3 Grammar Errors . . . . . . . . . . . . . . . . . . . 4.3.1 Agreement in Noun Phrases . . . . . . . . 4.3.2 Agreement in Predicative Complement . . 4.3.3 Definiteness in Single Nouns . . . . . . . . 4.3.4 Pronoun Case . . . . . . . . . . . . . . . . 4.3.5 Verb Form . . . . . . . . . . . . . . . . . 4.3.6 Sentence Structure . . . . . . . . . . . . . 4.3.7 Word Choice . . . . . . . . . . . . . . . . 4.3.8 Reference . . . . . . . . . . . . . . . . . . 4.3.9 Other Grammar Errors . . . . . . . . . . . 4.3.10 Distribution of Grammar Errors . . . . . . 4.3.11 Summary . . . . . . . . . . . . . . . . . . 4.4 Child Data vs. Other Data . . . . . . . . . . . . . 4.4.1 Primary and Secondary Level Writers . . . 4.4.2 Evaluation Texts of Proof Reading Tools . 4.4.3 Scarrie’s Error Database . . . . . . . . . . 4.4.4 Summary . . . . . . . . . . . . . . . . . . 4.5 Real Word Spelling Errors . . . . . . . . . . . . . 4.5.1 Introduction . . . . . . . . . . . . . . . . . 4.5.2 Spelling in Swedish . . . . . . . . . . . . 4.5.3 Segmentation Errors . . . . . . . . . . . . 4.5.4 Misspelled Words . . . . . . . . . . . . . . 4.5.5 Distribution of Real Word Spelling Errors . 4.5.6 Summary . . . . . . . . . . . . . . . . . . 4.6 Punctuation . . . . . . . . . . . . . . . . . . . . . 4.6.1 Introduction . . . . . . . . . . . . . . . . . 4.6.2 General Overview of Sentence Delimitation 4.6.3 The Orthographic Sentence . . . . . . . . . 4.6.4 Punctuation Errors . . . . . . . . . . . . . 4.6.5 Summary . . . . . . . . . . . . . . . . . . 4.7 Conclusions . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37 37 37 41 41 50 52 53 55 62 67 69 71 72 77 77 77 80 85 88 89 89 89 91 94 98 100 100 100 101 103 105 107 107

vii

II Grammar Checking

111

5 Error Detection and Previous Systems 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 What Is a Grammar Checker? . . . . . . . . . . . . . . . . . 5.2.1 Spelling vs. Grammar Checking . . . . . . . . . . . 5.2.2 Functionality . . . . . . . . . . . . . . . . . . . . . 5.2.3 Performance Measures and Their Interpretation . . . 5.3 Possibilities for Error Detection . . . . . . . . . . . . . . . 5.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 5.3.2 The Means for Detection . . . . . . . . . . . . . . . 5.3.3 Summary and Conclusion . . . . . . . . . . . . . . 5.4 Grammar Checking Systems . . . . . . . . . . . . . . . . . 5.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Methods and Techniques in Some Previous Systems 5.4.3 Current Swedish Systems . . . . . . . . . . . . . . 5.4.4 Overview of The Swedish Systems . . . . . . . . . 5.4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . 5.5 Performance on Child Data . . . . . . . . . . . . . . . . . . 5.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Evaluation Procedure . . . . . . . . . . . . . . . . . 5.5.3 The Systems’ Detection Procedures . . . . . . . . . 5.5.4 The Systems’ Detection Results . . . . . . . . . . . 5.5.5 Overall Detection Results . . . . . . . . . . . . . . 5.6 Summary and Conclusion . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

113 113 114 114 114 115 117 117 117 125 128 128 128 130 134 142 143 143 143 145 146 168 172

6 FiniteCheck: A Grammar Error Detector 6.1 Introduction . . . . . . . . . . . . . . . . 6.2 Finite State Methods and Tools . . . . . . 6.2.1 Finite State Methods in NLP . . . 6.2.2 Regular Grammars and Automata 6.2.3 Xerox Finite State Tool . . . . . . 6.2.4 Finite State Parsing . . . . . . . . 6.3 System Architecture . . . . . . . . . . . . 6.3.1 Introduction . . . . . . . . . . . . 6.3.2 The System Flow . . . . . . . . . 6.3.3 Types of Automata . . . . . . . . 6.4 The Lexicon . . . . . . . . . . . . . . . . 6.4.1 Composition of The Lexicon . . . 6.4.2 The Tagset . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

173 173 175 175 176 177 180 184 184 186 189 191 191 193

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

viii

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

194 195 196 196 198 201 203 205 205 210 214 214 215 216 216

7 Performance Results 7.1 Introduction . . . . . . . . . . . . . . . . . . . 7.2 Initial Performance on Child Data . . . . . . . 7.2.1 Performance Results: Phase I . . . . . 7.2.2 Grammatical Coverage . . . . . . . . . 7.2.3 Flagging Accuracy . . . . . . . . . . . 7.3 Current Performance on Child Data . . . . . . 7.3.1 Introduction . . . . . . . . . . . . . . . 7.3.2 Improving Flagging Accuracy . . . . . 7.3.3 Performance Results: Phase II . . . . . 7.4 Overview of Performance on Child Data . . . . 7.5 Performance on Other Text . . . . . . . . . . . 7.5.1 Performance Results of FiniteCheck . . 7.5.2 Performance Results of Other Tools . . 7.5.3 Overview of Performance on Other Text 7.6 Summary and Conclusion . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

219 219 219 219 220 223 228 228 229 232 233 237 237 240 243 246

8 Summary and Conclusion 8.1 Introduction . . . . . . . . . . . . . . . . . . . . 8.2 Summary . . . . . . . . . . . . . . . . . . . . . 8.2.1 Introduction . . . . . . . . . . . . . . . . 8.2.2 Children’s Writing Errors . . . . . . . . 8.2.3 Diagnosis and Possibilities for Detection 8.2.4 Detection of Grammar Errors . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

249 249 249 249 250 251 253

6.5 6.6

6.7

6.8

6.9

6.4.3 Categories and Features . . . . . . . Broad Grammar . . . . . . . . . . . . . . . . Parsing . . . . . . . . . . . . . . . . . . . . 6.6.1 Parsing Procedure . . . . . . . . . . 6.6.2 The Heuristics of Parsing Order . . . 6.6.3 Further Ambiguity Resolution . . . . 6.6.4 Parsing Expansion and Adjustment . Narrow Grammar . . . . . . . . . . . . . . . 6.7.1 Noun Phrase Grammar . . . . . . . . 6.7.2 Verb Grammar . . . . . . . . . . . . Error Detection and Diagnosis . . . . . . . . 6.8.1 Introduction . . . . . . . . . . . . . . 6.8.2 Detection of Errors in Noun Phrases . 6.8.3 Detection of Errors in the Verbal Head Summary . . . . . . . . . . . . . . . . . . .

ix

8.3 8.4

Conclusion . . . . . . . . . . . . . . . . . . . . . . Future Plans . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Introduction . . . . . . . . . . . . . . . . . . 8.4.2 Improving the System . . . . . . . . . . . . 8.4.3 Expanding Detection . . . . . . . . . . . . . 8.4.4 Generic Tool? . . . . . . . . . . . . . . . . . 8.4.5 Learning to Write in the Information Society

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

255 256 256 256 257 258 258

Bibliography

260

Appendices

276

A Grammatical Feature Categories

279

B Error Corpora B.1 Grammar Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Misspelled Words . . . . . . . . . . . . . . . . . . . . . . . . . . B.3 Segmentation Errors . . . . . . . . . . . . . . . . . . . . . . . .

281 282 293 306

C SUC Tagset

313

D Implementation D.1 Broad Grammar . D.2 Narrow Grammar: D.3 Narrow Grammar: D.4 Parser . . . . . . D.5 Filtering . . . . . D.6 Error Finder . . .

315 315 315 318 319 319 320

. . . . . . . . Noun Phrases Verb Phrases . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

x

LIST OF TABLES

xi

List of Tables 3.1

Child Data Overview . . . . . . . . . . . . . . . . . . . . . . . .

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 4.23 4.24 4.25 4.26 4.27 4.28 4.29 4.30 4.31 4.32 4.33 4.34 4.35 4.36

General Overview of Sub-Corpora . . . . . . . . . . . . . . General Overview by Age . . . . . . . . . . . . . . . . . . General Overview of Spelling Errors in Sub-Corpora . . . . General Overview of Spelling Errors by Age . . . . . . . . . Number Agreement in Swedish . . . . . . . . . . . . . . . . Gender Agreement in Swedish . . . . . . . . . . . . . . . . Definiteness Agreement in Swedish . . . . . . . . . . . . . Noun Phrases with Proper Nouns as Head . . . . . . . . . . Noun Phrases with Pronouns as Head . . . . . . . . . . . . Noun Phrases without (Nominal) Head . . . . . . . . . . . . Agreement in Partitive Noun Phrase in Swedish . . . . . . . Gender and Number Agreement in Predicative Complement Personal Pronouns in Swedish . . . . . . . . . . . . . . . . Finite and Non-finite Verb Forms . . . . . . . . . . . . . . . Tense Structure . . . . . . . . . . . . . . . . . . . . . . . . Fa-sentence Word Order . . . . . . . . . . . . . . . . . . . Af-sentence Word Order . . . . . . . . . . . . . . . . . . . Distribution of Grammar Errors in Sub-Corpora . . . . . . . Distribution of Grammar Errors by Age . . . . . . . . . . . Examples of Grammar Errors in Teleman’s Study . . . . . . Examples of Grammar Errors from the Skrivsyntax Project . Grammar Errors in the Evaluation Texts of Grammatifix . . . Grammar Errors in Granska’s Evaluation Corpus . . . . . . General Error Ratio in Grammatifix, Granska and Child Data Three Error Types in Grammatifix, Granska and Child Data Grammar Errors in Scarrie’s ECD and Child Data . . . . . . Examples of Spelling Error Categories . . . . . . . . . . . . Spelling Variants . . . . . . . . . . . . . . . . . . . . . . . Distribution of Real Word Segmentation Errors . . . . . . . Distribution of Real Word Spelling Errors in Sub-Corpora . Distribution of Real Word Spelling Errors by Age . . . . . . Sentence Delimitation in the Sub-Corpora . . . . . . . . . . Sentence Delimitation by Age . . . . . . . . . . . . . . . . Major Delimiter Errors in Sub-Corpora . . . . . . . . . . . Major Delimiter Errors by Age . . . . . . . . . . . . . . . . Comma Errors in Sub-Corpora . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22 38 39 40 40 42 42 42 44 44 45 45 50 54 55 56 63 63 74 74 78 79 81 82 83 83 86 90 91 91 99 99 103 103 105 105 106

LIST OF TABLES

xii

4.37 Comma Errors by Age . . . . . . . . . . . . . . . . . . . . . . . 107 5.1 5.2 5.3 5.4 5.5 5.6 5.7

Summary of Detection Possibilities in Child Data . . . . . . . . . Overview of the Grammar Error Types in Grammatifix (GF), Granska (GR) and Scarrie (SC) . . . . . . . . . . . . . . . . . . Overview of the Performance of Grammatifix, Granska and Scarrie Performance Results of Grammatifix on Child Data . . . . . . . . Performance Results of Granska on Child Data . . . . . . . . . . Performance Results of Scarrie on Child Data . . . . . . . . . . . Performance Results of Targeted Errors . . . . . . . . . . . . . .

6.1 6.2 6.3

Some Expressions and Operators in XFST . . . . . . . . . . . . . 178 Types of Directed Replacement . . . . . . . . . . . . . . . . . . . 179 Noun Phrase Types . . . . . . . . . . . . . . . . . . . . . . . . . 206

7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12

Performance Results on Child Data: Phase I . . . . False Alarms in Noun Phrases: Phase I . . . . . . . False Alarms in Finite Verbs: Phase I . . . . . . . . False Alarms in Verb Clusters: Phase I . . . . . . . False Alarms in Noun Phrases: Phase II . . . . . . False Alarms in Finite Verbs: Phase II . . . . . . . False Alarms in Verb Clusters: Phase II . . . . . . Performance Results on Child Data: Phase II . . . Performance Results of FiniteCheck on Other Text Performance Results of Grammatifix on Other Text Performance Results of Granska on Other Text . . Performance Results of Scarrie on Other Text . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

126 137 141 169 169 170 171

220 224 226 227 229 231 231 232 237 240 241 242

LIST OF FIGURES

xiii

List of Figures 3.1

Principles for Error Categorization . . . . . . . . . . . . . . . . .

31

4.1 4.2 4.3 4.4

73 76 76

4.5 4.6

Grammar Error Distribution . . . . . . . . . . . . . . . . . . . . Error Density in Sub-Corpora . . . . . . . . . . . . . . . . . . . . Error Density in Age Groups . . . . . . . . . . . . . . . . . . . . Three Error Types in Grammatifix (black line), Granska (gray line) and Child Data (white line) . . . . . . . . . . . . . . . . . . . . Error Distribution of Selected Error Types in Scarrie . . . . . . . Error Distribution of Selected Error Types in Child Data . . . . .

6.1

The System Architecture of FiniteCheck . . . . . . . . . . . . . . 185

7.1 7.2 7.3 7.4 7.5 7.6 7.7

False Alarms: Phase I vs. Phase II . . . . . . . Overview of Recall in Child Data . . . . . . . Overview of Precision in Child Data . . . . . . Overview of Overall Performance in Child Data Overview of Recall in Other Text . . . . . . . . Overview of Precision in Other Text . . . . . . Overview of Overall Performance in Other Text

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

84 87 87

233 234 235 236 244 244 245

xiv

Chapter 1

Introduction 1.1 Written Language in a Computer Literate Society Written language plays an important role in our society. A great deal of our communication occurs by means of writing, which besides the traditional paper and pen, is facilitated by the computer, the Internet and other applications such as for instance the mobile phone. Word processing and sending messages via email are among the most usual activities on computers. Other communicated media that enable written communication are also becoming popular such as webchat or instant messaging on the Internet or text messaging (Short-Message-Service, SMS) via the mobile phone.1 The present doctoral dissertation concerns word processing on computers, in particular the linguistic tools integrated in such authoring aids. The use of word processors for writing both in educational and professional settings modifies the process, practice and acquisition of writing. With a word processor, it is not only easy to produce a text with a neat layout, but it supports the writer throughout the whole writing process. Text may be restructured and revised at any time during text production without leaving any trace of the changes that have been made. Text may be reused and a new text composed by cutting and pasting passages. Iconic material such as pictures2 (or even sounds) can be inserted, linguistic aids can be used for proofreading a text. Writing acquisition can be enhanced by use of a word processor. For instance, focus on somewhat more technical aspects such as physically shaping letters with a pen shifts toward the more cognitive processes of text 1

Studies of computer-mediated communication are provided by e.g. Severinson Eklundh (1994); Crystal (2001); Herring (2001). A recent dissertation by H˚ard af Segerstad (2002) explores especially how written Swedish is used in email, webchat and SMS. 2 Smileys or emoticons (e.g. :-) “happy face”) are more used in computer-mediated communication.

2

Chapter 1.

production enabling the writer to apply the whole language register. Writing on a computer enhances in general both the motivation to write, revise or completely change a text (cf. Wresch, 1984; Daiute, 1985; Severinson Eklundh, 1993; Pontecorvo, 1997). The status of written language in our modern information society has developed. In contrast to ancient times, writing is no longer reserved for just a small minority of professional groups (e.g. priests and monks, bankers, important merchants). In particular, the emergence of computers in writing has led to the involvement of new user groups besides today’s writing professionals like journalists, novelists and scientists. We write more nowadays in general, and the freedom of and control over one’s own writing has increased. Texts are produced rapidly and are more seldom proofread by a careful secretary with knowledge of language. This is sometimes reflected in the quality and correctness of the resulting text (cf. Severinson Eklundh, 1995). Linguistic tools that check mechanics, grammar and style have taken over the secretarial function to some degree and are usually integrated in word processing software. Spelling checkers and hyphenators that check writing mechanics and identify violations on individual words have existed for some time now. Grammar checkers that recognize syntactic errors and often also violations of punctuation, word capitalization conventions, number and date formatting and other style-related issues, thus working above the word level, are a rather new technology, especially for such minor small languages like Swedish. Grammar checking tools for languages such as English, French, Dutch, Spanish, and Greek were being developed in the 1980’s, whereas research on Swedish writing aids aimed at grammatical deviance started quite recently. In addition to the present work, there are three research groups working in this area. The Department of Numerical Analysis and Computer Science (NADA) at the Royal Institute of Technology (KTH), with a long tradition of research in writing and authoring aids, is responsible for Granska. Development of this tool has occurred over a series of projects starting in 1994 (Domeij et al., 1996, 1998; Carlberger et al., 2002). The Department of Linguistics, Uppsala University was involved in an EU-sponsored project, Scarrie, between 1996 and 1999. The goal of this project was development of language tools for Danish, Norwegian and Swedish (S˚agvall Hein, 1998a; S˚agvall Hein et al., 1999). Finally, a Finnish language engineering company Lingsoft Inc. developed Grammatifix. Initiated in 1997, and completed in 1999, this tool was released on the market in November 1998, and has been part of the Swedish Microsoft Office Package since 2000 (Arppe, 2000; Birn, 2000). The three Swedish systems mainly use parsing techniques with some degree of feature relaxation and/or explicit error rules for detection of errors. Grammatifix and Granska are developed as generic tools and are tested on adult (mostly pro-

Introduction

3

fessional) texts. Scarrie’s end-users are professional writers from newspapers and publishing firms.

1.2 Aim and Scope of the Study The primary purpose of the present work is to detect grammar errors by means of linguistic descriptions of correct language use rather than describing the structure of errors. The ideal would be to develop a generic method for detection of grammar errors in unrestricted text that could be applied to different writing populations displaying different error types without the need for rewriting the grammars of the system. That is, instead of describing the errors made by different groups of writers resulting in distinct sets of error rules, use the same grammar set for detection. This approach of identifying errors in text without explicit description of them contrasts with the other three Swedish grammar checkers. Using this method, we will hopefully cover many different cases of errors and minimize the possibility of overlooking some errors. We chose primary school children as the targeted population as a new group of users not covered by the previous Swedish projects. Children as beginning writers, are in the process of acquiring written language, unlike adult writers, and will probably produce relatively more errors and errors of a different kind than adult writers. Their writing errors have probably more to do with competence than performance. Grammar checkers for this group have to have different coverage and concentrate on different kinds of errors. Further, the positive impact of computers on children’s writing opens new opportunities for the application of language technology. The role of proofreading tools for educational purposes is a rather new application area and this work can be considered a first step in that direction. Against this background, the main goal of the present thesis is handling children’s errors and experimenting with positive grammatical descriptions using finite state techniques. The work is divided into three subtasks, including first, an overall error analysis of the collected children’s texts, then exploring the nature and possibilities for detection of errors and finally, implementation of detection of (some) grammatical error types. Here is a brief characterization of these three tasks: I. Investigation of children’s writing errors: The targeted data for a grammar checker can be selected either by intuitions about errors that will probably occur, or by directly looking at errors that actually occur. In the present work, the second approach of empirical analysis will be applied. Texts from pupils at three primary schools were collected and analyzed for errors, focusing on errors above word-level including grammar errors, spelling errors resulting in existent words, and punctuation. The main focus lies on grammar errors

Chapter 1.

4

as the basis for implementation. The questions that arise are: What grammar errors occur? How should the errors be categorized? What spelling errors result in lexicalized strings and are not captured by a spelling checker? What is the nature of these? How is punctuation used and what errors occur? II. Investigation of the possibilities for detection of these writing errors: The nature of errors will be explored along with available technology that can be applied in order to detect them. An interesting point is how the errors that are found are handled by the current systems. The questions that arise are: What is the nature of the error? What is the diagnosis of the error? What is needed to be able to detect the error? How are the grammar errors handled by the current Swedish grammar checkers, Grammatifix, Granska and Scarrie? III. Implementation of the detection of (some) grammar errors: A subset of errors will be chosen for implementation and will concern grammar checking to the level of detecting errors. Errors will obtain a description of the type of error detected. Implementation will not include any additional diagnosis or any suggestion of how to correct the error. The analysis will be shallow, using finite state techniques. The grammars will describe real syntactic relations rather than the structure of erroneous patterns. The difference between grammars of distinct accuracy will reveal the errors, that as finite state automata can be subtracted from each other. Karttunen et al. (1997a) use this technique to find instances of invalid dates and this is an attempt to apply their approach to a larger language domain. The work on this grammar error detector started at the Department of Linguistics at Go¨ teborg University in 1998, in the project Finite State Grammar for Finding Grammatical Errors in Swedish Text and was a collaboration with the NADA group at KTH in the project Integrated Language Tools for Writing and Document Handling.3 The present thesis describes both the initial development within this project and the continuation of it. The main contributions of this thesis concern understanding of incorrect language use in primary school children’s writing and computational analysis of such incorrect text by means of correct language use, in particular: • Collection of texts written by primary school children, written both by hand and on a computer. 3

This project was sponsored by HSFR/NUTEK Language Technology Programme and has its site at: http://www.nada.kth.se/iplab/langtools/

Introduction

5

• Analysis of grammar errors, spelling errors and punctuation in the texts of primary school writers. • Comparison of errors found in the present data with errors found in other studies on grammar errors. • Comparison of error types covered by the three Swedish grammar checkers. • Performance analysis of the three Swedish grammar checkers on the present data. • Implementation of a grammar error detector that derives/compiles error patterns rather than writing the error grammar by hand. • Performance analysis of the detector on the collected data and some portion of other data.

1.3 Outline of the Thesis The remaining chapters of the thesis fall into two parts. Part I: The first part is devoted to a discussion of writing and an analysis of the collected data and consists of three chapters. Chapter 2 provides a brief introduction to research on writing in general, writing acquisition, how computers influence writing and descriptions of previous findings on grammar errors, concluding with what grammar errors are to be expected in written Swedish. Chapter 3 gives an overview of the data collected and a discussion of error classification. Chapter 4 presents the error profile of the data. The chapter concludes with discussion of the requirements for a grammar error detector for the particular subjects of this study. Part II: The second part of the thesis concerns grammar checking and includes three chapters. Chapter 5 starts with a general overview of the requirements and functionalities of a grammar checker and what is required for the errors in the present data. Swedish grammar checkers are described and their performance is checked on the present data. Chapter 6 presents the implementation of a grammar error detector that handles these errors, including description of finite state formalism. The techniques of finite state parsing are explained. Chapter 7 presents the performance of this tool. The thesis ends with a concluding summary (Chapter 8). In addition, the thesis contains four appendices. Appendix A presents the grammatical feature categories

6

Chapter 1.

used in the examples of errors or when explaining the grammar of Swedish. Appendix B presents the error corpora consisting of the grammar errors found in the present study (Appendix B.1), misspelled words (Appendix B.2) and segmentation errors (Appendix B.3). The tagset used is presented in Appendix C and some listings from the implementation are listed in Appendix D.

Part I

Writing

8

Chapter 2

Writing and Grammar 2.1 Introduction Learning to write does not imply acquiring a completely new language (new grammar), since often at this stage (i.e. beginning school) a child already knows the majority of the (general) grammar rules. Rather, learning to write is a process of learning the difference between written language and the already acquired spoken language. Consequently, errors that one will find in the writing of primary school children often are due to their lack of knowledge of written language and consist of attempts to reproduce spoken language norms as an alternative to the standard written norm or to errors due to the as yet not acquired part of written language. Further, even when the writer knows the standard norm, errors can occur either as the result of disturbances such as tiredness, stress, etc. or because the writer cannot manage to keep together complex content and meaning constructions (cf. Teleman, 1991a). Another source of errors is the aids we use for writing, computers, which also impact on our writing and may give rise to errors. The main purpose of the present chapter is to see if previous studies on writing can give some hint on what grammar errors are to be expected in the writing by Swedish children. It provides a survey of previous studies of grammar errors, as well as some background research on writing in general and some insights into what it means to learn to write and how computers influence our writing. First, a short review of research on writing is presented (Section 2.2), followed by a short explanation of what acquisition of written language involves and how computers influence the way we write (Section 2.3). Previous findings on grammar errors in Swedish can be found in the following section, including studies of writing of children and adolescents, adults and the disabled (Section 2.4).

10

Chapter 2.

2.2 Research on Writing in General For a long period of time many considered written language (beginning with e.g. de Saussure, 1922; Bloomfield, 1933) to be a transcription of spoken (oral) language and not that important as, or even inferior to, spoken language. A similar view is also reflected in the research on literacy, where studies on writing were very few in comparison to research on reading. A turning point at the end of 1970s, is described by many as “the writing crisis” (Scardamalia and Bereiter, 1986), when an expansion in research occurs in teaching native language writing. During this period, more naturalistic methods for writing are propagated, i.e. “learning to write by writing” (Moffett, 1968), examination of the writing situation in English schools (e.g. Britton, 1982; Emig, 1982) and changing the focus of study from judgments of products and more text-oriented research to the strategies involved in the process of writing (see Flower and Hayes, 1981). In Sweden, writing skills were studied by focusing on the written product, often related to the social background of the child. Research has been devoted to spelling (e.g. Haage, 1954; Wallin, 1962, 1967; Dahlquist and Henrysson, 1963; Ahlstr¨om, 1964, 1966; Lindell, 1964) and writing of composition in connection to standardized tests (e.g. Bj¨ornsson, 1957, 1977; Ljung, 1959). There are also studies concerning writing development in primary and upper secondary schools (e.g. Grundin, 1975; Bj¨ornsson, 1977; Hultman and Westman, 1977; Lindell et al., 1978; Larsson, 1984). During the later half of the 1980s, research in Sweden took a new direction towards studies of writing strategies concerning writing as a process (e.g. Bj o¨ rk and Bj¨ork, 1983; Str¨omquist, 1987, 1989) and development of writing abilities focusing on writing activities between children and parents (e.g. Liberg, 1990) and text analysis (e.g. Garme, 1988; Wikborg and Bjo¨ rk, 1989; Josephson et al., 1990). This turning point was reflected in education by the introduction of process-oriented writing, as well. Some research concerned writing as a cognitive text-creating process using video-recordings of persons engaged in writing (e.g. Matsuhasi, 1982), or clinical experiments (e.g. Bereiter and Scardamalia, 1985). The use of computers in writing prompted studies on the influence of computers in writing (e.g. Severinson Eklundh and Sjo¨ holm, 1989; Severinson Eklundh, 1993; Wikborg, 1990), resulting in the development of computer programs that register and record writing activities (e.g. Flower and Hayes, 1981; Severinson Eklundh, 1990; Kollberg, 1996; Stro¨ mqvist, 1996).

Writing and Grammar

11

2.3 Written Language and Computers 2.3.1

Learning to Write

Writing, like speaking, is primarily aimed at expressing meaning. The most evident difference between written and spoken language lies in the physical channel. Written language is a single-channelled monologue, using only the visual channel (eye) with the addressee not present at the same time. It is a more relaxed, rather slow process affording longer time for consideration and the possibility to edit/correct the end product. Speech as a dialogue is simultaneous and involves participants present at the same time, where all the senses can be used to receive information. It is a fast process with little time for consideration and difficulty in correcting the end product. The rules and conventions of written language are more restrictive than the rules of spoken language in the sense that there are constructions in spoken language regarded as “incorrect” in written language. Writing is, in general, standardized with less (dialectal) variation in contrast to spoken language, which is dialectal and varied. Further, acquisition of written and spoken language occurs under different conditions and in different ways. Writing is taught in school by teachers with certain training, whereas speaking is learned privately (in a family, from peers, etc.), without any planning of the process. When learning to speak, we learn the language. When learning to write we already know the language (in the spoken form) (cf. Linell, 1982; Teleman, 1991b; Liberg, 1990). 1 Learning a written language means not only acquiring its more or less explicit norms and rules, but also learning to handle the overall writing system, including the more technical aspects, such as how to shape the letters, the boundaries between words, how a sentence is formed, as well as acquiring the grammatical, discursive, and strategic competence to convey a thought or message to the reader. In other words, writing entails being able to handle the means of writing, i.e. letters and grammar rules, and arranging them to form words and sentences and being able to use them in a variety of different contexts and for different purposes. During this development, children may compose text of different genre, but not necessarily apply the conventions of the writing system correctly. Children are quite creative and they often use conventions in their own ways, for instance using periods between words to separate them instead of blank spaces (cf. Mattingly, 1972; Chall, 1979; Lundberg, 1989; Liberg, 1990; Pontecorvo, 1997; H˚akansson, 1998).

1 For further, more extensive definitions of differences between written and spoken language see e.g. Chafe (1985); Halliday (1985); Biber (1988).

Chapter 2.

12

The above discussion leads to a view of learning to write as being the acquisition of a complex system of communication with several components. Following Hultman (1989, p.73), we can identify three aspects of writing: 1. the motor aspect: the movement of the hand when forming the letters or typing on the keyboard 2. the grammar aspect: the rules for spelling and punctuation, morphology and syntax on clause, sentence and text level 3. the pragmatic aspect: use of writing for a purpose, to argue, tell, describe, discuss, inform, refer, etc. The text has to be readable, reflecting the meaning of words and the effect they have. This thesis focuses on the grammar aspect, in particular on the syntactic relationships between words. Also some aspects of spelling and punctuation are covered. The text level is not analyzed here.

2.3.2

The Influence of Computers on Writing

The view on writing has changed, it is no longer interpreted as a linear activity consisting of independent and temporally sequenced phases, but rather considered to be a dynamic, problem solving activity. According to Hayes and Flower (1980), as a cognitive process, writing is influenced by the task environment (the external social conditions) and the writer’s long term memory, including cognitive processes of planning (generating and organizing ideas, setting goals, and decision-making of what to include, what to concentrate on), translation (the actual production) and revision (evaluation of what has been written, proof-reading, writing out and publishing). This process-based approach with the phases also referred to as prewriting, writing and rewriting has been adopted in writing instruction in school (e.g. Graves, 1983; Calkins, 1986; Str¨omquist, 1993) and is also considered to be well-suited to computer-assisted composition (Wresch, 1984; Montague, 1990). Writing on a computer makes text easy to structure, rearrange and rewrite. Many studies report writers’ decreased resistance to writing. They experience that it is easier to start to write and there is a possibility to revise under the whole process of writing, leave the text and then come back to it again and update and reuse old texts (e.g. Wresch, 1984; Severinson Eklundh, 1993). Also, studies of children’s use of computers show that children who use a word-processor in school enjoy writing and editing activities more, considering writing on a computer to be much easier and more fun. They are more willing to revise and even completely

Writing and Grammar

13

change their texts and they write more in general (e.g. Daiute, 1985; Pontecorvo, 1997). The word processor affects the way we write in general. We usually plan less in the beginning when writing on a computer and revise more during writing. Thus, editing occurs during the whole process of writing and is not left solely to the final phase. In an investigation by Severinson Eklundh (1995) of twenty adult writers with academic backgrounds more than 80% of all editing was performed during writing and not after. The main disadvantage reported is that it is hard to get an overall perspective of a text on the screen, which then makes planning and revision more difficult and can in turn lead to texts being of worse quality (e.g. Hansen and Haas, 1988; Severinson Eklundh, 1993). Rewriting and rearranging of a text is easy to do on a word processor, for instance by copy and paste utilities that may easily give rise to errors that are hard to discover afterwards, especially in a brief perusal. Words and phrases can be repeated, omitted, transposed. Sentences can be too long (Pontecorvo, 1997) and errors that normally are not found in native speakers’ writing occur. The common claim is that writing in one’s mother tongue normally results in the types of errors that are different from the public language norm, since most of the mother tongue’s grammar is present before we begin school (Teleman, 1979). There are studies that clearly show that the use of word processors leads to completely new error types including some errors that were considered characteristic for second language writers. For instance, morpho-syntactic (agreement) errors have been found to be quite usual among native speakers in the studies of Bustamente and Le´on (1996) and Domeij et al. (1996). The errors are connected to how we use the functions in a word processor and that revision is more local due to limitations in view on the screen (cf. Domeij et al., 1996; Domeij, 2003). Concerning text quality, there are studies that point out that the use of a word processor results in longer texts, both among children and adults. Some researchers claim that the quality of compositions improved when word processors were used (see e.g. Haas, 1989; Sofkova Hashemi, 1998). However, no reliable quality enhancement besides the length of a text is evident in any study. The effects of using a computer for revision are regarded by some as being positive both on the mechanics and the content of writing while others feel it promotes only surface level revision, not enhancing content or meaning (see the surveys in Hawisher, 1986; Pontecorvo, 1997; Domeij, 2003).

Chapter 2.

14

2.4 Studies of Grammar Errors 2.4.1

Introduction

There are not many studies of grammar errors in written Swedish. Studies of adult writing are few, while research on children’s writing development mostly concerns the early age of three to six years and development of spelling and use of the period and/or other punctuation marks and conventions (e.g. Allard and Sundblad, 1991). Recent expansion of development of grammar checking tools contributes to this field, however. Below, studies are presented of grammar errors found in the writing of primary and upper secondary school children, adults, error types covered by current proof reading tools and analysis of grammar errors in texts of adult writers used for evaluation of these tools. Some of these studies are described further in detail and are compared to the analysis of the children’s texts gathered for the present thesis in Chapter 4 (Section 4.4).

2.4.2

Primary and Secondary Level Writers

During the 1980s, several projects investigated the language of Swedish school children as a contribution to discussion of language development and language in¨ struction (see e.g. the surveys in Ostlund-Stj¨ arneg˚ardh, 2002; Nystr¨om, 2000). The writing of children in primary and upper secondary school was analyzed mostly with focus on lexical measures of productivity and language use, in terms of analysis of vocabulary, parts-of-speech distribution, length of words, word variation and also content, relation to gender, social background and the grades assigned to the texts (e.g. Hersvall et al., 1974; Hultman and Westman, 1977; Lindell et al., 1978; Pettersson, 1980; Larsson, 1984). Then, when the traditional productoriented view on writing switched to the new process-oriented paradigm, studies on writing concerned the text as a whole and as a communicative act (e.g. Chrystal and Ekvall, 1996, 1999; Liberg, 1999) and became more devoted to analysis of ¨ genre and referential issues (e.g. Oberg, 1997; Nystr¨om, 2000) and relation to the ¨ grades assigned (e.g. Ostlund-Stj¨arneg˚ardh, 2002) and modality (speech or writing) (e.g. Str¨omqvist et al., 2002). Quantitative analysis in this field still concerns lexical measures of variation, length, coherence, word order and sentence structure; very few studies note errors other than spelling or punctuation (e.g. Olevard, 1997; Hallencreutz, 2002). A study by Teleman (1979) shows examples (no quantitative measures) of both lexical and syntactic errors observed in the writing of children from the seventh year of primary school (among others). He reports on errors in function words,

Writing and Grammar

15

inflection with dialectal endings in nouns, dropping of the tense-endings on verbs and on use of nominative form of pronouns in place of accusative forms as is often the case in spoken Swedish. Also, errors in definiteness agreement, missing constituents, reference problems, word order and tense shift are exemplified as well as observation of erroneous use of or missing prepositions in idiomatic expressions. Another study of Hultman and Westman (1977), concerns analysis of national tests from third year students from upper secondary school. The aim of the project Skrivsyntax “Written Syntax” was to study writing practice in school from a linguistic point of view. The material included 151 compositions (88 757 words in total) with the subject Familjen och a¨ ktenskapet a¨ n en g˚ang ‘Family and marriage once more’. Vocabulary, distribution of word categories, syntax and spelling were studied and compared to adult texts, between the marks assigned to the texts and between boys and girls. The study also included error analysis of punctuation, orthography, grammar, lexicon, semantics, stylistics and functionality of the text. Among grammar errors, gender agreement errors were reported being usual, and relatively many errors in pronoun case after preposition occurred. Errors in agreement between subject and predicative complement are also reported as rather frequent. Word order errors are also reported, mostly in the placement of adverbials. Other examples include verb form errors, subject related errors, reference, preposition use in idiomatic expressions and clauses with odd structure.

2.4.3

Adult Writers

There are few studies of adult writing in Swedish. Those that exist are mostly devoted to the writing process as a whole or to social aspects of it with very little attention being paid to the mechanics of writing. However, the recent expansion in the development of Swedish computational grammar checking tools that require understanding of what error types should be treated by such tools, has made contributions to this field. The realization of what types of errors occur and should thus be included in such an authoring aid may be based on intuitive presuppositions of what rules could be violated, in addition to empirical analysis of text. More empirical evidence of grammar violations also comes from the evaluation of such tools, where the system is tested against a text corpus with hand-coded analysis of errors. There are three available grammar checkers for Swedish: Granska (Knutsson, 2001), Grammatifix (Birn, 2000) and Scarrie (S˚agvall Hein et al., 1999).2 Scarrie is explicitly devoted to professional writers of newspaper articles. The other two systems are not explicitly aimed at any special user groups, although their performance tests were provided mainly on newspaper texts. 2

These tools are described in detail in Chapter 5.

16

Chapter 2.

Below, a survey of studies is presented of professional and non-professional writers, adult disabled writers, the grammar errors that are covered by the three Swedish grammar checkers, and grammar errors that occurred in the evaluation texts the performance of these systems was tested upon. Professional and Non-professional Writers Studies focusing on adult non-professional writing concern analysis of crime reports (Leijonhielm, 1989), post-school writing development (Hammarb¨ack, 1989), a socio-linguistic study concerning writing attitudes, i.e. what is written and who writes what at a local government office regardless of writing conventions (Gunnarsson, 1992) and some “typical features in non-proof-read adult prose” at a government authority are reported in Go¨ ransson (1998), the only investigation that addresses (to some extent) grammatical structure. G¨oransson (1998) describes her immediate impression when proof-reading texts written by her colleagues at a government authority, showing some typical features in this unedited adult prose. She examined reports, instructional texts, newspaper articles, formal letters, etc. The analysis distinguishes between high and low level errors. High level includes comprehensibility of the text, coherence and style, relevance for the context, ability to see one’s own text with the eyes of others, choice of words, etc. Low level errors cover grammar and spelling errors. Among the grammar errors she only reports reference problems, choice of preposition and agreement errors. Among studies of professional writers, the language consultant Gabriella Sandstr¨om (1996) analyzed editing at the Swedish newspaper Svenska Dagbladet that included 29 articles written by 15 reporters. The original script, the edited version and the final version of the articles were analyzed. The analysis involved spelling, errors at lexical and syntactic level, formation errors, punctuation and graphical errors. The result showed that the journalists made most errors in punctuation, graphical errors and lexical errors and most of them disappeared during the editing process. Among the lexical errors, Sandstro¨ m mentions errors in idiomatic expressions and in the choice of prepositions. Syntax errors also seem to be quite common, but the article does not give an analysis of the different kinds of syntax errors.

Writing and Grammar

17

Adults with Writing Disabilities Studies on writing strategies of disabled groups were conducted within the project Reading and Writing strategies of Disabled Groups,3 including analysis of grammar for the dyslexic and deaf (Wengelin, 2002). The analysis of the writing of deaf adults included no frequency data and is not that important for the present study since it tends to reflect more strategies found in second language acquisition. Adult dyslexics mostly show problems with formation of sentences and frequent omission of constituents. Especially frequent were missing or erroneous conjunctions. Other errors concern agreement in noun phrase or the form of noun phrases, verb form, tense shift within sentences and incorrect choice of prepositions. Marking of sentence boundaries and punctuation is the main problem of these writers. Error Types in Proof Reading Tools The error types covered by a grammar checker should, in general, include the central constructions of the language and, in particular, those which give rise to errors. These constructions should allow precise descriptions so that false alarms can be avoided. The selection of what error types to include is then also dependent on the available technology and the possibility of detecting and correcting the error types (cf. Arppe, 2000; Birn, 2000). In the development of Grammatifix, the pre-analysis of existing error types in Swedish was based on linguistic intuition, personal observation and reference literature of Swedish grammar and writing conventions (Arppe, 2000). In the case of Granska, the pre-analysis involved analysis of empirical data such as newspaper texts and student compositions (Domeij et al., 1996; Domeij, 2003). In the Scarrie project, where journalists are the end-users, the stage of pre-analysis consisted of gathering corrections made by professional proof-readers at the newspapers involved. These corrections were stored in a database (The Swedish Error Corpora Database, ECD), that contains nearly 9,000 error entries, including spelling, grammar, punctuation, graphic and style, meaning and reference errors. Arppe (2000) provides an overview of the types of errors covered by the Swedish tools and reports, in short, that all the tools treat errors in noun phrase agreement and verb forms in verb chains. Scarrie and Granska also treat errors in compounds, whereas Grammatifix has the widest coverage in punctuation and number formatting errors. He points out that the error classification in these tools is similar, but not exactly the same. The depth and breadth of included error categor3 More information about this project may be found at: http://www.ling.gu.se/ ˜wengelin/projects/r&r.

18

Chapter 2.

ies differs in the subsets of phrases, level of syntactic complexity or in the position of detection in the sentence. They may, for instance, detect errors in syntactically simple fragments, but fail with syntactically more complex structures. These factors are further explained and exemplified in Chapter 5, where I also compare the error types covered by the individual tools. Among the grammar errors presented in Scarrie’s ECD, errors in noun phrase agreement, predicative complement agreement, definiteness in single nouns, verb subcategorization and choice of preposition are the most frequent error types. Evaluation Texts of Proof Reading Tools Other empirical evidence of grammar errors can be observed in the evaluation of the three grammar checkers (Birn, 2000; Knutsson, 2001; S˚agvall Hein et al., 1999). The performance of all the tools was tested on newspaper text, written by professional writers. Only the evaluation corpus of Granska included texts written by non-professionals as well, represented by student compositions. In general, the corpora analyzed are dominated by errors in verb form, agreement in noun phrases, prepositions and missing constituents.

2.5 Conclusion The main purpose of the present chapter was to investigate if previous research reveals which grammar errors to expect in the writing of primary school children. Apparently, grammar in general has a very low priority in the research on writing in Swedish. Grammar errors in children’s writing have been analyzed at the upper level in primary school and in the upper secondary school and exist only as reports with some examples, without any particular reference to frequency. Some analyses have been performed on the writing of professional adult writers and in the research on the writing of adult dyslexic and deaf adults, with quantitative data for the dyslexic group. The only area that directly approaches grammar errors concerns the development of proofreading tools aiming particularly at grammar. These studies report on grammar errors in the writing of adults. Previous research presents no general characterization of grammar errors in children’s writing. There are, however, few indications that children as beginning writers make errors different from adult writers. Teleman’s observations indicate use of spoken forms that were not reported in the other studies. Some examples of errors in the Skrivsyntax project are evidently more related to the fact that the children have not yet mastered writing conventions (e.g. errors in the accusative

Writing and Grammar

19

case of plural pronouns) rather than making errors related to “slip of the pen” (e.g. due to lack of attention). In general, all the studies report errors in agreement (both in non phrase and predicative complement), verb form and the choice of prepositions in idiomatic expressions. Are these the central constructions in Swedish that give rise to grammar errors? It may be true for adult writers, but it is unclear regarding beginning writers. Analysis of grammar errors in the children data collected for the present study is presented in Chapter 4, together with a comparison of the findings of the previous studies of grammar errors presented above.

20

Chapter 3

Data Collection and Analysis 3.1 Introduction In this chapter we report on data that has been gathered for this study and the types of analysis provided on them. First, the data collection is presented and the different sub-corpora are described (Section 3.2). Then, a discussion follows of the kinds of errors analyzed and how they are classified (Section 3.3). The types of analyses in the present study are provided in the subsequent section (Section 3.4) and a description of error coding and tools that were used for that purpose end this chapter (Section 3.5).

3.2 Data Collection 3.2.1

Introduction

The main goal of this thesis is to detect automatically grammar errors in texts written by children. In order to explore what errors actually occur, texts with different topics written by different subjects were collected to build an underlying corpus for analysis, hereafter referred to as the Child Data corpus. The material was collected on three separate occasions and has served as basis for other (previous) studies. The first collection of the data consists of both hand written and computer written compositions on set topics by 18 children between 9 and 11 years old - The Hand versus Computer Collection. The second collection involves the same subjects, this time, the children participate in an experiment and tell a story about a series of pictures, both orally and in writing on a computer The Frog Story Collection. The third collection was presented from a project on

Chapter 3.

22

development of literacy and includes eighty computer written compositions of 10 and 13 year old children on set topics in two genres - The Spencer Collection. 1 Table 3.1 gives an overview of the whole Child Data corpus, including the three collections mentioned above, divided into five sub-corpora by the writing topics the subjects were given: Deserted Village, Climbing Fireman, Frog Story, Spencer Narrative, Spencer Expository. Further information concerns the age of the subjects involved, the number of compositions, number of words, if the children wrote by hand or on computer and what writing aid was then used. Table 3.1: Child Data Overview AGE C OMP W ORDS T OPIC W RITING AID H AND VS . C OMPUTER C OLLECTION : Deserted Village 9-11 18 7 586 ”They arrived in a paper and pen deserted village” Climbing Fireman

9-11

18

4 505 Shown: a picture of a Claris Works 3.0 fireman climbing on a ladder

F ROG S TORY C OLLECTION : Frog Story 9-11

18

4 907 Story-retelling: ”Frog ScriptLog where are you?”

S PENCER C OLLECTION : Spencer Narrative 10 & 13

40

5 487 Narrative: Tell about ScriptLog a predicament you had rescued somebody from, or you had been rescued from

Spencer Expository 10 & 13

40

7 327 Expository: Discuss the ScriptLog problems seen in the video 29 812

T OTAL

134

Altogether 58 children between 9 and 13 years old wrote 134 papers, comprising a corpus of 29,812 words. Most of the papers are written on the computer. Only the first sub-corpus (Deserted Village) consists of 18 hand written compositions. The editor Claris Works 3.0 was used for 18 computer written texts. ScriptLog, a tool for experimental research on the on-line process of writing, was used for the remaining (98) computer written compositions. ScriptLog looks just like an ordin1

Many thanks to Victoria Johansson and Sven Stro¨ mqvist, Department of Linguistics, Lund University for sharing this collection of data.

Data Collection and Analysis

23

ary word processor to the user, but in addition to producing the written text, it also logs information of all events on the keyboard, the screen position of these events and their temporal distribution.2 This section proceeds with detailed descriptions of the three collections that form the corpus, with information about when and for what purpose the material was collected, the subjects involved, the tasks they were given and the experiments they took part in.

3.2.2

The Sub-Corpora

The Hand vs. Computer Collection The first collection originates from a study on the computer’s influence on children’s writing, gathered in autumn, 1996. The writing performance in hand written and computer written compositions on the same subjects was compared (see Sofkova, 1997). Results from this study showed both great individual variation among the subjects and similarities between the two modes, e.g. the distribution of spelling and segmentation errors, as well as improved performance in the essays written on the computer especially in the use of punctuation, capitals and the number of spelling errors. The subjects included a group of eighteen children, twelve girls and six boys, between the age of 9 and 11, all pupils at the intermediate level at a primary school. This school was picked because the children had some experience with writing on computers. Computers had already been introduced in their instruction and pupils were free to choose to write on a computer or by hand. If they chose to write on a computer, they wrote directly on the computer, using the Claris Works 3.0 wordprocessor. Other requirements were that the subjects should be monolingual and not have any reading or writing disabilities. The children wrote two essays - one by hand and one on the computer. At the beginning of this study, the children were already busy writing a composition, which now is part of the hand written material. They were given a heading for the hand written task: De kom till en o¨ vergiven by ‘They arrived in a deserted village’. For the computer written task, pupils were shown a picture of a fireman climbing on a ladder. They were also told not to use the spelling checker when writing in order to make the two tasks as comparable as possible. 2 A first prototype was developed in the project Reading and writing in a Linguistic and a didactic perspective (Str¨omqvist and Hellstrand, 1994). An early version of ScriptLog developed for Macintosh computers was used for collecting the data in this thesis (Stro¨ mqvist and Malmsten, 1998). There is now also a Windows version (Stro¨ mqvist and Karlsson, 2002).

Chapter 3.

24

The Frog Story Collection The second collection is a story-telling experiment and involves the same subjects as in the Hand vs. Computer Collection. In April 1997, we invited the children to the Department of Linguistics at Go¨ teborg University to take part in the experiment. They played a role as control group in the research project Reading and Writing Strategies of Disabled Groups, that aims at developing a unified research environment for contrastive studies of reading and writing processes in language users with different types of functional disabilities.3 The experiment included a production task and the data were elicited both in written and spoken form (video-taped). A wordless picture story booklet Frog, where are you? by Mercer Mayer (1969) was used, a cartoon like series of 24 pictures about a boy, his dog and a frog that disappears. Each subject was asked to tell the story, picture by picture. At the beginning of the experiment the children were invited to look through the book to get an idea of the content. Then, the instruction was literally Kan du ber¨atta vad som h¨ander p˚a de h¨ar bilderna? ‘Can you tell what is happening in these pictures?’ Half of the children started with writing and then telling the story and half of them did the opposite. For the written task, the on-line process editor ScriptLog was used, storing all the writing activities. The SPENCER Collection The Spencer Project on Developing Literacy across Genres, Modalities and Languages4 lasted between July 1997, and June 2000. The aim was to investigate the development of literacy in both speech and writing. Four age groups (grade school students, junior high school students, high school students and university students), and seven languages (Dutch, English, French, Hebrew, Icelandic, Spanish and Swedish) were studied. Schools were picked from areas where one could expect few immigrants in the classes, and also where the children had some experience with computers. The subjects came from middle class, monolingual families and they had no reading or writing disabilities. Another criterion was that at least one of the subject’s parents had education beyond high school. 3

The project’s directors are Sven Str¨omqvist and Elisabeth Ahls´en from the Department of Linguistics, G¨oteborg University. More information about this project may be found at: http: //www.ling.gu.se/˜wengelin/projects/r&r. 4 The project was funded by the Spencer Foundation Major Grant for the Study of Developing Literacy to Ruth Berman, Tel Aviv University, who was the coordinator of this project. Each language/country involved has had its own contact person, for Swedish it was Sven Str o¨ mqvist from the Department of Linguistics at Lund University.

Data Collection and Analysis

25

All subjects had to create two spoken and two written texts, in two genres, expository and narrative. Each subject saw a short video (3 minutes long), containing scenes from a school day. After the video, the procedure varied depending on the order of genre and modality.5 The topic for the narratives was to tell about an event when the subject had rescued somebody, or had been rescued by somebody from a predicament. They were asked to tell how it started, how it went on and how it ended. The topic for the expository text was to discuss the problems they had seen in the video, and possibly give some solutions. They were explicitly asked not to describe the video. Written material for two age groups from the Swedish part of the study is included in the present Child Data: the grade school students (10 year olds) and junior high school students (13 year olds). In total, 20 subjects from each age group were recruited. The texts the subjects wrote were logged in the on-line process editor ScriptLog.

3.3 Error Categories 3.3.1

Introduction

The texts under analysis contain a wide variety of violations against written language norms, on all levels: lexical, syntactic, semantic and discourse. The main focus of this thesis is to analyze and detect grammar errors, but first we need to establish what a grammar error is and what distinguishes a grammar error from, for instance, a spelling error. Punctuation is another category of interest, important for deciding how to syntactically handle a text by a grammar error detector. The following section discusses categorization of the errors found in the data and explains what errors are classified as spelling errors as well as where the boundary lies between spelling and grammar errors. The error examples provided are glossed literally and translated into English. Grammatical features are placed within brackets following the word in the English gloss (e.g. klockan ‘watch [def]’) (the different feature categories are listed in Appendix A). Occurrences of spelling violations are followed by the correct form within parentheses and preceded by ‘⇒’, both in the Swedish example and the English gloss (e.g. var (⇒ vad) ‘was (⇒ what)’). 5 There were four different orders in the experiment: Order A: Narrative spoken, Narrative written, Expository spoken, Expository written. Order B: Narrative written, Narrative spoken, Expository written, Expository spoken. Order C: Expository spoken, Expository written, Narrative spoken, Narrative written. Order D: Expository written, Expository spoken, Narrative written, Narrative spoken.

Chapter 3.

26

3.3.2

Spelling Errors

Spelling errors are violations of the orthographic norms of a language, such as insertion (e.g. errour instead of error), omission (e.g. eror), substitution (e.g. errer) or transposition (e.g. erorr) of one or more letters within the boundaries of a word or omission of space between words (i.e. when words are written together) or insertion of space within a word (i.e. splitting a word into parts). Spelling errors may occur due to the subject’s lack of linguistic knowledge of a particular rule (competence errors) or as a typographical mistake, when the subject knows the spelling, but makes a motor coordination slip (performance errors). The difference between a competence and a performance error is not always so easy to see in a given text. For example, the (nonsense) string gube deviates from the intended correct word gubbe ‘old man’ by missing doubling of ‘b’ and violates thus the consonant gemination rule for this particular word. The text where the error comes from, shows that this subject is (to some degree) familiar with this rule applying consonant gemination on other words, indicating that the error is likely to be a typo (i.e. a performance error) and that it occurred by mistake. On the other hand, the subject may not be aware that this rule applies to this particular word. 6 It is then more a question of insufficient knowledge and thus, a competence error. Spelling errors often give rise to non-existent words (a non-word error) as in the example above, but they can also lead to an already lexicalized string (a real word error).7 For example, in the sentence in (3.1), the string damen also violates the consonant doubling rule and deviates from the intended correct word dammen ‘dam [def]’ by omission of ‘m’. However, in this case the resultant string coincides with an existent word damen ‘lady [def]’.8 The error still concerns the single word, but differs from non-word errors in that the realization now influences not only the erroneously spelled string but also the surrounding context. The newly-formed word completely changes the meaning of the sentence and gives rise to a sentence with a very peculiar meaning, where a particular lady is not deep. (3.1) Men ∗ damen (⇒ dammen) a¨ r inte s˚a djup. but lady [def] (⇒ dam [def]) is not that deep – But the dam is not so deep.

Homophones, words that sound alike but are spelled differently, are another example of a spelling error realized as a real word. The classical examples are the 6

The word gubbe ‘old man’ was used only once in the text. Usually around 40% of all misspellings result in lexicalized strings (e.g. Kukich, 1992). The notion of non-word vs. real word spelling errors is a terminology used in research on spelling (cf. Kukich, 1992; Ingels, 1996). 8 Consonant doubling is used for distinguishing short and long vowels in Swedish. 7

Data Collection and Analysis

27

words hj¨arna ‘brain’ and g¨arna ‘with pleasure’ that are often substituted in written production and as carriers of different meanings completely change the semantics of the whole sentence. Another category of words that may result in non-words or real words in writing are the alternative morphological forms in different dialects. For instance, a spoken dialectal variation of the standard final plural suffix -or on nouns as in flicker ‘girls’ (standard form is flick-or ‘girls’) is normally not accepted in written form and thus realizes as a non-word in the written language. Other spoken forms, such as jag ‘I’ normally reduced to ja in speech, coincide with other existent words and form real word errors in writing. In this case ja is homonymous with the interjection (or affirmative) ja ‘yes’. In neither case is it clear if the spoken form is used intentionally as some kind of stylistical marker or spelled in this way due to competence or performance insufficiency, meaning that the subject either had not acquired the written norm or that a typographical error occurred. Spelling errors are then violations of characters (or spaces) in single isolated words, that form (mostly) non-words or real words, the latter causing ungrammaticalities in text.

3.3.3

Grammar Errors

Grammar errors violate (mostly) the syntactic rules of a language, such as feature agreement, order or choice of constituents in a phrase or sentence, thus concerning a wider context than a single word.9 Like spelling errors, a grammar error may occur due to the subject’s insufficient knowledge of such language rules. However, the difference is that when learning to write as a native speaker (as the subjects in this study), only the written language norms that deviate from the already acquired (spoken) grammatical knowledge have to be learned. As mentioned earlier, research reveals that native speakers make not only errors reflecting the norms of the group one belongs to as one might expect, but also other grammar errors that have been ascribed to the influence of computers on writing. That is, even a native speaker can make grammar errors when writing on a computer due to rewriting or rearranging text. Again, the real cause of an error is not always clear from the text. For instance, in the noun phrase denna uppsatsen ‘this [def] essay [def]’ a violation of definiteness agreement occurs, since the demonstrative determiner denna ‘this’ normally requires the following noun to be in the indefinite form. In this case, the form denna uppsats ‘this [def] essay [indef]’ is the correct one (see Section 4.3.1). However, in certain regions of Sweden this construction is grammatical in speech. This 9

Choice of words may also lead to semantic or pragmatic anomaly.

Chapter 3.

28

means that this error appears as a competence error since the subject is not familiar with the written norm and applies the acquired spoken norm. On the other hand, it could also be a typographical mistake, as would be the case if the subject first used a determiner like den ‘the/that [def]’ that requires the following noun to be in definite form and then changed the determiner to the demonstrative one but forgot to change the definite form in the subsequent noun to indefinite. In earlier research grammar errors have been divided along two lines. Some researchers characterize the errors by application of the same operations as for orthographic rules also at this level, with omissions, insertions, substitutions and transpositions of words. Feature mismatch is then treated as a special case of substitution (e.g. Vosse, 1994; Ingels, 1996). For instance, in the incorrect noun phrase denna uppsatsen ‘this [def] essay [def]’ the required indefinite noun is substituted by a definite noun. Word choice errors, such as incorrect verb particles or prepositions, are other examples of grammatical substitution. Word order errors occur as transpositions of words, i.e. all the correct words are present but their order is incorrect. Missing constituents in sentences concern omission of words, whereas redundant words concern insertion. Others separate feature mismatch from other error types and distinguish between structural errors, that include violations of the syntactic structure of a clause, and non-structural errors, that concern feature mismatch (e.g. Bustamente and Le´on, 1996; S˚agvall Hein, 1998a).

3.3.4

Spelling or Grammar Error?

As mentioned in the beginning of this section, writing errors occur at all levels, including lexicon, syntax, semantics and discourse. The nature of an error is sometimes obvious, but in many cases it is unclear how to classify errors. The final versions of the text give very little hint about what was going on in the writer’s mind at the time of text production.10 Some kind of classification of writing errors is necessary, however, for detection and diagnosis of them. Consider for instance the sentence in (3.2), where a (non-finite) supine verb form f¨ors¨okt ‘tried [sup]’ is used as the main verb of the second sentence. The word in isolation is an existent word in Swedish, but syntactically a verb in supine form is ungrammatical as the predicate of a main sentence (see Section 4.3.5). This non-finite verb form has to be preceded by a (finite) temporal auxiliary verb (har f¨ors¨okt ‘have [pres] tried [sup]’ or hade fo¨ rs¨okt ‘had [pret] tried [sup]’) or the form has to be exchanged for a finite verb form, such as present (f o¨ rs¨oker ‘try [pres]’) 10

Probably some information can be gained from the log-files in the ScriptLog versions, but since not all data in the corpus are stored in that format, such an analysis has not been included in this thesis.

Data Collection and Analysis

29

or preterite (f¨ors¨okte ‘tried [pret]’). In regard to the tense used in the preceding context, the last alternative of preterite form would be the best choice. att kl¨attra ner. (3.2) Han tittade p˚a hunden. Hunden ∗ f¨ors¨okt he looked [pret] at the-dog the-dog tried [sup] to climb down – He looked at the dog. The dog tried to climb down.

The problem of classification lies in that although one single letter distinguishes the word from the intended preterite form and could then be considered as an orthographical violation, the error is realized not as a new word, but rather another form of the intended word is formed. This error could occur as a result of editing if the writer first used a past perfect tense (hade fo¨ rs¨okt ‘had tried’) and later changed the tense form to preterite (f¨ors¨okte ‘tried’) by removing the temporal auxiliary verb, but forgot also to change the supine form (fo¨ rs¨okt ‘tried [sup]’) to the correct preterite form. On the other hand, the correct preterite tense could be used by the subject already from the start. Then it is rather a question of a (real word) spelling error. The subject intended already from the beginning to write a preterite form, but intentionally or unintentionally omitted the final vowel -e, that happens to be a distinctive suffix for this verb. In the next example (3.3), a gender agreement error occurs between the neuter determiner det ‘the’ and the common gender noun a¨ nda ‘end’, as a result of replacing enda ‘only’ with a¨ nda ‘end’. The erroneous word is an existent word and differs from the intended word only in the single letter at the beginning (an orthographic violation). This is clearly a question of a spelling error, since the word does not form any other form of the intended word and it is realized as a completely new word with distinct meaning. ∗ (3.3) Det a¨ nda (⇒ enda) jag vet om the [neu] end [com] (⇒ only) I know about

– The only thing I know about ...

In the grammar checking literature, the categorization of writing errors is primarily divided into word-level errors and in errors requiring context larger than a word (cf. S˚agvall Hein, 1998a; Arppe, 2000). Real word spelling errors were treated in Scarrie’s Error Corpora Database as errors requiring wider context for recognition and were categorized in accordance with the means used for their detection. In other words, errors either belong to the category of grammar errors when violating syntactic rules, or are otherwise categorized as style, meaning and reference category (Wedbjer Rambell, 1999a, p.5). In this thesis, where grammar errors (syntactic violations) are the main foci, real word spelling errors will be classified as a separate category. This distinction is important for examination of the

Chapter 3.

30

real nature of such errors, especially when presenting a diagnosis to the user. Such considerations are especially important when the user is a beginning writer. Obvious cases of spelling errors such as the one in (3.3) are treated as such, whereas the treatment of errors lying on the borderline between a spelling and a grammar error as in (3.2) depends on: • what type of new formation occurred (other form of the same lemma or new lemma) • what type of violation occurred (change in letter, morpheme or word) • what level is influenced (lexical, syntactic or semantic) These principles are primarily aimed at the unclear cases, but seem to be applicable to other real word violations as well. The fact is that a majority of real word spelling errors form new words and violate semantics rather than syntax and just a few of them “accidentally” cause syntactic errors (see further in Section 5.3.2). It is the ones that form other forms of the same lemma that are tricky. They are treated here as grammar errors, but for diagnosis it is important to bear in mind that they also could be spelling errors. Figure 3.1 shows a scheme for error categorization. All violations of the written norm will be categorized starting with whether the error is realized as a non-word or a real word. Non-words are always classified as spelling errors. Real word errors are then further considered with regard to whether they form other forms of the same lemma or if new lemmas are created. In the case of same lemma (as in (3.2)), errors are classified as grammar errors. When new lemmas are formed, syntactic or semantic errors occur. Here a distinction is made between whether just a single letter is influenced, categorizing the error as a spelling error, or a whole word was substituted, categorizing it as a grammar error. For the errors realized as real words the following principles for error categorization then apply:11 (3.4)

(i). All real word errors, that violate a syntactic rule and result in other forms of the same lemma are classified as grammar errors. (ii). All real word errors resulting in new lemmas by a change of the whole word are classified as grammar errors. (iii). All real word errors resulting in new lemmas by a change in (one or more) letter(s) are classified as spelling errors.

11

Homophones are excepted from the principle (ii). They certainly form a new lemma by a change of the whole word, but are related to how the word is pronounced and thus are considered as spelling errors.

Data Collection and Analysis

31

Figure 3.1: Principles for Error Categorization For the above example (3.2), this means that considering the word in isolation, f¨ors¨okt ‘tried [sup]’ is an existent word in Swedish. Considering the difference in deviation of the intended preterite form, no new lemma is created, rather another form of the same lemma that happens to lack the final suffix realized as a single vowel. Considering the context it appears in, a syntactic violation occurs, since the sentence has no finite verb. So, according to principle (i) for error categorization in (3.4), this error is classified as a grammar error, since no new lemma was created, the required preterite form simply was replaced by a supine form of the same verb. In the case of (3.3), this error also involves a real word, but here, a new lemma was created by substitution of a letter. The error is then, according to principle (iii) in (3.4), considered to be a spelling error, since no other form of the same lemma or substitution of the whole word occurred.

3.3.5

Punctuation

Research on sentence development and the use of punctuation reveals that children mark out entities that are content rather than syntactically driven (e.g. Kress, 1994; Ledin, 1998). They form larger textual units, for instance, by joining together sentences that are “topically closely connected”, according to Kress (1994). In speech, such sequences would be joined by intonation due to topic. An example

Chapter 3.

32

of such adjoined clauses is “The boy I am writing about is called Sam he lived in the fields of Biggs Flat.”(Kress, 1994, p.84). Others use a strategy of linking together sentences with connectives like ‘and’, ‘then’, ‘so’ instead of punctuation marks, which can result in sentences of great length, here called long sentences (see Section 4.6 for examples). As we will see later on in Chapter 5, the Swedish grammar checking systems are based on texts written by adults and are able to rely on punctuation conventions for marking syntactic sentences in their detection rules or for scanning a text sentence by sentence. In accordance with the above discussion, this is not possible with the present data that consists of texts written by children. Occurrences of adjoined and long sentences are quite probable. In other words, analysis of the use of punctuation is important to confirm that also the subjects of the present study mark larger units. Thus, omissions of sentence boundaries are expected and have to be taken into consideration.

3.4 Types of Analysis The analysis of the Child Data starts with a general overview of the corpus, including frequency counts on words, word types, and all spelling errors. The main focus is on a descriptive error-oriented study of all errors above the lexical level, i.e. all that influence context. Only spelling errors resulting in non-words are not part of this analysis. The error types included are: 1. Real word spelling errors - misspelled words and segmentation errors resulting in existent words. 2. Grammar errors - syntactic and semantic violations in phrases and sentences. 3. Punctuation - sentence delimitation and the use of major delimiters and commas. The main focus here lies in the second group of grammar errors. Real word spelling errors and grammar errors are listed as separate error corpora - see Appendix B.1 for grammar errors, Appendix B.2 for misspelled words and Appendix B.3 for segmentation errors. Here all errors are represented with the surrounding context of the clause they appear in (in some cases greater parts are included e.g. in the case of referential errors). Errors are indexed and categorized by the type of error and annotated with information about possible correction (intended word) and the error’s origin in the core data.

Data Collection and Analysis

33

The analysis also includes descriptions of the overall distribution of error types and error density. Comparison is made between errors found in the different subcorpora and by age. Here it is important to bear in mind that the texts were gathered under different circumstances and that not all subjects attended in all the experiments (see Section 3.2). Error frequencies are related differently depending on the error type. Spelling errors that concern isolated words, are related to the total number of words. In the case of grammar errors, the best strategy would be to relate some error types to phrases, some to clauses or sentences and some to even bigger entities in order to get an appropriate comparison measure. However, counting such entities is problematic, especially in texts that contain lots of structural errors. The best solution is to compare frequencies of the attested error types that will reflect the error profile of the texts. The main focus in the analysis of the use of punctuation in this thesis is not the syntactic complexity of sentences, but rather if the children mark larger units than syntactic sentences and if they use sentence markers in wrong ways. The most intuitive procedure would be to compare the orthographic sentences, i.e. the real markings done by the writers, with the (“correct”) syntactic sentences. The main problem with such an analysis is that in the case of long sentences, often it will be hard to decide where to draw the line, since they are for the most part syntactically correct. Several solutions for delimitation in syntactic sentences may be available.12 The subjects’ own orthographic sentences will be analyzed at that point by length in terms of the number of words and by the occurrence of adjoined clauses. Further, erroneous use of punctuation marks will be provided for. Analysis of the use of connectives as sentence delimiters would certainly be appropriate here, but we live this for future research. All error examples represent the errors found in the Child Data corpus. The example format includes the error index in the corresponding error corpora (G for grammar errors (Appendix B.1), M for misspelled words (Appendix B.2), and S for segmentation errors (Appendix B.3)) and as already mentioned, the text is glossed and translated into English with grammatical features (see Appendix A) attached to words and spelling violations followed by the correct form within parentheses preceded by a double right-arrow ‘⇒’. 12

The macro-syntagm (Loman and Jo¨ rgensen, 1971; Hultman and Westman, 1977) and the T-unit (Hunt, 1970) are other units of measure more related to investigation of sentence development and grammatical complexity in education-oriented research in Sweden and America, respectively.

Chapter 3.

34

3.5 Error Coding and Tools 3.5.1

Corpus Formats

In order to be able to carry out automatic analyses on the collected material, the hand written texts were converted to a machine-readable format and compiled with the computer written texts to form one corpus. All the texts were transcribed in accordance with the CHAT-format (see (3.5) below) and coded for spelling, segmentation and punctuation errors and some grammar errors. Other grammar errors were identified and extracted either manually or by scripts specially written for the purpose. Non-word spelling errors were corrected in the original texts in order to be able to test the text in the developing error detector that includes no spelling checker. The spelling checker in Word 2001 was used for this purpose. The original Child Data corpus now exists in three versions: the original texts in machine-readable format, a coded version in CHAT-format and a spell-checked version. This version free from non-words was used as the basis for the manual grammar error analysis and as input to the error detector in progress and other grammar checking tools that were tested (see Chapter 5).

3.5.2

CHAT-format and CLAN-software

The CHAT (Codes for the Human Analysis of Transcripts) transcription and coding format and the CLAN (Computerized Language Analysis) program are tools developed within the CHILDES (Child Language Data Exchange System) project (first conceived in 1981), a computerized exchange system for language data (MacWhinney, 2000). This software is designed primarily for transcription and analysis of spoken data. It is, however, practical to apply this format to written material in order to take advantage of the quantitative analysis that this tool provides. For instance, the current material includes a lot of spelling errors that can be easily coded and a corresponding correct word may be added following the transcription format. This means that not only the number of words, but also the correct number of word types may be included in the analysis. Also analysis concerning for instance the spelling of words may be easily extracted. In practice, conversion of a written text to CHAT-format involves addition of an information field and division of the text into units corresponding to “speaker’s lines”, since the transcript format is adjusted to spoken material. The information field at the beginning of a transcript usually includes information on the subject(s) involved, the time and location for the experiment, the type of material coded, the type of analysis done, the name of the transcriber, etc. Speaker’s lines in spoken

Data Collection and Analysis

35

material correspond naturally to utterances. For the written material, we chose to use a finite clause as a corresponding unit, which means that every line must include a finite verb, except for imperatives and titles, that form their own “speaker’s lines”. The whole transcript includes just one participant, as it is a monologue. The information field in the transcribed text example in (3.5) below taken from the corpus, in accordance with the CHAT-format, includes all the lines at the beginning of this text starting with the @-sign. Lines starting with *SBJ: correspond to the separate clauses in the text. Comments can be inserted in brackets in speaker’s lines, e.g. [+ tit] indicating that this line corresponds to the title of the text. The intended word is given in brackets following a colon, e.g. & [: och] ‘and’. Relations to more than one word are indicated by the ‘’ signs, where the whole segment is included, e.g. [: o¨ vergivna] ‘abandoned’. Other signs and codes can be inserted in the transcription.13 (3.5) @Begin @Participants: SBJ Subject @Filename: caan09mHW.cha @Age of SBJ: 9 @Birth of SBJ: 1987 @Sex of SBJ: Male @Language: Swedish @Text Type: Hand written @Date: 10-NOV-1996 @Location: Gbg @Version: spelling, punctuation, grammar @Transcriber: Sylvana Sofkova Hashemi *SBJ: de kom till en ¨ overjiven [: ¨ overgiven] by [+ tit] *SBJ: vi kom ¨ over molnen jag & [: och] per p˚ a en flygande gris *SBJ: som hete [: hette] urban . *SBJ: d˚ a s˚ ag jag n˚ at [: n˚ agot] *SBJ: som jag aldrig har set [: sett] . *SBJ: en ¨ o som var helt [: igent¨ ackt] av palmer *SBJ: & [: och] i miten [: mitten] var en by av ¨ akta guld . *SBJ: n¨ ar vi kom ner . *SBJ: s˚ a gick vi & [: och] titade [: tittade] . *SBJ: vi s˚ ag ormar spindlar krokodiler ¨ odler [: ¨ odlor] & [: och] anat [: annat] . *SBJ: n¨ ar vi hade g˚ at [: g˚ att] en l˚ ang bit s˚ a sa [: sade] per . *SBJ: vi [: vilar] oss . *SBJ: per luta [: lutade] sig mot en . *SBJ: palmen vek sig *SBJ: & [: och] s˚ a ˚ akte vi ner i ett h˚ al . *SBJ: sen [: sedan] svimag [: svimmade jag] . *SBJ: n¨ ar jag vakna [: vaknade] . *SBJ: satt jag per & [: och] urban mit [: mitt] i byn . *SBJ: vi gick runt & [: och] titade [: tittade] . *SBJ: alla hus var [: ¨ overgivna] .

13 Further information about this transcription format and coding, including manuals for download, may be found at: http://childes.psy.cmu.edu/.

Chapter 3.

36

*SBJ: *SBJ: *SBJ: *SBJ: *SBJ: @End

d˚ a sa [: sade] per . overgivna] byn . over jivna> [: ¨ vi har hitat den 1) or whether recall is of greater value (β < 1). When both recall and precision are equally important the value of β is 1 (β = 1).

Chapter 5.

146

error types defined in Grammatifix (including style, punctuation and formatting and grammar errors) and also set the maximum length of a sentence in number of words. The tool also provides a report on the text’s readability, including the sum of tokens, words, sentences and paragraphs The mean score of these is counted providing an index of readability. One diagnosis of the error is always given, and usually a suggestion for correction. Granska The web-based demonstrator of Granska includes no interactive mode, and spelling and grammar are corrected independently, based on the tagging information. The user may choose a presentation format of the result that includes all sentences with comments on spelling and grammar or only the erroneous sentences. Further adjustments include the choice to display error correction, the result of tagging and if newline is interpreted as end of sentence or not. The last attribute is quite important for children’s writing, where punctuation is often absent or not used properly and the use of new line is also arbitrary, i.e. occurrence of new line in the middle of a sentence is not unusual. In some cases, Granska yields also more than one suggestion for error correction and there is a possibility of constructing new detection rules. Long parts in a text without any punctuation or new line (usual in children’s writing) are probably hard to handle by the tool, which just rejects the text without any output result. Scarrie Also the web-demonstrator of Scarrie does not include any interactive mode. Individual sentences (or a longer text) can be entered, with requirements on end-ofsentence punctuation. Both spelling and grammar are corrected and the result of detection is displayed at the same time. Errors are highlighted and a diagnosis is displayed in the status bar. The system gives no suggestion for correction.

5.5.4

The Systems’ Detection Results

In this section I present the result of the systems’ performance on Child Data. For every error type I first present to what extent the errors are explicitly covered according to the systems’ specifications and then I proceed system by system and present the detection result for the particular error type and discuss which errors are actually detected and which were incorrectly diagnosed, characteristics of errors that were not found, and false alarms. A short conclusion ends every error

Error Detection and Previous Systems

147

type presentation. Exemplified errors from Child Data refer either to previously discussed samples or directly to the index numbers of the error corpus presented in Appendix B.1. A system’s diagnosis is presented exactly as given by the particular system. All detection results are summarized and the overall performance is presented in Section 5.5.5. Agreement in Noun Phrases Most of the errors in Child Data concern definiteness in the noun and gender or number in determiner in the noun phrases, errors that, according to the error specifications, are explicitly covered by all three tools. They all also check for errors in masculine gender of adjective and agreement between the quantifier and the noun in partitive constructions. The latter type found in Child Data concerns the form in the noun rather than the form of the quantifier (see (4.11) on p.50). Grammatifix detected seven errors in definiteness and gender agreement. One of the errors in the masculine form of adjective was only detected in part and was given a wrong diagnosis. The error concerns inconsistency in the use of adjectives (previously discussed in (4.9) on p.49), either both adjectives should carry the masculine gender form or both should have the unmarked form. The error detection by Grammatifix is exemplified in (5.18), where we see that due to the split noun, the error was diagnosed as a gender agreement error between the common-gender determiner den ‘the [com]’ and the first part of the split troll ‘troll [neu]’ that is neuter. An interesting observation is that when the split noun is corrected and forms the correct word trollkarlen ‘magician [com,def]’ Grammatifix does not react and the error in the adjectives is not discovered. Grammatifix checks only when the masculine form of an adjective occurs together with a non-masculine noun, but not consistency of use as is the case in this error sample. (5.18)

A LARM det va it was

∗ den hemske the [com,def] awful [masc,wk]



fula troll karlen ugly [wk] troll [neu,indef] man [com,def] (⇒ trollkarlen) tokig som ... (⇒ magician [com,def]) Tokig that

G RAMMATIFIX ’ S D IAGNOSIS Check the word form den ‘the [com,def]’. If a determiner modifies a noun with neuter gender, e.g. troll ‘troll’ the determiner should also have neuter gender ⇒ det ‘the [neu,def]’

– It was the awful ugly magician Tokig that ...

In general, simple constructions with determiner and a noun are detected, whereas more complex noun phrases were missed. Three errors in definiteness form of the noun were overlooked (G1.1.1, G1.1.2 - see (4.2) p.46, G1.1.3 - see

Chapter 5.

148

(4.3) p.46). Concerning gender agreement, one error involving the masculine form of an adjective was missed (G1.2.4 - see (4.8) p.48). None of the errors in number agreement were detected, one with a determiner error (G1.3.1 - see (4.10) p.49) and two with partitive constructions (G1.3.2 - see (4.11) p.50, G1.3.3). Grammatifix made altogether 20 false assumptions, where 16 of them involved other error categories, mostly splits (12 false alarms), such as the one in (5.19): (5.19)

A LARM det var it was

ett a [neu]

stort big [neu]

hus house [neu]

sten stone [com]

G RAMMATIFIX ’ S D IAGNOSIS Check the word form ett ‘a [neu]’. If a determiner modifies a noun with commongender, e.g. sten ‘stone [com]’ should also the determiner have common-gender ⇒ en ‘a [com]’

– It was a big stone-house.

The overall performance for Grammatifix’s detection of errors in noun phrase agreement amounts then to 53% for recall and 29% for precision. Granska detected six errors in definiteness and two in gender agreement, one in a partitive noun phrase (G1.2.2). In three cases, where the error concerned the definiteness form in the noun, Granska suggested instead to change the determiner (and adjective), correcting G1.1.7 as den ra¨ kningen ‘the [def] bill [def]’ instead of en r¨akning ‘a [indef] bill [indef]’ (see (4.6) p.47). The same happened for error G1.1.8 where en kompisen ‘a [indef] friend [def]’ is corrected as den kompisen ‘the [def] friend [def]’ and the opposite for G1.1.2 where the definite determiner and adjective in den hemska pyroman ‘the [def] awful pyromaniac [indef]’ are changed to indefinite forms instead of changing the form in the noun to definite (see (4.2) p.46). Two errors in definiteness agreement (G1.1.1, G1.1.3 - see (4.3) p.46), none of the errors in masculine form of adjective (G1.2.3 - see (4.9) p.49, G1.2.4 - see (4.8) p.48) and all errors in number agreement were left undiscovered by Granska. Grammatical coverage for this error type results then in 53% recall. Some false alarms occurred (25), where 17 included other error categories, with splits as the most represented (9 false alarms), resulting in a slightly lower precision rate of 24% in comparison to Grammatifix. Scarrie detected six errors in definiteness agreement, one in gender agreement in a partitive noun phrase, two in the masculine form of adjective and one in number agreement. In the case of number agreement, the error in det tre tjejerna ‘the [sg] three girls [pl]’ (G1.3.1 - see (4.10) p.49) is incorrectly diagnosed as an error in the noun instead of in the determiner.

Error Detection and Previous Systems

149

Exactly as in Grammatifix, Scarrie detected the error in G1.2.3 due to the split noun and gave the same incorrect diagnosis (see (5.18) above). The missed errors include two errors in definiteness in the noun, one with a possessive determiner (G1.1.4 - see (4.4) p.47) and one with an indefinite determiner (G1.1.7 - see (4.6) p.47). One error concerned gender agreement with an incorrect determiner with a compound noun (G1.2.1 - see (4.7) p.48). Finally, two errors in number of the noun in partitive constructions were not detected (G1.3.2 - see (4.11) p.50, G1.3.3). Many false alarms occurred (133) and 50 of them concerned other error categories, mostly splits (33 false alarms) as in (5.20): (5.20)

A LARM han tittade i he looked into

ett jord a [neu] ground [com]

S CARRIE ’ S D IAGNOSIS wrong gender

h˚al hole [neu] – He looked into a hole in the ground.

Others involved spelling errors (10 false alarms) as in (5.21), where the pronoun vad ‘what’ is written as var and interpreted as the pronoun ‘each’ that does not agree in number with the following noun. (5.21)

A LARM Sj¨alv tycker jag att killarnas metoder self think I that the-boys’ methods [pl]

S CARRIE ’ S D IAGNOSIS wrong number

men ocks˚a a¨ r mer o¨ ppen och a¨ rlig are more open and honest but also mer more

elak mean

a¨ n than

var (⇒ vad) each [sg] (⇒ what)

tjejernas metoder a¨ r. the-girls’ [pl] methods are – I think myself that the boys’ methods are more open and honest but also more mean than the girls’ methods are.

Some false flaggings also concerned sentence boundaries (7 false alarms) as in (5.22):

Chapter 5.

150

(5.22)

A LARM pojken gick till f¨onstret och ropade the-boy went to the-window and shouted p˚a at

grodan the-frog

hunden the-dog [com]

men but

vad what

har had

fastnat stuck

S CARRIE ’ S D IAGNOSIS wrong form in adjective

dumt silly [neu] i in

burken the-pot

d¨ar grodan var. there the-frog was – The boy went to the window and shouted at the frog, but how silly, the dog got stuck in the pot where the frog was.

But mostly, ambiguity problems occurred (83 false alarms) as in (5.23a) and as in (5.23b): (5.23) a.

A LARM dessutom besides

luktade smelled

det it/the [neu]

S CARRIE ’ S D IAGNOSIS wrong gender

saltgurka. pickle-gherkin [com] – Besides it smelled like pickled gherkin. b.

Jag trampade rakt p˚a den och skar upp I walked right on it and cut up

wrong number

hela min v¨anstra fot. whole my left [pl,def] foot [sg,indef]

The coverage for this error type in Scarrie is 67%, but the high number of false alarms results in a very low precision value of only 7%. In conclusion, only Scarrie detected more than half of the errors in noun phrase agreement, but at the cost of many false alarms. Grammatifix and Granska displayed similarities in detection of this error type, detecting almost the same errors and also their false alarms are not that many. Scarrie’s coverage is different from the other tools and the high number of false alarms considerably decreased the precision score for detection of this error type. All tools failed to find the erroneous forms in the head nouns of the partitive noun phrases (G1.3.2 - see (4.11) p.50, G1.3.3), that are most likely not defined in the grammars of these systems.

Error Detection and Previous Systems

151

Agreement in Predicative Complement All the tools cover errors in both number and gender agreement with predicative complement. These types of errors in Child Data are however represented in most cases by rather complex phrase structures and will then at most result in three detections. Grammatifix detected only one instance of all the agreement errors in predicative complement (G2.2.6) and yielded an incomplete analysis of this particular error. It failed in that only the context of a sentence is taken into consideration. Due to ambiguity in the noun between a singular and plural form, Grammatifix detected this error as gender agreement, but should suggest plural form instead, which is clear from the preceding context (see (5.1) and the discussion on detection possibilities in Section 5.3, p.119). Grammatifix obtained very low recall (13%) for this error type. Three false alarms (one with a split), results in a precision value of 25%. The three simple construction of agreement errors in the predicative complement were all detected by Granska (G2.1.1 - see (4.12) p.51, G2.2.3 - see (4.13) p.51, G2.2.6 - see (5.1) p.119). In the case of G2.2.6 discussed above, the plural alternative is suggested. In error G2.2.3, the predicative complement includes a coordinated adjective phrase with errors in all three adjectives. Granska detected the first part: (5.24)

A LARM Sj¨alv tycker self think metoder methods [pl]

jag I a¨ r are

att that mer more

killarnas the-boys’ [pl] ∗

o¨ ppen open [sg]

och and



a¨ rlig men ocks˚a mer ∗ elak a¨ n honest [sg] but also more mean [sg] than

G RANSKA’ S D IAGNOSIS If o¨ ppen ‘open [sg]’ refers to metoder ‘methods [pl]’ that is an agreement error ⇒ killarnas metoder a¨ r mer o¨ ppna ‘the boys’ [pl] methods [pl] are more open [pl]’

var (⇒vad) tjejernas metoder a¨ r. was (⇒what) the-girls’s methods are – I think myself that the boys’ methods are more open and honest but also more mean than the girls’ methods.

Granska obtained then a coverage value of 38% for this error type with 5 false alarms (including one in split and one with a spelling error) the precision rate is also 38%. In the case of Scarrie, no errors in predicative complement agreement were detected, only 13 false flaggings occurred, which leaves this category with no results

Chapter 5.

152

for recall or precision. The false alarms occurred due to incorrectly chosen segments as in the following examples. In (5.25a) we have a compound noun phrase, where only the second part is considered and interpreted as a singular noun that does not agree with the plural adjective phrase as its predicative complement. In (5.25b) the verb is pratade ‘spoke [pret]’ interpreted as a plural past participle form and is considered as not agreeing with the preceding singular noun hon ‘she [sg]’. (5.25) a.

A LARM Han och hans hund var mycket he and his dog [sg] were/was very

S CARRIE ’ S D IAGNOSIS wrong number in adjective in predicative complement

o¨ ver den. stolta proud [pl] over it – He and his dog were very proud over it. b.

d˚a sa jag till dom och v˚aran l¨arare then said I to them and our teacher

wrong number in adjective in predicative complement

att hon blev mobbad och efter det that she [sg] was harassed and after that s˚a pratade l¨araren med dom som so spoke [pl] the-teacher with them that mobbade henne och d˚a slutade dom med harassed her and then stopped they with det. that – Then I said to them and our teacher that she was harassed and after that the teacher spoke to them that harassed her and then they stopped with that.

In conclusion, only Granska detected at least the simplest forms of agreement errors in predicative complement. The other tools had problems with selecting correct segments, especially Scarrie with its high number of false alarms. Pronoun Form Errors All three tools check explicitly for pronoun case errors after certain prepositions. Three of the four error instances in Child Data are preceded by a preposition. Grammatifix found two errors in the form of pronoun in the context of different prepositions (G4.1.1 - see (4.19) p.54, G4.1.3). No false flagging occurred. Granska found three errors in the context of the prepositions efter ‘after’ and med ‘with’ (G4.1.1 - see (4.19) p.54, G4.1.4, G4.1.5 - see (4.18) p.53), that gives a

Error Detection and Previous Systems

153

recall rate of 60%. However, many false alarms (24) occurred involving conjunctions being interpreted as prepositions (17 flaggings) or prepositions in a sentence boundary where punctuation is missing (5 flaggings), resulting in a very low precision value of 11%. In (5.26a) we see an example of a false alarm with the conjunction f¨or ‘because’ and in (5.26b) with a preposition ending a sentence followed by a personal pronoun as the subject of the next sentence: (5.26) a.

A LARM Vi skulle we would hon she [nom]

a˚ ka in go in skulle would

i into

hamnen the-port

f¨or for

ber¨atta n˚agot tell something

f¨or for

G RANSKA’ S D IAGNOSIS Erroneous pronoun form, use object-form ⇒ f¨or henne ‘for her [acc]’

sin mamma. her mother – We would go into the port because she should tell something to her mother. b.

...

och jag kom d˚a t¨anka p˚a den and I came then think at the

byn vi va (⇒ var) i jag the-village we what (⇒ were) in I [nom]

Erroneous pronoun form, use object-form ⇒ i mig ‘in me [acc]’

ber¨atta (⇒ ber¨attade) om byn och tell (⇒ told) about the-village and dom they

sa said

att that

det va (⇒ var) it what (⇒ was)

deras their

by. village – and I came to think at the village we were in. I told about the village and they said that it was their village.

Scarrie also found three error instances (G4.1.1 - see (4.19) p.54, G4.1.3, G4.1.4), all with different prepositions. False flaggings occurred also due to ambiguity problems, as for example in (5.27) and (5.28). (5.27)

A LARM Jag gick och gick tills jag h¨orde I walked and walked until I heard Pappa skrika kom kom daddy scream come come – I walked and walked until I heard daddy scream: Come! Come!

S CARRIE ’ S SUGGESTION wrong form of pronoun

Chapter 5.

154

(5.28) a.

A LARM Erik fr˚agade om han kunde f˚a ett Erik asked if OR about he could get a

S CARRIE ’ S SUGGESTION wrong form of pronoun

barn. child – Erik asked if he could get a child. b.

T¨ank om jag bott hos pappa. think if OR about I lived with daddy

wrong form of pronoun

– Think if I lived with daddy.

Scarrie obtains a recall of 60% but with 17 false alarms, attains a precision rate of only 15% for errors in pronoun case. In conclusion, as seen in the above examples, the tools search for errors in the pronoun form after certain types of prepositions, but due to ambiguity in them they fail more often than they succeed in detection of these errors. Finite Verb Form Errors Errors in finite verbs concern non-inflected verb forms, which is also the most common error found in Child Data. All of the tools search for missing finite verbs in sentences and, judging from the examples in the error specifications, it seems that they detect exactly this type of error. Grammatifix detected very few instances of sentences that lack a finite verb. Altogether four such errors are recognized and in one of them Grammatifix suggested correcting another verb. In total, seven false alarms occurred, detecting verbs after an infinitive marker as in (5.29) or after an auxiliary verb as in (5.30). (5.29)

A LARM dom la sig ner f¨or att ta they lay themselves down for to take [inf] skydd under natten shelter during the-night – They lay down to take shelter during the night.

G RAMMATIFIX ’ S D IAGNOSIS The sentence seems to lack a tense-inflected verb form. If such a construction is necessary can you try to change ta ‘take’.

Error Detection and Previous Systems

(5.30)

A LARM det kan ju bero p˚a att f¨or¨aldrarna it can of-course depend on that the-parents inte bryr sig dom kanske inte ens not care themselves they maybe not even

155

G RAMMATIFIX ’ S D IAGNOSIS The sentence seems to lack a tense-inflected verb form. If such a construction is necessary can you try to change beh¨ova ‘need’.

vet att man har prov f¨or dom lyssnar inte know that one has test for they listen not p˚a sitt barn f¨or en del kan ju to their children for a bit can of-course beh¨ova hj¨alp av sina f¨or¨aldrar need help from their parents – It can depend on that the parents do not care. They probably do not even know that you have a test, because they do not listen to their child, because some can need help from their parents.

It seems that Grammatifix cannot cope with longer sentences. For instance, breaking down the last example in (5.30b) from det kan ju bero p a˚ ... the error marking is not highlighted anymore. Since many errors with non-finite verbs as the predicates of sentences occurred in Child Data, Grammatifix obtains a low recall value of 4%. False alarms were relatively few, which gives a precision rate of 36%. Granska also checks for errors in clauses where a finite verb form is missing. It detected nine errors in verbs lacking tense-endings altogether, resulting in a recall of just 8%. Nine false flaggings occurred, mostly with imperatives, which gives it a precision score of 44%. Some other alarms concerned exclamations such as Grodan! ‘Frog!’ or Tyst! ‘Silence!’, or fragment clauses, where no verb was used (29 alarms). These are excluded from the present analysis. Scarrie explicitly checks verb forms in the predicate of a sentence and detected 17 errors in Child Data with two diagnoses - ‘wrong verb form in the predicate’ or ‘no inflected predicative verb’. Altogether, 13 false flaggings occurred due to marking correct finite verbs. One false alarm included a split as shown below in (5.31). Scarrie has the best result of the three systems for this error type with 15% in recall and 57% in precision. (5.31)

A LARM Han ring de till mig sen och sa samma he call [pret] to me later and said same sak. thing – He phoned me later and said the same thing.

S CARRIE ’ S D IAGNOSIS wrong verb form in predicate

Chapter 5.

156

In conclusion, the tools succeeded in detecting at most 17 cases of errors in finite verb form. The tools have a very low coverage rate for this frequent error type. The worst detection rate is for Grammatifix the best for Scarrie. Verb Form after Auxiliary Verb All the tools include detection of errors in the verb form after auxiliary verbs. In Child Data, only one of these erroneous verb clusters included an inserted adverb and one occurred in a coordinated verb. Grammatifix does not find any of these errors. Four instances of erroneous verb form after auxiliary verbs were detected by Granska. The remaining three which were not detected are presented in (5.32) and concern G6.1.2, a coordinated verb in (5.32a), G6.1.5, a verb with preceding adverb (5.32b) and G6.1.6, an auxiliary verb followed by a verb in imperative form (5.32c). (5.32) a. Ibland sometimes

f˚ar can [pres]

man bjuda one offer [inf]

p˚a on

sig sj¨alv och oneself and

∗ l˚ater let [pres]

henne/honom vara med! her/him be with – Sometimes can one make a sacrifice and let him/her take part. b. han r˚akade bara ∗ kom emot getingboet he happened [pret] just came [pret] against the wasp-nest – He just happened to come across the wasp’s nest. ∗ som vi alla nog skulle c. Det a¨ r n˚agot g¨or om vi inte it is something that we all probably would [pret] do [imp] if we not

hade l¨ast p˚a ett prov. had read to a test – This is something that we all probably would do if we had not been studying for a test.

Five false alarms occurred in sentence boundary. In (5.33a) we see an example where the end of a preceding direct-speech clause is not marked and the final verb is selected with the main verb of the subsequent clause. Similarly, in (5.33b) the verb cluster ending a clause where the boundary is not marked is selected together with the (adverb) and the initial main verb of the subsequent clause.

Error Detection and Previous Systems

(5.33) a.

A LARM Jo, det kanske han kan sa no that maybe he can [pres] said [pret] pappa. Daddy

157

G RANSKA’ S D IAGNOSIS unusual with verb form sa ‘said [pret]’ after modal verb kan ‘can [pres]’. ⇒ kan s¨aga ‘can [pres] say [inf]’

– No, maybe he can, said Daddy. b.

precis n¨ar dom skulle b¨orja s˚a just when they would [pret] start [inf] so h¨orde dom en r¨ost heard [pret] they a voice

unusual with verb form h¨orde ‘heard [pret]’ after modal verb skulle ‘would [pret]’. ⇒ skulle b¨orja s˚a ha h¨ort ‘would [pret] start [inf] so have [inf] heard [sup]’ or skulle b¨orja s˚a h¨ora ‘would [pret] start [inf] so hear [inf]’

– Just when they were about to begin, they heard a voice.

Granska’s performance rates are 57% in recall and 44% in precision. Scarrie detected only one error in verb form after an auxiliary verb in Child Data (G6.1.6 - see (5.32c) above) and made altogether nine false flaggings. Two false alarms occurred at sentence boundaries, one of them in the same instance as in Granska, see (5.33a) above. Scarrie ends up with a performance result of 14% in recall and 10% in precision. In conclusion, Granska detects more than half of the verb errors after the auxiliary, but the performance of the other tools is very low, detecting either none or just one such error. Missing Auxiliary Verb All the tools check explicitly for supine verb forms without the auxiliary infinitive form ha ‘have’. It is not clear if they also check for omission of the finite forms of the auxiliary verb in front of a bare supine. In Swedish, the supine is only used in subordinate clauses (see Section 4.3.5). Two errors with bare supine form in main clauses were found in Child Data. Grammatifix did not find the two errors in Child Data. Instead, Grammatifix suggested insertion of the auxiliary verb ha ‘have’ in constructions between an auxiliary verb and a supine verb form. This is rather a stylistic correction and is not part of the present analysis. Altogether, nine such suggestions were made of the kind given below:

Chapter 5.

158

(5.34)

A LARM a¨ tit f¨or en kvart jag skulle I should [pret] eaten [sup] for a quarter sen later

G RAMMATIFIX ’ S D IAGNOSIS Consider the word a¨ tit ‘eaten [sup]’. A verb such as skulle ‘should [pret]’ combines in polished style with ha ‘have [inf]’ + supine rather than only a supine. ⇒ skulle ha a¨ tit ‘should [pret] have [inf] eaten [sup]’

– I should have eaten a quarter of an hour ago.

The same happened in Granska, no errors were detected and suggestions made were for insertion of auxiliary ha ‘have’ in front of supine forms preceded by auxiliary verbs. Seven such flaggings occurred as in (5.35) and two flaggings were false and occurred at sentence boundaries. (5.35)

A LARM Jag m˚aste svimmat . I must [pret] fainted [sup]

G RANSKA’ S D IAGNOSIS unusual with verb form svimmat ‘fainted [sup]’ after the modal verb m˚aste ‘must [pret]’. ⇒ m˚aste ha svimmat ‘must [pret] have [inf] fainted [sup]’

– I must have fainted.

Scarrie did find one of the error instances in Child Data with a missing auxiliary verb (G6.2.1). Eight other detections included the same stylistic issues as for the other tools, suggesting insertion of ha ‘have’ between an auxiliary verb and a supine verb form, as in: (5.36)

A LARM de kunde berott p˚a att dom it could [pret] depend [sup] on that they

S CARRIE ’ S D IAGNOSIS wrong verb form after modal verb

gillade samma tjej liked same girl – It could have been because they liked the same girl.

In conclusion, just one of the two missing auxiliary verb errors in Child Data was found by Scarrie. The systems bring more attention to the stylistic issue of omitted ha ‘have’ with supine forms, pointing out that the supine verb form should not stand alone in formal prose.

Error Detection and Previous Systems

159

Verb Form in Infinitive Phrase Granska and Scarrie search for erroneous verb forms following an infinitive marker and should not have problems with finding these errors in Child Data, where only one instance included an adverb splitting the infinitive. Granska identified three errors in verb form after an infinitive marker, missing only the one with an adverb between the parts of the infinitive (G7.1.1 - see (4.35) p.62). This problem of syntactic coverage was already discussed in Section 5.4.4 in the examples in (5.13), where it also showed that Granska does not take adverbs into consideration. Altogether six false alarms occurred. Granska’s overall performance rates are 75% in recall and 33% in precision. Scarrie detected one of the errors in Child Data, where the infinitive marker is followed by a verb in imperative form instead of infinitive: att g o¨ r ‘to do [imp]’ (G7.1.4). Also, one false flagging occurred, shown in (5.37), where it seems that the system misinterpreted the conjunction fo¨ r att ‘because’ as the infinitive marker att ‘to’: (5.37)

A LARM s˚a jag sa att hon skulle ta det lite so I said that she should take it little

S CARRIE ’ S D IAGNOSIS inflected verb form after att ‘to’

lugnt f¨or att annars s˚a kan hon easy for that otherwise so can[pres] she inte s˚a skada sig och det a¨ r ju hurt[inf] herself and it is of-course not so bra. good – So I said that she should take it easy a little because otherwise she might hurt herself and that is of course not so good.

In conclusion, Granska finds all but one of the errors, due to insufficient syntactic coverage and makes also quite many false flaggings. Scarrie has difficulties with this error type and Grammatifix does not target it at all. Missing Infinitive Marker with Verbs All the tools check explicitly for both missing and extra inserted infinitive marker. Three errors in missing infinitive marker with verbs occurred in Child Data in the context of the auxiliary verb komma ‘will’. As presented in Section 4.3.5, certain main verbs take also an infinitive phrase as complement and some lack the infinitive marker and start to behave as auxiliary verbs, that normally do not combine with an

Chapter 5.

160

infinitive marker and only take bare infinitives as complement. This development is now in progress in Swedish, which indicates then rather to treat these constructions as stylistic issues. Grammatifix did not find the three errors in Child Data with omitted infinitive markers with the auxiliary verb komma ‘will’ (see example (4.36) p.62). In seven cases, the tool rather suggested removing the infinitive marker with the verbs b o¨ rja ‘begin’ and t¨anka ‘think’, e.g.: (5.38) a.

A LARM Jag och Virginia b¨orjade att ber¨atta I and Virginia started [pret] to tell [inf] om about

tromben the-tornado

och and

den the

o¨ vergivna abandoned

byn the-village

G RAMMATIFIX ’ S D IAGNOSIS Check the words att ‘to’ and ber¨atta ‘tell [inf]’. If an infinitive is governed by the verb b¨orjade ‘started [pret]’, the infinitive should not be preceded by att ‘to’ ⇒ b¨orjade ber¨atta ‘started [pret] tell [inf]’

– Virginia and I started to tell about the tornado and the abandoned village. b.

4 hus och 5 aff¨arer var ordning gjorda 4 houses and 5 shops were order done av gumman som hade t¨ankt by old-lady who had [pret] thought [sup] att g¨ora museum av den gamla staden to make [inf] museum of the old the-city

Check the words att ‘to’ and g¨ora ‘make [inf]’. If an infinitive is governed by the verb t¨ankt ‘thought [sup]’, the infinitive should not be preceded by att ‘to’ ⇒ t¨ankt g¨ora ‘thought [sup] make [inf]’

– 4 houses and 5 shops were tidied up by the old lady who had planned to make a museum of the old city.

Granska detected all the three omitted infinitive markers in the context of the auxiliary verb komma ‘will’. In this case also six false flaggings occurred, concerning the same verb used as a main verb, e.g.:

Error Detection and Previous Systems

(5.39)

A LARM han kommer he comes [pres] alla all

p˚a on

handen the-hand

och and utan except

undra (⇒ undrar) wonder [inf] (⇒ wonder [pres])

161

klappar pats en a

kille boy

hur how

han he

G RANSKA’ S D IAGNOSIS kommer ‘will’ without att ‘to’ before verb in infinitive

k¨anner sig d˚a? feels himself then – He comes and pats everybody’s hand except one boy. (I) wonder how he feels then?

In two cases, Granska also suggested insertion of the the infinitive marker with the verbs forts¨atta ‘continue’ and prova ‘try’. In nine cases, it wanted to remove the infinitive marker with the verbs b¨orja ‘begin’, f¨ors¨oka ‘try’, sluta ‘stop’ and t¨anka ‘think’. Scarrie detected two of the three missing infinitive marker errors with the verb komma ‘will’ found in Child Data. Quite a large number of false alarms (13) with the verb used as main verb occurred as in (5.40), where s a˚ is ambiguous between the conjunction ‘so’ or ‘and’ and a verb reading ‘sow’. The precision rate is then only 13%. (5.40)

A LARM men kom nu s˚a g˚ar vi hem but come now so OR sow go we home

S CARRIE ’ S D IAGNOSIS att ‘to’ missing

– But come now and we’ll go home.

In five cases, Scarrie suggested removal of the infinitive marker in the context of the verbs b¨orja ‘begin’, forts¨atta ‘continue’ and sluta ‘stop’. In conclusion, whereas both Granska and Scarrie performed well, Grammatifix did not succeed in tracing any of the errors with omitted infinitive markers with the auxiliary verb komma ‘will’. Overall, all the tools suggested both omission and insertion of infinitive markers with certain main verbs. In some cases they agree, but there are also cases where one system suggests removal of the infinitive marker and an another suggests insertion. A clear indication of confusion in the use or omission of the infinitive marker showed up when Granska suggested to insert the infinitive marker in the verb sequence fortsa¨ tta leva ‘continue live’ as shown in (5.41a), whereas in (5.41b) Scarrie suggested to remove it in the same verb sequence. This fact indicates clearly that this issue should be classified as a matter of style and not as a pure grammar error.

Chapter 5.

162

(5.41) a.

A LARM n¨ar jag dog 1978 i cancer a˚ terv¨ande jag when I died 1978 of cancer returned I

D IAGNOSIS Granska: ⇒ forts¨atta att leva ‘continue to live’

hit f¨or att forts¨atta leva mitt here for that continue [inf] live [inf] my liv h¨ar life here – When I died in 1978 of cancer, I returned here to continue live my life here. b.

Vi fortsatte att leva [inf] som en we continued [pret] to live as a

Scarrie: ⇒ fortsatte leva ‘continued live’

hel familj i v˚art nya hus h¨ar i whole family in our new house here in G¨oteborg. G¨oteborg – We continued to live as a whole family in our new house here in G¨oteborg.

Word Order Errors All three tools check for the position of adverbs (or negation) in subordinate clauses and constituent order in interrogative subordinate clauses. Scarrie also checks for word order in main clauses with inversion. Among the word order errors found in Child Data, all the errors are quite complex and also none of the tools succeeded in detection of this type of error. However, false flaggings of correct sentences occurred. Grammatifix made 15 false alarms when checking word order, one included a split and three occurred in clause boundary. A false flagging involving clause boundary is presented in (5.42a), where Grammatifix concerned the adverb hem ‘home’ as being placed wrongly between verbs. This problem is not only complicated by the second verb initiating a subsequent clause, but also in that not all adverbs can precede verbs. Another false flagging is presented in (5.42b), where Grammatifix checked for adverbs placed after the main verb in the expected subordinate clause, but here, main word order is found in the indirect speech construction.28 28

Main clause word order occurs when the clause expresses the speaker’s or the subject’s opinion or beliefs.

Error Detection and Previous Systems

(5.42) a.

A LARM N¨ar when

vi we

kom came

undra (⇒ undrar) wonder [inf] (⇒ wonder [pres])

163

hem home sj¨alvklart of-course

G RAMMATIFIX ’ S D IAGNOSIS Check the placement of hem ‘home’. In a subclause adverb is not usually placed between the verbs. Placement before the finite verb is often suitable.

mamma vart vi varit... mother where we been When we came home, mother wondered of course where we had been. b.

killen i luren sa att han kommer the-guy in the-receiver said that he comes genast immediately

Check the placement of genast ‘immediately’. In a subclause sentential adverb is placed by rule before the finite verb. ⇒ genast kommer ‘immediately comes’

– The guy in the receiver said that he would come immediately

In (5.43) the sentence is erroneously marked as a word order error in the placement of negation. The problem however concerns the choice of the (explanative) conjunction f¨or att ‘since/due to’ that combines with main clause and is more typical of spoken Swedish (Teleman et al., 1999, Part2:730). This conjunction corresponds to f¨or ‘due to/in order to’ in writing and coordinates then only main clauses. It is often confused with the causal subjunction fo¨ r att ‘because/with the intention of’ that is used only with subordinate clauses and requires then adverbs to be placed before the main verb (Teleman et al., 1999, Part2:736). (5.43)

A LARM ...d˚a sa han ja f¨or att han ville inte then said he yes for that he wanted not ber¨atta f¨or fr¨oken att han var ensam tell to the-teacher that he was alone

G RAMMATIFIX ’ S D IAGNOSIS Check the placement of inte ‘not’. In a subclause sentential adverb is placed by rule before the finite verb. ⇒ inte ville ‘not wanted’

– ... then he said yes, because he did not want to tell the teacher that he was alone.

All of the 15 flaggings by Granska were false, interpreting conjunctions as subjunctions as in (5.44a) or not taking indirect speech into consideration as in (5.44b), where the subject’s opinion is expressed by main clause word order and not subordinate clause word order as interpreted by the tool.

Chapter 5.

164

(5.44) a.

A LARM ... men den gick av s˚a jag hade bara lite but it went off so I had just little gips kvar. plaster left

G RANSKA’ S D IAGNOSIS Word order error, erroneous placement of adverb in subordinate clause. ⇒ bara hade ‘just had’

– ... but it broke off so I only had a little plaster left. d˚a tycker jag att det var inte hans fel then believe I that it was not his fault b.

utan deras. but theirs

Word order error, erroneous placement of adverb in subordinate clause. ⇒ inte var ‘not was’

Then I think that it was not his fault but theirs.

Scarrie’s 11 diagnoses were also false, mostly of the type “subject taking the position of the verb” as in (5.45a) and also cases of interpreting conjunctions as subjunctions as in (5.45b): (5.45) a.

A LARM D˚a vi kom till min by. Trillade jag when we came to my village fell I

S CARRIE ’ S D IAGNOSIS the subject in the verb position

av brand bilen f¨or det var en guppig v¨ag. off fire the-car for it was a bumpy road – When we arrived in my village, I fell off the fire-engine because the road was bumpy. b.

dom kanske inte ens vet att man har they maybe not even know that one has prov f¨or dom lyssnar inte p˚a sitt barn ... test for they listen not at their child

the inflected verb before sentence adverbial in subordinate clause

– They probably do not even know that you have a test, because they do not listen to their child ...

In conclusion, word order errors were hard to find due to their inner complexity. The tools seem to apply rather straight-forward approaches that resulted in many false flaggings. Redundancy According to the error specifications, only Grammatifix searches for repeated words and should then be able to at least detect errors with doubled words.

Error Detection and Previous Systems

165

Grammatifix identified the five errors with duplicated words immediately following each other. The number of false alarms is quite high (18 occurrences). One example is given below: (5.46)

A LARM Var var den d¨ar o¨ verraskningen. where was the there surprise

G RAMMATIFIX ’ S D IAGNOSIS doubled word

– Where was that surprise?

No other superfluous elements were detected so the system ends up with a performance rate of 38% in recall, and 23% in precision. Missing Constituents All three tools search for sentences with omitted verbs or infinitive markers, also in the context of a preceding preposition. Grammatifix did not find any missing verbs, but detected the only error with a missing infinitive marker in front of an infinitive verb after certain prepositions (G10.3.1) shown in (5.47). (5.47) a. Efter — ha sprungit igenom h¨ackarna tv˚a g˚anger s˚a vilade after — have [inf] run [sup] through the-hurdles two times then rest vi lite ... we little – After twice running through the hurdles, we rested a little. b. Efter att ha sprungit after to have [inf] run [sup]

Six false alarms occurred for this error type, mostly when the adverb tillbaka ‘back’ was split as shown in (5.48). The problem lies in that the split word results in a preposition till ‘to’ and the verb baka ‘bake’. (5.48)

A LARM inget kvack kom till — baka no quack came to — bake

G RAMMATIFIX ’ S D IAGNOSIS Check the word baka. If an infinitive is governed by a preposition it should be preceded by att ‘to’

– no quack came back.

Granska checked in the case of omitted verbs only for occurrences of single words such as Slut. ‘End.’ or sentence fragments, such as Tom gr a˚ och tyst. ‘Empty

Chapter 5.

166

grey and silent.’ or Inte ens pappa. ‘Not even daddy.’. The program further suggested that the error might be a title “Verb seems to be missing in the sentence. If this is a title it should not be ended with a period.” Altogether, 25 sentences were judged to be missing a verb and 12 false alarms occurred. None of the errors listed in Child Data were detected by Granska. This particular error type is not included in the present performance analysis. Granska also checks for missing subjects. Two cases concerned short sentence fragments and two were false flaggings as the one in (5.49) below. (5.49)

A LARM Hade alla 7 vandrat f¨org¨aves? had all 7 walked in vain

G RANSKA’ S D IAGNOSIS a subject seems to be missing in the sentence

– Had all seven walked in vain?

Scarrie also checks for missing subjects and successfully detected the error G10.1.5, shown in (5.50). The other three flaggings were false. In the case of a missing infinitive marker in constructions where a preposition precedes an infinitive phrase, six false flaggings occurred. Like Grammatifix, Scarrie marks erroneous splits homonymous with prepositions (see (5.48) above). (5.50) a. man f¨ors¨oker att l¨ara barnen att om — fuskar med t ex ett prov d˚a ... one tries to teach the-children to if — cheat with e.g. a test then – One tries to teach children that if they cheat on e.g. a test then ... b. om de fuskar med if they cheat with

In conclusion, many of the omitted constituents are not covered by these tools and result mostly in false flaggings. Grammatifix successfully detected a missing infinitive marker preceded by a preposition and Scarrie detected a missing subject. Other Errors Among other error types, all the tools also check if a sentence has too many finite verbs. Grammatifix succeeded in finding three instances of unmarked sentence boundaries. In three cases, false flaggings occurred, listed in (5.51). Two such flaggings concerned ambiguity between a verb and a pronoun and the one in (5.51c) involved a spelling error that resulted in a verb. These alarms are not part of the system’s performance test, since such errors were not the target of this analysis.

Error Detection and Previous Systems

(5.51) a.

167

A LARM Han undrade var de var n˚agonstans he wondered where they were somewhere

G RAMMATIFIX ’ S D IAGNOSIS Check the word forms undrade ‘wondered’ and var ‘where/was’. It seems as if the sentence would have too many finite verbs.

– He wondered where they were? b.

Var var den d¨ar o¨ verraskningen. where was the there surprise

Check the word forms var ‘where/was’ and var ‘where/was’. It seems as if the sentence might have too many finite verbs.

– Where was that surprise? c.

Pojken blev red (⇒ r¨add) the-boy became rode (⇒ afraid)

Check the word forms blev ‘became’ and red ‘rode’. It seems as if the sentence might have too many finite verbs.

– The boy became afraid.

Granska checks for occurrences of other finite verbs after the copula verb vara ‘be’. In Child Data, however, the only detections were false flaggings (8 occurrences), mostly due to homonymy between the verb and the adverb var ‘where’ as in (5.52a) (5 occurrences). Three false alarms occurred because of spelling errors as in (5.52a) or at sentence boundaries, as in (5.52b): (5.52) a.

A LARM Pojken the-boy

blev became [pret]

G RANSKA’ S D IAGNOSIS it is unusual to have a verb after the verb blev ‘became [pret]’

som tur var landade jag p˚a as luck was [pret] landed [pret] I on

it is unusual to have a verb after the verb var ‘was [pret]’

red (⇒ r¨add) rode [pret] (⇒ afraid) – The boy became afraid. b.

skyddsn¨atet p˚a brandbilen the-safety-net on the-fire-engine – luckily I landed on the safety-net on the fireengine.

Scarrie also checks for occurrences of two finite verbs in a row, but provides a diagnosis of a possible sentence boundary as well. Eight sentence boundaries were found and eight false markings occurred, often due lexical ambiguity as in (5.53). Also, in Scarrie’s case, these alarms are not included in the analysis.

Chapter 5.

168

(5.53)

A LARM Men sen kom en tjej som visste vem jag but then came a girl that knew who I

S CARRIE ’ S D IAGNOSIS two inflected verbs in predicate position or a sentence boundary

var f¨or hon ... was [pret] for OR lead [imp] she – But then came a girl that knew who I was, because she ...

Finally, Scarrie checks the noun case, where the genitive form of proper nouns is suggested in constructions of a proper noun followed by a noun. All result in false flaggings, due to part-of-speech ambiguity, e.g.: (5.54)

A LARM Men p˚a morgonen n¨ar Erik s˚ag but in the-morning when Erik [nom] saw

S CARRIE ’ S D IAGNOSIS basic form instead of genitive

att hans groda var f¨orsvunnen. that his frog was disappeared – But in the morning when Erik saw that his frog had disappeared.

5.5.5

Overall Detection Results

In accordance with the error specifications of the systems, none of the Swedish tools detects errors in definiteness in single nouns or reference and only Grammatifix checks for repeated words among redundancy errors. Missing constituents are checked only when a verb, subject or infinitive marker is missing. Word choice errors represented as prepositions in idiomatic expressions are checked by Granska. The detection results on Child Data, discussed in the previous section, are summarized in Tables 5.4, 5.5 and 5.6 below. Among the most frequent error types in Child Data, represented by errors in finite verbs, missing constituents, word choice errors, agreement in noun phrase and redundant words, Grammatifix succeeded in finding errors in four of these types, Scarrie in three of them and Granska in two categories. All the tools were best at finding errors in noun phrase agreement, with a recall rate between 53% and 67% and precision between 7% and 37%. For the most common error, finite verb form, all obtained very low coverage, with recall between 4% and 15% and precision between 36% and 57%. Grammatifix succeeded in finding all the repeated words among redundancy errors and one occurrence of missing constituent. Also Scarrie found one missing constituent. No word choice errors were found by Granska. Other error types in Child Data occurred less than ten times and no general assumptions can be made on how the tools performed on those.

Error Detection and Previous Systems

169

Table 5.4: Performance Results of Grammatifix on Child Data GRAMMATIFIX C ORRECT A LARM FALSE A LARM P ERFORMANCE Correct Incorrect No Other E RROR T YPE E RRORS Diagnosis Diagnosis Error Error Recall Precision F-value Agreement in NP 15 7 1 4 16 53% 29% 37% Agreement in PRED 8 1 2 1 13% 25% 17% Definiteness in single nouns 6 0% – – Pronoun case 5 2 40% 100% 57% Finite Verb Form 110 3 1 5 2 4% 36% 7% Verb Form after Vaux 7 0% – – Vaux Missing 2 0% – – Verb Form after inf. marker 4 0% – – Inf. marker Missing 3 0% – – Word order 5 11 4 0% 0% – Redundancy 13 5 16 1 38% 23% 29% Missing Constituents 44 1 1 6 5% 25% 8% Word Choice 28 0% – – Reference 8 0% – – Other 4 0% – – T OTAL 262 18 4 38 30 8% 24% 12%

Table 5.5: Performance Results of Granska on Child Data GRANSKA C ORRECT A LARM FALSE A LARM P ERFORMANCE Correct Incorrect No Other E RROR T YPE E RRORS Diagnosis Diagnosis Error Error Recall Precision F-value Agreement in NP 15 5 3 8 17 53% 24% 33% Agreement in PRED 8 3 3 2 38% 38% 38% Definiteness in single nouns 6 0% – – Pronoun case 5 3 24 60% 11% 19% Finite Verb Form 110 8 1 8 1 8% 50% 14% Verb Form after Vaux 7 4 5 57% 44% 50% Vaux Missing 2 2 0% 0% Verb Form after inf. marker 4 3 6 75% 33% 46% Inf. marker Missing 3 3 6 100% 33% 50% Word order 5 15 0% 0% – Redundancy 13 0% – – Missing Constituents 44 2 0% 0% – Word Choice 28 0% – – Reference 8 0% – – Other 4 0% – – T OTAL 262 29 4 79 20 13% 25% 17%

Chapter 5.

170

Table 5.6: Performance Results of Scarrie on Child Data SCARRIE C ORRECT A LARM FALSE A LARM P ERFORMANCE Correct Incorrect No Other E RROR T YPE E RRORS Diagnosis Diagnosis Error Error Recall Precision F-value Agreement in NP 15 8 2 83 50 67% 7% 13% Agreement in PRED 8 12 1 0% 0% – Definiteness in single nouns 6 0% – – Pronoun case 5 3 17 60% 15% 24% Finite Verb Form 110 16 1 13 15% 57% 24% Verb Form after Vaux 7 1 7 2 14% 10% 12% Vaux Missing 2 1 50% 100% 67% Verb Form after inf. marker 4 1 1 25% 50% 33% Inf. marker Missing 3 2 13 67% 13% 22% Word order 5 11 0% 0% – Redundancy 13 0% – – Missing Constituents 44 1 4 5 2% 10% 3% Word Choice 28 0% – – Reference 8 0% – – Other 4 0% – – T OTAL 262 33 3 161 58 14% 14% 14%

Overall performance figures in detecting the errors in Child Data show that Grammatifix did not detect many of the verb errors at all and has the lowest recall. Scarrie on the other hand detects most errors of them all, but has a high number of false flaggings. Errors in agreement with predicative complement were hard to find in general, even in cases where the subject and the predicate were adjacent, more complex structures would obviously pose more of a problem for the tools. Even when errors were found in these constructions, the tools often gave an incorrect diagnosis. Among the false flaggings, quite many included errors other than grammatical ones. The overall performance of the tools including all error types when applied to Child Data ends up at a recall rate of 14% at most, and a precision rate between 14% and 25%. Grammatifix detected the least number of errors and had the least number of false alarms, but the quite low recall rate leads to the lowest F-value of 12%. Granska found slightly more errors and had more false flaggings, obtaining the best F-value of 17%. Scarrie performed best of the tools in grammatical coverage, but at the cost of lots of false alarms, giving an F-value of 14%.

Error Detection and Previous Systems

171

In Table 5.7 the overall performance of the systems is presented for the errors they target specifically, excluding the zero-results. Observe that the F-values are slightly higher due to increased recalls. Precision rates remain the same. 29 Table 5.7: Performance Results of Targeted Errors T OOL Grammatifix Granska Scarrie

E RRORS 166 174 170

C ORRECT A LARM 22 33 36

FALSE A LARM 68 97 214

R ECALL 13% 19% 21%

P RECISION 24% 25% 14%

F- VALUE 17% 22% 17%

The performance tests on published adult texts and some student papers provided by the developers of these tools (see Table 5.3 on p.141), show on average much higher validation rates for these texts, with an overall coverage between 35% and 85% and precision between 53% and 77%. Granska shows to be best at detecting errors in verb form in the adult text data evaluated by the developers with a recall rate of 97%. Verb form errors are mostly represented by errors in finite verb form in Child Data, where Granska obtained a recall of 8%. Other types of verb errors occurred less than ten times which makes the performance result uncertain. For agreement errors in noun phrase, which is the second best category of Granska when tested on adult texts, Granska obtained much better results and detected at least half of the errors with a recall of 53%. Since the error frequency is much higher in texts written by children, the size of the Child Data corpus can be considered to be satisfactory and safe for evaluation, at least for the most frequent error types. This performance test shows that the three Swedish tools, designed for adult writers in the first place, have in general difficulty in detecting errors in such texts as Child Data. As indicated in some examples, this is not only due to insufficient error coverage of the defined error types in the systems. The structure of the texts may also be a cause for certain errors not being detected or being erroneously marked as errors. Different results were obtained sometimes when sentences were split into smaller units. 29

Grammatifix: redundancy includes 5 errors in doubled word, missing constituents are counted as infinitive marker (1) and verb (5). Granska: missing verb (5), choice of preposition (10). Scarrie: missing subject (10), missing infinitive marker (1).

172

Chapter 5.

5.6 Summary and Conclusion From the above analyses it is clear that among the grammar errors found in Child Data, all non-structural errors and some types of structural errors should be possible to detect by syntactic analysis and partial parsing, whereas other errors require more complex analysis or wider context. Among the central error types in Child Data, errors in finite verb form and agreement errors in noun phrases could be handled by partial parsing, which I will show in Chapter 6. The other more frequent errors, such as missing constituents, word choice and redundant words; forming new lemmas require deeper analysis. Furthermore, some real word spelling errors might be detected if they violate syntax. Missing punctuation in sentence boundaries requires analysis of at least the predicate’s complement structure. All the errors in Child Data except definiteness in single nouns and reference seem to be more or less covered by the Swedish tools considering the error specifications. The performance results show that agreement errors in noun phrases are the error type best covered, whereas errors in finite verb forms in relation to their frequency obtained a very low recall in all three systems. Grammatifix had in general difficulty detecting any errors concerning verbs. Granska performed best in this case. Overall, all the tools detect few errors in Child Data and the precision rate is quite low. It is not clear how many of the missed errors depended on insufficient syntactic coverage and how many depended on the complexity of the sentences in Child Data. That is, all three tools rely on sentences to be the unit of analysis, but “sentences” in Child Data do not always correspond to syntactic sentences. They often include adjoined clauses or quite long sentences (see Section 4.6). These tools are not designed to handle such complex structures. In conclusion, many errors that can be handled by partial parsing in Child Data are detected at a rate of less of not more than 60% by the Swedish grammar checkers. Errors in finite verb form obtained quite low results and are the type of error that needs the most improvement, especially since they are the most common error in Child Data.

Chapter 6

FiniteCheck: A Grammar Error Detector 6.1 Introduction This chapter reports on automatic detection of some of the grammar errors discussed in Chapter 4. The challenge of this part of the work is to exploit correct descriptions of language, instead of describing the structure of errors, and apply finite state techniques to the whole process of error detection. The implemented grammar error detector FiniteCheck identifies grammar errors using partial finite state methods, identifying syntactic patterns through a set of regular grammar rules (see Section 6.2.4). Constraints are used to reduce alternative parses or adjust the parsing result. There are no explicit error rules in the grammars of the system, in the sense that no grammar rules state the syntax of erroneous (ungrammatical) patterns. The rules of the grammar are always positive and define the grammatical structure of Swedish. The only constraints related to errors is the context of the error type. The present grammar is highly corpusoriented, based on the lexical and syntactic circumstances displayed in the Child Data corpus. Ungrammatical patterns are detected adopting the same method that Karttunen et al. (1997a) use for extraction of invalid date expressions, presented in Section 6.2.4. In short, potential candidates of grammatical violations are identified through a broad grammar that overgenerates and accepts also invalid (ungrammatical) constructions. Valid (grammatical) patterns are defined in an another narrow grammar and the ungrammaticalities among the selected candidates are identified as the difference between these two grammars. In other words, the strings selected

174

Chapter 6.

by the rules of the broad grammar that are not accepted by the narrow grammar are the remaining ungrammatical violations. The current system looks for errors in noun phrase agreement and verb form, such as selection of finite and non-finite verb forms in main and subordinate clauses and infinitival complements. Errors in the finite verb form in the main verb were the most natural choice for implementation since these are the most frequent error type in the Child Data corpus, represented by 110 error instances (see Figure 4.1 on p.73). Moreover, verb form errors are possible to detect using partial parsing techniques (see Section 5.3.3). Inclusion of errors in the finite main verb motivated expansion of this category to include other errors related to verbs, with addition of other types of finite verb errors and errors in non-finite verb forms. Errors in noun phrase agreement were among the five most frequent error types. In comparison to other writing populations this type of error might be considered as one of the central error types in Swedish (see Section 4.7). Furthermore, noun phrase errors are limited within the noun phrase and can most likely be detected by partial parsing (see Section 5.3). The other errors among the five most common error types in Child Data, including word choice errors and errors with extra or missing constituents, are not locally restricted in this way and will certainly require a more complex analysis. The development of the grammar error detector started with the project Finite State Grammar for Finding Grammatical Errors in Swedish Text (1998 - 1999). It was part of a larger project Integrated Language Tools for Writing and Document Handling in collaboration with the Numerical Analysis and Computer Science Department (NADA) at the Royal Institute of Technology (KTH) in Stockholm. 1 The project group in G¨oteborg consisted of Robin Cooper, Robert Andersson and myself. In the description of the system I will include the whole system and its functionalities, in particular my own contributions concerning mainly a first version of the lexicon, expansion of grammar and adjustment to the present corpus data of children’s texts, disambiguation and other adjustments to parsing results, as well as evaluation and improvements made on the system’s flagging accuracy. The work of the other two members concerns primarily the final version of the lexicon, optimization of the tagset, the basic grammar and the system interface. I will not discuss their contributions in detail but will refer to the project reports when relevant. The chapter proceeds with a short introduction to finite state techniques and parsing (Section 6.2). The description of FiniteCheck starts with an overview of the system’s architecture including short presentations of the different modules (Sec1 The project was sponsored by HSFR/NUTEK Language Technology Programme. See http: //www.ling.gu.se/˜sylvana/FSG/ for methods and goals of our part of the project.

FiniteCheck: A Grammar Error Detector

175

tion 6.3). Then follows a section on the composition of the lexicon with a description of the tagset, and identification of grammatical categories and features (Section 6.4). Next, the overgenerating broad grammar set is presented (Section 6.5), followed by a section on parsing (Section 6.6). The chapter then proceeds with a presentation of the narrow grammar of noun phrases and the verbal core (Section 6.7) and the actual error detection (Section 6.8). The chapter concludes with a summary (Section 6.9). Performance results of FiniteCheck are presented in Chapter 7.

6.2 Finite State Methods and Tools 6.2.1

Finite State Methods in NLP

Finite state technology as such has been used since the emergence of computer science, for instance for program compilation, hardware modeling or database management (Roche, 1997). Finite state calculus is considered in general to be powerful and well-designed, providing flexible, space and time effective engineering applications. However, in the domain of Natural Language Processing (NLP) finite state models were considered to be efficient but somewhat inaccurate, often resulting in applications of limited size. Other formalisms such as context-free grammars were preferred and considered to be more accurate than finite state methods, despite difficulties reaching reasonable efficiency. Thus, grammars approximated by finite state models were considered more efficient and simpler, but at the cost of a loss of accuracy. Improvement of the mathematical properties of finite state methods and reexamination of the descriptive possibilities made it possible for the emergence of applications for a variety of NLP tasks, such as morphological analysis (e.g. Karttunen et al., 1992; Clemenceau and Roche, 1993; Beesley and Karttunen, 2003), phonetic and speech processing (e.g. Pereira and Riley, 1997; Laporte, 1997), parsing (e.g. Koskenniemi et al., 1992; Appelt et al., 1993; Abney, 1996; Grefenstette, 1996; Roche, 1997; Schiller, 1996). In this section the finite state formalism is described along with possibilities for compilation of such devices (Section 6.2.2). Next, the Xerox compiler used in the present implementation is presented (Section 6.2.3). The techniques of finite state parsing are explained along with description of a method for extracting invalid input from unrestricted text that plays an important role for the present implementation (Section 6.2.4).

Chapter 6.

176

6.2.2

Regular Grammars and Automata

Adopting finite state techniques in parsing means modeling the syntactic relations between words using regular grammars2 and applying finite state automata to recognize (or generate) corresponding patterns defined by such grammar. A finite state automaton is a computer model representing the regular expressions defined in a regular grammar that takes a string of symbols as input, executes some operations in a finite number of steps and halts with information interpreted depending on the grammar as either that the machine accepted or rejected the input. It is defined formally as a tuple consisting of a finite set of symbols (the alphabet), a finite set of states with an unique initial state, a number of intermediate states and final states, and finally a transition relation defining how to proceed between the different states.3 Regular expressions represent sets of simple strings (a language) or sets of pairs of strings (a relation) mapping between two regular languages, upper and lower. Regular languages are represented by simple automata and regular relations by transducers. Transducers are bi-directional finite state automata, which means for example that the same automaton can be used for both analysis and generation. Several tools for the compilation of regular expressions exist. AT&T’s FSM Library4 is a toolbox designed for building speech recognition systems and supports development of phonetic, lexical and language-modeling components. The compiler runs under UNIX and includes about 30 commands to construct weighted finite-state machines (Mohri and Sproat, 1996; Pereira and Riley, 1997; Mohri et al., 1998). FSA Utilities5 is an another compiler developed in the first place for experimental purposes applying finite-state techniques in NLP. The tool is implemented in SICStus Prolog and provides possibilities to compile new regular expressions from the basic operations, thus extending the set of regular expressions handled by the system (van Noord and Gerdemann, 1999). The compiler used in the present implementation is the Xerox Finite-State Tool, one of Xerox software tools for computing with finite state networks, described further in the subsequent section. 2

Regular grammars are also called type-3 in the classification introduced by Noam Chomsky (Chomsky, 1956, 1959). 3 See e.g. Hopcroft and Ullman (1979); Boman and Karlgren (1996) for exact formal definitions of finite state automata. A ‘gentle’ introduction is presented in Beesley and Karttunen (2003). 4 The homepage of AT&T’s FSM Library: http://www.research.att.com/sw/ tools/fsm/ 5 The homepage of FSA Utilities’: http://www.let.rug.nl/˜vannoord/Fsa/

FiniteCheck: A Grammar Error Detector

6.2.3

177

Xerox Finite State Tool

Introduction Xerox research developed a system for computing and compilation of finite-state networks, the Xerox Finite State Tool (XFST).6 The tool is a successor to two earlier interfaces: IFSM created at PARC by Lauri Karttunen and Todd Yampol 1990-92, and FSC developed at RXRC by Pasi Tapanainen in 1994-95 (Karttunen et al., 1997b) The system runs under UNIX and is supplemented with an interactive interface and a compiler. Finite state networks of simple automata or transducers are compiled from regular expressions and can be saved into a binary file. The networks can also be converted to Prolog-format. The Regular Expression Formalism The metalanguage of regular expressions in XFST includes a set of basic operators such as union (or), concatenation, optionality, ignoring, iteration, complement (negation), intersection (and), subtraction (minus), crossproduct and composition, and an extended set of operators such as containment, restriction and replacement. The notational conventions of some part of the regular expression formalism in XFST, including the operators and atomic expressions that are used in the present implementation, are presented in Table 6.1 (cf. Karttunen et al., 1997b; Beesley and Karttunen, 2003). Uppercase letters such as A denote here regular expressions. For a description of the syntax and semantics of these operators see Karttunen et al. (1997a). The replacement operators play an important role in the present implementation and are further explained below.

6 Technical documentation and demonstration of the XFST can be found at: http://www. rxrc.xerox.com/research/mltt/fst/

Chapter 6.

178

Table 6.1: Some Expressions and Operators in XFST ATOMIC E XPRESSIONS epsilon symbol (the empty-string) any (unknown) symbol, universal language U NARY O PERATIONS iteration: zero or more (Kleene star) iteration: one or more (Kleene plus) optionality containment complement (not) B INARY O PERATIONS concatenation union (or) intersection (and) ignoring composition subtraction (minus) replacement (simple)

0 ?, ?* A* A+ (A) $A ∼A AB A|B A&B A/B A .o. B A-B A→B

Replacement Operators The original version of the replacement operator was developed by Ronald M. Kaplan and Martin Kay in the early 1980s, and was applied as phonological rewrite rules by finite state transducers. Replacement rules can be applied in an unconditional version or constrained by context or direction (Karttunen, 1995, 1996). Simple (unconditional) replacement has the format UPPER → LOWER denoting a regular relation (Karttunen, 1995):7 (RE6.1)

[ NO_UPPER [UPPER .x. LOWER] ] * NO_UPPER;

For example the relation [a b c → d e]8 maps the string abcde to dede. Replacement may start at any point and include alternative replacements, making these transducers non-deterministic, and yield multiple results. For example, a transducer represented by the regular expression in (RE6.2) produces four different results (axa, ax, xa, x) to the input string aba as shown in (6.1) (Karttunen, 1996). 7

NO UPPER corresponds to ∼$[UPPER - []]. Lower-case letters, such as a, represent symbols. Symbols can be unary (e.g. a, b, c ) or symbol pairs (e.g. a:x, b:0) denoting relations (i.e. transducers). Identity relation where a symbol maps to the same symbol as in a:a is ignored and written thus as a. 8

FiniteCheck: A Grammar Error Detector

(RE6.2)

179

ab|b|ba|aba→x

(6.1) a b a a x a

a b a --a x

a b a --x a

a b a ----x

Directionality and the length of replacement can be constrained by the directed replacement operators. The replacement can start from the left or from the right, choosing the longest or the shortest replacement. Four types of directed replacement are defined (Karttunen, 1996): Table 6.2: Types of Directed Replacement left-to-right right-to-left

longest match @→ →@

shortest match @> >@

Now, applying the same regular expression as above to the left-to-right longestmatch replacement as in the regular expression in (RE6.3), yields just one solution to the string aba as shown in (6.2). (RE6.3)

a b | b | b a | a b a @→ x

(6.2) a b a ----x

Directed replacement is defined as a composition of four relations that are composed in advance by the XFST-compiler. The advantage is that the replacement takes place in one step without any additional levels or symbols. For instance, the left-to-right longest-match replacement UPPER @→ LOWER is composed of the following relations (Karttunen, 1996): (6.3)

Input string .o. Initial match .o. Left-to-right constraint .o. Longest-match constraint .o. Replacement

With these operators, transducers that mark (or filter) patterns in text can be constructed easily. For instance, strings can be inserted before and after a string

Chapter 6.

180

that matches a defined regular expression. For this purpose a special insertion symbol “...” is used on the right-hand side to represent the string that is found matching the left-hand side: UPPER @→ PREFIX ... SUFFIX. Following an example from Karttunen (1996), a noun phrase that consists of an optional determiner (d), any number of adjectives a* and one or more nouns n+, can be marked using the regular expression in (RE6.4), mapping dannvaa into [dann]v[aan] as shown in (6.4). Thus, the expression compiles to a transducer that inserts brackets around maximal instances of the noun phrase pattern. (RE6.4)

(6.4)

(d) a* n+ @→ %[ ... ]%

dann v aan -----[dann] v [aan]

The replacement can be constrained further by a specific context, both on the left and the right of a particular pattern: UPPER @→ LOWER || LEFT RIGHT (see Karttunen, 1995, for further variations). Furthermore, the replacement can be parallel, meaning that multiple replacements are performed at the same time (see Kempe and Karttunen, 1996). For instance, the regular expression in (RE6.5) denotes a constrained parallel replacement, where the symbol a is replaced by symbol b and at the same time symbol b is replaced by c. Both replacements occur at the same time and only if the symbols are preceded by symbol x and followed by symbol y. Applying this automaton to the string xaxayby yields then the string xaxbyby and to the string xbybyxa yields xcybyxa as presented in (6.5). (RE6.5)

a → b , b → c || x y

(6.5) xaxayby --xaxbyby

6.2.4

xbybyxa --xcybyxa

Finite State Parsing

Introduction New approaches to parsing with the finite state formalism show that the calculus can be used to represent complex linguistic phenomena accurately and large scale lexical grammars can be represented in a compact way (Roche, 1997). There are various techniques for creating careful representations at increasing efficiency. For

FiniteCheck: A Grammar Error Detector

181

instance, parts of rules that are similar are represented only once, reducing the whole set of rules. For each state only one unique outgoing transition is possible (determinization), an automaton can be reduced to a minimal number of states (minimization). Moreover, one can create bi-directional machines, where the same automaton can be used for both parsing and generation. Applications of finite state parsing are used mostly in the fields of terminology extraction, lexicography and information retrieval for large scale text. The methods are more “partial” in the sense that the goal is not production of complete syntactic descriptions of sentences, but rather recognition of various syntactic patterns in a text (e.g. noun phrases, verbal groups). Parsing Methods Many finite-state parsers adopt the chunking techniques of Abney (1991) and collect sets of pattern rules into ordered sequences of levels of finite number, so called cascades, where the result of one level is the input to the next level (e.g. Appelt et al., 1993; Abney, 1996; Chanod and Tapanainen, 1996; Grefenstette, 1996; Roche, 1997). The parsing procedure over a text tagged for parts-of-speech usually proceeds by marking boundaries of adjacent patterns, such as noun or verbal groups, then the nominal and verbal heads within these groups are identified. Finally, patterns between non-adjacent heads are extracted identifying syntactic relations between words, within and across group boundaries. For this purpose, finite state transducers are used. The automata are applied both as finite state markers, that introduce extra symbols such as surrounding brackets to the input (as exemplified in the previous section), and as finite state filters that extract and label patterns. Usually, a combination of non-finite state methods and finite state procedures is applied, but the whole parser can be built as a finite state system (see further Karttunen et al., 1997a). The first application of finite state transducers to parsing was a parser developed at the University of Pennsylvania between 1958 and 1959 (Joshi and Hopely, 1996).9 The parser is essentially a cascade of finite state transducers and the parsing style resembles Abney’s “chunking” parser (Abney, 1991). Syntactic patterns using subcategorization frames and local grammars were constructed and recognize simple NPs, PPs, AdvPs, simple verb clusters and clauses. All of the modules of the parser, including dictionary look-up and part-of-speech disambiguation are finite state computations, except for the module for recognition of clauses. 9

The original version of the parser is presented in Joshi (1961) Up-to-date information about the reconstructed version of this parser - Uniparse - can be accessed from: http://www.cis. upenn.edu/˜phopely/tdap-fe-post.html.

182

Chapter 6.

Besides Abney’s chunking approach (Abney, 1991, 1996), constructive finite state parsing of collections of syntactic patterns and local grammars, others use this technique to locate noun phrases (or other basic phrases) from unrestricted text (e.g. Appelt et al., 1993; Schiller, 1996; Senellart, 1998). Further, Grefenstette (1996) uses this technique to mark syntactic functions such as subject and object. Other approaches to finite-state parsing start from a large number of alternative analyses and, through application of constraints in the form of elimination or restriction rules, they reduce the alternative parses (e.g. Voutilainen and Tapanainen, 1993; Koskenniemi et al., 1992). These techniques were also used for extraction of noun phrases or other basic phrases (e.g. Voutilainen, 1995; Chanod and Tapanainen, 1996; Voutilainen and Padro´ , 1997). Salah Ait-Mokhtar and Jean-Pierre Chanod constructed a parser that combines the constructive and reductionist approaches. The system defines segments by constraints rather than patterns. They mark potential beginnings and ends of phrases and use replacement transducers to insert phrase boundaries. Incremental decisions are made throughout the whole parsing process, but at each step linguistic constraints may eliminate or correct some of the previously added information (AitMohtar and Chanod, 1997). In the case of Swedish, finite state methods have been applied on a small scale to lexicography and information extraction. A Swedish regular expression grammar was implemented early at Ume˚a University, parsing a limited set of sentences (Ejerhed and Church, 1983; Ejerhed, 1985). Recently, a cascaded finite state parser Cass-Swe was developed for the syntactic analysis of Swedish (Kokkinakis and Johansson Kokkinakis, 1999), based on Abney’s parser. Here the regular expression patterns are applied in cascades ordered by complexity and length to recognize phrases. The output of one level in a sequence is used as input in the subsequent level, starting from tagging and syntactic labeling proceeding to recognition of grammatical functions. The grammar of Cass-Swe has been semi-automatically extracted from written text by the application of probabilistic methods, such as the mutual information statistics which allows the exclusion of incorrect part-of-speech n-grams (Magerman and Marcus, 1990), and by looking at which function words signal boundaries between phrases and clauses. Discrimination of Input One parsing application using finite state methods presented by Karttunen et al. (1997a) aims at extraction of not only valid expressions, but also invalid patterns occurring in free text due to errors and misprints. The method is applied to date expressions and the idea is simply to define two language sets - one that overgen-

FiniteCheck: A Grammar Error Detector

183

erates and accepts all date expressions, including dates that do not exist, and one that defines only correct date expressions. The language of invalid dates is then obtained by subtracting the more specific language from the more general one. Thus, by distinguishing the valid date expressions from the language of all date expressions we obtain the set of expressions corresponding to invalid dates, i.e. those dates not accepted by the language set of valid expressions. To illustrate, the definitions in Karttunen et al. (1997a) express date expressions from January 1, 1 to December 31, 9999 and are represented by a small finite state automaton (13 states, 96 arcs), that accepts date expressions consisting of a day of the week, a month and a date with or without a year, or a combination of the two as defined in (RE6.6a) (SP is a separator consisting of a comma and a space, i.e. ‘, ’). The parser for that language presented in (RE6.6b) is constraint by the left to right, longest match replacement operator which means that only the maximal instances of such expressions are accepted. However, this automaton also accepts dates that do not exist, such as “April 31”, which exceeds the maximum number of days for the month. Other problems concern leap days and the relationship between the day of the week and the date. A new language is defined by intersecting constraints of invalid types of dates with the language of date expressions as presented in (RE6.6c).10 This much larger automaton (1346 states, 21006 arcs) accepts only valid date expressions and again a transducer marks the maximal instances of such dates, see (RE6.6d). (RE6.6) a. DateExpression = Day | (Day SP) Month ‘‘ ’’ Date (SP Year) b. DateExpression @→ %[ ...

%]

c. ValidDate = DateExpression & MaxDaysInMonth & LeapDays & WeekDayDates d. ValidDate @→ %[ ...

%]

As the authors point out, it may be of use to distinguish valid dates from invalid ones, but in practice we also need to recognize the invalid dates due to errors and misprints in real text corpora. For this purpose we do not need to define a new language that reveals the structure of invalid dates. Instead, we make use of the already defined languages of all date expressions DateExpression and valid dates ValidDate and obtain the language of invalid dates by subtracting these language sets from each other [DateExpression - ValidDate].

10

For more detail on the separate definitions of constraints see Karttunen et al. (1997a).

Chapter 6.

184

A parser that identifies maximal instances of date expressions is presented in (RE6.7), that tags both the valid (VD) and invalid (ID) dates. (RE6.7)

[ [DateExpression - ValidDate] @→ ‘‘[ID ’’ ... ValidDate @→ ‘‘[VD’’ ... %] ]

%] ,

In the example in (6.6) below given by the authors, the parser identified two date expressions. First a valid one (VD) and then an invalid one (ID) differing only in the weekday from the valid one. Notice that the effect of the application of the longest match is reflected when for instance the invalid date Tuesday, September 16, 1996 is selected over Tuesday, September 16, 19, which is a valid date. 11 (6.6) The correct date for today is [VD Monday, September 16, 1996]. There is an error in the program. Today is not [ID Tuesday, September 16, 1996].

6.3 System Architecture 6.3.1

Introduction

After this short introduction to finite state automata, parsing methods with finite state techniques and a description of the XFST-compiler, I will now proceed with a description of the implemented grammar error detector FiniteCheck. In this section an overview is given of the system’s architecture and how the system proceeds in the individual modules identifying errors in text. The types of automata used in the implementation are also described. The implementation methods and detailed descriptions of the individual modules are discussed in subsequent sections. The framework of FiniteCheck is built as a cascade of finite state transducers compiled from regular expressions including operators defined in the Xerox FiniteState Tool (XFST; see Section 6.2.3). Each automaton in the network composes with the result of the previous application. The implemented tool applies a strategy of simple dictionary lookup, incremental partial parsing with minimal disambiguation by parsing order and filtering, and error detection using subtraction of ‘positive’ grammars that differ in their level of detail. Accordingly, the current system of sequenced finite state transducers is divided into four main modules: the dictionary lookup, the grammar, the parser and the error finder, see Figure 6.1 below. The system runs under UNIX in a simple emacs environment implemented by Robert Andersson with an XFST-mode that allows for menus to be used to 11

This date is however only valid in theory since the Gregorian calendar was not yet in use in the year 19 AD. The Gregorian calendar that replaced the Julian calendar was introduced in Catholic countries by the pope Gregory XIII on Friday, October 15, 1582 (in Sweden 1753).

FiniteCheck: A Grammar Error Detector

185

recompile files in the system. The modules are further described in the following subsection on the flow of data in the error detector. The form of the types of automata are discussed at the end of this section.

Figure 6.1: The System Architecture of FiniteCheck

Chapter 6.

186

6.3.2

The System Flow

The Dictionary Lookup The input text into FiniteCheck is first manually tokenized so that spaces occur between all strings and tokens, including punctuation. This formatted text is then tagged with part-of-speech and feature annotations by the lookup module that assigns all lexical tags stored in the lexicon of the system to a string in the text. No disambiguation is involved, only a simple lookup. The underlying lexicon of around 160,000 word forms is built as a finite state transducer. The tagset is based on the tagformat defined in the Stockholm Umea˚ Corpus (Ejerhed et al., 1992) combining part-of-speech information with feature information (see Section 6.4 and Appendix C). As an example, the sentence (6.7a) is ungrammatical, containing a (finite) auxiliary verb followed by yet another finite verb (see (4.32) on p.61). It will be annotated by the dictionary lookup as shown in (6.7b): ∗ (6.7) a. Men kom ih˚ag att det inte ska blir n˚agon riktig brand But remember that it not will [pres] becomes [pres] some real fire

– But remember that there will not be a real fire. b. Men[kn] kom[vb prt akt] ih˚ag[ab][pl] att[sn][ie] det[pn neu sin def sub/obj][dt neu sin def] inte[ab] ska[vb prs akt] blir[vb prs akt] n˚agon[dt utr sin ind][pn utr sin ind sub/obj] riktig[jj pos utr sin ind nom] brand[nn utr sin ind nom]

The Grammar The grammar module includes two grammars with (positive) rules reflecting the grammatical structure of Swedish, differing in their level of detail. The broad grammar (Section 6.5) is especially designed to handle text with ungrammaticalities and the linguistic descriptions are less accurate, accepting both valid and invalid patterns. The narrow grammar (Section 6.7) is more refined and accepts only grammatical segments. For example, the regular expression in (RE6.8) belongs to the broad grammar and recognizes potential verb clusters (VC) (both grammatical and ungrammatical) as a pattern consisting of a sequence of two or three verbs in combination with (zero or more) adverbs (Adv∗). (RE6.8)

define VC [Verb Adv* Verb (Verb)];

This automaton accepts all the verb cluster examples in (6.8), including the ungrammatical instance (6.8c) extracted from the text in (6.7), where a finite verb

FiniteCheck: A Grammar Error Detector

187

in present tense follows a (finite) auxiliary verb, instead of a verb in infinitive form (i.e. bli ‘be [inf]’). (6.8) a. kan inte springa can not run [inf] b. skulle ha sprungit would have run [sup] c. ska ∗ blir will be [pres]

Corresponding rules in the narrow grammar, represented by the regular expressions in (RE6.9), take into account the internal structure of a verb cluster and define the grammar of modal auxiliary verbs (Mod) followed by (zero or more) adverb(s) (Adv∗), and either a verb in infinitive form (VerbInf) as in (RE6.9a), or a temporal verb in infinitive (PerfInf) and a verb in supine form (VerbSup), as in (RE6.9b). These rules thus accept only the grammatical segments in (6.8) and will not include example (6.8c). The actual grammar of grammatical verb clusters is a little bit more complex (see Section 6.7). (RE6.9) a. define VC1 b. define VC2

[Mod Adv* VerbInf]; [Mod Adv* PerfInf VerbSup];

The Parser The system proceeds and the tagged text in (6.7b) is now the input to the next phase, where various kinds of constituents are selected applying a lexical-prefixfirst strategy, i.e. parsing first from the left margin of a phrase to the head and then extending the phrase by adding on complements. The phrase rules are ordered in levels. The system proceeds in three steps by first recognizing the head phrases in a certain order (verbal head vpHead, prepositional head ppHead, adjective phrase ap) and then selecting and extending the phrases with complements in a certain order (noun phrase np, prepositional phrase pp, verb phrase vp). The heuristics of parsing order gives better flexibility to the system in that (some) false parses can be blocked. This approach is further explained in the section on parsing (Section 6.6). The system then yields the output in (6.9). 12 Simple ‘’ around a phrase-tag denote the beginning of a phrase and the same signs together with a slash ‘/’ indicate the end. 12 For better readability, the lexical tags are kept only in the erroneous segment and removed manually in the rest of the exemplified sentence.

Chapter 6.

188

(6.9) Men kom ih˚ag att det inte ska[vb prs akt] blir[vb prs akt] n˚agon riktig brand

We apply the rules defined in the broad grammar set for this parsing purpose, like the one in (RE6.8) that identified the verb cluster in boldface in (6.9) above as a sequence of two verbs. The parsing output may be refined and/or revised by application of filtering transducers. Earlier parsing decisions depending on lexical ambiguity are resolved, and phrases are extended, e.g. with postnominal modifiers (see further in Section 6.6). Other structural ambiguities, such as verb coordinations or clausal modifiers on nouns, are also taken care of (see Section 6.7) The Error Finder Finally the error finder module is used to discriminate the grammatical patterns from the ungrammatical ones, by subtracting the narrow grammar from the broad grammar. These new transducers are used to mark the ungrammatical segments in a text. For example, the regular expression in (RE6.10a) identifies verb clusters that violate the narrow grammar of modal verb clusters (VC1 or VC2 in (RE6.9)) by subtracting (‘-’) these rules from the more general (overgenerating) rule in the broad grammar (VC in ((RE6.8) within the boundaries of a verb cluster ( , ), previously marked in the parsing stage in (6.9). That is, the output of the parsing stage in (6.9) is the input to this level. By application of the marking transducer in (RE6.10b), the erroneous verb cluster consisting of two verbs in present tense in a row is annotated directly in the text as shown in (6.10). (RE6.10) a. define VCerror [ ""

[VC - [VC1 | VC2]]

"" ];

b. define markVCerror [ VCerror -> "" ... ""];

(6.10) Men kom ih˚ag att det inte ska[vb prs akt] blir[vb prs akt] n˚agon riktig brand

FiniteCheck: A Grammar Error Detector

6.3.3

189

Types of Automata

In accordance with the techniques of finite-state parsing (see Section 6.2.4), there are in general two types of transducers in use: one that annotates text in order to select certain segments and one that redefines or refines earlier decisions. Annotations are handled by transducers called finite state markers that add reserved symbols into the text and mark out syntactic constituents, grammar errors, or other relevant patterns. For instance, the regular expression in (RE6.11) inserts noun phrase tags in text by application of the left-to-right-longest-match replacement operator (‘@→’) (see Section 6.2.3). (RE6.11)

define markNP [NP @-> "" ... ""];

The automaton finds the pattern that matches the maximal instance of a noun phrase (NP) and replaces it with a beginning marker ( ), copies the whole pattern by application of the insertion operator (‘...’) and then assigns the endmarker ( ). Three (maximal) instances of noun phrase segments are recognized in the example sentence (6.11a), discussed earlier in Chapter 4 (see (4.2) on p.46) as shown in (6.11b), where one violates definiteness agreement (in boldface).13 ∗ (6.11) a. En g˚ang blev den hemska pyroman utkastad one time was the [def] awful [def] pyromaniac [indef]) thrown-out

ur stan. from the-city – Once the awful pyromaniac was thrown out of the city. b. En g˚ang blev den[dt utr sin def][pn utr sin def sub/obj] hemska[jj pos utr/neu sin def nom][jj pos utr/neu plu ind/def nom] pyroman[nn utr sin ind nom] utkastad ur stan .

The regular expression in (RE6.12) represents an another example of an annotating automaton. (RE6.12)

define markNPDefError [ npDefError -> "" ... ""];

This finite state transducer marks out agreement violations of definiteness in noun phrases (npDefError; see Section 6.8). It detects for instance the erroneous noun phrase den hemska pyroman in the example sentence, where the determiner den ‘the’ is in definite form and the noun pyroman ‘pyromaniac’ is in indefinite form (6.12). By application of the left-to-right replacement operator 13

Only the erroneous segment is marked by lexical tags.

Chapter 6.

190

(→) the identified segment is replaced by first inserting an error-diagnosis-marker ( ) as the beginning of the identified pattern, then the pattern is copied and the error-end-marker ( ) is added. (6.12) En g˚ang blev den[dt utr sin def][pn utr sin def sub/obj] hemska[jj pos utr/neu sin def nom][jj pos utr/neu plu ind/def nom] pyroman[nn utr sin ind nom] utkastad ur stan .

The marking transducers of the system have the form A @→ S ... E, when marking the maximal instances of A from left to right by application of the left-toright-longest-match replacement operator (‘@→’) and inserting a start-symbol S (e.g. ) and an end-symbol E (e.g. ). In cases where the maximal instances are already recognized and only the operation of replacement is necessary, the transducers use the form A → S ... E, applying only the left-to-right replacement operator (‘→’). The other types of transducers are used for refinement and/or revision of earlier decisions. These finite state filters can for instance be used to remove the noun phrase tags from the example sentence, leaving just the error marking. The regular expression in (RE6.13) replaces all occurrences of noun phrase tags with an empty string (‘0’) by application of the left-to-right replacement operator (‘→’). The result is shown in (6.13). (RE6.13)

define removeNP ["" -> 0, "" -> 0];

(6.13) En g˚ang blev den[dt utr sin def][pn utr sin def sub/obj] hemska[jj pos utr/neu sin def nom][jj pos utr/neu plu ind/def nom] pyroman[nn utr sin ind nom] utkastad ur stan.

These filtering transducers have the form A → B and are used for simple replacement of instances of A by B by application of the left-to-right replacement operator (‘→’). In cases where the context plays a crucial role, the automata are extended by requirements on the left and/or the right context and have the form A → B || L R. Here, the patterns in A are replaced by B only if A is preceded by the left context L and followed by the right context R. In some cases only the left context is constrained, in others only the right, and in some cases both are needed.

FiniteCheck: A Grammar Error Detector

191

6.4 The Lexicon 6.4.1

Composition of The Lexicon

The lexicon of the system is a full form lexicon based on two resources, the Lexin (Skolverket, 1992) developed at the Department of Swedish Language, Section of Lexicology, G¨oteborg University and a corpus-based lexicon from the SveLex project under the direction of Daniel Ridings, LexiLogik AB. At the initial stage of lexicon composition, only the Lexin dictionary of 58 326 word forms was available to us and we chose it especially for the lexical information stored in it, namely that the lexicon also included information on valence. I converted the Lexin text records to one single regular expression by a two-step process using the programming language gawk (Robbins, 1996). From the Lexin records (exemplified in (6.14a) and (6.14b)) a new file was created with lemmas separated by rows as in (6.14c). The first line here represents the Lexin-entry for the noun bil ‘car’ in (6.14a) and the second for the verb bilar ‘travels by car [pres]’ in (6.14b). Only a word’s part-of-speech (entry #02), lemma (entry #01), and declined forms (entry #12) are listed in the current implementation. 14 The number and type of forms vary according to the part-of-speech, and sometimes even within a part-of-speech. (6.14) a. #01 #02 #04 #07 #09 #11 #11 #11 #11 #11 #11 #11 #11 #12 #14 b. #01 #02 #04 #10 #12 #14

bil subst ett slags motordrivet fordon ˚ aka bil bild 17:34, 18:36-37 bil˜trafik -en personbil bil˜buren bil˜fri bil˜sjuk bil˜sjuka bil˜telefon lastbil bilen bilar bi:l bilar verb ˚ aka bil A & (+ RIKTNING) bilade bilat bila(!) 2bI:lar

c. subst verb

bil bilar

bilen bilar bilade bilat bila

14 Future work will further extend the other kinds of information stored in the lexicon, such as valence and compounding.

Chapter 6.

192

In the next step I converted the data in (6.14c) directly to a single regular expression as shown in (RE6.14). Each word entry in the lexicon was represented as a single finite state transducer with the string in the LOWER side and the category and feature in the UPPER side, allowing both analysis and generation. The whole dictionary is formed as the union of these automata. At this stage I used only simple tagsets that were later converted to the SUC-format (see below). Using this automatic generation of lexical entries to a regular expression, alternative versions of the lexicon are easy to create with for example different tagsets or including other information from Lexin (e.g. valence, compounds).

(RE6.14)

[ | | . . . | | | . . . | | | | . . . | | |

A % - i n k o m s t 0: %[%+NSI%] A % - k a s s a 0: %[%+NSI%] A % - s k a t t 0: %[%+NSI%]

b i l 0: %[%+NSI%] b i l e n 0: %[%+NSD%] b i l a r 0:%[%+NPI%]

b b b b

i i i i

l l l l

a a a a

0:%[%+VImp%] r 0:%[%+VPres%] d e 0:%[%+VPret%] t 0:%[%+VSup%]

¨ o v ¨ a r l d 0: %[%+NSI%] ¨ r l d a r 0: %[%+NPI%] ¨ o v a ¨ o v ¨ a r l d e n 0:%[%+NSD%] ];

The Lexin dictionary was later extended with 100,000 most frequent word forms selected from the corpus-based SveLex. At this stage the format of the lexicon was revised. The new lexicon of 158,326 word forms was compiled to a new transducer using instead the Xerox tool Finite-State Lexicon Compiler (LEXC) (Karttunen, 1993), that made the lexicon more compact and effective. This software facilitates in particular the development of natural-language lexicons. Instead of regular expression declarations, a high-level declarative language is used to specify the morphotactics of a language. I was not part of the composition of the new version of the lexicon. The procedures and achievements of this work are described further in Andersson et al. (1998, 1999).

FiniteCheck: A Grammar Error Detector

6.4.2

193

The Tagset

In the present version of the lexicon, the set of tags follows the Stockholm Ume a˚ Corpus project conventions (Ejerhed et al., 1992), including 23 category classes and 29 feature classes (see Appendix C). Four additional categories were added to this set for recognition of copula verbs (cop), modal verbs (mvb), verbs with infinitival complement (qmvb) and unknown words, that obtain the tag [nil]. This morphosyntactic information is used for identification of strings by both their category and/or feature(s). For reasons of efficiency, the whole tag with category and feature definitions is read by the system as a single symbol and not as a separate list of atoms. An experiment conducted by Robert Andersson showed that the size of an automaton recognizing a grammatical noun phrase decreased with 90% less states and 60% less transitions in comparison to declaring a tag as consisting of a category and a set of features (see further in Andersson et al., 1999). As a consequence of this choice, the automata representing the tagset are divided both in accordance with the category they state and the features, always rendering the whole tag. The automata are constructed as an union of all the tags of the same category or feature. In practice this means that the same tag occurs in different tag-definitions as many times as the number of defined characteristics. For instance, the tag defining an active verb in present tense [vb prs akt] occurs in three definitions, first as an union of all tags defining the verb category (TagVB in (RE6.15)), then among all tags for present tense (TagPRS in (RE6.16)) and then among all tags for active voice (TagAKT in (RE6.17)).

(RE6.15)

define TagVB | | | | | | | | | | | | | | ];

[ "[vb "[vb "[vb "[vb "[vb "[vb "[vb "[vb "[vb "[vb "[vb "[vb "[vb "[vb "[vb

an]" sms]" prt akt]" prt sfo]" prs akt]" prs sfo]" sup akt]" sup sfo]" imp akt]" imp sfo]" inf akt]" inf sfo]" kon prt akt]" kon prt sfo]" kon prs akt]"

Chapter 6.

194

(RE6.16)

define TagPRS | | | | ];

(RE6.17)

define TagAKT | | | | | | ];

[ "[pc "[pc "[vb "[vb "[vb

prs prs prs prs kon

utr/neu sin/plu ind/def gen]" utr/neu sin/plu ind/def nom]" akt]" sfo]" prs akt]"

[ "[vb "[vb "[vb "[vb "[vb "[vb "[vb

prt prs sup imp inf kon kon

akt]" akt]" akt]" akt]" akt]" prt akt]" prs akt]"

On the other hand, the tag for an interjection ([in]) that consists only of the category, occurs just once in the definitions of tags: (RE6.18)

define TagIN

[ "[in]" ];

There are in total 55 different lexical-tag definitions of categories and features. One single automaton (Tag) represents all the different categories and features, that is composed as the union of these 55 lexical tags. The largest category of singular-feature (TagSIN) includes 80 different tags.

6.4.3

Categories and Features

In the parsing and error detection processes, strings need to be recognized by their category and/or feature inclusion. The morphosyntactic information in the tags is used for this purpose and automata identifying different categories and feature sets are defined. For instance, the regular expression in (RE6.19a) recognizes the tagged string kan[vb prs akt] ‘can’ as a verb, i.e. a sequence of one or more (the iteration sign ‘+’) letters followed by a sequence of tags one of which is a tag containing ‘vb’ (TagVB). Features are defined in the same manner. The same string can be recognized as a carrier of the feature of present tense. The regular expression in (RE6.19b) defines the automaton for present tense as a sequence of (one or more) letters followed by a sequence of tags, where one of them fulfills the feature of present tense ‘prs’ (TagPRS). (RE6.19) a. define Verb b. define Prs

Letter+ Tag* TagVB Tag*; Letter+ Tag* TagPRS Tag*;

FiniteCheck: A Grammar Error Detector

195

By using intersection (‘&’) of category and feature sets, there is also the possibility recognizing category-feature combinations. The same string can then be recognized directly as a verb in present tense by the regular expression VerbPrs given in (RE6.20), that presents all the verb tense features. (RE6.20)

define define define define define

VerbImp VerbPrs VerbPrt VerbSup VerbInf

[Verb [Verb [Verb [Verb [Verb

& & & & &

Imp]; Prs]; Prt]; Sup]; Inf];

Even higher level sets can be built. For instance, a category of tensed (finite) and untensed (non-finite) verbs may be defined as in (RE6.21), including the union of appropriate verb form definitions from the verb tense feature set in (RE6.20) above. Our string example falls then as a verb in present tense form among the finite verb forms (VerbTensed). (RE6.21)

define VerbTensed define VerbUntensed

[VerbPrs | VerbPrt]; [VerbSup | VerbInf];

6.5 Broad Grammar The rules of the broad grammar are used to mark potential phrases in a text, both grammatical and ungrammatical. The grammar consists of valid (grammatical) rules that define the syntactic relations of constituents mostly in terms of categories and list the order of them. There are no other constraints on the selections than the types of part-of-speech that combine with each other to form phrases. The grammar is in other words underspecified and does not distinguish between grammatical and ungrammatical patterns. The parsing is incremental, i.e. identifying first heads and then complements. This is also reflected in the broad grammar listed in (RE6.22), that includes rules divided in heads and complements. The whole broad grammar consists of six rules, including the head rules of adjective phrase (AP), verbal head (VPHead) and prepositional head (PPHead) and then rules for noun phrase (NP), prepositional phrase (PP) and verb phrase (VP).

Chapter 6.

196

(RE6.22)

# Head define define define

rules AP [(Adv) Adj+]; PPhead [Prep]; VPhead [[[Adv* Verb] | [Verb Adv*]] Verb* (PNDef & PNNeu)];

# Complement rules define NP [[[(Det | Det2 | NGen) (Num) (APPhr) (Noun)] & ?+] | Pron]; define PP [PPheadPhr NPPhr]; define VP [VPheadPhr (NPPhr) (NPPhr) (NPPhr) PPPhr*];

An adjective phrase (AP) consists of an (optional) adverb and a sequence of (one or more) adjectives. This means that an adjective phrase consists of at least one adjective. The head of a prepositional phrase (PPHead) is a preposition. A verbal head (VPHead) includes a verb preceded by a (zero or more) adverb(s) or followed by a (zero or more) adverb(s), and possibly followed by a (zero or more) verb(s) and an optional pronoun. This means that a verbal head consists at least of a single verb, that in turn may be preceded or followed by adverb(s) and followed by verb(s). In order to prevent pronouns being analyzed as determiners in noun phrases, e.g. jag anser det bra ‘I think it is good’, single neuter definite pronouns are included in the verbal head. The regular expression describing a noun phrase (NP) consists of two parts. The first states that a noun phrase includes a determiner (Det) or a determiner with adverbial h¨ar ‘here’ or d¨ar ‘there’ (Det2) or a possessive noun (NGen), followed by a numeral (Num), an adjective phrase (APPhr) and a (proper) noun (Noun). Not only the noun can form the head of the noun phrase which is why all the constituents are optional. The intersection with ‘any-symbol’ (‘?’) followed by the iteration sign (‘+’) is needed to state that at least one of the listed constituents has to occur. The second part of the noun phrase rule, states that a noun phrase may consist of a single pronoun (Pron). A prepositional phrase (PP) is recognized as a sequence of prepositional head (PPheadPhr) followed by a noun phrase (NPPhr). A verb phrase consists of a verbal head (VPheadPhr) followed by at most three (optional) noun phrases and (zero or more) prepositional phrases.

6.6 Parsing 6.6.1

Parsing Procedure

The rules of the (underspecified) broad grammar are used to mark syntactic patterns in a text. A partial, lexical-prefix-first, longest-match, incremental strategy is

FiniteCheck: A Grammar Error Detector

197

used for parsing. The parsing procedure is partial in the sense that only portions of text are recognized and no full parse is provided for. Patterns not recognized by the rules of the (broad) grammar remain unchanged. The maximal instances of a particular phrase are selected by application of the left-to-right-longest-match replacement operator (‘@→’) (see Section 6.2.3). In (RE6.23) we see all the marking transducers recognizing the syntactic patterns defined in the broad grammar. The automata replace the corresponding phrase (e.g. noun phrase, NP) with a label indicating the beginning of such pattern ( ), the phrase itself and a label that marks the end of that pattern ( ). (RE6.23)

define define define define define define

markPPhead markVPhead markAP markNP markPP markVP

[PPhead [VPhead [AP @-> [NP @-> [PP @-> [VP @->

@-> "" ... @-> "" ... "" ... "" "" ... "" "" ... "" "" ... ""

""]; ""]; ]; ]; ]; ];

The segments are built on in cascades in the sense that first the heads are recognized, starting from the left-most edge to the head (so called lexical-prefix) and then the segments are expanded in the next level by addition of complement constituents. The regular expressions in (RE6.24) compose the marking transducers of separate segments into a three step process. (RE6.24)

define parse1 [markVPhead .o. markPPhead .o. AP]; define parse2 [markNP]; define parse3 [markPP .o. markVP];

First the verbal heads, prepositional heads and adjective phrases are recognized by composition in that order (parse1). The corresponding marking transducers presented in (RE6.23) insert syntactic tags around the found phrases as in (6.15a).15 This output serves then as input to the next level, where the adjective phrases are extended and noun phrases are recognized (parse2) and marked as exemplified in (6.15b). This output in turn serves as input to the last level, where the whole prepositional phrases and verb phrases are recognized in that order (parse3) and marked as in (6.15c). 15

The original sentence example is presented in (6.11) on p.189.

Chapter 6.

198

(6.15) a. PARSE 1: VPHead .o. PPHead .o. AP En g˚ang blev den hemska pyroman utkastad ur stan . b. PARSE 2: NP En g˚ang blev den hemska pyroman utkastad ur stan . c. PARSE 3: PP .o. VP En g˚ang blev den hemska pyroman utkastad ur stan .

During and after this parsing annotation, some phrase types are further expanded with post-modifiers, split segments are joined and empty results are removed (see Section 6.6.4). The ‘broadness’ of the grammar and the lexical ambiguity in words, necessary for parsing text containing errors, also yields ambiguous and/or alternative phrase annotations. We block some of the (erroneous) alternative parses by the order in which phrase segments are selected, which causes bleeding of some rules and more ‘correct’ parsing results are achieved. The order in which the labels are inserted into the string influences the segmentation of patterns into phrases (see Section 6.6.2). Further ambiguity resolution is provided for by filtering automata (see Section 6.6.3).

6.6.2

The Heuristics of Parsing Order

The order in which phrases are labeled supports ambiguity resolution in the parse to some degree. The choice of marking verbal heads before noun phrases prevents merging constituents of verbal heads into noun phrases which would yield noun phrases with too wide a range. For instance, marking first the sentence in (6.16a) for noun phrases ((6.16b) ∗ NP:)16 would interpret the pronoun De ‘they’ as a determiner and the verb s˚ag ‘saw’, that is exactly as in English homonymous with the noun ‘saw’, as a noun and merges these two constituents to a noun phrase. The output would then be composed with the selection of the verbal head ((6.16b) ∗ NP .o. VPHead) that ends up within the boundaries of the noun phrase. Composing the marking transducers in the opposite order instead yields the more correct parse in (6.16c). Although the alternative of the verb being parsed as verbal head or a noun remains ( s˚ ag ), the pronoun is now marked correctly as a separate noun phrase and not merged together with the main verb into a noun phrase. 16

Asterix ‘*’ indicates erroneous parse.

FiniteCheck: A Grammar Error Detector

199

(6.16) a. De s˚ag ledsna ut They looked sad out – They looked sad. b.



NP: De s˚ag ledsna ut . ∗

NP .o. VPHead: De s˚ag ledsna ut . c. VPHead: De s˚ag ledsna ut . VPHead .o. NP: De s˚ag ledsna ut .

This ordering strategy is not absolute however, since the opposite scenario is possible where parsing noun phrases before verbal heads is more suitable. Consider for instance example (6.17a) below, where the string o¨ ppna ‘open’ in the noun phrase det o¨ ppna f¨onstret ‘the open window’ will be split in three separate noun phrase segments when applying the order of parsing verbal heads before noun phrases (6.17c), due the homonymity between an adjective and an infinitive or imperative verb form. The opposite scenario of parsing noun phrases before verbal heads yields a more correct parse (6.17b), where the whole noun phrase is recognized as one segment. (6.17) a. han tittade genom det o¨ ppna f¨onstret he looked through the open window – he looked through the open window b. NP: han tittade genom det o¨ ppna f¨onstret NP .o. VPHead: han tittade genom o¨ ppna f¨onstret c.

det



VPHead: han tittade genom det o¨ ppna f¨onstret ∗ VPHead .o. NP: han tittade genom det o¨ ppna f¨onstret

We analyzed the ambiguity frequency in the Child Data corpus and found that occurrences of nouns recognized as verbs are more recurrent than the opposite. On

Chapter 6.

200

this ground, we chose the strategy of marking verbal heads before marking noun phrases. In the case of the opposite scenario, the false parsing can be revised and corrected by an additional filter (see Section 6.6.3). A similar problem occurs with homonymous prepositions and nouns. For instance, the string vid is ambiguous between an adjective (‘wide’) and a preposition (‘by’) and influences the order of marking prepositional heads and noun phrases. Parsing prepositional heads before noun phrases is more suitable for preposition occurrences as shown in (6.18c) in order to prevent the preposition from being merged as part of a noun phrase, as in (6.18b). (6.18) a. Jag satte mig vid bordet I sat me by the-table – I sat down at the table. b.



NP: Jag satte mig vid bordet . ∗

NP .o. PP: Jag satte mig bordet .

vid

c. PP: Jag satte mig vid bordet . PP .o. NP: Jag satte mig vid bordet .

The opposite order is more suitable for adjective occurrences, as in (6.19), where the adjective is joined together with the head noun when selecting noun phrases first as in (6.19b). But when recognizing the adjective as prepositional head, that noun phrase is split into two noun phrases as in (6.19c). Again, the choice of marking prepositional heads before noun phrases was based on the result of frequency analysis in the corpus, i.e. the string vid occurred more often as a preposition than an adjective.

FiniteCheck: A Grammar Error Detector

201

(6.19) a. Hon hade vid kjol p˚a sig. She had wide skirt on herself. – She was wearing a wide skirt. b. NP: Hon hade vid kjol p˚a sig . NP .o. PP: Hon hade vid kjol p˚a sig . c.



PP: Hon hade vid kjol p˚a sig . ∗

PP .o. NP: Hon hade vid kjol p˚a sig .

6.6.3

Further Ambiguity Resolution

As discussed above, the parsing order does not give the correct result in every context. Nouns, adjectives and pronouns are homonymous with verbs and might then be interpreted by the parser as verbal heads, or adjectives homonymous with prepositions can be analyzed as prepositional heads. These parsing decisions can be redefined at a later stage by application of filtering transducers (see Section 6.3.3). As exemplified in (6.17) above, the consequence of parsing verbal heads before noun phrases may yield noun phrases that are split into parts, due to the fact that adjectives are interpreted as verbs. The filtering transducer in (RE6.25) adjusts such segments and removes the erroneous (inner) syntactic tags (i.e. replaces them with the empty string ‘0’) so that only the outer noun phrase markers remain and converts the split phrase in (6.20a) to one noun phrase yielding (6.20b). The regular expression consists of two replacement rules that apply in parallel. They are constrained by the surrounding context of a preceding determiner (Det) and a subsequent adjective phrase (APPhr) and a noun phrase (NPPhr) in the first rule, and a preceding determiner and an adjective phrase in the second rule. (6.20) a. han tittade genom det o¨ ppna f¨onstret b. han tittade genom det o¨ ppna f¨onstret (RE6.25)

define adjustNPAdj [ "" -> 0 || Det _ APPhr "" NPPhr,, "" -> 0 || Det "" APPhr _ ];

Chapter 6.

202

Noun phrases with a possessive noun as the modifier are split when the head noun is homonymous with a verb as in (6.21).17 The parse is then adjusted by a filter that simply extracts the noun from the verbal head and moves the borders of the noun phrase yielding (6.21c). (6.21) a. barnens far hade d¨ott children’s father had died – the father of the children had died b.

barnens far

hade do¨ tt

c. barnens far hade do¨ tt

The filtering automaton in (RE6.26) inserts a start-marker for verbal head (i.e. replaces the empty string ‘0’ with the syntactic tag vpHead) right after the end of the actual noun phrase and removes the redundant syntactic tags in the second replacement rule. The replacement procedure is (again) simultaneous, by application of parallel replacement. (RE6.26)

define adjustNPGen [ 0 -> "" || NGen "" NPPhr _,, "" -> 0 || NGen _ ˜$"" "];

Another ambiguity problem occurs with the interrogative pronoun var ‘where’ that in Swedish is ambiguous with the copula verb var ‘were’ or ‘was’. Since verbal heads are annotated first in the system identifying segments of maximal length, the homonymous pronoun is recognized as a verb and combined with the subsequent verb as in (6.22) and (6.23). (6.22) a. Var var den d¨ar o¨ verraskningen. where was the there surprise – Where was that surprise? b. Var var den d¨ar o¨ verraskningen ?

(6.23) a. Var s˚ag du h¨asten Madde fr˚agar jag. where saw you the-horse Madde ask I – Where did you see the horse, Madde? I asked. b. Var sa˚ g du h¨asten Madde fr˚agar jag . 17 Here the string far is ambiguous between the noun reading ‘father’ and the present tense verb form ‘goes’.

FiniteCheck: A Grammar Error Detector

203

A similar problem occurs with adjectives or participles homonymous with verbs as in (6.24), where the adjective ra¨ dda ‘scared [pl]’ is identical to the infinitive or imperative form of the verb ‘rescue’ and is joined with the preceding copula verb to form a verb cluster. (6.24) a. Alla blev r¨adda ... all became afraid – All became afraid ... b. Alla blev r a¨ dda ...

All verbal heads recognized as sequences of verbs with a copula verb in the beginning are selected by the replacement transducer in (RE6.27) that changes the verb cluster label ( ) to a new marking ( ). This selection provides no changes in the parsing result in that no markings are (re)moved. Its purpose rather is to prevent false error detection and mark such verb clusters as being different. For instance, applying this transducer on the example in (6.22,) will yield the output presented in (6.25). (RE6.27)

define SelectVCCopula [ "" -> "" || _ [CopVerb / NPTags] ˜$"" ""];

(6.25) Var var den d¨ar o¨ verraskningen ?

6.6.4

Parsing Expansion and Adjustment

The text is now annotated with syntactic tags and some of the segments have to be further expanded with postnominal attributes and coordinations. In the current system, partitive prepositional phrases are the only postnominal attributes taken care of. The reason is that grammatical errors were found in these constructions. By application of the filtering transducer in (RE6.28) the example text in (6.26a) with the partitive noun phrase split into a noun phrase followed by a prepositional head that includes the partitive preposition av ‘of’ and yet another noun phrase from the parsing stage in (6.26b) is merged to form a single noun phrase as in (6.26c). This automaton removes the redundant inner syntactic markers by application of two replacement rules, constrained by the right or left context. The replacement occurs simultaneously by application of parallel replacement. (RE6.28)

define adjustNPPart [ "" -> 0 || _ PPart "",, "" -> 0 || "" PPart _ ];

Chapter 6.

204

i en av (6.26) a. Mamma och Virginias mamma hade o¨ ppnat en tygaff¨ar mum and Virginia’s mum had opened a fabric-store in one of Dom gamla husen. the old the-houses – Mum and Virginia’s mum had opened a fabric-store in one of the old houses. b. Mamma och Virginias mamma hade o¨ ppnat en tygaff¨ar i en av Dom gamla husen . c. Mamma och Virginias mamma hade o¨ ppnat en tygaff¨ar i en av Dom gamla husen .

Another type of phrase that needs to be expanded are verbal groups with a noun phrase in the middle, normally occurring when a sentence is initiated by other constituents than a subject (i.e. with inverted word order; see Section 4.3.6), as in (6.27a). In the parsing phase the verbal group is split into two verbal heads, as in (6.27b) that should be joined in one as in (6.27c). (6.27) a. En dag t¨ankte Urban g¨ora varma mackor One day thought Urban do hot sandwiches – One day Urban thought of making hot sandwiches. b. En dag ta¨ nkte Urban g¨ora varma mackor . c. En dag ta¨ nkte Urban g¨ora varma mackor .

The filtering automaton merging the parts of a verb cluster to a single segment is constrained so that two verbal heads are joined together only if there is a noun phrase in-between them and the preceding verbal head includes an auxiliary verb or a verb that combines with an infinitive verb form (VBAux). The corresponding regular expression (RE6.29) removes the redundant verbal head markers in this constrained context. The replacement works in parallel, here removing both the redundant start-marker ( ) and the end-marker ( ) at the same time. There are two (alternative) replacement rules for every tag, since the noun phrase can either occur directly after the first verbal head as in our example (6.27) above, or as a pronoun be part of the first verbal head. Tags not relevant for this replacement (VCTags) are ignored (/).

FiniteCheck: A Grammar Error Detector

(RE6.29)

205

define adjustVC [ "" -> 0 || [[Adv* VBAux Adv*] / VCTags] _ NPPhr VPheadPhr,, "" -> 0 || [[Adv* VBAux Adv*] / VCTags] NPPhr _ VPheadPhr,, "" -> 0 || [[Adv* VBAux Adv*] / VCTags]] "" NPPhr _ ˜$"" "",, "" -> 0 || [[Adv* VBAux Adv*] / VCTags]] NPPhr "" _ ˜$"" "" ];

Other filtering transducers are used for refining the parsing result. Incomplete parsing decisions are eliminated at the end of parsing. For instance, incomplete prepositional phrases, i.e. a prepositional head without a following noun phrase, defined in the regular expression (RE6.30a) are removed. Also removed are empty verbal heads as in (RE6.30b) and other misplaced tags. (RE6.30) a. define errorPPhead [ "" -> 0 || \[""] _ ,, "" -> 0 || _ \[""]]; b. define errorVPHead [ "" -> 0];

6.7 Narrow Grammar The narrow grammar is the grammar proper, whose purpose is to distinguish the grammatical segments from the ungrammatical ones. The automata of this grammar express the valid (grammatical) rules of Swedish, and constrain both the order of constituents and feature requirements. The current grammar is based on the Child Data corpus and includes rules for noun phrases and the verbal core.

6.7.1

Noun Phrase Grammar

Noun Phrases The rules in the noun phrase grammar are divided, following Cooper’s approach (Cooper, 1984, 1986), according to what types of constituent they consist of and what feature conditions they have to fulfill (see Section 4.3.1). There are altogether ten noun phrase types implemented, listed in Table 6.3, including noun phrases with the (proper) noun as the head, pronoun or determiner, adjective, numeral and partitive attribute, reflecting the profile of the Child Data corpus.

Chapter 6.

206

Table 6.3: Noun Phrase Types RULE S ET NP1

N OUN P HRASE T YPE single noun

(Num) N PNoun

NP2

NP3

determiner and noun

Det (DetAdv) (Num) N

poss. noun and noun

NGen (Num) N

determiner, adj. and noun

Det AP N

poss. noun, adj. and noun

NGen AP N

NP4

adjective and noun

(Num) AP N

NP5

single pronoun

PN

NP6

single determiner

Det

NP7

adjective

NP8

determiner and adjective

NP9

numeral

(Det) Num

NPPart

partitive

Num PPart NP

partitive

Det PPart NP

Adj+ Det Adj+

E XAMPLE (tv˚a) grodor (two) frogs Kalle Kalle de (ha¨ r) (tv˚a) grodorna the/these (two) frogs flickans (tva˚ ) grodor girl’s (two) frogs den lilla grodan the little frog flickans lilla groda girl’s little frog (tv˚a) sm˚a grodor (two) little frogs han he den that obeh¨orig unauthorized de gamla the old den tredje, 8 the third, 8 tv˚a av husen two of houses ett av de gamla husen one of the old houses

Every noun phrase type is divided into six subrules, expressing the different types of errors, two for definiteness (NPDef, NPInd), two for number (NPSg, NPPl) and two for gender agreement (NPUtr, NPNeu). 18 For instance, in (RE6.31) we have the set of rules representing noun phrases consisting of a single pronoun, that present the feature requirements on the pronoun as the only constituent, i.e. that a definite form of the pronoun is required (PNDef) in order to be considered as a definite noun phrase (NPDef). 18

Utr denotes the common gender called utrum in Swedish.

FiniteCheck: A Grammar Error Detector

(RE6.31)

define define define define define define

NPDef5 NPInd5 NPSg5 NPPl5 NPNeu5 NPUtr5

207

[PNDef]; [PNInd]; [PNSg]; [PNPl]; [PNNeu]; [PNUtr];

The rule set NP2 presented in (RE6.32) is more complex and defines the grammar for both definite, indefinite and mixed noun phrases (see Section 4.3.1) with a determiner (or a possessive noun) and a noun. For instance, the definite form of this noun phrase type (NPDef2) is defined as a sequence of a definite determiner (DetDef), an optional adverbial (DetAdv; e.g. ha¨ r ‘here’), an optional numeral (Num), and a definite noun, or as a sequence of mixed determiner (DetMixed i.e. those that take an indefinite noun as complement; e.g. denna ‘this’) or a possessive noun (NGen), followed by an optional numeral and an indefinite noun. (RE6.32)

define NPDef2 define define define define define

NPInd2 NPSg2 NPPl2 NPNeu2 NPUtr2

[DetDef (DetAdv) (Num) NDef] | [[DetMixed | NGen] (Num) NInd]; [DetInd (Num) NInd]; [[DetSg (DetAdv) | NGen] (NumO) NSg]; [[DetPl (DetAdv) | NGen] (Num) NPl]; [[DetNeu (DetAdv) | NGen] (Num) NNeu]; [[DetUtr (DetAdv) | NGen] (Num) NUtr];

This particular automaton (NPDef2) accepts all the noun phrases in (6.28), except for the first one that forms an indefinite noun phrase and will be handled by the subsequent automaton of indefinite noun phrases of this kind (NPInd2). It also accepts the ungrammatical noun phrase in (6.28c), since it only constrains the definiteness features. This erroneous noun phrase is then handled by the automaton representing singular noun phrases of this type (NPSg2) that states that only ordinal numbers (NumO) can be combined with singular determiners and nouns. (6.28) a. en (f¨orsta) blomma a [indef] (first) flower [indef] b. den (h¨ar) (f¨orsta) blomman this [def] (here) (first) flower [def] c.



den (h¨ar) (tv˚a) blomman this [def] (here) (two) flower [def]

d. denna (f¨orsta) blomma this [def] (first) flower [indef] e. flickans (f¨orsta) blomma the [def] (first) flower [indef]

Chapter 6.

208

The different noun phrase rules can be joined by union into larger sets and divided in accordance with what different feature conditions they meet. For instance, the set of all definite noun phrases is defined as in (RE6.33a) and indefinite noun phrases as in (RE6.33b). All noun phrases that meet definiteness agreement are then represented by the regular expression in (RE6.33c), that is an automaton formed as an union of all definite and all indefinite noun phrase automata. (RE6.33) a. ### Definite NPs define NPDef [NPDef1 | NPDef2 | NPDef3 | NPDef4 | NPDef5 | NPDef6 | NPDef7 | NPDef8 | NPDef9]; b. ### Indefinite NPs define NPInd [NPInd1 | NPInd2 | NPInd3 | NPInd4 | NPInd5 | NPInd6 | NNPInd7 | PInd8 | NPInd9]; c. ###### NPs that meet definiteness agreement define NPDefs [NPDef | NPInd];

Noun phrases with partitive attributes have a noun phrase as the head and are treated separately in the grammar. Although the agreement occurs only between the quantifier and the noun phrase in gender, the rules of definiteness and number state that the noun phrase has to be definite and plural, see (RE6.34). (RE6.34)

define define define define define define

NPPartDef NPPartInd NPPartSg NPPartPl NPPartNeu NPPartUtr

[[Det | Num] PPart NPDef]; [[Det | Num] PPart NPDef]; [[DetSg | Num] PPart NPPl]; [[DetPl | Num] PPart NPPl]; [[DetNeu | Num] PPart NPNeu]; [[DetUtr | Num] PPart NPUtr];

Adjective Phrases Adjective phrases occur as modifiers in two types of the defined noun phrases (NP3 and NP4) and form a head of its own in two others, (NP7 and NP8). It consists in the present implementation of an optional adverb and a sequence of one or more adjectives and is also defined in accordance with the feature conditions that have to be fulfilled for definiteness, number and gender as shown in (RE6.35). The gender feature set includes also an additional definition for masculine gender. (RE6.35)

define define define define define define define

APDef APInd APSg APPl APNeu APUtr APMas

["" ["" ["" ["" ["" ["" [""

(Adv) (Adv) (Adv) (Adv) (Adv) (Adv) (Adv)

AdjDef+ AdjInd+ AdjSg+ AdjPl+ AdjNeu+ AdjUtr+ AdjMas+

""]; ""]; ""]; ""]; ""]; ""]; ""];

FiniteCheck: A Grammar Error Detector

209

One problem related to error detection concerns the ambiguity in weak and strong forms of adjectives that coincide in the plural, but in the singular the weak form of adjectives is used only in definite singular noun phrases (see Section 4.3.1). Consequently, such adjectives obtain both singular and plural tags and errors such as the one in (6.29) will be overlooked by the system. As we see in (6.29a), the adjective trasiga ‘broken’ is both singular (and definite) and indefinite (and plural) as the surrounding determiner and head noun and the check for number and definiteness will succeed. Since the whole noun phrase is singular, the plural tag highlighted in bold face in (6.29b) is irrelevant and can be removed by the automaton defined in (RE6.36) allowing a definiteness error to be reported. (6.29) a.



en trasiga speldosa a [sg,indef] broken [sg,wk] or [pl] musical box [sg,indef]

b. en[dt utr sin ind][pn utr sin ind sub/obj] trasiga[jj pos utr/neu sin def nom][jj pos utr/neu plu ind/def nom] speldosa[nn utr sin ind nom]

(RE6.36)

define removePluralTagsNPSg [ TagPLU -> 0 || DetSg "" Adj _ ˜$"" ""];

Other Selections In addition to these noun phrase rules, noun phrases with a determiner and a noun as the head that are followed by a relative subordinate clause are treated separately, for the reason that definiteness conditions are different in this context (see Section 4.3.1). As in (6.30) the head noun that is normally in definite form after a definite article lacks the suffixed article and stands instead in indefinite form. In the current system, these segments are selected as separate from other noun phrases by application of the filtering transducer in (RE6.37), that simply changes the beginning noun phrase label ( ) to the label in the context of a definite determiner with other constituents and the complementizer som ‘that’. The grammar is then prepared for expansion of detection of these error types as well.

Chapter 6.

210

(6.30) a. Jag tycker att det borde finnas en hj¨alpgrupp f¨or I think that it should exist a help-group for de elever som har lite sociala problem. the [pl,def] pupils [pl, indef] that have some social problems – I think that there should be a help-group for the pupils that have som social problems. b. Jag tycker att det borde finnas en hj¨alpgrupp f¨or de elever som har sociala problem .

(RE6.37)

6.7.2

define SelectNPRel ["" -> "" || _ DetDef ˜$"" "" (" ") {som} Tag*];

Verb Grammar

The narrow grammar of verbs specifies the valid rules of finite and non-finite verbs (see Section 4.3.5). The rules consider the form of the main finite verb, verb clusters and verbs in infinitive phrases. Finite Verb Forms The finite verb form occurs in verbal heads either as a single main verb or as an auxiliary verb in a verb cluster. The grammar rule in (RE6.38) states that the first verb in the verbal head (possibly preceded by adverb(s)) has to be tensed. Any following verbs (or other constituents) in the verbal head are then ignored (the any-symbol ‘?∗ ’ indicates that). (RE6.38)

define VPFinite [Adv* VerbTensed ?*];

Infinitive Verb Phrases The rule defining the verb form in infinitive phrases concerns verbal heads preceded by an infinitive marker. The marking transducer in (RE6.39a) selects these verbal heads and changes the label to infinitival verbal head ( ). The grammar rule of the infinitive verbal core is defined in (RE6.39b), including just one verb in infinitive form ( ), possibly preceded by (zero or more) adverbs and/or a modal verb also in infinitive form ( ).

FiniteCheck: A Grammar Error Detector

211

(RE6.39) a. define SelectInfVP [ "" -> "" || InfMark "" _ ]; b. define VPInf [Adv* (ModInf) VerbInf Adv* ?*];

Verb Clusters The narrow grammar of verb clusters is more complex, including rules for both modal (Mod) and temporal auxiliary verbs (Perf) and verbs combining with infinitive verbs (INFVerb), i.e. infinitive phrases without infinitive marker (see Section 4.3.5). The grammar rules state the order of constituents and the form of the verbs following the auxiliary verb. The form of the auxiliary verb is defined in the VPFinite rule above (see (RE6.38), i.e. the verb has to have finite form. The marking automaton (RE6.40b) selects all verbal heads that include more than one verb as verb clusters by the VC-rule in (RE6.40a). The potential verb clusters have the form of a verb followed by (zero or more) adverbs, an (optional) noun phrase, (zero or more) adverbs and subsequently one or two verbs. Other syntactic tags (NPTags) are ignored (‘/’ is the ignore-operator). (RE6.40) a. define VC [ [[Verb Adv*] / NPTags] (NPPhr) [[Adv* Verb (Verb)] / NPTags] ]; b. define SelectVC [VC @-> "" ... "" ];

Five different rules describe the grammar of verb clusters. Three rules concern the modal verbs (VC1, VC2, VC3 presented in (RE6.41)) and two rules deal with temporal auxiliary verbs (VC4, VC5 presented in (RE6.42)). Verbs that take infinitival phrases (without the infinitival marker) (INFVerb) share two rules with the modal verbs (VC1, VC2). All the verb cluster rules have the form VBaux (NP) Adv* Verb (Verb), i.e. an auxiliary verb followed by an optional noun phrase, (zero or more) adverb(s), a verb and an optional verb. By including the optional noun phrase, the grammar also handles inverted sentences. Again, irrelevant tags (NPTags) are ignored. (RE6.41) a. define VC1 [ [[Mod | INFVerb] / NPTags ] (NPPhr) [[Adv* VerbInf] / NPTags] ]; b. define VC2 [ [Mod / NPTags] (NPPhr) [[Adv* ModInf VerbInf] / NPTags] ]; c. define VC3 [ [Mod / NPTags] (NPPhr) [[Adv* PerfInf VerbSup] / NPTags] ];

Chapter 6.

212

(RE6.42) a. define VC4 [ [Perf / NPTags] (NPPhr) [[Adv* VerbSup] / NPTags] ]; b. define VC5 [ [Perf / NPTags] (NPPhr)[[Adv* ModSup VerbInf] / NPTags] ];

All the five rules can be combined by union in one automaton that represents the grammar of all verb clusters presented in (RE6.43). (RE6.43)

define VCgram [VC1 | VC2 | VC3 | VC4 | VC5];

Other Selections Coordinations of verbal heads in verb clusters or as infinitive verb phrases are selected as separate segments by the marking transducer in (RE6.44). The automaton replaces the verbal head marking with a new label that indicates coordination of verbs () as exemplified in (6.31) and (6.32.) (RE6.44)

define SelectVPCoord ["" -> "" || ["" | ""] ˜$"" ˜$"" [{eller} | {och}] Tag* (" ") "" _ ];

(6.31) a. hon skulle springa ner och larma she would run down and alarm – she was about to run down and give the alarm. b. hon skulle springa ner och larma

(6.32) a. det a¨ r dags att g˚a och l¨agga sig. it is time to go and lay oneself – It is time to go to bed. b. det a¨ r dags att g˚a och l¨agga sig .

The infinitive marker att is in Swedish homonymous with the complementizer att ‘that’ or part of f¨or att ‘because’ and thus not necessarily followed by an infinitive, as in (6.33), (6.34) and (6.35). Such ambiguous constructions are selected as separate segments by the regular expression in (RE6.45), that changes the verbal head label to .

FiniteCheck: A Grammar Error Detector

213

(6.33) a. Tuni ringde mig sen och sa att allt hade g˚att Tuni called me later and said that everything had [pret] gone [sup] bara bra. just good – Tuni called me later and said that everything had gone just fine. b. Tuni ringde mig sen och sa att allt hade ga˚ tt bara bra .

(6.34) a. Men det skulle han aldrig ha gjort f¨or att d˚a b¨orjar but it should he never have done because then starts [pres] grenen att r¨ora p˚a sig ... the-branch to move on itself – But he should never have done that because then the branch starts to move. b. Men det skulle han aldrig ha gjort f¨or att da˚ b¨orjar grenen att r o¨ ra p˚a sig ...

(6.35) a. s˚a t¨ankte jag att nu hade jag chansen. so thought I that now had [pret] I the-chance – so I thought that now I had the chance. b. s˚a t¨ankte jag att nu hade jag chansen .

(RE6.45)

define SelectATTFinite [ "" -> "" || [ [ [[{sa} Tag+] | [[{f¨ or} Tag+] / NPTags]] ("")] | [ [{t¨ ankte} Tag+] [[NPPhr "" ] | ["" NPPhr ""]]]] InfMark ""_ ];

Verbal heads with only a supine verb as in (6.36) and (6.37) are also selected separately. They are considered grammatical in subordinate clauses, whereas main clauses with supine verbs without preceding auxiliary verbs are invalid in Swedish (see Section 4.3.5). The transducer created by the regular expression in (RE6.46) replaces a verbal head marking with .

Chapter 6.

214

(6.36) a. T¨ank om jag bott hos pappa. think if I lived [sup] with daddy – Think if I had lived at Daddy’s. b. T¨ank om jag bott hos pappa .

(6.37) a. det var en g˚ang en pojke som f˚angat en groda. it was a time a boy that caught [sup] a frog – There was once a boy that had caught a frog. b. det var en g˚ang en pojke som f˚angat en groda .

(RE6.46)

define SelectSupVP [ "" -> "" || _ VerbSup ""];

6.8 Error Detection and Diagnosis 6.8.1

Introduction

The broad grammar is applied for marking both the grammatical and ungrammatical phrases in a text. The narrow grammar expresses the nature of grammatical phrases in Swedish and is then used to distinguish the true grammatical patterns from the ungrammatical ones. The automata created in the stage of error detection correspond to the patterns that do not meet the constraints of the narrow grammar and thus compile into a grammar of errors. This is achieved by subtraction of the narrow grammar from the broad grammar. The potential phrase segments recognized by the broad grammar are checked against the rules in the narrow grammar and by looking at the difference. The constructions violating these rules are identified. The detection process is also partial in the sense that errors are located in an appropriately delimited context, i.e. a noun phrase when looking for agreement errors in noun phrases, a verbal head when looking for violations of finite verbs, etc. The replacement operator is used for selection of errors in text.

FiniteCheck: A Grammar Error Detector

6.8.2

215

Detection of Errors in Noun Phrases

In the current narrow grammar, there are three rules for agreement errors in noun phrases without postnominal attributes and three for partitive constructions, all reflecting the features of definiteness, number and gender and differing only in the context they are detected in. We present the detection rules for noun phrases without postnominal attributes in (RE6.47) and for partitive noun phrases in (RE6.48). These automata represent the result of subtracting the narrow grammar of e.g. all noun phrases that meet the definiteness conditions (NPDefs) ((RE6.33) on p.208), from the overgenerating grammar of all noun phrases (NP) ((RE6.22) on p.196). By application of a marking transducer, the ungrammatical segments are selected and annotated with appropriate diagnosis-markers related to the types of rules that are violated, presented in (RE6.47) and (RE6.48). (RE6.47) a. define npDefError ["" [NP - NPDefs] ""]; define npNumError ["" [NP - NPNum] ""]; define npGenError ["" [NP - NPGen] ""]; b. define markNPDefError [ npDefError -> "" ... ""]; define markNPNumError [ npNumError -> "" ... ""]; define markNPGenError [ npGenError -> "" ...

""];

(RE6.48) a. define NPPartDefError [ "" [NPPart - NPPartDefs ""]; define NPPartNumError [ "" [NPPart - NPPartNum] ""]; define NPPartGenError [ "" [NPPart - NPPartGen] ""]; b. define markNPPartDefError [ NPPartDefError -> "" ... ""]; define markNPPartNumError [ NPPartNumError -> "" ... ""]; define markNPPartGenError [ NPPartGenError -> "" ...

""];

The narrow grammar of noun phrases is prepared for further extension of noun phrases modified by relative clauses that in the current version of the system, are just selected as distinct from the other noun phrase types.

Chapter 6.

216

6.8.3

Detection of Errors in the Verbal Head

Three detection rules are defined for verb errors, identifying the three types of context they can appear in. Errors in finite verb form are checked directly in the verbal head (vpHead). Errors in infinitive phrases are detected in the context of a verbal head preceded by an infinitive marker (vpHeadInf). Errors in verb form following an auxiliary verb are detected in the context of previously selected (potential) verb clusters (vc). The nets of these detecting regular expressions presented in (RE6.49a) correspond (as for noun phrases) to the difference between the grammatical rules (e.g. VPInf in (RE6.39) on p. 211) and the more general rules (e.g. VPHead in (RE6.22) on p. 196), yielding the ungrammatical verbal head patterns. The annotating automata in (RE6.49b) are used for error diagnosis. (RE6.49) a. define vpFiniteError [ "" [VPhead - VPFinite] ""]; define vpInfError [ "" [VPhead - VPInf] ""]; define VCerror [ "" [VC - VCgram] ""]; b. define markFiniteError [ vpFiniteError -> "" ... ""]; define markInfError [ vpInfError -> "" ... ""]; define markVCerror [ VCerror -> "" ... ""];

Also, the narrow grammar of verbs can be extended with the grammar of coordinated verbs, use of finite verb forms after att ‘that’ and bare supine verb form as the predicate, all selected as separate patterns.

6.9 Summary This chapter presented the final step of this thesis, to implement detection of some of the grammar errors found in the Child Data corpus. The whole system is implemented as a network of finite state transducers, disambiguation is minimal, achieved essentially by parsing order and filtering techniques, and the grammars of the system are always positive. The system detects errors in noun phrase agreement and errors in the finite and non-finite verb forms. The strength of the implemented system lies in the definition of grammars as positive rule sets, covering the valid rules of the language. The rule sets remain

FiniteCheck: A Grammar Error Detector

217

quite small and practically no description of errors by hand is necessary. There are altogether six rules defining the broad grammar set and the narrow grammar set is also quite small. Other automata are used for selection and filtering. We do not have to elaborate on what errors may occur, only in what context, and certainly not spend time on stipulating the structure of them. The approach aimed further at minimal information loss in order to be able to handle texts containing errors. The degree of ambiguity is maximal at the lexical level, where we choose to attach all lexical tags to strings. At a higher level, structural ambiguity is treated by parsing order, grammar extension and filtering techniques. The parsing order resolves some structural ambiguities and is complemented by grammar extensions as an application of filtering transducers that refine and/or redefine the parsing decisions. Other disambiguation heuristics are applied for instance to noun phrases, where pronouns that follow a verbal head are attached directly to the verbal head in order to prevent them from attachment to a subsequent noun.

218

Chapter 7

Performance Results 7.1 Introduction The implementation of the grammar error detector is to a large extent based on the lexical and syntactic circumstances displayed in the Child Data corpus. The actual implementation proceeded in two steps. In the first phase we developed the grammar so that the system could run on sentences containing errors and correctly identify the errors. When the system was then run on complete texts, including correct material, the false alarms allowed by the system were revealed. The second phase involved adjustment of the grammar to improve the flagging accuracy of the system. FiniteCheck was tested for grammatical coverage (recall) and flagging accuracy (precision) on Child Data and on an arbitrary text not known to the system in accordance with the performance test on the other three grammar checkers (see Section 5.2.3). In this chapter I present results from both the initial phase in the development of the system (Section 7.2) and the improved current version (Section 7.3). The results are further compared to the performance of the other three Swedish checkers on both Child Data (Section 7.4) and the unseen adult text (Section 7.5). The chapter ends with a short summary and conclusions (Section 7.6).

7.2 Initial Performance on Child Data 7.2.1

Performance Results: Phase I

The results of the implemented detection of errors in noun phrase agreement, verb form in finite verbs, after auxiliary verb and after infinitive markers in Child

Chapter 7.

220

Data from the initial Phase I in the development of FiniteCheck are presented in Table 7.1. Table 7.1: Performance Results on Child Data: Phase I

E RROR T YPE E RRORS Agreement in NP 15 Finite Verb Form 110 Verb Form after Vaux 7 Verb Form after inf. m. 4 T OTAL 136

F INITE C HECK : PHASE I C ORRECT A LARM FALSE A LARM P ERFORMANCE Correct Incorrect No Other Diagnosis Diagnosis Error Error Recall Precision F-value 14 1 76 64 100% 10% 18% 98 0 237 19 89% 28% 42% 6 0 61 10 86% 8% 15% 4 0 5 0 100% 44% 62% 122 1 379 93 90% 21% 34%

The grammatical coverage (recall) in this training corpus was maximal, except in one erroneous verb form after auxiliary verb and a few instances of errors in finite verb form. The overall recall rate for these four error types was 90%. When tested on the whole Child Data corpus, many segments were wrongly marked as errors and the precision rate was quite low, only 21% total, resulting in the overall F-value of 34%. Most of the false alarms occurred in errors in finite verb form, followed by errors in noun phrase agreement. Related to the error frequency of the individual error types, errors in verb form after an auxiliary verb had the lowest precision (8%), closely followed by errors in noun phrase agreement (10%). The grammar of the system was at this initial stage based essentially on the syntactic constructions displayed in the erroneous patterns that we wanted to capture. Many of the false alarms were due to missing grammar rules when tested on the whole text corpus. Other false markings of correct text occurred due to ambiguity, incorrect segmentation of the text in parsing stage, or occurrences of other error categories than grammatical ones. Below I discuss in more detail the grammatical coverage and flagging accuracy in this initial phase.

7.2.2

Grammatical Coverage

Errors in Noun Phrase Agreement All errors in noun phrase agreement were detected and one with incorrect diagnosis, due to a split in the head-noun. FiniteCheck is not prepared to handle segmentation errors and exactly as the other Swedish grammar checkers the noun phrase with inconsistent use of adjectives (G1.2.3; see (4.9) on p.49) is only detected in part. The detector yields both the correct diagnosis of gender mismatch and

Performance Results

221

an incorrect diagnosis of a definiteness mismatch, since the first part troll ‘troll’ of the head-noun is indefinite and neuter and does not agree with the definite, common gender determiner den ‘the’ as seen in (7.1a). When the head-noun has the correct form and is no longer split into two parts, the whole noun phrase is selected and a gender mismatch is reported, as seen in (7.1b). (7.1) (G1.2.3) a. det va it was



den the [com,def]

troll troll [neu,indef]

karlen man [com,def]

tokig som ... Tokig that – It was the awful ugly magician Tokig that ... b. det va it was

den the [com,def]



hemske awful [masc]



fula ugly [def]

trollkarlen tokig som ... magician [com,def] Tokig that – It was the awful ugly magician Tokig that ...

Errors in Finite Verb Form Among the errors in finite verb form, none of the errors concerning main verbs realized as participles were detected (G5.2.90 - G5.2.99; see (4.30) p.60). They require other methods for detection, since as seen in (7.2) they are interpreted as adjective phrases. (7.2) a. a¨ lgen sprang med olof till ett stup och ∗ kastad ner olof the-moose ran with Olof to a cliff and threw [past part] down Olof och hans hund and his dog – The moose ran with Olof to a cliff and threw Olof and his dog over it. b. a¨ lgen sprang med olof till ett stup och kastad ner olof och hans hund

Two errors were missed due to preceding verbs joined into the same segment and were then treated as verb clusters, as shown in (7.3) and (7.4).

Chapter 7.

222

(7.3) (G5.1.1) a. Madde och jag best¨amde oss f¨or att sova i kojan och se om vi Madde and I decided usselfs for to sleep in the-hut and see if we ∗

f˚a se vind. can [untensed] see Vind

– Madde and I decided to sleep in the hut and see if we will see Vind. b. Madde och jag best¨amde oss fo¨ r att sova i kojan och se om vi f˚a se vind .

(7.4) (G5.2.40) ∗ a. N¨ar vi kom fram b¨orja vi packa upp v˚ara grejer och when we came forward start [untensed] we pack up our stuff and rulla upp sovs¨acken. role up the-sleepingbag – When we arrived, we started to unpack our things and role out the sleepingbag. b. N¨ar vi kom fram bo¨ rja vi packa upp v˚ara grejer och rulla upp sovs¨acken .

One of the errors in finite verb was wrongly selected as seen in (7.5b). Here, the noun bo ‘nest’ is homonymous with the verb bo ‘live’ and joined together with the main verb to a verb cluster the detector selects the verb cluster 1 and diagnoses it as an error in finite verb, which is actually true but only for the main verb, the second constituent of this segment.

1 The noun phrase tags surrounding bo are ignored in the selection as verb cluster, see (RE6.40) on p.211.

Performance Results

(7.5) (G5.2.70) a. D˚a gick then went

223

pojken the-boy

vidare further

och and

s˚ag saw

inte not

att that

binas bees’s

bo nest



trilla ner. tumble [untensed] down – Then the boy went further on and did not see that the nest of the bees tumbled down. b. D˚a gick pojken vidare och s˚ag inte att binas bo trilla ner.

Rest of the Verb Form Errors One error in verb form after auxiliary verb was not detected (see (7.6)), that, involved coordination of a verb cluster and yet another verb, that should follow the same pattern and thus be in infinitive form (i.e. la˚ ta ‘let [inf]’). The system does not take coordination of verbs into consideration and the coordinated verb is identified as a separate verbal head with a finite verb, which then is a valid form in accordance with the grammar rules of the system, and the error is overlooked. (7.6) (G6.1.2) a. Ibland f˚ar man bjuda p˚a sig sj¨alv och ∗ l˚ater sometimes must [pres] one offer [inf] on oneself and let [pres] henne/honom vara med! her/him be with – Sometimes one has to make a sacrifice and let him/her come along. b. Ibland f˚ar man bjuda p˚a sig sj¨alv och la˚ ter henne/honom vara med !

Finally, all errors in verb form after infinitive marker were detected.

7.2.3

Flagging Accuracy

In this subsection follows a presentation of the kinds of false flaggings that occurred in this first test of the system. The description proceeds error type by error type, with specifications on whether the appearance was due to missing grammar rules, erroneous segmentation of text at parsing stage or ambiguity. Furthermore, the false alarms with other error categories are specified.

Chapter 7.

224

False Alarms in Noun Phrase Agreement The kinds and the number of false alarms occurring in noun phrases are presented in Table 7.2. Table 7.2: False Alarms in Noun Phrases: Phase I FALSE ALARM TYPE Not in Grammar:

Segmentation: Ambiguity: Other Error:

NPInd+som Adv in NP other too long parse PP V misspelling split sentence boundary

NO. 5 28 8 26 7 2 12 48 4

Most of these false alarms were due to the fact that they were not included in the grammar of the system. For instance, adverbs in noun phrases as in (7.7a) were not covered, causing alarms in gender agreement since often in Swedish a neuter form adjective coincides with the adverb of the same lemma. Further, noun phrases with a subsequent relative clause such as (7.7b) were selected as errors in definiteness, although they are correct since the form of the head noun is indefinite when followed by such clauses (see Section 4.3.1). (7.7) a. Det var i skolan och jag kom lite f¨or sent till en lektion med it was in school and I came little too late to a class with v¨aldigt str¨ang l¨arare. very hard/strict teacher – It was in school and I came little late to a class with very strict teacher. b. Jag tycker att det borde finnas en hj¨alpgrupp f¨or I think that it should exist a help-group for de elever som har the [pl,def] pupils [pl, indef] that have lite sociala problem. some social problems – I think that there should be a help-group for the pupils that have some social problems.

Other false flaggings depended on the application of longest-match resulting in noun phrases with too wide range as in (7.8a), where the modifying predicative

Performance Results

225

complement and the subject are merged to one noun phrase since the inverted word order forced the verb to be placed at the end of sentence instead of the usual place in-between, i.e. skolan ‘school’ should form a noun phrase on its own. (7.8) dom t¨anker inte hur they think not how

viktig skolan important [str] school [def]

a¨ r is – They do not think how important school is.

Furthermore, due to lexical ambiguity some prepositional phrases such as in (7.9) and verbs were parsed as noun phrases and later marked as errors. (7.9) Det a¨ r en ganska stor v¨ag ungef¨ar it is a rather big road somewhere vid hamnen wide [indef]/at harbor [def] – It is a rather big road somewhere at the harbor.

Also false flaggings with other error categories than grammar errors were quite common. Mostly splits as in (7.10a) were flagged. Here, the noun o¨ gonblick ‘moment’ is split and the first part of it o¨ gon ‘eyes’ does not agree in number with the preceding singular determiner and adjective. Also, flaggings involving misspellings occurred as in (7.10b), where the newly formed word results in a noun of different gender and definiteness than the preceding determiner and causes agreement errors. Some cases of missing sentence boundary were flagged as errors in noun phrase agreement. (7.10) a. F¨or ett kort o¨ gon blick trodde For a [sg] short eye [pl] blinking thought jag ... I ... – For a short moment I thought ... b.

Det a¨ nda the [neu,def] end [com,indef] jag vet I know – The only thing I know ...

Furthermore, erroneous tags assigned in the lexical lookup caused trouble when for instance many words were erroneously selected as proper names.

Chapter 7.

226

False Alarms in Finite Verb Form The types and number of false alarms in finite verbs are summarized in Table 7.3. These occurred mostly because of the small size of the grammar, but also due to ambiguity problems. Table 7.3: False Alarms in Finite Verbs: Phase I FALSE ALARM TYPE Not in Grammar:

Ambiguity:

Other Error:

imperative coordinated infinitive discontinuous verb cluster noun pronoun preposition/adjective misspelling split

NO. 56 74 43 36 8 20 9 10

Imperative verb forms, that in the first phase were not part of the grammar, caused false alarms not only in verbs as in (7.11a), but also in strings homonymous with such forms as in (7.11b). Here the noun sa¨ tt is ambiguous between the noun reading ‘way’ and the imperative verb form ‘set’. (7.11) a. Men titta en stock. but look [imp] a log – But look, a log. b. Dom samlade in pengar p˚a olika s¨att they collected in money in different ways/set [imp] – They collected money in different ways.

Further, coordinated infinitives as in (7.12) were diagnosed as errors in the finite verb form, since due to the partial parsing strategy, they were selected as separate verbal heads (see (6.31) and (6.32) on p.212). (7.12) a. hon skulle springa ner och larma she would run [inf] down and alarm – she would run down and alarm. b. det a¨ r dags att g˚a och l¨agga sig. lay [inf] oneself it is time to go and – It is time to go and lay down.

Performance Results

227

Similar problems occurred with discontinuous verb clusters when a noun followed the auxiliary verb and the subsequent verb forms are treated as separate verbal heads (see (6.27) on p.204). Further, primarily nouns, but also pronouns, adjectives and prepositions were recognized also as verbal heads causing false diagnosis as errors. Other error categories selected as errors in finite verb form concerned both splits and misspellings, but were considerably fewer in comparison to similar false alarms in noun phrase agreement. False Alarms in Verb Forms after an Auxiliary Verb False alarms in verb forms after an auxiliary verb occurred either due to ambiguity in nouns, pronouns, adjectives and prepositions interpreted as verbs or due to occurrences of other error categories (Table 7.4). In the case of pronouns, they were interpreted as verbs (mostly) in front of a copula verb and merged together to a verbal cluster segment. Similar problems occurred with adjectives and participles (see (6.22)-(6.24) starting on p.202). Table 7.4: False Alarms in Verb Clusters: Phase I FALSE ALARM TYPE Ambiguity:

Other error category:

noun pronoun preposition/adjective misspelling split

NO. 26 18 17 3 7

Among false flaggings concerning other error categories, both spelling errors and splits were flagged. In (7.13) we see an example of a misspelling where the adjective r¨add ‘afraid’ is written as red coinciding with the verb ‘rode’ being marked as an error in verb form after auxiliary verb.2 (7.13) pojken blev red the-boy became rode – The boy became afraid.

Furthermore, many instances of missing punctuation at a sentence boundary were flagged as errors in verb clusters, as the ones in (7.14). 3 Similarly to the 2

The broad grammar rule for verb clusters joins any types of verbs which is why the copula verb blev ‘became’ is included. 3 Two vertical bars indicate the missing clause or sentence boundary.

Chapter 7.

228

performance test of the other grammar checkers, these flaggings are not included in the test. They represent correct flaggings, although the diagnosis is not correct. (7.14) a. Jag fortsatte v¨agen fram || d˚a s˚ag I continued the-road forward then saw jag en brandbil || jag visste vad det var. I a fire-car I knew what it was – I continued forward on the road, then I saw a firetruck. I knew what it was. b. I h˚alet pojken hittat || fanns in the-hole the-boy found was en mullvad. a mole – In the hole the boy found a mole.

False Alarms in Verb Forms in Infinitive Phrase Finally, five false alarms in infinitival verbal heads occurred in constructions that do not require an infinitive verb form after att, which is both an infinitive marker ‘to’ and a subjunction ‘that’ (see (6.33)-(6.35) starting on p.213).

7.3 Current Performance on Child Data 7.3.1

Introduction

As shown above, almost all the errors in Child Data were detected by FiniteCheck. The erroneously selected segments classified as errors by the implemented detector were mostly due to the small number of grammatical structures covered by the grammar, tagging problems and the high degree of ambiguity in the system. Many alarms included also other error categories, such as misspellings, splits and omitted punctuation. In accordance with these observations, also the detection performance of the system was improved in these three ways in order to avoid false alarms: • extend and correct the lexicon • extend the grammar • improve parsing The full form lexicon of the system is rather small (around 160,000 words) and not without errors. So, the first and rather easy step was to correct erroneous

Performance Results

229

tagging and add new words to the lexicon. The grammar rules were extended and filtering transducers were used to block false parsing. Below follows a description of the grammar extension and other improvements in the system to avoid false alarms in the individual error types. Then the current performance of the system is presented (Section 7.3.3).

7.3.2

Improving Flagging Accuracy

Improving Flagging Accuracy in Noun Phrase Agreement The grammar of adjective phrases was expanded with missing adverbs. Noun phrases followed by relative clauses. These display distinct agreement constraints and were selected separately by the already discussed regular expression (RE6.37) (see p.210). This does not mean that the grammar is extended for such noun phrases, but false alarms in these constructions may be avoided. The false alarms in noun phrases caused by limitations in the grammar set were all avoided. This grammar update further improved parsing in the system and decreased the number of wide parses giving rise to false alarms. The types and number of false alarms that remain are presented in Table 7.5. Table 7.5: False Alarms in Noun Phrases: Phase II FALSE ALARM TYPE Segmentation: Ambiguity: Other Error:

too long parse PP misspelling split sentence boundary

NO. 5 10 10 35 2

Among these are (relative) clauses without complementizers as in (7.15). (7.15) a. det var den godaste frukost jag n˚agonsin a¨ tit ... eaten it was the best breakfast I ever – It was the best breakfast I ever have eaten ... b. det var den godaste frukost jag n˚agonsin a¨ tit .

230

Chapter 7.

Improving Flagging Accuracy in Finite Verbs In the case of finite verbs, the problem with imperative verbs is solved to the extent that forms that do not coincide with other verb forms are accepted as finite verb forms, e.g. t¨ank ‘think’. The imperative forms that coincide with infinitives (e.g. titta ‘look’) remain. The problem lies mostly in that errors in verbs realized as lack of the tense endings, often coincide with the imperative (and infinitive) form of the verb. Allowing all imperative verb forms as grammatical finite verb forms would then mean that such errors would not be detected by the system. Normally other hints, such as for example checking for end-of-sentence marking or a noun phrase before the predicate, are used to identify imperative forms of verbs. These methods are however not suitable for the texts written by children since these texts often lack end-of-sentence punctuation or capitals indicating the beginning of a sentence. This could then mean that a noun phrase preceding the predicate could be an end to a previous sentence. However, just to define imperative verb forms not coinciding with other verb forms as grammatical finite verb forms reduced the number of false alarms in imperatives decrease to half as shown in Table 7.6 below. Finite verb false alarms in coordinations with infinitive verbs decreased to just nine alarms and were blocked by selection of infinitive verbs preceded by a verbal group or infinitive phrase as a separate pattern category by the transducer in (RE6.44) (see p.212). Discontinuous verbal groups with a noun phrase following the auxiliary verb were joined together by the automaton (RE6.29) (see p.205) and the narrow grammar of verb clusters was expanded to include (optional) noun phrases. Almost half of such false alarms were avoided. False alarms in finite verbs occurring because of ambiguous interpretation also decreased. Some of those were avoided by the grammar update that also improved parsing. Further adjustments included nouns being interpreted as verbs in possessive noun phrases and adjectives in noun phrases being interpreted as verbal heads that were filtered applying the automata (RE6.26) and (RE6.25) (see p.201). Furthermore, verbal heads with a single supine verb form were distinguished since they are grammatical in subordinate clauses (see (RE6.46) on p.214).

Performance Results

231

The remaining false alarms are summarized in Table 7.6. Table 7.6: False Alarms in Finite Verbs: Phase II FALSE ALARM TYPE Not in Grammar:

Ambiguity:

Other: Other Error:

imperative coordinated infinitive discontinuous verb clusters noun pronoun preposition/adjective misspelling split

NO. 27 9 28 9 1 14 6 18 14

Improving Flagging Accuracy in Verb Form after Auxiliary Verb The ambiguity resolutions defined for finite verbs blocked not only the false alarms in finite verbs, but also in verb clusters. Furthermore, an annotation filter (RE6.27) (see p.203) was defined for copula verbs to block false markings of copula verbs combined with other constituents such such as pronouns, adjectives, and participles as a sequence of verbs. The types and number of false alarms that remain are presented in Table 7.7. Table 7.7: False Alarms in Verb Clusters: Phase II FALSE ALARM TYPE Ambiguity:

Other error category:

noun pronoun preposition/adjective misspelling split

NO. 4 4 24 6 9

Improving Flagging Accuracy in Verb Form in Infinitive Phrases The false alarms in infinitive verb phrases occurred due to constructions that do not require an infinitive verb form after an infinitive marker. These were selected as separate patterns by the automaton (RE6.45) (see p.213) and false markings of this type were blocked.

Chapter 7.

232

7.3.3

Performance Results: Phase II

The performance of the system in the new improved version (Phase II) is presented in Table 7.8. The grammatical coverage is the same for all error types, except for finite verbs, where the recall rate (slightly) decreased from 89% to 87%. Table 7.8: Performance Results on Child Data: Phase II

E RROR T YPE E RRORS Agreement in NP 15 Finite Verb Form 110 Verb Form after Vaux 7 Verb Form after inf. m. 4 T OTAL 136

F INITE C HECK : PHASE II C ORRECT A LARM FALSE A LARM P ERFORMANCE Correct Incorrect No Other Diagnosis Diagnosis Error Error Recall Precision F-value 14 1 15 47 100% 19% 33% 96 0 94 32 87% 43% 58% 6 0 32 15 86% 11% 20% 4 0 0 0 100% 100% 100% 120 1 145 94 89% 34% 49%

This is due to the improvement in flagging accuracy. That is, in addition to the errors not detected by the system from the initial stage (see Section 7.2), two additional errors in finite verb form realized as bare supine were not detected (G5.2.88, G5.2.89) as a consequence of selecting all bare supine forms as separate segments, as shown in (7.16). This selection was necessary in order to avoid marking correct use of bare supine forms as erroneous. When the grammar for the bare supine verb form is covered, these errors can be detected as well. (7.16) (G5.2.89) a. Han tittade p˚a hunden. Hunden ∗ f¨ors¨okt att kl¨attra ner. he looked [pret] at the-dog the-dog tried [sup] to climb down – He looked at the dog. The dog tried to climb down. b. Han tittade p˚a hunden , hunden f o¨ rs¨okt att kl¨attra ner

We were able to avoid many of the false flaggings by improvement of the lexical assignment of tags and expansion of grammar. The parsing results of the system also improved as was the case for false flaggings. The total precision rate improved from 21% to 34%. The remaining false alarms have most often to do with ambiguity. Only in the case of verb clusters is further expansion of grammar needed. Figure 7.1 shows the number of false markings of correct text as erroneous in comparison between the initial Phase I and the current Phase II.

Performance Results

233

Figure 7.1: False Alarms: Phase I vs. Phase II The types and number of alarms revealing other error categories are more or less constant and can be considered a side-effect of such a system. Methods for recognizing these error types are of interest. In the case of splits and misspellings, most of them were discovered due to agreement problems. Omission of sentence boundaries is in many cases covered by verb cluster analysis. The overall performance of the system in detecting the four error types defined, increased in F-value from 34% in the initial phase to 49% in the current improved version.

7.4 Overview of Performance on Child Data I presented earlier in Section 5.5 the linguistic performance on the Child Data corpus of the other three Swedish tools: Grammatifix, Granska and Scarrie. Here I discuss the results of these tools for the four error types targeted by FiniteCheck and explore the similarities and differences in performance between our system and the other tools. The purpose is not to claim that FiniteCheck is in general superior to the other tools. FiniteCheck was developed on the Child Data corpus, whereas the other tools were not. However, it is important to show that FiniteCheck represents some improvement over systems that were not even designed to cover this particular data.

Chapter 7.

234

The grammatical coverage of these three tools and our detector for the four error types are presented in Figure 7.2.4 The three other tools are designed to detect errors in adult texts and not surprisingly the detection rates are low. Among these four error types, agreement errors in noun phrases is the error type best covered by these tools, whereas errors in verb form obtained in general much lower results. All three systems managed to detect at least half of the errors in noun phrase agreement. Errors in the finite verb form obtained the worst results. In the case of Grammatifix, errors in verbs obtained no or very few results. Granska targeted all four error types and detected more than half of the errors in three of the types and only 4% of the errors in finite verb form. Scarrie also had problems in detecting errors in verbs, although it performed best on finite verbs in comparison to the other tools, detecting 15% of them.

Figure 7.2: Overview of Recall in Child Data FiniteCheck, which was trained on this data, obtained maximal recall rates for errors in noun phrase agreement and verb form after infinitive markers. Errors in other types of verb form obtained a somewhat lower recall (around 86%). Although this is a good result, we should keep in mind that FiniteCheck is here tested on the data that was used for development. That is, it is not clear if the system would 4

The number of errors per error type is presented within parentheses next to the error type name.

Performance Results

235

receive such high recall rates for all four error types even for unseen child texts. 5 However, the high performance in detecting errors especially in the frequent finite verb form error type is an obvious difference in comparison to the low performance of the other tools, which at least seems to motivate the tailoring of grammar checkers to children’s texts. Precision rates are presented in Figure 7.3. They are in most cases below 50% for all systems. The result is however relative to the number of errors. Best valued are probably errors in finite verb form as a quite frequent error type. The errors in verb form after infinitive marker are too few to draw any concrete conclusions about the outcome.

Figure 7.3: Overview of Precision in Child Data Evaluating the overall performance of the systems in detection of these four error types presented in Figure 7.4 below, the three other systems obtained a recall of 16% on average. The recall rate of FiniteCheck is considerably higher, which can mean that the tool is good at finding erroneous patterns in texts written by children, but that remains to be seen when tests on unseen texts are performed. Flagging accuracy is slightly above 30% for Grammatifix, Granska and FiniteCheck. Scar5 We have not been able to test the system on new child data. Texts written by children are hard to get and require a lot of preprocessing.

Chapter 7.

236

rie obtained slightly lower precision rates. In combining these rates and measuring the overall system performance in F-value, Grammatifix obtained the lowest rate, probably due to the low recall, closely followed by Scarrie. Granska had slightly higher results of 23%. Our system obtained twice the value of Granska.

Figure 7.4: Overview of Overall Performance in Child Data In conclusion, among these four error types the three other grammar checkers had difficulties in detecting the verb form errors in Child Data and only detected around half of the errors in noun phrase agreement. FiniteCheck had high recall rates for all four error types and a precision on the same level as the other tools. It is unclear how much the outcome is influenced by the fact that the system is based on exactly this data, but FiniteCheck seems not to have difficulties in finding errors in verb form (especially in finite verbs) that the other tools clearly signal. Further evaluation of FiniteCheck on a small text not known to the system is reported in the following section.

Performance Results

237

7.5 Performance on Other Text In order to see how FiniteCheck would perform on unseen text, of the kind used to test the other Swedish grammar checkers, a small literary text of 1,070 words describing a trip was evaluated. This text is used as a demonstration text by Granska. 6 The text includes 17 errors in noun phrase agreement, five errors in finite verb form and one error in verb form after an auxiliary verb. The purpose of this test is to see if the results are comparable to the other Swedish tools. Note that, the aim is not to compare the performance between all the checkers, which would be unfair since the text is a demonstration text of Granska, but rather to see how our detector performs in just the error types it targets in comparison to tools designed for this kind of text. Below, I first present and discuss the results of FiniteCheck, then the performance of the three other checkers is presented, followed by a comparative discussion.

7.5.1

Performance Results of FiniteCheck

Introduction The text was first manually prepared and spaces were inserted when needed between all strings, including punctuation. Further, the lexicon had to be updated, since the text used a particular jargon.7 The detection results of FiniteCheck are presented in Table 7.9. Table 7.9: Performance Results of FiniteCheck on Other Text

E RROR T YPE E RRORS Agreement in NP 17 Finite verb form 5 Verb form after Vaux 1 T OTAL 23

F INITE C HECK : OTHER T EXT C ORRECT A LARM FALSE A LARM P ERFORMANCE Correct Incorrect No Other Diagnosis Diagnosis Error Error Recall Precision F-value 13 1 2 4 82% 70% 76% 5 0 1 0 100% 83% 91% 1 0 1 0 100% 50% 67% 19 1 4 4 87% 71% 78%

FiniteCheck missed three errors in noun phrase agreement, which leaves it with a total recall of 87%. False alarms occurred in all three error types, mostly in noun 6 Demonstration page of Granska: http://www.nada.kth.se/theory/projects/ granska/. 7 FiniteCheck’s lexicon would need to be extended anyway to make a general grammar checking application.

Chapter 7.

238

phrase agreement, and results in a total precision of 71%. Below I discuss the performance results in more detail. Errors in Noun Phrase Agreement Among the noun phrase agreement errors, three errors were not detected and one was incorrectly diagnosed. The latter concerned a proper noun preceded by an indefinite neuter determiner. The noun phrase was selected and marked for all three types of agreement errors, as shown in (7.17). The reason for this selection is that the noun phrase was recognized by the broad grammar as a noun phrase, but rejected by the narrow grammar as ungrammatical. In this case it is true, since the proper noun should stand alone or be preceded by a neuter gender determiner, but the system should signal only an error in gender agreement. That is, the noun phrase was as a whole rejected by the system, since there are no rules for noun phrases with a determiner and a proper noun. (7.17) a. Detta a¨ r sannerligen ∗ en Mekka f¨or fj¨all¨alskaren ... this is certainly a [com,indef] Mekka for the mountain-lover – This is certainly a Mekka for the mountain-lover ... b. Detta a¨ r sannerligen en Mekka fo¨ r fj¨all¨alskaren ...

The undetected errors all concerned constructions not covered by our grammar. The first one in (7.18a)8 involves a possessive noun phrase modifying another noun. FiniteCheck covers noun phrases with single possessive nouns as modifiers. The other two concern numerals with nouns in definite form. Our current grammar does not explore much about numerals and definiteness. den stora ∗ forsen brus the big stream [nom] roar [nom]

⇒ den stora forsens brus the big stream [gen] roar [nom]

b.

tv˚a ∗ nackdelarna two disadvantages [def]

⇒ tv˚a nackdelar two disadvantages [indef]

c.

tv˚a ∗ k˚asorna kaffe two scoops [def] coffee

⇒ tv˚a k˚asor kaffe two scoops [indef] coffee

(7.18) a.

Altogether six false flaggings occurred in noun phrase agreement, four of them due to a split, thus involving an another error category. Two were due to ambiguity in the parsing. Both types are exemplified in the sentence in (7.19), where in 8

Correct forms are presented to the right after the arrow in the examples.

Performance Results

239

the first case the noun fj¨allutrustningen ‘mountain-equipment [sg,com,def]’ is split and the first part does not agree with the preceding modifiers. The second case involves the complex preposition framfo¨ r allt ‘above all’ where allt is joined with the following noun to build a noun phrase and a gender mismatch occurs. (7.19) a. ...i in

t¨altet och tent and

den the [sg,com,def]

o¨ vriga fj¨all rest mountain [sg/pl,neu,indef]

utrustningen vilar tryggheten och framf¨or allt equipment [sg,com,def] rests the-safety and above all [neu, indef] friheten. freedom [com,def] – ... in the tent and the other mountain equipment lies the safety and above all freedom. b. i t¨altet och den o¨ vriga fj¨all utrustningen vilar tryggheten och fram f¨or allt friheten .

Errors in Verb Form All the errors in verb form have been detected. One false alarm occurred in each error type. In the case of finite verbs, the alarm was caused due to homonymity in the noun styrka ‘force’ interpreted as the verb ‘prove’, as seen in (7.20). (7.20) a. Vinden mojnar inte under natten utan forts¨atter med of¨or minskad the-wind subside not during the-night but continues with undiminished styrka. force – The wind does not subside during the night, but continues with undiminished force. b. Vinden mojnar inte under natten utan forts¨atter med of¨or minskad styrka .

The false alarm in verb form after auxiliary verb concerned the split noun sovs¨acken ‘sleeping-bag’, where the first part sov is homonymous with the verb ‘sleep’ and was joined with the preceding verb to form a verb cluster, as shown in (7.21).

Chapter 7.

240

(7.21) a. Det finns dock tv˚a nackdelarna med t¨altning, pj¨axorna There exist however two disadvantages with camping the skiing-boots m˚aste i sov s¨acken f¨or att inte krympa ihop av kylan ... must into sleeping bag because not shrink together from the-cold – There are two disadvantages with camping, the skiing-boots must be inside the sleeping-bag in order to not shrink from the cold ... b. Det finns dock tv˚a nackdelarna med t¨altning, pj¨axorna m˚aste i sov s¨acken f¨or att inte krympa ihop av kylan

7.5.2

Performance Results of Other Tools

Grammatifix The results for Grammatifix are presented in Table 7.10 below, with 12 detected errors in noun phrase agreement, one error in finite verb form and one false alarm in verb form error after an auxiliary verb. This leaves the system a total recall of 57% and a precision of 93% for these three error types. Table 7.10: Performance Results of Grammatifix on Other Text GRAMMATIFIX: OTHER T EXT FALSE A LARM P ERFORMANCE C ORRECT No Other E RROR T YPE E RRORS A LARM Error Error Recall Precision F-value Agreement in NP 17 12 0 0 71% 100% 83% Finite Verb Form 5 1 0 0 20% 100% 33% Verb Form after Vaux 1 0 0 1 0% 0% – T OTAL 23 13 0 1 57% 93% 70%

The five errors in noun phrase agreement that were missed concerned the same segment with a possessive noun modifying another noun (see (7.18a)) and the one with a numeral and a noun in the definite form (see (7.18b)). Other cases concerned a possessive proper noun with erroneous definite noun (see (7.22a)), another definiteness error in noun (see (7.22b)) and a strong form of adjective used in definite noun phrase (see (7.22c)). Correct forms are presented to the right, next to the erroneous phrases.

Performance Results

241

Lapplands ∗ drottningen Lappland’s queen [def]

⇒ Lapplands drottning Lappland’s queen [indef]

b.

∗ en ny dagen a [indef] new [indef] day [def]

⇒ en ny dag a [indef] new [indef] day [indef]

c.

∗ djup sn¨on den the [def] deep [str] snow [def]

⇒ den djupa sn¨on the [def] deep [wk] snow [def]

(7.22) a.

No false alarms occurred other than one with a verb form after an auxiliary verb, concerning exactly the same segment and error suggestions as our detector as exemplified in (7.21) above. Granska The result for Granska is presented in Table 7.11. This system detected 11 agreement errors in noun phrase, the one error in verb form after auxiliary verb and one false alarm occurred in noun phrase agreement. No errors in finite verb form were identified. The total recall is 52% and precision 92% for these three error types. Table 7.11: Performance Results of Granska on Other Text GRANSKA: OTHER T EXT FALSE A LARM P ERFORMANCE C ORRECT No Other E RROR T YPE E RRORS A LARM Error Error Recall Precision F-value Agreement in NP 17 11 0 1 65% 92% 76% Finite Verb Form 5 0 0 0 0% – – Verb Form after Vaux 1 1 0 0 100% 100% 100% T OTAL 23 12 0 1 52% 92% 67%

The six errors in noun phrase agreement that were missed concerned the same segment with a possessive noun modifying another noun (see (7.18a)) and both cases with the numeral and a noun in definite form (see (7.18b-c)). Further errors concerned a possessive noun with an erroneous definite noun (see (7.23a)), a neuter gender possessive pronoun with a common gender noun (see (7.23b)) and an indefinite determiner with a definite noun (see (7.23c)).

Chapter 7.

242

ripornas ∗ kurren grouse’s hoot [def]

⇒ ripornas kurr grouse’s hoot [indef]

b.



mitt huva my [neu] hood [com]

⇒ min huva my [com] hood [com]

c.

∗ smulan en a [indef] bit [def]

⇒ en smula a [indef] bit [indef]

(7.23) a.

One false alarm occurred in a noun phrase with a split adjective and a missing noun, as shown in (7.24). Here the adjective vintero¨ ppna ‘winter-open’ (i.e. open for winter) is split and the first part causes an agreement error in definiteness. o¨ ppna — husera en arg gubbe ... (7.24) ... i den andra vinter in the [def] other winter [indef] open — haunt [inf] an angry old man –... the other cottage open for the winter was haunted by an angry old man ...

Scarrie The results for Scarrie are presented in Table 7.12. This system detected 10 agreement errors in noun phrase and one error in finite verb form. It had six false markings concerning noun phrase agreement. The total recall is 48% and precision 65%. Table 7.12: Performance Results of Scarrie on Other Text SCARRIE: OTHER T EXT FALSE A LARM P ERFORMANCE C ORRECT No Other E RROR T YPE E RRORS A LARM Error Error Recall Precision F-value Agreement in NP 17 10 2 4 59% 63% 61% Finite Verb Form 5 1 0 0 20% 100% 33% Verb Form after Vaux 1 0 0 0 0% – – T OTAL 23 11 2 4 48% 65% 55%

The seven errors in noun phrase agreement that were missed concerned the three our system did not find (see (7.18)) and two that Granska did not find (see (7.23a) and (7.23c)). The others are presented below, where two concerned gender agreement with determiner and a (proper) noun (see (7.25a)) and ((7.25b)), and one definiteness agreement with a weak form adjective together with an indefinite noun (see (7.25c)).

Performance Results

243

(7.25) a.



en Mekka a [com] Mekka

⇒ ett Mekka a [neu] Mekka

b.



en mantra a [com] mantra [neu]

⇒ ett mantra a [neu] mantra [neu]

c.



⇒ or¨ord fj¨allnatur untouched [str] mountain-nature [indef]

or¨orda untouched [wk]

fj¨allnatur mountain-nature [indef]

All false alarms concerned noun phrase agreement, where four of them concerned other error categories, as for instance in the ones presented in (7.19) or in (7.24).

7.5.3

Overview of Performance on Other Text

In Figure 7.5 I present the recall values for all three of the grammar checkers and our FiniteCheck for the three evaluated error types. All the tools detected 60% or more errors in noun phrase agreement, whereas verb form errors obtained different results. The other tools detected at most one verb form error in total of either the finite verb kind or after an auxiliary verb. FiniteCheck identified all six of the verb form errors. The errors in verb form are in fact quite few (six instances in total), but even for such a small amount there are indications that the other tools have problems identifying errors in verb form. Flagging accuracy for these error types is presented in Figure 7.6. Concerning errors in noun phrase agreement, Grammatifix had no false flaggings and obtains a precision of 100%. Granska’s precision rate is also quite high with only one false alarm. Scarrie and FiniteCheck obtained a lower precision around 70% due to six false alarms by each tool. Concerning verb errors, the three systems obtained full rates without any false flaggings when detection occurred. FiniteCheck had one false alarm in each error type, thus obtaining lower precision rates. The flagging accuracy of FiniteCheck in this text is a bit lower in comparison to Grammatifix and Granska, but comparable to the results of Scarrie.

Chapter 7.

244

Figure 7.5: Overview of Recall in Other Text

Figure 7.6: Overview of Precision in Other Text

Performance Results

245

The overall performance on the evaluated text presented in Figure 7.7 with 23 grammar errors, the three grammar checkers obtained on average 52% in recall, FiniteCheck scored 87%. The opposite scenario applies for precision, where FiniteCheck had slightly worse rate (71%) than Grammatifix and Granska which had a precision above 90%. Scarrie’s precision rate was 65%. In the combined measure of recall and precision (F-value) our system obtained a rate of 78%, which is slightly better in comparison to the other tools that had 70% or less in F-value.

Figure 7.7: Overview of Overall Performance in Other Text In conclusion, this test only compared a few of the constructions covered by the other systems, represented by the error types targeted by FiniteCheck. The result is promising for our detector that obtained comparable or better performance rates for coverage in this text. Flagging accuracy was slightly worse, especially in comparison to Grammatifix and Granska. Moreover the text was small with few errors and future tests on larger unseen text are of interest for better understanding of the system’s performance.

246

Chapter 7.

7.6 Summary and Conclusion The performance of FiniteCheck was tested during the developmental stage and on the current version. The system is in general good at finding errors and the flagging accuracy of the system can be improved by relatively simple means. The initial performance was improved solely by extension of the grammar and some ambiguity resolution. The broad grammar was extended by filtering transducers that extended head phrases with complements and merged split constituents or somehow adjusted the parsing output as a disambiguation step. The narrow grammar was improved by either extension of existing grammar rules or additional selections of segments. These new selections provide a basis for definitions of new grammars, thus the possibility of extending the detection to other types of errors. In the current version, noun phrases followed by relative clauses, coordinated infinitives and verbs in supine form were selected as separate segments and can be further extended with corresponding grammar rules. Detection of the four implemented error types in FiniteCheck was tested on both Child Data and a short adult text not only for our detector but also for the other three Swedish grammar checkers.9 In the case of Child Data, FiniteCheck achieved maximal or high grammatical coverage, being based on this corpus, and a total precision of around 30%. The other tools detected in general few errors in Child Data in the included error types with an average recall of 16%. Flagging accuracy is also around 30% for two of these tools and is lower for one of them. The outcome of FiniteCheck is hard to compare to the performance of the other tools, since our system is based on the Child Data corpus which was also used for evaluation, but there are indications of differences in the detection of errors in verb form at least, especially in finite verbs, where the other tools obtained quite low recall on average 9%. A similar effect occurs when the tools were tested on the adult text, where also here the other tools had difficulties to detect errors in verb form (although they were few), whereas FiniteCheck identified all of them. Otherwise, FiniteCheck obtained comparable (or even better) recall for the adult text with the three tools and a slightly lower accuracy in comparison to two of the tools. The performance rates of all the tools are in general higher on this adult text in comparison to Child Data, with a recall around 50% and a precision around 80%. Corresponding rates for Child Data are around 16% in recall 10 and 30% in precision. The validation tests on Child Data and the adult text indicate clearly that the children’s texts and the errors in them really are different from the adult texts and 9

Recall that these tools target many more error types. Evaluation of these grammar checkers on all errors found in Child Data is presented in Chapter 5 (Section 5.5). 10 Here, the recall rates of FiniteCheck were not included, since it is developed on this data.

Performance Results

247

errors, and that they are more challenging for current grammar checkers that have been developed for texts and errors written by adult writers. The low performance of the Swedish tools on Child Data clearly demonstrates the need for adaptation of grammar checking techniques to other users, such as children. The performance of FiniteCheck is promising but at this point only preliminary. More tests are needed in order to see the real performance of this tool, both on other unseen children texts and texts written by other users, such as adult writers or even second language learners.

248

Chapter 8

Summary and Conclusion 8.1 Introduction This concluding chapter begins with a short summary of the thesis (Section 8.2), followed by a section on conclusions (Section 8.3), finally, some future plans are discussed (Section 8.4).

8.2 Summary 8.2.1

Introduction

This thesis concerns the analysis of grammar errors in Swedish texts written by primary school children and the development of a finite state system for finding such errors. Grammar errors are more frequent for this group of writers and the distribution of the error types is different from texts written by adults. Also other writing errors above word-level are discussed here, including punctuation and spelling errors resulting in existing words. The method used in the implemented tool FiniteCheck involves subtraction of finite state automata that represent two ‘positive’ grammars with varying degree of detail. The difference between the automata corresponds to the search for writing problems that violate the grammars. The technique shows promising results on the implemented agreement phenomena and verb selection phenomena. The work is divided into three subtasks, analysis of errors in the gathered data, investigation of what possibilities there are for detecting these errors automatically and finally, implementation of detection of some errors. The summary of the thesis presented below follows these three subtasks.

Chapter 8.

250

8.2.2

Children’s Writing Errors

Data, Error Categories and Error Classification The analysis of children’s writing errors is based on empirical data of total 29,812 words gathered in a Child Data corpus consisting of three separate collections of hand written and computer written compositions written by primary school children between 9 and 13 years of age (see Section 3.2). The analysis concentrates on grammar, primarily. Other categories under investigation concern spelling errors which give rise to real word strings and punctuation. Error classification of the involved error categories is discussed in Chapter 3 (Section 3.3), where I present a taxonomy (Figure 3.1, p.31) and principles for classifying writing errors. Although this taxonomy was designed particularly for errors in the borderline between spelling and grammar error, it can be used for classification of both spelling and grammar errors. It takes into consideration the kind of new formation involved (new lemma or other forms of the same lemma), the type of violation (change in letter, morpheme or word) and what level was influenced (lexical, syntactic or semantic). What Grammar Errors Occur? In the survey of the considerably few existing studies on grammar errors in Chapter 2 (Section 2.4) I show that the most typical grammar errors in these studies are errors in noun phrase and predicative complement agreement, verb form and choice of prepositions in idiomatic expressions. Furthermore, some indications of errors influenced by spoken language are also evident in children’s writing. However, grammar has in general low priority in research on writing in Swedish. In particular, there are no recent studies concerning grammar errors by children and certainly no studies whatever for the youngest primary school children (see Section 2.3). In the present analysis in Child Data in Chapter 4 (Section 4.3), a total of 262 grammar errors occur. They are spread over more than ten error types. The expected “typical” errors occur, but they are not all particularly frequent. The most common errors occur in finite verb form, omission of obligatory constituents in sentences, choice of words, agreement in noun phrases and extra added constituents in sentences. In comparison to adult writers (Section 4.4), there are clear differences in error frequency and the distribution of error types. Grammar errors occur on average as much as 9 times in a child text of 1,000 words, which is considerably more frequent compared to adult writers who make an average one grammar error per 1,000 words. For some error types (e.g. noun phrase agreement) frequency differs marginally, whereas more significant differences arise, for instance for errors in

Summary and Conclusion

251

verb form, that are on average eight times more common in Child Data. Frequency distribution across all error types is also different, although the representation of the most common error types is similar except for finite verb form errors. The most common error type for the adults in the studies presented were missing or redundant constituents in sentences, agreement in noun phrase and word choice errors. In contrast, the most common verb error among adult writers is in the verb form after auxiliary verb and not in the finite verb form, as is the case for children. What Real Word Spelling Errors Occur? Spelling errors resulting in existing words are usually not captured by a spelling checker. For that reason they have been included in the present analysis, since they often require analysis of context larger than a word in order to be detected. The ones found in the Child Data corpus (presented and discussed in Section 4.5) are three times less frequent than the non-word spelling errors, where misspelled words are the most common error type. These errors indicate a clear confusion as to what form to use in which context as well as the influence of spoken language. Splits were in general more common as real word errors. How Is Punctuation Used? The main purpose of the analysis of punctuation (Section 4.6) was to investigate how children delimit text and use major delimiters and commas to signal clauses and sentences. The analysis of Child Data reveals that mostly the younger children join sentences into larger units without using any major delimiters to signal sentence boundaries. The oldest children formed the longest units with the least adjoined clauses. Erroneous use of punctuation is mostly represented by omission of delimiters, but also markings occurring at syntactically incorrect places. Punctuation analysis concludes at this point with recommendation not to rely on sentence marking conventions in children’s texts when describing grammar and rules of a system aiming at analyzing such texts.

8.2.3

Diagnosis and Possibilities for Detection

Possibilities and Means for Detection The errors found in Child Data were analyzed according to what means and how much context is needed for detection of them. Most of the non-structural errors (i.e. substitutions of words, concerning feature mismatch) and some structural errors (i.e. omission, insertion and transposition of words) can be detected successfully by means of partial parsing. These errors concern agreement in noun phrases, verb

252

Chapter 8.

form or missing constituents in verb clusters, some pronoun case errors, repeated words that cause redundant constituents, some word order errors and to some extent agreement errors in predicative complements. Furthermore, real word spelling errors giving rise to syntactic violations can also be traced by partial parsing. Other error types require more elaborate analysis in the form of parsing larger portions of a clause or even full sentence parsing (e.g. missing or extra inserted constituents), analysis above sentence-level requiring analysis of a preceding discourse (e.g. definiteness in single nouns, reference), or even semantics and world-knowledge (e.g. word choice errors). Among the most common errors in the Child Data corpus, errors in verb form and noun phrase agreement can be detected by partial parsing, whereas errors in the structure of sentences as insertions or omissions of constituents and word choice errors require more elaborate analysis. Coverage and Performance of Swedish Tools The three existing Swedish grammar checkers Grammatifix, Granska and Scarrie are designed for and primarily tested on texts written by (mostly professional) adult writers. According to their error specifications, they cover many of the error types found in Child Data. The errors that none of these tools targets include definiteness errors in single nouns and reference errors. The tools were tested on Child Data in order to gauge their real performance. The result of this test indicates low coverage overall and in particular for the most common error types. The systems are best at identifying errors in noun phrase agreement and obtain an average recall rate of 58%. However, the most common error in children’s writing, finite verb forms, is on average covered only to 9% (see Tables 5.4, 5.5 and 5.6 starting on p.169 or Figure 7.2 on p.234). The overall grammatical coverage (recall) by the adult grammar checkers across all errors in Child Data averages around 12%. A figure which is almost five times lower than in the tests on adult texts provided by the developers of these tools where the average recall rate is 57% (see Table 5.3 on p.141). This test showed that although these three proofing tools target the grammar errors occurring in Child Data, they have problems in detecting them. The reasons for this effect could in some cases be ascribed to the complexity of the error (e.g. insertion of optional constituents). However, the low performance has more often to do with the high error frequency in some error types (e.g. errors in finite verb form are much less frequent in adult texts; see Figure 4.5 on p.87) and the complexity in the sentence and discourse structure of the texts used in this study (e.g. violations of punctuation and capitalization conventions resulting in adjoined clauses).

Summary and Conclusion

8.2.4

253

Detection of Grammar Errors

Targeted Errors Among the errors found in Child Data, errors in noun phrase agreement and in the verb form in finite and non-finite verbs were chosen for implementation. There were two reasons for concentrating on these error types. First, they (almost all) occur among the five most common error types. Second, these error types are all limited to certain portions of text and can then be detected by means of partial parsing. In the current implementation, agreement errors in noun phrases with a noun, adjective, pronoun or numeral as the head are detected, as well as in noun phrases with partitive attributes. The noun phrase rules are defined in accordance with what feature requirements they have to fulfill (i.e. definiteness, number and gender). The noun phrase grammar is prepared for further detection of errors in noun phrases with a relative subordinate clause as complement, that display different agreement conditions. In the present implementation these are selected as separate segments from the other noun phrases. The main purpose of this selection was to avoid marking of correct noun phrase segments of this type as erroneous. The verb grammar detects errors in finite form, both in bare main verbs and in auxiliary verbs in a verb cluster, as well as non-finite forms in a verb cluster and in infinitive phrases following an infinitive marker. The grammar is designed to take into consideration insertion of optional constituents such as adverbs or noun phrases and also handles inverted word order. Also the verb grammar is prepared for expansion to cover detection of other errors in verbs. Coordinated verbs preceded by a verb cluster or infinitive phrase are selected as individual segments and invite further expansion of the system’s grammar to detection of errors manifested as finite verbs instead of the expected non-finite verb form. Similarly, verbal heads with bare supine form separate segments and lay a basis for the detection of omitted temporal auxiliary verbs in main clauses. Detection Approach The implemented grammar error detector FiniteCheck is built as a cascade of finite state transducers compiled from regular grammars using the expressions and operators defined in the Xerox Finite-State Tool. The detection of errors in a given text is based on the difference between two positive grammars differing in degree of accuracy. This is the same method that Karttunen et al. (1997a) use for distinguishing valid and invalid date expressions. The two grammars always describe valid rules of Swedish. The first more relaxed (underspecified) grammar is needed in a text containing errors to identify all segments that could contain errors and marks both

254

Chapter 8.

the grammatical and ungrammatical segments. The second grammar is a precise grammar of valid rules in Swedish and is used to distinguish the ungrammatical segments from the grammatical ones. The parsing strategy of FiniteCheck is partial rather than full, annotating portions of text with syntactic tags. The procedure is incremental recognizing first the heads (lexical prefix) and then expanding them with complements, always selecting maximal instances of segments. In order to prevent overlooking errors, the ambiguity in the system is maximal at the lexical level, assigning all the lexical tags presented in the lexicon. Structural ambiguity at a higher level is treated partially by parsing order and partially by filtering techniques, blocking or rearranging insertion of syntactic tags. Performance Results FiniteCheck was tested both on the (training) Child Data written by children and an adult text not known to the system. In the case of Child Data, the system showed high coverage (recall) in the initial phase of development, whereas many correct segments were selected as erroneous. Many of these false alarms were avoided by extending the grammar of the system, blocking an average of half of all the false markings. Remaining false alarms are more related to the ambiguity in parsing or selection of other error categories (i.e. misspelled words, splits and missing sentence boundaries). Only in the case of verb clusters did the system mark constructions not yet covered by the grammar of the system. Being based on this corpus, maximal or high grammatical coverage occurs with a total recall rate of 89% for the four implemented error types. Precision is 34%. The other three Swedish tools had on average lower results in recall with a total ratio of 16% on Child Data for the four error types targeted by FiniteCheck. The corresponding total precision value is on average 27%. Further, the performance of FiniteCheck on a text not known to the system shows that the system is good at finding errors, whereas the precision is lower. The three undetected errors in noun phrase agreement occurred due to the small size of the grammar. False flaggings involved both ambiguity problems and selections due to occurrence of other error categories. The total grammatical coverage (recall) of FiniteCheck on this text was 87% and precision was 71%. The other three Swedish tools are (again) good at finding errors in noun phrase agreement, whereas the verb errors obtain quite low results. The average total recall rate is 52% and precision is 83% for the three evaluated error types. The validation tests show that the performance of FiniteCheck on the four implemented error types is promising and comparable to current Swedish checkers. The low performance results of the Swedish systems on children’s texts indicates

Summary and Conclusion

255

that the nature of the errors found in texts written by primary school writers are different from adult texts and are more challenging for current systems that are oriented towards texts written by adult writers.

8.3 Conclusion The present work contributes to research on children’s writing by revealing the nature of grammar errors in their texts and fills a gap in this research field, since not many studies are devoted to grammar in writing. It shows further that it is important to develop aids for children since there are differences in both frequency and error types in comparison to adult writers. Current tools have difficulties coping with such texts. The findings here also show that it is plausible and promising to use positive rules for error detection. The advantage of applying positive grammars in detection of errors is first, that only the valid grammar has to be described and I do not have to speculate on what errors may occur. The prediction of errors is limited exactly to the portions of text that can be delimited. For example, errors in number in noun phrases with a partitive complement were not identified by any of the three Swedish checkers, since adults probably do not make these types of errors. The grammar of FiniteCheck describes the overall structure of such phrases in Swedish, including agreement between the quantifying numeral or determiner and the modifying noun phrase. It also states that the noun phrase has to have plural number in order to be considered correct. The Swedish tools take into consideration only the agreement between the constituents and not the whole structure of the phrase. Secondly, the rule sets remain quite small. Thirdly the grammars can be used for other purposes. That is, since the grammars of the system describe the real grammar of Swedish, they can also be used for detection of valid noun phrases and verbs and be applied for instance to extracting information in text or even parsing. The performance of FiniteCheck indicates promising results by the fact that not only good results were obtained on the ‘training’ Child Data, but also running FiniteCheck on adult texts yielded good results comparable to the other current tools. This result perhaps also indicates that the approach could be used as a generic method for detection of errors. The ambiguity in the system is not fully resolved, but his does not disturb the error detection. However, false parses are hard to predict and they may give rise to errors not being detected or occurrence of false alarms.

Chapter 8.

256

8.4 Future Plans 8.4.1

Introduction

The current version of the implemented grammar error detector is not intended to be considered a full-fledged grammar checker or a generic tool for detection of errors in any text written by any writer. The present version of FiniteCheck is based on a lexicon of limited size, ambiguity in the system is not fully resolved and it detects a limited set of grammar errors yielding simple diagnoses. The next challenge will be to expand the lexicon, experiment with disambiguation contra error detection, expand the coverage of the system to other error types, explore the diagnosis stage and test to detect errors in new texts written by different users. Furthermore, application of the grammars of the system for other purposes is also interesting to explore.

8.4.2

Improving the System

The lexicon of the system has to be expanded with missing forms and new lemmas and other valuable information, such as valence or compound information. The latter has practically been accomplished, being stored in the original text-version of part of the lexicon. There is a high level of ambiguity in the system, especially at the lexical level since we do not use a tagger which might eliminate information in incorrect text that is later needed to find the error. The fact is that unresolved ambiguity can sometimes lead to false parsing, which in turn could mean false alarms. The degree of lexical ambiguity and the impact on parsing and by extension detection of errors can be studied by experiments with weighted lexical annotation for instance, i.e. lexical tags ordered by probability measures (e.g. weighted automata). Such taggers are however often based on texts written by adults and could give rise to deceptive results. Also, disambiguation is not fully resolved at the structural level, blocking some insertions by parsing order and further adjusting the output by filtering automata. Extension of grammars in the system have shown positive impact on parsing and further evaluation is needed in order to decide the degree of ambiguity and prospects for prediction of false parsing, both having an influence on error detection. Another possibility is to explore the use of alternative parses, implemented for instance as charts. The rules of the broad grammar overgenerate to a great extent. One thing to experiment with is the degree of broadness in order to see how it influences the detection process. Will the parsing of text be better at the cost of worse error

Summary and Conclusion

257

detection? How much could the grammar set be extended to improve the parsing without influencing the error detection? Since the grammars of the system are positive, experiments of using them for other purposes are in place. For instance, the more accurate narrow grammar could be applied for information extraction or even parsing.

8.4.3

Expanding Detection

The first step in expansion of detection of FiniteCheck would naturally involve the types that are already selected for such expansion, i.e. noun phrases with relative clauses, coordinated infinitives and bare supine verbs. Furthermore, the verb grammar can be expanded with other constructions such as the auxiliary verb komma ‘will’ that requires an infinitive marker preceding the main verb, or main verbs that combine with infinitive phrases (see Section 4.3.5). Further expansion naturally would concern errors that require least analysis. After noun phrase and verb form errors, only some constructions can be detected by simple partial parsing, but more complex analysis is required. The system can be further expanded to include detection of errors in predicative agreement, some pronoun case errors, some word order errors and probably some definiteness errors in single nouns. With regard to children, most crucial would be coverage of errors with missing or redundant constituents in clauses or word choice errors, which represent two of the more frequent error types. These errors will, as my analysis reveals, most probably require quite complex investigation with descriptions of complement structure. It would be plausible to do more analysis of children’s writing in order to investigate if some such errors are for instance limited to certain portions of text and could then be detected by means of partial parsing. In consideration of children as the users of a grammar checker for educational purposes, the most important development will concern the error diagnosis and error messages to the user. A tool that supports beginning writers in their acquisition has to place high demands on the diagnosis of and information on errors in order to be useful. The message to the user has to be clear and adjusted to the skills of the child. A child not familiar with a given writing error or the grammatical terminology associated with it will certainly not profit from detecting such an error or from information containing grammatical terms. Studies of children’s interaction with authoring aids are in place in order to explore how alternatives of detection, diagnosis and error messages could best profit this user group. For instance, such a tool could be used for training grammar allowing customizing and options for what error types to detect or train on. There could also be different levels of diagnosis and error messages depending on the individual child’s level of writing acquisi-

Chapter 8.

258

tion. Also other users could find such a tool interesting, for instance in language acquisition as a second language learner. The diagnosis stage could be adjusted by analysis of on-going processes in writing of children, that could be a step toward revealing the cause of an error. By logging all activities during the writing process on a screen, for instance all revisions could be stored and then analyzed if they indicate any repeated patterns and certainly if there is a difference between making a spelling error or making a grammar error. Could a grammar checker gain from such on-line information? This analysis would further be of interest for the errors in the borderline between grammar and spelling error and could aid detection of other categories of errors incorrectly detected as grammar errors.

8.4.4

Generic Tool?

The detection and overall performance of the system has so far been tested on the ‘training’ Child Data corpus and a small adult text not known to the system. The results for the four implemented error types are promising on both texts representing two different writing populations. This fact could also imply that this method could be used as generic. FiniteCheck obtained comparable performance to other Swedish grammar checkers for the adult text and on Child Data. Although FiniteCheck was based on these texts, considerable difference in coverage occurred for some error types that the other tools had difficulty finding. The system needs to be tested further on other children’s texts not known to the system and also texts from other writers, primarily, texts of different genres written by adults. Furthermore, it would be interesting to test FiniteCheck on texts written by second language learners, dyslectics or even the hearing impaired, in order to explore how generic this tool is.

8.4.5

Learning to Write in the Information Society

Some of the future work discussed above is already initiated within the framework of a three year project Learning to Write in the Information Society, initiated in 2003 and sponsored by Vetenskapsr˚adet. The project group consisting of Robin Cooper, Ylva H˚ard af Segerstad and me aims to investigate written language by school children in different modalities and the effects of the use of computers and other communicated media such as webchat and text messaging over mobile phone. The main aims are to see how writing is used today and how information technology can better be used for support. Texts written by primary school children will be gathered, both in hand written and computer written form. The study will also involve writing experiments with email, SMS (Short-Message-Service)

Summary and Conclusion

259

and webchat. Further studies dealing with interaction with different writing aids are included. The results of this study should reveal how writing aids influence the writing of children, what the needs and requirements are on such tools by this writing population and how writing aids can be improved to enhance writing development and instruction in school.

260

BIBLIOGRAPHY

261

Bibliography Abney, S. (1991). Parsing by chunks. In Berwick, R. C., Abney, S., and Tenny, C., editors, Principle-Based Parsing: Computation and Psycholinguistics, pages 257–278. Kluwer Academic Publishers, Dordrecht. Abney, S. (1996). Partial parsing via finite-state cascades. In Workshop on Robust Parsing at The European Summer School in Logic, Language and Information, ESSLLI’96, Prague, Czech Republic. Ahlstr¨om, K.-G. (1964). Studies in spelling I. Uppsala University, The Institute of Education. Report 20. Ahlstr¨om, K.-G. (1966). Studies in spelling II. Uppsala University, The Institute of Education. Report 27. Ait-Mohtar, S. and Chanod, J.-P. (1997). Incremental finite-state parsing. In ANLP’97, pages 72–79, Washington. Allard, B. and Sundblad, B. (1991). Skrivandets genes under skoltiden med fokus p˚a stavning och o¨ vriga konventioner. Doktorsavhandling, Stockholms Universitet, Pedagogiska Institutionen. Andersson, R., Cooper, R., and Sofkova Hashemi, S. (1998). Finite state grammar for finding grammatical errors in Swedish text: a finite-state word analyser. Technical report, G¨oteborg University, Department of Linguistics. [http: //www.ling.gu.se/˜sylvana/FSG/Report-9808.ps]. Andersson, R., Cooper, R., and Sofkova Hashemi, S. (1999). Finite state grammar for finding grammatical errors in Swedish text: a system for finding ungrammatical noun phrases in Swedish text. Technical report, Go¨ teborg University, Department of Linguistics. [http://www.ling.gu.se/˜sylvana/FSG/ Report-9903.ps].

262

BIBLIOGRAPHY

Appelt, D. E., Hobbs, J. R., Bear, J., Israel, D., and Tyson, M. (1993). FASTUS: A finite-state processor for information extraction from real-word text. In The Proceedings of IJCAI’93, Chambery, France. Arppe, A. (2000). Developing a grammar checker for Swedish. In Nordg˚ard, T., editor, The 12th Nordic Conference in Computational Linguistics, NODALIDA’99, pages 13–27. Department of Linguistics, Norwegian University of Science and Technology, Trondheim. Arppe, A., Birn, J., and Westerlund, F. (1998). Lingsoft’s Swedish Grammar Checker. [http:www.lingsoft.fi/doc.swegc]. Beesley, K. R. and Karttunen, L. (2003). Publications.

Finite-State Morphology.

CSLI-

Bereiter, C. and Scardamalia, M. (1985). Cognitive coping strategies and the problem of inert knowledge. In Chipman, S. F., Segal, J. W., and Glaser, R., editors, Thinking and learning skills. Vol. 2, Research and open questions. Lawrence Erlbaum Associates, Hillsdale, New Jersey. Biber, D. (1988). Variation across speech and writing. Cambridge University Press, Cambridge. Birn, J. (1998). Swedish Constraint Grammar: A Short Presentation. [http: //www.lingsoft.fi/doc/swecg/]. Birn, J. (2000). Detecting grammar errors with Lingsoft’s Swedish grammar checker. In Nordg˚ard, T., editor, The 12thNordic Conference in Computational Linguistics, NODALIDA’99, pages 28–40. Department of Linguistics, Norwegian University of Science and Technology, Trondheim. Bj¨ork, L. and Bj¨ork, M. (1983). Amerikansk projekt fo¨ r b¨attre skrivundervisning. det viktiga a¨ r sj¨alva skrivprocessen - inte resultatet. La¨ rartidningen 1983:28, pages 30–33. Bj¨ornsson, C.-H. (1957). Uppsatsbedo¨ mning och m¨atning av skrivf¨orm˚aga. Licentiatavhandling, Stockholm. Bj¨ornsson, C.-H. (1977). Skrivf¨orm˚agan f¨orr och nu. Pedagogiskt centrum, Stockholm. Bloomfield, L. (1933). Language. Henry Holt & CO, New York. Boman, M. and Karlgren, J. (1996). Abstrakta maskiner och formella spr a˚ k. Studentlitteratur, Lund.

BIBLIOGRAPHY

263

Britton, J. (1982). Spectator role and the beginnings of writing. In Nystrand, M., editor, The Structure of Written Communication. Studies in Reciprocity Between Writers and Readers. Academic Press, New York. Bustamente, F. R. and Le´on, F. S. (1996). GramCheck: A grammar and style checker. In The 16th International Conference on Computational Linguistics, Copenhagen, pages 175–181. Calkins, L. M. (1986). The Art of Teaching Writing. Heinemann, Portsmouth. Carlberger, J., Domeij, R., Kann, V., and Ola, K. (2002). A Swedish Grammar Checker. Submitted to Association for Computational Linguistics, October 2002. Carlberger, J. and Kann, V. (1999). Implementing an efficient part-of-speech tagger. Software - Practice and Experience, 29(9):815–832. Carlsson, M. (1981). Uppsala Chart Parser 2: System documentation (UCDL-R81-1). Technical report, Uppsala University: Center for Computational Linguistics. Chafe, W. L. (1985). Linguistic differences produced by differences between speaking and writing. In Olson, D. R., Torrance, N., and Hildyard, A., editors, Literacy, language, and learning: The nature and consequences of reading and writing. Cambridge University Press, Cambridge. Chall, J. (1979). The great debate: ten years later, with a modest proposal for reading stages. In Resnick, L. B. and Weaver, P. A., editors, Theory and practice of early reading, Vol.2. Lawrence Erlbaum Associates, Hillsdale, New Jersey. Chanod, J.-P. (1993). A broad-coverage French grammar checker: Some underlying principles. In the Sixth International Conference on Symbolic and Lo gical Computing, Dakota State University Madison, South Dakota. Chanod, J.-P. and Tapanainen, P. (1996). A robust finite-state parser for french. In Workshop on Robust Parsing at The European Summer School in Logic, Language and Information, ESSLLI’96, Prague, Czech Republic. Chomsky, N. (1956). Three models for the description of language. IRE Transactions on Information Theory, 2, pages 113–124. Chomsky, N. (1959). On certain formal properties of grammars. Information and Control, 1, pages 91–112.

264

BIBLIOGRAPHY

Chrystal, J.-A. and Ekvall, U. (1996). Skribenter in spe. elevers skrivf o¨ rm˚aga och skriftspr˚akliga kompetens. ASLA-information 22:3, pages 28–35. Chrystal, J.-A. and Ekvall, U. (1999). Planering och revidering i skolskrivande. In Andersson, L.-G. e. a., editor, Svenskans beskrivning 23, pages 57–76. Lund. Clemenceau, D. and Roche, E. (1993). Enhancing a large scale dictionary with a two-level system. In EACL-93. Cooper, R. (1984). Svenska nominalfraser och kontext-fri grammatik. Nordic Journal of Linguistics, 7(2):115–144. Cooper, R. (1986). Swedish and the head-feature convention. In Hellan, L. and Koch Christensen, K., editors, Topics in Scandinavian Syntax. Crystal, D. (2001). Language and the Internet. Cambridge University Press, Cambridge. Dahlquist, A. and Henrysson, H. (1963). Om r¨attskrivning. Klassificering av fel i diagnostiska skrivprov. Folkskolan 3. Daiute, C. (1985). Writing and computers. Addison-Wesley, New York. Domeij, R. (1996). Detecting and presenting errors for Swedish writers at work. IPLab 108, TRITA-NA-P9629, KTH, Department of Numerical Analysis and Computing Science, Stockholm. Domeij, R. (2003). Datorst¨odd spr˚akgranskning under skrivprocessen. Svensk spr˚akkontroll ur anv¨andarperspektiv. Doktorsavhandling, Stockholms Universitet, Institutionen f¨or lingvistik. Domeij, R. and Knutsson, O. (1998). Granskaprojektet: Rapport fr a˚ n arbetet med granskningsregler och kommentarer. KTH, Institutionen f o¨ r numerisk analys och datalogi, Stockholm. Domeij, R. and Knutsson, O. (1999). Specifikation av grammatiska feltyper i Granska. Internal working paper. KTH, Institutionen f o¨ r numerisk analys och datalogi, Stockholm. ˙ (1998). Domeij, R., Knutsson, O., Larsson, S., Eklundh, K., and Rex, A. Granskaprojektet 1996-1997. IPLab-146, KTH, Institutionen f o¨ r numerisk analys och datalogi, Stockholm.

BIBLIOGRAPHY

265

Domeij, R., Ola, K., and Stefan, L. (1996). Datorsto¨ d f¨or spr˚aklig granskning under skrivprocessen: en l¨agesrapport. IPLab 109, TRITA-NA-P9630, KTH, Institutionen f¨or numerisk analys och datalogi, Stockholm. EAGLES (1996). EAGLES Evaluation of Natural Langauge Processing Systems. Final Report. EAGLES Document EAG-EWG-PR.2. [http://www.issco. unige.ch/projects/ewg96/ewg96.html]. Ejerhed, E. (1985). En ytstruktur grammatik fo¨ r svenska. In All´en, S., Andersson, L.-G., L¨ofstr¨om, J., Nordenstam, K., and Ralph, B., editors, Svenskans beskrivning 15. G¨oteborg. Ejerhed, E. and Church, K. (1983). Finite state parsing. In Karlsson, F., editor, Papers from the 7th Scandinavian Conference of Linguistics. University of Helsinki. No. 10(2):410-431. ˚ om, M. (1992). The Linguistic Ejerhed, E., K¨allgren, G., Wennstedt, O., and Astr¨ Annotation System of the Stockholm-Umea˚ Corpus Project. Report 33. University of Ume˚a, Department of Linguistics. Emig, J. (1982). Writing, composition, and rhetoric. In Mitzel, H. E., editor, Encyclopedia of Educational Research. The Free Press, New York. Flower, L. and Hayes, J. R. (1981). A cognitive process theory of writing. College Composition and Communication, 32:365–387. Garme, B. (1988). Text och tanke. Liber, Malmo¨ . Graves, D. H. (1983). Writing: Teachers and Children at Work. Heinemann, Portsmouth. Grefenstette, G. (1996). Light parsing as finite-state filtering. In Kornai, A., editor, ECAI’96 Workshop on Extended Finite State Models of Language, Budapest, Hungary. Grundin, H. (1975). L¨as och skrivf¨orm˚agans utveckling genom skol˚aren. Utbildningsforskning 20. Liber, Stockholm. Gunnarsson, B.-L. (1992). Skrivande i yrkeslivet: en sociolingvistisk studie. Studentlitteratur, Lund. G¨oransson, A.-L. (1998). Hur skriver vuxna? Spra˚ kv˚ard 3. Haage, H. (1954). R¨attskrivningens psykologiska och pedagogiska problem. Folkskolans metodik.

266

BIBLIOGRAPHY

Haas, C. (1989). Does the medium make a difference? Two studies of writing with pen and paper and with computers. Human-Computer Interaction, 4:149–169. Hallencreutz, K. (2002). S¨arskrivningar och andra skrivningar - nu och d˚a. Spr˚akv˚ardssamfundets skrifter 33. Halliday, M. A. K. (1985). Spoken and Written Language. Oxford University Press, Oxford. Hammarb¨ack, S. (1989). Skriver, det go¨ r jag aldrig. In Gunnarsson, B.-L., Liberg, C., and Wahl´en, S., editors, Skrivande. Rapport fra˚ n ASLA:s nordiska symposium, Uppsala, 10-12 november 1988, pages 103–113. Svenska f o¨ reningen f¨or till¨ampad spr˚akvetenskap, Uppsala. Hansen, W. J. and Haas, C. (1988). Reading and writing with computers: a framework for explaining differences in perfomance. Communications of the ACM, 31, Sept, pages 1080–1089. Hawisher, G. E. (1986). Studies in word processing. Computers and Composition, 4:7–31. Hayes, J. R. and Flower, L. (1980). Identifying the organisation of the writing process. In Gregg, L. W. and Steinberg, E. R., editors, Cognitive processes in writing. Lawrence Erlbaum Associates, Hillsdale, New Jersey. Heidorn, G. (1993). Experience with an easily computed metric for ranking alternative parses. In Jensen, K., Heidorn, G., and Richardson, S. D., editors, Natural Language Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht. Herring, S. C. (2001). Computer-mediated discourse. In Tannen, D., Schiffrin, D., and Hamilton, H., editors, Handbook of Discourse Analysis. Oxford, Blackwell. Hersvall, M., Lindell, E., and Petterson, I.-L. (1974). Om kvalitet i gymnasisters skriftspr˚ak. Pedagogisk-psykologiska problem 253. Hopcroft, J. E. and Ullman, J. D. (1979). Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, New York. Hultman, T. G. (1989). Skrivande i skolan: sett i ett utvecklingsperspektiv. In Gunnarsson, B.-L., Liberg, C., and Wahl´en, S., editors, Skrivande. Rapport fr˚an ASLA:s nordiska symposium, Uppsala, 10-12 november 1988, pages 69– 89. Svenska f¨oreningen f¨or till¨ampad spr˚akvetenskap, Uppsala.

BIBLIOGRAPHY

267

Hultman, T. G. and Westman, M. (1977). Gymnasistsvenska. Liber, Lund. Hunt, K. W. (1970). Recent measures in syntactic development. In Lester, M., editor, Readings in Applied Transformational Grammar. New York. H˚akansson, G. (1998). Spr˚akinl¨arning hos barn. Studentlitteratur, Lund. H˚ard af Segerstad, Y. (2002). Use and Adaptation of Written Language to the Conditions of Computer-Mediated Communication. PhD thesis, G o¨ teborg University, Department of Linguistics. Ingels, P. (1996). A Robust Text Processing Technique Applied to Lexical Error Recovery. Licentiate Thesis. Linko¨ ping University, Sweden. Jensen, K. (1993). PEG: The PLNLP English Grammar. In Jensen, K., Heidorn, G., and Richardson, S. D., editors, Natural Language Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht. Jensen, K., Heidorn, G., Miller, L., and Ravin, Y. (1993a). Parse fitting and prose fixing. In Jensen, K., Heidorn, G., and Richardson, S. D., editors, Natural Language Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht. Jensen, K., Heidorn, G., and Richardson, S. D., editors (1993b). Natural Language Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht. Josephson, O., Melin, L., and Oliv, T. (1990). Elevtext. Analyser av skoluppsatser fr˚an a˚ k 1 till a˚ k 9. Studentlitteratur, Lund. Joshi, A. K. (1961). Computation of syntactic structure. Advances in Documentation and Library Science, Vol. III, Part 2. Joshi, A. K. and Hopely, P. (1996). A parser from antiquity: An early application of finite state transducers to natural language parsing. In Kornai, A., editor, ECAI’96 Workshop on Extended Finite State Models of Language, Budapest, Hungary. J¨arvinen, T. and Tapanainen, P. (1998). Towards an implementable dependency grammar. In Kahane, S. and Polguere, A., editors, The Proceedings of COLINGACL’98, Workshop on ‘Processing of Dependency-Based Grammars’, pages 1– 10. Universite de Montreal, Canada. Karlsson, F. (1990). Constraint grammar as a system for parsing running text. In The Proceedings of the International Conference on Computational Linguistics, COLING’90, pages 168–173, Helsinki.

268

BIBLIOGRAPHY

Karlsson, F. (1992). SWETWOL: Comprehensive morphological analyzer for Swedish. Nordic Journal of Linguistics, 15:1–45. Karlsson, F., Voutilainen, A., Heikkil¨a, J., and Anttila, A. (1995). Constraint Grammar: a language-independent system for parsing unrestricted text. Mouton de Gruyter, Berlin. Karttunen, L. (1993). Finite-State Lexicon Compiler. Technical Report ISTLNLTT-1993-04-02, Xerox PARC. April 1993. Palo Alto, California. Karttunen, L. (1995). The replace operator. In The Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. ACL’95, pages 16– 23, Boston, Massachusetts. Karttunen, L. (1996). Directed replacement. In The Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, ACL’96, Santa Cruz, California. Karttunen, L., Chanod, J.-P., Grefenstette, G., and Schiller, A. (1997a). Regular expressions for language engineering. Natural Language Engineering 2(4), pages 305–328. Cambrigde University Press. Karttunen, L., Ga´al, T., and Kempe, A. (1997b). Xerox Finite-State Tool. Technical report, Xerox Research Centre Europe, Grenoble. June 1997. Maylan, France. Karttunen, L., Kaplan, R. M., and Zaenen, A. (1992). Two-level morphology with composition. In The Proceedings of the International Conference on Computational Linguistics, COLING’92. Vol. I, pages 141–148, July 25-28, Nantes France. Kempe, A. and Karttunen, L. (1996). Parallel replacement in the finite-state calculus. In The Proceedings of the Sixteenth International Conference on Computational Linguistics, COLING’96, Copenhagen, Denmark. Kirschner, Z. (1994). CZECKER - a maquette grammar-checker for Czech. The Prague Bulletin of Mathematical Linguistics 62. Universita Karlova, Praha. Knutsson, O. (2001). Automatisk spra˚ kgranskning av svensk text. Licentiatavhandling, KTH, Institutionen f¨or numerisk analys och datalogi, Stockholm. Kokkinakis, D. and Johansson Kokkinakis, S. (1999). A cascaded finite-state parser for syntactic analysis of Swedish. In EACL’99, pages 245–248.

BIBLIOGRAPHY

269

Kollberg, P. (1996). S-notation as a tool for analysing the episodic structure of revisions. In European writing conferences, Barcelona, October 1996. Koskenniemi, K., Tapanainen, P., and Voutilainen, A. (1992). Compiling and using finite-state syntactic rules. In The Proceedings of the International Conference on Computational Linguistics, COLING’92. Vol. I, pages 156–162, Nantes, France. Kress, G. (1994). Learning to write. Routledge, London. Kukich, K. (1992). Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4):377–439. Laporte, E. (1997). Rational transductions for phonetic conversion and phonology. In Roche, E. and Schabes, Y., editors, Finite State Language Processing. MIT Press, Cambridge, Massachusetts. Larsson, K. (1984). Skrivf¨orm˚aga: studier i svenskt elevspr˚ak. Liber, Malm¨o. Ledin, P. (1998). Att s¨atta punkt. hur elever p˚a l˚ag- och mellanstadiet anv¨ander meningen i sina uppsatser. Spr˚ak och stil, 8:5–47. Leijonhielm, B. (1989). Beskrivning av spr˚aket i brottsanm¨alningar. In Gunnarsson, B.-L., Liberg, C., and Wahl´en, S., editors, Skrivande. Rapport fra˚ n ASLA:s nordiska symposium, Uppsala, 10-12 november 1988. Svenska f o¨ reningen f¨or till¨ampad spr˚akvetenskap, Uppsala. Liberg, C. (1990). Learning to Read and Write. RUUL 20. Reports from Uppsala University, Department of Linguistics. Liberg, C. (1999). Elevers m¨ote med skolans textv¨arldar. ASLA-information 25:2, pages 40–44. Lindell, E. (1964). Den svenska r¨attskrivningsmetodiken: bidrag till dess pedagogisk-psykologiska grundval. Studia psychologica et paedagogica 12. Lindell, E., Lundquist, B., Martinsson, A., Nordlund, A., and Petterson, I.-L. (1978). Om fri skrivning i skolan. Utbildningsforskning 32. Liber, Stockholm. Linell, P. (1982). The written language bias in linguistics. Department of Communication Studies, Univerity of Linko¨ ping. Ljung, B.-O. (1959). En metod fo¨ r standardisering av uppsatsbedo¨ mning. Pedagogisk forskning 1. Universitetsforlaget, Oslo.

270

BIBLIOGRAPHY

Ljung, M. and Ohlander, S. (1993). Allma¨ n Grammatik. Gleerups F¨orlag, Surte. Loman, B. and J¨orgensen, N. (1971). Manual fo¨ r analys och beskrivning av makrosyntagmer. Studentlitteratur, Lund. Lundberg, I. (1989). Spr˚akutveckling och l¨asinl¨arning. In Sandqvist, C. and Teleman, U., editors, Spr˚akutveckling under skoltiden. Studentlitteratur, Lund. MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk, Vol. 1: Transcription Format and Programs. Lawrence Erlbaum Associates, Hillsdale, New Jersey. Magerman, D. M. and Marcus, M. P. (1990). Parsing a natural language using mutual information statistics. In AAAI’90, Boston, Ma. Manzi, S., King, M., and Douglas, S. (1996). Working towards user-orineted evaluation. In The Proceedings of the International Conference on Natural Language Processing and Industrial Applications (NLP+IA 96), pages 155–160. Moncton, New-Brunswick, Canada. Matsuhasi, A. (1982). Explorations in the realtime production of written discourse. In Nystrand, M., editor, What writers know: the language, process, and structure of written discourse. Academic Press, New York. Mattingly, I. G. (1972). Reading, the linguistic process and linguistic awareness. In Kavanagh, J. F. and Mattingly, I. G., editors, Language by Ear and by Eye, pages 133–147. MIT Press, Cambridge. Moffett, J. (1968). Teaching the Universe of Discourse. Houghton Mifflin Company, New York. Mohri, M., Pereira, F. C. N., and Riley, M. (1998). A rational design for a weighted finite-state transducer library. Lecture Notes in Computer Science 1436. Mohri, M. and Sproat, R. (1996). An efficient compiler for weighted rewrite rules. In The Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, ACL’96, Santa Cruz, California. Montague, M. (1990). Computers and writing process instruction. Computers in the Schools 7(3). Naucl´er, K. (1980). Perspectives on misspellings. A phonetic, phonological and psycholinguistic study. Liber L¨aromedel, Lund.

BIBLIOGRAPHY

271

van Noord, G. and Gerdemann, D. (1999). An extendible regular expression compiler for finite-state approaches in natural language processing. In Workshop on Implementing Automata’99, Postdam, Germany. Nystr¨om, C. (2000). Gymnasisters skrivande. En studie av genre, textstruktur och sammnahang. Doktorsavhandling, Institutionen fo¨ r Nordiska spr˚ak, Uppsala Universitet. N¨aslund, H. (1981). Satsradningar i svenskt elevspr˚ak. FUMS 95: Forskningskommitt´en i Uppsala f¨or modern svenska. Institutionen fo¨ r nordiska spr˚ak, Uppsala universitet. Olevard, H. (1997). Ton˚arsliv. en pilotstudie av 60 elevtexter fr˚an standardproven f¨or skol˚ar 9 a˚ ren 1987 och 1996. Svenska i utveckling nr 11. FUMS Rapport nr 194. Paggio, P. and Music, B. (1998). Evaluation in the SCARRIE project. In First International Conference on Language Resources and Evaluation, Granada, Spain, pages 277–281. Paggio, P. and Underwood, N. L. (1998). Validating the TEMAA LE evaluation methodology: a case study on danish spelling checkers. Natural Language Engineering, 4(3):211–228. Cambridge University Press. Pereira, F. C. N. and Riley, M. D. (1997). Speech recognition by composition of weighted finite automata. In Roche, E. and Schabes, Y., editors, Finite State Language Processing. MIT Press, Cambridge, Massachusetts. Pettersson, A. (1980). Hur gymnasister skriver. Svenskl a¨ rarserien 184. Pettersson, A. (1989). Utvecklingslinjer och utvecklingskrafter i elevernas uppsatser. In Sandqvist, C. and Teleman, U., editors, Spra˚ kutveckling under skoltiden. Studentlitteratur, Lund. Pontecorvo, C. (1997). Studying writing and writing acquisition today: A multidisciplinary view. In Pontecorvo, C., editor, Writing development: An interdisciplinary view. John Benjamins Publishing Company. Povlsen, C., S˚agvall Hein, A., and de Smedt, K. (1999). Final Project Report. Reports from the SCARRIE project, Deliverable 0.4. [http://fasting. hf.uib.no/˜desmedt/scarrie/final-report.html]. Ravin, Y. (1993). Grammar errors and style weakness in a text-critiquing system. In Jensen, K., Heidorn, G., and Richardson, S. D., editors, Natural Language Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht.

272

BIBLIOGRAPHY

Richardson, S. D. (1993). The experience of developing a large-scale natural language processing system: Critique. In Jensen, K., Heidorn, G., and Richardson, S. D., editors, Natural Language Processing: The PLNLP Approach. Kluwer Academic Publishers, Dordrecht. van Rijsbergen, C. J. (1979). Information Retrieval. London. Robbins, A. D. (1996). AWK Language Programming. A User’s Guide for GNU AWK. Free Software Foundation, Boston. Roche, E. (1997). Parsing with finite-state transducers. In Roche, E. and Schabes, Y., editors, Finite State Language Processing. MIT Press, Cambridge, Massachusetts. Sandstr¨om, G. (1996). Spr˚aklig redigering p˚a en dagstidning. Spr˚akv˚ard 1. de Saussure, F. (1922). Course in General Linguistics. Translation by Roy Harris. Duckworth, London. Scardamalia, M. and Bereiter, C. (1986). Research on written composition. In Wittrock, M. C., editor, Handbook of Research of Teaching. Third edition. A project of the american Educational Research Association. Macmillan Publishing Company, New York. Schiller, A. (1996). Multilingual finite-state noun phrase extraction. In ECAI’96 Workshop on Extended Finite State Models of Language, Budapest, Hungary. Senellart, J. (1998). Locating noun phrases with finite state transducers. In The Proceedings of COLING-ACL’98, pages 1212–1219. Severinson Eklundh, K. (1990). Global strategies in computer-based writing: the use of logging data. In 2nd Nordic Conference on Text Comprehension in Man and Machine, T¨aby. Severinson Eklundh, K. (1993). Skrivprocessen och datorn. IPLab 61, KTH, Institutionen f¨or numerisk analys och datalogi, Stockholm. Severinson Eklundh, K. (1994). Electronic mail as a medium for dialogue. In van Waes, L., Woudstra, E., and van den Hoven, P., editors, Functional Communication Quality. Rodopi Publishers, Amsterdam/Atlanta. Severinson Eklundh, K. (1995). Skrivmo¨ nster med ordbehandlare. Spra˚ kv˚ard 4.

BIBLIOGRAPHY

273

Severinson Eklundh, K. and Sjo¨ holm, K. (1989). Writing with a computer. A longitudinal survey of writers of technical documents. IPLab 19, KTH, Department of Numerical Analysis and Computing Science, Stockholm. Skolverket (1992). LEXIN: spr˚aklexikon f¨or invandrare. Nordsteds F¨orlag. Sofkova Hashemi, S. (1998). Writing on a computer and writing with a pencil and paper. In Str¨omqvist, S. and Ahls´en, E., editors, The Process of Writing - a progress report, pages 195–208. Go¨ teborg University, Department of Linguistics. Starb¨ack, P. (1999). ScarCheck - a software for word and grammar checking. In S˚agvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University, Department of Linguistics. Str¨omquist, S. (1987). Styckevis och helt. Liber, Malmo¨ . Str¨omquist, S. (1989). Skrivboken. Liber, Malmo¨ . Str¨omquist, S. (1993). Skrivprocessen. Teori och tilla¨ mpning. Studentlitteratur, Lund. Str¨omqvist, S. (1996). Discourse flow and linguistic information structuring: Explorations in speech and writing. Gothenburg Papers in Theoretical Linguistics 78. G¨oteborg University, Department of Linguistics. ˙ (1994). Tala och skriva i lingvistiskt och didaktStr¨omqvist, S. and Hellstrand, A. iskt perspektiv - en projektbeskrivning. Didaktisk Tidskrift, Nr 1-2. Str¨omqvist, S., Johansson, V., Kriz, S., Ragnarsdottir, H., Aisenman, R., and Ravid, D. (2002). Towards a crosslinguistic comparisson of lexical quanta in speech and writing. Written language and literacy Vol 5, N:o 1, pages 45–68. Str¨omqvist, S. and Karlsson, H. (2002). ScriptLog for Windows - User’s Manual. Department of Linguistics and University College of Stavanger: Centre for Reading Research. Str¨omqvist, S. and Malmsten, L. (1998). ScriptLog Pro 1.04 - User’s Manual. G¨oteborg University, Department of Linguistics. Svenska Akademiens Ordlista (1986). 11 uppl. Norstedts f o¨ rlag, Stockholm. S˚agvall Hein, A. (1981). An Overview of the Uppsala Chart Parser Version I (UCP-1). Uppsala University, Department of Linguistics.

274

BIBLIOGRAPHY

S˚agvall Hein, A. (1983). A Parser for Swedish. Status Report for SveUcp. (UCDLR-83-2). Uppsala University, Department of Linguistics. February 1983. S˚agvall Hein, A. (1998a). A chart-based framework for grammar checking: Initial studies. In The 11th Nordic Conference in Computational Linguistics, NODALIDA’98. S˚agvall Hein, A. (1998b). A specification of the required grammar checking machinery. In S˚agvall Hein, A., editor, Reports from the SCARRIE project, Deliverable 6.5.2, June 1998. Uppsala University, Department of Linguistics. S˚agvall Hein, A. (1999). A grammar checking module for Swedish. In S˚agvall Hein, A., editor, Reports from the SCARRIE project, Deliverable 6.6.3, June 1999. Uppsala University, Department of Linguistics. S˚agvall Hein, A., Olsson, L.-G., Dahlqvist, B., and Mats, E. (1999). Evaluation report for the Swedish prototype. In S˚agvall Hein, A., editor, Reports from the SCARRIE project, Deliverable 8.1.3, June 1999. Uppsala University, Department of Linguistics. Teleman, U. (1974). Manual f¨or beskrivning av talad och skriven svenska. Lund. Teleman, U. (1979). Spr˚akr¨att. Gleerups, Malm¨o. Teleman, U. (1991a). L¨ara svenska: Om spr˚akbruk och modersm˚alsundervisning. Skrifter utgivna av Svenska Spr˚akn¨amnden, Almqvist and Wiksell, Solna. Teleman, U. (1991b). Vad kan man n¨ar man kan skriva? In Malmgren and Sandqvist, editors, Skrivpedagogik. Teleman, U., Hellberg, S., and Andersson, E. (1999). Svenska Akademiens grammatik. Svenska Akademien. Vanneste, A. (1994). Checking grammar checkers. Utrecht Studies and Communication, 4. Vernon, A. (2000). Computerized grammar checkers 2000: Capabilities, limitations, and pedagogical possibilities. Computers and Composition 17, pages 329–349. Vosse, T. G. (1994). The Word Connection. Grammar-based Spelling Error Correction in Dutch. PhD thesis, Neslia Paniculata, Enschede. Voutilainen, A. (1995). NPtool, a detector of English noun phrases. In the Proceedings of Workshop on Very Large Corpora, Ohio State University.

BIBLIOGRAPHY

275

Voutilainen, A. and Padr´o, L. (1997). Developing a hybrid NP parser. In ANLP’97, Washington. Voutilainen, A. and Tapanainen, P. (1993). Ambiguity resolution in a reductionistic parser. In EACL-93, pages 394–403, Utrecht. Wallin, E. (1962). Bidrag till r¨attstavningsf¨orm˚agans psykologi och pedagogik. G¨oteborgs Universitet, Pedagogiska Institutionen. Wallin, E. (1967). Spelling. Factorial and experimental studies. Almqvist and Wiksell, Stockholm. Wedbjer Rambell, O. (1999a). Error typology for automatic proof-reading purposes. In S˚agvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University, Department of Linguistics. Wedbjer Rambell, O. (1999b). Swedish phrase constituent rules. A formalism for the expression of local error rules for Swedish. In S˚agvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University, Department of Linguistics. Wedbjer Rambell, O. (1999c). Three types of grammatical errors in Swedish. In S˚agvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University, Department of Linguistics. Wedbjer Rambell, O., Dahlqvist, B., Tjong Kim Sang, E., and Hein, N. (1999). An error database of Swedish. In S˚agvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University, Department of Linguistics. Weijnitz, P. (1999). Uppsala Chart Parser Light: system documentation. In S˚agvall Hein, A., editor, Reports from the SCARRIE project. Uppsala University, Department of Linguistics. ˙ (2002). Text Production in Adults with Reading and Writing DiffiWengelin, A. culties. PhD thesis, G¨oteborg University, Department of Linguistics. Wikborg, E. (1990). Composing on the computer: a study of writing habits on the job. In Nordtext Symposium, Text structuring - reception and production strategies, Hanasaari, Helsinki. Wikborg, E. and Bj¨ork, L. (1989). Sammanhang i text. Hallgren and Fallgren Studief¨orlag AB, Uppsala. Wresch, W. (1984). The computer in composition instruction. National Council of Teachers of English.

276

BIBLIOGRAPHY

¨ Oberg, H. S. (1997). Referensbindning i elevuppsatser. en prelimin¨ar modell och en analys i tv˚a delar. Svenska i utveckling nr 7. FUMS Rapport nr 187. ¨ Ostlund-Stj¨ arneg˚ardh, E. (2002). Godk¨and i svenska? Bed¨omning och analys av gymnasieelevers texter. Doktorsavhandling, Institutionen f o¨ r Nordiska spr˚ak, Uppsala Universitet.

Appendices

278

Appendix A

Grammatical Feature Categories GENDER: com neu masc fem

common gender neuter gender masculine gender feminine gender

DEFINITENESS: def indef wk str

definite form indefinite form weak form of adjective strong form of adjective

CASE: nom acc gen

nominative case accusative case genitive case

NUMBER: sg pl

singular plural

TENSE: imp inf pres pret perf past perf sup past part untensed

imperative infinitive present preterite perfect past perfect supine past participle non-finite verb

VOICE: pass

passive

OTHER: adj adv

adjective adverb

280

Appendix B

Error Corpora This Appendix presents the errors found in Child Data and consists of three corpora: B.1 Grammar Errors B.2 Misspelled Words B.3 Segmentation Errors Every listed instance of an error (E RROR) is indexed and followed by a suggestion for possible correction (C ORRECTION) and information about which sub-corpora (C ORP) it originates from, who the writer was (S UBJ), the writer’s age (AGE) and sex (S EX; m for male and f for female). The different sub-corpora are abbreviated as DV Deserted Village, CF Climbing Fireman, FS Frog Story, SN Spencer Narrative, SE Spencer Expository.

Appendix B.

282

B.1 Grammar Errors Grammar errors are categorized by the type of error that occurred. E RROR 1 1.1 1.1.1 1.1.2 1.1.3

1.1.4 1.1.5

1.1.6

1.1.7 1.1.8

1.2 1.2.1

1.2.2

1.2.3

1.2.4

1.3 1.3.1

AGREEMENT IN NOUN PHRASE Definiteness agreement Indefinite head with definite modifier Jag tar den n¨armsta handduk och sl¨anger den i vasken och bl¨oter den, En g˚ang blev den hemska pyroman utkastad ur stan. Jag s˚ag p˚a ett TV program d¨ar en metod mot mobbing var att satta mobbarn p˚a den stol och andra m¨anniskor runt den personen och d˚a fr˚aga varf¨or. Definite head with possessive modifier P¨ar tittar p˚a sin klockan och det var tid f¨or familjen att g˚a hem. hunden sa p˚a pojkens huvet. Definite head with modifier ‘denna’ Nu n¨ar jag kommer att skriva denna uppsatsen s˚a kommer jag ha en rubrik om n˚agra problem och ... Definite head with indefinite modifier Men senare a˚ ngrade dom sig f¨or det var en r¨akningen p˚a deras l¨agenhet. Man ska inte fr˚aga en kompisen om n˚agot arbete, man ska fr˚aga l¨araren. Gender agreement Wrong article pojken fick en grodbarn Wrong article in partitive Virginias mamma hade o¨ ppnat en tyg aff¨ar i en av Dom gamla husen. Masculine form of adjective sen ber¨atta den minsta att det va den hemske fula troll karlen tokig som ville g¨ora mos av dom f¨or han skulle bo i deras by. nasse blev arg han gick och la sig med dom andre syskonen. Number agreement Singular modifier with plural head Den d¨ara scenen med det tre tjejerna tyckte jag att de var taskiga som g˚ar ifr˚an den tredje tjejen

C ORRECTION

C ORP

S UBJ

AGE

S EX

handuken

CF

alhe

9

f

pyromanen

CF

frma

9

m

en stol/den stolen

SE

wj16

13

f

klocka

DV

frma

9

m

huve/ huvud

FS

haic

11

f

uppsats

SE

wj03

13

f

r¨akning

DV

jowe

9

f

kompis

SE

wg05

10

m

ett

FS

haic

11

f

ett

DV

idja

11

f

fule

DV

alhe

9

f

andra

CF

haic

11

f

de

SE

wg09

10

m

Error Corpora

1.3.2 1.3.3

283

E RROR

C ORRECTION

Singular noun in partitive attribute Alla m¨annen och pappa gick in i ett av huset. en av boven tog bensinen och gick bak˚at.

C ORP

S UBJ

AGE

S EX

husen bovarna

DV CF

haic haic

11 11

f f

CF SE

caan wg11

9 10

m f

meningsl¨os

SE

wj05

13

m

mobbade

SE

wj05

13

m

o¨ ppna, a¨ rliga, elaka

SE

wj13

13

m

utsatta

SE

wj19

13

m

sj¨alva smutsiga

SE CF

wj20 haic

13 11

m f

byn skeppet

DV DV

haic haic

11 11

f f

o¨ n borgm¨astaren grenen pojken

DV CF FS FS

haic frma frma frma

11 9 9 9

f m m m

dem

SN

wg10

10

m

dem

CF

klma

10

f

dem honom

SE SE

wg16 wj14

10 13

f m

honom

SE

wj14

13

m

f˚ar

CF

alhe

9

f

2 2.1 2.1.1 2.1.2

AGREEMENT IN PREDICATIVE COMPLEMENT Gender agreement d˚a b¨orja Urban lipa och sa: Mitt hus a¨ r bl¨ot. bl¨ott den som h¨orde de d¨ar stygga orden v˚agade utskrattad kanske inte spela p˚a en konsert f¨or att vara r¨add att bli utskrattat av avundsjuka personer.

2.2

Number agreement Singular En som a¨ r mobbad gr˚ater s¨akert varje dag k¨anner sig menigsl¨osa.

2.2.1

2.2.2 2.2.3

2.2.4 2.2.5 2.2.6

3 3.1.1 3.1.2 3.1.3 3.1.4 3.1.5 3.1.6 4 4.1 4.1.1 4.1.2 4.1.3 4.1.4 4.1.5 5 5.1 5.1.1

Plural Om dom som mobbar n˚agon g˚ang blir mobbad sj¨alv skulle han a¨ ndras helt och h˚allet. Sj¨avl tycker jag att killarnas metoder a¨ r mer o¨ ppen och a¨ rlig men ocks˚a mer elak a¨ n var tjejernas metoder a¨ r. jag tror att dom som a¨ r s har sj¨alva varit ut satt n˚agon g˚ang och nu vill dom h¨amnas och... ... f¨or folk t¨anker mest p˚a sig sj¨alv. nasse a¨ r en gris som har massor av syskon. nasse a¨ r sk¨ar. Men nasses syskon a¨ r smutsig. DEFINITENESS IN SINGLE NOUNS dom gick till by dom som bodde p˚a o¨ n kanske f¨ors¨okte komma p˚a skepp Jag s˚ag en o¨ vi gick till o¨ dom sa till borgm¨astare vad ska vi g¨ora! m¨an han hade skrikit s˚a b¨orjar gren r¨ora p˚a sig pojke hoppade ner till hunden PRONOUN CASE Case - Objective form bilarna bromsade s˚a att det blev svarta streck efter de. Tv˚a av brandm¨annen sprang in i huset f¨or att r¨adda de jag tycker synd om de d˚a kan ju den eleven som blir utsatt g˚a fram och prata med han bara f¨or man inte vill vara med han FINITE MAIN VERB Present tense Regular verbs Madde och jag best¨amde oss f¨or att sova i kojan och se om vi f˚a se vind.

Appendix B.

284

5.1.2

5.1.3 5.1.4 5.1.5 5.1.6 5.1.7 5.1.8 5.1.9

E RROR

C ORRECTION

C ORP

S UBJ

AGE

S EX

N¨ar hon kommer ner undrar hon varf¨or det lukta s˚a br¨ant och varf¨or det l˚ag en handduk o¨ ver spisen. undra vad det brann n˚anstans jag m˚aste i alla fall larma F˚a se nu vilken v¨ag a¨ r det, den h¨ar. han kommer och klappar alla p˚a handen utan en kille undra hur han k¨anner sig d˚a? ... det kan a¨ ven vara att n˚an kan sparka eller att man f˚a vara enst¨oring... ... d¨ar n˚agra tjejer/killar sitter och prata. men det kanske bero p˚a att det var en mindre skola ... och inte bry sig om han man inte f˚a vara med,

luktar

CF

alhe

9

f

undrar

CF

erja

9

m

F˚ar undrar

FS SE

idja wj03

11 13

f f

f˚ar

SE

wj08

13

f

pratar beror

SE SE

wj08 wj13

13 13

f m

f˚ar

SE

wj14

13

m

s¨ager

SE

wj03

13

f

ber¨attade ber¨attade ber¨attade ber¨attade b¨orjade

DV DV DV DV DV

alhe alhe alhe alhe alhe

9 9 9 9 9

f f f f f

cyklade h¨amtade h¨amtade

DV DV DV

alhe alhe alhe

9 9 9

f f f

knackade

DV

alhe

9

f

knackade

DV

alhe

9

f

knackade lugnade pekade ramlade stannade undrade

DV DV DV DV DV DV

alhe alhe alhe alhe alhe alhe

9 9 9 9 9 9

f f f f f f

undrade undrade vaknade o¨ ppnade o¨ ppnade o¨ ppnade ropade vaknade lutade

DV DV DV DV DV DV DV DV DV

alhe alhe alhe alhe alhe alhe angu angu anhe

9 9 9 9 9 9 9 9 11

f f f f f f f f m

Strong verbs 5.1.10 Att stj¨ala a¨ r inte bra speciellt inte om man tar en sak av en person som gick f¨or en i ett led och inte s¨aga till att man hittade den utan att man beh˚aller den. 5.2 5.2.1 5.2.2 5.2.3 5.2.4 5.2.5 5.2.6 5.2.7 5.2.8 5.2.9 5.2.10 5.2.11 5.2.12 5.2.13 5.2.14 5.2.15 5.2.16 5.2.17 5.2.18 5.2.19 5.2.20 5.2.21 5.2.22 5.2.23 5.2.24 5.2.25

Preterite Regular verbs vi ber¨atta och ... den a¨ ldsta som va 80 a˚ r ber¨atta jag ber¨atta om byn sen ber¨atta den minsta d˚a b¨orja alla i hela tunneln f¨orutom pappa och ja gr˚ata sen cykla vi dit igen. ...gick ner och h¨amta min och pappas cyklar ... Pappa gick och Knacka p˚a en d¨orr till medan jag h¨amta cyklarna Pappa gick och knacka p˚a en d¨orr f¨or att vi var v¨aldigt hungriga Pappa gick och Knacka p˚a en d¨orr till medan jag h¨amta cyklarna jag knacka p˚a d¨orren men jag lugna mig och k¨ande p˚a marken dom peka p˚a v¨aggen av tunneln jag ramla i en rutschbana l˚angt a˚ kte ja tills jag stanna vid en port ... n¨ar vi kom hem undra sj¨alvklart mamma vart vi varit pappa och jag undra va nycklarna va sen undra han va dom bodde p˚a morgonen n¨ar vi vakna... men ingen o¨ ppna n˚agon eller n˚agot o¨ ppna d¨orren vi till och med o¨ ppna pensionathem Lena Ropa mamma Lena vakna Pl¨otsligt vakna Hon av att n˚agon sa Lena Lena. Per luta sig mot en

Error Corpora

E RROR 5.2.26 Sen Svimma jag 5.2.27 n¨ar jag vakna satt Jag Per och Urban mitt i byn. 5.2.28 och n¨ar vi kom hem s˚a Vakna jag och allt var en dr¨om. 5.2.29 Pl¨otsligt b¨orja en lavin 5.2.30 n¨ar Gunnar o¨ ppna d¨orren till det stora huset rasa det ihop 5.2.31 och snart rasa hela byn ihop. 5.2.32 n¨ar Gunnar o¨ ppna d¨orren till det stora huset rasa det ihop 5.2.33 Niklas och Benny hoppa av kamelerna 5.2.34 och snabbt hoppa dom p˚a kamelerna 5.2.35 och rusa iv¨ag och red bort 5.2.36 snabbt samla han ihop alla sina j¨agare 5.2.37 men undra varf¨or den a¨ r o¨ vergiven. 5.2.38 Ida gick och t¨ankte p˚a vad dom skulle g¨ora hon snubbla p˚a n˚at 5.2.39 Jag tog min v¨aska och Madde tog sin, och vi b¨orja g˚a mot v˚ar koja, d¨ar vi skulle sova. 5.2.40 N¨ar vi kom fram b¨orja vi packa upp v˚ara grejer och rulla upp sovs¨acken. 5.2.41 Madde vaknade av mitt skrik, hon fr˚aga va det var f¨or n˚at. 5.2.42 P˚a morgonen vaknade vi och kl¨adde p˚a oss sen packa vi ner v˚ara grejer. 5.2.43 jag sa att det inte va n˚at s˚a somna vi om. 5.2.44 F¨or ett o¨ gon blick trodde jag att den h¨asten vakta v˚aran koja. 5.2.45 p˚a natten vakna jag av att brandlarmet tjo¨ t 5.2.46 d˚a b¨orja Urban lipa och sa: Mitt hus a¨ r bl¨ot. 5.2.47 Brandk˚aren kom och spola ner huset 5.2.48 Cristoffer stod och titta p˚a ugglan i tr¨adet 5.2.49 Erik gick till skogen och ropa allt han kunde. 5.2.50 R˚adjuret sprang iv¨ag med honom. Och kasta av pojken vid ett berg. 5.2.51 De kl¨attra o¨ ver en stock. 5.2.52 Pojken ropa groda groda var a¨ r du 5.2.53 De gick ut och ropa men de fick inget svar. 5.2.54 Ruff r˚aka trilla ut ur f¨onstret. 5.2.55 Pojken satt varje kv¨all och titta p˚a grodan 5.2.56 N¨ar pojken vakna n¨asta morgon och fann att grodan var f¨orsvunnen blev han orolig 5.2.57 Och utan att pojken visste om det hoppa grodan ur burken n¨ar han l˚ag. 5.2.58 N¨asta dag vakna pojken och s˚ag att grodan hade rymt 5.2.59 hunden halka efter. 5.2.60 N¨ar han landa s˚a svepte massa bin o¨ ver honom. 5.2.61 Pojken leta och leta i sitt rum. 5.2.62 Pojken leta och leta i sitt rum. 5.2.63 Hunden leta ocks˚a 5.2.64 Pojken gick d˚a ut och leta efter grodan 5.2.65 Pojken leta i ett tr¨ad

285

C ORP

S UBJ

AGE

S EX

svimmade vaknade vaknade

C ORRECTION

DV DV DV

anhe anhe anhe

11 11 11

m m m

b¨orjade rasade

DV DV

erha erha

10 10

m m

rasade o¨ ppnade

DV DV

erha erha

10 10

m m

hoppade hoppade rusade samlade undrade snubblade

DV DV DV DV DV DV

erja erja erja erja idja jowe

9 9 9 9 11 9

m m m m f f

b¨orjade

CF

alhe

9

f

b¨orjade

CF

alhe

9

f

fr˚agade

CF

alhe

9

f

packade

CF

alhe

9

f

somnade vaktade

CF CF

alhe alhe

9 9

f f

vaknade b¨orjade spolade tittade ropade kastade

CF CF CF FS FS FS

angu caan caan alca alhe angu

9 9 9 11 9 9

f m m f f f

kl¨attrade ropade ropade r˚akade tittade vaknade

FS FS FS FS FS FS

angu angu angu angu angu angu

9 9 9 9 9 9

f f f f f f

hoppade

FS

caan

9

m

vaknade

FS

caan

9

m

halkade landade letade letade letade letade letade

FS FS FS FS FS FS FS

erge erge erge erge erge erge erge

9 9 9 9 9 9 9

f f f f f f f

Appendix B.

286

E RROR 5.2.66 D˚a helt pl¨otsligt ramla hunden ner fr˚an f¨onstret 5.2.67 d¨ar bodde bara en uggla som skr¨amde honom s˚a han ramla ner p˚a marken. 5.2.68 D¨ar st¨allde pojken sig och ropa efter grodan 5.2.69 Hej d˚a ropa han hej d˚a. 5.2.70 D˚a gick pojken vidare och s˚ag inte att binas bo trilla ner. 5.2.71 n¨ar dom b˚ada trilla i. 5.2.72 Han ropa hall˚a var a¨ r du 5.2.73 han gick upp p˚a stora stenen ropa hall˚a hall˚a 5.2.74 D˚a o¨ ppnade han f¨onstret & ropa p˚a grodan. 5.2.75 I min f¨orra skola hade man n˚at som man kallade f¨or kamratst¨odjare, Det funka v¨al ganska bra men... 5.2.76 man visade ingen h¨ansyn eller att man inte heja eller bara br˚aka 5.2.77 man visade ingen h¨ansyn eller att man inte heja eller bara br˚aka 5.2.78 Var var den d¨ar o¨ verraskningen. Ni svara jag men b˚ada tittade p˚a varandra ... 5.2.79 Ni svara jag 5.2.80 det gick inte s˚a hon kl¨attrade upp bredvid mig och medan jag f¨or s¨okte lyfta upp mig sk¨alv medan hon putta bort jackan fr˚an pelare. 5.2.81 medan hon putta jackan fr˚an pelaren 5.2.82 jag var p˚a mitt land och bada 5.2.83 s˚a h¨ar b¨orja det 5.2.84 d¨ar s¨ovde dom mig och gipsa handen. 5.2.85 Hon hade bara kladdskrivit den uppsats jag l¨amna in ... 5.2.86 ...s˚a jag a˚ ngra verkligen att jag tog hennes uppsats... 5.2.87 N¨ar jag gick f¨orbi den djupa avdelningen s˚a kom en annan kille och putta i mig Supine 5.2.88 det l˚ag massor av saker runtomkring jag f¨ors¨okt att kom till f¨oren 5.2.89 Han tittade p˚a hunden, hunden f¨ors¨okt att kl¨attra ner Participle 5.2.90 F¨onstrena ser lite blankare ut d¨ar uppe sa Virginia och b¨orjad kl¨attra upp f¨or den ruttna stegen. 5.2.91 a¨ lgen sprang med olof till ett stup och kastad ner olof och hans hund 5.2.92 dom letad o¨ verallt 5.2.93 n¨ar han letad kollade en sork upp 5.2.94 han letad bakom stocken 5.2.95 alla pratad om borgm¨astaren 5.2.96 hunden r˚akade skakad ner ett getingbo 5.2.97 det var en liten pojke som satt och snyftad 5.2.98 svarad han

C ORP

S UBJ

AGE

S EX

ramlade ramlade

C ORRECTION

FS FS

erge erge

9 9

f f

ropade ropade trillade

FS FS FS

erge erge erge

9 9 9

f f f

trillade ropade ropade ropade funkade

FS FS FS FS SE

erge haic haic jobe wj13

9 11 11 10 13

f f f m m

br˚akade

SE

wj18

13

m

hejade

SE

wj18

13

m

svarade

SN

wg07

10

f

svarade puttade

SN SN

wg07 wg16

10 10

f f

puttade badade b¨orjade gipsade l¨amnade

SN SN SN SN SN

wg16 wg18 wg18 wj05 wj16

10 10 10 13 13

f m m m f

a˚ ngrade

SN

wj16

13

f

puttade

SN

wj20

13

m

f¨ors¨okte

DV

haic

11

f

f¨ors¨okte

FS

haic

11

f

b¨orjade

DV

idja

11

f

kastade

FS

frma

9

m

letade letade letade pratade skaka snyftade svarade

FS FS FS CF FS DV DV

frma frma frma frma frma haic alco

9 9 9 9 9 11 9

m m m m m f f

Error Corpora

E RROR

287

S UBJ

AGE

S EX

torkade

DV

idja

11

f

f¨orsvann

DV

erge

9

f

VERB CLUSTER Verb form after auxiliary verb Present Och i morgon a¨ r det brand¨ovning men kom ih˚ag att det inte ska blir n˚agon riktig brand. Ibland f˚ar man bjuda p˚a sig sj¨alv och l˚ater henne/honom vara med !

bli

CF

klma

10

f

l˚ata

SE

wj17

13

f

Preterite hon ville inte att jag skulle fo¨ ljde med men med lite tjat fick jag.

f¨olja

DV

alhe

9

f

rida

DV

idja

11

f

komma g¨ora

FS SE

haic wj20

11 13

f m

kommit

DV

idja

11

f

har/hade

DV

erge

9

f

har/hade

DV

haic

11

f

7.1.1

INFINITIVE PHRASE Verbform after infinitive marker Present Men hunden klarar att inte sl˚ar sig.

sl˚a

FS

haic

11

f

7.1.2 7.1.3 7.1.4

Imperative gl¨om inte att st¨ang d¨orren jag f¨ors¨okt att kom till f¨oren ˚ det g˚ar det nog inte att g¨or s˚a mycket a˚ t. At

st¨anga komma g¨ora

DV DV SE

hais haic wj20

11 11 13

f f m

att

E13

wj01

13

f

kommer det inte att g˚a

E13

wj06

13

f

5.2.99 Jag tittade p˚a Virginia som torkad av sin n¨asa som var blodig p˚a tr¨ojarmen. Strong verbs 5.2.100 N¨asta dag s˚a var en ryggs¨ack borta och mera grejer f¨orsvinna 6 6.1 6.1.1 6.1.2

6.1.3

6.1.4 6.1.5 6.1.6 6.1.7

6.2 6.2.1

6.2.2

7 7.1

7.2 7.2.1

7.2.2

Imperative Men de var fult med buskar utan f¨or som vi fick rid igenom. han r˚akade bara kom i mot getingboet. Det a¨ r n˚agot som vi alla nog skulle g¨or om vi inte hade l¨ast p˚a ett prov. Jag skrattade och undrade hur Tromben skulle ha kom igenom det lilla h˚alet. Missing auxiliary verb Temporal ‘ha’ ni m˚aste hj¨alpa mig om ni ska f˚a henne. och dom — lovat att bygga upp staden och de blev hotell Men pappa — fr˚agat mig om jag ville f¨olja med

Missing infinitive marker Men det v˚agar man kanske inte i f¨orsta taget f¨or d˚a kan man ju bli r¨add att man kommer — f˚a ett k¨annetecken som skolans skvallerbytta eller n˚agot s˚ant! ... t¨ankte jag att om man ska h˚alla p˚a s˚a kommer det — inte g˚a bra i skolan.

C ORRECTION

kommer f˚a

C ORP

Appendix B.

288

E RROR

C ORRECTION

C ORP

S UBJ

AGE

S EX

7.2.3

Nu n¨ar jag kommer att skriva denna uppsatsen s˚a kommer jag — ha en rubrik om n˚agra problem och vad man kan g¨ora f¨or att f¨orb¨attra dom.

kommer jag att ha

E13

wj03

13

f

8 8.1.1

WORD ORDER N¨ar han kom hem s˚a a˚ t han middag gick och borstade t¨anderna och gick och sedan lade sig. f¨or d˚a kan man inte n˚agot ting bara kan g˚a p˚a stan det d˚a fattar hj¨arnan ingenting Jag den dan gjorde inget b¨attre.

sedan och

FS

jowe

9

f

kan bara

SE

wg03

10

f

Jag gjorde inget b¨attre den dan. p˚a matten med

SN

wg07

10

f

SE

wg10

10

m

dom tvingar

SE

wj12

13

f

hund , men

FS SE

alhe wj17

9 13

f f

har

SE

wj19

13

m

ocks˚a

SN

wg04

10

m

klarade

SN

wg10

10

m

jag tycker att alla m˚aste f˚a vara med jag fick hj¨alp med det ˚ det g˚ar det At nog inte att g¨or s˚a mycket. Nasse sprang som en liten fnutknapp efter bovarna.

SE

wg18

10

m

SN

wj11

13

f

SE

wj20

13

m

CF

haic

11

f

CF

anhe

11

m

SE

wg03

10

f

SE

wg12

10

f

8.1.2 8.1.3

8.1.4

8.1.5

9 9.1

att jag har ett problem att jag m˚aste hela tiden fuska p˚a proven annars med p˚a matten nog alla lektioner m˚aste jag fuska och alltid br˚aka f¨or att f˚a uppm¨arksamhet. kompisarna g¨or det inte men om tvingar dom inte dig till att g¨ora det

9.1.5

REDUNDANCY Doubled word Following directly Han tittade p˚a sin hund hund oliver Kompisen ska f˚a titta p˚a en ibland ocks˚a men, men det f˚ar inte bli regelbundet f¨or d˚a... m˚anga som mobbar har har det oftast d˚aligt hemma vi skall i alla fall tr¨affas idag 20 mars 1999 m˚andagen kanske imorgon ocks˚a ocks˚a Jag hade tur jag klarade klarade mig

9.1.6

Word between jag tycker jag att alla m˚aste f˚a vara med

9.1.7

jag fick jag hj¨alp med det.

9.1.8

˚ det g˚ar det nog inte att g¨or s˚a mycket a˚ t. At

9.1.9

Nasse sprang efter som en liten fnutknapp efter Bovarna.

9.2 9.2.1

Redundant word Kalle som blev j¨atte r¨add och sprang till n¨armaste hus som l˚ag 9, kilometer bort f¨or d˚a kan man inte n˚agot ting bara kan g˚a p˚a stan det d˚a fattar hj¨arnan ingenting Hon och han borde pratat med en vuxen person (l¨araren). Eller pratat med f¨or¨aldrarna.

9.1.1 9.1.2 9.1.3 9.1.4

9.2.2 9.2.3

inte

Error Corpora

E RROR 9.2.4

289

C ORRECTION

C ORP

S UBJ

AGE

S EX

DV

haic

11

f

jag

CF

erja

9

m

jag

SN

wg04

10

m

n˚agot/det folk som

SE SE

wg08 wg14

10 10

f m

de

SE

wg19

10

m

jag

SE

wj03

13

f

jag/vi

SN

wj09

13

m

man

SE

wj19

13

m

man

SE

wj19

13

m

han

FS

mawe

11

f

de?

SE

wg03

10

f

det

SN

wg06

10

f

varandra

SE

wg18

10

m

att

SN

wj03

13

f

hade

DV

alhe

9

f

var

FS

hais

11

f

att g¨ora

SE

wj07

13

f

fick

SN

wj13

13

m

, blev (?)

DV

hais

11

f

n¨ar De kom till en o¨ vergiven by va Tor och jag var r¨adda

10 Missing Constituents 10.1 Subject 10.1.1 — undra vad det brann n˚anstans jag m˚aste i alla fall larma 10.1.2 vidare hoppas — att vi kommer att vara kompisar r¨att l¨ange 10.1.3 Jag tror — skulle hj¨alpa dem a¨ r att ... 10.1.4 I b¨orjan p˚a filmen var det massa — kollade p˚a den andras papper p˚a uppgiften 10.1.5 man f¨ors¨oker att l¨ara barnen att om — fuskar med t ex ett prov d˚a... 10.1.6 han kommer och klappar alla p˚a handen utan en kille — undra hur han k¨anner sig d˚a? 10.1.7 N¨ar jag var ungef¨ar 5 a˚ r och gick p˚a dagis s˚a skulle — a˚ ka p˚a ett barnkalas hos en tjej med dagiset. 10.1.8 N¨ar man tror att man har kompisar blir — ledsen n¨ar man bara g˚ar d¨ar ifr˚an om just kom dit 10.1.9 N¨ar man tror att man har kompisar blir ledsen n¨ar man bara g˚ar d¨ar ifr˚an om — just kom dit ˚ som suttit i ett 10.1.10 Dom satte av efter Billy och Ake tr¨ad men blivit nerputtad av en uggla — blev n¨astan nertrampad. 10.2 Object or other NPs 10.2.1 Om dom br˚akar som — a¨ r det inte s˚a mycket man kan g¨ora a˚ t saken 10.2.2 jag viste att han skulle bli lite ledsen d˚a efter som vi hade best¨amt —. 10.2.3 Om man s¨atter barn som a¨ r lika bra som — p˚a samma st¨alle blir det b¨attre f¨or... 10.3 Infinitive marker 10.3.1 Efter — ha sprungit igenom h¨ackarna tv˚a g˚anger s˚a vilade vi lite... 10.4 (att) Verb 10.4.1 en port som va helt glittrig och — 2 guldo¨ gon och silver mun. 10.4.2 sedan skuttade han fram vidare till den o¨ ppna burken d¨ar grodan — han. Nosade f¨orundrat p˚a grodan 10.4.3 Jag tycker att det har med ens uppfostran — om man nu ger eller inte ger hon/han den saken som man tappade. 10.4.4 ... s˚a kom det n˚agra utl¨anningar och tog bollen och vi — inte tillbaka den. 10.4.5 d˚a bar det av i 14 dagar och 14 a¨ ventyrsfyllda n¨atter jagade av a¨ lg — kompis med huggorm trampat p˚a igelkott mycket h¨ande verkligen.

Appendix B.

290

E RROR 10.5 Adverb 10.5.1 tuni hade j¨atte ont i kn¨at men hon ville — sluta f¨or det. 10.6 Preposition 10.6.1 Gunnar var p˚a semester — Norge och a˚ kte skidor. 10.6.2 dom b¨ar massor av sken smycken massor — saker 10.6.3 det ena huset efter det andra gjordes — ordning 10.6.4 Hunden hoppade ner — ett getingbo. 10.6.5 Nej det var inte grodan som bodde — h˚alet. 10.6.6 Pojken som var p˚a v¨ag upp — ett tr¨ad fick sl¨anga sig p˚a marken... 10.6.7 att de som kollade p˚a den andras papper skall tr¨ana mer — sin uppgift 10.6.8 ... s˚a t¨ankte jag att det a¨ r — verklighet s˚ant h¨ander 10.6.9 Mobbning handlar nog mycket — att man inte f¨orst˚ar olika m¨anniskor. 10.6.10 men jag blev — alla fall j¨atte r¨add f¨or... 10.6.11 mobbing a¨ r det v¨arsta som finns och — dom som g¨or det saknas det s¨akert n˚agonting i huvudet. 10.7 Conjunction and subjunction 10.7.1 han gick upp p˚a stora stenen — ropa hall˚a! hall˚a! 10.7.2 Simon kl¨adde p˚a sig — a˚ t frukost. 10.7.3 Det som flickan gjorde n¨ar det var en vuxen — svarade i sin mobiltelefon som tappade en 100 lapp. 10.7.4 ...till exempel — den h¨ar killen g¨or s˚a igen s˚a... 10.7.5 om det a¨ r en tjej man inte alls a¨ r bra kompis med — kommer och s¨atter sig p˚a b¨anken 10.8 Other 10.8.1 Alla blev r¨adda f¨or hans skrik hans h¨amnd kunde vara — som helst ... 10.8.2 dom gick ut p˚a kullek och letade. — och p˚a marken och i luften. 10.8.3 De k¨orde l˚angt bort och till slut kom de fram till en g¨ardsg˚ard och det var massor av hus —. 10.8.4 sen levde vi lyckliga — v˚ara dagar 10.8.5 att jag har ett problem att jag m˚aste hela tiden fuska p˚a proven annars — med p˚a matten nog alla lektioner m˚aste jag fuska och alltid br˚aka f¨or att f˚a uppm¨arksamhet. 10.8.6 den som h¨orde de d¨ar stygga orden v˚agade kanske inte spela p˚a en konsert f¨or att — vara r¨add att bli utskrattat av avundsjuka personer.

C ORRECTION

C ORP

S UBJ

AGE

S EX

inte

SN

wj03

13

f

i

DV

erha

10

m

av

DV

haic

11

f

i

DV

hais

11

f

i i i

FS FS FS

anhe haic idja

11 11 11

m f f

p˚a

SE

wg14

10

m

i verkligheten om

SE

wj06

13

f

SE

wj20

13

m

i hos

SN SE

wg18 wj05

10 13

m m

och

FS

haic

11

f

och som

FS SE

hais wg14

11 10

f m

om som

SE SE

wj03 wj17

13 13

f f

hur hemsk/vad de letade?

CF

frma

9

m

FS

hais

11

f

d¨ar

DV

alca

11

f

i alla (?)

DV SE

hais wg10

11 10

f m

han/hon var

SE

wg11

10

f

Error Corpora

E RROR 10.8.7 Om man inte kan det man ska g¨ora och tittar p˚a n˚agon annan visar — n˚agon annans resultat sen. 10.8.8 F¨or att f¨orb¨attra det a¨ r nog — att man ska prata med en l¨arare eller f¨or¨alder s˚a... 11 11.1 11.1.1 11.1.2

WORD CHOICE Prepositions and particles dom peka p˚a v¨aggen av tunneln Vi sprang allt vad vi orkade ner till sjo¨ n och sl¨angde ur oss kl¨aderna. 11.1.3 Jag kom ih˚ag allt som h¨ant innan jag trillat ifr˚an grenen. 11.1.4 Han ropade ut igenom f¨onstret men inget kvack kom tillbaka. 11.1.5 sen var det problem p˚a klass fotot 11.1.6 Jag tycker att om man har sv˚arigheter f¨or att skriva eller n˚at annat skall man visa det... 11.1.7 vi var v¨aldigt lika p˚a s¨attet allts˚a vi tyckte om samma saker 11.1.8 Jag blev glad p˚a Malin att hon hj¨alpte mig att s¨aga det till honom f¨or... 11.1.9 han kommer och klappar alla p˚a handen utan en kille 11.1.10 N¨ar vi skulle g˚a av satt jag och dagdr¨omde och s˚a gick alla av utan jag. 11.2 Adverb 11.2.1 Jag undrar ibland vart mamma a¨ r men det a¨ r ingen som vet. 11.2.2 Men vart ska jag bo? 11.2.3 Men vart dom en letade hittade dom ingen groda. 11.3 11.3.1 11.3.2 11.3.3

Infinitive marker det var on¨odigt och skrika pappa sen gick jag in och la mig och sova men jag vet inte hur man ska f˚a dom och g¨ora det. 11.3.4 ... men om man vill f¨ors¨oka bli kompis med n˚agra tjejer/killar och kanske f¨ors¨oker och g˚a fram ... 11.3.5 ... det fick en och t¨anka till hur man kan hj¨alpa s˚ana som a¨ r utsatta. 11.4 Pronoun 11.4.1 vad skulle dom g¨ora dess pengar tog n¨astan slut 11.4.2 Det a¨ r vanligt att om man har problem hemma att man l¨att blir arg och det g˚ar d˚a ut o¨ ver sina kompisar.

291

C ORP

S UBJ

AGE

S EX

(?)

C ORRECTION

SE

wj05

13

m

det b¨asta?

SE

wj07

13

f

i av

DV DV

alhe idja

9 11

f f

fr˚an

CF

jowe

9

f

genom

FS

caan

9

m

med med

SE SE

wg18 wj11

10 13

m f

till

SN

wg04

10

m

(?)

SN

wg06

10

f

utom

SE

wj03

13

f

utom

SN

wj09

13

m

var

CF

erge

9

f

var var

CF FS

erge anhe

9 11

f m

att att att

DV DV SE

alhe alhe wg18

9 9 10

f f m

att

SE

wj08

13

f

att

SE

wj16

13

f

deras ens

DV SE

jowe wj12

9 13

f f

Appendix B.

292

E RROR 11.5 Blend 11.5.1 n¨ar dom kommer hem s˚a m¨arker inte f¨or¨aldrarna n˚agot a¨ ven fast att man luktar r¨ok och sprit 11.5.2 Det a¨ r nog inte ett a¨ nda barn som inte har n˚agot problem a¨ ven fast att man inte har s˚a stora 11.5.3 jag sprang s˚a fort s˚a mycket jag var v¨ard 11.6 Other 11.6.1 Hon satte sig p˚a det guldigaste och mjukaste gr¨aset i hela v¨arlden. 11.6.2 men se d¨ar a¨ r ni ju det lilla f¨oljet best˚aende av snutna djur fr˚an djuraff¨aren. 11.6.3 Jag tittade p˚a Virginia som torkad av sin n¨asa som var blodig p˚a tr¨ojarmen. 11.6.4 jag f¨orst˚ar inte vad fr¨oken menar med grammatik n¨aringsv¨av och allt de andra. 11.6.5 Nasse sprang efter som en liten fnutknapp efter Bovarna. 12 12.1 12.1.1

12.1.2 12.1.3 12.1.4

REFERENCE Erroneous referent Number Lena fick en kattunge...Och Alexander fick ett spjut. sen gav den sej iv¨ag n¨ar de g˚att och g˚att s˚a h¨ande n˚agot l˚angt bort skymtade ett gult hus. vi n¨armade oss de sakta Att Urban hade en fru. och en massa ungar hade det. Oliver f¨ors¨okte f˚a av sig burken s˚a aggressivt s˚a han ramlade o¨ ver kanten. Erik tittade efter honom med en fr˚agande min N¨ar Oliver hade dom i baken s˚a hopade Erik ner.

Gender 12.1.5 ...vad heter din mamma? Det stod bara helt still i huvudet vad var det han hette nu igen? 12.1.6 Om nu n˚agon tappar n˚agon som pengar... 12.2 Change of referent 12.2.1 spring ut nu vi har bes¨okare n¨ar ni kom ut ... 12.2.2 Om dom som mobbar n˚agon g˚ang blir mobbad sj¨alv skulle han a¨ ndras helt och h˚allet. 13 OTHER 13.1 Adverb 13.1.1 N¨ar jag var liten mindre ... 13.2 13.2.1 13.2.2 13.2.3

Strange construction s˚a P¨ar var l¨aggdags god natt p˚a er Ses i morgon i g˚ar god natt n¨ar vi rast skulle st¨anga aff¨aren s˚a g¨omde jag mig.

C ORRECTION

C ORP

S UBJ

AGE

S EX

a¨ ven om/fast¨an

SE

wj12

13

f

a¨ ven om/fast¨an allt vad

SE

wj12

13

f

DV

haic

11

f

mest gulda

DV

angu

9

f

stulna

DV

hais

11

f

a¨ rmen

DV

idja

11

f

n¨aringsl¨ara?

CF

angu

9

f

?

CF

haic

11

f

de

DV

angu

9

f

det

DV

hais

11

f

de

FS

alhe

9

f

den

FS

alhe

9

f

hon

CF

hais

11

f

n˚agot

SE

wj07

13

f

vi dom/han (?)

DV SE

hais wj05

11 13

f m

lite

SN

wj11

13

f

DV DV DV

frma hais hais

9 11 11

m f f

Error Corpora

293

B.2 Misspelled Words Errors are categorized by part-of-speech and then by the part-of-speech they are realized in, indicated by an arrow (e.g. ‘Noun → Noun’ a noun becoming another noun). E RROR 1 1.1 1.1.1 1.1.2 1.1.3 1.1.4 1.1.5 1.1.6 1.1.7 1.1.8 1.1.9 1.1.10 1.1.11 1.1.12 1.1.13 1.1.14 1.1.15 1.1.16 1.1.17 1.1.18 1.1.19 1.1.20 1.1.21 1.1.22 1.1.23 1.1.24 1.1.25 1.1.26 1.1.27 1.1.28 1.1.29 1.1.30 1.1.31 1.1.32 1.1.33 1.1.34 1.1.35

NOUN Noun → Noun Medan Oliver hoppade efter bot. Gr¨avde sig Erik l¨angre ner i bot men upp ur bot kom ett djur upp. Erik sprang i v¨ag medan Oliver v¨alte ner det surande bot. Bina som bodde i bot rusade i mot Oliver men hunden hade fastnat i buken att dom bot i en j¨atte fin dy det va deras dy. Det KaM Till EN o¨ vergiven Bi dam bodde i en bi pappa i har hittat a¨ n o¨ vergiven bi de var en by en o¨ de dy. b˚ade pappa och jag kom d˚a att t¨anka p˚a den dyn vi va i p˚a v¨agen hem undrade p¨ar hur dyn hade kommit till. jag sprang till boten sen vaknade vi i botten Den d¨ar scenen med dammen som tappade sedlarna Renen sprang tills dom kom till en dam kastad ner olof och hans hund i en dam En dag n¨ar han var vid damen drog han med h˚aven i vattnet och fick upp en groda. Men damen a¨ r inte s˚a djup. ¨ Vi kom Over Molnen Jag och Per p˚a en flygande fris som hette Urban. pojken och huden kom i vattnet. de l¨at precis som Fjory hennes hast August rosen gren har l¨amnat hjorden... d¨arf¨or skulle dom andra i klasen visa hur duktiga dom var. Den brinnande makan huset brann upp f¨or att makan hade tagit eld. En dag t¨ankte Urban g¨ora varma makor. Manen var tjock och r¨okte cigarr. Ni har en son som ringt efter oss sa manen. Den gamla manen Ber¨attade om en by han Bot i f¨or l¨ange sedan den h¨ar gamla manen har tagit hand om oss. manen kom ut med tre sk˚alar h¨arlig soppa. men s˚a en dag kom en man som hette svarta manen

C ORRECTION

C ORP

S UBJ

AGE

S EX

boet boet boet boet

FS FS FS FS

alhe alhe alhe alhe

9 9 9 9

f f f f

boet burken by by by by by by byn

FS FS DV DV DV DV DV DV DV

alhe frma alhe alhe erja erja erja frma alhe

9 9 9 9 9 9 9 9 9

f m f f m m m m f

byn

DV

frma

9

m

b˚aten b˚aten damen

DV DV SE

haic haic wg09

11 11 10

f f m

damm damm dammen

FS FS FS

alhe frma alhe

9 9 9

f m f

dammen gris???

FS DV

jobe caan

10 9

m m

hunden h¨ast jorden klassen

FS DV DV SE

haic alco hais wg02

11 9 11 10

f f f f

mackan mackan mackor mannen mannen mannen

CF CF CF CF CF DV

caan caan caan alco idja angu

9 9 9 9 11 9

m m m f f f

mannen mannen mannen

DV DV DV

angu angu angu

9 9 9

f f f

Appendix B.

294

E RROR 1.1.36 1.1.37 1.1.38 1.1.39 1.1.40 1.1.41 1.1.42 1.1.43 1.1.44 1.1.45 1.1.46 1.1.47 1.1.48 1.1.49 1.1.50

f¨or manen hade m˚anga djur Det var nog den h¨ar byn manen talade om det var svarta manen. Lena gick fram till svarta manen svarta manen blev r¨add svarta manen sprang sin v¨ag det log maser av saker runtomkring dom b¨ar maser av sken smycken ... men pl¨otsligt tog matten slut. alla menen och Pappa gick in i ett av huset pojken skrek ett tupp! ja tak just d˚a ringde telef˚anen och pappa svarade: Fram ur vasen kom det n˚agot Sen gick jag ut, och fram f¨or mig stod v¨ardens finaste h¨ast. 1.1.51 dom som borde p˚a o¨ rn kanske f¨ors¨okte koma p˚a skepp 1.2 1.2.1 1.2.2 1.2.3

Noun → Adjective man kunde rida fyra i bred kale som blev j¨atte r¨add... ... och d¨ar fans ett tempel fult med matt.

1.3 1.3.1

Noun → Pronoun Men det han h¨oll i var ett par hon som i sin tur satt fast i en hjort.

1.4 1.4.1

Noun→ Numeral olof som kl¨attrade i ett tre

1.5 1.5.1 1.5.2 1.5.3 1.5.4

Noun → Verb pappa gick och knacka p˚a en d¨or till och knacka p˚a en d¨or Lena var en flika som var 8 a˚ r. Han letade i ett h˚al medans hunden sk¨allde p˚a masa bin. N¨ar han landa s˚a svepte masa bin o¨ ver honom. hunden hade hittat masa getingar d¨ar va en masa m¨anniskor Jag tycker att om man inte gillar en viss person ska man inte visa det p˚a ett s˚a taskigt sett.

1.5.5 1.5.6 1.5.7 1.5.8

1.6 1.6.1 1.6.2 1.7 1.7.1 1.7.2

Noun → Preposition D˚a fick muffins syn p˚a en massa in och b¨orjade jaga dom. dam flyttade naturligtvis till den o¨ vergivna b in Noun → More than one category Jag hade en jacka p˚a mig som det var ett litet h˚all i... Hur ska men kunna g¨ora f¨or att f¨orb¨attra dessa problem?

C ORP

S UBJ

AGE

S EX

mannen mannen mannen mannen mannen mannen massor massor maten m¨annen stup tack telefonen vassen v¨arldens

C ORRECTION

DV DV DV DV DV DV DV DV DV DV FS DV CF FS CF

angu angu angu angu angu angu haic haic erge haic haic haic erge idja alhe

9 9 9 9 9 9 11 11 9 11 11 11 9 11 9

f f f f f f f f f f f f f f f

o¨ n

DV

haic

11

f

bredd Kalle mat

DV CF DV

idja anhe erge

11 11 9

f m f

horn

FS

anhe

11

m

tr¨ad

FS

frma

9

m

d¨orr d¨orr flicka massa

DV DV DV FS

alhe alhe angu erge

9 9 9 9

f f f f

massa massa massa s¨att

FS FS DV SE

erge haic alhe wg17

9 11 9 10

f f f f

bin

FS

jowe

9

f

byn

DV

erja

9

m

h˚al

SN

wg16

10

f

man

SE

wj03

13

f

Error Corpora

1.7.3 1.7.4

1.7.5 2 2.1 2.1.1 2.1.2

E RROR

C ORRECTION

C ORP

S UBJ

AGE

S EX

...och vad men kan g¨ora f¨or att f¨orb¨attra dom. Att utfrysa en kompis eller n˚agon annan kan vara det v¨arsta men n˚agonsin kan g¨ora tycker jag. Precis d˚a kom pappa och hans men.

man man

SE SE

wj03 wj03

13 13

f f

m¨an

DV

haic

11

f

kallt

CF

erge

9

f

trygga

CF

frma

9

m

b¨ast enda enda enda

CF CF DV SE

hais idja jowe wj12

11 11 9 13

f f f f

enda

SE

wj13

13

m

enda

SN

wg19

10

m

r¨add r¨add r¨add r¨add r¨adda tyken

CF FS FS SN CF SE

anhe frma frma wg18 frma wj14

11 9 9 10 9 13

m m m m m m

f¨orra k¨anda l¨att r¨add

FS DV SE FS

idja erge wg03 erja

11 9 10 9

f f f m

alla de de

DV FS FS

idja alhe caan

11 9 9

f f m

de de de de de de

DV DV DV DV DV DV

alco erja erja erja erja frma

9 9 9 9 9 9

f m m m m m

de det

DV CF

jobe angu

10 9

m f

det

CF

angu

9

f

ADJECTIVE Adjective → Adjective Pappa du har gl¨omt att t¨anda brasan och det a¨ r kalt. det a¨ r den plikt att f˚a a˚ s att bli dryga

Adjective → Noun n¨ar hon var som best ... men inte en a¨ nda m¨anniska syntes till. det h¨ar brevet a¨ r det a¨ nda jag kan ge dig idag Det a¨ r nog inte ett a¨ nda barn som inte har n˚agot problem a¨ ven fast att man inte har s˚a stora 2.2.5 Det a¨ nda jag vet om grov mobbing a¨ r det jag har sett p˚a tv! 2.2.6 ... f¨or det var det a¨ nda s¨attet att komma upp till en koja 2.2.7 kalle som blev j¨atte r¨ad 2.2.8 han blev s˚a r¨ad 2.2.9 han var lite r¨ad f¨or kr˚akan 2.2.10 jag blev alla fall j¨atte r¨ad 2.2.11 alla var reda 2.2.12 man beh¨over inte vara tycken bara f¨or man inte vill vara med han. 2.2 2.2.1 2.2.2 2.2.3 2.2.4

2.3 2.3.1 2.3.2 2.3.3 2.3.4 3 3.1 3.1.1 3.1.2 3.1.3

295

Adjective → Verb Och kanske var det ett barn till hans fo¨ ra groda. ... och sp¨okena blev sk¨anda... jag tror man ska ta ett lett prov f¨orst men... pojken blev red

PRONOUN Pronoun → Pronoun fortsatte det att ringa i alle fall och en massa ungar hade det. Han sa till hunden att vara tyst fo¨ r att det skull titta efter. 3.1.4 Det kom till en o¨ vergiven by 3.1.5 Det KaM Till EN o¨ vergiven Bi 3.1.6 n¨ar det kam hem sade pappa... 3.1.7 n¨ar det hade kommit en liten bit sa pappa... 3.1.8 d˚a h¨orde det att det bubblade... 3.1.9 Det kom till en, plats som de aldrig hade varit, p˚a. 3.1.10 Det kom till en o¨ vergiven by 3.1.11 jag f¨orst˚ar inte vad fr¨oken menar med grammatik n¨aringsv¨av och allt de andra. 3.1.12 Och sen den dagen de brann i Kamillas l¨agenhet leker vi alltid brandm¨an.

Appendix B.

296

E RROR 3.1.13 de b¨orjar att skymma 3.1.14 De var han och han hade hittat en partner. 3.1.15 ... men de kom ingen groda den h¨ar g˚angen heller 3.1.16 De va en pojke som hette olof 3.1.17 de va en a¨ lg 3.1.18 mormor ber¨attade att de fanns en by bortom solens rike 3.1.19 d¨ar de fanns sm˚a r¨oda hus med vita knutar 3.1.20 ja men nu a¨ r de l¨aggdags sa mormor. 3.1.21 Anna funderade halva natten o¨ ver de d¨ar med morfar 3.1.22 de l¨at precis som Fjory hennes h¨ast 3.1.23 de s˚ag faktiskt ut som en o¨ vergiven by 3.1.24 de var bara ett f¨onster som lyste 3.1.25 De var en kv¨all som Lisa jag allts˚a ville h¨ora en saga... 3.1.26 och dom lovat att bygga upp staden och de blev hotell 3.1.27 de var en by en o¨ de by. 3.1.28 de var tid f¨or familjen att g˚a hem. 3.1.29 Det var d˚aligt v¨ader de bl˚aste och regnade. 3.1.30 de bl˚aste mer och mer 3.1.31 Men de var fullt med buskar utanf¨or 3.1.32 Dom gick in genom d¨orren och blev f¨orv˚anade av de dom s˚ag. 3.1.33 de kunde berott p˚a att dom gillade samma tjej. 3.1.34 N¨ar jag f˚ar se en son h¨ar film t¨anker jag p˚a att de nog a¨ r s˚a i de flesta skolorna 3.1.35 ... f¨or de a¨ r nog n˚agot typiskt med de 3.1.36 ... f¨or de a¨ r nog n˚agot typiskt med de 3.1.37 de f˚ar man nog f¨or man f˚ar s˚a mycket att g¨ora n¨ar man blir st¨orre 3.1.38 Den a¨ r ju inte heller s¨akert att den kompisen man kollar p˚a har r¨att 3.1.39 De var bara ungdomar inga vuxna. 3.1.40 De hela b¨orjade med att jag och min morfar skulle cykla ner till sj¨on f¨or... 3.1.41 de verkade lugnt. 3.1.42 de va en vanlig m˚andag 3.1.43 ... efter som de fanns en hel del sn¨alla kompisar i min klass s˚a hj¨alpte dom mig... 3.1.44 N¨ar jag kom p˚a f¨otter igen s˚a hade de kommit cirka tolv stycken i min klass och hj¨alpte mig 3.1.45 det a¨ r den plikt att f˚a a˚ s att bli dryga 3.1.46 Dem kom med en stegbil och h¨amtade oss. 3.1.47 N¨asta dag gick dem upp till en grotta 3.1.48 d¨ar fick dem var sin korg med saker i 3.1.49 Dem hade ett privatplan 3.1.50 nu sl˚ar dem upp t¨altet f¨or att vila... 3.1.51 n¨asta morgon g˚ar dem l˚angt l˚angt 3.1.52 men till slut kom dem till en o¨ vergiven by. 3.1.53 d¨ar stannade dem och bodde d¨ar resten av livet

C ORP

S UBJ

AGE

S EX

det det det

C ORRECTION

CF FS FS

frma caan frma

9 9 9

m m m

det det det

FS FS DV

frma frma alco

9 9 9

m m f

det det det

DV DV DV

alco alco alco

9 9 9

f f f

det det det det

DV DV DV DV

alco alco alco erge

9 9 9 9

f f f f

det

DV

erge

9

f

det det det det det det

DV DV DV DV DV DV

frma frma hais idja idja mawe

9 9 11 11 11 11

m m f f f f

det det

SE SE

wg07 wg20

10 10

f m

det det det

SE SE SE

wg20 wg20 wg20

10 10 10

m m m

det

SE

wj17

13

f

det det

SE SN

wj18 wg10

13 10

m m

det det det

SN SN SN

wg11 wg20 wg20

10 10 10

f m m

det

SN

wj10

13

m

din dom dom dom dom dom dom dom dom

CF CF DV DV DV DV DV DV DV

frma jobe angu angu jobe jobe jobe jobe jobe

9 10 9 9 10 10 10 10 10

m m f f m m m m m

Error Corpora

E RROR 3.1.54 dem kanske bodde i ett hus som dem fick hyra 3.1.55 dem kanske bodde i ett hus som dem fick hyra 3.1.56 ... dem m˚aste f˚a h¨oga betyg annars f˚ar de sk¨all av sina f¨or¨aldrar. 3.1.57 Dem andra m¨anniskorna som kollade p˚a sina kompisars provpapper, 3.1.58 ... n¨ar dem b¨orjade br˚aka, 3.1.59 dem kunde v¨al hj¨alpa varandra. 3.1.60 Men dem fortsatte. 3.1.61 Men jag fortsatte k¨ampa f¨or dem tv˚a skulle kunna se p˚a varan utan att v¨anda bort huvudet, 3.2 3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 3.2.6 3.2.7 3.2.8 3.2.9 3.2.10 3.2.11 3.2.12 3.2.13 3.2.14 3.2.15 3.2.16 3.2.17 3.2.18 3.2.19 3.2.20 3.2.21 3.2.22 3.2.23 3.2.24 3.2.25 3.2.26 3.2.27 3.2.28 3.2.29 3.2.30 3.2.31 3.2.32 3.2.33 3.2.34 3.2.35

Pronoun → Noun ... f¨or att du a¨ r ju alt jag har. ... och alt var en dr¨om. N˚agon anan la mig p˚a en b˚ar... och gick till en anan tunnel Det finns nog en anan v¨ag... s˚a jag fik a˚ ka med en anan som skulle ocks˚a h¨anga med var a¨ r set... var a¨ r set h¨ar snabbt springer dam ut ur brand bilarna snabbt tar dam fram stegen dam ramlar rakt ner i en damm d˚a a¨ r dam a¨ nnu n¨armare ljudet dam bodde i en by dam t˚ag och s˚a med sig sina tv˚a tigrar n¨ar dam hade kommit a¨ n bit in i skogen a˚ dam tv˚a tigrarna f¨oljde ocks˚a med dam red bod n¨ar dam kam hem dam flyttade naturligtvis till den o¨ vergivna in d¨ar levde dam lyckliga tillslut blev dam tv˚a kamelerna s˚a tr¨otta... n¨ar dam kam hem var kl. 12 hon fr˚aga va det var f¨or not och efter som det inte fans not lock p˚a burken han har fot syn p˚a not om det skulle h¨anda not om man s˚ag en a¨ lg eller r¨av och not anat stort djur en po¨ang alltid not ni f˚ar g¨arna bo hos oss under tid en ni inte har not att bo i. det a¨ r den plikt att f˚a a˚ s att bli dryga och la os p˚a varsin sida av den spikiga toppen och utrusta os sa Desere med en son skarp r¨ost hon alltid anv¨ande. gick vi upp till utg˚angen av t¨altet men uppt¨ackte varan och vi blev s˚a r¨adda Visa i filmen gillade inte varan

297

C ORP

S UBJ

AGE

S EX

dom dom dom

C ORRECTION

SE SE SE

wg01 wg01 wg01

10 10 10

f f f

dom

SE

wg01

10

f

dom dom dom dom

SN SN SN SN

wg01 wg01 wg01 wg07

10 10 10 10

f f f f

allt allt annan annan annan annan

CF DV CF DV DV SN

erge caan erge alhe idja wg20

9 9 9 9 11 10

f m f f f m

det det dom dom dom dom dom dom dom dom dom dom dom dom dom dom n˚at n˚at n˚at n˚at n˚at

DV DV CF CF FS FS DV DV DV DV DV DV DV DV DV DV CF FS FS DV DV

hais hais erja erja erja erja erja erja erja erja erja erja erja erja erja frma alhe alhe frma alhe alhe

11 11 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

f f m m m m m m m m m m m m m m f f m f f

n˚at n˚at

DV DV

alhe idja

9 11

f f

oss oss oss s˚an

CF DV DV DV

frma alhe alhe hais

9 9 9 11

m f f f

varann

DV

alhe

9

f

varann

SE

wg06

10

f

Appendix B.

298

E RROR 3.2.36 det f¨orsta problemet a¨ r att dom kollar p˚a varan 3.2.37 f¨or d˚a tittar man inte p˚a varan. 3.2.38 Men jag fortsatte k¨ampa f¨or dem tv˚a skulle kunna se p˚a varan utan att v¨anda bort huvudet, 3.3 3.3.1 3.3.2 3.3.3 3.3.4 3.3.5 3.3.6 3.3.7 3.3.8 3.3.9 3.3.10 3.3.11 3.3.12 3.3.13 3.3.14 3.3.15

Pronoun → Verb om man s˚ag en a¨ lg eller r¨av och not anat stort djur Vi s˚ag ormar spindlar krokodiler o¨ dlor och anat. hanns groda var f¨orsvunnen. hanns mamma hade sl¨angt ut den. som nu satt p˚a hanns huvud. f¨or att hanns kruka hade g˚att s¨onder kastad ner olof och hanns hund i en dam jag fick l˚ana hanns mobiltelefon. han fr˚agade honom n˚att ... den killen eller tjejen m˚aste ha n˚att problem eller... om det kommer n˚an ny till klassen eller n˚att ...s˚a hon hamnade inne i skogen p˚a n˚att konstigt s¨att... N¨ar det var tv˚a flickor som satt p˚a en b¨ank s˚a kom det en annan flicka som satte s¨ag bredvid Det var ocks˚a v¨aldigt roligt f¨or att man k¨ande s¨ag inte ensam om det. man f˚ar nog mer sona problem n¨ar man kommer h¨ogre upp i skolan

3.4 3.4.1 3.4.2

Pronoun → Preposition vi bar allt till mamma hos sa... sen n¨ar in kompis skulle hoppa s˚a...

3.5 3.5.1 3.5.2

Pronoun → Interjection va fiffigt t¨ankte ja d˚a b¨orja alla i hela tunneln f¨orutom pappa och ja gr˚ata vilken fin kl¨anning ja har Madde vaknade av mitt skrik, hon fr˚aga va det var f¨or n˚at.

3.5.3 3.5.4

3.6 3.6.1 3.6.2 3.6.3 3.6.4 3.6.5 3.6.6 3.6.7 3.6.8 3.6.9 3.6.10 3.6.11

Pronoun → More than one category Det var a¨ n g˚ang a¨ n man som hette Gustav Det var a¨ n g˚ang a¨ n man som hette Gustav a¨ n dag n¨ar Gustav var p˚a jobbet ringde det han trycker p˚a a¨ n knapp Gustav sitter i a¨ n av brand bilarna d¨ar e a¨ n d¨ar uppe p˚a a¨ n balkong st˚ar det ett barn han hade a¨ n groda m¨an a¨ n natt klev grodan upp ur glas burken det var a¨ n g˚ang tv˚a pojkar dam bodde i a¨ n bi.

C ORP

S UBJ

AGE

S EX

varann varann varann

C ORRECTION

SE SE SN

wg18 wg18 wg07

10 10 10

m m f

annat

DV

alhe

9

f

annat

DV

caan

9

m

hans hans hans hans hans hans n˚at n˚at

FS FS FS FS FS SN DV SE

alhe alhe alhe alhe frma wg14 haic wj08

9 9 9 9 9 10 11 13

f f f f m m f f

n˚at n˚at

SE SN

wj08 wj08

13 13

f f

sig

SE

wg14

10

m

sig

SN

wj11

13

f

s˚ana

SE

wg20

10

m

hon min

DV SN

haic wj08

11 13

f f

jag jag

DV DV

alhe alhe

9 9

f f

jag vad

DV CF

angu alhe

9 9

f f

en en en en en en en en en en en

CF CF CF CF CF CF CF FS FS DV DV

erja erja erja erja erja erja erja erja erja erja erja

9 9 9 9 9 9 9 9 9 9 9

m m m m m m m m m m m

Error Corpora

3.6.12 3.6.13 3.6.14 3.6.15 3.6.16 3.6.17 3.6.18 3.6.19 3.6.20 3.6.21 3.6.22 3.6.23 3.6.24

3.6.25 3.6.26 3.6.27 3.6.28 3.6.29 3.6.30 4 4.1 4.1.1 4.1.2 4.1.3 4.1.4 4.1.5 4.1.6 4.1.7 4.1.8 4.1.9 4.1.10 4.1.11 4.1.12 4.1.13 4.2 4.2.1 4.2.2

299

E RROR

C ORRECTION

S UBJ

AGE

S EX

pappa vi har hittat a¨ n o¨ vergiven bi. a¨ n dag sa Niklas ska vi rida ut n¨ar dam hade kommit a¨ n bit in i skogen a¨ n liten bit in i skogen s˚ag dom a¨ n o¨ vergiven by a¨ n liten bit in i skogen s˚ag dom a¨ n o¨ vergiven by Man ska vara en bra kompis, n¨ar n˚agon vill vara a¨ n sj¨alv. jag satt ner men packning Men var nu d˚a? d¨orren g˚ar inte upp. N¨ar simon kom ut och s˚ag var som hade h¨ant... Hans hund Taxi var nyfiken p˚a var det var f¨or n˚agot i burken. Men var a¨ r det f¨or ljud? var fan g¨or du Sj¨avl tycker jag att killarnas metoder a¨ r mer o¨ ppen och a¨ rlig men ocks˚a mer elak a¨ n var tjejernas metoder a¨ r. Hj¨alp det brinner vad n˚anstans undra vad det brann n˚anstans jag m˚aste i alla fall larma Jag visste inte att brandbilen vad p˚a v¨ag f¨orbi min egen by. Lena sa vad a¨ r vi hon s˚ag sig omkring Visa i filmen gillade inte varan dom br˚akade och l¨amnade visa utanf¨or.

en en en en

DV DV DV DV

erja erja erja erja

9 9 9 9

m m m m

en

DV

erja

9

m

en

SE

wg05

10

m

min vad vad vad

DV CF FS FS

haic idja hais idja

11 11 11 11

f f f f

vad vad vad

FS SE SE

idja wg07 wj13

11 10 13

f f m

var var

CF CF

erja erja

9 9

m m

var

CF

jowe

9

f

var Vissa vissa

DV SE SE

angu wg06 wg06

9 10 10

f f f

bet

FS

angu

9

f

bodde

DV

haic

11

f

hoppade

FS

alhe

9

f

hoppade hoppade h˚alla lyfta l˚ag l˚atsas

FS DV SN SN DV SE

anhe idja wj12 wg16 haic wj14

11 11 13 10 11 13

m f f f f m

ryckte satt surrade s¨att

CF FS FS DV

anhe haic erja alco

11 11 9 9

m f m f

beror

SE

wg12

10

f

bott

DV

angu

9

f

VERB Verb → Verb Upp ur h˚alet kom en gr¨avling och bett pojken i n¨asan dom som borde p˚a o¨ rn kanske f¨ors¨okte koma p˚a skepp N¨ar Oliver hade dom i baken s˚a hopade Erik ner. Och pojken hopade efter hunden. Vi hopade upp p˚a h¨astarna... ...f¨or att h¨alla henne s¨allskap. f¨orst f¨ors¨okte hon att lufta mig... det log maser av saker runtomkring han beh¨over inte lossas om som ingenting har h¨ant, brand m¨annen rykte ut och sl¨akte elden hunden sa p˚a pojkens huvet. d˚a surade bina rakt o¨ ver pojken sett dig hon gjorde som mannen sa Verb → Noun Och problemet kanske bror p˚a att kompisarna inte tyckte om den personen Den gamla manen Ber¨attade om en by han Bot i f¨or l¨ange sedan

C ORP

Appendix B.

300

4.2.3 4.2.4 4.2.5 4.2.6 4.2.7 4.2.8 4.2.9 4.2.10 4.2.11 4.2.12 4.2.13 4.2.14 4.2.15 4.2.16 4.2.17 4.2.18 4.2.19 4.2.20 4.2.21 4.2.22 4.2.23 4.2.24 4.2.25 4.2.26 4.2.27 4.2.28 4.2.29 4.2.30 4.2.31 4.2.32 4.2.33 4.2.34 4.2.35 4.2.36 4.2.37 4.2.38 4.2.39 4.2.40 4.2.41

E RROR

C ORRECTION

C ORP

S UBJ

AGE

S EX

Men konstigt nog ville jag se den h¨asten fast¨an den inte fans. Det fans en doktor som pratade v¨anligt med mig, och efter som det inte fans not lock p˚a burken Men i h˚alet fans bara... mormor ber¨attade att de fans en by bortom solens rike d¨arde fans sm˚a r¨oda hus med vita knutar d¨ar Annas morfar hade bott ... och d¨ar fans ett tempel fult med matt. men efter som de fans en hel del sn¨alla kompisar i min klass n¨ar jag kom ut ur huset sa Kamilla att jag fik hunden... S˚a fik pojken ett grodbarn Och vad fik dom se? men med lite tjat fik jag och f¨or varje djur fik man 1 eller 3 po¨ang fik man tio po¨ang f¨orst fik jag panik hon hoppade till n¨ar hon fik syn p˚a oss Men de var fult med buskar utan f¨or som vi fik rid igenom. s˚a jag fik a˚ ka med en anan som skulle ocks˚a h¨anga med han har fot syn p˚a not ... som dom hade fot tillsammans. P˚a morgonen vaknade vi och kl¨ade p˚a oss Madde sprang upp till sitt rum och kla¨ de p˚a sig Han kl¨ade p˚a sig Det Kam Till EN o¨ vergiven Bi n¨ar det kam hem sade pappa... n¨ar Niklas och Bennys halva kam fram till en damm upp ur dammen kam tv˚a krokodiler n¨ar dam kam hem n¨ar dam kam hem var kl. 12 d˚a ko min bror N¨ar jag kom ut s˚ag jag en liten eld l˚aga koma ut genom f¨onstret, det tog en timme att koma ditt Pojken som var p˚a v¨ag upp ett tr¨ad fick sl¨anga sig p˚a marken f¨or att inte koma i v¨agen f¨or bin. dom som borde p˚a o¨ rn kanske f¨ors¨okte koma p˚a skepp hans h¨amnd kund vara som helst p˚a v¨agen till pappa m¨ote jag en katt Jag gick in och sate mig vid bordet och a˚ t. Han sate sig upp och lyssnade Hon sate sej p˚a det guldigaste och mjukaste gr¨aset i hela v¨arlden.

fanns

CF

alhe

9

f

fanns

CF

erge

9

f

fanns fanns fanns

FS FS DV

alhe erge alco

9 9 9

f f f

fanns

DV

alco

9

f

fanns fanns

DV SN

erge wg20

9 10

f m

fick

CF

angu

9

f

fick fick fick fick fick fick fick fick

FS FS DV DV DV DV DV DV

caan erge alhe alhe alhe alhe hais idja

9 9 9 9 9 9 11 11

m f f f f f f f

fick

SN

wg20

10

m

f˚att f˚att kl¨adde kl¨adde kl¨adde kom kom kom

FS FS CF CF FS DV DV DV

frma haic alhe alhe haic erja erja erja

9 11 9 9 11 9 9 9

m f f f f m m m

kom kom kom kom komma

DV DV DV SN CF

erja erja frma wg18 alhe

9 9 9 10 9

m m m m f

komma komma

CF FS

anhe idja

11 11

m f

komma

DV

haic

11

f

kunde m¨otte satte satte satte

CF DV CF FS DV

frma alhe alhe alhe angu

9 9 9 9 9

m f f f f

Error Corpora

E RROR 4.2.42 Redan n¨asta dag sate vi ig˚ang med reparationen av byn. 4.2.43 D˚a s˚ag jag n˚at som jag aldrig har set 4.2.44 Jag tycker att hon skal prata med dom. 4.2.45 brandm¨annen sl¨akte elden 4.2.46 d¨ar nere i det h¨oga gr¨aset l˚ag dalmatinen tess, grisen kalle-knorr... och sav 4.2.47 Ring till B¨orje sej att vi l˚ast oss ute. 4.2.48 dam t˚ag och s˚a med sig sina tv˚a tigrar 4.2.49 ... att vi a˚ kt ner fr˚an berget och a˚ kt s˚a l˚angt att vi inte viste va vi va. 4.2.50 typ n¨ar man pratar om grejer som inte man villa att alla ska h¨ora! 4.2.51 ... att Mia inte viste om att mamma var en strandskata. 4.2.52 Och utan att pojken viste om det hoppa grodan ur burken n¨ar han l˚ag. 4.2.53 jag viste att han skulle bli lite ledsen d˚a efter som vi hade best¨amt. 4.2.54 d˚a viste jag inte vad jag skulle g¨ora 4.2.55 hon kan ju inte skylla p˚a att hon inte m¨arker n˚at f¨or det a¨ rr alltid tydligt. 4.3 4.3.1

Verb → Pronoun mer han jag inte t¨anka...

4.4 4.4.1 4.4.2 4.4.3 4.4.4

Verb → Adjective a˚ a¨ lgen bara gode Niklas och Benny kunde inte hala emot han h¨oll sig i och road Jag s˚ag p˚a ett TV program d¨ar en metod mot mobbing var att satta mobbarn p˚a den stol och andra m¨anniskor runt den personen och d˚a fr˚aga varf¨or. Hade Erik vekt en uggla

4.4.5

Verb → Interjection jag blev j¨atte besviken f¨or jag trodde att klockan va s˚ad¨ar 7. 4.5.2 men jag va visst j¨atte ledsen s˚a jag gick ut. 4.5.3 Vi kom tillbaks vid 6 tiden, och d˚a va vi j¨atte tr¨otta och hungriga. 4.5.4 Klockan va ungef¨ar 12 n¨ar jag vaknade, och va f˚ar jag se om inte h¨asten. 4.5.5 Klockan va ungef¨ar 12 n¨ar jag vaknade, och va f˚ar jag se om inte h¨asten. 4.5.6 jag sa att det inte va n˚at s˚a somna vi om. 4.5.7 alla va o¨ verens 4.5.8 De va en pojke som hette olof 4.5.9 de va en a¨ lg 4.5.10 Nu va det bara att hoppa ut fr˚an f¨onstret. 4.5.11 ... att vi a˚ kt ner fr˚an berget och a˚ kt s˚a l˚angt att vi inte viste va vi va. 4.5.12 pappa och jag undra va nycklarna va 4.5 4.5.1

301

S UBJ

AGE

S EX

satte

C ORRECTION

C ORP DV

idja

11

f

sett skall sl¨ackte sov

DV SE CF DV

caan wg02 frma hais

9 10 9 11

m f m f

s¨ag tog var

CF DV DV

idja erja alhe

11 9 9

f m f

vill

SE

wj17

13

f

visste

CF

hais

11

f

visste

FS

caan

9

m

visste

SN

wg06

10

f

visste a¨ r

SN SE

wg20 wj13

10 13

m m

hann

DV

idja

11

f

glodde? h˚alla ropade? s¨atta

FS DV FS SE

frma erja frma wj16

9 9 9 13

m m m f

v¨ackt

FS

alhe

9

f

var

CF

alhe

9

f

var var

CF CF

alhe alhe

9 9

f f

var

CF

alhe

9

f

var

CF

alhe

9

f

var var var var var var

CF CF FS FS FS DV

alhe frma frma frma haic alhe

9 9 9 9 11 9

f m m m f f

var

DV

alhe

9

f

Appendix B.

302

E RROR 4.5.13 Det b¨orjar med att pappa och jag va ute och cyklade p˚a landet... 4.5.14 ... att vi inte va p˚a toppen av berget utan i en by 4.5.15 han va f¨or tung 4.5.16 vi va i en j¨atte liten och fin by 4.5.17 nej det va en bl˚ames 4.5.18 Sen sa pappa att vi va tvungna att leta. 4.5.19 om dom va o¨ ppna 4.5.20 N¨ar jag kom dit va redan pappa d¨ar 4.5.21 en port som va helt glittrig 4.5.22 en katt som va svart och len 4.5.23 en platta som n¨astan va omringad av lava 4.5.24 d¨ar va en massa m¨anniskor som va fastkedjade med tjocka kedjor 4.5.25 d¨ar va en massa m¨anniskor som va fastkedjade med tjocka kedjor 4.5.26 den a¨ ldsta som va 80 a˚ r ber¨atta att... 4.5.27 den byn vi va i 4.5.28 det va deras by 4.5.29 det va den hemske fula trollkarlen tokig 4.5.30 som tur va gick h¨astarna i hagen. 4.5.31 ... d˚a vill ju han vara med den kompisen som han va med innan. 4.5.32 ... men eftersom det inte va s˚a mycket mobbing s˚a... 4.5.33 Det var i somras n¨ar jag, min syster och tv˚a andra kompisar va p˚a v˚arat vanliga st¨alle... 4.5.34 Vi va kanske inte s˚a bra p˚a det utan vi ramlade ganska ofta. 4.5.35 det kunde ju va att en sjusovare bor d¨ar inne 4.5.36 ... utan det kan a¨ ven vara att n˚an kan sparka eller att man f˚a vara enst¨oring och sitta sj¨alv hela tiden eller kanske spotta eller bara kanske va taskiga mot den personen 4.5.37 ... att f¨ors¨oka va tuff hela tiden (eller?) 4.5.38 det kan ju va att den som blir mobbad inte uppf¨or sig p˚a r¨att s¨att, 4.5.39 dom vill inte va kompis med hon/han. 4.5.40 D˚a m˚aste man fr˚aga dom som inte vill va kompis med en vad man g¨or f˚ar fel... 4.5.41 Och om kompisarna tycker att man a¨ r ful och inte vill va med en som a¨ r ful s˚a... 4.5.42 Marianne sa fort farande hur jag kunde va med henne 4.6 4.6.1 4.6.2 4.6.3 4.6.4 4.6.5

Verb → More than one category s˚a kommer det att vara sv˚arare att skaffa jobb om dom inte har gott i skolan han fick hetta Hubert. Men pojken a¨ r inte s˚a glad f¨or nu m˚aste han hetta en ny glasburk. Men sen s˚a dom att det var sm˚a grodor. ...vi hade precis g˚att f¨orbi skolan n¨ar vi s˚a ett g¨ang p˚a ca tio personer komma emot oss.

C ORP

S UBJ

AGE

S EX

var

C ORRECTION

DV

alhe

9

f

var var var var var var var var var var var

DV DV DV DV DV DV DV DV DV DV DV

alhe alhe alhe alhe alhe alhe alhe alhe alhe alhe alhe

9 9 9 9 9 9 9 9 9 9 9

f f f f f f f f f f f

var

DV

alhe

9

f

var var var var var var

DV DV DV DV DV SE

alhe alhe alhe alhe idja wg12

9 9 9 9 11 10

f f f f f f

var

SE

wj13

13

m

var

SN

wj06

13

f

var

SN

wj07

13

f

vara vara

DV SE

alhe wj08

9 13

f f

vara vara

SE SE

wj08 wj13

13 13

f m

vara vara

SE SE

wj19 wj19

13 13

m m

vara

SE

wj19

13

m

vara

SN

wg07

10

f

g˚att

SE

wg03

10

f

heta hitta

FS FS

haic haic

11 11

f f

s˚ag s˚ag

FS SN

idja wj15

11 13

f m

Error Corpora

4.6.6 4.6.7 4.6.8 4.6.9 4.6.10 4.6.11 4.6.12 4.6.13

5 5.1 5.1.1

E RROR

C ORRECTION

C ORP

S UBJ

AGE

S EX

Hela majs f¨altet vad svart Oliver bodde i en liten stuga en liten bit i fr˚an skogen och vad v¨aldigt intresserad av djur. Hans a¨ lsklings f¨arg vad gr¨on F¨or han vad mycket tr¨ott. till slut vad han uppe p˚a stocken med stort besv¨ar. n¨ar jag senare vad klar kom grannen och skrek... F¨or att komma till Str¨omstad vad de tvungna att a˚ ka fr˚an G¨oteborg... och sedan Str¨omstad. Det var en ganska d˚alig l¨arare som inte m¨arkte hans fusklapp han hade i pennfacket eller vad det vad.

var var

CF FS

jowe jowe

9 9

f f

var var var

FS FS FS

jowe jowe jowe

9 9 9

f f f

var

DV

jowe

9

f

var

DV

klma

10

f

var

SE

wj07

13

f

surrande

FS

alhe

9

f

bort

DV

erja

9

m

bort gott gott hur

DV CF FS SE

erja frma alhe wj13

9 9 9 13

m m f m

v¨al

SE

wg08

10

f

v¨al v¨al

SE SE

wj04 wj07

13 13

m f

v¨al v¨al

SN SN

wj08 wj08

13 13

f f

fullt fullt

DV DV

erge idja

9 11

f f

inte

SE

wj12

13

f

nu nu r¨att visst visst visst visst visst

DV DV CF CF CF FS DV DV

hais hais idja alhe jobe erge haic idja

11 11 11 9 10 9 11 11

f f f f m f f f

PARTICIPLE Participle → Participle Erik sprang i v¨ag medan Oliver v¨alte ner det surande bot.

6 6.1 6.1.1

ADVERB Adverb → Noun snabbt hoppa dom p˚a kamelerna och rusa iv¨ag och red bod till pappa 6.1.2 dam red bod 6.1.3 ingen sov got den natten 6.1.4 Oliver hj¨alpte till s˚a got han kunde. 6.1.5 att s¨aga ifr˚an och f¨orklara ur den utsatta skall uppf¨ora sig. 6.1.6 N¨ar de gick ifr˚an tjejen som kom s˚a var det v¨all f¨or att hon inte hj¨alpte dem med provet 6.1.7 ...men sen m˚aste dom v¨all f˚a skuld k¨anslor. 6.1.8 s˚a kan man v¨all fortfarande vara kompis med han hon. 6.1.9 det gick v¨all ganska bra. 6.1.10 jag f˚ar v¨all ta av min snowboard. Adverb → Adjective ... och d¨ar fans ett tempel fult med matt. Men de var fult med buskar utan f¨or som vi fik rid igenom. 6.2.3 men det a¨ r ju mycket coolare att s¨aga nej tack jag r¨oker inte en att s¨aga ja jag a¨ r v¨al inre feg. 6.2.4 ny v¨anta nu kommer hon 6.2.5 ny o¨ ppna inte garderoben 6.2.6 Det var r¨at bl˚asigt. 6.2.7 ... men jag va vist j¨atte ledsen s˚ajag gick ut. 6.2.8 det b¨orjade vist brinna 6.2.9 dom hade vist ungar och d¨ar var hans groda. 6.2.10 d˚a f˚ar vi Natta o¨ ver i byn vist. 6.2.11 Och s˚a landade du vist i en m¨oglig ko skit ocks˚a. 6.2 6.2.1 6.2.2

303

Appendix B.

304

E RROR 6.3 6.3.1 6.3.2 6.3.3 6.3.4 6.3.5 6.3.6

Adverb → Pronoun det tog en timme att koma ditt Men vart dom en letade hittade dom ingen groda. men hur han en lockade s˚a kom den inte. Det beror p˚a att den andra har jobbat b¨attre en den andra den som kollade p˚a honom. men det kan ju vara andra saker en bara skolan? men det a¨ r ju mycket coolare att s¨aga nej tack jag r¨oker inte en att s¨aga ja jag a¨ r v¨al inre feg.

6.4 6.4.1 6.4.2 6.4.3 6.4.4

Adverb → Verb f¨orts att vi inte s¨ogs med tromben som jag f¨orts trodde s˚a har gick det till: a¨ r ett s˚ant problem uppst˚ar f¨ors¨oker man klart hj¨alpa till.

6.5 6.5.1

Adverb → Interjection ... att vi a˚ kt ner fr˚an berget och a˚ kt s˚a l˚angt att vi inte viste va vi va. pappa och jag undra va nycklarna va sen undra han va dom bodde

6.5.2 6.5.3 6.6 6.6.1

Adverb → More than one category Hunden hade sk¨all t s˚a mycket att geting boet hade ramlat n¨ar.

7 7.1 7.1.1

PREPOSITION Preposition → Verb Min kompis t¨ankte h¨amta hj¨alp s˚a han h¨angde sig i viadukten och hoppa ber sprang till n¨armaste huset och sa att det var en som hade trillat ner och att han skulle ringa ambulansen.

7.2 7.2.1 7.2.2

Preposition → More than one category kan vi inte g˚a nu sa Filippa men darrig r¨ost Man besl¨ot att b¨orja men marknaderna igen.

8 8.1 8.1.1

CONJUNCTION Conjunction → Noun pojken fick n¨astan inte resa p˚a sig f¨oren en uggla kom. Pojken hinner knappt resa sig upp f¨oren en uggla kommer flygande mot honom. f¨oren pappa kom in rusande i mitt rum. inte f¨oren n¨ar jag skulle g˚a ner m¨arkte jag att jag hade fastnat, m¨an a¨ n natt klev grodan upp ur glas burken m¨an pl¨otsligt hoppade hunden ut ur f¨onstret m¨an d˚a hoppade pojken efter gick vi upp till utg˚angen av t¨altet mer uppt¨ackte varan och vi blev s˚a r¨adda m¨an han hade skrikit s˚a...

8.1.2 8.1.3 8.1.4 8.1.5 8.1.6 8.1.7 8.1.8 8.1.9

C ORRECTION

C ORP

S UBJ

AGE

S EX

dit a¨ n

CF FS

anhe anhe

11 11

m m

a¨ n a¨ n

FS SE

erge wg03

9 10

f f

a¨ n a¨ n

SE SE

wg03 wj12

10 13

f f

f¨orst f¨orst h¨ar N¨ar

DV SN DV SE

idja wj16 hais wg07

11 13 11 10

f f f f

var

DV

alhe

9

f

var var

DV DV

alhe alhe

9 9

f f

ner

FS

caan

9

m

ner

SN

wj05

13

m

med med

DV DV

hais mawe

11 11

f f

f¨orr¨an

FS

haic

11

f

f¨orr¨an

FS

idja

11

f

f¨orr¨an f¨orr¨an

DV SN

idja wg16

11 10

f f

men men men men

FS FS FS DV

erja erja erja alhe

9 9 9 9

m m m f

men/medan

FS

frma

9

m

Error Corpora

E RROR 8.1.10 ... a˚ st¨allde cyklarna p˚a den utskurna plattan. 8.1.11 Vi bor i samma hus jag och Kamilla a˚ hennes hund. 8.1.12 S˚a vi fick v¨anta tills pappa kom hem a˚ d˚a skulle jag visa pappa mamma 8.1.13 a˚ a¨ lgen bara gode 8.1.14 a˚ dam tv˚a tigrarna f¨oljde ocks˚a med 8.2 8.2.1 8.2.2 8.2.3

Conjunction → More than one category D˚a m˚aste man fr˚aga dom som inte vill va kompis med en vad man g¨or f˚ar fel... d˚a skulle vi samlas 11.30 f˚ar bussen gick lite senare vi har s˚a mycket saker s˚a vi kan ha i byn

9 9.1 9.1.1

INTERJECTION Interjection → Adjective n¨ar vi kom in till mig s˚a stod mamma och pappa i d¨orren och sa gratis till mig n¨ar jag kom.

10 10.1.1 10.1.2 10.1.3 10.1.4 10.1.5 10.1.6

OTHER d¨ar e huset som brinner nu e nog alla m¨anniskor ute d¨ar e a¨ n d˚a e dam a¨ nnu n¨armare ljudet Att bli mobbad e nog det v¨arsta som finns, Han slog d˚a till mig o¨ ver kinden s˚a att jag fick ett R.

305

C ORP

S UBJ

AGE

S EX

och och

C ORRECTION

CF CF

alhe angu

9 9

f f

och

CF

hais

11

f

och och

FS DV

frma erja

9 9

m m

f¨or

SE

wj19

13

m

f¨or

SN

wg20

10

m

som???

DV

haic

11

f

grattis

CF

alhe

9

f

a¨ r a¨ r a¨ r a¨ r a¨ r a¨ rr

CF CF CF CF SE SN

erja erja erja erja wj08 wg15

9 9 9 9 13 10

m m m m f m

Appendix B.

306

B.3 Segmentation Errors Errors are categorized by part-of-speech. E RROR 1 1.1.1 1.1.2 1.1.3 1.1.4 1.1.5 1.1.6 1.1.7 1.1.8 1.1.9 1.1.10 1.1.11 1.1.12 1.1.13 1.1.14 1.1.15 1.1.16 1.1.17 1.1.18 1.1.19 1.1.20 1.1.21 1.1.22 1.1.23 1.1.24 1.1.25 1.1.26

NOUN ˚ BORAS ˚ BAD HUS VI VAR PA ... har hunden f˚att syn p˚a en bi kupa. Han hoppar upp p˚a bi kupan ... s˚a att bi kupan b¨orjar att skaka bi kupan ramlar ner till marken! d˚a kom det en bi sv¨arm surrande f¨orbi tillslut v¨alte han ner hela kupan och en hel bi sv¨arm surrade ut. Efter 5 minuter k¨orde en brand bil in p˚a g˚arden. D˚a vi kom till min by. Trillade jag av brand bilen Men grannen intill ringde brand k˚aren. n¨ar brand k˚aren kom hade hela v˚ar ranch brunnit ner till grunden. brand larmet g˚ar Just n¨ar han h¨orde sm¨allen gick brand larmet p˚a riktigt! Han rusade ut till brandm¨annen som inte hade h¨ort sm¨allen och brand larmet. Han jobbade som brand man En brand man kl¨attrade upp till oss. om det fanns n˚agon ledig brand man jag h˚aller p˚a och utbildar mig till brand man Petter sa att han t¨ankte bli Brand man n¨ar han blir stor. En brand man ber¨attade att... BRAND MANEN det h¨ar var en bra tr¨aning f¨or mig sa brand manen brand menen ryckte ut och sl¨ackte elden. jag ringde till brand stationen Och i morgon a¨ r det brand o¨ vning d¨ar brand o¨ vningen skulle h˚alla till.

1.1.27 vi skulle b¨orja g¨ora i ordning den lilla byn som bestod av 8 hus 6 aff¨arer och ett by hus 1.1.28 Desere jobbade i en djur aff¨ar 1.1.29 men se d¨ar a¨ r ni ju det lilla f¨oljet best˚aende av snutna djur fr˚an djur aff¨aren. 1.1.30 n¨ar det lilla djur f¨oljet g˚att i fyra timmar 1.1.31 Efter n˚agra sekunder stod s˚afus med tungan halvv¨ags h¨angande ut i mun i d¨orr o¨ ppningen. 1.1.32 hon lurade i min pojkv¨an massa elak heter om Linnea. 1.1.33 han hade ett 4 mannat¨alt I sin fik kniv. 1.1.34 D˚a sprang dom fort till tunneln och fort till skidbacken och Fort till flyg platsen

C ORRECTION

C ORP

S UBJ

AGE

S EX

badhus bikupa bikupan bikupan bikupan bisv¨arm bisv¨arm

SN FS FS FS FS FS FS

wg13 klma klma klma klma alca hais

10 10 10 10 10 11 11

m f f f f f f

brandbil

CF

idja

11

f

brandbilen

CF

jowe

9

f

brandk˚aren brandk˚aren

CF DV

jobe idja

10 11

m f

brandlarmet brandlarmet

CF CF

erja klma

9 10

m f

brandlarmet

CF

klma

10

f

brandman brandman brandman brandman brandman

CF CF CF CF CF

erja idja idja idja idja

9 11 11 11 11

m f f f f

brandman brandmannen brandmannen

CF CF CF

jowe erja idja

9 9 11

f m f

brandm¨annen brandstationen brand¨ovning brand¨ovningen byhus

CF CF CF CF

anhe idja klma klma

11 11 10 10

m f f f

DV

hais

11

f

djuraff¨ar djuraff¨aren

DV DV

hais hais

11 11

f f

djurf¨oljet d¨orr¨oppningen elakheter

DV FS

hais hais

11 11

f f

SN

wg07

10

f

fickkniv flygplatsen

DV DV

alhe erha

9 10

f m

Error Corpora

E RROR 1.1.35 1.1.36 1.1.37 1.1.38

Jag h¨or fot steg fr˚an trappan frukost klockan ringde jag g˚ar ner och ringer i frukost klockan genom att han tappat en jord fl¨ack p˚a f¨onster karmen. 1.1.39 Ronja hittade en f¨orbands l˚ada 1.1.40 Men lars fick f¨ors¨akrings pengarna 1.1.41 1.1.42 1.1.43 1.1.44 1.1.45 1.1.46 1.1.47 1.1.48 1.1.49 1.1.50 1.1.51 1.1.52 1.1.53 1.1.54 1.1.55 1.1.56 1.1.57 1.1.58 1.1.59 1.1.60 1.1.61 1.1.62 1.1.63 1.1.64 1.1.65 1.1.66 1.1.67 1.1.68 1.1.69 1.1.70 1.1.71 1.1.72 1.1.73 1.1.74

Hunden hoppar vid ett geting bo. Geting boet trillar ner p˚a marken. Geting boet g˚ar s¨onder. det var en gips skena som... Nu hade han den i en ganska stor glas burk, p˚a sitt rum. s˚a han tog med sig grodan hem i en glas burk. grodan klev upp ur glas burken. hunden stack in huvudet i glas burken Glas burken som hunden hade p˚a huvudet gick i tusen bitar Oliver innerligt f¨ors¨okte f˚a av sig den glas burken som... Hunden hade fastnat i glas burken och ramlade ner. Pojken och hunden sitter och kollar p˚a grodan i glas burken. N¨ar pojken och hunden har somnat kryper grodan ut ur glas burken. Glas burken g˚ar s¨onder. s˚afus hade letat i glas burken han fick ha p˚a sig glas burken o¨ ver huvudet. s˚afus landade med huvudet f¨ore och hela glas burken sprack. ... s˚a gick glas burken s¨onder. dom plockade m˚anga kran kvistar och la som t¨acke h¨ar a¨ r ocks˚a en grav sten fr˚an 1989. jag satte upp grav stenar efter dom dan efter gr¨avde vi upp deras grav stenar hit ut g˚ar det ju bara en grus v¨ag H¨astarna saktade av n¨ar dom kom ut p˚a en grus v¨ag. vi fortsatte p˚a den lilla grus v¨agen. grus v¨agen ledde fram till en o¨ vergiven by. Vi f¨oljde grus v¨agen Vi red i genom det stora h˚alet och kom in p˚a grus v¨agen vart tionde a˚ r m˚aste han ha 5 guld klimpar en hund p˚a 14 hund a˚ r trampat p˚a igel kott En dag hade vi en informations dag om mobbing D˚a kom det upp en jord ekorre han tittade i ett jord h˚al.

307

C ORRECTION

C ORP

S UBJ

AGE

S EX

fotsteg frukostklockan frukostklockan f¨onsterkarmen f¨orbandsl˚ada f¨ors¨akringspengarna getingbo getingboet getingboet gipsskena glasburk

CF DV DV FS

alhe hais hais hais

9 11 11 11

f f f f

DV CF

mawe erha

11 10

f m

FS FS FS SN FS

erha erha erha wj05 alca

10 10 10 13 11

m m m m f

glasburk glasburken glasburken glasburken

FS FS FS FS

alhe alca alca alca

9 11 11 11

f f f f

glasburken

FS

alhe

9

f

glasburken

FS

caan

9

m

glasburken

FS

erha

10

m

glasburken

FS

erha

10

m

glasburken glasburken glasburken glasburken

FS FS FS FS

erha hais hais hais

10 11 11 11

m f f f

glasburken grankvistar

FS DV

klma hais

10 11

f f

gravsten gravstenar gravstenar grusv¨ag grusv¨ag

DV DV DV DV DV

hais hais hais idja idja

11 11 11 11 11

f f f f f

grusv¨agen grusv¨agen grusv¨agen grusv¨agen

DV DV DV DV

idja idja idja idja

11 11 11 11

f f f f

guldklimpar hund˚ar igelkott informationsdag

DV DV DV SE

angu hais hais wj16

9 11 11 13

f f f f

jordekorre jordh˚al

FS FS

alca alhe

11 9

f f

Appendix B.

308

E RROR det a¨ r ju jul afton om 3 dagar Innan jul skulle v˚aran klass ha jul fest. sen var det problem p˚a klass fotot man vill ju vara fin p˚a klass fotot P˚a t ex klass fotot MIN KLASS KAMRAT VILLE INTE ˚ HOPPTORNET HOPPA FRAN 1.1.81 snabbt tog han p˚a sig kl¨a d¨ar 1.1.82 Och s˚a landade du visst i en m¨oglig ko skit ocks˚a 1.1.83 men det finns i alla fall ingen tur med en mo¨ glig ko skit. 1.1.84 De hade med sig : ett spritk¨ok, ett t¨alt, och Massa Mat, n˚agra kul gev¨ar, och ammunition M.M. 1.1.85 N¨ar kv¨alls daggen kom var vi helt klara 1.1.86 Kv¨alls daggen hade fallit 1.1.87 det brann p˚a Macintosh v¨agen 738c 1.1.88 Att f˚a status a¨ r kanske det maffia ledarna h˚aller p˚a med. 1.1.89 Hela majs f¨altet var svart 1.1.90 Vid mat bordet var det en livlig st¨amma 1.1.91 dom kom in till oss med 2 stora mat kassar. 1.1.92 det var n¨ar jag gick i mellan stadiet 1.1.93 Jag satt vid middags bordet tillsammans med mamma och min lillebror Simon. 1.1.94 d¨ar stannade dem och bodde d¨ar resten av livet f¨or mobil telefonen r¨ackte inte enda hem. 1.1.95 alla djur rusade ut ur aff¨aren upp p˚a m¨olndals v¨agen 1.1.96 Han hade f˚angat en groda n¨ar han var i parken vid den stora n¨ackros dammen. 1.1.97 skuggorna f¨oll f¨orundrat p˚a det vita parkett golvet. 1.1.98 En vecka senare s˚a var det en polis patrull som letade efter skol klassen 1.1.99 och precis n¨ar en av dem skulle sl˚a till mig s˚a h¨orde jag polis sirener 1.1.100 Man h¨amtar d˚a en rast vakt. 1.1.101 f¨oljer du med p˚a en rid tur 1.1.102 h¨ar st˚ar det August rosen gren har l¨amnat jorden 1.1.103 jag hade f˚att en sjuk dom 1.1.104 helt pl¨otsligt var jag p˚a sjuk huset. 1.1.105 ... f¨orr¨an jag vaknade i en sjukhus s¨ang. 1.1.106 jag tog mina saker ner i en sken pa˚ se 1.1.107 dom b¨ar massor av sken smycken 1.1.108 Pappa det var du som la den i skrivbords la˚ dan 1.1.109 ...men sen m˚aste dom v¨al f˚a skuld k¨anslor. 1.1.110 d¨arf¨or a¨ r l¨ararens skyldig het att se till att eleven f˚ar hj¨alp. 1.1.111 Sedan var det ett sov rum med 4 b¨addar. 1.1.112 Dem kom med en steg bil och h¨amtade oss. 1.1.75 1.1.76 1.1.77 1.1.78 1.1.79 1.1.80

C ORP

S UBJ

AGE

S EX

julafton julfest klassfotot klassfotot klassfotot klasskamrat

C ORRECTION

CF SN SE SE SE SN

erge wg02 wg18 wg18 wg19 wg13

9 10 10 10 10 10

f f m m m m

kl¨ader koskit

FS DV

erja idja

9 11

m f

koskit

DV

idja

11

f

kulgev¨ar

DV

jobe

10

m

kv¨allsdaggen kv¨allsdaggen Macintoshv¨agen maffialedarna

DV DV CF SE

hais mawe anhe wj20

11 11 11 13

f f m m

majsf¨altet matbordet matkassar mellanstadiet middagsbordet

CF DV CF SN CF

jowe idja alhe wj14 mawe

9 11 9 13 11

f f f m f

mobiltelefonen

DV

jobe

10

m

M¨olndalsv¨agen

DV

hais

11

f

n¨ackrosdammen parkettgolvet

FS

alca

11

f

FS

hais

11

f

polispatrull

DV

alca

11

f

polissirener

SN

wj15

13

m

rastvakt ridtur Rosengren

SE DV DV

wg07 idja hais

10 11 11

f f f

sjukdom CF sjukhuset CF sjukhuss¨ang CF skenp˚ase DV skensmycken DV skrivbordsl˚adan CF skuldk¨anslor SE skyldighet SE

erge erge mawe haic haic erge wj04 wj19

9 9 11 11 11 9 13 13

f f f f f f m m

sovrum stegbil

mawe jobe

11 10

f m

DV CF

Error Corpora

E RROR 1.1.113 det var ett stort sten hus 1.1.114 Kalle-knorr hade hittat ett stort sten kors 1.1.115 d¨ar st˚ar ett gult hus med stock rosor slingrande efter v¨aggarna 1.1.116 allt fr˚an att f¨orst˚a en telefon apparat till att f¨orst˚a en m¨anniska. 1.1.117 n¨ar de var hemma s˚a tittade de i telefon katalogen 1.1.118 ni f˚ar g¨arna bo hos oss under tid en ni inte har n˚at att bo i. 1.1.119 s˚a kom brandbilen och r¨addade mamma ut genom toalett f¨onstret. 1.1.120 d¨ar bakom n˚agra grenar l˚ag n˚agonting ett tr¨a hus 1.1.121 Ett vardags rum med 2 soffor 1 bord och en stor o¨ ppenspis 1.1.122 Johan gick in i vardags rummet och satte upp elementet. 1.1.123 hela vardags rummet stod i brand 1.1.124 hans a¨ lsklings djur var groda. 1.1.125 Hans a¨ lsklings f¨arg vad gr¨on 1.1.126 Och det a¨ r nog en o¨ verlevnads instinkt.

2 2.1.1 2.1.2 2.1.3 2.1.4 2.1.5 2.1.6 2.1.7 2.1.8 2.1.9 2.1.10 2.1.11 2.1.12 2.1.13 2.1.14 2.1.15 2.1.16 2.1.17 2.1.18 2.1.19 2.1.20

ADJECTIVE/PARTICIPLE Fast pappa hade utrustat alla hus brand sa¨ kra. d¨ar va massa m¨anniskor som va fast kedjade med tjocka kedjor M¨anniskorna hade haft f¨arg glada dr¨akter p˚a sig Tanja sydde glatt f¨argade kl¨ader a˚ t allihop F¨onstret stod halv o¨ ppet d¨ar han l˚ag hj¨alp l¨os p˚a marken. Cristoffer hoppade ner och var j¨atte arg f¨or att burken gick s¨onder. Cristoffer lyfte upp hunden och var fortfarande j¨atte arg men ... Ett par horn p˚a en hjort som blev j¨atte arg. Bina som var inne i boet blev j¨atte arga och surrade upp ur boet. s˚a kanske de blir j¨atte bra kompisar. och t¨ank om den som man skrev av hade skrivit en j¨atte bra dikt Det var inte s˚a j¨atte djupt p˚a den delen av floden som Cristoffer och hunden f¨oll i p˚a. dom bott i en j¨atte fin by Sen hj¨alpte vi dom att g¨ora om byn till en j¨atte fin by Mamma och pappa tyckte det var en j¨atte fin by Jag hade ett j¨atte fint rum. d˚a blev jag j¨atte glad D˚a blev dom j¨atte glada. d¨ar man kan a¨ ta j¨atte god picknick

309

C ORP

S UBJ

AGE

S EX

stenhus stenkors stockrosor

C ORRECTION

DV DV DV

erha hais hais

10 11 11

m f f

telefonapparat

SE

wj20

13

m

telefonkatalogen CF

alca

11

f

tiden

DV

idja

11

f

toalettf¨onstret

CF

hais

11

f

tr¨ahus

DV

hais

11

f

vardagsrum

DV

mawe

11

f

vardagsrummet

CF

alca

11

f

vardagsrummet a¨ lsklingsdjur a¨ lsklingsf¨arg o¨ verlevnadsinstinkt

CF FS FS SE

alca jowe jowe wj20

11 9 9 13

f f f m

brands¨akra fastkedjade

DV DV

idja alhe

11 9

f f

f¨argglada

DV

mawe

11

f

glattf¨argade halv¨oppet hj¨alpl¨os j¨attearg

DV FS FS FS

mawe hais hais alca

11 11 11 11

f f f f

j¨attearg

FS

alca

11

f

j¨attearg j¨attearga

FS FS

erge alca

9 11

f f

j¨attebra j¨attebra

SE SE

wg16 wg17

10 10

f f

j¨attedjupt

FS

alca

11

f

j¨attefin j¨attefin

DV DV

alhe alhe

9 9

f f

j¨attefin

DV

idja

11

f

j¨attefint j¨atteglad j¨atteglada j¨attegod

DV SN DV DV

idja wg18 alhe alhe

11 10 9 9

f f f f

Appendix B.

310

E RROR 2.1.21 det var helt lila och s˚ag j¨atte hemskt ut, 2.1.22 pappa och jag t¨ankte att vi skulle cykla upp p˚a det j¨atte h¨oga berget f¨or att titta p˚a ut sikten. 2.1.23 pappa gick ut och s˚ag att vi va I en j¨atte liten och fin by, 2.1.24 Den andra fr˚agan a¨ r j¨atte l¨att 2.1.25 vi mulade och kastade j¨atte m˚anga sn¨obollar p˚a dom 2.1.26 tuni hade j¨atte ont i kn¨at 2.1.27 N¨asta dag n¨ar Oliver vaknade blev han j¨atte r¨add f¨or han s˚ag inte grodan i glasburken. 2.1.28 D˚a blev Oliver j¨atte r¨add. 2.1.29 jag blev j¨atte r¨add 2.1.30 b˚ade muffins och Oliver blev j¨atte r¨adda. 2.1.31 Det blev j¨atte struligt med allt m¨ojligt inblandat. 2.1.32 han sade till muffins att vara j¨atte tyst. 2.1.33 man ser att det a¨ r n˚at j¨atte viktigt hon ville ber¨atta. 2.1.34 Med en g˚ang blev jag klar vaken 2.1.35 en platta som n¨astan va om ringad av lava. 2.1.36 vi slog upp t¨altet p˚a den spik spetsiga toppen 2.1.37 det var en varm och stj¨arn klar natt. 2.1.38 En g˚ang blev den hemska pyroman ut kastad ur stan. 2.1.39 Om man blir ut satt f¨or n˚agot ... 2.1.40 i vart enda hus var alla saker kvar fr˚an 1600 talet 2.1.41 d˚a bar det av i 14 dagar och 14 a¨ ventyrs fyllda n¨atter 2.1.42 d˚a kom dom till en o¨ ver given by 2.1.43 de kom till en o¨ ver given by 2.1.44 de kom till en o¨ ver given by 2.1.45 Det var en o¨ ver given by. 2.1.46 d˚a f¨or stod vi att det var en o¨ ver given by 2.1.47 till slut kom dem till en o¨ ver given By. 2.1.48 vi passerade m˚anga o¨ ver vuxna hus 2.1.49 Oliver fick se ett geting bo och blev hel galen. 3 3.1.1 3.1.2 3.1.3 3.1.4 3.1.5 4 4.1.1 4.1.2 4.1.3 4.1.4

PRONOUN hon hade bara dr¨omt allt ihop. simon l˚ag p˚a sin kudde och hade inte m¨arkt n˚agon ting. Nu ska jag visa er n˚agon ting Dom flesta var duktiga p˚a n˚agon ting f¨or d˚a kan man inte n˚agot ting VERB n¨ar jag dog 1978 i cancer a˚ terv¨ande jag hit f¨or att fort s¨atta mitt liv h¨ar Jag tror att killen inte kan f¨or b¨attra sig sj¨alv... d˚a f¨or stod vi att det var en o¨ ver given by medan jag f¨or s¨okte lyfta upp mig sk¨alv

C ORP

S UBJ

AGE

S EX

j¨attehemskt j¨atteh¨oga

C ORRECTION

SN DV

wj03 alhe

13 9

f f

j¨atteliten

DV

alhe

9

f

j¨attel¨att j¨attem˚anga

SE SN

wj03 wj10

13 13

f m

j¨atteont j¨atter¨add

SN FS

wj03 jowe

13 9

f f

j¨atter¨add j¨atter¨add j¨atter¨adda j¨attestruligt

FS SN FS SN

jowe wj03 jowe wg11

9 13 9 10

f f f f

j¨attetyst j¨atteviktigt

FS CF

jowe alhe

9 9

f f

klarvaken omringad spikspetsiga stj¨arnklar utkastad

DV DV DV DV CF

idja alhe alhe hais frma

11 9 9 11 9

f f f f m

utsatt vartenda

SE DV

wj19 hais

13 11

m f

a¨ ventyrsfyllda

DV

hais

11

f

o¨ vergiven o¨ vergiven o¨ vergiven o¨ vergiven o¨ vergiven o¨ vergiven o¨ vervuxna helgalen

DV DV DV DV DV DV DV FS

erge erha hais hais hais jobe hais alhe

9 10 11 11 11 10 11 9

f m f f f m f f

alltihop n˚agonting

DV FS

angu hais

9 11

f f

n˚agonting n˚agonting n˚agonting

DV DV SE

hais mawe wg03

11 11 10

f f f

forts¨atta

DV

alco

9

f

f¨orb¨attra f¨orstod f¨ors¨okte

SE DV SN

wj03 hais wg16

13 11 10

f f f

Error Corpora

4.1.5 4.1.6 4.1.7 4.1.8 5 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6 5.1.7 5.1.8 5.1.9 5.1.10 5.1.11 5.1.12 5.1.13 5.1.14 5.1.15 5.1.16 5.1.17 5.1.18 5.1.19 5.1.20 5.1.21 5.1.22 5.1.23 5.1.24 5.1.25 5.1.26 5.1.27 5.1.28 5.1.29 5.1.30 5.1.31 5.1.32 5.1.33 5.1.34 5.1.35

311

E RROR

C ORRECTION

C ORP

S UBJ

AGE

S EX

ni f¨or tj¨anar verkligen mina hem kokta kladdkakor a Tess min fina gamla hund du p˚a minner mig om n˚agon jag har tr¨affat f¨orut Han ring de till mig sen och sa samma sak. Hon under s¨okte noga hans fot.

f¨ortj¨anar

DV

hais

11

f

p˚aminner

DV

hais

11

f

ringde unders¨okte

SN DV

wg07 mawe

10 11

f f

d¨arefter d¨arifr˚an d¨arifr˚an d¨arifr˚an d¨arifr˚an

CF FS SE SN SN

hais hais wj19 wg13 wj01

11 11 13 10 13

f f m m f

d¨arifr˚an emot emot fortfarande

SN FS FS SN

wj10 alhe haic wg07

13 9 11 10

m f f f

framemot f¨orbi f¨orbi f¨orbi f¨orut

SN FS DV SE DV

wj09 caan hais wg07 hais

13 9 11 10 11

m m f f f

f¨orut

DV

idja

11

f

h¨arifr˚an h¨arifr˚an ibland ibland

CF DV SE SE

idja angu wj02 wj09

11 9 13 13

f f f m

igen

CF

hais

11

f

igen igen igen igenom igenom

CF SN SN FS DV

hais wg03 wg03 erha erge

11 10 10 10 9

f f f m f

igenom igenom ihop

DV DV DV

idja idja erha

11 11 10

f f m

ihop ihop iv¨ag iv¨ag

DV DV FS FS

erha erja angu09 anhe

10 9 9 11

m m f m

ocks˚a

DV

angu

9

f

ocks˚a omkring

DV DV

erja hais

9 11

m f

ADVERB D¨ar efter dog mamma p˚a sjukhuset. men han tog sig snabbt d¨ar i fr˚an. n¨ar man bara g˚ar d¨ar ifr˚an ¨ IFRAN ˚ SEN GICK VI DAR Jag st¨allde mig p˚a en sten och efter ett tag s˚a ville jag g˚a d¨ar ifr˚an, s˚a till slut s˚a sprang dom d¨ar ifr˚an Bina som bodde i bot rusade i mot Oliver han r˚akade bara kom i mot getingboet. Marianne sa fort farande hur jag kunde va med henne Alla s˚ag fram emot att a˚ ka D˚a kom hunden f¨or bi med getingar m¨anniskor som g˚ar f¨or bi kan h¨ora oss. Eller n¨ar man g˚ar f¨or bi varandra vi hade aldrig f˚att smaka pl¨attar sylt och kola f¨or ut Inte konstigt att vi inte har uppt¨ackt den h¨ar ing˚angen f¨or ut jag som alltid tyckt det var s˚a h¨ogt h¨ar i fr˚an. stick h¨ar i fr˚an annars a¨ r du d¨odens I bland kan allt vara jobbigt och hemskt Men i bland kan det vara s˚a att dom tror att dom a¨ r coola jag var tvungen att ber¨atta hela historien om i gen. vad var det han hete nu i gen? jag vill bli kompis med henne i gen och s˚a ville Johanna bli kompis i gen. Pojken och hunden s¨oker i genom rummet. morfar och dom andra letar och letar i genom staden Vi red i genom det stora h˚alet Vi red i genom byn n¨ar Gunnar o¨ ppna d¨orren till det stora huset rasa det i hop snart rasa hela byn i hop snabbt samla han i hop alla sina j¨agare R˚adjuret sprang i v¨ag med honom. Han sprang i v¨agg och kl¨attrade upp p˚a en kulle. Lena s˚ag en gammal man sitta i ett t¨alt av guld intill sov s¨ackarna som och s˚a var av guld. dam t˚ag och s˚a med sig sina tv˚a tigrar undulater fl¨og om kring

Appendix B.

312

E RROR 5.1.36 n¨ar de s˚ag sig om kring 5.1.37 han trillar om kull. 5.1.38 Han ropade igenom f¨onstret men inget kvack kom till baka. 5.1.39 vi gick till baka igen 5.1.40 svarta manen sprang sin v¨ag och kom aldrig mer till baka. 5.1.41 Efter det gick vi till baka 5.1.42 ... ska man l¨amna till baka den. 5.1.43 Sedan slumrade s˚afus, grodan och simon djupt till sammans. 5.1.44 Men de var fult med buskar utan f¨or som vi fick rid igenom. 5.1.45 en kille blev utan f¨or, 5.1.46 men olof var glad en d˚a 5.1.47 men om man inte f˚ar vara med a¨ n d˚a 5.1.48 Erik letade o¨ ver allt 5.1.49 Han letade o¨ ver allt i sitt rum 5.1.50 Han letade under s¨angen under pallen i tofflorna bland kl¨aderna ja o¨ ver allt 5.1.51 Han letade o¨ ver allt 5.1.52 Desere letade o¨ ver allt 5.1.53 jag har letat o¨ ver allt 6 6.1.1 6.1.2

PREPOSITION fram f¨or mig stod v¨arldens finaste h¨ast. Vi gick l¨angs v¨agen tills vi s˚ag ett stort hus som l˚ag en bit utan f¨or sj¨alva stan

7 7.1.1

CONJUNCTION Efter som han fr¨os och inte s˚ag sig f¨or snubblade han p˚a en sten. ... och efter som det inte fanns n˚at lock p˚a burken... men jag kunde inte s¨aga det till honom f¨or att jag visste att han skulle bli lite ledsen d˚a efter som vi hade best¨amt.

7.1.2 7.1.3

8 8.1.1 8.1.2 8.1.3 8.1.4 8.1.5 8.1.6 8.1.7 8.1.8 8.1.9 8.1.10 8.1.11 8.1.12 8.1.13

RUN-ONS Nathalie ber¨attade alltf¨or mig d¨arbakom fanns 2 grodor. och tillslut stod vi alla p˚a marken tillslut v¨alte han ner hela kupan tillslut kom de fram till en g¨ardsg˚ard men tillslut tyckte de ocks˚a att ... tillslut blev dam tv˚a kamelerna s˚a tr¨otta... tillslut kom de fram till en vacker plats tillslut sa pappa Tillslut kom dom upp mot sidan av oss och sa, Tillslut kom det en massa vuxna som... Vi a˚ kte tillslut p˚a bio. mobbing r˚akar v¨aldigt m˚anga utf¨or.

C ORP

S UBJ

AGE

S EX

omkring omkull tillbaka

C ORRECTION

DV FS FS

jowe klma caan

9 10 9

f f m

tillbaka tillbaka

DV DV

alhe angu

9 9

f f

tillbaka tillbaka tillsammans

DV SE FS

idja wg17 hais

11 10 11

f f f

utanf¨or

DV

idja

11

f

utanf¨or a¨ nd˚a a¨ nd˚a o¨ verallt o¨ verallt o¨ verallt

SE FS SE FS FS FS

wj11 frma wj14 alhe jobe jowe

13 9 13 9 10 9

f m m f m f

o¨ verallt o¨ verallt o¨ verallt

FS DV DV

mawe hais hais

11 11 11

f f f

framf¨or utanf¨or

CF DV

alhe idja

9 11

f f

eftersom

DV

mawe

11

f

eftersom

FS

alhe

9

f

eftersom

SN

wg06

10

f

allt f¨or d¨ar bakom till slut till slut till slut till slut till slut till slut till slut till slut till slut till slut ut f¨or

SN FS CF FS DV DV DV DV DV SN SN SN SE

wg11 jowe idja hais alca alca erja hila idja wj04 wj04 wj04 wj05

10 9 11 11 11 11 9 10 11 13 13 13 13

f f f f f f m f f m m m m

Appendix C

SUC Tagset The set of tags used was taken from the Stockholm Ume˚a Corpus (SUC): Code AB DL DT HA HD HP HS IE IN JJ KN NN PC PL PM PN PP PS RG RO SN O VB

Category Adverb Delimiter (Punctuation) Determiner Interrogative/Relative Adverb Interrogative/Relative Determiner Interrogative/Relative Pronoun Interrogative/Relative Possessive Infinitive Marker Interjection Adjective Conjunction Noun Participle Particle Proper Noun Pronoun Preposition Possessive Cardinal Number Ordinal Number Subjunction Foreign Word Verb

Code

Feature

UTR NEU MAS UTR/NEU -

Common (Utrum) Neutre Masculine Underspecified Unspecified

Gender Gender Gender Gender Gender

SIN PLU

Singular Plural

Number Number

Appendix C.

314

SIN/PLU -

Underspecified Unspecified

Number Number

IND DEF IND/DEF -

Indefinite Definite Underspecified Unspecified

Definiteness Definiteness Definiteness Definiteness

NOM GEN SMS -

Nominative Genitive Compound Unspecified

Case Case Case Case

POS KOM SUV

Positive Comparative Superlative

Degree Degree Degree

SUB OBJ SUB/OBJ

Subject Object Underspecified

Pronoun Form Pronoun Form Pronoun Form

PRS PRT INF SUP IMP

Present Preterite Infinitive Supinum Imperative

Verb Verb Verb Verb Verb

AKT SFO

Active S form

Voice Voice

KON PRF

Subjunctive Perfect

Mood Perfect

AN

Abbreviation

Form

Form Form Form Form Form

Appendix D

Implementation D.1 Broad Grammar #### Declare categories define PPheadPhr ["" ˜$"" ""]; define VPheadPhr ["" ˜$"" ""]; define define define define

APPhr NPPhr PPPhr VPPhr

["" ["" ["" [""

˜$"" ˜$"" ˜$"" ˜$""

""]; ""]; ""]; ""];

#### Head rules define AP [(Adv) Adj+]; define PPhead [Prep]; define VPhead [[[Adv* Verb] | [Verb Adv*]] Verb* (PNDef & PNNeu)]; #### Complement rules define NP [[[(Det | Det2 | NGen) (Num) (APPhr) (Noun) ] & ?+] | Pron]; define PP [PPheadPhr NPPhr]; define VP [VPheadPhr (NPPhr) (NPPhr) (NPPhr) PPPhr*]; #### Verb clusters define VC [ [[Verb Adv*] / NPTags] (NPPhr) [[Adv* Verb (Verb)] / NPTags] ];

D.2 Narrow Grammar: Noun Phrases ############### define APDef define APInd define APSg define APPl define APNeu define APUtr define APMas

Narrow grammar for APs: ["" (Adv) AdjDef+ ""]; ["" (Adv) AdjInd+ ""]; ["" (Adv) AdjSg+ ""]; ["" (Adv) AdjPl+ ""]; ["" (Adv) AdjNeu+ ""]; ["" (Adv) AdjUtr+ ""]; ["" (Adv) AdjMas+ ""];

Appendix D.

316

############### Narrow grammar for NPs: ###### NPs consisting of a single noun define NPDef1 [(Num) [NDef | PNoun]]; define NPInd1 [(Num) NInd]; define NPSg1 [(NumO) NSg | [NPl & NInd] | PNoun]; define NPPl1 [(NumC) [NPl | PNoun]]; define NPNeu1 [(Num) [NNeu | [NUtr & NInd] | PNoun]]; define NPUtr1 [(Num) [[NUtr & NPl] | [NUtr & NDef] | PNoun]]; ###### NPs consisting of a determiner (or a noun in genitive) and a noun define NPDef2 [DetDef (DetAdv) (Num) NDef] | [[DetMixed | NGen] (Num) NInd]; define NPInd2 [DetInd (Num) NInd]; define NPSg2 [[DetSg (DetAdv) | NGen] (NumO) NSg]; define NPPl2 [[DetPl (DetAdv) | NGen] (NumC) NPl]; define NPNeu2 [[DetNeu (DetAdv) | NGen] (Num) NNeu]; define NPUtr2 [[DetUtr (DetAdv) | NGen] (Num) NUtr]; ###### NPs consisting of [Det (AP) N] define NPDef3 [DetDef (DetAdv) (Num) (APDef) NDef] | [[DetMixed | NGen] (Num) (APDef) NInd]; define NPInd3 [DetInd (NumO) (APInd) NInd]; define NPSg3 [[DetSg (DetAdv) | NGen] (NumO) (APSg) NSg]; define NPPl3 [[DetPl (DetAdv) | NGen] (NumC) (APPl) NPl]; #define NPNeu3 [[DetNeu (DetAdv) | NGen] (Num) (APNeu) NNeu]; define NPNeu3 [[DetNeu (DetAdv) | NGen] (Num) [[(APNeu) NNeu] | [(APMas) NMas]]]; define NPUtr3 [[DetUtr (DetAdv) | NGen] (Num) (APUtr) NUtr]; ###### NPs consisting of [Adj+ N] # optional numbers only in NPINd and NPPl define NPDef4 [APDef NDef]; define NPInd4 [(Num) APInd NInd]; define NPSg4 [APSg NSg]; define NPPl4 [(Num) APPl NPl]; define NPNeu4 [APNeu NNeu]; define NPUtr4 [APUtr NUtr]; ###### define define define define define define

NPs consisting of a single pronoun NPDef5 [PNDef]; NPInd5 [PNInd]; NPSg5 [PNSg]; NPPl5 [PNPl]; NPNeu5 [PNNeu]; NPUtr5 [PNUtr];

###### define define define define define define

NPs consisting of a single determiner NPDef6 [DetDef (DetAdv)]; NPInd6 [DetInd]; NPSg6 [DetSg (DetAdv)]; NPPl6 [DetPl (DetAdv)]; NPNeu6 [DetNeu (DetAdv)]; NPUtr6 [DetUtr (DetAdv)];

Implementation

317

###### define define define define define define

NPs consisting of adjectives NPDef7 [APDef+]; NPInd7 [APInd+]; NPSg7 [APSg+]; NPPl7 [APPl+]; NPNeu7 [APNeu+]; NPUtr7 [APUtr+];

###### define define define define define define

NPs consisting of a single determiner and adjectives NPDef8 [DetDef APDef]; NPInd8 [DetInd APInd]; NPSg8 [DetSg APSg]; NPPl8 [DetPl APPl]; NPNeu8 [DetNeu APNeu]; NPUtr8 [DetUtr APUtr];

###### define define define define define define

NPs consisting of number as the main word NPDef9 [(DetDef) NumO]; NPInd9 [Num]; NPSg9 [Num]; NPPl9 [Num]; NPNeu9 [Num]; NPUtr9 [Num];

###### NPs that meet definiteness agreement ### Definite NPs define NPDef [NPDef1 | NPDef2 | NPDef3 | NPDef4 | NPDef5 | NPDef6 | NPDef7 | NPDef8 | NPDef9 ]; ### Indefinite NPs define NPInd [NPInd1 | NPInd2 | NPInd3 | NPInd4 | NPInd5 | NPInd6 | NPInd7 | NPInd8 | NPInd9 ]; define NPDefs [NPDef | NPInd]; ###### NPs that meet number agreement ### Singular NPs define NPSg [NPSg1 | NPSg2 | NPSg3 | NPSg4 | NPSg5 | NPSg6 | NPSg7 | NPSg8 | NPSg9 ]; ### Plural NPs define NPPl [NPPl1 | NPPl2 | NPPl3 | NPPl4 | NPPl5 | NPPl6 | NPPl7 | NPPl8 | NPPl9 ]; define NPNum [NPSg | NPPl]; ###### NPs that meet gender agreement ### Utrum NPs define NPUtr [NPUtr1 | NPUtr2 | NPUtr3 | NPUtr4 | NPUtr5 | NPUtr6 | NPUtr7 | NPUtr8 | NPUtr9 ]; ### Neutrum NPs define NPNeu [NPNeu1 | NPNeu2 | NPNeu3 | NPNeu4 | NPNeu5 | NPNeu6 | NPNeu7 | NPNeu8 | NPNeu9 ]; define NPGen

[NPNeu | NPUtr];

Appendix D.

318

########## Partitive NPs define NPPart [[Det | Num] PPart NP]; define define define define define define

NPPartDef NPPartInd NPPartSg NPPartPl NPPartNeu NPPartUtr

[[Det | Num] PPart NPDef]; [[Det | Num] PPart NPDef]; [[DetSg | Num] PPart NPPl]; [[DetPl | Num] PPart NPPl]; [[DetNeu | Num] PPart NPNeu]; [[DetUtr | Num] PPart NPUtr];

define NPPartDefs [NPPartDef | NPPartInd]; define NPPartNum [NPPartSg | NPPartPl]; define NPPartGen [NPPartNeu | NPPartUtr];

########## NPs followed by relative subclause define SelectNPRel [ "" -> "" || _ DetDef ˜$"" "" (" ") {som} Tag*];

D.3 Narrow Grammar: Verb Phrases #### Infinitive VPs # select Infinitive VPs define SelectInfVP ["" -> "" || InfMark "" _ ]; # Infinitive VP define VPInf [Adv* (ModInf) VerbInf Adv* (NPPhr)]; #### Tensed verb first define VPFinite [ Adv* VerbTensed ?* ]; #### Verb Clusters: # select VCs define SelectVC [VC @-> "" ... "" ]; define VC1

[ [[Mod | INFVerb] / NPTags ] (NPPhr) [[Adv* VerbInf] / NPTags ]];

define VC2

[ [Mod / NPTags] (NPPhr) [[Adv* ModInf VerbInf] / NPTags ]];

define VC3

[ [Mod / NPTags] (NPPhr) [[Adv* PerfInf VerbSup] / NPTags ]];

define VC4

[ [Perf / NPTags] (NPPhr) [[Adv* VerbSup] / NPTags ]];

define VC5

[ [Perf / NPTags] (NPPhr)[[Adv* ModSup VerbInf] / NPTags ]];

define VCgram

[VC1 | VC2 | VC3 | VC4 | VC5];

Implementation

319

### Coordinated VPs: define SelectVPCoord ["" -> "" || ["" | ""] ˜$"" ˜$"" [{eller} | {och}] Tag* (" ") "" _ ]; #** ATT-VPs that do not require infinitive define SelectATTFinite [ "" -> "" || [ [ [[{sa} Tag+] | [[{f¨ or} Tag+] / NPTags]] ("")] | ankte} Tag+] [[NPPhr "" ] | [ [{t¨ ["" NPPhr ""]]]] InfMark ""_ ]; ### Supine VPs define SelectSupVP [ "" -> "" ||

_ VerbSup ""];

D.4 Parser ###### define define define

Mark head phrases (lexical prefix) markPPhead [PPhead @-> "" ... ""]; markVPhead [VPhead @-> "" ... ""]; markAP [AP @-> "" ... "" ];

###### define define define

Mark phrases with complements markNP [NP @-> "" ... "" ]; markPP [PP @-> "" ... "" ]; markVP [VP @-> "" ... "" ];

###### define define define

Composing parsers parse1 [markVPhead .o. markPPhead .o. markAP]; parse2 [markNP]; parse3 [markPP .o. markVP];

D.5 Filtering ################# Filtering Parsing Results ### Possessive NPs define adjustNPGen [ 0 -> "" || NGen "" NPPhr _,, "" -> 0 || NGen _ ˜$"" "]; ### Adjectives define adjustNPAdj [ "" -> 0 || Det _ APPhr "" NPPhr ,, "" -> 0 || Det "" APPhr _]; ### Adjective form, i.e. remove plural tags if singular NP define removePluralTagsNPSg [ TagPLU -> 0 || DetSg "" Adj _ ˜$"" ""]; ### Partitive NPs define adjustNPPart [

Appendix D.

320

"" -> 0 || _ PPart "",, "" -> 0 || "" PPart _]; ### Complex VCs stretched over two vpHeads: define adjustVC [ "" -> 0 || [[Adv* VBAux Adv*] / VCTags] _ NPPhr VPheadPhr,, "" -> 0 || [[Adv* VBAux Adv*] / VCTags] NPPhr _ VPheadPhr,, "" -> 0 || [[Adv* VBAux Adv*] / VCTags]] "" NPPhr _ ˜$"" "",, "" -> 0 || [[Adv* VBAux Adv*] / VCTags]] NPPhr "" _ ˜$"" "" ]; ### VCs with two copula or copula and an adjective: define SelectVCCopula [ "" -> "" || _ [CopVerb / NPTags] ˜$"" ""]; ################# Removing Parsing Errors ### not complete PPs, i.e. ppHeads without following NP define errorPPhead [ "" -> 0 || \[""] _ ,, "" -> 0 || _ \[""]]; ### empty VPHead define errorVPHead [ "" -> 0];

D.6 Error Finder ######### Finding grammatical ###### NPs # Define NP-errors define npDefError ["" [NP define npNumError ["" [NP define npGenError ["" [NP

errors (Error marking)

- NPDefs] ""]; - NPNum] ""]; - NPGen] ""];

# Mark NP-errors define markNPDefError [ npDefError -> "" ... ""]; define markNPNumError [ npNumError -> "" ... ""]; define markNPGenError [ npGenError -> "" ... ""]; # Define NPPart-errors define NPPartDefError ["" [NPPart - NPPartDefs] ""]; define NPPartNumError ["" [NPPart - NPPartNum] ""]; define NPPartGenError ["" [NPPart - NPPartGen] ""]; # Mark NPPart-errors define markNPPartDefError [ NPPartDefError -> "" ... ""]; define markNPPartNumError [ NPPartNumError -> "" ... ""]; define markNPPartGenError [ NPPartGenError -> "" ... ""];

Implementation

###### VPs # Define errors in VPs define vpFiniteError ["" [VPhead - VPFinite] ""]; define vpInfError ["" [VPhead - VPInf] ""]; define VCerror ["" [VC - VCgram] ""]; # Mark VP-errors define markFiniteError [ vpFiniteError -> "" ... ""]; define markInfError [ vpInfError -> "" ... ""]; define markVCerror [ VCerror -> "" ... ""];

321

Suggest Documents