A Prototype Syntax Checker for German Learners of English Paper presented at
Intelligent Computer Aided Language Learning ICALL ’91 UMIST - September 1991
Deirdre Mulligan Cambridge Technology Partners Cambridge Mass. U.S.A. & Kevin Ryan Dept. of Computer Science & Information Systems University of Limerick IRELAND
e-mail :
[email protected]
NOTE This paper has not been previously submitted for publication to any other conference or journal
1. Introduction This paper describes a prototype system that will act as a "style-checker" for written English of native German speakers. It is a German-specific system in that it detects errors, "typically" made by Germans, which occur due to the influence of their mother-tongue on their English. The "style-checker" also searches for errors that all learners of English are prone to make, irrespective of their mother-tongue. The motivations for such this work are many. An increasing number of German speaking people are studying English and so must submit essays and assignment for various courses. One of the major problems is that they write "German-English". A "style-checker" which could detect "German-English" errors could be of great benefit to these learners. 2. Error Analysis Although there has been (and still is) a great deal of research in the area of language learning the questions of how people actually "learn" their first language and how they master other languages have remained unanswered. However, many hypotheses and theories have been put forward to try and account for these processes. One important research approach has been error analysis, where the errors made by second language learners have been examined, and used to formulate hypotheses about the learning process. This approach has been followed in developing the prototype system. It is sometimes assumed that what can be expressed in one language can automatically be expressed in the other one. However this is not always true since : "each language articulates or organises the world differently. Languages do not simply name existing categories, they articulate their own." (Littlewood, l984). This point was much emphasised by the ’Contrastive Analysis Hypothesis’ which was one of the major ideas on error detection and analysis in the l960’s. It involved the study of how prior acquisition of one or more languages affected the process of acquiring other languages. (Singleton, l98l) Its aim was to predict what were the most error prone areas in a second language for learners of a given first language background, and it reasoned that if these areas could be identified and given particular attention in the teaching of the second language in question, then the acquisition of that language could be eased. According to the contrastive analysis hypothesis these errors were predictable on the basis of a comparison of the target language and the mother-tongue of the learner. For example , Nemser stated that it could "predict where learning difficulties or facilitation will occur on the basis of a comparison of descriptions of the learners language (base or source language) and the language he is attempting to learn (target or reference language)" (Nemser, l975) While Eric Kadler felt that "non-native speakers do make syntactic mistakes as frequently and stubbornly as they make semantic and morphological mistakes, because they tend to transfer to the foreign language their native syntactical system as well as their morphological habits and semantic values" . (Kadler, l970). So if the learner knows that in his language rule x exists then it is quite possible that he will try to apply this rule in the target language. On the other hand, cross linguistic influence can be very helpful to someone learning a second language, in fact as Corder suggests, the first language provides "a rather rich and specific set of hypotheses". (Corder, l970) However the contrastive analysis suggests that where the languages differ, errors can result. One needs to be careful not to overstate the case but the following balanced statement of the claims of Contrastive Analysis was given by James "... Contrastive Analysis has never claimed that Ll interference is the sole source of error. As Lado put it: ’these differences are the chief source of difficulty in learning a second language,’ and, ’the most important factor determining ease and difficulty in learning the patterns of a foreign language is their similarity to or difference from the patterns of the native language’. (Lado, l964) "chief source" and "most important" imply that Ll interference is the not conceived to be the only source". (James, l97l) It seems reasonable to proceed on the assumption that cross linguistic influence plays a role in the errors discovered in a second language learners target language, although it is by no means the sole cause of errors. 3. System Objectives
A Prototype Syntax Checker for German Learners of English
Page 2
As was stated in the introduction, the main objective of this work was to design and implement a system that would detect syntax errors in English text written by native German speakers. The design involved both practical and theoretical work. The practical work took the form of data collection from a number of native German speakers. From this data "typical" German errors and some universal errors were to be selected. These errors were then analysed to identify their underlying cause. Once this had been done a system was designed and implemented which could both parse (a range of) correct sentences of English and also process, and report on, a set of possible incorrect sentences. SInce the objective of this work was to demonstrate the feasibility of the approach there was no question of producing a parser forthe entire English language. Instead a restricted set of English sentences are only allowed as input to the system. The restriction placed on the user is that he/she can only input declarative sentences. Of course interrogative and imperative sentence types could be added relatively easily be incorporating these into the grammar. Given the current limitations however it is quite possible, for erroneous sentences to be accepted or for the system to respond simply that it cannot handle the input. The range of errors possible in the set of declarative sentences of English is vast and certainly impossible to enumerate. For this reason a system such as this can only hope to handle a limited set of errors, in this case the German-specific errors. So unless the error input by the user is a member of the error list defined, a message will be output saying that the sentence is acceptable or that it is not part of the grammar. 4. Data Collection This section details the conduct and results of data collection. The English language exercise books of a number of German exchange students were studied for recurring error patterns. In general many instances of apparent cross linguistic influence were identified, however there were also many instances of apparent developmental (i.e. language learning) errors. In some cases the error type could have been attributed either to the developmental process or cross linguistic influence. Errors occurred frequently in the choice of word order. German word order can be quite different from the English and it was apparent that some German word order was being used in English sentences. For example the fact that in German some conjunctions send the word to the end of the clause or sentence was a source of many problems. One sentence written was: ’while this between Bathsheba and Boldwood happens, ...’ The word 'while' is 'wæhrend' in German and is one of the conjunctions that sends a verb to the end of the clause or sentence, so the German version is:'wæhrend dieses zwichen Bathsheba und Boldwood passiert, ...' this was apparently translated literally by the learner into English. Another word order norm in German is that the verb should come second in the sentence. Again there appears to be some cross linguistic influence on the word order in English. For example:'secondly exist some cultural societies like the traditional-music society' and in German:'Zweitens existieren einige ... ' A similar error type that seemed to be cross-linguistic was the misplacing of the negative element ’not’, as in 'I enjoy not the food very much' which is probably based on 'Ich geniesse das Essen nicht sehr' There were also "false friends", where a German word was translated to a similar but noticably different English word, such as translating 'als' to 'as' ( in comparisons) or 'bekommen' to 'become' as well as the misspelling of English words which have a slightly different English spelling, notably writing 'befor' instead of 'before'. Disagreements between determiners and nouns or between nouns and verbs were also commonplace. Based on the errors observed in the data a system was designed and implemented to deal with the following set of errors:l. placing the verb at the end of the clause or sentence if the adverb is one of those in German that send the verb to the end of the sentence or clause, i.e. misplaced verb. 2. placing the negative element (not) directly after the finite, instead of adding an auxiliary and placing the negative element after the auxiliary. 3. determiner - noun disagreement 4. noun - verb disagreement. 5. the phrase "more... as ..." "being used instead of "more ... than..". 6. specific words being misspelt.
A Prototype Syntax Checker for German Learners of English
Page 3
These errors do not purport to be a comprehensive set but were chosen as a good cross-section of the errors discovered. There are many other errors that could be handled, however with the time and resources available the number of errors being dealt with had to be restricted. 5. Implementation This system was implemented in Scheme on a P.C. Scheme (a dialect of LISP) is well suited to natural language, processing. It provides excellent support for recursive programming which is central to natural language parsing techniques. The system was designed in five main sections: the grammar, the lexicon, the parser, the backtracking algorithm and the error handling devices and these were related as shown in Figure 1
Lexicon
Input Preprocessor
Grammar
Parser and Backtracker
Output Error Analysis
Spelling Errors Figure 1. Overall System Structure A small lexicon of computer related terminology was constructed and a rule-based parser was implemented. It should be noted that the grammar contained both the rules needed to parse the correct English sentences and an additional set of ’error’ rules needed to detect the German-specific errors. 6. Error Detection The system detects errors in three separate passes and keeps a record of all errors detected for subsequent output. The three passes are described below. The first process is a ’preprocessor’ and it is applied before the actual parsing of the sentence input by the user. It’s function is to search for any words that are typically misspelt by German learners of English, for example "befor" and "allthough". The function takes the sentence input by the user and searches through it word by word, looking for any incorrect spellings. If any errors are encountered, the incorrect version of the word and the correct version are stored, along with an error message in the global list of error messages. The system then outputs the error message to the user, indicating which word (or words) was wrong and what the correct version of the word is. For example:User Input: ’the user input the program befor he should have’. Output to the User: ’The following word has been spelled incorrectly, please re-enter the sentence, using the correct word: Befor Before. At present the user must re-enter the sentence with the corrected spelling before he can proceed. When this is completed, the parsing of the sentence can begin. While it is true that these errors could also have been detected later in the process, this would have meant including the correct and the incorrect words in the lexicon, which seemed wasteful. The remainder of the error analysis is carried out after the sentence has been parsed successfully and depends on having a complete set of all the rules that were used in the parse process. This set is called ’rulesused’.
A Prototype Syntax Checker for German Learners of English
Page 4
The second process of error detection and analysis involves searching for error rules in ’rulesused’. As mentioned previously the grammar implemented includes a specific set of incorrect rules for English. If any of these rules were used during parsing then this implies that an erroneous sentence was input by the user. So if ’rulesused’ contains any of the error rules a message, related to the specific error rule used, must be added to the list of errors to date. At present the error rules included in the grammar are the rules that reflect the negative element (not) being in the incorrect position in the clause or sentence and the rule where the auxiliary has been omitted altogether. Many other errors could be defined and detected in a similar way,. The third process of error detection is to search, specifically for the following errors:l. 2. 3. 4.
Determiner-Noun disagreement. Noun-Verb disagreement. The verb placed incorrectly at the end of a clause or sentence because of an incorrect assumption by the learner that certain bindings send the verb to the end of the clause or sentence. (Misplaced verb) The incorrect use of the phrase: "more ... as ...".
7. Example Run This section describes a typical session with the current system. The examples shown cover most of the error types currently detected. A few examples are also given of errors that the system cannot currently detect, both cases where the system informs the user that the input is correct, when in fact it contains an error, and case where the system rejects a correct sentence because there is no relevant rule in the current grammar. Note : In the following example user input is shown in courier bold and system response is shown in courier plain. [1] (startup ’ ()) Please type in your input putting a space between each word and a ’q’, followed by a carriage return to finish
The user did not input data , although the program data required q Your input was incorrect, you have made l error "The verb is incorrectly placed at the end of the sentence" REQUIRED Would you like to input another sentence? y Please type in your input putting a space between each word and a ’q’, followed by a carriage return to finish The user did input not his data q Your input was incorrect, you have made l error "the negative element (not) is in the incorrect position. It should occur directly after the auxiliary" Would you like to input another sentence? y Please type in your input putting a space between each word and a ’q’, followed by a carriage return to finish These program was more efficient as those program q Your input was incorrect, you have made 3 errors "The determiner doesn’t agree with the noun, the determiner here must have a plural noun following it’ THESE PROGRAM "the ’more as’ construct is not acceptable in English instead use the construct ’more than’ " "the determiner doesn’t agree with the noun, the determiner here must have a plural noun following it" A Prototype Syntax Checker for German Learners of English
Page 5
THOSE PROGRAM Would you like to input another sentence? y Please type in your input putting a space between each word and a ’q’, followed by a carriage return to finish The techniques are more useful as the software q Your input was incorrect, you have made l error "the ’more as’ construct is not acceptable in English instead use the construct ’more than’ " Would you like to input another sentence? y Please type in your input putting a space between each word and a ’q’, followed by a carriage return to finish Many problems have had been solved q Your input is correct. Would you like to input another sentence? y Please type in your input putting a space between each word and a ’q’, followed by a carriage return to finish The personal computer was efficient This sentence is not acceptable by the system’s grammar, please check it with someone. Would you like to input another sentence? n okay
8. Evaluation and Future Work The current prototype system is inadequate in many ways and will require considerable extension and improvement before it can be of use to students in writing correct English. Both the grammar and the lexicon are severely restricted and while neither could ever be "complete" in any absolute sense it is feasible to extend the lexicon to cover the majority of terms likely to be encountered. In fact the lexicon could be restructured so that it consists of a "general" lexicon, which contains the basic vocabulary needed irrespective of the area in question along with a set of ’subject-specific’ lexicons relating to specific areas. A user could then choose as many sub-lexicons as necessary to handle the sentence types being input. A feature might also be added so that the lexicon could "learn" new words. This would allow the user to add new words and relevant information about these words as they are encountered.Such a lexicon design, analagous to the specialist directories in spell-checkers, would allow users in many areas to use the system. The incompleteness of the grammar for English is more difficult to remedy. No complete grammar is really feasible and even if it was, it is likely that the user would want the freedom to create ungrammetric sentences for purposes of illustration, novelty or dramatic effect. Another, easier, development would be to survey more source data and thereby increase the coverage of error types. Finally the human-computer interface must be greatly improved making use of windows, highlighting and possibly sound so that the users input and the error messages are clearly related in the display, possibly producing a suggested correct version of the sentence for the user’s approval. It is anticipated that these improvements will require a complete re-implementation of the system. One radical change which might also be considered is to drop the parsing of correct English sentences and only look for the defined errors. This might be done on a monitoring basis so that the syntax-checker ran in the background and only drew the user’s attention when possible error was detected. Aside from improving the current system it could be used to support a number of types of studies, both longitudinal and cross-sectional. Longitudinal study, using the same group of learners over time, might identify which errors are prone to occur a certain stages of the language learning process. Such knowledge could then be incorporated into the system so that the student’s current level e.g. total beginner, advanced student etc. and the relevant error list could guide the serch. Cross-sectional study, collecting all the interactions with users over a fixed time period, would provide statistical information on the variety and frequency of errors. A Prototype Syntax Checker for German Learners of English
Page 6
Overall it is felt that the limited study carried out so far has shown the feasibility and desirability of this approach to computer-aided language learning. Acknowledgement The authors are indebted to David Singleton of Trinity College Dublin for his advice and information on error analysis in general and the contrastive analysis hypothesis in particular. References Corder 74
Corder S.P., "The Significance of Learner’s Errors", in Error Analysis : Perspectives on Second Language Acquisition, Richards J (Ed.)1974
James 71
James C., article in Singleton 1981
Kadler 70
Kadler E.H., Linguistics and Teaching Foreign Languages 1970
Lado 57
Lado R., article in Singleton 1981
Littlewood 74 Littlewood W., Foreign and Second Language Learning, Cambridge University Press, 1974 Nemser 75
Nemser , article in SIngleton 1981
Singleton 81
Singleton D.M., "Language Transfer : A review of some recent research", Centre for Language and Communication Studies, Trinity College Dublin, 1981
A Prototype Syntax Checker for German Learners of English
Page 7