towards a complete detection/correction system - CiteSeerX

10 downloads 0 Views 86KB Size Report
Mémoire CNAM - Mars 1988. [COURTIN 76] ... Thèse USS Grenoble - Mai 1987. [PETERSON 80] ... Thèse de l'Université Joseph Fourier, Grenoble I, Mars 1990.
TOWARDS A COMPLETE DETECTION/CORRECTION SYSTEM J. COURTIN, D. DUJARDIN, I. KOWARSKI, D. GENTHIAL, V.L. DE LIMA Laboratoire de Génie Informatique, IMAG-Campus BP 53X, F-38041 Grenoble CEDEX, FRANCE Phone : (33) 76 51 48 78 E-Mail : [email protected] or [email protected]

Abstract :

In this paper we describe a system for detection and correction of lexical and syntactic errors in a French text. The lexical level uses several techniques (keys, phonetics and morphological generation). The aim of the syntactic level is verification of the structure of a sentence and of relations between words (e.g. agreement). It proceeds by unification of the decorations on the nodes of a dependency tree, by application of condition-action rules. By use of a morphological generator, corrections can be offered for agreement errors. We propose to integrate the two levels in a complete tool, based on a real size lexicon.

Keywords :

morphological parsing and generation, similarity keys, phonetics, syntactic parsing, syntactic verification, unification, concordancy correction

1. INTRODUCTION We are working on the development of an interactive system for detection and correction of errors on the lexical and the syntactic level in a written French language text. Our system is modular, and is composed of two main parts : morphology and syntax. The morphology component contains the lexical parsing of the sentence, while the syntax component effects the syntactic parsing of a sentence supposed to be morphologically correct (we use the morphological and syntactic parsers of the PILAF system (Interactive Linguistic Procedure Applied to French) [COURTIN 77]). The aim of our system is to detect and correct lexical errors and agreement errors. To this end, we have implemented different methods for error correction, which are inserted in the morphological and syntactic treatment modules. Figure 1 presents the general architecture of the system. Three units are at present functional : a prototype (called DECOR [COHARD 88]) which effects detection and correction at the lexical level ; a syntactic verifier, which detects existence of errors on the syntactic level and corrects agreement errors. In this paper, we shall briefly expose the mechanisms of error treatment used on the lexical level and then present in detail detection and correction on the syntactic level, following by an analysis of the system we have implemented. We shall conclude with a discussion of perspectives and propositions for lexico-syntactic detection and correction of texts written in natural language. International Conference on Current Issues in Computational Linguistics, Penang, Malaysia

158

General Architecture of the system

SYNTAX

MORPHOLOGY Morphological parsing ( and detection of lexical errors)

Syntactic parsing

Correction of lexical errors Skeleton key

Syntactic verification

Phonetics Correction of concordancy mistakes Generation

Figure 1 : Architecture

2. THE LEXICAL LEVEL Morphological parsing of a sentence enables us to obtain the lexico-syntactic variables associated with a word (category, gender, number...). A word which cannot be analyzed initiates the correction process. In order to treat errors found on the lexical level we have implemented three different techniques for correction. One of the techniques we use is based on the skeleton key (described in detail in [POLLOCKּ84, LUּ86]). It is based on the idea that if a string can be transformed into a word which belongs to the dictionary by inversion of one of the fundamental editing operations (insertion, omission, substitution, transposition), then this word is a possible correction for the string in question. The algorithm for correction computes a key for the misspelled form. For instance, let disaue be the written form which is to be corrected. We generate the key DSIAUE, keeping the first letter, then attaching the consonants (once each) in their order of appearance, and adding in the same way the vowels of the misspelt word. We then search the dictionary for words with keys similar to the key we have computed. From this set of words, one or several alternatives are selected as correction. The selected words are those which differ from the original word by at most one of the editing operations quoted above (for instance, disque, obtained by substitution of the a of disaue). The second method we use for correction obtains phonetic equivalents for the misspelled word (a complete description may be found in [COURTIN 88]). By means of a transduction processor, we compute, starting with any graphic string, all the possible phonetic transcriptions associated with this string. For instance, phonetic parsing of chevale gives us the transcriptions : ch.é.v.a.l., International Conference on Current Issues in Computational Linguistics, Penang, Malaysia

159

k.é.v.a.l., ch.e.v.a.l. et k.e.v.a.l.. From these phonetic transcriptions, we generate written forms which give us proposals of corrections for the misspelled word. For chevale, for instance, we compute the written forms kevale, keval, chevale, cheval, chéval, chèval, etc. A final control allows us to select only correct graphic strings (in the example, cheval ). Morphological parsing of a word by means of the finite state transducer can determine a root and a set of flexions. If an error is detected on a term which possesses a correct root, we can process this term by supposing that the remaining part of the string can be parsed starting at the initial state of the automaton. This allows us to obtain certain grammatical variables associated with the misspelt word. We use this method in order to correct words in which a root or an ending have been misused, as in chevals. This string is incorrect, but the root chev yields the data common noun. Then, parsing the ending als starting at the initial state yields the markers of masculine and plural. This kind of error is corrected by use of the morphological generator of PILAF. For a detected root we generate all the forms corresponding to the grammatical variables found in the misspelt words (for the example, chevaux).

3. SYNTACTIC PARSING The aim of syntactic parsing in the framework of a detection/correction system is to establish realationships between words or groups of words, in order to apply verification rules such as : • the verb agrees in number with its subject, • the determiner and the adjective agree in gender and in number with the noun they depend on, • a transitive verb can accept only one direct object complement, • ... On the other hand, the parser must be able to build structures disregarding agreement rules, and must thus accept sentences such as : le chat est belle. les chat chasse les souris. We have used the syntactic parser of PILAF which builds all the decorated tree structures associated with a given sentence, compatible with a given dependency grammar. It must be noted that this parser takes in entry a sequence of lexical categories (verb, common noun,...) thus supposing that all the lexical level errors (unknown words) have been resolved. Example : see figure 2 below. The grammar is a set of dependency relations of the form : GOUV * DEP := List of weights The weights indicate the position of the dependent under the governor, negative weights mean a left dependent and positive weights a right dependent (see [COURTIN 77, 86] for more details). This formalism has the advantage of simplicity : easy implementation and high efficiency. On the other hand, it does not allow any contextual control of the relations other than the relative positions of dependents under a given governor. Example : The two relations : subc * det := -32 and subc * adjq := -16, -15 indicate that a determiner cannot be on the right of an adjective and imply that there will be no determiner between two adjectives . Similarly, the presence of a rule authorises the building of a tree, but there is no way of expressing structures to be that a determiner is compulsory. This has the advantage of allowing incomplete mange built, such as which, although incorrect, may be interpreted. soupe chien

International Conference on Current Issues in Computational Linguistics, Penang, Malaysia

160

L'

analyseur

construit

une

det

subc

verb

det

mas, fem, sin, def

mas, sin

structure de dépendances subc

ind, pre, tre, fem, sin, sin, ide

prep

fem, sin

subc fem, plu

Grammar : subc verb prep verb subc

PARSER

* * * * *

det := -32 subc := -32 32 subc := 32 prep := 24 32 48 prep := 32 48

verb ind, pre, tre, subc subc p r e p sin, mas, fem, sin subc sin fem, verb det det plu ind, pre, fem, mas, t r e , sin, fem, subc sin, ide mas, sin, def det mas, fem, sin, def

sin

det fem, sin, ide

subc fem, sin prep subc fem, plu

Figure 2 : Syntactic parsing

But these limitations lead to the presence, among the results of parsing, of a large number of incoherent structures. Examples : Pierre et Jacques gives : Pierre et Jacques

et Jacques

Pierre

Jacques

et Pierre

International Conference on Current Issues in Computational Linguistics, Penang, Malaysia

161

il profite de la vie : profite

i l

profite

de

i l

vie

de vie

la

la This problem has been temporarily solved by adding to each lexical category a configuration which indicates which tree shapes are allowed for this category. Thus, the category prep has become : prep (a right-hand dependent is required under the preposition) and the coordination conjunction (coco) has become :

coco

(a dependent on each side is required). For most

categories, a special configuration indicates that any structure is allowed. This method has allowed us to eliminate many incoherent structures, but it is clear that a good way to solve this problem would be addition to the grammar of a way of taking account of contextual information. The PILAF parser thus builds decorated dependency structures, but these structures are not perfectly adapted to application of verification rules (defined in paragraph 4) because : • certain categories which do not have strictly the same syntactic behaviour share the same agreement rules and play the same part in them (an adjective and a past participle agree in the same manner if they are attached to a noun) ; • the agreement rules only take two categories into account and it is sometimes necessary to specify the context. Thus, for a verb it is important to know whether it is part of a relative proposition or not, and if so, to know the referent of the pronoun. This has lead us to the definition of an interface between the syntactic parser and the syntactic verifier which allows : •

regouping categories which have the same behaviour concerning agreement, for instance: adjq (qualifying adjective) adji (indefinite adjective) --> adjo ppas (past participle ) adv (adverb) padv (adverbial preposition ) loca (adverbial locution )



--> advo

modification of certain structures in order to adapt them them more readily to detection/correction. This restructuring aims to obtain more regularity, so as to allow application, to complex structures, of verification rules provided for simple structures.

Example :

est soupe

est et

becomes

soupe

bon

chaud

la bon chaud la We thus avoid having special rules for verification of coordinate structures, by making use twice of the rules provided for to be-adjective. Although its power is limited by the simplicity of the formalism it uses, the syntactic parser of PILAF has allowed us to implement and test a prototype for detection/correction on the syntactic level. Present efforts are directed towards improvement of the power of expression of the grammar and integration of semantic knowledge. International Conference on Current Issues in Computational Linguistics, Penang, Malaysia

162

4. DETECTION AND CORRECTION AT SYNTACTIC LEVEL 4.1. The aims Once we have the dependency structures associated with a sentence (computed by the syntactic parser), the syntactic verifier determines whether they are valid in French, or, in other words, if the construction is correct or not. Examples : le soupe is considered incorrect : there is an error in agreement in gender between the determiner and the noun ; je le vous donne : misplacement of pronouns ; il mangeras : error of agreement in person between the subject and the verb. Syntactic verification concerns the applicability of a set of syntax rules which describe the most fundamental relationships between components of a proposition written in French, aiming towards detection of syntactic errors. Among the relationships we have considered, there are the links between a noun and its determiner, a noun and its adjective, a verb and its subject, a verb and its attribute, as well as the coordination conjunction et, and subordinations depending on a relative pronoun or on a subordination conjunction. Our system [STRUBE DE LIMA 90] is limited at present to the correction of a subset of the syntactic errors it can detect : the agreement errors. However, correction of other classes of syntactic errors, such as transposition of words, is rendered possible by the use of specific correction modules. Correction of agreement errors is effected by generation based on the parameters computed during syntactic verification. For instance, for the correction of il mangeras we retain the number and person of the subject il and the tense and mode of the verb, generating mangera. If a correction proposal is selected, we substitute the new word in place of the mistake and a new syntactic verification is launched, until complete correction of the sentence.

4.2. The process for syntactic checking and correction. Syntactic checking is effected by traversing a dependency tree in order to analyze relations between governors and dependents. When parsing a pair of nodes which have a correct syntactic relationship we unify the syntactic features, producing a unique node to which is attached the lexico-syntactic data resulting from unification. This process uses a base of syntactic rules and meta-rules, pertaining to relationships between components of a proposition. In case of error, the syntactic checker passes the word and the required morphological variables to the correction module which generates the corresponding correct forms. If a correction proposal is chosen, the wrong word is replaced by the corrected word directly in the syntactic tree (the syntactic parsing is not redone) and a new verification is launched until complete correction of the sentence. Our system for verification of a sentence uses only the morphological and syntactic variables without taking into account semantic variables. This measure presents aspects of simplification and generalization, by not excluding structures for a sentence taken out of its context (we consider only the syntactic structure of the sentence, without any referece to its meaning). Examples : Le chêne pleurait à regarder le roseau. is a sentence which the syntactic verifier will accept. Elle mange lire, also considered to be correct, will be similar, as to syntax, to the sentence Elle aime lire. The syntactic checking program makes use of two important structures : the dependency trees associated with a sentence (they contain, for each node, the word, its lexico-syntactic category and the associated morphological variables) and the syntactic verification rules which describe a subset of the French language.

International Conference on Current Issues in Computational Linguistics, Penang, Malaysia

163

Example of a dependency tree :

verbo a i m e (sin tre uno pre ind ) (sin tre uno pre sub ) (sin dos imp ) souris gnomo ( sin plu fem ) la deto ( sin fem tre cod )

chat gnomo (sin mas)

l e detpo ( sin mas tre cod )

n o i r adjo ( sin mas )

Figure 3 : Dependency tree Codes used for the lexico-syntactic categories : verbo : verb ; deto : determiner ; detpo : determiner pronoun ; gnomo : noun ; adjo : adjective. Codes used for the morphological variables : sin : singular ; plu : plural ; mas : masculine ; fem : feminine ; uno : 1st person ; dos : 2nd person ; tre : 3rd person ; ind : indicative ; sub : subjunctive ; imp : imperative ; pre : present ; cod : direct object complement. Remark : during the processing of a tree by the syntactic verifier, we add the lexico-syntactic data associated to nodes to the lists of morphological variables. Example : during the checking of the tree shown above, parsing the nodes la and souris produces as result a unique node, bearing the word souris , to which is associated data among which a variable specifies that souris has already "taken" its article.

4.2.1. The rules The rules for syntactic checking constitute a set of clauses associated with a PROLOG predicate, which are applied in order to determine whether a construction is syntacticly correct or not in French. The objective of a rule is to accept a certain link between two nodes of the syntactic tree, either a left son → father link, or a father → right son link, producing two results : a list of lexicosyntactic data and an indicator of detection of an agreement error. A rule is composed of three parts : • a header or syntactic proposal which takes into account the lexico-syntactic categories of the nodes to which the rule is applicable ; • a conditional proposition (called IF) which contains the conditions required for applicability of this rule ; • an expression (called THEN) which computes the lexico-syntactic data associated with the node resulting from the application of this rule. The rule given below as an example (figure 4) applies to a pair of nodes constituted by a deto (determiner) on the left and a gnomo (noun) on the right : (deto * gnomo). Elements of the result which are computed are : the gender and number of the gnomo, the lexicosyntactic constants art, which means that the gnomo has already "taken" its article, and tre, which denotes that the gnomo is in the third person. Among other possible lexico-syntactic data we might also have set, which signals that a verb has already "taken" its subject, or obj, which signals that a verb has already "taken" its object. In the output, we also find the parameter REP which indicates the existence of an agreement problem between the two nodes. REP leads to correction by generation, if the rule has detected an agreement error.

International Conference on Current Issues in Computational Linguistics, Penang, Malaysia

164

rule(ms(LeftWord,deto,Left), rule(ms(LeftWord,deto,Gauche), ms(RightWord,gnomo,Right), ms(MotDroit,gnomo,Droit), R,Rep) : -

/* IF */ coraccord(ms(MotGauche,deto,Gauche), coraccord(ms(LeftWord,deto,Left), ms(RightWord,gnomo, ms(MotDroit,gnomo, Droit), Right), ["GNR","NBR"],R1,Rep) , ecr_res(1,ms(MotGauche,deto,Gauche), ecr_res(1,ms(LeftWord,deto,Left), ms(RightWord,gnomo,Right)) ms(MotDroit,gnomo,Droit))

/* THEN */ , somme([R1,["art","tre"]],R).

Figure 4 : A verification rule Inside a rule, agreement checking is effected by use of the predicate coraccord. This predicate is also in charge of storing the recuperation parameters, in the case of agreement errors. The rule shown as an example checks agreement in gender (GNR) and number (NBR) between the two nodes. Of course, all rules do not contain agreement checking. It should be noted that parsing a pair of nodes without agreement checking aims to transmission in the tree of data associated with nodes, taking into account the relationships which exist between nodes. The ecr_res predicate in the IF part of the example takes care of displaying a trace. We have developed an elementary set of rules, which constitutes the kernel of syntactic checking. Access to this set of rules is coordinated by a set of meta-syntactic rules. The aim of a meta-rule is to direct rule application : by verifying the lexico-syntactic categories of the two nodes in the pair which is being parsed, a meta-rule allows processing of similar situations by a single rule. For example, processing of the links : qualifying adjective → common noun and common noun → qualifying adjective is directed towards the same rule.

4.2.2. Verification method Syntactic verification is effected by traversing the dependency trees in postfixed order, as is shown in figure 5 for the sentence il mange la soupe. N.B.suj indicates a subject. Let (left * right) be the pair of nodes which is being processed : a left son with its father or a father with its right son. The verifier tries to find a syntactic rule which can apply to the couple (left * right). Searching for an applicable rule ends when the first candidate rule is found. A rule is considered to be a candidate for application when it concerns the current (left * right) pair (in the header of the rule). If several rules are defined for a couple, the first is used. The next one will be considered only if the candidate is inapplicable to parsing. A rule is said to be applicable if : International Conference on Current Issues in Computational Linguistics, Penang, Malaysia

165

1) 2)

It concerns the lexico-syntactic categories associated respectively with the nodes left and right of the couple being parsed (it is a candidate rule) and all the conditions it contain are true.

verbo ( sin tre uno pre ind ) ( sin tre uno pre sub ) ( sin dos imp )

mange

i l popel ( sin mas tre suj )

soupe gnomo ( sin fem ) la detpo ( sin fem tre cod )

1 mange il

Result :

mange ( mas sin tre pre ind set ) ( mas sin tre pre sub set )

verbo 2 soupe

Result : soupe ( fem sin art tre ) gnomo

la

3 mange soupe

Result : mange ( sin tre pre ind obj ) ( sin tre pre sub obj ) verbo

Figure 5 : Example of verification The result of the application of a rule to a (left * right) couple is a single node, which retains the lexico-syntactic data coming from the verifation. This data, in list form, is computed by the THEN part of the rule if all the conditions are true (IF part). If a candidate rule is found but a condition for this rule is not true, we have a case of failure. Cases of failure due to non-agreement between the two components of the pair of nodes being parsed are taken into account : the morphological variables associated with the words are retained in view of correction (see Part 4.3). In cases of failure other than agreement errors, the system pursues its search for an applicable rule. If no applicable rule is found during verification of a pair of nodes the user is warned that the system contains no rule capable of processing the given sentence. Such a situation can arise for two reasons: International Conference on Current Issues in Computational Linguistics, Penang, Malaysia

166

A) the rules which have been defined are not sufficient, B) the proposed sentence is not acceptable in French. To deal with situation A, the missing rules must be added in order to cover exhaustively all possible (left * right) pairs. Situation B results from insufficiency of the syntactic parser of PILAF (see Part 3). The solution of both problems is to widen the linguistic scope of the system : we have achieved the implementation of a large dictionary (220 000 forms), and are studying a more complete and more reliable dependency grammar.

4.3. Correction Failure due to an agreement error launches the correction process, which takes syntactically incorrect words and the variables required for correction, and uses the morphological generation module of PILAF. The parameters we retain for correction are the position of the incorrect word in the sentence, the word itself and the morphological variables for generation. The variables to be retained for correction depend on the lexico-syntactic categories of the words being parsed. For example, in order to generate a verbal form we retain the number and person of the subject, and the original mode and tense of the verb in the sentence. In order to generate an adjective, we need the gender and the number of the name it qualifies. Example : to correct noire in le chat noire we retain : 3/noire, mas, sin. We correct a word by computing its morphological base and then generating the forms derived from this base which are caracterised by the set of generation variables. When a set of proposals of corrections is displayed, the user can : • either choose one of these proposals ; • or invalidate the correction attempt and go on to parsing the next tree for the same sentence. If an option for correction is chosen : 1) the sentence is modified ; 2) if necessary, the tree being parsed is modified ; 3) all the trees produced for the corrected sentence are eliminated, except the tree being parsed (we suppose that if the user has chosen a correction for a certain tree, the others can be discarded) ; 4) a new syntactic verification validates the correction. We can note that the system does not redo the morphological parsing nor the syntactic parsing of the corrected sentence : it uses existing structures and modifies them.

4.4. An example of syntactic verification and correction Consider the sentence Le chat noire et le chien mange.. The syntactic structure produced for this sentence is the following : mange

et chat

chien

le noire le We start syntactic verification of the tree by processing the following pairs : 1 le * chat → chat 2 chat(masc) * noire(fem) → agreement error Proposal for correction : noir If the user accepts this proposal, the word noire is replaced by noir, and syntactic verification restarts: 1 le * chat → chat 2 chat * noir → chat 3 chat * et → et International Conference on Current Issues in Computational Linguistics, Penang, Malaysia

167

4 le * 5 et * 6 chien(plural) *

chien chien mange(singular)

→ → →

chien chien (number : plural) agreement error Proposal for correction : mangent If the user accepts this proposal, the word mange is replaced by mangent.. Syntactic verification restarts and we finally have a sentence considered to be correct.

4.5. Past participle agreement Agreement rules are less evident for the past participle than for the noun group (determiner, noun, adjective) or for the verb group (subject, verb). They have certain particularities which require specific procedures. We correct the past participle used as an epithet (un homme averti), used with the auxiliary "être" (including the case of pronominal nouns) and used with the auxiliary "avoir" (even if the direct complement is placed before the verb, as in la soupe que j'ai mangée). In classical French grammar, agreement of the past participle followed by an infinitive takes into account the relationship between this participle and the pronoun which precedes it [GREVISSE 69], thus allowing correct sentences such as : Les chanteuses que j'ai entendues chanter. (the direct object is related to the participle, which must agree with it) Les chansons que j'ai entendu chanter. (the direct object is related to the infinitive, the participle must remain invariable) Analyzing this type of situation would require the use of semantic traits in order to know the capacity of the subject to effect the action described by the verb. We have therefore opted for correction according to a simplified syntax (authorised by the minister Georges Leygues, Paris, 26/02/1901), which allows a past participle conjuguated with the auxiliary avoir and followed by an infinitive to remain invariable, whatever the gender and number of complements which precede it.

4.6. Data required for an exhaustive set of agreement rules The difficulties met during the design of a syntactic verifier have led us to a certain number of considerations concerning the data which is necessary at lexical and syntactic levels. Syntactic checking, even for a basic subset of the language, requires knowledge of both lexical and syntactic data, such as that which describes the nominal group, verbs and complements. Furthermore, even at the stage of agreement verification, we need semantic level data. Example : La souris blanche du voisin que le chat a mangé(e) ?. As to rule writing, certain lexical categories are far more diversified than in the usual grammars, thus rendering more difficult the task of determining groups of syntactic properties. For instance, in the case of verbs, each verb should be attached to certain specific categories in view of syntactic checking. The traditional categories "direct transitive", "indirect transitive", etc are insufficient for automatic syntactic verification, and new categories must be proposed, taking into account the categories which may follow the verb, the possibility of accepting a relative clause, etc. The sytem designed by Lapalme and Lacouture [LACOUTURE 86, 88] uses, for instance, the attribute compl to note that a verb accepts a relative clause. In order to take into account these considerations about syntactic verification of agreement rules, we propose to particularise data referring to the categories nouns, verbs and adjectives. We also propose dissociation in the categories relative pronoun and coordination conjunction. Nouns must be parsed in view of their possibility of assuming the role of subject or direct object, associated with a certain verb. This possibility is an important semantic feature, and superficial parsing cannot solve the problem we have met in correcting concordancy errors. Use of a large-scale dictionary, such as the one we have now, entails a tedious task of classification, which is also necessary for studies in view of a more powerful syntactic parser (including and manipulating semantic informations). International Conference on Current Issues in Computational Linguistics, Penang, Malaysia

168

The classification used at present in PILAF associates verbs with conjugation paradigms. In order to effect syntactic checking and agreement error correction, we must also know : a) the transitivity of verbs - we must know if a verb is intransitive (such as dormir in Elle dort difficilement), direct transitive (such as manger in Il a mangé la pomme), indirect transitive (such as succéder in Ce jeune homme succédera à son père) or both direct and indirect transitive (such as offrir in Il offre des fleurs à sa mère), accepting a direct object complement, an indirect object complement or both. b) the possibility of accepting a relative clause - some verbs may accept a conjunctive subordinate proposition (introduced by a subordination conjunction). Certain conjunctives are direct object complements (relative clauses, as in Je pense qu'il viendra). We must therefore know if the verb can or cannot accept a relative clause as direct object complement. c) the possibility of accepting an infinitive as direct object complement - some verbs may accept an infinitive as direct object complement (such as aimer in Il aime lire which is equivalent to Il aime la lecture). d) the possibility of functioning as an "auxiliary" - some verbs may be used as auxiliaries (example : pouvoir in Elle peut lire). In this case, they may also accept being followed by an infinitive. e) the case of stative verbs - some auxiliary verbs are known as stative verbs : they do not imply an action, even if the subject is animate (example : être, sembler, devenir, etc, as in Elle devient de plus en plus belle). These verbs accept an attribute which refers to the subject and which follows agreement rules. Therefore, it is necessary to know if a verb is or is not stative. In order to verify agreement rules, we must also know the nature of certain adverbs. For instance, it is important to know if an adverb is a quantifier (as in Beaucoup de ces enfants partent en promenade). Among the coordination conjunctions we now count et, ni, ou, mais, donc, car, or, the comma, etc. But it is evident that the present group is not well adapted to verifiction and correction of agreement. In order to solve this problem, we might have a new classification such as "agreeing coordination conjunction" and "simple coordination conjunction". Relative pronouns also present a problem in agreement checking : some relative pronouns (example : dont in la fille dont je me souviens) do not induce agreement, while others do (such as que in la réunion que j'ai provoquée) . In view of this problem, it would be interesting to classify relative pronouns according to their behaviour in this respect.

5. PERSPECTIVES 5.1. At the lexical level The joint use of three techniques (phonetic, morphological, similarity keys) has given us quite good results : the correction of the incorrect word usually appears among the proposals offered by the system. At present we apply these techniques in the following order : keys, phonetics, morphology ; this sequence is very well adapted to the correction of common typing mistakes and is therefore well suited for texts written by a person who masters the language but not the keyboard. On the other hand, for a text written by a child or a foreigner, it may seem preferable to start with phonetic correction. Since the aim of an interactive system is to cut down answer times as far as possible, we believe there should be a choice of the most appropriate technique for a given user [COURTIN 89]. The system should therefore be able to reconfigure dynamically the sequence of application of the different techniques, by means of a profile of the user. The simplest version might be a set of statistics on the mistakes already made by the user. The first tests of the system were carried out on a limited vocabulary (5000 forms) ; recent tests with a far wider vocabulary (25 000 roots, 220 000 forms) have shown the importance of "noise" phenomena. On this scale, which is that of a real-size application, the number of proposed corrections must be considerably enlarged in order to insure that the good correction is included (some

International Conference on Current Issues in Computational Linguistics, Penang, Malaysia

169

skeleton keys have over 25 associated forms ; the same problem arises with phonetics : on a large dictionary, a single phonetic invariant generates a large number of words).

5.2. At the syntactic level A system for detection and correction must completely and coherently cover the domain it is supposed to correct. If we correct some agreement errors in gender and number, we must correct them all for users to think we are credible. This obviously requires a lexicon which is sufficiently complete to cover all of "basic French", and should also include a terminological dictionary tailored to any particular application domain. It also requires a complete grammar of the language, or at least a grammar which will accept all correct sentences. Here a fundamental problem arises : a grammar of the language (whatever formalism is used) is classically supposed to describe completely and consistently a given language or some subset of it. Therefore an incorrect sentence of the language cannot be described by this grammar ; in terms of parsing, this means that for an incorrect sentence, the parser associated with the grammar will not be able to build a structure, and therefore will not be able to correct it ! Thus, the PILAF parser can be used in our sytem to build structures (checked afterwards), only because it accepts a superset of the language : it can build incorrect structures. The question which then arises is what can not ? If we authorise (1) : where is the limit between what can be built and mangent by refraining from verifying agreement, what shall we do about soupe chien

mange il

l i r e which is structurally identical to i l

aime lire

? On the

other hand, if we cannot build (1), how can we check agreement situations as simple as that of a verb with its subject ? The answer to these questions, in our opinion, is that detection and correction of errors in a text should be considered as one of the functions of a morphological and syntactic parser. From the point of view of detection and correction, the structure which is built is not the end, but a means for temporary storing of the data required to validate a sentence (or a text). If this data is to be usable, the structure must be reliable, that is correct, which implies : • that checking is done during parsing ; • the use of complete syntactic knowledge (particularly for verbs : transitivity, stative verbs,...) and semantic knowledge . This last type of information is needed to remove ambiguities in certain sentences, especially for attachment of prepositional groups or determination of referents for pronouns. Example : To what word should qui refer in le chien de la voisine qui aboie sur le palier ? We shall therefore design a system for syntactic parsing which will have the classical structure of figure 6. To this structure, we shall "graft" the modules which take care of detection and correction, thus obtaining the structure shown in figure 7. For words which are correct, the system functions in the classical way, but detection of an incorrect word (failure in morphological parsing) generates proposals for correction by use of one (or possibly two or even all three) of the techniques described in Part 1. As we have noted above, there will probably be too many proposals ; their number can be limited by using a filter of hypotheses based on syntactic-semantic parsing : the hypotheses will be the values of morphological, syntactic and semantic variables which the parser expects. Example : Le chien mange la ??? Hypotheses for ??? : adjq, fem, sin adv subc, fem, sin, EDIBLE

International Conference on Current Issues in Computational Linguistics, Penang, Malaysia

170

Hyp.

Morphological Parsing

Text

Grammar + agreement rules

Morph.,Synt., Sem., Dict.

Words

Syntactic-Semantic Parsing

Figure 6 : Classical parser

Keys + forms

Phonetic dict..

Skeleton keys

Morphological generation

Phonetics

Words Errors Possible corrections

Base + Morph. var

Morph.,Synt., Sem., Dict.

Filter + Resumption Text

Morphological Parsing

Grammar + agreement rules

Hyp. Hypothesis generator

Words

Syntactic-Semantic parsing

Proposals for correction

Proposals for correction

Figure 7 : Detection/correction Such a method is not feasible without a high degree of cooperation between the two modules (lexical and syntactic), because the syntactic module will produce good hypotheses only if it can make use of what it has already parsed and also of what follows the incorrect word. Example : with le ??? it is very difficult to make any hypothesis other than (mas,sin), while with le ??? aboie, we can deduce (mas, sin, subc, CANIDE).

International Conference on Current Issues in Computational Linguistics, Penang, Malaysia

171

The morphological parser must therefore be equipped with a means for restarting which will allow it to skip the incorrect word and go on feeding the syntactic module, which in turn must be capable of integrating unknown words in its structures. Detection of an error at the syntactic level leads to issuing proposals for correction, by use of the morphological generator (see Part 4). Even if correction of certain errors can be rendered entirely automatic, the final decision of correction must be left to the user, the system must only issue proposals. For instance, even in as simple a case of error as la chien mange, the system cannot decide alone whether the right correction is la chienne mange or le chien mange. Such a decision would suppose an overall understanding of the text and the context, not only the local context (neighbouring sentences) but also the intentions of the author. Use of such a system is thus typically interactive, it can be imagined as a component of a naturallanguage man-machine interface, if integrated with a dialogue module and a syntactic generator, or as the linguistic kernel of a sophisticated word-processor. In this last case, the presence of a syntactic parser and of a morphological generater could allow implementation of new functionalities such as : put a noun group in the plural and make the rest of the sentence agree, change the tense of the whole text,...

REFERENCES [COHARD 88] : Brigitte COHARD Logiciel de détection et de correction des erreurs lexicales. Mémoire CNAM - Mars 1988 [COURTIN 76] : Jacques COURTIN, Danièle DUJARDIN Paramètres linguistiques de la Morphologie Française dans le système PIAF. Université Scientifique et Médicale de Grenoble, Laboratoire d'Informatique, Déc. 1976. [COURTIN 77] : Jacques COURTIN Algorithmes pour le traitement interactif des langues naturelles. Thèse d'état - USMG INPG - Octobre 1977 [COURTIN 88] : Jacques COURTIN, Danièle DUJARDIN, Irène KOWARSKI, Damien GENTHIAL, Vera Lúcia STRUBE DE LIMA Correção de erros de ortografia através da fonética em textos escritos em francês. XIV Conferencia Latinoamericana de Informática, 17avas Jornadas Argentinas de Informática e Investigación Operativa, Buenos Aires, Sep. 1988. p. 873-891. [COURTIN 89] : Jacques COURTIN, Danièle DUJARDIN, Irène KOWARSKI, Damien GENTHIAL, Vera Lúcia STRUBE DE LIMA Interactive multi-level systems for correction of ill-formed french texts. Proceedings of the Scandivanian Conference on Artificial Intelligence, Tampere, Finland June 89. [GREVISSE 69] : Maurice GREVISSE Le bon usage. Hatier, Paris, 1969. 1228 p. [LACOUTURE 86] : Roxane LACOUTURE Une implantation informatique du français fondamental. Mémoire présenté en vue de l'obtention du grade de Maîtrise en Sciences (M.Sc.), Département d'informatique et de recherche opérationnelle - Université de Montréal, Mai 1986. [LACOUTURE 88] : Roxane LACOUTURE, Guy LAPALME Une implantation informatique du français fondamental. T.S.I. - Techniques et Sciences Informatiques, Vol. 7 nº 5, 1988. International Conference on Current Issues in Computational Linguistics, Penang, Malaysia

172

[LAHENS 87] : François LAHENS Un modèle stochastique pour la vérification et la correction automatique de textes : le système VORTEX. Thèse Université Paul Sabatier Toulouse, Janvier 1987 [LAPALME 86] : Guy LAPALME, Danièle RICHARD Un système de correction automatique des accords des participes passés. TSI 86-4, 1986. [LU 86] : LU CHENGREN Détection et correction des erreurs dans un texte écrit en langue naturelle. Rapport de DEA - INPG USMG - Septembre 86 [MARET 87] : Dominique MARET Comparaisons de chaînes de caractères, accès lexicaux tolérants et applications. Thèse USS Grenoble - Mai 1987 [PETERSON 80] : J.L. PETERSON Computers programs for detecting and correcting spelling errors. CACM, Vol. 23, 1980. p. 676-687. [POLLOCK 84] : Joseph J. POLLOCK & Antonio ZAMORA Automatic spelling correction in scientific and scholarly text CACM Volume 27, Number 4, Avril 84. [STRUBE DE LIMA 90] : Vera Lucia STRUBE de LIMA. Contribution à l'étude du traitement des erreurs au niveau lexico-syntaxique dans un texte écrit en français. Thèse de l'Université Joseph Fourier, Grenoble I, Mars 1990

International Conference on Current Issues in Computational Linguistics, Penang, Malaysia

173