Textual and Stylistic Error Detection and Correction - IRIT

4 downloads 0 Views 2MB Size Report
1963,. Chuquet 1989). This type of error includes the whole complexity of language: morphology, ..... areas of language, rationality is not as central, as in poetry).
2009 Eighth International Symposium on Natural Language Processing

Textual and Stylistic Error Detection and Correction: Categorization, Annotation and Correction Strategies Laurie Buscail, and Patrick Saint-Dizier

relative strength paired with a decision theory. The modeling of correction strategies is based on the annotation of a large variety of types of textual documents in English and in French, produced by native speakers. Annotations allow us to identify and to categorize errors as well as the parameters at stake when making corrections. Those parameters are a priori neutral in the annotation schemas; they make more explicit the different characteristics of an error and its related corrections. We then define a preference model that assigns polarity (positive, negative) and a weight to each of these parameters, together with additional parameters among which the target reader, the type of document, etc. An argumentation model that considers these parameters as weighted arguments, for or against a certain correction, thus can be introduced. Paired with a decision model, optimal corrections can be proposed to the author together with explanations. This approach confers a formal interpretation to our annotation schema. Works on the correction of grammatical and textual are not so widespread in research circles. Let us note the systems that do machine translation post-edition (Isabelle et ali. 2007), (Simart et ali. 2007) or those that correct human authors (e.g. Writer's v. 8.2) that have recently emerged. These systems do not propose any explicit analysis of the errors nor do they help the user to understand them. The approach presented here, which is still preliminary, is an attempt to include some didactic aspects in the correction by explaining to the user the nature of her/his errors, while weighting the pros and cons of a correction, via argumentation and decision theories (Boutiler et ali. 1999), (Amgoud et ali. 2008). Persuasion aspects are also of importance within the didactical perspective (e.g. Persuasion Technology symposiums), (Prakken 2006). In this short document we present the premises of an approach to correcting textual and style errors, which allow us to evaluate difficulties, challenges, deadlocks, etc. We first present the development corpus considered, then we give the different annotations we introduced to characterize errors and their correction. Finally, we briefly outline the argumentation system that leads to correction proposals.

Abstract—In this paper we present an analysis of the most frequently encountered style and text structure errors produced by a variety of types of authors when producing texts. We then show how such errors can be annotated together with their correction(s). From these annotations, via generalizations, correction rules can be induced. Since correcting errors is a complex process, with several solutions and possibilities, we finally show how an argumentation system can be used so that the user can get arguments for or against a certain correction.

I. INTRODUCTION

N

ON-NATIVE

English speaking authors producing documents in English often undergo lexical, grammatical and stylistic difficulties that make their texts difficult to understand by native English speakers. As a result, the professionalism and the credibility of these texts is often affected. Our main aim is to develop procedures for the correction of those errors which cannot (and will not in the near future) be treated by the most advanced text processing systems such as those proposed in the Office Suite, OpenOffice and the like. In this paper, we focus on stylistic and textual errors, an level of errors which is very frequent but seldom addressed because of its linguistic complexity (Vinay et al. 1963, Chuquet 1989). This type of error includes the whole complexity of language: morphology, lexicon, grammar, style, domain usages, context of production, target audience, etc.. When attempting to correct errors, it turns out that, in a large number of cases, (1) there may be ambiguities in the analysis of the nature of errors, (2) errors can receive various types and levels of corrections depending on the type of document, reader, etc., and (3) some corrections cannot be realized without an interaction with the author. To achieve these aims we need to produce a model of the cognitive strategies deployed by human experts (e.g. language teachers) when they detect and correct errors. Our observations show that it is not a simple and straightforward strategy, but that error diagnosis and corrections are often based on a complex analytical and decisional process. Since we want our system to have a didactic capacity, to help writers understand their text errors, we propose an analysis of error diagnosis based on argumentation theory, outlining arguments for or against a certain correction and their

II. THE CORPUS The documents considered range from spontaneous short productions, with little control and proofreading, such as personal emails, blogs or posts on forums, to highly controlled documents such as web pages, wiki productions, publications or professional reports. These latter form the

L. BUSCAIL is a PhD student at IRIT, Toulouse University, France. (email: [email protected]) P. SAINT-DIZIER is a CNRS research director at the same institution. He heads the language processing ILPL group, France. (e-mail: [email protected])

978-1-4244-4139-6/09/$25.00 ©2009 IEEE

205

to distance himself from his audience and topic. We find in the literature four broadly defined registers: Familiar, Informal, Formal and Ceremonial. To these very classical registers, we add: the technology and the scientific registers which do have their own internal logic and textual organization. In most languages from Asia, register is often paired with the honorifics level, which then introduces another, orthogonal, register level within the English language academic community, the conventions of formal register are generally followed. The choice of register for a particular text, part of text, or spoken presentation will vary depending on genre and audience, but it is important not to mix registers up. The following examples show register errors:

essentials of our corpus. Within each of these types, we also observed variations in the control of the quality of the writing. For example, emails sent to friends are less controlled than those produced in a professional environment, which have a stronger textual nature. Note that even in this latter framework, messages sent to the hierarchy or to foreign colleagues receive more attention than those sent to close colleagues or to friends. Therefore, the different corpora we have collected form a certain continuum over several parameters (control, orality, etc.); they allow us to observe a large variety of language productions. More details on the elaboration of corpora, definition of attributes, and annotation scenarios investigated by bilingual speakers and didacticians can be found in (Anonymous, Corpus Linguistics conference, 2009). The corpus is composed of 384 pages from various papers and reports written by various authors:

Number of  pages  Number of  words 

Total 

Papers 

Reports 

384 

84 

300 

184320 

40320 

144000 

This character has a proclivity for screwing up everything. Following the clean-up, algae levels in the river were pretty good The examples above should be written as it follows: This character tends to wreck everything. Following the clean-up, algae levels in the river were improved.

We noted about 3 errors per page; the total number of errors, being 1152, Distribution has been classified in 6 categories (see next section for definitions): Category of errors 

Number of errors 

Register 

230 

Deixis 

150 

Coordination and reference 

403 

Pronominal references 

196 

Sentence management  Paragraph structure and  contents 

173 

Total 

Register errors also include the need to distinguish between oral style (as in emails) and written style. These two types of writings have their own conventions. In written style, a number of elements of oral style should not be used such as contractions in English of the use of the first person : The English Monarchy wasn't responding to the needs of the population. The example above should be written as it follows :

115  The English Government was not responding to the needs of the population.

1152 

III. ERROR CATEGORIZATIONS Finally, there are a number of forms that need some forms of polishing or care. For example, expression of opinion need a lot of care:

In this section we present an original text and style error categorization that we have elaborated from corpus analysis. We basically identify three main types of errors: register (where different levels of language are mixed), structural (including a large variety of problems related to the expression of reference, pronominal, temporal of spatial) and macro-structural (errors in the construction of paragraphs and in the use of discourse connectors). We review and illustrate these errors below. Examples are in English, come from either English corpus or are glosses from French texts.

I believe NATO’s strategy was poorly designed and carelessly implemented. We argue that NATO’s strategy was poorly designed and carelessly implemented. You can see that NATO’s strategy was poorly designed and carelessly implemented.

A. Register errors

In all the examples above, the writer believes that NATO’s strategy was poorly designed and carelessly implemented. The pronouns I, we and you are not always appropriate in

Register is the manner of speaking or writing specific to a certain domain of communication. This way of communicating is determined by how much a writer chooses

206

attention focus. However, too many personal pronouns or too many candidates to the role of the referent of the antecedent may be a source of confusion: Yesterday, Peter came to see John; unfortunately, he was out at the time, and he was disappointed when he realized that he wouldn’t see him for a long time.

formal written register. To increase formality, impersonal structures are required. This can be done in a variety of ways. First, an opinion can be stated directly. It should be obvious what the writer believes if she/he writes: NATO’s strategy was poorly designed and carelessly implemented.

In the example above, who is the referent of the antecedent of he/him? John? Peter? To circumvent such a problem, the writer/speaker must reinsert the referent of the antecedent using a full or semi referential noun phrase:

The writer can also invoke authority. B. Structure of sentences From our corpus investigations, it turns out that a quite large number of referential expressions need some revisions or elaborations to be clear and non-ambiguous. We survey here the four main types of errors we encountered: deixis, referential aspects associated with coordination, pronoun production and sequence of tenses. 1) Deixis Time deixis concerns itself with the various times involved in and referred to in an utterance – typically, the moment of utterance. It is much better to replace deictic terms, such as now or yesterday, by adverbs which are not related to the moment of utterance.

Yesterday, Peter came to see John; unfortunately, he was out at the time, and Peter (full referential NP) was disappointed when he realized that he wouldn’t see his friend (semi referential NP) for a long time. 4) Tense sequences The writer/speaker must beware of the relationship between the grammatical tenses of verbs in related clauses or sentences, in order to show the temporal relationship of the events to which they refer: In the twenties, drinking and smoking became an issue for women, who want to be as respected as men.

After long waking hours, the character was dozing off now.

The example above should be written as it follows:

The character knew that he had gone through the town yesterday.

In the twenties, drinking and smoking became an issue for women, who wanted to be as respected as men.

The examples above should be written as follows : C. Text Macro-Structure 1) Sentence management There are no rules governing sentence length or complexity in English. Sentences can be short and complex, or long and simple. Generally, it is best to vary the length and complexity of sentences so the reader’s interest is maintained and/or they are not overburdened. In general:

After long waking hours, the character was then dozing off. The character knew that he had gone through the town the day before. 2) Coordination and reference The coordinating conjunction or may be a source of confusion when it connects two noun phrases; the second noun phrase (in the linear chain) is often deleted, which induces a substantial change in meaning:

Very simple sentences should be combined: In 1976, he was assassinated. This was bad politically. Chaos resulted.

The lack of cerebral stimulus or brain death... # He activates his utopia or idealism.

His assassination in 1976 resulted in political chaos.

In these examples, the omission of the article the is relevant because the lack of cerebral stimulus is the definition of brain death. However, in the next sentence a determiner – other than Ø – should appear before the noun idealism because this term and the morpheme utopia do not share the same meaning. 3) Pronominal references Third-person pronouns – he/she/it – are anaphoric pronouns: their role is to replace a noun or a noun phrase in order to avoid repetition. Furthermore, they allow the writer/speaker to organize the unfolding discourse by maintaining a particular referent in the reader/hearer’s

Too many ideas should not be included in the same sentence: His assassination in 1976 resulted in political chaos as all three opposition parties refused to recognize the president’s hand-picked successor, and for several weeks the situation remained uncertain and tense until a delegation from the OAU arrived in the country and met with members from all sides in the dispute, and brokered a peaceful resolution to the crisis before any violence took place.

207

A frequent feature of good paragraphs is that having made a claim in the topic sentence and elaborated it, the writer then brings examples or evidence to support his or her claim. This can be very helpful in persuading the reader of the validity of the writer’s position. In academic writing, this illustration may well take the form of quotation from or reference to research carried out by others: Research by Hofstetter and Igel (1995), for example, has shown that women in former East Germany experienced considerably higher rates of depression and resorted more often to psychiatric help in coping with social change than their male counterparts.

His assassination in 1976 resulted in political chaos. All three opposition parties refused to recognize the president’s hand-picked successor, and for several weeks the situation remained uncertain and tense. Finally, a delegation from the OAU arrived in the country, met with members from all sides in the dispute, and brokered a peaceful resolution to the crisis before any violence took place. 2) Paragraph structure and contents Just as each text has an overall structure, with an introduction, a development of a main argument and conclusion, the smaller parts of each text also have their own micro-level structure. The smallest significant unit for the development of an idea is the paragraph. A paragraph is a text unit within which a single idea, or one aspect of a large and more complex topic, is developed and supported through a series of closely related sentences. The articulations between sentences may be implicit (contextually inferred) or explicit, via the use of rhetorical marks. The paragraph level introduces a new topic (or subtopic, or aspect of a topic) and develops it, usually making some sort of claim and supporting it. Each sentence has its role in building up an argument, whether by introducing a claim, expanding upon it, providing or analyzing evidence, or drawing a conclusion. Though the theoretical research on paragraph structure is too complex to cover in detail here, some broad guidelines may be given. To be effective, a paragraph must have coherence and unity, that is the entire paragraph should concern itself with a single focus. If it begins with one focus or major point of discussion, it should not end with another or wander within different ideas. The sentences should lead on from each other logically so that each one answers the questions that come into the reader's mind when they read the sentence before it. If the reader has to go back and reads more than once to understand what the writer is saying, this is an indication that the paragraph may not be coherent. A very long paragraph may also lack unity. Most paragraphs have a topic sentence. The topic sentence presents the subject of the paragraph; the remainder of the paragraph then supports and develops that statement with carefully related details. Because it introduces the subject that the paragraph is to develop, the topic sentence is typically the first sentence of the paragraph. It is effective in this position because the reader knows immediately what the paragraph is about. It is very common after the topic sentence for writers to develop further or expand their main idea. This may also involve a more detailed or qualified restatement of the topic sentence. It is a relatively common stylistic trick to start with a relatively simple topic sentence and then restate or expand it. With a minimal set of rhetorical marks, we can identify the main lines of the paragraph structure and propose some restructuring to the author. Another common strategy after the topic sentence is to immediately limit or narrow the paragraph to a precise aspect of the topic which will be discussed. Amongst these problems, however, some of the most serious are those experienced by women, whether this be in the family or in the workplace.

IV. ANNOTATING TEXT AND STYLE ERRORS Let us now introduce the annotation schema we have developed. It is an ongoing effort which is gradually evaluated by real users. This schema is an attempt to reflect, in a factual and declarative way, the different parameters considered by didacticians and teachers when detecting and correcting errors. It contains several groups of tags given below. The values for each attribute are based on a granularity level evaluated by the didacticians of our group. These are still preliminary and need evaluation and revisions. Their structure has been designed so that they can be used in an argumentation framework. In contrast with the detection of grammatical errors, detecting and correcting style errors often require the taking into account of a large text portion, the co-text, if not the context of the utterance. (A) Error delimitation and characterization: tags the text segment involved in the error. The error zone is meant to be as minimal as possible. This tag has several attributes: comprehension: from 0 to 4 (0 being worse): indicates if the segment is understandable, in spite of the error, level of error: from 0 to 2: indicates how heavy the error is. Categ indicates the nature of the error, from the classifications presented in the previous section. Source indicates the origin, such as calque (analogy with source language), spoken language, etc. (B) Delimitation of the correction: tags the text fragment involved in the correction. It is equal or larger than the error zone. (C) Characterization of a given correction: Each correction is characterized by a tag and associated attributes, positively oriented ones are underlined: surface: size of the text segment affected by the correction: minimal, average, maximal, status: indicates, whenever appropriate, if the correction proposed is the standard one or a more complex one; values are: by-default, alternative, unlikely, meaning: indicates if the meaning has been altered: yes, somewhat, no, var-size: is an integer that indicates the increase/decrease in number of words of the correction w.r.t. the original fragment,

208

Informally, a correction rule is defined as the union of all the corrections found for that particular pattern: (1) merge all corrections which are similar, i.e. where the position of each word in the erroneous segment is identical to that of the correction; the values of the different attributes of the tag are averaged, (2) append all corrections which have a different correction following the word to word criterion above, and also all corrections for which the attribute 'fix' is true. (3) remove the text segments or keep them as examples. After generalizations over examples, it is induced that when the tense of the main clause is in the past, then the tense of any subordinated clause must also be in the past. For the other situations (e.g. main verb in present tense), it all depends on meaning. For the above example, we then get the following rule, via generalizations:

comp indicate the difficulty to understand the correction: it may be more difficult that the error, even though it is correct, change indicate what the changes are in the correction: syntactic, morphological, etc. qualify: indicates the certainty level of the annotator and didacticians, it qualifies separately the certainty of the error detection and of the proposed correction, correct: gives the correction. Let us now consider two illustrative examples, attribute values are assigned by annotators: (1)

Register level:

Algae levels were much better (2)



Sequence of tenses:

drinking and smoking became an issue for women, who want to be as respected as men

The correction indicates that the two verbs of the main and relative clause must be in the same tense when the verb of the main clause is in the past. The register level example is treated in the same way: any register besides formal is noted for formal papers, any register besides Informal or familiar for emails is noted, etc., and if any discrepancy is noted, then register equivalence is enforced so that all parts of the sentence or paragraph have the same register level. Correction for register level is realized via the domain ontology and terminology, which, in principle, identifies language levels. For paragraph structure, we are now testing the text tiling technique (Hearst 1997), which in general, identifies zones in a text that talk about the same topic, based on the density of pronouns, common terms and synonyms, etc. Zones dealing with the same topic should form one or more paragraph depending on size. If two topics are identified in a paragraph, then the paragraph is decomposed into two units. We may have several competing solutions: several corrections from the same rule or from different rules may be competing. Using argumentation to structure the correction space.

V. FROM ANNOTATIONS TO CORRECTION RULES Our corpus has been annotated following the above schema. There are several steps to reach the correction rule stage that we briefly outline here. The approach is still exploratory, and needs further elaborations and evaluations. Dealing with style and text form correction is indeed not very much explored and raises many difficulties due to a lack of linguistic criteria. Generalizations are realized by a gradual and manually controlled machine generalization process. In particular we aim at reaching the highest possible level of generalization, but generalizing may be delicate and may lead to rules that lack nuances. One of our main task is, for each phenomena, to adjust the level of generalization so that corrections can be done adequately. To define a correction rule, the segment of words at stake in the error zone and in the correction first get a morphosyntactic tagging that characterizes the error, so that the error can be easily identified in any circumstance as an erroneous pattern. All errors that have the same erroneous pattern are grouped to form a single correction procedure, possibly leading to several corrections.

VI. AN ARGUMENTATION MODEL FOR DEALING WITH MULTIPLE CORRECTIONS

Our goal, within an 'active didactics' perspective, consists in identifying the best corrections and proposing them to the writer together with explanations, so that he can make the most relevant decisions. Classical decision theory must be paired with argumentation to produce explanations. In our

209

framework, argumentation is based on the attributes associated with the tags of the correction rules. We assume that decisions are made in a rational way, i.e. in a way which is consistent with a set of preferences (note that in some areas of language, rationality is not as central, as in poetry). This view confers a kind of operational semantics to the tags and attributes we have defined. Formally, a decision based on practical arguments is represented by a vector (D, K, G, R) defined as follows: (1) D is a vector composed of decision variables associated with explanations: the list of the different decisions which can be considered, including no correction. The final decision is then made by the writer, (2) K is a structure of stratified knowledge, possibly inconsistent. Stratifications encode priorities (e.g. Bratman, 1987, Amgoud et ali. 2008). K includes, for example, knowledge on readers (e.g. in emails they like short messages, close to oral communication), grammatical and stylistic conventions or by-default behaviors, global constraints on texts or sentences. Each strata is associated with a weight wK ∈ [0,1]. (3) G is a set of goals, possibly inconsistent, that correspond to positive attributes Ai to promote in a correction. These goals depend on the type of document being written. For example, (4) R is a set of rejections: i.e. criteria that are not desired, e.g., longer text after correction. Format for R is the same as for G. R and G have an empty intersection. Rejections may also have weights. Some attributes may remain neutral (e.g. var-size) for a given type of document or profile. The global scenario for correcting an error is the following: while checking a text, when an error pattern (or more if patterns are ambiguous) is activated, then the corrections proposed in the tag are activated and a number of them become active because the corresponding 'correct' attribute is active. Then, for each such correction, the attributes in the correction, which form arguments, are integrated in the decision process. Their weight in G or R is integrated in a decision formula; these weights may be reinforced or weakened via the knowledge and preferences given in K. For each correction decision, a meta-argument that contains all the weighted pros and cons is produced. This meta-argument is the motivation and explanation for realizing the correction as suggested. It has no polarity.

being done due to the necessity of having first a prototype and an evaluation protocol. VIII. CONCLUSION We have presented in this paper an analysis of the most frequently encountered style and text structure errors. This is a really challenging, but important, problem since there are very few works in this area. We have then presented the way we annotate errors and their possible corrections, noting that errors may get several types of corrections. From these annotations, we have shown how correction rules can be induced. Evaluating the performances of such system is necessary but raises many problems, in particular due to the difficulty of detecting and analyzing errors: our linguists disagree from time to time: this motivates our interactive approach where the writer gets arguments for or against a certain correction. REFERENCES [1]

Amgoud, L., Dimopoulos, Y., Moraitis, P., Making decisions through preference-based argumentation. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning (KR’08), AAAI Press, 2008. [2] Boutilier, C., Dean, T., Hanks, S., Decision-theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 11:1–94, 1999. [3] Bratman, M., Intentions, plans, and practical reason. Harvard University Press, Massachusetts, 1987. [4] Chuquet, H., Paillard, M., Approche Linguistique des Problèmes de Traduction, Paris, Ophrys, 1989. [5] Hammadou, J., the Impact of Analogy and Content Knowledge on Reading Comprehension: What Helps, What Hurts, ERIC, 2000. [6] Hearst, M.A., TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages, Computational Linguistics, vol. 23-1, 1997. [7] Isabelle, P., Goutte, C., Simard, M., Domain adaptation of MT systems through automatic post-editing, MT Summit XI, 2007 [8] Prakken, H., Formal systems for persuasion dialogue, Knowledge Engineering Review, 21:163–188, 2006. [9] Simard, M., Goutte, C., and Isabelle, P, Statistical Phrase-based Postediting, proceedings of the NAACL-HLT. 2007. [10] Vinay, J.P. , Darbelnay, J., Stylistique Compar\'ee du Francais et de l'Anglais, Paris, Didier, 1963.

VII. IMPLEMENTATION At the moment, we are developing in Java a prototype running on 4 major style and text errors in order to be able to evaluate the challenges, the needed resources and the system behavior. This step should be carried out shortly. When this is done and evaluated, we can proceed to a larger realization. From a linguistic point of view, the evaluation of this approach is a real challenge. At the moment, we aim at evaluating the error recognition rate and if the corrections proposed are appropriate. We are now exploring a way to evaluate the construction of arguments and the hierarchy of decisions which are proposed. Obviously, this task has not

210

Suggest Documents