bigrams and chunking: advantages for using in ...

2 downloads 0 Views 510KB Size Report
Oct 10, 2017 - rule-based parsing in dependency model, partial grammar analysis, ..... Goldwater S. and Griffiths T. (2007) A fully Bayesian approach to ...
Vol. 73 | No. 10 | Oct 2017 DOI: 10.21506/j.ponte.2017.10.10

International Journal of Sciences and Research

BIGRAMS AND CHUNKING: ADVANTAGES FOR USING IN AUTOMATIC SPELLING CORRECTION IN RUSSIAN AND ENGLISH Vladimir Polyakov (Corresponding Author) National University of Science and Technology “MISIS”, Russia Ivan Anisimov Yandex, LLC, Russia Elena Makarova Institute of Linguistics of Russian Academy of Sciences, Russia

Abstract The present research is concerned with the problem of automatic spelling correction for Russian and English. The program realized in a batch mode draws upon chunking - a model of an incomplete syntactic analysis. Basing on the previous version of the program and its advantages and shortcomings, we made a decision to introduce a stage of analysis using bigrams into the chunking pipeline, which considerably increased the efficiency of spelling correction. Unlike other programs that presuppose an interactive mode with a human interference, the spelling corrector described in the present paper is completely automatic, i.e. the program itself chooses the best variant of a correction and makes the necessary replacement. The work of the program was tested on two mini-collections (for Russian and for English) of a hundred clauses each collected from Twitter. Though there is still room for improvement, the results testify to the fact that joint use of bigrams and chunks has great potential. Keywords: Automatic Spelling correction; Bigrams; Chunking; Dependency Tree Model; Russian; English; Syntax. Introduction The task of spelling correction, an important field of natural language processing, has been pending for many decades, but it still remains unsolved despite numerous works in this sphere [6], [15], [16], [20], [21], [22], [23], [25], [27]. Such problems as ungrammaticalness (neologisms, proper names, company names that a human brain can easily detect and deal with) or multilingualism (when a word or a phrase in a language is written using a different alphabet) present certain difficulties and become obstacles for the development of an automatic spelling corrector. Thus, although there are a lot of spelling correctors, they function in an interactive mode, i.e. the final decision – choice between a several generated variants of correction for a misspelled word – is made by the user. The analysis of advances in syntactic processing of the natural-languages text showed that during the last 60 years since the beginning of works in this field [8], [30] the sphere of building syntactic models for the natural language has gone a considerable way. There appeared parsers (programs for grammatical analysis) for a wide range of languages, such as English, Russian, French, German, Turkish, Arabic, Czech, Bulgarian, etc. [14], [18], [19], based on one of the above-mentioned grammars. Besides, numerous treebanks containing syntactic descriptions similar to text corpora were developed. Nevertheless, both the sphere of generative grammars and the sphere of dependencies grammars are characterized by a deficit of practical applications. And it is, first of all, accounted for by the difficulty of application of these models in practice. Besides, there are only few libraries aimed at Russian language processing [29: http://dx.doi.org/10.1155/2016/4183760].

108

Vol. 73 | No. 10 | Oct 2017 DOI: 10.21506/j.ponte.2017.10.10

International Journal of Sciences and Research

The present research is dedicated to the creation of a spelling corrector based on syntactic context analysis operating in a batch mode. Thus, the work does not come down to the detection of misspelled words and generation of possible corrections. The main challenge is to select an efficient model that would automatically choose the correct variant among several options. The first version of such spelling correctors for Russian and for English was created in 2016 using UIMA framework [https://uima.apache.org], Java [https://www.java.com/en/] and the NLP@Cloud (Moscow) library. There are two versions of NLP@Cloud library: the first version is NLP@Cloud(Moscow) in NUST “MISIS”, and the second is NLP@Cloud(Kazan) in KFU. The work on the two variants of the parsing program (for English and Russian) has been conducted simultaneously. A chunk in dependency grammar is a hedge of the dependency tree. The spelling correction program is based on the results of chunking. Despite the fact that the first results seemed quite promising, there still remained a lot to be improved. In order to increase the productivity of the spelling correction program, we decided to introduce a stage “Correction filtering using a dictionary of bigrams” into the pipeline in order to reduce the number of possible corrections for each word, which considerably improved the general efficiency of the program. Thus, the program in question is a spelling corrector with support of syntactic context, which can serve as a restriction when the program has to choose the best suitable correction among several generated variants. Both chunks and bigrams can be regarded as a syntactic context of a word, and separately they give a certain percentage of corrected mistakes, but it is their join use that proved to be a good solution to the problem of automatic spelling correction. One of the goals of the original studies was to suggest a new formalism of the natural language syntax description basing on chunking in dependency tree and a dictionary of bigrams. The task of spelling correction was chosen for debugging of this syntactic model. Material and Methods The task of chunking was first approbated in 2009 by Bushtedt and Polyakov [7] for Russian. Later, in 2016 it was significantly reconsidered basing on new heuristics. The model of incomplete syntactic analysis, or chunking, is based on Tesnière’s dependency grammar [30], [31], as it is closer to native speakers of languages with free word order [24], such as Russian. Conference proceedings [14], [18], [19] show a wide range of natural human languages, whose syntax is described basing on the formalism of the dependency grammar. Besides, compared to the second dominating syntactic model, Chomsky’s [8], [9], it can provide an easier transition to the logical notation, which was defined as the ultimate goal of the research. At the present time the main goal of the works on NLP@Cloud(Moscow) program library is creation of a partial syntactic analyzer (chunker). But, due to the complexity of testing syntactic analysis as is, the authors made a decision to use the task of automated spelling correction as a means of testing the suggested syntactic model. Below are presented the material and the method using in the present research, namely: the NLP@Cloud program library and the pipeline created on its basis, and the structure of the test collection and the process of its creation. a) NLP@Cloud Library NLP@Cloud is a program library aimed at natural language text processing. The work began in 2012 [26] and still continues. At the present time there are two version of the library – NLP@Cloud(Moscow) and NLP@Cloud(Kazan). The present research is based on the former. The library is oriented towards desktop application for deep natural language text processing. The circle of perspective goals of the NLP@Cloud(Moscow) library includes the problems of

109

Vol. 73 | No. 10 | Oct 2017 DOI: 10.21506/j.ponte.2017.10.10

International Journal of Sciences and Research

automated spelling correction, homophony, and homonymy disambiguation based on syntactic models. NLP@Cloud (Moscow) program library is written in Java using UIMA (Unstructured Information Management Architecture) framework. A detailed description of the technical parameters of the library is given in Table 1. Table 1. Technical characteristics of the NLP@CLOUD(Moscow) library Parameter Characteristic Goals processing (analysis) of natural-language texts Languages the library is designed English, Russian for Program language Java Additional libraries UIMA, Tika, JFlex, Liblevenshtein, Stanford NLP Group PoS Tagger Operating systems MS Windows, Linux, Mac OS Required RAM 6 Gb Volume of dictionaries СОСА (for English): over 100 thousand words OpenCorpora (for Russian): about 300 thousand words Size of NLP@Cloud library 733 Mb Load time of dictionaries СОСА: less than 1 minute OpenCorpora: less than 1 minute Time for pipeline execution, per English, Russian: from 0.1 sec to 1 minute sentence Types of processed documents NLP@Cloud(Moscow) includes library, which can process over 1400 types of files, including DOC, DOCX, PDF, TXT, HTML, XML, etc. Stages of NLP tokenization, morphological analysis, syntactic analysis, spelling correction Tasks morphological analysis, partial syntactic analysis, spelling correction, homonymy disambiguation, homophony disambiguation (in prospect), partial semantic analysis (in prospect), scientific texts processing (in prospect) Parsing model rule-based parsing in dependency model, partial grammar analysis, chunking Model of DB “Chunking” version for English: Spring, 2016, number of chunks: 228 version for Russian: Spring, 2016, number of chunks: 242 Resources for creation Time: 2012-ongoing (over 5 years) People: at the beginning in 2012 there were 10 people (including top-ranked specialists), now there are 4 people working on the library Youtube channel www.youtube.com/channel/UCufoW0VTmimbM5zSRTY3fOg (under development)

b) Pipeline description The program of spelling correction using chunking includes the following stages, described in general for English and Russian [1], [2], [3], [4]:  Pre-processing. Using Apache Tika Library [https://tika.apache.org], the program extracts the whole text and identifies its language.  Tokenization. The program identifies two annotations – a word in Russian/English or a complex word in Russian/English – and normalizes the text bringing all words to the lower case. The stage is executed by the finite-state machine generated by JFlex analyzer [http://jflex.de].  Morphological analysis. The stage is applicable only for Russian. Basing on OpenCorpora dictionary [5], [http://opencorpora.org/dict.php], the program detects the inflectional grammar form of a word or a set of such forms (homonyms). In particular, the number of such homonymous forms can be over a dozen, as Russian is a highly inflectional language.

110

Vol. 73 | No. 10 | Oct 2017 DOI: 10.21506/j.ponte.2017.10.10

International Journal of Sciences and Research

 PoS-tagging. The stage is applicable only for English. Each variant of the clause generated at the previous stage is processed with Stanford Log-linear Part-Of-Speech Tagger [17], [http://nlp.stanford.edu/software/tagger.shtml]. All word forms generated at this stage include their part of speech and lemma (basic form of the word). The stage also includes unification of parts of speech marking in COCA [12], [13], [http://corpus.byu.edu/coca/], Stanford NLP Group PoS-tagger and “Chunking” database.  Homonymy list cleaning and work with dictionary. The stage is applicable only for Russian. At this stage the program eliminates “false” homonyms (cases of coincidences of one-letter prepositions, conjunctions and pronouns with letters of the alphabet) and corrects words with the letter “ё”.  Synthesis of potential corrections. For each word, which was not found in the dictionary, all possible corrections with Levenshtein distance equaling one step are generated. This procedure is executed by Levenshtein automaton based on the morphological dictionary using Liblevenshtein [https://github.com/universal-automata/liblevenshtein-java].  Correction filtering using a dictionary of bigrams. The stage was introduced into the pipeline within the frames of the present study and will be described in section “Correction filtering using a dictionary of bigrams as a new stage of the pipeline”.  Building of an extended word tuple. A new heuristic introduced into the original version of the pipeline comes down to the following: all prepositions, particles, conjunctions, adverbs, and articles (for English) are excluded from the lexical contents of the sentence and marked in the extended word tuple of the words they are attributed to. In the upcoming version of the database “Chunking” constructions with auxiliary verbs will also be shifted to the extended word tuple. This allowed us to keep all chunks strictly two-member and all chunk trees – homogeneous.  Forming a set of potential chunks. The program successively joins all word forms pairwise, and, if such pair of words meets the criteria of one of the chunk types from the database “Chunking” (version Spring 2016), it becomes part of the set.  Search for the main parts of the sentence. Basing on the decision scheme developed within the frame of the research, the program finds all possible “subject-predicate” pairs, which further become the tops of the chunk trees. The decision scheme was developed separately for Russian and for English.  Building a set of chunk trees. The program builds a sole graph that includes all chunks from the previous two stages and, after the girth of this graph, it forms a set of potential chuck trees.  Choosing the best chunk tree. The tree that includes most words from the initial clause become the best chunk tree.  Output of results. All corrections received at the previous stages are introduced into the original text, and the corrected text is output into the resulting file. c) Correction filtering using a dictionary of bigrams as a new stage of the pipeline The first results of the program testing on a mini-collection of Twitter messages were not bad, but they showed that chunking alone is not enough for a sufficient percentage of right corrections. Particularly, one of the weak sides of the program was choosing the best chunk tree, and, consequently, the right variant of correction. It concerns the situations when two or more variants had the same grammar characteristics. In such cases the program made a random choice, which sometimes turned out to be incorrect. Besides, a big number of generated corrections led to an unreasonably big number of chunks and chunk tree, which slowed the execution of the program. Thus, we made a decision to introduce a stage “Correction filtering using a dictionary of bigrams” into the original pipeline. 111

Vol. 73 | No. 10 | Oct 2017 DOI: 10.21506/j.ponte.2017.10.10

International Journal of Sciences and Research

Bigram is a sequence of two elements [10: http://digitalcommons.butler.edu/wordways/vol22/iss3/8], in our case – a sequence of two adjacent words. Within the frames of the spelling correction problem, bigrams can help eliminate a number of improbable corrections generated by the program. The stage “Correction filtering using a dictionary of bigrams” was introduced into the pipeline after the stage when the program synthesizes a set of potential corrections for the word not found in the dictionary using the method of Levenshtein distance. For each variant of correction the program builds two bigrams consisting of the correction variant itself and a word on the left or a word on the right. Or, if the correction variant is the first of the last word of the sentence, the program builds only one bigram. If both bigrams for a correction are present in the dictionary of bigrams (COCA for English and OpenCorpora for Russian), the correction is marked as “highly probable”. If only one bigram is present in the dictionary, it is marked as “probable”. Finally, the correction is marked as “improbable”, if neither bigram is included in the dictionary. In case there are “highly probable” variants of correction, others are not taken into consideration at the other stages of the pipeline. If there are only “probable” and “improbable” variants, the latter are excluded. The algorithm proved to be a very effective way of eliminating a part of the false corrections. In some cases stage “Correction filtering using a dictionary of bigrams” keeps only one variant of correction. In other cases – it eliminates some of the generated variants, which facilitates and accelerates further stages of chunking. A fragment of the log of the program execution for English is shown in Table 2. Table 2. Extracts from the log of the program execution for one English clause. Extract from log Comments Sentence 1[The partyu neesd to starrt over from It is the original sentence (with spelling mistakes) from scrarch] the collection. There are four misspelled words in the sentence: partyu neesd starrt scrarch Word missing in dictionary: partyu Stage “Synthesis of potential corrections”. Four words Replacements: [ (“partyu”, “neesd”, “starrt” and “scrarch”) from the party. sentence were not found in the dictionary. For each of partythem the program generated all possible corrections with party Levenshtein distance equaling one step. Thus, there are ] 3 variants of correction for the words “partyu, “neesd”, Word missing in dictionary: neesd “starrt”, and 1 variant of correction for the word Replacements: [ “scrarch”. needs need nesd ] Word missing in dictionary: starrt Replacements: [ starry start starr ] Word missing in dictionary: scrarch Replacements: [ scratch ] Finished generating possible corrections in 8 ms After each stage the program writes the time spent on it. Finished filtering sentences with simple n-gram 112

Vol. 73 | No. 10 | Oct 2017 DOI: 10.21506/j.ponte.2017.10.10

International Journal of Sciences and Research

dictionary in 0 ms Finished generating possible sentences and postagging in 1 ms Finished filtering sentences with pos n-gram dictionary in 0 ms Word [the] wordforms: [ [word:"the", pos: "DT", lemma: "the"] ] Word [partyu] wordforms: [ [word:"party", pos: "NN", lemma: "party"] ] Word [neesd] wordforms: [ [word:"needs", pos: "VBZ", lemma: "need"] ] Word [to] wordforms: [ [word: "to", pos: "TO", lemma: "to"] ] Word [starrt] wordforms: [ [word:"start", pos: "VB", lemma: "start"] ] Word [over] wordforms: [ [word:"over", pos: "RP", lemma: "over"] ]Word [from] wordforms: [ [word:"from", pos: "IN", lemma: "from"] ] Word [scrarch] wordforms: [ [word: "scratch", pos: "NN", lemma: "scratch"] ] ... Extended Word [partyu] attributes: [ article: the personal pronoun: null possessive pronoun: null possessive ending: null negative particle: null preposition: null ... Best tree: party needs (4) needs start (27) needs scratch (25) ... Output sentence: [The party needs to start over from scratch..] ...

Processing time 64 ms

For each word form the program defines its part of speech and lemma (base form). Besides, the list of word forms also shows the results of execution of the stage “Correction filtering using a dictionary of bigrams”. For example, words “partyu”, “neesd” and “starrt” have only one word form. Its means that during the analysis of the generated corrections basing on the dictionary of bigrams the program eliminated all other variants of correction.

Fragment of building of an extended word tuple for the word “partyu”. Particularly, the article “the” was attributed to it.

The last but one stage of the pipeline – “Choosing the best chunk tree”. In each chunk the first word is the main word, and the second – dependent. The first chunk in the tree is the subject and the predicate of the clause. The numbers in brackets show the ID of the chunk type from the database “Chunking”. The last stage of the pipeline – “Output of results” – presents a clause with all corrections chosen by Levenshtein method, and methods of bigrams and chunking. In the example all the misspelled word underwent the necessary corrections. All the stages for one sentence were executed in 64 ms.

d) Test collection The work of the program for Russian and English was tested on a mini-collection of 100 clauses (simple sentences) from Twitter. The collections are divided into four parts: 25 sentences with 4 misspelled words, 25 sentences with 3 misspelled words, 25 sentences with 2 misspelled words, 25 sentences with 1 misspelled word. All the misspelled words from the collection have one mistake with Levenshtein distance equaling one. 113

Vol. 73 | No. 10 | Oct 2017 DOI: 10.21506/j.ponte.2017.10.10

International Journal of Sciences and Research

The sentences that formed the two mini-collections were manually selected from the whole set of all Twitter messages in the languages in question. The sentences had to meet the following criteria: they had to be simple sentences containing no swear words, mentions of sex and pornography, politics, and racial discrimination. The mistakes were introduced into the selected clauses manually, so that there are different types of mistakes presented within one sentence. Thus, the general representation of all error types is almost equal. Twitter was chosen as a source of data for the collection as it is one of largest freely-available set of messages from real people. Besides, it provides a possibility of advanced search tool, which simplified the process of collection creation. During the work with the collection such characteristics of the authors of Twitter messages as age, gender, social status, etc., were not taken into consideration as irrelevant, as the main goal of the research was to analyze the performance of the spelling correction pipeline, which does not depend on the above-mentioned characteristics of people. Results The present section regards at the results of the program execution for Russian and English separately. a) Results for Russian The result of the program testing for Russian is shown in Fig. 1. The graph presents the number of sentences that underwent all the necessary corrections (all misspelled words in the sentences were replaced with the right correction) in four parts of the collection separately. The input (original collection), output (result of the work of the program) and the logs for both languages are available for download at: https://cloud.mail.ru/public/8qMm/yytN3tmDF.

Fig. 1 Program testing results (for Russian).

As shown in the diagram, out of the 25 sentences that contain four spelling errors each, only two sentences underwent all the necessary correction. In the second part and in the third parts of the collection (sentences with three and two spelling mistakes each, respectively) five sentences were corrected. The biggest percentage of the completely corrected sentences (64% or 16 sentences) belongs to the fourth part of the collection, which contains 25 sentences with one misspelled word each. A more detailed analysis of the program work is presented in Fig. 2.

114

Vol. 73 | No. 10 | Oct 2017 DOI: 10.21506/j.ponte.2017.10.10

International Journal of Sciences and Research

Fig. 2 Uncorrected mistakes and mistakes corrected by Levenshtein method, bigrams and chunking (for Russian).

The diagram in Fig. 2 represents four parts of the mini-collection separately (sentences with 4 wrong words each, sentences with 3 wrong words, etc.). The black parts denote the percentage of mistakes corrected by Levenshtein method – misspelled words, for which the program generated only one possible correction. Dark-grey parts present the percentage of mistakes corrected by bigrams, i.e. there were several possible correction variants generated by the program, but the stage “Correction filtering using a dictionary of bigrams” eliminated all but one variant. Grey parts of the bar show the percentage of mistakes corrected by chunking – after the choice of the best chunk tree, the program introduced into the original sentences the correction variants that were included in the best tree. b) Results for English The result of the program testing for English is shown in Fig. 3.

Fig. 3 Program testing results (for English).

As shown in the diagram (Fig.3), the number of completely corrected sentences in four parts of the collection makes as follows:  Sentences 1-25 (four mistakes each): 11 completely corrected clauses;  Sentences 26-50 (three mistakes each): 5 completely corrected clauses;  Sentences 51-75 (two mistakes each): 10 completely corrected clauses;  Sentences 76-100 (one mistake each): 18 completely corrected clauses.

115

Vol. 73 | No. 10 | Oct 2017 DOI: 10.21506/j.ponte.2017.10.10

International Journal of Sciences and Research

A detailed analysis of the contribution of each method into spelling correction is presented in Fig. 4.

Fig. 4 Uncorrected mistakes and mistakes corrected by Levenshtein method, bigrams and chunking (for English).

The diagram in Fig. 4 shows the high contribution of the stage “Correction filtering using a dictionary of bigrams” into the general number of corrected mistakes, which testifies to the justifiability of the decision to include the stage into the pipeline. Discussion The new version of the pipeline, which includes the stage “Correction filtering using a dictionary of bigrams”, showed better results than the pipeline without the stage “Correction filtering using a dictionary of bigrams”, when the only criteria for choosing the most suitable correction was the result of chunking. First of all, there is a certain percentage of mistakes that can be corrected at the stage of bigram, as the program eliminates all but one possible variant. In Russian such number of mistakes varies from 12% to 19%; in English the number is even higher: from 32% to 50%. As shown in Figures 2 and 4 (for Russian and English, respectively), the introduction of the stage of bigrams increased the percentage of corrected mistakes, which proved the joint usage of chunking and bigrams within the frames of the spelling correction task to be winning cooperation. Besides, at the stage of bigrams all “improbable” variants are excluded from the set of potential corrections, which further decreases the number of all possible chunks and, consequently, decreases the time for search for chunk trees in the graph. This enhances the chances of choosing the right variant of correction. Noteworthy, the number of misspelled words in a clause does not, as it could be expected, considerably influence the result of the program execution: in Russian the percentage of corrected mistakes varies from 44% to 64% (Fig. 2). For English this number makes from 64% to 77% (Fig. 4). The contribution of chunking in Russian is higher than in English. It can be accounted for the fact that English has a strict word order. And, due to that, most mistakes are eliminated at the stage analysis using a dictionary of bigrams. Besides, the fact that the results for the two languages are rather similar testifies to the correct realization of the program and the perspective of the suggested method. As for the shortcomings of using bigrams, at the stage “Correction filtering using a dictionary of bigrams” the correct variants of a misspelled word replacement were eliminated in 8 cases in 116

Vol. 73 | No. 10 | Oct 2017 DOI: 10.21506/j.ponte.2017.10.10

International Journal of Sciences and Research

Russian and in 4 cases in English, thus, leaving no chances for the clause to undergo the necessary correction. Despite the fact that the percentage of such errors is quite low – 3.2% (for Russian) and 1.6% (for English), it still requires improvement, which will be done in further researches. One of the ways of improving the work of the program is creation of a frequency dictionary of chunks, enhancement of the algorithm of choosing the best chunk tree among others, search for other means (besides bigrams) to reduce the number of the trees. Analysis of proceedings [14], [18], [19] showed a rather low number of applied works based on syntactic models. It does not concern the creation of treebanks, which will always be in high demand among the scientific society and which can also be interesting from a purely theoretical point of view. We believe that in the nearest future syntactic models of the text can become a successful field of application in a wide range of problems concerning different types of disambiguation: choice of the correct homophone from a number of variants in voice assistants like Siri [https://www.apple.com/ios/siri/], homonymy resolution, error correction in spellcheckers, batch preprocessing of “dirty” texts (texts with mistakes) within the frames of sentiment analysis, etc. [28]. The main goal of the research is to use the task of spelling correction to demonstrate that the stage of partial syntactic analysis successfully works. Then we will proceed to other stages, e.g. the stage of partial semantic analysis. Approaches based on HMM and probabilistic models [32] have their advantages, and the chunking model has its advantages. We do not oppose them. For example, the chunking model explicitly sets lemma (basic word form) and word tuple, which does not exist in the mentioned models. In fact, we suggest a rule-based parsing in the dependency model, where the rules are formed as the “Chunking” database and the decision scheme. Besides, we introduced such limitations as two-member chunks only and extended word tuples. There are rules for reducing the list of homonyms and number of chunks aimed at decreasing the combinatory complexity of the problem. There are rules for building a forest (=a set) of chunk trees (=dependencies) and rules for choosing the best tree from the forest. For spelling correction we use Levenshtein (edit) distance equaling only one step. If we correct mistakes with Levenshtein distance over 1, we will get a combinatory explosion, and the program will never finish. We use only a dictionary of bigrams and do not take into consideration n-grams of a higher level [12], [13]. N-grams for N > 2 include bigrams. Consequently, bigrams are enough for eliminating the improbable variants. Besides, if we use n-grams of higher range, the size of the dictionary will increase, and, thus, the load time will also increase. We use a relatively small test collection. At the early stages of program debugging the execution of the pipeline even for such a small collection took quite a long time. For example, the execution of the program for the described collection took about 10 hours. So, we made a decision to use a small collection in order to speed up the works on the analysis. Now the program has been optimized and it processes the collection of 100 sentences within 10-15 minutes. In future we are planning to increase the size of the collection to 400 sentences (precision of 1%) and generate mistakes by the random numbers generation. In future we are also planning to study how the time of the program execution depends on the clause characteristics (such as number of mistakes, word length, number of words and chunks, number of generated variants of correction, number of homonyms, etc.), which will let us optimize the program. It is also planned to conduct an analysis of the program execution using non-parametric testing [33].

117

Vol. 73 | No. 10 | Oct 2017 DOI: 10.21506/j.ponte.2017.10.10

International Journal of Sciences and Research

Conclusion The present paper is dedicated to the problem of automatic spelling correction for Russian and English basing on the model of an incomplete syntactic analysis – chunking – and bigrams. The main characteristic of the program is that it does not require a human interference in the process of correction. The program generates all possible corrections with Levenshtein distance equaling one for all the words it did not find in the morphological dictionary. At the stage “Correction filtering using a dictionary of bigrams” all the variants are marked as “highly probable”, “probable” and “improbable” depending on their occurrence in the dictionary. At the next stages only the variants with the highest level of probability are taken into consideration. As shown in the diagrams (Fig. 2, 4), all three methods of spelling correction make a contribution to the general number of corrected mistakes. Thus, the cumulative effect of the application of Levenshtein method, bigrams and chunking is higher than their separate use. Overall, joint use of Levenshtein method, bigrams and chunking within the frames of the spelling correction task proved to be a perspective direction. Among the limitations of the programs we can call the fact that now the program processes only simple sentences (i.e. those consisting only of one clause). In future, after the realization of the segmentation stage, the work of the program will also be tested on compound and complex sentences. Once the work on the pipeline is accomplished, it can be formed as API and used during keyboard input into text boxes with spelling correction support in medical institutions (sickness certificates, clinical records), in courts (trial records), in police (investigation records), as well as in twits, forums, emails, etc. Acknowledgement, Authors’ Contribution The idea of the study and the heuristics belong to Vladimir Polyakov. The database “Chunking” and the decision scheme for search for the subject and the predicate were developed by Vladimir Polyakov and Elena Makarova. The database “Chunking”, COCA dictionaries and Stanford PoS-tagger were adapted by Ivan Anisimov. The test collection was compiled by Elena Makarova and Vladimir Polyakov. The collection was processed by Ivan Anisimov. The results were analyzed by Elena Makarova and Vladimir Polyakov. The formal set up of the problem of reducing the matrix of potential word forms was initially set by Vladimir Polyakov. The formal set up of the problem of applying a dictionary of bigrams to reduce a set of correction was solved by Ivan Anisimov. The algorithm of choosing the best tree was suggested by Ivan Anisimov. The program was realized by Ivan Anisimov. The research was supported by RSF grant # 15-11-10019. References 1. Anisimov, I., Makarova, E., Polyakov, V. (2016-1) Chunking in Dependency Model and Spelling Correction in Russian. In Proceedings DTGS-2016, 23-24 June, St. Petersburg, Russia. Communications in Computer and Information Science series. Springer. V. 674. Pp. 565-575. DOI: 10.1007/978-3-319-49700-6_56. 2. Anisimov, I., Makarova, E., Polyakov, V. (2016-2) Chunking in Dependency Model and Spelling Correction in Russian and English. In Proceedings 2016 SAI Intelligent Systems Conference (IntelliSys), 21-22 September 2016, London, United Kingdom. 2016. Pp. 143-150. ISBN (IEEEXplore): 978-1-5090-1121-6. ISBN (USB) 978-1-5090-1665-5. DOI 978-1-5090-1121-6/16. 3. Anisimov, I., Makarova, E., Polyakov, V. (2017-1) Choosing chunk trees in the task of multi-objective optimization. In: Proceedings 2017 SAI Computing Conference (Computing-2017), 18-20 July 2017, London, United Kingdom. (in print) 4. Anisimov, I., Makarova, E., Polyakov, V., Solovyev V. (2017-2) Spelling Correction in English: joint use of bigrams and chunking. In: Proceedings 2017 SAI Intelligent Systems Conference (IntelliSys-2017), 7-8 September 2017, London, United Kingdom. (in print)

118

Vol. 73 | No. 10 | Oct 2017 DOI: 10.21506/j.ponte.2017.10.10

International Journal of Sciences and Research

5. Bocharov, V., Bichineva, S., Granovsky, D., Ostapuk, N., & Stepanova, M. (2011). Quality assurance tools in the OpenCorpora project. In Computational Linguistics and Intelligent Technology: Proceeding of the International Conference “Dialog”, pp. 10-17. 6. Brill E., Moore R. C. (2000) An improved error model for noisy channel spelling correction. In ACL ’00: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pp. 286–293. Association for Computational Linguistics. 7. Bushtedt V., Polyakov V. (2009). Heuristics for Improvement of Partial Syntactic Analyzer Work. Scientific Notes of Kazan State University. V. 151, book 3, pp. 214-228. 8. Chomsky N. (1956) Three Models for Description of Language. // IRE Trans. Informat. Theory, 1956, v. IT-2, p. 113-124. 9. Chomsky N. (1957) Syntactiс Structures. — The Hague: Mouton, 1957. (Reprint: Chomsky N. Syntactiс Structures. — De Gruyter Mouton, 2002. — ISBN 3-11-017279-8). 10. Corbin, Kyle (1989). Double, Triple, and Quadruple Bigrams. Word Ways: Vol. 22: Iss. 3, Article 8. 11. Damerau F. J. (1964) A technique for computer detection and correction of spelling errors. Communications of the ACM-7, pp. 171–176. 12. Davies, M. 2009. The 385+ million word Corpus of Contemporary American English: design, architecture, and linguistic insights. Int. J. Corpus Linguist, 14 (2009), pp. 159–190 13. Davies, M. 2013. "Google Scholar vs. COCA: two very different approaches to examining academic English”. Journal of English for Academic Purposes 12: 155-165. 14. Gerdes, K., Hajičová, E, Wanner, L. (eds). (2011) Proceedings of the First International Conference on Dependency Linguistics (Depling-2011). Barcelona, Spain. ISBN 978‐84‐615‐1834‐0. 15. Golding A. R., Roth D. (1999) A winnow-based approach to context-sensitive spelling correction // Machine learning. — Vol. 34. — № 1–3. — p. 107–130. 16. Golding A. R., Schabes Y. (1996) Combining trigram-based and feature-based methods for context-sensitive spelling correction //Proceedings of the 34th annual meeting on Association for Computational Linguistics. — Association for Computational Linguistics, — pp. 71–78. 17. Goldwater S. and Griffiths T. (2007) A fully Bayesian approach to unsupervised part-of-speech tagging. Association for Computational Linguistics (ACL). 18. Hajičová, E, Gerdes, K., Wanner, L. (eds). (2013) Proceedings of the Second International Conference on Dependency Linguistics (DepLing-2013). Prague, Czech Republic. ISBN 978-80-7378-240-5. 19. Hajičová, E, Nivre, J. (eds). (2015). Proceedings of the Third International Conference on Dependency Linguistics (DepLing-2015). Uppsala, Sweden. ISBN 978-91-637-8965-6. 20. Kernighan M. D., Church K. W., and Gale W. A. (1990) A spelling correction program based on a noisy channel model. In Proceedings of the 13th conference on Computational linguistics, pages 205–210. Association for Computational Linguistics. 21. Kukich K. (1992) Techniques for automatically correcting words in texts. ACM Computing Surveys 24, pp. 377–439. 22. Mays E., Damerau F. J., Mercer R. L. (1991) Context based spelling correction // Information Processing & Management. – Vol. 27. — № 5. — pp. 517–522. 23. McIlroy M. D. (1982) Development of a Spelling List. AT & T Bell Laboratories. 24. Melchuk, I. (2003). Levels of dependency in linguistic description: Concepts and problems. In Ágel et al., 170–187. 25. Oflazer K. (1996) Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction // Computational Linguistics. — Vol. 22. — № 1. — pp. 73–89. 26. Polyakov V., Solovyev V., Anisimov I., Ponomarev A. (2013). Creation of a new generation of intellectual systems of semantic text processing. Neurocomputers: development, application. Issue 1, pp. 31-39. ISSN 19998554. (in Russian). 27. Ristad E. S., Yianilos P. N. (1998) Learning string-edit distance // Pattern Analysis and Machine Intelligence, IEEE Transactions on. — Vol. 20. — № 5. — p. 522–532. 28. Solovyev V., Ivanov V. (2014). Dictionary-Based Problem Phrase Extraction from User Reviews. LNAI. v. 8655, pp. 225–232. 29. Solovyev V., Ivanov V. (2016). Knowledge-Driven Event Extraction in Russian: Corpus-Based Linguistic Resources. Computational Intelligence and Neuroscience. Volume 2016. Article ID 4183760. 30. Tesnière, L. (1959). Elements of Structural Syntax (Éléments de syntaxe structural), Klincksieck, Paris. Préface by Jean Fourquet, professeur à la Sorbonne. Second edition, reviewed and corrected. ISBN 2-25202620-0. Re-edition of: Tesniere, L. (1959). Éléments de syntaxe structurale, Klincksieck, Paris. ISBN 2-25201861-5 31. Tesnière, L. (1988). Dependency Syntax: Theory and Practice, Albany, N.Y.: SUNY Press, 428 p. 119

Vol. 73 | No. 10 | Oct 2017 DOI: 10.21506/j.ponte.2017.10.10

International Journal of Sciences and Research

32. The Handbook of Computational Linguistics and Natural Language Processing. (2013) Alexander Clark, Chris Fox, Shalom Lappin (eds.). John Wiley & Sons. 33. Weiss, Neil A. (1999). Introductory Statistics (5th ed.). p. 802. ISBN 0-201-59877-9.

120

Suggest Documents