Journal of Computing::Spelling Error Trends and Patterns in ... - arXiv

1 downloads 0 Views 664KB Size Report
Oct 10, 2012 - Keywords: Sindhi, Error Trends, Phonetic Errors, Cognitive Errors, Typographic ..... correct spelling of the word but could confuse the letters.
VOL. 3, NO.10 Oct, 2012

ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2012 CIS Journal. All rights reserved. http://www.cisjournal.org

Spelling Error Trends and Patterns in Sindhi 1

Zeeshan Bhatti,2 Imdad Ali Ismaili, 3 Asad Ali Shaikh, 4 Waseem Javaid

1,2,3,4

Institute of Information and Communication Technology, University of Sindh, Jamshoro Email :{ 1 [email protected], 2 [email protected],3 [email protected],4 [email protected]}

ABSTRACT Statistical error Correction technique is the most accurate and widely used approach today, but for a language like Sindhi which is a low resourced language the trained corpora’s are not available, so the statistical techniques are not possible at all. Instead a useful alternative would be to exploit various spelling error trends in Sindhi by using a Rule based approach. For designing such technique an essential prerequisite would be to study the various error patterns in a language. This paper presents various studies of spelling error trends and their types in Sindhi Language. The research shows that the error trends common to all languages are also encountered in Sindhi but their do exist some error patters that are catered specifically to a Sindhi language. Keywords: Sindhi, Error Trends, Phonetic Errors, Cognitive Errors, Typographic Errors, Visual Errors.

1. INTRODUCTION Spell Checker have been developed for many world languages. The research in the field of detecting errors in typed words started in 1960 [2]. The investigations have been made that the first spell checkers were only ‘Verifiers’. They were not performing the task of ‘Correctors’, in that they offered no suggestions for incorrect spelling word. The investigation showed that for logical or phonetic errors those verifiers were not helpful as they only helped in typo mistakes. Thus the early developers faced immense challenges and difficulties in offering useful suggestions for misspelled words. Thus for the understanding and implementation of an advance spell checking algorithm various techniques and algorithms have been developed for various languages of the world. Sindhi language spoken in Pakistan and some parts of India is one of the oldest language of the world, using Persio-Arabic and Devnagri script. Due to its script and ligature complexity, it’s very essential to understand the ligature structure and analyze the various error patterns or trends in Sindhi language before any attempt on developing a spell checker could be made.

2. THE WORD “ERROR” There are two distinct categories in which the word ‘Error’ can be placed, first being the “Non-Word Error” and the second category is the “Real-Word Error”. Consider a candidate string- which is collection of characters and usually separated by hard space or punctuation marks, if this string or word has a meaning then it would be called as a candidate string otherwise this string is referred to as a ‘Non-Word Error’[2]. Real word errors are considered to be such type of errors that occur when the typist or writer types a valid word which is spelled correctly but it is not the actually word intended to be written with respect to context. Therefore the sentence would be considered incorrect syntactically or semantically [3].

subsection (3.1), I review the various different classification theories from several different studies. In section 3.2, the classifications of errors from these studies are discussed.

3.1

The basic type of classification of errors is between typing errors (commonly known as ‘typos’) and spelling errors. Typing errors occurs when the typist or the author knows the actual and correct spelling of the word but mistakenly or by slip of finger presses an invalid key. It has been observed that since the writer is aware of the actually spelling of the words so the possibilities of multiple mistakes within the word are very low and the resulting misspelled word will usually resemble orthographically to the intended word. (e.g.

{cupreme court} for

{Supreme Court (noun): supreme kourt}). Whereas the term spelling errors refers to the errors that occur due to the ignorance of the author or typist regardless of the fact that the actual spelling of the intended word is known or not. Usually the terms spelling errors and typing errors are considered to be same and referred to simply as a spelling errors, this sometimes also creates confusion among the users. It has been observed that there are three general possibilities of spelling errors: i.

ii.

3. CLASSIFICATIONS OF ERRORS In the literature, there are several theories on various classifications of errors and each theory does not exclude another. Several of these error classification theories have been discussed by Kukich [4]. In the next

Typing Errors And Spelling Errors

iii.

Phonetic error are errors due to the similar pronounced characters in a language where the incorrect or misspelled word is phonetically similar to the intended word (e.g. rof ‫ماڻهوماڻ حو‬ {human: mardhon} or ‫ ناتصڪاپ‬for ‫پاڪستان‬ {Pakistan (noun): Pakistan}) and so the phonetically incorrect word is sometimes also referred to as a homophone of the actual intended word. The other reason for the spelling mistake is that the incorrect or misspelled word is semantically similar to the actual intended word (e.g. ‫حيحص‬ for ‫{ يحص‬correct: sahi} or ‫ وتحالص‬for ‫وحالص‬ {guider : salahoo} or ‫) يحالص‬. The last reason behind the spelling errors is that the writer or the typist is ignorant or unaware of the basic grammatical rules of a language while

1435

VOL. 3, NO.10 Oct, 2012

ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2012 CIS Journal. All rights reserved. http://www.cisjournal.org

typing the text. According to researchers and linguistics the grammatical errors do not fall under the category of spelling errors but according to them such type of errors should be considered under a separate category and so usually most of the general spelling error detection tools and software do not consider grammatical errors and ignore them.

character or letter can be inserted before the first or initial character (‫ حج‬rof ‫ف اظت‬:noitcetorp} ‫ ح فاظت‬hifazat}). An nth error of a character in a word occurs when there is an error in certain characters of words other than the first or initial character. In nth error type first character must be correct.

3.2 Single Error Words And The Multiple Error Words

In this type of error classifications, the distinction between errors within a Sindhi word and errors at the boundary of multiple words, are discussed. The errors due to the Word boundary are such type of errors that occur due to the result of incorrect or misplaced spacing of words. There are generally two types or reasons behind the word boundary error:

From the literature review of kukich [4] the second category of errors discussed is the distinction between the words that contain multiple errors against just a single character error, hence the terms single error words and multiple error words. Kukick in his studies describes that single error mistakes usually occur due to one of the following four reasons [4]:

3.5 Errors Within One Word And The Word Boundary Errors

i. Incorrect splits, when the space character is inserted at an invalid position of a word hence

i. A single character inserted at invalid position in a string. ii. A single character deleted form an intended word. iii. A character substituted by another incorrect or undesired character. iv. Character transposition among two letters in a word.

splitting a single word into two, e.g. or for {Jamshoro (noun): jamshoro} ii. Run-ons, when the space character is omitted or not inserted between two separate words hence resulting in the joining of two words as a single word e.g. for

These are also sometimes collectively referred to as ‘single transformations’ and are also known as ‘Damerau transformations’ [5]. Words that include more than one of the above mentioned error types are simply known as multiple error words.

3.3

Errors In Long Words And The Errors In Short Words

The next category of error classification is the distinction between the errors in short words and the errors that are found in long words or a valid string that consists of large number of characters. It has been analyzed from the studies of Pollock and Zamora [6] that in short or small words - with three or four characters, there is an extensive variance in the amount of errors that occur that constitutes 9.2% of all errors. According to Yannakoudakis and Fawthrop [7] the frequency of errors in short words is 1.5%, however Kukich [8] in his studies disagrees with previous two analysis and emphasizes that the frequency of errors is 63% in small words of two, three and four characters.

3.4 First Character Errors And The Nth Character Errors The next type of error classification discussed in the literature is the distinction between the error that occurs in the first or initial character of the word and the error that occurs in rest of the word or in all other characters of the word except the first. The first or the initial character of the intended word can be deleted ( ‫ ح فاظت‬rof ‫ف اظت‬ {protection: hifazat}, substituted (‫ تظافھ‬for ‫تظافح‬ {protection: hifazat}), transposed with the second character (!‫ح فاظت } ف حاظت‬protection: hifazat}) or some other

{University(noun) : uoniversity}.

15% of all the errors discovered by Kukich [9] in his study were actually word boundary errors and similarly Mitton [10] in his research also found that 13% of the errors generated were word boundary errors.

3.6 The Real-Word Errors And Non-Word Errors In the sixth and the last classification of errors a difference is established between the real-word errors and non-word errors. The Non-word errors are such type of errors that are completely invalid and incorrect words not to be found in any dictionary or in any other corpus, (e.g. eht saerehW .({tazafih : noitcetorp} ‫ ح فاظت‬rof ‫ھ فاظت‬ Real-word errors actually yields a string of characters which is a valid and correct word itself found within a dictionary with some meaning (e.g. ‫حيحص‬ for w rekcehc lleps eht ecneh ({ihas : tcerroc}‫ صحي‬ill not be able to identify such type of errors. Peterson [11] is amongst the few who had investigated and produced several theories on the frequency of real-word errors. Peterson [11] in his research work studied and analyzed the relationship between the lexicon size and real-word error rate, considering only single error misspellings [5].

4.

SPELLING ERROR TRENDS IN SINDHI

The common technique of developing the spell correcting algorithm is by initially studying and analyzing he spelling error trends (also known as error patterns) in a

1436

VOL. 3, NO.10 Oct, 2012

ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2012 CIS Journal. All rights reserved. http://www.cisjournal.org

language. According to Damerau [5] and Peterson [11] the spelling errors in English language are divided into two main categories, typographic errors and cognitive errors. Through research and analysis I have ascertained that in Sindhi Language Four types of Errors usually occur. • • • •

Typographic Errors Cognitive or Phonetic Errors Visual Errors Space Related Errors

Single-Error is therefore refereed to any type of errors produced from above editing operations [4]. The proclamation made by the Damerau’s was later confirmed by a number of researchers and scholar including Peterson [11]. Table 1 shows the results of a study by Peterson [11]. The source of data and information were taken actually from the Webster’s Dictionary and Government Printing Office (GPO) documents, which were in-fact retyped by college students [1]. Table 1: Statistics of the Four Basic Types of Errors (for English language) Error Type Transposition Insertion Deletion Substitution Total

GPO 4 (2.6%) 29 (18.7%) 49 (31.6%) 62 (40.0%) 144 (92.9%)

Web7 47 (13.1%) 73 (20.3%) 124 (34.4%) 97 (26.9%) 4.7%)

5. TYPOGRAPHIC ERRORS Typographic Errors occur when a word is mistyped by mistake usually because a finger was placed in the wrong position, whereas the actual and correct spelling of the intended word is known. These type of errors are usually and most often related and dependent on the keyboard, along with the typing knowledge and accuracy of the typist and so such type of errors do not fall under the criteria of any linguistics rule [12][13]. The error patterns here are somewhat dependent on the language and keyboard layout. Damerau [5] in his study emphasized that approximately 80% of the typographic errors fall into one of the following four categories  Letter Insertion Error: While writing a word if a character is pressed twice or some other character is pressed along with the original character, the undesired character is considered to be inserted e.g. ro { tazafih : noitcetorp} ‫ ح فاظت‬rof ‫سنهڌيجح فاظت‬ for ‫{ سنڌي‬Sindhi(noun) : sindhi}.  Letter Deletion Error: While writing a word if any or more character is missing then it is called to be deleted and the error is considered to be a deletion error or Omission error e.g. ‫ تظاف‬for ro { tazafih : noitcetorp}‫ پاڪتانح فاظت‬for ‫پاڪستان‬ {Pakistan (noun): Pakistan}.  Letter Substitution Error: When the desired character in a word got replaced by some other

character that is the actual character is substituted. e.g. ‫ پارلعامينٽ‬for ‫پارليامينٽ‬.{parliament (noun) : parliament}  Letter Transposition Error: As the name suggest, transposition errors occur when characters have "transposed" - that is, they have switched places. Transposition errors are almost always human in origin. The most common way for characters to be transposed is when a user is touch typing at a speed that makes them input one character, before the other. This may be caused by their brain being one step ahead of their body. e.g. ‫ تظاحف‬for ‫تظافح‬ {protection : hifazat } or ‫ قئاداعظم‬for ‫قائداعظم‬.{Quaide-Azam (noun) : quaideazam} Table 2: Various Typographic Error Types Error Type

Wrong Word

English Correct Transliteration Word

Letter Insertion

‫سنهڌي‬

Sindhi

‫سنڌي‬

Letter Omission

‫پاڪتان‬

Pakistan

‫پاڪستان‬

Letter Substitution

‫ پارلعامينٽ‬Parliament

‫پارليامينٽ‬

Transposition of two adjacent letters

‫قئاداعظم‬

‫قائداعظم‬

Quaideazam

6. COGNITIVE OR PHONETIC ERRORS Cognitive Errors or Phonetic errors usually occur due to similarly pronounced characters or homophone characters in Sindhi sometimes also referred to as Hearing Errors. Generally the writer is unaware or does not know the actual spelling of the word and the writer tries to write it with probable spelling guessing it with similar pronunciation of the letters that sounds somewhat similar to the actual word. The similar sounding or homophone letters are often substituted in this type of error (e.g. ‘‫{ ’ھ‬hah} for ‘‫{ ’ح‬hey} or ‘‫{ ’ت‬tey} for ‘‫{ ’ط‬toye}. e.g. “‫”ڪيرات‬ {darkness: tareeik} replaced by “‫”ڪيراط‬. Table 3 shows the list of all the Sindhi letters groped according to their similar sound of pronunciation Table 3: Similarly pronounced letters in Sindhi Sound code 1 2 3 4 5 6

trorGui tmira C uiuoS i limiS ‫ئ‬

‫ي‬

‫ء‬

‫آ‬

‫ٻ‬

‫ب‬ ‫ط‬ ‫ٺ‬

‫ا‬ ‫پ‬ ‫ڀ‬ ‫ت‬ ‫ٿ‬ ‫ٽ‬

1437

VOL. 3, NO.10 Oct, 2012

ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2012 CIS Journal. All rights reserved. http://www.cisjournal.org

‫ص‬ ‫چ‬

7 8 9 10 11 12 13 14 15 16 71 71 71 02 07 00

‫ہ‬ ‫ڍ‬ ‫ظ‬

‫ڏ‬ ‫ض‬ ‫ر‬

‫س‬ ‫ڇ‬ ‫ڄ‬ ‫ھ‬ ‫ک‬ ‫ڌ‬ ‫ز‬ ‫ڻ‬ ‫ڦ‬ ‫ڪ‬ ‫غ‬

‫ڱ‬

‫ث‬ ‫ج‬ ‫ڃ‬ ‫ح‬ ‫خ‬ ‫د‬ ‫ذ‬ ‫ڙ‬ ‫ش‬ ‫ف‬ ‫ق‬ ‫گ‬ ‫ل‬ ‫م‬ ‫ن‬ ‫ڳ‬

reason could be that the writer is unaware of the correct spelling of the word and tries to guess the word based on the similar shape of characters. Table 5 shows few examples of such type of Visual errors in Sindhi language. Table 5: Types of Error that occur English Correct Transliteration Word

Wrong Word

Some of the error that usually occurs due to the similar sound of Sindhi letters is shown in the table 4 below. Here the writer tries to guess the correct word by inserting a letter which sounds appropriate for the word. The correct word is shown in the left column and their misspelled word is shown in the opposite column. Table 4: Various Examples of cognitive errors

‫ دنڌو‬or ‫دندو‬

Dhandho

‫ڌندو‬

‫ ڃڄ‬or ‫جڃ‬

Jyunj

‫ڄڃ‬

‫ ٻڪزي‬or ‫بڪري‬

Bakri

‫ٻڪري‬

‫ احتيازات‬or ‫اخٺيارات‬

Akhtyrait

‫اختيارات‬

The various shape groups for Sindhi are given in Table 6. Some of the characters in Sindhi language do not share their shapes with any other character and so are so placed alone in the table, there are ten such characters (i.e.: ‫ ا‬, ‫ جه‬,‫ق‬,‫ڪ‬,‫گه‬,‫ل‬,‫م‬,‫و‬,‫ھ‬,‫ء‬,‫ (ي‬. Other then these all other characters in Sindhi alphabet have similar shape characters as shown in Table 6 below. Table 6: Basic Shape groups in Sindhi

1 2

Base Shape ‫ا‬ ‫ٮ‬

3

‫ح‬

4 5 6

‫حه‬ ‫د‬ ‫ر‬

S.No:

Wrong Word

English Transliteration

Correct word

‫مفروزو‬ ‫مفروذو‬

or

Mafroozo

‫مفروضو‬

1

‫حڪومط‬ ‫ھڪومت‬

or

Hukomat

‫حڪومت‬

2

‫قرارداڌ‬ ‫ڪرارداد‬

or

Qrrardaad

‫قرارداد‬

3

‫موجودگهي‬ ‫موجهودگي‬

or

Moujeedgy

‫موجودگي‬

4

Shape Group ‫ا‬ ‫ ث‬,‫ ٺ‬, ‫ ٽ‬, ‫ ٿ‬, ‫ ت‬,‫ ڀ‬, ‫ پ‬, ‫ ٻ‬, ‫ب‬ ‫خ‬,‫ح‬,‫ڇ‬,‫چ‬,‫ ڃ‬,‫ڄ‬, ‫ج‬ ‫جه‬ ‫ ذ‬, ‫ ڍ‬, ‫ ڊ‬, ‫ ڏ‬, ‫ ڌ‬,‫د‬

No: of Elements 1 9 7 1 6 3

‫ز‬, ‫ ڙ‬, ‫ر‬ 7

‫س‬

2 ‫ش‬,‫س‬

8

‫ص‬

‫ض‬

2

,‫ص‬

7.

VISUAL ERRORS

Visual errors are the errors that occur due to the similar shapes of various Sindhi characters. They could also be termed as Reading Errors or similar shape errors. According to Tahira Naseem “Similar shape based errors are not cognitive in nature” [1]. There could be several reasons for this type of errors. First, the professional typists, when typing form a hand written draft of a document will usually type in the text exactly as it looks in his first glaze without giving much consideration to its meaning or word structure and thus characters with similar visual representation are confused with each other rendering in spelling mistakes. Second, when a novice typist or writer of Sindhi language writes a document he may know the correct spelling of the word but could confuse the letters due to shape similarity for example ‫{ ڃ‬jey} could be mistyped for ‫{ ڄ‬chey} or ‫ ث‬ymi r f ‫ ٽ‬.{Tey}. Third

9

‫ط‬

2 ‫ظ‬,‫ط‬

10

‫ع‬

2 ‫ غ‬,‫ع‬

11 12 13 14 15 16 17 18 19 20 21

‫ڡ‬ ‫ٯ‬ ‫ڪ‬ ‫ک‬ ‫گه‬ ‫ل‬ ‫م‬ ‫ں‬ ‫و‬ ‫ه‬ ‫ء‬

2 ‫ڦ‬, ‫ف‬ ‫ق‬ ‫ڪ‬ ‫ڱ‬,‫ ڳ‬,‫ گ‬,‫ک‬ ‫گه‬ ‫ل‬ ‫م‬ ‫ ڻ‬, ‫ن‬ ‫و‬ ‫ھ‬ ‫ء‬

1 1 4 1 1 1 2 1 1 1

1438

VOL. 3, NO.10 Oct, 2012

ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2012 CIS Journal. All rights reserved. http://www.cisjournal.org

22 Total

‫ى‬

‫ي‬

1 52

8. SPACE RELATED ERRORS Space is one of the most commonly and habitually used character in document drafting and composition. Due to its immense use the involuntary insertion occurs during typing. In Sindhi the space insertion is not as common as space omission which is much more frequent. One reason behind this could be that we usually want to type quickly and want to minimize the efforts hence omitting the space character (or any other character ) intentionally out of ignorance but at the same time we insert a space character either mistakenly or of necessity. This second reason of omitting a space is found to be most common specially when we intent to write complex and long words as Sindhi language has several words that use spaces between two words to make the sentence clear while reading. For example consider the word “‫{”لعل شهباز‬Lal Shabaz (noun): lal shahbaaz}, if for instance the space character is omitted or deleted from the middle of the words then the word will become ‫{ لعلشهباز‬LalShabaz(noun): lalshaabaz} which is incorrect and misspelled. Table 7 shows some of the space related errors in Sindhi Language. Table 7: Shows the various space related errors Error

Wrong word

[2]

B.B. CHAUDHURI, (2001) “Reversed word dictionary and phonetically similar word grouping based spellchecker to Bangla text”, Proc. LESAL Workshop, Mumbai,.

[3]

A. BHATT, M. CHOUDHURY, S. SARKAR AND A. BASU, “Exploring the Limits of Spellcheckers: A comparative Study in Bengali and English”, Proc. The Second Symposium on Indian Morphology, Phonology and Language Engineering (SIMPLE'05). Published by CIIL Mysore, Kharagpur, INDIA, February 2005, pp.60-65.

[4]

KUKICH, K. (1992B), Techniques for automatically correcting words in text, in .ACM Computing Surveys. Vol: 24 Issue 4, pp : 377-439

[5]

DAMERAU, F.J. “A Technique for Computer Detection and Correction of Spelling Errors.” In Communications of ACM, Vol. 7, No. 3, pp. 171177, March, 1964.

[6]

POLLOCK, J. J. AND ZAMORA, A. (1983), Collection and characterization of spelling errors in scientific and scholarly text, in .Journal of American Society of Informatics and Science., 34(1): 51-58

[7]

YANNAKOUDAKIS, E. J. AND FAWTHROP, D. (1983), the rules of spelling errors, in .Information processing and management. 19(12): 101-108

[8]

KUKICH, K. (1990), “A comparison of some novel and traditional lexical distance metrics for spelling correction,” in .Proceedings of INNC-90-Paris., 309313

[9]

KUKICH, K. (1992a), Spelling correction for the telecommunications network for the deaf, in .Communications of the ACM., 35(5): 80-90

[10]

MITTON, R. (1987), Spelling Checkers, Spelling Correctors, and the Misspellings of Poor Spellers, in Information Processing & Management., 23(5): 495505

[11]

PETERSON, J. L. (1986), “A note on undetected typing errors,” in .Communications of the ACM., 29(7): 633-637

[12]

ETHNOLOGUE.COM. “Sindhi: A language of Pakistan.” Retrieved from http://www.ethnologue.com/show_language.asp?cod e=snd (Accessed on: 26th April, 2011).

[13]

Sindhi Language Authority “Sindhi Script”. Website: http://www.sindhila.org/SindhiLanguage.htm. (Accessed: November 2011).

English Correct Transliteration word

Insertion

‫پا ڻي‬

Pardein

‫پاڻي‬

Omission

‫لعلشهباز‬

Lal Shabaz

‫لعل شهباز‬

Transposition

‫زن دگي‬

Zindegi

‫زندگي‬

9. CONCLUSION From the discussion and studies given in this paper it can be summarized that in Sindhi, few unique script specific errors trends exists which cannot be found in a language like English. Amongst the various error trends discussed for Sindhi language the one found to be most frequent is substitution errors caused due to the shape similarity of the letters in Sindhi alphabet and also due to the similar pronunciation of various letters. The other type of error found in Sindhi language are the omission or deletion of space character at the word boundaries. From the studied presented for Sindhi it can be assumed that these results and error trends and patters will also apply to other languages that are similar to Persio-Arabic script.

REFERENCES [1]

NASEEM, T., HUSSAIN, S. (2007). “Spelling Error Trends in Urdu”. In the Proceedings of Conference on Language Technology (CLT07), University of Peshawar, Pakistan.

1439

Suggest Documents