Coding System For Bangla Spell Checker

0 downloads 0 Views 343KB Size Report
In case of Bangla Spell Checker application, detection of replacement errors, deletion ... due to language complexities. Like other languages,. Bangla is divided into vowels ..... “Linguistically Sorting Bengali Texts: A Case Study of. Multilingual ...
Coding System For Bangla Spell Checker Md. Tamjidul Hoque and M. Kaykobad Department of Computer Science and Engineering Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh Email: [email protected] Abstract: Development of Bangla application is relatively complex due to the complexities of Bangla character set. In case of Bangla Spell Checker application, detection of replacement errors, deletion errors, insertion errors and swap or transposition errors are to be handled effectively and efficiently because of their frequent occurrences. Proper coding system has been implemented to handle above-mentioned frequent errors including other types of errors or miss-spellings. Keywords: Bangla alphabet, Replacement error, Deletion error, Insertion error, swap error, minor error, and coding system, Spell Check.

1. INTRODUCTION It is desirable from a speller that the speller would search through a document for invalid or miss-spelled words. The searching area might be pre-selected by highlighting the portion of the document or the checking should run forward starting from the cursor position of the active document up to the end. Functionally, each word is identified on the run and the word is matched with the database of the valid word-stock or word dictionary. If no match is found the word is an invalid or miss-spelled one. In case of Bangla the development is relatively tedious due to language complexities. Like other languages, Bangla is divided into vowels [2][3][8][10] are 11 vowels) and consonants ( ( are 39 consonants). Half-form of vowels exist when vowels guide the sound of consonants [4]. The place where the half-form would reside is not common, a few resides left of the biasing consonants, called as left-biased, similarly there are right-biased, bottom biased even both-biased half forms of vowels. Compound form of consonants, halfform of compound consonants including plain consonants exist in character set at the application level. Though Bangla alphabet is case-insensitive, Bangla fonts do have about 190 different characters [6] of which 50 are in principal forms.

2. POSSIBLE SOURCES OF SPELLING ERRORS The error categories for generation of miss-spelled words while typing are replacement error, deletion error, insertion error and swap error. And the rate of errors for Bangla is found [1] as;

Table 1: Percentage of various types of errors in Bangla. Type of Error Percentage Substitution / Replacement Error 66.32 Deletion Error 21.88 Insertion Error 6.53 Swap / Transposition Error 5.27 And the number of characters making the error is in percentile as follow [1]; Table 2: Error localization. Error Zone length % of words (in no. of characters) 1 41.36 2 32.94 3 16.58 4 7.10 5 1.78 6 0.24 One Swap error happens while adjacent two letters are swapped. For example, by one swap error ‘KjmÕ may be , if is not the dictionary word-list turned into is in the word-list then there then it is an error. But if is no way to declare it an invalid entry except semantic error checking. One Insertion error happens while word contains extra may be inserted with an extra letter(s). For example, as . One Deletion error happens while intended word misses its letter(s). For example, for type-mistake word may be turned into . One Replacement error occurs when intended character of the word is replaced by another character. For example, instead of is a replacement error. typing Keyboard adjacency error is mostly responsible for the replacement error. Structural similarities of the alphabet sometime cause replacement error. 3. SQL QUERIES TO DETECT FREQUENT ERRORS Now the question is, how the database of word-list can be utilized for detecting the above-mentioned errors. First of all, the Simplified or, Open format [1][4] would be the base of searching rather than the actual word-list, since the non-principal forms of vowel would be an additional

computational complexity, and that is absent in the Simplified format. And, this paper concentrates on one character error since one character error is highly probable [Table 2].

b. A number of databases does not support binary mode. c. Except otherwise exactly equal comparison, binary comparison does not support others.

The database is assumed to have table TblWordList for word-list having columns BWord having the actual word, Simplified having the Simplified or Open format of the actual word, and field length is the length of the Simplified field, etc.

That is the query in SQL would not support binary search, implies that unwanted or garbage suggestion would be generated. And to remove garbage from the suggestion would be a more tedious computation.

Let the Simplified form of the miss-spelled word be ‘WXYZ’, and

5. DEVISING CODING AND ITS UTILIZATION

‘Swap Error suggestion candidate’ SQL might be as: “ Select BWord from TblWordList where Simplified = ( ‘XWYZ’ or ‘WYXZ’ or ‘WXZY’ )” Using loop (say, FOR loop) the query variable can easily be generated. The complexity is O(n), where n= length(‘WXYZ’)-1. ‘Insertion Error suggestion candidate’ SQL might be as: “ Select BWord from TblWordList where Simplified = ( ‘XYZ’ or ‘WYZ’ or ‘WXZ’ or ‘WXY’ )” The complexity of query variable generation is O(n), where n= length(‘WXYZ’). ‘Deletion Error suggestion candidate’ SQL might be as: “ Select BWord from TblWordList where Simplified = ( ‘_WXYZ’ or ‘W_XYZ’ or ‘WX_YZ’ or ‘WXY_Z’ or ‘WXYZ_’ )” The complexity of query variable generation is O(n), where n= length(‘WXYZ’)+1. ‘Replacement Error suggestion candidate’ SQL might be as: “ Select BWord from TblWordList where Simplified = (‘_XYZ’ or ‘W_YZ’ or ‘WX_Z’ or ‘WXY_’ )”

The difficulties can be overcome by devising effective coding system. Obviously, coding system would add addition column(s) in the database table, but practically the impact of moderate database size increment is minor here. The time vs. space trade-off in the case of dictionary searching, the minimization of search time would be far more prominent requirement than space optimization. Let the following column(s) in the database table be added as: SimWLen = Length of the Simplified format column. WSum = Cubed sum of the character value of character of the Simplified format. Chari = 0 if i > length(Simplified) else (WSum – Cuibed of the character value of the ith character), where, i= 1 to m where, m=max length of the Simplified column, in practice the size of m is few greater for future requirement. 6. CODING BASED MODIFIED QUERY AND ITS BENEFIT: With the code-based the query is more controllable; output is quicker and more problem free. The code-based version of the query of section 5 has been demonstrated below:

Here, in the above queries, ‘_’ is used for any single character.

‘Swap Error suggestion candidate’ SQL using coded column might be as: “Select Bword from TblWordList where SimWLen=Length(‘WXYZ’) and WSum=WSum(‘WXYZ’)”

4. PROBLEM WITH THE SQL QUERY

The complexity to generate the above query is, O(1).

In practical cases, the databases are designed with caseinsensitive setup by default. That would collect a lot of garbage words in case of Bangla. So, the search must be case-sensitive. To make case-sensitive search the database must be in binary-mode or the comparison must be in binary mode. Binary mode has the following limitations:

‘Insertion Error suggestion candidate’ SQL using coded column might be as: “ Select Bword from TblWordList where SimWLen = Length(‘WXYZ’)-1 and WSum= ( WSum(‘XYZ’) or WSum(‘WYZ’) or WSum(‘WXZ’) or WSum(‘WXY’) )”

In binary mode searching becomes extremely slow.

Using loop (say, for loop) the query variable can easily be generated. The complexity is O(n), where n= length(‘WXYZ’).

The complexity of query variable generation is O(n), where n= length(‘WXYZ’).

a.

‘Deletion Error suggestion candidate’ SQL using coded column might be as: “ Select Bword from TblWordList where SimWLen = Length(‘WXYZ’)+1 and (( Char1 or Char2 or Char3 or Char 4 or Char5 ) = WSum(‘WXYZ’) )”

(β-ϕ) and (β+ϕ), positive integer and βmin > ϕ.

where α, β, ϕ are

Thus, through proper coding suggestion for Keyboard adjacency error suggestion candidate can be identified.

The complexity is O(n), where n= length(‘WXYZ’)+1.

8. PRIORITY OF MINOR ERROR

‘Replacement Error suggestion candidate’ SQL using coded column might be as: “Select Bword from TblWordList where SimWLen = Length(‘WXYZ’) and (Char1 =WSum(‘XYZ’) or Char 2=WSum(‘WYZ’) or Char3 =WSum(‘WXZ’) or Char 4=WSum(‘WXY’)) ”

Different levels of user have different level of typing skill and they generate different level of errors. User having higher typing skill usually causes minor or delta error, for might be example, missing non-principal vowels. or, . instead of type-mistaken as etc. Users sometimes wrongly choose similar sounding but different letter out of confusion, e.g. choice ), ( ), ( ), ( ) or, ( ), between the pair ( ). It is hard for the users to remember all the ( grammatical rules and the exception to select the correct letter for the confusing words.

The complexity is O(n), where n= length(‘WXYZ’). The modified query using coded column has the following benefits: a. The query generation complexity is reduced. b. The searching is faster. c. The suggestion candidate list is garbage free, i.e., the case-insensitive search has been suppressed. To reduce the number of search results / suggestion candidate instead of Cubic Sum, the grade of the powered Sum might me increased or very large distinguishing value for each character might be assigned and be stored. 7. IMPROVING THE SUGGESTION LIST AND ITS PRESENTATION In the case of suggestion for ‘Replacement Error’, the suggestion-candidates should be ordered according to the candidate of ‘Keyboard adjacency error’. Keyboard adjacency error occurs while instead of pressing the required key its adjacency key is pressed. For example, in the keyboard ‘t’ is adjacent both ‘y’ and ‘r’. Due to Keyboard adjacency error ‘rhe’ or ‘yhe’ may be misstyped instead of ‘the’.

For example, sometimes user is confused what to choose . Usage of duplicate same between ) is sometimes consonants in composite form (e.g. may sometimes be confused confusing, e.g. Word . with 9. CODING FOR DETECTING THE MINOR OR DELTA ERROR Let above kind be named ‘Delta Error’. Again, a coding system is introduced for Delta Error Detection Mapping and a value column in the database table need to be added, say as ‘DEDMValue’. The main ideas of the coding are, a. Same sounding letters would have same value, e.g. ( ) b. Same Composite consonants would have single value, as if c. The non-principal vowels would have very low (delta) value respect to the other forms of letters.

Keyboard adjacency mapping might be utilized upon the search result of ‘replacement error’ suggestion list to order them or for filtering or generating short suggestion list. The coding system can be re-arranged to identify the suggestion candidate for possible occurrence of Keyboard adjacency error. The character value might be re-assigned as adjacent keys might be assigned with minimum distanced value among keys. For example, by indicating adjacency by ‘–’ the following keys are shown;

Let, ‘WmXnYoZp’ is a word where ‘m’, ‘n’, ‘o’ and ‘p’ are non-principal vowel having very low value and other (‘W’, ‘X’, ‘Y’, ‘Z’) have higher values. Let the weight of a character having higher weight is ω and character having lower weight is d.

Q – W – E – R – T – Y; respectively their assigned values are like, α – (α+ϕ) – (α+2ϕ) – (α+3ϕ) – (α+4ϕ) – (α+5ϕ), so on.

Except non-principal vowels the cubic value is added and multiplication of positional value increases the suggestion’s accuracy. If the DEDMValue of the missspelled word is calculated and assigned to a variable DeltaValue. The possible SQL for access is;

That is if a key’s assigned value is β, then its two possible adjacent keys’ values are;

DEDMValue(‘WaXbYcZd’)= 1*ω(W)*ω(W)*ω(W) + d(m) + 2*ω(X)*ω(X)* ω(X) + d(n) + 3*ω(Y)* ω(Y)* ω(Y) + d(o) + 4*ω(Z)* ω(Z)* ω(Z) + d(p)

“select Bword from TblBWords where abs(" + Trim(DeltaValue) + "- DEDMValue) < δ and SimWLen =" + Trim(SimWLen) + " order by abs(" + Trim(DeltaValue) + "- DEDMValue) ” Note, δ is any threshold value where, 0 ≤ δ ≤ δmax, δmax= maximum assigned value to a non-principal vowel, and δmax < ωmin. Say, DEDMValue of a miss-spelled / invalid word is, λx and DEDMValue column is λ. The detection range will be, (λx - δ) ≤ λ ≤ (λx + δ). It is desirable for the coding system that, δmax