Developing an Eff and Compre ficient Algorithm for R

0 downloads 0 Views 461KB Size Report
May 24, 2014 - [18] The Daily ProthomAlo, [Online], Available: www.prothom- alo.com/opinion/article/243856/. [19] The. Daily. ProthomAlo,. [Online],. Available:.
The 9th International Forum on Strategic Technology (IFOST), October 21-23, 2014, Cox’s Bazar, Bangladesh

Developing an Effficient Algorithm for Representation and Compreession of Large Bengalli Text Md. Abu Marjan1, Md. M Palash Uddin2, Masud Ibn Afjal3 and Md. Dulaal Haque4 Faculty of Computer Science and Enginneering, Hajee Mohammad Danesh Science and Technnology University (HSTU), Dinajpur-5200, Bangladesh 1 3 [email protected], [email protected], p [email protected] and [email protected] characters composed of 2, 3 or 4 consonants where 200 compound characters composedd of 2 consonants, 51 compound characters composed of 3 consonants and 2 compound characters composed of 4 consoonants [2]. According to Bengali Academy standard, some vow wels and corresponding vowel modifiers with their placement are listed in Table I.

Abstract—Efficient coding is one of the challlenging aspects of information and communication theory. On the t other hand, the natural languages such as Bengali is coded using Unicode us takes more time technology which requires more space and thu to transfer the data of that language. In th his paper, we have proposed a novel algorithm to represent Bengali text efficiently and then to compress the text offering a better compression ratio. Each Bengali character is represented by a unique 2-digit intermediate decimal value. Indexing and soorting all the word values successive subtraction is performed on the values in hope n values of each to reduce the weight of the numbers. The new word can now be encoded with a very few bitss. In comparison to other compressors, the compression ratio of the proposed algorithm decreases in a big amount for the larrge text which may contain more duplicate or redundant words, more m words with the same length and more words of the same len ngth with the same prefix called Uposorgo in Bengali. Keywords—Compression, Decompression n, Bengali representation, Bengali text compression, comprression ratio.

I.

TABLE I.

VOWELS AND VOWEL MODIFIERS

Some consonant modifieers with their corresponding consonants are listed in Tabble II [7]. Besides the vowel, consonant and their modified form fo we have a special character Hoshonto (,  ).

text

INTRODUCTION

TABLE II.

Bengali, one of the rich languages, is thee first language of more than 245 million people in the worlld. Approximately 10% of world’s populations speak in Bengaali [5]. Thus, it is crying need to computerize this language inn an efficient way, but we have done a little in this regard till now [6]. It also lacks of standard in the layout for Bengali keybboard or computer representation of Bengali characters. On thhe other hand, in computer science and telecommunication sectors, it is an important issue to store and transmit data using less space. Hence, we need to encode this data in an effiicient manner. The coding technique to reduce the size of daata is called data compression [3]. In the lossless compressionn scheme, there is no data loss when it is decompressed but in the lossy data compression, some data may be lost. Thus, thhe development of a convenient, efficient, and versatile compresssion algorithm for the encoding of Bengali language is a big challenge. c To cope up with that challenge, we have introduceed a new lossless compression technique for the Bangle text. Prior P to do this we have to know some basic concepts about Benngali language.

CON NSONANT MODIFIERS

Unlike English, Bangle woords are not only composed of individual characters placed one after another. In Bangle 2, 3 or 4 consonants can be mergeed together to form a single compound character. Some exaamples are given in Table III. TABLE III.

COM MPOUND CHARACTERS

B. Coding Techniques

A. Bangla Language a 11 vowels, 39 In the printed form of Bengali, there are consonants and 10 numerical digits. Furtherm more, there are 10 short forms of vowels called vowel modifierss (Kar) and 7 short forms of consonants called consonant modifiers (Fala) [1]. Besides these, there are more than abouut 253 compound

ASCII is one of the codinng schemes to represent various languages in computer which is machine dependent character encoding standard. Another machine independent coding scheme is Unicode which is claassified into UTF-8, UTF-16 and UTF-32. UTF-8 is a way of transforming all Unicode

978-1-4799-6062-0/14/$31.00©2014 IE EEE 22

TABLE IV.

characters into a variable length encodingg (VLE) of bytes. UTF-16 is reasonably compact and all the heavily used characters fit into a single 16-bit code uniit, while all other characters are accessible via pairs of 16-bit coode units. In UTF32, each Unicode character is encoded in a single s 32-bit code. Although ASCII and Unicode play the sam me role for English character coding system, for many major languages l such as Bengali Unicode font is much more beneficiial over ASCII for documentation. Therefore, it is an importantt issue to represent all Bengali characters in the Unicode formaat so that Bengali characters can easily be presented in all kind of digital machine. Hence, we have proposed a Unicode based representation and compression technique with a view to achieving better performance. For the Benngali character the reserved Unicode ranges from \u0980 to \u099FF [4]. II.

Where, Ni is the decimal vaalue for the word of index i, C is the 2-digit intermediate value of o each character, i is the index number (0

Suggest Documents