CPC-A Novel Mapping Technique for Inter Conversion between Multi ...

4 downloads 1237 Views 649KB Size Report
this purpose, we have used an html (Hyper Text Mark-up. Language) page as an input ... (c) Description of the image formats in the document pages is available.
CPC-A Novel Mapping Technique for Inter Conversion between Multi-Script / Multi-Standard Non-Unicode and Unicode Encoded Documents Nibaran Das, Partha Sarathi Roy, Ram Sarkar, Subhadip Basu, Mahantapas Kundu, Mita Nasipuri, Dipak kumar Basu Department of Computer Science & Engineering, Jadavpur University, Kolkata-700032, India [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract— To convert digital documents of different languages /scripts already created in ASCII/ISCII or any other non-Unicode standards, into Unicode we need different converters for different fonts. It is because they do not follow any standard rules during construction of characters. The problem becomes more severe for language like Bangla, Hindi and Marathi where languages are enriched by the presence of compound characters. There are no tools available to convert multi-scripts non Unicode encoded documents to Unicode encoded document and vice versa. To address these challenges, we have developed a 10 byte character positioning code (CPC) for these categories of scripts maintaining the proper layout and architecture of the source. It is noteworthy that with CPC we have tried to identify a group of characters which has unique glyph(s) representation in non standard fonts during mapping. Keywords- Character positioning code (CPC), Unicode, ASCII, ISCII , Document conversion, Multi-script/Multi-standard document converter.

1.

INTRODUCTION

Nowadays Unicode, a universal unique encoding system to represent a character in computer, is used popularly to represent non-Latin characters. Before the introduction of Unicode, ASCII (American Standard Code for Information Interchange) was very popular. In Extended ASCII, each character is represented by an 8-bit code. So, only 256 varieties of encoding numbers are available in this system to represent all kinds of characters of the world. While representing different scripts/fonts in ASCII, characters from the different scripts are encoded into same numbers. Characters of the same script but different fonts can also have different encoding numbers under the same operating system environments. All these create problems in representation of documents containing multilingual scripts/fonts and documents containing the same font but created under different operating systems. Besides, there is no standard rule in ASCII to represent the non-Latin

scripts. So standardization was problematic for encoding of documents based on non-Latin script. Unicode solves most of the problems of non-standardisation and parsing. But transliteration a huge number of non-Unicode document to Unicode encoded document was a problem. Therefore an ample number of converters [1-7] have been developed to convert non-Unicode encoded document to Unicode encoded document. But there are no standard methodology/converter is available which can convert different non-standard font encoded document to Unicode font encoded document simultaneously. There exists a universal font converter [8], but the modus operandi of it is completely different. It converts fonts in different format, for example from .ttf to .otf, .otf to .ttf font. Apart from that, in Unicode gateway converter [2] some Indic scripts are converted to Unicode for display but not in multi script format. In the present work we have proposed a methodology which can convert multiple non Unicode encoded documents to Unicode encoded one maintaining the logical structure of sections and paragraphs of documents. For that reason we have designed an html file based font conversion techniques which can easily inter converted with .doc/.docx files. This proposed methodology is unique in that sense that it has capability to convert multi script document both online and offline. In our previous attempt [9] we proposed a scheme for font conversion from ASCII/ISCII to Unicode or vice-versa for multilingual scripts or multiple fonts with any standard focused on English, Bangla and Hindi scripts. We have modified our earlier work by introducing character position encoding format for representation of constituent characters of compound character (normally prevalent in Indic scripts including Bangla) to handle the font conversion process in the better way. Following are the key advantages of current technique in comparison to others and our earlier work:

2nd International Conference on Computer Processing of Bangla, 2011 Organized by: Independent University, Bangladesh (IUB); and Bangladesh Computer Council (BCC)

15

1. 2. 3. 2.

More methodical way to convert Unicode to non standard encoding document and vice versa Introduction of CPC. The logical structure of sections and paragraphs of the documents can be maintained. PRESENT WORK

In this work, we have proposed a new scheme to convert a multilingual Unicode encoded document to corresponding ASCII based documents and vice versa. The methodology can be applied to any non standard fonts also. As in normal text based editor, like Note Pad, it is difficult to identify the differences between two fonts/scripts in ASCII/ISCII code. So our first step is to identify different fonts and mark them. For this purpose, we have used an html (Hyper Text Mark-up Language) page as an input. The output file in Unicode encoded format is also created in html format. The advantages in using html files are the following: (a) Clear description of all labelling and texts in the document are available. This description bears information for fonts/scripts and writing direction in the document. (b) The logical structure of sections and paragraphs of documents can be maintained. (c) Description of the image formats in the document pages is available. (d) An html file can easily be inter-converted into .doc file if we create the html with the help of word processor like Microsoft word, open office word etc. The steps involved in the conversion process of non Unicode to Unicode encoded documents and vice versa are shown in FIG. 1.

a.IMPORTANCE

OF CPC

Design of CPC is discussed in this section . It has been mentioned earlier that all but non-Unicode fonts do not follow any standard methodology to represent languages for the same scripts. It is because they do not follow any standard rules developed by others during construction of characters. It becomes more severe for language like Bangla, Hindi and Marathi where languages are enriched by the presence of compound characters. Some basic and compound characters are represented by a unique numeric code. Sometimes it is constructed by more than one numeric codes. Moreover the corresponding Unicode characters are written into an output file following the rules specifying format of the Unicode encoded representation because the format of the Unicode encoded representation is different from that of the ASCII encoded representation. For example, when vowel modifiers/allographs like  , < ◌>, < ◌া>, < ◌>,  and  < ◌ৗ>  is associated with a consonant, the numeric code of the modifier is placed after the numeric code of the consonant in Unicode. But for ASCII-based font like SAMIT_ATMLight (a popular Bangla Font among different DTP operators in West Bengal) the coding is done in the reverse order. For instance, the sequence of characters in SAMIT font family encoding format is < ি◌> +  for representation of , which is different from the Unicode encoding ( +  = ) .  

(a)

(b)

FIG. 1 (a) Steps involved during conversion of Unicode to non Unicode encoded documents (b) Steps involved during conversion of Non Unicode to Unicode encoded documents

2nd International Conference on Computer Processing of Bangla, 2011 Organized by: Independent University, Bangladesh (IUB); and Bangladesh Computer Council (BCC)

16

In Unicode, a compound character is not encoded with a single numeric code. Rather, codes for the constituent characters are arranged in sequence with a code for the joiner (hasanta) introduced between each pair of successive consonants in the sequence. But this is not same in case of ASCII-based font. For example, in SAMIT_ATMLight, most of the compound characters (, >, etc.) are represented by single character code instead of multiple codes, although some of the compound characters (e.g., < ↔>, < ←>, , etc.) in SAMIT_ATMLight are arranged as < + ↔>,< + ←>, before encoding. In case of Unicode, these are arranged as , , < ম+◌্ +ব> before encoding. There are some basic characters, which are represented by a pair of numeric codes in ASCII/ISCII. Therefore, these characters are viewed as a combination of two individual characters. For instance, characters and in SAMIT_ATMLight are formed as a combination of two characters: , before they are encoded. But in Unicode, these characters are encoded with a single numeric code. Therefore, at the time of conversion from ASCII/ISCII to Unicode one has to be aware of these facts. Hence we have introduced “Character Positioning Codes” for a font so that it is possible to convert from the font to Unicode and Unicode to the font. It is noteworthy that with CPC we try to identify a group of characters which has unique glyph(s) representation in non standard fonts. The construction of CPC for Bangla script is shown in Table I. TABLE I. Byte Position

9

Corresponding character code

8

first consonant

CONSTRUCTION OF CPC 7

6

Second consonant

5

ক খ গ ঘ ঙ চ ছ জ ঝ ঞ ট ঠ ড

Unique code 1

Letter

12

ঢ ণ ত থ দ ধ ন প ফ ব ভ ম

13



2 3 4 5 6 7 8 9 10 11

3

Third consonant

TABLE II. Letter

4

2

Fourth consonant

1

0

Vowel modifier

During the design of CPC for consonant/ compound characters, if we get a consonant first then its corresponding code is stored at first consonant position in the CPC, then next character is checked whether there is a vowel modifier or hasanta. If it is a hasanta then again next character is fetched and stored in second position in the CPC and the process is continued until there is two consecutive chracters other than hasanta or a vowel modifier. We have assigned unique codes to all of the consonants and vowels as shown in the Table II. In case of vowel modifier we check for the next letter. For the character ‘ক’, The unique code number for the character is:- 01 So,the unique number for representing the letter ‘ক’ 0100000000 .

is

For the character ‘কা= ক + ◌া’ , The unique code number for the character is 01 The unique code number for the character is 51 So,the CPC for representing the letter ‘কা’ is 0100000051 (Last two digits are resrved for vowel modifiers only) For the character ‘k= ক + ◌্ + ব’, The unique code number for the character is 01 The unique code number for the character is 23 So,the CPC for representing the letter ‘k’ is 0123000000 . In case of vowel modifiers along with compound consonant having ‘ É ’ they are added at the proper place with the compound consonants . For the letter ‘s া= স + ◌্ + ন + ◌্ + য + ◌া’, The unique code number for the character is 31 The unique code number for the character is 20 The unique code number for the character is 26 The unique code number for the character is 51 So,the CPC for representing the letter ‘s া’ is 3120260051

CODE ASSIGNS FOR BENGALI ALPHABET DURING CONSTRUCTION OF CPC Unique code 14 15 16 17 18 19

Letter

র ল শ ষ স হ

Unique code 27 28 29 30 31 32

20

33

21

34

22 23 24

◌ং ◌ঃ ◌ঁ

35 36

Letter

◌া ি◌ ◌ী ◌ু ◌ূ ◌ৃ ◌ ◌ ◌া ◌ৗ

Unique code 51 52 53 54 55 56 57 58 59 60

37

25 26

2nd International Conference on Computer Processing of Bangla, 2011 Organized by: Independent University, Bangladesh (IUB); and Bangladesh Computer Council (BCC)

17

FIG. 2 The flow diagram with CPC during conversion from Unicode to non standard ANSI code

In case of compound characters having ‘ ’ the unique code number isn’t taken into account and a flag is used and ‘ ’ is added at the proper place . For the letter ‘িশ= র + ◌্ + শ + ি◌’, Here we just derive the numeric value for ‘িশ=শ + ি◌’ The unique code number for the character < শ> is 29 The unique code number for the character < ি◌> is 52 So,the CPC for representing the letter ‘িশ’ is 2900000052. As per Bangla grammar there may be at most 4 consonants in any compound character, therefore the above logic is applicable for all type of consonant letters(including ⎝ , ‘ ⊃ ’ and vowel modifiers). How CPC solves the problem of font conversion for Bangla script is described in FIG. 1 and FIG.2.

3.

RESULTS AND DISCUSSION

We have used the above mentioned technique for conversion of English – Bangla mixed documents for non-Unicode to Unicode and Unicode to the corresponding fonts. To validate our scheme we have considered the documents written in different types of bilingual (Bangla and English) scripts in Unicode containing about 890717 words. For the selection of text categories we have considered different types of document as described in Table III. TABLE III.

DIFFERENT TYPES OF TEXTS USED DURING CONVERSION Types of text tested

No of Words

Social Science Natural Science Economics Computer Science

760690 46573 70100 13354

2nd International Conference on Computer Processing of Bangla, 2011 Organized by: Independent University, Bangladesh (IUB); and Bangladesh Computer Council (BCC)

18

FIG. 3 The flow diagram with CPC during conversion from Non Unicode to Unicode characters

The experimental results of conversion from existing ASCIIbased multilingual code to Unicode are found to be quite robust. A sample input and its output in html pages after converting from ASCII to Unicode is shown below (FIG. 4a4b). A sample input and its output in html pages after converting from Unicode to ASCII is shown below (FIG. 5a5b).The developed software is currently being used for the Computer Jagat [10], a Bangla monthly computer magazine, for preparing the content of the magazine where the manuscripts are written by the authors containing both Bangla

and Roman scripts in Unicode encoded format. Our developed software converts the manuscript to Samit font family encoded document for Bangla script and Times New Roman for Roman script. The Samit font encoded documents are used for final preparation of tracing of the magazine. After preparing the magazine the reverse procedure are used for conversion of Unicode with proper layout of the magazine. The graphical user interface of our software for Samit to Unicode conversion is shown in FIG. 6.

2nd International Conference on Computer Processing of Bangla, 2011 Organized by: Independent University, Bangladesh (IUB); and Bangladesh Computer Council (BCC)

19

(a)

(b) FIG. 4 (a) Bilingual ASCII based document before conversion (b) Bilingual Unicode encoded document after conversion

(a)

(b) FIG. 5

(a) Bilingual Unicode encoded document before conversion (b) Bilingual ASCII based document after conversion

FIG. 6 A glimpse of our developed software

 

2nd International Conference on Computer Processing of Bangla, 2011 Organized by: Independent University, Bangladesh (IUB); and Bangladesh Computer Council (BCC)

20

4.

CONCLUSION

A new CPC based mapping technique is developed here for inter-conversion of multi-script non Unicode to Unicode encoded documents. The CPC is developed for group wise conversion instead of one to one mapping of characters. It facilitates proper glyph representation during encoding system conversations. The developed technique can be used to convert multi-standard/multi-script encoded document to Unicode encoded document. The technique has been successfully applied for preparation of Bangla document mixed with English words from one standard to another. The technique may be applicable for searching non Unicode font encoded web documents.

ACKNOWLEDGEMENT Authors are thankful to the “Center for Microprocessor Application for Training Education and Research”, “Project on Storage Retrieval and Understanding of Video for Multimedia” and Computer Science & Engineering Department, Jadavpur University, for providing infrastructure facilities during progress of the work. The Authors are also thankful to “Society for Natural Language Technology Research”, Kolkata, India for partial funding of the work. One of the authors, Prof. Dipak Kumar Basu is thankful to the A.I.C.T.E. (New Delhi, India) for awarding

him an Emeritus Fellowship (F. No: 1-51/RID/EF(13)/200708).

References: 1. 2. 3.

http://omicronlab.com/avro-converter.html http://uni.medhas.org/ Bangla Converter:version 3.0 (PHP+UTF-8), http://thpbd.org/bangla/ 4. Bangla text converter (Boishakhi:Bijoy:Unicode), http://utsablog. Blogspot.com/ 2006/06/bangla-textconverter-boishakhi-bijoy.html 5. রাdুর বাংলা iuিনেকাড কনভাটার, http://www.roddur.net/convert/ 6. http://ltrc.iiit.ac.in/showfile.php?filename=downloads/F C-1.0/fc.html 7. http://nltr.org/snltr-software/ 8. http://www.softpedia.com/get/Others/FontUtils/TransType-SE.shtml 9. N. Das, A. Murmu, S. Roy, R. Sarkar, S. Basu, “A Novel Scheme for Editing and Inter Conversion between ASCII/ISCII/Unicode Encoded Multilingual Documents”, accepted for publication in LANGUAGE FORUM, Vol. 35, No. 2, July-Dec 2009 10. http://computerjagat.org/

2nd International Conference on Computer Processing of Bangla, 2011 Organized by: Independent University, Bangladesh (IUB); and Bangladesh Computer Council (BCC)

21

Suggest Documents