This paper focuses on the evolution and standardization of Urdu ... development
related to Urdu collation sequence and proposes a set of collation elements to.
Urdu Encoding and Collation Sequence for Localization Sarmad Hussain1, Sana Gul, Afifah Waseem Center for Research in Urdu Language Processing National University of Computer and Emerging Sciences 1. Introduction Information has become such an integral part of our global society that its access is considered as a basic human right [1, 2]. Access to this information may rightly be equated with access to the Information and Communication Technology (ICT) specifically the Internet that is believed to be the dominant carrier of information across the globe. Currently, English is the lingua franca for ICTs with the majority of content being in English. This makes English language information available on ICTs inaccessible to the majority of world population that cannot read or write in English. This is especially applicable to Pakistan where only 5-10% of the total population can understand English [3, 4]. Due to this low English literacy rate the ratio of internet usage in the country has been adversely affected. According to one estimate, at present only 1.2 % of the total country population is using internet [4]. Evidently, unless the ICTs are available in languages that are understood by the masses, the majority of the country's population would remain deprived of the information available through the English-centric ICT devices. Enabling ICTs in the local language of the user is known as "localization". Specifically, it is enabling computing experience in linguistic culture of the user. Linguistic culture is not just limited to the language but how the language is used by the environment of the user. Thus, for Punjabi speakers in India, the computer should display the language in Gurmukhi script and for Punjabi speakers in Pakistan, the same language should be displayed in Arabic script. Localization of ICTs requires local language applications based on local language computing standards. These standards include character set encoding, keyboard (and keypad) layout, collation/sorting sequence, locale and ICT terminology (overview of these standards is given in [5, 6]). In addition to definition of standards, applications also need to be developed for local language computing to support script, speech and language processing. Within these areas, there is a large variety of applications required, some being more fundamental in nature, while others are more advanced and complex but equally vital for end users. These applications include definition of input methods, normalization and sorting utility [e.g., 7, 8], find/replace utility, fonts, lexicon, thesaurus, spell checker, grammar checker, text-to-speech system, speech recognition system, machine translation system and optical character recognition system [5, 6] Currently work is being done to develop these standards and applications for Urdu language, the lingua franca of Pakistan. This paper focuses on the evolution and standardization of Urdu language character set encoding. In addition, this paper also describes the issues and development related to Urdu collation sequence and proposes a set of collation elements to enable it. Both standards are essential for the development of Urdu computing. 2. Urdu Encoding Character encoding may be defined as assigning a unique number to each language character to be processed by the computer. Whenever a character is input from keyboard or other input devices, this particular code is generated internally in the computer. Arbitrary encoding may be defined for any application (e.g. 80 for letter ‘a’, 81 for letter ‘b’). However, if different vendors are Associate Professor, National University of Computer and Emerging Sciences, B Block, Faisal Town, Lahore, Pakistan. Email:
[email protected]. URL: www.crulp.org.
1
defining arbitrary encodings, their encodings may not agree with one another. This would mean that the data files generated by their application(s) will be accessible only through their applications and may not correctly represent data in another vendor’s software (e.g. another vendor may define 80 to be letter “P” and 81 to be “Q”). Though initially application developers were able to get away with this, even successfully use this mechanism to tie in users, with the advent of the Internet, it has now become increasingly essential to standardize the encoding scheme because users are accessing data created by a variety of sources through web browsers (a single application). Realizing the significance of standardizing encoding, work was done early for English and American Standard Code for Information Interchange (ASCII) was defined in 1968. This standard had 128 slots defined using 7 bits by American National Standards Institute (ANSI). However, this code chart only contained English characters. This table was extended to 8 bits to include other Latin based languages spoken in Europe and South America. Later international organizations like International Standards Organization (ISO) also adopted ASCII as ISO/IEC 646 standard. 8-bit code pages for other languages were also introduced as international standards ISO/IEC 8859 –x (e.g. ISO/IEC 8859 – 6 for Arabic). Many different encoding schemes have been developed for Urdu as well by various vendors. However, no national or international effort was undertaken to standardize Urdu encoding, resulting in difficulties in exchanging Urdu data. The necessity of a standardized character set was first highlighted in Oct, 1997 at the 4th National Computer Conference, Islamabad, organized by the Pakistan Computer Bureau [9]. To formulate these standards a one-day long seminar on “The Standardization of Urdu Keyboard Layout and Internal Character Representation” was conducted at Foundation for Advancement of Science and Technology (FAST) Institute of Computer Science, Lahore, in 1998. This seminar resulted in the formation of four separate committees focusing on Urdu Code Page, Urdu Keyboard Layout, Urdu Email and Urdu Internet. As the work on the last three committees was based upon the work accomplished by Urdu Code page team, therefore the development of Urdu Code page was prioritize over others. The recommendation through this committee resulted in the formulation of the Urdu Zabta Takhti (UZT 1.01; see [10] for details) which was accepted as 8-bit encoding standard for Urdu by Government of Pakistan in 2000 [11]. This standard is briefly described below (for a detailed discussion see [12]). 2.1. UZT 1.01 [12] UZT 1.01 is an 8 bit code page and can accommodate 256 slots for character representation. This standard has all the characters identified to do publishing for Urdu. Though it is primarily an encoding standard (and sorting is usually not catered directly in such standards, e.g. see Collation Section below), the standard was devised to do some implicit sorting based directly on encoding. It has been divided into various logical sections that are as follows: i. ii. iii. iv. v. vi. vii. viii. ix. x.
The Control characters Punctuation and arithmetic symbols Digits Urdu aerabs/diacritics Urdu characters Reserved control space Special symbols Reserved expansion space Vendor area Toggle character
UZT 1.01 is shown in Figure 1 below. A brief description of these logical sections is also given.
Figure 1: Urdu Zabta Takhti 1.01 Control Characters Slots from decimal positions 0 – 31 and 127 are dedicated to the control characters which include characters like null (0), new line (10), escape (27), delete (127), etc. As these characters are used universally, UZT has used the same slots for these characters as the ASCII characters, in order to avoid conflicts. Punctuation and Arithmetic Symbols These symbols have two dedicated places from slots 32 – 47 and 58 – 65 assigned for punctuation and arithmetic symbols respectively. This ordering is similar to the ASCII. Some symbols are logically similar however transformed into a different shape. Finally a ligature break (65) has been included and distinguished from word break (32). The Urdu script is although connected, however breaks between characters may occur within words or across words as illustrated in Figure 2. To distinguish word boundary from ligature boundary hard space and space are used respectively where through former a single ligature can be broken into two ligatures without breaking the word, while the latter breaks ligatures into different words.
ن
ن Word Ligature Break Break
Figure 2. Word break and Ligature break in Urdu Digits Urdu numerals from zero through nine are included in slots 48 – 57 same as ASCII. Urdu Diacritics (Aerab) The aerab are included from slots 66 – 79 and slots 123 – 126. These aerab cover basic Urdu vowels Zer, Zabar and Pesh and also extended aerab set (e.g. Khari Zabar, Tashdeed, etc.). Infrequent aerab like Noon Ghunna sign were also included to cater to the publishing needs of Urdu. Urdu Characters Urdu characters fill slots 80 – 121, therefore there are 42 characters included. ‘Do-chashmey hay’ is included at the end of the list in slot 122 (instead of its traditional position adjacent to ‘Gol hay’ in slot 117) for two reasons. First, though written separately, ‘Do-chashmey hay’ is not a character of Urdu but forms characters when combined with other characters (e.g. /bh/, /ph/, etc.) and therefore needs to be distinguished from characters. Second, the sorting sequence, which requires all words starting with /b/ to be before all words starting with /bh/ (and similarly with other unaspirated-aspirated consonantal pairs), comes out naturally if ‘Do-chashmey hay’ is placed at the end of the character list. Reserved Control Space ASCII is a seven-bit standard, from 0-127. Earlier systems, of which are possibly still deployed, still conform to this standard. Therefore, if one byte code is sent to these systems, they truncate the most significant bit and convert the byte code to seven-bit code. This would be especially dangerous if slots 128 – 159 are used, because truncating the most significant bit would map these slots onto the control characters resulting in unpredictable behavior. Therefore, these slots have not been used. For the same reasons of truncation, all the Urdu characters, aerab and digits have been included in lower 128 slots. This ensures that even if a system conforms to a seven-bit code, it may still effectively use UZT to store information in Urdu. Special Symbols Slots 160 – 176 include these symbols, with room for further expansion after it. Slots 192 – 199 include more general symbols also in ASCII, including square brackets, curly brackets, underscore and dash. Reserved Expansion Space Slots 177 – 191, 200 – 207 and 240 – 253 have been reserved for future expansion. The committee devising Urdu standards may fill these slots in the future. The slots have been left vacant in anticipation of the future needs of Urdu. Vendor Area The code page covers the general and widely used Urdu language characters, aerab and symbols. However, there may be special needs that may arise for formatting or for including other more specialized characters not commonly used in Urdu. Slots 208 – 239 have been reserved for this purpose. Toggle Character
Finally, a character in slot 254 has been reserved to toggle between various code pages. Currently toggle character followed by character zero (slot 48) would mean start of UZT 1.01 and toggle character followed by one (slot 49) would mean ASCII. This will be relevant for the (Urdu Text File) format. 2.2. Unicode Initially most documentation was done in a single language, therefore 8-bit single language code pages served the need. However, in 1990s, with increasing needs for multi-lingual documents (where one could require Japanese and Arabic in the same document), it was realized that defining 8-bit code pages were not a scalable solution. Adding code pages for various languages and scripts and using them together in one application created a lot of difficulty and complexity in processing because users had to keep toggling between them. To address this issue, major vendors got together and created Unicode consortium (www.unicode.org). This consortium started working on developing a singular, unified and universal code chart which would contain all characters of all languages. As 8-bit (256 slots) code pages were insufficient for this requirement, Unicode character encoding standard was developed using 16 bits (65536 slots) [13]. This space has been divided to cater to various scripts and thus bypassed the need for toggling for different languages. Unicode provides a consistent way of encoding plain text in multiple languages within a single document and brings in a simple, efficient manner. Unicode standard has also been adopted by ISO as ISO/IEC 10646, and is now jointly monitored and updated by Unicode consortium and ISO/IEC JTC1/SC2/WG2, the working group assigned to look at character encodings. As Unicode is a script based standard, Urdu support in Unicode is given in Arabic Script block from Hexadecimal 0600 till 06FF (because Urdu is written in Arabic script along with Farsi, Pashto, Sindhi, etc.). In addition, there are two more blocks for Arabic, which were added to provide backward compatibility and have now been deprecated. These blocks are from Hexadecimal FB50-FDFF and FE70-FEFF (except FDFx sequence, which is still in use and contains ligatures). Only basic level work was done for Urdu when UZT 1.01 was formalized. UZT gave a good measure of Urdu requirements. Using this as a baseline, gap analysis was done between UZT and Unicode [14]. The gap analysis resulted in proposal for adding missing Urdu characters to ISO/IEC 10646 standard, which was submitted to ISO/IEC JTC1/SC2/WG2 meeting in Dublin in 2002 for consideration [15]. After the addition of characters through this proposal and other proposals presented by other ISO/IEC JTC1/SC2/WG2 members, Unicode 4.0 has complete support of Urdu language. As Unicode standard has to cater to multiple existing systems and multiple languages within a script, redundancies are introduced in it. However, sometimes these redundancies are visual and not present functionally (this is not always evident from the glyphs given in the code charts, but is explained where character names, descriptions and behaviors are elaborated in [13]; e.g. for DoChashmey Hay, there are two codes, Hexadecimal 0647 and 06BE, but latter has been recommended for use in Urdu; also see 0649 and 06CC). Therefore, before using Unicode, it is necessary to determine (and perhaps standardize nationally) which codes in Unicode need to be short-listed for Urdu. In addition, some normalization also needs to be done to address the redundancies introduced through the standard (e.g. Hamza Alif = Alif + Hamza above: Hexadecimal 0623 = 0627 + 0654; also 0622 = 0627 + 0653) [7]. Figure 3 below is a subset of the Unicode code chart for Arabic (available from http://www.unicode.org/charts/PDF/U0600.pdf) which identifies the minimal set of Urdu relevant characters. These characters include most of the characters in UZT 1.01 discussed in the previous section. The characters which have been “blacked-out” from the original chart either belong to other languages (e.g. Sindhi, Pashto) and not used in Urdu or include characters which can be derived from the characters shown (e.g. Hamza Alif = Alif + Hamza above).
060
061
062
0
1
2
063
064
ء
ر
ف
آ
ز
ق
4
ش
ل
5
ص
م
ض
ن
6
ا
ط
ب
ظ
9 A B
C
D
E
F
067
068
069
06A
06B
06C
06D
ڑ
غ
ہ
ۃ
۴ ۵
چ
۶ ۷
ؤ
ڈ ٹ
ژ
۸ ک
٪
۹ ں
ج
ی
ح
د
۲ ۳
۔
ث
خ
06F
۱ ے
ع ث
06E
۰
س
8
066
ذ
3
7
065
پ
ھ گ
Figure 3: Urdu Character Subset for Unicode Arabic
In addition to these another important character used for Urdu is the Zero Width Non-Joiner (ZWNJ) Unicode 200C (functionally equivalent to Ligature Break in UZT 1.01; see Figure 2). Also, there are ten ligatures from the Unicode FDF2-FDFB that are included. Unicode standard provides almost complete support for Urdu. However, there are still few deficiencies. Hamza is one such example. In Unicode, Hamza character
ءis declared a non-
joiner (i.e. it does not connect with letter following it). However, in Urdu language words like require Hamza to join with characters following it. Similarly Bari Yay ےis also considered a non-joiner in Urdu (with the following character), but word ر5 Urdu as ر
is also commonly written in
. These issues still need to be resolved with Unicode standard to complete Urdu
support. 3. Urdu Collation After encoding, definition of collation sequence for a language has perhaps the most important requirement for standardization. Any significant data processing (e.g. maintaining employee records, voter lists, etc.) requires lexicographic sorting of textual data, which is usually different from sorting on the codes assigned to characters of the language. For example, if English is sorted on ASCII code, it would sort an alphabet string to “ABCDE…Zabcde…z”. However, lexicographic sorting requirement of English is “AaBbCcDdEd…Zz”. In addition, unlike Unicode encoding, which is a script based standard, collation is usually specific to a particular language. This would mean that different languages, which are based on the same script may still have different sorting sequences. These differences arise because languages base sorting on many different factors including characters, capitalization, marks/diacritics and strokes [17]. For example, Urdu and Sindhi are both based on Arabic script, but have different character level sorting sequences:
( ب پ ث ثfor Urdu) and ( ب ث ث پfor Sindhi) [16].
Sorting order differences can also be demonstrated through examples from the Latin script languages. For example sorting sequence of the same words is shown for English and French diacritic level differences in sorting conventions in Figure 4 (for a detailed discussion see [17]). English casa chaque choisi choy císta Ciuriką cote coté côte côté
French casa chaque choisi choy císta Ciuriką cote côte coté côté
Figure 4: English and French Sorting Conventions [17]
As Unicode encoding standard is script based, where only one sequence of characters may be stored for all languages using that script, it is not possible to realize varying sorting sequences within it. Therefore, separate codes (based on languages) different from Unicodes are used to do language-centric sorting. These codes are referred to as collation elements. In addition, to cater to the various levels of sorting (depending on character, mark, capitalization, etc.) these collation elements are divided into multiple levels (see [8] for details). A default collation element set has also been standardized (not for any particular language) and may be used as a starting reference [32]. This section first discusses the issues and their resolution to develop Urdu lexicographic standard for Pakistan. The section then defines collation elements of Urdu compatible with Unicode collation algorithm as defined in [7, 8]. This section also presents results of implementation of the algorithm using the proposed collation elements to show that they can be successfully used to sort in Urdu. 3.1. Issues with Urdu Collation [18] Initial work was done to identify Urdu collation sequence by studying multiple dictionaries available for Urdu. Results showed that multiple conventions were being used and no one sequence was agreed as final [18]. The issues discovered are summarized below. Issue # 1: Ambiguity between the sequential position of
آand ا
Data gathered from eight different dictionaries of Urdu language, depict the following inconsistency in the sequential position of the Urdu characters
آand ا.
Data from [20], [24],
[25], [26] shows that آfollows the basic اin a sorted sequence. The sorting sequence would be:
ا آ آب اب
(from right to left). Justification for this order is that historically آwas considered
to be a different stylist way of righting double اi.e. two اoccurring in a row [27]. Therefore it was postulated that linguistically آdoes not exist as a separate character. Another sorting sequence for
اand آcan
be seen from the data gathered from [19]. This dictionary recognizes
آas
a
separate independent character of the language, and places it sequentially before ا. The sorting order thus formed is: اب
آ آب ا
(from right to left)
This sequence is a more widely used sorting order convention in Urdu language. Yet another sorting order for the same characters can be seen, by gathering data from [21], [22], [23]. Here as in [19], linguists recognize آas a separate character, contrary to [20], [24], [25], [26], but order
آsequentially after اcontrary to the rules in [19].
The sorting order thus formed is as follows:
( ا اب آ آبfrom right to left). Issue # 2: The ambiguity of digraphs formed with Do-Chashmey Hey () ھ The ambiguity in determining the sequential position of Urdu characters (e.g.
ã = ب+ ھ
character or a sequence of
).
ھwhen it is combined with other basic
The question is whether this diagraph is a separate
بand ھwhere ھcomes before/after ہ.
This discrepancy in the
sorting order of Urdu language can be observed while examining data in Urdu dictionaries. A set of dictionaries including [21], [22], [23] believe that the composite characters like the joining of a
ب+ ھ, is an independent character.
They treat
ã
ã
formed after
as a separate character and
position it after the occurrence of بsequentially, giving the following sorting sequence: (from right to left). Another Urdu dictionary [20] treats the diagraph
ã
, not as an independent entity but a sequential occurrence of the characters بand ھ. Here ھ
in contrary is considered as a separate and independent character. This dictionary sorts the two characters,
ہand ھat the same level in the first pass, but gives precedence to ھover ہin the
second pass of sorting. The sorting sequence thus formed is: right to left). Similar treatment of the diagraph
ã
, is observed in [19], [24], [25], [26] as it was
observed in [20], where the diagraph is treated as a sequence of dictionaries give precedence to formed is:
(from
ب
ہover ھin the second pass of sorting. (from right to left).
+
ھ,
however, these
The sorting sequence so
Issue # 3: Ambiguity in the sequential position of
نand ں
There are two major issues regarding the morphological and phonetic characteristics of the Urdu
ں.
Firstly, status of
ںas a separate, independent character is very ambiguous.
arises because of its phonetic behavior. Observations on the use of
This ambiguity
ںin Urdu language shows
that this character adds “nasalization” feature to the preceding vowel with which it occurs in a word. E.g.
ں، ؤں5 at the same time it does not produce a separate sound.
nasal constant itself. For this reason few dictionaries consider character
ں.
ن.
ں
as a variation of the Urdu
The second major issue is pertaining to the sorting order of the characters
One set of dictionaries [19], [21], [20], [26], [25], and [24] gives precedence to
sorting order following the sequence (from right to left)
ں ن
نis a
However,
نand
ں
in the
, while another set of dictionaries
[22], [23] give precedence to the Urdu character noon over the Urdu character noon gunnah in a sequentially sorted sequence for exampleں
ن
.
Issue # 4: Ambiguities pertaining to the sequential position of ۃin the Urdu language sort order The Urdu character
ۃis
borrowed from Arabic language. In Arabic Tay sound is represented
by ثwhen it is part of the root morpheme. However, if the sound occurs as part of feminine or other suffix, Tay sound is represented by ۃ. In Arabic, it is pronounced as ہwhenever there is a pause observed after the word ending with ۃ. However, this phonetic property of the Gol Tay does not exist in Urdu language. In [28] following sequence of the words is observed:
ؤفh ۃا:دا.
ہ: داthen
This sequence shows that here Gol Tay is placed sequentially after the Gol hay.
However, in [29] the word ہ
: داfollows ؤفh ۃا: دا.
Issue # 5: Ambiguities related with status of
ء
In certain cases Hamza takes a character joiner but in other cases, it resides directly on another
character. For example, in words like
ل
Hamza takes an explicit glyph joiner (compare with
where Hamza is not written), but does not take a joiner for the following words:
(compare with
@ ,ث
ؤ, أث
where Hamza takes an explicit joiner), Characters always have a
joiner and aerab never have a joiner. Thus, Hamza shows a mixed behaviour and as both types have differing sorting behaviour, it is also unclear how to sort Hamza. Issue # 6: Ambiguity between the sequential position of یand ے Though it is fairly conventional and agreed upon by all dictionaries that
ى
comes before
ے,
nevertheless few dictionaries do not agree whether this difference in collation is in the first or second pass of sorting, giving the following variations. Sorting sequence of three words is given in two different ways by two sets of Urdu dictionaries. According to [19], [21], [22], [23], [26] the sorting sequence from right to left is:
'
according to [20] ], [24], [25] .
while the same words are sorted as In addition there exists another ambiguity
regarding the sorting sequence of yeh when it is written within a word, e.g. in the word unclear whether to sort the middle Yay as a یor a ے.
ن6 it is
Issue # 7: Sequence for aerab: Zer, Zabar, Pesh, khari Zabar, tashdeed etc. The words with diacritic marks or aerab are processed at the second level of the sorting process. However it has to be determined how they are mutually sorted and that whether words without them are to be placed sequentially before or after them, e.g. sorting order of words like
ُ
َ
؛ ؛ ؛ ِ
has to be determined.
3.2. Recommendations for the Resolution of Issues with Urdu Collation Work had been in progress to finalize a single collation sequence, which could be standardized for Government of Pakistan. The above mentioned problems were first highlighted in 2002 [30]. Later in 2003 these issues were reviewed in a meeting held at Center for Research in Urdu Language Processing at National University of Computer and Emerging Sciences. Finally, the issues were resolved in the final meeting held at National Language Authority on 26th January, 2004. The following are the recommended resolutions of the issues discussed. Before individual issues were taken up, for consistency some basic rules of thumb were agreed for standardizing the collation sequence. They were then used to form the recommendations in a consistent manner. These rules were: 1. Given two characters, one being core and other being secondary in nature and not conventionally considered as core, the former would precede the latter in sorting order.
2. As core Urdu characters are grouped according to their shapes (and not sounds), any nonprimary characters should be introduced according to the shape and not their sound. Following these two basic principles the ambiguities pertaining to the conflicting sorting orders were resolved as explained below. Resolution of Issue # 1
آ
is taken as an independent character and is no longer considered as a variation of ا.
Justification behind declaring
آas a separate character is because it is very productive in Urdu
and now native speakers’ intuition treats it as different from a sequence of two ( اmost likely now this stylistic variation has been lexicalized).
آis placed sequentially after اin the sorting order
using the two rules of thumb discussed above. Resolution of Issue # 2
ھis not a separate character of Urdu, but forms a digraph indicating the aspirated version of the associated consonant. As these digraphs form different phonemes and are very productive in their usage, following earlier work [23] they were taken as separate characters and placed sequentially after their basic character, e.g. in the ã will follow ث. Resolution of Issue # 3 Noon gunnah was agreed to be separate from noon, and using the two rules, placed after noon for sorting. Resolution Issue # 4 Gol Tay was also identified as a separate character. Although it produces a Tay sound but following the two rules, it was recommended to be placed after ہ. Resolution Issue # 5 Hamez
ءtakes dual form, as a combining mark and a character.
It was recommended that with
vowel characters ے،ی،ؤ،ا, it should be considered as part of aerab, but outside this context it is taken as a character. When it appears as a character, the conventional sorting sequence of
ءis
maintained, where it appears after
ہbut before ی.
When it occurs as an aerab, it follows the
sorting sequence discussed in Resolution of Issue # 7 below. Resolution of Issue # 6 Both یand ےare taken as characters and thus are to be sorted at the first level. Additionally, it was clarified that diction rules of Urdu state that where two forms for Urdu words exist, only the
ر5
“broken” multiple ligature form is correct. Thus, writing
is the only correct way and
ر
is
incorrect. Finally, in the medial Yay should be sorted according to its code, which ever it may be. The code should depend on rules of Urdu diction (which are out of scope of this paper). Resolution of Issue # 7 The sorting order for the aerab (diacritics) was recommended to be one shown in Figure 5 below. Alif is taken as a representative character to show the aerab, however it should be noted that there are some constraints on placing the aerab. Zer, Zabar, Pesh and Hamza can be placed over all characters except Noon Ghunna. Khari Zabar occurs on vao and Yay, khari Zer is not normally used in Urdu (though included for words borrowed from Arabic) and Ulta Pesh is normally restricted to Gol hay (word finally). In Urdu, Do-Zabar is restricted to Alif (and Do-Zer and Do-Pesh are only included for words borrowed from Arabic). Finally, Tashdeed can occur on all characters (except Alif).
ْ ا ا
ٔ ا Figure 5: Sorting Sequence of Aerab in Urdu Figure 6 below shows the recommended sequence for Urdu characters.
ٹ
ã
ث
ã
پ
ã
ب
آ
ا
د
خ
ح
ã
چ
ã
ج
ث
ã
ز
ڑھ
ڑ
رھ
ر
ذ
ڈھ
ڈ
دھ
غ
ع
ظ
ط
ض
ص
ش
س
ژ
م
ã
ل
ã
گ
ãD
ک
ق
ف
ۃ
ہ
ؤھ
ؤ
2
ã
ں
ã
ن
ã
ے
ã
ی
ء
Figure 6: Sorting Sequence of Urdu Characters3 3.3. Urdu Collation Elements To realize the given sorting sequence, collation elements for Urdu characters need to be defined. These collation elements are used by the Unicode Collation Algorithm (UCA; equivalent to ISO/IEC 14651) [8, 31]. UCA uses the collation elements to generate sort-keys for the input strings to realize the required lexicographic sorting for Urdu (also see [33, 34] for description of UCA and efficient realization). Collation element is an ordered list of four weights, e.g. 0861.0020.0002.0061 for Latin Small Letter a. The first (left-most) weight is called the Level 1 weight (or primary weight), the second is called the Level 2 weight (secondary weight), the third is called the Level 3 weight (tertiary weight), the fourth is called the Level 4 weight (quaternary weight). These levels are used to distinguish between characters, diacritics, capitalization ( 'a' vs. 'A') and other lesser significant differences respectively [8]. 0000 sequence is used to make the character ignorable at the level this weight is assigned. Unicode also describes default values for collation elements for all Unicode letters in Allkeys.txt [32]. However, this default sequence cannot realize collation for a specific language. In order to define collation properly for any particular language, these collations codes must be re-ordered in such a way that the coding sequence conforms to the standardized linguistic sorting order. Urdu sorting requires at least three levels. At the first level of sorting, only the basic Urdu characters will be sorted. Once the characters determine word sequence, aerab are used to determine the sequence of words having the same characters. Finally the third level is used to sort special symbols e.g. honorific mark ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM (see Figure 7 below). Allkeys.txt sequence proposed by Unicode Consortium is used [32] and where required, weights are updated to give the collation sequence required for Urdu.
Glyph
Unicode
Proposed Collation Elements
English Description
ZWNJ
200C
0000
0010
0002
200C
ZERO WIDTH NON-JOINER
0600
0000
0000
0000
0600
ARABIC NUMBER SIGN
0601
0000
0000
0000
0601
ARABIC SIGN SANAH
0602
0000
0000
0000
0602
ARABIC FOOTNOTE MARKER
This letter has got a Noon Ghunna Mark (U+0658) on top of it, not currently supported. Do-Chashmee Hay has not been taken as a separate character of Urdu as it is combined with base characters. However, as it is separately encoded in Unicode, it has still been given a collation element, as in Figure 7.
2 3
0603
0000
0000
0000
0603
ARABIC SIGN SAFHA
0615
0000
0000
0000
0615
ARABIC SMALL HIGH TAH
060C
0000
0000
0000
060C
ARABIC COMMA
060D
0000
0000
0000
060D
ARABIC DATE SEPARATOR
066B
0000
0000
0000
066B
ARABIC DECIMAL SEPARATOR
066C
0000
0000
0000
066C
ARABIC THOUSANDS SEPARATOR
061F
0000
0000
0000
061F
ARABIC QUESTION MARK
061B
0000
0000
0000
061B
ARABIC SEMICOLON
06D4
0000
0000
0000
06D4
ARABIC FULL STOP
066A
0000
0000
0000
066A
ARABIC PERCENT SIGN
060E
0000
0000
0000
060E
ARABIC POETIC VERSE SIGN
060F
0000
0000
0000
060F
ARABIC SIGN MISRA
0610
0000
0000
000A
0610
ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM
0611
0000
0000
001A
0611
ARABIC SIGN ALAYHE ASSALLAM
0613
0000
0000
002A
0613
ARABIC SIGN RADI ALLAHOU ANHU
0612
0000
0000
003A
0612
ARABIC SIGN RAHMATULLAH ALAYHE
0614
0000
0000
004A
0614
ARABIC SIGN TAKHALLUS
0652
0000
00C4
0002
0652
ARABIC SUKUN
064E
0000
00C9
0002
064E
ARABIC FATHA
0650
0000
00CA
0002
0650
ARABIC KASRA
064F
0000
00CB
0002
064F
ARABIC DAMMA
0670
0000
00CD
0002
0670
ARABIC LETTER SUPERSCRIPT ALEF
0656
0000
00D5
0002
0656
ARABIC SUBSCRIPT ALEF
0657
0000
00D8
0002
0657
ARABIC INVERTED DAMMA
064B
0000
00DB
0002
064B
ARABIC FATHATAN
064D
0000
00DE
0002
064D
ARABIC KASRATAN
064C
0000
00E2
0002
064C
ARABIC DAMMATAN
0654
0000
00E5
0002
0654
ARABIC HAMZA ABOVE
0651
0000
00E8
0002
0651
ARABIC SHADDA
0658
0000
00EA
0002
0658
ARABIC MARK NOON GHUNNA
0653
0000
00F1
0002
0653
ARABIC MADDAH ABOVE
۰
06F0
0E29
0020
0002
06F0
EXTENDED ARABIC-INDIC DIGIT ZERO
۱
06F1
0E2A
0020
0002
06F1
EXTENDED ARABIC-INDIC DIGIT ONE
۲
06F2
0E2B
0020
0002
06F2
EXTENDED ARABIC-INDIC DIGIT TWO
۳
06F3
0E2C
0020
0002
06F3
EXTENDED ARABIC-INDIC DIGIT THREE
۴
06F4
0E2D
0020
0002
06F4
EXTENDED ARABIC-INDIC DIGIT FOUR
۵
06F5
0E2E
0020
0002
06F5
EXTENDED ARABIC-INDIC DIGIT FIVE
۶
06F6
0E2F
0020
0002
06F6
EXTENDED ARABIC-INDIC DIGIT SIX
۷
06F7
0E30
0020
0002
06F7
EXTENDED ARABIC-INDIC DIGIT SEVEN
۸
06F8
0E31
0020
0002
06F8
EXTENDED ARABIC-INDIC DIGIT EIGHT
۹
06F9
0E32
0020
0002
06F9
EXTENDED ARABIC-INDIC DIGIT NINE
ا
0627
1350
0020
0002
0627
ARABIC LETTER ALEF
آ
0622 or 0627 0653
1351
0020
0002
0622
ARABIC LETTER ALEF WITH MADDA ABOVE
ب
0628
1352
0020
0002
0628
ARABIC LETTER BEH
ã
0628 06BE
1353
0020
0002
0628
ARABIC LETTER BEH + ARABIC LETTER HEH DOCHASHMEE
پ
067E
1354
0020
0002
067E
ARABIC LETTER PEH
ã
067E 06BE
1355
0020
0002
067E
ARABIC LETTER PEH + ARABIC LETTER HEH DOCHASHMEE
ث
062A
1357
0020
0002
062A
ARABIC LETTER TEH
ã
062A 06BE
1358
0020
0002
062A
ARABIC LETTER THE + ARABIC LETTER HEH DOCHASHMEE
ٹ
0679
135A
0020
0002
0679
ARABIC LETTER TTEH
ã
0679 06BE
135B
0020
0002
062B
ARABIC LETTER TTEH + ARABIC LETTER HEH DOCHASHMEE
ث
062B
135D
0020
0002
062B
ARABIC LETTER THEH
ج
062C
135E
0020
0002
062C
ARABIC LETTER JEEM
ã
062C 06BE
135F
0020
0002
062C
ARABIC LETTER JEEM + ARABIC LETTER HEH DOCHASHMEE
چ
0686
1361
0020
0002
0686
ARABIC LETTER TCHEH
ã
0686 06BE
1362
0020
0002
0686
ARABIC LETTER TCHEH + ARABIC LETTER HEH DOCHASHMEE
ح
062D
1364
0020
0002
062D
ARABIC LETTER HAH
خ
062E
1365
0020
0002
062E
ARABIC LETTER KHAH
د
062F
1369
0020
0002
062F
ARABIC LETTER DAL
دھ
062F 06BE
136A
0020
0002
062F
ARABIC LETTER DAL + ARABIC LETTER HEH DOCHASHMEE
ڈ
0688
136B
0020
0002
0688
ARABIC LETTER DDAL
ڈھ
0688 06BE
136C
0020
0002
0688
ARABIC LETTER DDAL + ARABIC LETTER HEH DOCHASHMEE
ذ
0630
1370
0020
0002
0630
ARABIC LETTER THAL
ر
0631
1375
0020
0002
0631
ARABIC LETTER REH
رھ
0631 06BE
1376
0020
0002
0631
ARABIC LETTER REH + ARABIC LETTER HEH DOCHASHMEE
ڑ
0691
1377
0020
0002
0691
ARABIC LETTER RREH
ڑھ
0691 06BE
1378
0020
0002
0691
ARABIC LETTER RREH + ARABIC LETTER HEH DOCHASHMEE
ز
0632
137C
0020
0002
0632
ARABIC LETTER ZAIN
ژ
0698
137E
0020
0002
0698
ARABIC LETTER JEH
س
0633
1381
0020
0002
0633
ARABIC LETTER SEEN
ش
0634
1382
0020
0002
0634
ARABIC LETTER SHEEN
ص
0635
1387
0020
0002
0635
ARABIC LETTER SAD
ض
0636
1388
0020
0002
0636
ARABIC LETTER DAD
ط
0637
138C
0020
0002
0637
ARABIC LETTER TAH
ظ
0638
138D
0020
0002
0638
ARABIC LETTER ZAH
ع
0639
138F
0020
0002
0639
ARABIC LETTER AIN
غ
063A
1390
0020
0002
063A
ARABIC LETTER GHAIN
ف
0641
1393
0020
0002
0641
ARABIC LETTER FEH
ق
0642
139B
0020
0002
0642
ARABIC LETTER QAF
ک
06A9
139F
0020
0002
06A9
ARABIC LETTER KEHEH
ãD
06A9 06BE
13A2
0020
0002
06A9
ARABIC LETTER KEHEH + ARABIC LETTER HEH DOCHASHMEE
گ
06AF
13A5
0020
0002
06AF
ARABIC LETTER GAF
ã
06AF 06BE
13A6
0020
0002
06AF
ARABIC LETTER GAF + ARABIC LETTER HEH DOCHASHMEE
ل
0644
13AB
0020
0002
0644
ARABIC LETTER LAM
ã
0644 06BE
13AC
0020
0002
0644
ARABIC LETTER LAM + ARABIC LETTER HEH DOCHASHMEE
م
0645
13B0
0020
0002
0645
ARABIC LETTER MEEM
ã
0645 06BE
13B1
0020
0002
0645
ARABIC LETTER MEEM + ARABIC LETTER HEH DOCHASHMEE
ن
0646
13B4
0020
0002
0646
ARABIC LETTER NOON
ã
0646 06BE
13B5
0020
0002
0646
ARABIC LETTER NOON + ARABIC LETTER HEH DOCHASHMEE
ں
06BA
13B9
0020
0002
06BA
ARABIC LETTER NOON GHUNNA
ã
06BA 06BE
13BA
0020
0002
06BA
ؤ
0648
13BD
0020
0002
0648
ARABIC LETTER NOON GHUNNA + ARABIC LETTER HEH DOCHASHMEE (see Footnote 2) ARABIC LETTER WAW
ؤھ
0648 06BE
13BE
0020
0002
0648
ARABIC LETTER WAW + ARABIC LETTER HEH DOCHASHMEE
ہ
06C1
13C2
0020
0002
06C1
ARABIC LETTER HEH GOAL
ھ
06BE
13C4
0020
0002
06BE
ARABIC LETTER HEH DOCHASHMEE
ۃ
06C3
13C6
0020
0002
06C3
ARABIC LETTER TEH MARBUTA GOAL
ء
0621
13C7
0020
0002
0621
ARABIC LETTER HAMZA
ی
06CC
13C9
0020
0002
06CC
ARABIC LETTER FARSI YEH
ã
06CC 06BE
13CB
0020
0002
06CC
ARABIC LETTER FARSI YEH + ARABIC LETTER HEH DOCHASHMEE
ے
06D2
13CE
0020
0002
06D2
ARABIC LETTER YEH BARREE
Figure 7: Collation Elements for Urdu Sorting While sorting some normalization and/or contractions [8, 31, 33] also need to be done before the collation elements are assigned, in cases where one-to-one mapping is not available between
Urdu character and Unicode encoding, e.g. contraction would collapse two Unicodes بand ھ to ã which is a single character of Urdu. In addition, sometimes there might be instances when there are multiple ways of encoding the same character(s), e.g. the character ( ۓUnicode 06D3) may also be written by first writing a ( ےUnicode 06D2) and then writing a
( ءUnicode
0654). The collation elements given are sorted by a computer program developed to realize simplified version of four-level sorting (see [8, 31, 33, 34] for details of the algorithm). Figure 8 below shows input and sorted results based on the collation elements. (a) Input
َّ
اب ِ
َ
ن
ن
ز ةز تز ہ
ُ Ùن ٰ َّ َ ز ة
ن ں
ا
آب ا Ù
ََ
Ù
َّ
وف آ
دائرہ ُ دائرةاÙ
ا ُ
َ
دائرة
(b) Output
ن دائرہ ن َّ َّ َ َ َ َ َّ
ُ
ُ ُ Ù Ù
Ù
ِ
َ
ن آ آب
ٰ Ù وف ز ت ز ہ ز ة ز ة
ا
ا اب ا
دائرة دائرةا ن ں
Figure 8: Unsorted and Sorted Data for Urdu The results show that the collation elements can be successfully used to sort for Urdu. However, further work needs to be done to explore if there is need to do further semantic-level Urdu processing for sorting text strings, e.g. in English, sorting of Acronyms "UN", numerals like "9 1/2
weeks" etc. [33]. Further work also needs to be done to define how Arabic script characters in Unicode but not in Urdu (e.g. Sindhi and Pashto characters) would be sorted with Urdu. 4. Conclusion This paper explains the development and usage of two of the most fundamental standards for Urdu: Urdu character set encoding and Urdu collation sequence. These two standards are essential for the development of Urdu ICTs and would be vital in the generation, exchange and processing of Urdu textual data for ICTs. What has been standardized (or is in the process of being standardized) is not a solution which agrees with all the conventions and norms of Urdu. However, that would be impossible to achieve, as in many case parallel norms exist which are mutually conflicting. Therefore, relevant people and organizations have tried to standardize on best possible reasoning. Nevertheless, natural languages do not always depend on reason and logic and therefore sometimes a level of arbitrariness has to be induced, as has been done here as well. The most important goal and challenge is to preserve the linguistic intuition as much as possible (and not allow the technology to compromise it), though it may not be entirely possible to achieve it totally. Much more work lies ahead for Urdu to create and mature standards required for ICTs (e.g. Keyboard, locale, etc.) and to develop meaningful applications and content around these standards which can eventually benefit Urdu speakers. 5. Acknowledgements The historical work discussed here has been possible through inputs from many individuals in their private capacity or acting on behalf of Urdu and Regional Languages’ Software Development Forum, National Language Authority, Pakistan Computer Bureau and Ministry of IT & Telecom (formerly Ministry of Science and Technology, IT&T Division). This work has been partially supported through PAN Localization Project grant from the International Development Research Center, Ottawa, Canada, administered through Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, Pakistan. We would also like to acknowledge Åke Persson, Ernst Tremel, Dr. Khaver Zia and Dr. Muhammad Afzal for their feedback on the paper. 6. References [1] Statement presented by Minister of Information and Communications Technology of Thailand, at World Summit for Information Society, Regional Conference for WSIS in Tokyo, Jan 2002. [2] Rule 5 of the Standard Rules on the Equalization of Opportunities, UN 1993. [3] Languages of Pakistan, http://www.ethnologue.com/show_country.asp?name=Pakistan. [4] Masood, J. Pakistan, in Digital Review of Asia Pacific 2003/2004, Orbicom International Secretariat, Montreal, Canada, 2003. [5] Hussain, S. and Gul, S. LICT4D.pk, accepted for publication in the Proceedings of Frontiers of IT, COMSATS Institute of Information Technology, Islamabad, Pakistan, December 2003. [6] Hussain, S. Localization of FOSS, in the Proceedings of Free and Open Source Software Asia-Pacific (FOSSAP) Consultation, APDIP UNDP, Malaysia, February 2004.
[7] Davis, M. and Durst, M. Unicode Normalization Forms, http://www.unicode.org/reports/tr15/tr15-23.html, Unicode Consortium, 2003.
published
at
[8] Davis, M. and Whistler, K. Unicode Collation Algorithm, http://www.unicode.org/reports/tr10/tr10-11.html, Unicode Consortium, 2004.
published
at
[9] Afzal, M, Urdu Software Industry: Prospects, Problems and Need for Standards, Science Vision 5(2). Islamabad, Pakistan, 1999. [10] Afzal, M. and Hussain, S., Urdu Computing Standards: Development of Urdu Zabta Takhti, in the Proceedings of International IEEE Multi-topic Conference (INMIC), Lahore University of Management Sciences (LUMS), Lahore, Pakistan, 2001. [11] Chief Executive Approves Standard Urdu Code Page, in Akhbar-e-Urdu, National Language Authority, Islamabad, Pakistan, 2000. [12] Hussain, S. and Afzal, M. Urdu Computing Standards: Urdu Zabta Takhti (UZT) 1.01, in the Proceedings of International IEEE Multi topic Conference (INMIC), Lahore University of Management Sciences (LUMS), Lahore, Pakistan, 2001. [13] Unicode Consortium, Unicode 4.0, Addison-Wesley, Reading, UK, 2003. [14] Zia, K., Towards Unicode Standard for Urdu, in the Proceedings of Fouth Symposium on Multi-lingual Information Processing (MLIT4), Yangoon, Myanmar (CICC Japan), 1999. [15] Hussain, S., Proposal to Add Marks and Digits in Arabic Code Block (for Urdu), in the Proceedings of ISO/IEC JTC1/SC2/WG2 Meeting, Dublin, Ireland, 2002. [16] Bhurgari, A. M., Enabling Pakistani Languages through Unicode, published at http://download.microsoft.com/download/1/4/2/142aef9f-1a74-4a24-b1f4782d48d41a6d/PakLang.pdf. [17] Wissink, C. and Kaplan, M., Sorting it all out: An Introduction to Collation, in the Proceedings of 23rd International Unicode Conference, March 2003, Prague, Czech Republic. [18] Hussain, S. and Karamat, N. Urdu Collation Sequence, in the Proceedings of International IEEE Multi topic Conference (INMIC), Ghulam Ishaq Khan Institute (GIKI), Islamabad, Pakistan, 2003. [19] Feroz-ul-Lughat Jamay (
ؤزا ثa ), Feroz Sons, Lahore, Pakistan.
[20] Standard Twentieth Century Dictionary: Urdu to English, Educational Publishing House, New Dehli, India. [21]Haqqi, S., Farhang-e-Talafuz (
h ), Muqtadra Qaumi Zaban, Umaiz Publications, 1995,
Islamabad, Pakistan. [22] Jadeed Urdu Lulghat ( [23] Urdu Lughat (
اردؤ
), Muqtadra Qaumi Zuban, Islamabad, Pakistan.
)اردؤ, Urdu Lughat Board, Karachi, Pakistan.
[24] Platts, J., A Dictionary of Urdu, Classical Hindi and English, Crosby Lockwood and Son, London, UK, 1911, reprinted by Sang-e-Meel Publishers, Lahore, Pakistan. [25] Dehlvi, M. S. A., Farhang-e-Asfia (ê
آ
h ), reprinted by Sang-e-Meel Publications, Lahore,
Pakistan, 2002. [26] Nayyer, M. N. H., Noor-ul-Lughat (
)@ را, Sang-e-Meel Publications, 1989, Lahore, Pakistan.
[27] Platts, J., A Grammar of the Hindustani or Urdu Language, 1909, reprinted by Sang-e-Meel Publishers, Lahore, Pakistan. [28] Almi Urdu Lughat. [29] Naseem-ul-Lughat. [30] Hussain, S., Collation Sequence of Pakistani Languages, invited talk at Seminar on Software Development of Urdu and Other Pakistani Languages held at Karachi Institute of Information Technology, September 2002. [31] ISO/IEC 14651: Information Technology - International String Ordering and Comparison - Method for Comparing Character Strings and Description of Common Template Tailorable Ordering, ISO Geneva, Switzerland (www.iso.ch), 2001. [32] Unicode Consortium, Default Unicode Collation Element Table, http://www.unicode.org/Public/UCA/latest/allkeys.txt, 2003. [33] Gillman, R., Unicode Demystified: A Practical Programmer’s Guide to the Encoding Standard, Addison-Wesley, Boston, USA, 2003. [34] International Components for Unicode, IBM, ICU Collation Design Documentation, http://oss.software.ibm.com/cvs/icu/icuhtml/design/collation/ICU_collation_design.htm.