Sindh Univ. Res. Jour. (Sci. Ser.) Vol.47 (2) 287-290 (2015)
SI NDH UNIVERSITY RESEARCH JOURNAL (SCIENCE SERIES) Towards an Efficient Urdu Keyboard Layout I. A. KHAN, M. USMAN*, M. SHAFI*, K. GHANI*, N. MINALLAH* Department of Computer Science, COMSATS Institute of Information Technology, Abbottabad Received 11th December 2014 and Revised 24th February 2015 Abstract- English is the most common language for conveying digital information in the world. However, there is a huge population of world that doesn’t understand English and hence there is a need of translating the digital information to other languages as well. Translation of Information Communication Technologies (ICTs’) to local languages has been a focus of the research for quite a while now. Keyboard has been one of the main sources of input in ICTs’ and like translation of information; there is a need of Keyboard localization (efficient customization of keyboard for other languages). This paper presents efficient keyboard localization of Urdu language. This paper proposes four techniques adopted from Dvorak to analyze Urdu words based on alphabet frequencies to design an efficient Urdu keyboard layout. These techniques are frequency words, Letter Frequencies, Diagraphs frequencies and reverse diagraph frequencies. Five thousand most frequently used words of Urdu were analyzed by these four techniques and it was concluded that layouts resulting from this analyses would be different from the most commonly used Urdu phonetic keyboard. Therefore there is need to consider these layouts as well to design an efficient keyboard layout for Urdu users. Keywords: Unicode, Localization, Diagraph, Urdu, Keyboard, Layout
1.
INTRODUCTION Access of information is considered as a basic human right. Internet is one of the major sources of information in the modern world. Most of the information on the internet is available in English, being the widely spoken language in the World. However, there is still a huge population that can’t use this information as it is incapable of reading and writing in English (Hussain, et al.,2004). For example in Pakistan, only 5-10% of the entire population understands English of which only 1.2 % use internet (Hussain, et al.,2004). Hence there is a need of transforming this information into other local languages as well. This process of transformation of ICTs’ into local languages is known as localization. Local language computing standards (e.g. keyboard (and keypad) layout, character set encoding, sorting/collation sequence, locale and ICT terminology (Hussain, et al., 2004) ) must be followed during this localization process. Urdu, apart from being the national language of Pakistan, is the first language of more than 60 million people and spoken by more than 100 million people across various countries of the world (Anon., 2013). Use of Urdu in computation started in 1980s but no standard encoding scheme was present at that time. Multiple encoding schemes therefore were introduced by various vendors for the localization of Urdu later on (Hussain, 2008). One of them is Urdu Zabta Takhti UZT 1.01 which is 8-bit encoding standard accepted in 2000 by the Government of Pakistan (Hussain, ++
et al.,2004). After inception of UNICODE in early 1990s, some online Urdu publications have started using Unicode (Hussain, 2008). Unicode 4.0 completely supports Urdu language (Hussain, et al., 2004). However, the display of text is only a one dimension in ICTs’ whereas other dimensions include writing as well. The basic device for such input is Keyboard. Keyboard layout design is an integral part of local language computing standards (Hussain, et al.,2004). The QWERTY keyboard has been used as a primary input method for a long time though it is being considered as a less efficient, incorrect standard and poor arrangement for touch typing by many experts (Matias, et al.,1996). The QWERTY keyboard layout was invented in 1868 and was designed to separate most frequently occurring letter pairs to get rid of mechanical clashes caused during fast typing (Jasmin & Casasanto, 2012). The factor of ease in typing or learning was not considered during its design (Norman and Fisher, 1981). The letter placements are not ideal which causes fatigue, errors and lower typing speed. Also, the typing speed and comfort of user is greatly influenced by the arrangement of letters on the keyboard (Wagner, et al., 2003). Research work has been carried out to overcome these issues. One of the most significant attempts was made by Dvorak et al.,whose dominant theme was maximum efficiency (Parkinson, 2000). Dovrak arranged keyboard layout on scientific principles. He assigned 30 characters in three rows out of which 26 are letters and 4 are punctuation marks.
Corresponding Author email,:
[email protected] * Department of Computer System, University of Engineering and Technology, Peshawar,
[email protected]
288
I. A. KHAN et al.,
Impressive results were claimed for better speed, fatigue reduction and improvement in accuracy. The research took more than a decade before final results were obtained due to lack of computers (Wagner, et al.,2003). Literature above indicates various researches toward improvement in accuracy of typing, speed of typing etc. for English language. However, as discussed earlier, localization is becoming very important with the advancement in technology as well as it ubiquity. NonPhonetic keyboards have better efficiency which are usually based on the frequency of characters (Sarfraz, et al.,2011). Regarding localization of Urdu keyboard, there are Urdu Keyboard layout designs mostly based on phonetics. However, no layout to the best of authors’ knowledge, has been designed on scientific principles. The focus of this paper is to formulate some scientific principles for designing Urdu keyboard layout. The rest of the paper is organized as follow: Section 2 will describes the collection of material for such a layout, section 3 is about analysis process, section 4 highlights the results and section 5 concludes the overall findings and highlights some of the limitations of this research and possible future enhancements.
frequencies were taken into account by Dvorak while designing his keyboard (West, 1998). To calculate the Urdu letter frequency, an algorithm was developed that traversed each unique word from the unique words file. This algorithm also calculated frequency of occurrence of each alphabet of Urdu language of the unique words under consideration. These letters along with their frequencies were then saved in another in an excel file. A list of letter frequencies in descending order and its bar graph was created in Excel sheet as shown in the figure below. "( "ﺍAlif) came on top of the list with 3086 occurrence. In other words analysis of these 5000 high frequency words, it was observed that "( "ﺍAlif) is the most frequently used alphabet in Urdu language. All of the thirty eight (38) alphabets were found in the 5000 most occurring unique words. Alphabets frequency distribution diagram is given below in the (Fig. 1):
2.
MATERIAL AND METHODS Efficient keyboard layout designs like (Parkinson, 2000) utilized unique words in thousands, letter frequencies and diagraph frequencies. Therefore, the first step was to collect unique and most frequently used Urdu words. A large amount of Urdu data in Unicode encoding was required which was downloaded from Centre for Research in Urdu Language Processing (CRULP) website.
Algorithm 1 – Extracting Letter Frequencies
CRULP at National University of Computer and Emerging Sciences in Pakistan has been involved in developing corpora for Urdu (Hussain, 2008). CRULP has made a wordlist of 5000 most frequently used Urdu words. It has been extracted from 19.3 million corpus gathered from a wide range of domains based on news, sports, entertainment, and finance etc. This wordlist of 5000 high frequency words of Urdu is used in our experiments.
While (uWord not Null) Set uWord word from the file and move to next wordLet uChar a character While (there are characters in uWord
3.
ANALYSIS PROCESS The 5000 high frequency words were saved in a text file. In the rest of the paper this file will be called as unique words file. The experiments to extract letter frequency, diagraphs etc. using programming algorithms are discussed below. 3.1
Letter Frequencies Many efficient keyboard layout designs are based upon the frequency of letters usage in the target language (Ichbiah, 1996). Letter and letter-sequence
Fig. 1: Urdu Alphabets frequency distribution in 5000 most uniquely occurring words
Begin let uWord a String set uWord “” let charArray an array 38 integers for all alphabets in sequence with first character column and second frequency counter
Set uChar character from word and move to next character Set charSeq sequence number of uChar Set CharArray[charseq] CharArray[charseq] + 1 End While (there are characters in uWord) End While (uWord not Null) Save charArrayinto excel file End Algorithm 3.2 Digraph Frequency For work of acceptable quality, speed is the only valid criterion (West, 1998). Throughout history in keyboard comparison, typing speed has been the major
Towards an Efficient Urdu Keyboard Layout…
289
quantitative indicator (Buzing, 2003). Dvorak and Dealy noticed that the speed at which one types is mostly the function of the speed at which one can type two-letter sequences (Harbaugh, 1996). The fastest possibilities are when both keys are typed using alternate hands and are on the home row (Harbaugh, 1996). Such letter sequences are called diagraphs. Next analysis on our collection of words was therefore calculating the frequencies of letters that occur in a sequence. For example in the word “alphabet” sequence is of character ‘a’ followed by character ‘l’. An algorithm was developed that searched all the combinations of Urdu diagraphs' in the 5000 most frequently occurring unique words collection file and calculated the frequency of frequently occurring diagraphs. The frequency table of the diagraphs and its graph was created in an Excel sheet in descending order. “ "ﺑﺞtopped the list with 282 occurrences. A total of 298 unique diagraph sequences were found by the algorithm. The diagraph frequency table of top 40 most occurring diagraphs is given below in (Fig. 2). Complete frequency graph can be downloaded from the website 1 P0F
P
Fig. 2: The frequency table of top 40 most occurring diagraphs
Algorithm 2–Extracting Digraph FrequenciesBegin let uWord a String set uWord “” let charArray be a dynamic array with first string column and second frequency counter While (uWord not Null) Set uWord word from the file and move to next word Let uDigraph a string Set uDigraph uWord [firstCharacter] + uWord[secondCharacter] Match uDiagraph in charArray If uDigraph not matched then Add new row and add uDigraph in charArray string column Else Increment counter of the matched Element in charArray frequency counter 1
https://sites.google.com/site/mailtoiftikhar/home/urd uresources
Endif End While (uWord not Null) Save chrArrayinto excel file with frequencies End Algorithm 3.3 Sum of Digraph Frequency and Reverse Diagraph Frequency Dvorak used diagraphs as a pair of letters typed in a particular order. For example EY and YE consist of same letters but he considered them as two different diagraphs because of their particular order (Parkinson, 2000). However, research suggests that there should be no distinction between diagraphs based on the order of alphabet occurrences in the word (Parkinson, 2000). A non-distinction based on order will cause same layout for a diagraphs and its reverse. Hence more words could be typed efficiently with the same layout. As in the previous example of word “alphabet”, the order of these characters (‘a’ comes first or later) doesn’t matter. That means that word “alphabet” and word “latitude” has same sequence of characters and therefore should be added in the same sequence (Parkinson, 2000). Such diagraphs will be mentioned as Reverse diagraphs. The Reverse diagraphs was applied on Urdu alphabets and frequency of Urdu diagraphs also. Previously found diagraphs were traversed in order to find their reverse. The frequency of the diagraph and the frequency of its reverse if existed were added together to find reverse diagraphs frequencies. . A table of the sum of frequencies of diagraphs and their reverse-order diagraphs was created. This data was used to create graphs in descending order of frequency sum. “ "ﺑﺞagain, topped the list with 283 occurrences. As various diagraphs and their reverses were combined, total diagraphs were reduced to 263 from 298. This means that there were 35 diagraphs with similar sequence of first two alphabets if reversed. A comparison of the first 40 diagraphs represented in Figure 2 and reverse diagraphs represented in Figure 3 reveal that there are few changes in reverse diagraph. For example, in Figure 2, diagraph ‘Wawo Faay ( ’)ﺅﻑcomes before diagraph ‘Rey Yeh ( ’)ﺭﺉbecause of its higher frequency. In reverse diagraph in (Fig. 3), ‘Faay Wawo’ comes after ‘Rey Yeh’ because here ‘Rey Yeh’s’ frequency is now higher. In other words, a sum of ‘Rey Yeh’ and ‘Yeh Rey’ formed higher frequency then sum of ‘Wawo Faay’ and ‘Faay Wawo’. Similar is the situation with some of the other diagraphs. As, 35 diagraphs with similar sequence of first two alphabets were found that means that keyboard layout based on the reverse diagraphs will be different from the layout based on the simple diagraph frequencies.
I. A. KHAN et al.,
Fig.3: The frequency table of top 40 most occurring diagraphs and their reverse
Algorithm 3–Extracting Reverse Digraph Frequencies Begin let uWord a String set uWord “” let charArray be a dynamic array with first string column and second frequency counter While (uWord not Null) Set uWord word from the file and move to next word Let uDigraph a string Set uDigraph uWord [firstCharacter] + uWord[secondCharacter] Match uDiagraph or reverse of uDiagraph in charArray If uDigraph not matched then Add new row and add uDigraph in charArray string column Else Increment counter of the matched Element in charArray frequency counter End if End While (uWord not Null) Save charArrayinto excel file with frequencies End Algorithm Conclusions, Limitations and Future Research This paper analyzed some of the propositions for an Efficient Urdu keyboard layout design. Some of the popular basic steps for the design of an efficient Urdu keyboard were taken. These sources can now be used to design layout just like other English language designs. Collected material was based on frequently occurring 5000 unique Urdu words from CRULP. However, it is important to determine that which of the design based on the above mentioned resources will perform best in terms of speed, reduction of errors and elimination of fatigue. The performances of these multiple designs need to be compared to select the best layout in future.
290
One of the limitations of this research was that it analyzed only 5000 uniquely occurring words which form very minor part of Urdu language. As this research, ignored large dictionary of Urdu language words, these words can be extracted from other sources and can be compared to the results of this paper. There are other possibilities of extracting related material from sources such as unique Urdu words dictionary. Later on frequencies of letters, diagraphs and reverse-order diagraphs can be obtained from them to use in the design of an efficient keyboard layout. Again performance can be compared by the proposed methods of this research to conclude a keyboard layout for Urdu language. REFERENCES: Donald A. N. and D. Fisher, (1981) "Why Alphabetic Keyboards Are Not Easy to Use: Keyboard Layout Doesn't Much Matter”, Human Factors, vol. 24, 509-515. Edgar M. I., S. MacKenzie, and W. Buxton, (1996). "One-Handed Touch-Typing on a," Human-Computer Interaction, vol. 2, no. 1, 1-27, Gary B. H., (1996) "Computer Keyboard Layout," Patent No. 5584588. Huda S., A. Dilawari, and S. Hussain, (2011) "Assessing Urdu Language Support on the Multilingual Web,". (2013, Apr.) Wikepedia. [Online]. http://en.wikipedia.org/wiki/Urdu, accessed 10th December 2014. Jean D. I., (1996). "Method For Designing An Ergonomic One-Finger Keyboard And Apparatus Therefor," US Patent No. 5487616, Jan. 30. John V. P., (2000) "User-Friendly And Efficient Keyboard," Patent No. 6 053 647. Leonard, J. W., (1998) "The Standard and Dvorak Keyboards Revisited:," Research in Economics, 98-105. Kyle J. and D. Casasanto, (2012). "The QWERTY Effect: How typing shapes the meanings of words," Psychonomic Bulletin & Review, vol. 19, no. 3, 499-504, Marc O. W., B. Yannou, S. Kehl, D. Feillet, and J. Eggers, (2003). "Ergonomic Modelling and Optimization of the Keyboard," Journal of Engineering Design, vol. 14, no. 2, 187-208, Pieter B., (2003) "Comparing Different Keyboard Layouts: Aspects of Qwerty, Dvorak and alphabetical keyboards," Delft University of technology articles. Sarmad H., (2008). "Resources for Urdu Language Processing," in The 6th Workshop on Asian Languae Resources, Hyderabad India, 99-100.