Script Indices - Semantic Scholar

Script Indices

Richard Sproat Prakash Padakannaya University of Illinois at Urbana-Champaign University of Mysore [email protected] [email protected]

Abstract Sproat (2000, 2003) proposed a formal model of layout in writing systems wherein graphic elements are conjoined by two dimensional catenation operators. It has been shown in previous work (e.g. Prakash et al. 1993, Vaid & Gupta 2002) that issues such as layout and diacritization of symbols have relevance for segmental awareness, and online reading among readers of Indian writing systems. In this paper we will develop a script index based on properties of the layout and illustrate its application with examples from Kannada and Devanagari. We discuss the relevance of this index to readers’ performance on metaphonological awareness tasks.

1 Introduction When letters are combined into words in English or other languages that use the Latin alphabet, the basic rule for doing this combination is simple: add the next letter to the right of the previous letter. More formally put, in English one (con)catenates letters together in a left-to-right fashion. Other scripts, such as Hebrew, switch the direction to right-to-left, but overall the process is equally simple. This simple method of combination found in Latinbased and other canonical alphabetic scripts is in stark contrast to writing systems, such as those of India, where glyphs are combined in a variety of manners. Thus, in the Devanagari script, for example, while consonants are typically arranged left-to-right, post-consonantal vowels may be written after, above, below or even before the consonant(s) they logically follow.

2 Script Layout In previous work (Sproat 2000; Sproat 2003) a model of script layout was developed wherein the various manners of combining glyphs were expressed in terms of directional catenation operators. In particular, five types of catenation were countenanced, as shown in Figure 1. The semantics of the different catenation operators is shown schematically in Figure 2. In Figure 3 we illustrate each of the catenators with examples from Chinese. →

• leftwards catenation: ·

←

• rightwards catenation: · ↑

• upwards catenation: ·

↓

• downwards catenation: · J • surrounding catenation: Figure 1: Catenation operators, after Sproat 2000. Sproat (2003) developed the system further for Brahmi-derived scripts of India, and provided formal analysis of the Devanagari, Oriya, Kannada, Malayalam and Tamil scripts. Among the important properties of these scripts, as described in the literature on writing systems (e.g. Daniels & Bright 1996) are the following: • Glyphs are arranged into orthographic syllables termed aks.ara. An aks.ara begins at the left edge of a consonant sequence and continues until the end of the vowel of the syllable, possibly including a postvocalic nasal. Thus a sequence VCCCVCCVC will be divided into V.CCCV.CCV.C, with the first vowel being in an aks.ara by itself, and subsequent consonant plus vowel combinations forming their own aks.ara. It is important to note that the orthographic syllables do not in general correspond to

Figure 3: Illustration of the layout catenators in Chinese. Approximately 95% of all Chinese characters are composed of a socalled semantic radical, shown in green in the above, and a so-called phonetic radical, shown in red. The semantic radical often gives a rough semantic sense of the character; the phonetic component gives a hint as to the pronunciation. As shown above, the semantic and phonetic components of characters can be combined using any of the five catenator types defined previously. Usually, the particular catenator used is determined by the semantic radical.

Figure 4: Full and diacritic forms for Devanagari vowels, classified by catenator inherent to the diacritic forms.

too.) Similarly, consonants when combined into the consonant sequences in aks.ara may include various kinds of reductions from their full “standalone” variant. Finally, varying degrees of fusion may be seen with different glyph combinations. Figure 2: Schematic illustration of script layout catenators.

phonological syllables . • Within an aks.ara, consonants are combined using various catenators depending upon the script. Similarly, vowels are written around the consonantal core using various catenators. Figures 4–5 show the layout properties of various glyphs in different scripts. Full vowel forms are used aks.ara-initially, otherwise reduced (diacritic) forms are used. (In Hindi, full vowel forms can appear at noninitial word position

• In many Brahmi-derived scripts, some vowels are written as complexes of other more basic symbols. Often these more basic symbols are themselves indicators of other vowels. This is illustrated for Kannada in Figure 5. Thus “o” in Kannada is written as a combination of the glyphs for “e” and “uu.” Each of these combines with the consonants within the aks.ara using the catenators that the independent symbol would normally have. Thus the “e” portion is written above and ligatured to the consonant sequence, the “uu” portion after and ligatured to the consonant sequence. In the other complex vowel

cases listed in Figure 5, special symbols (“LONG”, “ai”) are used that are not used except in complex vowels. • Aks.aras always have an inherent vowel, usually a schwa-like vowel, which is the vowel sound that is understood if no explicit vowel mark is used. • Tamil differs from other Brahmi-derived scripts in that consonant sequences have become linearized. A consequence of this is that vowel symbols are written around the last consonant in a prevocalic sequence. So, if a vowel mark uses rightwards catenation and thus is written before what it logically follows, in Tamil it will show up before the last consonant of the sequence. This contrasts with the situation in other scripts where such a vowel mark would show up before the first consonant of a sequence. Similarly in Kannada, where consonant sequences tend to be stacked using the downwards catenator, some vowels, such as the “ulke” diacritic for /e/, are ligatured above the first consonant in the sequence; this again is in contrast to the situation in Tamil. In Sproat 2003, it is shown that this difference between Tamil and other scripts follows from the theory of Sproat 2000 as a consequence of Tamil having lost the traditional aks.ara.

Figure 5: Forms for diacritic Kannada vowels, classified by catenator. Complex vowels are given at the bottom of the table The above discussion motivates a set of features or questions that can be asked of a phonological element in Indian languages, vis-`a-vis its graphic expression. In particular, for a particular phoneme φ and corresponding glyph α (possibly null) one can ask the following questions. Possible answers are given in parentheses:

• Is φ expressed at all? (Yes/No) • What catenator is used to combine α with its logical J predecessor? (One of four possibilities, since is not used in Indian scripts.) • Is α reduced in form relative to its standalone variant? (Yes/No) • Are α and an adjacent glyph fused? (Yes/No) • Is α a complex glyph as with the Kannada vowels from Figure 5? (Yes/No) Finally, recognizing that in some Indic writing systems (e.g. Tamil) the spelling is less than fully phonemic (i.e. in Tamil, voiced/voiceless distinctions are not made in the script though they are significant in the language), one can add a feature of transparency: • Is φ transparently encoded?

3 Script Indices Based on the above features, we propose vector of feature values that we will term a Script Index, which can be applied to any phoneme and its graphical expression (including allographs). The purpose of this index will be to correlate features of the graphical expression of phonemes with readers’ performance on phoneme awareness tasks in psycholinguistic experiments. It has been shown in previous work (Prakash et al. 1993, Padakannaya 1999, Padakannaya 2000, Padakannaya 2001, Vaid & Gupta 2002) that issues of layout (catenation direction), and degree of fusion or diacriticity have a measurable effect on readers’ abilities to show awareness of the phoneme corresponding to the glyph in question. A striking example of this is in the differential awareness of post-vocalic nasals, written with the anusvara symbol, in Hindi (Devanagari script) versus Kannada (Prakash et al. 1993). Hindi speakers have great difficulty in treating the postvocalic nasal as a separate segments, wheres this is relatively easy for Kannada speakers. This correlates with the fact that in Devanagari the anusvara is written as a small diacritic dot above the righthand edge of the aks.ara, whereas in Kannada script the anusvara is a large circle written inline with the main consonant of the aks.ara. See Figure 6. We illustrate Script Indices by means of the examples in Figures 7–9. Each phoneme is associated with a vector, where each of the elements of the vector corresponds to an answer to the questions outlined above. Thus each vector will have the form in Figure 10: Consider Figure 7, which shows the graphical expression of /peMsil/ (here /M/ represents a postvocalic nasal) in (Hindi) Devanagari. The phonemes /peMsil/ map to a formal expression of the graphical layout represented

Figure 6: Anusvara in Devanagari (left) and Kannada for the word /pensil/. In Devanagari the anusvara is the small dot above the initial element in the word. In Kannada, the anusvara is the large circle in the second position. Note that the actual word “pencil” in Kannada is not written using anusvara; the spelling shown is merely a theoretically possible spelling for illustrative purposes.

Figure 8: Layout details for /eeˇsvaikya/ in Kannada.

Figure 7: Layout details for /pensil/ in Devanagari.

Figure 9: Layout details for /laks.miis.a/ in Kannada.

by the middle line in the figure. In this expression, the aks.aras are delimited with square brackets. The spellout of each phoneme is expressed, following Sproat 2000, in the form γ(φ) where φ is the phoneme and γ represents a relation that maps phonemes to their graphical expression. A special version of γ namely γρ represents a reduced form. Within each aks.ara, the catenators used to combine the elements are explicitly given, with the exception of the logically first element of the aks.ara, which is always arranged using the script’s macroscopic catenator (Sproat 2000), which is left catenation in Indic scripts. Not explicitly shown here is whether the elements are fused with other elements, or whether they are phonologically transparent. Figure 7 also shows how the formal layout is actually expressed in the form /peMsil/, with arrows showing the correspondence with the formal elements and the glyphs. In Figure 11 we list the Script Index vector for each phoneme. Note that /p/ is marked as fused since it is fused with the logically following /e/. Also, in Devanagari most glyphs are joined together with the headbar: we are not considering the headbar to constitute fusion. Turning now to Figure 8, we see a similar diagram for the Kannada word /eeˇsvaikya/. In this case, one of the vowels /ai/, is expressed as a complex symbol consisting of /e/ and a symbol that is only used with /ai/ (and therefore glossed here as /ai/). In the formal expression of this mapping, this complex symbol is expressed as a set in curly braces, with the individual components being

elements of the set. The inherent vowel is not expressed graphically, and this is represented in the formal representation with the expression γ (a), and in the graphical representation with the “kill” symbol of a skull-andcrossbones. The Script Index values for each phoneme in this case are as in Figure 12. Note that if a phoneme is not expressed, none of the other features are relevant; this is why the values for all features except expr are 0 for /a/. The component of the expression of /ai/ is written to the right of the expression of /v/ rather than below it as one might expect from its use of the downwards catenator. Sproat 2003 argues for a rule of Kannada linearization whereby all but the first of a sequence of downwards catenators are converted to left catenators. Thus the first glyph in a sequence of glyphs that use downwards catenation is written below the glyph it logically follows, then all subsequent glyphs in the sequence are written to the right. Finally notice that the features for cat, red and fused are ordered sets when the compl has the value yes. Figure 9 depicts the graphic expression of Kannada /lakˇsmiiˇsa/. The Script Indices for the phonemes in this case are as in Figure 13.

4 Script Indices and Phonological Awareness: Preliminary Evidence Previous research mentioned earlier showed that issues of layout (catenation direction), and degree of fusion in Indian alphasyllabaries do have a significant effect on

one’s performance on metaphonological tasks. In all these studies, phonemic tests had corresponding syllabic tests controlled for the general task demands. In general, participants performed consistently better on syllabic tests than corresponding phonemic tests. However, qualitative analysis of performance on various phonemic tasks reveals the following. In the phoneme recognition test, performance is better when the target is a consonant part of aks.ara (orthographic syllable) rather than a vowel. Phoneme oddity test results also demonstrate an advantage of consonant targets over the vowel targets. Children who find phonemic tasks very difficult are almost 100% successful in deleting phonemes (e.g., arka /r/, and anusvara “nasal postvocalic sounds”) that have independent glyphs written inline with other aks.aras. The second consonant in a sequence that has a reduced but not fused glyph also posed no problem in deletion task. On the other hand the success rate in deleting consonants with full but fused forms is very low. In addition nonlinearity of some of the features (e.g., arka in Kannada) also pose problems in such tasks as aks.ara deletion. Thus, catenations, their explicitness, independence and linearity are the factors that determine one’s success in metaphonological manipulations of the glyphs in question. The overall performance of individuals in metaphonological tasks should be a function of these factors. Our aim in future work is to arrive at a script index based on the total vector values of individual characters and to compare the performance on metaphonological tests with script indices in the Hindi (Devanagari), Kannada, Malayalam, Oriya, and Tamil languages.

References Peter Daniels and William Bright, editors, 1996. The World’s Writing Systems. Oxford University Press, New York, NY. Padakannaya, Prakash. 1999. Reading disability and knowledge of orthographic principles. Psychological Studies, 44:5964. Padakannaya, Prakash. 2000. Is phonemic awareness an artefact of alphabetic literacy?! Presented at ARMADILLO 11, Texas A&M, October 1314, 2000. Padakannaya, Prakash. 2001. Syllabic and phonemic awareness in children acquiring literacy in a semisyllabic script. Department of Psychology, University of Mysore. Prakash, P., D. Rekha, R. Nigam, and P. Karanth. 1993. Phonological awareness, orthography and literacy. In Robert Scholes, editor, Literacy and Language Analysis. Lawrence Erlbaum Associates, Hillsdale, NJ, pages 5570.

Richard Sproat, A Computational Theory of Writing Systems, (ACL Studies in Natural Language Processing Series), Cambridge, Cambridge University Press, 2000. Richard Sproat, A Formal Computational Analysis of Indic Scripts, International Symposium on Indic Scripts: Past and Future, Tokyo, December 2003. Available from http://compling.ai.uiuc.edu/rws/newindex/indic.pdf Vaid, Jyotsna and Ashum Gupta. 2002. Exploring word recognition in a semialphabetic script: The case of Devanagari. Brain and Language, 81:679690.

[expressed, catenator, reduced, fused, complex, transparent] Figure 10: Feature vector slots.

/p/ /e/ /M/ /s/ /i/ /l/

[expr=yes, [expr=yes, [expr=yes, [expr=yes, [expr=yes, [expr=yes,

cat=left, red=no, fused=yes, compl=no, trans=yes] cat=up, red=yes, fused=yes, compl=no, trans=yes] cat=up, red=yes, fused=no, compl=no, trans=yes] cat=left, red=no, fused=no, compl=no, trans=yes] cat=right, red=yes, fused=no, compl=no, trans=yes] cat=left, red=no, fused=no, compl=no, trans=yes] Figure 11: Feature vector slots for Devanagari /peMsil/.

/ee/ /sh/ /v/ /ai/ /k/ /y/ /a/

[expr=yes, cat=left, red=no, fused=no, compl=no, trans=yes] [expr=yes, cat=left, red=no, fused=yes, compl=no, trans=yes] [expr=yes, cat=down, red=yes, fused=no, compl=no, trans=yes] [expr=yes, cat={up,down}, red={yes,yes}, fused={yes, no}, compl=yes, trans=yes] [expr=yes, cat=left, red=no, fused=no, compl=no, trans=yes] [expr=yes, cat=down, red=yes, fused=no, compl=no, trans=yes] [expr=no, cat=0, red=0, fused=0, compl=0, trans=0] Figure 12: Feature vector slots for Kannada /eeˇsvaikya/.

/l/ /a/ /k/ /s/ /m/ /ii/ /s/ /a/

[expr=yes, cat=left, red=no, fused=no, compl=no, trans=yes] [expr=no, cat=0, red=0, fused=0, compl=0, trans=yes] [expr=yes, cat=left, red=no, fused=yes, compl=no, trans=yes] [expr=yes, cat=down, red=yes, fused=no, compl=no, trans=yes] [expr=yes, cat=down, red=yes, fused=no, compl=no, trans=yes] [expr=yes, cat={up, left}, red={yes,yes}, fused={yes,no}, compl=yes, trans=yes] [expr=yes, cat=left, red=no, fused=no, compl=no, trans=yes] [expr=no, cat=0, red=0, fused=0, compl=0, trans=yes] Figure 13: Feature vector slots for Kannada /laks.miis.a/.

Script Indices - Semantic Scholar

Script Indices - Semantic Scholar

Suggest Documents

Script-sharing and - Semantic Scholar

Approximating Power Indices - Semantic Scholar

Handwritten Devanagari Script Segmentation: A ... - Semantic Scholar

INDICES FOR ASSESSMENT AND MONITORING ... - Semantic Scholar

EVENNESS INDICES MEASURE THE SIGNAL ... - Semantic Scholar

The Multiplicative Zagreb Indices of ... - Semantic Scholar

New Power Quality Indices - Semantic Scholar

Fuzzy Indices of Document Reliability - Semantic Scholar

COMPUTING CLASSICAL POWER INDICES FOR ... - Semantic Scholar

COMPUTING CLASSICAL POWER INDICES FOR ... - Semantic Scholar

Physiologically based indices of volumetric ... - Semantic Scholar

Reactivity Indices in Polycyclic Aromatic ... - Semantic Scholar

Discriminative Distance-Based Network Indices ... - Semantic Scholar

Novel Ventricular Repolarization Indices in ... - Semantic Scholar

Behavioral and Electrophysiological Indices of ... - Semantic Scholar

BEHAVIOUR OF DISPERSION INDICES IN ... - Semantic Scholar

Usefulness ofnutritional indices and classifications ... - Semantic Scholar

Vivaldi Script English Script AdineKirnberg-Script Alexei ...

Script

Script

Word-wise Sinhala, Tamil and English Script ... - Semantic Scholar

Online Handwritten Indian Script Recognition: A ... - Semantic Scholar

Printed Gujarati Script OCR using Hopfield Neural ... - Semantic Scholar

Elicitation and generation of a script for the ... - Semantic Scholar