The Computational Semantics of Characters Dafydd Gibbon,1 Baden Hughes,2 Thorsten Trippel1 1
Fakult¨ at f¨ ur Linguistik und Literaturwissenschaft, Universit¨ at Bielefeld, 33501 Bielefeld, Germany {gibbon,ttrippel}@uni-bielefeld.de 2 Department of Computer Science and software Engineering, The University of Melbourne, VIC 3010 Australia
[email protected]
1
Introduction
The semantic properties associated with larger text units, typically morphemes, words, phrases, and sentences, have been treated computationally in innumerable studies. Smaller text units have largely been ignored to date in the context of computational semantics research. We argue in this paper that although the forms of smaller units at the orthographic level have received considerable attention in the Unicode discussion, they have semantic properties which require additional explicit consideration. In this paper we present a new approach to the computational semantics of characters, which fills this gap:1 the orthographic projection of linguistic information, analogous to phonetic interpretation. We consider a number of use cases prior to discussion of three different perspectives. Adopting a holistic view of semantics, we discover that there are properties at this lower level which require similar specification to that at more well-studied levels, and which can coherently extend computational linguistic models to the domain of orthography. Word semantic representations, e.g. in the Generative Lexicon approach [6], allow the decomposition of word semantics into specific levels of description, e.g. argument structure, event structure, qualia structure and lexical inheritance structure. Syntactic frameworks e.g. HPSG [7] apply a similar representation to combine those syntactic and semantic features which are relevant for sentence syntax. A representation in (often underspecified) hierarchical classes seems to be generic enough for the purpose. In character-to-glyph projection, such principles are generally not applied, on the assumption, perhaps, that character semantics is trivial. The opposite is the case. The semantics of a specific glyph (or an entire font e.g. Hiragana or Kanji) may include an assignment to a language (e.g. Japanese) or language family, to a specific social group, e.g. phoneticians (the International Phonetic Alphabet, IPA), or a company which defines a font as part of its corporate identity. Such semantic properties will play a useful role in practical Semantic Web search. In the computational language documentation paradigm, this kind of explicit description of font and glyph semantics, rather than the ad hoc interpretations given by the average font user, is an absolute necessity. Especially in legacy data, representations of a language in written form will not be accessble if the font information is 1
Throughout, the term character is used to denote one particular element of a set of symbols which represent words in writing, e.g. the letters of an alphabet rendered in a specific shape.
1
not interpretable. There are many other contexts where an explicit semantics is required: e.g., grapheme-to-phoneme rules implicitly define language-specific semantics for characters.
2
Related Work
The work in this paper represents an extension of previously published work on the accessibility of language documentation, where character encoding is often idiosyncratic, to descriptive linguistic models of complementary properties of characters which can be combined to represent more ‘dense’ semantic descriptions of the characters. Issues concerning the interpretability of fonts in the context of rescuing legacy data based on endangered languages have been dealt with by [1], who point out that a non–documented encoding may result in fatal information loss as the mapping of the electronically encoded glyph, rendered as a character, cannot be reconstructed if the font becomes unavailable. A registered semantic specification of a font, however, would permit reconstruction of an interpretable approximation to the original font. Unicode may simplify this process, but the semantic interpretability problem remains because of the use of ad hoc mixtures of characters for a given language from completely different regions of Unicode space. In [2] and [3], a feature structure representation and an XML encoded description based on [5], for characters are proposed, and illustrated by examples from the IPA [4]. In contrast to this, representations in the present study are based on linguistic properties of a character (syntactic, semantic and stylistic). To underline the point: the present study outlines a strategy for embedding these properties in a formal model of character representation.
3
Representational Models for Glyphs
Specifically, we propose a model for glyph representation with three major dimensions, representing major semiotic attributes, and integrated by a combinatorial operation which will be discussed below. The three dimensions in character property space are: 1. Functional: conventional object semantics of characters, their “meaning” in context, e.g. in specific languages, phonetics, logic, companies, ... 2. Formal: combinatorial properties of the parts of characters e.g. ascenders, descenders, serifs, diacritics, ... 3. Physical: mapping of the combinatorial definition into spatial coordinates. Traditional font specifications are restricted to a formal perspective, generally involving a hybrid specification based on the attributes Formal and Physical. Different approaches to glyph description may be found: 1. Representation as numerically coded physical entities, i.e. bitmaps, as in TrueType fonts, which are combined to form a picture mapping. 2. Vector representation of each character, which is the basis of the Metafont system used by LATEX, with the advantage of independence from rendering size, permitting high–quality scaling, and yielding higher printing quality. However, the actual rendering is left to interpretation. 2
3. Combinatorial representation, much like instructions for moving a plotter or pen on paper, describing characters compositionally in terms of ascenders, descenders, horizontal directions, serifs, etc., and combinatory operations for these. Good illustrations of character semantics are found in the well-defined formal notations of mathematics, including formal logic and indeed the numerical systems of natural languages. However, the perception that this concerns character semantics is obscured by the fact that such characters are logographic: the characters can be interpreted as representing words. The figure ‘7’ is a common logographic character, and the number ‘1983’ is a common semantically compositional logographic character sequence. This property is also well–known in connection with the Chinese writing system: the semantics provided for such characters is simultaneously a semantics of words. With alphabetic characters, the situation is comparable, except that the meanings are not simultaneously word meanings: a clear example is the standard ASCII alphabet, in which characters identified by code can be assigned a “syntax” or form, and a dual semantics, in the domain of English spelling (or other functional domains) and in the domain of teletype glyphs and layout instructions. In linguistics, the IPA is also comparable: a character not only has formal properties of name, code number or glyph definition, but also semantic properties which define sound types (whether phoneme or phone), as well as the classic glyph shapes of the International Phonetic Alphabet. Thus, there is a specified character which is interpreted on the one hand as a p glyph, and on the other as bilabial voiceless plosive. In fact, the well–known tables of the IPA effectively define a bijective function directly from the “phonetic semantics” to the “glyph semantics”. In our approach, then, a character as a formal object has two “semantic” interpretations: the mapping to its linguistic meaning, and the mapping to its glyph shape, in the same way as a word has an object semantic interpretation as well as a phonetic (or orthographic) interpretation.
4
Integrating Models
The three dimensions (Functional, Formal, and Physical) can be combined into a unitary semantic character representation. Each dimension has a model with the usual form < Interpretation, Domain >, with some derivative cases based on dependencies between the domains. Alternative approaches can be considered: 1. modelf unc = < f unc(charcode ), GlyphP roperties >, that is, numeric representations of the character are mapped to the instructions needed to create the glyph in a linear process. 2. modelf orm = < f unc(charglyphprops ), F ormalDef initions >, that is, the properties of a glyph are mapped into vectors or features. 3. modelphys = < f unc(charf ormaldef s ), BitmapCoordinates >, that is, the formal definition of a character is mapped directly to a picture, as a bitmap. The “combination” operation can be made explicit as function composition: charbitmap = phys(f orm(f unc(charcode ))), where the character is compositionally interpreted on the basis of the Physical, Formal and Functional mapping of the numeric character representation. Other combination concepts can be considered. A TrueType font has a simpler composition function: 3
charbitmap = phys(f unc(charcode )), where the formal combinatorial or componential level is unspecified and (except for the serif/nonserif distinction and arguably the basic italic and bold face, etc. character “styles”) the spatial rendering of characters is left to low–level printer and screen drivers.
5
Future Work
We have outlined a new framework for a formal computational semantics of the orthographic level, covering Formal or “syntactic” combinatorial properties, Functional or “object semantic” properties, and Physical or “layout” properties of characters, i.e. mappings to linguistic properties on the one hand, and glyph sets or fonts on the other. Previous work has used attribute–value logics, but other standard semantic frameworks can potentially be used in this domain, as in other linguistic domains, e.g. word and sentence semantics. Several solutions to problems in this field have been presented in previous work, but open questions remain. In particular, the “combination” operation requires further attention, and application to specific use cases on the lines of previous studies of the use case IPA, e.g. to language or institution specific font families. Applications of this approach to the field of Semantic Web tagging and search were mentioned, and are in progress in domains of user group identification.
References [1] D. Gibbon, C. Bow, S. Bird, and B. Hughes. Securing Interpretability: The Case of Ega Language Documentation. In Proceedings of the 4th International Conference on Language Resources and Evaluation, pages 1369–1372. European Language Resources Association: Paris, 2004. [2] D. Gibbon, B. Hughes, and T. Trippel. Semantic Decomposition of Character Encodings for Linguistic Knowledge Discovery. In From Data and Information Analysis to Knowledge Engineering, volume 30 of Studies in Classification, Data Analysis and Knowledge Organisation, pages 366–373. Springer-Verlag: Berlin, 2006. [3] B. Hughes, D. Gibbon, and T. Trippel. Feature-based Encoding and Querying Language Resources with Character Semantics. In Proceedings of the 5th International Conference on Language Resources and Evaluation, pages 939–944. European Language Resources Association: Paris, 2006. [4] International Phonetic Association. Handbook of the International Phonetic Association, 1999. [5] ISO 24610-1. Language resources management - Feature Structures - Part 1: Feature Structure representation, ISO 24610-1, 2006. [6] J. Pustejovsky. The Generative Lexicon. MIT Press, Cambridge (Ma), London, 1995. [7] I. Sag and T. Wasow. Syntactic Theory: A Formal Introduction. Center for the Study of Language and Information: Stanford, 1999.
4