IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 12, NO. 4, DECEMBER 2004
441
Text Generation From Taiwanese Sign Language Using a PST-Based Language Model for Augmentative Communication Chung-Hsien Wu, Senior Member, IEEE, Yu-Hsien Chiu, and Chi-Shiang Guo
Abstract—This paper proposes a novel approach to the generation of Chinese sentences from ill-formed Taiwanese Sign Language (TSL) for people with hearing impairments. First, a sign icon-based virtual keyboard is constructed to provide a visualized interface to retrieve sign icons from a sign database. A proposed language model (LM), based on a predictive sentence template (PST) tree, integrates a statistical variable -gram LM and linguistic constraints to deal with the translation problem from ill-formed sign sequences to grammatical written sentences. The PST tree trained by a corpus collected from the deaf schools was used to model the correspondence between signed and written Chinese. In addition, a set of phrase formation rules, based on trigger pair category, was derived for sentence pattern expansion. These approaches improved the efficiency of text generation and the accuracy of word prediction and, therefore, improved the input rate. For the assessment of practical communication aids, a reading-comprehension training program with ten profoundly deaf students was undertaken in a deaf school in Tainan, Taiwan. Evaluation results show that the literacy aptitude test and subjective satisfactory level are significantly improved. Index Terms—Hearing aids, natural language, sign language, text processing.
I. INTRODUCTION
P
EOPLE WITH hearing impairments usually have several language learning problems due to incapable of receiving feedback responses from their own voices. Generally, Augmentative and alternative communication (AAC) technology is to facilitate the disabled with minimum efforts of alternative input to generate comprehensive information output and greatly improve expressive communication abilities [1], [2]. In the past years, many communication-aided methods for the hearing impaired have been proposed and applied to enhance language and communication abilities, such as sign language, finger spelling, lip-reading, and total communication approaches [3], [4]. Current research into augmentative communication technologies mostly focused on the purpose of developing more efficient user interface and extending the accessibility [5]. For the interface configurations, the row-column scanning and arrangement layout are commonly used to reduce the number of scanning steps required to reach the desired selections [6]. A variety of Manuscript received August 28, 2001; revised August 20, 2003. This work was supported by the National Science Council, Taiwan, R.O.C., under Contract NSC89-2614-H-006-003-F20. The authors are with the Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C. (e-mail:
[email protected]). Digital Object Identifier 10.1109/TNSRE.2003.819930
rate-enhancement techniques have been applied to enhance the capability of computer access such as word prediction, abbreviation expansion and semantic coding strategies [5]–[8]. In [5], a virtual keyboard (VK) and various selection and prediction strategies were surveyed and comments on the tradeoff design issues between increased motor and decreased cognitive loads were suggested. The VK is an alternative interface for generating text and allows users to integrate various physical devices and language sets. This takes advantage of allowing more direct access to vocabulary at the word level and illustrates the sentence compansion technique to speed up communication rate that allows the user to select the uninflected content words of the desired sentence. Dundee University, Dundee, U.K., has developed a system called Predictive Adaptive Lexicon (PAL), which uses statistical information about common English words to enhance communication rate [8]. However, early studies showed this might generate spelling or grammatical errors in text construction. Due to this problem, a word prediction method SynPal is further developed [9]. Unlike PAL, this method integrated the syntactic parsing rules into predictive algorithm. This was achieved by predicting next words with similar syntactic class. For young children, especially in preschool or primary school, symbol-based systems such as Bliss or multimeaning icons were used [10]. A well-known product named Minspeak translates the icons input into sentences and output via a voice synthesizer. Its basic idea is to use some ambiguous icons to represent a concept [11]. For example, when the icon “Apple” is combined with the icon “NOUN” or “VERB,” the concept can be represented as “eat” or “food,” respectively. Through this way, although the ambiguity can be resolved, more keystrokes are required to express the proper intention. In addition, the input order also influences the meaning of a sentence. Although the AAC technologies described have been developed for years, they cannot be directly adopted for people using Chinese language in Taiwan. In recent years, several national research projects, funded by National Science Council or Ministry of Education, Taiwan, have been initiated to develop the domestic assistive technologies. In this paper, we focus on developing the assistive technologies for the hearing impaired, especially for deaf students. In Taiwan, signed Chinese, similar to the signed English in American Sign Language, is a branch of Taiwanese Sign Language (TSL). It is a conventional teaching approach for deaf people and has been developed and used for years. For the comparison of signed and written Chinese, we discovered that deaf students using TSL as their native language usually have difficulty to construct written Chinese. They are accustomed to using
1534-4320/04$20.00 © 2004 IEEE
442
IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 12, NO. 4, DECEMBER 2004
Fig. 1. Flow diagram of the proposed strategies.
their own conventions of TSL structure. The written Chinese sentences directly translated from the signed language representations are generally ill formed from the viewpoint of Chinese grammar. We discovered that there are several structural differences between them such as word order, common word deletion, quantifier deletion, and syntax conjugation [4]. Accordingly, deaf students usually make ill-formed sentences from the viewpoint of written Chinese. In addition, reading Chinese articles is not an easy task for the deaf students [12]. For example, adjectives in a sentence do not provide any information for them. According to this investigation, a database with sentences represented by both TSL symbols and Chinese characters is needed to derive the mapping between signed and written Chinese [13]. The objective of language processing in an AAC system is to enhance the input rate and accuracy of message composition, particularly in generating well-formed sentences from ill-formed input for the hearing impaired. Recently, the development of speech recognition and information retrieval technologies allows researchers to design and develop high performance AAC systems with the help of the technology of language processing. These techniques and design issues about ill-formed sentence processing were introduced in many approaches, such as example-based, knowledge-based, and statistic-based language modeling and machine translation [14]. A well-established approach, -gram language model (LM), has been widely used to model the relationship between words. Such language models endow with prediction capability and the structural representation of a language. Considering the language perplexity, current research integrates the grammatical constraints into various processing frameworks. Siu et al. proposed a variable -gram and its extensions [15]. Niesler et al. [16] and Hamaker [17] proposed a similar algorithm to build a part-of-speech LM using grammatical constraints.
These approaches motivate us to reduce the effect of literacy limitation and explore the use of rich linguistic information in LM. This paper proposes an approach to the generation of sentences from ill-formed Taiwanese Sign Language for people with hearing impairments. The overall flow diagram of the proposed system is depicted in Fig. 1. A sign icon-based virtual keyboard provides a visualized interface to access sign icon from a sign icon database. The sign sequence is then decoded as keyword sequences. A proposed LM, based on a predictive sentence template (PST) tree, integrates a statistical variable -gram LM and linguistic constraints to deal with the translation problem from ill-formed sign sequences to grammatical written sentences. The linguistic analysis performs a partial parsing process to construct as many structures as the sentence generation task requires. Then, the trigger pair modeling is applied to capture the preferred relationships such as semantic relations and grammar regularities. A template-matching algorithm is proposed to generate the well-formed sentences. In addition, a sentence scoring function, which considers sentence pattern and semantic characteristics, is proposed to rank the output sentences for user selection. This approach can be further integrated into various interface strategies, such as physical devices and scanning strategies to achieve a wide range of functions for the hearing and speech impaired depending upon their unique needs. This paper is organized as follows. In Section II, the framework of constructing a predictive sentence template language model is illustrated. In Section III, we will introduce how to generate well-formed sentences from ill-formed TSL. In Section IV, several experiments for evaluating the proposed approaches and results are described. In Section V, some conclusions and future works are drawn.
WU et al.: TEXT GENERATION FROM TAIWANESE SIGN LANGUAGE
443
TABLE I LINGUISTIC RELATIONS BETWEEN SYNTACTIC AND SEMANTIC INFORMATION
II. PREDICTIVE SENTENCE TEMPLATE TREE A. Linguistic Processing of TSL Vocabulary TSL vocabulary is widely collected from the education and training materials used in deaf schools in Taiwan. It contains the most commonly used signs, which are tagged with Chinese phonetic symbols, characters, words, and phrases. It is hierarchically categorized based on syntactic and semantic features of the signs in the TSL vocabulary. In the tagging process, the syntactic annotation is performed automatically using the Chinese Knowledge Information Processing Group (CKIP) AutoTag system, which includes a lexicon and a parser developed by the CKIP, Academia Sinica, Taiwan. The Chinese lexicon and parser are used to analyze the sentences in the training corpus and output the corresponding part-of-speech (POS) for each keyword and the sentence structure pattern for each sentence. The parser, based on head-driven principle [18], aims at ambiguity resolution in syntactic analysis and describes the detailed word dependencies with respect to argument and adjunct characteristics. The HowNet knowledge base developed by Dong [19] is adopted to analyze the acting roles of keywords, such as agent, event, time, place, and theme. It describes the relations between concepts using the attributes of the concepts. Totally, the bilingual HowNet knowledge base covers 65 953 concepts (in 53 335 words) in Chinese and 75 365 concepts (in 57 392 words) in English and contains the detailed description of the interconcept relations. The definition of a concept in HowNet represents one or more of the following relations: hyponymy, synonymy, antonymy, meronymy, attribute-host, material-product, converse, dynamic role, and concept-co-occurrence [20]. Many words have multiconceptual representations such as polyzemy. Each concept contains two categories of semantic features using the definition “DEF” in HowNet. The primary feature indicates the concept’s category that can be represented as a hierarchical taxonomy, which is similar to the concept relationship in WordNet. For example, (student)} has the definition DEF human the word { study , education }. The feature {human } is the primary and the others are the secondary. The secondary feature gives more concrete description. Based on the hypernym and hyponym hierarchical relations, the underlying taxonomy can further be used to infer the similarity between concepts. In other words, the features in the deep node of the taxonomy have more similar property. Table I shows the linguistic relations among syntactic, semantic, and syntax information. Through
is represented as a set the linguistic processing, each word of POS features POS , a semantic feature SF and a word frequency WF for language modeling and prediction. B. Derivation of Phrase Formation Rules In the tokenization process, the training corpus for constructing sentence template tree is collected from the teaching materials and daily dialogues, mostly from the textbook for deaf students in primary schools. The schoolteachers and special educators in deaf school transcribed the signed Chinese into written Chinese. It is used to discover the structural differences between signed and written Chinese. The TSL vocabulary is defined in this paper as the keyword set, which represents the main intention of an expression, and the others are regarded as function words. A function word set, which contains grammat(and), or s (a phrase-final particle ical words like (of), used in questions), plays an important role in grammar, is extracted automatically from the training corpus. In general, each sentence can be represented as the keywords or key-phrases together with several function words. Based on this idea, each sentence, , regarded as a sequence of keywords and function words can be expressed as follows: (1) is the th keyword or key-phrase and is the where th function word in sentence . All the sentences in the training corpus are parsed and tagged with the associated POS and semantic role shown in Table I. In this paper, the generalized structure, which possess several semantic and syntactic relations, are used to construct a well-formed sentence from subsentence or phrase sequence with one or more function words skipped. Based on the observation and comparison between written and signed Chinese, several linguistic differences, such as word order, quantifier, function word, conjunction, and interrogative, are encountered [4], [12]. These describe our main design issues with respect to structural translation. Traditionally, the example-based, rule-based, and partial parsing approaches have been applied to deal with these kinds of problems. However, the pattern matching problems always arise from rule explosion or insufficient rules. In addition, from the viewpoint of written Chinese, the word dependency in signed Chinese does not rely on the adjacent words. Take the quantifier as an example, the written Chinese phrase (a flower)} is represented using signed Chinese as { { (flower) (one)}.Obviously, the quantifier { } disappears
444
IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 12, NO. 4, DECEMBER 2004
TABLE II TRIGGER PAIR CATEGORIES WITH RESPECT TO PHRASE FORMATION AND SENTENCE EXPANSION RULES.
in signed Chinese and the word order is reversed. As discussed earlier, in this paper, we apply the trigger pair modeling approach to capture the relationships between words over a short or long distance dependencies among words [21], especially in semantic relations and grammar regularities. There exist highly associated word pairs in natural language, such as {sunshine/hot} and {doctor/nurse}. Like the example of signed Chinese shown previously, these kinds of useful preference information could also be extracted from the corpus. Furthermore, the grammar regularities can deal with rule explosion issues and further be used to expand sentence patterns using phrase formation rules. The use of semantic preferences between noun/noun, adjective/noun, and quantifier/noun relations becomes more crucial in our study for reducing the language perplexity and the inference of word alignment. Motivated by this reason, the trigger pair concept was adopted for extracting the word association rules from highly correlated is highly correlated with another word word pairs. If a word , then a word pair is regarded as a trigger pair, with being the trigger, which occurs somewhere in the past hisbeing the triggered word, tory of the training corpus, and the next word. Each pair can be used to form the significant subsentences or phrases. We can represent these relationships with distance . as the conditional probability , this model is simplified to the conventional bi-gram As language model. However, for a highly associated pair with two rare words, such as { /one} and { /flower}, the use of longdistance bi-grams is not sufficient to determine its relationship (my brother), and compare it to a common word pair , which is less correlated. Based on this reason, a useful way to measure the average mutual information (AMI) [21], which takes the joint probability of the dependencies between the two words into account, is used and defined as follows: AMI
(2)
All candidate trigger pairs are preliminarily selected by using AMI criterion. Then, the mutual information (MI) is further used to measure the preference relationship degree of a given pair, which is defined as follows: MI
(3)
reflects a strong correlation The high value of MI between these two words. According to the characteristics of signed Chinese and semantic information, all the extracted trigger pairs are categorized into six preference sets: agent, time, place, theme, quantifier, and compound noun. Then, the corresponding grammar regularities are collected with high MI values and labeled as the instances of the designated trigger pairs. Like the quantifier example shown before, a trigger pair Na, Neu defined in this paper was extracted. All the derived POS-based instances inherit their syntactic attributes with respect to the associated semantic roles. Parts of phrase formation criteria using trigger pairs modeling and examples are shown in Table II. C. Construction of Predictive Sentence Template Tree Given the training corpus, the PST tree is designed as a tree structure in accordance with a sentence-template based grammar tree and a variable -gram language model. By reformulating the variable -gram language model, the PST tree
WU et al.: TEXT GENERATION FROM TAIWANESE SIGN LANGUAGE
445
Fig. 2. Example of a tree representation for a variable n-gram model. The alphabets represent the POS tags.
can be used to predict the next sign given its sentence template history with lexical information. We apply the POS-based variable N-gram LM to deal with the word order issues in syntactic structure during the translation from signed to written Chinese. The concept of POS classes is used to reduce the perplexity of a language model [16], [17]. Furthermore, we also use the -gram LM and POS information to predict the next sign. The -gram LM is generally adopted for word prediction. Given a word sequence or simpli, the task of predicting the next word can be fied as stated as attempting to estimate the probability function . The probability estimation for predicting the next word is (4) Using the maximum likelihood estimation (MLE) algorithm, the probability is calculated based on the following relative frequencies:
(5) where
is the frequency of word sequence in the training corpus; is the number of training instances. We consider each word occurring in its correct location as an independent event and use the chain rule of probability to decompose this probability as follows:
(6) Traditionally, an -gram LM can be constructed as a hierarchical tree structure of context nodes of a fixed depth. Variable -gram model can be viewed as a nonuniform tree, which can be constructed using the stopping or merging criterion in the construction process. The -gram model with vocabulary size is an -level, -ary branching tree. Each node represents a
context unit and the relation between nodes and its descending node is characterized by using conditional probability. The root node has no context. Each internal node connects to possible branches of the context. In this paper, we adopt the variable -gram tree structure to construct the sentence templates with predictive capability, shown in Fig. 2. In this tree representation, the depth varies across different branches. The data structure of in the PST tree is described as follows. Each internal node PST tree
Path
(7)
consists of the following eight parameters: 1) is the definition of ; is the semantic feature of ; 2) is the referenced frequency of POS of ; 3) 4) is the j-th keyword or function word in ; is the j-th word frequency in ; 5) Path is the conditional probability of 6) given the context history Path and is the number of internal node preceding ; is the number of sons of in the PST tree; 7) is the linked list of sons of in the PST tree. 8) can be represented as follows: Each external node Noun Verb
(8)
where is the number of internal node before Noun is the number of nouns in a sentence template Path Verb is is the the number of verbs in a sentence template Path referenced frequency of each sentence template Path . Before describing the PST tree construction algorithm, the statistical language model of the PST tree has to be introduced first. Each internal node in the PST tree represents one and is characterized by the conditional probability Path given the context history Path . Path is the full-length history traversed from the root node and shown as follows: Path
(9)
446
IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 12, NO. 4, DECEMBER 2004
Fig. 3. Construction Algorithm of the PST tree.
Using the MLE algorithm, the conditional probability Path can be estimated as follows:
TPC 2 and 7: (tomorrow/Nd/time) (have free time/VH); TPC 2: (you/Nh/agent) (have free time/VH) (an interrogative/T); TPC 2 and 7: (you/Nh/agent) (have free time/VH). All the possible patterns for each sentence are used to construct the PST tree. In the construction process of the tree, for each node, a two-stage node-matching criterion is applied to examine the node definitions. If the semantic features of the input word and the visited node are the same, then the word has higher priority to merge, else consider the POS features and perform the matching process again. For each matched , the word word, if there exists in the internal node frequency is increased, otherwise, the word is assigned to the . All the entities in the feature vector internal node Path are recorded and updated. Using the breadth-first search algorithm the PST tree can be derived. After all the training sentences have been used, all the entities Noun Verb in the external node are updated and the parameters of the statistical language model are re-estimated for further training. In this construction process, the semantic features are used to refine and merge the similar sentence structure. For (brother/Na/agent) example, given two training sentences { (like/VK/event) (apple/Na/theme)} and { (i/Nh (like/VK/event) (apple/Na/theme)}, we can /agent) derive two different structures due to the different POS tags (I) and (brother). But (I) and (brother) are of regarded as the same agent, a semantic feature defined in the HowNet. Therefore, we use the semantic features to merge similar semantic features to deal with the syntactic rule explosion problem. This could also be applied to avoid the logically ”(Apple likes brother) in conflictive situation like “ the sentence generation stage. Fig. 4 shows an example of the PST tree.
Path III. SENTENCE GENERATION USING PST TREE (10) Fig. 3 shows the processing diagram for constructing the PST tree. Given each parsed sentences , the various types of trigger pair category (TPC) described earlier are used to generate all the possible sentence patterns for improving the drawback of insufficient training data. For example, given a sentence { (you/Nh/ (tomorrow/Nd/time) (have free time/VH) (an agent) interrogative/T)} (Do you have free time tomorrow?), several derived sentence patterns are shown as follows: • (you/Nh/agent) (tomorrow/Nd/time) (/have free time/VH) (an interrogative/T); TPC 7: (you/Nh/agent) (tomorrow/Nd/time) (havefree time/VH) ; TPC 2: (tomorrow/Nd/time) (have free time/VH) (an interrogative/T);
Generally, the signed Chinese formed by the hearing impaired are almost ill formed. For accuracy enhancement in the sentence generation process, the PST tree is used to generate well-formed sentences from ill-formed sign sequence inputs. Fig. 5 shows the sentence generation process from signed Chinese to written Chinese. This mechanism regards the input sign sequences as the keyword sequences for expressing comprehensive message. By using the statistical language model, the contextual information of keyword sequences can also be used to reduce the search space and to mutually constrain the generation results. Therefore, this can also take advantage of providing a fast approach for sentence construction using sentence pattern prediction. The processing scheme of sentence generation is based on the bottom-up parsing and top-down filtering strategies. In the bottom-up parsing stage, the trigger pair category is firstly used to identify all the possible phrases as noun phrases. Then, each keyword is treated as the basic units and aligned into one or more suitable nodes in the PST tree. In the top-down filtering
WU et al.: TEXT GENERATION FROM TAIWANESE SIGN LANGUAGE
Fig. 4.
447
Example of the PST tree.
model is used to fill the suitable function words into the nodes. Finally, the PST-based syntactic score and the bi-gram LM score are combined to rank the output sentences. The most promising sentence is then used to generate the voice output via a Chinese text-to-speech conversion system. A. Development of Sign Prediction Interface In accordance with the keystroke saving requirement and design issues of human–computer interface (HCI), the concept of Fitts’ law [22] and the row-column scanning strategy are applied to develop a TSL virtual keyboard (VK) with row-column arrangement layout [23]. According to the Fitts’ law, the movethat lies at distance ment time MT to select a target of width or amplitude is formulated as follows: (11)
Fig. 5.
Flowchart of sentence generation and learning.
stage, the template-matching criterion and a sentence pattern scoring function with node-matching consideration are used to retrieve the top sentence templates in the generation process. Then, the established keyword/function-word bi-gram language
where parameters and are observable constants depending on the user’s response. Through investigation, some underlying result in samll design issues are apparent. Small and large MT. In our proposed TSL VK, the frequently used symbols are dynamically arranged on the upper-left corner and ranked. This makes the system more accessible to the hearing and speech impaired by providing the whole-word or phrase access via pictorial representations. In [4], Hsing investigated the schoolteachers’ and students’ consistency in communication or teaching behaviors in Taiwan. The results showed that the mean length of utterance (MLU) in sign communication ranges from 3.18 to 9.98 Chinese characters per utterance. According to this native research discovery and the MLU of practical dialog pattern for daily life, we construct a TSL VK with two rows and four columns and each sign icon has a size of 150 150 pixels on a display resolution of 800 600 pixels. Fig. 6 shows an example of sentence
448
IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 12, NO. 4, DECEMBER 2004
are inherited with the corresponding syntactic and semantic features. For example, because of the lack of quantifier in signed Chinese, the parsing process accepts the signed se(mother/Na/agent), (red/Na), quence { (i/Nh/agent), (apple/Na), (two/Neu)} to generate a new phrase [mymother/Na/agent, (tworedapples/Na)]. { (my mother/Na/agent)} is derived using The key-phrase { the first rule of TPC and the key-phrase { (two red apples/Na)} is derived using the fifth rule. C. Template Matching
Fig. 6. An example of sentence translation from ill-formed sign sequences input to well-formed sentence via TSL VK.
translation from ill-formed sign sequences input to well-formed sentences via the TSL VK. The left-upper block shows the 15 meaningful categories of TSL vocabulary for fast access of a sign from the category. The left-lower block shows the categories of most frequently used dialogue sentences defined by users. The right-upper block shows the proposed TSL VK with row-column arrangement. Each sign icon also shows its corresponding Chinese word. The right-lower block shows system functions, such as text generation, text-to-speech conversion and sentence pattern prediction, etc. Word prediction is an essential task in augmentative communication for the hearing impaired. Based on the PST, a bi-gram LM was used for predicting the next sign. This model assumes that the probability of a sign depends only on the previous sign and is called a Markov assumption. We can approximate the -gram model using the bi-gram model. The general equation for this -gram approximation to the conditional probability for predicting the next sign is (12) We take the training corpus into account by the sum of all the bi-grams that share the same first sign (13) Then, according to the POS information and its corresponding word frequency, all the candidate signs are dynamically arranged on the TSL VK for selection. Thus, the bi-gram LM is used to model the dependency for sign prediction and further for function word filling. B. Key-Phrase Identification Using Trigger Pair Category Input keyword sequences are first analyzed by using TPC listed in Table II. Each keyword has its corresponding linguistic information defined in Table I. Using the parsing ability of trigger pair, all the possible keyword fragments are merged to form a new keyword or key-phrase sequence, which are regarded as the basic processing units. All the derived key-phrases
On the basis of the syntactic structure translation from ill-formed signed Chinese to written Chinese, the sequence is filled into the PST tree using the node-matching criterion. The detailed syntactic and semantic information are adopted in the matching process. For example, even though the sentence tem/VL(Love) /Nh/agent(You)} plate { /Nh/agent(I) exists in the PST tree, the keyword sequence { /Nh/agent(I), /VK(like), /Nh/agent(You)} cannot find a matching structure. This is because of the verb constraint, in which the /VK} are different. POS tags of the words { /VL} and { Motivated by this reason, we loosen the verb constraints using its hierarchical characteristics, a verb taxonomy tree defined by CKIP [18], to improve the success of sentence generation. These matching criteria vary according to the aspects of equivalence classes. The node-matching process is to measure the matching similarity in accordance with semantic features or POS features and defined as: if the semantic features of the input word and the visited node are the same, then fill the input word into the matched nodes; else match the syntactic features, such as noun or verb. To loosen the constraints of node with noun feature using its corresponding semantic feature can deal with the logically conflictive situation. The verb features in CKIP definition have 12 categories and are mainly separated into active and stative types. Each active or stative type can be further classified into transitive and intransitive types. The three-level hierarchical classification is used to measure the POS relation between nodes and keywords or key-phrases and to loosen the constraints of verb definitions. This process can further be applied to sentence-level or discourse-level translation in which matches on any member of the equivalent containing morphological variant, such as synonyms, hyperonyms, and hyponyms, are considered equally. Based on the analysis of node similarity, a two-stage matching process is used to filter the unsuitable paths and reduce the search space. In the first stage, the template-matching criteria are used and defined as follows: 1) the number of verbs Verb in the template is equal to the number of verbs of keywords or key-phrases; 2). the number of nouns Noun is smaller or equal to the number of nouns of keywords or key-phrases; Noun Verb and 3) the number of function words in each template is equal to or smaller than two. According to these three criteria, we choose all possible paths as the candidate templates. In the second stage, the node- matching algorithm is used and described as follows. Step 1: Initialization All the keywords or key-phrases are regarded as nonmatched. For each candidate template, search from the root.
WU et al.: TEXT GENERATION FROM TAIWANESE SIGN LANGUAGE
449
Step 2: Recursion Step 2.1: For all keywords or key phrases, perform Step 2.2 to Step 2.4. Step 2.2: For each node, check the node-matching criteria: Step 2.2.1: If matches, fills the current keyword into the matched node in the candidate templates. Go to Step 2.1. Step 2.2.2: Match the succeeding node with verb or noun feature. If there is no match, skip and go to Step 2.2. Step 2.3.1: Otherwise, go to Step 3.1. Step 3: Termination Step 3.1: If all the candidate templates have been searched, then stop. Step 3.2: Otherwise, consider the next candidate template. Go to Step 1.
omitting the input of function words. Fig. 4 shows an example ”(greatly) to generate of filling the word “ ”(very) or “ sentences. Based on the bi-gram information, (greatly) are filled into the Dfa node of function word. In addition, (“ma”) and (“ba”) can also be inserted into the tail of a sentence to form an interrogative sentence. Finally, we combine sentence pattern and semantic scores using an empirically determined weighting factor to rank the output sentences for user selection and defined as follows: Sentence
Path
Path (17)
E. Automatic Sentence Pattern Learning Then, a sentence pattern scoring function is used to rank the output sentence templates. This function takes the feature simiof each sentence temlarity and the referenced frequency plate Path into account and defined as follows:
Path
(14)
where is the similarity measure for node matching. One obvious way to measure the node similarity between two verbs is to measure the distance between them in CKIP definitions. This can be done by estimating the similarity and in the CKIP verb taxonomy. We between verbs compute path length similarity between two verbs using the formula:
Using TPC, the PST tree is trained automatically and regarded as the initial model for sentence construction. Fig. 5 shows the processing diagram for the predictive sentence construction. The TSL icon sequences on the display are dynamically updated based on previous input and constrain the sentence template based grammar. The task of predicting the next keyword can be stated as attempting to estimate the probability function shown in (10). This has the advantage of longer-distance predictive ability. In addition, the statistical learning mechanism, for rate enhancement, is utilized to dynamically record personalized and frequently used words, phrases, and sentence patterns in the PST tree. All the statistical and in each Path can be updated information after the desired sentence has been selected by the users. The learning mechanism accounts for both the sequence of words and the POS tags to those words and can be rewritten as
(15) to where is the number of nodes in such path from is the maximum depth of the taxonomy. This node similarity is shown as (16), located at the measure function bottom of the page, where NPOS is the node with the syn, Case is the node with semantic featactic feature ture , CPOS means the stage of comparison with verb features, SPOS means the stage of comparison of verb with active or stative feature, otherwise is the stage of comparison of verb with transitive or intransitive feature.
(18) This can be used to rank the order for the row-column arrangement layout and further reduce scanning numbers in order to improve flexibility and search efficiency. By way of learning, this can adapt to customized issues for the users to speed up the communication rate. IV. RESULTS AND DISCUSSION
D. Function Word Filling After all the key phrases have been processed, the keyword/function-word bi-gram LM is used to fill function words into the sentences. This can take advantage of generating more natural sentences and speed up the rate of communication by
if if if if otherwise
In order to evaluate the performance of our approach, a computer-based TSL virtual keyboard AAC system was developed for the hearing impaired, especially for profoundly deaf students. On the basis of the cognitive effect of visual language and the practical demand for the deaf, a set of TSL signs is used and
or Case is a function word node
Case (16)
450
IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 12, NO. 4, DECEMBER 2004
regarded as the alternative interface units for augmentative communication. By developing a TSL sign-based interface, a suitable human–computer interface should consider the user’s response in physiological, cognitive, and learning abilities. Moreover, this specialized design with suitable operating strategies becomes an important factor for performance evaluation. For this reason, the linguistic knowledge, dynamic display, and arrangement strategies are considered into our TSL-based virtual keyboard design. By way of selecting TSL icons from the TSL VK, the users can input the sign sequences via keyboard, mouse, or touch screen. The choice of the interface technology generally depends on individual physical capability. This system facilitates the TSL teaching ability for teachers and increases the learning motivation for the deaf students via the interactive bi-modal interface. Since the implementation of TSL VK is a highly individualized process, a rehabilitation team funded by National Science Council in Taiwan evaluates the various users’ responses to accommodate the tradeoff problem of speed, large-sized icon, visible region, and the number of sign sequence. The members consist of schoolteachers, special educators, ear-nose-throat (ENT) and rehabilitation doctors, speech and hearing therapists and computer scientists. Clinicians and special educators were asked to evaluate the hearing loss and learning ability through the use of bi-modal interface. According to the practical needs of deaf students with special kinds of disabled conditions and the survey of the existing augmentative communication devices, an icon-based computer-aided prescription was documented and proposed. The commonly used TSL signs were collected and organized as a message base. Then, a row-column layout with dynamic arrangement and prediction capability was investigated. This makes the power for integrating various input control modes such as direct selection, encoded selection and scanning selection. A. Training Databases and Perplexity Evaluation The training databases, which consist of text corpus and TSL sign lexicons, are collected from the teaching materials and daily dialogue. The first part of text corpus contains the content from the first up to the fifth grade textbooks of primary deaf school in Taiwan. The second part of the dialogue corpus is transcribed from videotapes by the native users of TSL. The training corpus is transcribed by the schoolteachers and special educators. The TSL vocabulary is widely collected from the education and training materials in the deaf schools. In this paper, 1178 most frequently used TSL signs were used to investigate the effect of message expressing communication and categorized into 15 categories for fast and direct access. The classification criteria are based on standard teaching materials for deaf school published by Educational Ministry in Taiwan. Fig. 7 shows the distribution of 15 classes. We selected 1064 sentences, in which the MLU is 4.9, to develop 491 sentence templates with average node number of 4.5 per template in PST tree. The MLU is used to examine the relation between signed Chinese and written Chinese. In order to evaluate the degree of sophistication for this task, a measure of performance called perplexity is conducted. This quantity represents the average
Fig. 7. The distribution of 15 semantic classes of TSL vocabulary. TABLE III PERPLEXITY EVALUATION RESULTS FOR THE BASELINE, POS-BASED, TRIGGER PAIR BASED MODELS
word branching factor of a language model and is defined as follows: Perplexity (19) where represents the estimated entropy, the average informa. Therefore, tion of a given word sequence is obtained, the better a language the lower the perplexity model is achieved. In this paper, we adopted the bi-gram model as the baseline model to evaluate the task to predict the next word from the corpus and compared with the model with POS features and the model with both POS and semantic features. Table III shows the perplexity comparison of the three models and the improvement using our proposed method. B. Functional Evaluation for Sign Prediction Interface In the practice evaluation, the usefulness of the proposed system is dependent on many factors, ranging from the user’s cognition abilities to the interface design. Therefore, an evaluation of the potential effectiveness is developed based on an analysis of training corpus to estimate the keystroke saving rate over a comparable word-based approach. The evaluation focuses on the various strategies of rate enhancement. 1) Function Word Deletion: An additional goal of the PST tree is to relieve the user from the burden of inputting function words such as prepositions and stop words in Chinese. Table IV shows the statistics of the function words in the training corpus. Table V shows a summary of this analysis. A 26.25% keystroke (KS) saving rate for sentences with function word deletion was achieved. Therefore, using meaningful TSL signs or keywords to construct sentences can have benefit to reduce the scanning or input steps.
WU et al.: TEXT GENERATION FROM TAIWANESE SIGN LANGUAGE
451
TABLE IV STATISTICS OF THE FUNCTION WORDS OCCURRED IN TRAINING CORPUS
TABLE V KEYSTROKE SAVING RATE FOR FUNCTION WORD DELETION
COMPARISON
FOR
TABLE VI SCANNING IMPROVEMENT RATE WITH VARIOUS PREDICTION STRATEGIES.
C. Reading Comprehension Evaluation
2) Sign-Based Scanning: In this row-column scanning and arrangement framework, this estimation must consider several factors. The initial estimates are perhaps low for subjects because of the effect in relation to the arrangement and scanning strategies. That is, the interface design has been further tailored to match user operation requirements, such as the degree of attention and focus. When an action activates, this function is highlighted and the others are disabled. For evaluating our proposed prediction strategies, we model the scanning process from page to page and row to column in order to evaluate the required scanning steps for selecting a desired sign. The similar measure was proposed in [23], [24]. The scanning time is formulated as follows:
(20)
where is the average scanning time for is the total number of testing the th keywords in sentence is the total number of keywords of each sentences, is the page number sentence, is the number of pages, for locating the desired sign, and represent the . In this formula, for row and column numbers on page each scanning process, the scanning time and location of each step are recorded and aimed to analyze user responses. The scanning rate enhancement using keyword prediction and sentence pattern prediction were 67.71% and 79.50%, respectively. Table VI shows the comparison of various prediction strategies in the evaluation of sign scanning. Fig. 4 shows the input sign (Apple) (Brother) (Like) (Eat)}. After sequence { text generation and text-to-speech conversion, the decoded messages can be generated with bi-modal information, which con(eat apple)} and sists of a well-formed sentence { its corresponding voice output. Hence, this approach can take advantage of establishing a TSL vocabulary for the user and organize it in a way that is both efficient and relatively easy to learn and retain over time consideration.
For the assessment of practical communication aids, a training program funded by National Science Council in Taiwan was undertaken for clinical evaluation and education training. The training program was conducted for a three-month period and suspended for three months to evaluate the stability of learning performance. In the training period, the functional parameters of the proposed system, such as scanning steps and time, arrangement and highlighting modes of sign prediction interface, provide customized prescription and design for matching individuals’ communication needs, not only in daily activities but also for education and training. Each training course is guided to practice for 50 min for both groups, with ten minutes for conversation and the rest of time spent. The aims of this empirical study were to augment the learning ability of written Chinese for deaf students in Taiwan. 1) Subjects and Evaluation Criteria: Ten profoundly deaf students who were in the fifth grade and enrolled in the primary deaf school in Tainan, Taiwan, served as the subjects and were randomly divided into a testing group subject to the proposed system and a control group who were trained by the conventional teaching method. The special educator was asked to select suitable subjects meeting a predefined requirement in literacy and language learning capability. The literacy is regarded as a cognitive process, including development of reading and writing. The literacy aptitude test was examined by sign language teaching committee of Educational Ministry in Taiwan at the end of a semester. In this paper, an evaluation index called semantic integration (SI) per utterance is conducted to examine the degree of intention expression from a reading comprehensive task [25], especially for textbooks. This index simultaneously considers the amount of semantic or keyword information and the order of information presentation that students integrated a question asked by teachers into a single utterance or answer. Table VII shows the basic information and aptitude test of deaf students. In the SI test, the outcome is shown as the correct rate of given questions randomly selected from the literacy aptitude test files. After training, we also conducted a subjective evaluation called satisfactory level using a questionnaire to investigate the schoolteachers’ and students’ consistency in interactive communication behaviors. The satisfactory level means the state of intention expression generated by the proposed method and defined as five levels, excellent (credit of 5), good (credit of 4), fair (credit of 3), poor (credit of 2) and unsatisfactory (credit of 1). 2) Evaluation Results: Fig. 8 shows the improvement rate of reading comprehensive evaluation. A -test revealed this to be a significant difference between two groups, with
452
IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 12, NO. 4, DECEMBER 2004
TABLE VII BASIC INFORMATION OF THE SUBJECTS
tendency increased slightly due to the breakthrough of learning problem. Although, students had been familiar with the way of expressing intention using keyword-based and template prediction methods. However, as a result of language complexity, our proposed scheme faces with multi-intention and complicated sentence type problems. The alternative solution is to increase the training corpus and to adopt the technology of semantic understanding. V. CONCLUSION Fig. 8. Improvement of literacy reading-comprehension training.
aptitude
test
through
the
and 0.05. Table VIII shows the summary of rate enhancement, SI and satisfactory level evaluation at first-, third-, and sixth-month stages for test group. Both show a tendency toward improvement in reading ability and intention expression after the training process using our proposed communication aided system. The conventional training way for control group depends highly on the teacher’s skills. Since the comprehension of words and the structural difference from signed Chinese to written Chinese were the major factor of affecting the learning performance. In the training stage, subjects were accustomed to constructing sentences using the conventional way and unfamiliar with the functions provided by this proposed system. Moreover, the MLU was higher than other stages because of a few unneeded words. In order to produce sentence, subjects were forced to spend most of the scanning time and selecting their desired signs on the VK. This leads to degrade the performance of sentence generation and increase the scanning steps and time. Therefore, the evaluation results did not achieve the desired goal. In the fourth-month stage in Fig. 8, the correct rate of SI decreased slightly because subjects relaxed efforts after midterm examination. This situation was able to anticipate. However, for a comprehensive survey, the evaluation had a significant improvement. During the suspended period, the
Language learning, including home language, sign language and other AAC science, enable people to make meaning, access education, think and express their thoughts. When people have certain degree of communication disorder, the appropriate communication-aided and learning strategies should be provided for their specific demand. The proposed system applies the innovative design methodology of sentence prediction and construction to develop a TSL AAC system for the deaf students. The newly proposed TSL VK design takes more considerations to human factors, visual concentration and eye-hand coordination. By way of signed Chinese to written Chinese translation, the deaf students can regard this system as a specialized word processor that automatically translate the input sign sequences into a well-formed sentence. In addition to speeding up sentence generation rate for individuals, the proposed system has potential to assist people with language defects in sentence formation and grammatical correction. Evaluation results show a significant improvement was achieved. However, the proposed PST language model can only be used to generate the task or domain-specific sentence types. For further development of language generation, our work will focus on analyzing the more detailed syntactic structure translation and context understanding task between signed Chinese and written Chinese. For evaluating the benefit of training and education, a computer-aided instruction training program, such as Individual Education Program in Taiwan, is further developed to improve the language learning and speech communication ability for communication impaired people. This study aims to
WU et al.: TEXT GENERATION FROM TAIWANESE SIGN LANGUAGE
453
TABLE VIII SUMMARY OF RATE ENHANCEMENT, SEMANTIC INTEGRATION RATE AND SATISFACTORY LEVEL EVALUATION FOR TEST GROUP
establish domestic AAC technologies to provide the users, clinicians, and special educators with flexible and useful tools to design and use customized communication-aided devices. And this system can further be applied to multidisciplinary evaluation and therapy. After the AAC system has been developed and fine turned, the system will be transferred to a palm-sized platform and other multimedia applications. More importantly, portable communication system are designed and fabricated for the cerebral palsied or stroke with motor disorders in their daily activities. ACKNOWLEDGMENT The authors want to express their gratitude to Z. Dong for providing the HowNet knowledge base. REFERENCES [1] A. M. Cook and S. M. Hussey, Assistive Technologies: Principles and Practice. St. Louis, MO: Mosby-Year Book, 1995. [2] J. Reichle, Implementing Augmentative and Alternative Communication: Strategies for Learners with Severe Disabilities. Baltimore, MD: Paul H. Bookes, 1992. [3] M. D. Manoranjan and J. A. Robinson, “Practical low-cost visual communication using binary images for deaf sign language,” IEEE Trans. Rehab. Eng., vol. 8, pp. 81–88, Mar. 2000. [4] M. H. Hsing, “An analysis on deaf-school teachers’ utterance messages and morpheme semantic features between spoken and signed language channels,” Bull. Special Educat. Rehab., vol. 8, pp. 27–52, June 2000. [5] P. Demasco and K. F. McCoy, “Generating text from compressed input: An intelligent interface for people with severe motor impairments,” in Commun. ACM, vol. 35, 1992, pp. 68–78. [6] R. C. Simpson and H. H. Koester, “Adaptive one-switch row-column scanning,” IEEE Trans. Rehab. Eng., vol. 7, pp. 464–473, Dec. 1999. [7] H. H. Koester and S. P. Levine, “Modeling the speed of text entry with a word prediction interface,” IEEE Trans. Rehab. Eng., vol. 2, pp. 177–187, Sept. 1994. [8] A. L. Swiffin, J. L. Arnott, and A. F. Newell, “Adaptive and predictive techniques in a communication prosthesis,” Augment. Altern. Commun., vol. 3, no. 4, pp. 181–191, 1987. [9] C. Morris, A. Newell, L. Booth, and J. Arnott, “Syntax pal—A system to improve the syntax of those with language dysfunction,” in Proc. 14th Annu. RESNA Conf.. Washington, DC, 1991, pp. 105–106. [10] C. Sara and S. Yechayahu, “Semantic transparency and translucency in compound blissymbols,” Augment. Altern. Commun., vol. 14, pp. 171–183, 1998. [11] S. K. Chang et al., “A methodology for iconic language design with application to augmentative communication,” in Proc. IEEE Workshop on Visual Language, 1992, pp. 110–116. [12] Y. C. Wang and T. J. Chou, “A comparative study of reading comprehension strategies used by elementary reading disabled and proficient reading students,” Bull. Special Educat. Rehab., vol. 8, pp. 161–182, June 2000. [13] T. Wilson and M. Hyde, “The use of signed English pictures to facilitate reading comprehension by deaf students,” Amer. Ann. Deaf, vol. 142, pp. 333–341, 1997.
[14] C. D. Manning and H. Schütze, Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press, 1999. [15] M. Siu and M. Ostendorf, “Variable n-grams and extensions for conversational speech language modeling,” IEEE Trans. Speech Audio Processing, vol. 8, pp. 63–75, Jan. 2000. [16] T. Niesler and P. Woodland, “Variable-length category n-gram language models,” Comp. Speech Lang., vol. 21, pp. 1–26, 1999. [17] J. S. Hamaker, “Toward building a better language model for switchboard: The POS tagging task,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 1999, pp. 579–582. [18] The Chinese Knowledge Information Processing Group, “Analysis of Chinese part of speech,” (in Chinese), Institute of Information Science, Academic Sinica, Taipei, CKIP Tech. Rep. 93-05, 1993. [19] D. Zhendong, The HowNet Website, http://www.how-net.com/. [20] Q. Zhou and S. Feng, “Build a relation network representation for hownet,” in Proc. Int. Conf. Multilingual information Processing, 2000, pp. 139–145. [21] R. Rosenfeld, “Adaptive statistical language modeling: A maximum entropy approach,” Ph.D. dissertation, Carnegie Mellon Univ., Pittsburgh, PA, 1994. [22] I. S. MacKenzie and W. Buxton, “A tools for the rapid evaluation of input devices using Fitts’ law models,” SIGCHI Bull., vol. 25, no. 3, pp. 58–63, 1993. [23] J. G. Webster, A. M. Cook, W. J. Tompkins, and G. C. Vanderheiden, Electronic Devices for Rehabilitation. New York: Wiley, 1985. [24] G. C. Vanderheiden, “Computer can play a dual role for disabled individuals,” BYTE, vol. 7, no. 9, pp. 136–62, 1982. [25] S. Oviatt, “Predicting spoken disfluencies during human-computer interaction,” Comput. Speech Lang., vol. 9, pp. 19–35, 1995.
Chung-Hsien Wu (M’94–SM’03) received the B.S. degree in electronics engineering from National Chiao-Tung University, Hsinchu, Taiwan, R.O.C., in 1981, and the M.S. and Ph.D. degrees in electrical engineering from National Cheng Kung University (NCKU), Tainan, Taiwan, in 1987 and 1991, respectively. Since August 1991, he has been with the Department of Computer Science and Information Engineering, NCKU. He became a Professor in August 1997. From 1999 to 2002, he served as the Chairman of the Department. He also worked at Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, in summer 2003 as a Visiting Scientist. He is currently the Editor-in-Chief for the International Journal of Computational Linguistics and Chinese Language Processing. His research interests include speech recognition, text-to-speech, multimedia information retrieval, spoken language processing, and sign language processing for the hearing impaired. Dr. Wu is a member of International Speech Communication Association and ROCLING.
454
IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 12, NO. 4, DECEMBER 2004
Yu-Hsien Chiu received the B.S. degree in electrical engineering from I-Shou University, Kaohsiung, Taiwan, R.O.C., in 1997, and the M.S. degree in biomedical engineering in 1999 from National Cheng Kung University, Tainan, Taiwan, where he is currently working toward the Ph.D. degree in the Department of Computer Science and Information Engineering. His research interests include speech and biomedical signal processing, embedded system design, spoken language processing, and sign language processing for the hearing impaired.
Chi-Shiang Guo received the B.S. degree in information engineering from National Central University, Chung-Li, Taiwan, R.O.C., in 1998, and the M.S. degree in computer science and information engineering from National Cheng Kung University, Tainan, Taiwan, in 2000. His research interests include digital signal processing, text-to-speech synthesis, natural language processing, and assistive technology for the hearing-impaired.