Error-tolerant sign retrieval using visual features and maximum a ...

5 downloads 0 Views 2MB Size Report
The maximum a posteriori estimation is exploited to retrieve the most likely sign word given the .... Sections 3 and 4 show the proposed MAP-based retrieval.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 26, NO. 4,

APRIL 2004

495

Error-Tolerant Sign Retrieval Using Visual Features and Maximum A Posteriori Estimation Chung-Hsien Wu, Senior Member, IEEE, Yu-Hsien Chiu, and Kung-Wei Cheng Abstract—This paper proposes an efficient error-tolerant approach to retrieving sign words from a Taiwanese Sign Language (TSL) database. This database is tagged with visual gesture features and organized as a multilist code tree. These features are defined in terms of the visual characteristics of sign gestures by which they are indexed for sign retrieval and displayed using an anthropomorphic interface. The maximum a posteriori estimation is exploited to retrieve the most likely sign word given the input feature sequence. An error-tolerant mechanism based on mutual information criterion is proposed to retrieve a sign word of interest efficiently and robustly. A user-friendly anthropomorphic interface is also developed to assist learning TSL. Several experiments were performed in an educational environment to investigate the system’s retrieval accuracy. Our proposed approach outperformed a dynamic programming algorithm in its task and shows tolerance to user input errors. Index Terms—Taiwanese Sign Language, alternative and augmentative communication, error tolerant retrieval, gesture feature.

æ 1

INTRODUCTION

A

LLOWING disabled people to participate in everyday life is important and developing the technological means to support this participation is also important. Accordingly, augmentative and alternative communication (AAC) technology is used to develop more efficient user interfaces and predictive strategies to increase the text entry rate [1], [2], [3]. Over the last decade, many iconic languages and virtual keyboards have been successfully used in special education, including Mayer-Johnson’s Picture Communication Symbol (PCS), Picsyms, Blissymbolics, and others [1], [4], [5]. Successful text entry mostly depends on the use of easily comprehensible ideographs or pictographs. However, the deaf students may have difficulty in sensing letters, words, or symbols. Conventionally, they use sign language to represent their ideas [6], [7]. Accordingly, the aim in designing a sign retrieval system is to exploit Taiwanese Sign Language (TSL) phonological features as an alternative system of symbols to enable more intuitively the retrieval of a sign word using a sequence of gesture feature inputs. This could be regarded as an interface to a search engine for databases of signed videos and books that users can make easier to look up the meaning or pronunciation of a specific sign. To enable access to a vocabulary by a computer, many letter or word-based selection strategies have been proposed that place the most frequently used items in the most easily accessible locations, as in direct and row-column scanning methods [8], [2], [6]. However, a particular item is typically harder to find when items are sorted by frequency. Therefore, many methods, including word prediction, abbreviation expansion, semantic coding, n-gram model, and variable-length model have been proposed [2], [9]. Representative systems include the MinSpeak2 iconic keyboard [5] and Reactive Keyboard [9].

. The authors are with the Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C. E-mail: {chwu, chiuyh, kungwei}@csie.ncku.edu.tw. Manuscript received 12 Sept. 2002; revised 1 Apr. 2003; accepted 7 Oct. 2003. Recommended for acceptance by L. Vincent. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 117336. 0162-8828/04/$20.00 ß 2004 IEEE

Recently, research into information retrieval has involved combining several useful representations and similaritymatching models with index structures to increase the efficiency of retrieval from large databases [10]. For effective indexing, perceptual features, including the shapes and spatial relationships among images or objects are important [11]. In [11], such indexing strategies and metric-tree-based index structures are compared. Feature matching methods based on a dynamic programming algorithm and a distance measure are also discussed. This paper proposes a multicode tree (MCT), which is a fixed-length metric-tree, to organize the TSL database using gesture features for effective indexing and searching. Manual sign information consists of hand shapes, movement, location, and palm orientation as gesture features [12]. Currently, these features are also adopted to recognize sign language [13], [14]. In [14], Stokoe’s model [15] was used to detect gesture features and performed explicit segmentation based on discontinuities in the movements of hands. However, Stokoe’s model assumes that all aspects of signs occur simultaneously. Recently, much research has shown the sequential phenomenon is an important characteristic in sign languages [16]. Furthermore, in [17], [18], [19], the authors used the hidden Markov model (HMM) to recognize continuous American Sign Language (ASL). They dropped whole-word context-dependent modeling in favor of explicitly modeling movement epenthesis between signs. This method of using phonological constraints [20] to aid image segmentation and to reduce the lexical search space can simplify the recognition process. However, technology for capturing gestures is expensive: either a vision system or a data glove is needed, and the vocabulary set to be recognized remains of limited size. This paper applies phonological features of TSL to define several combinations of features as entry patterns. For sign retrieval, entry patterns that include the most discriminating gesture features are inputted, followed by the remaining ones, until a sign word is derived. This paper proposes an innovated sign retrieval system to assist in learning or teaching TSL. The signs tagged with gesture feature indices are organized as a multilist code tree to Published by the IEEE Computer Society

496

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 26,

NO. 4,

APRIL 2004

dez (designator): hand shape or configuration of the hand involved in the sign. 2. tab (tabula): location of the sign in relation to the body. 3. sig (signation): movement executed by the hands. 4. ori (orientation): orientation of the palm and fingers. In this paper, Figs. 2 and 3 show 51 distinct hand shapes (dez) and 21 distinct locations (tab), respectively. The gesture motions in three-dimensional space have 12 degrees of freedom, including translation and rotation. Accordingly, eight modes-up, down, left, right, front, back, rotation, and shaking/wiggling for single-handed motion and three modes-up-down, left-right, and front-back for interactive motion of two hands are defined as motion features involving movement (sig) or orientation (ori). In this paper, a TSL vocabulary with 1,881 distinct and most commonly used signs was collected from the teaching materials used in deaf schools [21], [22]. Each sign was tagged with these gesture features. Fig. 4 shows the number of occurrences of the 10 most frequently used gesture features in the database. 1.

Fig. 1. Graphical illustration of signs “Father” and “Mother” in TSL.

enable efficient and effective access from a large database of signs. To enable sign word retrieval, at most five sign features, including initial hand shape and location, movement/ orientation, and final hand shape and location, are required to retrieve a specific one-handed sign. Likewise, at most 10 sign features are required for retrieving a two-handed sign. However, long feature sequence input must overcome the difficulty of the need to learn or memorize a long feature sequence, which represents a sign word. In this paper, we define four entry patterns with fewer feature combinations to solve this problem. The maximum a posteriori (MAP) estimation is then exploited to find the most likely sign word given by an input feature sequence. An error-tolerant sign retrieval algorithm, based on a mutual information criterion is proposed to solve the problem of the misjudgment of sign features to enable the efficient and robust retrieval of the sign of interest. Moreover, a novel anthropomorphic interface based on these gesture features was developed to assess the performance of the proposed approach. The interface design considers human factors, including visual concentration and eye-hand coordination; it can be easily be integrated into various input control devices. The rest of this paper is organized as follows: The next section describes the sign representation and indexing of the TSL. Sections 3 and 4 show the proposed MAP-based retrieval framework and the error-tolerant sign retrieval mechanism, respectively. Section 5 summarizes the experimental results. Finally, Sections 6 and 7 draw some discussion and conclusions, and suggest directions for future work.

2

SIGN REPRESENTATION MULTILIST CODE TREE

AND INDEXING

USING A

Sign language is a visually expressive language in which gesture information includes temporal and spatial cues. Fig. 1 and . Each sign depicts two TSL signs, can be represented as a starting gesture (a basic hand shape in TSL), followed by a motion (the orientation of the palm), and a final gesture (another hand shape). The arrow indicates the orientation. The end gesture distinguishes a sign from another.

2.1 Sign Feature Definition and Representation According to Dr. Wayne Smith’s analysis of TSL phonology, a sign includes four visual features [12].

2.2 Sign Indexing Using Multilist Code Tree (MCT) In elucidating the hand typology of TSL, each sign can be categorized as one of three main configurations, 1) onehanded signs, 2) two-handed signs where each hand makes the same shape, and 3) two-handed signs where each hand makes different hand shapes. Signs made in the neck and face areas generally use only one hand while those made below the neck normally use both hands. If both hands are involved, the hand shapes are generally the same. Hand contact with the body generally occurs at the head, trunk, arm, and hand. The hand shape(s) in the initial and final locations and the corresponding movements or orientations of each hand are considered to characterize in detail a sign. Table 1 shows the data structure for representing a TSL sign. All the indexed signs are organized as a multilist code tree (MCT) with a fixed-length feature sequence for retrieval. The MCT is similar to the trie data structure [23], shown in Fig. 5, to enable fast access instead of sequential search. Each sign si is represented as an ordered vertex list ðI dezi ; I tabi ; sig orii ; F dezi ; F tabi Þ, where I dezi and F dezi represent the hand shape features of the initial and final gestures, respectively; sig orii represents the movement or orientation feature, and I tabi and F tabi represent the location features of the initial and final gestures, respectively. Two ordered lists are used to represent each two-handed sign that involves a two-handed configuration. For building the MCT, the sign features, each denoted by a node, are sequentially linked with a downward pointer and each terminal node represents a unique sign. This paper aims to provide users with a means of retrieving a sign word with a minimal number of gesture features. Based on the investigation of phonology and hand typology of TSL, four basic combinations of gesture features are defined as entry patterns—(dez, tab), (dez, dez, tab), (dez, sig_ori, tab), and (dez, dez, sig_ori, tab). For the entry patterns (dez, tab) and (dez, sig_ori, tab), only one hand shape (dez) is inputted regardless of whether the hand shape is left or right-handed or initial or final. For the patterns (dez, dez, tab) and (dez, dez, sig_ori, tab), two hand shapes are needed. In

WU ET AL.: ERROR-TOLERANT SIGN RETRIEVAL USING VISUAL FEATURES AND MAXIMUM A POSTERIORI ESTIMATION

497

Fig. 2. The 51 fundamental hand shapes in TSL.

these entry patterns, users can input both initial and final hand shapes of a single hand, or the left and right hand shapes are inputted independently whether they are initial or final. Fig. 6 shows the number of occurrences of the four types of entry patterns in the database. Clearly, the entry pattern (dez, dez, sig_ori, tab) with more input features, as shown in Fig. 6d, are associated with fewer candidate sign words than the entry patterns with fewer input features.

3

MAXIMUM A POSTERIORI (MAP)-BASED RETRIEVAL

For each entry pattern input, the retrieval of a sign from the multilist code tree is generally confronted with the matching problem. For example, users may express a two-handed sign using only one-handed features or by omitting certain features. This situation frequently occurs, especially when driving or communicating during daily activities, increasing the number of retrieved candidates. Some visually similar features or cognitive inconsistencies among users can lead to incorrect judgments, in particular, with regard to hand shape features. This paper generalizes three cases of matching to describe the retrieval process. Exact matching: all the input features completely match the indexed feature sequence in MCT. 2. Partial matching: all the input features partially match the indexed feature sequence in MCT, since users make incorrect judgments. 3. No matching: no input features match any indexed feature sequence in MCT due to an error in the input or a lack of vocabulary. Fig. 7 shows the block diagram of the proposed system. This mechanism involves two main processes: gesture feature sequence matching and hand shape recovery. First, the sign features of the input are considered to extract a few sign candidates from the MCT database. When no match 1.

Fig. 3. The 21 fundamental locations in TSL displayed with an anthropomorphic interface.

498

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 26,

NO. 4,

APRIL 2004

Fig. 4. Number of occurrences for Top-10 (a) hand shapes, (b) movements and orientations, and (c) locations in the database.

TABLE 1 Data Structure for the Representation of a TSL Sign

exists, the recovery patterns, trained based on real user observations as described in Section 5.2.2, replace the hand shape(s) of user inputs. Finally, several recovered sign candidates are retrieved for further probabilistic ranking. This paper proposes a MAP-based approach to model the feature matching process and exploit it to rank candidate signs according to their estimated relevance to the query. This ranking mechanism accounts for the entry patterns, feature matching cases, and a priori probability of related gesture features. Table 2 shows an example of exact matching, partial matching, and no matching cases to a given sequence of indexed features, sj = (HS2, Front_Body, Front, HS27, Front_Body). In exact matching, users may input a query sequence X = (HS2, HS27, Front_Body) in which all the input features match the indexed sequence. In partial matching, users may input a query sequence X = (HS2, HS19, Front_Body) in which the features of HS2 and Front_Body are matched. A binary sequence (1, 0, 1) is used to illustrate feature matching in this example. “0”s represent incorrect judgment of the input hand shape feature. The HS19 can be recovered and replaced by HS27. In the same way, for no matching, HS39 and HS17 can be recovered and replaced by HS2 and HS27, respectively. In

this case, only the most likely indexed feature sequence is retrieved as a candidate. Given the input sign feature sequence X ¼ ðx1 ; x2 ; ::::; xn Þ, the retrieval involves finding the sign sj ¼ ðyj1 ; yj2 ; . . . ; yjm Þ with the highest conditional probability, as follows: s ¼ arg max P ðsj jXÞ j

¼ arg max P ðXjsj ÞP ðsj Þ j

¼ arg max P ðx1 ; x2 ; :::::; xn jyj1 ; yj2 ; ::::; yjm ÞP ðyj1 ; yj2 ; ::::; yjm Þ; j

ð1Þ where xi represents the ith input gesture feature and yjk represents the kth gesture feature of the jth feature sequence in the MCT database; m and n, for m ¼ 5 and n ¼ 2; 3; 4 according to the definition of matching cases, represent the numbers of indexed features in the MCT and input features, respectively. Besides, the probability P ðXjsj Þ is defined as an alignment probability of matched sequence with the input feature sequence that describes the feature matching case. P ðsj Þ represents the a priori probability of the matched

WU ET AL.: ERROR-TOLERANT SIGN RETRIEVAL USING VISUAL FEATURES AND MAXIMUM A POSTERIORI ESTIMATION

499

Fig. 5. Example of the MCT presentation.

Fig. 6. Top-10 number of occurrences of four types of entry patterns: (a) (dez, tab), (b) (dez,dez, tab), (c) (dez, sig_ori, tab), and (d) (dez, dez, sig_ori, tab).

features in the MCT. The following sections describe the estimation of these two probabilities.

3.1 Alignment Probability Estimation As illustrated in Table 2, the alignment probability clearly follows a multivariate binary distribution, where every matched indexed feature takes the value 0 or 1. Consider a

sequence of input features, each of which, independent of the others, matches or does not match the indexed features; the probability of successful matching is an unknown constant. This unknown probability can be estimated from a limited number of observations and by adopting a kernel averaging method [24]. The kernel density is approximated as a sum of kernel functions. In our approach, sign sj can be

500

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 26,

NO. 4,

APRIL 2004

recommend the jackknife method as yielding a good value [24]. The estimation is described as follows:  ¼ arg max W ðjXi Þ i;

! n  i  1X log P^ xk jXi  xik ;  ; ¼ arg max n k¼1 i;

ð4Þ

where P^ðxik jXi  xik ; Þ is the estimate of the probability based on the feature sequence Xi  xik ¼ fxi1 ; . . . ; xik1 ; xikþ1 ; . . . ; xin g. The log-likelihood is averaged over each choice of omitted xik . It follows from plugging (3) into (2) that the alignment probability can be computed as follows: P ðXjsj Þ ffi

N   1X ðXi ;sj Þ  ð1  ÞnðXi ;sj Þ ; N i¼1

ð5Þ

where ðXi ; sj Þ indicates the number of matched features between input sequence and indexed sequence in the MCT. This alignment probability can be regarded as a weighting factor for distinguishing feature matching cases. The unmatched component is used to describe the partial matching or no matching situation. The unmatched component occurs often with respect to the hand shape feature. Cases of incorrect judgment are rectified using the similarity measure and error recovery strategies, as described in Sections 4.1 and 4.2. Fig. 7. The error-tolerant retrieval framework.

retrieved by N distinct feature sequences, fXi ; 1  i  Ng, with some features omitted. The probability of the input sign feature sequence X with respect to sign sj is estimated using the kernel density as follows: ! N N n   X X Y 1 1 j i ð2Þ KðXi ; sj Þ ¼ K o xk ; y k P^ðXjsj Þ ¼ N i¼1 N i¼1 k¼1 with, 

Ko xik ; yjk



( ¼

; 1  ;

if if

xik ¼ yjk xik 6¼ yjk; ;

ð3Þ

where KðXi ; sj Þ and Ko ðxik ; yjk Þ are the kernel functions;  represents the contribution of each feature to this estimate. In (3), the alignment between xik and yjk is performed according to the gesture feature, for example, the hand shape feature, xik , in Xi is matched with the hand shape feature, yjk , in sj . To estimate , Aitchison and Aitken

3.2 A Priori Probability Estimation The a priori probability of the sequence in the MCT can be modeled as follows such that components of the feature sequence sj ¼ ðyj1 ; yj2 ; . . . ; yjm Þ are statistically independent, i.e., m   Y   P ðsj Þ ¼ P yj1 ; yj2 ; . . . ; yjm ¼ P yjk :

ð6Þ

k¼1

The parameter of interest is the occurrence probability P ðyjk Þ for independent features. Moreover, each feature yjk of the feature sequence sj can take on k possible outcomes ck ð1Þ; ck ð2Þ; . . . : : ; ck ðk Þ with respective probabilities p ¼ ðpk ð1Þ; . . . : : ; pk ðk ÞÞ. For example, feature dez has 51 possible hand shape outcomes (k ¼ 51). Now, we suppose that the vector p ¼ ðpk ð1Þ; . . . : : ; pk ðk ÞÞ is specified by a “uniform” distribution. Such a distribution is of the form,  Pk pk ðÞ ¼ 1 dk ; 0pk ðÞ1;  ¼ 1; . . . ; k ; ¼1 ð7Þ P ðpÞ ¼ 0; otherwise:

TABLE 2 An Example for Illustrating Exact Matching, Partial Matching, and No Matching Cases

WU ET AL.: ERROR-TOLERANT SIGN RETRIEVAL USING VISUAL FEATURES AND MAXIMUM A POSTERIORI ESTIMATION

This multivariate distribution is a special case known as the Dirichlet distribution [25], [26]. Imposing the constraint Pk ¼1 pk ðÞ ¼ 1, dk ¼ 1=ðk  1Þ!. The Bayesian estimator is used to compute the probability of a feature, as follows: R P ðyjk jpÞP ðy1k ; . . . : : ; yLk jpÞP ðpÞdp j R ; ð8Þ P ðyk Þ ¼ P ðy1k ; . . . : : ; yLk jpÞP ðpÞdp where L (=1,881) is the number of signs in MCT and 8 j > < pk ð1Þ; for yk ¼ ck ð1Þ; j .. P ðyk jpÞ ¼ . > : pk ðk Þ; for yjk ¼ ck ðk Þ:

ð9Þ

Suppose that the discrete probability P ðy1k ; . . . ; yN k jpÞ can be modeled as a multinomial distribution [25], [26] and the feature is random and independent with each other. The denominator is computed as follows: Z P ðy1k ; . . . ; yLk jpÞP ðpÞdp ¼ dk  L!  lk ð1Þ!    lk ðk Þ!

Z 0  pk ðÞ  1; P k p ðÞ¼1  ¼1 k

k Y

ðpk ðÞÞlk ðÞ dpk ð1Þ    dpk ðk Þ;

¼1

where lk ðÞ is the number of feature sequences in the MCT such that the kth feature takes the th outcome. Using the Pk fact ¼1 lk ðÞ ¼ L and the properties of the Dirichlet distribution, the following definite integral can be obtained. Z P ðy1k ; . . . ; yLk jpÞP ðpÞdp ð11Þ dk  L! ðlk ð1Þ þ 1Þ    ðlk ðk Þ þ 1Þ  : ¼ lk ð1Þ!    lk ðk Þ! ðlk ð1Þ þ    lk ðk Þ þ k Þ The definite integral of the numerator is also derived using a similar approach. Dividing the numerator by the denominator yields the probability P ðyjk Þ, as follows: ðlk ðÞ þ 2ÞðL þ k Þ lk ðÞ þ 1 : ð12Þ ¼ ðlk ðÞ þ 1ÞðL þ k þ 1Þ L þ k

Finally, the a priori probability can be written as  m m  Y Y lk ðÞ þ 1 : P ðsj Þ ffi P ðyjk ¼ ck ðÞÞ ¼ L þ k k¼1 k¼1

T h2 .. . T h51

R h1

2 6 6 4

4.1



R h51

d1;2

d2;1 .. . d51;1

d2;2 .. .. . . d51;2



d1;51



d2;51



d51;51

.. .

3 7 7; 5

51 P adi;t  adj;t ! ! adi  adj t¼1 s ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ Simðhi ; hj Þ ¼ ! ; ð15Þ ! 51 51 j adi j  jadj j P P 2 2 ðadi;t Þ  ðadj;t Þ

Hand Shape Recovery for Partial Matching Using a Similarity Measure Users of a sign retrieval system are likely to select a wrong feature because of the visual similarities among some hand shapes, resulting in partial matching as defined above. This paper proposes a similarity measure to retrieve numerous signs with similar hand shapes. Hence, a feature similarity matrix for the 51 hand shapes is defined as follows:

t¼1

where adi;j represents the number of accumulated correct judgments by all subjects about the similarity between test ! hand shape i and reference hand shape j. adi ¼ ðadi;1 ; adi;2 ; . . . :; adi;51 Þ represents the vector whose elements are the total numbers of correct judgments about hand shape i. In this matrix, a higher value of adi;j represents stronger agreement among users on the similarity between the ith test hand shape and the jth reference hand shape. In the partial-matching situation, this measure of similarity Simðhr ; hm Þ between these two hand shapes is regarded as a weighting factor for the recovery of hr from hm . Accordingly, a weighted a priori probability for hr is defined as follows:     P^ yjk ¼ hr jyjk ¼ hm ¼ Simðhr ; hm Þ  P yjk ¼ hr ð16Þ lk ðrÞ þ 1 ¼ Simðhr ; hm Þ  ; L þ k where Simðhm ; hm Þ ¼ 1 represents exact matching. Finally, in the partial matching situation, the retrieval function can be written as follows: s ¼ arg max P ðsj jXÞ j

"

j

ERROR-TOLERANT SIGN RETRIEVAL

¼1 for correct judgment;

where T hi represents the ith hand shape under test; R hj represents the jth reference hand shape, and di;j represents the judgment whether T hi is similar to R hj . In this paper, an empirical study was conducted to obtain the similarity matrix. Section 5.2.1 describes related approaches and experiments. The entire similarity matrix tested by subjects are accumulated and investigated. Furthermore, the cosine measure function is adopted to estimate the similarity between hand shapes, as follows [10]:

¼ arg max ð13Þ

di;j ¼ 0 for incorrect judgment

ð14Þ



ndez Y  a¼1

4

R h2

d1;1

t¼1

ð10Þ

P ðyjk ¼ ck ðÞÞ ffi

T h1

501

! N  1X nðXi ;sj Þ ðXi ;sj Þ  ð1  Þ  N i¼1

Simðhr ; hm Þ 

P ðyja

!# Ydez    nn j ; ¼ hr Þ  P yw w¼1

ð17Þ where ndez represents the number of hand shapes. This approach shows the discriminative power of the retrieved candidates.

4.2 Error Recovery for No Matching Following the investigation of feature matching cases, an error-recovery mechanism is proposed for the no matching situation. A database of a large number of hand shape recovery patterns was collected from an empirical study described in Section 5.2.2. This information was used to

502

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

retrieve a possible hand shape ht from the frequently wrong-judged hs . The relationship between ht and hs , called the recovery pattern, were trained using the average mutual information (AMI) criterion, [27] as follows:

NO. 4,

APRIL 2004

Finally, in the no-matching case, the retrieval function can be written as follows: s ¼ arg max P ðsj jXÞ j "  N   1X ðXi ;sj Þ  ð1  ÞnðXi ;sj Þ ¼ arg max N i¼1 j # nn Ydez P ðyjw ÞÞ :  ðRF ðX; sj Þ 

AMIðht ; hs Þ ¼ P ðht ; hs Þ log

VOL. 26,

P ðht ; hs Þ P ðht ; hs Þ þ P ðht ; hs Þ log P ðht ÞP ðhs Þ P ðht ÞP ðhs Þ

ð24Þ

w¼1

P ðht ; hs Þ P ðht ; hs Þ þ P ðht ; hs Þ log þ P ðht ; hs Þ log ; ð18Þ P ðht ÞP ðhs Þ P ðht ÞP ðhs Þ where hs specifies that hs does not occur. P ðht ; hs Þ and P ðht ; hs Þ represent the probabilities of the co-occurrence of ðht ; hs Þ and ðht ; hs Þ, respectively. P ðht ; hs Þ and P ðht ; hs Þ reveal the discrepancy of ðht ; hs Þ and ðht ; hs Þ, respectively. This AMI measure is used to normalize the mutual information about each recovery pattern. In the no-matching situation, a weighting RP ðht ; hs Þ is proposed to decrease the contribution of the a priori probability of the recovery of hand shape ht from hs , and is defined as follows: P ðh ;h Þ

P ðh ;h Þ

RP ðht ; hs Þ ¼

t s t s P ðht ; hs Þ log P ðh ÞP ðh Þ þ P ðht ; hs Þ log P ðh ÞP ðh Þ t

s

t

AMIðht ; hs Þ

s

;

Considering exact matching, partial matching, and no matching conditions, the final retrieval function can be summarized as follows: s ¼ arg max P ðsj jXÞ j

8 for exact and partial matching: > ! > > > N   > P > nðXi ;sj Þ ðXi ;sj Þ 1 >  ð1Þ > N > > i¼1 > > > ! > < n  ð25Þ  nn dez Qdez j Q j ¼ arg max  Simðhr ;hm ÞP ðya ¼ hr Þ  P ðyw Þ ; j > a¼1 w¼1 > > > > > for no matching: > > ! > > > nndez N   > P Q > j ðX ;s Þ nðX ;s Þ i j Þ ðRF ðX;s Þ >  i j ð1Þ P ðyw Þ : j : N1 w¼1

i¼1

ð19Þ where the co-occurrence conditions in the numerator represent the probabilistic dependency between ht and hs . However, this approach can also be applied to recover other types of gesture features. The weighted a priori probability of recovering ht from hs is rewritten as follows: lk ðtÞ þ 1 : P^ðyjk ¼ ht jyjk ¼ hs Þ ¼ RP ðht ; hs Þ  L þ k

ð20Þ

Furthermore, a recovery function, RF 1 ðX; sj Þ, is defined to compute the weighted a priori probability of the recovery of a certain input hand shape as follows: RF 1 ðX; sj Þ ¼ max 0 2 1 Q j j ^ ða; kÞ  P ðyk ¼ ht jyk ¼ hs Þ; B C B a¼1 C B 2 C; @Q A j j ^ ð2  a þ 1; kÞ  P ðyk ¼ hd jyk ¼ hq Þ

ð21Þ

a¼1

where hd and hq are another two hand shapes that the possible hand shape hd is recovered from the frequently wrong-judged hq , and  RP ðht ; hs Þ; if a ¼ k; ð22Þ ða; kÞ ¼ 1; if a 6¼ k: 2

Moreover, a recovery function RF ðX; sj Þ is defined to compute the weighted a priori probability of the recovery of both input hand shapes as follows:  RF 2 ðX; sj Þ ¼ max RP ðht ; hs Þ  RP ðhd ; hq Þ t;d ð23Þ  j P^ðyk1 ¼ ht ; yjk2 ¼ hd jyjk1 ¼ hs ; yjk2 ¼ hq Þ :

5

EXPERIMENTAL RESULTS

5.1 Functional Evaluation of Sign Retrieval Our proposed approach enables users to enter features with a minimal number of keystrokes to retrieve a sign word of interest. The query about the entry pattern is input as an incomplete feature sequence for comparison with the indexed feature sequence in the MCT. This kind of feature matching problem is similar to the least common subsequence problem. To deal with this string-matching problem, many efficient algorithms, such as Knuth-MorrisPratt and Boyer-Moore methods, have been used [10], [28]. However, these algorithms are effectively restricted to sequential searches of completely specified query patterns. In this paper, the dynamic programming (DP) approach, a more flexible and popular method for matching features, was compared with our proposed approach [10], [29], [30], based on which a matrix C½0 . . . n; 0 . . . m is filled column with a column, where C½i; j represents the overall cost with the minimum number of mismatches needed to match P1...:i to a suffix of T1...j , given the initial conditions C½0; j ¼ 0 andC½i; 0 ¼ i. This computation is as follows: C½i; j ¼ if ðPi ¼ Tj Þ; then

C½i  1; j  1

else 1þ minðC½i  1; j ; C½i; j  1 ; C½i  1; j  1 Þ; ð26Þ where Pi represents the ith feature in the input sequence, and Tj represents the jth feature/node in the MCT. This functional evaluation consists of three experiments, involving no deletion of gesture features, deletion of hand shape, and deletion of motion feature within a given entry pattern. For these three experiments, 1,881 distinct signs with the corresponding gesture features were adopted to construct the MCT and also used to estimate the alignment

WU ET AL.: ERROR-TOLERANT SIGN RETRIEVAL USING VISUAL FEATURES AND MAXIMUM A POSTERIORI ESTIMATION

503

TABLE 3 Comparison of Dynamic Programming Algorithm and Our Proposed Approach (Database Size = 1,881 Signs)

and the a priori probabilities. In the following experiments, 380 signs involving one-handed and two-handed configurations were randomly selected from our sign database and used as the testing database. In the experiment without deletion, the gesture features in the entry pattern (dez, dez, sig_ori, tab) were the indexed feature sequence in the testing database without feature deletion. In the hand shape deletion experiment, a certain hand shape in each indexed feature sequence in the testing database was omitted to form a displaced entry sequence as an entry pattern, such as (dez, sig_ori, tab). In the motion feature deletion experiment, the motion feature in each indexed feature sequence in the testing database was omitted to form a displaced query sequence, such as (dez, dez, tab). In the evaluation, if the target sign of interest was included in the Top-N candidates of the retrieved signs, the retrieval was regarded to be correct. Table 3 shows the evaluation results. The proposed approach outperformed the DP-based approach. The hand shape was an important feature for sign retrieval. In the experiments of no deletion and motion feature deletion, the proposed approach can yield a >90 percent retrieval rate by applying a Top-15 condition. In the experiment with hand shape deletion, the proposed approach under performs because one-handed signs were simultaneously retrieved. In the DP-based approach, the performance was not good because most of the retrieved candidate signs had the same number of mismatches and could not be ranked. However, the retrieval rates obtained by applying the conditions of Top-1 to Top-10 were not satisfactory. We think that this is because at most four input features are given in the entry pattern. It is likely that this shows a tradeoff between the fewer number of input features and the higher of retrieval rate.

5.2

Development and Evaluation of the Retrieval Interface To develop a sign retrieval interface, 83 gesture features including 51 hand shapes, 21 locations, and 11 motion features are adopted. In this paper, Fitt’s law was applied as the guiding principle for the design of this interface [31]. This law is formulated as follows:

MT ¼ a þ b log2 ðA=W þ 0:5Þ;

ð27Þ

where MT represents the time spent to select a target of width W that lies at a distance of A from a reference point in a coordinate, generally in the left-upper corner of a display window; a and b are observable constants that depend on the user’s response. This law implies that a small A and a large W results in a small MT. Moreover, integrating 83 distinct symbols (or gesture features) into the limited display range is a challenge of the interface design. Accordingly, we cluster hand shapes and motion features into several categories, and arrange hand shapes into a two-layer layout involving cluster index in the first layer and the corresponding hand shapes in the second layer. 1.

Experiment on Similarity and Clustering of Hand Shapes. As stated in Section 4.1, the visual similarity between hand shape features was used to categorize 51 hand shapes into several clusters. For this purpose, an empirical study was conducted, involving 23 profoundly deaf students (17 males and 9 females with an average age of 12.2) at National Tainan Deaf School in Taiwan. The students were in the fifth grade. In the first study of hand shape similarities recognized by human judgment, each hand shape was subjectively determined to resemble several hand shapes. Then, the hand shape similarity matrix was derived as in (14). Cochran’s Q test was applied to evaluate the consistency among subjects’ judgments of each hand shape [32]. Table 4 shows the contingency table between test and reference hand shapes. Each similarity test in this table is extracted from the row component in (14). Let Ri represent the row totals, i ¼ 1; 2; ::::; r (r = 23), and let Cj represent the column totals, j ¼ 1; 2; . . . : :; c (c = 51) in the contingency table. Then, the test statistic is calculated as follows: !, ! c  r X X N 2 Cj  Ri ðc  Ri Þ ; Q ¼ cðc  1Þ c j¼1 i¼1 ð28Þ

504

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 26,

NO. 4,

APRIL 2004

TABLE 4 Contingency Table for Data Layout of Each Hand Shape

2.

where N represents the total number of ones in the table. The evaluation implies consistency of hand shape judgment among subjects at a significance level of p < 0:1. The deaf students are familiar with gesture features taken from formal teaching materials. Then, these empirical data were used as described in Section 4.1. The accumulated feature similarity matrix was further clustered, using Ward’s method and the Pearson-r distance measure [33], into several clusters with similar hand shapes. The Wendy Sandler Feature System [34] and a manual clustering approach were also considered for comparison. Wendy Sandlerbased criteria can be illustrated as a hierarchical structure: The first layer represents the number of fingers, the second layer specifies open, close, curved, bent, and spread of flex type of each finger, the third layer specifies the flex type of nonselected finger(s), and the final layer represents the contact between fingers. The manual clustering approach was implemented by Professor Y.-S. Deng at the Department of Industrial Design, National Cheng Kung University, Taiwan. Professor Deng, familiar with our approach, clustered the hand shapes used in the experiments. Fig. 8 shows the results of clustering obtained using these approaches. The hand shapes in each cluster obtained by the proposed and manual clustering approaches are visually similar to each other. Experiments on Extraction of Hand Shape Recovery Patterns. Based on the results of hand shape clustering, a second empirical study was conducted. Fig. 9 depicts the interface for evaluating the performance of the selection of hand shapes. The selection process defined in this paper involves first, selecting a particular cluster and then selecting a hand shape of interest. The left of the main window shows the testing vocabulary. The right side shows clusters illustrated as icons for left and right hands, respectively. In this interface, the cluster icons are extracted from the first column of hand shapes shown in Fig. 8c and arranged as the first-layer row-column layout. When users select a particular cluster icon, the cluster window displays a second-layer row-column layout with the hand shapes in their corresponding cluster as shown in cluster (5) in Fig. 8c. When a desired hand shape is selected, users press a button to complete the selection. In this paper, 255 signs, in which each hand shape occurs five times, were randomly selected from the collection for assessing the accuracy of selecting hand shapes. Moreover, certain system descriptors are defined to describe their relationship to the retrieval rate. These system descriptors are scanning number,

scanning time, and average scanning time, which are recorded during the selection of a desired hand shape. The scanning number represents the total number of keystrokes and the scanning time represents the time spent during selection. The average scanning time Tavg is used to describe the time spent by users in selecting a cluster or a hand shape. It is defined as follows: T avg ¼ Scanning time=Scanning number:

3.

6

ð29Þ

Similar evaluation criteria were addressed in [6] and [9]. Furthermore, selection may be performed repeatedly to select a correct hand shape associated with a preceding wrong-judged hand shape. This information was collected and used to obtain the hand shape recovery pattern using the average mutual information criterion. Table 5 compares hand shape selection performance. The proposed approach outperformed Wendy Sandler’s and the manual clustering approaches in terms of the scanning numbers, the number of mistaken selections and time taken, because the proposed approach was developed from real observations. Case Study for Practical Evaluation. A novel interface for inputting gesture features was designed to retrieve a sign word of interest. Fig. 10 shows the sign retrieval interface. Fitt’s law and a two-layer rowcolumn layout of hand shapes were also performed as stated above. This interface allows users to select a sequence of visual gesture features using a mouse, and then ranks and displays several candidate sign words with textual illustrations for user selection. In this interface, the motion features were simplified and categorized into five classes—front, back, left, and right features, and the interactive motion of two hands. For evaluation, a training program funded by the National Science Council in Taiwan was undertaken with 23 deaf students over a three-week period. Each training course involves 50 minutes of practice and 10 minutes of conversation. The testing vocabulary set with 50 signs was randomly selected in each period to evaluate the retrieval rate for each subject. Table 6 shows the results of the evaluation. Clearly, the performance was significantly improved after training. The proposed error-tolerant retrieval mechanism worked well in practical situations. Users became increasingly familiar with system functions.

DISCUSSION

We have shown an error tolerant framework for retrieving a TSL sign word from adatabase with 1,881 signs. While it

WU ET AL.: ERROR-TOLERANT SIGN RETRIEVAL USING VISUAL FEATURES AND MAXIMUM A POSTERIORI ESTIMATION

505

Fig. 8. Graphical illustration of the clustering results with (a) the proposed, (b) Wendy Sandler’s, and (c) manual clustering approaches.

could be argued that the system performance depends on the entry patterns and user interface, further thought is that more number of feature input is needed to improve the retrieval performance while increase the search time. To achieve this demand, the method of using phonological constraints could be applied to reduce the lexical search space. For modeling the phonological rules, some statistical

models, such as hidden Markov model (HMM), Parallel HMM [18], and finite state machine [35], are suggested. This study can refer to phonological analysis of ASL that is also crucial in the recognition of sign language [16], [18], [20]. In addition, as shown by the effects in the evaluation of the retrieval interface, the behavior of human judgment in the recognition of symbols plays important roles in this task.

506

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 26,

NO. 4,

APRIL 2004

Fig. 9. Interface for evaluating the selection performance of hand shapes.

TABLE 5 Comparison of Hand Shape Selection Performance for the Proposed, Wendy Sandler’s, and Manual Clustering Approaches

Fig. 10. The interface for sign feature input and text generation.

WU ET AL.: ERROR-TOLERANT SIGN RETRIEVAL USING VISUAL FEATURES AND MAXIMUM A POSTERIORI ESTIMATION

507

TABLE 6 Case Study Results

This kind of development is highly empirical. In this paper, 23 native signers were conducted to explore this behavior in visual similarity and recovery patterns of hand shapes. Other gesture features could also be examined through the same method. For investigating user independence of using our proposed system, collecting data from many subjects is needed. This contributes to understand different subjects with their own variations of symbol judgment and entry process.

[5]

[6] [7]

[8]

[9]

7

CONCLUSIONS

This work has proposed an innovative sign retrieval system for the deaf, to assist in teaching or learning TSL. The proposed theoretical approach, based on the maximum a posteriori framework, aims to model the retrieval behavior by estimating the probability of matching and occurrence of entry feature patterns. Experimental results reveal that the error-tolerant mechanism is robust and discriminative in retrieving a sign word of interest in comparison with the DP approach. In an investigation of the linguistic properties of TSL, the gesture feature set was used as an indexing strategy and the benefits of developing an alternative symbol system for alternative communication were explored. The experimental results show a significant improvement in performance was achieved. In the future, statistical grammar rules and language models for sign language can be integrated into this framework as a front end of an augmentative communication tool. This design concept will be extended to the translation between natural language and sign language as well as image-based TSL synthesis.

[10] [11]

[12] [13]

[14]

[15] [16]

[17]

[18]

[19]

ACKNOWLEDGMENTS The authors would like to thank the National Science Council, Taiwan, Republic of China, for financially supporting this research under contract no. NSC91-2614-H-006-003F20. They would also like to thank Professor Yi-Shin Deng for his professional assistance in the design of sign retrieval interface.

[20] [21]

[22]

[23]

REFERENCES [1] [2] [3] [4]

L.L. Lloyd, D.R. Fuller, and H.H. Arvidson, Augmentative and Alternative Communication: A Handbook of Principles and Practices. Allyn and Bacon, Inc., 1997. P. Demasco and K.F. McCoy, “Generating Text from Compressed Input: An Intelligent Interface for People with Server Motor Impairments,” Comm. ACM, vol. 35, no. 5, pp. 68-78, 1992. A.L. Swiffin, J.L. Arnott, and A.F. Newell, “Adaptive and Predictive Techniques in a Communication Prosthesis,” Augmentative and Alternative Comm., vol. 3, no. 4, pp. 181-191, 1987. A.M. Cook and S.M. Hussey, Assistive Technologies: Principles and Practice. St. Louis, MO: Mosby-Year Book, 1995.

[24] [25] [26] [27]

[28]

S.K. Chang, G. Costagliola, S. Orefice, et al., “A Methodology for Iconic Language Design with Application to Augmentative Communication,” Proc. 1992 IEEE Workshop Visual Language, pp. 110-116, 1992. J.G. Webster, A.M. Cook, W.J. Tompkins, and G.C. Vanderheiden, Electronic Devices for Rehabilitation. John Wiley & Sons, 1985. F. Alonso, A. de Antonio, J.L. Fuertes, et al., “Teaching Communication Skills to Hearing-Impaired Children,” IEEE Multimedia, pp. 55-67, 1995. R.C. Simpson and H.H. Koester, “Adaptive One-Switch RowColumn Scanning,” IEEE Trans. Rehabilitation Eng., vol. 7, no. 4, pp. 464-473, 1999. J. Darragh and I. Witten, The Reactive Keyboard. Cambridge Univ. Press, 1992. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. ACM Press, 1999. S. Berretti, A.D. Bimbo, and P. Pala, “ Retrieval by Shape Similarity with Perceptual Distance and Effective Indexing,” IEEE Trans. Multimedia, vol. 2, no. 4, 2000. W.H. Smith, “Morphological Characteristics of Verbs in Taiwan Sign Language,” PhD dissertation, Indiana Univ., 1989. M.-C. Su, Y.-X. Zhao, H. Huang, and H.-F. Chen, “ A Fuzzy RuleBased Approach to Recognizing 3-D Arm Movements,” IEEE Trans. Neural Systems and Rehabilitation Eng., vol. 9, no. 2, 2001. R. Liang, “Continuous Gesture Recognition System for Taiwanese Sign Language,” PhD dissertation, Nat’l Taiwan Univ., Taiwan, 1997. C. Valli and C. Lucas, Linguistics of American Sign Language: An Introduction. Gallaudet Univ. Press, 1995. S.K. Liddell and R.E. Johnson, “American Sign Language: The Phonological Base,” Sign Language Studies, vol. 64, pp. 195-277, 1989. C. Vogler and D. Metaxas, “Toward Scalability in ASL Recognition: Breaking Down Signs into Phonemes,” Lecture Notes in Artificial Intelligence, vol. 1739, pp. 211-224, 1999. C. Vogler and D. Metaxas, “A Framework for Recognizing the Simultaneous Aspects of American Sign Language,” Computer Vision and Image Understanding, no. 81, pp. 358-384, 2001. T. Starner, J. Weaver, and A. Pentland, “Real-Time American Sign Language Recognition Using Desk and Wearable ComputerBased Video,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 12, pp. 1371-1375, Dec. 1998. W. Sandler, “Phonological Features and Feature Classes: The Case of Movements in Sign Language,” Lingua 98, pp. 197-220, 1996. W.H. Smith and L.-F. Ting, Shou Neng Sheng Chyau (Your Hands Can Become a Bridge), vols. 1 and 2, Taipei, R.O.C.: Deaf Sign Language Research Assoc., 1997. Ministry of Education, Division of Special Education, Changyong Cihui Shouyu Huace (Sign Album of Common Words), vol. 1, Taipei: Ministry of Education, 2000. G.H. Gonnet, Handbook of Algorithms and Data Structures. AddisonWesley Publishing Company, 1984. D.J. Hand, Discrimination and Classification, pp. 100-101, John Wiley & Sons, 1989. S.M. Ross, Introduction to Probability Models. Academic Press, Inc., 1993. M.H. DeGroot, Optimal Statistical Decisions. McGraw-Hill Publishing, 1970. Z.G. Dong and L.K. Teng, “Interpolation of n-Gram and MutualInformation Based Trigger Pair Language Models for Mandarin Speech Recognition,” Computer Speech and Language, vol. 13, pp. 125-141, 1999. T.H. Cormen, C.E. Leiserson, and R.L. Rivest, Introduction to Algorithms. The MIT Press, 1994.

508

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

[29] K. Oflazer, “Error-Tolerant Retrieval of Trees,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 12, Dec. 1997. [30] L. Rabiner and B. Juang, Fundamentals of Speech Recognition. Prentice Hall, 1993. [31] I.S. MacKenzie and W. Buxton, “A Tool for Rapid Evaluation of Input Devices Using Fitt’s Law Models,” SIGCHI Bull., vol. 25, no. 3, pp. 58-63, 1993. [32] G.K. Kanji, 100 Statistical Tests. SAGE Publications Ltd., 1999. [33] S. Theodoridis and K. Koutroumbas, Pattern Recognition. Academic Press, 1999. [34] J. Ann, “A Linguistic Investigation of the Relationship between Physiology and Handshape,” PhD dissertation, Univ. of Arizona, 1993. [35] M. Mohri, “Finite-State Transducers in Language and Speech Processing,” Assoc. for Computational Linguistics, vol. 23, pp. 1-42, 1997.

VOL. 26,

NO. 4,

APRIL 2004

Chung-Hsien Wu received the BS degree in electronics engineering from National Chiao Tung University, Hsinchu, Taiwan, in 1981, and the MS and PhD degrees in electrical engineering from National Cheng Kung University, Tainan, Taiwan, R.O.C., in 1987 and 1991, respectively. Since August 1991, he has been with the Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan. He became a professor in August 1997. From 1999 to 2002, he served as the chairman of the department. He also worked at Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory, Cambridge, in the summer of 2003 as a visiting scientist. His research interests include speech recognition, text-to-speech, multimedia information retrieval, spoken language processing, and sign language processing for hearing-impaired. Dr. Wu is a senior member of IEEE, International Speech Communication Association (ISCA), and ROCLING. Yu-Hsien Chiu received the BS degree in electrical engineering from I-Shou University, Kaohsiung, Taiwan, R.O.C., in 1997, and the MS degree in biomedical engineering from National Cheng Kung University, Tainan, Taiwan, in 1999. He is a PhD candidate in the Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan. His research interests include speech and biomedical signal processing, embedded system design, spoken language processing, and sign language processing for the hearing-impaired. Kung-Wei Cheng received the BS degree in information science from Tunghai University, Taichung, Taiwan, in 2000, and the MS degree in computer science and information engineering from National Cheng Kung University, Tainan, Taiwan, in 2002. His research interests include digital signal processing, text-to-speech synthesis, natural language processing, and assistive technology for the hearing-impaired.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

Suggest Documents