A Semi-automatic Tool - Computer Vision and Pattern Recognition ...

10 downloads 2460 Views 1MB Size Report
We have developed a video hand segmentation tool which can help with generating hands ground truth from sign language image sequences. ..... Recognition using Desk and Wearable Computer Based Video, IEEE Transactions on Pattern ...
Efficient Generation of Large Amounts of Training Data for Sign Language Recognition: A Semi-automatic Tool Ruiduo Yang1 , Sudeep Sarkar1 , Barbara Loeding2 , and Arthur Karshmer2 1 2

University of South Florida, Tampa, FL, 33620, USA University of South Florida, Lakeland, FL, 33803, USA

Abstract. We have developed a video hand segmentation tool which can help with generating hands ground truth from sign language image sequences. This tool may greatly facilitate research in the area of sign language recognition. In this tool, we offer a semi automatic scheme to assist with the localization of hand pixels, which is important for the purpose of recognition. A candidate hand generator is applied by using the mean shift image segmentation algorithm and a greedy seeds growing algorithm. After a number of hand candidates is generated, the user can reduce the candidates by simple mouse clicks. The tool also provides a hand tracking function for faster processing and a face detection function for groundtruthing non manual signals. In addition, we provided a twopasses groundtruthing scheme unlike other tools that only does one-pass. Our first pass processing is automatic and does not need user interaction. The experiment results demonstrate that based on the first pass’s result, one can groundtruth 10,000+ frames of sign language within 8 hours.

1

Introduction

With the growing capacity of computer vision technology, there is more interest being directed towards the automatic recognition of sign language based solely on image sequences or videos. We refer the reader to [1] for a review of the applications of related vision technologies. Briefly, if an accurate and reliable method can be developed to recognize and translate from sign sentences to spoken or written languages, this will help alleviate communication difficulties that Deaf people experience daily. This method can be incorporated into an assistive device carried by the Deaf person or available within the computer systems of stores, banks and other places where a Deaf person needs to communicate with non-signers. In a vision-based sign language recognition system, generally a model will be trained on either a phoneme, sign or sentence level. This requires accurate feature extractions from the image frames. For example, one may need to extract the facial expression, hand position and hand shape parameters. However, the automatic hand segmentation is a challenging problem and unwanted noise will be generated for real world image sequences; hence the accuracy of the model K. Miesenberger et al. (Eds.): ICCHP 2006, LNCS 4061, pp. 635–642, 2006. c Springer-Verlag Berlin Heidelberg 2006 

636

R. Yang et al.

can be decreased. For example, an automatic scheme for segmenting the hand can be based on hand color model [2], stereo depth [3], colored gloves [4] or motion [5] information. Nevertheless, a hand color model is not sufficient for real world images where a large variation of lighting condition exists. Alternatively using depth image is slow and will fail when the hand moves to the same plane as the body. In addition, colored gloves might help during training but it still changes the appearance of hand and makes the signer feel unnatural. And finally, although using motion information is usually robust, it cannot accommodate subtle hand shape changes and a moving background. Due to these reasons, a hand groundtruthing step is often taken with an annotation tool before we start the training. Sign language training databases generally consist of many video shots and frames, which makes it very hard to groundtruth quickly using general image processing tool (e.g. Photoshop) frame by frame. Even with a vision based video annotation tool, this task could be labor intensive and time-consuming. For example, considering a small training database that has 100 signs, each sign has 25 training samples and the average length of one training sample is 50 frames. There are a total of 25x100x50=125000 frames. Suppose we were using a general image processing tool which can generate groundtruth for one frame in 5 min. The time cost to groundtruth the whole database will be 434 days. In this paper, we offer a semi automatic scheme to assist with the localization of hand pixels, which is important for the purpose of recognition. A candidate hand generator is applied by using the mean shift image segmentation algorithm and a greedy seeds growing algorithm. After a number of hand candidates is generated, the user can reduce the candidates by simple mouse clicks. In addition, we provided a two-passes groundtruthing scheme unlike other tool that only do onepass. Our first pass processing is automatic and does not need user interaction. We are aware of many other video annotation tools. However, most focus on scene segmentation or key frame detection (e.g. IBM EVA[6], ESP Game[7]). Some of them combine local feature extraction and temporal tracking together. For example, the VIPER annotation tool proposed by Pfund et.al [8] provides image segmentation, temporal segmentation and event annotation together; ViPerGT proposed by Doermann et.al [9] can detect multiple objects and track them using bounding boxes automatically, Marcotegui et.al proposed VOGUE [10], where a number of image and video segmentation techniques are incorporated for object annotation purpose. All of these tools are standalone applications providing semi automatic groundtruthing function with user-friendly interface. Our annotation tool is a side-product of our vision-based American Sign Langugae(ASL) recognition system. It also provides a semi-automatic scheme for efficient ground-truthing. However, its main purpose is to segment the hand pixels frame by frame. Instead of using the general segmentation and tracking algorithm as in the existing tools, we advocate a candidate hand generator approach which is more reliable during hand shape change and hand crossing face situation. Unlike the existing tools where only one-pass is conducted, we offer a two-pass scheme for faster processing, where the first pass generates candidate hands automatically.

Efficient Generation of Large Amounts of Training Data g11 = { p11 , p12 } g12 = { p12 , p31 }

Grouping

ĂĂ

g31 = { p14 , p51 }

links between groups

g14 = { p31 , p14 , p51 }

Original Frames

ĂĂ

g32 = { p22 , p32 , p42 }

ĂĂ

ĂĂ

g51 = { p11 } g16 = { p31 , p14 }

ĂĂ

Segmented Frames

g12 = { p12 , p22 } g22 = { p22 , p32 }

p51

p14 p31

1 1

1 2

p

p

ĂĂ p42

p

I1

ĂĂ

g33 = { p43 , p53 } g43 = { p33 , p43 , p53 }

g53 = { p13 , p23 , p63 }

g63 = { p53 , p63 }

ĂĂ

p33

p12

2 2

g13 = { p13 , p23 } g23 = { p23 , p33 }

ĂĂ

g42 = { p12 } g52 = { p32 , p42 }

p32

637

p53

p43 3 2

p13

p

I2

p63

I3

(a) First Pass : Grouping Grouping layers and links between groups

g11 = { p11 , p12 } g12 = { p12 , p31 }

ĂĂ

g31 = { p14 , p51 }

ĂĂ

2 3

2 2

2 3

2 4

g14 = { p31 , p14 , p51 }

g ={p , p , p }

g51 = { p11 } g16 = { p31 , p14 }

g42 = { p12 } g52 = { p32 , p42 }

ĂĂ

ĂĂ

Original Frames

g12 = { p12 , p22 } g22 = { p22 , p32 }

I1

User Selection Tracking

ĂĂ

ĂĂ

g13 = { p13 , p23 } g23 = { p23 , p33 }

ĂĂ

g33 = { p43 , p53 } g43 = { p33 , p43 , p53 }

ĂĂ

g53 = { p13 , p23 , p63 }

g63 = { p53 , p63 }

ĂĂ

I2

I3

User Selection Tracking User Selection

(b) Second Pass: Selection Fig. 1. Overview of the two-pass approach

Fig. 1 illustrates our two-pass scheme. In Fig. 1(a), ASL video frames are first segmented into seed primitives; these primitives are grouped by a grouping engine to generate overlapped candidate hand groups, where each group may consist of one or more primitives. This step is automatic and no user interaction is involved. In Fig. 1(b), where the second pass is taken, the grouped results will be loaded back for user’s examination. Note the number of generated groups could be huge, hence we allow the user to mouse click the hand region to reduce the number of candidates. At the same time a tracking method is also incorporated among adjacent frames to improve efficiency. The paper is organized as follows: in section 2 we will discuss the algorithm used in the first-pass, Section 3 describes the Graphic User Interface(GUI), Section 4 shows us the experimental results and we conclude with Section 5.

2

First-Pass: Segmentation and Grouping

The first-pass annotation can be done completely on the background since there is no user interaction involved. The goal of this pass is to generate the hand candidates and store them as a file to be reloaded in the second pass. For existing

638

R. Yang et al.

annotation tools, these two passes are generally working at the same time, forcing the user to wait for the segmenting and tracking result. With a complex scene, the segmentation itself can be very time consuming. (e.g. a segmentation of one single frame with a highly textured scene with mean shift method [11] could take over 10 seconds). On the other hand, many tracking techniques take an iterative approach which could also be slow. Tracking methods also suffer from the requirement of good initialization, which must be taken for each sign in the training dataset separately. For our case, since we tend to generate as many candidates as possible to accommodate with the fast changing appearance of the hand while signing, the processing of one single frame will even be slower. For example, with a complex scene there will be over 500 candidate hands generated, which result in a running time for one frame to exceed 30 seconds. Fortunately, unlike other tracking or hand segmentation schemes, our approach does not require perfect results or user modification when the candidate hands are being detected. Hence this step can be done solely in the background. 2.1

Grouping of Primitives

We adopt a greedy approach to form the groups. Let us denote the low-level region patches in the k-th image frame by Sk = {pk1 , · · · , pkNk }. From this initial set of primitives, we select a subset of primitives that are likely to come from the hand, based on the size of the patch. These are our seed patches. Given some knowledge of the approximate size of the hands in the sequence, we can eliminate large, non-homogeneous region patches from further consideration. We use a list L to store the possible groups. This list is initialized by choosing each selected primitive to be a singleton group. These groups would be merged to form a larger conglomerate. The grouping process starts by picking the first group in L, denoted here by p, and searching for its neighbors {Npi }. Each neighbor Npi is considered for grouping with p to generate a tentative larger grouping. We select the best local grouping, and denote it as g. The group g is further tested to see if it can possibly represent a hand. If the group g passes this test, it is inserted into the final candidate group list, C, otherwise it is inserted at the end of the list L, to be considered for further grouping. 2.2

Associating Groups Across Frames

We denote the jth group detected in tth frame as Gjt . The groups detected in each frame are associated with those detected in previous frames to result in a linked sequence of groups spanning all the frames. We define the predecessors set of each element in each groups set as 1 n , · · · , Gjt−1 ], t ≥ 2, 1 ≤ jk ≤ ct−1 , P re(Gjt ) = [Gjt−1

(1)

k is one possible predecessor of Gjt . The predecessor relationship bewhere Gjt−1 tween the groups from different time instants is based on feature similarity. It

Efficient Generation of Large Amounts of Training Data

639

captures how likely the groups are from the same underlying cause in the image. We will use the best successors as the tracking result.

3

Second Pass: Graphic User Interface

After the candidate groups and their links are generated, user interaction is needed to select the best group, with a default group given at each frame by the tracking method. We provide functionality that specifically works well for sign sentences. These functions are built upon the candidate generator discussed in section 2, a simple tracking technique that work with the links between the candidate groups, a face detector for non-manual information analysis, a glossing tool and various elements that facilitate the hand groundtruthing.

Fig. 2. The Graphic User Interface

The application is coded under the Microsoft Visual Studio Environment, using MFC class, OpenCV libraries and related windows APIs. Fig. 2 illustrate us the GUI. Important functions related to sign language are supported as: 1. 3 views of current frame: The first is the dominant hand view, the second is the non-dominant hand view, the third is the view for both. 2. Click to select: One can click on the hand area to reduce the groups to be shown. The shown groups must include the point where the mouse is clicked. 3. Missing hand checkbox: One can choose the current hand as missing if the hand is out of the scene. 4. 2 Hand listboxes: The two listboxes below show the list of the candidate dominant hand and candidate non-dominant hands. The list is ranked by their boundary smoothness and the tracking result. The highlighted item is the candidate hand that is being selected. 5. Face detector: automatic face detector, shown as the blue bounding box.

640

R. Yang et al.

6. Play,Stop,Step Button: Pressing the play button will automatically track all the hands and save the result. Stop Button will stop the tracking. Step button is to allow the application to track one frame and wait for the user’s response, which is most often used. 7. Redo Checkbox: Redo Checkbox allows one to re-detect the current sequence.

4

Experiment Result

We tested our annotation tool with 2 datasets with different parameters settings. Both of them consist of ASL sign sentences. The first dataset has a simple background with 10675 frames and a resolution of 460x290. On average there are 100 candidate hands generated per frame for this dataset, yet it took us less than 8 hours to finish groundtruthing both hands. The second dataset has a complex background with 640x480 resolution. There are 500 candidate hands generated for each frame. We took 500 frames and established the groundtruth within 1 hour. Note the time we refer to here is the user interaction time, that is, the time of the second pass. In Fig. 3(a), we show the time taken over the 2 datasets of the two passes. We use a P4 2.4G CPU with 4G memory. The number shown is the time taken per frame in seconds. Our first pass takes relatively longer since we incorporated automatic segmentation, face detection, and the greedy seeds growing algorithm. However it’s completed offline. The second pass is done by reloading the candidate result, taking much shorter time.

Fig. 3. (a) Time taken of first pass and second pass (in seconds). (b) Time taken with different method sets. (in seconds).

Efficient Generation of Large Amounts of Training Data

(a)

(b)

(c)

(d)

(e)

(f)

641

Fig. 4. (a) Original frame 1. (b) Segmented frame 1 (c) List of candidate groups in frame 1. (d) Original frame 2. (e) Segmented frame 2 (f) List of candidate groups in in frame 2.

On the other hand, Fig. 3(b) shows us the time taken for different method sets of the tool. We choose 500 frames from the simple background set to do the experiment. Here we refer to ”C” as the method that only uses the generated result, ”C+C” refers to using the click-to-select method to reduce the candidate set, ”C+T” refers to using candidate with tracking method, ”C++” refers to the method of using candidate and both click-to-select and tracking method. Tracking contributes a lot because it exploits the temporal relationship. The click-to-select method does help when tracking failed. For example, when a large motion happens, drastic hand shape changes and occlusion happens. At the same time, we did an experiment, using wand tools in a image editing software to groundtruth 10 frames randomly chosen from the dataset. The average time taken for each frame exceeds 3 min. Fig. 4 shows us some visual results of the generated candidate hands. We can see our method can discern the handshape even when there is occlusion and overlaps. In particular, Fig. 4 (c) shows us the result where face crossed hand, and Fig. 4 (f) shows us where the two hands crossed each other.

5

Conclusion

In this paper, we describe our semi-automatic tool for annotating sign language video sequences. The objective of this tool is to generate large amount of data with labeled hand pixels that can provide reliable training process. The tool uses segmentation, tracking and offline processing method to considerably save the

642

R. Yang et al.

time to get the reliable hand data. For future work, an assisting edit tool can be added to secure the situation when there are merges among hand pixels and non-hand pixels where segmentation and grouping fail.

Acknowledgment This work was supported in part by the National Science Foundation under grant IIS 0312993. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the National Science Foundation.

References 1. B. L. Loeding, S. Sarkar, A. Parashar and A. I. Karshmer.: Progress in Automated Computer Recognition of Sign Language., Lecture Notes in Computer Science, 3118(2004), 1079 - 1087 2. T. Starner, and J. Weaver and A. P. Pentland.: Real-time American Sign Language Recognition using Desk and Wearable Computer Based Video, IEEE Transactions on Pattern Analysis and Machine Intelligence, 12 (1998) 1371–1375. 3. L. V. Tsap and M. C. Shin.: Dynamic Disparity Adjustment and Histogram-Based Filtering of Range Data for Fast 3-D Hand Tracking, Journal of Digital Signal Processing, 14 (2004) 550–565. 4. A. Just and S. Marcel.: Two-Handed Gesture Recognition, IDIAP Research Institute CH-1920, 2005, Martigny, Switzerland 5. M.H. Yang., N.Ahuja. and M. Tabb.: Extraction of 2D Motion Trajectories and its Application to Hand Gesture Recognition., IEEE Transactions on Pattern Analysis and Machine Intelligence, 24 (2002) 168–185. 6. T. Volkmer and J. R. Smith and A. Natsev.: A Web-based System for Collaborative Annotation of Large Image and Video Collections: An Evaluation and User Study, MULTIMEDIA ’05: Proceedings of the 13th annual ACM international conference on Multimedia, (2005) 892–901. 7. L. V. Ahn and L. Dabbish.: Labeling Images with a Computer Game. In Conference on Human Factors in Computing Systems (CHI) (2004) 319–326. 8. T. Pfund and S. M. Maillet. A Dynamic Multimedia Annotation tool. In G. Beretta and R. Schettini, editors, In Proceedings of SPIE Photonics West, Electronic Imaging 2002, Internet Imaging III, 4672 (2002) 216–224. 9. D. Doermann and D. Mihalcik. Tools and Techniques for Video Performance Evaluation. In Proceedings of the IEEE International Conference on Pattern Recognistion (ICPR’00), 4 (2000) 167–170. 10. B. Marcotegui, P. Correia,A Video Object Generation Tool Allowing Friendly User Interaction.,International Conference on Image Processing, 4 (1999) 391–395. 11. D. Comanicu and P. Meer.: Mean shift: A Robust Approach Toward Feature Space Analysis., IEEE Transactions on Pattern Analysis and Machine Intelligence, 24 (2002) 603–619.

Suggest Documents