Image-Based Wearable Tangible Interface Jiung-Yao Huang1, Yong-Zeng Yeo1, Lin Huei2, Chung-Hsien Tsai3 1
Department of Computer Science and Information Engineering, NTPU, Taiwan {jyhuang, s79983201}@mail.ntpu.edu.tw 2 Department of Computer Science and Information Engineering, TKU, Taiwan
[email protected] 3 Department of Computer Science & Information Engineering, NDU, Taiwan
[email protected]
Abstract. This paper presents a novel technique for the mobile TUI system which consists of a pico projector and a camera only. The proposed system can transform an arbitrary flat surface into a touch panel. It allows user to interact with a computer by hand gestures on any flat surface in anytime and anyplace. The contributions of the proposed system include the extraction of display screen from the live captured image by a RGB-based camera, shadow-based fingertips detection approach and a fast yet reliable FSM grammar to determine user’s finger gesture. At the end, a prototype system is built to validate the proposed techniques and an experiment is shown in the last. Keywords: Tangible User Interface, Computer Vision, Finger Tracking, Camera Projector.
1
Introduction
The intuitiveness is an important feature in the User Interface (UI) research [1, 2]. For a long decade, the research on UI or Human-Computer Interaction (HCI) usually is limited by the graphic display and the standard I/O interface. Recently this paradigm has changed due to the wireless, handheld and mobile devices are emerging as promising techniques. The Tangible User Interface (TUI) is the result of such trend and it pursues seamless coupling between digital information and the physical environment [3, 4]. This allows us for a much richer modal interaction between human and computer. It would be more comfortable and effective if the user can directly control the computer in anytime and anyplace without any other hardware equipments [5]. This paper proposes an effective TUI system which is composed of a mobile computer integrated with a pico projector and a camera as illustrated in Fig. 1. As shown in Fig. 1, the proposed approach assumes the user will wear the system in the front which projects the computer display on any flat surface and allows him to manipulate the computer by hand. In other words, the proposed approach allows the user to manipulate the computer on any flat surface just like interacting with a tablet computer. The proposed system projects the display of a computer onto a flat surface similar to a adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011
virtual desktop and recognizes the user’s finger gesture through the camera captured image.
Fig. 1. Scenario for the proposed system
For the rest of the paper, the related works are presented first. The overview of the system will be presented next. The techniques of the proposed system will then be described in the following four sections. Finally, this paper is concluded with the implementation of a prototype system and a performance validation experiment.
2
Related Works
There has been a great variety of interactive tables and interactive wall researches proposed in the recent years, such as WUW-Wear Ur World[6] by Pranav Mistry (MIT), PlayAnywhere [7] by Andrew D. Wilson (Microsoft Research), and Twinkle [8] by Takumi Yoshida (The University of Tokyo). Pranav et al. from MIT demonstrated WUW-Wear Ur World [6] at ACM CHI conference in 2009. This system was also well-known as the 6th Sense system. It allows user to access information as though user always have a computer next to him, yet, the computer is completely controlled by hand gestures. They combine a number of standard gadgets including a webcam, a projector, a mobile phone and a notebook. In its current form, the battery-powered projector is attached to a hat, the webcam is hung around the neck (or also is positioned on the hat) and the mobile phone provides the connection to the Internet. In Play Anywhere [7] system, Andrew coupled a camera with a projector and placed it on a fixed-end of flat surface. The system was capable of detecting and recognizing the objects placed on the surface. The most important contribution of this system was that it did not rely on fragile tracking algorithms. Instead, it used a fast and simple shadow-based touch detection algorithm for gesture recognition. In Twinkle [8], Takumi et al. proposed an interface for interacting with an arbitrary physical surface by using a handheld projector and camera. The handheld device recognizes features of the physical environment and tags images and sounds onto the environment in real time according to the user’s motion and collisions of the projected images with objects. Similar to Play Anywhere, they are also employed a fast and simple motion estimation algorithm.
3
System overview
The proposed method extracts the computer display from the projected images that are live captured by the camera. User’s fingertips are identified from the extracted display and his touch gestures are then recognized accordingly. The proposed method adopts fingers touch gestures from Apple Inc [15]. Further, a FSM is proposed to infer the touch gestures. Fig. 2 shows the pipeline of the proposed method. The first step is to compute four corner points of the computer display from the projected images that are captured by the camera. The correlation between the projected computer display and the original computer display is derived in terms of the transformation matrix in the second step. User’s fingertips’s position along with the amount of fingertips are then extracted from the projected image. Finally, FSM is used to determine the touch gestures of the user. Yes
Main
Image Frame
Screen Extraction
Planar Homography
Image Frame
4 Corners of Projected Image
Grayscale
Thresholding
Denoise
Planar Homgraphy Fuction
3x3 Homograpy Matrix
Fingertips Extraction
Gesture Recognition
Area of Projected Image (API)
Gesture Recognition
Shadow Dection
FSM
Match Template
Fingertips Position
Next Frame?
No
End Process
Gesture
End Process
Find Corners End Process
End Process
End Process
Fig. 2. Pipeline of the proposed system
4
Screen Extraction
The proposed approach wants the user can interact with the computer at any flat surface in anytime and anyplace. However, since the proposed system is aimed to be worn by the user, the field of view of the camera is constantly changed according to the user positions. In this case, most of the similar systems had provided other sensors or devices such as Kinect or FR to handle the changes of the background. The proposed system completely relies on the image processing technique. The goal of the screen extraction stage is to extract the projected display from the camera captured image. Hence, this stage is further decomposed into four steps: gray-scaling, thresholding, morphology denoise, and four corners computation.
4.1
Convert Color Image to Grayscale In the proposed system, the camera captured image is in RGB color space.
Traditionally, there are two approaches to convert RGB image into the grayscale image which are static weighting and adaptive conversion. Since the pixels’ value of the projected image has varied according to the ambient light in our operating environment, an adaptive Gray-Scale Mapping approach is adopted to convert the RGB image into the grayscale image. The conversion formula is shown as in Eq. (1). 𝐼𝐺 (𝑥, 𝑦) = 𝑅𝑤 𝑅(𝑥, 𝑦) + 𝐺𝑤 𝐺(𝑥, 𝑦) + 𝐵𝑤 𝐵(𝑥, 𝑦) (1) Where IG denotes the grayscale image and R(x, y), G(x, y), B(x, y) represents RGB channels of a color image respectively. Furthermore, Rw is the weight of R channel which is computed by Eq. (2) and Eq. (3) with N as the total number of pixels. 𝑅(𝑥,𝑦) 𝑅𝑎𝑣 = ∑ (2) 𝑁
𝑅𝑤 =
𝑅𝑎𝑣
(3)
𝑅𝑎𝑣 +𝐺𝑎𝑣 +𝐵𝑎𝑣
The average intensity value of the red channel is calculated by Eq. (2) first. The weight parameter of red channel is then computed by Eq. (3). Similarly, Gw and Bw are computed by the same method. Noteworthy, this conversion approach enhances edges and reduces the lighting noise. As illustrated in Fig. (3), (a) is the projected image captured by the camera where (b) is the grayscale image after adaptive GrayScale Mapping computation.
(a)
(b)
(c)
Fig. 3. (a) Camera captured image; (b) Gray-level image; (c) Thresholding and de-noises result
4.2
Adaptive Thresholding
From a grayscale image, thresholding is a pixel-by-pixel operation, as shown in Eq. (4), used to create a binary image as illustrated in Fig. 3(c). The key parameter in the thresholding process is the choice of the threshold value T. 1 𝑖𝑓 𝐼𝑔 (𝑥, 𝑦) < 𝑇 𝐼𝑏 (𝑥, 𝑦) = { (4) 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 where Ib is resulted binary image and 𝐼𝑏 (𝑥, 𝑦) is the value of pixel at location (x, y). Since the proposed system has the camera and the projector bound together, the ratio of the projected computer display size to the projected images is roughly the same all the time even when the occlusion happens. On the other hand, the projected computer display area is always brighter than the surrounding area. These two conditions enable us to propose a purely intensity-base approach to obtain a rather good threshold value T. We define Area of Projected Display (APD) as the total number of pixels of the projected display in the camera captured projection image. According to experiement,
this value can be treated as a constant in the whole process. Hence, the threshold value T can be given by Eq. (5) 𝑔 𝑓(𝑇) = argmin{𝑓(𝑔)| 𝑓(𝑔) = |∑𝑥=0 𝑝(𝑥) − A|, ∀𝑔 ∈ [0,255]} (5) Where A is Area of Projected Display (APD) and p(x) is the number of pixels in Ig at the grey-level x.
4.3
Morphology De-noise
There might be some bright pixels left in the surrounding area during the thresholding process. The proposed system uses the erosion method [9] to erode away the noise of background pixels.
4.4
Find Corners
After the processes of gray-scale, thresholding, morphology denoise are executed, the resulted image is as shown in Fig. 3(c). Furthermore, Fig. 3(c) shows that the projected computer display is close to a quadrangle within the projected image. Hence, the next step is to compute the coordinate of four corners of this quadrangle. Each corner is the pixel in the quadrangle that has longest distance with respect to the center of quadrangle. Given these four corners as N, S, E, W, and define a, b as the height and width of the camera captured image, the corner computing formula is given in Eq. (6). 𝑎 𝑏 2 𝑎 𝑏 𝑁(𝑖, 𝑗) = argmax {‖〈𝑖, 𝑗〉 − 〈 , 〉‖ | ∀ 𝑖 ≤ j > } 2 2 2 2 2 𝑎 𝑏 𝑎 𝑏 𝑊(𝑖, 𝑗) = argmax {‖〈𝑖, 𝑗〉 − 〈 , 〉‖ | ∀ 𝑖 < j ≤ } 2 2 2 2 (6) 2 𝑎 𝑏 𝑎 𝑏 𝐸(𝑖, 𝑗) = argmax {‖〈𝑖, 𝑗〉 − 〈 , 〉‖ | ∀ 𝑖 > j ≥ } 2 2 2 2 2 𝑎 𝑏 𝑎 𝑏 𝑆(𝑖, 𝑗) = argmax {‖〈𝑖, 𝑗〉 − 〈 , 〉‖ | ∀ 𝑖 ≥ j < } 2 2 2 2
5
Planar Homography
In this step, the relation between the original computer display and the projected display is computed.Here, we only consider the camera and projector are bound together in the proposed system. We also assume that the projected surface is a perfectly planar surface. In other word, we assume that the projected display is intacting inside the camera captured image.
5.1
Projector/Camera Calibration
To enable users to author and interact with the projected display, it is necessary to calibrate the projector and camera in an unified 3D space [10]. This research adopts
Zhang’s Method [11] and Falcao’s Method [12] to perform camera and projector calibration.
5.2
Original Reference Image and Projected Image Homography Estimation
Since the coordinate of four corners of the projected display is already computed by Eq.(6), the homography matrix H, i.e. Eq. (8), of the original computer display and the projected display can be easily done by Eq. (7), where x’ , y’ are coordinates of original computer display and x, y are coordinates of the projected display. 𝑥1′ 𝑥1 𝑦1 1 0 0 0 −𝑥1 𝑥′1 −𝑦1 𝑥′1 ℎ11 0 0 0 𝑥1 𝑦1 1 −𝑥1 𝑦′1 −𝑦1 𝑦′1 ℎ12 𝑦1′ 𝑥2 𝑦2 1 0 0 0 −𝑥2 𝑥′2 −𝑦2 𝑥′2 ℎ13 𝑥2′ 0 0 0 𝑥2 𝑦2 1 −𝑥2 𝑦′2 −𝑦2 𝑦′2 ℎ21 𝑦′ = 2′ (7) 𝑥3 𝑥3 𝑦3 1 0 0 0 −𝑥3 𝑥′3 −𝑥3 𝑥′3 ℎ22 𝑦3′ 0 0 0 𝑥3 𝑦3 1 −𝑥3 𝑦′3 −𝑥3 𝑦′3 ℎ23 𝑥4 𝑦4 1 0 0 0 −𝑥4 𝑥′4 −𝑥4 𝑥′4 ℎ31 𝑥4′ [ 0 0 0 𝑥4 𝑦4 1 −𝑥4 𝑦′4 −𝑥4 𝑦′4 ] [ℎ32 ] [𝑦4′ ] ℎ11 𝑥′ [𝑦′] = [ℎ21 ℎ31 1
6
ℎ12 ℎ22 ℎ32
ℎ13 𝑥 ℎ23 ] [𝑦] ℎ33 1
(8)
Fingertips Extraction
The transformation matrix of Eq.(8) is then useful to compute user’s finger position on the original computer display from the shadow of the user’s fingers. From the computed fingers’s position, the user’s touch gesture can then be derived. To further speed up the derivation of user’s touch gesture, the detection of user’s fingers are limited to identify fingertip from the fingure’s shadow. Hence, the task of this step is further divied into two sub-tasks: shadow detection and template matching. Since the gray-scale image and binary image of the camera captured image were computed in the first stage of the proposed pipeline, Fig.(2), we can simply perform the AND operation on these two images to derive the shadow of user’s fingers. A pixel of resulted image is denoted as part of shadow if its value is lower than a predefined threshold. The resulting shape will then be the finger shadow. A thining process is apply next to the derived finger shadow to locate fingertips position. The amount of thining lines is depend on the detected finger shadows. All generated lines will be processed to determine user gesture. The result of the AND operation and line thining effect are shown in Fig. 4(a). Line that exceed certain length will be identified as finger and be tracked. From the Fig. 4(a), we can clearly see that a straight finger will generate a line which is the combination of a beeline and small curve at the end of the line. In other word, we use this feature to determine fingertips location as the resulting curve is due to the shape of fingertip. Besides that, the last pixel from the line
will be set as the fingertips location. Apart of that, line that only possess a straight line will be determined as “select” as a press motion will not generate a curve shape. Fig. 4(b) denote the detected fingertips by red circles. Notice that, in oder to clarly display computed result, we marked red circles on the original captured image.
(a)
(b)
Fig. 4. (a) Shadow detection (b) Fingertips Extraction
7
FSM-based Recognition of Dynamic Fingers Gestures
Finally, the proposed system employs FSM to infer user’s fingers gestures from the computed position of fingertip and the amount of fingertips. The survey by [14] shows that previous effort on gesture recognition can be categories into hidden Markov models (HMMs), particle filtering and condensation, finite-state machine (FSM), and neural network. In addition, most of FSM approaches model gesture as an ordered sequence of states in a spatial–temporal configuration space. The proposed system recognize nine gestures at this moment and, different from [13], they are modeled into one ordered sequence of states. This design offers some advantages: (1) no any specific initial gesture is required; (2) the computational complexity is reduced. This research adopts touch gesture from Apple Inc. [14] as shown in Table 1. We define the set of states as {S0, S1, S2, S3, S4, S5, S6, S= , Su, Sd, Se, Zm, Sh, Sl, Sr} with S0 be the initial state. Furthermore, the input alphabet, i.e. {f1, f2, f3, f4, d>, dSu| dZm| dSl| d