Neurocomputing 116 (2013) 242–249
Contents lists available at SciVerse ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Hand gesture tracking and recognition system using Lucas–Kanade algorithms for control of consumer electronics Prashan Premaratne a,n, Sabooh Ajaz b, Malin Premaratne c a
School of Electrical Computer and Telecommunications Engineering, University of Wollongong, North Wollongong, NSW 2522, Australia Department of Information and Communication, Inha University, Incheon, South Korea c Department of Electrical and Computer Systems Engineering at Monash University, Victoria, Australia b
a r t i c l e i n f o
a b s t r a c t
Available online 9 October 2012
Dynamic hand gesture tracking and recognition system can simplify the way humans interact with computers and many other non-critical consumer electronic equipments. This system is based on the well-known ‘‘Wave Controller’’ technology developed at the University of Wollongong [1–3] and certainly a step forward in video gaming and consumer electronics control interfaces. Currently, computer interfacing mainly involves keyboard, mouse, joystick or gaming wheels and occasionally voice recognition for user input. These modes of interaction have constrained the artistic ability of many users, as they are required to respond to the computer through pressing buttons or moving other apparatus. Voice recognition is seen as unreliable and impractical in areas where more than one user is present. All these drawbacks can be tackled by using a reliable hand gesture tracking and recognition system based on both Lucas–Kanade and Moment Invariants approaches. This will facilitate interaction between users and computers and other consumer electronic equipments in real time. This will further enhance the user experience as users are no longer have any physical connection to the equipment being controlled. In this research, we have compared our proposed moment invariant based algorithm with template based and Fourier descriptor based methods to highlight the advantages and limitations of the proposed system. & 2012 Elsevier B.V. All rights reserved.
Keywords: Dynamic gesture recognition Support vector machines Tracking Lucas–Kanade algorithm Moment invariants Computer human interaction (HCI)
1. Introduction With their ever-increasing processing power and plummeting costs, the use of computers and other microcontroller-based consumer electronics equipment in the average home is increasing. Whether in our lounge room, bedroom or office, a number of electronic devices may require commands to perform some valuable tasks. It could be the television set, the VCR or the set-top box waiting for our command to provide us with music or perhaps news and the command may reach them with a push of a button of a remote controller or a keyboard. People have long tried to replace such controllers with voice recognition or glovebased devices [4–9], with mixed results. Glove-based devices [8,9] are tethered to the main processor with cables that restrict the user’s natural ability to communicate. Many of these approaches have been implemented to focus on a single aspect of gestures, such as hand tracking, hand posture estimation, or hand pose classification, using uniquely colored gloves or markers on hands or fingers [6–10]. Technological experts overwhelmingly agree that the bottleneck in computer and gaming adoption is the human–computer interface. This is evident from the popularity of the gravity sensor-based gaming console, which has
n
Corresponding author. Tel.: þ612 42214778. E-mail address:
[email protected] (P. Premaratne). URL: http://www.elec.uow.edu.au/staff/pp/ (P. Premaratne).
0925-2312/$ - see front matter & 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2011.11.039
replaced its traditional game pad with a Nintendo wii controller. This has increased computer usage and is increasingly engaging the older generation who do not like using the traditional keyboard and mouse to interact with electronic entertainment units. This has further thrust human–computer interaction (HCI) to a new level of sophistication and hand gesture recognition is seen at the forefront of that trend [2]. The development of the ‘‘Wave Controller’’ has demonstrated that traditional computer and consumer electronic controls such as remote controllers can be replaced effectively with a hand gesture recognition system. Hand gesture recognition systems rely on computer vision to recognize hand gestures in the presence of any background clutter. These hand gestures must then be uniquely identified to issue control signals to the computer or another entertainment unit. During the past 18 years, many attempts have been made with mixed success to recognize hand gestures. Some of the drawbacks of the original systems were their inability to run in real time with modest computing power and low classification scores in gesture recognition. Most of the techniques relied on template matching [11] or shape descriptors [12–14] and required more processing time than could be provided in real time. Furthermore, the users were restricted to wearing gloves [8,9], or markers to increase reliability, and were also required to be at a preset distance from the camera. These restrictions meant that a practical system that could operate in a lounge room was unthinkable. In 2007, a truly practical system that was capable of running on
P. Premaratne et al. / Neurocomputing 116 (2013) 242–249
modest computing power in real time was unveiled at the University of Wollongong and was hailed as a major breakthrough in hand gesture recognition system [1]. This research was distinguished from previous attempts by a few marked differences: (i) hand tracking that isolates region of interest (ROI); (ii) using a minimum number of gestures offers higher accuracy with less confusion; (iii) only low processing power is required to process the gestures, making the system useful for simple consumer control devices; (iv) the system is very robust to lighting variations; (v) the system operates in real time; and (vi) distance from the hand to the camera is immaterial. The previous system operated in realtime as it was processing static images. It also had a very high accuracy as gestures with conflicting moment invariants were removed from the system. In this new proposed system, we have been able to accommodate dynamic gestures (multiple static images) for increased accuracy and removal of constraints on long-sleeved garments. In order to gain a complete understanding of the dynamic system, it is essential to understand the static hand gesture system developed in our previous research. Next section will discuss the static hand gesture recognition system followed by hand tracking using Lucas–Kanade Algorithm. Dynamic gesture recognition system will be discussed in Section 4. Section 5 will discuss the feature extraction process followed by gesture classification. Hardware interface that we built to realize the system is discussed in Section 7 followed by Results and Discussion sections.
2. Static hand gesture recognition system As depicted in Fig. 1, the system camera initially captures a frame every second. In order to determine the hand gesture, skin segmentation is performed looking for skin-like regions. Since RGB color domain is found to be unhelpful in achieving effective skin segmentation, image is initially converted to YCbCr domain. Even though YCbCr domain facilitates the skin segmentation process, it always results in noisy spots due to lighting variations. As can be seen from Fig. 2, gestures will accompany skin like regions with cluttered backgrounds. This distortion, as expected, becomes more pronounced in low lighting conditions. As a result, the skin-segmented image is noisy and distorted and will likely result in incorrect recognition at the subsequent stages. These
243
distortions, however, can be removed during the gesture normalization stage which would use well-known morphological filtering technique using erosion combined with dilation [20]. The output of this stage is a smooth region of the hand gesture, which is stored as a logical bitmap image. 2.1. Hand region segmentation In our previous research [1], we described a system that used long sleeved shirt/garment which resulted in straight forward hand gestures when processed with skin segmentation filter. However, when part of the arm is captured along with the hand as seen in Fig. 3(a), there needs to be further processing to isolate the hand area. We were inspired by the work carried out by Abe et al. [15] in removing arm region from the hand region for effective hand gesture recognition. However, their approach does not produce real-time wrist recognition as it involves developing contours for the entire hand and arm area. In our observation, a rolled up arm or short sleeved dress will have an arm region which is straight compared to hand. If these straight sections are represented using two straight lines (red) as in Fig. 3(b), we can estimate the deviation of the hand contour from a straight line. Using a dataset of more than 50 hand-arm images, we determined that when the contour deviates 10% distance from the straight lines, wrist section can be reliably estimated. Once the wrist is located, hand regions can be segmented from potential hand gestures.
3. Hand tracking using Lucas–Kanade algorithm A user who is issuing a command to a hand gesture system may move his/her hand intentionally or unintentionally during the course of the gesture or gestures. It is vital for an effective system to track the hand so that finding the ROI will be fast and error free. Lucas– Kanade [16,17] algorithm provides useful insight for implementing a sparse hand tracking method. Since most of the computer vision problems such as hand tracking involve moderate changes from frame to frame, this algorithm provides an effective method to track movement of robust features. It asserts some properties for a pixels or a region with group of pixels. The algorithm devises velocity equation and tracks each feature point from one frame to the next using iterative approximation with a Newton–Raphson type method. The method relies on some of the properties of common video which
Fig. 2. Noisy image after skin segmentation and smoothed image.
Fig. 1. System overview of a static hand gesture controller.
Fig. 3. (a) Hand region with arm and (b) wrist section. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
244
P. Premaratne et al. / Neurocomputing 116 (2013) 242–249
includes small changes in brightness, distance and neighboring pixels move in the same direction. The matching of these regions is achieved by comparing the pixel intensities between patch areas. The base-match is the one that gives minimum value of sum of squared-differences (SSD). The Lucas–Kanade algorithm can be understood by considering an image can be warped to another using a homography and homogeneous coordinates such that 32 3 2 03 2 h11 h12 h13 x x 76 7 7 6 a6 4 y0 5 ¼ 4 h21 h22 h23 54 y 5 1
h31
h32
1
1
,y0 ¼
h21 x þ h22 y þh23 h31 x þh32 y þ 1
or x0 ¼
h11 x þ h12 yþ h13 h31 x þ h32 y þ 1
The homography above uniquely maps 2-D homogeneous points to other 2-D homogeneous points. In tracing, homography between two images can be established by finding a sparse set of feature points. Suppose that you infinitesimally increment hij so as to move the mapping of point (x,y) from point ðx0 ,y0 Þ to (x0 þð@x0 =@hij Þ, y0 þ ð@y0 =@hij Þ), the brightness change can be established as 0 @ 0 0 0 @x @y0 I ðx ,y Þ ¼ , :rI0 ðx0 ,y0 Þ @hij @hij @hij Since matching of two regions can be estimated by minimizing their mismatch, the following error minimization establishes the Lucas–Kanade algorithm for hand tracking: Let J denote the P overall squared error: J ¼ x ðIðxÞI0 ðx0 ÞÞ2 we want to find ð@=@hij ÞJ for each parameter, or rH J @ @ X J¼ ðIðxÞI0 ðx0 ÞÞ2 @hij @hij x X @ ¼ ðIðxÞI0 ðx0 ÞÞ2 @hij x X @ ¼ 2ðIðxÞI0 ðx0 ÞÞ ðIðxÞI0 ðx0 ÞÞ @h ij x X @ 0 0 ¼ 2ðIðxÞI0 ðx0 ÞÞ I ðx Þ @h ij x X @ 0 ¼ 2ðIðxÞI0 ðx0 ÞÞ x :rI0 ðx0 Þ @h ij x This can be stated as X @ J ¼ 2ðIðxÞI0 ðx0 ÞÞ |fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl} @hij x error
@ @hij |{z}
x0 :
rI0 ðx0 Þ |fflfflffl{zfflfflffl}
spatial gradient
point motion
while achieving realtime performance. In this research, we have used finger tips as ‘features’ to track the hand from frame to frame. The success of hand tracking is discussed in Section 8. Lucas–Kanade tracking algorithm calculates a brightness gradient (Sobel operator) along at least two directions for a good feature to be tracked over time [16,17]. We have implemented our algorithm similar to an attempt by Fogelton [18] where he uses a combination with image pyramids (a series of progressively smaller-resolution interpolations of the original image). He matches an image from area from a video efficiently to the most similar area within a search window in the following video frame. If the feature match correlation between two consecutive frames is below a threshold, the feature is considered ‘lost’. A hand detection method supplies both a rectangular bounding box and a probability distribution to initialize tracking. However, since we are not using a video but merely a collection of images taken apart 100 ms, the amount of processing required is low. The probability mask states for every pixel in the bounding box the likelihood whether it
belongs to the hand. Features are selected within the bounding box according to their ranking and observing a pair wise minimum distance. These features are being ranked according to the combined probability of their locations and color. Highly ranked features are tracked individually per frames. Their new locations become the area with the highest match correlation between the two frame’s areas. Individual features can latch onto arbitrary artifacts of the object being tracked, such as fingers of a hand. Their movement is independent along with the artifact, without disturbing other features. Sections with high concentrations of features are avoided due to the minimum distance constraint. However, stray features that are too far from the object of interest are brought back into the flock with the maximum distance constraint. In order to get more stable results, about 15% of the furthest features from median computation are discarded. The speed of pyramid-based KLT [16,17] feature tracking allows to overcome the computational limitations of the model-based tracking approaches.
4. Dynamic gesture recognition The dynamic gesture recognition system discussed here is quite similar to the static gesture recognition system. However, it differs from the static system by following aspects:
Handling new images every 100 milliseconds opposed to 1 s in a static gesture system.
The system gesture routine is initiated by finding a start gesture and completed when stop image is detected.
Any unintentional gestures not preceded by start are rejected. All routines are written in C and Cþþ opposed to very high level functional language in the previous system.
Easily implementable in a smart phone equipped with a camera. Supports a theoretical figure of 42 dynamic gestures opposed to seven gestures in the previous system.
Supports bare arm regions without any long-sleeved garments. Dynamic gesture recognition also minimizes the problems associated with hand movements in static hand gesture recognition systems. Apart from the above aspects, the initial processing of captured images remains the same as described in Section 2. A flow chart of the implementation steps of the system is depicted in Fig. 4. The flow of the actions can be highlighted as follows: when the system starts, the camera captures an image every 100 ms. Then this image is converted to YCbCr domain and undergoes skin segmentation. Since the image obtained from the skin segmentation step is noisy, morphological filtering is performed and the result undergoes a threshold filter to remove ‘non-skin’ components. The major advantage of this approach is that the influence of luminosity can be removed during the conversion process, making the segmentation less dependent on the lightning conditions, which has always been a critical obstacle for image recognition. The threshold values will be obtained from a database. A number of sample points that represent skin patches and non-skin patches are then obtained. It is further processed to isolate the hand region in case the arm is present in the image as described in Section 2.1. Next this image is evaluated for features using template matching, Fourier descriptor and moment invariants and is classified using both neural networks and support vector machine for comparison. If the system classifies the first image as start, then the system expects a dynamic gesture sequence and looks for any other gestures such as Volume, Channel, Equipment Select or any other gesture that is being used. Once valid gestures are recognized, system records them as a sequence of actions to be performed until it recognizes stop
P. Premaratne et al. / Neurocomputing 116 (2013) 242–249
245
5.2. Moment invariants
Fig. 4. Dynamic hand gesture recognition system.
gesture terminating the one dynamic gesture sequence. The hardware then issues the sequence of the gesture commands and the system returns to its initial stage and look for another start command to perform another sequence.
5. Feature extraction of the hand gesture There have been many attempts to use a variety of features for gesture classification. Most of these features have been adapted from similar classification problems in other areas of object classification. The following sections will detail template matching and moment invariant based feature extraction methods to evaluate their suitability for gesture classification.
5.1. Template matching Template matching is a fundamental pattern recognition technique which has been utilized in the context of both posture and gesture recognition. In the context of images, template matching is performed by the pixel-by-pixel comparison of a prototype and a candidate image. The similarity of the candidate to the prototype is proportional to the total score on a preselected similarity measure. In order to recognize hand gestures, the image of a detected hand forms the candidate image which is directly compared with prototype images of hand gestures. The best matching prototype (if any) is considered as the matching gesture. However, due to pixel-by-pixel image comparison, template matching is not invariant to scaling and rotation. Template matching was one of the first methods employed to detect hands in images [19]. Scale and rotational normalization [20] as well as multiple views [21] can be used to tackle the variance nature of hand gestures. In [20], the image of the hand is normalized for rotation based on the detection of the hands main axis and, then, scaled with respect to hand dimensions in the image. This leads to a constrained hand that can only move on a planar surface that is front to parallel to the camera. To cope with the increased computational cost when comparing with multiple views of the same prototype, these views were annotated with the orientation parameters [22]. Searching for the matching prototype was accelerated, by searching only in relevant postures with respect to the one detected in the previous frame. A template comprised of edge directions was utilized in [23]. Edge detection is performed on the image of the isolated hand and edge orientations are computed. The histogram of these orientations is used as the feature vector. The evaluation of this approach showed that edge orientation histograms are not very discriminative, because several semantically different gestures exhibit similar histograms.
Image classification is a very mature field today. There are many approaches to finding matches between images or image segments. Starting from the basic correlation approach to the scale-space technique, they offer a variety of feature extraction methods with varying success. However, it is very critical in hand gesture recognition that the feature extraction is fast and captures the essence of a gesture in unique small data set. Neither the Fourier descriptor [12], which results in a large set of values for a given image, nor scale space [24] succeed in this context. The proposed approach of using moment invariants stems from our success in developing the ‘‘Wave Controller’’. Gesture variations caused by rotation, scaling and translation can be circumvented by using a set of features, such as moment invariants, that are invariant to these operations. The moment invariants algorithm has been recognized as one of the most effective methods to extract descriptive feature for object recognition applications and has been widely applied in classification of subjects such as aircrafts, ships, and ground targets [25,26]. Essentially, the algorithm derives a number of self-characteristic properties from a binary image of an object. These properties are invariant to rotation, scale and translation. Let f ði,jÞ be a point of a digital image of size M Nði ¼ 1,2, . . . ,M and j ¼ 1,2, . . . ,NÞ. The 2-D moments and central moments of order ðp þ qÞ of f ði,jÞ are defined as mpq ¼
M X N X
ip jq f ði,jÞ
and
i¼1j¼1
U pq ¼
M X N X
ðiiÞp ðjjÞq f ði,jÞ
i¼1j¼1
Where i ¼ m10 =m00 , j ¼ m01 =m00 From the second-order and thirdorder moments, a set of seven (7) moment invariants are derived as follows [22,23]:
j1 ¼ Z20 þ Z02
ð1Þ
j2 ¼ ðZ20 Z02 Þ2 þ4Z211
ð2Þ
j3 ¼ ðZ30 3Z12 Þ2 þ ð3Z21 Z03 Þ2
ð3Þ
j4 ¼ ðZ30 þ Z12 Þ2 þ ðZ21 þ Z03 Þ2 "
j5 ¼ ðZ30 3Z12 ÞðZ30 þ Z12 Þ
ð4Þ
ðZ30 þ Z12 Þ2
#
3ðZ21 þ Z03 Þ2 " # 3ðZ30 þ Z12 Þ2 þ ð3Z21 Z03 ÞðZ21 þ Z03 Þ ðZ21 þ Z03 Þ2 "
j6 ¼ ðZ20 Z02 Þ
ðZ30 þ Z12 Þ2 ðZ21 þ Z03 Þ2 "
j7 ¼ ð3Z21 Z03 ÞðZ30 þ Z12 Þ
# þ 4Z11 ðZ30 þ Z12 ÞðZ21 þ Z03 Þ ðZ30 þ Z12 Þ2
Where Zpq is the normalized central moments defined by
Zpq ¼ U pq =U r00 and
p þq ¼ 2,3, . . .
ð6Þ
#
3ðZ21 þ Z03 Þ2 " # 3ðZ30 þ Z12 Þ2 ðZ30 3Z12 ÞðZ21 þ Z03 Þ ðZ21 þ Z03 Þ2
r ¼ ½ðp þ qÞ=2 þ1
ð5Þ
ð7Þ
246
P. Premaratne et al. / Neurocomputing 116 (2013) 242–249
Fig. 5. Letter ‘A’ in different orientations. Fig. 6. Letter ‘L’ in different orientations. Table 1 Moment invariants of the different orientations of letter ‘A’. Moments
A1
A2
A3
A4
U1 U2 U3 U4 U5 U6 U7
0.2165 0.001936 3.6864e 005 1.6384e 005 4.0265e 010 7.209e 007 0
0.2165 0.001936 3.6864e 005 1.6384e 005 4.0265e 010 7.209e 007 0
0.204 0.001936 3.6864e 005 1.6384e 005 4.0265e 010 7.209e 007 0
0.25153 0.002161 0.004549 0.002358 7.59E 6 7.11E 05 1.43E 06
It is possible that using few features than seven values will still provide a very useful set of features. We have used first four moments as the feature set in our system to classify for hand gestures using both neural networks and support vector machine (SVM) as will be discussed later.
5.3. Example of invariant properties Fig. 5 shows images containing letter ‘A’, rotated and scaled, translated and noisy versions of letter ‘A’. Table 1 shows their respective moment invariants calculated using formulas using Eqs. (1)–(7). It is obvious from Table 1 that the algorithm produces the same result for the first three orientations of letter ‘A’ despite the different transformations applied upon them. There is only one value, i.e. f1, displays a small discrepancy of 5.7% due to the difference in scale. The other values of the three figures are effectively the same for f2, f3, f4, f5, f6 and f7. The last letter, however, reveals the drawback of the algorithm: it is susceptible to noise. Specifically, the added noisy spot in the letter has changed the entire moment invariants set. This drawback suggests that moment invariants can only be applied on noisefree images in order to achieve the best results. Since the algorithm is firmly effective against transformations, a simple classifier can exploit these moment invariants values to differentiate as well as recognize the letter ‘A’ from other letters, such as the letter ‘L’ as depicted in Fig. 6 and Table 2.
6. Gesture classification Having accomplished all the above stages, we have successfully extracted a data set from an image of a user hand gesture. However, this data set remains meaningless unless the program can interpret it into a preset command to control the electronic device. We decided to use both neural network and support vector machine classifiers as both offer similar complexity and classification accuracy for well-defined inputs. After feature extraction stage, each group of the sample images that represent the same gesture produces a certain range of f1, f2, f3, and f4. These ranges are then used as preset values to classify a random input image. The procedure implicitly states that the more samples we have, the better the classification becomes.
Table 2 Moment invariants of the different orientations of letter ‘L’. Moments
L1
L2
L3
U1 U2 U3 U4 U5 U6 U7
0.34028 0.043403 0.023148 0.002572 5.56E 06 0.00015 1.91E 05
0.31944 0.043403 0.023148 0.002572 5.56E 06 0.00015 1.91E 05
0.31944 0.043403 0.023148 0.002572 5.56E 06 0.00015 1.91E 05
6.1. The proposed neural network design Neural network has been applied to perform complex functions in numerous applications including: pattern recognition [27–30]; classification [31]; identification [32] and so on. Once implemented, it can compute the output significantly faster than the nearest-neighbor classifier. Neural network also encompasses the ability to learn and predict over the time. This property enables the system to be viewed more as a human-like entity that can actually ‘understand’ the user, which is also one of the major objectives of our research. The system is designed to capture one image frame (static image) every 100 ms and is then segmented for skin region detection and other pre-processing before the invariant moments are calculated. These invariant moments will be the input to the neural network for classification and the subsequent action using the remote control and the feedback system. If any of the static images is captured when the hand is moving, resultant image would be blurred. This will result in an unrecognized hand gesture and the user will be informed about it through system feedback display. The designed neural network is a ‘backpropagation’ network, in which input vectors (invariant moments of the sample set of user hand gestures) and the corresponding target vectors (the commands set) are used to train the network until it can approximate a function between the input and the output [33–35]. In this particular design, there are only three layers due to the limited number of hand gestures to be classified. More complex network can be possibly designed and implemented, but it is neither practical nor necessary for our research. For better visualization, the network can be illustrated in Fig. 7 where W represents the weighting function in which each input is weighted with an appropriate w; b represents the bias coefficient and it is set to 1 in this design. 6.2. Support vector machine classification In our previous attempt, we successfully used neural networks for classification. We have also implemented support vector machine (SVM) classifier to compare the performance of both classifiers. SVM is increasingly used for statistical classification and regression analysis. The SVM classifier learns from data points in examples when they are classified belonging to their respective categories.
P. Premaratne et al. / Neurocomputing 116 (2013) 242–249
SVM training algorithm builds a model that predicts whether a new example falls into one category or the other. Intuitively, an SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.
247
More formally, a support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training data points of any class, since in general the larger the margin the lower the generalization error of the classifier. In our approach, f12f4 features (first four Hu Moments) are used as inputs into the SVM. The weights are calculated using training data set similar to training a neural network.
7. Computer remote interface
Fig. 7. Structure of the neural network classifier.
Fig. 8. Block diagram of the hardware interface.
Once the system determines an appropriate gesture or set of gestures have been issued, the system issues these commands through a Universal remote controller which is linked through an interface with the computer. Fig. 8 shows the block diagram of this hardware interface identifying the channels of communication and functionalities. This diagrams depicts that eight data pins of the computer parallel port are split into two different sets: three bits and five bits. The first bit set identifies which device to control, whilst the second set specifies which command being issued to that device. By combining the two sets with relevant hex decoders, the system is theoretically capable of controlling up to eight (23) different devices, and 32 (25 Þ commands for each device. However, for ease of use only four bits are used reducing the number of commands to 16 (24 ). In order to implement the system, a 3-to-8 line decoder is used to process the first bit set (i.e. the remote control code) and a 4-to-16 line decoder is used for the second bit set (i.e. the command code). As a result, input–output (I/O) pin number 6 of the parallel port is left unused and could be implemented for other possible purposes at a later stage. Fig. 9 illustrates the detailed mapping of the I/O pins from the parallel port to the decoders. As the selected output of the 4-to-16 decoder is always lowasserted, a hex inverter is positioned after the decoder to reverse the output to high-asserted. The final output from the inverter will activate the transistor circuit, which acts as a switch to press the button. The inverter chosen for the research provided six inputs and six outputs. Therefore, in order to handle the 16 outputs of the 4-to-16 decoder, three inverters were connected in parallel. One noticeable advantage of the system is that both the decoders and the inverters have the same operating voltage characteristic. Thus makes it easier to design a power supply source for all the devices with the same voltage level. When a particular command to be given to any consumer electronics device, the command is sent through the parallel port which in turn selects the appropriate remote controller and a particular
Fig. 9. Parallel port mapping detail (left) and photo of the prototype.
248
P. Premaratne et al. / Neurocomputing 116 (2013) 242–249
Table 3 Some hand gestures and their corresponding classification scores under different features. Gesture
Proposed technique (%)
Template based (%)
Fourier descriptor (%)
74
53
82
Proposed technique (%)
Template based (%)
Fourier descriptor (%)
27
95
83
46
47
42
96
76
52
97
78
72
95
85
71
90
82
78
89
63
64
switch (similar to pressing a button on the remote) is activated through the closing of a transistor switch. A photo of this hardware interface is shown in Fig. 9(right).
8. Results and discussion The results of the proposed system have been very encouraging. We have successfully implemented the Lucas–Kanade algorithm to track the hand when it is initially displayed on the screen. A user may start the gesture recognition process by keeping the hand vertically displaying the palm to camera so that the whole hand is in the middle of the frame. The system then captures the region of interest (ROI) as a rectangle encompassing the hand and would track the hand. As long as the gesturing is done in a calm and orderly manner, the hand region will always be tracked despite any background variations. We extracted features using template matching, Fourier descriptor and moment invariants (Hu moments) to compare their effectiveness. The results are shown in Table 3. As can be seen from the results for these gestures, moment invariants offer outstanding classification scores. These results further prove that moment invariants can be used for object recognition applications since it is rigidly invariant to scale, rotation and translation. The following account summarizes the advantages of moment invariants algorithm for gesture classification.
Moment invariants are invariant to translation, scaling and
rotation. Therefore, the user can issue commands disregarding orientation of his/her hand. The algorithm is susceptible to noise. Most of this noise, however, is filtered at the gesture normalization stage. The algorithm is moderately easy to implement and requires only an insignificant computational effort from the CPU. Feature extraction, as a result, can be progressed rapidly and efficiently. The first four moments, f1, f2, f3, and f4 are adequate to represent a gesture uniquely and hence result in a simple feature vector with only four values.
We have also implemented a dynamic gesture recognition system that recognized dynamic gestures that clearly identify
Gesture
start and stop function in real time. The system now runs on any machine that is capable of running openCV unlike our previous system which required MATLAB with some of the latest tool boxes. This is a further boost to our quest to implement the whole system on a ‘Smart Phone’ currently available in the market so that gesture recognition can be very easily available for any equipment housing a camera. The system captures images every 100 ms. The gestures that follow start need to be made consciously such that it will not cause blur or smearing. We are also conducting research into running entire processing on fieldprogrammable gate array (FPGA) bypassing a computer. Our goal is to implement dynamic hand gesture system on a chip.
References [1] P. Premaratne, Q. Nguyen, Consumer electronics control system based on hand gesture moment invariants, IET Comput. Vis. 1-1 (2007) 35–41. [2] S. Hutcheon, Last Hurrah for Lost Remote, Sydney Morning Herald, 2007 /http://www.smh.com.au/articles/2007/07/18/1184559833067.htmlS. [3] International Reporter, 2007 /http://www.internationalreporter.com/News2402/Now,-seven-simple-hand-gestures-to-switch-your-TV-on.htmlS. [4] Y. Fujita, S. Lam, Menu-driven user interface for home system, IEEE Trans. Con. Elect. 40-3 (1994) 587–597. [5] D.W. Lee, J.M. Lim, J. Sunwoo, I.Y. Cho, C.H. Lee, Actual remote control: a universal remote control using hand motions on a virtual menu, IEEE Trans. Con. Elec. 55-3 (2009) 1439–1446. [6] Y. Han, A low cost visual motion data glove as an input device to interpret human hand gestures, IEEE Trans. Con. Elect. 56-2 (2010) 501–509. [7] D. Lee, Y. Park, Vision-based remote control system by motion detection and open finger counting, IEEE Trans. Con. Elect. 55-4 (2009) 2308–2313. [8] D.L. Quam, Gesture recognition with a dataglove, in: Proceedings of the 1990 IEEE National Aerospace and Electronics Conference, vol. 2, 1990, pp. 755–760. [9] D.J. Sturman, D. Zeltzer, A survey of glove-based input, IEEE Comput. Graph. Appl. 14 (1994) 30–39. [10] J. Davis, M. Shah, Recognizing hand gestures, in: Proceedings of the European Conference on Computer Vision, 1994, pp. 331–340. [11] C. Shan, Y. Wei, X. Qiu, T. Tan, Gesture recognition using temporal template based trajectories, in: Proceedings of the 17th International Conference on Pattern Recognition, vol. 3, 2004, pp. 954–957. [12] P.R.G. Harding, T. Ellis, Recognizing hand gesture using Fourier descriptors, in: Proceedings of the 17th International Conference on Pattern Recognition, vol. 3, 2004, pp. 286–289. [13] C.S. Lin, C.L. Hwang, New forms of shape invariants from elliptic Fourier descriptors, Pattern Recogn. 20 (5) (1987) 535–545. [14] C.S. Lin, C. Jungthirapanich, Invariants of three dimensional contours, Pattern Recogn. 23 (8) (1990) 833–842. [15] K. Abe, H, Saito, S. Ozawa, 3-D Drawing system via hand motion recognition from two cameras, in: Proceedings of the Sixth Korea–Japan Joint Workshop on Computer Vision, 2000, pp. 138–143.
P. Premaratne et al. / Neurocomputing 116 (2013) 242–249
[16] J. Shi, C. Tomasi, Good features to track, in: Proceedings of the CVPR ’94, 1994 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1994, pp. 593–600. [17] C. Tomasi, T. Kanade, Detection and tracking of point features, Image Rochester NY, Technical Report CMU, 1991, pp. 91–132. [18] A. Fogelton, Real-time hand tracking using modificated flocks of features algorithm, Inf. Sci. Technol. Bull. ACM Slovakia 3-2 (2011) 1–5. [19] W. Freeman, C. Weissman, Television control by hand gestures, in: International Workshop on Automatic Face and Gesture Recognition, 1995, pp. 179–183. [20] H. Birk, T.B. Moeslund, C.B. Madsen, Real-time recognition of hand alphabet gestures using principal component analysis, in: Proceedings of the Scandinavian Conference on Image Analysis, 1997. [21] T. Darrell, A. Pentland, Space-time gestures, in: Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), 1993, pp. 335–340. [22] H. Fillbrandt, S. Akyol, K.F. Kraiss, Extraction of 3D hand shape and posture from images sequences from sign language recognition, in: Proceedings of the International Workshop on Analysis and Modeling of Faces and Gestures, 2003, pp. 181–186. [23] W. Freeman, M. Roth, Orientation histograms for hand gesture recognition, in: Proceedings of the International Conference on Automatic Face and Gesture Recognition, 1995, pp. 296–301. [24] S. Ho, G. Greig, Scale-space on image profiles about an object boundary, in: Scale Space Methods in Computer Vision, Lecture Notes in Computer Science, vol. 2695, 2003, pp. 564–575. [25] P. Premaratne, ISAR Ship Classification: An Alternative Approach, CSSIP-DSTO Internal Publication, 2003. [26] Q. Zhongliang, W. Wenjun, Automatic ship classification by superstructure moment invariants and two-stage classifier, in: ICCS/ISITA ’92 Communications on the Move, 1992. [27] W.S. Chen, P.C. Yuen, J. Huang, D.Q. Dai, Kernel machine-based one-parameter regularized fisher discriminant method for face recognition, IEEE Trans. Syst. Man Cybernet. 35 (4) (2005) 659–669. [28] W. Jia, D.S. Huang, D. Zhang, Palmprint verification based on robust line orientation code, Pattern Recogn. 41-5 (2008) 1521–1530. [29] D.S. Huang, W. Jia, D. Zhang, Palmprint verification based on principal lines, Pattern Recogn. 41-4 (2008) 1316–1328. [30] Z.Q. Zhao, D.S. Huang, Palmprint recognition with 2DPCA PCA based on modular neural networks, Neurocomputing 71 (1–3) (2007) 448–454. [31] C. Caleanu, D.S. Huang, V. Gui, V. Tiponu, V. Maranescu, Interest operator vs. Gabor filtering for facial imagery classification, Pattern Recogn. Lett. 28-8 (2007) 950–956. [32] Z.Q. Zhao, D.S. Huang, B.Y. Sun, Human face recognition based on multiple features using neural networks committee, Pattern Recogn. Lett. 25-12 (2004) 1351–1358. [33] MALTAB Help, Backpropagation, Neural Network, 5.2–5.71. [34] D.S. Huang, Radial basis probabilistic neural networks: model and application, Int. J. Pattern Recogn. Artif. Intell. 13 (7) (1999) 1083–1101. [35] L. Shang, D.S. Huang, J.X. Du, C.H. Zheng, Palmprint recognition using FastICA algorithm and radial basis probabilistic neural network, Neurocomputing 69 (13-15) (2006) 1782–1786.
Prashan Premaratne was awarded a full scholarship in 1994 to pursue undergraduate studies under the John Crawford Scholarship Scheme at the University of Melbourne. He obtained his Bachelor of Engineering (Electrical and Electronics with Hons) from the University of Melbourne in 1997 and joined Fujitsu (Singapore) Pty. Ltd. as a Network Software Engineer at Network Software Development Division. In 1999 he was awarded a National University of Singapore Postgraduate Scholarship, National Science and Technology Board of Singapore award and a Motorola Research Grant to pursue a Doctor of Philosophy degree in Electrical and Computer Engineering. He was awarded his PhD in 2002 for his work on ‘‘Blind Deconvolution for Image Restoration’’.
249
He is a Senior Member of IEEE and has been the Guest Editor of the International Journal of Information Technology, vol. 11, no. 12—special issue on Security and Financial Series Analysis); Guest Editor—International Journal of Wavelets, Multiresolution and Information Processing (IJWMIP); and Guest Editor on a special issue in Neurocomputing Journal (Elsevier, 2006). He is also an ARC OzReader in Mathematics, Information and Communication. He is the Tutorial Chair of ICIC2008 Conference in Shanghai, China, and was the Program Committee Chair of ICIC2006 Conference held in Kungming, China. He has also published over 50 IEE/IEEE refereed publications in signal and image processing and has been invited to chair sessions in major IEEE conferences such as ISIMP2004, in Hong Kong and ICIC2005 in China.
Sabooh Ajaz received his Bachelors of Engineering (Electronics) from NED University of Engineering & Technology, Karachi, Pakistan, in 2006. He graduated from Masters of Engineering with Distinction from University of Wollongong, NSW, Australia in July 2010. Since then he has started a PhD at the Inha University in Korea. His core research interest includes VLSI and embedded systems and specifically their application in the field of signal and image processing.
Malin Premaratne received the BSc (math.) and BE (elec.) degree with first class honors and the PhD degree from the University of Melbourne, Victoria, Australia, in 1995, 1995, and 1998, respectively. At present, he holds visiting research appointments with the University of Melbourne, Australian National University, University of California–Los Angeles (UCLA) and University of Rochester, New York. From 1998 to 2000, he was with the Photonics Research Laboratory, a division of the Australian Photonics Cooperative Research Center (APCRC), the University of Melbourne, where he was the Coproject Leader of the APCRC Optical Amplifier Project. During this period, he worked with Telstra, Australia and Hewlett Packard, USA through the University of Melbourne. From 2000 to 2003, he was involved with several leading startups in the photonic area either as an Employee or a Consultant. During this period, he has also served in the editorial boards of SPIE/Kluwer and Wiley publishers in the optical communications area. From 2001 to 2003, he worked as the Product Manger (Research and Development) of VPI systems Optical Systems group. Since 2003, he steered the research program in high-performance computing applications to complex systems simulations at the Advanced Computing and Simulation Laboratory (AXL) at Monash University, where currently he serves as the Research Director and Associate Professor. He has published over 100 research papers in the areas of semiconductor lasers, EDFA and Raman amplifiers, optical network design algorithms and numerical simulation techniques. He is a Fellow of the Institution of Engineers Australia (FIEAust). He is an executive member of IEAust Victoria Australia and since 2001 had served as the Chairman of IEEE Lasers and ElectroOptics Society in Victoria, Australia.